AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Back to the blog
Blog / May 11, 2026 · 11 min

LLM Quantization: When 4-Bit Pays (and When It Bites)

LLM quantization stores a model at fewer bits per weight, cutting memory and cost. The trade-off: quality holds on most tasks and quietly breaks on a few.

LLM quantization stores a model at fewer bits per weight, cutting memory and cost. The trade-off: quality holds on most tasks and quietly breaks on a few.

LLM quantization is the practice of storing a model's weights (and sometimes its activations) at lower numerical precision, dropping from 16-bit floats down to 8-bit or 4-bit integers, so the model takes less memory and runs cheaper. The cost is quality: most of the time the drop is small enough to hide inside the noise of your eval suite, and some of the time it is large enough to break the feature in production. The whole game is knowing which task you have.

I run revenue at Devlyn, where we ship customer-facing AI into a retail eyewear experience that touches real people in stores. I came up as an engineer, and I still read the inference bill line by line. Quantization is one of the few levers that moves the bill by a multiple rather than a few percent, which is exactly why it deserves more rigor than it usually gets. Most teams either refuse to quantize out of superstition or quantize everything and ship a quietly degraded model. Both are expensive mistakes.

This is a supporting piece in my guide to LLM inference cost. Quantization sits next to model sizing and prompt caching as one of the three biggest cost levers you control, and it is the one teams understand the least.

Key takeaways

  • Quantization is a margin lever, not a quality compromise, as long as you quantize the tasks where the quality drop falls inside your eval noise and leave the ones where it does not.
  • 8-bit is close to free; 4-bit is a real trade. INT8 and FP8 are near-lossless on most tasks. At 4-bit you save 4x the memory and pay a real, measurable quality cost on hard tasks.
  • 4-bit is fine for summarization, classification, and extraction, and genuinely costly for math, multi-step reasoning, and long-context faithfulness. The bit-width is not a single decision across your whole product.
  • The kernel matters as much as the format. The same 4-bit weights can run 10x faster or slower depending on which inference kernel serves them.
  • Quantization is what makes self-hosting pay. Fitting a 70B model on one GPU drops the break-even point against an API from "never" to "sooner than you think."

What LLM quantization actually is, and what you trade

A model weight is just a number. Train a model in the usual way and each of those numbers is a 16-bit float, which is the default precision most modern LLMs ship in. Quantization replaces those 16-bit numbers with lower-precision ones: 8-bit integers, 4-bit integers, or 8-bit floats. Fewer bits per number means a smaller model in memory and less data to move per token, and moving data is most of what inference actually spends time on.

There are two things you can quantize, and the distinction drives everything downstream. Weight-only quantization compresses the stored weights but runs the math in higher precision (often written W4A16, meaning 4-bit weights and 16-bit activations). Weight-and-activation quantization compresses both (W8A8 is the common one). Weight-only is easier to do without hurting quality, because activations carry outliers that low precision handles badly. That is why most 4-bit deployments are weight-only.

The other split is when you quantize. Post-training quantization (PTQ) takes a finished model and compresses it using a small calibration dataset, in minutes to hours, with no retraining. Quantization-aware training (QAT) bakes the low precision into the training loop, which recovers more quality but costs a full training run. For almost every team I talk to, PTQ is the right starting point, and the methods below are all PTQ.

The trade you are making is precision for size. Less precision means each weight is a coarser approximation of what training produced. On easy, redundant tasks that coarseness washes out. On tasks that depend on a long chain of exact intermediate steps, the errors compound. The skill is not "should I quantize," it is "which precision survives this specific task," and that is an empirical question you answer by measuring, not by reading a blog post, including this one.

The methods that matter: GPTQ, AWQ, GGUF, and FP8

Four names cover most of what you will actually deploy. They are not interchangeable, and the differences are practical rather than academic.

GPTQ is a one-shot post-training method that uses approximate second-order information to decide how to round each weight while minimizing the error it introduces. The original work showed it could quantize a 175-billion-parameter model down to 3 or 4 bits in roughly four GPU hours with negligible accuracy loss relative to the full-precision baseline (Frantar et al., ICLR 2023). It is fast to produce and well supported. It tends to lose a little more on code and reasoning than the alternatives, for reasons I will get to.

AWQ (Activation-aware Weight Quantization) starts from a sharp observation: not all weights matter equally. Roughly 1% of weight channels are salient, identified by looking at the activation distribution rather than the weights themselves, and protecting just those channels before quantizing the rest sharply reduces error. It uses no backpropagation and no reconstruction, so it preserves the model's general behavior instead of overfitting to the calibration set (Lin et al., MLSys 2024, which won that conference's best paper award). In practice AWQ holds quality at 4-bit a bit better than GPTQ.

GGUF is the format from the llama.cpp project, and it is what you reach for when you are serving on CPU, on mixed hardware, or on Apple Silicon, where it is effectively the only practical option. Its K-quant variants (Q4_K_M and friends) give you fine-grained control over the quality-size trade by storing more scale data per block. If your deployment is a laptop, an edge device, or a heterogeneous fleet, GGUF is usually the answer.

FP8 is the newer entrant and a different idea. Instead of integers it uses 8-bit floating point (the E4M3 and E5M2 formats), which keeps a wider dynamic range than INT8 and so handles the outlier activations that wreck integer quantization. On the latest hardware FP8 is close to lossless for both weights and activations, which makes it the cleanest way to get an 8-bit speedup without an eval babysitting it. If your GPUs support it natively, FP8 is often the easiest win in this whole list.

Quality by bit-width: where 4-bit is free and where it bites

Here is the part teams get wrong: they treat bit-width as one decision for the whole product, when it is really a decision per task. The same 4-bit model that is indistinguishable from full precision on one job is visibly worse on another.

At 8-bit the trade barely exists. FP8 is essentially lossless, and INT8 with sensible tuning lands within about 1 to 3% of the full-precision model on most tasks. If you are nervous about quantization and want the safe first step, quantize to 8-bit and move on. You get roughly half the memory for a quality cost you will struggle to measure.

At 4-bit the trade is real but narrow. Across the common 4-bit formats, perplexity stays within about 6% of the 16-bit baseline, which sounds reassuring until you look at task-specific numbers. On HumanEval, a code-generation benchmark, one widely-cited comparison put full precision at 56.1% pass@1, with AWQ and GGUF Q4_K_M both holding 51.8% and GPTQ falling to 46% (The AI Engineer, 2026). A four-to-ten-point drop on code is not noise. On a summarizer you would never see it.

That pattern is the rule, not the exception. Four-bit quantization is close to free on summarization, classification, extraction, and retrieval-grounded answering, where the model has slack and the task tolerates approximation. It bites on math, multi-step reasoning, code generation, and long-context faithfulness, where each step depends on the last and a coarse weight throws the chain off. So the operative question is never "is 4-bit good enough," it is "is 4-bit good enough for this task," and the answer changes across the surfaces of a single product.

Bit-width is not one decision for your whole product. It is a decision per task, and the same 4-bit model can be flawless on one job and visibly broken on the next.

At Devlyn this plays out concretely. Our intent classifier and our preference-extraction step run quantized at 4-bit and we cannot find the quality difference in our evals. The step that reasons over a prescription and a set of frame constraints stays at higher precision, because when we quantized it as an experiment we lost a few points on the numeric sub-eval and that step has to be right in front of a customer. Same product, two different precisions, one eval suite that told us where the line was. Those numbers are illustrative of the shape, not a published benchmark, but the decision process is exactly what I would defend in a review.

The memory and throughput math

The memory story is the easy one. Going from 16-bit to 4-bit cuts model size by roughly 4x; 8-bit cuts it by about 2x. A 70-billion-parameter model that needs well over 100GB in full precision compresses to a footprint that fits on a single 80GB GPU at 4-bit. That single fact, "it fits on one card," is what turns a lot of self-hosting math from impossible to routine, because you stop paying for multi-GPU sharding and the interconnect headaches that come with it.

Throughput is more interesting and more misunderstood. Quantization speeds up inference because you move less data per token, but how much speedup you get depends heavily on the inference kernel, not just the format. In one H200 benchmark, the exact same AWQ weights served 68 tokens per second under a default vLLM path and 741 tokens per second under an optimized Marlin kernel, a roughly 10x difference from kernel choice alone, with the optimized path running about 1.6x faster than the FP16 baseline (The AI Engineer, 2026).

The lesson there is one I have learned the expensive way: if you quantize, benchmark, and conclude "quantization didn't speed anything up," you probably benchmarked the kernel, not the format. The format determines the ceiling. The serving stack determines whether you reach it. Most disappointing quantization results are serving problems wearing a quantization costume.

When self-hosting a quantized model beats an API

This is the section that actually moves money, so I want to be honest about both sides of it. The reason quantization matters to a CRO is that it changes the break-even point between paying a hosted API per token and running your own model on your own GPUs.

The arithmetic is roughly this. A dedicated H100 runs about $2 to $2.80 per hour on demand, which is $1,800 to $2,000 a month if you keep it busy around the clock. Quantization lets a strong open model fit on that one card. Against frontier API pricing, self-hosting tends to break even somewhere in the low millions of tokens per day; one widely-cited comparison had a 4-bit Llama 70B setup costing around $4,360 a month at 500 million tokens a day versus roughly $22,500 on an API, about a 5x swing in self-hosting's favor at that volume (sourced from current self-host-versus-API cost analyses, 2026). Below that volume, the API usually wins.

Now the honest counterweight, because the per-token math is a trap. Self-hosting is not just the GPU bill. Realistic estimates put ongoing maintenance, monitoring, and incident response at 10 to 20 engineering hours a month, which at senior rates is another $750 to $3,000, and the all-in cost of a self-hosted deployment commonly runs 3 to 5x the raw GPU price once you count it all. A break-even calculation that ignores the engineer is a break-even calculation that will be wrong in the direction that hurts.

The per-token math says self-hosting wins at volume. The all-in math, including the engineer who babysits it, moves the break-even point and is the only version worth presenting to a board.

So the real rule is: self-hosting a quantized model pays when you have sustained high volume, a latency or data-residency requirement that an API cannot meet, and the engineering capacity to operate it. If you are below a few million tokens a day, or you cannot staff the operations, the API is cheaper even though the per-token number looks worse. This is the same outcomes-over-velocity discipline I apply to model sizing in the case for shipping smaller models: the cheapest token is the one you serve on infrastructure you can actually run.

If you want this break-even modeled against your real traffic and your real quality bar rather than a blog post's averages, that is the kind of work the Devlyn engineering team does on inference economics.

A comparison you can paste into a deck

Here is the practical decision in one table: the method, the bit-width it is usually run at, the quality you typically keep, and where it fits. Quality figures are directional, drawn from the public benchmarks cited above, not a guarantee for your model.

MethodTypical bit-widthQuality retentionBest use case
FP8 (E4M3)8-bit floatNear-losslessThe safe default on FP8-capable GPUs
INT8 (W8A8)8-bit int~1-3% dropConservative speedup, broad task safety
AWQ4-bit, weight-only~99% on most tasks4-bit when you want to protect quality
GPTQ4-bit, weight-onlyWithin ~6% perplexity; weaker on codeFast to produce, GPU serving at scale
GGUF (Q4_K_M)4-bit, weight-only~92% of perplexityCPU, edge, mixed fleets, Apple Silicon

If you take one thing from the table: start at 8-bit when in doubt, reach for AWQ when you need 4-bit on a GPU and care about quality, and use GGUF when your hardware is not a data-center NVIDIA card. Everything else is tuning.

The pitfalls nobody puts on the slide

Quantization fails in quiet, specific ways, and the failures rarely show up in the headline accuracy number. These are the ones that have cost me time.

Calibration-set mismatch. PTQ methods tune the quantization using a small calibration dataset. If that data does not resemble your production traffic, the model is optimized for the wrong distribution and degrades on exactly the inputs you care about. Calibrate on your data, not on whatever sample shipped with the tool.

The KV cache is a separate decision. Quantizing the weights does not quantize the key-value cache, which is its own large and growing chunk of memory at long context. Teams quantize weights, celebrate the memory win, then hit an out-of-memory wall on long conversations because the cache was never touched. Treat KV-cache quantization as its own line item with its own quality check.

You must evaluate the quantized model, not the original. The most common mistake I see is running the eval suite on the full-precision model, approving it, and then deploying the quantized one. You validated a model you are not shipping. Quantization is a model change, so it goes through the same gate as any other model change, on a frozen, production-sampled set. If you do not have that gate yet, build it first; my guide to LLM evaluation covers how, and it is the prerequisite for quantizing safely.

Kernel and hardware lock-in. The fast path for a given quantized format often depends on a specific kernel and a specific GPU generation. A format that flies on an H200 may crawl on older hardware or in a different serving stack. Pin down the format-plus-kernel-plus-hardware combination before you commit, because the quality and the speed both depend on all three.

None of these are reasons not to quantize. They are reasons to quantize with a measurement loop attached. Quantization without evals is just hoping the trade went your way, and hope is not a cost strategy. If you want the deeper treatment of sizing the model down before you compress it, the companion read is my book on small models, and the question of whether to fine-tune the quantized model or leave it alone is its own decision I work through in To Fine-Tune or Not.

Frequently asked questions

What is LLM quantization in simple terms? It is storing a model's numbers at lower precision, dropping from 16-bit floats to 8-bit or 4-bit, so the model uses less memory and runs cheaper. You trade a small amount of accuracy for a large reduction in size and cost. On most tasks the accuracy you lose is too small to matter; on math and reasoning tasks it can be large enough to matter a lot.

Does 4-bit quantization hurt model quality? Sometimes, and it depends entirely on the task. On summarization, classification, and extraction the quality drop is usually inside your eval noise. On code, multi-step reasoning, and long-context faithfulness it is measurable, often a few points on the relevant benchmark. The only honest answer comes from running your own eval suite on the quantized model.

Which is better, GPTQ or AWQ? For 4-bit GPU serving where you care about quality, AWQ usually holds accuracy slightly better because it protects the small set of salient weight channels that matter most. GPTQ is fast to produce and serves well at scale but tends to lose a bit more on code and reasoning. If you are on CPU or Apple Silicon, neither applies and you want GGUF.

When is self-hosting a quantized model cheaper than an API? When you have sustained high volume (commonly in the low millions of tokens per day against frontier pricing), plus a latency or data-residency need an API cannot meet, plus the engineering capacity to operate the stack. Below that volume, or without the operations capacity, the per-token API price usually wins once you count the real cost of running your own infrastructure.

Quantization is one of the clearest examples of the principle that runs through all of my inference-cost work: the model is an input to a system you design, not the system itself. Measure the trade, quantize the tasks that survive it, and keep full precision where the chain has to hold. If you would rather have a team instrument the evals and model the break-even against your own traffic, that is what Devlyn's observability and monitoring work is built for. Compress what you can measure. Leave alone what you cannot.

Share
Next

Keep reading

View all blogs