LLM Evaluation Metrics That Matter (and the Ones That Lie)

The LLM evaluation metrics that matter measure what breaks in production. The ones that lie measure what looks good in a deck. Here is how to tell them apart.

The LLM evaluation metrics that matter measure what breaks in production: task accuracy on a frozen, production-sampled set; human-disagreement rate; faithfulness; latency at p95; and cost per resolved task. The ones that lie are aggregate accuracy on a set you keep editing and vanity benchmark scores like MMLU. The first group tells you whether to ship. The second group tells you whether you feel good. Those are different questions.

I have watched a team celebrate 94% accuracy on a Friday and roll back the model on Monday. Nothing about the model changed over the weekend. The metric was always lying; the weekend just gave production enough time to prove it. This piece is about how to tell the two kinds of LLM metrics apart before a customer does it for you. It is the metric chapter of my complete guide to LLM evaluation; for the harness these metrics plug into, see how to build an LLM evaluation framework.

A metric that cannot go down is not measuring your model. It is measuring your willingness to edit the test.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Aggregate accuracy on an elastic eval set is the most common lie in AI. If the set grows whenever the number dips, the number is about the test, not the model.
Faithfulness and human-disagreement rate predict production failures that accuracy hides. Fluent and wrong still scores as a pass on a loose rubric.
Vanity benchmarks like MMLU are saturated. Frontier models cluster at 88-93%, so the number cannot rank them on your task.
Cost per resolved task is the one metric the business should see. It ties evaluation directly to the P&L.

Why most LLM metrics lie: the elastic ruler

The single biggest reason an LLM evaluation metric lies is that the set under it keeps changing. A developer adds easy cases when the score dips. A PM quietly drops the case that always fails. The number climbs, and everyone reads it as the model improving. It is not. The ruler got shorter.

I cover the mechanics of freezing and versioning a set in my essay on evals that predict production. The short version: sample your eval set from real production traffic, freeze it as a named artifact, and never let it grow organically. A frozen set can only score lower over time, which is exactly what makes it honest. You want a fixed ruler, not a rubber band.

This is also why benchmark scores belong in the "lies" column for production decisions. MMLU and HumanEval are saturated: GPT-5, Claude Opus, and Gemini all cluster in the high 80s and low 90s, so the score range has compressed until noise exceeds signal (benchmarkingagents.com). Worse, popular benchmark questions leak into training data, so a model can recall the answer instead of reasoning to it. A high MMLU score tells you the model has seen the test. It tells you nothing about your traffic.

Reference-based vs reference-free metrics: where each one lies

Most LLM evaluation metrics split into two families, and knowing which family you are holding tells you most of where it can lie. Reference-based metrics compare the output to a known correct answer. Reference-free metrics judge the output on its own, with no answer key. You need both, and you mislead yourself when you reach for the wrong one.

Reference-based metrics include exact match, BLEU, ROUGE, and BERTScore. BLEU and ROUGE measure surface word overlap; BERTScore measures embedding similarity. They are fast, cheap, and deterministic, which makes them tempting. They also miss a correct answer phrased differently, and they reward a wrong answer that happens to share words with the reference. Use them only where there is a tight expected output, like translation or extraction, and never as the gate for an open-ended chatbot. A high BLEU score on a support reply tells you the model copied the template, not that it solved the ticket.

Reference-free metrics include faithfulness, answer relevancy, and most LLM-as-a-judge scores. They assess the output directly against the question or the source context, so they handle open-ended tasks where no single answer exists. That flexibility is also their weakness: they inherit the blind spots of whatever judge model grades them. A reference-free score is only as honest as the model behind it, which is why the judge needs its own calibration before you trust it. The practical rule: reference-based for closed tasks with a real answer key, reference-free for the open-ended traffic that makes up most production systems.

The metrics that matter, one at a time

Each metric below earns its place because it catches a failure mode the others miss. For each, here is what it measures, when it lies, and how to read it.

Task accuracy on a frozen, production-sampled set. This measures whether the model gets your real cases right, scored against a reference, on a set that does not move. It lies the moment you let the set drift or score it with a loose rubric that accepts "close enough." Read it as a trend, not a snapshot: the same questions, the same rubric, this model versus the last one. A single accuracy number with no frozen denominator behind it is a vanity figure wearing a lab coat.

Human-disagreement rate. This is the fraction of cases where the model's output diverges from a calibrated human reference beyond a tolerance you set in advance. It is the metric I report to leadership, because it is anchored to human performance and to a fixed distribution. It lies if your raters are not blinded to the model version, because reviewers give quiet benefit of the doubt to a model they helped tune. Read it directionally: up means worse, down means better, and you can open the exact cases that moved.

Faithfulness (groundedness). Faithfulness measures whether every claim in the answer can be inferred from the provided context. RAGAS defines it concretely as the number of claims supported by the context divided by the total claims in the answer (Ragas docs). It catches the failure accuracy cannot see: a fluent, confident answer that contradicts the source. It lies when you score it with a cheap judge model that under-detects contradiction; in 2026, faithfulness scoring is only reliable with a strong reference model behind it. Read a low faithfulness score as a hallucination alarm, not a style note, and pair it with the discipline in my piece on measuring and reducing hallucination.

Latency at p95. Average latency is a comfort metric. The 95th percentile is the truth, because your slowest 5% of requests are where users rage-quit and where timeouts cascade. It lies only when you report the mean instead. Read p95 as a hard product constraint: a model that is 2% more accurate and 600ms slower at p95 may lose you more revenue than it earns.

Cost per resolved task. More on this below, because it is the metric that connects the whole harness to the business.

A comparison you can paste into a deck

Here is the same set of LLM evaluation metrics in one table: what each one measures, and whether it tells the truth about production.

Metric	What it measures	When it lies	Verdict
Task accuracy (frozen set)	Correct answers on real, version-locked cases	When the set drifts or the rubric goes loose	Matters
Human-disagreement rate	Model vs. calibrated human on a fixed set	When raters are not blinded to model version	Matters
Faithfulness	Claims supported by the provided context	When scored by a weak judge model	Matters
Latency p95	Worst-case response time users actually feel	When you report the mean instead	Matters
Cost per resolved task	Spend divided by tasks fully handled	When you count attempts, not resolutions	Matters
Aggregate accuracy (elastic set)	Average correctness on a set you keep editing	Whenever the set changes to chase the number	Lies
Benchmark score (MMLU, etc.)	Performance on a public, saturated test	Contamination and saturation; not your traffic	Lies

How to read AI evaluation metrics together, not alone

No single metric gates a deploy. A model can ace task accuracy and still fail faithfulness, which means it is confidently wrong on the cases where it diverges. A model can win on faithfulness and lose on p95, which means it is honest and too slow to use. The metrics are a panel, and the panel disagreeing is itself a signal.

Here is what that panel looks like coming out of a real eval runner, scored against a frozen set. The numbers are realistic, not from a specific live system.

# eval run against frozen set eval-set-2026-w24-v1.jsonl

python -m eval.runner \

--suite eval-set-2026-w24-v1.jsonl \

--model prod-candidate-2026-06-15

# metrics summary

task accuracy 0.883 # frozen set, up 0.006 vs prior

human disagree 6.8% # threshold 8.0%, PASS

faithfulness 0.91 # threshold 0.90, PASS

latency p95 1,920 ms # +180 ms vs baseline, FLAG

cost / resolved $0.041 # up from $0.034, FLAG

verdict GATE CLEAR # pending p95 + cost review

Notice that two metrics flag without blocking. The point of a flag is to make a trade-off visible instead of letting it ship silently. Accuracy went up, and so did latency and cost. Whether that trade is worth it is a business decision, and the harness puts the numbers in front of the people who should make it.

The one metric the business should see: cost per resolved task

Cost per resolved task is total inference spend divided by the number of tasks the system fully handled without a human finishing the job. It is the metric I put in front of revenue and finance, because it converts every engineering choice into money.

It lies in exactly one way, and the way is common: counting attempts instead of resolutions. A cheaper model that resolves 70% of tickets is not cheaper than a pricier model that resolves 90%, once you price in the human who cleans up the other 30%. The token cost per call dropped. The cost per resolved task went up. Teams optimize the first number and wonder why the support line got more expensive.

Token cost is what the model charges you. Cost per resolved task is what the model costs you. Only one of them is on the P&L.

This is also where evaluation stops being an engineering hobby and becomes a revenue lever. When you can say "this model resolves 8% more tasks at $0.007 less per resolution," you have turned an eval run into a margin argument. That is the sentence that gets an AI project funded, and the sentence most teams cannot say because they never measured resolution, only accuracy. Tracking cost per resolved task in production is squarely an AI observability and monitoring problem, not a one-off spreadsheet.

One honest trade-off: cost per resolved task is harder to compute than token cost. You need a reliable definition of "resolved," which often means instrumenting downstream outcomes and accepting some noise in attribution. It is worth the trouble. A metric that is approximately right about money beats one that is precisely right about tokens.

Where these metrics still fall short

Even the metrics that matter have a ceiling. A frozen set drifts from reality as the world changes, so it needs a scheduled refresh, versioned the way code releases are versioned. Faithfulness scoring inherits the blind spots of whatever judge model grades it. And cost per resolved task depends on a "resolved" definition that a product team has to own and defend.

None of this argues for a human reviewing every output instead. That does not scale, and I make the full case in why a human in the loop is not a plan. The answer is not more review. It is a metric panel that earns the right to gate a deploy, plus a human who designs and audits that panel. The machine does the work. The human evaluates the work, and the metrics are how the evaluation scales.

Frequently asked questions

What LLM evaluation metrics actually matter? Five: task accuracy on a frozen production-sampled set, human-disagreement rate, faithfulness, latency at p95, and cost per resolved task. They matter because each catches a production failure the others miss, and none of them improves just because you edited the test.

How do I measure LLM performance without lying to myself? Freeze your eval set, sample it from real traffic, blind your raters to the model version, and set pass thresholds before the run, not after. Read every metric as a trend on a fixed ruler, and never let the set grow to rescue a number.

Are benchmark scores like MMLU useful AI evaluation metrics? For ranking frontier models on your task, no. MMLU and HumanEval are saturated and contaminated, so scores cluster in the high 80s and reflect memorization more than reasoning. Use them for rough capability filtering, never as a deploy gate.

What is the difference between reference-based and reference-free metrics? Reference-based metrics (exact match, BLEU, ROUGE, BERTScore) compare the output to a known answer key, so they fit closed tasks like translation or extraction. Reference-free metrics (faithfulness, answer relevancy, LLM-as-a-judge) score the output directly with no answer key, so they fit the open-ended traffic most production systems handle. Use reference-based where a tight expected answer exists, reference-free everywhere else.

What is the single most important LLM metric for the business? Cost per resolved task. It divides total inference spend by tasks fully handled, so it captures quality and price in one number and puts evaluation directly on the P&L.

If you want the full harness these metrics plug into, including label-blinding protocols and how to handle disagreement, my book A Field Guide to Evals walks through it end to end. And if you would rather have a team instrument cost per resolved task and the rest of this panel in your stack from day one, that is exactly what Devlyn's AI observability and monitoring work is for. Measure what breaks. Ignore what flatters.