How to Measure (and Reduce) Hallucination
Measure hallucination as faithfulness against a source on a frozen set, then reduce it with grounding, constrained decoding, and calibrated abstention.
Hallucination is a property of language models to manage, not a bug you patch once and forget. You measure it as faithfulness, or groundedness, against a known source on a frozen evaluation set. Then you reduce it with three interventions that earn their keep: retrieval grounding, constrained decoding, and calibrated abstention. The order is not optional. You cannot reduce what you do not measure, and most teams skip straight to reduction with no ruler in hand.
I have watched a team ship a "fixed" model after a week of prompt-tweaking, with no number to show the fix worked. It didn't. The complaints just moved to a different question class. Measuring hallucination first would have told them that in an afternoon. This piece is the harness I trust for measuring hallucination, and the interventions that move the number versus the ones that are theater. If you want the wider discipline this sits inside, start with my guide to LLM evaluation; this article is the hallucination-specific cut of it.
Key takeaways
If you read nothing else, these are the load-bearing claims:
- Hallucination is the model guessing fluently when it should abstain. It is mechanical, not mystical, and it is measurable.
- Measure faithfulness on a frozen set before you touch the model. Faithfulness is the fraction of claims in an answer that the source supports.
- Retrieval grounding, constrained decoding, and calibrated abstention measurably help. Bigger prompts and "be accurate" instructions are theater.
- Confident-and-wrong is the expensive failure. A model that abstains costs you a deflection; a model that fabricates costs you a customer.
What hallucination actually is, mechanically
A language model predicts the next token from a distribution. It does not look anything up by default. When the training data thins out around a fact, the distribution stays smooth and confident anyway, and the model emits the most likely-sounding continuation. That fluent guess is a hallucination. There is no internal alarm that fires when the model leaves the region it actually knows.
The mechanism gets worse because of how we score models. OpenAI's 2025 paper argues that hallucination persists largely because evaluation rewards it: binary scoring treats "I don't know" the same as a wrong answer, so a model that guesses outscores one that abstains (Kalai et al., arXiv 2509.04664). Their numbers are stark. An older model abstained on 1% of cases and was wrong on 75%; a newer one trained to abstain hit a 52% abstention rate with 26% errors, at similar accuracy. We trained the guessing in. That is the thing we now have to measure out.
How to measure hallucination: faithfulness and groundedness
Measuring hallucination means scoring an answer against a source it is supposed to be true to. The 2026 evaluation stack splits this into three checks, and they are not interchangeable (Braintrust, 2026):
- Groundedness checks the answer against the specific retrieved passages you put in the context.
- Faithfulness checks the answer against the full source text, catching claims that drift during summarizing or rewriting.
- Factuality checks a claim against general world knowledge, with no provided source.
For any retrieval-grounded system, faithfulness is the metric I gate on. RAGAS defines it concretely: break the answer into atomic claims, then divide the number of claims the context supports by the total claims in the answer (Ragas docs). A faithfulness of 0.78 means roughly a fifth of what the model asserted was not in the source it was handed. That is your hallucination rate, expressed as a number you can track.
The hard rule: measure faithfulness on a frozen, production-sampled set, not a set you keep editing. I make the full case for freezing the ruler in my essay on evals that predict production. A frozen set can only score lower over time, which is exactly what makes it honest about hallucination. If your hallucination number only ever improves, you are measuring your willingness to edit the test.
Detection methods that work in production
You cannot have a human read every answer for fabrication. That does not scale, and I argue why in why a human in the loop is not a plan. Detection has to be automatic and continuous. Three methods carry the load in 2026.
Claim extraction plus verification. This is the engine under faithfulness scoring. A judge model breaks the answer into atomic claims, then verifies each one against the retrieved context, and you report the supported fraction. It is the most direct signal because it tells you which claim hallucinated, not just that something did. The cost is real: it is several extra model calls per answer, so you sample in production rather than scoring every request.
LLM-as-a-judge against a rubric. A strong model grades the answer for groundedness on a fixed scale. It is cheaper than full claim extraction and good for trend monitoring. The catch is that a weak judge under-detects contradiction, so faithfulness scoring is only reliable with a strong reference model behind it. I cover when to trust the grader in when to trust LLM-as-a-judge.
Self-consistency sampling. Generate an answer several times at nonzero temperature and measure agreement. When a model knows a fact, the samples converge; when it is hallucinating, they scatter. Disagreement across samples is a cheap, model-internal signal of low confidence, and it needs no source document to compute.
Here is what a faithfulness run looks like coming out of an eval runner, scored against a frozen set. The numbers are realistic, not from a specific live system.
The line that matters is the last data row. Confident-and-wrong, the cases where the model asserted something the source did not support and gave no hedge, is the failure that costs money. A 0.88 faithfulness looks fine in a deck. A 3.1% confident-wrong rate is the number that should hold the deploy.
Interventions that measurably help vs. theater
Once you have a faithfulness number, you can tell a real intervention from a comfortable one. The field shifted in 2026 from chasing a truthful model to building a truthful system: retrieval, validation, calibration, and structural constraints around a base model that will always guess if you let it (Lakera, 2026). Three interventions move the number.
Retrieval grounding. Give the model the source instead of its memory. Retrieve from a verified knowledge base, rerank with a cross-encoder, and set a similarity threshold that filters weak chunks before they reach the prompt. Grounding works because it converts a factuality problem, which the model fails, into a faithfulness problem, which you can score and gate. The trade-off is that bad retrieval grounds the model in the wrong passage, so retrieval quality becomes your new hallucination surface. I cover that decay in why RAG pipelines fail in month three.
Constrained decoding. Force the output into a schema with a JSON grammar, and require a citations array and a confidence field for every claim. This kills structural hallucination outright: the model cannot invent a field or cite a document that is not in the allowed set. It is the cheapest high-leverage fix because it is a decoding constraint, not a model change.
Calibrated abstention. Let the model say "I don't know" and reward it for doing so. Set a confidence threshold below which the system abstains or escalates, and measure abstention as a first-class outcome, not a failure. This is the direct answer to the incentive problem: if your eval gives credit for a well-placed refusal, you stop training the guessing back in (Kalai et al.).
The theater list is shorter and louder. Adding "be accurate and do not make things up" to the system prompt does not move faithfulness; the model had no mechanism to comply. Stacking few-shot examples of good answers does not teach the model where its knowledge ends. Raising the model size buys you fluency, which makes hallucinations harder to spot, not rarer. None of these survive contact with a frozen faithfulness set, which is precisely why you measure first.
| Intervention | What it fixes | Helps or theater |
|---|---|---|
| Retrieval grounding + reranking | Model lacks the fact | Helps |
| Constrained decoding (schema + citations) | Invented fields and sources | Helps |
| Calibrated abstention | Guessing when unsure | Helps |
| "Be accurate" prompt instructions | Nothing measurable | Theater |
| More few-shot examples | Style, not knowledge boundary | Theater |
| Bigger model alone | Hides hallucination behind fluency | Theater |
Calibrated abstention is the whole game
A model that abstains when it does not know is worth more than a model that is slightly more accurate and never hedges. Abstention turns an unbounded risk into a bounded cost. The system says "I can't answer that from the sources I have" and routes to a human or a fallback. That is a deflection you can price. A fabricated answer delivered with confidence is a liability you cannot.
Calibration means the model's stated confidence matches its real accuracy. When it says 90% sure, it should be right about 90% of the time. You measure this by bucketing answers by stated confidence and checking accuracy within each bucket. A well-calibrated model lets you set one threshold and trust it. An overconfident model makes every threshold a gamble, which is the state most untuned models ship in.
Here is the revenue tie, in one line a CRO understands. A support agent that abstains on 8% of tickets sends them to a human at a known cost per ticket. A support agent that confidently fabricates a refund policy on 3% of tickets creates a chargeback, a complaint, and a churned account. The first is an operating expense. The second is an uncapped loss with your brand attached. Measuring and budgeting the confident-wrong rate is the difference between an AI feature you can underwrite and one you are quietly gambling on.
This is also where measuring hallucination stops being an eval exercise and becomes a continuous-monitoring problem. Faithfulness and confident-wrong rate drift as your corpus and traffic change, so they belong on a dashboard, not in a one-time spreadsheet. That is squarely an AI observability and monitoring job, scored on live traffic against a frozen reference.
Frequently asked questions
How do you measure hallucination in an LLM? Score the answer against its source on a frozen evaluation set. For retrieval systems, measure faithfulness: the fraction of claims in the answer that the provided context supports. Break the answer into atomic claims, verify each against the source, and report the supported ratio as your hallucination rate.
What is the difference between faithfulness and factuality? Faithfulness checks an answer against a specific source you provided, so it asks "is this true to the document." Factuality checks a claim against general world knowledge with no source, so it asks "is this true at all." For RAG and grounded systems you gate on faithfulness, because it is the property you actually control.
Can you eliminate LLM hallucination completely? No. Hallucination is a property of how language models generate, not a bug with a final patch. You manage it down with retrieval grounding, constrained decoding, and calibrated abstention, and you keep measuring it because the rate drifts as your data and traffic change.
What is the most expensive hallucination failure? Confident-and-wrong: the model asserts something false with no hedge. Abstention is cheap because it routes to a fallback at a known cost. A confident fabrication is an uncapped loss, because it ships to a customer as if it were verified.
If you want the full harness this plugs into, including frozen sets, judge calibration, and abstention budgets, my book A Field Guide to Evals walks through it end to end, and Observability for AI covers monitoring hallucination on live traffic. For why a frozen ruler is the whole point, read my essay on evals that predict production. And if you would rather have a team instrument faithfulness and confident-wrong rate in your stack from day one, that is exactly what Devlyn's AI observability and monitoring work is for. Measure it first. Then reduce what the number tells you to.
