April 3, 2026

Evaluations

Evaluating Hallucinations in Clinical Note Summarization

How to evaluate clinical note summarization with a section-level framework that reflects clinical risk and supports safer release

Written by Golda Manuel, PharmD., MS

Co-founder and CEO, Quantiles

Hand holding a clinical note to represent clinical note summarization evaluation

Ambient scribes, chart recap tools, inbox copilots, and other agentic clinical AI systems are quickly becoming core operational infrastructure in clinical settings, helping to reduce documentation burden and strengthen care coordination. In these workflows, hallucinations are consequential because they can alter the chart, which can affect downstream clinical actions and ultimately increase verification burden.

A recent npj Digital Medicine study shows that binary decision labels are insufficient for effectively detecting hallucinations in clinical summaries. These need to be judged in context, because a fabricated medication change or follow-up detail carries far greater risk than a minor phrasing error in a lower-risk section.

Clinical summarization has the potential to do more than save time. With the right evaluations, it can help safely create documentation workflows that are more reliable, more scalable, and better aligned with patient care.

Section-Level Hallucination Evaluation

A practical alternative to binary hallucination labels is to evaluate note summarizers section by section, distinguish between major and minor hallucinations, and apply release rules according to the downstream risk of each section. The goal is not to remove every imperfection, but to align release thresholds with the real documentation risk carried by different parts of the note. A hallucination budget gives teams a structured way to encode the heterogenous impact of hallucinations into release policy, rather than assuming all summarization errors carry the same operational significance.

Lower-risk sections should be understood as relatively lower risk, not risk-free. They may carry fewer downstream consequences than action-driving sections like medication changes, assessment and plan, follow-up instructions, or disposition, but they still require careful review. In these relatively lower-risk sections, minor errors may be acceptable if they do not change clinical meaning and can be caught during review. In higher-risk sections, error tolerance must be very low, since errors can shape treatment decisions, escalation, and care coordination.

Section-Level Clinical Documentation Risk

Lower-risk sections

Nonclinical background context

Non-actionable social/contextual details

Administrative recap

Higher-risk sections

Medication changes

Assessment and plan

Orders, referrals, and disposition

Diagnosis and problem list updates

Hallucination budgets give teams a structured way to treat note summarization as a collection of section-level tasks with distinct risks, thresholds, and release criteria. This improves the precision of failure analysis, makes governance more transparent, and supports more defensible deployment decisions. It also creates a stronger bridge to context-switching evaluation by turning note summarization into a set of comparable section-level checks across settings. These tools give teams a more practical way to detect when a change in documentation style, workflow, or note structure alters risk in one part of the note even if overall performance still looks acceptable.

Designing Hallucination Budgets for Clinical Note Summarizers

A credible hallucination budget for clinical note summarization requires more than a note-level hallucination rate. Teams need a section-level framework that reflects clinical risk, distinguishes major from minor errors, and ties observed performance to explicit release criteria.

Four Steps in Hallucination Budget Design for Clinical Notes

Step 1

Define the note sections in scope

Break summarization into sections with different downstream uses, such as history, medications, assessment, and instructions.

Step 2

Separate major hallucinations from minor ones

Use an error taxonomy that separates clinically consequential hallucinations, such as fabricated medications, reversals of presence or absence, and shifts in clinical context, from lower-severity errors.

Step 3

Tie each section to an allowed threshold

Set tighter thresholds to higher-risk sections, and allow only errors that are minor, reviewable, and unlikely to affect clinical meaning.

Step 4

Specify the review method

Define the review process by specifying the reviewers, the sampling approach, the adjudication method for disagreements, and the evidence required to support a promote, hold, or restrict a decision.

Under a note-level approach, a team might review 100 summaries, report one overall hallucination rate, and approve the system if the aggregate result looks acceptable. Under a section-level approach, the same team would inspect medication, assessment, and instruction sections separately, score major and minor failures by class, and allow release only for the sections whose error pattern stays inside the approved budget. The second method is narrower, but it is much more useful for deployment.

By breaking note summarization into section-level risks and review standards, hallucination budgets give teams a clearer path to safer evaluation, better governance, and more scalable deployment.

Release is only the first checkpoint. For clinical note summarization to be safe and scalable, teams need to monitor whether section-specific error patterns stay within budget as models, prompts, workflows, note structures, and patient populations change over time. Quantiles helps teams operationalize that process by turning section definitions, evaluations, thresholds, and re-review triggers into reusable evaluation infrastructure that persists across model versions, deployments, and post-deployment monitoring workflows. This infrastructure gives teams a more durable, rigorous way to govern note summarization as a documentation system, not just evaluate it once as a generic text-generation feature.

FAQs

Common questions this article helps answer

Why is a note-level hallucination rate not enough for clinical note summarization?▼

A note-level hallucination rate collapses errors that create very different downstream risks. An invented medication change, a flipped negation in assessment, and a minor error in background history should not carry the same release meaning. Section-level review makes those distinctions visible and more actionable.

How should teams decide which note sections need the tightest hallucination thresholds?▼

Start with sections that can directly affect treatment, escalation, follow-up, or reconciliation work. Medication changes, assessment and plan, and instructions usually need the strictest thresholds because errors there can shape clinical action. Lower-risk sections may tolerate limited minor errors if clinical meaning is preserved and review remains practical.

Who should review hallucinations in clinical note summarization evaluations?▼

The right reviewer depends on the section and the workflow risk. Clinically sensitive sections may require clinician, pharmacist, or informatics review, while lower-risk sections may support a narrower review model with escalation rules. The important part is to define reviewer roles and adjudication policy before release decisions are made.

How does post-deployment monitoring change when teams use section-level hallucination budgets?▼

Monitoring becomes more targeted because teams can watch whether specific sections drift outside their approved thresholds over time. Instead of asking whether the summarizer still performs well on average, teams can track whether medication, assessment, or instruction sections are becoming less reliable after model, prompt, or workflow changes.

Keep reading

View all

Hand juggling balls to represent managing multiple healthcare AI context shifts