How to evaluate clinical note summarization with a section-level framework that reflects clinical risk and supports safer release
Written by

Ambient scribes, chart recap tools, inbox copilots, and other agentic clinical AI systems are quickly becoming core operational infrastructure in clinical settings, helping to reduce documentation burden and strengthen care coordination. In these workflows, hallucinations are consequential because they can alter the chart, which can affect downstream clinical actions and ultimately increase verification burden.
A recent npj Digital Medicine study shows that binary decision labels are insufficient for effectively detecting hallucinations in clinical summaries. These need to be judged in context, because a fabricated medication change or follow-up detail carries far greater risk than a minor phrasing error in a lower-risk section.
A practical alternative to binary hallucination labels is to evaluate note summarizers section by section, distinguish between major and minor hallucinations, and apply release rules according to the downstream risk of each section. The goal is not to remove every imperfection, but to align release thresholds with the real documentation risk carried by different parts of the note. A hallucination budget gives teams a structured way to encode the heterogenous impact of hallucinations into release policy, rather than assuming all summarization errors carry the same operational significance.
Lower-risk sections should be understood as relatively lower risk, not risk-free. They may carry fewer downstream consequences than action-driving sections like medication changes, assessment and plan, follow-up instructions, or disposition, but they still require careful review. In these relatively lower-risk sections, minor errors may be acceptable if they do not change clinical meaning and can be caught during review. In higher-risk sections, error tolerance must be very low, since errors can shape treatment decisions, escalation, and care coordination.
Hallucination budgets give teams a structured way to treat note summarization as a collection of section-level tasks with distinct risks, thresholds, and release criteria. This improves the precision of failure analysis, makes governance more transparent, and supports more defensible deployment decisions. It also creates a stronger bridge to context-switching evaluation by turning note summarization into a set of comparable section-level checks across settings. These tools give teams a more practical way to detect when a change in documentation style, workflow, or note structure alters risk in one part of the note even if overall performance still looks acceptable.
A credible hallucination budget for clinical note summarization requires more than a note-level hallucination rate. Teams need a section-level framework that reflects clinical risk, distinguishes major from minor errors, and ties observed performance to explicit release criteria.
Under a note-level approach, a team might review 100 summaries, report one overall hallucination rate, and approve the system if the aggregate result looks acceptable. Under a section-level approach, the same team would inspect medication, assessment, and instruction sections separately, score major and minor failures by class, and allow release only for the sections whose error pattern stays inside the approved budget. The second method is narrower, but it is much more useful for deployment.
Release is only the first checkpoint. For clinical note summarization to be safe and scalable, teams need to monitor whether section-specific error patterns stay within budget as models, prompts, workflows, note structures, and patient populations change over time. Quantiles helps teams operationalize that process by turning section definitions, evaluations, thresholds, and re-review triggers into reusable evaluation infrastructure that persists across model versions, deployments, and post-deployment monitoring workflows. This infrastructure gives teams a more durable, rigorous way to govern note summarization as a documentation system, not just evaluate it once as a generic text-generation feature.
Common questions this article helps answer