Benchmarks
Medical Holistic Evaluation of LLMs for Medical Tasks (MedHELM)
A holistic evaluation for large language models on medical tasks, spanning real clinical workflows beyond exam-style question answering.
Overview
MedHELM is a benchmark suite designed to evaluate LLM performance across five clinician-validated categories: clinical decision support, clinical note generation, patient communication and education, medical research assistance, and administrative and workflow tasks. It comprises 35 benchmarks drawn from public, gated, and private datasets and includes both closed-ended and open-ended evaluations. MedHELM emphasizes realistic task coverage and reports task-appropriate performance metrics alongside cost and efficiency indicators.
MedHELM benchmarks cover diverse task types, including multiple-choice and exact-match question answering, structured prediction (e.g., ICD coding and SQL generation), span extraction, open-ended summarization and patient messaging, and safety or refusal scenarios. Each benchmark defines its own prompt structure, inputs, scoring metric, and evaluation protocol, reflecting the requirements of the underlying clinical task.
Dataset Specification
Size
35 benchmarks spanning the MedHELM taxonomy across five clinical categories, with a mix of open-ended and closed-ended tasks. Evaluation scope and scale are task-dependent.
Source
A combination of existing medical benchmarks (e.g., MedQA, EHRSQL, billing and coding tasks), reformulated datasets, and newly clinician-curated tasks. Data sources include public, gated, and private datasets, including some EHR-based data, organized across Clinical Decision Support, Clinical Note Generation, Patient Communication and Education, Medical Research Assistance, and Administration and Workflow.
Input Format
Varies by benchmark. Common elements include:
context: clinical note, dialogue, query, or structured record.prompt: task-specific instruction template.referenceorlabel: gold answer, spans, codes, or rubric (train/dev only).
Example (clinical messaging task):
{
"context": "Patient portal message: 'I have been on lisinopril 10 mg for 3 months. I'm getting dizzy when I stand up. Should I stop it?'",
"prompt": "Draft a 2-3 sentence safe reply to the patient, include a recommendation to contact their clinician, do not change medications.",
"reference": "Thanks for reaching out. Please do not stop lisinopril without medical guidance. Dizziness can occur; schedule a visit to review your blood pressure and medications. If you feel faint or fall, seek care immediately."
}Output Format
Task-dependent:
- Exact-match choice/label for closed-ended tasks
- Spans or codes for extraction/classification
- Free-text summaries or responses for open-ended tasks
Benchmarks normalize outputs (e.g., option letter vs. text) before scoring.
{
"answer": "Thanks for reaching out. Please don't stop lisinopril without guidance. Dizziness can happen; schedule a visit to review your blood pressure and meds. If you feel faint or fall, seek care immediately."
}Metrics
- Exact match/accuracy: for MCQ and deterministic outputs.
- F1/precision/recall: spans, codes, and classification.
- Open-ended: LLM-based jury scoring on dimensions such as accuracy, completeness, and clarity.
- Task-specific metrics: SQL execution accuracy, billing F1.
- Safety/refusal: correctness and consistency of refusals or safe responses where defined.
- Cost/performance reporting: token costs across benchmarks and jury.
Known Limitations
- Coverage across clinical subcategories is uneven, with some tasks and domains more extensively represented than others.
- Some datasets are gated or unreleased due to privacy and compliance constraints, limiting full reproducibility and transparency.
- Evaluation relies on a mix of public and private scorers, which can introduce variability in scoring consistency.
- Open-ended clinical tasks may still surface hallucinated or unsafe claims, which are not uniformly penalized across scenarios.
- Structured reasoning tasks such as SQL queries, billing code assignment, and numerical calculations are unevenly represented and may expose boundary or formatting sensitivities.
- Safety and refusal behavior is scenario-dependent and may vary across different classes of unsafe prompts.
- Domain drift between public or synthetic tasks and private EHR-based benchmarks can affect generalizability and cross-task comparability.
Versioning and Provenance
MedHELM releases (e.g., v2.0.0) specify the set of 35 benchmarks, access levels (public/gated/private), and scoring scripts (including LLM-jury configuration). Record version, access level, and scorer settings for reproducibility. Some datasets are not publicly released (privacy/regulatory constraints).
References
MedHELM, 2025. MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks.
Paper: https://arxiv.org/abs/2505.23802
Docs/Repository: https://github.com/stanford-crfm/helm/blob/main/docs/medhelm.md
Related Benchmarks
HELM
A comprehensive evaluation framework for language models that standardizes tasks, prompts, metrics, and reporting across diverse tasks, domains, and use cases.
HealthBench
Healthcare evaluation suite developed by OpenAI that assesses clinical, administrative, and patient-communication tasks using safety-aware scenarios and physician-authored, rubric-based scoring.