Benchmarks
MedQA
A USMLE-style multiple-choice question-answering benchmark for clinical and biomedical knowledge and basic clinical reasoning, derived from professional medical board exams.
Overview
MedQA is a USMLE-style medical multiple-choice question answering benchmark focused on clinical and biomedical knowledge and short-form reasoning. Questions are exam-like rather than free-form clinical notes, and emphasize selecting the best option among distractors (incorrect or less-correct answers). MedQA is useful for probing knowledge breadth and basic reasoning but is not a full clinical safety or deployment benchmark. Results should be interpreted as exam-style performance, not as evidence of bedside safety, holistic safety or production readiness.
The model is given a question stem and a fixed set of answer options and must select a single correct option. MedQA is typically evaluated in closed-form multiple-choice mode. Systems that generate free-text responses must deterministically map outputs back to the provided option set for scoring (if this mapping is not deterministic, evaluation results are unreliable).
Dataset Specification
Size
Approximately 12.7k English multiple-choice questions (exact counts vary by release and preprocessing), with standard train/dev/test splits and additional simplified and traditional Chinese subsets.
Source
Professional medical board exam–style (USMLE-like) questions spanning major clinical specialties, including internal medicine, surgery, pediatrics, OB/GYN, and psychiatry. Some releases include explanations or references annotations.
Input Format
question: stringoptions: list of stringsanswer: string label in the dataset (ground truth), not provided to the model at inference
Model input example (answer omitted from inference):
{
"question": "Which of the following drugs is contraindicated in pregnancy?",
"options": [
"Amlodipine",
"Warfarin",
"Metformin",
"Labetalol"
]
}Output Format
A single choice, either as the option letter (A, B, …) or the option text. scoring.
{
"answer": "B"
}Note that the evaluation pipeline should deterministically normalize free-text outputs to option indices:{ "answer": "Warfarin" }(normalized to optionB).
Metrics
- Accuracy (primary): fraction of questions where the predicted option matches the ground-truth option.
MedQA's multiple-choice format makes accuracy the simple, standard primary metric.
- Optional: calibrated accuracy, confidence-weighted accuracy, response latency.
Known Limitations
- Exam-style, single-turn multiple-choice questions with short, well-formed stems that do not reflect real-world clinical notes, longitudinal context, or workflow complexity.
- Encourages pattern recognition over mechanistic or causal clinical reasoning, including confusion between similar diagnoses or overlapping treatments.
- Limited sensitivity to negation, qualifiers, and subtle contextual distinctions that are common in real clinical documentation.
- Underrepresents rare conditions, edge-case contraindications, or high-risk safety scenarios.
- Correct answer selection does not guarantee clinically sound reasoning. Models may still general plausible but incorrect explanations.
Versioning and Provenance
MedQA appears in multiple processed releases (e.g., v1 and other community preprocessings). Exact question counts and splits vary by source. Record the dataset version, preprocessing pipeline, and split definitions used for each run to ensure reproducibility.
References
Jin et al., 2020. What Disease Does This Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
Paper: https://arxiv.org/abs/2009.13081
GitHub Repository: https://github.com/jind11/MedQA
Related Benchmarks
MMLU
Broad multi-domain benchmark with ~15k questions across 57 subjects that evaluates general knowledge and multiple-choice reasoning.
PubMedQA
Biomedical research QA benchmark with ~1k questions that evaluates evidence grounded answering using PubMed abstracts.