Benchmarks
Massive Multitask Language Understanding (MMLU)
Multi-domain, multiple-choice question-answering benchmark covering 57 academic and professional subjects to probe broad knowledge and reasoning.
Overview
MMLU is a multiple-choice benchmark that spans 57 subjects across STEM, humanities, social sciences, and professional exams. It tests general knowledge breadth and short-form reasoning in exam-style questions. MMLU is useful for assessing domain breadth and multi-choice reasoning but is not a clinical safety or deployment benchmark. Results should be interpreted as exam-style performance, not as evidence of production readiness in specialized domains.
The model receives a question stem and a fixed set of answer options and must output the single correct option. MMLU is typically scored in closed-form multiple-choice mode. If free-text answers are produced, they should be deterministically mapped to the option set before scoring.
Dataset Specification
Size
Approximately 15k multiple-choice questions across 57 subjects, with public development and validation subsets for few-shot evaluation and a held-out test set used for official scoring. Exact counts vary by processed version.
Source
Publicly available or derived academic and professional exam-style questions spanning 57 subject areas, including math, physics, chemistry, computer science, biology, medicine, law, business, economics, history, philosophy, and more.
Input Format
question: stringoptions: list of stringsanswer: string label in the dataset (ground truth), not provided to the model at inference
Model input example (answer omitted from inference):
{
"question": "Which vitamin deficiency most commonly leads to night blindness?",
"options": [
"Vitamin A",
"Vitamin B1",
"Vitamin C",
"Vitamin K"
]
}Output Format
A single choice, either the option letter (A, B, …) or the option text. Outputs are normalized to the provided option set before scoring.
{
"answer": "A"
}Alternate: { "answer": "Vitamin A" } (normalized to option A).
Metrics
- Accuracy (primary): fraction of questions where the predicted option matches the ground-truth option. MMLU's multiple-choice format makes this the standard primary metric.
- Optional: calibrated accuracy, confidence-weighted accuracy, response latency.
Known Limitations
- Exam-style, closed-form multiple-choice questions favor test-taking strategies and surface pattern recognition over deep, subject-specific or causal reasoning.
- Limited ability to disambiguate closely related concepts, negation, and subtle qualifiers, particularly in subjects with overlapping definitions.
- Sensitivity to phrasing, prompt format, and few-shot example selection can materially affect reported accuracy.
- Subject coverage and difficulty are uneven, with weaker signal for low-resource or niche domains.
- Not designed to evaluate real-world safety, deployment behavior, or downstream decision impact.
Versioning and Provenance
MMLU has multiple processed releases and community variants. Exact counts and subject splits can differ across repositories. Record the dataset version, preprocessing steps, and split definitions used for each evaluation run to ensure reproducibility and comparability.
References
Hendrycks et al., 2020. Measuring Massive Multitask Language Understanding.
Paper: https://arxiv.org/abs/2009.03300
GitHub Repository: https://github.com/hendrycks/test
Related Benchmarks
MedQA
USMLE-style medical multiple-choice QA benchmark (~12k items) evaluating diagnostic reasoning, treatment selection, and contraindication assessment across major clinical domains.
PubMedQA
Biomedical research QA benchmark with ~1k questions that evaluates evidence grounded answering using PubMed abstracts.