Benchmarks
MedMCQA
Large-scale medical multiple-choice question answering benchmark based on professional entrance and licensing exam-style questions.
Overview
MedMCQA is a large medical multiple-choice QA benchmark that evaluates clinical and biomedical knowledge across major specialties. Questions are exam-like and intended to test knowledge retrieval and short-form reasoning under constrained answer options.
The model receives a question stem and answer choices and must select one option. Evaluation is usually performed in closed-form multiple-choice mode. If a model outputs free text, predictions should be deterministically normalized to one option label prior to scoring.
Dataset Specification
Size
Approximately 194k question-answer items (exact counts vary by preprocessing and split), making it substantially larger than many prior medical QA benchmarks.
Source
Questions are derived from AIIMS and NEET-PG style medical entrance and licensing exam content covering subjects including medicine, surgery, pathology, pharmacology, and related disciplines.
Input Format
question: stringoptions: list of stringsanswer: option label or index (ground truth), not provided to the model at inference time
Model input example (answer omitted from inference):
{
"question": "Which vitamin deficiency causes megaloblastic anemia?",
"options": ["Vitamin C", "Vitamin B12", "Vitamin D", "Vitamin K"]
}Output Format
A single option selection, such as a label (A, B, C, D) or an option text string that can be mapped to one label.
{
"answer": "B"
}Metrics
- Accuracy (primary): fraction of questions where the predicted option matches the ground-truth option.
Known Limitations
- Exam-style multiple-choice format can reward recognition patterns that do not fully reflect real clinical reasoning in practice.
- Performance may overstate practical utility in patient-specific workflows where context is longitudinal and often incomplete.
- Accuracy alone does not measure calibration, uncertainty handling, or safety under ambiguous or high-risk cases.
- Dataset variants and preprocessing choices can affect comparability if split definitions are not consistently tracked.
Versioning and Provenance
MedMCQA is distributed through multiple repositories and processed variants. For reproducibility, record dataset source, revision/config, split, label-mapping policy, and any preprocessing used in each run.
References
Pal et al., 2022. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering.
Paper: https://arxiv.org/abs/2203.14371
GitHub Repository: https://github.com/medmcqa/medmcqa
Related Benchmarks
MedQA
USMLE-style medical multiple-choice QA benchmark (~12k items) evaluating diagnostic reasoning, treatment selection, and contraindication assessment across major clinical domains.
MMLU
Broad multi-domain benchmark with ~15k questions across 57 subjects that evaluates general knowledge and multiple-choice reasoning.
PubMedQA
Biomedical research QA benchmark with ~1k questions that evaluates evidence grounded answering using PubMed abstracts.