Benchmarks

MedMCQA

Large-scale medical multiple-choice question answering benchmark based on professional entrance and licensing exam-style questions.

Overview

MedMCQA is a large medical multiple-choice QA benchmark that evaluates clinical and biomedical knowledge across major specialties. Questions are exam-like and intended to test knowledge retrieval and short-form reasoning under constrained answer options.

The model receives a question stem and answer choices and must select one option. Evaluation is usually performed in closed-form multiple-choice mode. If a model outputs free text, predictions should be deterministically normalized to one option label prior to scoring.

Dataset Specification

Size

Approximately 194k question-answer items (exact counts vary by preprocessing and split), making it substantially larger than many prior medical QA benchmarks.

Source

Questions are derived from AIIMS and NEET-PG style medical entrance and licensing exam content covering subjects including medicine, surgery, pathology, pharmacology, and related disciplines.

Input Format

question: string
options: list of strings
answer: option label or index (ground truth), not provided to the model at inference time

Model input example (answer omitted from inference):

{
  "question": "Which vitamin deficiency causes megaloblastic anemia?",
  "options": ["Vitamin C", "Vitamin B12", "Vitamin D", "Vitamin K"]
}

Output Format

A single option selection, such as a label (A, B, C, D) or an option text string that can be mapped to one label.

{
  "answer": "B"
}

Metrics

Accuracy (primary): fraction of questions where the predicted option matches the ground-truth option.
$\text{Accuracy} = \frac{\text{correct}}{\text{total}}$

Known Limitations

Exam-style multiple-choice format can reward recognition patterns that do not fully reflect real clinical reasoning in practice.
Performance may overstate practical utility in patient-specific workflows where context is longitudinal and often incomplete.
Accuracy alone does not measure calibration, uncertainty handling, or safety under ambiguous or high-risk cases.
Dataset variants and preprocessing choices can affect comparability if split definitions are not consistently tracked.

Versioning and Provenance

MedMCQA is distributed through multiple repositories and processed variants. For reproducibility, record dataset source, revision/config, split, label-mapping policy, and any preprocessing used in each run.

References

Pal et al., 2022. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering.

Paper: https://arxiv.org/abs/2203.14371

GitHub Repository: https://github.com/medmcqa/medmcqa

Related Benchmarks

MedQA

Knowledge & Question Answering

USMLE-style medical multiple-choice QA benchmark (~12k items) evaluating diagnostic reasoning, treatment selection, and contraindication assessment across major clinical domains.

Question stem + answer optionsDe-identified clinical Q&AAccuracy

MMLU

Knowledge & Question Answering

Broad multi-domain benchmark with ~15k questions across 57 subjects that evaluates general knowledge and multiple-choice reasoning.

Question stem + answer optionsPublic academic/professional exam-style QAAccuracy

PubMedQA

Knowledge & Question Answering

Biomedical research QA benchmark with ~1k questions that evaluates evidence grounded answering using PubMed abstracts.

Question + PubMed abstract contextCurated PubMed-derived QA datasetAccuracy + Macro-F1