Benchmarks

MedMCQA

Large-scale medical multiple-choice question answering benchmark based on professional entrance and licensing exam-style questions.

Overview

MedMCQA is a large medical multiple-choice QA benchmark that evaluates clinical and biomedical knowledge across major specialties. Questions are exam-like and intended to test knowledge retrieval and short-form reasoning under constrained answer options.

The model receives a question stem and answer choices and must select one option. Evaluation is usually performed in closed-form multiple-choice mode. If a model outputs free text, predictions should be deterministically normalized to one option label prior to scoring.

Dataset Specification

Size

Approximately 194k question-answer items (exact counts vary by preprocessing and split), making it substantially larger than many prior medical QA benchmarks.

Source

Questions are derived from AIIMS and NEET-PG style medical entrance and licensing exam content covering subjects including medicine, surgery, pathology, pharmacology, and related disciplines.

Input Format

  • question: string
  • options: list of strings
  • answer: option label or index (ground truth), not provided to the model at inference time

Model input example (answer omitted from inference):

{
  "question": "Which vitamin deficiency causes megaloblastic anemia?",
  "options": ["Vitamin C", "Vitamin B12", "Vitamin D", "Vitamin K"]
}

Output Format

A single option selection, such as a label (A, B, C, D) or an option text string that can be mapped to one label.

{
  "answer": "B"
}

Metrics

  • Accuracy (primary): fraction of questions where the predicted option matches the ground-truth option.
    Accuracy=correcttotal\text{Accuracy} = \frac{\text{correct}}{\text{total}}

Known Limitations

  • Exam-style multiple-choice format can reward recognition patterns that do not fully reflect real clinical reasoning in practice.
  • Performance may overstate practical utility in patient-specific workflows where context is longitudinal and often incomplete.
  • Accuracy alone does not measure calibration, uncertainty handling, or safety under ambiguous or high-risk cases.
  • Dataset variants and preprocessing choices can affect comparability if split definitions are not consistently tracked.

Versioning and Provenance

MedMCQA is distributed through multiple repositories and processed variants. For reproducibility, record dataset source, revision/config, split, label-mapping policy, and any preprocessing used in each run.

References

Pal et al., 2022. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering.

Paper: https://arxiv.org/abs/2203.14371

GitHub Repository: https://github.com/medmcqa/medmcqa

Related Benchmarks