Benchmarks
PubMedQA
Biomedical question-answering benchmark using PubMed abstracts to test factuality and evidence-grounded reasoning.
Overview
PubMedQA is a biomedical QA benchmark built from PubMed abstracts. It focuses on factual correctness and evidence-grounded reasoning over short scientific contexts. Tasks are short-form and literature-grounded, emphasizing evidence-based claim verification over exam-style recall. PubMedQA is not a full clinical safety or deployment benchmark and it emphasizes literature-grounded Q&A rather than patient-specific workflows.
The model receives a question and an associated abstract or context snippet and must output the correct label or short answer. Evaluation is typically closed-form over the label set, with free-text mapped to the allowed labels before scoring.
Dataset Specification
Size
Approximately 1k labeled biomedical question-answer pairs used for evaluating, including a 500-question test set, with standard dev/test splits. Additional unlabeled or generative corpora are available but not used in standard scoring.
Source
Publicly available PubMed abstracts and derived question-answer pairs spanning biomedical research topics across multiple clinical and scientific specialties.
Input Format
question: stringcontext: string (PubMed abstract or snippet without conclusion)answer: "yes" | "no" | "maybe" (ground truth)long_answer(optional): string (abstract conclusion / rationale; present in dataset but not used for scoring)
Model input example (answer omitted from inference):
{
"question": "Does metformin use associate with lower cancer incidence?",
"context": "We evaluated observational studies on metformin use and cancer risk...",
}Output Format
A single label from the allowed set (e.g., "yes", "no", "maybe"). Free-text outputs are normalized to the label set before scoring.
{
"answer": "yes"
}Metrics
- Accuracy (primary): fraction of questions where the predicted label matches the ground truth. The small, discrete label space makes accuracy the standard primary metric.
- Optional: calibrated accuracy, confidence-weighted accuracy, latency, and faithfulness to provided context.
Known Limitations
- Small labeled evaluation set with coarse answer labels (yes / no / maybe), limiting resolution and robustness of performance comparisons.
- Based on PubMed abstracts rather than full-text articles, which can omit critical methodological details or evidentiary context.
- Abstract-focused questions encourage shallow evidence extraction and may overstate the strength of reported findings.
- Limited ability to assess nuanced scientific reasoning, including distinguishing correlation from causation and interpreting hedging language such as “may” or “suggests.”
- Susceptible to hallucinated or unsupported claims when models rely on prior knowledge rather than the provided abstract.
- Borderline cases are difficult to label reliably and may be misclassified due to inconclusive abstracts.
- Not designed to evaluate clinical safety, patient-level decision-making, or real-world deployment behavior.
Versioning and Provenance
PubMedQA has multiple processed releases and community variants (different label schemes and splits). Record the dataset version, preprocessing steps, and split definitions used for each evaluation to ensure reproducibility.
References
Jin et al., 2019. PubMedQA: A Dataset for Biomedical Research Question Answering.
Paper: https://arxiv.org/abs/1909.06146
GitHub Repository: https://github.com/pubmedqa/pubmedqa
Related Benchmarks
MedQA
USMLE-style medical multiple-choice QA benchmark (~12k items) evaluating diagnostic reasoning, treatment selection, and contraindication assessment across major clinical domains.
MMLU
Broad multi-domain benchmark with ~15k questions across 57 subjects that evaluates general knowledge and multiple-choice reasoning.