Benchmark Hub
The Quantiles Benchmark Hub is a library of evaluations and metrics designed to reveal how AI models behave, especially in healthcare contexts. Rather than optimizing for a single score, these benchmarks are task-focused and probe distinct dimensions of model behavior, including reasoning, factual accuracy, hallucinations, calibration, robustness, and clinical safety.
What the Hub is for
The Benchmark Hub serves as a centralized reference for:
- Curated descriptions of widely used and emerging AI evaluation benchmarks
- Clear explanations of what each benchmark measures and how it is typically used
- The strengths and limitations of commonly cited benchmarks
- A common reference point for technical and clinical stakeholders
Benchmarks & Metrics
AUPRC
Summarizes precision-recall performance across thresholds and is especially informative for imbalanced data.
AUROC
Summarizes ROC curve performance across thresholds by measuring ranking quality between positives and negatives.
BBQ
Question-answering benchmark for detecting social bias through stereotype reliance under ambiguous context.
BERTScore
Contextual embedding-based similarity metric for scoring generated text against reference outputs.
BLEU
Corpus-level n-gram overlap metric (MT-originated) used to score generated text against reference translations or summaries.
Brier Score
Proper scoring rule measuring mean squared error between predicted probabilities and observed binary outcomes used to assess calibration and reliability.
Calibration Curve
Reliability diagram showing how predicted probabilities align with observed outcome frequencies across bins.
CrowS-Pairs
Bias benchmark using minimal pairs to measure preference for stereotyped vs. anti-stereotyped sentences.
Decision Curve Analysis
Clinical utility evaluation that compares net benefit across decision thresholds to determine whether a model improves outcomes versus treat-all or treat-none strategies.
Expected Calibration Error
Calibration metric that quantifies the discrepancy between predicted probabilities and observed accuracy across probability bins.
F1 Score
Balanced metric that summarizes precision and recall into one harmonic-mean score for classification performance.
HealthBench
Healthcare evaluation suite developed by OpenAI that assesses clinical, administrative, and patient-communication tasks using safety-aware scenarios and physician-authored, rubric-based scoring.
HELM
A comprehensive evaluation framework for language models that standardizes tasks, prompts, metrics, and reporting across diverse tasks, domains, and use cases.
HolisticBias
Benchmark for measuring social bias across demographic attributes using templated prompts and model completions.
Log loss
Aggregates probabilistic prediction errors by penalizing incorrect and overconfident predictions.
MCC
Correlation-based metric that accounts for true/false positives and negatives, robust to class imbalance.
MedHELM
A healthcare-focused evaluation suite that assesses large language models across 35 medical benchmarks covering clinical, biomedical, and healthcare-related tasks.
MedQA
USMLE-style medical multiple-choice QA benchmark (~12k items) evaluating diagnostic reasoning, treatment selection, and contraindication assessment across major clinical domains.
MMLU
Broad multi-domain benchmark with ~15k questions across 57 subjects that evaluates general knowledge and multiple-choice reasoning.
MT-Bench
Multi-turn conversational benchmark evaluated using LLM-as-judge scoring to assess instruction adherence, coherence, and response quality across dialogue turns.
PPV & NPV
Positive predictive value (precision) and negative predictive value, measuring correctness for predicted positives and negatives.
PubMedQA
Biomedical research QA benchmark with ~1k questions that evaluates evidence grounded answering using PubMed abstracts.
ROUGE
Recall-oriented overlap metrics (ROUGE-N, ROUGE-L, ROUGE-Lsum) for comparing generated summaries to reference summaries.
Sensitivity & Specificity
Companion metrics measuring true positive rate (sensitivity) and true negative rate (specificity).
TruthfulQA
Question-answering benchmark designed to detect model falsehoods and overconfident incorrect answers across adversarial prompts.
WILDS
Benchmark suite of real-world datasets designed to evaluate model robustness under distribution shift, with explicit domain and subgroup metadata.