Benchmark Hub

The Quantiles Benchmark Hub is a library of evaluations and metrics designed to reveal how AI models behave, especially in healthcare contexts. Rather than optimizing for a single score, these benchmarks are task-focused and probe distinct dimensions of model behavior, including reasoning, factual accuracy, hallucinations, calibration, robustness, and clinical safety.

What the Hub is for

The Benchmark Hub serves as a centralized reference for:

Curated descriptions of widely used and emerging AI evaluation benchmarks
Clear explanations of what each benchmark measures and how it is typically used
The strengths and limitations of commonly cited benchmarks
A common reference point for technical and clinical stakeholders

Benchmarks & Metrics

AUPRC

Ranking & Discrimination

Summarizes precision-recall performance across thresholds and is especially informative for imbalanced data.

Predicted scores + ground-truth labelsProbabilistic classification tasksAUPRC

AUROC

Ranking & Discrimination

Summarizes ROC curve performance across thresholds by measuring ranking quality between positives and negatives.

Predicted scores + ground-truth labelsBinary classification (ranking)AUROC

BBQ

Bias & Fairness

Question-answering benchmark for detecting social bias through stereotype reliance under ambiguous context.

Question with ambiguous or disambiguated contextPaired ambiguous and disambiguated QA promptsStereotype preference · accuracy · demographic breakdowns

BERTScore

Text Generation - Semantic Similarity

Contextual embedding-based similarity metric for scoring generated text against reference outputs.

Model output + one or more reference stringsText generation tasks with referencesBERTScore (Precision/Recall/F1)

BLEU

Text Generation - Surface Similarity

Corpus-level n-gram overlap metric (MT-originated) used to score generated text against reference translations or summaries.

Model output + one or more reference stringsText generation tasks with referencesBLEU score

BOLD

Bias & Fairness

Bias in Open-ended Language Generation benchmark with Wikipedia-derived prompts across profession, gender, race, religion, and political ideology.

Natural language prompts grouped by demographic domain and subgroupWikipedia-derived open-ended prompt datasetSentiment · Regard · Toxicity · Psycholinguistic norms · Gender polarity

Brier Score

Calibration & Trustworthiness

Proper scoring rule measuring mean squared error between predicted probabilities and observed binary outcomes used to assess calibration and reliability.

Predicted probabilities + ground-truth labelsBinary probabilistic classificationBrier score

Calibration Curve

Calibration & Trustworthiness

Reliability diagram showing how predicted probabilities align with observed outcome frequencies across bins.

Predicted probabilities + ground-truth labelsProbabilistic classification tasksCalibration curve

CrowS-Pairs

Bias & Fairness

Bias benchmark using minimal pairs to measure preference for stereotyped vs. anti-stereotyped sentences.

Sentence pairs with stereotype labelSynthetic sentence pairs with demographic attributesLikelihood gap · bias direction · stereotype score

Decision Curve Analysis

Clinical Utility & Decision Support

Clinical utility evaluation that compares net benefit across decision thresholds to determine whether a model improves outcomes versus treat-all or treat-none strategies.

Predicted probabilities + ground-truth labelsBinary classification with risk predictionsNet benefit curve

Expected Calibration Error

Calibration & Trustworthiness

Calibration metric that quantifies the discrepancy between predicted probabilities and observed accuracy across probability bins.

Predicted probabilities + ground-truth labelsProbabilistic classification tasksECE score

F1 Score

Thresholded Decision Performance

Balanced metric that summarizes precision and recall into one harmonic-mean score for classification performance.

Discrete predictions + labelsBinary and multiclass classification (aggregated)F1 score

HealthBench

Evaluation Suites (Multi-task / Multi-domain)

Healthcare evaluation suite developed by OpenAI that assesses clinical, administrative, and patient-communication tasks using safety-aware scenarios and physician-authored, rubric-based scoring.

Task-specific prompts (clinical, admin, comms)Curated health prompts and datasets (mix of sources)Rubric evaluation · LLM-judge scoring

HELM

Evaluation Suites (Multi-task / Multi-domain)

A comprehensive evaluation framework for language models that standardizes tasks, prompts, metrics, and reporting across diverse tasks, domains, and use cases.

Task-specific prompts and referencesMixed public benchmarks across domainsTask-appropriate metrics · calibration · efficiency · robustness

HolisticBias

Bias & Fairness

Benchmark for measuring social bias across demographic attributes using templated prompts and model completions.

Templated prompts with demographic attributesSynthetic, templated bias promptsLikelihood Bias · Full and Partial Gen Bias · Offensiveness rate

Log loss

Probabilistic Quality

Aggregates probabilistic prediction errors by penalizing incorrect and overconfident predictions.

Predicted probabilities + ground-truth labelsProbabilistic classification tasksLog loss

MCC

Agreement & Robustness

Correlation-based metric that accounts for true/false positives and negatives, robust to class imbalance.

Discrete predictions + ground-truth labelsBinary classification tasksMCC

MedHELM

Evaluation Suites (Multi-task / Multi-domain)

A healthcare-focused evaluation suite that assesses large language models across 35 medical benchmarks covering clinical, biomedical, and healthcare-related tasks.

Task-specific prompts across clinical domainsMix of public, gated, and private medical datasetsTask-appropriate metrics including accuracy, faithfulness, safety

MedMCQA

Knowledge & Question Answering

Large-scale multiple-choice medical QA benchmark built from AIIMS/NEET-PG style exam questions, used to evaluate medical knowledge recall and question-level reasoning.

Question stem + answer optionsMedical exam-style multiple-choice QAAccuracy

MedQA

Knowledge & Question Answering

USMLE-style medical multiple-choice QA benchmark (~12k items) evaluating diagnostic reasoning, treatment selection, and contraindication assessment across major clinical domains.

Question stem + answer optionsDe-identified clinical Q&AAccuracy

MMLU

Knowledge & Question Answering

Broad multi-domain benchmark with ~15k questions across 57 subjects that evaluates general knowledge and multiple-choice reasoning.

Question stem + answer optionsPublic academic/professional exam-style QAAccuracy

MT-Bench

Conversational & Instruction Following

Multi-turn conversational benchmark evaluated using LLM-as-judge scoring to assess instruction adherence, coherence, and response quality across dialogue turns.

Multi-turn dialogue historyCurated multi-turn prompts across task categoriesLLM judge score · category scores

PPV & NPV

Thresholded Decision Performance

Positive predictive value (precision) and negative predictive value, measuring correctness for predicted positives and negatives.

Thresholded predictions + ground-truth labelsBinary classification (prevalence-dependent)PPV · NPV

PubMedQA

Knowledge & Question Answering

Biomedical research QA benchmark with ~1k questions that evaluates evidence grounded answering using PubMed abstracts.

Question + PubMed abstract contextCurated PubMed-derived QA datasetAccuracy + Macro-F1

ROUGE

Text Generation - Surface Similarity

Recall-oriented overlap metrics (ROUGE-N, ROUGE-L, ROUGE-Lsum) for comparing generated summaries to reference summaries.

Model output + one or more reference stringsText generation tasks with referencesROUGE-N · ROUGE-L · ROUGE-Lsum

Sensitivity & Specificity

Thresholded Decision Performance

Companion metrics measuring true positive rate (sensitivity) and true negative rate (specificity).

Thresholded predictions + ground-truth labelsBinary classification tasksSensitivity · Specificity

TruthfulQA

Knowledge & Question Answering

Question-answering benchmark designed to detect model falsehoods and overconfident incorrect answers across adversarial prompts.

Question with optional answer choicesCurated adversarial QA promptsTruthfulness · informativeness

WILDS

Distribution Shift & Robustness

Benchmark suite of real-world datasets designed to evaluate model robustness under distribution shift, with explicit domain and subgroup metadata.

Task-specific inputs + labels + domain metadataMulti-domain datasets across modalitiesAverage and worst-group performance