Benchmarks

Decision Curve Analysis

Decision curve analysis evaluates clinical utility by comparing net benefit across thresholds to treat-all and treat-none strategies.

Overview

Decision curve analysis (DCA) is a benchmark methodology for evaluating whether a risk model delivers clinical benefit across a range of decision thresholds by comparing model-guided decisions to default policies such as treating everyone or treating no one.

Rather than focusing on discrimination or calibration alone, DCA evaluates decision policies derived from model predictions, asking whether acting on those predictions improves outcomes when the costs of false positives and false negatives differ. The resulting net benefit curve makes these tradeoffs explicit and identifies thresholds where the model provides meaningful clinical value.

Extensions of decision curve analysis exist for time-to-event outcomes and alternative decision settings, but standard binary-outcome DCA remains the most common and interpretable form.

Dataset Specification

Size

No fixed dataset size. DCA is cohort-dependent and applicable to datasets of varying scale; stability improves with larger samples, especially for low-prevalence outcomes.

Source

Not tied to a single dataset. DCA applies to labeled cohorts with binary outcomes and model-generated risk predictions, including clinical trial data, observational studies, registries, and real-world clinical data.

Input Format

predictions: array of risk scores or probabilities
labels: array of binary outcomes
thresholds: list of decision thresholds to evaluate

Example:

{
  "predictions": [0.91, 0.62, 0.44, 0.2, 0.05],
  "labels": [1, 1, 0, 0, 0],
  "thresholds": [0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
}

Output Format

Net benefit values computed over a range of thresholds, typically returned as a curve for the model and baseline strategies.

{
  "net_benefit": [
    { "threshold": 0.1, "model": 0.12, "treat_all": 0.08, "treat_none": 0.0 },
    { "threshold": 0.2, "model": 0.09, "treat_all": 0.04, "treat_none": 0.0 }
  ]
}

Metrics

Net benefit: benefit of treating true positives minus a threshold-weighted penalty for treating false positives, where p_t is the decision threshold.
$\text{NB} = \frac{\text{TP}}{N} - \frac{\text{FP}}{N} \cdot \frac{p_t}{1 - p_t}$
Baseline strategies: compare against treat-all and treat-none curves to identify thresholds with positive incremental benefit.

Known Limitations

Requires calibrated risk estimates; poorly calibrated models can yield misleading net benefit curves.
Interpretation depends on clinical context and the true costs of false positives and false negatives.
Threshold ranges must be chosen thoughtfully; extreme thresholds can distort comparisons.
Does not replace causal evaluation of downstream clinical outcomes.

Versioning and Provenance

Net benefit calculations vary based on threshold grids, prevalence, and the definition of a positive outcome. For reproducibility, document the cohort, outcome definition, threshold range, and how ties or missing predictions are handled.

References

Vickers & Elkin, 2006. Decision curve analysis: a novel method for evaluating prediction models.

Paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC2577036/

GitHub: https://github.com/MSKCC-Epi-Bio/dcurves

Related Benchmarks

HealthBench

Evaluation Suites (Multi-task / Multi-domain)

Healthcare evaluation suite developed by OpenAI that assesses clinical, administrative, and patient-communication tasks using safety-aware scenarios and physician-authored, rubric-based scoring.

Task-specific prompts (clinical, admin, comms)Curated health prompts and datasets (mix of sources)Rubric evaluation · LLM-judge scoring