Metrics
Calibration Curve
Visualizes how predicted probabilities align with observed outcome frequencies across a dataset.
Overview
A calibration curve, also known as a reliability diagram, is an evaluation artifact used to assess whether a model’s predicted probabilities are well calibrated, most commonly for binary classification tasks. It compares predicted probabilities to observed outcome frequencies, answering the question: when a model predicts a given probability, how often does that outcome occur?
Calibration curves are not benchmarks themselves. They are applied after running a benchmark or evaluation that produces probabilistic outputs. In healthcare AI and other high-stakes domains, they are commonly used alongside metrics like Brier score and Expected Calibration Error (ECE) to diagnose overconfidence, underconfidence, and systematic probability misalignment.
A well-calibrated model produces probabilities that closely reflect observed event rates, which is critical for clinical risk prediction, triage, and decision support. Calibration evaluates probability reliability, in contrast to discrimination metrics (e.g., AUROC) that measure ranking or separation ability.
Input Format
predicted_probability: array of model-predicted probabilities in the [0,1] rangelabel: array of ground-truth binary outcomes encoded as 0 or 1, one per example
Example (single example):
{
"predictions": [0.82, 0.14, 0.67, 0.91],
"labels": [1, 0, 1, 1],
}Output Format
A calibration curve plotted across probability bins, optionally accompanied by summary statistics.
{
{
"binning": {
"n_bins": 10,
"strategy": "uniform"
},
"bins": [
{
"mean_predicted": 0.15,
"observed_rate": 0.12,
"count": 87
},
{
"mean_predicted": 0.45,
"observed_rate": 0.41,
"count": 102
},
{
"mean_predicted": 0.82,
"observed_rate": 0.79,
"count": 64
}
]
}
}Optional outputs may include:
- Confidence intervals per bin
- Sample counts per bin
- Overlay with perfect-calibration reference line
Metrics
Calibration curve: plots the observed outcome frequency (y-axis) against the average predicted probability (x-axis) across probability bins. The diagonal line represents perfect calibration.
Points above the diagonal indicate underconfidence, while points below indicate overconfidence. Calibration curves visualize calibration behavior across the probability range and do not produce a single scalar value.
Known Limitations
- Sensitive to binning strategy (number of bins, binning method).
- Can be noisy for small datasets or rare outcomes.
- Visual interpretation may obscure uncertainty without confidence intervals.
- Does not assess discrimination (e.g., ranking ability), only probability reliability.
- Requires probabilistic outputs and is not applicable to label-only predictions.
- A single global calibration curve can mask subgroup-specific miscalibration (e.g., across demographic or clinical subgroups), so group-wise calibration curves may be needed in fairness- or safety-critical settings.
Versioning and Provenance
Calibration curves depend on the binning scheme (fixed-width vs adaptive/quantile), outcome prevalence, sample size per bin, and whether smoothing or confidence intervals are applied. For reproducibility, document the binning strategy, minimum bin counts, dataset version, evaluation cohort definition, and implementation details.
When comparing curves across model versions, ensure identical bin definitions and evaluation cohorts, and indicate whether probabilities are raw model outputs or post-processed with a calibration method (e.g., Platt scaling, isotonic regression).
References
Niculescu-Mizil & Caruana, 2005. Predicting Good Probabilities with Supervised Learning.
Paper: https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
scikit-learn calibration documentation: https://scikit-learn.org/stable/modules/calibration.html
Related Metrics
Expected Calibration Error
Calibration metric that quantifies the discrepancy between predicted probabilities and observed accuracy across probability bins.
Brier Score
Proper scoring rule measuring mean squared error between predicted probabilities and observed binary outcomes used to assess calibration and reliability.