Metrics

Expected calibration error (ECE)

Measures how closely predicted probabilities match observed outcomes using binned confidence versus accuracy.

Overview

Expected calibration error (ECE) evaluates how well predicted probabilities align with empirical outcomes. A well-calibrated model predicts 0.8 when outcomes are correct about 80% of the time. ECE is a metric rather than a benchmark and requires probability predictions alongside ground-truth labels.

ECE partitions predictions into confidence bins (e.g., 10 bins across 0 to 1) and computes the weighted average of the absolute difference between average confidence and accuracy per bin. Lower is better, with 0 indicating perfect calibration.

Input Format

  • prediction: predicted probability or confidence score per example (binary) or per class (multiclass)
  • label: ground-truth label per example
{
  "prediction": 0.87,
  "label": 1
}

Output Format

A single numeric ECE aggregated over the dataset. Optional outputs may include bin-level accuracies and confidences used to build a reliability diagram.

{
  "ece": 0.034
}

Metrics

  • ECE: weighted average of the per-bin absolute difference between accuracy and confidence.
    ECE=m=1MBmNacc(Bm)conf(Bm)\text{ECE} = \sum_{m=1}^M \frac{|B_m|}{N} \left|\text{acc}(B_m) - \text{conf}(B_m)\right|

    Here, BmB_m denotes the set of predictions in bin mm, acc(Bm)\text{acc}(B_m) is the fraction of correct predictions in that bin, and conf(Bm)\text{conf}(B_m) is the average predicted confidence of those predictions.

  • Optional: maximum calibration error (MCE) and bin statistics for reliability diagrams.

Known Limitations

  • Sensitive to binning strategy and the number of bins, which can change the reported score.
  • Aggregates across bins, hiding where calibration errors occur.
  • Not directly comparable across datasets with different base rates or label distributions.
  • For multiclass problems, the definition of confidence and binning requires explicit specification.
  • Common ECE formulations consider only the most confident class and therefore can hide miscalibration in lower-probability classes.
  • ECE measures calibration rather than discrimination, so a model may achieve a low ECE while still exhibiting poor predictive utility or ranking performance.

Versioning and Provenance

ECE implementations vary by bin count, binning strategy (equal width vs. equal mass), and whether confidences is derived from maximum class probability or class-specific probabilities. For reproducibility, document binning choices, label encoding, and implementation details. Calibration-curve utilities (e.g., scikit-learn’s calibration_curve) define the binning and empirical accuracy calculations used by ECE, but the final weighted aggregation into a single ECE score is typically implemented separately.

References

Guo et al., 2017. On Calibration of Modern Neural Networks.

Paper: https://arxiv.org/abs/1706.04599

Implementation: https://github.com/EFS-OpenSource/calibration-framework

Related Metrics