Metrics

Brier score

A proper scoring rule for probabilistic classification that measures how close predicted probabilities are to observed outcomes.

Overview

The Brier score evaluates the quality of probabilistic predictions for binary classification tasks. Lower scores indicate better calibrated and more reliable probability estimates. Unlike accuracy, the Brier score refelcts how confident a model is in its predictions and penalizes both overconfidence and underconfidence. It is a metric rather than a benchmark and requires paired probability predictions and ground truth labels.

The model outputs a probability for a binary outcome. The Brier score is computed as a mean squared difference between the predicted probability and the observed outcome. For multiclass problems, the score can be extended using one-hot encoded labels or computed per class and aggregated.

Input Format

predictions: array of predicted probabilities for the positive class (0–1)
label: array of ground-truth binary outcomes

Example:

{
  "predictions": [0.82, 0.14, 0.67, 0.91],
  "labels": [1, 0, 1, 1]
}

Output Format

Expected output: a numeric Brier score aggregated over the dataset. Can optionally include per-class or per-group scores for calibration slices.

{
  "brier": 0.042
}

Metrics

Brier score: mean squared error of predicted probabilities ( $p_i$ ) vs. binary labels ( $y_i$ ).
$\text{Brier} = \frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)^2$
Lower scores indicate better calibrated and more informative probabilistic predictions. Scores range from 0 to 1, where 0 indicates perfect performance and ≈0.25 corresponds to random guessing on a balanced binary task.
Optional: reliability (calibration), resolution (discrimination), and uncertainty (Murphy decomposition).

Known Limitations

Sensitivity to class imbalance, where performance is dominated by majority class in imbalanced data.
No assessment of ranking or discrimination quality (use AUROC or precision-recall metrics).
Collapses different error profiles into a single scalar, obscuring where errors occur.
No signal for safety or clinical appropriateness and should not be used alone in clinical settings.

Versioning and Provenance

Brier implementations vary by smoothing and aggregation, especially in multiclass settings. For reproducibility, document the implementation (e.g., scikit-learn's brier_score_loss), label encoding, dataset version, and any calibration or binning used for decomposition.

References

Brier, 1950. Verification of forecasts expressed in terms of probability.

Paper: https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

Implementation (scikit-learn brier_score_loss): https://scikit-learn.org/stable/modules/generated/sklearn.metrics.brier_score_loss.html

Related Metrics

Expected Calibration Error

Calibration & Trustworthiness

Calibration metric that quantifies the discrepancy between predicted probabilities and observed accuracy across probability bins.

Predicted probabilities + ground-truth labelsProbabilistic classification tasksECE score

Log loss

Probabilistic Quality

Aggregates probabilistic prediction errors by penalizing incorrect and overconfident predictions.

Predicted probabilities + ground-truth labelsProbabilistic classification tasksLog loss