Metrics

Log loss (Negative Log Likelihood)

Proper scoring rule that penalizes incorrect probabilistic predictions, with steeper penalties for confident mistakes.

Overview

Log loss measures the quality of probabilistic predictions by comparing predicted probabilities to ground-truth labels. It rewards well-calibrated probabilities and strongly penalizes overconfident errors. Log loss is a metric rather than a benchmark and requires probability predictions paired with labels.

For binary classification, log loss is the negative log-likelihood of the true labels under the predicted probabilities. For multiclass problems, it is computed with one-hot labels across classes.

Input Format

predictions: array of probabilities for the positive class (binary) or per-class probability vectors (multiclass)
labels: array of ground-truth labels (binary: 0/1) or one-hot vectors (multiclass)

Example:

{
  "predictions": [0.93, 0.12, 0.78, 0.05],
  "labels": [1, 0, 1, 0]
}

Output Format

A single numeric log loss aggregated over the dataset.

{
  "log_loss": 0.24
}

Metrics

Log loss: negative log-likelihood of the true labels under predicted probabilities.
$\text{LogLoss} = -\frac{1}{N} \sum_{i=1}^N \left[y_i \log p_i + (1 - y_i) \log(1 - p_i)\right]$
Lower scores indicate better probabilistic accuracy and calibration. The metric is unbounded above and heavily penalizes confident incorrect predictions (e.g., predicting 0.99 for a negative case).
Optional: multiclass log loss (cross-entropy) and per-class loss breakdowns.

Known Limitations

Sensitive to label noise; a few mislabeled examples can inflate the score substantially.
Not a ranking metric and should be paired with AUROC or AUPRC when ranking quality matters.
Scores are not directly comparable across datasets with different label distributions.
Requires careful handling of probabilities near 0 or 1 to avoid numerical instability.

Versioning and Provenance

Log loss implementations differ in clipping strategies for numerical stability and in multiclass averaging conventions. For reproducibility, document probability clipping, label encoding, and implementation (e.g., scikit-learn's log_loss).

References

Goodfellow, Bengio, and Courville, 2016. Deep Learning. Chapter 6: Deep Feedforward Networks.

Book: https://www.deeplearningbook.org/

Implementation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html

Related Metrics

Brier Score

Calibration & Trustworthiness

Proper scoring rule measuring mean squared error between predicted probabilities and observed binary outcomes used to assess calibration and reliability.

Predicted probabilities + ground-truth labelsBinary probabilistic classificationBrier score

Expected Calibration Error

Calibration & Trustworthiness

Calibration metric that quantifies the discrepancy between predicted probabilities and observed accuracy across probability bins.

Predicted probabilities + ground-truth labelsProbabilistic classification tasksECE score