Metrics
Log loss (Negative Log Likelihood)
Proper scoring rule that penalizes incorrect probabilistic predictions, with steeper penalties for confident mistakes.
Overview
Log loss measures the quality of probabilistic predictions by comparing predicted probabilities to ground-truth labels. It rewards well-calibrated probabilities and strongly penalizes overconfident errors. Log loss is a metric rather than a benchmark and requires probability predictions paired with labels.
For binary classification, log loss is the negative log-likelihood of the true labels under the predicted probabilities. For multiclass problems, it is computed with one-hot labels across classes.
Input Format
predictions: array of probabilities for the positive class (binary) or per-class probability vectors (multiclass)labels: array of ground-truth labels (binary: 0/1) or one-hot vectors (multiclass)
Example:
{
"predictions": [0.93, 0.12, 0.78, 0.05],
"labels": [1, 0, 1, 0]
}Output Format
A single numeric log loss aggregated over the dataset.
{
"log_loss": 0.24
}Metrics
- Log loss: negative log-likelihood of the true labels under predicted probabilities.
Lower scores indicate better probabilistic accuracy and calibration. The metric is unbounded above and heavily penalizes confident incorrect predictions (e.g., predicting 0.99 for a negative case).
- Optional: multiclass log loss (cross-entropy) and per-class loss breakdowns.
Known Limitations
- Sensitive to label noise; a few mislabeled examples can inflate the score substantially.
- Not a ranking metric and should be paired with AUROC or AUPRC when ranking quality matters.
- Scores are not directly comparable across datasets with different label distributions.
- Requires careful handling of probabilities near 0 or 1 to avoid numerical instability.
Versioning and Provenance
Log loss implementations differ in clipping strategies for numerical stability and in multiclass averaging conventions. For reproducibility, document probability clipping, label encoding, and implementation (e.g., scikit-learn's log_loss).
References
Goodfellow, Bengio, and Courville, 2016. Deep Learning. Chapter 6: Deep Feedforward Networks.
Book: https://www.deeplearningbook.org/
Implementation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
Related Metrics
Brier Score
Proper scoring rule measuring mean squared error between predicted probabilities and observed binary outcomes used to assess calibration and reliability.
Expected Calibration Error
Calibration metric that quantifies the discrepancy between predicted probabilities and observed accuracy across probability bins.