Metrics
F1 score
Harmonic mean of precision and recall for classification tasks, commonly used when classes are imbalanced.
Overview
The F1 score summarizes precision and recall into a single metric that balances false positives and false negatives. It is widely used for binary and multiclass classification tasks where accuracy can be misleading due to class imbalance. F1 is a metric rather than a benchmark and requires discrete predictions alongside ground-truth labels.
F1 is defined for binary classification and extended to multiclass settings through averaging strategies. In binary classification, results depend on which class is designated as the positive class.
For multiclass settings, F1 can be computed per class and aggregated using micro, macro, or weighted averaging. For probabilistic models, the thresholding strategy used to convert probabilities into labels has a direct impact on the F1 score.
Input Format
prediction: array of discrete predicted class labelslabel: array of ground-truth class labels
Example (per-item):
{
"prediction": "positive",
"label": "positive"
}Output Format
A single numeric F1 score aggregated over the dataset. Optional breakdowns may include per-class F1 and averaged variants (macro, micro, weighted).
{
"f1": 0.81
}Metrics
- F1 score: harmonic mean of precision and recall.
Precision is the share of predicted positives that are correct (), while recall is the share of actual positives captured (). Scores range from 0 to 1, where 1 indicates perfect precision and recall.
- Optional: macro, micro, and weighted F1 for multiclass classification; per-class F1 for error analysis.
Known Limitations
- Ignores true negatives, which can overstate performance on heavily imbalanced datasets.
- Sensitive to the probability threshold used to create predicted labels.
- Collapses different error profiles into a single scalar, obscuring which errors dominate.
- Not a calibration metric and should be paired with probability quality checks when confidence matters.
Versioning and Provenance
F1 implementations vary by averaging strategy (micro/macro/weighted), label encoding, positive-class definition, and handling of undefined values when precision or recall is zero. For reproducibility, document the averaging scheme, label set, thresholding rules, and implementation (e.g., scikit-learn's f1_score).
References
van Rijsbergen, 1979. Information Retrieval.
Book: https://openlib.org/home/krichel/courses/lis618/readings/rijsbergen79_infor_retriev.pdf
Implementation (scikit-learn f1_score): https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
Related Metrics
Sensitivity & Specificity
Companion metrics measuring true positive rate (sensitivity) and true negative rate (specificity).
MCC
Correlation-based metric that accounts for true/false positives and negatives, robust to class imbalance.