Metrics

F1 score

Harmonic mean of precision and recall for classification tasks, commonly used when classes are imbalanced.

Overview

The F1 score summarizes precision and recall into a single metric that balances false positives and false negatives. It is widely used for binary and multiclass classification tasks where accuracy can be misleading due to class imbalance. F1 is a metric rather than a benchmark and requires discrete predictions alongside ground-truth labels.

F1 is defined for binary classification and extended to multiclass settings through averaging strategies. In binary classification, results depend on which class is designated as the positive class.

For multiclass settings, F1 can be computed per class and aggregated using micro, macro, or weighted averaging. For probabilistic models, the thresholding strategy used to convert probabilities into labels has a direct impact on the F1 score.

Input Format

prediction: array of discrete predicted class labels
label: array of ground-truth class labels

Example (per-item):

{
  "prediction": "positive",
  "label": "positive"
}

Output Format

A single numeric F1 score aggregated over the dataset. Optional breakdowns may include per-class F1 and averaged variants (macro, micro, weighted).

{
  "f1": 0.81
}

Metrics

F1 score: harmonic mean of precision and recall.
$\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$
Precision is the share of predicted positives that are correct ( $\text{TP} / (\text{TP} + \text{FP})$ ), while recall is the share of actual positives captured ( $\text{TP} / (\text{TP} + \text{FN})$ ). Scores range from 0 to 1, where 1 indicates perfect precision and recall.
Optional: macro, micro, and weighted F1 for multiclass classification; per-class F1 for error analysis.

Known Limitations

Ignores true negatives, which can overstate performance on heavily imbalanced datasets.
Sensitive to the probability threshold used to create predicted labels.
Collapses different error profiles into a single scalar, obscuring which errors dominate.
Not a calibration metric and should be paired with probability quality checks when confidence matters.

Versioning and Provenance

F1 implementations vary by averaging strategy (micro/macro/weighted), label encoding, positive-class definition, and handling of undefined values when precision or recall is zero. For reproducibility, document the averaging scheme, label set, thresholding rules, and implementation (e.g., scikit-learn's f1_score).

References

van Rijsbergen, 1979. Information Retrieval.

Book: https://openlib.org/home/krichel/courses/lis618/readings/rijsbergen79_infor_retriev.pdf

Implementation (scikit-learn f1_score): https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Related Metrics

Sensitivity & Specificity

Thresholded Decision Performance

Companion metrics measuring true positive rate (sensitivity) and true negative rate (specificity).

Thresholded predictions + ground-truth labelsBinary classification tasksSensitivity · Specificity

MCC

Agreement & Robustness

Correlation-based metric that accounts for true/false positives and negatives, robust to class imbalance.

Discrete predictions + ground-truth labelsBinary classification tasksMCC