Metrics

Sensitivity & Specificity

Complementary classification metrics that measure true positive rate (sensitivity) and true negative rate (specificity).

Overview

Sensitivity (also called recall or true positive rate) measures the fraction of actual positives that are correctly identified. Specificity (true negative rate) measures the fraction of actual negatives that are correctly identified. Together, they summarize different failure modes and are often used in clinical evaluation to balance missed cases vs. false alarms.

Sensitivity and Specificity are metrics and require thresholded (discrete) predictions paired with ground-truth labels. Threshold choices for converting probabilities into labels have a direct impact on both.

Input Format

predictions: array of thresholded predicted labels
labels: array of ground-truth labels

Example:

{
  "predictions": [1, 1, 1, 0, 0, 0, 1, 0],
  "labels": [1, 1, 1, 0, 0, 0, 0, 1]
}

Output Format

Numeric sensitivity and specificity aggregated over the dataset. Optional outputs may include confusion-matrix counts.

{
  "sensitivity": 0.88,
  "specificity": 0.8
}

Metrics

Sensitivity: true positive rate.
$\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}}$
Specificity: true negative rate.
$\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}$
Optional: report the full confusion matrix and threshold used.

Known Limitations

Sensitive to the decision threshold, with changes in the cutoff shifting both metrics.
Does not summarize performance across thresholds (use AUROC or AUPRC when ranking matters).
Does not reflect probability calibration or confidence quality.
Can hide subgroup performance differences without stratified reporting.

Versioning and Provenance

Implementations vary by label encoding, threshold selection, and how ties or uncertain outputs are handled. For reproducibility, document the threshold, label mapping, and implementation (e.g., scikit-learn's recall_score for sensitivity and confusion_matrix for specificity).