Metric

Area Under the Receiver Operating Characteristic Curve (AUROC)

Threshold-free ranking metric that summarizes the tradeoff between true positive rate and false positive rate.

Overview

AUROC measures how well a model ranks positive examples above negative ones across all possible thresholds. It is widely used for binary classification tasks and summarizes the ROC curve, which plots true positive rate (sensitivity) against false positive rate (1 - specificity).

AUROC is a metric rather than a benchmark and requires probabilistic predictions (or scores) paired with ground-truth labels.

Input Format

predictions: array of numbers (model scores or probabilities; higher values indicate stronger confidence in the positive class)
labels: array of binary ground-truth labels (0 or 1)

Example:

{
  "predictions": [0.72, 0.31, 0.89, ...],
  "labels": [1, 0, 1, ...]
}

Output Format

A single numeric AUROC aggregated over the dataset. Optional outputs may include the ROC curve points for plotting.

{
  "auroc": 0.91
}

Metrics

AUROC: area under the ROC curve that measures how well a model ranks positive examples above negative ones across thresholds, with values ranging from 0.5 for random ranking to 1.0 for perfect separation.
Optional: ROC curve points, TPR/FPR at selected thresholds.

Known Limitations

Can appear overly optimistic in highly imbalanced datasets where false positives are rare.
Does not reflect calibration or absolute probability quality.
Performance at clinically relevant thresholds can differ from the aggregate AUROC.
Not directly comparable across datasets with different base rates without context.

Versioning and Provenance

AUROC implementations vary in interpolation strategy and handling of tied scores. For reproducibility, document the implementation (e.g., scikit-learn's roc_auc_score), label encoding, and the score type used.

References

Hanley and McNeil, 1982. The meaning and use of the area under a ROC curve.

Paper: https://doi.org/10.1148/radiology.143.1.7063747

Implementation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

Related Metrics

AUPRC

Ranking & Discrimination

Summarizes precision-recall performance across thresholds and is especially informative for imbalanced data.

Predicted scores + ground-truth labelsProbabilistic classification tasksAUPRC