Metrics

Matthews Correlation Coefficient (MCC)

Balanced classification metric that accounts for true and false positives and negatives, even with imbalanced classes.

Overview

MCC summarizes a confusion matrix into a single correlation coefficient between predicted and true labels. It is especially useful when classes are imbalanced because it incorporates all four outcomes (TP, TN, FP, FN) rather than focusing only on the positive class.

MCC is a metric rather than a benchmark and requires discrete predictions alongside ground-truth labels.

Input Format

predictions: array of predicted class labels
labels: array of ground-truth class labels

Example:

{
  "predictions": [1, 0, 1, 1, 0, 0],
  "labels": [1, 0, 0, 1, 0, 1]
}

Output Format

A single numeric MCC aggregated over the dataset. Optional outputs may include the underlying confusion matrix.

{
  "mcc": 0.62
}

Metrics

MCC: correlation between predicted and true labels using all confusion matrix terms.
$\text{MCC} = \frac{(\text{TP} \cdot \text{TN}) - (\text{FP} \cdot \text{FN})}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}}$
Scores range from −1 to 1, where 1 indicates perfect prediction, 0 corresponds to random performance, and −1 indicates total disagreement. The metric is undefined when any denominator term is zero, and some implementations return a value of 0 in those cases.
Optional: report the confusion matrix and per-class MCC for multiclass variants.

Known Limitations

Requires discrete predictions, so thresholding choices directly affect the score.
Can be unstable for very small datasets where the confusion matrix is sparse.
Multiclass definitions vary across implementations, which can lead to differences in reported results.
Not a probability quality metric and should be paired with calibration checks when confidence estimates matter.

Versioning and Provenance

MCC implementations differ in how they handle zero-division and multiclass averaging. For reproducibility, document the averaging scheme, label encoding, and implementation (e.g., scikit-learn's matthews_corrcoef).

References

Matthews, 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Paper: https://pubmed.ncbi.nlm.nih.gov/1180967/

Implementation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html

Related Metrics

F1 Score

Thresholded Decision Performance

Balanced metric that summarizes precision and recall into one harmonic-mean score for classification performance.

Discrete predictions + labelsBinary and multiclass classification (aggregated)F1 score

Sensitivity & Specificity

Thresholded Decision Performance

Companion metrics measuring true positive rate (sensitivity) and true negative rate (specificity).

Thresholded predictions + ground-truth labelsBinary classification tasksSensitivity · Specificity