Metrics

Matthews Correlation Coefficient (MCC)

Balanced classification metric that accounts for true and false positives and negatives, even with imbalanced classes.

Overview

MCC summarizes a confusion matrix into a single correlation coefficient between predicted and true labels. It is especially useful when classes are imbalanced because it incorporates all four outcomes (TP, TN, FP, FN) rather than focusing only on the positive class.

MCC is a metric rather than a benchmark and requires discrete predictions alongside ground-truth labels.

Input Format

  • predictions: array of predicted class labels
  • labels: array of ground-truth class labels

Example:

{
  "predictions": [1, 0, 1, 1, 0, 0],
  "labels": [1, 0, 0, 1, 0, 1]
}

Output Format

A single numeric MCC aggregated over the dataset. Optional outputs may include the underlying confusion matrix.

{
  "mcc": 0.62
}

Metrics

  • MCC: correlation between predicted and true labels using all confusion matrix terms.
    MCC=(TPTN)(FPFN)(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{MCC} = \frac{(\text{TP} \cdot \text{TN}) - (\text{FP} \cdot \text{FN})}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}}

    Scores range from −1 to 1, where 1 indicates perfect prediction, 0 corresponds to random performance, and −1 indicates total disagreement. The metric is undefined when any denominator term is zero, and some implementations return a value of 0 in those cases.

  • Optional: report the confusion matrix and per-class MCC for multiclass variants.

Known Limitations

  • Requires discrete predictions, so thresholding choices directly affect the score.
  • Can be unstable for very small datasets where the confusion matrix is sparse.
  • Multiclass definitions vary across implementations, which can lead to differences in reported results.
  • Not a probability quality metric and should be paired with calibration checks when confidence estimates matter.

Versioning and Provenance

MCC implementations differ in how they handle zero-division and multiclass averaging. For reproducibility, document the averaging scheme, label encoding, and implementation (e.g., scikit-learn's matthews_corrcoef).

References

Matthews, 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Paper: https://pubmed.ncbi.nlm.nih.gov/1180967/

Implementation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html

Related Metrics