Metrics

BERTScore

Semantic similarity metric that compares candidate and reference text using contextual embeddings.

Overview

BERTScore evaluates generated text by aligning tokens between candidate and reference sequences using contextual embeddings from a pretrained encoder (e.g., RoBERTa or BERT). It captures semantic similarity beyond surface overlap, offering a stronger signal than n-gram metrics for paraphrases and meaning-preserving rewrites. BERTScore is a metric rather than a benchmark and should be paired with task-specific and safety evaluations.

The metric computes precision, recall, and F1 based on cosine similarity between token embeddings. Scores are typically reported as F1, with optional IDF weighting to reduce the impact of common tokens. When multiple references are provided, scores are computed against each reference and aggregated using the maximum similarity.

Input Format

candidate: string (model-generated output)
references: array of strings (one or more reference outputs)

Example:

{
  "candidate": "The patient was discharged with follow-up in two weeks.",
  "references": [
    "The patient was discharged and will return in two weeks for follow-up.",
    "Patient discharged; follow-up visit scheduled in two weeks."
  ]
}

Output Format

A numeric BERTScore summary, usually reported as F1. Optional outputs include precision and recall variants, and per-example scores.

{
  "bertscore_f1": 0.91
}

Metrics

BERTScore: Precision ( ${P}$ ) and recall ( ${R}$ ) are computed from bidirectional maximum cosine-similarity token alignments and combined using an F1 score.
$\text{BERTScore}_{F1} = 2 \cdot \frac{P \cdot R}{P + R}$
Token alignment evaluates similarities between all candidate–reference token pairs and selects the maximum match per token in each direction.
BERTScore F1 is typically reported as a similarity score that falls between 0 and 1 in many common setups, although the exact scale depends on the embedding model and whether optional baseline rescaling is applied.
Optional: report precision and recall, IDF-weighted scores, and the embedding model used (e.g., roberta-large).

Known Limitations

Sensitive to the choice of embedding model, tokenizer, and IDF weighting.
Can reward semantic similarity even when factual details are incorrect or unsafe.
Computationally heavier than n-gram overlap metrics for large corpora.
Scores are not directly comparable across different embedding models, domains, or languages without careful normalization.

Versioning and Provenance

BERTScore results depend on the base encoder, IDF weighting, and tokenizer version. For reproducibility, document the embedding model name and language scope (e.g., roberta-large), IDF settings, whether rescaling or baseline subtraction is applied, and the BERTScore package version.

References

Zhang et al., 2019. BERTScore: Evaluating Text Generation with BERT.

Paper: https://arxiv.org/abs/1904.09675

Implementation: https://github.com/Tiiiger/bert_score

Related Metrics

BLEU

Text Generation - Surface Similarity

Corpus-level n-gram overlap metric (MT-originated) used to score generated text against reference translations or summaries.

Model output + one or more reference stringsText generation tasks with referencesBLEU score

ROUGE

Text Generation - Surface Similarity

Recall-oriented overlap metrics (ROUGE-N, ROUGE-L, ROUGE-Lsum) for comparing generated summaries to reference summaries.

Model output + one or more reference stringsText generation tasks with referencesROUGE-N · ROUGE-L · ROUGE-Lsum