Metrics
BERTScore
Semantic similarity metric that compares candidate and reference text using contextual embeddings.
Overview
BERTScore evaluates generated text by aligning tokens between candidate and reference sequences using contextual embeddings from a pretrained encoder (e.g., RoBERTa or BERT). It captures semantic similarity beyond surface overlap, offering a stronger signal than n-gram metrics for paraphrases and meaning-preserving rewrites. BERTScore is a metric rather than a benchmark and should be paired with task-specific and safety evaluations.
The metric computes precision, recall, and F1 based on cosine similarity between token embeddings. Scores are typically reported as F1, with optional IDF weighting to reduce the impact of common tokens. When multiple references are provided, scores are computed against each reference and aggregated using the maximum similarity.
Input Format
candidate: string (model-generated output)references: array of strings (one or more reference outputs)
Example:
{
"candidate": "The patient was discharged with follow-up in two weeks.",
"references": [
"The patient was discharged and will return in two weeks for follow-up.",
"Patient discharged; follow-up visit scheduled in two weeks."
]
}Output Format
A numeric BERTScore summary, usually reported as F1. Optional outputs include precision and recall variants, and per-example scores.
{
"bertscore_f1": 0.91
}Metrics
- BERTScore: Precision () and recall () are computed from bidirectional maximum cosine-similarity token alignments and combined using an F1 score.
Token alignment evaluates similarities between all candidate–reference token pairs and selects the maximum match per token in each direction.
BERTScore F1 is typically reported as a similarity score that falls between 0 and 1 in many common setups, although the exact scale depends on the embedding model and whether optional baseline rescaling is applied.
- Optional: report precision and recall, IDF-weighted scores, and the embedding model used (e.g., roberta-large).
Known Limitations
- Sensitive to the choice of embedding model, tokenizer, and IDF weighting.
- Can reward semantic similarity even when factual details are incorrect or unsafe.
- Computationally heavier than n-gram overlap metrics for large corpora.
- Scores are not directly comparable across different embedding models, domains, or languages without careful normalization.
Versioning and Provenance
BERTScore results depend on the base encoder, IDF weighting, and tokenizer version. For reproducibility, document the embedding model name and language scope (e.g., roberta-large), IDF settings, whether rescaling or baseline subtraction is applied, and the BERTScore package version.
References
Zhang et al., 2019. BERTScore: Evaluating Text Generation with BERT.
Paper: https://arxiv.org/abs/1904.09675
Implementation: https://github.com/Tiiiger/bert_score
Related Metrics
BLEU
Corpus-level n-gram overlap metric (MT-originated) used to score generated text against reference translations or summaries.
ROUGE
Recall-oriented overlap metrics (ROUGE-N, ROUGE-L, ROUGE-Lsum) for comparing generated summaries to reference summaries.