Metrics

Bilingual Evaluation Understudy (BLEU)

Corpus-level n-gram overlap metric for evaluating generated text against reference outputs.

Overview

BLEU is a corpus-level n-gram overlap metric originally developed for machine translation and widely reused for summarization and other text generation tasks. It quantifies surface similarity between model outputs and reference texts but does not directly or effectively assess reasoning quality, factual correctness, or safety. BLEU is a metric rather than a benchmark and should be used with task-specific and safety-focused evaluations.

Given a model-generated candidate string and one or more reference texts, BLEU computes modified n-gram precision by clipping n-gram counts to their maximum reference frequency. Precision across multiple n-gram orders (typically for n = 1 to 4) are combined using a weighted geometric mean, with a brevity penalty applied to discourage overly short outputs, yielding a single corpus-level score.

Input Format

candidate: string (model-generated output)
references: array of strings (one or more reference outputs)

Example (per-item):

{
  "candidate": "The patient was discharged with follow-up in two weeks.",
  "references": [
    "The patient was discharged and will return in two weeks for follow-up.",
    "Patient discharged; follow-up visit scheduled in two weeks."
  ]
}

Output Format

A single numeric BLEU score, usually computed at the corpus level. Sentence-level BLEU scores exist but are less stable. When reported, these metrics should be interpreted with caution.

{
  "bleu": 0.42
}

Metrics

BLEU (primary): geometric mean of modified (clipped) n-gram precisions ( $p_n$ , where $n \in [1, 4]$ typically) multiplied by a brevity penalty ( $BP$ ).
$\text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^N w_n \log p_n\right)$
Scores range from 0 to 1 (sometimes reported as 0-100%), where 0 indicates no n-gram overlap with references and 1 means an exact match (extremely rare outside identical copies).
Optional: report brevity penalty and individual n-gram precisions for debugging.

Known Limitations

Good semantics but low surface overlap, causing BLEU to underestimate quality.
Acceptable paraphrases, lexical variations, or reordered content are over-penalized.
Models can game the metric by modifying output length (e.g. by lengthening outputs to avoid brevity penalties).
Ignores lexical variation or synonyms, penalizing diverse but equivalent phrasing.
In clinical settings, high BLEU scores may mask factual errors or unsafe statements not captured by n-gram overlap.
Scores are not directly comparable across datasets, domains, or language pairs.

Versioning and Provenance

BLEU implementations vary in tokenization, case handling, smoothing, and n-gram order. In clinical settings, these differences (plus n-gram limitations) may mask factual errors, unsafe content, or non-reproducible results. Always document BLEU variant and version (e.g. SacreBLEU v2.5.0), preprocessing algorithms (e.g., lowercasing), and dataset format and version for reproducibility purposes.

References

Papineni et al., 2002. BLEU: a Method for Automatic Evaluation of Machine Translation.

Paper: https://aclanthology.org/P02-1040

Implementation (SacreBLEU): https://github.com/mjpost/sacrebleu

Related Metrics

ROUGE

Text Generation - Surface Similarity

Recall-oriented overlap metrics (ROUGE-N, ROUGE-L, ROUGE-Lsum) for comparing generated summaries to reference summaries.

Model output + one or more reference stringsText generation tasks with referencesROUGE-N · ROUGE-L · ROUGE-Lsum

BERTScore

Text Generation - Semantic Similarity

Contextual embedding-based similarity metric for scoring generated text against reference outputs.

Model output + one or more reference stringsText generation tasks with referencesBERTScore (Precision/Recall/F1)