Metrics

Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

A set of recall-oriented n-gram overlap metrics commonly used to evaluate automatic summarization and other text generation tasks.

Overview

ROUGE is a family of overlap metrics (e.g., ROUGE-N, ROUGE-L, ROUGE-Lsum) that compare generated text to one or more reference summaries. It was designed for summarization and is recall-oriented, emphasizing how much of the reference content appears in the candidate output. ROUGE is a metric, not a standalone benchmark; pair it with task-specific quality and safety checks.

The model generates a candidate summary given a source document. ROUGE is implemented in variants, including:

ROUGE-N - measures n-gram overlap against reference data
ROUGE-L - measures longest common subsequence (LCS) against reference data
ROUGE-Lsum - measures sentence-level overlap against reference data

Input Format

Logical schema:

candidate: string (model-generated summary)
references: array of strings (one or more reference summaries)

Example (per-item):

{
  "candidate": "Patient discharged with two-week follow-up scheduled.",
  "references": [
    "The patient was discharged and will return in two weeks for follow-up.",
    "Discharge completed; follow-up visit set for two weeks."
  ]
}

Output Format

Expected output: ROUGE is typically reported at the corpus level to reduce variance. Sentence-level scores are implementation-dependent and should be interpreted cautiously.

{
  "rouge1": 0.52,
  "rouge2": 0.28,
  "rougeL": 0.46
}

Metrics

ROUGE-N (primary: ROUGE-1, ROUGE-2): n-gram overlap; originally recall-based, often reported as precision/recall/F1

\text{ROUGE-N}_R = \frac{\sum \text{overlapping n-grams}}{\sum \text{reference n-grams}}

ROUGE-L: F1 computed from longest common subsequence (LCS) precision and recall between candidate and reference summaries
ROUGE-Lsum: summary-level longest common subsequence (LCS) computed over concatenated sentences; standard in summarization benchmarks

ROUGE scores range from 0 to 1 (sometimes reported as 0–100%), where 0 indicates no overlap with the reference text and 1 indicates complete overlap, which typically occurs only for identical or near-identical texts.

Optional: Report precision/recall/F1 breakdowns, stemming/stopwords used.

Known Limitations

Relies on surface overlap, underestimating quality for valid paraphrases or reordered content.
Hallucinated or unsupported facts can still yield high ROUGE scores when surface overlap with references is strong.
Overlong summaries can inflate recall-oriented ROUGE variants at the expense of precision (variant-dependent).
ROUGE-L and ROUGE-Lsum are sensitive to content order.
All ROUGE variants are susceptible to metric gaming via keyword stuffing, which can boost unigram overlap without improving coherence or faithfulness.
Does not assess factual grounding, reasoning quality, safety, or clinical appropriateness.

Versioning and Provenance

ROUGE implementations (e.g., py-rouge, rouge-score, SacreROUGE) vary in tokenization, stemming, stopword handling, and averaging. Always document version, options (e.g., stemming, case-folding), reference set, and dataset for reproducibility.

References

Lin, 2004. ROUGE: A Package for Automatic Evaluation of Summaries.

Paper: https://aclanthology.org/W04-1013

Implementation (rouge-score): https://github.com/google-research/google-research/tree/master/rouge

Related Metrics

BLEU

Text Generation - Surface Similarity

Corpus-level n-gram overlap metric (MT-originated) used to score generated text against reference translations or summaries.

Model output + one or more reference stringsText generation tasks with referencesBLEU score

BERTScore

Text Generation - Semantic Similarity

Contextual embedding-based similarity metric for scoring generated text against reference outputs.

Model output + one or more reference stringsText generation tasks with referencesBERTScore (Precision/Recall/F1)