Metrics

Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

A set of recall-oriented n-gram overlap metrics commonly used to evaluate automatic summarization and other text generation tasks.

Overview

ROUGE is a family of overlap metrics (e.g., ROUGE-N, ROUGE-L, ROUGE-Lsum) that compare generated text to one or more reference summaries. It was designed for summarization and is recall-oriented, emphasizing how much of the reference content appears in the candidate output. ROUGE is a metric, not a standalone benchmark; pair it with task-specific quality and safety checks.

The model generates a candidate summary given a source document. ROUGE is implemented in variants, including:

  • ROUGE-N - measures n-gram overlap against reference data
  • ROUGE-L - measures longest common subsequence (LCS) against reference data
  • ROUGE-Lsum - measures sentence-level overlap against reference data

Input Format

Logical schema:

  • candidate: string (model-generated summary)
  • references: array of strings (one or more reference summaries)

Example (per-item):

{
  "candidate": "Patient discharged with two-week follow-up scheduled.",
  "references": [
    "The patient was discharged and will return in two weeks for follow-up.",
    "Discharge completed; follow-up visit set for two weeks."
  ]
}

Output Format

Expected output: ROUGE is typically reported at the corpus level to reduce variance. Sentence-level scores are implementation-dependent and should be interpreted cautiously.

{
  "rouge1": 0.52,
  "rouge2": 0.28,
  "rougeL": 0.46
}

Metrics

  • ROUGE-N (primary: ROUGE-1, ROUGE-2): n-gram overlap; originally recall-based, often reported as precision/recall/F1
  • ROUGE-NR=overlapping n-gramsreference n-grams\text{ROUGE-N}_R = \frac{\sum \text{overlapping n-grams}}{\sum \text{reference n-grams}}
  • ROUGE-L: F1 computed from longest common subsequence (LCS) precision and recall between candidate and reference summaries
  • ROUGE-Lsum: summary-level longest common subsequence (LCS) computed over concatenated sentences; standard in summarization benchmarks
  • ROUGE scores range from 0 to 1 (sometimes reported as 0–100%), where 0 indicates no overlap with the reference text and 1 indicates complete overlap, which typically occurs only for identical or near-identical texts.

  • Optional: Report precision/recall/F1 breakdowns, stemming/stopwords used.

Known Limitations

  • Relies on surface overlap, underestimating quality for valid paraphrases or reordered content.
  • Hallucinated or unsupported facts can still yield high ROUGE scores when surface overlap with references is strong.
  • Overlong summaries can inflate recall-oriented ROUGE variants at the expense of precision (variant-dependent).
  • ROUGE-L and ROUGE-Lsum are sensitive to content order.
  • All ROUGE variants are susceptible to metric gaming via keyword stuffing, which can boost unigram overlap without improving coherence or faithfulness.
  • Does not assess factual grounding, reasoning quality, safety, or clinical appropriateness.

Versioning and Provenance

ROUGE implementations (e.g., py-rouge, rouge-score, SacreROUGE) vary in tokenization, stemming, stopword handling, and averaging. Always document version, options (e.g., stemming, case-folding), reference set, and dataset for reproducibility.

References

Lin, 2004. ROUGE: A Package for Automatic Evaluation of Summaries.

Paper: https://aclanthology.org/W04-1013

Implementation (rouge-score): https://github.com/google-research/google-research/tree/master/rouge

Related Metrics