Metrics
Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
A set of recall-oriented n-gram overlap metrics commonly used to evaluate automatic summarization and other text generation tasks.
Overview
ROUGE is a family of overlap metrics (e.g., ROUGE-N, ROUGE-L, ROUGE-Lsum) that compare generated text to one or more reference summaries. It was designed for summarization and is recall-oriented, emphasizing how much of the reference content appears in the candidate output. ROUGE is a metric, not a standalone benchmark; pair it with task-specific quality and safety checks.
The model generates a candidate summary given a source document. ROUGE is implemented in variants, including:
- ROUGE-N - measures n-gram overlap against reference data
- ROUGE-L - measures longest common subsequence (LCS) against reference data
- ROUGE-Lsum - measures sentence-level overlap against reference data
Input Format
Logical schema:
candidate: string (model-generated summary)references: array of strings (one or more reference summaries)
Example (per-item):
{
"candidate": "Patient discharged with two-week follow-up scheduled.",
"references": [
"The patient was discharged and will return in two weeks for follow-up.",
"Discharge completed; follow-up visit set for two weeks."
]
}Output Format
Expected output: ROUGE is typically reported at the corpus level to reduce variance. Sentence-level scores are implementation-dependent and should be interpreted cautiously.
{
"rouge1": 0.52,
"rouge2": 0.28,
"rougeL": 0.46
}Metrics
- ROUGE-N (primary: ROUGE-1, ROUGE-2): n-gram overlap; originally recall-based, often reported as precision/recall/F1
- ROUGE-L: F1 computed from longest common subsequence (LCS) precision and recall between candidate and reference summaries
- ROUGE-Lsum: summary-level longest common subsequence (LCS) computed over concatenated sentences; standard in summarization benchmarks
- Optional: Report precision/recall/F1 breakdowns, stemming/stopwords used.
ROUGE scores range from 0 to 1 (sometimes reported as 0–100%), where 0 indicates no overlap with the reference text and 1 indicates complete overlap, which typically occurs only for identical or near-identical texts.
Known Limitations
- Relies on surface overlap, underestimating quality for valid paraphrases or reordered content.
- Hallucinated or unsupported facts can still yield high ROUGE scores when surface overlap with references is strong.
- Overlong summaries can inflate recall-oriented ROUGE variants at the expense of precision (variant-dependent).
- ROUGE-L and ROUGE-Lsum are sensitive to content order.
- All ROUGE variants are susceptible to metric gaming via keyword stuffing, which can boost unigram overlap without improving coherence or faithfulness.
- Does not assess factual grounding, reasoning quality, safety, or clinical appropriateness.
Versioning and Provenance
ROUGE implementations (e.g., py-rouge, rouge-score, SacreROUGE) vary in tokenization, stemming, stopword handling, and averaging. Always document version, options (e.g., stemming, case-folding), reference set, and dataset for reproducibility.
References
Lin, 2004. ROUGE: A Package for Automatic Evaluation of Summaries.
Paper: https://aclanthology.org/W04-1013
Implementation (rouge-score): https://github.com/google-research/google-research/tree/master/rouge
Related Metrics
BLEU
Corpus-level n-gram overlap metric (MT-originated) used to score generated text against reference translations or summaries.
BERTScore
Contextual embedding-based similarity metric for scoring generated text against reference outputs.