Benchmarks

Holistic Evaluation of Language Models (HELM)

A comprehensive evaluation framework for language models that enables transparent, multi-dimensional assessment across diverse tasks and domains using standardized scenarios, metrics, and reporting.

Overview

HELM is designed to support a consistent and comparable evaluation of language models across a broad range of capabilities, indluding task performance, robustness, fairness, safety, efficiency, and calibration. Rather than optimizing for a single benchmark or metric, HELM emphasizes standardized evaluation practices that make model behavior interpretable and comparable across settings.

HELM evaluates language models using scenario-based prompts spanning question answering, summarization, reasoning, and safety-related text generation. Each scenario is paired with task-appropriate metrics, and results are reported across multiple dimensions to surface trade-offs between accuracy, robustness, fairness, calibration, and efficiency.

Dataset Specification

Size

No fixed dataset size. Evaluation is scenario-dependent, with HELM v0 reporting ~42 million total model evaluations.

Source

Public datasets combined with standardized evaluation scenarios spanning multiple dimenstions such as task accuracy, robustness, calibration, fairness, toxicity, and efficiency. Scenarios draw from established benchmarks (e.g., MMLU and widely used bias and toxicity datasets), and some require API access to hosted models (e.g., OpenAI, Anthropic).

Input Format

Scenario-specific: HELM does not define a single canonical input schema. Each scenario specifies its own prompt template, inputs, and references based on task type.

Common scenarios:

Generation scenarios:

  • prompt: instruction/context for the task (e.g., source text to summarize).
  • references: optional list of gold answers/summaries for scoring.
  • expected_format: optional guidance (e.g., bullet list, JSON schema).

Classification / multiple-choice scenarios:

  • doc: structured input (e.g., question, context, answer choices)
  • target: ground-truth label.
  • references: label(s) used for scoring.

Example 1: generation / summarization scenario:

{
  "prompt": "Summarize the following article in 2 sentences: ...",
  "references": [
    "Reference summary 1",
    "Reference summary 2"
  ],
  "expected_format": "short paragraph"
}

Example 2: classification / multiple-choice scenarios (e.g., MMLU):

{
  "doc": {
    "question": "Which vitamin deficiency most commonly leads to night blindness?",
    "choice": ["A. Vitamin A", "B. Vitamin B1", "C. Vitamin C", "D. Vitamin K"]
  },
  "target": "A",
  "references": ["A"]
}

Output Format

Task-dependent:
  • Classification scenarios output a discrete label or option index.
  • Generation scenarios output free-form text, optionally constrained by formatting rules.

Outputs are normalized according to scenario-specific scoring logic (e.g., option mapping, whitespace/case normalization, or structured validation).

Example 1: generation / summarization scenario:

{
  "answer": "The article describes ..."
}

Example 2: classification / multiple-choice scenarios (e.g., MMLU):

{
  "answer": "A"
}

Metrics

  • Task quality: Exact Match (EM), accuracy, or F1 for closed-ended tasks such as classification and multiple-choice QA.
  • Generation quality: ROUGE, BLEU, BERTScore, or model-/human-judge scores for open-ended generation tasks (e.g., summarization).
  • Safety and fairness: Content-safety checks, refusal behavior, or subgroup performance analyses where defined by the scenario.
  • Robustness: Performance under adversarial perturbations, input variations, or distribution shifts, evaluated per scenario.
  • Calibration: Alignment between predicted confidence and observed accuracy, where confidence estimates are available.
  • Efficiency: Latency, throughput, and cost per example or per run.

Known Limitations

  • Coverage, difficulty, and reliability vary substantially across scenarios, making aggregate scores sensitive to scenario selection.
  • Results can be brittle to prompt perturbations or minor input changes in some scenarios.
  • Some evaluations rely on automated or model-based judges, which may introduce scoring noise or bias.
  • Safety and refusal behavior is scenario-dependent and may only surface under specific red-team prompts.
  • Grounded generation tasks may still exhibit hallucinations despite reference-based or retrieval-augmented setups.
  • Sensitivity to formatting requirements when structured outputs are expected.
  • Calibration quality is inconsistent across tasks and domains, limiting cross-scenario comparability.
  • Not domain-specific to healthcare and does not directly assess clinical safety, workflows, or deployment readiness.

Versioning and Provenance

HELM publishes versioned releases (e.g., helm-latest, helm-vX). Record the scenario set, metrics, and scorer versions used. Note any gated assets or private datasets included in a run for reproducibility.

References

Liang et al., 2022. Holistic Evaluation of Language Models.

Paper: https://arxiv.org/abs/2211.09110

GitHub Repository: https://github.com/stanford-crfm/helm

Related Benchmarks