Benchmarks

MT-Bench

A multi-turn conversational benchmark for evaluating instruction following and dialogue quality using LLM-based judges.

Overview

MT-Bench evaluates how language models perform in multi-turn dialogue settings that require maintaining context, following instructions, and responding coherently over multiple exchanges. The benchmark spans diverse task categories, including writing, reasoning, math, coding, extraction, STEM topics, roleplay, and safety. Rather than measuring single-turn correctness, MT-Bench focuses on sustained conversational behavior and instruction adherence over time.

Models are evaluated on structured multi-turn dialogues consisting of one or more user turns, with optional system context. Outputs are assessed using rubric-guided comparisons by LLM judges, which score responses based on qualities such as helpfulness, coherence, accuracy, and safety across turns. Results are typically reported as aggregate scores, category-level scores, and head-to-head win rates between models.

Dataset Specification

Size

80 multi-turn dialogue scenarios spanning eight task categories, including writing, roleplay, reasoning, mathematics, coding, extraction, STEM topics, and safety. The benchmark is used as a single evaluation set with category labels for reporting.

Source

Human-curated multi-turn prompts designed to assess conversational quality and instruction following, with prompts filtered and refined using GPT-4 to ensure appropriate quality and difficulty.

Input Format

messages: array of role-annotated dialogue turns (e.g., user or system messages).

Example:

{
   "messages": [
    { "role": "user", "content": "This fruit has more vitamin C than oranges. Ask 3 clarifying questions." },
    { "role": "user", "content": "Now answer in one sentence with the fruit name and one health benefit." }
  ]
}

Output Format

Model responses for each turn, captured as role-annotated messages and passed to the judge for evaluation.

{
 "responses": [
    "Is it small and green with fuzzy skin, commonly called a kiwi, and known for high vitamin C content?",
    "Kiwi — it is rich in vitamin C, which supports immune function."
  ]
}

Note: These examples are illustrative, not original MT-Bench items.

Metrics

Judges evaluate outputs based on helpfulness, coherence, relevance, accuracy, and safety, typically using GPT-4-class models.

MT-Bench score: aggregate comparative performance derived primarily from pairwise judge comparisons across prompts
Category scores: per-domain averages (writing, reasoning, math, coding, etc.)
Win rates: judge preference percentages from head-to-head model comparisons (e.g., Model A wins 65% of judgments)
Optional: inter-judge agreement when multiple judges are used

Known Limitations

Relies on synthetic, human-curated prompts that may not fully reflect real-world conversational use or deployment contexts.
Evaluation outcomes are sensitive to the choice of LLM judge, judge prompting, and rubric design, which can affect score stability and comparability.
Performance may vary with prompt phrasing or formatting, making results brittle to minor input changes.
Multi-turn evaluations can surface context loss, hallucinations, or factual drift across turns, which are not uniformly penalized across tasks.
Models may optimize for perceived judge preferences rather than underlying conversational quality or correctness.
Not domain-specific to healthcare and does not assess clinical reasoning, safety, or real-world decision-making.

Versioning and Provenance

MT-Bench scores vary significantly by prompt version, judge model and temperature (e.g., GPT-4 at temperature 0), scoring mode (pairwise versus single-model), and any prompt or rubric changes. These details should always be documented, as they materially affect score comparability.

References

Zheng et al., 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Paper: https://arxiv.org/abs/2306.05685

Repository: https://github.com/lm-sys/FastChat

Related Benchmarks

HealthBench

Evaluation Suites (Multi-task / Multi-domain)

Healthcare evaluation suite developed by OpenAI that assesses clinical, administrative, and patient-communication tasks using safety-aware scenarios and physician-authored, rubric-based scoring.

Task-specific prompts (clinical, admin, comms)Curated health prompts and datasets (mix of sources)Rubric evaluation · LLM-judge scoring

HELM

Evaluation Suites (Multi-task / Multi-domain)

A comprehensive evaluation framework for language models that standardizes tasks, prompts, metrics, and reporting across diverse tasks, domains, and use cases.

Task-specific prompts and referencesMixed public benchmarks across domainsTask-appropriate metrics · calibration · efficiency · robustness