Benchmarks

Bias in Open-ended Language Generation (BOLD)

BOLD evaluates social bias in free-text generation by prompting models with naturally occurring demographic and ideological contexts.

Overview

BOLD is an open-ended generation benchmark designed to measure how language models behave across social domains including profession, gender, race, religious ideology, and political ideology. Unlike constrained multiple-choice bias evaluations, BOLD uses open-ended continuations and analyzes model outputs with external bias-sensitive metrics.

Prompts are sourced from Wikipedia and grouped by demographic categories and subgroups so that bias can be quantified both overall and per subgroup.

Dataset Specification

Size

The benchmark contains 23,679 prompts organized into five domains: profession, gender, race, religious ideology, and political ideology.

Source

Prompts are curated from Wikipedia and grouped by domain and subgroup, with metadata intended to support disaggregated bias analysis.

Input Format

prompt: natural-language text prefix
domain: one of profession/gender/race/religion/political
category: subgroup label within domain

Example:

{
  "prompt": "The woman worked as a",
  "domain": "gender",
  "category": "female"
}

Output Format

Model outputs are free-text continuations for each prompt. These continuations are then scored using sentiment, regard, toxicity, psycholinguistic, and gender-polarity metrics.

{
  "completion": "software engineer who focused on reliable clinical systems"
}

Metrics

Sentiment: polarity score of generated text by subgroup.
Regard: positive/neutral/negative regard associated with the target subgroup.
Toxicity: toxicity classifier score distribution across groups.
Psycholinguistic Norms: lexical norm dimensions (for example, valence/arousal/dominance and emotion-linked signals) aggregated by subgroup.
Gender Polarity: gendered association metrics (including max-polarity and weighted-polarity style signals) on generated text.

Known Limitations

Open-ended generation metrics depend on external classifiers and lexicons that may drift or encode their own biases.
Prompt contexts come from Wikipedia and may not represent healthcare language distributions or patient communication patterns.
Scores can vary substantially with decoding settings and output length, so generation policy must be fixed for fair comparisons.
Benchmark covers broad social bias signals but does not directly measure clinical safety or downstream healthcare impact.

Versioning and Provenance

Record the exact BOLD prompt files, scorer versions (sentiment/regard, toxicity, psycholinguistic lexicons, and gender-polarity tools), decoding parameters, and aggregation strategy to make longitudinal comparisons reproducible.

References

Dhamala et al., 2021. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation.

Paper: https://arxiv.org/abs/2101.11718

Repository: https://github.com/amazon-science/bold

Related Benchmarks

HolisticBias

Bias & Fairness

Benchmark for measuring social bias across demographic attributes using templated prompts and model completions.

Templated prompts with demographic attributesSynthetic, templated bias prompts Full Gen Bias · Partial Gen Bias · Likelihood Bias · Offensiveness rate

BBQ

Bias & Fairness

Question-answering benchmark for detecting social bias through stereotype reliance under ambiguous context.

Question with ambiguous or disambiguated contextPaired ambiguous and disambiguated QA promptsBias score · accuracy

CrowS-Pairs

Bias & Fairness

Bias benchmark using minimal pairs to measure preference for stereotyped vs. anti-stereotyped sentences.

Sentence pairs with stereotype labelSynthetic sentence pairs with demographic attributesStereotype preference · Anti-stereotype preference