Benchmarks

HolisticBias

A synthetic bias evaluation benchmark that uses templated prompts to assess social bias across demographic attributes and contexts.

Overview

HolisticBias evaluates social bias in language models using systematically templated prompts that vary demographic attributes (e.g., gender, ethnicity, religion, age, nationality) across multiple contexts. By analyzing model-generated text or relative likelihoods, the benchmark quantifies the presence of stereotypes, biased associations, and toxic or discriminatory behavior. HolisticBias is designed for broad bias coverage and is not domain-specific to healthcare.

Models receive attribute-conditioned prompts and evaluated based on the content or likelihood of completions. Bias signals are derived from whether generated outputs exhibit stereotypical, biased, or harmful patterns associated with the conditioned attributes.

Dataset Specification

Size

Thousands of synthetic prompt variations generated by combining multiple demographic attributes with templated contexts. Exact scale depends on the number of attributes and templates used.

Source

Programmatically generated templates with demographic attribute slots covering multiple protected characteristics, such as gender, ethnicity, religion, age, and nationality.

Input Format

prompt: string (templated text with a demographic slot filled)
attribute: demographic category/value used (e.g., gender=female)

Example:

{
  "prompt": "The woman said that she wanted to become a doctor because...",
  "attribute": {"gender": "female"}
}

Output Format

Either free-text completion (for generation-based scoring) or logprob scores over continuations. Outputs are analyzed for bias/toxicity or compared across attributes to measure disparities.

{
  "completion": "she was inspired by the doctors who helped her family."
}

Metrics

Likelihood Bias: pairwise score-difference metric for directional bias in likelihood-based evaluations.
Full Gen Bias: distributional disparity metric over generated language across descriptor groups.
Partial Gen Bias: cluster-level generation bias metric that captures disparities within prompt/context subsets.
Offensiveness rate (diagnostic): fraction of generations flagged as offensive by a classifier.

Known Limitations

Relies on synthetic, templated prompts that may not fully reflect real-world language use or discourse dynamics.
Focuses on bias associated with demographic attributes rather than task performance or downstream decision impact.
Bias signals are inferred from generated content or likelihood differences and may not capture contextual or interactional bias.
May surface stereotyped or toxic completions for certain attributes, or overly conservative refusals for benign prompts, depending on model behavior.

Versioning and Provenance

HolisticBias variants differ by attribute coverage, templates, and scoring setup. Record the version, attribute set, prompt templates, toxicity/bias classifiers, and normalization used to ensure reproducibility.

References

Smith et al., 2022. "I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset.

Paper: https://arxiv.org/abs/2205.09209

Repository: https://github.com/facebookresearch/ResponsibleNLP/tree/main/holistic_bias

Related Benchmarks

BBQ

Bias & Fairness

Question-answering benchmark for detecting social bias through stereotype reliance under ambiguous context.

Question with ambiguous or disambiguated contextPaired ambiguous and disambiguated QA promptsStereotype preference · accuracy · demographic breakdowns

CrowS-Pairs

Bias & Fairness

Bias benchmark using minimal pairs to measure preference for stereotyped vs. anti-stereotyped sentences.

Sentence pairs with stereotype labelSynthetic sentence pairs with demographic attributesLikelihood gap · bias direction · stereotype score