Benchmarks

SimpleQA Verified

A 1,000-prompt short-form factuality benchmark from Google DeepMind and Google Research that tests LLMs’ closed-book parametric recall on niche factual QA prompts spanning people, places, dates, numbers, geography, sports, art, music, and biography.

Overview

SimpleQA Verified is a 1,000-prompt benchmark for evaluating whether language models can answer short, fact-seeking questions from their internal parameters. It builds on OpenAI's SimpleQA benchmark, but was re-curated by Google DeepMind and Google Research to reduce noisy labels, topical bias, question redundancy, and ambiguous or conflicting source evidence.

The benchmark is designed for closed-book factual recall. The model receives a short factual question and generates an answer, which is automatically graded as correct, incorrect, or not attempted. It is not designed to evaluate retrieval, search, grounding, or agentic research behavior. When external tools are enabled, the task becomes much easier and no longer isolates parametric knowledge.

Dataset Specification

Size

1,000 evaluation prompts selected from the original 4,326 SimpleQA questions after unique-source filtering, semantic and TF-IDF de-duplication, publisher-preference filtering, topic and answer-type balancing, source reconciliation, numeric-answer range rewriting, and final filtering to preserve benchmark headroom by retaining the most difficult remaining questions.

Source

Human-crafted short-form factual questions derived from SimpleQA, curated by Google DeepMind and Google Research.

Input Format

  • original_index: index linking the example back to the original SimpleQA benchmark.
  • problem: short factual question presented to the model.
  • answer: gold answer used for grading, not provided to the model at inference time.
  • topic: topical category metadata.
  • answer_type: answer category metadata, such as date, number, person, place, or other.
  • multi_step and requires_reasoning: metadata flags describing whether the question requires multiple information steps or more complex reasoning.
  • urls: supporting source URLs for the gold answer.

Example (answer and supporting URLs omitted):

{
  "original_index": 75,
  "problem": "What is the first vampire number in recreational mathematics obtained by a 3 x 3-digit multiplication?",
  "topic": "Other",
  "answer_type": "Number",
  "multi_step": false,
  "requires_reasoning": false
}

Output Format

A short free-text answer. The model may also abstain if it does not know the answer. During evaluation, the predicted answer is passed to an autorater, which maps the response to correct, incorrect, or not attempted. Hedged responses that list several possible answers without committing to one should be treated as not attempted.

{
  "answer": "102510"
}

Metrics

  • F1 score (primary): harmonic mean of overall accuracy and accuracy given attempted. This balances answering correctly when the model attempts a question and avoiding excessive abstention.
  • Accuracy: fraction of all examples answered correctly.
  • Accuracy given attempted: fraction of attempted examples answered correctly.
  • Attempted and hedged rates: auxiliary rates that show how often the model commits to an answer, is rated not attempted, or gives a hedged response.

How to Run in Quantiles

Use the following command to run SimpleQA Verified in Quantiles using either the built-in demo model or your own provider-backed model:

qt run simpleqa-verified

When using a coding agent, install SKILL.md first then copy the prompt below into the agent.

Use the Quantiles eval skill to run the full SimpleQA Verified benchmark and summarize the results.

Known Limitations

  • Designed for no-tool evaluation. Search, retrieval, browsing, or other external tools can make the benchmark near-trivial and change the measured capability.
  • Short-answer factual recall does not measure long-form factual consistency, citation quality, source synthesis, or grounded generation.
  • Public, static benchmark data can be memorized or contaminated in model training and post-training corpora.
  • Autorater scoring can still introduce measurement noise, especially in edge cases with partial answers, approximate numeric answers, alternate phrasings, and hedged responses.
  • Questions emphasize verifiable factoids and tail knowledge, which may not reflect everyday user queries or high-stakes domain workflows.
  • Supporting web sources can change, disappear, or disagree after the benchmark release, so provenance should be pinned for reproducible runs.

Versioning and Provenance

Record the exact SimpleQA Verified release, dataset source, file hash, scorer implementation, autorater model and prompt version, decoding settings, no-tool policy, and numeric-answer handling. The technical report uses gpt-4.1-2025-04-14 as the autorater for the reported results. Changing the autorater, prompt, numeric-answer handling, or abstention rules can materially affect score comparability.

References

Haas et al., 2025. SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge.

Paper: https://arxiv.org/abs/2509.07968

Hugging Face Dataset: https://huggingface.co/datasets/google/simpleqa-verified

Related Benchmarks