Skip to Content

Built-in Benchmarks

Built-in benchmarks are ready-to-run evals with predefined datasets, scoring methodologies, and metrics. Use them when you want a standardized evaluation that provides a common reference point, a repeatable baseline, or a well-defined implementation of an industry benchmark.

Built-in benchmarks use the demo model by default to validate the installation and inspect evaluation workflow, execution steps, recorded inputs and outputs, scoring behavior, and reported metrics without calling an external model API or incurring usage charges.

Use the following command pattern to run a built-in benchmark:

qt run "$BENCHMARK"

All built-in benchmarks can be customized with a Configuration file.

Available Built-in Benchmarks

Quantiles includes built-in benchmarks to make common evaluations easier to run. These benchmarks are open source and commonly used across industry evaluation workflows.

CodeAboutDetails
qt run simpleqa-verified1,000 short-form factuality prompts for testing parametric knowledgeSimpleQA Verified
qt run pubmedqa1,000 expert-labeled biomedical yes/no/maybe QA instancesPubMedQA

When given a built-in benchmark, the qt run command does the following:

  1. Selects the benchmark dataset and load its examples into the local run.
  2. Executes the benchmark with the configured sampler.
  3. Scores each example with the benchmark’s evaluation metric.
  4. Saves per-example inputs, outputs, and step records in the run history.
  5. Emits aggregate metrics that you can inspect with qt show and can later compare against other runs with [qt compare.

Limit the number of samples

To evaluate a subset of benchmark samples, specify a samples key in the configuration file or pass a samples key to the --input data on the command line:

qt run "$BENCHMARK" --input '{"samples":10}'

Inspect the run

Inspect the full run record, including inputs, outputs, sample level results, and metadata:

qt show "$RUN_ID"

Use qt list to view run history and find a run_id to pass to qt show

Run a Benchmark Against Your Own Model

To evaluate your own model against a built-in benchmark, override the benchmark defaults with a local configuration file. Create either quantiles.toml or .quantiles.toml in the current working directory and use it to set options such as the model, sample limit, and other run configuration.

See configuration documentation for configuration details.

Run a smoke test

We recommend starting with a small sample count (e.g., 10) to verify the benchmark runs end to end with your selected model before committing time or inference cost to a full evaluation.

qt run pubmedqa --input '{"limit":10}'

To run the full benchmark, use qt run without input overrides, as long as no sample limit is set in your configuration file.

qt run pubmedqa

Compare Benchmark Runs

Use the qt CLI to compare and analyze the differences between two benchmark runs:

qt compare "$RUN_ID_A" "$RUN_ID_B"

See Compare Evals for full documentation on run comparison.

Resume Interrupted Benchmark Runs

Quantiles evaluation and benchmark workflows are designed to recover from interruptions (e.g., rate limits, timeouts, interrupted processes). If a benchmark is interrupted after completing some steps, you can resume the run using the run_id:

qt resume "$RUN_ID"

See Resume Runs for the full recovery workflow.

Useful Commands

CommandDescriptionExample
qt run <benchmark>Run <command> with the given input and optional args, as the given <eval_name>qt run simpleqa-verified
qt run --input '{"limit":<count>}' <benchmark>Specify the number of samples to run.--input '{"limit":5}'
qt listShow all evaluation and benchmark runsqt list
qt show "$RUN_ID" --jsonShow details of a given evaluation or benchmark runqt show 1
qt compare "$RUN_A_ID" "$RUN_B_ID"Compare two evaluation or benchmark runsqt compare 1 2
qt resume "$RUN_ID"Resume an interrupted evaluation or benchmark runqt resume 1

See the CLI Reference for full command details and output examples.

Request a Built-in Benchmark

If there is an open-source benchmark you would like to add as a built-in benchmark, file an issue in the Quantiles GitHub repo .

Helpful requests include the benchmark name, source dataset or repository, license and any reference implementation.

Last updated on