Skip to Content

Evaluation Quickstart

Quantiles lets teams run benchmarks and evaluations from the command line, inspect sample-level outputs, compare runs against baselines, and define custom evaluations with minimal setup.

Install the CLI

To install the Quantiles CLI, run the following command:

curl -fsSL https://cli.quantiles.io/install.sh | bash

Because Quantiles is local by default, your project directory holds the database for run history, metadata, samples, and metrics, all under .quantiles/.

Run a sample benchmark

Run a simpleqa-verified sample benchmark with the built-in demo model to validate the installation and inspect how Quantiles runs evaluations and records inputs, outputs, and results.

qt run simpleqa-verified

The SimpleQA-verified benchmark tests 1,000 short-form factuality prompts for testing parametric knowledge. See the arXiv paper  for details.

The above command uses a demo model which generates random text and does not use external model API calls, and doesn’t incur usage charges. Runs using the demo model are only useful for validating the workflow. See Built-in Benchmarks for more details.

Inspect the run

The first Quantiles run in a new workspace will have a run_id of 1. Use the run_id to inspect the complete run record, including workflow inputs, outputs, sample-level results, metrics, and execution metadata, in JSON format.

qt show 1 --json

You can omit the --json flag to see a human-readable summary of the run.

Use qt list to view run history and find a run_id to analyze

Run a Benchmark Against Your Own Model

To evaluate your own model against a built-in benchmark, override the benchmark defaults with a local configuration file. Create either quantiles.toml or .quantiles.toml in the current working directory and use it to set options such as the model, sample limit, and other run configuration.

Configure the model

The below config block customizes how qt run executes the simpleqa-verified built-in benchmark:

[benchmarks.simpleqa-verified] # Limit to the first 10 samples of the benchmark samples = 10 # Test against OpenAI's GPT 5.4-nano model model = "openai:gpt-5.4-nano"

Paste this block into a quantiles.toml file in your repository, then put your OpenAI API key into the OPENAI_API_KEY environment variable.

See the configuration documentation for deatils on customizing benchmarks.

Run the benchmark

The qt run simpleqa-verified command will read and use this new configuration to run simpleqa-verified with only the first 10 samples, using the OpenAI model.

The above config sets samples to 10 to verify the benchmark runs end to end with your selected model before committing time or inference cost to a full evaluation.

qt run pubmedqa --json

You can pass the --input '{"samples": <number_of_samples>}' flag to qt run to override the configurated sample count.

Create a Fully Custom Benchmark

A custom evaluation is a Python program that is run by the qt CLI and uses the Quantiles API to execute an eval. Use custom evaluations when you need to measure behavior that is specific to your product, workflow, prompt, dataset, rubric, or release process.

See Custom Evaluations for a complete walkthrough.

Compare Eval Runs

Use the qt CLI to compare and analyze the differences between two eval runs:

qt compare "$RUN_ID_A" "$RUN_ID_B"

See Compare Evals for full documentation on run comparison.

Using Coding Agents to Run Evaluation Workflows

For an agent-assisted workflow, install the Quantiles skill and use your preferred coding agent to run evaluations. The skill works with agents such as Codex, Claude, and others.

Useful Commands

CommandDescriptionExample
qt run <benchmark>Run <command> with the given input and optional args, as the given <eval_name>qt run simpleqa-verified
qt run --input '{"limit":<count>}' <benchmark>Specify the number of samples to run.qt run --input '{"limit":5}' simpleqa-verified
qt listShow all evaluation and benchmark runsqt list
qt show "$RUN_ID" --jsonShow details of a given evaluation or benchmark runqt show 1
qt compare "$RUN_A_ID" "$RUN_B_ID"Compare two evaluation or benchmark runsqt compare 1 2
qt resume "$RUN_ID"Resume an interrupted evaluation or benchmark runqt resume 1

See the CLI Reference for full command details and output examples.

Last updated on