Evaluation Quickstart
Quantiles lets teams run benchmarks and evaluations from the command line, inspect sample-level outputs, compare runs against baselines, and define custom evaluations with minimal setup.
Install the CLI
To install the Quantiles CLI, run the following command:
curl -fsSL https://cli.quantiles.io/install.sh | bashBecause Quantiles is local by default, your project directory holds the database for run history, metadata, samples, and metrics, all under .quantiles/.
Run a sample benchmark
Run a simpleqa-verified sample benchmark with the built-in demo model to validate the installation and inspect how Quantiles runs evaluations and records inputs, outputs, and results.
qt run simpleqa-verifiedThe SimpleQA-verified benchmark tests 1,000 short-form factuality prompts for testing parametric knowledge. See the arXiv paper for details.
The above command uses a demo model which generates random text and does not use external model API calls, and doesn’t incur usage charges. Runs using the demo model are only useful for validating the workflow. See Built-in Benchmarks for more details.
Inspect the run
The first Quantiles run in a new workspace will have a run_id of 1. Use the run_id to inspect the complete run record, including workflow inputs, outputs, sample-level results, metrics, and execution metadata, in JSON format.
qt show 1 --jsonYou can omit the --json flag to see a human-readable summary of the run.
Use
qt listto view run history and find arun_idto analyze
Run a Benchmark Against Your Own Model
To evaluate your own model against a built-in benchmark, override the benchmark defaults with a local configuration file. Create either quantiles.toml or .quantiles.toml in the current working directory and use it to set options such as the model, sample limit, and other run configuration.
Configure the model
The below config block customizes how qt run executes the simpleqa-verified built-in benchmark:
[benchmarks.simpleqa-verified]
# Limit to the first 10 samples of the benchmark
samples = 10
# Test against OpenAI's GPT 5.4-nano model
model = "openai:gpt-5.4-nano"Paste this block into a quantiles.toml file in your repository, then put your OpenAI API key into the OPENAI_API_KEY environment variable.
See the configuration documentation for deatils on customizing benchmarks.
Run the benchmark
The qt run simpleqa-verified command will read and use this new configuration to run simpleqa-verified with only the first 10 samples, using the OpenAI model.
The above config sets samples to 10 to verify the benchmark runs end to end with your selected model before committing time or inference cost to a full evaluation.
qt run pubmedqa --jsonYou can pass the
--input '{"samples": <number_of_samples>}'flag toqt runto override the configurated sample count.
Create a Fully Custom Benchmark
A custom evaluation is a Python program that is run by the qt CLI and uses the Quantiles API to execute an eval. Use custom evaluations when you need to measure behavior that is specific to your product, workflow, prompt, dataset, rubric, or release process.
See Custom Evaluations for a complete walkthrough.
Compare Eval Runs
Use the qt CLI to compare and analyze the differences between two eval runs:
qt compare "$RUN_ID_A" "$RUN_ID_B"See Compare Evals for full documentation on run comparison.
Using Coding Agents to Run Evaluation Workflows
For an agent-assisted workflow, install the Quantiles skill and use your preferred coding agent to run evaluations. The skill works with agents such as Codex, Claude, and others.
- Agent Quickstart
Use coding agents to run, inspect, compare, and summarize evals
- Agent Overview
Learn how agents fit into the Quantiles evaluation workflow
Useful Commands
| Command | Description | Example |
|---|---|---|
qt run <benchmark> | Run <command> with the given input and optional args, as the given <eval_name> | qt run simpleqa-verified |
qt run --input '{"limit":<count>}' <benchmark> | Specify the number of samples to run. | qt run --input '{"limit":5}' simpleqa-verified |
qt list | Show all evaluation and benchmark runs | qt list |
qt show "$RUN_ID" --json | Show details of a given evaluation or benchmark run | qt show 1 |
qt compare "$RUN_A_ID" "$RUN_B_ID" | Compare two evaluation or benchmark runs | qt compare 1 2 |
qt resume "$RUN_ID" | Resume an interrupted evaluation or benchmark run | qt resume 1 |
See the CLI Reference for full command details and output examples.