Inspect Evaluation Results
Quantiles is a local-first, offline system that runs benchmarks and evals locally and keeps all execution, metadata, metrics, and analysis data local on your computer by default. Every benchmark or eval run is recorded as an inspectable record that includes:
- Run metadata, including ID, eval name, status, timestamps, and error state
- Workflow input passed through
qt run --input - Workflow output returned by the workflow handler
- Numeric metrics emitted with
emit - Step records, including step key, status, input hash, and timestamps
- Lifecycle events for runs and steps
Step outputs are captured for durability, run resume-ability, and comparisons. Put values you need to inspect directly in workflow output or metrics.
Inspect One Evaluation Run
To inspect one evaluation, first find its run ID, then use qt show to view the recorded result.
List recorded evaluation runs
Use qt list to see recorded runs with their IDs, eval names, status, sample count, creation time, and duration:
qt listExample output:
ID EVAL STATUS SAMPLES CREATED DURATION
2 support-triage completed 5000 2026-7-02T18:30:00.000000Z 5.000s
1 support-triage failed 813 2026-7-01T18:15:00.000000Z 2.000sView high-level run information
Use qt show to inspect a single run for a run_id:
qt show "$RUN_ID"This displays a top section that summarizes high-level run information, including status and any errors. The metrics table reports the aggregrate metrics emmited during the run.
Run 2
eval: support-triage
status: completed
created: 2026-7-02T18:30:00.000000Z
duration: 5.000s
input: {"promptVersion":"B","model":"openai:gpt-5.5"}
output: {"accuracy":0.92,"correct":4600,"total":5000}
error: -
Metrics
NAME VALUE UNIT
accuracy 0.92 -
correct_count 4600 -
total_count 5000 -View sample-level inputs, outputs, and metadata
Use --json when you need structured output with detailed run information, including sample-level results. JSON output is recommended for agents and scripts that need to parse evaluation results.
qt show "$RUN_ID" --jsonJSON output includes run details, metrics, and sample-level step results in a machine-readable format. Sample objects include fields like:
{
"samples": [
{
"step_key": "case:double-charge",
"status": "completed",
"input_hash": "ebc8f717a2ae7bf6",
"started_at": "2026-7-02T18:30:01.000000Z",
"finished_at": "2026-7-02T18:30:02.000000Z",
"metrics": { "latency_ms": 1000 }
}
]
}Use these sample-level step data to identify which steps completed, failed, or were left incomplete. The input hash helps verify that a resumed run reached the same step with the same input.
Compare Evaluation Results
Compare evaluation results between two runs with the following command:
qt compare "$RUN_ID_A" "$RUN_ID_B"The comparison output includes a run summary for each run, including status and model configuration, followed by a metrics table that reports aggregate metrics and deltas.
Comparing runs 2 and 3
Run 2 Run 3 Delta
Eval simpleqa-verified simpleqa-verified SAME
Status COMPLETE COMPLETE
Duration 23.201s 18.720s -4.481s
Model openai:gpt-5.5 openai: gpt-5.5 SAME
max_similarity 0.7409 0.7618 +0.0209
mean_similarity 0.5634 0.5647 +0.0013
median_similarity 0.56 0.56 -0.000041
min_similarity 0.4501 0.4341 -0.0161
p95_similarity 0.6387 0.6397 +0.001
p99_similarity 0.6779 0.6798 +0.0018
stdev_similarity 0.0409 0.0407 -0.000228
variance_similarity 0.0017 0.0017 -0.000019Use --json when you need structured output with detailed run information, including sample-level results. JSON output is recommended for agents and scripts that need to parse evaluation results.
qt compare "$RUN_ID_A" "$RUN_ID_B" --jsonSee Compare Evals for more information.
Resume Interrupted Evaluation Runs
Evaluation and benchmark workflows can be resumed after interruptions such as rate limits or timeouts. If a benchmark is interrupted after completing some steps, you can resume the run from its last successful step using the run_id:
qt resume "$RUN_ID"In the following example, run_id 1 failed to complete:
ID EVAL STATUS SAMPLES CREATED DURATION
2 support-triage completed 5000 2026-7-02T18:30:00.000000Z 5.000s
1 support-triage failed 813 2026-7-01T18:15:00.000000Z 2.000sWhen the failed run is inspected with qt show, no aggregate metrics are displayed:
qt show 1
Run 1
eval: support-triage
status: failed
created: 2026-7-01T18:15:00.000000Z
duration: -
input: {"model":"openai:gpt-5.5","num_samples":813}
output: -
error: -
Aggregated Metrics
No metrics found.To resume and complete the failed evaluation, use the following command:
qt resume 1Resume a run only when you want to continue the same evaluation configuration as the original run used. Start a new run when you intentionally change the model, prompt, dataset, rubric, workflow input, or step input, so results are measured against the correct evaluation configuration. See Resume Runs for the full resume workflow.
Using Coding Agents to Analyze Evaluation Results
Use the Quantiles agent skill to use coding agents to run repository-based evaluations with the qt CLI. The skill supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.
Install the skill
Use the prompt below to set up your coding agent with the Quantiles CLI and agent skill:
Please install the Quantiles skill at github.com/quantiles-evals/skillAlternatively, copy SKILL.md into your agent’s skill directory.
Prompt the coding agent
After installation, use your coding agent to run and inspect an evaluation or compare results across two runs. Customize the below prompt template to your needs:
Run $BENCHMARK, then summarize the aggregate metrics, sample-level results, failures, and any notable errors. Identify the highest-impact issues to review first, and recommend specific next steps to improve quality and reliability.See Agents Overview for more detail on using agents with Quantiles.