Skip to Content

Inspect Evaluation Results

Quantiles is a local-first, offline system that runs benchmarks and evals locally and keeps all execution, metadata, metrics, and analysis data local on your computer by default. Every benchmark or eval run is recorded as an inspectable record that includes:

  • Run metadata, including ID, eval name, status, timestamps, and error state
  • Workflow input passed through qt run --input
  • Workflow output returned by the workflow handler
  • Numeric metrics emitted with emit
  • Step records, including step key, status, input hash, and timestamps
  • Lifecycle events for runs and steps

Step outputs are captured for durability, run resume-ability, and comparisons. Put values you need to inspect directly in workflow output or metrics.

Inspect One Evaluation Run

To inspect one evaluation, first find its run ID, then use qt show to view the recorded result.

List recorded evaluation runs

Use qt list to see recorded runs with their IDs, eval names, status, sample count, creation time, and duration:

qt list

Example output:

ID EVAL STATUS SAMPLES CREATED DURATION 2 support-triage completed 5000 2026-7-02T18:30:00.000000Z 5.000s 1 support-triage failed 813 2026-7-01T18:15:00.000000Z 2.000s

View high-level run information

Use qt show to inspect a single run for a run_id:

qt show "$RUN_ID"

This displays a top section that summarizes high-level run information, including status and any errors. The metrics table reports the aggregrate metrics emmited during the run.

Run 2 eval: support-triage status: completed created: 2026-7-02T18:30:00.000000Z duration: 5.000s input: {"promptVersion":"B","model":"openai:gpt-5.5"} output: {"accuracy":0.92,"correct":4600,"total":5000} error: - Metrics NAME VALUE UNIT accuracy 0.92 - correct_count 4600 - total_count 5000 -

View sample-level inputs, outputs, and metadata

Use --json when you need structured output with detailed run information, including sample-level results. JSON output is recommended for agents and scripts that need to parse evaluation results.

qt show "$RUN_ID" --json

JSON output includes run details, metrics, and sample-level step results in a machine-readable format. Sample objects include fields like:

{ "samples": [ { "step_key": "case:double-charge", "status": "completed", "input_hash": "ebc8f717a2ae7bf6", "started_at": "2026-7-02T18:30:01.000000Z", "finished_at": "2026-7-02T18:30:02.000000Z", "metrics": { "latency_ms": 1000 } } ] }

Use these sample-level step data to identify which steps completed, failed, or were left incomplete. The input hash helps verify that a resumed run reached the same step with the same input.

Compare Evaluation Results

Compare evaluation results between two runs with the following command:

qt compare "$RUN_ID_A" "$RUN_ID_B"

The comparison output includes a run summary for each run, including status and model configuration, followed by a metrics table that reports aggregate metrics and deltas.

Comparing runs 2 and 3 Run 2 Run 3 Delta Eval simpleqa-verified simpleqa-verified SAME Status COMPLETE COMPLETE Duration 23.201s 18.720s -4.481s Model openai:gpt-5.5 openai: gpt-5.5 SAME max_similarity 0.7409 0.7618 +0.0209 mean_similarity 0.5634 0.5647 +0.0013 median_similarity 0.56 0.56 -0.000041 min_similarity 0.4501 0.4341 -0.0161 p95_similarity 0.6387 0.6397 +0.001 p99_similarity 0.6779 0.6798 +0.0018 stdev_similarity 0.0409 0.0407 -0.000228 variance_similarity 0.0017 0.0017 -0.000019

Use --json when you need structured output with detailed run information, including sample-level results. JSON output is recommended for agents and scripts that need to parse evaluation results.

qt compare "$RUN_ID_A" "$RUN_ID_B" --json

See Compare Evals for more information.

Resume Interrupted Evaluation Runs

Evaluation and benchmark workflows can be resumed after interruptions such as rate limits or timeouts. If a benchmark is interrupted after completing some steps, you can resume the run from its last successful step using the run_id:

qt resume "$RUN_ID"

In the following example, run_id 1 failed to complete:

ID EVAL STATUS SAMPLES CREATED DURATION 2 support-triage completed 5000 2026-7-02T18:30:00.000000Z 5.000s 1 support-triage failed 813 2026-7-01T18:15:00.000000Z 2.000s

When the failed run is inspected with qt show, no aggregate metrics are displayed:

qt show 1 Run 1 eval: support-triage status: failed created: 2026-7-01T18:15:00.000000Z duration: - input: {"model":"openai:gpt-5.5","num_samples":813} output: - error: - Aggregated Metrics No metrics found.

To resume and complete the failed evaluation, use the following command:

qt resume 1

Resume a run only when you want to continue the same evaluation configuration as the original run used. Start a new run when you intentionally change the model, prompt, dataset, rubric, workflow input, or step input, so results are measured against the correct evaluation configuration. See Resume Runs for the full resume workflow.

Using Coding Agents to Analyze Evaluation Results

Use the Quantiles agent skill  to use coding agents to run repository-based evaluations with the qt CLI. The skill supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.

Install the skill

Use the prompt below to set up your coding agent with the Quantiles CLI and agent skill:

Please install the Quantiles skill at github.com/quantiles-evals/skill

Alternatively, copy SKILL.md  into your agent’s skill directory.

Prompt the coding agent

After installation, use your coding agent to run and inspect an evaluation or compare results across two runs. Customize the below prompt template to your needs:

Run $BENCHMARK, then summarize the aggregate metrics, sample-level results, failures, and any notable errors. Identify the highest-impact issues to review first, and recommend specific next steps to improve quality and reliability.

See Agents Overview for more detail on using agents with Quantiles.

Last updated on