Results

Inspect Evaluation Results

Quantiles is a local-first, offline system that runs benchmarks and evals locally and keeps all execution, metadata, metrics, and analysis data local on your computer by default. Every benchmark or eval run is recorded as an inspectable record that includes:

Run metadata, including ID, eval name, status, timestamps, and error state
Workflow input passed through qt run --input
Workflow output returned by the workflow handler
Numeric metrics emitted with emit
Step records, including step key, status, input hash, and timestamps
Lifecycle events for runs and steps

Step outputs are captured for durability, run resume-ability, and comparisons. Put values you need to inspect directly in workflow output or metrics.

Inspect One Evaluation Run

To inspect one evaluation, first find its run ID, then use qt show to view the recorded result.

List recorded evaluation runs

Use qt list to see recorded runs with their IDs, eval names, status, sample count, creation time, and duration:


qt list

Example output:


ID  EVAL            STATUS     SAMPLES    CREATED                     DURATION
2   support-triage  completed  5000       2026-7-02T18:30:00.000000Z  5.000s
1   support-triage  failed     813        2026-7-01T18:15:00.000000Z  2.000s

View high-level run information

Use qt show to inspect a single run for a run_id:


qt show "$RUN_ID"

This displays a top section that summarizes high-level run information, including status and any errors. The metrics table reports the aggregrate metrics emmited during the run.


Run 2
  eval:        support-triage
  status:      completed
  created:     2026-7-02T18:30:00.000000Z
  duration:    5.000s
  input:       {"promptVersion":"B","model":"openai:gpt-5.5"}
  output:      {"accuracy":0.92,"correct":4600,"total":5000}
  error:       -

Metrics
  NAME                     VALUE          UNIT
  accuracy                 0.92           -
  correct_count            4600           -
  total_count              5000           -

View sample-level inputs, outputs, and metadata

Use --json when you need structured output with detailed run information, including sample-level results. JSON output is recommended for agents and scripts that need to parse evaluation results.


qt show "$RUN_ID" --json

JSON output includes run details, metrics, and sample-level step results in a machine-readable format. Sample objects include fields like:


{
  "samples": [
    {
      "step_key": "case:double-charge",
      "status": "completed",
      "input_hash": "ebc8f717a2ae7bf6",
      "started_at": "2026-7-02T18:30:01.000000Z",
      "finished_at": "2026-7-02T18:30:02.000000Z",
      "metrics": { "latency_ms": 1000 }
    }
  ]
}

Use these sample-level step data to identify which steps completed, failed, or were left incomplete. The input hash helps verify that a resumed run reached the same step with the same input.

Compare Evaluation Results

Compare evaluation results between two runs with the following command:


qt compare "$RUN_ID_A" "$RUN_ID_B"

The comparison output includes a run summary for each run, including status and model configuration, followed by a metrics table that reports aggregate metrics and deltas.


Comparing runs 2 and 3
                      Run 2              Run 3              Delta
 Eval                 simpleqa-verified  simpleqa-verified  SAME
 Status               COMPLETE           COMPLETE
 Duration             23.201s            18.720s            -4.481s
 Model                openai:gpt-5.5     openai: gpt-5.5    SAME
 max_similarity       0.7409             0.7618             +0.0209
 mean_similarity      0.5634             0.5647             +0.0013
 median_similarity    0.56               0.56               -0.000041
 min_similarity       0.4501             0.4341             -0.0161
 p95_similarity       0.6387             0.6397             +0.001
 p99_similarity       0.6779             0.6798             +0.0018
 stdev_similarity     0.0409             0.0407             -0.000228
 variance_similarity  0.0017             0.0017             -0.000019

Use --json when you need structured output with detailed run information, including sample-level results. JSON output is recommended for agents and scripts that need to parse evaluation results.


qt compare "$RUN_ID_A" "$RUN_ID_B" --json

See Compare Evals for more information.

Resume Interrupted Evaluation Runs

Evaluation and benchmark workflows can be resumed after interruptions such as rate limits or timeouts. If a benchmark is interrupted after completing some steps, you can resume the run from its last successful step using the run_id:


qt resume "$RUN_ID"

In the following example, run_id 1 failed to complete:


ID  EVAL            STATUS     SAMPLES  CREATED                     DURATION
2   support-triage  completed  5000     2026-7-02T18:30:00.000000Z  5.000s
1   support-triage  failed     813      2026-7-01T18:15:00.000000Z  2.000s

When the failed run is inspected with qt show, no aggregate metrics are displayed:


qt show 1
Run 1
  eval:        support-triage
  status:      failed
  created:     2026-7-01T18:15:00.000000Z
  duration:    -
  input:       {"model":"openai:gpt-5.5","num_samples":813}
  output:      -
  error:       -
 
Aggregated Metrics
  No metrics found.

To resume and complete the failed evaluation, use the following command:


qt resume 1

Resume a run only when you want to continue the same evaluation configuration as the original run used. Start a new run when you intentionally change the model, prompt, dataset, rubric, workflow input, or step input, so results are measured against the correct evaluation configuration. See Resume Runs for the full resume workflow.

Using Coding Agents to Analyze Evaluation Results

Use the Quantiles agent skill to use coding agents to run repository-based evaluations with the qt CLI. The skill supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.

Install the skill

Use the prompt below to set up your coding agent with the Quantiles CLI and agent skill:


Please install the Quantiles skill at github.com/quantiles-evals/skill

Alternatively, copy SKILL.md into your agent’s skill directory.

Prompt the coding agent

After installation, use your coding agent to run and inspect an evaluation or compare results across two runs. Customize the below prompt template to your needs:


Run $BENCHMARK, then summarize the aggregate metrics, sample-level results, failures, and any notable errors. Identify the highest-impact issues to review first, and recommend specific next steps to improve quality and reliability.

See Agents Overview for more detail on using agents with Quantiles.