Results

Compare Evaluation Runs

Many system changes can affect model behavior, including changes to prompts, datasets, code, infrastructure, and the model itself. Run evaluations before and after these changes so you can detect regressions, measure improvements, and understand whether a model update helped. The following is a typical comparison workflow using the Quantiles stack:

Run the baseline eval
Change the model, prompt, parameter, etc.
Re-run the same eval against the new changes
Compare the two evaluation results

Use the qt CLI to compare and analyze the differences between two eval runs with a single command:


qt compare "$RUN_ID_A" "$RUN_ID_B"

Comparison Workflow

Use qt list to find the run IDs for the two evaluations or benchmarks you want to compare:


 ID  EVAL               STATUS     SAMPLES  CREATED                      DURATION
 5   pubmedqa           completed  1000     2026-06-10T19:33:17.133737Z  3.108s
 4   pubmedqa           completed  1000     2026-06-07T01:00:25.014097Z  3.408s
 3   simpleqa-verified  completed  1000     2026-06-07T00:58:22.872712Z  18.720s
 2   simpleqa-verified  completed  1000     2026-06-07T00:47:18.830474Z  23.201s
 1   pubmedqa           completed  1000     2026-06-07T00:43:37.204221Z  1.705s

Use any two run IDs from the ID column as inputs to qt compare. For the clearest comparison, choose two completed runs from the same evaluation or benchmark where only the intended model, prompt, parameter, dataset, or code differs.


qt compare 2 3

The qt CLI displays a warning when comparing runs where the evaluation / benchmark names differ.

Interpreting `qt compare` output

The comparison output includes a run summary for each run, including status and model configuration, followed by a metrics table that reports aggregate metrics and deltas.


Comparing runs 2 and 3
                      Run 2              Run 3              Delta
 Eval                 simpleqa-verified  simpleqa-verified  SAME
 Status               COMPLETE           COMPLETE
 Duration             23.201s            18.720s            -4.481s
 Model                demo-builtin       demo-builtin       SAME
 max_similarity       0.7409             0.7618             +0.0209
 mean_similarity      0.5634             0.5647             +0.0013
 median_similarity    0.56               0.56               -0.000041
 min_similarity       0.4501             0.4341             -0.0161
 p95_similarity       0.6387             0.6397             +0.001
 p99_similarity       0.6779             0.6798             +0.0018
 stdev_similarity     0.0409             0.0407             -0.000228
 variance_similarity  0.0017             0.0017             -0.000019

Metrics

The metrics table compares numeric values emitted with emit. Metrics are grouped by name, and each row shows the value from each run and the delta.

Run A and Run B - metric values emitted in each eval or benchmark
Delta - The numeric difference between eval or benchmark, calculated as Run B - Run A. Delta indicates direction and magnitude only. It does not determine whether the change is an improvement or regression.

Interpret metric direction using the meaning of the metric. For example, a positive delta may be good for accuracy, but bad for latency, cost, or token usage.

`--json` output format

Use --json to emit the above aggregated metrics along with detailed step-level comparison output, all in JSON format. These data are useful for agents or when your own scripts need to parse and analyze the output.


qt compare "$RUN_ID_A" "$RUN_ID_B" --json

This emits a single JSON object with:

run_a and run_b metadata (id, workflow_name, status)
differs, a boolean indicating whether any dimension changed
input_comparison and output_comparison objects, with high-level metadata indicating how the input and output data differed across the runs
warning, emitted if the names of the runs differed
steps, an array of per-step comparisons
output_differences, an array of field-level diffs
metrics, an array of metric comparisons with values and deltas

Using Coding Agents to Compare Evaluation Runs

Use the Quantiles agent skill to have coding agents compare repository-based evaluations with the qt CLI. It supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.

If you haven’t already installed the skill, see the Agent Quickstart documentation.

After installation, use your coding agent to compare results across two evaluation runs. Customize the below prompt template to your needs:


Compare "$RUN_ID_A" "$RUN_ID_B", then summarize the aggregate metrics, sample-level results, failures, and any notable errors. Identify the highest-impact issues to review first, and recommend specific next steps.

See Agents Overview for more detail on using agents with Quantiles.

Exit codes

qt compare uses exit codes to signal whether the two runs matched or differed, independent of the output format you choose.

0 - runs are identical in every checked dimension (same workflow inputs, workflow outputs, steps, and aggregate metrics)
1 - runs differ in at least one dimension

Run `qt compare` in CI/CD systems

This exit code behavior makes qt compare useful for detecting unexpected drift in CI. For example, a GitHub Actions job might have a step that looks like the following:


- name: Check benchmark drift
  run: qt compare ${{ steps.baseline.outputs.run_id }} ${{ steps.current.outputs.run_id }}

This job fails when qt compare detects differences in checked fields, including:

Workflow input
Workflow output
Step presence
Step input hashes
Step statuses
Step outputs
Aggregate metrics

Note that metric improvements still count as differences, so qt compare can fail even when the new run is better. Some summary fields, such as duration, model, and benchmark name, are displayed for context but are not all used directly as the exit-code policy.

To enforce threshold-based behavior, such as accepting accuracy improvements but rejecting latency regressions, use --json from a script and apply your own pass/fail rules.