Compare Evaluation Runs
Many system changes can affect model behavior, including changes to prompts, datasets, code, infrastructure, and the model itself. Run evaluations before and after these changes so you can detect regressions, measure improvements, and understand whether a model update helped. The following is a typical comparison workflow using the Quantiles stack:
- Run the baseline eval
- Change the model, prompt, parameter, etc.
- Re-run the same eval against the new changes
- Compare the two evaluation results
Use the qt CLI to compare and analyze the differences between two eval runs with a single command:
qt compare "$RUN_ID_A" "$RUN_ID_B"Comparison Workflow
Use qt list to find the run IDs for the two evaluations or benchmarks you want to compare:
ID EVAL STATUS SAMPLES CREATED DURATION
5 pubmedqa completed 1000 2026-06-10T19:33:17.133737Z 3.108s
4 pubmedqa completed 1000 2026-06-07T01:00:25.014097Z 3.408s
3 simpleqa-verified completed 1000 2026-06-07T00:58:22.872712Z 18.720s
2 simpleqa-verified completed 1000 2026-06-07T00:47:18.830474Z 23.201s
1 pubmedqa completed 1000 2026-06-07T00:43:37.204221Z 1.705sUse any two run IDs from the ID column as inputs to qt compare. For the clearest comparison, choose two completed runs from the same evaluation or benchmark where only the intended model, prompt, parameter, dataset, or code differs.
qt compare 2 3The
qtCLI displays a warning when comparing runs where the evaluation / benchmark names differ.
Interpreting qt compare output
The comparison output includes a run summary for each run, including status and model configuration, followed by a metrics table that reports aggregate metrics and deltas.
Comparing runs 2 and 3
Run 2 Run 3 Delta
Eval simpleqa-verified simpleqa-verified SAME
Status COMPLETE COMPLETE
Duration 23.201s 18.720s -4.481s
Model demo-builtin demo-builtin SAME
max_similarity 0.7409 0.7618 +0.0209
mean_similarity 0.5634 0.5647 +0.0013
median_similarity 0.56 0.56 -0.000041
min_similarity 0.4501 0.4341 -0.0161
p95_similarity 0.6387 0.6397 +0.001
p99_similarity 0.6779 0.6798 +0.0018
stdev_similarity 0.0409 0.0407 -0.000228
variance_similarity 0.0017 0.0017 -0.000019Metrics
The metrics table compares numeric values emitted with emit. Metrics are grouped by name, and each row shows the value from each run and the delta.
- Run A and Run B - metric values emitted in each eval or benchmark
- Delta - The numeric difference between eval or benchmark, calculated as
Run B - Run A. Delta indicates direction and magnitude only. It does not determine whether the change is an improvement or regression.
Interpret metric direction using the meaning of the metric. For example, a positive delta may be good for accuracy, but bad for latency, cost, or token usage.
--json output format
Use --json to emit the above aggregated metrics along with detailed step-level comparison output, all in JSON format. These data are useful for agents or when your own scripts need to parse and analyze the output.
qt compare "$RUN_ID_A" "$RUN_ID_B" --jsonThis emits a single JSON object with:
run_aandrun_bmetadata (id, workflow_name, status)differs, a boolean indicating whether any dimension changedinput_comparisonandoutput_comparisonobjects, with high-level metadata indicating how the input and output data differed across the runswarning, emitted if the names of the runs differedsteps, an array of per-step comparisonsoutput_differences, an array of field-level diffsmetrics, an array of metric comparisons with values and deltas
Using Coding Agents to Compare Evaluation Runs
Use the Quantiles agent skill to have coding agents compare repository-based evaluations with the qt CLI. It supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.
If you haven’t already installed the skill, see the Agent Quickstart documentation.
After installation, use your coding agent to compare results across two evaluation runs. Customize the below prompt template to your needs:
Compare "$RUN_ID_A" "$RUN_ID_B", then summarize the aggregate metrics, sample-level results, failures, and any notable errors. Identify the highest-impact issues to review first, and recommend specific next steps.See Agents Overview for more detail on using agents with Quantiles.
Exit codes
qt compare uses exit codes to signal whether the two runs matched or differed, independent of the output format you choose.
- 0 - runs are identical in every checked dimension (same workflow inputs, workflow outputs, steps, and aggregate metrics)
- 1 - runs differ in at least one dimension
Run qt compare in CI/CD systems
This exit code behavior makes qt compare useful for detecting unexpected drift in CI. For example, a GitHub Actions job might have a step that looks like the following:
- name: Check benchmark drift
run: qt compare ${{ steps.baseline.outputs.run_id }} ${{ steps.current.outputs.run_id }}This job fails when qt compare detects differences in checked fields, including:
- Workflow input
- Workflow output
- Step presence
- Step input hashes
- Step statuses
- Step outputs
- Aggregate metrics
Note that metric improvements still count as differences, so qt compare can fail even when the new run is better. Some summary fields, such as duration, model, and benchmark name, are displayed for context but are not all used directly as the exit-code policy.
To enforce threshold-based behavior, such as accepting accuracy improvements but rejecting latency regressions, use --json from a script and apply your own pass/fail rules.