Skip to Content

Compare Evaluation Runs

Many system changes can affect model behavior, including changes to prompts, datasets, code, infrastructure, and the model itself. Run evaluations before and after these changes so you can detect regressions, measure improvements, and understand whether a model update helped. The following is a typical comparison workflow using the Quantiles stack:

  1. Run the baseline eval
  2. Change the model, prompt, parameter, etc.
  3. Re-run the same eval against the new changes
  4. Compare the two evaluation results

Use the qt CLI to compare and analyze the differences between two eval runs with a single command:

qt compare "$RUN_ID_A" "$RUN_ID_B"

Comparison Workflow

Use qt list to find the run IDs for the two evaluations or benchmarks you want to compare:

ID EVAL STATUS SAMPLES CREATED DURATION 5 pubmedqa completed 1000 2026-06-10T19:33:17.133737Z 3.108s 4 pubmedqa completed 1000 2026-06-07T01:00:25.014097Z 3.408s 3 simpleqa-verified completed 1000 2026-06-07T00:58:22.872712Z 18.720s 2 simpleqa-verified completed 1000 2026-06-07T00:47:18.830474Z 23.201s 1 pubmedqa completed 1000 2026-06-07T00:43:37.204221Z 1.705s

Use any two run IDs from the ID column as inputs to qt compare. For the clearest comparison, choose two completed runs from the same evaluation or benchmark where only the intended model, prompt, parameter, dataset, or code differs.

qt compare 2 3

The qt CLI displays a warning when comparing runs where the evaluation / benchmark names differ.

Interpreting qt compare output

The comparison output includes a run summary for each run, including status and model configuration, followed by a metrics table that reports aggregate metrics and deltas.

Comparing runs 2 and 3 Run 2 Run 3 Delta Eval simpleqa-verified simpleqa-verified SAME Status COMPLETE COMPLETE Duration 23.201s 18.720s -4.481s Model demo-builtin demo-builtin SAME max_similarity 0.7409 0.7618 +0.0209 mean_similarity 0.5634 0.5647 +0.0013 median_similarity 0.56 0.56 -0.000041 min_similarity 0.4501 0.4341 -0.0161 p95_similarity 0.6387 0.6397 +0.001 p99_similarity 0.6779 0.6798 +0.0018 stdev_similarity 0.0409 0.0407 -0.000228 variance_similarity 0.0017 0.0017 -0.000019

Metrics

The metrics table compares numeric values emitted with emit. Metrics are grouped by name, and each row shows the value from each run and the delta.

  • Run A and Run B - metric values emitted in each eval or benchmark
  • Delta - The numeric difference between eval or benchmark, calculated as Run B - Run A. Delta indicates direction and magnitude only. It does not determine whether the change is an improvement or regression.

Interpret metric direction using the meaning of the metric. For example, a positive delta may be good for accuracy, but bad for latency, cost, or token usage.

--json output format

Use --json to emit the above aggregated metrics along with detailed step-level comparison output, all in JSON format. These data are useful for agents or when your own scripts need to parse and analyze the output.

qt compare "$RUN_ID_A" "$RUN_ID_B" --json

This emits a single JSON object with:

  • run_a and run_b metadata (id, workflow_name, status)
  • differs, a boolean indicating whether any dimension changed
  • input_comparison and output_comparison objects, with high-level metadata indicating how the input and output data differed across the runs
  • warning, emitted if the names of the runs differed
  • steps, an array of per-step comparisons
  • output_differences, an array of field-level diffs
  • metrics, an array of metric comparisons with values and deltas

Using Coding Agents to Compare Evaluation Runs

Use the Quantiles agent skill to have coding agents compare repository-based evaluations with the qt CLI. It supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.

If you haven’t already installed the skill, see the Agent Quickstart documentation.

After installation, use your coding agent to compare results across two evaluation runs. Customize the below prompt template to your needs:

Compare "$RUN_ID_A" "$RUN_ID_B", then summarize the aggregate metrics, sample-level results, failures, and any notable errors. Identify the highest-impact issues to review first, and recommend specific next steps.

See Agents Overview for more detail on using agents with Quantiles.

Exit codes

qt compare uses exit codes to signal whether the two runs matched or differed, independent of the output format you choose.

  • 0 - runs are identical in every checked dimension (same workflow inputs, workflow outputs, steps, and aggregate metrics)
  • 1 - runs differ in at least one dimension

Run qt compare in CI/CD systems

This exit code behavior makes qt compare useful for detecting unexpected drift in CI. For example, a GitHub Actions job  might have a step that looks like the following:

- name: Check benchmark drift run: qt compare ${{ steps.baseline.outputs.run_id }} ${{ steps.current.outputs.run_id }}

This job fails when qt compare detects differences in checked fields, including:

  • Workflow input
  • Workflow output
  • Step presence
  • Step input hashes
  • Step statuses
  • Step outputs
  • Aggregate metrics

Note that metric improvements still count as differences, so qt compare can fail even when the new run is better. Some summary fields, such as duration, model, and benchmark name, are displayed for context but are not all used directly as the exit-code policy.

To enforce threshold-based behavior, such as accepting accuracy improvements but rejecting latency regressions, use --json from a script and apply your own pass/fail rules.

Last updated on