Evaluate

Restart and Resume Runs

Quantiles runs are designed as durable workflows and are broken up into individual steps. If a step crashes or is otherwise interrupted, the entire run can be restarted from the last successful step. With the qt CLI, you can resume a run using its ID with:


qt resume "$RUN_ID"

When a run is restarted, outputs from previously successful steps are loaded from cache to avoid re-execution. Failed steps are retried, and any newly added steps are executed normally.

Use qt resume for operational failures like timeouts, rate limits, network errors, malformed samples, local process exits, or interrupted long-running evaluations. Start a new run when you intentionally change the evaluation and want a clean result to compare against the previous run.

Situation	Command	Reason
Network timeout halfway through an evaluation	`qt resume "$RUN_ID"`	Reuse completed steps and retry failed work
Rate limit after many model calls	`qt resume "$RUN_ID"`	Continue the same run and avoid duplicate model calls
Laptop sleeps or the process exits	`qt resume "$RUN_ID"`	Recover from the last recorded step state
Prompt, model, dataset, or rubric intentionally changed	`qt run "$EVAL_NAME"`	Start a new evaluation run so every sample is evaluated against the updated configuration
A completed evaluation or sample should be recomputed	`qt run "$EVAL_NAME"`	Successfully completed evaluations cannot be resumed. Run a new evaluation.

The Quantiles SDKs include features for custom evaluations built on top of this resume infrastructure, like the Python dataset API. When building custom evals, we recommend using these features wherever possible, to take full advantage of the reliability and resilience offered by restart/resume.

Resume Workflow

Use the qt list command to see a list of all evaluation runs. In the following example, run_id 1 failed to complete:


ID  EVAL            STATUS     SAMPLES     CREATED                     DURATION
2   support-triage  completed  5000        2026-7-02T18:30:00.000000Z  5.000s
1   support-triage  failed     813         2026-7-01T18:15:00.000000Z  2.000s

When the failed run is inspected with qt show, no aggregate metrics are displayed:


Run 1
  eval:        support-triage
  status:      failed
  created:     2026-7-01T18:15:00.000000Z
  duration:    -
  input:       {"model":"demo-builtin","samples":5000}
  output:      -
  error:       -
 
Aggregated Metrics
  No metrics found.

To resume and complete the failed evaluation, use the following command:


qt resume 1

This command automatically uses the input that is seen in the qt show output above.

How `qt resume` works

This section contains details that are important if you’re building custom code evals. If you’re just using built-in benchmarks, you do not need to read it.

When you pass qt resume "$RUN_ID", the CLI:

Opens the local Quantiles workspace.
Loads the existing run.
Rejects the command if the run is already completed.
Resets the run status to running, clears the run error, and records a run.resumed event.
Starts the local server if the run is a custom evaluation.
Runs your command with the existing run ID.
Skips all steps already marked completed, and uses each step’s cached output instead.
Runs all non-completed steps to completion, and records each step’s output.
Marks the run completed or failed based on the eval’s exit status.

Step caching

When you resume a run, Quantiles replays the workflow and checks each step(...) call against the steps already recorded for that run.

If a completed step has the same step_key and input hash, Quantiles returns the stored output instead of running it again. Failed, running, or missing steps are executed normally. If a step with the same step_key is reached with a different input hash, Quantiles stops with an error so that one run does not accidentally mix outputs from different models, prompts, dataset rows, or rubrics.

Existing step state	Resume behavior
Completed with same input	Reuse stored output
Failed with same input	Retry the step
Running with same input	Run again using the same step record
Missing	Create and run a new step
Same key with different input	Error

Stable Step Inputs

Because of the step caching behavior above, resume works best when each step input includes every value that can affect the output:

Dataset row ID and relevant row fields
Model name and sampling parameters
Prompt text or prompt version
Retrieved context
Tool configuration
Judge model, judge prompt, and rubric version

Avoid unstable values in step inputs unless they are part of the behavior being evaluated. Timestamps, random IDs, request IDs, and temporary file paths can cause unnecessary input changes and prevent reuse.

If you omit explicit step input, the step is effectively cached by key alone. That is only safe when the output does not depend on changing values closed over by the step callback.

Common Failure Cases

qt resume "$RUN_ID" fails when the target run is no longer resumable, most often because it already completed or because step/run inputs changed. The cases below describe common error messages and recovery paths.

The CLI says the run is already completed

Start a new evaluation run. Completed runs are final, so qt resume is rejected on them:


qt run "$EVAL_NAME"

A step errors with a different input hash

The resumed workflow reached a step with the same key but different input. Check whether the command, run input, prompt version, dataset row, model parameters, or rubric changed.

Use the following command to inspect step keys and input hashes:


qt show "$RUN_ID" --json

If the change was intentional, start a new evaluation run. If it was accidental, fix or remove the input parameter and re-try qt resume "$RUN_ID".

A completed step reused an old output

That is expected during resume. To force recomputation, start a new evaluation run. If the output should have changed, add the missing behavior-changing value to the step input before running a new evaluation.

Using Coding Agents to Resume Interrupted Eval Runs

If you haven’t already installed the skill, see the Agent Quickstart documentation. After installation, use your coding agent to resume failed or interrupted evaluation runs. Customize the below prompt template to your needs:


Resume run <run_id>. Inspect the original failure, verify whether the resumed run completed successfully, summarize aggregate metrics and sample-level results, and recommend specific next actions.

See Agents Overview for more detail on using agents with Quantiles.

Best Practices

Use qt resume only when recovering a previously-failed run.
Start a new run for intentional code, model, prompt, dataset, or rubric changes, even if the old run failed.
Keep workflow input identical when resuming a run.
Keep step keys stable across retries.
Put every behavior-changing value in step input.
Use qt show "$RUN_ID" --json before and after a resume to verify which steps ran, retried, or reused cached output.