Restart and Resume Runs
Quantiles runs are designed as durable workflows and are broken up into individual steps. If a step crashes or is otherwise interrupted, the entire run can be restarted from the last successful step. With the qt CLI, you can resume a run using its ID with:
qt resume "$RUN_ID"When a run is restarted, outputs from previously successful steps are loaded from cache to avoid re-execution. Failed steps are retried, and any newly added steps are executed normally.
Use qt resume for operational failures like timeouts, rate limits, network errors, malformed samples, local process exits, or interrupted long-running evaluations. Start a new run when you intentionally change the evaluation and want a clean result to compare against the previous run.
| Situation | Command | Reason |
|---|---|---|
| Network timeout halfway through an evaluation | qt resume "$RUN_ID" | Reuse completed steps and retry failed work |
| Rate limit after many model calls | qt resume "$RUN_ID" | Continue the same run and avoid duplicate model calls |
| Laptop sleeps or the process exits | qt resume "$RUN_ID" | Recover from the last recorded step state |
| Prompt, model, dataset, or rubric intentionally changed | qt run "$EVAL_NAME" | Start a new evaluation run so every sample is evaluated against the updated configuration |
| A completed evaluation or sample should be recomputed | qt run "$EVAL_NAME" | Successfully completed evaluations cannot be resumed. Run a new evaluation. |
The Quantiles SDKs include features for custom evaluations built on top of this resume infrastructure, like the Python dataset API. When building custom evals, we recommend using these features wherever possible, to take full advantage of the reliability and resilience offered by restart/resume.
Resume Workflow
Use the qt list command to see a list of all evaluation runs. In the following example, run_id 1 failed to complete:
ID EVAL STATUS SAMPLES CREATED DURATION
2 support-triage completed 5000 2026-7-02T18:30:00.000000Z 5.000s
1 support-triage failed 813 2026-7-01T18:15:00.000000Z 2.000sWhen the failed run is inspected with qt show, no aggregate metrics are displayed:
Run 1
eval: support-triage
status: failed
created: 2026-7-01T18:15:00.000000Z
duration: -
input: {"model":"demo-builtin","samples":5000}
output: -
error: -
Aggregated Metrics
No metrics found.To resume and complete the failed evaluation, use the following command:
qt resume 1This command automatically uses the input that is seen in the qt show output above.
How qt resume works
This section contains details that are important if you’re building custom code evals. If you’re just using built-in benchmarks, you do not need to read it.
When you pass qt resume "$RUN_ID", the CLI:
- Opens the local Quantiles workspace.
- Loads the existing run.
- Rejects the command if the run is already completed.
- Resets the run status to
running, clears the run error, and records arun.resumedevent. - Starts the local server if the run is a custom evaluation.
- Runs your command with the existing run ID.
- Skips all
steps already markedcompleted, and uses each step’s cached output instead. - Runs all non-
completedsteps to completion, and records each step’s output. - Marks the run
completedorfailedbased on the eval’s exit status.
Step caching
When you resume a run, Quantiles replays the workflow and checks each step(...) call against the steps already recorded for that run.
If a completed step has the same step_key and input hash, Quantiles returns the stored output instead of running it again. Failed, running, or missing steps are executed normally. If a step with the same step_key is reached with a different input hash, Quantiles stops with an error so that one run does not accidentally mix outputs from different models, prompts, dataset rows, or rubrics.
| Existing step state | Resume behavior |
|---|---|
| Completed with same input | Reuse stored output |
| Failed with same input | Retry the step |
| Running with same input | Run again using the same step record |
| Missing | Create and run a new step |
| Same key with different input | Error |
Stable Step Inputs
Because of the step caching behavior above, resume works best when each step input includes every value that can affect the output:
- Dataset row ID and relevant row fields
- Model name and sampling parameters
- Prompt text or prompt version
- Retrieved context
- Tool configuration
- Judge model, judge prompt, and rubric version
Avoid unstable values in step inputs unless they are part of the behavior being evaluated. Timestamps, random IDs, request IDs, and temporary file paths can cause unnecessary input changes and prevent reuse.
If you omit explicit step input, the step is effectively cached by key alone. That is only safe when the output does not depend on changing values closed over by the step callback.
Common Failure Cases
qt resume "$RUN_ID" fails when the target run is no longer resumable, most often because it already completed or because step/run inputs changed. The cases below describe common error messages and recovery paths.
The CLI says the run is already completed
Start a new evaluation run. Completed runs are final, so qt resume is rejected on them:
qt run "$EVAL_NAME"A step errors with a different input hash
The resumed workflow reached a step with the same key but different input. Check whether the command, run input, prompt version, dataset row, model parameters, or rubric changed.
Use the following command to inspect step keys and input hashes:
qt show "$RUN_ID" --jsonIf the change was intentional, start a new evaluation run. If it was accidental, fix or remove the input parameter and re-try qt resume "$RUN_ID".
A completed step reused an old output
That is expected during resume. To force recomputation, start a new evaluation run. If the output should have changed, add the missing behavior-changing value to the step input before running a new evaluation.
Using Coding Agents to Resume Interrupted Eval Runs
If you haven’t already installed the skill, see the Agent Quickstart documentation. After installation, use your coding agent to resume failed or interrupted evaluation runs. Customize the below prompt template to your needs:
Resume run <run_id>. Inspect the original failure, verify whether the resumed run completed successfully, summarize aggregate metrics and sample-level results, and recommend specific next actions.See Agents Overview for more detail on using agents with Quantiles.
Best Practices
- Use
qt resumeonly when recovering a previously-failed run. - Start a new run for intentional code, model, prompt, dataset, or rubric changes, even if the old run failed.
- Keep workflow input identical when resuming a run.
- Keep step keys stable across retries.
- Put every behavior-changing value in step input.
- Use
qt show "$RUN_ID" --jsonbefore and after a resume to verify which steps ran, retried, or reused cached output.