Workflows and Steps
Quantiles uses workflows and steps to make benchmark runs durable, inspectable, and repeatable.
Workflows
A workflow defines the entrypoint for a benchmark or custom eval. It typically loads a dataset, runs the model or agent under test, scores outputs, emits metrics, and returns a final summary. Built-in benchmarks are natively built into the qt CLI around the workflow concept. Custom code evals are Python programs whose entrypoint is configured in quantiles.toml and executed with qt run:
[benchmarks.support-triage]
type = "custom_code"
command = ["uv", "run", "eval.py"]qt run support-triageAfter this command runs, the
qtCLI will assign the workflow run a newrun_idto track step execution and manage caching.
Quantiles requires custom code evals to be written against a simple and small Python API to control workflow execution and metrics emission, while allowing your code to control dataset loading, model calls, scoring logic, and metric calculation.
Steps
Each step is a durable unit of execution. Quantiles records its cache input, output, and status so failed or interrupted runs can resume without rerunning completed work. A step can wrap:
- Loading a dataset batch
- Running one model call
- Grading one sample
- Computing one expensive measurement
- Calling an external tool or agent
Step keys and step inputs determine whether Quantiles can reuse stored step outputs during
qt resume.
Each step in Quantiles is identified by a step_key value and an optional input dictionary. step_key is a stable identifier for a unit of work, like sample:42. Inputs, while optional, are used by the durable workflow engine to distinguish identically-named steps, such as in a for loop that iterates over rows in a dataset. See below for details on how to use input.
for row in dataset_rows:
await step(
# this is the name of the step
ctx,
step_key="sample",
# This is the step-specific input.
#
# Since the step is named "sample" in all iterations of the loop,
# the input helps distinguish steps from each other.
#
# Internally, these values are hashed to uniquely identify this
# specific step.
input_value={
"row_id": row.id,
"prompt_version": "v1",
"model":"openai:gpt-5.5",
},
# This function will be actually run, and its return value will
# be stored.
execute=lambda: run_model(row.prompt),
)
Choosing step input values
As described above, the step input field is optional in the Quantiles SDK, but helps distinguish identically-named steps, which is important for caching. Step inputs also show up in execution traces, so you can use them to record information about the execution context of a step, such as the following.
- Model name or version
- Prompt text or prompt version
- Hyperparameter configuration like temperature, structured output schema, or max tokens
- Dataset row ID and relevant row fields
- Judge prompt or rubric version
- Tool configuration
Avoid putting unstable values like random numbers or timestamps in the input, since they can make it harder to reliably identify a step execution, cause unnecessary cache misses, and make it harder to reliably reproduce a benchmark if necessary.
Note that step inputs are different from workflow inputs. Step inputs are used to uniquely identify a step inside a workflow, while workflow inputs allow you to pass data into your entire workflow, and record what you passed for later analysis.
Resume behavior
When you resume a run, Quantiles replays the workflow and checks each step(...) call against the steps already recorded for that run.
Completed steps are reused when their step_key and input hash match, returning the stored output instead of executing again. Failed, running, or missing steps execute normally. If a step is reached with the same step_key but a different input hash, Quantiles stops with an error to prevent a single run from mixing outputs produced under different inputs, models, prompts, dataset rows, or rubrics.
| Existing step state | Resume behavior |
|---|---|
| Completed with same input | Reuse stored output |
| Failed with same input | Retry the step |
| Running with same input | Run again using the same step record |
| Missing | Create and run a new step |
| Same key with different input | Error |
See Resume Runs for details on recovering failed runs.