Skip to Content

Workflows and Steps

Quantiles uses workflows and steps to make benchmark runs durable, inspectable, and repeatable.

Workflows

A workflow defines the entrypoint for a benchmark or custom eval. It typically loads a dataset, runs the model or agent under test, scores outputs, emits metrics, and returns a final summary. Built-in benchmarks are natively built into the qt CLI around the workflow concept. Custom code evals are Python programs whose entrypoint is configured in quantiles.toml and executed with qt run:

[benchmarks.support-triage] type = "custom_code" command = ["uv", "run", "eval.py"]
qt run support-triage

After this command runs, the qt CLI will assign the workflow run a new run_id to track step execution and manage caching.

Quantiles requires custom code evals to be written against a simple and small Python API to control workflow execution and metrics emission, while allowing your code to control dataset loading, model calls, scoring logic, and metric calculation.

Steps

Each step is a durable unit of execution. Quantiles records its cache input, output, and status so failed or interrupted runs can resume without rerunning completed work. A step can wrap:

  • Loading a dataset batch
  • Running one model call
  • Grading one sample
  • Computing one expensive measurement
  • Calling an external tool or agent

Step keys and step inputs determine whether Quantiles can reuse stored step outputs during qt resume.

Each step in Quantiles is identified by a step_key value and an optional input dictionary. step_key is a stable identifier for a unit of work, like sample:42. Inputs, while optional, are used by the durable workflow engine to distinguish identically-named steps, such as in a for loop that iterates over rows in a dataset. See below for details on how to use input.

for row in dataset_rows: await step( # this is the name of the step ctx, step_key="sample", # This is the step-specific input. # # Since the step is named "sample" in all iterations of the loop, # the input helps distinguish steps from each other. # # Internally, these values are hashed to uniquely identify this # specific step. input_value={ "row_id": row.id, "prompt_version": "v1", "model":"openai:gpt-5.5", }, # This function will be actually run, and its return value will # be stored. execute=lambda: run_model(row.prompt), )

Choosing step input values

As described above, the step input field is optional in the Quantiles SDK, but helps distinguish identically-named steps, which is important for caching. Step inputs also show up in execution traces, so you can use them to record information about the execution context of a step, such as the following.

  • Model name or version
  • Prompt text or prompt version
  • Hyperparameter configuration like temperature, structured output schema, or max tokens
  • Dataset row ID and relevant row fields
  • Judge prompt or rubric version
  • Tool configuration

Avoid putting unstable values like random numbers or timestamps in the input, since they can make it harder to reliably identify a step execution, cause unnecessary cache misses, and make it harder to reliably reproduce a benchmark if necessary.

Note that step inputs are different from workflow inputs. Step inputs are used to uniquely identify a step inside a workflow, while workflow inputs allow you to pass data into your entire workflow, and record what you passed for later analysis.

Resume behavior

When you resume a run, Quantiles replays the workflow and checks each step(...) call against the steps already recorded for that run.

Completed steps are reused when their step_key and input hash match, returning the stored output instead of executing again. Failed, running, or missing steps execute normally. If a step is reached with the same step_key but a different input hash, Quantiles stops with an error to prevent a single run from mixing outputs produced under different inputs, models, prompts, dataset rows, or rubrics.

Existing step stateResume behavior
Completed with same inputReuse stored output
Failed with same inputRetry the step
Running with same inputRun again using the same step record
MissingCreate and run a new step
Same key with different inputError

See Resume Runs for details on recovering failed runs.

Last updated on