Evaluate

Workflows and Steps

Quantiles uses workflows and steps to make benchmark runs durable, inspectable, and repeatable.

Workflows

A workflow defines the entrypoint for a benchmark or custom eval. It typically loads a dataset, runs the model or agent under test, scores outputs, emits metrics, and returns a final summary. Built-in benchmarks are natively built into the qt CLI around the workflow concept. Custom code evals are Python programs whose entrypoint is configured in quantiles.toml and executed with qt run:


[benchmarks.support-triage]
type = "custom_code"
command = ["uv", "run", "eval.py"]


qt run support-triage

After this command runs, the qt CLI will assign the workflow run a new run_id to track step execution and manage caching.

Quantiles requires custom code evals to be written against a simple and small Python API to control workflow execution and metrics emission, while allowing your code to control dataset loading, model calls, scoring logic, and metric calculation.

Steps

Each step is a durable unit of execution. Quantiles records its cache input, output, and status so failed or interrupted runs can resume without rerunning completed work. A step can wrap:

Loading a dataset batch
Running one model call
Grading one sample
Computing one expensive measurement
Calling an external tool or agent

Step keys and step inputs determine whether Quantiles can reuse stored step outputs during qt resume.

Each step in Quantiles is identified by a step_key value and an optional input dictionary. step_key is a stable identifier for a unit of work, like sample:42. Inputs, while optional, are used by the durable workflow engine to distinguish identically-named steps, such as in a for loop that iterates over rows in a dataset. See below for details on how to use input.


for row in dataset_rows:
    await step(
        # this is the name of the step
        ctx,
        step_key="sample",
        # This is the step-specific input.
        #
        # Since the step is named "sample" in all iterations of the loop,
        # the input helps distinguish steps from each other.
        #
        # Internally, these values are hashed to uniquely identify this
        # specific step.
        input_value={
            "row_id": row.id,
            "prompt_version": "v1",
            "model":"openai:gpt-5.5",
        },
        # This function will be actually run, and its return value will
        # be stored.
        execute=lambda: run_model(row.prompt),
    )

Choosing step input values

As described above, the step input field is optional in the Quantiles SDK, but helps distinguish identically-named steps, which is important for caching. Step inputs also show up in execution traces, so you can use them to record information about the execution context of a step, such as the following.

Model name or version
Prompt text or prompt version
Hyperparameter configuration like temperature, structured output schema, or max tokens
Dataset row ID and relevant row fields
Judge prompt or rubric version
Tool configuration

Avoid putting unstable values like random numbers or timestamps in the input, since they can make it harder to reliably identify a step execution, cause unnecessary cache misses, and make it harder to reliably reproduce a benchmark if necessary.

Note that step inputs are different from workflow inputs. Step inputs are used to uniquely identify a step inside a workflow, while workflow inputs allow you to pass data into your entire workflow, and record what you passed for later analysis.

Resume behavior

When you resume a run, Quantiles replays the workflow and checks each step(...) call against the steps already recorded for that run.

Completed steps are reused when their step_key and input hash match, returning the stored output instead of executing again. Failed, running, or missing steps execute normally. If a step is reached with the same step_key but a different input hash, Quantiles stops with an error to prevent a single run from mixing outputs produced under different inputs, models, prompts, dataset rows, or rubrics.

Existing step state	Resume behavior
Completed with same input	Reuse stored output
Failed with same input	Retry the step
Running with same input	Run again using the same step record
Missing	Create and run a new step
Same key with different input	Error

See Resume Runs for details on recovering failed runs.