Evaluate

Custom Evaluations

A custom evaluation is a Python program that is run by the qt CLI and uses the Quantiles API to execute an eval. Your code owns the evaluation logic:

Loading data (possibly using the dataset API)
Calling a model or agent
Scoring outputs
Computing and emitting metrics

While Quantiles records the run, durable steps, emitted metrics, events, inputs, outputs, and comparisons in durable, local storage, then provides observability and analysis tools in qt show, qt list, qt compare, etc…

Use custom evaluations when you need to measure behavior that is specific to your product, workflow, prompt, dataset, rubric, or release process, that cannot be measured with built-in benchmarks.

Evaluation Structure

Most custom evaluations follow the same shape:

Accept structured run input, such as model name, prompt version, dataset slice, or rubric version.
Load or define evaluation cases.
Run one durable step per sample, task, model call, judge call, or agent turn.
Score each result with deterministic metrics or a documented judge rubric.
Emit numeric metrics with emit.
Return a JSON-serializable summary that can be inspected with qt show and compared with qt compare.

The Python SDK provides a Pythonic API around the core durable workflow and metrics primitives detailed in the Workflow and Steps documentation:

Primitive	Purpose
`workflow`	Defines a named evaluation entry point that `qt run` can execute
`entrypoint`	Dispatches the CLI-provided workflow name to the right workflow
`step`	Records, retries, and reuses durable units of work
`emit`	Records numeric metrics such as accuracy, latency, cost, or token usage

All custom evaluations must be defined with type = "custom_code" in the quantiles.toml configuration file.

Example code

Copy the example code from GitHub , then run two prompt versions, and compare the results:


qt run support-triage --input '{"prompt_version":"A"}'
 
qt run support-triage --input '{"prompt_version":"B"}'
 
qt compare 1 2

While you can manually load datasets, the Python SDK also provides a dataset(...) helper to more efficiently load datasets. See the Accessing Datasets documentation for details.

Metrics and Outputs

Use metrics for numeric values that should appear in qt show and qt compare. For example, in Python:


await emit(ctx, "accuracy", 0.92)
await emit(ctx, "latency_ms", 1840, "ms")
await emit(ctx, "tokens_used", 12034, "tokens")
await emit(ctx, "cost_usd", 0.41, "usd")

Metric collection has very low runtime and storage overhead, so we encourage recording extensive, rich telemetry as much a possible during every workflow run to enable future debugging, evaluation, and regression analysis.

You can also return richer, numeric and non-numeric JSON output from your workflow run. These data can include per-sample results, representative failures, summaries, and any fields you want to inspect later:


return {
    "accuracy": accuracy,
    "correct": correct,
    "total": total,
    "failures": [result for result in results if not result.correct],
}

Metrics, aggregations, and workflow output will appear when you run qt show:


qt show 1

Inputs

Inputs allow you to pass structured data into workflows to run different variations of your benchmarks with the same code. The input data passed to a run are recorded with the run itself, to ensure the run is reproducible.

The prompt eval examples above have workflows that take input data that specify the prompt version. In these examples, you can use the --input flag or a config file to specify this value, and for other evals, you can use either method to pass data that define the experiment:

Model or agent version
Prompt version
Dataset split, revision, or row limit
Sampling parameters
Judge prompt or rubric version
Tool configuration

Below is an example that passes both a prompt version and a model to the above support-triage example: Pass an input to a workflow with the --input flag:


qt run support-triage --input '{"promptVersion":"A","model":"openai:gpt-5.5"}'

Inputs must be JSON-serializable and should be deterministic wherever possible. Avoid timestamps, random IDs, transient request IDs, and local file paths unless required.

Step Inputs

Steps are the unit of durability and comparison. A step should wrap work that is expensive, failure-prone, or useful to inspect later, such as a model call, tool call, judge call, retrieval query, or agent turn.

Steps, like workflows, have inputs, but step inputs define the values on which the step depends. The input data are used by the SDK and the qt tool to intelligently cache step outputs, and to surface context about step execution in the output of qt show and qt compare.

Step inputs are optional in the step function, but we recommend passing them wherever possible. Do so by passing a second parameter to the step function:


await step(
    ctx,
    step_key=f"case:{sample.id}",
    input_value={
        "sample_id": sample.id,
        "model": model,
        "prompt": prompt,
    },
    execute=lambda: call_model(sample.prompt),
)

For each step input, include every value that can change the output:

Sample ID and relevant sample fields
Prompt text or prompt version
Model name and model parameters
Tool configuration
Retrieved context
Judge rubric and judge model

Using Coding Agents to Create Custom Evaluations

Use the Quantiles agent skill to use coding agents to run repository-based evaluations with the qt CLI. It supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.

Install the skill

Use the prompt below to set up your coding agent with the Quantiles CLI and agent skill:


Please install the Quantiles skill at github.com/quantiles-evals/skill

Alternatively, copy SKILL.md into your agent’s skill directory.

Prompt the coding agent

After installation, have your coding agent build and run a custom evaluation. Customize the below prompt template to your needs:


Write a Quantiles custom code evaluation using the Python SDK that uses the <your dataset> dataset, runs samples through the <your model> model, and measures the output using the following metrics:

<list your metrics here>.

Call the evaluation <name>, and make sure to include it in the `quantiles.toml` config file. When you're done, run the new eval and summarize the results.

See Agents Overview for more detail on using agents with Quantiles.

Best Practices

Start with a small dataset slice and a smoke-test metric before scaling up.
Keep step keys deterministic and stable across runs.
Put every behavior-changing value in the step input.
Emit metrics often, to ensure you can effectively analyze later.
Return per-sample JSON details when you need to debug or review failures.
Version prompts, judge rubrics, datasets, and tool configurations explicitly.
Resume interrupted workflows.
Compare runs after each meaningful prompt, model, code, or rubric change.