Custom Evaluations
A custom evaluation is a Python program that is run by the qt CLI and uses the Quantiles API to execute an eval. Your code owns the evaluation logic:
- Loading data (possibly using the dataset API)
- Calling a model or agent
- Scoring outputs
- Computing and emitting metrics
While Quantiles records the run, durable steps, emitted metrics, events, inputs, outputs, and comparisons in durable, local storage, then provides observability and analysis tools in qt show, qt list, qt compare, etc…
Use custom evaluations when you need to measure behavior that is specific to your product, workflow, prompt, dataset, rubric, or release process, that cannot be measured with built-in benchmarks.
Evaluation Structure
Most custom evaluations follow the same shape:
- Accept structured run input, such as model name, prompt version, dataset slice, or rubric version.
- Load or define evaluation cases.
- Run one durable
stepper sample, task, model call, judge call, or agent turn. - Score each result with deterministic metrics or a documented judge rubric.
- Emit numeric metrics with
emit. - Return a JSON-serializable summary that can be inspected with
qt showand compared withqt compare.
The Python SDK provides a Pythonic API around the core durable workflow and metrics primitives detailed in the Workflow and Steps documentation:
| Primitive | Purpose |
|---|---|
workflow | Defines a named evaluation entry point that qt run can execute |
entrypoint | Dispatches the CLI-provided workflow name to the right workflow |
step | Records, retries, and reuses durable units of work |
emit | Records numeric metrics such as accuracy, latency, cost, or token usage |
All custom evaluations must be defined with
type = "custom_code"in thequantiles.tomlconfiguration file.
Example code
Copy the example code from GitHub , then run two prompt versions, and compare the results:
qt run support-triage --input '{"prompt_version":"A"}'
qt run support-triage --input '{"prompt_version":"B"}'
qt compare 1 2While you can manually load datasets, the Python SDK also provides a dataset(...) helper to more efficiently load datasets. See the Accessing Datasets documentation for details.
Metrics and Outputs
Use metrics for numeric values that should appear in qt show and qt compare. For example, in Python:
await emit(ctx, "accuracy", 0.92)
await emit(ctx, "latency_ms", 1840, "ms")
await emit(ctx, "tokens_used", 12034, "tokens")
await emit(ctx, "cost_usd", 0.41, "usd")Metric collection has very low runtime and storage overhead, so we encourage recording extensive, rich telemetry as much a possible during every workflow run to enable future debugging, evaluation, and regression analysis.
You can also return richer, numeric and non-numeric JSON output from your workflow run. These data can include per-sample results, representative failures, summaries, and any fields you want to inspect later:
return {
"accuracy": accuracy,
"correct": correct,
"total": total,
"failures": [result for result in results if not result.correct],
}Metrics, aggregations, and workflow output will appear when you run qt show:
qt show 1Inputs
Inputs allow you to pass structured data into workflows to run different variations of your benchmarks with the same code. The input data passed to a run are recorded with the run itself, to ensure the run is reproducible.
The prompt eval examples above have workflows that take input data that specify the prompt version. In these examples, you can use the --input flag or a config file to specify this value, and for other evals, you can use either method to pass data that define the experiment:
- Model or agent version
- Prompt version
- Dataset split, revision, or row limit
- Sampling parameters
- Judge prompt or rubric version
- Tool configuration
Below is an example that passes both a prompt version and a model to the above support-triage example:
Pass an input to a workflow with the --input flag:
qt run support-triage --input '{"promptVersion":"A","model":"openai:gpt-5.5"}'Inputs must be JSON-serializable and should be deterministic wherever possible. Avoid timestamps, random IDs, transient request IDs, and local file paths unless required.
Step Inputs
Steps are the unit of durability and comparison. A step should wrap work that is expensive, failure-prone, or useful to inspect later, such as a model call, tool call, judge call, retrieval query, or agent turn.
Steps, like workflows, have inputs, but step inputs define the values on which the step depends. The input data are used by the SDK and the qt tool to intelligently cache step outputs, and to surface context about step execution in the output of qt show and qt compare.
Step inputs are optional in the step function, but we recommend passing them wherever possible. Do so by passing a second parameter to the step function:
await step(
ctx,
step_key=f"case:{sample.id}",
input_value={
"sample_id": sample.id,
"model": model,
"prompt": prompt,
},
execute=lambda: call_model(sample.prompt),
)For each step input, include every value that can change the output:
- Sample ID and relevant sample fields
- Prompt text or prompt version
- Model name and model parameters
- Tool configuration
- Retrieved context
- Judge rubric and judge model
Using Coding Agents to Create Custom Evaluations
Use the Quantiles agent skill to use coding agents to run repository-based evaluations with the qt CLI. It supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.
Install the skill
Use the prompt below to set up your coding agent with the Quantiles CLI and agent skill:
Please install the Quantiles skill at github.com/quantiles-evals/skillAlternatively, copy SKILL.md into your agent’s skill directory.
Prompt the coding agent
After installation, have your coding agent build and run a custom evaluation. Customize the below prompt template to your needs:
Write a Quantiles custom code evaluation using the Python SDK that uses the <your dataset> dataset, runs samples through the <your model> model, and measures the output using the following metrics:
<list your metrics here>.
Call the evaluation <name>, and make sure to include it in the `quantiles.toml` config file. When you're done, run the new eval and summarize the results.See Agents Overview for more detail on using agents with Quantiles.
Best Practices
- Start with a small dataset slice and a smoke-test metric before scaling up.
- Keep step keys deterministic and stable across runs.
- Put every behavior-changing value in the step input.
- Emit metrics often, to ensure you can effectively analyze later.
- Return per-sample JSON details when you need to debug or review failures.
- Version prompts, judge rubrics, datasets, and tool configurations explicitly.
- Resume interrupted workflows.
- Compare runs after each meaningful prompt, model, code, or rubric change.