References

Python SDK

The Python SDK brings Quantiles evaluation workflow primitives to Python, adding durable execution, efficient run handling, and built-in performance optimizations to existing codebases. It supports durable workflow steps, emitted metrics, typed dataset loading, simple concurrency helpers, generalized LLM calls, and common metric utilities.

Installation

Install the Python SDK with your preferred dependency management system. We recommend using the uv tool:


uv add quantiles

The Quantiles Python SDK requires Python 3.12 or newer.

Example


from quantiles import emit, entrypoint, step, workflow
from quantiles.types import JsonValue
from quantiles.workflow_context import WorkflowContext
 
# A workflow is an `async` function that `qt run` executes.
# The function accepts a `ctx` and, optionally also an `input_value` if you want
# to pass data through the `--input` flag on `qt run`
async def my_eval_handler(input_value: dict[str, JsonValue], ctx: WorkflowContext):
    # execute a durable step in this workflow. the input value, output value, and latency will be
    # recorded in the trace, and the output value will be cached in case this workflow fails later
    result = await step(
        ctx,
        step_key="call-model",
        input_value=input_value,
        execute=call_model,
    )
 
    # output metrics for this workflow
    await emit(ctx, "tokens_used", 42)
    await emit(ctx, "correct", is_correct(result))
 
    # the return value will be captured as part of the workflow trace
    return result
 
 
my_eval = workflow("my_eval", my_eval_handler)
 
if __name__ == "__main__":
    entrypoint(my_eval)

Run a Custom Evaluation Workflow

Custom Python workflows must be registered in a quantiles.toml or .quantiles.toml configuration file, so qt run knows which command to execute for a given custom code eval.


[benchmarks.my_eval]
type = "custom_code"
# This command assumes that your custom code eval is in a file called
# my_eval.py in the current directory.
command = ["uv", "run", "./my_eval.py"]

The eval workflow can then be run through the CLI:


qt run my_eval --input '{"model":"openai:gpt-5.5"}'

The SDK reads runtime environment variables injected by qt run and qt resume, such as QUANTILES_RUN_ID, QUANTILES_WORKFLOW_NAME, and QUANTILES_INPUT, automatically. See the CLI Reference for the full list.

Core Concepts

Steps

Steps are durable units of work inside a workflow. Each step has:

step_key: a stable name for the work
input_value: optional JSON-serializable input used for cache invalidation
execute: the function to run if the step is not already cached


async def evaluate_sample(ctx: WorkflowContext, sample_id: str, question: str, model_id: str):
    return await step(
        ctx,
        step_key=f"sample:{sample_id}",
        input_value={
            "sample_id": sample_id,
            "question": question,
            "model_id": model_id,
        },
        execute=lambda: call_model(question, model_id),
    )

Use stable step keys. In dataset loops, prefer sample IDs or deterministic row numbers. Put every value that should invalidate the output into input_value, including the model, prompt version, row content, sampling parameters, judge configuration, and rubric version.

If a completed step with the same key and input hash already exists in the run, Quantiles reuses its stored output. If a failed step has the same key and input hash, Quantiles retries it on resume. If the same step key appears with a different input hash in the same run, Quantiles treats that as a conflict and returns an error.

Metrics

Use emit to record numeric metrics for the run. Metrics are shown by qt show and compared by qt compare.


await emit(ctx, "accuracy", 0.92)
await emit(ctx, "latency_ms", 120, "ms")
await emit(ctx, "tokens_used", 408)

Metric values must be numbers. The unit is optional and used for display.

Datasets

The Python SDK can load Hugging Face datasets natively through the local Quantiles dataset API. You can also choose to extend the Python DatasetSource class to load from other sources. Datasets loaded through the native dataset API are batched and wrapped in durable steps.

Wherever possible, we recommend using Quantiles-curated datasets on HuggingFace at huggingface.co/quantiles/datasets , because they’re curated and stable. You can, however, use the dataset(...) API to load any dataset on HuggingFace, and can subclass DatasetSource to load datasets from anywhere else.


from pydantic import BaseModel
from quantiles import dataset
 
 
class PubMedQARow(BaseModel):
    sample_id: str
    question: str
    context: str
    gold_answer: str
 
 
# prepare to iterate over the dataset at https://huggingface.co/datasets/quantiles/PubMedQA
ds = await dataset(
    ctx,
    source="huggingface://quantiles/PubMedQA",
    row_type=PubMedQARow,
    batch_size=25,
    config="pqa_labeled",
    split="train",
    max_rows=100,
    on_error="skip",
)
 
# ds.iter_rows() creates an AsyncIterator that yields PubMedQARow objects. You can iterate over
# this iterator as you would any other, but if you need more efficiency, you can also use concurrency
# helpers to iterate through this iterator more efficiently. See below for more details.
async for row in ds.iter_rows():
    ...

Supported dataset options include:

source: huggingface://..., hf://...
- You can also iterate a custom DatasetSource manually with an async for loop
row_type: a Pydantic model used to validate each row
batch_size: number of rows to fetch per dataset batch
config: Hugging Face dataset config
split: dataset split
revision: optional dataset revision
max_rows: optional cap for local runs or smoke tests
on_error: "fail" or "skip" for row validation errors
transform: optional function that converts raw rows into row_type

Private Hugging Face datasets can use HF_TOKEN through the CLI dataset server. For non-Hugging Face private datasets, implement DatasetSource; custom sources run inside the Python workflow process instead of through the CLI dataset API.

Concurrency Helpers

If you need to process Dataset rows concurrently, so your eval finishes more quickly, you can use the map_dataset function.


from quantiles import collect_async_iter, map_dataset
 
 
@dataclass(frozen=True)
class EvalResult:
    # Details about the row's evaluation results should go here
    pass
 
async def evaluate(row: PubMedQARow) -> EvalResult:
    # logic to evaluate the sample should go here
    return EvalResult()
 
 
# map_dataset returns an `AsyncIterator`, and `collect_async_iter` loops through all items in that
# iterator and stores them in a list[PubMedQARow]
#
# Use `collect_async_iter` with care, because it can potentially store very large datasets in memory.
results = await collect_async_iter(
    map_dataset(
        ds,
        evaluate,
        max_concurrency=8,
        # pass "input" here when result order should match dataset order
        # pass "completion" when you want results as soon as each task finishes
        yield_order="input",
    )
)

The map_dataset function returns an AsyncIterator that processes items in the given Dataset concurrently in batches specified by its max_concurrency parameter. For each row, it automatically calls the callback you give it, which has a single parameter that represents one row in the dataset. Your callback should sample the model with the data in this row.

LLM Helpers

Many modern evals are built to measure the performance of large language models (LLMs). Some also use LLMs to help with the measurement, like LLM-as-judge evals. To make sampling from LLMs easier, the Python SDK provides a simple call_llm function for calling OpenAI chat models.

call_llm currently supports the "openai" provider and requires the OPENAI_API_KEY environment variable to be set.


from quantiles import call_llm
 
 
result = await call_llm(
    "openai",
    "gpt-5-nano",
    [
        {"role": "system", "content": "Return only yes, no, or maybe."},
        {"role": "user", "content": "Is aspirin an antiplatelet medication?"},
    ],
)
 
print(result["content"])
print(result["tokens"])

Metric Utilities

After sampling a model, the results must be measured, so we can later determine how well a model performs. The Python SDK includes packages to help take these measurements.

Use Statistics for aggregate numeric metrics:


from quantiles import Statistics
 
accuracy = Statistics.accuracy(correct_count, total_count)
mean_score = Statistics.mean(scores)
lower, upper = Statistics.confidence_interval(scores)

Use Classification for binary classifier metrics:


from quantiles import Classification
 
precision = Classification.precision(y_true, y_pred)
recall = Classification.recall(y_true, y_pred)
specificity = Classification.specificity(y_true, y_pred)
f1 = Classification.f1(y_true, y_pred)

Core Exports

The main SDK exports are:

Export	Use
`workflow`	Define a named workflow that can be run by `qt run`
`entrypoint`	Dispatch from the CLI-provided workflow name to a workflow
`step`	Run or reuse a durable step inside a workflow
`emit`	Record a numeric metric on the current run
`QuantilesClient`	Low-level async client for the local `qt` server
`QuantilesRun`	Low-level handle for one run
`StepParams`	Parameters for `QuantilesRun.step(...)` in the low-level client API
`Dataset`	Typed dataset returned by `dataset(...)`
`DatasetSource`	Protocol for custom public or private dataset sources
`dataset`	Load typed Hugging Face datasets or custom `DatasetSource` objects
`map_dataset`	Apply an async function to dataset rows
`iter_async_with_concurrency`	Run async work over an iterable with bounded concurrency
`collect_async_iter`	Collect an async iterator into a list
`call_llm`	Call an OpenAI chat model and return content plus token usage
`LLMMessage`	Type for messages passed to `call_llm`
`ModelProvider`	Type for supported model providers
`SystemMessage`	Type for system messages passed to `call_llm`
`UserMessage`	Type for user messages passed to `call_llm`
`Statistics`	Compute aggregate metrics such as accuracy, mean, variance, and confidence intervals
`Classification`	Compute precision, recall, specificity, and F1

Low-Level Client

Most evals will benefit most from using the workflow, step, and emit functions. These abstract the lower-level details of encoding request payloads, issuing RPCs to the local CLI server, and decoding responses. In some cases, however, it might be necessary to directly access that functionality, and you can use the QuantilesClient class to do that.


from quantiles import QuantilesClient, StepParams
 
 
async with QuantilesClient() as client:
    await client.health()
    run = await client.create_run("manual-eval", {"model": "gpt-5-nano"})
    result = await run.step(
        StepParams(
            key="call-model",
            input_value={"prompt": "hello"},
            execute=lambda: call_model("hello"),
        )
    )
    await run.emit("accuracy", 0.95)
    await run.complete()

By default, QuantilesClient reads the QUANTILES_BASE_URL environment variable to determine where the qt server is running. If that environment variable doesn’t exist, it defaults to http://127.0.0.1:8765.