Python SDK
The Python SDK brings Quantiles evaluation workflow primitives to Python, adding durable execution, efficient run handling, and built-in performance optimizations to existing codebases. It supports durable workflow steps, emitted metrics, typed dataset loading, simple concurrency helpers, generalized LLM calls, and common metric utilities.
Installation
Install the Python SDK with your preferred dependency management system. We recommend using the uv tool:
uv add quantilesThe Quantiles Python SDK requires Python 3.12 or newer.
Example
from quantiles import emit, entrypoint, step, workflow
from quantiles.types import JsonValue
from quantiles.workflow_context import WorkflowContext
# A workflow is an `async` function that `qt run` executes.
# The function accepts a `ctx` and, optionally also an `input_value` if you want
# to pass data through the `--input` flag on `qt run`
async def my_eval_handler(input_value: dict[str, JsonValue], ctx: WorkflowContext):
# execute a durable step in this workflow. the input value, output value, and latency will be
# recorded in the trace, and the output value will be cached in case this workflow fails later
result = await step(
ctx,
step_key="call-model",
input_value=input_value,
execute=call_model,
)
# output metrics for this workflow
await emit(ctx, "tokens_used", 42)
await emit(ctx, "correct", is_correct(result))
# the return value will be captured as part of the workflow trace
return result
my_eval = workflow("my_eval", my_eval_handler)
if __name__ == "__main__":
entrypoint(my_eval)Run a Custom Evaluation Workflow
Custom Python workflows must be registered in a quantiles.toml or .quantiles.toml configuration file, so qt run knows which command to execute for a given custom code eval.
[benchmarks.my_eval]
type = "custom_code"
# This command assumes that your custom code eval is in a file called
# my_eval.py in the current directory.
command = ["uv", "run", "./my_eval.py"]The eval workflow can then be run through the CLI:
qt run my_eval --input '{"model":"openai:gpt-5.5"}'The SDK reads runtime environment variables injected by qt run and qt resume, such as QUANTILES_RUN_ID, QUANTILES_WORKFLOW_NAME, and QUANTILES_INPUT, automatically. See the CLI Reference for the full list.
Core Concepts
Steps
Steps are durable units of work inside a workflow. Each step has:
step_key: a stable name for the workinput_value: optional JSON-serializable input used for cache invalidationexecute: the function to run if the step is not already cached
async def evaluate_sample(ctx: WorkflowContext, sample_id: str, question: str, model_id: str):
return await step(
ctx,
step_key=f"sample:{sample_id}",
input_value={
"sample_id": sample_id,
"question": question,
"model_id": model_id,
},
execute=lambda: call_model(question, model_id),
)Use stable step keys. In dataset loops, prefer sample IDs or deterministic row numbers. Put every value that should invalidate the output into input_value, including the model, prompt version, row content, sampling parameters, judge configuration, and rubric version.
If a completed step with the same key and input hash already exists in the run, Quantiles reuses its stored output. If a failed step has the same key and input hash, Quantiles retries it on resume. If the same step key appears with a different input hash in the same run, Quantiles treats that as a conflict and returns an error.
Metrics
Use emit to record numeric metrics for the run. Metrics are shown by qt show and compared by qt compare.
await emit(ctx, "accuracy", 0.92)
await emit(ctx, "latency_ms", 120, "ms")
await emit(ctx, "tokens_used", 408)Metric values must be numbers. The unit is optional and used for display.
Datasets
The Python SDK can load Hugging Face datasets natively through the local Quantiles dataset API. You can also choose to extend the Python DatasetSource class to load from other sources. Datasets loaded through the native dataset API are batched and wrapped in durable steps.
Wherever possible, we recommend using Quantiles-curated datasets on HuggingFace at huggingface.co/quantiles/datasets , because they’re curated and stable. You can, however, use the
dataset(...)API to load any dataset on HuggingFace, and can subclassDatasetSourceto load datasets from anywhere else.
from pydantic import BaseModel
from quantiles import dataset
class PubMedQARow(BaseModel):
sample_id: str
question: str
context: str
gold_answer: str
# prepare to iterate over the dataset at https://huggingface.co/datasets/quantiles/PubMedQA
ds = await dataset(
ctx,
source="huggingface://quantiles/PubMedQA",
row_type=PubMedQARow,
batch_size=25,
config="pqa_labeled",
split="train",
max_rows=100,
on_error="skip",
)
# ds.iter_rows() creates an AsyncIterator that yields PubMedQARow objects. You can iterate over
# this iterator as you would any other, but if you need more efficiency, you can also use concurrency
# helpers to iterate through this iterator more efficiently. See below for more details.
async for row in ds.iter_rows():
...Supported dataset options include:
source:huggingface://...,hf://...- You can also iterate a custom
DatasetSourcemanually with anasync forloop
- You can also iterate a custom
row_type: a Pydantic model used to validate each rowbatch_size: number of rows to fetch per dataset batchconfig: Hugging Face dataset configsplit: dataset splitrevision: optional dataset revisionmax_rows: optional cap for local runs or smoke testson_error:"fail"or"skip"for row validation errorstransform: optional function that converts raw rows intorow_type
Private Hugging Face datasets can use HF_TOKEN through the CLI dataset server. For non-Hugging Face private datasets, implement DatasetSource; custom sources run inside the Python workflow process instead of through the CLI dataset API.
Concurrency Helpers
If you need to process Dataset rows concurrently, so your eval finishes more quickly, you can use the map_dataset function.
from quantiles import collect_async_iter, map_dataset
@dataclass(frozen=True)
class EvalResult:
# Details about the row's evaluation results should go here
pass
async def evaluate(row: PubMedQARow) -> EvalResult:
# logic to evaluate the sample should go here
return EvalResult()
# map_dataset returns an `AsyncIterator`, and `collect_async_iter` loops through all items in that
# iterator and stores them in a list[PubMedQARow]
#
# Use `collect_async_iter` with care, because it can potentially store very large datasets in memory.
results = await collect_async_iter(
map_dataset(
ds,
evaluate,
max_concurrency=8,
# pass "input" here when result order should match dataset order
# pass "completion" when you want results as soon as each task finishes
yield_order="input",
)
)The map_dataset function returns an AsyncIterator that processes items in the given Dataset concurrently in batches specified by its max_concurrency parameter. For each row, it automatically calls the callback you give it, which has a single parameter that represents one row in the dataset. Your callback should sample the model with the data in this row.
LLM Helpers
Many modern evals are built to measure the performance of large language models (LLMs). Some also use LLMs to help with the measurement, like LLM-as-judge evals. To make sampling from LLMs easier, the Python SDK provides a simple call_llm function for calling OpenAI chat models.
call_llmcurrently supports the"openai"provider and requires theOPENAI_API_KEYenvironment variable to be set.
from quantiles import call_llm
result = await call_llm(
"openai",
"gpt-5-nano",
[
{"role": "system", "content": "Return only yes, no, or maybe."},
{"role": "user", "content": "Is aspirin an antiplatelet medication?"},
],
)
print(result["content"])
print(result["tokens"])Metric Utilities
After sampling a model, the results must be measured, so we can later determine how well a model performs. The Python SDK includes packages to help take these measurements.
Use Statistics for aggregate numeric metrics:
from quantiles import Statistics
accuracy = Statistics.accuracy(correct_count, total_count)
mean_score = Statistics.mean(scores)
lower, upper = Statistics.confidence_interval(scores)Use Classification for binary classifier metrics:
from quantiles import Classification
precision = Classification.precision(y_true, y_pred)
recall = Classification.recall(y_true, y_pred)
specificity = Classification.specificity(y_true, y_pred)
f1 = Classification.f1(y_true, y_pred)Core Exports
The main SDK exports are:
| Export | Use |
|---|---|
workflow | Define a named workflow that can be run by qt run |
entrypoint | Dispatch from the CLI-provided workflow name to a workflow |
step | Run or reuse a durable step inside a workflow |
emit | Record a numeric metric on the current run |
QuantilesClient | Low-level async client for the local qt server |
QuantilesRun | Low-level handle for one run |
StepParams | Parameters for QuantilesRun.step(...) in the low-level client API |
Dataset | Typed dataset returned by dataset(...) |
DatasetSource | Protocol for custom public or private dataset sources |
dataset | Load typed Hugging Face datasets or custom DatasetSource objects |
map_dataset | Apply an async function to dataset rows |
iter_async_with_concurrency | Run async work over an iterable with bounded concurrency |
collect_async_iter | Collect an async iterator into a list |
call_llm | Call an OpenAI chat model and return content plus token usage |
LLMMessage | Type for messages passed to call_llm |
ModelProvider | Type for supported model providers |
SystemMessage | Type for system messages passed to call_llm |
UserMessage | Type for user messages passed to call_llm |
Statistics | Compute aggregate metrics such as accuracy, mean, variance, and confidence intervals |
Classification | Compute precision, recall, specificity, and F1 |
Low-Level Client
Most evals will benefit most from using the workflow, step, and emit functions. These abstract the lower-level details of encoding request payloads, issuing RPCs to the local CLI server, and decoding responses. In some cases, however, it might be necessary to directly access that functionality, and you can use the QuantilesClient class to do that.
from quantiles import QuantilesClient, StepParams
async with QuantilesClient() as client:
await client.health()
run = await client.create_run("manual-eval", {"model": "gpt-5-nano"})
result = await run.step(
StepParams(
key="call-model",
input_value={"prompt": "hello"},
execute=lambda: call_model("hello"),
)
)
await run.emit("accuracy", 0.95)
await run.complete()By default, QuantilesClient reads the QUANTILES_BASE_URL environment variable to determine where the qt server is running. If that environment variable doesn’t exist, it defaults to http://127.0.0.1:8765.