Skip to Content

Python SDK

The Python SDK brings Quantiles evaluation workflow primitives to Python, adding durable execution, efficient run handling, and built-in performance optimizations to existing codebases. It supports durable workflow steps, emitted metrics, typed dataset loading, simple concurrency helpers, generalized LLM calls, and common metric utilities.

Installation

Install the Python SDK with your preferred dependency management system. We recommend using the uv tool:

uv add quantiles

The Quantiles Python SDK requires Python 3.12 or newer.

Example

from quantiles import emit, entrypoint, step, workflow from quantiles.types import JsonValue from quantiles.workflow_context import WorkflowContext # A workflow is an `async` function that `qt run` executes. # The function accepts a `ctx` and, optionally also an `input_value` if you want # to pass data through the `--input` flag on `qt run` async def my_eval_handler(input_value: dict[str, JsonValue], ctx: WorkflowContext): # execute a durable step in this workflow. the input value, output value, and latency will be # recorded in the trace, and the output value will be cached in case this workflow fails later result = await step( ctx, step_key="call-model", input_value=input_value, execute=call_model, ) # output metrics for this workflow await emit(ctx, "tokens_used", 42) await emit(ctx, "correct", is_correct(result)) # the return value will be captured as part of the workflow trace return result my_eval = workflow("my_eval", my_eval_handler) if __name__ == "__main__": entrypoint(my_eval)

Run a Custom Evaluation Workflow

Custom Python workflows must be registered in a quantiles.toml or .quantiles.toml configuration file, so qt run knows which command to execute for a given custom code eval.

[benchmarks.my_eval] type = "custom_code" # This command assumes that your custom code eval is in a file called # my_eval.py in the current directory. command = ["uv", "run", "./my_eval.py"]

The eval workflow can then be run through the CLI:

qt run my_eval --input '{"model":"openai:gpt-5.5"}'

The SDK reads runtime environment variables injected by qt run and qt resume, such as QUANTILES_RUN_ID, QUANTILES_WORKFLOW_NAME, and QUANTILES_INPUT, automatically. See the CLI Reference for the full list.

Core Concepts

Steps

Steps are durable units of work inside a workflow. Each step has:

  • step_key: a stable name for the work
  • input_value: optional JSON-serializable input used for cache invalidation
  • execute: the function to run if the step is not already cached
async def evaluate_sample(ctx: WorkflowContext, sample_id: str, question: str, model_id: str): return await step( ctx, step_key=f"sample:{sample_id}", input_value={ "sample_id": sample_id, "question": question, "model_id": model_id, }, execute=lambda: call_model(question, model_id), )

Use stable step keys. In dataset loops, prefer sample IDs or deterministic row numbers. Put every value that should invalidate the output into input_value, including the model, prompt version, row content, sampling parameters, judge configuration, and rubric version.

If a completed step with the same key and input hash already exists in the run, Quantiles reuses its stored output. If a failed step has the same key and input hash, Quantiles retries it on resume. If the same step key appears with a different input hash in the same run, Quantiles treats that as a conflict and returns an error.

Metrics

Use emit to record numeric metrics for the run. Metrics are shown by qt show and compared by qt compare.

await emit(ctx, "accuracy", 0.92) await emit(ctx, "latency_ms", 120, "ms") await emit(ctx, "tokens_used", 408)

Metric values must be numbers. The unit is optional and used for display.

Datasets

The Python SDK can load Hugging Face datasets natively through the local Quantiles dataset API. You can also choose to extend the Python DatasetSource class to load from other sources. Datasets loaded through the native dataset API are batched and wrapped in durable steps.

Wherever possible, we recommend using Quantiles-curated datasets on HuggingFace at huggingface.co/quantiles/datasets , because they’re curated and stable. You can, however, use the dataset(...) API to load any dataset on HuggingFace, and can subclass DatasetSource to load datasets from anywhere else.

from pydantic import BaseModel from quantiles import dataset class PubMedQARow(BaseModel): sample_id: str question: str context: str gold_answer: str # prepare to iterate over the dataset at https://huggingface.co/datasets/quantiles/PubMedQA ds = await dataset( ctx, source="huggingface://quantiles/PubMedQA", row_type=PubMedQARow, batch_size=25, config="pqa_labeled", split="train", max_rows=100, on_error="skip", ) # ds.iter_rows() creates an AsyncIterator that yields PubMedQARow objects. You can iterate over # this iterator as you would any other, but if you need more efficiency, you can also use concurrency # helpers to iterate through this iterator more efficiently. See below for more details. async for row in ds.iter_rows(): ...

Supported dataset options include:

  • source: huggingface://..., hf://...
    • You can also iterate a custom DatasetSource manually with an async for loop
  • row_type: a Pydantic model used to validate each row
  • batch_size: number of rows to fetch per dataset batch
  • config: Hugging Face dataset config
  • split: dataset split
  • revision: optional dataset revision
  • max_rows: optional cap for local runs or smoke tests
  • on_error: "fail" or "skip" for row validation errors
  • transform: optional function that converts raw rows into row_type

Private Hugging Face datasets can use HF_TOKEN through the CLI dataset server. For non-Hugging Face private datasets, implement DatasetSource; custom sources run inside the Python workflow process instead of through the CLI dataset API.

Concurrency Helpers

If you need to process Dataset rows concurrently, so your eval finishes more quickly, you can use the map_dataset function.

from quantiles import collect_async_iter, map_dataset @dataclass(frozen=True) class EvalResult: # Details about the row's evaluation results should go here pass async def evaluate(row: PubMedQARow) -> EvalResult: # logic to evaluate the sample should go here return EvalResult() # map_dataset returns an `AsyncIterator`, and `collect_async_iter` loops through all items in that # iterator and stores them in a list[PubMedQARow] # # Use `collect_async_iter` with care, because it can potentially store very large datasets in memory. results = await collect_async_iter( map_dataset( ds, evaluate, max_concurrency=8, # pass "input" here when result order should match dataset order # pass "completion" when you want results as soon as each task finishes yield_order="input", ) )

The map_dataset function returns an AsyncIterator that processes items in the given Dataset concurrently in batches specified by its max_concurrency parameter. For each row, it automatically calls the callback you give it, which has a single parameter that represents one row in the dataset. Your callback should sample the model with the data in this row.

LLM Helpers

Many modern evals are built to measure the performance of large language models (LLMs). Some also use LLMs to help with the measurement, like LLM-as-judge evals. To make sampling from LLMs easier, the Python SDK provides a simple call_llm function for calling OpenAI chat models.

call_llm currently supports the "openai" provider and requires the OPENAI_API_KEY environment variable to be set.

from quantiles import call_llm result = await call_llm( "openai", "gpt-5-nano", [ {"role": "system", "content": "Return only yes, no, or maybe."}, {"role": "user", "content": "Is aspirin an antiplatelet medication?"}, ], ) print(result["content"]) print(result["tokens"])

Metric Utilities

After sampling a model, the results must be measured, so we can later determine how well a model performs. The Python SDK includes packages to help take these measurements.

Use Statistics for aggregate numeric metrics:

from quantiles import Statistics accuracy = Statistics.accuracy(correct_count, total_count) mean_score = Statistics.mean(scores) lower, upper = Statistics.confidence_interval(scores)

Use Classification for binary classifier metrics:

from quantiles import Classification precision = Classification.precision(y_true, y_pred) recall = Classification.recall(y_true, y_pred) specificity = Classification.specificity(y_true, y_pred) f1 = Classification.f1(y_true, y_pred)

Core Exports

The main SDK exports are:

ExportUse
workflowDefine a named workflow that can be run by qt run
entrypointDispatch from the CLI-provided workflow name to a workflow
stepRun or reuse a durable step inside a workflow
emitRecord a numeric metric on the current run
QuantilesClientLow-level async client for the local qt server
QuantilesRunLow-level handle for one run
StepParamsParameters for QuantilesRun.step(...) in the low-level client API
DatasetTyped dataset returned by dataset(...)
DatasetSourceProtocol for custom public or private dataset sources
datasetLoad typed Hugging Face datasets or custom DatasetSource objects
map_datasetApply an async function to dataset rows
iter_async_with_concurrencyRun async work over an iterable with bounded concurrency
collect_async_iterCollect an async iterator into a list
call_llmCall an OpenAI chat model and return content plus token usage
LLMMessageType for messages passed to call_llm
ModelProviderType for supported model providers
SystemMessageType for system messages passed to call_llm
UserMessageType for user messages passed to call_llm
StatisticsCompute aggregate metrics such as accuracy, mean, variance, and confidence intervals
ClassificationCompute precision, recall, specificity, and F1

Low-Level Client

Most evals will benefit most from using the workflow, step, and emit functions. These abstract the lower-level details of encoding request payloads, issuing RPCs to the local CLI server, and decoding responses. In some cases, however, it might be necessary to directly access that functionality, and you can use the QuantilesClient class to do that.

from quantiles import QuantilesClient, StepParams async with QuantilesClient() as client: await client.health() run = await client.create_run("manual-eval", {"model": "gpt-5-nano"}) result = await run.step( StepParams( key="call-model", input_value={"prompt": "hello"}, execute=lambda: call_model("hello"), ) ) await run.emit("accuracy", 0.95) await run.complete()

By default, QuantilesClient reads the QUANTILES_BASE_URL environment variable to determine where the qt server is running. If that environment variable doesn’t exist, it defaults to http://127.0.0.1:8765.

Last updated on