Evaluate

Access Datasets

Building and evaluating AI systems requires data, and many evaluation workflows rely on datasets that are too large or too slow to load row by row over the internet. Quantiles provides a local dataset server and REST API through the qt CLI, and a Python SDK helper for efficient dataset loading.

The current qt CLI dataset implementation natively supports HuggingFace datasets (internally, it uses the the HuggingFace Dataset Viewer API ). The Python SDK provides a simple, typed AsyncIterator interface for such datasets. For other dataset sources, Python users writing custom code evals can implement the DatasetSource class and iterate samples themselves. Native support for a wider variety of dataset sources is planned.

Currently, any client can use the Dataset REST API, but only the Python SDK has a higher-level convenience API for datasets.

How it works

When a Python workflow calls dataset(...) with a HuggingFace URL, the SDK talks to the local qt server:

The SDK initializes the dataset through POST /dataset/init.
The CLI resolves the dataset config and split.
The SDK requests rows in batches through POST /datasets/batch.
Each dataset batch is wrapped in a step, so dataset loading has the same observability, failure recovery and caching features as the rest of the workflow.
The CLI immutably caches fetched batches locally on disk, to avoid re-fetching from the internet in subsequent runs.

When dataset(...) receives a DatasetSource object instead of a HuggingFace URL string, the SDK initializes that source in the Python workflow process and calls load_batch(...) directly for rows. Custom batch loading is still wrapped in Quantiles steps, so custom sources keep the same validation, iteration, observability, failure recovery and batch-level caching behavior at the workflow level.

Python Example

Use dataset from the Python SDK inside a workflow handler.


from pydantic import BaseModel
from quantiles import collect_async_iter, dataset, entrypoint, map_dataset, workflow
 
 
class Row(BaseModel):
    sample_id: str
    question: str
    context: str
    gold_answer: str
 
 
async def handler(input_value, ctx):
    ds = await dataset(
        ctx,
        # Use the `huggingface://` prefix to pull from HuggingFace datasets
        # using the native Quantiles dataset loader. For private HuggingFace
        # datasets, set `HF_TOKEN` before running the workflow.
        source="huggingface://$ORGANIZATION/$DATASET",
        row_type=Row,
        batch_size=25,
        max_rows=100,
        on_error="skip",
    )
 
    async def evaluate(row: Row):
        return {
            "sample_id": row.sample_id,
            "has_context": row.context != "",
        }
 
    results = await collect_async_iter(
        map_dataset(
            ds,
            evaluate,
            max_concurrency=8,
            yield_order="input",
        )
    )
 
    return {"rows_evaluated": len(results)}
 
 
pubmedqa = workflow("my_eval", handler)
 
if __name__ == "__main__":
    entrypoint(pubmedqa)

After wiring the workflow as a custom_code eval in quantiles.toml, run it with:


[benchmarks.my_eval]
type = "custom_code"
command = ["python", "my_eval.py"]


qt run my_eval

Configure Dataset Loading

Use these settings to tell Quantiles where to fetch a dataset, which rows to load, and how raw rows should be validated before your workflow processes them. For reproducible HuggingFace-backed benchmarks, set the source, config, split, revision explicitly.

Dataset Parameters

The Python SDK dataset(...) helper accepts:

Option	Description	Default	Notes
`ctx`	Workflow context passed to the workflow handler
`source`	Dataset URI, currently `huggingface://...` or `hf://...`, or a Python `DatasetSource` implementation		`config`, `split` and `revision` only apply to HuggingFace URL sources
`row_type`	Pydantic model used to validate rows		Make sure this is a class that inherits from `BaseModel`
`batch_size`	Number of rows to fetch per batch	`100`	You usually won’t need to change this value
`on_error`	if a row cannot be decoded into the `BaseModel` in `row_type`, either `"fail"` to stop iterating, or `"skip"` to skip the row and continue iterating	`"fail"`
`transform`	Optional function that converts raw row dictionaries into `row_type`	`None`
`config`	Hugging Face dataset config	`None`
`split`	Hugging Face dataset split
`revision`	Optional Hugging Face dataset revision
`max_rows`	Optional limit for smoke tests or smaller benchmark runs		Set this to a low value for quick, local checks before you run a larger benchmark

Supported Sources

The Python SDK supports Hugging Face datasets with either the huggingface://$OWNER/$DATASET or hf://$OWNER/$DATASET format. For non-HuggingFace public or private data sources, implement the Python DatasetSource protocol and pass an instance as source.

For example, if you pass source="huggingface://quantiles/PubMedQA", the CLI fetches dataset metadata and rows from the Hugging Face Dataset Viewer API at https://datasets-server.huggingface.co. For the qt CLI to read it, the dataset must be available through that API. If you have a private dataset repository on HuggingFace, set the HF_TOKEN environment variable before running the workflow:


export HF_TOKEN="secret-token"
qt run "$EVAL_NAME"

Note: the above qt run ... command assumes you have set up a custom_code eval in your configuration file of the desired name.

Custom Python Sources

Use DatasetSource when rows need to come from a non-HuggingFace source, such as a private database, private object store, internal API, or local file. Custom sources execute in the Python workflow process. The qt CLI still records workflow steps and run metadata, but it does not fetch custom source rows through the Dataset REST API.


from quantiles import DatasetSource, JsonValue, dataset
 
 
class PrivateDatasetSource(DatasetSource):
    @property
    def source_id(self) -> str:
        return "private-dataset:v1"
 
    async def initialize(self) -> JsonValue:
        return {"source": self.source_id}
 
    async def load_batch(
        self,
        offset: int,
        batch_size: int,
    ) -> list[dict[str, JsonValue]]:
        return await load_rows_from_private_system(
            offset=offset,
            limit=batch_size,
        )
 
 
# iterate over your PrivateDatasetSource implementation here, using
# a standard "async for" loop

For custom sources, do not pass config, split, or revision; those options only apply to HuggingFace URL sources.

Configs and Splits

If you pass config, Quantiles attempts to use that configuration when loading the dataset from HuggingFace. If you omit it, the CLI infers a config from the Hugging Face Dataset Viewer API.

If you pass split, Quantiles validates that the dataset split exists. If you omit it, the CLI chooses the first available split in this priority order:

test
validation
eval
train

If none of those names exist, Quantiles uses the first split reported by Hugging Face.

For fully reproducible HuggingFace-backed benchmarks, explicitly set config, split, and revision whenever and wherever possible.

Row Validation and Transforms

Rows are validated with the Pydantic model you pass as row_type.

Use on_error="fail" when invalid rows should stop the run. Use on_error="skip" when the dataset may contain rows that don’t match the schema of the Pydantic model you passed in row_type.

If raw dataset rows do not immediately unmarshal into your row_type schema, pass a function into the transform parameter that manually unmarshals.


def transform_row(raw: dict[str, str]):
    return Row(
        sample_id=str(raw.get("id", "")),
        question=str(raw.get("question", "")),
        context="\n".join(raw.get("contexts", [])),
        gold_answer=str(raw.get("final_decision", "")),
    )
 
 
ds = await dataset(
    ctx,
    source="huggingface://$ORGANIZATION/$DATASET",
    row_type=Row,
    split="train",
    transform=transform_row,
    on_error="skip",
)

Efficient Iteration

The dataset API has declarative and imperative APIs for iterating dataset rows.

Sequential processing

Use iter_rows() when you want to use the imperative variant to process rows sequentially:


async for row in ds.iter_rows():
    ...

Concurrent processing

Use map_dataset when you want to use the declarative variant and process rows concurrently:


results = map_dataset(
    ds,
    evaluate_row,
    max_concurrency=8,
    yield_order="input",
)

The `yield_order` parameter

The map_dataset function returns an AsyncIterator, and the order in which it yields items is affected by the yield_order parameter. Pass one of two values to this parameter:

yield_order="input" - returns an iterator that yields results in the same order as they appear in the dataset.
yield_order="completion" - returns an iterator that yields results as soon as each concurrent worker finishes.

The `collect_async_iter` function

This function is provided for convenience to collect every element in the AsyncIterator returned by map_dataset into an in-memory list:


# This is an AsyncIterator.
results_iter = map_dataset(
    ds,
    evaluate_row,
    max_concurrency=8,
    yield_order="input",
)
 
# This iterates the entire iterator and stores all the
# yielded elements in an in-memory list.
elts_list = await collect_async_iter(results_iter)

Avoid using collect_async_iter unless you know the result set is small enough to fit in memory. For larger benchmarks, stream rows in a standard for loop, and calculate per-sample metrics as you iterate.

Runtime Behavior

For HuggingFace URL sources, each dataset batch that qt caches is keyed by dataset ID, config, split, revision, offset, and limit, to reduce the chance of naming conflicts. Cached batches are stored as Parquet files. Empty batches are cached too, so repeated scans can finish cleanly without refetching the end of a dataset.

Custom DatasetSource objects do not use the CLI Dataset REST API or the CLI dataset cache. They run in the Python workflow process. Quantiles still records dataset initialization metadata and wraps each custom batch fetch in a durable step.

Dataset `step`s

The datasets API is an extension of the foundational step API. HuggingFace URL sources use the Dataset REST API; custom DatasetSource objects run locally in the Python workflow process. The Python SDK records dataset initialization metadata and wraps each batch fetch in durable steps:

dataset-init
dataset-batch-<offset>

Those steps are visible in qt show "$RUN_ID" --json and participate in caching and resume behavior.