Get Started

Get Started with Quantiles

Quantiles is a local-first CLI and SDK for running AI evaluation workflows with fast, continuous feedback. Teams can iterate on model behavior, prompts, agent workflows, and debugging scripts while preserving the metrics and run histories needed to understand what improved, what regressed, and why.

With Quantiles, teams can rely on built-in infrastructure for running, recording, analyzing, and comparing AI evaluations:

Local evaluation workflows through the qt CLI, with support for benchmark runs, custom evals, smoke tests, and repeatable execution from the development environment.
Automatic run recording for eval runs, steps, metrics, events, inputs, outputs, errors, timing, metadata, and final results.
Local-first execution history stored on disk so teams can inspect past runs, reproduce results, and maintain evaluation evidence without depending on a hosted service.
Step-level traces that make it possible to debug individual samples, inspect intermediate inputs and outputs, and understand how a result was produced.
Resilient execution with caching, retries, durable step reuse, and resumable runs so failed or interrupted evaluations can restart without repeating completed work.
Efficient dataset handling for loading, limiting, slicing, and iterating over benchmark or custom evaluation datasets.
Model sampling infrastructure for running evaluations against demo models or provider-backed models with consistent configuration and recorded model inputs.
Reliable scoring and grading using deterministic metrics, reference-based checks, rubric scoring, model judges, or custom evaluators.
Metrics and metadata storage for tracking model, prompt, dataset, scorer, judge, sampling, and run configuration across evaluation runs.
Integration with coding agents using reusable Quantiles instructions for running evals, inspecting results, comparing runs, and summarizing regressions directly from the development workflow.
Run inspection and comparison directly from the same qt CLI, including sample-level review, metric differences, changed outputs, failed steps, and regression analysis.

Key capabilities

Evaluation workflows quickly outgrow one-off scripts once teams need caching, retries, dataset handling, metrics capture, and run comparison. Quantiles gives teams those primitives without slowing down iteration:

Run built-in benchmarks with minimal setup and configuration.
Build custom evals with standard Python and familiar, Pythonic patterns.
Run eval workflows locally from the qt CLI, with reproducible results.
Automatically record runs, steps, metrics, events, inputs, and final outputs.
Store execution history locally in open data formats.
Debug individual samples with full step-by-step traces, inputs, and outputs.
Inspect and compare runs directly from the same CLI.
Resilient execution by default with step caching and restartable failed runs.

This architecture lets teams run O(10,000)-sample evaluations immediately, without maintaining and deploying infrastructure or managing complex cloud environments.

Python SDK

Use the official Python SDK to build custom evals with primitives like durable steps, structured inputs/outputs, and metrics emission, all using Pythonic patterns and practices. The SDK integrates tightly with the qt CLI’s local API for running, recording and analyzing benchmarks.

Python SDK
Use Python workflow primitives for durable steps, emitted metrics, and local runs

The CLI’s local API can be used without the SDK, but integration must be built manually.

Local-first experience

The qt CLI and Python SDK are designed to run your evals locally, and without access to the internet. Code executes on your machine, data stays in open local formats, and teams retain control over their datasets, eval logic, benchmark results, and release decisions.

Get started with local evals.