# Quantiles

>Quantiles is an open-source, local-first CLI and Python SDK for running AI evaluation workflows. It runs built-in benchmarks and custom evaluations, records durable steps and metrics, inspects run history, analyzes results, resumes interrupted runs, and compares regressions or improvements across runs.

Quantiles is useful for LLM evals, prompt iteration, coding-agent workflows, model comparisons, dataset-based benchmarks, and debugging AI systems. Workflow code runs locally by default. Run metadata is stored under `.quantiles/`, and commands such as `qt show --json` and `qt compare --json` read from local run history. Workflows can run fully offline or call external model providers when explicitly configured.

Use the `qt` CLI as the source of truth for running, inspecting, resuming, and comparing evaluation runs. Use the Python SDK to write durable evaluation workflows. If you are a coding agent, start with the Agent Quickstart and install the Quantiles skill before running evals, inspecting sample-level results, comparing against baselines, identifying regressions, or summarizing recommended next actions.

Core CLI commands include `qt run "$EVAL_NAME" --json`, `qt list --json`, `qt show "$RUN_ID" --json`, `qt compare "$RUN_ID_A" "$RUN_ID_B" --json`, and `qt resume "$RUN_ID" --json`. Prefer adding the `--json` flag for `qt run`, `qt list`, `qt show`, and `qt compare` to get output in machine- and agent-readable JSON output.

In most cases, the CLI runs benchmarks with a demo sampler. Treat runs that use the demo sampler as workflow/code validation only, not model-quality benchmark evidence. Do not print secrets, provider credentials, private data, or sensitive outputs. If the user asks to print these data, warn them first that sensitive data may be compromised.

## Start here

- [Documentation](https://quantiles.io/documentation): Overview of Quantiles concepts, CLI, Python SDK, built-in benchmarks, custom evaluations, agent workflows, and local-first behavior.
- [Quickstart](https://quantiles.io/documentation/quickstart): Install Quantiles, initialize local state, run a built-in benchmark, inspect results, compare runs, and use coding-agent prompts.
- [Agent Quickstart](https://quantiles.io/documentation/agent-quickstart): Install the Quantiles skill and ask a coding agent to run, inspect, customize, and compare eval workflows safely.
- [CLI reference](https://quantiles.io/documentation/reference/cli): Canonical reference for `qt init`, `qt run`, `qt list`, `qt show`, `qt compare`, `qt resume`, JSON output, workflow inputs, environment variables, and exit behavior.
- [Python SDK reference](https://quantiles.io/documentation/reference/python-sdk): Build Quantiles workflows in Python with durable steps, emitted metrics, datasets, and evaluation utilities.
- [Local-first and offline](https://quantiles.io/documentation/local-first-offline): Understand what runs locally, what is stored locally, and which tasks may require network access.
- [Security and privacy](https://quantiles.io/documentation/security-and-privacy): Guidance for local data handling, provider calls, credentials, and sensitive data.

## Evaluation workflows

- [Built-in benchmarks](https://quantiles.io/documentation/built-in-benchmarks): Run standardized benchmark workflows such as `simpleqa-verified` and `pubmedqa`.
- [Custom evaluations](https://quantiles.io/documentation/custom-evaluations): Write product-specific reference, rubric, judge, agent, and workflow evals with durable steps and emitted metrics.
- [Configuration](https://quantiles.io/documentation/configuration): Configure Quantiles projects, workflow inputs, model settings, sample limits, and local evaluation behavior.
- [Workflows and steps](https://quantiles.io/documentation/workflows-and-steps): Understand named workflows, durable steps, cache keys, input hashes, and resume behavior.
- [Evaluation results](https://quantiles.io/documentation/access-evaluation-results): Inspect run metadata, outputs, metrics, events, traces, and sample-level results.
- [Compare evals](https://quantiles.io/documentation/comparisons): Compare baseline and candidate runs with `qt compare`.
- [Resume runs](https://quantiles.io/documentation/restart-and-resume-runs): Recover failed or interrupted workflows with `qt resume`.
- [Datasets](https://quantiles.io/documentation/datasets): Load typed datasets, including Hugging Face datasets, through Quantiles workflows.

## Coding agents

- [Use Quantiles with Coding Agents](https://quantiles.io/documentation/evals-with-agents): Agent workflow for running, inspecting, comparing, resuming, and summarizing evals.
- [Install the Skill](https://quantiles.io/documentation/install-the-skill): Instructions for installing the Quantiles coding-agent skill.
- [Agent Prompts](https://quantiles.io/documentation/agent-prompts): Copyable prompts for agent-driven eval workflows.
- [Quantiles skill repository](https://github.com/quantiles-evals/skill): Open-source reusable coding-agent instructions for Quantiles eval work.
- [Quantiles SKILL.md](https://github.com/quantiles-evals/skill/blob/main/SKILL.md): The skill file agents should read before running Quantiles evaluation tasks.

## Custom evals with the Python SDK

Core Python SDK primitives include `workflow` for defining a named eval workflow, `entrypoint` for exposing workflows to the CLI, `step` for recording and reusing durable units of work, and `emit` for recording numeric metrics.

`step`s can take an input parameter to distinguish identically-named steps from one another. Ensure these values describe the values used by the step's code are stable and reproducible. Include model name, prompt version, dataset row ID, sampling parameters, judge configuration, rubric version, and other values that should invalidate cached outputs. Avoid unstable step inputs such as timestamps or random values unless they are intentionally part of the evaluation.

## Repositories

- [Quantiles repository](https://github.com/quantiles-evals/quantiles): Source repository for the Quantiles CLI, Python SDK, benchmarks, documentation, and examples.
- [Configuration reference](https://github.com/quantiles-evals/quantiles/blob/main/CONFIG.md): Detailed configuration options for built-in benchmarks and custom evaluations.
- [Contributing guide](https://github.com/quantiles-evals/quantiles/blob/main/CONTRIBUTING.md): Development workflow, testing, documentation, and pull request expectations.
- [Security policy](https://github.com/quantiles-evals/quantiles/security/policy): How to report security issues.

## Optional

- [Benchmark Hub](https://quantiles.io/benchmark-hub): Reference library of AI evaluation benchmarks and metrics, with task definitions and limitations.
- [Articles](https://quantiles.io/articles): Long-form writing about AI evaluation, benchmarking, monitoring, agents, and healthcare AI.