The evaluation layer
for the AI stack

Measure AI system behavior across models, configurations,
prompts, and workflows from prototype to production.

Built by AI and technology experts from
OpenAI
Stanford Medicine
Microsoft
Genentech

Evaluations give teams the evidence to understand behavior, improve performance, and build better systems.

Run an eval

Custom evaluations

Turn product-specific workflows into repeatable evals with reference outputs, task-specific rubrics, and sample-level metrics.

  1. Reference Labelsgold answers + accepted variants
  2. Scoring Criteriaexact match, rubric, pass rate
  3. Slice Accuracyworst slice 68.9%

Baseline comparisons

Compare prompts, models, and edge-case inputs against baseline runs to measure each iteration and catch regressions early.

Evidence

Resumable eval runs

Resume interrupted evaluation workflows, reuse cached steps, and preserve run history across long-running evals.

$ qt resume 3
model_v1.7.2edge casesprompt variants
CLI-first tooling

Eval on command

Build and run evaluations directly from the Quantiles CLI. Configure evaluation logic in code, execute runs from the terminal, and review metrics, outputs, and failure patterns as part of your development workflow.

qt run my-eval

eval: my-eval
status: completed
created: loading-pacific-time
duration: 2.23s
input: {"model":"openai:gpt-5.5","num_samples":1000}
output: {"samples_completed":1000}
error: -

qt show 1 --json

with the same config at every scale.

Scale the same eval config from local smoke tests to full evaluation runs without rewriting the harness. Quantiles uses local-first execution for fast iteration, checkpointed execution for resumable eval runs, and sample-level results for deep analysis.

Run evals with speed
Run with agents

Use coding agents to run evaluations and guide model iteration.

Coding agent prompt

Coding agents can run evaluations, compare eval runs, and surface regressions or performance gains before a change is merged, shipped, or promoted.

Codex

Use Codex to run eval commands as models, prompts, or configs change.

Claude Code

Trigger CLI benchmark and eval runs with Claude Code.

Gemini

Compare eval results with Gemini across model updates.

Any coding agents

Works with coding agents that can run shell commands and read evaluation outputs.

Open SDK, local CLI

Run Quantiles locally with the open-source CLI

A full-featured, local-native toolchain for running and analyzing evals at scale.

Agents
Coding agent prompt