Local-first AI evaluation
for developers and agents

Run, analyze, and compare reproducible evaluations across models,
prompts, benchmarks, and agentic workflows from the terminal.

GitHub Repo Quickstart

Built by AI and technology experts from

Evaluations give teams the evidence to understand behavior, improve performance, and build better systems.

Run an eval

Custom evaluations

Turn product-specific workflows into repeatable evals with reference outputs, task-specific rubrics, and sample-level metrics.

Reference Labelsgold answers + accepted variants
Scoring Criteriaexact match, rubric, pass rate
Slice Accuracyworst slice 68.9%

Baseline comparisons

Compare prompts, models, and edge-case inputs against baseline runs to measure each iteration and catch regressions early.

Evidence

Resumable eval runs

Resume interrupted evaluation workflows, reuse cached steps, and preserve run history across long-running evals.

$ qt resume 3

CLI-first tooling

Eval on command

Build and run evaluations directly from the Quantiles CLI. Configure evaluation logic in code, execute runs from the terminal, and review metrics, outputs, and failure patterns as part of your development workflow.

Read the docs Get started

runs.py

from quantiles import emit, entrypoint, step, workflow
from quantiles.types import JsonValue
from quantiles.workflow_context import WorkflowContext

async def my_eval_handler(
 input_value: dict[str, JsonValue],
 ctx: WorkflowContext,
):
 result = await step(
 ctx,
 step_key="call-model",
 input_value=input_value,
 execute=call_model,
 )

metric	base	new
accuracy	91.4%	95.0%
specificity	80.2%	78.9%
latency	412ms	438ms
precision	88.7%	92.1%
recall	84.9%	89.3%
ece	0.081	0.088
f1	86.8%	90.5%
auroc	93.2%	96.4%
log loss	0.31	0.22
brier	0.094	0.061

qt run my-eval

eval: my-eval
status: completed
created: loading-pacific-time
duration: 2.23s
input: {"model":"openai:gpt-5.5","num_samples":1000}
output: {"samples_completed":1000}
error: -

qt show 1 --json

Local-first Sample level results Checkpoints

with the same config at every scale.

Scale the same eval config from local smoke tests to full evaluation runs without rewriting the harness. Quantiles uses local-first execution for fast iteration, checkpointed execution for resumable eval runs, and sample-level results for deep analysis.

Run evals with speed

Run with agents

Use coding agents to run evaluations and guide model iteration.

Coding agent prompt

Agent Guide SKILL.md

Coding agents can run evaluations, compare eval runs, and surface regressions or performance gains before a change is merged, shipped, or promoted.

CODEXCODE

Run my-eval, compare it against the most recent completed baseline, identify sample-level regressions, and recommend the highest-impact fixes for reliability, cost, and latency.

Worked for 20s

I ran my-eval and compared it with baseline run 42.

The regression is localized, not suite-wide. pass_rate dropped from 87.2% to 84.6%, with 14 sample-level regressions out of 1,500 samples. The failures are concentrated in multi-step tool use, ambiguous retrieval cases, and incomplete final answers. p95 latency increased 18%, mostly from repeated tool calls, and average cost per sample increased 11%.

Highest-impact fixes:

Add input-confirmation logic before irreversible tool calls.
Pin the 14 regressed samples as a replay set for future changes.
Cache repeated retrieval calls inside the workflow to reduce p95 latency and cost.
Add a final-answer validation step to catch partial or unsupported responses before emitting results.

I’d prioritize the tool-use fix first. It covers the largest regression cluster and is unlikely to increase cost materially.