Agents

Use Quantiles with Coding Agents

Quantiles gives coding agents a CLI-native evaluation workflow for running evaluations, inspecting and analyzing evaluation intputs and outputs, and comparing results across runs. It supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.

Agent Quickstart
Quickly connect a coding agent to the Quantiles eval workflow.
Agent Skill
Give coding agents reusable Quantiles instructions for running, inspecting, comparing, and resuming evals.

What agents can do

With Quantiles, coding agents can:

Run and/or configure a built-in Quantiles benchmark (e.g. pubmedqa, simpleqa-verified)
Run and/or configure a custom code Quantiles evaluation
Inspect and analyze an evaluation or benchmark run
Compare two evaluation or benchmark runs
Resume a failed or interrupted evaluation or benchmark run
Debug failed samples, metrics, scorers, or run outputs
Write a new custom evaluation using the Quantiles Python SDK
Convert an ad-hoc Python evaluation script into a durable Quantiles evaluation
Summarize regressions, failures, and recommended next steps

Agent evaluation workflow

This is the evaluation workflow for agents using Quantiles. The SKILL.md defines reusable Quantiles instructions and the qt CLI executes the evaluation workflow.


Install SKILL.md -> Run eval -> Inspect run -> Compare against a baseline when relevant -> Inspect and analyze inputs and outputs

Install the skill

Use the prompt below to set up your coding agent with the Quantiles CLI and agent skill:


Please install the Quantiles skill at github.com/quantiles-evals/skill

Alternatively, copy SKILL.md into your agent’s skill directory.

Troubleshooting

Most agent issues come from unclear instructions or a missing local setup. The agent may not know which skill to use, may not have qt available in its shell, or may run too broad an eval too early. Use direct language so the agent can correct course.

Problem	Solution
Agent does not use the skill	Confirm the Quantiles SKILL.md is installed in the location your coding agent expects.
`qt` command not found	Use commands `which qt` or `qt --version` to confirm that Quantiles is installed and available in the agent’s shell. If the CLI is missing, If the CLI is missing, run the install command.
Run is too expensive	Start with a small sample limit before running the full benchmark. The SKILL.md instructs agents to ask for approval before running provider-backed or full benchmark evaluations.
Results are hard to compare	Compare completed runs from the same evaluation or benchmark with the same dataset, model settings, prompt version, rubric, and benchmark configuration.
Agent summarizes vaguely	Ask the agent to ground every conclusion in run data and require specific metrics, sample IDs, observed failures, changed outputs, and concrete recommendations instead of general observations. See an example prompt.
Configuration is not applied	Check whether `quantiles.toml` or `.quantiles.toml` exists in the current working directory. Confirm the run is using the expected model, sample limit, and benchmark configuration.
Provider-backed run fails	Verify that the required provider API key is configured without printing the key value, then check for authentication, quota, rate-limit, timeout, or provider availability errors.

Guardrails and Cost Control

Agent-driven evals can become expensive or hard to interpret if too many variables change at once. Give your agent direct operating rules before it runs qt, especially around sample counts, model configuration, comparisons, and secrets.

Use instructions like these:

Start with a small sample count before a full benchmark
Ask before running a full benchmark
Use limit for smoke tests and sample subsets
Use --json for qt list, qt run, qt show, and qt compare when reporting to an agent or script
Treat demo model runs as eval workflow validation only, not model-quality results
Do not run provider-backed evals unless the user asks for a real model run or provides model
Check required provider API keys without printing their values
Keep the dataset, model settings, prompt version, and sample count stable when comparing runs
Never print API keys, .env values, secrets, or private credentials
Report the exact command used
Preserve every run ID
Use qt compare for run-to-run changes instead of manually eyeballing outputs
Use qt resume "$RUN_ID" for interrupted eval runs
Do not edit benchmark definitions, datasets, or scoring code unless I ask

This is not an exhaustive checklist. Coding agents can still make mistakes, misunderstand instructions, or produce incorrect summaries, so review commands and results before relying on them for release or evaluation decisions.