Use Quantiles with Coding Agents
Quantiles gives coding agents a CLI-native evaluation workflow for running evaluations, inspecting and analyzing evaluation intputs and outputs, and comparing results across runs. It supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.
- Agent Quickstart
Quickly connect a coding agent to the Quantiles eval workflow.
- Agent Skill
Give coding agents reusable Quantiles instructions for running, inspecting, comparing, and resuming evals.
What agents can do
With Quantiles, coding agents can:
- Run and/or configure a built-in Quantiles benchmark (e.g. pubmedqa, simpleqa-verified)
- Run and/or configure a custom code Quantiles evaluation
- Inspect and analyze an evaluation or benchmark run
- Compare two evaluation or benchmark runs
- Resume a failed or interrupted evaluation or benchmark run
- Debug failed samples, metrics, scorers, or run outputs
- Write a new custom evaluation using the Quantiles Python SDK
- Convert an ad-hoc Python evaluation script into a durable Quantiles evaluation
- Summarize regressions, failures, and recommended next steps
Agent evaluation workflow
This is the evaluation workflow for agents using Quantiles. The SKILL.md defines reusable Quantiles instructions and the qt CLI executes the evaluation workflow.
Install SKILL.md -> Run eval -> Inspect run -> Compare against a baseline when relevant -> Inspect and analyze inputs and outputsInstall the skill
Use the prompt below to set up your coding agent with the Quantiles CLI and agent skill:
Please install the Quantiles skill at github.com/quantiles-evals/skillAlternatively, copy SKILL.md into your agent’s skill directory.
Troubleshooting
Most agent issues come from unclear instructions or a missing local setup. The agent may not know which skill to use, may not have qt available in its shell, or may run too broad an eval too early. Use direct language so the agent can correct course.
| Problem | Solution |
|---|---|
| Agent does not use the skill | Confirm the Quantiles SKILL.md is installed in the location your coding agent expects. |
qt command not found | Use commands which qt or qt --version to confirm that Quantiles is installed and available in the agent’s shell. If the CLI is missing, If the CLI is missing, run the install command. |
| Run is too expensive | Start with a small sample limit before running the full benchmark. The SKILL.md instructs agents to ask for approval before running provider-backed or full benchmark evaluations. |
| Results are hard to compare | Compare completed runs from the same evaluation or benchmark with the same dataset, model settings, prompt version, rubric, and benchmark configuration. |
| Agent summarizes vaguely | Ask the agent to ground every conclusion in run data and require specific metrics, sample IDs, observed failures, changed outputs, and concrete recommendations instead of general observations. See an example prompt. |
| Configuration is not applied | Check whether quantiles.toml or .quantiles.toml exists in the current working directory. Confirm the run is using the expected model, sample limit, and benchmark configuration. |
| Provider-backed run fails | Verify that the required provider API key is configured without printing the key value, then check for authentication, quota, rate-limit, timeout, or provider availability errors. |
Guardrails and Cost Control
Agent-driven evals can become expensive or hard to interpret if too many variables change at once. Give your agent direct operating rules before it runs qt, especially around sample counts, model configuration, comparisons, and secrets.
Use instructions like these:
- Start with a small sample count before a full benchmark
- Ask before running a full benchmark
- Use
limitfor smoke tests and sample subsets - Use
--jsonforqt list,qt run,qt show, andqt comparewhen reporting to an agent or script - Treat demo model runs as eval workflow validation only, not model-quality results
- Do not run provider-backed evals unless the user asks for a real model run or provides
model - Check required provider API keys without printing their values
- Keep the dataset, model settings, prompt version, and sample count stable when comparing runs
- Never print API keys,
.envvalues, secrets, or private credentials - Report the exact command used
- Preserve every run ID
- Use
qt comparefor run-to-run changes instead of manually eyeballing outputs - Use
qt resume "$RUN_ID"for interrupted eval runs - Do not edit benchmark definitions, datasets, or scoring code unless I ask
This is not an exhaustive checklist. Coding agents can still make mistakes, misunderstand instructions, or produce incorrect summaries, so review commands and results before relying on them for release or evaluation decisions.