Skip to Content

Use Quantiles with Coding Agents

Quantiles gives coding agents a CLI-native evaluation workflow for running evaluations, inspecting and analyzing evaluation intputs and outputs, and comparing results across runs. It supports Codex, Claude Code, Cursor, GitHub Copilot, Gemini CLI, OpenCode, and other agents that use reusable skills or instruction files.

What agents can do

With Quantiles, coding agents can:

  • Run and/or configure a built-in Quantiles benchmark (e.g. pubmedqa, simpleqa-verified)
  • Run and/or configure a custom code Quantiles evaluation
  • Inspect and analyze an evaluation or benchmark run
  • Compare two evaluation or benchmark runs
  • Resume a failed or interrupted evaluation or benchmark run
  • Debug failed samples, metrics, scorers, or run outputs
  • Write a new custom evaluation using the Quantiles Python SDK
  • Convert an ad-hoc Python evaluation script into a durable Quantiles evaluation
  • Summarize regressions, failures, and recommended next steps

Agent evaluation workflow

This is the evaluation workflow for agents using Quantiles. The SKILL.md defines reusable Quantiles instructions and the qt CLI executes the evaluation workflow.

Install SKILL.md -> Run eval -> Inspect run -> Compare against a baseline when relevant -> Inspect and analyze inputs and outputs

Install the skill

Use the prompt below to set up your coding agent with the Quantiles CLI and agent skill:

Please install the Quantiles skill at github.com/quantiles-evals/skill

Alternatively, copy SKILL.md  into your agent’s skill directory.

Troubleshooting

Most agent issues come from unclear instructions or a missing local setup. The agent may not know which skill to use, may not have qt available in its shell, or may run too broad an eval too early. Use direct language so the agent can correct course.

ProblemSolution
Agent does not use the skillConfirm the Quantiles SKILL.md  is installed in the location your coding agent expects.
qt command not foundUse commands which qt or qt --version to confirm that Quantiles is installed and available in the agent’s shell. If the CLI is missing, If the CLI is missing, run the install command.
Run is too expensiveStart with a small sample limit before running the full benchmark. The SKILL.md instructs agents to ask for approval before running provider-backed or full benchmark evaluations.
Results are hard to compareCompare completed runs from the same evaluation or benchmark with the same dataset, model settings, prompt version, rubric, and benchmark configuration.
Agent summarizes vaguelyAsk the agent to ground every conclusion in run data and require specific metrics, sample IDs, observed failures, changed outputs, and concrete recommendations instead of general observations. See an example prompt.
Configuration is not appliedCheck whether quantiles.toml or .quantiles.toml exists in the current working directory. Confirm the run is using the expected model, sample limit, and benchmark configuration.
Provider-backed run failsVerify that the required provider API key is configured without printing the key value, then check for authentication, quota, rate-limit, timeout, or provider availability errors.

Guardrails and Cost Control

Agent-driven evals can become expensive or hard to interpret if too many variables change at once. Give your agent direct operating rules before it runs qt, especially around sample counts, model configuration, comparisons, and secrets.

Use instructions like these:

  • Start with a small sample count before a full benchmark
  • Ask before running a full benchmark
  • Use limit for smoke tests and sample subsets
  • Use --json for qt list, qt run, qt show, and qt compare when reporting to an agent or script
  • Treat demo model runs as eval workflow validation only, not model-quality results
  • Do not run provider-backed evals unless the user asks for a real model run or provides model
  • Check required provider API keys without printing their values
  • Keep the dataset, model settings, prompt version, and sample count stable when comparing runs
  • Never print API keys, .env values, secrets, or private credentials
  • Report the exact command used
  • Preserve every run ID
  • Use qt compare for run-to-run changes instead of manually eyeballing outputs
  • Use qt resume "$RUN_ID" for interrupted eval runs
  • Do not edit benchmark definitions, datasets, or scoring code unless I ask

This is not an exhaustive checklist. Coding agents can still make mistakes, misunderstand instructions, or produce incorrect summaries, so review commands and results before relying on them for release or evaluation decisions.

Agent Documentation

Last updated on