Agents

Agent Prompts

After installing the skill, use the following prompt templates for common agent-driven Quantiles evaluation workflows. Use them as written, adapt them to your benchmarks, or use their structure to create your own prompts.

Each prompt tells the agent what to run, what to inspect, and what to report so evaluation results are consistent and reviewable. The Quantiles skill instructs the agent to use --json for structured run output.

Prompt Templates

Use these prompts as starting points for common Quantiles agent workflows. Replace the placeholders with your evaluation name, run IDs, model configuration, and evaluation goals before passing the prompt to your coding agent.

Run your first agent-driven evaluation

Use the built-in demo model to run a built-in evaluation locally without external model calls, provider API keys, or inference costs.

Pass the following prompt to your coding agent:


Run a SimpleQA Verified benchmark and summarize the results.

If your coding agent does not detect the Quantiles skill automatically, add the $quantiles prefix to your prompt.

Run a subset of samples

Use a sample limit to run a subset of any built-in becnhmarks or custom code evaluations. When no sample limit is specified in the command or configuration file, the full evaluation runs by default.


Run 10 samples of '$EVALUATION' as a smoke test. Summarize the results and tell me if it is safe to continue to a full run.

Built-in benchmarks you can try: pubmedqa, simpleqa-verified.

Customize the benchmark

To customize a built-in benchmark, such as simpleqa-verified, ask your coding agent to configure a hosted LLM provider of your choice and run a subset of benchmark samples with the following prompt:


Configure the `simpleqa-verified` benchmark in a Quantiles config file to use 10 samples and the <your model here> model, then run the benchmark and summarize the results.

See configuration documentation for more details.

See a list of all evaluation runs

Your agent can list all of your evaluation runs, including each run’s ID, status, creation time, and other summary metadata.


Give me a list of all my evaluation runs.

Compare two evaluation runs

Use this prompt to have your coding agent locate the two most recent runs for an evaluation, compare their aggregate metrics, and summarize sample-level regressions, improvements, and output differences.


Compare the two most recent runs for '$EVALUAION'. Summarize the aggregate metrics, sample-level results, failures, and any notable errors. Identify the highest-impact issues to review first, and recommend specific next steps.

See Compare Evals for more information on comparing evaluations.

Inspect a failed or interrupted evaluation

Use this prompt when you want the agent to diagnose failed or interrupted runs before deciding whether to resume, rerun, or change the evaluation configuration. The agent should identify what failed, why it likely failed, which samples were affected, and what to fix next.


List recent evaluation runs, identify the most recent failed or interrupted run, and inspect its run record. Summarize the failure reason, affected samples or steps, likely causes, and recommended fixes. Do not resume or rerun the evaluation unless explicitly asked.

Resume a failed or interrupted run

Use this prompt when a run was interrupted by a timeout, rate limit, stopped process, or other recoverable failure. The agent will resume the run and preserve completed work.


Resume run '$RUN_ID'. Inspect the original failure, verify whether the resumed run completed successfully, summarize aggregate metrics and sample-level results, and recommend specific next actions.

See Resume Runs for more information on the recovery workflow.

Create a custom evaluation

For custom evaluations that use your own datasets, models, and measurement techniques, you can build evaluations with the Quantiles Python SDK.

To have your coding agent build and run a custom evaluation, customize the below prompt template to your needs:


Write a Quantiles custom code evaluation using the Python SDK that uses the <your dataset> dataset, run samples through the <your model> model, and measures the output using the following metrics:

<list your metrics here>.

Call the evaluation <name>, and make sure to include it in the `quantiles.toml` config file. When you're done, run the new eval and summarize the results.

See custom evaluations documentation for details on how to write custom evaluations with your own code, using the Quantiles SDKs and tooling.