Test, lint, and report on AI agent tool use before shipping.
Task definitions live in tasks.json.
The starter fixture is examples/calendar-email/tasks.json. Each task includes:
idpromptexpectedToolsuccessCriteriaUse expectedTool to name the tool the agent should choose. Use none when no tool should be selected.
Good eval sets include:
Tags are not a first-class task field yet. Use clear id values and successCriteria until a future schema adds tagging.
Example:
{
"id": "email-status-update",
"prompt": "Email Jordan a short status update about the release.",
"expectedTool": "send_email"
}