Test, lint, and report on AI agent tool use before shipping.
ToolSmith can fail builds when eval scores fall below a threshold:
npm run dev -- eval examples/calendar-email --fail-under 80
If the score is below the threshold, the command exits non-zero and prints:
Fail-under threshold: 80%
CI result: failed
Compare a baseline run with a current run:
npm run dev -- compare baseline.json .toolsmith/runs/latest.json
Fail on score regression:
npm run dev -- compare baseline.json .toolsmith/runs/latest.json --fail-on-regression
The docs-only GitHub Actions example lives at:
docs/examples/github-actions.md
No GitHub Actions workflow is enabled in this repo. GitHub Pages is used for documentation only and does not run CI checks.
Before using CI checks publicly, review docs/RELEASE_CHECKLIST.md and verify macOS/Windows expectations in docs/CROSS_PLATFORM.md.