The Agent Eval Blog

Everything on AI agent evaluation — from first principles to CI integration.

Agents make decisions, call tools, and branch based on state. Unit tests and conversation replays don't work. Here's what does.

Most eval cases are too narrow. Here's a framework for writing scenarios that surface the real-world failures your users will hit.

The most dangerous regressions are the ones you don't see coming. How to detect them before your users do.

Step-by-step: GitHub Actions, CircleCI, eval thresholds, and what to do when an eval fails in CI.

A 95% pass rate sounds great until you realize the 5% that fails is your most important customer path. How to interpret eval scores honestly.

Ready to start evaluating?

Free plan. First eval in 5 minutes.