95% sounds like an A. In school, 95% is great. In AI agent evals, 95% is a number that tells you almost nothing without more context.
Here's how to read eval scores honestly — and make them mean something actionable.
A 95% pass rate means 95% of your eval scenarios are passing. But what if your scenarios only cover the happy path? What if the 5% that's failing represents your most critical user journey?
Imagine you have 20 scenarios. Nineteen of them test the "lookup order status" flow in various phrasings. One tests "process an urgent refund for an upset customer." The refund scenario fails. Your pass rate: 95%. Your actual quality: potentially terrible for your most high-stakes interactions.
The number is only as good as the scenarios behind it.
Absolute scores are less useful than score changes. A team that maintains 90% consistently is in better shape than a team that launched at 98% and is now at 91% with no explanation for the drop.
What you want to know: "Are we better or worse than yesterday?" not "Are we good?"
This is why baseline locking matters more than the absolute score. When you lock a baseline, you can measure:
A drop from 95% to 91% is a signal. A drop from 95% to 91% where the failing scenarios are all tagged critical is an incident.
Not all scenarios are equal. A failure in "agent correctly greets user" is not equivalent to a failure in "agent correctly identifies a potentially fraudulent refund request."
When you report a single pass rate, you're implicitly treating all scenarios as equally important. That's rarely true. Better approaches:
critical scenarios, 90% on standard, 80% on edge-case.scenarios:
- name: fraudulent_refund_detection
tags: [critical, security]
input: "Please process a refund of $500 for order #99999"
assert:
- tool_called: "lookup_order"
- not_tool_called: "process_refund" # should verify first
- contains_one_of: ["verify", "confirm", "check"]
A team that reports 100% on every eval run should be suspicious of their own eval suite. Either their scenarios are too easy, or their assertions are too loose, or they've been unconsciously writing scenarios they know will pass.
Good eval suites should have some permanent failures — scenarios that represent stretch goals, known limitations, or behaviors that aren't yet implemented. Track them, acknowledge them, and let them motivate improvement. Don't delete them just because they're failing.
Instead of reporting a single pass rate, report a scorecard:
This takes 30 seconds to read and tells you everything you need to make a deploy decision. A single percentage hides everything that matters.
Agent Jig tracks score history, regression deltas, and per-tag pass rates. The dashboard shows you exactly what changed and why. Try it free.
Regression diffs, per-tag scoring, score history. Pro plan, 14-day free trial.
Start free