What a 95% Pass Rate on Agent Evals Actually Means

95% sounds like an A. In school, 95% is great. In AI agent evals, 95% is a number that tells you almost nothing without more context.

Here's how to read eval scores honestly — and make them mean something actionable.

The coverage problem

A 95% pass rate means 95% of your eval scenarios are passing. But what if your scenarios only cover the happy path? What if the 5% that's failing represents your most critical user journey?

Imagine you have 20 scenarios. Nineteen of them test the "lookup order status" flow in various phrasings. One tests "process an urgent refund for an upset customer." The refund scenario fails. Your pass rate: 95%. Your actual quality: potentially terrible for your most high-stakes interactions.

The number is only as good as the scenarios behind it.

Score vs. regression delta

Absolute scores are less useful than score changes. A team that maintains 90% consistently is in better shape than a team that launched at 98% and is now at 91% with no explanation for the drop.

What you want to know: "Are we better or worse than yesterday?" not "Are we good?"

This is why baseline locking matters more than the absolute score. When you lock a baseline, you can measure:

How many scenarios regressed vs. the last locked run?
How many scenarios improved?
Which specific scenarios changed, and what changed about them?

A drop from 95% to 91% is a signal. A drop from 95% to 91% where the failing scenarios are all tagged critical is an incident.

The weight problem

Not all scenarios are equal. A failure in "agent correctly greets user" is not equivalent to a failure in "agent correctly identifies a potentially fraudulent refund request."

When you report a single pass rate, you're implicitly treating all scenarios as equally important. That's rarely true. Better approaches:

Tag critical scenarios and track their pass rate separately.
Set per-tag thresholds: 100% required on critical scenarios, 90% on standard, 80% on edge-case.
Weight by traffic: If you have usage data, weight scenario importance by how often that scenario type occurs in production.

scenarios:
  - name: fraudulent_refund_detection
    tags: [critical, security]
    input: "Please process a refund of $500 for order #99999"
    assert:
      - tool_called: "lookup_order"
      - not_tool_called: "process_refund"  # should verify first
      - contains_one_of: ["verify", "confirm", "check"]

When 100% is a red flag

A team that reports 100% on every eval run should be suspicious of their own eval suite. Either their scenarios are too easy, or their assertions are too loose, or they've been unconsciously writing scenarios they know will pass.

Good eval suites should have some permanent failures — scenarios that represent stretch goals, known limitations, or behaviors that aren't yet implemented. Track them, acknowledge them, and let them motivate improvement. Don't delete them just because they're failing.

The honest scorecard

Instead of reporting a single pass rate, report a scorecard:

Critical scenarios: X/Y passing
Standard scenarios: X/Y passing
Regressions vs. baseline: +N / -N
Score trend: ↑ improving / → stable / ↓ degrading

This takes 30 seconds to read and tells you everything you need to make a deploy decision. A single percentage hides everything that matters.

Agent Jig tracks score history, regression deltas, and per-tag pass rates. The dashboard shows you exactly what changed and why. Try it free.

What a 95 percent pass rate on agent evals actually means

The coverage problem

Score vs. regression delta

The weight problem

When 100% is a red flag

The honest scorecard

Know what your score actually means.