How to write eval cases that actually catch regressions

Most teams start their eval suite with the happy path: write a scenario for the textbook version of each feature, verify it passes, move on. Six months later, they have 200 scenarios, 100% passing, and a production incident every other sprint.

The happy path is the path you thought of. Regressions live everywhere else.

The three scenario types you actually need

A good eval suite covers three categories, not one:

1. Happy path scenarios

The canonical version of each task. These should always pass. If they don't, something is seriously broken. Don't over-index on these — they're necessary but not sufficient.

2. Edge case scenarios

Variations on the happy path that expose brittleness. Misspellings, ambiguous inputs, incomplete information, uncommon phrasings. If your agent handles "I want a refund" but fails on "can u plz cancel my order and get my money back" — that's a regression waiting to happen.

3. Regression scenarios

The most important category: scenarios written specifically to prevent bugs from recurring. Every time you fix a production bug, write a scenario that would have caught it. This is your institutional memory.

The best eval suite is a graveyard of production bugs. Every scenario should have a story.

The anatomy of a good scenario

A well-written eval scenario has four properties:

scenarios:
  # Good: single intent, realistic input, behavioral assertions
  - name: refund_ambiguous_phrasing
    input: "hey can u help me cancel my thing from last week and get money back"
    assert:
      - tool_called: "lookup_order"
      - contains: "refund"
      - not_contains: "I don't understand"
      - sentiment: "helpful"

  # Also good: regression scenario for a past bug
  - name: regression_no_hallucinate_order
    input: "I want a refund for order #99999"
    tags: [regression, bug-2024-11-12]
    assert:
      - tool_called: "lookup_order"
      - tool_arg: { fn: "lookup_order", key: "order_id", value: "99999" }
      # Bug was: agent hallucinated a response when order wasn't found
      - not_contains: "Your refund has been processed"

How many scenarios do you need?

Not as many as you think. A focused suite of 30 high-quality scenarios will catch more regressions than a sprawling suite of 300 happy-path clones.

A reasonable starting ratio:

The regression category should only grow. Every production bug becomes a permanent eval case.

The "scenario as documentation" principle

A side effect of a well-written eval suite: it documents your agent's expected behavior better than any README. When someone new joins your team, the eval suite tells them: here is every case the agent should handle, and here is exactly what correct behavior looks like.

Write your eval cases as if they're the spec. Because they are.

What not to do

Agent Jig makes YAML eval cases first-class — write them in your repo, run them in CI, add new ones every time you fix a bug. Start your eval suite free.

Write your first 5 scenarios today.

YAML-based, CLI-runnable, CI-integrated. Free plan to get started.

Start free