▲ Catch regressions before your users do

Test your AI agent
before your users do.

Define eval scenarios in YAML. Run them in CI. Get pass/fail scores and regression diffs on every deploy.

agent-jig · CI run #247 · main → deploy
$ agent-jig run --config eval.yaml --baseline main   refund_request PASS 142ms escalate_angry_customer PASS 89ms missing_order_lookup PASS 201ms multi_step_checkout PASS 167ms ambiguous_intent_fallback PASS 73ms   5/5 passing · +0 regressions · baseline locked ✓

Shipping agents without evals is flying blind.

Teams update a prompt or swap a model and push to production — then find out three days later their agent is worse.

Day 3

The silent regression

Your model provider upgrades their backend. Your agent's tool-calling rate drops 18%. You find out via a support ticket from your biggest customer.

Week 2

The prompt update gamble

You improve the system prompt for one persona. The fix works — but now edge cases that were handled fine are silently failing for 4% of users.

Forever

No regression baseline

Without a locked eval baseline, you have no objective answer to "is this version better or worse than the one from last Tuesday?" Just vibes.

The fixture that holds your agent still.

A jig holds a workpiece in place so every cut is repeatable. Agent Jig does the same for your evals.

📄

YAML-defined scenarios

Write eval cases as human-readable YAML. Input, expected output, assertions — all in one place. Review them in PRs like any other code.

🔁

CI-native execution

One CLI command runs your full eval suite. Works in GitHub Actions, CircleCI, and any shell. Fails the build when your agent regresses.

📊

Pass/fail scoring

Each scenario produces a deterministic pass or fail. No "it depends." Score trends over time — see exactly when and why quality changed.

🔍

Regression diffs

Every run is compared against your locked baseline. See which scenarios regressed, by how much, and what changed in the agent's output.

🧪

Framework-agnostic

Wraps any agent — LangChain, LlamaIndex, Crew, custom Python, Node. If it takes input and returns output, Agent Jig can evaluate it.

🔒

Baseline locking

Lock a known-good eval run as your baseline. All future runs are diffed against it. Ship with confidence: green means no regressions.

First eval in 5 minutes.

No infrastructure to provision. No SDK to learn. Just YAML and a CLI command.

1

Write your eval.yaml

Define test scenarios: the input to your agent, what you expect back, and which assertions to apply. Start with 5 scenarios. Grow from there.

2

Run against your agent

Point Agent Jig at your agent's endpoint or local process. It feeds each scenario in, captures the response, and scores every assertion.

3

Lock and integrate CI

Lock the passing run as your baseline. Add one step to your CI pipeline. Now every PR shows a regression diff before it merges.

eval.yaml
scenarios: - name: refund_request input: "I want a refund for order #12345" assert: - contains: "refund" - tool_called: "lookup_order"   - name: escalate_angry_customer input: "This is the worst service ever" assert: - sentiment: "empathetic" - tool_called: "create_ticket" - not_contains: "I cannot help"

Everything on AI agent evals.

Start shipping with confidence.

Free plan. First eval in 5 minutes. No infrastructure required.

Run your first eval free