Running evals manually before a deploy is better than nothing. Running them automatically on every commit is the only thing that scales.
This is a practical guide to wiring AI agent evals into your CI pipeline — step by step, for GitHub Actions and CircleCI, including how to handle failures and set sensible thresholds.
Before integrating evals into CI, you need:
pip install agent-jig # or npm install -g agent-jig
Verify it works locally first:
agent-jig run --config eval.yaml --agent-url http://localhost:8080
Add your Agent Jig API key as a CI secret. In GitHub Actions, add it as AGENT_JIG_API_KEY in your repository secrets (Settings → Secrets → Actions).
.github/workflows/eval.yml:
name: Agent Evals
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install Agent Jig
run: pip install agent-jig
- name: Start agent
run: |
python agent/server.py &
sleep 5 # wait for server to start
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Run evals
run: |
agent-jig run \
--config eval.yaml \
--agent-url http://localhost:8080 \
--baseline main \
--fail-below 90
env:
AGENT_JIG_API_KEY: ${{ secrets.AGENT_JIG_API_KEY }}
.circleci/config.yml (eval job):
eval:
docker:
- image: cimg/python:3.11
steps:
- checkout
- run:
name: Install Agent Jig
command: pip install agent-jig
- run:
name: Start agent
command: |
python agent/server.py &
sleep 5
background: true
- run:
name: Run evals
command: |
agent-jig run \
--config eval.yaml \
--agent-url http://localhost:8080 \
--baseline main \
--fail-below 90
The --fail-below flag controls the minimum passing score. Setting this correctly matters:
Start at 90% and raise it as your eval suite stabilizes. If you have scenarios that are expected to be occasionally flaky (LLM-judged assertions on ambiguous inputs), tag them and exclude them from the threshold calculation.
When evals fail in CI, Agent Jig outputs a regression diff:
✗ Eval run failed (86/100 scenarios passing — below threshold 90%)
Regressions vs. baseline:
FAIL escalate_angry_customer (was: PASS)
assertion "tool_called: create_ticket" failed
agent called "search_kb" instead
FAIL refund_ambiguous_phrasing (was: PASS)
assertion "sentiment: empathetic" failed
actual sentiment score: 0.31 (threshold: 0.6)
This tells you exactly what regressed, not just that something went wrong. In most cases, you can see the root cause immediately — a prompt change that affected tool selection, a model update that changed tone, a new code path that short-circuits a behavior.
Not every eval failure should block a deploy. Some teams use a tiered approach:
critical)The key insight: block on regressions, not just on total score. A drop from 97% to 94% on non-critical scenarios is less concerning than any regression in a scenario tagged critical.
Agent Jig Pro includes CI integration with GitHub Actions and CircleCI, regression diffs, and critical-scenario gating. Start your 14-day free trial.
GitHub Actions, CircleCI, regression diffs. Pro plan, 14-day free trial.
Start Pro trial