Integrating AI Agent Evals into Your CI Pipeline: A Practical Guide

Running evals manually before a deploy is better than nothing. Running them automatically on every commit is the only thing that scales.

This is a practical guide to wiring AI agent evals into your CI pipeline — step by step, for GitHub Actions and CircleCI, including how to handle failures and set sensible thresholds.

Prerequisites

Before integrating evals into CI, you need:

An eval suite with at least 5 scenarios (see our guide on writing eval cases)
An Agent Jig Pro account (CI integration requires Pro)
Your agent accessible as an HTTP endpoint or local process

Step 1: Install the CLI

pip install agent-jig
# or
npm install -g agent-jig

Verify it works locally first:

agent-jig run --config eval.yaml --agent-url http://localhost:8080

Step 2: Set up authentication

Add your Agent Jig API key as a CI secret. In GitHub Actions, add it as AGENT_JIG_API_KEY in your repository secrets (Settings → Secrets → Actions).

Step 3: Add the GitHub Actions step

.github/workflows/eval.yml:

name: Agent Evals

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install Agent Jig
        run: pip install agent-jig

      - name: Start agent
        run: |
          python agent/server.py &
          sleep 5  # wait for server to start
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Run evals
        run: |
          agent-jig run \
            --config eval.yaml \
            --agent-url http://localhost:8080 \
            --baseline main \
            --fail-below 90
        env:
          AGENT_JIG_API_KEY: ${{ secrets.AGENT_JIG_API_KEY }}

Step 4: Add the CircleCI config

.circleci/config.yml (eval job):

  eval:
    docker:
      - image: cimg/python:3.11
    steps:
      - checkout
      - run:
          name: Install Agent Jig
          command: pip install agent-jig
      - run:
          name: Start agent
          command: |
            python agent/server.py &
            sleep 5
          background: true
      - run:
          name: Run evals
          command: |
            agent-jig run \
              --config eval.yaml \
              --agent-url http://localhost:8080 \
              --baseline main \
              --fail-below 90

Setting the right failure threshold

The --fail-below flag controls the minimum passing score. Setting this correctly matters:

Too high (e.g., 100%): Any non-deterministic scenario will flap the build. You'll spend more time investigating false positives than catching real regressions.
Too low (e.g., 70%): Your agent can degrade significantly before CI catches it.
Right (~90–95%): Allows for reasonable non-determinism while catching meaningful regressions.

Start at 90% and raise it as your eval suite stabilizes. If you have scenarios that are expected to be occasionally flaky (LLM-judged assertions on ambiguous inputs), tag them and exclude them from the threshold calculation.

Reading the regression diff

When evals fail in CI, Agent Jig outputs a regression diff:

✗ Eval run failed (86/100 scenarios passing — below threshold 90%)

Regressions vs. baseline:
  FAIL  escalate_angry_customer    (was: PASS)
        assertion "tool_called: create_ticket" failed
        agent called "search_kb" instead

  FAIL  refund_ambiguous_phrasing  (was: PASS)
        assertion "sentiment: empathetic" failed
        actual sentiment score: 0.31 (threshold: 0.6)

This tells you exactly what regressed, not just that something went wrong. In most cases, you can see the root cause immediately — a prompt change that affected tool selection, a model update that changed tone, a new code path that short-circuits a behavior.

When to block vs. warn

Not every eval failure should block a deploy. Some teams use a tiered approach:

Block: Regressions in high-criticality scenarios (tagged critical)
Warn: Score drops in non-critical scenarios that might be noise
Track: Score trends over time (available in the Agent Jig dashboard)

The key insight: block on regressions, not just on total score. A drop from 97% to 94% on non-critical scenarios is less concerning than any regression in a scenario tagged critical.

Agent Jig Pro includes CI integration with GitHub Actions and CircleCI, regression diffs, and critical-scenario gating. Start your 14-day free trial.

Integrating AI agent evals into your CI pipeline: a practical guide

Prerequisites

Step 1: Install the CLI

Step 2: Set up authentication

Step 3: Add the GitHub Actions step

Step 4: Add the CircleCI config

Setting the right failure threshold

Reading the regression diff

When to block vs. warn

Add evals to your CI in one step.