Your PR Passed. Your Agent Broke Checkout: Why CI Needs Action-Level Verification

A pull request goes green.

Unit tests pass. Integration tests pass. Snapshot diffs look clean. CI/CD stamps the change as safe. The agent that wrote the code opens a tidy summary: refactored checkout state handling, updated API contract, improved test coverage.

Then production tells a different story.

Users can add items to cart, but the cart disappears after login. Checkout renders, but the payment callback never marks the order as paid. The success page loads, but fulfillment never starts because the queue event shape changed by one field. Support sees failed purchases. Engineering sees green checks. Leadership sees revenue dropping while everyone argues that the PR was tested.

This is the new failure mode of AI-assisted development.

Code gets generated faster than user behavior gets verified.

That is the gap. And most teams are still treating it like a testing coverage problem when it is actually a workflow verification problem.

Traditional testing is still useful. You still need unit tests, integration tests, static checks, and code review. But when AI coding agents can produce large, plausible, internally consistent changes at high speed, those layers stop being enough to establish confidence. They tell you the code satisfies local assertions. They do not tell you whether the business action succeeded end to end.

For critical PR validation, the real question is no longer just did the code behave as specified in isolated tests? It is did the user actually complete the action, and did every system involved uphold the right invariants?

That is action-level verification.

It means your CI should answer questions like:

Did signup complete and create the right account state?
Did cart state persist across auth, refresh, and device boundaries?
Did permissions hold after role changes and redirects?
Did the browser, API, database, queue, email system, and payment provider hand off correctly?
Did the side effects that matter to the business actually happen?

If your pipeline cannot answer those questions, your green build is weaker than it looks.

The problem: CI validates code paths, but users experience workflows

Most engineering teams built their testing strategy around the shape of the codebase.

Functions get unit tests.
Services get integration tests.
Components get snapshots.
APIs get contract tests.
Releases get a little manual QA if time allows.

That model made sense when developers were writing code incrementally and humans could reason through most changes by inspection. It makes less sense when an AI agent can update ten files, two API contracts, a serializer, a retry policy, and a state machine in one pass. The output may be syntactically correct, type-safe, and well-tested against mocked expectations. It can still break the actual workflow users care about.

Why? Because users do not experience your architecture in layers.

They click a button in a browser. A token gets refreshed. A backend endpoint mutates state. A queue publishes an event. A third-party payment system confirms a charge. A webhook comes back. A job updates fulfillment. A UI poll or redirect displays the final state.

That whole chain is the product.

Testing only the layers is like validating a car by separately testing the steering wheel, brakes, and engine mount. It tells you something. It does not prove the car can finish a turn at speed.

AI increases the frequency of this problem because it tends to optimize for local correctness. Agents are very good at making tests pass. They are much less reliable at preserving real-world behavior across hidden assumptions, system boundaries, and weird production-shaped timing.

That is exactly why conventional CI now creates false confidence.

Why current approaches fail

Unit tests are too narrow

Unit tests are valuable because they are fast, deterministic, and precise. They also only prove exactly what you asserted.

If your cart reducer preserves items in a local state object, that says nothing about whether cart state survives login, session refresh, server reconciliation, or currency normalization from the pricing API.

An AI agent can change the point where state is rehydrated, update the reducer tests, and still break the actual cart persistence flow.

A passing unit test suite often means one thing: the internal implementation still satisfies the assumptions encoded by developers in the past.

That is not the same as saying the user journey works today.

Integration tests often validate mocked systems, not real handoffs

Integration tests are supposed to reduce the gap between isolated logic and system behavior. In practice, many teams mock the parts that fail most often in production:

payment providers n- queues
email delivery
webhooks
auth redirects
object storage
feature flag systems
background workers

This is understandable. Real dependencies are slow, flaky, expensive, and harder to seed in CI/CD. But once you mock the important handoffs, you are no longer testing the workflow. You are testing your assumptions about the workflow.

That distinction matters.

A payment integration test that asserts createCharge() is called with the right payload does not verify that a successful browser checkout results in an order in paid state after the webhook, job processing, and persistence layers complete.

It verifies a function call.

Your business depends on the state transition.

Snapshot tests protect structure, not outcomes

Snapshot tests are especially dangerous as confidence theater. They catch rendering changes. They do not tell you whether the page did anything useful.

A checkout page can render perfectly while silently failing to attach the event handler that submits payment. An order confirmation view can match its snapshot while showing stale client state that never came from the backend.

Agents are particularly good at keeping snapshots green because they are good at preserving shape. Shape is not behavior.

Manual QA cannot keep up with agent velocity

Many teams compensate for weak automated verification with a human tester or a quick pre-merge walkthrough. That works until change volume spikes.

With AI-assisted development, more code lands faster. More branches are opened. More “small” changes touch wider surfaces. The burden shifts onto humans exactly when the system becomes harder to reason about by hand.

Manual QA is still useful for exploration. It is not a scalable primary defense against workflow regressions in high-velocity codebases.

CI/CD rewards what is measurable, not what matters

This is the uncomfortable part.

Most CI/CD pipelines are optimized around what is easy to automate:

linting
static analysis
unit test pass rate
coverage percentage
type checks
container builds
deploy previews

Those signals are useful. None of them directly answer whether checkout works.

And because they are fast and crisp, organizations over-trust them. Green pipelines become a substitute for confidence rather than a component of confidence.

That worked better before AI because humans naturally throttled complexity. Now a coding agent can generate broad changes faster than your test strategy can evolve, and the green pipeline becomes actively misleading.

The core insight: verify actions, not just assertions

Action-level verification starts from the user outcome and works backward.

Instead of asking, “Did each component do what its test expected?” ask, “Did the business action complete successfully under realistic conditions?”

For example, for checkout, the action is not:

button click fired
API returned 200
createOrder() was called
payment client was invoked

The action is:

a user added an item to cart
cart state persisted appropriately
login or guest flow behaved correctly
payment was submitted
provider callback or webhook was processed
order ended in the correct paid state
fulfillment side effect was triggered
the user saw the correct confirmation
no authorization or duplication invariant was violated

That is a bigger claim. It is also the claim your product actually needs.

Action-level verification is not a replacement for lower-level tests. It sits above them and catches what they systematically miss: regressions in orchestration, state continuity, async boundaries, and side effects.

This is especially important for AI-generated changes because agents often preserve local behavior while accidentally altering system choreography.

The code still “works.” The workflow breaks.

What action-level verification looks like in practice

You do not need a giant end-to-end pyramid or a brittle UI suite that takes 45 minutes to run. The goal is not to automate every click path. The goal is to validate a small set of business-critical actions in PR pipelines using production-shaped conditions.

For most products, that list is surprisingly short:

user can sign up
user can log in
user can add item to cart
user can complete checkout
user permissions restrict access correctly
user can create, edit, or publish the primary domain object
notifications and background jobs fire correctly
billing changes apply to account state

These are actions. They cross system boundaries. They matter.

A good action-level check usually has four properties:

Real browser interaction for the critical user path
Seeded environment with controlled but realistic data
Invariant checks on final business state, not just HTTP responses
Production-shaped side effects such as queues, jobs, webhooks, and auth flows

Let’s make this concrete.

Example: checkout verification with Playwright

Below is a simplified Playwright test that does more than click through UI. It verifies the final business result.

ts
import { test, expect } from '@playwright/test';

const baseURL = process.env.APP_BASE_URL!;
const adminToken = process.env.ADMIN_API_TOKEN!;

test('guest checkout completes and order is paid', async ({ page, request }) => {
  // Seed product and clean test customer state
  const seed = await request.post(`${baseURL}/test/seed-checkout`, {
    headers: { Authorization: `Bearer ${adminToken}` },
    data: {
      sku: 'sku_ci_checkout_001',
      priceCents: 4900,
      inventory: 10,
      customerEmail: 'ci-checkout@example.com'
    }
  });
  expect(seed.ok()).toBeTruthy();

  await page.goto(`${baseURL}/products/sku_ci_checkout_001`);
  await page.getByRole('button', { name: 'Add to cart' }).click();

  await page.goto(`${baseURL}/cart`);
  await expect(page.getByText('$49.00')).toBeVisible();

  await page.getByRole('button', { name: 'Checkout' }).click();
  await page.getByLabel('Email').fill('ci-checkout@example.com');
  await page.getByLabel('Card number').fill('4242424242424242');
  await page.getByLabel('Expiration').fill('12/34');
  await page.getByLabel('CVC').fill('123');
  await page.getByRole('button', { name: 'Pay now' }).click();

  await page.waitForURL(/order-confirmation/);
  await expect(page.getByText('Payment successful')).toBeVisible();

  // Verify backend state, not just UI success
  const orderRes = await request.get(`${baseURL}/test/orders/by-email/ci-checkout@example.com`, {
    headers: { Authorization: `Bearer ${adminToken}` }
  });
  expect(orderRes.ok()).toBeTruthy();

  const order = await orderRes.json();
  expect(order.status).toBe('paid');
  expect(order.totalCents).toBe(4900);
  expect(order.fulfillmentState).toBe('queued');

  // Verify event handoff happened
  const eventRes = await request.get(`${baseURL}/test/events`, {
    headers: { Authorization: `Bearer ${adminToken}` },
    params: { type: 'order.paid', orderId: order.id }
  });
  const events = await eventRes.json();
  expect(events.length).toBeGreaterThan(0);
});

This test does three important things many teams skip:

It uses a real browser flow.
It verifies final persisted state.
It checks that a downstream side effect occurred.

That is action-level verification.

Notice what it does not do. It does not mock the “difficult” part away. It does not stop at a 200 response. It does not assume the success page means the system is consistent.

Example: invariant checks in Python

Sometimes the right assertion is not tied to one endpoint. It is an invariant across systems. For example: after checkout, exactly one paid order exists, inventory decreased by one, and the customer does not have an orphaned pending payment.

python
import os
import requests

BASE_URL = os.environ["APP_BASE_URL"]
ADMIN_TOKEN = os.environ["ADMIN_API_TOKEN"]
HEADERS = {"Authorization": f"Bearer {ADMIN_TOKEN}"}


def fetch_checkout_state(email: str):
    r = requests.get(
        f"{BASE_URL}/test/checkout-state",
        headers=HEADERS,
        params={"email": email},
        timeout=10,
    )
    r.raise_for_status()
    return r.json()


def assert_checkout_invariants(email: str):
    state = fetch_checkout_state(email)

    assert len(state["paid_orders"]) == 1, "Expected exactly one paid order"
    assert len(state["pending_payments"]) == 0, "Unexpected pending payments remain"
    assert state["inventory_delta"] == -1, "Inventory was not decremented correctly"
    assert state["fulfillment_jobs_enqueued"] >= 1, "Fulfillment was not triggered"
    assert state["latest_webhook_status"] == "processed", "Payment webhook not processed"


if __name__ == "__main__":
    assert_checkout_invariants("ci-checkout@example.com")
    print("Checkout invariants hold")

This kind of verification is powerful because it catches the weird failures that local tests miss:

duplicate order creation due to retries
inventory not updated on the eventual-consistency path
webhook accepted but never processed
paid UI shown while payment remains pending internally

Those are real production bugs. They are also exactly the kind of bugs that appear when agents modify orchestration code.

Seeded environments matter more than perfect mocks

If you want action-level verification in CI/CD, your environment strategy matters.

A lot of teams try to make end-to-end tests reliable by mocking everything unstable. That lowers flakiness, but it also lowers truth. Instead, aim for a seeded environment that is controlled and deterministic while still exercising real boundaries.

That usually means:

ephemeral preview or test environments per PR, or stable shared staging with isolation
seed endpoints or scripts that create realistic data fixtures
test-mode integrations for payment, email, and auth providers
deterministic queue processing or bounded waits for async jobs
reset hooks for idempotent reruns

A seed script might look like this in Node:

js
import fetch from 'node-fetch';

const baseURL = process.env.APP_BASE_URL;
const adminToken = process.env.ADMIN_API_TOKEN;

async function seedCheckoutFixture() {
  const res = await fetch(`${baseURL}/test/seed-checkout`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Authorization: `Bearer ${adminToken}`
    },
    body: JSON.stringify({
      sku: 'sku_ci_checkout_001',
      title: 'CI Hoodie',
      priceCents: 4900,
      inventory: 25,
      coupon: 'CI10',
      paymentProviderMode: 'test'
    })
  });

  if (!res.ok) {
    throw new Error(`Failed seeding fixture: ${res.status}`);
  }

  return res.json();
}

seedCheckoutFixture()
  .then(data => {
    console.log('Seeded checkout fixture', data);
  })
  .catch(err => {
    console.error(err);
    process.exit(1);
  });

This is not glamorous infrastructure. It is worth more than another 300 unit tests around helper functions nobody buys your product for.

Production-shaped side effects are where many regressions hide

Checkout failures are often not UI bugs. They are handoff bugs.

Examples:

Browser submits payment intent with stale client secret.
API persists order before payment confirmation but never reconciles on webhook.
Queue consumer expects customer_id; agent changed payload to user_id.
Feature flag gates tax calculation in backend but not frontend.
Auth refresh happens during redirect and loses cart session binding.
Payment provider retries callback and creates duplicate fulfillment jobs.

These are not hypothetical. They are normal distributed-system failures.

Action-level verification should explicitly check the edges where systems hand off responsibility. In practice, that means asserting things like:

event emitted with expected semantic payload
downstream worker processed event
idempotency key prevented duplicates
permission boundary held across redirect or token refresh
persisted state matches UI-visible state
external callback transitioned entity to final state

If you only test code paths inside one process, you miss the product.

CI implementation: a lightweight but serious pipeline

You do not need to run a giant full-suite end-to-end matrix on every PR. A practical CI/CD strategy is layered:

Fast checks on every commit: lint, types, unit tests, focused integration tests
Action-level critical path checks on every PR affecting key surfaces
Broader workflow suites on main branch or pre-release
Production synthetic checks after deploy

Here is a GitHub Actions example:

yaml
name: pr-validation

on:
  pull_request:
    branches: [main]

jobs:
  fast-checks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run lint
      - run: npm run typecheck
      - run: npm run test:unit
      - run: npm run test:integration

  action-checks:
    runs-on: ubuntu-latest
    needs: fast-checks
    if: contains(join(github.event.pull_request.labels.*.name, ','), 'run-action-checks') || true
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - name: Start app
        run: docker compose up -d --build
      - name: Wait for app
        run: ./scripts/wait-for-app.sh
      - name: Seed environment
        run: node scripts/seed-checkout.js
        env:
          APP_BASE_URL: http://localhost:3000
          ADMIN_API_TOKEN: ${{ secrets.ADMIN_API_TOKEN }}
      - name: Install Playwright browsers
        run: npx playwright install --with-deps chromium
      - name: Run action-level checks
        run: npx playwright test tests/action
        env:
          APP_BASE_URL: http://localhost:3000
          ADMIN_API_TOKEN: ${{ secrets.ADMIN_API_TOKEN }}
      - name: Upload traces
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: playwright-traces
          path: test-results/

The key idea is not the YAML. It is the prioritization.

You are promoting a handful of business-critical actions to first-class CI citizens.

That is a much better use of CI time than endlessly expanding low-signal assertions.

Tools comparison: what each layer is good for

No single tool solves this. You need a stack with different purposes.

Layer	Best for	Strengths	Blind spots
Unit tests	Pure logic, edge cases, fast feedback	Very fast, deterministic, precise	Miss orchestration and system boundaries
Integration tests	Service interactions inside controlled scope	Good for contract validation and data flow	Often over-mocked, weak on real side effects
Snapshot tests	Catching structural UI changes	Cheap regression signal	Very weak behavioral confidence
Playwright/Cypress browser checks	User workflows through real UI	Strong on realistic interaction and rendering	Can become flaky if environment is poor
API-level invariant tests	Business state validation across systems	Great for side effects and correctness	Need good observability/test hooks
Synthetic post-deploy checks	Production confidence	Validates real deployment behavior	Usually narrow and not PR-blocking

A few opinions, bluntly:

If you have to choose one browser automation tool for modern action-level verification, Playwright is usually the best default. It has strong debugging ergonomics, good tracing, and solid multi-browser support.
Cypress is still useful, especially for frontend-heavy teams, but many engineering organizations now prefer Playwright for broader workflow coverage and CI flexibility.
Pure API tests are not enough for user-critical flows that depend on browser behavior, auth redirects, cookies, or frontend state continuity.
Snapshot-heavy suites are one of the easiest ways to fool yourself.

The point is not to declare old tools obsolete. The point is to stop asking them to prove things they cannot prove.

Debugging action-level failures is actually better for developer productivity

Some teams resist end-to-end style verification because they assume failures will be harder to debug. That is only true when the tests are vague and the environment is opaque.

Well-built action-level checks often improve debugging and developer productivity because they fail at the level users feel.

Instead of “expected mocked function to be called once,” you get:

user reached payment page
payment submit succeeded
order remained pending
webhook event never transitioned state
fulfillment job not enqueued

That narrows the search dramatically.

To make this work, add debug outputs intentionally:

browser traces and videos
server logs correlated by request ID or test run ID
event stream snapshots
database state summaries
queue/job introspection endpoints for test environments
artifact upload on CI failure

The best action-level tests are not just gates. They are executable debugging systems.

Practical patterns for adding action-level verification without making CI miserable

You do not need to boil the ocean. Start with practices that create high leverage.

1. Identify the top five business-critical actions

Ask a simple question: if this breaks in production, who gets paged or which metric drops?

That list is where to start. Usually:

signup
login
checkout
primary object creation
permission-sensitive action

2. Define success as a state transition, not a UI click

“Submit button worked” is not enough.

Prefer assertions like:

account created with expected role
document published and searchable
order paid and fulfillment queued
invitation accepted and permissions updated

3. Seed data explicitly

Do not rely on leftover environment state. Use fixtures, seed APIs, or scripts. Make tests rerunnable and isolated.

4. Minimize mocks on critical paths

Mocks are fine for peripheral dependencies. Avoid them on the handoffs most likely to break the business action.

5. Add test-only observability

A secure /test/* namespace in non-production environments can expose:

seeded entity lookup
event inspection
queue status
webhook processing status
invariant summaries

This dramatically improves reliability and debugging.

6. Run only the right action checks per PR

Not every PR needs the full critical-path suite. Trigger checks based on changed paths, service ownership, or risk labels.

Examples:

auth changes trigger signup/login/permissions checks
checkout or pricing changes trigger cart/checkout checks
worker or event schema changes trigger side-effect verification

7. Keep the suite small and ruthless

Ten excellent action-level checks beat 300 mediocre end-to-end scripts.

You are not trying to reproduce every manual QA scenario. You are trying to kill false confidence.

8. Verify invariants after the UI flow

This is where most teams stop too early. Always ask: what persisted? what emitted? what reconciled? what duplicated? what failed silently?

9. Design for idempotency and retries

If your workflows are async, your tests should tolerate eventual consistency but still assert correctness. Use bounded polling, deterministic retries, and idempotent seed/reset operations.

10. Treat failures as product bugs, not flaky test annoyances

A flaky action-level test often reveals one of two things:

your test architecture is weak
your system is timing-sensitive in ways users already feel

Both are worth fixing.

Where AI coding agents make this urgent

None of this is only about AI. Distributed systems have always failed at the seams. But AI coding agents amplify the seam failures in three ways.

First, they increase change surface area. A single prompt can alter frontend state handling, backend serialization, and test fixtures together.

Second, they optimize toward passing existing checks. Agents learn quickly what the pipeline rewards. If CI rewards narrow assertions, the agent will satisfy narrow assertions.

Third, they produce changes that look internally coherent. That is the dangerous part. The code often appears clean, typed, and well-structured. Humans over-trust coherence.

This is why old testing rituals are now insufficient. They were built for a world where code volume and change breadth were naturally constrained by human effort. That constraint is gone.

So your validation has to move up a level.

Not more assertions about code internals. More verification about user-visible actions and business-critical state.

A simple mental model

If an executive asked, “How do we know this PR won’t break checkout?” and your answer is “the integration tests passed,” that is not good enough.

A better answer is:

we executed a real checkout in CI
we used seeded production-shaped data
we verified the order reached paid
we confirmed fulfillment was queued
we checked no duplicate payment or permission regression occurred

That is a meaningful statement.

It is also much closer to what reliability actually means.

Conclusion

The old model of PR validation assumes that if enough local assertions pass, the product is probably fine. In the age of AI-assisted development, that assumption breaks faster and more often.

Agents can make every unit, integration, and snapshot test pass while still shipping broken user flows. Not because testing is useless, but because most testing sits below the level where the failure actually happens.

Users do not care whether your mocked service returned the expected payload. They care whether signup worked, whether cart state persisted, whether permissions held, and whether payment completed.

Your CI/CD pipeline should care about the same things.

So keep the unit tests. Keep the integration tests. Keep the fast checks that support developer productivity.

But stop pretending they are enough for critical path confidence.

For the actions that matter to the business, verify the action.

Run the browser. Seed the environment. Check the final state. Inspect the side effects. Assert the invariants.

Because a green PR is only useful if the user can still buy the product after it merges.