Your PR Passed. Your Agent Broke Checkout: Why CI Needs Action-Level Verification
A pull request goes green.
Unit tests pass. Integration tests pass. Snapshot diffs look clean. CI/CD stamps the change as safe. The agent that wrote the code opens a tidy summary: refactored checkout state handling, updated API contract, improved test coverage.
Then production tells a different story.
Users can add items to cart, but the cart disappears after login. Checkout renders, but the payment callback never marks the order as paid. The success page loads, but fulfillment never starts because the queue event shape changed by one field. Support sees failed purchases. Engineering sees green checks. Leadership sees revenue dropping while everyone argues that the PR was tested.
This is the new failure mode of AI-assisted development.
Code gets generated faster than user behavior gets verified.
That is the gap. And most teams are still treating it like a testing coverage problem when it is actually a workflow verification problem.
Traditional testing is still useful. You still need unit tests, integration tests, static checks, and code review. But when AI coding agents can produce large, plausible, internally consistent changes at high speed, those layers stop being enough to establish confidence. They tell you the code satisfies local assertions. They do not tell you whether the business action succeeded end to end.
For critical PR validation, the real question is no longer just did the code behave as specified in isolated tests? It is did the user actually complete the action, and did every system involved uphold the right invariants?
That is action-level verification.
It means your CI should answer questions like:
- Did signup complete and create the right account state?
- Did cart state persist across auth, refresh, and device boundaries?
- Did permissions hold after role changes and redirects?
- Did the browser, API, database, queue, email system, and payment provider hand off correctly?
- Did the side effects that matter to the business actually happen?
If your pipeline cannot answer those questions, your green build is weaker than it looks.
The problem: CI validates code paths, but users experience workflows
Most engineering teams built their testing strategy around the shape of the codebase.
- Functions get unit tests.
- Services get integration tests.
- Components get snapshots.
- APIs get contract tests.
- Releases get a little manual QA if time allows.
That model made sense when developers were writing code incrementally and humans could reason through most changes by inspection. It makes less sense when an AI agent can update ten files, two API contracts, a serializer, a retry policy, and a state machine in one pass. The output may be syntactically correct, type-safe, and well-tested against mocked expectations. It can still break the actual workflow users care about.
Why? Because users do not experience your architecture in layers.
They click a button in a browser. A token gets refreshed. A backend endpoint mutates state. A queue publishes an event. A third-party payment system confirms a charge. A webhook comes back. A job updates fulfillment. A UI poll or redirect displays the final state.
That whole chain is the product.
Testing only the layers is like validating a car by separately testing the steering wheel, brakes, and engine mount. It tells you something. It does not prove the car can finish a turn at speed.
AI increases the frequency of this problem because it tends to optimize for local correctness. Agents are very good at making tests pass. They are much less reliable at preserving real-world behavior across hidden assumptions, system boundaries, and weird production-shaped timing.
That is exactly why conventional CI now creates false confidence.
Why current approaches fail
Unit tests are too narrow
Unit tests are valuable because they are fast, deterministic, and precise. They also only prove exactly what you asserted.
If your cart reducer preserves items in a local state object, that says nothing about whether cart state survives login, session refresh, server reconciliation, or currency normalization from the pricing API.
An AI agent can change the point where state is rehydrated, update the reducer tests, and still break the actual cart persistence flow.
A passing unit test suite often means one thing: the internal implementation still satisfies the assumptions encoded by developers in the past.
That is not the same as saying the user journey works today.
Integration tests often validate mocked systems, not real handoffs
Integration tests are supposed to reduce the gap between isolated logic and system behavior. In practice, many teams mock the parts that fail most often in production:
- payment providers n- queues
- email delivery
- webhooks
- auth redirects
- object storage
- feature flag systems
- background workers
This is understandable. Real dependencies are slow, flaky, expensive, and harder to seed in CI/CD. But once you mock the important handoffs, you are no longer testing the workflow. You are testing your assumptions about the workflow.
That distinction matters.
A payment integration test that asserts createCharge() is called with the right payload does not verify that a successful browser checkout results in an order in paid state after the webhook, job processing, and persistence layers complete.
It verifies a function call.
Your business depends on the state transition.
Snapshot tests protect structure, not outcomes
Snapshot tests are especially dangerous as confidence theater. They catch rendering changes. They do not tell you whether the page did anything useful.
A checkout page can render perfectly while silently failing to attach the event handler that submits payment. An order confirmation view can match its snapshot while showing stale client state that never came from the backend.
Agents are particularly good at keeping snapshots green because they are good at preserving shape. Shape is not behavior.
Manual QA cannot keep up with agent velocity
Many teams compensate for weak automated verification with a human tester or a quick pre-merge walkthrough. That works until change volume spikes.
With AI-assisted development, more code lands faster. More branches are opened. More “small” changes touch wider surfaces. The burden shifts onto humans exactly when the system becomes harder to reason about by hand.
Manual QA is still useful for exploration. It is not a scalable primary defense against workflow regressions in high-velocity codebases.
CI/CD rewards what is measurable, not what matters
This is the uncomfortable part.
Most CI/CD pipelines are optimized around what is easy to automate:
- linting
- static analysis
- unit test pass rate
- coverage percentage
- type checks
- container builds
- deploy previews
Those signals are useful. None of them directly answer whether checkout works.
And because they are fast and crisp, organizations over-trust them. Green pipelines become a substitute for confidence rather than a component of confidence.
That worked better before AI because humans naturally throttled complexity. Now a coding agent can generate broad changes faster than your test strategy can evolve, and the green pipeline becomes actively misleading.
The core insight: verify actions, not just assertions
Action-level verification starts from the user outcome and works backward.
Instead of asking, “Did each component do what its test expected?” ask, “Did the business action complete successfully under realistic conditions?”
For example, for checkout, the action is not:
- button click fired
- API returned 200
createOrder()was called- payment client was invoked
The action is:
- a user added an item to cart
- cart state persisted appropriately
- login or guest flow behaved correctly
- payment was submitted
- provider callback or webhook was processed
- order ended in the correct paid state
- fulfillment side effect was triggered
- the user saw the correct confirmation
- no authorization or duplication invariant was violated
That is a bigger claim. It is also the claim your product actually needs.
Action-level verification is not a replacement for lower-level tests. It sits above them and catches what they systematically miss: regressions in orchestration, state continuity, async boundaries, and side effects.
This is especially important for AI-generated changes because agents often preserve local behavior while accidentally altering system choreography.
The code still “works.” The workflow breaks.
What action-level verification looks like in practice
You do not need a giant end-to-end pyramid or a brittle UI suite that takes 45 minutes to run. The goal is not to automate every click path. The goal is to validate a small set of business-critical actions in PR pipelines using production-shaped conditions.
For most products, that list is surprisingly short:
- user can sign up
- user can log in
- user can add item to cart
- user can complete checkout
- user permissions restrict access correctly
- user can create, edit, or publish the primary domain object
- notifications and background jobs fire correctly
- billing changes apply to account state
These are actions. They cross system boundaries. They matter.
A good action-level check usually has four properties:
- Real browser interaction for the critical user path
- Seeded environment with controlled but realistic data
- Invariant checks on final business state, not just HTTP responses
- Production-shaped side effects such as queues, jobs, webhooks, and auth flows
Let’s make this concrete.
Example: checkout verification with Playwright
Below is a simplified Playwright test that does more than click through UI. It verifies the final business result.
tsimport { test, expect } from '@playwright/test'; const baseURL = process.env.APP_BASE_URL!; const adminToken = process.env.ADMIN_API_TOKEN!; test('guest checkout completes and order is paid', async ({ page, request }) => { // Seed product and clean test customer state const seed = await request.post(`${baseURL}/test/seed-checkout`, { headers: { Authorization: `Bearer ${adminToken}` }, data: { sku: 'sku_ci_checkout_001', priceCents: 4900, inventory: 10, customerEmail: 'ci-checkout@example.com' } }); expect(seed.ok()).toBeTruthy(); await page.goto(`${baseURL}/products/sku_ci_checkout_001`); await page.getByRole('button', { name: 'Add to cart' }).click(); await page.goto(`${baseURL}/cart`); await expect(page.getByText('$49.00')).toBeVisible(); await page.getByRole('button', { name: 'Checkout' }).click(); await page.getByLabel('Email').fill('ci-checkout@example.com'); await page.getByLabel('Card number').fill('4242424242424242'); await page.getByLabel('Expiration').fill('12/34'); await page.getByLabel('CVC').fill('123'); await page.getByRole('button', { name: 'Pay now' }).click(); await page.waitForURL(/order-confirmation/); await expect(page.getByText('Payment successful')).toBeVisible(); // Verify backend state, not just UI success const orderRes = await request.get(`${baseURL}/test/orders/by-email/ci-checkout@example.com`, { headers: { Authorization: `Bearer ${adminToken}` } }); expect(orderRes.ok()).toBeTruthy(); const order = await orderRes.json(); expect(order.status).toBe('paid'); expect(order.totalCents).toBe(4900); expect(order.fulfillmentState).toBe('queued'); // Verify event handoff happened const eventRes = await request.get(`${baseURL}/test/events`, { headers: { Authorization: `Bearer ${adminToken}` }, params: { type: 'order.paid', orderId: order.id } }); const events = await eventRes.json(); expect(events.length).toBeGreaterThan(0); });
This test does three important things many teams skip:
- It uses a real browser flow.
- It verifies final persisted state.
- It checks that a downstream side effect occurred.
That is action-level verification.
Notice what it does not do. It does not mock the “difficult” part away. It does not stop at a 200 response. It does not assume the success page means the system is consistent.
Example: invariant checks in Python
Sometimes the right assertion is not tied to one endpoint. It is an invariant across systems. For example: after checkout, exactly one paid order exists, inventory decreased by one, and the customer does not have an orphaned pending payment.
pythonimport os import requests BASE_URL = os.environ["APP_BASE_URL"] ADMIN_TOKEN = os.environ["ADMIN_API_TOKEN"] HEADERS = {"Authorization": f"Bearer {ADMIN_TOKEN}"} def fetch_checkout_state(email: str): r = requests.get( f"{BASE_URL}/test/checkout-state", headers=HEADERS, params={"email": email}, timeout=10, ) r.raise_for_status() return r.json() def assert_checkout_invariants(email: str): state = fetch_checkout_state(email) assert len(state["paid_orders"]) == 1, "Expected exactly one paid order" assert len(state["pending_payments"]) == 0, "Unexpected pending payments remain" assert state["inventory_delta"] == -1, "Inventory was not decremented correctly" assert state["fulfillment_jobs_enqueued"] >= 1, "Fulfillment was not triggered" assert state["latest_webhook_status"] == "processed", "Payment webhook not processed" if __name__ == "__main__": assert_checkout_invariants("ci-checkout@example.com") print("Checkout invariants hold")
This kind of verification is powerful because it catches the weird failures that local tests miss:
- duplicate order creation due to retries
- inventory not updated on the eventual-consistency path
- webhook accepted but never processed
- paid UI shown while payment remains pending internally
Those are real production bugs. They are also exactly the kind of bugs that appear when agents modify orchestration code.
Seeded environments matter more than perfect mocks
If you want action-level verification in CI/CD, your environment strategy matters.
A lot of teams try to make end-to-end tests reliable by mocking everything unstable. That lowers flakiness, but it also lowers truth. Instead, aim for a seeded environment that is controlled and deterministic while still exercising real boundaries.
That usually means:
- ephemeral preview or test environments per PR, or stable shared staging with isolation
- seed endpoints or scripts that create realistic data fixtures
- test-mode integrations for payment, email, and auth providers
- deterministic queue processing or bounded waits for async jobs
- reset hooks for idempotent reruns
A seed script might look like this in Node:
jsimport fetch from 'node-fetch'; const baseURL = process.env.APP_BASE_URL; const adminToken = process.env.ADMIN_API_TOKEN; async function seedCheckoutFixture() { const res = await fetch(`${baseURL}/test/seed-checkout`, { method: 'POST', headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${adminToken}` }, body: JSON.stringify({ sku: 'sku_ci_checkout_001', title: 'CI Hoodie', priceCents: 4900, inventory: 25, coupon: 'CI10', paymentProviderMode: 'test' }) }); if (!res.ok) { throw new Error(`Failed seeding fixture: ${res.status}`); } return res.json(); } seedCheckoutFixture() .then(data => { console.log('Seeded checkout fixture', data); }) .catch(err => { console.error(err); process.exit(1); });
This is not glamorous infrastructure. It is worth more than another 300 unit tests around helper functions nobody buys your product for.
Production-shaped side effects are where many regressions hide
Checkout failures are often not UI bugs. They are handoff bugs.
Examples:
- Browser submits payment intent with stale client secret.
- API persists order before payment confirmation but never reconciles on webhook.
- Queue consumer expects
customer_id; agent changed payload touser_id. - Feature flag gates tax calculation in backend but not frontend.
- Auth refresh happens during redirect and loses cart session binding.
- Payment provider retries callback and creates duplicate fulfillment jobs.
These are not hypothetical. They are normal distributed-system failures.
Action-level verification should explicitly check the edges where systems hand off responsibility. In practice, that means asserting things like:
- event emitted with expected semantic payload
- downstream worker processed event
- idempotency key prevented duplicates
- permission boundary held across redirect or token refresh
- persisted state matches UI-visible state
- external callback transitioned entity to final state
If you only test code paths inside one process, you miss the product.
CI implementation: a lightweight but serious pipeline
You do not need to run a giant full-suite end-to-end matrix on every PR. A practical CI/CD strategy is layered:
- Fast checks on every commit: lint, types, unit tests, focused integration tests
- Action-level critical path checks on every PR affecting key surfaces
- Broader workflow suites on main branch or pre-release
- Production synthetic checks after deploy
Here is a GitHub Actions example:
yamlname: pr-validation on: pull_request: branches: [main] jobs: fast-checks: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm run lint - run: npm run typecheck - run: npm run test:unit - run: npm run test:integration action-checks: runs-on: ubuntu-latest needs: fast-checks if: contains(join(github.event.pull_request.labels.*.name, ','), 'run-action-checks') || true steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - name: Start app run: docker compose up -d --build - name: Wait for app run: ./scripts/wait-for-app.sh - name: Seed environment run: node scripts/seed-checkout.js env: APP_BASE_URL: http://localhost:3000 ADMIN_API_TOKEN: ${{ secrets.ADMIN_API_TOKEN }} - name: Install Playwright browsers run: npx playwright install --with-deps chromium - name: Run action-level checks run: npx playwright test tests/action env: APP_BASE_URL: http://localhost:3000 ADMIN_API_TOKEN: ${{ secrets.ADMIN_API_TOKEN }} - name: Upload traces if: failure() uses: actions/upload-artifact@v4 with: name: playwright-traces path: test-results/
The key idea is not the YAML. It is the prioritization.
You are promoting a handful of business-critical actions to first-class CI citizens.
That is a much better use of CI time than endlessly expanding low-signal assertions.
Tools comparison: what each layer is good for
No single tool solves this. You need a stack with different purposes.
| Layer | Best for | Strengths | Blind spots |
|---|---|---|---|
| Unit tests | Pure logic, edge cases, fast feedback | Very fast, deterministic, precise | Miss orchestration and system boundaries |
| Integration tests | Service interactions inside controlled scope | Good for contract validation and data flow | Often over-mocked, weak on real side effects |
| Snapshot tests | Catching structural UI changes | Cheap regression signal | Very weak behavioral confidence |
| Playwright/Cypress browser checks | User workflows through real UI | Strong on realistic interaction and rendering | Can become flaky if environment is poor |
| API-level invariant tests | Business state validation across systems | Great for side effects and correctness | Need good observability/test hooks |
| Synthetic post-deploy checks | Production confidence | Validates real deployment behavior | Usually narrow and not PR-blocking |
A few opinions, bluntly:
- If you have to choose one browser automation tool for modern action-level verification, Playwright is usually the best default. It has strong debugging ergonomics, good tracing, and solid multi-browser support.
- Cypress is still useful, especially for frontend-heavy teams, but many engineering organizations now prefer Playwright for broader workflow coverage and CI flexibility.
- Pure API tests are not enough for user-critical flows that depend on browser behavior, auth redirects, cookies, or frontend state continuity.
- Snapshot-heavy suites are one of the easiest ways to fool yourself.
The point is not to declare old tools obsolete. The point is to stop asking them to prove things they cannot prove.
Debugging action-level failures is actually better for developer productivity
Some teams resist end-to-end style verification because they assume failures will be harder to debug. That is only true when the tests are vague and the environment is opaque.
Well-built action-level checks often improve debugging and developer productivity because they fail at the level users feel.
Instead of “expected mocked function to be called once,” you get:
- user reached payment page
- payment submit succeeded
- order remained
pending - webhook event never transitioned state
- fulfillment job not enqueued
That narrows the search dramatically.
To make this work, add debug outputs intentionally:
- browser traces and videos
- server logs correlated by request ID or test run ID
- event stream snapshots
- database state summaries
- queue/job introspection endpoints for test environments
- artifact upload on CI failure
The best action-level tests are not just gates. They are executable debugging systems.
Practical patterns for adding action-level verification without making CI miserable
You do not need to boil the ocean. Start with practices that create high leverage.
1. Identify the top five business-critical actions
Ask a simple question: if this breaks in production, who gets paged or which metric drops?
That list is where to start. Usually:
- signup
- login
- checkout
- primary object creation
- permission-sensitive action
2. Define success as a state transition, not a UI click
“Submit button worked” is not enough.
Prefer assertions like:
- account created with expected role
- document published and searchable
- order paid and fulfillment queued
- invitation accepted and permissions updated
3. Seed data explicitly
Do not rely on leftover environment state. Use fixtures, seed APIs, or scripts. Make tests rerunnable and isolated.
4. Minimize mocks on critical paths
Mocks are fine for peripheral dependencies. Avoid them on the handoffs most likely to break the business action.
5. Add test-only observability
A secure /test/* namespace in non-production environments can expose:
- seeded entity lookup
- event inspection
- queue status
- webhook processing status
- invariant summaries
This dramatically improves reliability and debugging.
6. Run only the right action checks per PR
Not every PR needs the full critical-path suite. Trigger checks based on changed paths, service ownership, or risk labels.
Examples:
- auth changes trigger signup/login/permissions checks
- checkout or pricing changes trigger cart/checkout checks
- worker or event schema changes trigger side-effect verification
7. Keep the suite small and ruthless
Ten excellent action-level checks beat 300 mediocre end-to-end scripts.
You are not trying to reproduce every manual QA scenario. You are trying to kill false confidence.
8. Verify invariants after the UI flow
This is where most teams stop too early. Always ask: what persisted? what emitted? what reconciled? what duplicated? what failed silently?
9. Design for idempotency and retries
If your workflows are async, your tests should tolerate eventual consistency but still assert correctness. Use bounded polling, deterministic retries, and idempotent seed/reset operations.
10. Treat failures as product bugs, not flaky test annoyances
A flaky action-level test often reveals one of two things:
- your test architecture is weak
- your system is timing-sensitive in ways users already feel
Both are worth fixing.
Where AI coding agents make this urgent
None of this is only about AI. Distributed systems have always failed at the seams. But AI coding agents amplify the seam failures in three ways.
First, they increase change surface area. A single prompt can alter frontend state handling, backend serialization, and test fixtures together.
Second, they optimize toward passing existing checks. Agents learn quickly what the pipeline rewards. If CI rewards narrow assertions, the agent will satisfy narrow assertions.
Third, they produce changes that look internally coherent. That is the dangerous part. The code often appears clean, typed, and well-structured. Humans over-trust coherence.
This is why old testing rituals are now insufficient. They were built for a world where code volume and change breadth were naturally constrained by human effort. That constraint is gone.
So your validation has to move up a level.
Not more assertions about code internals. More verification about user-visible actions and business-critical state.
A simple mental model
If an executive asked, “How do we know this PR won’t break checkout?” and your answer is “the integration tests passed,” that is not good enough.
A better answer is:
- we executed a real checkout in CI
- we used seeded production-shaped data
- we verified the order reached
paid - we confirmed fulfillment was queued
- we checked no duplicate payment or permission regression occurred
That is a meaningful statement.
It is also much closer to what reliability actually means.
Conclusion
The old model of PR validation assumes that if enough local assertions pass, the product is probably fine. In the age of AI-assisted development, that assumption breaks faster and more often.
Agents can make every unit, integration, and snapshot test pass while still shipping broken user flows. Not because testing is useless, but because most testing sits below the level where the failure actually happens.
Users do not care whether your mocked service returned the expected payload. They care whether signup worked, whether cart state persisted, whether permissions held, and whether payment completed.
Your CI/CD pipeline should care about the same things.
So keep the unit tests. Keep the integration tests. Keep the fast checks that support developer productivity.
But stop pretending they are enough for critical path confidence.
For the actions that matter to the business, verify the action.
Run the browser. Seed the environment. Check the final state. Inspect the side effects. Assert the invariants.
Because a green PR is only useful if the user can still buy the product after it merges.
