A signup flow passed every unit test, every API contract test, and every CI check. The pull request was green. The deploy went out on schedule.

Then sales asked why new trial users weren’t showing up in the CRM.

The answer wasn’t in the signup endpoint. It wasn’t in the frontend validation logic either. The app accepted the form, created the user record, returned a 200, and redirected to the dashboard exactly as designed. But one field name had changed in a downstream event payload. The billing system still provisioned the account. The email provider still sent a welcome message. The CRM sync rejected the event silently because the schema no longer matched. No alert fired. No test failed. The workflow broke at the handoff.

That is what modern production failures look like.

As AI agents generate more application code, this class of bug gets more common, not less. Agents are very good at making local changes that satisfy nearby tests. They are much worse at understanding the operational consequences of changing how data, state, and timing move across systems. A refactor that looks harmless inside a repo can break the exact point where your product depends on coordination: frontend to backend, app to queue, auth to billing, form submit to email, CRM, analytics, and webhook chains.

The uncomfortable truth is that most teams still test code paths while users depend on workflow paths. That gap is where “green” CI/CD pipelines create false confidence.

The new failure surface is the boundary, not the function

Traditional testing assumes the dangerous parts of the system live inside the code you directly control. So teams write unit tests for business logic, integration tests for endpoints, and UI tests for happy-path interactions. Those layers matter. But they mostly validate components in isolation.

Users do not experience your product in isolation.

A real workflow spans systems with different schemas, consistency models, retry behavior, authorization scopes, and operational guarantees. Consider a simple “start trial” action:

User fills a frontend form.
Browser sends a request to your backend.
Backend creates a user and organization.
Auth provider issues session state.
Billing provider provisions a trial subscription.
Event is published to a queue.
Worker enriches the account.
CRM contact is created.
Welcome email is triggered.
Analytics event is emitted.
Internal Slack notification is sent.

From a customer perspective, that is one action. From your system’s perspective, it is a chain of handoffs.

Any one of these can fail in ways your API tests will never catch:

The backend returns success before the queue publish actually commits.
The worker retries and creates duplicate CRM records.
The auth token lacks a scope required by billing in production only.
The webhook consumer accepts malformed payloads in staging but rejects them behind stricter production validation.
The frontend navigates away before an idempotency key is persisted.
A field rename keeps the email flow working but breaks downstream segmentation.
A background job succeeds after the user has already hit an error screen and retried, creating split state.

None of this is theoretical. These are standard distributed system failures, hidden behind otherwise healthy components.

The more systems you compose, the more your reliability depends on verifying the transitions between them.

Why agent-generated changes amplify handoff failures

AI-assisted development is not the root cause. It is a force multiplier.

Agents optimize for what they can see and measure. In most repositories, what they can see is local code, local tests, static types, and maybe some mocked integrations. That means they are naturally biased toward changes that preserve local correctness while accidentally weakening global behavior.

Here are a few common ways this happens.

1. Agents refactor semantics, not just syntax

An agent updates a serializer, extracts a shared utility, renames a field, or standardizes error handling. All of those can be reasonable changes. But in workflow-heavy systems, semantics matter more than formatting.

Changing customerId to customer_id might be harmless inside one service and catastrophic at a webhook boundary.

Changing the timing of when an event is emitted might still satisfy endpoint tests while breaking every consumer that assumed the record already existed.

2. Agents satisfy mocks that don’t behave like production

A test double often encodes an idealized integration:

instant responses
stable schema
no retries
no rate limits
no partial failures
no eventual consistency

An agent can make tests greener by aligning code more tightly to those mocks. In production, the real system is slower, stricter, or simply different.

3. Agents create broad changes with shallow understanding

A senior engineer might recognize that a billing callback and a CRM sync both depend on the same event envelope. An agent often won’t, unless those relationships are explicit and close by in code.

This is the core risk: larger surface area changes shipped faster into systems whose true contracts are social, implicit, or distributed across teams.

4. Agents increase deployment frequency without increasing verification depth

More code gets merged. More PRs are green. More changes reach production. If your testing strategy still centers on component-level validation, you just increased the rate at which boundary failures are introduced.

That is not an indictment of AI tools. It is a reminder that debugging and testing strategies must evolve with how code is produced.

Why current approaches fail: CI, unit tests, and manual QA

Most teams already have “good coverage” by conventional definitions. They still get burned by broken workflows because the common layers of validation stop too early.

CI/CD validates build integrity, not operational completion

A standard CI/CD pipeline answers questions like:

Does the code compile?
Do unit tests pass?
Does the API return the expected response?
Can the container build?
Can we deploy safely?

Those are useful gates. They are not proof that a user action completes.

A pipeline can be perfectly green while a critical workflow silently stalls after the first hop.

Here is a familiar CI config that looks solid and still misses the real risk:

yaml
name: ci

on:
  pull_request:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: postgres
        ports:
          - 5432:5432
      redis:
        image: redis:7
        ports:
          - 6379:6379

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run lint
      - run: npm run test
      - run: npm run test:integration
      - run: npm run build

This pipeline proves local integrity. It does not prove that “user submits form and account appears in billing, CRM, and email systems without duplication or silent loss.”

That difference matters more than ever.

Unit tests are precise but narrow

Unit tests are excellent for logic. They are terrible proxies for workflows.

If the bug lives in a transformation, retry policy, missing idempotency key, race condition, or downstream schema mismatch, your unit tests can all pass while customers are stuck in limbo.

A unit test for an event publisher might look like this:

javascript
import { publishSignupEvent } from './events';
import { queue } from './queue';

jest.mock('./queue', () => ({
  queue: { publish: jest.fn().mockResolvedValue({ ok: true }) }
}));

test('publishes signup event', async () => {
  await publishSignupEvent({
    userId: 'u_123',
    email: 'user@example.com',
    plan: 'trial'
  });

  expect(queue.publish).toHaveBeenCalledWith('user.signup', {
    userId: 'u_123',
    email: 'user@example.com',
    plan: 'trial'
  });
});

This is fine as far as it goes. But it tells you nothing about:

whether the event survives serialization
whether consumers expect plan_type instead of plan
whether the publish is transactional with the user record creation
whether retries create duplicates
whether downstream systems reject the payload
whether the user sees success before the workflow actually completes

Unit tests can confirm that a function behaved correctly under assumptions. They cannot validate whether the assumptions hold across systems.

Manual QA checks interfaces, not system truth

Manual QA has a place, especially for exploratory work. But workflow failures are often asynchronous, conditional, and delayed.

The QA tester sees:

form submits successfully
UI redirects correctly
success toast appears
dashboard loads

What they don’t necessarily see:

lead missing in the CRM 3 minutes later
duplicate charge after a retry storm
webhook delivery dead-lettered
account provisioned but email never sent
billing trial created without the entitlement row your app needs

Human testing is not built to continuously verify invisible handoffs. That is an automation problem.

The core insight: stop testing requests, start testing actions

The right abstraction is not “did endpoint X return 200?”

It is “did the user action complete all required handoffs?”

That shift sounds small, but it changes everything.

A request is local. An action is distributed.

An action-level test starts at the interface the user or operator touches, triggers the real workflow, and verifies the side effects at each boundary that matter. It treats success as operational completion, not merely application acknowledgment.

This is what modern reliability work needs to look like:

Start from a user action.
Observe every critical handoff.
Verify completion across systems.
Detect stalls, drops, duplicates, and misordered side effects.
Run under production-like timing and auth assumptions.

If you do that, you stop being fooled by green builds that only validate the first 20% of reality.

What action-level verification looks like in practice

You do not need to simulate the entire internet. You need to define the workflows that matter most and prove the critical handoffs complete.

For a signup flow, your verification might assert:

frontend form submission succeeds
backend user row exists
organization row exists
billing trial exists with expected plan
queue event is emitted exactly once
worker processes event successfully
CRM contact exists with correct identifiers
welcome email trigger is recorded
analytics event appears once

That is the difference between testing your app and testing your business process.

Example: a brittle implementation that passes local tests

Here is a simplified Node.js handler that looks reasonable but can fail at the workflow boundary.

javascript
app.post('/api/signup', async (req, res) => {
  const { email, companyName, plan } = req.body;

  const user = await db.user.create({ data: { email } });
  const org = await db.organization.create({ data: { name: companyName, ownerId: user.id } });

  await billing.createTrial({
    email,
    externalCustomerId: org.id,
    plan
  });

  queue.publish('user.signup', {
    userId: user.id,
    orgId: org.id,
    email,
    plan
  });

  res.status(200).json({ ok: true, userId: user.id, orgId: org.id });
});

Problems:

queue.publish is not awaited.
There is no transactional outbox or handoff guarantee.
Billing succeeds before event publication is confirmed.
A client can retry and create duplicate downstream state.
Response success implies workflow success even though only part of it completed.

A local integration test may still pass every time.

Better: explicit handoff guarantees and idempotency

A more resilient version separates acknowledgment from completion and makes the workflow observable.

javascript
import { randomUUID } from 'node:crypto';

app.post('/api/signup', async (req, res) => {
  const { email, companyName, plan } = req.body;
  const idempotencyKey = req.header('Idempotency-Key') || randomUUID();

  const existing = await db.signupRequest.findUnique({ where: { idempotencyKey } });
  if (existing) {
    return res.status(202).json({
      requestId: existing.id,
      status: existing.status
    });
  }

  const signupRequest = await db.$transaction(async (tx) => {
    const request = await tx.signupRequest.create({
      data: {
        idempotencyKey,
        email,
        companyName,
        plan,
        status: 'pending'
      }
    });

    await tx.outbox.create({
      data: {
        topic: 'signup.requested',
        aggregateId: request.id,
        payload: JSON.stringify({
          requestId: request.id,
          email,
          companyName,
          plan
        })
      }
    });

    return request;
  });

  res.status(202).json({ requestId: signupRequest.id, status: 'pending' });
});

A worker can then process the outbox reliably and update workflow status as each boundary completes.

python
async def process_signup_requested(event, db, billing, crm, emailer):
    request_id = event["requestId"]

    request = await db.signup_request.get(request_id)
    if request.status == "completed":
        return

    user = await db.user.upsert_by_email(request.email)
    org = await db.organization.upsert_by_request(request_id, request.company_name, user.id)

    billing_result = await billing.ensure_trial(
        external_customer_id=org.id,
        email=request.email,
        plan=request.plan,
        idempotency_key=f"signup:{request_id}:billing"
    )

    await crm.upsert_contact(
        external_id=user.id,
        email=request.email,
        company=request.company_name
    )

    await emailer.send_welcome(
        to=request.email,
        template="trial_welcome",
        metadata={"request_id": request_id}
    )

    await db.signup_request.update(
        request_id,
        {
            "status": "completed",
            "user_id": user.id,
            "org_id": org.id,
            "billing_customer_id": billing_result["customer_id"]
        }
    )

This does not magically remove failures. It makes them controllable, retryable, and testable.

Testing the workflow with Playwright plus backend assertions

UI testing alone is not enough, but UI initiation is often the correct entry point because that matches the user action.

With Playwright, you can trigger the workflow from the browser, then verify the downstream state through internal admin endpoints, database checks, or test-only observability hooks.

javascript
import { test, expect } from '@playwright/test';

test('signup completes all critical handoffs', async ({ page, request }) => {
  const email = `test-${Date.now()}@example.com`;

  await page.goto('/signup');
  await page.fill('[name="companyName"]', 'Boundary Test Inc');
  await page.fill('[name="email"]', email);
  await page.selectOption('[name="plan"]', 'trial');
  await page.click('button[type="submit"]');

  await expect(page.getByText('Check your email to continue')).toBeVisible();

  await expect.poll(async () => {
    const res = await request.get(`/test-api/workflows/signup-status?email=${email}`);
    return res.json();
  }, {
    timeout: 30000,
    intervals: [500, 1000, 2000]
  }).toMatchObject({
    appUserCreated: true,
    organizationCreated: true,
    billingTrialCreated: true,
    crmContactCreated: true,
    welcomeEmailTriggered: true,
    status: 'completed'
  });
});

That test is much closer to what the business actually needs.

Critically, it verifies the handoffs instead of stopping at the first success screen.

Add production-like failure modes to your tests

If your verification only checks the happy path, you are still under-testing boundaries.

You need scenarios for:

delayed downstream systems
duplicate deliveries
partial completion
auth token expiry
webhook retry and reorder
queue processing lag
provider validation drift

For example, test that retry behavior does not create duplicate billing state:

javascript
test('signup is idempotent across client retry', async ({ request }) => {
  const email = `retry-${Date.now()}@example.com`;
  const key = `idem-${Date.now()}`;

  const payload = {
    email,
    companyName: 'Retry Co',
    plan: 'trial'
  };

  const [r1, r2] = await Promise.all([
    request.post('/api/signup', {
      headers: { 'Idempotency-Key': key },
      data: payload
    }),
    request.post('/api/signup', {
      headers: { 'Idempotency-Key': key },
      data: payload
    })
  ]);

  expect([202, 202]).toContain(r1.status());
  expect([202, 202]).toContain(r2.status());

  await expect.poll(async () => {
    const res = await request.get(`/test-api/workflows/signup-status?email=${email}`);
    return res.json();
  }).toMatchObject({
    billingTrialCount: 1,
    crmContactCount: 1,
    status: 'completed'
  });
});

This kind of test catches a whole class of workflow bugs that API contract tests miss completely.

CI/CD should run workflow verification, not just code verification

If workflow boundaries are the failure surface, your CI/CD pipeline needs a stage that explicitly validates them.

That does not mean every PR must spin up every real external dependency. It means you need a strategy for meaningful workflow verification at the right layers.

A practical model looks like this:

PR checks: unit, integration, schema compatibility, critical workflow smoke tests against controlled environments
main branch: broader end-to-end workflow suite
pre-release or canary: production-like verification against real integrations or high-fidelity sandboxes
post-deploy: synthetic action-level checks continuously exercising critical workflows

Here is an example GitHub Actions workflow with a dedicated workflow-verification job:

yaml
name: workflow-verification

on:
  pull_request:
  push:
    branches: [main]

jobs:
  app-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run test
      - run: npm run test:integration

  workflow-smoke:
    needs: app-tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: docker compose up -d
      - run: npm run db:migrate
      - run: npm run test:workflow
        env:
          BILLING_BASE_URL: http://localhost:4010
          CRM_BASE_URL: http://localhost:4020
          EMAIL_BASE_URL: http://localhost:4030

  canary-workflows:
    if: github.ref == 'refs/heads/main'
    needs: workflow-smoke
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run test:workflow:canary
        env:
          APP_BASE_URL: ${{ secrets.CANARY_APP_BASE_URL }}
          BILLING_SANDBOX_KEY: ${{ secrets.BILLING_SANDBOX_KEY }}
          CRM_SANDBOX_KEY: ${{ secrets.CRM_SANDBOX_KEY }}

This is closer to reality because it acknowledges that developer productivity is not just about writing code faster. It is about reducing the time between change and reliable confidence.

What to verify at each boundary

Not every integration deserves the same investment. Focus on workflows where failure has customer or revenue impact.

For each boundary, verify four things:

1. Delivery

Did the handoff happen at all?

Examples:

event published
webhook sent
job enqueued
API call made

2. Acceptance

Did the receiving system accept it?

Examples:

2xx from provider is real, not just connection success
payload validated
auth scopes accepted
consumer deserialized without dropping fields

3. Effect

Did the intended side effect occur?

Examples:

customer created in billing
contact visible in CRM
entitlement row created
email trigger recorded

4. Uniqueness and ordering

Did it happen once, in the expected sequence?

Examples:

no duplicate subscriptions
no duplicate emails
no worker processing before source row committed
no stale webhook overwriting newer state

If your test suite does not answer these questions for critical workflows, it is not actually validating reliability.

Tools comparison: what each layer is good for

No single tool solves this. The point is to use each one for the right job.

Tool/Approach	Good for	Weak at
Unit tests	business logic, edge cases, transformations	cross-system behavior, timing, retries
API integration tests	endpoint contracts, database effects	downstream async workflows, external side effects
Playwright/Cypress UI tests	user initiation, visible UX, browser state	invisible backend handoffs unless extended
Contract tests	schema compatibility between known producers/consumers	operational completion, retries, ordering
Mock servers	deterministic local testing, failure injection	fidelity to real production integrations
Sandbox integration tests	realistic provider behavior	cost, slowness, environment drift
Synthetic production checks	real operational confidence	narrower debugging signal, needs guardrails
Observability/tracing	debugging workflow failures in production	not a substitute for pre-deploy verification

The mistake is expecting any one of these to carry the full load.

The right stack combines them around critical actions.

Actionable practices for teams shipping agent-assisted code

This is the part that matters. If you want fewer boundary failures, make the workflow explicit in both architecture and testing.

1. Define your critical user workflows

List the actions that actually matter to the business:

signup and onboarding
upgrade and billing change
password reset
invite teammate
import data
submit lead form
cancel subscription
generate report and notify user

If a failure would lose revenue, break trust, or create operational cleanup, it needs action-level verification.

2. Model handoffs as first-class states

Do not compress a distributed workflow into a single “success” response.

Track states like:

requested
accepted
processing
billing_completed
crm_completed
email_completed
completed
failed

This makes debugging easier and enables testing against real workflow progress rather than guessing from logs.

3. Add idempotency everywhere retries can happen

At the boundary, retries are normal. Networks fail. browsers retry. workers restart. webhooks redeliver.

If your workflow cannot safely repeat, it will eventually corrupt state.

Use:

idempotency keys on external writes
dedupe keys on events
upserts on downstream records
monotonic version checks where ordering matters

4. Prefer outbox and inbox patterns over best-effort async calls

If you create state and emit an event, make the handoff durable.

Best-effort publish after commit is where workflows disappear.

Transactional outbox patterns are not glamorous, but they close one of the most common reliability gaps in modern systems.

5. Build test-only workflow status endpoints or fixtures

Engineers often avoid workflow testing because downstream verification feels hard. Make it easier.

Expose internal test helpers that answer questions like:

has the CRM sync completed?
how many billing subscriptions exist for this request?
did the welcome email trigger fire?
what state is the workflow in now?

You are not cheating. You are making invisible system behavior observable enough to test.

6. Run a small set of high-value workflow tests on every meaningful change

Do not start with fifty brittle end-to-end scenarios. Start with five workflows that matter.

A few high-signal action tests catch more production-risking bugs than a mountain of component checks pretending to represent reality.

7. Continuously exercise workflows after deploy

Pre-merge checks are not enough. Real integrations drift. Auth scopes change. provider behavior changes. Queues back up.

Run synthetic workflow checks in staging and production-like environments. Alert on incomplete actions, not just service uptime.

A homepage returning 200 tells you nothing about whether customers can actually onboard.

8. Use tracing to connect the workflow for debugging

When a handoff fails, developers need to see the chain quickly.

Correlate:

browser/session request id
app request id
outbox event id
worker execution id
provider request id
webhook delivery id

Without this, debugging turns into log archaeology across six systems. With it, you can answer where the workflow stalled in minutes instead of hours.

9. Review agent-generated changes for boundary semantics

When AI proposes changes, reviewers should ask:

Did any payload shape change?
Did any event timing change?
Did any retry behavior change?
Did any field rename cross a service boundary?
Did any mock hide a real integration requirement?
Is idempotency still preserved?

This is a better review checklist than just asking whether the code is clean.

10. Measure workflow completion as a product health metric

Track metrics like:

signup completion rate across all downstream systems
median and p95 time to full workflow completion
duplicate side-effect rate
dead-letter rate by workflow type
partial completion rate

If you do not measure handoff completion, you will keep optimizing the wrong thing.

A better mental model for modern testing

Old model: confidence comes from proving each component works.

Better model: confidence comes from proving important actions complete under realistic conditions.

The old model was always incomplete, but it was tolerable when systems were smaller, changes were slower, and fewer integrations were involved.

That world is gone.

Now we have:

more services
more vendors
more async behavior
more generated code
more deploys
more hidden contracts between systems

In that environment, workflow boundaries are where reliability is won or lost.

Testing has to follow that reality.

Conclusion

Your tests may cover the API. Your CI/CD pipeline may be green. Your agent may have produced perfectly plausible code.

None of that proves the user action completed.

The real failure surface in modern software is the handoff between systems: where one component says “done” and another quietly disagrees. That is where signups stall, charges duplicate, emails vanish, CRM records drift, and operators learn too late that a workflow only partially existed.

If AI-assisted development is increasing the speed and volume of change, then the answer is not more faith in isolated tests. It is better testing aimed at the actual risk.

Test actions, not just endpoints. Verify handoffs, not just handlers. Measure completion, not just response codes.

That is what real reliability looks like now. And it is the only kind of confidence that survives contact with production.