A user clicks Approve refund. The button disables. A success toast appears. The page refreshes and the refund row shows Processing.

Every visible part of the product says the action worked.

But nothing happened.

The frontend emitted an event with the wrong payload shape. The backend accepted it because the request was syntactically valid. The auth layer stripped a claim needed by the refund worker. A queue message was published without the merchant context. The worker retried twice, then dead-lettered. The webhook to the payment processor never fired. Finance never got the approval task. The customer support agent moved on because the UI looked done.

The tests were green.

This is the failure pattern more teams are running into as AI writes a larger share of product code. Agent-written pull requests are often competent at local correctness. They update a component. They add an endpoint. They patch a serializer. They satisfy type checks. They regenerate fixtures. They make the UI test pass. Sometimes they even improve unit coverage.

And still the change fails in production, not because the code is obviously broken, but because the handoff is broken.

That is the real failure surface: the point where a workflow crosses a boundary between systems, roles, or asynchronous stages. Frontend to API. API to auth. Auth to job worker. Worker to webhook. Webhook to third-party API. System to human approver. One service to another team's contract. Those boundaries are where software silently stops honoring user intent.

This is not just a testing blind spot. It is a delivery blind spot. Modern CI/CD pipelines are optimized to verify isolated code paths, not workflow commitments. Traditional testing proves that components behave under controlled assumptions. Real users do not operate inside those assumptions. They traverse the full chain.

If you care about reliability, debugging, and developer productivity, this distinction matters. A system is not correct because a screen rendered the right state or an endpoint returned 200. A system is correct when the intended action actually propagates across the chain and reaches the real external consequence.

The Problem: Correct Locally, Broken at the Boundary

A lot of engineering teams still describe failures using implementation language:

the queue consumer broke
the webhook schema drifted
auth middleware dropped context
the UI sent stale data
the approval state machine got stuck

That language is accurate, but it hides what the user experienced.

The user experience is simpler: I did the thing, and the thing did not happen.

Boundary failures are dangerous because the product often appears correct at each local checkpoint:

The button click handler fired.
The API request succeeded.
The database row updated.
The worker picked up the job.
The external API acknowledged receipt.
The UI showed an optimistic status.

Yet the workflow still fails because the guarantee between stages was never actually verified.

Agent-written changes amplify this because agents are excellent at satisfying explicit local constraints and weak at implicit cross-system intent unless you encode that intent directly. Given a task like “add approval flow for refunds,” an agent will usually wire up the obvious pieces. It may even read neighboring tests and imitate their patterns. But if your test suite only proves that the button renders, the controller returns 200, and the worker runs when manually invoked, the agent has no reason to discover that the actual product commitment is:

A support rep clicks approve.
The request carries the correct user and merchant identity.
The server records an approval event with durable correlation.
A job is enqueued with all required context.
The worker calls the payment processor with the approved amount.
A webhook updates the refund status.
Finance receives the audit artifact.
The support rep sees the final state only after the external action is confirmed.

That is the flow. That is what needs testing.

Most suites do not test that flow. They test fragments.

Why Green Tests Still Lie

The problem is not that unit tests are useless or that end-to-end tests do not matter. The problem is that teams mistake partial verification for workflow verification.

Unit tests prove functions, not commitments

Unit tests are good at checking deterministic behavior in isolation. That is valuable. They catch regressions quickly. They support refactoring. They improve debugging when failures are local and obvious.

But a unit test cannot tell you whether the user action propagated across a queue with the right identity, timing, retry semantics, and downstream side effects. The more mocks you use around a boundary, the easier it is to accidentally certify a fantasy.

An agent can write a passing unit test for this:

javascript
it('publishes refund job after approval', async () => {
  const queue = { publish: vi.fn().mockResolvedValue(true) };
  const service = new RefundApprovalService({ queue });

  await service.approve({
    refundId: 'r_123',
    approvedBy: 'user_9',
    merchantId: 'm_77'
  });

  expect(queue.publish).toHaveBeenCalledWith('refund.approved', {
    refundId: 'r_123',
    approvedBy: 'user_9',
    merchantId: 'm_77'
  });
});

Looks solid. But if the real queue publisher serializes merchant_id while the worker expects merchantId, this test proves nothing about production behavior. It only proves your mock accepted your expectations.

E2E tests often stop at the UI boundary

A surprising number of so-called end-to-end tests are really frontend integration tests with network stubbing. They launch the browser, click buttons, intercept requests, return fixtures, and assert UI state.

That is useful for frontend confidence. It is not the full workflow.

This Playwright test can pass while the real refund never happens:

javascript
import { test, expect } from '@playwright/test';

test('support rep can approve a refund', async ({ page }) => {
  await page.route('**/api/refunds/r_123/approve', async route => {
    await route.fulfill({
      status: 200,
      contentType: 'application/json',
      body: JSON.stringify({ status: 'processing' })
    });
  });

  await page.goto('/refunds/r_123');
  await page.getByRole('button', { name: 'Approve refund' }).click();

  await expect(page.getByText('Refund approved')).toBeVisible();
  await expect(page.getByText('Processing')).toBeVisible();
});

The UI works. The user journey in production may not.

The hidden assumption is that the backend, queue, worker, webhook, and processor will all behave consistently with the stub. That assumption is exactly where many failures live.

CI/CD validates mergeability, not reality

CI/CD systems are built around fast feedback and deterministic automation. That pushes teams toward checks that are cheap to run:

linting
type checks
unit tests
integration tests with local mocks
short browser tests
ephemeral environments with limited dependencies

None of that is wrong. But the result is predictable: pipelines tell you whether a change is internally consistent in a controlled environment, not whether it will survive real handoffs.

The deeper problem is that CI usually has no model of the workflow commitment. It knows a PR should compile, not that a user approval must lead to a real downstream settlement event within five minutes with an audit trail and retry visibility.

That gap creates false confidence. A green pipeline feels like proof. It is not proof. It is evidence of a narrow class of correctness.

QA cannot reliably catch boundary failures by exploration alone

Manual QA can find obvious broken flows, but boundary failures are often timing-sensitive, permission-sensitive, data-shape-sensitive, or environment-specific. They may require:

the wrong role token
a real webhook signature
queue lag
third-party sandbox behavior
a delayed callback
a mid-flow manual approval
stale browser state
duplicate delivery

These are difficult to explore systematically by hand, and almost impossible to cover thoroughly as product complexity grows.

If your workflow correctness depends on a human tester noticing that the final consequence never occurred three systems later, you do not have a reliable verification strategy.

The Core Insight: Test the Handoff, Not Just the Step

The right mental model is simple:

Every meaningful product action is a chain of commitments across boundaries.

Testing should assert not just that each step can execute, but that the output of one boundary becomes the valid, sufficient, and observable input to the next.

That means you need to verify propagation, not just local success.

For a real workflow, ask these questions:

What user intent starts the chain?
What evidence proves that intent crossed the first boundary correctly?
What context must survive each hop?
What external effect is the workflow promising?
What observable signal confirms the effect happened?
What should the user see before that confirmation versus after it?
Where can timing, retries, auth, idempotency, and schema drift break the chain?

This shifts testing from “did this component respond correctly?” to “did the system uphold the user’s intended action across all boundaries?”

That is a harder question. It is also the one that matters.

Where Agent-Written Changes Commonly Break

AI-assisted development does not create boundary failures, but it makes them easier to introduce because the generated code often mirrors existing local patterns without understanding full workflow contracts.

1. Frontend event to backend contract

An agent updates a form or button handler and sends fields that look reasonable but do not preserve the backend’s real invariants.

Examples:

sends display currency instead of settlement currency
omits actor role or tenant scope
renames a field to match frontend style
fires duplicate events because optimistic state changed twice
treats “accepted” as “completed”

2. Auth boundary

The request arrives, but key authorization context is lost or transformed.

Examples:

background job runs under system identity without original actor metadata
service-to-service token omits tenant claim
admin UI path works in dev with broad permissions, fails in prod with scoped roles
approval requires step-up auth that tests never modeled

3. Async queue boundary

The server says success after enqueue, but the enqueued message is insufficient, malformed, or semantically incomplete.

Examples:

missing correlation IDs
incompatible schema version
serialized enum mismatch
worker depends on data not included in payload
race between DB commit and message publish

4. Webhook or callback boundary

The downstream system responds later, and your app assumes ideal callback behavior.

Examples:

callback signature validation rejects sandbox payloads after a library change
duplicate webhook creates inconsistent state
out-of-order events regress status
timeout path marks UI as failed while downstream action later succeeds

5. Third-party API boundary

Tests usually mock this boundary. Real systems rarely behave like mocks.

Examples:

202 Accepted instead of 200 OK
partial success requiring polling
stricter rate limits
undocumented required field combinations
idempotency keys handled differently in production

6. Human approval or operational handoff boundary

This is the most ignored category because it sits between software and process.

Examples:

approval creates a task in the wrong queue
required reviewer never notified
status changes before legal/compliance signoff
audit logs miss who approved what
operators lack enough context to complete the handoff

These failures are not edge cases. They are normal consequences of workflows that cross boundaries.

A Better Verification Strategy: Assert End-to-End Propagation

You do not need to replace all existing testing. You need to add tests and observability that prove workflow propagation.

The key pattern is:

Start from a real user-triggered action.
Allow the chain to cross actual boundaries where possible.
Assert on downstream evidence, not just upstream acknowledgment.
Record correlation across steps so debugging is possible when the chain breaks.

Example architecture under test

Assume this flow:

React UI sends refund approval request
Node API validates and writes approval event
API publishes job to queue
Python worker calls payment processor
Processor emits webhook
API updates refund status
Finance review task is created

A superficial test checks the UI response. A handoff-aware test verifies the full chain.

Code Example: Strengthening the Backend Contract

First, make the handoff explicit in code. Do not publish vague messages and hope downstream services infer missing context.

javascript
// api/refunds/approveRefund.js
import { randomUUID } from 'node:crypto';

export async function approveRefund(req, res, { db, queue }) {
  const correlationId = req.headers['x-correlation-id'] || randomUUID();
  const actor = {
    userId: req.auth.userId,
    role: req.auth.role,
    tenantId: req.auth.tenantId,
  };

  const { refundId } = req.params;
  const { amount, reason } = req.body;

  await db.transaction(async tx => {
    await tx.refundApprovals.insert({
      refundId,
      amount,
      reason,
      approvedBy: actor.userId,
      tenantId: actor.tenantId,
      correlationId,
      status: 'approved_pending_execution',
      createdAt: new Date().toISOString(),
    });

    await queue.publish('refund.approved.v2', {
      correlationId,
      refundId,
      tenantId: actor.tenantId,
      actor,
      amount,
      reason,
      approvedAt: new Date().toISOString(),
      schemaVersion: 2,
    });
  });

  res.status(202).json({
    correlationId,
    status: 'approved_pending_execution',
  });
}

Several things matter here:

202 reflects asynchronous reality better than 200 with fake completion semantics.
correlationId gives you a chain-wide handle for debugging.
actor and tenant context are preserved explicitly.
the workflow state names the handoff honestly: pending execution, not completed.

This alone improves reliability because it reduces semantic ambiguity.

Code Example: Python Worker with Boundary Assertions

python
# worker/refund_processor.py
from dataclasses import dataclass

@dataclass
class RefundApprovedMessage:
    correlation_id: str
    refund_id: str
    tenant_id: str
    actor: dict
    amount: int
    reason: str
    schema_version: int


def process_refund(message: RefundApprovedMessage, payment_api, db, task_service, logger):
    assert message.schema_version == 2, "unsupported schema version"
    assert message.tenant_id, "tenant_id is required"
    assert message.actor.get("userId"), "actor.userId is required"

    logger.info("processing refund", extra={
        "correlation_id": message.correlation_id,
        "refund_id": message.refund_id,
        "tenant_id": message.tenant_id,
    })

    response = payment_api.create_refund(
        refund_id=message.refund_id,
        tenant_id=message.tenant_id,
        amount=message.amount,
        idempotency_key=message.correlation_id,
        metadata={
            "approved_by": message.actor["userId"],
            "reason": message.reason,
        },
    )

    db.refund_executions.insert({
        "refund_id": message.refund_id,
        "correlation_id": message.correlation_id,
        "provider_ref": response["provider_ref"],
        "status": "submitted_to_processor",
    })

    task_service.create_finance_review({
        "refund_id": message.refund_id,
        "correlation_id": message.correlation_id,
        "provider_ref": response["provider_ref"],
        "tenant_id": message.tenant_id,
    })

This is still not enough by itself, but it demonstrates an important habit: enforce required cross-boundary context at the consumer. If a producer omitted critical fields, fail loudly with correlation, not silently with a half-broken downstream state.

Playwright Example: Verify the Full Workflow, Not the Toast

A realistic browser test should not stop after the click. It should verify that the system eventually reaches the real externally-confirmed state.

javascript
import { test, expect } from '@playwright/test';

async function pollRefundStatus(request, refundId) {
  for (let i = 0; i < 20; i++) {
    const response = await request.get(`/api/test/refunds/${refundId}/status`);
    const body = await response.json();
    if (body.status === 'completed') return body;
    await new Promise(r => setTimeout(r, 1000));
  }
  throw new Error('Refund did not complete within timeout');
}

test('refund approval propagates across queue, processor, webhook, and finance task', async ({ page, request }) => {
  const refundId = 'r_123';

  await page.goto(`/refunds/${refundId}`);
  await page.getByRole('button', { name: 'Approve refund' }).click();

  await expect(page.getByText('Approval submitted')).toBeVisible();
  await expect(page.getByText('Pending execution')).toBeVisible();

  const finalState = await pollRefundStatus(request, refundId);

  expect(finalState.processorStatus).toBe('succeeded');
  expect(finalState.webhookReceived).toBe(true);
  expect(finalState.financeTaskCreated).toBe(true);
  expect(finalState.approvedBy).toBe('support-user-1');

  await page.reload();
  await expect(page.getByText('Completed')).toBeVisible();
});

This test assumes you expose a test-only status endpoint or equivalent harness in a pre-production environment. Some teams resist this because it feels impure. In practice, it is often the difference between verifying a workflow and pretending to.

CI/CD Example: Add a Boundary Verification Stage

Your pipeline should distinguish between local correctness and workflow correctness.

yaml
name: ci

on:
  pull_request:
  push:
    branches: [main]

jobs:
  fast-checks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run lint
      - run: npm run typecheck
      - run: npm test

  browser-tests:
    runs-on: ubuntu-latest
    needs: fast-checks
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npx playwright install --with-deps
      - run: npm run test:e2e

  boundary-verification:
    runs-on: ubuntu-latest
    needs: browser-tests
    steps:
      - uses: actions/checkout@v4
      - run: docker compose up -d api worker queue webhook-mock finance-mock
      - run: ./scripts/wait-for-stack.sh
      - run: npm run test:workflow -- --grep "@boundary"
      - run: python scripts/assert_no_dead_letters.py
      - run: python scripts/assert_webhook_delivery.py
      - run: python scripts/assert_finance_tasks.py

This does not mean every PR needs a 45-minute gauntlet. It means your CI/CD system should contain at least one stage that verifies critical workflow handoffs with real infrastructure or high-fidelity substitutes.

For high-risk flows, that stage should be release-blocking.

Tools Comparison: What Each Layer Catches and Misses

Here is the practical tradeoff.

Unit tests

Good for:

pure logic
serializers
validation rules
state transitions
fast debugging feedback

Misses:

queue contracts
auth propagation
webhook reality
third-party behavior
multi-stage workflow commitments

Integration tests

Good for:

database interactions
API contracts inside one service
message producer/consumer compatibility
internal modules working together

Misses:

full user intent propagation
real browser behavior
production auth topology
external callback timing

Playwright/browser tests

Good for:

real UI interactions
accessibility regressions
form behavior
visible workflow states
basic end-user confidence

Misses when heavily stubbed:

real backend semantics
queue and worker failures
true async completion
third-party edge behavior

Contract tests

Good for:

schema compatibility across service boundaries
catching drift earlier than full-system tests
enforcing versioned expectations

Misses:

sequencing issues
retries and timing
operational visibility
human handoff correctness

Workflow or boundary verification tests

Good for:

proving user intent reaches downstream consequence
catching propagation failures
validating async handoffs
exposing false confidence from green UI/API tests

Costs:

slower
more environment complexity
more harness work
more careful test data management

That cost is justified for flows that affect revenue, trust, compliance, provisioning, billing, identity, or irreversible user actions.

Actionable Practices for Teams Shipping Agent-Written PRs

If you use AI-assisted development, adopt these practices aggressively.

1. Define workflow commitments explicitly

For every critical flow, write a short contract:

trigger
boundaries crossed
required context at each hop
final external consequence
observable evidence of completion
acceptable timeout/retry behavior

If this is not written down, your tests will drift toward local implementation details.

2. Treat “accepted” and “completed” as different states

A huge amount of user confusion and debugging waste comes from UI and APIs collapsing async acceptance into completion.

Use state names that reflect reality:

submitted
queued
processing
awaiting_callback
completed
failed_needs_retry
failed_needs_manual_review

This reduces false confidence for both users and engineers.

3. Propagate correlation IDs everywhere

If a workflow crosses boundaries without a shared identifier, debugging becomes archaeology.

Correlation IDs should appear in:

frontend-initiated requests
queue messages
worker logs
webhook processing logs
task creation records
audit trails

This is not optional for serious systems.

4. Preserve actor and tenant context explicitly

Do not assume downstream systems can reconstruct identity correctly. Carry the minimal required context across boundaries, and validate it at the consumer.

Many production failures that look like “queue bugs” are actually authorization context bugs.

5. Add a small number of critical boundary tests

Do not try to fully system-test every path. Pick the flows where a broken handoff is expensive:

payment approval
account provisioning
subscription changes
invitation and access grants
document signing
refund execution
compliance review submission

Build durable workflow tests for those first.

6. Avoid excessive stubbing in “end-to-end” tests

If you stub the exact boundaries where failures usually happen, your tests are performing confidence theater.

Use stubs selectively for non-critical dependencies, but keep critical handoffs real enough to expose contract, timing, and propagation issues.

7. Add dead-letter and retry assertions to test runs

A workflow test should fail not just when the UI looks wrong, but when hidden operational signals indicate a broken handoff.

Examples:

dead-letter queue not empty
webhook retry count exceeded
missing finance task
orphaned approval record with no execution record
status stuck in intermediate state beyond threshold

These checks surface silent failures that normal UI assertions miss.

8. Require agent-generated changes to identify boundary impact

When an agent opens a PR, require structured notes such as:

boundaries touched
contracts changed
async behavior changed
auth context changed
new observable signals added
workflow tests updated

This forces both the agent and reviewer to reason beyond local code diffs.

9. Review for handoff semantics, not just code style

A senior review comment should often sound like this:

What confirms the downstream action really happened?
Where is tenant context preserved for the worker?
What happens if the webhook arrives twice?
Why does the UI say completed before processor confirmation?
How do we debug this with one correlation ID?

That is much more valuable than arguing about helper function naming.

10. Make boundary observability part of the feature

Observability is not a post-incident concern. For async workflows, it is part of the product.

When you ship a feature crossing boundaries, also ship:

correlation IDs
structured logs
event timelines
stuck-state alerts
queue depth dashboards
webhook failure metrics
audit entries for human approvals

Without these, debugging turns into distributed guesswork.

A Review Heuristic: Follow the Intent Chain

When reviewing a PR—especially an agent-written PR—trace one user action through the full chain.

Ask:

What exact intent is the user expressing?
How is that intent encoded in the first request?
What context must survive each boundary?
What evidence proves the next system accepted it meaningfully?
What downstream effect are we actually promising?
What confirms that effect occurred?
What does the user see if the chain is delayed, duplicated, or broken?
How do we debug one failed execution end to end?

If the PR cannot answer those questions, it is not production-ready, no matter how green the tests are.

Why This Matters More Now

AI increases output. That is already obvious. The less obvious consequence is that teams can now generate a lot more locally-correct code than they can deeply verify.

That changes the bottleneck.

The bottleneck is no longer writing the component, the endpoint, or even the test file. The bottleneck is ensuring that changes preserve real workflow commitments across boundaries.

If your testing strategy remains centered on isolated correctness, you will ship more regressions with more confidence and spend more time debugging failures that only appear when a real user traverses a real workflow.

This is why developer productivity cannot be measured by merged PR count or even by green CI/CD runs alone. Productive teams are the ones that reduce the distance between a user’s intended action and the system’s verified outcome.

That requires better testing, better observability, and better review discipline.

Conclusion

The most dangerous failures in modern software are often not inside components. They are between them.

A UI can render correctly. An endpoint can return success. A queue can accept a message. A worker can start. A webhook can eventually arrive. And the product can still fail the user because the handoff between those stages was never truly verified.

Agent-written changes make this more visible, not because agents are uniquely reckless, but because they expose a weakness that already existed: most teams test steps, not chains.

If you want reliability, stop treating green tests as proof that the workflow works. Start asking whether the intended action actually propagated across the boundaries that matter.

Test the handoff. Assert the downstream consequence. Preserve context across every hop. Make async state honest. Instrument the chain so debugging is possible.

That is how you reduce false confidence.

That is how you make CI/CD mean something.

And that is how you build systems that work for real users, not just for test runners.