A pull request is opened by an agent. The diff is tidy. Type checks pass. Unit tests are green. CI says the branch is healthy. The reviewer scans the code, leaves one comment about naming, approves, and merges.
An hour later, nobody can complete checkout.
Not because the payment API is down. Not because production fell over. The failure is smaller and more common than that: the discount field shifts focus at the wrong time, a client-side state update drops the selected shipping method, the submit button becomes enabled before the tax quote finishes loading, and the request payload reaches the backend missing one field the old UI always sent.
Everything looked fine in the artifacts developers are used to checking. The code compiled. The tests passed. The PR was reviewed. The deploy succeeded.
What failed was the workflow.
That gap matters more now because software teams are generating more change than they can manually reason about. AI coding agents make it cheap to produce diffs, refactors, migrations, and “small” UI updates. That is useful. It is also dangerous. The limiting factor in delivery is no longer writing code. It is verifying that the thing users actually do still works after the code changes.
That is the new accountability gap in modern delivery: your agent opened the PR, but nobody proved the user journey still completes.
The failure mode nobody owns
Most teams have clear ownership for code quality in theory:
- Developers own implementation.
- Reviewers own code review.
- CI/CD owns automated validation.
- QA owns pre-release checks.
- Product owns acceptance.
In practice, workflow failures slip through because each layer validates a different abstraction.
Developers verify functions, types, and local behavior. Reviewers verify structure, maintainability, and obvious logic issues. CI verifies the repository can build and that selected tests still pass. QA verifies whatever fits into time before release. Product verifies goals, not every branch of every interaction.
None of those, by default, answer the question that actually matters in production:
Can a real user still complete the task this feature is supposed to support?
That question is uncomfortable because it cuts across frontend state, backend responses, auth, browser timing, third-party integrations, environment config, and asynchronous behavior. It is not owned by a single function or service. It exists at the seam between systems.
And seams are exactly where generated code causes trouble.
An AI agent can make a change that is internally consistent and locally correct while still damaging the end-to-end action. It can rename fields consistently in one layer but miss a dependency in another. It can simplify a form component but unintentionally change event order. It can update API handling to match a new schema but overlook a redirect edge case. It can satisfy every coded expectation in the repository while violating the uncoded expectation that users must still be able to sign in, pay, upload, or recover an account.
The result is a kind of false confidence that modern engineering teams increasingly recognize but still often tolerate: green checks on code artifacts standing in for real product correctness.
Why PR review catches style but misses action-level regressions
Code review is good at catching certain classes of failure:
- obvious logic bugs
- missing null checks
- bad abstractions
- security smells
- maintainability issues
- naming and consistency problems
It is bad at simulating workflow execution in the reviewer’s head.
That is not because reviewers are lazy. It is because the cognitive task is unreasonable.
Consider a PR that touches:
- a React checkout form
- a shared validation utility
- a backend endpoint that now expects
shipping_option_idinstead ofshippingMethod - a debounce hook used for address autocomplete
- analytics instrumentation added by the agent
A reviewer can confirm the code looks coherent. They can verify the new field names line up where they appear in the diff. They can note that tests were updated. What they usually cannot do from static review alone is predict that selecting an address now triggers a rerender that resets the shipping option, producing a payload mismatch only when a user edits their ZIP code after applying a coupon.
That is an action-level regression. It emerges from sequence, timing, and state transitions. It is not visible in syntax. It is not even visible in isolated logic.
This is especially true with AI-generated changes because the diffs often look mechanically complete. Agents are very good at producing code that appears consistent. That visual completeness makes reviewers more likely to trust the change set than deeply question workflow behavior.
Review becomes a check for plausibility, not proof.
And plausibility is not reliability.
Why CI/CD still over-indexes on code artifacts
A lot of teams say they have “strong CI” when what they really have is a fast artifact validation pipeline.
Their pipeline checks things like:
- linting
- type safety
- unit tests
- integration tests against mocks
- coverage thresholds
- image build success
- dependency scans
- migration checks
All useful. None sufficient.
CI/CD systems are usually designed around repository outputs because those are easy to automate and cheap to parallelize. User workflows are slower, more brittle, and often require environments that resemble reality. So they get deprioritized or shoved to the end as a smoke test nobody trusts.
This leads to a harmful inversion: the pipeline becomes rigorous about whether code looks correct to machines, and weak about whether the product still works for humans.
You can see this in how teams describe incidents:
- “All tests passed.”
- “The deployment was successful.”
- “The endpoint was healthy.”
- “Nothing failed in CI.”
Those statements are operationally comforting and product-wise irrelevant if users cannot finish the task.
A successful build is not a successful workflow.
A passing test suite is not evidence that auth redirects terminate correctly.
A healthy service is not proof that file uploads survive the browser, CDN, signed URL exchange, and asynchronous post-processing.
CI/CD tends to fail here for structural reasons.
1. It optimizes for determinism over realism
Mocked dependencies and isolated tests reduce flake, but they also remove the conditions where workflow bugs happen.
2. It measures implementation confidence, not outcome confidence
Most pipelines answer “did the code behave as expected in this narrow harness?” instead of “did the user objective complete under production-like conditions?”
3. It treats end-to-end testing as optional garnish
Workflow tests are often few, slow, and run only nightly or after merge. That means the actual release gate is still code-centric, not user-centric.
4. It assumes regressions are component-local
Real product failures are often cross-layer. A tiny frontend change can invalidate a backend assumption or break a browser-only interaction path that no service-level test sees.
Why unit tests and QA are not enough anymore
Unit tests are still necessary. They are just answering a smaller question than many teams pretend.
A unit test can prove a validation function rejects empty input. It cannot prove a user can sign up through a third-party auth provider, return through the redirect, land on the right screen, and complete onboarding without local state being lost.
A backend integration test can prove an endpoint accepts multipart form data. It cannot prove the browser correctly obtains a signed upload URL, preserves the file metadata, handles progress state, and posts the final asset reference after upload completes.
Manual QA helps, but it does not scale to the rate and shape of change introduced by AI-assisted development. When agents can produce many small PRs quickly, the old idea that a human will click through every meaningful workflow before merge becomes fantasy.
QA also tends to be constrained by time and environment mismatch:
- They test only major paths.
- They test late.
- They test on staging conditions that differ from production.
- They test happy paths more than stateful edge cases.
None of this is criticism of QA. It is just reality. The volume of change has increased faster than manual verification capacity.
So teams end up with a dangerous set of beliefs:
- if unit tests are green, the logic is good
- if CI is green, the change is safe
- if QA did not flag it, users are probably fine
That stack of assumptions collapses the moment a workflow depends on timing, redirects, browser behavior, external services, or state carried across steps.
The core insight: reliability lives at the workflow layer
If your business depends on users completing actions, then your release confidence has to be tied to actions, not just code paths.
That means elevating workflow verification to a first-class release gate.
Not a nice-to-have. Not a nightly suite everyone ignores. Not three brittle browser tests built years ago and muted whenever they fail.
A first-class gate means:
- identifying the workflows that matter commercially or operationally
- expressing them as executable tests
- running them in CI/CD against realistic environments
- failing the release when they break
- treating workflow traces as primary debugging artifacts
This is the shift many teams still resist because it sounds expensive. But compare the cost to the current alternative: producing more code faster while measuring the wrong thing.
If checkout, login, account creation, upload, search, or provisioning are core user actions, then “tests passed” should not mean “linters and unit tests succeeded.” It should mean “the user journey completed.”
That is a more honest definition of done.
Example: checkout passes tests and still fails users
Here is a simplified frontend example in JavaScript. The code looks fine. The unit tests may even pass.
jsimport { useEffect, useState } from 'react' export function CheckoutForm({ quoteTax, submitOrder }) { const [address, setAddress] = useState(null) const [shippingMethod, setShippingMethod] = useState('standard') const [taxReady, setTaxReady] = useState(false) const [submitting, setSubmitting] = useState(false) useEffect(() => { if (!address) return setTaxReady(false) quoteTax(address).then(() => { setShippingMethod('standard') setTaxReady(true) }) }, [address, quoteTax]) async function onSubmit() { setSubmitting(true) await submitOrder({ address, shippingMethod, }) setSubmitting(false) } return ( <div> <AddressPicker onChange={setAddress} /> <ShippingSelector value={shippingMethod} onChange={setShippingMethod} /> <button disabled={!address || !taxReady || submitting} onClick={onSubmit}> Place order </button> </div> ) }
The bug is subtle: whenever the address changes and tax is requoted, the selected shipping method resets to standard. If a user picks express, then tweaks the address, the UI may silently overwrite their selection. Unit tests for submitOrder payload shape still pass. Component tests that assert the button disables during tax quote still pass. Reviewers may not notice.
What catches it is a workflow test that behaves like a user.
tsimport { test, expect } from '@playwright/test' test('user can complete checkout with express shipping', async ({ page }) => { await page.goto('/checkout') await page.getByLabel('Address').fill('100 Market St') await page.getByRole('button', { name: 'Use entered address' }).click() await page.getByLabel('Shipping method').selectOption('express') await page.getByLabel('Address').fill('100 Market Street') await page.getByRole('button', { name: 'Use entered address' }).click() await expect(page.getByLabel('Shipping method')).toHaveValue('express') await page.getByRole('button', { name: 'Place order' }).click() await expect(page).toHaveURL(/confirmation/) await expect(page.getByText('Order confirmed')).toBeVisible() })
That test validates the workflow, not just the implementation detail. It captures sequence and state preservation. It proves the user action still completes.
Example: auth redirect loops with green backend checks
Authentication flows are a classic source of false confidence because the pieces are often individually healthy.
- Auth provider works.
- Callback endpoint returns 200.
- Session cookie is set.
- Protected API returns data for valid sessions.
And yet users get stuck in a redirect loop.
A simplified Python backend example:
pythonfrom flask import redirect, request, session @app.route('/auth/callback') def auth_callback(): token = exchange_code_for_token(request.args['code']) session['user'] = decode_user(token) next_url = request.args.get('next', '/dashboard') return redirect(next_url) @app.route('/dashboard') def dashboard(): if 'user' not in session: return redirect('/login?next=/dashboard') return render_template('dashboard.html')
Now imagine a frontend change or proxy rule causes the session cookie to be set with the wrong SameSite or domain attribute only on the callback route. Backend tests pass because they inspect the route behavior in isolation. The auth provider integration looks healthy. CI stays green.
Users, however, authenticate successfully and land right back on login.
Only a browser-level workflow test sees the complete failure.
tstest('user can sign in and reach dashboard', async ({ page }) => { await page.goto('/login') await page.getByRole('button', { name: 'Continue with SSO' }).click() await page.getByLabel('Email').fill('user@example.com') await page.getByLabel('Password').fill('correct-horse-battery-staple') await page.getByRole('button', { name: 'Sign in' }).click() await expect(page).toHaveURL(/dashboard/) await expect(page.getByText('Welcome back')).toBeVisible() })
This is the level where product correctness lives.
Example: file uploads fail in the seams
File uploads are another perfect example because they span browser APIs, backend signing, object storage, and asynchronous processing.
The backend endpoint can be correct. The storage bucket can be healthy. The UI can render. And users can still fail to upload.
Common workflow-level failures include:
- drag-and-drop events no longer attach the file
- the MIME type is altered or omitted
- signed URL expiration is too short under normal latency
- progress state completes before the final asset registration call succeeds
- redirect after upload clears transient state
A realistic Playwright test catches what lower-level checks miss:
tstest('user can upload a profile image', async ({ page }) => { await page.goto('/settings/profile') await page.setInputFiles('input[type="file"]', 'tests/fixtures/avatar.png') await expect(page.getByText('Upload complete')).toBeVisible() await page.getByRole('button', { name: 'Save profile' }).click() await expect(page.getByText('Profile updated')).toBeVisible() await expect(page.locator('img[alt="Profile photo"]')).toBeVisible() })
Notice what matters here: not whether one helper function worked, but whether the user completed the job.
What workflow verification should look like in CI/CD
If workflow verification is a release gate, it must live inside delivery, not beside it.
A practical model looks like this:
Tier 1: fast artifact checks
Keep linting, type checks, unit tests, and build validation. They are still valuable for debugging and developer productivity.
Tier 2: targeted workflow tests on every PR
Run a small, curated set of business-critical user journeys against ephemeral or preview environments.
Examples:
- sign up
- login
- checkout
- create project
- upload file
- invite user
- complete payment
Tier 3: broader cross-browser and edge-case coverage post-merge
Run expanded suites after merge or on release candidates, including retries, device variation, and less common paths.
The key is that Tier 2 must be blocking. If the primary workflow fails, the PR should not merge just because unit coverage is high.
Here is a simple GitHub Actions example:
yamlname: ci on: pull_request: push: branches: [main] jobs: build-and-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npm run lint - run: npm run typecheck - run: npm run test:unit - run: npm run build workflow-verification: runs-on: ubuntu-latest needs: build-and-test steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 - run: npm ci - run: npx playwright install --with-deps - run: npm run start:preview & - run: npx wait-on http://localhost:3000 - run: npm run test:workflows - uses: actions/upload-artifact@v4 if: always() with: name: playwright-report path: playwright-report/
That last artifact matters. When workflow tests fail, debugging should begin with the browser trace, screenshots, console logs, and network history—not with someone arguing that the unit tests are probably enough.
Workflow tests are debugging tools, not just release gates
A common mistake is treating browser and end-to-end tests as ceremonial checks: useful only if they pass, annoying if they fail.
That mindset ignores their biggest value. Good workflow tests produce the best debugging evidence you can get for user-visible failures.
When a workflow test fails, you can inspect:
- DOM state at failure time
- console errors
- network requests and responses
- redirect chains
- timing and waits
- screenshots and video
- traces of user actions
That is far more actionable than a support ticket saying “checkout is broken for some users.”
It also changes team behavior. Once engineers get fast access to workflow traces in CI/CD, they stop debating whether a regression is “real enough” and start fixing the actual break.
In a world of AI-generated code, this matters even more. Agents accelerate code production, but they do not own production debugging. Humans do. So the validation system should optimize for the evidence humans need when generated changes go wrong.
A realistic tools comparison
No single testing layer solves everything. The right question is what each tool verifies and what it cannot.
| Approach | Good at | Misses | Best use |
|---|---|---|---|
| Linting/type checks | syntax, contracts, consistency | runtime behavior, user flows | fast baseline validation |
| Unit tests | local logic, edge cases in functions | cross-layer state, browser behavior | core logic correctness |
| Integration tests with mocks | service boundaries, API contracts | real redirects, timing, external systems | component/service interaction |
| Manual QA | exploratory testing, UX judgment | scale, repeatability, per-PR gating | release review and discovery |
| Browser workflow tests (Playwright/Cypress) | real user actions, redirects, async UI, critical paths | deep internal logic coverage | workflow verification |
| Production monitoring | real-world failures | pre-release prevention | detection and feedback |
If you are deciding where to invest, the underfunded layer in most teams is obvious: browser-driven workflow verification.
Playwright is especially effective here because it supports modern debugging artifacts, parallel execution, reliable selectors, and trace inspection. Cypress can also work well, especially for teams already invested in it, but the broader point is not brand selection. The point is to choose tooling that validates outcomes users care about.
Actionable practices for teams shipping with AI agents
If AI agents are writing or reshaping meaningful portions of your codebase, you need stronger release discipline at the workflow layer. Here is a practical operating model.
1. Define your critical workflows explicitly
Make a short list of actions that must never silently break.
For most SaaS products, this includes some version of:
- sign up
- sign in
- password reset
- checkout or upgrade
- file upload
- project creation
- invitation acceptance
- core search or query flow
If you cannot name your critical workflows, your CI/CD cannot protect them.
2. Map each workflow to an executable test
Do not stop at test plans in docs. Turn them into browser automation.
The test should prove completion, not just page rendering. For example, “checkout works” means an order was submitted and confirmation appeared—not that the form loaded.
3. Run the smallest meaningful workflow suite on every PR
Do not wait for nightly runs. If a workflow is important enough to care about, it is important enough to block on.
Keep the PR suite focused. Five to fifteen high-value workflows are often enough to materially improve reliability.
4. Use production-like environments where it matters
Auth, payments, uploads, and redirects often fail because staging differs from reality. Use preview environments and realistic configuration. Mock only what you must.
5. Treat flaky workflow tests as engineering debt, not a reason to remove the gate
A flaky critical-path test usually means one of three things:
- the product has timing instability
- the environment is unrealistic
- the test is poorly designed
All three are worth fixing. Flake is not an excuse to abandon workflow verification.
6. Review traces, not just logs
When tests fail, require trace artifacts. Make them easy to access in CI.
This improves debugging speed and developer productivity because engineers can replay the failure instead of reproducing it blindly.
7. Ask agents to generate tests, but do not trust them blindly
AI can help draft Playwright tests, fixtures, and selectors. That is useful. But generated tests can be just as shallow as generated app code.
A bad workflow test often asserts that a button exists. A good one proves the user objective completed.
Use agents to accelerate test authoring, not to outsource judgment.
8. Make workflow pass/fail visible at the PR level
Do not bury workflow validation in a separate dashboard. Put it next to lint, unit, and build status so reviewers can see whether the actual user journey was verified.
9. Align ownership with user outcomes
The team that ships the feature should own the workflow checks for that feature. If ownership stops at code merge, workflow reliability will remain everyone’s problem and no one’s job.
10. Redefine “tests passed” in team language
This sounds small, but it matters culturally.
If someone says “tests passed,” ask: which tests? Did the critical workflow run? Did the user journey complete?
Teams become more reliable when they stop using broad reassuring phrases for narrow technical facts.
The uncomfortable truth about speed
AI increases code throughput. It does not automatically increase confidence throughput.
That distinction is where many teams are getting hurt.
The old bottleneck was implementation. Now implementation is cheaper. Validation is the constraint. If you ignore that and keep measuring delivery maturity by PR velocity, deploy frequency, or green CI runs, you will create the exact conditions for frequent workflow regressions.
This is why some teams feel paradoxically slower after adopting AI-assisted coding. They are shipping more change, but also spending more time debugging weird, cross-layer failures that escaped traditional testing.
The solution is not to reject AI. It is to stop pretending that code-centric checks are enough when code production gets dramatically easier.
As code generation accelerates, workflow verification becomes more—not less—important.
Conclusion
The real risk in AI-assisted delivery is not that agents write bad syntax. It is that they can produce plausible, passing, reviewable changes that still break the product exactly where users try to do something important.
That is why PR review, unit tests, and conventional CI/CD checks are no longer enough as the primary definition of software quality. They validate code artifacts. Your users experience workflows.
If reliability matters, your release gate has to match that reality.
Verify checkout, not just reducers.
Verify auth completion, not just callback handlers.
Verify uploads, not just signed URL helpers.
In other words: stop treating “tests passed” as proof unless the user journey actually ran.
Your agent may have opened the PR. Your CI may have gone green. But if nobody verified the workflow, nobody verified the product.
