Deterministic Replay for Browser Agents: Taming DOM Drift and Enabling Reproducible Debugging in CI and CD
Browser automation is growing up. We have agents that can read, click, type, and reason, guided by large models or rule-based planners. But anyone who has shipped a nontrivial browser agent knows the pain: a test that passes locally flakes in CI, a click that worked yesterday hits the wrong element today, or a canary that looked healthy suddenly explodes in production. Root cause is often the same trifecta: non-deterministic timing, network variance, and DOM drift.
This article lays out a pragmatic blueprint for deterministic replay of AI browser agent runs. The goal: capture and freeze enough of the world to replay the exact run and reproduce the bug with confidence. We will cover the architecture, concrete code-level hooks, a privacy-safe scrubber, and strict approaches to neutralize A B tests and canary randomness. By the end, you should be able to design a recorder and replayer that stabilize flaky runs and raise the engineering bar for web automation.
Opinionated thesis:
- If you cannot replay a failure deterministically, you are guessing.
- Determinism for browser agents is doable today with commodity tooling: Chrome DevTools Protocol, Playwright or Puppeteer, and a thin layer of instrumentation.
- Reproducibility demands end-to-end control of network, time, randomness, and the DOM. Partial measures reduce flakes but do not eradicate them.
Why deterministic replay matters now
Modern web apps are a chaos engine for automation:
- Aggressive hydration and lazy loading reorder the DOM between runs.
- Ads, chat widgets, and analytics inject nodes at random times.
- Experimentation frameworks flick between variants using cookies, local storage, or opaque headers.
- Network variability, retries, and server-side AB tests return different responses to the same URL.
- Timing is slippery. Microtasks, timers, and animation frames interleave differently across small environmental changes.
Browser agents driven by LLMs magnify these issues because they rely on context: current DOM and text content shape the next action. When the world shifts, the action diverges and the run derails.
Deterministic replay is the antidote:
- Debugging: reproduce failures byte-for-byte, including DOM and network, to see what the agent saw and why it acted.
- Flake triage: separate environmental non-determinism from true bugs.
- Safer canaries: capture failing canary sessions and replay them locally without re-hitting user data.
- CI stability: make headless runs repeatable across machines and time.
What we mean by deterministic replay
Given an agent run R, we want to record a bundle B such that:
- Replayer loads B and produces the same sequence of observable effects: the DOM states seen by the agent, event dispatch order, responses to network fetches, and time-dependent values.
- Agent actions applied to the replayed environment follow the same path with no external network or random differences.
- The run is portable across machines and time subject to a pinned browser build and OS container.
Determinism here does not require simulating the full browser engine. We target a practically deterministic environment by controlling four pillars:
- Network: capture all HTTP s and WebSocket IO and stub at replay time.
- DOM: capture an initial snapshot and incremental mutations in a form that can be reapplied.
- Time and scheduling: virtualize time and control timers, animation frames, and PRNG.
- Inputs: capture user agent actions from the agent runner and reapply them deterministically.
Additionally, we neutralize server-side and client-side experiments and scrub sensitive data before storage.
Architecture at a glance
-
Recorder library attached to a headless browser run collects:
- Network: request and response metadata, bodies, headers, cookies, WebSocket frames.
- DOM: a full snapshot at T0 plus a stream of MutationObserver diffs at agent-visible boundaries.
- Timing: a virtual clock timeline and hooks for timers and animation frames.
- Inputs: clicks, key presses, scrolls, pointer moves that the agent performed.
- Environment: browser build, OS, viewport, UA string, locale, timezone.
-
Scrubber pipeline:
- Redact or transform PII across DOM and network payloads using a deterministic scheme that preserves referential integrity.
-
Bundle packager:
- Assemble JSONL logs, compressed response bodies, and metadata into an artifact.
-
Replayer:
- Spin up the same browser build inside a pinned container.
- Disable all external network and feed captured responses through a network stub.
- Install a virtual time policy and determinize Math.random and Date.
- Reconstruct the DOM from snapshot plus diffs or rely on the stubbed network to let the app rebuild to the recorded state.
- Reapply the recorded inputs with the recorded timestamps on virtual time.
-
Validator:
- Compare key checkpoints hash of DOM and log warnings on divergence.
-
CI and canary integration:
- On failure, upload bundle as artifact and expose a one-click replay command and viewer.
Concrete implementation plan with Playwright and CDP
We assume Node.js and Playwright with access to the Chrome DevTools Protocol for low-level control. Puppeteer or WebDriver BiDi can be used similarly; CDP gives you the broadest capability for timing and snapshots.
Pin the environment
- Use a Docker image with a known Chromium version. Pin exact build numbers. Prefer the same engine for recording and replay.
- Fix locale, timezone, fonts, and GPU state. Use headless new, disable GPU for determinism when possible.
- Freeze UA string and Accept-Language.
Example Dockerfile snippet:
dockerfileFROM mcr.microsoft.com/playwright:v1.48.0-jammy # Contains pinned Chromium Firefox WebKit builds ENV TZ=UTC ENV LANG=en_US.UTF-8 # Optionally add custom fonts if the app needs them
Connect to CDP
tsimport { chromium } from 'playwright'; const browser = await chromium.launch({ headless: true, args: [ '--disable-gpu', '--no-sandbox', '--disable-dev-shm-usage', '--disable-features=InterestCohort,Fledge', ], }); const context = await browser.newContext({ userAgent: 'MyAgent 1.0; Chrome pinned', locale: 'en-US', timezoneId: 'UTC', }); const page = await context.newPage(); const client = await context.newCDPSession(page);
Establish virtual time
Chrome DevTools Protocol exposes Emulation.setVirtualTimePolicy which lets you control how time advances. This is critical to reproducible timer and animation frame ordering.
tsawait client.send('Emulation.setVirtualTimePolicy', { policy: 'pause', budget: 0, }); // Our own virtual clock state let vtNow = 0; // ms function advanceVirtualTime(ms: number) { vtNow += ms; return client.send('Emulation.setVirtualTimePolicy', { policy: 'advance', budget: ms, }); }
To make page scripts observe virtual time deterministically, also patch timer sources. Inject a preload script that replaces Date.now, new Date, performance.now, and Math.random with deterministic implementations driven by a seeded PRNG and vtNow from a binding.
tsimport seedrandom from 'seedrandom'; await context.addInitScript({ content: ` (function(){ const rng = window._replaySeedRandom = (function(){ // default seed will be overridden via binding let rng = (function(seed){ let x = 0; return () => (x += 0.123456789) % 1; })('seed'); return { setSeed: function(seed){ rng = (function(){ return seedrandom(seed); })(); }, random: function(){ return rng(); } }; })(); const origDate = Date; const origPerfNow = performance.now.bind(performance); let vtNow = 0; function nowMs(){ return vtNow; } // Override Date and performance const DateProxy = function(...args){ if (this instanceof DateProxy) { if (args.length === 0) return new origDate(nowMs()); return new origDate(...args); } return new origDate(nowMs()).toString(); }; DateProxy.now = () => nowMs(); DateProxy.UTC = origDate.UTC; DateProxy.parse = origDate.parse; DateProxy.prototype = origDate.prototype; Object.defineProperty(window, 'Date', { value: DateProxy }); Object.defineProperty(window, 'performance', { value: new Proxy(performance, { get(target, prop){ if (prop === 'now') return () => nowMs(); return Reflect.get(target, prop); } }) }); Math.random = function(){ return window._replaySeedRandom.random(); }; // requestAnimationFrame scheduling using vt const rafQueue = []; window.requestAnimationFrame = function(cb){ const id = Math.floor(Math.random() * 1e9); rafQueue.push({ id, cb }); return id; }; window.cancelAnimationFrame = function(id){ const idx = rafQueue.findIndex(x => x.id === id); if (idx >= 0) rafQueue.splice(idx, 1); }; function _replayTick(newNow){ vtNow = newNow; // Flush a stable snapshot of queued rafs const q = rafQueue.splice(0, rafQueue.length); for (const { cb } of q) { try { cb(vtNow); } catch (e) {} } } // Expose a binding sink for the host to advance vt window.__replay__updateTime = function(newNow){ _replayTick(newNow); }; window.__replay__setSeed = function(seed){ window._replaySeedRandom.setSeed(seed); }; })(); ` }); // Initialize seed and clock after page creation await page.evaluate(() => { window.__replay__setSeed('run-uuid-seed'); window.__replay__updateTime(0); });
Note: CDP virtual time and patched timers must agree. On each advanceVirtualTime, call into the page to update vt.
tsasync function tick(ms: number){ await advanceVirtualTime(ms); const now = vtNow; await page.evaluate(n => window.__replay__updateTime(n), now); }
Intercept and record network
Use CDP Network and Fetch domains to see requests and, at replay time, to stub responses. Playwright also offers route handlers, but CDP gives access to raw encoded bodies and WebSocket frames.
Recording phase outline:
- Enable Network enable, Fetch enable requestPaused to intercept all requests.
- For each request, assign a stable id and record metadata and body.
- Let the request proceed to the network in record mode, capturing the response body and headers.
- For WebSocket, subscribe to Network.webSocketCreated, webSocketFrameSent, webSocketFrameReceived.
tsawait client.send('Network.enable', {}); await client.send('Fetch.enable', { patterns: [{ urlPattern: '*' }] }); const networkLog = []; client.on('Fetch.requestPaused', async evt => { const reqId = evt.requestId; const { request } = evt; // Collect request body let postData = request.postData || null; // Sanitize here or later in scrubber networkLog.push({ type: 'request', id: reqId, url: request.url, method: request.method, headers: request.headers, body: postData, ts: vtNow, }); // Continue to network while recording await client.send('Fetch.continueRequest', { requestId: reqId }); }); client.on('Network.responseReceived', async evt => { const { requestId, response } = evt; // Get body for text json responses; for binaries, store base64 try { const bodyRes = await client.send('Network.getResponseBody', { requestId }); networkLog.push({ type: 'response', id: requestId, status: response.status, headers: response.headers, mimeType: response.mimeType, body: bodyRes.base64Encoded ? { base64: bodyRes.body } : { text: bodyRes.body }, ts: vtNow, }); } catch (e) { // Might be no body for redirects or cached networkLog.push({ type: 'response', id: requestId, status: response.status, headers: response.headers, mimeType: response.mimeType, body: null, ts: vtNow }); } }); client.on('Network.webSocketCreated', evt => { networkLog.push({ type: 'ws_created', url: evt.url, id: evt.requestId, ts: vtNow }); }); client.on('Network.webSocketFrameSent', evt => { networkLog.push({ type: 'ws_tx', id: evt.requestId, payload: evt.response.payloadData, ts: vtNow }); }); client.on('Network.webSocketFrameReceived', evt => { networkLog.push({ type: 'ws_rx', id: evt.requestId, payload: evt.response.payloadData, ts: vtNow }); });
Replay phase outline:
- Enable Fetch and for every paused request, look up a recorded response by a stable lookup key. Keying purely by requestId is not portable; instead build a key from url, method, headers whitelist, and body hash.
- Respond with Fetch.fulfillRequest using the recorded payload.
- Block any request without a match; optionally error to surface drift.
tsawait client.send('Fetch.enable', { patterns: [{ urlPattern: '*' }] }); const index = new Map(); for (const ev of networkLog) { if (ev.type === 'request') { const key = buildKey(ev.url, ev.method, ev.headers, ev.body); const resp = findResponseFor(ev.id, networkLog); if (resp) index.set(key, resp); } } client.on('Fetch.requestPaused', async evt => { const req = evt.request; const body = req.postData || null; const key = buildKey(req.url, req.method, req.headers, body); const resp = index.get(key); if (!resp) { // If strict, fail; if lenient, continueRequest await client.send('Fetch.failRequest', { requestId: evt.requestId, errorReason: 'BlockedByClient' }); return; } const bodyStr = resp.body?.text || (resp.body?.base64 ? Buffer.from(resp.body.base64, 'base64').toString('utf-8') : ''); await client.send('Fetch.fulfillRequest', { requestId: evt.requestId, responseCode: resp.status, responseHeaders: headersToArray(resp.headers), body: Buffer.from(bodyStr, 'utf-8').toString('base64'), }); });
This approach neutralizes server responses, makes experiments stable, and makes the page build to the same DOM. For heavy SPAs, this is often enough without recording DOM diffs. For extra safety or to catch nondeterminism early, record DOM snapshots as well.
DOM snapshot and incremental diffs
CDP exposes DOMSnapshot.captureSnapshot across several buckets: DOM nodes, layout tree, and computed styles. This is excellent for coarse checkpoints.
- Take a full snapshot after initial navigation and after any major route change.
- Between checkpoints, use a MutationObserver installed via addInitScript to stream mutations into a log. Assign stable node ids to avoid reliance on brittle selectors.
Recorder injection:
tsawait context.addInitScript({ content: ` (function(){ const idMap = new WeakMap(); let nextId = 1; function getId(node){ if (!idMap.has(node)) idMap.set(node, nextId++); return idMap.get(node); } const observer = new MutationObserver(muts => { const events = []; for (const m of muts) { if (m.type === 'attributes') { events.push({ t: 'attr', id: getId(m.target), name: m.attributeName, value: m.target.getAttribute(m.attributeName) }); } else if (m.type === 'characterData') { events.push({ t: 'text', id: getId(m.target), value: m.target.data }); } else if (m.type === 'childList') { const added = Array.from(m.addedNodes).map(n => serializeNode(n)); const removed = Array.from(m.removedNodes).map(n => getId(n)); events.push({ t: 'child', parent: getId(m.target), before: m.nextSibling ? getId(m.nextSibling) : null, added, removed }); } } window.__replay__emit && window.__replay__emit({ kind: 'dom_mut', events, now: Date.now() }); }); observer.observe(document, { attributes: true, childList: true, characterData: true, subtree: true }); function serializeNode(node){ const id = getId(node); if (node.nodeType === Node.TEXT_NODE) return { id, text: node.data }; if (node.nodeType === Node.ELEMENT_NODE) { const attrs = {}; for (const a of node.attributes) attrs[a.name] = a.value; const tag = node.tagName.toLowerCase(); const children = []; for (const c of node.childNodes) children.push(serializeNode(c)); return { id, tag, attrs, children }; } return { id, other: node.nodeType }; } // Initial emit of root window.__replay__emit && window.__replay__emit({ kind: 'dom_snapshot', root: serializeNode(document.documentElement), now: Date.now() }); })(); ` }); // Bind event sink to collect from page to Node await page.exposeFunction('__replay__emit', (payload) => { domLog.push(payload); });
Replayer applies an initial snapshot into a clean document and then applies incremental events in timestamp order as virtual time advances. If you are also stubbing network, you can run the natural app code and only use the DOM logs to validate and hash checkpoints rather than rebuild the DOM yourself. That reduces drift while catching differences early.
Capture agent inputs deterministically
Record pointers, keys, wheel, focus, and scroll with enough identifiers to reapply to the same nodes under replay. Avoid brittle selectors; instead record a stable strategy bundle:
- Node id if known from DOM logs, otherwise an anchored CSS selector and a text fingerprint of nearby text.
- Bounding rectangle at time of action to sanity-check target mapping.
Example input schema entry:
json{ "ts": 1234, "type": "click", "target": { "nodeId": 4242, "selector": "#buy-button", "text": "Buy now", "rect": { "x": 100, "y": 200, "w": 80, "h": 24 } }, "button": "left" }
For replay, resolve the target in the following order:
- Node id to current live node via a map built from the snapshot and mutation stream.
- If not found, query selector and pick a node whose text matches the fingerprint.
- As last resort, pick element at recorded coordinates within the viewport and assert text near it matches.
This approach tames DOM drift during replay. If the replayer cannot resolve a target, you flag a divergence early.
Neutralize AB tests and client randomness
Nondeterminism often hides inside experimentation and analytics frameworks. You need a defense in depth strategy:
Server side:
- Record and stub network so that server-side variants are locked to captured responses.
- Normalize request headers that flip variants: Cookie, X-Experiment, X-User-Bucket. In recorder, log them. In replay, force exact values.
- Freeze geo and IP by going through a stable proxy during recording, or better, stub network entirely.
Client side:
- Override Math.random with seeded PRNG as shown above.
- Block or neutralize common experiment libraries and tags by url substring allowlist. Either stub their network responses or restrict loads via a CSP rule at context creation time.
- Optionally inject a small script that sets variant overrides in known frameworks, for example setting window.optimizely to a mock that returns a deterministic variant.
Playwright can set a strict CSP by injecting a meta tag at document start, or you can set context bypassCSP true and control loaded scripts via Fetch interception.
Privacy-safe scrubbing and PII retention strategy
A robust replay system cannot ship raw production data logs for analysis. You need a privacy budget that is strict, reproducible, and auditable.
Principles:
- Scrub at record time before data leaves the machine or container.
- Preserve referential integrity with deterministic transforms so that equal values map to equal tokens within a run, but tokens are different across runs or projects.
- Minimize data: store only what is required to replay and debug.
Practical scrubbing strategy:
- Network: remove Authorization, Cookie, Set-Cookie, X-Csrf-Token, and any header matching a policy allowlist. For request and response bodies of content types text and json, apply field-level redaction driven by a ruleset: keys like email, phone, name, address, ssn, cc map to tokens.
- DOM: before storing snapshots and diffs, traverse text nodes and input values and apply tokenization for likely PII. Use heuristics Luhn for card like numbers, regex for emails and phones, and element attributes autocomplete hints cc-number, cc-csc, email, address-line1.
- Images and canvas: do not store or mask with a placeholder. If visual replay is required, consider a blur mask around input elements and run on-device OCR to detect digits and emails, then blur.
Deterministic tokenization example with HMAC and run-scoped salt:
tsimport crypto from 'crypto'; const runSalt = crypto.randomBytes(32); function tokenFor(value){ const mac = crypto.createHmac('sha256', runSalt).update(String(value)).digest('hex').slice(0, 16); return `tok_${mac}`; }
Use this to replace DOM text and JSON values. Maintain a small dictionary inside the bundle mapping tok to original only if absolutely required for debugging; otherwise store no reverse map.
Document the policy and build automated diff tests to ensure no sensitive headers or fields escape.
Bundle format and storage
Favor a streaming-friendly structure. JSONL for event logs, binary blobs for response bodies, zstd compression.
Proposed layout:
- meta.json: environment info, browser version, seeds, viewport, UA, os, runner version.
- network.jsonl: sequence of request and response events with ids, keyed by a deterministic key.
- dom.jsonl: initial snapshot followed by mutation events with virtual timestamps.
- inputs.jsonl: agent actions with timestamps.
- checksums.json: SHA256 of each file, number of events, start and end vt.
- blobs dir: response bodies keyed by content hash.
Schema stability matters. Version each file and write a migration tool.
Replayer: making it actually deterministic
Launching the replayer follows the same steps as recorder with extra safeguards:
- Pin browser build and OS container.
- Set offline mode for external network. In Chromium, you can set Network.emulateNetworkConditions with offline true and still fulfill via Fetch.fulfillRequest.
- Install timer and PRNG patches with the same seed used during recording.
- Route all requests through the stub. If an unrecorded request appears, treat as fatal and stop the run.
Time advancement policy:
- Drive the virtual clock based on recorded input and network event timestamps. For instance, sort all inputs by ts and advance vt to each in order before applying it.
- If the page schedules rAF or timers, release time in small quanta to flush queues deterministically.
Replay loop sketch:
tslet cursorInputs = 0; const inputs = loadInputs(); await page.goto('about:blank'); await page.setViewportSize({ width: 1280, height: 800 }); while (cursorInputs < inputs.length) { const ev = inputs[cursorInputs]; await tick(ev.ts - vtNow); // advance virtual time await applyInput(ev); cursorInputs++; // Allow microtasks and raf to drain at this timestamp await tick(0); } // Final drain await tick(50);
Input application must resolve targets deterministically using the snapshot id map and fallback strategy described earlier. Always assert a hash of the DOM subtree around the target equals the recorded one. If not, abort with a diff so engineers see the divergence.
Handling modern web features and edge cases
- Service workers: if the app registers a service worker, it can disrupt stubbed network. In recorder, capture service worker scripts and registration events. In replay, either disable service workers via args like --disable-features=ServiceWorker or preinstall a neutral SW that forwards requests to the Fetch stub. Easiest is to disable SW for replay runs.
- WebSockets and SSE: recorded frames must be replayed deterministically. For client initiated WS, block the connect and synthesize an in-page mock that delivers recorded frames at recorded times. Alternatively, fulfill the upgrade via Fetch and intercept frames at CDP Network.webSocketFrameReceived and inject to the page via a mocked socket wrapper. The first approach is simpler: replace window.WebSocket with a shim during replay that plays back frames.
- WebGL and canvas: rendering is not deterministic across hardware and drivers. Disable GPU and test without relying on pixel-perfect assertions. For visual diff, protect with large tolerances.
- Fonts and layout: small font differences cause layout drift. Install the same fonts in the container and force font fallback via CSS injection if possible. Capture viewport size and device scale factor and pin them.
- React hydration flicker: server renders markup and client hydrates. Hydration can reorder and replace nodes, which means DOM ids assigned by WeakMap are unstable if assigned before hydration completes. Strategy: delay id assignment until after DOMContentLoaded plus a short idle window, or use mutation logs to remap ids on the fly. Better: assign ids lazily only when an element becomes a target or changes.
- CSP: sites with strict CSP may block injected scripts. Use Playwright context option bypassCSP or CDP Page.setBypassCSP to allow instrumentation.
DOM drift: causes and practical mitigation
DOM drift is any change between runs that alters node identity or position from an agent perspective. Typical culprits:
- Ad slots that load or collapse.
- Live chat or support widgets that insert portals.
- Experiment banners and interstitials.
- Infinite lists with dynamic keying.
- Locale and personalization content toggles.
Mitigation blueprint:
- Network stubbing so that the same HTML and data are fed to the app.
- Script blocklist for known drift sources by url pattern. Replace with no-ops or static mocks.
- Stable targeting in the agent: do not rely solely on CSS selectors. Use multi-signal matching: selector, semantic role, accessible name, nearby text, and geometry consistency checks.
- During recording, capture a compact text fingerprint of the subtree around the target. For example, 2 to 3 unique words and element roles in a window of siblings. Store it with each input.
- During replay, if the primary map misses, search by fingerprint and assert similarity above threshold.
This approach reflects how robust testing frameworks like Playwright recommend selecting elements by role and accessible names rather than brittle selectors.
Validation and confidence building
Determinism is never perfect on the first iteration. Build validation into your bundle and replays:
- Checkpoint hashes: compute a rolling hash of text content and tag names for key subtrees after each input. Store in dom.jsonl. On replay, recompute and compare.
- Network coverage: ensure every request made during replay maps to a recorded response. Fail on misses.
- Timer sanity: record counts of timers fired and rAF callbacks per logical step. On replay, compare counts.
- Metrics: log how many diffs occurred and if any target resolution needed fallback. Use these numbers to track drift regressions over time.
CI and canary integration
The recorder should be available both in developer runs and in CI. Suggested workflow:
- CI runner wraps each agent test with recording enabled. On success, discard bundles by default. On failure, upload the bundle as an artifact with a retention period and post a link in the PR comment.
- Provide a command line tool replay-run that takes a bundle and runs an interactive or headless replay locally with a DOM viewer.
- Provide a GitHub Action that runs the replay on a pinned container, captures a screenshot or video, and attaches it to the build.
Sample GitHub Actions step:
yaml- name: Run agent tests with recording run: | pnpm test:agents --record --output ./artifacts - name: Upload replay bundles uses: actions/upload-artifact@v4 if: failure() with: name: replay-bundles path: artifacts/**/*.replay.tar.zst - name: Reproduce failure if: failure() run: | npx my-replayer run artifacts/failing.replay.tar.zst --headless
Canary rollout:
- Enable recording for a small fraction of agent sessions in canary environment. Do not send bundles off the box until the run fails or a canary monitor triggers.
- On canary error, immediately package and upload the bundle to an isolated bucket with strict access controls.
- Trigger an automated rule to run deterministic replay and extract a minimal bug report: last successful step, offending network response, and a set of DOM diffs.
This elevates canary signal from red light to actionable evidence.
Performance and data volume
Recording everything can be heavy. Practical tips:
- Use allowlists for responses to store bodies only for HTML, JSON, and scripts. Store other responses by content hash and only fetch body if it is not cacheable.
- Compress aggressively with zstd at level 6 to 10. JSONL compresses well.
- For DOM mutations, checkpoint sparsely: initial snapshot plus post-navigation and post-major action points. Do not store every micro-mutation unless needed.
- Use binary blobs for large bodies, referenced by digest in JSONL to avoid duplication.
- Cap bundle size and provide a sampler mode that stops recording if cap exceeded and marks the run as partial.
Overhead targets in practice:
- Many app flows will generate bundles in the tens of megabytes, which is acceptable for failed runs kept for a week.
- CPU overhead is primarily in serialization and compression; offload to a worker thread when possible.
Related work and ecosystem
- Playwright trace viewer captures network, console, and screenshots and provides a great baseline. You can piggyback but it is not a deterministic replay of network by default.
- Chrome DevTools Recorder and user flows capture interactions but do not enforce deterministic virtual time.
- rr time-travel debugger and Pernosco inspire the approach but operate at process level for native code.
- Webrecorder and WARC focus on archiving web sessions; similar network capture ideas apply.
- WebDriver BiDi standardizes events and commands, maturing toward some of the CDP powers.
A purpose-built replay system for agents pulls these threads together into a developer-friendly tool that is CI ready.
Minimal schema sketch
Keep it simple and versioned.
json{ "meta": { "schema": 1, "browser": { "name": "chromium", "version": "120.0.6099.109" }, "os": "linux-jammy", "viewport": { "w": 1280, "h": 800, "dpr": 1 }, "ua": "MyAgent 1.0; Chrome pinned", "locale": "en-US", "timezone": "UTC", "seed": "run-uuid-seed", "startedAt": 0, "endedAt": 75234 }, "network": "network.jsonl", "dom": "dom.jsonl", "inputs": "inputs.jsonl", "blobsDir": "blobs/", "checksums": "checksums.json" }
Event entries are JSON lines with fields documented in your repo. Include a simple merkle hash across files in checksums.json to quickly verify bundle integrity.
Rollout plan in a real codebase
- Phase 1: Network-only record replay. Stub all network and ensure your agent can run against recorded responses. This alone kills a majority of flakes.
- Phase 2: Add time and PRNG determinization. Replay run time order should be stable across machines.
- Phase 3: Introduce DOM checkpoints and input target stabilization with fingerprints. Start validating DOM hashes to catch drift early.
- Phase 4: Privacy scrubber hardening with static analysis on allowed headers and keys. Add e2e tests that plant fake PII and ensure it is scrubbed.
- Phase 5: CI integration and artifact viewer with one-click local replay.
- Phase 6: Canary recorder with strict access gates and automatic replay jobs.
Measure success by flake rate drop, mean time to reproduction, and the ratio of divergence-free replays on first try.
Common pitfalls and how to avoid them
- Relying on requestId as a stable key across sessions. Use deterministic keys from url, method, and body hash.
- Forgetting to pin the browser build. Cross-version replay is a flake magnet.
- Allowing any live network during replay. Even a single analytics request can mutate state via cookies.
- Recording too much without scrubbing. Privacy debt will halt adoption.
- Overfitting to one framework. Keep instrumentation framework-agnostic; observe the browser, not the app.
Final thoughts
Deterministic replay for browser agents is not a moonshot. With a pinned browser, CDP level control, disciplined network stubbing, a virtual clock, and a light DOM log, you can make flaky failures reproducible and debugging humane. The payoff compounds: faster root cause, fewer he-said-she-said triages, and the confidence to ship canaries that you can actually diagnose when they go wrong.
Once your team gets a taste of one-click replay on a red CI job, they will not go back. The engineering investment pays for itself the first time a production-only failure is explainable from your desk without touching user data.
Build it incrementally, enforce it in CI, and treat privacy as a first-class concern. Your agents and your developers will thank you.
