From "What Is My Browser Agent?" to Clean Datasets: Labeling User-Agent, DOM, and Network Traces for Auto-Agent AI and Agentic Browsers

Agentic browsers and auto-agent AI systems are only as good as the datasets they learn from. If you’ve ever debugged a brittle UI test, you’ve already felt the issues: selectors that rot, event logs with no causal story, and a vague recollection of which user-agent variant was used when things broke. Building robust learning systems requires moving beyond raw clickstreams and screenshots to datasets where user-agent state, DOM mutations, and network I/O are all captured and linked with explicit causal edges.

This article proposes a practical, opinionated blueprint for building such datasets—focusing on the essentials: capturing UA/switcher variants, labeling DOM/network/tool steps with causal links, scrubbing PII safely, deduplicating sessions, and constructing stratified splits that generalize across sites and devices. The audience is technical and the prescriptions are concrete: schemas, instrumentation patterns, code snippets, and quality checks that reduce iteration time and improve model performance.

Why a Better Browser Dataset Now?

Agentic systems are graduating from toy tasks to business-critical flows. They must:

Generalize across sites, devices, locales, and privacy defenses.
Reason about causality: which action caused which network request and which DOM change.
Handle non-determinism: AB tests, cookie walls, pop-ups, and service workers.
Be safe: avoid exfiltrating PII and comply with platform policies.

The gap between "record a session" and "train a reliable policy" is mostly a data problem. Most public web datasets omit the very signals agents need: initiator stacks, stable element identity, mutation diffs, UA-CH, and tool-level semantics for actions.

Design Goals

Causal labeling: Link tool actions → DOM events → network requests → DOM mutations → outcome.
Explicit UA context: Capture classic UA string and User-Agent Client Hints (UA-CH), device emulation, viewport, locale, timezone, proxy/ASN.
Selector provenance: How was the target element found? What features made it stable?
PII safety: Aggressive redaction with semantics-preserving tokenization.
Deduplication and stratified splits: Prevent leakage while maximizing OOD generalization.
Reproducibility: Environment fingerprinting (browser version, flags, extensions) and determinism where possible.

This blueprint assumes Playwright or CDP-based instrumentation, but applies to WebDriver BiDi as well.

A Data Model That Teaches Causality

You need an event graph, not a flat log. The following schemas are intentionally opinionated; adapt to your stack but preserve their spirit.

Core Entities

Session: A single run of a task on a site with a specific browser/device profile.
Step: A high-level tool action (navigate, click, type, select, scroll, wait, etc.).
DOM Snapshot: A captured representation of the DOM (HTML plus optional layout and screenshots).
Mutation: A targeted diff between pre/post snapshots.
ElementRef: A stable reference to a DOM element across time.
NetworkRequest/Response: Full request/response metadata with initiator stack and body hashes.
CausalLink: Edges connecting steps → requests → mutations.

Example JSON Schemas (abridged)

json
{
  "Session": {
    "id": "uuid",
    "started_at": "iso-datetime",
    "site_domain": "string",
    "task": {
      "goal_text": "string",
      "constraints": ["string"],
      "success_criteria": ["string"]
    },
    "browser": {
      "name": "chrome|firefox|safari|webkit",
      "version": "string",
      "headful": true,
      "flags": ["string"],
      "extensions": ["string"]
    },
    "ua": {
      "user_agent": "string",
      "ua_ch": {
        "Sec-CH-UA": "string",
        "Sec-CH-UA-Platform": "string",
        "Sec-CH-UA-Model": "string",
        "Sec-CH-UA-Mobile": "?0|?1"
      }
    },
    "device_profile": {
      "name": "Desktop Chrome 120|iPhone 14 Pro|...",
      "viewport": {"width": 1280, "height": 800, "deviceScaleFactor": 2},
      "timezone": "America/Los_Angeles",
      "locale": "en-US",
      "geolocation": {"lat": 37.78, "lon": -122.41},
      "proxy": {"region": "us-west", "asn_hash": "sha256"}
    },
    "env": {
      "os": "macOS 14.5",
      "node": "v20",
      "playwright": "1.46",
      "cdp_protocol": "v133"
    },
    "labels": {
      "inferred_botsignal": {"captcha": false, "403": false, "fingerprint_block": false},
      "session_outcome": "success|failure|partial",
      "notes": "string"
    },
    "split": "train|val|test|ood"
  }
}

json
{
  "Step": {
    "id": "uuid",
    "session_id": "uuid",
    "index": 7,
    "timestamp": "iso-datetime",
    "tool": "navigate|click|type|select|scroll|wait|keypress|upload|download",
    "args": {
      "url": "https://...",
      "text": "search term",
      "key": "Enter",
      "scroll": {"x": 0, "y": 600}
    },
    "target": {
      "element_id": "uuid",
      "selector_provenance": {
        "methods": ["role", "text", "css"],
        "features": {
          "role": "button",
          "name": "Add to cart",
          "css": "#add-to-cart",
          "xpath": "//button[@id='add-to-cart']",
          "stability_score": 0.92
        }
      }
    },
    "pre_snapshot_id": "uuid",
    "post_snapshot_id": "uuid",
    "network_request_ids": ["uuid", "uuid"],
    "causal_links": [
      {"type": "initiated", "to": "network_request_id"},
      {"type": "caused", "to": "mutation_id"}
    ],
    "status": "ok|retry|timeout|no-op",
    "rationale": "Clicking adds item to cart"
  }
}

json
{
  "NetworkRequest": {
    "id": "uuid",
    "session_id": "uuid",
    "request": {
      "url": "https://example.com/api/cart",
      "method": "POST",
      "headers": {"content-type": "application/json"},
      "body_hash": "sha256:...",
      "body_redacted": true
    },
    "response": {
      "status": 200,
      "headers": {"content-type": "application/json"},
      "mime_type": "application/json",
      "size_bytes": 8432,
      "body_hash": "sha256:...",
      "body_redacted": true
    },
    "timing": {
      "start_time": 0.0,
      "dns": 10.4,
      "connect": 28.9,
      "tls": 35.1,
      "ttfb": 120.5,
      "download": 8.2
    },
    "protocol": "h2|h3|http/1.1",
    "initiator": {
      "type": "script|parser|preload|user",
      "stack": ["function addToCart (cart.js:87)", "onclick (product.html:142)"]
    }
  }
}

json
{
  "Mutation": {
    "id": "uuid",
    "session_id": "uuid",
    "pre_snapshot_id": "uuid",
    "post_snapshot_id": "uuid",
    "type": "childList|attributes|characterData",
    "target_element_id": "uuid",
    "diff": {
      "added_nodes": ["<div class=\"toast\">Added to cart</div>"],
      "removed_nodes": [],
      "attributes_changed": {"aria-busy": ["true", "false"]}
    },
    "region_hash": "simhash:...",
    "visual_bbox": {"x": 882, "y": 46, "w": 320, "h": 64}
  }
}

This structure enforces causal paths and preserves enough detail to debug models and compute fine-grained rewards.

Instrumentation: Capturing the Right Signals

Instrument once, use for years. Use Playwright with CDP sessions for maximal coverage, or WebDriver BiDi if you need standardization.

Playwright + CDP Skeleton (Python)

python
import asyncio, json, hashlib
from playwright.async_api import async_playwright

async def sha256_bytes(b):
    import hashlib
    return hashlib.sha256(b).hexdigest()

async def run(session_meta):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context(
            user_agent=session_meta["ua"]["user_agent"],
            locale=session_meta["device_profile"]["locale"],
            timezone_id=session_meta["device_profile"]["timezone"],
            viewport=session_meta["device_profile"]["viewport"],
            device_scale_factor=session_meta["device_profile"]["viewport"]["deviceScaleFactor"]
        )
        page = await context.new_page()

        # Attach CDP session for low-level events
        client = await context.new_cdp_session(page)
        await client.send("Network.enable")
        await client.send("Page.enable")
        await client.send("Runtime.enable")
        await client.send("DOM.enable")
        await client.send("Performance.enable")

        requests = {}
        events = []

        # Network events with initiator linkages
        client.on("Network.requestWillBeSent", lambda e: requests.setdefault(
            e["requestId"], {"request": e, "response": None}
        ))

        async def on_response(e):
            rid = e["requestId"]
            if rid not in requests:
                requests[rid] = {"request": None, "response": e}
            else:
                requests[rid]["response"] = e
        client.on("Network.responseReceived", lambda e: asyncio.create_task(on_response(e)))

        # Inject MutationObserver for fine-grained diffs
        await page.add_init_script(
            """
            (function(){
              window.__mutations = [];
              const obs = new MutationObserver((list) => {
                for (const m of list) {
                  window.__mutations.push({
                    type: m.type,
                    target: m.target && m.target.outerHTML ? m.target.outerHTML.slice(0, 1024) : null,
                    added: Array.from(m.addedNodes).map(n => n.outerHTML || n.textContent).slice(0, 5),
                    removed: Array.from(m.removedNodes).map(n => n.outerHTML || n.textContent).slice(0, 5),
                    attrName: m.attributeName || null,
                    ts: performance.now()
                  });
                }
              });
              obs.observe(document.documentElement, {subtree:true, childList:true, attributes:true, characterData:true});
            })();
            """
        )

        async def snapshot_dom():
            # Use CDP DOMSnapshot for scale; fallback to full HTML
            try:
                snap = await client.send("DOMSnapshot.captureSnapshot", {"computedStyles": []})
                return {"type": "domsnapshot", "data": snap}
            except Exception:
                html = await page.content()
                return {"type": "html", "data": html[:5_000_000]}

        async def get_mutations():
            res = await page.evaluate("window.__mutations.splice(0, window.__mutations.length)")
            return res

        # Example step: navigate → wait → capture
        async def step_navigate(url):
            pre = await snapshot_dom()
            await page.goto(url, wait_until="networkidle")
            post = await snapshot_dom()
            muts = await get_mutations()
            return pre, post, muts

        pre, post, muts = await step_navigate(session_meta["task"]["start_url"])  # example

        # Collate network with body hashes
        for rid, obj in requests.items():
            # Note: to capture bodies reliably use Network.getResponseBody
            try:
                body = await client.send("Network.getResponseBody", {"requestId": rid})
                bh = await sha256_bytes(body.get("body", "").encode("utf-8", errors="ignore"))
            except Exception:
                bh = None
            events.append({"requestId": rid, "request": obj["request"], "response": obj["response"], "body_hash": bh})

        await browser.close()
        return {"snap_pre": pre, "snap_post": post, "mutations": muts, "network": events}

# Usage
# asyncio.run(run(session_meta))

Notes:

Use DOMSnapshot (Chrome) for scalable snapshots. For Firefox/WebKit, fallback to HTML.
Preserve initiator stacks from CDP (Network.requestWillBeSent.initiator) to link actions → requests.
For UA-CH capture, read navigator.userAgentData (where available) and record request headers seen server-side.
Inject a MutationObserver early via add_init_script to not miss mutations during early page load.

Selector Provenance and Stable References

Agents suffer when selectors are brittle. Record multiple selector strategies and measure stability:

Role/name-based: ARIA role and accessible name.
Text features: innerText tokens, case-folded, stopword-trimmed.
CSS selector: shortest unique CSS path.
XPath: for redundancy.
Geometry: bounding box and z-index.
Visual cues: screenshot crop hash near the element, if allowed.

Assign a stability score using historical presence across snapshots. Promote selectors that survive layout shifts and A/B variants.

Network Peculiarities to Capture

HTTP/2 vs HTTP/3, ALPN negotiated, and TLS versions.
Server Push (legacy) and Preload/Preconnect hints.
WebSockets/SSE: map messages with timestamps; hash message payloads with redaction.
Service worker involvement: is request intercepted? Record fromServiceWorker and SW version hash.
Cache state: whether request hit memory/disk cache. For fairness, consider deterministic cache control (fresh profile per session).

User-Agent and Switcher Variants

If your dataset doesn’t vary device and UA, it will overfit to a single fingerprint and collapse on real traffic. Capture:

Classic UA strings: Chrome, Firefox, Safari variants; iOS/Android device UAs; headless signatures.
UA-CH headers: Sec-CH-UA, Sec-CH-UA-Platform, Sec-CH-UA-Mobile, Sec-CH-UA-Model (where applicable). Note: some servers require Grease brand handling.
Device emulation: viewport, DPR, touch, hardware concurrency, platform, media features (prefers-color-scheme/dark mode), reduced motion.
Timezone, locale, and keyboard layout variations.
Network egress: residential vs datacenter proxy; ASN category; country/region. Record hashed ASN and country to avoid PII.

Maintain a matrix of profiles (site × UA × device × locale) and schedule coverage budgets so early training data is not dominated by a single profile. Log bot-detection outcomes (403, CAPTCHA, suspicious script) and tie them to UA/egress changes.

Example UA profiles:

Desktop Chrome stable (e.g., Chrome/129, Windows 11).
iPhone Safari (iOS 17) with mobile viewport.
Android Chrome (Pixel).
Firefox ESR.
Headless Chromium with UA-CH disabled (for stress tests).

Causal Links: From Action to Request to Mutation

Causality is the differentiator for agent training. Build an explicit causal DAG:

Edge A: Step (user intent) → Network request(s).
- Evidence: initiator.type == "script" with stack frames pointing to onclick handler for element E, or parser for navigations.
- Timing: request starts within [0, 2500ms] after the step.
- Throttling heuristic: nearest-in-time request from same frame as target element.
Edge B: Network response(s) → DOM mutation(s).
- Evidence: PerformanceObserver resource timing end time < mutation timestamp, same frame, same origin, or response used by script causing mutation.
- Use response body hash to compute a semantic signature; if mutation inserts text found in response payload (after redaction), raise confidence.
Edge C: Step → Mutation (via A and B).

Keep a confidence score per edge (0–1). For training, use high-confidence edges; for analysis, show full DAG with edge weights. When ambiguous (e.g., background analytics requests), mark as non-causal.

Minimal Causal Heuristic (JS snippet)

javascript
// In-page helper to map requests to mutations via resource timings
(function(){
  window.__causal = {links: []};
  const perfObs = new PerformanceObserver((list) => {
    list.getEntries().forEach(e => {
      if (e.initiatorType === 'xmlhttprequest' || e.initiatorType === 'fetch') {
        window.__causal.links.push({
          type: 'req',
          name: e.name,
          start: e.startTime,
          end: e.responseEnd
        });
      }
    });
  });
  perfObs.observe({entryTypes: ['resource']});

  const mutObs = new MutationObserver((ml) => {
    const t = performance.now();
    const lastReq = window.__causal.links.filter(r => r.end <= t).slice(-1)[0];
    if (lastReq) {
      window.__causal.links.push({type: 'mut', when: t, causedBy: lastReq.name});
    }
  });
  mutObs.observe(document, {subtree:true, childList:true, characterData:true, attributes:true});
})();

This is simplistic but surprisingly useful when combined with CDP initiator stacks.

Labeling Protocol: What to Capture and Why

Human-in-the-loop labeling remains crucial. Label at the step level:

Intent: natural language rationale for the step.
Target: the intended UI element(s) and selection provenance (role, text, css, features).
Expected outcome: the minimal DOM region change or network response indicating success.
Observed outcome: mutation(s), network(s), and visual signal(s) actually observed.
Confidence: did the step do what it was supposed to do?

Recommended guidelines:

Prefer role/name selectors; record visible text verbatim and tokenized.
When multiple elements match, annotate disambiguation cues (position, iconography, ARIA).
For forms, label field semantics (email, phone, search) rather than only CSS selectors.
For paginated tables, label the algorithm: "scroll until row where column 'Status' == 'Complete' appears; then click the row." Encode that as a declarative goal.

Provide annotators with a timeline view: steps, network waterfall, animation of DOM diffs. Without visualization, causal labeling is error-prone.

PII Scrubbing With Semantics Intact

You cannot release raw HTML and payloads for arbitrary sites.

Redaction strategy:

Define high-risk fields: emails, phone numbers, names, addresses, IBAN/credit-card, SSNs, cookies, auth tokens, CSRF tokens, JWTs.
Replace matches with type-preserving tokens, e.g., EMAIL:hash, PHONE:hash. Preserve length class (short/long), format class (E.164 vs local), domain TLD class for emails, and case characteristics.
Hash with keyed HMAC, not raw SHA, to avoid dictionary reversal. Store keys offline.
For HTML, parse and redact inside text nodes and attributes (value, href, src, data-*) and form inputs.
For JSON payloads, parse and redact keys/values by schema plus regex fallback.

Example Python redactor (simplified):

python
import re, hmac, hashlib

SECRET = b"dataset-redaction-key"

def token(kind, s):
    h = hmac.new(SECRET, s.encode('utf-8'), hashlib.sha256).hexdigest()[:16]
    return f"{kind}:{h}"

EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")
PHONE_RE = re.compile(r"\+?[0-9][0-9\-()\s]{6,}[0-9]")
CC_RE = re.compile(r"\b(?:\d[ -]*?){13,19}\b")
JWT_RE = re.compile(r"eyJ[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+")

REPLACERS = [
    (EMAIL_RE, lambda m: token("EMAIL", m.group(0).lower())),
    (PHONE_RE, lambda m: token("PHONE", re.sub(r"\D", "", m.group(0)))),
    (CC_RE, lambda m: token("CC", re.sub(r"\D", "", m.group(0))[-4:])),
    (JWT_RE, lambda m: token("JWT", m.group(0)[:16]))
]

def redact_text(text: str) -> str:
    out = text
    for rx, fn in REPLACERS:
        out = rx.sub(fn, out)
    return out

Where possible, replace values with a synthetic, format-preserving placeholder (e.g., keep same number of digits). Keep an allowlist of neutral terms (e.g., generic product names) to avoid over-redaction that harms learning.

For body payloads:

Do not store raw credentials, even redacted; use synthetic accounts for login flows.
For cookies/localStorage/IndexedDB, hash keys and values with type markers.
Consider differential privacy for aggregated metrics, not for raw per-event logs.

Record body_redacted: true and body_hash so researchers can detect changes without seeing payloads.

Deduplication: Sessions and Steps

Redundancy is endemic in web tasks. Dedup to avoid overfitting and to maximize coverage.

Signals for dedup:

DOM similarity: Use SimHash/MinHash on sanitized HTML or DOMSnapshot tokens.
Visual similarity: Perceptual hash (pHash) of screenshots for key states.
Action sequence similarity: tokenized tool steps (navigate→click(button[name='Buy'])→type('q="gpu"')...).
Network sequence similarity: n-gram hashing over request URL patterns and status codes.

A simple approach:

Build a 4-way signature per step: {dom_simhash, action_hash, network_minhash, visual_phash}.
Mark near-duplicates when 3/4 exceed thresholds.
Deduplicate at session level if ≥80% of steps are near-duplicates in order.

Example DOM SimHash sketch:

python
from dataclasses import dataclass
import mmh3

@dataclass
class SimHasher:
    nbits: int = 128
    def simhash(self, tokens):
        v = [0]*self.nbits
        for t in tokens:
            h = mmh3.hash128(t)
            for i in range(self.nbits):
                bit = (h >> i) & 1
                v[i] += 1 if bit else -1
        out = 0
        for i, val in enumerate(v):
            if val > 0:
                out |= (1 << i)
        return out

# tokens = extract from HTML: tags, classes, roles, data-testids

Weight tokens such as ARIA roles and data-testids higher than style classes.

Stratified Splits That Actually Generalize

Random splits leak patterns. Use hierarchical stratification:

Group by site domain and template cluster (e.g., product page vs search results). Keep entire site or template clusters out of test sets for OOD evaluation.
Stratify by device/UA: ensure each split has coverage across desktop/mobile, Chrome/Safari/Firefox.
Time-based splits: keep latest sessions as test to measure drift robustness.
Bot-defense strata: include samples with cookie walls, consent modals, simple CAPTCHAs; hold out some to test resilience.

Recommended splits:

IID: random within site+template buckets (for sanity checks).
Site-OOD: domains unseen in training.
Device-OOD: device/UA profiles unseen in training (e.g., Safari-only test).
Time-OOD: last N% by timestamp.

Store split attribution in the Session object and enforce with static lists to prevent accidental leakage in re-shuffles.

Quality Control and Validation

Automate checks:

Schema validation: JSON Schema for each entity; reject missing referential integrity.
Causal coverage: ≥95% of Steps should link to at least one NetworkRequest or Mutation (depending on type).
Selector stability: Targets present in both pre/post snapshots when expected; flag if element identity changes unexpectedly.
Redaction coverage: No raw emails/phones/JWTs remain; fail build on detection.
Bot-signal classification: Label sessions with CAPTCHA/403; store reasons (response status, DOM keywords, hCaptcha iframe presence).

Add statistical monitors:

Distribution of steps per session, requests per step, mutations per request.
Time-to-first-meaningful-mutation after actionable steps.
Rate of retries/timeouts by device/UA; regressions often show here first.

Storage and Reproducibility

Columnar formats: Parquet/Arrow for events; store big blobs (screenshots, DOM HTML) separately with content-addressed storage (CAS) using hashes.
Compression: Zstd for JSONL; WebP/AVIF for images; MP4/H.264 for short videos if you store screencasts.
Versioning: Schema version field and migration scripts; semver for dataset releases.
Environment capture: browser build, flags, feature-detection results (e.g., navigator.hardwareConcurrency, platform, language).
Determinism: Disable autoplay/animations where legal; set prefers-reduced-motion for consistency in some splits; freeze time via CDP emulation for a subset of runs.

Training and Evaluation Pipeline

From Dataset to Policy

Supervised fine-tuning (SFT): Turn Steps into instruction-output pairs.
- Input: goal + current DOM snapshot features + recent network context + candidate target elements with selector provenance.
- Output: tool call (action + args + element_ref) and expected outcome signature.
DAgger-style data aggregation: Roll out current policy, collect corrections, relabel.
RL fine-tuning: Reward by causal outcomes (mutation achieves goal, or network response matches expected schema). Penalize extraneous actions and regressions.

Feature representations:

DOM as a tree: tag/role tokens, attributes, text embeddings, geometry.
Network context: recent requests/responses encoded as URL pattern tokens and status classes.
UA context: categorical embedding (device family, browser family) and numeric features (viewport dims, DPR).

Evaluation Metrics

Success rate per task and per stratum (site/device/time).
Steps-to-success and action efficiency score.
Correction rate (human or scripted) during rollouts.
Extraneous actions per success.
Robustness: success under layout shifts, cookie walls, and pop-ups.
OOD gap: delta between IID and site/device/time OOD splits.

Build an open evaluation harness that replays tasks against live or recorded sites. Prefer deterministic replays (service worker capture) for consistency; run live sanity checks to detect site drift.

Hard Problems You Should Plan For

Cross-origin iframes: Hard to snapshot and to inject observers. Record frame tree and per-frame events. Use out-of-process iframe restrictions to guide expectations.
Content Security Policy (CSP): Curb injection. Use extension-based instrumentation or devtools protocol instead of in-page scripts when blocked.
AB tests and personalization: Randomized DOM. Record experiment cookies/IDs; use synthetic, de-personalized accounts.
Cookie consent and privacy banners: Treat as first-class obstacles with labeled solutions.
Service workers and caches: Requests may not hit the network; link to SW version and URL scope.
Anti-bot defenses: Invisible challenges, fingerprint checks. Rotate device/UA properly; log rejections and do not attempt to bypass strong defenses.
File uploads/downloads: Redact file names and content; use synthetic files; hash binary payloads and store offline.
Internationalization: Non-Latin scripts, RTL, mixed locales. Include language tags and font availability as features.

Opinions: What Matters and What Doesn’t

Don’t obsess over pixel-perfect screenshots. Visuals help debugging but are secondary to a faithful DOM/network causal graph.
Do invest in selector provenance and element identity. Without this, both training and evaluation degrade.
Don’t release raw payloads. Hash + redact + schema-extract; it’s enough for learning.
Do vary UA/device aggressively. Real users aren’t a single Chrome-on-macOS fingerprint.
Don’t hide bot-detection outcomes. Label them; agents must learn to navigate or gracefully fail.
Do publish a dataset card with collection policies, consent status, and ToS compliance.

Compliance and Ethics

Honor robots.txt and site ToS. Prefer opt-in sites or your own synthetic environments for public releases.
GDPR/CCPA: Avoid personal data; document redaction; provide contact for takedown.
Rate limits and backoffs: Be a good citizen; throttle collection; respect CAPTCHAs—do not circumvent.
Security: Never collect real credentials; use scoped test accounts and rotating keys; store secrets outside dataset.

A Minimal Working Blueprint (Checklist)

Closing Thoughts

Agentic browsers need more than click logs and screenshots. They need causality: the ability to map an intent to an element, an element to a request, a request to a mutation, and a mutation to success. Once you capture that graph—under varied user-agent/device conditions and with PII handled responsibly—you can train policies that generalize across the chaotic real web.

The blueprint above is intentionally prescriptive: build these schemas, wire these instruments, enforce these checks, and your models will improve. Skip them, and you’ll chase flaky selectors and inexplicable failures for months. Make the causal graph your first-class citizen, and the rest of the pipeline will fall into place.