Build a Flight Recorder for Code Debugging AI: Repro Traces, Env Snapshots, and One-Click Rollbacks
AI code debugging agents are finally good enough to fix broken tests, propose refactors, and generate non-trivial patches. But when an agent does something surprising in production, the worst possible answer is sorry, we cannot reproduce it. Flight recorder style instrumentation is how we prevent that answer.
In aviation, flight recorders capture everything necessary to reconstruct events deterministically: control inputs, telemetry, environment snapshots. In software, we have equivalents: traces, prompts, diffs, artifacts, seeds, execution logs, and environment manifests. The goal is simple: given a flight recording from an AI debugging session, any developer should be able to replay the exact sequence of decisions, validate every tool call, inspect diffs and artifacts, and, if needed, roll back the agent to a safe state with one click.
This article shows how to build that flight recorder for AI code debugging. We will cover architecture, event schemas, prompt capture, diffs and artifacts, environment snapshots, reproducible seeds, privacy and compliance, CI integration, and a replayer tool that reconstructs the environment and re-executes the session. The result is deterministic repros, auditable histories, and safe rollbacks that fit into fast CI and trunk-based development.
Why AI debugging needs a flight recorder
Traditional unit tests and logs are not enough for AI-assisted debugging because:
- Non-determinism: model sampling and tool latencies can change outcomes from one run to the next.
- Hidden context: system prompts, retrieved context, and tool specs modify behavior but are often not logged.
- Ephemeral worlds: the workspace and environment seen by the agent constantly change during the run.
- Multi-hop decisions: a single failure may depend on a chain of prior model calls, edits, and test runs.
Without complete capture, teams cannot guarantee that a fix is correct or that a regression can be diagnosed after the fact. The solution is an always-on, low-overhead recorder that makes every run reproducible.
Design principles
- Deterministic reproduction: capture enough data to re-run the session bit-for-bit, including model seeds and exact prompts.
- Low friction: invisible in the happy path, trivial to enable deeper capture on anomalies.
- Portable repro pack: a single artifact that anyone can run on any machine or a container runner.
- Governance first: built-in PII redaction, encryption, access control, and retention policies.
- Observability native: standard traces, spans, and metrics via OpenTelemetry so you can plug into existing tools.
- CI friendly: artifacts attach to build summaries, PR comments, and release candidates; one-click rollback aligns with feature flags or config versioning.
Architecture at a glance
A minimal flight recorder architecture:
- Agent runtime: code-debugging agent orchestrating tools like repo search, patch apply, test runner, and build.
- Instrumentation layer: hooks that emit traces and artifacts for every key transition.
- Storage: write-once object store for artifacts and a trace backend for spans and events.
- Repro pack builder: a bundler that collects a manifest, traces, prompts, env snapshot, diffs, and tools output.
- Replayer: CLI to reconstruct the environment and replay the recorded run.
- Control plane: retention, access policies, and rollback configuration management.
ASCII diagram:
[Developer/CI] -> [Agent] -> [Instrumentation] -> [Trace backend + Object store]
\-> [Repro pack builder] -> [Repro .tar.zst]
\-> [Replayer] -> [Deterministic replay]
Event and data model: what to capture
Represent your session as a tree of traces and spans. OpenTelemetry is a pragmatic default because it integrates with the rest of your observability stack and supports semantic conventions. You can add a small layer for LLM specifics and code editing tools.
Recommended span hierarchy:
- session: the entire debugging session
- task: e.g., fix failing test test_foo or refactor module bar
- tool-call: repo_search, apply_patch, run_tests, format, build
- llm-request and llm-response: each model invocation
- external-http: calls to package registries, code hosts, model providers
- file-patch: edits and diffs
- artifact: outputs like compiled binaries, coverage reports, logs
Attach common attributes to every span:
- session_id, task_id, parent_span
- git: commit_sha, branch, repo_remote, dirty_flag, diff_base
- runtime: os, arch, container_digest, python_version, node_version
- llm: provider, model, model_revision, temperature, top_p, seed, max_tokens, stop, logprobs if available
- prompts: system, user, tool_spec, tool_json_schema, retrieval_snippets
- privacy: redaction_policy_version, redaction_hashes
- environment snapshot IDs
Be explicit about seeds. Many providers now support a seed parameter that makes sampling deterministic for a given prompt. When available, set seed on every call and record it. If the provider does not support seeds, ensure full response capture and store provider request headers for audit plus a local deterministic fallback for offline replay.
Prompt capture without footguns
Store the exact text that the model saw, not just the template. Recommendations:
- Expand templates at runtime and log the expanded system, user, and tool messages.
- Store tool schemas and function signatures that were advertised to the model.
- Capture retrieval results and their ordering with content digests.
- Redact sensitive fields in two layers: irreversible masked form for general logs and reversibly encrypted form for restricted repros. Keep a keyed hash for join operations.
- Canonicalize whitespace and line endings to avoid phantom diffs between systems.
Example Python instrumentation for an LLM call:
pythonfrom contextlib import contextmanager from opentelemetry import trace from opentelemetry.trace import SpanKind tracer = trace.get_tracer('debug-agent') @contextmanager def capture_llm_call(client, model, system_prompt, user_prompt, tools=None, seed=None, **kwargs): with tracer.start_as_current_span('llm.request', kind=SpanKind.INTERNAL) as span: span.set_attribute('llm.provider', 'openai') span.set_attribute('llm.model', model) if seed is not None: span.set_attribute('llm.seed', seed) span.set_attribute('llm.temperature', kwargs.get('temperature', 0)) span.set_attribute('llm.top_p', kwargs.get('top_p', 1)) span.set_attribute('llm.max_tokens', kwargs.get('max_tokens', 2048)) # Store expanded prompts span.set_attribute('llm.prompt.system', system_prompt) span.set_attribute('llm.prompt.user', user_prompt) if tools: span.set_attribute('llm.tools.schema', str(tools)) # or serialize to a stored artifact try: yield span finally: pass
Then actually call the model and attach the response in a paired span:
pythondef ask_llm(client, **params): with capture_llm_call(**params) as span: resp = client.responses.create( model=params['model'], input=params['user_prompt'], system=params['system_prompt'], seed=params.get('seed'), temperature=params.get('temperature', 0), top_p=params.get('top_p', 1), max_output_tokens=params.get('max_tokens', 2048), tools=params.get('tools') ) with tracer.start_as_current_span('llm.response') as rspan: rspan.set_attribute('llm.response.id', getattr(resp, 'id', 'unknown')) rspan.set_attribute('llm.response.text', resp.output_text if hasattr(resp, 'output_text') else str(resp)) rspan.set_attribute('llm.token.usage', str(getattr(resp, 'usage', {}))) return resp
Use whichever provider SDK you prefer. The key is to set seed when supported, capture the actual rendered prompts, tool schemas, and full text response.
Capturing code edits and diffs
The agent will propose patches repeatedly. Capture them as both unified diffs and materialized files so replayers can apply them without external context.
- Record git baseline: commit SHA and a workspace dirty flag.
- For each edit, store a patch artifact with the diff and a content hash of both before and after states.
- Maintain a patch index with order, authoring span, and reasoning snippet that led to the change.
Example helper to emit a patch artifact and span:
pythonimport hashlib import subprocess from pathlib import Path def file_sha256(path): h = hashlib.sha256() with open(path, 'rb') as f: h.update(f.read()) return h.hexdigest() def capture_patch(repo_root, paths, reason): # Create a unified diff cmd = ['git', '-C', repo_root, 'diff', '--', *paths] diff = subprocess.check_output(cmd).decode('utf-8') with tracer.start_as_current_span('file-patch') as span: before_hashes = {} after_hashes = {} for p in paths: fp = Path(repo_root) / p if fp.exists(): after_hashes[p] = file_sha256(fp) span.set_attribute('patch.reason', reason) span.set_attribute('patch.diff', diff) span.set_attribute('patch.after_hashes', str(after_hashes)) # Store diff as a separate artifact file through your object store API # store_artifact('patches', diff) return diff
Artifacts: logs, coverage, build outputs
Artifacts are the tangible outputs that make a repro useful:
- Test logs and junit xml
- Coverage reports
- Lint and typecheck logs
- Build outputs and version manifests
- Console transcripts from tool invocations
Store with content addressable naming: artifacts keyed by sha256 and a metadata index that maps span ids to artifact hashes. This makes deduplication and retention easier.
Environment snapshot: the other half of determinism
Even perfect trace capture fails if the environment differs. Record a precise snapshot that includes:
- OS and kernel
- Container or VM image digest
- CPU and GPU details, CUDA and cuDNN versions, OpenCL
- Python version, exact package set with hashes
- Node version, package lock snapshots
- System tools versions: git, make, clang, gcc, rustc
- Environment variables with a safe allowlist and hash-only for secrets
- External endpoints and model provider versions
A simple portable script can gather most of this:
bashset -euo pipefail mkdir -p repro/env uname -a > repro/env/os.txt python3 -V > repro/env/python.txt || true node -v > repro/env/node.txt || true pip freeze --disable-pip-version-check > repro/env/pip-freeze.txt || true python3 -c 'import platform,sys;print(platform.platform());print(sys.version)' > repro/env/python-platform.txt || true # GPU info nvidia-smi -q > repro/env/nvidia-smi.txt 2>/dev/null || true # Git state { echo repo_remote: $(git config --get remote.origin.url || true) echo commit_sha: $(git rev-parse HEAD || true) echo branch: $(git rev-parse --abbrev-ref HEAD || true) echo dirty: $(git status --porcelain | wc -l) } > repro/env/git.txt # Toolchain which gcc && gcc --version > repro/env/gcc.txt 2>/dev/null || true which clang && clang --version > repro/env/clang.txt 2>/dev/null || true which rustc && rustc --version > repro/env/rustc.txt 2>/dev/null || true
Prefer a locked environment when possible:
- Python: uv or pip-tools or conda-lock
- Node: package-lock.json or pnpm-lock.yaml
- System deps: Nix flake or container image digest
You can include a Nix flake or a Dockerfile with the exact base image digest in the repro pack. Or export a Conda environment yaml with build numbers.
Recording external calls safely
AI agents often call external APIs: model providers, code hosting, package registries. Capture these interactions for audit and replay:
- Log request method, URL, headers, and a response digest.
- Do not store raw secrets. Replace tokens with stable identifiers and store a reference to a secure vault path.
- Optionally record body payloads to enable offline replays with a VCR-like cassette.
Minimal requests wrapper:
pythonimport requests def http_call(method, url, **kwargs): with tracer.start_as_current_span('external-http') as span: span.set_attribute('http.method', method) span.set_attribute('http.url', url) redacted_headers = {k: ('<redacted>' if 'authorization' in k.lower() else v) for k, v in (kwargs.get('headers') or {}).items()} span.set_attribute('http.headers', str(redacted_headers)) resp = requests.request(method, url, **kwargs) span.set_attribute('http.status_code', resp.status_code) span.set_attribute('http.response.sha256', hashlib.sha256(resp.content).hexdigest()) return resp
The repro pack: a portable, self-describing bundle
Package everything into a single tarball with a manifest. Suggested layout:
repro/
manifest.yaml
traces.jsonl
prompts/
0001-system.txt
0001-user.txt
patches/
0001.diff
0002.diff
artifacts/
test-logs/
coverage/
env/
os.txt
pip-freeze.txt
git.txt
cassettes/
external-http.jsonl
replayer/
replay.py
container.Dockerfile
Example manifest.yaml:
yamlversion: 1 session_id: a017c2d6-91a3-4d9c-8d2f-40e2b8cfa0ee created_at: 2025-11-24T13:42:17Z agent: name: repo-debugger commit: 6d1e5f2 config_version: 42 inputs: git: repo_remote: git@github.com:example/service.git base_commit: 9ac3f7f working_dir: ./ environment: container_digest: ghcr.io/org/base@sha256:abc123 python: 3.11.8 lockfiles: - requirements.lock - package-lock.json llm: provider: openai model: gpt-4o-mini seed: 424242 temperature: 0 top_p: 1 artifacts: - path: artifacts/test-logs/junit.xml - path: patches/0001.diff privacy: redaction_policy: default-v3 encrypted_payloads: true
The replayer: deterministic replay with minimal friction
A replayer should:
- Reconstruct the environment using an explicit container image or a lockfile resolver.
- Restore the workspace at the recorded base commit and apply patches in order.
- Replay external HTTP calls using cassettes, falling back to live calls only if permitted.
- Re-execute tool invocations and run tests with the same flags.
- Optionally run the model calls in two modes: offline mode using recorded responses or online mode using seed and recorded prompts for a live validation.
Pseudocode for a Python replayer:
pythonimport subprocess, json, os, tarfile from pathlib import Path class Replayer: def __init__(self, repro_root): self.root = Path(repro_root) with open(self.root / 'manifest.yaml', 'r') as f: self.manifest = f.read() # parse yaml as needed def restore_workspace(self, target_dir): # clone repo at base commit # apply patches in numerical order pass def setup_env(self): # build container or create venv with locked deps pass def replay_http(self): # start a local proxy that serves cassettes pass def run_tests(self): # run recorded commands from traces pass if __name__ == '__main__': r = Replayer('repro') r.setup_env() r.restore_workspace('/workspace') r.replay_http() r.run_tests()
Keep replayer usage simple. One command should be enough:
replayer run repro.tar.zst
One-click rollback: configuration, not panic
CI should treat agent behavior as versioned configuration. Rollback means pointing production to a previous known-good agent config, not editing code under pressure.
- Store agent parameters and prompts in a versioned registry: temperature, seed, tool schemas, safety policies, chain definitions.
- Tag each production deployment with the agent config version.
- Implement a revert button that updates an environment-specific pointer to a prior version and redeploys the config only.
- Gate rollbacks behind a quick smoke evaluation to prevent obvious regressions.
A GitHub Actions snippet that posts repro packs and enables quick rollback via config versions:
yamlname: ai-agent-ci on: push: branches: [ main ] pull_request: jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install run: | pip install -r requirements.txt - name: Run agent integration tests run: | python -m agent_ci.run --record --output repro - name: Tar repro pack run: | tar -I 'zstd -19' -cf repro-${{ github.sha }}.tar.zst repro - name: Upload repro pack uses: actions/upload-artifact@v4 with: name: repro-${{ github.sha }} path: repro-${{ github.sha }}.tar.zst - name: Post PR comment if: ${{ github.event_name == 'pull_request' }} uses: marocchino/sticky-pull-request-comment@v2 with: header: repro-pack message: | Repro pack uploaded: repro-${{ github.sha }}.tar.zst Use replayer run to reproduce locally.
For the rollback control plane, store a small JSON or yaml doc in a config repo:
yamlenvironment: production agent_config_version: 41 llm: model: gpt-4o-mini seed: 424242 temperature: 0 features: speculative_decoding: false context_window: 128k
Promote or roll back by changing agent_config_version and letting your deployment system sync it.
Sampling strategies that do not torpedo performance
Always-on full capture can be expensive. Control cost with:
- Head-based sampling for traces but tail-based upsampling on errors and slow runs.
- Artifact sampling: keep full artifacts for failing runs and metadata-only for passes.
- Deduplication by content hash. Do not store the same model response text twice if you can reference it.
- Compression: zstd works very well for text artifacts.
- Retention tiers: 7 days for full data, 90 days for minimal traces, 1 year for audit summaries.
Track overhead explicitly: measure P50, P95 trace sizes in KB and CPU cost of serialization and hashing. You can keep flight recorder overhead under 5 percent CPU with careful batching and non-blocking exports.
Privacy, security, and compliance without drama
- Redaction policies: define allowlists for environment variables and denial lists for known sensitive keys. Hash or encrypt secrets.
- Access boundaries: developers get masked views by default; compliance and SRE can request an encrypted repro with justification.
- Storage: encrypt artifacts at rest and in transit. Use per-tenant keys.
- WORM and legal hold: support write once storage for critical releases and attach chain-of-custody hashes to the manifest.
- Audit: store a small summary per session with timestamps, principal identity, and reason codes for high risk tool calls, like force pushing to a repo.
Handling provider differences and non-determinism
Even with a seed, different providers may not guarantee stable outputs forever when models are upgraded. To hedge:
- Record the provider declared model revision or snapshot ID when available.
- Save full text responses in the repro pack for offline replay.
- For live verification, compute a similarity score between recorded responses and fresh responses rather than requiring equality.
- Pin a model snapshot when you need exact matching for a critical release window.
Also stabilize your tool layer:
- Capture whitespace normalization rules for patch application.
- Fix random seeds in your own tools: sort directory listings, set LC_ALL, set numpy and torch seeds if you use them.
- Avoid wall clock dependencies in prompts; record a fixed timestamp or provide an explicit now.
Case study: finding a silent temperature bump
A team shipped a minor agent update and suddenly saw a spike in flaky patch proposals. Tests passed locally, but CI runs created alternate diffs. The flight recorder made the root cause obvious:
- The llm.request spans showed temperature 0.2 on some calls, despite default 0.
- Comparing traces revealed that a missing parameter in a new wrapper defaulted to provider sdk temperature 0.2 on a retry path.
- Environment snapshots were identical; diffs were similar but not exactly the same, implicating sampling noise.
- Fix: set temperature 0 and seed on all calls. Add a unit test that inspects the llm.request span attributes.
- Rollback: toggle agent_config_version back one step while the fix baked.
Without the recorder, this would have been guesswork.
Integrating with OpenTelemetry
OpenTelemetry handles the heavy lifting for trace propagation and export. Add custom attributes for LLM spans and code edit spans.
- Use span names like llm.request, llm.response, tool.apply_patch, test.run.
- Set status and events. For model call retries, add events retry with incrementing attempt.
- Export to an OTLP collector that routes to your preferred backend.
You can also adopt emerging semantic conventions for LLMs in Otel. Until they fully stabilize, namespaced attributes like llm.model and llm.seed are practical.
Evaluations and guardrails as part of the recorder
Flight recorders are a foundation for continuous evaluation:
- Golden tasks: curated repro packs that encode regressions you never want again.
- Model drift checks: periodically replay a subset online to detect behavioral shifts.
- Contract tests: snapshot prompts and tool schemas and fail CI if a change expands tool permissions unintentionally.
Add a simple declarative test:
yamlname: fix-flaky-serializer seed: 1337 temperature: 0 input: failing_test: tests/test_serializer.py::test_roundtrip expect: patch_contains: ensure_ascii=False tests_pass: true
Your CI runner turns this into a repro and keeps it forever.
Common pitfalls and how to avoid them
- Logging the template not the rendered prompt: always capture expanded prompts.
- Forgetting seeds on retry: wrap the client and enforce seed.
- Redacting too aggressively: keep an encrypted channel for privileged repros.
- Enormous artifacts: dedupe by content hash, compress, and sample only on failures.
- Missing base image digests: never use latest; pin by sha256.
- Incomplete diffs when the workspace was dirty at start: store the full uncommitted patch as baseline.
Minimal reference implementation checklist
- Instrumentation
- Otel tracer with session and span types
- LLM wrapper that sets seed and records prompts and responses
- Tool wrappers for patch, build, test, and external HTTP
- Artifact store
- Content addressed, zstd compressed
- Metadata index mapping spans to artifact hashes
- Environment snapshot
- OS, image digest, language runtimes, lockfiles, git state
- Repro pack builder
- manifest.yaml, traces.jsonl, env, patches, artifacts
- Replayer CLI
- Containerized or virtualenv-based
- Offline mode using cassettes and recorded LLM outputs
- CI glue
- Upload artifacts
- Comment on PRs
- One-click rollback via config versions
Frequently asked design questions
- Should I record full model responses or rely on seeds only
- Record full responses for high value runs. Seeds help, but provider behavior can change.
- How big are repro packs
- Typical end-to-end session with 30 to 50 tool calls and 10 model calls compresses to 1 to 20 MB, depending on logs and coverage.
- How do I keep PII out
- Redaction policies, allowlists, and environment filters. Provide an encrypted path for privileged repros.
- Can I replay without credentials
- Yes, if you rely on cassettes. For code hosts and registries, either store cassettes or mount read-only creds in the replay container.
Closing thoughts
AI code debugging has crossed the threshold where nondeterministic behavior becomes operational risk. A flight recorder gives you deterministic reproduction, trustworthy audits, and fast, safe rollback, while creating a virtuous cycle of evaluations that continuously hardens your agent.
The pattern is not exotic: traces and spans for control flow, content-addressed artifacts for evidence, a portable environment manifest for determinism, and a replayer to tie it together. The result is boring reliability for a system that otherwise feels like magic. That is exactly where you want to be.
Build it once, wire it into your CI, and your future self will thank you the next time a production patch looks weird at 2 AM.
