RAG for Stack Traces: Building a Debugging AI That Reproduces Bugs, Not Just Explains Them

Most LLM-based debugging assistants excel at explaining failures in plain English. That’s useful, but insufficient. Teams don’t merge explanations; they merge fixes. To ship fixes confidently, you need reproducibility: the same failure, under a controlled environment, replaying the same causal chain, producing the same outcome, as often as you need.

This article outlines a practical, engineering-first architecture for a debugging AI that reproduces bugs, not just narrates them. The core ideas:

Capture execution traces on demand and on failure.
Build causal slices that isolate only the events that mattered.
Use Retrieval-Augmented Generation (RAG) over traces ("trace‑RAG") to query similar failures, known root causes, and proven fixes.
Apply delta debugging to minimize the failing input and environment so reproduction is small, deterministic, and reviewable.
Auto‑triage flaky tests by classifying nondeterminism and constructing reproducible schedules or seeds.

The result: a system that can reliably turn a failing CI job into a minimal reproduction capsule, propose a fix grounded in prior evidence, and give reviewers a safe, deterministic test that guards against regressions.

Why Reproduction, Not Just Explanation

Explanations are hypotheses. Reproductions are experiments. If your debugging workflow stops at an LLM paragraph, you’re still guessing. In contrast, a reproducible failure has:

A precise failure predicate (e.g., assertion A fails with value V).
A deterministic environment (inputs, seed, system clock policy, network conditions).
A verified causal chain (statements and conditions necessary for that predicate).
A minimal reproducer (small input, minimal dependency surface, short run time).

Reproduction turns debugging into a controlled scientific process. It also improves developer trust in AI suggestions: when the assistant generates a patch, it must prove the fix by rerunning the reproduction and demonstrating the failure disappears while adjacent invariants remain intact.

Architectural Overview

A debugging AI that reproduces bugs can be broken into the following components:

Failure detector and trigger
- When a test/job fails, capture failure metadata and rerun under instrumentation.
Trace capture
- Collect stack traces, function call events, and key data dependencies with low overhead.
Causal slicing
- From the failure predicate, compute a dynamic backward slice that retains only events relevant to the failure.
Trace‑RAG index
- Chunk and embed trace slices, stack fingerprints, error messages, and diffs. Store in a hybrid (vector + keyword) index.
Delta debugging engine
- Minimize inputs, environment variables, dependency versions, and schedules until the smallest still-failing case remains.
Reproduction capsule builder
- Pack the minimized scenario into a hermetic artifact (container/Nix derivation/rr trace) used by humans and CI.
Auto‑triage for flakiness
- Characterize nondeterministic failures, cluster by signature, and derive deterministic replays.
Governance and review
- Generate a proposed fix and the test that proves it; require review; gate with policy and property checks.

Each area has prior art we can leverage: delta debugging (Zeller, 1999–2002), dynamic slicing (Weiser 1981; Korel & Laski 1988), record‑and‑replay (rr, TTD), and vector search for RAG. The novelty is integrating them in CI around LLMs to close the loop from failure to reproducible fix.

1) Capturing Execution Traces Without Melting Your CI

You need traces, but you don’t want 50% overhead on every green run. The pragmatic approach:

Default: Run tests normally, collect lightweight metadata (build SHA, test name, seed, timestamps, container image digest, OS/kernel).
On failure: Automatically rerun failing test(s) under an instrumented mode that captures traces and environment deltas.
On flakiness: Trigger perturbation runs (shuffled order, time dilation, network chaos) with tracing to pin down nondeterminism.

Recommended options by platform:

Python: sys.setprofile, faulthandler, trace, pytest hooks, coverage, optional sys.settrace for granularity. For C-extensions, consider eBPF/uprobes.
JVM: Java Flight Recorder (JFR), Async Profiler, ByteBuddy/ASM instrumentation, structured logs.
Node.js: --trace-events-enabled, V8 inspector, async_hooks.
Go: pprof, execution tracer (go test -trace), hooks in testing framework.
Native (C/C++/Rust): rr (record & replay), Intel PT (via perf), PIN/Valgrind/DynamoRIO for heavier runs; eBPF for syscall-level data.

Use a ring buffer to keep overhead low. On failure trigger, flush the last N seconds of rich context plus a focused rerun with full tracing. Example trace schema (JSON Lines) that balances detail and cost:

json
{"ts": 1733304871.112, "thread": 17, "span": "T42", "event": "call", "file": "orders.py", "func": "apply_discount", "line": 77, "args": {"user_id": 120, "price": 19.99}}
{"ts": 1733304871.113, "thread": 17, "span": "T42", "event": "read", "var": "loyalty_points", "value": 200}
{"ts": 1733304871.115, "thread": 17, "span": "T42", "event": "branch", "cond": "points > threshold", "value": true}
{"ts": 1733304871.119, "thread": 17, "span": "T42", "event": "exception", "type": "ZeroDivisionError", "msg": "division by zero", "file": "orders.py", "line": 89}

You don’t need every local variable every time. Focus on:

Function entry/exit, exceptions, branches taken.
Reads/writes for key variables (via static config, taint heuristics, or automatic backprop from the failure variable).
External I/O (filesystem paths, network endpoints, system time, randomness sources).

Security note: traces may contain secrets or PII. Redact based on allowlists/denylists; encrypt at rest; enforce retention policies.

A minimal Python hook for failure reruns

python
# conftest.py (pytest)
import os
import json
import time
import traceback
import contextlib

TRACE_PATH = os.getenv("TRACE_PATH", ".artifacts/traces")
os.makedirs(TRACE_PATH, exist_ok=True)

@contextlib.contextmanager
def trace_context(test_id):
    events = []
    def tracer(frame, event, arg):
        try:
            if event in ("call", "return", "exception"):
                code = frame.f_code
                rec = {
                    "ts": time.time(),
                    "event": event,
                    "file": code.co_filename,
                    "func": code.co_name,
                    "line": frame.f_lineno,
                }
                if event == "exception":
                    exc, val, tb = arg
                    rec["exc_type"] = getattr(exc, "__name__", str(exc))
                    rec["exc_msg"] = str(val)
                    rec["tb"] = traceback.format_exception(exc, val, tb)
                events.append(rec)
        except Exception:
            pass
        return tracer
    try:
        sys.setprofile(tracer)
        yield
    finally:
        sys.setprofile(None)
        with open(f"{TRACE_PATH}/{test_id}.jsonl", "w") as f:
            for e in events:
                f.write(json.dumps(e) + "\n")

def pytest_runtest_call(item):
    # Only enable trace on reruns or when env flag is set
    if os.getenv("TRACE_ON_FAIL"):
        test_id = item.nodeid.replace("::", "_")
        with trace_context(test_id):
            item.runtest()
    else:
        item.runtest()

def pytest_runtest_makereport(item, call):
    if call.when == "call" and call.excinfo is not None:
        # mark for rerun with tracing
        os.environ["TRACE_ON_FAIL"] = "1"

In CI, detect failure, rerun the failed test with TRACE_ON_FAIL=1, then archive the resulting jsonl trace in your artifact store.

2) From Trace to Causal Slice

A raw trace is noisy. What you want is the minimal set of events that were necessary for the failure. This is the essence of dynamic slicing.

Background:

Program slicing (Weiser, 1981) introduced the idea of computing subsets of program statements relevant to a slicing criterion (e.g., variable v at line l).
Dynamic slicing (Korel & Laski, 1988) uses a concrete execution trace to compute a slice for that specific run.
Zeller’s “Cause-Effect Chains” and Delta Debugging (1999–2002) add a practical orientation: isolate differences that cause failure.

We combine them: start with the failure predicate and walk backward through the trace along data and control dependencies.

High-level algorithm (conceptual):

Define the failure predicate F (e.g., assertion XY failed at line L with observed value V).
Identify the variable(s) and condition(s) that directly influence F.
Backward-slice: for each influencing value, find its last definition in the trace; add that event to the slice.
Recursively include control dependencies (branches whose outcome affected whether the failing code executed) and data dependencies.
Stop when reaching inputs or external I/O boundaries; record those as root dependencies.

This yields a “causal slice”: a much smaller sub-trace with only the events that could have made the failure happen. Often, slices reduce megabytes of trace to a few dozen lines, making them ideal for human review and LLM grounding.

Pseudocode for a simple dynamic backward slice over a JSONL trace:

python
from collections import defaultdict, deque

class Event:
    def __init__(self, rec):
        self.ts = rec["ts"]
        self.event = rec["event"]
        self.file = rec.get("file")
        self.func = rec.get("func")
        self.line = rec.get("line")
        self.exc_type = rec.get("exc_type")
        self.exc_msg = rec.get("exc_msg")
        self.vars_read = set(rec.get("reads", []))
        self.vars_written = set(rec.get("writes", []))
        self.branch = rec.get("cond")

def build_backward_slice(events, failure_predicate):
    # events: chronological list of Event
    # failure_predicate identifies final variables/locations
    slice_events = set()
    required = set(failure_predicate.variables)  # variables to justify

    last_defs = defaultdict(list)  # var -> list of indices where written
    for i, e in enumerate(events):
        for v in e.vars_written:
            last_defs[v].append(i)

    work = deque(required)
    while work:
        v = work.popleft()
        if v not in last_defs:
            continue
        i = last_defs[v][-1]
        e = events[i]
        if i not in slice_events:
            slice_events.add(i)
            # everything read by this def becomes required
            for r in e.vars_read:
                work.append(r)
            # include control dependencies heuristically
            # e.g., prior branch events in same function
            for j in range(i - 1, max(-1, i - 50), -1):
                if events[j].event == "branch":
                    slice_events.add(j)
    return sorted(slice_events)

In practice, you’ll augment events with explicit read/write sets. For dynamic languages, approximate using bytecode analysis or configurable probes (e.g., only track a set of critical locals or return values). For native code, rr or Intel PT can give you high-fidelity traces; you can post-process to annotate dependencies.

The output is a compact slice with:

The minimal statements and branches that mattered.
The exact values that flowed into the failure.
The external inputs that must be controlled in reproduction.

3) Trace‑RAG: Retrieval Over Slices and Stack Fingerprints

Traditional RAG uses documents. Here, your “documents” are:

Causal slices (JSONL or text summaries).
Stack fingerprints (hashes over ordered frames).
Error messages and key-value context.
Git diffs (failing commit vs last green), build metadata, environment deltas.

Index them in a hybrid search system:

Vector index: embeddings of slice summaries, error messages, and stack frames using code-aware models (e.g., jina-embeddings-code, text-embedding-3-large, or bge-large-en).
Keyword/BM25: exact matching for exception types, module names, line numbers, feature flags, test names.
Filters: time window, repo, language, commit range, service.

The goal is to ask questions like:

“Have we seen a similar causal slice before?”
“Which commits changed functions in this stack?”
“What fixes historically resolved the same stack fingerprint?”

Example: building an embedding for a slice record in Python:

python
from sentence_transformers import SentenceTransformer
import json

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def slice_to_text(slice_events):
    # Convert a slice into a compact textual form
    lines = []
    for e in slice_events:
        if e.event == "exception":
            lines.append(f"EXC {e.exc_type}: {e.exc_msg} at {e.file}:{e.line}")
        elif e.event == "call":
            lines.append(f"CALL {e.func}({','.join(sorted(e.vars_read))}) -> {','.join(sorted(e.vars_written))}")
        elif e.event == "branch":
            lines.append(f"BRANCH {e.branch}")
    return "\n".join(lines)

text = slice_to_text(slice_events)
vec = model.encode(text)
# store vec into vector DB with metadata (repo, commit, test, tags)

To feed the LLM, construct a grounded context:

Top-k similar slices with their confirmed root causes and fix diffs.
The current causal slice and the full stack trace.
The minimal environment diff (from delta debugging, next section).
A query: “Generate a deterministic reproduction recipe and propose a minimal fix; cite which prior case(s) you used.”

Critical: instruct the LLM to only make claims that cite event IDs or retrieved artifacts. Reject answers that introduce non-grounded facts.

4) Delta Debugging in CI: Make Failures Small and Deterministic

Delta debugging (Andreas Zeller, "Yesterday, my program worked. Today, it does not. Why?", 1999) systematically minimizes failure-inducing differences. It’s the right tool to turn a big messy CI failure into a crisp reproduction.

Surfaces to minimize:

Input files or payloads (JSON, fixtures, seeds).
Test selection and order.
Environment variables and feature flags.
Dependency versions and container layers.
Concurrency and timing (thread count, scheduling seeds).

Core idea: given a predicate P(X) that returns FAIL or PASS for a configuration X, search for a minimal X that still fails. The classic ddmin algorithm splits X into subsets and tests combinations to shrink X while preserving failure.

A practical shell-around-any-command implementation:

python
import subprocess

def fails(cmd, env=None):
    try:
        r = subprocess.run(cmd, shell=True, env=env, timeout=600)
        return r.returncode != 0
    except subprocess.TimeoutExpired:
        return True  # treat timeout as failure

# ddmin over environment variables

def ddmin_env(base_env, cmd):
    items = list(base_env.items())
    n = 2
    current = items[:]
    while len(current) >= 1:
        subset_size = max(1, len(current) // n)
        some_progress = False
        for i in range(0, len(current), subset_size):
            trial = current[:i] + current[i+subset_size:]
            trial_env = dict(trial)
            if fails(cmd, env=trial_env):
                current = trial
                n = 2
                some_progress = True
                break
        if not some_progress:
            if n == len(current):
                break
            n = min(len(current), n * 2)
    return dict(current)

For inputs, represent X as a list of chunks (e.g., JSON fields, lines, events) and reuse the same ddmin structure.

Integrate into CI:

On failure, capture the environment snapshot (env vars, feature flags, dependency locks, container image digest, kernel version, locale, timezone, CPU features, random seeds).
Launch a delta-debugging job that:
- Minimizes env vars against the failing command.
- Minimizes test selection: start from the full shard; reduce to the single failing test; minimize adjacent tests if order-dependent.
- For inputs/fixtures, apply structural shrinking (e.g., JSON shrinking, property-based test shrinkers).
- For dependencies, try narrowing to the minimal set of version bumps that still fail.
Stop when further shrinking flips to PASS. Save the minimal failing configuration.

Output a "reproduction capsule":

Dockerfile with pinned base image digest and apt indexes.
requirements.txt or lockfiles pinned to exact versions.
A wrapper script that sets minimal env vars and runs the single failing command.
A tarball of fixtures reduced by delta debugging.
Optional: rr trace or JFR file.

Developers then run:

bash
# build the repro capsule
$ docker build -t repro:abc123 .
# verify reproduction
$ docker run --rm repro:abc123 ./repro.sh

Reproducibility beats explanation every time during review.

5) Auto‑Triage Flaky Tests with Deterministic Replays

Flaky tests resist reproduction because the failure predicate depends on scheduling, time, or environment noise. Your goal is to turn a flaky outcome into a deterministic replay.

Flake taxonomy (practical):

Order-dependent: test A passes alone but fails after B due to shared state.
Time-sensitive: timeouts, clocks, sleep-based races.
Concurrency: data races, deadlocks, inconsistent visibility.
Resource-sensitive: low disk/memory, file descriptor leaks.
External: network instability, API rate limits.

Triage strategy:

Rerun under perturbations:
- Randomize test order and isolate with fresh processes.
- Inject jitter: delay syscalls (eBPF), throttle network, simulate DNS failures.
- Control time: monotonic clock, frozen time libraries, time dilation.
Capture traces for both PASS and FAIL; compute differential slices.
Apply probabilistic delta debugging: search for a minimal schedule/seed that yields failure with high probability (e.g., >95%).

Key techniques:

Seed everything: random, fuzzers, property-tests, any sampling logic. Record seeds in test metadata.
Deterministic schedulers:
- JVM: use DeterministicRunner, Loom Virtual Threads with a scheduling seed.
- Go: -race, GOMAXPROCS=1 for simplification, custom goroutine scheduler for tests.
- Python: pytest-randomly, faulthandler, monkeypatch time and random with a seeded implementation.
Order-dependent detection: compute minimal inter-test dependency using ddmin on test order (binary search the smallest prefix that flips PASS→FAIL).

For classification, score flakiness based on:

Failure signature stability (stack fingerprint entropy).
Sensitivity to order (delta after permutations).
Sensitivity to time (delta after clock perturbation).

Once triaged, store a deterministic replay recipe (seed + order + environment). Attach to the test as a "known flake reproducer" so the next occurrence is instant to validate.

6) Building the LLM Agent That Reproduces, Then Fixes

A robust agent shouldn’t write code first. It should:

Reproduce the failure deterministically.
Explain the causal slice in human terms.
Retrieve similar past fixes via trace‑RAG.
Propose a minimal test and code change.
Validate: rerun the reproduction capsule, confirm failure disappears and no new regressions (at least on an impacted subset).

Suggested tools and guardrails:

Tools available to the agent:
- run(cmd): execute the repro command inside the capsule.
- git_checkout(ref), git_diff(), git_apply(patch).
- search_index(query, filters): query the trace‑RAG store.
Output schema: always produce a structured plan: reproduction steps, expected outcome, diff, and impacted tests.
Guard the LLM with deterministic checks: the system refuses to advance to "patch proposal" until reproduction is verified.

A prompt template for fix generation:

System: You are a debugging assistant. Only use cited trace events and retrieved cases. Do not invent code or APIs that are not present in the repository.

User:
- Failure predicate: {predicate}
- Causal slice: {slice_text} (events are labeled [E#])
- Environment delta: {env_delta}
- Similar cases: {top_k_cases_with_fixes}

Tasks:
1) Produce a Reproduction Recipe using only the provided artifacts. Cite event IDs [E#] for each step that depends on an event.
2) State the hypothesized root cause as a cause-effect chain using only cited events.
3) Suggest a minimal patch. Cite prior cases used. Limit changes to files that appear in the slice or the retrieved fixes.
4) Propose a unit/integration test that fails before the patch and passes after.
5) Provide a validation plan (commands) that replays the failure, applies the patch, and verifies.

This keeps the model grounded and verifiable.

7) Implementation Blueprint

Here’s a minimal but end-to-end path to get started.

CI Setup
- On failure, rerun the failed test with tracing enabled.
- Archive trace, logs, and environment snapshot.
- Trigger a minimization job (delta debugging).
- Produce a reproduction capsule artifact.
Trace Slice and Index
- Post-process the trace to compute a causal slice.
- Generate a textual summary of the slice.
- Store the slice and stack fingerprint to a vector DB (e.g., Qdrant, Weaviate) and a keyword store (e.g., OpenSearch).
Agent Workflow
1. Pull the latest failing capsule.
2. Verify reproduction by running it twice.
3. Query the index for similar slices/fixes.
4. Draft patch and a test under constraints.
5. Validate locally in the capsule.
6. Open a PR with:
  - The capsule recipe and how to run it.
  - The causal slice and failure explanation.
  - The patch and test.
  - Benchmarks showing no regression if available.

Sample GitHub Actions snippet:

yaml
name: Repro-and-Slice
on:
  workflow_run:
    workflows: ["CI"]
    types:
      - completed
jobs:
  repro:
    if: ${{ github.event.workflow_run.conclusion == 'failure' }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install deps
        run: |
          pip install -r requirements.txt
      - name: Rerun failed tests with tracing
        run: |
          FAILED=$(python scripts/parse_failed_tests.py ${{ github.event.workflow_run.id }})
          TRACE_ON_FAIL=1 pytest -q $FAILED || true
      - name: Archive traces
        uses: actions/upload-artifact@v4
        with:
          name: traces
          path: .artifacts/traces
      - name: Delta debug environment
        run: |
          python tools/dd_env.py --cmd "pytest -q $FAILED" --out .artifacts/repro_env.json
      - name: Build reproduction capsule
        run: |
          python tools/build_capsule.py --env .artifacts/repro_env.json --out dist/capsule
      - name: Upload capsule
        uses: actions/upload-artifact@v4
        with:
          name: capsule
          path: dist/capsule

8) Safety, Privacy, and Cost Controls

Data minimization: only collect reads/writes for variables in the causal slice. Avoid global dumps of locals.
Redaction: tokenize or hash suspect fields (auth tokens, emails, PII). Use vault-backed rehydration only when needed.
Performance: use on-failure reruns for full tracing, and ring buffers otherwise. Cap traces by wall-clock and event count.
Retention: keep only the latest N traces per test signature; gate access with RBAC.
Rollout: start with one repo/service, benchmark overhead and MTTR improvement, then expand.

9) Opinionated Guidance

Reproduction is your North Star. If your assistant can’t replay the bug, it’s not ready to propose a fix.
Prefer hermetic builds. Bazel/Nix/containers reduce environmental variance and make delta debugging ruthlessly effective.
Start coarse, then refine. Basic call/exception traces + simple dynamic slice will get you most of the gains before diving into taint tracking or Intel PT.
Treat flaky tests as first-class citizens. Invest in deterministic schedules and seeds; you’ll pay it back in hours saved.
Make patches reviewable. Always ship a minimal failing test with the fix. Reviewers should be able to reproduce in one command.

10) Limitations and Pitfalls

Over-instrumentation can distort timing-sensitive bugs. Keep the minimal instrumentation necessary.
Some bugs require full record-and-replay to reproduce (e.g., kernel-level races). Use rr or TTD for those, but only on demand.
Embedding drift: retrain or refresh your vector index periodically; include strong keyword filters to avoid semantically similar but irrelevant matches.
LLM hallucinations: require citations to event IDs and retrieved cases; reject non-cited claims in the pipeline.
Privacy: treat traces as sensitive. Avoid copying prod data into CI traces; sanitize aggressively.

11) Measuring Success

Track these KPIs:

Reproduction rate: percentage of failures that yield a deterministic capsule within X minutes.
Minimization ratio: size/time reduction from original failure to minimal reproducer.
MTTR: mean time to resolve for failures processed by the system vs. baseline.
Flake detection precision/recall: ability to correctly label and replay flaky tests.
Patch acceptance rate: percent of AI-proposed fixes that pass review and stay green.

Aim for a >80% reproduction rate on common classes (assertions, type errors, order-dependent tests) within 15 minutes.

12) References and Further Reading

Weiser, M. (1981). Program Slicing. ICSE.
Korel, B., & Laski, J. (1988). Dynamic Program Slicing. Information Processing Letters.
Zeller, A., & Hildebrandt, R. (2002). Simplifying and Isolating Failure-Inducing Input. IEEE TSE. (Delta Debugging)
Zeller, A. (2009). Why Programs Fail: A Guide to Systematic Debugging.
Mozilla rr: Lightweight record-and-replay for debugging. https://rr-project.org/
Java Flight Recorder (JFR) and Async Profiler.
eBPF tracing tools (bcc, bpftrace) for syscall and kernel-level insight.
Sentence Transformers and modern code embeddings for retrieval.

Closing

RAG for traces reorients LLM debugging toward evidence. With causal slices, delta debugging, and deterministic capsules, your assistant doesn’t just talk about bugs—it reenacts them. That makes fixes safer, reviews faster, and engineering cultures more scientific. Start small: a rerun-on-failure tracer, a simple slice, and a minimizer. Add trace‑RAG when you have even 100 historical failures. Within a few sprints, you’ll wonder how you ever debugged without a button that says: “Reproduce.”