Repo-RAG: Feeding Your Code Debugging AI with Tests, Traces, and Diffs for Root-Cause Analysis
Modern software is awash in evidence: failing tests, flaky CI logs, OpenTelemetry traces, code coverage heatmaps, and Git diffs that tell the story of how the system got here. The problem isn’t lack of data—it’s that your debugging loop can’t ingest and reason over all of it coherently and quickly.
Repo-RAG (Repository Retrieval-Augmented Generation) is a pragmatic pattern to make a code debugging AI actually useful: not by making the model bigger, but by making its context smarter. You build a repo-aware retrieval pipeline that can fetch the right tests, traces, logs, coverage, and Git history; localize the fault; rank hypotheses; and propose verifiable fixes. Done right, it’s fast, private, and CI-friendly.
This article lays out the blueprint: data schemas, indexing strategies, retrieval operators, ranking math, prompting patterns, CI integration, evaluation metrics, and a step-by-step case study. The goal is to move from generic chat-with-your-code to precise, reproducible root-cause analysis (RCA) and verifiable patches.
TL;DR
- Repo-RAG makes LLMs effective at debugging by grounding them in test artifacts, traces, logs, coverage, and diffs.
- Combine code search, spectrum-based fault localization (SBFL), Git churn, and trace slicing to rank root-cause hypotheses.
- Feed only the minimal, high-signal context to the model; ask it to propose both a patch and a test; verify in CI.
- Keep it private and deterministic with local indices, deterministic retrieval, and reproducible builds.
Why Repo-RAG (and why now)
RAG improved question answering for documents. But source code is different:
- The relevant context isn’t just “nearby text.” It’s dynamic behavior (tests, logs, traces), historical edits (diffs, blame), and explicit runtime failures.
- Success isn’t a persuasive paragraph. It’s a patch that compiles, passes tests, and avoids regressions.
- Ground truth and verifiability exist: tests and reproducible builds.
Repo-RAG treats your repository as a living knowledge graph: code artifacts connected to runtime evidence and history. It gives a debugging model the minimum evidence needed to reason effectively, not the maximum tokens it can swallow.
The architecture at a glance
Think of Repo-RAG as a modular pipeline with four phases:
- Collect: Ingest artifacts from CI and dev runs: test results, logs, traces, coverage, build metadata, and Git history.
- Index: Build code maps, symbol graphs, failure-indexed logs/traces, and diff-based history indices. Store with fast retrieval keys.
- Retrieve: For a given failure, deterministically fetch high-signal slices: failing tests, relevant stack frames, touched files, suspect diffs, coverage deltas, and historical flakiness.
- Reason and act: Let the model (or a small agent) rank hypotheses, propose a patch and a test, and orchestrate a validation run.
Each phase is pluggable and can run offline/on-prem for privacy.
Data you must collect (and the minimum useful schema)
You don’t need a data lake. You need evidence with stable keys and timestamps. Start with this minimal schema and expand as needed.
- Test artifacts (per CI run):
- identity: commit SHA, branch, CI run ID, job ID
- results: test name, outcome, duration, error type, stack trace, stdout/stderr
- coverage: per-file and per-line hit counts
- Logs:
- timestamped lines grouped by test or request ID; severity; source component; optional JSON fields
- Traces (OpenTelemetry/Jaeger):
- trace ID, span IDs, name, attributes, status, events, links to logs and code locations
- Git history:
- commits, parents, diffs per file, rename/copy detection, author time, message, change hunks
- blame per line; churn metrics; tags for revert/reintroduce patterns
- Build metadata:
- environment (OS, runtime version), compiler flags, feature flags, container image digest
A portable JSON representation you can store in S3, GCS, artifact store, or a local directory works fine. Example:
json{ "run": { "sha": "a1b2c3...", "branch": "feature/xyz", "ci_run_id": "12345", "started_at": "2025-11-10T12:00:00Z", "env": {"python": "3.11.6", "os": "ubuntu-22.04"} }, "tests": [ { "name": "tests/test_tz.py::test_midnight_rollover", "status": "failed", "duration_ms": 182, "error": { "type": "AssertionError", "message": "expected 2025-01-01, got 2024-12-31", "stack": [ {"file": "app/time_util.py", "line": 84, "fn": "rollover"}, {"file": "tests/test_tz.py", "line": 42, "fn": "test_midnight_rollover"} ] }, "stdout": "...", "stderr": "...", "trace_id": "07a3c..." } ], "coverage": { "files": { "app/time_util.py": { "lines": {"82": 1, "83": 1, "84": 1, "85": 0} } } } }
Indexing your repository for debugging
Indexing for Repo-RAG is not just vectorizing files. You need deterministic, structured indices:
- Code map index:
- A symbol graph (functions, classes, modules) with locations and cross-references. Use tree-sitter, ctags, or language servers (LSP) to build it.
- Derived structures: call graph (static approximation), file-module mapping, test-to-code mapping via imports and coverage.
- Test index:
- Key by test name; include last N outcomes, failure frequency, min/max durations, flaky score, and links to coverage and traces.
- Trace/log index:
- Key by test or trace ID; store top-K spans on error path; keep inverted index by error messages and span attributes.
- Git history index:
- For each file: commits touching it, churn, time decay, rename history, revert pairs.
- Precompute SZZ suspect commits: link failing tests to candidate culprit commits using failure introduction points.
- Embeddings index (optional but useful):
- Chunk by symbol, not arbitrary tokens; embed docstrings, signatures, and key code lines. This supports semantic matching for novel failures.
Store metadata in a document database (SQLite, DuckDB, or Postgres), time-series store for runs (SQLite/DuckDB often suffice), and a search backend (ripgrep for lexical, a vector store for embeddings). Keep it simple; optimize later.
The retrieval operators that matter
Design retrieval as composable operators with clear inputs/outputs. For a failure F at commit C:
- select_failing_tests(F):
- Return failing test objects with stack frames and error kinds.
- slice_trace(F):
- Fetch traces/logs for F; return the shortest failing path and last N spans with status=ERROR.
- map_frames_to_symbols(frames):
- Use code map to resolve frames to functions/classes; return neighborhoods (file +/- 30 lines, callers/callees).
- suspects_from_git(C, files):
- Use SZZ and churn to rank suspect commits and hunks affecting those files or symbols.
- coverage_delta(F):
- Compare coverage of failing vs passing runs; highlight newly executed lines near failures.
- semmatch(F):
- Optional: embed error message + frames; retrieve semantically similar past failures and fixes.
These operators keep the retrieval deterministic and explainable. You can log their outputs to audit why the model saw what it saw.
Ranking hypotheses: blend signals, not vibes
Don’t ask the model to guess blindly. Provide it with scored hypotheses derived from classic software engineering research plus repo signals. A simple, effective triad:
- Spectrum-Based Fault Localization (SBFL) via Ochiai or Tarantula.
- Git priors: churn, recency, revert history.
- Trace depth and error proximity.
Compute a score for each candidate line, hunk, or symbol. Example Ochiai for a statement s:
- ef: number of failing tests that execute s
- nf: number of failing tests that do not execute s
- ep: number of passing tests that execute s
- np: number of passing tests that do not execute s
- Ochiai(s) = ef / sqrt((ef + nf) * (ef + ep))
Python snippet:
pythonimport math from dataclasses import dataclass @dataclass class Spectrum: ef: int nf: int ep: int np: int def ochiai(spec: Spectrum) -> float: denom = math.sqrt((spec.ef + spec.nf) * (spec.ef + spec.ep)) return (spec.ef / denom) if denom else 0.0 # Combine with Git churn and trace proximity @dataclass class Features: ochiai: float churn: float # normalized per file recency_days: float # negative weight trace_depth: int # closer to error ==> higher weight def score(features: Features) -> float: # weights tuned empirically; keep monotonicity w_ochiai = 0.6 w_churn = 0.2 w_recency = 0.1 w_trace = 0.1 return ( w_ochiai * features.ochiai + w_churn * features.churn + w_trace * (1.0 / (1 + features.trace_depth)) + w_recency * (1.0 / (1 + features.recency_days)) )
Present the model with the top K hunks/symbols, their scores, and the minimal surrounding context.
Prompting the debugging model: be precise and verifiable
A good Repo-RAG prompt has these features:
- Task framing: localize fault; propose minimal patch; propose a test that fails before and passes after.
- Constraints: don’t touch unrelated files; respect style; include rationale tied to observed evidence.
- Context: failing tests, narrowed code slices, suspect hunks, relevant diffs, trace excerpt.
- Output schema: patch as a unified diff, plus a test patch.
Template:
You are a software debugging assistant. Use the provided evidence to localize the bug and propose a minimal fix with a verifiable test.
Evidence:
- Commit: {sha}
- Failing tests: {test_names}
- Error: {error_type}: {error_message}
- Stack frames (top 5):
{frames}
- Trace excerpt (last spans until error):
{trace_excerpt}
- Candidate hunks (ranked):
{hunks_with_scores}
- Coverage delta: lines executed only by failing tests: {lines}
- Relevant recent diffs:
{diff_snippets}
Requirements:
1) Explain the root cause in 2-4 sentences, citing specific lines/hunks.
2) Produce a minimal patch as a unified diff.
3) Produce or modify a test so that it fails before your patch and passes after.
4) Do not change public APIs unless strictly necessary.
5) Keep edits within these files only: {allowed_files}.
Output JSON strictly matching this schema:
{
"explanation": "...",
"patch_diff": "--- a/...\n+++ b/...\n...",
"test_diff": "--- a/...\n+++ b/...\n..."
}
Keep the prompt compact: summarize logs and traces to the shortest path-to-error; include only the top 1-3 hunks per file; limit code to +/- 30 lines around candidates.
Building the collector and indexer
Start with a portable, language-agnostic setup. You can add language-specific enrichments later.
Shell snippets for collection in CI:
bash# 1) Test artifacts (pytest example) pytest -q --maxfail=1 --disable-warnings \ --junitxml=artifacts/junit.xml \ --cov=app --cov-report=xml:artifacts/coverage.xml \ | tee artifacts/test.log # 2) OpenTelemetry traces (assuming OTEL exporter is set) export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 # Your app emits spans during tests; collector writes to artifacts/traces.jsonl # 3) Git metadata git rev-parse HEAD > artifacts/sha.txt git show -s --format='%H %ct %an %s' > artifacts/commit.txt # 4) Normalize artifacts python tools/collect_ci_artifacts.py \ --junit artifacts/junit.xml \ --coverage artifacts/coverage.xml \ --logs artifacts/test.log \ --traces artifacts/traces.jsonl \ --out artifacts/run.json
Simplified Python for JUnit and coverage normalization:
python# tools/collect_ci_artifacts.py import argparse, json, os from xml.etree import ElementTree as ET def parse_junit(path): root = ET.parse(path).getroot() tests = [] for case in root.iter('testcase'): name = f"{case.get('classname')}::{case.get('name')}" failure = case.find('failure') or case.find('error') status = 'failed' if failure is not None else 'passed' tests.append({ 'name': name, 'status': status, 'error': None if status=='passed' else { 'type': failure.get('type'), 'message': failure.get('message'), 'stack': failure.text.splitlines()[:50] if failure.text else [] } }) return tests def main(): ap = argparse.ArgumentParser() ap.add_argument('--junit'); ap.add_argument('--coverage'); ap.add_argument('--logs') ap.add_argument('--traces'); ap.add_argument('--out') args = ap.parse_args() run = {'tests': parse_junit(args.junit)} # parse coverage/logs/traces similarly; elided for brevity with open(args.out, 'w') as f: json.dump(run, f) if __name__ == '__main__': main()
Indexing code and Git with tree-sitter and ripgrep:
bash# Build a symbol map (example using universal-ctags) ctags -R --fields=+n --languages=Python,JavaScript --extras=+q -f artifacts/tags . # Lexical search baseline rg -n "\brollover\b" app/ > artifacts/search.txt # Git history JSON (per file churn) git log --numstat --date=iso --pretty=format:'commit:%H%nparent:%P%nauthor:%an%ndate:%ad%n' \ | python tools/log_to_json.py > artifacts/git.jsonl
You can move to a more robust graph store later (e.g., sqlite with tables for symbols, refs, tests, traces, commits, hunks). Start with reproducibility and inspectability.
Suspect commits with SZZ and diffs
The SZZ algorithm links bug-introducing commits by tracing from a fix to lines it modifies and then blaming those lines back to earlier commits. For online debugging, you can use a “forward SZZ” heuristic: given failing hunks and frames, blame the lines to find the last commits touching them and rank by recency and churn.
Example shell:
bash# For each candidate file file=app/time_util.py awk 'NR>=70 && NR<=110 {print NR":"$0}' "$file" > /tmp/slice.txt # git blame for the slice git blame -L 70,110 --line-porcelain "$file" | grep '^commi\|^author-time ' > /tmp/blame.txt
Use this data to compute a suspect prior per hunk. Combine with SBFL and trace depth for the final ranking.
Putting it together: the Repo-RAG flow
Pseudocode orchestrator:
pythonfrom typing import List class RepoRAG: def __init__(self, indexes): self.idx = indexes def analyze_failure(self, sha: str, run_path: str) -> dict: run = load_run(run_path) failing = [t for t in run['tests'] if t['status']=='failed'] frames = top_frames(failing) trace = slice_traces(run, failing) cand_symbols = map_frames_to_symbols(frames, self.idx.symbols) cov = coverage_delta(run, self.idx.recent_pass) hunks = expand_symbols_to_hunks(cand_symbols, radius=30) scores = rank_hunks(hunks, cov, self.idx.git, trace) context = build_context(failing, frames, trace, scores.topk(10)) prompt = render_prompt(context) model_out = call_model(prompt) return model_out
A small agent can then apply the patch, run tests, and iterate if necessary. Keep the loop bounded to avoid CI sprawl.
CI integration: GitHub Actions example
Make it easy to adopt by binding to PRs and failing tests. Minimal workflow:
yamlname: repo-rag-debug on: pull_request: types: [opened, synchronize, reopened] workflow_dispatch: {} jobs: analyze: runs-on: ubuntu-latest permissions: contents: read pull-requests: write steps: - uses: actions/checkout@v4 with: fetch-depth: 0 - uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install deps run: | pip install -r requirements-dev.txt pip install pytest pytest-cov opentelemetry-sdk - name: Run tests and collect artifacts run: | mkdir -p artifacts pytest -q --junitxml=artifacts/junit.xml \ --cov=. --cov-report=xml:artifacts/coverage.xml | tee artifacts/test.log - name: Build indexes run: | sudo apt-get update && sudo apt-get install -y universal-ctags ripgrep ctags -R -f artifacts/tags . python tools/collect_ci_artifacts.py --junit artifacts/junit.xml \ --coverage artifacts/coverage.xml --logs artifacts/test.log \ --out artifacts/run.json - name: Repo-RAG debug env: MODEL_ENDPOINT: ${{ secrets.MODEL_ENDPOINT }} MODEL_TOKEN: ${{ secrets.MODEL_TOKEN }} run: | python tools/repo_rag_debug.py \ --sha $(git rev-parse HEAD) \ --run artifacts/run.json \ --out artifacts/proposal.json - name: Post patch suggestion as PR comment uses: actions/github-script@v7 with: script: | const fs = require('fs'); const p = JSON.parse(fs.readFileSync('artifacts/proposal.json','utf8')); const body = `RCA: ${p.explanation}\n\nPatch:\n\n\`\`\`diff\n${p.patch_diff}\n\`\`\`\n\nTest:\n\n\`\`\`diff\n${p.test_diff}\n\`\`\``; github.rest.issues.createComment({ ...context.repo, issue_number: context.issue.number, body })
Keep the model call behind a secret endpoint. For privacy, run a local model (e.g., Llama, DeepSeek-Coder) on a self-hosted runner.
Case study: timezone rollover bug
Scenario: A Python service computes the next date at midnight local time. A test fails around New Year’s.
Artifacts:
- Failing test: tests/test_tz.py::test_midnight_rollover
- Error: AssertionError: expected 2025-01-01, got 2024-12-31
- Stack: app/time_util.py:84 in rollover
- Coverage: line 85 is unexecuted in passing runs but executed in the failing test
- Git: last week, commit abc123 changed time_util.py lines 80-90 to handle DST
Retrieval results:
- Frames mapped to symbol: function app.time_util.rollover at lines 76-94
- Hunks: lines 82-88 recently edited; blame points to abc123
- Trace slice: span compute_next_date -> rollover -> error raised; attributes tz='America/Los_Angeles'
- SBFL: ef=1, ep=120 (many passing tests don’t touch this code); Ochiai points to line 84 where naive date arithmetic crosses midnight without timezone awareness
Model prompt includes:
- The 30-line window around lines 76-94
- The recent diff for lines 80-90
- The error and trace snippet with tz attribute
- Coverage delta showing the failing-only path
- Allowed files: app/time_util.py, tests/test_tz.py
Proposed patch (illustrative):
diff--- a/app/time_util.py +++ b/app/time_util.py @@ def rollover(dt: datetime, tz: tzinfo) -> date: - # naive: add seconds until midnight in naive local time - seconds = 24*3600 - (dt.hour*3600 + dt.minute*60 + dt.second) - return (dt + timedelta(seconds=seconds)).date() + # Correct: convert to tz-aware, normalize at local midnight, then add one day if needed + if dt.tzinfo is None: + dt = dt.replace(tzinfo=tz) + local = dt.astimezone(tz) + midnight = local.replace(hour=0, minute=0, second=0, microsecond=0) + if local.time() == time(0, 0): + return local.date() + return (midnight + timedelta(days=1)).date()
Test patch:
diff--- a/tests/test_tz.py +++ b/tests/test_tz.py @@ def test_midnight_rollover(): - dt = datetime(2024, 12, 31, 23, 59, 59) - assert rollover(dt, ZoneInfo('America/Los_Angeles')) == date(2025, 1, 1) + la = ZoneInfo('America/Los_Angeles') + dt = datetime(2024, 12, 31, 23, 59, 59, tzinfo=la) + assert rollover(dt, la) == date(2025, 1, 1) + # Midnight edge case should return same local date + dt2 = datetime(2025, 1, 1, 0, 0, 0, tzinfo=la) + assert rollover(dt2, la) == date(2025, 1, 1)
Verification:
- Before patch: the added midnight test fails
- After patch: both tests pass
This is a toy example, but the pattern generalizes: timezone and DST bugs are classic cases where traces (attributes), diffs (recent DST handling), and SBFL (lines executed by failing-only tests) converge.
Privacy and security: keep it in your repo bubble
- Run the collector and indexer inside CI; store artifacts in your existing artifact store.
- Prefer local or VPC-hosted models; if you must use a cloud model, strip PII and redact secrets from logs.
- Enforce least-privilege: the debugging job needs read-only to code and artifacts; write access only to post comments or push a patch branch.
- Scan patches with secret scanners and static analyzers (e.g., Trivy, Gitleaks, CodeQL) before proposing merges.
Performance and reliability tips
- Cache indices by commit SHA; use incremental updates per diff rather than full re-index.
- Shard large repos by language or module; build per-package symbol maps.
- Bound retrieval: top 3 failing tests, top 10 hunks, code windows under 3k lines total.
- Summarize logs and traces: keep only path-to-error spans, de-duplicate repeated stack frames.
- Deterministic retrieval order: sort by score, then file path, then line number for reproducibility.
Evaluation: measure RCA, not just token usage
Define a small, actionable metrics suite:
- Time-to-RCA: time from test failure to proposed root cause explanation.
- Top-1/Top-5 localization: whether the actual faulty file/hunk is in the top K candidates.
- MRR (mean reciprocal rank): for faulty symbol position.
- Patch acceptance rate: percentage of proposed patches that merge without revert.
- Revert rate: percentage of merged patches reverted within 7/30 days.
- Test delta quality: coverage increase for modified files; mutation kill rate on added tests.
Build an offline harness:
- Seed bugs with mutation tools (mutmut for Python, Stryker for JS/TS); generate failing tests.
- Replay real-world bug datasets: Defects4J (Java), BugsInPy (Python), ManyBugs (C/C++), Bears (Java).
- Run your pipeline end-to-end and record metrics. Tune retrieval weights, not just model prompts.
Minimal evaluator sketch:
pythonfrom collections import defaultdict def mrr(ranked, ground_truth): for i, item in enumerate(ranked, 1): if item == ground_truth: return 1.0 / i return 0.0 results = defaultdict(list) for bug in corpus: ranked = repo_rag.rank_candidates(bug) results['mrr'].append(mrr([c.symbol for c in ranked], bug.faulty_symbol)) print('MRR', sum(results['mrr'])/len(results['mrr']))
Common pitfalls (and how to avoid them)
- Overstuffed context: Flooding the model with entire files degrades reasoning. Slice aggressively.
- Inconsistent artifacts: Missing JUnit or trace links lead to empty retrieval. Fail fast with clear diagnostics and fallbacks.
- Flaky tests mislead SBFL: Incorporate flakiness scores; weigh consistently failing tests higher.
- Rename churn breaks blame: Enable Git rename detection; map symbols across renames.
- Hallucinated patches: Enforce schema; validate that the patch applies and compiles before asking CI to run the full suite.
Extensions and roadmap
- Multi-repo/microservices: Correlate traces to repos by service name; fetch versions via SBOM or deployment manifests.
- Static analysis integration: Incorporate CodeQL alerts and dataflow slices into retrieval and ranking.
- IDE loop: Provide a local mode that runs on the developer machine with cached indices and on-demand traces.
- Feedback learning: Use accepted patches as supervised examples to fine-tune ranking weights (not necessarily the model).
Tools and libraries that help
- Parsing and code maps: tree-sitter, universal-ctags, language servers (pyright, clangd, jdtls)
- Searching: ripgrep for lexical, sqlite/duckdb for tabular, a lightweight vector DB for embeddings
- Testing and coverage: pytest + coverage.py, JUnit/JaCoCo, Jest/Istanbul
- Tracing/logging: OpenTelemetry SDKs, Jaeger/Zipkin, structured logging with request/test IDs
- Git and history: pygit2, gitpython, SZZ implementations (e.g., szz-ruby, szz-fall)
- CI: GitHub Actions, GitLab CI, Buildkite; artifact storage native to your CI
References (selected)
- Abreu et al., “An Evaluation of Similarity Coefficients for Software Fault Localization,” 2009. Ochiai/Tarantula foundations.
- Dallmeier et al., “Lightweight Bug Localization with Snapshots of Program State,” 2005. Early dynamic approaches.
- Mockus and Votta, “Identifying Reasons for Software Changes using Historic Databases,” 2000. Churn as a defect predictor.
- Defects4J: https://github.com/rjust/defects4j
- BugsInPy: https://github.com/soarsmu/BugsInPy
- ManyBugs: http://repairbenchmarks.cs.umass.edu/ManyBugs/
- OpenTelemetry: https://opentelemetry.io/
Conclusion
Repo-RAG reframes debugging with AI from “ask a model to read the whole repo” to “give the model the right evidence.” By building a repo-aware pipeline that diligently collects tests, traces, logs, coverage, and diffs—and by ranking hypotheses with established fault-localization signals—you enable precise, verifiable fixes that fit naturally into your CI.
Start small: normalize test artifacts, build a symbol map, wire in SBFL and blame-based suspects, and present the top 10 hunks to your model with a strict output schema. Measure localization and patch acceptance, tighten your retrieval, and only then worry about fancier models. The result is an assistant that earns trust by shipping fixes—not by writing essays.
