Repo-RAG: Feeding Your Code Debugging AI with Tests, Traces, and Diffs for Root-Cause Analysis

Modern software is awash in evidence: failing tests, flaky CI logs, OpenTelemetry traces, code coverage heatmaps, and Git diffs that tell the story of how the system got here. The problem isn’t lack of data—it’s that your debugging loop can’t ingest and reason over all of it coherently and quickly.

Repo-RAG (Repository Retrieval-Augmented Generation) is a pragmatic pattern to make a code debugging AI actually useful: not by making the model bigger, but by making its context smarter. You build a repo-aware retrieval pipeline that can fetch the right tests, traces, logs, coverage, and Git history; localize the fault; rank hypotheses; and propose verifiable fixes. Done right, it’s fast, private, and CI-friendly.

This article lays out the blueprint: data schemas, indexing strategies, retrieval operators, ranking math, prompting patterns, CI integration, evaluation metrics, and a step-by-step case study. The goal is to move from generic chat-with-your-code to precise, reproducible root-cause analysis (RCA) and verifiable patches.

TL;DR

Repo-RAG makes LLMs effective at debugging by grounding them in test artifacts, traces, logs, coverage, and diffs.
Combine code search, spectrum-based fault localization (SBFL), Git churn, and trace slicing to rank root-cause hypotheses.
Feed only the minimal, high-signal context to the model; ask it to propose both a patch and a test; verify in CI.
Keep it private and deterministic with local indices, deterministic retrieval, and reproducible builds.

Why Repo-RAG (and why now)

RAG improved question answering for documents. But source code is different:

The relevant context isn’t just “nearby text.” It’s dynamic behavior (tests, logs, traces), historical edits (diffs, blame), and explicit runtime failures.
Success isn’t a persuasive paragraph. It’s a patch that compiles, passes tests, and avoids regressions.
Ground truth and verifiability exist: tests and reproducible builds.

Repo-RAG treats your repository as a living knowledge graph: code artifacts connected to runtime evidence and history. It gives a debugging model the minimum evidence needed to reason effectively, not the maximum tokens it can swallow.

The architecture at a glance

Think of Repo-RAG as a modular pipeline with four phases:

Collect: Ingest artifacts from CI and dev runs: test results, logs, traces, coverage, build metadata, and Git history.
Index: Build code maps, symbol graphs, failure-indexed logs/traces, and diff-based history indices. Store with fast retrieval keys.
Retrieve: For a given failure, deterministically fetch high-signal slices: failing tests, relevant stack frames, touched files, suspect diffs, coverage deltas, and historical flakiness.
Reason and act: Let the model (or a small agent) rank hypotheses, propose a patch and a test, and orchestrate a validation run.

Each phase is pluggable and can run offline/on-prem for privacy.

Data you must collect (and the minimum useful schema)

You don’t need a data lake. You need evidence with stable keys and timestamps. Start with this minimal schema and expand as needed.

Test artifacts (per CI run):
- identity: commit SHA, branch, CI run ID, job ID
- results: test name, outcome, duration, error type, stack trace, stdout/stderr
- coverage: per-file and per-line hit counts
Logs:
- timestamped lines grouped by test or request ID; severity; source component; optional JSON fields
Traces (OpenTelemetry/Jaeger):
- trace ID, span IDs, name, attributes, status, events, links to logs and code locations
Git history:
- commits, parents, diffs per file, rename/copy detection, author time, message, change hunks
- blame per line; churn metrics; tags for revert/reintroduce patterns
Build metadata:
- environment (OS, runtime version), compiler flags, feature flags, container image digest

A portable JSON representation you can store in S3, GCS, artifact store, or a local directory works fine. Example:

json
{
  "run": {
    "sha": "a1b2c3...",
    "branch": "feature/xyz",
    "ci_run_id": "12345",
    "started_at": "2025-11-10T12:00:00Z",
    "env": {"python": "3.11.6", "os": "ubuntu-22.04"}
  },
  "tests": [
    {
      "name": "tests/test_tz.py::test_midnight_rollover",
      "status": "failed",
      "duration_ms": 182,
      "error": {
        "type": "AssertionError",
        "message": "expected 2025-01-01, got 2024-12-31",
        "stack": [
          {"file": "app/time_util.py", "line": 84, "fn": "rollover"},
          {"file": "tests/test_tz.py", "line": 42, "fn": "test_midnight_rollover"}
        ]
      },
      "stdout": "...",
      "stderr": "...",
      "trace_id": "07a3c..." 
    }
  ],
  "coverage": {
    "files": {
      "app/time_util.py": {
        "lines": {"82": 1, "83": 1, "84": 1, "85": 0}
      }
    }
  }
}

Indexing your repository for debugging

Indexing for Repo-RAG is not just vectorizing files. You need deterministic, structured indices:

Code map index:
- A symbol graph (functions, classes, modules) with locations and cross-references. Use tree-sitter, ctags, or language servers (LSP) to build it.
- Derived structures: call graph (static approximation), file-module mapping, test-to-code mapping via imports and coverage.
Test index:
- Key by test name; include last N outcomes, failure frequency, min/max durations, flaky score, and links to coverage and traces.
Trace/log index:
- Key by test or trace ID; store top-K spans on error path; keep inverted index by error messages and span attributes.
Git history index:
- For each file: commits touching it, churn, time decay, rename history, revert pairs.
- Precompute SZZ suspect commits: link failing tests to candidate culprit commits using failure introduction points.
Embeddings index (optional but useful):
- Chunk by symbol, not arbitrary tokens; embed docstrings, signatures, and key code lines. This supports semantic matching for novel failures.

Store metadata in a document database (SQLite, DuckDB, or Postgres), time-series store for runs (SQLite/DuckDB often suffice), and a search backend (ripgrep for lexical, a vector store for embeddings). Keep it simple; optimize later.

The retrieval operators that matter

Design retrieval as composable operators with clear inputs/outputs. For a failure F at commit C:

select_failing_tests(F):
- Return failing test objects with stack frames and error kinds.
slice_trace(F):
- Fetch traces/logs for F; return the shortest failing path and last N spans with status=ERROR.
map_frames_to_symbols(frames):
- Use code map to resolve frames to functions/classes; return neighborhoods (file +/- 30 lines, callers/callees).
suspects_from_git(C, files):
- Use SZZ and churn to rank suspect commits and hunks affecting those files or symbols.
coverage_delta(F):
- Compare coverage of failing vs passing runs; highlight newly executed lines near failures.
semmatch(F):
- Optional: embed error message + frames; retrieve semantically similar past failures and fixes.

These operators keep the retrieval deterministic and explainable. You can log their outputs to audit why the model saw what it saw.

Ranking hypotheses: blend signals, not vibes

Don’t ask the model to guess blindly. Provide it with scored hypotheses derived from classic software engineering research plus repo signals. A simple, effective triad:

Spectrum-Based Fault Localization (SBFL) via Ochiai or Tarantula.
Git priors: churn, recency, revert history.
Trace depth and error proximity.

Compute a score for each candidate line, hunk, or symbol. Example Ochiai for a statement s:

ef: number of failing tests that execute s
nf: number of failing tests that do not execute s
ep: number of passing tests that execute s
np: number of passing tests that do not execute s
Ochiai(s) = ef / sqrt((ef + nf) * (ef + ep))

Python snippet:

python
import math
from dataclasses import dataclass

@dataclass
class Spectrum:
    ef: int
    nf: int
    ep: int
    np: int

def ochiai(spec: Spectrum) -> float:
    denom = math.sqrt((spec.ef + spec.nf) * (spec.ef + spec.ep))
    return (spec.ef / denom) if denom else 0.0

# Combine with Git churn and trace proximity
@dataclass
class Features:
    ochiai: float
    churn: float        # normalized per file
    recency_days: float # negative weight
    trace_depth: int    # closer to error ==> higher weight

def score(features: Features) -> float:
    # weights tuned empirically; keep monotonicity
    w_ochiai = 0.6
    w_churn = 0.2
    w_recency = 0.1
    w_trace = 0.1
    return (
        w_ochiai * features.ochiai +
        w_churn * features.churn +
        w_trace * (1.0 / (1 + features.trace_depth)) +
        w_recency * (1.0 / (1 + features.recency_days))
    )

Present the model with the top K hunks/symbols, their scores, and the minimal surrounding context.

Prompting the debugging model: be precise and verifiable

A good Repo-RAG prompt has these features:

Task framing: localize fault; propose minimal patch; propose a test that fails before and passes after.
Constraints: don’t touch unrelated files; respect style; include rationale tied to observed evidence.
Context: failing tests, narrowed code slices, suspect hunks, relevant diffs, trace excerpt.
Output schema: patch as a unified diff, plus a test patch.

Template:

You are a software debugging assistant. Use the provided evidence to localize the bug and propose a minimal fix with a verifiable test.

Evidence:
- Commit: {sha}
- Failing tests: {test_names}
- Error: {error_type}: {error_message}
- Stack frames (top 5):
{frames}
- Trace excerpt (last spans until error):
{trace_excerpt}
- Candidate hunks (ranked):
{hunks_with_scores}
- Coverage delta: lines executed only by failing tests: {lines}
- Relevant recent diffs:
{diff_snippets}

Requirements:
1) Explain the root cause in 2-4 sentences, citing specific lines/hunks.
2) Produce a minimal patch as a unified diff.
3) Produce or modify a test so that it fails before your patch and passes after.
4) Do not change public APIs unless strictly necessary.
5) Keep edits within these files only: {allowed_files}.

Output JSON strictly matching this schema:
{
  "explanation": "...",
  "patch_diff": "--- a/...\n+++ b/...\n...",
  "test_diff": "--- a/...\n+++ b/...\n..."
}

Keep the prompt compact: summarize logs and traces to the shortest path-to-error; include only the top 1-3 hunks per file; limit code to +/- 30 lines around candidates.

Building the collector and indexer

Start with a portable, language-agnostic setup. You can add language-specific enrichments later.

Shell snippets for collection in CI:

bash
# 1) Test artifacts (pytest example)
pytest -q --maxfail=1 --disable-warnings \
  --junitxml=artifacts/junit.xml \
  --cov=app --cov-report=xml:artifacts/coverage.xml \
  | tee artifacts/test.log

# 2) OpenTelemetry traces (assuming OTEL exporter is set)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
# Your app emits spans during tests; collector writes to artifacts/traces.jsonl

# 3) Git metadata
git rev-parse HEAD > artifacts/sha.txt
git show -s --format='%H %ct %an %s' > artifacts/commit.txt

# 4) Normalize artifacts
python tools/collect_ci_artifacts.py \
  --junit artifacts/junit.xml \
  --coverage artifacts/coverage.xml \
  --logs artifacts/test.log \
  --traces artifacts/traces.jsonl \
  --out artifacts/run.json

Simplified Python for JUnit and coverage normalization:

python
# tools/collect_ci_artifacts.py
import argparse, json, os
from xml.etree import ElementTree as ET

def parse_junit(path):
    root = ET.parse(path).getroot()
    tests = []
    for case in root.iter('testcase'):
        name = f"{case.get('classname')}::{case.get('name')}"
        failure = case.find('failure') or case.find('error')
        status = 'failed' if failure is not None else 'passed'
        tests.append({
            'name': name,
            'status': status,
            'error': None if status=='passed' else {
                'type': failure.get('type'),
                'message': failure.get('message'),
                'stack': failure.text.splitlines()[:50] if failure.text else []
            }
        })
    return tests

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument('--junit'); ap.add_argument('--coverage'); ap.add_argument('--logs')
    ap.add_argument('--traces'); ap.add_argument('--out')
    args = ap.parse_args()
    run = {'tests': parse_junit(args.junit)}
    # parse coverage/logs/traces similarly; elided for brevity
    with open(args.out, 'w') as f:
        json.dump(run, f)

if __name__ == '__main__':
    main()

Indexing code and Git with tree-sitter and ripgrep:

bash
# Build a symbol map (example using universal-ctags)
ctags -R --fields=+n --languages=Python,JavaScript --extras=+q -f artifacts/tags .

# Lexical search baseline
rg -n "\brollover\b" app/ > artifacts/search.txt

# Git history JSON (per file churn)
git log --numstat --date=iso --pretty=format:'commit:%H%nparent:%P%nauthor:%an%ndate:%ad%n' \
  | python tools/log_to_json.py > artifacts/git.jsonl

You can move to a more robust graph store later (e.g., sqlite with tables for symbols, refs, tests, traces, commits, hunks). Start with reproducibility and inspectability.

Suspect commits with SZZ and diffs

The SZZ algorithm links bug-introducing commits by tracing from a fix to lines it modifies and then blaming those lines back to earlier commits. For online debugging, you can use a “forward SZZ” heuristic: given failing hunks and frames, blame the lines to find the last commits touching them and rank by recency and churn.

Example shell:

bash
# For each candidate file
file=app/time_util.py
awk 'NR>=70 && NR<=110 {print NR":"$0}' "$file" > /tmp/slice.txt

# git blame for the slice
git blame -L 70,110 --line-porcelain "$file" | grep '^commi\|^author-time ' > /tmp/blame.txt

Use this data to compute a suspect prior per hunk. Combine with SBFL and trace depth for the final ranking.

Putting it together: the Repo-RAG flow

Pseudocode orchestrator:

python
from typing import List

class RepoRAG:
    def __init__(self, indexes):
        self.idx = indexes

    def analyze_failure(self, sha: str, run_path: str) -> dict:
        run = load_run(run_path)
        failing = [t for t in run['tests'] if t['status']=='failed']
        frames = top_frames(failing)
        trace = slice_traces(run, failing)
        cand_symbols = map_frames_to_symbols(frames, self.idx.symbols)
        cov = coverage_delta(run, self.idx.recent_pass)
        hunks = expand_symbols_to_hunks(cand_symbols, radius=30)
        scores = rank_hunks(hunks, cov, self.idx.git, trace)
        context = build_context(failing, frames, trace, scores.topk(10))
        prompt = render_prompt(context)
        model_out = call_model(prompt)
        return model_out

A small agent can then apply the patch, run tests, and iterate if necessary. Keep the loop bounded to avoid CI sprawl.

CI integration: GitHub Actions example

Make it easy to adopt by binding to PRs and failing tests. Minimal workflow:

yaml
name: repo-rag-debug
on:
  pull_request:
    types: [opened, synchronize, reopened]
  workflow_dispatch: {}

jobs:
  analyze:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install deps
        run: |
          pip install -r requirements-dev.txt
          pip install pytest pytest-cov opentelemetry-sdk
      - name: Run tests and collect artifacts
        run: |
          mkdir -p artifacts
          pytest -q --junitxml=artifacts/junit.xml \
            --cov=. --cov-report=xml:artifacts/coverage.xml | tee artifacts/test.log
      - name: Build indexes
        run: |
          sudo apt-get update && sudo apt-get install -y universal-ctags ripgrep
          ctags -R -f artifacts/tags .
          python tools/collect_ci_artifacts.py --junit artifacts/junit.xml \
            --coverage artifacts/coverage.xml --logs artifacts/test.log \
            --out artifacts/run.json
      - name: Repo-RAG debug
        env:
          MODEL_ENDPOINT: ${{ secrets.MODEL_ENDPOINT }}
          MODEL_TOKEN: ${{ secrets.MODEL_TOKEN }}
        run: |
          python tools/repo_rag_debug.py \
            --sha $(git rev-parse HEAD) \
            --run artifacts/run.json \
            --out artifacts/proposal.json
      - name: Post patch suggestion as PR comment
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const p = JSON.parse(fs.readFileSync('artifacts/proposal.json','utf8'));
            const body = `RCA: ${p.explanation}\n\nPatch:\n\n\`\`\`diff\n${p.patch_diff}\n\`\`\`\n\nTest:\n\n\`\`\`diff\n${p.test_diff}\n\`\`\``;
            github.rest.issues.createComment({
              ...context.repo,
              issue_number: context.issue.number,
              body
            })

Keep the model call behind a secret endpoint. For privacy, run a local model (e.g., Llama, DeepSeek-Coder) on a self-hosted runner.

Case study: timezone rollover bug

Scenario: A Python service computes the next date at midnight local time. A test fails around New Year’s.

Artifacts:

Failing test: tests/test_tz.py::test_midnight_rollover
Error: AssertionError: expected 2025-01-01, got 2024-12-31
Stack: app/time_util.py:84 in rollover
Coverage: line 85 is unexecuted in passing runs but executed in the failing test
Git: last week, commit abc123 changed time_util.py lines 80-90 to handle DST

Retrieval results:

Frames mapped to symbol: function app.time_util.rollover at lines 76-94
Hunks: lines 82-88 recently edited; blame points to abc123
Trace slice: span compute_next_date -> rollover -> error raised; attributes tz='America/Los_Angeles'
SBFL: ef=1, ep=120 (many passing tests don’t touch this code); Ochiai points to line 84 where naive date arithmetic crosses midnight without timezone awareness

Model prompt includes:

The 30-line window around lines 76-94
The recent diff for lines 80-90
The error and trace snippet with tz attribute
Coverage delta showing the failing-only path
Allowed files: app/time_util.py, tests/test_tz.py

Proposed patch (illustrative):

diff
--- a/app/time_util.py
+++ b/app/time_util.py
@@ def rollover(dt: datetime, tz: tzinfo) -> date:
-    # naive: add seconds until midnight in naive local time
-    seconds = 24*3600 - (dt.hour*3600 + dt.minute*60 + dt.second)
-    return (dt + timedelta(seconds=seconds)).date()
+    # Correct: convert to tz-aware, normalize at local midnight, then add one day if needed
+    if dt.tzinfo is None:
+        dt = dt.replace(tzinfo=tz)
+    local = dt.astimezone(tz)
+    midnight = local.replace(hour=0, minute=0, second=0, microsecond=0)
+    if local.time() == time(0, 0):
+        return local.date()
+    return (midnight + timedelta(days=1)).date()

Test patch:

diff
--- a/tests/test_tz.py
+++ b/tests/test_tz.py
@@
 def test_midnight_rollover():
-    dt = datetime(2024, 12, 31, 23, 59, 59)
-    assert rollover(dt, ZoneInfo('America/Los_Angeles')) == date(2025, 1, 1)
+    la = ZoneInfo('America/Los_Angeles')
+    dt = datetime(2024, 12, 31, 23, 59, 59, tzinfo=la)
+    assert rollover(dt, la) == date(2025, 1, 1)
+    # Midnight edge case should return same local date
+    dt2 = datetime(2025, 1, 1, 0, 0, 0, tzinfo=la)
+    assert rollover(dt2, la) == date(2025, 1, 1)

Verification:

Before patch: the added midnight test fails
After patch: both tests pass

This is a toy example, but the pattern generalizes: timezone and DST bugs are classic cases where traces (attributes), diffs (recent DST handling), and SBFL (lines executed by failing-only tests) converge.

Privacy and security: keep it in your repo bubble

Run the collector and indexer inside CI; store artifacts in your existing artifact store.
Prefer local or VPC-hosted models; if you must use a cloud model, strip PII and redact secrets from logs.
Enforce least-privilege: the debugging job needs read-only to code and artifacts; write access only to post comments or push a patch branch.
Scan patches with secret scanners and static analyzers (e.g., Trivy, Gitleaks, CodeQL) before proposing merges.

Performance and reliability tips

Cache indices by commit SHA; use incremental updates per diff rather than full re-index.
Shard large repos by language or module; build per-package symbol maps.
Bound retrieval: top 3 failing tests, top 10 hunks, code windows under 3k lines total.
Summarize logs and traces: keep only path-to-error spans, de-duplicate repeated stack frames.
Deterministic retrieval order: sort by score, then file path, then line number for reproducibility.

Evaluation: measure RCA, not just token usage

Define a small, actionable metrics suite:

Time-to-RCA: time from test failure to proposed root cause explanation.
Top-1/Top-5 localization: whether the actual faulty file/hunk is in the top K candidates.
MRR (mean reciprocal rank): for faulty symbol position.
Patch acceptance rate: percentage of proposed patches that merge without revert.
Revert rate: percentage of merged patches reverted within 7/30 days.
Test delta quality: coverage increase for modified files; mutation kill rate on added tests.

Build an offline harness:

Seed bugs with mutation tools (mutmut for Python, Stryker for JS/TS); generate failing tests.
Replay real-world bug datasets: Defects4J (Java), BugsInPy (Python), ManyBugs (C/C++), Bears (Java).
Run your pipeline end-to-end and record metrics. Tune retrieval weights, not just model prompts.

Minimal evaluator sketch:

python
from collections import defaultdict

def mrr(ranked, ground_truth):
    for i, item in enumerate(ranked, 1):
        if item == ground_truth:
            return 1.0 / i
    return 0.0

results = defaultdict(list)
for bug in corpus:
    ranked = repo_rag.rank_candidates(bug)
    results['mrr'].append(mrr([c.symbol for c in ranked], bug.faulty_symbol))

print('MRR', sum(results['mrr'])/len(results['mrr']))

Common pitfalls (and how to avoid them)

Overstuffed context: Flooding the model with entire files degrades reasoning. Slice aggressively.
Inconsistent artifacts: Missing JUnit or trace links lead to empty retrieval. Fail fast with clear diagnostics and fallbacks.
Flaky tests mislead SBFL: Incorporate flakiness scores; weigh consistently failing tests higher.
Rename churn breaks blame: Enable Git rename detection; map symbols across renames.
Hallucinated patches: Enforce schema; validate that the patch applies and compiles before asking CI to run the full suite.

Extensions and roadmap

Multi-repo/microservices: Correlate traces to repos by service name; fetch versions via SBOM or deployment manifests.
Static analysis integration: Incorporate CodeQL alerts and dataflow slices into retrieval and ranking.
IDE loop: Provide a local mode that runs on the developer machine with cached indices and on-demand traces.
Feedback learning: Use accepted patches as supervised examples to fine-tune ranking weights (not necessarily the model).

Tools and libraries that help

Parsing and code maps: tree-sitter, universal-ctags, language servers (pyright, clangd, jdtls)
Searching: ripgrep for lexical, sqlite/duckdb for tabular, a lightweight vector DB for embeddings
Testing and coverage: pytest + coverage.py, JUnit/JaCoCo, Jest/Istanbul
Tracing/logging: OpenTelemetry SDKs, Jaeger/Zipkin, structured logging with request/test IDs
Git and history: pygit2, gitpython, SZZ implementations (e.g., szz-ruby, szz-fall)
CI: GitHub Actions, GitLab CI, Buildkite; artifact storage native to your CI

References (selected)

Abreu et al., “An Evaluation of Similarity Coefficients for Software Fault Localization,” 2009. Ochiai/Tarantula foundations.
Dallmeier et al., “Lightweight Bug Localization with Snapshots of Program State,” 2005. Early dynamic approaches.
Mockus and Votta, “Identifying Reasons for Software Changes using Historic Databases,” 2000. Churn as a defect predictor.
Defects4J: https://github.com/rjust/defects4j
BugsInPy: https://github.com/soarsmu/BugsInPy
ManyBugs: http://repairbenchmarks.cs.umass.edu/ManyBugs/
OpenTelemetry: https://opentelemetry.io/

Conclusion

Repo-RAG reframes debugging with AI from “ask a model to read the whole repo” to “give the model the right evidence.” By building a repo-aware pipeline that diligently collects tests, traces, logs, coverage, and diffs—and by ranking hypotheses with established fault-localization signals—you enable precise, verifiable fixes that fit naturally into your CI.

Start small: normalize test artifacts, build a symbol map, wire in SBFL and blame-based suspects, and present the top 10 hunks to your model with a strict output schema. Measure localization and patch acceptance, tighten your retrieval, and only then worry about fancier models. The result is an assistant that earns trust by shipping fixes—not by writing essays.