Stop Hallucinated Fixes: A RAG Blueprint for Code Debugging AI That Reads Your Build Graph
Generative models are useful for debugging, but they hallucinate when they guess across gaps in context. The fastest way to reduce noisy, incorrect patches is to make the model read what your build already knows: targets, symbol bindings, dependency edges, stack traces, ASTs, and logs. This article lays out an end-to-end retrieval-augmented generation (RAG) blueprint for a repo-and-build-aware code debugging assistant, with deep details on indexing, query planning, caching, and CI integration.
The design here is informed by what works in production developer tooling: precise context, narrow patches, and automated validation. The end result is not a black box “copilot for fixes,” but a grounded system that treats your build graph as a source of truth and uses it to reduce hallucinations.
Why code debugging RAG must be build-aware
A pure text RAG over a repo is a start, but large codebases are not simple document collections. They’re structured graphs:
- Files belong to build targets.
- Symbols resolve to definitions via language-specific rules.
- Stack traces point into frames in specific modules.
- Logs and test outputs are tied to targets, tests, and commits.
If your retrieval layer ignores these relationships, the model will:
- Suggest edits in the wrong file or the wrong layer of the dependency graph.
- Propose renames that break symbol resolution.
- Patch call sites without touching the definition where the bug actually resides.
- Overfit to noisy logs.
By contrast, a build-aware retrieval layer:
- Prioritizes files that are topologically close to the failing test or target.
- Resolves symbols through ASTs and indexers, so the model sees the true definition and references.
- Splices in stack frames and relevant log shards.
- Anchors proposals to the exact build target and test that fail in CI.
You reduce hallucinations not by nagging the model, but by feeding it authoritative evidence in a predictable, structured way.
High-level architecture
At a glance, the system has four major parts:
-
Ingestion and indexing
- Parse the repo to ASTs and symbols.
- Build and cache the build graph (Bazel, Buck, Pants, Gradle, Maven, CMake, etc.).
- Index stack traces, test outputs, compiler errors, and logs.
- Create both keyword and vector indices, plus a small knowledge base of build metadata.
-
Query planning and retrieval
- Normalize the incoming debugging question (e.g., failing CI job or a local stack trace).
- Determine query plan: which sources to consult in what order.
- Retrieve by build-target locality, symbol resolution, and semantic similarity.
- Construct a compact, provenanced context window.
-
Caching and incremental updates
- Cache by commit hash, target, test name, and symbol.
- Content-address your artifacts (AST shards, logs) to get great cache hits.
- Incrementally refresh indices on CI builds.
-
Patch generation and validation
- Generate a minimal patch constrained to retrieved files and symbols.
- Validate via quick compile/test runs, static checks, and guards.
- Post patches back to CI/PR with confidence and provenance.
Below we dig into each area with concrete choices and implementation notes.
Ingestion layer: what to index and how
A robust indexing pipeline is the foundation of build-aware RAG.
1) Repository content
- Use language-aware parsing:
- tree-sitter for broad language support.
- clang/LLVM tooling for C/C++ ASTs.
- Java: Eclipse JDT or javaparser.
- Python: libCST or parso for concrete syntax trees.
- TypeScript/JavaScript: TypeScript compiler API.
- Chunking for retrieval:
- Prefer symbol-centric chunks (function/class definitions, docstrings, top-level comments).
- Include leading import statements for context.
- Annotate each chunk with: file path, symbol name, language, build target(s), dependency in-degree/out-degree, and git commit hash.
- Embeddings:
- Use code-aware embeddings when possible (e.g., CodeBERT-like, text-embedding-3-large, or other high-quality code-text models).
- Normalize with language tag prompts to avoid cross-language drift.
2) Symbol index
- Build a symbol table per language with cross-references:
- Definitions: name, signature, start/end span, file, AST type.
- References: usage sites with call/attribute context.
- Canonicalize names (e.g., fully qualified names for Java, Python module paths, C++ namespaces).
- Store call graph edges when feasible; at least record intra-target call edges.
- Link symbols to build targets: which rules produce the artifact that includes this symbol.
3) Build graph
- Extract target graph:
- Bazel: query cquery/aquery; export target edges and actions.
- Gradle: configure-on-demand and build scans; extract task dependency graph.
- Maven: effective POM and dependency tree.
- CMake: compile_commands.json and target dependency info.
- Persist:
- Nodes: target label, language, src files, outputs, test flag.
- Edges: depends_on.
- Metadata: compiler flags, environment, platform.
- Provide target locality metrics:
- Graph distance to failing test target.
- Shared source file ratio.
- Affected by last N commits.
4) Stack traces and error events
- Parse stack traces with structured extractors:
- Python: parse traceback frames and exception types.
- Java: JVM stack frames with class/method/line.
- Node/TS: V8 frames; source maps for transpiled code.
- C/C++: symbolized stack from addr2line/LLDB.
- Annotate with:
- Frame file, function, line; exception message; test name; CI job id; commit.
- Link frames to symbols and targets via path+symbol mapping.
- Store as searchable documents with both keyword and vector fields.
5) Logs and test outputs
- Store test logs in a columnar store (e.g., Parquet) keyed by test target, test name, commit, shard, and timestamp.
- Tokenize into blocks with error signatures and context windows (e.g., last 100 lines before failure).
- Annotate with normalized error codes, regex’d common failures, and runtime metrics.
- Consider OpenTelemetry spans/traces for distributed systems tests; link service logs to repo modules.
6) Inverted and vector indices
- Maintain a hybrid retrieval stack:
- Inverted index (Elasticsearch/OpenSearch) for exact symbol/file/line queries and error strings.
- Vector index (FAISS, ScaNN, Milvus, pgvector) for semantic retrieval over code and messages.
- Join retrieval results by boosting:
- Build-target proximity.
- Symbol resolution confidence.
- Stack-frame depth and frequency across failing shards.
7) Provenance and schema
- Every chunk/document includes a provenance header:
- repository, commit
- file path, symbol name/signature
- build target label
- source type (AST, stack trace, test log, compile error)
- line ranges
- This enables strict grounding: suggested patches must reference retrieved provenance.
Query planning: from error to context to fix
A planful retrieval process improves both quality and cost. The core idea: turn the incoming error into an explicit series of lookups rooted in the build graph and symbols.
Inputs
- A failing CI job or a developer-provided signal, including any of:
- Stack trace (raw text).
- Test name/target and failing commit.
- Compiler error output.
- Log snippet.
- Developer query (“Why is test FooBar failing?”).
Planner goals
- Resolve the failure to one or more build targets.
- Find the likely buggy symbol(s) and files.
- Fetch minimal but sufficient context: definitions, callsites, relevant logs, recent diffs.
- Construct a compact, deterministic context pack for the model.
Planning steps
- Normalize and parse inputs.
- Extract error class, messages, paths, line numbers, function/class names.
- For Python, map ModuleNotFoundError to import path; for Java, map NoSuchMethodError to JAR and class; etc.
- Resolve symbols and files.
- Use symbol index and stack frames to find primary definitions.
- Map to build targets via file->target index.
- Rank candidate targets/files by locality and recency.
- Graph distance to failing test target.
- Recent commits touching the file/target.
- Frequency of error across shards (systemic vs flaky).
- Assemble evidence packs.
- Primary: symbol definitions + immediate callsites + AST snippet around failure lines (e.g., ±30 lines).
- Secondary: dependent/dependee symbols by call graph edges.
- Tertiary: log blocks that include the error signature.
- Metadata: compiler flags, environment, versions.
- De-duplicate and budget.
- Enforce token budget; prefer definitions and failing frame lines over broad files.
- Include only neighboring modules within a small radius unless the failure indicates cross-cutting impact.
- Formulate task-specific prompt and constraints.
- Explain the failure context in structured bullet points with provenance IDs.
- Constrain patch location(s) to retrieved files.
- Require that any new imports/symbols exist or are added in the patch.
Simple planner skeleton (Python pseudocode)
pythonfrom typing import List, Dict class QueryPlanner: def __init__(self, indices, build_graph): self.idx = indices self.bg = build_graph def plan(self, inputs: Dict) -> Dict: # 1) Parse signals stack = parse_stacktrace(inputs.get("stacktrace", "")) test_target = inputs.get("test_target") commit = inputs.get("commit") compile_errors = parse_compile_errors(inputs.get("compile_log", "")) # 2) Resolve to symbols/files/targets frames = resolve_frames(stack, self.idx.symbols) files = {f.file for f in frames} targets = sorted({self.idx.file_to_target.get(f) for f in files if f in self.idx.file_to_target}) if test_target: targets.append(test_target) # 3) Rank candidates by locality ranked = rank_by_build_locality(targets, self.bg, test_target) # 4) Retrieve evidence evidence = [] for t in ranked[:5]: evidence += self.idx.retrieve_symbols_for_target(t) evidence += self.idx.retrieve_logs_for_target(t, commit) evidence += retrieve_recent_diffs(commit, files) # 5) Pack with budgets packed = pack_evidence(evidence, token_budget=20_000) # 6) Constraints constraints = { 'allowed_files': list(files), 'must_compile': True, 'max_changes_per_file': 1, } return { 'evidence': packed, 'constraints': constraints }
This planner does not need to be a stochastic chain-of-thought system. It’s a deterministic, tool-using component that translates error signals into targeted retrieval and constraints.
Retrieval scoring that respects the build graph
Move beyond generic vector cosine similarity. A composite score can significantly boost precision:
Score(doc) = w1 * text_similarity + w2 * symbol_match + w3 * build_locality + w4 * stack_frame_overlap + w5 * recency_decay
- text_similarity: embedding cosine with the error text and test description.
- symbol_match: binary or graded match on definition/reference names; boost exact matches.
- build_locality: a function of graph distance to the failing target (shorter is better) and shared sources.
- stack_frame_overlap: whether doc spans include lines in failing frames.
- recency_decay: penalize stale docs based on age or last-change.
Tune weights offline against a gold set of past failures and fixes.
Caching: CI-friendly and commit-scoped
RAG for debugging benefits massively from caching because inputs are commit-scoped and repetitive across CI shards.
- Content-address all artifacts:
- Use SHA256 of file content for AST and symbol records.
- Key logs by commit+test_target+test_name+shard.
- Layered caches:
- Local developer cache (~1–5 GB) for day-to-day tests.
- CI workspace cache shared across jobs (restore/save mechanisms in Actions/GitLab/Bazel remote cache).
- Remote object storage (S3/GCS/Azure Blob) for persistent artifacts.
- Invalidation:
- On commit, invalidate only targets impacted by changed files using the build graph.
- Keep soft TTLs for logs; evict oldest commits or commits past merge.
- Warmup:
- On PR open, pre-index changed targets and precompute embeddings.
- Prefetch dependent targets when tests begin.
A well-tuned cache turns evidence retrieval into tens of milliseconds lookups in the common case.
Context construction: compact, provenanced, reproducible
The context fed to the LLM should be both informative and minimal. A typical pack includes:
- Problem summary (generated deterministically by the planner): error class, messages, files, targets, commit, failing tests.
- Code snippets:
- Function/class definitions of frames from the stack trace.
- Neighboring helper functions if directly referenced.
- Interface/contract definitions (types/signatures) if the error is type or interface related.
- Logs and traces:
- 50–150 line windows around the error signature.
- Any repeated error patterns across shards (with counts).
- Build metadata:
- Target labels, toolchain versions, important flags.
- Provenance tokens:
- IDs like [PROV:file=src/foo.py:123-170:commit=abc123:target=//foo:test] prefixed to each block.
Make the pack reproducible: two identical failures on the same commit should yield identical context, making behavior stable and debuggable.
Patch generation: constrain, compile, and test
LLMs can propose fixes, but they do better when kept on a short leash.
- Constrain patch scope:
- Only allow changes in allowed_files retrieved during planning.
- Limit edits to within ±N lines of the failing frames or the retrieved symbol definition, unless the planner expands scope.
- Disallow renames/moves unless the evidence includes all references.
- Require explicit references:
- Every modified symbol must appear in the retrieved provenance.
- New imports must resolve; if adding a new function, include its definition in the patch.
- Prefer minimal diffs:
- Encourage single-change patches; multiple-file changes only when planner indicates cross-cutting fix.
Guardrails before posting a patch
- Static checks:
- Run linters/formatters to ensure style and syntax correctness.
- For typed languages: type check (mypy/pyright, javac, tsc, clang++ with -Wall).
- Compile/test subset:
- Build and run only the impacted targets and failing tests (fast feedback).
- If green, optionally broaden run to affected dependents.
- Confidence scoring:
- Combine model self-score with validation outcomes: compiles, tests, static checks.
- If confidence below threshold, post a diagnostic analysis without a patch.
Patch format for CI
- Unified diff with file headers.
- Include a machine-readable section listing provenance IDs of edited snippets.
- Include a summary explaining the hypothesis and why the change should fix it, referencing evidence blocks.
Wiring into CI: sources of truth and safety
Integrating with CI ensures the assistant stays grounded and measurable.
GitHub Actions example
yamlname: Debug Assistant on: workflow_run: workflows: ["Build and Test"] types: ["completed"] jobs: propose-fix: if: ${{ github.event.workflow_run.conclusion == 'failure' }} runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: fetch-depth: 0 - name: Restore cache uses: actions/cache@v4 with: path: .rag_cache key: rag-${{ github.sha }} restore-keys: | rag- - name: Ingest build graph run: | ./tools/rag/index_build.sh --commit $GITHUB_SHA - name: Parse failing logs run: | ./tools/rag/collect_ci_logs.sh ${{ github.event.workflow_run.id }} > ci_logs.json - name: Plan, retrieve, and generate patch run: | python tools/rag/propose_fix.py \ --commit $GITHUB_SHA \ --logs ci_logs.json \ --out patch.diff \ --report report.md - name: Validate patch run: | git apply --check patch.diff && git apply patch.diff ./gradlew :failing:tests --info - name: Post results uses: actions/github-script@v7 with: script: | const fs = require('fs'); const body = fs.readFileSync('report.md', 'utf8'); github.rest.issues.createComment({ owner: context.repo.owner, repo: context.repo.repo, issue_number: context.issue.number, body });
Key properties:
- Reads the exact build graph that CI used.
- Grounds on the failing job’s logs, not local reproductions.
- Shares a cache across retries.
- Posts a report with provenance and a minimal patch.
Security and privacy
- If using external APIs, redact secrets and proprietary logs.
- Optionally run embeddings and LLM on-prem for sensitive code.
- Log all retrievals and patch proposals for audit.
Example end-to-end scenario
Let’s walk a concrete failure in a mixed Python/TypeScript repo built with Bazel and tested with pytest and Playwright.
- Failure: CI shows pytest failure with a ValueError at src/server/feature_flags.py:87. The stack trace points to parse_bool_flag, called by load_flags during test setup. Multiple shards fail the same way.
- Planner steps:
- Parse stack trace, extract frames and lines.
- Map src/server/feature_flags.py to Bazel target //server:feature_flags.
- Graph distance to failing test target //server:test_e2e is 1 (direct dependency).
- Retrieve AST for parse_bool_flag and load_flags. Retrieve callsites from server/init.py.
- Pull log blocks containing the ValueError message and configuration values.
- Retrieve recent diffs affecting feature flags; notice a commit that switched truthy strings from ["true", "false"] to ["True", "False"].
- Construct context:
- Code for parse_bool_flag, load_flags.
- Callers in test setup.
- Log lines showing flag value "TRUE".
- Diff of recent commit touching the flag parser.
- Model prompt constraints:
- Only edit src/server/feature_flags.py within ±40 lines of parse_bool_flag.
- Any new constants must be declared in the same file.
- Patch proposal:
- Normalize case when parsing flags, accepting variants like "true", "TRUE", "True".
- Validation:
- Run pytest subset for targets depending on //server:feature_flags; pass.
- Static checks (flake8, mypy strict) pass.
- CI posts a patch with rationale referencing provenance blocks and the problem disappears.
Notice what prevented hallucination: it never touched unrelated files, it justified the fix with logs and recent diffs, and it respected build-target boundaries.
Reducing noisy patches: additional guardrails
- No symbol creation without definition:
- If a patch introduces a new function or class, it must include its full definition in the patch; otherwise reject.
- Reference check:
- Verify that all referenced symbols exist or are added; use the symbol index to confirm.
- Cross-file edits:
- Disallow unless planner finds cross-target evidence; otherwise keep patches one-file.
- Renames:
- Block unless all references are updated and the index confirms complete coverage.
- Compile fence:
- A patch that doesn’t compile is not posted; instead, post a diagnostic analysis with retrieved evidence.
Metrics and evaluation
You can’t improve what you don’t measure. Track both retrieval and end-to-end outcomes.
- Retrieval metrics:
- Symbol recall@k: fraction of ground-truth buggy symbol definitions appearing in top-k retrieved chunks.
- Frame coverage: fraction of failing stack frames covered by retrieved code spans.
- Build-locality precision: percentage of retrieved chunks within N hops of failing target.
- Generation metrics:
- Patch compile rate: percentage of proposed patches that compile.
- Test-pass rate: percentage that make failing tests pass locally.
- PR acceptance rate and time-to-green delta versus baseline.
- Cost and latency:
- Retrieval latency P50/P95; token usage per attempt; cache hit rates.
For ground truth, use your own historical failures and patches. For public benchmarks, consider repository-level bug-fix datasets and real-world issue-fix pairs. The key is to evaluate on repo-scale tasks, not isolated snippets.
Implementation details and choices
- Embeddings:
- Use a high-quality code-text model for symbols and code blocks; use a general text model for logs and error messages.
- Store both to enable cross-modal retrieval.
- Index storage:
- Keep symbol and AST shards in a document store (e.g., SQLite/Parquet) with a small metadata DB for fast joins.
- Maintain separate indices per language to avoid conflation.
- Build graph connectors:
- Implement adapters that output a uniform graph schema (nodes, edges, attributes). Cache graph per commit.
- Language servers:
- Optionally integrate with LSPs (pyright, tsserver, clangd) to enrich symbol references and types.
- Static analysis integration:
- Run linters/type checkers as tools the planner can call for additional evidence in type- or interface-related failures.
- Token budgeting:
- Pre-shrink code with lossless elision: keep signatures and bodies for suspect functions, collapse other regions with markers like // … elided … while preserving line numbers in comments for mapping back.
Analyzing logs without overwhelming the model
Logs are high-volume and high-noise. Treat them as signals, not payloads.
- Signature extraction:
- Regex error classes and common patterns; compute hashes of error message shapes to deduplicate.
- Context windows:
- Keep small windows (50–150 lines) around signatures.
- Cross-shard aggregation:
- If N shards show the same failure signature, include frequency counts instead of repeating blocks.
- Correlate with code:
- Map log messages to source via structured logging fields or embedded file:line tags if available.
Handling polyglot repos and generated code
- Generated code:
- Index generators and their templates. When a frame is in generated code, retrieve the template and generation parameters instead of the generated file.
- Transpiled languages:
- Use source maps for TS->JS or similar to map frames back to authored code; store and index maps.
- Monorepos:
- Namespacing by workspace or top-level module to keep indices manageable.
- Shard indices by language and by top-level package.
Failure modes and mitigations
- Ambiguous stack traces:
- Use last in-project frame heuristics; de-prioritize third-party frames.
- If ambiguity persists, post an analysis-only comment with likely candidates and ask for reproduction steps.
- Flaky tests:
- Detect via historical flake rates; avoid proposing patches for known flaky signatures without code changes.
- Massive diffs:
- If the model proposes large multi-file changes, reject and fall back to analysis.
- Hidden environmental issues:
- Pull environment metadata (OS, Python/Node/Java versions, flags); many failures are version skew or missing deps.
Minimal reference implementation sketch
To make this concrete, here’s a more detailed sketch for a Python/TypeScript/Bazel repo using OpenSearch + FAISS and a Python orchestration layer.
python# tools/rag/index_repo.py import os, json, hashlib from pathlib import Path from typing import Dict, List from parsers import ts_parser, py_parser from embeddings import embed_text from storage import DocStore, VectorStore class Indexer: def __init__(self, repo_root: str, commit: str): self.repo = Path(repo_root) self.commit = commit self.docs = DocStore(path=".rag_cache/docs.parquet") self.vec = VectorStore(index_path=".rag_cache/faiss.idx") def index_file(self, path: Path) -> List[Dict]: text = path.read_text(encoding="utf-8", errors="ignore") lang = detect_lang(path) ast_chunks = parse_to_chunks(text, lang) docs = [] for ch in ast_chunks: doc = { 'id': sha(path, ch['span']), 'repo': str(self.repo), 'commit': self.commit, 'path': str(path), 'lang': lang, 'symbol': ch.get('symbol'), 'span': ch['span'], 'text': ch['text'], 'provenance': { 'file': str(path), 'lines': ch['lines'], } } docs.append(doc) return docs def run(self): for path in self.repo.rglob("*.py"): docs = self.index_file(path) self.docs.add_all(docs) vecs = embed_text([d['text'] for d in docs]) self.vec.add([d['id'] for d in docs], vecs) # Repeat for TS/TSX and others self.docs.commit(); self.vec.commit()
python# tools/rag/planner.py from retrieval import hybrid_search, retrieve_logs, retrieve_stack class Planner: def __init__(self, stores, build_graph): self.stores = stores self.bg = build_graph def plan(self, failure): frames = parse_stacktrace(failure['stack']) primary_files = [f['file'] for f in frames if is_repo_path(f['file'])] targets = {file_to_target(p) for p in primary_files} ranked_targets = rank_targets(targets, failure['test_target']) # Retrieve symbol defs for top targets and frames evidence = [] for f in frames: evidence += hybrid_search(query=f["function"], filters={"path": f["file"]}, k=5) for t in list(ranked_targets)[:3]: evidence += hybrid_search(query=failure['message'], filters={"target": t}, k=20) evidence += retrieve_logs(failure['commit'], failure['test_target']) return pack(evidence)
This is not production-ready code; it conveys the shape and separation of concerns. In production you’ll add proper build adapters, symbol resolution, batching, cache management, and robust validation.
Cost control and performance
- Token economy:
- Cap context to a tight budget; prefer more retrieval precision to bigger context.
- Summarize long logs before including; keep raw logs available via link.
- Parallel parsing and vectorization:
- Use a worker pool to parse files and build AST chunks; vectorize in batches.
- Latency goals:
- P50 < 1s for retrieval with warm cache; < 5s with cold index.
- Patch attempts under 2–5 minutes end-to-end for impacted tests only.
What “good” looks like after adoption
Teams that implement this blueprint usually see:
- Big drop in “noisy patches” on PRs: fewer suggestions that touch the wrong file or symbol.
- Faster time-to-green: the assistant posts a viable patch or a precise analysis quickly.
- Better developer trust: suggestions are accompanied by evidence and provenance.
- Lower compute cost: planner-driven retrieval reduces token sprawl and unnecessary full-repo context dumps.
Checklist: getting started this quarter
- Choose build graph adapters for your build system.
- Stand up a simple hybrid index: OpenSearch + FAISS/pgvector.
- Implement AST/symbol extraction for your two most common languages.
- Parse stack traces and logs in CI and push into your store.
- Write a deterministic planner with:
- Stack frame resolution
- Build-target locality
- Minimal evidence packing
- Add commit-scoped caching and prewarm on PR open.
- Enforce guardrails: compile fence, test subset, minimal diffs.
- Add CI bot to post analysis or patch with provenance.
- Start measuring retrieval recall@k and patch compile/test-pass rates.
Closing opinion
Most “AI debugging” disappoints because it treats a repository like a bag of strings. Your build system already encodes ground truth about what depends on what, where symbols live, and which tests verify which behavior. If you make your retrieval layer read that truth—and you constrain generations to it—you’ll turn a guessing machine into a reliable collaborator. The blueprint above is not glamorous, but it’s the shortest path to fewer hallucinations and more fixes that actually land.
References and further reading
- Bazel Query and Aquery docs (build graph and actions)
- Gradle Build Scans and dependency insight
- CMake compile_commands.json and target graphs
- tree-sitter (multi-language parsing)
- libCST (Python concrete syntax trees)
- TypeScript Compiler API
- LLVM/Clang tooling (AST and indexing)
- FAISS, Milvus, pgvector (vector search)
- OpenSearch/Elasticsearch (inverted search)
- OpenTelemetry (logs/tracing)
- GitHub Actions/GitLab CI docs for cache and artifact management
These are battle-tested tools. Combining them in a build-aware RAG pipeline is how you stop hallucinated fixes and start shipping grounded patches.
