Stop Feeding Your Whole Repo to Debug AI: Program Slicing and Log‑RAG for Precise Fixes
If your debugging assistant needs the whole repository to reason about a bug, your context strategy is broken. You are paying more, getting slower answers, and increasing the chance of a wrong fix. Worse, you are training yourself to rely on brute force instead of engineering.
A better way is both old and new: use classic program analysis to precisely bound what matters, and retrieve live context from production logs and traces. Static program slicing, symbol graphs, and log‑RAG (retrieval augmented by structured logs and traces) form a tight loop that feeds your model only what it needs.
This guide details a practical pipeline to build minimal, relevant context for debugging AI:
- Start with a slicing criterion and compute a compact static slice.
- Expand context along a symbol graph only as needed.
- Anchor analysis with production evidence via log‑RAG.
- Compress and package context for the model in a predictable, reproducible format.
The result: fewer tokens, fewer hallucinations, faster triage, and fixes that actually compile.
TL;DR
- Whole‑repo ingest is a last resort. Most bugs require less than 1% of the repo.
- A three‑layer pipeline consistently outperforms naive context stuffing:
- static slicing around the failure site,
- symbol‑graph expansion for interprocedural context,
- log‑RAG to add runtime evidence.
- Expect 5–50x token reduction, lower latency, and higher fix precision.
- You can build this today with off‑the‑shelf parsers (Tree‑sitter, LSP), graph libs (NetworkX), and standard IR stores (SQLite FTS, OpenSearch, or vector DBs).
Why whole‑repo context is a bad default
It is tempting to be generous with context when you are desperate for a fix. But dumping everything into the prompt has concrete downsides:
- Accuracy drop from noise: retrieval theory and empirical LLM work both show that adding non‑relevant tokens can degrade responses. The model’s attention is finite; noise pushes probability mass toward spurious correlations.
- Latency and cost: context windows are bigger, but bandwidth and inference time do not scale linearly. Every extra 10–100k tokens adds seconds, sometimes minutes.
- Security and compliance: repos often contain secrets, customer data samples, or internal IP outside the bug’s scope.
- Reproducibility: dynamic repos change. Whole‑repo snapshots are hard to cache and diff, making audit and rollbacks brittle.
Program slicing and log‑RAG shift you from volume to precision. You give the model less, but you give it the right less.
Debugging is not codegen: the model needs constraints, not volume
Bug fixing differs from greenfield coding. The model’s job is to:
- localize the fault,
- reason over data/control dependencies that explain the failure,
- propose minimal, safe edits that pass tests and keep contracts intact.
Volume rarely helps with any of those. Structure does. Static dependencies, symbol relationships, and live signals (logs/traces) are the right inputs because they mirror how engineers actually debug.
The minimal‑context pipeline
We will build a three‑layer pipeline. Think of it as progressive disclosure:
- Static program slice around a target line or failing test.
- Symbol graph expansion across files and modules.
- Log‑RAG retrieval of runtime evidence tied to the slice.
Finally, we package the context into a compact, deterministic bundle for the model.
A running example
Assume a Python microservice where a checkout endpoint fails with intermittent 500s. A failing test produces this traceback:
Traceback (most recent call last):
File 'tests/test_checkout.py', line 42, in test_discount_applied
assert apply_discount(total, user) == expected
File 'svc/pricing.py', line 118, in apply_discount
rate = get_discount_rate(user.tier)
File 'svc/pricing.py', line 76, in get_discount_rate
return RATES[segment]
KeyError: 'vip'
You do not need the whole repo. You need:
- the functions
apply_discountandget_discount_rate, - definition and use of
RATESanduser.tier, - any higher‑level caller context that constrains expected behavior (tests, API contract),
- the production logs for requests that triggered
KeyError: vip.
Let us compute that slice and augment with logs.
Static program slicing 101
Program slicing extracts the subset of code that may affect a variable or expression at a program point. The classic references are Weiser (1981) for static slicing and Korel & Laski (1988) for dynamic slicing.
Key distinctions:
- Static vs dynamic slices:
- Static: conservative over‑approximation using syntax and control/data flow, independent of a particular run.
- Dynamic: precise to one or more executions; requires traces.
- Backward vs forward:
- Backward: from a slicing criterion (e.g., variable at a line) to statements that can influence it.
- Forward: from a statement to those it can influence downstream.
For debugging, backward static slices are a great start: they are fast, reproducible, and do not require running the system.
Choose a slicing criterion
Your criterion should be specific. In our example: variable segment at pricing.py:76 or the dictionary access RATES[segment] at the same line.
Minimal implementation strategy
- Parse source into ASTs (Tree‑sitter is language‑agnostic and fast; the built‑in
astworks for Python). - Build a def‑use map (who defines what symbols, and where they are used).
- Build control flow (basic blocks and edges) and a simple call graph.
- Compute a backward slice by walking data dependencies and control predecessors until a budget or boundary (module, package) is reached.
You do not need a full compiler to get a useful slice. Here is a compact Python prototype to compute a conservative backward slice using the stdlib ast and NetworkX:
python# Minimal, illustrative slicer for Python import ast import inspect from pathlib import Path import networkx as nx class DefUseVisitor(ast.NodeVisitor): def __init__(self): self.defs = {} # symbol -> set of (file, lineno) self.uses = {} # symbol -> set of (file, lineno) self.calls = [] # (caller, callee, file, lineno) self.assignments = [] # (target_symbol, file, lineno) def visit_FunctionDef(self, node): self.defs.setdefault(node.name, set()).add((self._file, node.lineno)) self.generic_visit(node) def visit_Assign(self, node): for target in node.targets: if isinstance(target, ast.Name): self.assignments.append((target.id, self._file, node.lineno)) self.defs.setdefault(target.id, set()).add((self._file, node.lineno)) self.generic_visit(node) def visit_Name(self, node): if isinstance(node.ctx, ast.Load): self.uses.setdefault(node.id, set()).add((self._file, node.lineno)) self.generic_visit(node) def visit_Call(self, node): # best effort: direct name calls only if isinstance(node.func, ast.Name): self.calls.append((None, node.func.id, self._file, node.lineno)) self.generic_visit(node) def index_file(self, path: Path): self._file = str(path) tree = ast.parse(path.read_text()) self.visit(tree) def build_symbol_graph(files): v = DefUseVisitor() for f in files: v.index_file(Path(f)) G = nx.DiGraph() # defs and uses -> edges from defs to uses for sym, def_sites in v.defs.items(): for d in def_sites: G.add_node(('def', sym, d)) for use in v.uses.get(sym, []): for d in def_sites: G.add_edge(('def', sym, d), ('use', sym, use)) # calls as edges from caller site to callee symbol for _, callee, file, lineno in v.calls: G.add_edge(('callsite', callee, (file, lineno)), ('callee', callee, None)) return G, v def backward_slice(v: DefUseVisitor, target_symbol: str, target_file: str, target_line: int, depth=3): # naive: include all defs of the symbol and their dependent defs recursively work = [(target_symbol, target_file, target_line, 0)] included = set() while work: sym, f, l, d = work.pop() if d > depth: continue for (df, dl) in v.defs.get(sym, set()): included.add((df, dl)) # pull in other symbols used on the defining line # very rough: reparse the line and collect names src_line = Path(df).read_text().splitlines()[dl - 1] for token in src_line.split(): if token.isidentifier() and token != sym: work.append((token, df, dl, d + 1)) return included # usage files = ['svc/pricing.py', 'svc/user.py', 'tests/test_checkout.py'] G, v = build_symbol_graph(files) sl = backward_slice(v, 'segment', 'svc/pricing.py', 76, depth=3) for f, l in sorted(sl): print(f'{f}:{l}')
This is crude but demonstrates the pattern: pick a criterion, crawl definitions and related symbols, bound the depth, and get a small set of lines to include. Production systems enrich this with control flow, interprocedural aliasing, imports, and more accurate def‑use. Tree‑sitter grammars, Jedi/Pyright (Python), or ts‑morph (TypeScript) make this reliable without writing your own parser.
Slicing tips that matter
- Favor backward slices from failures, not forward slices from entrypoints. It is closer to how humans localize bugs.
- Cap the slice by:
- maximum files (e.g., 15–50),
- maximum lines per file (keep only the needed spans),
- call depth (e.g., 3–5),
- same package or module boundary unless crossing is justified by imports.
- Always include nearby context lines (±5–10) to preserve syntax and local contracts.
- Store slices by a stable key: repo SHA + criterion (file:line + symbol). This enables caching and reproducibility.
Symbol graphs: precision across files and modules
Slices inside one file are not enough when the fault spans modules. A symbol graph connects definitions, references, and call edges across the codebase. At minimum you want:
- symbol to defining file/line,
- reference edges (who uses the symbol),
- call graph edges (which function calls which),
- import graph (modules to modules),
- type/interface edges (if typed languages).
Many languages hand you this via LSP servers. Examples:
- Python: Jedi, Pyright, pylsp.
- TypeScript/JavaScript: tsserver via ts‑morph.
- Go: gopls.
- Java: Eclipse JDT LS.
A minimal TypeScript call graph build using ts‑morph:
ts// Build a simple symbol/call graph in TypeScript using ts-morph import { Project, SyntaxKind } from 'ts-morph'; const project = new Project({ tsConfigFilePath: 'tsconfig.json' }); project.addSourceFilesAtPaths('src/**/*.ts'); const calls: Array<{ caller: string; callee: string; file: string; line: number }> = []; for (const sf of project.getSourceFiles()) { sf.forEachDescendant(node => { if (node.getKind() === SyntaxKind.CallExpression) { const ce = node.asKindOrThrow(SyntaxKind.CallExpression); const symbol = ce.getExpression().getSymbol(); const callee = symbol?.getName() ?? 'unknown'; const fn = node.getFirstAncestorByKind(SyntaxKind.FunctionDeclaration); const caller = fn?.getName() ?? 'anon'; const { line } = sf.getLineAndColumnAtPos(node.getStart()); calls.push({ caller, callee, file: sf.getFilePath(), line }); } }); } console.log(calls);
With this graph, you can expand the static slice outward only along edges relevant to the failure criterion. If get_discount_rate is called only from apply_discount, you do not need every other consumer of RATES.
Practical graph heuristics
- Expand along edges that are on paths between the failure site and the public entrypoint or failing test.
- Prefer shortest paths first; cap the number of alternative paths.
- Collapse repeated library code: add a symbolic stub for standard library or vendored dependencies instead of inlining their source.
- Merge adjacent spans in the same file to reduce glue lines.
Log‑RAG: runtime evidence beats speculation
Static context alone misses what actually happened in production. Retrieval‑augmented generation is usually framed around embeddings of documents. For debugging, your most valuable corpus is structured logs and traces with context keys like request IDs, user IDs, and stack traces.
Log‑RAG means: retrieve the most relevant log and trace fragments for the current failure, then feed that evidence to the model alongside the code slice.
Crucially, logs are already in the developer’s language: they contain symbol names, function names, and error messages. You do not need advanced embeddings to do well; BM25 or plain keyword search often beats vector search for error retrieval because of exact matches on identifiers.
Structure your logs for retrieval
- Emit structured logs, not raw strings. Include keys for service, file, function, line, request_id, trace_id, user/tier, and a stable error_code.
- Include stack traces with file:line pairs. They align exactly with your slice.
- Use OpenTelemetry to propagate trace_id through services.
- Adopt a small schema and stick to it. The fewer free‑form fields, the better your retrieval quality.
Example Python log emitter with structlog:
pythonimport structlog log = structlog.get_logger() def get_discount_rate(segment: str) -> float: try: return RATES[segment] except KeyError: log.error('discount_rate_missing', file='svc/pricing.py', func='get_discount_rate', segment=segment) raise
Index logs for fast, precise recall
A simple, robust setup:
- Send logs to OpenSearch/Elasticsearch and create an index on fields: service, file, func, error_code, and a full‑text field for message and stack.
- Alternatively, use SQLite FTS5 for local development indexing.
- Keep a retention of at least N days for debugging; keep a high‑cardinality shard for recent hours.
Minimal SQLite FTS example:
pythonimport sqlite3 conn = sqlite3.connect('logs.db') cur = conn.cursor() cur.execute(""" CREATE VIRTUAL TABLE IF NOT EXISTS logs USING fts5( ts UNINDEXED, level UNINDEXED, service, file, func, error_code, request_id, message, stack ); """) # insert logs elsewhere ... # query by identifiers from slice q = 'get_discount_rate RATES KeyError vip svc/pricing.py' rows = cur.execute('SELECT ts, service, file, func, message, stack FROM logs WHERE logs MATCH ?', (q,)).fetchmany(20) for r in rows: print(r)
This retrieval uses the exact identifiers from the code slice and failure. You can also filter by time window (e.g., last 2 hours) or by error_code if you emit stable codes.
Trace‑aware retrieval
Traces provide causality across services. With OpenTelemetry and a trace store (Tempo, Jaeger), you can:
- fetch all spans linked to errors that contain the target file/function,
- extract baggage and attributes (e.g., user.tier) that inform the bug,
- reconstruct the minimal sequence of operations that led to the failure.
Your RAG query becomes: given failure site S, retrieve traces where S appears, then extract the top spans by error frequency and latency. Add the minimal annotated trace to the model context.
Assembling the context pack
Now that you have a static slice, a bounded symbol graph, and runtime evidence, you need to package it compactly. The goals are:
- determinism: same inputs produce identical packs,
- compression: no redundant code or logs,
- readability: the model and a human can parse it easily.
A simple YAML‑like layout works well:
# Context Pack v1
meta:
repo_sha: abcdef1
target:
file: svc/pricing.py
line: 76
symbol: segment
budgets:
max_files: 25
max_lines_per_file: 200
slice:
- file: svc/pricing.py
spans:
- start: 60
end: 130
note: get_discount_rate and apply_discount
- file: svc/user.py
spans:
- start: 10
end: 45
note: User.tier derivation
symbols:
calls:
- caller: apply_discount
callee: get_discount_rate
file: svc/pricing.py
line: 118
defs:
- symbol: RATES
file: svc/pricing.py
line: 20
runtime:
logs:
- ts: 2025-11-12T10:22:31Z
level: error
file: svc/pricing.py
func: get_discount_rate
message: discount_rate_missing segment='vip'
stack: |
File 'svc/pricing.py', line 76, in get_discount_rate
return RATES[segment]
- ts: 2025-11-12T10:22:31Z
level: info
file: svc/user.py
func: resolve_segment
message: segment='vip' user_id=123 tier='gold'
constraints:
expected:
- 'vip' should map to same rate as 'gold' unless overridden by experiment XYZ.
tests:
- tests/test_checkout.py::test_discount_applied
Note how minimal this is. It omits entire directories that do not matter. Any log without the target identifiers is excluded.
Compression tricks that keep meaning but cut tokens
- Minify code but preserve lines: strip comments and docstrings after extracting spans, then keep a line map.
- Deduplicate logs by template: cluster messages by tokenizing literals vs variables (e.g., via regex or Drain3); include the template and top 3 parameter instantiations.
- Summarize traces: include only error spans and their direct parents/children.
- Replace large constants with stubs and include a single example.
Orchestrating the pipeline
Here is a high‑level orchestration sketch you can implement as a service or CLI:
pythonfrom dataclasses import dataclass from typing import List @dataclass class Criterion: file: str line: int symbol: str @dataclass class ContextPack: meta: dict slice: List[dict] symbols: dict runtime: dict constraints: dict def build_context_pack(repo_root: str, crit: Criterion, budgets: dict) -> ContextPack: # 1) parse and index files = discover_files(repo_root, budgets) sym_graph, analyzer = build_symbol_graph(files) # 2) static slice sl = backward_slice(analyzer, crit.symbol, crit.file, crit.line, depth=budgets.get('depth', 4)) spans = merge_spans(sl, budgets) # 3) symbol expansion expanded = expand_via_graph(sym_graph, spans, budgets) # 4) runtime retrieval log_hits = query_logs_for_slice(expanded, time_window=budgets.get('time_window')) trace_hits = query_traces_for_slice(expanded, time_window=budgets.get('time_window')) # 5) compress code_snippets = extract_and_minify(expanded) log_snippets = dedup_logs(log_hits) trace_snippets = summarize_traces(trace_hits) # 6) assemble return ContextPack( meta={'repo_sha': git_sha(repo_root), 'target': crit.__dict__, 'budgets': budgets}, slice=code_snippets, symbols=serialize_symbols(sym_graph, expanded), runtime={'logs': log_snippets, 'traces': trace_snippets}, constraints=collect_constraints(repo_root, crit) )
The details depend on your language ecosystem, but the skeleton is universal.
Prompting the model with the context pack
You want a standard, low‑surprise prompt shape. Focus on:
- a crisp task definition,
- the compact context pack,
- explicit output constraints (patch format, test expectations),
- refusal to modify code outside the slice unless justified by a missing definition.
Example instruction prompt (system or top of user message):
Task: Identify the root cause of the failure at svc/pricing.py:76 (KeyError: 'vip') and produce a minimal patch.
Rules:
- Use only the provided code spans. If a symbol is missing, request it explicitly.
- Do not modify code outside the slice unless the fix requires a definition present in the symbol graph.
- Explain the dependency path that leads to the failure in 2–4 bullets.
- Output a unified diff patch.
Then append the context pack. Many teams wrap it in fenced blocks like:
=== BEGIN CONTEXT PACK ===
...
=== END CONTEXT PACK ===
This bracketing reliably anchors the model to the relevant section.
Evaluation: measure recall, not just vibes
If you are replacing naive whole‑repo prompts with slicing + log‑RAG, measure it. Suggested metrics:
- Slice recall: fraction of truly relevant lines included in the slice (approximate via human or test oracle). Aim for 0.7–0.9 with compact budgets.
- Token count: total input tokens. Track p50/p90; expect 5–50x reductions.
- Latency: end‑to‑end time from criterion to model response.
- Fix success rate: patch applies and passes tests. Use a benchmark corpus like SWE‑bench (Python) or Defects4J (Java) to standardize.
- Human minutes saved: time to triage and land the fix in real incidents.
An incremental adoption pattern:
- A/B test on a sample of incidents. Route half through whole‑repo baseline, half through slicing + log‑RAG.
- Freeze the model to isolate context effects.
- After 2–4 weeks, compare success and latency. Expand rollout when the numbers are clear.
Operational playbook: start small, escalate only when needed
Treat context acquisition as a budgeted search process:
- Start with the failure site: last frame’s file:line, variable/expression, and failing test.
- Build a backward static slice up to depth D.
- Expand along the symbol graph only on paths from test or public entrypoint to the failure site.
- Retrieve logs for identifiers in the slice, capped to N results and a time window.
- If the model requests more, iterate: increase depth or include a requested file/span.
- Only if the model is still stuck, escalate to a broader scope (e.g., entire module). Whole repo is the nuclear option.
This mirrors how senior engineers debug. The AI should not get more privileges than a human needs to get the job done.
Real‑world pitfalls and how to avoid them
- Dynamic features and reflection: Python, Ruby, and Java reflection can hide dependencies. Mitigation: fall back to dynamic execution traces in tests to seed the slice (hybrid static‑dynamic slicing).
- Generated code: do not include generated bundles; include the source generator and the schema or IDL that drives it.
- Multi‑language repos: build per‑language slices and join them at well‑defined boundaries (HTTP handlers, protobuf/grpc endpoints, CLI invocations).
- Version skew: logs and code must match the same deploy hash. Always include the service/version in log retrieval and the repo SHA in the context pack.
- Privacy: redact PII from logs and from the context pack. Prefer templates over raw messages.
- Flaky tests: log‑RAG with request IDs helps disambiguate data‑dependent flakes.
Concrete token savings and accuracy gains
Teams that move from whole‑repo prompting to slice + log‑RAG consistently report:
- token reductions of 80–98% for medium monorepos,
- end‑to‑end latency dropping from tens of seconds to under 5 seconds for most bugs,
- fewer hallucinated edits outside the touched module,
- higher patch acceptance on first try.
Your exact numbers will vary by repo size and language. The trend will not.
Tooling you can use today
- Parsers and analysis:
- Tree‑sitter (fast incremental parsing, many languages)
- Pyright/Jedi (Python), ts‑morph (TypeScript), gopls (Go), JDT LS (Java)
- CodeQL for deep semantic queries (heavier but powerful)
- Graph libs:
- NetworkX (Python), igraph, rustworkx
- Log and trace infra:
- OpenTelemetry SDKs and collectors, Jaeger/Tempo, OpenSearch/Elasticsearch, SQLite FTS5
- Template miners: Drain3 for log template clustering
- Orchestration:
- A simple Python CLI or a service in your infra. Cache by (repo SHA, criterion) keys.
A brief worked fix with the running example
Our slice includes get_discount_rate, apply_discount, RATES, and the User.tier computation. Logs show many errors with segment 'vip' and user tier 'gold'. The likely bug: the segment value 'vip' is not in the RATES dict, but business rules say VIPs should use the 'gold' rate unless an experiment overrides it.
A minimal, safe patch:
- Map unknown segments to a fallback derived from tier.
- Log once at warn level with a stable error_code.
- Add test coverage for 'vip'.
Example diff:
--- a/svc/pricing.py
+++ b/svc/pricing.py
@@
RATES = {
- 'bronze': 0.95,
- 'silver': 0.90,
- 'gold': 0.85,
+ 'bronze': 0.95,
+ 'silver': 0.90,
+ 'gold': 0.85,
}
@@
def get_discount_rate(segment: str, tier: str | None = None) -> float:
- return RATES[segment]
+ try:
+ return RATES[segment]
+ except KeyError:
+ # Fallback to tier mapping; VIPs use gold unless overridden
+ fallback = 'gold' if (segment == 'vip' and tier) else (tier or 'bronze')
+ log.warn('discount_rate_fallback', segment=segment, tier=tier, fallback=fallback)
+ return RATES.get(fallback, 1.0)
@@
-def apply_discount(total: float, user: User) -> float:
- rate = get_discount_rate(user.tier)
+def apply_discount(total: float, user: User) -> float:
+ rate = get_discount_rate(user.segment, user.tier)
return round(total * rate, 2)
Even if your actual business logic differs, the point stands: a small patch in the exact slice, informed by runtime evidence. No need to show the model every microservice in the monorepo.
Governance and reproducibility
- Deterministic builds: pin tool versions (parsers, indexers) and emit their versions in the context pack meta.
- Caching: key the slice by (repo SHA, criterion, budgets). Store packs in object storage for reuse and audit.
- Redaction: run a filter that drops secrets and PII from logs and code snippets before the model sees them.
- Human‑in‑the‑loop: let engineers inspect the context pack and request expansions. This trains the orchestrator to pick better defaults.
Opinionated take: correctness is a retrieval problem first, a generation problem second
If your debugging assistant is not retrieving the right 500–2,000 lines and the 5–20 right log entries, model size and context window will not save you. You do not need a 200k‑token prompt. You need the correct 2k tokens.
The discipline of slicing and graph‑guided retrieval has existed for decades. Marry it with modern observability, and you get a pragmatic, high‑leverage system that beats naive whole‑repo prompting on cost, latency, and fix precision.
Make the model do the thinking, not the searching.
References and further reading
- M. Weiser, Program Slicing, IEEE TSE, 1984 (original concept introduced in 1981). Classic static slicing.
- B. Korel and J. Laski, Dynamic Program Slicing, Information Processing Letters, 1988.
- SWE‑bench: Can LLMs Fix Bugs? Benchmark for real‑world Python bug fixing. Useful for measuring patch success.
- Tree‑sitter: fast incremental parsing for many languages.
- OpenTelemetry: vendor‑neutral observability framework for traces, metrics, logs.
- Drain3: online log template mining for deduping and summarization.
Implementation checklist
- Pick one language and build a minimal slicer using Tree‑sitter or the language’s AST.
- Build a symbol graph: defs/uses, calls, imports. Persist it.
- Define a context pack schema and write a packer that enforces budgets.
- Index logs with FTS and add trace retrieval for error spans.
- Wire an orchestrator that starts from a criterion, builds the pack, and prompts the model.
- Add caching and a simple UI to inspect and expand slices.
- Evaluate on a set of historical incidents before rolling out widely.
Adopt this, and you will stop feeding your whole repo to the AI. Your bills go down, your fixes get better, and your engineers regain control over the debugging loop.
