RAG for Code Debugging AI: Turning Logs and Diffs into Instant Fixes

Software teams don’t need another chatbot that explains what a NullPointerException is. They need fixes. Concretely: given a stack trace, a failing test, and the last few months of diffs and PR discussions, the system should propose a small, safe patch and run a targeted test plan. Retrieval-augmented generation (RAG) is the correct backbone for this, but only if we treat engineering artifacts as first-class signals and design for real-world constraints like private code, tenant isolation, latency, and regression risk.

This article lays out a practical, opinionated blueprint for building a private RAG stack that turns logs and diffs into instant fixes. We will focus on:

Data: ingesting stack traces, test failures, and PR history into a robust schema
Retrieval: hybrid and structure-aware search tailored to code and CI noise
Generation: prompts, patch formats, and execution sandboxes
Guardrails: policy, evaluation, and risk scoring to keep changes safe
Operations: cost, latency budgets, and observability in production

If you already have CI logs, a code host, and a vector database, you can get a first version working in a week. The difference between a demo and a dependable system is in the details below.

Why RAG for Code Debugging is different

Generic RAG retrieves public webpages and synthesizes a prose answer. Code debugging RAG operates on:

Highly structured artifacts: stack traces, JUnit XML, diff hunks, code symbols, PR review threads
Strong priors: the fix usually localizes near the top stack frame, the changed file in the last PR touching that module, or a config toggle mentioned in the error
Hard correctness constraints: any change must compile, pass tests, and respect coding standards
Tenant privacy and organizational context: on-prem code, internal libraries, and tribal knowledge in PR comments

That means your RAG system must be:

Private by default: data never leaves your trust boundary
Schema- and relation-aware: not just bag-of-words vectors
Feedback-driven: test outcomes, revert signals, and developer votes must update ranking and generation behaviors

System overview

A minimal, pragmatic architecture:

Ingest
- CI logs (build output, JUnit XML, failing tests)
- Runtime errors (Sentry-like events, stack traces)
- Git history (commits, PRs, diffs, review comments)
- Code symbols (function signatures, types) and dependency edges (call graph when feasible)
Normalize
- Canonical schemas per artifact with cross-links (e.g., a stack trace frame links to a code symbol and to PRs that last modified that file)
Index
- Vector: embeddings for code, diffs, and error messages
- Lexical: BM25 for exact error tokens and identifiers
- Symbol index: function and class lookup
- Graph edges for temporal and structural proximity
Retrieve
- Build a multi-channel query from the latest failure: top stack frames, error type, failing test name, repo, branch, and recent PRs touching those files
- Hybrid search with filters (repo, branch, visibility)
- Rerank with a cross-encoder that understands code and diffs
Generate
- Constrained patch proposal in a sandboxed workspace
- Targeted test execution and lints
- Produce a PR or commit message with references back to context
Guardrail
- Diff risk scoring, policy checks, static analysis gates
- Strict fail-safe: comment with analysis if patch risk exceeds thresholds or tests cannot be reproduced deterministically
Observe and learn
- Capture telemetry on retrieval quality, patch outcomes, reverts
- Continuous evals on a corpus of known failures and synthetic mutants

Data: schemas that survive real failures

Most debugging context arrives messy. A reliable RAG system imposes a structure that can be indexed and joined. Below is a recommended document schema set and link strategy.

Core document types

stack_trace
- fields: tenant_id, repo_id, commit_sha, branch, service, env, error_type, error_message, frames, timestamp, incident_id, ci_job_id
- frames: list of { file_path, function, line, module, language, in_repo, symbol_id }
test_failure
- fields: tenant_id, repo_id, commit_sha, branch, test_suite, test_name, classname, failure_message, failure_type, stack_trace_id, artifacts_path, junit_xml_path, duration_ms, timestamp, ci_job_id
pr
- fields: tenant_id, repo_id, pr_number, title, description, authors, reviewers, labels, created_at, merged_at, merge_commit_sha
diff_hunk
- fields: tenant_id, repo_id, pr_number, file_path, hunk_id, before_range, after_range, patch_text, summary, symbols_touched, risk_score_baseline
code_symbol
- fields: tenant_id, repo_id, path, language, symbol_name, kind (function, class, method), signature, start_line, end_line, last_modified_commit, code_excerpt
ci_job
- fields: tenant_id, repo_id, provider, job_id, workflow_name, status, logs_path, artifacts_path, started_at, ended_at, commit_sha, branch
incident
- fields: tenant_id, incident_id, severity, tags, primary_error_type, first_seen, last_seen, frequency, related_stack_trace_ids

Edges and joins

stack_trace.frames[].symbol_id -> code_symbol
code_symbol.last_modified_commit -> commit -> pr
test_failure.stack_trace_id -> stack_trace
diff_hunk.symbols_touched -> code_symbol
ci_job -> test_failure -> stack_trace -> code_symbol

Represent edges explicitly to enable graph-style expansion. Even if you use a document store, keep a lightweight adjacency collection:

# edge document
from_id: 'stack_trace:abc123'
to_id: 'code_symbol:repoA/src/foo.py:Foo.bar'
relation: 'mentions'
weight: 0.9

from_id: 'code_symbol:...'
to_id: 'pr:42'
relation: 'touched_by'
weight: 0.8

Normalization tips

Map frame file paths to repo canonical paths; resolve symlinks and monorepo workspaces
Deduplicate identical stack traces by error signature + top N frames; track frequency counter
Merge PR review comments into the diff_hunk context; often the rationale for a change is in comments
Extract configuration references (flags, env keys) with simple regex into a normalized table; they are frequently the cause of flaky tests and env-specific errors

Ingestion pipelines

You want deterministic, resumable ETL with clear SLAs. Dagster or Airflow works well; for simpler orgs, a handful of idempotent workers is enough.

Connectors

Git hosting: GitHub/GitLab APIs for PRs, commits, review comments, and diff patches
CI: GitHub Actions, GitLab CI, Jenkins, CircleCI; collect logs, artifacts, and JUnit XML
Error tracking: Sentry-like webhook or your own error aggregator. Even plain stderr logs from Kubernetes can work if you parse them

Example: minimal GitHub Actions step exporting JUnit XML and logs to object storage and posting metadata to your RAG service:

yaml
name: test
on: [push, pull_request]
jobs:
  unit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - run: pytest -q --maxfail=1 --disable-warnings --junitxml=report.xml
      - name: upload artifacts
        uses: actions/upload-artifact@v4
        with:
          name: junit
          path: report.xml
      - name: notify rag
        run: |
          curl -sSf -X POST ${{ secrets.RAG_URL }}/ingest/ci \
            -H 'Authorization: Bearer ${{ secrets.RAG_TOKEN }}' \
            -F repo_id=${{ github.repository }} \
            -F commit_sha=${{ github.sha }} \
            -F branch=${{ github.ref_name }} \
            -F ci_job_id=${{ github.run_id }} \
            -F workflow_name='unit' \
            -F junit_xml=@report.xml \
            -F logs=@$GITHUB_STEP_SUMMARY

Parsing stack traces

For each language, implement a small parser producing canonical frames. Keep it simple; you can expand later.

Python: parse lines matching File path, line N, in function and the following code context
Java: parse at package.Class.method(File.java:line)
Node: parse at function (file:line:col) and support sourcemaps if present
Go: parse function lines and next source lines

Attach language and in_repo flags for filtering. For each frame, attempt to resolve to a code_symbol by scanning your symbol index (built from ctags, tree-sitter, or LSP servers).

Example Python-ish sketch:

python
import re

PY_FRAME = re.compile(r'File (.+), line (\d+), in (.+)')

def parse_python_trace(trace_text):
    frames = []
    for line in trace_text.splitlines():
        m = PY_FRAME.search(line)
        if not m:
            continue
        path, line_no, func = m.group(1), int(m.group(2)), m.group(3)
        frames.append({
            'file_path': path,
            'line': line_no,
            'function': func,
            'language': 'python',
        })
    return frames

Parsing test failures

Most frameworks export JUnit XML. Extract suite, classname, test name, failure type, and message.

python
from xml.etree import ElementTree as ET

def read_junit(path):
    doc = ET.parse(path)
    failures = []
    for tc in doc.iterfind('.//testcase'):
        failure = tc.find('failure') or tc.find('error')
        if failure is None:
            continue
        failures.append({
            'test_suite': tc.get('classname'),
            'test_name': tc.get('name'),
            'failure_type': failure.get('type') or 'Failure',
            'failure_message': (failure.text or '').strip()[:4000],
        })
    return failures

Git diffs and PR history

For each PR, persist:

Hunk-level patches
Touched symbols (map by line range overlap)
A short summary of each hunk, generated offline with a small local model
Review threads, keyed by file and range
Temporal metadata: created_at, merged_at, deployment windows if available

Chunk diffs at the hunk level; this is the natural unit to retrieve later.

Symbol index

Produce code_symbol docs by scanning all repositories.

Use tree-sitter or ctags to enumerate functions, methods, classes, and their line ranges
Store signature, language, and an excerpt (e.g., first 30 lines)
Maintain a per-repo, per-branch index, and optionally per-commit snapshots for hot branches

Indexing: hybrid and structure-aware

Search for debugging is not purely semantic. The top-1 signal is often the exact error token or identifier match. The second is temporal proximity. Semantic similarity helps connect noisy failure messages to relevant PR commentary or config toggles. Use a hybrid approach:

Lexical index (BM25 or equivalent): for error tokens, identifiers, file paths
Vector index: embeddings per doc chunk, with model specialized or at least competent on code and diffs
Metadata filters: repo, branch, language, visibility
Multi-vector strategy: create multiple embeddings per doc: raw text, code-only, and summary. Use multi-vector retrieval if your vector DB supports it
Graph edges: for post-retrieval expansion (e.g., hunk -> corresponding code_symbol -> PR)

Embeddings

Good defaults as of 2024:

Code-aware embeddings for code and diffs: options include Jina Code v2, Voyage code, or e5-base for mixed text/code
General text embeddings for error messages and PR comments: text-embedding-3-large or comparable private model
Keep dimensionality ~768–1024; HNSW index with M ~ 16–32, efConstruction ~ 128–256

Store per-chunk vectors with fields:

doc_id, chunk_id
type: stack_trace, test_failure, diff_hunk, code_symbol, pr_comment
text, code, summary vectors
metadata: repo_id, branch, commit_sha, file_path, symbol_id, timestamp, severity

Lexical index

Tokenize identifiers by camelCase and snake_case splitting
Preserve file paths and error codes as exact tokens
Maintain per-repo shards to keep posting lists tight

Temporal weighting

Boost items that occurred around the failure window (e.g., merged within last 30 days). Temporal recency frequently correlates with causality.

Retrieval: turning a failure into a context plan

Retrieval starts with a query builder that takes a single failure and produces multiple sub-queries with weights. For example, given a test failure with a stack trace:

Query A: error_type + normalized error_message tokens (lexical boost)
Query B: top stack frame symbol_id and file_path (lexical + vector)
Query C: recent PR hunks touching the file_path or symbol_id (temporal boost)
Query D: test_name and suite for known flakiness or previous fixes

Pseudocode outline:

python
def build_queries(failure, repo_id, branch):
    tokens = important_tokens(failure['failure_message'])
    frames = failure['frames'][:3]
    q = []
    # A: error tokens
    q.append({
        'chan': 'lexical',
        'query': ' '.join(tokens),
        'filters': {'repo_id': repo_id, 'branch': branch},
        'weight': 1.0,
    })
    # B: top frame symbol or file
    for fr in frames:
        q.append({
            'chan': 'hybrid',
            'query': f"{fr.get('function') or ''} {fr['file_path']}",
            'filters': {'repo_id': repo_id, 'branch': branch},
            'weight': 1.2,
        })
    # C: recent PRs touching these paths
    for fr in frames:
        q.append({
            'chan': 'temporal',
            'query': fr['file_path'],
            'filters': {'repo_id': repo_id, 'branch': branch, 'type': 'diff_hunk', 'merged_after_days': 45},
            'weight': 1.3,
        })
    # D: test name
    q.append({
        'chan': 'lexical',
        'query': f"{failure['test_suite']} {failure['test_name']}",
        'filters': {'repo_id': repo_id},
        'weight': 0.7,
    })
    return q

Execute each sub-query, take top-k (e.g., k=25 per channel), then dedupe and rerank globally with a cross-encoder tuned for code and diffs. A good, simple reranking input pairs the failure synopsis with each candidate chunk:

Left: normalized failure summary including error_type, top 2 frames, test_suite/test_name
Right: candidate text (diff hunk summary + patch, code excerpt, or PR comment)

Keep final context small but rich: 10–20 chunks is plenty when each chunk is specific (hunks, symbol excerpts, review comments). Add a few graph expansions: for each selected diff_hunk, pull the code_symbol and one parent hunk for surrounding context.

Mandatory filters and negative constraints

Always filter by repo_id and branch unless you intentionally search across repos. Avoid cross-tenant leakage. Add negative constraints:

Exclude generated files and vendored dependencies unless the stack trace points there
Exclude secrets or files tagged sensitive

Aggregated context pack

Construct a structured context object for the generator, not just a blob of text:

python
context = {
  'failure': {
    'error_type': 'AssertionError',
    'message': 'expected 200 got 500',
    'top_frames': [
      {'file': 'src/api/order.py', 'line': 128, 'func': 'create_order'},
      {'file': 'src/db/tx.py', 'line': 42, 'func': 'commit'},
    ],
    'test': {'suite': 'tests.api', 'name': 'test_create_order_happy_path'}
  },
  'candidates': {
    'diff_hunks': [...],
    'symbols': [...],
    'pr_comments': [...],
  },
  'repo': {'id': 'shop/api', 'branch': 'main', 'commit_sha': 'abc123'},
}

The model can be steered with tool calls more reliably when fed structured context.

Generation: from context to a patch that actually lands

Generation is where many systems fail by over-editing or hallucinating. Keep it disciplined.

Output contract

Ask the model for a minimal unified diff patch plus a short rationale. Use a strict schema, or function-calling if available.

Files changed must exist and be within retrieved files unless explicitly allowed
Max changed lines per file (e.g., 30)
No new dependencies unless the context includes a relevant PR that added them

Prompt skeleton (abridged):

System: You are a senior engineer tasked with proposing a minimal patch to fix the described failure. Only edit files in the allowed set. The patch must compile and satisfy static checks.

User:
- Failure summary: ...
- Top frames: ...
- Test: ...
- Allowed files: src/api/order.py, src/db/tx.py
- Retrieved context (diff hunks, symbols, PR notes):
  1) Diff hunk summary: ...
  2) Code excerpt: ...

Return:
- Rationale (3 bullet points)
- Patch (unified diff)
- Targeted tests to run (list)

Constrained decoding

If you can, constrain output to a patch grammar to avoid leakage or chatty outputs. At minimum, enforce a post-processor that:

Validates unified diff format
Ensures only allowed files were modified
Checks line ranges exist and apply cleanly

Sandbox execution and test plan

Before surfacing any patch to humans, run it in an ephemeral, isolated workspace:

Clone repo at commit_sha
Apply patch with three-way merge if necessary
Run a minimal test plan: failing test + its direct dependents + smoke tests
Run static checks: linters, type checkers

A small orchestrator sketch:

python
import subprocess, tempfile, os, json

def apply_and_test(repo_url, commit_sha, patch_text, tests):
    work = tempfile.mkdtemp()
    subprocess.check_call(['git', 'clone', '--filter=blob:none', repo_url, work])
    subprocess.check_call(['git', 'checkout', commit_sha], cwd=work)
    patch_path = os.path.join(work, 'fix.patch')
    with open(patch_path, 'w') as f:
        f.write(patch_text)
    # apply patch
    subprocess.check_call(['git', 'apply', '--index', patch_path], cwd=work)
    # install and run targeted tests
    r = subprocess.call(['pytest', '-q'] + tests, cwd=work)
    return r == 0

Tool use: repo-aware actions

Expose a minimal toolset to the model or the orchestrator:

read_file(path, start_line, end_line)
list_callers(symbol_id) or show_refs(path, symbol)
run_tests(test_names)
format_code(file)

Whether the model or the orchestrator calls tools is an architectural choice. A pragmatic approach: do retrieval and tool execution outside the model and keep the model focused on code synthesis and small planning steps.

Guardrails: keep changes safe and reviews happy

Guardrails turn a clever demo into a dependable assistant.

Policy constraints

Only modify files surfaced by retrieval or within N lines of a retrieved symbol
Max total lines changed per patch (e.g., 80)
No changes to public API signatures unless test clearly indicates backward-incompatible bug and there is prior art in context
Require successful local tests and lints

Static analysis and type checks

Gate the patch through:

Language-specific linters (flake8, eslint, golangci-lint)
Type checkers (mypy, tsc)
Security scanners for obvious anti-patterns (e.g., disable TLS verification)

Diff risk scoring

Score a patch before running tests. High-risk diffs trigger safe-mode (analysis-only comment or small refactor suggestion).

Signals for risk:

Touches many files or widely used symbols (from call graph)
Changes in core packages (e.g., auth, billing)
Low retrieval confidence (reranker scores) or no matching prior PR context
Adds or removes conditionals around critical paths

PII and secret hygiene

Redact secrets before indexing logs
Strip request bodies and cookies unless explicitly allowed
Apply tenant isolation: ACL checks on every retrieval and before attaching context to prompts

Evals: prove it works and keeps working

RAG systems drift. Evaluations must be continuous and actionable.

Offline evals

Build a corpus that mimics your org’s failures:

Real failures: sample CI jobs and incidents with known human fixes
Synthetic faults: mutation testing (e.g., mutmut for Python, PIT for Java) to create controlled, fixable bugs

For each case, store:

Inputs: stack trace, failing test XML, repo state
Gold contexts: hunks, symbols, or PRs that humans consulted (derived from blame and PR references)
Gold fix: final patch and test outcomes

Metrics:

Retrieval: recall@k, nDCG@k against gold contexts; time-to-first-relevant
Generation: patch success rate (applies cleanly and fixes failure), test pass rate, minimality (LOC changed), revert rate (proxy through manual review feedback if available)
Safety: static check pass rate; false edit rate (patch produced when only analysis was allowed)

Ablations to run:

With vs without PR comment contexts
With vs without temporal boost
Hybrid vs vector-only
Reranker families: general text vs code-aware

Online evals

Shadow mode: generate patches but never post; compare against human fixes
Canary deployments on low-risk repos or branches
A/B during work hours only; respect developer load

Track:

Mean time to mitigation on flaky vs deterministic failures
Developer acceptance rate (patch merged or used as starting point)
Regression incidents linked to AI-suggested patches

Latency and cost: budgets that fit CI

You need results under a few minutes to be useful during CI or within an IDE session. A concrete budget:

Ingestion and parsing: amortized, but within 30–60 s of job completion
Retrieval per query: 50–150 ms for lexical + vector; 150–300 ms with reranker on top 100 candidates
Generation: 1–8 s depending on model and patch size
Sandbox: targeted tests under 60–120 s; keep it small by selecting only failing tests and closest dependencies

Cost levers:

Cache embeddings for recurring error patterns and top frames
Use a small local reranker for candidate narrowing; run a larger reranker only if scores are borderline
Prefer local or VPC-hosted models for generation where compliance demands it; use distillation to smaller code models for patch formats

Tooling stack: boring, reliable choices

Vector DB: Qdrant, Weaviate, Milvus, or pgvector with HNSW. Pick one with hybrid search support and per-collection filters
Orchestration: Dagster or Airflow for ETL; Celery or a lightweight queue for on-demand patch runs
LLM serving: vLLM or TGI for on-prem; or use a vendor with VPC peering
Reranking: a cross-encoder that understands code and diffs; Cohere rerank or a fine-tuned MiniLM/CodeT5+ variant
Parsing and symbols: tree-sitter for robust, multi-language ASTs
Telemetry: OpenTelemetry traces for end-to-end requests; Prometheus for RAG-specific counters (retrieval hits, reranker distributions, patch outcome labels)

Worked example: a failure to a minimal patch

Scenario: a Python service fails a test.

Failure: tests.api::test_create_order_happy_path fails with AssertionError: expected 200 got 500
Stack trace: top frame in src/api/order.py: create_order at line 128; next frame src/db/tx.py: commit line 42

Retrieval plan:

Lexical search for error tokens: AssertionError, expected 200 got 500, create_order
Hybrid search for file path src/api/order.py and symbol create_order
Temporal search for PRs touching src/api/order.py in the last 45 days

Results:

diff_hunk A: PR 842 changed error handling in create_order; added a try/except that returns 500 on any exception
pr_comment: reviewer noted a missing validation before DB commit
code_symbol: create_order signature shows it returns a tuple (payload, status)

Context summary to the model suggests that we should catch only a specific exception or move validation before the commit.

Proposed patch (sketch):

--- a/src/api/order.py
+++ b/src/api/order.py
@@ def create_order(req):
-    try:
-        order = validate_and_build(req)
-        db.save(order)
-        tx.commit()
-        return {'id': order.id}, 200
-    except Exception as e:
-        logger.exception('create_order failed: %s', e)
-        return {'error': 'internal'}, 500
+    order = validate_and_build(req)
+    try:
+        db.save(order)
+        tx.commit()
+    except DbConflictError as e:
+        logger.info('conflict: %s', e)
+        return {'error': 'conflict'}, 409
+    return {'id': order.id}, 200

Sandbox run:

Apply patch
Run tests.api::test_create_order_happy_path and tests.api::test_create_order_conflict
Lints and type checks pass

Guardrail checks:

Lines changed under limit
Allowed files only
Diff risk low (single endpoint, two symbols)

The assistant posts the patch to a draft PR with rationale: restoring 200 path for happy case, narrowing exception scope to DbConflictError, and moving validation out of the blanket try/except.

Privacy and security: private by construction

Tenant isolation: every document is tagged with tenant_id and repo_id; all queries must include tenant_id filters
Encryption: at rest for object storage and vector DB; in transit with mTLS
Egress control: no external calls from generation path without explicit allowlists; build-time checks in CI
Audit: log every retrieval and generation event with hashed context IDs, not raw content

Failure modes and mitigations

Wrong branch context: always include branch in filters; if missing symbol on current branch, fallback to commit_sha snapshot
Stale index: incrementally update on every merge; schedule nightly re-index for symbols
Flaky tests: detect historical flakiness and switch to analysis-only mode with a proposed flake quarantine plan
Over-broad patches: cap line count, penalize large diffs in scoring, require stronger retrieval evidence when expanding beyond top-1 frame
Non-reproducible errors: run in container images matching CI; if still non-reproducible, switch to analysis-only report with environment diff suggestions

Minimal service skeleton

Expose a simple endpoint for CI or incident responders to request a fix proposal.

python
from fastapi import FastAPI, Body

app = FastAPI()

@app.post('/propose-fix')
async def propose_fix(payload: dict = Body(...)):
    # 1) Build failure object
    failure = payload['failure']
    repo = payload['repo']
    # 2) Retrieve context
    queries = build_queries(failure, repo['id'], repo['branch'])
    candidates = retrieve_and_rerank(queries)
    context = assemble_context(failure, candidates, repo)
    # 3) Ask model for patch under constraints
    patch = generate_patch(context)
    if not patch:
        return {'mode': 'analysis_only', 'reason': 'no_safe_patch'}
    # 4) Sandbox
    ok = apply_and_test(repo['url'], repo['commit_sha'], patch['diff'], patch['tests'])
    if not ok:
        return {'mode': 'analysis_only', 'reason': 'tests_failed', 'patch': patch}
    # 5) Return patch and rationale
    return {'mode': 'patch', 'patch': patch}

Configuration example for connectors in YAML:

yaml
repos:
  - id: shop/api
    url: git@github.com:org/shop-api.git
    language: python
    default_branch: main
connectors:
  ci:
    provider: github_actions
    bucket: s3://org-ci-artifacts
  code_host:
    provider: github
    app_installation_id: 123456
retrieval:
  vector_db: qdrant
  embedding_models:
    code: jina-code-v2
    text: text-embedding-3-large
  reranker: cohere-rerank-3
  k: 25
  final_k: 15
  temporal_days: 45
policy:
  max_lines_changed: 80
  allowed_file_patterns:
    - '^src/'
    - '^tests/'

Roadmap: compounding gains

Better localization: train a small classifier to predict the most probable fix file from failure features
CRAG (compressive RAG): pre-summarize long PR threads into factual kernels to shrink context while preserving rationale
Delta indexing: update only changed symbols and hunks per merge to keep ingestion cheap
Self-play: generate synthetic bugs by mutating recent diffs, then let the system fix them; add successful cases to training/eval sets
IDE integration: let developers trigger propose-fix locally with cached retrieval contexts; upload sanitized traces when allowed

Final take

Most debugging is a retrieval problem disguised as code generation. Once you feed the model the right hunks, symbols, and review notes, the patch is usually surgical. The craft is in building a private, relation-aware retrieval stack; constraining generation to minimal patches; and wrapping it with tests and policy gates that mirror how your team already works.

Start small: index your last 90 days of PRs and CI failures, wire up a strict patch contract, and run in shadow mode. Measure retrieval recall@k and patch success on a fixed set of incidents. Iterate on chunking and reranking before swapping models. The fastest path to reliable, org-aware fixes is a boring, well-instrumented RAG pipeline that treats logs and diffs as first-class citizens.