Time-Travel RAG for Code Debugging AI: Build-Specific Context to Fix Regressions Fast
Debugging regressions is hard enough for humans. For an AI, it borders on impossible without the exact context that produced the failure. The code has likely changed since the failing build, environment flags have flipped, dependencies have shifted, and the logs you care about are buried in a thousand lines of unrelated output.
The fix is simple to state but non-trivial to implement: give your debugging AI the precise, immutable context of the failing build—code, logs, configs, dependencies, even the container image digest—and make it trivially retrievable by build ID or commit. This is Time-Travel Retrieval Augmented Generation (RAG): an AI agent that can “time-travel” to the system state at the moment of failure, reason locally, and propose a precise, reproducible patch.
Below I outline a practical, production-grade design for a repo-time-indexed knowledge base integrated with your CI. We’ll cover the data model, indexing pipeline, retrieval strategies, guardrails against hallucination, suggested prompts, patch generation and validation, security, costs, and metrics. Code snippets are included for GitHub Actions, Jenkins, and a reference indexer/retriever service.
TL;DR
- Build a time-indexed, repo-scoped knowledge base. For every build, persist: source snapshot (or commit + submodule lock), logs, test reports, configs, env variables, container image digests, SBOM/lockfiles, and key artifacts.
- Use hybrid retrieval: deterministic (by build_id/commit), symbolic (stack traces → files/symbols), and vector search (embeddings on code/logs) with strong metadata filters.
- Enforce temporal grounding: the AI can only read artifacts pinned to the failing build’s state. No fetching from HEAD or the internet during patch planning.
- Close the loop: propose patch → apply on a branch → run the same tests in the same container → report back diffs and status → iterate.
- Measure: top-k retrieval accuracy of ground-truth file, time-to-first-green, and patch acceptance rate.
Why regressions are hard (and why RAG helps)
- Code drift: The commit that failed is not the one currently on main. Suggestions based on HEAD are often wrong.
- Environment drift: OS packages, toolchains, feature flags, secrets, and runtime env can change hour-to-hour.
- Observability drift: Logs are non-deterministic; pipeline steps differ across branches.
- Context overload: The relevant 200 lines are scattered across code, logs, and config.
RAG addresses the last point—but vanilla RAG is not enough. Retrieval must be constrained by time and repository state to avoid hallucinations and ensure reproducibility. The AI must only see what was true at the time of failure.
What is Time-Travel RAG?
Time-Travel RAG is a retrieval-augmented system where the knowledge base is indexed by repository, commit, and build. The AI retriever fetches:
- The exact code version (by Git commit SHA) and file contents.
- Build metadata: CI provider, job IDs, timestamps, branch, PR.
- Logs: build logs, test logs, stack traces, coverage reports.
- Config: env vars, flags, config files, secrets metadata (redacted), CI YAML, infra manifests.
- Environment: container image digests, toolchain versions, SBOM/lockfiles.
- Artifacts: compiled outputs (when safe), symbol maps, crash dumps, core stack traces.
Crucially, retrieval is filtered by build_id or commit_sha so the LLM cannot drift to non-deterministic context.
Architecture Overview
Components:
-
CI Event Listener (webhook/queue)
- Receives build_started/build_finished events from GitHub Actions, Jenkins, Buildkite, CircleCI, etc.
- Triggers snapshot and indexing jobs.
-
Snapshotter
- Resolves repo, commit SHA, submodules/LFS, and workspace state.
- Captures: source tree (or references), env, CI config, toolchain versions, container digests.
- Produces a manifest and uploads artifacts to object storage (e.g., s3://ci-artifacts/<repo>/<build_id>/...).
-
Indexer
- Tokenizes and chunks code, logs, and configs; builds embeddings and inverted indices.
- Annotates with metadata: {repo, build_id, commit_sha, branch, path, language, symbol, test_name, step_id, timestamp}.
- Stores in: document store (Postgres/Elastic), vector DB (FAISS/Weaviate/Milvus/PGVector), and an object storage catalog.
-
Retriever API
- Query planner parses failure content (stack traces, error messages) into structured intents.
- Hybrid retrieval: metadata filters + BM25 + vector search + symbol-aware expansion.
- Temporal guard: restricts to given build_id/commit.
-
Debugging Agent
- Runs with a deterministic system prompt enforcing time-travel constraints.
- Generates patches, proposes changes, links to provenance.
-
Executor/Sandbox
- Applies patch on a branch pinned to the original commit.
- Replays tests in the same container image; reuses caches when safe.
- Reports results (pass/fail, new logs) back to the agent.
-
Audit/Provenance
- Stores all retrieval URIs and checksums alongside the generated patch for traceability.
Data Model: Build-Scoped Knowledge Graph
Represent a build as a first-class, queryable entity with strict immutability.
-
Build
- build_id: string
- repo: string
- commit_sha: string
- branch/tag/PR: string
- timestamp: RFC3339
- ci_provider: enum
- status: enum
- container_image_digests: list<string>
- env_snapshot: map<string,string> (redacted-at-source for secrets)
- manifests: list<URI> (CI YAML, infra configs)
- sbom/spdx/cyclonedx: URI
-
Sources
- source_manifest: URI (includes Git URL, commit SHA, submodule SHAs, LFS pointers)
- files: object storage paths keyed by repo/build_id/path
- symbols: optional symbol graph (generated via tree-sitter/ctags)
-
Logs & Reports
- build_logs: list<URI>
- test_reports: JUnit XML, coverage, crash dumps
- step_logs: map<step_id, URI>
-
Artifacts
- artifact_manifest: URIs of build outputs, checksums
- debug_symbols (if applicable)
-
Index
- doc_id → metadata (repo, build_id, path, lang, symbol, test_name)
- embeddings vector
- BM25/inverted index tokens
Partition indexes by repo and time for scale, with metadata filters enforcing build_id first, then commit_sha fallback.
CI Integration: Capture the World, Cheaply and Safely
The goal: minimal friction for engineers and no unbounded cost growth.
Recommended capture scope per build:
- Git commit SHA, branch, PR number.
- Source snapshot: either store diffs or reference remote Git by SHA; also capture submodule/LFS SHAs.
- Container image digests (e.g., docker image --digests) and runtime versions (Node, Python, Java, Go).
- Lockfiles: package-lock.json, poetry.lock, Pipfile.lock, go.sum, Cargo.lock, Gemfile.lock, requirements.txt.
- Env snapshot: whitelist of non-sensitive vars; redact secrets at source.
- CI configs: .github/workflows/*.yml, Jenkinsfile, buildkite.yaml.
- Logs: build logs, test logs, JUnit results, failing test names.
- Feature flag snapshots (IDs and values; no PII).
Avoid storing large binaries unless needed for debugging. Use object storage lifecycle policies to expire bulky artifacts after N days; keep metadata, manifests, and minimal text context longer.
GitHub Actions example
yamlname: ci on: push: branches: [ main ] pull_request: jobs: build-test: runs-on: ubuntu-latest steps: - name: Checkout uses: actions/checkout@v4 with: fetch-depth: 0 # get full history for submodules/sha pin submodules: recursive - name: Capture environment info id: envinfo run: | echo "commit_sha=$(git rev-parse HEAD)" >> $GITHUB_OUTPUT echo "branch=${GITHUB_REF_NAME}" >> $GITHUB_OUTPUT echo "node_version=$(node -v || true)" >> $GITHUB_OUTPUT echo "python_version=$(python3 -V 2>&1 || true)" >> $GITHUB_OUTPUT echo "container_digest=$IMAGE_DIGEST" >> $GITHUB_OUTPUT - name: Run tests id: test run: | set -o pipefail npm ci npm test 2>&1 | tee test.log continue-on-error: true - name: Upload artifacts if: always() uses: actions/upload-artifact@v4 with: name: build-${{ github.run_id }} path: | test.log junit/**/*.xml coverage/** package.json package-lock.json .github/workflows/** - name: Index build to Time-Travel RAG if: always() env: RAG_API: ${{ secrets.RAG_API }} RAG_TOKEN: ${{ secrets.RAG_TOKEN }} run: | python3 - <<'PY' import os, json, subprocess, sys def sh(x): return subprocess.check_output(x, shell=True, text=True).strip() payload = { "repo": os.environ.get("GITHUB_REPOSITORY"), "build_id": os.environ.get("GITHUB_RUN_ID"), "commit_sha": sh("git rev-parse HEAD"), "branch": os.environ.get("GITHUB_REF_NAME"), "ci_provider": "github_actions", "artifacts": ["artifact://github/build-{}".format(os.environ.get("GITHUB_RUN_ID"))], "env": { "node": os.environ.get("node_version", ""), "python": os.environ.get("python_version", ""), } } import requests r = requests.post(os.environ["RAG_API"] + "/index", headers={"Authorization": "Bearer " + os.environ["RAG_TOKEN"]}, json=payload) print(r.status_code, r.text) PY
Jenkins pipeline snippet
groovypipeline { agent any stages { stage('Checkout') { steps { checkout scm } } stage('Test') { steps { sh '''#!/bin/bash set -euo pipefail mvn -q -DskipTests=false test 2>&1 | tee build.log || true ''' } } stage('Index to Time-Travel RAG') { steps { sh '''#!/bin/bash curl -H "Authorization: Bearer ${RAG_TOKEN}" \ -H "Content-Type: application/json" \ -d '{ "repo": "'${JOB_NAME}'", "build_id": "'${BUILD_ID}'", "commit_sha": "'$(git rev-parse HEAD)'", "ci_provider": "jenkins" }' \ ${RAG_API}/index ''' } } } }
Indexing: From Build Artifacts to Searchable Context
Indexing is a pipeline. Keep it simple but strict:
- Ingest manifests and logs from object storage.
- Chunk code by language-aware boundaries (e.g., tree-sitter) with 200–400 line windows and 20–40 line overlaps.
- Extract symbols (functions, classes) and attach to chunk metadata.
- Normalize logs; extract test names, error codes, stack traces.
- Build:
- Inverted index (BM25) for logs and code comments.
- Embeddings for code and logs (separate models or multi-vector per doc type).
- Store pointers, not bytes, when possible. Keep immutable URIs with checksums.
Reference indexer sketch:
python# indexer.py from typing import Dict, List import hashlib, json class Doc: def __init__(self, uri, text, meta): self.uri = uri self.text = text self.meta = meta self.id = hashlib.sha256((uri + json.dumps(meta)).encode()).hexdigest() class Indexer: def __init__(self, vecdb, invdb, objstore): self.vecdb = vecdb self.invdb = invdb self.objstore = objstore def index_build(self, build: Dict): build_id = build['build_id'] repo = build['repo'] # 1) list artifacts artifacts = self.objstore.list(prefix=f"{repo}/{build_id}/") # 2) parse files docs: List[Doc] = [] for art in artifacts: text, meta = self._parse(art, build) if not text: continue docs.append(Doc(uri=art.uri, text=text, meta=meta)) # 3) write indices self.invdb.add([(d.id, d.text, d.meta) for d in docs]) self.vecdb.add([(d.id, self._embed(d.text), d.meta) for d in docs]) return len(docs) def _parse(self, art, build): # return (text, metadata) # implement language and filetype-aware parsing ... def _embed(self, text): # call your embedding model ...
Retrieval: Deterministic First, Semantic Second
For debugging, retrieval should be predictable. Use a hierarchical approach:
- Anchor by build_id or commit_sha. All queries must include this filter. If missing, reject the query.
- Deterministic signals:
- Parse stack traces to extract file paths, line numbers, and symbols.
- Map test failure names to files (via test reports).
- Use git diff against the parent commit to prioritize changed files.
- Symbolic expansion: build a small dependency closure using call graphs or import graphs.
- Semantic search: use embeddings only within the build-scoped corpus to fill gaps (e.g., related configs or docs).
- Score fusion: combine BM25, symbol hits, recency-within-build (e.g., later steps), and embedding similarity.
Retriever sketch:
python# retriever.py from typing import Dict, List def plan_query(error_text: str) -> Dict: # naive example: extract paths like src/foo/bar.py:123 ... class Retriever: def __init__(self, invdb, vecdb): self.invdb = invdb self.vecdb = vecdb def retrieve(self, repo: str, build_id: str, question: str, k: int = 30): plan = plan_query(question) filters = {"repo": repo, "build_id": build_id} hard_hits = [] if plan.get("paths"): for p in plan["paths"]: hard_hits.extend(self.invdb.lookup(path=p, filters=filters)) bm_hits = self.invdb.search(question, filters=filters, top_k=k) vec_hits = self.vecdb.search(question, filters=filters, top_k=k) return self._fuse(hard_hits, bm_hits, vec_hits) def _fuse(self, hard, bm, vec): # de-duplicate and score ...
Guardrails Against Hallucinations
Opinionated but effective constraints:
- Temporal lock: All retrieval must specify build_id or commit_sha. Reject otherwise.
- Repo scoping: No cross-repo context unless explicitly permitted.
- Immutable URIs: Only read from object storage or Git at the pinned commit.
- No internet during planning: The agent cannot web-search or pull third-party snippets.
- Token budget policy: Prioritize files directly referenced in stack traces and tests; add minimal surrounding context.
- Answer invalidation: If the agent references files not present in the build snapshot, flag it and force a retry.
System prompt fragment:
textYou are a debugging assistant operating on build ${build_id} of repo ${repo}. Rules: - Only use context retrieved for build ${build_id}. Do not assume HEAD state. - If a required file is missing in the build snapshot, ask for it or state it explicitly. - Propose changes as minimal patches against commit ${commit_sha} with unified diff. - Always link each claim to a source (URI + checksum) from the build snapshot.
Example: From Failure Log to Patch
Suppose a failing Jest test prints:
TypeError: cannot read properties of undefined (reading 'toLowerCase')
at normalizeUser (src/users/normalize.ts:42:17)
at Object.<anonymous> (src/users/normalize.test.ts:15:5)
Workflow:
- Query planner extracts src/users/normalize.ts:42 and failing test file.
- Retriever collects:
- normalize.ts chunk around lines 30–60
- normalize.test.ts
- package.json and tsconfig.json
- recent changes touching src/users
- relevant logs showing input shape
- Agent inspects code and sees user.email?.toLowerCase() missing null check.
- Agent proposes patch with tests.
Proposed patch (unified diff):
diff--- a/src/users/normalize.ts +++ b/src/users/normalize.ts @@ export function normalizeUser(user: Partial<User>): User { - return { - ...user, - email: user.email.toLowerCase(), - name: user.name.trim(), - } as User; + return { + ...user, + email: typeof user.email === 'string' ? user.email.toLowerCase() : undefined, + name: typeof user.name === 'string' ? user.name.trim() : undefined, + } as User; }
Executor applies patch on a branch from commit_sha, runs tests in the same Node image, and reports back: all tests green. The agent posts explanation plus provenance:
- Evidence URIs:
- s3://ci-artifacts/acme/users/build-123/logs/test.log#L211-240
- git://acme/users@commit:abc123:path/src/users/normalize.ts#L36-56
Reproducibility: Environment and Provenance
Many “heisenbugs” are environment-sensitive. Capture and pin:
- Container images by digest (e.g., ghcr.io/acme/app@sha256:...)
- Toolchain versions (gcc/clang, glibc, JDK, Node, Python)
- OS packages (apt/dnf snapshot hashes), or use fully hermetic builds (Bazel/Nix)
- SBOM: SPDX or CycloneDX for dependency graphs
- Feature flags: snapshot ID → value mapping at build time
Provenance record per patch:
- Inputs: URIs + checksums used in retrieval
- Build metadata: build_id, commit_sha, branch
- Execution: container digest, test command, exit codes
- Outputs: diff, test results, coverage delta
This makes your AI’s proposal auditable and aligns with supply-chain best practices (e.g., SLSA, in-toto).
Ranking Signals That Matter
- Direct references: file path and line numbers from stack traces rank highest.
- Test failure proximity: failing test file, related fixtures, and mocks.
- Recent diffs within the same commit: changed files associated with the area.
- Symbol proximity: functions/classes called around the error line.
- Config influence: environment variables and flags referenced in code.
- Cross-signal agreement: documents retrieved by both BM25 and embeddings get a boost.
Avoid global recency bias—time travel is by build, not by calendar.
Evaluation: Measure Retrieval and Fix Quality
Quantitative:
- Top-k recall of ground-truth file (does the failing file appear in top-5?).
- Time-to-first-useful-context (P50 ms for initial retrieval).
- Time-to-first-green (median number of iterations to green test).
- Patch acceptance rate (merged without human edits).
- Cost per fix (tokens + compute + storage).
Qualitative:
- Hallucination rate (references to non-existent files).
- Explanation quality (links to provenance, clarity of root cause).
Run offline benchmarks: replay a corpus of historical failures with known fixes. Score retrieval and patch accuracy.
Security and Privacy
- Secrets hygiene: never store plaintext secrets. Redact at collection time. For env snapshots, whitelist known-safe vars.
- PII: scrub logs for sensitive data; apply regex/structured redaction.
- Access control: tie retriever access to repo permissions; isolate data per repo/org.
- Tenant isolation: separate object storage buckets and vector DB namespaces per tenant.
- Data retention: aggressive TTLs on large artifacts; keep minimal text for learning.
- Egress control: the agent should not exfiltrate code; restrict outbound calls.
Cost and Performance
Storage:
- Text compresses well. Logs and source diffs are cheap to store.
- Large binaries: avoid unless essential; use lifecycle policies (e.g., 7–30 days).
Compute:
- Index incrementally per build; parallelize by repo.
- Cache embeddings for unchanged files (content-hash keys).
Latency:
- Co-locate vector DB and object storage; use pre-signed URLs.
- Warm caches for recent builds on hot repos.
Token budget:
- Prioritize stack-trace files and surrounding context.
- Summarize long logs; chunk and retrieve only relevant sections.
Rollout Plan
-
Phase 0: Passive capture
- Start collecting build artifacts and metadata without AI.
- Validate manifests, ensure costs are controlled, and data is clean.
-
Phase 1: Human-in-the-loop retrieval
- Build a CLI: given build_id, print top-10 relevant files and log snippets.
- Iterate on ranking signals and chunking.
-
Phase 2: AI explanations only
- Let the agent diagnose failures with time-travel constraints, but no patching.
-
Phase 3: Patch proposals
- Generate diffs and open PRs; humans review and run CI.
-
Phase 4: Fully automated fix loop for low-risk areas
- Auto-merge if tests pass and coverage does not drop; feature-flag by repo.
Common Pitfalls (and How to Avoid Them)
- Missing submodule/LFS pins: store submodule SHAs and LFS pointers; fetch by exact versions.
- HEAD drift: forbid retrieval without build_id/commit; enforce at API level.
- Token bloat from logs: chunk logs by test/step; summarize long sections first.
- Non-hermetic test runs: pin container images; log image digests explicitly.
- Multirepo monorepos: index per package with shared artifacts; include workspace graphs (e.g., Bazel query).
- Flaky tests: capture multiple runs; incorporate flakiness signals to avoid chasing noise.
Example: End-to-End Contract
Retriever API:
httpPOST /retrieve { "repo": "acme/web", "build_id": "12345", "question": "Jest failure TypeError at src/users/normalize.ts:42", "top_k": 20 }
Response (trimmed):
json{ "results": [ { "uri": "s3://ci-artifacts/acme/web/12345/src/users/normalize.ts#L30-70", "meta": {"path":"src/users/normalize.ts","lang":"ts","build_id":"12345"}, "checksum": "sha256-...", "snippet": "export function normalizeUser(...) { ... }" }, { "uri": "s3://ci-artifacts/acme/web/12345/logs/test.log#L200-260", "meta": {"step":"test","test":"users/normalize"}, "checksum": "sha256-...", "snippet": "TypeError: cannot read properties of undefined ..." } ] }
Agent output policy:
- Provide unified diffs only.
- Each claim must cite one or more URIs from the retriever response.
- If missing context, request specific build-scoped files by path.
Tooling and Libraries That Help
- Parsing and symbols: tree-sitter, universal-ctags.
- Vector and search: FAISS, Milvus, Weaviate, Elasticsearch/OpenSearch, PGVector.
- SBOM: SPDX, CycloneDX; generators for Node, Java, Python, Go, Rust.
- CI: GitHub Actions, Jenkins, Buildkite, CircleCI.
- Hermetic builds: Bazel, Nix, Buck2.
Note: Retrieval-Augmented Generation is described in Lewis et al., 2020, "Retrieval-Augmented Generation for Knowledge-Intensive NLP" (NeurIPS). Treat that as the conceptual backbone; adapt for code and CI context.
Opinionated Defaults
- Always pin by build_id first; commit_sha second. Build_ids disambiguate custom patches.
- Refuse to answer if context is insufficient; ask for specific files.
- Use hybrid retrieval with hard constraints; embeddings alone are not reliable for stack traces.
- Store provenance for every patch; make it easy for humans to audit.
- Redact at source; don’t rely on post-hoc scrubbing.
Conclusion
Time-Travel RAG reframes AI debugging: the agent isn’t a general knowledge oracle—it’s a precise, deterministic processor of a frozen system state. With a repo-time-indexed knowledge base tied into your CI, you can give it the exact code, logs, configs, and environment that produced a regression. The result is fewer hallucinations, faster root-cause analysis, and precise, reproducible fixes that pass the same tests in the same environment.
Start small: capture artifacts, build a retriever CLI, and iterate on ranking. When you’re satisfied with retrieval quality, let the agent propose patches. The forward path is clear: reproducibility and provenance first; cleverness second. That’s how you fix regressions fast, at scale.
