Bisection Bots: Code Debugging AI That Reproduces, Bisects, and Auto-Patches Regressions in CI
Most engineering teams ship with a persistent drag: regressions that slip into mainline, late discovery of failures, and time-consuming hunts for root causes. We can do better. A bisection bot is a pragmatic, verifiable debugging agent that runs inside CI to: (1) reproduce failures, (2) bisect the failure-inducing change, and (3) synthesize and validate a minimal patch. This article lays out an end-to-end pipeline that blends classic software engineering research with modern program analysis and LLM-based repair. It is structured for teams that want a system they can trust in production.
The goal is not a black-box robot that edits your code base at night. The goal is a transparent, auditable pipeline with hard guardrails, measurable outcomes, and reversible actions.
What a Bisection Bot Does (and Does Not Do)
- It mines CI logs and telemetry to detect likely regressions.
- It reconstructs a minimal, hermetic repro of the failure.
- It localizes the offending change via commit bisect and diff-level delta debugging.
- It generates potential patches (LLM or rule-based), and validates them in a sandbox.
- It proposes or auto-merges patches under strict policy gates.
- It measures impact with metrics and rolls back autonomously if risk conditions trigger.
What it does not do: rewrite your architectural boundaries, modify critical crypto or auth logic without approvals, or bypass your review and SRE protocols. Autonomy is earned by measurable performance and containment.
System Architecture Overview
A robust bisection bot has five main stages:
- Signal ingestion and triage
- Reproduction and minimal repro construction
- Bisection and root cause localization
- Patch synthesis and sandboxed validation
- Guardrails, policies, rollout and rollback
Each stage is designed for verification and determinism. The pipeline is repeatable, observable, and degrades gracefully into human-in-the-loop when uncertainty is high.
1) Signal Ingestion and Triage
You cannot fix what you cannot reliably detect. The bot should ingest:
- CI failures (build, unit, integration, e2e)
- Runtime errors (crash loops, 5xx spikes, SLO breaches), with sampling and scrubbing
- Static analysis and security scanner findings
- Performance regressions (latency, p95/p99 budgets)
Key techniques:
- Failure fingerprinting: Hash stack traces and error messages to cluster repeats. Store in a failure index to avoid duplicate work.
- Noise control: Distinguish between flakes and real regressions using historical pass/fail ratios, change-point detection, and retry heuristics. Require N out of M failed re-runs before escalation.
- Data minimization: Capture the minimal artifact set necessary to repro: test name, seed, commit SHA, build flags, environment variables, and dependency versions.
- Privacy: Strip PII and secrets early. Treat logs as untrusted input.
Design the ingestion stage to produce a crisp triage record:
- fingerprint: stable cluster ID
- repro_spec: test target, seed, platform image, timeouts
- suspected_scope: PR number or commit range
- severity: mapping to SLO impact or critical paths
This record feeds the repro engine.
2) Hermetic Reproduction and Minimal Repro
Reproducing a failure deterministically is the foundation of everything that follows. Non-determinism kills bisect accuracy and patch validation.
Recommended ingredients:
- Hermetic builds: Use Bazel/Buck/Pants or deterministic Docker/Nix images. Pin toolchains, OS image, and third-party dependencies.
- Environment capture: Dump relevant env vars, feature flags, locale, timezone, CPU architecture, and kernel version. Persist as part of repro_spec.
- Seed control: For property-based or fuzz tests, record RNG seeds.
- External dependencies: Stub or record/replay network and file system interactions. For services, prefer ephemeral test doubles over shared staging.
- Time travel: For heisenbugs, record execution with rr (Linux perf-based record/replay) or JVM Flight Recorder. Deterministic replay dramatically improves localizable signal.
Minimal repro is a reduction problem: shrink inputs and the executing code surface until the failure persists but the search space becomes tractable. Use delta debugging (Zeller) to reduce inputs and even patch sets.
Example: ddmin-inspired reducer for input chunks:
python# Minimal reducer inspired by Zeller's delta debugging (ddmin) # chunks: a list of input components (lines, config entries, diff hunks) # test: function that returns 'pass', 'fail', or 'unresolved' def ddmin(chunks, test): n = 2 while len(chunks) >= 1: subset_size = max(1, len(chunks) // n) reduced = False for i in range(0, len(chunks), subset_size): complement = chunks[:i] + chunks[i + subset_size:] outcome = test(complement) if outcome == 'fail': chunks = complement n = max(2, n - 1) reduced = True break if not reduced: if subset_size == 1: break n = min(len(chunks), n * 2) return chunks
Use it to reduce:
- Input files (e.g., test fixtures, API traces)
- Config flags
- Test steps in an end-to-end scenario
- Diff hunks for delta-bisect (see below)
A reproducible minimal repro should be invocable as a single command, in a fixed container image, with bounded time and memory. Budget resource limits early; timeouts reduce long-tail costs.
Example Dockerfile for a hermetic test runner:
dockerfileFROM ubuntu:22.04 RUN apt-get update && apt-get install -y python3 python3-venv git build-essential WORKDIR /repo # Copy only what is needed for the failing test target COPY pyproject.toml . RUN python3 -m venv .venv && . .venv/bin/activate && pip install -U pip # Dependencies are pinned in pyproject.lock or constraints.txt COPY constraints.txt . RUN . .venv/bin/activate && pip install -r constraints.txt COPY src/ src/ COPY tests/ tests/ ENV PYTHONHASHSEED=0 CMD ['bash', '-lc', 'set -euxo pipefail; . .venv/bin/activate; pytest -q -k failing_test --maxfail=1 --disable-warnings']
Note: Use a lockfile or constraints file for exact dep versions. Avoid pulling latest at runtime.
3) Bisecting to Localize the Failure-Inducing Change
Commit-level bisect: Given a known good and bad commit, systematically search the commit range for the first bad commit.
- Git supports this out of the box via git bisect.
- Use the minimal repro command as the test predicate.
- Annotate flakiness with retries; treat unresolved runs carefully.
Example script:
bash#!/usr/bin/env bash set -euo pipefail # Inputs: GOOD_SHA, BAD_SHA, repro.sh exists and exits nonzero on failure git bisect start git bisect bad "$BAD_SHA" git bisect good "$GOOD_SHA" bisect_runner() { # Retry repro up to 3 times to guard against flakes for i in 1 2 3; do if ./repro.sh; then # repro passed -> test is 'pass' return 1 fi done # failed in all retries -> 'fail' return 0 } git bisect run bash -lc bisect_runner culprit=$(git rev-parse HEAD) echo "culprit=$culprit" > bisect_result.txt # Reset bisect state git bisect reset
Delta-bisect on diffs: Large PRs can bundle unrelated changes. Even after identifying the culprit commit, finding the exact failure-inducing hunk improves patch synthesis.
Approach:
- Split the diff into logical hunks.
- Use ddmin to find the smallest subset of hunks that triggers the failure.
- If the failure requires interaction of multiple hunks, ddmin tends to find a minimal set rather than a single hunk.
Pseudo-code:
pythonfrom difflib import unified_diff # Given: base tree, head tree, file list # Produce hunks as independent patch fragments, apply subsets, and test hunks = compute_hunks(base_tree, head_tree) def test_hunks(subset): apply_patch(subset) # create a temporary working tree rc = run_repro() revert_patch(subset) return 'fail' if rc != 0 else 'pass' failure_hunks = ddmin(hunks, test_hunks)
Heuristics to rank suspect hunks:
- Stack-trace correlation: Hunks touching files on the stack + coverage changes
- Dependency graph proximity: Modified modules on the call path to failure
- Historical fault-proneness: Files with high past-failure density
- Temporal coupling: Files that often change together
These features are standard in defect prediction literature and help prioritize patch strategies.
4) Patch Synthesis: Pattern-Based and Model-Assisted
Patch generation should be conservative. Favor simple, explainable patterns before leaning on a generative model. The synthesis engine can use a portfolio:
- Rule-based templates: Null checks, off-by-one adjustments, input validation, exception propagation, correct resource cleanup, configuration fallback. These patterns cover a surprising fraction of regressions.
- Learned fix patterns: Mine repo history or use tools like Getafix-style pattern mining to produce edit templates.
- Constraint-based repairs: For type or contract violations, integrate static analyzers (Infer, Error Prone) and synthesize the minimal fix that satisfies constraints.
- LLM proposals: Prompt a code model with the minimal repro, culprit hunks, stack traces, and invariants. Constrain the edit region to the failure hunks and their immediate context. Ask for a minimal patch with justification and a test augmentation.
Guidelines for safe patch generation:
- Edit locality: restrict edits to files identified by bisect/delta-bisect, plus test files.
- Single-responsibility patches: one root cause, one fix.
- Minimal diff size: fewer lines changed correlates with higher acceptance in APR literature.
- Require test strengthening: add or tighten a test that would have caught the bug.
Simple example of a patch runner script:
bash#!/usr/bin/env bash set -euo pipefail patch_file=${1:-patch.diff} # Apply patch in a temp branch git checkout -b bisection-bot/fix-$(date +%s) git apply --index "$patch_file" # Build and test in sandbox if ./scripts/sandbox_test.sh; then git commit -m 'bisection-bot: minimal fix for regression X; adds test' -a git push origin HEAD else git reset --hard git checkout - git branch -D bisection-bot/fix-* exit 1 fi
LLM prompting considerations:
- Provide only necessary context: culprit commit diff, failing test, stack trace, runtime variables. Avoid dumping the entire repo.
- Include invariants: performance budgets, public API contracts, security rules.
- Forbid changes outside the suspect path. Reject patches adding new dependencies or changing build files unless allowed by policy.
5) Sandboxed Validation and Safety Nets
All candidate patches run in a hardened sandbox with multiple gates:
- Test suite: Run the failing test and relevant dependent suites; augment with property-based tests and fuzzers if cheap.
- Security checks: Static analyzers, SAST/DAST where applicable, license compliance, and dependency scanning.
- Performance guardrails: Microbenchmarks or regression checks for critical paths; p95 or p99 budgets.
- API compatibility: Check public API signatures and JSON schemas remain backward compatible if required.
- Non-determinism check: Run the repro N times. Measure flake rate; patches should reduce or eliminate flakiness.
Sandbox hardening:
- Containers with seccomp profiles and read-only file systems.
- No outbound network access unless specifically permitted and recorded.
- Per-run CPU/memory/time budgets with cgroups.
- Declarative environments for reproducibility (Docker, Nix).
Example sandbox_test.sh skeleton:
bash#!/usr/bin/env bash set -euo pipefail # 1) Build ./bazelisk test //:compile_checks # 2) Run targeted tests with retries for determinism assessment passes=0 for i in $(seq 1 5); do if bazelisk test //tests:failed_target --test_arg='--seed=1234' ; then passes=$((passes+1)) fi sleep 1 done if [ "$passes" -lt 4 ]; then echo 'Flakiness too high' exit 1 fi # 3) Security, style, and licenses ./scripts/static_checks.sh # 4) Perf smoke on critical path ./scripts/perf_guard.sh --budget_ms 5 --target //pkg:hot_fn
6) Guardrails and Policy: When to Auto-Merge vs. Propose
Automated patching in production requires explicit policy. Store policy in version control.
Example policy config (YAML):
yaml# .bisection-bot/policy.yaml risk: max_diff_lines: 20 allowed_files: - src/** - tests/** forbidden_patterns: - '**/crypto/**' - '**/auth/**' - '**/payment/**' require_new_test: true require_changed_test_to_fail_without_patch: true min_successful_retries: 3 min_confidence_score: 0.8 max_perf_delta_pct: 2 review: auto_merge: enabled: true gates: - all_tests_green - coverage_not_reduced - risk_score_below: 0.3 - owner_approval_for_sensitive_paths fallback_to_pr: enabled: true reviewers: - oncall-bugfixers - codeowners
Risk scoring features:
- Diff size and entropy (Levenshtein distance in suspect region)
- Patch pattern class (known-safe patterns get a small risk prior)
- Flake reduction observed vs. baseline
- API and ABI surface touched
- Security and perf scan deltas
A simple logistic regression or gradient boosting model on historical patch outcomes can provide an interpretable risk score. Keep the model simple and auditable; log the feature vector and decision.
7) Progressive Rollout and Instant Rollback
Treat patches like any other change: progressive exposure and fast rollback.
- Submit as a PR with a clear title, context, and links to the repro and bisect results.
- Merge with a staged rollout: feature flags or canary environments. For libraries, cut a pre-release tag used by a small percentage of services first.
- Monitor key metrics post-merge: error budgets, latency, and failure fingerprints of interest.
- Auto-revert if any regression-specific SLO triggers.
Example GitHub Actions steps for auto-revert:
yamlname: Auto-revert on: workflow_run: workflows: ['Deploy'] types: [completed] jobs: revert_on_incident: if: ${{ github.event.workflow_run.conclusion == 'failure' }} runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Configure bot run: | git config user.name 'bisection-bot' git config user.email 'bot@example.com' - name: Revert last commit run: | git revert --no-edit HEAD git push origin HEAD:main
Use the same policy gates to decide whether an auto-revert can proceed without humans (e.g., if the bad commit was a bot-generated one and the incident matches its fingerprint).
8) CI/CD Integration: End-to-End Flow
A minimal wiring for GitHub Actions might look like this:
yamlname: Bisection Bot on: workflow_run: workflows: ['CI'] types: [completed] jobs: triage: if: ${{ github.event.workflow_run.conclusion == 'failure' }} runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Fetch artifacts uses: actions/download-artifact@v4 with: name: ci-logs - name: Run triage run: | python3 -m bot.triage --logs ci-logs --out triage.json - name: Persist triage record run: | cat triage.json reproduce: needs: triage runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Build repro image run: | docker build -t repro:latest -f repro.Dockerfile . - name: Run repro run: | docker run --rm repro:latest /bin/bash -lc './repro.sh' bisect_and_fix: needs: reproduce runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run bisect run: | ./scripts/bisect.sh - name: Generate patch candidates run: | python3 -m bot.patch_synth --bisect-result bisect_result.txt --out candidates/ - name: Validate in sandbox run: | for p in candidates/*.diff; do ./scripts/validate_patch.sh "$p" || true done - name: Submit PR or auto-merge run: | python3 -m bot.submit --policy .bisection-bot/policy.yaml --candidates candidates/
Adapt this skeleton to your orchestration system (GitLab CI, Jenkins, Buildkite) and artifact store (S3/GCS/Artifactory). Ensure sensitive tokens used by the bot can only open PRs or push to bot-namespaced branches.
9) Metrics That Matter
A bisection bot must prove its value with numbers. Instrument end-to-end:
- TTD: Time to detect (from commit to failure signal).
- TTR: Time to reproduce (from triage to deterministic repro).
- TTB: Time to bisect (from repro to culprit commit).
- TTFx: Time to first viable patch.
- PAA: Patch acceptance rate (auto-merged + human-approved).
- RRR: Re-revert rate (patches reverted post-merge).
- FP/FN rates in triage clustering and flake classification.
- Coverage delta and test strengthening ratio (tests added vs. removed/loosened).
- Cost per incident (compute minutes) and cache hit rates.
Set service-level objectives for the bot itself. Example initial targets:
- 80% of deterministic regressions get a minimal repro within 30 minutes.
- 70% of deterministic regressions localized to a single commit within 45 minutes.
- 40% of deterministic regressions receive a viable patch within 2 hours.
- < 5% re-revert rate for bot-merged patches over a trailing 30-day window.
Track cohort performance over time and feed lessons back into guardrails.
10) Rollback Strategies and Recovery Playbooks
Rollbacks must be mechanically simple and safe:
- Always keep revert scripts tested. A broken revert is worse than a broken patch.
- Prefer reverting a bot patch over stacking another fix when incidents fire.
- For cross-repo dependency updates, maintain a dependency lock snapshot for quick rollbacks.
- Coordinate rollbacks with feature flag states to avoid inconsistent behavior across services.
Example operational policy:
- If a bot patch correlates with a newly detected fingerprint within 30 minutes post-merge, auto-revert and page on-call with context.
- If SLOs degrade but fingerprints are unrelated, fall back to human triage; the bot can provide best-effort localization without auto-actions.
11) Cost and Throughput: Keep It Bounded
Left unchecked, automated debugging can consume a lot of compute:
- Prioritize by severity and recency; de-duplicate clustered failures.
- Use coarse-to-fine search: cheap static analysis first, then narrow repro, then expensive bisect only if needed.
- Cache build artifacts and repro environments aggressively.
- Enforce queue budgets and preemption: kill low-priority work if the main CI queue is saturated.
- Integrate with a work scheduler that supports priorities and concurrency limits.
A practical budget: keep median compute under 30 CPU-minutes per incident, 95th percentile under 200, by terminating dead-end paths early.
12) Data Governance and Security
- Secret hygiene: LLM prompts and logs must be scrubbed; never include tokens or customer data.
- Supply chain: The bot uses pinned images and signed artifacts. Generate SBOMs for patches.
- Permissions: The bot account cannot write to main; it can only open PRs or push to a dedicated bot branch. Auto-merge occurs via a protected workflow token gated by policy.
- Auditability: Every decision emits a structured audit event with inputs, features, model version, and outputs. Store for 90 days.
13) Failure Modes and Design Responses
Known hard cases:
- Flaky tests: Use retry with statistical thresholds; tag flaky fingerprints and deprioritize patching until stabilized.
- Non-deterministic infra: Network or time-based failures; repeatable repros require isolation or recorded I/O.
- Long commit ranges: Rebase and squash history can blur good/bad boundaries; consider binary search on build artifacts if commit-level bisect is impractical.
- Multi-repo changes: Use a monorepo or a cross-repo snapshot. Bisect on the composed dependency graph, not a single repo.
- Compiler and JIT bugs: Use alternative toolchains or lower optimization levels in repro to distinguish source vs. toolchain.
The system should fall back gracefully to providing high-quality localization and a PR with an investigative test, even if it cannot auto-fix.
14) Example: End-to-End Walkthrough
Scenario: A Python service begins returning 500s on a JSON parsing endpoint. CI reports a failing test tests/test_api.py::test_parse_payload.
- Triage: The bot clusters crashes to a new fingerprint: ValueError at parse_int. The suspected scope is PR #4821, 19 files changed.
- Repro: The bot constructs a minimal Docker image with the test target and pinned dependencies. The failure reproduces deterministically.
- Bisect: git bisect identifies commit abc123 within the PR. Delta-bisect reduces the culprit to a hunk changing int(x) to int(x, 10) without guarding when x is already an int. The repro still fails on the minimal hunk set.
- Patch synthesis: A rule-based pattern suggests a type check before coercion. The bot proposes:
diffdiff --git a/src/parse.py b/src/parse.py index 1234..5678 100644 --- a/src/parse.py +++ b/src/parse.py @@ def parse_int(x): - return int(x, 10) + if isinstance(x, int): + return x + return int(str(x), 10)
The bot also adds a test:
python# tests/test_parse.py import pytest from src.parse import parse_int def test_parse_int_accepts_ints(): assert parse_int(7) == 7
- Validation: The sandbox runs the failing test suite 5 times (all pass), static checks, and a micro perf test on parse_int (no regression).
- Policy: Changes limited to src/ and tests/, diff lines under 10, new test added. Risk score 0.15. Auto-merge allowed.
- Rollout: The patch is merged; canary environment shows the 500s fingerprint disappears. No SLO regression. Incident auto-resolved. Metrics updated with a successful, low-risk auto-fix.
15) How to Start: Minimum Viable Bisection Bot
Begin with the smallest useful loop:
- Build a deterministic repro harness for your top 20 flaky or failing tests.
- Wire git bisect to run the repro and output the culprit commit.
- Implement delta-debugging on diff hunks for large commits.
- Add a small set of rule-based fixes and require a new test.
- Validate inside a sandbox, never outside.
- Only propose PRs initially; auto-merge later after success metrics stabilize.
This sequence typically yields benefits within a few weeks without risky autonomy.
16) Research and Benchmarks Worth Knowing
- Delta debugging: A seminal technique for isolating failure-inducing inputs and changes. See Zeller, Simplifying and Isolating Failure-Inducing Input, 2002.
- C-Reduce and Test-Case Reduction: Aggressive reducers for C/C++ that inspire general reduction strategies.
- Automated program repair (APR): GenProg, Prophet, SPR, and TBar demonstrate fix patterns and probabilistic search. APR success is higher on small, localized edits.
- Production systems: Facebook Getafix (pattern mining) and SapFix (auto-fixes + human-in-the-loop) underline the value of pattern libraries and guardrails.
- Benchmarks: Defects4J and SWE-bench provide standardized bug-fixing corpora to evaluate patch quality.
- Record-and-replay: rr enables reproducible debugging for many C/C++ workloads.
Start with pattern-based fixes and only then layer model-driven synthesis. Build a local corpus of accepted patches to bootstrap learned patterns.
17) Opinionated Design Principles
- Repro first, model second: LLMs without deterministic repros are unreliable.
- Constrain the search space: Edit locality and minimal diffs produce safer patches.
- Prefer interpretability: Rule-based or mined patterns are easier to audit and reason about.
- Treat flakiness as technical debt: Fix it or quarantine it; do not let it poison your signals.
- Make everything auditable: Artifacts, prompts, features, and decisions.
- Earn autonomy: Start with PRs, move to auto-merge with strong guardrails only after your metrics justify it.
18) Common Pitfalls and Anti-Patterns
- Running the full test suite on every patch attempt: too slow. Targeted suites first.
- Allowing network egress in sandbox: opens nondeterminism and data leaks.
- Letting the bot change build or dependency files casually: a small code patch turning into a huge dependency churn.
- Not versioning the bot itself: keep bot configs, prompts, and models versioned and pinned.
- Ignoring license and security scans for bot patches: they are production changes; treat them as such.
19) Extending Beyond Unit Tests: Integration and Systems-Level Bugs
- Service-level reproducibility: Capture API requests and responses; use contract tests with recorded traces.
- Database state: Use ephemeral databases seeded with a snapshot or migrations plus a deterministic fixture loader.
- Distributed traces: Map the failing span path to code modules to prioritize hunk ranking.
- Performance regressions: Integrate statistical tests (e.g., Mann-Whitney U) on benchmark results to avoid overreacting to noise.
Complex systems bugs benefit from the same principles: isolate and minimize, then bisect, then patch within strict bounds.
20) Closing: Shipping AI-Driven Patches Safely
Bisection bots are not science fiction. They are a careful assembly of established techniques: deterministic repro, delta debugging, commit bisection, conservative patch templates, and rigorous validation. Adding a code model can increase coverage, but only under tight constraints.
If you prioritize verifiability, build strong guardrails, and commit to measuring outcomes, you can safely automate a meaningful slice of your debugging and patching workload. Start small, demonstrate value, and expand autonomy as your metrics—and your team’s trust—grow.
Further Reading
- Andreas Zeller. Yesterday, my program worked. Today, it does not. Why? SIGSOFT, 1999; and Simplifying and Isolating Failure-Inducing Input, 2002.
- C-Reduce: https://embed.cs.utah.edu/creduce/
- rr (record and replay): https://rr-project.org/
- GenProg: https://dspace.mit.edu/handle/1721.1/56512
- Prophet: https://people.cs.umass.edu/~brun/pubs/pubs/Long15icse.pdf
- Facebook Getafix and SapFix: https://engineering.fb.com/2019/06/12/developer-tools/getafix/ and https://engineering.fb.com/2018/08/27/developer-tools/sapfix/
- Defects4J: https://github.com/rjust/defects4j
- SWE-bench: https://arxiv.org/abs/2310.06770
