From Flaky CI to Failing Test: How debug ai Can Auto-Generate Minimal Reproducers
Flaky CI is one of the most expensive forms of waste in modern software teams. It breaks trust in the pipeline, forces developers into manual archaeology across logs and artifacts, and stalls releases while everyone asks the same set of questions: did we break prod, is this a test order issue, is the network slow today, or did a dependency ship a bad patch?
The cost is real. Luo et al. found that flaky tests are pervasive and costly across large codebases, and that root causes vary widely across concurrency, network timing, and environment assumptions [Luo et al., ICSE 2014]. Even when a test is genuinely broken, the delta between a CI failure and a minimal, deterministic reproducer is often dozens of minutes of detective work.
There is a better path. In this article, I lay out an opinionated blueprint for a debug AI that converts flaky CI signals into minimal, failing tests the AI can iterate on automatically. The core idea is simple:
- Ingest everything: CI logs, test reports, traces, artifacts, and git history.
- Use retrieval-augmented generation (RAG) to align an LLM with the exact failure context.
- Reconstruct and sandbox a hermetic environment so the AI can run the failure locally, repeatedly, and deterministically.
- Apply program analysis and delta debugging to shrink the failure surface area.
- Guard secrets and privacy by design, so logs and prompts never leak credentials.
By the end, you should have a mental model, concrete algorithms, and reference code that you can adapt to your stack, whether you run Python/pytest, Java/JUnit, Go, JS/Jest, or Rust.
Why start with a failing test, not a failing job?
CI jobs are noisy. They run many steps and many tests. Errors get interleaved, retried, and sometimes masked by test reruns. Reproducing the entire job locally is overkill and slow. What developers want is the smallest test or script that fails deterministically. That is the unit that both humans and AI can iterate on quickly.
Minimal reproducers have three characteristics:
- Specific: pinned to one test/function, one seed, one time.
- Deterministic: runs fail consistently (ideally 1.0 failure probability) on a hermetic environment.
- Minimal: removes unrelated code, flags, and inputs without losing the failure.
The goal of a debug AI pipeline is to go from the messy world of CI artifacts to that minimal entity. Lets assemble the building blocks.
The signals: what to ingest from CI
An effective pipeline starts with comprehensive ingestion. You cannot debug what you cannot see.
- CI logs and step metadata
- Raw stdout/stderr per step
- Exit codes and timing
- Environment variables captured at runtime (sanitized)
- Test runner command lines
- Container image digests or VM AMIs used for the job
- Test reports
- JUnit XML, Allure, or native formats
- Failure messages, stack traces, test names and class names
- Rerun results (e.g., Maven Surefire rerun, pytest-flakefinder)
- Traces and metrics
- OpenTelemetry spans for test start/stop, queries, HTTP calls, retries
- Links to logs via trace IDs
- Artifacts
- Core dumps, minidumps, crash reports
- Coverage reports
- Build manifests, lockfiles, dependency graphs
- Git and build context
- HEAD commit and merge base
- Diff hunks relevant to failing files
- Build cache keys, Bazel/Nix derivations
Whenever possible, record the run recipe:
- Docker image or microVM base
- Exact test command used (e.g., pytest -k test_login -q -s)
- Seeds, time offsets, test order, shuffle flags
- CPU/memory limits and OS details
This recipe is the skeleton of a reproducible environment. Everything else is used for narrowing and guidance.
Architecture blueprint: the debug ai that builds reproducers
A pragmatic architecture for an automated reproducer generator looks like this:
- Ingestor
- Streams CI logs, test reports, artifacts, traces, and git metadata
- Normalizes into a common schema (e.g., JSON with fields: job_id, step, ts, level, message, test_case, stack, env)
- Normalizer & Fingerprinter
- Extracts failure signatures: stack trace frames, error codes, assertion messages
- Maps failures to code locations (file:line, function) and libraries
- Retriever (RAG Store)
- Indexes code snippets, recent diffs, failing stack frames, infra docs, and run recipes
- Provides top-k chunks to prompt the LLM with contextually relevant material
- Planner (LLM Orchestrator)
- Given the failure signature and retrieved context, plans a step-by-step reproduction strategy: environment, test runner args, seeds, and mocks
- Produces an initial candidate test or shell script
- Sandbox Runner
- Spins a hermetic runtime (container or microVM) with restricted network, pinned CPU, deterministic time/seed
- Executes the candidate reproducer repeatedly
- Captures results and artifacts
- Shrinker
- Uses delta debugging, AST-aware slicing, and property-based shrinking to reduce test inputs, flags, and fixtures while preserving the failure
- Secret Guardian
- Redacts sensitive data from all prompts, logs, and artifacts
- Enforces policy on where data can be stored or queried
- Evaluator & Feedback
- Scores reproducibility and minimality
- Feeds outcomes back into RAG store for future runs (e.g., this type of Jest timeout requires fake timers)
This loop continues until the reproducer is minimal and deterministic enough to hand off to a human or to let the AI propose a patch.
From failure to plan: RAG prompting that actually helps
LLMs are not omniscient; they are pattern matchers. They perform best when grounded in precise, local context. RAG aligns the model toward the specific failure at hand.
Recommended RAG corpus:
- Code adjacency
- The failing file and neighbors in the module graph
- Recent diffs touching the failing code or test harness
- Test runner docs and recipes
- How to run a single test (pytest, JUnit, jest, go test, cargo test)
- Flags for seed, order, retries, and sharding
- Build system and infra docs
- Container images, caching, Bazel/Nix targets
- How to run database or service mocks locally
- Observability breadcrumbs
- Stack trace frames and corresponding code snippets
- Traces of failing spans (SQL, HTTP)
- Team-specific playbooks (if available and safe to index)
- Known flake patterns, e.g., increase jest timeout is an antipattern; use fake timers instead
Chunking strategy:
- Prefer semantic chunks (functions/classes) over fixed tokens to preserve code boundaries.
- Annotate chunks with provenance: file path, commit hash, line ranges.
- Use hybrid retrieval: embedding similarity for messages + symbolic filters for language/test framework.
Prompt template (system message) example:
You are a debugging assistant. Given CI logs, traces, and code context, output a deterministic, minimal reproducer for the failure.
Constraints:
- Only use commands and code that can run inside the provided container image.
- Prefer running a single test with a fixed seed and time.
- Avoid changing application logic; focus on deterministic reproduction.
- Propose deltas to reduce scope (e.g., test selection, fixtures, flags).
- Do not include any secrets or tokens.
Output:
1) Proposed test command
2) Optional additional test code
3) Required environment and seed/time control
4) A list of assumptions
Retrieved context is inserted as a separate section, clearly delimited, so the model can cite relevant snippets without hallucinating.
Sandboxing: runnable, safe, and fast
Reproduction requires execution, and execution requires isolation.
- Runtime isolation
- Containers: gVisor or Kata Containers enhance isolation over runc
- MicroVMs: Firecracker offers fast, lightweight VMs with strong isolation
- User namespaces and seccomp profiles to minimize syscall surface
- Hermetic inputs
- Pin base images and packages via digests (e.g., python:3.11@sha256:...)
- Use Nix/Bazel for reproducible toolchains where feasible
- Deterministic runtime
- CPU pinning to reduce scheduling variance (taskset/cgroups)
- Fixed locale, TZ, and LANG
- Controlled network (localhost only unless explicit mocks)
- Capture
- Record stdout/stderr, exit codes
- Store test run seeds, shuffled order, flaky probability distribution
A typical sandbox runner API might look like this (Python pseudo-implementation):
pythonfrom dataclasses import dataclass import subprocess import json import os import tempfile @dataclass class RunSpec: image: str cmd: list env: dict cpu_set: str = '0' network: str = 'none' # or 'mock' mounts: list = None @dataclass class RunResult: exit_code: int stdout: str stderr: str artifacts: dict class Sandbox: def run(self, spec: RunSpec) -> RunResult: # Example: run via 'docker run' with gVisor runtime env_args = sum((["-e", f"{k}={v}"] for k, v in spec.env.items()), []) mount_args = sum((["-v", m] for m in (spec.mounts or [])), []) cmd = [ "docker", "run", "--rm", "--runtime=runsc", # gVisor "--cpuset-cpus", spec.cpu_set, "--network", spec.network, *env_args, *mount_args, spec.image, *spec.cmd, ] p = subprocess.run(cmd, capture_output=True, text=True) return RunResult(p.returncode, p.stdout, p.stderr, artifacts={})
Swap Docker for Firecracker or another executor if you need stronger isolation or resource control.
Determinism: freeze time, seed randomness, unflake order
Flakiness often comes from hidden nondeterminism: time, randomness, concurrency, network, and IO. The reproducer must parameterize and control these.
- Time control
- Inject a test clock abstraction (e.g., Javas Clock, Pythons freezegun)
- Freeze time in tests: ensure all code under test consults the mock clock
- In Node, use Jest fake timers for timers and intervals
- Randomness
- Use fixed seeds; ensure both test and application RNGs are seeded
- For property-based frameworks (Hypothesis/QuickCheck), record seeds and strategies
- Concurrency and ordering
- Run tests serially: pytest -n 0, jest --runInBand, go test -p=1, cargo test -- --test-threads=1
- For Java, tools like NonDex explore nondeterministic iteration order to surface order-dependent flakiness
- Networking and IO
- Replace external calls with mocks/stubs or recorded fixtures
- Use ephemeral ports deterministically or bind to fixed ports in isolation
Framework-specific recipes:
- Python (pytest)
- Run a single test: pytest -k 'TestFoo and test_bar' -q -s
- Disable xdist/parallel: -n 0
- Fix seed: pytest-randomly (--randomly-seed=1234) or Hypothesis: --hypothesis-seed=1234
- Freeze time: freezegun
- Java (JUnit with Maven Surefire)
- Single test: mvn -q -Dtest=ClassName#method test
- Disable reruns in reproducer; capture surefire-reports
- NonDex to explore iteration order nondeterminism
- JavaScript (Jest)
- Single test by name: jest -t 'renders when user is admin' and --runInBand
- Use --ci and fake timers (modern) to control time
- Go
- Single test: go test ./pkg/foo -run TestBar -count=1 -shuffle=off
- Seed control: -test.seed=1234 via flag parsing
- Rust
- cargo test test_name -- --nocapture --test-threads=1
- Seed: pass via env var to your RNG initializer
The key is to encode these levers in the plan produced by the LLM, and enforce them in the sandbox runner.
From CI failure to initial reproducer
Lets walk through a concrete pipeline that turns a GitHub Actions failure into a first-pass reproducer.
- Parse the job
- Download job logs, JUnit XML, and artifacts
- Extract failing tests (name, class, file), with stack traces
- Identify the test runner command
- Identify environment
- Pull the container image digest from the job
- Snapshot relevant env vars (sanitized)
- Detect runtime flags: parallelism, seed, shuffle
- Draft a plan
- Use RAG to retrieve docs on how to run a single test for the framework
- Ask the LLM to propose a single-command reproduction plus any seed/time arguments
- Execute in sandbox
- Run 10 times and compute failure rate
- If p(fail) ~ 1.0, proceed to shrinking
- If p(fail) < 0.2, classify as flaky; choose a stabilization strategy (e.g., mock time, retry until failure to gather more traces)
Sample GitHub Actions step to publish the recipe:
yaml- name: Publish run recipe if: failure() run: | echo "image=$(docker inspect --format='{{.Image}}' $GITHUB_JOB_CONTAINER_ID)" >> $GITHUB_OUTPUT echo "cmd=$GITHUB_STEP_SUMMARY" >> $GITHUB_OUTPUT env: GITHUB_JOB_CONTAINER_ID: ${{ job.container.id }}
In practice, youll collect a richer recipe via a small wrapper around your test runner that emits JSON with the command, working directory, and flags.
Shrinking: from repro to minimal repro
Once you can reproduce the failure deterministically, the next step is to minimize it.
Techniques you should combine:
- Delta debugging (ddmin)
- Classic algorithm from Zeller & Hildebrandt to isolate failure-inducing deltas by systematic partitioning and testing subsets
- Apply to: test flags, environment variables, fixtures, input files, dependency versions
- AST-aware slicing
- Parse the test file into an AST; identify the minimal set of statements and imports needed to trigger the failure
- Use static analysis to remove dead code and unused fixtures
- Property-based shrinking
- If the test data is generated, reuse the frameworks shrinking (e.g., Hypothesis shrinks counterexamples automatically)
- Order-sensitive reduction
- For inter-test order issues, build a minimal sequence of tests that triggers the failure; tools like iDFlakies and NonDex inform the search
Minimal ddmin core (pseudocode):
pythondef ddmin(config_items, test_fn): # config_items: list of toggles, flags, or inputs n = 2 current = list(config_items) while len(current) >= 2: subset_size = len(current) // n some_progress = False for i in range(0, len(current), subset_size or 1): subset = current[:i] + current[i+subset_size:] if test_fn(subset) == 'FAIL': current = subset n = max(n-1, 2) some_progress = True break if not some_progress: if n == len(current): break n = min(len(current), n*2) return current
Use the sandbox runner as the oracle for test_fn. The trick is defining config_items granularly: each flag, each test fixture, and each mock can be a toggleable item.
AST-based shrinking example for pytest:
pythonimport ast import astor class TestShrinker(ast.NodeTransformer): def __init__(self, names_to_keep): self.keep = set(names_to_keep) def visit_FunctionDef(self, node): if node.name.startswith('test_') and node.name not in self.keep: return None return self.generic_visit(node) def visit_Import(self, node): # Remove unused imports by a second pass with name usage analysis return node # Parse test file, keep only failing test function and required fixtures
Iterate: after each AST shrink, run in the sandbox. If it still fails, keep the reduction.
Dealing with flakiness: probabilistic reproduction
Some failures are irreducibly flaky without deeper fixes. Treat them statistically.
- Failure probability estimation
- Run the reproducer N times; estimate p_hat = fails / N
- Report confidence intervals; e.g., Wilson interval for binomial proportion
- Prioritized stabilization
- If p_hat is low but the error signature suggests timeouts, attempt time freezing and network mocking first
- If the stack implicates concurrency, enforce single-threaded execution and fixed scheduling (e.g., Go: GOMAXPROCS=1)
- Sample-based shrinking
- Instead of strict FAIL, accept a reduction that increases p_hat above a threshold
- This turns ddmin into a noisy search; use more runs per candidate subset
Empirically, this approach converges to usable reproducers even for notorious flakes.
Git history as a debugging prior: bisect and blame
Git history is an underused prior for reproducers.
- Map failing frames to recently changed files: if a function appears in both the stack and the diff, escalate its importance
- Explore git bisect for failures that started after a merge: run the reproducer across commits to identify the first bad commit
- Use historical reproducers: if a similar failure happened before, retrieve its reproducer and adapt
Automating a lightweight bisect inside the sandbox:
bashset -euo pipefail start_commit=$(git merge-base HEAD origin/main) end_commit=HEAD while [[ "$start_commit" != "$end_commit" ]]; do mid=$(git rev-list --bisect $start_commit..$end_commit) git checkout $mid if ./run_repro.sh; then # pass: failure not present start_commit=$mid else # fail: failure present end_commit=$mid fi done echo "First failing commit: $end_commit"
To keep this fast, restrict to the affected submodule and cache builds where possible.
Secret safety: never let credentials leak
If you index raw CI logs and artifacts, you risk capturing secrets. AI systems must be designed to minimize that risk.
- Ingestion-time scanning and redaction
- Apply high-precision detectors (entropy + patterns) like trufflehog or detect-secrets
- Taint-track known secret sources (env vars like AWS_* and GITHUB_TOKEN)
- Redact before storage and before RAG indexing
- Principle of least privilege
- Use short-lived OIDC-issued tokens in CI; never store long-lived tokens in logs
- Restrict the sandboxs network; default to no egress
- Prompt hygiene
- Prevent the LLM from seeing raw secrets; mask tokens as *** with reversible mapping only inside the sandbox if strictly necessary
- Ensure the LLM provider does not train on your prompts unless explicitly allowed under a secure data policy
- Data retention
- Define lifecycle policies; expire raw logs fast, keep only minimal, redacted reproducer artifacts
A simple redaction layer for environment variables:
pythonSENSITIVE_PREFIXES = ['AWS_', 'GITHUB_', 'GCLOUD_', 'AZURE_', 'SLACK_', 'SENTRY_', 'DB_PASSWORD'] def sanitize_env(env: dict) -> dict: redacted = {} for k, v in env.items(): if any(k.startswith(p) for p in SENSITIVE_PREFIXES): redacted[k] = '***REDACTED***' else: redacted[k] = v return redacted
Run this before serialization or prompt construction.
Example end-to-end: Python/pytest flaky timeout
Scenario: CI shows occasional TimeoutError: expected call not received in a test that depends on an async job completing.
- Ingestion finds:
- Failing test: tests/test_worker.py::TestJob::test_job_finishes
- Stack shows waiting on an asyncio task with a 3s timeout
- Logs show retries against an HTTP endpoint
- RAG retrieves:
- Pytest docs on single-test run
- freezegun usage and pytest-asyncio notes
- Team playbook: mock external HTTP in tests; do not rely on real network
- Plan:
- Command: pytest -k 'TestJob and test_job_finishes' -q -s --maxfail=1
- Env: PYTHONHASHSEED=0, disable parallel
- Time control: patch time.monotonic via a fixture
- Network: replace httpx.Client with a local stub
- Sandbox run:
- Failure probability goes from 0.3 in CI to 1.0 in sandbox after applying time freeze and network stub
- Shrink:
- Remove unrelated fixtures; minimal test imports only the job, the stub, and the time fixture
- Delta-debug flags; drop -s and maintain failure
- Output:
- A single test file snippet that fails deterministically, plus a one-line command to run it
Minimal test sketch:
pythonimport asyncio from myapp.jobs import run_job class StubClient: def post(self, *args, **kwargs): return type('Resp', (), {'status_code': 200, 'json': lambda self: {'ok': True}})() def test_job_finishes(monkeypatch, freezer): monkeypatch.setattr('myapp.http.client', StubClient()) freezer.move_to('2024-01-01T00:00:00Z') with pytest.raises(asyncio.TimeoutError): asyncio.run(run_job(timeout=0.01))
Now both humans and AI can iterate. The AI can propose fixes: remove real network calls from the job or use proper cancellation semantics.
Example end-to-end: Java/JUnit order-dependent flake
Scenario: CI fails in ClassBTest#testX sporadically; stack suggests shared static state.
- Ingestion finds order-specific failure after ClassATest#testInit runs first
- RAG retrieves JUnit @FixMethodOrder notes, Maven Surefire flags, NonDex documentation
- Plan:
- Command: mvn -q -Dtest=ClassATest#testInit,ClassBTest#testX test to confirm order-dependent failure
- Suggest NonDex run to explore iteration order
- Sandbox run:
- Deterministically fails when A runs before B
- Shrink:
- Minimal reproducer becomes a two-test suite, with ClassA setting a static field
- Output: Two minimal test classes and a one-liner to run them
This level of specificity is what developers need and what AI can reason about.
Observability as a first-class input: OpenTelemetry in tests
If you instrument test execution with OpenTelemetry spans, you can correlate failing assertions with the exact HTTP calls, queries, and retries. This makes both RAG retrieval and shrinking better.
- Emit spans for test lifecycle: test start/stop with attributes for seed, shuffled order
- Emit spans for network calls and tag them with mocked=true or real=true
- Correlate with logs via trace_id/log correlation
The retriever then pulls the specific spans for the failure and feeds them to the LLM, which can recommend mocks or time control appropriately.
Evaluating success: metrics that matter
You cannot improve what you do not measure. Track:
- Reproduction rate
- Percentage of CI failures that yield a runnable reproducer automatically
- Determinism score
- Failure probability under N repeated runs
- Shrink ratio
- Lines of code and number of fixtures/flags removed from the original context
- Time to first failing run
- From CI signal arrival to the first deterministic failure locally
- Human effort saved
- Median developer time spent on failure triage before/after
- False positives
- Cases where the reproducer fails for the wrong reason; require signature matching (stack frame matching) to avoid this
Use these metrics to prioritize investments: if shrink ratio is low, invest in AST-aware reduction. If determinism is low, improve time/seed/network control.
Cost and performance: making it fast enough to matter
- Warm caches
- Maintain a pool of pre-warmed sandboxes with common images
- Cache build artifacts per commit and per module
- Incremental narrowing
- Run quick checks (single test) before expensive multi-test order explorations
- Parallel hypotheses
- Try multiple stabilization strategies concurrently (freeze time, mock network, enforce serial)
- Early exits
- Cap runs when confidence is sufficient; do not chase 0.999 failure probability if 0.99 is enough for practical debugging
Implementation checklist
- Data pipeline
- CI connectors (GitHub Actions, GitLab CI, Jenkins)
- Parsers for JUnit XML, Allure, raw logs
- Sanitization + secret redaction
- RAG backend
- Embedding index for code and docs
- Symbolic filters (language, framework, module)
- Prompt templates with constraints
- Sandbox
- Container/microVM runner with network policy and CPU pinning
- Seed/time control utilities per language
- Artifact capture and run metadata
- Shrinker
- ddmin over flags/env/fixtures
- AST-aware reducers per language
- Order-sensitive reduction utilities
- Governance
- Data retention policies
- Model privacy settings (no training on prompts)
- Audit logs of access to artifacts
Limitations and guardrails
- Heisenbugs may evade deterministic reproduction even with strong controls; record/replay systems can help but are costly
- Multi-service integration tests may require orchestrated mocks that the AI cannot infer automatically; human-provided recipes accelerate success
- Some frameworks lack good seed/time hooks; invest in adding testability seams to your codebase
- Secret redaction is probabilistic; defense in depth is required (redaction + least privilege + no egress)
Where this goes next
- Learned strategies
- Feed successful reproducer strategies back into RAG to improve future planning
- Program repair
- Once a minimal reproducer exists, AI can propose patches and validate them against the reproducer and related tests
- Cross-repo reasoning
- For monorepos and multi-repo systems, integrate build graphs to isolate only the impacted packages
- Deterministic build stacks
- Broader adoption of hermetic builds (Bazel/Nix) will reduce environmental flakiness across the board
References and further reading
- Andreas Zeller and Ralf Hildebrandt, Simplifying and isolating failure-inducing input with delta debugging (2002)
- Qi Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov, An Empirical Analysis of Flaky Tests (ICSE 2014)
- Wing Lam et al., iDFlakies: A Framework for Detecting Order-Dependent Flaky Tests (2019)
- NonDex: Detecting and avoiding non-determinism in Java tests
- Hypothesis: property-based testing and shrinking
- OpenTelemetry
- Firecracker MicroVM
- gVisor
- Reproducible Builds
- Git bisect
- TruffleHog (secret scanning)
Conclusion
The shortest path from a flaky CI failure to a fix is a small, deterministic reproducer. With the right architecture, a debug AI can generate that reproducer automatically by combining RAG-grounded planning, hermetic sandboxing, and systematic shrinking all while preserving privacy and security. The payoff is significant: fewer stalled releases, faster root cause analysis, and a repository of precise, executable knowledge about how your system fails.
None of this eliminates the need for good engineering practices; in fact, it amplifies them. Invest in deterministic hooks (seed/time), observability in tests, and hermetic builds. Then let the AI do the tedious work of reconstruction and minimization. Your developers will thank you, and your CI will finally be a trustworthy signal rather than a coin flip.
