Your Flaky Tests Are Lying: How Code Debugging AI Finds Nondeterminism and Environment Drift

Flaky tests are not just annoying; they are actively deceptive. They tell you something is broken when it isn’t and hide issues when they are. Worse, they induce fatigue—engineers stop believing failing tests, quarantine them, and let defects slip through. At scale, that erodes trust in your pipeline and subtly changes your engineering culture.

The good news: flakes are not magic. Almost all of them collapse to a small number of nondeterminism sources and environment drifts. If you can make them deterministic—once—then you can isolate the root cause, write a fix, and turn a perennial time sink into a one-time cleanup.

This article lays out a practical, technical blueprint for how a code-debugging AI can do that for you automatically:

Fingerprint randomness and capture seeds.
Mock time and I/O to remove nondeterminism at the system boundary.
Diff and score environment drift across machines, containers, and CI runners.
Mine CI history to cluster flaky signatures and rank likely root causes.
Emit a deterministic, copy-paste repro recipe you can run locally or in CI.

You’ll see concrete examples across Python, Java, JavaScript/Node, Go, and systems-level hooks (LD_PRELOAD/eBPF), plus references to research and battle-tested industry practices. The goal is practical: turn flaky failures into deterministic reproductions before they poison your pipeline.

Flakiness: A Taxonomy That Maps to Fixes

What counts as "flaky"? A test that both passes and fails under the same code and inputs without any intended code change. Typical classes:

Time-dependent: flakes triggered by wall-clock, DST, leap seconds, timer accuracy, or event-loop jitter.
Randomness-dependent: RNG seeds, implicit PRNG usage, randomized iteration order.
Concurrency/race: thread scheduling, IO buffering, async ordering, unsynchronized access.
Order-dependent: hidden dependency between tests, mutable global state, leakage across test boundaries.
IO/network nondeterminism: DNS, retries, timeouts, network partitions, rate limits, server-side A/B.
Resource-dependent: low disk, ephemeral ports, FD limits, CPU contention, slow storage.
Floating-point/precision: non-associativity on different CPUs or compiler flags; SIMD and fused-multiply-add.
Environment drift: OS/kernel, CPU flags, locale/timezone, filesystem semantics (case sensitivity, mtime precision), container base image, library versions, feature flags.

Each class has a straightforward lever for determinization. Code-debugging AI exploits those levers systematically.

Why "Make It Deterministic" Is the Only Strategy That Scales

Manual flake triage doesn’t scale. Re-running tests 100x and hoping for failure is expensive and often inconclusive. Quarantining just hides the issue. The strategy that scales is to instrument the test and its environment so that all nondeterminism sources become deterministic. Once you can flip a flake into a deterministic failing repro, you can:

Minimize the failing input via delta debugging.
Localize root cause with spectrum-based or trace-based techniques.
Propose a precise fix (and an assertion) to prevent regressions.

A code-debugging AI is well-suited to both orchestration and inference in this workflow: it can spin up controlled sandboxes, capture seeds and environment, search the flakiness space, and output a concise repro recipe.

Architecture of a Code-Debugging AI for Flakes

A practical system includes:

Test orchestrator
- Runs tests repeatedly under controlled perturbations (fixed seeds, frozen time, sandboxed IO).
- Records provenance: commit SHA, container image digest, OS/kernel, CPU flags, env vars, locale/timezone, package lock data.
Instrumentation agents
- RNG hooks for common languages (Python/NumPy/Torch, Java Random, JS Math.random, Go math/rand, C++ <random>).
- Time hooks (wall clock, monotonic, timers) and filesystem/network interposition (LD_PRELOAD/eBPF on Linux, DYLD on macOS).
- Coverage and syscall tracing for spectrum-based diagnostics.
Environment differ
- Computes semantic diffs across machines/containers (e.g., glibc minor version, tzdata, CPU microcode, locale, Docker base image digest, Node/Python/Java runtime versions).
- Weights diffs by suspiciousness learned from prior flakes.
CI history miner
- Clusters failure signatures (stack frames, exception types, stdout fragments, test names) across builds/jobs.
- Computes flakiness metrics, suspicious changes (Ochiai/Tarantula), and drift trends.
Repro synthesizer
- Generates a single-command reproduction: pinned container, env var overrides, RNG seeds, frozen time, dependency lockfile, test selection, and a minimal input case.

Now let’s drill into the technical levers.

Fingerprinting Randomness and Capturing Seeds

Randomness is the most tractable form of nondeterminism. If you record a seed and intercept RNG calls, you can replay the exact bitstream that produced the failure.

Practical RNG Sources to Hook

Python: random, secrets, numpy.random, torch.manual_seed, os.urandom.
Java: java.util.Random, ThreadLocalRandom, SplittableRandom, SecureRandom (less common for tests).
JavaScript/Node: Math.random, crypto.randomBytes.
Go: math/rand (global seeded), crypto/rand.Read.
C++: std::mt19937 and friends; C rand; platform urandom.
Property-based testing: Hypothesis (Python), QuickCheck/ScalaCheck, jqwik/JUnit, testify/quick (Go). These already expose seeds—log them.

Technique: Entropy Fingerprint

An entropy fingerprint is a compact summary of where and how much randomness a test consumed:

Intercept RNG calls and hash the callsite stack trace to a stable ID.
Record the count and sequence of calls per callsite.
Log the global seed(s) and local seeds (e.g., np.random.RandomState or torch generators).

If a test fails intermittently, compare fingerprints across runs. A consistent seed + fingerprint that correlates with failures yields a deterministic repro.

Example: A Pytest Plugin to Capture and Replay Seeds

python
# conftest.py
import os, time, json, random
import numpy as np

try:
    import torch
    HAS_TORCH = True
except ImportError:
    HAS_TORCH = False

_SEED_LOG_PATH = os.environ.get("SEED_LOG_PATH", ".seed-log.jsonl")


def _seed_all(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    if HAS_TORCH:
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False


def pytest_addoption(parser):
    parser.addoption("--seed", action="store", default=None, help="Global RNG seed")


def pytest_configure(config):
    seed = config.getoption("--seed")
    if seed is None:
        # Stable seed unless overridden; still allows CI to sweep
        seed = int(os.environ.get("PYTEST_SEED", str(int(time.time())[:8])))
    seed = int(seed)
    _seed_all(seed)
    os.environ["PYTEST_SEED"] = str(seed)
    with open(_SEED_LOG_PATH, "a") as f:
        f.write(json.dumps({"ts": time.time(), "seed": seed, "pid": os.getpid()}) + "\n")


# Optional: intercept os.urandom for tests via monkeypatch fixture
import builtins
from contextlib import contextmanager

@contextmanager
def deterministic_urandom(seed: int):
    rng = np.random.RandomState(seed)
    orig_urandom = os.urandom
    def fake_urandom(n):
        return rng.bytes(n)
    os.urandom = fake_urandom
    try:
        yield
    finally:
        os.urandom = orig_urandom


import pytest

@pytest.fixture(autouse=True)
def _apply_determinism():
    with deterministic_urandom(int(os.environ["PYTEST_SEED"])):
        yield

When a flaky failure appears in CI, your code-debugging AI reads .seed-log.jsonl and re-runs with --seed=<value>. It also stores the entropy fingerprint of RNG callsites, enabling delta-debugging the random sequence (ddmin) to find the minimal subsequence or callsite responsible.

Example: Node Monkeypatch for Math.random

js
// test/setup.js
const seedrandom = require('seedrandom');
const seed = process.env.JEST_SEED || `${Math.floor(Date.now() / 1000)}`;
const rng = seedrandom(seed);

const origRandom = Math.random;
Math.random = () => rng();

console.log(`[seed] JEST_SEED=${seed}`);

// For crypto.randomBytes, use a shim only in tests that opt-in; avoid weakening security code.

JVM Agent for java.util.Random

In Java, use a Java agent (e.g., ByteBuddy) to intercept Random.next and ThreadLocalRandom methods. Your agent can initialize a seeded SplittableRandom per thread to preserve concurrency semantics while remaining reproducible.

The key idea: you don’t need to patch app code; the agent binds at runtime, only in tests.

Mocking Time and IO: Removing the Flakiest Boundary Conditions

Wall-clock time, timers, and IO are common flake triggers. You can neutralize them with systematic mocking or OS-level interposition.

Time: Freeze, Control, and Make Monotonic

Common problems:

Tests assume now() advances by at least X ms; timer resolutions vary.
DST or timezone changes influence date arithmetic.
System time jumps (NTP adjustments) cause negative durations.
Misuse of wall clock where monotonic time is required.

Solutions:

Inject a clock interface in app code and use a test clock.
Monkeypatch/fake timers in test harnesses.
OS-level time interposition for legacy code under test.

Examples:

Python: freezegun or time-machine, but prefer to inject clocks for core logic.
JavaScript: jest.useFakeTimers('modern') or sinon fake timers; advance time deterministically.
Go: define an interface type Clock { Now() time.Time; Sleep(d time.Duration) } and pass it.
JVM: java.time.Clock injection; in tests, use Clock.fixed or Clock.offset.

Example: Go Clock Interface

go
// clock/clock.go
package clock

import "time"

type Clock interface {
    Now() time.Time
    Sleep(d time.Duration)
}

type Real struct{}
func (Real) Now() time.Time { return time.Now() }
func (Real) Sleep(d time.Duration) { time.Sleep(d) }

// testclock/testclock.go
package testclock

import "time"

type Fake struct{ now time.Time }
func New(t time.Time) *Fake { return &Fake{now: t} }
func (c *Fake) Now() time.Time { return c.now }
func (c *Fake) Sleep(d time.Duration) { c.now = c.now.Add(d) }

OS-level interposition

For code you can’t easily refactor, interpose syscalls:

Linux: LD_PRELOAD to override clock_gettime/gettimeofday; or eBPF uprobes to hook calls.
macOS: DYLD_INSERT_LIBRARIES.

This lets your AI freeze time to an epoch and advance deterministically during test replay. Similarly, you can intercept sleep to avoid real waiting.

Filesystem and Network: Deterministic IO

Filesystem sources of flake:

Directory iteration order undefined; tests expect sorted results.
mtime precision differences (1s vs ns), affecting cache invalidation.
Case sensitivity differences (macOS default case-insensitive vs Linux).

Solutions:

Wrap directory reads and sort deterministically in tests or via shim.
Normalize path case if your target environment is case-insensitive.
Use a deterministic temp directory naming scheme.

Network sources of flake:

Timeouts and retries behaving differently under load.
DNS TTL and resolver differences.
External services with A/B experiments.

Solutions:

Record/replay: capture HTTP requests/responses and replay locally (VCR for Python/Ruby, Polly.js for Node, WireMock for JVM).
Service virtualization: run local mocks with deterministic responses.
For e2e, set generous but deterministic budgets and disable jitter.

Example: Python VCR for Deterministic HTTP

python
# tests/test_client.py
import vcr

my_vcr = vcr.VCR(
    cassette_library_dir='tests/cassettes',
    record_mode='once',
    filter_headers=['authorization']
)

@my_vcr.use_cassette('get_user.yaml')
def test_get_user(api_client):
    user = api_client.get_user("alice")
    assert user["name"] == "alice"

When a flake occurs, the AI examines network calls; if not already recorded, it proposes adding VCR or WireMock harness around those interaction points and regenerates a deterministic cassette as part of the repro.

Environment Drift: Diff, Score, and Explain

Environment drift is a top cause of seemingly inexplicable flakes: "passes on my laptop, fails on CI." The fix is to capture a high-fidelity environment snapshot for every test run and diff them.

What to capture:

OS: name, version, kernel, container runtime, cgroup version, SELinux/AppArmor.
CPU: model, microcode, core count, flags (avx, fma, sse, neon), endianness.
Timezone, locale, encoding; tzdata version; system clock sync status.
Filesystem: case sensitivity, mtime precision, mount options (noatime), disk speed class.
Runtime versions: Python/Node/Java/Go/Ruby; compiler flags; libc/glibc/musl version.
Dynamic libraries and loaded modules; OpenSSL/crypto lib version.
Package graph: lockfiles and resolved versions (pip freeze, npm ls, mvn dependency:tree, go mod graph, cargo tree).
Env vars: set, unset, and values.
Feature flags and service endpoints; secrets or tokens masked.

Your AI takes two or more runs (pass vs fail), computes a semantic diff, and then ranks features by suspiciousness. Ranking can start with heuristics and evolve toward a learned model from historical flakes.

Heuristics That Work Surprisingly Well

Locale/timezone differences are highly suspicious for date or string collation tests.
Different tzdata package versions can change DST rules around edge cases.
Python version and PYTHONHASHSEED influence dict order and hash stability.
Node/V8 minor versions can change timer behavior.
glibc minor version differences can affect regex, locale collation, and fenv rounding.
CPU flags changing can switch math kernels (e.g., MKL using AVX vs SSE) and alter floating-point rounding.
Case-insensitive filesystem vs case-sensitive can silently change file path resolution.

Example: Minimal Env Snapshot Script (Linux)

bash
#!/usr/bin/env bash
set -euo pipefail

OUT=${1:-env-snapshot.json}

jq -n \
  --arg os "$(uname -srvmo)" \
  --arg kernel "$(uname -r)" \
  --arg tz "$(cat /etc/timezone 2>/dev/null || echo ${TZ:-})" \
  --arg locale "$(locale | tr '\n' ';')" \
  --arg cpu "$(lscpu | tr '\n' ';')" \
  --arg python "$(python3 --version 2>&1 || true)" \
  --arg node "$(node --version 2>/dev/null || true)" \
  --arg java "$(java -version 2>&1 | tr '\n' ';' || true)" \
  --arg glibc "$(ldd --version 2>&1 | head -n1 || true)" \
  --arg openssl "$(openssl version 2>/dev/null || true)" \
  --arg docker "$(cat /etc/os-release 2>/dev/null | tr '\n' ';')" \
  --arg env "$(env | sort | tr '\n' ';')" \
  --arg pip "$(python3 -m pip freeze 2>/dev/null | tr '\n' ';')" \
  --arg npm "$(npm ls --depth=0 2>/dev/null | tr '\n' ';')" \
  '{os:$os, kernel:$kernel, tz:$tz, locale:$locale, cpu:$cpu, python:$python, node:$node, java:$java, glibc:$glibc, openssl:$openssl, os_release:$docker, env:$env, pip:$pip, npm:$npm}' \
  > "$OUT"

The AI consumes these snapshots and produces a diff like:

TZ: UTC vs America/Los_Angeles (suspiciousness: high)
PYTHONHASHSEED: 123 vs 0/undefined (high)
glibc: 2.31 vs 2.37 (medium)
tzdata: 2021a vs 2023c (high around DST tests)
Locale: C vs en_US.UTF-8 (high for collation)
CPU flags: avx512f present vs absent (medium for FP tests)

It then proposes a remediating repro container pinning these attributes.

Hermetic Builds Are Great—But You Still Need Diffs

Tools like Bazel, Nix/Guix, Pants, and Cargo/Go modules move you toward hermeticity. Do use them. But flakes still slip in: kernel, CPU flags, tzdata, locale, and runtime toggles aren’t always covered. Environment diffs remain essential.

Mining CI History: Cluster, Correlate, and Prioritize

Your CI contains a goldmine of signals that an AI can mine to isolate and prioritize flaky tests.

What to track per test per build:

Pass/fail/skip; duration; retries; shard.
Exception types, top frames; error messages hashed into signatures.
stdout/stderr fragments; known patterns (ECONNRESET, ETIMEDOUT, flaky DNS).
Flake rate (fail->pass without code change), failure streaks, and time-of-day effects.
Code changes: touched files, commit authors, change metadata.

Techniques:

Failure clustering: embed stack traces and messages, then cluster with locality-sensitive hashing to spot recurring signatures.
Spectral debugging (Tarantula/Ochiai): rank changed files or functions by co-occurrence with failures.
Delta debugging across CI variants: same test failing only on Windows runners? Only with Node 20? That signals the root cause class quickly.
Control charts: alert on step changes in flake rate.

This context helps the AI decide where to start: freeze time, mock IO, or chase environment drift first. It also prevents whack-a-mole by recognizing the “same” flake across services or repositories.

Turning Flaky Failures into Deterministic Repros

A good debugging AI doesn’t just say “flake.” It emits a single, deterministic repro you can run locally. The repro bundles:

Exact test identifier and command line.
Pinned container image (digest, not tag) with installed packages.
RNG seed(s) and time controls.
Required env vars and locale/timezone.
Network recording or mocks; in-memory fakes where feasible.
Minimal input necessary to trigger the failure.

Example: Repro Recipe Emitted as a Script

bash
# repro_flake.sh (generated)
set -euo pipefail

IMG="ghcr.io/acme/ci-py:sha256:9f0b..."  # exact digest
SEED=17000042
TZ=UTC
PYTHONHASHSEED=$SEED

# Run in a container to match CI
exec docker run --rm -t \
  -e TZ=$TZ -e PYTHONHASHSEED=$PYTHONHASHSEED \
  -e PYTEST_SEED=$SEED \
  -v "$PWD":/work -w /work \
  "$IMG" \
  bash -lc '
    pip install -r requirements.txt && \
    pytest -k "test_billing_rounding" \
      --maxfail=1 -q --seed '$SEED' \
      -m "not flaky_quarantine" \
      --timeout=60
  '

The script is your single source of truth for reproducing the failure. Inside the container, the installed pytest plugin seeds all RNGs, a time shim freezes the epoch, and VCR/WireMock replay network calls.

Concurrency: From Random Races to Deterministic Schedules

Concurrency flakes are harder, but not beyond reach:

Introduce a deterministic scheduler for tests (e.g., libgraviton-style task scheduler) or run with thread sanitizers and record schedules.
For JVM, use ConTest-like scheduling perturbations; for Go, GOMAXPROCS=1 and -race may change timings but also expose data races.
Systematically perturb yields or sleep to trigger races repeatedly; once triggered, record the interleaving to replay.

When the AI detects race-like signatures (data race reports, inconsistent ordering in logs, use of unsynchronized structures), it switches tactics: run under schedule control, then replay that schedule in the repro.

Order Dependence and Global State Leakage

Tests that pass individually but fail in a suite typically leak or depend on global state: environment vars, singletons, temp files, caches.

Detection and mitigation:

Randomize test execution order; if failure correlates with certain predecessors, you’ve got an order dependency.
Snapshot and diff global state before/after tests (env vars, files in temp dirs, singletons toString values) to identify leakers.
Run the suspect test in a fresh process to isolate.

Your AI orchestrator can run a bisection over preceding tests (ddmin) to find the minimal prefix that causes failure, then emit a fix suggestion: reset global X in teardown, use tempdir fixture, or avoid singleton.

Statistical Guardrails: When Is a Flake “Real”?

Not every intermittent failure is random; some are rare but deterministic edge cases that need coverage. The AI should use simple, transparent statistics to classify:

If failures correlate strongly with certain seeds, timezones, or env diffs, treat as determinizable flake.
If failure probability remains high under fixed seed/time and stable env, classify as bug.
Use Fisher’s exact test or a Bayesian model to decide if observed passes/fails under conditions are unlikely by chance.

The goal isn’t p-value purity; it’s to quickly pick the right path: determinize or debug the code.

Automated Minimization: Shrinking the Failing Input

Once the failure is reproducible, the next win is to minimize: smaller repros are faster to fix.

ddmin (delta debugging, Zeller) on seeds: cut the RNG sequence into chunks, test subsets, and keep the minimal failing subsequence.
ddmin on input fixtures: remove fields/rows and see if failure persists.
Slice traces: drop calls that don’t affect the outcome using dynamic slicing.

The AI integrates ddmin loops with the deterministic harness so you get a tiny, crisp failing case—even for property-based tests.

Implementation Blueprint: Building This in Your Stack

You don’t need a monolithic product. You can implement a pragmatic subset that yields big wins.

Start with RNG sealing
- Add a test harness plugin for each major language to set and log seeds.
- Train engineers to print seeds in failure messages.
- Encourage property-based testing with seed visibility.
Add time and IO control
- Adopt fake timers and freeze time in your test frameworks where feasible.
- Add record/replay for network interactions touching external services.
Capture environment snapshots in CI
- Dump env, runtime versions, and package graphs on failures.
- Store snapshots alongside artifacts and link from test reports.
Mine CI history
- Hash failure signatures and compute flake rates per test.
- Alert when flake rate crosses thresholds; open automated tickets with repro scripts.
Generate repro recipes
- Use container digests and a deterministic script. Keep it copy-paste simple.
Iterate with learned suspiciousness
- Label resolved flakes with root cause categories. Use that to weight diffs and guide perturbations.

Opinionated Guidance: Practices That Pay Off

Stop quarantining as a first response. Quarantine is a last resort with an expiry date.
Make every test seed explicit. Even for tests that don’t use randomness—be consistent.
Freeze time by default in unit tests. Opt into real time only where needed.
Disallow direct network in unit tests. Use record/replay or fakes.
Make locale and timezone explicit in CI. TZ=UTC and LC_ALL=C.UTF-8 are good defaults.
Prefer monotonic clocks for durations. Only use wall clock for user-facing timestamps.
Log and surface seeds and env snapshot links in every failure report.

What the Data Says

Google reported that a small percentage of test runs are flaky, but a large fraction of tests exhibit flakiness at least once; the hidden cost is substantial in lost trust and engineer time. See "Flaky Tests at Google and How We Mitigate" on the Google Testing Blog.
Academic studies categorize common flake causes: asynchrony, time, IO, randomness, and order dependence dominate. See works by Luo et al. (A Study of Flaky Tests), Micco (Google), and recent ICSME/FSE papers on flaky test taxonomy and mitigation.
Spectrum-based fault localization (e.g., Ochiai) and delta debugging (Zeller) are effective ingredients for root cause isolation once reproducibility is achieved.

The punchline: determinism beats sampling. Instrument once, reap benefits across the entire pipeline.

Frequently Overlooked Edge Cases

Python dict/set iteration order is insertion-ordered in CPython 3.7+, but PYTHONHASHSEED affects hash maps and randomized hash salts for security; always set PYTHONHASHSEED for consistency across processes.
Go map iteration order is intentionally random; never rely on it in tests—sort keys before comparing.
Java HashMap iteration order is unspecified; prefer LinkedHashMap for determinism in tests.
Floating-point: enabling AVX/FMA can slightly alter rounding; if tests assert exact decimals, use decimal types or ulp-based comparisons and pin BLAS/MKL settings in CI.
Filesystem timestamps: different mtime precision can invalidate caches; store normalized timestamps in tests or use monotonic counters.
DST transitions: run date tests at fixed UTC zones, and include DST edge cases explicitly.

From Repro to Fix: How AI Closes the Loop

Once the AI has your deterministic repro, it can:

Suggest minimal code changes: replace time.Now() with injected Clock; sort before asserting equality; add mutex; seed RNG; use retry/backoff with bounded jitter and fakes in tests.
Generate a failing test that codifies the root cause so you never regress.
Open a PR with the fix guarded by the deterministic test, pass CI, and reference the repro script.

Engineers remain in control; the AI accelerates the tedious parts: data collection, perturbation orchestration, and minimization.

ROI: What You Can Expect

Teams that adopt deterministic harnessing consistently report:

Significant drop in flake-related reruns and quarantines.
Faster triage: fail-to-repro time shrinks from hours to minutes.
Increased trust in CI, allowing stricter gates without blocking the team.

Even a small pilot—seeding RNGs and freezing time—can cut the top 30% of flakes in weeks. Adding environment diffs and CI mining compounds the effect.

Checklist: Your First 30 Days

Weeks 1–2
- Implement seed capture in all test harnesses; print seeds on failure.
- Freeze time in unit tests; adopt fake timers/framework support.
- Disable real network in unit tests; add record/replay where needed.
Weeks 3–4
- Capture environment snapshots on failures; store with artifacts.
- Add CI metrics for flake rate and failure clustering.
- Generate reproducible repro scripts for top flaky tests.
Ongoing
- Train engineers to file tickets with seed + repro script.
- Review diffs for environment drift periodically; pin where appropriate.
- Gradually refactor code toward injected clocks and deterministic IO boundaries.

Conclusion: Your Tests Aren’t Your Enemy—Nondeterminism Is

Flaky tests are a symptom of uncontrolled entropy. You can’t eliminate every source, but you can control them enough to make flakes deterministic, reproducible, and fixable. A code-debugging AI amplifies your ability to do this at scale by fingerprinting randomness, freezing time, taming IO, diffing environments, and mining CI history.

Don’t accept flaky tests as a cost of doing business. The pipeline you save first is your own.

References and Further Reading

Google Testing Blog: Flaky Tests at Google and How We Mitigate (John Micco and colleagues). https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html
A Study of Flaky Tests (Luo et al., 2014). ACM ESEC/FSE. https://dl.acm.org/doi/10.1145/2635868.2635920
Understanding and Detecting Flaky Tests: A Taxonomy (various works at ICSME/FSE; see also: ICSME 2019 studies).
Delta Debugging (Zeller). https://www.st.cs.uni-saarland.de/dd/
Spectrum-Based Fault Localization (Tarantula, Ochiai). Overview: Wong et al., ACM Computing Surveys.
Hypothesis (Python) seeds and reproducibility. https://hypothesis.readthedocs.io/
WireMock (JVM) and VCR (Python/Ruby) for HTTP record/replay. http://wiremock.org/ and https://github.com/kevin1024/vcrpy
Freezegun (Python) and Sinon/Jest fake timers (JS). https://github.com/spulec/freezegun and https://sinonjs.org/
ReproZip for reproducible experiments (useful inspiration for environment capture). https://www.reprozip.org/
Nix/NixOS for hermetic environments. https://nixos.org/

If you have specific flakes haunting your repo, start by seeding RNGs and freezing time. Your future debugging self will thank you.