Code Debugging AI vs Race Conditions: How to Feed Happens‑Before Evidence, Not Hunches
Modern code debugging assistants are impressive at refactoring and finding obvious mistakes. But when the bug is a race, deadlock, or timing dependent heisenbug, a large language model that only sees source code and log snippets has almost no chance. Concurrency is about causality, not only code. You need to feed the AI evidence of what actually happened: a structured witness of the execution order, the lock and memory events, and the schedule that made the bug surface.
This article is a blueprint for doing that in practice. It shows how to capture and format the critical concurrency evidence — lock graphs, happens‑before relations via vector clocks or epochs, and reproducible traces — then wire it into your tests, CI, and IDE so your debugging AI can pinpoint races and deadlocks with causal proof. We will be opinionated: do not ask an AI to guess a scheduler interleaving from a stack trace. Give it the interleaving.
What follows is practical and grounded in decades of concurrency research: Lamport happens‑before, lockset analysis (Eraser), dynamic race detection via FastTrack, CHESS and DPOR for systematic schedules, record‑replay via rr, runtime detectors like ThreadSanitizer, and lock dependency validators like lockdep. We will show the minimal instrumentation you need, the JSON schemas that make the data usable, and how to integrate this into a developer workflow.
Executive summary
- LLMs cannot reliably infer concurrency causality from code alone. They need a causal trace: who‑happened‑before‑whom.
- Capture three things in your tests and CI:
- lock order graphs and wait‑for edges,
- memory events with vector clocks or epochs, and
- a reproducible schedule trace.
- Route that evidence to your code debugging AI in a compact schema. Ask it to explain the bug using causal proof, not intuition.
- Use existing tools where possible: ThreadSanitizer, Go race detector, Helgrind, lockdep, rr, Java Flight Recorder, JVMTI, Loom test harnesses. Fill gaps with small shims.
- Bake it into developer ergonomics: test annotations that auto‑collect traces, CI that fails on cycles or races with linked replay artifacts, and an IDE panel visualizing lock graphs and HB partial orders.
Why LLMs miss concurrency bugs
An LLM excels at static, local reasoning. Concurrency bugs are inherently global and temporal. Consider a classic data race: two threads write the same location without synchronization. Whether this race is harmful depends on actual interleavings and memory model semantics. Without an execution trace, an LLM is guessing.
Similarly for deadlocks: a cycle in lock acquisition order does not exist inside any single function; it emerges at runtime across threads. Heisenbugs are worst of all: they appear only under specific schedules, CPU counts, or instruction reorderings. If you show an LLM only the code, it may propose plausible fixes but will not know which interleaving actually occurred, what locks held where, or which happens‑before edges were missing.
Concurrency demands proof:
- A happens‑before partial order that demonstrates two events are concurrent (not ordered), so a data race is real.
- A lock dependency cycle that explains a deadlock precisely.
- A reproducible interleaving or schedule that any developer can step through to verify the diagnosis.
You will not get that from logs and hunches. You get it from instrumentation and runtime analysis.
The core evidence: lock graphs, happens‑before, reproducible traces
There are three complementary lenses on concurrency that, together, cover most real bugs.
- Lock graphs and wait‑for edges
- Track lock acquisition and release by thread or task.
- Build directed edges from held lock to acquired lock to capture ordering constraints.
- For deadlocks, detect cycles in the wait‑for graph in real time or at trace analysis time.
- For lock ordering hygiene, detect inconsistent orders across code paths.
- Happens‑before via vector clocks or epochs
- The happens‑before relation is the foundation: A happens‑before B if they are ordered by program order, synchronization, or transitive closure. Two events not ordered by HB are concurrent.
- Vector clocks (Fidge and Mattern) attach a logical timestamp vector to each thread; events update these vectors on synchronization to reflect causal dependencies.
- FastTrack (Flanagan and Freund) optimizes HB race detection by using epochs: many memory locations can be tracked by a single thread id and clock value until a second thread writes, then escalate.
- With HB or epoch metadata, you can provide causal proof of a race: both accesses are concurrent and at least one is a write. That is not a guess; it is a certificate of unsafety.
- Reproducible schedule traces
- Recording the interleaving that triggered the bug enables replay and exact diagnosis.
- Approaches:
- record and replay (rr for user space on Linux),
- deterministic scheduling for tests (like CHESS or Loom style controlled schedulers),
- dynamic partial order reduction to enumerate minimal schedules,
- capturing scheduler decisions and synchronization outcomes in a compact log.
- If you can reproduce the bug in CI and locally, you can verify the fix with confidence.
A practical evidence pipeline
Your debugging AI needs structured input, not a pile of raw logs. The pipeline is:
- Instrument concurrency primitives and memory accesses (prefer using existing runtimes when available).
- Emit structured events with thread id, operation, address or lock id, and a timestamp (logical, not wall clock).
- Maintain HB state (vector clocks or epochs) in the runtime during test execution.
- Detect race or deadlock conditions online, but also persist traces for offline analysis.
- Summarize findings into compact artifacts: lock graph cycles, minimal counterexample trace, and memory access pairs with HB justification.
- Feed artifacts to the LLM with the exact code spans and symbols mapped to source lines.
- Offer a replay or deterministic schedule to developers and CI.
Opinion: if you must choose only one piece to start, implement HB‑aware event capture and minimal counterexample traces. They pay off immediately.
Background, briefly: HB, lockset, and how detectors work
Three classic approaches, often combined in practice:
-
Lockset (Eraser, Savage et al., 1997). Track the set of locks held during each access to a memory location. If the intersection across accesses becomes empty, warn of a potential race. Pros: simple. Cons: false positives for benign patterns, does not account for other synchronization mechanisms.
-
Happens‑before. Use synchronization events (locks, unlocks, condition variables, atomics) to define a partial order of events. Two memory accesses are racing if they are not ordered by HB and at least one is a write. Pros: precise with fewer false positives. Cons: needs more metadata, may miss true races if synchronization is not instrumented.
-
FastTrack (PLDI 2009). A near‑O(1) HB detector using epochs that tracks per variable the last write and a summary of reads. It escalates to vector clocks only when needed, drastically improving performance over naive vector clocks.
Real world: ThreadSanitizer uses a HB approach with vector clocks and other optimizations; Go race detector is based on TSan; Helgrind is a lockset‑based tool; Java offers JFR and agents; lockdep in Linux checks lock ordering rules and can flag cycles.
References:
- Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 1978.
- Flanagan and Freund. FastTrack: Efficient and precise dynamic race detection. PLDI 2009.
- Flanagan and Godefroid. Dynamic partial-order reduction for model checking software. POPL 2005.
- Musuvathi and Qadeer. Iterative context bounding for systematic testing of multithreaded programs. PLDI 2007 (CHESS).
- O Callahan et al. rr: Lightweight recording and deterministic debugging. Mozilla Research.
Implementing capture: minimal, language by language
Start with what your language runtime already offers, then add shims.
C and C plus plus
- Use ThreadSanitizer during tests to detect races and emit detailed reports. It captures stacks and operations; in CI, archive the sanitizer report as an artifact.
- For reproducible traces, use rr to record failing tests and enable exact replay with gdb. In CI, capture rr traces only on failure to control cost.
- If you maintain custom lock types, instrument them: assign each lock a unique id, log acquire and release with thread id and logical clock.
Minimal lock logging shim in C:
c// compile with -pthread, and consider wrapping pthread_mutex with these #include <pthread.h> #include <stdint.h> #include <stdio.h> static __thread uint64_t logical_clock = 0; // per thread logical time void log_event(const char* op, void* lock_addr) { logical_clock++; fprintf(stderr, "{op:'%s',tid:%lu,lock:'%p',t:%lu}\n", op, (unsigned long)pthread_self(), lock_addr, logical_clock); } int logged_mutex_lock(pthread_mutex_t* m) { log_event("lock", m); return pthread_mutex_lock(m); } int logged_mutex_unlock(pthread_mutex_t* m) { log_event("unlock", m); return pthread_mutex_unlock(m); }
Note this uses a simple per thread logical clock, not a vector clock. For HB analysis you need to propagate clock values on unlock and lock. If you can, prefer a tested tool like TSan which already embeds a HB detector.
Go
- The built in race detector is excellent for tests: go test -race. Archive the reports.
- For reproducible schedules, the testing runtime offers some hooks and you can introduce a scheduler shim for select tests by replacing sync primitives with deterministic versions in a build tag.
- For trace capture, Go has the runtime tracer (go test -trace) which records goroutines, network, syscalls, and scheduler events. You can post process this into HB insights.
Example: annotate a flaky test to run under tracing and with race detection.
bashGOFLAGS='-race' go test -run TestFlaky -count=100 -trace trace.out
Java and JVM languages
- Java Flight Recorder (JFR) can capture thread states, lock contention, and events with low overhead. Persist the JFR file on failures.
- JVMTI agents can instrument monitor enter and exit and map threads and locks to identities.
- Tools like IBM ConTest or academic CHESS for Java inspired harnesses to systematically explore schedules. For unit tests, you can use a deterministic executor for futures.
Minimal synchronized block tracing in Java using a wrapper:
javaclass TracedLock { private final Object m = new Object(); private final String id; TracedLock(String id) { this.id = id; } void with(Runnable r) { Trace.log("lock_attempt", id, Thread.currentThread().getName()); synchronized (m) { Trace.log("lock_acquired", id, Thread.currentThread().getName()); try { r.run(); } finally { Trace.log("lock_release", id, Thread.currentThread().getName()); } } } }
Rust
- Use loom to run concurrent code under a deterministic scheduler in tests. Loom explores different interleavings and can surface races and deadlocks in a reproducible way.
- For dynamic detection, there is sanitizer support when compiling with appropriate flags and using nightly features.
- Use tracing crates to emit structured events; attach causal ids via task contexts.
Example loom test skeleton:
rustuse loom::sync::{Arc, Mutex}; use loom::thread; #[test] fn loom_finds_interleaving() { loom::model(|| { let v = Arc::new(Mutex::new(0)); let v1 = v.clone(); let t1 = thread::spawn(move || { let mut g = v1.lock().unwrap(); *g += 1; }); let v2 = v.clone(); let t2 = thread::spawn(move || { let mut g = v2.lock().unwrap(); *g += 1; }); t1.join().unwrap(); t2.join().unwrap(); assert_eq!(*v.lock().unwrap(), 2); }); }
Loom will explore the interleavings; for more complex code, add yield points to widen the search.
How to represent the evidence so an AI can use it
Text blobs are not ideal. You want compact, structured artifacts that encode causality and map back to source. A simple schema approach:
- Event stream for concurrency primitives and memory:
json{ "events": [ {"t": 1, "tid": 17, "op": "spawn", "child": 19}, {"t": 2, "tid": 17, "op": "lock", "lock": "L1", "site": "foo.cc:42"}, {"t": 3, "tid": 19, "op": "load", "addr": "0x7f..", "site": "bar.cc:88", "epoch": "19:1"}, {"t": 4, "tid": 17, "op": "store", "addr": "0x7f..", "site": "foo.cc:45", "epoch": "17:3"}, {"t": 5, "tid": 19, "op": "lock", "lock": "L2", "site": "bar.cc:90"}, {"t": 6, "tid": 17, "op": "lock", "lock": "L2", "site": "foo.cc:46"}, {"t": 7, "tid": 19, "op": "lock_wait", "lock": "L1"}, {"t": 8, "tid": 19, "op": "deadlock_cycle", "cycle": ["L1","L2","L1"], "threads": [17,19]} ] }
-
HB metadata per event: epoch or vector clock before and after. Example epoch format tid:clock.
-
Memory access log includes variable id and last observed writer epoch.
-
Lock dependency graph: nodes are lock ids; edges are observed acquire while holding relationships.
-
Minimal counterexample trace: a truncated sequence of events just long enough to trigger the bug, with thread switches called out.
-
Source mapping: each event has file, line, and symbol. Include short code snippets around the site to help the AI ground explanations.
-
Optional: environment digest (CPU count, seed, OS) to correlate flakiness with runtime.
Once you have this, the AI prompt can be directive: explain the deadlock by identifying the cycle and proposing a lock ordering policy or lock splitting. Explain the data race by naming the two accesses, showing HB non ordering, and suggesting synchronization. The key is that the AI is not guessing; it is interpreting an already computed causal structure.
Building HB state in a runtime: vector clocks and epochs
You do not need a full vector clock implementation to get mileage; a FastTrack style epoch detector is efficient and precise enough for many languages. The idea, in simplified form:
- Each thread maintains a clock C[tid]. On each thread event, increment C[tid].
- Each lock L stores a vector clock L.vc that represents the HB time at unlock. On lock acquire, merge L.vc into the thread clock.
- Each variable x stores:
- W, the write epoch (tid, clock) of the last write.
- R, either a read epoch or a read vector clock if there are multiple readers.
- On read of x by thread t with clock Ct:
- If W is not ordered before Ct (i.e., W.clock is not less than or equal to Ct[W.tid]), then the read races with the last write.
- Update R to record t as a reader at Ct[tid]. If R holds just one reader, store it as an epoch; otherwise, escalate to a vector.
- On write of x by thread t with clock Ct:
- If R has any reader not ordered before Ct, race.
- If W is not ordered before Ct, race.
- Update W to (tid, Ct[tid]) and reset R to empty.
Pseudo code for key steps:
pseudo// per thread C[tid] += 1 // lock acquire C = join(C, L.vc) // lock release L.vc = C // read(x) by t if not hb(W, C): report_race(x, W, read_event) R = add_reader(R, t, C) // write(x) by t if not hb(W, C): report_race(x, W, write_event) if exists r in R where not hb(r, C): report_race(x, r, write_event) W = epoch(t, C[t]) R = empty
The hb relation can be checked by comparing epochs or by checking dominance in vector clocks. In practice, use a well tested implementation; writing a correct, fast HB detector is subtle. But even a lightweight subset that only covers locks and condition variables can justify many diagnoses.
Lock graphs and deadlock detection
A lock graph captures ordering constraints: whenever a thread acquires B while holding A, record an edge A -> B. Deadlocks are cycles in the wait‑for graph, which can be approximated by analyzing the lock graph for cycles.
Online detection options:
- At each lock attempt, detect if acquiring would complete a cycle given current ownership states. If so, emit a deadlock warning with the cycle.
- Track wait edges: T waits for L, and L is held by U, hence T waits for U. Detect cycles in wait relationships.
Offline analysis options:
- Build the lock graph from the trace and run cycle detection to produce minimal cycles and the threads involved.
- Correlate each edge with code sites; propose a global lock ordering or annotate code to follow a canonical order.
Showing the AI the cycle is crucial. Example artifact for a deadlock in two threads:
json{ "deadlock": { "threads": [17, 19], "cycle": [ {"tid":17, "holds":"L1", "acquires":"L2", "site":"foo.cc:46"}, {"tid":19, "holds":"L2", "acquires":"L1", "site":"bar.cc:91"} ] } }
This leaves little room for speculation. The AI can propose either enforcing an order L1 then L2 everywhere, or using trylocks with backoff, or restructuring the critical sections.
Reproducible traces: rr, deterministic schedulers, and DPOR
You will not fix flakiness by wishful thinking. You need to capture the interleaving that triggered the bug and make it replayable.
-
rr (Linux user space). It records a process execution including all nondeterministic inputs and many scheduling decisions; replay produces the same behavior. Integrate rr into CI by recording failing test runs and attaching the trace. Developers can step with gdb in a faithful replay.
-
Deterministic schedulers for tests.
- CHESS style: instrument synchronization and expose a scheduler that can control or search interleavings. Tests run repeatedly under the scheduler.
- Loom (Rust) and JCStress (JVM) provide mechanisms for exploring interleavings in a bounded but systematic way.
-
Dynamic Partial Order Reduction (DPOR). Avoids exploring equivalent interleavings that differ only by independent operations. A DPOR harness can drastically reduce the number of schedules while still hitting bugs.
Minimal pattern for a deterministic test harness:
- Replace mutexes and condvars with versions that yield control to a central scheduler on each operation.
- The scheduler maintains a queue of runnable threads and decides which to run next, recording the choices.
- On failure, persist the list of scheduler choices. To reproduce, feed the same choice sequence back.
An example of a replayable decision log format:
json{ "schedule": [ {"step": 1, "run": 0}, {"step": 2, "run": 2}, {"step": 3, "run": 1} ], "seed": 12345, "env": {"cpus": 8, "os": "linux"} }
This kind of artifact is perfect for an AI to explain a heisenbug: at step 2, thread 2 ran and observed a stale value because the write by thread 1 had not yet been ordered by a release operation.
Integrating into tests, CI, and IDEs
Ergonomics matter. Developers should not need to become concurrency scientists to benefit from this.
Tests
- Adopt a testing profile that runs with sanitizers and tracing enabled. For example, a nightly job that runs with TSan or Go race, and a small fraction of tests under deterministic schedulers or DPOR.
- Annotate flaky tests with a macro or attribute that automatically enables instrumentation and captures schedule traces on failure.
- Emit artifacts to a known location with consistent schemas.
CI
- Store sanitizer and trace artifacts for 30 days; link them from the CI failure UI.
- Automatically summarize detected issues: data races, deadlock cycles, and attach minimal counterexample traces.
- Provide a one click reproduction command, for example: rr replay artifacts/test123.rr or cargo test --replay schedule.json.
- Gate merges on any detected HB verified data race or lock cycle in changed code unless explicitly waived.
IDE
- Add a panel that visualizes the lock graph around the failure. Let devs click nodes to jump to acquisition sites.
- Show the minimal counterexample trace with thread switches highlighted. Play it like a timeline.
- Provide an AI explanation view that cites HB facts and points to exact lines.
- Offer quick fixes: insert missing synchronization, reorder locks, or split locks.
Example: diagnosing a real race with HB proof
Suppose a test under TSan flags a race on an integer counter increment. The event data includes two accesses:
- Write by thread 17 at foo.cc:45 inside a critical section on L1.
- Read by thread 19 at bar.cc:88 outside any lock, used to compute a threshold.
The HB metadata shows no HB path from the write to the read; the locks on the write side do not synchronize with the read. The AI can present this as:
- Two accesses to the same address, at foo.cc:45 and bar.cc:88.
- The write is not ordered before the read by happens‑before; hence they are concurrent under the memory model.
- Since at least one is a write, this is a data race.
- Proposed fixes:
- Protect the read at bar.cc:88 with the same lock L1.
- Or, convert the counter to an atomic with acquire on read and release on write, if lock contention is a concern.
Then it can propose a patch with either lock guard insertion or atomic swap, depending on the codebase style.
Example: deadlock via a lock order cycle
Trace indicates:
- Thread A acquires L1 then tries to acquire L2 in function f at foo.cc:46.
- Thread B acquires L2 then tries to acquire L1 in function g at bar.cc:91.
- Both are blocked. The lock graph shows edges L1 -> L2 and L2 -> L1 forming a cycle.
The AI can make a precise recommendation:
- Adopt a global lock order, for example sort by lock id or name and always acquire in ascending order.
- If that is impractical, refactor to reduce the span of critical sections or split locks so that both code paths avoid double acquisition.
- As a short term mitigation, switch one acquisition to a trylock with timeout and backoff.
Since the cycle is explicit, the fix is not speculative.
Heisenbug under load: replay with DPOR minimal trace
In a stress test, a flaky failure emerges once every few hours. Hooking the test to a DPOR scheduler produces a minimal interleaving that triggers the failure in 12 steps. The artifact includes the schedule, the threads at each step, and the key memory events.
Feeding this to the AI allows it to narrate the failure: at step 5, goroutine 3 reads a stale pointer because goroutine 1 published it without a release store; the pointer is then dereferenced, leading to a crash at step 8. The AI recommends making the publication atomic with release semantics and adding an acquire load on the consumer side, or protecting the pointer with a mutex.
Because you can replay the 12 step schedule locally, you can verify the fix and banish the flake.
Distributed contexts: vector clocks, tracing, and cross process causality
Concurrency is not only threads and locks; in distributed systems, requests cross processes, machines, and time zones. Happily, the theory is the same: vector clocks and HB still apply.
- Use tracing frameworks like OpenTelemetry to propagate a context through requests. That context can carry a causal id and logical clock information.
- At boundaries like RPC send and receive, record send and receive events and merge vector clocks.
- Across services, you can establish HB edges that explain why a later event was or was not ordered after an earlier write.
A minimal extension: augment trace spans with vector clock metadata. Even if you do not fully instrument memory accesses, you can correlate service level races, like two services updating a shared resource without coordination.
Performance, overhead, and sampling
Capturing HB and schedule traces has a cost. Do not turn it on for all builds or all tests all the time; be intentional.
-
Use tiers:
- Always on low overhead lock logging in tests and dev builds.
- Nightly HB detection with sanitizers or agents.
- On demand rr recording on failure.
- DPOR exploration only for annotated tests or small critical modules.
-
Sample long running tests: enable tracing for 1 percent of runs or for short windows.
-
Strip symbols and compress artifacts, but keep a mapping from addresses to source lines.
-
Provide a kill switch if overhead spikes.
The point is not to collect all data; it is to collect decisive data when you need it.
Memory model gotchas the AI should know (and you should feed)
- Atomics and fences. Many detectors understand pthread mutexes but not custom atomics. If your code uses acquire and release operations, make sure your HB capture recognizes them.
- Condition variables and channels. Signal and broadcast induce HB edges; capture them or you will flag false races.
- Async and futures. Logical tasks may execute on thread pools; track logical task ids and cross these with thread ids.
- Unsafe or foreign calls. A lock acquired in a C library should be modeled or at least treated conservatively.
If your capture runtime cannot understand a synchronization mechanism, add an adapter that emits the right events.
Prompts and responses: telling the AI how to use the evidence
Your LLM prompt should be explicit and evidence centric. A good pattern:
- Provide the minimal counterexample trace, the lock graph segment, and the HB explanation for the suspect accesses.
- Provide code spans for each site.
- Ask for a diagnosis that cites the causal proof and proposes patches that maintain invariants.
Example prompt fragment you can template in your tool:
You are a concurrency debugging assistant. Use the provided trace and HB facts. Do not speculate beyond the evidence.
- Lock cycle: L1 -> L2 -> L1 involving threads 17 and 19.
- Data race candidate: write at foo.cc:45 (tid 17) and read at bar.cc:88 (tid 19). HB analysis: no path orders write before read.
- Minimal schedule: [T17 lock L1, T19 read x, T17 write x, T17 lock L2, T19 lock L2 (blocks), T19 lock L1 (blocks)]
Tasks:
1) Explain the deadlock and propose a lock ordering or refactor.
2) Explain the data race and propose synchronization or atomics.
3) Provide code level changes with minimal disruption.
This keeps the AI inside the rails. You can augment with repository context so code edits are precise.
Putting it all together: a rollout plan
- Choose your base detectors and recorders.
- C and C plus plus: ThreadSanitizer for tests, rr for replay, optional Helgrind. If kernel side, lockdep.
- Go: built in race detector, runtime tracer for schedules.
- JVM: JFR, JCStress for litmus tests, agent for lock events.
- Rust: loom for deterministic interleavings; tsan when possible.
- Add lightweight lock logging and a scheduler hook library for unit tests.
- Wrap mutexes and channels with logging macros behind a build flag.
- Provide a deterministic scheduler shim for a small subset of tests.
- Define your artifact schemas.
- Events: thread id, op, lock or address, site, and logical time.
- HB metadata: epochs per event.
- Lock graph cycles and minimal counterexample traces.
- Schedule logs for replay.
- Integrate into CI.
- Nightly sanitizer runs with artifact upload.
- On failure, run deterministic reproduction job and attach rr traces if available.
- Present AI diagnosis in the PR with links to replay and source lines.
- Upgrade the IDE.
- Extension or plugin that reads artifacts and draws lock graphs.
- One click reproduction from the IDE.
- AI panel that cites HB evidence.
- Enforce hygiene.
- Establish lock ordering rules and static checks.
- Convert shared flags and counters to atomics or protect them with locks.
- Add micro tests with deterministic schedulers for tricky code paths.
What not to do
- Do not ask an LLM to find a race by eyeballing code. The best it can do is guess patterns; without HB, it is speculation.
- Do not rely on logs without structured causality. Timestamps are not causality; they are unreliable across threads and cores.
- Do not treat sanitizer reports as build spam. Archive them and wire them into your review workflow.
Conclusion: stop guessing, start proving
Concurrency bugs are adversarial. They hide behind schedules and reorderings. You will not out guess them, and neither will a language model working blind. But if you feed your debugging AI the right evidence — lock graphs that show cycles, HB metadata that proves races, and reproducible traces that anyone can replay — it will become a powerful ally. It can explain the bug in causal terms and propose fixes grounded in the actual execution, not in folklore.
Start small: enable your language s race detector in tests, capture lock events into a simple JSON log, and persist rr traces on failures. Define a schema and teach your AI agent to read it. Add deterministic schedulers or loom style tests for the nastiest modules. Within a sprint or two, you will have transformed flakiness from an intermittent sinkhole into a series of concrete, explainable, and fixable issues.
Concurrency is about causality. Give the AI the causality. The rest is just code.
