Benchmarks Are Broken: A Real-Repo Framework for Evaluating Code Debugging AI

Benchmarks for code debugging AI look great in slide decks and preprints—but the numbers rarely survive contact with a production codebase. A model may appear to fix dozens of synthetic bugs in a curated suite but fail to land a single safe change in a large, real repository with opaque build steps, transitive dependencies, and flaky tests. If you are building or buying code-debugging AI, the discrepancy between lab-grade outcomes and CI-grade outcomes is a material risk.

This article proposes a real-repository evaluation framework—think CI harness, not leaderboard—that measures the things that matter in practice: fix rate, regression rate, latency, reproducibility, and operator workload. It pairs hermetic builds with trace instrumentation and flaky-test detection so you can compare models and agents using evidence that mirrors your production environment.

Opinions included. Hype excluded.

Why Most Benchmarks Mislead

Most public benchmarks for AI code debugging share three structural problems:

Synthetic bugs and short repros

Tiny, single-file tasks with mocked dependencies or cached interpreter state.
Task design that rewards pattern recall (e.g., edit a known line) rather than end-to-end debugging under real toolchains.

Incomplete correctness signals

A patch that makes a unit test green could silently break integration tests or non-obvious invariants.
Many benchmarks lack negative checks: they confirm the target test passes but rarely ensure all previously passing tests remain green.

Uncontrolled execution environment

Benchmark harnesses often allow network during builds, reuse global caches, or leak environment-specific state.
Results are not reproducible across machines, time, or package mirror drift.

In short, we have arcade games where production teams need crash tests. The alternative is not impossible; it just requires adopting habits from modern CI systems and SRE practice: hermetic builds, deterministic runners, traceable execution, and aggressive flake management.

Design Goals for a CI-Grade Debugging Benchmark

A credible evaluation framework for debugging models or agents should:

Use real repositories with unmodified build and test instructions.
Recreate failing states from real commit history or synthetic-but-realistic bug seeding.
Pin compilers, toolchains, and dependency graphs to guarantee hermetic builds.
Instrument the full agent loop—retrieval, planning, editing, building, testing—with traces and logs.
Detect and quarantine flakiness so you do not attribute non-deterministic failures to the model.
Report fix rate, regression rate, and latency with confidence intervals and failure modes.
Run offline by default (no network) and limit resource usage per attempt.
Expose artifacts, diffs, and execution plans for human auditability.

Below is an architecture and protocol that meets these goals and can be implemented incrementally.

Architecture Overview

The harness comprises the following components:

Repository curator: selects targets, reproduces failing states by commit manipulations, and pins toolchains.
Builder: runs in a hermetic container or VM with no network and reproducible dependency graphs.
Orchestrator: coordinates model/agent, applies patches, enforces timeouts and resource quotas.
Test runner: executes full suites (or selective subsets) and captures structured result data.
Flake detector: repeats suspicious cases, models flakiness, and de-weights flaky signals.
Telemetry layer: emits OpenTelemetry (OTel) traces for end-to-end observability.
Metrics and reporter: computes fix/regression/latency metrics and outputs machine-readable results.

A typical run takes a repo at a failing state, invokes an agent with a fixed budget (time, tool calls), applies patch proposals, and evaluates correctness under hermetic build constraints. Every significant event is traced; every outcome is reproducible.

Real Repositories and Failing States

Building a good task set begins with real repositories and realistic failure modes.

Sources: popular open-source repos with mature CI (e.g., Python, Java, Go, Rust ecosystems) and internal monorepos if available.
Languages: maintain diversity (Python, JS/TS, Java, Go, Rust, C/C++, C#) and build tools (pip/poetry, npm/yarn/pnpm, Maven/Gradle, go, cargo, Bazel/CMake).
Failure induction strategies:
- Revert known bug-fix commits (ground truth fixes exist). This is robust and highly realistic.
- Cherry-pick a dependent feature without updating callers to create type or API breakages.
- Seed minimal semantic edits that break tests beyond syntax (e.g., off-by-one, wrong default, null handling).

For each task:

Baseline commit: a passing commit from project history.
Failing state: apply a deterministic transform (e.g., git revert) to introduce a failure.
Target criteria: returning the repo to a state where all previously passing tests pass, within a fixed budget.

Hermetic Builds: The Non-Negotiable

You cannot compare models fairly if the build graph changes between runs.

Key practices:

Containerized or VM isolation with fixed base images.
Compiler and toolchain pinning (e.g., Node LTS exact versions, Python minor version, Rust toolchain via rust-toolchain.toml).
Lockfile enforcement:
- Python: pip-tools or poetry lock; pip --require-hashes; offline wheels from an internal mirror.
- Java: Maven/Gradle offline mode with a pre-seeded local repo.
- Node: npm ci/pnpm install --frozen-lockfile with a vendored cache.
- Go: go mod vendor; GONOSUMDB as needed.
- Rust: cargo --locked with a vendored registry snapshot.
Network: deny by default. Whitelist only trace export and internal artifact store if needed.
System invariants: clock, timezone, locale, CPU count, and kernel features fixed.

Example Dockerfile for a Python project:

Dockerfile
FROM ubuntu:22.04

# Pin OS packages
RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3.11-dev build-essential git ca-certificates && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Create venv and install exact wheels from a local mirror (no internet)
ENV VIRTUAL_ENV=/opt/venv
RUN python3.11 -m venv ${VIRTUAL_ENV}
ENV PATH="${VIRTUAL_ENV}/bin:$PATH"

# Add project and a prebuilt wheels directory
WORKDIR /workspace
COPY . /workspace
COPY wheels/ /wheels/

# Install with hashes and offline
RUN pip install --no-index --find-links=/wheels -r requirements.txt --require-hashes

# Disable network at runtime (enforced by orchestrator via seccomp/iptables)

For Bazel and Nix, hermeticity can be even stronger. Nix flakes or Bazel toolchains can reproduce the build and test graph byte-for-byte given a fixed input hash.

Orchestrator and Execution Flow

The orchestrator owns the lifecycle of a task, from checkout to verdict.

High-level steps:

Provision runner: start a container or micro-VM (e.g., Firecracker) with pinned resources.
Materialize repo: clone at baseline, apply failing transform, verify baseline tests pass, verify failing state reproduces.
Freeze environment: verify lockfiles and catalogs present; deny network.
Instrument: initialize OpenTelemetry tracer.
Agent loop: grant the model/agent capabilities (read code, run tests, build, propose diffs) with strict budgets.
Patch evaluation: for each candidate patch, rebuild and run tests with flaky-test handling.
Verdict: record fix rate, regression rate, latency, resource usage, and artifacts (diffs, logs, traces).
Teardown: export artifacts and destroy runner.

Pseudocode sketch:

python
async def run_task(task, agent, budgets):
    with provision_runner(task.image, cpu=4, mem_gb=8, net=False) as runner:
        tracer = init_tracer(task.id)
        repo = await materialize_repo(runner, task)
        assert await verify_repro(repo)

        ctx = AgentContext(repo=repo, budgets=budgets, tracer=tracer)
        start = monotonic()

        while ctx.within_budget():
            with tracer.start_as_current_span('agent_step') as span:
                plan = await agent.plan(ctx)
                result = await agent.execute(plan, ctx)
                span.set_attribute('step.type', plan.type)
                span.set_attribute('tokens.input', result.tokens_in)
                span.set_attribute('tokens.output', result.tokens_out)

            if result.patch:
                verdict = await evaluate_patch(repo, result.patch, tracer)
                if verdict.is_fix and not verdict.regressions:
                    latency = monotonic() - start
                    return Report.fix(task.id, verdict, latency, ctx.usage())

        return Report.fail(task.id, ctx.usage())

Telemetry: Tracing the Debugging Loop

OpenTelemetry (OTel) traces let you analyze where agents spend time and why runs fail. A typical trace includes spans for:

Code retrieval: reading files, indexing, building context windows.
Test execution: including sharding, retries, and flaky decisions.
Build steps: compile, link, bundle; cache hit/miss.
Diff synthesis: model prompts, tool-augmented planning, and edit application.
Patch validation: unit tests, integration tests, static analysis, type checks, linters.

Example instrumentation in Python using opentelemetry-sdk:

python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer('debugging-harness')

with tracer.start_as_current_span('build') as span:
    rc, logs = run_build()
    span.set_attribute('build.rc', rc)
    span.set_attribute('build.cache_hit', bool(cache_hit(logs)))
    span.add_event('build.logs', {'size_bytes': len(logs)})

Export traces to Jaeger or OTLP for dashboards. Span attributes should include repo, commit, language, toolchain, model name, token usage, and attempt number for later analysis.

Flaky-Test Detection and Quarantine

Flakiness is a top source of evaluation noise. Without mitigation, you will misestimate both false positives and false negatives.

Recommended strategy:

Establish a flake prior: before using a repo in the pool, run its test suite N times to estimate base flake rate per test (e.g., N=10). Record tests with a non-zero failure probability.
During evaluation:
- On failure, automatically rerun failing tests up to k times with identical environment.
- Model flakiness using a simple Beta-Binomial posterior: if a test fails f times out of r runs, compute posterior P(flake | f, r). If above threshold (e.g., 0.8), label flaky and exclude from pass/fail decision for that attempt; still record it.
- Distinguish systemic failures (fail every time) from sporadic ones.
Post-run audits: maintain a flaky-test registry by repo and test identifier, decaying over time to capture recently introduced or fixed flakiness.

Minimal Python snippet:

python
from math import comb

def beta_binomial_p_flaky(f, r, alpha=1, beta=1):
    # Compute P(theta > 0 | data) proxy via posterior mean
    # Posterior mean of theta = (f + alpha) / (r + alpha + beta)
    return (f + alpha) / (r + alpha + beta)

Many ecosystems also provide first-class support: pytest-rerunfailures, Bazel flaky test annotations, or Gradle test retry plugin. The harness should integrate but never silently ignore flakiness; always surface it as a separate metric.

Metrics That Matter

Primary metrics:

Fix rate: proportion of tasks where the agent returns the repo to a passing state under budget.
Regression rate: proportion of tasks where the targeted failure is fixed but previously passing tests regress.
Time-to-first-green (latency): wall-clock time from first agent action to the first successful, regression-free test run.

Secondary metrics:

Attempts to fix: count of distinct patch applications before success.
Build success rate: proportion of attempts that reach test execution (compilation or setup succeeded).
Token and tool usage: input/output tokens, number of external tool invocations, and their durations.
Edit distance: lines changed, files changed, and abstract syntax tree (AST) edit distance if available.
Coverage delta: change in statement/branch coverage after patch application.
Determinism: variance in outcomes across repeated runs of the same task with same budgets.

Reporting guidelines:

Use bootstrap confidence intervals across tasks (e.g., 95% CI for fix rate).
Report p50, p90, p95 latency, not just mean.
Disaggregate by language, build system, repo size, and failure type.

Example JSON result for one task:

json
{
  "task_id": "requests-4521-revert",
  "model": "agent-x-32k",
  "fix": true,
  "regression": false,
  "attempts": 2,
  "latency_sec": 418.5,
  "tokens_in": 1_240_123,
  "tokens_out": 210_444,
  "edits": {"files": 2, "lines_added": 8, "lines_removed": 3},
  "flake": {"tests_quarantined": ["tests/test_sessions.py::test_redirect"], "p_flaky_max": 0.91}
}

Patch Evaluation Protocol

For each candidate patch produced by the agent:

Apply diff to a clean working tree.
Run linters and type checkers if they are part of the repo's normal CI.
Build and run the full test suite, possibly with test sharding if extremely large.
Retry failed tests up to k times to model flakiness.
Decide outcomes:
- Target fixed: original failing tests now pass (accounting for flake model).
- No regression: all previously passing tests still pass.
If both true, declare success for the task; record latency and exit early.

Where runtime budgets are tight, apply test impact analysis (TIA) to run a minimal subset first (e.g., affected tests near edited files), then confirm with a full suite once a candidate looks promising. This mimics how humans iterate locally before pushing to CI.

Seed Tasks from Real CI

Relying solely on synthetic bug seeding is dangerous. A robust pool can be harvested from real CI history:

For repos with long histories, search for commits that fix failing CI. Identify the preceding red commit and reconstruct the red-green pair.
Use heuristics to detect likely bug fixes: commit messages with fix, regression, fallback; diffs touching code and tests; PR labels referencing bugs.
For each red commit, treat returning to the green state as the task. The ground truth patch is the actual fix; the agent is not required to replicate it exactly, only to restore correctness.

This approach yields realistic failure diversity: test-timeouts, flaky patterns, integration test breakages, dependency pin issues, and environmental assumptions.

Example: End-to-End Flow on a Python Repo

Assume we choose a task from a popular library. We reconstruct a failing state by reverting a known fix commit that addressed a redirect handling bug.

Baseline: commit A (green)
Failing state: A + revert(commit F) introducing failing tests

Orchestrator steps:

Start container with Python 3.11, offline wheels for dependencies.
Verify baseline tests pass and failing state recreates the target failure deterministically.
Invoke agent with budgets: 30 minutes wall-time, 8 build/test invocations max, 2 million tokens total.
Agent reads failing test logs, searches code for redirect logic, edits a helper function, and runs pytest -k redirect.
Partial green: target test passes locally; run full suite.
One unrelated test fails intermittently. Flake detector reruns it 3 times; it passes twice. Mark as flaky and proceed.
Final check: rerun suite once to ensure determinism; record latency and edits; export trace to Jaeger.

Outcome reported:

fix: true
regression: false
latency: 7 minutes
attempts: 2
flake quarantined: 1 test

This mirrors a human workflow in CI and produces a defensible metric.

Implementation Details: Practical Tools

Workflow engine: Temporal or Airflow for retries and DAGs.
Sandbox: Docker for convenience, Firecracker or gVisor for stronger isolation.
Artifact store: S3-compatible (MinIO) for logs, diffs, coverage, and wheels.
Metrics: Prometheus for per-run counters; Grafana dashboards.
Telemetry: OTel Collector to fan out to Jaeger and Prometheus.
Data store: Postgres for task metadata and results.

Example GitHub Actions runner to launch the harness in self-hosted infra:

yaml
name: Debugging-Benchmark
on:
  workflow_dispatch: {}
  schedule:
    - cron: '0 3 * * *'
jobs:
  run:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4
      - name: Build harness image
        run: |
          docker build -t harness:latest infra/harness
      - name: Run suite
        env:
          OTEL_EXPORTER_OTLP_ENDPOINT: https://otel-gateway.internal:4317
        run: |
          docker run --rm \
            --cpus=4 --memory=8g --network=none \
            -e OTEL_EXPORTER_OTLP_ENDPOINT \
            -v $PWD/artifacts:/artifacts \
            harness:latest ./run_suite.py --pool configs/pool.yaml --budget 1800

For Nix users, define a flake that pins the entire harness and each repo's environment so the same inputs produce the same outputs across machines.

Regression Detection Beyond Tests

Tests are necessary but incomplete. Add additional gates:

Static analysis and type checks: mypy, ESLint/TypeScript, golangci-lint, Clippy, SpotBugs.
Binary compatibility checks for libraries with public APIs (Java: japicmp; Rust: semverver).
Mutation testing on touched code to catch weak tests; run a small subset to keep budgets manageable.
Coverage guardrails: disallow large negative changes in coverage near edited code.

These checks reduce false positives where a patch makes tests pass for the wrong reasons.

Latency and Resource Accounting

Latency matters operationally; a fix that arrives 2 hours later may miss a deployment window. Measure:

Time-to-first-green: wall time from agent start to first regression-free pass.
Build/test time proportion: percentage of time spent executing external tools vs. model inference.
Token budget usage: for LLM-based agents, record token usage and cost equivalents.

Report both aggregate distribution and per-repo differences (large monorepos will skew absolute latencies).

Fairness, Leakage, and Reproducibility

To avoid accidental leakage:

Partition by time: only include repos and commits newer than the model's training cutoff for the main benchmark; maintain a second partition for older repos as an upper bound.
Holdout repos: keep a set of repos never seen during agent development to validate generalization.
Freeze the pool: publish a hash of the task pool and runner image. All participants evaluate against the same bits.

For reproducibility:

Publish the invocation manifest: model version, parameters, tool versions, container image digests, and task set hash.
Store all diffs and logs; verify that rerunning the same manifest yields the same outcomes within the measured flake tolerance.

A Minimal Evaluation Harness: Reference Skeleton

Below is a condensed Python sketch of the patch evaluation loop:

python
import subprocess, json, time
from pathlib import Path

class Verdict:
    def __init__(self, is_fix, regressions, flake_info):
        self.is_fix = is_fix
        self.regressions = regressions
        self.flake_info = flake_info

def run_cmd(cmd, timeout):
    p = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
    return p.returncode, p.stdout + '\n' + p.stderr

def run_tests(reruns=2, timeout=1800):
    # Example using pytest json report
    rc, _ = run_cmd('pytest -q --maxfail=1 --disable-warnings --json-report --json-report-file=report.json', timeout)
    report = json.loads(Path('report.json').read_text())
    failures = [t for t in report['tests'] if t['outcome'] == 'failed']
    flaky = []
    for test in failures:
        name = test['nodeid']
        passes = 0
        for _ in range(reruns):
            rc2, _ = run_cmd(f"pytest -q {name} --disable-warnings", timeout=600)
            passes += int(rc2 == 0)
        if passes > 0:
            flaky.append(name)
    return rc == 0 or len(failures) == len(flaky), flaky

def evaluate_patch(apply_patch_fn):
    apply_patch_fn()
    build_rc, logs = run_cmd('python -m build || true', timeout=1200)
    if build_rc not in (0,):
        return Verdict(False, True, [])
    ok, flaky = run_tests()
    # You would also run baseline pass set here for regressions; simplified:
    regressions = not ok
    return Verdict(ok and not regressions, regressions, flaky)

This is intentionally minimal; a production harness should add isolation, OTel, language-agnostic runners, and richer test adapters.

Scaling Up: Cost and Throughput

A serious evaluation might include hundreds of tasks across languages, with each attempt taking 5–30 minutes. Tips:

Shard across many small runners instead of a few large ones; debugging agents are often IO-bound.
Cache heavy artifacts (e.g., language runtimes) in base images to reduce cold-start time.
Pre-warm dependency caches offline inside the image to keep network disabled during runs.
Apply test sharding and TIA to reduce iteration time; always confirm with a full pass before declaring success.
Enforce strict budgets per task and per agent to keep costs predictable.

Threats to Validity and How to Mitigate

Hidden non-determinism: time-based tests, race conditions, or OS-specific behavior. Address by pinning system invariants and re-running suspect cases.
Dataset drift: repos evolve; lock a task set snapshot for fair comparisons, and create new versions periodically.
Overfitting to the harness: agents might learn to game signals (e.g., deleting tests). Guard with mutation checks, coverage diffs, and immutable test sets loaded from a separate path.
Security and secrets: scrub environment variables and block network; do not run unknown code on privileged hosts.
Licensing: ensure redistribution rights for repo snapshots or provide reproducible recipes instead of artifacts.

Relationship to Existing Work

SWE-bench and similar datasets demonstrated the value of real issues and multi-file reasoning. Our proposal adds CI-grade hermeticity, regression checks, traceability, and flake modeling.
Reproducible builds research and systems (Nix, Bazel, Guix) inform the hermetic build design.
SRE tracing and test flakiness literature motivate the telemetry and statistical handling.

The aim is not to dethrone existing leaderboards but to complement them with a harness that predicts operational value.

Roadmap: From Internal Harness to Community Standard

v0: Private harness inside your org with 50–100 tasks, focused on your languages and build systems.
v1: Public spec for task format, runner contract, and result schema. Publish a small open pool with fully reproducible recipes.
v2: Cross-org interoperability with a shared OTel semantic convention for debugging spans.
v3: Governance and submission rules to prevent leakage and gaming; periodic pool refresh with versioning.

If you contribute only one thing, contribute a task recipe that others can reproduce exactly. Provenance beats size.

Conclusion: Make Benchmarks Boring Again

Debugging AI will only be trustworthy when its evaluation is boring—no surprises, no unpinned dependencies, no mysterious green builds that turn red in CI. The path forward is clear:

Real repos, not toy snippets.
Hermetic builds, not hope-and-cache.
Regression checks, not single-test illusions.
Flake-aware verdicts, not roulette.
Traces and artifacts, not anecdotes.

Adopt a CI-grade evaluation harness and your numbers will drop at first. That is good news. They will also start to mean something.

Call it a benchmark if you must. Better to call it what it is: a rehearsal for production.