Make Code Debugging AI Prove It: A Proof‑of‑Fix Contract for CI Pipelines

AI can write patches. That doesn’t mean we should trust them. If you run a modern codebase with a nontrivial CI pipeline, you’ve already seen the dark side of automated fixes: patches that compile but don’t correct the bug, new flaky tests, silent performance regressions, and brittle workarounds disguised as "improvements." The fix velocity looks great—until your incident dashboard tells another story.

It’s time to move from faith to proof. This article proposes a Proof‑of‑Fix Contract: a policy enforced in CI that requires every “bug fix” patch—especially those generated or assisted by AI—to carry concrete evidence that it actually resolves the failure and doesn’t introduce regressions. No proof, no merge.

The contract is simple:

The patch must ship a failing reproduction that fails on main (or the parent commit) and passes after the fix.
The patch must include a minimal, isolated test that encodes the bug and prevents recurrence.
The patch must provide trace/log evidence of the bug before and after the fix.
The patch must pass performance and safety budgets (latency, memory, SAST/DAST, secrets, dependency risk), with baselined comparisons.
All artifacts are captured in a manifest and automatically validated in CI to block hallucinated fixes.

This isn’t bureaucracy. It’s an engineering forcing function that shortens mean time to repair (MTTR), reduces reopens, and makes AI worth its hype.

Why blind trust in debugging AI fails

LLM-generated code is astonishingly productive, but it’s not systematically reliable. The failure modes are predictable:

Hallucinated fixes: patches that change symptoms, not root causes.
Non-reproducible bugs: the failure only happens under a precise environment, seed, locale, or timing.
Garden-path tests: tests that pass for the wrong reasons (mocks out the wrong component, asserts a side effect instead of behavior).
Silent regressions: a fix that trades correctness for performance or security.
Flaky confirmations: the “fix” seems to work but is based on nondeterministic behavior (threads, time, network).

Developers already know how to address these problems: capture a failing repro, write a minimal regression test, ensure it fails pre-fix, then passes post-fix, and confirm no regressions. The problem is how to apply this at scale, with AI in the loop. The answer: encode it as a contract and enforce it with CI.

The Proof‑of‑Fix Contract (PoF): what it requires

A Proof‑of‑Fix Contract is a structured, machine-validated set of artifacts that travel with any fixing patch. You can adopt this for every bug fix, or at minimum, for AI-assisted ones (label-based enforcement). A PoF should include:

Failing reproduction

A deterministic reproduction script or test demonstrating the bug.
Environment capture: OS, CPU, dependencies, feature flags, container image digest.
Seeds and timestamps as necessary to remove flakiness.
Evidence that the repro fails at the parent commit (pre-patch) and passes at the patch commit.

Minimal test

A minimal unit or integration test that isolates the bug.
Ideally, a small property or metamorphic test if the bug is boundary-related.
Code coverage deltas to ensure the new test actually exercises the changed code.

Trace and log evidence

Relevant logs and/or traces (OpenTelemetry spans) showing the failing path.
Redacted as needed to avoid secrets and PII.
Before/after comparison: the error span disappears or changes state in a principled way.

Performance and safety checks

Performance budgets: latency/throughput or micro-benchmarks around the changed code.
Memory checks if applicable.
Safety budgets: SAST (e.g., Semgrep), linting, secret scan, license/SBOM diff, dependency risk.
No new high/critical issues. Documented waivers must be explicit and time-bound.

Attestation and policy compliance

A machine-readable manifest tying all artifacts together.
CI validation via policy-as-code (OPA/Conftest) to make proofs first-class citizens.
Optional build attestation (Sigstore Cosign) and provenance (SLSA) for supply-chain integrity.

If an AI agent generates a patch but fails to produce these artifacts, the CI pipeline rejects it. If a human developer writes a fix without proof, the CI pipeline rejects it. In practice, developers quickly adopt the habit of crafting small repros and minimal tests—because it’s the fastest way to get green.

A concrete schema: proof-of-fix.json

Define a PoF manifest at the root of the PR (or in a .pof/ directory). Here’s a pragmatic schema you can adopt today.

json
{
  "$schema": "https://example.com/schemas/proof-of-fix.schema.json",
  "fix_id": "POF-2025-00123",
  "title": "Fix off-by-one error when paginating orders",
  "type": "bugfix",
  "labels": ["ai-assisted", "p1"],
  "issue": {
    "tracker": "github",
    "id": "#4821",
    "url": "https://github.com/acme/shop/issues/4821"
  },
  "reproduction": {
    "kind": "script",
    "entry": "repro/repro_pagination.sh",
    "args": ["--page=10", "--size=50"],
    "seed": 1337,
    "env": {
      "TZ": "UTC",
      "FEATURE_FLAGS": "orders-v2"
    },
    "container": {
      "image": "ghcr.io/acme/shop-ci:1.24.3",
      "digest": "sha256:3e7a..."
    },
    "expected_pre_fix": "exit_code!=0",
    "expected_post_fix": "exit_code==0"
  },
  "tests": {
    "added": [
      "tests/orders/test_pagination.py::test_last_page_inclusive"
    ],
    "coverage_target": {
      "changed_files": ">=0.8",
      "overall": ">=0.75"
    }
  },
  "evidence": {
    "logs_before": "artifacts/logs/pagination_fail.log",
    "logs_after": "artifacts/logs/pagination_pass.log",
    "traces_before": "artifacts/traces/pagination_fail.json",
    "traces_after": "artifacts/traces/pagination_pass.json"
  },
  "budgets": {
    "perf": {
      "benchmarks": [
        {
          "name": "paginate_orders_bench",
          "path": "benchmarks/orders_bench.py::test_paginate_bench",
          "max_regression": "+5%"
        }
      ]
    },
    "safety": {
      "sast": {
        "tool": "semgrep",
        "max_new_findings": {
          "HIGH": 0,
          "MEDIUM": 0
        }
      },
      "secrets_scan": true,
      "license_check": true,
      "sbom_diff": true
    }
  },
  "attestation": {
    "signed_by": "ci-bot@acme.dev",
    "sigstore_bundle": "artifacts/attestations/pof.sig"
  }
}

The manifest points to the actual artifacts. Your CI pipeline will run validators to ensure:

The repro fails on the base branch and passes on the head branch.
The new test fails pre-fix and passes post-fix.
Coverage thresholds are met.
Traces/logs exist and show the error transitioned from failure to success.
Performance and safety budgets are respected (or waivers are present and approved).

An end-to-end example: fixing a timezone bug

Let’s walk through a specific bug and show how the contract works in practice. Scenario: a Python service incorrectly applies timezone offsets when parsing ISO8601 timestamps without explicit timezone information, causing off-by-hours errors in analytics.

Symptoms in production:

Sporadic miscounts when aggregating events around DST transitions.
Error logs noting naive datetime arithmetic.

Step 1: Reproduction script

We write a deterministic repro and pin the environment.

bash
#!/usr/bin/env bash
set -euo pipefail
export TZ=America/Los_Angeles
python - <<'PY'
import os
from datetime import datetime

# Buggy behavior: naive datetime interpreted in local time, should be treated as UTC
# Repro target: events at DST boundary
raw = '2024-11-03T01:30:00'  # ambiguous local time

# Simulate current behavior
parsed = datetime.fromisoformat(raw)  # naive -> local
expected_utc = '2024-11-03T01:30:00+00:00'

if parsed.tzinfo is not None:
    raise SystemExit(1)  # Expect naive, proves bug path
print('Parsed:', parsed)

# Downstream uses parsed.timestamp() assuming UTC; we show mismatch vs. expected
print('timestamp:', parsed.timestamp())
PY

This script demonstrates a naive datetime created in a zone where DST is active. The actual failure mode depends on downstream assumptions, so we’ll codify a minimal test next.

Step 2: Minimal test

We add a unit test that encodes the correct behavior: treat naive timestamps as UTC (your policy may vary, but it must be consistent).

python
# tests/test_time_parsing.py
import os
import pytest
from mysvc.timeutil import parse_iso8601

@pytest.mark.parametrize('raw, ts', [
    ('2024-11-03T01:30:00', 1730597400),  # 2024-11-03 01:30:00 UTC -> epoch
])
def test_naive_treated_as_utc(monkeypatch, raw, ts):
    monkeypatch.setenv('TZ', 'America/Los_Angeles')
    parsed = parse_iso8601(raw)
    assert parsed.tzinfo is not None and parsed.utcoffset().total_seconds() == 0
    assert int(parsed.timestamp()) == ts

We also add a micro-benchmark to protect performance if the fix adds parsing overhead.

python
# benchmarks/test_time_bench.py
from time import perf_counter
from mysvc.timeutil import parse_iso8601

def test_parse_benchmark(benchmark):
    samples = [f"2024-01-01T00:00:{i:02d}" for i in range(60)]
    def run():
        for s in samples:
            parse_iso8601(s)
    benchmark(run)

Step 3: The fix

We implement a robust parser that defaults naive timestamps to UTC, emits a structured log when encountering naive values, and is explicit about timezone handling.

python
# mysvc/timeutil.py
from __future__ import annotations
import datetime as dt
from typing import Optional

import logging
log = logging.getLogger(__name__)

ISO8601_FMT = '%Y-%m-%dT%H:%M:%S'

def parse_iso8601(raw: str) -> dt.datetime:
    """Parse ISO8601 string; treat naive as UTC; preserve explicit tz.
    Raises ValueError on invalid input."""
    # Try fromisoformat first (fast path)
    try:
        parsed = dt.datetime.fromisoformat(raw)
    except ValueError:
        # Fallback for variants like 'Z' or fractional seconds
        # Use dateutil if available or custom parser; keeping stdlib here for clarity
        raise

    if parsed.tzinfo is None:
        # Contract: naive -> UTC
        log.warning('Naive timestamp encountered; defaulting to UTC', extra={'raw': raw})
        return parsed.replace(tzinfo=dt.timezone.utc)
    return parsed.astimezone(dt.timezone.utc)

Step 4: Trace/log evidence

Instrument the parsing path and capture traces using OpenTelemetry so you can store a before/after trace diff.

python
# mysvc/timeutil.py (snippet showing OTel integration)
from opentelemetry import trace
tracer = trace.get_tracer(__name__)

def parse_iso8601(raw: str) -> dt.datetime:
    with tracer.start_as_current_span('parse_iso8601') as span:
        span.set_attribute('input.raw', raw)
        ...
        if parsed.tzinfo is None:
            span.add_event('naive_timestamp_defaulted_to_utc')
            ...
        return result

In the bug reproducer (pre-fix), the trace includes an error or missing event; post-fix, the event is present and no error is emitted. CI archives both traces and stores them under artifacts/traces/.

Step 5: Performance and safety budgets

We attach budgets to ensure we didn’t make things worse.

Perf: parse micro-benchmark regression no more than +5% vs baseline.
Safety: no new high-severity SAST findings (e.g., insecure datetime parsing is unlikely to be flagged, but we protect against newly introduced issues elsewhere in the diff), no secrets, SBOM and license unchanged for this patch.

Step 6: The proof-of-fix manifest

We create proof-of-fix.json referencing the repro, tests, traces, and budgets as shown earlier. CI will use this to drive enforcement.

CI enforcement: GitHub Actions example

Here’s a minimal CI pipeline that enforces the contract. You can adapt the same principles to GitLab CI, Buildkite, CircleCI, or Jenkins.

yaml
name: Proof of Fix
on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  pof-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install tooling
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov pytest-benchmark opentelemetry-sdk semgrep
          curl -L -o conftest.tar.gz https://github.com/open-policy-agent/conftest/releases/download/v0.53.0/conftest_0.53.0_Linux_x86_64.tar.gz
          tar -xzf conftest.tar.gz && sudo mv conftest /usr/local/bin/
      - name: Validate PoF manifest shape
        run: |
          test -f proof-of-fix.json || (echo 'Missing proof-of-fix.json' && exit 1)
          conftest test --policy policies/ proof-of-fix.json
      - name: Run repro on base branch
        run: |
          BASE_SHA=$(git merge-base origin/${{ github.base_ref }} HEAD)
          echo "Base: $BASE_SHA"
          git checkout $BASE_SHA
          bash $(jq -r '.reproduction.entry' proof-of-fix.json) || echo "Expected failure"
          # Assert failure (exit code != 0)
          if bash $(jq -r '.reproduction.entry' proof-of-fix.json); then
            echo 'Repro did not fail on base' && exit 1
          fi
      - name: Run repro on head branch
        run: |
          git checkout $GITHUB_SHA
          bash $(jq -r '.reproduction.entry' proof-of-fix.json)
      - name: Run tests and coverage
        run: |
          pytest -q --cov=mysvc --cov-report=xml
      - name: Check coverage deltas
        run: |
          python scripts/check_coverage.py --target-changed 0.8 --target-overall 0.75
      - name: Run perf benchmarks
        run: |
          pytest -q benchmarks/test_time_bench.py --benchmark-json=artifacts/bench.json
          python scripts/check_benchmark_regression.py --max-regression 5% artifacts/bench.json
      - name: Run safety checks
        run: |
          semgrep --config p/ci --error || true
          python scripts/check_semgrep_budget.py --max-new-high 0 --max-new-medium 0 semgrep.sarif
          python scripts/scan_secrets.py
          python scripts/sbom_diff.py
      - name: Archive artifacts
        uses: actions/upload-artifact@v4
        with:
          name: pof-artifacts
          path: |
            artifacts/**
            proof-of-fix.json

This workflow enforces reproducibility, minimal test presence, performance budgets, and safety checks. You’ll notice small helper scripts (check_coverage.py, check_benchmark_regression.py, etc.). These should be part of your repo or a shared internal tool.

Policy-as-code: block PRs without proof

Instead of hardcoding logic in shell, use Open Policy Agent (OPA) with Conftest to declare the rules.

Example Rego policy (policies/pof.rego):

rego
package main

default allow = false

pof := input

# Require AI-assisted or bugfix changes to include PoF
allow {
  required := pof.type == "bugfix" or contains("ai-assisted", pof.labels)
  required
  pof.reproduction.entry != null
  count(pof.tests.added) > 0
  pof.evidence.logs_before != null
  pof.evidence.logs_after != null
  pof.budgets.perf.benchmarks[_]
  pof.budgets.safety.sast.max_new_findings.HIGH == 0
}

contains(x, arr) {
  some i
  arr[i] == x
}

Now any PR without a compliant manifest fails the policy check.

Provenance and attestation: sign the proof

To prevent forgery or tampering, sign your PoF and critical artifacts. Sigstore’s Cosign can sign blobs and attach attestations to container images and releases.

Sign the manifest and the trace/log archives.
Generate an attestation that includes the build info (builder, repo, commit, timestamp) and references the artifact digests.
Store signatures alongside artifacts, verify them in CI before trusting evidence.

With SLSA provenance, you can tie the patch, the build image, and the proof artifacts together, making the trail auditable.

Handling flakiness and non-determinism

Flaky tests are poison for a PoF system, so design for determinism.

Seed everything: RNGs, fuzzers, property-based tests.
Freeze time: use time-freezing utilities or pass monotonic clock sources; avoid datetime.now() in tests.
Serialize concurrency: add synchronization hooks, or use deterministic schedulers in tests.
Retry policy: never mask flakiness; instead, detect and quarantine. If a repro is flaky, it is not a valid PoF. Require a stabilization step (e.g., isolating the race, raising logging verbosity, or adding a controlled delay in tests only).
Flaky label: allow a temporary label to bypass strict checks only with a sign-off from a designated owner and a timebox. The waiver itself is a policy artifact.

Non-functional changes and exceptions

Not every PR is a bug fix. For refactors, docs, or dependency bumps, you can:

Route by label: pof-required for bugfix and ai-assisted; pof-optional for docs/refactor. Apply different policies for each.
For dependency upgrades labeled “security”, require SBOM diff and vulnerability delta checks, even if no bug repro exists.
For refactors, require a baseline test suite to remain green and coverage not to drop; no PoF repro required.

CI should make the path of least resistance the correct one: if a developer marks a PR as bugfix, the template should scaffold the repro, test, and manifest automatically. If it’s docs-only, the template omits PoF sections.

Encouraging minimal tests and constrained diffs

A PoF system is most powerful when the regression test is small and targeted.

Enforce diff locality: if the PR touches 20 files but the bug lives in one, ask for justification. Tools like Danger or custom bots can flag unusually broad diffs for bugfix PRs.
Require failing-but-minimal tests: tests should import and call the smallest unit that reproduces the bug. Integration tests are fine, but use them when necessary, not by default.
Prefer metamorphic and property-based tests for boundary bugs: for example, “parsing then formatting yields the same canonical string,” or “aggregations are associative within tolerance.”

Property-based testing can dramatically increase confidence in AI-generated fixes by exploring edge cases the developer didn’t think of. Use Hypothesis (Python), quickcheck (Haskell/Rust variants), jqwik (Java), or fast-check (JS).

Telemetry as evidence: OpenTelemetry patterns

Traces are underused as regression evidence. PoF makes them central:

Create a span around the failing function.
Attach attributes corresponding to inputs, code paths, and feature flags.
When a bug manifests, add an event or record_exception with a structured error.
After the fix, the span should not record_exception for the same scenario, and an event like bug_fixed_path should be present.
Export to JSON for CI; avoid dependence on a live collector.

Example: export traces to artifacts/traces/ with the simple file exporter and compare span names/events across before/after.

python
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, SimpleSpanProcessor
from opentelemetry.sdk.trace.export import ConsoleSpanExporter

provider = TracerProvider(resource=Resource.create({"service.name": "mysvc"}))
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

In CI, capture the console-exported spans to a file and diff them.

Performance budgets that matter

Performance regressions are easy to hide in “fixes.” Attach budgets that scale with the scope of change:

Micro-bench nearest hot path: a direct benchmark of the changed function or method.
Macro-bench endpoint or job-level metrics if the change affects request/response flows.
Define baseline storage: persist benchmark summaries from main as JSON; compare percent deltas per test name.
Use P95/P99 thresholds for macro-benchmarks to avoid missing tail regressions.
For CPU, memory, and allocations: language-specific tools (go test -bench and -benchmem, pytest-benchmark with statistics, JMH for JVM).

A simple, effective rule: For bugfix PRs, any benchmark named *_bench must not regress by more than +5% unless a waiver is included with an owner and an expiry date.

Safety budgets: keep your “fix” from being a backdoor

Bugfixes can smuggle new risks. Enforce:

SAST: run Semgrep, ESLint, Bandit, go vet, FindSecBugs, etc., and ensure no new high/critical findings. Track diff of findings, not absolute counts.
Secrets scanning: block any new secrets or credentials; redaction in logs/gists.
SBOM and dependency deltas: generate SBOM (CycloneDX, SPDX), diff dependencies, run vulnerability scans (e.g., OSV, Snyk, Trivy). No new high/critical; if upgraded to fix a CVE, verify the CVE is no longer applicable.
Supply-chain checks: verify container base image digests and builders.

These checks are not unique to PoF, but by bundling them under a proof contract, you avoid the common trap of passing the unit tests while failing the overall change quality bar.

Integrating with PR templates and bots

Make the process discoverable and low-friction:

PR template sections: "Reproduction", "Minimal Test", "Evidence", "Budgets", "Manifest". The template can generate a stub proof-of-fix.json and sample scripts.
Bot assists: a bot (or your AI agent) can populate the manifest, spin up a sandbox container for the repro, export traces, and scaffold tests.
Labels control strictness: ai-assisted or bugfix automatically enables PoF enforcement. docs-only disables it.
Discuss commit: require Conventional Commits style (fix: ...) to trigger PoF; chore: ... bypasses.

Adopting PoF without ruining velocity: a path

Week 1–2: Introduce the PR template and minimal schema. Start gathering repros and tests manually; do not block merges yet.
Week 3–4: Enforce manifest presence for ai-assisted PRs. Add coverage and SAST budgets. Still soft-fail on perf.
Week 5–6: Enforce repro pre/post checks and minimal test presence for all bugfix PRs. Add trace/log artifacts.
Week 7+: Enforce perf and SBOM budgets; introduce attestations; track metrics.

Measure:

Fix Closure Reopen Rate (FCRR): percent of fixes reopened within 30 days; target a 50% reduction.
Proof Coverage: percent of bugfix PRs with a valid PoF; target >90%.
Mean Time to Diagnose (MTTD): time from issue open to reproducible failure captured; target aggressive cuts.
Flaky Rate: percent of PoFs rejected due to flakiness; drive to near zero via stabilization work.

A Go example: proving a data race fix

Suppose an AI proposes adding a mutex to fix a data race. Without proof, you may still have a race elsewhere.

Repro (uses -race and a deterministic seed):

bash
#!/usr/bin/env bash
set -euo pipefail
export GORACE='halt_on_error=1'
go test ./pkg/cache -race -run TestConcurrentSetGet -count=1

Minimal test:

go
// pkg/cache/cache_test.go
func TestConcurrentSetGet(t *testing.T) {
    cache := NewCache()
    const N = 1000
    var wg sync.WaitGroup
    for i := 0; i < N; i++ {
        wg.Add(1)
        go func(i int) {
            defer wg.Done()
            cache.Set(fmt.Sprintf("k%d", i), i)
            _ = cache.Get(fmt.Sprintf("k%d", i))
        }(i)
    }
    wg.Wait()
}

Fix:

go
// pkg/cache/cache.go
type Cache struct {
    mu sync.RWMutex
    m  map[string]any
}

func (c *Cache) Set(k string, v any) {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.m[k] = v
}

func (c *Cache) Get(k string) any {
    c.mu.RLock()
    defer c.mu.RUnlock()
    return c.m[k]
}

Budgets:

Perf: Get and Set should not regress more than 10% in micro-bench; enforced via go test -bench with -benchmem.
Safety: go vet and staticcheck must show no new issues.

PoF ensures the race is actually gone (repro fails pre-fix, passes post-fix) and you haven’t introduced contention that tanks latency.

JavaScript example: blocking a hallucinated DOM fix

An LLM proposes: "Add setTimeout(0) before reading offsetWidth to avoid layout thrash." Without measurement, this is folklore.

Repro: Jest + jsdom test that captures layout reads/writes order; failing assertion pre-fix.
Evidence: Performance.now() timestamps; a synthetic benchmark around the affected component.
Budget: no more than +2ms on average per render in benchmark. Linting: eslint-plugin-react hooks rules remain clean.

If the fix doesn’t reduce forced reflows or violates perf budget, CI blocks the PR.

Practical concerns and answers

Isn’t this too heavy? For critical repos, it pays for itself. Automate the scaffolding and keep manifests terse.
What about emergency hotfixes? Allow an emergency label that bypasses some budgets, but require a follow-up PoF within 24–48 hours. Track SLA.
Won’t developers game the system? That’s why you validate pre/post behavior and coverage against changed lines, not just the presence of tests. Use diff-aware coverage tools.
How do we avoid sensitive data in logs/traces? Build redaction into logging and serializer layers. Reject artifacts that include secrets; require rotating keys if leakage is detected.
Can AI generate the PoF? Yes—if instrumented correctly. The AI agent should run the repro on base and head, collect artifacts, and fill the manifest. But the CI must verify independently.

Tooling suggestions

Manifest and policy: Open Policy Agent (OPA) + Conftest.
Repro containers: Docker + pinned digests; devcontainers for local reproduction.
Tracing: OpenTelemetry SDKs; console/file exporters.
Benchmarks: pytest-benchmark, Go testing, JMH.
Safety: Semgrep, Bandit, ESLint, Trivy, OSV scanner, Syft/Grype for SBOM.
Attestations: Sigstore Cosign; SLSA provenance.
Bots: Danger, Probot, or internal bots to enforce labels and templates.

A 10-point checklist to operationalize PoF

Add a PR template that prompts for repro, test, evidence, budgets.
Add proof-of-fix.json to the template with a minimal schema.
Add an OPA policy to require PoF for bugfix and ai-assisted labels.
Teach your CI to run repro on base and head; assert fail-passes.
Require at least one minimal regression test; diff-aware coverage must increase over changed lines.
Capture and archive traces/logs with simple exporters.
Add a micro-benchmark and a small perf budget by default.
Enforce SAST, secrets, SBOM diffs as budgets with waivers only by owners.
Sign artifacts and manifests with Cosign; verify in CI.
Track metrics (Proof Coverage, Reopen Rate, Flaky Rate) and publish weekly.

Counterarguments and why they don’t hold

"We already have tests." Tests are necessary but not sufficient; they often don’t encode the specific failure. PoF makes the specific failure reproducible and persistent.
"AI patches save time; this adds latency." The time you save on the first pass is lost in reopens and incidents. With a scaffolded PoF, the incremental cost is minutes.
"Developers shouldn’t need to write a manifest." True, which is why templates and bots generate it. The point is machine-verifiable evidence, not human busywork.

Conclusion: no proof, no merge

AI will keep getting better at writing code. We should get better at demanding evidence that code actually fixes bugs. A Proof‑of‑Fix Contract moves bug fixes—human or AI—from hand-wavy confidence to concrete, repeatable proof.

Make it your CI’s job to ask the hard questions:

Where’s the failing repro?
Where’s the minimal test that fails before and passes after?
Where’s the trace/log evidence?
Did we stay within performance and safety budgets?

If the PR can’t answer, it doesn’t merge. That’s not gatekeeping—it’s engineering.

Adopt PoF, and you’ll see fewer reopens, faster diagnosis, more trustworthy AI output, and a team culture that values truth over confidence. Proof scales. Let’s make our debugging AI prove it.