Proof-Carrying Patches: Making code debug ai Prove Its Fixes Before You Merge

There is a widening gap between the rate at which AI can propose code changes and the rate at which engineering teams can review, validate, and safely merge them. Code agents can open hundreds of pull requests in minutes, but production-grade fixes still require reproducibility, test evidence, and supply-chain assurances. That mismatch invites brittle patches, subtle regressions, and audit nightmares.

We need a higher standard: proof-carrying patches.

Borrowing from proof-carrying code (George C. Necula, 1997), a proof-carrying patch is a change that arrives with machine-checkable evidence that it (a) reproduces the failure, (b) deterministically validates the fix, (c) minimizes behavioral surface area, and (d) declares verifiable provenance and security attestations. In this model, your code debug ai does not merely claim the fix works; it must prove it, in forms your CI can verify autonomously. Only then is auto-merge even on the table.

This article lays out a concrete, production-ready blueprint for such a pipeline: end-to-end artifacts, a standard manifest, deterministic replays, AST-aware diff minimization, property- and fuzz-based validation, and cryptographically verifiable supply-chain attestations. The result is an opinionated, auditable flow that turns AI patches from a trust liability into a durable asset.

Why Proof-Carrying Patches

AI-generated patches fail in predictable ways:

They pass on one machine but fail elsewhere due to nondeterminism or undeclared environment assumptions.
They introduce regressions far from the original bug due to broad edits or changes in shared utilities.
They smuggle in new dependencies, licenses, or build steps without proper documentation.
They cannot be audited because we lack a clear provenance chain: who or what produced the patch, from which inputs, with which tools.

A proof-carrying patch inverts the burden of proof. The patch must carry the evidence your system demands, in standardized, verifiable formats, before your policy engine even considers auto-merge. The maintainers still can review and veto, but the routine cases become push-button and high-confidence.

Core Principles

Reproducibility first: If it isn’t deterministic, it’s not evidence.
Evidence over persuasion: Attach checks that CI can validate, not arguments in the PR description.
Minimality: A smaller, intent-focused diff reduces blast radius and audit cost.
Explicit provenance: Cryptographic attestations bind patch, tests, and environment to identity and time.
Policy-driven automation: Merge gates should be machine-evaluable, transparent, and amendable.

Architecture Overview

The pipeline involves six roles/services:

AI code agent (the code debug ai): Generates candidate fixes, failing tests, and the proof bundle.
Hermetic build/test runner: Ensures determinism and captures record/replay artifacts.
Minimizer/differ: Reduces the patch to the smallest safe change, ideally AST-aware.
Attestation service: Produces, signs, and stores provenance attestations (SLSA, in-toto) with a transparency log.
Policy engine: Evaluates the evidence against risk thresholds, coverage gates, static analysis, and security posture.
VCS integration: Links artifacts to the pull request and enforces merge conditions.

A typical flow:

The agent receives a bug report and repository pointer.
It reproduces the failure in a hermetic environment and records deterministic traces.
It proposes a patch and its failing test, then iteratively minimizes the diff while keeping the tests passing.
It runs coverage, fuzzing (time-boxed), and static checks, packaging results.
It signs and uploads attestations to a transparency log and posts a PR with a manifest referencing those artifacts.
CI retrieves, verifies, and enforces policy; if all gates pass and risk is low, auto-merge triggers.

The Patch Proof Package (PPP)

Standardize the artifact. A PPP is a directory (or OCI artifact) with:

manifest: Summary metadata; links/digests for all artifacts; agent identity; versions.
test: The failing test(s) that reproduce the issue pre-patch; the test passes post-patch.
replay: Deterministic record/replay traces or a deterministic rerun recipe with fixed seeds.
patch: The minimal diff (text and AST diff), plus rule-based justification for each change hunk.
validation: Coverage deltas, fuzzing results, symbolic or property proofs where applicable.
sbom: Component inventory for any newly introduced or upgraded dependencies.
provenance: SLSA provenance, in-toto attestations, DSSE envelopes, and transparency-log references.
signatures: Sigstore/cosign signatures for artifacts and envelope digests.

A minimal manifest sketch (YAML to avoid quote noise):

yaml
apiVersion: ppp.dev/v0
kind: PatchProofPackage
metadata:
  repo: git.example.com/org/project
  pr: 1234
  commit_base: a1b2c3d
  commit_patch: f9e8d7c
  created_at: 2026-01-22T12:34:56Z
  agent:
    name: code-debug-ai
    version: 0.7.3
    identity: spiffe://ci/agents/debug-ai
    model: 'glm-4.1-code'
artifacts:
  failing_tests:
    - path: tests/regressions/T1234_repro.py
      digest: sha256:...
  replay:
    - type: rr
      path: artifacts/rr/trace.tar.zst
      digest: sha256:...
    - type: env_recipe
      path: artifacts/env/recipe.nix
      digest: sha256:...
  patch:
    - path: patches/fix.diff
      digest: sha256:...
    - path: patches/fix.astdiff
      digest: sha256:...
  validation:
    coverage:
      report: artifacts/coverage/coverage.lcov
      baseline: artifacts/coverage/baseline.lcov
    fuzz:
      summary: artifacts/fuzz/summary.json
      corpus: artifacts/fuzz/corpus.zip
    static:
      codeql: artifacts/static/codeql.sarif
      semgrep: artifacts/static/semgrep.sarif
  sbom:
    - path: sbom/spdx.json
      digest: sha256:...
  provenance:
    slsa: attestations/slsa.dsse
    intoto: attestations/supply-chain.intoto.jsonl
    transparency_log:
      rekor_entry: https://rekor.tlog.dev/entries/sha256:...
  signatures:
    - path: signatures/ppp.dsse.sig
      cert: signatures/ppp.sigstore.cert
      chain: signatures/fulcio.chain
policy:
  risk_score: 0.21
  requires:
    min_coverage_delta: +0.5%
    fuzz_minutes: 10
    zero_new_critical_findings: true

Everything in the manifest must be content-addressed (digests) and covered by attestations/signatures. The PPP itself can be published as an OCI artifact (via ORAS) to your registry and linked in the PR.

Implementing Each Evidence Class

1) Failing Tests First

The non-negotiable contract: the agent must produce a failing test that reproduces the bug on the base commit, and that test must pass on the patched commit. This can be:

A new unit/integration test
A property-based test that captures the violated invariant
A regression snapshot with a minimal fixture

Example in Python using Hypothesis to lock down a serialization bug:

python
# tests/regressions/test_lenient_json_roundtrip.py
import json
from hypothesis import given, strategies as st

@given(st.one_of(st.text(), st.integers(), st.booleans(), st.none()))
def test_roundtrip_preserves_value(x):
    # Reproducer: bug was that None became 'null' string
    payload = {'v': x}
    encoded = json.dumps(payload, separators=(',', ':'), ensure_ascii=False)
    decoded = json.loads(encoded)
    assert decoded == payload

The agent includes a minimal fixture and points to the issue reference in the test docstring. Two runs are mandatory in CI: test fails before patch, passes after.

For systems where reproducing the failure requires external state (network, clock, locale), move that state into mocks or hermetic fixtures. Enforce a no-network rule to ensure the test is self-contained.

2) Deterministic Replays

Determinism is the anchor that turns a passing test into proof. There are several approaches:

rr (Linux) records user-space execution and enables bitwise-repeatable replays for C/C++/Rust binaries. Store rr traces as compressed artifacts and verify replay during CI.
Hermetic builds: Use Nix, Bazel, or Guix to declare inputs and eliminate ambient dependencies. Pin toolchains and container base layers.
Time and randomness control: Freeze time (e.g., via time namespace, LD_PRELOAD, or language-specific clock abstractions) and seed RNGs.
Network and filesystem mocking: Block real network; mount read-only fixtures; ephemeral scratch directories.

A minimal harness for a Go service that freezes time and blocks network:

bash
# run_deterministic.sh
set -euo pipefail
export TZ=UTC
export SOURCE_DATE_EPOCH=1700000000
export GODEBUG='randautoseed=0'

# Block network using unshare (Linux), drop NET capability
unshare -n bash -c '
  go test ./... -run TestRepro -count=1 -failfast
'

For rr on a Rust binary:

bash
# build with deterministic flags
RUSTFLAGS='-C debuginfo=2 -C link-arg=-Wl,--build-id=sha1' cargo build --release
rr record ./target/release/mybin --repro-args
# Save the trace
rr pack
tar -C ~/.local/share/rr -caf rr_trace.tar.zst mybin-*

CI should replay the trace on the base and patched commits to validate equivalence of the deterministic envelope while observing the different outcomes at the failing assertion.

3) Minimal Diffs, Preferably AST-Aware

A minimal change surface reduces regression risk and eases review. Textual diff minimization (delta debugging) is good; AST-aware minimization is better.

Recommended tools and methods:

GumTree or Difftastic for AST-aware diffs across languages.
Delta debugging: iteratively remove hunks while tests pass.
Heuristics: prefer local edits, avoid cross-cutting refactors, preserve API signatures, and ban opportunistic style changes.

Pseudo-process:

text
1. Generate candidate patch P0.
2. While tests pass:
   - Try removing non-essential hunks.
   - Re-run failing test set and impacted tests.
   - Keep smallest Pk that preserves green.
3. Emit both textual diff and AST diff with justification notes per hunk.

The AST diff should label node types (e.g., IfStatement predicate narrowed) and reference static rules (e.g., side effects unchanged). This justifies minimality in a machine-readable way.

4) Supply-Chain Attestations

To make the patch auditable and secure:

SLSA provenance: Records who built what, from which sources, with which builder, producing which outputs. slsa.dev provides schemas.
in-toto layout: Declares steps of the pipeline and materials/products per step, with cryptographic bindings.
Sigstore (cosign, Fulcio, Rekor): Sign attestations and artifacts with short-lived certificates anchored to workload identity (e.g., GitHub OIDC), and publish to a transparency log.
SBOM (SPDX or CycloneDX): Enumerate dependencies introduced or modified by the patch.

Example cosign attestation commands (conceptual):

bash
# Sign the PPP as an OCI artifact
oras push ghcr.io/org/project/ppp:pr-1234 ./ppp/
cosign attest -predicate attestations/slsa.json -type slsaprovenance ghcr.io/org/project/ppp:pr-1234
cosign sign ghcr.io/org/project/ppp:pr-1234

CI verification must:

Fetch transparency log entries from Rekor
Validate Fulcio certs and OIDC identity claims (subject, audience, repo, workflow)
Confirm subject digests match the PPP contents

Beyond Unit Tests: Multi-Angle Validation

AI fixes often meet the immediate assertion but miss system properties. The PPP should include:

Coverage delta: The new failing test must be covered, and overall line/branch coverage should not drop. For riskier areas, require a small increase.
Fuzzing burst: Run 5–15 minutes of fuzzing on the changed code paths with a preserved corpus. AFL++/libFuzzer/Honggfuzz for native; Jazzer for JVM; Hypothesis for Python.
Static analysis: CodeQL, Semgrep, or language-specific linters to ensure no new taint routes, dangerous sinks, or concurrency hazards.
Optional formal checks: Use an SMT solver (e.g., via Liquid types, Dafny, or KLEE) or model checks (TLA+) for protocols and concurrency if the change touches those domains.

Example Semgrep rule snippet that the PPP can include to prove no new risky pattern was introduced:

yaml
rules:
  - id: no-unsafe-shell
    patterns:
      - pattern: os.system($CMD)
    message: Avoid os.system; use subprocess.run with shell=False
    severity: ERROR
    languages: [python]

A CodeQL query to ensure new string inputs do not flow into eval-like sinks could be included, with SARIF outputs attached to the PPP.

Policy Engine: Enforce, Don’t Suggest

Use OPA (Open Policy Agent) with Rego to write merge-gating policies that evaluate PPP manifests and attestations. Policies should be simple to audit and test.

A simplified Rego policy (illustrative) that enforces core PPP invariants:

rego
package ppp.policy

default allow = false

ppp := input.ppp

# Ensure signatures exist and transparency log is present
require_signatures {
  count(ppp.provenance.transparency_log) > 0
  some s
  s := ppp.signatures[_]
}

# Basic risk gating
ok_risk {
  ppp.policy.risk_score <= 0.3
}

# Coverage must not decrease
coverage_ok {
  delta := input.metrics.coverage_delta
  delta >= 0
}

# No critical findings in static analysis
static_ok {
  input.metrics.static.critical_findings == 0
}

allow {
  require_signatures
  ok_risk
  coverage_ok
  static_ok
}

This policy would run in CI via conftest against a JSON produced by a verifier that combines the PPP manifest, computed metrics, and attestation statuses.

CI/CD Wiring: GitHub Actions Example

A GitHub Actions workflow that ties it together:

yaml
name: proof-carrying-patch
on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  verify-ppp:
    runs-on: ubuntu-latest
    permissions:
      id-token: write     # for Sigstore OIDC
      contents: read
      packages: read
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Download PPP artifact
        run: |
          # Agent posts a link or artifact name in PR body or labels
          oras pull ghcr.io/org/project/ppp:pr-${{ github.event.number }} -o ppp

      - name: Verify signatures and Rekor
        run: |
          cosign verify-attestation --type slsaprovenance ghcr.io/org/project/ppp:pr-${{ github.event.number }}
          cosign verify ghcr.io/org/project/ppp:pr-${{ github.event.number }}

      - name: Verify deterministic replay on base and patch
        run: |
          jq -r '.artifacts.replay[0].path' ppp/manifest.yaml 2>/dev/null || true
          # Script would:
          # 1) checkout base
          # 2) run rr replay and confirm the failing test fails
          # 3) checkout head
          # 4) rr replay or deterministic rerun and confirm passing

      - name: Run tests and compute metrics
        run: |
          ./ci/run_tests.sh --ppp ppp
          ./ci/compute_metrics.sh --ppp ppp > metrics.json

      - name: Policy check
        uses: open-policy-agent/conftest-action@v0
        with:
          policy: ./policy
          input: metrics.json

      - name: Auto-merge if allowed
        if: success()
        run: |
          gh pr merge ${{ github.event.number }} --merge --admin

For GitLab, a similar flow can be implemented with OIDC-based Sigstore integration and a project-level policy.

Data Model and Storage

Store PPPs as OCI artifacts with content-addressable layers. This avoids bloating your git repository and leverages registry access control and caching.
The manifest contains digests; verify them before any use.
All DSSE envelopes and signatures are included in the PPP and discoverable via the transparency log.
Link the PPP digest in the PR description and as a status check URL. Optionally mirror metadata in an internal catalog (e.g., an evidence ledger) for compliance.

Handling Nondeterminism and Flakes

Nondeterminism is unavoidable in some systems, but you can corral it:

Time: Freeze via language adapters or syscall interposition. E.g., Python freezegun, Java Clock injection, Go time.Now wrapping.
Randomness: Inject deterministic seeds; verify seeds are read from env variables and not regenerated implicitly.
Concurrency: Record scheduling with tools like rr or run with a determinizer (e.g., Loom for Java testing scenarios). For Go, reduce race windows with GOMAXPROCS=1 during tests and run race detector.
Network: Replace live calls with vector clocks + golden pcap fixtures; use eBPF to block outbound connections in CI.
Filesystem: Use container volumes with read-only mounts for fixtures and tmpfs for scratch.

If a bug is inherently timing-sensitive, the PPP should carry a replay that captures the problematic interleaving and a model-check or stress-test harness that demonstrates the absence of that interleaving post-fix.

Advanced Proofs: When Tests Aren’t Enough

Some fixes touch safety-critical code paths: authorization, money movement, crypto, consensus. For these, elevate the proof burden:

Design by contract: Add pre/post-conditions; use languages with contract support or add runtime assertions that become property tests.
Symbolic execution: KLEE (C/C++), Symbiotic, angr (binary), or JS symbolic engines to prove path properties.
Type-level proofs: LiquidHaskell, Rust typestates, or refined types to enforce invariants.
Model checking: TLA+ or PlusCal specs for concurrency and distributed protocols; link the TLA+ model to the PPP and include TLC result traces.

A pragmatic stance: you rarely need full formal proofs for routine bug fixes. But for high-risk areas, make higher assurance mandatory in your Rego policy, keyed off file paths or labels.

Risk Scoring and Rollout Strategy

Auto-merge should be selective. Compute a risk score from:

Blast radius: number of files changed, fan-in/fan-out of impacted modules, change to public APIs.
Test strength: coverage of changed lines, mutation score, fuzz depth.
Complexity deltas: cyclomatic complexity shifts; cognitive complexity thresholds.
Security posture: any code paths touching secrets, authn/authz, cryptography.
Operational sensitivity: pages per KLOC metric for the module; SLO tie-ins.

For low-risk patches that fully meet PPP policy, auto-merge with a short-lived feature flag or a staged rollout (canary percentage). For higher-risk patches, require human review even if the PPP passes; the PPP still accelerates and improves the review.

Concrete Example: Fixing a Dangling File Handle in Rust

Scenario: A service intermittently fails with Too many open files. The code debug ai proposes a fix.

Failing test: Adds a test that opens and reads 10,000 temp files concurrently in a hermetic environment. Asserts no error and that open FD count stays below a threshold.
Deterministic replay: rr trace captured on failure; replays to show leaked handles pre-patch.
Minimal diff: Introduces Drop for a wrapper type that ensures file close, and replaces a map of raw File with this wrapper. AST diff shows only type name change and a small Drop impl.
Validation: Coverage shows the new Drop path is covered; fuzzing on the parser that touches the files runs for 10 minutes without new crashes. CodeQL/semgrep confirm no new unsafe patterns.
Provenance: SLSA provenance ties the build to builder identity; SBOM unchanged; cosign signatures published to Rekor.
Policy: Risk score 0.18 (low). OPA allows auto-merge; rollout canary to 5% for 30 minutes, then full.

Tooling Landscape You Can Use Today

Determinism and replay: rr, Bazel sandboxing, Nix/Guix, Reproducible Builds flags (SOURCE_DATE_EPOCH), Time namespaces.
Diff minimization: Difftastic (AST diff), GumTree, Delta debugging frameworks; C-Reduce for C-family simplification of failing inputs.
Testing: Hypothesis/QuickCheck/JQwik for property tests; libFuzzer/AFL++/Jazzer for fuzzing.
Static analysis: CodeQL, Semgrep, Rust clippy, Go vet, Sonar.
Attestations: SLSA generator, in-toto, Sigstore (cosign, Fulcio, Rekor), DSSE.
Policy: OPA/Rego, Conftest.
Artifact storage: OCI registries (GHCR, GCR, ACR), ORAS CLI for arbitrary artifacts.

Operational Concerns

Performance: Keep the PPP lean. Store large replays and corpora as compressed artifacts with deduplication in your registry.
Privacy and secrets: Scrub traces for secrets; never record live credentials or PII. Use synthetic fixtures and fake tokens.
Longevity: Set retention for PPP artifacts and link them to releases. For regulated domains, preserve attestations as part of your Software Lifecycle Evidence store.
Developer experience: Provide a local CLI to generate and validate PPPs so engineers can reproduce what CI does.

A Simple Local CLI

Sketch of a developer-friendly command set:

bash
ppp init               # scaffolds manifest and directories
ppp repro -- test -k T1234
ppp patch --apply ./fix.diff
ppp minimize --strategy ast
ppp attest --slsa --intoto --sign
ppp verify --all
ppp push --oci ghcr.io/org/project/ppp:pr-1234

This CLI wraps rr, your test runner, coverage tools, and cosign, producing a manifest that your CI policy can trust.

What About Human Review?

Proof-carrying patches do not eliminate human review; they prioritize it. Routine fixes that pass the policy can merge with low latency. Higher-risk changes surface with rich, structured evidence that makes review faster and more substantive. Reviewers spend time on the hard parts, not on reproducing bugs or guessing at patch intent.

Limitations and Edge Cases

Legacy code without testability seams: You may need to first invest in test harnesses and dependency seams to enable deterministic repros.
Non-deterministic domains like distributed systems: Rely more on model checks, deterministic simulations, and record/replay of message schedules.
Huge binaries and rr trace sizes: Use selective tracing (scope to the module) and time-boxed runs.
Language/tooling gaps: Some ecosystems lack mature replay or AST diff tooling. Start with textual minimization plus strong tests.

Roadmap and Maturity Model

Phase 1: Tests + basic attestations. Require failing test and signed SBOM; block network in CI.
Phase 2: Deterministic replays and AST minimization; integrate coverage and fuzzing bursts.
Phase 3: Formal properties for sensitive modules; risk-weighted policies; staged rollouts tied to PPP status.
Phase 4: Organization-wide evidence ledger; dashboards for PPP metrics; automatic backports with inherited proofs.

Conclusion

A world where AI agents propose code is inevitable. A world where those changes merge without strong evidence should not be. Proof-carrying patches let organizations scale AI-driven maintenance without giving up on reproducibility, auditability, and security. The technical ingredients exist today: hermetic test runners, deterministic record/replay, AST-aware diffs, property testing, fuzzers, and supply-chain attestations via Sigstore, SLSA, and in-toto. What’s been missing is tying them together into a crisp policy that shifts the burden of proof onto the patch itself.

Make your code debug ai earn its merges. Demand failing tests, deterministic replays, minimal diffs, and cryptographically verifiable provenance. Then let your policy engine do what it does best: trust, but verify, and automate the rest.