Refactor Your Codebase with LLMs in 2025: CI‑Gated Codemods, AST Guarantees, and Safe Multi‑Repo Rollouts

Large‑scale refactors are no longer weekend‑long heroics. In 2025, LLMs can draft codemods, map semantic changes across languages, and propose migration strategies across dozens of repos. But the promised velocity only materializes if you wrap models inside a rigorous engineering envelope: typed AST transformations, deterministic evals, CI‑gated rollouts, and automated mitigations when things go sideways.

This article lays out an end‑to‑end, production‑grade approach for letting LLMs rewrite thousands of files safely. It is opinionated: use AST‑first codemods, gate every step with CI and merge queues, and treat LLMs as spec generators whose output must pass the same compile, test, and observability standards as any human‑authored change.

We cover:

How to generate codemods with LLMs and snap them to AST frameworks for safety
Enforcing AST/compile guarantees and idempotence
Building a gold‑set evaluation harness and measuring precision/recall of transformations
CI gating, merge queues, and diff budgets
Multi‑repo rollouts with canaries, bisection, and auto‑reverts
Concrete examples across Python (LibCST), JS/TS (jscodeshift/ts‑morph), Java (OpenRewrite), Go (go/ast), and Rust (syn/quote)

If you want the TL;DR pattern: LLMs propose; codemod frameworks enforce; CI decides. Everything else is tooling.

Why 2025 Is Different

The ingredients for safe automated refactors matured:

Structured generation: grammar‑constrained decoding, function‑calling, and toolformer‑style agents allow LLMs to produce typed recipes instead of free‑form text.
Ubiquitous AST toolchains: LibCST, OpenRewrite, jscodeshift, ts‑morph, go/ast, Rust syn, and tree‑sitter provide concrete, lossless ASTs and printers with stable formatting.
Proven rollout patterns: merge queues, canarying, staged rollouts, and automated reverts have gone mainstream in CI/CD.
Evidence from industry: systems like Facebook Getafix and SapFix, OpenRewrite at scale in Java ecosystems, and Semgrep autofix/Comby show that static and structural transforms, when tested and gated, can be safe and fast.

The remaining delta is orchestration: make LLMs generate codemods and proofs, not just patches; then let CI be the arbiter of safety.

The Core Principle: LLMs Propose, AST Frameworks Enforce

Free‑form patches from an LLM are not shippable artifacts. Treat them as drafts for a typed codemod:

Ask the LLM to produce a transformation spec: before/after examples, invariants, and constraints.
Ask it to instantiate the spec into a codemod for your language’s AST framework.
Compile and unit‑test the codemod itself.
Run the codemod on a representative sample, then run AST/compile/test gates.
Only batch when the codemod is proven deterministic, idempotent, and precise.

The key: never let unconstrained text changes land. Always traverse and edit the AST, preserving trivia (comments/formatting) and ensuring syntactic validity by construction.

Architecture: An End‑to‑End Refactor Pipeline

A production pipeline typically looks like this:

Spec phase
- Prompt LLM for a migration plan and transformation spec with examples, negative examples, and invariants.
- Generate codemod code and unit tests in the chosen AST framework.
Build phase
- Compile the codemod and run its unit tests.
- Dry‑run on a random sample of files; enforce AST, idempotence, and determinism checks.
Eval phase
- Run on a curated gold‑set with expected outputs; compute precision/recall, compile pass rate, and diff budgets.
CI gate
- Create a PR with the transform applied to a limited scope; run full build/test/lint/analysis.
- Require codeowner review; gate merging via a merge queue.
Rollout phase
- Stage across directories/services/repos with canaries.
- Monitor build/test dashboards; auto‑bisect failing shards and auto‑revert if regression thresholds are crossed.
Post‑rollout
- Freeze the codemod, archive metrics, and document patterns/traps discovered.

Guardrails That Actually Work

Adopt these non‑negotiables:

AST‑only edits: No regex or string‑based search/replace for code structure changes.
Determinism: The codemod produces the same output across runs and environments.
Idempotence: Running the codemod twice yields the same tree as once.
Scope controls: Target symbols, import paths, and file globs explicitly. Avoid wildcards.
Diff budgets: Cap files changed and total LOC per PR.
Compile/test gates: Refuse to merge without a green build.
Auto‑revert on failure: If a staged rollout breaks, revert and investigate; do not proceed.

From Prompt to Codemod: Concrete Examples

Below are end‑to‑end sketches for popular languages. The point is not the specifics but the pattern: capture intent in the prompt, produce an AST‑based transform, and wrap it in tests.

Python: LibCST migration from requests to httpx

Goal: replace requests.get/post with httpx.get/post, adjust imports, and add timeout defaults if none are specified. We constrain scope to files importing requests.

Prompt sketch (truncated):

Describe the before/after examples, e.g. requests.get(url) -> httpx.get(url, timeout=5) if no timeout.
Invariants: do not modify string literals; preserve comments; do not change semantics of keyword arguments; maintain verify argument as‑is.
Provide negative examples: don’t touch variables named requests that shadow the package.
Ask for LibCST codemod plus unit tests.

Example LibCST codemod:

python
# python - uses only single quotes to minimize JSON escaping
import libcst as cst
import libcst.matchers as m
from typing import Optional

class RequestsToHttpx(cst.CSTTransformer):
    def __init__(self) -> None:
        self.alias: Optional[str] = None
        self.seen_requests_import = False

    def leave_Import(self, original: cst.Import, updated: cst.Import) -> cst.BaseStatement:
        # Replace `import requests` with `import httpx as requests` if used unaliased
        names = []
        for name in updated.names:
            if m.matches(name, m.ImportAlias(name=m.Name('requests'))):
                self.seen_requests_import = True
                # Keep name as 'requests' but import httpx under that name for minimal churn
                names.append(cst.ImportAlias(name=cst.Name('httpx'), asname=cst.AsName(name=cst.Name('requests'))))
            else:
                names.append(name)
        if names != list(updated.names):
            return updated.with_changes(names=tuple(names))
        return updated

    def leave_ImportFrom(self, original: cst.ImportFrom, updated: cst.ImportFrom) -> cst.BaseStatement:
        # Do not modify from-imports of requests to avoid surprises
        return updated

    def _maybe_add_timeout(self, args: list[cst.Arg]) -> list[cst.Arg]:
        has_timeout = any(getattr(a.keyword, 'value', None) == 'timeout' for a in args if a.keyword)
        if not has_timeout:
            args.append(cst.Arg(keyword=cst.Name('timeout'), value=cst.Integer('5')))
        return args

    def leave_Call(self, original: cst.Call, updated: cst.Call) -> cst.Call:
        # Match requests.get/post(...)
        if m.matches(
            updated.func,
            m.Attribute(value=m.Name('requests'), attr=m.OneOf(m.Name('get'), m.Name('post'))),
        ):
            func = updated.func
            if isinstance(func, cst.Attribute) and isinstance(func.attr, cst.Name):
                new_func = cst.Attribute(value=cst.Name('httpx'), attr=cst.Name(func.attr.value))
                new_args = list(updated.args)
                new_args = self._maybe_add_timeout(new_args)
                return updated.with_changes(func=new_func, args=tuple(new_args))
        return updated

if __name__ == '__main__':
    import sys
    code = sys.stdin.read()
    mod = cst.parse_module(code)
    out = mod.visit(RequestsToHttpx())
    print(out.code)

Unit test sketch using LibCST’s snapshot testing:

python
from textwrap import dedent
from requests_to_httpx import RequestsToHttpx
import libcst as cst

def apply(src: str) -> str:
    return cst.parse_module(src).visit(RequestsToHttpx()).code

def test_adds_timeout_and_swaps_import():
    before = dedent('''
        import requests
        def f(url):
            return requests.get(url)
    ''')
    after = dedent('''
        import httpx as requests
        def f(url):
            return httpx.get(url, timeout=5)
    ''')
    assert apply(before) == after

def test_preserves_existing_timeout():
    before = 'import requests\nrequests.post(u, timeout=10)\n'
    after = 'import httpx as requests\nhttpx.post(u, timeout=10)\n'
    assert apply(before) == after

AST guarantees here: parsing and printing are mediated by LibCST, preserving formatting/comments and ensuring syntactic validity.

JavaScript/TypeScript: jscodeshift to rename a prop and fix imports

Goal: rename component prop onChange to onValueChange for imports from ui-lib and fix callsites accordingly.

javascript
// transform.js - jscodeshift codemod
export default function transformer(file, api) {
  const j = api.jscodeshift;
  const root = j(file.source);

  const isUiImport = (node) => node.source && node.source.value === 'ui-lib';

  root.find(j.ImportDeclaration).filter(p => isUiImport(p.node)).forEach(p => {
    p.node.specifiers.forEach(s => {
      if (s.imported && s.imported.name === 'Widget') {
        // ok
      }
    });
  });

  root.find(j.JSXOpeningElement, { name: { name: 'Widget' } }).forEach(path => {
    const attrs = path.node.attributes;
    attrs.forEach(attr => {
      if (attr.type === 'JSXAttribute' && attr.name.name === 'onChange') {
        attr.name.name = 'onValueChange';
      }
    });
  });

  return root.toSource({ quote: 'single' });
}

Test with jscodeshift -t transform.js src/**/*.tsx and snapshot diff in CI. For TypeScript with types, consider ts-morph to resolve symbols and avoid renaming unrelated onChange props.

Java: OpenRewrite recipe to migrate logging API

OpenRewrite is a mature framework for Java refactors with recipe YAML, semantic types, and a rich ruleset.

yaml
# rewrite.yml
type: specs.openrewrite.org/v1beta/recipe
name: com.acme.logging.MigrateLogger
displayName: Migrate to NewLogger
recipeList:
  - org.openrewrite.java.dependencies.AddDependency:
      groupId: com.acme
      artifactId: newlogger
      version: 1.x
  - org.openrewrite.java.search.FindMethods:
      methodPattern: com.old.Logger log(..)
  - org.openrewrite.java.ChangeType:
      oldFullyQualifiedTypeName: com.old.Logger
      newFullyQualifiedTypeName: com.acme.newlogger.Logger
  - org.openrewrite.java.ChangeMethodName:
      methodPattern: com.acme.newlogger.Logger log(..)
      newMethodName: info

Run with Gradle plugin or rewrite-maven-plugin, gate on mvn -Drewrite.activeRecipes=com.acme.logging.MigrateLogger rewrite:run in CI.

Go: go/ast to thread context.Context through functions

A classic large refactor in Go: add a context.Context parameter to service methods and propagate to call sites.

go
// context_codemod.go
package main

import (
    'go/ast'
    'go/parser'
    'go/printer'
    'go/token'
    'os'
)

func addCtxParam(fd *ast.FuncDecl) {
    if fd.Type.Params != nil {
        // Prepend ctx context.Context
        ident := ast.NewIdent('ctx')
        sel := &ast.SelectorExpr{X: ast.NewIdent('context'), Sel: ast.NewIdent('Context')}
        field := &ast.Field{Names: []*ast.Ident{ident}, Type: sel}
        fd.Type.Params.List = append([]*ast.Field{field}, fd.Type.Params.List...)
    }
}

func main() {
    fset := token.NewFileSet()
    f, _ := parser.ParseFile(fset, os.Args[1], nil, parser.ParseComments)
    ast.Inspect(f, func(n ast.Node) bool {
        if fd, ok := n.(*ast.FuncDecl); ok {
            addCtxParam(fd)
        }
        return true
    })
    printer.Fprint(os.Stdout, fset, f)
}

In real usage, you also update import sets, fix call sites, and ensure the transform is idempotent. Combine with go/types for symbol resolution.

Rust: syn/quote to rename a crate and update paths

rust
// Cargo.toml migration might be handled separately
use quote::quote;
use syn::{visit_mut::VisitMut, *};

struct RenameCrate;

impl VisitMut for RenameCrate {
    fn visit_path_segment_mut(&mut self, i: &mut PathSegment) {
        if i.ident == "oldcrate" { i.ident = Ident::new("newcrate", i.ident.span()); }
        syn::visit_mut::visit_path_segment_mut(self, i);
    }
}

fn main() {
    let src = std::fs::read_to_string("src/lib.rs").unwrap();
    let mut file: File = syn::parse_file(&src).unwrap();
    RenameCrate.visit_file_mut(&mut file);
    println!("{}", prettyplease::unparse(&file));
}

Again, real‑world usage interops with cargo check in CI and enforces idempotence.

AST and Compile Guarantees

A credible safety story requires multiple independent guards.

Parse‑transform‑print roundtrip: parse to a concrete AST, edit via visitors, and print using a stable printer that preserves formatting/comments. Tools: LibCST (Python), OpenRewrite (Java), jscodeshift/ts‑morph (JS/TS), tree‑sitter for many languages, go/ast (Go), syn (Rust).
AST diff localized to intended nodes: compute a pre/post AST diff and assert that changes only occur under matched patterns (e.g., calls to requests.get). Fail fast if edits leak elsewhere.
Compile gates: the full workspace compiles. For monorepos, use Bazel/Pants/Buck2 to isolate targets; for polyrepos, use each repo’s native build tool.
Test gates: unit and integration suites. For large refactors, enforce a smoke subset plus a rotating shard to control CI time.
Lint/static analysis gates: run linters, formatters, and static analyzers (ErrorProne, SpotBugs, mypy, ESLint/TS, Clippy).
Idempotence: re‑apply the codemod to its own output and compare AST equality. Fail if not equal.
Determinism: run on a representative shard twice in a clean environment; diff must be empty.

Implementation tip: treat these gates as tests for the codemod itself; build a small harness that runs them offline before opening any PRs.

Gold‑Set Evaluations That Mean Something

Your codemod is only as good as its eval. Build a gold‑set that captures real‑world edge cases:

Sources
- Public examples from docs and migration guides
- Randomly sampled files from your repos containing the target pattern
- Adversarial cases (shadowed names, dynamic imports, macro‑like uses)
Labels
- Expected outputs for each input (reviewed by an owner)
- Tags for difficulty: trivial, aliasing, nested, mixed languages
Metrics
- Precision: fraction of transformed files that match the expected output
- Recall: fraction of gold examples correctly transformed
- Compile pass rate: percent of transformed targets that still compile
- Test pass rate on a smoke suite
- Average LOC delta and max diff size

The bar to promote from experiment to rollout should be explicit. For example: precision >= 0.98, compile pass >= 0.995, zero non‑localized AST deviations, and idempotence proven.

CI Gating and Merge Queues

Once the codemod passes offline eval, move to a CI‑gated rollout. Use a merge queue to serialize landings on main and avoid cross‑PR flakiness.

Recommended guardrails:

Pre‑submit checks
- Run the codemod on a limited batch (e.g., 50 files) and open a PR.
- Full build/test/analysis; compute metrics and attach a report artifact.
- Require codeowner approval for the first few PRs.
Merge queue
- Use a merge queue (GitHub Merge Queue, Mergify, bors‑ng, Gerrit submit requirements) so each batch merges only after rebasing and passing CI in the queue state.
- Enable auto‑revert on post‑merge failures.
Diff budgets
- Cap files per PR (e.g., 100–300). For monorepos with fast builds, you can go higher.
- Split by package/module boundaries to aid bisection.

Example GitHub Actions workflow (sketch):

yaml
name: codemod-ci
on:
  pull_request:
    branches: [ main ]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup
        run: |
          pip install libcst pytest
      - name: Run codemod smoke
        run: |
          python codemod.py --sample 50 --apply --report report.json
      - name: Lint & Build
        run: |
          ./gradlew build || true
          npm ci && npm run build || true
      - name: Tests
        run: |
          ./gradlew test
          npm test -- --maxWorkers=50%
      - name: Verify idempotence
        run: |
          git diff --name-only > changed.txt
          python codemod.py --files @changed.txt --apply
          git diff --exit-code
      - name: Upload report
        uses: actions/upload-artifact@v4
        with:
          name: codemod-report
          path: report.json

Note: you would tailor the build/test steps to your stack. The critical checks are idempotence and full CI gates.

Staged Multi‑Repo Rollouts

Many organizations run polyrepos or hybrid monorepos. Multi‑repo refactors require orchestration.

Core tactics:

Dependency graph first: build or ingest a graph of repos/modules and their dependency edges. Tools: a manifest service, Bazel query, or a simple YAML map.
Expand/contract migrations: when altering an API, add the new method alongside the old (expand), migrate call sites, then remove the old (contract). This allows safe partial deployments.
Compatibility windows: publish a version matrix that tells which library versions are compatible across services.
Staging by risk: start with low‑traffic or internal services, then medium, then high‑traffic/front‑door.
Canary cohorts: pick representative repos across languages/frameworks to validate cross‑cutting issues early.
Auto‑bisection: if stage n fails, bisect the batch to isolate the offending repo or package.
Auto‑reverts: if a merged batch causes a regression, revert the batch PRs en masse; do not hand‑revert one by one.

A simple orchestrator loop (pseudocode):

plan = stage_plan(graph, stages=[canary, low, medium, high])
for stage in plan:
  batches = shard(stage.targets, max_files=2000, max_repos=10)
  for batch in batches:
    prs = apply_codemod(batch)
    results = queue_and_wait(prs)
    if results.failure_rate > 0: 
      bad = bisect(batch)
      revert(bad)
      abort_stage_and_report()
      break

Make the orchestrator idempotent and restartable. Persist state (what’s been transformed and merged) and metrics to a durable store.

Observability and Runtime Signals

Treat refactors like deployments: watch telemetry.

CI dashboards: compile/test pass rates by stage; time to green; queue length.
Diff health: average LOC, outliers, churn hotspots.
Runtime metrics: if the refactor changes behavior (e.g., I/O libraries), monitor p50/p99 latencies, error rates, and resource usage on canaries.
Rollback SLOs: define and measure time‑to‑revert if a stage degrades SLOs.

Policy and Governance

Set clear rules that align engineering, security, and compliance:

Code ownership: refactor PRs ping relevant OWNERS; override rules only with documented approvals.
Security scans: run SAST/DAST and license scanning on refactor PRs, especially when adding dependencies.
Audit: keep machine‑readable logs of prompts, codemod versions, eval metrics, and PR links for postmortems.
Model governance: document which model and version was used; sanitize prompts and inputs to avoid leaking secrets; respect data handling policies.

What To Automate vs. What To Review

Automate:

Generating and compiling codemods
Gold‑set eval and reporting
CI checks, idempotence, and merge queue operations
Batch planning, canary selection, bisection, and reverts

Human review:

The transformation spec and first PRs for a new pattern
Risk assessment for API migrations and behavior‑changing refactors
Post‑incident reviews and codemod improvements

This blend preserves velocity without sacrificing judgment.

Anti‑Patterns to Avoid

Regex‑only transforms for structural changes: fragile and non‑portable.
One‑shot mega‑PRs: impossible to review and risky to revert.
Skipping idempotence/determinism checks: will burn you in multi‑repo runs.
No gold‑set: you will overfit to easy cases and break in the wild.
Allowing model‑authored free‑form patches to merge: AST or bust.

Checklists

Codemod readiness checklist:

Transformation spec with examples and invariants
Codemod compiles and has unit tests
Gold‑set precision >= 0.98 and recall >= 0.95 (tune for your risk)
Full compile/test green on a sample PR
Idempotence and determinism proven
Diff budgets defined

Rollout checklist:

Dependency graph and stage plan
Canary set defined and owners notified
Merge queue enabled and auto‑revert wired
Observability dashboards prepared
Rollback SLO agreed and tested

Prompt Patterns That Help

Structure your prompts to elicit safe, testable outputs:

Ask for a transformation spec with: scope, before/after pairs, invariants, and negative cases.
Demand code in a specific AST framework and unit tests.
Require an idempotence property and a quick self‑check harness.
Prohibit editing of non‑code assets unless explicitly whitelisted.
Request a rollback recipe (how to undo the transform) to aid reverts.

Example meta‑prompt snippet:

You are generating a safe codemod for Python using LibCST. Output:
1) Transformation spec with scope and invariants.
2) LibCST transformer implementing the spec.
3) Pytest tests showing before/after including negative cases.
Constraints: preserve comments/formatting; do not change string literals; limit matches to imports from 'requests'. Make the transform idempotent.

Putting It All Together: A Reference Pipeline

Let’s summarize an end‑to‑end run for a cross‑repo refactor:

Propose
- Engineer writes a short brief: migrate Foo.do(x) to Foo.run(x) in languages A/B.
- LLM generates specs and codemods per language with unit tests.
Prove offline
- Compile codemods and run their unit tests.
- Build a gold‑set across repos; compute precision/recall and idempotence.
CI pilot
- Apply to a small batch; open PR; run full CI; require owner review.
- Merge via queue; monitor.
Stage rollout
- Orchestrator applies to canaries; then progressively to low/medium/high risk cohorts.
- Each batch merges via queue; failures trigger bisection and auto‑revert.
Finalize
- Remove compatibility shims (contract) if using expand/contract.
- Archive metrics and add patterns to your internal cookbook.

Python: LibCST for safe, lossless edits; Bowler for quick renames; Ruff for lint; pytest for tests.
JS/TS: jscodeshift for tree transforms; ts‑morph when types are needed; ESLint with autofix; Vitest/Jest.
Java: OpenRewrite for recipes; ErrorProne for static checks; Gradle/Maven plugins for CI.
Go: go/ast with go/types; staticcheck; go test ./... in CI.
Rust: syn/quote with prettyplease; Clippy; cargo check/test in CI.
Polyglot: tree‑sitter for parsing and spot‑checks; Comby for SSR‑style text when structure is simple and safe.
Orchestration: GitHub Actions + Merge Queue, Buildkite, or Gerrit; Mergify/bors for queues; Bazel/Pants/Buck2 to speed monorepo builds.

Frequently Overlooked Details

Formatting drift: ensure your printer or a linter/formatter runs post‑transform to minimize diff noise.
Performance: run codemods in parallel but shard to avoid CI resource starvation; cache builds (ccache/remote cache).
Partial failures: design the codemod to skip unsupported cases with explicit markers (e.g., TODO comments) instead of producing bad output.
Comment contracts: if your company uses special annotations in comments, ensure the codemod preserves them verbatim.
Binary files and generated sources: exclude them aggressively.

Closing Thoughts

In 2025, LLMs make the hard part of large refactors—designing the transformation and stitching language‑specific details—astonishingly fast. But reliability comes from the boring parts: AST discipline, evals that measure what matters, and CI mechanics that tame blast radius. Treat the model as a codemod author whose work must pass the same compile‑and‑test standards as any engineer’s. Put the merge queue in charge. Stage your rollouts. And always keep a revert handy.

Follow these patterns and you can safely let an LLM touch thousands of files—while sleeping well the night after merge.