Refactor Your Codebase with LLMs in 2025: CI‑Gated Codemods, AST Guarantees, and Safe Multi‑Repo Rollouts
Large‑scale refactors are no longer weekend‑long heroics. In 2025, LLMs can draft codemods, map semantic changes across languages, and propose migration strategies across dozens of repos. But the promised velocity only materializes if you wrap models inside a rigorous engineering envelope: typed AST transformations, deterministic evals, CI‑gated rollouts, and automated mitigations when things go sideways.
This article lays out an end‑to‑end, production‑grade approach for letting LLMs rewrite thousands of files safely. It is opinionated: use AST‑first codemods, gate every step with CI and merge queues, and treat LLMs as spec generators whose output must pass the same compile, test, and observability standards as any human‑authored change.
We cover:
- How to generate codemods with LLMs and snap them to AST frameworks for safety
- Enforcing AST/compile guarantees and idempotence
- Building a gold‑set evaluation harness and measuring precision/recall of transformations
- CI gating, merge queues, and diff budgets
- Multi‑repo rollouts with canaries, bisection, and auto‑reverts
- Concrete examples across Python (LibCST), JS/TS (jscodeshift/ts‑morph), Java (OpenRewrite), Go (go/ast), and Rust (syn/quote)
If you want the TL;DR pattern: LLMs propose; codemod frameworks enforce; CI decides. Everything else is tooling.
Why 2025 Is Different
The ingredients for safe automated refactors matured:
- Structured generation: grammar‑constrained decoding, function‑calling, and toolformer‑style agents allow LLMs to produce typed recipes instead of free‑form text.
- Ubiquitous AST toolchains: LibCST, OpenRewrite, jscodeshift, ts‑morph, go/ast, Rust syn, and tree‑sitter provide concrete, lossless ASTs and printers with stable formatting.
- Proven rollout patterns: merge queues, canarying, staged rollouts, and automated reverts have gone mainstream in CI/CD.
- Evidence from industry: systems like Facebook Getafix and SapFix, OpenRewrite at scale in Java ecosystems, and Semgrep autofix/Comby show that static and structural transforms, when tested and gated, can be safe and fast.
The remaining delta is orchestration: make LLMs generate codemods and proofs, not just patches; then let CI be the arbiter of safety.
The Core Principle: LLMs Propose, AST Frameworks Enforce
Free‑form patches from an LLM are not shippable artifacts. Treat them as drafts for a typed codemod:
- Ask the LLM to produce a transformation spec: before/after examples, invariants, and constraints.
- Ask it to instantiate the spec into a codemod for your language’s AST framework.
- Compile and unit‑test the codemod itself.
- Run the codemod on a representative sample, then run AST/compile/test gates.
- Only batch when the codemod is proven deterministic, idempotent, and precise.
The key: never let unconstrained text changes land. Always traverse and edit the AST, preserving trivia (comments/formatting) and ensuring syntactic validity by construction.
Architecture: An End‑to‑End Refactor Pipeline
A production pipeline typically looks like this:
- Spec phase
- Prompt LLM for a migration plan and transformation spec with examples, negative examples, and invariants.
- Generate codemod code and unit tests in the chosen AST framework.
- Build phase
- Compile the codemod and run its unit tests.
- Dry‑run on a random sample of files; enforce AST, idempotence, and determinism checks.
- Eval phase
- Run on a curated gold‑set with expected outputs; compute precision/recall, compile pass rate, and diff budgets.
- CI gate
- Create a PR with the transform applied to a limited scope; run full build/test/lint/analysis.
- Require codeowner review; gate merging via a merge queue.
- Rollout phase
- Stage across directories/services/repos with canaries.
- Monitor build/test dashboards; auto‑bisect failing shards and auto‑revert if regression thresholds are crossed.
- Post‑rollout
- Freeze the codemod, archive metrics, and document patterns/traps discovered.
Guardrails That Actually Work
Adopt these non‑negotiables:
- AST‑only edits: No regex or string‑based search/replace for code structure changes.
- Determinism: The codemod produces the same output across runs and environments.
- Idempotence: Running the codemod twice yields the same tree as once.
- Scope controls: Target symbols, import paths, and file globs explicitly. Avoid wildcards.
- Diff budgets: Cap files changed and total LOC per PR.
- Compile/test gates: Refuse to merge without a green build.
- Auto‑revert on failure: If a staged rollout breaks, revert and investigate; do not proceed.
From Prompt to Codemod: Concrete Examples
Below are end‑to‑end sketches for popular languages. The point is not the specifics but the pattern: capture intent in the prompt, produce an AST‑based transform, and wrap it in tests.
Python: LibCST migration from requests to httpx
Goal: replace requests.get
/post
with httpx.get
/post
, adjust imports, and add timeout defaults if none are specified. We constrain scope to files importing requests
.
Prompt sketch (truncated):
- Describe the before/after examples, e.g.
requests.get(url)
->httpx.get(url, timeout=5)
if no timeout. - Invariants: do not modify string literals; preserve comments; do not change semantics of keyword arguments; maintain
verify
argument as‑is. - Provide negative examples: don’t touch variables named
requests
that shadow the package. - Ask for LibCST codemod plus unit tests.
Example LibCST codemod:
python# python - uses only single quotes to minimize JSON escaping import libcst as cst import libcst.matchers as m from typing import Optional class RequestsToHttpx(cst.CSTTransformer): def __init__(self) -> None: self.alias: Optional[str] = None self.seen_requests_import = False def leave_Import(self, original: cst.Import, updated: cst.Import) -> cst.BaseStatement: # Replace `import requests` with `import httpx as requests` if used unaliased names = [] for name in updated.names: if m.matches(name, m.ImportAlias(name=m.Name('requests'))): self.seen_requests_import = True # Keep name as 'requests' but import httpx under that name for minimal churn names.append(cst.ImportAlias(name=cst.Name('httpx'), asname=cst.AsName(name=cst.Name('requests')))) else: names.append(name) if names != list(updated.names): return updated.with_changes(names=tuple(names)) return updated def leave_ImportFrom(self, original: cst.ImportFrom, updated: cst.ImportFrom) -> cst.BaseStatement: # Do not modify from-imports of requests to avoid surprises return updated def _maybe_add_timeout(self, args: list[cst.Arg]) -> list[cst.Arg]: has_timeout = any(getattr(a.keyword, 'value', None) == 'timeout' for a in args if a.keyword) if not has_timeout: args.append(cst.Arg(keyword=cst.Name('timeout'), value=cst.Integer('5'))) return args def leave_Call(self, original: cst.Call, updated: cst.Call) -> cst.Call: # Match requests.get/post(...) if m.matches( updated.func, m.Attribute(value=m.Name('requests'), attr=m.OneOf(m.Name('get'), m.Name('post'))), ): func = updated.func if isinstance(func, cst.Attribute) and isinstance(func.attr, cst.Name): new_func = cst.Attribute(value=cst.Name('httpx'), attr=cst.Name(func.attr.value)) new_args = list(updated.args) new_args = self._maybe_add_timeout(new_args) return updated.with_changes(func=new_func, args=tuple(new_args)) return updated if __name__ == '__main__': import sys code = sys.stdin.read() mod = cst.parse_module(code) out = mod.visit(RequestsToHttpx()) print(out.code)
Unit test sketch using LibCST’s snapshot testing:
pythonfrom textwrap import dedent from requests_to_httpx import RequestsToHttpx import libcst as cst def apply(src: str) -> str: return cst.parse_module(src).visit(RequestsToHttpx()).code def test_adds_timeout_and_swaps_import(): before = dedent(''' import requests def f(url): return requests.get(url) ''') after = dedent(''' import httpx as requests def f(url): return httpx.get(url, timeout=5) ''') assert apply(before) == after def test_preserves_existing_timeout(): before = 'import requests\nrequests.post(u, timeout=10)\n' after = 'import httpx as requests\nhttpx.post(u, timeout=10)\n' assert apply(before) == after
AST guarantees here: parsing and printing are mediated by LibCST, preserving formatting/comments and ensuring syntactic validity.
JavaScript/TypeScript: jscodeshift to rename a prop and fix imports
Goal: rename component prop onChange
to onValueChange
for imports from ui-lib
and fix callsites accordingly.
javascript// transform.js - jscodeshift codemod export default function transformer(file, api) { const j = api.jscodeshift; const root = j(file.source); const isUiImport = (node) => node.source && node.source.value === 'ui-lib'; root.find(j.ImportDeclaration).filter(p => isUiImport(p.node)).forEach(p => { p.node.specifiers.forEach(s => { if (s.imported && s.imported.name === 'Widget') { // ok } }); }); root.find(j.JSXOpeningElement, { name: { name: 'Widget' } }).forEach(path => { const attrs = path.node.attributes; attrs.forEach(attr => { if (attr.type === 'JSXAttribute' && attr.name.name === 'onChange') { attr.name.name = 'onValueChange'; } }); }); return root.toSource({ quote: 'single' }); }
Test with jscodeshift -t transform.js src/**/*.tsx
and snapshot diff in CI. For TypeScript with types, consider ts-morph
to resolve symbols and avoid renaming unrelated onChange
props.
Java: OpenRewrite recipe to migrate logging API
OpenRewrite is a mature framework for Java refactors with recipe YAML, semantic types, and a rich ruleset.
yaml# rewrite.yml type: specs.openrewrite.org/v1beta/recipe name: com.acme.logging.MigrateLogger displayName: Migrate to NewLogger recipeList: - org.openrewrite.java.dependencies.AddDependency: groupId: com.acme artifactId: newlogger version: 1.x - org.openrewrite.java.search.FindMethods: methodPattern: com.old.Logger log(..) - org.openrewrite.java.ChangeType: oldFullyQualifiedTypeName: com.old.Logger newFullyQualifiedTypeName: com.acme.newlogger.Logger - org.openrewrite.java.ChangeMethodName: methodPattern: com.acme.newlogger.Logger log(..) newMethodName: info
Run with Gradle plugin or rewrite-maven-plugin
, gate on mvn -Drewrite.activeRecipes=com.acme.logging.MigrateLogger rewrite:run
in CI.
Go: go/ast to thread context.Context through functions
A classic large refactor in Go: add a context.Context
parameter to service methods and propagate to call sites.
go// context_codemod.go package main import ( 'go/ast' 'go/parser' 'go/printer' 'go/token' 'os' ) func addCtxParam(fd *ast.FuncDecl) { if fd.Type.Params != nil { // Prepend ctx context.Context ident := ast.NewIdent('ctx') sel := &ast.SelectorExpr{X: ast.NewIdent('context'), Sel: ast.NewIdent('Context')} field := &ast.Field{Names: []*ast.Ident{ident}, Type: sel} fd.Type.Params.List = append([]*ast.Field{field}, fd.Type.Params.List...) } } func main() { fset := token.NewFileSet() f, _ := parser.ParseFile(fset, os.Args[1], nil, parser.ParseComments) ast.Inspect(f, func(n ast.Node) bool { if fd, ok := n.(*ast.FuncDecl); ok { addCtxParam(fd) } return true }) printer.Fprint(os.Stdout, fset, f) }
In real usage, you also update import sets, fix call sites, and ensure the transform is idempotent. Combine with go/types
for symbol resolution.
Rust: syn/quote to rename a crate and update paths
rust// Cargo.toml migration might be handled separately use quote::quote; use syn::{visit_mut::VisitMut, *}; struct RenameCrate; impl VisitMut for RenameCrate { fn visit_path_segment_mut(&mut self, i: &mut PathSegment) { if i.ident == "oldcrate" { i.ident = Ident::new("newcrate", i.ident.span()); } syn::visit_mut::visit_path_segment_mut(self, i); } } fn main() { let src = std::fs::read_to_string("src/lib.rs").unwrap(); let mut file: File = syn::parse_file(&src).unwrap(); RenameCrate.visit_file_mut(&mut file); println!("{}", prettyplease::unparse(&file)); }
Again, real‑world usage interops with cargo check
in CI and enforces idempotence.
AST and Compile Guarantees
A credible safety story requires multiple independent guards.
- Parse‑transform‑print roundtrip: parse to a concrete AST, edit via visitors, and print using a stable printer that preserves formatting/comments. Tools: LibCST (Python), OpenRewrite (Java), jscodeshift/ts‑morph (JS/TS), tree‑sitter for many languages, go/ast (Go), syn (Rust).
- AST diff localized to intended nodes: compute a pre/post AST diff and assert that changes only occur under matched patterns (e.g., calls to
requests.get
). Fail fast if edits leak elsewhere. - Compile gates: the full workspace compiles. For monorepos, use Bazel/Pants/Buck2 to isolate targets; for polyrepos, use each repo’s native build tool.
- Test gates: unit and integration suites. For large refactors, enforce a smoke subset plus a rotating shard to control CI time.
- Lint/static analysis gates: run linters, formatters, and static analyzers (ErrorProne, SpotBugs, mypy, ESLint/TS, Clippy).
- Idempotence: re‑apply the codemod to its own output and compare AST equality. Fail if not equal.
- Determinism: run on a representative shard twice in a clean environment; diff must be empty.
Implementation tip: treat these gates as tests for the codemod itself; build a small harness that runs them offline before opening any PRs.
Gold‑Set Evaluations That Mean Something
Your codemod is only as good as its eval. Build a gold‑set that captures real‑world edge cases:
- Sources
- Public examples from docs and migration guides
- Randomly sampled files from your repos containing the target pattern
- Adversarial cases (shadowed names, dynamic imports, macro‑like uses)
- Labels
- Expected outputs for each input (reviewed by an owner)
- Tags for difficulty: trivial, aliasing, nested, mixed languages
- Metrics
- Precision: fraction of transformed files that match the expected output
- Recall: fraction of gold examples correctly transformed
- Compile pass rate: percent of transformed targets that still compile
- Test pass rate on a smoke suite
- Average LOC delta and max diff size
The bar to promote from experiment to rollout should be explicit. For example: precision >= 0.98, compile pass >= 0.995, zero non‑localized AST deviations, and idempotence proven.
CI Gating and Merge Queues
Once the codemod passes offline eval, move to a CI‑gated rollout. Use a merge queue to serialize landings on main and avoid cross‑PR flakiness.
Recommended guardrails:
- Pre‑submit checks
- Run the codemod on a limited batch (e.g., 50 files) and open a PR.
- Full build/test/analysis; compute metrics and attach a report artifact.
- Require codeowner approval for the first few PRs.
- Merge queue
- Use a merge queue (GitHub Merge Queue, Mergify, bors‑ng, Gerrit submit requirements) so each batch merges only after rebasing and passing CI in the queue state.
- Enable auto‑revert on post‑merge failures.
- Diff budgets
- Cap files per PR (e.g., 100–300). For monorepos with fast builds, you can go higher.
- Split by package/module boundaries to aid bisection.
Example GitHub Actions workflow (sketch):
yamlname: codemod-ci on: pull_request: branches: [ main ] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup run: | pip install libcst pytest - name: Run codemod smoke run: | python codemod.py --sample 50 --apply --report report.json - name: Lint & Build run: | ./gradlew build || true npm ci && npm run build || true - name: Tests run: | ./gradlew test npm test -- --maxWorkers=50% - name: Verify idempotence run: | git diff --name-only > changed.txt python codemod.py --files @changed.txt --apply git diff --exit-code - name: Upload report uses: actions/upload-artifact@v4 with: name: codemod-report path: report.json
Note: you would tailor the build/test steps to your stack. The critical checks are idempotence and full CI gates.
Staged Multi‑Repo Rollouts
Many organizations run polyrepos or hybrid monorepos. Multi‑repo refactors require orchestration.
Core tactics:
- Dependency graph first: build or ingest a graph of repos/modules and their dependency edges. Tools: a manifest service, Bazel query, or a simple YAML map.
- Expand/contract migrations: when altering an API, add the new method alongside the old (expand), migrate call sites, then remove the old (contract). This allows safe partial deployments.
- Compatibility windows: publish a version matrix that tells which library versions are compatible across services.
- Staging by risk: start with low‑traffic or internal services, then medium, then high‑traffic/front‑door.
- Canary cohorts: pick representative repos across languages/frameworks to validate cross‑cutting issues early.
- Auto‑bisection: if stage n fails, bisect the batch to isolate the offending repo or package.
- Auto‑reverts: if a merged batch causes a regression, revert the batch PRs en masse; do not hand‑revert one by one.
A simple orchestrator loop (pseudocode):
plan = stage_plan(graph, stages=[canary, low, medium, high])
for stage in plan:
batches = shard(stage.targets, max_files=2000, max_repos=10)
for batch in batches:
prs = apply_codemod(batch)
results = queue_and_wait(prs)
if results.failure_rate > 0:
bad = bisect(batch)
revert(bad)
abort_stage_and_report()
break
Make the orchestrator idempotent and restartable. Persist state (what’s been transformed and merged) and metrics to a durable store.
Observability and Runtime Signals
Treat refactors like deployments: watch telemetry.
- CI dashboards: compile/test pass rates by stage; time to green; queue length.
- Diff health: average LOC, outliers, churn hotspots.
- Runtime metrics: if the refactor changes behavior (e.g., I/O libraries), monitor p50/p99 latencies, error rates, and resource usage on canaries.
- Rollback SLOs: define and measure time‑to‑revert if a stage degrades SLOs.
Policy and Governance
Set clear rules that align engineering, security, and compliance:
- Code ownership: refactor PRs ping relevant OWNERS; override rules only with documented approvals.
- Security scans: run SAST/DAST and license scanning on refactor PRs, especially when adding dependencies.
- Audit: keep machine‑readable logs of prompts, codemod versions, eval metrics, and PR links for postmortems.
- Model governance: document which model and version was used; sanitize prompts and inputs to avoid leaking secrets; respect data handling policies.
What To Automate vs. What To Review
Automate:
- Generating and compiling codemods
- Gold‑set eval and reporting
- CI checks, idempotence, and merge queue operations
- Batch planning, canary selection, bisection, and reverts
Human review:
- The transformation spec and first PRs for a new pattern
- Risk assessment for API migrations and behavior‑changing refactors
- Post‑incident reviews and codemod improvements
This blend preserves velocity without sacrificing judgment.
Anti‑Patterns to Avoid
- Regex‑only transforms for structural changes: fragile and non‑portable.
- One‑shot mega‑PRs: impossible to review and risky to revert.
- Skipping idempotence/determinism checks: will burn you in multi‑repo runs.
- No gold‑set: you will overfit to easy cases and break in the wild.
- Allowing model‑authored free‑form patches to merge: AST or bust.
Checklists
Codemod readiness checklist:
- Transformation spec with examples and invariants
- Codemod compiles and has unit tests
- Gold‑set precision >= 0.98 and recall >= 0.95 (tune for your risk)
- Full compile/test green on a sample PR
- Idempotence and determinism proven
- Diff budgets defined
Rollout checklist:
- Dependency graph and stage plan
- Canary set defined and owners notified
- Merge queue enabled and auto‑revert wired
- Observability dashboards prepared
- Rollback SLO agreed and tested
Prompt Patterns That Help
Structure your prompts to elicit safe, testable outputs:
- Ask for a transformation spec with: scope, before/after pairs, invariants, and negative cases.
- Demand code in a specific AST framework and unit tests.
- Require an idempotence property and a quick self‑check harness.
- Prohibit editing of non‑code assets unless explicitly whitelisted.
- Request a rollback recipe (how to undo the transform) to aid reverts.
Example meta‑prompt snippet:
You are generating a safe codemod for Python using LibCST. Output:
1) Transformation spec with scope and invariants.
2) LibCST transformer implementing the spec.
3) Pytest tests showing before/after including negative cases.
Constraints: preserve comments/formatting; do not change string literals; limit matches to imports from 'requests'. Make the transform idempotent.
Putting It All Together: A Reference Pipeline
Let’s summarize an end‑to‑end run for a cross‑repo refactor:
- Propose
- Engineer writes a short brief: migrate
Foo.do(x)
toFoo.run(x)
in languages A/B. - LLM generates specs and codemods per language with unit tests.
- Engineer writes a short brief: migrate
- Prove offline
- Compile codemods and run their unit tests.
- Build a gold‑set across repos; compute precision/recall and idempotence.
- CI pilot
- Apply to a small batch; open PR; run full CI; require owner review.
- Merge via queue; monitor.
- Stage rollout
- Orchestrator applies to canaries; then progressively to low/medium/high risk cohorts.
- Each batch merges via queue; failures trigger bisection and auto‑revert.
- Finalize
- Remove compatibility shims (contract) if using expand/contract.
- Archive metrics and add patterns to your internal cookbook.
Tooling Menu (Opinionated)
- Python: LibCST for safe, lossless edits; Bowler for quick renames; Ruff for lint; pytest for tests.
- JS/TS: jscodeshift for tree transforms; ts‑morph when types are needed; ESLint with autofix; Vitest/Jest.
- Java: OpenRewrite for recipes; ErrorProne for static checks; Gradle/Maven plugins for CI.
- Go: go/ast with go/types; staticcheck;
go test ./...
in CI. - Rust: syn/quote with prettyplease; Clippy; cargo check/test in CI.
- Polyglot: tree‑sitter for parsing and spot‑checks; Comby for SSR‑style text when structure is simple and safe.
- Orchestration: GitHub Actions + Merge Queue, Buildkite, or Gerrit; Mergify/bors for queues; Bazel/Pants/Buck2 to speed monorepo builds.
Frequently Overlooked Details
- Formatting drift: ensure your printer or a linter/formatter runs post‑transform to minimize diff noise.
- Performance: run codemods in parallel but shard to avoid CI resource starvation; cache builds (ccache/remote cache).
- Partial failures: design the codemod to skip unsupported cases with explicit markers (e.g., TODO comments) instead of producing bad output.
- Comment contracts: if your company uses special annotations in comments, ensure the codemod preserves them verbatim.
- Binary files and generated sources: exclude them aggressively.
Closing Thoughts
In 2025, LLMs make the hard part of large refactors—designing the transformation and stitching language‑specific details—astonishingly fast. But reliability comes from the boring parts: AST discipline, evals that measure what matters, and CI mechanics that tame blast radius. Treat the model as a codemod author whose work must pass the same compile‑and‑test standards as any engineer’s. Put the merge queue in charge. Stage your rollouts. And always keep a revert handy.
Follow these patterns and you can safely let an LLM touch thousands of files—while sleeping well the night after merge.