Goodbye Cold Starts in 2025: Snapshotting Serverless Runtimes with CRaC, Lambda SnapStart, and Firecracker
Serverless cold starts have been the long-running tax we all pay for developer convenience. In 2025, that tax is negotiable. Snapshotting—freezing a fully initialized runtime and resuming it in milliseconds—has moved from research and niche Linux hacks into mainstream cloud architectures. If you’re building on the JVM or on AWS Lambda (Java), you can largely erase cold starts while keeping the benefits of managed compute.
This article unpacks three major approaches to snapshotting serverless runtimes:
- CRaC (Coordinated Restore at Checkpoint) for the Java ecosystem, and its cousin CRIU for general Linux processes
- AWS Lambda SnapStart (powered by Firecracker microVM snapshots) for near-instant JVM cold-starts
- Firecracker itself—the microVM technology underpinning the next wave of serverless isolation and speed
We’ll go deep on hooks, secrets hygiene, DB and SDK reconnects, reproducible builds, observability, and CI. Along the way, you’ll get practical code snippets and operational guardrails.
TL;DR
- Snapshotting eliminates most cold start latency by restoring from a pre-initialized runtime image.
- CRaC coordinates JVM checkpoint/restore so your app can close and re-open external resources safely.
- AWS Lambda SnapStart uses Firecracker snapshots under the hood to restore Java runtimes quickly, with runtime hooks for cleanup and re-init.
- Firecracker is the isolation and snapshot engine making fast restore feasible at scale.
- To succeed: implement before-checkpoint/after-restore hooks, refresh secrets and database connections, make builds reproducible, and update your observability and CI to understand snapshot lifecycles.
Why snapshotting fixes cold starts
Cold starts are the combined penalty of runtime boot, framework/classpath load, dependency initialization, native library warmup, JIT warmup, and client/connection setup (DB, caches, SDKs). If we can do all that upfront, once, and capture the fully initialized process image, subsequent executions can skip the work and start near-instantly.
At a low level, snapshotting captures:
- Process memory (heap, stacks, code cache, JIT state)
- File descriptors and sockets (with caveats)
- Threads and registers
- Certain kernel and namespace state (depending on layer)
Restoring a snapshot aims to make the process think it was never stopped. But the outside world moved on: clocks ticked, network connections stale, secrets rotated, IPs changed. That is why coordination hooks are essential.
Snapshotting options in 2025
1) CRaC for Java (with CRIU for Linux)
CRaC—Coordinated Restore at Checkpoint—is an OpenJDK project that addresses the hardest part of checkpoint/restore for application runtimes: coordination. The JVM and your application get explicit hooks for:
- Before checkpoint: quiesce, flush, close fragile external resources (DB sockets, TLS sessions), reset randomness if needed, scrub secrets you don’t want persisted
- After restore: rebuild connection pools, refresh secrets, resume schedulers, and validate environment assumptions
CRaC typically relies on CRIU (Checkpoint/Restore In Userspace) on Linux to capture and restore process state. The difference from naively using CRIU is that CRaC surfaces lifecycle hooks all the way to your application, making restoration safe and predictable for managed runtimes like the JVM.
Support status: CRaC is available in specialized JDK builds (e.g., Azul Zulu CRaC builds, other vendors). As of 2025, you’ll still be selecting a CRaC-enabled distribution. Framework support (Spring, Micronaut, Quarkus) is advancing quickly; many now offer CRaC-friendly integrations or community starters.
What it’s good for:
- Java services where you control the runtime and container
- Knative/OpenFaaS-style serverless or traditional container orchestration
- Reducing cold starts from multiple seconds to tens of milliseconds
What to watch out for:
- OS/kernel/CRIU compatibility variations
- Coordinating hooks correctly (especially connection pools and secrets)
- Determinism and reproducibility of the image
CRaC hooks in practice (Spring Boot example)
You implement the org.crac.Resource interface to participate in checkpoint and restore.
javaimport org.crac.Core; import org.crac.Resource; import org.crac.Context; import org.springframework.stereotype.Component; import javax.annotation.PostConstruct; import javax.sql.DataSource; import com.zaxxer.hikari.HikariDataSource; @Component public class CracResource implements Resource { private final HikariDataSource dataSource; private volatile boolean restored; public CracResource(DataSource ds) { this.dataSource = (HikariDataSource) ds; Core.getGlobalContext().register(this); } @PostConstruct public void init() { // Force early init of hot paths (JIT-friendly) and caches // e.g., warm key serializers, load config, prime classpaths } @Override public void beforeCheckpoint(Context<? extends Resource> context) throws Exception { // Close fragile external resources; snapshots shouldn’t contain stale sockets try { dataSource.close(); } catch (Exception e) { // log and proceed } // Scrub secrets in memory you don’t want in the checkpoint // e.g., zeroize byte[] holding decrypted creds // Reset randomness seeds if you had pre-generated IDs } @Override public void afterRestore(Context<? extends Resource> context) throws Exception { // Recreate connection pool HikariDataSource newDs = new HikariDataSource(); newDs.setJdbcUrl(System.getenv("JDBC_URL")); newDs.setUsername(System.getenv("DB_USER")); newDs.setPassword(fetchFreshDbPassword()); // Swap references via DI or a holder; avoid races // Optionally, warm up prepared statements restored = true; } private String fetchFreshDbPassword() { // Fetch from vault/SM/KMS on restore so it’s current return System.getenv("DB_PASSWORD"); } }
You’ll typically pair this with framework-specific adapters that gracefully swap data sources and clients post-restore.
Using CRIU directly (containerized service)
If you’re not on Java or want to experiment at the container/process level, CRIU provides CLI hooks. Tools like runc/podman integrate with it.
bash# Pre-warm the service ./my-service --config config.yaml & SRV_PID=$! # Ensure readiness curl -f http://localhost:8080/healthz # Checkpoint into an image directory sudo criu dump -t "$SRV_PID" \ --images-dir ./snap \ --leave-running \ --tcp-established \ --shell-job # Later: restore into a new process sudo criu restore --images-dir ./snap
Caveat: network sockets in CRIU are tricky. In practice you’ll design the service to close external sockets before checkpoint and reopen after restore. With CRaC, the JVM coordinates that for you.
2) AWS Lambda SnapStart (Java)
AWS Lambda SnapStart, introduced for Java, reduces cold start latency by creating a snapshot of your function’s runtime after initialization. On each new execution environment, Lambda restores from that snapshot instead of running your init code. Under the hood, Lambda uses Firecracker microVM snapshots for isolation and speed.
Key concepts:
- Init phase: your static initializers and constructor/DI wiring run once during deployment or scale-out warm-up.
- Snapshot: Lambda freezes the microVM after init.
- Restore: Cold start becomes “restore,” typically far faster than a full init.
- Runtime hooks: you can register callbacks for BeforeCheckpoint and AfterRestore to clear and rebuild external resources.
Supported runtimes: Java 11 and 17 on Lambda’s managed Java runtimes. Check AWS docs for current versions.
What it’s good for:
- Serverless APIs and event handlers in Java with heavy frameworks (Spring Boot, Micronaut, Quarkus)
- Dramatically reducing p50/p90 cold starts without changing your deployment model
What to watch out for:
- Anything initialized during init becomes part of the snapshot, including contents of /tmp and static singletons
- Secrets and network connections must be refreshed after restore
- Randomness and unique IDs seeded during init can collide across restores if not reset
SnapStart runtime hooks
AWS exposes runtime hooks to coordinate resource teardown and rebuild. You register hooks in a static initializer.
javaimport com.amazonaws.services.lambda.runtime.api.client.runtimeapi.LambdaRuntime; public class Handler implements com.amazonaws.services.lambda.runtime.RequestHandler<Request, Response> { static { LambdaRuntime.getRuntimeHooks().register(LambdaRuntime.BeforeCheckpoint.class, context -> { // Close DB pools, HTTP clients, clear caches you don’t want persisted ConnectionManager.closeAll(); SecretsCache.clearSensitiveBytes(); }); LambdaRuntime.getRuntimeHooks().register(LambdaRuntime.AfterRestore.class, context -> { // Rebuild clients and connections, refresh secrets Secrets.initFresh(); ConnectionManager.init(); IdGenerator.reseed(); }); } @Override public Response handleRequest(Request input, com.amazonaws.services.lambda.runtime.Context ctx) { // Fast path thanks to SnapStart return Service.process(input); } }
Notes:
- Avoid pre-generating request IDs or random seeds during init; reseed in AfterRestore.
- Refresh AWS SDK clients only if they hold state you care about (e.g., HTTP keep-alive pools, TLS sessions). Creating a new client is cheap and safe.
- If you wrote files to /tmp during init, they will be present after restore. Do not store request-correlated or unique tokens there during init.
Latency improvements you can expect
Real-world reports and AWS benchmarks commonly show 50–90% reductions in cold-start latency for Java functions. Heavy Spring Boot handlers that previously cold-started in the multi-second range often drop into sub-300ms territory, with warm paths indistinguishable from pre-SnapStart warm invokes. Your mileage varies with framework weight and I/O.
3) Firecracker microVM snapshots
Firecracker is the microVM technology built by AWS for Lambda and Fargate. Compared to containers alone, microVMs provide stronger isolation with near-container performance. Crucially, Firecracker supports fast snapshot/restore:
- Captures guest memory, vCPU register state, and device model state
- Restores in milliseconds using demand paging and copy-on-write
- Enables providers to spin up many identical sandboxes quickly
You rarely use Firecracker directly in serverless apps, but it matters because:
- Lambda SnapStart piggybacks on Firecracker snapshots
- Emerging serverless platforms and edge runtimes adopt similar snapshotting models
- Understanding its guarantees (e.g., time and device state) informs how you design restore-safe code
What it’s good for:
- Provider-side isolation and scale-out
- Enabling snapshot-based cold start elimination across multi-tenant environments
What to watch out for:
- Kernel and CPU feature compatibility for self-managed deployments
- Snapshot images must match the VM configuration; reproducibility matters
Designing your app for snapshot/restore
Snapshotting is not zero-effort. You must take control of init and restore lifecycles, particularly around secrets, randomness, time, and I/O.
Init-time responsibilities
- Front-load heavy work: class loading, dependency injection, JIT warming (hot methods), loading native libraries, priming caches.
- Avoid creating externally unique identifiers during init.
- Decide what must not be persisted in the snapshot (secrets in byte arrays, session tokens, ephemeral keys) and scrub them before checkpoint.
- Ensure file paths, env vars, and config are deterministic and pinned.
Before checkpoint (CRaC: beforeCheckpoint, SnapStart: BeforeCheckpoint)
- Close DB pools and TCP sockets. A checkpoint should not include live network connections.
- Flush metrics and logs, stop schedulers and timers.
- Zeroize sensitive in-memory artifacts you don’t want encrypted into the snapshot.
- Persist safe caches to local disk only if you understand restore semantics.
After restore (CRaC: afterRestore, SnapStart: AfterRestore)
- Recreate connection pools (HikariCP, R2DBC, Redis clients, HTTP clients).
- Reseed randomness for UUIDs and nonces. Prefer OS entropy or a DRBG seeded after restore.
- Refresh secrets from your vault (AWS Secrets Manager, HashiCorp Vault). If you rely on env vars for secrets, confirm rotation strategy.
- Restart schedulers with a known state, accounting for elapsed time.
- Validate environment assumptions: region, AZ, network, account/role.
Java example: Coordinating HikariCP and AWS SDK
javapublic final class AfterRestoreActions { private static volatile HikariDataSource ds; private static volatile S3Client s3; public static void init() { LambdaRuntime.getRuntimeHooks().register(LambdaRuntime.BeforeCheckpoint.class, c -> before()); LambdaRuntime.getRuntimeHooks().register(LambdaRuntime.AfterRestore.class, c -> after()); } private static void before() { if (ds != null) { try { ds.close(); } catch (Exception ignored) {} } if (s3 != null) { try { s3.close(); } catch (Exception ignored) {} } SecretsCache.clear(); } private static void after() { String pwd = Secrets.fetchDbPassword(); HikariDataSource newDs = new HikariDataSource(); newDs.setJdbcUrl(System.getenv("JDBC_URL")); newDs.setUsername(System.getenv("DB_USER")); newDs.setPassword(pwd); ds = newDs; s3 = S3Client.builder() .httpClientBuilder(UrlConnectionHttpClient.builder()) .region(Region.of(System.getenv("AWS_REGION"))) .build(); Ids.reseed(SecureRandomFactory.fresh()); } }
Secrets management and DB reconnects
Snapshotting changes the failure modes of secrets and connections. Assume any secret loaded during init may be persisted in the snapshot (encrypted at rest by the platform). You still need freshness and revocation.
Guidelines:
- Never embed short-lived tokens (OIDC, OAuth, STS) into the snapshot. Fetch them after restore.
- Prefer client credentials or database passwords fetched after restore. For Lambda SnapStart, use AfterRestore to get a fresh Secrets Manager value.
- For RDBMS, always close the pool before checkpoint and rebuild after restore; do not rely on TCP keep-alives.
- For caches (Redis/Memcached), rebuild clients after restore; pipeline warmups if necessary.
- For TLS, renegotiate sessions after restore.
What about /tmp and local caches?
- In Lambda SnapStart, /tmp created during init is included in the snapshot and restored. Treat it as prewarmed read-only caches, not as a store for secrets or unique tokens.
- For CRaC on containers, your filesystem snapshot depends on CRIU and container runtime; revalidate path assumptions.
Reproducible builds for snapshot integrity
A snapshot is only safe to restore if the environment that consumes it is semantically identical to the one that produced it. Drift yields crashes or subtle bugs. Make your builds reproducible:
- Pin base images by digest, not tag (e.g., FROM ghcr.io/distroless/java17@sha256:...).
- Vendor native libraries or use deterministic package managers (apk --no-cache with pinned versions).
- Make your classpath deterministic. Avoid relying on file system enumeration order.
- Use reproducible JARs (strip timestamps, order entries), tools like Maven Reproducible Builds plugin, Gradle’s reproducible archives, or Bazel.
- Consider Nix/Guix or Bazel rules_java to lock transitive dependencies and toolchains.
- Record runtime fingerprints: kernel version, glibc/musl versions, CPU flags. With SnapStart, AWS handles the kernel/VM; with CRaC/CRIU, you own it.
- Generate SBOMs (CycloneDX/SPDX) and sign artifacts (SLSA, Sigstore). You want to know exactly what went into the snapshot.
Testing determinism:
- Build the same commit twice on clean workers; compare image digests and artifact checksums.
- Run smoke tests against both; if behavior diverges, you have non-determinism to fix.
Observability: measuring and debugging snapshots
Snapshotting alters your service lifecycle. Update your telemetry accordingly.
What to capture:
- Lifecycle events: init, before-checkpoint, after-restore, first-request-after-restore. Emit structured logs.
- Metrics: time spent in restore vs. init, number of restores, request latencies immediately after restore vs sustained warm load.
- Cold/warm attribution: with SnapStart, “cold” becomes “restored.” Track a boolean attribute (restored=true) on the first N requests after restore.
- Resource reinit failures: DB reconnect attempts, secret fetch failures, and backoff behavior.
Lambda specifics:
- CloudWatch and the Lambda console expose InitDuration and RestoreDuration (for SnapStart-enabled functions). Alert if RestoreDuration drifts up.
- Record a custom metric for after-restore time-to-first-successful-DB-connection.
- X-Ray/OTel: start a new segment/subsegment around AfterRestore activities to see their impact.
Common pitfalls:
- Duplicated IDs: pre-generated UUIDs in the snapshot repeat across every restore. Reseed RNGs after restore.
- Metrics counters: if your counter state is snapshotted, you might observe resets or jumps. Prefer cumulative counters in a collector outside the function, or reset counters after restore with an explicit label.
- Clocks: wall-clock and monotonic time advance between snapshot and restore. Avoid snapshotting scheduled tasks mid-flight; reschedule after restore.
Minimal example with OpenTelemetry in Java:
javaprivate static final Meter meter = GlobalOpenTelemetry.getMeter("app"); private static final LongCounter restores = meter.counterBuilder("restores").build(); static { LambdaRuntime.getRuntimeHooks().register(LambdaRuntime.AfterRestore.class, c -> { restores.add(1); logger.info("after_restore", kv("aws_request_id", c.getAwsRequestId())); }); }
CI pipelines and tests for snapshot readiness
You can and should test snapshot lifecycles in CI.
For CRaC-based services
- Use a CRaC-enabled JDK in CI (e.g., Azul Zulu CRaC builds).
- Start the service, run readiness checks, then trigger a checkpoint via jcmd or programmatic API.
- Kill external dependencies (e.g., drop DB connections) to simulate stale state.
- Restore, assert that afterRestore hooks reconnect and pass health checks.
Example CI step (pseudo-Bash):
bash# Build image with CRaC JDK docker build -t app-crac:ci . # Run dependencies docker compose up -d postgres redis # Start app CID=$(docker run -d --cap-add=CHECKPOINT_RESTORE --name app app-crac:ci) # Wait healthz until docker exec "$CID" curl -sf http://localhost:8080/healthz; do sleep 1; done # Trigger checkpoint via an admin endpoint docker exec "$CID" curl -sf -X POST http://localhost:8080/admin/checkpoint # Simulate network change: restart postgres docker compose restart postgres # Restore (implementation-specific: some frameworks auto-restore on next start) # Alternatively, run `criu restore` inside the container or restart app process under CRaC # Validate afterRestore succeeded docker exec "$CID" curl -sf http://localhost:8080/healthz || exit 1
For Java, you can expose a small admin endpoint that calls org.crac.Core.checkpoint() to initiate the checkpoint in a controlled state.
For Lambda SnapStart
- Unit-test your hook logic (extract before/after logic into plain classes).
- In an AWS dev account, deploy with SnapStart enabled and run integration tests.
- Force restores by updating function versions or scaling concurrently (e.g., invoke with many parallel requests to cause new environments).
- Assert that the first request after restore succeeds and that secrets are fresh (e.g., rotate a secret, then trigger another restore).
You can also build a local “simulation mode” that calls the same teardown/reinit methods your hooks use, letting you run them in regular unit tests without Lambda.
For CRIU-level experiments
- Use runc/podman checkpoint/restore features.
- Add golden tests: checkpoint after warmup, restore, then run a suite to verify functionality and latencies.
- Verify that no outbound sockets exist at checkpoint time (lsof -i) to avoid fragile restores.
Performance engineering tips
- Focus on the init barrier: aggressively front-load heavy work during init so the snapshot is worth more.
- Warm the JIT: invoke hot code paths during init to populate code cache and tiered compilation.
- Precompute serializers and reflection metadata (Jackson/Moshi/Serde). Consider ahead-of-time reflection configs.
- Keep snapshots small: remove unnecessary caches and large temporary buffers before checkpoint to reduce restore time.
- Measure with production-ish traffic and data; snapshot benefits correlate with how much you moved into init.
Security considerations
Snapshotting changes the threat model subtly:
- Secrets in memory are encrypted within provider-managed snapshots (e.g., SnapStart) but still present after restore. Treat them as long-lived unless you clear before checkpoint and refresh after restore.
- Zeroization: for highly sensitive materials (API keys, decrypted certificates), store them in byte arrays that you can zero before checkpoint.
- Unique tokens: never pre-generate request IDs or nonces during init. Regenerate after restore.
- Supply chain: reproducible builds and SBOMs matter more, since you’re codifying the entire runtime image.
- Isolation: Firecracker offers strong isolation; if self-managing snapshots, ensure minimal privileges and hardened kernels.
When to choose which approach
-
Use AWS Lambda SnapStart if:
- You’re on Lambda with Java 11/17 and want the simplest path to erase cold starts.
- You can express your teardown and reinit via runtime hooks.
-
Use CRaC if:
- You run Java services on Kubernetes/Knative or your own orchestrator.
- You need language-level coordination beyond what CRIU provides.
-
Use CRIU directly if:
- You’re not on the JVM and want to experiment with container-level snapshots.
- You’re comfortable designing your app to close/reopen external resources around the checkpoint.
-
Firecracker as a direct tool:
- Typically provider territory. If you’re building a serverless platform, invest here; otherwise, rely on managed offerings.
Opinion: 2025 is the year we stop accepting cold starts
The industry finally has pragmatic tools to beat cold starts without heroic micro-optimizations or rewriting to a different language/runtime. For JVM shops, CRaC and SnapStart are mature enough to standardize. For platform teams, Firecracker snapshots set the direction of travel. The remaining work is engineering discipline: lifecycle hooks, secret hygiene, reproducible builds, and updated observability.
If you adopt snapshotting thoughtfully, you’ll convert cold starts from a chronic pain into a rounding error.
Practical checklist
- Initialization
- Front-load: classpath, DI, serializers, JIT hot paths
- Avoid generating unique IDs/randoms during init
- Before checkpoint
- Close DB/HTTP/TLS connections
- Stop schedulers/timers
- Zeroize secrets and large buffers
- After restore
- Rebuild pools and clients
- Refresh secrets from a trusted source
- Reseed RNGs and regen ephemeral keys
- Reschedule tasks and validate env
- Build
- Pin base images, use reproducible JARs
- Produce SBOM, sign artifacts
- Record runtime fingerprints (kernel, libc)
- Observability
- Emit lifecycle logs and metrics
- Track restore counts and latency
- Tag first requests after restore
- CI
- Simulate checkpoint/restore in tests
- Validate DB reconnects and secret rotation post-restore
References and further reading
- Project CRaC (OpenJDK): https://openjdk.org/projects/crac/
- CRIU (Checkpoint/Restore In Userspace): https://criu.org/
- AWS Lambda SnapStart for Java: https://docs.aws.amazon.com/lambda/latest/dg/snapstart.html
- Lambda runtime hooks (Java): see SnapStart docs and aws-lambda-java-libs
- Firecracker microVM: https://firecracker-microvm.github.io/
- Spring Boot and CRaC community guides (search "Spring CRaC")
- Reproducible Builds: https://reproducible-builds.org/
- SLSA framework: https://slsa.dev/
Closing
Snapshotting turns cold starts into a solved problem—if you treat initialization and restoration as first-class concerns. Whether you pick CRaC in your Kubernetes cluster or SnapStart on Lambda, the pattern is the same: orchestrate teardown and reinit, verify with tests, lock down your builds, and update telemetry. The payoff is real: fast, consistent startup times without giving up the ergonomics of serverless and managed runtimes.