Goodbye Service API Keys: SPIFFE/SPIRE Workload Identity and Zero‑Trust mTLS Across Kubernetes and Multi‑Cloud in 2025
If you’re still rolling API keys across fleets, you’re burning time and carrying risk. The path forward in 2025 is clear: short‑lived, verifiable workload identities and mutual TLS. The SPIFFE standard and its reference implementation SPIRE have matured into production‑grade building blocks for secure service‑to‑service authentication and authorization across Kubernetes, VMs, and multi‑cloud.
This article is an opinionated, hands‑on blueprint for moving from shared secrets to SPIFFE/SPIRE. We’ll cover:
- What breaks with API keys, and why SPIFFE is different
- SVID issuance (X.509 and JWT), node and workload attestation
- Envoy SDS integration for automatic mTLS
- IAM federation (AWS, GCP, Azure) via OIDC
- OPA policy enforcement using SPIFFE IDs
- Certificate rotation and CA lifecycle
- Phased rollout plan
- Performance tradeoffs and capacity planning
- Common pitfalls and how to avoid them
The tone is practical and precise: enough detail to implement, avoiding dogma. Let’s get to work.
Why API Keys Don’t Survive Contact With Reality
API keys are shared secrets. They leak, get copied into logs and tickets, and live long past their intended scope. Rotating them across hundreds of services is error‑prone. They don’t bind to a runtime identity, so you can’t say which workload instance used the key.
Zero‑trust architectures require three things API keys can’t offer:
- Verifiable identity: A cryptographic claim bound to a process and host, not just a string.
- Contextual authorization: Policies that reference who/what the peer is (workload identity), not just possession of a shared token.
- Automated rotation: Short‑lived credentials replaced without human toil.
SPIFFE (Secure Production Identity Framework For Everyone) defines a standard identity format and how workloads receive credentials. SPIRE (SPIFFE Runtime Environment) is the CNCF‑graduated implementation that issues and rotates those credentials, backed by attestations about where workloads actually run.
The net result: You replace brittle API keys with SPIFFE Verifiable Identity Documents (SVIDs) that are short‑lived, automatically rotated, and strongly tied to a workload. Mutual TLS then becomes the default transport security, carrying identity in every connection.
Core Concepts in 90 Seconds
- SPIFFE ID: A URI like
spiffe://prod.example.internal/ns/payments/sa/checkout
. It names a workload within a trust domain. - Trust Domain: A namespace of identities, e.g.,
spiffe://prod.example.internal
. You typically use one per environment (dev, staging, prod). - SVID: The credential asserting a SPIFFE ID. Comes as X.509 (for mTLS) and/or JWT (for OIDC/federation).
- Attestation: Proving where the workload is running. SPIRE supports Kubernetes (PSAT), AWS/GCP/Azure instance metadata, and more.
- SPIRE Server: Issues SVIDs and manages registration entries (which workload gets which identity and under what conditions).
- SPIRE Agent: Runs on nodes, verifies node attestation, and issues SVIDs to workloads that match selectors.
- Federation: Connecting trust domains across clusters/clouds so identities from one are trusted by another.
In Kubernetes, SPIRE authenticates nodes via k8s PSAT and workloads via selectors like namespace, service account, and pod labels.
Architecture: Kubernetes and Multi‑Cloud
A pragmatic layout that scales:
- One SPIRE Server per cluster (HA via StatefulSet) or per region; each with a distinct trust domain (e.g.,
prod-us-east.example.internal
andprod-eu-west.example.internal
). - SPIRE Agents as DaemonSets on each node. Agents handle workload SVID issuance and the Envoy SDS gRPC endpoint.
- An upstream CA per environment (Vault, AWS PCA, or GCP CAS) for central root management, or a self‑contained SPIRE CA if you prefer local roots.
- Federation across trust domains where services call cross‑cluster (bundle endpoints with mTLS).
- Envoy sidecars or node proxies consume SVIDs via SDS for automatic mTLS.
- Optional SPIRE OIDC Discovery Provider for IAM federation to AWS/GCP/Azure.
This topology prevents single global blast radius, keeps failure domains local, and avoids tying the entire company to a single trust root.
Step 1: Trust Domain and Naming Strategy
Be clear and consistent; you will live with these for years.
- Use environment‑scoped trust domains, not a single global one. Example:
spiffe://prod.example.internal
,spiffe://stage.example.internal
. - Encode k8s context in SPIFFE IDs:
spiffe://prod.example.internal/ns/{namespace}/sa/{service_account}
plus optional granularity likesvc/{name}
if you want to split identities by workload instance. - Avoid embedding ephemeral data (pod UID) in the SPIFFE ID itself; selectors handle that.
Example canonical SPIFFE ID:
spiffe://prod.example.internal/ns/payments/sa/checkout-api
Step 2: Deploy SPIRE Server (HA) and Agent
For production, back SPIRE Server with a replicated datastore (e.g., Postgres). SQLite is fine for labs.
Example SPIRE Server config (HCL‑like) using Vault as upstream CA and Kubernetes PSAT node attestation:
hclserver { bind_address = "0.0.0.0" bind_port = "8081" trust_domain = "prod.example.internal" data_dir = "/run/spire/server/data" # Federation bundle endpoint for other trust domains federation { bundle_endpoint = { address = "0.0.0.0" port = 8443 acme = { directory_url = "https://acme-v02.api.letsencrypt.org/directory" domain_name = "bundle.prod.example.internal" email = "secops@example.internal" } } } } datastore "sql" { database_type = "postgres" connection_string = "host=postgres.spire.svc port=5432 dbname=spire user=spire password=${POSTGRES_PASSWORD} sslmode=disable" } upstream_ca "vault" { address = "https://vault.vault.svc:8200" token = "${VAULT_TOKEN}" pki_path = "pki_int" # TTLs are set per SVID entry as well; this controls CA chain lifetime } node_attestor "k8s_psat" { clusters = { "prod-us-east-1" = { service_account_allow_list = ["spire/spire-agent"] audience = "spire-server" } } } workload_attestor "k8s" { skip_kubelet_verification = false } telemetry { prometheus { } }
SPIRE Agent config (Kubernetes PSAT, plus SDS for Envoy):
hclagent { data_dir = "/run/spire/agent" trust_domain = "prod.example.internal" server_address = "spire-server.spire.svc" server_port = "8081" socket_path = "/run/spire/sockets/agent.sock" # SDS socket for Envoy } # Node attestation: the agent proves it's running in the trusted cluster node_attestor "k8s_psat" { cluster = "prod-us-east-1" } workload_attestor "k8s" {} sds { default_svid_name = "default" default_bundle_name = "spiffe://prod.example.internal" } telemetry { prometheus { } }
Kubernetes manifests: run SPIRE Server as a StatefulSet with PodDisruptionBudgets and SPIRE Agent as a DaemonSet mounting the projected service account token (PSAT).
Step 3: Registration Entries and SVID Issuance
Workloads get SVIDs when their pod selectors match a registration entry. You can scope identities to namespace+service account, and optionally labels for fine‑grained splits.
Example: give the checkout-api
service an SVID with 30‑minute TTL.
bashspire-server entry create \ -spiffeID spiffe://prod.example.internal/ns/payments/sa/checkout-api \ -selector k8s:ns:payments \ -selector k8s:sa:checkout-api \ -selector k8s:pod-label:app:checkout \ -x509svid-ttl 1800 \ -jwtSvidTtl 600
- x509 SVID TTL 1800s (30 min) keeps compromise window tight, rotates frequently.
- JWT SVID TTL 600s (10 min) is good for OIDC federation to clouds.
The SPIRE Agent will mint and cache the X.509 keypair locally and rotate it automatically before expiry. No developers need to touch certificates.
Step 4: Envoy SDS Integration for Automatic mTLS
Envoy can request certificates and trust bundles over SDS from SPIRE Agent via a Unix domain socket. This is the cleanest approach—no files, no sidecars doing file rotations.
Envoy bootstrap snippet:
yamlstatic_resources: clusters: - name: sds-grpc type: STATIC connect_timeout: 1s load_assignment: cluster_name: sds-grpc endpoints: - lb_endpoints: - endpoint: address: pipe: path: /run/spire/sockets/agent.sock # Upstream service cluster - name: orders-api type: STRICT_DNS connect_timeout: 1s load_assignment: cluster_name: orders-api endpoints: - lb_endpoints: - endpoint: address: socket_address: { address: orders-api.default.svc.cluster.local, port_value: 8443 } listeners: - name: inbound-listener address: { socket_address: { address: 0.0.0.0, port_value: 8443 } } filter_chains: - transport_socket: name: envoy.transport_sockets.tls typed_config: "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext common_tls_context: tls_certificate_sds_secret_configs: - name: spiffe://prod.example.internal/ns/payments/sa/checkout-api sds_config: api_config_source: api_type: GRPC grpc_services: - envoy_grpc: cluster_name: sds-grpc validation_context_sds_secret_config: name: spiffe://prod.example.internal sds_config: api_config_source: api_type: GRPC grpc_services: - envoy_grpc: cluster_name: sds-grpc require_client_certificate: true filter_chains: - filters: [ ... http connection manager ... ] # Upstream TLS for egress to orders-api transport_socket: name: envoy.transport_sockets.tls typed_config: "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext common_tls_context: tls_certificate_sds_secret_configs: - name: spiffe://prod.example.internal/ns/payments/sa/checkout-api sds_config: api_config_source: api_type: GRPC grpc_services: - envoy_grpc: cluster_name: sds-grpc validation_context_sds_secret_config: name: spiffe://prod.example.internal sds_config: api_config_source: api_type: GRPC grpc_services: - envoy_grpc: cluster_name: sds-grpc upstream_tls_context: {}
tls_certificate_sds_secret_configs
retrieves the workload’s X.509 SVID.validation_context_sds_secret_config
retrieves the trust bundle (all CA certs in the trust domain and federated domains).require_client_certificate: true
enforces client mTLS on inbound.
Envoy exposes peer identity to upstream via dynamic metadata and request headers (e.g., XFCC), which we’ll use with OPA later.
Step 5: Cross‑Cluster and Multi‑Cloud mTLS
When services in different trust domains must talk (prod‑US East to prod‑EU West), enable SPIFFE federation. Each SPIRE Server hosts a bundle endpoint (HTTPS with Web PKI cert). Servers are configured to fetch and trust each other’s bundles.
Server federation snippet:
hclfederation { bundle_endpoint = { address = "0.0.0.0" port = 8443 acme = { ... } } federates_with = [ { domain = "prod-eu-west.example.internal" bundle_endpoint_url = "https://bundle.prod-eu-west.example.internal:8443" # TLS trust for the bundle endpoint uses Web PKI or pinned certs } ] }
Once federated, Envoy’s validation context bundle includes both trust domains. Policies can then authorize cross‑domain calls using the caller’s SPIFFE ID, e.g., allowing spiffe://prod-eu-west.example.internal/ns/inventory/sa/indexer
to call the US orders API.
Step 6: IAM Federation (AWS, GCP, Azure) via SPIRE OIDC
Eventually, workloads need cloud APIs. Ditch long‑lived cloud access keys. Use JWT‑SVIDs and standard OIDC flows:
- Run SPIRE OIDC Discovery Provider (ODP). It exposes
/.well-known/openid-configuration
and a JWKS endpoint using SPIRE as issuer. - Configure a custom OIDC provider in AWS IAM, GCP Workload Identity Federation, or Azure Entra ID to trust your ODP issuer.
- Write IAM conditions matching SPIFFE IDs in the
sub
or custom claim.
ODP example:
hcloidc { issuer = "https://oidc.prod.example.internal" audience = ["sts.amazonaws.com", "gcp" ,"azure"] listen_port = 8444 # TLS via ingress/controller in front, or terminate locally }
AWS example: create an IAM OIDC provider using the issuer and JWKS URL from ODP. Then a role trust policy like:
json{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": {"Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.prod.example.internal"}, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.prod.example.internal:aud": "sts.amazonaws.com", "oidc.prod.example.internal:sub": "spiffe://prod.example.internal/ns/payments/sa/checkout-api" } } } ] }
The workload asks SPIRE Agent for a JWT‑SVID and exchanges it using STS AssumeRoleWithWebIdentity. No static cloud keys on disk.
GCP and Azure offer similar federation. Use audience scoping and SPIFFE IDs in subject conditions to maintain least privilege.
Step 7: Authorization with OPA (Rego) and Envoy ext_authz
mTLS authenticates the peer. You still need a policy engine to decide who can call what. OPA is well‑suited here. Use Envoy’s ext_authz to call OPA with mTLS metadata.
Envoy adds peer info into dynamic metadata. Alternatively, forward the SPIFFE ID from the peer cert via x-forwarded-client-cert
(XFCC). A cleaner approach is to use Envoy’s built‑in metadata fields for SAN URIs.
Envoy filter (excerpt):
yamlhttp_filters: - name: envoy.filters.http.ext_authz typed_config: "@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz transport_api_version: V3 with_request_body: { max_request_bytes: 8192, allow_partial_message: true } grpc_service: envoy_grpc: { cluster_name: opa } metadata_context_namespaces: ["envoy.filters.listener.tls_inspector"] - name: envoy.filters.http.router
OPA running as a sidecar or Deployment listens on gRPC and evaluates Rego. Example policy using the SPIFFE ID and HTTP method/path:
regopackage envoy.authz default allow = false spiffe_id := input.attributes.mtls.peer.spiffe_id method := upper(input.parsed_method) path := input.parsed_path allow { # Payments checkout-api may call orders create endpoint spiffe_id == "spiffe://prod.example.internal/ns/payments/sa/checkout-api" method == "POST" path == ["v1", "orders"] } allow { # Read endpoints allowed for inventory service across federated trust domain spiffe_id == "spiffe://prod-eu-west.example.internal/ns/inventory/sa/indexer" method == "GET" startswith(join("/", path), "/v1/orders/") }
Return decisions to Envoy; deny with appropriate 403 if policy fails.
Key point: authorization references an immutable identity (SPIFFE ID), not ephemeral network addresses or API keys.
Step 8: Certificate and CA Rotation
Short‑lived SVIDs are only half the story. You must plan for CA lifecycle and root rotation without downtime.
- SVID rotation: Set X.509 SVID TTLs to 10–60 minutes for high‑value services. SPIRE Agent rotates proactively (e.g., at 2/3 of TTL). Envoy reloads via SDS without connection drops; new handshakes use fresh certs.
- CA rotation: Use SPIRE’s bundle endpoint to distribute new roots and overlap old/new during the rollout. Maintain at least one full TTL where both old and new CAs validate SVIDs.
- Upstream CA rotation: If using Vault/AWS PCA/GCP CAS, script and simulate root rollover well in advance; SPIRE picks up new intermediates via the UpstreamAuthority plugin. Monitor bundle freshness.
- Backdating and clock skew: Configure 1–5 minutes of NotBefore skew to tolerate small time drift. Enforce NTP across nodes.
Checklist:
- Run synthetic traffic during CA rotations.
- Track SVID expiration histograms and bundle freshness via Prometheus.
- Ensure long‑lived HTTP/2 gRPC connections are recycled periodically so new certs are used.
Step 9: Phased Rollout Plan (Battle‑Tested)
Start with minimal blast radius and iterate.
Phase 0 – Prereqs
- Assign trust domains and DNS names for ODP and bundle endpoints.
- Deploy SPIRE Server/Agent in one non‑critical cluster; integrate with upstream CA.
- Enable telemetry and logging sinks.
Phase 1 – Intra‑namespace mTLS
- Pick 2–3 services within a namespace. Issue identities and wire Envoy SDS.
- Run inbound Envoy in permissive mode if needed (accept plain + mTLS) while clients migrate.
- Verify identity propagation and OPA policies.
Phase 2 – Namespace‑wide
- Register entries for all service accounts in the namespace.
- Enforce require_client_certificate on inbound.
- Remove API key checks; keep as fallback for a week with dual‑auth logging.
Phase 3 – Cross‑namespace and mesh integration
- Expand across the cluster. If you have a service mesh (Istio, Consul, Linkerd), choose: either delegate mTLS to mesh CA via SPIRE as external CA, or keep Envoy SDS directly from SPIRE. Avoid dual CAs.
Phase 4 – Federation and Multi‑Cloud
- Enable SPIRE federation across clusters and clouds.
- Replace cloud access keys with OIDC federation via ODP.
Phase 5 – Decommission legacy secrets
- Rotate out API keys. Lock down endpoints to mTLS‑only. Remove key material from vaults, repos, and CI.
Performance Tradeoffs and Capacity Planning
The common fear is that mTLS and frequent rotation will add unacceptable latency and CPU load. In practice, when tuned, the overhead is modest.
Data points and rules of thumb (as of 2025, modern x86/arm64 CPUs):
- TLS 1.3 ECDSA P‑256 handshake adds roughly 0.3–1.5 ms CPU time on each side under no contention; network latency dominates at micro‑service hop scales.
- Session resumption and HTTP/2 connection pooling amortize handshake cost; steady‑state connections rarely re‑handshake.
- SVID TTL of 10–30 minutes with proactive rotation changes certs but does not force connection drops; Envoy hot‑swaps certs. Long‑lived connections should be gently recycled to pick up new certs within a grace window.
- SPIRE Agent CPU is typically low: tens of millicores per node under hundreds of workloads, spikes during attestations/rotations. Profile with Prometheus.
Tuning tips:
- Prefer ECDSA P‑256 certs over RSA for lower CPU and smaller cert sizes.
- Cap max concurrent TLS handshakes per proxy to avoid thundering herds during pod churn.
- Use Envoy circuit breakers and connection pool limits to smooth spikes.
- Keep JWT‑SVID TTLs small (5–10 min) but cache cloud STS tokens for their allowed TTL to reduce STS churn.
Benchmark your actual traffic patterns. Run chaos drills rotating SVID TTLs down to 2–5 minutes temporarily and watch headroom.
Developer Experience: Using SPIFFE in Application Code (When Needed)
The ideal is to keep identity at the transport layer via Envoy. Sometimes you do need programmatic access (e.g., calling a SaaS with mTLS or fetching JWT‑SVID for OIDC):
Go example (using go‑spiffe):
goimport ( "context" "crypto/tls" "net/http" "github.com/spiffe/go-spiffe/v2/spiffe" "github.com/spiffe/go-spiffe/v2/spiffeid" "github.com/spiffe/go-spiffe/v2/svid/x509svid" "github.com/spiffe/go-spiffe/v2/workloadapi" ) func mtlsClient() (*http.Client, error) { ctx := context.Background() source, err := workloadapi.NewX509Source(ctx) if err != nil { return nil, err } // Optionally restrict to a set of expected peer IDs expected, _ := spiffeid.FromString("spiffe://prod.example.internal/ns/orders/sa/api") tlsConfig := &tls.Config{ MinVersion: tls.VersionTLS13, GetClientCertificate: x509svid.Source(source).GetClientCertificate, VerifyPeerCertificate: spiffe.VerifyPeerCertificate(source, spiffe.ExpectAnyPeer(), spiffe.ExpectOneOf(expected)), } tr := &http.Transport{TLSClientConfig: tlsConfig} return &http.Client{Transport: tr}, nil }
JWT‑SVID for OIDC (Go):
goimport ( "context" "github.com/spiffe/go-spiffe/v2/workloadapi" ) func getJwtSvid(aud string) (string, error) { ctx := context.Background() c, err := workloadapi.New(ctx) if err != nil { return "", err } svid, err := c.FetchJWTSVID(ctx, &workloadapi.JWTSVIDParams{Audience: []string{aud}}) if err != nil { return "", err } return svid.Marshal(), nil // compact JWT }
Most teams should avoid embedding SPIFFE logic in every service. Keep it in a shared client library or stick with Envoy.
Pitfalls and How to Avoid Them
- Overly broad registration entries: Don’t give
spiffe://prod/.../ns/*
to an entire cluster. Scope to namespace and service account, add labels where practical. - Clock skew: Short TTLs and JWT validation fail if clocks drift. Enforce NTP, allow small NotBefore backdating.
- Permissive forever: Use permissive inbound TLS only during migration. Set deadlines to enforce mTLS or it will drag on.
- Mesh double‑CA: If you run Istio or another mesh, choose a single CA source. Use SPIRE as external CA to mesh or bypass mesh mTLS entirely. Two CAs equals confusion.
- SDS miswiring: Point Envoy SDS to the SPIRE Agent socket, not to xDS. Watch logs: 404 on SDS name usually means the secret name doesn’t match your SPIFFE ID or bundle name.
- Long‑lived connections: If you never recycle connections, new certs won’t be used. Configure max connection age in gRPC/Envoy.
- CA rotation gaps: Never remove old root until all workloads have rotated at least once with the new chain.
- Selector drift: Changing k8s labels without updating SPIRE entries will cause identity loss. Treat labels as API; gate via PRs and validations.
- Secure storage: Do not write SVID keypairs to persistent volumes. Keep them in memory via SDS.
- Audit trails: Without central logs, you lose visibility. Emit authentication and authorization decision logs from Envoy/OPA with SPIFFE ID and peer SANs.
Observability: What to Watch
- spire_server: issued SVID count, attestation successes/failures, bundle freshness age.
- spire_agent: SVID rotation count and errors, SDS push metrics.
- Envoy: TLS handshake counts, connection pool sizes, mTLS success vs. failures, XFCC headers.
- OPA: decision latency and deny rates; policy coverage.
- Synthetic canary traffic across trust domains to catch federation regressions.
Alerting suggestions:
- SVIDs expiring within 2x TTL without rotation
- Federation bundle stale beyond 2x refresh interval
- Spike in mTLS handshake failures or authorization denials
Cost and Operational Considerations
- HA SPIRE Server per cluster adds some infra, but not large: typically 2–3 vCPUs and a small Postgres instance can serve thousands of workloads.
- Network egress for ODP and federation endpoints is negligible; cache JWKS in cloud providers where possible.
- Developer impact is small if you centralize Envoy configuration templates and registration entries via GitOps.
Security Hardening Checklist
- Restrict SPIRE Server admin APIs to a private network segment; enforce mTLS on its gRPC.
- Lock SPIRE Agent socket permissions to Envoy and the app user where necessary.
- Use PSP/PSA or PodSecurity to restrict which workloads can mount the agent socket.
- Keep distinct trust domains per environment; avoid trusting dev in prod.
- SVID TTLs: 10–60 minutes for high‑value, 2–6 hours for low‑risk batch jobs; JWTs 5–15 minutes.
- Prefer ECDSA P‑256; ensure FIPS compliance if required by using suitable crypto providers.
Migration Pattern: From API Keys to SPIFFE
- Dual‑auth window: Require either valid API key or mTLS with a valid SPIFFE ID. Log which path each request used.
- Instrumentation shows readiness: once 99% of calls use mTLS, flip to mTLS‑only.
- External clients: For third‑party integrations, either issue them SPIFFE identities via a gateway or terminate mTLS at an edge that maps to client credentials. Do not tunnel API keys over mTLS internally.
- Cleanup: Revoke and delete API keys; scrub them from repos and CI secrets; add detectors to prevent reintroduction.
A Note on Service Meshes in 2025
Istio, Linkerd, and Consul all support external CAs or SPIRE integration. If you already run a mesh, plugging SPIRE in as the CA gives you SPIFFE‑compliant identities and lifecycle automation, while keeping mesh features (traffic shaping, telemetry). If you don’t need advanced L7 features, Envoy + SPIRE alone is simpler. The mistake is mixing both CAs; pick one root of trust.
Troubleshooting Playbook
- 403 from Envoy ext_authz: inspect OPA decision logs; check that the peer SPIFFE ID matches policy. Confirm Envoy sees the SPIFFE ID in peer SANs.
- TLS handshake fails: verify the validation context bundle includes the caller’s trust domain; confirm federation is configured and bundles are fresh.
- Workload lacks SVID:
spire-agent api fetch -socketPath /run/spire/sockets/agent.sock
inside the pod’s netns to see selectors; confirm registration entry selectors match labels, namespace, and service account. - SDS secret not found: confirm SDS secret names match. For bundles, use the trust domain name; for SVIDs, use the SPIFFE ID string.
- OIDC federation errors: check audience and issuer URLs; confirm clock sync; verify provider trusts your issuer and JWKS.
Example: End‑to‑End Minimal Blueprint
- Deploy SPIRE Server and Agent in the payments cluster with trust domain
prod.example.internal
. - Create entries for two services:
bash# checkout-api spire-server entry create \ -spiffeID spiffe://prod.example.internal/ns/payments/sa/checkout-api \ -selector k8s:ns:payments -selector k8s:sa:checkout-api -selector k8s:pod-label:app:checkout \ -x509svid-ttl 1800 -jwtSvidTtl 600 # orders-api spire-server entry create \ -spiffeID spiffe://prod.example.internal/ns/orders/sa/orders-api \ -selector k8s:ns:orders -selector k8s:sa:orders-api -selector k8s:pod-label:app:orders \ -x509svid-ttl 1800 -jwtSvidTtl 600
- Sidecar Envoy for both services with SDS to SPIRE Agent.
- Inbound Envoy requires client cert; OPA policy allows
checkout-api
to call/v1/orders
POST. - Enable ODP and set up AWS IAM role for
checkout-api
withsub = spiffe://prod.example.internal/ns/payments/sa/checkout-api
. - Remove legacy API key validation from orders once traffic proves reliance on mTLS + policy.
Opinionated Recommendations (So You Don’t Have To Learn the Hard Way)
- Don’t use a single global trust domain. Per‑environment domains make rollbacks and rotations survivable.
- Keep SVID TTLs short enough to matter (<= 60 minutes for hot paths). Longer TTLs are basically static certs.
- Push identity to the transport layer. Avoid per‑service ad‑hoc JWT validation unless it’s for external calls.
- Invest early in observability. You’ll need it during the first few rotations and federations.
- Treat k8s labels as identity inputs; create a contract around them with CI checks.
Conclusion
By 2025, keeping API keys is a choice, not a constraint. SPIFFE/SPIRE turns identity from a static secret into a verifiable, short‑lived property of workloads—automated, observable, and portable across Kubernetes and clouds. With Envoy SDS, OIDC federation, and OPA policy, you can implement end‑to‑end zero‑trust mTLS that scales with your architecture and reduces risk.
Start small, measure, and iterate. In a quarter, you can be done rotating API keys forever—and your future incident retros will thank you.
Further resources:
- SPIFFE: https://spiffe.io
- SPIRE: https://spiffe.io/spire/
- go‑spiffe: https://github.com/spiffe/go-spiffe
- Envoy SDS: https://www.envoyproxy.io/docs/envoy/latest/configuration/security/secret#sds-secrets
- OPA/Envoy: https://www.openpolicyagent.org/docs/latest/envoy-introduction/
- AWS OIDC federation: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html
- GCP Workload Identity Federation: https://cloud.google.com/iam/docs/workload-identity-federation
- Azure federated credentials: https://learn.microsoft.com/azure/active-directory/develop/workload-identity-federation