Sandboxing Production: How to Give a Debug AI Live Kubernetes Access Safely
A wave of platform teams are experimenting with AI copilots that can triage incidents, query observability, and even poke around in containers to validate hypotheses. The upside is clear: faster mean time to diagnosis (MTTD) and less ops toil. The risk is also obvious: if a model has production credentials, a prompt injection or misfired tool call could make a bad day catastrophic.
This article gives you a concrete, production-grade blueprint for granting a debug AI controlled, auditable access to Kubernetes logs, exec, and metrics—via MCP/tooling—backed by hardened RBAC, admission control, API rate limiting, session recording, and a human-guarded break-glass path. The goal is surgical, explainable access that improves incident response without materially increasing outage risk.
We’ll cover the threat model, a reference architecture, YAML you can adapt, policy snippets, and practical trade-offs on EKS/GKE/AKS.
TL;DR
- Do not hand a model a kubeconfig. Put a tightly-scoped gateway between the AI and the cluster.
- Grant read-mostly access: logs, events, metrics, and a gated exec path. Deny all writes by default.
- Enforce policy with RBAC plus a ValidatingAdmissionWebhook/OPA Gatekeeper to constrain exec to labeled namespaces/pods and identities.
- Record everything: Kubernetes audit logs at RequestResponse for exec/logs, and a custom exec proxy that captures I/O.
- Rate limit with API Priority and Fairness. Time-bound tokens and per-session quotas.
- For break-glass writes, require human approval to mint a short-lived, limited-scope credential. Log and expire it automatically.
Why give a model live access at all?
The diagnostic loop often needs ground truth:
- Logs for correlated errors and backtraces
- Events for scheduling issues and restarts
- Metrics for saturation, queuing, and spikes
- Exec to run targeted, read-only commands like
cat /proc/meminfo,ss -tan, orcurl localhost:8080/healthz
If you can wire a model to these signals—and constrain it so it cannot mutate workloads—you can dramatically reduce time to resolution without paging humans for every probe. But guardrails are non-negotiable.
Threat model and design principles
Threats to mitigate:
- Prompt injection causing destructive tool calls (e.g.,
kubectl delete pod) - Credential theft or lateral movement from the AI runtime
- Accidental high-cost queries (cardinality explosions on metrics, giant log streams)
- Sensitive-data exfiltration from logs/exec output
- Abuse from compromised identities or noisy automation during incidents
Design principles:
- Least privilege: deny writes by default; permit minimal read actions only
- Time scoping: short-lived tokens; ephemeral break-glass permissions
- Human-in-the-loop for any escalation; no unattended write access
- Full auditability: record requests, tool decisions, exec I/O
- Policy enforcement in multiple layers (RBAC + admission + network + APF)
- Blast-radius control: namespace-level scoping and label-based allowlists
Reference architecture
Components:
- AI/agent: the model or orchestration that plans and calls tools
- MCP server (Model Context Protocol) or equivalent tool gateway: the only bridge between AI and cluster; enforces policy and recording
- Kubernetes API and cluster: production target(s)
- Policy engine: RBAC, admission control (Gatekeeper/OPA), API Priority and Fairness (APF)
- Observability backends: Prometheus, Cloud Logging, etc.
- Storage: immutable logs for audits (e.g., S3/GCS with object lock)
Flow:
- AI asks the MCP server to fetch logs, run a read-only exec, or query metrics.
- MCP server authenticates with a short-lived SA token, applies rate limits and authorization checks, and calls the Kubernetes API.
- ValidatingAdmissionWebhook ensures only permitted targets and subresources are accessed by the AI identity.
- Kubernetes audit logs capture all requests; the MCP server additionally records exec I/O.
- If the AI requests break-glass operations (e.g., eviction), it must submit a justification. A human approval workflow mints an ephemeral credential and binds a limited ClusterRole for a short TTL. Every action is logged.
Access patterns and their risks
- Logs (pods/log): low risk if redacted; unbounded volume can be expensive and leak secrets
- Events: low risk; useful for scheduling/CrashLoopBackOff
- Metrics: low to medium risk; cardinality and expensive queries can cause load; keep read-only tokens
- Exec (pods/exec): medium risk; must be read-only shell usage; restrict commands and targets; record session
- Port-forward: higher risk, opens new channels; avoid or guard heavily
We’ll implement logs/events/metrics and a gated exec; block port-forward entirely.
Kubernetes identity, RBAC, and scoping
We’ll create a dedicated ServiceAccount in an ops namespace with a narrow ClusterRole. Use OIDC or native ServiceAccount tokens with short TTL via TokenRequest API.
Example YAML:
yamlapiVersion: v1 kind: Namespace metadata: name: ai-ops labels: pod-security.kubernetes.io/enforce: baseline --- apiVersion: v1 kind: ServiceAccount metadata: name: debug-ai namespace: ai-ops --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: debug-ai-readonly rules: # Read pod metadata and status across selected namespaces - apiGroups: [""] resources: ["pods", "pods/status", "namespaces", "nodes", "events"] verbs: ["get", "list", "watch"] # Read pod logs - apiGroups: [""] resources: ["pods/log"] verbs: ["get"] # Allow creating exec sessions (subresource) only; actual constraints enforced by admission webhook - apiGroups: [""] resources: ["pods/exec"] verbs: ["create"] # Disallow port-forward by omission # Read deployments/replicas for context - apiGroups: ["apps"] resources: ["deployments", "replicasets", "statefulsets", "daemonsets"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: debug-ai-metrics-readonly rules: # Access metrics APIs via aggregated servers (customize for your setup) - apiGroups: ["metrics.k8s.io"] resources: ["nodes", "pods"] verbs: ["get", "list"] # If using Prometheus, prefer a read-only HTTP token outside K8s RBAC --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: debug-ai-readonly-binding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: debug-ai-readonly subjects: - kind: ServiceAccount name: debug-ai namespace: ai-ops --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: debug-ai-metrics-binding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: debug-ai-metrics-readonly subjects: - kind: ServiceAccount name: debug-ai namespace: ai-ops
Important notes:
- RBAC cannot restrict by label or command for subresources like exec. Use admission policies for fine-grained constraints.
- Keep the ClusterRole minimal. If your AI never needs node reads, drop them.
Admission control for fine-grained guardrails
Use OPA Gatekeeper or a custom ValidatingAdmissionWebhook to restrict exec/logs by namespace labels, pod annotations, and identity. Example: allow exec only to pods in namespaces labeled debug-ai-exec=enabled and deny shell-like commands unless they match an allowlist.
Gatekeeper takes a ConstraintTemplate with Rego and a Constraint resource:
yamlapiVersion: templates.gatekeeper.sh/v1beta1 kind: ConstraintTemplate metadata: name: k8sallowedexec spec: crd: spec: names: kind: K8sAllowedExec validation: openAPIV3Schema: properties: allowedNamespacesSelector: type: object allowedServiceAccounts: type: array items: { type: string } allowedCommands: type: array items: { type: string } targets: - target: admission.k8s.gatekeeper.sh rego: | package k8sallowedexec import data.inventory default deny = false # Match pods/exec create requests is_exec { input.review.kind.kind == "Pod" input.review.operation == "CREATE" endswith(input.review.requestSubResource, "exec") } allowed_sa := { sa | sa := input.parameters.allowedServiceAccounts[_] } allowed_cmds := { c | c := input.parameters.allowedCommands[_] } ns := input.review.object.metadata.namespace sa := input.review.userInfo.username # Namespace selector: label must be present with value "true" ns_ok { inv := inventory.namespace[ns] inv.metadata.labels["debug-ai-exec"] == "enabled" } # Extract command from query; for WebSocket exec the command appears in the URL parameters cmd_params := { p | p := input.review.requestObject["command"][_] } cmds_ok { count(cmd_params) > 0 forall cmd in cmd_params { allowed_cmds[cmd] } } sa_ok { allowed_sa[sa] } deny { is_exec not ns_ok } deny { is_exec not sa_ok } deny { is_exec not cmds_ok } --- apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sAllowedExec metadata: name: debug-ai-exec-policy spec: match: kinds: - apiGroups: [""] kinds: ["Pod"] parameters: allowedNamespacesSelector: {} allowedServiceAccounts: - "system:serviceaccount:ai-ops:debug-ai" allowedCommands: - "/bin/cat" - "/bin/ls" - "/usr/bin/curl" - "/bin/echo" - "/usr/bin/ss"
Caveats:
- Commands in pods/exec are transported as query parameters for SPDY/streamed requests; Rego may need to inspect
requestURIorrequestObject. Validate against your API server version and Gatekeeper’s input structure. - If you need deeper inspection or to block patterns, consider a custom webhook.
Also consider a separate policy to deny pods/portforward for the AI identity entirely.
Kubernetes audit logging at RequestResponse
Enable audit logging to capture every logs/exec invocation with parameters. Configure the audit policy to record full request/response bodies for relevant subresources.
Example audit policy:
yamlapiVersion: audit.k8s.io/v1 kind: Policy rules: # Default minimal logging - level: Metadata # Capture logs and exec calls fully - level: RequestResponse verbs: ["get", "create"] resources: - group: "" resources: ["pods/log", "pods/exec", "pods/attach"] # Capture subject access reviews and token requests - level: Request resources: - group: "authorization.k8s.io" resources: ["subjectaccessreviews"] - group: "" resources: ["serviceaccounts", "serviceaccounts/token"]
Notes:
- On EKS, enable audit and control plane logs to CloudWatch; on GKE, enable Audit Logs to Cloud Logging; on AKS, use Diagnostic Settings.
- Store audit logs with immutability (e.g., S3 Object Lock, WORM retention) for forensics.
API Priority and Fairness (APF) rate limiting
Throttle the AI identity to protect the API server during incidents.
yamlapiVersion: flowcontrol.apiserver.k8s.io/v1beta3 kind: PriorityLevelConfiguration metadata: name: ai-low spec: type: Limited limited: assuredConcurrencyShares: 20 limitResponse: type: Queued queuing: queues: 64 queueLengthLimit: 50 handSize: 8 --- apiVersion: flowcontrol.apiserver.k8s.io/v1beta3 kind: FlowSchema metadata: name: ai-debug spec: priorityLevelConfiguration: name: ai-low distinguisherMethod: type: ByUser matchingPrecedence: 9000 rules: - subjects: - kind: User user: name: "system:serviceaccount:ai-ops:debug-ai" resourceRules: - verbs: ["get", "list", "watch", "create"] apiGroups: ["", "apps", "metrics.k8s.io"] resources: ["pods", "pods/log", "pods/exec", "events", "deployments", "nodes", "pods.metrics.k8s.io", "nodes.metrics.k8s.io"]
Tune concurrency and queue lengths to your scale. Consider cluster-wide APF baselines so the AI cannot starve control-plane traffic.
Session recording for exec
Kubernetes audit logs record the exec request and command, but not I/O streams. For robust forensics, route exec through a proxy that records stdin/stdout/stderr and metadata.
Approaches:
- Use an exec gateway written in Go that terminates the exec request, tees the streams to object storage, and forwards to the API server using client-go’s
remotecommandpackage. - Use a commercial platform like Teleport’s Kubernetes Access, which natively records exec sessions and supports approvals and RBAC.
Minimal Go proxy sketch:
gopackage main import ( "context" "fmt" "io" "log" "net/http" "os" "time" corev1 "k8s.io/api/core/v1" "k8s.io/client-go/kubernetes" "k8s.io/client-go/rest" "k8s.io/client-go/tools/remotecommand" ) func execHandler(client kubernetes.Interface, cfg *rest.Config) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { // Parse params: namespace, pod, container, command[] from query/body ns := r.URL.Query().Get("namespace") pod := r.URL.Query().Get("pod") container := r.URL.Query().Get("container") cmd := r.URL.Query()["command"] // Policy: check allowlist for cmd, namespace labels, identity, etc. if !policyAllows(r.Context(), ns, pod, container, cmd) { http.Error(w, "policy denied", http.StatusForbidden) return } req := client.CoreV1().RESTClient().Post(). Resource("pods"). Name(pod). Namespace(ns). SubResource("exec") option := &corev1.PodExecOptions{ Container: container, Command: cmd, Stdin: false, Stdout: true, Stderr: true, TTY: false, } req.VersionedParams(option, schemeParameterCodec) executor, err := remotecommand.NewSPDYExecutor(cfg, "POST", req.URL()) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } // Recording setup sessionID := fmt.Sprintf("%d-%s-%s", time.Now().UnixNano(), ns, pod) outFile, _ := os.Create("/records/" + sessionID + "-stdout.log") errFile, _ := os.Create("/records/" + sessionID + "-stderr.log") defer outFile.Close(); defer errFile.Close() prOut, pwOut := io.Pipe() prErr, pwErr := io.Pipe() // Tee to both response and file go io.Copy(outFile, prOut) go io.Copy(errFile, prErr) mw := &multiWriter{w: w} // Stream exec err = executor.StreamWithContext(r.Context(), remotecommand.StreamOptions{ Stdout: io.MultiWriter(pwOut, mw), Stderr: io.MultiWriter(pwErr, mw), Tty: false, }) pwOut.Close(); pwErr.Close() if err != nil { log.Printf("exec error: %v", err) } } } type multiWriter struct{ w http.ResponseWriter } func (m *multiWriter) Write(p []byte) (int, error) { return m.w.Write(p) } func policyAllows(ctx context.Context, ns, pod, container string, cmd []string) bool { // TODO: call OPA or check config; ensure read-only commands return true } func main() { cfg, _ := rest.InClusterConfig() client, _ := kubernetes.NewForConfig(cfg) http.HandleFunc("/exec", execHandler(client, cfg)) log.Fatal(http.ListenAndServe(":8080", nil)) }
This proxy becomes the only path the AI can use for exec. Block direct API server access to exec by the AI SA via a network policy if the proxy runs in-cluster, or by not sharing credentials except the proxy’s.
MCP tool definitions: logs, exec, metrics
Define strict tools with schemas to limit what the AI can ask for.
Example MCP tools (conceptual):
json[ { "name": "k8s_get_logs", "description": "Fetch recent logs for a pod container.", "input_schema": { "type": "object", "properties": { "namespace": {"type": "string"}, "pod": {"type": "string"}, "container": {"type": "string"}, "since_seconds": {"type": "integer", "minimum": 1, "maximum": 3600}, "tail_lines": {"type": "integer", "minimum": 1, "maximum": 2000} }, "required": ["namespace", "pod"] } }, { "name": "k8s_exec_readonly", "description": "Run an allowlisted, read-only command in a pod container.", "input_schema": { "type": "object", "properties": { "namespace": {"type": "string"}, "pod": {"type": "string"}, "container": {"type": "string"}, "command": {"type": "array", "items": {"type": "string"}, "minItems": 1, "maxItems": 6} }, "required": ["namespace", "pod", "command"] } }, { "name": "prom_query", "description": "Run a read-only PromQL query against the metrics API.", "input_schema": { "type": "object", "properties": { "query": {"type": "string", "maxLength": 1000}, "range_seconds": {"type": "integer", "minimum": 60, "maximum": 3600}, "step_seconds": {"type": "integer", "minimum": 5, "maximum": 60} }, "required": ["query"] } } ]
Server-side enforcement is critical: even if the schema is tight, never trust the model; validate every field against policy.
Implementing logs and events with guardrails
Use the Kubernetes API for logs. The MCP server should:
- Enforce
sinceSecondsandtailLinesmaximums - Support
previousto fetch logs from the last crashed container - Redact likely secrets with a configurable regex and entropy detector
Example code notes:
- Use
clientset.CoreV1().Pods(ns).GetLogs(pod, &PodLogOptions{Container: ..., SinceSeconds: ..., TailLines: ...}).Stream(...) - Stream to response and tee to immutable storage for audit if volume is small; otherwise log metadata and hash, not full logs
Events are low-risk but high-value for diagnosis: always allow list/watch for the target namespaces.
Metrics: prefer a read-only Prometheus token
Avoid hitting the apiserver for metrics beyond the metrics.k8s.io API. Use Prometheus’ HTTP API with a read-only, low-scope token.
- Put Prometheus behind an internal gateway enforcing rate limits
- Enforce max range and step; reject queries with label value wildcards that explode cardinality
- Predefine safe queries for common SLOs and resource saturation (CPU throttling, restarts, 5xx rates)
Example reverse proxy policies (pseudo):
- Reject queries where estimated series > 50k
- Cap duration to 30 minutes and step >= 10s
Secret handling and data minimization
- Do not grant access to Secrets, ConfigMaps, or volumes
- Add a redaction filter in logs/exec output: mask JWTs, OAuth tokens, AWS keys, and high-entropy strings
- Provide a “show sensitive” toggle for human operators only; never show raw to the model
Sample redaction regex set:
AKIA[0-9A-Z]{16}(AWS Access Key ID)(?i)secret|password|tokenfollowed by value-like patterns- Bearer tokens:
(?i)authorization:\s*bearer\s+[a-z0-9-_\.]+
Complement regex with entropy-based detection to reduce false negatives.
Token lifetimes and credential hygiene
- Use the ServiceAccount TokenRequest API to mint short-lived tokens per session (e.g., 10 minutes); do not store tokens in long-lived config
- Rotate the AI SA token and revoke on anomalies
- Restrict the MCP server’s network egress to only the Kubernetes API and Prometheus; block cloud metadata endpoints
Break-glass: human-approved, ephemeral, minimal writes
Sometimes diagnostics require a nudge: evict a wedged pod, cordon a node, or restart a deployment. Treat this as an exception with explicit human approval.
Pattern:
- AI proposes an action with justification and expected blast radius
- Human reviews in chat/console and approves
- Controller mints a short-lived token bound to a dedicated ClusterRole and SA via a RoleBinding with TTL
- MCP server swaps to the break-glass credential for a single action; session is fully recorded
Minimal write ClusterRole (example):
yamlapiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: debug-ai-breakglass rules: # Allow evicting pods only (graceful remove via eviction API) - apiGroups: ["policy"] resources: ["evictions"] verbs: ["create"] # Allow rollout restarts on specific deployments via patch - apiGroups: ["apps"] resources: ["deployments"] verbs: ["get", "patch"] resourceNames: ["my-critical-deployment"] # Node cordon/uncordon optional - apiGroups: [""] resources: ["nodes"] verbs: ["get", "patch"]
Ephemeral binding via a controller that watches a CRD:
yamlapiVersion: ops.example.com/v1 kind: BreakGlassRequest metadata: name: bgr-2025-01-incident-123 spec: serviceAccountRef: name: debug-ai namespace: ai-ops roleRef: debug-ai-breakglass ttlSeconds: 900 justification: "Evict pod pod-abc in ns prod to unblock stuck volume mount" approvers: - alice@example.com - oncall-ops@example.com status: state: Pending
The controller:
- Requires one or more approver signatures (via OIDC or a signed comment in chat)
- Creates a temporary RoleBinding or ClusterRoleBinding
- Mints a short-lived TokenRequest for the SA
- Deletes the binding on TTL expiry
All actions are posted to an audit channel and appended to your SIEM.
Multi-cluster and multi-tenant considerations
- Run one MCP gateway per cluster to isolate blast radius; aggregate results to the AI runtime
- Namespace scoping: let the AI access only namespaces labeled
debug-ai-access=enabled - Separate credentials per environment (prod vs. staging) with distinct policies and APF settings
Platform-specific notes
- EKS: enable control plane logs (audit, authenticator) to CloudWatch; consider AWS WAF on Prometheus gateway; use IAM Roles for Service Accounts (IRSA) to limit AWS API exposure from MCP pods
- GKE: use Workload Identity; enable Audit Logs; consider Binary Authorization if you ship the proxy into prod
- AKS: use Managed Identity; enable diagnostic settings for audit; verify APF version and stability
Operational playbook
- Pre-incident: test in staging; run policy unit tests (OPA test fixtures) and chaos drills for rate-limiting and audit coverage
- Incident: AI operates read-only by default; proposes break-glass with a diff of intended changes
- Post-incident: review recorded sessions, adjust allowlists, update runbooks and predefined queries
Example end-to-end flow
- Pager triggers on 5xx spike.
- AI uses
prom_querywith a safe query forsum(rate(http_requests_total{status=~"5.."}[5m])) by (service), capped at 30m range. - AI calls
k8s_get_logson the top offender pods withsince_seconds=600andtail_lines=500. - It detects OOMKills via
eventsandLastTerminationStatein pod status. - It proposes a
k8s_exec_readonlycommand:/bin/cat /proc/meminfoand/usr/bin/ss -tan. Gatekeeper permits; the exec proxy records outputs. - AI concludes a memory leak in revision X; proposes a controlled rollout restart. Human approves break-glass for a single deployment patch. The controller mints a 10-minute token; the MCP server executes the patch and logs everything.
Testing and verification
- Policy tests: OPA unit tests for your Gatekeeper templates; simulate requests with various namespaces and commands
- Load tests: hammer
pods/logand/metricswith APF in place; verify saturation behavior - Redaction tests: seed logs with faux secrets; assert masks are applied and no plaintext leaks to the AI
- Audit completeness: verify that a sample exec appears in audit logs with RequestResponse and in session storage with matching IDs
Common pitfalls and how to avoid them
- Assuming RBAC can filter exec by command: it cannot. Use admission policies and a proxy.
- Leaving port-forward enabled: models can unintentionally open tunnels; block subresource entirely.
- Unbounded logs: enforce
sinceSecondsandtailLines; cut off after N MB per call. - Token sprawl: use TokenRequest per session; expire tokens aggressively; avoid static kubeconfigs.
- Over-permissive Prometheus: run all queries through a gateway with per-query limits and response size caps.
Security and compliance mapping
- Least privilege and separation of duties: dedicated SA, minimal ClusterRole, human-approved escalation
- Auditability: Kubernetes audit logs + exec I/O recording + SIEM integration
- Data minimization: redaction and deny-listing sensitive resources
- Change control: break-glass with approvals and ephemeral bindings
These controls align with SOC 2 CC6/7 and ISO 27001 A.9/A.12
Alternatives and extensions
- Mirror-only mode: ship logs and metrics to a read-only shadow plane and disallow live exec; safer but slower diagnosis
- Out-of-band command runners: run predefined diagnostics as Jobs with controlled inputs instead of arbitrary exec
- Structured diagnostics: encode SRE runbooks as parameterized tools to reduce free-form shell usage
Conclusion
You don’t need to choose between “no prod access” and “here’s a kubeconfig.” A layered, policy-enforced gateway gives a debugging AI the signals it needs—logs, events, metrics, and tightly-scoped exec—without handing it the power to disrupt production.
The core of the blueprint:
- A minimal RBAC identity scoped for reads, not writes
- Admission policies that constrain exec to safe namespaces and allowlisted commands
- A recording exec proxy and RequestResponse audit logging
- Rate limits via APF and short-lived tokens
- A human-reviewed, ephemeral break-glass path with limited write verbs
Start in staging, prove the guardrails, then gradually enable in production namespaces with clear labels and observability. Your SREs get faster triage and deeper context, and your risk posture remains sane.
References
- Kubernetes audit logging: https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/
- API Priority and Fairness: https://kubernetes.io/docs/concepts/cluster-administration/flow-control/
- OPA Gatekeeper: https://github.com/open-policy-agent/gatekeeper
- client-go exec: https://pkg.go.dev/k8s.io/client-go/tools/remotecommand
- Teleport Kubernetes Access (session recording): https://goteleport.com/docs/kubernetes-access/
- TokenRequest API: https://kubernetes.io/docs/reference/kubernetes-api/authentication-resources/token-request-v1/
