Sandboxing Production: How to Give a Debug AI Live Kubernetes Access Safely

A wave of platform teams are experimenting with AI copilots that can triage incidents, query observability, and even poke around in containers to validate hypotheses. The upside is clear: faster mean time to diagnosis (MTTD) and less ops toil. The risk is also obvious: if a model has production credentials, a prompt injection or misfired tool call could make a bad day catastrophic.

This article gives you a concrete, production-grade blueprint for granting a debug AI controlled, auditable access to Kubernetes logs, exec, and metrics—via MCP/tooling—backed by hardened RBAC, admission control, API rate limiting, session recording, and a human-guarded break-glass path. The goal is surgical, explainable access that improves incident response without materially increasing outage risk.

We’ll cover the threat model, a reference architecture, YAML you can adapt, policy snippets, and practical trade-offs on EKS/GKE/AKS.

TL;DR

Do not hand a model a kubeconfig. Put a tightly-scoped gateway between the AI and the cluster.
Grant read-mostly access: logs, events, metrics, and a gated exec path. Deny all writes by default.
Enforce policy with RBAC plus a ValidatingAdmissionWebhook/OPA Gatekeeper to constrain exec to labeled namespaces/pods and identities.
Record everything: Kubernetes audit logs at RequestResponse for exec/logs, and a custom exec proxy that captures I/O.
Rate limit with API Priority and Fairness. Time-bound tokens and per-session quotas.
For break-glass writes, require human approval to mint a short-lived, limited-scope credential. Log and expire it automatically.

Why give a model live access at all?

The diagnostic loop often needs ground truth:

Logs for correlated errors and backtraces
Events for scheduling issues and restarts
Metrics for saturation, queuing, and spikes
Exec to run targeted, read-only commands like cat /proc/meminfo, ss -tan, or curl localhost:8080/healthz

If you can wire a model to these signals—and constrain it so it cannot mutate workloads—you can dramatically reduce time to resolution without paging humans for every probe. But guardrails are non-negotiable.

Threat model and design principles

Threats to mitigate:

Prompt injection causing destructive tool calls (e.g., kubectl delete pod)
Credential theft or lateral movement from the AI runtime
Accidental high-cost queries (cardinality explosions on metrics, giant log streams)
Sensitive-data exfiltration from logs/exec output
Abuse from compromised identities or noisy automation during incidents

Design principles:

Least privilege: deny writes by default; permit minimal read actions only
Time scoping: short-lived tokens; ephemeral break-glass permissions
Human-in-the-loop for any escalation; no unattended write access
Full auditability: record requests, tool decisions, exec I/O
Policy enforcement in multiple layers (RBAC + admission + network + APF)
Blast-radius control: namespace-level scoping and label-based allowlists

Reference architecture

Components:

AI/agent: the model or orchestration that plans and calls tools
MCP server (Model Context Protocol) or equivalent tool gateway: the only bridge between AI and cluster; enforces policy and recording
Kubernetes API and cluster: production target(s)
Policy engine: RBAC, admission control (Gatekeeper/OPA), API Priority and Fairness (APF)
Observability backends: Prometheus, Cloud Logging, etc.
Storage: immutable logs for audits (e.g., S3/GCS with object lock)

Flow:

AI asks the MCP server to fetch logs, run a read-only exec, or query metrics.
MCP server authenticates with a short-lived SA token, applies rate limits and authorization checks, and calls the Kubernetes API.
ValidatingAdmissionWebhook ensures only permitted targets and subresources are accessed by the AI identity.
Kubernetes audit logs capture all requests; the MCP server additionally records exec I/O.
If the AI requests break-glass operations (e.g., eviction), it must submit a justification. A human approval workflow mints an ephemeral credential and binds a limited ClusterRole for a short TTL. Every action is logged.

Access patterns and their risks

Logs (pods/log): low risk if redacted; unbounded volume can be expensive and leak secrets
Events: low risk; useful for scheduling/CrashLoopBackOff
Metrics: low to medium risk; cardinality and expensive queries can cause load; keep read-only tokens
Exec (pods/exec): medium risk; must be read-only shell usage; restrict commands and targets; record session
Port-forward: higher risk, opens new channels; avoid or guard heavily

We’ll implement logs/events/metrics and a gated exec; block port-forward entirely.

Kubernetes identity, RBAC, and scoping

We’ll create a dedicated ServiceAccount in an ops namespace with a narrow ClusterRole. Use OIDC or native ServiceAccount tokens with short TTL via TokenRequest API.

Example YAML:

yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ai-ops
  labels:
    pod-security.kubernetes.io/enforce: baseline
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: debug-ai
  namespace: ai-ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: debug-ai-readonly
rules:
  # Read pod metadata and status across selected namespaces
  - apiGroups: [""]
    resources: ["pods", "pods/status", "namespaces", "nodes", "events"]
    verbs: ["get", "list", "watch"]
  # Read pod logs
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get"]
  # Allow creating exec sessions (subresource) only; actual constraints enforced by admission webhook
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create"]
  # Disallow port-forward by omission
  # Read deployments/replicas for context
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets", "statefulsets", "daemonsets"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: debug-ai-metrics-readonly
rules:
  # Access metrics APIs via aggregated servers (customize for your setup)
  - apiGroups: ["metrics.k8s.io"]
    resources: ["nodes", "pods"]
    verbs: ["get", "list"]
  # If using Prometheus, prefer a read-only HTTP token outside K8s RBAC
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: debug-ai-readonly-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: debug-ai-readonly
subjects:
  - kind: ServiceAccount
    name: debug-ai
    namespace: ai-ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: debug-ai-metrics-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: debug-ai-metrics-readonly
subjects:
  - kind: ServiceAccount
    name: debug-ai
    namespace: ai-ops

Important notes:

RBAC cannot restrict by label or command for subresources like exec. Use admission policies for fine-grained constraints.
Keep the ClusterRole minimal. If your AI never needs node reads, drop them.

Admission control for fine-grained guardrails

Use OPA Gatekeeper or a custom ValidatingAdmissionWebhook to restrict exec/logs by namespace labels, pod annotations, and identity. Example: allow exec only to pods in namespaces labeled debug-ai-exec=enabled and deny shell-like commands unless they match an allowlist.

Gatekeeper takes a ConstraintTemplate with Rego and a Constraint resource:

yaml
apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8sallowedexec
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedExec
      validation:
        openAPIV3Schema:
          properties:
            allowedNamespacesSelector:
              type: object
            allowedServiceAccounts:
              type: array
              items: { type: string }
            allowedCommands:
              type: array
              items: { type: string }
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sallowedexec

        import data.inventory

        default deny = false

        # Match pods/exec create requests
        is_exec {
          input.review.kind.kind == "Pod"
          input.review.operation == "CREATE"
          endswith(input.review.requestSubResource, "exec")
        }

        allowed_sa := { sa | sa := input.parameters.allowedServiceAccounts[_] }
        allowed_cmds := { c | c := input.parameters.allowedCommands[_] }

        ns := input.review.object.metadata.namespace
        sa := input.review.userInfo.username

        # Namespace selector: label must be present with value "true"
        ns_ok {
          inv := inventory.namespace[ns]
          inv.metadata.labels["debug-ai-exec"] == "enabled"
        }

        # Extract command from query; for WebSocket exec the command appears in the URL parameters
        cmd_params := { p | p := input.review.requestObject["command"][_] }

        cmds_ok {
          count(cmd_params) > 0
          forall cmd in cmd_params { allowed_cmds[cmd] }
        }

        sa_ok { allowed_sa[sa] }

        deny {
          is_exec
          not ns_ok
        }

        deny {
          is_exec
          not sa_ok
        }

        deny {
          is_exec
          not cmds_ok
        }
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedExec
metadata:
  name: debug-ai-exec-policy
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
  parameters:
    allowedNamespacesSelector: {}
    allowedServiceAccounts:
      - "system:serviceaccount:ai-ops:debug-ai"
    allowedCommands:
      - "/bin/cat"
      - "/bin/ls"
      - "/usr/bin/curl"
      - "/bin/echo"
      - "/usr/bin/ss"

Caveats:

Commands in pods/exec are transported as query parameters for SPDY/streamed requests; Rego may need to inspect requestURI or requestObject. Validate against your API server version and Gatekeeper’s input structure.
If you need deeper inspection or to block patterns, consider a custom webhook.

Also consider a separate policy to deny pods/portforward for the AI identity entirely.

Kubernetes audit logging at RequestResponse

Enable audit logging to capture every logs/exec invocation with parameters. Configure the audit policy to record full request/response bodies for relevant subresources.

Example audit policy:

yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Default minimal logging
  - level: Metadata

  # Capture logs and exec calls fully
  - level: RequestResponse
    verbs: ["get", "create"]
    resources:
      - group: ""
        resources: ["pods/log", "pods/exec", "pods/attach"]

  # Capture subject access reviews and token requests
  - level: Request
    resources:
      - group: "authorization.k8s.io"
        resources: ["subjectaccessreviews"]
      - group: ""
        resources: ["serviceaccounts", "serviceaccounts/token"]

Notes:

On EKS, enable audit and control plane logs to CloudWatch; on GKE, enable Audit Logs to Cloud Logging; on AKS, use Diagnostic Settings.
Store audit logs with immutability (e.g., S3 Object Lock, WORM retention) for forensics.

API Priority and Fairness (APF) rate limiting

Throttle the AI identity to protect the API server during incidents.

yaml
apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: PriorityLevelConfiguration
metadata:
  name: ai-low
spec:
  type: Limited
  limited:
    assuredConcurrencyShares: 20
    limitResponse:
      type: Queued
      queuing:
        queues: 64
        queueLengthLimit: 50
        handSize: 8
---
apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: FlowSchema
metadata:
  name: ai-debug
spec:
  priorityLevelConfiguration:
    name: ai-low
  distinguisherMethod:
    type: ByUser
  matchingPrecedence: 9000
  rules:
    - subjects:
        - kind: User
          user:
            name: "system:serviceaccount:ai-ops:debug-ai"
      resourceRules:
        - verbs: ["get", "list", "watch", "create"]
          apiGroups: ["", "apps", "metrics.k8s.io"]
          resources: ["pods", "pods/log", "pods/exec", "events", "deployments", "nodes", "pods.metrics.k8s.io", "nodes.metrics.k8s.io"]

Tune concurrency and queue lengths to your scale. Consider cluster-wide APF baselines so the AI cannot starve control-plane traffic.

Session recording for exec

Kubernetes audit logs record the exec request and command, but not I/O streams. For robust forensics, route exec through a proxy that records stdin/stdout/stderr and metadata.

Approaches:

Use an exec gateway written in Go that terminates the exec request, tees the streams to object storage, and forwards to the API server using client-go’s remotecommand package.
Use a commercial platform like Teleport’s Kubernetes Access, which natively records exec sessions and supports approvals and RBAC.

Minimal Go proxy sketch:

go
package main

import (
  "context"
  "fmt"
  "io"
  "log"
  "net/http"
  "os"
  "time"

  corev1 "k8s.io/api/core/v1"
  "k8s.io/client-go/kubernetes"
  "k8s.io/client-go/rest"
  "k8s.io/client-go/tools/remotecommand"
)

func execHandler(client kubernetes.Interface, cfg *rest.Config) http.HandlerFunc {
  return func(w http.ResponseWriter, r *http.Request) {
    // Parse params: namespace, pod, container, command[] from query/body
    ns := r.URL.Query().Get("namespace")
    pod := r.URL.Query().Get("pod")
    container := r.URL.Query().Get("container")
    cmd := r.URL.Query()["command"]

    // Policy: check allowlist for cmd, namespace labels, identity, etc.
    if !policyAllows(r.Context(), ns, pod, container, cmd) {
      http.Error(w, "policy denied", http.StatusForbidden)
      return
    }

    req := client.CoreV1().RESTClient().Post().
      Resource("pods").
      Name(pod).
      Namespace(ns).
      SubResource("exec")

    option := &corev1.PodExecOptions{
      Container: container,
      Command:   cmd,
      Stdin:     false,
      Stdout:    true,
      Stderr:    true,
      TTY:       false,
    }

    req.VersionedParams(option, schemeParameterCodec)

    executor, err := remotecommand.NewSPDYExecutor(cfg, "POST", req.URL())
    if err != nil {
      http.Error(w, err.Error(), http.StatusInternalServerError)
      return
    }

    // Recording setup
    sessionID := fmt.Sprintf("%d-%s-%s", time.Now().UnixNano(), ns, pod)
    outFile, _ := os.Create("/records/" + sessionID + "-stdout.log")
    errFile, _ := os.Create("/records/" + sessionID + "-stderr.log")
    defer outFile.Close(); defer errFile.Close()

    prOut, pwOut := io.Pipe()
    prErr, pwErr := io.Pipe()

    // Tee to both response and file
    go io.Copy(outFile, prOut)
    go io.Copy(errFile, prErr)
    mw := &multiWriter{w: w}

    // Stream exec
    err = executor.StreamWithContext(r.Context(), remotecommand.StreamOptions{
      Stdout: io.MultiWriter(pwOut, mw),
      Stderr: io.MultiWriter(pwErr, mw),
      Tty:    false,
    })

    pwOut.Close(); pwErr.Close()

    if err != nil {
      log.Printf("exec error: %v", err)
    }
  }
}

type multiWriter struct{ w http.ResponseWriter }
func (m *multiWriter) Write(p []byte) (int, error) {
  return m.w.Write(p)
}

func policyAllows(ctx context.Context, ns, pod, container string, cmd []string) bool {
  // TODO: call OPA or check config; ensure read-only commands
  return true
}

func main() {
  cfg, _ := rest.InClusterConfig()
  client, _ := kubernetes.NewForConfig(cfg)
  http.HandleFunc("/exec", execHandler(client, cfg))
  log.Fatal(http.ListenAndServe(":8080", nil))
}

This proxy becomes the only path the AI can use for exec. Block direct API server access to exec by the AI SA via a network policy if the proxy runs in-cluster, or by not sharing credentials except the proxy’s.

MCP tool definitions: logs, exec, metrics

Define strict tools with schemas to limit what the AI can ask for.

Example MCP tools (conceptual):

json
[
  {
    "name": "k8s_get_logs",
    "description": "Fetch recent logs for a pod container.",
    "input_schema": {
      "type": "object",
      "properties": {
        "namespace": {"type": "string"},
        "pod": {"type": "string"},
        "container": {"type": "string"},
        "since_seconds": {"type": "integer", "minimum": 1, "maximum": 3600},
        "tail_lines": {"type": "integer", "minimum": 1, "maximum": 2000}
      },
      "required": ["namespace", "pod"]
    }
  },
  {
    "name": "k8s_exec_readonly",
    "description": "Run an allowlisted, read-only command in a pod container.",
    "input_schema": {
      "type": "object",
      "properties": {
        "namespace": {"type": "string"},
        "pod": {"type": "string"},
        "container": {"type": "string"},
        "command": {"type": "array", "items": {"type": "string"}, "minItems": 1, "maxItems": 6}
      },
      "required": ["namespace", "pod", "command"]
    }
  },
  {
    "name": "prom_query",
    "description": "Run a read-only PromQL query against the metrics API.",
    "input_schema": {
      "type": "object",
      "properties": {
        "query": {"type": "string", "maxLength": 1000},
        "range_seconds": {"type": "integer", "minimum": 60, "maximum": 3600},
        "step_seconds": {"type": "integer", "minimum": 5, "maximum": 60}
      },
      "required": ["query"]
    }
  }
]

Server-side enforcement is critical: even if the schema is tight, never trust the model; validate every field against policy.

Implementing logs and events with guardrails

Use the Kubernetes API for logs. The MCP server should:

Enforce sinceSeconds and tailLines maximums
Support previous to fetch logs from the last crashed container
Redact likely secrets with a configurable regex and entropy detector

Example code notes:

Use clientset.CoreV1().Pods(ns).GetLogs(pod, &PodLogOptions{Container: ..., SinceSeconds: ..., TailLines: ...}).Stream(...)
Stream to response and tee to immutable storage for audit if volume is small; otherwise log metadata and hash, not full logs

Events are low-risk but high-value for diagnosis: always allow list/watch for the target namespaces.

Metrics: prefer a read-only Prometheus token

Avoid hitting the apiserver for metrics beyond the metrics.k8s.io API. Use Prometheus’ HTTP API with a read-only, low-scope token.

Put Prometheus behind an internal gateway enforcing rate limits
Enforce max range and step; reject queries with label value wildcards that explode cardinality
Predefine safe queries for common SLOs and resource saturation (CPU throttling, restarts, 5xx rates)

Example reverse proxy policies (pseudo):

Reject queries where estimated series > 50k
Cap duration to 30 minutes and step >= 10s

Secret handling and data minimization

Do not grant access to Secrets, ConfigMaps, or volumes
Add a redaction filter in logs/exec output: mask JWTs, OAuth tokens, AWS keys, and high-entropy strings
Provide a “show sensitive” toggle for human operators only; never show raw to the model

Sample redaction regex set:

AKIA[0-9A-Z]{16} (AWS Access Key ID)
(?i)secret|password|token followed by value-like patterns
Bearer tokens: (?i)authorization:\s*bearer\s+[a-z0-9-_\.]+

Complement regex with entropy-based detection to reduce false negatives.

Token lifetimes and credential hygiene

Use the ServiceAccount TokenRequest API to mint short-lived tokens per session (e.g., 10 minutes); do not store tokens in long-lived config
Rotate the AI SA token and revoke on anomalies
Restrict the MCP server’s network egress to only the Kubernetes API and Prometheus; block cloud metadata endpoints

Break-glass: human-approved, ephemeral, minimal writes

Sometimes diagnostics require a nudge: evict a wedged pod, cordon a node, or restart a deployment. Treat this as an exception with explicit human approval.

Pattern:

AI proposes an action with justification and expected blast radius
Human reviews in chat/console and approves
Controller mints a short-lived token bound to a dedicated ClusterRole and SA via a RoleBinding with TTL
MCP server swaps to the break-glass credential for a single action; session is fully recorded

Minimal write ClusterRole (example):

yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: debug-ai-breakglass
rules:
  # Allow evicting pods only (graceful remove via eviction API)
  - apiGroups: ["policy"]
    resources: ["evictions"]
    verbs: ["create"]
  # Allow rollout restarts on specific deployments via patch
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "patch"]
    resourceNames: ["my-critical-deployment"]
  # Node cordon/uncordon optional
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "patch"]

Ephemeral binding via a controller that watches a CRD:

yaml
apiVersion: ops.example.com/v1
kind: BreakGlassRequest
metadata:
  name: bgr-2025-01-incident-123
spec:
  serviceAccountRef:
    name: debug-ai
    namespace: ai-ops
  roleRef: debug-ai-breakglass
  ttlSeconds: 900
  justification: "Evict pod pod-abc in ns prod to unblock stuck volume mount"
  approvers:
    - alice@example.com
    - oncall-ops@example.com
status:
  state: Pending

The controller:

Requires one or more approver signatures (via OIDC or a signed comment in chat)
Creates a temporary RoleBinding or ClusterRoleBinding
Mints a short-lived TokenRequest for the SA
Deletes the binding on TTL expiry

All actions are posted to an audit channel and appended to your SIEM.

Multi-cluster and multi-tenant considerations

Run one MCP gateway per cluster to isolate blast radius; aggregate results to the AI runtime
Namespace scoping: let the AI access only namespaces labeled debug-ai-access=enabled
Separate credentials per environment (prod vs. staging) with distinct policies and APF settings

Platform-specific notes

EKS: enable control plane logs (audit, authenticator) to CloudWatch; consider AWS WAF on Prometheus gateway; use IAM Roles for Service Accounts (IRSA) to limit AWS API exposure from MCP pods
GKE: use Workload Identity; enable Audit Logs; consider Binary Authorization if you ship the proxy into prod
AKS: use Managed Identity; enable diagnostic settings for audit; verify APF version and stability

Operational playbook

Pre-incident: test in staging; run policy unit tests (OPA test fixtures) and chaos drills for rate-limiting and audit coverage
Incident: AI operates read-only by default; proposes break-glass with a diff of intended changes
Post-incident: review recorded sessions, adjust allowlists, update runbooks and predefined queries

Example end-to-end flow

Pager triggers on 5xx spike.
AI uses prom_query with a safe query for sum(rate(http_requests_total{status=~"5.."}[5m])) by (service), capped at 30m range.
AI calls k8s_get_logs on the top offender pods with since_seconds=600 and tail_lines=500.
It detects OOMKills via events and LastTerminationState in pod status.
It proposes a k8s_exec_readonly command: /bin/cat /proc/meminfo and /usr/bin/ss -tan. Gatekeeper permits; the exec proxy records outputs.
AI concludes a memory leak in revision X; proposes a controlled rollout restart. Human approves break-glass for a single deployment patch. The controller mints a 10-minute token; the MCP server executes the patch and logs everything.

Testing and verification

Policy tests: OPA unit tests for your Gatekeeper templates; simulate requests with various namespaces and commands
Load tests: hammer pods/log and /metrics with APF in place; verify saturation behavior
Redaction tests: seed logs with faux secrets; assert masks are applied and no plaintext leaks to the AI
Audit completeness: verify that a sample exec appears in audit logs with RequestResponse and in session storage with matching IDs

Common pitfalls and how to avoid them

Assuming RBAC can filter exec by command: it cannot. Use admission policies and a proxy.
Leaving port-forward enabled: models can unintentionally open tunnels; block subresource entirely.
Unbounded logs: enforce sinceSeconds and tailLines; cut off after N MB per call.
Token sprawl: use TokenRequest per session; expire tokens aggressively; avoid static kubeconfigs.
Over-permissive Prometheus: run all queries through a gateway with per-query limits and response size caps.

Security and compliance mapping

Least privilege and separation of duties: dedicated SA, minimal ClusterRole, human-approved escalation
Auditability: Kubernetes audit logs + exec I/O recording + SIEM integration
Data minimization: redaction and deny-listing sensitive resources
Change control: break-glass with approvals and ephemeral bindings

These controls align with SOC 2 CC6/7 and ISO 27001 A.9/A.12

Alternatives and extensions

Mirror-only mode: ship logs and metrics to a read-only shadow plane and disallow live exec; safer but slower diagnosis
Out-of-band command runners: run predefined diagnostics as Jobs with controlled inputs instead of arbitrary exec
Structured diagnostics: encode SRE runbooks as parameterized tools to reduce free-form shell usage

Conclusion

You don’t need to choose between “no prod access” and “here’s a kubeconfig.” A layered, policy-enforced gateway gives a debugging AI the signals it needs—logs, events, metrics, and tightly-scoped exec—without handing it the power to disrupt production.

The core of the blueprint:

A minimal RBAC identity scoped for reads, not writes
Admission policies that constrain exec to safe namespaces and allowlisted commands
A recording exec proxy and RequestResponse audit logging
Rate limits via APF and short-lived tokens
A human-reviewed, ephemeral break-glass path with limited write verbs

Start in staging, prove the guardrails, then gradually enable in production namespaces with clear labels and observability. Your SREs get faster triage and deeper context, and your risk posture remains sane.

References

Kubernetes audit logging: https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/
API Priority and Fairness: https://kubernetes.io/docs/concepts/cluster-administration/flow-control/
OPA Gatekeeper: https://github.com/open-policy-agent/gatekeeper
client-go exec: https://pkg.go.dev/k8s.io/client-go/tools/remotecommand
Teleport Kubernetes Access (session recording): https://goteleport.com/docs/kubernetes-access/
TokenRequest API: https://kubernetes.io/docs/reference/kubernetes-api/authentication-resources/token-request-v1/