Telemetry as Prompt: Designing Runtime Signals for Debug AI
If you want an AI to debug your software, stop thinking of telemetry as something you read after the fact. Treat it as the runtime prompt that guides the AI. Your logs, traces, and metrics can be engineered to be crisp, lossless, and safe inputs that let an automated debugger find, reproduce, and explain failures quickly.
This article is a practical blueprint for designing observability that doubles as AI-friendly prompts. It covers structured logging, trace context, redaction pipelines, retrieval patterns, and feedback loops that make debugging AI accurate, fast, and safe. The audience is technical, so we will lean into implementation detail, trade-offs, and patterns observed in production systems.
The core idea: telemetry is a data product that feeds a reasoning engine
Debugging AI, whether an LLM or a specialized reasoning model, is constrained by context windows, latency budgets, and safety. It needs the right information at the right time, not a firehose. The most reliable way to provide that is to design your telemetry as a data product with explicit schemas, semantics, and privacy guarantees. That same product becomes the model's prompt or retrieval substrate.
My take:
- Prefer structured events over free-text logs.
- Propagate trace context everywhere, end to end.
- Minimize and redact at the source; do not rely on downstream scrubbing.
- Establish quality contracts on telemetry fields just like API contracts.
- Make the prompt assembly deterministic and reproducible from stored telemetry.
- Measure AI effectiveness the same way you measure SRE: with SLOs.
Why debugging AI needs engineered signals
Most failure reports are vague: a stack trace, a 500 in a dashboard, an alert about error rate. Human engineers interpolate context from memory, code reading, and log spelunking. An AI doesn't have your tacit knowledge. It needs structured context with:
- Causality: which request or span triggered what.
- Time ordering: event sequences and durations.
- Identity (safe form): hashed user/session IDs, feature flags, deploy versions.
- Environment: commit SHA, config fingerprints, experiments, regional topology.
- Constraints: budgets, rate limits, retries, backoffs, circuit states.
Your telemetry either supplies that or forces the model to guess. Guessing is where hallucinations and expensive back-and-forths come from.
Taxonomy of runtime signals for prompts
- Logs: Verbose event detail. Opinionated guidance: write JSON-structured, bounded keys, with field allowlists.
- Traces: Causal graph of spans and timings across services. Use W3C Trace Context (traceparent) and OpenTelemetry for standardization.
- Metrics: Aggregates for gating and budgets: error rates, p95 latency, retry counts, saturation (SRE golden signals).
- Exceptions: Stack traces, error codes, error kinds, with stable error fingerprints.
- Config and environment: Feature flags, version, commit, rollout waves, region/AZ, runtime limits.
- Artifacts: Build logs, test results, core dumps, coverage, crash reproductions.
These artifacts can be chunked and retrieved by trace ID or incident, forming the prompt context.
From telemetry to prompt: the assembly pipeline
A robust pipeline has five stages:
- Ingest
- Structured events flow through collectors (e.g., OpenTelemetry, Fluent Bit, Vector) into a broker or lake.
- Ensure end-to-end propagation of trace and baggage fields.
- Sanitize and redact
- Apply allowlisted fields and typed validation.
- Redact secrets and PII deterministically.
- Correlate
- Join logs, spans, metrics, and exceptions by trace_id and span_id.
- Attach deploy and config metadata by time and environment.
- Summarize and prioritize
- Use domain heuristics to score events by diagnostic value (exceptions > warnings > info near failure time).
- Trim or compress with structure-aware reducers.
- Prompt build and retrieval
- For online use, assemble a bounded prompt for the model.
- For offline analysis, index chunks with embeddings keyed by trace/incident identity.
The key is that step 2 and 3 make the prompt deterministic: given an incident ID, you can reconstruct exactly what the AI saw.
Instrumentation patterns that work
1) Structured logs with a stable schema
- Use JSON Lines or protobuf-encoded events, not free-form text.
- Keep field names short, semantic, and stable.
- Include trace_id, span_id, service, version, env, region, user_hash, session_id, route, feature_flags, error_code, and safe_message.
- Use explicit levels: debug, info, warn, error.
Example: Python with structlog and OpenTelemetry context propagation:
pythonimport os import hashlib import structlog from opentelemetry import trace logger = structlog.get_logger() SALT = os.environ.get('USER_HASH_SALT', 'dev-salt') def user_hash(user_id: str) -> str: return hashlib.sha256((SALT + '|' + user_id).encode()).hexdigest()[:16] def log_event(level: str, msg: str, **fields): span = trace.get_current_span() span_ctx = span.get_span_context() if span else None base = { 'trace_id': span_ctx.trace_id if span_ctx and span_ctx.is_valid else None, 'span_id': span_ctx.span_id if span_ctx and span_ctx.is_valid else None, 'service': 'checkout-api', 'version': os.environ.get('SERVICE_VERSION', 'dev'), 'env': os.environ.get('DEPLOY_ENV', 'dev'), } event = {**base, **fields, 'message': msg} getattr(logger, level)(**event) # Usage log_event('info', 'authorize payment', user_hash=user_hash('42'), route='/v1/pay', gateway='stripe', amount_cents=1299, currency='USD')
Go with zerolog and W3C trace context via OpenTelemetry:
gologger := zerolog.New(os.Stdout).With().Str("service", "checkout-api").Timestamp().Logger() tr := otel.Tracer("checkout-api") ctx, span := tr.Start(ctx, "charge") defer span.End() reqID := ctx.Value("request-id").(string) traceID := span.SpanContext().TraceID().String() spanID := span.SpanContext().SpanID().String() logger.Info(). Str("trace_id", traceID). Str("span_id", spanID). Str("route", "/v1/pay"). Str("request_id", reqID). Int("amount_cents", 1299). Msg("authorize payment")
Opinion: prefer allowlisting fields per event type via typed builders instead of free-form key dumps. That keeps the prompt small and stable.
2) Trace context everywhere
- Adopt W3C Trace Context: the 'traceparent' header and optional 'tracestate'.
- Propagate context across HTTP, gRPC, queues, and serverless invocations.
- Store trace_id and span_id on every event. If you have only one thing, have this.
Example HTTP propagation (Node + Express + OpenTelemetry):
jsconst { context, trace, propagation } = require('@opentelemetry/api') app.use((req, res, next) => { const extracted = propagation.extract(context.active(), req.headers) const span = trace.getTracer('gateway').startSpan('incoming', undefined, extracted) context.with(trace.setSpan(extracted, span), () => next()) })
With proper propagation, your AI can reconstruct causal order: which DB call preceded which cache miss, across services.
3) Redaction at the source
Data minimization should happen before logs leave the process. Relying on a downstream redactor is brittle and increases blast radius.
Core patterns:
- Allowlist over blocklist: emit only fields you explicitly allow.
- Deterministic hashing for identifiers via salted hash or keyed HMAC.
- Secret scanning for common patterns (AWS keys, OAuth tokens) with immediate drop or replace.
- PII tagging in code: annotate fields with sensitivity; loggers enforce policy.
Example: a redacting structlog processor in Python:
pythonimport re SENSITIVE_KEYS = {'password', 'token', 'authorization', 'secret'} TOKEN_RE = re.compile(r'(?:Bearer\s+)?([A-Za-z0-9-_]{20,})') REDACT = '[REDACTED]' def redact_event(_, __, event_dict): # key-based redaction for k in list(event_dict.keys()): if k.lower() in SENSITIVE_KEYS: event_dict[k] = REDACT # value-based scanning for k, v in event_dict.items(): if isinstance(v, str): event_dict[k] = TOKEN_RE.sub(REDACT, v) return event_dict structlog.configure(processors=[redact_event, structlog.processors.JSONRenderer()])
For user identifiers, prefer salted hashes or keyed HMACs so you can correlate across services without retaining raw PII:
pythonimport hmac, hashlib KEY = b'service-key' def pseudonymize(identifier: str) -> str: return hmac.new(KEY, identifier.encode(), hashlib.sha256).hexdigest()[:16]
Legal and compliance side note: document categories, purpose, retention, and purge procedures (DSAR readiness). Build a purge path that removes raw and derived artifacts, including embeddings and caches.
4) Error fingerprints and event contracts
A stable error_code and fingerprint are gold for debugging AI. Do not rely solely on stack traces, which are noisy across versions.
- Emit error_code from a constrained enum.
- Build a fingerprint from (error_code, normalized message, top frame file+line).
- Use this fingerprint as a retrieval key to past incidents and fixes.
Example fingerprint in Python:
pythonimport re, hashlib WS = re.compile(r'\s+') def normalize_msg(msg: str) -> str: msg = WS.sub(' ', msg.strip()) return msg[:256].lower() def error_fingerprint(error_code: str, msg: str, file: str, line: int) -> str: base = f'{error_code}|{normalize_msg(msg)}|{file}:{line}' return hashlib.sha256(base.encode()).hexdigest()[:16]
5) Metrics as gates and budgets
Use aggregates to control when and how much context to collect and send to the model.
- Trigger AI triage only when error_rate for a service exceeds baseline plus threshold.
- Cap prompt size and model selection based on severity: small model at warn; larger model at incident.
- Attach p95 latency, saturation, and retry counts to the prompt as summary features.
Example gating logic:
pythonif error_rate('checkout-api') > baseline('checkout-api') * 3 and latest_deploy_age_minutes() < 30: triage_level = 'high' else: triage_level = 'normal'
6) Prompt assembly patterns
Principles:
- Deterministic: given a trace_id, the same prompt is built every time.
- Structured: use JSON sections so the model can parse reliably; then render to natural language if needed.
- Budgeted: cap tokens; prefer highest-signal events.
- Safe: assert schema and scrub before sending.
Example prompt builder (conceptual):
json{ 'system': 'You are an automated debugging assistant. Explain likely root cause, impacted components, and safe next steps. Operate only on provided context.', 'context': { 'incident': { 'id': 'INC-2025-01-1337', 'service': 'checkout-api', 'env': 'prod', 'deploy': {'version': '1.42.0', 'commit': 'a1b2c3d', 'age_min': 12} }, 'trace': { 'trace_id': '0af7651916cd43dd8448eb211c80319c', 'spans': [ {'span_id': '1', 'name': 'HTTP POST /v1/pay', 'dur_ms': 120, 'status': 'error', 'attrs': {'route': '/v1/pay', 'user_hash': 'b8f3...'}}, {'span_id': '2', 'name': 'db.insert charge', 'dur_ms': 90, 'status': 'ok'}, {'span_id': '3', 'name': 'gateway.authorize', 'dur_ms': 20, 'status': 'error', 'attrs': {'error_code': 'GATEWAY_TIMEOUT'}} ] }, 'events': [ {'ts': '2025-12-31T12:00:05Z', 'level': 'error', 'error_code': 'PAYMENT_TIMEOUT', 'fingerprint': 'f1a2', 'message': 'gateway timeout at 20s backoff'}, {'ts': '2025-12-31T12:00:05Z', 'level': 'warn', 'message': 'retry exhausted', 'retry_count': 3} ], 'metrics': {'error_rate_5m': 6.2, 'baseline_error_rate': 0.8, 'p95_latency_ms': 980} }, 'question': 'What is the most likely root cause and the next diagnostic step?' }
Note: although presented with single quotes for brevity here, use canonical JSON when sending to the model and validate against a schema.
7) Retrieval: hybrid keys plus embeddings
For offline or retrospective debugging, index events by:
- Exact keys: trace_id, fingerprint, error_code, service, version.
- Time windows: incident time ± N minutes.
- Embeddings: description of error and stack trace to find similar past incidents.
Always prefer exact-key retrieval first; use embeddings to rank within the candidate set. This keeps retrieval precise and cheap.
8) Streaming trace summarization
Summarize traces in near real time into incident cards the AI can consume.
Design:
- A streaming job consumes traces and logs from a broker (e.g., Kafka, Pub/Sub).
- It builds per-trace state machines, emits a compact summary when terminal states occur (error, timeout, circuit open), and indexes the summary.
- The AI subscribes to summary topics rather than raw logs.
Pseudo-code for a summarizer:
pythonclass TraceSummarizer: def __init__(self): self.state = {} def on_event(self, e): st = self.state.setdefault(e.trace_id, {'spans': [], 'errors': 0}) if e.type == 'span_end': st['spans'].append(e) if e.level == 'error': st['errors'] += 1 if e.terminal: self.emit_summary(e.trace_id, st) del self.state[e.trace_id]
Guardrails and safety for prompt ingestion
Debugging prompts are built from user input, configurations, and third-party payloads. Treat them as untrusted and protect the model from prompt injection and data leaks.
- Strict schema validation: reject or sanitize non-conforming events.
- Flatten complex structures and cap recursion depth.
- Drop or redact suspicious fields that look like instructions.
- Isolate untrusted text in quoted blocks and instruct the model to treat it as data, not instruction.
- Use output schemas and strict JSON output from the model with automatic validation and retry-on-fail using constrained decoding if available.
Minimal wrapper example for a model call:
pythonfrom pydantic import BaseModel, ValidationError class Diagnosis(BaseModel): root_cause: str confidence: float impacted_components: list[str] next_steps: list[str] resp = llm.generate(prompt, response_format=Diagnosis) # if your provider supports it try: diagnosis = Diagnosis.model_validate_json(resp) except ValidationError: # Retry with a smaller prompt or stricter instructions diagnosis = fallback(prompt)
Compression and token budgeting without losing meaning
- Prefer structural compression over text summarization: drop low-value fields, bucket timestamps, and limit stack frames to top N.
- Use deterministic sampling: keep all error-level logs and spans within ±5 seconds of the first error; sample 1-in-10 info logs elsewhere.
- Canonicalize and deduplicate repeat messages by fingerprint.
- For large payloads, include checksums and references instead of raw data; let the AI request specific artifacts via tools only when needed.
Example reducer for logs around an error window:
pythondef reduce_events(events, error_ts, window=5): keep = [] for e in events: if e.level in ('error', 'warn'): keep.append(e) elif abs(e.ts - error_ts) <= window: keep.append(e) # dedupe by fingerprint seen = set() out = [] for e in keep: fp = getattr(e, 'fingerprint', None) if fp and fp in seen: continue if fp: seen.add(fp) out.append(e) return out[:200] # hard cap
Feedback loops: closing the learning cycle
You don't want an AI that merely explains failures; you want one that gets better at your system over time. Build a feedback loop:
- Collect ground truth from postmortems, incident tickets, and merged fixes.
- Label incident summaries with root cause categories and effective remediation steps.
- Evaluate the AI with held-out incidents: measure diagnostic accuracy, time-to-signal, and false positive rate.
- Use offline fine-tuning or retrieval augmentation of successful past diagnoses.
Metrics to track:
- Mean time to plausible root cause (MTPRC).
- Precision/recall on root cause categories.
- Prompt token cost per incident and time-to-first-answer.
- Human accept rate (engineer marks the suggestion as helpful or not).
Guard against overfitting: keep a rotating set of recent incidents as a blind evaluation suite.
A concrete end-to-end example
Scenario: After a canary deploy of checkout-api 1.42.0, error rate spikes. Users report timeouts. The AI triage system kicks in.
- Instrumentation in place
- Every service propagates traceparent. Logs are structured with trace_id and span_id.
- Redaction is enforced at source; user_id is hashed; secrets replaced.
- Error codes and fingerprints exist for common failures.
- Ingestion
- Spans and logs stream to Kafka; a summarizer correlates by trace_id.
- Gate and assemble prompt
- Metrics show error_rate_5m 6.2% vs baseline 0.8%. Deploy age 12 minutes. Severity set to high; large model allowed.
- The prompt builder collects the top 50 errored traces with their summaries, deduplicated by fingerprint.
- Model input (excerpt)
System: You are an automated debugging assistant. Explain likely root cause and next steps. Use only provided context.
Context:
- Incident: INC-2025-01-1337, service checkout-api, env prod, deploy 1.42.0 (commit a1b2c3d), age 12m.
- Metrics: error_rate_5m=6.2% (baseline 0.8%), p95=980ms, retries elevated.
- Trace 0af765...: HTTP POST /v1/pay -> db.insert ok (90ms) -> gateway.authorize error (GATEWAY_TIMEOUT) after 3 retries (exp backoff), circuit open.
- Config diff: gateway timeout decreased from 5s to 2s in this deploy; retry policy unchanged.
- Errors (deduped): PAYMENT_TIMEOUT (fingerprint f1a2), RETRY_EXHAUSTED (fingerprint c9d1).
- Health: downstream gateway latency p95 increased in same time window in us-east-1 only.
Question: What is the most likely root cause and the next diagnostic step?
- Model output (schematized)
- Root cause: misaligned retry-timeout configuration after deploy 1.42.0. Timeout reduced to 2s while downstream p95 is > 2s during a regional blip, causing timeouts and retries to exhaust; circuit opens.
- Impacted components: checkout-api, payment-gateway client lib.
- Next steps: rollback config to 5s; or reduce retries to 1 with longer timeout; verify gateway SLO and adjust budgets; add alert on timeout/retry product exceeding threshold; add canary check for gateway p95>timeout.
- Confidence: 0.86
- Human verification
- On-call confirms: a new config template defaulted timeout to 2s. Rollback resolves errors.
- Postmortem action: add CI policy to block timeout < gateway SLO p95; add canary test; add unit that asserts retry*timeout < user SLA.
This closed loop relies on telemetry that spelled out the config diff, trace ordering, and clear error codes—data designed for both humans and machines.
Anti-patterns that sabotage debugging AI
- Free-text logs with varying fields and no backbone keys.
- Missing trace context across service boundaries.
- Logging raw user PII or secrets and relying on a downstream redactor.
- Excessive sampling that drops error-level events.
- Huge prompts with unbounded context, leading to cost and latency blowups.
- Model given direct access to production data without guardrails or schema validation.
- Embeddings-only retrieval without exact-key correlation (high recall, low precision).
Practical blueprint to implement
- Baseline observability
- Adopt OpenTelemetry for traces and metrics; enable W3C Trace Context.
- Replace ad-hoc logs with structured logs and a field allowlist.
- Add error_code and fingerprint emission in exception handlers.
- Privacy and safety
- Implement source redaction and pseudonymization.
- Add schema validators for emitted events; reject on violation.
- Document retention and purge; automate purges across data stores and derived indices.
- Summarization and retrieval
- Build a streaming summarizer keyed by trace_id; generate per-trace incident cards.
- Index by exact keys plus embeddings for similarity.
- Prompt assembly
- Define a deterministic prompt schema; cap tokens; prefer high-signal events.
- Use model output schemas and validator loops.
- Evals and feedback
- Construct an incident dataset from past outages; label root causes and fixes.
- Measure accuracy, time-to-signal, and cost; iterate on instrumentation quality.
- Production hardening
- Multi-model strategy: small model for triage; escalate to larger model on trigger.
- Caching: reuse diagnoses for identical fingerprints within a time window.
- Rate limiting and budgets based on severity and on-call schedules.
References and further reading
- Google SRE Book: The Four Golden Signals (latency, traffic, errors, saturation) — foundational for choosing metrics.
- OpenTelemetry: https://opentelemetry.io — vendor-neutral standard for traces, metrics, and logs.
- W3C Trace Context: https://www.w3.org/TR/trace-context — spec for traceparent and tracestate headers.
- Martin Fowler, Structured Logging — rationale and patterns for logs as data.
- OpenTracing to OpenTelemetry migration guides — practical tips for propagation.
- OWASP Top 10 and Secrets Management Cheat Sheet — for redaction and secret handling.
- Differential Privacy (Dwork et al.) and Google RAPPOR — background on privacy-preserving telemetry.
Closing opinion
Treat observability as a first-class input to a reasoning system. If you design logs, traces, and metrics with AI consumption in mind—stable schemas, trace context, redaction, and feedback loops—you will get an automated debugger that is not only accurate but fast and safe. The payoff is compounding: each incident improves your telemetry and your model, and each improvement reduces time-to-diagnosis for the next incident. That is how telemetry becomes a competitive advantage, not just a dashboard.
