Code Debugging AI Needs LLM Breakpoints: Observable, Reproducible, CI‑Ready
Modern AI software is bottlenecked by invisible failures. LLMs silently hallucinate, retrieval layers drift, and multi-step tool calls fail in ways that are hard to reproduce. By the time an error reaches a developer, the context has changed, the model has been updated, and the logs read like a chat transcript rather than an engineering trace.
We need a primitive that does for AI what breakpoints did for imperative code: pause at the right place, capture state, and turn a flaky occurrence into a durable, verifiable case. This article proposes LLM breakpoints: a design pattern and a minimal spec that binds every high-level model interaction to stack traces, OpenTelemetry spans, and invariants. The result is observability you can search, reproducibility you can replay, and CI-ready tests you can trust.
The intended audience is technical teams building products with LLMs: platform engineers, ML engineers, developer tool builders, and researchers who care about reliability and velocity.
TL;DR
- Add an LLM breakpoint at every model or agent call. Treat it like a unit-testable boundary.
- Record stack traces, OpenTelemetry context, prompts, responses, seeds, sampler settings, model version, tool calls, and costs.
- Define invariants per breakpoint. Fail spans early with precise reasons and remediation.
- Convert breakpoints into minimal reproducible examples and auto-generate CI tests.
- Use content-addressable storage, prompt hashes, and response fingerprints for deduplication.
- Make traces portable across providers via a thin adapter and a stable breakpoint schema.
Why LLM Breakpoints?
The typical LLM debug loop looks like this:
- A user reports that the AI assistant gave an off-by-one command, or inserted incorrect code.
- You check the server logs: a long prompt, a verbose response, no clear failure marker.
- You re-run locally and cannot reproduce the failure because the model rolled forward, context windows differ, or sampling diverged.
- You add logging everywhere, but the volume and lack of structure make it hard to assert anything.
Classic breakpoints gave us a way to stop execution and inspect state. We need the same for LLMs, but focused on constraints unique to generative systems:
- Non-determinism from stochastic sampling and vendor changes.
- Contextual variance from retrieval, time, and dynamic tools.
- Multi-agent and multi-step flows where the root cause might be three hops away.
- Weak contracts between natural language inputs and expected outputs.
An LLM breakpoint formalizes the state of an AI interaction at the exact moment of a call and ties it into your observability and testing fabric.
What Is an LLM Breakpoint?
An LLM breakpoint is a structured checkpoint at the boundary of an LLM call or agent step. It captures enough state to replay the call deterministically where possible, evaluate invariants, and integrate with distributed tracing.
Conceptually, a breakpoint is to an LLM call what a unit test case is to a function call, except it is created at runtime with real production inputs and includes provenance.
Design Goals
-
Observability
- Rich spans with consistent attributes and events.
- Searchable by prompt hash, trace id, user id, feature flag, and git revision.
-
Reproducibility
- Capture seeds, sampler parameters, model versions, tool I O, retrieved documents, and environment.
- Content-addressable artifacts for prompts and attachments.
-
CI readiness
- Serialize breakpoints into fixtures.
- Auto-generate tests and re-run them in pipelines.
-
Portability
- Provider-agnostic schema and adapter layer.
-
Privacy by design
- PII redaction, encryption at rest, selective sampling.
-
Performance and cost awareness
- Token counts, latency, and spend captured as first-class metrics.
Anatomy of an LLM Breakpoint
A minimal schema for an LLM breakpoint could look like this, expressed in YAML for readability:
yamlversion: 1 breakpoint_id: 01HXYV3VKF5QCPAFQY trace_id: 0af7651916cd43dd8448eb211c80319c span_id: b7ad6b7169203331 parent_span_id: null created_at: 2025-01-17T15:03:12.415Z service_name: devtools-api environment: production region: us-east-1 git: commit: 9a7c1f2 branch: main dirty: false code: module: api.handlers.code_review function: review_patch line_no: 241 stack: - file: api/handlers/code_review.py function: review_patch line: 241 - file: api/router.py function: handle line: 88 provider: name: openai model: gpt-4o api_base: https://api.openai.com/v1 api_version: 2024-12-01 tokenizer: cl100k_base sampling: seed: 3456789 temperature: 0.2 top_p: 1.0 presence_penalty: 0.0 frequency_penalty: 0.0 max_tokens: 1024 prompt: system: | You are a senior code reviewer. Prefer minimal diffs. messages: - role: user content: | Review this diff and propose a fix for the null pointer: --- a/app.py +++ b/app.py @@ ... - res = handler(x) + if x is not None: + res = handler(x) context: attachments: - type: file uri: blob://sha256/ab12... retrieved_docs: - uri: blob://sha256/9cde... chunk_id: 42 top_k_rank: 1 tool_state: vars: repo: org/project pr: 123 response: latency_ms: 482 finish_reason: stop token_usage: input: 692 output: 154 total: 846 content: - type: text text: | The null pointer arises when handler is called with None... tool_calls: [] fingerprint: sha256:5b77... metrics: cost_usd: 0.0171 cache_hit: false invariants: - id: must-produce-patch level: error type: regex pattern: '^\+\s|^-\s' on: response.text message: response must include a unified diff - id: no-secrets level: error type: pii_scan on: response.text ci: replayable: true test_name: test_review_patch_null_pointer skip_on: - provider.name == azure-openai and provider.model startswith gpt-35 security: pii: redacted encrypted_fields: - prompt.messages - response.content
Key points:
- code.stack and git fields bind a prompt to the exact source location and revision.
- provider and sampling sections capture everything required to reduce nondeterminism.
- context retains artifacts used during the call such as retrieved docs and attachments.
- response contains both the raw content and structured usage stats.
- invariants define assertions to automatically validate model behavior at the breakpoint.
- metrics record cost and latency for SLOs.
- ci describes how the breakpoint becomes a test case.
- security controls how sensitive fields are handled.
Binding Prompts to Stack Traces
Most teams log the prompt and response but not the call site. You want the inverse emphasis: start from the code boundary, then capture the prompt.
A lightweight Python wrapper can do this:
python# pip install opentelemetry-api opentelemetry-sdk openai import time import inspect import hashlib import json import os from dataclasses import dataclass, asdict from contextlib import contextmanager from opentelemetry import trace from opentelemetry.trace import Status, StatusCode from opentelemetry.trace.propagation import set_span_in_context from openai import OpenAI tracer = trace.get_tracer('ai.debugger') @dataclass class Breakpoint: version: int service_name: str environment: str git_commit: str module: str function: str line_no: int stack: list provider: dict sampling: dict prompt: dict context: dict response: dict | None metrics: dict | None invariants: list ci: dict security: dict def hash_prompt(self) -> str: payload = json.dumps(self.prompt, sort_keys=True, separators=(',', ':')).encode('utf-8') return hashlib.sha256(payload).hexdigest() def current_git_commit() -> str: return os.getenv('GIT_COMMIT', 'unknown') @contextmanager def llm_breakpoint(provider_name: str, model: str, sampling: dict, prompt: dict, context: dict, invariants: list): frame = inspect.currentframe().f_back stack = [] for f, _ in inspect.getouterframes(frame): stack.append({'file': f.filename, 'function': f.function, 'line': f.lineno}) bp = Breakpoint( version=1, service_name=os.getenv('SERVICE_NAME', 'devtools-api'), environment=os.getenv('ENV', 'dev'), git_commit=current_git_commit(), module=frame.f_code.co_filename, function=frame.f_code.co_name, line_no=frame.f_lineno, stack=stack, provider={'name': provider_name, 'model': model}, sampling=sampling, prompt=prompt, context=context, response=None, metrics=None, invariants=invariants, ci={'replayable': True}, security={'pii': 'redacted'}, ) with tracer.start_as_current_span('llm.call') as span: span.set_attribute('ai.provider', provider_name) span.set_attribute('ai.model', model) span.set_attribute('ai.prompt.hash', bp.hash_prompt()) span.set_attribute('code.filepath', bp.module) span.set_attribute('code.function', bp.function) span.set_attribute('code.lineno', bp.line_no) start = time.perf_counter() try: yield bp, span except Exception as e: span.set_status(Status(StatusCode.ERROR, str(e))) raise finally: elapsed = (time.perf_counter() - start) * 1000 if bp.metrics is None: bp.metrics = {} bp.metrics['latency_ms'] = elapsed span.set_attribute('ai.latency_ms', elapsed) # Serialize to your store: blob, database, or file path = f'.llm_breakpoints/{bp.hash_prompt()[:8]}-{int(time.time())}.jsonl' os.makedirs(os.path.dirname(path), exist_ok=True) with open(path, 'a', encoding='utf-8') as f: f.write(json.dumps(asdict(bp)) + '\n')
Use the context manager to wrap a provider call:
pythonclient = OpenAI() def review_patch(diff: str) -> str: prompt = { 'system': 'You are a senior code reviewer. Prefer minimal diffs.', 'messages': [{'role': 'user', 'content': f'Review this diff and propose a fix:\n{diff}'}] } sampling = {'seed': 42, 'temperature': 0.2, 'top_p': 1.0, 'max_tokens': 512} invariants = [ {'id': 'has-text', 'level': 'error', 'type': 'non_empty', 'on': 'response.text'}, {'id': 'no-secrets', 'level': 'error', 'type': 'pii_scan', 'on': 'response.text'}, ] with llm_breakpoint('openai', 'gpt-4o', sampling, prompt, context={}, invariants=invariants) as (bp, span): # Provider call start = time.perf_counter() resp = client.chat.completions.create( model=bp.provider['model'], messages=[{'role': 'system', 'content': prompt['system']}] + prompt['messages'], temperature=bp.sampling['temperature'], top_p=bp.sampling['top_p'], max_tokens=bp.sampling['max_tokens'], seed=bp.sampling['seed'], ) elapsed = (time.perf_counter() - start) * 1000 text = resp.choices[0].message.content usage = getattr(resp, 'usage', None) bp.response = { 'content': [{'type': 'text', 'text': text}], 'finish_reason': resp.choices[0].finish_reason, 'token_usage': {'input': usage.prompt_tokens if usage else None, 'output': usage.completion_tokens if usage else None}, } bp.metrics = {'latency_ms': elapsed, 'cost_usd': estimate_cost(usage)} span.set_attribute('ai.tokens.input', usage.prompt_tokens if usage else 0) span.set_attribute('ai.tokens.output', usage.completion_tokens if usage else 0) assert_invariants(bp) return text def estimate_cost(usage): if not usage: return 0.0 # Simple placeholder cost model return (usage.prompt_tokens + usage.completion_tokens) * 0.000002 def assert_invariants(bp: Breakpoint) -> None: text = ''.join([c['text'] for c in bp.response['content'] if c['type'] == 'text']) for inv in bp.invariants: if inv['type'] == 'non_empty' and not text.strip(): raise AssertionError(f'invariant {inv['id']} failed: response empty') if inv['type'] == 'pii_scan' and contains_pii(text): raise AssertionError(f'invariant {inv['id']} failed: PII leak') def contains_pii(text: str) -> bool: # Heuristic detector placeholder return 'ssn' in text.lower() or 'password' in text.lower()
Note: single quotes are used in this example to make embedding into JSON easier. Real code can use your preferred style.
OpenTelemetry Integration
Treat each LLM call as a span with structured attributes, events, and links. Suggested attributes:
- ai.provider, ai.model
- ai.prompt.hash
- ai.tokens.input, ai.tokens.output
- ai.latency_ms, ai.cost_usd
- ai.seed, ai.temperature, ai.top_p
- code.filepath, code.function, code.lineno
- user.id, session.id when applicable
Useful events:
- ai.prompt.created with a truncated prompt preview and content hash
- ai.response.received with fingerprint and finish reason
- ai.invariant.failed with id and message
Link spans across retries and fallbacks by adding span links to previous attempts. This allows you to visualize decision trees in your trace UI. If you dispatch a tool call, start a nested span tool.call with tool name, input hash, output hash, and tokenized cost attribution for the step.
Invariants at the Boundary
LLM outputs are soft; invariants give them a shape. Examples:
-
Structural invariants
- Must parse as JSON for a specific schema.
- Must include a patch header if the task is code editing.
-
Semantic invariants
- Keywords must appear or must not appear.
- Round-trip property: if we ask the model to summarize and then expand, original key facts must be preserved.
-
Safety and policy invariants
- No PII, no secrets, no links to disallowed domains.
-
Cost and latency invariants
- Latency under a threshold, token budget not exceeded.
Implementation approaches:
- JSON schema for parseable outputs, with an optional repair step using a function-call style schema.
- Pydantic models for parsed outputs in Python.
- Property-based testing for round-trip invariants, inspired by QuickCheck and Hypothesis.
Example: assert a JSON response conforming to a schema and repair once before failing.
pythonfrom pydantic import BaseModel, ValidationError class Issue(BaseModel): file: str line: int severity: str message: str class ReviewResult(BaseModel): issues: list[Issue] def parse_or_repair(text: str) -> ReviewResult: try: return ReviewResult.model_validate_json(text) except ValidationError: # One-shot repair prompt pattern repaired = client.chat.completions.create( model='gpt-4o-mini', messages=[{'role': 'system', 'content': 'Output valid JSON only, conforming to ReviewResult schema.'}, {'role': 'user', 'content': text}], temperature=0.0, ).choices[0].message.content return ReviewResult.model_validate_json(repaired)
Reproducibility: Seeds, Sampling, and State Capture
Reproducibility in LLMs is best-effort, not perfect. Vendors can change logits, models can roll, and stochasticity can sneak in via tools and retrieval. Still, you can get far with these practices:
- Capture seed, temperature, top_p, presence and frequency penalties, and max tokens.
- Pin provider model versions when possible; track api version and model snapshot if offered.
- Persist retrieved documents, including their pre-tokenized form if you rely on chunking.
- Content-addressable storage for prompts and attachments using SHA-256 hashes.
- Record and, if feasible, snapshot external tool outputs and features that influenced the prompt.
- For local inference (vLLM llama cpp), enforce deterministic kernels and identical tokenizer versions.
When full determinism is impossible, capture enough context to compare behavior regression-style. You can assert that a response stays within a semantic equivalence class even if bytes differ.
From Breakpoints to Minimal Reproductions
Every failed invariant should produce a minimal reproduction artifact:
- A single file fixture containing the breakpoint.
- A script that replays the call locally or in a container aligned to the captured provider and api version.
- A smoke test that asserts the invariant and logs the diff between actual and expected structure.
Deduplicate failures by prompt hash and response fingerprint. Link all occurrences across services via trace id.
CI: Turn Breakpoints Into Tests
Integrate with your test runner to auto-generate cases.
Example pytest harness that collects breakpoint files and runs invariant checks:
python# conftest.py import glob import json import os import pytest BP_DIR = os.getenv('BP_DIR', '.llm_breakpoints') def iter_breakpoints(): for path in glob.glob(f'{BP_DIR}/*.jsonl'): with open(path, 'r', encoding='utf-8') as f: for line in f: yield json.loads(line) def idfn(bp): return f"{bp['provider']['name']}-{bp['provider']['model']}-{bp['code']['function']}-{bp['created_at']}" @pytest.mark.parametrize('bp', list(iter_breakpoints()), ids=idfn) def test_breakpoint(bp): # Re-run if replayable and not marked to skip if not bp.get('ci', {}).get('replayable', True): pytest.skip('not replayable') if should_skip(bp): pytest.skip('provider skip rule') actual = replay(bp) run_invariants(bp, actual) def should_skip(bp): name = bp['provider']['name'] model = bp['provider']['model'] return name == 'azure-openai' and model.startswith('gpt-35')
Sample GitHub Actions workflow:
yamlname: LLM Breakpoint Replay on: pull_request: schedule: - cron: '0 * * * *' jobs: replay: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: '3.11' - run: pip install -r requirements.txt - run: pytest -q --maxfail=1 --disable-warnings env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} BP_DIR: .llm_breakpoints
Create gates for flaky detection by re-running each failing breakpoint K times and counting instability. Report regressions when failure rate increases beyond a threshold.
Cross-Provider Portability
A thin adapter isolates provider-specific request and response shapes behind a common interface. Ideas:
- Use a neutral message object with role, content, and optional tool invocation metadata.
- Normalize token usage fields to input output total tokens.
- Translate tool call formats to a single internal spec, capturing function name, arguments, and return.
- Map errors to standardized categories: rate_limit, quota, invalid_request, server_error, network, timeout.
This allows you to replay the same breakpoint across OpenAI, Anthropic, local llama cpp, or Azure variants, then compare costs and quality.
Tool and Agent Steps Are Also Breakpoints
When a model calls a function or tool, that boundary deserves its own breakpoint. Capture the following:
- Tool name and version.
- Input arguments hash and canonicalized JSON.
- Tool output and its provenance.
- Latency and downstream costs if the tool triggers another model.
Folding tool steps into the trace creates a cascade you can debug visually and in CI. For agent frameworks, each planner and executor step is a span with its own invariants.
Data Hygiene: Redaction and Encryption
Since prompts and responses may contain sensitive data, build hygiene into the breakpoint pipeline:
- Redact known PII using pattern-based and ML detectors before persistence.
- Mark fields as encrypted at rest; use envelope encryption with KMS integrated keys.
- Keep a field-level allowlist for what can be exported to CI versus retained only in a secure store.
- Honor data retention policies and DSAR workflows by indexing breakpoints by user id.
Developer Workflow: Break on Predicate
Breakpoints become most useful when surfaced to developers naturally:
- VS Code extension that shows recent breakpoints in the Problems panel, with links to the exact source location and a one-click replay.
- A TUI that streams breakpoints and lets you filter by service, provider, invariant id, or model name.
- Predicate-based breakpoints: break when latency exceeds 2 s, token usage exceeds 4k, or a specific invariant fails.
- Time-travel prompts: open a diff between the captured prompt and the current code path prompt.
Cost and Latency as First-Class Metrics
Attach token counts and cost estimates to every span. You can then:
- Create SLOs for 95th percentile latency and token budgets per feature.
- Implement circuit breakers that fall back to cheaper models when budgets are exhausted.
- Attribute spend by team, endpoint, and invariant bucket.
Use standard OTel metrics where possible:
- ai.tokens.input counter
- ai.tokens.output counter
- ai.cost.usd counter
- ai.latency.ms histogram
Prompts as Code: Versioning and Migrations
Prompts age. Treat them like code:
- Store prompts as text files with unit tests and typed templates.
- Record prompt hash in every breakpoint and display a prompt diff when behavior changes.
- Migrate prompts with codemods when you change few-shot examples or system directives.
- Consider lightweight CRDTs or a registry for prompts shared across services.
Coverage for AI: Breakpoint Coverage
Define coverage metrics for AI behavior:
- Endpoint coverage: percent of LLM boundaries instrumented with breakpoints.
- Invariant coverage: percent of breakpoints with at least one invariant.
- Replay coverage: percent of breakpoints replayed in CI within 24 hours.
- Stability rate: fraction of replays that produce invariant-satisfying outputs.
Use these alongside code coverage to assess reliability.
Example: TypeScript Adapter With OpenTelemetry
A Node snippet that wraps a provider and emits OTel spans:
tsimport { context, trace, SpanStatusCode } from '@opentelemetry/api' import crypto from 'node:crypto' export type Msg = { role: 'system' | 'user' | 'assistant'; content: string } export interface Provider { name: string model: string chat(messages: Msg[], opts: { temperature: number; top_p: number; seed?: number; max_tokens?: number }): Promise<{ text: string; usage?: { input: number; output: number } }> } export async function withLLMBreakpoint<T>( provider: Provider, messages: Msg[], opts: { temperature: number; top_p: number; seed?: number; max_tokens?: number }, invariants: ((text: string) => void)[] ): Promise<string> { const tracer = trace.getTracer('ai.debugger') const promptHash = crypto.createHash('sha256').update(JSON.stringify(messages)).digest('hex') const span = tracer.startSpan('llm.call', { attributes: { 'ai.provider': provider.name, 'ai.model': provider.model, 'ai.prompt.hash': promptHash, 'ai.temperature': opts.temperature, 'ai.top_p': opts.top_p, 'ai.seed': opts.seed ?? -1, }, }) return await context.with(trace.setSpan(context.active(), span), async () => { try { const res = await provider.chat(messages, opts) span.setAttribute('ai.tokens.input', res.usage?.input ?? 0) span.setAttribute('ai.tokens.output', res.usage?.output ?? 0) invariants.forEach(fn => fn(res.text)) return res.text } catch (err: any) { span.setStatus({ code: SpanStatusCode.ERROR, message: String(err?.message ?? err) }) throw err } finally { span.end() } }) }
This adapter can serialize a minimal breakpoint object alongside the span and store it for replays.
Limitations and Edge Cases
-
Streaming
- Capture streamed tokens with chunked events or assemble final text for invariants. Consider tradeoffs between storage and insight.
-
Tool concurrency
- Agent frameworks may dispatch tools in parallel. Ensure each tool call has a child span and robust correlation ids.
-
Retrieval volatility
- Indexes change over time. Snapshot index versions or store retrieved chunks inline to keep replays meaningful.
-
Provider drift
- Vendors change model weights. Capture model snapshot identifiers when available and annotate traces with roll events.
-
Seeds are not magic
- Some providers do not expose full determinism. When determinism fails, compare semantics rather than exact bytes.
-
Prompt privacy
- Even redacted, some prompts are sensitive. Provide kill switches and sampling ratios for data capture.
Toward a Standard
Adopt or propose semantic conventions for OTel attributes such as ai.provider, ai.model, ai.tokens.*, ai.cost.usd, and ai.prompt.hash. Where possible, re-use existing db and messaging conventions for tool calls.
A small, provider-agnostic JSON schema for breakpoints would unlock:
- Cross-vendor replay tools.
- Shared dashboards and triage automation.
- Reproducibility archives for incident response.
Candidates to watch and integrate with:
- OpenTelemetry community for AI semantic conventions.
- Agent frameworks such as LangChain, LlamaIndex, and DSPy for breakpoint hooks.
- Local inference engines such as vLLM and llama cpp for deterministic kernels and tokenizer pinning.
- Property-based testing libraries such as Hypothesis and fast-check for invariants.
Putting It All Together: A Day in the Life
- You ship a new feature that uses an agent to generate code migrations.
- A user hits a path where the agent drops a critical step.
- The request emits an LLM breakpoint at the agent planner and executor. The planner invariant fails must produce a plan with at least 3 steps.
- The failure appears in your trace dashboard with stack location, prompt hash, and invariant id.
- A minimal reproduction file lands in your breakpoint store. CI picks it up and reproduces the failure on main.
- You open the breakpoint in VS Code, replay locally with the captured seed and sampling, and see the faulty plan.
- You refine the system prompt and add a constraint. The replay passes. You commit the prompt change with a unit test referencing the breakpoint id.
- CI replays the original breakpoint and a held-out set from the past week. All green. You deploy.
This is the missing developer loop for AI: from bug report to reproducible failure to verified fix.
Opinionated Guidance
- Treat every LLM call like an external API boundary. You would never call a payment API without recording request id, version, cost, and response codes. Do the same here.
- Prefer a small number of strong invariants over a long tail of weak ones. Start with parseability, no PII, and task-specific structural checks.
- Fail fast in dev and staging. In production, fail open with a low-risk fallback but still record a failing breakpoint.
- Do not couple breakpoints to any single framework. They should be as portable as JSON logs.
- Make it default-on in new services. Instrumentation after the fact is always harder.
References and Further Reading
- OpenTelemetry specification: opentelemetry.io
- DSPy for declarative LLM systems: arxiv.org abs 2310.03714
- Hypothesis property-based testing: hypothesis.works
- QuickCheck paper: JFP 2000 Koen Claessen, John Hughes
- ReAct prompting: arxiv.org abs 2210.03629
- vLLM and tokenizer determinism: vllm.ai
Conclusion
LLM breakpoints promote AI development from art to engineering practice. By binding prompts to stack traces, traces to invariants, and invariants to CI, you create a virtuous cycle: failures turn into knowledge, knowledge turns into tests, and tests turn into stability. You will ship faster not by asking models to be perfect, but by building systems that can observe, reproduce, and fix their imperfections.
It is time to standardize this primitive, build ergonomic tooling around it, and make debugging AI as natural as debugging code. Add your first breakpoint today and do not ship another opaque prompt again.
