Stop Prompt-Glut: Tooling Up Debug AI with MCP to Run Real Debuggers
Generative models got very good at talking about code. That does not make them good at debugging code.
Most teams experimenting with AI-assisted debugging start by throwing more words at the problem: longer logs, bigger contexts, more stack traces, multiple copies of the same failing output, maybe even the entire test suite as a paste. The model reads it all, predicts a likely-looking patch, and then everyone prays it compiles.
Sometimes it works. Too often, it hallucinates a function that never existed, forgets a build flag that changes behavior, or misses that the bug only reproduces under a specific allocator. We doubled down on the prompt and not on the ground truth.
There is a cleaner way: stop prompt-glut and give the model actual tools. Use the Model Context Protocol (MCP) to wire your AI into real debuggers, static analyzers, and build systems. That yields:
- Reproducible traces: preserved test commands, tool versions, environment, and inputs
- Fewer hallucinations: answers grounded in tool outputs and runtime state
- CI-safe fixes: patches validated by deterministic builds, analyzers, and tests
This article is an opinionated blueprint for building a debugging AI that runs gdb, clang-tidy, and your build tool via MCP. It includes architecture, code snippets, session design patterns, evaluation metrics, and a step-by-step adoption plan.
The core problem: prompts are not evidence
Debugging is evidence-driven. You form hypotheses, instrument code, inspect memory, and run the executable under controlled conditions. LLM prompts, in contrast, are narratives. Without access to the running program, the model cannot verify a hypothesis. Even high-quality logs are lossy summaries; they omit state you did not think to print.
Symptoms of prompt-first debugging:
- Patch proposals that compile but do not fix the failing test
- Fixes that hide the symptom rather than address root cause
- Explanations that confidently cite nonexistent symbols or lines
- High variance: seemingly identical prompts produce different answers
In other words, the model is acting like a very smart rubber duck. We can do better by giving it the same power tools a human uses: debugger, analyzer, and build system.
What MCP brings to tool-using models
MCP, the Model Context Protocol, standardizes how models discover and call external tools. In practice it means:
- A process that exposes a set of tools with structured schemas
- A transport (commonly stdio or WebSocket) speaking JSON-RPC with typed requests and results
- Resource providers and file systems the model can browse or fetch from
- A handshake for tool discovery, capabilities, and versioning
This is not the first tool-calling idea. LLM function calling, LSP servers, and various SDKs exist. MCP’s contributions for debugging are:
- A language-agnostic, vendor-agnostic way to expose multiple tools as one coherent surface
- Uniform, typed tool definitions and errors, making it safe to automate in CI
- A clean separation between model policy (reasoning) and tool execution (authority)
Instead of copy-pasting a backtrace into the prompt, the model calls a backtrace tool. Instead of hand-waving about whether a symbol exists, it reads symbols from the binary. The protocol turns debugging from text prediction into an experiment with instrumentation.
Architecture: an AI debugger that actually debugs
At a high level, the system has these components:
- MCP client in the model runtime: the AI can discover and call tools
- MCP servers that expose:
- Debugger actions: run program, set breakpoints, backtrace, print variables
- Build actions: configure, build targets, run tests, regenerate compile_commands
- Analysis actions: run clang-tidy, cppcheck, mypy, bandit, semgrep, etc.
- Repo actions: checkout commit, open branch, apply patch, create PR
- Sandboxed execution environment: containerized toolchain and dependencies
- Provenance and logging: every tool call recorded with inputs, outputs, timestamps, tool versions, and environment
If your repo already uses Bazel or hermetic containers, you are halfway there. If not, you can still get most benefits by capturing a faithful session manifest and ensuring determinism in the test harness.
Why gdb beats a longer prompt
Here is the qualitative difference:
- Prompt-glut: Please fix intermittent segfault. It happens more with large inputs.
- With gdb via MCP: Run binary under gdb with ASAN, set breakpoint on SIGSEGV, print backtrace, print suspicious pointer values, inspect heap metadata, step until write occurs.
The second requires more wiring, but the output is decisive. The model does not need to invent; it observes and reasons.
Minimal MCP server that wraps gdb via the MI interface
GDB has a machine interface (MI) built for tooling. In Python, the pygdbmi library is a convenient bridge. Here is a minimal MCP server exposing a few gdb tools. To keep JSON valid in this document, string literals in the code samples use single quotes.
python# file: mcp_gdb_server.py import asyncio from contextlib import asynccontextmanager from pygdbmi.gdbcontroller import GdbController # A trivial MCP-like skeleton. In production, use an MCP SDK to register tools # and serve over stdio or websockets with JSON-RPC. This code illustrates tool shapes. class GDBSession: def __init__(self, exe_path, args=None): self.exe_path = exe_path self.args = args or [] self.gdb = None async def start(self): self.gdb = GdbController(command=['gdb', '--interpreter=mi2', self.exe_path, '--args', *self.args]) # set non-stop and pretty printing helpful defaults self._mi('-gdb-set pagination off') self._mi('-enable-pretty-printing') return True def _mi(self, cmd): return self.gdb.write(cmd) def run(self): return self._mi('-exec-run') def backtrace(self): return self._mi('-stack-list-frames') def break_at(self, location): return self._mi(f'-break-insert {location}') def continue_run(self): return self._mi('-exec-continue') def print_var(self, expr): return self._mi(f'-data-evaluate-expression {expr}') def next(self): return self._mi('-exec-next') def step(self): return self._mi('-exec-step') def quit(self): if self.gdb: self.gdb.exit() # MCP-style tool handlers (pseudo). Replace with real MCP registration. _sessions = {} async def tool_gdb_open(session_id, exe_path, args=None): if session_id in _sessions: return {'ok': False, 'error': 'session already exists'} s = GDBSession(exe_path, args) await s.start() _sessions[session_id] = s return {'ok': True} async def tool_gdb_backtrace(session_id): s = _sessions.get(session_id) if not s: return {'ok': False, 'error': 'no such session'} return {'ok': True, 'frames': s.backtrace()} async def tool_gdb_print(session_id, expr): s = _sessions.get(session_id) if not s: return {'ok': False, 'error': 'no such session'} return {'ok': True, 'value': s.print_var(expr)} async def tool_gdb_run(session_id): s = _sessions.get(session_id) if not s: return {'ok': False, 'error': 'no such session'} return {'ok': True, 'events': s.run()} async def tool_gdb_quit(session_id): s = _sessions.pop(session_id, None) if s: s.quit() return {'ok': True}
In a production-grade MCP server, you would:
- Register these as typed tools with JSON schemas
- Stream long outputs as chunks, with size limits
- Normalize MI results into cleaner structured outputs
- Add timeouts and OS-level sandboxing
- Record a session manifest with the binary SHA256, gdb version, and environment
The point is not a perfect wrapper; it is to move from text-only to authority-bearing tools the model can call.
A structured session manifest for reproducibility
Reproducibility is the currency of CI-safe fixes. Before any debugging, the AI should capture a manifest that can re-create the environment. This can be a resource exposed by the MCP server or a file checked into the PR. Example shape:
json{ 'session_id': 'debug-2025-01-27T12:03Z-a1b2c3', 'repo': { 'url': 'git@github.com:org/project.git', 'commit': '8d4c2d9', 'dirty': false }, 'build': { 'system': 'bazel', 'target': '//app:server', 'flags': ['--config=asan'], 'cache': 'remote', 'toolchain': { 'cc': 'clang 18.1.2', 'ld': 'lld 18.1.2' } }, 'runtime': { 'container': 'ghcr.io/org/ci:2025-01-15@sha256:abc...', 'env': { 'ASAN_OPTIONS': 'halt_on_error=1:allocator_may_return_null=1', 'UBSAN_OPTIONS': 'abort_on_error=1' }, 'args': ['--config', 'stage'], 'input_seed': 1337 }, 'tools': { 'gdb': '14.2', 'clang_tidy': '18.1.2', 'cppcheck': '2.14', 'mcp_server': 'org.project.mcp.gdb@0.2.1' } }
This manifest is not just nice-to-have. It is the difference between a one-off lucky fix and a repeatable pipeline that any reviewer can reproduce locally or in CI.
Wiring in static analyzers via MCP
Runtime debugging is only half the story. Static analyzers catch classes of bugs without running the program. A model that can invoke them intelligently will propose smaller, safer patches and avoid regressions.
Tools worth exposing:
- clang-tidy with compile_commands.json
- clang static analyzer or scan-build
- cppcheck for C/C++
- semgrep for idiomatic rules across languages
- mypy, ruff, bandit for Python
- golangci-lint for Go
- cargo clippy for Rust
A minimal MCP tool wrapper for clang-tidy might look like:
python# file: mcp_clang_tidy.py import subprocess import shlex from pathlib import Path async def tool_clang_tidy(paths, checks=None, fix=False): cmd = ['clang-tidy'] if checks: cmd += ['-checks', checks] if fix: cmd += ['-fix', '-format-style=file'] # clang-tidy discovers translation units via compile_commands.json if not Path('compile_commands.json').exists(): return {'ok': False, 'error': 'missing compile_commands.json'} cmd += paths proc = subprocess.run(cmd, capture_output=True, text=True) return { 'ok': proc.returncode in (0, 1), # 1 when warnings are present 'exit_code': proc.returncode, 'stdout': proc.stdout[-200000:], 'stderr': proc.stderr[-200000:] }
The AI should use analyzers both before and after a proposed fix, record the diffs in findings, and attach results to PRs. Over time, you can teach the AI a house style: prefer refactors suggested by analyzers, avoid changes that introduce new warnings, and gate merging on a clean analysis run.
Build-system integration that does not leak state
A debugging AI must be able to build your project like a developer does, but within a sandbox that yields deterministic results. Two practical patterns:
- Bazel or Buck: hermetic builds, content-addressable caching, reproducible test runs
- CMake + Ninja + containerized toolchain: pin compiler versions and dependencies, generate compile_commands.json
Expose build tools via MCP:
- build.configure: configure the build, e.g., cmake -S . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo
- build.target: build a specific target or run unit tests
- build.generate_compile_commands: ensure analyzers have correct flags
- build.test: run a test filter, collect JUnit XML, coverage data
Be strict: impose timeouts and output truncation, and make the model pass environment overrides explicitly (e.g., ASAN_OPTIONS) through tool parameters. Surprise inherited environment variables are a common source of nondeterminism.
A debugging session, end-to-end
Below is a condensed transcript of an AI debugging a crashing HTTP server using MCP tools. The session demonstrates how grounding replaces speculation.
- Reproduce and capture manifest
- Tool call build.target with flags [--config=asan], target //app:server
- Tool call build.test with filter ServerCrashTest
- Tool result: test fails with SIGSEGV under ASAN in connection.cc:214
- Tool call resources.write to persist manifest.json
- Start gdb and run the test binary
- Tool call gdb.open with session_id s1, exe_path bazel-bin/app/server_tests, args ['--gtest_filter', 'ServerCrashTest']
- Tool call gdb.run s1
- Tool result: program receives SIGSEGV at connection.cc:214
- Inspect state
- Tool call gdb.backtrace s1
- Tool result: frames show Connection::on_read -> HTTPParser::parse -> Buffer::append
- Tool call gdb.print s1 expr 'this->inbuf.size'
- Tool result: value 18446744073709551615 (wraparound)
- Tool call gdb.print s1 expr 'len'
- Tool result: 16384
- Hypothesis
- The input buffer size is size_t wraparound, likely due to unchecked subtraction on underflow. Suspect Buffer::reserve or append arithmetic.
- Set breakpoint and step
- Tool call gdb.break_at s1 location 'buffer.cc:89'
- Tool call gdb.continue s1
- Tool result: hit breakpoint at buffer.cc:89 in Buffer::reserve
- Tool call gdb.print s1 expr 'needed'
- Tool result: -1
- Tool call gdb.print s1 expr 'capacity'
- Tool result: 0
- Root cause
- needed is computed as desired - size, with desired < size, so signed negative promoted to large size_t, causing wraparound and over-allocate followed by out-of-bounds write.
- Propose fix
- Add saturating check: if desired <= size return; else compute growth carefully with checked arithmetic
- Apply and validate
- Tool call repo.apply_patch with patch for buffer.cc
- Tool call build.target for //app:server with asan
- Tool call build.test for ServerCrashTest
- Tool result: test passes; no ASAN errors
- Tool call clang_tidy on modified files
- Tool result: clean
- Tool call repo.create_pr with title 'Fix size_t wraparound in Buffer::reserve'
This session has two crucial properties: every claim the AI made is backed by a tool output, and anyone can replay the session by using the manifest and the recorded tool calls.
Policies that reduce hallucination rate
Even with tools, policies matter. These are simple guardrails that move the AI from story-telling to engineering:
- Tool-first, not prompt-first: when a claim can be verified by a tool, call the tool
- Evidence tagging: every assertion is labeled as observed, inferred from observation, or assumed
- State before patch: capture a backtrace and relevant variable dumps before proposing any fix
- No silent retries: all failed tool calls are recorded and explained
- Deterministic reruns: identical tool calls with same inputs must yield same outputs; otherwise the model questions the environment
- Summarize by reference: include file:line anchors and snippet ranges rather than pasting whole files into context
These policies are mechanical to implement in the MCP middleware. They make reviewer trust easier to earn.
Evaluation: measure fixes, not vibes
You cannot manage what you do not measure. For debugging AIs, prefer task-based metrics over token-level metrics.
Tasks and datasets:
- Defects4J: real Java bugs with tests; use a Maven or Gradle server and a Java debugger (jdb or DAP)
- BugsInPy: real Python faults; combine pytest, pdb, mypy, and bandit
- SWE-bench and related variants: repository-scale software engineering tasks; helpful to measure end-to-end patch acceptance under CI
Metrics:
- Reproduction rate: fraction of reported failures the system can reproduce in its sandbox
- Fix acceptance rate: patches that pass tests and survive code review
- Regression rate: fixes that introduce new test failures or analyzer findings
- Iterations to fix: average tool-call rounds from start to green tests
- Hallucination rate: human-rated rate of claims not supported by tool outputs
While reported numbers vary across papers and model families, a consistent finding is that tool grounding and execution feedback loops materially improve end-to-end fix rates versus prompt-only approaches. Your goal is not to chase a single leaderboard but to continuously reduce triage time and increase the share of PRs that are green on first CI run.
Safety and security: least privilege by default
You are giving an autonomous agent power to run code on source trees. That power needs tight controls:
- Sandbox execution in containers with read-only root, writable workspace, no network unless needed
- Explicit allowlists: expose only specific commands as MCP tools; do not give a raw shell
- Timeouts and memory limits per tool call; kill runaway processes
- Logging and audit: record inputs, outputs, resource usage, exit codes
- Secrets hygiene: pass credentials only to tools that need them, and scope them to read-only operations when possible
- Content filters: redact PII and secrets from logs attached to PRs
MCP’s typed tools make allowlists enforceable: you decide exactly what arguments a tool takes and validate them.
Bridging the debugging ecosystem: DAP, LSP, MCP
You do not need to reinvent everything. Useful bridges:
- DAP (Debug Adapter Protocol): many debuggers implement DAP servers (e.g., vscode-cpptools, debugpy). An MCP server can wrap DAP calls into simplified tools like backtrace, variables, continue, step
- LSP (Language Server Protocol): for code navigation, references, and refactors; an MCP tool can expose find-references or rename via LSP
- VCS platforms: MCP tools for repo.apply_patch, repo.create_branch, repo.open_pr
This keeps your AI compatible with existing developer tooling while giving it a stable, typed surface to act on.
Implementation blueprint: from zero to useful in two weeks
Here is a pragmatic rollout plan.
Week 1: local prototyping
- Containerize your build: pin compiler versions, ensure tests run deterministically
- Generate compile_commands.json and verify clang-tidy runs cleanly on main
- Implement a minimal MCP server for build.* and analyzer.* tools
- Expose resources for the workspace files and the manifest
- Add recording of tool calls and outputs to a session.log file
Week 2: debugger integration and CI loop
- Add gdb or lldb tools via pygdbmi or a DAP bridge
- Add timeouts, output caps, and streaming for long logs
- Implement repo.* tools: apply_patch, create_branch, open_pr
- Build a thin policy layer: tool-first, evidence tagging, deterministic reruns
- Create a GitHub Actions or GitLab CI job that runs the AI on incoming failing test reports
- Gate PRs created by the AI behind human review, attach the session manifest and summarized evidence
Optional enhancements:
- Static checker policies: run analyzers pre- and post-patch; block new critical findings
- Fuzz reproduction tool: expose a fuzz.run tool and a seed recorder for tricky bugs
- Coverage-aware prioritization: prefer fixes that increase failing-test coverage of changed lines
Example MCP tool schemas (pseudo)
Define tools with explicit inputs and structured outputs. Below is a sketch in a JSON-like notation using single quotes.
json{ 'tools': [ { 'name': 'gdb.open', 'description': 'Start a gdb session for an executable', 'input_schema': { 'type': 'object', 'properties': { 'session_id': {'type': 'string'}, 'exe_path': {'type': 'string'}, 'args': {'type': 'array', 'items': {'type': 'string'}}, 'env': {'type': 'object', 'additionalProperties': {'type': 'string'}} }, 'required': ['session_id', 'exe_path'] }, 'output_schema': { 'type': 'object', 'properties': {'ok': {'type': 'boolean'}, 'error': {'type': 'string'}} } }, { 'name': 'gdb.backtrace', 'input_schema': {'type': 'object', 'properties': {'session_id': {'type': 'string'}}, 'required': ['session_id']}, 'output_schema': {'type': 'object', 'properties': {'ok': {'type': 'boolean'}, 'frames': {'type': 'array'}}} }, { 'name': 'build.test', 'input_schema': { 'type': 'object', 'properties': { 'target': {'type': 'string'}, 'filter': {'type': 'string'} }, 'required': ['target'] }, 'output_schema': { 'type': 'object', 'properties': { 'ok': {'type': 'boolean'}, 'exit_code': {'type': 'integer'}, 'stdout': {'type': 'string'}, 'stderr': {'type': 'string'}, 'junit_xml_path': {'type': 'string'} } } } ] }
These schemas are contract tests. Your MCP middleware should validate inputs against them and reject ambiguous or dangerous requests before any subprocess is spawned.
A CI workflow that keeps humans in the loop
Make the AI a disciplined colleague, not an all-powerful committer. A typical GitHub Actions workflow might look like:
yamlname: ai-debugger on: workflow_dispatch: issue_comment: types: [created] pull_request: types: [labeled] jobs: run-ai-debugger: if: contains(github.event.pull_request.labels.*.name, 'ai-debug') || startsWith(github.event.comment.body, '/ai-debug') runs-on: ubuntu-22.04 container: ghcr.io/org/ci:2025-01-15 steps: - uses: actions/checkout@v4 - name: Prepare build cache uses: actions/cache@v4 with: path: ~/.cache/bazel key: bazel-${{ hashFiles('**/*') }} - name: Start MCP servers run: | nohup python mcp_gdb_server.py & nohup python mcp_clang_tidy.py & - name: Run AI agent env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} run: | python run_agent.py --plan reproduce,debug,fix,validate --out artifacts/ - name: Upload artifacts uses: actions/upload-artifact@v4 with: name: ai-debug-session path: artifacts/
Key properties:
- The workflow is opt-in via a label or slash command
- Everything runs in a pinned container image
- Artifacts include session.log, manifest.json, diffs, analyzer results, and a summary.md human reviewers can read in minutes
Good prompts still matter, but they are different
A tool-using model still needs planning and reflection. Useful internal instructions for the agent:
- Plan: break down the debugging task into reproduce, inspect, hypothesize, validate
- Observe: prefer short, targeted tool queries over one massive invocation
- Summarize: keep a running notebook with hypotheses and supporting evidence, referencing file:line, frame numbers, and variable names
- Validate: Before making a patch, specify a falsifiable test or property that would fail if the hypothesis is wrong
- Minimize: propose the smallest patch that resolves the failure and does not introduce new analyzer warnings
These are not overwrought system prompts; they are operational checklists that keep the agent honest and effective.
Common pitfalls and how to avoid them
- Partial symbol info: debugging optimized binaries without debuginfo wastes time; ensure -g or split DWARF is present in the targets the agent uses
- Fighting the build: non-hermetic builds, nondeterministic tests, or flaky network calls will make the agent chase ghosts; use hermetic containers and network sandboxes
- Tool output overload: treat tool logs as data, not as prompt dumps; summarize to structured evidence and attach full logs as resources
- Over-broad analyzers: too many warnings drown signal; tune clang-tidy checks, use baselines, and gate only on newly introduced issues for AI patches
- Silent environment drift: pin tool versions in the container and surface them in the manifest; fail early if versions drift
Where this approach aligns with the literature
The broader research on tool-using and verification-aware LLMs points in the same direction:
- Models improve reliability when they can call external tools and ground their outputs in observable state
- Planning-execution-verification loops outperform one-shot answers for complex tasks
- Program synthesis and repair systems that compile, run tests, and iterate consistently beat prompt-only baselines
Across datasets like Defects4J, BugsInPy, and repository-scale suites, studies report material gains when agents compile, execute tests, and use analyzers versus purely predicting patches from textual artifacts. While exact percentages vary by setting and model, the directional conclusion is robust: ground truth tools reduce hallucinations and increase fix acceptance.
The big picture: from narrative to experiment
The most important shift is cultural. Treat the AI like a junior engineer who is excellent at pattern recognition but must learn to prove claims. That requires turning debugging into an experiment:
- Hypothesis: the crash is caused by underflow in size computation
- Instrumentation: breakpoints, variable inspection, analyzers focusing on signed-to-unsigned conversions
- Experiment: run with ASAN, capture backtrace, inspect variable ranges under different inputs
- Conclusion: code path leads to size_t wraparound; patch to guard and adjust arithmetic
- Verification: failing test passes, no new analyzer issues, coverage covers changed lines
When you operationalize this with MCP, your debugging sessions stop being long prompts and start being scientific notebooks with runnable cells.
Opinionated takeaways
- Stop pasting megabytes of logs into prompts. Tool calls are cheaper and more reliable than tokens spent on context.
- Do not let the model guess at runtime state. A single gdb print beats a paragraph of speculation.
- Reproducibility is a first-class feature. Ship a manifest with every AI-generated PR.
- Make analyzers friends, not gatekeepers. Use them to guide patches and catch regressions early.
- Keep humans in the loop. The AI should present small, well-evidenced diffs that are easy to review.
Appendix: a tiny end-to-end example repository layout
A sample repository with MCP debugging support might look like:
.
├── .github/workflows/ai-debugger.yml
├── containers/
│ └── ci.Dockerfile
├── mcp/
│ ├── gdb_server.py
│ ├── clang_tidy_server.py
│ └── repo_server.py
├── tools/
│ └── summarize.py
├── app/
│ ├── buffer.cc
│ ├── buffer.h
│ └── ...
├── CMakeLists.txt
├── compile_commands.json # generated
└── README.md
README.md should include:
- How to start MCP servers locally
- How to run the agent in dry-run mode
- How to replay a session from a manifest
Closing
Debugging is not a writing task. It is an investigative process grounded in the actual behavior of a program. The Model Context Protocol makes it straightforward to give your AI the same instruments a human engineer uses. Once the agent can run gdb, invoke analyzers, and build deterministically, you will find that the conversation shifts from persuasive essays to concise, verifiable evidence — and patches that turn CI from red to green on the first try.
Stop prompt-glut. Wire the debugger. Use MCP to make your AI a trustworthy debugging partner.
