Why production debugging is different
Debugging in development is a controlled environment: unminified code, fast reloads, stable reproductions, and a single user (you). Production is the opposite:
- Code is optimized (minified, tree-shaken, code-split), so stack traces can be opaque.
- Issues are intermittent (race conditions, device-specific, network-specific).
- You have limited visibility (you can’t attach a debugger to a user’s browser or serverless runtime in most cases).
- You must prioritize safety (avoid logging PII, avoid adding heavy instrumentation, avoid changes that risk downtime).
This article provides a structured approach to production debugging for JavaScript applications—frontend (browser), Node.js backends, and serverless—covering instrumentation, source maps, observability tools, reliable reproduction strategies, and best practices that scale from junior dev to senior engineer.
1) Start with a decision tree: crash, error rate spike, or “weird behavior”
Before touching code, classify the incident. Your approach depends on what you’re seeing:
A. Hard crash / unhandled exception
Typical signals:
- Browser:
window.onerror,unhandledrejectionevents, a Sentry “unhandled exception” issue. - Node: process exits, container restarts,
UnhandledPromiseRejectionWarning(older Node), fatal OOM.
First actions:
- Identify the top stack trace and the release version.
- Verify source map availability for that release.
- Confirm if it’s a new regression (compare to previous deploy).
B. Error rate spike (non-fatal)
Examples:
- API requests failing with 500/502.
- Client-side requests failing due to CORS, auth expiry, network errors.
First actions:
- Correlate by time, region, browser, device, endpoint.
- Check dependencies (DB, cache, third-party API) and recent configuration changes.
- Look for saturation signals: p95 latency, CPU, memory, connection pools.
C. Weird behavior (no obvious error)
Examples:
- UI stuck in loading state.
- Data inconsistencies.
- Payment flows failing only for some users.
First actions:
- Add or inspect structured logs and business-level events.
- Use session replay (if available) or synthetic reproduction.
- Validate feature flag states and rollout percentages.
2) Instrumentation beats guessing: logs, metrics, traces
In production, “debugging” often means asking the system questions through telemetry. The core pillars are:
- Logs: discrete events with context.
- Metrics: aggregated measurements over time.
- Traces: end-to-end request flows across services.
Structured logging (do this, not printf debugging)
Use JSON logs with consistent fields. In Node.js, pino is fast and widely used.
js// logger.js import pino from 'pino'; export const logger = pino({ level: process.env.LOG_LEVEL ?? 'info', base: { service: 'billing-api', }, redact: { paths: ['req.headers.authorization', 'user.ssn', 'card.number'], remove: true, }, });
In an Express handler:
jsapp.get('/invoice/:id', async (req, res) => { const requestId = req.headers['x-request-id'] ?? crypto.randomUUID(); res.setHeader('x-request-id', requestId); req.log = logger.child({ requestId, route: 'GET /invoice/:id' }); try { req.log.info({ invoiceId: req.params.id }, 'Fetching invoice'); const invoice = await invoices.get(req.params.id); res.json(invoice); } catch (err) { req.log.error({ err, invoiceId: req.params.id }, 'Failed to fetch invoice'); res.status(500).json({ error: 'internal_error', requestId }); } });
Best practices:
- Always include requestId / traceId, userId (if safe), route, and relevant domain IDs.
- Prefer event-style messages (“PaymentAuthorizationFailed”) over vague text.
- Redact secrets and PII at the logger level.
Metrics: detect trends and regressions quickly
If you’re using Prometheus/OpenTelemetry, capture:
- Request counts, error counts, latency histograms.
- Queue depth, DB connection pool usage.
- Node.js event loop lag and memory.
Example (using prom-client):
jsimport client from 'prom-client'; const httpDuration = new client.Histogram({ name: 'http_server_duration_ms', help: 'HTTP request duration in ms', labelNames: ['route', 'method', 'status'], buckets: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000] }); app.use((req, res, next) => { const start = performance.now(); res.on('finish', () => { httpDuration .labels(req.route?.path ?? 'unknown', req.method, String(res.statusCode)) .observe(performance.now() - start); }); next(); });
Traces: find where time and failures occur
Distributed tracing (OpenTelemetry + Jaeger/Tempo/Datadog) helps answer:
- Which downstream call is slow?
- Did we retry?
- Where did the error originate?
Even if you can’t fully adopt tracing everywhere, start by propagating a traceparent header and logging trace IDs.
3) Source maps: the difference between “minified noise” and actionable stacks
Frontend source maps (Webpack/Vite/Rollup)
Production errors often arrive as:
TypeError: Cannot read properties of undefined (reading 'foo')
at e (app.3f2c1a.js:1:18233)
With source maps uploaded to an error tracker, you get:
TypeError: Cannot read properties of undefined (reading 'foo')
at renderInvoiceRow (src/components/InvoiceTable.tsx:87:13)
Key practices:
- Generate
hidden-source-map(Webpack) or equivalent to avoid exposing maps publicly. - Upload maps to Sentry/Datadog/New Relic and do not serve them from your CDN.
- Include release identifiers (commit SHA, version).
Webpack example:
js// webpack.config.js module.exports = { devtool: 'hidden-source-map', };
Vite example:
ts// vite.config.ts export default defineConfig({ build: { sourcemap: true, // pair with private upload to tracker } });
Node.js source maps
If you ship TypeScript or bundled server code, enable source maps in Node:
- Node 12+ supports source maps with
--enable-source-maps.
Example:
bashnode --enable-source-maps dist/server.js
In many platforms you can set NODE_OPTIONS=--enable-source-maps.
4) Error tracking: pick a system and configure it properly
Tool comparison (practical view)
- Sentry: excellent stacktrace mapping, breadcrumbs, releases, performance monitoring, session replay (optional). Strong ecosystem.
- Datadog: strong all-in-one platform (APM + logs + metrics + RUM). Great correlation across telemetry.
- New Relic: mature APM and browser monitoring.
- Rollbar / Bugsnag: solid error tracking; varying depth for performance tooling.
What matters most is not the brand—it’s whether you:
- Upload source maps.
- Tag releases.
- Capture user context safely.
- Capture breadcrumbs and network spans.
Frontend: capturing exceptions and context
Example with Sentry (browser):
tsimport * as Sentry from '@sentry/react'; Sentry.init({ dsn: process.env.SENTRY_DSN, release: process.env.APP_VERSION, environment: process.env.NODE_ENV, integrations: [Sentry.browserTracingIntegration()], tracesSampleRate: 0.1, beforeSend(event) { // remove sensitive fields if any were added accidentally return event; } }); export function setUserContext(user: { id: string; plan: string }) { Sentry.setUser({ id: user.id }); Sentry.setTag('plan', user.plan); }
Backend: ensure errors are captured before process exit
In Node, fatal errors can kill the process before flushing telemetry. Ensure graceful shutdown and flush.
jsprocess.on('uncaughtException', async (err) => { logger.fatal({ err }, 'uncaughtException'); // flush telemetry here if needed process.exit(1); }); process.on('unhandledRejection', async (reason) => { logger.fatal({ reason }, 'unhandledRejection'); process.exit(1); });
In many cases you should not keep running after an uncaught exception—crash fast and rely on orchestration.
5) Reproduction strategies that work in production reality
A. Reproduce with the same inputs
- Capture the request payload (redacted), headers, and feature flag states.
- Record the specific user journey or API sequence.
- Store a “debug bundle” for failing requests: requestId + trace + key IDs.
B. Reproduce with the same environment
Common mismatches:
- Different Node version.
- Different timezone/locale.
- Different CDN caching behavior.
- Different third-party API behavior.
Use containerized repro:
bashdocker run --rm -it node:20 bash node -e "console.log(Intl.DateTimeFormat().resolvedOptions())"
C. Reproduce timing bugs with stress and determinism
Race conditions often disappear under a debugger. Use:
- Artificial latency (Chrome DevTools throttling,
toxiproxy,tc). - High concurrency load (k6, autocannon).
- Deterministic seeds for randomization.
Example with autocannon:
bashnpx autocannon -c 50 -d 30 https://api.example.com/search?q=test
D. Production-like data without violating privacy
- Use anonymized snapshots.
- Use synthetic data generators.
- Re-run only the failing query pattern rather than exporting raw rows.
6) Debugging frontend production issues
Common class: “works on my machine” UI bugs
These often stem from:
- Browser differences (Safari quirks, old Chromium).
- Hydration mismatches (SSR/React).
- Caching/service worker issues.
- Locale and timezone parsing.
Debugging checklist
- Check browser + OS distribution in your error tracker.
- Inspect breadcrumbs: navigation, click events, XHR/fetch.
- Confirm release version and whether the user’s assets updated.
- Look for service worker cache staleness.
Service worker cache gotchas
A buggy service worker can keep users stuck on old bundles.
Mitigations:
- Version your cache keys.
- Implement
skipWaiting+ prompt users to refresh (carefully). - Add observability: log SW version and activation.
Capturing network failures and server responses
Wrap fetch to add correlation IDs and log failures.
tsexport async function apiFetch(input: RequestInfo, init: RequestInit = {}) { const requestId = crypto.randomUUID(); const headers = new Headers(init.headers); headers.set('x-request-id', requestId); const res = await fetch(input, { ...init, headers }); if (!res.ok) { const body = await res.clone().text().catch(() => ''); // Send to error tracker with minimal safe context throw new Error(`HTTP ${res.status} for ${String(input)} requestId=${requestId} body=${body.slice(0, 200)}`); } return res; }
Be careful: don’t attach full bodies if they can include secrets.
Debugging sourcemap “mismatch”
If your tracker shows wrong file/line:
- Ensure
releasematches the uploaded maps. - Ensure your bundler isn’t rewriting paths unexpectedly.
- Ensure your CDN isn’t serving old assets for a new release.
A practical safeguard is to include the commit SHA in the asset filename and disable aggressive caching for the HTML entry point.
7) Debugging Node.js production issues
Memory leaks and OOMs
Symptoms:
- RSS steadily increases.
- GC pauses increase.
- Process killed by the platform.
Quick triage
- Log memory periodically:
jssetInterval(() => { const m = process.memoryUsage(); logger.info({ rss: m.rss, heapUsed: m.heapUsed, heapTotal: m.heapTotal }, 'mem'); }, 30000);
- Enable heap snapshots on demand (advanced): run Node with inspector or use
heapdumpin controlled environments.
Common leak sources
- Unbounded caches (Map without eviction).
- Retained closures (event listeners never removed).
- Large arrays accumulating per request.
- Logging buffers or in-memory batching.
Event loop lag and “it’s slow but CPU isn’t high”
Use event loop delay monitoring:
jsimport { monitorEventLoopDelay } from 'node:perf_hooks'; const h = monitorEventLoopDelay({ resolution: 20 }); h.enable(); setInterval(() => { logger.info({ p99: h.percentile(99), mean: h.mean }, 'event_loop_delay'); h.reset(); }, 10000);
If p99 lag spikes:
- Look for synchronous CPU-heavy work (JSON stringify of huge objects, crypto, compression).
- Offload CPU tasks to worker threads or separate services.
Debugging production-only crashes in Node
If you have native modules or segmentation faults:
- Ensure consistent libc and Node versions.
- Consider running with
--report-on-fatalerrorand capturing Node reports.
bashNODE_OPTIONS="--report-on-fatalerror --report-directory=/tmp/node-reports" node server.js
Node reports can include stack traces and heap stats.
8) Serverless and edge runtimes: special constraints
Serverless (AWS Lambda, Cloudflare Workers, Vercel functions) changes debugging:
- Short-lived environments.
- Cold starts.
- Limited filesystem access.
Best practices:
- Always log a correlation ID.
- Prefer structured logs; scraping raw text is painful.
- Use vendor-native tracing (AWS X-Ray) or OpenTelemetry where supported.
- Be mindful of sampling and cost.
For Lambda, ensure you capture and log errors with enough context but no PII.
9) Debugging with feature flags and safe mitigations
Feature flags as a debugging tool
Feature flags aren’t just for product experiments—they’re operational controls:
- Turn off a broken code path.
- Reduce concurrency.
- Switch to a fallback provider.
Best practices:
- Flags should be fast to evaluate and safe by default.
- Keep a runbook: which flag mitigates which failure.
Canary releases and progressive delivery
To reduce blast radius:
- Deploy to 1% of traffic.
- Watch key SLO metrics and error budgets.
- Promote gradually.
This turns production debugging into controlled experimentation: isolate the release that introduced the issue.
10) A systematic workflow: from alert to fix
Step 1: Confirm and scope
- Is this real or noise?
- Which endpoints/pages are affected?
- Which users/regions/browsers?
Step 2: Correlate across telemetry
- Logs: find representative requestIds.
- Traces: identify the slow span or failing dependency.
- Metrics: confirm spike timing and saturation.
Step 3: Hypothesize minimal causes
Prefer a small number of strong hypotheses over many weak ones. Examples:
- “A new deploy changed serialization; now the downstream rejects payloads.”
- “A cache key changed; cache misses are hammering the DB.”
Step 4: Reproduce or simulate
- Re-run failing requests in a staging env with production-like inputs.
- Add controlled latency and concurrency.
Step 5: Mitigate first, then fix
Mitigation options:
- Rollback.
- Disable feature flag.
- Rate limit.
- Circuit breaker to a dependency.
Step 6: Patch with tests and guards
- Add regression tests.
- Add runtime assertions where appropriate.
- Add telemetry to detect recurrence.
Step 7: Post-incident improvement
- Update runbooks.
- Add dashboards.
- Add alerts based on symptoms you missed.
11) Practical examples of production bugs and how to debug them
Example 1: “Cannot read properties of undefined” only in production
Symptom: Client error rate spikes after a deploy.
Likely causes:
- A backend response field became optional.
- A code-splitting boundary changed; a module loads later.
- A feature flag enabled a new path that lacks checks.
Debug steps:
- Use error tracker to identify the component and release.
- Inspect the event’s breadcrumbs for the last API call.
- Find server logs for that requestId to see the response shape.
Fix pattern: validate API response, add defensive code.
tstype Invoice = { id: string; total?: number }; function formatTotal(inv: Invoice) { // guard const total = inv.total ?? 0; return total.toFixed(2); }
Also add a contract test between frontend and backend (OpenAPI or Pact).
Example 2: Elevated p95 latency after enabling compression
Symptom: CPU rises and p95 latency doubles.
Root cause: synchronous compression on large responses.
Debug steps:
- Compare traces before/after; look for time spent in response writing.
- Sample a CPU profile (in a controlled replica).
Fix:
- Compress only above a threshold.
- Avoid compressing already-compressed content.
- Consider CDN compression.
Example 3: Memory leak due to unbounded Map
Symptom: RSS climbs until OOM.
Root cause: caching per-user data without TTL.
Fix: use an LRU cache.
jsimport LRU from 'lru-cache'; const userCache = new LRU({ max: 10_000, ttl: 1000 * 60 * 10, // 10 minutes });
12) Debugging techniques you should know
Log sampling and dynamic log levels
When an incident occurs, you may need more detail temporarily.
- Support runtime log level changes (if your platform allows).
- Sample verbose logs only for failing requests.
“Debug endpoints” and why they’re risky
Exposing /debug endpoints in production can help, but they are security liabilities.
If you must:
- Require strong authentication.
- Restrict by IP.
- Avoid exposing secrets.
- Prefer internal admin networks.
Assertions and invariant checks
For critical invariants, fail fast with clear errors:
jsfunction invariant(cond, msg) { if (!cond) throw new Error(`Invariant failed: ${msg}`); } invariant(typeof payload.userId === 'string', 'payload.userId must be string');
In production, ensure these errors are tracked and actionable.
13) Best practices that prevent future debugging pain
Build for debuggability
- Consistent correlation IDs across all services.
- Source maps uploaded and release-tagged.
- Structured logs and domain event naming.
- Dashboards aligned with user journeys (signup, checkout, search).
Guardrails in CI/CD
- Automated rollback hooks.
- Canary and progressive rollout.
- Contract tests for APIs.
- Bundle size checks and source map checks.
Security and privacy
- Redact PII by default.
- Restrict access to logs and replays.
- Define data retention and deletion policies.
14) A minimal “production debugging kit” checklist
If you’re building from scratch, aim for:
- Error tracking with releases + sourcemaps.
- Structured logs with requestId/traceId.
- Metrics for latency, errors, saturation.
- Tracing at least for critical request paths.
- Feature flags for rapid mitigation.
- Runbooks and dashboards for common incidents.
Closing thoughts
Production debugging is less about clever breakpoints and more about disciplined observability, safe experimentation, and fast mitigation. Invest in source maps, structured telemetry, and correlation IDs, and you’ll turn “mysterious production bug” into a tractable engineering problem with a repeatable playbook.
