Why production debugging is different

Debugging in development is a controlled environment: unminified code, fast reloads, stable reproductions, and a single user (you). Production is the opposite:

Code is optimized (minified, tree-shaken, code-split), so stack traces can be opaque.
Issues are intermittent (race conditions, device-specific, network-specific).
You have limited visibility (you can’t attach a debugger to a user’s browser or serverless runtime in most cases).
You must prioritize safety (avoid logging PII, avoid adding heavy instrumentation, avoid changes that risk downtime).

This article provides a structured approach to production debugging for JavaScript applications—frontend (browser), Node.js backends, and serverless—covering instrumentation, source maps, observability tools, reliable reproduction strategies, and best practices that scale from junior dev to senior engineer.

1) Start with a decision tree: crash, error rate spike, or “weird behavior”

Before touching code, classify the incident. Your approach depends on what you’re seeing:

A. Hard crash / unhandled exception

Typical signals:

Browser: window.onerror, unhandledrejection events, a Sentry “unhandled exception” issue.
Node: process exits, container restarts, UnhandledPromiseRejectionWarning (older Node), fatal OOM.

First actions:

Identify the top stack trace and the release version.
Verify source map availability for that release.
Confirm if it’s a new regression (compare to previous deploy).

B. Error rate spike (non-fatal)

Examples:

API requests failing with 500/502.
Client-side requests failing due to CORS, auth expiry, network errors.

First actions:

Correlate by time, region, browser, device, endpoint.
Check dependencies (DB, cache, third-party API) and recent configuration changes.
Look for saturation signals: p95 latency, CPU, memory, connection pools.

C. Weird behavior (no obvious error)

Examples:

UI stuck in loading state.
Data inconsistencies.
Payment flows failing only for some users.

First actions:

Add or inspect structured logs and business-level events.
Use session replay (if available) or synthetic reproduction.
Validate feature flag states and rollout percentages.

2) Instrumentation beats guessing: logs, metrics, traces

In production, “debugging” often means asking the system questions through telemetry. The core pillars are:

Logs: discrete events with context.
Metrics: aggregated measurements over time.
Traces: end-to-end request flows across services.

Structured logging (do this, not printf debugging)

Use JSON logs with consistent fields. In Node.js, pino is fast and widely used.

js
// logger.js
import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  base: {
    service: 'billing-api',
  },
  redact: {
    paths: ['req.headers.authorization', 'user.ssn', 'card.number'],
    remove: true,
  },
});

In an Express handler:

js
app.get('/invoice/:id', async (req, res) => {
  const requestId = req.headers['x-request-id'] ?? crypto.randomUUID();
  res.setHeader('x-request-id', requestId);

  req.log = logger.child({ requestId, route: 'GET /invoice/:id' });

  try {
    req.log.info({ invoiceId: req.params.id }, 'Fetching invoice');
    const invoice = await invoices.get(req.params.id);
    res.json(invoice);
  } catch (err) {
    req.log.error({ err, invoiceId: req.params.id }, 'Failed to fetch invoice');
    res.status(500).json({ error: 'internal_error', requestId });
  }
});

Best practices:

Always include requestId / traceId, userId (if safe), route, and relevant domain IDs.
Prefer event-style messages (“PaymentAuthorizationFailed”) over vague text.
Redact secrets and PII at the logger level.

Metrics: detect trends and regressions quickly

If you’re using Prometheus/OpenTelemetry, capture:

Request counts, error counts, latency histograms.
Queue depth, DB connection pool usage.
Node.js event loop lag and memory.

Example (using prom-client):

js
import client from 'prom-client';

const httpDuration = new client.Histogram({
  name: 'http_server_duration_ms',
  help: 'HTTP request duration in ms',
  labelNames: ['route', 'method', 'status'],
  buckets: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
});

app.use((req, res, next) => {
  const start = performance.now();
  res.on('finish', () => {
    httpDuration
      .labels(req.route?.path ?? 'unknown', req.method, String(res.statusCode))
      .observe(performance.now() - start);
  });
  next();
});

Traces: find where time and failures occur

Distributed tracing (OpenTelemetry + Jaeger/Tempo/Datadog) helps answer:

Which downstream call is slow?
Did we retry?
Where did the error originate?

Even if you can’t fully adopt tracing everywhere, start by propagating a traceparent header and logging trace IDs.

3) Source maps: the difference between “minified noise” and actionable stacks

Frontend source maps (Webpack/Vite/Rollup)

Production errors often arrive as:

TypeError: Cannot read properties of undefined (reading 'foo')
    at e (app.3f2c1a.js:1:18233)

With source maps uploaded to an error tracker, you get:

TypeError: Cannot read properties of undefined (reading 'foo')
    at renderInvoiceRow (src/components/InvoiceTable.tsx:87:13)

Key practices:

Generate hidden-source-map (Webpack) or equivalent to avoid exposing maps publicly.
Upload maps to Sentry/Datadog/New Relic and do not serve them from your CDN.
Include release identifiers (commit SHA, version).

Webpack example:

js
// webpack.config.js
module.exports = {
  devtool: 'hidden-source-map',
};

Vite example:

ts
// vite.config.ts
export default defineConfig({
  build: {
    sourcemap: true, // pair with private upload to tracker
  }
});

Node.js source maps

If you ship TypeScript or bundled server code, enable source maps in Node:

Node 12+ supports source maps with --enable-source-maps.

Example:

bash
node --enable-source-maps dist/server.js

In many platforms you can set NODE_OPTIONS=--enable-source-maps.

4) Error tracking: pick a system and configure it properly

Tool comparison (practical view)

Sentry: excellent stacktrace mapping, breadcrumbs, releases, performance monitoring, session replay (optional). Strong ecosystem.
Datadog: strong all-in-one platform (APM + logs + metrics + RUM). Great correlation across telemetry.
New Relic: mature APM and browser monitoring.
Rollbar / Bugsnag: solid error tracking; varying depth for performance tooling.

What matters most is not the brand—it’s whether you:

Upload source maps.
Tag releases.
Capture user context safely.
Capture breadcrumbs and network spans.

Frontend: capturing exceptions and context

Example with Sentry (browser):

ts
import * as Sentry from '@sentry/react';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  release: process.env.APP_VERSION,
  environment: process.env.NODE_ENV,
  integrations: [Sentry.browserTracingIntegration()],
  tracesSampleRate: 0.1,
  beforeSend(event) {
    // remove sensitive fields if any were added accidentally
    return event;
  }
});

export function setUserContext(user: { id: string; plan: string }) {
  Sentry.setUser({ id: user.id });
  Sentry.setTag('plan', user.plan);
}

Backend: ensure errors are captured before process exit

In Node, fatal errors can kill the process before flushing telemetry. Ensure graceful shutdown and flush.

js
process.on('uncaughtException', async (err) => {
  logger.fatal({ err }, 'uncaughtException');
  // flush telemetry here if needed
  process.exit(1);
});

process.on('unhandledRejection', async (reason) => {
  logger.fatal({ reason }, 'unhandledRejection');
  process.exit(1);
});

In many cases you should not keep running after an uncaught exception—crash fast and rely on orchestration.

5) Reproduction strategies that work in production reality

A. Reproduce with the same inputs

Capture the request payload (redacted), headers, and feature flag states.
Record the specific user journey or API sequence.
Store a “debug bundle” for failing requests: requestId + trace + key IDs.

B. Reproduce with the same environment

Common mismatches:

Different Node version.
Different timezone/locale.
Different CDN caching behavior.
Different third-party API behavior.

Use containerized repro:

bash
docker run --rm -it node:20 bash
node -e "console.log(Intl.DateTimeFormat().resolvedOptions())"

C. Reproduce timing bugs with stress and determinism

Race conditions often disappear under a debugger. Use:

Artificial latency (Chrome DevTools throttling, toxiproxy, tc).
High concurrency load (k6, autocannon).
Deterministic seeds for randomization.

Example with autocannon:

bash
npx autocannon -c 50 -d 30 https://api.example.com/search?q=test

D. Production-like data without violating privacy

Use anonymized snapshots.
Use synthetic data generators.
Re-run only the failing query pattern rather than exporting raw rows.

6) Debugging frontend production issues

Common class: “works on my machine” UI bugs

These often stem from:

Browser differences (Safari quirks, old Chromium).
Hydration mismatches (SSR/React).
Caching/service worker issues.
Locale and timezone parsing.

Debugging checklist

Check browser + OS distribution in your error tracker.
Inspect breadcrumbs: navigation, click events, XHR/fetch.
Confirm release version and whether the user’s assets updated.
Look for service worker cache staleness.

Service worker cache gotchas

A buggy service worker can keep users stuck on old bundles.

Mitigations:

Version your cache keys.
Implement skipWaiting + prompt users to refresh (carefully).
Add observability: log SW version and activation.

Capturing network failures and server responses

Wrap fetch to add correlation IDs and log failures.

ts
export async function apiFetch(input: RequestInfo, init: RequestInit = {}) {
  const requestId = crypto.randomUUID();
  const headers = new Headers(init.headers);
  headers.set('x-request-id', requestId);

  const res = await fetch(input, { ...init, headers });

  if (!res.ok) {
    const body = await res.clone().text().catch(() => '');
    // Send to error tracker with minimal safe context
    throw new Error(`HTTP ${res.status} for ${String(input)} requestId=${requestId} body=${body.slice(0, 200)}`);
  }

  return res;
}

Be careful: don’t attach full bodies if they can include secrets.

Debugging sourcemap “mismatch”

If your tracker shows wrong file/line:

Ensure release matches the uploaded maps.
Ensure your bundler isn’t rewriting paths unexpectedly.
Ensure your CDN isn’t serving old assets for a new release.

A practical safeguard is to include the commit SHA in the asset filename and disable aggressive caching for the HTML entry point.

7) Debugging Node.js production issues

Memory leaks and OOMs

Symptoms:

RSS steadily increases.
GC pauses increase.
Process killed by the platform.

Quick triage

Log memory periodically:

js
setInterval(() => {
  const m = process.memoryUsage();
  logger.info({ rss: m.rss, heapUsed: m.heapUsed, heapTotal: m.heapTotal }, 'mem');
}, 30000);

Enable heap snapshots on demand (advanced): run Node with inspector or use heapdump in controlled environments.

Common leak sources

Unbounded caches (Map without eviction).
Retained closures (event listeners never removed).
Large arrays accumulating per request.
Logging buffers or in-memory batching.

Event loop lag and “it’s slow but CPU isn’t high”

Use event loop delay monitoring:

js
import { monitorEventLoopDelay } from 'node:perf_hooks';

const h = monitorEventLoopDelay({ resolution: 20 });
h.enable();

setInterval(() => {
  logger.info({ p99: h.percentile(99), mean: h.mean }, 'event_loop_delay');
  h.reset();
}, 10000);

If p99 lag spikes:

Look for synchronous CPU-heavy work (JSON stringify of huge objects, crypto, compression).
Offload CPU tasks to worker threads or separate services.

Debugging production-only crashes in Node

If you have native modules or segmentation faults:

Ensure consistent libc and Node versions.
Consider running with --report-on-fatalerror and capturing Node reports.

bash
NODE_OPTIONS="--report-on-fatalerror --report-directory=/tmp/node-reports" node server.js

Node reports can include stack traces and heap stats.

8) Serverless and edge runtimes: special constraints

Serverless (AWS Lambda, Cloudflare Workers, Vercel functions) changes debugging:

Short-lived environments.
Cold starts.
Limited filesystem access.

Best practices:

Always log a correlation ID.
Prefer structured logs; scraping raw text is painful.
Use vendor-native tracing (AWS X-Ray) or OpenTelemetry where supported.
Be mindful of sampling and cost.

For Lambda, ensure you capture and log errors with enough context but no PII.

9) Debugging with feature flags and safe mitigations

Feature flags as a debugging tool

Feature flags aren’t just for product experiments—they’re operational controls:

Turn off a broken code path.
Reduce concurrency.
Switch to a fallback provider.

Best practices:

Flags should be fast to evaluate and safe by default.
Keep a runbook: which flag mitigates which failure.

Canary releases and progressive delivery

To reduce blast radius:

Deploy to 1% of traffic.
Watch key SLO metrics and error budgets.
Promote gradually.

This turns production debugging into controlled experimentation: isolate the release that introduced the issue.

10) A systematic workflow: from alert to fix

Step 1: Confirm and scope

Is this real or noise?
Which endpoints/pages are affected?
Which users/regions/browsers?

Step 2: Correlate across telemetry

Logs: find representative requestIds.
Traces: identify the slow span or failing dependency.
Metrics: confirm spike timing and saturation.

Step 3: Hypothesize minimal causes

Prefer a small number of strong hypotheses over many weak ones. Examples:

“A new deploy changed serialization; now the downstream rejects payloads.”
“A cache key changed; cache misses are hammering the DB.”

Step 4: Reproduce or simulate

Re-run failing requests in a staging env with production-like inputs.
Add controlled latency and concurrency.

Step 5: Mitigate first, then fix

Mitigation options:

Rollback.
Disable feature flag.
Rate limit.
Circuit breaker to a dependency.

Step 6: Patch with tests and guards

Add regression tests.
Add runtime assertions where appropriate.
Add telemetry to detect recurrence.

Step 7: Post-incident improvement

Update runbooks.
Add dashboards.
Add alerts based on symptoms you missed.

11) Practical examples of production bugs and how to debug them

Example 1: “Cannot read properties of undefined” only in production

Symptom: Client error rate spikes after a deploy.

Likely causes:

A backend response field became optional.
A code-splitting boundary changed; a module loads later.
A feature flag enabled a new path that lacks checks.

Debug steps:

Use error tracker to identify the component and release.
Inspect the event’s breadcrumbs for the last API call.
Find server logs for that requestId to see the response shape.

Fix pattern: validate API response, add defensive code.

ts
type Invoice = { id: string; total?: number };

function formatTotal(inv: Invoice) {
  // guard
  const total = inv.total ?? 0;
  return total.toFixed(2);
}

Also add a contract test between frontend and backend (OpenAPI or Pact).

Example 2: Elevated p95 latency after enabling compression

Symptom: CPU rises and p95 latency doubles.

Root cause: synchronous compression on large responses.

Debug steps:

Compare traces before/after; look for time spent in response writing.
Sample a CPU profile (in a controlled replica).

Fix:

Compress only above a threshold.
Avoid compressing already-compressed content.
Consider CDN compression.

Example 3: Memory leak due to unbounded Map

Symptom: RSS climbs until OOM.

Root cause: caching per-user data without TTL.

Fix: use an LRU cache.

js
import LRU from 'lru-cache';

const userCache = new LRU({
  max: 10_000,
  ttl: 1000 * 60 * 10, // 10 minutes
});

12) Debugging techniques you should know

Log sampling and dynamic log levels

When an incident occurs, you may need more detail temporarily.

Support runtime log level changes (if your platform allows).
Sample verbose logs only for failing requests.

“Debug endpoints” and why they’re risky

Exposing /debug endpoints in production can help, but they are security liabilities. If you must:

Require strong authentication.
Restrict by IP.
Avoid exposing secrets.
Prefer internal admin networks.

Assertions and invariant checks

For critical invariants, fail fast with clear errors:

js
function invariant(cond, msg) {
  if (!cond) throw new Error(`Invariant failed: ${msg}`);
}

invariant(typeof payload.userId === 'string', 'payload.userId must be string');

In production, ensure these errors are tracked and actionable.

13) Best practices that prevent future debugging pain

Build for debuggability

Consistent correlation IDs across all services.
Source maps uploaded and release-tagged.
Structured logs and domain event naming.
Dashboards aligned with user journeys (signup, checkout, search).

Guardrails in CI/CD

Automated rollback hooks.
Canary and progressive rollout.
Contract tests for APIs.
Bundle size checks and source map checks.

Security and privacy

Redact PII by default.
Restrict access to logs and replays.
Define data retention and deletion policies.

14) A minimal “production debugging kit” checklist

If you’re building from scratch, aim for:

Error tracking with releases + sourcemaps.
Structured logs with requestId/traceId.
Metrics for latency, errors, saturation.
Tracing at least for critical request paths.
Feature flags for rapid mitigation.
Runbooks and dashboards for common incidents.

Closing thoughts

Production debugging is less about clever breakpoints and more about disciplined observability, safe experimentation, and fast mitigation. Invest in source maps, structured telemetry, and correlation IDs, and you’ll turn “mysterious production bug” into a tractable engineering problem with a repeatable playbook.