Ship Private, Offline AI in 2025: On‑Device LLM Inference with WebGPU, WebNN, and WASM

By 2025, privacy-first, offline LLMs on end-user devices are no longer a science project—they’re a pragmatic product decision. Browsers ship WebGPU on desktop and mobile, WebNN is stabilizing across Chromium-based browsers, and WebAssembly is ubiquitous with SIMD and threads. With a careful toolchain (quantization, streaming loaders, workers, caching) you can deliver useful, fast, and private LLM capabilities that work offline, don’t exfiltrate data, and scale to millions of users without server bills.

This article is a hands-on blueprint: progressive capability detection, memory budgets and KV cache math, quantization choices (GGUF/INT4), GPU/CPU execution with WebGPU/WebNN/WASM, streaming model loaders, workers, caching strategies, fallbacks, and safety guardrails. You’ll leave with a concrete plan and production-ready patterns.

Why on-device LLMs in 2025

Privacy and compliance: keep sensitive data local. For many orgs, private-by-default is now a requirement.
Latency and reliability: no network round trips, instant token streaming, and offline availability.
Cost and scaling: inference moves from your cloud to the user’s device.
UX: snappy, resilient AI that works on a plane and in spotty network conditions.

The trade-offs are real: tighter memory budgets, thermals/battery, and heterogenous capabilities. But with the tools available in 2025, you can ship a great experience on a wide range of devices.

Execution backends: WebGPU, WebNN, WASM

WebGPU: The workhorse for browser LLMs. You get compute shaders, storage buffers, subgroup ops, timestamp queries, and robust adapter limits. Ideal for matmul-heavy decode kernels, KV cache management, and fast attention.
WebNN: A high-level API mapping to native NN backends (DirectML, Core ML, NNAPI). When available and performant, it can beat hand-rolled kernels—especially on mobile SoCs with NPU/DSP. Best with ONNX graphs.
WASM (SIMD + threads): Ubiquitous fallback with reasonable performance on CPU. Necessary for WebGPU-less environments or to offload tokenization and control logic. With cross-origin isolation, threads and SharedArrayBuffer enable parallelism.

A pragmatic stack:

Prefer WebGPU for LLM decoding on desktop and high-end mobile.
Use WebNN when it’s available and proven faster on a given device (especially mobile NPUs).
Fall back to WASM CPU inference for smaller models or degrade to server inference if needed.

Capability detection and progressive enhancement

Feature-detect at runtime and build a plan: pick backend, model size, quantization, max context, and thread counts.

ts
// capability.ts
export type Plan = {
  backend: 'webgpu' | 'webnn' | 'wasm' | 'server';
  model: 'phi3-mini-3.8b' | 'llama3.1-8b' | 'qwen2.5-7b' | 'tiny-1b';
  quant: 'q4_0' | 'q4_k_m' | 'q5_k_m' | 'q8_0';
  maxCtx: number; // tokens
  threads: number;
  useKVCacheQuant: boolean;
};

export async function planCapabilities(): Promise<Plan> {
  const hasWebGPU = !!navigator.gpu;
  const hasWebNN = !!(navigator as any).ml; // WebNN API (behind flags on some browsers)
  const cores = navigator.hardwareConcurrency || 4;
  const deviceMemoryGB = (navigator as any).deviceMemory || 4; // heuristic

  // Try to get GPU adapter info
  let adapter: GPUAdapter | null = null;
  try { adapter = hasWebGPU ? await navigator.gpu!.requestAdapter() : null; } catch {}
  const limits = adapter?.limits;

  // Rough VRAM/shared budget heuristic
  const maxBufferSize = limits?.maxStorageBufferBindingSize || 128 << 20; // default 128MB
  const subgroup = limits?.maxComputeWorkgroupsPerDimension ? true : false;

  // Rules of thumb for 2025 client fleet
  if (adapter) {
    // Desktop/iGPU/dGPU
    if (maxBufferSize >= (512 << 20)) {
      return {
        backend: 'webgpu',
        model: 'llama3.1-8b',
        quant: 'q4_k_m',
        maxCtx: 4096,
        threads: Math.min(cores, 8),
        useKVCacheQuant: true,
      };
    }
    // Mid-tier (iGPU, mobile high-end)
    return {
      backend: 'webgpu',
      model: 'phi3-mini-3.8b',
      quant: 'q4_0',
      maxCtx: 3072,
      threads: Math.min(cores, 6),
      useKVCacheQuant: true,
    };
  }

  if (hasWebNN && deviceMemoryGB >= 4) {
    return {
      backend: 'webnn',
      model: 'phi3-mini-3.8b',
      quant: 'q4_0',
      maxCtx: 3072,
      threads: Math.min(cores, 6),
      useKVCacheQuant: true,
    };
  }

  if (deviceMemoryGB >= 8) {
    return {
      backend: 'wasm',
      model: 'phi3-mini-3.8b',
      quant: 'q4_0',
      maxCtx: 2048,
      threads: Math.min(cores, 8),
      useKVCacheQuant: true,
    };
  }

  return {
    backend: 'server',
    model: 'tiny-1b',
    quant: 'q8_0',
    maxCtx: 1024,
    threads: Math.min(cores, 4),
    useKVCacheQuant: false,
  };
}

Notes:

navigator.deviceMemory is a hint; use with caution.
GPUAdapter.limits helps approximate workable tensor sizes and whether larger models will fit without heavy fragmentation.
Always capture telemetry (anonymous, on-device only) to refine thresholds—don’t hard-code forever.

Quantization: GGUF, INT4, and browser-friendly formats

For browser inference in 2025, two families dominate:

GGUF: the de facto weight format from llama.cpp with metadata, tensor layout, and quantization blocks. Excellent for streaming, versioning, and broad model availability.
ONNX: graph format consumed by WebNN/ONNX Runtime Web. Use post-training quantization (INT4/INT8) or AWQ/GPTQ weights when supported.

Quantization choices:

INT4 group-wise (q4_0, q4_K_M): best bang-for-byte; q4_K_M often yields better accuracy than naive q4_0.
INT5/INT6 variants: a good middle ground when memory is tight but you need a little extra quality.
INT8: safer accuracy, larger footprint; useful for small models or KV cache.

Practical advice:

Prefer q4_K_M for general-purpose chat if available.
Keep the KV cache at higher precision (Q8) when possible; if you must quantize KV, test carefully for degradation on long contexts.
Use group-size-32 or 64 quantization for GPU-friendly blocks.

Memory budgeting: parameters, KV cache, and reality checks

Total memory = weights + KV cache + activations + fragmentation overhead.

Weights: 8B params at 4 bits is roughly 8B × 0.5 bytes = 4 GB, but GGUF includes per-tensor scales/metadata. Expect 4.5–5.5 GB for 8B q4_K_M. For 3–4B models, expect 2–3 GB.
KV cache: grows with sequence length. For decoder-only LLMs:
- KV bytes ≈ layers × heads × seqLen × headDim × 2 × bytesPerElement
- Many implementations pack KV across heads; a useful approximation is: KV bytes ≈ 2 × hiddenDim × seqLen × layers × bytesPerElement × kvCompressionFactor
- For an 8B LLaMA-like model: hiddenDim ≈ 4096–5120, layers ≈ 32–40. Using FP16 (2 bytes) is large; Q8 KV halves it; more aggressive KV quant can hurt quality.

Example: LLaMA 3.1 8B, hiddenDim≈4096, layers=32, Q8 KV, seqLen=4096

KV ≈ 2 × 4096 × 4096 × 32 × 1 byte ≈ 1.07 GB
Prefill will allocate transient activations; budget at least 20–30% headroom during prefill.

Rules of thumb:

Budget 1–1.5 GB for KV at 4k context for 7–8B. For 2k context, ~500–800 MB.
Mobile devices: cap context aggressively (1–2k tokens) and prefer 3–4B models.
Use sliding window attention if available to bound KV growth.

Model selection for offline UX

Small and capable: Phi-3-mini (3.8B), Qwen2.5-4B/7B, LLaMA 3.1 8B Instruct. Pick variants with instruction tuning and safety alignment.
Pre-quantized GGUF from reputable sources reduces build complexity.
Verify tokenizer compatibility; mismatched tokenizers wreck accuracy.

If you need tool use or RAG:

External tools (search, code exec) can stay offline with sandboxed WASM interpreters and local vector search (FAISS-wasm, sqlite-vss) and file system APIs.

Loading architecture: streaming shards, service worker, and caching

Large models must stream in and resume if interrupted. Serve shard files (64–128 MB) over HTTP with Range support, and cache them.

Cache Storage API for byte caches with versioned URLs.
IndexedDB for structured indexes (tensor metadata, tokenizer, vocab).
Service Worker for request interception, offline revalidation, and resume.

Service Worker skeleton:

js
// sw.js
const MODEL_CACHE = 'model-v3';
self.addEventListener('install', (event) => {
  self.skipWaiting();
});
self.addEventListener('activate', (event) => {
  event.waitUntil(clients.claim());
});

self.addEventListener('fetch', (event) => {
  const url = new URL(event.request.url);
  if (!url.pathname.startsWith('/models/')) return;

  event.respondWith((async () => {
    const cache = await caches.open(MODEL_CACHE);
    // Serve from cache first; fall back to network; support Range
    const req = event.request;

    // If Range request, we can still serve from cached full blob if present
    const cached = await cache.match(url.href);
    if (req.headers.has('Range') && cached) {
      const rangeHdr = req.headers.get('Range');
      const m = /bytes=(\d+)-(\d+)?/.exec(rangeHdr);
      if (m) {
        const start = parseInt(m[1], 10);
        const end = m[2] ? parseInt(m[2], 10) : undefined;
        const buf = await cached.arrayBuffer();
        const slice = buf.slice(start, end ? end + 1 : undefined);
        return new Response(slice, {
          status: 206,
          headers: {
            'Content-Range': `bytes ${start}-${end ?? buf.byteLength - 1}/${buf.byteLength}`,
            'Accept-Ranges': 'bytes',
            'Content-Type': 'application/octet-stream',
          },
        });
      }
    }

    const net = await fetch(req);
    // Successful full response: cache it
    if (net.ok && !req.headers.has('Range')) {
      cache.put(url.href, net.clone());
    }
    return net;
  })());
});

Client-side streaming loader with abort/resume:

ts
// loader.ts
export async function streamShards(urls: string[], onChunk: (buf: ArrayBuffer, i: number) => void, signal?: AbortSignal) {
  for (let i = 0; i < urls.length; i++) {
    const res = await fetch(urls[i], { signal });
    if (!res.ok) throw new Error(`Failed shard ${i}`);
    const reader = res.body!.getReader();
    let offset = 0;
    for (;;) {
      const { done, value } = await reader.read();
      if (done) break;
      onChunk(value.buffer, i);
      offset += value.byteLength;
    }
  }
}

Ask for persistent storage so the model survives eviction:

ts
if (navigator.storage && navigator.storage.persist) {
  navigator.storage.persist().then(granted => {
    console.log('persistent storage', granted);
  });
}

Versioning scheme: include model-id + quant + tokenizer version in URLs, e.g., /models/llama3.1-8b-q4km/v3/shard-0003.gguf

Workers, threads, and cross-origin isolation

Run inference off the main thread. For WASM with threads and SharedArrayBuffer, you must enable cross-origin isolation.

Add headers from your server:

js
// express.js fragment
app.use((req, res, next) => {
  res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
  res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
  next();
});

Use a dedicated worker for model runtime:

ts
// inference.worker.ts
self.onmessage = async (e) => {
  const { type, payload } = e.data;
  switch (type) {
    case 'init':
      // init runtime, load model shards
      break;
    case 'generate':
      // run prefill + decode loop; postMessage tokens incrementally
      break;
  }
};

export {}; // TS module mode

Main thread:

ts
const worker = new Worker(new URL('./inference.worker.ts', import.meta.url), { type: 'module' });
worker.onmessage = (e) => {
  if (e.data.type === 'token') {
    appendToken(e.data.text);
  }
};
worker.postMessage({ type: 'init', payload: { plan, modelUrls } });

If using WebGPU from a worker, prefer a dedicated Worker with OffscreenCanvas not required for compute. You can requestAdapter in workers in modern Chromium-based browsers.

Execution options: roll your own vs libraries

llama.cpp (WASM + WebGPU): battle-tested GGUF loader, supports a variety of quantizations, KV cache quant, and streaming decode. Builds to the web with decent performance.
MLC/WebLLM: end-to-end WebGPU stack with compiled kernels and a clean JS API. Great perf and portability.
ONNX Runtime Web: WebGPU and WebNN execution providers; works well with ONNX-quantized models and broader operator coverage for non-LLM tasks.
Transformers.js: high-level inference with WebGPU backend, great for smaller models and pipelines.

Pragmatic approach:

Use llama.cpp or WebLLM for GGUF LLMs in the browser.
Use ONNX Runtime Web for safety classifiers and auxiliary models (NSFW, PII, toxicity) with WebNN/WebGPU EPs.

Prefill vs decode: performance and scheduling

LLM generation has two phases:

Prefill: process the prompt into KV cache. Compute-heavy; benefits from large matmuls and saturated GPU.
Decode: one token at a time; latency-sensitive; favors small, reused command buffers and cache-friendly kernels.

Optimization tactics:

Reuse command encoders and pipelines; avoid rebuilding pipelines per token.
Keep weights on device memory; stream only the prompt. Avoid re-uploading large buffers.
For WebGPU, use timestamp queries to measure kernel times and dynamically adjust batch size or throttling.

Example of GPU timestamps:

ts
async function measurePass(device: GPUDevice, encoderFn: (enc: GPUCommandEncoder) => void) {
  const querySet = device.createQuerySet({ type: 'timestamp', count: 2 });
  const resolveBuffer = device.createBuffer({ size: 16, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ });

  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass({ timestampWrites: { querySet, beginningOfPassWriteIndex: 0, endOfPassWriteIndex: 1 } });
  // record compute work
  pass.end();
  encoder.copyQueryResults(querySet, 0, 2, resolveBuffer, 0, 16);
  device.queue.submit([encoder.finish()]);

  await resolveBuffer.mapAsync(GPUMapMode.READ);
  const timestamps = new BigUint64Array(resolveBuffer.getMappedRange());
  const dt = Number(timestamps[1] - timestamps[0]);
  resolveBuffer.unmap();
  return dt; // in timestamp ticks; scale by device.limits.timestampPeriod if exposed
}

WASM CPU fallback: SIMD, threads, and pinning

For environments without WebGPU/WebNN:

Build with -msimd128 and pthreads; ensure crossOriginIsolated is true for threads.
Use a pinned memory pool to avoid frequent growing of the WASM heap.
Balance threads to cores but avoid oversubscription; threads = min(cores, 8).
For laptops on battery, consider reducing threads to avoid thermal throttling.

Device-aware resource scaling

Adapt at runtime to avoid OOM and jank:

Measure allocation failures and downshift model/ctx length automatically.
Track decode tokens/s; if TPS < threshold for N tokens, reduce max context or switch to smaller model on next session.
Watch for page visibility; pause or reduce TPS in background tabs.

Heuristic downshift:

ts
function shouldDownshift(tpsHistory: number[]): boolean {
  if (tpsHistory.length < 10) return false;
  const median = tpsHistory.slice(-10).sort((a,b)=>a-b)[4];
  return median < 2; // e.g., if below 2 tokens/s, choose smaller model next time
}

Streaming tokenizer and incremental decode

Tokenization can be a bottleneck; run it in WASM and stream inputs.

Use fast BPE/unigram tokenizers in WASM with SIMD.
For chat UIs, tokenize user input incrementally as they type; amortize prefill.
Maintain a ring buffer of token IDs; reuse for next turns.

Example concept (pseudo):

ts
// tokenize.ts
import initTokenizer from './tokenizer_wasm.js';

const tok = await initTokenizer();
export function tokenizeStreaming(text: string, onTokens: (ids: Uint32Array) => void) {
  const chunkSize = 1024; // chars
  for (let i = 0; i < text.length; i += chunkSize) {
    const chunk = text.slice(i, i + chunkSize);
    const ids = tok.encode(chunk);
    onTokens(ids);
  }
}

Caching: beyond weights

Cache everything that’s stable:

Tokenizer files (vocab.json, merges.txt or equivalent).
Model metadata (tensor shapes, layer norms, rope parameters).
Precompiled pipelines or shader variants keyed by adapter + limits.
Safety classifier models.

Invalidate by version keys: model-id + quant + backend + adapter.vendor/device + driver version when available.

Offline/online fallback strategy

Your app should function fully offline. But if device capability is insufficient or the user opts in, fall back to a server model.

Detect failure early (OOM during prefill, allocation errors) and present a choice: try smaller on-device model, reduce context, or use cloud.
Maintain identical API surfaces for on-device and server responses so the UI doesn’t care.

Example switch:

ts
async function respond(prompt: string, plan: Plan) {
  try {
    if (plan.backend === 'server') return await callServer(prompt);
    return await localGenerate(prompt, plan);
  } catch (e) {
    // If offline, degrade gracefully
    if (!navigator.onLine) return { text: offlineSorryMessage() };
    // Otherwise fallback to server
    return await callServer(prompt);
  }
}

Safety guardrails entirely on-device

Security and safety must not require a round trip. Combine fast, deterministic filters with a compact classifier.

Layers:

Deterministic filters

Regex/heuristics for PII: emails, phone numbers, SSNs, credit cards.
Hard blocklists for disallowed content according to your policy.
Prompt/response context rules: reject jailbreak patterns, ensure system prompts aren’t leaked.

Lightweight on-device classifier

Toxicity/NSFW classifier as an ONNX model quantized to INT8/INT4, run with ONNX Runtime Web on WebGPU/WebNN.
For speed, use small architectures (e.g., TinyBERT, MobileBERT classifiers) at 5–20 MB.

Self-check pass

After generation, run a second pass with the same LLM using a short safety prompt to evaluate the response and possibly redact.

Example PII filter:

ts
const rePII = [
  /\b\d{3}-\d{2}-\d{4}\b/g,          // SSN US
  /\b\d{13,19}\b/g,                    // CC-like sequences (very rough)
  /[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi,
  /\b\+?\d{1,3}?[\s.-]?\(?\d{1,4}\)?[\s.-]?\d{1,4}[\s.-]?\d{1,9}\b/g
];
export function redactPII(s: string): string {
  return rePII.reduce((acc, r) => acc.replace(r, '[REDACTED]'), s);
}

Classify with ONNX Runtime Web:

ts
import * as ort from 'onnxruntime-web';

async function loadSafetyModel() {
  await ort.env.wasm.numThreads = Math.min(navigator.hardwareConcurrency || 4, 4);
  // Prefer WebGPU
  try { await ort.env.webgpu.init(); } catch {}
  const session = await ort.InferenceSession.create('/safety/tinybert-int8.onnx', {
    executionProviders: ['webgpu', 'wasm'],
  });
  return session;
}

Policy decision:

If deterministic filters trigger, redact or block.
If classifier score exceeds threshold, block or require user confirmation.
Log decisions locally for transparency; do not send anywhere by default.

WebGPU implementation notes that bite in production

Adapter selection: requestAdapter with powerPreference: 'high-performance' for desktops with dGPU.
Buffer alignment: abide by minUniformBufferOffsetAlignment and minStorageBufferOffsetAlignment.
Command submission: batch decode steps but keep per-token latency low.
Memory fragmentation: create large, reusable buffers for activations; implement a simple allocator to sub-allocate.
Shader specialization: bake in dimensions as constants for better compiler optimization.
Avoid mapping big buffers every token; use queue.writeBuffer for small updates and persistent storage/readonly mappings for weights.
Safari quirks: test on Safari 17+/iOS; some limits differ, and timestamp queries may not be available.

Example: end-to-end minimal local generation flow

Using a WebGPU-backed runtime (e.g., WebLLM or llama.cpp-wasm) with a worker and streaming UI.

ts
// app.ts
import { planCapabilities } from './capability';

let runtime: any;
const plan = await planCapabilities();

const worker = new Worker('/inference.worker.js', { type: 'module' });
worker.postMessage({ type: 'init', payload: { plan } });

const controller = new AbortController();

export function generate(prompt: string, onToken: (t: string) => void) {
  return new Promise((resolve, reject) => {
    const onMsg = (e: MessageEvent) => {
      const { type, data } = e.data || {};
      if (type === 'token') onToken(data.text);
      if (type === 'done') { cleanup(); resolve(data); }
      if (type === 'error') { cleanup(); reject(data); }
    };
    const cleanup = () => worker.removeEventListener('message', onMsg);
    worker.addEventListener('message', onMsg);
    worker.postMessage({ type: 'generate', payload: { prompt }, signal: controller.signal }, [/* transferable if any */]);
  });
}

Worker side (pseudo using llama.cpp-wasm API style):

ts
// inference.worker.js (pseudo)
import { LlamaModel } from 'llama.cpp-web';

let model;
self.onmessage = async (e) => {
  const { type, payload } = e.data;
  if (type === 'init') {
    const { plan } = payload;
    model = await LlamaModel.load({
      urls: plan.model === 'llama3.1-8b' ? LLAMA_8B_Q4KM_URLS : PHI3_Q4_URLS,
      backend: plan.backend, // 'webgpu' | 'wasm'
      kvQuant: plan.useKVCacheQuant ? 'Q8' : 'F16',
      maxCtx: plan.maxCtx,
      threads: plan.threads,
    });
    postMessage({ type: 'ready' });
  }
  if (type === 'generate') {
    try {
      const stream = model.generate({ prompt: payload.prompt, temperature: 0.7, top_p: 0.9 });
      for await (const token of stream) {
        postMessage({ type: 'token', data: { text: token.text } });
      }
      postMessage({ type: 'done', data: {} });
    } catch (err) {
      postMessage({ type: 'error', data: String(err) });
    }
  }
};

Add a safety pass on the main thread just before displaying tokens:

ts
function safeAppend(token: string) {
  const redacted = redactPII(token);
  // Optionally buffer and run classifier per N characters
  appendToUI(redacted);
}

Energy and thermal considerations

Back off when on battery: detect navigator.getBattery if available; slow decode rate or reduce threads.
Use requestIdleCallback for non-critical work (pre-compiling pipelines, warming caches).
Provide a visible toggle for "Battery saver" that caps context length and reduces TPS.

Testing and benchmarking

Measure prefill throughput (tokens/s-equivalent) and decode TPS separately.
Use PerformanceObserver to attribute jank on main thread.
On WebGPU, timestamp queries across kernels; on WASM, coarse timers are sufficient.
Test on a matrix: Windows iGPU (Iris Xe), MacBook Air (M-series), AMD dGPU, iOS Safari, Android Chrome mid-range.

Minimal TPS tracker:

ts
class TPSTracker {
  private times: number[] = [];
  tick() { this.times.push(performance.now()); if (this.times.length > 1000) this.times.shift(); }
  tps() {
    if (this.times.length < 2) return 0;
    const dt = this.times[this.times.length - 1] - this.times[0];
    const tokens = this.times.length - 1;
    return (tokens / dt) * 1000;
  }
}

Browser coverage in 2025: reality check

Chrome/Edge: WebGPU stable on desktop and Android; WebNN available behind a flag or origin trial in some channels; ORT WebGPU EP works well.
Safari: WebGPU on macOS Sonoma+ and iOS 17/18 with limitations; test memory limits carefully.
Firefox: WebGPU shipping with ongoing improvements; performance can differ across platforms.

Plan for variability: keep a WASM path for correctness, and allow server fallback as last resort.

Security and privacy

Keep all model files and intermediate data local. Don’t send prompts/outputs by default.
If offering cloud fallback, make it opt-in and clearly indicated.
Allow users to clear local model/cache; respect private browsing mode (disable large caches).
Signed model manifests with hashes; verify before using to prevent tampering.

Hash verification example:

ts
async function verifyDigest(resp: Response, expectedHex: string) {
  const buf = await resp.clone().arrayBuffer();
  const digest = await crypto.subtle.digest('SHA-256', buf);
  const hex = [...new Uint8Array(digest)].map(b => b.toString(16).padStart(2,'0')).join('');
  if (hex !== expectedHex) throw new Error('Model shard hash mismatch');
  return resp;
}

Putting it together: a production blueprint

Capability plan

Detect WebGPU/WebNN/WASM and adapter limits
Choose model size, quant, max context, threads

Storage and caching

Request persistent storage
Service Worker caches shards with range support and hash verification
Cache tokenizer and metadata in IndexedDB

Initialization

Spin up a Worker with backend selected
Warm-up: compile pipelines, allocate buffers, load tokenizer

Generation

Stream user input tokenization
Prefill then decode with reusable command buffers
Emit tokens to UI progressively, applying safety redaction

Safety

Run deterministic filters; gate high-risk content
Optional small classifier for toxicity/NSFW
Self-check with the same model on final output if policy requires

Fallbacks

Downshift model/ctx on perf issues or OOM
If offline and failure, show helpful message; if online, offer server inference

Observability (local)

Record TPS, OOMs, backend choice to local storage for next-session planning
Provide a diagnostics panel for users

Common pitfalls and how to avoid them

OOM during prefill: reduce context, quantize KV cache, or pick smaller model. Allocate KV lazily with growable windows.
Long cold starts: shard models, parallelize fetches, prewarm pipelines, persist caches across sessions.
Jank on main thread: move tokenization and all inference to workers; throttle DOM updates.
Inconsistent tokenization: ensure tokenizer version matches model; cache tokenizer; write tests for round-trip encoding.
Excessive storage eviction: request persistent storage; reduce shard size; allow user to pin models.
Poor mobile perf: cap to 1–2k context, smaller models, and leverage WebNN where it shines.

Opinionated take: what to ship in 2025

Default model: 3–4B instruct, q4_K_M, 2–3k context. It’s the sweet spot across devices.
Advanced toggle: 7–8B instruct, q4_K_M, 4k context for desktops with dGPU or M-series.
Backend priority: WebGPU > WebNN (when NPU helps) > WASM.
Safety: deterministic filters + tiny classifier; no server dependency.
UX: instant token stream within 500 ms on cold start for cached models; 2–6 TPS decode typical; clear battery-saver mode.

This setup gives you a private, fast, and maintainable on-device AI that scales.

Further references and ecosystems to watch

llama.cpp and GGUF ecosystem: rich model zoo and continuous kernel improvements.
MLC/WebLLM: compiler-driven approach for WebGPU with strong results.
ONNX Runtime Web: WebGPU/WebNN EPs, great for auxiliary models.
WebNN spec and implementations: track release notes for stability and performance on mobile.
Tokenizers-wasm projects: fast BPE/unigram with SIMD.

Closing

Shipping private, offline AI in 2025 is a solved engineering problem when you combine the right primitives: WebGPU/WebNN for compute, WASM for glue and fallback, GGUF INT4 for compact weights, streaming loaders and caches for UX, workers for responsiveness, and on-device safety guardrails. The challenge is no longer “is it possible?” but “can you design a robust, adaptive runtime that treats every device with respect?” With the patterns above, you can.