Ship Private, Offline AI in 2025: On‑Device LLM Inference with WebGPU, WebNN, and WASM
By 2025, privacy-first, offline LLMs on end-user devices are no longer a science project—they’re a pragmatic product decision. Browsers ship WebGPU on desktop and mobile, WebNN is stabilizing across Chromium-based browsers, and WebAssembly is ubiquitous with SIMD and threads. With a careful toolchain (quantization, streaming loaders, workers, caching) you can deliver useful, fast, and private LLM capabilities that work offline, don’t exfiltrate data, and scale to millions of users without server bills.
This article is a hands-on blueprint: progressive capability detection, memory budgets and KV cache math, quantization choices (GGUF/INT4), GPU/CPU execution with WebGPU/WebNN/WASM, streaming model loaders, workers, caching strategies, fallbacks, and safety guardrails. You’ll leave with a concrete plan and production-ready patterns.
Why on-device LLMs in 2025
- Privacy and compliance: keep sensitive data local. For many orgs, private-by-default is now a requirement.
- Latency and reliability: no network round trips, instant token streaming, and offline availability.
- Cost and scaling: inference moves from your cloud to the user’s device.
- UX: snappy, resilient AI that works on a plane and in spotty network conditions.
The trade-offs are real: tighter memory budgets, thermals/battery, and heterogenous capabilities. But with the tools available in 2025, you can ship a great experience on a wide range of devices.
Execution backends: WebGPU, WebNN, WASM
- WebGPU: The workhorse for browser LLMs. You get compute shaders, storage buffers, subgroup ops, timestamp queries, and robust adapter limits. Ideal for matmul-heavy decode kernels, KV cache management, and fast attention.
- WebNN: A high-level API mapping to native NN backends (DirectML, Core ML, NNAPI). When available and performant, it can beat hand-rolled kernels—especially on mobile SoCs with NPU/DSP. Best with ONNX graphs.
- WASM (SIMD + threads): Ubiquitous fallback with reasonable performance on CPU. Necessary for WebGPU-less environments or to offload tokenization and control logic. With cross-origin isolation, threads and SharedArrayBuffer enable parallelism.
A pragmatic stack:
- Prefer WebGPU for LLM decoding on desktop and high-end mobile.
- Use WebNN when it’s available and proven faster on a given device (especially mobile NPUs).
- Fall back to WASM CPU inference for smaller models or degrade to server inference if needed.
Capability detection and progressive enhancement
Feature-detect at runtime and build a plan: pick backend, model size, quantization, max context, and thread counts.
ts// capability.ts export type Plan = { backend: 'webgpu' | 'webnn' | 'wasm' | 'server'; model: 'phi3-mini-3.8b' | 'llama3.1-8b' | 'qwen2.5-7b' | 'tiny-1b'; quant: 'q4_0' | 'q4_k_m' | 'q5_k_m' | 'q8_0'; maxCtx: number; // tokens threads: number; useKVCacheQuant: boolean; }; export async function planCapabilities(): Promise<Plan> { const hasWebGPU = !!navigator.gpu; const hasWebNN = !!(navigator as any).ml; // WebNN API (behind flags on some browsers) const cores = navigator.hardwareConcurrency || 4; const deviceMemoryGB = (navigator as any).deviceMemory || 4; // heuristic // Try to get GPU adapter info let adapter: GPUAdapter | null = null; try { adapter = hasWebGPU ? await navigator.gpu!.requestAdapter() : null; } catch {} const limits = adapter?.limits; // Rough VRAM/shared budget heuristic const maxBufferSize = limits?.maxStorageBufferBindingSize || 128 << 20; // default 128MB const subgroup = limits?.maxComputeWorkgroupsPerDimension ? true : false; // Rules of thumb for 2025 client fleet if (adapter) { // Desktop/iGPU/dGPU if (maxBufferSize >= (512 << 20)) { return { backend: 'webgpu', model: 'llama3.1-8b', quant: 'q4_k_m', maxCtx: 4096, threads: Math.min(cores, 8), useKVCacheQuant: true, }; } // Mid-tier (iGPU, mobile high-end) return { backend: 'webgpu', model: 'phi3-mini-3.8b', quant: 'q4_0', maxCtx: 3072, threads: Math.min(cores, 6), useKVCacheQuant: true, }; } if (hasWebNN && deviceMemoryGB >= 4) { return { backend: 'webnn', model: 'phi3-mini-3.8b', quant: 'q4_0', maxCtx: 3072, threads: Math.min(cores, 6), useKVCacheQuant: true, }; } if (deviceMemoryGB >= 8) { return { backend: 'wasm', model: 'phi3-mini-3.8b', quant: 'q4_0', maxCtx: 2048, threads: Math.min(cores, 8), useKVCacheQuant: true, }; } return { backend: 'server', model: 'tiny-1b', quant: 'q8_0', maxCtx: 1024, threads: Math.min(cores, 4), useKVCacheQuant: false, }; }
Notes:
- navigator.deviceMemory is a hint; use with caution.
- GPUAdapter.limits helps approximate workable tensor sizes and whether larger models will fit without heavy fragmentation.
- Always capture telemetry (anonymous, on-device only) to refine thresholds—don’t hard-code forever.
Quantization: GGUF, INT4, and browser-friendly formats
For browser inference in 2025, two families dominate:
- GGUF: the de facto weight format from llama.cpp with metadata, tensor layout, and quantization blocks. Excellent for streaming, versioning, and broad model availability.
- ONNX: graph format consumed by WebNN/ONNX Runtime Web. Use post-training quantization (INT4/INT8) or AWQ/GPTQ weights when supported.
Quantization choices:
- INT4 group-wise (q4_0, q4_K_M): best bang-for-byte; q4_K_M often yields better accuracy than naive q4_0.
- INT5/INT6 variants: a good middle ground when memory is tight but you need a little extra quality.
- INT8: safer accuracy, larger footprint; useful for small models or KV cache.
Practical advice:
- Prefer q4_K_M for general-purpose chat if available.
- Keep the KV cache at higher precision (Q8) when possible; if you must quantize KV, test carefully for degradation on long contexts.
- Use group-size-32 or 64 quantization for GPU-friendly blocks.
Memory budgeting: parameters, KV cache, and reality checks
Total memory = weights + KV cache + activations + fragmentation overhead.
- Weights: 8B params at 4 bits is roughly 8B × 0.5 bytes = 4 GB, but GGUF includes per-tensor scales/metadata. Expect 4.5–5.5 GB for 8B q4_K_M. For 3–4B models, expect 2–3 GB.
- KV cache: grows with sequence length. For decoder-only LLMs:
- KV bytes ≈ layers × heads × seqLen × headDim × 2 × bytesPerElement
- Many implementations pack KV across heads; a useful approximation is: KV bytes ≈ 2 × hiddenDim × seqLen × layers × bytesPerElement × kvCompressionFactor
- For an 8B LLaMA-like model: hiddenDim ≈ 4096–5120, layers ≈ 32–40. Using FP16 (2 bytes) is large; Q8 KV halves it; more aggressive KV quant can hurt quality.
Example: LLaMA 3.1 8B, hiddenDim≈4096, layers=32, Q8 KV, seqLen=4096
- KV ≈ 2 × 4096 × 4096 × 32 × 1 byte ≈ 1.07 GB
- Prefill will allocate transient activations; budget at least 20–30% headroom during prefill.
Rules of thumb:
- Budget 1–1.5 GB for KV at 4k context for 7–8B. For 2k context, ~500–800 MB.
- Mobile devices: cap context aggressively (1–2k tokens) and prefer 3–4B models.
- Use sliding window attention if available to bound KV growth.
Model selection for offline UX
- Small and capable: Phi-3-mini (3.8B), Qwen2.5-4B/7B, LLaMA 3.1 8B Instruct. Pick variants with instruction tuning and safety alignment.
- Pre-quantized GGUF from reputable sources reduces build complexity.
- Verify tokenizer compatibility; mismatched tokenizers wreck accuracy.
If you need tool use or RAG:
- External tools (search, code exec) can stay offline with sandboxed WASM interpreters and local vector search (FAISS-wasm, sqlite-vss) and file system APIs.
Loading architecture: streaming shards, service worker, and caching
Large models must stream in and resume if interrupted. Serve shard files (64–128 MB) over HTTP with Range support, and cache them.
- Cache Storage API for byte caches with versioned URLs.
- IndexedDB for structured indexes (tensor metadata, tokenizer, vocab).
- Service Worker for request interception, offline revalidation, and resume.
Service Worker skeleton:
js// sw.js const MODEL_CACHE = 'model-v3'; self.addEventListener('install', (event) => { self.skipWaiting(); }); self.addEventListener('activate', (event) => { event.waitUntil(clients.claim()); }); self.addEventListener('fetch', (event) => { const url = new URL(event.request.url); if (!url.pathname.startsWith('/models/')) return; event.respondWith((async () => { const cache = await caches.open(MODEL_CACHE); // Serve from cache first; fall back to network; support Range const req = event.request; // If Range request, we can still serve from cached full blob if present const cached = await cache.match(url.href); if (req.headers.has('Range') && cached) { const rangeHdr = req.headers.get('Range'); const m = /bytes=(\d+)-(\d+)?/.exec(rangeHdr); if (m) { const start = parseInt(m[1], 10); const end = m[2] ? parseInt(m[2], 10) : undefined; const buf = await cached.arrayBuffer(); const slice = buf.slice(start, end ? end + 1 : undefined); return new Response(slice, { status: 206, headers: { 'Content-Range': `bytes ${start}-${end ?? buf.byteLength - 1}/${buf.byteLength}`, 'Accept-Ranges': 'bytes', 'Content-Type': 'application/octet-stream', }, }); } } const net = await fetch(req); // Successful full response: cache it if (net.ok && !req.headers.has('Range')) { cache.put(url.href, net.clone()); } return net; })()); });
Client-side streaming loader with abort/resume:
ts// loader.ts export async function streamShards(urls: string[], onChunk: (buf: ArrayBuffer, i: number) => void, signal?: AbortSignal) { for (let i = 0; i < urls.length; i++) { const res = await fetch(urls[i], { signal }); if (!res.ok) throw new Error(`Failed shard ${i}`); const reader = res.body!.getReader(); let offset = 0; for (;;) { const { done, value } = await reader.read(); if (done) break; onChunk(value.buffer, i); offset += value.byteLength; } } }
Ask for persistent storage so the model survives eviction:
tsif (navigator.storage && navigator.storage.persist) { navigator.storage.persist().then(granted => { console.log('persistent storage', granted); }); }
Versioning scheme: include model-id + quant + tokenizer version in URLs, e.g., /models/llama3.1-8b-q4km/v3/shard-0003.gguf
Workers, threads, and cross-origin isolation
Run inference off the main thread. For WASM with threads and SharedArrayBuffer, you must enable cross-origin isolation.
Add headers from your server:
js// express.js fragment app.use((req, res, next) => { res.setHeader('Cross-Origin-Opener-Policy', 'same-origin'); res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp'); next(); });
Use a dedicated worker for model runtime:
ts// inference.worker.ts self.onmessage = async (e) => { const { type, payload } = e.data; switch (type) { case 'init': // init runtime, load model shards break; case 'generate': // run prefill + decode loop; postMessage tokens incrementally break; } }; export {}; // TS module mode
Main thread:
tsconst worker = new Worker(new URL('./inference.worker.ts', import.meta.url), { type: 'module' }); worker.onmessage = (e) => { if (e.data.type === 'token') { appendToken(e.data.text); } }; worker.postMessage({ type: 'init', payload: { plan, modelUrls } });
If using WebGPU from a worker, prefer a dedicated Worker with OffscreenCanvas not required for compute. You can requestAdapter in workers in modern Chromium-based browsers.
Execution options: roll your own vs libraries
- llama.cpp (WASM + WebGPU): battle-tested GGUF loader, supports a variety of quantizations, KV cache quant, and streaming decode. Builds to the web with decent performance.
- MLC/WebLLM: end-to-end WebGPU stack with compiled kernels and a clean JS API. Great perf and portability.
- ONNX Runtime Web: WebGPU and WebNN execution providers; works well with ONNX-quantized models and broader operator coverage for non-LLM tasks.
- Transformers.js: high-level inference with WebGPU backend, great for smaller models and pipelines.
Pragmatic approach:
- Use llama.cpp or WebLLM for GGUF LLMs in the browser.
- Use ONNX Runtime Web for safety classifiers and auxiliary models (NSFW, PII, toxicity) with WebNN/WebGPU EPs.
Prefill vs decode: performance and scheduling
LLM generation has two phases:
- Prefill: process the prompt into KV cache. Compute-heavy; benefits from large matmuls and saturated GPU.
- Decode: one token at a time; latency-sensitive; favors small, reused command buffers and cache-friendly kernels.
Optimization tactics:
- Reuse command encoders and pipelines; avoid rebuilding pipelines per token.
- Keep weights on device memory; stream only the prompt. Avoid re-uploading large buffers.
- For WebGPU, use timestamp queries to measure kernel times and dynamically adjust batch size or throttling.
Example of GPU timestamps:
tsasync function measurePass(device: GPUDevice, encoderFn: (enc: GPUCommandEncoder) => void) { const querySet = device.createQuerySet({ type: 'timestamp', count: 2 }); const resolveBuffer = device.createBuffer({ size: 16, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ }); const encoder = device.createCommandEncoder(); const pass = encoder.beginComputePass({ timestampWrites: { querySet, beginningOfPassWriteIndex: 0, endOfPassWriteIndex: 1 } }); // record compute work pass.end(); encoder.copyQueryResults(querySet, 0, 2, resolveBuffer, 0, 16); device.queue.submit([encoder.finish()]); await resolveBuffer.mapAsync(GPUMapMode.READ); const timestamps = new BigUint64Array(resolveBuffer.getMappedRange()); const dt = Number(timestamps[1] - timestamps[0]); resolveBuffer.unmap(); return dt; // in timestamp ticks; scale by device.limits.timestampPeriod if exposed }
WASM CPU fallback: SIMD, threads, and pinning
For environments without WebGPU/WebNN:
- Build with -msimd128 and pthreads; ensure crossOriginIsolated is true for threads.
- Use a pinned memory pool to avoid frequent growing of the WASM heap.
- Balance threads to cores but avoid oversubscription; threads = min(cores, 8).
- For laptops on battery, consider reducing threads to avoid thermal throttling.
Device-aware resource scaling
Adapt at runtime to avoid OOM and jank:
- Measure allocation failures and downshift model/ctx length automatically.
- Track decode tokens/s; if TPS < threshold for N tokens, reduce max context or switch to smaller model on next session.
- Watch for page visibility; pause or reduce TPS in background tabs.
Heuristic downshift:
tsfunction shouldDownshift(tpsHistory: number[]): boolean { if (tpsHistory.length < 10) return false; const median = tpsHistory.slice(-10).sort((a,b)=>a-b)[4]; return median < 2; // e.g., if below 2 tokens/s, choose smaller model next time }
Streaming tokenizer and incremental decode
Tokenization can be a bottleneck; run it in WASM and stream inputs.
- Use fast BPE/unigram tokenizers in WASM with SIMD.
- For chat UIs, tokenize user input incrementally as they type; amortize prefill.
- Maintain a ring buffer of token IDs; reuse for next turns.
Example concept (pseudo):
ts// tokenize.ts import initTokenizer from './tokenizer_wasm.js'; const tok = await initTokenizer(); export function tokenizeStreaming(text: string, onTokens: (ids: Uint32Array) => void) { const chunkSize = 1024; // chars for (let i = 0; i < text.length; i += chunkSize) { const chunk = text.slice(i, i + chunkSize); const ids = tok.encode(chunk); onTokens(ids); } }
Caching: beyond weights
Cache everything that’s stable:
- Tokenizer files (vocab.json, merges.txt or equivalent).
- Model metadata (tensor shapes, layer norms, rope parameters).
- Precompiled pipelines or shader variants keyed by adapter + limits.
- Safety classifier models.
Invalidate by version keys: model-id + quant + backend + adapter.vendor/device + driver version when available.
Offline/online fallback strategy
Your app should function fully offline. But if device capability is insufficient or the user opts in, fall back to a server model.
- Detect failure early (OOM during prefill, allocation errors) and present a choice: try smaller on-device model, reduce context, or use cloud.
- Maintain identical API surfaces for on-device and server responses so the UI doesn’t care.
Example switch:
tsasync function respond(prompt: string, plan: Plan) { try { if (plan.backend === 'server') return await callServer(prompt); return await localGenerate(prompt, plan); } catch (e) { // If offline, degrade gracefully if (!navigator.onLine) return { text: offlineSorryMessage() }; // Otherwise fallback to server return await callServer(prompt); } }
Safety guardrails entirely on-device
Security and safety must not require a round trip. Combine fast, deterministic filters with a compact classifier.
Layers:
- Deterministic filters
- Regex/heuristics for PII: emails, phone numbers, SSNs, credit cards.
- Hard blocklists for disallowed content according to your policy.
- Prompt/response context rules: reject jailbreak patterns, ensure system prompts aren’t leaked.
- Lightweight on-device classifier
- Toxicity/NSFW classifier as an ONNX model quantized to INT8/INT4, run with ONNX Runtime Web on WebGPU/WebNN.
- For speed, use small architectures (e.g., TinyBERT, MobileBERT classifiers) at 5–20 MB.
- Self-check pass
- After generation, run a second pass with the same LLM using a short safety prompt to evaluate the response and possibly redact.
Example PII filter:
tsconst rePII = [ /\b\d{3}-\d{2}-\d{4}\b/g, // SSN US /\b\d{13,19}\b/g, // CC-like sequences (very rough) /[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, /\b\+?\d{1,3}?[\s.-]?\(?\d{1,4}\)?[\s.-]?\d{1,4}[\s.-]?\d{1,9}\b/g ]; export function redactPII(s: string): string { return rePII.reduce((acc, r) => acc.replace(r, '[REDACTED]'), s); }
Classify with ONNX Runtime Web:
tsimport * as ort from 'onnxruntime-web'; async function loadSafetyModel() { await ort.env.wasm.numThreads = Math.min(navigator.hardwareConcurrency || 4, 4); // Prefer WebGPU try { await ort.env.webgpu.init(); } catch {} const session = await ort.InferenceSession.create('/safety/tinybert-int8.onnx', { executionProviders: ['webgpu', 'wasm'], }); return session; }
Policy decision:
- If deterministic filters trigger, redact or block.
- If classifier score exceeds threshold, block or require user confirmation.
- Log decisions locally for transparency; do not send anywhere by default.
WebGPU implementation notes that bite in production
- Adapter selection: requestAdapter with powerPreference: 'high-performance' for desktops with dGPU.
- Buffer alignment: abide by minUniformBufferOffsetAlignment and minStorageBufferOffsetAlignment.
- Command submission: batch decode steps but keep per-token latency low.
- Memory fragmentation: create large, reusable buffers for activations; implement a simple allocator to sub-allocate.
- Shader specialization: bake in dimensions as constants for better compiler optimization.
- Avoid mapping big buffers every token; use queue.writeBuffer for small updates and persistent storage/readonly mappings for weights.
- Safari quirks: test on Safari 17+/iOS; some limits differ, and timestamp queries may not be available.
Example: end-to-end minimal local generation flow
Using a WebGPU-backed runtime (e.g., WebLLM or llama.cpp-wasm) with a worker and streaming UI.
ts// app.ts import { planCapabilities } from './capability'; let runtime: any; const plan = await planCapabilities(); const worker = new Worker('/inference.worker.js', { type: 'module' }); worker.postMessage({ type: 'init', payload: { plan } }); const controller = new AbortController(); export function generate(prompt: string, onToken: (t: string) => void) { return new Promise((resolve, reject) => { const onMsg = (e: MessageEvent) => { const { type, data } = e.data || {}; if (type === 'token') onToken(data.text); if (type === 'done') { cleanup(); resolve(data); } if (type === 'error') { cleanup(); reject(data); } }; const cleanup = () => worker.removeEventListener('message', onMsg); worker.addEventListener('message', onMsg); worker.postMessage({ type: 'generate', payload: { prompt }, signal: controller.signal }, [/* transferable if any */]); }); }
Worker side (pseudo using llama.cpp-wasm API style):
ts// inference.worker.js (pseudo) import { LlamaModel } from 'llama.cpp-web'; let model; self.onmessage = async (e) => { const { type, payload } = e.data; if (type === 'init') { const { plan } = payload; model = await LlamaModel.load({ urls: plan.model === 'llama3.1-8b' ? LLAMA_8B_Q4KM_URLS : PHI3_Q4_URLS, backend: plan.backend, // 'webgpu' | 'wasm' kvQuant: plan.useKVCacheQuant ? 'Q8' : 'F16', maxCtx: plan.maxCtx, threads: plan.threads, }); postMessage({ type: 'ready' }); } if (type === 'generate') { try { const stream = model.generate({ prompt: payload.prompt, temperature: 0.7, top_p: 0.9 }); for await (const token of stream) { postMessage({ type: 'token', data: { text: token.text } }); } postMessage({ type: 'done', data: {} }); } catch (err) { postMessage({ type: 'error', data: String(err) }); } } };
Add a safety pass on the main thread just before displaying tokens:
tsfunction safeAppend(token: string) { const redacted = redactPII(token); // Optionally buffer and run classifier per N characters appendToUI(redacted); }
Energy and thermal considerations
- Back off when on battery: detect navigator.getBattery if available; slow decode rate or reduce threads.
- Use requestIdleCallback for non-critical work (pre-compiling pipelines, warming caches).
- Provide a visible toggle for "Battery saver" that caps context length and reduces TPS.
Testing and benchmarking
- Measure prefill throughput (tokens/s-equivalent) and decode TPS separately.
- Use PerformanceObserver to attribute jank on main thread.
- On WebGPU, timestamp queries across kernels; on WASM, coarse timers are sufficient.
- Test on a matrix: Windows iGPU (Iris Xe), MacBook Air (M-series), AMD dGPU, iOS Safari, Android Chrome mid-range.
Minimal TPS tracker:
tsclass TPSTracker { private times: number[] = []; tick() { this.times.push(performance.now()); if (this.times.length > 1000) this.times.shift(); } tps() { if (this.times.length < 2) return 0; const dt = this.times[this.times.length - 1] - this.times[0]; const tokens = this.times.length - 1; return (tokens / dt) * 1000; } }
Browser coverage in 2025: reality check
- Chrome/Edge: WebGPU stable on desktop and Android; WebNN available behind a flag or origin trial in some channels; ORT WebGPU EP works well.
- Safari: WebGPU on macOS Sonoma+ and iOS 17/18 with limitations; test memory limits carefully.
- Firefox: WebGPU shipping with ongoing improvements; performance can differ across platforms.
Plan for variability: keep a WASM path for correctness, and allow server fallback as last resort.
Security and privacy
- Keep all model files and intermediate data local. Don’t send prompts/outputs by default.
- If offering cloud fallback, make it opt-in and clearly indicated.
- Allow users to clear local model/cache; respect private browsing mode (disable large caches).
- Signed model manifests with hashes; verify before using to prevent tampering.
Hash verification example:
tsasync function verifyDigest(resp: Response, expectedHex: string) { const buf = await resp.clone().arrayBuffer(); const digest = await crypto.subtle.digest('SHA-256', buf); const hex = [...new Uint8Array(digest)].map(b => b.toString(16).padStart(2,'0')).join(''); if (hex !== expectedHex) throw new Error('Model shard hash mismatch'); return resp; }
Putting it together: a production blueprint
- Capability plan
- Detect WebGPU/WebNN/WASM and adapter limits
- Choose model size, quant, max context, threads
- Storage and caching
- Request persistent storage
- Service Worker caches shards with range support and hash verification
- Cache tokenizer and metadata in IndexedDB
- Initialization
- Spin up a Worker with backend selected
- Warm-up: compile pipelines, allocate buffers, load tokenizer
- Generation
- Stream user input tokenization
- Prefill then decode with reusable command buffers
- Emit tokens to UI progressively, applying safety redaction
- Safety
- Run deterministic filters; gate high-risk content
- Optional small classifier for toxicity/NSFW
- Self-check with the same model on final output if policy requires
- Fallbacks
- Downshift model/ctx on perf issues or OOM
- If offline and failure, show helpful message; if online, offer server inference
- Observability (local)
- Record TPS, OOMs, backend choice to local storage for next-session planning
- Provide a diagnostics panel for users
Common pitfalls and how to avoid them
- OOM during prefill: reduce context, quantize KV cache, or pick smaller model. Allocate KV lazily with growable windows.
- Long cold starts: shard models, parallelize fetches, prewarm pipelines, persist caches across sessions.
- Jank on main thread: move tokenization and all inference to workers; throttle DOM updates.
- Inconsistent tokenization: ensure tokenizer version matches model; cache tokenizer; write tests for round-trip encoding.
- Excessive storage eviction: request persistent storage; reduce shard size; allow user to pin models.
- Poor mobile perf: cap to 1–2k context, smaller models, and leverage WebNN where it shines.
Opinionated take: what to ship in 2025
- Default model: 3–4B instruct, q4_K_M, 2–3k context. It’s the sweet spot across devices.
- Advanced toggle: 7–8B instruct, q4_K_M, 4k context for desktops with dGPU or M-series.
- Backend priority: WebGPU > WebNN (when NPU helps) > WASM.
- Safety: deterministic filters + tiny classifier; no server dependency.
- UX: instant token stream within 500 ms on cold start for cached models; 2–6 TPS decode typical; clear battery-saver mode.
This setup gives you a private, fast, and maintainable on-device AI that scales.
Further references and ecosystems to watch
- llama.cpp and GGUF ecosystem: rich model zoo and continuous kernel improvements.
- MLC/WebLLM: compiler-driven approach for WebGPU with strong results.
- ONNX Runtime Web: WebGPU/WebNN EPs, great for auxiliary models.
- WebNN spec and implementations: track release notes for stability and performance on mobile.
- Tokenizers-wasm projects: fast BPE/unigram with SIMD.
Closing
Shipping private, offline AI in 2025 is a solved engineering problem when you combine the right primitives: WebGPU/WebNN for compute, WASM for glue and fallback, GGUF INT4 for compact weights, streaming loaders and caches for UX, workers for responsiveness, and on-device safety guardrails. The challenge is no longer “is it possible?” but “can you design a robust, adaptive runtime that treats every device with respect?” With the patterns above, you can.