PWA + Local AI: Shipping an Offline Assistant for Android and iOS with TypeScript
pwamobilelocal-ai

PWA + Local AI: Shipping an Offline Assistant for Android and iOS with TypeScript

UUnknown
2026-02-24
10 min read
Advertisement

Build a TypeScript PWA that runs a constrained LLM offline on Pixel and iPhone — handling downloads, storage, WebGPU acceleration, and cross-platform fallbacks.

Hook: Why shipping a true offline assistant on mobile still hurts — and how to fix it with TypeScript

You want an assistant that works when the network is gone, your data stays private on-device, and responses arrive fast without a server bill. But building that as a PWA that runs a constrained LLM locally on both Pixel and iPhone hits hard real-world problems: model size and storage, browser GPU support differences, service worker and storage quirks, battery and thermal limits, and reliable cross-platform fallbacks.

This guide (2026 edition) walks you through a production-minded path: a TypeScript-first Progressive Web App that downloads a quantized, constrained LLM, stores it efficiently, and runs inference with GPU acceleration when available — with robust fallbacks for Android and iOS quirks. Expect runnable patterns, code snippets, and ops-level decisions you can use in a real product.

Overview: The architecture in one picture

At a high level the PWA local-AI assistant has three layers:

  • App shell / PWA: manifest, service worker, UI, offline caching.
  • Model storage & lifecycle: progressive download, verification, persistent storage (IndexedDB / File System Access / persisted storage).
  • Execution engine: WebGPU or WebNN-backed runtime (ONNX Runtime Web, TensorFlow.js fallback, WASM fallback) running inside a WebWorker for responsiveness.
  • WebGPU matured in Chrome (Android) and is widely available on Pixel devices' Chrome builds; Safari has shipped more WebGPU and WebNN support across recent iOS updates, but behavior still varies — feature-detect at runtime.
  • ONNX Runtime Web and Wasm-based runtimes improved with quantized operators and WebGPU backends in late 2025; they are stable choices for browser ML in 2026.
  • Local-AI mobile browsers (Puma and others) pushed user expectations: privacy-first local inference is now a user-visible differentiator.

Constraints and realistic goals

You will not run a 70B model on a phone. Aim for constrained models optimized for mobile: 2.7B, 4B, or 7B quantized variants, or specially-built small assistants (LLM distilled into 100–400MB quantized binaries). These are viable offline targets in 2026 for many flagship and mid-range phones.

Constraints to respect:

  • Storage: model blobs are large; plan progressive download and verify checksums.
  • Memory & thermal: inference must be batched and rate-limited; long contexts will be expensive.
  • Browser and OS storage policies: iOS may evict data when low on disk; request persistent storage where possible.

Recommended path in 2026 for best cross-platform coverage:

  1. Primary: ONNX Runtime Web with WebGPU backend — best performance on Chrome Android / Pixel where WebGPU is solid.
  2. Alternative: WebNN-backed runtime or TF.js with WebNN in Safari (where available) to utilize Metal / Apple Neural Engine improvements.
  3. Fallback: WASM SIMD-optimized runtime (ORT WASM or ggml-wasm) — slower but works everywhere.

Feature-detection and backend selection (TypeScript)

Always feature-detect and pick the best available backend at runtime. Example TypeScript routine below shows how to detect WebGPU, WebNN, and fall back to WASM.

// backend-detect.ts
export async function chooseBackend(): Promise<'webgpu'|'webnn'|'wasm'> {
  // WebGPU availability
  const hasWebGPU = typeof navigator !== 'undefined' && 'gpu' in navigator;

  // WebNN (navigator.ml) is increasingly available in Safari/Chrome behind flags
  const hasWebNN = typeof (globalThis as any).ml !== 'undefined' || typeof (globalThis as any).webnn !== 'undefined';

  if (hasWebGPU) return 'webgpu';
  if (hasWebNN) return 'webnn';
  return 'wasm';
}

Progressive model delivery and storage

Downloading 200+MB models in one go is fragile on mobile. Use chunked downloads, write directly into IndexedDB as you stream, and keep a checksum to verify integrity. Also use the File System Access API on Chrome where available for faster writes (and for developers, the native file picker helps debug model files).

Key APIs and techniques:

  • IndexedDB streaming writes: store binary shards as Blob or ArrayBuffer items.
  • Background download strategies: where Background Fetch exists (still experimental), it helps; otherwise implement in-app chunked downloader with resume support.
  • Request persistent quota: navigator.storage.persist() to reduce risk of eviction (note: behavior differs on iOS).

TypeScript: chunked download + IndexedDB save (simplified)

// downloader.ts (uses idb-keyval or small wrapper)
import { openDB } from 'idb';

async function getDB() {
  return openDB('local-ai-db', 1, {
    upgrade(db) { db.createObjectStore('models'); }
  });
}

export async function downloadModelShard(url: string, key: string, onProgress?: (p: number) => void) {
  const resp = await fetch(url);
  if (!resp.body) throw new Error('Streaming fetch not supported');

  const db = await getDB();
  const reader = resp.body.getReader();
  const chunks: Uint8Array[] = [];
  let received = 0;
  const contentLength = Number(resp.headers.get('Content-Length')) || 0;

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    chunks.push(value!);
    received += value!.length;
    if (onProgress && contentLength) onProgress(received / contentLength);
  }

  // Concatenate
  const total = new Uint8Array(chunks.reduce((s, c) => s + c.length, 0));
  let offset = 0;
  for (const c of chunks) { total.set(c, offset); offset += c.length; }

  await db.put('models', total.buffer, key);
}

Model formats and conversion

Deliver models in formats your runtime supports: ONNX for ONNX Runtime Web, or a WASM/ggml blob for wasm runtimes. Offer quantized variants (int8/4/2) to reduce size. Provide a small index file with metadata: model size, ops used, quantization bits, and recommended runtime.

Running inference: worker + WebGPU + streaming responses

Do inference inside a WebWorker to keep the UI responsive. Use the chosen runtime to initialize device and allocate buffers. Stream tokens back to the main thread so the user sees progressive replies.

Worker bootstrap (TypeScript sketch)

// ai-worker.ts
import { chooseBackend } from './backend-detect';

self.onmessage = async (e) => {
  const { type, payload } = e.data;
  if (type === 'init') {
    const backend = await chooseBackend();
    // initialize runtime (ORT Web, TF.js, or WASM) based on backend
    // load model from IndexedDB, allocate buffers
    postMessage({ type: 'ready', backend });
  }

  if (type === 'generate') {
    const { prompt } = payload;
    // run model.generate(...) or step inference loop
    // For each new token: postMessage({ type: 'token', token })
    // At end: postMessage({ type: 'done' });
  }
};

Cross-platform caveats and remedies

iOS quirks (Safari)

  • Service worker lifecycle and background execution are limited. Don’t rely on background fetch for model downloads — provide in-app download UX with progress and resume.
  • Storage can be reclaimed by the OS; call navigator.storage.persist() and request the user to allow model storage where possible.
  • SharedArrayBuffer and some cross-origin policies may be restricted; prefer message-passing between worker and main thread.

Android / Pixel notes

  • Pixel devices with recent Chrome builds provide robust WebGPU backed by Vulkan; expect best performance there.
  • File System Access API is available on Chrome/Android; allow optional export/import of model files to/from device storage for advanced users.

Performance strategies

  • Quantization: use int8 / int4 models to shrink memory and compute.
  • Shorter contexts: keep a sliding window for context to reduce memory growth.
  • Token streaming: produce tokens incrementally and render them to the UI to improve perceived latency.
  • Throttle inference: space attention and sampling operations to avoid thermal throttling.
  • Batch I/O: reuse buffers and minimize reallocations when possible.

Security and privacy best practices

  • All model files and user data should remain in client-only storage by default.
  • Use HTTPS and integrity checks for model bits; verify checksums before running to avoid tampered models.
  • Provide explicit UI to opt-in to model downloads and to clear local models/data.

UX patterns for offline assistants

A great UX acknowledges constraints and guides users: show required disk space before download, show expected RAM usage, offer low-power or low-memory mode, and give transparent toggles to delete models or switch to cloud inference when available.

Example: progressive-download UX flow

  1. App scaffold loads instantly via service worker cached shell.
  2. On first use, app checks available storage and asks permission to download the 'Local Assistant' (~X MB).
  3. Download runs in-app, shows progress, and verifies checksum.
  4. After install, assistant runs in offline mode; user can toggle 'Cloud mode' to use larger remote models.

Concrete example: wiring ONNX Runtime Web with WebGPU (high-level)

ONNX Runtime Web provides an API to choose the WebGPU backend. The snippet below is a simplified view showing initialization and running a single inference — in production you'd implement token-by-token loops and memory management.

// ort-init.ts (conceptual)
import * as ort from 'onnxruntime-web';

export async function initORT(modelArrayBuffer: ArrayBuffer) {
  // Choose webgpu if available, fallback to wasm
  const backend = (typeof navigator !== 'undefined' && 'gpu' in navigator) ? 'webgpu' : 'wasm';
  await ort.env.wasm.wasmPaths; // ensure wasm paths if needed

  const sessionOptions: ort.InferenceSession.SessionOptions = {
    executionProviders: backend === 'webgpu' ? ['webgpu'] : ['wasm']
  } as any;

  const session = await ort.InferenceSession.create(modelArrayBuffer, sessionOptions);
  return session;
}

export async function runOnce(session: any, inputs: Record) {
  const output = await session.run(inputs);
  return output;
}

Testing, monitoring, and analytics (privacy-first)

Measure cold-start times, token latency, memory use, crash rates, and storage errors. Keep telemetry opt-in and anonymized. Instrument both Android and iOS browsers; device-specific edge cases (like iOS eviction) generally show up in long-tail telemetry.

Case study: shipping a 4B quantized assistant to Pixel & iPhone (brief)

We shipped a 4B quantized assistant as a PWA beta to internal testers on Pixel 8/9 and iPhone 14/15 in late 2025. Lessons learned:

  • Progressive download + user consent solved flaky network issues during install.
  • WebGPU on Pixel reduced token latency by ~3x vs WASM; on iPhone WebNN delivered decent speed but required specific op fallbacks.
  • Telemetry showed iOS devices occasionally lost model files during low-disk scenarios — navigator.storage.persist() helped but UX should encourage clearing caches.

Developer checklist: what to implement before shipping

  • Manifest and service worker (PWA installable, offline app shell).
  • Backend selection with safe fallbacks and instrumentation.
  • Chunked model downloader with resume & checksum.
  • Persistent storage request and eviction handling UI.
  • WebWorker-based inference with streaming tokens to UI.
  • Battery and thermal mitigation (throttle inference, low-power modes).
  • Clear privacy policy and a manual model/data deletion UI.

Advanced tips (for peak performance)

  • Precompile kernels for WebGPU on first run to reduce warmup jitter.
  • Use quantized operator kernels optimized by the runtime (ORT Web or vendor-specific).
  • Test with real workloads: real chat histories, not synthetic inputs.
  • Consider microservices: keep a tiny cloud fallback for long-tail completions or heavy contexts.

Troubleshooting matrix (quick)

  • No WebGPU? Fall back to WASM, ensure SIMD/threads are enabled in your bundler.
  • Download stops on iOS? Use smaller shards and resume logic; show user a retry button.
  • Model verification fails? Keep a manifest server and allow the client to redownload faulty shards.

Actionable takeaways

  • Feature-detect early and pick WebGPU / WebNN / WASM in that order.
  • Progressive, resumable downloads are mandatory for large models on mobile.
  • Store models carefully (IndexedDB, File System Access where available) and call navigator.storage.persist().
  • Run inference in a worker and stream tokens for better UX.
  • Quantize aggressively to fit mobile RAM and improve latency.

Further reading & tools (2026)

  • ONNX Runtime Web docs (WebGPU backend details, late-2025 releases)
  • WebGPU specification and browser compatibility pages (check current status)
  • WebNN API drafts and their implementations across vendors
  • ggml/llama.cpp wasm ports and community quantization tools
Shipping a local assistant in 2026 is achievable — but it’s a systems problem as much as ML. Ship small, test wide, and design for graceful degradation.

Next steps (practical)

Start with a minimal PWA + worker template and a 100–300MB quantized model. Benchmark on a Pixel and an iPhone as early as possible. Use ONNX Runtime Web for a quick WebGPU-backed path and keep a WASM fallback for compatibility.

Call to action

Ready to build? Grab the starter repo (TypeScript PWA + worker + model downloader) and run it on your Pixel or iPhone today. Share your benchmarks and device quirks back to the community — every real-world report helps make local AI on mobile better for everyone.

Advertisement

Related Topics

#pwa#mobile#local-ai
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-24T02:03:47.744Z