local-aiwebassemblypwa

Build a Local LLM-Powered Browser Feature with TypeScript (no server required)

UUnknown

2026-02-22

12 min read

Build a privacy-first on-device browser AI using TypeScript, WebAssembly, and WebGPU. Step-by-step PWA guide to run local LLMs in mobile browsers.

Hook: Why your mobile web app needs a local LLM in 2026

Developers building mobile browser experiences today wrestle with three recurring pain points: privacy (user data leaving the device), latency (network round-trips for inference), and reliability (offline or poor connectivity). If you’re a TypeScript engineer shipping PWAs and mobile web features, the solution many teams are adopting in 2025–2026 is to run LLM inference directly in the browser using WebAssembly and WebGPU. This article is a hands-on, step-by-step guide to implement a privacy-first, on-device AI assistant in a mobile browser — no server required — inspired by the local-AI approach popularized by browsers like Puma.

The state of on-device browser AI in 2026 (short)

By early 2026, WebGPU and WASM support has matured across major mobile browsers, enabling feasible, accelerated on-device ML. Open-source runtimes (WASM builds of ggml/llama.cpp-family projects, tokenizer WASM modules) and quantized models have drastically reduced memory and compute requirements. Browser vendors have relaxed policies and standardized APIs around compute and storage, and PWAs are the de-facto delivery method for installing local AI features on phones.

Why this matters for TypeScript devs

Ship features that keep user data on-device to meet privacy regulations and user expectations.
Remove network dependencies and reduce latency for conversational UX.
Reuse your TypeScript expertise to build the UI, glue code, and tooling that orchestrates WASM/WebGPU runtimes.

Architecture overview: components you’ll build

High-level components for a local LLM-powered browser feature:

UI written in TypeScript/React (or vanilla TS) that streams tokens to the user.
WASM-based LLM runtime (e.g., a wasm-compiled ggml/llama-like runtime) that performs inference.
WebGPU compute paths to accelerate matrix ops when available.
Tokenizer (WASM or JS) to map text to tokens and back.
Storage layer using IndexedDB (or Cache API) to store model shards, tokenizer files, and user preferences.
PWA and service worker for installability and offline operation.

Step 0 — Project scaffold (TypeScript + Vite)

Use Vite for fast iteration and small bundles. Below is a minimal package.json and TypeScript configuration tuned for WASM and WebGPU development.

{
  "name": "local-llm-pwa",
  "version": "0.1.0",
  "private": true,
  "scripts": {
    "dev": "vite",
    "build": "vite build",
    "preview": "vite preview"
  },
  "dependencies": {
    "idb": "^7.0.0"
  },
  "devDependencies": {
    "vite": "^5.0.0",
    "typescript": "^5.5.0",
    "vite-plugin-pwa": "^0.15.0"
  }
}

// tsconfig.json
{
  "compilerOptions": {
    "target": "ES2022",
    "module": "ESNext",
    "moduleResolution": "bundler",
    "lib": ["ES2022", "DOM", "DOM.Iterable"],
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "resolveJsonModule": true,
    "types": []
  },
  "include": ["src"]
}

Step 1 — Loading WASM LLM runtimes in TypeScript

You’ll typically have a WASM module that implements a runtime API. We recommend lazy-loading the runtime and storing the model blobs in IndexedDB so the first load is quick and subsequent runs are offline.

// src/wasm/loader.ts
export async function loadRuntime(wasmUrl: string) {
  const resp = await fetch(wasmUrl);
  const bytes = await resp.arrayBuffer();
  // instantiateStreaming would be ideal, but fetch+instantiate is more flexible
  const { instance } = await WebAssembly.instantiate(bytes, {
    env: {
      // provide any imports your runtime expects
      memory: new WebAssembly.Memory({ initial: 256 }),
      abort: () => { throw new Error('wasm abort'); }
    }
  } as any);
  return instance.exports as any;
}

Type declarations for WASM

Tell TypeScript about WASM imports to avoid type errors.

// src/types/wasm.d.ts
declare module "*.wasm" {
  const value: any;
  export default value;
}

Step 2 — WebGPU initialization and a sample compute pass

WebGPU is the performance layer. The first task is to request an adapter and device and compile a compute shader for an operation like fused matrix multiply. Below is a minimal initialization snippet in TypeScript.

// src/webgpu/init.ts
export async function initWebGPU() {
  if (!('gpu' in navigator)) throw new Error('WebGPU not supported');
  const adapter = await (navigator as any).gpu.requestAdapter();
  if (!adapter) throw new Error('No WebGPU adapter available');
  const device = await adapter.requestDevice();
  return { adapter, device };
}

// minimal compute example: adds two float32 arrays on GPU
export async function vectorAdd(device: GPUDevice, a: Float32Array, b: Float32Array) {
  const n = a.length;
  const gpuBufferA = device.createBuffer({
    size: a.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
  });
  const gpuBufferB = device.createBuffer({
    size: b.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
  });
  const gpuBufferOut = device.createBuffer({
    size: a.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC
  });
  device.queue.writeBuffer(gpuBufferA, 0, a.buffer);
  device.queue.writeBuffer(gpuBufferB, 0, b.buffer);

  // shader: out[i] = a[i] + b[i]
  const shaderCode = `@group(0) @binding(0) var a : array;
@group(0) @binding(1) var b : array;
@group(0) @binding(2) var out : array;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3) {
  let i = i32(gid.x);
  out[i] = a[i] + b[i];
}`;

  const module = device.createShaderModule({ code: shaderCode });
  const pipeline = device.createComputePipeline({ compute: { module, entryPoint: 'main' } });

  const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: gpuBufferA } },
      { binding: 1, resource: { buffer: gpuBufferB } },
      { binding: 2, resource: { buffer: gpuBufferOut } }
    ]
  });

  const cmd = device.createCommandEncoder();
  const pass = cmd.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(Math.ceil(n / 64));
  pass.end();
  device.queue.submit([cmd.finish()]);

  // read back
  const readBuffer = device.createBuffer({
    size: a.byteLength,
    usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ
  });
  const copyCmd = device.createCommandEncoder();
  copyCmd.copyBufferToBuffer(gpuBufferOut, 0, readBuffer, 0, a.byteLength);
  device.queue.submit([copyCmd.finish()]);
  await readBuffer.mapAsync(GPUMapMode.READ);
  const copyArray = new Float32Array(readBuffer.getMappedRange().slice(0));
  readBuffer.unmap();
  return copyArray;
}

Use this pattern to implement kernel primitives (GEMM, softmax) in WebGPU and call them from your WASM runtime when a JS-native path is beneficial. Many WASM LLM runtimes expose hooks to provide custom GPU kernels.

Step 3 — Tokenizers: run locally with WASM or JS

Tokenizers are small and fast; you can run them entirely in JS or use a WASM tokenizer for better perf. Use streaming tokenization for long context windows and avoid loading the full tokenizer until needed.

// simple wrapper: tokenizers.wasm provides `encode` and `decode` methods
import tokenizerWasm from '../models/tokenizer.wasm';
import { loadRuntime } from './wasm/loader';

export async function loadTokenizer() {
  const runtime = await loadRuntime(tokenizerWasm as unknown as string);
  return {
    encode: (text: string) => {
      // call exported wasm function, copy mem, etc.
      return runtime.encode(text);
    },
    decode: (tokens: number[]) => runtime.decode(tokens)
  };
}

Step 4 — Model storage: download, shard, and persist (IndexedDB + Cache)

Model files will often be too large for a single fetch. Split models into shards and write them to IndexedDB using an appendable scheme. Use minimal metadata so users can choose models and sizes.

// src/storage/modelStore.ts
import { openDB } from 'idb';
const DB_NAME = 'local-llm';
const DB_VERSION = 1;

export async function openModelDB() {
  return openDB(DB_NAME, DB_VERSION, {
    upgrade(db) {
      db.createObjectStore('shards');
      db.createObjectStore('meta');
    }
  });
}

export async function storeShard(key: string, chunk: ArrayBuffer) {
  const db = await openModelDB();
  await db.put('shards', chunk, key);
}

export async function loadShard(key: string) {
  const db = await openModelDB();
  return db.get('shards', key);
}

Encryption at rest (optional, privacy-first)

Use the Web Crypto API to encrypt model shards so that files on the device are protected if the device is compromised. Derive a key from a user passphrase or integrate platform keystore when available.

async function deriveKey(passphrase: string) {
  const pwUtf8 = new TextEncoder().encode(passphrase);
  const pwKey = await crypto.subtle.importKey('raw', pwUtf8, 'PBKDF2', false, ['deriveKey']);
  return crypto.subtle.deriveKey(
    { name: 'PBKDF2', salt: new Uint8Array(16), iterations: 100000, hash: 'SHA-256' },
    pwKey,
    { name: 'AES-GCM', length: 256 },
    false,
    ['encrypt', 'decrypt']
  );
}

Step 5 — Running inference and streaming tokens

The runtime will expose a predict or step API. For good UX, stream tokens as they are produced and update the UI incrementally.

// pseudocode orchestrator
async function generatePrompt(prompt: string, runtime: any, tokenizer: any, onToken: (t: string) => void) {
  const tokens = tokenizer.encode(prompt);
  // push tokens to runtime input buffer
  runtime.push_tokens(tokens);
  // run in small steps to stream tokens
  while (!runtime.isDone()) {
    const tokenId = runtime.step(); // synchronous or async depending on runtime
    const text = tokenizer.decode([tokenId]);
    onToken(text);
  }
}

Step 6 — PWA, service worker, and installability

Make your local AI a PWA so users can install it, which provides a persistent environment and gives you control over update semantics. Use the Vite PWA plugin to generate a manifest and a service worker that caches shell files and optionally model metadata (NOT full shards — store those in IndexedDB).

// vite.config.ts (excerpt)
import { defineConfig } from 'vite';
import { VitePWA } from 'vite-plugin-pwa';

export default defineConfig({
  plugins: [
    VitePWA({
      registerType: 'autoUpdate',
      includeAssets: ['favicon.svg'],
      manifest: {
        name: 'Local LLM Assistant',
        short_name: 'LocalAI',
        start_url: '/',
        display: 'standalone'
      }
    })
  ]
});

Tooling & editor integrations (practical)

TypeScript and editor tooling smooth the developer experience. A few practical tips:

VS Code: add workspace tasks for building and running Vite, and configure a launch profile for remote debugging on a mobile browser via port forwarding.
Types: add d.ts shims for any WASM exports to expose safe typed signatures and avoid implicit any.
Linting and tests: add unit tests for tokenizers and run end-to-end tests in a headless browser (Playwright) to validate the PWA install flow and offline behavior.

// src/types/llm.d.ts
export interface LLMSession {
  push_tokens(tokens: number[]): void;
  step(): number; // returns token id
  isDone(): boolean;
}

Performance tips and 2026 trends to leverage

Quantized models: use 4-bit or 8-bit quantized models to reduce memory and allow larger context windows.
Kernel offload: implement or reuse WebGPU kernels for GEMM, RMSNorm, and softmax to get near-native performance on modern phone SoCs.
Streaming & chunked decoding: run inference in micro-batches and stream tokens to the UI to keep the app responsive.
Model hubs: by late 2025–2026, multiple community hubs host browser-ready quantized models and tokenizer packs — curate small, privacy-safe models for mobile.

Debugging tips for complex type and runtime issues

You’ll encounter three categories of bugs: TypeScript typing mismatches with WASM exports, WebGPU validation errors, and resource throttling on mobile. Practical debugging steps:

Expose descriptive errors from the WASM runtime (return codes translated into JS exceptions).
Use WebGPU validation layers (enable during development) to get human-readable errors from shader issues.
Throttle model size and thread usage to avoid background-killing on iOS/Android; monitor memory footprint with browser devtools.

Security and privacy checklist

Never send user inputs to remote servers by default; require explicit opt-in for cloud fallback.
Use encryption-at-rest for model shards if you store sensitive data locally, and explain the UX for passphrases/keys.
Limit permissions: only request the minimum (e.g., microphone for voice input) and explain why in the install flow.
Provide a data purge action so users can remove models and caches quickly.

"Running inference in the browser makes strong privacy promises possible — but only if you design the storage and permission flows with transparency." — Trusted engineering advice

Real-world checklist: deploying a minimal demo

To get from zero to a running demo on your phone:

Scaffold the Vite TypeScript app and configure VitePWA.
Bundle or host the tokenizer.wasm and runtime.wasm; implement the loader and DB storage code above.
Implement a simple UI that streams token text as it arrives.
Test on Android Chrome (WebGPU is stable) and iOS Safari (use latest WebKit builds; enable WebGPU if necessary in settings as of early 2026).

Limitations and fallbacks

On-device LLMs still have trade-offs: model capacity is smaller than cloud-hosted giants, and inference takes more local CPU/GPU time. Offer an explicit opt-in cloud fallback for heavy tasks, but keep defaults privacy-first. Measure battery and CPU usage during development — on mobile devices, even well-optimized WebGPU workloads can warm the device.

Future-proofing: what to watch in 2026+

WASI & improved WASM SIMD/WebGPU interop will reduce overheads and make heavier models viable in the browser.
Standardized tokenizers and model metadata formats for browser delivery will simplify model installation.
Platform-level support (secure enclaves accessible via web APIs) may offer better key storage for encrypted shards.

Actionable takeaways

Start small: ship a tiny quantized model and a streaming UI to validate UX and resource usage.
Use IndexedDB for storage, WebGPU for compute, and a WASM runtime for inference orchestration.
Keep the default behavior privacy-first: local-only inference, transparent permissions, and user-controlled model storage.
Invest in TypeScript types for WASM exports and editor tooling to speed development and debugging.

Where to go next

Implement the examples in this guide as modular pieces: loader, webgpu kernels, tokenizer, storage layer, and the PWA shell. Break down integration into small PRs and measure performance at each step. If you want a reference implementation, search for open-source WASM LLM runtimes and browser-native tokenizer projects from late 2025 and early 2026 — they’ll accelerate your build dramatically.

Final thoughts & call-to-action

Building a privacy-first, on-device AI assistant in the mobile browser is now realistic for TypeScript teams. The combination of WASM runtimes, WebGPU acceleration, and PWA installability unlocks low-latency, private conversational features that ship with your app. Start with a small quantized model, wire up token streaming, and iterate. If you’d like, clone a starter template, run the demo on your phone, and share performance numbers — the community is iterating fast and your feedback helps shape the next wave of local LLM tooling.

Ready to build? Clone the starter, run npm run dev, install the PWA on your phone, and test an offline chat. If you want a curated checklist or a reference repository with the TypeScript + Vite + WASM + WebGPU integration, sign up for updates or contribute to the open-source starter linked from this article.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.