Building Siri-like assistants with TypeScript: voice UI, LLM glue, and privacy
voiceassistantprivacy

Building Siri-like assistants with TypeScript: voice UI, LLM glue, and privacy

ttypescript
2026-02-02
11 min read
Advertisement

Architect a privacy-first, TypeScript-based voice assistant using typed intent schemas, Gemini/cloud OR local edge LLMs, and on-device STT/TTS.

Build a Siri-like Assistant in TypeScript (2026): Voice UI, LLM NLU, and Privacy-First Edge Strategies

Hook: You want a voice assistant that understands natural language, runs reliably with low latency, and keeps user data private — but you also need strong TypeScript types, predictable intents, and an architecture that can run partly or fully on the edge. This guide shows how to architect and implement that assistant in 2026 using practical TypeScript patterns, voice I/O options, typed intent schemas, LLM glue (including modern Gemini-based setups), and real privacy constraints.

Executive summary (most important first)

  • Architecture: capture audio → STT → typed NLU → action dispatcher → TTS.
  • Type safety: use TypeScript union types + runtime validators (zod/io-ts) to convert LLM outputs into safe intents.
  • LLM glue: combine local edge models and cloud LLMs (Gemini or open models) with robust prompt templates and response schemas.
  • Privacy & offline: adopt local-first STT/TTS and on-device LLMs for sensitive tasks, encrypt when cloud is required, and minimize PII retention.
  • 2026 trends: mainstream Gemini partnerships (e.g., platform embeds), efficient edge hardware (Pi 5 + AI HAT), and lightweight GGML models enable hybrid local/cloud assistants.

Why build this in TypeScript in 2026?

TypeScript remains the best engineering choice for cross-platform system glue: it runs in browsers, desktops (Electron), Node servers, and can compile to edge runtimes. In 2026 we see two important trends that make a TypeScript-first assistant practical:

  • Cloud LLMs (including Google Gemini partnerships) provide high-accuracy NLU and reasoning APIs for complex tasks.
  • Edge compute has matured — Raspberry Pi 5 + AI HATs and GGML-backed LLMs let you do private STT/LLM inference locally at reasonable cost; see real-world edge templates and micro-edge VPS patterns for latency-sensitive deployment.

High-level assistant architecture

Keep the assistant modular and typed. The typical pipeline is:

  1. Audio capture: microphone access, VAD (voice activity detection), optional wake word.
  2. Speech-to-text (STT): local (Whisper/Coqui/Vosk) or cloud (Google, OpenAI), returning transcripts and confidence.
  3. NLU (LLM): typed intent extraction using a validation schema derived from TypeScript types.
  4. Action dispatcher: safely map intents to platform actions with permission checks.
  5. Text-to-speech (TTS): local (Coqui, Edge TTS) or cloud with privacy controls.
  6. Telemetry & privacy: minimize logs, encrypt, and allow local-only mode.

Component responsibilities

  • Capture: ensure low-latency buffering and handle device permissions.
  • STT: return timestamps, confidence scores, and PII markers if possible.
  • NLU: hand off a clean transcript to an LLM using a strict output schema to avoid hallucinations.
  • Dispatcher: call system APIs only after type-checked intent validation and user consent for sensitive actions.

Designing a typed intent schema

The single most important engineering step to make LLMs safe and reliable is to require a machine-readable intent schema. In TypeScript, describe intents as discriminated unions, then create a runtime validator with zod. This gives both developer ergonomics and runtime guarantees.

Example intent types (TypeScript)

import { z } from 'zod'

// TypeScript-discriminated union type via zod
const PlayMusic = z.object({
  type: z.literal('play_music'),
  query: z.string(),
  shuffle: z.boolean().optional(),
})

const SetTimer = z.object({
  type: z.literal('set_timer'),
  seconds: z.number(),
  label: z.string().optional(),
})

const SendMessage = z.object({
  type: z.literal('send_message'),
  recipient: z.string(),
  message: z.string(),
})

export const IntentSchema = z.discriminatedUnion('type', [PlayMusic, SetTimer, SendMessage])
export type Intent = z.infer

Now require LLM outputs to conform to this schema.

LLM NLU: prompt + schema enforcement

When you call a cloud LLM (Gemini or similar), use a prompt that includes the JSON schema and a strict response format instruction. Then run runtime validation on the returned JSON and handle errors (retry with clarification or fallback to deterministic parser).

async function parseIntentWithLLM(transcript: string): Promise<Intent | null> {
  const prompt = `You are an intent extractor. Given the user transcript, produce exactly one JSON object that matches the following schema: ${IntentSchema.toString()}\nTranscript: ${transcript}`
  const response = await callLLM(prompt)
  try {
    const parsed = JSON.parse(response)
    const validated = IntentSchema.parse(parsed)
    return validated
  } catch (e) {
    console.warn('LLM parse/validate failed', e)
    return null
  }
}

Connecting to Gemini and other LLMs (2026)

By 2026, major device-makers increasingly embed or partner with large foundation models like Gemini. For assistant NLU you should design your system to be LLM-agnostic: implement a small adapter layer that can route to Gemini, an OpenAI-compatible endpoint, or a local GGML model.

Adapter pattern (TypeScript)

export type LLMResult = { text: string }

export interface LLMClient { generate(prompt: string): Promise<LLMResult> }

// adapter for cloud (Gemini/OpenAI-like)
export class CloudLLM implements LLMClient {
  constructor(private apiKey: string, private endpoint: string) {}
  async generate(prompt: string) {
    const res = await fetch(this.endpoint, {
      method: 'POST',
      headers: { 'Authorization': `Bearer ${this.apiKey}`, 'Content-Type': 'application/json' },
      body: JSON.stringify({ prompt })
    })
    const json = await res.json()
    return { text: json.output_text ?? json.choices?.[0]?.text }
  }
}

// adapter for local GGML-backed LLM over a local HTTP bridge
export class LocalLLM implements LLMClient {
  constructor(private url = 'http://127.0.0.1:8080/generate') {}
  async generate(prompt: string) {
    const res = await fetch(this.url, { method: 'POST', body: JSON.stringify({ prompt }) })
    const json = await res.json()
    return { text: json.text }
  }
}

This lets you switch to on-device inference (low-latency, private) when available, and fall back to cloud LLMs for heavy tasks. For real-world hybrid patterns see case studies like startups using hybrid cloud strategies and micro-edge deployment notes.

Voice I/O: practical choices

Choose STT and TTS based on privacy, latency, and device constraints.

Speech-to-text options

  • Browser: Web Speech API — easiest but limited privacy; use secure contexts and consider a local STT if privacy-sensitive.
  • Cloud: Google Speech-to-Text, OpenAI Whisper endpoints, or Gemini voice STT (where available) — high quality, higher privacy cost.
  • Local: Whisper.cpp/Whisper.cpp-Web, Vosk, or other GGML-compiled models on Pi 5 + AI HAT for offline STT. In 2025–26 the Pi AI HAT+ families shipped optimized runtimes that meaningfully reduced latency; see edge field kits for hands-on deployment patterns (Edge Field Kit).

Text-to-speech options

  • Browser: Web Speech Synthesis (SpeechSynthesisUtterance).
  • Cloud: Gemini/Google TTS or OpenAI TTS — great voices, but send text over network.
  • Local: Coqui TTS, Edge-friendly vocoders, or hardware-accelerated TTS on AI HATs for offline playback; audio field kits and portable creator gear reviews are useful when selecting output hardware (portable audio kits).

Browser example: capture → STT (Whisper) → NLU → TTS

async function runFlow() {
  // 1) capture audio (simplified)
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  const mediaRecorder = new MediaRecorder(stream)
  let chunks: BlobPart[] = []
  mediaRecorder.ondataavailable = (e) => chunks.push(e.data)
  mediaRecorder.start()

  // stop after 2 seconds for demo
  await new Promise(res => setTimeout(res, 2000))
  mediaRecorder.stop()
  const audioBlob = new Blob(chunks)

  // 2) upload to Whisper/cloud STT
  const sttRes = await fetch('/api/stt', { method: 'POST', body: audioBlob })
  const transcript = await sttRes.text()

  // 3) parse intent
  const intent = await parseIntentWithLLM(transcript)
  if (!intent) { speak('Sorry, I did not understand. Can you rephrase?'); return }

  // 4) dispatch
  const outcome = await dispatchIntent(intent)

  // 5) TTS
  speak(outcome.reply)
}

function speak(text: string) {
  const u = new SpeechSynthesisUtterance(text)
  speechSynthesis.speak(u)
}

Making the LLM-to-action glue robust

LLMs may infer actions not specified by the schema. Never let an unchecked LLM string invoke a privileged API. Use an explicit action registry and map validated intents to whitelisted functions. For example, sending SMS requires an explicit user confirmation step.

async function dispatchIntent(intent: Intent) {
  switch (intent.type) {
    case 'play_music':
      return await playMusic(intent.query, !!intent.shuffle)
    case 'set_timer':
      return await setTimer(intent.seconds, intent.label)
    case 'send_message':
      // require confirmation for sensitive actions
      const ok = await confirmIfSensitive('send_message', intent)
      if (!ok) return { reply: 'Cancelled.' }
      await sendMessage(intent.recipient, intent.message)
      return { reply: 'Message sent.' }
    default:
      return { reply: 'I cannot do that yet.' }
  }
}

Privacy-first patterns and offline constraints

Privacy is the top differentiator for an on-device assistant versus cloud-only services. Key patterns:

  • Local-first mode: prefer local STT/TTS and local LLMs for PII tasks (messages, payments, health queries). Only fall back to cloud with explicit user consent — hybrid deployment case studies like Bitbox.Cloud show how startups route heavy tasks to cloud while preserving privacy for core flows.
  • Ephemeral transcripts: store transcripts in-memory; if persisted, encrypt with a device key.
  • Minimize outbound data: redact or hash PII before sending to cloud; use field-level encryption for sensitive parameters.
  • Consent & audit: keep a user-visible audit log of actions that required cloud processing.
  • Hardware anchors: use TPM/secure enclave (where available) for key storage and attestation, especially when you allow cloud token refreshes.

Edge deployment: Pi5 + AI HAT example

In late 2025/early 2026, new AI HATs for Raspberry Pi 5 made local generative inference feasible. For a privacy-first assistant, deploy:

  • Local STT model (Whisper.cpp optimized) running on the HAT.
  • GGML-backed small LLM for intent extraction and quick tasks.
  • Cloud-only heavy LLM calls for complex tasks with opt-in telemetry.
Deploying on-device drastically reduces latency and keeps sensitive transcripts off the wire. Use the cloud only for large-context reasoning or retrieval-augmented responses.

Handling hallucinations and validation failures

LLMs hallucinate. Your system must detect and mitigate:

  • Schema validation failures: ask for user clarification or re-send to a different model with a stricter prompt.
  • Action confirmation: for destructive/sensitive actions, require a confirmation step (voice or UI tap).
  • Deterministic fallback parsers: implement regex/fsm extractors for critical parameters like phone numbers or timeouts.

Testing, observability, and metrics

Track these metrics to iterate:

  • NLU intent accuracy (validated by zod): percent of LLM outputs that pass the schema.
  • Latency breakdown: STT time, LLM time, dispatch time, TTS time — instrumented using micro-edge patterns from micro-edge VPS guides.
  • Privacy surface: percentage of requests handled locally vs cloud.
  • User confirmation rates for sensitive actions.

Developer workflow and TypeScript tips

  • Strict typing: enable strict in tsconfig. Use zod for runtime validation so TypeScript types match runtime checks.
  • Monorepo layout: split packages: assistant-core, stt-adapters, llm-adapters, device-drivers. Consider modular workflows from publishing and templates-as-code for packaging shared artifacts.
  • Testing: use end-to-end tests with recorded audio fixtures and mocked LLM clients to simulate misbehavior. Developer toolkits and extensions can speed debug cycles — see tool roundups like Top 8 Browser Extensions for Fast Research.
  • CI: run type-check + linter + zod schema tests to prevent schema drift; add observability tests inspired by observability-first practices.

Real-world considerations & trade-offs

Decision matrix:

  • Cloud LLM only — Pros: best accuracy, frequent model updates. Cons: privacy, cost, network dependency.
  • Local LLM only — Pros: private, low-latency. Cons: limited reasoning capability, larger device footprint.
  • Hybrid — Practical sweet spot in 2026: local LLM for NLU & short contexts, cloud LLM for complex multi-turn reasoning and long-context retrieval. Model orchestration (routing models by privacy and latency) is becoming common in production systems and borrows patterns from model orchestration playbooks.

Leverage new capabilities that emerged in late 2025–early 2026:

  • Model orchestration: route requests across models based on cost, latency, and privacy tags.
  • RAG with local indices: keep private documents indexed locally for retrieval-augmented generation without exposing them to the cloud.
  • Edge acceleration: use hardware-accelerated inference on AI HATs and offload heavy TTS vocoding to specialized silicon — see hardware reviews like the SkyPort Mini and creator handhelds for deployment trade-offs.
  • Platform LLM partnerships: several platforms now expose optimized LLM pathways (Gemini integrations in device ecosystems) — design your adapter layer to take advantage of these where allowed.

Case study: A minimal privacy-first assistant on Pi 5

Architectural choices we used in a 2025 pilot:

  • Hardware: Raspberry Pi 5 + AI HAT for quantized GGML LLM and STT — practical field kits are summarized in edge deployment writeups like the Edge Field Kit.
  • Stack: Node 20 runtime, TypeScript, zod validators, local LLM server (llama.cpp fork), Coqui TTS for vocal output.
  • Behavior: All STT and NLU run locally. Cloud calls only for calendar lookup (optional) and complex summarization with explicit user opt-in.
  • Outcome: ~90% intent accuracy for 12 common intents, median latency < 300ms for local intents, and clear opt-in pathway for cloud docs.

Actionable takeaways

  • Define typed intent schemas up-front and use runtime validators — this prevents hallucination-driven actions.
  • Implement an LLM adapter layer so you can change model providers without refactoring business logic; look at hybrid cloud case studies like Bitbox.Cloud for architectural inspiration.
  • Prefer local-first STT/TTS and LLMs for PII-sensitive tasks; fall back to cloud with explicit consent.
  • Whitelist actions and require confirmations for destructive operations.
  • Measure privacy surface and latency and optimize placement of components (edge vs cloud) based on those metrics.

Further reading & resources (2026)

  • Trends: device-maker LLM partnerships (e.g., Gemini integrations) that gained traction in 2024–2026.
  • Edge hardware: Raspberry Pi 5 AI HAT ecosystems from late 2025 that enable local GGML workloads; hardware and creator device reviews such as the Orion Handheld X provide device-level tradeoffs.
  • Open-source stacks: whisper.cpp / llama.cpp / GGML and Coqui for building local STT/TTS pipelines.

Conclusion & next steps

By combining TypeScript's strong types with runtime validators, a modular LLM adapter layer, and a privacy-first approach to STT/TTS and model placement, you can build a Siri-like assistant that is robust, auditable, and user-trustworthy. In 2026 the best practice is hybrid: use local inference for immediate, sensitive interactions and cloud LLMs for heavy lifting with explicit consent.

Get started: scaffold a TypeScript monorepo with the intent schema above, plug in a local GGML LLM and Whisper runtime, and iterate on intent coverage with zod validation. Keep privacy as the default mode and measure the trade-offs. For deployment, consult micro-edge VPS guidance (micro-edge instances) and observability patterns (observability-first).

Call to action

Ready to prototype? Clone a starter repo (TypeScript + zod + LLM adapter) and run a minimal Pi 5 local setup — then share your results in the TypeScript Assistant community. If you want starter code or a troubleshooting checklist, let me know which platform (browser, Node, or Pi) you're targeting and I’ll provide a tailored scaffold. For dev tooling and quick research, check curated tool roundups like Top 8 Browser Extensions for Fast Research.

Advertisement

Related Topics

#voice#assistant#privacy
t

typescript

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-05T00:31:52.126Z