architectureedgeai

Edge AI with TypeScript: architecture patterns for small devices and Raspberry Pi HATs

ttypescript

2026-01-25

11 min read

Architectural patterns for TypeScript on edge devices: offline inference, batching, model caching, and typed HAT SDKs for Raspberry Pi HATs.

The pain: shipping robust Edge AI on tiny, constrained devices

You know the pain: a team ships a TypeScript service to run on a Raspberry Pi 5 or a small MCU HAT, and everything looks fine in the lab — until real traffic, battery limits, or a stuttering inference engine expose flaky memory, long tail latency, and inconsistent device drivers. This article gives Architecture patterns and pragmatic trade-offs for 2026: offline inference, batching, model caching, and typed SDK wrappers that help teams ship reliable Edge AI with TypeScript.

Why this matters in 2026

By late 2025 and early 2026, the edge AI landscape changed fast: Raspberry Pi 5 and new AI HATs (for example, the AI HAT+ 2) made affordable local inference practical. At the same time, compact runtimes (WASM-based engines and gguf/llama.cpp-style C runtimes) and model quantization matured. The result: more compute at the edge, but also greater expectations for robustness, observability, and typed developer ergonomics.

ZDNET and other outlets noted the new Pi AI HATs in 2025 as a turning point for on-device generative AI — but hardware alone doesn't solve architecture and reliability problems.

High-level patterns — the quick map

Edge AI apps typically combine several patterns. Pick the right combination depending on your constraints:

Offline inference with model caching — keep model files and runtime local; fall back to cloud only if needed.
Batching and micro-batching — accumulate requests to improve accelerator utilization without blowing latency SLOs.
Typed HAT SDK wrappers — wrap hardware access in a well-typed API to catch issues at compile time and document capabilities.
Adaptive model selection — switch models by resource availability (memory, battery, CPU/GPU availability).
Cache eviction strategies (LRU, frequency-based, pinned models) — conserve limited storage and RAM.

Trade-offs: latency vs throughput vs resource usage

Every edge deployment is a balancing act:

Throughput: Large batch sizes increase throughput but raise tail latency and require more memory.
Latency: For tactile applications (robotics, voice assistants) low latency matters; smaller batches or per-request inference is required.
Energy: Larger accelerators or continuous inference drain battery-operated devices quickly.
Predictability: Deterministic memory usage and handling of degraded conditions is crucial for production-grade devices.

Pattern 1 — Model caching and eviction strategies

Store model artifacts locally in compressed or quantized formats to avoid the network and speed startup. But local storage is finite: use a smart cache with predictable eviction.

Key components for a production cache

Index file with metadata (model name, identifier, version, size, quantization, timestamp).
LRU or LFU eviction with explicit pinning for critical models.
Checkpointing for partial loads (memory-map Quantized files to avoid full decompression).
Atomic swap when updating models to avoid partially-written artifacts.

TypeScript implementation idea: small LRU cache for models

The following TypeScript shows a compact LRU cache for model metadata. The goal: pin models you need, evict older ones, and keep file I/O outside the hot path.


type ModelMeta = {
  id: string
  path: string
  sizeBytes: number
  pinned?: boolean
  lastUsed: number
}

class ModelCache {
  private map = new Map()
  private capacityBytes: number
  private usedBytes = 0

  constructor(capacityBytes: number) {
    this.capacityBytes = capacityBytes
  }

  touch(id: string) {
    const meta = this.map.get(id)
    if (!meta) return
    meta.lastUsed = Date.now()
    // move to back
    this.map.delete(id)
    this.map.set(id, meta)
  }

  add(meta: ModelMeta) {
    if (this.map.has(meta.id)) {
      this.touch(meta.id)
      return
    }
    while (this.usedBytes + meta.sizeBytes > this.capacityBytes) {
      if (!this.evictOne()) break
    }
    this.map.set(meta.id, { ...meta, lastUsed: Date.now() })
    this.usedBytes += meta.sizeBytes
  }

  private evictOne() {
    for (const [id, meta] of this.map) {
      if (meta.pinned) continue
      this.map.delete(id)
      this.usedBytes -= meta.sizeBytes
      // async delete file: fs.unlink(meta.path)
      return true
    }
    return false
  }
}

Trade-offs: this simple LRU doesn't account for load costs (network latency to re-download) or model loading time. In many systems, a cost-aware eviction (weight by reload time) performs better. For a deeper look at storage and privacy-friendly analytics at the edge, see Edge Storage for Small SaaS in 2026.

Pattern 2 — Offline inference with cloud fallback

Keep inference local whenever possible. Network connectivity can be intermittent, and device owners expect privacy and low latency. But local compute can fail (memory pressure, overheating). Implement graceful fallback to the cloud only when safe and cost-effective.

Architectural flow

Try local inference with preferred model.
If local runtime fails or yields unacceptable latency, queue request for cloud inference (optionally after local pre-processing such as feature extraction or compression).
Log telemetry about fallback frequency for future model selection and provisioning.

Practical tips:

Measure both success rate and energy cost of local inference.
Compress inputs for cloud fallbacks to reduce egress cost.
Implement an exponential backoff for cloud calls to avoid network storms on reconnect. If you want patterns for shipping offline-first apps and fallback behavior, the offline-first field service playbook contains useful operational parallels.

Pattern 3 — Batching and micro-batching

Accelerators (NPU, GPU) are most efficient with batches. But large batches increase latency. Micro-batching groups small requests into short windows (for example, 20-100 ms) to get the majority of throughput gains without harming latency-sensitive workflows.

Type-safe batching queue example

Use generics to keep request/response types consistent across the queue and model runtime.


type InferenceReq = {
  id: string
  payload: T
  resolve: (res: any) => void
  reject: (err: Error) => void
}

class Batcher {
  private queue: InferenceReq[] = []
  private windowMs: number
  private maxBatch: number
  private timer?: NodeJS.Timeout
  private runBatch: (batch: T[]) => Promise

  constructor(windowMs: number, maxBatch: number, runBatch: (batch: T[]) => Promise) {
    this.windowMs = windowMs
    this.maxBatch = maxBatch
    this.runBatch = runBatch
  }

  enqueue(payload: T) {
    return new Promise((resolve, reject) => {
      this.queue.push({ id: String(Math.random()), payload, resolve, reject })
      if (!this.timer) {
        this.timer = setTimeout(() => this.flush(), this.windowMs)
      }
      if (this.queue.length >= this.maxBatch) this.flush()
    })
  }

  private async flush() {
    if (this.timer) { clearTimeout(this.timer); this.timer = undefined }
    const batch = this.queue.splice(0, this.maxBatch)
    if (batch.length === 0) return
    try {
      const results = await this.runBatch(batch.map(r => r.payload))
      for (let i = 0; i < batch.length; i++) batch[i].resolve(results[i])
    } catch (err: any) {
      for (const r of batch) r.reject(err)
    }
  }
}

Trade-offs: tune windowMs and maxBatch by measuring end-to-end latency and device throughput. For voice or robotics, set tiny windows; for telemetry post-processing, use larger windows. For live, low-latency testbeds and tunnel setups you can validate batching behavior on, see this hosted tunnels & low-latency testbeds review.

Pattern 4 — Typed HAT SDK wrappers

HATs vary widely: camera modules, audio DSP HATs, NPUs, or custom sensors. A typed SDK wrapper gives you compile-time safety across hardware capabilities and reduces runtime errors you can't easily debug on headless devices.

Design goals for a HAT SDK

Capability discovery: runtime introspection of features (e.g., supported precisions, memory, version).
Typed commands: request/response types with generics so function signatures convey contract.
Error modeling: typed error unions so callers can handle common failure modes.

Example: typed wrapper for a generic AI HAT


// capabilities
type Precision = 'fp32' | 'fp16' | 'int8'

type HatCapabilities = {
  name: string
  maxRamBytes: number
  supportedPrecisions: Precision[]
  modelFormats: ('gguf' | 'tflite' | 'onnx')[]
}

// typed errors
type HatError =
  | { type: 'NotConnected' }
  | { type: 'OutOfMemory'; requiredBytes: number }
  | { type: 'UnsupportedFormat'; format: string }

// typed command for inference
type InferenceOptions = { modelId: string; precision?: Precision }

interface IHatClient {
  getCapabilities(): Promise
  loadModel(modelPath: string): Promise
  infer(input: TInput, opts?: InferenceOptions): Promise
}

// usage
async function classify(client: IHatClient, image: Uint8Array) {
  const caps = await client.getCapabilities()
  if (!caps.supportedPrecisions.includes('int8')) console.warn('int8 not supported')
  const res = await client.infer(image, { modelId: 'mobilenet-int8' })
  if (typeof (res as any).type === 'string') {
    // handle HatError
  } else return res
}

Why generics help: the infer signature enforces at compile time that the consumer and provider agree on the data shapes. That prevents subtle bugs when the HAT runtime expects a different tensor layout.

Runtime choices and packaging

In 2026 you can choose among several runtimes for TypeScript on the edge:

Node.js — battle-tested with rich native addon ecosystem; suitable when you can accept a larger runtime.
Deno — simpler deployment model with single-binary scripting and better security defaults.
WASM runtimes (Wasmtime, Wasmer) — run inference runtimes compiled to WASM for isolation and smaller footprints.

Packaging tips:

Bundle TypeScript to a single artifact using esbuild or swc for fast startup.
Use prebuilt native bindings for NPUs when possible; build on a matching architecture (aarch64) to avoid runtime surprises. For guidance on sustainable procurement and device choices that affect security and build reproducibility, see refurbished devices & procurement.
Consider shipping a minimal OS image with required drivers pinned to known-good versions.

Observability and reliability

Edge devices have less visibility than cloud services. Instrument key signals and keep telemetry minimal but actionable.

Local logs with circular buffer to avoid filling storage; push only high-priority telemetry to the cloud.
Health endpoints for connected management services that verify model presence, memory usage, and temperature.
Feature flags to toggle batching or different models remotely when devices misbehave after a rollout.

Security considerations

Local models and inference inputs are sensitive. Protect them with the same rigor as cloud counterparts.

Encrypt model files at rest when devices are at risk of physical compromise.
Sign model artifacts and verify signatures before loading to prevent malicious models — tie this into an audit-ready pipeline for provenance and verification.
Limit network egress and use mutual TLS for cloud fallback to prevent data leakage.

Case study: a conversational assistant on Pi HAT+ 2

Scenario: a lightweight conversational assistant runs on a Raspberry Pi 5 with an AI HAT+ 2. Goal: local answers for privacy-sensitive queries, cloud fallback for long-tail requests.

Architecture summary

Local model: a quantized 4-bit GGUF model for short prompts, cached and pinned for instant access.
Batching: 50 ms micro-batches for speech-to-text snippets to keep audio responsiveness.
Fallback: cloud LLM for complex contexts; local pre-processing extracts embeddings and obfuscates PII before send.
Typed SDK: HAT wrapper exposes audio capture, audio preproc, and inference with typed request/response shapes.

Lessons learned

Pinning a small, high-quality model reduced cold-start latency by 4x versus pulling on-demand.
Micro-batching tuned at 30-60ms hit a good balance between latency and throughput on the AI HAT NPU.
Typed wrappers prevented mismatches between the audio preprocessor and the model input shape that previously caused silent runtime failures.

Advanced TypeScript patterns for maintainability

Leverage TypeScript's advanced types to make your Edge AI code safe and readable:

Discriminated unions for modeling HatError and capability variants.
Generics to keep inference pipelines type-safe end-to-end.
Mapped types to adapt shape transformations for different model inputs/outputs.
Conditional types for building typed wrappers that return different response shapes depending on options.

Example: conditional response types


type Precision = 'fp32' | 'fp16' | 'int8'

type InferenceResult = P extends 'int8'
  ? { predictions: number[]; scale: number }
  : { predictions: number[]; rawBytes?: Uint8Array }

async function inferWithPrecision(prec: P) : Promise> {
  // runtime uses prec to pick engine; TS type ensures caller handles returned shape
  if (prec === 'int8') return { predictions: [0.1, 0.9], scale: 0.5 } as any
  return { predictions: [0.1, 0.9], rawBytes: new Uint8Array([]) } as any
}

This pattern forces consumers to account for the different metadata each precision returns, avoiding runtime surprises.

Testing and local simulation

Device tests are expensive. Use a layered testing strategy:

Unit tests for SDK wrappers and caches with simulated failures.
Emulated runtimes for fast validation of batching logic and offline fallback flows.
Hardware-in-the-loop tests for final verification of model loading, power consumption, and thermal behavior. For hosted infrastructure and testbeds to validate end-to-end low-latency behavior, consider services reviewed in the hosted tunnels & low-latency testbeds roundup.

2026 trend predictions and operational guidance

Expect these trends to shape your architecture choices in 2026:

More capable NPUs on small boards — inference will move even closer to devices; plan for occasional model updates and signature verification.
Standardized WASM-based ML runtimes — they will simplify cross-platform deployment and increase portability between Pi variants.
Hybrid cloud-edge orchestration — orchestration layers will let you dynamically move workloads between local HATs and cloud pools based on telemetry.

Operationally, invest in tooling to push signed model updates, monitor fallback rates, and remotely toggle batching strategies. These small investments pay off when hundreds of devices are in the field.

Actionable checklist before shipping

Implement a model cache with pinning and eviction.
Tune batching window and batch size with real traffic on identical hardware.
Wrap all HAT interactions in a typed SDK with discriminated errors.
Provide a cloud fallback path with privacy-preserving preprocessing and telemetry.
Sign and verify model artifacts; encrypt at rest where needed. See audit-ready text pipelines for guidance on signing/verification policies.
Automate tests: unit, emulated, and hardware-in-the-loop.

Final takeaways

Edge AI in 2026 is practical and powerful, but hardware advances do not eliminate engineering complexity. The winning architectures are those that explicitly address: model locality, resource-aware batching, safe model caching, and typed hardware APIs. Use TypeScript's advanced types and generics to codify contracts between components so runtime errors become compile-time errors.

Call to action

If you're building Edge AI with TypeScript, start by implementing a typed HAT SDK and a small model cache. Want a starter kit? Check our GitHub repo for a reference TypeScript HAT SDK, caching library, and batching primitives that run on Raspberry Pi 5 and common AI HATs. Join the discussion, file issues, and contribute patterns that work for your devices. For storage and edge deployment patterns, see Edge Storage for Small SaaS in 2026. If you need a quick example starter kit and packaging tips to show in your portfolio, see this guide on showcasing micro apps.

typescript

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.