edgehardwaretutorial

Running Node + TypeScript on Raspberry Pi 5 with the new AI HAT+: a hands-on guide

ttypescript

2026-01-24

11 min read

Step-by-step guide to run Node + TypeScript on Raspberry Pi 5 with AI HAT+ 2 — setup, local vs remote inference, and performance tuning.

Get a high-performance Pi + AI HAT+ 2 devbox running Node + TypeScript — fast

Hook: If you're maintaining backend services or building edge AI prototypes, getting a Raspberry Pi 5 to run Node + TypeScript with an AI HAT+ 2 for local inference is one of the fastest ways to move from proof-of-concept to deployable, privacy-preserving systems. This guide walks you through a reproducible setup in 2026: hardware hookup, OS and runtime configuration, a TypeScript/Node app that chooses local inference or offload, and concrete performance tuning tips.

Why this matters in 2026

Edge AI matured rapidly between 2023–2026: optimized quantized models, pervasive NN accelerators on tiny boards, and tooling that integrates native inference into JavaScript runtimes. For teams, that means you can run meaningful LLM-style inference near users for lower latency, better privacy, and lower bandwidth costs. The Raspberry Pi 5 + AI HAT+ 2 combo is now a widely available, cost-effective edge platform; this article shows how to make it robust for TypeScript-driven production workflows.

What you'll get

A checklist to prepare Pi 5 and AI HAT+ 2
Step-by-step OS and Node/TypeScript install (arm64 optimized)
A sample TypeScript app that runs local inference or offloads to a model endpoint
Concrete performance tuning and build/tooling advice

Prerequisites

Raspberry Pi 5 (64-bit preferred), power supply rated for your AI HAT+ 2
AI HAT+ 2 board and adapter cable (follow manufacturer packing list)
16GB+ microSD or NVMe storage (for models use NVMe or USB SSD)
Another computer for flashing the OS and SSH access

Quick architecture overview

The AI HAT+ 2 typically exposes a hardware accelerator (NPU/TPU-like) via a high-speed interface. Your app will either:

Run local inference using an on-device runtime (ONNX/TF Lite/ggml/llama.cpp node bindings) that uses the HAT's accelerator, or
Offload heavy work to a cloud or LAN model endpoint when model size or latency constraints require it.

Step 1 — Prepare the OS

1.1 Flash a 64-bit OS

Use a current 64-bit Raspberry Pi OS or Ubuntu Server ARM64 build (2026 images). The 64-bit kernel matters: inference runtimes and native Node binaries perform significantly better on arm64.

From your workstation:

# Example using Raspberry Pi Imager or curl+dd
curl -L -o ubuntu-raspi.img.xz https://cdimage.ubuntu.com/ubuntu-server/24.04/...-pi.img.xz
xz -d ubuntu-raspi.img.xz
sudo dd if=ubuntu-raspi.img of=/dev/sdX bs=4M status=progress conv=fsync

1.2 First boot & base packages

# After boot & SSH in
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential git curl htop unzip i2c-tools pciutils lm-sensors
# Optional: install zram-tools for swap
sudo apt install -y zram-tools

Enable SSH, configure locale/timezone, and set a strong password. If your AI HAT+ 2 requires I2C/SPI, run:

sudo raspi-config nonint do_i2c 0  # if using raspi-config compatible OS
# Or edit /boot/firmware/config.txt according to vendor docs

Step 2 — Attach the AI HAT+ 2 and verify

Follow manufacturer mechanical instructions. After attaching power and cabling, verify kernel sees the device:

# For USB-attached HAT
lsusb
# For PCIe-attached or M.2 HAT
lspci -v
# Generic check
dmesg | tail -n 50

If the device requires a firmware blob, copy it to /lib/firmware per vendor instructions and reboot. Use the vendor's diagnostics to confirm the board is healthy.

Step 3 — Install Node.js and TypeScript toolchain

Recommendation (2026): Use Node 20 LTS or newer. Node 20 has stable features and wide binary support for arm64 native modules.

# Install Node via NodeSource or nvm
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs
# Verify
node -v && npm -v
# Install pnpm for fast installs (optional)
npm i -g pnpm
# Project-level dev tools
pnpm add -D typescript ts-node esbuild nodemon
npx tsc --version

tsconfig: tuned for fast iteration

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "CommonJS",
    "moduleResolution": "node",
    "outDir": "dist",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "sourceMap": true
  },
  "include": ["src"]
}

Use esbuild/tsup for production bundles to avoid heavy runtime transpilation on the Pi.

Step 4 — Choose a runtime for local inference

In 2026 the ecosystem matured: popular options include onnxruntime-node, native bindings to llama.cpp or ggml, and vendor-provided SDKs that expose the HAT+ 2 accelerator. Choose based on model format and vendor support:

ONNX: broad tooling, use onnxruntime-node (arm64 native build)
ggml/llama.cpp: excellent for quantized LLMs (4-bit/8-bit) on CPU or small NPUs
Vendor SDK: fastest path to hardware acceleration on the AI HAT+ 2

Install onnxruntime-node (example)

pnpm add onnxruntime-node@latest
# If binary not available for your arch, build from source per ONNX docs

Step 5 — Example TypeScript app: local vs offload

We'll create a minimal Node + TypeScript app that:

Loads a small quantized model and runs local inference when available
Falls back to an offload endpoint when the model is too large or local resources are constrained

Project structure

my-edge-app/
├─ package.json
├─ tsconfig.json
└─ src/
   ├─ index.ts
   └─ inference/
      ├─ local.ts
      └─ remote.ts

package.json (scripts)

{
  "name": "pi-ai-edge",
  "version": "0.1.0",
  "type": "commonjs",
  "scripts": {
    "dev": "nodemon --watch src --exec 'ts-node src/index.ts'",
    "build": "esbuild src/index.ts --bundle --platform=node --outfile=dist/index.js",
    "start": "node dist/index.js"
  },
  "dependencies": {
    "node-fetch": "^3.0.0",
    "onnxruntime-node": "^1.14.0"
  },
  "devDependencies": {
    "typescript": "^5.5.0",
    "ts-node": "^10.9.1",
    "esbuild": "^0.18.0",
    "nodemon": "^2.0.22"
  }
}

src/inference/local.ts (simplified)

import * as ort from 'onnxruntime-node';

export async function runLocalInference(modelPath: string, inputTensor: Float32Array) {
  const session = await ort.InferenceSession.create(modelPath, { executionProviders: ['cpu'] });
  const tensor = new ort.Tensor('float32', inputTensor, [1, inputTensor.length]);
  const feeds = { input: tensor };
  const results = await session.run(feeds);
  return results.output.data as Float32Array;
}

src/inference/remote.ts (example offload)

import fetch from 'node-fetch';

export async function runRemoteInference(endpoint: string, prompt: string) {
  const res = await fetch(endpoint, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt })
  });
  if (!res.ok) throw new Error(`Remote inference failed: ${res.statusText}`);
  return await res.json();
}

src/index.ts (decision logic)

import { runLocalInference } from './inference/local';
import { runRemoteInference } from './inference/remote';
import fs from 'fs';

const MODEL_PATH = '/models/small_quant.onnx';
const REMOTE_ENDPOINT = process.env.REMOTE_ENDPOINT || 'https://api.my-model-host.local/infer';

async function main() {
  const prompt = 'Summarize the following text...';

  // Simple resource check: if model file exists and free memory > threshold, run local
  let canRunLocal = false;
  try {
    const stats = fs.statSync(MODEL_PATH);
    const freeMem = os.freemem();
    canRunLocal = stats.size < 200 * 1024 * 1024 && freeMem > 300 * 1024 * 1024; // example thresholds
  } catch (e) {
    canRunLocal = false;
  }

  if (canRunLocal) {
    console.log('Running local inference');
    const input = new Float32Array([/* encoded prompt */]);
    const out = await runLocalInference(MODEL_PATH, input);
    console.log('Local out', out.slice(0, 10));
  } else {
    console.log('Offloading to remote endpoint');
    const res = await runRemoteInference(REMOTE_ENDPOINT, prompt);
    console.log('Remote result:', res);
  }
}

main().catch(err => { console.error(err); process.exit(1); });

Step 6 — Model deployment and storage

Store models on an attached NVMe SSD or a fast USB SSD. MicroSD cards are fine for OS and small artifacts, but large model files will stress them and cause swap thrashing.

Use compressed, quantized models (4-bit/8-bit) - reduces memory and improves speed.
Memory-map files when supported (some runtimes support memory-mapped ONNX).
Use incremental downloads or chunked model loading for very large models.

Performance tuning — actionable tips

Here are concrete knobs that matter on Pi 5 + AI HAT+ 2 in 2026:

1. Monitor and tune CPU/GPU/NPU usage

Use top/htop and vendor SDK tools to observe accelerator utilization. For guidance on observability patterns see advanced observability.
Set OMP_NUM_THREADS or env vars your runtime uses. For small NPUs, fewer threads often wins:

export OMP_NUM_THREADS=4
export GOMP_CPU_AFFINITY="0-3"

2. Use quantized models

Quantize from FP16 to 8-bit or 4-bit where accuracy allows. In 2025–2026 quantization-aware toolchains (like Hugging Face Optimum, ONNX quantization, and ggml tools) deliver big speedups on small accelerators. Test accuracy vs latency tradeoffs.

3. Memory and swap

Enable zram for fast compressed swap (less wear, faster than microSD swap)
Ensure you're not swapping big tensors to slow microSD — use SSD for models

# Example zram setup (Ubuntu Debian family)
sudo apt install zram-tools
sudo systemctl enable --now zramswap.service

4. Thermal and power management

AI workloads heat the Pi. Use a case with active cooling and set conservative CPU governors if thermal throttling occurs. Monitor with:

vcgencmd measure_temp  # Raspberry OS
sensors

5. Pick the right build toolchain

Bundle server code with esbuild or swc. Avoid heavyweight TypeScript compilation on-device during production.
Use prebuilt native modules for arm64 or cross-compile them on faster machines and copy artifacts to the Pi.

6. Smart offloading strategies

Rather than a binary local-or-remote decision, consider hybrid strategies and patterns from edge caching & cost-control playbooks:

Run lightweight candidate generation locally, offload final scoring to larger remote model.
Cache recent inputs and responses to reduce repeated offloads.
Use model distillation or cascaded models: tiny local model for most inputs, bigger remote model for edge cases.

Editor & Dev tooling integration

For developer productivity on the Pi:

Use VS Code Remote - SSH or code-server on the Pi. That removes the need to edit files directly on device.
Set up a devcontainer on your workstation that matches the Pi's arm64 environment for consistent builds.
Use esbuild or swc for fast local iteration and build caching.

Example: attach VS Code to the Pi and run the dev script. Use nodemon + ts-node for quick feedback, but build production bundles for deployment.

Debugging common problems

Device not found: check dmesg, lsusb, lspci. Confirm firmware and power supply.
onnxruntime fails to load binary: you likely need an arm64 build. Rebuild from source or use a prebuilt wheel for your kernel.
Out of memory: reduce batch size, use quantized model, enable zram, move to SSD.
Thermal throttling: add active cooling and lower CPU governor or OMP thread count.

Keep experiments reproducible: version your model artifacts, record environment variables (OMP_NUM_THREADS, LD_PRELOAD for vendor libs), and keep benchmark scripts.

Real-world checklist before production

Model is quantized and validated against accuracy targets
Storage is on SSD and models are memory-mapped where supported
Native modules are prebuilt for arm64 and bundled
Startup scripts set environment variables and pin CPU affinities
Monitoring reports latency, utilization, temperature, and swap usage

Future-proofing: trends to watch (2026)

Hardware-accelerated runtimes will dominate: expect more vendor SDKs exposing NN APIs directly to Node in 2026. See broader runtime trends like Kubernetes runtime trends for WASM and eBPF patterns.
Model distillation and on-device personalization make the Pi a viable endpoint for user-specific models.
WebNN and unified accelerator APIs are becoming standard, easing cross-runtime development.
Quantized, privacy-preserving tiny LLMs will be first-class citizens in edge deployments.

Actionable takeaways

Always run arm64 OS and Node for best performance on Pi 5.
Prefer quantized models and SSD storage when doing on-device LLM-like work.
Build native modules off-device; bundle with esbuild for production.
Implement a hybrid local/offload strategy to manage latency and cost.
Monitor thermal, memory, and accelerator utilization — automate restarts when thresholds are breached.

Wrap-up and next steps

The Raspberry Pi 5 paired with AI HAT+ 2 is a practical, cost-effective edge AI platform in 2026. By using an arm64 OS, prebuilt native binaries, model quantization, and a hybrid local/offload architecture, you can run meaningful inference from a TypeScript/Node app with acceptable latency and robust developer workflows. Start small: validate with a tiny quantized model, measure, then scale up your model complexity and offload strategies as required.

Call to action: Try the sample project on your Pi 5 this week — flash an arm64 image, attach the AI HAT+ 2, and run the example. If you run into a specific bottleneck (build, memory, or accelerator access), save the logs and share them with your team or community for faster diagnosis. Want a ready-to-run template or a devcontainer tuned for Pi 5 + AI HAT+ 2? Download the companion repo linked from our TypeScript website and get a working prototype in under an hour.

typescript

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.