~/darrells.ai
← back to work
PrototypeEdge AI · 2026 · Builder

LLM on a Raspberry Pi

A full ChatGPT-style assistant running entirely on a Raspberry Pi you can hold in your hand — a small LLM on a Hailo NPU at around 15 tokens a second, with nothing ever leaving the device.

Edge AILLMHardwareLocal-first
Pi 5 + Hailo
Runs on
~15 tok/s
Throughput
0
Cloud calls

the problem

Single-board computers scratch the same itch as trading infrastructure: squeezing every last ounce of performance out of modest hardware. I wanted to see whether I could run a genuinely useful LLM, fully on-device, on a Raspberry Pi you can hold in one hand.

The same itch as my day job

Electronics, microcontrollers, and single-board computers might look a world away from electronic trading systems — but they exercise the same muscles. Both are about wringing every megabyte and clock cycle out of hardware that doesn't have much to give. It's the discipline of my day job, in miniature, and a genuinely fun way to keep it sharp.

This build pairs a Raspberry Pi 5 with a Hailo NPU — a neural-processing HAT that bolts a bit of AI horsepower onto the board — to run a small LLM entirely on-device, behind a web experience that feels like ChatGPT. The hardware was the easy part; fitting a capable model into that tight memory-and-compute envelope was the real puzzle.

The result genuinely surprised me: around 15 tokens a second, with plenty of intelligence to hold in your hand — no cloud, no account, nothing leaving the device — fronted by a clean Open WebUI with full chat-history recall. I open-sourced the whole thing, so if you've got a Pi 5 and the Hailo HAT, it's close to a one-command setup.

It looks like Ollama, but it isn't

The Hailo stack ships an API-compatible server — but there's no real Ollama CLI, and the models are pre-compiled HEF binaries, not GGUF files. Half the project is a setup guide that documents the actual working steps and the gotchas (“dpkg doesn't download,” “there is no ollama command”) where the official docs leave you stranded.

The result runs llama3.2 and a handful of 1.5B models entirely on the Hailo NPU, fronted by Open WebUI in Docker and managed by a single health-polling control script with systemd auto-start. A private ChatGPT-style assistant on a board that fits in your palm.

Open WebUI running llama3.2 on the Pi — fully local

built with

Raspberry Pi 5Hailo NPUHailoRTOpen WebUIDocker