PrototypeComputer Vision · 2025 · Builder

Vision Pipeline

A self-hosted, real-time vision foundation for robotics: one camera feed fanned out over a Redis bus so any number of models can watch at once. Today, a YOLO detector drawing live boxes and a Moondream VLM you can chat with; tomorrow, vision-language-action models driving real actuators.

Computer VisionVLMRoboticsReal-timeEdge

Mac version (GitHub) ↗Jetson version (GitHub) ↗

Redis perception bus
Architecture: YOLO + Moondream
Live demo: Mac + Jetson
Runs on

the problem

I wanted a perception foundation I could eventually point at robotics: one self-hosted pipeline where a single camera feeds many models at once — today to understand a scene, one day to act on it.

A foundation for robots

This is the quiet start of something bigger. A camera — the Mac's, or a Raspberry Pi camera on a Jetson — captures a continuous stream of frames, drops them to disk, and publishes their locations onto a Redis message bus. From there, any number of models can subscribe to the same feed at once: object detectors, classifiers, vision-language models, eventually vision-language-action models — each adding its own read of the world to a shared perception layer.

Today the demo pairs two of them: a YOLO model drawing real-time bounding boxes around what it recognizes, and a Moondream VLM you can chat with — ask it a question and it describes what it sees, in context. Running the whole thing on self-hosted hardware holds tiny hints of wonder and magic — and the door it opens is the one I'm really after: user-directed but self-actuated robotics, where a vision-language-action model emits JSON that an actuator control system can act on.

Making it fast, and portable

Real-time vision needs the GPU, but Docker on a Mac can't see it. So the models run natively for Metal/MPS acceleration while Redis, the API, and the React frontend stay containerized — a clean hybrid that keeps the heavy compute fast and the infrastructure tidy. A hands-free continuous-narration mode loops a pinned question and reads out fresh scene descriptions on an adjustable cycle.

The same design re-targets cleanly to NVIDIA's Jetson Orin Nano with a Raspberry Pi camera, so the perception layer isn't bound to one machine. Both versions are open source.

built with

PythonPyTorchYOLO11Moondream2RedisFastAPIReactJetson