Devy
An open-source DevOps & SRE co-pilot: one capable agent behind a centralized service, with pluggable “eyes and hands” across your hosts, containers, runbooks, and observability stack. It correlates signals, cites your own docs, and investigates incidents — with a human in the loop for anything consequential.
- Apache-2.0
- License
- 115
- Tests
- 1, by design
- Agents
- human-gated
- Actions on prod
the problem
The hardest AI system I'd built was a DevOps co-pilot for a production trading platform — and its lessons were locked behind an NDA. So I rebuilt the framework from scratch, in the open, for anyone who wants to point a capable, safe-by-default agent at their own infrastructure.
What I set out to build
Devy is the open-source rebuild of the most substantial AI system I'd made: a DevOps and SRE co-pilot I designed and shipped for a demanding production trading platform. The specifics of that system stay private — but its hardest-won lessons, the architecture and the safety posture and the dozens of decisions that each took me a very long time the first time around, are exactly what's reusable. So I rebuilt the framework from scratch, in my own time, and open-sourced it under Apache-2.0.
The mission never changed from the original: use LLMs to take the toil out of incident response, diagnostics, and operational triage — and make the outputs more consistent than what tired humans produce under pressure. Devy is the bootstrap I wish I'd had. Point it at your hosts, containers, runbooks, and observability backends, and it helps you see what's happening, correlate signals, and reason about where a fault most likely lives — grounded in real data, not guesses.
The design I threw away
I started, more than a year before this rebuild, with a confident and elaborate plan: a society of six role-scoped autonomous agents — infrastructure, CI/CD, performance, observability, connectivity, security — each its own MCP server, coordinated by a central “MCP Hub” with an intricate push-pull notification protocol. It was internally coherent. Building the real thing taught me it was mostly machinery the problem didn't need.
The multi-agent design produced the exact failure modes its complexity was meant to prevent. Agents re-explained context to each other and burned tokens, second-guessed one another, and — worst — failed in correlated ways: when the underlying model misjudged something, every agent built on it misjudged the same way, so the redundancy bought far less safety than it appeared to. And a wrong answer could originate anywhere in a chain of handoffs.
What I shipped instead was the opposite: one very capable agent behind a centralized service, with thin, interchangeable surfaces on top. The “Hub” dissolved — with one brain, there was nothing left to coordinate — and MCP went back to being what it actually is: a clean way to expose tools, not an orchestration fabric.
The thesis: a co-pilot, never an autopilot
The original design imagined agents running continuously and acting on their own. What proved genuinely valuable was the opposite — an agent that's always available and deeply context-aware, but invoked by a human: surfacing, correlating, and explaining, with recommendations, while a person stays in the loop for anything consequential. Devy never acts blindly on production.
That wasn't only a reliability choice; it's what made adoption possible at all. A safe-by-default co-pilot is what keeps a security team comfortable pointing it at real systems. The clearest expression of that is the host boundary: Devy inspects live hosts and containers through a declarative, profile-gated allow-list with no shell access — which is precisely what makes aiming it at a production box adoptable.
How it works
At the center is a containerized service I call the LLM-PROXY: it owns all the reasoning, tools, memory, and tracing, and exposes one HTTP/SSE API that every surface — a terminal-themed web chat, a native Go `ask` command, a one-shot endpoint — consumes as a dumb client. The harness loop is mine, owned directly rather than inherited from a heavyweight framework: assemble context, call the model, dispatch tools, repeat.
Two pieces do the heavy lifting. A tools-router solves the “too many tools for one context” problem not by splitting the agent but by splitting the catalog: tools are registered with metadata, and the agent discovers the relevant ones on demand via a single find_tools call — full reach, small working context. And a knowledge layer ingests your runbooks and postmortems into a hybrid index (vector + full-text, reciprocal-rank-fused) that the agent searches and cites, so answers are anchored to your own documentation.
- ▸One agent, one brain; thin web / TUI / HTTP surfaces over a single API
- ▸On-demand tool discovery via find_tools — broad capability, small context
- ▸Hybrid retrieval (vector + full-text) with metadata-rich citations
- ▸Provider-agnostic via LiteLLM — users pick a tier (fast/balanced/deep), not a model
- ▸Two-channel memory: a lossless transcript plus a compact, token-triggered summary
The capability I didn't design for
The single most valuable thing Devy does wasn't on the original roadmap at all: incident root-cause analysis as a correlated event timeline. Because one agent holds broad knowledge of the platform and can pull from live logs, container status, and your docs at once, it can stitch a single, time-ordered narrative across all of them — finding in seconds the needle a human would need hours and a dozen open tabs to assemble.
It ships with a live demo: spin up a container that deliberately crash-loops, and ask Devy to investigate. It pulls the live logs and status through the host boundary, cross-references the matching runbook and past postmortems from the knowledge base, builds a correlated timeline, and returns a ranked root cause with evidence — distinguishing the out-of-memory symptom from the connection-pool-exhaustion cause. That capability wasn't a component I built; it emerged once the brain, the knowledge, and the tools all lived in one place.
Open-sourced, so the next person skips the year I spent
Everything here is the framework and the patterns, not the trading-platform specifics — the reusable foundation, with the dozens of decisions that cost me months already made. It's Apache-2.0, runs end-to-end from a single `docker compose up`, and is covered by 115 tests. The full design lineage — every pivot, and why — is preserved in the repo's JOURNEY doc for anyone setting out where I started.
I built Devy because I'm fully employed and not looking for work, but I do want others to get real value from it. If you're an enterprise leader who'd like help standing it up against your own infrastructure, I'm glad to talk — the contact form is the way to reach me.
built with