Stop Running Your Agents

Intaglio engraving of three specialist workbenches — blueprints, keyboard, magnifying glass — arranged around a terracotta dispatch table, a lone figure at the hub deciding what goes where

226 GitHub stars in 3 days. 427 upvotes. 143 comments. That's the thread: u/russellenvy on r/ClaudeAI, "I replaced chaotic solo Claude coding with a simple 3-agent team (Architect + Builder + Reviewer)."

I read it at 11pm on a Wednesday and then read all 143 comments because the thread was doing something unusual. It wasn't the usual AI hype loop. The upvotes were real signal — people who'd tried solo Claude sessions long enough to hit the same wall. The comments split almost evenly between two questions the repo didn't answer.

What Russellenvy Got Right

The core insight is worth quoting directly: "The solution isn't a better prompt. It's a process."

That's true, and it's the thing most people spend months figuring out on their own. When you hit a complex feature and the single Claude session starts losing track of what it decided three turns ago, the instinct is to write a longer system prompt. Or switch models. Or add more context. None of that addresses the real problem, which is that you're asking one thing to simultaneously hold the architecture, do the implementation, and review its own work. Those are not the same cognitive task. The conflict is structural.

The three-role split addresses this directly. Architect doesn't write code — it produces a spec. Builder consumes the spec and writes code without second-guessing the design. Reviewer reads the code against the spec without being emotionally attached to either. Each role starts with a clean context window, no accumulated bias from the previous turn.

The second quote in the thread is also right: "Three is not arbitrary — it is the minimum for meaningful review and the maximum before coordination overhead eats the gain." I've run 250+ missions with nanika and the empirical shape of this is accurate. Two roles and you collapse back into the bias problem. Four or more and you spend more context on coordination than on work.

Three isn't magic. It's the minimum viable separation of concerns for a software task.

Intaglio engraving of an open ledger with two tally columns — 'how do they share state' and 'what if I need more roles' — both converging to a terracotta question mark at the bottom

The Two Anxieties

143 comments and the thread splits almost cleanly into two categories.

The first: how do they pass state? How does the Builder get the Architect's spec? What format? What if the spec is ambiguous — does the Builder ask, or guess, or fail? What if the Architect produces a design that's incompatible with something the Builder discovered mid-implementation? The thread has good workarounds but no answer. People are passing context as markdown via clipboard, or piping files between sessions manually, or just copy-pasting. The repo doesn't ship a state-passing mechanism. It ships a pattern.

The second: what about tasks that don't fit three roles? A backend feature might need Architect + Builder + Reviewer. But a research task needs Researcher + Synthesizer. A content pipeline needs Scout + Writer + Editor. People in the thread are forking the pattern, naming new agents, wondering if they should add a QA agent or a DevOps agent. The meta-question underneath all of this: how do I know what roles the task actually needs, and who decides?

Both anxieties point at the same gap. The 3-agent pattern is a team. What's missing is a coach — something that reads the task, assigns the right roles, passes output forward, and decides when to retry instead of continuing.

The Coordinator Problem

The obvious response is: add a 4th agent as a coordinator. Call it a PM agent, a director agent, an orchestrator agent. Have it decompose the task, assign the other agents their roles, collect their outputs, and decide what to do with failures.

I built this. It was one of the first things I tried. The coordinator was an LLM with a system prompt about task management. It ran the other three agents as tools or subprocesses. It decomposed tasks into subtasks and passed them along.

It was worse than the three-agent pattern. Not catastrophically worse — but it introduced a failure mode the original pattern didn't have. When the coordinator made a wrong decomposition decision, the error propagated silently through all three downstream agents. The architect wrote a spec for the wrong thing. The builder implemented it. The reviewer approved it. The whole run produced confident, polished output for a task nobody asked for.

A human coordinator can catch that kind of error because they understand the original intent. An LLM coordinator hallucinates intent from context. The more context it has, the better — but you've just put all your eggs back in the single-context-window basket, one level up.

The coordinator can't be an LLM making routing decisions. It has to be deterministic.

What 250 Missions Taught

The nanika orchestrator is a Go binary. Not an agent, not an LLM, not a PM. A binary that reads a mission file, parses phase definitions and dependency edges, and dispatches work in topological order. It caps concurrent phases at three. When a phase completes, it injects its full output into the context of every downstream phase that depends on it. When a phase fails, it retries up to three times — and if all retries fail, it skips the transitive dependents and continues with independent branches.

The coordinator doesn't hallucinate because it doesn't reason. It just reads the dependency graph and follows it.

The three-agent pattern maps directly onto this model. Architect = Phase 1. Builder = Phase 2, DEPENDS: architect. Reviewer = Phase 3, DEPENDS: builder. But the model doesn't stop at three. A content pipeline looks like: Scout → Researcher → Writer → Editor. A feature build looks like: Researcher + Architect (parallel) → Builder → Reviewer. The team size and structure follow the task's actual dependency graph, not a fixed headcount.

State passing — the first thread anxiety — resolves into a string concatenation problem:

priorContext := strings.Join(priorParts, "\n\n---\n\n")

Phase 2 doesn't negotiate with Phase 1. It receives Phase 1's complete output injected into its context bundle under "Context from Prior Work." No messages. No contracts. No shared memory. The prior phase's full text becomes the next phase's briefing document. This is intentionally dumb. The builder gets the architect's full spec, not a summarized version the coordinator decided was relevant.

The second anxiety — more than 3 roles — stops being a problem when the team structure is derived from the task rather than fixed in advance. On average across 250 missions, the orchestrator runs 4 phases. Some missions have 2. The longest I've run had 15. The pattern isn't "always 3 agents." It's "as many phases as the task's dependency graph needs, run in parallel where possible, handed off deterministically."

What the coordinator gets right that an LLM coordinator can't: it computes which phases to skip when something fails. If the architect phase fails after three retries, the builder and reviewer phases are skipped automatically — not because a PM agent decided to skip them, but because the dependency graph makes it structurally impossible for them to run on missing input. Failure handling is a property of the graph, not a judgment call.

Intaglio engraving of a dependency graph etched in stone: three parallel work channels feeding into a merge point, retry arrows on failing nodes, skip arrows bypassing blocked paths, governed by a terracotta mechanical clock at the top

The Pattern Is the Easy Part

226 stars says the pattern clicked for a lot of people at the same moment. That makes sense. Three roles, clean separation, no shared state — it's a good mental model and it genuinely works for the tasks russellenvy demoed.

The gap the thread anxiety points at is real though. When you want to run this on a complex feature that takes an hour, not a function that takes a minute, the manual version starts to creak. You're the coordinator. You're the one deciding whether the architect's spec is good enough to hand to the builder. You're copy-pasting context between sessions. You're deciding whether to retry the builder when it produces something weird, or hand it to the reviewer anyway.

That's where the 250-mission number comes from. I didn't want to be the coordinator. I wanted to type orchestrator run "implement the new config system" and come back when it was done. The three-agent pattern was the right model. The binary coordinator was the missing piece.

What's still not solved — and what the thread doesn't even ask about — is quality evaluation. The reviewer phase in nanika produces output, but the orchestrator doesn't evaluate whether that output is correct. It checks that output is non-empty and doesn't start with "Error:". That's a format gate, not a quality gate. The reviewer catches what the reviewer catches. A failed review that produces 300 words of confident-sounding feedback passes the gate.

I haven't found a good answer to this. An LLM quality evaluator has the same problem as an LLM coordinator — it's as fallible as the agent it's evaluating, and its failures are silent. A deterministic quality gate would need to understand the task's success criteria at the phase level, which puts you back in the territory of task-specific configuration for every mission.

russellenvy's post ends with the pattern working. The thread ends with people building versions of what I built. The question nobody's asking out loud yet: when the reviewer is wrong, what catches it?

nanika is open source. The orchestrator binary and mission file format are at github.com/joeyhipolito/nanika.

Stop Running Your Agents

What Russellenvy Got Right

The Two Anxieties

The Coordinator Problem

What 250 Missions Taught

The Pattern Is the Easy Part

Enjoyed this post?

Related Posts

The #1 Thing My AI Agents Learned Wasn't Code

How the Orchestrator Actually Works: 7 Packages, 4,570 Lines, Zero Magic

MCPs Are Dead. CLIs Won.