
I typed orchestrator run "audit the orchestrator codebase and write a technical article" at 11 PM on a Saturday. Three agents spun up in parallel — a researcher reading Go source files, an architect analyzing package boundaries, a storyteller drafting prose. Each wrote to its own directory. None of them knew the others existed. Twelve minutes later, the orchestrator merged their outputs into a single workspace and printed a summary.
That coordination is 4,570 lines of Go across 7 packages. No framework. No agent library. Just a CLI that reads markdown files and spawns Claude Code sessions.
The Six-Sentence Version
Here is the system in six sentences. Each maps to a specific code path.
- Decomposition. You give the orchestrator a task in natural language; it classifies whether that task is simple or complex, and if complex, asks an LLM (Sonnet) to break it into a DAG of phases with explicit dependencies — falling back to keyword-based decomposition if the LLM call fails.
- Persona matching. For each phase, the system selects a specialist persona by asking Haiku to pick the best match from markdown files loaded at startup from
~/via/personas/, falling back to keyword scoring against each persona's "When to Use" section if the LLM is unavailable. - Worker spawning. Each phase becomes a worker: a directory on disk containing a single generated
CLAUDE.mdfile that bundles the persona prompt, the phase objective, available tools, prior phase outputs, and relevant learnings retrieved from a SQLite database. - Parallel execution. The engine dispatches phases whose dependencies are satisfied, runs them concurrently (default max 3), and feeds each completed phase's output forward to its dependents — with up to 3 attempts per phase on failure and exponential backoff between retries.
- Quality gates. After each worker completes, the engine runs a two-tier gate check (existence: non-empty output; format: output doesn't appear to be just an error message) and records pass/fail as a warning, continuing execution either way (fail-forward).
- Learning capture. Workers mark discoveries with structured markers (
LEARNING:,GOTCHA:,PATTERN:,DECISION:,FINDING:) in their output; the system captures these into a SQLite database with FTS5 full-text indexing and optional vector embeddings, deduplicating via cosine similarity (>0.85 = duplicate), so future missions can retrieve relevant past lessons through hybrid search (0.3×FTS5 + 0.7×semantic, weighted by recency).
That is the whole system. The rest of this article walks through each stage with the specific functions and design decisions behind them.
Breaking the Task Apart
When a task arrives, the first question is whether it needs decomposition at all. The router — a lightweight classifier in internal/router/ — checks keyword signals to decide if the task is simple enough for a single agent or complex enough to split into phases.
Complex tasks go to decompose.Decompose(), which calls Claude Sonnet to produce a plan: a list of typed phases with explicit dependency edges. If the LLM call fails — rate limit, network error, malformed response — the system falls back to keywordDecompose(), which uses pattern matching to build a simpler plan. That fallback has never produced a plan as good as the LLM version, but it means a flaky API call does not block the entire pipeline.
The decomposer also routes each phase to one of three model tiers: Think (Opus) for architecture decisions, Work (Sonnet) for implementation, Quick (Haiku) for classification and matching. A research phase and an implementation phase in the same mission run on different models at different price points.
Matching Personas from Markdown Files
Every specialist — researcher, architect, storyteller, security auditor — lives as a markdown file in ~/via/personas/. The loader at persona/personas.go reads these files at startup, parses their "When to Use" and "When NOT to Use" sections, and builds a catalog. Adding a new persona means creating one .md file. Zero Go code changes.
Matching uses persona.Match(), which asks Haiku to pick the best persona for a given phase description. If the LLM is unavailable, keywordMatch() scores each persona by counting word overlaps against trigger phrases — crude but deterministic.
This replaced a 23-subdirectory agent template system in v1 that required touching seven packages and fourteen files to add a single role. The entire v2 persona package is 323 lines.
Folders Are Workers
Each phase becomes a physical directory on disk. Inside that directory, worker.BuildCLAUDEmd() generates a single file — CLAUDE.md — that contains everything the agent needs: the persona's full prompt, the phase objective, available tool definitions discovered via worker.LoadSkills(), outputs from completed dependency phases, and relevant learnings pulled from the database.
Think of it like handing someone a sealed envelope with their role, assignment, tools, notes from earlier colleagues, and lessons from previous projects. The agent opens the envelope, does the work, and writes output.md to the same directory.
Agents never share memory, never call each other, and never coordinate in real time. The orchestrator is the only process that reads from all directories and decides what to pass forward. A crashed agent leaves its directory intact for retry. Independence is not just cleaner architecture — it is operational resilience.
Running Phases in Parallel
The engine at internal/engine/ manages execution as a dependency-aware dispatch loop — not batch-and-wait. It maintains a semaphore (make(chan struct{}, 3) by default) that caps concurrent agents at 3. On each tick, it scans for phases whose dependencies have all completed, dispatches them, and waits for any running phase to finish before scanning again. If phases A, B, and C have no dependencies, all three start immediately. When A finishes and unlocks phase D, D starts without waiting for B or C.
When a phase fails, the engine retries up to 3 times with exponential backoff. If all attempts fail, the engine skips any phases that depend on the failed one — but other independent branches continue. For creative work like content pipelines, a failed research phase does not kill the entire mission. Writing and illustration phases that do not depend on it still produce partial output.
Two Gates, Not Three
After each worker completes, the engine runs a quality gate defined in engine/gate.go. The gate has two checks: existence (the output is non-empty) and format (the output does not consist solely of error messages). A phase that produces only "Error: rate limit exceeded" fails. A phase that produces 200 words with an error buried in the middle passes.
I originally planned three tiers — existence, format, and a test tier that could run shell commands to validate output. The test tier does not exist. The two-tier gate turns out to be sufficient because the real quality signal comes from downstream phases. If a research phase produces bad results, the writing phase that consumes it produces a bad article — and that is visible in the final output without an automated test.

What the System Remembers
Workers can mark discoveries in their output with five structured markers: LEARNING:, FINDING:, GOTCHA:, PATTERN:, and DECISION:. The learning system at internal/learning/ parses these from completed worker outputs and stores them in a SQLite database with FTS5 full-text indexing.
When a new worker spawns, BuildCLAUDEmd() retrieves relevant past learnings via hybrid search — 30% full-text relevance, 70% semantic similarity, weighted by recency (full weight under 30 days, decaying to 0.4 after 180). At insert time, cosine similarity above 0.85 triggers dedup: the existing entry's seen_count increments instead of creating a duplicate.
The mechanism works. The proof that it works is a different question.
Honest Limitations
Every function I described above is implemented and running. But I have no evidence that learnings actually improve mission outcomes.
The used_count column exists in the database schema but is never incremented when learnings are retrieved. I cannot distinguish a learning that influenced a worker's output from one that was injected and ignored. The hybrid search returns results — but whether those results make the 50th mission measurably better than the 5th is an open question I have not instrumented.
The retry system has the same blind spot. Each retry re-executes the identical configuration. No failure context — no "you failed last time because of X, try Y instead." A retry that does not know why it failed is just the same coin flip run three times.
Persona matching accuracy is also unmeasured. Haiku picks a persona, the phase runs, output appears. I have no data on whether LLM-selected personas produce better outputs than keyword-selected ones.
These are not theoretical gaps. They are the difference between a system that coordinates agents and a system that coordinates agents well. The fixes are small — increment used_count at retrieval, inject failure summaries into retry prompts, log persona selection comparisons. The measurement discipline is the hard part.
Next article in this series: what 703 insights and 360 errors in the learnings database actually say about how AI agents fail.