
541 hours. Twenty calendar days. That's 27 hours per day — more than the day contains.
The math doesn't work because I was running Claude Code sessions in parallel. While one agent was auditing 39 blog articles, another was implementing a Go CLI. While the orchestrator was executing a 30-phase infrastructure mission, a third session was writing an ADR for the DNS cache I'd decided to build next. The 541 hours isn't wall-clock time. It's AI compute time, stacked — 34 parallel overlap events across 45 sessions, 15% of all messages delivered while at least one other session was already running.
What I was actually building: 11 plugins for a personal AI operating system I call Via. Go CLIs, content pipelines, image registries, an orchestration system that launches and manages AI missions across 5 domains. +32,774 lines of code across 284 files. 26 commits. All of it written, debugged, and shipped in 20 days by one person and a lot of API calls.
82% Success Rate, and What It Hides
The insights report put the headline number at 82%. Out of 61 sessions with measurable outcomes, 50 fully or mostly achieved the stated goal.
That's a good number. I expected it to be worse.
But the methodology is softer than it sounds. "User satisfaction" in the dataset was inferred by a model analyzing turn-level response patterns — corrections, explicit approval, task completion signals. The model is making educated guesses about whether I was satisfied in a given turn. It's probably right most of the time. The 12 sessions classified as "unclear outcome" were excluded from the denominator entirely. Including them shifts the number downward; I don't know by exactly how much.
The 18% is where it gets interesting.
Seven sessions hit context exhaustion — not gradual degradation, just zero output. The worst: I asked Claude to visually analyze 176 video frames. Loading all 176 consumed the entire context window before Claude produced a single line of analysis. The session completed. The deliverable was nothing. A separate session, 136 frames, same result — a prerequisite phase was missing, Claude tried to compensate by loading everything at once, and hit the limit before writing a word.
Then there's the mission that ran for five hours generating 1.7GB of garbage output before I killed it. Multi-phase orchestrated missions amplify bugs silently: a defect in phase 3 feeds corrupted output into phase 4, phase 5, phase 6 — with no visibility until the end. By the time the session finishes, the damage is already ten phases deep.
And 80 instances of wrong approach across 158 sessions. Claude defaulting to generic patterns instead of the conventions I'd spent weeks encoding. YAML metadata blobs instead of SKILL.md files on disk. Inline Python scripts instead of jq. Mission phases over-specified when the entire point of the orchestrator is that it decides phase granularity. The corrections happened, the work eventually landed, but each redirection cost turns — and turns cost time.
The 82% success rate is real. It's also the number you get when you exclude unclear sessions and don't interrogate what "mostly achieved" means for a 30-phase mission that completed phases 1 through 9 before timing out and dying.
The Pattern That Changed How I Work

Forty-four sessions before I recognized it.
I'd been starting implementation sessions directly — describe the feature, let Claude write the code. The success rate was decent but the wrong-approach corrections clustered here. Claude would choose a reasonable approach that wasn't my approach. I'd redirect. We'd lose half a session to the correction loop.
Then I started requiring an architecture decision record before any implementation began.
ADR sessions — sessions whose only goal was "explore the codebase and produce a design document" — had zero failures in the data. Not a low failure rate. Zero not-achieved outcomes across every ADR session I ran.
The mechanism is obvious once you see it: writing an ADR requires Claude to read the actual codebase before designing anything. It can't assume my conventions; it has to find them. When Claude reads my existing Go service files before writing new ones, it conforms to the patterns it finds. When it skips straight to implementation, it defaults to whatever patterns its training data said were correct.
The DNS cache that delivered 202 passing tests on the first implementation pass: ADR first. The auth middleware shipped in a single session with zero wrong-approach corrections: ADR first. The iMessage plugin I shelved after two failed debug sessions and still haven't returned to: straight to implementation, no design document.
I was paying a drift tax every time I skipped the architecture step. The ADR isn't bureaucracy — it's Claude reading the codebase I'd already built before writing code that has to live there.
What I Was Actually Doing

The median time between my turns was 108 seconds. Ninety-four gaps lasted between 5 and 15 minutes. Forty gaps stretched past 15 minutes.
I was not watching Claude work.
The framing most people use for AI coding tools is "pair programmer" — something you steer in real time, accepting and rejecting suggestions as they appear. That's not what this data shows. The operational pattern was closer to delegation: describe the goal, set Claude running, do something else, come back and evaluate the output.
The trust model reflects this. I rejected 25 tool calls out of 1,715 Bash executions — a 1.5% rejection rate. I wasn't hovering over an approval gate for every file write.
But the corrections, when they came, were decisive. I deleted an entire frontend UI when the output wasn't what I wanted — not "can you revise section three" but the whole thing, gone, start over. I shelved the iMessage plugin after two fix attempts didn't resolve the self-messaging loop; I've kept it shelved. When the orchestrator started over-specifying phases, I redirected once, clearly, and moved on. No iterating past a clear dead end.
High-trust delegation with sharp corrections. It's not pair programming. It's closer to working with a contractor who is genuinely skilled but doesn't know your codebase — and who will, without the right guardrails, build the thing correctly according to generic standards that don't match what you've already built.
The Move Most People Haven't Found

103 Task subagent calls across 73 sessions. The report flagged this as an advanced usage pattern most Claude Code users never discover.
The 39-article audit: three parallel Task agents, each handling roughly 13 articles, running simultaneously. Synthesis complete in 7 minutes. One session.
The 242 Kurzgesagt thumbnails: parallel agents making external Gemini Vision API calls for each thumbnail, building a complete registry in a single session. The session that tried to process 176 video frames by loading them directly into Claude's context window produced zero output. The thumbnail session succeeded because it offloaded the heavy work to external APIs instead of trying to hold everything in one window.
The architectural principle the data keeps reinforcing: large-volume work goes through parallel agents with bounded scope, not through a single context window. Each agent handles a fixed subset of inputs and returns a summary. The orchestrating session synthesizes summaries — not raw content. This is not a clever optimization. It's the difference between a session that delivers 242 complete entries and a session that silently exhausts its resources before producing anything.
The context window is not just a long memory. It's a finite resource that fails without warning and takes the entire session with it.
Honest Limitations
The satisfaction percentages in this analysis are model-inferred. The 93% positive satisfaction figure comes from a model classifying my turn-level responses — not from me rating each interaction. It's likely accurate in aggregate. It could be systematically wrong in ways I can't detect without reviewing 805 messages individually, which I haven't done.
The 82% outcome rate excludes the 12 sessions the classifier couldn't confidently assess. I don't know if those sessions skew better or worse than the ones it could measure.
And the number I can't stop thinking about: I've offloaded 541 hours of engineering work to a system whose conventions I've been carefully teaching for three weeks. The CLAUDE.md files, the SKILL.md format, the orchestrator delegation model — the reason the success rate is 82% and not 50% is that I spent significant time encoding my conventions into files Claude reads before every session.
That encoding is load-bearing infrastructure now. If Claude's training updates in a way that overrides what I've written, the success rate shifts. If the API changes behavior in a way I don't notice, I might not catch it until I've run 10 sessions on wrong assumptions. The system works because I built the scaffolding that makes it work — and the scaffolding requires ongoing maintenance I'm only starting to account for.
The 26 commits are real. The +32,774 lines shipped. The infrastructure runs. I'm not walking back any of it.
I just don't know yet whether "high-trust delegation with sharp corrections" is a sustainable operating model or whether I'm three months from discovering what the ceiling actually looks like.
Twenty days isn't enough data to answer that question. I'll have more in twenty days.