Skip to main content
// JH

· 8 min read

Your CLAUDE.md Is Making Things Worse

A newly published study found that LLM-generated context files hurt task completion and raise costs by 20%. The harder question is when they still make sense.

ai-development · claude-code · agents · configuration · dx

An octopus walking along a glowing path through space, holding a CLAUDE.md document — navigating between order and chaos

The T3 Chat author ran a test on his own codebase. Without a CLAUDE.md, the task completed in 1 minute and 11 seconds. With a freshly generated one, it took 1 minute and 29 seconds. He deleted the file.

That single data point — n=1, one timing trial — wouldn't move me on its own. What moved me was that a preprint published in February 2025 independently reached the same conclusion at scale. Kelechi et al. tested context files across 138 tasks and 12 repositories, running Claude Code, Codex, and Qwen Code against each condition. LLM-generated context files reduced task success by 2% and raised costs by 20–23%. Human-written files did better — a modest +4% on task completion — but still carried a 19% cost penalty.

The numbers tell the same story the timing did. Agents that can navigate codebases don't need the summary. They'll grep, read package.json, explore the file tree. Giving them a pre-filled context file doesn't accelerate that process. It redirects it.

The Pink Elephant Problem

The most useful piece of the video wasn't the timing experiment. It was the TRPC story.

He had TRPC listed in his AGENTS.md — a legacy technology his codebase was moving away from. The agent kept reaching for TRPC. Not because it made sense architecturally. Because it was in context. What appears in context gets autocompleted toward, regardless of whether you want it to.

The Kelechi et al. study calls this the redundancy trap, but the practitioner version is more vivid: you're not giving the agent a map, you're giving it a list of things to think about. Everything on that list becomes a gravity well. List something you're deprecating and the agent will deprecate it slower than if you'd said nothing at all.

This is where I agree with the argument most completely. Via's CLAUDE.md files evolved through exactly this failure mode. Early in the build, I added context about the orchestrator's retry logic because an agent kept mishandling failures. It fixed the immediate problem and created a new one: every subsequent agent inherited that framing and applied it to contexts where it didn't belong. I spent a session removing the entry. The agents adapted fine.

The discipline required is uncomfortable. You have to resist the instinct to help the agent by explaining things. The explanation costs more than the discovery.

A purple octopus at a computer, dwarfed by the ghost of legacy code looming behind it — the gravity well of context files

Where the Study's Conditions Don't Hold

The Kelechi et al. finding has a scope that matters. The study ran against Python repositories — twelve of them, selected from AGENTbench and SWE-bench Lite. These are well-documented, widely-used codebases. The agents navigated them with or without the context files because the models had encountered similar patterns in training.

Vercel ran a different test. They targeted Next.js 16 APIs — connection(), 'use cache', cacheLife() — that postdated training cutoffs for the models being tested. The agent had no prior knowledge to fall back on. Under those conditions, AGENTS.md achieved a 100% pass rate against a 53% baseline. A 47 percentage point gain. Skills-based retrieval maxed at 79%, and only because the agent chose to invoke the skill. In 56% of default runs, it didn't.

The key discriminator isn't file quality or file length. It's whether the information exists anywhere in the model's weights. For familiar territory, context files are redundant by definition. For territory the model has never seen, they're the only path to correct behavior.

I encountered this boundary once with Via's plugin system. The orchestrator manages nine plugins across five domains, and the routing logic for one of them — the Obsidian capture integration — uses a CLI invocation pattern specific to how I set up my environment. No model trained before 2025 would infer that pattern from context. I wrote it into the configuration and it has never misfired. That entry stays. It's not codebase context; it's information the model genuinely lacks.

The Routing Problem Is Structurally Different

Here's where I disagree with the core framing of the video, and where the empirical literature is quietly silent.

The video argues that CLAUDE.md files are overused as codebase context — a summary of things the agent can discover on its own. That argument is correct. But it treats CLAUDE.md as a single category of thing, when practitioners are using it for at least two different purposes.

The first purpose is codebase context: tech stack, conventions, directory layout. The study measures this. The evidence says: minimal benefit, real cost. The conclusion is defensible.

The second purpose is routing logic: which agent handles what, when to escalate, what constraints bind each persona. This is not information the agent can discover from the codebase. It's the architecture of the system itself. You can't grep for "who is responsible for research tasks and when does control transfer to the implementer."

Via's decomposer pattern lives here. The orchestrator routes incoming missions through a sequence of specialized agents — researcher, strategist, writer, reviewer — based on explicit handoff rules encoded in configuration. Those rules are not discoverable. They're decisions. The agent that receives a research task doesn't know it's supposed to stop and hand off when it has three credible sources; it needs to be told. Removing that configuration doesn't reduce noise. It removes the constraint that makes the pipeline work.

PubNub's subagent architecture uses the same pattern: a three-stage pipeline where each stage has a scoped CLAUDE.md encoding its role, its permitted actions, and its handoff conditions. The evidence for whether this helps is weak — one case study, no controls. But the structural logic is sound. Routing rules that cannot be inferred must be declared. There's no evidence against that claim because no study has tested it. The literature hasn't separated "codebase context files" from "pipeline constraint files." They're different things with the same file extension.

Two octopuses side by side — one juggling documentation in familiar territory, the other using AGENTS.md as a lantern to illuminate unknown APIs

A Framework That Fits the Evidence

The study and the video converge on one honest conclusion: most people's CLAUDE.md files are doing nothing useful. They were generated by /init, never pruned, and are now a slightly stale summary of things the agent already knows. Delete them. Your agent will be faster, your costs will drop 20%, and your sessions will stop biasing toward whatever you listed.

But the conclusion doesn't extend to every use case. Three categories seem to hold up:

Novel APIs outside training data. If your codebase uses a framework, SDK, or internal API that postdates the model's training cutoff, write precise documentation into AGENTS.md. Vercel compressed 40KB of Next.js 16 docs to 8KB and hit 100%. The compression matters: the goal is to eliminate inference failures, not to provide background reading.

Explicit corrections for consistent wrong behavior. When an agent reliably does the wrong thing — not once, but three times in a row across different sessions — that specific correction belongs in the configuration. Not as a general principle, not as a style guideline, but as a precise counter to the pattern you've observed. HumanLayer's rule of thumb applies: under 60 lines, pointers to files rather than inline code snippets, removed when the model generation catches up.

Multi-agent routing constraints. If you're running a pipeline where agents specialize and hand off, the handoff logic needs to be declared. This is architecture documentation, not codebase context. The study's negative findings almost certainly don't apply here because the study didn't measure this use case. That absence of evidence is not permission to assume it fails — it's just an untested condition.

Everything outside those three categories: the tech stack summary, the style preferences, the architectural overview the model could construct by reading your files for ninety seconds. Cut it.

Honest Limitations

I've been building Via for long enough to have opinions about its configuration files, and specific enough metrics on some parts of the system to feel confident in claims about memory retrieval or orchestrator behavior. I don't have controlled data on what happens to my agent sessions when I remove a CLAUDE.md entry versus keeping it. My "it worked better without this" observations are the same quality of evidence as the timing experiment in the video — directional, not dispositive.

The same applies to the routing argument. I believe Via's decomposer pattern requires explicit routing configuration. I've built it that way and the pipeline runs correctly. I haven't run the test where I remove the routing rules and observe what the agents do. Maybe they'd infer the correct behavior from context. Probably they wouldn't. But "probably" is not a finding.

The deeper limitation is one neither study addresses: these findings were measured against models available in 2024 and early 2025. The practitioner who ran the timing experiment notes that he deletes more from his AGENTS.md with every new model release. The +4% human-written benefit that Kelechi et al. found might already be +2% now, or zero. The Vercel out-of-training-data scenario might shrink as training cutoffs move forward. The specific number matters less than the direction: every improvement in model self-navigation capability reduces the legitimate use case for context files by some amount. The three categories above may narrow to two, then one, then conditions rare enough to be edge cases.

That's not an argument against writing them now. It's an argument for reviewing them the way you review dependencies — on a schedule, not just when something breaks.

Enjoyed this post?

Subscribe to get weekly deep-dives on building AI dev tools, Go CLIs, and the systems behind a personal intelligence OS.

Related Posts

Feb 17, 2026

The #1 Thing My AI Agents Learned Wasn't Code

Feb 23, 2026

How the Orchestrator Actually Works: 7 Packages, 4,570 Lines, Zero Magic

Feb 23, 2026

MCPs Are Dead. CLIs Won.