The #1 Thing My AI Agents Learned Wasn't Code

An octopus pointing at a massive locked filing cabinet covered in cobwebs with 3.2% floating beside it — the broken retrieval system

I ran orchestrator learnings stats on a Tuesday morning and stared at the number: 3.2%. We had 1,469 learnings stored in SQLite with a hybrid retrieval system, semantic deduplication, and type-bucketed injection. The architecture was sound. And it was surfacing relevant knowledge 3.2% of the time.

Most AI systems forget everything between sessions. An agent spends 20 minutes debugging a race condition, captures the solution, and when the next agent encounters the same problem two weeks later, the knowledge is gone. Each session starts from scratch. Each mistake is rediscovered independently. The cost compounds invisibly — across hundreds of missions, you're solving the same problems multiple times because there's no mechanism for institutional learning.

I fixed the retrieval. The rate jumped to 26.4%. But even with working retrieval, 88.2% of the 899 learnings in the database have never been surfaced. And the #1 most-used learning — retrieved 247 times — isn't a code pattern. It's a publishing decision. That surprised me more than the 3.2% did.

The hybrid search formula is 0.3 × FTS5 keyword + 0.7 × cosine similarity. Seventy percent of the score depends on semantic embeddings. When I audited coverage that Tuesday, 69% of the database had no embeddings. The system was running a retrieval formula designed for full coverage with less than a third of the data actually participating.

The fix was a one-line command I'd written weeks earlier and never run:

orchestrator learnings backfill --all

Cost: $0.002 in API calls. Time: under a minute. Impact: every learning in the database could now participate in semantic search. Retrieval rate jumped from 3.2% to 26.4% — an 8.2× improvement.

This is the embarrassing truth: I spent weeks designing a sophisticated retrieval pipeline, wrote 856 lines of Go handling FTS5 triggers and cosine similarity scoring, and the bottleneck was an unfinished batch job. The lesson wasn't architectural. It was: fix data before adding complexity.

An octopus at the center of a glowing orbital feedback loop with jellyfish and circuit tendrils — the working memory system

The #1 Learning Isn't Code

The top-used learning in the database — retrieved 247 times across 175 missions — is a content strategy decision: "publish the narrative version on the blog and optionally link to a technical deep-dive." Not a code pattern. Not an API trick. A publishing decision that every writing mission rediscovered until the system captured it.

This reveals something I didn't expect when I built the system: AI agents need a learning system for organizational knowledge, not just technical knowledge. The decisions you make about your own work compound. The patterns you discover about yourself matter. An agent that remembers "use FTS5 for full-text search" saves a few minutes of debugging. An agent that remembers "publish the narrative version first" saves an entire mission from going in the wrong direction.

The database tracks used_count — how many times a learning was actually retrieved for a mission:

106 learnings ever used (11.8%)
Top entry used 247 times (a decision about dual-audience publishing strategy)
Total usage events: 1,732 (across all learnings and missions)
346 learnings never surfaced (38.5%)

The 11.8% usage rate stings. It means 88.2% of captured learnings have never been retrieved for a future mission. Some of these are genuinely low-value — niche learnings about specific APIs that rarely recur. But some are probably high-value learnings with slightly wrong embeddings or phrasing that doesn't match query patterns. I have no way to distinguish "correctly ignored" from "incorrectly missed" without manual review.

What Agents Leave Behind

The core concept is simple. Agents produce knowledge as a side effect of doing work. A researcher discovering a useful API writes FINDING: The Frankfurter API provides free exchange rates with no auth required. A developer hitting a build error writes GOTCHA: SQLite FTS5 triggers must be created after the main table, not before. This knowledge was always being generated. It was just evaporating.

The learnings system catches it in a closed loop:

Agent completes a phase
  → Output contains markers (LEARNING:, GOTCHA:, FINDING:, DECISION:, PATTERN:)
  → orchestrator parses markers into structured entries
  → Each entry gets a Gemini embedding (768 dimensions)
  → Cosine dedup check: > 0.85 similarity? Skip (increment seen_count)
  → Novel entries stored in SQLite with FTS5 index
           ↓
Next agent spawns on a new mission
  → orchestrator queries learnings DB with hybrid search
  → FTS5 keyword scores (0.3 weight) + cosine similarity (0.7 weight)
  → Top matches injected into agent's CLAUDE.md
  → Agent sees "Apply:", "Avoid:", "Consider:" sections

The injection format is the psychological design that matters. Errors become "Avoid:" so agents know what not to repeat. Patterns become "Apply:" so agents know what to lean on. Decisions become "Consider:" so agents weigh trade-offs. An agent isn't told "you have access to past knowledge." It's given actionable framing.

How Fifteen Rediscoveries Become One

Learning capture happens in the orchestrator's teardown sequence. The system scans agent output for 12 marker types, extracts the text, and runs it through deduplication.

The markers vary by domain:

Dev: LEARNING:, GOTCHA:, BEST_PRACTICE:, ANTIPATTERN:, FINDING:
Creative: STYLE:, TECHNIQUE:, INSPIRATION:
Personal: PREFERENCE:, HABIT:, REFLECTION:
Meta (any domain): AGENT_ISSUE:, PERSONA_GAP:, PERSONA_MISMATCH:, CAPABILITY_REQUEST:

Different personas naturally emit different markers. The security-auditor tags vulnerabilities. The performance-engineer tags bottlenecks. The architect tags decisions. The system doesn't force this — personas just receive instructions that nudge them toward their domain-specific markers.

Deduplication is where the intelligence happens. When a new learning arrives, the system compares it against existing entries using cosine similarity on Gemini embeddings:

> 0.85 similarity: Exact duplicate. Increment seen_count, skip insert.
0.70 – 0.85 similarity: Near-duplicate. Store but flag for compaction.
< 0.70 similarity: Novel. Insert normally.

This threshold creates an unexpected signal: frequency as quality. A learning that's been "seen" 15 times across different missions is almost certainly a fundamental constraint of a tool or API — not an edge case. The highest-scored entries in the database are the ones agents kept rediscovering because they mattered.

The 899 learnings represent roughly 1,200 captured instances that survived deduplication. Without it, the database would be 30% larger and proportionally less useful.

Errors First, Patterns Second

When a new mission starts, the orchestrator queries the learnings database with the task description. The hybrid search uses four scoring components:

FTS5 keyword matching (30% weight): Fast and precise. "SQLite FTS5" queries match learnings about FTS5.
Cosine similarity on embeddings (70% weight): Semantic matching. "Performance bottleneck" surfaces learnings about "slow queries."
Quality boost: Learnings with high seen_count or used_count score higher.
Recency decay: A learning captured last week scores 1.0. A learning from 6 months ago scores 0.4. This prevents stale advice from dominating.

The system retrieves up to 5 learnings per query, prioritized into typed buckets:

2 errors (avoid patterns)
2 patterns (apply patterns)
1 decision (consider trade-offs)

Errors come first because avoiding a mistake is more valuable than repeating a success.

The retrieval formula looks like this in Go:

score = 0.3*ftsRank + 0.7*cosineSim + qualityBoost + recencyWeight

The weighted combination ensures we catch conceptually relevant learnings — a learning about "dependency injection" surfaces for a task about "reducing coupling" — while respecting specific technical terms that keyword search catches better.

What 899 Learnings Reveal

The database breakdown by type tells the story:

Type	Count	What it captures
Insights	322	Techniques that work, patterns worth reusing
Errors	211	Mistakes, gotchas, things that broke
Decisions	141	Architectural choices with rationale
Patterns	118	Reusable approaches across domains
Sources	92	Useful APIs, documentation, references
Meta-learnings	15	System critiques of orchestration itself

The 211 errors are the crown jewels. Each one represents a mistake an agent made and documented. When a future agent working on a similar task receives the learning "Avoid: go build fails silently when the embedding directive references a missing file," it sidesteps that debugging session entirely. This is institutional learning for software agents.

The most valuable errors fall into categories:

API misuse (32% of errors): Wrong parameters, deprecated endpoints, missing headers.
Configuration issues (24%): Environment variables missing, config files malformed, path resolution bugs.
Tool constraints (18%): Using tools in ways they weren't designed for.
Order-of-operations (14%): Migrations before schema, tests before build, FTS5 triggers before the parent table.
Other (12%): Encoding issues, permission problems, rate limits.

The 15 meta-learnings are the system watching itself:

meta_gap (7 entries): Missing persona — no specialist covers this task type.
meta_issue (4 entries): System behavior problem — agent told to plan but task required writing.
meta_mismatch (3 entries): Wrong persona assigned — writer doing review work.
meta_observation (1 entry): Workflow insight worth remembering.

These feed back into the persona selector and decomposer. If missions keep generating PERSONA_MISMATCH flags, the selector adjusts. If research phases keep tagging meta_gap for DevOps work, I know to add a DevOps persona.

A small octopus looking up at a single glowing column among hundreds of dim ones — finding the right memory among thousands

Three Places Knowledge Enters

The learnings system lives at three injection points in the orchestrator's execution:

run.go:294 — When an agent spawns, the orchestrator queries learnings and injects them into the agent's CLAUDE.md file.
orchestrator.go:137 — During phase execution, learnings are retrieved based on the phase description and added to the agent's system prompt.
decompose.go:97 — During DAG decomposition, relevant learnings from past decompositions are injected to guide the planner.

The teardown sequence captures learnings after a mission completes:

// Pseudo-code for the teardown sequence
1. Signal all agents to stop
2. Wait 2s for final output
3. Parse outputs for marker tags
4. Generate Gemini embeddings
5. Dedup against existing learnings (cosine > 0.85 = skip)
6. Store novel learnings in SQLite
7. Check coverage — if < 80% have embeddings, backfill
8. Checkpoint workspace state

The system is defensive by design. Every step uses defer/recover for panic tolerance. A failure in learning capture never blocks mission completion. A missing embedding is not an error — it's a signal to backfill later. The whole sequence is non-blocking and optimized for "if it fails, just keep running."

Three Claude Code hooks fire at lifecycle moments:

PreCompact hook: Before the context window truncates, flush the last 20 messages to a markdown file so nothing is permanently lost.
SessionStart hook: After compaction, inject saved context back into the new session.
SessionEnd hook: When a session ends, extract learnings from the full transcript and store them with embeddings.

These hooks prevent knowledge loss at the boundaries where agents are most vulnerable.

From 1,604 Entries to 899

The honest truth about building this system:

Version 1 (2026-02-08): 1,604 learnings, 3.2% retrieval rate. The architecture was there — capture, dedup, hybrid search. The problem was missing embeddings.

Version 2 (2026-02-15): Embedding backfill completed. 992 learnings after compaction (52.7% reduction). Retrieval jumped to 34.7%. Three new hooks added. The system finally worked.

Version 3 (2026-02-17, current): 899 learnings, 100% embedding coverage, 26.4% retrieval rate. Quality and recency decay added. Meta-learnings captured and tracked. The system is refined.

The drop from 992 to 899 is not a regression — it's compaction. With full embedding coverage, the dedup pipeline could finally detect semantic duplicates. "Use FTS5 for full-text search" and "SQLite FTS5 provides full-text indexing" collapsed into a single entry with seen_count: 4. Fewer learnings, higher signal.

An octopus on a broken circuit ring with data blocks scattering into space — the memory loop that doesn't complete

Honest Limitations

88.2% of learnings have never been surfaced. This is the most uncomfortable metric. It means either the retrieval formula misses relevant learnings, or a large portion of captured knowledge truly isn't valuable for future missions. I can't distinguish between these without manual review. The hybrid search weights (0.3 keyword + 0.7 semantic) were set by intuition, not by measurement.

The quality score creates a rich-get-richer effect. Learnings that get used early accumulate higher scores, which boosts their retrieval ranking, which gets them used more. New learnings have to overcome incumbency advantage. I haven't implemented an exploration mechanism to surface low-usage entries.

Learning decay is aggressive for some domains. An entry unused for 90 days gets its quality multiplied by 0.1. This makes sense for fast-moving technical contexts where three-month-old patterns are likely stale. It makes less sense for personal domain preferences or architectural decisions, which stay relevant for years.

The capture system depends on agents voluntarily emitting markers. If an agent encounters a gotcha but doesn't tag it with GOTCHA:, the learning is lost. There's no way to extract implicit learnings from agent behavior — only explicit annotations get captured. This biases the database toward learnings agents thought to document, not necessarily what was most valuable.

Retrieval effectiveness isn't measured against a baseline. The 26.4% retrieval rate means 26.4% of learnings in the database get surfaced for some future mission. But I don't know if that's good or bad. There's no equivalent system to compare against, no controlled experiments. The number is a measurement; its value is genuinely uncertain.

The system assumes single-user context. Via's learnings work for orchestrating my personal missions. Scaling to multi-user systems would require access control, domain isolation, and probably different retention policies. Those problems aren't solved.

What the 88.2% Demands

The gaps are clear. The 88.2% unused rate demands investigation. I'm planning to:

Manually audit the unsurfaced learnings to identify categories that aren't valuable.
Measure retrieval quality — did a surfaced learning actually change agent behavior?
Implement exploration mechanisms to surface low-usage entries in low-risk contexts.
Add learning feedback loops so agents can explicitly rate whether a retrieved learning was useful.

For now, the system works well enough to compound across missions. The 247-times-used decision about publishing strategy has probably saved hundreds of hours of agents debating the same trade-off. The order-of-operations errors embedded in the database have likely prevented dozens of failed builds.

Via's learning system isn't perfect. But imperfect institutional learning beats perfect amnesia.

How Multi-Agent Orchestration Works — The orchestrator that spawns agents and injects learnings
How Via Actually Works — The complete system architecture and 175 missions

The #1 Thing My AI Agents Learned Wasn't Code

A One-Line Fix for a Six-Week Blind Spot

The #1 Learning Isn't Code

What Agents Leave Behind

How Fifteen Rediscoveries Become One

Errors First, Patterns Second

What 899 Learnings Reveal

Three Places Knowledge Enters

From 1,604 Entries to 899

Honest Limitations

What the 88.2% Demands

Enjoyed this post?

Related Posts

The Death of Code Review

The Vampiric Effect

Amazon Blamed the Humans