Fault Tolerance for AI Agents

TL;DR

Via's orchestrator now survives failures instead of dying on them. Checkpoint/resume saves mission state to disk with atomic writes and auto-detects resumable missions using Jaccard similarity (threshold 0.3). The event-loop executor dispatches phases immediately as dependencies clear — not batch-and-wait. Per-provider semaphores (claude: 2, gemini: 3) prevent rate-limit storms. And a 6-step defensive teardown runs on every exit path with per-step panic recovery.

The Rate Limit That Killed a Mission

Three research agents into a multi-phase mission, Claude Code hit a rate limit. Phase 3 failed. Phases 4 and 5 depended on Phase 3. The orchestrator marked the mission as failed, and I lost the completed output from Phases 1 and 2. Not the files — those were on disk — but the mission state. The orchestrator had no concept of "resume from where you stopped." I re-ran the entire thing.

That happened twice in one week. After the second time, I built the fault tolerance layer. The orchestrator post described the execution pipeline. This post describes what happens when the pipeline breaks.

Checkpoint Everything

Every state transition in a mission gets checkpointed to disk. Phase starts, phase completes, phase fails, gate runs — each event triggers a checkpoint write. If the process dies at any point, the last checkpoint contains enough information to resume.

internal/orchestrate/checkpoint.go

type Checkpoint struct {
  Version       int              `json:"version"`
  MissionID     string           `json:"mission_id"`
  WorkspacePath string           `json:"workspace_path"`
  Domain        string           `json:"domain"`
  Plan          *CheckpointPlan  `json:"plan"`
  PhaseStates   []PhaseState     `json:"phase_states"`
  GateResults   []GateCheckpoint `json:"gate_results,omitempty"`
  RetryState    []RetryRecord    `json:"retry_state,omitempty"`
  CreatedAt     time.Time        `json:"created_at"`
  UpdatedAt     time.Time        `json:"updated_at"`
  Status        string           `json:"status"`
}

The checkpoint captures the full plan, every phase's status and output, gate results, and retry state. On resume, failed and running phases get reset to pending while completed phases keep their output. The mission picks up where it left off.

internal/orchestrate/orchestrator.go

func (o *Orchestrator) Resume(ctx context.Context, workspacePath string) (*ExecutionResult, error) {
  cp, err := LoadCheckpoint(workspacePath)
  plan := cp.RestorePlan()

  // Reset failed/running phases back to pending for re-execution
  for _, phase := range plan.Phases {
      if phase.Status == PhaseRunning || phase.Status == PhaseFailed {
          plan.setPhaseStatus(phase, PhasePending)
          phase.Output = ""
          phase.Error = ""
      }
  }

  result, err := o.executePlan(ctx, plan)
}

The write itself is atomic. A temp file gets written first, then renamed into place. An flock prevents concurrent checkpoint writes from corrupting the file if multiple goroutines race to save state.

internal/orchestrate/checkpoint.go

func SaveCheckpoint(cp *Checkpoint) error {
  // Acquire exclusive flock
  if err := syscall.Flock(int(lockFile.Fd()), syscall.LOCK_EX); err != nil {
      return fmt.Errorf("acquire checkpoint flock: %w", err)
  }
  defer syscall.Flock(int(lockFile.Fd()), syscall.LOCK_UN)

  // Atomic write: temp file + rename
  tmpPath := cpPath + ".tmp"
  if err := os.WriteFile(tmpPath, data, 0644); err != nil {
      return fmt.Errorf("write temp checkpoint: %w", err)
  }
  return os.Rename(tmpPath, cpPath)
}

Auto-Detecting Resumable Missions

Running orchestrator run --resume /path/to/workspace resumes a specific checkpoint. But most of the time I don't remember the workspace path. So the system auto-detects resumable missions.

When you run orchestrator run "research authentication patterns" and a failed checkpoint exists from the last 24 hours with a matching task description, the orchestrator offers to resume it instead of starting fresh.

The matching uses Jaccard similarity on word sets, with a threshold of 0.3:

internal/orchestrate/checkpoint.go

func FindResumableCheckpoint(domain, taskHint string) (string, *Checkpoint) {
  cutoff := time.Now().Add(-24 * time.Hour)
  hintWords := taskWordSet(taskHint)

  for i := len(entries) - 1; i >= 0; i-- {
      // Only resume failed runs that made progress and match the task
      if cp.Status == "failed" && len(cp.CompletedPhaseIDs()) > 0 {
          if taskOverlaps(hintWords, cp.Plan.TaskDesc) {
              return wsPath, cp
          }
      }
  }
}

The 0.3 threshold is deliberately low. "Research auth patterns" should match "research authentication patterns and implement OAuth" even though the word overlap is partial. The system scans from newest to oldest and returns the first match — if multiple checkpoints qualify, you get the most recent one.

The constraints are strict enough to prevent false positives: the checkpoint must be in the same domain, less than 24 hours old, in a failed state, and have at least one completed phase. A fresh failure with zero completed phases is not worth resuming.

The Event Loop

The original parallel execution was batch-and-wait. Identify all phases with satisfied dependencies, launch them, wait for all to complete, then check for the next batch. If three phases could run in parallel but two finished in 10 seconds while the third took 5 minutes, the fast ones would sit idle waiting for the slow one before the next batch could start.

The event-loop model reacts to each completion individually:

internal/orchestrate/orchestrator.go

func (o *Orchestrator) executeParallel(ctx context.Context, plan *Plan) error {
  sem := make(chan struct{}, o.config.MaxConcurrent)
  doneCh := make(chan phaseResult, len(plan.Phases))
  running := 0

  dispatchReady := func() {
      for _, phase := range plan.GetPendingPhases() {
          plan.setPhaseStatus(phase, PhaseRunning)
          running++
          go func(p *Phase) {
              sem <- struct{}{}
              defer func() { <-sem }()
              err := o.executePhaseWithRetry(ctx, plan, p)
              doneCh <- phaseResult{phase: p, err: err}
          }(phase)
      }
  }

  dispatchReady()

  for running > 0 {
      select {
      case <-ctx.Done():
          return ctx.Err()
      case result := <-doneCh:
          running--
          if result.err != nil {
              plan.setPhaseStatus(result.phase, PhaseFailed)
              plan.skipDependents(result.phase.ID)
          }
          dispatchReady()
      }
  }
  return nil
}

When any phase completes, the loop immediately checks for newly-unblocked phases and dispatches them. No waiting for the batch. This matters for DAGs with uneven phase durations — a 30-second research phase that unblocks a 5-minute implementation phase starts the implementation immediately, not after all sibling research phases finish.

The dispatchReady function is the key. It scans the plan for pending phases whose dependencies are all satisfied and launches them. It runs once at the start (to kick off root phases) and again after every completion (to dispatch newly-unblocked phases). The sem channel caps total concurrency at MaxConcurrent — a buffered channel acting as a counting semaphore.

The fail-forward behavior is also in this loop. When a phase fails, skipDependents does a BFS walk through the dependency graph and marks all transitive dependents as skipped. The mission continues with whatever phases are still viable, rather than aborting entirely. If Phase 3 fails but Phase 4 was independent, Phase 4 still runs.

Per-Provider Concurrency

Running five Claude Code agents in parallel sounds efficient until they all hit the API rate limit simultaneously. The orchestrator now enforces per-provider concurrency limits using buffered channels as semaphores.

internal/orchestrate/orchestrator.go

// Initialize per-provider semaphores from config
providerSems := make(map[string]chan struct{})
for provider, limit := range config.ProviderLimits {
  providerSems[provider] = make(chan struct{}, limit)
}

The defaults are claude: 2, gemini: 3. Claude's API rate limits are tighter, so the orchestrator caps Claude agents at 2 concurrent. Gemini is more permissive, allowing 3. These limits interact with the global MaxConcurrent setting (default 3) — the global limit caps total goroutines while provider limits cap per-provider goroutines within that total.

Before launching a phase, the executor acquires a slot from the provider's semaphore. If all slots are taken, the goroutine blocks until one frees up. This is a natural throttle: the event loop keeps dispatching phases, but they queue up at the semaphore boundary rather than hammering the API.

The combination of event-loop dispatch and provider semaphores handles the common case gracefully: a mission with 2 Claude implementation phases and 3 Gemini research phases runs all 5 phases with maximum parallelism without exceeding either provider's limits. The phases that depend on different providers naturally interleave.

Phase Output Verification

Before the gate even runs, the orchestrator verifies that file paths mentioned in the agent's output actually exist on disk. Agents sometimes claim "I wrote the file to /path/to/output.go" when the file is not there — a hallucination about their own actions.

internal/orchestrate/verify.go

var pathPattern = regexp.MustCompile(
  `(?:^|[s\`"'(])(/[a-zA-Z0-9_./-]+.[a-zA-Z0-9]+|.{1,2}/[a-zA-Z0-9_./-]+.[a-zA-Z0-9]+|~/[a-zA-Z0-9_./-]+.[a-zA-Z0-9]+)`)

func VerifyPhaseOutput(output string) error {
  paths := ExtractFilePaths(output)
  if len(paths) == 0 {
      return nil
  }
  missing := VerifyReferencedFiles(paths)
  if len(missing) == 0 {
      return nil
  }
  return fmt.Errorf("referenced files not found: %s", strings.Join(missing, ", "))
}

Verification is strict for writer agents and lenient for others. If a writer agent claims to have produced a file that does not exist, that is a hard failure — the phase retries. A researcher referencing a URL or an external path gets a warning log but not a retry. The distinction matters because writer agents are the ones whose output gates depend on, and a missing file there cascades into downstream failures.

Defensive Teardown

The teardown runs on every exit path. Success, failure, context cancellation, panic — the teardown always executes. It has 6 ordered steps, each wrapped in its own recover():

internal/orchestrate/teardown.go

func (o *Orchestrator) teardown(plan *Plan) {
  // Step 1: Signal agents to stop
  o.teardownSignalAgents()
  // Step 2: Wait for graceful stop (with timeout)
  o.teardownWaitAgents()
  // Step 3: Capture learnings from completed phases
  o.teardownCaptureLearnings(plan)
  // Step 4: Save final checkpoint
  o.teardownSaveCheckpoint(plan)
  // Step 5: Release provider semaphores
  o.teardownReleaseSemaphores()
  // Step 6: Close learnings database
  o.closeLearningsDB()
}

Each step is independent and failure-isolated. If Step 3 (learning capture) panics because of a corrupt output file, Step 4 (save checkpoint) still runs. The checkpoint is the most critical step — without it, the mission cannot be resumed.

internal/orchestrate/teardown.go

func (o *Orchestrator) teardownSignalAgents() {
  defer func() {
      if r := recover(); r != nil {
          log.Printf("[teardown] panic in signal step: %v", r)
      }
  }()
  // ...
}

The ordering matters. Signal agents before waiting for them. Capture learnings before saving the checkpoint (so the checkpoint reflects captured state). Release semaphores before closing the database (since some goroutines might still be holding semaphores and writing learnings). Close the database last because everything else might need it.

The Honest Limitations

Fault tolerance is better but not complete:

No distributed state. The checkpoint lives on the local filesystem. If the machine reboots or the disk fills up, the checkpoint is gone. For a CLI tool running on a single developer machine, this is fine. For anything more distributed, it would need a remote state store.

Resume resets entire failed phases. When resuming, a failed phase starts over from scratch. If the phase was 90% complete when it failed, the agent redoes all of it. Finer-grained checkpointing within phases — saving intermediate agent output — would help, but it would require instrumenting the agent subprocess itself.

Provider limits are static. The claude: 2, gemini: 3 limits are hardcoded in the config. They do not adapt to actual rate limit responses. A smarter system would start at the configured limit and back off dynamically when rate limits hit, then gradually recover. The current approach is conservative enough to avoid most rate limits, but it leaves throughput on the table during off-peak hours.

No cross-mission deduplication for checkpoints. Each mission gets its own workspace directory. If I run the same mission twice, both produce checkpoints independently. There is no mechanism to detect that two missions are producing duplicate work and merge their results.

How the Orchestrator Learned to Fix Itself — The companion post covering the self-improvement loop: LLM-based selection, retry with enriched context, and graduated quality gates
How Multi-Agent Orchestration Works — The original architecture post that this builds upon
Why I Route Research to Gemini — The multi-LLM routing that interacts with per-provider concurrency limits

Fault Tolerance for AI Agents

TL;DR

The Rate Limit That Killed a Mission

Checkpoint Everything

Auto-Detecting Resumable Missions

The Event Loop

Per-Provider Concurrency

Phase Output Verification

Defensive Teardown

The Honest Limitations

Enjoyed this post?

Related Posts

Why I Built a Multi-LLM Orchestration System (And You Might Want One Too)

Why I Built a Personal Intelligence OS

Starting Line: The Case for Personal AI