AI Reviews the Code That AI Wrote

A massive automated production machine stamps out code cards at furious speed on the left, while a single review station on the right is buried under an avalanche of unreviewed PRs — the mascot overwhelmed, tentacles raised in despair.

On March 5, 2026, Amazon's checkout pipeline went dark for six hours. The root cause was traced to AI-generated code that had passed standard review and shipped to production. Kiro, Amazon's internal AI development tool, had accelerated the write side of the pipeline. The review side was unchanged. Four days later, Anthropic announced Claude Code Review: parallel AI agents that examine pull requests for bugs and security vulnerabilities, positioned explicitly as a solution to the code volume that tools like Kiro are generating.

Two organizations, one structural problem, two opposite responses. Amazon reached for policy. Anthropic reached for more AI. Neither response is wrong. But neither answers the question that actually matters: when the same generation of models writes the code and reviews it, do they share the same blind spots?

That question has no published answer. The field has been moving too fast to ask it.

The Bottleneck Nobody Planned For

Before AI coding tools became standard, code review was annoying but manageable. A developer opened a PR, a teammate reviewed it, things moved. The constraint was writing speed, not review speed. The two scaled together because they were both human.

That coupling broke in 2025. Code volume scaled with AI. Review throughput did not.

CodeRabbit analyzed 470 repositories through December 2025 and found that PRs per author increased 20% year over year while incidents per PR increased 23.5%. Both numbers moved in the same direction. A team generating more code faster was also generating more failure faster. CodeRabbit produced this data and has a product to sell, so independent replication matters, but the directional claim is consistent with what Amazon discovered in production and with informal observation from anyone running AI coding tools at scale.

The same analysis found that AI-assisted code generates 1.7 times more issues per PR than human-written code.

On February 19, I wrote that code review was already broken economically, that the human time cost of reviewing noise had killed a $100,000 curl program. What has happened in the four weeks since suggests the economics were a symptom. The structural problem is velocity asymmetry: one side of the review equation scaled, and the other did not.

One Answer Was Policy

Amazon's response was not technical. It was procedural. After the March 5 checkout outage and a separate 13-hour AWS China incident also traced to Kiro-generated code, Amazon issued an internal mandate: AI-assisted code now requires senior engineer sign-off before shipping.

The mandate is honest about what it is. It inserts a human gate back into a pipeline that had effectively removed one. The senior engineer becomes the last line of review before production, personally accountable for code they may not have written and in many cases may not fully understand at depth.

Amazon's VP of Developer Relations, Jeff Barr, defended the policy by framing AI as a passive tool. The framing routes accountability downward, to the engineer who approved the PR, rather than upward to the system that generated the code or the organization that removed the review gate that would have caught it. I covered that accountability structure in detail in February. The pattern here is the procedural consequence of it.

The policy is understandable as crisis management. As a long-term architecture, it replicates the bottleneck. Senior engineers are finite. Code velocity is not. A mandate that puts the fastest-growing part of the pipeline behind the slowest human resource in the organization is not a scalable answer. It is a pause.

The Other Answer Was More AI

Anthropic's March 9 launch is the technical response to the same problem. Multiple agents review a PR in parallel: one reads the diff, others examine broader codebase context, a synthesis agent produces a consolidated report. Average review time runs approximately 20 minutes. Pricing is token-billed, running $15 to $25 per review for large PRs. Available in research preview for Team and Enterprise customers only.

The internal numbers Anthropic published are striking: 84% of PRs with 1,000 or more lines of code received at least one finding, averaging 7.5 issues identified per PR. Fewer than 1% of findings were marked incorrect by developers.

Those numbers are good. They are also self-reported and unblinded. The sub-1% incorrect rate is measured by developer feedback, which means it captures the issues developers noticed were wrong, not the issues the model missed that nobody flagged. A reviewer that misses 40% of bugs in a plausible way will score nearly perfect on developer-marked accuracy because the misses never surface. The denominator is invisible.

The launch materials include two specific cases where Claude Code Review caught real issues: an auth-breaking one-liner that would have invalidated all active sessions and a ZFS encryption key bug. These are plausible catches. They are also Anthropic's chosen examples from their own beta.

The Benchmark That Actually Measures This

On March 10, researchers from DeepMind, Anthropic, and Meta released Code Review Bench, the first large-scale open-source evaluation framework for AI code review tools. The dataset spans 50 real pull requests from Sentry, Grafana, Cal.com, Discourse, and Keycloak, with human-curated gold-standard comments tagged by severity. A second evaluation layer tracks what developers actually do with AI review comments across 5,035 real PRs, measuring online recall: whether a comment leads to action.

Eleven tools were evaluated: Claude Code, CodeRabbit, GitHub Copilot, Qodo, Graphite, Greptile, and others.

The headline finding is not about any single tool. No tool found more than 63% of known issues.

Best-in-class AI code review misses at least 37% of the bugs that independent human reviewers identified in the same PRs. Graphite achieved the highest precision but the lowest recall. CodeRabbit achieved a 0.54 online recall rate across the 5,035-PR evaluation set: it acted on approximately half the issues developers considered actionable.

The benchmark varies by tool and metric, but the floor is consistent: AI review catches most things and misses a meaningful, non-random share of the rest.

That is useful information. It is not a condemnation. Human reviewers also miss things, and the comparison that matters is net bug escape rate against what review would look like at AI generation throughput with only human reviewers available. On that comparison, AI review almost certainly improves outcomes on volume. The Martian benchmark establishes per-tool recall against known issues. What it does not establish is whether the reviewer and the generator cluster their misses in the same places.

Two identical AI reviewer figures examine a shared code panel between them — both have the same blind spot, and in that exact shadow a vivid magenta bug sits in plain sight. The mascot watches from the side, the only one who can see it.

The Blind Spot Nobody Has Mapped

When an organization uses one AI system to generate code and a closely related AI system to review it, the relevant question is not how often does the reviewer miss issues. It is what issues does the reviewer miss that the generator also missed.

If a model has a systematic blind spot toward a class of bug, say a specific pattern of race condition in concurrent code, or a category of input validation failure, or an architectural assumption that produces subtle state corruption at scale, then a reviewer trained on the same data and the same underlying architecture is likely to share that blind spot. The reviewer will catch the bugs it is trained to catch and miss the ones the generator was never trained to avoid. The review pass is not independent. It is a second opinion from a system with the same priors.

This is not a theoretical concern. The Martian benchmark demonstrates that even the best tools miss more than a third of known issues. The distribution of those misses is not random. Models trained on similar corpora with similar objectives will cluster their misses similarly. An organization deploying AI generation and AI review from the same provider is running a pipeline where the two sides may be optimizing against the same failure surface, which means certain failure modes pass both gates consistently.

Amazon understood something like this when it mandated senior human approval. A senior engineer's review is structurally independent in a way that same-provider AI review is not. The senior engineer brings a different training distribution, a different failure model, and accountability that concentrates rather than diffuses. The question is whether independence, not capability, is the variable that actually determines whether review catches what the generator missed.

No published study has tested this directly. The Martian benchmark is a strong step: independent researchers, real production PRs, multi-tool evaluation, two layers of measurement. It does not measure generator-reviewer error correlation. That measurement is the missing layer, and the field is deploying generation-plus-review pipelines without it.

The mascot faces a tall mirror showing a calm, satisfied reflection — but viewed from behind, three vivid magenta bug creatures cling to its back, looking directly at the viewer. The mirror confirms what you already look like. It cannot show you what you cannot see.

The Mirror Check

There is a version of AI code review that works as advertised. It catches the bugs human reviewers miss because they are fatigued, or because the PR is 1,400 lines and nobody reads PRs that long carefully, or because the issue is in a module the reviewer does not own. Anthropic's auth-breaking one-liner is real. The 7.5 issues per large PR is probably real. The throughput advantage over human review alone is real.

There is also a version where AI code review functions primarily as organizational comfort: it produces reports, flags things, reduces the guilt of skipping dedicated review. If the system generating the code and the system reviewing it share training distribution and share blind spots, the review is not independent. It is a mirror check. The mirror confirms what you already look like. It cannot show you what you cannot see.

The field has spent 2025 measuring how fast AI can write code. It has spent early 2026 measuring how accurately AI can review code. The measurement nobody has built yet is what happens to the bugs that fall between those two systems, the ones the generator produces in the exact register the reviewer was trained to trust.

Amazon's senior approval mandate is blunt and unscalable. Anthropic's parallel agent review is sophisticated and self-validated. The Martian benchmark is independent and rigorous within its scope. All three are responses to the speed problem.

None of them answers whether the feedback loop has a hole at the bottom.