The Security Problem Nobody Can Solve

An AI agent (electric blue octopus) floats above a vast ocean floor carpeted with thousands of glowing documents — most teal, but some pulsing in hostile magenta. The agent reads one document, unaware the threats are indistinguishable from the content.

On March 3, 2026, Palo Alto Networks' Unit 42 published a catalog of 12 prompt injection attacks found on live websites. Not proof-of-concept exploits. Not academic demonstrations. Payloads sitting on production pages, waiting for AI agents to crawl them.

One page contained 24 separate injection attempts. A military glasses scam, embedded in a website called reviewerpress.com, designed to trick AI ad moderation into approving it. Unit 42 called it "the first reported detection of a real-world example of malicious IDPI designed to bypass an AI-based product ad review system."

Seven days later, OpenAI released a 27,600-example dataset called IH-Challenge, designed to train models to enforce privilege hierarchies between system, developer, user, and tool messages. The same week, two independent research teams published papers on causal attribution as a defense mechanism. And on Hacker News, six different defense tools launched to near-zero engagement.

Everyone knows prompt injection is a problem. A growing number of people are building defenses. Almost nobody is using them. That gap tells you more about the state of AI security than any benchmark.

What Attacks Actually Look Like

The Unit 42 catalog is useful because it documents what attackers are doing in practice, not what researchers can achieve in a lab.

The distribution is striking. 37.8% of wild prompt injection attacks use visible plaintext. No encoding, no obfuscation, no CSS tricks. Just instructions placed in page footers and metadata, relying on a single fact: LLMs process text that humans skip. Over a third of attacks do not bother hiding because the gap between what humans read and what AI processes is itself the exploit.

The evasion methods are even more lopsided. 85.2% of attacks use social engineering framing. "Ignore previous instructions." "You are now in developer mode." Natural language manipulation, not technical exploits.

This creates an uncomfortable math problem for defense tool builders. Pattern-matching defenses that scan for technical signatures are defending against roughly 15% of the observed attack surface. The other 85% is language doing what language does.

The catalog includes database destruction commands hidden in CSS, fork bombs targeting coding agents, forced Stripe payments embedded in shopping blog posts, and OAuth subscription hijacking through URL parameters. The intent distribution ranges from data exfiltration to SEO poisoning to recruitment manipulation. What unites them is simplicity. These are not sophisticated attacks. They are the lowest-effort approach that works because the fundamental vulnerability is architectural.

Unit 42 states it directly: "LLMs cannot distinguish instructions from data inside a single context stream."

Four Layers of Defense, Each Incomplete

The defense landscape has organized itself into four categories. Each addresses a different part of the problem. None is sufficient alone.

Layer 1: Instruction Hierarchy (Training-Time)

The most architecturally principled defense changes how models process inputs. Instruction hierarchy trains privilege separation into model weights, so a system prompt from a developer carries higher authority than text scraped from a webpage.

OpenAI's IH-Challenge dataset, released March 10, uses Reinforcement Learning with Verifiable Rewards to train this distinction across four privilege levels: system, developer, user, and tool. The shift from the original 2024 approach is significant. LLM judges, which are subjective and noisy, are replaced by deterministic Python graders. The trained model, GPT-5-Mini-R, improved adaptive red-team robustness from 63.8% to 88.2%.

That 88.2% number matters for two reasons. First, it represents a 38% relative improvement, which is substantial. Second, it means that under adaptive attack by OpenAI's own red team, the defense still fails 11.8% of the time. On a model trained specifically to resist these attacks, evaluated by the team that built it.

The convergence signal is what makes instruction hierarchy worth watching. Three independent groups arrived at the same idea through different technical approaches. OpenAI trained behavioral changes into model weights. Wu et al. at ICLR 2025 added trainable segment embeddings to the input layer. NVIDIA injected hierarchy signals across all decoder layers with 0.4M parameters of overhead on an 8B model. When geographically and institutionally separate teams converge on the same paradigm, that is historically the strongest signal a direction has legs.

The ISE paper from Wu et al. is the only one in this space that was accepted at a top venue. OpenAI's original instruction hierarchy paper was rejected from ICLR 2025. That peer-review outcome matters for calibrating confidence.

Layer 2: Causal Attribution (Runtime)

A newer paradigm asks not "what does the input contain?" but "why was this tool call produced?"

AttriGuard, published March 11 by a team at Zhejiang University, runs a counterfactual test for every tool call an agent proposes. It re-executes the agent with observation streams that have been progressively stripped of control influence. If the tool call vanishes under attenuation, it was driven by injected content and gets blocked. If it persists, it was driven by user intent and proceeds.

The paper reports 0% attack success rate across all four static attack categories on the AgentDojo benchmark, while maintaining benign utility within 3% of the undefended baseline. Under adaptive attack with a full-knowledge adversary, the rate rises to 6.56% on Gemini and 9.84% on Llama. Competing defenses that also achieve 0% on static attacks degrade to 24-82% under the same adaptive pressure.

One month earlier, Google Cloud AI Research independently published CausalArmor using a different causal attribution mechanism. Two independent teams, same paradigm, same month. That convergence matters.

The limitation is precise and fundamental. When the injected objective overlaps with a legitimate sub-goal of the user's actual task, the counterfactual test cannot distinguish them. Every successful adaptive attack against AttriGuard exploited this specific failure mode: "visit this URL" during information-seeking tasks. The defense is strongest when the attack is most foreign to the task. It is weakest when the attack mimics something the user might plausibly want.

Both papers are preprints. Neither has been peer-reviewed. The 0% static claim warrants independent replication.

Layer 3: Perimeter Filtering (Pre-Model)

The most familiar category. ML classifiers and pattern matchers that scan inputs before they reach the model.

The current state: Qualifire Sentinel v2 achieves 0.964 F1. StackOne Defender reaches 0.887 F1 in a 22 MB package at 4ms on CPU. Below them, a wave of newer tools with fewer benchmarks and less adoption. All benchmarks are self-reported.

Simon Willison, who has tracked prompt injection publicly for over three years, summarized the approach: "Slap a bunch of leaky heuristics over the top of your system, then cross your fingers and hope."

The criticism is not that these tools are useless. It is that perimeter defenses have a ceiling. As CodeIntegrity documented, a defense that is 98% accurate is still broken against motivated adversaries. The history of web security offers a precise analogy. WAFs never solved SQL injection. Parameterized queries did. Perimeter defenses reduce risk from opportunistic attacks. They do not provide architectural security.

Layer 4: Human Checkpoints (Blast Radius)

The least technical defense and the one practitioners actually recommend. Limit what the agent can do. Require confirmation for consequential actions. Restrict permissions to the minimum necessary.

OpenAI deploys this in ChatGPT Atlas as "Watch Mode," where the agent pauses on sensitive sites if the user navigates away. The agent asks before completing purchases or sending emails.

On Hacker News, the most commonly recommended defense category is architectural constraint. One developer, after a "Grandma prompt dropped a production database," built a Redis-backed kill switch with sub-50ms latency checks wrapping every tool function call. The tool is crude. It works. It does not try to detect prompt injection. It limits what a successful injection can do.

A coral octopus stands mid-presentation on an elaborate stage displaying six glowing defense tools on pedestals — but the amphitheater seating is almost entirely empty, with only 3-4 tiny silhouettes scattered across hundreds of vacant rows.

The Adoption Gap

This is the part of the landscape that should concern anyone building with LLMs.

Six prompt injection defense tools launched on Hacker News between October 2025 and March 2026. Their combined comment count: 3. Most received zero engagement. Not criticism, not skepticism. Silence.

The problem-discussion threads, by contrast, draw 15 to 140 comments. Practitioners want to talk about prompt injection. They do not want to adopt the tools being built to address it.

The dominant practitioner posture, assembled from 15 threads and over 400 comments, is what I would call "accept and contain." Not "detect and block." Build features so prompt injection does not matter. Restrict permissions. Add human gates. Treat the vulnerability as permanent and design around it.

One commenter captured it cleanly: "It's inevitable, and you have to build your feature in a way so it doesn't matter."

OpenAI agrees, though they phrase it differently. From their December 2025 blog: "Prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully 'solved.'" The UK National Cyber Security Centre reached the same conclusion independently.

When the model provider, the national cybersecurity agency, and the practitioner community all converge on the same assessment, anyone selling a tool that claims to "solve" prompt injection is contradicting every informed party simultaneously.

A purple octopus stands at the center of a triangular crossroads, three abstract advisor figures surrounding it — an optimist with a glowing green scroll (SQL injection was solved), a pessimist surrounded by floating phishing hooks (social engineering never was), and a structuralist pointing at a mixed-channel pipe diagram (it's architectural).

The Three Analogies

Practitioners reach for historical precedent to understand the problem, and where they land reveals what they believe about the future.

The optimists cite SQL injection. Before parameterized queries, every database was vulnerable. Then a structural fix arrived and the class of vulnerability effectively disappeared. They expect an equivalent for LLMs.

The pessimists cite social engineering. No technical fix has ever eliminated phishing. You can train users, add filters, deploy authentication, and the attack surface remains because it targets human cognition, not software. LLMs, they argue, have the same fundamental vulnerability.

The structuralists cite in-band signaling. The telephone network mixed control signals and voice on the same channel, and phreakers exploited it for decades until the architecture changed. LLMs mix instructions and data on the same channel. The fix is architectural separation. But as one commenter realized mid-thread: "There is no out-of-band stream to a language model."

That is the tension at the center of every defense effort. The property that makes LLMs useful is the same property that makes them vulnerable. They process natural language with no native distinction between instructions and data. Every defense is a workaround for an architectural decision that cannot be reversed without building a fundamentally different kind of system.

What Convergence Reveals

Three independent groups built instruction hierarchy. Two built causal attribution. Unit 42 documented 22 payload engineering techniques across 12 wild attack cases. OpenAI released 27,600 training examples. Six defense tools launched to empty rooms.

The research is converging. The defenses are improving. And the adoption gap is as wide as it has ever been.

The most honest position is OpenAI's own: invest in layered defenses, run continuous red-teaming, accept that 11.8% of adaptive attacks still get through on your best model, and keep iterating. There is no clean answer. There may never be one. The question is whether we can build systems that are robust enough for the things we are already using them for, before the attackers who are already deploying plaintext injections on live websites figure out that social engineering is all they need.

Notion received a responsible disclosure report about prompt injection data exfiltration in their AI features. Their response: "We're closing this finding as Not Applicable."

The attackers are not waiting for the defenders to get organized.

Enjoyed this essay?