Against Agentic Everything

In 2023, Air Canada deployed an AI chatbot to handle customer inquiries. When Jake Moffatt asked about bereavement fares after his grandmother died, the chatbot told him he could book at full price and apply for a retroactive discount within 90 days. The policy did not exist. Air Canada refused the refund and argued the chatbot was a separate legal entity responsible for its own actions. A tribunal disagreed.

The chatbot did exactly what it was designed to do: answer questions autonomously, without human oversight, at scale. Nobody had designed for what happens when it answers confidently and wrong.

Andrej Karpathy put it directly: “It will take about a decade to work through all of those issues.” Not a quarter. Not a roadmap item. A founding member of OpenAI saying the technology his industry is selling does not work yet.

The problems are compounding. The reliability math does not work. The security architecture is fundamentally vulnerable. The data infrastructure does not exist. And much of what the industry is selling is fraud.

The math

Researchers from UC Berkeley and collaborating institutions analyzed over 1,600 execution traces across five multi-agent frameworks and catalogued 14 distinct failure modes. In some frameworks, failure rates exceeded 75%.

That number is not surprising once you understand compounding. At a 20% error rate per action, a five-step workflow drops to 32% success. Even at 99% per-step reliability (which nobody has demonstrated in production) you only get 82% success over 20 steps. The researchers found most failures cluster into three categories: system design flaws, inter-agent misalignment, and failures in task verification. Error cascades connect all three: one agent makes a small mistake, the next accepts it as ground truth, and by the fourth step the output is confidently wrong.

Multi-step production workflows need reliability that current agents cannot provide. The math does not work.

The security problem

Simon Willison identified “the lethal trifecta” for AI agents: access to private data, exposure to untrusted content, and the ability to communicate externally. Combine all three and an attacker can trick your agent into exfiltrating your data.

The core problem is that LLMs believe anything you tell them. They cannot distinguish instructions from content. Willison documented this vulnerability in ChatGPT, Amazon Q, GitHub Copilot, Microsoft Copilot, Slack, and Claude. Almost all were fixed by locking down exfiltration. The underlying vulnerability is architectural. That is the question every agentic deployment has to answer: whether to trust a model that accepts the literal truth of anything presented to it to go out and act on your behalf.

The data problem

The disconnect is between what agents promise and what data infrastructure supports. Joe Reis observed that very few people raise their hands when he asks audiences if they would put an LLM on top of their existing data.

Before you deploy an agent, you need to know what your data means. Data contracts are not optional infrastructure. They are the prerequisite for anything autonomous. Most corporate datasets are what Reis calls “utter hellscapes”: poorly named columns, janky data models, timestamps without timezones because “everyone knows it’s Eastern.”

The data readiness problem has not been solved. It has been papered over with demos.

The fraud problem

The SEC charged Presto Automation with claiming its AI drive-thru “eliminated the need for human order taking.” Reality: over 70% of orders required human agents in the Philippines and India.

The SEC and DOJ charged Albert Saniger of Nate Inc. with fraud. He raised $42 million claiming AI automation. Actual automation rate: zero percent. Hundreds of contractors in call centers in the Philippines and Romania manually completed purchases.

Gartner estimates only 130 of thousands of vendors claiming “agentic AI” are real. The rest are agent-washing: rebranding chatbots without substantial capabilities.

What works

The pattern in every success is the same: narrow, internal, supervised.

Klarna deployed an AI assistant for customer service in early 2024. Within a month it handled 2.3 million conversations, two-thirds of all customer chats, with resolution times dropping from 11 minutes to under 2. Then they pushed further. They cut staff by 40% and let AI handle increasingly complex queries. Quality dropped. Complaints rose. By 2025, CEO Sebastian Siemiatkowski acknowledged they had gone too far and began rehiring human agents.

The initial deployment worked because it was narrow: routine inquiries with clear escalation paths. The expansion failed because it crossed into territory requiring judgment, empathy, and context that LLMs cannot reliably provide. The line between the two is where every agentic deployment succeeds or fails.

The winning implementations are hybrid: LLM reasoning combined with deterministic guardrails and human-in-the-loop oversight. The failures are broad, external, and autonomous. Autonomy is a design choice, not a goal.

What I am still figuring out

The decade timeline feels right but the path is unclear. Current agents fail on multi-step tasks, but the failure modes are not improving predictably. Some capabilities are advancing (coding, research). Others are stuck (reliable tool use, long-horizon planning).

The question I cannot answer: is this a capabilities problem that more compute will solve, or an architectural problem that requires a different approach? Karpathy thinks capabilities. Yann LeCun thinks architecture, predicting the current LLM paradigm has a shelf life of three to five years. If LeCun is right, the current wave of agent investments is building on the wrong foundation.

Either way, the companies deploying agents today are running the experiment with their own money. Some will learn. Most will join the pilot graveyard.

The problem is not AI. The problem is “agentic everything”: the assumption that autonomy is progress, that removing humans is the objective, that governance can come later.

The governance cannot come later. The companies shipping agents without controls are building the next incident. The companies building data contracts, audit logs, and kill switches first are building something that might actually work.

Karpathy thinks it will take a decade. I think he is right.