When AI Actually Works

Wendy’s launched their AI drive-thru in a single Columbus, Ohio restaurant in June 2023. They spent a year in that one location before expanding to four stores. Another year to reach 36. They had people camped in the dining room for thousands of hours, talking to crews and customers. By 2025, they were planning 500 locations.

McDonald’s took a different path. They deployed their IBM partnership to over 100 restaurants in two years. Viral videos showed the system adding bacon to ice cream and ordering nine sweet teas instead of one. In June 2024, they shut it down entirely.

Same technology category. Same industry. Same use case. Opposite outcomes.

The pattern in every AI success is the same: narrow scope, human oversight, automatic rollback, and the patience to learn in production rather than the lab.

Shadow deployment

Uber runs shadow testing on 75% of critical online use cases, with plans to reach 100%. New models process real traffic alongside production models, but only the current model’s predictions reach users. Auto-rollback reverts to the last known good version if error rates, latency, or CPU utilization breach thresholds.

Netflix compares 1,000+ metrics between baseline and canary code, generating a confidence score that indicates how likely the canary is to succeed in production. Their deployment strategy: 1% of traffic, then 5%, then 25%, over 14 hours. Before they built this infrastructure, going from project start to deployment took four months. Now the median is seven days.

Stripe’s fraud detection assesses over 1,000 characteristics per transaction and makes a decision in under 100 milliseconds. Out of billions of legitimate payments, Radar incorrectly blocks just 0.1%. Their feature freshness runs at 150ms p99, nearly real-time response to changing fraud patterns.

These companies did not build better models. They built better deployment infrastructure. The model is not the product. The system that validates, monitors, and rolls back the model is the product.

Narrow scope

The AI implementations that work are not general-purpose. They are narrow, bounded, and boring.

H&M’s customer support chatbot handles order tracking, return policies, and sizing assistance, repetitive queries that overwhelmed their human team. Response time dropped from minutes to seconds. Operational costs fell by an estimated 30%. Available 24/7 in over 30 languages.

Mavi, a Turkish retailer, boosted revenue 9.6% with AI-powered inventory optimization across 439 stores. Amazon and Walmart report significant improvements from predictive inventory systems, though the specific figures are difficult to verify independently.

IT ticketing systems see 40–60% ticket deflection within 90 days and 25–40% faster resolution for escalated cases. The key insight: 60–70% of inbound volume is repetitive, low-complexity work that does not require human judgment. Those are the ideal candidates for automation. Not the complex cases, not the edge cases. The boring repetitive volume.

The scope determines the outcome. Broad, autonomous, general-purpose deployments fail. Narrow, supervised, specific deployments work.

Human in the loop

A Harvard and BCG study on consultants using AI found a 40% productivity increase when AI was used within its capability boundary. When used outside that boundary, performance dropped by 19 percentage points.

The researchers identified two working styles: “centaurs” who divided tasks between AI and themselves, and “cyborgs” who fully integrated their workflow with AI. Both patterns succeeded when the human understood where the AI’s competence ended.

The lesson is not that AI needs babysitting. The lesson is that humans and AI have complementary failure modes. AI fails on edge cases, novel situations, and context that was not in the training data. Humans fail on attention, consistency, and volume. The combination catches what either would miss alone.

Automatic rollback

The companies that avoid disaster are the ones that built the kill switch before they built the model.

Uber’s auto-rollback fires on three signals: error rates, latency, or resource utilization breaching thresholds. No human approval required. The system reverts to the last known good version automatically.

Industrial AI implementations now require mapped failure modes, tabletop rollback tests within 90 days, and emergency stops independent of the AI. The kill switch is not optional. It is the prerequisite.

A rollback is not a dashboard someone checks. It is a trigger that fires before the damage compounds.

What the 6% do differently

McKinsey found that 88% of organizations now use AI in at least one business function. Only 6% qualify as “high performers” achieving significant profit impact. The gap is not technology.

High performers are three times more likely to redesign workflows rather than just adding AI to existing processes. 55% fundamentally redesigned individual workflows, compared to 20% of other companies.

The MIT NANDA research found that purchasing AI tools from specialized vendors succeeds 67% of the time, while internal builds succeed only one-third as often. The successful companies treated vendors like business service providers rather than software suppliers: deep customization, outcome-based evaluation, co-evolutionary development.

A Capital One survey of nearly 4,000 business leaders found that 73% identified data quality and completeness as a top barrier to AI success, ranking it alongside data security and above model accuracy, computing costs, and talent shortages.

BCG’s 10-20-70 principle summarizes it: 10% algorithms, 20% data and technology, 70% people, processes, and cultural transformation. A RAND report found that AI projects fail at twice the rate of other IT projects, with root causes spanning miscommunication, data issues, infrastructure problems, and unrealistic expectations.

What I am still figuring out

The Wendy’s-vs-McDonald’s comparison is clean, but the sample size is small. Two companies is not a pattern. The rollout speed seems to matter: Wendy’s spent a year in one location, McDonald’s scaled to 100. But there could be confounding factors: different vendor partnerships, different menu complexity, different customer bases.

The 67% vendor success rate versus 33% internal build rate is striking, but the causation is unclear. Do vendors succeed because they have better technology, because they bring operational discipline, or because buying a product is a forcing function that prevents scope creep?

The narrow-scope pattern is consistent, but the boundary is not obvious. How narrow is narrow enough? Customer service chatbots succeed. General-purpose assistants fail. Somewhere between is a line, and I do not have a principled way to find it.

Wendy’s and McDonald’s used the same technology for the same use case in the same industry. Wendy’s spent a year building the ability to learn in production. McDonald’s spent two years building the ability to deploy at scale.

Deployment without learning scales the mistakes. Learning without deployment wastes the insight. Wendy’s got the order right.