Perplexity spent 164% of their revenue on AWS, Anthropic, and OpenAI bills in 2024. They were not training models. They were running searches. The cost structure made profitability mathematically impossible at their current scale.

OpenAI projected $5 billion in losses for 2024 on $3.7 billion in revenue. The companies selling AI infrastructure are hemorrhaging money faster than the companies buying it.

Token costs dropped 1,000x in three years. Spending increased anyway.

The paradox

This is Jevons’ Paradox in real time. When a resource becomes more efficient, total consumption increases rather than decreases. Coal engines got more efficient. Britain burned more coal, not less.

LLM inference costs dropped dramatically from GPT-3 era pricing to current levels. Per-token costs for frontier models fell by an order of magnitude or more year over year. Companies responded by spending $37 billion on generative AI in 2025, up from $11.5 billion in 2024.

The invoice tripled because the invoice changed form.

Where the cost moved

GPT-4’s training cost was around $100 million. OpenAI spent roughly $2 billion on inference compute in 2024, twenty times the original training cost. Training happens once. Inference costs compound with every query, forever.

But inference is not where organizations are bleeding. The real multipliers are invisible:

A significant fraction of tokens in production LLM deployments are redundant. Anthropic’s prompt caching documentation shows that cached prefixes can reduce costs by up to 90%, implying most of what gets sent is repeated context. System prompts get repeated with every call because the architecture is stateless. Context windows scale quadratically due to the attention mechanism: doubling context length quadruples compute cost. Organizations routinely discover their API costs are multiples of what they budgeted because inefficient context management compounds with scale.

Meanwhile, data scientists spend 50-80% of their time on data preparation before any model sees the data. Infrastructure consumes roughly half of total AI spend. Compliance and governance are line items that did not exist two years ago.

The token got cheaper. Everything around the token got expensive.

The architectural tax

The decisions made during early pilots create exponential multipliers at scale.

84% of enterprises report AI costs cutting gross margins by more than 6%, with over a quarter seeing hits of 16% or more. Only 48% of AI projects make it into production, and 85% of companies miss their AI cost forecasts by more than 10%. Gartner warned that CIOs who do not understand GenAI cost scaling could make 500%-1,000% errors in budget calculations.

The pattern is consistent: pilot with synthetic data, prove the model works, scale to production, discover the architecture does not. The cost structure that worked for 100 queries per day collapses at 100,000.

Stateless API calls mean system prompts get retransmitted thousands of times per day. No caching layer. No shared context. Every request starts from scratch. A pilot does not notice this. Production does.

RAG pipelines retrieve the same documents repeatedly. A Stanford study found that legal RAG tools hallucinate more than 17% of the time despite vendor claims, and retrieval costs are often modeled in isolation while integration costs are ignored. The vector database is cheap. The data pipeline feeding it is not.

Multi-agent systems use approximately 15x more tokens than single-agent interactions. Agents coordinate by passing context back and forth. Each handoff repeats information the previous agent already processed. The factory scales linearly. The token cost scales quadratically.

The three types of waste

Repeated context. System prompts, few-shot examples, and boilerplate instructions get sent with every request. Anthropic’s prompt caching can reduce costs by 90% for these patterns, but only if the architecture is designed to use it. Retrofitting caching into a stateless system requires rewriting the entire call pattern.

Irrelevant retrieval. RAG systems retrieve top-K results by similarity. Similarity is not relevance. You pay to process documents that share keywords with the query but do not answer it. The embedding model does not know which chunks matter until after you have already paid to retrieve them.

Redundant reasoning. The LLM re-derives the same conclusions repeatedly because nothing persists between calls. A customer support agent answers the same question fifty times per day. Each time, the model processes the full context, reasons through the answer, and generates the response. Caching the answer would cost nearly nothing. Re-computing it costs 50x.

What actually works

Prompt caching. Anthropic’s implementation can reduce costs by 90% for repeated context. But the benefit only materializes if your system is architected to reuse prompts. The savings are not automatic. They require deliberate design.

Structured outputs. JSON mode, function calling, and structured generation reduce token waste from retry loops. When the model outputs malformed JSON, you pay to re-generate. Structured outputs eliminate the failure mode. The token savings are small per request but compound over millions of calls.

Stateful context. Maintaining state between turns eliminates repeated transmission of conversation history. A stateless system sends the full conversation with every message. A stateful system sends only the delta. The difference at scale is not marginal. It is the difference between profitability and burning 164% of revenue.

Selective retrieval. Hybrid search (BM25 + vector embeddings + reranking) costs more per query but retrieves fewer irrelevant documents. You pay more for retrieval and less for processing. The net is often cheaper because you are not paying the LLM to ignore junk.

The real cost

The token tax is not the API bill. The token tax is the architectural decisions you made when tokens were cheap and mistakes were free.

You built a stateless system because stateless is simple. You skipped caching because the pilot worked without it. You designed agents to over-communicate because coordination failures were worse than redundant context. These were rational choices when the cost was $50/month.

They are not rational at $500,000/month.

The companies that survive are not the ones with the best models. They are the ones that designed for cost from the beginning. Netflix, Stripe, and Uber deploy models constantly because their infrastructure was built to make deployment cheap. Shadow deployments, automated rollback, and guardrails are not features. They are cost controls dressed up as reliability engineering.

The invoice arrived in 2025. It showed up as missed margin targets, budget overruns, and AI initiatives that worked technically but failed economically.

What I am still figuring out

Whether reasoning models change the entire cost calculus. The three types of waste I identified (repeated context, irrelevant retrieval, redundant reasoning) assume the expensive tokens are the wasteful ones. Reasoning models like o3 and DeepSeek R1 generate thousands of internal chain-of-thought tokens that are neither repeated, irrelevant, nor redundant. They are novel reasoning, and they are expensive. The architectural solutions I described (caching, stateful context, selective retrieval) do not help with them. If reasoning models become the default, the token tax may shift from an architecture problem to a fundamental cost-of-intelligence problem with a different set of levers.


Tokens got cheap. Systems got expensive.

The cost moved from training to inference, from models to context, from prompts to architecture. The companies bleeding money are not the ones using AI poorly. They are the ones who designed for capability and discovered too late that capability is not the constraint.

The constraint is cost per query at scale, and the architectural decisions you make in the pilot determine whether that cost is sustainable or catastrophic.