Your Evals Won't Save You
In 2024, 42% of companies abandoned most of their AI initiatives before reaching production. Up from 17% the year before. Gartner predicts 30% of GenAI projects will be abandoned after proof of concept by the end of 2025.
These projects did not fail because their evals were wrong. They failed because evals are not the problem.
The eval-industrial complex
The industry is over-invested in pre-production evaluation. Evals are demo-able (you can show a chart going up), fundable (VCs understand “we improved accuracy by 15%”), and feel like engineering. They are a proxy for progress when you do not know if the product works yet.
But Gartner says projects fail from “poor data quality, inadequate risk controls, escalating costs, unclear business value.” Not “our benchmark scores were too low.”
The eval-industrial complex exists because evals are easy to build and hard to question. The actual hard problems are messy and cross-functional. Evals are contained and measurable. So we measure what is easy to measure and hope it correlates with what matters. The same trap repeats in production: dashboards show green while agents burn money in conversations with themselves.
The journey most teams take
It starts with benchmarks. An LLM scores 87% on HumanEval. The team ships. In production, accuracy drops to 30% on real codebases with cross-file dependencies. Worse: benchmark contamination is rampant. Models have seen the test data. You were measuring memorization, not capability.
So the team builds custom evals. Reference-based metrics like ROUGE and BERTScore. But Eugene Yan found these do not discriminate well enough to set production thresholds. The similarity distributions of positive and negative instances are too close. Generated outputs often surpass reference quality anyway.
So the team builds an eval framework. Infrastructure, pipelines, dashboards. But Hamel Husain advises starting simpler: “Spend 30 minutes manually reviewing 20-50 LLM outputs.” If you are building infrastructure before you have looked at your data, you are optimizing for the wrong thing.
The pattern: teams keep investing in better pre-production measurement. The gap is not measurement. The gap is what happens after you ship.
What actually matters
Structured outputs. OpenAI’s schema enforcement went from 40% compliance with prompting alone to 100% with strict mode. This is not glamorous, but it is where production systems actually break.
Synthetic data for bootstrapping. Synthetic data solves the cold start problem: no production traffic yet, no labeled dataset. You can generate thousands of test cases in minutes. Once you understand your domain, curated test cases beat synthetic volume.
Binary labels. Both practitioners recommend pass/fail over Likert scales. The difference between a “3” and a “4” is subjective. Pass/fail forces clearer thinking.
Production feedback loops. This is where the real gap is. Not better evals. Better learning.
The system is the eval
Adaptive clinical trials do not test all treatments upfront, pick a winner, then ship. They use multi-armed bandits to allocate more patients to treatments that are working. You learn as you treat. The FDA approved this approach because it is more ethical and more effective than static trial design.
The same principle applies to LLM systems. The question is not “did this pass our evals?” The question is “can this system learn from production?”
Netflix does not pre-eval all possible recommendations. The system learns from what you watch. The eval is the production behavior.
| Pre-production evals | Production learning |
|---|---|
| Static measurement | Dynamic adaptation |
| Certainty before shipping | Learning from traffic |
| Investment: better evals | Investment: better feedback loops |
The teams that win will not be the ones with the best pre-production evals. They will be the ones whose systems learn fastest from real users.
What I am still figuring out
Whether the adaptive clinical trial analogy holds for LLM systems. Clinical trials have well-defined outcomes (patient survives, tumor shrinks). LLM quality is multidimensional and context-dependent. The multi-armed bandit framing assumes you can define a reward signal. For many LLM applications, the reward signal is itself the hard problem. This is not a minor caveat. If you cannot define what “better” means in production, production learning does not work either. The recommendation to invest in feedback loops over evals assumes the feedback loop has something to measure. When it does not, pre-production evals may be the only option, even if they are insufficient.
Invest more in evaluatable systems than in evaluations. A system that learns from feedback will outperform a system that passed all your evals but cannot adapt.