AI Benchmarks Are Measuring the Wrong Thing
A radiologist AI scored better than human experts on diagnostic accuracy. It also slowed down the hospital. Both facts are true. That contradiction is the problem.
FDA-approved models demonstrated they could read medical scans faster and more accurately than radiologists in controlled benchmarks. Then those same models hit real hospital workflows and introduced delays, because hospital-specific reporting standards and regulatory requirements don't exist in benchmark environments.
What Benchmarks Actually Test
Current AI benchmarks evaluate models in isolation. One model, one task, one score. Real deployments involve multi-person teams, organizational workflows, and the kind of coordination overhead that no leaderboard captures.
The gap between benchmark performance and real-world outcomes isn't a bug in specific deployments. It's structural.
Four Years of Watching It Play Out
Angela Aristidou, a researcher affiliated with UCL, Stanford Digital Economy Lab, and Stanford HAI, has been studying real-world AI deployment since 2022. Her work spans organizations in the UK, US, and Asia across health, humanitarian, nonprofit, and higher-education sectors.
One UK hospital system ran an evaluation from 2021 to 2024. Early focus: diagnostic accuracy. By the end, the evaluation had expanded to measure coordination quality and deliberation quality across teams using AI versus teams that weren't. The definition of "doing well" changed as the deployment matured.
A separate 18-month evaluation in the humanitarian sector tracked something called error detectability: how easily human teams could identify and correct AI mistakes within real workflows. Not whether the AI was right. Whether humans could tell when it was wrong.
That's a meaningfully different question.
The Proposed Fix
Aristidou's proposed framework, HAIC (Human-AI, Context-Specific Evaluation), shifts benchmarking along four axes:
- Individual performance to team performance
- One-off testing to long-term testing
- Correctness and speed to organizational outcomes
- Isolated outputs to system-level effects
None of these shifts are small. Longitudinal team-level evaluation costs more, takes longer, and produces results that are harder to compare across systems. Those are reasons why it hasn't been the standard, not reasons why the current standard is adequate.
Why This Matters Now
Benchmark scores drive procurement decisions, regulatory approvals, and deployment confidence. If those scores measure something other than what actually happens in production, the gap between claimed capability and real-world outcome will keep widening.
The hospital that saw delays didn't get a bad model. It got a model that was evaluated on the wrong question.
Source: Technologyreview