AI Benchmarks Are Measuring the Wrong Thing

March 30, 2026

A radiologist AI scored better than human experts on diagnostic accuracy. It also slowed down the hospital. Both facts are true. That contradiction is the problem.

FDA-approved models demonstrated they could read medical scans faster and more accurately than radiologists in controlled benchmarks. Then those same models hit real hospital workflows and introduced delays, because hospital-specific reporting standards and regulatory requirements don't exist in benchmark environments.

What Benchmarks Actually Test

Current AI benchmarks evaluate models in isolation. One model, one task, one score. Real deployments involve multi-person teams, organizational workflows, and the kind of coordination overhead that no leaderboard captures.

The gap between benchmark performance and real-world outcomes isn't a bug in specific deployments. It's structural.

Four Years of Watching It Play Out

Angela Aristidou, a researcher affiliated with UCL, Stanford Digital Economy Lab, and Stanford HAI, has been studying real-world AI deployment since 2022. Her work spans organizations in the UK, US, and Asia across health, humanitarian, nonprofit, and higher-education sectors.

One UK hospital system ran an evaluation from 2021 to 2024. Early focus: diagnostic accuracy. By the end, the evaluation had expanded to measure coordination quality and deliberation quality across teams using AI versus teams that weren't. The definition of "doing well" changed as the deployment matured.

A separate 18-month evaluation in the humanitarian sector tracked something called error detectability: how easily human teams could identify and correct AI mistakes within real workflows. Not whether the AI was right. Whether humans could tell when it was wrong.

That's a meaningfully different question.

The Proposed Fix

Aristidou's proposed framework, HAIC (Human-AI, Context-Specific Evaluation), shifts benchmarking along four axes:

Individual performance to team performance
One-off testing to long-term testing
Correctness and speed to organizational outcomes
Isolated outputs to system-level effects

None of these shifts are small. Longitudinal team-level evaluation costs more, takes longer, and produces results that are harder to compare across systems. Those are reasons why it hasn't been the standard, not reasons why the current standard is adequate.

Why This Matters Now

Benchmark scores drive procurement decisions, regulatory approvals, and deployment confidence. If those scores measure something other than what actually happens in production, the gap between claimed capability and real-world outcome will keep widening.

The hospital that saw delays didn't get a bad model. It got a model that was evaluated on the wrong question.

Source: Technologyreview

AI Benchmarks Are Measuring the Wrong Thing

What Benchmarks Actually Test

Four Years of Watching It Play Out

The Proposed Fix

Why This Matters Now

Related Articles