Harvard Study: o1 Outdiagnoses ER Doctors at Triage

May 2, 2026

A study published in Science found that OpenAI's o1 model reached the correct or near-correct diagnosis in 67% of emergency room triage cases. The two human internal medicine attending physicians in the study hit 55% and 50%, respectively.

The Setup

Researchers from Harvard Medical School and Beth Israel Deaconess Medical Center ran 76 real ER cases through OpenAI o1 and GPT-4o, then compared the results against human physicians. Both AI models and the human doctors worked from the same raw electronic medical record data, with no preprocessing or summarization. Whatever the physician had at diagnosis time, the model had.

Arjun Manrai, who leads an AI lab at Harvard Medical School, was the study's lead author.

Where the Gap Was Largest

The performance difference between o1 and human physicians was most pronounced at initial triage. That's the point in the ER workflow where the least patient information is available.

This is counterintuitive if you expected LLMs to struggle with ambiguity. Apparently, when information is thin and pattern-matching matters most, o1 has an edge. Whether that reflects genuine diagnostic reasoning or something more statistical is not clear from the study.

o1 also outperformed GPT-4o across multiple diagnostic touchpoints.

The Limitations They Noted

The study did not claim LLMs are ready for deployment. Current foundation models show more limited performance when reasoning over non-text inputs. Medical diagnosis involves imaging, lab values in context, physical examination findings, and real-time patient interaction. The study worked from text-form EMR data.

The authors specifically called for prospective trials to evaluate LLMs in actual patient care settings. A retrospective benchmark on 76 cases is a signal, not a conclusion.

What to Make of This

The result is meaningful. 67% versus 50-55% on the same case set, using the same raw data, is not a marginal difference. The triage finding is the most interesting part: if AI has the sharpest edge when information is scarcest, that has direct implications for how it might be used in clinical workflows.

The prospective trial question is the one that matters. Retrospective cases don't capture ordering decisions, patient communication, or the downstream cost of a wrong early guess. Those are real-world variables this study was not designed to test.

Source: Techcrunch

Harvard Study: o1 Outdiagnoses ER Doctors at Triage

The Setup

Where the Gap Was Largest

The Limitations They Noted

What to Make of This

Related Articles