OpenAI o1 Outperforms Physicians on Clinical Reasoning Tasks

A landmark study in Science found that OpenAI’s o1 series outperformed human physicians at multiple clinical reasoning tasks, but that doesn’t mean it’s time to hang up the scrubs just yet.

Researchers at Harvard and Beth Israel Deaconess Medical Center designed the study to evaluate whether LLMs are ready to do what physicians do on a daily basis: review messy patient charts and use that data to determine diagnosis and next steps.

They evaluated o1 on clinical cases ranging from patient vignettes to second opinions on 76 real-world ED assessments, which included all the noise and incomplete information that clinicians routinely encounter in the EHR.

The refreshingly well-designed study also incorporated a blinded evaluation with two attending physicians at BIDMC and GPT-4.

o1 came to play. On clinical vignettes evaluating management reasoning, o1-preview scored a median of 86%. Not too shabby.

It outperformed GPT-4, humans with GPT-4, and humans with conventional resources like UpToDate – all of which scored below 45%.

The ED cases were even more impressive. o1 offered second opinions about the diagnosis at three points along the patient’s ED journey:

At triage, o1 gave an exact or very close diagnosis in 67% of cases (when information in the record dump was most limited). The two physicians hit 55% and 50%.

o1 still outperformed the physicians when given all the data collected by the end of the ED encounter.

It was only when the physicians were given the most information possible to inform their diagnosis – at the time the patient would have been admitted to the hospital – that the scores finally converged.

The cherry on top? Physician raters couldn’t tell whether the differentials came from o1 or a human. One rater couldn’t tell in 83.6% of cases, the other in 94.4%.

The authors were quick to mention that these results don’t mean AI is ready to replace human physicians. They mean it’s time for rigorous research into how AI can augment care teams, serve as a second opinion, and become a safety layer for clinicians.

The Takeaway

o1 outperforming a couple internists at triage isn’t quite Deep Blue beating Gary Kasparov at chess, but it’s a step in that direction – especially considering OpenAI’s performance jump in just the last week (let alone since o1 launched in 2024).

Get the top digital health stories right in your inbox

You might also like

AI Moves From Proof-of-Concept to Proof-of-Return April 30, 2026

Medicare’s None the WISeR April 27, 2026

OpenAI Launches ChatGPT for Clinicians April 23, 2026

Like the website? You'll love the newsletter