openai Archives - Digital Health Wire

Hot on the heels of launching its HealthBench medical AI benchmark, OpenAI just delivered results from the largest-ever study of clinical AI in actual practice – and let’s just say the future’s looking bright.

40,000 visits, 106 clinicians, 15 clinics. OpenAI went big to get real-world data, equipping Kenya-based primary and urgent care provider Penda Health with AI Consult (GPT4o) clinical decision support within its EHR.

The study split 106 Penda clinicians into two even groups (half with AI Consult, half without), then tracked outcomes over a three month period.

When AI Consult detected a potential error in history, diagnosis, or treatment, it triggered a simple Traffic Light alert.

Green – No concerns, no action needed
Yellow – Moderate concerns, optional clinician review
Red – Safety-critical concerns, mandatory clinician review

The results were definitely promising. Clinicians using AI Consult saw a:

16% reduction in diagnostic errors
13% reduction in treatment errors
32% reduction history-taking errors

The “training effect” is real. The AI Consult group got significantly better at avoiding common mistakes over time, triggering fewer alerts as the study progressed.

Part of that is because Penda took several steps to help along the way, including one-on-one training, peer champions, and performance feedback.

It’s also worth noting that there was no recorded harm as a result of AI Consult suggestions, and 100% of the clinicians using it said that it improved their quality of care.

What’s the catch? While AI Consult led to a clear reduction in clinical errors, there was no statistically significant difference in patient-reported outcomes, and clinicians using the copilot saw slightly longer visit times.

The Takeaway

Clinical AI continues to prove itself outside of multiple choice licensing exams / clinical vignettes, and OpenAI just gave us our best evidence yet that general-purpose models can reduce errors in actual patient care.

OpenAI is officially setting its sights on healthcare with the launch of HealthBench, a new benchmark for evaluating AI performance in realistic medical scenarios.

HealthBench marks the first time the ChatGPT developer has taken a direct step into the industry without a partner to hold its hand.

Developed with 262 physicians from 60 countries, HealthBench includes 5,000 simulated health conversations, each with a custom rubric to grade the responses.

The conversations “were created to be realistic and similar to real-world use of LLMs,” meaning they’re multi-turn and multilingual, while spanning a range of medical specialties and themes like handling uncertainty or global health.

Here’s how current frontier models stacked up in the HealthBench test.

OpenAI’s o3 was the best performing model with a score of 60%
xAI’s Grok 3 ranked second with a score of 54%
Google’s Gemini 2.5 Pro followed close behind at 52%

All three leading models outperformed physicians who weren’t equipped with AI, although physicians outperformed the newer models when they had access to the AI output.

The paper also reviewed other LLMs like Llama and Claude, but unsurprisingly none of them scored higher than OpenAI’s model on OpenAI’s own test.

Even the best models came up short in a few common places, AKA areas that developers should focus on to improve performance.

Current AI models would rather hallucinate than withhold an answer they aren’t confident on, obviously not a good trait to bring into a clinical setting.

None of the leading LLMs were great at asking for additional context or more information when the input was vague.

When AI misses, it misses bad, as seen in the sharp quality dropoff with the worst 10% of responses.

The Takeaway

Outside of giving us yet another datapoint that AI is catching up to human physicians, HealthBench provides one of the best standardized ways to compare model performance in (simulated) clinical practice, and that’s just what the innovation doctor ordered.

Get the top digital health stories right in your inbox

This content is exclusive to subscribers