Hot on the heels of launching its HealthBench medical AI benchmark, OpenAI just delivered results from the largest-ever study of clinical AI in actual practice – and let’s just say the future’s looking bright.
40,000 visits, 106 clinicians, 15 clinics. OpenAI went big to get real-world data, equipping Kenya-based primary and urgent care provider Penda Health with AI Consult (GPT4o) clinical decision support within its EHR.
- The study split 106 Penda clinicians into two even groups (half with AI Consult, half without), then tracked outcomes over a three month period.
When AI Consult detected a potential error in history, diagnosis, or treatment, it triggered a simple Traffic Light alert.
- Green – No concerns, no action needed
- Yellow – Moderate concerns, optional clinician review
- Red – Safety-critical concerns, mandatory clinician review
The results were definitely promising. Clinicians using AI Consult saw a:
- 16% reduction in diagnostic errors
- 13% reduction in treatment errors
- 32% reduction history-taking errors
The “training effect” is real. The AI Consult group got significantly better at avoiding common mistakes over time, triggering fewer alerts as the study progressed.
- Part of that is because Penda took several steps to help along the way, including one-on-one training, peer champions, and performance feedback.
- It’s also worth noting that there was no recorded harm as a result of AI Consult suggestions, and 100% of the clinicians using it said that it improved their quality of care.
What’s the catch? While AI Consult led to a clear reduction in clinical errors, there was no statistically significant difference in patient-reported outcomes, and clinicians using the copilot saw slightly longer visit times.
The Takeaway
Clinical AI continues to prove itself outside of multiple choice licensing exams / clinical vignettes, and OpenAI just gave us our best evidence yet that general-purpose models can reduce errors in actual patient care.