Artificial Intelligence

OpenAI Delivers Largest-Ever Study of Clinical AI

OpenAI Penda

Hot on the heels of launching its HealthBench medical AI benchmark, OpenAI just delivered results from the largest-ever study of clinical AI in actual practice – and let’s just say the future’s looking bright.

40,000 visits, 106 clinicians, 15 clinics. OpenAI went big to get real-world data, equipping Kenya-based primary and urgent care provider Penda Health with AI Consult (GPT4o) clinical decision support within its EHR.

  • The study split 106 Penda clinicians into two even groups (half with AI Consult, half without), then tracked outcomes over a three month period. 

When AI Consult detected a potential error in history, diagnosis, or treatment, it triggered a simple Traffic Light alert.

  • Green – No concerns, no action needed
  • Yellow – Moderate concerns, optional clinician review 
  • Red – Safety-critical concerns, mandatory clinician review

The results were definitely promising. Clinicians using AI Consult saw a:

  • 16% reduction in diagnostic errors
  • 13% reduction in treatment errors
  • 32% reduction history-taking errors

The “training effect” is real. The AI Consult group got significantly better at avoiding common mistakes over time, triggering fewer alerts as the study progressed.

  • Part of that is because Penda took several steps to help along the way, including one-on-one training, peer champions, and performance feedback.
  • It’s also worth noting that there was no recorded harm as a result of AI Consult suggestions, and 100% of the clinicians using it said that it improved their quality of care.

What’s the catch? While AI Consult led to a clear reduction in clinical errors, there was no statistically significant difference in patient-reported outcomes, and clinicians using the copilot saw slightly longer visit times.

The Takeaway

Clinical AI continues to prove itself outside of multiple choice licensing exams / clinical vignettes, and OpenAI just gave us our best evidence yet that general-purpose models can reduce errors in actual patient care.

Get the top digital health stories right in your inbox

You might also like