AI is getting pretty darn good at patient diagnosis challenges… but don’t bother asking it to show its work.
A new study in npj Digital Medicine pitted GPT-4V against human physicians on 207 image challenges designed to test the reader’s ability to diagnose a patient based on a series of pictures and some basic clinical background info.
- Researchers at the NIH and Weill Cornell Medicine then asked GPT-4V to provide step-by-step reasoning for how it chose the answer.
- Nine physicians then tackled the same questions in both a closed-book (no outside help) and open-book format (could use outside materials and online resources).
How’d they stack up?
- GPT-4V and the physicians both scored high marks for accurate diagnoses (81.6% vs. 77.8%), with a statistically insignificant difference in performance.
- GPT-4V bested the physicians on the closed-book test, selecting more correct diagnoses.
- Physicians bounced back to beat GPT-4V on the open-book test, particularly on the most difficult questions.
- GPT-4V also performed well in cases where physicians answered incorrectly, maintaining over 78% accuracy.
Good job AI, but there’s a catch. The rationales that GPT-4V provided were riddled with mistakes – even if the final answer was correct – with error rates as high as 27% for image comprehension.
The Takeaway
There could easily come a day when clinical AI surpasses human physicians on the diagnosis front, but that day isn’t here quite yet. Real care delivery also doesn’t bless physicians with a set of multiple choice options, and hallucinating the rationale behind diagnoses doesn’t cut it with actual patients.