OpenAI is officially setting its sights on healthcare with the launch of HealthBench, a new benchmark for evaluating AI performance in realistic medical scenarios.
HealthBench marks the first time the ChatGPT developer has taken a direct step into the industry without a partner to hold its hand.
- Developed with 262 physicians from 60 countries, HealthBench includes 5,000 simulated health conversations, each with a custom rubric to grade the responses.
- The conversations “were created to be realistic and similar to real-world use of LLMs,” meaning they’re multi-turn and multilingual, while spanning a range of medical specialties and themes like handling uncertainty or global health.
Here’s how current frontier models stacked up in the HealthBench test.
- OpenAI’s o3 was the best performing model with a score of 60%
- xAI’s Grok 3 ranked second with a score of 54%
- Google’s Gemini 2.5 Pro followed close behind at 52%
All three leading models outperformed physicians who weren’t equipped with AI, although physicians outperformed the newer models when they had access to the AI output.
- The paper also reviewed other LLMs like Llama and Claude, but unsurprisingly none of them scored higher than OpenAI’s model on OpenAI’s own test.
Even the best models came up short in a few common places, AKA areas that developers should focus on to improve performance.
- Current AI models would rather hallucinate than withhold an answer they aren’t confident on, obviously not a good trait to bring into a clinical setting.
- None of the leading LLMs were great at asking for additional context or more information when the input was vague.
- When AI misses, it misses bad, as seen in the sharp quality dropoff with the worst 10% of responses.
The Takeaway
Outside of giving us yet another datapoint that AI is catching up to human physicians, HealthBench provides one of the best standardized ways to compare model performance in (simulated) clinical practice, and that’s just what the innovation doctor ordered.