We might have just gotten our spiciest study of the year after Nature published results suggesting that general-purpose LLMs outperform fine-tuned healthcare models straight out of the box.
It was a battle of the bots. Researchers pitted OE and UTD Expert AI against three general-purpose frontier models that anyone with a web browser can pull up in two seconds: GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6.
The models were tested across three domains:
- medical knowledge (MedQA)
- expert clinician alignment (HealthBench)
- 100 real physician queries (RCQ) scored by 12 blinded clinicians
It was a clean sweep. The general-purpose LLMs outperformed the specialized models on all three evals, and by a healthy margin. This chart gets the point across.
- On MedQA, Gemini led the pack with 97.4% accuracy (vs. 89.6% for OE and 88.4% for UTD). Fun fact, the frontier models are trained on these exact questions (and answers).
- On HealthBench, GPT-5.2 dominated with a score of 88%. It’s almost like OpenAI invented the benchmark.
- The RCQs were probably the most clinically meaningful component, and all three frontier models took the podium here as well. It was a bit odd that the researchers didn’t share the exact questions, and OE thought so too.
OpenEvidence hit back hard and fast. It quickly took to socials to let the world know that the study was not only poorly designed and biased, but that the authors had reached out for API access to build a competing product. Request denied.
- Outside of pointing out the data contamination issue with MedQA, OE also critiqued HealthBench for scoring responses based on subjective stylistic choices. It gave an example where OE scored 20% “worse” because it didn’t use a specific email header.
- The cherry on top was OE revealing that the real-world clinician queries were only added after peer reviewers flagged the study for having weak evidence. Big if true.
Obligatory disclaimer: the models were tested back in February, and the performance gap is most likely even wider today.
- That said, OE and UTD didn’t become this successful by being better AI developers than OpenAI and Anthropic. They did it by curating sources for verifiable evidence, wrapping them in an interface that docs love, and earning the trust of clinicians.
The Takeaway
Frontier models might eventually eat the world, and they’ll probably get pretty dang good at answering clinical questions in the process. In the meantime, benchmarks still aren’t medicine, and this study still isn’t the final word on clinical AI.

