Better reasoning apparently doesn’t prevent LLMs from spewing out false facts.
Independent testing from AI firm Vectara showed that the latest advanced reasoning models from OpenAI and DeepSeek hallucinate even more than previous models.
- OpenAI’s o3 reasoning model scored a 6.8% hallucination rate on Vectara’s test, which asks the AI to summarize various news articles.
- DeepSeek’s R1 fared even worse with a 14.3% hallucination rate, an especially poor performance considering that its older non-reasoning DeepSeek-V2.5 model clocked in at 2.4%.
- On OpenAI’s more difficult SimpleQA tests, o3 and o4-mini hallucinated between 51-79% of the time, versus just 37% for its GPT-4.5 non-reasoning model.
OpenAI positions o3 as its most powerful model because it’s a “reasoning” model that takes more time to “think” and work out its answers step-by-step.
- This process produces better answers for many use cases, but these reasoning models can also hallucinate at each step of their “thinking,” giving them even more chances for incorrect responses.
The Takeaway
Even though the general purpose models studied weren’t fine-tuned for healthcare, the results raise concerns about their safety in clinical settings – especially given how many physicians report using them in day-to-day practice.
We’re testing a new format today – let us know if you prefer two shorter Top Stories or one longer Top Story with this quick survey!