Artificial Intelligence

More Reasoning, More Hallucinations for LLMs

ChatGPT

Better reasoning apparently doesn’t prevent LLMs from spewing out false facts.  

Independent testing from AI firm Vectara showed that the latest advanced reasoning models from OpenAI and DeepSeek hallucinate even more than previous models.

  • OpenAI’s o3 reasoning model scored a 6.8% hallucination rate on Vectara’s test, which asks the AI to summarize various news articles.
  • DeepSeek’s R1 fared even worse with a 14.3% hallucination rate, an especially poor performance considering that its older non-reasoning DeepSeek-V2.5 model clocked in at 2.4%.
  • On OpenAI’s more difficult SimpleQA tests, o3 and o4-mini hallucinated between 51-79% of the time, versus just 37% for its GPT-4.5 non-reasoning model.

OpenAI positions o3 as its most powerful model because it’s a “reasoning” model that takes more time to “think” and work out its answers step-by-step.

  • This process produces better answers for many use cases, but these reasoning models can also hallucinate at each step of their “thinking,” giving them even more chances for incorrect responses.

The Takeaway

Even though the general purpose models studied weren’t fine-tuned for healthcare, the results raise concerns about their safety in clinical settings – especially given how many physicians report using them in day-to-day practice.

We’re testing a new format today – let us know if you prefer two shorter Top Stories or one longer Top Story with this quick survey!

Get the top digital health stories right in your inbox

You might also like

You might also like..

Select All

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Digital Health Wire team

You're all set!