OpenAI Dives Into Healthcare With HealthBench

OpenAI is officially setting its sights on healthcare with the launch of HealthBench, a new benchmark for evaluating AI performance in realistic medical scenarios.

HealthBench marks the first time the ChatGPT developer has taken a direct step into the industry without a partner to hold its hand.

  • Developed with 262 physicians from 60 countries, HealthBench includes 5,000 simulated health conversations, each with a custom rubric to grade the responses.
  • The conversations “were created to be realistic and similar to real-world use of LLMs,” meaning they’re multi-turn and multilingual, while spanning a range of medical specialties and themes like handling uncertainty or global health.

Here’s how current frontier models stacked up in the HealthBench test.

  • OpenAI’s o3 was the best performing model with a score of 60%
  • xAI’s Grok 3 ranked second with a score of 54%
  • Google’s Gemini 2.5 Pro followed close behind at 52%

All three leading models outperformed physicians who weren’t equipped with AI, although physicians outperformed the newer models when they had access to the AI output.

  • The paper also reviewed other LLMs like Llama and Claude, but unsurprisingly none of them scored higher than OpenAI’s model on OpenAI’s own test.

Even the best models came up short in a few common places, AKA areas that developers should focus on to improve performance.

  • Current AI models would rather hallucinate than withhold an answer they aren’t confident on, obviously not a good trait to bring into a clinical setting.
  • None of the leading LLMs were great at asking for additional context or more information when the input was vague.
  • When AI misses, it misses bad, as seen in the sharp quality dropoff with the worst 10% of responses.

The Takeaway

Outside of giving us yet another datapoint that AI is catching up to human physicians, HealthBench provides one of the best standardized ways to compare model performance in (simulated) clinical practice, and that’s just what the innovation doctor ordered.

More Reasoning, More Hallucinations for LLMs

Better reasoning apparently doesn’t prevent LLMs from spewing out false facts.  

Independent testing from AI firm Vectara showed that the latest advanced reasoning models from OpenAI and DeepSeek hallucinate even more than previous models.

  • OpenAI’s o3 reasoning model scored a 6.8% hallucination rate on Vectara’s test, which asks the AI to summarize various news articles.
  • DeepSeek’s R1 fared even worse with a 14.3% hallucination rate, an especially poor performance considering that its older non-reasoning DeepSeek-V2.5 model clocked in at 2.4%.
  • On OpenAI’s more difficult SimpleQA tests, o3 and o4-mini hallucinated between 51-79% of the time, versus just 37% for its GPT-4.5 non-reasoning model.

OpenAI positions o3 as its most powerful model because it’s a “reasoning” model that takes more time to “think” and work out its answers step-by-step.

  • This process produces better answers for many use cases, but these reasoning models can also hallucinate at each step of their “thinking,” giving them even more chances for incorrect responses.

The Takeaway

Even though the general purpose models studied weren’t fine-tuned for healthcare, the results raise concerns about their safety in clinical settings – especially given how many physicians report using them in day-to-day practice.

We’re testing a new format today – let us know if you prefer two shorter Top Stories or one longer Top Story with this quick survey!

AI Can Help Doctors Change Their Minds

A recent study out of Stanford explored whether doctors would revise their medical decisions in light of new AI-generated information, finding that docs are more than willing to change their minds despite being just as vulnerable to cognitive biases as the rest of us.

Here’s the set up published in Nature Communications Medicine

  • 50 physicians were randomized to watch a short video of either a white male or black female patient describing their chest pain with an identical script.
  • The physicians made triage, diagnosis, and treatment decisions using any non-AI resource.
  • The physicians were then given access to GPT-4 (which they were told was an AI system that had not yet been validated) and allowed to change their decisions.

The initial scores left plenty of room for improvement.

  • The docs achieved just 47% accuracy in the white male patient group.
  • The docs achieved a slightly better 63% accuracy in the black female patient group.

The physicians were surprisingly willing to change their minds based on the AI advice.

  • Accuracy scores with AI improved from 47% to 65% in the white male group.
  • Accuracy scores with AI improved from 63% to 80% in the black female group.

Not only were the physicians open to modifying their decisions with AI input, but doing so made them more accurate without introducing or exacerbating demographic biases.

  • Both groups showed nearly identical magnitudes of improvement (18%), suggesting that AI can augment physician decision-making while maintaining equitable care.
  • It’s worth noting that the docs used AI as more than a search engine, asking it to bring in new evidence, compare treatments, and even challenge their own beliefs [Table].

The Takeaway

Although having the doctors go first means that AI didn’t save them any time in this study – and actually increased time per patient – it showed that flipping the paradigm from “doctors checking AI’s work” to “AI helping doctors check their own work” has the potential to improve clinical decisions without amplifying biases.

The Healthcare AI Adoption Index

Bessemer Venture Partners’ market reports are always some of the best in the business, but its recent Healthcare AI Adoption Index might just be its finest work yet.

The Healthcare AI Adoption Index is based on survey data from 400+ execs across Payors, Providers, and Pharma – breaking down how buyers are approaching GenAI applications, what jobs-to-be-done they’re prioritizing, and where their projects sit on the adoption curve.

Here’s a look at what they found:

  • AI is high on the agenda across the board, with AI budgets outpacing IT spend in each of the three segments. Over half (54%) are seeing ROI within the first 12 months.
  • Only a third of AI pilots end up reaching production, held back by everything from security and data readiness to integration costs and limited in-house expertise.
  • Despite all the trendsetters we cover on a weekly basis, only 15% of active AI projects are being driven by startups. The rest are being built internally or led by the usual suspects like major EHRs and Big Tech.
  • That said, 48% of executives say they prefer working with startups over incumbents, and Bessemer encourages founders to co-develop solutions with their customers and lean in on partnerships that provide access to distribution, proprietary datasets, and credibility.

The highlight of the report was Bessemer’s analysis of the 59 jobs-to-be-done as potential use cases for AI. 

  • Of the 22 jobs-to-be-done for Payors (claims, network, member, pricing), 19 jobs for Pharma (preclinical, clinical, marketing, sales), and 18 jobs for Providers (care delivery, RCM) – 45% are still in the ideation or proof of concept phase.
  • Providers are ahead in POC experimentation, while most Payor and Pharma use cases remain in the ideation phase. Here’s a beautiful look at where different use cases stand.

Bessemer topped off its analysis with the debut of its AI Dx Index, which factors in market size, urgency, and current adoption to help startups map and prioritize AI use cases. One of the best graphics so far this year.

The Takeaway

Healthcare’s AI-powered paradigm shift is kicking into overdrive, and Bessemer just delivered one of the most comprehensive views of where the puck is going that we’ve seen to date.

K Health’s AI Clinical Recommendations Rival Doctors in Real-World Setting

Real-world comparisons of AI recommendations and doctors’ clinical decisions have been few and far between, but a new study in the Annals of Internal Medicine gave us a great look at how performance stacks up with actual patients.

The early verdict? AI came out on top, but that doesn’t mean doctors should pack their bags quite yet.

Researchers from Cedars-Sinai and Tel Aviv University compared recommendations made by K Health’s AI Physician Mode to the final decisions made by physicians for 461 virtual urgent care visits. Here’s what they found:

  • In 68% of cases, the AI and physician recommendations were rated as equal
  • AI rated better on 21% of cases, versus just 11% for physicians 
  • AI recommendations were rated “optimal” in 77% of cases, versus 67% for physicians

Although AI takes the cake with the top line numbers, unpacking the data reveals some not-too-surprising strengths and weaknesses. AI was primarily rated better when physicians:

  • Missed important lab tests (22.8%)
  • Didn’t follow clinical guidelines (16.3%)
  • Failed to refer patients to specialists or the ED if needed (15.2%)
  • Overlooked risk factors and red flags (4.4%)

Physicians beat out AI when the human elements of care delivery came into play, such as adapting to new information or making nuanced decisions. Physicians were rated better when:

  • AI made unnecessary ED referrals (8.0%)
  • There was evolving or inconsistent information during consultations (6.2%)
  • They made necessary referrals that the AI missed (5.9%)
  • They correctly adjusted diagnoses based on visual examinations (4.4%)

While the study focused on the exact types of common conditions that AI excels at diagnosing (respiratory, urinary, vaginal, eye, and dental), it’s still impressive to see the outperformance in the messy trenches of a real clinical setting – a far cry from the static medical exams that have been the go-to for similar evaluations. 

The Takeaway

For AI to truly transform healthcare, it’ll need to do a lot more than automate administrative work and back office operations. This study demonstrates AI’s potential to enhance decision-making in actual medical practice, and points toward a future where delivering high-quality patient care becomes genuinely scalable.

PHTI Delivers Mixed Reviews on Ambient Scribes

The Peterson Health Technology Institute’s latest technology review is here, and it had a decidedly mixed report card for the ambient AI scribes sweeping across the industry. 

PHTI’s total count of ambient scribe vendors stands at over 60, but the bulk of its report focuses on the early experiences and lessons learned from the top 10 scribes across leading health systems.

According to PHTI’s conversations with health system execs, the primary driver of ambient scribe adoption has been addressing clinician burnout – and AI’s promise is clear on that front.

  • Mass General Brigham reported a 40% reduction in burnout during a six-week pilot.
  • MultiCare reported a 63% reduction in burnout and a 64% improvement in work-life balance.
  • Another study from the Permanente Medical Group found that 81% of patients felt their physician spent less time looking at their computer when using an ambient scribe.

Despite these drastic improvements, PHTI concludes that the financial returns and efficiency of ambient scribes remain unclear.

  • On one hand, enhanced documentation quality “could lead to higher reimbursements, potentially offsetting expenses.”
  • On the other hand, the cumulative costs “may be greater than any savings achieved through improved efficiency, reduced administrative burden, or reduced clinician attrition.”

It’s a bold conclusion considering the cost of losing a single provider, let alone the downstream effects of having a burned out workforce. 

PHTI’s advice to health systems? Define the outcomes you’re looking for and then measure ambient AI’s performance and financial impacts against those goals. Bit of a no-brainer, but sound advice nonetheless. 

The Takeaway

Ambient scribes are seeing the fastest adoption of any recent healthcare technology that wasn’t accompanied by a regulatory mandate, and that’s mostly because of magic that’s hard to capture in a spreadsheet. That said, health systems will eventually need to justify these solutions beyond their impact on the clinical experience, and PHTI’s report brings a solid framework and standardized methodologies for bridging that gap.

AI Misses the Mark on Detecting Critical Conditions

Most health systems have already begun turning to AI to predict if patient health conditions will deteriorate, but a new study in Nature Communications Medicine suggests that current models aren’t cut out for the task. 

Virginia Tech researchers looked at several popular machine learning models cited in medical literature for predicting patient deterioration, then fed them datasets about the health of patients in ICUs or with cancer.

  • They then created test cases for the models to predict potential health issues and risk scores in the event that patient metrics were changed from the initial dataset.

AI missed the mark. For in-hospital mortality prediction, the models tested using the synthesized cases failed to recognize a staggering 66% of relevant patient injuries.

  • In some instances, the models failed to generate adequate mortality risk scores for every single test case.
  • That’s clearly not great news, especially considering that algorithms that can’t recognize critical patient conditions obviously can’t alert doctors when urgent action is needed.

The study authors point out that it’s extremely important for technology being used in patient care decisions to incorporate medical knowledge, and that “purely data-driven training alone is not sufficient.”

  • Not only did the study unearth “alarming deficiencies” in models being used for in-hospital mortality predictions, but it also turned up similar concerns with models predicting the prognosis of breast and lung cancer over five-year periods.
  • The authors conclude that a significant gap exists between raw data and the complexities of medical reality, so models trained solely on patient data are “grossly insufficient and have many dangerous blind spots.”

The Takeaway

The promise of AI remains just as immense as ever, but studies like this provide constant reminders that we need a diligent approach to adoption – not just for the technology itself but for the lives of the patients it touches. Ensuring that medical knowledge gets incorporated into clinical AI models also seems like a theme that we’re about to start hearing more often.

Stress Testing Ambient AI Scribes

Providers are lining up to see if ambient AI can live up to its promise of decreasing burnout while improving the patient experience… and researchers are starting to wonder the same thing.

A new study in JAMA Network Open investigated whether ambient AI scribes actually decrease clinical note burden, following 46 clinicians at the University of Pennsylvania Health System as they used Nuance’s DAX Copilot AI ambient scribe from July to August 2024.

  • Researchers combined EHR data with a clinician survey to determine both quantitatively and qualitatively whether ambient scribes actually make a positive impact.

Here’s what they found. Over the course of the study, ambient scribe use was associated with:

  • 20.4% less time in notes per appointment (from 10.3 to 8.2 minutes)
  • 9.3% greater same-day appointment closure (from 66.2% to 72.4%)
  • 30.0% less after-hours work time per workday (from 50.6 to 35.4 minutes)

It’s tough to argue with the data. Ambient scribing definitely moves the needle on several important metrics, and even the less clear-cut stats still had a positive spin to them.

  • Note length was 20.6% greater with scribing (from 203k to 244k characters/wk)
  • However, the percentage of documentation that was typed by clinicians was 29.6% lower compared to baseline (from 11.2% to 7.9%)

The qualitative feedback told a different story. Even though clinicians reported feeling more engaged during patient conversations, “the need for substantial editing and proofreading of the AI-generated notes, which sometimes offset the time saved” was a recurring theme in the open-ended comments.

Ambient AI received a net promoter score of 0 on a scale of -100 to 100, meaning the clinicians were as likely to not recommend it as they were to recommend it.

  • 13 clinicians would recommend ambient AI to others, 13 wouldn’t recommend it, and 11 didn’t feel strongly either way.

The mixed reviews could mean that the ambient scribe performed better/worse for different users, but it could also mean that some clinicians were more diligent at checking the output.

The Takeaway

The evidence in favor of ambient AI scribes continues to pile up – even if the pajama-time reductions in this study didn’t live up to the promise on the box. Big technology shifts also come with adjustment periods, and this invited commentary did a great job highlighting the “real risk of automation bias” that comes with ambient AI, as well as the liability risk of missing its errors.

AI Enthusiasm Heats Up With Doctors

The unstoppable march of AI only seems to be gaining momentum, with an American Medical Association survey noting greater enthusiasm – and less apprehension – among physicians. 

The AMA’s Augmented Intelligence Research survey of 1,183 physicians found that those whose enthusiasm outweighs their concerns with health AI rose to 35% in 2024, up from 30% in 2023. 

  • The lion’s share of doctors recognize AI’s benefits, with 68% reporting at least some advantage in patient care (up from 63% in 2023).
  • In both years, about 40% of doctors were equally excited and concerned about health AI, with almost no change between surveys.

The positive sentiment could be stemming from more physicians using the tech in practice. AI use cases nearly doubled from 38% in 2023 to 66% in 2024.

  • The most common uses now include medical research, clinical documentation, and drafting care plans or discharge summaries.

The dramatic drop in non-users (62% to 33%) over the course of a year is impressive for any new health tech, but doctors in the latest survey called out several needs that have to be addressed for adoption to continue.

  • 88% wanted a designated feedback channel
  • 87% wanted data privacy assurances
  • 84% wanted EHR integration

While physicians are still concerned about the potential of AI to harm data privacy or offer incorrect recommendations (and liability risks), they’re also optimistic about its ability to put a dent in burnout.

  • The biggest area of opportunity for AI according to 57% of physicians was “addressing administrative burden through automation,” reclaiming the top spot it reached in 2023.
  • That said, nearly half of physicians (47%) ranked increased AI oversight as the number one regulatory action needed to increase trust in AI enough to drive further adoption.

The Takeaway

It’s encouraging to see the shifting sentiment around health AI, especially as more doctors embrace its potential to cut down on burnout. Although the survey pinpoints better oversight as the key to maximizing trust, AI innovation is moving so quickly that it wouldn’t be surprising if not-too-distant breakthroughs were magical enough to inspire more confidence on their own.

First Snapshot of AI Oversight at U.S. Hospitals

A beautiful paper in Health Affairs brought us the first snapshot of AI oversight at U.S. hospitals, as well as a glimpse of the blindspots that are already adding up.

Data from 2,425 hospitals that participated in the 2023 AHA Annual Survey shed light on the differences in AI adoption and evaluation capacity at hospitals on both sides of a growing divide.

Two-thirds of hospitals reported using AI predictive models, a figure that’s likely only gone up over the last year. These models were most commonly used to:

  • predict inpatient health trajectories (92%)
  • identify high-risk outpatients (79%)
  • facilitate scheduling (51%)
  • perform a long tail of various administrative tasks

Bias blindness ran rampant. Although 61% of the AI-user hospitals evaluated accuracy using data from their own system (local evaluation), only 44% performed similar evaluations for bias.

  • Those are some concerningly low percentages considering that models trained on external datasets might not be effective in different settings, and since AI bias is a surefire way to exacerbate health inequities
  • Hospitals that developed their own models, had high operating margins, and belonged to a health system were all more likely to conduct local evaluations. 

There’s a digital divide between hospitals with the resources to build models tailored to their own patients and those who are getting these solutions “off the shelf,” which increases the risk that they were trained on data from patients that might look very different from their own.

  • Only 54% of the AI hospitals designed their own models, while a larger share took the path of least resistance with algorithms supplied by their EHR developer (79%).
  • Combine that with the fact that most hospitals aren’t conducting local evaluations of bias, and there’s a major lack of systematic protection preventing these models from underrepresenting certain patients or adding unfair barriers to care.

The authors conclude that policymakers should “ensure the use of accurate and unbiased AI for patients regardless of where they receive care… including interventions designed to connect underresourced hospitals to evaluative capacity.”

The Takeaway

Without the local evaluation of AI models, there’s a glaring blindspot in the oversight of algorithmic bias, and this study gives compelling evidence that more needs to be done to fill that void.

Get the top digital health stories right in your inbox

You might also like..

Select All

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Digital Health Wire team

You're all set!