PHTI Breaks Down Barriers to Clinical AI

PHTI’s new Clinical AI report delivered exactly what we’ve come to expect from their research: top tier industry analysis through the lens of actual stakeholders.

They assembled the A Team for this one. The report was built from an in-person workshop that PHTI convened with senior industry leaders – from health systems and health plans to tech firms and federal agencies – to explore what’s needed to safely scale clinical AI.

  • The workshop underscored the policy, reimbursement, and evidence gaps holding back adoption, with several key themes emerging from the discussion around their example use cases (hypertension management and mental health chatbots).

Theme 1: Evidence standards should compare AI to current standards of care and scale with risk.

  • That means comparing AI to the care that patients actually receive today rather than idealized care, then having different standards that align with the clinical risk of using the tool.
  • Highlight: Evidence should assess whether the full workflow (including multiple models, devices, and human oversight) improves outcomes, not merely model performance.

Theme 2: Performance benchmarks should be based on clinical outcomes, and safety standards should adapt as the evidence grows.

  • Ambiguity around what constitutes “good” performance is a persistent barrier. Metrics need to be anchored to specific clinical outcomes instead of vague process measures.
  • Highlight: Across both use cases, participants emphasized the need not only to set benchmarks but to set minimum safety floors, which could adjust dynamically over time on the basis of observed outcomes, changing patient risk profiles, & emerging evidence.

Theme 3: New technologies may be initially tested in lower-risk populations, but should scale quickly to high-risk populations to maximize impact.

  • Low-risk patients are tempting on-ramps, but AI’s greatest benefits come from reaching the high-need patients, and reaching them carries higher evidence expectations and more clinical risk.
  • Highlight: For mental health, engagement and retention are huge barriers to treatment. Participants cautioned that overly restrictive AI deployments risk limiting access and instead emphasized the need for appropriate care routing following LLM engagement.

The Takeaway

Even the most effective clinical AI tools still have plenty of questions to address before adoption can scale, and PHTI just crowdsourced some promising answers straight from the boots-on-the-ground in the healthcare trenches.

LLMs Still Struggle With Medical Misinformation 

The Lancet Digital Health just published one of the largest-ever stress tests on medical misinformation in LLMs, and it looks like most models still struggle to separate fact from fiction.

Here’s the setup. Researchers probed 20 LLMs with over 3M prompts containing medical information from three different sources: social media posts, simulated clinical vignettes, or real hospital discharge notes with a single fabricated recommendation inserted.

  • Each prompt was presented in multiple versions, once with neutral wording to establish a baseline, then with a series of variations that were emotionally charged or leading.
  • Ten logical fallacies were also used to test how framing influences model behavior, such as appeals to authority (a physician said…) or popularity (everyone agrees that…).

LLMs love fake news. The susceptibility was shockingly high across all models, with the medical misinformation accepted in 32% of the neutral base prompts.

  • That jumped to 46% when the misinformation was embedded in formal discharge notes, but at least the models were more skeptical of the social media content (9%).

Other findings were more counter-intuitive. Eight of the 10 logical fallacies ended up reducing the misinformation acceptance rate rather than increasing it like the authors expected.

  • Only appeals to authority (+2.9 percentage points above the base prompts) and slippery slope prompts (+2.2pp) increased susceptibility, a relatively small impact considering appeals to popularity slashed it by nearly 20pp.
  • Larger models were generally safer, although the language and phrasing had a far greater influence than the parameter count alone. 
  • It was also surprising to see that the medical models performed worse than the general purpose models, with many having weaker lie detectors despite the specialization.

Improving LLM safety is about more than making bigger models. It’s about knowing how information gets presented by actual humans, and having guardrails in place that hold up even when that information is wrong.

The Takeaway

Benchmark performance isn’t real-world performance, and this study provides another reminder that a model’s ability to separate fact from fiction is often more important than its test scores.

The Patient You Lost Before They Ever Walked In

Thousands of patients are referred for procedures but vanish into the void because no one called them back within 48 hours.

By Shani Fargun, VP Healthcare at StackAI
Sponsored by StackAI

While the headlines at major cardiology conferences focus on AI that can read angiograms or predict arrhythmias, a quieter, unsexy revolution is happening in the back office, and it might be the key to actually using those advanced clinical tools.

The biggest bottleneck in modern cardiology is administrative friction. It’s the death by 1,000 faxes that occurs when a patient is referred for a TAVR, but the pre-op workup is trapped in a PDF from an external hospital. It’s the prior authorization that sits in a queue for weeks because a specific keyword was missing from the submission.

  • According to the AMA, 94% of physicians report that these administrative hurdles lead to delays in accessing necessary care.

Healthcare has a data problem. The industry runs on unstructured data. Referral letters, handwritten call notes, faxed labs, and denial letters make up the bulk of cardiac operations.

  • Nearly 80% of all healthcare data is unstructured and inaccessible to traditional automation. This forces highly trained clinical staff to spend hours acting as data entry clerks rather than treating patients.

Agentic AI is the solution. Agentic AI isn’t a chatbot or a diagnostic model, it’s a digital worker. 

  • Unlike traditional software that waits for a human to input data, Agentic AI can autonomously perform tasks across different systems.

How can agentic workflows change modern practices?

  • Patient Scheduling & Follow-Up  Agents autonomously handle the last mile of care coordination, reaching out to patients to schedule diagnostic testing, confirming procedure dates, and answering routine logistical questions without burdening clinical staff. This directly combats referral leakage, which costs health systems an estimated $971,000 per physician annually. 
  • Automated Prior Auth – Agents cross-reference patient charts against payer-specific guidelines to draft authorization requests that minimize technical denials. Download the free whitepaper of use cases for healthcare here.
  • Referral Velocity – Agents ingest incoming faxes and emails, extract clinical criteria, and draft the patient chart for review: reducing time-to-appointment from weeks to days.

The Takeaway

The future of healthcare starts with better flows. By automating the administrative burden, we allow interventionalists to focus on what they do best: treating patients.

Request a demo to see customized use cases for your organization here.

Epic Shakes Up Scribe Market With AI Charting

The wait is over. Epic’s scribe has arrived, and it’s packing a lot more than ambient notes.

“AI Charting” goes beyond transcriptions. The fully built-in feature not only listens during patient visits and drafts notes, it also queues up orders based on the conversation.

  • The initial release allows clinicians to personalize the note structure using voice commands (Ex. asking to format the history of present illness as a bulleted list).
  • Epic is positioning AI Charting as the killer app for its Art clinical copilot, which also has a pre-visit Insights tool that’s apparently already being used 16M times per month.

Distribution is king. Over 40% of U.S. hospitals are on Epic, and an AJMC study from just last week showed that two-thirds of those hospitals have already adopted ambient AI.

  • AI Charting is breaking onto the scene through one of healthcare’s biggest distribution channels, and Epic has a ton of levers it can pull with pricing and bundling to start stealing share (DAX Copilot, Abridge, and ThinkAndor accounted for ~80% of Epic hospitals in the recent study).
  • Rather than charging a per-user-per-month fee like most ambient AI platforms, STAT reports that Epic plans to have a separate license for AI Charting, with the price varying by org size and utilization to get the tool in as many hands as possible.

It’s time to differentiate. The race is on for established players to prove they can deliver value that Epic’s integrated approach can’t match.

  • That means tackling problems that are too messy for Epic to touch (Abridge bringing real-time prior auths to the point of conversation), or too specialized for it to get right with so many other plates spinning (Nabla raising the bar for AI safety with world models).
  • Epic is working closely with Microsoft to get new features online quickly, but nailing multiple specialties in countless languages could still prove to be a job that’s better suited for a company with a dedicated focus.
  • Epic might own the “operating system” almost as much as Microsoft owns Windows, but just because MS Paint exists doesn’t mean the world doesn’t need Adobe Photoshop.

The Takeaway

Ambient scribes proved how fast health systems would layer on their own AI if Epic couldn’t keep up, and we’ll now have to wait and see if the cost and experience of Epic’s scribe is enough to compete with the flock of ambient AI innovators dedicated to this problem.

Bessemer Venture Partners State of Health AI

Bessemer Venture Partners’ always-stellar State of Healthcare AI report did a great job explaining why we (probably) aren’t in a bubble even though the health AI rocket has hit escape velocity.

AI is more than hype. BVP points to signals from the private markets to make its case. 

M&A activity is surging. Global health tech M&A reached 400 deals in 2025 (up from 350 in 2024), but the strategic rationale matters more than the volume. Healthcare orgs and investors recognize that AI simultaneously drives revenue growth and margin improvement.

  • Prime example: the Smarter Technologies roll up was designed to leverage Thoughtful and SmarterDx’s growth engine and clinical AI platform to drive margin expansion across the Access Healthcare RCM services conglomerate.

VC funding is nearly back to pandemic levels. BVP counted 527 venture deals in 2025 (~$14B total), with the average round size climbing 42% to $29M.

  • AI startups captured 55% of that, up from 37% in 2024. Even more importantly, for every $1 invested in AI companies overall, $0.22 was deployed to healthcare AI startups, outpacing the fair share of 18% of GDP that healthcare spending represents in the U.S.

The question now is, are we in a bubble? BVP has a nuanced answer for why health AI is in a better spot than the Dot Com Bubble.

  • First, AI’s technological shift has spurred the invention of new business models, with the emergence of “AI-services-as-software” companies delivering service-level outcomes (human-quality work) with software-level margins (70%+ gross margins).
  • Second, buyers are now pulling instead of being pushed. While EHRs took 15 years to scale, AI scribes have pulled it off in three. Demonstrable ROI and ease of implementation were key here.

Health AI has an X Factor. New health AI “supernova” startups are bending traditional growth curves entirely. BVP attributes these supernovas’ unprecedented growth to four X Factors.

  • Continuous hyper-growth velocity (not just growth projections)
  • Revenue durability through defensibility
  • Productivity gains that translate to better margins and full-time employee metrics at scale
  • Point solution to platform expansion

Maybe sane valuations, maybe VC mental gymnastics. BVP argues that a supernova with $30M ARR and $1B valuation isn’t overvalued, it has fundamentally different growth dynamics.

  • When you’re growing 6x instead of 2x, you reach $100M ARR in 18 months instead of 36+ months. That compression in time-to-scale commands a premium, and BVP says a 7x revenue multiple for supernovas is justified versus 2-3x for a strong SaaS company.

The Takeaway

Health AI is going supernova, and the explosion might actually be big enough to let the leaders grow into their astronomical valuations.

AI Spots Early Cognitive Decline in Clinical Notes

Early disease detection is entering the AI era, and a new study in npj Digital Medicine shows that autonomous agents can now flag cognitive decline using nothing but clinical notes.

Cognitive decline is difficult to detect. It remains significantly underdiagnosed in routine care, and traditional screening usually requires a dedicated clinician and tests that can take hours. 

  • At the same time, early detection is becoming increasingly important, especially with the recent approval of Alzheimer’s therapies that are most effective when administered early. 

Mass General Brigham might have an answer. Clinical notes contain whispers of cognitive decline that busy clinicians can’t always hear. MGB built a system that listens at scale.

  • These whispers include everything from linguistic shifts and sentence pauses to disorganized narratives and family member concerns. 
  • MGB developed an AI system that scans for these signals in routine clinical documentation, leveraging five specialized agents that critique each other and refine their reasoning.

It worked like a charm. The MGB researchers set their agents loose on over 3,300 clinical notes from 200 anonymized patients, then had human reviewers take their own look.

  • The agents detected cognitive impairment with 91% sensitivity, nearly matching expert-level accuracy – without any human intervention needed after deployment.
  • When the AI and human reviewers disagreed, an independent expert validated the AI’s reasoning 58% of the time – meaning the system was often making sound clinical judgments that initial human review had missed.

The cherry on top? The MGB team open-sourced Pythia alongside the study, enabling any provider org to deploy autonomous prompt optimization for their own AI screening applications.

The Takeaway

LLMs have opened the door to proactive screening at scale, and MGB just provided an excellent proof of concept using AI agents that turn everyday documentation into a chance to catch cognitive decline during the optimal treatment window.

ARISE Maps the State of Clinical AI

There have probably been hundreds of reports on the medical AI landscape, but there’s only been one State of Clinical AI from the rockstar team at ARISE.

The AI opus delivers the most complete review we’ve seen of a field that’s moving faster than its evaluation practices. It looked at the most influential clinical AI studies from 2025 to answer a trio of important questions:

  • Where does AI meaningfully improve care once it leaves research settings?
  • Where does performance break down?
  • Where do risks remain underexamined?

ARISE brought the heat. The Stanford-Harvard research network produced more highlights than we could count, but here’s a roundup of some of our favorites.

Impressive results in narrow evaluations. AI models have shown “superhuman performance” in research settings, but these results often depend on how narrowly the problem is framed. 

  • In one study, researchers modified standard medical multiple-choice questions so that the correct answer became “none of the other answers.” The clinical reasoning required to solve the question didn’t change. Model performance did. Accuracy dropped sharply across leading AI models, in some cases by over a third.

AI clearly helps prediction at scale. Although diagnostic reasoning was a mixed bag, several studies demonstrated that AI excels at identifying early warning signals from large datasets.

  • A hospital-based study found that a model trained on continuous wearable vital signs predicted patient deterioration up to 24 hours before standard alerts, identifying patients at risk for ICU transfer, cardiac arrest, or death while there was still time to intervene.

Most studies still don’t resemble the reality of healthcare. Clinical work has little to do with answering exam questions, and much to do with reviewing charts, coordinating care, and deciding when not to intervene.

  • A review of 500+ studies found that nearly half of them tested models using medical exam-style questions. Only 5% used real patient data, very few measured whether the models recognized uncertainty, and even fewer examined bias or fairness.

Now what? ARISE offered a few focus areas for 2026 that hit the center of the bullseye for building trust in the latest AI models.  

  • Evaluate models using real-world scenarios to drive evidence-based medicine.
  • Prioritize human-computer interaction design as much as primary outcomes.
  • Measure uncertainty, bias, and harm – especially when it comes to patient-facing AI.

The Takeaway

Healthcare AI has arrived, and ARISE made it clear that innovation won’t be driven by newer models alone. It will depend on whether health systems, researchers, and regulators are willing to apply the same evidence standards to AI that they expect out of any other clinical solution.

Anthropic and OpenAI Set Sights on Providers

Digital health has some fresh competition. Less than a week after OpenAI launched ChatGPT Health, Anthropic crashed the party with the grand debut of Claude for Healthcare

Player 2 has entered the fight. Anthropic’s headlining feature for consumers is identical to ChatGPT Health – the answers are grounded in the patient’s own medical history.

  • Claude for Healthcare lets patients securely upload their health records and app data to unlock the same wide-ranging benefits as ChatGPT Health, such as spotting trends, preparing for visits, interpreting lab results… so on and so forth.
  • The two even share some overlapping partner apps like Function and Apple Health, but the similarities end there. 

Claude for Healthcare gets providers in on the action. Unlike OpenAI’s shiny new patient-facing solution, Claude for Healthcare comes with a suite of “Connectors” that enable it to support previously out-of-reach workflows. The list includes:

  • Prior auth reviews and coverage verifications [CMS Coverage Database]
  • Medical coding and billing accuracy [ICD-10]
  • Provider verification and credentialing [NPI Registry]

OpenAI hasn’t taken any days off. It followed up last week’s big ChatGPT Health news with the launch of ChatGPT for Healthcare – similar names, very different products.

  • ChatGPT for Healthcare is OpenAI’s enterprise solution to the Anthropic problem. It brings new provider-facing capabilities like care path management, referral letter generation, and clinical search (tough break for Doximity and Wolters Kluwer).

The fun doesn’t end there. OpenAI added to its hot streak by picking up Torch, a four-person startup building “a medical memory for AI.” The Information pinned the price tag at $100M. 

  • Torch feeds scattered records into a context engine that connects the dots between visit notes, lab results, wearable data, and any other medical info you can think of. 
  • That pitch rhymes perfectly with ChatGPT Health’s value prop, and the Torch team will now be helping boost the new solution’s medical memory across its inaugural cohort of partner apps.

The Takeaway

What a week for our little corner of the industry. OpenAI and Anthropic are diving in head first, and their tech, ambition, and pockets might even be deeper than the choppy legal waters.

Foundation Models Can Compromise Patient Privacy

Foundation models trained on EHR data hold massive potential for clinical applications, but a new study out of MIT shows that they might have just as much potential to violate patient privacy.

Generalized knowledge makes better predictions. EHR foundation models normally draw on a collection of de-identified patient records to produce their outputs.

  • That’s not a problem on its own, but unintended “memorization” also allows these models to serve answers based on a single record from their training data. 

Therein lies the problem. To quantify the risk of these models revealing sensitive information, MIT researchers developed structured tests to determine how easily an attacker with partial knowledge of a patient – think lab results or demographic details – could extract further identifiable info through targeted prompts.

The tests measured memorization as a function of: 

  • the amount of information an attacker needs to reveal information
  • the risk associated with the revealed information

What did they find? After validating the tests using EHRMamba, an EHR foundation model with publicly available training data, the researchers reached a pair of conclusions that weren’t too surprising to see.

  • The more information attackers have on a patient, the greater their privacy risk.
  • Some patients, particularly those with rare conditions, are more susceptible.

Not all information is harmful. The researchers found that some details, such as a patient’s age or gender, present a relatively lower risk in the event of a data breach. 

  • This info wasn’t very helpful in targeted prompts that probed the model for memorized records, and it isn’t very damaging if the answers reveal it.
  • Other info, such as a rare disease diagnosis, was flagged as significantly more harmful. It posed a higher risk of getting the model to expose patient-specific details (especially in combination with other identifiers), and it can be especially sensitive if revealed through probing.

The Takeaway

EHR foundation models need some degree of memorization to solve complex tasks, but memorizing and revealing patient records is obviously out of the question. The tradeoff between performance and privacy is an ongoing challenge, but MIT just delivered a framework for evaluating some of the risks that can help strike the right balance.

OpenAI Jumps Into Healthcare Arena With ChatGPT Health

If OpenAI wasn’t already a major healthcare player, the launch of ChatGPT Health definitely just made it one.

It’s the gamechanger everyone saw coming. OpenAI even teed up the launch with a report showing that 40M people are already using ChatGPT for healthcare advice on a daily basis. 

ChatGPT Health is about to take that a massive step further. 

Here’s a look at the core features:

  • ChatGPT Health operates inside a dedicated health environment with additional privacy layers (conversations aren’t used for model training, optional two-factor authentication).
  • Users can securely upload their complete medical records (courtesy of b.well).
  • Users can connect apps to inform answers (Apple Health, Function, MyFitnessPal).
  • The model uses longitudinal health data, labs, and visit summaries to help spot trends.

OpenAI is moving beyond general health advice. The extra clinical context gives ChatGPT Health the ability to give better answers at scale, and that’s good news for patients.

A few of the most obvious benefits for patients include:

  • Empowering them to take a more active role in their care.
  • Helping them uncover trends in their overall health.
  • Reducing confusion around test results.
  • Reinforcing care plans between visits.
  • The list could go on for a while.

ChatGPT Health isn’t actually HIPAA compliant. Then again, it doesn’t need to be.

  • Consumer health apps like ChatGPT Health aren’t covered by HIPAA, and to OpenAI’s credit it appears to have done a great job with the necessary disclaimers.
  • The dedicated health environment was also developed with input from 260+ physicians, and it leverages a physician-authored framework for safety, clarity, and escalation.

The question now is, who’s accountable when things go wrong? Millions of patients are about to start showing up to visits armed with advice from ChatGPT Health, which means its AI fingerprints will be all over their questions, concerns, and even clinical decisions. The tech might be ready. The governance isn’t.

  • When ChatGPT Health mentions an unproven treatment and a patient follows through, or interprets a worrying lab value as benign, who carries the liability?
  • OpenAI? The physicians who authored the safety framework? The patient who followed the advice? It’s tough to say, but providers – and their patients – still need a clear answer.

The Takeaway

Everyone wants a doctor in their pocket, and ChatGPT Health just filled that role for millions of patients… even if OpenAI explicitly told them it wasn’t up for the job.

Get the top digital health stories right in your inbox