ARISE Maps the State of Clinical AI

There have probably been hundreds of reports on the medical AI landscape, but there’s only been one State of Clinical AI from the rockstar team at ARISE.

The AI opus delivers the most complete review we’ve seen of a field that’s moving faster than its evaluation practices. It looked at the most influential clinical AI studies from 2025 to answer a trio of important questions:

  • Where does AI meaningfully improve care once it leaves research settings?
  • Where does performance break down?
  • Where do risks remain underexamined?

ARISE brought the heat. The Stanford-Harvard research network produced more highlights than we could count, but here’s a roundup of some of our favorites.

Impressive results in narrow evaluations. AI models have shown “superhuman performance” in research settings, but these results often depend on how narrowly the problem is framed. 

  • In one study, researchers modified standard medical multiple-choice questions so that the correct answer became “none of the other answers.” The clinical reasoning required to solve the question didn’t change. Model performance did. Accuracy dropped sharply across leading AI models, in some cases by over a third.

AI clearly helps prediction at scale. Although diagnostic reasoning was a mixed bag, several studies demonstrated that AI excels at identifying early warning signals from large datasets.

  • A hospital-based study found that a model trained on continuous wearable vital signs predicted patient deterioration up to 24 hours before standard alerts, identifying patients at risk for ICU transfer, cardiac arrest, or death while there was still time to intervene.

Most studies still don’t resemble the reality of healthcare. Clinical work has little to do with answering exam questions, and much to do with reviewing charts, coordinating care, and deciding when not to intervene.

  • A review of 500+ studies found that nearly half of them tested models using medical exam-style questions. Only 5% used real patient data, very few measured whether the models recognized uncertainty, and even fewer examined bias or fairness.

Now what? ARISE offered a few focus areas for 2026 that hit the center of the bullseye for building trust in the latest AI models.  

  • Evaluate models using real-world scenarios to drive evidence-based medicine.
  • Prioritize human-computer interaction design as much as primary outcomes.
  • Measure uncertainty, bias, and harm – especially when it comes to patient-facing AI.

The Takeaway

Healthcare AI has arrived, and ARISE made it clear that innovation won’t be driven by newer models alone. It will depend on whether health systems, researchers, and regulators are willing to apply the same evidence standards to AI that they expect out of any other clinical solution.

Foundation Models Can Compromise Patient Privacy

Foundation models trained on EHR data hold massive potential for clinical applications, but a new study out of MIT shows that they might have just as much potential to violate patient privacy.

Generalized knowledge makes better predictions. EHR foundation models normally draw on a collection of de-identified patient records to produce their outputs.

  • That’s not a problem on its own, but unintended “memorization” also allows these models to serve answers based on a single record from their training data. 

Therein lies the problem. To quantify the risk of these models revealing sensitive information, MIT researchers developed structured tests to determine how easily an attacker with partial knowledge of a patient – think lab results or demographic details – could extract further identifiable info through targeted prompts.

The tests measured memorization as a function of: 

  • the amount of information an attacker needs to reveal information
  • the risk associated with the revealed information

What did they find? After validating the tests using EHRMamba, an EHR foundation model with publicly available training data, the researchers reached a pair of conclusions that weren’t too surprising to see.

  • The more information attackers have on a patient, the greater their privacy risk.
  • Some patients, particularly those with rare conditions, are more susceptible.

Not all information is harmful. The researchers found that some details, such as a patient’s age or gender, present a relatively lower risk in the event of a data breach. 

  • This info wasn’t very helpful in targeted prompts that probed the model for memorized records, and it isn’t very damaging if the answers reveal it.
  • Other info, such as a rare disease diagnosis, was flagged as significantly more harmful. It posed a higher risk of getting the model to expose patient-specific details (especially in combination with other identifiers), and it can be especially sensitive if revealed through probing.

The Takeaway

EHR foundation models need some degree of memorization to solve complex tasks, but memorizing and revealing patient records is obviously out of the question. The tradeoff between performance and privacy is an ongoing challenge, but MIT just delivered a framework for evaluating some of the risks that can help strike the right balance.

OpenAI Jumps Into Healthcare Arena With ChatGPT Health

If OpenAI wasn’t already a major healthcare player, the launch of ChatGPT Health definitely just made it one.

It’s the gamechanger everyone saw coming. OpenAI even teed up the launch with a report showing that 40M people are already using ChatGPT for healthcare advice on a daily basis. 

ChatGPT Health is about to take that a massive step further. 

Here’s a look at the core features:

  • ChatGPT Health operates inside a dedicated health environment with additional privacy layers (conversations aren’t used for model training, optional two-factor authentication).
  • Users can securely upload their complete medical records (courtesy of b.well).
  • Users can connect apps to inform answers (Apple Health, Function, MyFitnessPal).
  • The model uses longitudinal health data, labs, and visit summaries to help spot trends.

OpenAI is moving beyond general health advice. The extra clinical context gives ChatGPT Health the ability to give better answers at scale, and that’s good news for patients.

A few of the most obvious benefits for patients include:

  • Empowering them to take a more active role in their care.
  • Helping them uncover trends in their overall health.
  • Reducing confusion around test results.
  • Reinforcing care plans between visits.
  • The list could go on for a while.

ChatGPT Health isn’t actually HIPAA compliant. Then again, it doesn’t need to be.

  • Consumer health apps like ChatGPT Health aren’t covered by HIPAA, and to OpenAI’s credit it appears to have done a great job with the necessary disclaimers.
  • The dedicated health environment was also developed with input from 260+ physicians, and it leverages a physician-authored framework for safety, clarity, and escalation.

The question now is, who’s accountable when things go wrong? Millions of patients are about to start showing up to visits armed with advice from ChatGPT Health, which means its AI fingerprints will be all over their questions, concerns, and even clinical decisions. The tech might be ready. The governance isn’t.

  • When ChatGPT Health mentions an unproven treatment and a patient follows through, or interprets a worrying lab value as benign, who carries the liability?
  • OpenAI? The physicians who authored the safety framework? The patient who followed the advice? It’s tough to say, but providers – and their patients – still need a clear answer.

The Takeaway

Everyone wants a doctor in their pocket, and ChatGPT Health just filled that role for millions of patients… even if OpenAI explicitly told them it wasn’t up for the job.

8VC’s Vision for Healthcare AI in America

8VC just dropped its Vision for Healthcare AI in America, and it’s the best roadmap we’ve seen for removing the barriers between AI and its potential to transform medicine.

Great cakes have three layers, maybe four. Before 8VC shared its recipe for how AI can help fix things, it laid out the four main ingredients that it’ll be working with.

  • Level 0: Administrative – AI that supports providers in the back office. Example: AI scheduling agents, scribes.
  • Level 1: Assistive – AI that assists clinicians but doesn’t diagnose, treat, or triage, or prescribe medications to patients. Example: AI coaches, navigators.
  • Level 2: Supervised Autonomous – AI that does all the things that Level 1 doesn’t, with decisions supervised by a clinician. Example: AI medication management.
  • Level 3: Autonomous – AI that diagnoses, treats, triages, or prescribes medications completely on its own. Example: fully-autonomous triage lines.

Now for the vision. Most healthcare AI solutions currently live on Level 0. They’re creating real value for providers, but they aren’t going to steer the Titanic away from the iceberg.

  • 8VC thinks the other levels might, but not unless we remove the legal barriers that are preventing our innovators from innovating.

Level 1. These solutions exist today, but assistive AI care models are being held back by a lack of broadly billable CPT codes for the services they render.

  • Solution: Implement value-based reimbursement for assistive AI care models. 8VC describes a CMMI model with durable codes and case rates, which sounds like something most payors would be lining up to lobby for.

Level 2. All autonomous AI is considered Software as a Medical Device by the FDA, but the current performance bars are set too high. Driving tests don’t need to be F1 races.

  • Solution: Align FDA approval benchmarks with real-world standards, not hypothetical ideals. LumineticsCore is a good example – the FDA required the tool to catch at least 85% of diabetic retinopathy cases, but most ophthalmologists land between 33-77%. 

Level 3. Only a few policy changes are needed to open the door to Level 3 once we get to Level 2, the biggest of which is defining AI as a type of practitioner that’s eligible for reimbursement.

  • Solution: Amend the Social Security Act to allow Medicare reimbursement for licensed AI. As it stands today, even if CMS created a code for a Level 3 service, it would still be illegal for Medicare to pay an AI company instead of the supervising physician.

The Takeaway

AI is going to have to level up if we want to transform healthcare experiences, costs, and ultimately outcomes. 8VC thinks we can get there if we let our builders build, and it even gave us a blueprint for getting out of our own way.

AI Scribes Aren’t Productivity Tools, Yet

The first randomized controlled trials for ambient AI have finally arrived, and NEJM AI just gave us the strongest evidence yet that scribes deliver… minimal time savings.

The first study was a mixed bag. UCLA researchers assigned 238 physicians across 14 specialties to one of two scribes – Microsoft DAX and Nabla – or usual care for two months.

  • Nabla ended up saving about 23 seconds per visit, while DAX shaved off a whopping 5 seconds (which wasn’t even statistically significant).
  • Both scribe groups did however report less burnout and reduced cognitive burden than the usual care controls.

The second study told a similar tale. Physicians at the University of Wisconsin that used Abridge’s AI scribe for 6 weeks trimmed their daily documentation time by 22 minutes.

  • Still not a world-changing difference, but the UW physicians also saw significant positive improvements in work exhaustion and well-being.

But wait, there’s more. While those studies didn’t go as far as to suggest a cause for the lackluster time savings, a separate well-timed study from Navina offered a possible mechanism.

  • Scribes capture clinical conversations. Those conversations only inform a piece of the note, and those notes are only a piece of the workflow.
  • Navina found that incorporating patient medical histories into ambient documentation dramatically improves both note completeness and quality, which also seems like a great way to help physicians avoid lengthy manual chart reviews to fill any remaining gaps.

Then why do scribes get rave reviews? That’s a mystery that’s still up for debate.

  • It’s worth noting that “average time savings” include plenty of physicians who barely used the scribe. UCLA only had about a third of physicians pick up the tools, while UW was close to a best-case scenario at 71%.
  • It’s also possible that physicians enjoy not having to hold the visit in their head until they can finish their note, and getting rid of that burden is as magical as actual time savings.

The Takeaway

Not everything that can be measured matters, and not everything that matters can be measured. AI scribes might not be productivity tools quite yet, but physicians are clearly finding plenty of reasons to love them until they get there – even if more time isn’t one of them.

Bain & Company: Top Healthcare IT Priorities

Payors and providers are fighting different operational battles, but they’re using the same two-letter weapon to come out on top: AI, you guessed it. 

A joint report from Bain & Company and KLAS found that 80% of payors and 70% of providers now have an AI strategy in place, up from just 60% last year.

  • Providers are up against structural workforce shortages and rising patient volumes, while payors are contending with higher medical loss ratios and more regulatory scrutiny.
  • Bain and KLAS’ survey of 228 U.S. healthcare execs suggests that all signs point to one solution, and that’s deploying tech to improve margins.

Where are payors investing? Care coordination (57%) and utilization management (55%) were the top IT investment priorities for the second straight year.

  • Payors place total cost of ownership, functionality, and scalability ahead of suite convenience, so best‑of‑breed is still the default buying motion.
  • Plans are leveraging AI for everything from member engagement (35%) and enrollment (26%) to risk adjustment (26%) and prior auth automation (20%).

Where are providers investing? Revenue. Cycle. Management.

  • Half of providers ranked RCM among their top IT priorities, placing it above clinical workflows (34%) and EHRs (32%).
  • RCM = ROI. Accurate documentation and coding results in cleaner claims and fewer denials, which directly translates to higher revenue and lower expenses.
  • It’s also a match made in heaven for AI automation, and RCM currently represents the four most common AI use cases: ambient documentation (62%), clinical documentation improvement (43%), coding (30%), and prior authorization (27%).

Here’s the kicker. Providers cite EHR integration and interoperability as their biggest pain points, so most of them prioritize their EHR vendors for new solutions.

  • Only 20% of providers are primarily best-of-breed buyers, and two-thirds of Epic customers would choose an Epic option that’s “good enough” over a better competing product.

The Takeaway

It’s getting pretty hard to not be bullish on AI. There’s still plenty of uncertainty, but both payors and providers now seem to agree that inaction is the riskiest action.

AI Learns the Natural History of Human Disease

Clinical decision-making relies on understanding patients’ past health to improve their future health, an impossible task without first understanding how diseases progress over time.

That’s where a new study in Nature suggests AI is ready to help.

It starts with generative pretrained transformers. Researchers built a GPT, dubbed Delphi-2M, to predict the “progression and competing nature of human diseases.” 

  • Delphi-2M was trained on 400k UK Biobank participants (which lean healthier than the average person), and then externally validated on 1.9M Danish patients.
  • The training was designed to predict a patient’s next diagnosis and the time to it, using only data readily available within the EHR: past medical history, age, sex, BMI, and alcohol/smoking status.

How’d it do? The results speak for themselves:

  • Delphi-2M was able to forecast the incidence of over 1,000 diseases with comparable accuracy to existing models that are fine-tuned to predict single diseases.
  • Death could be predicted with eerily impressive accuracy (AUC: 0.97), and the survival curves that it simulated lined up almost perfectly with national mortality statistics.
  • Comorbidities emerged naturally from the training, and Delphi-2M was able to understand the progression from type 2 diabetes to eye disease to nerve damage.
  • Delphi-2M’s ability to predict heart attack and stroke matched established scores like QRisk, and it even outperformed leading biomarker-based AI models.

Better forecasts inform better policies. If policymakers can consult the Oracle of Delphi to see how many people will develop a disease over the next decade, the authors conclude that they’ll also be able to implement better regulations to prepare. 

  • Not a bad theory, assuming models trained on historical data can make forecasts that hold up to evolving treatments and populations (and that politicians act in the best interest of the people:).

The Takeaway

AI is reaching the point where it can predict thousands of diseases as well as the best narrowly focused models, and that could have big implications for everything from early screening to policymaking.

Wolters Kluwer Jumps in the GenAI Ring With UpToDate Expert AI

Right when you think Wolters Kluwer might just let everyone else have all the AI fun, it debuted UpToDate Expert AI to give the world’s most widely used clinical decision support tool a much-needed AI overhaul.

Wolters Kluwer took its time with the launch. The incumbent CDS juggernaut is used by 3M doctors worldwide, so it had plenty of users to disappoint with a hasty roll out.

  • That said, nimble competition has been gaining ground pretty much as fast as it takes to download OpenEvidence from the App Store.
  • The good news is that WK made the most of the extra development time.

Here’s what sets UpToDate Expert AI apart. Unlike general-purpose chatbots, the AI-enhanced version of UpToDate is built exclusively on WK’s peer-reviewed content library.

  • It draws on 30+ years of evidence-based research authored by 7,600 experts, rather than the open web or selective journals.
  • That allows it to quickly answer complex clinical questions, while surfacing all of its sources, assumptions, and step-by-step reasoning directly in the response. Probably safe to assume that also helps with hallucinations.
  • Those answers still manage to be easy to scan at the bedside and will look extremely familiar to any doctor that’s ever read an UpToDate article (or one that’s been reading them for a decade).

The extra time in the oven means that more features are baked in. Wolters Kluwer knows its audience, and UpToDate Expert AI’s biggest leg up on the competition is its fine-tuning for health systems.

  • Enterprise-grade governance, compliance, and workflow integration are all standard out-of-the-box, giving UpToDate Expert AI an advantage for a system-wide implementation over OpenEvidence or Doximity.

The Takeaway

It turns out that the 800-pound clinical support gorilla wasn’t going to let the newcomers eat its lunch forever, and UpToDate Expert AI gives health systems plenty of reasons to keep rolling with Wolters Kluwer.

Penguin Ai Raises $30M to Arm the AI Agent War

Payors and providers are in an AI arms race, and Penguin Ai just raised $30M to supply both sides with agents to outcompete each other.

Penguin goes far beyond point solutions. The enterprise AI platform combines proprietary LLMs with AI tooling that both payors and providers can use to configure custom agents for their own back-office processes. 

  • The platform enables customers to prep their data for AI, use pre-built LLMs via APIs, or start with a ready-made agent for medical coding, prior auths, claims adjudication, appeals management, risk adjustment, medical chart summarization, or payment integrity.
  • The ultimate goal is streamline high-volume workflows and cut down on the billions of dollars of administrative waste that the healthcare industry generates every year.

The agent wars have begun. Payors and providers across the country are racing to enlist AI agents to fight for an advantage in a system that’s historically been plagued by inefficiencies and headbutting.

  • Providers vs. Payors: Doctors and hospitals are leveraging agents to fight back against billing denials – filing floods of appeals and automating responses faster than any human could manage alone.
  • Payors vs. Providers: Health plans are rolling out agents to instantly review claims, prior auths, and appeals requests – enabling mass, automatic care decisions that overwhelm providers.

Penguin CEO Fawad Butt has been in the buyer seat. He spent his career serving as the chief data officer at some of the biggest names in the industry: UnitedHealthcare, Kaiser Permanente, and Optum.

  • He founded Penguin to build the platform he saw was missing, and that adds a lot of credibility as Penguin takes on incumbent admin agent dealers like Innovaccer and Autonomize AI.

The Takeaway

The agent wars are in full swing, and Penguin is bringing a comprehensive platform to a battlefield full of point solutions. 

Doctors Who Use AI Are Viewed Worse by Peers

The research headline of the week belongs to a study out of Johns Hopkins University that found “doctors who use AI are viewed negatively by their peers.”

Clickbait from afar, but far from clickbait. The investigation in npj Digital Medicine surfaced interesting takeaways after randomizing 276 practicing clinicians to evaluate one of three vignettes depicting a physician: using no GenAI (the control), using GenAI as a primary decision-making tool, or using GenAI as a verification tool.

  • Participants rated the clinical skill of the physician using GenAI as a primary decision-making tool as significantly lower than the physician who didn’t use it (3.79 vs. 5.93 control on a 7-point scale). 
  • Framing GenAI as a “second opinion” or verification tool improved the negative perception of clinical skill, but didn’t fully eliminate it (4.99 vs. 5.93 control). 
  • Ironically, while an overreliance on GenAI was viewed as a weakness, the clinicians also recognized AI as beneficial for enhancing medical decision-making. Riddle us that.

Patients seem to agree. A separate study in JAMA Network Open took a look at the patient perspective by randomizing 1.3k adults into four groups that were shown fake ads for family doctors, with one key difference: no mention of AI use (the control), or a reference to the doctors using AI for administrative, diagnostic, or therapeutic purposes (Supplement 1 has all the ads).  

For every AI use case, the doctors were perceived significantly worse on a 5-point scale:

  • less competent – control: 3.85, admin AI: 3.71; diagnostic AI: 3.66; therapeutic AI: 3.58
  • less trustworthy – control: 3.88; admin AI: 3.66; diagnostic AI: 3.62; therapeutic AI: 3.61
  • less empathic – control: 4.00 ; admin AI: 3.80; diagnostic AI: 3.82; therapeutic AI: 3.72

Where’s that leave us? Despite pressure on clinicians to be early AI adopters, using it clearly comes with skepticism from both peers and patients. In other words, AI adoption is getting throttled by not only technological barriers, but also some less-discussed social barriers.

The Takeaway

Medical AI moves at the speed of trust, and these studies highlight the social stigmas that still need to be overcome for patient care to improve as fast as the underlying tech.

Get the top digital health stories right in your inbox