Ad-verse Effects in Consumer-Facing AI

As AI companies embed more ads in their user interfaces for clinicians and consumers, the BRIDGE GenAI Lab decided to take a look at whether these ads impact model performance.

Turns out, they do. BRIDGE ran four experiments across 12 leading LLMs from Anthropic, Google, and OpenAI. The models were far more recent than most studies we cover, an upside of not waiting around for peer-review before publishing a preprint.

  • Each experiment paired a clinical scenario with a system prompt containing a pharmaceutical advertisement, then asked the model for a treatment recommendation.

Ads definitely moved the needle. Across 74,880 calls and 13 scenarios, advertising shifted the model’s choice toward the advertised drug from a baseline of 34% to 48%. 

  • That’s a jump of +12.7 percentage points on average.

The LLMs had some nice range. Model bias varied widely by developer.

  • Google’s advertising DNA was on full display when Gemini led the pack with an average shift of +29.8 percentage points toward the advertised drug. 
  • Five models from OpenAI were swayed by an average of +10.9 pp.
  • Anthropic’s models were the most resilient at +2.0 pp, and the ever-skeptical Opus 4.6 actually steered away from the promoted drug by -3.8 pp.

Three experiments contrasted three different conditions. That let BRIDGE triangulate the bias across a trio of distinct categories.

  • Equipoise (+12.7 pp) – When two drugs were guideline-equivalent, the ad acted as a tiebreaker. The output was clinically correct, but biased.
  • Suboptimal Drug (+0.6 pp) – When the advertised drug was clinically inferior, models resisted. Only 4.4% of responses chose the suboptimal advertised option.
  • Wellness Supplements (-0.6 pp) – For supplements lacking evidence, endorsement decreased. Anthropic models actively pushed back at -2.4 pp.

The picture was consistent. Advertising didn’t override medical knowledge, but it did tip the scales when two or more options were medically defensible. 

  • Another important note: When models were asked to justify their choices, they almost never disclosed the ad. If they chose the advertised drug, the justification echoed the ad in 52.7% of cases.

The Takeaway

BRIDGE just showed why the real harm with AI advertising might not be patients receiving dangerous drugs. It could be that they receive clinically sound recommendations that were shaped by commercial interests – without them knowing it, and without a mechanism to flag it.

OpenAI o1 Outperforms Physicians on Clinical Reasoning Tasks

A landmark study in Science found that OpenAI’s o1 series outperformed human physicians at multiple clinical reasoning tasks, but that doesn’t mean it’s time to hang up the scrubs just yet.

Researchers at Harvard and Beth Israel Deaconess Medical Center designed the study to evaluate whether LLMs are ready to do what physicians do on a daily basis: review messy patient charts and use that data to determine diagnosis and next steps.

  • They evaluated o1 on clinical cases ranging from patient vignettes to second opinions on 76 real-world ED assessments, which included all the noise and incomplete information that clinicians routinely encounter in the EHR.
  • The refreshingly well-designed study also incorporated a blinded evaluation with two attending physicians at BIDMC and GPT-4.

o1 came to play. On clinical vignettes evaluating management reasoning, o1-preview scored a median of 86%. Not too shabby.

  • It outperformed GPT-4, humans with GPT-4, and humans with conventional resources like UpToDate – all of which scored below 45%.

The ED cases were even more impressive. o1 offered second opinions about the diagnosis at three points along the patient’s ED journey:

  • At triage, o1 gave an exact or very close diagnosis in 67% of cases (when information in the record dump was most limited). The two physicians hit 55% and 50%. 
  • o1 still outperformed the physicians when given all the data collected by the end of the ED encounter.
  • It was only when the physicians were given the most information possible to inform their diagnosis – at the time the patient would have been admitted to the hospital – that the scores finally converged.

The cherry on top? Physician raters couldn’t tell whether the differentials came from o1 or a human. One rater couldn’t tell in 83.6% of cases, the other in 94.4%. 

  • The authors were quick to mention that these results don’t mean AI is ready to replace human physicians. They mean it’s time for rigorous research into how AI can augment care teams, serve as a second opinion, and become a safety layer for clinicians.

The Takeaway

o1 outperforming a couple internists at triage isn’t quite Deep Blue beating Gary Kasparov at chess, but it’s a step in that direction – especially considering OpenAI’s performance jump in just the last week (let alone since o1 launched in 2024).

Why AI Vendors Struggle to Compete With EHRs

Anyone who has ever tried selling AI into health systems will tell you that it’s tough to compete with EHRs, but a new article in JAMA makes the case that it’s actually gotten too tough – and it might be time for regulators to step in.

Most markets reward the best products. The healthcare industry has a funny way of preventing that from happening, and EHR vendor dominance is a textbook example.

  • EHRs hold advantages across infrastructure, workflow integration, procurement, and pricing that make it difficult for third-party tools to gain a foothold.
  • A 2025 Health Affairs study backed that up by showing that 79% of U.S. hospitals use AI models from their EHR vendor, compared to just 59% that use AI from third-party developers.
  • A Bain report drove the point home. Two-thirds of Epic customers said they’d pick a “good enough” Epic option over a better competing product.

These EHR advantages are a natural feature of the market. That said, it’s up to regulators to decide whether the status quo is serving patients and the overall healthcare system. The JAMA authors argue that it doesn’t, and offer three areas where targeted policy could level the playing field.

Infrastructure – Integrating AI tools into clinical workflows requires real-time data access and the ability to survive EHR upgrades intact, both of which are dramatically easier for EHR vendors – particularly as data fields get added or removed.

  • Potential Policy – Mandate broader API adoption so third parties can access EHR data on equal footing, and use existing EHR certification and interoperability frameworks to do it.

Workflow and Usability – The authors specifically flag EHR vendors’ edge in understanding the trade-offs of allocating limited screen real estate to new AI tools, something that’s harder for third parties to gauge from the outside looking in.

  • Potential Policy – Require EHR vendors to offer more robust developer sandboxes – similar to Apple’s iOS developer environment – so third parties can build and test without operating at a structural disadvantage.

Procurement and Pricing – Long-standing health system relationships give EHR vendors a streamlined path through procurement, as well as the leverage to “use pricing structures that incentivize adoption.”

  • Potential Policy – Although this is the hardest area for a policy fix, the authors suggest that improving transparency around AI performance could at least help health systems make more informed decisions regardless of where a tool comes from.

The Takeaway

EHRs are in a powerful position, and companies in powerful positions have a long track record of making life harder for their competition. Healthcare is too important of an industry to not have the best products rise to the top, and this article offers some sound strategies to make sure that stays possible.

Qualified Raises $125M to Build AI Infrastructure

In an era of isolated AI pilots, Qualified Health is building the infrastructure to connect the dots.

AI is the star of enterprise transformation. Health systems are looking to deploy and scale AI across their entire organization, and Qualified just raised $125M of Series B funding to make sure every new agent fits into a cohesive constellation.

The core platform has four distinct layers:

  • A data foundation that turns the EHR and external sources into an AI-ready bedrock.
  • A layer that lets hospitals build and deploy AI tools without always starting from scratch.
  • A layer that turns those tools into AI apps and agents deployed directly into workflows.
  • A layer that keeps governance, monitoring, and evaluation at the center of everything.

Qualified doesn’t leave AI to chance. It embeds forward-deployed product leaders alongside health system teams to identify high-priority needs, deploy solutions quickly, and iterate based on actual feedback in the trenches.

That has a couple of major benefits:

  • AI solutions are purpose-built for specific operational problems rather than mass market appeal.  
  • The tight feedback loop allows Qualified to iterate faster than it would be able to with a traditional implementation cycle, which shortens the timescale needed to improve its deployments and demonstrate a measurable impact.

The proof is in the pudding. At the University of Texas Medical Branch, Qualified reportedly generated a $15M measurable run-rate impact within the first six months.

  • That’s an eye-popping number to get on record, and it apparently stemmed from “a real willingness to dive deep” alongside UTMB clinical teams to deploy multiple assistants and automated workflows.
  • Qualified already supports systems representing about 7% of U.S. hospital revenue, and the next chapter is about deepening those partnerships and scaling responsibly.
  • Big ambition also means big competition, and Qualified will be up against everyone from Innovaccer to Epic if it wants to become healthcare’s AI platform of choice.

The Takeaway

Hospitals aren’t looking to AI for incremental improvement. They’re looking to AI to transform how they deliver care, and Qualified just landed another $125M to be the infrastructure that makes that possible.

How to Build Patient Trust in Medical AI

AI might move at the speed of trust, but new research in JAMA Network Open shows that trust only moves at the speed of accuracy.

The study had a solid setup. To determine the factors currently driving patient trust in AI, researchers presented 3,000 U.S. adults with a pair of hypothetical AI-assisted visits for a moderate-risk rash. 

  • Each visit had six randomized attributes, such as whether or not a doctor was present, how well the AI performs relative to human clinicians, and various AI governance mechanisms.

AI performance came out on top by a wide margin. Respondents cared more about how well the AI performs than FDA approval, governance, and even having a doctor in the room.

  • The biggest difference came from AI performing better than a specialist, which increased the likelihood of choosing that visit by 32.5%.
  • AI performing at the same level as a specialist boosted visit preference by 24.8%, slightly more than having AI that performs as well as a general practitioner (19.1%).
  • Having an actual doctor present surprisingly only swayed visit preference by 18.4%.

Governance factors also moved the needle. They just didn’t move it much.

  • FDA approval for the AI increased visit preference by a modest 11.1%.
  • Mayo Clinic AI certifications apparently carry just as much weight – also coming in at 11.1%.
  • Local hospital certifications for the AI only gave visits a 7.8% lift.

AI data quality was important. It just wasn’t as convincing as AI performance. 

  • AI that had nationally representative training data boosted visit preference by 11.9%, but it was interesting to see that disclosing bias in the training data had no effect versus not providing any data details.

The written explanations told the same story. Respondents cited AI performance and clinician involvement as the primary reasons for their choices, with many of them expressing comfort with AI as a tool – but not as a standalone decision-maker.

The Takeaway

Widespread AI adoption requires patient trust, and this study did a great job highlighting the specific areas that should be prioritized for building it.

Microsoft Dragon Copilot Gets AI Upgrades

Microsoft might have had the biggest presence at the biggest health IT conference, and it made sure all the lights in Las Vegas were on Dragon Copilot

Unify. Simplify. Scale. Microsoft’s theme at HIMSS was all about making Dragon Copilot a one-stop-shop for information within clinical workflows. It debuted several new capabilities at the show:

  • Integrated medical content from trusted sources
  • Partner-powered AI apps and agents
  • Proactive ICD‑10 specificity suggestions
  • Expanded role-based experiences for physicians, nurses, and radiologists

Partnering is quicker than building. Rather than developing every Dragon Copilot capability in-house, Microsoft has been leaning on outside partners to round out the platform.

  • Dragon Copilot’s clinical evidence feature is a prime example. It brings medical content and other relevant contextual information in-workflow, all curated through new partnerships with Wolters Kluwer, Elsevier, and other vetted sources.

Microsoft Marketplace fills the gaps. It allows users to add AI partner apps directly into their Dragon Copilot workflows. Picture a modular side panel with insights from folks like: 

  • Regard – surfaces comorbidities and relevant diagnoses 
  • Canary Speech – analyzes voice biomarkers for mental health conditions
  • Humata Health – automates prior authorization processes for clinicians 
  • Atropos – generates personalized real-world evidence 
  • Optum – identifies potential coverage issues and supports claims processing 

All roads lead to scribes. When Microsoft first acquired Nuance for $20M back in 2022, it was its second largest acquisition ever behind LinkedIn, and the core offerings were radiology report automation, dictation, and transcription (with humans still pulling a ton of weight).

  • The product formerly known as Dragon Ambient eXperience is now the backbone of Dragon Copilot, and it’s been adding features at a breakneck pace.
  • Microsoft is looking to make Dragon Copilot everything, everywhere, all at once, and so far new partnerships have been the key to making that happen.

The Takeaway

As every digital health company rushes to add scribing to their platform, the OG scribe is rushing to add everything else. Now it just needs to maintain a unified user eXperience.

Anterior Closes $40M to Take AI to the Largest Plans in the Country

The AI race between payors and providers is healthcare’s Kentucky Derby, and Anterior just closed $40M to help turn the dark horses into the frontrunners.

Anterior uses AI to ease the back-office burden on health plans. It started with a laser focus on prior authorizations, translating huge amounts of unstructured data into the information that’s actually needed to make quicker decisions.

  • When Anterior helps payors deploy AI in their clinical and operational workflows, it doesn’t just dump a bunch of models on them and disappear into the sunset.
  • It embeds its own clinicians and engineers alongside the platform to support its partners, optimize accuracy, and drive a measurable impact.

Trust is a differentiator. Payors are a cautious crowd, and they aren’t exactly known for trusting new friends with their critical workflows. 

  • Anterior’s clinicians are its secret sauce. They make up about 40% of the company, and many of them have even started contributing directly to the platform’s code base.
  • This hands-on support why partners build trust, and that hard-earned resource is what allowed Anterior to take the same tech underpinning its prior auth tools and expand it to other workflows.

New partners lead to new proof points. New proof points lead to new use cases. 

  • Anterior’s early successes – from both its people and technology – have allowed it to quickly land and expand into areas like payment integrity and risk adjustment. 
  • Since closing its $20M Series A in June 2024, Anterior has deployed its AI across major payors like Geisinger Health Plan, and worked alongside enterprise technology partners like HealthEdge to build out key strategic integrations.
  • The platform now supports orgs representing over 50M covered lives, and the fresh funds will help it use those case studies to pry open the door to the biggest national plans in the business.  

The Takeaway

Anterior’s earliest partners had to gamble on an unproven platform without any real-world evidence to back it up. Now, the proof is in the success stories, and Anterior just landed another $40M to go after the largest and most risk-averse payors in the country.

LLMs Still Struggle With Medical Misinformation 

The Lancet Digital Health just published one of the largest-ever stress tests on medical misinformation in LLMs, and it looks like most models still struggle to separate fact from fiction.

Here’s the setup. Researchers probed 20 LLMs with over 3M prompts containing medical information from three different sources: social media posts, simulated clinical vignettes, or real hospital discharge notes with a single fabricated recommendation inserted.

  • Each prompt was presented in multiple versions, once with neutral wording to establish a baseline, then with a series of variations that were emotionally charged or leading.
  • Ten logical fallacies were also used to test how framing influences model behavior, such as appeals to authority (a physician said…) or popularity (everyone agrees that…).

LLMs love fake news. The susceptibility was shockingly high across all models, with the medical misinformation accepted in 32% of the neutral base prompts.

  • That jumped to 46% when the misinformation was embedded in formal discharge notes, but at least the models were more skeptical of the social media content (9%).

Other findings were more counter-intuitive. Eight of the 10 logical fallacies ended up reducing the misinformation acceptance rate rather than increasing it like the authors expected.

  • Only appeals to authority (+2.9 percentage points above the base prompts) and slippery slope prompts (+2.2pp) increased susceptibility, a relatively small impact considering appeals to popularity slashed it by nearly 20pp.
  • Larger models were generally safer, although the language and phrasing had a far greater influence than the parameter count alone. 
  • It was also surprising to see that the medical models performed worse than the general purpose models, with many having weaker lie detectors despite the specialization.

Improving LLM safety is about more than making bigger models. It’s about knowing how information gets presented by actual humans, and having guardrails in place that hold up even when that information is wrong.

The Takeaway

Benchmark performance isn’t real-world performance, and this study provides another reminder that a model’s ability to separate fact from fiction is often more important than its test scores.

AI Spots Early Cognitive Decline in Clinical Notes

Early disease detection is entering the AI era, and a new study in npj Digital Medicine shows that autonomous agents can now flag cognitive decline using nothing but clinical notes.

Cognitive decline is difficult to detect. It remains significantly underdiagnosed in routine care, and traditional screening usually requires a dedicated clinician and tests that can take hours. 

  • At the same time, early detection is becoming increasingly important, especially with the recent approval of Alzheimer’s therapies that are most effective when administered early. 

Mass General Brigham might have an answer. Clinical notes contain whispers of cognitive decline that busy clinicians can’t always hear. MGB built a system that listens at scale.

  • These whispers include everything from linguistic shifts and sentence pauses to disorganized narratives and family member concerns. 
  • MGB developed an AI system that scans for these signals in routine clinical documentation, leveraging five specialized agents that critique each other and refine their reasoning.

It worked like a charm. The MGB researchers set their agents loose on over 3,300 clinical notes from 200 anonymized patients, then had human reviewers take their own look.

  • The agents detected cognitive impairment with 91% sensitivity, nearly matching expert-level accuracy – without any human intervention needed after deployment.
  • When the AI and human reviewers disagreed, an independent expert validated the AI’s reasoning 58% of the time – meaning the system was often making sound clinical judgments that initial human review had missed.

The cherry on top? The MGB team open-sourced Pythia alongside the study, enabling any provider org to deploy autonomous prompt optimization for their own AI screening applications.

The Takeaway

LLMs have opened the door to proactive screening at scale, and MGB just provided an excellent proof of concept using AI agents that turn everyday documentation into a chance to catch cognitive decline during the optimal treatment window.

ARISE Maps the State of Clinical AI

There have probably been hundreds of reports on the medical AI landscape, but there’s only been one State of Clinical AI from the rockstar team at ARISE.

The AI opus delivers the most complete review we’ve seen of a field that’s moving faster than its evaluation practices. It looked at the most influential clinical AI studies from 2025 to answer a trio of important questions:

  • Where does AI meaningfully improve care once it leaves research settings?
  • Where does performance break down?
  • Where do risks remain underexamined?

ARISE brought the heat. The Stanford-Harvard research network produced more highlights than we could count, but here’s a roundup of some of our favorites.

Impressive results in narrow evaluations. AI models have shown “superhuman performance” in research settings, but these results often depend on how narrowly the problem is framed. 

  • In one study, researchers modified standard medical multiple-choice questions so that the correct answer became “none of the other answers.” The clinical reasoning required to solve the question didn’t change. Model performance did. Accuracy dropped sharply across leading AI models, in some cases by over a third.

AI clearly helps prediction at scale. Although diagnostic reasoning was a mixed bag, several studies demonstrated that AI excels at identifying early warning signals from large datasets.

  • A hospital-based study found that a model trained on continuous wearable vital signs predicted patient deterioration up to 24 hours before standard alerts, identifying patients at risk for ICU transfer, cardiac arrest, or death while there was still time to intervene.

Most studies still don’t resemble the reality of healthcare. Clinical work has little to do with answering exam questions, and much to do with reviewing charts, coordinating care, and deciding when not to intervene.

  • A review of 500+ studies found that nearly half of them tested models using medical exam-style questions. Only 5% used real patient data, very few measured whether the models recognized uncertainty, and even fewer examined bias or fairness.

Now what? ARISE offered a few focus areas for 2026 that hit the center of the bullseye for building trust in the latest AI models.  

  • Evaluate models using real-world scenarios to drive evidence-based medicine.
  • Prioritize human-computer interaction design as much as primary outcomes.
  • Measure uncertainty, bias, and harm – especially when it comes to patient-facing AI.

The Takeaway

Healthcare AI has arrived, and ARISE made it clear that innovation won’t be driven by newer models alone. It will depend on whether health systems, researchers, and regulators are willing to apply the same evidence standards to AI that they expect out of any other clinical solution.

Get the top digital health stories right in your inbox