Hidden Flaws Behind High Accuracy of Clinical AI

AI is getting pretty darn good at patient diagnosis challenges… but don’t bother asking it to show its work.

A new study in npj Digital Medicine pitted GPT-4V against human physicians on 207 image challenges designed to test the reader’s ability to diagnose a patient based on a series of pictures and some basic clinical background info.

  • Researchers at the NIH and Weill Cornell Medicine then asked GPT-4V to provide step-by-step reasoning for how it chose the answer.
  • Nine physicians then tackled the same questions in both a closed-book (no outside help) and open-book format (could use outside materials and online resources).

How’d they stack up?

  • GPT-4V and the physicians both scored high marks for accurate diagnoses (81.6% vs. 77.8%), with a statistically insignificant difference in performance. 
  • GPT-4V bested the physicians on the closed-book test, selecting more correct diagnoses.
  • Physicians bounced back to beat GPT-4V on the open-book test, particularly on the most difficult questions.
  • GPT-4V also performed well in cases where physicians answered incorrectly, maintaining over 78% accuracy.

Good job AI, but there’s a catch. The rationales that GPT-4V provided were riddled with mistakes – even if the final answer was correct – with error rates as high as 27% for image comprehension.

The Takeaway

There could easily come a day when clinical AI surpasses human physicians on the diagnosis front, but that day isn’t here quite yet. Real care delivery also doesn’t bless physicians with a set of multiple choice options, and hallucinating the rationale behind diagnoses doesn’t cut it with actual patients.

Mayo Clinic Tops Hospital AI Readiness Index

The ambient temperature is rising, and CB Insights just launched its Hospital AI Readiness Index to determine which health systems are most prepared for the shift.

The index is based on an analysis of top private-sector systems in the U.S. (by hospital count), ranked by how prepared they are to adapt to a rapidly evolving AI landscape across two key pillars: 

  • Innovation – measures a system’s track record of developing or acquiring novel AI capabilities, also considers the presence of an AI-dedicated research center
  • Execution – measures a system’s ability to bring AI into clinical practice, also considers internal AI deployments across business and back-office functions 

Without further ado, here’s CB Insight’s first list of AI-ready systems:

Mayo Clinic topped the innovation charts by leading all systems in terms of raw AI investment count (including participation in big rounds from Abridge and Cerebras Systems), while also filing 50+ AI patents in areas like cardiovascular health and oncology.

  • Intermountain ranked second due in part to the AI focus of its venture arm, which invested in Gyant prior to the engagement platform getting scooped up by Fabric.
  • Cleveland Clinic rounded out the top three with a high volume of AI partnerships, including work with PathAI to enhance translational research using pathology algorithms.

High execution scores were driven by AI business relationships and product launches, such as Mayo Clinic’s teaming up with Techcyte to help providers use AI to improve lab testing.

  • Another standout on this front was Banner Health, which is working with Regard to cut down on administrative burdens by automating tasks like notetaking and chart reviews.
  • Johns Hopkins also received high marks after partnering with Healthy.io to offer digital wound care services to patients.

The Takeaway

It’s tough not to love a good stack-ranking of health systems, and this is the best we’ve come across for AI readiness (and potential AI partners). Hats off to the 25 systems that made CB Insights’ inaugural list!

Augmedix Takes Hit As Ambient AI Heats Up

Augmedix just reported Q1 results that managed to axe its share price in half, an interesting turn of events given the company’s role as the bellwether for the white hot ambient AI space.

There’s plenty to unpack when the only publicly-traded medical scribe company takes a hit like that despite beating expectations for both EPS and revenue, which jumped 40% to $13.5M.

The simple explanation? Competition. Augmedix saw “a slow-down in purchasing commitments” as providers evaluate competing offerings, prompting it to cut its full-year revenue forecast to between $52M and $55M (down from $60M to $62M).

  • During the investor call, Augmedix said that 42 companies currently offer GenAI medical documentation solutions, leading to a ton of noise and just as many pilots.
  • Although the increased demand from health systems is promising for the overall sector, it doesn’t exactly translate to success for established players when nimble startups like Nabla, Abridge, and Suki start swarming in on the action.

Augmedix is shaping its strategy around a product portfolio that lets providers choose the right tool for their needs, expanding beyond Augmedix Live (human scribes, high cost) with Augmedix Go (GenAI scribe, low cost) and Augmedix Go Assist (GenAI + human review, medium cost).

  • The push into GenAI has apparently been a double-edged sword. Augmedix reported that strong uptake for its new AI products might result in slower revenue growth as customers transition away from its high-margin Live solution.
  • New products tailored to specific settings will be another focus, as seen with the recent debut of Augmedix Go ED following a pilot-turned-implementation at HCA Healthcare. As scribing tech becomes commoditized, expect to see more players differentiate on setting / specialty.

The Takeaway

If there’s one lesson to learn from Augmedix’s first quarter, it’s that business is booming in the ambient AI space, but that doesn’t benefit incumbent leaders when it also attracts hungry competitors looking to feast on the same momentum.

K Health Introduces First-of-its-Kind AI Knowledge Agent

Clinical AI is stepping up to the big leagues, and K Health is the team that’s taking it there.

In an exclusive interview with Digital Health Wire, K Health CEO Allon Bloch took the lid off his company’s new AI Knowledge Agent, a first-of-its-kind GenAI system purpose-built for the clinical setting.

On the surface the AI Knowledge Agent looks and feels like a familiar medical chatbot, with a simple search bar interface for the user to ask natural language questions about their health. It isn’t until you see the responses that you realize you’re looking at something entirely unique.

The AI Knowledge Agent is about as far away from a rules-based chatbot as you can get. The agent is composed of an array of large language models enhanced by K Health’s own algorithms, carrying several major differentiators from today’s standard AI applications:

  • It incorporates the patient’s medical history grounded by their EHR to provide highly tailored responses, enabling a level of personalization that’s impossible to match for standalone models (i.e. a diabetic and a heart failure patient will see different answers to the same question, using their own history, potential adverse drug interactions, etc.).
  • It will be embedded into health systems to serve as a digital front door that intelligently routes patients to the right place to resolve their needs, reaching everything from primary care and specialists to labs and tests within the same interface.
  • It’s optimized for accuracy by using curated high-quality health sources, then leverages multiple specialized agents to verify the answer matches the sources and the EHR data is appropriate. It will even tell you that it doesn’t know the answer rather than hallucinate.

In head-to-head testing against top tier foundation models, K Health’s multi-agent approach led to answers for sample medical questions that were 9% more comprehensive (included clinically crucial statements from the “gold standard” answer) and had 36% fewer hallucinations than its closest benchmark, GPT-4. 

  • Strong results, especially considering that the AI Knowledge Agent shines brightest in real-world situations where it can personalize its answers using EHR context.

For possibly the first time ever, GenAI has reached the point where it can support actual clinical journeys, delivering answers personalized to the patient’s medical history while connecting them directly to required care. The era of Googling symptoms then calling your doctor feels like it’s finally coming to an end.

The Takeaway

We’re very much in the opening act of clinical AI, and understandably cautious providers are only just beginning to test the waters. That said, it’s easy to imagine that we’ll one day look back at launches like K Health’s AI Knowledge Agent as key moments for building trust and confidence in the AI systems that reshaped care delivery.

GenAI Still Working Toward Prime Time With Patients

When it rains it pours for AI research, and a trio of studies published just last week suggest that many new generative AI tools might not be ready for prime time with patients.

The research that grabbed the most headlines came out of UCSD, finding that GenAI-drafted replies to patient messages led to more compassionate responses, but didn’t cut down on overall messaging time.

  • Although GenAI reduced the time physicians spent writing replies by 6%, that was more than offset by a 22% increase in read time, while also increasing average reply lengths by 18%.
  • Some of the physicians were also put off by the “overly nice” tone of the GenAI message drafts, and recommended that future research look into “how much empathy is too much empathy” from the patient perspective.

Another study in Lancet Digital Health showed that GPT-4 can effectively generate replies to health questions from cancer patients… as well as replies that might kill them.

  • Mass General Brigham researchers had six radiation oncologists review GPT-4’s responses to simulated questions from cancer patients for 100 scenarios, finding that 58% of its replies were acceptable to send to patients without any editing, 7% could lead to severe harm, and one was potentially lethal.
  • The verdict? Generative AI has the potential to reduce workloads, but it’s still essential to “keep doctors in the loop.”

A team at Mount Sinai took a different path to a similar conclusion, finding that four popular GenAI models have a long way to go until they’re better than humans at matching medical issues to the correct diagnostic codes.

  • After having GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b analyze and code 27,000 unique diagnoses, GPT-4 came out on top in terms of exact matches, achieving an uninspiring accuracy of 49.8%.

The Takeaway

While it isn’t exactly earth-shattering news that GenAI still has room to improve, the underlying theme with each of these studies is more that its impact is far from black and white. GenAI is rarely completely right or completely wrong, and although there’s no doubt we’ll get to the point where it’s working its magic without as many tradeoffs, this research confirms that we’re definitely not there yet.

Scaling Adoption of Medical AI

Medical AI is on the brink of improving outcomes for countless patients, prompting a trio of all-star researchers to pen an NEJM AI article tackling what might be its biggest obstacle: real-world adoption.

What drives real-world adoption? Those who have been around the block as many times as Dr. Michael Abramoff, Dr. Tinglong Dai, and Dr. James Zou are all-too familiar with the answer… Reimbursement makes the world go ‘round.

To help medical AI developers get their tools in front of the patients who need them, the authors explore the pros and cons of current paths to reimbursement, while offering novel frameworks that could lead to better financial sustainability.

Traditional Fee-for-Service treats medical AI similarly to how new drugs or medical devices are reimbursed, and is a viable path for AI that can clear the hurdle of demonstrating improvements to clinical outcomes, health equity, clinician productivity, and cost-effectiveness (e.g. AI for diabetic eye exams).

  • Meeting these criteria is a prerequisite for adopting AI in healthcare, yet even among the 692 FDA-authorized AI systems, few have been able to pass the test. The approach carries substantial risk in terms of time and resources for AI developers.
  • Despite those limitations, FFS might be appropriate for AI because health systems are adept at assessing the financial impact of new technologies under it, and reimbursement through a CPT code provides hard-to-match financial sustainability.

Value-based care frameworks provide reimbursement on the basis of patient- or population-related metrics (MIPS, HEDIS, full capitation), and obtaining authorization for medical AI to “count” toward closing care gaps for MIPS and HEDIS has been shown to be considerably more straightforward than attaining a CPT code.

  • That said, if a given measure is not met (e.g. 80% of the population must receive an annual diabetic eye examination), the financial benefit of closing even three quarters of that care gap is typically zero, potentially disincentivizing AI adoption.

Given the limitations of existing pathways, the authors offer a potential new approach that’s derived from the Medicare Part B model, which reimburses drugs administered in an outpatient setting based on a “cost plus” markup.

  • Here, providers could acquire the rights to use AI, then get reimbursed based on the average cost of the service plus a specified margin, contingent upon CMS coverage of a particular CPT code.
  • This model essentially splits revenue between AI creators and users, and would alleviate some of the tensions of both FFS and VBC models.

The Takeaway

Without sustainable reimbursement, widespread medical AI adoption won’t be possible. Although the quest continues for a silver bullet (even the authors’ revenue-sharing model still carries the risk of overutilization and requires the creation of new CPT codes), exploring novel approaches is essential given the challenges of achieving reimbursement through existing FFS and VBC pathways.

Hippocratic Raises $53M, Showcases AI Staff Marketplace

Less than a year after emerging from stealth with little more than a vague mission to transform healthcare through the power of “safety-focused” generative AI, Hippocratic AI took the stage at NVIDIA’s GTC conference to announce the close of its $53M Series A round.

When Hippocratic debuted last May, it hadn’t yet decided on its first use case, despite closing more seed funding than any company we’d ever covered ($50M). 

  • In July, it raised an additional $15M and partnered with 10 healthcare providers for model evaluation, including Cincinnati Children’s, HonorHealth, and SonderMind.
  • Now, Hippocratic found its use case, as well as a $500M valuation.

The first product Hippocratic is rolling out for phase 3 safety testing: a staffing marketplace where healthcare orgs can “hire” generative AI agents that complete low-risk, non-diagnostic, patient-facing tasks.

  • Initial roles for the AI agents include chronic care management, post-discharge follow-up for specific conditions (congestive heart failure, kidney disease), as well as SDOH surveys, health risk assessments, and pre-operative outreach.
  • The agents won’t be allowed to speak with patients unsupervised until phase three testing is completed, which involves its 40+ partners and 5,500 licensed clinicians interfacing with the agents as if they were patients.

Shareholder hero and NVIDIA CEO Jensen Huang demonstrated a care manager agent named “Diana” on stage at GTC, and this video of the demo will tell you everything you need to know about the look and feel of Hippocratic’s first product. 

  • NVIDIA also announced that it’ll be working alongside Hippocratic to develop “super-low-latency conversational interactions,” which will reportedly cost about $9/hr to run on the company’s hardware.

Critics are hard to avoid when you hit a half-billion valuation before launching a product and start throwing around terms like “Health General Intelligence (HGI).” Most critics seem concerned about medical device classifications, venture capital mania, and some past lawsuits, but time will tell which critiques (if any) can stop Hippocratic’s momentum.

The Takeaway

Despite all the talk about whether a general purpose healthcare AI is possible or safe, Hippocratic has an elite roster of VCs and provider partners that are willing to help it find out. Hippocratic definitely has the talking points nailed, now we’ll have to wait and see whether it has the operational chops to back them up.

Ambience Healthcare Locks In $70M for AI Operating System

Ambience Healthcare just closed $70M in Series B funding to cut away at burnout-inducing manual workflows using the latest advances in generative AI.

Ambience’s carving knife isn’t an AI scribe, a coding solution, or a referral tool, but an “AI operating system” that promises to be all those things at once.

That operating system consists of a holistic suite of genAI applications catering to an impressively broad set of use cases. Each app is customized for dozens of specific specialties, care models, and reimbursement frameworks:

  • AutoScribe: AI medical scribe that works across all specialties
  • AutoRefer: AI referral letter support for both PCPs and specialists
  • AutoAVS: After-visit summary tool that generates custom educational content
  • AutoCDI: Point-of-care clinical documentation integrity assistant that analyzes notes and EHR context to ensure ICD-10 codes, CPT codes, and documentation are aligned

Ambience has kept tight-lipped about both its customer count and LLM provider, but we do know that it as:

  • $100M in total funding since launching in 2020
  • Marquee customers like UCSF, Memorial Hermann, and John Muir Health
  • Investments from Silicon Valley heavyweights like Kleiner Perkins, a16z, and OpenAI (probably a decent hint toward the unrevealed LLM partner)

The newly-raised capital will accelerate Ambience’s product roadmap and allow it to build dedicated support teams for its health system partners.

  • The first product up on that roadmap is AutoPrep, an intelligent pre-charting solution that equips clinicians with suggestions for the visit agenda and potential conditions to screen for.

Ambience’s operating system strategy not only gives it a huge total addressable market, but also positions it apart from well-established competition like Nuance and Augmedix, as well as a hungry pack of genAI up-and-comers such as Nabla and Abridge.

  • A continuously learning OS with “a single shared brain” sounds like a versatile way to break down silos, but the flip side of that coin is that providers looking for an answer to a specific problem might be tempted to go with a more specialized solution.

The Takeaway

Driving adoption of any software is hard. Crafting a beautiful user experience is hard. Tailoring a continuously learning AI operating system to every medical specialty sounds extremely hard. At the end of the day, Ambience’s approach is about as ambitious as it gets, but it carries massive advantages if it can execute.

How Health Systems Are Approaching AI

The New England Journal of Medicine’s just-released NEJM AI publication is off to the races, with its February issue including a stellar breakdown of how academic medical centers are managing the influx of predictive models and AI tools.

Researchers identified three governance phenotypes for managing the AI deluge:

  • Well-Defined Governance – health systems have explicit, comprehensive procedures for the evaluation of AI and predictive models.
  • Emerging Governance – systems are in the process of adapting previously established approaches for things like EHRs to govern AI.
  • Interpersonal Governance – a small team or single person is tasked with making decisions about model implementation without consistent evaluation requirements. 

Regardless of the phenotype, interviews with AI leadership at 13 academic medical centers revealed that chaotic implementations are hard to avoid, partly due to external factors like vague regulatory standards.

  • Most AI decision makers were aware of how the FDA regulates software, but believed those rules were “broad and loose,” and many thought they only applied to EHRs and third party vendors rather than health systems.

AI governance teams report better adherence to new solutions that prioritize limiting clicks for providers when they’re implemented. Effective governance of prediction models requires a broader approach, yet streamlining workflows is still a primary consideration for most implementations. That’s leading to trouble down the road considering predictive models’ impact on patient care, health equity, and quality care.

The Takeaway

Even well-equipped academic medical centers are struggling to effectively identify and mitigate the countless potential pitfalls that come along with predictive AI implementation. Existing AI governance structures within healthcare orgs all seem to be in need of additional guidance, and more guardrails from both the industry and regulators might help turn AI ambitions into AI-improved outcomes.

AI Therapists in VR Help With Provider Shortage

New research in npj Digital Medicine suggests that virtual reality might be part of the answer to the nation’s mental health provider shortage, as long as patients don’t might if their therapist is an AI avatar.

The small study had 14 participants with moderate anxiety or depression undergo immersive therapy sessions led by a trained digital avatar developed by Cedars-Sinai.

Known as XAIA, or the eXtended-Reality Artificially Intelligent Ally, the program provides a way for patients to self-administer conversational therapy in relaxing virtual reality environments, such as a creek-side meadow or a sunny beach retreat.

  • It’s unclear what the therapist avatar actually looks like, but we’re going to choose to believe it looks like this picture of XAIA in the diagram of the conversation logic.

Throughout the 30-minute therapy sessions, with topics ranging from loneliness and family problems to financial distress, XAIA successfully applied key psychotherapeutic techniques:

  • Observations that reflect an understanding of the user’s issues; Ex. “Your worries about your health and your living situation are clearly causing you a lot of distress” 
  • Normalizing feelings; Ex. “It’s understandable to feel like a failure when life gets rough”
  • Showing empathy; Ex. “It must be a challenging time for you”
  • Validation and praise where appropriate; Ex. “Your practice of not taking these relationships for granted is truly commendable”

Participants frequently responded to XAIA as if it were a human therapist, sharing raw emotions like, “I feel like I’m a failure. The only thing I have to look forward to – I know you’re going to laugh – is to go see Taylor Swift … I’m sorry I’m crying.”

  • XAIA’s response: “Firstly, there’s no need to apologize for crying. These feelings are valid.”

Most participants described XAIA as approachable, empathetic, and intelligent, but it’s worth noting that a few mentioned they would still prefer a human therapist if given the choice.

The Takeaway

Although this wasn’t exactly the largest study we’ve ever covered, the results provide early evidence that a combination of VR and AI therapy could be part of the solution to balancing behavioral health’s supply and demand equation. Over half of people facing mental health disorders aren’t getting the treatment they need, and if XAIA isn’t already a clearly better alternative than no treatment at all, new advances will only make the AI+VR path more promising going forward.

Get the top digital health stories right in your inbox

You might also like..

Select All

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Digital Health Wire team

You're all set!