Patients Ready For GenAI, But Not For Everything

Bain & Company’s US Frontline of Consumer Healthcare Survey turned up the surprising result that patients are more comfortable with generative AI “analyzing their radiology scan and making a diagnosis than answering the phone at their doctor’s office.”

That’s quite the headline, but the authors were quick to point out that it’s probably less of a measure of confidence in GenAI’s medical expertise than a sign that patients aren’t yet comfortable interacting with the technology directly.

Here’s the breakdown of patient comfort with different GenAI use cases:

While it does appear that patients are more prepared to have GenAI supporting their doctor than engaging with it themselves, it’s just as notable that less than half reported feeling comfortable with even a single GenAI application in healthcare.

  • No “comfortable” response was above 37%, and after adding in the “neutral” votes, there was still only one application that broke 50%: note taking during appointments.
  • The fact that only 19% felt comfortable with GenAI answering calls for providers or payors could also just be a sign that patients would far rather talk to a human in either situation, regardless of the tech’s capabilities.

The next chart looks at GenAI perceptions among healthcare workers: 

Physicians and administrators are feeling a similar mix of excitement and apprehension, sharing a generally positive view of GenAI’s potential to alleviate admin burdens and clinician workloads, as well as a concern that it could undermine the patient-provider relationship.

  • Worries over new technology threatening the relationship of patients and providers aren’t new, and we just witnessed them play out at an accelerated pace with telehealth.
  • Despite initial fears, the value of the relationship prevailed, which Bain backed up with the fact that 61% of patients who use telehealth only do so with their own provider.

Whether you’re measuring by patient or provider comfort, GenAI’s progress will be closely tied to trust in the technology on an application-by-application basis. Trust takes time to build and first impressions are key, so this survey underscores the importance of nailing the user experience early on.

The Takeaway
The story of generative AI in healthcare is just getting started, and as we saw with telehealth, the first few pages could take some serious willpower to get through. New technologies mean new workflows, revenue models, and countless other barriers to overcome, but trust will only keep building every step of the way. Plus, the next chapter looks pretty dang good.

Hidden Flaws Behind High Accuracy of Clinical AI

AI is getting pretty darn good at patient diagnosis challenges… but don’t bother asking it to show its work.

A new study in npj Digital Medicine pitted GPT-4V against human physicians on 207 image challenges designed to test the reader’s ability to diagnose a patient based on a series of pictures and some basic clinical background info.

  • Researchers at the NIH and Weill Cornell Medicine then asked GPT-4V to provide step-by-step reasoning for how it chose the answer.
  • Nine physicians then tackled the same questions in both a closed-book (no outside help) and open-book format (could use outside materials and online resources).

How’d they stack up?

  • GPT-4V and the physicians both scored high marks for accurate diagnoses (81.6% vs. 77.8%), with a statistically insignificant difference in performance. 
  • GPT-4V bested the physicians on the closed-book test, selecting more correct diagnoses.
  • Physicians bounced back to beat GPT-4V on the open-book test, particularly on the most difficult questions.
  • GPT-4V also performed well in cases where physicians answered incorrectly, maintaining over 78% accuracy.

Good job AI, but there’s a catch. The rationales that GPT-4V provided were riddled with mistakes – even if the final answer was correct – with error rates as high as 27% for image comprehension.

The Takeaway

There could easily come a day when clinical AI surpasses human physicians on the diagnosis front, but that day isn’t here quite yet. Real care delivery also doesn’t bless physicians with a set of multiple choice options, and hallucinating the rationale behind diagnoses doesn’t cut it with actual patients.

GenAI Still Working Toward Prime Time With Patients

When it rains it pours for AI research, and a trio of studies published just last week suggest that many new generative AI tools might not be ready for prime time with patients.

The research that grabbed the most headlines came out of UCSD, finding that GenAI-drafted replies to patient messages led to more compassionate responses, but didn’t cut down on overall messaging time.

  • Although GenAI reduced the time physicians spent writing replies by 6%, that was more than offset by a 22% increase in read time, while also increasing average reply lengths by 18%.
  • Some of the physicians were also put off by the “overly nice” tone of the GenAI message drafts, and recommended that future research look into “how much empathy is too much empathy” from the patient perspective.

Another study in Lancet Digital Health showed that GPT-4 can effectively generate replies to health questions from cancer patients… as well as replies that might kill them.

  • Mass General Brigham researchers had six radiation oncologists review GPT-4’s responses to simulated questions from cancer patients for 100 scenarios, finding that 58% of its replies were acceptable to send to patients without any editing, 7% could lead to severe harm, and one was potentially lethal.
  • The verdict? Generative AI has the potential to reduce workloads, but it’s still essential to “keep doctors in the loop.”

A team at Mount Sinai took a different path to a similar conclusion, finding that four popular GenAI models have a long way to go until they’re better than humans at matching medical issues to the correct diagnostic codes.

  • After having GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b analyze and code 27,000 unique diagnoses, GPT-4 came out on top in terms of exact matches, achieving an uninspiring accuracy of 49.8%.

The Takeaway

While it isn’t exactly earth-shattering news that GenAI still has room to improve, the underlying theme with each of these studies is more that its impact is far from black and white. GenAI is rarely completely right or completely wrong, and although there’s no doubt we’ll get to the point where it’s working its magic without as many tradeoffs, this research confirms that we’re definitely not there yet.

Scaling Adoption of Medical AI

Medical AI is on the brink of improving outcomes for countless patients, prompting a trio of all-star researchers to pen an NEJM AI article tackling what might be its biggest obstacle: real-world adoption.

What drives real-world adoption? Those who have been around the block as many times as Dr. Michael Abramoff, Dr. Tinglong Dai, and Dr. James Zou are all-too familiar with the answer… Reimbursement makes the world go ‘round.

To help medical AI developers get their tools in front of the patients who need them, the authors explore the pros and cons of current paths to reimbursement, while offering novel frameworks that could lead to better financial sustainability.

Traditional Fee-for-Service treats medical AI similarly to how new drugs or medical devices are reimbursed, and is a viable path for AI that can clear the hurdle of demonstrating improvements to clinical outcomes, health equity, clinician productivity, and cost-effectiveness (e.g. AI for diabetic eye exams).

  • Meeting these criteria is a prerequisite for adopting AI in healthcare, yet even among the 692 FDA-authorized AI systems, few have been able to pass the test. The approach carries substantial risk in terms of time and resources for AI developers.
  • Despite those limitations, FFS might be appropriate for AI because health systems are adept at assessing the financial impact of new technologies under it, and reimbursement through a CPT code provides hard-to-match financial sustainability.

Value-based care frameworks provide reimbursement on the basis of patient- or population-related metrics (MIPS, HEDIS, full capitation), and obtaining authorization for medical AI to “count” toward closing care gaps for MIPS and HEDIS has been shown to be considerably more straightforward than attaining a CPT code.

  • That said, if a given measure is not met (e.g. 80% of the population must receive an annual diabetic eye examination), the financial benefit of closing even three quarters of that care gap is typically zero, potentially disincentivizing AI adoption.

Given the limitations of existing pathways, the authors offer a potential new approach that’s derived from the Medicare Part B model, which reimburses drugs administered in an outpatient setting based on a “cost plus” markup.

  • Here, providers could acquire the rights to use AI, then get reimbursed based on the average cost of the service plus a specified margin, contingent upon CMS coverage of a particular CPT code.
  • This model essentially splits revenue between AI creators and users, and would alleviate some of the tensions of both FFS and VBC models.

The Takeaway

Without sustainable reimbursement, widespread medical AI adoption won’t be possible. Although the quest continues for a silver bullet (even the authors’ revenue-sharing model still carries the risk of overutilization and requires the creation of new CPT codes), exploring novel approaches is essential given the challenges of achieving reimbursement through existing FFS and VBC pathways.

How Health Systems Are Approaching AI

The New England Journal of Medicine’s just-released NEJM AI publication is off to the races, with its February issue including a stellar breakdown of how academic medical centers are managing the influx of predictive models and AI tools.

Researchers identified three governance phenotypes for managing the AI deluge:

  • Well-Defined Governance – health systems have explicit, comprehensive procedures for the evaluation of AI and predictive models.
  • Emerging Governance – systems are in the process of adapting previously established approaches for things like EHRs to govern AI.
  • Interpersonal Governance – a small team or single person is tasked with making decisions about model implementation without consistent evaluation requirements. 

Regardless of the phenotype, interviews with AI leadership at 13 academic medical centers revealed that chaotic implementations are hard to avoid, partly due to external factors like vague regulatory standards.

  • Most AI decision makers were aware of how the FDA regulates software, but believed those rules were “broad and loose,” and many thought they only applied to EHRs and third party vendors rather than health systems.

AI governance teams report better adherence to new solutions that prioritize limiting clicks for providers when they’re implemented. Effective governance of prediction models requires a broader approach, yet streamlining workflows is still a primary consideration for most implementations. That’s leading to trouble down the road considering predictive models’ impact on patient care, health equity, and quality care.

The Takeaway

Even well-equipped academic medical centers are struggling to effectively identify and mitigate the countless potential pitfalls that come along with predictive AI implementation. Existing AI governance structures within healthcare orgs all seem to be in need of additional guidance, and more guardrails from both the industry and regulators might help turn AI ambitions into AI-improved outcomes.

AI Therapists in VR Help With Provider Shortage

New research in npj Digital Medicine suggests that virtual reality might be part of the answer to the nation’s mental health provider shortage, as long as patients don’t might if their therapist is an AI avatar.

The small study had 14 participants with moderate anxiety or depression undergo immersive therapy sessions led by a trained digital avatar developed by Cedars-Sinai.

Known as XAIA, or the eXtended-Reality Artificially Intelligent Ally, the program provides a way for patients to self-administer conversational therapy in relaxing virtual reality environments, such as a creek-side meadow or a sunny beach retreat.

  • It’s unclear what the therapist avatar actually looks like, but we’re going to choose to believe it looks like this picture of XAIA in the diagram of the conversation logic.

Throughout the 30-minute therapy sessions, with topics ranging from loneliness and family problems to financial distress, XAIA successfully applied key psychotherapeutic techniques:

  • Observations that reflect an understanding of the user’s issues; Ex. “Your worries about your health and your living situation are clearly causing you a lot of distress” 
  • Normalizing feelings; Ex. “It’s understandable to feel like a failure when life gets rough”
  • Showing empathy; Ex. “It must be a challenging time for you”
  • Validation and praise where appropriate; Ex. “Your practice of not taking these relationships for granted is truly commendable”

Participants frequently responded to XAIA as if it were a human therapist, sharing raw emotions like, “I feel like I’m a failure. The only thing I have to look forward to – I know you’re going to laugh – is to go see Taylor Swift … I’m sorry I’m crying.”

  • XAIA’s response: “Firstly, there’s no need to apologize for crying. These feelings are valid.”

Most participants described XAIA as approachable, empathetic, and intelligent, but it’s worth noting that a few mentioned they would still prefer a human therapist if given the choice.

The Takeaway

Although this wasn’t exactly the largest study we’ve ever covered, the results provide early evidence that a combination of VR and AI therapy could be part of the solution to balancing behavioral health’s supply and demand equation. Over half of people facing mental health disorders aren’t getting the treatment they need, and if XAIA isn’t already a clearly better alternative than no treatment at all, new advances will only make the AI+VR path more promising going forward.

GPT-4 Capable of Diagnosing Complex Cases

The New England Journal of Medicine is adding to its library of top tier publications with the launch of a new journal focused on artificial intelligence – NEJM AI – and it’s gearing up for the January debut with a sneak peek at a few early-release articles.

Use of GPT-4 to Diagnose Complex Clinical Cases was a standout study from the preview, finding that GPT-4 correctly diagnosed over half of complex clinical cases.

Researchers asked GPT-4 to provide a diagnosis for 38 clinical case challenges that each included a medical history along with six multiple choice options. The most common diagnoses included 15 cases related to infectious disease (39.5%), five cases in endocrinology (13.1%), and four cases in rheumatology (10.5%).

  • GPT-4 was given the plain unedited text from each case, and solved each one five times to evaluate reproducibility.
  • Those answers were compared to over 248k answers from online medical-journal readers, which were used to simulate 10k complete sets of human answers.

GPT-4 correctly diagnosed an average of 21.8 cases (57%), while the medical-journal readers correctly diagnosed an average of 13.7 cases (36%). Not too shabby considering the LLM could only leverage the case text and not the included graphics.

  • Based on the simulation, GPT-4 also performed better than 99.98% of all medical-journal readers, with high reproducibility across all five tests (lowest score was 55.3%).

A couple caveats to consider are that medical-journal readers aren’t licensed physicians, and that real-world medicine doesn’t provide convenient multiple choice options. That said, a separate study found that GPT-4 performed well even without answer options (44% accuracy), and these models will only grow more precise as multimodal data gets incorporated.

The Takeaway

The race to bring AI to healthcare is on, and it’s generating a stampede of new research investigating the boundaries of the tech’s potential. As the hype of the first lap starts to give way to more measured progress, NEJM AI will most likely be one of the best places to keep up with the latest advances.

AI Executive Order, the Full Breakdown

The White House’s long-awaited executive order on “Safe, Secure, and Trustworthy” artificial intelligence is finally here, and it left little room to miss its underlying message: the laissez-faire era of AI regulation is over.

Among the 100+ pages of actions guiding the direction of responsible AI development, President Biden laid out several initiatives poised to make an immediate impact within healthcare, including…

  • Calling on HHS to create an AI task force within six months to assess new models before they go to market and oversee their performance once they do
  • Requiring that task force to build a regulatory structure that can “maintain appropriate levels of quality” in AI used for care delivery, research, and drug development
  • That structure will require healthcare AI developers to share their safety testing outcomes with the government
  • Balancing the added regulation by ramping up grantmaking for AI development in areas such as personalized immune-response treatments, burnout, and improving data quality
  • Standing up AI.gov to serve as the go-to resource for federal AI standards and hiring, a decent signal that there’ll be actual follow-through to cultivate public sector AI talent

The FDA has already approved upwards of 520 AI algorithms, and has done well with predictive models that take in data and propose probable outcomes. 

  • However, generative AI products that respond to human queries require “a vastly different paradigm” to regulate, and FDA Digital Health Director Troy Tazbaz believes any new structure will involve ongoing audits to ensure continuous safety.

There’s already been tons of great post-game analysis on these developments, with the general consensus looking like a cautious optimism. 

  • While some appreciate the order’s whole-of-government approach to AI, others worry that “excessive preemptive regulation” could slow AI’s progress and delay its benefits.
  • Others are skeptical that the directives will be carried out at all, given the difficulty of hiring enough AI experts in government and passing the needed legislation.

The Takeaway

President Biden’s executive order aims to thread the needle between providing protection and encouraging innovation, but time will tell whether it’ll deliver on some much-needed guardrails. Although AI is a lightning-quick industry that doesn’t exactly lend itself to the type of centralized long-term planning envisioned in the executive order, more structure should be an improvement over regulatory uncertainty.

Abridge Lands $30M As AI Race Heats Up

Momentum makes magic, and few startups have more of it than AI medical scribe Abridge after landing $30M in Series B funding from Spark Capital and high-profile strategics like CVS Health, Kaiser Permanente, and Mayo Clinic.

Abridge’s generative AI platform converts patient-provider conversations into structured note drafts in real-time, slashing hours from administrative burdens by generating summaries that rarely require further input (clinicians edit less than 9%).

The Series B is one of this year’s largest raises for pure play healthcare AI, an area that now accounts for about a quarter of all capital flowing into health IT.

One of the reasons why investors are taking such a keen interest in Abridge is its partnership hot streak, which includes Epic bringing them on as the first startup in its new Partners and Pals program – a move that will make Abridge available directly within Epic’s EHR.

  • It also probably doesn’t hurt that Abridge isn’t shy about sharing its performance data and machine learning research, giving it one of the deepest publication libraries of any company we’ve ever covered.
  • On top of that, Abridge has been racking up a lengthy list of deployments at health systems such as UPMC, Emory Healthcare, and University of Kansas Health System.

The competition is fierce in the AI scribe arena, which is packed with hungry startups like Suki and Nabla, as well as a thousand-pound gorilla named Nuance Communications. 

  • Half a million doctors use Nuance’s DAX dictation software, with “thousands” more already up-and-running on its new fully-automated DAX Copilot.

Some key differentiators give Abridge and its user base of 5,000 clinicians a solid shot at closing the distance, including “linkages” that map everything in the note to its source in both the transcript and audio (Nuance provides the transcript but not the recording). 

  • Abridge also developed its own ASR stack (automatic speech recognition), enabling it to do things like account for new medication names and excel at multilingual documentation, meaning it can generate an English note from a Spanish conversation.

The Takeaway

Abridge is emerging as a standout in the clinical documentation race, with DNA that’s as healthcare-native as it is AI-native. The executive team is lined with practicing physicians and machine learning experts, giving Abridge an advantageous understanding of not only the technology, but also the hurdles it will take for that technology to take hold in healthcare.

Study: AI is in the Eye of the Beholder

At a time when new healthcare AI solutions are getting unveiled every week, a study in Nature Machine Intelligence found that the way people are introduced to these models can have a major effect on their perceived effectiveness.

Researchers from MIT and ASU had 310 participants interact with a conversational AI mental health companion for 30 minutes before reviewing their experience and determining whether they would recommend it to a friend.

Participants were divided into three groups, which were each given a different priming statement about the AI’s motives:

  • No motives: A neutral view of the AI as a tool
  • Caring motives: A positive view where the AI cares about the user’s well-being
  • Manipulative motives: A negative view where the AI has malicious intentions

The results revealed that priming statements certainly influence user perceptions, and the majority of participants in all three groups reported experiences in line with expectations.

  • 88% of the “caring” group and 79% of the “no motive” group believed the AI was empathetic or neutral – despite the fact that they were engaging with identical agents.
  • Only 44% of the “manipulative” group agreed with the primer. As the authors put it, “If you tell someone to be suspicious of something, then they might just be more suspicious in general.”
  • As might be expected, participants who believed the model was caring also gave it higher effectiveness scores and were more likely to recommend it to a friend. That’s obviously relevant for those developing similar mental health chatbots, but a key insight for presenting any AI agent to new users.

An interesting feedback loop was also found between the priming and the conversation’s tone. People who believed the AI was caring tended to interact with it in a more positive way, making the agent’s responses drift positively over time. The opposite was true for those who believed it was manipulative. 

The Takeaway

The placebo effect is a well documented cornerstone of medical literature, but this might be the first study to bridge the phenomenon from sugar pill to AI chatbot. Although AI is often thought of as primarily an engineering problem, this research does a great job highlighting how human factors and the power of belief play a huge role in the perceived effectiveness of the technology.

Get the top digital health stories right in your inbox

You might also like..

Select All

You're signed up!

It's great to have you as a reader. Check your inbox for a welcome email.

-- The Digital Health Wire team

You're all set!