In case last week’s AI drama wasn’t hot enough, a pair of new studies in Nature cranked up the heat by finding that AI agents beat physicians on ER and care management tasks – just not real ones.
“Towards autonomous medical artificial intelligence agents.” The first study took a look at MIRA, an AI agent developed in Germany that operates inside a sandboxed EHR environment.
- Using 574 real emergency department cases, researchers had MIRA chat with another patient agent and execute entire care workflows, such as investigating diagnoses, ordering labs, and triaging for hospital admission.
The headline: MIRA significantly outperformed four board-certified physicians. The agent had higher overall diagnostic accuracy (87.8% vs. 78.1%), was better at ordering correct procedures like laparoscopic appendectomy (53.5% vs 38.3%), and had 35% better guideline alignment.
The reality: ER doc Graham Walker, MD, put it perfectly on LinkedIn: “There is no way in hell that humans mismanaged almost 30% of appendicitis cases, the most common ‘surgical emergency’ that we’ve all seen hundreds of in our career.”
- It turns out the EHR sandbox needed 21 keystrokes to get this right, and the physicians failed unless they explicitly searched and entered a “laparoscopic appendectomy.” AI is built for that, humans not so much.
“Towards conversational AI for disease management.” The second study explored whether Google’s AMIE agent could expand from pure diagnostics to longitudinal care management.
- The blinded study pitted AMIE against 21 primary care physicians on 100 multi-visit cases, with the agent pulling live guidelines and drug references to produce structured management plans.
The headline: AMIE’s care plans were better than PCPs across the board. The agent notched higher marks on management reasoning, precision of investigations, and guideline alignment.
The reality: AMIE operated in a world without prior auths, without formulary restrictions, and without social needs that patients didn’t want to bring up. The authors didn’t pretend otherwise.
The Takeaway
This might sound familiar, but these studies show that MIRA and AMIE performed well in ideal scenarios, not in the messy trenches of real-world medicine. That said, the results aren’t important because AI beat a benchmark, they’re important because AI took another big step toward “delivering actions” instead of just “delivering answers.”

