Foundation models trained on EHR data hold massive potential for clinical applications, but a new study out of MIT shows that they might have just as much potential to violate patient privacy.
Generalized knowledge makes better predictions. EHR foundation models normally draw on a collection of de-identified patient records to produce their outputs.
- That’s not a problem on its own, but unintended “memorization” also allows these models to serve answers based on a single record from their training data.
Therein lies the problem. To quantify the risk of these models revealing sensitive information, MIT researchers developed structured tests to determine how easily an attacker with partial knowledge of a patient – think lab results or demographic details – could extract further identifiable info through targeted prompts.
The tests measured memorization as a function of:
- the amount of information an attacker needs to reveal information
- the risk associated with the revealed information
What did they find? After validating the tests using EHRMamba, an EHR foundation model with publicly available training data, the researchers reached a pair of conclusions that weren’t too surprising to see.
- The more information attackers have on a patient, the greater their privacy risk.
- Some patients, particularly those with rare conditions, are more susceptible.
Not all information is harmful. The researchers found that some details, such as a patient’s age or gender, present a relatively lower risk in the event of a data breach.
- This info wasn’t very helpful in targeted prompts that probed the model for memorized records, and it isn’t very damaging if the answers reveal it.
- Other info, such as a rare disease diagnosis, was flagged as significantly more harmful. It posed a higher risk of getting the model to expose patient-specific details (especially in combination with other identifiers), and it can be especially sensitive if revealed through probing.
The Takeaway
EHR foundation models need some degree of memorization to solve complex tasks, but memorizing and revealing patient records is obviously out of the question. The tradeoff between performance and privacy is an ongoing challenge, but MIT just delivered a framework for evaluating some of the risks that can help strike the right balance.
