Is AI Coming to a Clinic Near You?

HAL 9000

Ayers, et al., recently published a provocative article likely to trigger reflexive efforts to use chatbots to answer patient queries. I’ll discuss such usage in a moment. First, it is important to show that the methods of Ayers, et al., were poorly done and biased. I will use an analogy from audio to illustrate my point. 

Good evidence, going back decades, shows that when people are evaluating two audio devices in a so-called A-B experiment, the loudness of the devices has to be exquisitely closely matched – to less than 0.5 dB, which is much less than audibly detectable (Aczel, et al.). If not, listeners favor the device that is louder. More generically, when one compares any A to any B, it is important to match the intensities of A and B.

Looking now at the abstract from Ayers, et al., the mean physician responses vs the chatbot responses were “(52 [17-62] words vs 211 [168-245] words; t = 25.4; P < .001)”. Full stop. This was an inadequately controlled experiment because the intensity of A was not statistically matched to that of B (to a significantly high P-value!). Such science should go straight to the round file as invalid. 

If one wants to compare two written responses for quality and empathy, it is necessary to match the length of the responses. It is not legitimate to allow one author to say more than another. This is why contestants in a debate get the same amount of time to speak!

It would not have been difficult to constrain the bot responses to matching length. Yet, the authors don’t even mention unmatched length of responses in their “Limitations” comments! Were they and the journal’s reviewers unaware of their bias? At least the authors acknowledged that physicians responding to questions in a Reddit forum might not put forth their best (or even usual) effort. The only take-home’s from this illegitimate project are that: 1) doctors responding on Reddit are terse and non-empathic, and 2) ChatGPT can create responses that are acceptable to clinician reviewers. We already knew both points. 

Deductions about how AI could be used as a physician supervised “Answer Bot” reflect prior bias - the authors wanted to show what they found, but weren’t sufficiently skeptical to adequately control their methods. JAMA Internal Medicine’s editors erred in publishing this study, but at least they felt compelled to run a sane, cautionary commentary about the coming role of bots (Li, et al.). 

Now let’s turn to the notion of where use of AI in medicine is headed.

Like it or not, over and over again, formal and informal tests are finding that chatbots are, well…at the least, “OK”. Doctors everywhere know that there aren’t enough clinicians to respond to all of the patient queries. It is thus increasingly easy to imagine that, if not now then likely soon, bots will be able to deliver high-quality content that matches or surpasses physicians. Indeed, as big data analytics improve, it strains credulity to think that physicians will be better than bots at estimating the odds of a patient’s diagnosis or outcome. It is not at all hard to imagine doctors in the not-distant future working with the guidance of a computer, somewhat like the character of Dr. “Bones” McCoy, in the original Star Trek series.

So how might bots do at health promoting services, in areas where the expertise isn’t really even “medical”? One need not to have spent years studying anatomy, pathology, physical examination skills, diagnostic techniques, suturing, setting fractures, etc., in order to help patients eat better, play more, not abuse substances, get better sleep, socialize and regulate response to stress. Is it possible to deliver health promoting interventions though bots? 

Large language model bots almost certainly can be trained to give good advice and phrase it empathically. Indeed, one thing that strikes me about motivational interviewing is that it is a way of sounding like you care even if you don’t. If we can train people to emulate empathy, we can train the bots too. Can a machine evoke sufficient self-caring within a patient to inspire activation, to engage in behavior change? Not so far, but that seems destined to change because the key element is the transfer of thought, not the method of delivery. If books can be motivating, then bots can be motivating too. Large language models have yet to truly master empathy, but even though the study by Ayers, et al., was poorly done, their study suggests it may not be long before bots are providing counseling. 

Whether we who cherish the human connection aspect of medicine like it or not, evidence increasingly shows that an AI chatbot can be trained to emulate pretty much anyone, including a caring provider. How will people respond?

 

 

References 

Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. Published online April 28, 2023. doi:10.1001/jamainternmed. 2023.1838 

Aczel P, Barton P, Bech S, et al. AES recommended practice for professional audio — Subjective evaluation of loudspeakers, AES20-1996. (s2008). Audio Engineering Society, Inc., 2008. https://www.aes.org/publications/standards/search.cfm?docID=24 (accessed May 4, 2023).

Li R, Kumar A, Chen JH. How Chatbots and Large Language Model Artificial Intelligence Systems Will Reshape Modern Medicine Fountain of Creativity or Pandora’s Box? JAMA Intern Med. Published online April 28, 2023. doi:10.1001/jamainternmed. 2023.1835