Using Language Models as Diagnosis Aid Doesn’t Improve Physician Accuracy

Published on: October 29, 2024

A study revealed using language models as diagnostic aids did not significantly enhance clinical reasoning or accuracy among physicians over conventional resources alone.

A new study revealed that physicians using language learning models as a diagnostic aid did not significantly improve their clinical reasoning compared to those solely relying on conventional resources.¹ Although language learning models themselves outperformed physician participants on complex cases, they ultimately did not enhance the diagnostic accuracy of physicians who consulted them.

“The [large language model] alone outperformed physicians even when the [large language model] was available to them, indicating that further development in human-computer interactions is needed to realize the potential of AI in clinical decision support systems,” wrote investigators, led by Ethan Goh, MD, MBBS, MS, from the Stanford Center for Biomedical Informatics Research at Stanford University.

Strategies explored to improve clinical reasoning of diagnoses include educational, reflective, and team-based practices, as well as clinician decision support tools. Yet, the impact of these strategies is limited.

Large language models have shown promise in their capability of multiple-choice and open-ended medical reasoning examinations. However, it was unknown whether AI improves physician diagnostic reasoning. Investigators sought to evaluate the effect of large language models on physicians’ diagnostic reasoning compared with physicians using conventional resources.

The team conducted a single-blind randomized clinical trial from November 29 to December 29, 2023, using participants, in-person and remote, across multiple academic medical institutions. The participants were physicians either trained in family medicine, internal medicine, or emergency medicine.

The sample was randomized to either access the language model in addition to conventional diagnostic resources or conventional resources only. Participants were allowed 60 minutes to review up to 6 clinical vignettes. The analysis stratified participants by career stage.

The primary outcome was how well participants performed on a standardized evaluation rubric that assessed 3 specific areas of diagnostic performance. The tool measured diagnosis accuracy, how well they justified the diagnosis by correctly identifying relevant factors that either supported or argued against it, and how appropriately they recommended the next steps for diagnostic testing or assessments. Secondary outcomes included the time spent per case in seconds and final diagnosis accuracy.

The study included 50 physicians with 26 attendings and 24 residents. The median years in practice was 3 years (IQR, 2 to 8).

Participants had a median diagnostic reasoning score per case of 76% (IQR, 66% to 87%) for the language model group and 74% (IQR, 63% to 83%) for the conventional resources group, with an adjusted difference of 2 percentage points (95% confidence interval [CI], -4 to -8 percentage points; P = .60). This suggests the language model had slightly better diagnostic reasoning score than conventional resources, but it was not statistically significant.

“Results of this study should not be interpreted to indicate that LLMs should be used for diagnosis autonomously without physician oversight,” investigators wrote. “The clinical case vignettes were curated and summarized by human clinicians, a pragmatic and common approach to isolate the diagnostic reasoning process, but this does not capture competence in many other areas important to clinical reasoning, including patient interviewing and data collection.”

Moreover, the median time spent per case for the language model group was 519 seconds, compared with a medium of 565 seconds for the conventional resources group. The groups had a time difference of -82 seconds (95% CI, - 195 to 31; P = .20). Additionally, the large language model scored 16 percentage points greater than the conventional resources group (95% CI, 2 to 30 percentage points; P = .30).

“The field of AI is expanding rapidly and impacting our lives inside and outside of medicine. It is important that we study these tools and understand how we best use them to improve the care we provide as well as the experience of providing it,” said Andrew Olson, MD, a professor at the U of M Medical School and hospitalist with M Health Fairview, in a statement.² “This study suggests that there are opportunities for further improvement in physician-AI collaboration in clinical practice.”

References

^{Goh E, Gallo R, Hom J, et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open. 2024;7(10):e2440969. doi:10.1001/jamanetworkopen.2024.40969}
^{AI In Healthcare: New Research Shows Promise and Limitations of Physicians Working With GPT-4 For Decision Making. EurekAlert! October 28, 2024. https://www.eurekalert.org/news-releases/1062915. Accessed October 29, 2024.}