Using LLMs with decision support could improve diagnoses, MGB shows

While a diagnostic decision support engine outperformed ChatGPT and Gemini at ascertaining disease in a new Mass General Brigham study, generative artificial intelligence models also performed well, suggesting potential synergistic benefits.

Global

Artificial Intelligence

By Andrea Fox , Senior Editor | June 3, 2025 | 11:51 AM

Doctor reviews treatment plan with patient in exam room

Photo: Gerber86/Getty Images

Mass General Brigham researchers see value in a hybrid approach that makes use of generative artificial intelligence to diagnose patients.

Comparing two large language models (LLMs) – OpenAI's GPT-4 and Google's Gemini 1.5 – with its homegrown diagnostic decision support system, DXplain, MGB scientists found that the DDSS outperformed the LLMs in accurately diagnosing patient cases – but that the two types of AI could augment one another to better inform treatment.

WHY IT MATTERS

DXplain was first developed in Boston back in 1984 as a standalone platform and has since evolved into a web-based application and cloud-based differential diagnosis engine. It currently relies on 2,680 disease profiles, more than 6,100 clinical findings and hundreds of thousands of data points that generate and rank potential diagnoses.

For their comparison, researchers from MGB's Mass General Hospital Laboratory of Computer Science prepared a collection of 36 diverse clinical cases based on actual patients from three academic medical centers.

"A user can enter clinical findings and the DDSS will generate a rank-ordered list of diagnoses that explain the findings," the researchers explained in their report, published this past Thursday in JAMA Network.

Conversely, LLMs have been shown to perform as well as physicians in passing certain types of board examinations and have had success analyzing case descriptions and generating accurate diagnoses.

"These results are noteworthy, as generative AI was not designed for clinical reasoning but generates human-like text responses to any question using enormous datasets gathered from the internet," they said.

However, "amid all the interest in LLMs, it is easy to forget that the first AI systems used successfully in medicine were expert systems."

The researchers chose ChatGPT and Gemini because they performed best in previous studies published in the New England Journal of Medicine and JAMA.

"The DDSS has been shown to improve the accuracy of medical residents’ diagnostic abilities, shorten the length of stay of medical inpatients with complex conditions and reveal findings with high predictive value for critical diseases that could allow for their earlier detection."

For the year-long study, three physicians manually assessed cases, identifying all clinical findings as well as subsets deemed relevant for making diagnoses mapped to the DDSS’ controlled vocabulary. They returned two sets of marked-up copies, one identifying all clinical findings and the other all relevant positive and negative findings for establishing diagnoses.

Researchers explained in the report that they chose two versions of case entry for the study's DDSS evaluation because using all clinical findings is likely how "a future automated electronic health record-integrated approach would likely be implemented," and using only relevant findings is how the system is currently used.

Two other physicians without access to each case's diagnosis entered data from these cases into the DDSS, LLM1 (ChatGPT) and LLM2 (Gemini) to run the AI versus AI comparisons.

Unlike the generative AI systems, MGB's DDSS engine requires a user to use a controlled vocabulary from its dictionary. It also relies on keyword matching and other lexical techniques. For the purposes of the study, investigators extracted individual findings from each case and then mapped them to the DDSS’ clinical vocabulary.

They compared both sets of the DDSS’ top 25 diagnoses with the 25 diagnoses generated by each LLM for each of the 36 cases.

For the mark-ups with all findings, but no laboratory test results, the DDSS listed the differential diagnosis more often (56%) than ChatGPT (42%) and Gemini (39%), which researchers deemed statistically insignificant.

However, where laboratory test results were included in case reports, all three systems had success listing the correct diagnosis – DDSS at 72% of the time, ChatGPT, 64% and Gemini, 58%.

"The LLMs performed remarkably well considering they were not designed for the medical domain," although they do not explain their reasoning – LLMs' foundational black box behavior challenge, researchers said.

The medical DDSS performed better where data entry captured all laboratory test results, and it is inherently designed to explain its conclusions.

"Hence, integration with the clinical workflow where all data are available should allow for improved performance when compared with the current method of clinician case entry of selected findings ex post facto," researchers said.

Interestingly, the DDSS listed a differential case diagnosis more than half the time when either LLM did not include it – 58% compared to ChatGPT and 64% compared to Gemini – but each LLM listed the case diagnosis 44% of the time that the DDSS did not list it.

Thus, the investigators envision pairing DXplain with an LLM as the optimal way forward, as it would improve both systems' clinical efficacy.

"For example, querying the LLMs to support their reasoning for including the correct diagnoses that the DDSS missed could help the developers correct any knowledge base errors," they said. "Conversely, asking an LLM to consider a diagnosis that the DDSS listed that the LLM did not list might allow the LLM to reconsider its differential diagnosis."

THE LARGER TREND

A previous study by MGB researchers, conducted at the health system's Innovation in Operations Research Center, put ChatGPT to the test working through an entire clinical encounter with a patient, recommending a diagnostic workup, deciding on a course of action and making a final treatment diagnosis.

The LLM's performance was steady across care modalities, but it struggled with differential diagnoses.

That's "the meat and potatoes of medicine," said Dr. Marc Succi, associate chair of innovation and commercialization and executive director of its MESH Incubator's Innovation in Operations Research Group, in a statement at the time.

"That is important because it tells us where physicians are truly experts and adding the most value – in the early stages of patient care with little presenting information, when a list of possible diagnoses is needed."

With trust a critical question for AI-supported decision-making, healthcare is likely to have many support systems "running concurrently" for the foreseeable future, as Dr. Blackford Middleton – a renowned health informaticist and clinical advisor with more than 40 years of experience with clinical decision support – explained recently in a HIMSSCast interview with Healthcare IT News.

ON THE RECORD

"A hybrid approach that combines the parsing and expository linguistic capabilities of LLMs with the deterministic and explanatory capabilities of traditional DDSSs may produce synergistic benefits," said MGB researchers about their new study.

Andrea Fox is senior editor of Healthcare IT News.
Email: afox@himss.org
Healthcare IT News is a HIMSS Media publication.

Topic:

Analytics,

Artificial Intelligence,

Clinical,

Decision Support

Using LLMs with decision support could improve diagnoses, MGB shows

More Regional News