Mass General Brigham research shows that publicly available AI chatbots are getting better at diagnostic accuracy when presented with comprehensive clinical information, but still underperform at differential diagnoses when information is lacking
- Mass General Brigham research shows that publicly available AI chatbots are getting better at diagnostic accuracy when presented with comprehensive clinical information, but still underperform at differential diagnoses when information is lacking
- Researchers developed new measure called PrIME-LLM for benchmarking the clinical competence of different AI models
- Study reinforces necessity of “human in the loop” physician involvement for medical decision-making

Despite increasing use of artificial intelligence (AI) in health care, a new study led by Mass General Brigham researchers from the MESH Incubator shows that generative AI models continue to fall short at their clinical reasoning capabilities.
By asking 21 different large language models (LLMs) to play doctor in a series of clinical scenarios, the researchers showed that LLMs often fail often fail at navigating diagnostic workups and coming up with a testable list of potential or “differential” diagnoses. Though all tested LLMs arrived at a correct final diagnosis more than 90% of the time when provided with all pertinent information in a patient case, they consistently performed poorly at the earlier, reasoning-driven steps of the diagnostic process, according to the results published in JAMA Network Open.
“Despite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment,” said corresponding author Marc Succi, MD, executive director of the MESH Incubator at Mass General Brigham. “Differential diagnoses are central to clinical reasoning and underlie the ‘art of medicine’ that AI cannot currently replicate. The promise of AI in clinical medicine continues to lie in its potential to augment, not replace, physician reasoning, provided all the relevant data is available – not always the case”
This new research is a follow-up to previous work led by Succi’s MESH group in which researchers evaluated ChatGPT 3.5 ability to accurately in diagnose a series of a clinical vignettes.
In the new study, the researchers developed a novel and more holistic measure of LLMs that looked beyond accuracy, called PrIME-LLM, which evaluates a model’s competency across different stages of clinical reasoning—coming up with potential diagnoses, conducting appropriate tests, arriving at a final diagnosis, and managing treatment. When models perform well in one area but poorly in another, this imbalance is reflected in the PrIME-LLM score, as opposed to averaging competency across tasks, which may mask areas of weakness, according to the researchers.
The study compared 21 general-purpose LLMs, including the latest models of ChatGPT, DeepSeek, Claude, Gemini, and Grok at the time of submission. The researchers tested the models’ ability to work through 29 published clinical cases. To simulate the way that clinical cases unfold, the researchers gradually fed the models information, beginning with basics like a patient’s age, gender and symptoms before adding physical examination findings and laboratory results. The LLMs’ performance at each stage was assessed by medical student evaluators, and these evaluations were used to calculate the models’ overall PrIME-LLM scores.
In line with their previous study, the researchers found that the LLMs were good at producing accurate final diagnoses. However, all of the models failed to produce an appropriate differential diagnosis more than 80% of the time. In the real world, a differential diagnosis is critical, but in this study, the models were given more information so that they could proceed to the next stage of the clinical workup even if they failed at the differential diagnosis step.
“By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor,” said Arya Rao, lead author, MESH researcher, and MD-PhD student at Harvard Medical School. “These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information.”
Most of the LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text. More recently released models generally outperformed older models, showing that LLMs are improving incrementally. The models’ PrIME-LLM scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5.
According to Succi, PrIME-LLM represents a standardized way to evaluate AI’s clinical competency that could be used by AI developers and hospital leaders to benchmark new technologies as they are released.
“We want to help separate the hype from the reality of these tools as they apply to health care,” he said. “Our results reinforce that large language models in healthcare continue to require a ‘human in the loop’ and very close oversight.”
