Sequential Diagnosis with Language Models

AI-generated keywords: Artificial Intelligence

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • AI has the potential to revolutionize access to expert medical knowledge and reasoning in healthcare.
  • Traditional evaluations of language models often fall short in capturing the complexity and nuance of evidence-based medicine in real-world scenarios.
  • Researchers have introduced the Sequential Diagnosis Benchmark to bridge the gap between AI capabilities and clinical reality by transforming diagnostically challenging cases into stepwise diagnostic encounters.
  • The MAI Diagnostic Orchestrator (MAI-DxO) adds sophistication to the diagnostic process by simulating a panel of physicians, suggesting likely differential diagnoses, and strategically selecting high-value, cost-effective tests.
  • MAI-DxO achieves an impressive 80% diagnostic accuracy rate when combined with OpenAI's o3 model, significantly outperforming generalist physicians and traditional AI models.
  • MAI-DxO reduces diagnostic costs by 20% compared to traditional physician-led approaches and by 70% compared to off-the-shelf AI models like o3.
  • When configured for maximum accuracy, MAI-DxO achieves an outstanding 85.5% accuracy rate across various AI model families such as OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama.
  • The research highlights how AI systems can enhance diagnostic precision while improving cost-effectiveness in clinical care settings.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, Eric Horvitz

23 pages, 10 figures

Abstract: Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they've just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI's o3 model, MAI-DxO achieves 80% diagnostic accuracy--four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.

Submitted to arXiv on 27 Jun. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2506.22405v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Artificial intelligence (AI) has the potential to revolutionize access to expert medical knowledge and reasoning in healthcare. Traditional evaluations of language models often fall short in capturing the complexity and nuance of evidence-based medicine in real-world scenarios. In clinical practice, physicians engage in an iterative process of formulating and revising diagnostic hypotheses, adapting their questions and tests based on new information, and carefully weighing evolving evidence before arriving at a final diagnosis. To bridge this gap between AI capabilities and clinical reality, researchers have introduced the Sequential Diagnosis Benchmark. This innovative approach transforms 304 diagnostically challenging cases from the New England Journal of Medicine clinicopathological conference into stepwise diagnostic encounters. In this model, a physician or AI begins with a brief case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only upon explicit inquiry. Performance is evaluated not just based on diagnostic accuracy but also on the cost-effectiveness of physician visits and tests conducted. Furthermore, the introduction of the MAI Diagnostic Orchestrator (MAI-DxO) adds another layer of sophistication to the diagnostic process. This model-agnostic orchestrator simulates a panel of physicians, suggesting likely differential diagnoses and strategically selecting high-value, cost-effective tests. When combined with OpenAI's o3 model, MAI-DxO achieves an impressive 80% diagnostic accuracy rate—four times higher than the average performance of generalist physicians. Moreover, MAI-DxO significantly reduces diagnostic costs by 20% compared to traditional physician-led approaches and by 70% compared to off-the-shelf AI models like o3. Notably, when configured for maximum accuracy, MAI-DxO achieves an outstanding 85.5% accuracy rate across various AI model families such as OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama. These remarkable performance gains underscore how AI systems can enhance diagnostic precision while simultaneously improving cost-effectiveness in clinical care settings. In conclusion, this research highlights the transformative potential of AI-guided iterative thinking and judicious decision-making in advancing healthcare diagnostics towards more accurate outcomes at reduced costs.
Created on 23 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.