Sequential Diagnosis with Language Models

AI-generated keywords: Artificial Intelligence

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

AI has the potential to revolutionize access to expert medical knowledge and reasoning in healthcare.
Traditional evaluations of language models often fall short in capturing the complexity and nuance of evidence-based medicine in real-world scenarios.
Researchers have introduced the Sequential Diagnosis Benchmark to bridge the gap between AI capabilities and clinical reality by transforming diagnostically challenging cases into stepwise diagnostic encounters.
The MAI Diagnostic Orchestrator (MAI-DxO) adds sophistication to the diagnostic process by simulating a panel of physicians, suggesting likely differential diagnoses, and strategically selecting high-value, cost-effective tests.
MAI-DxO achieves an impressive 80% diagnostic accuracy rate when combined with OpenAI's o3 model, significantly outperforming generalist physicians and traditional AI models.
MAI-DxO reduces diagnostic costs by 20% compared to traditional physician-led approaches and by 70% compared to off-the-shelf AI models like o3.
When configured for maximum accuracy, MAI-DxO achieves an outstanding 85.5% accuracy rate across various AI model families such as OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama.
The research highlights how AI systems can enhance diagnostic precision while improving cost-effectiveness in clinical care settings.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, Eric Horvitz

arXiv: 2506.22405v2 - DOI (cs.CL)

23 pages, 10 figures

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they've just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI's o3 model, MAI-DxO achieves 80% diagnostic accuracy--four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.

Submitted to arXiv on 27 Jun. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2506.22405v2

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

Artificial intelligence (AI) has the potential to revolutionize access to expert medical knowledge and reasoning in healthcare. Traditional evaluations of language models often fall short in capturing the complexity and nuance of evidence-based medicine in real-world scenarios. In clinical practice, physicians engage in an iterative process of formulating and revising diagnostic hypotheses, adapting their questions and tests based on new information, and carefully weighing evolving evidence before arriving at a final diagnosis. To bridge this gap between AI capabilities and clinical reality, researchers have introduced the Sequential Diagnosis Benchmark. This innovative approach transforms 304 diagnostically challenging cases from the New England Journal of Medicine clinicopathological conference into stepwise diagnostic encounters. In this model, a physician or AI begins with a brief case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only upon explicit inquiry. Performance is evaluated not just based on diagnostic accuracy but also on the cost-effectiveness of physician visits and tests conducted. Furthermore, the introduction of the MAI Diagnostic Orchestrator (MAI-DxO) adds another layer of sophistication to the diagnostic process. This model-agnostic orchestrator simulates a panel of physicians, suggesting likely differential diagnoses and strategically selecting high-value, cost-effective tests. When combined with OpenAI's o3 model, MAI-DxO achieves an impressive 80% diagnostic accuracy rate—four times higher than the average performance of generalist physicians. Moreover, MAI-DxO significantly reduces diagnostic costs by 20% compared to traditional physician-led approaches and by 70% compared to off-the-shelf AI models like o3. Notably, when configured for maximum accuracy, MAI-DxO achieves an outstanding 85.5% accuracy rate across various AI model families such as OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama. These remarkable performance gains underscore how AI systems can enhance diagnostic precision while simultaneously improving cost-effectiveness in clinical care settings. In conclusion, this research highlights the transformative potential of AI-guided iterative thinking and judicious decision-making in advancing healthcare diagnostics towards more accurate outcomes at reduced costs.

- AI has the potential to revolutionize access to expert medical knowledge and reasoning in healthcare.
- Traditional evaluations of language models often fall short in capturing the complexity and nuance of evidence-based medicine in real-world scenarios.
- Researchers have introduced the Sequential Diagnosis Benchmark to bridge the gap between AI capabilities and clinical reality by transforming diagnostically challenging cases into stepwise diagnostic encounters.
- The MAI Diagnostic Orchestrator (MAI-DxO) adds sophistication to the diagnostic process by simulating a panel of physicians, suggesting likely differential diagnoses, and strategically selecting high-value, cost-effective tests.
- MAI-DxO achieves an impressive 80% diagnostic accuracy rate when combined with OpenAI's o3 model, significantly outperforming generalist physicians and traditional AI models.
- MAI-DxO reduces diagnostic costs by 20% compared to traditional physician-led approaches and by 70% compared to off-the-shelf AI models like o3.
- When configured for maximum accuracy, MAI-DxO achieves an outstanding 85.5% accuracy rate across various AI model families such as OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama.
- The research highlights how AI systems can enhance diagnostic precision while improving cost-effectiveness in clinical care settings.

Summary1. AI, which stands for artificial intelligence, can help doctors by providing expert medical knowledge and reasoning in healthcare. 2. Some language models, like the MAI Diagnostic Orchestrator (MAI-DxO), are designed to improve how accurately doctors diagnose illnesses. 3. MAI-DxO works by simulating a group of doctors and suggesting possible diagnoses along with cost-effective tests. 4. MAI-DxO is very good at diagnosing illnesses, even better than regular doctors or other AI models. 5. Using AI like MAI-DxO can make diagnosing illnesses more precise and cost-effective in hospitals. Definitions- Artificial Intelligence (AI): Technology that allows machines to learn from experience and perform tasks that typically require human intelligence. - Diagnosis: Identifying a disease or illness based on its symptoms and test results. - Cost-effective: Providing good value for the amount of money spent; efficient in terms of costs. - Precision: The quality of being accurate and exact in performing a task or measurement.

Introduction

Artificial intelligence (AI) has been making significant strides in various industries, and healthcare is no exception. With its ability to analyze vast amounts of data and make informed decisions, AI has the potential to revolutionize access to expert medical knowledge and reasoning in healthcare. However, traditional evaluations of language models often fall short in capturing the complexity and nuance of evidence-based medicine in real-world scenarios. In clinical practice, physicians engage in an iterative process of formulating and revising diagnostic hypotheses, adapting their questions and tests based on new information, and carefully weighing evolving evidence before arriving at a final diagnosis. This process requires critical thinking skills that are difficult for AI systems to replicate accurately. To bridge this gap between AI capabilities and clinical reality, researchers have introduced the Sequential Diagnosis Benchmark. This innovative approach transforms 304 diagnostically challenging cases from the New England Journal of Medicine clinicopathological conference into stepwise diagnostic encounters.

The Sequential Diagnosis Benchmark

The Sequential Diagnosis Benchmark is designed to mimic the diagnostic process used by physicians in real-life scenarios. In this model, a physician or AI begins with a brief case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only upon explicit inquiry. This approach allows for more nuanced decision-making as it simulates how physicians gather information through questioning patients or ordering tests. Performance is evaluated not just based on diagnostic accuracy but also on the cost-effectiveness of physician visits and tests conducted.

The MAI Diagnostic Orchestrator (MAI-DxO)

To further enhance the performance of AI systems in healthcare diagnostics, researchers have introduced the MAI Diagnostic Orchestrator (MAI-DxO). This model-agnostic orchestrator simulates a panel of physicians, suggesting likely differential diagnoses and strategically selecting high-value, cost-effective tests. By combining OpenAI's o3 model with MAI-DxO, researchers were able to achieve an impressive 80% diagnostic accuracy rate – four times higher than the average performance of generalist physicians. Moreover, MAI-DxO significantly reduces diagnostic costs by 20% compared to traditional physician-led approaches and by 70% compared to off-the-shelf AI models like o3.

Implications for Healthcare Diagnostics

The results of this research highlight the transformative potential of AI-guided iterative thinking and judicious decision-making in advancing healthcare diagnostics towards more accurate outcomes at reduced costs. By simulating the critical thinking process used by physicians, AI systems can enhance diagnostic precision while simultaneously improving cost-effectiveness in clinical care settings. Moreover, when configured for maximum accuracy, MAI-DxO achieves an outstanding 85.5% accuracy rate across various AI model families such as OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama. This demonstrates how different types of AI models can work together to improve overall performance and provide more accurate diagnoses.

Conclusion

In conclusion, this research paper highlights the potential impact of AI on healthcare diagnostics. The Sequential Diagnosis Benchmark and MAI Diagnostic Orchestrator have shown significant improvements in diagnostic accuracy and cost-effectiveness compared to traditional physician-led approaches or off-the-shelf AI models. As technology continues to advance and new developments are made in the field of artificial intelligence, we can expect even greater advancements in healthcare diagnostics. With its ability to analyze vast amounts of data quickly and accurately simulate complex decision-making processes used by physicians, AI has the potential to revolutionize healthcare delivery and improve patient outcomes.

Created on 23 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

76.2%

Automatic Evaluation of Healthcare LLMs Beyond Question-Answering

cs.CL

74.6%

MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models an…

cs.CL

74.4%

Sequential Match Network: A New Architecture for Multi-turn Response Selectio…

cs.CL

73.7%

Augmented Language Models: a Survey

cs.CL

73.5%

Challenges and Responses in the Practice of Large Language Models

cs.CL

73.4%

Seq2Seq AI Chatbot with Attention Mechanism

cs.CL

73.3%

Quality of Answers of Generative Large Language Models vs Peer Patients for I…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.