Microsoft Reveals Its Path To Medical Superintelligence
A deep dive into Microsoft’s new medical benchmark called ‘SDBench’ and a multi-agentic system called ‘MAI-DxO’ to explore what they mean for the future of clinicians and medical AI.
The current methods of testing LLMs for diagnosing medical conditions are totally flawed.
LLM evaluations on medical datasets usually involve answering well-structured questions that bundle patient history, examination findings, and investigation results to select a diagnosis from a given set of options.
Although LLMs show very impressive diagnostic accuracy in these evaluations, this one-turn, multiple-choice, quiz-like approach is far from how clinicians diagnose a patient.
Clinicians take a step-by-step, sequential approach, where they first come up with a hypothesis and then iteratively question and test it using investigations to narrow down their options, reaching a final diagnosis.
Alongside this, they must also keep the consultation and investigation costs, as well as the patient’s clinical status, in mind.
Microsoft researchers have just improved on this approach and introduced a new medical evaluation framework called Sequential Diagnosis Benchmark (SDBench).
SDBench consists of hundreds of real-world case records, published weekly in the New England Journal of Medicine, converted into stepwise diagnostic encounters and made interactive using an LLM.
Along with this, they have released their multi-agent framework, called the MAI Diagnostic Orchestrator (MAI-DxO), which achieves 80% diagnostic accuracy on SDBench, which is four times higher than the 20% average of generalist physicians.
MAI-DxO achieves such high accuracy at a 20% reduced diagnostic cost compared to physicians and a 70% reduced cost compared to OpenAI’s o3.

Here is a story where we discuss SDBench and MAI-DxO in-depth and explore what they mean for the future of clinicians and medical AI.
Let’s begin!
My latest book, called “LLMs In 100 Images”, is now out!
It is a collection of 100 easy-to-follow visuals that describe the most important concepts you need to master LLMs today.
Grab your copy today at a special early bird discount using this link.
What’s So Good About ‘SDBench’?
Consider a popular medical benchmark, such as MedQA, which is simply a compilation of multiple-choice questions from the United States Medical Licensing Examination (USMLE).
SDBench is very different from this.
It is a compilation of 304 consecutive cases (published between 2017 and 2025) from the New England Journal of Medicine’s (NEJM) Case Challenge series, where each of them is converted into an interactive simulation for sequential diagnosis using an LLM (termed the Gatekeeper agent).
Keep reading with a 7-day free trial
Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.