Microsoft AI CEO Mustafa Suleyman: Our AI Doctor Outperforms Human Diagnosticians
Channel: Alex Kantrowitz
Published at: 2025-07-04
YouTube video id: jhAqPFo6yyI
Source: https://www.youtube.com/watch?v=jhAqPFo6yyI
AI's application in medicine is one of the most fascinating uses of the technology both because of what can go right and what can go wrong. And so today I'm thrilled to bring in Mustafa Sullean, the CEO of Microsoft AI to talk about the application of the technology in medicine and a new development that Microsoft has on that front. Mustafa, great to see you. Thanks for coming back on the show. Yeah, it's a pleasure. Yeah, great to see you, too. Thanks for having me. So, I was reading your blog post that you're going to talk we're going to talk a little bit about this new development that you have as far as putting AI to use in medicine, but one stat struck uh struck me right before we get to the actual work itself, and that is that 50 million queries come in across uh C-Pilot uh and your other AI properties that are trying to solve medical use cases. Is that a good thing? I mean, I think it's incredible because we're already with just search engines making access to information super cheap and concise. And now with Copilot, the answers are much more conversational. You can tone them down so they suit your specific level of knowledge and expertise. And as a result, more and more people are asking C-Pilot and Bing um health related questions. So yeah, you're right. We have 50 million queries a day uh that are health rellated and they can range from anything from you know um uh a cancer issue that someone's dealing with to a death in a family to a mental health issue to just having a skin rash. And so the variety is is huge. But obviously we've got a really important objective here is to try and improve the quality of our consumer health products. And so this is 50 million queries every day. Do the queries that come into chat bots look any different than search. I mean asking you know uh search product what's going on with you is not uh anything new but I imagine with the chatbot you might be able to go one level deeper. So do you see anything different in the queries? Co-pilot's answers tend to be more succinct and tend to be um more sort of responsive to the style and tone of the individual person asking the question. and that tends to encourage people to ask a second follow-up question. So, it turns it into more of a dialogue or a consultation that you might end up having with your doctor. So, they are quite different to a normal search query. And so, let's talk about this. Speaking of dialogues, let's talk about this new uh development that Microsoft is announcing today, which is that you've created, and let me know if I get this right, a diagnostician bot that effectively will be able to dialogue with a patient's case file and then make a diagnosis. So it's actually two bots. One uh and it's a system. So it's not just a Microsoft's uh bots uh but it can be on any bot where there one bot uh basically acts as a gatekeeper to all patients medical information and then the other one is basically acting as the diagnostician or the physician uh that goes in and asks questions about that history. And you found some pretty uh incredible results when it comes to the effectiveness of this system to be able to diagnose uh correctly. Yeah, it's a great summary. That's exactly right. We essentially wanted to simulate what it would be like um for an AI to act as a diagnostician to ask the patient a series of questions to draw out their case history uh go through a whole bunch of tests that they may have had pathology and radiology and then iteratively examine the information that it's getting in order to improve the accuracy and reliability of its prediction about what your uh diagnosis this actually is. Um, and we actually use the New England Journal of Medicine case histories. Uh, hundreds of these past cases. One of these cases comes out every single week and it's like an ultimate crossword for doctors. Um, they obviously don't see the answer until the following week. And it's a big guessing game to go back through five to seven pages of very detailed history and then try and figure out what the uh diagnosis actually turns out to be. Okay. And so what happens is these two bots work together in conjunction to figure out what the diagnosis is. Why use a system like this? I mean I thought one of the benefits of generative AI is it can sort of take in a lot of information and then come to these answers sometimes in one shot. So what is the benefit of having these this dialogue between two bots? So the big breakthrough of the last 6 months or so in AI is these thinking or reasoning models that can obviously query other agents or find other information sources at inference time to improve the quality of its response rather than just giving the first best answer. It instead goes and you know consults a range of different um sources and that improves the quality of the information that it it finally gets to. So we see that this orchestrator which under the hood uses four different models from the major providers can actually improve the accuracy of each of the individual models and collectively all of them together by a very significant degree about 10% or so. So it's a it's a big step forward and I think that as the AI models get commoditized um you know really all the value will be added in that final layer of orchestration product integration and that's what we're seeing with this um diagnostic orchestrator. So a 10% increase in uh accurately diagnosing on top of the standard LLM. Yeah. And in fact we actually benchmarked that against human performance. So we had a whole bunch of expert physicians play this simulated uh diagnosis environment game and they on average get about one in five right so about 20%. Whereas our orchestrator gets about 85% accuracy so it's four times more accurate which you know in in like my career I've never seen such a big gulf between human level performance and the AI systems performance. Many years ago, I worked on like lots of uh diagnoses for radiology and uh head and neck cancer and me mammography and the goal was just to take a single radiology exam and predict you know yes or no does it have cancer and that was the most we could do whereas now it is not just producing a binary uh class output but it's actually producing a very detailed uh diagnosis and getting and doing that sequentious sequentially um through this interactive dialogue mechanism and so that massively improves the accuracy. Okay, so it can do 80% uh accurate diagnoses which sounds incredible and I have to pressure test this a little bit because what if you have the same thing happen to medicine as is happening with uh beginner level code where basically there are people who are learning to code uh using these co-pilots but then when something breaks it becomes harder for them to figure out what's going on. So if you're a doctor relying on something amazing 80% uh accuracy um but if you don't have if you sort of outsource some of your thinking to these bots is that a problem down the line? Yeah. So this isn't just giving a blackbox answer. That's why the sequential diagnosis part is so important because you can watch the AI in real time ask questions of the case history get an answer shape a new question get an answer present a new question then present then then ask for a different type of testing get those results interpret it then give an answer so the dialogic nature means that a human doctor can follow along and actually learn in a very transparent way it's almost like having an interpretive ility mechanism inside the black box of the LLM because you can see its thinking process in real time and in fact you don't just see the sort of chain of thoughts which is the you know inner monologue we've actually created five different uh types of agent which all have a debate and we call this chain of debate they negotiate with one another they try to prioritize you know certain different aspects like cost or efficiency um and it's the coordination of those different skills skill sets among the doctors which actually met doctor agents is actually what makes this um so effective but I I want to ask again because even if a doctor can watch this go uh take place it it effectively turns uh their role in di let's say this becomes something that doctors use it turns their role in diagnosis to from something that's active and really thinking through to more of like a passive okay I'm watching the bots uh go through it and I do wonder if uh there is some benefit in having the doctor actually have to do that themselves because it helps the brain work in ways that it doesn't when it's just watching bots have a conversation. Yeah, I mean I I think that's totally true. I just still think this is going to be an amazing education tool for doctors to actually learn about the breadth of cases they never would have encountered. For example, we actually ran the the DXO orchestrator last week on the most recent case study in the New England Journal of Medicine, and it correctly diagnose uh diagnosed a case that had only ever been seen 1500 times in all of medical literature. It was such an obscure longtail uh disease. So, very few doctors are ever going to get the chance to see that. And so the ability to accurately and preventably detect um these kinds of conditions in the wild in production I think will massively outweigh um you know the risk of of of of doctors not being able to sort of exercise in in the way that you describe. I think the tools just change how you work and you know obviously everyone will sort of have to adapt to that over time but the utility is just so unquestionably beneficial that I think it it it makes it worthwhile. Now, is it able to do that because the cases are potentially in the training data? And even if they are, does it really matter? I mean, it is if it is able to diagnose these rare conditions. Um, should we really mind if it's seen it before in the training data? Well, part of the reason why we partnered with the New England Journal of Medicine is because each week they put out a brand new case which has never even been digitized. So, there's no question that it's not in the training data. this case, for example, from last week, there's absolutely no way it's in the training data because it's literally just got published. So, and and you know, we we think that's the case going back for all of the previous studies too, cases, too. So, I don't think there's any chance of that. This really is doing a kind of um a sort of abstraction of judgment. It's not just reproducing training data. It is actually doing some kind of inference or or thinking based on the knowledge that it does already have. Now I have to ask you, you mentioned that basically what happens in this orchestration is that you're taking um a traditional model and turning it into some form of specialized reasoning model. Um there was a interesting point in the paper that said that uh the results uh improvement over like the reasoning models like an 03 weren't as big as they were uh elsewhere. So, is it possible that the reasoning models that are state-of-the-art today will effectively learn how to do stuff like this and you won't need this type of specialized uh sequencing in order to be able to make these diagnosis in a accurate way like we're seeing here. I I think that the the real value here long term is in in actually how you orchestrate a variety of different models with different types of expertise. So, you know, each one of these five agents has been prompted and designed to have a different type of expertise and then have them jointly negotiate um and reason collectively. It may be the case, you know, that that maybe they all get subsumed into a single model in the future. I I don't know. Right now, it doesn't look like that. Right now, orchestrators are the thing that is able to drive much bigger gains. The other thing that we see for example is that it's able to um optimize for cost as well and reduce the cost by avoiding unnecessary tests uh versus the humans. So that's a function of um cost being factored into the orchestrator at inference time which you know wouldn't be something that you could reconcile inside one of the um a single model um you know in in pre-training or post- training. Yeah, we we shouldn't skip over the cost thing. I mean, obviously in medicine, cost is a factor to ignore it to get better. You know, you could order every single test and probably do better diagnosing people, but it's just not a reality today. And it is interesting to watch the bot work through which tests to order and then come in actually at a lower cost than typical doctors uh and come out with a better diagnosis. So, how does it work? Well, more tests also make people feel anxious. they waste time especially if you have a condition then that could be treated another way. So it's not just cost but it's actually the patient experience that gets optimized for as well. And so how does it decide which test to order and how to optimize cost? Is that another feature within the model or well I mean you can think of an unnecessary test as a kind of error a human error. So really what the model is trying to do is to get to the best diagnosis with the minimum number of tests. And you know obviously you know the the model has much broader you know sort of range in terms of awareness of which test results tend to correlate with which particular uh diagnostic outcomes. And so given that it's seen so many more cases than any given human, then it's obviously showing that it can do a better job of judging in this instance, given this case history that it already knows about a patient. What is the sort of minimum number of necessary tests to get the next piece of information to be able to continue the um the diagnosis and get it more accurate? Could I tell you something else that surprised me? It seemed like the bot struggled with more common type of uh uh diagnoses, was really good at the complex diagnosis. I'll just read it straight from your paper. Although it excels at tackling the and I I called it a bot. It's really the orchestrator. It excels at tackling the most complex diagnostic challenges, further testing is needed to assess the performance on more common everyday presentations. Do you think it's just like waiting to diagnose that rare case so it skips over the fact that it could just be a stomach ache? All all we're saying there is that we haven't applied it to your everyday sort of GP or primary care physician experience where you're sort of you know you have a a skin rash or you you know you you've got a pain with your knee and so um this does tend to be the longer tale of complex cases but it goes without saying that there is less of that information in the training data and we know that if there is more training data the models do better. So if so clearly by virtue of the fact that there are more cancers, more diabetes, you know, more weight loss questions, more you know, knee pain than any of these longtail conditions, the model is almost certainly going to do better in those primary care type environments than it's doing on the longtail. Um, but because we didn't develop the research with those environments today, we will of course as a next step, that's why we added that qualifier, right? Okay, that explains a lot. And let's talk a little bit about the role of the doctor moving forward. I mean, we touched on it previously. Um, but there's an interesting this the release that you put out in conjunction with the paper does touch on what the role of the doctor will be. It says doctor clinical roles are much broader than simply making a diagnosis. Uh, they need to navigate ambiguity and build trust with patients and their families in a way that AI isn't set up to do. Can I just take the other side of this? I mean, maybe AI is set up to do this. I think that if you're talking with a bot every day, you might trust it more than a doctor you see once a year or uh even a new specialist. So, does is AI poised to be able to take on some of that work as well? It's possible that it will do some of that work. Certainly, I I hope one day that it will be good enough to do that kind of work. But nothing is going to replace the humanto human connection that you get in the physical real world at a moment of heightened anxiety and fear when you're just facing, you know, one of the biggest challenges in your life and you have a massive diagnosis ahead of you or when you just need everyday regular treatment and care. So that's I think going to continue to be the role of uh a doctor and hopefully they get to spend more time face to face with patients and you know sort of less time reading the you know the the the long tale of the history and so on. So doctors become something of auditors of the out in the future they become auditors of the output uh of these AI bots. They use some of their training to either confirm or look at a different uh different possibility and then mostly what they're doing is effectively they're becoming shepherds that are shephering patients through their you know for lack of a better phrase care journey. I think there's still going to be a tremendous amount of judgment um that is required by expert human doctors both as part of the diagnosis and then secondly making the judgment about what works for the patient and factoring in you know and helping a patient decide what journey do I want to take given that I now know I've got this diagnosis what treatments do I want to take and when and what are the trade-offs there um so that that is going to require a tremendous amount of judgment and so it's not just about the humanto human connection and being on your feet. It's also thinking in a deeply empathetic way alongside a patient that's re received a diagnosis to plan their treatment course. Okay. And what other professions could you see that's being applied to this type of system? Well, I mean the the basic method of these orchestrators is that they tune different AIs to play very specific roles and then have the AIs negotiate with one another, debate and discuss. That setting obviously applies to a lot of different environments um you know be it in business or even in government in the future. Um, and so I think if this finding holds and applies to other domains, I think it will be be very very promising because it's also how we collectively as a human species work, right? We generally consult very widely when we make decisions. Often there's even consensus um before actually coming to a final conclusion. So it has a lot of parallels to the the human world. And lastly, this isn't being rolled out broadly in a hospital setting yet. So everyone who's panicking at this point can relax. But is that the ultimate goal? Like where do you see this? Is it an education tool? Um or does this actually become integrated in medical centers and hospitals in the coming years? I mean obviously at the moment this is just early research and we're figuring out how best to deploy it. But I think the fact that we're able to get a 4x improvement on human performance across the board on diagnosis with significantly reduced cost in super fast time. I mean to me that feels like steps towards a true medical super intelligence and we would want to try and make that available as widely as possible as quickly as possible including for our 50 million uh daily health queries and so that's going to be our ambition is like get it in front of consumers as as as fast as possible in the safest way possible. Okay. I'm definitely one of those uh 50 million who are asking these medical questions. Uh and so the better it gets the happier I'm going to be. Although I think with the caveat um that when things get really hairy, I'm going to go see a human. Sounds like a good idea. All right, Mustafa, great to see you as always. Thanks for coming on. Thanks, man. Appreciate you. Take it easy.