Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit
Channel: aiDotEngineer
Published at: 2025-07-31
YouTube video id: 1v9zBiZKlIY
Source: https://www.youtube.com/watch?v=1v9zBiZKlIY
[Music] I'm going to jump right into it. I'm going to be talking about Voice AI's interruption problem. Then after that, I'm going to talk about how we currently handle interruptions and turn taking in voice AI. what we can learn from the study of human conversation on how to handle how humans handle turn taking and then about some of the really neat and interesting new approaches out there for handling turn taking and interruptions in voice AI agents. Um voice excuse me interruptions are the biggest problem in voice AI agents right now. When you're talking to chat GPT advanced voice mode and it interrupts you, it's annoying. But when you're talking when a patient is talking to a voice AI dental assistant and it interrupts the patient, the patient hangs up and the dentist stops paying the voice AI developer. This is our collective problem. This is all of our problem that we we all have to solve. Um and the the problem is like turn taking is just hard. Um let me first like define what turn taking is. Turn taking is this unspoken system we have for who controls the floor on um between speakers uh during a conversation and fundamentally it's hard because turn taking happens really fast in human conversation um and there's no oneize that fits all. So, this is this really cool um study that I pulled this data from where they're looking at how long it took a listener to start responding after the speaker finished speaking um across different cultures. And we can see that the Danes take a relatively long amount of time to start speaking after uh the other speaker finishes speaking, but the Japanese do it almost instantaneously. So, there's differences across cultures. That's part of what makes it hard. There's also differences across individuals. I'm one of those people that takes a long time to respond. Even before I got into voice AI, people would sometimes comment being like, "Are you gonna respond?" I'm like, "Yeah, yeah, thinking about it." Um, and uh also like even though I'm one individual, there's a lot of variability in how quickly I respond. Like if you make me angry, I'm probably going to respond kind of quicker. Um, so it's just a hard problem. And in the the next slide I want to talk about for people who are not very familiar with how voice AI agent pipelines work. I'm going to provide like a simplified overview of how we handle handle turn taking and interruptions in voice AI agents currently. So the user starts speaking that's the speech input and that audio those audio chunks are passed to a speechtoext model. Um and that speechtoext model uh transcribes the uh audio into uh a transcription. The next step is something called a VAD that determines whether or not the user has finished speaking. Um, I'm going to go more into that in a moment. The next step, if the user has finished speaking, the transcript is passed to an LLM and the LLM outputs its chat completion. That chat completion is streamed out and that stream is passed to a texttospech model where it's converted into audio and that audio which is the audio of now of the voice AI agent is passed back to the user. Um, let's dig more into the voice activity detection system. It's a system with primarily two parts. So, it's a machine learning model, a neural network that is detecting whether or not somebody is speaking. So, it's like speech or not speech. It's pretty simple um machine. It's not I shouldn't call it simple, but it's it's a really neat model, but it's it's ultimately just looking at speech or not speech. And then the next thing next part of it is a silence algorithm. And the silence algorithm is saying, okay, if the person hasn't spoken for more than half a second, they're done speaking and it's time for the agent to start speaking. So that's the in most production voice AI systems, we're using something like that sort of VAD. Um, and that's changing. We're building all sorts of new interesting things that I'll cover later in the presentation. Um the next part of my presentation I want to dig into what we can learn from linguistics and academic research about how human how turn taking works in human conversations. And one of the lines I read in a paper I read was that I really liked was that turn taking in human conversation is a psycho linguistic puzzle and that we respond in 200 milliseconds but the process of like finding the words and generating speech and articulating speech takes 600 milliseconds. So how can we possibly be speaking so quickly. How can the listener start returning uh an answer so speak so quickly when it takes much longer to actually generate the speech? And the answer is that there has to be prediction going on. The listener is predicting when the end of turn is going to occur and then they're they start to generate speech before that end of turn. And what are the primary inputs in creating that prediction? The primary inputs on creating that pred prediction are the semantic and this one's the most important one like what the content is of the of what the person is saying. Other inputs into this you know prediction algorithm in our head of when we're trying to predict when the speaker is going to finish speaking is the syntax the structure of the sentence the procity like the the expressiveness um the tone and then also visual cues. Um like most things in uh how the human mind works, we we actually don't really know like you know that's it's complicated. Um and but I I would say one of the generally accepted models is what I'm going to walk through now of how turn taking works in the human minds. It's broken up into three stages. The first stage is semantic prediction. So, what the listener is doing, and you'll notice this as you speak to other people at the conference, is if you like are paying attention to your your thought process, is you're constantly inferring the intended message of the person that's speaking to you. So, before they finish speaking, you're kind of figuring out, wait, what are they trying to say? And then you're using what what your your prediction of what they're trying to say to um uh to then use that information to predict when the end of utterance will occur. And you're not doing this just once. You're doing this multiple times again and again, right? You're constantly updating this prediction as the speaker keeps going. So that's the first stage. Um then the the next stage once it seems like you have a general idea, your your prediction is coming true of like when you think the end of utterance will occur as you start getting closer to that you start refining that that endpoint prediction um based on both the semantics and the syntax. And then as you start getting really as the speaker starts getting really close to the end of turn the listener finalizes the prediction by using procity by using information around like the tone and other acoustic features. Um so it's three steps a semantic prediction a refinement and a finaliz uh a finalization. And one of the things I want to point out is that the human mind is full duplex. We're both processing input and we're starting to generate output at the same time. And I think that's really nicely described in this figure from this paper. Um, and so on the x- axis, what we have here is time where the zero millisecond is the end of the speaker's turn. And what these different blocks represent is what of mental processes that are happening inside the mind of the listener. And you can see well before the end of the turn, there's this whole comprehension uh track um that's going on where the listener is inferring the intended message of the speaker and making predictions about when they're going to start finish speaking. And at the same time, there's also this production or generation track where the um the listener is starting to produce what they're going to say. And we're going to talk more about full duplex models in computer in silic rather than human minds in a little bit in my presentation. Um, oh, Jordan Deersley just texted me. [Laughter] He texted boo. Like, how does he know to boo me? He's not even here. Um, okay. It's it's all good. We'll keep rolling. Uh so um so let's go back to uh um like let's contrast this really interesting complex process that's going on in the human mind compared to current voice AI systems. You'll see it's just so much more simple, right? It's just speech or not speech. It's looking backwards. It's not making a prediction. Um it's done in serial. Nothing's happening in parallel. Um so it's it's much more simple. And that's part part of the problem, right, of why these interruptions are happening. Um, so there's uh I'm going to talk through three types of models and the approaches people are using in these three types of models. Um, so the the prevailing model for building voice AI agents is the cascading model system of models where you have um what we talked about earlier, speech to text, VA, LLM, TTS. And uh what we're doing in those is we're augmenting these new the new approaches to better handling interruptions is we're augmenting the VAD with models that look at the semantics syntax or procity. Um and uh I want to jump into an example of it. Um I really have too much content for uh for my aotted time that I'm looking at down here. Um but maybe I can just take some time from Jordan. Um, so, uh, let me let me, um, give an example for one of these semantic type models that is used to augment VAD. I'm going to talk about our model at LiveKit. It's a textbased semantic model. So, what we're doing is we're taking the last four turns of the conversation as input. So, that means that it's the uh, the voice AI agent's turn, then the user's turn, then the voice AI agents turn, and then the user's current turn. Those are the inputs into to a transformer model and what we're the token that we're predicting um because this is an LLM we're predicting the end of utterance token and if that end of utterance token based on this the content right based on the the the context and the semantics of that that input um if the end of utterance token is saying that the end of turn it hasn't happened yet then we don't we extend the silence algorithm part of the that and say don't trigger the end of turn wait longer. Um, so they work in concert. That's generally the idea of how it works. I'm going to walk through a quick demo of of how this works in action. Needed to build a demo, but I wasn't. So, in this first part of this demo, what Shane is going to do is uh talk to a voice agent that is just using the the traditional VAD that I was discussing earlier that just looks at speech or not speech. And the second half of the demo, he's using our semantic end of utterance model. 100% sure what it was that I should I had to build a demo for. What was the a live kit? A live kit uh turn detection a turn detection demo where one of the agents like Got it. What challenges did you face while building that demo kept on interrupting me being interrupted by the agent constantly. That was the worst part for sure. I understand. How did you overcome that challenge during the demo? Well, hopefully we're going to overcome the challenge by using the new turn detection model that LiveKit's offering. So, let's try that one now instead. Hey, can you interview me about a time when I had to build a demo? Absolutely. Can you share your experience with building a demo? Yeah, definitely. So, I needed to build a demo, but I wasn't I wasn't a 100% sure what it was that I should build. And then I I was thinking like probably the best way to show that would be uh would be side by side. Yeah, thank you. I'll I'll let I'll let Shane know that you all applauded his demo. Uh he'll appreciate that. Yeah, you can really see it's a night and day difference when you augment the VAD with models that look at the semantics and syntax and procity. um which is a good segue to my next slide. So there's another type of another approach people are taking to augmenting the VAD and what they're doing is not just taking the semantic input the textbased input but they're also looking at the audio signal as well and trying to infer things from the acoustic features of the the dialogue. So they're taking the basic idea is the input is audio tokens and the output is the probability that the user has finished speaking. Um Quinn and the Daily team have built their also openweight smart turn model that that uh is this neat combination of a model that is both transformer and looking at acoustic um characteristics. Um, and then one of the new things that has just emerged has been um, Assembly AI dropped their their speechto text new streaming speech to text service earlier this week. And their model is really neat in that it emits it takes audio in and emits out both the transcript and a likelihood that uh, the speaker is finished speaking. Um, so it's one model that's kind of doing both these two things at the same time. Um, and it's also looking at the acoustic features and the semantic features. One of the things I want QAI also has one that they recently released that's pretty neat. Um, one of the things I want to note about this though is if you're using your speechto texts builtin end of utterance model, it's only seeing half the context. It's only seeing what the user is saying. It's not also seeing what the agent is saying. So, it doesn't quite have the full picture on the context. Um, but it works remarkably well. All these approaches work remarkably well and are a major step forward from these more traditional VADs and like definitely something that if you're building a voice AI agent after this conference you should go like implement it. They're pretty easy to implement on the different platforms too. Um okay. So we often talk about in uh uh speech models or in voice AI that um speech models are uh are going to save us. Um speechto speech models audio in and audio out are going to save us. Um but actually if you look at how these models work that like OpenAI's real-time API, they're still using a VAD on the on the internals. Um, so they're still just looking at speech or not speech. Or you can you can opt to turn on their semantic, they call it semantic VAD, which is kind of a paradox. It's not the best term, but turning on a semantic model that augments the bad. Um, so to answer the title of my talk of why chat GPT advanced voice mode keeps interrupting you, it's because it thinks you're done speaking based on how long it's been since you last said a word, um, or based on what you've uh what you've said previously. Um, and it's just not quite cutting it. um when those interruptions happen um uh and it's a problem that is not totally solved. I want to also bring that up too that like this is an ongoing problem um with all the different approaches nothing has perfected it yet. Um our end of live kit doesn't although we power the transport the audio layer transport for uh advanced voice mode um openai is not using our end of utterance model. So the next topic I want to cover is the um uh I'm running out of time here but uh is full duplex models. These are really neat. So a full duplex model is more like a human mind and it's processing input and generating speech at the same time. Um and as far as I know there's not really any commercial applications of these. Um but they're they're fundamentally they're intuitive talkers. They're trained on the raw audio data. And the analogy I like to use is that it's like computer vision. In the early days of computer vision, we were handwriting algorithms to try to recognize the stop sign based on the color and the number of sides on it, etc. Um, and it just didn't work very well. But when we started giving the raw image data to the neural network and let the neural network figure it out, all a sudden it just started working. And I think it's a and actually that uh what we learned from computer vision that really helped us emerge from the AI winter. That was a major kind of uh seeding process for where we are now with AI. Um, and it's a similar analogy with full duplex models in that we're handing them the raw audio data and we're just letting them figure out how turn taking works rather than trying to handw write all the rules. Um, but the downside of these models is they're really optimized for like being really good at turn taking and they're kind of dumb LLMs. They're small models. They're not trained on a lot of data. They can't do instruction following very well. Um, and just to give you a sense of like more specifics of how these models work, let's talk about the Moshi model. Um, what really made it more concrete for me of how this model works is this idea that is always listening to input and it's always generating output and even when it's not its turn to speak, it's emitting natural silence. So, it's just basically emitting silence that you can't hear, but it's still always emitting silence. Um, so it's always kind of doing both just like a human is. Um, Sync LLM, which is Meta AI's uh, full duplex like experimental mode that you can access inside the app, um, is a similar full duplex model or it's also full duplex model. Something neat that I want to bring up about Sync LLM is they're actually in the internals of that model, they're forecasting what the user said saying about five tokens ahead or 200 milliseconds ahead, which is more closely like what humans are doing except we're uh forecasting a much longer time frame. And then lastly, my predictions for the future of how we'll solve this problem uh is I think full duplex models are neat, but I don't think they're going to solve the problem. Like I think we just for for real production commercial use cases of voice AI, we need more control. Um and we need more control over how it says things like brand names. Um and instead what I think is going to happen is we're going to get smarter and smarter VAD augmentations and faster and faster models in the cascade pipeline. And we're just going to have more budget to work with to do a good job with this sort of thing. Um and the reason I think that's true is like computers don't do math the same way humans do. They don't have the same conceptual way of thinking about it. And uh LLMs think differently than us. And similarly, I wouldn't expect voice AI to use the same mechanisms as the human mind to generate speech and to talk. Um thank you all for your attention. This was fun. Really appreciate it. We do have some time. So I don't know if you want to take Q&A. Um I would love to. We could do that. Um and I could start with the first question. So, the demo you showed there there wasn't any response at the end, right? I I cut off the demo. It's actually a two-minute demo and I only have 18 minutes to speak. So, I truncated on both sides. Do because I was like, "Okay, maybe you just turned everything off and it was an impressive demo. No interruptions. No speaking." Yeah. Do you have the end of the demo? Do you want to show it or um No, no worries. It's more of the same idea. But like what you could see is that Shane was like, you know, taking his time talking and really pausing and thinking, it wasn't interrupting him and then when it eventually would find his end of turn based on the context. Cool. Can we find it on your Twitter or Yeah, it's it's on our link our live kit Twitter. Awesome. Yeah. So, we can look that up on the live kit Twitter. Uh, awesome. Yeah, we can take some questions. I saw you had one. Hi. How important are visual cues for turn detection in in the human context? And are there any um is there any development to kind of replicate that in the uh voice AI context as well? Yeah, it's a really neat question of like how important are visual cues and are are people working on integrating that into the turn taking um intelligence for uh avatars and real-time experiences. So visual cues are actually despite the fact that we are visual animals um very very much so like visual is the most visceral like you know input for us visual cues are actually pretty low down the stack of like uh predictors for when it will be end of turn it really is semantics that's like one of the main messages that I want to convey to people from this talk is it's the content of what people are saying is the main thing we're using to predict when they're going to finish speaking. Um, and then these visual cues are are and these other ones are ancillary to it. Um, and I'm sure somebody's working on building something really cool where it's like multimodal and looking at visual cues to look at the to infer the end of turn. Um, I'm just haven't seen it yet and can't keep up with all the AI stuff on the internet. Yes. What is the average cost? What is the average cost for usually voice uh generated call and then how is what is the effect when you try to keep regenerating the response? So the question is what's the average cost for voice AI call? Um and what is the cost when you keep trying to regenerate the the response? Um, so I would I would first say that the I think your your reference to what does it cost to keep trying to regenerate the response is um what I the way I want to answer that is actually the thing that's most expensive in the pipeline tends to be the text to speech. So there's all these optimizations you can do in the cascade and if you end up hitting the LLM multiple times within a turn, it's not it's not all that costly and those sorts of things. Um, there's some really neat calculators online because I'm personally not a VoiceAI agent builder and those unit economics don't uh don't directly affect me. I don't have the numbers off the top of my head, but there's some really nice calculators. It's going to depend on how long the conversation is and that sort of thing. Yeah, thanks for the demo. That was great. Uh the question is about your new model that you just shown and that blew us away. So uh one is like why is chat GPT not using that model to improve their stuff and two is it available for us to use now if I were to build a uh voice bot on live kit and uh three maybe during your development of that one demo is great. uh do you also do some kind of benchmarking with user to see if you know this is like 50% better or something like that? Yes. So the first question is about why isn't open AAI using our end of utterance model. I don't know why they're not. I think that's maybe above my pay grade of this company I just joined four weeks ago. Um and the the second question is like can you is our end of utterance model available um for use and so it's really easy on our website to uh or it's really easy to follow our quick start on our website and build a voice AI agent that you can talk to. Um and it's just one more line in our in the pipeline that you you build. You just turn on you have one more line in there and you get to use our end of utterance model. It's open weight. It's you don't have to pay for it. It's just baked in. Um and our docs show you pretty I think by default it's in there. Um and the the third question uh remind me the third question. I'm sorry. Ah yes, benchmarking. Um so we have benchmarks where we have our test data set and you know the numbers of course look great. Um, but I think uh I was on a long call with our machine learning team this morning where we spent a lot of time just talking about like how do we get a good data set for benchmarking that and it's just it's just really it's a tough problem. Um, and I feel like the industry as a whole doesn't have a good benchmark around turn taking. Um, and that it's something that I'm sure will eventually emerge. Uh, okay. We we'll do one last question I think. So there was yes you in the back not great to be sitting in the back if you want to ask questions but thank you Tom for the diamond and the presentation I got question related to the back channel. So how did you tackle the back channel challenge in the turn taking detection problem. So first uh for a natural conversational AI the back channel is the one that cannot be uh ignored and sometimes it cause trouble for the voice agent to detect whether it is it is the end pointing and second for a typical back channel like yeah uh yes uh those words can occur like a back channel or it can be start as the uh the agent who should be responding the in the following period. So how the back channel is handled should be kind of important in this field. So um my question was or the question was about not when the not the case where the AI is interrupting the human but the case where the human is accidentally interrupting the AI. Um and I didn't really cover that in my talk. I was mostly focusing on the AI interrupting the human. Um we don't have our approach is simple like we're just using like the solar or the the normal VAD approach of like if the person is speaking for more than X milliseconds assume it's not a back channel and that they're actually trying to interrupt the voice AI. But one of the things we want to build is another machine learning model that can like recognize the difference between whether or not it's a back channel or someone trying to uh interrupt the the voice AI. One quick note on the full duplex models. Uh the meta AI one can natively back channel because it's like learned from the raw audio data. So when you're talking to it, it'll go mhm. Uhhuh. Which is just so neat. Um and uh yeah, it's just a it's a tough problem the back channeling thing. Awesome. Um yeah, if you have more questions, you can find Tom. Uh please give another warm applause for Tom. Thank you all. [Music]