Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit

Channel: aiDotEngineer

Published at: 2025-07-31

YouTube video id: 1v9zBiZKlIY

Source: https://www.youtube.com/watch?v=1v9zBiZKlIY

[Music]
I'm going to jump right into it. I'm
going to be talking about Voice AI's
interruption problem. Then after that,
I'm going to talk about how we currently
handle interruptions and turn taking in
voice AI. what we can learn from the
study of human conversation on how to
handle how humans handle turn taking and
then about some of the really neat and
interesting new approaches out there for
handling turn taking and interruptions
in voice AI agents. Um voice excuse me
interruptions are the biggest problem in
voice AI agents right now. When you're
talking to chat GPT advanced voice mode
and it interrupts you, it's annoying.
But when you're talking when a patient
is talking to a voice AI dental
assistant and it interrupts the patient,
the patient hangs up and the dentist
stops paying the voice AI developer.
This is our collective problem. This is
all of our problem that we we all have
to solve.
Um and the the problem is like turn
taking is just hard. Um let me first
like define what turn taking is. Turn
taking is this unspoken system we have
for who controls the floor on um between
speakers uh during a conversation and
fundamentally it's hard because turn
taking happens really fast in human
conversation um and there's no oneize
that fits all. So, this is this really
cool um study that I pulled this data
from where they're looking at how long
it took a listener to start responding
after the speaker finished speaking um
across different cultures. And we can
see that the Danes take a relatively
long amount of time to start speaking
after uh the other speaker finishes
speaking, but the Japanese do it almost
instantaneously. So, there's differences
across cultures. That's part of what
makes it hard. There's also differences
across individuals. I'm one of those
people that takes a long time to
respond. Even before I got into voice
AI, people would sometimes comment being
like, "Are you gonna respond?" I'm like,
"Yeah, yeah, thinking about it." Um, and
uh also like even though I'm one
individual, there's a lot of variability
in how quickly I respond. Like if you
make me angry, I'm probably going to
respond kind of quicker. Um, so it's
just a hard problem. And in the the next
slide I want to talk about for people
who are not very familiar with how voice
AI agent pipelines work. I'm going to
provide like a simplified overview of
how we handle handle turn taking and
interruptions in voice AI agents
currently. So the user starts speaking
that's the speech input and that audio
those audio chunks are passed to a
speechtoext model. Um and that
speechtoext model uh transcribes the uh
audio into uh a transcription. The next
step is something called a VAD that
determines whether or not the user has
finished speaking. Um, I'm going to go
more into that in a moment. The next
step, if the user has finished speaking,
the transcript is passed to an LLM and
the LLM outputs its chat completion.
That chat completion is streamed out and
that stream is passed to a texttospech
model where it's converted into audio
and that audio which is the audio of now
of the voice AI agent is passed back to
the user.
Um, let's dig more into the voice
activity detection system. It's a system
with primarily two parts. So, it's a
machine learning model, a neural network
that is detecting whether or not
somebody is speaking. So, it's like
speech or not speech. It's pretty simple
um machine. It's not I shouldn't call it
simple, but it's it's a really neat
model, but it's it's ultimately just
looking at speech or not speech. And
then the next thing next part of it is a
silence algorithm. And the silence
algorithm is saying, okay, if the person
hasn't spoken for more than half a
second, they're done speaking and it's
time for the agent to start speaking. So
that's the in most production voice AI
systems, we're using something like that
sort of VAD. Um, and that's changing.
We're building all sorts of new
interesting things that I'll cover later
in the presentation. Um the next part of
my presentation I want to dig into what
we can learn from linguistics and
academic research about how human how
turn taking works in human
conversations. And one of the lines I
read in a paper I read was that I really
liked was that turn taking in human
conversation is a psycho linguistic
puzzle and that we respond in 200
milliseconds but the process of like
finding the words and generating speech
and articulating speech takes 600
milliseconds. So how can we possibly be
speaking
so quickly. How can the listener start
returning uh an answer so speak so
quickly when it takes much longer to
actually generate the speech? And the
answer is that there has to be
prediction going on. The listener is
predicting when the end of turn is going
to occur and then they're they start to
generate speech before that end of turn.
And what are the primary inputs in
creating that prediction? The primary
inputs on creating that pred prediction
are the semantic and this one's the most
important one like what the content is
of the of what the person is saying.
Other inputs into this you know
prediction algorithm in our head of when
we're trying to predict when the speaker
is going to finish speaking is the
syntax the structure of the sentence the
procity like the the expressiveness um
the tone and then also visual cues.
Um like most things in uh how the human
mind works, we we actually don't really
know like you know that's it's
complicated. Um and but I I would say
one of the generally accepted models is
what I'm going to walk through now of
how turn taking works in the human
minds. It's broken up into three stages.
The first stage is semantic prediction.
So, what the listener is doing, and
you'll notice this as you speak to other
people at the conference, is if you like
are paying attention to your your
thought process, is you're constantly
inferring the intended message of the
person that's speaking to you. So,
before they finish speaking, you're kind
of figuring out, wait, what are they
trying to say? And then you're using
what what your your prediction of what
they're trying to say to um uh to then
use that information to predict when the
end of utterance will occur. And you're
not doing this just once. You're doing
this multiple times again and again,
right? You're constantly updating this
prediction as the speaker keeps going.
So that's the first stage. Um then the
the next stage once it seems like you
have a general idea, your your
prediction is coming true of like when
you think the end of utterance will
occur as you start getting closer to
that you start refining that that
endpoint prediction um based on both the
semantics and the syntax. And then as
you start getting really as the speaker
starts getting really close to the end
of turn the listener finalizes the
prediction by using procity by using
information around like the tone and
other acoustic features. Um
so it's three steps a semantic
prediction a refinement and a finaliz uh
a finalization. And one of the things I
want to point out is that the human mind
is full duplex. We're both processing
input and we're starting to generate
output at the same time. And I think
that's really nicely described in this
figure from this paper. Um, and so on
the x- axis, what we have here is time
where the zero millisecond is the end of
the speaker's turn. And what these
different blocks represent is what of
mental processes that are happening
inside the mind of the listener. And you
can see well before the end of the turn,
there's this whole comprehension
uh track um that's going on where the
listener is inferring the intended
message of the speaker and making
predictions about when they're going to
start finish speaking. And at the same
time, there's also this production or
generation track where the um the
listener is starting to produce what
they're going to say.
And we're going to talk more about full
duplex models in computer in silic
rather than human minds in a little bit
in my presentation. Um, oh, Jordan
Deersley just texted me.
[Laughter]
He texted boo. Like, how does he know to
boo me? He's not even here. Um,
okay. It's it's all good. We'll keep
rolling. Uh so um so let's go back to uh
um like let's contrast this really
interesting complex process that's going
on in the human mind compared to current
voice AI systems. You'll see it's just
so much more simple, right? It's just
speech or not speech. It's looking
backwards. It's not making a prediction.
Um it's done in serial. Nothing's
happening in parallel. Um so it's it's
much more simple. And that's part part
of the problem, right, of why these
interruptions are happening. Um,
so there's uh I'm going to talk through
three types of models and the approaches
people are using in these three types of
models. Um, so the the prevailing model
for building voice AI agents is the
cascading model system of models where
you have um what we talked about
earlier, speech to text, VA, LLM, TTS.
And uh what we're doing in those is
we're augmenting these new the new
approaches to better handling
interruptions is we're augmenting the
VAD with models that look at the
semantics syntax or procity. Um and
uh I want to jump into an example of it.
Um I really have too much content for uh
for my aotted time that I'm looking at
down here. Um but maybe I can just take
some time from Jordan. Um, so, uh, let
me let me, um, give an example for one
of these semantic type models that is
used to augment VAD. I'm going to talk
about our model at LiveKit. It's a
textbased semantic model. So, what we're
doing is we're taking the last four
turns of the conversation as input. So,
that means that it's the uh, the voice
AI agent's turn, then the user's turn,
then the voice AI agents turn, and then
the user's current turn. Those are the
inputs into to a transformer model and
what we're the token that we're
predicting um because this is an LLM
we're predicting the end of utterance
token and if that end of utterance token
based on this the content right based on
the the the context and the semantics of
that that input um if the end of
utterance token is saying that the end
of turn it hasn't happened yet then we
don't we extend the silence algorithm
part of the that and say don't trigger
the end of turn wait longer. Um, so they
work in concert. That's generally the
idea of how it works. I'm going to walk
through a quick demo of of how this
works in action. Needed to build a demo,
but I wasn't.
So, in this first part of this demo,
what Shane is going to do is uh talk to
a voice agent that is just using the the
traditional VAD that I was discussing
earlier that just looks at speech or not
speech. And the second half of the demo,
he's using our semantic end of utterance
model.
100% sure what it was that I should
I had to build a demo for.
What was the a live kit? A live kit uh
turn detection a turn detection demo
where one of the agents
like
Got it. What challenges did you face
while building that demo kept on
interrupting me being interrupted by the
agent constantly. That was the worst
part for sure.
I understand. How did you overcome that
challenge during the demo?
Well, hopefully we're going to overcome
the challenge by using the new turn
detection model that LiveKit's offering.
So, let's try that one now instead.
Hey, can you interview me about a time
when I had to build a demo?
Absolutely. Can you share your
experience with building a demo?
Yeah, definitely. So,
I needed to build a demo, but I wasn't
I wasn't a 100% sure what it was that I
should build.
And then I
I was thinking like probably the best
way to show that would be uh
would be side by side.
Yeah, thank you. I'll I'll let I'll let
Shane know that you all applauded his
demo. Uh he'll appreciate that. Yeah,
you can really see it's a night and day
difference when you augment the VAD with
models that look at the semantics and
syntax and procity. um which is a good
segue to my next slide. So there's
another type of another approach people
are taking to augmenting the VAD and
what they're doing is not just taking
the semantic input the textbased input
but they're also looking at the audio
signal as well and trying to infer
things from the acoustic features of the
the dialogue. So they're taking the
basic idea is the input is audio tokens
and the output is the probability that
the user has finished speaking. Um Quinn
and the Daily team have built their also
openweight smart turn model that that uh
is this neat combination of a model that
is both transformer and looking at
acoustic um characteristics. Um, and
then one of the new things that has just
emerged has been um, Assembly AI dropped
their their speechto text new streaming
speech to text service earlier this
week. And their model is really neat in
that it emits it takes audio in and
emits out both the transcript and a
likelihood that uh, the speaker is
finished speaking. Um, so it's one model
that's kind of doing both these two
things at the same time. Um, and it's
also looking at the acoustic features
and the semantic features. One of the
things I want QAI also has one that they
recently released that's pretty neat.
Um, one of the things I want to note
about this though is if you're using
your speechto texts builtin end of
utterance model, it's only seeing half
the context. It's only seeing what the
user is saying. It's not also seeing
what the agent is saying. So, it doesn't
quite have the full picture on the
context. Um, but it works remarkably
well. All these approaches work
remarkably well and are a major step
forward from these more traditional VADs
and like definitely something that if
you're building a voice AI agent after
this conference you should go like
implement it. They're pretty easy to
implement on the different platforms
too.
Um
okay. So we often talk about in uh uh
speech models or in voice AI that um
speech models are uh are going to save
us. Um speechto speech models audio in
and audio out are going to save us. Um
but actually if you look at how these
models work that like OpenAI's real-time
API, they're still using a VAD on the on
the internals. Um, so they're still just
looking at speech or not speech. Or you
can you can opt to turn on their
semantic, they call it semantic VAD,
which is kind of a paradox. It's not the
best term, but turning on a semantic
model that augments the bad. Um, so to
answer the title of my talk of why chat
GPT advanced voice mode keeps
interrupting you, it's because it thinks
you're done speaking based on how long
it's been since you last said a word,
um, or based on what you've uh what
you've said previously. Um, and it's
just not quite cutting it. um when those
interruptions happen um uh and it's a
problem that is not totally solved. I
want to also bring that up too that like
this is an ongoing problem um with all
the different approaches nothing has
perfected it yet. Um our end of live kit
doesn't although we power the transport
the audio layer transport for uh
advanced voice mode um openai is not
using our end of utterance model. So the
next topic I want to cover is the um uh
I'm running out of time here but uh is
full duplex models. These are really
neat. So a full duplex model is more
like a human mind and it's processing
input and generating speech at the same
time. Um and as far as I know there's
not really any commercial applications
of these. Um but they're they're
fundamentally they're intuitive talkers.
They're trained on the raw audio data.
And the analogy I like to use is that
it's like computer vision. In the early
days of computer vision, we were
handwriting algorithms to try to
recognize the stop sign based on the
color and the number of sides on it,
etc. Um, and it just didn't work very
well. But when we started giving the raw
image data to the neural network and let
the neural network figure it out, all a
sudden it just started working. And I
think it's a and actually that uh what
we learned from computer vision that
really helped us emerge from the AI
winter. That was a major kind of uh
seeding process for where we are now
with AI. Um, and it's a similar analogy
with full duplex models in that we're
handing them the raw audio data and
we're just letting them figure out how
turn taking works rather than trying to
handw write all the rules. Um, but the
downside of these models is they're
really optimized for like being really
good at turn taking and they're kind of
dumb LLMs. They're small models. They're
not trained on a lot of data. They can't
do instruction following very well. Um,
and just to give you a sense of like
more specifics of how these models work,
let's talk about the Moshi model. Um,
what really made it more concrete for me
of how this model works is this idea
that is always listening to input and
it's always generating output and even
when it's not its turn to speak, it's
emitting natural silence. So, it's just
basically emitting silence that you
can't hear, but it's still always
emitting silence. Um, so it's always
kind of doing both just like a human is.
Um, Sync LLM, which is Meta AI's uh,
full duplex like experimental mode that
you can access inside the app, um, is a
similar full duplex model or it's also
full duplex model. Something neat that I
want to bring up about Sync LLM is
they're actually in the internals of
that model, they're forecasting what the
user said saying about five tokens ahead
or 200 milliseconds ahead, which is more
closely like what humans are doing
except we're uh forecasting a much
longer time frame. And then lastly, my
predictions for the future of how we'll
solve this problem uh is
I think full duplex models are neat, but
I don't think they're going to solve the
problem. Like I think we just for for
real production commercial use cases of
voice AI, we need more control. Um and
we need more control over how it says
things like brand names. Um and instead
what I think is going to happen is we're
going to get smarter and smarter VAD
augmentations and faster and faster
models in the cascade pipeline. And
we're just going to have more budget to
work with to do a good job with this
sort of thing. Um and the reason I think
that's true is like computers don't do
math the same way humans do. They don't
have the same conceptual way of thinking
about it. And uh LLMs think differently
than us. And similarly, I wouldn't
expect voice AI to use the same
mechanisms as the human mind to generate
speech and to talk. Um thank you all for
your attention. This was fun. Really
appreciate it.
We do have some time. So I don't know if
you want to take Q&A. Um I would love
to. We could do that. Um and I could
start with the first question.
So, the demo you showed there there
wasn't any response at the end, right? I
I cut off the demo. It's actually a
two-minute demo and I only have 18
minutes to speak. So, I truncated on
both sides. Do because I was like,
"Okay, maybe you just turned everything
off and it was an impressive demo. No
interruptions. No speaking." Yeah. Do
you have the end of the demo? Do you
want to show it or um No, no worries.
It's more of the same idea. But like
what you could see is that Shane was
like, you know, taking his time talking
and really pausing and thinking, it
wasn't interrupting him and then when it
eventually would find his end of turn
based on the context. Cool. Can we find
it on your Twitter or Yeah, it's it's on
our link our live kit Twitter. Awesome.
Yeah. So, we can look that up on the
live kit Twitter. Uh, awesome. Yeah, we
can take some questions. I saw you had
one.
Hi. How important are visual cues for
turn detection in in the human context?
And are there any um is there any
development to kind of replicate that in
the uh voice AI context as well? Yeah,
it's a really neat question of like how
important are visual cues and are are
people working on integrating that into
the turn taking um intelligence for uh
avatars and real-time experiences. So
visual cues are actually despite the
fact that we are visual animals um very
very much so like visual is the most
visceral like you know input for us
visual cues are actually pretty low down
the stack of like uh predictors for when
it will be end of turn it really is
semantics that's like one of the main
messages that I want to convey to people
from this talk is it's the content of
what people are saying is the main thing
we're using to predict when they're
going to finish speaking. Um, and then
these visual cues are are and these
other ones are ancillary to it. Um, and
I'm sure somebody's working on building
something really cool where it's like
multimodal and looking at visual cues to
look at the to infer the end of turn.
Um, I'm just haven't seen it yet and
can't keep up with all the AI stuff on
the internet. Yes.
What is the average cost?
What is the average cost for usually
voice uh generated call and then how is
what is the effect when you try to keep
regenerating the response?
So the question is what's the average
cost for voice AI call? Um and what is
the cost when you keep trying to
regenerate the the response? Um, so I
would I would first say that the I think
your your reference to what does it cost
to keep trying to regenerate the
response is um
what I the way I want to answer that is
actually the thing that's most expensive
in the pipeline tends to be the text to
speech. So there's all these
optimizations you can do in the cascade
and if you end up hitting the LLM
multiple times within a turn, it's not
it's not all that costly and those sorts
of things. Um, there's some really neat
calculators online because I'm
personally not a VoiceAI agent builder
and those unit economics don't uh don't
directly affect me. I don't have the
numbers off the top of my head, but
there's some really nice calculators.
It's going to depend on how long the
conversation is and that sort of thing.
Yeah, thanks for the demo. That was
great. Uh the question is about your new
model that you just shown and that blew
us away. So uh one is like why is chat
GPT not using that model to improve
their stuff and two is it available for
us to use now if I were to build a uh
voice bot on live kit and uh three maybe
during your development of that one demo
is great. uh do you also do some kind of
benchmarking with user to see if you
know this is like 50% better or
something like that?
Yes. So the first question is about why
isn't open AAI using our end of
utterance model. I don't know why
they're not. I think that's maybe above
my pay grade of this company I just
joined four weeks ago. Um
and the the second question is like can
you is our end of utterance model
available um for use and so it's really
easy on our website to uh or it's really
easy to follow our quick start on our
website and build a voice AI agent that
you can talk to. Um and it's just one
more line in our in the pipeline that
you you build. You just turn on you have
one more line in there and you get to
use our end of utterance model. It's
open weight. It's you don't have to pay
for it. It's just baked in. Um and our
docs show you pretty I think by default
it's in there. Um and the the third
question uh remind me the third
question. I'm sorry.
Ah yes, benchmarking.
Um so we have benchmarks where we have
our test data set and you know the
numbers of course look great. Um,
but I think uh I was on a long call with
our machine learning team this morning
where we spent a lot of time just
talking about like how do we get a good
data set for benchmarking that and it's
just it's just really it's a tough
problem. Um, and I feel like the
industry as a whole doesn't have a good
benchmark around turn taking. Um, and
that it's something that I'm sure will
eventually emerge.
Uh, okay. We we'll do one last question
I think. So there was yes you in the
back not great to be sitting in the back
if you want to ask questions but
thank you Tom for the diamond and the
presentation I got question related to
the back channel. So how did you tackle
the back channel challenge in the turn
taking detection problem. So first uh
for a natural conversational AI the back
channel is the one that cannot be uh
ignored and sometimes it cause trouble
for the voice agent to detect whether it
is it is the end pointing and second for
a typical back channel like yeah uh yes
uh those words can occur like a back
channel or it can be start as the uh the
agent who should be responding the in
the following period. So how the back
channel is handled should be kind of
important in this field.
So um my question was or the question
was about not when the not the case
where the AI is interrupting the human
but the case where the human is
accidentally interrupting the AI. Um and
I didn't really cover that in my talk. I
was mostly focusing on the AI
interrupting the human. Um we don't have
our approach is simple like we're just
using like the solar or the the normal
VAD approach of like if the person is
speaking for more than X milliseconds
assume it's not a back channel and that
they're actually trying to interrupt the
voice AI. But one of the things we want
to build is another machine learning
model that can like recognize the
difference between whether or not it's a
back channel or someone trying to uh
interrupt the the voice AI. One quick
note on the full duplex models. Uh the
meta AI one can natively back channel
because it's like learned from the raw
audio data. So when you're talking to
it, it'll go mhm. Uhhuh. Which is just
so neat. Um and uh yeah, it's just a
it's a tough problem the back channeling
thing.
Awesome. Um yeah, if you have more
questions, you can find Tom. Uh please
give another warm applause for Tom.
Thank you all.
[Music]