Building Effective Voice Agents — Toki Sherbakov + Anoop Kotha, OpenAI

Channel: aiDotEngineer

Published at: 2025-07-20

YouTube video id: -OXiljTJxQU

Source: https://www.youtube.com/watch?v=-OXiljTJxQU

[Music]
Hello everyone. Thanks for coming to to
our talk here. Um, just briefly
introduce myself. I'm Toki. I lead the
solution architecture team uh in the US
for OpenAI and I have Venoop here as
well on the solution architecture team.
Uh, and we'll talk about building
practical audio agents. Um, before I
jump in, maybe a quick raise of hands.
Who has either like built an audio agent
or voice agent of some kind or thinking
about building a audio agent?
Okay, so like halfish of the room. Okay,
good. It's always good to know what
people's interests are. So yeah, we'll
talk a bit about like um some of the
things we we've learned, where audio
models are today, and then some of the
kind of patterns and best practices. So
just to take a quick step back, um,
Genai applications broadly to date, as
many of you all know, have been very
textbased, right? This is uh models that
really deeply understand uh and generate
text with very much humanlike fluency
and there's a lot of production
applications out there built on text and
that's continues to scale. I think we've
really started to enter a bit more of
this multimodal era where we now have
models that can understand images
extremely well. There's now models that
can generate images really well if you
were part of the image gen Studio Gibbly
craze uh or Sora that generates videos
for example. Um and then now also audio.
These are models. We've had models like
transcription models or texttospech
models for some time, but now this kind
of emerging speechtoech capability is
getting a lot better. And I would say
now we we will see I think a much more
of an explosion of use cases around
these modalities of images and video.
And today we're really going to focus on
the audio modality uh across this. So I
would say like one important thing we're
starting to see is that audio is really
getting to that point where it's
becoming much more usable at scale. And
again, we've had the transcription and
the TTS side, but especially that
speech-to-pech low latency experience.
The models are getting, I would say,
good enough and are kind of at this
tipping point where you can start to
really meaningfully build high quality
production applications at scale. So, I
would say before the models were a lot
slower, robotic, and brittle, and now
they're a lot faster, expressive, and
accurate. I'll show an example. The
first one will show what the models were
like not even six months ago. It's a
very painful example, so I hope you feel
a bit of the pain uh in this.
>> Hello.
Hello. Please give me a topic for your
haik coup.
>> Uh,
well, you can see the the initial delay.
Um, that's actually honestly
>> Could you write me a haiku about
whispers?
>> Silent air, faint words. Drift through
darken rooms. Secrets linger near.
Dust slides through my hands. Sifted
grains forget the weight. What is left
behind? So, I hope you felt the pain a
little bit. Right. So, these models are
uh this is actually not just not even
six months ago, right? The models are
incredibly slow. Um they're very robotic
in how they sound. Uh and they're quite
brittle. So, in that example, we
actually tried to interrupt the model
and tell it no, don't write a don't
write a poem about uh whispers. Write
one about stars. They wrote one about
sand. I'm not sure where it got that
from. So just showing you that's
actually where it was. And then now if
you like fast forward to today, these
are the models that uh for a very
similar use case kind of how they how
we'll behave today.
>> Hello
>> share with me a topic and I shall weave
a ha coup for you rich with emotion and
imagery. What shall it be?
>> Can you make a ha coup about whispers?
>> Whispers? Actually, could you make a ha
coup about stars?
>> Stars, whispers of light adorning the
velvet night,
eternal secrets.
>> So, you can see it's much faster. It's
incredibly expressive, emotional. You
can steer it, you can get it excited,
you can slow it down, and it's accurate.
We interrupted it. We had it change
course and do something else, and it
responded well. Um, so the models are
really, we do believe, at that good
enough tipping point where you can start
to build much more reliable experiences
around it. Um, in terms of actually
building on top of the uh, audio models,
there's really two emerging
architectures that we see. One is kind
of the one that's been around for a bit
and then I'd say a little bit of the
newer one, which I'll talk about. The
existing one has been what we call kind
of the chained approach. This is
stitching together these three
approaches. You have audio come in, you
transcribe it, you have some sort of
intelligence layer with an LLM that does
some sort of intelligence, and then
finally text to speech on the output.
Um, but again, this is stitching
together three models. There's a lot of
problems with this. slow across time
like it takes longer time to actually
generate your output audio. Um it also
has some lossiness across this right you
kind of lose a little bit of the
semantic understanding of the
conversation. So really this emerging
pattern is more this speechtoech
architecture where it really it takes
these three models into a um single
model which for us it's the real-time
API uh and it does all these layers it
does the transcription it does that
intelligence layer and then finally does
the speech as the output. So this
simplifies the architecture quite
dramatically. Really reduces the time it
takes to output. This is really good for
low latency experiences. Um and it
doesn't have that lossiness problem. You
actually are able to maintain semantic
understanding across a conversation. So
this is really kind of the emerging
pattern we see.
And when building there's some kind of
key trade-offs and considerations to
take into account um when building on
tops of these architectures. And we
think it's kind of across these five
main areas. You have latency, you have
cost, you have accuracy and
intelligence, user experience and
integrations and tooling. And depending
what you're building, your trade-offs
will be different. Right? So if for
example you're talking about a consumerf
facing application, um the trade-offs
are quite different. Here you care a lot
about user experience. Um these are
usually a low latency experience that um
a lot of end users are are interacting
with the model. So user experience
matters a lot. Latency matters a lot. It
can't be slow. It needs to be really
fast and attractive. But things like
cost probably doesn't matter as much. It
doesn't need to be accurate. Not really,
right? It's more about expressiveness
and quick answers. Um, and integrations
and tooling like you don't really need
to integrate with uh SIP or like Twilio
for for a broader experience. It's
pretty simple um uh interaction. So
things like this would really the
real-time API works really well for a
consumerf facing application depending
on these trade-offs. Now if you go to
customer service as another example, the
trade-offs are different, right? So here
accuracy intelligence actually probably
matters the most. you can't get an order
number wrong. If it uh goes to delete an
order instead of updating an order,
that's a big deal. So, accuracy and
intelligence matters probably the most.
Integrations and tooling start to matter
a lot more, too, because now you have to
integrate your with your various
internal systems. You have to integrate
with, yeah, SIP and phone providers if
needed. Um, so those are probably the
most important things, but user
experience still matters, but you'll
probably take a hit on that a little bit
to make sure it's more accurate. Again,
latency still matters, but not as much
because accuracy intelligence matters
more and cost um you're saving a large
cost for this anyway. So, I think these
are just the other considerations to um
to take into account for customer
service. So, for this real-time API
might make sense for uh a situation
where latency might matter the most, but
if you care about determinism and
accuracy and intelligence, then actually
the chained architecture would make a
lot of sense here, too. Um but it's just
important to consider what your
trade-offs are depending on what you're
trying to build. So hopefully that gives
you paints you a picture a bit of like
what to consider. I'll hand it off to
Noob to talk about very specific
patterns we've seen when working with
customers building on this.
>> Cool. Awesome Tobi. Um so me just like
taking a step back. I think we've
probably defined agent a bunch of times
but here's just like a a canonical
definition to hopefully keep everyone
grounded for this session. Um think of
an agent as some model, some set of
instructions. So your prompts, the tools
you give your model and then the
runtime. So like the guardrails like how
you do your execution environment. Um,
and generally like when you're building
voice agents, that's all the same things
you should be worrying about, but
there's also a few other things that are
maybe a bit different that we're going
to focus on today. Uh, so the main stuff
is like your system design is probably a
bit different than a traditional LM
agent. Um, it's probably a bit different
on the prompting side and customization
of the voice. Um, the tools you're using
probably a bit different. Um, and then
also eval guardrails you might have to
think about um, as well. So these are
the things we'll we'll probably focus on
that are like distinct when you're
building specifically voice agents
compared to like traditional agents.
Cool. So if you've all built uh maybe
like a textbased agent, let's use like
customer service as a conical example.
Um you have like some user question. You
have some sort of triage agent that it's
like a small model that goes to other
models here that will do like the harder
tasks. So you have like an 03 to do like
the refund or maybe like an L4 mini to
do some of the cancellation. Um, and
typically that's like how you build a a
normal agent today. Uh, for voice, one
pattern that we're starting to see more
and more of is some sort of like
delegation through tools. So you would
have like the real-time API be your like
frontline agent. So it does a lot of the
normal interaction patterns that you're
used to. So it can respond to like the
easy questions that you sort of want
your model to answer and you know you're
confident in your model answering. Um,
but then there's also other questions
that might be a bit harder. Um, and then
you want your model to like do a tool
call to these other agents that are run
by smarter models behind the hood. Um,
so that's a pattern that we'll show a
quick video of of like how this would
work.
>> I've authenticated you now. What item
would you like to return?
>> Yeah, I want to return the snowboard I
bought.
>> Got it. I'm reviewing the policy and
we'll get back to you with next steps
shortly. So, here's a scenario where
we're actually using like a much smaller
model04 mini uh and delegating to that
because we don't want uh a smaller
model.
>> It looks like the return you are making
is within the 30-day return policy. We
will process the return and send you a
confirmation email with next steps
shortly.
>> So, that's like one pattern uh we've
seen on the architecture side. Another
point I touched on is like the the way
you basically customize your brand and
what you're sort of building. Uh so the
prompts you're using is a huge part of
that. And the way to think about this is
when you're prompting models today uh in
the textbased world, you only can
control like the instructions you're
giving it and sort of what you want the
model to do. But in voice based
applications, you can also control the
expressiveness of the voice and how you
want the model to speak or sound. Uh, so
this is like an example prompt here of,
hey, you can control like the demeanor,
the tone, the level of enthusiasm, all
those things that don't really get
captured in a in a text agent that you
do have much more control over in a
voice space agent.
Um, here's just a fun website you can
use to play around with this open. Um,
where you can play around with different
voices and also click through sample
prompts that we've created. Uh this is
just a great place to get started on
like actually playing around with the
expressiveness of voices and seeing that
there is a ceiling and like a lot of
fluctuation you can get in performance
just by the prompts you do use.
Uh, another side of this is when you're
when you're prompting models in the the
textbased approach, you'll probably have
like a lot of few shot examples of like,
hey, I want you to do these few steps uh
and then do these next few steps um and
go through like a typical uh
conversation flow. Uh the same thing you
can sort of mimic in voice. So if you
see in in this picture uh there's like
some ID of hey I want you to do some
greeting, the description of it and then
the instructions for the users. Um and
you can just go through that. So this is
like a sample way of you to think
through how you should be prompting
models especially for more complex tasks
that you want the real-time API to
handle.
Uh the next big bucket of things that
are sort of important or maybe like a
bit different for the voice agents
you're building is tools. Tool use in
general is like very uh important
because you want your model to connect
to other things that are sort of
important for you. Um and the first step
is honestly just start simple. you don't
need to connect 10 tools or like 20
tools to a given model. Uh keep the
number of tool calls pretty limited for
a given agent. Uh and that maybe goes
back to the first diagram we showed of
you want to delegate to different models
or agents to do different tasks. So here
a given agent don't give it a ton of
tools to start with and then slowly add
on more tools as needed. Um handoffs. So
if you ever you uh use the agents SDK,
this is a concept of if we have a given
model um you want to pass it to another
um and here you can manage context
between handoffs as well. So the thing
to remember here is you might have a
given agent and you're going to another
one but you want to keep the context the
same between them. Um so a good pattern
to do is actually summarizing the
conversation state and then passing it
to the other agents so you don't lose
any of the the context and it's not
lossy. Uh a final one is delegation. So
the concept we showed a bit earlier of
like using 03 mini or 04 mini to do the
tool call that's like natively built
into the agents SDK if you'd like to try
it out.
>> Um evals. So I think this is a pretty
big part of the like voice experience
and maybe like breaking it down into
like four key buckets. I think the first
and most obvious thing is always start
with observability. um you're not going
to get very far if you don't know how
your data looks. Um so just make sure
you have traces of what's going on with
your agents uh the audio and everything
so you can actually look at it and take
actions. Um and the next thing is when
you're actually starting to do evals
like the most important thing is
actually havememes to label these data
um and actually go and iterate on
prompts from there. It sounds maybe a
bit too easy or it's like not the most
appropriate or uh maybe the best
approach but it is the most effective
that we've seen when working with
customers. Um it lets you go a lot
faster. Uh and then from there the next
step is transcription based evals. So
this is where you would have your
traditional as a judge evals or like
testing your function calling
performance on certain business criteria
and having a rubric in place. Um, and
then there's this next bucket, uh, where
a lot of people ask us, how do we
actually test the audio that's
generated? Um, and for the most part,
it's there's not a lot of things that
audio uh, evals get you that maybe text
evals don't, but the things that maybe
it does get you is like understanding
the tone, understanding the pacing, or
other things that are maybe a bit harder
to capture just through text. And this
is where you can use like a completions
model or some other audio model. Uh so
for example GT4 audio to like understand
the tone, the pacing and what other
intonations that you care about for your
business. Um and then the final one that
we've started to see a bit more is
actually synthetic conversations. So for
example, maybe you have two real-time
clients going back and forth across to
each other. So maybe one is your
real-time API agent or whatever and then
the other is a set of customer personas
that you have and you can simulate like
many many of those conversations and
then extract those evals across maybe
like transcriptions or audio and then
have a much better way to understand
what you've sort of built.
Cool. Uh and then the final thing here
is guardrails. Um because especially for
the real-time API it's pretty fast. uh
you want to have guardrails that make
sure that you have the safety that's
needed for uh to actually have
confidence in in the solution you're
you're rolling out to users. Um so the
main things we have here is like run
them async uh generally because the
generated text from the real-time API is
much faster than the spoken audio that
you get back from the model. So you have
room to play with the latency here. Uh
the next thing also is you have control
to set like whatever debounce period you
have here. So in the code snippet uh to
the right hand side of the screen you'll
see that like right now it runs every
100 characters to run the guardrails but
you can control that if you want to.
Awesome. And then maybe just some
learnings that we've seen from other
folks in the field. Um Lemonade so uh AI
like insurance company they're like
building e one of the things we like
noticed from them that made them
successful is they like focused really
early on eval guardrails and like
feedback mechanisms for their team. And
even if it wasn't scalable, they
realized that and it actually made them
move a lot faster in the end. Um, and
then finally for Tinder, um, they
realized that customization and brand
realism was pretty important for them
when they were launching their RZ chat
experience. Um, so that's what they
focused on and it created a much better
experience for their users.
And now I'll hand it off to Toki to
>> Yeah, just to wrap us all up. So I kind
of what I said in the beginning is we're
really entering this multimodal era. So
away from or in addition to textbased
models, we have now obviously um video
and image models, but now audio. So very
much in the multimodal era and this
real-time speech to speech technology is
emerging. And I think we had a kind of
lowkey update yesterday, a real time
API. We had a new snapshot that we that
we released that's actually a lot
better. So I really feel like now is the
time to to build. So you kind of have
this first mover's advantage to to build
now. I think the technology is getting
to that point where it is good enough to
build scalable production applications.
Um, so really excited to see what you
all build. Please test it out. Um, we'll
be around after the talk if you guys
have any questions, but thanks for the
time.
[Music]