Building Effective Voice Agents — Toki Sherbakov + Anoop Kotha, OpenAI
Channel: aiDotEngineer
Published at: 2025-07-20
YouTube video id: -OXiljTJxQU
Source: https://www.youtube.com/watch?v=-OXiljTJxQU
[Music] Hello everyone. Thanks for coming to to our talk here. Um, just briefly introduce myself. I'm Toki. I lead the solution architecture team uh in the US for OpenAI and I have Venoop here as well on the solution architecture team. Uh, and we'll talk about building practical audio agents. Um, before I jump in, maybe a quick raise of hands. Who has either like built an audio agent or voice agent of some kind or thinking about building a audio agent? Okay, so like halfish of the room. Okay, good. It's always good to know what people's interests are. So yeah, we'll talk a bit about like um some of the things we we've learned, where audio models are today, and then some of the kind of patterns and best practices. So just to take a quick step back, um, Genai applications broadly to date, as many of you all know, have been very textbased, right? This is uh models that really deeply understand uh and generate text with very much humanlike fluency and there's a lot of production applications out there built on text and that's continues to scale. I think we've really started to enter a bit more of this multimodal era where we now have models that can understand images extremely well. There's now models that can generate images really well if you were part of the image gen Studio Gibbly craze uh or Sora that generates videos for example. Um and then now also audio. These are models. We've had models like transcription models or texttospech models for some time, but now this kind of emerging speechtoech capability is getting a lot better. And I would say now we we will see I think a much more of an explosion of use cases around these modalities of images and video. And today we're really going to focus on the audio modality uh across this. So I would say like one important thing we're starting to see is that audio is really getting to that point where it's becoming much more usable at scale. And again, we've had the transcription and the TTS side, but especially that speech-to-pech low latency experience. The models are getting, I would say, good enough and are kind of at this tipping point where you can start to really meaningfully build high quality production applications at scale. So, I would say before the models were a lot slower, robotic, and brittle, and now they're a lot faster, expressive, and accurate. I'll show an example. The first one will show what the models were like not even six months ago. It's a very painful example, so I hope you feel a bit of the pain uh in this. >> Hello. Hello. Please give me a topic for your haik coup. >> Uh, well, you can see the the initial delay. Um, that's actually honestly >> Could you write me a haiku about whispers? >> Silent air, faint words. Drift through darken rooms. Secrets linger near. Dust slides through my hands. Sifted grains forget the weight. What is left behind? So, I hope you felt the pain a little bit. Right. So, these models are uh this is actually not just not even six months ago, right? The models are incredibly slow. Um they're very robotic in how they sound. Uh and they're quite brittle. So, in that example, we actually tried to interrupt the model and tell it no, don't write a don't write a poem about uh whispers. Write one about stars. They wrote one about sand. I'm not sure where it got that from. So just showing you that's actually where it was. And then now if you like fast forward to today, these are the models that uh for a very similar use case kind of how they how we'll behave today. >> Hello >> share with me a topic and I shall weave a ha coup for you rich with emotion and imagery. What shall it be? >> Can you make a ha coup about whispers? >> Whispers? Actually, could you make a ha coup about stars? >> Stars, whispers of light adorning the velvet night, eternal secrets. >> So, you can see it's much faster. It's incredibly expressive, emotional. You can steer it, you can get it excited, you can slow it down, and it's accurate. We interrupted it. We had it change course and do something else, and it responded well. Um, so the models are really, we do believe, at that good enough tipping point where you can start to build much more reliable experiences around it. Um, in terms of actually building on top of the uh, audio models, there's really two emerging architectures that we see. One is kind of the one that's been around for a bit and then I'd say a little bit of the newer one, which I'll talk about. The existing one has been what we call kind of the chained approach. This is stitching together these three approaches. You have audio come in, you transcribe it, you have some sort of intelligence layer with an LLM that does some sort of intelligence, and then finally text to speech on the output. Um, but again, this is stitching together three models. There's a lot of problems with this. slow across time like it takes longer time to actually generate your output audio. Um it also has some lossiness across this right you kind of lose a little bit of the semantic understanding of the conversation. So really this emerging pattern is more this speechtoech architecture where it really it takes these three models into a um single model which for us it's the real-time API uh and it does all these layers it does the transcription it does that intelligence layer and then finally does the speech as the output. So this simplifies the architecture quite dramatically. Really reduces the time it takes to output. This is really good for low latency experiences. Um and it doesn't have that lossiness problem. You actually are able to maintain semantic understanding across a conversation. So this is really kind of the emerging pattern we see. And when building there's some kind of key trade-offs and considerations to take into account um when building on tops of these architectures. And we think it's kind of across these five main areas. You have latency, you have cost, you have accuracy and intelligence, user experience and integrations and tooling. And depending what you're building, your trade-offs will be different. Right? So if for example you're talking about a consumerf facing application, um the trade-offs are quite different. Here you care a lot about user experience. Um these are usually a low latency experience that um a lot of end users are are interacting with the model. So user experience matters a lot. Latency matters a lot. It can't be slow. It needs to be really fast and attractive. But things like cost probably doesn't matter as much. It doesn't need to be accurate. Not really, right? It's more about expressiveness and quick answers. Um, and integrations and tooling like you don't really need to integrate with uh SIP or like Twilio for for a broader experience. It's pretty simple um uh interaction. So things like this would really the real-time API works really well for a consumerf facing application depending on these trade-offs. Now if you go to customer service as another example, the trade-offs are different, right? So here accuracy intelligence actually probably matters the most. you can't get an order number wrong. If it uh goes to delete an order instead of updating an order, that's a big deal. So, accuracy and intelligence matters probably the most. Integrations and tooling start to matter a lot more, too, because now you have to integrate your with your various internal systems. You have to integrate with, yeah, SIP and phone providers if needed. Um, so those are probably the most important things, but user experience still matters, but you'll probably take a hit on that a little bit to make sure it's more accurate. Again, latency still matters, but not as much because accuracy intelligence matters more and cost um you're saving a large cost for this anyway. So, I think these are just the other considerations to um to take into account for customer service. So, for this real-time API might make sense for uh a situation where latency might matter the most, but if you care about determinism and accuracy and intelligence, then actually the chained architecture would make a lot of sense here, too. Um but it's just important to consider what your trade-offs are depending on what you're trying to build. So hopefully that gives you paints you a picture a bit of like what to consider. I'll hand it off to Noob to talk about very specific patterns we've seen when working with customers building on this. >> Cool. Awesome Tobi. Um so me just like taking a step back. I think we've probably defined agent a bunch of times but here's just like a a canonical definition to hopefully keep everyone grounded for this session. Um think of an agent as some model, some set of instructions. So your prompts, the tools you give your model and then the runtime. So like the guardrails like how you do your execution environment. Um, and generally like when you're building voice agents, that's all the same things you should be worrying about, but there's also a few other things that are maybe a bit different that we're going to focus on today. Uh, so the main stuff is like your system design is probably a bit different than a traditional LM agent. Um, it's probably a bit different on the prompting side and customization of the voice. Um, the tools you're using probably a bit different. Um, and then also eval guardrails you might have to think about um, as well. So these are the things we'll we'll probably focus on that are like distinct when you're building specifically voice agents compared to like traditional agents. Cool. So if you've all built uh maybe like a textbased agent, let's use like customer service as a conical example. Um you have like some user question. You have some sort of triage agent that it's like a small model that goes to other models here that will do like the harder tasks. So you have like an 03 to do like the refund or maybe like an L4 mini to do some of the cancellation. Um, and typically that's like how you build a a normal agent today. Uh, for voice, one pattern that we're starting to see more and more of is some sort of like delegation through tools. So you would have like the real-time API be your like frontline agent. So it does a lot of the normal interaction patterns that you're used to. So it can respond to like the easy questions that you sort of want your model to answer and you know you're confident in your model answering. Um, but then there's also other questions that might be a bit harder. Um, and then you want your model to like do a tool call to these other agents that are run by smarter models behind the hood. Um, so that's a pattern that we'll show a quick video of of like how this would work. >> I've authenticated you now. What item would you like to return? >> Yeah, I want to return the snowboard I bought. >> Got it. I'm reviewing the policy and we'll get back to you with next steps shortly. So, here's a scenario where we're actually using like a much smaller model04 mini uh and delegating to that because we don't want uh a smaller model. >> It looks like the return you are making is within the 30-day return policy. We will process the return and send you a confirmation email with next steps shortly. >> So, that's like one pattern uh we've seen on the architecture side. Another point I touched on is like the the way you basically customize your brand and what you're sort of building. Uh so the prompts you're using is a huge part of that. And the way to think about this is when you're prompting models today uh in the textbased world, you only can control like the instructions you're giving it and sort of what you want the model to do. But in voice based applications, you can also control the expressiveness of the voice and how you want the model to speak or sound. Uh, so this is like an example prompt here of, hey, you can control like the demeanor, the tone, the level of enthusiasm, all those things that don't really get captured in a in a text agent that you do have much more control over in a voice space agent. Um, here's just a fun website you can use to play around with this open. Um, where you can play around with different voices and also click through sample prompts that we've created. Uh this is just a great place to get started on like actually playing around with the expressiveness of voices and seeing that there is a ceiling and like a lot of fluctuation you can get in performance just by the prompts you do use. Uh, another side of this is when you're when you're prompting models in the the textbased approach, you'll probably have like a lot of few shot examples of like, hey, I want you to do these few steps uh and then do these next few steps um and go through like a typical uh conversation flow. Uh the same thing you can sort of mimic in voice. So if you see in in this picture uh there's like some ID of hey I want you to do some greeting, the description of it and then the instructions for the users. Um and you can just go through that. So this is like a sample way of you to think through how you should be prompting models especially for more complex tasks that you want the real-time API to handle. Uh the next big bucket of things that are sort of important or maybe like a bit different for the voice agents you're building is tools. Tool use in general is like very uh important because you want your model to connect to other things that are sort of important for you. Um and the first step is honestly just start simple. you don't need to connect 10 tools or like 20 tools to a given model. Uh keep the number of tool calls pretty limited for a given agent. Uh and that maybe goes back to the first diagram we showed of you want to delegate to different models or agents to do different tasks. So here a given agent don't give it a ton of tools to start with and then slowly add on more tools as needed. Um handoffs. So if you ever you uh use the agents SDK, this is a concept of if we have a given model um you want to pass it to another um and here you can manage context between handoffs as well. So the thing to remember here is you might have a given agent and you're going to another one but you want to keep the context the same between them. Um so a good pattern to do is actually summarizing the conversation state and then passing it to the other agents so you don't lose any of the the context and it's not lossy. Uh a final one is delegation. So the concept we showed a bit earlier of like using 03 mini or 04 mini to do the tool call that's like natively built into the agents SDK if you'd like to try it out. >> Um evals. So I think this is a pretty big part of the like voice experience and maybe like breaking it down into like four key buckets. I think the first and most obvious thing is always start with observability. um you're not going to get very far if you don't know how your data looks. Um so just make sure you have traces of what's going on with your agents uh the audio and everything so you can actually look at it and take actions. Um and the next thing is when you're actually starting to do evals like the most important thing is actually havememes to label these data um and actually go and iterate on prompts from there. It sounds maybe a bit too easy or it's like not the most appropriate or uh maybe the best approach but it is the most effective that we've seen when working with customers. Um it lets you go a lot faster. Uh and then from there the next step is transcription based evals. So this is where you would have your traditional as a judge evals or like testing your function calling performance on certain business criteria and having a rubric in place. Um, and then there's this next bucket, uh, where a lot of people ask us, how do we actually test the audio that's generated? Um, and for the most part, it's there's not a lot of things that audio uh, evals get you that maybe text evals don't, but the things that maybe it does get you is like understanding the tone, understanding the pacing, or other things that are maybe a bit harder to capture just through text. And this is where you can use like a completions model or some other audio model. Uh so for example GT4 audio to like understand the tone, the pacing and what other intonations that you care about for your business. Um and then the final one that we've started to see a bit more is actually synthetic conversations. So for example, maybe you have two real-time clients going back and forth across to each other. So maybe one is your real-time API agent or whatever and then the other is a set of customer personas that you have and you can simulate like many many of those conversations and then extract those evals across maybe like transcriptions or audio and then have a much better way to understand what you've sort of built. Cool. Uh and then the final thing here is guardrails. Um because especially for the real-time API it's pretty fast. uh you want to have guardrails that make sure that you have the safety that's needed for uh to actually have confidence in in the solution you're you're rolling out to users. Um so the main things we have here is like run them async uh generally because the generated text from the real-time API is much faster than the spoken audio that you get back from the model. So you have room to play with the latency here. Uh the next thing also is you have control to set like whatever debounce period you have here. So in the code snippet uh to the right hand side of the screen you'll see that like right now it runs every 100 characters to run the guardrails but you can control that if you want to. Awesome. And then maybe just some learnings that we've seen from other folks in the field. Um Lemonade so uh AI like insurance company they're like building e one of the things we like noticed from them that made them successful is they like focused really early on eval guardrails and like feedback mechanisms for their team. And even if it wasn't scalable, they realized that and it actually made them move a lot faster in the end. Um, and then finally for Tinder, um, they realized that customization and brand realism was pretty important for them when they were launching their RZ chat experience. Um, so that's what they focused on and it created a much better experience for their users. And now I'll hand it off to Toki to >> Yeah, just to wrap us all up. So I kind of what I said in the beginning is we're really entering this multimodal era. So away from or in addition to textbased models, we have now obviously um video and image models, but now audio. So very much in the multimodal era and this real-time speech to speech technology is emerging. And I think we had a kind of lowkey update yesterday, a real time API. We had a new snapshot that we that we released that's actually a lot better. So I really feel like now is the time to to build. So you kind of have this first mover's advantage to to build now. I think the technology is getting to that point where it is good enough to build scalable production applications. Um, so really excited to see what you all build. Please test it out. Um, we'll be around after the talk if you guys have any questions, but thanks for the time. [Music]