Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)
Channel: aiDotEngineer
Published at: 2025-07-31
YouTube video id: E71YtNbCFXY
Source: https://www.youtube.com/watch?v=E71YtNbCFXY
All [Music] right, Squabbert, you ready to get packed up? I don't know. I'm pretty nervous. Oh, relax. You got nothing to worry about. But this is like the worst idea ever for a live demo. A live unscripted conversation with a non-deterministic LLM on conference Wi-Fi. Come on, your prompt is great. And I mean, it's got to be with onstage audio in a room full of echoes. Sometimes my text to speech even mispronounces my own name. Why did you even name me, Squabbert? What if I say Squibbly or something again? Okay, take a breath. Well, you don't really do that. Let's just take it one step at a time. Just start with the intro. I guess you're right. Okay, here I go. Hi everybody, I'm Squabbert here to take us on a whirlwind tour of the wonderful world of WebRTC. Please welcome Shawn and Quinn. Hey, I'm Sean. I work on WebRTC at OpenAI. Um, some of the things you might be familiar with are the real-time API or 1800 chat GPT. You can call it right from your phone. Um, before I worked at OpenAI, I worked on the Go implementation of WebRTC called Pion. And I'm Quinn. I work at Daily on real-time audio and video infrastructure and on an open source voice agent framework called Pipcat. Today we're going to talk about how to build natural fast humanlike voice experiences. Uh we're going to give you a crash course on low latency audio and video and I hope we'll show you a couple of things you might not have thought of around voice AI before. Um, if you want to build a conversational voice experience that people really love, you you're going to stress a lot about latency. Nothing else matters if your AI responds too slowly. Building voice AI experiences is similar to other kinds of AI engineering in most ways. If you've built multi-turn agents, a lot of that will port over to building voice agents. But the big difference is latency. Everything in a voice AI app needs to be ground built from the ground up for fast response times. If you're talking to a person, around 500 milliseconds sounds natural. When talking to an AI system, people bring those same expectations. Response latencies much above a second in general doom your voice agent to very low completion rates and low NPS scores and hang-ups. And we're talking here about voicetooice latency. So this is the time uh between when I the human stop talking and the time I hear the first audio bite come back from the LLM. Let's take a look at how latency adds up in a typical voicetovoice AI application. So this is a breakdown from a real voice AI app running in a web browser on Mac OS talking over the internet to a voice agent running in the cloud running on Pipecat. A couple things to note. Our voice latency is just under a second. That's good, but not great. We can make things a little faster, but that comes with trade-offs. Lower quality or cost. Second, it's frustratingly easy to do even worse than this. Your LLM might be slower. Other things get in the way, or worst of all, Bluetooth. Don't get me started on Bluetooth. But the single biggest mistake we see people making is using the wrong approach to sending and receiving audio over the network. It's time to talk about WebRTC and websockets. If you're new to building voice applications, you probably think, "Hey, I need a like a longived uh connection. I'm going to send audio and video over this longived connection. I've used websockets for long lived data connections before. I'm just going to write some websockets code." That's great if you're doing longived short small amounts of data. It doesn't work for real-time audio. In fact, websockets are almost the opposite of what you want from a network engineering perspective for real-time audio and video. So, let's do a compare and contrast on this. Websockets are great if you're trying to deliver audio and you want something really easy that can target all platforms. If you're trying to build a prototype lots of different platforms, websockets the way to go. On the other hand, WebRTC solves a bunch of things around handling giving you high quality audio, high bandwidth, low latency. But the the catch is it can be a lot more complicated to implement and um it specializes, but it can be frustrating. Lots of applications use both, but for different things. So, here's the TLTR if you only remember one thing from this talk. Use websockets for those server to server use cases and small amounts of structured data in places that you want to prototype. Use WebRTC if you're sending audio and video streams over the internet from your web app, your native. Um, that's where it excels. So, why is it so important to use WebRTC for real-time edgetocloud audio? A websocket is a TCP connection. TCP guarantees in order delivery of network packets. If you send some data, that data is going to arrive exactly as you send it or it's not going to arrive at all. you put packets in your operating systems send queue that OS Q is going to keep trying to send them until they either get acted by the other side or your connection completely times out. And this is in general what you want if you're doing most network programming. If you're making a web request, for example, this is perfect. It's not what you want if you're aiming for conversational latency. Remember that we're trying to hit a voicetovoice latency of under one second and ideally even better. What we want to ignore is things like the occasional packet loss. So imagine if a packet is dropped. I don't really care about what happened a second ago. So WebRTC does clever math and buffer management that we're going to talk about more to hide where that happens. So this is the first and most important thing WebRTC does for you. It's all that machinery that sends packets as fast as possible, ignores packets that don't arrive inside that very tight latency budget we're operating within. Even think of it as like super fast, best effort networking. If this were all WebRTC could do for you compared to websockets, it would still be worth using WebRTC just for this because you literally can't implement this on top of a TCP stack or on top of websockets. Again, the operating system is just going to try to keep sending whatever you tell it to send. It's going to block everything if you have any packet loss or significant sort of jitter or delay in the network. And in real world, we have lots and lots of real world data on this. In the real world, this means you will get audio glitchiness or high latency or unexpected socket disconnections in 10 to 15% of your network connections. But WebRTC does a lot more than that. Um, if you go and you try to build the same application in websockets, you have to handle resampling, you have to handle packetization and doing all that bandwidth estimation. Networks are constantly changing and fluctuating. So, you can't just send one bit rate. Um, and you also get standard APIs for getting the stats and observability. This is all just built into WebRTC, but if you decided to do websockets, you have to build it yourself. So, if you look at this code up on the screen, on the right side is an example with WebRTC sending one birectional stream of audio. On the left side is websockets. And you want to spend more time building your application and less time worrying about things like sample rates. That's why you pick WebRTC. And this is real code using your OpenAI real-time API, which you you offer both both options for developers. So, I hope we've convinced you that you should use WebRTC if you're doing edge to cloud audio. Oh, and especially audio and video. Um, we love talking about this stuff. If you've come find us later, we will talk your ear off about jitter buffers and package management and bandwidth shaping and all that stuff. But what we want to do now is move on and talk about a whole another category of fun stuff, which is what you can actually do with WebRTC. Um, I'll start by saying you can embed real-time audio in any app you write. Any website, any iOS app, any Android app, lots of fun embedded stuff. And the network connections will just work. You will get good audio on any device, any platform, almost any real world network connection. And I bet you use WebRC, WebRTC today already. If you used Facebook Messenger, WhatsApp, Zoom, Discord, you know, any of these applications, they're using WebRTC. Um, but you didn't know that there's even more cool things happening with WebRTC. Um, I worked with a company that was doing surgery over the internet. Um, people will teleop um, vehicles in the field. It's super cool. Um, WebRTC is kind of the standard language of the real-time world. Um, and that's why it makes so easy that we can go build conversational intelligence on top of it. Um, all the stuff's already been solved. I mean, in the new LLM era, we know lots of people who spend hours talking to computers, uh, driving their developments environments with voice, doing brainstorming, treating the computer as a personal assistant, a coach, a therapist, a researcher. I'm convinced voice is going to be the core building block of the next generation of UIs, of the UIs for the generative AI era. We have to do a little iteration before we figure out really what those UIs look and sound like. Uh, but that future seems very clear to me. One of the things I say to people to kind of try to communicate how excited I am about building all this stuff is we all lived through the last platform shift. It was recent enough that the move for most computing from desktop to mobile happened within our memories. And so by analogy, we are in late 2007 now. We have the first iPhones, but we haven't yet invented pull to refresh. What keeps me so excited and motivated is I feel like uh voice is like the next uh bicycle for the mind. You know, today we've only been able to use computers with our eyes and our hands, but now I can talk to it as well. Um think of all those situations where like you you have your voice available, but you don't have your hands. Um and other great thing is you can keep all this computing power remotely and access it and have small devices near you. So, on the small devices front, we brought a friend with us today. Uh, he's a little shy sometimes, so give him just a second. So, uh, Squabbert, can you tell people a little bit about your tech stack? This is Squabbert, who Chad built, and Chad's daughter, Ella, named Squabbert. Squabbert, can you tell the people a little bit about your tech stack? using MLX Whisper Gemma 3 and a mildly buggy custom logic sampler that Quinn wrote. I mean, come on now. Like, I had a good reason that I wanted to write a log sampler, and I I wanted you to be able to do something that even the really amazing big LLMs running in the cloud can't do. Do you know what I'm talking about? You're talking about my counting talent. It's pretty neat, even if I do say so myself. I mean, counting syllables is like the 2025 version of how many Rs are in strawberry. And you know, SC, why don't we why don't we demonstrate that a little bit? Can you make up a poem like a fourline poem about computer programming and use only two syllable words? Two. Okay, listen. Logic knowing system rolling. One more maybe. Uh, that was good, but I've heard you do better. Okay, here's another one. dancing colors bright feel so right and night magic pure delight you forgot about the two words thing but we'll we'll we'll let it go yeah definitely the first one had two civil awards words I think I think we should say goodbye just in the interest of time squabber but thank you so much [Applause] I think Squabber did really really well um so that that is a Raspberry Pi I connected with a peer-to-peer WebRTC connection directly to my laptop over the same local area network. So that was the serverless WebRTC connection. Squabbert's talking directly to the laptop. But what's super cool about WebRTC is you have all these different choices to how you want to connect to things. So you could do this local connection or something like Squabert could connect to a server um up running on another in the cloud and do all the AI stuff. And then the third option is you can go and connect up to something like Pipcat and make it multi-party. So bring LLMs into meetings or other places like that. Um super awesome how like you get all this flexibility to build these things the way that like matches your application. So we want to close the video with actually a builder who's in the audience. Um super excited and um what's the best part about this is that um when she built this she had never rewritten any code before this. And so what makes me so excited about the future of voice is if we can make this easy enough, people that have really innovative, inspirational ideas can go and do stuff themselves. Hi, I'm Yashin. I'm a mom of two bilingual kids. I'm raising my kids bilingual because I want them to connect with my cultural roots and that's a wish shared by many parents of bilingual children. But raising bilingual kids is hard. It's expensive. It's consuming and it often feels like a big chore on both the kids and parents. I believe we can do a lot better with today's technology. I really believe we can make language education feel more natural and even fun for the kids. Here's a quick clip of what I've been working on. Hi there buddy. Ready to have some fun today? How about we start with something fun? Let's say hello. In Mandarin, we say, can you say, okay, it's still early, but I'm excited about what's possible. I'm not technical, but with just a little bit of guidance from a kind member of this community, I was able to bring this first version to life. I already have a group of eager testers, mostly parents like me, who are excited to try this. If this also sounds exciting to you, I would love to connect. Thanks so much for watching and let's build this future together. [Applause] So, we put the QR code up here for Yashin's project, which I absolutely love. She's here. Uh, if you're interested, if you're here in the audience and, uh, you're interested in multilingual stuff or building these kind of things, please find her. It's such a great great great thing. Um, if you're watching on YouTube, here's the QR code. Sean and I are super excited about these kind of projects. And I mean the the person Yashin shouted out in the video is of course Sean who has done more than anybody I know to make WebRTC accessible to everyone. Um the idea that we want to leave you with is if you have an idea in voice AI and WebRTC whether you've been a programmer for years or you're just getting started like we are here to support you. We're so excited and we believe that things like Pipecat and Lip Pier like if we can make this easier we're going to see the next generation of really exciting innovative projects. come find us in the hallway or online. We hang out on Discord and Twitter and LinkedIn. And uh here are the resources that you can scan and uh hopefully they find helpful. So Quinn wrote an amazing book that I believe is in your bag. Um and then uh yeah, so thank you so much. We can't wait to see what you build. [Applause]