Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)

Channel: aiDotEngineer
Published at: 2025-07-31
YouTube video id: E71YtNbCFXY
Source: https://www.youtube.com/watch?v=E71YtNbCFXY
All
[Music]
right, Squabbert, you ready to get
packed up? I don't know. I'm pretty
nervous. Oh, relax. You got nothing to
worry about. But this is like the worst
idea ever for a live demo. A live
unscripted conversation with a
non-deterministic LLM on conference
Wi-Fi. Come on, your prompt is great.
And I mean, it's got to be with onstage
audio in a room full of echoes.
Sometimes my text to speech even
mispronounces my own name. Why did you
even name me, Squabbert? What if I say
Squibbly or something again? Okay, take
a breath. Well, you don't really do
that. Let's just take it one step at a
time. Just start with the intro. I guess
you're right.
Okay, here I go. Hi everybody, I'm
Squabbert here to take us on a whirlwind
tour of the wonderful world of WebRTC.
Please welcome Shawn and Quinn. Hey, I'm
Sean. I work on WebRTC at OpenAI. Um,
some of the things you might be familiar
with are the real-time API or 1800 chat
GPT. You can call it right from your
phone. Um, before I worked at OpenAI, I
worked on the Go implementation of
WebRTC called Pion. And I'm Quinn. I
work at Daily on real-time audio and
video infrastructure and on an open
source voice agent framework called
Pipcat. Today we're going to talk about
how to build natural fast humanlike
voice experiences. Uh we're going to
give you a crash course on low latency
audio and video and I hope we'll show
you a couple of things you might not
have thought of around voice AI before.
Um, if you want to build a
conversational voice experience that
people really love, you you're going to
stress a lot about latency. Nothing else
matters if your AI responds too slowly.
Building voice AI experiences is similar
to other kinds of AI engineering in most
ways. If you've built multi-turn agents,
a lot of that will port over to building
voice agents.
But the big difference is latency.
Everything in a voice AI app needs to be
ground built from the ground up for fast
response times. If you're talking to a
person, around 500 milliseconds sounds
natural. When talking to an AI system,
people bring those same expectations.
Response latencies much above a second
in general doom your voice agent to very
low completion rates and low NPS scores
and hang-ups. And we're talking here
about voicetooice latency. So this is
the time uh between when I the human
stop talking and the time I hear the
first audio bite come back from the LLM.
Let's take a look at how latency adds up
in a typical voicetovoice AI
application. So this is a breakdown from
a real voice AI app running in a web
browser on Mac OS talking over the
internet to a voice agent running in the
cloud running on Pipecat.
A couple things to note. Our voice
latency is just under a second. That's
good, but not great. We can make things
a little faster, but that comes with
trade-offs. Lower quality or cost.
Second, it's frustratingly easy to do
even worse than this. Your LLM might be
slower. Other things get in the way, or
worst of all, Bluetooth. Don't get me
started on Bluetooth.
But the single biggest mistake we see
people making is using the wrong
approach to sending and receiving audio
over the network. It's time to talk
about WebRTC and websockets.
If you're new to building voice
applications, you probably think, "Hey,
I need a like a longived uh connection.
I'm going to send audio and video over
this longived connection. I've used
websockets for long lived data
connections before. I'm just going to
write some websockets code."
That's great if you're doing longived
short small amounts of data. It doesn't
work for real-time audio. In fact,
websockets are almost the opposite of
what you want from a network engineering
perspective for real-time audio and
video.
So, let's do a compare and contrast on
this.
Websockets are great if you're trying to
deliver audio and you want something
really easy that can target all
platforms. If you're trying to build a
prototype lots of different platforms,
websockets the way to go. On the other
hand, WebRTC solves a bunch of things
around handling giving you high quality
audio, high bandwidth, low latency.
But the the catch is it can be a lot
more complicated to implement and um it
specializes, but it can be frustrating.
Lots of applications use both, but for
different things.
So, here's the TLTR if you only remember
one thing from this talk. Use websockets
for those server to server use cases and
small amounts of structured data in
places that you want to prototype. Use
WebRTC if you're sending audio and video
streams over the internet from your web
app, your native. Um, that's where it
excels. So, why is it so important to
use WebRTC for real-time edgetocloud
audio? A websocket is a TCP connection.
TCP guarantees in order delivery of
network packets. If you send some data,
that data is going to arrive exactly as
you send it or it's not going to arrive
at all. you put packets in your
operating systems send queue that OS Q
is going to keep trying to send them
until they either get acted by the other
side or your connection completely times
out. And this is in general what you
want if you're doing most network
programming. If you're making a web
request, for example, this is perfect.
It's not what you want if you're aiming
for conversational latency.
Remember that we're trying to hit a
voicetovoice latency of under one second
and ideally even better. What we want to
ignore is things like the occasional
packet loss. So imagine if a packet is
dropped. I don't really care about what
happened a second ago. So WebRTC does
clever math and buffer management that
we're going to talk about more to hide
where that happens. So this is the first
and most important thing WebRTC does for
you. It's all that machinery that sends
packets as fast as possible, ignores
packets that don't arrive inside that
very tight latency budget we're
operating within. Even think of it as
like super fast, best effort networking.
If this were all WebRTC could do for you
compared to websockets, it would still
be worth using WebRTC just for this
because you literally can't implement
this on top of a TCP stack or on top of
websockets. Again, the operating system
is just going to try to keep sending
whatever you tell it to send. It's going
to block everything if you have any
packet loss or significant sort of
jitter or delay in the network. And in
real world, we have lots and lots of
real world data on this. In the real
world, this means you will get audio
glitchiness or high latency or
unexpected socket disconnections in 10
to 15% of your network connections.
But WebRTC does a lot more than that.
Um, if you go and you try to build the
same application in websockets, you have
to handle resampling, you have to handle
packetization and doing all that
bandwidth estimation. Networks are
constantly changing and fluctuating. So,
you can't just send one bit rate. Um,
and you also get standard APIs for
getting the stats and observability.
This is all just built into WebRTC, but
if you decided to do websockets, you
have to build it yourself. So, if you
look at this code up on the screen, on
the right side is an example with WebRTC
sending one birectional stream of audio.
On the left side is websockets. And you
want to spend more time building your
application and less time worrying about
things like sample rates. That's why you
pick WebRTC. And this is real code using
your OpenAI real-time API, which you you
offer both both options for developers.
So, I hope we've convinced you that you
should use WebRTC if you're doing edge
to cloud audio. Oh, and especially audio
and video. Um, we love talking about
this stuff. If you've come find us
later, we will talk your ear off about
jitter buffers and package management
and bandwidth shaping and all that
stuff. But what we want to do now is
move on and talk about a whole another
category of fun stuff, which is what you
can actually do with WebRTC. Um, I'll
start by saying you can embed real-time
audio in any app you write. Any website,
any iOS app, any Android app, lots of
fun embedded stuff. And the network
connections will just work. You will get
good audio on any device, any platform,
almost any real world network
connection. And I bet you use WebRC,
WebRTC today already. If you used
Facebook Messenger, WhatsApp, Zoom,
Discord, you know, any of these
applications, they're using WebRTC. Um,
but you didn't know that there's even
more cool things happening with WebRTC.
Um, I worked with a company that was
doing surgery over the internet. Um,
people will teleop um, vehicles in the
field. It's super cool. Um, WebRTC is
kind of the standard language of the
real-time world. Um, and that's why it
makes so easy that we can go build
conversational intelligence on top of
it. Um, all the stuff's already been
solved. I mean, in the new LLM era, we
know lots of people who spend hours
talking to computers, uh, driving their
developments environments with voice,
doing brainstorming, treating the
computer as a personal assistant, a
coach, a therapist, a researcher. I'm
convinced voice is going to be the core
building block of the next generation of
UIs, of the UIs for the generative AI
era. We have to do a little iteration
before we figure out really what those
UIs look and sound like. Uh, but that
future seems very clear to me. One of
the things I say to people to kind of
try to communicate how excited I am
about building all this stuff is we all
lived through the last platform shift.
It was recent enough that the move for
most computing from desktop to mobile
happened within our memories. And so by
analogy, we are in late 2007 now. We
have the first iPhones, but we haven't
yet invented pull to refresh.
What keeps me so excited and motivated
is I feel like uh voice is like the next
uh bicycle for the mind. You know, today
we've only been able to use computers
with our eyes and our hands, but now I
can talk to it as well. Um think of all
those situations where like you you have
your voice available, but you don't have
your hands. Um and other great thing is
you can keep all this computing power
remotely and access it and have small
devices near you.
So, on the small devices front, we
brought a friend with us today. Uh, he's
a little shy sometimes, so give him just
a second.
So, uh, Squabbert, can you tell people a
little bit about your tech stack?
This is Squabbert, who Chad built, and
Chad's daughter, Ella, named Squabbert.
Squabbert,
can you tell the people a little bit
about your tech stack?
using MLX Whisper Gemma 3 and a mildly
buggy custom logic sampler that Quinn
wrote. I mean, come on now. Like, I had
a good reason that I wanted to write a
log sampler, and I I wanted you to be
able to do something that even the
really amazing big LLMs running in the
cloud can't do. Do you know what I'm
talking about?
You're talking about my counting talent.
It's pretty neat, even if I do say so
myself. I mean, counting syllables is
like the 2025 version of how many Rs are
in strawberry. And you know, SC, why
don't we why don't we demonstrate that a
little bit? Can you make up a poem like
a fourline poem about computer
programming and use only two syllable
words?
Two. Okay, listen. Logic
knowing system
rolling.
One more maybe. Uh, that was good, but
I've heard you do better.
Okay, here's another one. dancing colors
bright feel so right
and night magic pure delight you forgot
about the two words thing but we'll
we'll we'll let it go yeah definitely
the first one had two civil awards words
I think I think we should say goodbye
just in the interest of time squabber
but thank you so much
[Applause]
I think Squabber did really really well
um so that that is a Raspberry Pi I
connected with a peer-to-peer WebRTC
connection directly to my laptop over
the same local area network. So that was
the serverless WebRTC connection.
Squabbert's talking directly to the
laptop. But what's super cool about
WebRTC is you have all these different
choices to how you want to connect to
things. So you could do this local
connection or something like Squabert
could connect to a server um up running
on another in the cloud and do all the
AI stuff. And then the third option is
you can go and connect up to something
like Pipcat and make it multi-party. So
bring LLMs into meetings or other places
like that. Um super awesome how like you
get all this flexibility to build these
things the way that like matches your
application.
So we want to close the video with
actually a builder who's in the
audience. Um super excited and um what's
the best part about this is that um when
she built this she had never rewritten
any code before this. And so what makes
me so excited about the future of voice
is if we can make this easy enough,
people that have really innovative,
inspirational ideas can go and do stuff
themselves.
Hi, I'm Yashin. I'm a mom of two
bilingual kids. I'm raising my kids
bilingual because I want them to connect
with my cultural roots and that's a wish
shared by many parents of bilingual
children. But raising bilingual kids is
hard. It's expensive. It's consuming and
it often feels like a big chore on both
the kids and parents. I believe we can
do a lot better with today's technology.
I really believe we can make language
education feel more natural and even fun
for the kids. Here's a quick clip of
what I've been working on.
Hi there buddy. Ready to have some fun
today?
How about we start with something fun?
Let's say hello. In Mandarin, we say,
can you say,
okay, it's still early, but I'm excited
about what's possible. I'm not
technical, but with just a little bit of
guidance from a kind member of this
community, I was able to bring this
first version to life.
I already have a group of eager testers,
mostly parents like me, who are excited
to try this. If this also sounds
exciting to you, I would love to
connect. Thanks so much for watching and
let's build this future together.
[Applause]
So, we put the QR code up here for
Yashin's project, which I absolutely
love. She's here. Uh, if you're
interested, if you're here in the
audience and, uh, you're interested in
multilingual stuff or building these
kind of things, please find her. It's
such a great great great thing. Um, if
you're watching on YouTube, here's the
QR code. Sean and I are super excited
about these kind of projects. And I mean
the the person Yashin shouted out in the
video is of course Sean who has done
more than anybody I know to make WebRTC
accessible to everyone. Um the idea that
we want to leave you with is if you have
an idea in voice AI and WebRTC whether
you've been a programmer for years or
you're just getting started like we are
here to support you. We're so excited
and we believe that things like Pipecat
and Lip Pier like if we can make this
easier we're going to see the next
generation of really exciting innovative
projects. come find us in the hallway or
online. We hang out on Discord and
Twitter and LinkedIn. And uh here are
the resources that you can scan and uh
hopefully they find helpful. So Quinn
wrote an amazing book that I believe is
in your bag. Um and then uh yeah, so
thank you so much. We can't wait to see
what you build.
[Applause]