Pipecat Cloud: Enterprise Voice Agents Built On Open Source - Kwindla Hultman Kramer, Daily

Channel: aiDotEngineer

Published at: 2025-07-31

YouTube video id: IA4lZjh9sTs

Source: https://www.youtube.com/watch?v=IA4lZjh9sTs

[Music]
Hi everybody. My name is Quinn. I am a
co-founder of a company called Daily.
Dy's other founder is in the back there,
Nina. I'm stepping in for my colleague
Mark, who couldn't make it today, so
we're going to do this fast and very
informally, but I think that's a good
way to do it at an engineering
conference. I don't have as much code to
show as the last awesome presentation,
but I'll try to show a little bit. We're
going to talk about building voice
agents today. Uh I work on an open
source vendor neutral project called
Pipecat. Um and a lot of other people at
Daily do too because Voice AI is growing
fast and is super interesting and is a
good fit for what we do as a company. We
started in 2016. We are global
infrastructure for real-time audio,
video, and now AI for developers. Pipcat
sits somewhere higher up in the stack
than our traditional infrastructure
business. So we'll talk a little bit
about how you can build reliable,
performant voice AI agents completely
using open source software. We also
recently launched a layer just on top of
our traditional infrastructure designed
for hosting voice AI agents. We'll talk
just a little bit about that. Um so
we've been doing this a long time. We
care a lot about the developer
experience for very fast, very
responsive real-time audio and video. We
have a long list of engineering first
we're proud of, but that's not why
you're here today. Uh, happy to talk
about that later though. Um, if we step
back and orient a little bit, what are
you doing when you build a voice agent?
I I tend to sort of orient people with
three things they have to think about.
You've got to write the code. You have
to deploy that code somewhere. And then
you have to connect users over the
network or over a telefan connection to
that agent. A few things here. User
expectations are high. Voice AI is new,
but it's growing fast, I think, because
we're able to with sort of the best
technologies that are just now becoming
available to meet user expectations. But
users expect the AI to understand what
they're saying, to feel smart and
conversational and human uh to be
connected to knowledge bases, to have
actual access to useful information for
whatever they are doing for that user.
uh to sound natural. There's definitely
an uncanny valley problem that in
generative AI we fell into for a very
long time. Now we're on the other side
of that for voice AI which is really
exciting. Um the agents have to respond
fast. Humans expect it varies by
language and by culture and by
individual but roughly speaking humans
expect a 500 millisecond response time
in natural human conversation. If you
don't do that in your voice AI interface
you are probably going to lose most of
your normal users. So we tell people
target 800 millisecond voicetovoice
response times. That's not easy to do
with today's technology, but it is
definitely possible. And build UIs very
thoughtfully to understand that humans
expect fast responses.
Other thing that's hard uh a little bit
like fast response times is knowing when
to respond. Uh humans are good but not
perfect at knowing when somebody we're
talking to is done talking and when we
should start talking. Voice AI agents
are not as good at that yet, but they're
getting better. So we'll talk a tiny bit
about that. So why do developers use a
framework like Pipcat instead of writing
all the code themselves? Well, a little
bit of it is all those hard things on
the previous slide that you probably
don't want to write the code for
yourself if you're mostly thinking about
your business logic and your user
experience and connecting to all of your
systems. You want to use battle tested
implementations of things like turn
detection, interruption handling,
context management, calling out to other
tools, function function calling in an
asynchronous environment, all that
stuff. Um, so developers tend to use
frameworks these days for lots of
agentic things they do. Uh, and voice
AI, I think, is even more important to
sit on top of uh really well- tested
infrastructure and code components than
even in other domains. Um, Pipcat
appeals to developers because it's 100%
open source and completely vendor
neutral. You can use it with lots of
different providers at every single
layer of the stack that Pipcat enables.
Um, for example, there's native telefan
support in Pipcat. So, you can use
Pipcat with lots of different telefan
providers in a plug-and-play manner. You
can use Twilio for example, which a lot
of developers know. If you're in an
geography like India where Twilio
doesn't have phone numbers, you can use
Pivo. Bunch of other telefan providers
are supported. Um there's a native audio
smart turn model that's completely open
source in Pipcat. So the community has
gotten large enough that there's kind of
cutting edge ML research at least in the
small model domain coming out of this
open source community which is really
fun. Um Pipcat cloud I think is a really
nice advantage for the Pipcat ecosystem.
It's the first open-source voice AI
cloud sort of built from the ground up
to host code that you write but that is
designed for the problems of voice AI.
Uh, and Pipcat supports a lot of models
and services. The count is something
like 60 plus. All the things you would
want to use in a voice AI agent are
probably in Pipcat main branch. Um, so
you probably don't have to write code to
get started, though the appeal is that
you can write lots and lots of code if
you want to. So there's no ceiling. Um,
I'll talk a little bit about what the
architecture looks like. And we probably
won't have time to talk about client
SDKs because most of you in this room
are probably building for telefan use
cases. But there's a really rich and
growing set of JavaScript, React, iOS,
Android client side components and SDKs
that people in the pipecat community are
using to build multimodal applications
that run in the web and on native mobile
platforms. Um, we talked about this so I
will actually just skip this slide. Uh,
I hope we'll have time for Q&A. That's
the most fun part. Um, here's the other
piece that often helps orient people.
This is what a Pipcat agent looks like.
So, you're building a pipeline of
programmable media handling elements.
Uh, these are all written in Python,
although lots of the performant uh
sensitive ones bottom out in some kind
of C code. Uh, it's pretty common in
real-time media handling. Um, you
probably don't have to worry about that
level though. You're probably just
thinking in pipe pipe pi python. Um,
pipecat pipelines can be really simple.
Uh, they can have just a couple, maybe
just three elements, something for the
network, something that's doing some
processing, and something that's sending
stuff back out the network. Or they can
be quite complicated. And we see
enterprise voice agents often become
quite complicated because they're doing
complicated things and connecting out to
a large variety of existing legacy
systems. Um, so an example of a little
bit of that span. The left two
screenshots here are from the pipecat
docs about how you work with the open
AAI audio ccentric models in pipcat.
Open AI gives you a couple of different
shapes of models and APIs that you can
use. One is chaining together
transcription large language model
operating in text mode and voice output.
The other is using their new and uh in
some ways experimental speech-to-pech
models which are also really awesome and
promising. Um, you can do either of
those approaches in Pipcat just by
changing probably three or four lines of
code. Uh, on the right is the Python
sort of the chunk core chunk of a few
hundred lines of Python code and a flow
diagram for a more complicated uh,
pipeline. This is one of my favorite
starter kit examples for Pipcat. It uses
two uh, instances of the Gemini
multimodal live API in audio native
mode. uh and one is the conversational
flow and the other is a another
participant in the conversation that
plays a game with the user. So there's
sort of an LLM as a judge pattern here
but in the context of a game. Uh and
you're moving the audio frames around
through both pipelines selectively
depending on the results of the
real-time inference. Uh which is a
pattern we also see in enterprise use
cases but it's fun to clone this and run
it and play the game.
Um, we listed some of the services here.
Uh, we can talk a lot more if, uh, you
want to in the Q&A about sort of what we
see people actually using in production
most often in terms of models and
services, uh, in enterprise voice AI.
So, that's a very quick rundown of the
Pipcat framework, which is how you write
the code. Now, how do you deploy it? And
why am I talking to you about Pipcat
Cloud today? Um, there are a bunch of
hard things about Voice AI that are
unique to these use cases. These are
long running sessions. Uh they have to
use network protocols that are designed
for low latency. Um things like
autoscaling
are not available out of the box for
these workloads the way they are for
something like HTTP workloads. So I was
actually quite resistant for a long time
to building anything commercial around
Pipcat at daily because we do the
low-level infrastructure. We already
have things that we do that serve the
pipecat community. But it got to the
point where there were very large
percentage of the questions in the
Pipcat Discord were about how to deploy
and scale. Um, and I I I initially sort
of felt like that was a solved enough
problem because what we do in the
infrastructure level helps you in one
way. what a lot of our friends and and
customers do much higher up in the stack
with
platforms that sort of wrap all of the
voice AI problem set in very easy to use
dashboards and tools and guies are also
really good solutions. But what we came
to realize is that there was sort of a
middle of the stack that people were
asking about a lot in the open source
community that boiled down to how do I
do my Kubernetes? Um, so people would
ask questions in the Pipecut Discord
about deployment and scaling and we
would say, "Oh, well, if you really want
to run this stuff yourself on your own
infrastructure, here are the five things
you do in Kubernetes." And people would
say, "Some version of Cooper what?" Um,
and we don't have a good answer to that.
So, we thought we'd come up with a good
answer to that, which is a very thin
layer on top of our existing global
media oriented real-time infrastructure
designed as what I think of, this is not
a very good marketing tagline, but I
think of this as a very thin wrapper
around Docker and Kubernetes optimized
for voice AI. Um, so what are the things
we're trying to solve for? Fast start
times are very important. If somebody
calls your voice agent uh and they hear
ringing, they want to hear that voice
agent pick up the phone and say hello
pretty fast. Um no, almost no matter
what you do in AI, you care about cold
start times, but it's even more
important when the user is initiating
some action and expects you to hear
audio back. Um cold starts are hard. If
you've built Genai infrastructure, you
know that uh we try to solve the cold
start uh problem for voice AI. happy to
talk about cold starts in great detail
because it's something I've been
thinking a lot about over the last few
months. Um, autoscaling is a little bit
related to cold starts. You want your
resources to expand as your traffic
pattern expands. The alternative is you
know exactly what your traffic pattern
is and you just deploy a bunch of
resources. Uh, that doesn't work for
most workloads. Most people have timed
dependent or completely unpredictable
workloads. Uh, so you need to scale up
and scale down. Um, real time is
different from non-real time. And by
non-real time, I mean everything that's
not conversational latency of a few
hundred milliseconds or less. If you are
making an HTTP request, you want it to
be fast, but you don't really care if
your P95 is 1500 milliseconds or 2,000
milliseconds. In most cases in a voice
AI conversation, you care a lot if your
P95 goes up above 800, 900, 1,000
milliseconds for the entire voicetooice
uh response chain. Uh all the little
inference calls you make as part of that
have to be much faster than that by
definition. Um so the whole networking
stack from client to wherever your pipe
code is running and inside that
Kubernetes cluster has to be optimized
for real time.
uh you probably need global deployment.
Uh you probably have uh GDPR or data
residency or other kinds of data privacy
requirements or you just need global
deployment because you want these
servers close to users because that
helps with latency
and all these things have to be like
delivered at reasonable cost. So we try
to take these things off of your plate
and help you build quickly and get to
market uh with your voice agents. Um, a
couple other things that are just worth
flagging here. We've done a lot of work
on turn detection, which is sort of one
of the 2025 top three problems most
people in Voice AI are thinking about
how to make better. Um, check out the
open source smart turn model that's part
of the Pipcat ecosystem if you're
interested in in that. Uh, the open
source smart turn model is built into
Pipcat cloud and runs for free. Our
friends at FAL host it. Um, you've
probably heard of FAL if you're doing
Genai stuff. very fast, very good GPU
optimized inference
um and ambient noise and background
voices. So one problem with voice AI is
that even though transcription models
today are very res like resilient to all
kinds of noisy environments, the LMS
themselves are not. So if you are trying
to do transcription and figure out when
people are talking and figure out when
to fire inference down the chain and ask
your LLMs to do something, having
background noise that sounds a little
bit like speech will trigger lots of
interruptions that you don't mean to
happen and will inject lots of spurious
pseudo speech into your transcripts. So
and that's true even for speechtospech
models today. Uh they're not very
resilient to background noise. Um the
best the the the best solution to
background noise today is a commercial
model from a really great small company
called Crisp. The Crisp uh model is only
available with sort of big chunk of
commercial licensing. Uh you can use
Crisp for free inside Pipcat cloud if
you run on Pipcat cloud. You can also
use Crisp in your own Pipcat pipelines
with your own license if you run Crisp
somewhere else. Uh finally, agents are
nondeterministic as we all know. There's
a whole eval and PM track here and in
every other track we talk about this
problem. Um, we've got some nice uh
low-level building blocks for logging
and observability natively in Pipcat and
exposed through Pipcat Cloud and a bunch
of partners we work with on that. I'm
happy to introduce you to the great
teams we work with at various companies
that are building observability stuff.
That is my speedrun. I came in 20
seconds under the 15 minutes, but
because we are the last talk in this
block, if people want to do Q&A, totally
happy to.
Thanks. Uh, wonderful. One, um, actually
I have two questions, two very quick
questions. One is we're based out of
Sydney, Australia.
800 time
opening.
Do you have any alternatives for that?
Have you looked at other alternatives
for people outside the states? Yes,
that's a great question. So the question
I will repeat the question. The question
is if you're in a geography that is a
long way from your inference servers. So
in the case of this particular question,
you're uh serving users in Australia.
You're using OpenAI. Open AAI only has
inference servers in the US. you don't
want to make extra round trips to the
US. So there's a couple answers to that.
One is if you make one long haul to the
US for all the audio that at the
beginning of the chain and at the end of
the chain that is much better than
making three inference round trips for
transcription, open AI and uh voice
generation. So that's one tool we often
say to people just deploy close to the
inference servers rather than close to
the users and optimize for having one
long trip and then a bunch of very very
fast short trips. That's good but not
great. The other option is to run stuff
in uh on using open weights models
locally in Australia which you can
definitely do.
It's a longer conversation about what
use cases you can use, say the best open
weights models versus the, you know,
GPT40 and Gemini 2 flash uh level
models. But there are definitely some
voice AI workloads now that you can
reliably run on like the Gemma or the
Quinn 3 or the Llama 4 models. Okay.
Second question maybe just related to
that is let's say can we if we basically
host models in Australia itself? Y um
what's the interconnectivity over the
network from your cloud by is something
like do you go to like the internet
exchange locally out there? Yes. So we
have endpoints all over the world that
are we in our world we call them points
of presence. So we have the the sort of
the edge server close to the user and
we'll terminate the web RTC or the
telefony connection there and then we'll
route over our own private AWS or OCI
backbones to wherever you need to route
to. If you're hosting in Australia, uh
you should be able to just hit our
endpoints and then uh your your hosting
in Australia. We also we have some
regional availability of Pipcat Cloud
now. We will launch a bunch more
regional availability of Pipcat Cloud
over the next quarter. So I hope we
actually have Pipcat Cloud in Australia
soon. Although you you can also
obviously self-host in Australia and
still use either Pipcat itself or Pipcat
Plus daily in other ways. Thank you.
Yeah. Oh, sorry. Yeah. So, thanks.
Thanks for the talk. Thank you. So,
there are ones like Moshi. I don't know
if you Yeah. Yeah. Yeah. I love Moshi.
So, they basically claim that threat
detection is no longer needed because
they inher
open weights model called Moshi by a
French lab called Kyoai. Um, Moshi is a
is is a sort of next generation research
model where the architecture is constant
birectional streaming. So, you're always
streaming tokens in and the model is
always streaming tokens out. In a
conversational voice situation, which
Moshi was designed for, most of the
tokens streaming out are silence tokens
of some kind. And when they're not
silence tokens, it's because the model
decided it was going to do whatever the
model's trained to do, which is really
cool because that can mean not just that
the model does natural turn taking, but
also that the model can do things like
back channeling. So the model can do the
things the humans do, that it's data set
has audio for like when you're talking,
I can say, uh, yeah, yeah, uh, and it's
not actually a new inference call, it's
just streaming. um that paper, the the
Kyoi Labs Moshi architecture paper was
my very favorite ML research paper from
last year. Now that model itself is not
usable in production for a bunch of
reasons, including that it is too small
a language model to be useful for
basically any real world use case. Um,
I I have more to say about that, but I'm
super super excited about that
architecture, but I don't think I mean,
we're a couple years away from that
architecture being actually usable and
trained as a production model. There are
speech-to-pech models from the from the
large labs that are closer to being able
to be used in production. Uh, now they
are not streaming architecture models,
but they are native audio speech-to-pech
models, which have a bunch of
advantages, including really great
multilingual support. So like mixed
language stuff is great from those
models. Um in theory latency reductions.
Um so OpenAI has a a real-time model
called GPT4 audio preview that sits
behind their real-time API. It's a it's
a good model. Uh Gemini 20 flash uh is
available in an audioto audio mode and
their training to or their they have
preview releases of 25 flash. These
models are now good enough that you can
use them for use cases where you are
more concerned about naturalness of the
human conversation than you are about
reliable instruction following and
function calling. They are less reliable
in audio mode than the text mode that
the soda models operating in text mode.
So what we generally see is that for a
small subset of voice AI use cases today
that are really about like
conversational dynamics, narrative,
storytelling, those models are starting
to get adopted for the majority of sort
of enterprise voice AI use cases where
you really need best possible
instruction following and function
calling. Those models are not yet the
right choice but they are getting better
every release and all of us expect the
world to move to speechtoech models
being the default for like 95% of voice
AI sometime in the next two years. The
question is when in your use case will a
particular model architecture sort of
cross that threshold in your evals.
Sorry what about sesame? You put sesame
in the same bucket as Gemini and open
sesame is closer to Moshi. In fact,
Sesame. So, there's another open uh
weights or partly open weights and
really interesting model called sesame.
Uh it's a little like Moshi. It in fact
uses the Moshi neural encoder. Yeah, it
uses Mimi. Um
ses So, Sesame has not yet been fully
released. There isn't a full Sesame
release. Uh also, I think Sesame is
smaller than probably you would need to
use for most enterprise use cases today.
Although the lab training Sesame I think
has bigger versions coming. Uh there's
also a speech-to-spech model called
Ultravox which is really good which is
trained on the Llama 37B backbone and
that team supports that model and has a
production voice AI API. That model is
worth trying if you are really
interested in speechtoech models. If
Llama 370B can do what you want, I think
Ultravox is a good choice. If Llama 37B
isn't quite there for your use case,
probably not. but you know the next
release of ultravio so speech to speech
is definitely the future I I generally
tell people experiment with it don't
necessarily start assuming you're going
to use it for your enterprise use case
though today
given your v vendor neutrality can you
speak to the strengths and weaknesses of
using like the leading edge uh
multimodal input models like open AAI
and Gemini when when should I use choose
open AI or when should I choose Gemini
So my opinion is that GPT40 in text mode
and Gemini 20 flash in text mode are
roughly equivalent models for the use
cases that I test every day. Um so I
would make the decision if you can I
would build a pipecap pipeline and then
just swap the two models and run your
evals. Um because they're both really
good models. One of the advantages of
Gemini is that it's extremely
aggressively priced. So,
you know, a a a 30-minute conversation
on Gemini is probably 10 times cheaper
than a 30-minute conversation on GPT40.
Um, you know, that may or may not stay
true as they both change their prices.
But that's definitely something we hear
a lot from customers today is that they
like the pricing of Gemini. The other
interesting thing about Gemini is that
it operates in native audio input mode
very well. So you can use Gemini native
audio input mode and then text output
mode in a pipeline and that has
advantages for some use cases and some
languages and you can again test that on
your eval. And OpenAI also has native
audio support in some of their newer
models but I I think they're just a
little bit behind the Gemini models in
that uh in that regard. Um time for one
more or we done one more and then we're
done. Yeah. What are the general
advantages of speech to speech versus
going speech to text doing something and
then fact text to speech. So what are
the general advantages of speech to
speech instead of text text inference uh
to speech and out? So it's super
interesting question and I have a like a
a practical answer and a philosophical
answer. I'll keep them both short. The
the the practical answer is that you
lose information when you transcribe.
And so if there's information that's
useful in the um in the transcription
step that if there's information in the
audio that you would lose that's useful
for your use case then a speechto speech
model is great. Um so for example things
like mixed language are very hard for
small transcription models. Um you're
almost always sort of losing a bunch
more information in a mixed language
transcription than you are in like a an
optimized model monolingual
transcription. So why not go to the big
LLM that just has all this like language
knowledge and can do a better job on the
multilingual input. Um the other
advantage is potentially you have lower
latency like if you're if you've trained
an end toend model for speech to speech
and it's all one model and you're not
like chaining together inference calls
you you can probably get lower latency.
Uh in practice whether that's true today
depends more on the sort of APIs and
inference stack than it does on the
model architecture. But I think we're
all going towards assuming that we just
want to do one inference call for like
the bulk of things and then we might use
other little models on the side for for
like subsets. The philosophical answer
though is that those advantages are
probably outweighed by the challenges to
today's LLM's architecture LLM
architectures when you have big context
and big context and audio tokens take up
a lot of context tokens. So when you're
operating in audio mode, you're just
sort of expanding the context massively
relative to operating in text mode. And
that tends to degrade the performance of
the model. I think a little bit
relatedly, nobody has as much audio data
as they have text data for training. So
even though a big model is doing a bunch
of transfer learning, when you give it a
bunch of audio and it is in theory sort
of mapping all that audio to the same
latent space as its text reasoning, in
practice, it's definitely not doing that
exactly. It's doing something like that,
but not that. And so because we don't
have as much audio data, you see a lot
of issues with audioto audio models like
the model will sometimes just respond in
a totally different language. And that's
cool, but it's never what you want in
the enterprise, you know, voice AI use
case. And the best guess for why that's
happening is it's in some right part of
the latent space from some projection
but then from some other projection it's
totally in a different part of the
latent space when you gave it audio
instead of text even though if you
transcribe that text it would be exactly
the same as the audio. Um so you know
latent spaces are big and to like
actually find our way through them in
post training you really have to have a
lot of data and nobody has enough audio
data yet. But the big labs are going to
fix that cuz audio matters and
multi-turn conversations matter.
[Music]