Full Workshop: Realtime Voice AI — Mark Backman, Daily

Channel: aiDotEngineer
Published at: 2025-08-03
YouTube video id: nxuTVd7v7dg
Source: https://www.youtube.com/watch?v=nxuTVd7v7dg
[Music]
I'm Mark with Daily. This is Alles.
Alish. And then we have a few other
daily folks. uh Quinn, Nina, Verun, and
then I'm not sure where he went, but
Philip from right there from the Google
team, Google DeepMind team. So, this
session we're going to spend just a few
minutes getting everyone started. The
idea here is going to be a hands-on
workshop where all the folks I just
called out are going to be available to
help out. We'll walk you through a quick
start to get you up and running and then
the idea is to build something. So,
build a voice bot in the next 78 minutes
and 12 seconds or whatever time we have
left. Um, the one I guess there's one uh
consideration is the Wi-Fi. If you don't
have good Wi-Fi, you might want to try
to tether. I was able to tether and it
worked fairly well, but the conference
Wi-Fi was a little shaky. This is real
time, so you will be streaming data. It
does require a viable connection and not
just sending, you know, a few bits over.
So, just a heads up if you hit that as a
as a snag. So, I guess before I get
started, who here knows about Pipecat or
has built anything with voice AI?
Okay,
a smaller audience. Has anyone built any
real-time applications with LLMs or AI?
Maybe slightly bigger. Okay, great. So,
Pipcat um is a uh it's a open source
repo. It's a Python framework for
building voice and AI multimodal agents
and it's built by the team at Daily. Um
but we're an open source it's an open
source project that anyone can
contribute to. It's been around for I
don't know just over a year now. And
yeah, like I would say officially
Pipecat was March 2024 something like
that. So 13 months. There we go. So just
a quick uh walk through maybe just to
kind of ground everyone in the thinking
around voice AI. Uh these slides weren't
built for this talk but I'm going to use
them. So the you know voice AI or
real-time applications are tough because
there's just you know we as humans
communicate all the time with each
other. Thousands tens of thousands of
years of evolution baked into our
brains. So it's pretty tough to make a
machine you know work on the same level.
So we have great expectations being the
user in it. So you know you need a good
listener something that is smart and
conversational. You need to be connected
to data stores. Uh has to sound normal
or natural. Think back to even just
maybe two three years ago what your
voice bots sound like and many of them
on if you call them on the phone still
sound like. Needs to sound natural. And
actually kudos to the Google team. The
latest uh Gemini live uh native audio
dialogue is quite good in that regard.
it has to be fast. So the whole end to
end communication needs to happen and
roughly you know kind of the benchmark
is around 800 milliseconds. Um you could
strive for more. I think we see maybe on
the human level it might be 500
milliseconds or somewhere on that uh
order. So it is pretty fast. So there's
a lot to kind of to get all the way
there. And this is something that we at
daily everyone building pipe has been
working very very hard on uh getting all
the way to meeting all these
expectations. So just to kind of ground
you in some of this since we're going to
be working in Pipcat. Pipcat has a
pipeline. I don't know if maybe Al you
want to talk a little bit about the
origin of that quickly. Sure. Sure. Uh
you can think about it as a multimedia
pipeline and you would think what is a
multimedia pipeline. It's basically just
think about like boxes that receive
input and input could be audio or video
and then those boxes just will stream
those uh same data or modify data or new
data to the following uh to the
following elements or processors in in
pipe wall we call them processors. So in
in pipcat you would have a pipeline
where you have a transport which is the
the in the the transport of your data or
the input of your data. For example when
you're talking
you could be talking in a in a meeting.
So that would be the audio of the user.
Then you would have uh another box
following that which is the speechto
text service. So the speechto text
service would would grab uh audio from
the user. It would uh transcribe it.
Then you will get text that will be the
following data that goes through the
pipeline and then the next one will be
the LLM. So now the LLM has what the
user has set
and then it generates uh output whatever
uh whatever the LLM that would be tokens
then those tokens are are converted into
um text to speech and then the text to
speech outputs audio and then the audio
goes back to the transport so you could
hear what the LM has said. Today, uh,
what we're going to do today with Gemini
live, a lot of those boxes go away
because, uh, uh, the LLM will do a bunch
of these things. It will do
transcription, it will do the LLM, and
it will do the text to speech in in one
of these boxes, but you might still
require, for example, if you want to
save the audio, record the audio into a
file, you need a bunch of utilities to
do that. And Pipet has all that built
for you. Um, so basically that's that's
it. I mean a lot of this is really just
or it's orchestration. So if you think
about what Pipecat offers, it's
orchestration. It also offers a lot of
abstractions for a lot of common
utilities like Alles had said. So
recording transcript outputs, artifacts
you might want to produce or even ways
that you might manipulate the
information in the pipeline itself. So
this uh image here, which is what Alles
actually just talked through, is what
you would call I guess a cascaded model
where you have this flow through of
information. So we're going to be you
can build with Google and many different
services in this way. Uh in the last
year there's been an emergence of
speechtoech models that now take audio
in natively and audio out natively. And
those models also allow for audio in and
then optionally text andor audio out. So
you can actually for example take a raw
you know microphone input or uh you know
audio input and then the model would run
all of its logic and you can actually
opt to have it output text if you want
to say parse the text output before
speaking. So there are a few different
demos we'll look at that offer that uh
and in Pipcat we show kind of all the
ways to do things because that's that's
what we offer or at least what Pipcat
offers as a as a value proposition. So
the um yeah just one thing I don't think
you'll mention it in the slides but all
these boxes you can pluck and play the
service you want in pipcat. So the
speech to speech uh speech to text, it
could be I don't know deep for example,
the LLM could be uh Google or OpenAI or
whatever. You can just plug and play any
service you want, right? Yeah. The
modularity I guess is the other big
strength. So there's no, you know, you
can change out a service without
changing out your underlying application
code, which makes it easy. And we see
with this a lot of companies that are
building for voice AI might have uh
maybe even a more complex thing a
pipeline here runs straight down but you
can actually have split branches where
you might have one leg that's running
some logic and the other running a
different we call that a parallel
pipeline. So if you wanted to have say a
failover if vendor A goes down you can
move to vendor B dynamically even within
the same conversation. That's something
that Pipcat affords as well. Um, and
that can allow you to transfer contexts
over. So, a lot of really cool stuff.
Uh, the goal again today being to get
you familiar with just building a voice
agent and building one to get started.
So, one of the cool things,
um, like Alles had pointed out that with
the cascaded models, there's a lot of
complexity, but with your speechtospech
model, things get, you know,
dramatically simplified. You know, your
code may have looked something like
this. This is like an old example of
some like a ton of uh orchestration in
the pipeline, but with a speech-to-pech
model, you may be able to simplify it
down to this, but then you have to
remember you actually need orchestration
around it. So, it does get simpler to
some regard. Um, so it's it's more about
the services you interface with. Uh I
think with that why don't we transition
now because I'm realizing we have only
about maybe 70 70 75 minutes left to
looking at um the actual activity for
today. All right. So there's a public
repo I don't know how big or small this
is but it's under daily co. So on GitHub
daily co daily.comini-pipcat-workshop.
We'll give everyone a chance to make
sure internet's working. And can
everyone see the the repo? Yeah, it's
really tiny. Yeah. Okay. I should have
it in big text somewhere.
Not really.
There we go.
All right,
let's take a look at this. So, what what
I want to do, so there I spent a little
bit of time, Alles and I spent time
writing up this repo. This is meant to
be just a jumping off point. I'm going
to get you oriented and then I want to
look through one of the bot files which
is kind of the main pipecat code with
you and then we'll break and make this
an interactive session where we can
answer a bunch of questions. So in the
repo um you could either start doing it
now or maybe take a pause but this will
give you the steps to walk through
getting the quick start running. Before
we do that I want to take a moment here.
Let's see how the Wi-Fi is doing. That's
not a good sign.
It's down. Uh
well, hey, you know, it is tough to do
Wi-Fi for this many people. Very very
instead of real time is going to be real
slow. Real slow. Yeah. Real slow voice
communication. All right.
Yeah. Yes. Yeah. There will be some.
Okay. So, in uh here is this Gemini
bot.py files. This is all Python again.
So be everything will be in Python.
There will be some client code options
which we'll look at in a second. Um just
to orient you, I'll just jump right into
the meat of the pipeline. We have this
main uh function that runs your bot.
Everything is going to run kind of
encapsulated within an AIO HTTP session.
We're going to pass that session around.
That's more of just kind of the
mechanics of things in our pipeline.
Let's just jump to the the simple part
here. will have daily as the transport
daily as we are a web RTC provider as
well as we build pipecat. Uh there will
be context aggregation. So one important
note when you speak uh every turn of the
bot is a like a a discrete point in time
and this is maybe less so the case for a
speechtoech model but for basic LLMs
they get discrete inputs. Everything is
like a REST API call. So you're going to
get a snapshot of the conversation. The
context aggregator is going to collect
the all the bits of the conversation
both from the user and the assistant uh
and we'll put that into the form of what
the LLMs can handle. So this in this
case is more for function calling and
kind of logistics and management. Gemini
is amazing because it offers a lot of
this for you. But if you're to build
with say the uh just build with Gemini
not live the actual kind of just the
textbased LLM you'd have to have this
context aggregation
uh that will then go to your LLM which
is going to be Gemini live Gemini
multimodal live and then it's going to
be outputed through uh daily again on on
that side of the transport.
So you have daily we'd configure our
service which takes a number of
arguments like you set up a room with a
token give it a name and then have some
properties. There are docs which I'll
link to uh in it's linked in the quick
start. There's also a Gemini multimodal
live llm service which is a pipecat
class that is a wrapper around the
Gemini live API. So this again you just
initialize and run
with the LLM. You see, we do a few
special things. We're going to define
tools. This one has just two really
basic kind of canned functions.
Fortunately, we're not calling out to
the internet because it's not working
very well. Um, our connection is not
working well. So, this one just has the
dummy like fetch weather and we'll give
you a restaurant recommendation. So,
these are two handlers that when your
function is called, we'll just return
this result information.
Uh so we have the actual functions
themselves that are defined in this
function schema which is you can use
just native function uh definitions
using whatever LLM format. We also
created this function schema which is a
universal schema that lets you define uh
and move between any LLM without having
to kind of transform your LLM calls from
OpenAI to anthropic to Gemini to Bedrock
because they're all a little bit
different or Grock, you know, they all
have slightly different formats. So this
is more of kind of a universal transform
for that.
And then they're collected and trans
translated into the native format in
this tool schema. So we'll pass the
tools then to the Gemini service and
that's how it gets access to use and run
those tools. There's also a prompt above
which I think in this simple example we
just say hey you're a chatbot you have
these tools available uh and that's
that.
We're also setting up our context
aggregation which for better or worse we
use open AI as kind of the default like
uh the lingua frana for context. So
everything gets kind of folded back into
open AI uh at a certain level. Um and we
define the pipeline. So the pipeline is
again like Alles had said just a list of
all of your different or I guess a tupil
of all of your different services
um that are running in the pipeline and
you can write your own. So if you want
to instead make the LLM output text and
you want to either extract information
from that, maybe you have it in code
some XML or some type of information,
you can actually extract it, store it,
maybe your application does something
with it. Um, or maybe you want to inject
a texttospech type of uh frame. You can
actually do that by separating the LLM
the that audio output from audio output
to just be text. And then we you would
add like a texttospech service here or
you could write your own processor which
maybe not within the next hour. And then
lastly all of these or not all but many
of them have uh events they emit. The
transport emits handlers for when the
client connects and disconnects. So in
this case uh we use this line here to
actually inject a context frame into
Gemini to get kick off the conversation.
So when your client application
connects, it's going to cue a frame. So
again, frames being kind of the base
format for uh information. Think of it
as like an object for your pipeline.
You're going to cue one of those frames.
And what this function does is it just
grabs the latest context. So when we set
this one above, I think that just says
hello, that's going to pass that into
basically push that into the pipeline,
which then it will make its way to uh to
Gemini to initialize the conversation.
And the rest of this you could think of
as boilerplate to run it. You create a
runner which then is the thing that
actually runs your task. And the task is
what runs the pipeline. So maybe beyond
today, but just know that that's
something that's required to run your
code. Okay, I'm going to pause here just
for any questions because I do want to
get to developing soon.
We provide both. That's a whole topic
that is a talk in itself. Um, the short
answer is if you're building a client
server app, you should with like strong
emphasis use WebRTC. It has a whole
bunch of properties that are relevant
like error correction, better audio
quality, etc., etc. If you're building
server to server, so one of the options
today is to build a phone chatbot, you
can use and it's probably the best
option to use websockets. So, you can
bring your own uh transport. We actually
in Pipcat, there's a fast API version of
that that's a server that you can use to
exchange messages with a websocket. So
yeah, it's really up to you. So I guess
maybe the the takeaway is if you're
building client server, you really want
WebRTC, you could technically use
websockets, but you'll hit like a long
tail of of errors when you get to
production. And then server to server,
totally fine. You're going to be fine
with websockets.
Could anyone download the repo by any
chance? Did anyone not? One person
download the repo. persons. Three, four.
Wi-Fi is dead. Wi-Fi is dead. Can anyone
Can you tether? Is that an option? Do
folks have tether hotspots on your phone
or that's even shaky. Oh jeez. Okay.
I can dance. Could do some I don't know.
You know, I'm not really, you know, it's
not my thing, but I could try.
True. Yeah, I guess I could just make
this uh code walk through. Um or we
could write it. What? Yeah.
Yeah, it Gemini does have its own VAD.
Yeah,
Gemini does have its own VAD. In fact,
if you're using a speech to text service
that likely also brings its own VAD to
the to the equation. What we found is
that so maybe to use some of our extra
time because we're having internet
issues. Um the VAD serves a really
important purpose of detecting when a
user starts speaking. So in the whole
kind of life cycle of a turn that user
speaking kind of ushers in the user's
turn for the conversation. So pipecat
will emit a user started speaking frame
and that will also push through an
interruption. So the user will interrupt
like anything that's talking. So if the
bot was speaking or whatever uh it
basically clears the way because the
user has expressed they want to speak.
The idea with the VAD is that we want it
to be extremely accurate and extremely
fast. So running something on device, we
recommend Solero which is an open source
option. It works incredibly well. I
don't know what the inference time is
like. I don't know millisec extremely
fast. So you're going to get an event
back uh fast. In fact, you have the
ability to tune how long to hear human
speech before that event gets emitted.
And there the defaults are pretty good
in Pipcat. There are maybe scenarios
where you want to change that. But the
VAD is a really important uh
consideration. It's extremely low CPU
consumption. Quinn has a great uh
spreadsheet of breaking down the full
cost analysis of an an agent and really
the CPU is going to be extremely low.
Interestingly, the TTS tokens or
characters are are the most expensive by
far. Um, so when you think about it,
running that local VAD gives you
superior performance and it allows you
and it there's not much of a cost hit. I
mean, it's maybe like a fraction of 1%
to run a local VAD. Um but yeah, you
have all sorts of choices, but we find
we find that to work really well.
Uh it's integrated with phone carriers,
too. So there are a ton there are I
found out I didn't work much with phone.
Verun's been our phone expert, though I
don't Verun is like a figure in the
WebRTC community. He's like an author of
many things, but he's also a phone
expert, which I don't know if that
crosses over too much. Maybe that
happened before. Um,
phones are super complicated how you
actually make calls. Uh, Pipcat supports
all of them. So, maybe a very quick
list. You can make a websocket
connection with a phone provider like a
Twilio or Telnix or Pivo or Exo Exotel
and exchange uh media streams. So that's
a way to have just a a native websocket
connection from Pipcat to Twilio. You'd
call Twilio. It's going to emit the
websocket and there'll be a handshake to
get connected. You can also use uh PSTN
which is a public switch telephone
network
uh which is lets you dial in and that's
going to be kind of a different
mechanism. There's also SIP which is its
own separate thing again and all of the
telefan providers would also support
this as well. With SIP, you would call u
something say like Twilio and you would
call into like say a server like uh or a
SIP provider like daily which is offers
SIP provided rooms and you then have the
ability to kind of uh bring the two
together via that SIP connection. The
nice thing about SIP is that you have
the ability to have like superior call
control. It does it is slightly more
complicated whereas that websocket
connection is instantaneous. Your bot
needs to be up and running. So there's a
whole we're not going to talk about it
today, but like cold starts for agents,
they need to start immediately. So if
you don't have resources provisioned,
you don't want your users waiting like
20 seconds while the bot comes online.
So a longish medium-ish answer for a
very complicated question. Maybe you
didn't know that.
Yeah. Yeah. Yeah.
Yeah. Yeah. Answer that. Yeah. Yeah.
Cart Cartisia of
Yeah. Can you I didn't hear it. Well, I
think you were you were asking if Pyut
compares to or how it compares to
Cartisia. Is that
Yeah, Cartisia is a well I think they're
going to do more stuff but as of today
it's a texttospech service and you can
clone your voice or you can you know and
then they provide a real time uh API you
can just via websockets you pass the
text and they reply with with audio
basically with audio frames and then
pipecat integrates with cartisia as any
other texttospech service like 11 laps
or or anything like think about pipe cut
as a it's just a framework for
developers where they can plug and play
the service they want. So they can plug
and they can take Cartisia they can put
Cartisia they can take it out put 11
laps. Oops even closer. Okay now I can
hear myself. Um
yeah you can plug and play any service
you want like you can change llm you can
use llama you can use anthropic you can
use uh cartisia or 11 labs then for uh
speech to text you can use deepgram gram
or you can use one box like gemini live
which has all that uh built for you so
is that that clear Yeah. Okay.
About what sir?
I I'm not sure if I understood. Are you
talking like how to ensure that the LLM
says what the right thing or not the
right thing? Yeah. Well, Pipcat doesn't
have control about that. It's up to you
to define the prompt or define how the
LLM will um uh will reply. You can put
if you want you can put uh you'll be
able to write your own processors that
we call them those little boxes and
check what the LLM has said for example
before it's um like put some kind of
real time eval like to to make sure uh
that the LLM has you could you could do
that. Yeah, you the all that that
pipeline is very flexible. So you can
you can you can put whatever you want
there like in parallel not in parallel.
For example, Mark was saying about
parallel pipelines. Like if you have
video and audio at the same time. Uh
with Gemini Live, you can do everything
in one box. But let's say you don't have
Gemini Live, you want to use other uh
other services that one does video and
that the other one does audio. You can
have a parallel pipeline which you know
it's like a tree, right? You have your
transport input and then if it's audio,
it goes this way. If it's video, it goes
that that way. And you can um you can do
things dynamically like like that. Yeah.
This next question. Yeah.
Sure.
Okay. So the question was um are guard
are guardrails requirements and do how
do people use them and how does that
apply for speechtospech models? So the
answer is they're not required. In there
is a challenge here and actually uh I
talked about this on like one of the
first slides. Latency is absolutely
critical. So what you want to avoid are
unnecessary turns. You know obviously
LLMs are amazing language processors. So
if you had all the time in the world you
could do hallucination checking against
the LM. There are other strategies to
handle this. One of the big things is
that we see because of the aggregate
nature of the context, it grows over the
course of the conversation. You actually
you can find better um better accuracy
with the responses if you have more
control over how you prompt the LLM. So
this is uh a whole topic and talk in
itself. Uh what we found is there are
two ways to handle this. Well, at least
two. Um one if you for a lot of
conversations they're going to be task
oriented. So let's say uh something
simple like a restaurant reservation
bot. It may have to take your name, get
your time, log the time to a database.
You can chunk that out, even that small
conversation into just discrete tasks.
And LLMs are really good at uh following
the most recent input. So if you kind of
feed it task by task, that helps. Also,
if you control the context window, like
the size, it can really be beneficial to
kind of manage that really judiciously.
So you could either reset um one example
might be let's say you're building a
patient intake bot at a doctor's office.
They may the very first thing it may do
is verify the date of birth which serves
no utility beyond just the very first
you know checkpoint. So you may actually
remove that from the context like
completely get it out of there because
otherwise it's just cruff that hangs on
and instead you kind of reset and then
maybe roll through the tasks. You could
also for really really long
conversations summarize the the context.
So you you may want to do an outofband
LLM call. This is something actually
Quinn just we talked internally about
this that we're going to see more and
more of this mixture of LLMs where even
in the context of real time you may have
an outofband like REST call to the to
the textbased LLM just to do a summary
and then return it back so that you can
kind of compress that context window.
And just to give a call out to Google,
the the live API, so maybe transitioning
there, they offer context uh management
through a bunch of different different
strategies like a rolling they have a
rolling window or sliding window. I
think they offer like token caps for
that. So you can have some control or if
you want you can output text and then
kind of do whatever you want with it.
They also take text input. So there's a
lot of a lot of flexibility with
speechtoech models. They do offer or
they do pose some other maybe
development challenges but offer like
tremendous benefits in terms of the
features they offer. All right, I'm
gonna
I'm gonna hold questions just for a sec
because I I've heard the Wi-Fi is back.
Can folks try downloading the repo again
because I Slack channel. Yes, Quinn, can
you maybe come to the mic and I can't
remember what you told me.
workshop voice gemini pipecat with
dashes in between on AI engineer Slack.
Okay, does everyone know where that is?
I know this is like day one hour three.
Workshop-Voice-geemini-pipecat
is the channel name. So if you go to the
engineer and search for workshop, it
should come up
as
you raise your
Can people join like do they have do you
guys have access to the Slack AI
engineer Slack?
Oh yeah, there's like Yeah.
There should be the join a Slack group.
All right. I'd say I'm going to I'll
take that one then I'll walk over if I
could let's just try maybe could listen
and also try to get the repo. I'd like
to walk through the quick start and then
we can look at some examples.
I you know it's something actually uh
Quinn has done a ton of work with this.
I have not personally. I mean there's
obviously massive latency benefits
because you cut out network round trips
all over the place. So you save a ton
and depending on where you are in the
world that can save a lot if you're
US-based. You know your network latency
is going to be relatively low. But a lot
of the developers in the community are
in Europe and a lot of these AI services
are relatively new with you know data
centers only in the US. So there are
different challenges when doing that
though when running it you it actually
there are great um hosting providers
like modal that offer really good
options for running like your own local
LLM which then you're not you know
you're buying or I guess leasing like
the GPU time instead of running
everything on a GPU which would probably
be cost prohibitive because because of
the way processes run but that's a that
is also a talk in itself. Um, next
question.
Sure.
Oh yeah, definitely.
Well, the large context windows
definitely cause LLMs to process slower.
Um, yeah, I'm sorry. The question was
around state management with um with
LLMs and whether it's better to I guess
chunk or have more kind of deterministic
input versus just a large context where
you just dump everything in. This is
actually an extension of what the other
gentleman was asking. Um the idea being
that uh and actually so daily we built
the chat widget that's on the homepage
of the voice AI world's fair uh page. I
I personally built that. What was
interesting there is that um in Alles
and I were just talking about this
function calls in the context of real
time are still slow unfortunately like
too slow. Um actually Gemini I'll give
more props to Gemini being like maybe
one of the fastest. If you run like a
basic local like one of these demos and
you ask it what the weather is, it will
return back with time to first bite in I
don't know less than 500 milliseconds
whereas other vendors not trying to
throw open AI under the bus. But it has
gotten slower like the you might see
upwards of 1.5 to two seconds of waiting
just to get that first token back and we
dug into actually this is just something
recently this morning. The issue is when
you get the normal streamed response for
the conversation you can start playing
that audio out once you get the first
sentence. The issue being when you get a
tool, you need the entire JSON response
before you can actually do anything with
it. So that's slow. Uh that's one part
of it. Se separately chunking the
prompts is absolutely the way to go. Uh
in building that world's fair bot, um I
was kind of balancing between the two
worlds because uh you can talk to it and
ask it about speakers for the session.
route one would have said let's use like
a mag approach and put all of the the
speaker JSON in something that could be
a tool that could be accessed and I
tried that and unfortunately it's just a
little too slow. It's a giant context.
It takes a while to come back. What's
interesting is if you instead move that
all just directly into the context with
Gemini live it's a little bit variable
but under good conditions you'll get a
response back on that like 800
millisecond latency. So it actually has
access to like the full context. The one
tradeoff though is what this gentleman
over here was asking is accuracy. It's
going to get confused because especially
when you get a JSON with a lot of
speaker and this isn't this is all LMS.
It's not specifically Gemini with that
type of even structured data becomes
very hard to kind of discern what's what
when a lot of it looks the same. So it's
all I mean a lot of this is emerging
like other things in in AI trying to
just do it as fast as possible with
voice. Before I take one more question,
I want to check in on have folks have
any luck with the repo? Got a thumbs up.
All right, I'm going to pause on
questions for the time being. We can
take them at the tail end. I do want to
try to go through the quick start if we
could
voice.
Okay, great. So, I would recommend um
maybe we'll just take like a few minutes
of independent getting set up. If you
all go to the read me on
in the Gemini Pipecat workshop. And if
you'd roll through the first few steps,
maybe get a I don't know, a hand up. You
could just flash it up real quick so I
could see when people start to get
through it. You don't have to hold it
up. If you don't mind if you brought a
device. Okay, we've got one,
two.
Ah, so for this workshop, I have leaked
my key, a key from one of my accounts,
which I'll cycle after this. So, you
don't need to sign up for a daily
account. You do need to sign up for a
Gemini account. I don't have a key I can
just give out. So, in the
environment.example, you'll see there's
already a daily key which you can use.
Do people know how to sign up for
Gemini? I have it in the readme. Is
through AI studio. Is that a good spot
to do it? Okay. So,
All right, I'll take a question while
we're waiting.
We
So, in terms of um the pipe, that's a
good question. In terms of the pipecat
interface, you interface with the
pipecat class. So there's a service
class and within pipcat, so how you
write your application code is going to
be uniform across all of the services.
The individual um providers haven't
really unlike the textbased LLM, they
haven't really settled on kind of a a
standard. So pipecat handles all that
translation on your behalf. And I think
there are other frameworks that do
similar things. The idea is to provide
like a uniform simple interface. Uh so
that you could take and this is part of
the modularity. If you wanted to, you
could swap this bot out for OpenAI real
time or you know a textbased model with
a TTS and ST paired with it. That's kind
of the whole whole idea. There is maybe
a little bit in terms of uh the one
thing LLM providers maybe I don't know
if anybody here can nudge anyone you
know the system instruction or system
prompt is not unifor uniformly dealt
with I don't know if that's well
understood but open AI has this like
user that is system that you can inject
anywhere at any time which is really
fantastic but um anthropic and Google
require like a named system instruction
that's a special one time like at
constructor time um instruction.
So there's there's some differences. As
much as best we can, we we unify.
All right. Any
how are folks doing on getting quick
start going? See a hand. Is that a
question or Okay, I'll take maybe it's
related to the quick start.
Oh yeah, great topic.
Yeah. So there's a question about noisy
environments which is like voice AI's
kryptonite. Um so with the VAD no not at
the moment. Uh but there are again this
is where like pipecat is the assembler
of all things. What you want to plug in?
Um that you can run separately from the
VAD. Uh we found like crisp is one
that's a partner of ours. there have
they have fantastic noise cancellation.
You can run something outside of the
loop that uh would actually clean up
like in the in pipe cut it would be in
the transport itself. So in that audio
input it would take uh the audio input
and remove any ambient noise. So like
chip bags opening or dogs barking but
maybe more impressively uh human
background noise. So it will remove that
from the feed. So you could be in this
conference and it picks up the primary
speaker for the device like incredibly
well. At the moment, they're the only
ones that I'm aware that know how to do
that and do it that well. But it's I
mean, it's phenomenal. But you're right,
the VAD is I mean, it was one of the
biggest problems that we saw until we
found Crisp and they're fantastic. CR
IP.
So, big props to the Crisp team.
Well, we use we use all of them.
Pipecat's open source. So we bring we
have I think options are Gemini,
Multimodal Live, OpenAI, Realtime and
then AWS just launched a new one called
Nova Sonic. So we have those three um
within Pipcat.
They're all pretty they're all very much
on actually well they're all they're all
very similar. Um, I'm not going to I
don't want to like nitpick on all of the
vendors here, but they're I mean they're
they have strengths and weaknesses each
because it's still an emerging field.
Um, but they're latency wise that is not
an issue. Latency is fantastic for all
the providers. Yeah.
All right.
Okay. Maybe Allesh will walk through a
quick start. Why don't we He's going to
just do some live coding here and maybe
this will help to understand what's how
this all works and then Perhaps we I
stopped chatting and maybe you could
grab me if you have questions and just
do some heads down working time.
Yeah, I cannot type and
teamwork. Teamwork. Yeah. Uh I'll try
from the very very from nothing from
scratch. Um so this is a Python project.
Um actually let me let me try again.
Yeah, this is a Python project. The
first thing we'll do is create an
environment
like a virtual environment that is
called I like to call it amp that that's
how you create a virtual environment in
in in Python. Um a virtual environment
will have
what's that?
Oh, it's hard to see. Um
how do I do that?
[Music]
[Music]
I only have one thing.
Maybe
the only issue is I only have one thing
but
better. Okay. Yay.
Okay. So, we'll start with um uh yes,
it's Emacs. You can judge me. Uh I'm I'm
I'm not going to change. I have a few
years only coding. So that's what it is.
Um so I'm going to create a requirements
file. Sorry, I loaded first thing. I did
the virtual environment. We did this.
Then we're going to load it. Now we can
start installing uh packages in in this
environment like Python packages. So the
ones we have to do is uh pipe of course.
So I'm just creating a new file which
has pipecat AI and pipecat has a few
options. So for this example we're going
to use daily we're going to use uh
Google and we're going to use silero
right silo is the VAD that we we've been
talking about. Um the other thing we
will do is we will load the envir
there's a package called python. temp
just to load the environment variables
that we will have uh in this in here.
Okay. Um that's it. One file. Now we're
going to I'm going to create uh
an environment variable. I'm going to
just copy it from the
from the kickstart. It's just a oops
it's just amp
and we can take a look at it. it. I will
these are the the API keys. We'll get
rid of them after this workshop. So, but
yeah, these are the you just need two
keys. Google API key and just uh go to
AI studio and create your own one and a
daily API key. This one we already this
one you can copy. It's a bit long but uh
I don't know doesn't Yeah, it's in the
project. Yeah. Yeah. If you if you are
able to go to the to the repo on GitHub,
you you can just copy it from there. All
right. Next, we're going to create a
Python file. Let's call it uh bot.py.
Again, empty file. Uh first thing we'll
do is typical I think it's called name
uh equals main. No help from an LLM. And
then we do a sync.io.run. run uh a
little bit of help singo.r run and we're
going to run a main a main function
function
and then let's write the main function.
All right. So now we start. So the first
thing we uh we we we said is we're going
to need a transport. The transport input
is like we're going to um like the day a
daily transport. We're going to say
that's I'm going to speak to the to the
bot through the daily transport. That's
what I'm going to create. And then we
want it's a transport for incoming audio
from me talking and then a transport uh
for outputting what the what the LLM
will say. So I'm just going to create a
transport. Um and this is a daily
transport
and it has a few uh arguments. It has a
room URL. I'm just going to create mine
for now.
Um this is a again a public public room
that I will have to delete. Uh I don't
need a token and params.
Okay. And then the daily params is like
what do we want this out uh this
transport to do? So, we're going to say
that we want uh input enable, which
means get audio from the transport, and
we also want to
to send audio to the transport.
Okay. Um
All right. What else? Uh it's
complaining for something. Oh, yeah. And
we also need uh VAT analyzer, which is
going to be Silerero VA analyzer.
Um, so the transport is going to be able
to uh use this VA analyzer to detect if
the user has spoken or not. What's that?
Oh yeah. And this is going to be AI
engineer.
There we go. Okay. Now we have the
transport. Now we're going to create the
LLM. Uh, in this case it's Google um,
uh, Gemini live. So I don't need to
create an speech to text or text to
speech. We just create uh the Gemini
live and for that I'll need to copy it
for because I don't know that from
memory but um
oops
that's in this file gem Ibot. I just
want to copy these lines here. There we
go.
All right.
So this is my LLM and again it uses this
Gemini multimodel live LLM service
multimodel live LLM service. I'm just
going to add my import. Okay. And now it
needs a couple of things. The system
instruction which is like what the what
the agent is going to is going to do and
some tools. The tools I'm going to skip
them for now. So let's let's do the
system instruction again. I'm going to
copy it from somewhere. There it is.
This is like the prom, like the main
prom of the Well, not. Yeah.
All right. So, the system instruction is
you're a helpful assistant who can
answer questions and use tools. For now,
we're not going to use any tools. Um,
you know what? Let me get rid of the
tools. Just copy here for later. I'm
gonna comand this out.
and you are just the helpful assistant.
Okay, so that's
all right. Um, so no complaints here.
All right. And now we just create the
a pipeline.
I'm going to avoid storing the context
because I don't think we need it for
now.
And this is the pipeline. The pipeline
just receives a list of um of processors
or elements and the first one is a
transportinput
that's the input transport. So how we
get audio from the uh from the daily
room in this case the llm and the
transport.output.
All right. Now we need some this is just
defines a p a pipeline also is another
processor. So you could build a pipeline
of pipelines of pipelines of pipelines.
So you can build uh or you can plug and
play uh as as the way you liked it. So
how do you run a pipeline? You need a
task. What we call a pipeline task
that receives a pipeline. And the
pipeline task also has some params
which are called pipeline params.
Um,
and we're going to say, oops,
that we allow interruptions.
And I think that's enough.
And how do you run a task? Uh, you can
create more than one pipeline task if
you wanted. In this case, we just have
one. Usually, you just you just have
one. Uh, you're going to create a
runner. And guess what? It's called
pipeline runner. Uh it's pipeline runner
and then we just do await
runner run
task
and some completions please pipeline
runner. All right and I think that's it.
Uh we'll try it. Oh OS import OS. I
think
I think there's no more warnings.
Oh, and I need to load the environment
variables, which is this line here,
load.m
just copy it from another file.
Okay.
And where do we get loadm? This is just
a function that um
um that imports
the imports the environment variable.
All right. And yeah, let's let's try
just going to open the Oops.
Just going to open the terminal here
and I'm just going to run it. I think we
call it bot.py.
Uh, no model. Oh, maybe I need to
install the requirements. I forgot this
step.
There it is. Okay. So, init at the
beginning I uh wrote that file
requirements uh text which has uh had a
bunch of uh well just a few requirements
but I forgot to to install them.
In the meantime, I'm just going to go to
the daily room that I just pointed the
bot to
uh
video.
Okay.
So that's uh right now it's just me in
that in that room. So and now we just
have to wait for this to to finish and
hopefully the bot will join the room and
we'll be able to talk to it. Hopefully.
Yeah.
It is in Yeah, it's this is because uh
we're using the daily transport and the
daily transport just connects to a daily
room. So, but you could have a a
websocket transport and then use Twilio
with a phone number and Twilio being
connected to that. If we have time, we
can even try that. Um, so I think small
transport
you want to you want to talk? Oh, yeah.
I'll wait.
So we als uh in Pipcat we also have
added um based off of the AIO RTC Python
package which is how uh WebRTC package
in Python. We've added a new transport
called small WebRTC transport. It is a
peer-to-peer WebRTC communication that's
free. So it's separate from any vendor.
Uh though the one downside is that it
requires turn server which we you bring
your own. So we we didn't you know we
weren't prepared for that for the
conference and also just the conference
Wi-Fi makes that a little challenging.
But normally if you're running any of
the any of the u we call them
foundational examples in Pipcat. Think
of them as the like essential um
examples that show how to do very
specific functions. There's probably
about a hundred of them in Pipcat, but
one by one it shows you how to like
record or add an ST or push frames or
show images or sync images and and
sound. Those all use the peer-to-peer
WebRTC transport. So we we would have
loved to have used that. You wouldn't
need a key, but unfortunately firewall
rules have trumped.
All right. Uh so I'm just running the
bot and see how it fails because it has
to fail the first time.
Yeah, this is the math there because
there's a bunch of the um Python
packages and Python just decides to take
uh a time to load them. But you see how
easy it was to write
um
like an agent like a voice agent with
Gemini live uh if it worked. It's just a
few lines of code. Uh that we wrote in I
don't know how long it took me but uh
maybe like five 10 minutes. Um yeah. Are
there any questions on the example or
what's that? It worked.
would work. The bot work. Okay. All
righty.
All right. Nice.
Yeah.
uh we have I don't know the actual
number of customers but I mean Pipecat
probably serves hundreds of thousands of
calls a day. I don't know a lot coin you
probably have a better idea. Oh yeah,
Pad is used by some very large companies
in production. Uh and people are
contributing to it from Nvidia, AWS, uh
OpenAI, Google, uh lots of lots of big
companies.
Yeah, there's one thing we didn't
mention about Pipcat is that what you
see now in the screen, this runs on the
server side, but we do have client SDKs
for Android, iOS, JavaScript, and React.
And I think that's about it. But uh
Ragnated even a C++ client um uh if you
want. Um so yeah, so that's the server
side, but you can plug your your client
and connect to the to the agent on your
on your phone. We RTC connections or
that would be it that depends on the
transport you use. But yeah, you could
you could have your client connect to a
daily or we support LifeKit as well, but
to a daily room. We like daily because
we are working daily, but you connect
you can connect to a daily room and then
the bot would connect or the agent would
connect to the daily room as well. And
then that's the transport, the WebRTC
transport. Yeah,
I think there were questions there. I
can see
Uh say it again.
for the previous version. Um, I just
hacked together a thing that that I call
release eval which is a bot talking to a
bot and what it does is I put this bot
uh up and then it joins a daily room and
then I have a an eval bot and the eval
bot um what he's going to do it has a
prompt which is ask a simple addition.
Okay. And then that evalbot is going to
connect to the room. is going to add
what is 2 plus two and then the other
bot is going to reply 2 plus 2 is four
and the ebal bot the LLM uh it checks if
the answer of the user is correct and
the user in this case is another LLM so
it verifies it's like an end to end the
good thing is we we run the we used to
run like more than 100 examples every
release just to make sure they work so I
just got tired of it because it's very
painful and very slow. So we have these
uh eval or eval release evals that are
going to test each service like we test
Gemini live we test cartisia deep gram
like all the services like end to end
and then the bots basically talk to each
other with boys that's the that's the
the nice thing so yeah okay oh is this
on maybe real quick just to show for
those that didn't see it it is hey can
you tell me a
Why don't scientists trust atoms?
Because they make up everything,
which
if you build pipe cat, I've heard that
joke probably I mean, not like 5,000
times. Like you can even try to seed it
with something different, but it's still
I will try to come up with a new joke.
Yeah. Why do Why do you always say the
same joke?
That's an interesting question. Why do
you think people keep telling the same
jokes?
I don't know. You tell me.
From my search, there appear to be
several reasons why people repeat jokes.
Enjoyment and reinforcement. People
repeat things they find funny because
they want to reexperience the good
feeling associated with laughter. It
feels good. And repeating it is a way to
try and recreate that sensation. Memory
aid. Okay. Okay, that's enough.
Is there anything else I can help you
with today? No, thank you. That that was
it. Thank you so much. Well, just to
show there was a question about
interruptions. We could just have it
like my favorite is to ask it to tell
you like a really long story and then
interrupt it. So, can you tell me a
really long really long story?
Okay, I can do that. Here's a story and
feel free to interrupt whenever you
like. Once upon a time. Okay. Actually,
tell me that new joke. Quit following
the clues. Hey, tell me that new joke.
You'd like to scientists trust Adams?
Because they make up everything. Okay,
here's one.
That was the same joke.
All right. All right. How was it?
Oh.
So I think the questions asked during
this workshop could map out like years
of work. So this is like another one of
those fantastic cutting edge things. So
again back to like human evolution. We
all know and when we talk actually it's
even hard for humans to talk to not
speak over each other. So the way that
it works mechanically is when the user
stops speaking the VAD has a timeout.
you tell it and program it, wait, let's
say one second, point8 seconds, half
second, whatever feels natural and
you're trying to balance low latency
response with giving the user enough
time to speak. It's a really hard thing
and it's one of the biggest complaints
is that agents will speak over the
human. So, if you're, let's say you're
building an interview bot, like you're
using um like Tavis, one of their
digital twins, you want to have like a
real like likeness and you want to speak
to it. you may take time to think
because sometimes you have to take time
to think and that's a really difficult
thing for bots to do because again it's
driven by like a simple stop speaking
algorithm. So this is a new uh I guess
it's like an emerging uh field with
models which is looking at semantic uh
end of turn. So driven off of things
like um
like speech filler words, pauses, uh
intonation, so things in the audio realm
and also things in the textbased realm.
So just looking at context. So we've
actually started uh we're one of many
that are doing this I think but we we
launched a model if you look at it on
GitHub it's under smart- turn. It's a
native audioin uh classifier that runs
an inference on the in input audio and
it simply outputs either complete or
incomplete. And the way Pipcat uses this
is that if you get an incomplete
response we can dynamically adjust the
VAD timeout. So we can tell the pipecat
bot okay he you know he or she is not
done speaking let's actually move the
let's give three seconds to complete the
thought and if it's not done then the
bot will actually respond. So you can
create a little bit of like dynamic
interaction there. That's one of the
first things. I'm sure goo the Google
team is working on similar things. I
know openai is and all the st vendors
are are also looking at their own
things. So, I'd say right now it is very
much an unsolved problem, but I would
imagine given how fast things are going
in the next 12 months, we'll have great
solutions that will make it even more
natural to talk to a bot.
It's a good question.
Any more questions? Any questions?
like there.
Oh yeah yeah yeah. One uh this is
actually back to the well for this is
specific to pipecat but also um like
Gemini live will output audio and text
and other speechtospech LLMs do this. Uh
pipcat offers in terms of its again
orchestration role when you get a uh
actually it's going to be specific to
TTS provider. Um, many T there are great
TTS providers that do word and time
stamp synchronization. So they'll give
pairs. They call them like alignment
pairs. So if you're using a Cartisia or
an 11 Labs or Rhyme, they all output
these pairs. One of the really cool
things with Pipcat is that the TTS
services output not only the audio
stream but also the text stream. So
they'll output text frames, TTS text
frames we call them in Pipcat. And if
you place uh we have in terms of how the
client software works there is uh like
an observer role where you can actually
watch there's a process that can watch
things that happen in the pipeline and
emit events. So we've instrumented that
for the clients so that whenever you see
those text frames move through the
transport you can get synchronized word
and audio output. So in your client if
you wanted to have wordbyword output
synchronized to the audio you can do
that with pipecat and it's as simple as
just adding an event. I think you listen
to like bot tts text output or onbot tts
text and it will give you the
synchronized output.
fully offline. Um, well, they're all
doable. There are great models. I think
it really depends on what your bot needs
to accomplish. Um, a lot of the
state-of-the-art models to do all the
best and smartest things need to have
some like they're going to be run like
on prem or in the cloud. Um, but if you
have a lot of bots do jobs like if you
wanted to build like a restaurant
reservation one like I referenced
earlier, it's a very simple job. You
could probably run it with some version
of Llama running locally. Um there are
great local something again Quinn has
been experimenting with uh a lot of
great local models like u whisper has
challenges you know I mean it's it has a
lot it has some challenges uh as an open
source model for ST but there are good
and emerging TTS services so you know I
I things are only as good as the input
and we've actually seen this with some
of the speechtospech models that
sometimes they mistranscribe so you
really need I mean it's you know there
every part is critical, but if you can't
transcribe the speech really well,
nothing really matters. Like it has to
understand you. And having like
disluencies or like hallucinated
responses or even just inaccurate
responses kind of breaks everything
down. So things mostly start at at the
ST. So maybe that's the hardest. I don't
know if there are a lot of good open
source options for that right now.
I don't know. We don't we're not doing
anything in the ST world. No, no, no.
That's a whole different ballgame.
Good question though. just made me
realize we could have used local models
and avoid this. We could have. Yeah.
Well, we're partnering with the the
Google team. Yeah. Yeah. As a Yeah,
exactly. Yeah.
Has anyone looked at any of the sample
projects and had questions? There's a
lot of interesting things there. If any
of this is like interested you, we do
have a Discord. You're welcome to get on
it. Um, you can find us at pipcat.ai
and find our Discord there. You can ask
questions. Um, there's some really cool
stuff with Gemini that can be done
there. Uh, in particular in the Pipecat
repo, we built uh, I don't know if you
know the game catchphrase where you
describe a word and something, you know,
guesses it. We built a version of that.
We had to brand it something else called
word wrangler and you as the human, you
describe a word and then you have the AI
agent try to answer it. So, we built a
client server version of that which I
linked to in the repo. And then we have
one that's a phone based one that's I
think particularly sophisticated and
interesting. So you might think like how
the hell would I build this with a
speech-to-spech model? We actually use
two Gemini agents in the same call and
we use a parallel pipeline where one
agent is the host giving out the the
questions to the human user, the other
is the guesser and we have you know kind
of limit the audio flow so that the
guesser the AI player can only hear the
user. So there's a bunch of really
interesting things getting into the like
majorly into the weeds of some of the
powers of u pipecat, but it also speaks
to the strength of having just native
audio input being really really help
helpful. So I'd recommend checking those
out. Um really cool easy demos to run.
One's Twilio, the other is again a
client server. I think it's like a React
Nex.js project.
What's that?
Word wrangler.
Yeah, I mean we could run the word
wrangler client app. It's actually just
on the web
test.
Welcome to Word Wrangler. I I'll try to
guess the words you describe. Remember,
don't say any part of the word itself.
Ready? Let's go.
Uh I'm going to skip to something
easier. Okay. This is something you take
pictures with. It's on your phone.
Is it camera? All
right. This is a field uh related to the
study of languages, I think.
Is it linguistics? All right. This is a
game of the yellow ball you play with
rackets. Hit the ball over the net.
Is it tennis?
All right. This is a a round dessert
with chocolate chips sometimes and other
fun goodies.
Is it cookie? It's really good even when
I'm bad at giving answers. So, pretty
cool. This is with built with Gemini
live. But again, just an example of
things you can build with uh
with voice AI. So cool, unique
interactions.
All right, I think that's about it.
Thanks everybody.
[Music]