Full Workshop: Realtime Voice AI — Mark Backman, Daily
Channel: aiDotEngineer
Published at: 2025-08-03
YouTube video id: nxuTVd7v7dg
Source: https://www.youtube.com/watch?v=nxuTVd7v7dg
[Music] I'm Mark with Daily. This is Alles. Alish. And then we have a few other daily folks. uh Quinn, Nina, Verun, and then I'm not sure where he went, but Philip from right there from the Google team, Google DeepMind team. So, this session we're going to spend just a few minutes getting everyone started. The idea here is going to be a hands-on workshop where all the folks I just called out are going to be available to help out. We'll walk you through a quick start to get you up and running and then the idea is to build something. So, build a voice bot in the next 78 minutes and 12 seconds or whatever time we have left. Um, the one I guess there's one uh consideration is the Wi-Fi. If you don't have good Wi-Fi, you might want to try to tether. I was able to tether and it worked fairly well, but the conference Wi-Fi was a little shaky. This is real time, so you will be streaming data. It does require a viable connection and not just sending, you know, a few bits over. So, just a heads up if you hit that as a as a snag. So, I guess before I get started, who here knows about Pipecat or has built anything with voice AI? Okay, a smaller audience. Has anyone built any real-time applications with LLMs or AI? Maybe slightly bigger. Okay, great. So, Pipcat um is a uh it's a open source repo. It's a Python framework for building voice and AI multimodal agents and it's built by the team at Daily. Um but we're an open source it's an open source project that anyone can contribute to. It's been around for I don't know just over a year now. And yeah, like I would say officially Pipecat was March 2024 something like that. So 13 months. There we go. So just a quick uh walk through maybe just to kind of ground everyone in the thinking around voice AI. Uh these slides weren't built for this talk but I'm going to use them. So the you know voice AI or real-time applications are tough because there's just you know we as humans communicate all the time with each other. Thousands tens of thousands of years of evolution baked into our brains. So it's pretty tough to make a machine you know work on the same level. So we have great expectations being the user in it. So you know you need a good listener something that is smart and conversational. You need to be connected to data stores. Uh has to sound normal or natural. Think back to even just maybe two three years ago what your voice bots sound like and many of them on if you call them on the phone still sound like. Needs to sound natural. And actually kudos to the Google team. The latest uh Gemini live uh native audio dialogue is quite good in that regard. it has to be fast. So the whole end to end communication needs to happen and roughly you know kind of the benchmark is around 800 milliseconds. Um you could strive for more. I think we see maybe on the human level it might be 500 milliseconds or somewhere on that uh order. So it is pretty fast. So there's a lot to kind of to get all the way there. And this is something that we at daily everyone building pipe has been working very very hard on uh getting all the way to meeting all these expectations. So just to kind of ground you in some of this since we're going to be working in Pipcat. Pipcat has a pipeline. I don't know if maybe Al you want to talk a little bit about the origin of that quickly. Sure. Sure. Uh you can think about it as a multimedia pipeline and you would think what is a multimedia pipeline. It's basically just think about like boxes that receive input and input could be audio or video and then those boxes just will stream those uh same data or modify data or new data to the following uh to the following elements or processors in in pipe wall we call them processors. So in in pipcat you would have a pipeline where you have a transport which is the the in the the transport of your data or the input of your data. For example when you're talking you could be talking in a in a meeting. So that would be the audio of the user. Then you would have uh another box following that which is the speechto text service. So the speechto text service would would grab uh audio from the user. It would uh transcribe it. Then you will get text that will be the following data that goes through the pipeline and then the next one will be the LLM. So now the LLM has what the user has set and then it generates uh output whatever uh whatever the LLM that would be tokens then those tokens are are converted into um text to speech and then the text to speech outputs audio and then the audio goes back to the transport so you could hear what the LM has said. Today, uh, what we're going to do today with Gemini live, a lot of those boxes go away because, uh, uh, the LLM will do a bunch of these things. It will do transcription, it will do the LLM, and it will do the text to speech in in one of these boxes, but you might still require, for example, if you want to save the audio, record the audio into a file, you need a bunch of utilities to do that. And Pipet has all that built for you. Um, so basically that's that's it. I mean a lot of this is really just or it's orchestration. So if you think about what Pipecat offers, it's orchestration. It also offers a lot of abstractions for a lot of common utilities like Alles had said. So recording transcript outputs, artifacts you might want to produce or even ways that you might manipulate the information in the pipeline itself. So this uh image here, which is what Alles actually just talked through, is what you would call I guess a cascaded model where you have this flow through of information. So we're going to be you can build with Google and many different services in this way. Uh in the last year there's been an emergence of speechtoech models that now take audio in natively and audio out natively. And those models also allow for audio in and then optionally text andor audio out. So you can actually for example take a raw you know microphone input or uh you know audio input and then the model would run all of its logic and you can actually opt to have it output text if you want to say parse the text output before speaking. So there are a few different demos we'll look at that offer that uh and in Pipcat we show kind of all the ways to do things because that's that's what we offer or at least what Pipcat offers as a as a value proposition. So the um yeah just one thing I don't think you'll mention it in the slides but all these boxes you can pluck and play the service you want in pipcat. So the speech to speech uh speech to text, it could be I don't know deep for example, the LLM could be uh Google or OpenAI or whatever. You can just plug and play any service you want, right? Yeah. The modularity I guess is the other big strength. So there's no, you know, you can change out a service without changing out your underlying application code, which makes it easy. And we see with this a lot of companies that are building for voice AI might have uh maybe even a more complex thing a pipeline here runs straight down but you can actually have split branches where you might have one leg that's running some logic and the other running a different we call that a parallel pipeline. So if you wanted to have say a failover if vendor A goes down you can move to vendor B dynamically even within the same conversation. That's something that Pipcat affords as well. Um, and that can allow you to transfer contexts over. So, a lot of really cool stuff. Uh, the goal again today being to get you familiar with just building a voice agent and building one to get started. So, one of the cool things, um, like Alles had pointed out that with the cascaded models, there's a lot of complexity, but with your speechtospech model, things get, you know, dramatically simplified. You know, your code may have looked something like this. This is like an old example of some like a ton of uh orchestration in the pipeline, but with a speech-to-pech model, you may be able to simplify it down to this, but then you have to remember you actually need orchestration around it. So, it does get simpler to some regard. Um, so it's it's more about the services you interface with. Uh I think with that why don't we transition now because I'm realizing we have only about maybe 70 70 75 minutes left to looking at um the actual activity for today. All right. So there's a public repo I don't know how big or small this is but it's under daily co. So on GitHub daily co daily.comini-pipcat-workshop. We'll give everyone a chance to make sure internet's working. And can everyone see the the repo? Yeah, it's really tiny. Yeah. Okay. I should have it in big text somewhere. Not really. There we go. All right, let's take a look at this. So, what what I want to do, so there I spent a little bit of time, Alles and I spent time writing up this repo. This is meant to be just a jumping off point. I'm going to get you oriented and then I want to look through one of the bot files which is kind of the main pipecat code with you and then we'll break and make this an interactive session where we can answer a bunch of questions. So in the repo um you could either start doing it now or maybe take a pause but this will give you the steps to walk through getting the quick start running. Before we do that I want to take a moment here. Let's see how the Wi-Fi is doing. That's not a good sign. It's down. Uh well, hey, you know, it is tough to do Wi-Fi for this many people. Very very instead of real time is going to be real slow. Real slow. Yeah. Real slow voice communication. All right. Yeah. Yes. Yeah. There will be some. Okay. So, in uh here is this Gemini bot.py files. This is all Python again. So be everything will be in Python. There will be some client code options which we'll look at in a second. Um just to orient you, I'll just jump right into the meat of the pipeline. We have this main uh function that runs your bot. Everything is going to run kind of encapsulated within an AIO HTTP session. We're going to pass that session around. That's more of just kind of the mechanics of things in our pipeline. Let's just jump to the the simple part here. will have daily as the transport daily as we are a web RTC provider as well as we build pipecat. Uh there will be context aggregation. So one important note when you speak uh every turn of the bot is a like a a discrete point in time and this is maybe less so the case for a speechtoech model but for basic LLMs they get discrete inputs. Everything is like a REST API call. So you're going to get a snapshot of the conversation. The context aggregator is going to collect the all the bits of the conversation both from the user and the assistant uh and we'll put that into the form of what the LLMs can handle. So this in this case is more for function calling and kind of logistics and management. Gemini is amazing because it offers a lot of this for you. But if you're to build with say the uh just build with Gemini not live the actual kind of just the textbased LLM you'd have to have this context aggregation uh that will then go to your LLM which is going to be Gemini live Gemini multimodal live and then it's going to be outputed through uh daily again on on that side of the transport. So you have daily we'd configure our service which takes a number of arguments like you set up a room with a token give it a name and then have some properties. There are docs which I'll link to uh in it's linked in the quick start. There's also a Gemini multimodal live llm service which is a pipecat class that is a wrapper around the Gemini live API. So this again you just initialize and run with the LLM. You see, we do a few special things. We're going to define tools. This one has just two really basic kind of canned functions. Fortunately, we're not calling out to the internet because it's not working very well. Um, our connection is not working well. So, this one just has the dummy like fetch weather and we'll give you a restaurant recommendation. So, these are two handlers that when your function is called, we'll just return this result information. Uh so we have the actual functions themselves that are defined in this function schema which is you can use just native function uh definitions using whatever LLM format. We also created this function schema which is a universal schema that lets you define uh and move between any LLM without having to kind of transform your LLM calls from OpenAI to anthropic to Gemini to Bedrock because they're all a little bit different or Grock, you know, they all have slightly different formats. So this is more of kind of a universal transform for that. And then they're collected and trans translated into the native format in this tool schema. So we'll pass the tools then to the Gemini service and that's how it gets access to use and run those tools. There's also a prompt above which I think in this simple example we just say hey you're a chatbot you have these tools available uh and that's that. We're also setting up our context aggregation which for better or worse we use open AI as kind of the default like uh the lingua frana for context. So everything gets kind of folded back into open AI uh at a certain level. Um and we define the pipeline. So the pipeline is again like Alles had said just a list of all of your different or I guess a tupil of all of your different services um that are running in the pipeline and you can write your own. So if you want to instead make the LLM output text and you want to either extract information from that, maybe you have it in code some XML or some type of information, you can actually extract it, store it, maybe your application does something with it. Um, or maybe you want to inject a texttospech type of uh frame. You can actually do that by separating the LLM the that audio output from audio output to just be text. And then we you would add like a texttospech service here or you could write your own processor which maybe not within the next hour. And then lastly all of these or not all but many of them have uh events they emit. The transport emits handlers for when the client connects and disconnects. So in this case uh we use this line here to actually inject a context frame into Gemini to get kick off the conversation. So when your client application connects, it's going to cue a frame. So again, frames being kind of the base format for uh information. Think of it as like an object for your pipeline. You're going to cue one of those frames. And what this function does is it just grabs the latest context. So when we set this one above, I think that just says hello, that's going to pass that into basically push that into the pipeline, which then it will make its way to uh to Gemini to initialize the conversation. And the rest of this you could think of as boilerplate to run it. You create a runner which then is the thing that actually runs your task. And the task is what runs the pipeline. So maybe beyond today, but just know that that's something that's required to run your code. Okay, I'm going to pause here just for any questions because I do want to get to developing soon. We provide both. That's a whole topic that is a talk in itself. Um, the short answer is if you're building a client server app, you should with like strong emphasis use WebRTC. It has a whole bunch of properties that are relevant like error correction, better audio quality, etc., etc. If you're building server to server, so one of the options today is to build a phone chatbot, you can use and it's probably the best option to use websockets. So, you can bring your own uh transport. We actually in Pipcat, there's a fast API version of that that's a server that you can use to exchange messages with a websocket. So yeah, it's really up to you. So I guess maybe the the takeaway is if you're building client server, you really want WebRTC, you could technically use websockets, but you'll hit like a long tail of of errors when you get to production. And then server to server, totally fine. You're going to be fine with websockets. Could anyone download the repo by any chance? Did anyone not? One person download the repo. persons. Three, four. Wi-Fi is dead. Wi-Fi is dead. Can anyone Can you tether? Is that an option? Do folks have tether hotspots on your phone or that's even shaky. Oh jeez. Okay. I can dance. Could do some I don't know. You know, I'm not really, you know, it's not my thing, but I could try. True. Yeah, I guess I could just make this uh code walk through. Um or we could write it. What? Yeah. Yeah, it Gemini does have its own VAD. Yeah, Gemini does have its own VAD. In fact, if you're using a speech to text service that likely also brings its own VAD to the to the equation. What we found is that so maybe to use some of our extra time because we're having internet issues. Um the VAD serves a really important purpose of detecting when a user starts speaking. So in the whole kind of life cycle of a turn that user speaking kind of ushers in the user's turn for the conversation. So pipecat will emit a user started speaking frame and that will also push through an interruption. So the user will interrupt like anything that's talking. So if the bot was speaking or whatever uh it basically clears the way because the user has expressed they want to speak. The idea with the VAD is that we want it to be extremely accurate and extremely fast. So running something on device, we recommend Solero which is an open source option. It works incredibly well. I don't know what the inference time is like. I don't know millisec extremely fast. So you're going to get an event back uh fast. In fact, you have the ability to tune how long to hear human speech before that event gets emitted. And there the defaults are pretty good in Pipcat. There are maybe scenarios where you want to change that. But the VAD is a really important uh consideration. It's extremely low CPU consumption. Quinn has a great uh spreadsheet of breaking down the full cost analysis of an an agent and really the CPU is going to be extremely low. Interestingly, the TTS tokens or characters are are the most expensive by far. Um, so when you think about it, running that local VAD gives you superior performance and it allows you and it there's not much of a cost hit. I mean, it's maybe like a fraction of 1% to run a local VAD. Um but yeah, you have all sorts of choices, but we find we find that to work really well. Uh it's integrated with phone carriers, too. So there are a ton there are I found out I didn't work much with phone. Verun's been our phone expert, though I don't Verun is like a figure in the WebRTC community. He's like an author of many things, but he's also a phone expert, which I don't know if that crosses over too much. Maybe that happened before. Um, phones are super complicated how you actually make calls. Uh, Pipcat supports all of them. So, maybe a very quick list. You can make a websocket connection with a phone provider like a Twilio or Telnix or Pivo or Exo Exotel and exchange uh media streams. So that's a way to have just a a native websocket connection from Pipcat to Twilio. You'd call Twilio. It's going to emit the websocket and there'll be a handshake to get connected. You can also use uh PSTN which is a public switch telephone network uh which is lets you dial in and that's going to be kind of a different mechanism. There's also SIP which is its own separate thing again and all of the telefan providers would also support this as well. With SIP, you would call u something say like Twilio and you would call into like say a server like uh or a SIP provider like daily which is offers SIP provided rooms and you then have the ability to kind of uh bring the two together via that SIP connection. The nice thing about SIP is that you have the ability to have like superior call control. It does it is slightly more complicated whereas that websocket connection is instantaneous. Your bot needs to be up and running. So there's a whole we're not going to talk about it today, but like cold starts for agents, they need to start immediately. So if you don't have resources provisioned, you don't want your users waiting like 20 seconds while the bot comes online. So a longish medium-ish answer for a very complicated question. Maybe you didn't know that. Yeah. Yeah. Yeah. Yeah. Yeah. Answer that. Yeah. Yeah. Cart Cartisia of Yeah. Can you I didn't hear it. Well, I think you were you were asking if Pyut compares to or how it compares to Cartisia. Is that Yeah, Cartisia is a well I think they're going to do more stuff but as of today it's a texttospech service and you can clone your voice or you can you know and then they provide a real time uh API you can just via websockets you pass the text and they reply with with audio basically with audio frames and then pipecat integrates with cartisia as any other texttospech service like 11 laps or or anything like think about pipe cut as a it's just a framework for developers where they can plug and play the service they want. So they can plug and they can take Cartisia they can put Cartisia they can take it out put 11 laps. Oops even closer. Okay now I can hear myself. Um yeah you can plug and play any service you want like you can change llm you can use llama you can use anthropic you can use uh cartisia or 11 labs then for uh speech to text you can use deepgram gram or you can use one box like gemini live which has all that uh built for you so is that that clear Yeah. Okay. About what sir? I I'm not sure if I understood. Are you talking like how to ensure that the LLM says what the right thing or not the right thing? Yeah. Well, Pipcat doesn't have control about that. It's up to you to define the prompt or define how the LLM will um uh will reply. You can put if you want you can put uh you'll be able to write your own processors that we call them those little boxes and check what the LLM has said for example before it's um like put some kind of real time eval like to to make sure uh that the LLM has you could you could do that. Yeah, you the all that that pipeline is very flexible. So you can you can you can put whatever you want there like in parallel not in parallel. For example, Mark was saying about parallel pipelines. Like if you have video and audio at the same time. Uh with Gemini Live, you can do everything in one box. But let's say you don't have Gemini Live, you want to use other uh other services that one does video and that the other one does audio. You can have a parallel pipeline which you know it's like a tree, right? You have your transport input and then if it's audio, it goes this way. If it's video, it goes that that way. And you can um you can do things dynamically like like that. Yeah. This next question. Yeah. Sure. Okay. So the question was um are guard are guardrails requirements and do how do people use them and how does that apply for speechtospech models? So the answer is they're not required. In there is a challenge here and actually uh I talked about this on like one of the first slides. Latency is absolutely critical. So what you want to avoid are unnecessary turns. You know obviously LLMs are amazing language processors. So if you had all the time in the world you could do hallucination checking against the LM. There are other strategies to handle this. One of the big things is that we see because of the aggregate nature of the context, it grows over the course of the conversation. You actually you can find better um better accuracy with the responses if you have more control over how you prompt the LLM. So this is uh a whole topic and talk in itself. Uh what we found is there are two ways to handle this. Well, at least two. Um one if you for a lot of conversations they're going to be task oriented. So let's say uh something simple like a restaurant reservation bot. It may have to take your name, get your time, log the time to a database. You can chunk that out, even that small conversation into just discrete tasks. And LLMs are really good at uh following the most recent input. So if you kind of feed it task by task, that helps. Also, if you control the context window, like the size, it can really be beneficial to kind of manage that really judiciously. So you could either reset um one example might be let's say you're building a patient intake bot at a doctor's office. They may the very first thing it may do is verify the date of birth which serves no utility beyond just the very first you know checkpoint. So you may actually remove that from the context like completely get it out of there because otherwise it's just cruff that hangs on and instead you kind of reset and then maybe roll through the tasks. You could also for really really long conversations summarize the the context. So you you may want to do an outofband LLM call. This is something actually Quinn just we talked internally about this that we're going to see more and more of this mixture of LLMs where even in the context of real time you may have an outofband like REST call to the to the textbased LLM just to do a summary and then return it back so that you can kind of compress that context window. And just to give a call out to Google, the the live API, so maybe transitioning there, they offer context uh management through a bunch of different different strategies like a rolling they have a rolling window or sliding window. I think they offer like token caps for that. So you can have some control or if you want you can output text and then kind of do whatever you want with it. They also take text input. So there's a lot of a lot of flexibility with speechtoech models. They do offer or they do pose some other maybe development challenges but offer like tremendous benefits in terms of the features they offer. All right, I'm gonna I'm gonna hold questions just for a sec because I I've heard the Wi-Fi is back. Can folks try downloading the repo again because I Slack channel. Yes, Quinn, can you maybe come to the mic and I can't remember what you told me. workshop voice gemini pipecat with dashes in between on AI engineer Slack. Okay, does everyone know where that is? I know this is like day one hour three. Workshop-Voice-geemini-pipecat is the channel name. So if you go to the engineer and search for workshop, it should come up as you raise your Can people join like do they have do you guys have access to the Slack AI engineer Slack? Oh yeah, there's like Yeah. There should be the join a Slack group. All right. I'd say I'm going to I'll take that one then I'll walk over if I could let's just try maybe could listen and also try to get the repo. I'd like to walk through the quick start and then we can look at some examples. I you know it's something actually uh Quinn has done a ton of work with this. I have not personally. I mean there's obviously massive latency benefits because you cut out network round trips all over the place. So you save a ton and depending on where you are in the world that can save a lot if you're US-based. You know your network latency is going to be relatively low. But a lot of the developers in the community are in Europe and a lot of these AI services are relatively new with you know data centers only in the US. So there are different challenges when doing that though when running it you it actually there are great um hosting providers like modal that offer really good options for running like your own local LLM which then you're not you know you're buying or I guess leasing like the GPU time instead of running everything on a GPU which would probably be cost prohibitive because because of the way processes run but that's a that is also a talk in itself. Um, next question. Sure. Oh yeah, definitely. Well, the large context windows definitely cause LLMs to process slower. Um, yeah, I'm sorry. The question was around state management with um with LLMs and whether it's better to I guess chunk or have more kind of deterministic input versus just a large context where you just dump everything in. This is actually an extension of what the other gentleman was asking. Um the idea being that uh and actually so daily we built the chat widget that's on the homepage of the voice AI world's fair uh page. I I personally built that. What was interesting there is that um in Alles and I were just talking about this function calls in the context of real time are still slow unfortunately like too slow. Um actually Gemini I'll give more props to Gemini being like maybe one of the fastest. If you run like a basic local like one of these demos and you ask it what the weather is, it will return back with time to first bite in I don't know less than 500 milliseconds whereas other vendors not trying to throw open AI under the bus. But it has gotten slower like the you might see upwards of 1.5 to two seconds of waiting just to get that first token back and we dug into actually this is just something recently this morning. The issue is when you get the normal streamed response for the conversation you can start playing that audio out once you get the first sentence. The issue being when you get a tool, you need the entire JSON response before you can actually do anything with it. So that's slow. Uh that's one part of it. Se separately chunking the prompts is absolutely the way to go. Uh in building that world's fair bot, um I was kind of balancing between the two worlds because uh you can talk to it and ask it about speakers for the session. route one would have said let's use like a mag approach and put all of the the speaker JSON in something that could be a tool that could be accessed and I tried that and unfortunately it's just a little too slow. It's a giant context. It takes a while to come back. What's interesting is if you instead move that all just directly into the context with Gemini live it's a little bit variable but under good conditions you'll get a response back on that like 800 millisecond latency. So it actually has access to like the full context. The one tradeoff though is what this gentleman over here was asking is accuracy. It's going to get confused because especially when you get a JSON with a lot of speaker and this isn't this is all LMS. It's not specifically Gemini with that type of even structured data becomes very hard to kind of discern what's what when a lot of it looks the same. So it's all I mean a lot of this is emerging like other things in in AI trying to just do it as fast as possible with voice. Before I take one more question, I want to check in on have folks have any luck with the repo? Got a thumbs up. All right, I'm going to pause on questions for the time being. We can take them at the tail end. I do want to try to go through the quick start if we could voice. Okay, great. So, I would recommend um maybe we'll just take like a few minutes of independent getting set up. If you all go to the read me on in the Gemini Pipecat workshop. And if you'd roll through the first few steps, maybe get a I don't know, a hand up. You could just flash it up real quick so I could see when people start to get through it. You don't have to hold it up. If you don't mind if you brought a device. Okay, we've got one, two. Ah, so for this workshop, I have leaked my key, a key from one of my accounts, which I'll cycle after this. So, you don't need to sign up for a daily account. You do need to sign up for a Gemini account. I don't have a key I can just give out. So, in the environment.example, you'll see there's already a daily key which you can use. Do people know how to sign up for Gemini? I have it in the readme. Is through AI studio. Is that a good spot to do it? Okay. So, All right, I'll take a question while we're waiting. We So, in terms of um the pipe, that's a good question. In terms of the pipecat interface, you interface with the pipecat class. So there's a service class and within pipcat, so how you write your application code is going to be uniform across all of the services. The individual um providers haven't really unlike the textbased LLM, they haven't really settled on kind of a a standard. So pipecat handles all that translation on your behalf. And I think there are other frameworks that do similar things. The idea is to provide like a uniform simple interface. Uh so that you could take and this is part of the modularity. If you wanted to, you could swap this bot out for OpenAI real time or you know a textbased model with a TTS and ST paired with it. That's kind of the whole whole idea. There is maybe a little bit in terms of uh the one thing LLM providers maybe I don't know if anybody here can nudge anyone you know the system instruction or system prompt is not unifor uniformly dealt with I don't know if that's well understood but open AI has this like user that is system that you can inject anywhere at any time which is really fantastic but um anthropic and Google require like a named system instruction that's a special one time like at constructor time um instruction. So there's there's some differences. As much as best we can, we we unify. All right. Any how are folks doing on getting quick start going? See a hand. Is that a question or Okay, I'll take maybe it's related to the quick start. Oh yeah, great topic. Yeah. So there's a question about noisy environments which is like voice AI's kryptonite. Um so with the VAD no not at the moment. Uh but there are again this is where like pipecat is the assembler of all things. What you want to plug in? Um that you can run separately from the VAD. Uh we found like crisp is one that's a partner of ours. there have they have fantastic noise cancellation. You can run something outside of the loop that uh would actually clean up like in the in pipe cut it would be in the transport itself. So in that audio input it would take uh the audio input and remove any ambient noise. So like chip bags opening or dogs barking but maybe more impressively uh human background noise. So it will remove that from the feed. So you could be in this conference and it picks up the primary speaker for the device like incredibly well. At the moment, they're the only ones that I'm aware that know how to do that and do it that well. But it's I mean, it's phenomenal. But you're right, the VAD is I mean, it was one of the biggest problems that we saw until we found Crisp and they're fantastic. CR IP. So, big props to the Crisp team. Well, we use we use all of them. Pipecat's open source. So we bring we have I think options are Gemini, Multimodal Live, OpenAI, Realtime and then AWS just launched a new one called Nova Sonic. So we have those three um within Pipcat. They're all pretty they're all very much on actually well they're all they're all very similar. Um, I'm not going to I don't want to like nitpick on all of the vendors here, but they're I mean they're they have strengths and weaknesses each because it's still an emerging field. Um, but they're latency wise that is not an issue. Latency is fantastic for all the providers. Yeah. All right. Okay. Maybe Allesh will walk through a quick start. Why don't we He's going to just do some live coding here and maybe this will help to understand what's how this all works and then Perhaps we I stopped chatting and maybe you could grab me if you have questions and just do some heads down working time. Yeah, I cannot type and teamwork. Teamwork. Yeah. Uh I'll try from the very very from nothing from scratch. Um so this is a Python project. Um actually let me let me try again. Yeah, this is a Python project. The first thing we'll do is create an environment like a virtual environment that is called I like to call it amp that that's how you create a virtual environment in in in Python. Um a virtual environment will have what's that? Oh, it's hard to see. Um how do I do that? [Music] [Music] I only have one thing. Maybe the only issue is I only have one thing but better. Okay. Yay. Okay. So, we'll start with um uh yes, it's Emacs. You can judge me. Uh I'm I'm I'm not going to change. I have a few years only coding. So that's what it is. Um so I'm going to create a requirements file. Sorry, I loaded first thing. I did the virtual environment. We did this. Then we're going to load it. Now we can start installing uh packages in in this environment like Python packages. So the ones we have to do is uh pipe of course. So I'm just creating a new file which has pipecat AI and pipecat has a few options. So for this example we're going to use daily we're going to use uh Google and we're going to use silero right silo is the VAD that we we've been talking about. Um the other thing we will do is we will load the envir there's a package called python. temp just to load the environment variables that we will have uh in this in here. Okay. Um that's it. One file. Now we're going to I'm going to create uh an environment variable. I'm going to just copy it from the from the kickstart. It's just a oops it's just amp and we can take a look at it. it. I will these are the the API keys. We'll get rid of them after this workshop. So, but yeah, these are the you just need two keys. Google API key and just uh go to AI studio and create your own one and a daily API key. This one we already this one you can copy. It's a bit long but uh I don't know doesn't Yeah, it's in the project. Yeah. Yeah. If you if you are able to go to the to the repo on GitHub, you you can just copy it from there. All right. Next, we're going to create a Python file. Let's call it uh bot.py. Again, empty file. Uh first thing we'll do is typical I think it's called name uh equals main. No help from an LLM. And then we do a sync.io.run. run uh a little bit of help singo.r run and we're going to run a main a main function function and then let's write the main function. All right. So now we start. So the first thing we uh we we we said is we're going to need a transport. The transport input is like we're going to um like the day a daily transport. We're going to say that's I'm going to speak to the to the bot through the daily transport. That's what I'm going to create. And then we want it's a transport for incoming audio from me talking and then a transport uh for outputting what the what the LLM will say. So I'm just going to create a transport. Um and this is a daily transport and it has a few uh arguments. It has a room URL. I'm just going to create mine for now. Um this is a again a public public room that I will have to delete. Uh I don't need a token and params. Okay. And then the daily params is like what do we want this out uh this transport to do? So, we're going to say that we want uh input enable, which means get audio from the transport, and we also want to to send audio to the transport. Okay. Um All right. What else? Uh it's complaining for something. Oh, yeah. And we also need uh VAT analyzer, which is going to be Silerero VA analyzer. Um, so the transport is going to be able to uh use this VA analyzer to detect if the user has spoken or not. What's that? Oh yeah. And this is going to be AI engineer. There we go. Okay. Now we have the transport. Now we're going to create the LLM. Uh, in this case it's Google um, uh, Gemini live. So I don't need to create an speech to text or text to speech. We just create uh the Gemini live and for that I'll need to copy it for because I don't know that from memory but um oops that's in this file gem Ibot. I just want to copy these lines here. There we go. All right. So this is my LLM and again it uses this Gemini multimodel live LLM service multimodel live LLM service. I'm just going to add my import. Okay. And now it needs a couple of things. The system instruction which is like what the what the agent is going to is going to do and some tools. The tools I'm going to skip them for now. So let's let's do the system instruction again. I'm going to copy it from somewhere. There it is. This is like the prom, like the main prom of the Well, not. Yeah. All right. So, the system instruction is you're a helpful assistant who can answer questions and use tools. For now, we're not going to use any tools. Um, you know what? Let me get rid of the tools. Just copy here for later. I'm gonna comand this out. and you are just the helpful assistant. Okay, so that's all right. Um, so no complaints here. All right. And now we just create the a pipeline. I'm going to avoid storing the context because I don't think we need it for now. And this is the pipeline. The pipeline just receives a list of um of processors or elements and the first one is a transportinput that's the input transport. So how we get audio from the uh from the daily room in this case the llm and the transport.output. All right. Now we need some this is just defines a p a pipeline also is another processor. So you could build a pipeline of pipelines of pipelines of pipelines. So you can build uh or you can plug and play uh as as the way you liked it. So how do you run a pipeline? You need a task. What we call a pipeline task that receives a pipeline. And the pipeline task also has some params which are called pipeline params. Um, and we're going to say, oops, that we allow interruptions. And I think that's enough. And how do you run a task? Uh, you can create more than one pipeline task if you wanted. In this case, we just have one. Usually, you just you just have one. Uh, you're going to create a runner. And guess what? It's called pipeline runner. Uh it's pipeline runner and then we just do await runner run task and some completions please pipeline runner. All right and I think that's it. Uh we'll try it. Oh OS import OS. I think I think there's no more warnings. Oh, and I need to load the environment variables, which is this line here, load.m just copy it from another file. Okay. And where do we get loadm? This is just a function that um um that imports the imports the environment variable. All right. And yeah, let's let's try just going to open the Oops. Just going to open the terminal here and I'm just going to run it. I think we call it bot.py. Uh, no model. Oh, maybe I need to install the requirements. I forgot this step. There it is. Okay. So, init at the beginning I uh wrote that file requirements uh text which has uh had a bunch of uh well just a few requirements but I forgot to to install them. In the meantime, I'm just going to go to the daily room that I just pointed the bot to uh video. Okay. So that's uh right now it's just me in that in that room. So and now we just have to wait for this to to finish and hopefully the bot will join the room and we'll be able to talk to it. Hopefully. Yeah. It is in Yeah, it's this is because uh we're using the daily transport and the daily transport just connects to a daily room. So, but you could have a a websocket transport and then use Twilio with a phone number and Twilio being connected to that. If we have time, we can even try that. Um, so I think small transport you want to you want to talk? Oh, yeah. I'll wait. So we als uh in Pipcat we also have added um based off of the AIO RTC Python package which is how uh WebRTC package in Python. We've added a new transport called small WebRTC transport. It is a peer-to-peer WebRTC communication that's free. So it's separate from any vendor. Uh though the one downside is that it requires turn server which we you bring your own. So we we didn't you know we weren't prepared for that for the conference and also just the conference Wi-Fi makes that a little challenging. But normally if you're running any of the any of the u we call them foundational examples in Pipcat. Think of them as the like essential um examples that show how to do very specific functions. There's probably about a hundred of them in Pipcat, but one by one it shows you how to like record or add an ST or push frames or show images or sync images and and sound. Those all use the peer-to-peer WebRTC transport. So we we would have loved to have used that. You wouldn't need a key, but unfortunately firewall rules have trumped. All right. Uh so I'm just running the bot and see how it fails because it has to fail the first time. Yeah, this is the math there because there's a bunch of the um Python packages and Python just decides to take uh a time to load them. But you see how easy it was to write um like an agent like a voice agent with Gemini live uh if it worked. It's just a few lines of code. Uh that we wrote in I don't know how long it took me but uh maybe like five 10 minutes. Um yeah. Are there any questions on the example or what's that? It worked. would work. The bot work. Okay. All righty. All right. Nice. Yeah. uh we have I don't know the actual number of customers but I mean Pipecat probably serves hundreds of thousands of calls a day. I don't know a lot coin you probably have a better idea. Oh yeah, Pad is used by some very large companies in production. Uh and people are contributing to it from Nvidia, AWS, uh OpenAI, Google, uh lots of lots of big companies. Yeah, there's one thing we didn't mention about Pipcat is that what you see now in the screen, this runs on the server side, but we do have client SDKs for Android, iOS, JavaScript, and React. And I think that's about it. But uh Ragnated even a C++ client um uh if you want. Um so yeah, so that's the server side, but you can plug your your client and connect to the to the agent on your on your phone. We RTC connections or that would be it that depends on the transport you use. But yeah, you could you could have your client connect to a daily or we support LifeKit as well, but to a daily room. We like daily because we are working daily, but you connect you can connect to a daily room and then the bot would connect or the agent would connect to the daily room as well. And then that's the transport, the WebRTC transport. Yeah, I think there were questions there. I can see Uh say it again. for the previous version. Um, I just hacked together a thing that that I call release eval which is a bot talking to a bot and what it does is I put this bot uh up and then it joins a daily room and then I have a an eval bot and the eval bot um what he's going to do it has a prompt which is ask a simple addition. Okay. And then that evalbot is going to connect to the room. is going to add what is 2 plus two and then the other bot is going to reply 2 plus 2 is four and the ebal bot the LLM uh it checks if the answer of the user is correct and the user in this case is another LLM so it verifies it's like an end to end the good thing is we we run the we used to run like more than 100 examples every release just to make sure they work so I just got tired of it because it's very painful and very slow. So we have these uh eval or eval release evals that are going to test each service like we test Gemini live we test cartisia deep gram like all the services like end to end and then the bots basically talk to each other with boys that's the that's the the nice thing so yeah okay oh is this on maybe real quick just to show for those that didn't see it it is hey can you tell me a Why don't scientists trust atoms? Because they make up everything, which if you build pipe cat, I've heard that joke probably I mean, not like 5,000 times. Like you can even try to seed it with something different, but it's still I will try to come up with a new joke. Yeah. Why do Why do you always say the same joke? That's an interesting question. Why do you think people keep telling the same jokes? I don't know. You tell me. From my search, there appear to be several reasons why people repeat jokes. Enjoyment and reinforcement. People repeat things they find funny because they want to reexperience the good feeling associated with laughter. It feels good. And repeating it is a way to try and recreate that sensation. Memory aid. Okay. Okay, that's enough. Is there anything else I can help you with today? No, thank you. That that was it. Thank you so much. Well, just to show there was a question about interruptions. We could just have it like my favorite is to ask it to tell you like a really long story and then interrupt it. So, can you tell me a really long really long story? Okay, I can do that. Here's a story and feel free to interrupt whenever you like. Once upon a time. Okay. Actually, tell me that new joke. Quit following the clues. Hey, tell me that new joke. You'd like to scientists trust Adams? Because they make up everything. Okay, here's one. That was the same joke. All right. All right. How was it? Oh. So I think the questions asked during this workshop could map out like years of work. So this is like another one of those fantastic cutting edge things. So again back to like human evolution. We all know and when we talk actually it's even hard for humans to talk to not speak over each other. So the way that it works mechanically is when the user stops speaking the VAD has a timeout. you tell it and program it, wait, let's say one second, point8 seconds, half second, whatever feels natural and you're trying to balance low latency response with giving the user enough time to speak. It's a really hard thing and it's one of the biggest complaints is that agents will speak over the human. So, if you're, let's say you're building an interview bot, like you're using um like Tavis, one of their digital twins, you want to have like a real like likeness and you want to speak to it. you may take time to think because sometimes you have to take time to think and that's a really difficult thing for bots to do because again it's driven by like a simple stop speaking algorithm. So this is a new uh I guess it's like an emerging uh field with models which is looking at semantic uh end of turn. So driven off of things like um like speech filler words, pauses, uh intonation, so things in the audio realm and also things in the textbased realm. So just looking at context. So we've actually started uh we're one of many that are doing this I think but we we launched a model if you look at it on GitHub it's under smart- turn. It's a native audioin uh classifier that runs an inference on the in input audio and it simply outputs either complete or incomplete. And the way Pipcat uses this is that if you get an incomplete response we can dynamically adjust the VAD timeout. So we can tell the pipecat bot okay he you know he or she is not done speaking let's actually move the let's give three seconds to complete the thought and if it's not done then the bot will actually respond. So you can create a little bit of like dynamic interaction there. That's one of the first things. I'm sure goo the Google team is working on similar things. I know openai is and all the st vendors are are also looking at their own things. So, I'd say right now it is very much an unsolved problem, but I would imagine given how fast things are going in the next 12 months, we'll have great solutions that will make it even more natural to talk to a bot. It's a good question. Any more questions? Any questions? like there. Oh yeah yeah yeah. One uh this is actually back to the well for this is specific to pipecat but also um like Gemini live will output audio and text and other speechtospech LLMs do this. Uh pipcat offers in terms of its again orchestration role when you get a uh actually it's going to be specific to TTS provider. Um, many T there are great TTS providers that do word and time stamp synchronization. So they'll give pairs. They call them like alignment pairs. So if you're using a Cartisia or an 11 Labs or Rhyme, they all output these pairs. One of the really cool things with Pipcat is that the TTS services output not only the audio stream but also the text stream. So they'll output text frames, TTS text frames we call them in Pipcat. And if you place uh we have in terms of how the client software works there is uh like an observer role where you can actually watch there's a process that can watch things that happen in the pipeline and emit events. So we've instrumented that for the clients so that whenever you see those text frames move through the transport you can get synchronized word and audio output. So in your client if you wanted to have wordbyword output synchronized to the audio you can do that with pipecat and it's as simple as just adding an event. I think you listen to like bot tts text output or onbot tts text and it will give you the synchronized output. fully offline. Um, well, they're all doable. There are great models. I think it really depends on what your bot needs to accomplish. Um, a lot of the state-of-the-art models to do all the best and smartest things need to have some like they're going to be run like on prem or in the cloud. Um, but if you have a lot of bots do jobs like if you wanted to build like a restaurant reservation one like I referenced earlier, it's a very simple job. You could probably run it with some version of Llama running locally. Um there are great local something again Quinn has been experimenting with uh a lot of great local models like u whisper has challenges you know I mean it's it has a lot it has some challenges uh as an open source model for ST but there are good and emerging TTS services so you know I I things are only as good as the input and we've actually seen this with some of the speechtospech models that sometimes they mistranscribe so you really need I mean it's you know there every part is critical, but if you can't transcribe the speech really well, nothing really matters. Like it has to understand you. And having like disluencies or like hallucinated responses or even just inaccurate responses kind of breaks everything down. So things mostly start at at the ST. So maybe that's the hardest. I don't know if there are a lot of good open source options for that right now. I don't know. We don't we're not doing anything in the ST world. No, no, no. That's a whole different ballgame. Good question though. just made me realize we could have used local models and avoid this. We could have. Yeah. Well, we're partnering with the the Google team. Yeah. Yeah. As a Yeah, exactly. Yeah. Has anyone looked at any of the sample projects and had questions? There's a lot of interesting things there. If any of this is like interested you, we do have a Discord. You're welcome to get on it. Um, you can find us at pipcat.ai and find our Discord there. You can ask questions. Um, there's some really cool stuff with Gemini that can be done there. Uh, in particular in the Pipecat repo, we built uh, I don't know if you know the game catchphrase where you describe a word and something, you know, guesses it. We built a version of that. We had to brand it something else called word wrangler and you as the human, you describe a word and then you have the AI agent try to answer it. So, we built a client server version of that which I linked to in the repo. And then we have one that's a phone based one that's I think particularly sophisticated and interesting. So you might think like how the hell would I build this with a speech-to-spech model? We actually use two Gemini agents in the same call and we use a parallel pipeline where one agent is the host giving out the the questions to the human user, the other is the guesser and we have you know kind of limit the audio flow so that the guesser the AI player can only hear the user. So there's a bunch of really interesting things getting into the like majorly into the weeds of some of the powers of u pipecat, but it also speaks to the strength of having just native audio input being really really help helpful. So I'd recommend checking those out. Um really cool easy demos to run. One's Twilio, the other is again a client server. I think it's like a React Nex.js project. What's that? Word wrangler. Yeah, I mean we could run the word wrangler client app. It's actually just on the web test. Welcome to Word Wrangler. I I'll try to guess the words you describe. Remember, don't say any part of the word itself. Ready? Let's go. Uh I'm going to skip to something easier. Okay. This is something you take pictures with. It's on your phone. Is it camera? All right. This is a field uh related to the study of languages, I think. Is it linguistics? All right. This is a game of the yellow ball you play with rackets. Hit the ball over the net. Is it tennis? All right. This is a a round dessert with chocolate chips sometimes and other fun goodies. Is it cookie? It's really good even when I'm bad at giving answers. So, pretty cool. This is with built with Gemini live. But again, just an example of things you can build with uh with voice AI. So cool, unique interactions. All right, I think that's about it. Thanks everybody. [Music]