Pipecat Cloud: Enterprise Voice Agents Built On Open Source - Kwindla Hultman Kramer, Daily
Channel: aiDotEngineer
Published at: 2025-07-31
YouTube video id: IA4lZjh9sTs
Source: https://www.youtube.com/watch?v=IA4lZjh9sTs
[Music] Hi everybody. My name is Quinn. I am a co-founder of a company called Daily. Dy's other founder is in the back there, Nina. I'm stepping in for my colleague Mark, who couldn't make it today, so we're going to do this fast and very informally, but I think that's a good way to do it at an engineering conference. I don't have as much code to show as the last awesome presentation, but I'll try to show a little bit. We're going to talk about building voice agents today. Uh I work on an open source vendor neutral project called Pipecat. Um and a lot of other people at Daily do too because Voice AI is growing fast and is super interesting and is a good fit for what we do as a company. We started in 2016. We are global infrastructure for real-time audio, video, and now AI for developers. Pipcat sits somewhere higher up in the stack than our traditional infrastructure business. So we'll talk a little bit about how you can build reliable, performant voice AI agents completely using open source software. We also recently launched a layer just on top of our traditional infrastructure designed for hosting voice AI agents. We'll talk just a little bit about that. Um so we've been doing this a long time. We care a lot about the developer experience for very fast, very responsive real-time audio and video. We have a long list of engineering first we're proud of, but that's not why you're here today. Uh, happy to talk about that later though. Um, if we step back and orient a little bit, what are you doing when you build a voice agent? I I tend to sort of orient people with three things they have to think about. You've got to write the code. You have to deploy that code somewhere. And then you have to connect users over the network or over a telefan connection to that agent. A few things here. User expectations are high. Voice AI is new, but it's growing fast, I think, because we're able to with sort of the best technologies that are just now becoming available to meet user expectations. But users expect the AI to understand what they're saying, to feel smart and conversational and human uh to be connected to knowledge bases, to have actual access to useful information for whatever they are doing for that user. uh to sound natural. There's definitely an uncanny valley problem that in generative AI we fell into for a very long time. Now we're on the other side of that for voice AI which is really exciting. Um the agents have to respond fast. Humans expect it varies by language and by culture and by individual but roughly speaking humans expect a 500 millisecond response time in natural human conversation. If you don't do that in your voice AI interface you are probably going to lose most of your normal users. So we tell people target 800 millisecond voicetovoice response times. That's not easy to do with today's technology, but it is definitely possible. And build UIs very thoughtfully to understand that humans expect fast responses. Other thing that's hard uh a little bit like fast response times is knowing when to respond. Uh humans are good but not perfect at knowing when somebody we're talking to is done talking and when we should start talking. Voice AI agents are not as good at that yet, but they're getting better. So we'll talk a tiny bit about that. So why do developers use a framework like Pipcat instead of writing all the code themselves? Well, a little bit of it is all those hard things on the previous slide that you probably don't want to write the code for yourself if you're mostly thinking about your business logic and your user experience and connecting to all of your systems. You want to use battle tested implementations of things like turn detection, interruption handling, context management, calling out to other tools, function function calling in an asynchronous environment, all that stuff. Um, so developers tend to use frameworks these days for lots of agentic things they do. Uh, and voice AI, I think, is even more important to sit on top of uh really well- tested infrastructure and code components than even in other domains. Um, Pipcat appeals to developers because it's 100% open source and completely vendor neutral. You can use it with lots of different providers at every single layer of the stack that Pipcat enables. Um, for example, there's native telefan support in Pipcat. So, you can use Pipcat with lots of different telefan providers in a plug-and-play manner. You can use Twilio for example, which a lot of developers know. If you're in an geography like India where Twilio doesn't have phone numbers, you can use Pivo. Bunch of other telefan providers are supported. Um there's a native audio smart turn model that's completely open source in Pipcat. So the community has gotten large enough that there's kind of cutting edge ML research at least in the small model domain coming out of this open source community which is really fun. Um Pipcat cloud I think is a really nice advantage for the Pipcat ecosystem. It's the first open-source voice AI cloud sort of built from the ground up to host code that you write but that is designed for the problems of voice AI. Uh, and Pipcat supports a lot of models and services. The count is something like 60 plus. All the things you would want to use in a voice AI agent are probably in Pipcat main branch. Um, so you probably don't have to write code to get started, though the appeal is that you can write lots and lots of code if you want to. So there's no ceiling. Um, I'll talk a little bit about what the architecture looks like. And we probably won't have time to talk about client SDKs because most of you in this room are probably building for telefan use cases. But there's a really rich and growing set of JavaScript, React, iOS, Android client side components and SDKs that people in the pipecat community are using to build multimodal applications that run in the web and on native mobile platforms. Um, we talked about this so I will actually just skip this slide. Uh, I hope we'll have time for Q&A. That's the most fun part. Um, here's the other piece that often helps orient people. This is what a Pipcat agent looks like. So, you're building a pipeline of programmable media handling elements. Uh, these are all written in Python, although lots of the performant uh sensitive ones bottom out in some kind of C code. Uh, it's pretty common in real-time media handling. Um, you probably don't have to worry about that level though. You're probably just thinking in pipe pipe pi python. Um, pipecat pipelines can be really simple. Uh, they can have just a couple, maybe just three elements, something for the network, something that's doing some processing, and something that's sending stuff back out the network. Or they can be quite complicated. And we see enterprise voice agents often become quite complicated because they're doing complicated things and connecting out to a large variety of existing legacy systems. Um, so an example of a little bit of that span. The left two screenshots here are from the pipecat docs about how you work with the open AAI audio ccentric models in pipcat. Open AI gives you a couple of different shapes of models and APIs that you can use. One is chaining together transcription large language model operating in text mode and voice output. The other is using their new and uh in some ways experimental speech-to-pech models which are also really awesome and promising. Um, you can do either of those approaches in Pipcat just by changing probably three or four lines of code. Uh, on the right is the Python sort of the chunk core chunk of a few hundred lines of Python code and a flow diagram for a more complicated uh, pipeline. This is one of my favorite starter kit examples for Pipcat. It uses two uh, instances of the Gemini multimodal live API in audio native mode. uh and one is the conversational flow and the other is a another participant in the conversation that plays a game with the user. So there's sort of an LLM as a judge pattern here but in the context of a game. Uh and you're moving the audio frames around through both pipelines selectively depending on the results of the real-time inference. Uh which is a pattern we also see in enterprise use cases but it's fun to clone this and run it and play the game. Um, we listed some of the services here. Uh, we can talk a lot more if, uh, you want to in the Q&A about sort of what we see people actually using in production most often in terms of models and services, uh, in enterprise voice AI. So, that's a very quick rundown of the Pipcat framework, which is how you write the code. Now, how do you deploy it? And why am I talking to you about Pipcat Cloud today? Um, there are a bunch of hard things about Voice AI that are unique to these use cases. These are long running sessions. Uh they have to use network protocols that are designed for low latency. Um things like autoscaling are not available out of the box for these workloads the way they are for something like HTTP workloads. So I was actually quite resistant for a long time to building anything commercial around Pipcat at daily because we do the low-level infrastructure. We already have things that we do that serve the pipecat community. But it got to the point where there were very large percentage of the questions in the Pipcat Discord were about how to deploy and scale. Um, and I I I initially sort of felt like that was a solved enough problem because what we do in the infrastructure level helps you in one way. what a lot of our friends and and customers do much higher up in the stack with platforms that sort of wrap all of the voice AI problem set in very easy to use dashboards and tools and guies are also really good solutions. But what we came to realize is that there was sort of a middle of the stack that people were asking about a lot in the open source community that boiled down to how do I do my Kubernetes? Um, so people would ask questions in the Pipecut Discord about deployment and scaling and we would say, "Oh, well, if you really want to run this stuff yourself on your own infrastructure, here are the five things you do in Kubernetes." And people would say, "Some version of Cooper what?" Um, and we don't have a good answer to that. So, we thought we'd come up with a good answer to that, which is a very thin layer on top of our existing global media oriented real-time infrastructure designed as what I think of, this is not a very good marketing tagline, but I think of this as a very thin wrapper around Docker and Kubernetes optimized for voice AI. Um, so what are the things we're trying to solve for? Fast start times are very important. If somebody calls your voice agent uh and they hear ringing, they want to hear that voice agent pick up the phone and say hello pretty fast. Um no, almost no matter what you do in AI, you care about cold start times, but it's even more important when the user is initiating some action and expects you to hear audio back. Um cold starts are hard. If you've built Genai infrastructure, you know that uh we try to solve the cold start uh problem for voice AI. happy to talk about cold starts in great detail because it's something I've been thinking a lot about over the last few months. Um, autoscaling is a little bit related to cold starts. You want your resources to expand as your traffic pattern expands. The alternative is you know exactly what your traffic pattern is and you just deploy a bunch of resources. Uh, that doesn't work for most workloads. Most people have timed dependent or completely unpredictable workloads. Uh, so you need to scale up and scale down. Um, real time is different from non-real time. And by non-real time, I mean everything that's not conversational latency of a few hundred milliseconds or less. If you are making an HTTP request, you want it to be fast, but you don't really care if your P95 is 1500 milliseconds or 2,000 milliseconds. In most cases in a voice AI conversation, you care a lot if your P95 goes up above 800, 900, 1,000 milliseconds for the entire voicetooice uh response chain. Uh all the little inference calls you make as part of that have to be much faster than that by definition. Um so the whole networking stack from client to wherever your pipe code is running and inside that Kubernetes cluster has to be optimized for real time. uh you probably need global deployment. Uh you probably have uh GDPR or data residency or other kinds of data privacy requirements or you just need global deployment because you want these servers close to users because that helps with latency and all these things have to be like delivered at reasonable cost. So we try to take these things off of your plate and help you build quickly and get to market uh with your voice agents. Um, a couple other things that are just worth flagging here. We've done a lot of work on turn detection, which is sort of one of the 2025 top three problems most people in Voice AI are thinking about how to make better. Um, check out the open source smart turn model that's part of the Pipcat ecosystem if you're interested in in that. Uh, the open source smart turn model is built into Pipcat cloud and runs for free. Our friends at FAL host it. Um, you've probably heard of FAL if you're doing Genai stuff. very fast, very good GPU optimized inference um and ambient noise and background voices. So one problem with voice AI is that even though transcription models today are very res like resilient to all kinds of noisy environments, the LMS themselves are not. So if you are trying to do transcription and figure out when people are talking and figure out when to fire inference down the chain and ask your LLMs to do something, having background noise that sounds a little bit like speech will trigger lots of interruptions that you don't mean to happen and will inject lots of spurious pseudo speech into your transcripts. So and that's true even for speechtospech models today. Uh they're not very resilient to background noise. Um the best the the the best solution to background noise today is a commercial model from a really great small company called Crisp. The Crisp uh model is only available with sort of big chunk of commercial licensing. Uh you can use Crisp for free inside Pipcat cloud if you run on Pipcat cloud. You can also use Crisp in your own Pipcat pipelines with your own license if you run Crisp somewhere else. Uh finally, agents are nondeterministic as we all know. There's a whole eval and PM track here and in every other track we talk about this problem. Um, we've got some nice uh low-level building blocks for logging and observability natively in Pipcat and exposed through Pipcat Cloud and a bunch of partners we work with on that. I'm happy to introduce you to the great teams we work with at various companies that are building observability stuff. That is my speedrun. I came in 20 seconds under the 15 minutes, but because we are the last talk in this block, if people want to do Q&A, totally happy to. Thanks. Uh, wonderful. One, um, actually I have two questions, two very quick questions. One is we're based out of Sydney, Australia. 800 time opening. Do you have any alternatives for that? Have you looked at other alternatives for people outside the states? Yes, that's a great question. So the question I will repeat the question. The question is if you're in a geography that is a long way from your inference servers. So in the case of this particular question, you're uh serving users in Australia. You're using OpenAI. Open AAI only has inference servers in the US. you don't want to make extra round trips to the US. So there's a couple answers to that. One is if you make one long haul to the US for all the audio that at the beginning of the chain and at the end of the chain that is much better than making three inference round trips for transcription, open AI and uh voice generation. So that's one tool we often say to people just deploy close to the inference servers rather than close to the users and optimize for having one long trip and then a bunch of very very fast short trips. That's good but not great. The other option is to run stuff in uh on using open weights models locally in Australia which you can definitely do. It's a longer conversation about what use cases you can use, say the best open weights models versus the, you know, GPT40 and Gemini 2 flash uh level models. But there are definitely some voice AI workloads now that you can reliably run on like the Gemma or the Quinn 3 or the Llama 4 models. Okay. Second question maybe just related to that is let's say can we if we basically host models in Australia itself? Y um what's the interconnectivity over the network from your cloud by is something like do you go to like the internet exchange locally out there? Yes. So we have endpoints all over the world that are we in our world we call them points of presence. So we have the the sort of the edge server close to the user and we'll terminate the web RTC or the telefony connection there and then we'll route over our own private AWS or OCI backbones to wherever you need to route to. If you're hosting in Australia, uh you should be able to just hit our endpoints and then uh your your hosting in Australia. We also we have some regional availability of Pipcat Cloud now. We will launch a bunch more regional availability of Pipcat Cloud over the next quarter. So I hope we actually have Pipcat Cloud in Australia soon. Although you you can also obviously self-host in Australia and still use either Pipcat itself or Pipcat Plus daily in other ways. Thank you. Yeah. Oh, sorry. Yeah. So, thanks. Thanks for the talk. Thank you. So, there are ones like Moshi. I don't know if you Yeah. Yeah. Yeah. I love Moshi. So, they basically claim that threat detection is no longer needed because they inher open weights model called Moshi by a French lab called Kyoai. Um, Moshi is a is is a sort of next generation research model where the architecture is constant birectional streaming. So, you're always streaming tokens in and the model is always streaming tokens out. In a conversational voice situation, which Moshi was designed for, most of the tokens streaming out are silence tokens of some kind. And when they're not silence tokens, it's because the model decided it was going to do whatever the model's trained to do, which is really cool because that can mean not just that the model does natural turn taking, but also that the model can do things like back channeling. So the model can do the things the humans do, that it's data set has audio for like when you're talking, I can say, uh, yeah, yeah, uh, and it's not actually a new inference call, it's just streaming. um that paper, the the Kyoi Labs Moshi architecture paper was my very favorite ML research paper from last year. Now that model itself is not usable in production for a bunch of reasons, including that it is too small a language model to be useful for basically any real world use case. Um, I I have more to say about that, but I'm super super excited about that architecture, but I don't think I mean, we're a couple years away from that architecture being actually usable and trained as a production model. There are speech-to-pech models from the from the large labs that are closer to being able to be used in production. Uh, now they are not streaming architecture models, but they are native audio speech-to-pech models, which have a bunch of advantages, including really great multilingual support. So like mixed language stuff is great from those models. Um in theory latency reductions. Um so OpenAI has a a real-time model called GPT4 audio preview that sits behind their real-time API. It's a it's a good model. Uh Gemini 20 flash uh is available in an audioto audio mode and their training to or their they have preview releases of 25 flash. These models are now good enough that you can use them for use cases where you are more concerned about naturalness of the human conversation than you are about reliable instruction following and function calling. They are less reliable in audio mode than the text mode that the soda models operating in text mode. So what we generally see is that for a small subset of voice AI use cases today that are really about like conversational dynamics, narrative, storytelling, those models are starting to get adopted for the majority of sort of enterprise voice AI use cases where you really need best possible instruction following and function calling. Those models are not yet the right choice but they are getting better every release and all of us expect the world to move to speechtoech models being the default for like 95% of voice AI sometime in the next two years. The question is when in your use case will a particular model architecture sort of cross that threshold in your evals. Sorry what about sesame? You put sesame in the same bucket as Gemini and open sesame is closer to Moshi. In fact, Sesame. So, there's another open uh weights or partly open weights and really interesting model called sesame. Uh it's a little like Moshi. It in fact uses the Moshi neural encoder. Yeah, it uses Mimi. Um ses So, Sesame has not yet been fully released. There isn't a full Sesame release. Uh also, I think Sesame is smaller than probably you would need to use for most enterprise use cases today. Although the lab training Sesame I think has bigger versions coming. Uh there's also a speech-to-spech model called Ultravox which is really good which is trained on the Llama 37B backbone and that team supports that model and has a production voice AI API. That model is worth trying if you are really interested in speechtoech models. If Llama 370B can do what you want, I think Ultravox is a good choice. If Llama 37B isn't quite there for your use case, probably not. but you know the next release of ultravio so speech to speech is definitely the future I I generally tell people experiment with it don't necessarily start assuming you're going to use it for your enterprise use case though today given your v vendor neutrality can you speak to the strengths and weaknesses of using like the leading edge uh multimodal input models like open AAI and Gemini when when should I use choose open AI or when should I choose Gemini So my opinion is that GPT40 in text mode and Gemini 20 flash in text mode are roughly equivalent models for the use cases that I test every day. Um so I would make the decision if you can I would build a pipecap pipeline and then just swap the two models and run your evals. Um because they're both really good models. One of the advantages of Gemini is that it's extremely aggressively priced. So, you know, a a a 30-minute conversation on Gemini is probably 10 times cheaper than a 30-minute conversation on GPT40. Um, you know, that may or may not stay true as they both change their prices. But that's definitely something we hear a lot from customers today is that they like the pricing of Gemini. The other interesting thing about Gemini is that it operates in native audio input mode very well. So you can use Gemini native audio input mode and then text output mode in a pipeline and that has advantages for some use cases and some languages and you can again test that on your eval. And OpenAI also has native audio support in some of their newer models but I I think they're just a little bit behind the Gemini models in that uh in that regard. Um time for one more or we done one more and then we're done. Yeah. What are the general advantages of speech to speech versus going speech to text doing something and then fact text to speech. So what are the general advantages of speech to speech instead of text text inference uh to speech and out? So it's super interesting question and I have a like a a practical answer and a philosophical answer. I'll keep them both short. The the the practical answer is that you lose information when you transcribe. And so if there's information that's useful in the um in the transcription step that if there's information in the audio that you would lose that's useful for your use case then a speechto speech model is great. Um so for example things like mixed language are very hard for small transcription models. Um you're almost always sort of losing a bunch more information in a mixed language transcription than you are in like a an optimized model monolingual transcription. So why not go to the big LLM that just has all this like language knowledge and can do a better job on the multilingual input. Um the other advantage is potentially you have lower latency like if you're if you've trained an end toend model for speech to speech and it's all one model and you're not like chaining together inference calls you you can probably get lower latency. Uh in practice whether that's true today depends more on the sort of APIs and inference stack than it does on the model architecture. But I think we're all going towards assuming that we just want to do one inference call for like the bulk of things and then we might use other little models on the side for for like subsets. The philosophical answer though is that those advantages are probably outweighed by the challenges to today's LLM's architecture LLM architectures when you have big context and big context and audio tokens take up a lot of context tokens. So when you're operating in audio mode, you're just sort of expanding the context massively relative to operating in text mode. And that tends to degrade the performance of the model. I think a little bit relatedly, nobody has as much audio data as they have text data for training. So even though a big model is doing a bunch of transfer learning, when you give it a bunch of audio and it is in theory sort of mapping all that audio to the same latent space as its text reasoning, in practice, it's definitely not doing that exactly. It's doing something like that, but not that. And so because we don't have as much audio data, you see a lot of issues with audioto audio models like the model will sometimes just respond in a totally different language. And that's cool, but it's never what you want in the enterprise, you know, voice AI use case. And the best guess for why that's happening is it's in some right part of the latent space from some projection but then from some other projection it's totally in a different part of the latent space when you gave it audio instead of text even though if you transcribe that text it would be exactly the same as the audio. Um so you know latent spaces are big and to like actually find our way through them in post training you really have to have a lot of data and nobody has enough audio data yet. But the big labs are going to fix that cuz audio matters and multi-turn conversations matter. [Music]