Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber
Channel: aiDotEngineer
Published at: 2025-07-31
YouTube video id: rD23-VZZHOo
Source: https://www.youtube.com/watch?v=rD23-VZZHOo
[Music] I'm Neil. That's that's Jack in the in the front there, but it's just me. Um, so yeah, we're really just going to talk about our experience um ho hosting Orpheus uh inference um for our real real time stack. So, I'm Neil um the CTO at a company called Gabber, small startup. Uh but I spent a lot of my career doing real-time media stuff. So, sending audio and video around the internet. Um started at a company called Bibo. Um was ultimately acquired by Amazon. But, uh there I was doing a lot of we did like a game streaming app kind of like OBS. Um built uh a lot of the streaming infrastructure there. Built a ML pipeline to watch people play video games. Um, so they would watch people play Fortnite and put some cool effects on the screen when when they got a kill or victory or something. Um, so spent a lot of time in like the Gstreamer trenches and with WebRTC and RTMP and all that stuff. Um, took a detour, worked at Uber for a couple years. Uh, then left that. Um, did a multiplayer gaming uh, startup with my brother Jack here. Um, so doing basically trying to bring like AAA style multiplayer to to web games. So a lot of real and with voice and stuff too. So there's a lot of real-time media realtime um you know simulation kind of stuff there. Um and then yeah we uh didn't do super good job there and uh shut shut that company down and um we were using LiveKit. I made a LiveKit SDK and that uh segueed to me working at LiveKit. I think a lot of people probably heard of LiveKit in this room. Um and yeah the second half of my time at LiveKit I uh was spent doing the LiveKit agents platform. So that's like the platform that was kind of born out of um LiveKit's involvement with GPT boys. Um so yeah wrote the first line of code on that and worked on that. Um and then yeah left LiveKit and did another startup with my brother um Gabber. Um so that's what we're doing now. So Gabber is real time uh infro for real time basically AI personas. Um so you know we have some core building blocks like voice memory um video inputs coming soon tool calling kind of like the usual suspects I guess but our focus is really on the consumer apps um you know we we see the like the replacing human use cases pretty often um like the call center use cases customer support AI SDR that that kind of stuff um but our interest is really in the the consumer space we think um these kind of like real-time synchronous AI experiences are going to be as ubiquous us as as websites and apps in the next kind of like two to five years. So that's our focus and we that's how we try and differentiate in terms of opinion into our product and our SDKs and APIs and stuff. Um uh here are some of the use cases we're seeing. Um these are also kind of like the usual suspects. AI girlfriends was the first one. um that is like uh I I'll get to why that's the first one I guess but um other ones are like AI MPCs uh AI therapists AI personal trainers AI toys for kids I think that you saw that in a couple a couple sessions ago these use cases like we're seeing a lot of different use case and I saw it at LiveKit too and it got me really really excited about about this stuff but um AI girlfriends was was the first one mainly because um everything's so expensive um uh some some of these voice platforms it's you know end to end upwards of $5 an hour. Uh, and that doesn't really work for like 90% of the consumer consumer apps. Um, but AI girlfriends it works because like the users are paying like um it's like usually like a credit system like you buy credits and you use the app and uses credit. So they're more comfortable with that with that kind of spend. But most consumer use cases they need something pretty close to free. Um, so we knew that uh and at the time we were not hosting any any voice models. We we but we knew we had to uh we knew that the only way to really get this to to execute on our vision of putting these experiences everywhere we had to start bringing more things in house and running on our own GPUs. Um so at the time open source there weren't a lot of good open source voice models. Um uh there were a lot of good ones for asynchronous uh use cases. So generating voice slower than real time. Um but there weren't any really good like real-time streaming ones until uh Orpheus. Uh Orpheus was the first really good one um that uh was kind of like ready to go. So um Orpheus came out and we're like, "Okay, this is our time to shine. Uh we immediately like put it on an H100. Um hosted it. Um went viral with Jack's tweet and uh got a ton of top top of funnel." Um and yeah, that was kind of like the starting point. It's like our our company. There's like before Orpheus and after Orpheus, our company kind of changed. Um so a little background on what Orpheus is. Uh it's a voice model, but it it started as a llama three billion. Um it was trained on uh pre-trained on like a 100,000 hours of uh voice uh data and text data as well to make sure it kepts its understanding of kind of like language. Uh and and it was trained to output audio tokens. They're called snack tokens. So that's another open source project, Snack, which is a audio codec. Um and so it's trained to output the 24 kHz version of of snack tokens. Um those snack tokens are then decoded and then you get audio. You get 24 kilhertz audio. Um important thing to note here is it's about 85 snack tokens for one second of audio. So um Orpheus wherever you're hosting it, it has to it has to be uh a throughput of about 85. I mean you want like 90 to 100 tokens per second to keep up with real time. Otherwise you get gap gaps obviously in the audio and it sounds bad. Um, other things that were important to us because we're going after the consumer use cases, um, was cloning. Um, so our clones seem to be emotive and and high fidelity. Um, and oneshot cloning doesn't work that well. Um, that's more true for Orpheus because it only had 100,000 hours of pre-trained data. Um, whereas I think some of the zeroot emergent behavior comes at at like a million plus hours. So, and we're scrappy. I think you can tell by like our design here. um that were like pretty scrappy, right? We weren't going to fill that gap. So um so we went with low rank fine tunes for our clones. Um so here's an example. So this is um a low rank fine tune. We have some better ones. This isn't like the best example, but they're customers. So I didn't want to put it in the thing here. So we just clone Jack's voice um like yesterday and use 16 rank alpha 32 basically all the projections. Um here's the source audio. Let's see if restart. We forgot to pick up our child from school. Oh. H the school called me in the middle of a meeting. Oops. Um so that's the source. Uh and then here's the result of a of a fine tune. So um let me manage expectations here. Um this was this was like pretty bad data like 10 10 minutes of data. You really want like 30 minutes. It was so I had to overfitit. So I trained on like like five epics. Um it's pretty overfit, but you'll see it like still sounds okay. Hey, how are you? I'm kind of sick. This is a longer generation. Let's see if it sounds okay. Yeah. So, it's not bad. Um, you know, I my whole life or most my I'm the older brother, so most of my life, so I know his voice very well. Um, so, so it's draw it's jarring to me, but um, cool cool thing is like, yeah, it's trained to do these tokens, which is important for consumer. Um, uh, so and it's pretty emotive. like when it said I'm kind of sick, it sounded pretty sad. So, it picks up on the language cues as well. Um, other thing that's really important obviously for all voice use cases, not just um not just consumer, is latency. Um, so there's four things that really affect latency. Uh, time to first token is is one of them. Um, tokens per second is one of them. Um, I'll get into why that is later. Um but what we found in network latency is another one but we found the the most uh biggest cause of latency was what we're calling head of line silence. Um this is somewhat specific to the Orpheus model. This isn't going to be true for all models. Um but head of line silence is basically that uh some somewhere in the finetune of Orpheus um the data had a lot of silence at the beginning um because it was voice actors that they hired and they train and they like took those scripts and train fine tune a model from it. Um, so this is like the default Orpheus voice uh or one of the ones that came with it called Tara and it has 600 milliseconds of latency at the beginning. And they probably had other good reasons for like adding silence at the beginning. Um, uh, but this is a lot, right? So 600 milliseconds of silence. Um, we actually found that Oh, so 600 milliseconds of silence. We're running on L40S machines as of now. Um, they can do about 100 tokens a second. So 600 milliseconds is uh almost half a second of of silence. So even we're we are filtering out the silence like we're not just playing that audio back to the user. But because takes a while to generate those tokens, we're adding like basically half a second of of latency just on the on wasted compute pretty much. Um so yeah, even filtering out the silence, you're only like saving 10% there because you're just barely faster than real time. Um we're scrappy again, so we're running on L40s. Um, but what we found was interesting is that we could actually just fine-tune the silence away. So, um, this is an example of a clone that we did, a Laura fine-tune of a customer's clone, and the latency is is basically like 100 milliseconds like P50. Um, so much better like half a second basically for free. Uh, and that matters uh because these real-time you you kind of have a latency budget on the on the real-time application. So the way these work is you you know the human talks and then at some point you decide um is the human done talking? Those models are not perfect. So you typically add like a snooze period at the end of that. But during that snooze period you can still do work. Um so what we do is we kick off the LLM. Um the way we have our Orpheus stack set up is we start generating audio after two sentences um or if it's done but two sentences typically uh which gives it enough context to like capture the emotions. Um, so all that to say is if we generate the first audio packet within that snooze period, then we're kind of like in the money on on latency in our latency budget. Um, now these endpointing models are going to get better. So, you know, that snooze period is going to go down to like half a second to a second is probably like the sweet spot, but one and a half seconds is um kind of the threshold I think for anything above that sounds pretty bad. Um, and anything kind of equal to or below that is like acceptable. Um, so yeah, that half a second mattered a lot because it gives our LLM more time to um create tokens and because we're letting customers bring their own LLMs, um, it's somewhat out of our control. Um, so the next big category here is infrastructure. Um, again, we're we're scrappy, so we really needed uh something that um was robust and uh not too complicated. Um, and we needed batch inference. So we needed batch inference obviously to save money. So we need to run um multiple uh generations on in the same batch or in the on the same GPU concurrently and we also needed multiple Lauras uh to be running in the same batch on the same GPU. Um and we wanted one load balancer in front of everything. We're spinning up multiple different models for different languages. So we all wanted this just to sort of be like a black box. It just sort of worked. Um uh so VLM to the rescue supports all those things. Um so VLM um can do batch inference with Lauras which is really really awesome. Um this is unfortunately the FP16 model was slower than real time on L40s. It worked on H100 but it was slower than real time but again VLM to the rescue. Um they support FP8 dynamic quantization which requires basically zero work. Um it just works automatically. It does all the um scaling and everything automatically. So you don't have to like train the calibration data into uh your own quant. It just works. Um and it's amazing. So that brought us up to 105 tokens a second on the non-fineetune uh voices and 95 tokens a second on the Laura uh voices with a batch of 10. Um which is we're yeah well well in the money uh in terms of margins and things like that. So that's nice. Um, party infrastructure is is of course load balancing. Um, so you know, Lauras are depending on what your hyperparameters are. Uh, they're between 100 and 200 megabytes. Um, so you want to make sure you end up on a server that has a Laura in memory and and and things like that. We also wanted to support um, so that's where like sticky session comes in here. Um, uh, and yeah, latency low, I guess. Um, but we also wanted to support streaming input. Um, uh, mainly because the LM often, you know, might not be done by the time you want to start producing audio, but we also wanted to support arbitrarily long um, generation. So like storytelling, things like that. Um, so we we have um, so that that's another reason why uh, it I guess this load balancing problem is interesting because you want to make sure you end up on the same GPU across the whole session. Uh so we went with a pretty much like a buy the book consistent hash ring setup. Um so if you've seen hash rings before this is not that interesting but basically the way it works is you hash the servers um multiple times. So you want it called virtual node so it distributes around this hash ring um and then when you know generation starts you hash that with the same hashing algorithm you pick the nearest server to that and it just works. And the reason this is chosen is because you can like remove a server and it it doesn't um re reload balance like everything. It just only a few um I guess migrations so needed. Um the other nice thing about this strategy is if a clone gets very popular um it's pretty easy to handle that. You can just uh append um to the Laura. So you can just the more popular Laura is, you can just add it to more servers and upscale and downscale that uh very elegantly without really a ton of engineering work. Um so yeah, at the high level it looks something like this. Um we have our WebRTC backend that kind of like terminates the client connections. Then we use websockets um to our GPUs and then the GPUs are talking to Reddus. Reddus is not the best um the best choice. Uh but if we scale beyond needing Reddus for this kind of thing, um we can just solve that with piles of money, I guess. Um but yeah, the way it works here is you uh start a session. The WebRTC backend just connects to any GPU, then it asks uh Reddus, hey, what GPU is this request supposed to be on? And then it just proxies it with another TCP connection to the correct GPU. Um, which is fine because these GPUs are in the same data center, private networking. Um, so low latency, TCP, that's totally fine with within the same network. Um, so that's that's pretty much it. I mean, the conclusion here is, you know, we're we're pretty scrappy. Um, and we were able to host voice models on GPUs and handle that infrastructure, so you can, too. Um, open source is there. And, um, yeah, it I think it's going to unlock a ton of cool use cases. Um, shout outs, uh, shout out Swix. Um, he's a supporter of ours. Um, and obviously put put this on or half half of half of it, I guess. But, uh, Swix is awesome. We love him. Um, Canopy Labs, uh, who created Orpheus. Um, haven't met them. Would love to if they're here. Um, uh, and then just free open source software in general, right? Canopy Labs is built on Llama, which is and and Snack. So, it's this whole ecosystem is greater than the sum of its parts, I guess. And um LiveKit, we're a LiveKit alum, so love those guys. And our WebRTC infra is built on them. Um and then VLM uh um notable open source project. And yeah, that's it.