Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber

Channel: aiDotEngineer

Published at: 2025-07-31

YouTube video id: rD23-VZZHOo

Source: https://www.youtube.com/watch?v=rD23-VZZHOo

[Music]
I'm Neil. That's that's Jack in the in
the front there, but it's just me. Um,
so yeah, we're really just going to talk
about our experience um ho hosting
Orpheus uh inference um for our real
real time stack.
So, I'm Neil um the CTO at a company
called Gabber, small startup. Uh but I
spent a lot of my career doing real-time
media stuff. So, sending audio and video
around the internet. Um started at a
company called Bibo. Um was ultimately
acquired by Amazon. But, uh there I was
doing a lot of we did like a game
streaming app kind of like OBS. Um built
uh a lot of the streaming infrastructure
there. Built a ML pipeline to watch
people play video games. Um, so they
would watch people play Fortnite and put
some cool effects on the screen when
when they got a kill or victory or
something. Um, so spent a lot of time in
like the Gstreamer trenches and with
WebRTC and RTMP and all that stuff. Um,
took a detour, worked at Uber for a
couple years. Uh, then left that. Um,
did a multiplayer gaming uh, startup
with my brother Jack here. Um, so doing
basically trying to bring like AAA style
multiplayer to to web games. So a lot of
real and with voice and stuff too. So
there's a lot of real-time media
realtime um you know simulation kind of
stuff there. Um and then yeah we uh
didn't do super good job there and uh
shut shut that company down and um we
were using LiveKit. I made a LiveKit SDK
and that uh segueed to me working at
LiveKit. I think a lot of people
probably heard of LiveKit in this room.
Um and yeah the second half of my time
at LiveKit I uh was spent doing the
LiveKit agents platform. So that's like
the platform that was kind of born out
of um LiveKit's involvement with GPT
boys. Um so yeah wrote the first line of
code on that and worked on that. Um and
then yeah left LiveKit and did another
startup with my brother um Gabber. Um so
that's what we're doing now. So Gabber
is real time uh infro for real time
basically AI personas. Um so you know we
have some core building blocks like
voice memory um video inputs coming soon
tool calling kind of like the usual
suspects I guess but our focus is really
on the consumer apps um you know we we
see the like the replacing human use
cases pretty often um like the call
center use cases customer support AI SDR
that that kind of stuff um but our
interest is really in the the consumer
space we think um these kind of like
real-time synchronous AI experiences are
going to be as ubiquous us as as
websites and apps in the next kind of
like two to five years. So that's our
focus and we that's how we try and
differentiate in terms of opinion into
our product and our SDKs and APIs and
stuff. Um
uh here are some of the use cases we're
seeing. Um these are also kind of like
the usual suspects. AI girlfriends was
the first one. um that is like uh I I'll
get to why that's the first one I guess
but um other ones are like AI MPCs uh AI
therapists AI personal trainers AI toys
for kids I think that you saw that in a
couple a couple sessions ago these use
cases like we're seeing a lot of
different use case and I saw it at
LiveKit too and it got me really really
excited about about this stuff but um AI
girlfriends was was the first one mainly
because um everything's so expensive um
uh some some of these voice platforms
it's you know end to end upwards of $5
an hour. Uh, and that doesn't really
work for like 90% of the consumer
consumer apps. Um, but AI girlfriends it
works because like the users are paying
like um it's like usually like a credit
system like you buy credits and you use
the app and uses credit. So they're more
comfortable with that with that kind of
spend. But most consumer use cases they
need something pretty close to free. Um,
so we knew that uh and at the time we
were not hosting any any voice models.
We we but we knew we had to uh we knew
that the only way to really get this to
to execute on our vision of putting
these experiences everywhere we had to
start bringing more things in house and
running on our own GPUs. Um so at the
time open source there weren't a lot of
good open source voice models. Um uh
there were a lot of good ones for
asynchronous uh use cases. So generating
voice slower than real time. Um but
there weren't any really good like
real-time streaming ones until uh
Orpheus. Uh Orpheus was the first really
good one um that uh was kind of like
ready to go. So um Orpheus came out and
we're like, "Okay, this is our time to
shine. Uh we immediately like put it on
an H100. Um hosted it. Um went viral
with Jack's tweet and uh got a ton of
top top of funnel." Um and yeah, that
was kind of like the starting point.
It's like our our company. There's like
before Orpheus and after Orpheus, our
company kind of changed. Um so a little
background on what Orpheus is. Uh it's a
voice model, but it it started as a
llama three billion. Um it was trained
on uh pre-trained on like a 100,000
hours of uh voice uh data and text data
as well to make sure it kepts its
understanding of kind of like language.
Uh and and it was trained to output
audio tokens. They're called snack
tokens. So that's another open source
project, Snack, which is a audio codec.
Um
and so it's trained to output the 24 kHz
version of of snack tokens. Um those
snack tokens are then decoded and then
you get audio. You get 24 kilhertz
audio. Um important thing to note here
is it's about 85 snack tokens for one
second of audio. So um Orpheus wherever
you're hosting it, it has to it has to
be uh a throughput of about 85. I mean
you want like 90 to 100 tokens per
second to keep up with real time.
Otherwise you get gap gaps obviously in
the audio and it sounds bad.
Um, other things that were important to
us because we're going after the
consumer use cases, um, was cloning. Um,
so our clones seem to be emotive and and
high fidelity. Um, and oneshot cloning
doesn't work that well. Um, that's more
true for Orpheus because it only had
100,000 hours of pre-trained data. Um,
whereas I think some of the zeroot
emergent behavior comes at at like a
million plus hours. So, and we're
scrappy. I think you can tell by like
our design here. um that were like
pretty scrappy, right? We weren't going
to fill that gap. So um so we went with
low rank fine tunes for our clones. Um
so here's an example. So this is um a
low rank fine tune. We have some better
ones. This isn't like the best example,
but they're customers. So I didn't want
to put it in the thing here. So we just
clone Jack's voice um like yesterday and
use 16 rank alpha 32 basically all the
projections. Um here's the source audio.
Let's see if restart. We forgot to pick
up our child from school. Oh.
H the school called me in the middle of
a meeting. Oops. Um so that's the
source. Uh and then here's the result of
a of a fine tune. So um let me manage
expectations here. Um this was this was
like
pretty bad data like 10 10 minutes of
data. You really want like 30 minutes.
It was so I had to overfitit. So I
trained on like like five epics. Um it's
pretty overfit, but you'll see it like
still sounds okay.
Hey,
how are you?
I'm kind of sick. This is a longer
generation. Let's see if it sounds okay.
Yeah. So, it's not bad. Um, you know, I
my whole life or most my I'm the older
brother, so most of my life, so I know
his voice very well. Um, so, so it's
draw it's jarring to me, but um, cool
cool thing is like, yeah, it's trained
to do these tokens, which is important
for consumer. Um, uh, so and it's pretty
emotive. like when it said I'm kind of
sick, it sounded pretty sad. So, it
picks up on the language cues as well.
Um,
other thing that's really important
obviously for all voice use cases, not
just um not just consumer, is latency.
Um, so there's four things that really
affect latency. Uh, time to first token
is is one of them. Um, tokens per second
is one of them. Um,
I'll get into why that is later. Um but
what we found in network latency is
another one but we found the the most uh
biggest cause of latency was what we're
calling head of line silence. Um this is
somewhat specific to the Orpheus model.
This isn't going to be true for all
models. Um but head of line silence is
basically that uh some somewhere in the
finetune of Orpheus um the data had a
lot of silence at the beginning um
because it was voice actors that they
hired and they train and they like took
those scripts and train fine tune a
model from it.
Um, so this is like the default Orpheus
voice uh or one of the ones that came
with it called Tara and it has 600
milliseconds of latency at the
beginning. And they probably had other
good reasons for like adding silence at
the beginning. Um, uh, but this is a
lot, right? So 600 milliseconds of
silence. Um, we actually found that Oh,
so 600 milliseconds of silence. We're
running on L40S machines as of now. Um,
they can do about 100 tokens a second.
So 600 milliseconds is uh almost half a
second of of silence. So even we're we
are filtering out the silence like we're
not just playing that audio back to the
user. But because takes a while to
generate those tokens, we're adding like
basically half a second of of latency
just on the on wasted compute pretty
much. Um so yeah, even filtering out the
silence, you're only like saving 10%
there because you're just barely faster
than real time. Um we're scrappy again,
so we're running on L40s. Um, but what
we found was interesting is that we
could actually just fine-tune the
silence away. So, um, this is an example
of a clone that we did, a Laura
fine-tune of a customer's clone, and the
latency is is basically like 100
milliseconds like P50. Um, so much
better like half a second basically for
free.
Uh, and that matters uh because these
real-time you you kind of have a latency
budget on the on the real-time
application. So the way these work is
you you know the human talks and then at
some point you decide um is the human
done talking? Those models are not
perfect. So you typically add like a
snooze period at the end of that. But
during that snooze period you can still
do work. Um so what we do is we kick off
the LLM. Um the way we have our Orpheus
stack set up is we start generating
audio after two sentences um or if it's
done but two sentences typically uh
which gives it enough context to like
capture the emotions. Um, so all that to
say is if we generate the first audio
packet within that snooze period, then
we're kind of like in the money on on
latency in our latency budget. Um, now
these endpointing models are going to
get better. So, you know, that snooze
period is going to go down to like half
a second to a second is probably like
the sweet spot, but one and a half
seconds is um kind of the threshold I
think for anything above that sounds
pretty bad. Um, and anything kind of
equal to or below that is like
acceptable. Um, so yeah, that half a
second mattered a lot because it gives
our LLM more time to um create tokens
and because we're letting customers
bring their own LLMs, um, it's somewhat
out of our control.
Um, so the next big category here is
infrastructure. Um, again, we're we're
scrappy, so we really needed uh
something that um was robust and uh not
too complicated. Um, and we needed batch
inference. So we needed batch inference
obviously to save money. So we need to
run um multiple
uh generations on in the same batch or
in the on the same GPU concurrently and
we also needed multiple Lauras uh to be
running in the same batch on the same
GPU. Um and we wanted one load balancer
in front of everything. We're spinning
up multiple different models for
different languages. So we all wanted
this just to sort of be like a black
box. It just sort of worked. Um
uh so VLM to the rescue supports all
those things. Um so VLM um can do batch
inference with Lauras which is really
really awesome. Um this is unfortunately
the FP16 model was slower than real time
on L40s. It worked on H100 but it was
slower than real time but again VLM to
the rescue. Um they support FP8 dynamic
quantization which requires basically
zero work. Um it just works
automatically. It does all the um
scaling and everything automatically. So
you don't have to like train the
calibration data into uh your own quant.
It just works. Um and it's amazing. So
that brought us up to 105 tokens a
second on the non-fineetune uh voices
and 95 tokens a second on the Laura uh
voices with a batch of 10. Um which is
we're yeah well well in the money uh in
terms of margins and things like that.
So that's nice. Um, party infrastructure
is is of course load balancing. Um, so
you know, Lauras are depending on what
your hyperparameters are. Uh, they're
between 100 and 200 megabytes. Um, so
you want to make sure you end up on a
server that has a Laura in memory and
and and things like that. We also wanted
to support um, so that's where like
sticky session comes in here. Um,
uh, and yeah, latency low, I guess. Um,
but we also wanted to support streaming
input. Um, uh, mainly because the LM
often, you know, might not be done by
the time you want to start producing
audio, but we also wanted to support
arbitrarily long um, generation. So like
storytelling, things like that. Um, so
we we have um, so that that's another
reason why uh, it I guess this load
balancing problem is interesting because
you want to make sure you end up on the
same GPU across the whole session. Uh so
we went with a pretty much like a buy
the book consistent hash ring setup. Um
so if you've seen hash rings before this
is not that interesting but basically
the way it works is you hash the servers
um multiple times. So you want it called
virtual node so it distributes around
this hash ring um and then when you know
generation starts you hash that with the
same hashing algorithm you pick the
nearest server to that and it just
works. And the reason this is chosen is
because you can like remove a server and
it it doesn't um re reload balance like
everything. It just only a few um I
guess migrations so needed. Um the other
nice thing about this strategy is if a
clone gets very popular um it's pretty
easy to handle that. You can just uh
append um to the Laura. So you can just
the more popular Laura is, you can just
add it to more servers and upscale and
downscale that uh very elegantly without
really a ton of engineering work.
Um so yeah, at the high level it looks
something like this. Um we have our
WebRTC backend that kind of like
terminates the client connections. Then
we use websockets um to our GPUs and
then the GPUs are talking to Reddus.
Reddus is not the best um the best
choice. Uh but if we scale beyond
needing Reddus for this kind of thing,
um we can just solve that with piles of
money, I guess. Um but yeah, the way it
works here is you uh start a session.
The WebRTC backend just connects to any
GPU, then it asks uh Reddus, hey, what
GPU is this request supposed to be on?
And then it just proxies it with another
TCP connection to the correct GPU. Um,
which is fine because these GPUs are in
the same data center, private
networking. Um, so low latency, TCP,
that's totally fine with within the same
network. Um, so that's that's pretty
much it. I mean, the conclusion here is,
you know, we're we're pretty scrappy.
Um, and we were able to host voice
models on GPUs and handle that
infrastructure, so you can, too. Um,
open source is there. And, um, yeah, it
I think it's going to unlock a ton of
cool use cases. Um, shout outs, uh,
shout out Swix. Um, he's a supporter of
ours. Um, and obviously put put this on
or half half of half of it, I guess.
But, uh, Swix is awesome. We love him.
Um, Canopy Labs, uh, who created
Orpheus. Um, haven't met them. Would
love to if they're here. Um, uh, and
then just free open source software in
general, right? Canopy Labs is built on
Llama, which is and and Snack. So, it's
this whole ecosystem is greater than the
sum of its parts, I guess. And um
LiveKit, we're a LiveKit alum, so love
those guys. And our WebRTC infra is
built on them. Um and then VLM uh um
notable open source project.
And yeah, that's it.