Dream Machine: Scaling to 1m users in 4 days — Keegan McCallum, Luma AI

Channel: aiDotEngineer

Published at: 2025-07-19

YouTube video id: EY4O9M6AsWI

Source: https://www.youtube.com/watch?v=EY4O9M6AsWI

[Music]
So, it's 9:00 a.m. June 11th, 2024,
and we send out the announcement and
hold our breath waiting uh to see the
user signups pour in. We're expecting
significant traffic for the launch of
Dream Machine, Luma's first video model,
and we were woefully unprepared for what
came next. Uh, we'd allocated about 500
H100 GPUs. We thought that was a lot at
the time. It wasn't. Um,
and uh, over the next hour, we saw
request after request pour in and a
giant queue of requests start to pile
up. Luckily, we had a contingency plan
for this um which was to take every GPU
we had access to from every provider and
start manually running SSH commands
against them to spin up workers uh
pulling in work from a global queue. Um,
and we were able to get to about 5,000
H100s over the next 6 hours.
And our cues of almost 100,000 uh
finally started to drain around like 2
p.m. that day.
And that is when Emit, our CEO and
self-appointed chaos monkey, uh, decided
to tweet that we had scaled up by 10x.
Uh, come on, come on in. it'll be faster
now
and we'll see how it goes. Um, so we
were our cues were down to about, you
know, 300 at this point. And you can
kind of see me here freaking out a
little bit going, okay, about 10 minutes
after the tweet goes out, uh, they start
to go up again, even though we've scaled
up by 10 times. So now we're at 350,
we're at 400, we're at,400.
And as this is going on, um, we're
basically taking uh the entire training
cluster. This is the only uh the only
GPUs we hadn't used yet. Uh taking the
training cluster over. Uh it's another
like 4,000 uh H100 GPUs. And uh it
barely made a dent. The cues were still
going up. Uh so you'll see this uh it's
it's the KEK W emoji, which has become a
cultural staple of Luma over over time,
and it was very much the mood at the
time, like what am I supposed to do with
this?
So, uh, I'm here to talk to you guys
today a little bit about Luma, um, who
we are, what we do, and how we've
managed to kind of scale our
infrastructure up, um, to initially a
million users in four days, which was
pretty nuts. Uh, for some context, Chat
GBT hit a million users in 5 days. Um,
and we processed about half a million
videos over the course of this 12 hours.
Um, so yeah, we're going to talk a
little bit about what we learned,
scaling up, and, you know, how we did
it. Um, but first very quick little
intro to Luma if you're not familiar
with us. So, we're not just a video
model company. We're a foundation model
lab and we're aiming to build general
multimodal intelligence that can, you
know, generate, understand, and operate
in the physical world just like a human
can. And uh to give you some sense of
kind of what our models can do and where
we're at today, um this is a feature we
dropped yesterday, a demo video for it.
This is modify video. All these videos
um are initially taken on iPhones and
basically uploaded to our platform with
a text prompt and you can turn it turn
that robot into anything you want.
And for the AI engineers out in the
audience um shout out to he uh manages
our public API and you can integrate
this functionality to your applications
using it very easily. Um you don't need
to do any kind of crazy prompt
engineering. We've taken care of that
for you. Just send us raw user uh
prompts, generative media, and we'll
send you back images and videos that uh
meet your users needs.
So, if you're interested in that, uh
definitely hit me or Kron up. Um he's
done a wonderful job building out, you
know, our SDK and our whole kind of
Devril uh side of things. So, yeah. Um
that is uh my little plug for our API in
case anyone's interested. So back to
infrastructure,
this is what our serving stack basically
looked like when we launched. So we had
a bunch of just tightly tightly coupled
uh containers working together. Um one
of the benefits of this is that we could
kind of just launch these on raw
machines with zero other dependencies.
Um so that worked out well for us at
launch. Um but there were some
challenges uh scaling this up. So you
know like any good engineer, I didn't
want to reinvent the wheel initially. So
we we reached for Triton inference
server um which is kind of a classic um
generalpurpose model serving uh server
but there there were some issues with
it. Um this setup was brittle. Um if
Triton went down the the CPU processes
didn't necessarily know that it went
down and so you'd be kind of pulling
jobs and they'd fail. It was annoying.
Um, worse though with these video
models, you're needing to uh run these
on multiple GPUs and actually multiple
nodes in a lot of cases to get to the
latency you need. And Triton's just not
built for that. Um, also we run our
inference now on multiple different
chipsets. So Nvidia is the the company
that builds Triton. Uh, they don't have
great support for things like AMD or
Grock or any of those. Um and finally
the biggest kind of hurdle was that this
was really difficult to develop against
for the researchers. Um it had a whole
bunch of different idioms, a whole bunch
of um kind of incantations you needed to
make to make it work well and the
overall setup was just uh it just felt
very janky. So what we ended up doing um
was you know rearchitecting to address
some of these things and uh building our
own serving stack on top of you know
vanilla pietorch for all the GPU work.
um that worked out really well because
most of the vendors that are building uh
these different chipsets, they make sure
that PyTorch is fully supported. It's
kind of this great substrate to build on
top of if you support, you know, very
vanilla PyTorch things. Um you can
typically make your model run anywhere.
Um you may need to optimize certain
optim uh operations uh depending on your
model to make things fast, but it's
relatively easy to uh to get started. Um
and in terms of like the decoupled
architecture you you see here uh the CPU
workers being decoupled is quite quite
useful and important because you can use
um use them to queue up work and you
know pull in when you're dealing with
you know videos, images, multimedia
inputs on top of just text. You want
those to be in the cluster ready for the
GPUs to pull so you're not blocking the
GPUs at all. Um and also with this
architecture you can actually run the
GPUs anywhere as long as you can connect
to Reddus and our distributed storage.
You can kind of see there seaweed FS um
you can have a a GPU in you know any
kind of random provider any random VM
and use tail scale connect uh connect to
the rest of this architecture and then
scale up without having to do all that
other kind of crazy parallel SSH stuff.
Um so you know you want to run uh
compute on your training cluster now you
don't need to um you know provision
special machines or anything you just
run a command.
There were a few more challenges that we
hit um after we got through kind of the
the initial hurdles of this uh this
infrastructure. So one of uh one of the
big ones was back pressure. So because
this is decoupled um you can get into a
sit and because there's multiple
clusters that are you know pulling in
work from the same global queue you can
get into this situation where you have
too many CPU workers pulling in work to
one cluster and they're they're kind of
waiting there to be processed by the
GPUs on that cluster uh when they could
have been processed somewhere else. So
we came up with this uh dispatch uh
limitation system. So we came up with
like a state where if a GPU has been
pulled into a uh sorry if a job has been
pulled into a cluster and is waiting to
be picked up by a GPU uh you can put a
limitation on like how many jobs are in
that state so that you can avoid this
this issue. Um I'll talk more in depth
about uh priorities and our fair
scheduling uh wos. That was another one.
Um you know we have multiple tiers of
users and deciding whose jobs get
process first is uh a constant
optimization challenge. Um also handling
different models. So like these video
models are big. They are typically made
up of you know 10 20 different um subm
models. So you're pulling in a lot of
weights. you're spending a lot of time
compiling these things. So, traditional
autoscaling is super wasteful. You're
wasting, you know, 10 20 minutes of GPU
time just warming things up. Um, and
finally, you know, handling bursts. So,
um, we basically built this system to
handle bursts where you could scale up
automatically on our training cluster,
which our researchers hate me for. uh it
makes them very upset but it allows us
to keep up with the demand from our
users. Um and so that decoupled
architecture plus a very simple
scheduler that runs on top of slurm and
can just run these pietorrch workers um
handles you know increasing the total
pool of GPUs when we need it and scaling
down when we don't.
So this system is a pullbased system and
there's you know cues that submit to
cues that pull from cues and it is um as
you know Sorish and Vas you can both
attest to here from Luma it is the least
um enjoyable part of working at Luma is
managing these cues is is everybody's uh
least favorite task. So um initially
when we uh launched Ray 2 which is a
much bigger more resource inensive model
um we began to actually have to deal
with the the fact that we couldn't just
keep scaling up more and more compute.
It just wasn't economical or feasible.
Um so we had to deal with limited
resources and um when you deal when you
have like a pollbased scheduler with a
bunch of cues like this there's this
concept of work starvation that that
comes into play where essentially you
know we've got API uh tier jobs which we
try to process very quickly. We've got
enterprise we've got unlimited plus
light free. So we've got all these
different tiers that have different
priority and who gets to go first. So if
you just naively process them in
priority order, there may be enough
enterprise or API jobs that the light
jobs never get processed. So we were
having people waiting for like seven,
eight, nine hours and they were very
unhappy. Um so also shout out to Sorish.
He actually on a weekend after we
designed this system in a couple hours
implemented um this SLO based system
which I'll go into that allows us to
more fairly schedule uh work across the
the limited resources that we have. So
how the system works is when you think
about um
the the concept of
work starvation, one of the typical
approaches you can you can use to manage
this is aging. So the idea is the longer
something waits in the queue, the higher
its priority goes. But what's the actual
function to control that aging
mechanism? Um so we had the insight that
this is a product problem and it really
comes down to how long worst case are
you okay with different tiers waiting.
Um, so we have service level objectives
that the product team defines and it
kind of controls this aging behavior.
And how that works is, you know, an API
job, we may not want to wait in a queue
more than a couple minutes. But a light
job, maybe we're okay with them waiting
for 10 minutes. And you can configure
these and then set a threshold. Once uh
once the threshold gets hit, so say 50%
of that, you know, worst case timing,
then that job gets pulled to the front
of the queue. And initially uh that
worked all right, but then you can
actually hit another case of work
starvation. Um if if you have a bunch of
uh jobs that are potentially breaching
the SLA where um you know a a job with a
long SLO since it's been waiting in the
queue for a longer time will actually
starve out the resources of say an API
job that you don't want to wait as long.
Um, so how we handled that was with a uh
we we ranked the jobs by the percentage
of their SLO um that they that they were
at. So like if an API job's waiting for
a minute, that gets treated the same as
a light job that's been waiting for 10
minutes. And that actually works out
really nicely in practice and results in
kind of intuitive fair scheduling
behaviors.
The last piece I want to talk about is
kind of how we actually manage all these
models. Um, so this was something where
I was glad that we kind of reached for
Triton initially. They had this nice
concept of a model repo um that that
we've really leaned into. So every model
um has a folder in object storage
somewhere with a bunch of subfolders
that have different versions. Um, and so
as you, you know, develop the code and
fine-tune these models, deploy new
checkpoints, you, uh, create a bunch of
these immutable versions, and then you
have a simple YAML file in the root of
the model folder that lets you define
which one's active. So that's really
nice because you can kind of re
reproducibly
um you know in these sorry in these
versions you can um you store the full
Python environment so all the
dependencies needed to run the model and
the checkpoints and this system is
pretty nice because you know if you ever
need to roll back or you know you want
to run things in all these different
desperate environments you can make sure
that you know you're running the exact
same Python environment the exact same
checkpoint that you were before And very
rarely are there issues that um you
can't really solve by just rolling back.
So that's been quite useful for us. And
um we've actually built like a automated
kind of rollout system on top of this
too where when we update those YAML
files, the workers will just kind of
switch versions on the fly without
restarting. And um that that helps us
kind of roll out uh new new model
changes to the whole fleet of these
thousands of uh H100 and uh AMD GPUs all
at once. Um which makes managing this
much more sane than the early days of
parallel SSH.
And um yeah, if any of this seemed
interesting to you, up your alley, you
know, we're hiring, we're actively
looking for cracked engineers,
researchers,
um AI enthusiasts in general. Please uh
please hit us up and uh yeah, I just
wanted to thank everybody for taking the
time today and uh thank the team at Luma
because uh it's it's incredible the work
that everyone there does.
We've we've got about three minutes for
questions. So, let's take one or two and
let's see where we get any questions.
Yeah. So, the the kind of cheat code
here is that the chip the chip providers
who we typically partner with really
closely um they're always making sure
that PyTorch at least at a certain
version works for their chipset. So,
they're doing a lot of that work for us.
And then typically what'll happen is
we'll try to run the models on whatever
the new chipset is. Usually it'll work
and then it'll work a bit more slowly.
So we've got actually a team of like 10
guys we call our Excel team that are
optimizing the low-level operations
within PyTorch using things like Triton
and working with the chip providers to
make sure that things are actually fast.
So you'll typically with PyTorch be able
to run things um anywhere but a lot of
the times they won't be fast and then we
just work closely with the chipset uh
providers to actually you know optimize
the model over time.
Yeah. Yeah. We work really closely with
Nvidia and AMD and uh we are exploring
you know some other providers um but
nothing really deep yet like you know
Amazon's got their own chips uh Gro has
some chips. We just announced a big
partnership with uh Humane who works
really closely with with Grock and so
yeah, we're exploring some of these
other chipsets um through through some
of these uh partnerships.
Cool. Any other questions?
Yeah.
We do not we well
I don't think so. Um but yeah, we we we
work with cloud providers that basically
kind of give us a um a working
Kubernetes cluster at the very least,
which which is nice. Um we're not
provisioning the nodes oursel or
anything, but I actually would have to
check with the individual providers if
there's like VMs or if they're actually
metal machines. I know uh when we work
with Amazon, they're not, but some of
the other ones might be.
Yeah.
Uh not yet. Um the way the like general
application works um does involve uh
video and like image QA model. So um
essentially the the way the actual
application works is we have like an
agent that you're interacting with. So
when you upload a video or an image that
is actually being you know captioned by
you know VLMs to enhance the the the
total pro the overall prompt but um no
like true VQA stuff quite yet but there
there may or may not be some coming.
[Music]