Dream Machine: Scaling to 1m users in 4 days — Keegan McCallum, Luma AI
Channel: aiDotEngineer
Published at: 2025-07-19
YouTube video id: EY4O9M6AsWI
Source: https://www.youtube.com/watch?v=EY4O9M6AsWI
[Music] So, it's 9:00 a.m. June 11th, 2024, and we send out the announcement and hold our breath waiting uh to see the user signups pour in. We're expecting significant traffic for the launch of Dream Machine, Luma's first video model, and we were woefully unprepared for what came next. Uh, we'd allocated about 500 H100 GPUs. We thought that was a lot at the time. It wasn't. Um, and uh, over the next hour, we saw request after request pour in and a giant queue of requests start to pile up. Luckily, we had a contingency plan for this um which was to take every GPU we had access to from every provider and start manually running SSH commands against them to spin up workers uh pulling in work from a global queue. Um, and we were able to get to about 5,000 H100s over the next 6 hours. And our cues of almost 100,000 uh finally started to drain around like 2 p.m. that day. And that is when Emit, our CEO and self-appointed chaos monkey, uh, decided to tweet that we had scaled up by 10x. Uh, come on, come on in. it'll be faster now and we'll see how it goes. Um, so we were our cues were down to about, you know, 300 at this point. And you can kind of see me here freaking out a little bit going, okay, about 10 minutes after the tweet goes out, uh, they start to go up again, even though we've scaled up by 10 times. So now we're at 350, we're at 400, we're at,400. And as this is going on, um, we're basically taking uh the entire training cluster. This is the only uh the only GPUs we hadn't used yet. Uh taking the training cluster over. Uh it's another like 4,000 uh H100 GPUs. And uh it barely made a dent. The cues were still going up. Uh so you'll see this uh it's it's the KEK W emoji, which has become a cultural staple of Luma over over time, and it was very much the mood at the time, like what am I supposed to do with this? So, uh, I'm here to talk to you guys today a little bit about Luma, um, who we are, what we do, and how we've managed to kind of scale our infrastructure up, um, to initially a million users in four days, which was pretty nuts. Uh, for some context, Chat GBT hit a million users in 5 days. Um, and we processed about half a million videos over the course of this 12 hours. Um, so yeah, we're going to talk a little bit about what we learned, scaling up, and, you know, how we did it. Um, but first very quick little intro to Luma if you're not familiar with us. So, we're not just a video model company. We're a foundation model lab and we're aiming to build general multimodal intelligence that can, you know, generate, understand, and operate in the physical world just like a human can. And uh to give you some sense of kind of what our models can do and where we're at today, um this is a feature we dropped yesterday, a demo video for it. This is modify video. All these videos um are initially taken on iPhones and basically uploaded to our platform with a text prompt and you can turn it turn that robot into anything you want. And for the AI engineers out in the audience um shout out to he uh manages our public API and you can integrate this functionality to your applications using it very easily. Um you don't need to do any kind of crazy prompt engineering. We've taken care of that for you. Just send us raw user uh prompts, generative media, and we'll send you back images and videos that uh meet your users needs. So, if you're interested in that, uh definitely hit me or Kron up. Um he's done a wonderful job building out, you know, our SDK and our whole kind of Devril uh side of things. So, yeah. Um that is uh my little plug for our API in case anyone's interested. So back to infrastructure, this is what our serving stack basically looked like when we launched. So we had a bunch of just tightly tightly coupled uh containers working together. Um one of the benefits of this is that we could kind of just launch these on raw machines with zero other dependencies. Um so that worked out well for us at launch. Um but there were some challenges uh scaling this up. So you know like any good engineer, I didn't want to reinvent the wheel initially. So we we reached for Triton inference server um which is kind of a classic um generalpurpose model serving uh server but there there were some issues with it. Um this setup was brittle. Um if Triton went down the the CPU processes didn't necessarily know that it went down and so you'd be kind of pulling jobs and they'd fail. It was annoying. Um, worse though with these video models, you're needing to uh run these on multiple GPUs and actually multiple nodes in a lot of cases to get to the latency you need. And Triton's just not built for that. Um, also we run our inference now on multiple different chipsets. So Nvidia is the the company that builds Triton. Uh, they don't have great support for things like AMD or Grock or any of those. Um and finally the biggest kind of hurdle was that this was really difficult to develop against for the researchers. Um it had a whole bunch of different idioms, a whole bunch of um kind of incantations you needed to make to make it work well and the overall setup was just uh it just felt very janky. So what we ended up doing um was you know rearchitecting to address some of these things and uh building our own serving stack on top of you know vanilla pietorch for all the GPU work. um that worked out really well because most of the vendors that are building uh these different chipsets, they make sure that PyTorch is fully supported. It's kind of this great substrate to build on top of if you support, you know, very vanilla PyTorch things. Um you can typically make your model run anywhere. Um you may need to optimize certain optim uh operations uh depending on your model to make things fast, but it's relatively easy to uh to get started. Um and in terms of like the decoupled architecture you you see here uh the CPU workers being decoupled is quite quite useful and important because you can use um use them to queue up work and you know pull in when you're dealing with you know videos, images, multimedia inputs on top of just text. You want those to be in the cluster ready for the GPUs to pull so you're not blocking the GPUs at all. Um and also with this architecture you can actually run the GPUs anywhere as long as you can connect to Reddus and our distributed storage. You can kind of see there seaweed FS um you can have a a GPU in you know any kind of random provider any random VM and use tail scale connect uh connect to the rest of this architecture and then scale up without having to do all that other kind of crazy parallel SSH stuff. Um so you know you want to run uh compute on your training cluster now you don't need to um you know provision special machines or anything you just run a command. There were a few more challenges that we hit um after we got through kind of the the initial hurdles of this uh this infrastructure. So one of uh one of the big ones was back pressure. So because this is decoupled um you can get into a sit and because there's multiple clusters that are you know pulling in work from the same global queue you can get into this situation where you have too many CPU workers pulling in work to one cluster and they're they're kind of waiting there to be processed by the GPUs on that cluster uh when they could have been processed somewhere else. So we came up with this uh dispatch uh limitation system. So we came up with like a state where if a GPU has been pulled into a uh sorry if a job has been pulled into a cluster and is waiting to be picked up by a GPU uh you can put a limitation on like how many jobs are in that state so that you can avoid this this issue. Um I'll talk more in depth about uh priorities and our fair scheduling uh wos. That was another one. Um you know we have multiple tiers of users and deciding whose jobs get process first is uh a constant optimization challenge. Um also handling different models. So like these video models are big. They are typically made up of you know 10 20 different um subm models. So you're pulling in a lot of weights. you're spending a lot of time compiling these things. So, traditional autoscaling is super wasteful. You're wasting, you know, 10 20 minutes of GPU time just warming things up. Um, and finally, you know, handling bursts. So, um, we basically built this system to handle bursts where you could scale up automatically on our training cluster, which our researchers hate me for. uh it makes them very upset but it allows us to keep up with the demand from our users. Um and so that decoupled architecture plus a very simple scheduler that runs on top of slurm and can just run these pietorrch workers um handles you know increasing the total pool of GPUs when we need it and scaling down when we don't. So this system is a pullbased system and there's you know cues that submit to cues that pull from cues and it is um as you know Sorish and Vas you can both attest to here from Luma it is the least um enjoyable part of working at Luma is managing these cues is is everybody's uh least favorite task. So um initially when we uh launched Ray 2 which is a much bigger more resource inensive model um we began to actually have to deal with the the fact that we couldn't just keep scaling up more and more compute. It just wasn't economical or feasible. Um so we had to deal with limited resources and um when you deal when you have like a pollbased scheduler with a bunch of cues like this there's this concept of work starvation that that comes into play where essentially you know we've got API uh tier jobs which we try to process very quickly. We've got enterprise we've got unlimited plus light free. So we've got all these different tiers that have different priority and who gets to go first. So if you just naively process them in priority order, there may be enough enterprise or API jobs that the light jobs never get processed. So we were having people waiting for like seven, eight, nine hours and they were very unhappy. Um so also shout out to Sorish. He actually on a weekend after we designed this system in a couple hours implemented um this SLO based system which I'll go into that allows us to more fairly schedule uh work across the the limited resources that we have. So how the system works is when you think about um the the concept of work starvation, one of the typical approaches you can you can use to manage this is aging. So the idea is the longer something waits in the queue, the higher its priority goes. But what's the actual function to control that aging mechanism? Um so we had the insight that this is a product problem and it really comes down to how long worst case are you okay with different tiers waiting. Um, so we have service level objectives that the product team defines and it kind of controls this aging behavior. And how that works is, you know, an API job, we may not want to wait in a queue more than a couple minutes. But a light job, maybe we're okay with them waiting for 10 minutes. And you can configure these and then set a threshold. Once uh once the threshold gets hit, so say 50% of that, you know, worst case timing, then that job gets pulled to the front of the queue. And initially uh that worked all right, but then you can actually hit another case of work starvation. Um if if you have a bunch of uh jobs that are potentially breaching the SLA where um you know a a job with a long SLO since it's been waiting in the queue for a longer time will actually starve out the resources of say an API job that you don't want to wait as long. Um, so how we handled that was with a uh we we ranked the jobs by the percentage of their SLO um that they that they were at. So like if an API job's waiting for a minute, that gets treated the same as a light job that's been waiting for 10 minutes. And that actually works out really nicely in practice and results in kind of intuitive fair scheduling behaviors. The last piece I want to talk about is kind of how we actually manage all these models. Um, so this was something where I was glad that we kind of reached for Triton initially. They had this nice concept of a model repo um that that we've really leaned into. So every model um has a folder in object storage somewhere with a bunch of subfolders that have different versions. Um, and so as you, you know, develop the code and fine-tune these models, deploy new checkpoints, you, uh, create a bunch of these immutable versions, and then you have a simple YAML file in the root of the model folder that lets you define which one's active. So that's really nice because you can kind of re reproducibly um you know in these sorry in these versions you can um you store the full Python environment so all the dependencies needed to run the model and the checkpoints and this system is pretty nice because you know if you ever need to roll back or you know you want to run things in all these different desperate environments you can make sure that you know you're running the exact same Python environment the exact same checkpoint that you were before And very rarely are there issues that um you can't really solve by just rolling back. So that's been quite useful for us. And um we've actually built like a automated kind of rollout system on top of this too where when we update those YAML files, the workers will just kind of switch versions on the fly without restarting. And um that that helps us kind of roll out uh new new model changes to the whole fleet of these thousands of uh H100 and uh AMD GPUs all at once. Um which makes managing this much more sane than the early days of parallel SSH. And um yeah, if any of this seemed interesting to you, up your alley, you know, we're hiring, we're actively looking for cracked engineers, researchers, um AI enthusiasts in general. Please uh please hit us up and uh yeah, I just wanted to thank everybody for taking the time today and uh thank the team at Luma because uh it's it's incredible the work that everyone there does. We've we've got about three minutes for questions. So, let's take one or two and let's see where we get any questions. Yeah. So, the the kind of cheat code here is that the chip the chip providers who we typically partner with really closely um they're always making sure that PyTorch at least at a certain version works for their chipset. So, they're doing a lot of that work for us. And then typically what'll happen is we'll try to run the models on whatever the new chipset is. Usually it'll work and then it'll work a bit more slowly. So we've got actually a team of like 10 guys we call our Excel team that are optimizing the low-level operations within PyTorch using things like Triton and working with the chip providers to make sure that things are actually fast. So you'll typically with PyTorch be able to run things um anywhere but a lot of the times they won't be fast and then we just work closely with the chipset uh providers to actually you know optimize the model over time. Yeah. Yeah. We work really closely with Nvidia and AMD and uh we are exploring you know some other providers um but nothing really deep yet like you know Amazon's got their own chips uh Gro has some chips. We just announced a big partnership with uh Humane who works really closely with with Grock and so yeah, we're exploring some of these other chipsets um through through some of these uh partnerships. Cool. Any other questions? Yeah. We do not we well I don't think so. Um but yeah, we we we work with cloud providers that basically kind of give us a um a working Kubernetes cluster at the very least, which which is nice. Um we're not provisioning the nodes oursel or anything, but I actually would have to check with the individual providers if there's like VMs or if they're actually metal machines. I know uh when we work with Amazon, they're not, but some of the other ones might be. Yeah. Uh not yet. Um the way the like general application works um does involve uh video and like image QA model. So um essentially the the way the actual application works is we have like an agent that you're interacting with. So when you upload a video or an image that is actually being you know captioned by you know VLMs to enhance the the the total pro the overall prompt but um no like true VQA stuff quite yet but there there may or may not be some coming. [Music]