Hacking the Inference Pareto Frontier - Kyle Kranen, NVIDIA
Channel: aiDotEngineer
Published at: 2025-08-01
YouTube video id: Y2qc0UhDSnc
Source: https://www.youtube.com/watch?v=Y2qc0UhDSnc
[Music] Hey there everyone. I'm Kyle Cranon and today I'll be talking about how to break the inference bra frontier in your advantage. Um really the thing that enables the success is that a good model and a good system that takes into account the actual constraints for what you need from your deployment is actually key to the success of both your deployment and the application that is backed by it. Um so who am I and why am I talking about this? Uh as I said my name is Kyle Crane. Uh previously I work at or currently I work at NVIDIA. Uh previously at NVIDIA, I was uh leading and GMing the largest inference deployment at NVIDIA with a multiple tens of millions of dollar quarterly cloud bill. Uh and now I'm an architect and lead for a project that we just released in open source called NVIDIA Dynamo that aims to do things like enable data center scale inference to manipulate your deployment and manipulate the paro frontier in order to achieve better SLAs's or achieve lower costs for your existing SLAs's. with techniques like disagregation or more techniques that I'll talk about later in the talk. Dynamo is uh the Dynamo meetup is linked right here. You can learn more about Dynamo there if you want to look it up and I'll also have that uh at the end of the talk as well. Um so the three things that we or I like to think about when I'm thinking about whether or not something can actually be deployed and used is really simple. It's quality. whether or not your application and the system around your model is capable of, you know, completing tasks with some level of accuracy or quality. Latency, whether or not the task can be completed in a fast enough envelope for, you know, either the user to be happy or to meet safety guarantees like for robotics, uh, and cost. Can the LM complete the task cheaply enough per request in order for you to meet whatever, you know, margin requirements you have for your application? Um and one of the ways that we generally compare these three things is through a para frontier. Now uh the frontier I'm showing here is two dimensional. Uh it's actually really hard to plot things in 3D on a 2D slide. So I'm going to just show two dimensions. Really what this looks like is you have this like edge that oh is it working? uh we have this edge that sort of represents the best or to the top and rightmost points that we achieve for uh you know a specific set of attributes. So in this case we have the TPS per GPU which is effectively a cost metric. How many requests can you handle per GPU per second and the user TPS which is a responsiveness metric right? So this is the the latency versus the cost and for different applications really you want to enable your prao front enable or you actually really only want one point on the prao frontier it's right what is your operating latency what is the operating quality you need and how can you minimize cost for that now this really actually depends on the application you're talking about so one of the most important things you're doing when you're thinking about breaking the creative frontier is you're thinking about your application so for example if we're talking about personal cancer cures, which is a topic that's talked about a lot in the context of generative AI. Uh, in that situation, uh, latency and cost are pretty much no object, right? You could spend millions of dollars on proving out a single cure and if it works, the return on investment is so high that it doesn't really matter. Um, to take a different example, tab completion like those that you see in popular IDs like cursor all are very very dependent upon snappiness. The user expects that when they press tab, they will see a recommendation for the next line or the next set of you know tokens very very quickly. And then to take another code example with respect to async code commits things like uh you know uh cursors um what's it called? agent mode and um you know other applications where the the chatbot or applications working next to the user there's not as much a consideration for latency but there is a concern for both quality and cost. Uh and this sort of breaks down how or this sort of depends upon what the user expects from the application. Does it do do they expect it to be fast? Do they expect it to be slow? Are they involved in the loop with this application? Now there are a series of you know techniques that are pretty commonly known about that all you know support the manipulation of the sprintier. For example, quantization speeds up your latency and it also decreases your cost because you can produce higher batch sizes. Retrieve augmented generation generally slows down your application, makes it higher latency, increases the cost but also increases the quality. And reasoning for example similar, you know, you produce more tokens to think. And changing the model config allows you to do any of these things. If you change how the model is represented in a parallel manner, uh you can significantly change the characteristics of uh speed, cost, and theoretically equality. If you're talking about non-haled context parallelism, um the thing that I want to impart upon you before we jump into like a lot more of these advanced techniques is that these techniques can be compounded. So for example, if you have an initial application and has some required performance, you can actually stack for example retrieve augmented generation in order to increase the quality but make the latency worse and you can also stack on top of that quantization of the model in order to speed up your latency. The point I'm, you know, trying to trying to make here is that you really have this toolbox of a sets of large sets of tools that you can use together. And the tools themselves are not independent and can be combined in very sometimes non-obvious ways in order to actually break your predo frontier or squeeze it in different directions in order to support your application. Um so there are three things outside of those techniques that I I I tend I tend to think drive uh you know the how you can modify the parto frontier going forward. Those three are scale, structure and dynamism. Um so one of the things that is you know really relevant in the realm of scale is disagregation. So for those that aren't aware, uh KV caching is a technique by which you take the K K val key and value vectors that are associated with each token and you cache them. Uh so that when you're doing auto reggressive generation, you don't have to generate the entire set of key and value vectors for the entire sequence up to this point. You can just generate new ones and put them back into the KV cache. Uh what this actually means is that we effectively have two phases of generation. one in which you're generating the prefill or filling up your KV cache and one in which you're actually generating new KV cache as well as new tokens and producing output. Now um disagregation as a technique basically allows you to have these you know two phases which were typically used on the same set of GPUs uh onto multiple different workers and sets of GPUs and this provides a couple key benefits that we'll go into right now. Um the three really big benefits here are that you will you can really now take two uh sort of phases that have very very different needs. Uh prefill is very computebound and decode depending on the application and model can be very memory bound and it allows you to do a granular load matching between those two phases. And what this means is uh you know compute saturates relatively early uh to use deepse as an example compute saturates relatively early and you may use uh relatively few uh GPUs for your prefill instances and have them handle a lower batch size but handle a much larger batch size with many more GPUs for your decode instances. And this split and this heterogeneity between the two actually allows you to produce far more performance. Um the other thing is that you know one of the one of the problems is that if you have inflight batching you have many tokens coming in at the same time that are in different phases of generation right if you have a request that's doing prefill and a request that's doing decode on the same machine you get scheduling conflicts and theuler basically has to decide whether or not it handles new tokens uh there are some techniques to handle this like inflight batching and chunk piggybacking uh or sorry chunk chunk piggybacking but um generally there is a cost to doing that mutual scheduling so splitting this makes the scheduling simpler. Now, there's an asterisk to this, which is that uh sorry, really quickly and I I'll go over the performance numbers. I'm going to use llama 7B as an example. Right. Right here we have on our left axis or the our y axis we have the tokens per second per GPU. On our x-axis, we have the tokens per second per user. Right up and to the right is better. Um, if we choose one operating point at latency, disagregating on the same number of GPUs, 16 total H100s, we can achieve up to two times uh the tokens per second per GPU at a fixed latency, which means that you're now paying two times less for your application. Um, there are some constraints though. Uh, the use case really does dictate performance for for disagregation. Uh, for example, low input length use cases have little to no speed up because uh you don't h actually have as much of the scheduling problem. They're very pre-filled light. You're basically just doing decode the entire time. Um, and then per the graph disagregation and I'll go back to the graph. actually disagregation is is is useful in usually in the middle of the graph. In very high high latency high throughput scenarios which would be the left and top of the graph and low latency low throughput scenarios the bottom right uh aggregated tends to reconfer recon converge with disagregated and produce a little bit more performance in those cases. Uh that being said for a lot of user you know interactive applications disagregation makes the most sense because users tend to read that between or care about things in the realm of 20 to 200 tokens per second. Um the other thing that's kind of a caveat about this is that configuration is really important. Um since you're separating these two phases into pre-fill and generation, the balance between the number of workers for prefill and decoding dictates the performance. So for example, if you have too many decode workers, you're pre you're you're basically going to have uh decode workers that are starving for work. And if you prefill workers, uh they're going to be generating work for the decode workers and the decode workers are going to be being pushed down by, you know, just an increasing amount of load and increasing Q depth. Um, and the other thing is that mo modifying this is kind of expensive and hard because the balance between prefill and decode depends on the par the parallel configs of each. So it's like this really wide configuration space. One other thing that we talk about with respect to scale is routing. So we talked about how this KV is important. Um, one of the things that we have to do for pre uh pre-filled decode disagregation is that we need to actually transfer the KV between machines. And uh in some sense there's actually an affinity for some mean machines to do some work since the KV cache of previous requests is actually stored on those GPUs or offloaded onto uh system memory host or external storage uh during the course of of inference. So um I actually labeled this wrong. Uh this is not the smart router. Um uh this in in in a naive case you would route pretty much exclusively randomly, right? um or or towards uh you know any anything right so in this case we're biasing towards we're not biasing towards anything we're sampling randomly alternatively you know if we're talking about this uh routing to worker three in this case it could also be that you're optimizing for purely your KV match um if you're doing a this is actually inverted if you're doing a a KV based router um uh you uh may end up biasing towards machines that have too high KV reload and therefore we're not going to be able to handle the request and you end up with queuing. Uh in a smart case, you actually want to minimize the uh sort of this cost function that includes both the amount of prefix match that you can get from uh or maximize the prefix match that you can get from uh the work that's already been done on that node and the amount of load that already exists on that node. Um, and as you scale out, as you get get more and more GPUs um, in a in a in a deployment, you actually end up with more and more represented KV space that's local to those machines. And because of that, having a larger and larger deployment means that you get an asmtoically increasing uh, KV cache hit rate, which means that you're doing less and less prefill work over time. So routing, you know, here to give it a report card increases your speed and cost and doesn't really have an effect upon quality because it's it's doing the same work that it would normally do. Um, now we talk about structure. Structure is really important because we have a lot of these workloads that you guys have probably seen at the AI engineers world fair like agents for example. Um, agents impart a structure on the workload in that they have moderately predictable usage patterns between concurrent requests. So an example here is inference time scaling. This is a cool graph I'll go over really quickly. For example, we have three models here. In green we have an 8B model. In yellow we have a 49B model. And in red we have a 235B model. We find that with inference time scaling that is to re-query the model and to you know prompt it to reconsider its results or reason more about its results, we can produce better and better results. And you actually see this really interesting trend where uh with about three or four times of requering, we can see that the 8B model is basically on par with respect to quality as the 49B model. And the 49B is almost on par with respect to quality as the 235B model. Um and we note here that like the cost of quering that AP model you know uh even quering it multiple times is actually lower than quering the uh the larger model right and in this sense you know we we b we basically see that inference time scaling can be considered sort of as you know increasing quality at the cost of speed and at the cost of uh cost right because you're you're requerering But alternatively, if you keep quality fixed, um you can basically get lower latency and lower cost by using a smaller model and requering it multiple times. And the structure that we infer or from doing that requiring allows us to do better scheduling. So in this graph we have a series of curves that represent basically uh the runtime of a of a given reasoning example uh from the natural plan data set and the uh basically the the um concurrency. So this is like a graph of like how many you can how many in concurrent instances you can run at once and uh we sample across all this this concurrency. We see that implementing disagregation gives us a small benefit uh mostly because this data set is very ISL uh short ISL long OSL um you don't get a whole ton of benefit from disagregation in this case basically making uh you know removing a round trip by making the requeries come from the router instead of coming from the user or the client on the outside allows you to really decrease these round trips with respect to latency and then on top of that making the router aware and making the LMuler aware that you are doing you know repeat work you're requiring it actually gives you an increased benefit that is the red line to the green line that is to say amongst of a wide variety of models if we assume that the quality is fixed we can actually use inference time scaling and some smart techniques in order to significantly decrease latency and increase throughput while maintaining the same quality one last thing and I'm going to go through this really quick because I'm getting low on time is manipulating K K and V values Right? We've sort of talked about how before there's this work that we do in prefill that we don't want to lose. We we do routing to ensure that we uh don't lose this KV and you know if we have like a workflow where we know the runtimes of things. So for example, if we do a, you know, a tool call, for example, and we know this tool call takes a moderately deterministic amount of time, we basically end up with this this KV eviction, right? If we have a tool call that takes 30 seconds, the KV is going to be swept out from HPM and you're not going to be able to use it in the future because it's no longer being cached. But if we know that it's going to be used again, why not just offload it, right? Basically inference time scaling gives you structure to manipulate your KV. Tool calling gives you structure to manipulate your KV. So instead of doing another prefill uh the second time you might for example do prefill once do uh you know the LM call the decode once uh move it to host memory and then you know at the time at which you expect the tool to complete you move it right back into me into GPU memory so that's ready for the next LM call that will include this added context from the tool. So KV manipulation you know again increases your speed and decreases your cost while also you know improving your quality or not improving quality keeping quality constant. Um the last thing that we have to talk about here is dynamism. Um worker specialization is really important. As I said, since you have uh different charact characteristics of disagregation at different infant sequence links and output sequence links, you actually want to have a mix of aggregated and disagregated workers based on where you are in the OSL ISL histogram. So at, you know, lower input sequence links and higher output sequence links, you might want to do aggregated with a higher tensor parallelism. In the middle of the range, you may want to use in the middle of the input sequence range, you may may want to use disagregated. And in the you know long context uh you know regime you may want to use disagregated with context parallelism. Now uh again this differs model to model uh and this is just an exemplary graph but um generally if you specialize workers you can also increase uh increase your speed decrease your cost while keeping quality the same because again you're not actually touching the execution what the model is executing you're not touching the math it's doing. Um, one last thing about dynamism is load balance is quite important. As I mentioned earlier, doing looking at the amount of P and D workers is really important to determine whether or not your disagregated deployment is going to be successful. So for example, if you have a histogram that you initially create your configuration based off of, for example, if you have app A and app B that have like these two input sequence length and output sequence links, you may end up with this scenario where a change in user distribution causes significant issues with your deployment. You're de in this case by when you increase your input sequence length and output sequence length by a little bit more your input sequence length, you might create more demand for prefill workers than you do for decode workers. So your balance will change over time and this has been empirically proven by a wide variety of people that publish data. Um and you actually have to do autoscaling across these two you know types of instances in real time to account for changes in user usage distribution of your platform. Um so in this case dynamic load balancing uh increases your speed and keeps your cost low but mostly it's it's really just essential to ensuring that disagre disagre disagregation actually works to maximum potential. Um okay last things. Uh here is the uh Dynamo repo. It's right here. Um it's github.com. Um we also have a Dynamo meetup that is being hosted tomorrow Thursday from 5 to 8:00 p.m. here in San Francisco. uh please come. We're going to be talking a lot more about how we actually implement these things at the event. [Music]