Hacking the Inference Pareto Frontier - Kyle Kranen, NVIDIA

Channel: aiDotEngineer

Published at: 2025-08-01

YouTube video id: Y2qc0UhDSnc

Source: https://www.youtube.com/watch?v=Y2qc0UhDSnc

[Music]
Hey there everyone. I'm Kyle Cranon and
today I'll be talking about how to break
the inference bra frontier in your
advantage. Um really the thing that
enables the success is that a good model
and a good system that takes into
account the actual constraints for what
you need from your deployment is
actually key to the success of both your
deployment and the application that is
backed by it. Um so who am I and why am
I talking about this? Uh as I said my
name is Kyle Crane. Uh previously I work
at or currently I work at NVIDIA. Uh
previously at NVIDIA, I was uh leading
and GMing the largest inference
deployment at NVIDIA with a multiple
tens of millions of dollar quarterly
cloud bill. Uh and now I'm an architect
and lead for a project that we just
released in open source called NVIDIA
Dynamo that aims to do things like
enable data center scale inference to
manipulate your deployment and
manipulate the paro frontier in order to
achieve better SLAs's or achieve lower
costs for your existing SLAs's. with
techniques like disagregation or more
techniques that I'll talk about later in
the talk. Dynamo is uh the Dynamo meetup
is linked right here. You can learn more
about Dynamo there if you want to look
it up and I'll also have that uh at the
end of the talk as well. Um so the three
things that we or I like to think about
when I'm thinking about whether or not
something can actually be deployed and
used is really simple. It's quality.
whether or not your application and the
system around your model is capable of,
you know, completing tasks with some
level of accuracy or quality. Latency,
whether or not the task can be completed
in a fast enough envelope for, you know,
either the user to be happy or to meet
safety guarantees like for robotics, uh,
and cost. Can the LM complete the task
cheaply enough per request in order for
you to meet whatever, you know, margin
requirements you have for your
application?
Um
and one of the ways that we generally
compare these three things is through a
para frontier. Now uh the frontier I'm
showing here is two dimensional. Uh it's
actually really hard to plot things in
3D on a 2D slide. So I'm going to just
show two dimensions. Really what this
looks like is you have this like edge
that oh is it working? uh we have this
edge that sort of represents the best or
to the top and rightmost points that we
achieve for uh you know a specific set
of attributes. So in this case we have
the TPS per GPU which is effectively a
cost metric. How many requests can you
handle per GPU per second and the user
TPS which is a responsiveness metric
right? So this is the the latency versus
the cost and for different applications
really you want to enable your prao
front enable or you actually really only
want one point on the prao frontier it's
right what is your operating latency
what is the operating quality you need
and how can you minimize cost for that
now this really actually depends on the
application you're talking about so one
of the most important things you're
doing when you're thinking about
breaking the creative frontier is you're
thinking about your application so for
example if we're talking about personal
cancer cures, which is a topic that's
talked about a lot in the context of
generative AI. Uh, in that situation,
uh, latency and cost are pretty much no
object, right? You could spend millions
of dollars on proving out a single cure
and if it works, the return on
investment is so high that it doesn't
really matter. Um, to take a different
example, tab completion like those that
you see in popular IDs like cursor all
are very very dependent upon snappiness.
The user expects that when they press
tab, they will see a recommendation for
the next line or the next set of you
know tokens very very quickly. And then
to take another code example with
respect to async code commits things
like uh you know uh cursors um what's it
called? agent mode and um you know other
applications where the the chatbot or
applications working next to the user
there's not as much a consideration for
latency but there is a concern for both
quality and cost. Uh and this sort of
breaks down how or this sort of depends
upon what the user expects from the
application. Does it do do they expect
it to be fast? Do they expect it to be
slow? Are they involved in the loop with
this application?
Now there are a series of you know
techniques that are pretty commonly
known about that all you know support
the manipulation of the sprintier. For
example, quantization speeds up your
latency and it also decreases your cost
because you can produce higher batch
sizes. Retrieve augmented generation
generally slows down your application,
makes it higher latency, increases the
cost but also increases the quality. And
reasoning for example similar, you know,
you produce more tokens to think. And
changing the model config allows you to
do any of these things. If you change
how the model is represented in a
parallel manner, uh you can
significantly change the characteristics
of uh speed, cost, and theoretically
equality. If you're talking about
non-haled context parallelism, um the
thing that I want to impart upon you
before we jump into like a lot more of
these advanced techniques is that these
techniques can be compounded. So for
example, if you have an initial
application and has some required
performance, you can actually stack for
example retrieve augmented generation in
order to increase the quality but make
the latency worse and you can also stack
on top of that quantization of the model
in order to speed up your latency. The
point I'm, you know, trying to trying to
make here is that you really have this
toolbox of a sets of large sets of tools
that you can use together. And the tools
themselves are not independent and can
be combined in very sometimes
non-obvious ways in order to actually
break your predo frontier or squeeze it
in different directions in order to
support your application. Um so there
are three things outside of those
techniques that I I I tend I tend to
think drive uh you know the
how you can modify the parto frontier
going forward. Those three are scale,
structure and dynamism.
Um so one of the things that is you know
really relevant in the realm of scale is
disagregation. So for those that aren't
aware, uh KV caching is a technique by
which you take the K K val key and value
vectors that are associated with each
token and you cache them. Uh so that
when you're doing auto reggressive
generation, you don't have to generate
the entire set of key and value vectors
for the entire sequence up to this
point. You can just generate new ones
and put them back into the KV cache. Uh
what this actually means is that we
effectively have two phases of
generation. one in which you're
generating the prefill or filling up
your KV cache and one in which you're
actually generating new KV cache as well
as new tokens and producing output. Now
um disagregation as a technique
basically allows you to have these you
know two phases which were typically
used on the same set of GPUs uh onto
multiple different workers and sets of
GPUs and this provides a couple key
benefits that we'll go into right now.
Um the three really big benefits here
are that you will you can really now
take two uh sort of phases that have
very very different needs. Uh prefill is
very computebound and decode depending
on the application and model can be very
memory bound and it allows you to do a
granular load matching between those two
phases. And what this means is uh you
know compute saturates relatively early
uh to use deepse as an example compute
saturates relatively early and you may
use uh relatively few uh GPUs for your
prefill instances and have them handle a
lower batch size but handle a much
larger batch size with many more GPUs
for your decode instances. And this
split and this heterogeneity between the
two actually allows you to produce far
more performance.
Um the other thing is that you know one
of the one of the problems is that if
you have inflight batching you have many
tokens coming in at the same time that
are in different phases of generation
right if you have a request that's doing
prefill and a request that's doing
decode on the same machine you get
scheduling conflicts and theuler
basically has to decide whether or not
it handles new tokens uh there are some
techniques to handle this like inflight
batching and chunk piggybacking uh or
sorry chunk chunk piggybacking but um
generally there is a cost to doing that
mutual scheduling so splitting this
makes the scheduling simpler. Now,
there's an asterisk to this, which is
that uh sorry, really quickly and I I'll
go over the performance numbers. I'm
going to use llama 7B as an example.
Right. Right here we have on our left
axis or the our y axis we have the
tokens per second per GPU. On our
x-axis, we have the tokens per second
per user. Right up and to the right is
better. Um, if we choose one operating
point at latency, disagregating on the
same number of GPUs, 16 total H100s, we
can achieve up to two times uh the
tokens per second per GPU at a fixed
latency, which means that you're now
paying two times less for your
application.
Um, there are some constraints though.
Uh, the use case really does dictate
performance for for disagregation. Uh,
for example, low input length use cases
have little to no speed up because uh
you don't h actually have as much of the
scheduling problem. They're very
pre-filled light. You're basically just
doing decode the entire time. Um, and
then per the graph disagregation and
I'll go back to the graph. actually
disagregation is is is useful in usually
in the middle of the graph. In very high
high latency high throughput scenarios
which would be the left and top of the
graph and low latency low throughput
scenarios the bottom right uh aggregated
tends to reconfer recon converge with
disagregated and produce a little bit
more performance in those cases. Uh that
being said for a lot of user you know
interactive applications disagregation
makes the most sense because users tend
to read that between or care about
things in the realm of 20 to 200 tokens
per second. Um
the other thing that's kind of a caveat
about this is that configuration is
really important. Um since you're
separating these two phases into
pre-fill and generation, the balance
between the number of workers for
prefill and decoding dictates the
performance. So for example, if you have
too many decode workers, you're pre
you're you're basically going to have uh
decode workers that are starving for
work. And if you prefill workers, uh
they're going to be generating work for
the decode workers and the decode
workers are going to be being pushed
down by, you know, just an increasing
amount of load and increasing Q depth.
Um, and the other thing is that mo
modifying this is kind of expensive and
hard because the balance between prefill
and decode depends on the par the
parallel configs of each. So it's like
this really wide configuration space.
One other thing that we talk about with
respect to scale is routing. So we
talked about how this KV is important.
Um, one of the things that we have to do
for pre uh pre-filled decode
disagregation is that we need to
actually transfer the KV between
machines. And uh in some sense there's
actually an affinity for some mean
machines to do some work since the KV
cache of previous requests is actually
stored on those GPUs or offloaded onto
uh system memory host or external
storage uh during the course of of
inference. So um I actually labeled this
wrong. Uh this is not the smart router.
Um uh this in in in a naive case you
would route pretty much exclusively
randomly, right? um or or towards uh you
know any anything right so in this case
we're biasing towards we're not biasing
towards anything we're sampling randomly
alternatively you know if we're talking
about this uh routing to worker three in
this case it could also be that you're
optimizing for purely your KV match um
if you're doing a this is actually
inverted if you're doing a a KV based
router um uh you uh may end up biasing
towards machines that have too high KV
reload and therefore we're not going to
be able to handle the request and you
end up with queuing. Uh in a smart case,
you actually want to minimize the uh
sort of this cost function that includes
both the amount of prefix match that you
can get from uh or maximize the prefix
match that you can get from uh the work
that's already been done on that node
and the amount of load that already
exists on that node. Um, and as you
scale out, as you get get more and more
GPUs um, in a in a in a deployment, you
actually end up with more and more
represented KV space that's local to
those machines. And because of that,
having a larger and larger deployment
means that you get an asmtoically
increasing uh, KV cache hit rate, which
means that you're doing less and less
prefill work over time.
So routing, you know, here to give it a
report card increases your speed and
cost and doesn't really have an effect
upon quality because it's it's doing the
same work that it would normally do. Um,
now we talk about structure. Structure
is really important because we have a
lot of these workloads that you guys
have probably seen at the AI engineers
world fair like agents for example. Um,
agents impart a structure on the
workload in that they have moderately
predictable usage patterns between
concurrent requests. So an example here
is inference time scaling. This is a
cool graph I'll go over really quickly.
For example, we have three models here.
In green we have an 8B model. In yellow
we have a 49B model. And in red we have
a 235B model. We find that with
inference time scaling that is to
re-query the model and to you know
prompt it to reconsider its results or
reason more about its results, we can
produce better and better results. And
you actually see this really interesting
trend where uh with about three or four
times of requering, we can see that the
8B model is basically on par with
respect to quality as the 49B model. And
the 49B is almost on par with respect to
quality as the 235B model. Um and we
note here that like the cost of quering
that AP model you know uh even quering
it multiple times is actually lower than
quering the uh the larger model right
and in this sense you know we we b we
basically see that inference time
scaling can be considered sort of as you
know increasing quality at the cost of
speed and at the cost of uh cost right
because you're you're requerering
But alternatively, if you keep quality
fixed, um you can basically get lower
latency and lower cost by using a
smaller model and requering it multiple
times. And the structure that we infer
or from doing that requiring allows us
to do better scheduling. So in this
graph we have a series of curves that
represent basically uh the runtime of a
of a given reasoning example uh from the
natural plan data set and the uh
basically the the um concurrency. So
this is like a graph of like how many
you can how many in concurrent instances
you can run at once and uh we sample
across all this this concurrency. We see
that implementing disagregation gives us
a small benefit uh mostly because this
data set is very ISL uh short ISL long
OSL um you don't get a whole ton of
benefit from disagregation in this case
basically making uh you know removing a
round trip by making the requeries come
from the router instead of coming from
the user or the client on the outside
allows you to really decrease these
round trips with respect to latency and
then on top of that making the router
aware and making the LMuler aware that
you are doing you know repeat work
you're requiring it actually gives you
an increased benefit that is the red
line to the green line that is to say
amongst of a wide variety of models if
we assume that the quality is fixed we
can actually use inference time scaling
and some smart techniques in order to
significantly decrease latency and
increase throughput while maintaining
the same quality
one last thing and I'm going to go
through this really quick because I'm
getting low on time is manipulating K K
and V values Right? We've sort of talked
about how before there's this work that
we do in prefill that we don't want to
lose. We we do routing to ensure that we
uh don't lose this KV and you know if we
have like a workflow where we know the
runtimes of things. So for example, if
we do a, you know, a
tool call, for example, and we know this
tool call takes a moderately
deterministic amount of time, we
basically end up with this this KV
eviction, right? If we have a tool call
that takes 30 seconds, the KV is going
to be swept out from HPM and you're not
going to be able to use it in the future
because it's no longer being cached. But
if we know that it's going to be used
again, why not just offload it, right?
Basically inference time scaling gives
you structure to manipulate your KV.
Tool calling gives you structure to
manipulate your KV. So instead of doing
another prefill uh the second time you
might for example do prefill once do uh
you know the LM call the decode once uh
move it to host memory and then you know
at the time at which you expect the tool
to complete you move it right back into
me into GPU memory so that's ready for
the next LM call that will include this
added context from the tool. So KV
manipulation you know again increases
your speed and decreases your cost while
also you know improving your quality or
not improving quality keeping quality
constant. Um the last thing that we have
to talk about here is dynamism. Um
worker specialization is really
important. As I said, since you have uh
different charact characteristics of
disagregation at different infant
sequence links and output sequence
links, you actually want to have a mix
of aggregated and disagregated workers
based on where you are in the OSL ISL
histogram. So at, you know, lower input
sequence links and higher output
sequence links, you might want to do
aggregated with a higher tensor
parallelism. In the middle of the range,
you may want to use in the middle of the
input sequence range, you may may want
to use disagregated. And in the you know
long context uh you know regime you may
want to use disagregated with context
parallelism. Now uh again this differs
model to model uh and this is just an
exemplary graph but um generally if you
specialize workers you can also increase
uh increase your speed decrease your
cost while keeping quality the same
because again you're not actually
touching the execution what the model is
executing you're not touching the math
it's doing. Um, one last thing about
dynamism is load balance is quite
important. As I mentioned earlier, doing
looking at the amount of P and D workers
is really important to determine whether
or not your disagregated deployment is
going to be successful. So for example,
if you have a histogram that you
initially create your configuration
based off of, for example, if you have
app A and app B that have like these two
input sequence length and output
sequence links, you may end up with this
scenario where a change in user
distribution causes significant issues
with your deployment. You're de in this
case by when you increase your input
sequence length and output sequence
length by a little bit more your input
sequence length, you might create more
demand for prefill workers than you do
for decode workers. So your balance will
change over time and this has been
empirically proven by a wide variety of
people that publish data. Um and you
actually have to do autoscaling across
these two you know types of instances in
real time to account for changes in user
usage distribution of your platform. Um
so in this case dynamic load balancing
uh increases your speed and keeps your
cost low but mostly it's it's really
just essential to ensuring that disagre
disagre disagregation actually works to
maximum potential. Um okay last things.
Uh here is the uh Dynamo repo. It's
right here. Um it's github.com.
Um we also have a Dynamo meetup that is
being hosted tomorrow Thursday from 5 to
8:00 p.m. here in San Francisco. uh
please come. We're going to be talking a
lot more about how we actually implement
these things at the event.
[Music]