What every AI engineer needs to know about GPUs — Charles Frye, Modal

Channel: aiDotEngineer
Published at: 2025-07-20
YouTube video id: y-UGrYbJsJk
Source: https://www.youtube.com/watch?v=y-UGrYbJsJk
[Music]
So um what I wanted to talk about today
was uh what every AI engineer needs to
know about GPUs.
the like so far in the last couple of
years um most of the things that people
have built as AI applications people who
are AI engineers they've been building
on top of model APIs so they use the
open AI API the anthropic API the
deepseek API and they build an
application on top of that and that goes
back to kind of like the initial diagram
that Swix put out the like AI like rise
of the AI engineer thing um and yeah
probably just mirror would be great um
and the um so that like having that API
boundary is like is like pretty
important right it's like you can't
really build a complex system if
everybody has to know how every piece
works and everybody has to know all of
it in detail and there's no like
boundaries or breakdowns like you're
just yeah you'll compl collapse in
complexity if you do that um so
um oh that was a meme but I'm down with
it um so like so Yeah. So it started off
by trying to answer the question of like
why every AI engineer needs to know
about GPUs. Um and so yeah, so here's
our famous diagram AI engineer on the
right of the API boundary where they're
like constrained by the um the needs of
users rather than like the like what's
possible with research or what
infrastructure is capable of providing.
Um, and the way that I think about this
distinction is that um, it's kind of
similar to the way that very few
developers need to actually like write a
database. Um, like almost no one writes
a database except in their like you know
like undergrad classes. And then even
very few developers like run a database.
A lot of them will use either a fully
managed service or um, just like a
hosted service like uh, RDS on on
Amazon. Um but like almost all
developers despite the fact that they
aren't like database engineers, they are
users of databases and they like need to
like write know how to like write good
queries. They need to know how to like
hold the tool in order to press the like
buttons on the side. Uh so there's a
famous educational resource that I
really love um about databases called
use the index loop um that's like
basically about how to write SQL queries
and not like suck. Um and the whole
point is like there is a thing called an
index. There's a couple of data
structures that support it. It talks
about things like B trees and log
structured merge trees and stuff. And
the intent of it isn't that you can then
leave and go and like invert a binary
tree on a whiteboard so you can get a
fang job. Like the point of it is to
teach you what you need to know so that
you can write like write queries
properly that use the index and don't
not use the index. Primary and secondary
indices all these things you don't like
you know that's like a little bit um
like an easier like prospect knowing it
well enough to be able to use it rather
than like build it or innovate on it.
Um, and I think we're reaching this
point now with uh with language models
where um where
you'll have more ability to like
integrate tightly run your own language
models and so more need to like use the
index or I guess if you want like the
like one sentence summary of this talk
um it's uh use the tensor cores Luke.
Um, so in building your there's one
there's basically one part of an Nvidia
GPU um and an equivalent in other GPUs
that is uh fast and good and gets better
and it's the tensor core and it does
matrix matrix multiplication and uh you
should make sure you're using it and uh
and not not using it just like an index
on a database.
Um, so yeah, so open like I kind of made
this point earlier about open weights
models and the open source software to
run them like Dynamo getting better very
quickly. So it finally makes sense to
self-host. I'm not going to belabor this
point because I'm giving another talk
1245 presenting some like benchmarking
results that we did on like running uh
VLM SGA tensor TLM on like uh you 10 12
different models on 10 12 different
workloads um to show like what's what's
economical what's not. Okay. So uh so
that's the why um sort of a slight
change or adjustment in what AI
engineers I think a engineers should
focus on know about. Um so now what is
it that you need to know about engineer
about uh these these this hardware in
detail. So the primary thing is that
GPUs embrace high bandwidth not low
latency. That's the like key feature of
this hardware. Similar with TPUs, but
distinguishes it from pretty much every
other piece of hardware that you're used
to programming. Um, and then in detail,
they optimize for math bandwidth over
memory bandwidth. So they do like
computing on things. That's what they
where they have the highest throughput.
So you want to align yourself not to
latency but to throughput. And within
throughput, you want to focus on
computational operations. And then
within computational operations, what
you want to focus on if you want to like
actually use the whole GPU you paid for,
it's low precision matrix matrix
multiplications. Sorry, that wasn't a
stutter. That was matrix matrix
multiplications, not just matrix vector.
Um, okay. So f the first point about
latency versus bandwidth. Um so I like
regret to inform you that the scaling of
latency and the reduction of latency in
computing systems died during the Bush
administration. It's not coming back. Um
see a talk later today for an
alternative perspective but um yeah GPUs
embrace bandwidth scaling. Um so a
little more detail on that. Um so this
is a computer or a piece of a computer
in case you haven't looked inside one in
a while. Um so this is a logic gate from
the ZUSA 1 um computer built in Germany
in the 30s kind of first digital
computer digital but not electronic it's
mechanical. So all these actuator plates
in it that that um implemented logical
operations. So what you see there on the
left is the logical operation a and so
if two plates are pushed down then if
both of them are present then when a
then when the other plate pushes forward
it will push the final plate forward.
That's the logical operation and and the
thing that pushes that like the the
thing that pushes forward is driven by a
clock like a literal clock. Um I guess
now everybody has Apple watches but you
know there was a time when you would
have a physical clock for that sort of
thing. So this um uh the clock uh like
drove like drives these systems and
causes them to like compute their
logical operations, right? So every time
the clock ticks, you get a new
operation. And so you can just, you
know, so we we we've changed computers a
little bit in that we use different
physics to drive them, but it's still
the same basic like abstract system.
There's a sort of a motive force that
happens on a clock cycle that leads to
calculations. Um, and the cool thing
about that is that if you just make that
faster, literally nobody has to like
like think about anything and the
computer gets better. So this was the
like primary driver of computers getting
better in the '9s. No recompiling, no
rewriting your software. Everything just
got better because now the clock started
going like
twice as fast, right? And time is very
virtual in in computers and so like the
program couldn't possibly know the
difference. Um, so that was really great
during that like that like mid to late
90s period and then that like fell off a
cliff in the early 2000s and this is
like impacted a lot of computing over
the last two decades but actually its
effects are like still being felt like
all this switch from being able to kind
of avoid the like uh needing to think
about performance. So this is like kind
of slowly and inevitably changing pretty
much like everything in software. um all
kinds of things you've seen around
concurrency uh guilfree python
multipprocessing async co- routines
um so there's like couple like kind of
detailed things to dive in here I want
to make sure that I give enough time to
talk about the GPU stuff but there's
kind of two notions of how to make
things faster without doing that one is
parallel so like when you have a clock
cycle just do two things instead of one
sounds like a good idea um the other one
is concurrent which is a little bit
trickier but it's like so you start
doing something clock cycle hits, you
start running a calculation. Maybe that
calculation takes five clock cycles to
finish. Like instead of waiting for
those clock cycles to finish, try and do
five other things with the next couple
clock cycles. Makes your programs really
ugly because you write have to write
async await everywhere. Um and yeah, uh
if you're writing Rust, it's uh it's a
world of pin. Um but yeah, but it helps
you keep these like these super high
bandwidth pipelines busy. Um and so
these like concurrent and parallel these
are two strategies to maximize bandwidth
that are adopted like at the hardware
level all the way up to the programming
level with GPUs to take this like
bandwidth further than CPUs can. So uh
GPUs take parallelism further than CPUs.
So I'm comparing an AMD Epic CPU and an
Nvidia H100 SXM GPU here. Uh the figure
of merit here is the number of parallel
threads that can operate and the wattage
at which they operate. So an H1 like a
like an AMD epic uh CPU can do two
threads per core at about one watt per
thread. Um that's not bad, but an H100
can do over 16,000 parallel threads at 5
cents per thread, which is pretty uh
pretty amazing. Uh very big difference.
And uh parallel means like literally
every clock cycle all 16,000 threads of
execution make progress at the exact
same time. So what about concurrency? So
it may look like CPUs have an advantage
here because effectively concurrent
threads are unbounded. Like you can just
make a thread in Linux like it's free.
Government doesn't want you to know
this. Um uh and but there's a limit on
H100. So it looks like oh wow oh only
250,000 threads. What am I supposed to
do with that? Um but the difference here
is context switching speed. How quickly
can you go from executing one thing to
another? So if it like if our purpose
was to take advantage of every clock
cycle and it takes us a thousand clock
cycles like a microscond to context
switch then our concurrency is like
actually like pretty tightly bounded um
because we can't do uh a thing for a
whole thousand clock cycles. But in GPUs
context switching happens literally
every clock cycle. It's down there at
the warp scheduler inside the hardware.
um if you have to think about it that
hard, you're probably having a bad time.
But if you uh but normally it's just
making everything run faster. Um so
there's not really a name for this uh
the phenomenon that's that's driving all
of this work. Um but David Patterson who
came up with risk machines um and worked
on TPUs uh wrote it down. So I call it
Patterson's law. Latency lags bandwidth.
Um so like why are why are we doing all
these things the to like rewriting our
programs rethinking them in order to to
like take advantage of increasing
bandwidth and you know bandwidth is
replacing latency scaling because if you
look across a variety of different
subsystems of computers networks uh
memory discs the latency improvement is
actually the square of or sorry the um
the bandwidth improvement is the square
of the latency improvement over time.
This is one of those like Moors law
style charts where you're looking at
like trends in performance over time.
And it's like for every 10x that we
improve latency, we get 100x improvement
in bandwidth. And there's some arguments
in the article about where what it you
know where this comes from. Basically,
with latency, you run into the laws of
physics. With bandwidth, you just run
into like how many things can you do at
the same time? And you can always you
can take the same physics and spread it
out more easily than you can like come
up with new physics to take advantage
of. like you you cannot bribe the laws
of physics, Scotty, in Star Trek. Um,
and that's like one of the limits on
like network latency is like we uh we
send packets at like 70% of the speed of
light. So like we can't get them 10x
faster. Um, yeah.
Um, all right. So that's that's
bandwidth. Uh, GPUs embrace bandwidth.
Maybe big takeaway from Patterson's law
is like uh bandwidth has won out over
and over again. So maybe bet on the
bandwidth hardware. I don't know if the
person who's going to be talking about
uh LPUs or or etched is here, but we
should fight about this later. Um yeah,
so all right. So what kind of bandwidth
though? Um arithmetic bandwidth over
memory bandwidth. So not moving bytes
around that that they are high they have
high bandwidth memory, the fanciest
finest Heinix high bandwidth memory. Um
but they uh the thing where they really
excel is doing calculations on that
memory. And so the takeaway here is that
n squared algorithms are usually bad,
but if it's n squed operations for n
memory loads, the it actually works out
pretty nicely. It's almost like maybe
Billy and others were thinking of this
when they built the chip. I don't know.
Um so like arithmetic intensity is the
term for this or yeah math intensity. Um
and if you look here at the things
highlighted in purple
secondated
in terra
that go up into the thousands memory
bandwidth at the bottom is has the four
and that has not changed with blackwell.
It's only gotten worse um or better I
don't know um the the ratio has gone up.
Uh so LM inference works pretty nicely
during prompt processing where you're
moving you move 8 gigabytes then you
like 8 billion parameter model FPA
quantization you're going to move 8
gigabytes from the memory into the
registers for calculation you're going
to do about 60 billion floatingoint
operations um so that's that's a you
know doesn't really scale too much with
the sequence directly anyway you're the
that when you then need to do
you now need to move those 8 billion
parameters again. Um so this is from the
like GPU's memory into the place where
the compute happens like you can't you
have to you know it's vonoyman
architecture you can't like keep the
things um it's compute on stuff that is
is in place um so lm imprints works
great during prompt processing not so
much during decoding um so one way to
get around this is to just do more stuff
when you're decoding so one example is
to take a small model so 8 billion
parameters and run it like a thousand
times on the same prompt now you're
loading the weights and you only load
the weights one time, but then you
generate like 10,000 things. Um, and uh,
so there's like kind of an inherent
advantage there to small models for
being more sympathetic to the hardware.
You can actually match quality if you do
things right. If you have a good
verifier either, in this case, this is
does it pass a Python test um, that
allows you to pick uh, the one of your
10,000 outcomes. And so you can use
llama 318B to match GPT40 with like a
100 Yeah, 100 generations. So that's a
figure on the left is a reproduction. Um
I read a research paper. I sat down
spend a day coding and I got the exact
same result on different data in a
different model.
Any research but that
this is like this is this is legit. This
is real this is real science you know.
Um and it's a so it's it's a real
phenomenon and it fits with the
hardware. Um so lastly like so we want
to do more like we want to do like
throughput oriented like large scale
activities we want to do it with uh like
computation and mathematics not with
memory movement and the specific thing
we want to do is low precision matrix
multiplication
um and the takeaway here is that some
surprising things are going to turn out
to be approximately free I don't have
time to go into the details on this but
it turns out Kyle Crannon also on the
Dynamo team had I came to the exact same
conclusion. We were talking, you know,
comparing notes last night. So check his
talk in the afternoon of the
infrastructure track if you want less
handwaving and more charts. Um so things
like multi-token prediction, mult
samples query, all this stuff suddenly
becomes like basically free. Uh and the
reason why is that the latest GPUs,
Nvidas and others have this giant chunk
in them, the tensor core that's
specialized for low precision matrix
matrix multiplication. Uh and so that's
you know all these things in purple here
that have the really big numbers are
tensor core output uh not the CUDA core
output. Um and tensor cores do exactly
one thing and it's floating point matrix
multiplication. Um bit of a tough world
to live in as like if you're a
theoretical programmer to to discover
that there's only one data type you're
allowed to work with but you just get
more creative right you can do 4A
transform with this thing if you want.
Um yeah so the generation phase of l
language models is very heavy on matrix
vector operations if you just like write
it out at first. Um so the things that
are basically free are things that can
upgrade you to a matrix matrix
operation. There's some microbenchmarks
from the um Thunderkittens people I
think Hazy Research that was basically a
tensor core looks like it runs at like
if you give it a matrix and then like a
mostly empty matrix with one column full
you get like one overn of the
performance right um and so you know if
you just add more stuff there like all
of a sudden you like your the like
performance is scaling to match um so
this is sort of yeah this it's another
phenomenon that pushes you in the ction
of generating multiple samples um
generating multiple tokens as deepseek
does the next token prediction and I
think the llama 4 models do as well. Um
so yeah so these are the so as an AI
engineer the things you should be
looking at are like okay like maybe I
can get away with running a smaller
model that fits on a GPU that's under my
desk um uh and then I just like scale it
out in order to get the like sufficient
quality to to satisfy users. There's a
bunch of research on this stuff back in
like around the release of chatbt when
there was still like a thriving academic
field on top of language models. Um, and
it hasn't uh like people have kind of
forgotten about it a bit, but I think
the model the open models are good
enough that this is uh back to being a
good idea. Um, cool. I think I only have
about 10 seconds left. So, I'll just say
if you want to learn more, uh, I wrote
this uh GPU glossery. Modal.com/GPU-
glossery. It's a CUDA docs for humans
attempt to like explain this whole
software and hardware stack in one place
with lots of links. Uh so that when
you're reading about a warpuler and
you've forgotten what a streaming
multiprocessor architecture is and how
that's related to the NVIDIA CUDA
compiler driver, it's like one click
away to get all those things. Um so if
you want to run these uh uh uh on this
hardware um no better place than the
platform that I work on uh modal uh
serverless GPUs and more. Um we sort of
like re ripped out and rewrote the whole
container stack to make um like
serverless Python for data inensive and
comput intensive workloads like language
model inference work well. Um and so you
should definitely check it out. come
find us at the expo hall and we'll uh
you know talk your ear off about it. All
right. Thank you very much.
[Music]