What every AI engineer needs to know about GPUs — Charles Frye, Modal
Channel: aiDotEngineer
Published at: 2025-07-20
YouTube video id: y-UGrYbJsJk
Source: https://www.youtube.com/watch?v=y-UGrYbJsJk
[Music] So um what I wanted to talk about today was uh what every AI engineer needs to know about GPUs. the like so far in the last couple of years um most of the things that people have built as AI applications people who are AI engineers they've been building on top of model APIs so they use the open AI API the anthropic API the deepseek API and they build an application on top of that and that goes back to kind of like the initial diagram that Swix put out the like AI like rise of the AI engineer thing um and yeah probably just mirror would be great um and the um so that like having that API boundary is like is like pretty important right it's like you can't really build a complex system if everybody has to know how every piece works and everybody has to know all of it in detail and there's no like boundaries or breakdowns like you're just yeah you'll compl collapse in complexity if you do that um so um oh that was a meme but I'm down with it um so like so Yeah. So it started off by trying to answer the question of like why every AI engineer needs to know about GPUs. Um and so yeah, so here's our famous diagram AI engineer on the right of the API boundary where they're like constrained by the um the needs of users rather than like the like what's possible with research or what infrastructure is capable of providing. Um, and the way that I think about this distinction is that um, it's kind of similar to the way that very few developers need to actually like write a database. Um, like almost no one writes a database except in their like you know like undergrad classes. And then even very few developers like run a database. A lot of them will use either a fully managed service or um, just like a hosted service like uh, RDS on on Amazon. Um but like almost all developers despite the fact that they aren't like database engineers, they are users of databases and they like need to like write know how to like write good queries. They need to know how to like hold the tool in order to press the like buttons on the side. Uh so there's a famous educational resource that I really love um about databases called use the index loop um that's like basically about how to write SQL queries and not like suck. Um and the whole point is like there is a thing called an index. There's a couple of data structures that support it. It talks about things like B trees and log structured merge trees and stuff. And the intent of it isn't that you can then leave and go and like invert a binary tree on a whiteboard so you can get a fang job. Like the point of it is to teach you what you need to know so that you can write like write queries properly that use the index and don't not use the index. Primary and secondary indices all these things you don't like you know that's like a little bit um like an easier like prospect knowing it well enough to be able to use it rather than like build it or innovate on it. Um, and I think we're reaching this point now with uh with language models where um where you'll have more ability to like integrate tightly run your own language models and so more need to like use the index or I guess if you want like the like one sentence summary of this talk um it's uh use the tensor cores Luke. Um, so in building your there's one there's basically one part of an Nvidia GPU um and an equivalent in other GPUs that is uh fast and good and gets better and it's the tensor core and it does matrix matrix multiplication and uh you should make sure you're using it and uh and not not using it just like an index on a database. Um, so yeah, so open like I kind of made this point earlier about open weights models and the open source software to run them like Dynamo getting better very quickly. So it finally makes sense to self-host. I'm not going to belabor this point because I'm giving another talk 1245 presenting some like benchmarking results that we did on like running uh VLM SGA tensor TLM on like uh you 10 12 different models on 10 12 different workloads um to show like what's what's economical what's not. Okay. So uh so that's the why um sort of a slight change or adjustment in what AI engineers I think a engineers should focus on know about. Um so now what is it that you need to know about engineer about uh these these this hardware in detail. So the primary thing is that GPUs embrace high bandwidth not low latency. That's the like key feature of this hardware. Similar with TPUs, but distinguishes it from pretty much every other piece of hardware that you're used to programming. Um, and then in detail, they optimize for math bandwidth over memory bandwidth. So they do like computing on things. That's what they where they have the highest throughput. So you want to align yourself not to latency but to throughput. And within throughput, you want to focus on computational operations. And then within computational operations, what you want to focus on if you want to like actually use the whole GPU you paid for, it's low precision matrix matrix multiplications. Sorry, that wasn't a stutter. That was matrix matrix multiplications, not just matrix vector. Um, okay. So f the first point about latency versus bandwidth. Um so I like regret to inform you that the scaling of latency and the reduction of latency in computing systems died during the Bush administration. It's not coming back. Um see a talk later today for an alternative perspective but um yeah GPUs embrace bandwidth scaling. Um so a little more detail on that. Um so this is a computer or a piece of a computer in case you haven't looked inside one in a while. Um so this is a logic gate from the ZUSA 1 um computer built in Germany in the 30s kind of first digital computer digital but not electronic it's mechanical. So all these actuator plates in it that that um implemented logical operations. So what you see there on the left is the logical operation a and so if two plates are pushed down then if both of them are present then when a then when the other plate pushes forward it will push the final plate forward. That's the logical operation and and the thing that pushes that like the the thing that pushes forward is driven by a clock like a literal clock. Um I guess now everybody has Apple watches but you know there was a time when you would have a physical clock for that sort of thing. So this um uh the clock uh like drove like drives these systems and causes them to like compute their logical operations, right? So every time the clock ticks, you get a new operation. And so you can just, you know, so we we we've changed computers a little bit in that we use different physics to drive them, but it's still the same basic like abstract system. There's a sort of a motive force that happens on a clock cycle that leads to calculations. Um, and the cool thing about that is that if you just make that faster, literally nobody has to like like think about anything and the computer gets better. So this was the like primary driver of computers getting better in the '9s. No recompiling, no rewriting your software. Everything just got better because now the clock started going like twice as fast, right? And time is very virtual in in computers and so like the program couldn't possibly know the difference. Um, so that was really great during that like that like mid to late 90s period and then that like fell off a cliff in the early 2000s and this is like impacted a lot of computing over the last two decades but actually its effects are like still being felt like all this switch from being able to kind of avoid the like uh needing to think about performance. So this is like kind of slowly and inevitably changing pretty much like everything in software. um all kinds of things you've seen around concurrency uh guilfree python multipprocessing async co- routines um so there's like couple like kind of detailed things to dive in here I want to make sure that I give enough time to talk about the GPU stuff but there's kind of two notions of how to make things faster without doing that one is parallel so like when you have a clock cycle just do two things instead of one sounds like a good idea um the other one is concurrent which is a little bit trickier but it's like so you start doing something clock cycle hits, you start running a calculation. Maybe that calculation takes five clock cycles to finish. Like instead of waiting for those clock cycles to finish, try and do five other things with the next couple clock cycles. Makes your programs really ugly because you write have to write async await everywhere. Um and yeah, uh if you're writing Rust, it's uh it's a world of pin. Um but yeah, but it helps you keep these like these super high bandwidth pipelines busy. Um and so these like concurrent and parallel these are two strategies to maximize bandwidth that are adopted like at the hardware level all the way up to the programming level with GPUs to take this like bandwidth further than CPUs can. So uh GPUs take parallelism further than CPUs. So I'm comparing an AMD Epic CPU and an Nvidia H100 SXM GPU here. Uh the figure of merit here is the number of parallel threads that can operate and the wattage at which they operate. So an H1 like a like an AMD epic uh CPU can do two threads per core at about one watt per thread. Um that's not bad, but an H100 can do over 16,000 parallel threads at 5 cents per thread, which is pretty uh pretty amazing. Uh very big difference. And uh parallel means like literally every clock cycle all 16,000 threads of execution make progress at the exact same time. So what about concurrency? So it may look like CPUs have an advantage here because effectively concurrent threads are unbounded. Like you can just make a thread in Linux like it's free. Government doesn't want you to know this. Um uh and but there's a limit on H100. So it looks like oh wow oh only 250,000 threads. What am I supposed to do with that? Um but the difference here is context switching speed. How quickly can you go from executing one thing to another? So if it like if our purpose was to take advantage of every clock cycle and it takes us a thousand clock cycles like a microscond to context switch then our concurrency is like actually like pretty tightly bounded um because we can't do uh a thing for a whole thousand clock cycles. But in GPUs context switching happens literally every clock cycle. It's down there at the warp scheduler inside the hardware. um if you have to think about it that hard, you're probably having a bad time. But if you uh but normally it's just making everything run faster. Um so there's not really a name for this uh the phenomenon that's that's driving all of this work. Um but David Patterson who came up with risk machines um and worked on TPUs uh wrote it down. So I call it Patterson's law. Latency lags bandwidth. Um so like why are why are we doing all these things the to like rewriting our programs rethinking them in order to to like take advantage of increasing bandwidth and you know bandwidth is replacing latency scaling because if you look across a variety of different subsystems of computers networks uh memory discs the latency improvement is actually the square of or sorry the um the bandwidth improvement is the square of the latency improvement over time. This is one of those like Moors law style charts where you're looking at like trends in performance over time. And it's like for every 10x that we improve latency, we get 100x improvement in bandwidth. And there's some arguments in the article about where what it you know where this comes from. Basically, with latency, you run into the laws of physics. With bandwidth, you just run into like how many things can you do at the same time? And you can always you can take the same physics and spread it out more easily than you can like come up with new physics to take advantage of. like you you cannot bribe the laws of physics, Scotty, in Star Trek. Um, and that's like one of the limits on like network latency is like we uh we send packets at like 70% of the speed of light. So like we can't get them 10x faster. Um, yeah. Um, all right. So that's that's bandwidth. Uh, GPUs embrace bandwidth. Maybe big takeaway from Patterson's law is like uh bandwidth has won out over and over again. So maybe bet on the bandwidth hardware. I don't know if the person who's going to be talking about uh LPUs or or etched is here, but we should fight about this later. Um yeah, so all right. So what kind of bandwidth though? Um arithmetic bandwidth over memory bandwidth. So not moving bytes around that that they are high they have high bandwidth memory, the fanciest finest Heinix high bandwidth memory. Um but they uh the thing where they really excel is doing calculations on that memory. And so the takeaway here is that n squared algorithms are usually bad, but if it's n squed operations for n memory loads, the it actually works out pretty nicely. It's almost like maybe Billy and others were thinking of this when they built the chip. I don't know. Um so like arithmetic intensity is the term for this or yeah math intensity. Um and if you look here at the things highlighted in purple secondated in terra that go up into the thousands memory bandwidth at the bottom is has the four and that has not changed with blackwell. It's only gotten worse um or better I don't know um the the ratio has gone up. Uh so LM inference works pretty nicely during prompt processing where you're moving you move 8 gigabytes then you like 8 billion parameter model FPA quantization you're going to move 8 gigabytes from the memory into the registers for calculation you're going to do about 60 billion floatingoint operations um so that's that's a you know doesn't really scale too much with the sequence directly anyway you're the that when you then need to do you now need to move those 8 billion parameters again. Um so this is from the like GPU's memory into the place where the compute happens like you can't you have to you know it's vonoyman architecture you can't like keep the things um it's compute on stuff that is is in place um so lm imprints works great during prompt processing not so much during decoding um so one way to get around this is to just do more stuff when you're decoding so one example is to take a small model so 8 billion parameters and run it like a thousand times on the same prompt now you're loading the weights and you only load the weights one time, but then you generate like 10,000 things. Um, and uh, so there's like kind of an inherent advantage there to small models for being more sympathetic to the hardware. You can actually match quality if you do things right. If you have a good verifier either, in this case, this is does it pass a Python test um, that allows you to pick uh, the one of your 10,000 outcomes. And so you can use llama 318B to match GPT40 with like a 100 Yeah, 100 generations. So that's a figure on the left is a reproduction. Um I read a research paper. I sat down spend a day coding and I got the exact same result on different data in a different model. Any research but that this is like this is this is legit. This is real this is real science you know. Um and it's a so it's it's a real phenomenon and it fits with the hardware. Um so lastly like so we want to do more like we want to do like throughput oriented like large scale activities we want to do it with uh like computation and mathematics not with memory movement and the specific thing we want to do is low precision matrix multiplication um and the takeaway here is that some surprising things are going to turn out to be approximately free I don't have time to go into the details on this but it turns out Kyle Crannon also on the Dynamo team had I came to the exact same conclusion. We were talking, you know, comparing notes last night. So check his talk in the afternoon of the infrastructure track if you want less handwaving and more charts. Um so things like multi-token prediction, mult samples query, all this stuff suddenly becomes like basically free. Uh and the reason why is that the latest GPUs, Nvidas and others have this giant chunk in them, the tensor core that's specialized for low precision matrix matrix multiplication. Uh and so that's you know all these things in purple here that have the really big numbers are tensor core output uh not the CUDA core output. Um and tensor cores do exactly one thing and it's floating point matrix multiplication. Um bit of a tough world to live in as like if you're a theoretical programmer to to discover that there's only one data type you're allowed to work with but you just get more creative right you can do 4A transform with this thing if you want. Um yeah so the generation phase of l language models is very heavy on matrix vector operations if you just like write it out at first. Um so the things that are basically free are things that can upgrade you to a matrix matrix operation. There's some microbenchmarks from the um Thunderkittens people I think Hazy Research that was basically a tensor core looks like it runs at like if you give it a matrix and then like a mostly empty matrix with one column full you get like one overn of the performance right um and so you know if you just add more stuff there like all of a sudden you like your the like performance is scaling to match um so this is sort of yeah this it's another phenomenon that pushes you in the ction of generating multiple samples um generating multiple tokens as deepseek does the next token prediction and I think the llama 4 models do as well. Um so yeah so these are the so as an AI engineer the things you should be looking at are like okay like maybe I can get away with running a smaller model that fits on a GPU that's under my desk um uh and then I just like scale it out in order to get the like sufficient quality to to satisfy users. There's a bunch of research on this stuff back in like around the release of chatbt when there was still like a thriving academic field on top of language models. Um, and it hasn't uh like people have kind of forgotten about it a bit, but I think the model the open models are good enough that this is uh back to being a good idea. Um, cool. I think I only have about 10 seconds left. So, I'll just say if you want to learn more, uh, I wrote this uh GPU glossery. Modal.com/GPU- glossery. It's a CUDA docs for humans attempt to like explain this whole software and hardware stack in one place with lots of links. Uh so that when you're reading about a warpuler and you've forgotten what a streaming multiprocessor architecture is and how that's related to the NVIDIA CUDA compiler driver, it's like one click away to get all those things. Um so if you want to run these uh uh uh on this hardware um no better place than the platform that I work on uh modal uh serverless GPUs and more. Um we sort of like re ripped out and rewrote the whole container stack to make um like serverless Python for data inensive and comput intensive workloads like language model inference work well. Um and so you should definitely check it out. come find us at the expo hall and we'll uh you know talk your ear off about it. All right. Thank you very much. [Music]