Continuous Profiling for GPUs — Matthias Loibl, Polar Signals
Channel: aiDotEngineer
Published at: 2025-07-22
YouTube video id: wt8gzWR6auQ
Source: https://www.youtube.com/watch?v=wt8gzWR6auQ
[Music] Great. Uh, thank you for for coming. Um, I'm going to talk about maximizing GPU efficiency with continuous profiling for GPUs. Um, so what is profiling? Profiling is pretty much as old as programming. I think it was like firstly first first done in like the 1970s. I think some IBM folks were were trying to figure out what was happening on their computers back then. So it's been around basically forever in computer science. And uh what are we doing with profiling? We are profiling um basically uh anything that that we can to to inform our um view of the world. We want to see like the memory or CPU and GPU um time spent. We want to see the usage of the individual instructions and the frequency and duration of these function calls. So um yeah, a lot of different approaches to profiling, but um yeah, it's generally speaking super super important to to performance engineering. So why would we do this? Uh obviously to improve performance and then also we can save money. So if we um improve our um software by like 10%, we might be able to just like turn off 10% of our like servers and save a bunch of money, right? Um so that would be great. And um there are two different kinds of of profiling typically um that that we're seeing these days. So one is tracing profiling um and that is you record each and every event all the time constantly. But um obviously that's great for like getting like the best uh possible uh view onto the system, but it's like pretty high cost um and generates a lot of data. So it's like hard to to do uh continuously and that is why we're doing sampled profiling. So what we do is basically we sample for a certain duration like 10 seconds and we uh only sample 100 times per second or like 20 times per second etc. Uh you can tweak that uh how often do you want to profile um and and and sample. So like 100 times per second isn't that much for a CPU and that's why you get like less than like a percent overhead on on the CPU and like only like four megabytes of overhead for the memory profiling. um you will most definitely miss things, but if you do it always on um you will eventually see most of the relevant things, right? Like one stack that executed once isn't like relevant to us anyway. Like we want to see the the big picture. Um so yeah, this is basically what we're what we're doing. we we see like the stacks on the left hand side executing and these are like the functions that are calling each other and we are just um like 20 times per second or 100 times per second taking taking note um on what exact uh stack we're seeing on the CPU um or which stack is like allocating uh etc. Yeah. Yeah. And that allows us to like do it always on, do it in production. Um, your machine is not the production environment. So, it is pretty important to be able to do this in production and actually see what's happening uh out there in the real world and do it with uh low overhead. And we are actually using Linux EVPF. And because we're using something um that the kernel is doing, we we don't even have to uh change any of your uh applications. That means um you start one thing and it will start um profiling all of your applications. So you don't really have to instrument. Quickly about me, I'm Matias Loy flew in from Berlin, Germany. and I'm the director of Polynes cloud and I'm also a maintainer of Prometheus Prometheus operator park the open source uh version of all of what I'm talking about today and some other projects so um we are basically here for like GPUs right and we just earlier this year uh after like working on CPU and memory profiling for the last three or four years um started um a preview on GPU profiling. So, I'm going to talk about this today and uh why why we think it's pretty pretty great. Um as you can see in this uh screenshot um we're talking to Nvidia NVML to get these metrics out of uh your GPU. So we can see in the blue the blue line on top we can see the overall utilization of the node and then the orange line is one particular process on that GPU. Um so we can see over here we can see the process ID. So we see individual processes but we also see the overall nodes utilization. Um further down the memory utilization and the clock speed etc. And that will kind of inform um where we want to look at um the performance of our system. Right? So sometimes we can see the utilization drop down and that might be something that we want to investigate to really make sure that we are using uh our GPUs to the fullest. Uh just couple of more metrics we are collecting. So there's like the power utilization and the dashed line is the power limit and then the temperature. temperature sometimes is important because like eventually if you're like always at like 80 degrees Celsius you're you're going to get throttled um by the GPU um quite significantly and then obviously PCIe throughput um is interesting are you bound by the data you're transferring between CPU and GPU perfect yeah so uh just to repeat like um the negative one is receiving versus the positive ones are sending 10 megabytes per second uh through PCIe and then we can u use all of those metrics to correlate from the CPU prof uh from the GPU metrics to the CPU um profiles that we're storing. So we are like collecting like we have done the last three or four years uh using EVPF those CPU stacks um and we want to like see what is happening on the CPU. So in this case we might want to look at a particular stack on uh CPU zero um right before the end because there was some activity for example. So we can drag and drop and select a particular uh time range and then we are presented with with a flame chart. Um and in the flame charts we can see what the CPU is doing while the GPU is not fully utilized. So in in this case we we we can see that Python is actually calling um eventually the the CUDA uh functions further down. Um but often times you you will see that like the CPU is um pretty actively uh trying to load data and being busy that way and not not keeping the GPU busy. Um if you are um using Python we can see it. If you're using Rust to integrate uh with CUDA for example um that also works but any compiled language is going to show up uh in those stack traces and even some of the interpreted languages are going to show up like Ruby, Python um JVM etc. So while there's a focus at this conference um that we're talking about GPUs here, it really works with like any language and and any application. So web servers, databases and vector databases for example um also are interested in improving their performance. Obviously uh something super exciting that we first uh uh introduced this morning. So this like um super fresh uh and hot off the press is GPU time profiling. So um as you heard like I I was talking about like these GPU profiles and we we look how how much uh time is spent on uh individual functions on the CPU but we are like more interested in uh GPU time spent uh by these functions. So here's like a small example of CUDA functions and basically what we do is we tell the Linux kernel to um whenever there's a CUDA stack getting put on the on the CPU to tell us the start time of that uh function and then eventually tell us the uh time when that kernel terminates and then we know the duration of how much time that particular kernel was spending on on the GPU. Um and and that's super interesting obviously because now we can actually see how much GPU time these individual um functions are taking on the GPU. And here's a bit more of a real world example. So um at the top we can see or on the yeah on the right hand side we can see like the main function in Python and then calling down into lip cuda down here and the width of these um stacks that we are seeing is like the actual time that we had these functions uh take up in in the GPU. So this is showing CPU and GPU. Yeah. So this is like basically um the stack on the CPU down to here and then the leaf of each stack is the function that was taking time on the GPU. Okay. What are the colors? Uh the colors are different um binaries in this case that are running on your machine. So that's why like blue up here for example is Python and then there's like some some I think CUDA uh down here. Yeah, great question. Uh how do you get started? Because we um we run uh on Linux using EVPF. You have a binary that you can that you can run uh using system docker works as well but we also have a damon set for kubernetes. Um, and you deploy that, you get the manifest YL and give it a token. Um, and then some of our customers are already using it for CPU and memory profiling and they're starting to also integrate um their platforms with our GPU profiling. Uh, especially like Turbo Puffer um are are interested in in improving their performance of their vector engine. Right. And that's really it. Um please visi visit our booth. Um um you get um the first 10 people get like to sign up for a consultation get two hours for free if you want to and we can also do discounts for C and C series A startups. And that's really it. Thank you so much.