Continuous Profiling for GPUs — Matthias Loibl, Polar Signals

Channel: aiDotEngineer
Published at: 2025-07-22
YouTube video id: wt8gzWR6auQ
Source: https://www.youtube.com/watch?v=wt8gzWR6auQ
[Music]
Great. Uh, thank you for for coming. Um,
I'm going to talk about maximizing GPU
efficiency with continuous profiling for
GPUs. Um, so what is profiling?
Profiling is pretty much as old as
programming. I think it was like firstly
first first done in like the 1970s. I
think some IBM folks were were trying to
figure out what was happening on their
computers back then. So it's been around
basically forever in computer science.
And uh what are we doing with profiling?
We are profiling um basically uh
anything that that we can to to inform
our um view of the world. We want to see
like the memory or CPU and GPU um time
spent. We want to see the usage of the
individual instructions and the
frequency and duration of these function
calls. So um yeah, a lot of different
approaches to profiling, but um yeah,
it's generally speaking super super
important to to performance engineering.
So why would we do this? Uh obviously to
improve performance and then also we can
save money. So if we um improve our um
software by like 10%, we might be able
to just like turn off 10% of our like
servers and save a bunch of money,
right? Um so that would be great. And um
there are two different kinds of of
profiling typically um that that we're
seeing these days. So one is tracing
profiling um and that is you record each
and every event all the time constantly.
But um obviously that's great for like
getting like the best uh possible uh
view onto the system, but it's like
pretty high cost um and generates a lot
of data. So it's like hard to to do uh
continuously and that is why we're doing
sampled profiling. So what we do is
basically we sample for a certain
duration like 10 seconds and we uh only
sample 100 times per second or like 20
times per second etc. Uh you can tweak
that uh how often do you want to profile
um and and and sample. So like 100 times
per second isn't that much for a CPU and
that's why you get like less than like a
percent overhead on on the CPU and like
only like four megabytes of overhead for
the memory profiling. um you will most
definitely miss things, but if you do it
always on um you will eventually see
most of the relevant things, right? Like
one stack that executed once isn't like
relevant to us anyway. Like we want to
see the the big picture.
Um so yeah, this is basically what we're
what we're doing. we we see like the
stacks on the left hand side executing
and these are like the functions that
are calling each other and we are just
um like 20 times per second or 100 times
per second taking taking note um on what
exact uh stack we're seeing on the CPU
um or which stack is like allocating uh
etc.
Yeah. Yeah. And that allows us to like
do it always on, do it in production.
Um, your machine is not the production
environment. So, it is pretty important
to be able to do this in production and
actually see what's happening uh out
there in the real world and do it with
uh low overhead. And we are actually
using Linux EVPF.
And because we're using something um
that the kernel is doing, we we don't
even have to uh change any of your uh
applications. That means um you start
one thing and it will start um profiling
all of your applications. So you don't
really have to instrument.
Quickly about me, I'm Matias Loy flew in
from Berlin, Germany. and I'm the
director of Polynes cloud and I'm also a
maintainer of Prometheus Prometheus
operator park the open source uh version
of all of what I'm talking about today
and some other projects
so um we are basically here for like
GPUs right and we just earlier this year
uh after like working on CPU and memory
profiling for the last three or four
years um started
um a preview on GPU profiling. So, I'm
going to talk about this today and uh
why why we think it's pretty pretty
great. Um as you can see in this uh
screenshot um we're talking to Nvidia
NVML to get these metrics out of uh your
GPU. So we can see in the blue the blue
line on top we can see the overall
utilization of the node and then the
orange line is one particular process on
that GPU. Um so we can see over here we
can see the process ID. So we see
individual processes but we also see the
overall nodes utilization. Um further
down the memory utilization and the
clock speed etc.
And that will kind of inform um where we
want to look at um the performance of
our system. Right? So sometimes we can
see the utilization drop down and that
might be something that we want to
investigate to really make sure that we
are using uh our GPUs to the fullest.
Uh just couple of more metrics we are
collecting. So there's like the power
utilization and the dashed line is the
power limit and then the temperature.
temperature sometimes is important
because like eventually if you're like
always at like 80 degrees Celsius you're
you're going to get throttled um by the
GPU um quite significantly and then
obviously PCIe throughput um is
interesting are you bound by the data
you're transferring between CPU and GPU
perfect yeah so uh just to repeat like
um the negative one is receiving versus
the positive ones are sending 10
megabytes per second uh through PCIe
and then we can u use all of those
metrics to correlate from the CPU prof
uh from the GPU metrics to the CPU um
profiles that we're storing. So we are
like collecting like we have done the
last three or four years uh using EVPF
those CPU stacks um and we want to like
see what is happening on the CPU. So in
this case we might want to look at a
particular stack on uh CPU zero um right
before the end because there was some
activity for example. So we can drag and
drop and select a particular uh time
range and then we are presented with
with a flame chart. Um and in the flame
charts we can see what the CPU is doing
while the GPU is not fully utilized. So
in in this case we we we can see that
Python is actually calling um eventually
the the CUDA
uh functions further down. Um but often
times you you will see that like the CPU
is um pretty actively uh trying to load
data and being busy that way and not not
keeping the GPU busy.
Um if you are um using Python we can see
it. If you're using Rust to integrate uh
with CUDA for example um that also works
but any compiled language is going to
show up uh in those stack traces and
even some of the interpreted languages
are going to show up like Ruby, Python
um JVM etc. So while there's a focus at
this conference um that we're talking
about GPUs here, it really works with
like any language and and any
application. So web servers, databases
and vector databases for example um also
are interested in improving their
performance. Obviously
uh something super exciting that we
first uh uh introduced this morning. So
this like um super fresh uh and hot off
the press is GPU time profiling. So um
as you heard like I I was talking about
like these GPU profiles and we we look
how how much uh time is spent on uh
individual functions on the CPU but we
are like more interested in uh GPU time
spent uh by these functions. So here's
like a small example of CUDA functions
and basically what we do is we tell the
Linux kernel to um whenever there's a
CUDA stack getting put on the on the CPU
to tell us the start time of that uh
function and then eventually tell us the
uh time when that kernel terminates and
then we know the duration of how much
time that particular kernel was spending
on on the GPU.
Um and and that's super interesting
obviously because now we can actually
see how much GPU time these individual
um functions are taking on the GPU. And
here's a bit more of a real world
example. So um at the top we can see or
on the yeah on the right hand side we
can see like the main function in Python
and then calling down into lip cuda down
here and the width of these um stacks
that we are seeing is like the actual
time that we had these functions uh take
up in in the GPU.
So this is showing CPU and GPU.
Yeah. So this is like basically um the
stack on the CPU
down to here and then the leaf of each
stack is the function that was taking
time on the GPU.
Okay. What are the colors?
Uh the colors are different um
binaries in this case that are running
on your machine. So that's why like blue
up here for example is Python and then
there's like some some I think CUDA uh
down here.
Yeah, great question. Uh how do you get
started? Because we um we run uh on
Linux using EVPF. You have a binary that
you can that you can run uh using system
docker works as well but we also have a
damon set for kubernetes.
Um, and you deploy that, you get the
manifest YL and give it a token. Um, and
then some of our customers are already
using it for CPU and memory profiling
and they're starting to also integrate
um their platforms with our GPU
profiling. Uh, especially like Turbo
Puffer um are are interested in in
improving their performance of their
vector engine. Right.
And that's really it. Um please visi
visit our booth. Um um you get um the
first 10 people get like to sign up for
a consultation get two hours for free if
you want to and we can also do discounts
for C and C series A startups. And
that's really it. Thank you so much.