AI Kernel Generation: What's working, what's not, what's next – Natalie Serrino, Gimlet Labs

Channel: aiDotEngineer

Published at: 2025-12-17

YouTube video id: 6guQG_tGt0o

Source: https://www.youtube.com/watch?v=6guQG_tGt0o

[music]
Hey everyone, how's it going?
So, my name is Natalie. I'm a co-founder
of Gimlet Labs and um yeah, just a
little bit of background about why
Gimlet's looking at AI generated
kernels. Let's just get right to it. Um
we're building an agentic inference
cloud focused on performance and
efficiency. And the thing that we've
seen with all these talks so far is with
agents, they're not just one chat model.
They're complex pipelines of multiple
models, multiple stages, tool calls, and
the compute backing these is inherently
should be heterogeneous. So what we do
is we automatically split up and
orchestrate these agentic workloads
across optimal hardware which can be
different vendors and different sizes.
This can present a problem at the kernel
level because a lot of times you have
models that are really optimized just
for one hardware. So what we started
looking at is can we use AI to help
automatically port different segments of
aentic workloads to hardware that it
hasn't necessarily been optimized for.
So just to clarify something really
quick because we run into this a lot.
What do I mean by kernels? I do not mean
AI generating operating systems like the
Linux kernel or things like that. What I
mean is kernels at the sense of like
transformer architecture like the
individual like functions that perform
like massive parallel computations
leveraging like all the crazy amounts of
threads that GPUs have. So yes, people
be like, "Oh, how are you going to
generate an operating system?" I think
maybe we're not quite there yet, but one
day.
So why use AI to do this? So I think
there's a few reasons. So we know that
optimizing low-level kernels can make
workloads like ML workloads
significantly faster. So here we have
like it's probably too small to see but
it's a blog from Nvidia where they
implemented a different attention and it
allowed them to like get 3x throughput
on a llama model. So these like
implementations can make a major
difference from a performance
perspective. But at the same time if you
just search Twitter everyone's whining
about how it's impossible to find these
people and the people that exist are
like really overt taxed with so much to
do, so much work. There's just not
enough experts to be able to solve every
problem in this in this space right now.
And the problem explodes because you
have so many frameworks and so many ways
to write kernels from things like CUDA
and like Triton to Palace to things like
that are device specific like Metal and
you have different hardware platforms
too and each of these hardware platforms
even within a single vendor has
different characteristics. We've seen
for example that some of the new um like
hardware from Nvidia like some of the
old kind of like DSLs weren't working as
well on it because the different
hardware has different properties it has
different features it has different
characteristics different cache sizes
etc all of which impact the optimized
implementation from a kernel perspective
so we and many others in the space have
thought it would be great if AI could
help us with this problem where you
could potent potentially give it PyTorch
code and then generate optimized
implementations for whatever hardware
you're trying to run that workload on.
So I think when you're trying to use an
agent for something, you have to start
with what the human workflow is. And the
human workflow today, when you have that
like really hardcore kernel expert,
let's say they're trying to port a new
uh like workload over to Metal, right?
What they'll do is they'll say, "Okay, I
have this implementation. Maybe I have a
CUDA version, maybe I don't. And I'm
going to try something. I'll see if it
compiles." Most of the time, maybe not.
I'll see if it runs.
If see if it's correct, and if none of
those are the case, you just pass that
back into the human context, so to
speak. And then once you get something
that's working, then you start looking
at the profiling information in depth
and just hammering down like this is the
bottleneck now. This is the bottleneck
now. This is the bottle of the mic now.
It's a very iterative process.
So I think that you know basically the
idea here is to put AI as the as the
kind of like
where the human would go in that same
loop right so the agentic flow here is
to make sure it compiles and it executes
and it's correct and then from there
optimizing it.
So this is something that I would say is
like very new technology. There's a lot
of interest here, but there's some
things that it's good at and some things
that it's still kind of in development
for. And so, let's dive into some of the
specifics.
So, this is a quick demo of our system.
It the font's kind of small, but we're
passing into a CLI tool, a PyTorch
workload, targeting it to an H100, and
the system had explored a bunch of
candidate optimizations. It's comparing
to eager mode and torch compile and it
found one uh candidate that was 22%
faster than the torch compile baseline.
So this was a real case. It just sped up
because it actually took about 20
minutes.
So there's some challenges though with
measuring these agents at kernel
synthesis. So um like first of all you
have to figure out what your definition
of correct is when you're dealing with
floating point. This is always a
question. You can do different types of
tolerances, but you also need to make
sure your input sizes are well selected.
If you're only passing in really small
input sizes, it can cause problems with
the benchmarking where you're measuring
the overhead, not the actual kernel as
the critical path.
You also have to make sure you're
reliably measuring performance. So, if
you just do a naive timer start on your
implementation, it's probably going to
be wrong. And there was a great blog
that had a diagram for this because
you're basically measuring the launch
time, not the execution time. So there's
a bunch of kind of gotchas like that
that when you're building an agentic
system like this, you have to be really
really careful about catching doing
things like warm-ups and cache clearing
because a lot of times you'll have
you'll have the original implementation
run and then the new implementation run
and then the original one's result is
cached and the new one fetches it. So
you have all kinds of things like that
that you have to be really neurotic
about otherwise you might get bad
results.
You also need great benchmarks for this.
I think that someone said earlier that
there's not a ton of examples of
low-level kernels across all these
different hardware platforms. And so the
input data is a challenge and also
benchmarking it is a challenge. Like how
do you know if your agent is better? You
change the prompt to it. How do you
know? It's the same story we hear with
every agent here basically.
So we have some preliminary results on
that we're sharing right now on Apple's
um M4 uh using the metal framework. Um
and this is on the kernel bench
benchmark the v0.1 version of it which
is the latest one. So what we can see
here is results across 250 problems and
it compares to either torch compile or
eager mode depending on which one of
those is faster. So with the kernel
bench data set, we have different tiers
of problems with L1 being the easiest or
simplest rather and L3 being like more
complex. So what we can see for the
standalone agent is that we see an
average speed up of about 25% or 24%.
And the sweet spot is those moderately
complex problems. It seems honestly the
same as like a lot of coding problems
where it's good at moderately complex
things, but then you push it too far and
the performance drops off. So, an
interesting challenge here is going to
be how do we make these agents perform
better on more complex problems that
they're going to have to break down and
execute.
There we go. Um, let's talk about a
couple of examples because I love to
just see example code. So, this was a
success case where the model found a
case where we could do kernel fusion. So
for those that aren't that familiar with
GPU kernels, kernel fusion is one of the
most go-to techniques in kernel
optimization where you say I have two
kernels. Let's say in this case it was
like a convolution softmax bias scaling
and sigmoid. So those were five ops and
what the agent did was it took four of
those ops and instead of running
individual functions for those it made a
mega function that compacted them all
together. So kernel fusion isn't new.
It's something that torch compile
already does quite well, but it's a
common way that we found agents can
speed up these workloads because you can
really customize it to the specific use
case. So, this result achieved a 40%
speed up over the baseline on the M4.
And just kind of like zooming into what
happened, the agent wrote a fused op. So
it basically wrote like C++ code that it
put as kind of an inline string with the
PyTorch code and then it called that
fused operation in the forward pass of
the model and then that fused
implementation we can see a snippet of
it up here where it's basically taking
those four ops and putting them together
in one mega op. And so this was done
automatically.
Sometimes though like writing low-level
kernels isn't the best optimization that
we can get. We had another case which
was on a level one problem which
basically improved the performance by
80%. And the the insight the agent had
in this case was the operation in metal
for average pool 1D was not as optimized
as some other ops that are much more
optimized on metal. So what it did was
it actually rewrote the pietorch code to
use the more optimized op and reexpress
the same problem in a different way.
So to dive into this um the average pool
1D is basically taking averages across
like one dimension. So like you can see
that that input vector could produce the
output vector with five and seven
averaging to six and so on.
So if you express that same thing as a
convolution, you can get the same
result.
So if you do the math, it will lead to
that same result. And so that's what the
agent did. Basically what it did was it
said, hey, instead of doing the original
call to the baseline op, let's generate
that weights matrix and execute this as
a convolution because I know that that's
really fast on metal.
There's also an interesting algorithmic
optimization case. This was for a level
three problem. So it was more complex
where basically the agent figured out
that it could combine two operations
into a single operation at the PyTorch
level, not even using low-level kernels.
So what we can see is that basically it
fused it, it rewrote it as Python code
and calls that single convolution and
that's a lot more efficient because you
don't have to launch as many ops.
But this does not always work. This is
not a silver bullet and I think that's
really important to emphasize. So a case
where the agent totally faceplanted was
on matrix multiplication and it was it
wrote a custom CUDA kernel for this but
it was a lot slower than the baseline.
And the thing is with this is matrix
multiply is one of the most hand
optimized ops that exists. So it's not
that surprising that an agent would not
do as well as something that a human
expert spent a long time on. So, this is
an area that it did not work.
Another case was a case that we saw
which had a 71,000x speed up. And
anything like that should trigger your
suspicion brain.
Wow. 71,000. Great. We're done. It's,
you know, this technology is worth
billions of dollars, right? No. So,
basically, what happened? So, this
operation is basically saying, give me
inputs and I'm going to make sure they
fall betweengative -1 and one. Okay,
that's what the operation being tested
was.
So the agent figured out that for all of
the test cases, this was already the
case. So it wrote a nice long comment
saying this is actually not necessary,
so just output the input.
[laughter]
So [snorts] you could argue this is the
agent being smart because it's pruning
unnecessary work, but I think a lot of
us would agree that it's not in the
spirit of what we're trying to benchmark
here. So we've excluded cases like this
from our analysis, but it is interesting
because maybe some of the times you
would want it to do something like that.
And I think this is part of where the
human element comes in with these
agents. Sometimes the agent does
something that depending on your
definition of what you want to see could
be good or it could be bad. And so
that's where the human part kind of
weighs in.
So, like I keep drawing parallels to
other kinds of coding agents because
even though this is like kind of a niche
like low-level domain, I don't think
that the story is fundamentally
different. We see standalone agents are
really good at cheaply generating like
lots of different ideas and lots of
possibilities to explore. They're good
at slurping in a ton of different
context and seeing what helps. And
they're really good at doing these like
level one and level two tasks. Like for
example, we're still not asking AI
agents to write the Linux kernel, but
what is still needed is robust and
robust quality and performance
validation. We need to make sure that
the agents aren't cheating and we need
to make sure that the results are
actually correct. We need empirical data
from hardware in the loop to guide the
search and optimization because it's
actually really hard to look at
low-level code and know how it's going
to perform on the hardware. We still
heavily rely on looking on profiling
data and things like that. And we also
need the human in the loop to supervise
the results and guide the work. So
design of a modern agent, you have
multiple sub aents that are working
together. You have that human in the
loop and a purpose-built harness for
that task. And I think this is the
pattern we've seen throughout this
conference.
So just to get a little bit into kind of
like what that architecture looks like
and this is what we're you know what
we're building at Gimlet you have a
supervisor agent which takes in input
code target hardware and then also human
prompting because humans still can
really guide the best path for
optimization that supervisor is in
charge of managing the work. It deploys
the synthesis agentic swarm which
collectively work together to come up
with ideas for optimizations and they
are basically the idea factory coming up
with new techniques. Those ideas get
sent to the verification agent which is
running them in on actual hardware in a
hardware in the loop system to see how
they do and that verification agent
needs to be extremely strict about
making sure that no funny business is
happening. And that's a major part of
the challenge.
So just a couple more realistic case
studies that are not benchmarks. We got
really excited because we ran this on a
vision transformer model and I don't
know if you can see but basically the
original uh vanilla implementation using
torch compile and our generated code
using torch compile ours was twice as
fast. So this was like a hoay moment.
So the speedups were promising, but then
it turned out the optimization was just
swapping out the original attention
module for SDPA, which is a more
optimized attention module. And this is
the kind of thing that yes, that's true.
That is a valid optimization, but I
wouldn't necessarily call it rocket
science. So we consider that to be a
trivial case study where if you're not
using a more optimized attention module,
maybe you haven't actually optimized
your workload that much yet.
But we do still see interesting results
for full models when we have human
prompting. And one case for this was an
audioenccoder model where it generated
six custom kernels for the workload
specialized for the RTX 6000 Blackwell.
And the results were strong. It was
about 70% faster. Both implementations
using torch compile.
So just to kind of show an example, we
load in line six different fused kernels
and then call them in the code. And the
nice thing about this approach, even
though it's a little weird declaring
these as strings, is that you have like
a completely a API compatible swap in
replacement for the original module in
PyTorch.
So where are we with AIdriven kernel
optimization? I think like I said before
this is not a silver bullet but it is a
promising new tool in the toolbox. The
best applications that we see are things
like searching across many bags of
tricks. We know that fusion works. We
know that tiling works and we can run
lots of experiments really quickly this
way by launching them with agents and
see what actually performs the best on
the workload. It's also good at porting
existing implementations to new hardware
where it takes the insights from that
original implementation and specializes
them to the hardware available features
on the new target
and also about translating existing
optimizations to new scenarios. You can
quickly adopt new optimizations like
let's say you're changing the
quantization of your model. You can
still look at differently quantized
implementations to guide that
optimization.
In terms of the worst applications,
we're still not at the point where
they're writing the N plus1 for flash
attention, coming up with like those
genius algorithmic advances. And they're
not currently outperforming a human
expert who bang their head on this
problem for months. And we shouldn't
expect them to be. I think that the most
exciting part of this work is allowing
those people to focus on the most
interesting optimizations and getting us
better than baseline on all the problems
that they don't have time for.
So what's next in the work? We want to
build uh abstract models of different
machines to help the agents further
specialize code to individual hardware.
We're also interested in generating
basically what is like NVIDIA assembly
such as PTX. You can see an example here
because the thought is that we can
basically do that better with AI than
humans because it's so cumbersome. And
then also looking at academic formal
verification methods for correctness.
Um, also want to give a huge shout out
to my colleagues. Um, they are the
silent unspoken heroes here and um, you
know, I love talking about this with
people. So, please feel free to give me
an email if you want to talk about
kernel generation or anything that I
covered. And we are hiring. So, if this
problem interests you, we'd love to
chat.
Thanks.
[music]
Heat.