AI Kernel Generation: What's working, what's not, what's next – Natalie Serrino, Gimlet Labs
Channel: aiDotEngineer
Published at: 2025-12-17
YouTube video id: 6guQG_tGt0o
Source: https://www.youtube.com/watch?v=6guQG_tGt0o
[music] Hey everyone, how's it going? So, my name is Natalie. I'm a co-founder of Gimlet Labs and um yeah, just a little bit of background about why Gimlet's looking at AI generated kernels. Let's just get right to it. Um we're building an agentic inference cloud focused on performance and efficiency. And the thing that we've seen with all these talks so far is with agents, they're not just one chat model. They're complex pipelines of multiple models, multiple stages, tool calls, and the compute backing these is inherently should be heterogeneous. So what we do is we automatically split up and orchestrate these agentic workloads across optimal hardware which can be different vendors and different sizes. This can present a problem at the kernel level because a lot of times you have models that are really optimized just for one hardware. So what we started looking at is can we use AI to help automatically port different segments of aentic workloads to hardware that it hasn't necessarily been optimized for. So just to clarify something really quick because we run into this a lot. What do I mean by kernels? I do not mean AI generating operating systems like the Linux kernel or things like that. What I mean is kernels at the sense of like transformer architecture like the individual like functions that perform like massive parallel computations leveraging like all the crazy amounts of threads that GPUs have. So yes, people be like, "Oh, how are you going to generate an operating system?" I think maybe we're not quite there yet, but one day. So why use AI to do this? So I think there's a few reasons. So we know that optimizing low-level kernels can make workloads like ML workloads significantly faster. So here we have like it's probably too small to see but it's a blog from Nvidia where they implemented a different attention and it allowed them to like get 3x throughput on a llama model. So these like implementations can make a major difference from a performance perspective. But at the same time if you just search Twitter everyone's whining about how it's impossible to find these people and the people that exist are like really overt taxed with so much to do, so much work. There's just not enough experts to be able to solve every problem in this in this space right now. And the problem explodes because you have so many frameworks and so many ways to write kernels from things like CUDA and like Triton to Palace to things like that are device specific like Metal and you have different hardware platforms too and each of these hardware platforms even within a single vendor has different characteristics. We've seen for example that some of the new um like hardware from Nvidia like some of the old kind of like DSLs weren't working as well on it because the different hardware has different properties it has different features it has different characteristics different cache sizes etc all of which impact the optimized implementation from a kernel perspective so we and many others in the space have thought it would be great if AI could help us with this problem where you could potent potentially give it PyTorch code and then generate optimized implementations for whatever hardware you're trying to run that workload on. So I think when you're trying to use an agent for something, you have to start with what the human workflow is. And the human workflow today, when you have that like really hardcore kernel expert, let's say they're trying to port a new uh like workload over to Metal, right? What they'll do is they'll say, "Okay, I have this implementation. Maybe I have a CUDA version, maybe I don't. And I'm going to try something. I'll see if it compiles." Most of the time, maybe not. I'll see if it runs. If see if it's correct, and if none of those are the case, you just pass that back into the human context, so to speak. And then once you get something that's working, then you start looking at the profiling information in depth and just hammering down like this is the bottleneck now. This is the bottleneck now. This is the bottle of the mic now. It's a very iterative process. So I think that you know basically the idea here is to put AI as the as the kind of like where the human would go in that same loop right so the agentic flow here is to make sure it compiles and it executes and it's correct and then from there optimizing it. So this is something that I would say is like very new technology. There's a lot of interest here, but there's some things that it's good at and some things that it's still kind of in development for. And so, let's dive into some of the specifics. So, this is a quick demo of our system. It the font's kind of small, but we're passing into a CLI tool, a PyTorch workload, targeting it to an H100, and the system had explored a bunch of candidate optimizations. It's comparing to eager mode and torch compile and it found one uh candidate that was 22% faster than the torch compile baseline. So this was a real case. It just sped up because it actually took about 20 minutes. So there's some challenges though with measuring these agents at kernel synthesis. So um like first of all you have to figure out what your definition of correct is when you're dealing with floating point. This is always a question. You can do different types of tolerances, but you also need to make sure your input sizes are well selected. If you're only passing in really small input sizes, it can cause problems with the benchmarking where you're measuring the overhead, not the actual kernel as the critical path. You also have to make sure you're reliably measuring performance. So, if you just do a naive timer start on your implementation, it's probably going to be wrong. And there was a great blog that had a diagram for this because you're basically measuring the launch time, not the execution time. So there's a bunch of kind of gotchas like that that when you're building an agentic system like this, you have to be really really careful about catching doing things like warm-ups and cache clearing because a lot of times you'll have you'll have the original implementation run and then the new implementation run and then the original one's result is cached and the new one fetches it. So you have all kinds of things like that that you have to be really neurotic about otherwise you might get bad results. You also need great benchmarks for this. I think that someone said earlier that there's not a ton of examples of low-level kernels across all these different hardware platforms. And so the input data is a challenge and also benchmarking it is a challenge. Like how do you know if your agent is better? You change the prompt to it. How do you know? It's the same story we hear with every agent here basically. So we have some preliminary results on that we're sharing right now on Apple's um M4 uh using the metal framework. Um and this is on the kernel bench benchmark the v0.1 version of it which is the latest one. So what we can see here is results across 250 problems and it compares to either torch compile or eager mode depending on which one of those is faster. So with the kernel bench data set, we have different tiers of problems with L1 being the easiest or simplest rather and L3 being like more complex. So what we can see for the standalone agent is that we see an average speed up of about 25% or 24%. And the sweet spot is those moderately complex problems. It seems honestly the same as like a lot of coding problems where it's good at moderately complex things, but then you push it too far and the performance drops off. So, an interesting challenge here is going to be how do we make these agents perform better on more complex problems that they're going to have to break down and execute. There we go. Um, let's talk about a couple of examples because I love to just see example code. So, this was a success case where the model found a case where we could do kernel fusion. So for those that aren't that familiar with GPU kernels, kernel fusion is one of the most go-to techniques in kernel optimization where you say I have two kernels. Let's say in this case it was like a convolution softmax bias scaling and sigmoid. So those were five ops and what the agent did was it took four of those ops and instead of running individual functions for those it made a mega function that compacted them all together. So kernel fusion isn't new. It's something that torch compile already does quite well, but it's a common way that we found agents can speed up these workloads because you can really customize it to the specific use case. So, this result achieved a 40% speed up over the baseline on the M4. And just kind of like zooming into what happened, the agent wrote a fused op. So it basically wrote like C++ code that it put as kind of an inline string with the PyTorch code and then it called that fused operation in the forward pass of the model and then that fused implementation we can see a snippet of it up here where it's basically taking those four ops and putting them together in one mega op. And so this was done automatically. Sometimes though like writing low-level kernels isn't the best optimization that we can get. We had another case which was on a level one problem which basically improved the performance by 80%. And the the insight the agent had in this case was the operation in metal for average pool 1D was not as optimized as some other ops that are much more optimized on metal. So what it did was it actually rewrote the pietorch code to use the more optimized op and reexpress the same problem in a different way. So to dive into this um the average pool 1D is basically taking averages across like one dimension. So like you can see that that input vector could produce the output vector with five and seven averaging to six and so on. So if you express that same thing as a convolution, you can get the same result. So if you do the math, it will lead to that same result. And so that's what the agent did. Basically what it did was it said, hey, instead of doing the original call to the baseline op, let's generate that weights matrix and execute this as a convolution because I know that that's really fast on metal. There's also an interesting algorithmic optimization case. This was for a level three problem. So it was more complex where basically the agent figured out that it could combine two operations into a single operation at the PyTorch level, not even using low-level kernels. So what we can see is that basically it fused it, it rewrote it as Python code and calls that single convolution and that's a lot more efficient because you don't have to launch as many ops. But this does not always work. This is not a silver bullet and I think that's really important to emphasize. So a case where the agent totally faceplanted was on matrix multiplication and it was it wrote a custom CUDA kernel for this but it was a lot slower than the baseline. And the thing is with this is matrix multiply is one of the most hand optimized ops that exists. So it's not that surprising that an agent would not do as well as something that a human expert spent a long time on. So, this is an area that it did not work. Another case was a case that we saw which had a 71,000x speed up. And anything like that should trigger your suspicion brain. Wow. 71,000. Great. We're done. It's, you know, this technology is worth billions of dollars, right? No. So, basically, what happened? So, this operation is basically saying, give me inputs and I'm going to make sure they fall betweengative -1 and one. Okay, that's what the operation being tested was. So the agent figured out that for all of the test cases, this was already the case. So it wrote a nice long comment saying this is actually not necessary, so just output the input. [laughter] So [snorts] you could argue this is the agent being smart because it's pruning unnecessary work, but I think a lot of us would agree that it's not in the spirit of what we're trying to benchmark here. So we've excluded cases like this from our analysis, but it is interesting because maybe some of the times you would want it to do something like that. And I think this is part of where the human element comes in with these agents. Sometimes the agent does something that depending on your definition of what you want to see could be good or it could be bad. And so that's where the human part kind of weighs in. So, like I keep drawing parallels to other kinds of coding agents because even though this is like kind of a niche like low-level domain, I don't think that the story is fundamentally different. We see standalone agents are really good at cheaply generating like lots of different ideas and lots of possibilities to explore. They're good at slurping in a ton of different context and seeing what helps. And they're really good at doing these like level one and level two tasks. Like for example, we're still not asking AI agents to write the Linux kernel, but what is still needed is robust and robust quality and performance validation. We need to make sure that the agents aren't cheating and we need to make sure that the results are actually correct. We need empirical data from hardware in the loop to guide the search and optimization because it's actually really hard to look at low-level code and know how it's going to perform on the hardware. We still heavily rely on looking on profiling data and things like that. And we also need the human in the loop to supervise the results and guide the work. So design of a modern agent, you have multiple sub aents that are working together. You have that human in the loop and a purpose-built harness for that task. And I think this is the pattern we've seen throughout this conference. So just to get a little bit into kind of like what that architecture looks like and this is what we're you know what we're building at Gimlet you have a supervisor agent which takes in input code target hardware and then also human prompting because humans still can really guide the best path for optimization that supervisor is in charge of managing the work. It deploys the synthesis agentic swarm which collectively work together to come up with ideas for optimizations and they are basically the idea factory coming up with new techniques. Those ideas get sent to the verification agent which is running them in on actual hardware in a hardware in the loop system to see how they do and that verification agent needs to be extremely strict about making sure that no funny business is happening. And that's a major part of the challenge. So just a couple more realistic case studies that are not benchmarks. We got really excited because we ran this on a vision transformer model and I don't know if you can see but basically the original uh vanilla implementation using torch compile and our generated code using torch compile ours was twice as fast. So this was like a hoay moment. So the speedups were promising, but then it turned out the optimization was just swapping out the original attention module for SDPA, which is a more optimized attention module. And this is the kind of thing that yes, that's true. That is a valid optimization, but I wouldn't necessarily call it rocket science. So we consider that to be a trivial case study where if you're not using a more optimized attention module, maybe you haven't actually optimized your workload that much yet. But we do still see interesting results for full models when we have human prompting. And one case for this was an audioenccoder model where it generated six custom kernels for the workload specialized for the RTX 6000 Blackwell. And the results were strong. It was about 70% faster. Both implementations using torch compile. So just to kind of show an example, we load in line six different fused kernels and then call them in the code. And the nice thing about this approach, even though it's a little weird declaring these as strings, is that you have like a completely a API compatible swap in replacement for the original module in PyTorch. So where are we with AIdriven kernel optimization? I think like I said before this is not a silver bullet but it is a promising new tool in the toolbox. The best applications that we see are things like searching across many bags of tricks. We know that fusion works. We know that tiling works and we can run lots of experiments really quickly this way by launching them with agents and see what actually performs the best on the workload. It's also good at porting existing implementations to new hardware where it takes the insights from that original implementation and specializes them to the hardware available features on the new target and also about translating existing optimizations to new scenarios. You can quickly adopt new optimizations like let's say you're changing the quantization of your model. You can still look at differently quantized implementations to guide that optimization. In terms of the worst applications, we're still not at the point where they're writing the N plus1 for flash attention, coming up with like those genius algorithmic advances. And they're not currently outperforming a human expert who bang their head on this problem for months. And we shouldn't expect them to be. I think that the most exciting part of this work is allowing those people to focus on the most interesting optimizations and getting us better than baseline on all the problems that they don't have time for. So what's next in the work? We want to build uh abstract models of different machines to help the agents further specialize code to individual hardware. We're also interested in generating basically what is like NVIDIA assembly such as PTX. You can see an example here because the thought is that we can basically do that better with AI than humans because it's so cumbersome. And then also looking at academic formal verification methods for correctness. Um, also want to give a huge shout out to my colleagues. Um, they are the silent unspoken heroes here and um, you know, I love talking about this with people. So, please feel free to give me an email if you want to talk about kernel generation or anything that I covered. And we are hiring. So, if this problem interests you, we'd love to chat. Thanks. [music] Heat.