Compilers in the Age of LLMs — Yusuf Olokoba, Muna
Channel: aiDotEngineer
Published at: 2025-11-24
YouTube video id: q2nHsJVy4FE
Source: https://www.youtube.com/watch?v=q2nHsJVy4FE
If you're an AI engineer right now, your day-to-day probably looks something like this. You've got an open client in your codebase. You've got a few hugging face tabs open. You've got three different repos with the word playground in them. And you've got at least one agentic workflow that's really just stringing together a bunch of HTTP calls. Right now, everyone is talking about voice agents, MCP, and these are pretty cool technologies, but when you peel back the hype a little bit, what I hear when I talk to a lot of engineering teams is that they're usually grappling with much more fundamental and boring problems. How do I use more models in more places without having to rebuild or extend my infrastructure every single time? So say you want to go try out a new open source model that just dropped on hugging face today. That usually means you got to go write a Docker file, spin up a Docker container, and then get that running on infrastructure that you own or that you rent from a third party provider. And if you're wiring this into an AI agent, well, that's another tool that you have to put into the context and perhaps expose either like an MCP or something similar. A lot of this is just complexity that creeps in and only grows further more time you spend. What developers actually want is something way simpler. Just give me an open style client that just works. Let me point it to any model at all. It doesn't matter if it's running locally, if it's running remotely, if it's Llama CBP or Tensor RT. I just want something that works with minimal code changes. In this talk, I'll walk you through how we decided to build a compiler for Python that enables developers to write simple plain Python code and then convert that into a tiny self-contained binary that can then run anywhere at all. It could be the cloud, it could be Apple silicon, it could be anything else in between. Further, I'll show you how we use LLM within that compiler pipeline. a few things we tried, what worked, what didn't work, how we fenced them with verification and LLM power testing, and how these this infrastructure gives us the ability to not just run any AM model at all, but we can now run it in so many more places beyond just server side. So before we start getting our hands dirty with an example, I wanted to provide some motivation on why we thought building a Python compiler was the best way to solve AI deployment in the long run. First, we needed an extremely simple and standardized way for developers to bring their AI models, whether the ones that they've built internally or models that they found open source on Ugging on GitHub, and then get something that they could execute very easily in their codebase. So, when a new OpenAI model comes out, for example, all you have to do is simply just change the model argument pointing it to the new model that OpenAI just dropped. We wanted to recreate something that tracked this experience as closely as possible. Conceptually, this would have to look like something that ingested code, Python inference code, and then spat out some other thing that knew how to get executed in our develop in our users uh execution environments. Second, we wanted to prepare for what we strongly believe to be the future of AI deployment, hybrid inference. We expect that in the future we will see smaller models typically much closer to users either locally on their devices or in edge locations working in tandem with cloud AI models that are much larger and have a bigger reasoning abilities and we expect that this is going to be the future of how a lot of people consume AI in their day-to-day lives. As such, this means that developers have to move away from, you know, the the cages of Python code and Docker containers into something that is a lot more low-level, closer to the hardware, and a lot more responsive. So, let's get our hands dirty. This is a Python function that runs Google's embedding Gemma 270 million parameter model. It's a very simple text embedding model that takes in a list of of sentences, just plain text, and then runs a model that is able to generate an embedding vector or a list of embedding vectors, an embedding matrix. You will typically use models like this in text, in text search, in retrieval augmented generation, and in other frameworks where you need to be able to retrieve documents or retrieve subsections of documents. This model from Google is small enough at only 270 million parameters that not only can it run very easily on uh GPUs in the cloud, it can also run very quickly on consumer hardware also. And today we will be figuring out how to take this Python function that runs the embedding model, generate equivalent C++ and Rust code that is much lower level and is now able to run anywhere at all. And then we will compile a binary that contains this model and all the dependencies it needs. And finally we will consume this model using the familiar OpenAI client.bings.create experience. The very first step is taking our function and generating a graph representation that describes everything that happens within that function. We call this tracing. Initially our uh first prototypes of building a symbolic tracing uh solution was actually built off of PyTorch 2 which introduced Torch compile along with Torch FX uh for this purpose. So the way that torch FX works is it'll take in Python source code and then run it with fake inputs that don't allocate any memory and then give you a description a graph of everything that happened within that function. We actually try to use this but we faced two major issues that caused us to build our own uh tracing infrastructure. The first was that PyTorch uh is very focused its tracer is very focused on only PyTorch code. And so in order to trace arbitrary code which your functions will usually have to rely on things like numpy operations or OpenCV or something else we would have had to figure out a way to like add support for those data types into PyTorch. The second reason why we didn't stick with PyTorch was in order for the tracer to work, it had to be run on fake inputs. And so, you know, creating a fake tensor is trivial. You just, you know, give it the same description and don't allocate any data. But it's a lot harder to create a fake image or a fake dictionary or a fake, you know, whatever type that we might encounter in the wild. And so, we simply decided that we were going to build something in house. Our first attempt was actually using LLM as a way to generate traces because LLMs for quite some time now have had this capability of structured outputs. This is where you can give an LLM a prompt some data whether it be an image, text or audio and ask it to respond to you with a specific schema that you have given to the model. This actually turned out to work pretty well. Uh it had almost like a 100% uh accuracy rate in our own testing. The only limitation was it simply took way too much time. And so eventually we decided we're just going to do it old school. We would build a tracer by first analyzing the code looking at the a or the abst the abstract syntax tree of the Python code and then using a bunch of internal huristics to build our own internal representation or IR of the user's function. So for this function that we've written up, the IR is actually incredibly simple. I'm not going to show you the entire thing, but I'll just show you the parts that are relevant to look at. As you can see, there's input nodes for the actual uh inputs to the function. So like that's a list of the strings. There's a function call to calling out to the tokenizer. Another one's calling out to the model. And then we return those outputs so that the user can then get their embedding vectors. Now that we have a high-level intermediate representation of our Python function, the next step is to figure out how to translate that somehow into lower level C++ or Rust code. But before jumping into that, I wanted to talk about one major difference between Python as a language and C++ or other lower level languages that we will run into and have to solve. Python is a very dynamic language. So one variable X could be assigned to an integer and then immediately after assigned to say a string. There is full dynamism in anything goes. Whereas in lower level languages like C++ and Rust if you declare a variable you must give it a type and that type can never change. This gives us quite a bit of a challenge because we need to figure out how to attach or constrain the types in the code that we will be generating from our Python highle code. So let's look at the first line of our function. The very first node if you call it of our IR. As you can see prompts is this list that is being generated by a comprehension statement. and we're effectively just adding a set of prefixes for every sentence that has been passed in by the user. And so let's just focus in on that addition operation that's happening within that comprehension. As you can see, well, we know that every item in text is a string because we have pretty much annotated our function as such, right? The input text is a list of strings. And we also know that the text prefix map just contains a bunch of strings. Each prefix is itself a string. And so the question then becomes how do we know or how do we figure out the C++ type on the output of that operation. And this is where the compiler comes in specifically a technique we call type propagation. And so here we will take one string the prefix the other string the actual input text that was provided to the function. And we now know that there is some addition operation happening to these two. So we can simply write or generate a C++ function that takes in two strings and performs the operator.add operation from Python. The output of that function that we generate in C++ as you can see here well it's just a string and that's how we know that whatever the output of this addition operation uh we're doing is must itself be a string. So in that way zooming out we've been able to take the input information the input type information from just the signature of our Python function along with the C++ type information or the the native type information of this global constant task prefix map and then we've been able to use that to propagate into the output of the concatenation of these two things. We now know that if I concatenate one prefix with one input string, the result itself is a string. And so we can then do this propagation for every intermediate variable or every operation within our original Python function. And that's how we can kind of like flow type information through. And so at this point you might be wondering well your compiler if you're doing this propagation thing that requires us manually implementing some operation in C++ or in RS code we would have to literally rewrite this for every unique function call or operation that we ever encounter in Python. And you'll be correct you'll be 100% correct that is in fact what we would have to do. But that is now tractable and it's an easier problem to solve now for two reasons. The first reason is that all the variety you'll ever see in source code in the wild is not because there's such a giant volume of these operations. The volume is actually because you can combine operations in so many different ways. You can permute them in so many different ways. in each of these permutations is what forms a unique Python function or Python code. And so we really only need to cover that base level or that base number of elementary functions. And we could just stack them or combine them in different ways in C++ the same way we do in Python. But you might even say to that that wait that elementary set of functions, it's still pretty large. And you would be 100% right. We need to cover everything from you know adding two things to like you know subtracting them to exponentiation to like you know some stuff that is like in native libraries like numpy operations or pyarch operations and so yeah so you have a perfectly valid point. The only reason why that's tractable now is well we don't have to sit down and write the equivalent native code that does the same thing in Python anymore. we can simply have LLMs generate all the code that we need that translates a function from Python right into C++ and Rust. And so this gives us the ability to basically massroduce a lot of the operations that we we would otherwise have had to manually rewrite ourselves in native code. And so now that we've been able to propagate type information through our Python IR graph, we basically have all we need to simply generate actual C++ code that is correct and will compile. So here's what it actually looks like side by side. As you can see, I'm l just walking through and you can see where we're doing that, you know, list comprehension to add the prefixes to each string. You can see where we are running the tokenizer to tokenize those input text into IDs. And you can now see we're running the model and returning the output embedding vectors or the embedding matrix. At this point, because we now have C++ source code, we can now compile this to run natively on any device or platform that we would ever want to run on. Simply because every piece of technology that you've ever touched has a C or C++ compiler. This is what gives us the ability to take high-level Python code and convert it into a form that is self-contained and that can now run anywhere at all. So let's go ahead and do that. And then what we're going to end up with on the other end is simply a uh dynamic library uh a shared object if you call it that that we can then load into a process and execute like any other code. Now comes the fun part. Let's figure out how to actually invoke or use our compiled embedding model from any language on any device. We're going to go with JavaScript running on Node.js for this example. And so the very first step we want to do is figure out how to call in to our compiled library from JavaScript in Node.js. We can use FFI for this for this purpose. And so this is where you're able to effectively design bindings and declare that hey I'm loading this native library which has been compiled for my system and my architecture. It has this function with some name. In our case we already have a a function name and that function that native function has this signature. And so we're able to write a bunch of scaffolding code. this we figured out a way to standardize this across different different uh compiled functions to make it very easy for ourselves but this is pretty open-ended once you do you can basically point NodeJS or your JavaScript application to the location of that compiled library load it in and simply just invoke it like any other thing when we do guess what we get our embedding matrix right there and for the final piece of the puzzle let's take it back to the top let's figure out how to expose our compiled embedding model through our OpenAI style client. So what we're going to do is create a class, just call it client. Within it, we'll create a nested class called embeddings. And within that, we will create a create function mirroring the official OpenAI client.create path. And so within that function, when the user passes in the model name, all we're going to do is simply just go from the name of that model to a path to the compiled binary that we just created from our C++ code generation. And with that, with the rest of all the uh FFI that we just implemented, we now have a way of taking the model, resolving it to a path to the library, loading that library in library in, and simply just executing it to get out our embedding matrix. The final step is to simply massage the outputs so that it looks just like the outputs that the official OpenAI client gives you. And with this entire system in place, we have just recreated the official OpenAI client, but given it access to any open-source model that we can get into a Python function.