Compilers in the Age of LLMs — Yusuf Olokoba, Muna

Channel: aiDotEngineer

Published at: 2025-11-24

YouTube video id: q2nHsJVy4FE

Source: https://www.youtube.com/watch?v=q2nHsJVy4FE

If you're an AI engineer right now, your
day-to-day probably looks something like
this. You've got an open client in your
codebase. You've got a few hugging face
tabs open. You've got three different
repos with the word playground in them.
And you've got at least one agentic
workflow that's really just stringing
together a bunch of HTTP calls. Right
now, everyone is talking about voice
agents, MCP, and these are pretty cool
technologies, but when you peel back the
hype a little bit, what I hear when I
talk to a lot of engineering teams is
that they're usually grappling with much
more fundamental and boring problems.
How do I use more models in more places
without having to rebuild or extend my
infrastructure every single time?
So say you want to go try out a new open
source model that just dropped on
hugging face today. That usually means
you got to go write a Docker file, spin
up a Docker container, and then get that
running on infrastructure that you own
or that you rent from a third party
provider. And if you're wiring this into
an AI agent, well, that's another tool
that you have to put into the context
and perhaps expose either like an MCP or
something similar. A lot of this is just
complexity that creeps in and only grows
further more time you spend. What
developers actually want is something
way simpler. Just give me an open style
client that just works. Let me point it
to any model at all. It doesn't matter
if it's running locally, if it's running
remotely, if it's Llama CBP or Tensor
RT. I just want something that works
with minimal code changes. In this talk,
I'll walk you through how we decided to
build a compiler for Python that enables
developers to write simple plain Python
code and then convert that into a tiny
self-contained binary that can then run
anywhere at all. It could be the cloud,
it could be Apple silicon, it could be
anything else in between. Further, I'll
show you how we use LLM within that
compiler pipeline. a few things we
tried, what worked, what didn't work,
how we fenced them with verification and
LLM power testing, and how these this
infrastructure gives us the ability to
not just run any AM model at all, but we
can now run it in so many more places
beyond just server side.
So before we start getting our hands
dirty with an example, I wanted to
provide some motivation on why we
thought building a Python compiler was
the best way to solve AI deployment in
the long run.
First, we needed an extremely simple and
standardized way for developers to bring
their AI models, whether the ones that
they've built internally or models that
they found open source on Ugging on
GitHub, and then get something that they
could execute very easily in their
codebase.
So, when a new OpenAI model comes out,
for example, all you have to do is
simply just change the model argument
pointing it to the new model that OpenAI
just dropped. We wanted to recreate
something that tracked this experience
as closely as possible. Conceptually,
this would have to look like something
that ingested code, Python inference
code, and then spat out some other thing
that knew how to get executed in our
develop in our users uh execution
environments. Second, we wanted to
prepare for what we strongly believe to
be the future of AI deployment, hybrid
inference. We expect that in the future
we will see smaller models typically
much closer to users either locally on
their devices or in edge locations
working in tandem with cloud AI models
that are much larger and have a bigger
reasoning abilities and we expect that
this is going to be the future of how a
lot of people consume AI in their
day-to-day lives. As such, this means
that developers have to move away from,
you know, the the cages of Python code
and Docker containers into something
that is a lot more low-level, closer to
the hardware, and a lot more responsive.
So, let's get our hands dirty. This is a
Python function that runs Google's
embedding Gemma 270 million parameter
model. It's a very simple text embedding
model that takes in a list of of
sentences, just plain text, and then
runs a model that is able to generate an
embedding vector or a list of embedding
vectors, an embedding matrix. You will
typically use models like this in text,
in text search, in retrieval augmented
generation, and in other frameworks
where you need to be able to retrieve
documents or retrieve subsections of
documents. This model from Google is
small enough at only 270 million
parameters that not only can it run very
easily on uh GPUs in the cloud, it can
also run very quickly on consumer
hardware also. And today we will be
figuring out how to take this Python
function that runs the embedding model,
generate equivalent C++ and Rust code
that is much lower level and is now able
to run anywhere at all. And then we will
compile a binary that contains this
model and all the dependencies it needs.
And finally we will consume this model
using the familiar OpenAI
client.bings.create
experience.
The very first step is taking our
function and generating a graph
representation that describes everything
that happens within that function. We
call this tracing.
Initially our uh first prototypes of
building a symbolic tracing uh solution
was actually built off of PyTorch 2
which introduced Torch compile along
with Torch FX
uh for this purpose. So the way that
torch FX works is it'll take in Python
source code and then run it with fake
inputs that don't allocate any memory
and then give you a description a graph
of everything that happened within that
function. We actually try to use this
but we faced two major issues that
caused us to build our own uh tracing
infrastructure. The first was that
PyTorch uh is very focused its tracer is
very focused on only PyTorch code. And
so in order to trace arbitrary code
which your functions will usually have
to rely on things like numpy operations
or OpenCV or something else we would
have had to figure out a way to like add
support for those data types into
PyTorch.
The second reason why we didn't stick
with PyTorch was in order for the tracer
to work, it had to be run on fake
inputs. And so, you know, creating a
fake tensor is trivial. You just, you
know, give it the same description and
don't allocate any data. But it's a lot
harder to create a fake image or a fake
dictionary or a fake, you know, whatever
type that we might encounter in the
wild. And so, we simply decided that we
were going to build something in house.
Our first attempt was actually using LLM
as a way to generate traces because LLMs
for quite some time now have had this
capability of structured outputs. This
is where you can give an LLM a prompt
some data whether it be an image, text
or audio and ask it to respond to you
with a specific schema that you have
given to the model. This actually turned
out to work pretty well. Uh it had
almost like a 100% uh accuracy rate in
our own testing. The only limitation was
it simply took way too much time. And so
eventually we decided we're just going
to do it old school. We would build a
tracer by first analyzing the code
looking at the a or the abst the
abstract syntax tree of the Python code
and then using a bunch of internal
huristics to build our own internal
representation or IR of the user's
function. So for this function that
we've written up, the IR is actually
incredibly simple. I'm not going to show
you the entire thing, but I'll just show
you the parts that are relevant to look
at. As you can see, there's input nodes
for the actual uh inputs to the
function. So like that's a list of the
strings. There's a function call to
calling out to the tokenizer. Another
one's calling out to the model. And then
we return those outputs so that the user
can then get their embedding vectors.
Now that we have a high-level
intermediate representation of our
Python function, the next step is to
figure out how to translate that somehow
into lower level C++ or Rust code. But
before jumping into that, I wanted to
talk about one major difference between
Python as a language and C++ or other
lower level languages that we will run
into and have to solve. Python is a very
dynamic language. So one variable X
could be assigned to an integer and then
immediately after assigned to say a
string. There is full dynamism in
anything goes. Whereas in lower level
languages like C++ and Rust if you
declare a variable you must give it a
type and that type can never change.
This gives us quite a bit of a challenge
because we need to figure out how to
attach or constrain the types in the
code that we will be generating from our
Python highle code.
So let's look at the first line of our
function. The very first node if you
call it of our IR. As you can see
prompts is this list that is being
generated by a comprehension statement.
and we're effectively just adding a set
of prefixes for every sentence that has
been passed in by the user. And so let's
just focus in on that addition operation
that's happening within that
comprehension. As you can see, well, we
know that every item in text is a string
because we have pretty much annotated
our function as such, right? The input
text is a list of strings. And we also
know that the text prefix map just
contains a bunch of strings. Each prefix
is itself a string. And so the question
then becomes how do we know or how do we
figure out the C++ type on the output of
that operation. And this is where the
compiler comes in specifically a
technique we call type propagation. And
so here we will take one string the
prefix the other string the actual input
text that was provided to the function.
And we now know that there is some
addition operation happening to these
two. So we can simply write or generate
a C++ function that takes in two strings
and performs the operator.add operation
from Python.
The output of that function that we
generate in C++ as you can see here well
it's just a string and that's how we
know that whatever the output of this
addition operation uh we're doing is
must itself be a string. So in that way
zooming out we've been able to take the
input information the input type
information from just the signature of
our Python function along with the C++
type information or the the native type
information of this global constant task
prefix map and then we've been able to
use that to propagate into the output of
the concatenation of these two things.
We now know that if I concatenate one
prefix with one input string, the result
itself is a string. And so we can then
do this propagation for every
intermediate variable or every operation
within our original Python function. And
that's how we can kind of like flow type
information through. And so at this
point you might be wondering well your
compiler if you're doing this
propagation thing that requires us
manually implementing some operation in
C++ or in RS code we would have to
literally rewrite this for every unique
function call or operation that we ever
encounter in Python. And you'll be
correct you'll be 100% correct that is
in fact what we would have to do. But
that is now tractable and it's an easier
problem to solve now for two reasons.
The first reason is that all the variety
you'll ever see in source code in the
wild is not because there's such a giant
volume of these operations. The volume
is actually because you can combine
operations in so many different ways.
You can permute them in so many
different ways. in each of these
permutations is what forms a unique
Python function or Python code. And so
we really only need to cover that base
level or that base number of elementary
functions. And we could just stack them
or combine them in different ways in C++
the same way we do in Python. But you
might even say to that that wait that
elementary set of functions, it's still
pretty large. And you would be 100%
right. We need to cover everything from
you know adding two things to like you
know subtracting them to exponentiation
to like you know some stuff that is like
in native libraries like numpy
operations or pyarch operations and so
yeah so you have a perfectly valid
point. The only reason why that's
tractable now is well we don't have to
sit down and write the equivalent native
code that does the same thing in Python
anymore. we can simply have LLMs
generate all the code that we need that
translates a function from Python right
into C++ and Rust. And so this gives us
the ability to basically massroduce a
lot of the operations that we we would
otherwise have had to manually rewrite
ourselves in native code. And so now
that we've been able to propagate type
information through our Python IR graph,
we basically have all we need to simply
generate actual C++ code that is correct
and will compile. So here's what it
actually looks like side by side. As you
can see, I'm l just walking through and
you can see where we're doing that, you
know, list comprehension to add the
prefixes to each string. You can see
where we are running the tokenizer to
tokenize those input text into IDs. And
you can now see we're running the model
and returning the output embedding
vectors or the embedding matrix. At this
point, because we now have C++ source
code, we can now compile this to run
natively on any device or platform that
we would ever want to run on. Simply
because every piece of technology that
you've ever touched has a C or C++
compiler. This is what gives us the
ability to take high-level Python code
and convert it into a form that is
self-contained and that can now run
anywhere at all. So let's go ahead and
do that. And then what we're going to
end up with on the other end is simply a
uh dynamic library uh a shared object if
you call it that that we can then load
into a process and execute like any
other code. Now comes the fun part.
Let's figure out how to actually invoke
or use our compiled embedding model from
any language on any device. We're going
to go with JavaScript running on Node.js
for this example. And so the very first
step we want to do is figure out how to
call in to our compiled library from
JavaScript in Node.js. We can use FFI
for this for this purpose. And so this
is where you're able to effectively
design bindings and declare that hey I'm
loading this native library which has
been compiled for my system and my
architecture. It has this function with
some name. In our case we already have a
a function name and that function that
native function has this signature. And
so we're able to write a bunch of
scaffolding code. this we figured out a
way to standardize this across different
different uh compiled functions to make
it very easy for ourselves but this is
pretty open-ended once you do you can
basically point NodeJS or your
JavaScript application to the location
of that compiled library load it in and
simply just invoke it like any other
thing when we do guess what we get our
embedding matrix right there and for the
final piece of the puzzle let's take it
back to the top let's figure out how to
expose our compiled embedding model
through our OpenAI style client. So what
we're going to do is create a class,
just call it client. Within it, we'll
create a nested class called embeddings.
And within that, we will create a create
function mirroring the official OpenAI
client.create
path. And so within that function, when
the user passes in the model name, all
we're going to do is simply just go from
the name of that model to a path to the
compiled binary that we just created
from our C++ code generation. And with
that, with the rest of all the uh FFI
that we just implemented, we now have a
way of taking the model, resolving it to
a path to the library, loading that
library in library in, and simply just
executing it to get out our embedding
matrix.
The final step is to simply massage the
outputs so that it looks just like the
outputs that the official OpenAI client
gives you. And with this entire system
in place, we have just recreated the
official OpenAI client, but given it
access to any open-source model that we
can get into a Python function.