Code World Model: Building World Models for Computation – Jacob Kahn, FAIR Meta

Channel: aiDotEngineer

Published at: 2025-12-17

YouTube video id: sYgE4ppDFOQ

Source: https://www.youtube.com/watch?v=sYgE4ppDFOQ

[music]
Great to be here everyone. I'm Jacob
Khan. I'm a researcher at at Farret
Medai. I'm going to talk today about the
code world model which I'll abbreviate
as CWM and what it means to build world
models for computation.
This is work done by an incredible team
at fair uh extends all over the world
and I'm very grateful to be
collaborating with them.
So what's our goal with CWM? Our primary
goal is to build models that reason,
plan and make decisions. And we start
with code because it's an interesting
sandbox in which to think about
reasoning, right? It's constrained. uh
there are certain rules with code and so
our our goal is to predict future
observations given past observations and
actions. That's maybe what it means to
build a world model in some sense. And
we want to do this because we can learn
good representations of things if we
learn some sort of mapping between
observations and the future. And
eventually that leads us to planning and
reasoning and we can consider different
actions and see if we like the results
for decisions we make. I think there's a
bit of a false dichotomy right now
between world models and large language
models. World models are just a
parameterization of a problem as I'll
discuss. LMS are a way to to view and
use that parameterization and I'll I'll
dive into more of what that means in a
bit.
So, one of the fundamental questions
we're asking with CWM is what does it
mean to model code? Is code literally
the syntax in your editor or is it
something else?
And if you think about it, all a model
sees that is operating on code is just
syntax, right? We tokenize the input. It
goes into the model and we predict more
code as the output. This is the starting
and ending point for an analysis of a
program with a tokenbased autogressive
model. It's just the syntax. But what if
we instead modeled execution more
explicitly? And what if we created a
maybe a natural language systematic
description of programs and neural
models could ingest a more structured
representation of what it means to
execute code and then maybe we could
emit autogressively this representation
too.
So that's one of our goals for CWM. We
want to predict program execution
because we believe it might lead to us
better modeling things about code,
writing code, analyzing code, and
beyond. And so what we're going to
implicitly do is predict a transition
function of program states as we go
about executing.
So this is what execution tracing might
look like in action. We have a program.
We're going to count the number of ours
in strawberry. And at each step maybe
we'll have some frame separator which
will denote distinct lines of execution.
And we'll actually explicitly have local
variables. We could introduce things
about memory in that trace and that will
delineate line by line what's happening
as our program executes. And this is
something we could essentially feed to a
model because each line of our execution
trace maps to a corresponding line in a
program.
We don't have to stop at functions. We
could think about entire repository
level execution traces. We could think
about distributed system level execution
traces. We could think about modeling
execution for code contest solutions or
something more complex. programs with
high complexity. We could also then
transition that into, as I said, natural
language tracing. And we'll see what
that means in a moment.
But what does it actually look like to
model that transition function at a high
level as we start to parameterize the
problem? Well, we have programs or we
have data. That's some state. We have an
action executing the next line and that
results in the next state. And so both
both the program execution and the
model's decision-m in an agentic sense
uh can be modeled as a transition
function.
So where are we? This broader approach,
world modeling, we could say in an
agentic reasoning setting, we have a
problem. We have a model that thinks
about the problem. It takes an action in
the world. We get some feedback. Maybe
we fail. We think again. And we
iteratively continue this process with
feedback from the environment. Maybe in
the sense of code, that environment is
just an execution in a in a code
setting, right? But with a world model,
maybe we can actually simulate. We can
imagine that action. we can get feedback
in our imagined environment. So we could
actually generate execution traces about
a program without executing it. And this
gives us the ability to be far more
efficient with how we actually structure
our agentic execution. We don't have to
interact with the real world unless
we're ready to.
So let's couple this with autogressive
large language models. Right now we have
a state of a program. We have an action,
maybe the next line, and then we get to
a new state. we take another action etc.
And so we can sort of turn this with the
execution tracing format I mentioned
into almost a chain of thought that a
model can just interpret a model can
learn to predict the next state of an
execution trace. And so an LLM can
autogressively generate token by token
the state and action to state function
with program executions as the starting
point. Okay,
let's talk about data for a second.
Let's talk about for CWM. We gathered a
huge amount of GitHub data. We take
GitHub events and as I said, we're
interested in modeling things at the
repo level if we can, at the systems
level if we can. We want to have
execution traces go outside of the scope
of simple programs. And so we'll take a
bunch of PRs, we'll mutate those PRs,
predict changes,
[snorts and clears throat] and we'll
eventually have a raw PR data set. And
we can actually run tests or CI on those
GitHub repos when we know they're
passing and then generate execution
traces from that repo level data if we
want.
So here we are at the artifact the code
world model itself. I'll talk a bit
about what we did with it, how we
trained it and then what we can do with
some of these interesting execution
trace capabilities. But first it's a 32
billion parameter dense transformer.
This is a model for research. This is
not a huge you can't play with. uh you
can play with it right now. It has a
nice long context length for some
reasoning tasks and we train it end to
end. We do all the pre-training and
post- training ourselves
processes. We pre-train on a few
trillion tokens. We mid-train on some
more domain specific data. We do some
long context mid-training. We fine-tune
further uh on some instruction following
and reasoning tokens. And then we do
this joint RL and agentic reasoning
setup.
So let's parameterize the problem even
more broadly with CWM. We have a prompt.
We have an agent. We do some reasoning.
We take an action. We can use a tool. We
can emit text which is code that goes
into the environment. We take a step.
And from that environment, we get a few
things back. We get tokens. We get
rewards. We get log probabilities. We
might get compiler output. So with CWM,
we're also taking a big step back with
how we interact with the environment. C
CWM is a very bashoriented model. It has
fewer tools than do other models and it
has to learn how to use the terminal
pretty well to solve a lot of the tasks
we give it.
And this starts with SRL and with SWRL
we take a GitHub issue. We feed it to
the agent starting with that repository
level data set from before and we just
use bash, right? We learn commands uh in
bash and that lets us mutate our
environment that lets us mutate the
state of files. We can maybe use an edit
tool eventually or create content and
then submit things. But ultimately,
we're trying to put the model in an
environment that's very very similar to
what an engineer would be in and and
learn end to end in a bashbased setting.
Okay.
So we can bootstrap this setup further.
We can do some SFT before RL and we can
find some failure modes for the model.
We can rejection sample. So we can take
a bunch of agentic reasoning traces on
code tasks that failed and we can
basically feed those back into the
model. So in this example here, we have
a thinking trace where we're thinking
about instantiation logic for some code.
And I can look for that code. I can call
an explicit grab function. And this is
something we did with CWM again with
fewer tools and a larger emphasis on
bash as a starting point.
Let's talk about post- training for a
moment. We want to scale post- training
quite a bit. This is the trend we see
and we're getting a lot of excellent
returns out of uh from a reasoning
perspective when we post train. So part
of solving this for CWM because we have
a small model is an opportunity to
really scale up how we do post training
and in particular to improve the
throughput of the system and we're doing
an asynchronous RLbased setup. We have
samplers. We have an environment where
we can execute in the terminal and get
output. We have a bunch of trajectories
reasoning trajectories we output. We
have a trainer where we compute
gradients and score trajectories. We
have a source of truth for the model and
then that loop repeats.
So what's the challenge here? We have
this loop, right? We have samplers
predicting trajectories. We have scoring
trajectories. We're executing in the
environment. As we're doing this, we're
going to update a model. Eventually, we
have a produce consumed pipeline
problem. And so samplers are producing
lots of trajectories that are consumed
by those trainers. We need to
synchronize weights. And so we solve
this in CWM with a very very
asynchronous model. So of course we have
a trainer that's sending a model
checkpoint to a sampler very very
eagerly.
We have trajectories which are being
sampled and then sent back to trainers
very eagerly. But in particular we have
cues. So we actually will have many
models queued up to be input into a
sampling system. will have many
trajectories queued up to be scored and
then added visav gradients to the train
model. And so this setup stays
relatively on policy even though it's
highly asynchronous and we're not really
waiting for much with this setup. We're
able to achieve very very strong
throughput uh because of the
asynchronicity.
So one interesting feature of this which
is increasingly common is that we're
actually updating models mid trajectory.
So I have a model which we're sampling
from. It's interacting with the
environment. It's generating data. It's
executing bash commands. It's executing
code. It's getting output. And I might
actually update that model while it's
interacting with the environment. So
mid- trajectory I could totally swap out
the model with the new checkpoint and
the trajectory will change a little bit.
Uh theoretically that trajectory is a
bit off policy but the guarantees we
have with this system are quite strong
still in that because of the throughput
and because of the amount of data we see
we're able to make a lot of guarantees
around and take a lot of risk with
updating the model on the fly. And this
gives us really a system where there are
very very few bottlenecks overall
because we're queuing models, we're
queuing trajectories. We don't have to
wait until anything is done.
Okay, so overall we post train on still
a relatively small number of steps at
pretty large scale and we process about
200 and some billion tokens. And this
scale works really well. It produces a
strong model, a strong open model. It's
a pretty small model. It punches above
its weight. It's very nice. It's pretty
versatile. It uses tools in bash very
well. [clears throat]
But what can you what can we actually do
with uh with this model, right? What can
we do with a model that understands
program execution traces that maybe has
a good understanding of how how a
program will run and predicting future
state of a program.
CWM traces code really well, right? We
know that. we've showed it execution
traces and I can actually give it a
function and then it can go and trace
line by line that function with very
very high accuracy. It can show me the
values of local variables at certain
points again with a lot of precision
and this gives us some pretty
interesting capabilities.
I can think about a neural debugger on
top of a model. Traditionally, right, I
have a piece of code. I don't know what
I want to write. I put some question
marks.
Historically, I might prompt a model
with natural language. I want to set the
valuable uh the variable left and right
to be something in particular. I don't
know what it is. Uh now I need to
specify very fully the ambiguity that
I'm experiencing with how to complete my
program. With CWM, I can express those
things very naturally in line with code.
And I can actually express the shape of
the program I want with code and the
model will fill in the rest. And the
model fills in the rest by understanding
that the user wrote a for loop here. The
user wrote a condition here. The user
left a variable and assigned. Well, if I
were to go execute that, I could
simulate the execution of that loop and
understand better what it is the user is
really after. And so a neural debugger
is something that helps you compose with
code side by side. It's not just
generating code and it allows you to
again express the semantics of code very
very loosely but also very very
precisely. So if I have a piece of code
where I I want a certain structure I can
ensure that the model understands that
structure and and can implicitly trace
the execution.
This will make theoreticians bristle.
But I can also think about some really
ambitious things in computer science.
The halting problem we know is this very
fundamental problem where we don't know
if a if a program is going to to halt to
stop executing to terminate. And in
particular, this is tough because in
order to know if a program halts, we
would have to simulate the entire
execution of the program which if it
didn't halt would take forever. So the
halting problem is in some sense a
difficult problem to simulate or decide.
And so the question we can ask with CWM
is can I approximate some of these
things? Can I concretely reason about
program execution dynamics in this
sense? So can I say here's a program
does it halt? Maybe the model by
simulating execution can understand
really really high level patterns.
In the same way the model can understand
high level patterns in broader systems,
right? I could use this to debug a huge
distributed system where executing code
is very very expensive or even an
expensive function on a single machine.
Right? But the ability to have an
implicit world model internally where
I'm simulating what's happening with a
piece of code or a broader system gives
me the ability to reason about it
without executing otherwise expensive
things.
So we can make some progress with the
halting problem by building a model that
simulates it that simulates execution
and from there we can simulate and
approximate what it means to solve
otherwise impossible problems in
computer science. So this is pretty
interesting.
With that I want to encourage everyone
to go build on CWM.
Uh this talk does halt. This talk does
terminate. Um, and the model's available
on hugging face. We have some [snorts]
code on GitHub which will help you get
started with inference in a fashion
where you can twiddle bits a bit more.
We also have a technical report again
where we really try to be as open as
possible with all of these details
around training. This post-raining setup
I mentioned is explained in even more
excruciating detail as well as some of
the data that we use for execution
training and some of what we imagine a
model with these capabilities could be
used for. Thanks for your time. Have
fun.
>> [applause and cheering]
[music]