Code World Model: Building World Models for Computation – Jacob Kahn, FAIR Meta
Channel: aiDotEngineer
Published at: 2025-12-17
YouTube video id: sYgE4ppDFOQ
Source: https://www.youtube.com/watch?v=sYgE4ppDFOQ
[music] Great to be here everyone. I'm Jacob Khan. I'm a researcher at at Farret Medai. I'm going to talk today about the code world model which I'll abbreviate as CWM and what it means to build world models for computation. This is work done by an incredible team at fair uh extends all over the world and I'm very grateful to be collaborating with them. So what's our goal with CWM? Our primary goal is to build models that reason, plan and make decisions. And we start with code because it's an interesting sandbox in which to think about reasoning, right? It's constrained. uh there are certain rules with code and so our our goal is to predict future observations given past observations and actions. That's maybe what it means to build a world model in some sense. And we want to do this because we can learn good representations of things if we learn some sort of mapping between observations and the future. And eventually that leads us to planning and reasoning and we can consider different actions and see if we like the results for decisions we make. I think there's a bit of a false dichotomy right now between world models and large language models. World models are just a parameterization of a problem as I'll discuss. LMS are a way to to view and use that parameterization and I'll I'll dive into more of what that means in a bit. So, one of the fundamental questions we're asking with CWM is what does it mean to model code? Is code literally the syntax in your editor or is it something else? And if you think about it, all a model sees that is operating on code is just syntax, right? We tokenize the input. It goes into the model and we predict more code as the output. This is the starting and ending point for an analysis of a program with a tokenbased autogressive model. It's just the syntax. But what if we instead modeled execution more explicitly? And what if we created a maybe a natural language systematic description of programs and neural models could ingest a more structured representation of what it means to execute code and then maybe we could emit autogressively this representation too. So that's one of our goals for CWM. We want to predict program execution because we believe it might lead to us better modeling things about code, writing code, analyzing code, and beyond. And so what we're going to implicitly do is predict a transition function of program states as we go about executing. So this is what execution tracing might look like in action. We have a program. We're going to count the number of ours in strawberry. And at each step maybe we'll have some frame separator which will denote distinct lines of execution. And we'll actually explicitly have local variables. We could introduce things about memory in that trace and that will delineate line by line what's happening as our program executes. And this is something we could essentially feed to a model because each line of our execution trace maps to a corresponding line in a program. We don't have to stop at functions. We could think about entire repository level execution traces. We could think about distributed system level execution traces. We could think about modeling execution for code contest solutions or something more complex. programs with high complexity. We could also then transition that into, as I said, natural language tracing. And we'll see what that means in a moment. But what does it actually look like to model that transition function at a high level as we start to parameterize the problem? Well, we have programs or we have data. That's some state. We have an action executing the next line and that results in the next state. And so both both the program execution and the model's decision-m in an agentic sense uh can be modeled as a transition function. So where are we? This broader approach, world modeling, we could say in an agentic reasoning setting, we have a problem. We have a model that thinks about the problem. It takes an action in the world. We get some feedback. Maybe we fail. We think again. And we iteratively continue this process with feedback from the environment. Maybe in the sense of code, that environment is just an execution in a in a code setting, right? But with a world model, maybe we can actually simulate. We can imagine that action. we can get feedback in our imagined environment. So we could actually generate execution traces about a program without executing it. And this gives us the ability to be far more efficient with how we actually structure our agentic execution. We don't have to interact with the real world unless we're ready to. So let's couple this with autogressive large language models. Right now we have a state of a program. We have an action, maybe the next line, and then we get to a new state. we take another action etc. And so we can sort of turn this with the execution tracing format I mentioned into almost a chain of thought that a model can just interpret a model can learn to predict the next state of an execution trace. And so an LLM can autogressively generate token by token the state and action to state function with program executions as the starting point. Okay, let's talk about data for a second. Let's talk about for CWM. We gathered a huge amount of GitHub data. We take GitHub events and as I said, we're interested in modeling things at the repo level if we can, at the systems level if we can. We want to have execution traces go outside of the scope of simple programs. And so we'll take a bunch of PRs, we'll mutate those PRs, predict changes, [snorts and clears throat] and we'll eventually have a raw PR data set. And we can actually run tests or CI on those GitHub repos when we know they're passing and then generate execution traces from that repo level data if we want. So here we are at the artifact the code world model itself. I'll talk a bit about what we did with it, how we trained it and then what we can do with some of these interesting execution trace capabilities. But first it's a 32 billion parameter dense transformer. This is a model for research. This is not a huge you can't play with. uh you can play with it right now. It has a nice long context length for some reasoning tasks and we train it end to end. We do all the pre-training and post- training ourselves processes. We pre-train on a few trillion tokens. We mid-train on some more domain specific data. We do some long context mid-training. We fine-tune further uh on some instruction following and reasoning tokens. And then we do this joint RL and agentic reasoning setup. So let's parameterize the problem even more broadly with CWM. We have a prompt. We have an agent. We do some reasoning. We take an action. We can use a tool. We can emit text which is code that goes into the environment. We take a step. And from that environment, we get a few things back. We get tokens. We get rewards. We get log probabilities. We might get compiler output. So with CWM, we're also taking a big step back with how we interact with the environment. C CWM is a very bashoriented model. It has fewer tools than do other models and it has to learn how to use the terminal pretty well to solve a lot of the tasks we give it. And this starts with SRL and with SWRL we take a GitHub issue. We feed it to the agent starting with that repository level data set from before and we just use bash, right? We learn commands uh in bash and that lets us mutate our environment that lets us mutate the state of files. We can maybe use an edit tool eventually or create content and then submit things. But ultimately, we're trying to put the model in an environment that's very very similar to what an engineer would be in and and learn end to end in a bashbased setting. Okay. So we can bootstrap this setup further. We can do some SFT before RL and we can find some failure modes for the model. We can rejection sample. So we can take a bunch of agentic reasoning traces on code tasks that failed and we can basically feed those back into the model. So in this example here, we have a thinking trace where we're thinking about instantiation logic for some code. And I can look for that code. I can call an explicit grab function. And this is something we did with CWM again with fewer tools and a larger emphasis on bash as a starting point. Let's talk about post- training for a moment. We want to scale post- training quite a bit. This is the trend we see and we're getting a lot of excellent returns out of uh from a reasoning perspective when we post train. So part of solving this for CWM because we have a small model is an opportunity to really scale up how we do post training and in particular to improve the throughput of the system and we're doing an asynchronous RLbased setup. We have samplers. We have an environment where we can execute in the terminal and get output. We have a bunch of trajectories reasoning trajectories we output. We have a trainer where we compute gradients and score trajectories. We have a source of truth for the model and then that loop repeats. So what's the challenge here? We have this loop, right? We have samplers predicting trajectories. We have scoring trajectories. We're executing in the environment. As we're doing this, we're going to update a model. Eventually, we have a produce consumed pipeline problem. And so samplers are producing lots of trajectories that are consumed by those trainers. We need to synchronize weights. And so we solve this in CWM with a very very asynchronous model. So of course we have a trainer that's sending a model checkpoint to a sampler very very eagerly. We have trajectories which are being sampled and then sent back to trainers very eagerly. But in particular we have cues. So we actually will have many models queued up to be input into a sampling system. will have many trajectories queued up to be scored and then added visav gradients to the train model. And so this setup stays relatively on policy even though it's highly asynchronous and we're not really waiting for much with this setup. We're able to achieve very very strong throughput uh because of the asynchronicity. So one interesting feature of this which is increasingly common is that we're actually updating models mid trajectory. So I have a model which we're sampling from. It's interacting with the environment. It's generating data. It's executing bash commands. It's executing code. It's getting output. And I might actually update that model while it's interacting with the environment. So mid- trajectory I could totally swap out the model with the new checkpoint and the trajectory will change a little bit. Uh theoretically that trajectory is a bit off policy but the guarantees we have with this system are quite strong still in that because of the throughput and because of the amount of data we see we're able to make a lot of guarantees around and take a lot of risk with updating the model on the fly. And this gives us really a system where there are very very few bottlenecks overall because we're queuing models, we're queuing trajectories. We don't have to wait until anything is done. Okay, so overall we post train on still a relatively small number of steps at pretty large scale and we process about 200 and some billion tokens. And this scale works really well. It produces a strong model, a strong open model. It's a pretty small model. It punches above its weight. It's very nice. It's pretty versatile. It uses tools in bash very well. [clears throat] But what can you what can we actually do with uh with this model, right? What can we do with a model that understands program execution traces that maybe has a good understanding of how how a program will run and predicting future state of a program. CWM traces code really well, right? We know that. we've showed it execution traces and I can actually give it a function and then it can go and trace line by line that function with very very high accuracy. It can show me the values of local variables at certain points again with a lot of precision and this gives us some pretty interesting capabilities. I can think about a neural debugger on top of a model. Traditionally, right, I have a piece of code. I don't know what I want to write. I put some question marks. Historically, I might prompt a model with natural language. I want to set the valuable uh the variable left and right to be something in particular. I don't know what it is. Uh now I need to specify very fully the ambiguity that I'm experiencing with how to complete my program. With CWM, I can express those things very naturally in line with code. And I can actually express the shape of the program I want with code and the model will fill in the rest. And the model fills in the rest by understanding that the user wrote a for loop here. The user wrote a condition here. The user left a variable and assigned. Well, if I were to go execute that, I could simulate the execution of that loop and understand better what it is the user is really after. And so a neural debugger is something that helps you compose with code side by side. It's not just generating code and it allows you to again express the semantics of code very very loosely but also very very precisely. So if I have a piece of code where I I want a certain structure I can ensure that the model understands that structure and and can implicitly trace the execution. This will make theoreticians bristle. But I can also think about some really ambitious things in computer science. The halting problem we know is this very fundamental problem where we don't know if a if a program is going to to halt to stop executing to terminate. And in particular, this is tough because in order to know if a program halts, we would have to simulate the entire execution of the program which if it didn't halt would take forever. So the halting problem is in some sense a difficult problem to simulate or decide. And so the question we can ask with CWM is can I approximate some of these things? Can I concretely reason about program execution dynamics in this sense? So can I say here's a program does it halt? Maybe the model by simulating execution can understand really really high level patterns. In the same way the model can understand high level patterns in broader systems, right? I could use this to debug a huge distributed system where executing code is very very expensive or even an expensive function on a single machine. Right? But the ability to have an implicit world model internally where I'm simulating what's happening with a piece of code or a broader system gives me the ability to reason about it without executing otherwise expensive things. So we can make some progress with the halting problem by building a model that simulates it that simulates execution and from there we can simulate and approximate what it means to solve otherwise impossible problems in computer science. So this is pretty interesting. With that I want to encourage everyone to go build on CWM. Uh this talk does halt. This talk does terminate. Um, and the model's available on hugging face. We have some [snorts] code on GitHub which will help you get started with inference in a fashion where you can twiddle bits a bit more. We also have a technical report again where we really try to be as open as possible with all of these details around training. This post-raining setup I mentioned is explained in even more excruciating detail as well as some of the data that we use for execution training and some of what we imagine a model with these capabilities could be used for. Thanks for your time. Have fun. >> [applause and cheering] [music]