AI Engineer World’s Fair 2025 - Reasoning + RL
Channel: aiDotEngineer
Published at: 2025-06-06
YouTube video id: -9E9_21tx04
Source: https://www.youtube.com/watch?v=-9E9_21tx04
A few of them that are mo very important all of them have like different implementation details but in general the idea is you have a bunch of tasks like versions of your problem which are essentially prompts. You have rollouts which are just completions potentially involving many steps of interactions but like one sequence of stuff happening and then you have evaluations potentially interle throughout or at the end of the sequence and what you're estimating is the advantage. The advantage here is the idea that uh sometimes your model will be be better than others. Like these uh LLMs are all uh non-deterministic. You have temperature above zero. You have different things happen in different uh rolls of the dice. Um and this uh forking process of saying like okay this time it did better than that time. Why was it different? RL is really about saying like okay uh this is the actual thing that changed that resulted in the reward being better the eval being better this is the token at which I went down the good path versus the bad path um and whether you're doing PO or GRPO um like this is the mechanism by which you get the signal of like you have some things that sometimes went better sometimes went worse now you can kind of uh very surgically have the learn to do more of the good stuff without changing too much overall. I think this is also kind of maybe a reason why DPO I think people were hoping DPO would like really work well. In my view, DPO does not necessarily have this like fine grain advantage estimate. Like it's not really clear just from like a full good completion and a full bad completion where you're really getting the signal about these complex branching processes. Uh PO has this, but it's also very expensive. GRPO I think has taken a lot of people kind of uh by storm in terms of like being a very nice like middle ground where it's more computationally efficient. It's like simple to implement. Um but also it does have this kind of forking process that comes just from sampling. Um there's also just too many papers. So like I think a lot of people just see a new paper every day and are like do I have to read this one? Um and I feel that too. Like I think it's difficult to know up front like which of these are going to be important. Which of them are just going to be like noise especially because lots of them have very sensationalist titles like oh Quen doesn't work um or like or everyone everything only works with Quen is like kind of true but like there's also more to the story than that. And I think there's like different implementation details of like oh if you change the loss function like this in this experiment then it works. And I think for most people it is best to just like kind of set this aside and to not get too caught up in the individual details of individual experiments and individual papers and kind of think more holistically about what is the process of reinforcement learning doing? Um what implementation details am I willing to kind of leave to other people to figure out and eventually come to me with like software that like has the the knob set correctly. Um and which pieces are actually important for solving the problems I care about. Um, and so for a lot of people, I think the things that are going to be really interesting, um, are things that are relating to actual software, to actual problems that they want to solve in the world. And agents, I think, are kind of the instantiation of that where this makes sense. And the thing that makes an agent an agent is tools. Uh, the ability to interact with an environment, with a system. A lot of people here are like very excited about MCP at the conference. Like MCP is just tools. MCP is about giving your LLM the ability to like interact with stuff to go solve problems that involve changing files, making requests, uh, editing code, running code. Um, and so I think these are the papers that I get excited about because there feel like like there's parts of the puzzle that are not fully solved yet of like what's the right way to do all of this. Like there's still some open questions. Um, but I think those are getting kind of refined. We're starting to see more and more, but like a lot of the code, the tools we have out in the wild to like go do this. Like if you want to like go play around with RL, most code bases are like very set up for like either code and math tasks or things that are quite similar to that. This is kind of my fault. Um I had a a snippet go viral that was like here's how you do RL on like GSMAK, which is like a kind of easy math data set. Um, and then I think I've seen a lot of people like stick with this as like, oh, we're gonna RL on math. And I like this is also just like math is easy to evaluate. Um, and I think people are evaluating evals are hard. Like there's a whole track going on in parallel to this about like how to build a good eval. Um, and so I think a lot of researchers gravitate towards things that like look like the benchmarks that are also really easy to eval because there's like a very clear signal of like, okay, this thing is like right, this thing is wrong. Good. Okay, we're doing RL. Um but like real world tasks are messier than that. Um we are not going to like get great software systems just by like hill climbing on whatever question answer benchmark is popular today. Um what we're going to do is we're going to have to do is start thinking about like the actual systems at hand and the challenges that emerge when we're trying to design these rewards. And so like reward hacking is like a real thing. Um I think this is one of the lessons that like our works but also it's not like always going to work. There are things that can go wrong. And to me, reward hacking is really a message about the difficulty of building good evals. Like uh what you really want with an eval is for it to be easier for your model to do the task than to hack the eval. You want to build a reward signal that actually captures what you care about where uh gaming it is like more difficult than not gaming it. If you can if the model can learn to do the task directly just by doing what you want it to do uh in the spirit of the task then like that is what will happen. It will flow in the path of least resistance. This is like models just want to learn but they want to learn to do better on reward signals and so your reward signals have to point in the direction of the thing you actually care about. Um otherwise like models will find cheats. Um, and I think thinking about these things in combination kind of points a little bit towards a direction that I think is going to be very promising. And there's some very early signs that like this actually can work. Um, which is like when R1 came out, I was kind of like speculating like what's next? What are the things that are going to unlock this sort of technique being used more generally? Um, and you people talk a lot about like generator verifier gaps. like what are the differences between like solving a problem versus checking if you have a solution. And a lot of problems like are much easier to check than solve. But this isn't like a binary thing. This is a spectrum of how difficult is it to verify a thing. But um there's some kind of signs that you kind of can do evaluations on more ambiguous tasks by just breaking them down into smaller pieces and by using LMS as sub routines in your evaluations. like LM is a judge on steroids where maybe you want to actually like train a specialized LM who is really good at doing these fine grain evaluations. I like using the term rubric as a conceptual general umbrella around reward models, reward functions, LM as judge setups like the criteria on which you are evaluating a thing. There's a cool paper from deepseek that I was thought found very exciting when it came out a couple months ago about like how to train reward models that like generate these rubrics on the fly. There was a paper very recently that does this for creative writing and kind of found that like yes you actually can train reward models that will come up with nuanced fine-rain evaluation criteria for a task on the fly given the actual problem and this gives you something that result in a very like fine grain score that allows you to actually do RL and like keep getting better. Um and I think like this is an area that I'm really excited about to keep watching. Um, but also like multi-turn. Multi-turn is probably where we're headed. We want to do a jetic search. We want to do tool calls, software, games, long horizon planning, computer use, memory, scaling on tool calls let you solve harder problems. Um, and so how do we actually like do this? What's the uh way to go about building multi- aent or multi-turn agentic systems to do and that we can use RL with? Um, and I think the conceptual pieces here are environments are basically harnesses, rewards are basically eval, tasks are just prompts, and your policy in the RL sense hopefully should just be as simple as like an LLM API. The I think the programming interface that makes sense for a lot of people is to have an API that you're writing code as if it's just a normal agent in a loop, but then this is a thing that you can use to go do RL. And so that's what I've been building over the past couple months. Um, I maintain a repo called verifiers. Um it's finally uh on pip uh out in the world. You can just install it, but it's been a long time coming. Um and what it really is is a toolkit of these pieces to make it so that building an agent that you can actually train with RL feels just like building an agent. Um, so the interaction protocol here is like quite simple like this is the entire roll out function on the left of like what happens in the code when you're running an agent to do RL which is that you kind of set up some initial state stuff have a while loop for is it done yet? If it's not done, do a turn. And the thing you're passing here is a client object that's just an OpenAI compatible API. And I think this is the kind of interface that you really want if you want people to be able to go from their agent applications to something that's trainable, something they can use with RL. Um, it's been a lot of fun thinking about like what are the abstractions? What are the pieces here? And so like there's things like parsers and rubrics that I think are like nice building blocks that you sometimes want to use. You can also like not use them if you don't want to, but like I tried to make it fun and user friendly. Um, the other day I like was like, let's train a Wordle agent. I think this was like a fun little toy problem where it's like it's not that hard of like a game for us as humans, but like it's actually like kind of tricky to get your code to be this sort of thing where you have this like multi-turn interaction protocol that you actually can do learning with. Um, but now it's like much easier. like the code to do these things is like quite simple and the reward functions can kind of be relatively simple for this sort of setup where it's like okay you want to reward it for like uh solving the thing eventually but also like give it more rewards for doing it in less turns and like this is a 7B model like works reasonably well but one of the reasons it works um which I'll talk about in a sec is uh SFT warm-up as a way of kind of lowering the barrier of entry like this the code as it is is very much set up so that like your environments for RL are also just like synthetic data loops or evals where you can plug in claude or deepseek or open AAI and like test. So you don't have to like do RL to debug. You can like debug with an API in terms of seeing is this a good eval? Is this a good reward once you're kind of comfortable with it you can like use whatever API you like that you are allowed to use and make synthetic data do some SFT on it and now you can start doing RL and this like helps a lot with small models. Um, I think there's a lot of efficiency challenges that are like I've been kind of hard at work trying to solve in terms of like having all of your computation be utilized effectively, having everything be like fully async so you don't have to worry about like batching. Um, and that your trainer and your inference can kind of go at the same time. You can be like a little bit off policy. Um, a lot of engineering that I'm hoping like if you want to worry about that, great. Dig into it, fork the repo, mess with things. If you don't want to, you shouldn't have to. Um, and like the idea here is that this should become something that more people are trying out, more people are having fun with with exploring and getting a feel for it. Um, because if it's going to be a thing we have to worry about if this is the future of building better agent models uh for your applications, like now's a good time to start. Um, and so this stuff is set up so you can like on a couple GPUs like uh do a lot of interesting research. Like the barrier of entry is like much lower now than it used to be. Um, I have a lot of fun doing this on like a couple GPUs. Uh, we sell GPUs by the way. Um, thanks everybody. Uh, I don't think we have time for questions, but uh, yeah. All right. Thank you. Thank you so much, Will. Um, next up on the stage, we're going to have Greg Comrad. He is, uh, the founder, one of the co-founders, and president of the ARC Prize Foundation. Um, ARC is fantastic. They've been creating some of the highest signal evaluations. Um, as we're on our path towards AGI, um, they're basically helping us measure where we're going. I'm very excited to hear, um, I think we might even get a sneak preview today of of the next version of what they're doing. Um, so anyway, uh, welcome to the stage, Greg Camera. Thank [Applause] you. Beautiful. Today we are going to talk about why AI benchmarking is about to get a lot more fun. But before we do, we need to go over some cool demos here. So, I love the Claude plays Pokemon demo. There's something about really special about seeing this little embodied agent make its own decisions, go and play Pokemon uh for us from our childhood here. Now, OpenAI got in on the same game. I thought this was awesome. And then just the other day, Gemini beat Pokemon. I like seeing the little robots with AGI on top of their head. That must mean that we're already there, right? Well, if we have these agents playing Pokemon and it's already doing it in Beating Head, that means the game's over, right? We're all done. Well, not quite. Because with Claude plays Pokemon, we saw that it would get stuck in the same place for three days. It would need interventions. It would hallucinate different actions. And not only that, there was a ton of Pokemon training data that was within the model itself. So although this is a really cool example about an agent exploring a world, there are a lot of things that we can go and improve on. So as Kyle was saying, my name is Greg Camera, president of Arcress. We are a nonprofit with the mission to be a northstar guide towards open AGI. We were founded last year by M Franis Chalet and Mike Canoope and just last December we were invited by OpenAI to join them on their live stream to co-announce the 03 preview results on ARC AGI. Now there's a lot of AI benchmarking companies out there but we take a very opinionated approach as to how we should do this. And our opinion is that the best target we should be aiming for is actually humans. And the reason why we think that is because we see that humans are the one proof point of general intelligence that we know about. And if we use humans as the target, that does two things for us. Because what we do is we come up with problems that are feasible for humans but hard for AI. Now, while we do that, that does two things. Number one is it creates a gap. And when you have that gap, you can start to measure. Well, how many problems can we come up with that humans can still do but AI can't? And then number two is it guides research. So you can quantify that class of problems and then go tell researchers, hey, there's something really interesting going on on this side of the problem. Uh there something that we need to go uh check out from there. All right. So if we're going to measure artificial general intelligence based off of humans, we need to actually define well what is general intelligence? And there's two definitions that I love to uh quote. The first one was by John McCarthy and he says that AI is the science and engineering of uh making machines do tasks and this is the important part that they have never seen beforehand and they have not prepared for beforehand. This is very important because if you've seen a class of problem beforehand if it's already in your training data then you're simply just repeating memorization. You're not actually learning anything new on the fly. Right? The second uh person I like to quote on this is actually Francois himself and he put it very eloquently within just three words and he calls intelligence skill acquisition efficiency and this is this is really beautiful here because skill acquisition can you learn new things and not only that but how efficiently can you learn those new things and humans are extremely of spoiler humans are extremely efficient at learning these new things. So Francois proposed this definition in his 2019 paper on the measure of intelligence. But he went further than that. He didn't just define it. He actually proposed the benchmark to see can a human or or an AI can it learn something new and then go repeat what it learned. And this is where the RKGI version one benchmark came out. So over here on the left hand side, this is the learn the skill portion. This is what we call the training portion. And what we show you is a transformation from an input to an output grid. And then the goal for the human or the AI is to look at it and say, "hm, what's going on here?" And then on the left, we actually ask you to demonstrate that skill. So it's a little mini skill you learn on the left and we ask you to demonstrate it on the right. And if you can successfully do it, and this is what it looks like, it's just the grid editor here. Then yes, you've learned what the transformation is and you've actually applied this. And so you're showing a nonzero level of generalization as you go through this. So our benchmarks, RKGI 2, this is the most recent one. It has over a thousand tasks in it. And the important part here is each one of these tasks is novel and unique. And what I mean by that is the skills required for one of them. We will never ask you to apply that same skill to another task. Um, this is very important because we're not testing whether or not you can just repeat the skill you've already learned, but we want to test all the little mini skills that you can do over time and see you can see if you can actually demonstrate those. And if we're going to back up that humans can actually do this, well, we need to go get first-party data. So our group as a nonprofit, we went down to uh San Diego and we tested over 400 people. So rented a bunch of computers and we did this in person to pre uh have data privacy and we made sure that every single task that was included in ARC AGI was solvable by people. So this isn't just an aim here. We're actually doing the work to to to go and do that. But if we think about it, there's actually quite a bit of human-like intelligence that's out of scope from what we call a single turn type of benchmark. With RKGI, you have all the information presented that you need right at test time. You don't need to do any exploring or anything. And it's all through single turn. So if we're going to be measuring any humanlike intelligence, and I would argue that if you are going to measure humanlike intelligence, it needs to be interactive by design. And what you need to have is you need to be able to test the ability of an agent, whether that be biological or artificial, to explore an open world, understand what goals it needs to do, and ultimately look at the rewards and go from there. So this is actually very in line with what Rich Sutton had just published within his paper, Welcome to the Era of Experience. And he argues that if we want agents that will um be readily adaptable to the human world, they need to engage with the open world. They need to collect observational data and they need to be able to take that data to build a world model and make their own rules and really understand what it is. Or else you're just going to have the human ceiling uh the human data ceiling going forward from here. If we're going to be able to build this, we're going to need a new type of benchmark that gets out of the single turn uh realm. And this is where interactive reasoning benchmarks are going to come in. Now, an interactive reasoning benchmark is going to be a benchmark where you have a controlled environment. you have defined rules and you may have sparse rewards where an agent needs to navigate to understand what is going on in order to explore and uh complete the objective from here. Now there's an open question as to all right if our aim is interactive reasoning benchmarks what is the medium in which we're going to actually execute these benchmarks in and it turns out that actually games are quite a quite suitable for interactive reasoning benchmarks. The reason for this is because games, they're a very unique set of um intersection of complex rules, defined scope, and you have large flexibility into creating these types of environments that you can then go put different artificial systems in or biological systems with it. Now, I know what you may be asking here. Wait, Greg, didn't we already do games? Didn't we do this 10 years ago? We already went through the Atari phase. Well, yes, we did. But there's actually a huge amount of issues with what was going on during that realm there. Not um even just starting with all the dense rewards that come with the Atari games. There was a ton of irregular reporting. So everybody would report their own performance on these different scales and it was tough to compare these models with it. There was no hidden test set that came. And then one of my um biggest gripes with the Atari phase was that all the developers, they already knew what the Atari games were. So they were able to inject their own developer intelligence into their models themselves. And then all of a sudden the intelligence of the performance that well that's getting borrowed from the developer that's not actually getting done by the model itself from there. So if we were able to create a benchmark that overcame these shortcomings well then we'd be able to make a capabilities uh assertion about the model that beat it that we've never been able to make beforehand. And so to put it another way that's a bit more visual. We know that AI can beat one game. This is proved. We AI can beat chess. AI can beat Go. We've seen this many, many, many times here. And we know that AI can beat 50 games with 50 known games with unlimited compute and unlimited training data. We've seen this happen with Agent 57 and Muse 0ero. But the assertion that we want to make is, well, what if AI beat a 100 games that the system has never seen beforehand and the developer has never actually seen beforehand either? If we were able to successfully put a test or put AI to this test, then we could make the capabilities assertion about that AI that we don't we don't currently have in the market right now. And I'm excited to say that that's exactly what ARC is going to go build. So this is going to be our version 3 benchmark. Today is a sneak pre preview about what that's going to look like. And this is going to be our first interactive reasoning benchmark that is going to come from ARC. And I want to jump into three reasons why it's very unique here. So the first one is much like our current benchmark, we're going to have a public training and a public evaluation set. So the reason why this is important with our public training call it on the order of about 40 different novel games. This will be where the developer and the AI can understand the interface and understand kind of what's going on here. But all performance reporting will happen on the private evaluation set. And this is very important because on this private evaluation set there's no internet access allowed. No data is getting out about this. The scores that come out of private evaluation set will have been done by an AI that has never seen these games before and neither has the developer seen them. So we can authoritatively say that this AI has generalized to these open domains here. Now the second important point about what RAGI 3 is going to have is it's going to force understanding through exploration. One of my other gripes with uh current game benchmarks out there is you give a lot of instruction to the actual AI itself. Hey, you're in a racing game or hey, you're in an FPS. Go control the mouse and do all these things. We're going to drop AI and humans into this world and they won't know what's going on until they start exploring. So, even as I look at this screenshot, this is actually one of our first games. We call it locksmith. We give all of our games a cool little name like that. As I look at this, it was I don't know what's going on, right? But I start to explore and I start to understand, oh, there's certain things I need to pick up. There may be walls. There may be goals and objectives. I'm not sure what those goals and objectives are right when I first start, but that's the point. So, not only are we going to ask humans to explore and make up their own rules as to understand how to do the game, but we're going to require the same thing for AI as well. And that's something that we're not currently seeing from the from the reasoning models that we have from there. Now, the third key point is that we're only going to require core knowledge priors only. This is something that we carry from ARC AGI 1 and 2 as well. But what this means is you'll notice the ARC tasks there's no language. There's no text that's being involved here. There's no symbols and we're not asking you any trivia. So, I call these um other benchmarks that rely on these. Sometimes we try to make the hardest problems possible. We go hire the best people in the world and I call them PhD plus++ problems, right? And that's great, but AI is already super human. It's way smarter than me in a lot of different domains. We take the alternative approach, which is let's look at more of the floor and look at the reliability side. Let's take all the core um anything outside of core knowledge and strip those away. So core knowledge prior the four of them that there are are basic math and these are things that are humans that were either born with or hardwired to gather right immediately after birth. So basic math meaning counting up to 10 basic geometry so understanding different shapes and topology and then agentness which is understanding theory of mind that there's other types of agents out there in the world that I know that they're interacting and then the fourth one is objectness. So as we create our benchmark, these are the four principles that we like to um go after when we try to test the abstract and reasoning piece. Now I was reading the recent uh Darkeesh essay and he actually put it really well in one of his paragraphs here. He was talking about one of the reasons why humans are great and he says it's their ability to build up context, interrogate interrogate their own failures and pick up small improvements and efficiencies as they practice a task. We don't yet have this type of environment that can go and test this from a benchmark perspective for AI. And this is exactly what RKGI is going to go build. So before we wrap it up here, I want to talk about how we're going to evaluate AI because it's like, okay, cool. You go, they go play the game. What does it mean? How do you know if it's doing well or it's not? And we're going to bring it back to Francois's definition. So we're going to bring it back to uh skill acquisition efficiency, and we're going to use humans, which again is our only proof point of general intelligence. We're going to use humans as the baseline. So, we're going to go and test um hundreds of humans on these exact arc tasks and we're going to measure how long does it take them, how many actions does it take them to complete the game. And then we're going to get a human baseline and we're going to be able to measure AI in the same exact way. So, can the AI explore the environment, intuitit about it, create its own goals, and complete the objectives faster than humans? Well, if it cannot, I would go as far as to assert that we do not yet have AGI. And as long as we can come up with problems that humans can still do but machines cannot, I would again assert that we do not have AGI with it. So we're going to be looking at skill acquisition efficiency as our main uh our main output metric here. Today we're giving a sneak peek about what this looks like. This is World's Fair. Actually, next month in San Francisco, we're going to give a sandbox preview. So we're going to release five games. Um we know better than to try to wait till the end. We're going to make contact with reality. We're going to put out these fives. We're actually going to host a mini agent competition, too. So, we want to see what is the best possible agent that people can do. We'll put up a little prize money. Um, and then we're going to look forward to launching about 120 games. That's the goal by Q1 of 2026. Now, that sounds like it's not that many games, and you think it's not that many data points, but the richness of each one of these games goes really, really deep. There's multiple levels. It goes deep with each one of them. And it's quite the operational challenge to make all of these. And that's a whole another side of the benchmarking process, which I'm happy to talk about later. If this mission resonates with you, again, Arc Prize, we are a nonprofit. One of the best ways to get involved is through making a direct taxdeductible donation from that. If anybody in the room knows any philanthropic donors, whether it be LPs or individuals, I'd love to absolutely talk to them. But then also, we're looking for adversarial testers. We want to pressure test ARGI3 as best as we can. So, if there's anybody who's interested in participating in the agent competition, whether it's offline or offline uh online or offline, let me know. Happy to chat. And then also kind of cool story, we originally started with Unity to try to make these games. And we quickly found out that Unity was way overkill for what we needed to do if you're just doing 2 by 2 64x 64 uh games here. So we're actually making a very lightweight Python engine ourselves. So if there's any game developers out there, anybody wants to get involved with this and knows Python well, we're looking for game developers and game designers as well. That is all we have today. Thank you very much. Kyle, do we have Yeah, I think we have time in this case for for a couple of questions if anyone wants to come up. There's microphones, one, two, three of them. Maybe a couple of questions. I'm gonna kick that off. Yeah. Yeah. Yes. Yes. Um question for you. So I don't know where I can repeat. Okay. All right. Um I uh I I it's it's very hard to make estimates about timelines famously, but um if you had to guess, how long do you think this new version of the benchmark you're making will take before it gets saturated? Um well, the the way I think about that is uh well, I would say I'm counting in years. I'm not counting decades. We'll put it that way. Okay. Yes. Interesting. All right. Uh yeah, we'll we'll take uh one at each mic. Looks like you've it's well distributed. So, starting over here. Sure. Hi. Um, you mentioned efficiency as part of the requirements and so I'm wondering for the uh the benchmarks if you're considering things like wattage or time or other ways of of using that as one of the criteria. Yeah, I I love that question and I would have put it in if I had more time, but I'm very opinionated about efficiency for measuring AI systems. If I could have two denominators for intelligence on the output, number one would be energy because you know how much energy the human uh the human brain takes and that's our proof point of general intelligence. So you can take how much calories the human brain takes. So I would love to do energy. But the number two denominator is the amount of training data that you need for it. Neither of which are very accessible for closed models in the current day. So we use proxies and the proxy is the cost. But then with interactive evos like this, you get another proxy which is um action count and how long does it take you to actually do it. We're not gonna have a a wall clock within these games. It's going to be turnbased. Um so we won't have a wall clock to do it. Awesome. All right. Question two and then one, two, three. Please keep them both very short. Yeah. Um yeah, very quick question. Um could you define more what do you mean by objectness? Yes, that one's actually quite simple. It's just understanding that when you look out into the world that there's um a mass of things that may act together. So the crude way would be you have one pixel but then it's surrounded by a whole bunch of other pixels and they all move together. You understand all those pixels as one and kind of acting as a one body rather than individual. And really evolutionary wise that's a same that's a tree over there. All this is part of the same tree that that kind of thing. Um final question I'll keep this one super short. Uh how do you distinguish between tasks that um in the games that you guys you guys are developing? How do you distinguish between tasks that humans cannot do and an AGI also cannot do? Like what is the north star there? It's a good question. It tasks that humans cannot do are a bit out of scope for our thesis on how we want to drive towards AGI. So I would say that's not that's not really the aim that we're looking looking for on that. Um that's a whole different conversation around super intelligence and may that's for another time. Thank you. All right. Thank you everyone and let's hear it for Greg. [Applause] Awesome. Okay. So, uh our next speaker speaker, it is my pleasure to announce to the sta bring to the stage um Akansa Chadri. Um so, while she gets up and is is getting plugged in. Um I'll give a bit of an introduction. So, she is working at reflection AI. Um she's going to talk to us about using reinforcement learning for autonomous coding. This is a huge area. This is what all the big labs are doing. I'm very excited to hear more about how that um how that works, how she's getting it to work. Um this is the perfect person to talk to us about this. Um Koncha was the first author on the Palm paper which was you know an absolute breakthrough LLM produced by Google a few years ago. Um she led up the uh she was a lead researcher at Gemini um early on and now of course she's at Reflection AI where she spends all her time thinking about this. So welcome to the welcome to the stage. [Applause] If the presentation works, that's the AGI complete problem, right? Okay. HDMI. Okay. I'm trying to mirror it. Uh And that's the excitement here. You can see it, but I can't see it. So that's why I'm fixing. Let me solve it one more time. Good. You good to go? Yes. Awesome. Okay. Hi everyone. I'm Akans Shaw. I was at Google for more than six years and I led the research for Palm and I was a lead researcher in Gemini. Uh these days I'm working on uh pushing the frontier for autonomous coding uh with reinforcement learning. So just to recap the arc of how we have progressed in large language models and um why autonomous coding and why now. Um so I think everyone here or those of you uh who don't remember in 2020 there was this breakthrough paper that came out which talked about scaling laws for large language models. And if you were to take a 30-cond recap, all the main thing it said was that there's a power law relationship between the test loss of large language models. Um, so if you use more compute, more data and put more parameters in your machine learning model, uh, which is a transformer model, you will get more performant models and it will not be performant just in the domain in which you are training the model. it will actually be performant and it will generalize to many other domains and the generalization was uh pretty much a feature uh in this particular case. So as the uh large language models got bigger we saw continuous improvement across benchmarks to the point that they're starting to get saturated now. And the other interesting thing was that we saw emergent behavior uh where capabilities were emerging in large language models that were not present in smaller models. And this is a classic slide that I show for for the work that we did uh in palm. So typically when you go about trying to solve math problems and you give the model some examples on the left you have a math problem around tennis balls and then you give a second problem the model output looks wrong. But what palm and the subsequent set of uh papers showed was that if you ask the model to output its uh reasoning chains uh which has become very common concept now but this is remember 2021 so four years ago um if you ask the model to show its reasoning chains then the answer actually is correct. So basically by getting the model to uh output it as chain of thought or reasoning chains the the model performance improves and uh this capability particularly emerged in large language models. These are all the models. Uh so lambda and palm were the state-of-the-art models about three years ago and what I'm showing on x-axis is the increasing number of parameters. Palm was scaled all the way up to 540 billion parameters. No one actually publishes the number of parameters these days. So you have to live with the the graphs from three years ago or the open source stuff that's coming out with DeepSeek and Quen models. But what Y-axis is showing is that the solve rate on middle school math word problems was increasing uh with the number of parameters in the models and it was essentially increasing mainly when you were prompting the models and asking them to show chain of thought and this led to all kinds of prompting techniques where you ask the model to think step by step. you even go and bribe the model and such and you ask the model nicely or not. So this was all kinds of fun stuff. Um and I think the the uh the thing that really stood out from this generation of models years ago was that this capability was not just limited to mathemat uh math problems. It was uh basically uh generalizing across a whole bunch of domains anywhere from question answering in other languages to puzzle problems to um multitask natural language understanding problems. And what this led to next was that now that these models could reason, we could get them to follow instructions. So the first set of applications that became possible with these large language models were chatbot applications. So everyone remembers that chat GPT and uh now Gemini and various other chatbots have become extremely popular. All of us use them all the time. But what made them really possible was that when you give instructions to the model to go do something, it's actually able to do it. And the way it learns that is actually based on reinforcement learning. And the reinforcement learning data that we're giving to the model in this particular case is essentially data based on human feedback. So you're basically saying okay here is a set of questions and if I were to give it to a human and it were there were two answers which one would the human prefer and if you have enough of this data and you train your model you would actually end up with uh better performance because you taught the model which set of responses to prefer and this actually doesn't only work in chatbot applications it also works in code. So on the bottom right I'm showing that even if you were to do this for applications and code you start to see some performance improvements. Now of course the question is that last year there was a whole bunch of debate as to are we hitting the wall in terms of performance of large language models uh pre-training is not giving any gains or all all of these questions were on the horizon. So what is next? And I I one of the key questions to remember in in all of this is that when you go and pre-train the models, you end up spending a lot of money. Um on training these models, it could be tens of millions of dollars. And when you do inference on the models, it's extremely cheap. Um these numbers are not endorsed by any of the companies I worked at, but these are public numbers uh from uh public sources. So going back to the main point that I want to make here is that training is extremely costly. So if you constantly try to scale up the model size, you end up uh in in this regime of like um if if it's not giving performance gains, then can we get performance gate at uh gains at inference time because inference calls are so cheap. And a key idea that was uh extremely useful here was that if you could get the models to generate multiple responses and then do majority voting. So uh in the example above I'm showing that we get we the the prompt doesn't make sense but you've given a mathematical problem to large language model and you're asking it to generate three answers independently and then you basically do some voting on top of those answers and if two answers match then that's a majority vote or like if in this room I were to ask a question and all of you said yes then that is a majority vote. So similarly in large language models if you can get the model to like generate many many samples and then consistently get it to uh like get many of those answers to agree this notion of majority voting or self-consistency had shown gains. So this kind of scaling computed inference time was clearly one avenue to go push on. Another avenue that emerged and showed substantial value was that you could sequentially revise your previous response. So as humans often times we write the first answer and then we go evaluate our answer and we're like oh there's some mistake here it doesn't quite match and then you go fix it. So basically can we get LLMs to do the same kind of revision looking at previous set of revisions and this was the second. So basically having longer uh chains of thought uh and getting the model to improve consistently in inference time based on that and these kind of techniques uh where you could verify your correct answer. So in math or in programming where you have unit tests showed uh very clear gains. So what I'm showing you here is an example from uh uh one of my colleagues work um at Stanford uh which is a publicly uh published uh paper and on the y-axis we have passet k or coverage score and on the x-axis we have u number of samples so as you basically are doing a lot of samples on the x-axis your accuracy is improving uh with open source deepseeek model and just taking more samples so you're getting a very high score on bbench verified compared to even state-of-the-art back in end of 2024. Uh, of course, now all of these scores have pushed up and we are roughly somewhere around 80% already. But what we want to take away here is the fact that these lines of work they showed that inference time compute predictably gives us gains especially in domains where we can verify. If we know how to verify the answers, then we actually know how to translate that into intelligence. And going back to my talk title, coding is one of those domains where we do have the capability to verify um and that gives us tremendous advantage in terms of building super intelligence on top of autonomous coding. Of course, now you ask the question of what does automated verification mean here? So for inference time scaling to work you need basically some way to say this output is correct. Now in math um this is a very simple example. If you were to give the input to solve this mathematical equation um and if you were to do the same calculation on a calculator you can actually verify that that problem is correct or that solution is correct. Um and similarly in math you have formal proofs. So you can actually verify things are correct. uh in coding you have unit tests in compilers you can actually generate the code and then use PyTorch as a verifier uh PyTorch the compiler as a verifier and in fact uh in domains where you don't have uh this kind of verification then there's a large gap if you were to generate a lot of solutions and then do majority voting you actually don't get as much gains so what this roughly meant was that okay so inference time scaling would work in scenarios where I have automated verification But that doesn't quite solve the problem for it to be have real world impact. And the reason for that is shown in this graph as to typically uh if you do majority voting and these uh this is across multiple different models on GSM8K which is middle school math problems and another math benchmark uh if you were to sort them by correct fraction you have to sample a lot. The correct generations could be very rare. So who has time to sample 10,000 times and then get a correct solution? You would be sitting there waiting just finding the correct solution unless you can actually figure out where the correct generation is. So basically scaling inference time compute with just majority voting or longer reasoning chains is great in the sense that there is some correct solution somewhere there but it doesn't work well across the board. So what will get these models to learn to generate correctly uh during training? Well, in in the chatbot application scenario, we saw that RL with human feedback did work. So can we apply the same principle here and get the model to generate correctly in uh where where we can automatically verify the outputs. So our belief at uh reflection is that the next frontier for scaling is reinforcement learning and we already have proof points from some of the frontier labs as well. And as David Silver and um Saturn published recently they agree with or or rather they they are the pioneers in uh in reinforcement learning. They say that we are basically entering the era of experience like starting from alpha go and alpha zero uh where you had an era of simulation and the next set of large language model era was where you scaled up with RL using human data but the next era from this year is really the era of experience which was which will lead us to super intelligence. So reinforcement learning will be a fundamental component in building uh super intelligent systems uh especially in areas where we have automated uh verification. And some uh proof point for why this makes sense is that in math u over several papers this is uh results from 01 but over several papers we have already seen examples that if you give the model on the right side uh test time compute uh on the y-axis test time compute is same as inference time scaling and you measure accuracy on the x-axis it should go up um but as you can repeat this process uh in with reinforcement learning Then the training time compute going up on X-axis also improves the accuracy on Y-axis for a challenging benchmark in math called Amy. Uh most of these benchmarks saturate within a year as you probably have learned by now. So uh this benchmark is already saturated. Um so now that I've hopefully convinced you that reinforcement learning and scaling reinforcement learning is uh the next frontier you'd be like okay so why are why is not everyone doing it? What's so challenging about it? So as I have built large language models before a big part of building uh these systems uh ends up being that the machine learning plus system stack for these uh systems themselves is very challenging. So here is um an example of why scaling up reinforcement learning is challenging. So if you are trying to do reinforcement learning with uh PO which is u one of the algorithms used for RL with human feedback um then it moved to uh DPO you have to keep four copies of uh different models. Uh so if you imagine a really large model and then you have to keep four copies then you have to arrange them somewhere on GPUs in your large cluster you you can have some fun figuring out the exact layout and um it's it's it's a fun and interesting problem but it's a hard problem in the sense that uh to make maximum utilization of these systems and and arranging them in the right way just building that system is extremely hard and uh deepseek actually showed uh with deepseek math that gpo uh gets gets rid of the value model and and only has three copies of the model. But that doesn't that's still a very challenging problem. So scaling up RL uh is more ch even more challenging um than scaling up um LLMs because you have multiple copies of the model and you have a training loop and an inference loop. And then on the machine learning side on the on the reinforcement learning side you also suffer a lot from reward hacking. If you're uh the model that is deciding that this is the correct answer is a neural reward model. So you uh as we discussed before in autonomous coding applications you do have the ability to verify your output. Uh which roughly means that you can decide this is the correct answer or not. Uh that's how sweet bench verified scores uh work today. U you have execution feedback, you have unit tests. So all of these possibilities of course um this is an ongoing list. All of these possibilities me mean that you can design better reward functions. Okay. So this means that autonomous coding is a great domain for scaling up RL. Then the question becomes how does this have real world impact? So in software engineering applications generation of code is only one part of the system. If you look at end toend workflows for software engineering there is many more parts to that system. How do you scale up your system to generalize across all of those domains? So that's the problem we are trying to solve at reflection. Our mission is that we would like to build some super intelligence and we are starting with autonomous coding as the root node problem for this um mission and uh we have a team of about 35 pioneers um who are who have pioneered various uh legendary works in LMS and reinforcement learning. So if you're excited about this mission, uh you can reach out to um one of us um or my u my email is my last name at reflection.ai and we would love to work with you. And with that, I can take questions. All right. Um same protocol as last time. If you have a question, please come up to one of these three microphones we have distributed throughout. We can probably take one or two questions. So if you want to ask something um feel free. Um I guess I can I I'll do the first one while people are coming up. So I'm curious um seems like the foundation models are tried to build one model and deploy it across everything. Do you have an opinion with the work you're doing right now if you think that's the right approach or if you think there'll be more specialization on different languages or even like individual code bases? Um or do you feel like the best approach is just to have like one model that's trained across the the greatest diversity of tasks possible? Uh I think that will answer your question uh in terms of building coding agents does require um multiple capabilities and how you get there you will definitely need multiple LM calls and then whether that's one model or multiple models I think that's the secret sauce right now for most people. Fair enough. All right please. Hi. Um I'm wondering in the slide with the chart of error of simulation, error of something and error of experience. Uh they had put in AlphaGo and um the uh previous one where also you they played stock Star stock or something. They all used MCDS uh which I mean maybe it's my unfamiliarity with them but it's also data simulation. Uh so we're using synthetic data for era of experience as well. So how does why is that called simulation and why is what we're doing right now not called simulation? What's the sort of overlap between simulation experience? How does that how do you think about that? I can ask babe that question, you know, but going back to the point, I think I think the better way to answer that question is uh roughly what Greg covered in the last talk where his comment was that um so in gaming you can envision what scenarios might happen next and you're basically using that to build your reinforcement learning. So you're doing rollouts and you're you're basically building uh based on that. Um in in real world in most scenarios you have an imperfect roll out. So you don't have full knowledge of how the system might work. Um simulation is possible in certain domains where you do build a world model uh which is closer to robotics and all the work that's happening in the physical AI space right but in the in the real world applications which is what we're targeting uh you will have imperfect things. So you have to actually experience the real world and you have to collect some data and that data is not going to be in any way complete nor will it complete early search the exponential search space that could exist. Awesome. Thank you so much Akancha. We'll have to stop it there. Um but I assume will will you be around afterwards to to answer questions? Awesome. If you have more questions please do. Um next we're going to welcome to the stage uh Ryan Martin. So Ryan is one of the founding engineers at Bespoke Labs. Um, sounds like he's got a friend. Um, um, anyway, we are very excited. So, Brian's gonna be talking about a project that he worked on, that they worked on, building a reasoning model. It's called Open Thinker. Um, that actually is able to outperform the distilled versions of R1. So, he's going to talk through um, what they built there and uh, and yeah, like what we can learn from it. So, welcome. Thank you. Thank you. So yeah, I'm Ryan. I'm a founding engineer at Bespoke Labs. And today I'm going to talk to you about Open Thoughts, which is our project to create the best open-source reasoning data sets. And I'll be switching tack a little bit from our earlier discussions on reasoning and RL and focus on the reasoning part and you'll see why. So just so we're on the same page, we've talked a lot about reasoning, but what's actually going on here? So I like this graph from Jason which shows this incredible performance that's happened in the last several months where models are getting much much much better on certain benchmarks. Um and if you look at that this is reasoning this is test time scaling. I think everyone here is quite familiar with this and it seems that certain tasks like Amy, which are competitive math problems, really respond to models when they're able to think step by step and do these long chain of thoughts. Um, so let's go back to DeepSeek R1. Now, Deepseek R1 was really impressive for a lot of people for a lot of reasons and RL was a big part of that. But I was also particularly interested because DeepSeek R1 at the end of the day is an SFT model. So the final weights that they've released are actually from Deepseek V3 base which is fine-tuned on 800K SFT examples. 600K of which are reasoning. Of course you can see here that RL was a big part of it and RL was used heavily to create that model which generated this data. Um but at the end it was SFT and a little bit of RL for alignment. So this was really interesting and surprising. And the other thing that was really interesting and surprising to us was these small reasoning models that Deepseek released which were incredibly strong. Um and this for us was a a huge motivation a huge motivation to try to do this ourselves. And why is that interesting? Because if we go back to here, none no additional detail was really given on these data sets here. So if you want to create strong reasoning models, we now sort of have a training recipe, but we don't have the data recipe. That's the missing link. Okay. I want to also include a slide here on why is it interesting to train your own reasoning models. So uh I'm partially taking this from Amir's talk yesterday on open source and enterprise which I really liked. But there's these main points, performance, privacy, speed and cost, and then ownership and destiny. I think um using reasoning is a is a great tool to solve a problem. And you shouldn't limit yourself in your toolbox if you're trying to solve a specific domain task. So, uh as we talked about before, RL is a great tool in this toolbox to tackle to tackle reasoning tasks. But we're going to see here that SFT is, as Nathan put this morning, extremely easy and extremely effective. Okay, great. Now, the missing link. How do we actually solve for this this reasoning data recipe? There's all these questions that we had when we started. How much data do you really need? What data creation steps are necessary? What are the optimal choices for each step in that data creation pipeline? And then, how do you even go about figuring all this out? And this this is the meat of the Open Thoughts project. So today we're excited to announce Open Thoughts 3, which is hot off the presses, just came out two hours ago, which is our latest and greatest version of our reasoning data sets. And thank you. And now we this is the state-of-the-art reasoning data set recipe. So you can see here these graphs are showing accuracy on three of these reasoning benchmarks. Amy which is competitive math, live codebench is competitive code and GPQA diamond which is our science questions. Um on the y- axis you see accuracy is going up. Uh on the x-axis you see the data scale is going up. So we we heard before that scaling is difficult particularly difficult with RL. The good news is for SFT scaling is quite easier. Um you can see here we compare to other open reasoning data sets. So Neatron Nano Nvidia released this great model Neimatron nano it's a 8 AP model and they also released the data set to train on it. So we compared directly by training on the same base model between our data set which is our data set recipe and the Neatron nano data which is the Nvidia recipe and you can see here there's a significant gap. So we we shifted this scaling curve upwards. Great. So the yeah this is the state-of-the-art 7B open data reasoning model. You can see we've had we have measured across the domains of interest. So science, code, and math and then a couple held out benchmarks. So our original goal was to to reproduce to find the missing link for the Deep Seek distill models. And you can see here we've crushed that goal. So we're we're significantly outperforming the DeepS R1 quen 7B model which we started off trying to reproduce. And then compared to the Neimatron nano model which is trained on a different base model um we are also outperforming on some benchmarks and similarly competitive on some others. So okay let's actually talk about how we achieve this. This is the interesting part for you. So we go back to the scaling graph. You can see um once again on the x-axis we're scaling data set size. So uh this is a a huge method to increase accuracy and the thing here is it gets more and more expensive exponentially more expensive as you keep going. Um and then uh on vertically you can see that we've shifted this the scaling curve up. So this is what I was talking about before. This is the improving the data set recipe. So given a fixed data set recipe you can always scale it larger and you can always have higher performance. But um if you want to push your performance to abs absolute maximum, the real question is how do I create the best data set and therefore what is the best recipe for the data set. Okay, so uh enough teasing here. Let's go into the meat of it. So this is this is how we approach this problem. We broke down the data set pipeline into sourcing questions, mixing different sources of questions, filtering those questions, filtering out the high highest quality questions, generating answers with a teacher model. So that's distillation, and then filtering out bad answers, um, and and lastly, at the end of this entire experimentation, we looked at what what are the best teacher models, which which teacher model should we select? So through this entire pipeline, we've we've come down to this final data set recipe. Now, this was a ton of work. This is a screenshot of our our hugging face page. So, you can see created over 5,000 data sets and almost 3,000 models. Um, for this project, it was only around a thousand experiments. But it just to give you an idea of how rigorously we looked at the different decisions in each of these steps of the pipeline. And also, I think this is interesting because it it peels back the curtain a little bit on maybe what the frontier labs are doing. uh finding signal at the smallest scale possible and trying out as many things as possible and empirically choosing the best and then scaling. And often sometimes when you scale you see okay what was the best at the small scale doesn't actually work but if you're lucky um and you've done good science then you'll you'll your yolo run will be the best possible right okay so these are the the key learnings that we had from our data set recipe and and this is what you can take away so the first thing is that pretty surprising sampling multiple answers so multiple reasoning traces per question in your data set works really really well. Um the the performance does not go down at a fixed scale. If you take a fixed scale of questions, say 30k questions um or 30 30k examples and of those you if you take just 30k questions and you only sample once per question that performs pretty similarly to um if you took 116th so 30k over 16 and then for each you sampled 16 times which is quite cool. So this allows you, this is really cool because this allows you to scale by 16x, which is more than an order of magnitude. And if you remember the graph from before, that corresponds to a pretty large increase in accuracy. The other surprising thing that we found was that a better model in terms of its own performance on evaluation benchmarks does not necessarily mean it's a better teacher model. I think a good way to think about this is a brilliant researcher who's maybe a terrible lecturer, right? Um we found specifically Quen 32B was a stronger teacher model than Deepseek R1. So we switched to that in our our recipe even though previously everyone has been using R1. We also found that the syn the sources of data that had synthetic questions were actually quite good. Um some of the top sources that we selected were entirely synthetic and better than sources say that scraped from forums or had humans manually write things. And this is also really good news because synthetic question generation is scalable. So once again we go back to the x-axis and we can push even further which is is accuracy boost. So question filtering also works well here. We we filtered questions by having asking a language model how difficult is this question and then taking only the hardest questions. We also had a language model try to answer that question and looked at the length of that answer. So these are sort of proxies for the same thing. You can imagine that if a problem is a lot harder then a language model will think more and it will produce more text. So its answer will be longer and these things worked better than embeddings based approaches or fast text classifiers which is interesting as so much that those those approaches were typical for pre-training. So it seems that the the filtering for data and post- training is quite different than pre-training. Okay. Some things that didn't work that were also quite interesting. Uh through our experiments, you saw that choosing a smaller number of high quality sources was much better than trying to optimize for diversity by going for a larger number of sources. That's very counterintuitive, right? You'd think, okay, I'm always going to go for for higher diversity. But this is actually not what we saw. Um the last thing we was interesting is that people talk a lot about um verification which is obviously very important for RL and we actually see for SFT and distillation it didn't seem that filtering based off of the answer or verifying the answer really helped at all. This is quite surprising. Um, and I think there's there's some some good research in the literature about maybe why this is because if you have the the hardest problem, it might be still helpful even if you have an incorrect answer to that hardest problem. Um, keeping it in and and seeing how the teacher model attempts. It's not just the final output that matters. Okay, great. Okay, so this is those are all like the amazing learnings that we had for OpenTOS 3, which super excited to share. But now you're probably thinking, okay, they they've done a thousand experiments. I don't want to do a thousand experiments. I still want to create reasoning models. Uh how do I adapt this if I want to create specialized reasoning models? Um so I guess the first thing I would say is be aware that based off of your domain, these exact choices might be a little bit different. I would suggest okay, start with our recipe and then iterate on it. If you have um capacity and compute, try a couple different choices for each step in the pipeline. And I think a good example of this is we studied each step in the pipeline differently by domain. So we studied it distinctly for code, science and math. And we saw for example in the question filtering which I talked about before um using difficulty labels worked well for code questions but for math and science it was response length. And if you think about that for a second, it makes a little it makes sense because the response length for coding questions are very different, right? For for um Amy math, it's literally just a number between zero and a thousand. So the the answer is not it's not considering a large portion of the length, but you can imagine there's very simple coding questions in which the answer is still a lot of lines of code. Um so yeah, this is one thing to be aware of. The other thing which I talked about previously is synthetic question generation because it works so well. Um and if if your specialized domain if you're if you don't have a lot of data for your particular problem then uh go ahead transform that existing data into questions expand it um throw those as in context examples and just and generate more data. So yeah we built an open source library for this. It's called curator and you can you can try that out. And then lastly, I feel like everyone says this, but it can't be said enough. Like the evaluation is paramount. If you don't know how well your models are doing or improving, then you cannot make good principled decisions about your data set recipe. Um, we spent a lot of time on this. We also have this open source library on GitHub called Evalchemy, uh, which takes a care takes care of this and also takes care of the um, sharding and parallelism. And and the key thing here is for very small evaluation sets. If you if you only have a handful of questions, you should run your model on those evaluation sets many times in average. So going back again to AM competitive math questions, there's only 30 per year. So uh for our evaluations, we gave the model those 30 questions 10 times and then we averaged to get the the the final signal to determine um which data strategies were working better than others because otherwise there's too much noise. Okay, this is also very very interesting and surprising and promising for you if you're specializing. It seems that you can actually surpass the teacher in some domains with distillation. This is this is super cool. Usually you think about only RL can push the frontier. Distillation is just about catching up to the teacher. But no, that's not the case. So we have an example, it's in our paper where um we looked at the legal reasoning domain. So the problem of classifying Supreme Court decisions. What we did is we took 2k unique questions. We sampled five answers per question and then we did do verification here which which did matter. So we threw away any questions any answers that were incorrect. Um and when you fine-tune the 7B model, it surpasses R1 which is a very strong reasoning model and also a very huge reasoning model. So this is very exciting. I think there's a lot more um research and also application to be done here. Okay, cool. So, everything's open. It's open thoughts and open thoughts means open. Go out and build. We have all of our uh we got our detailed paper. It's just out this morning. We've got the weights data set. Uh we have a ton of repos for code for data generation for evaluation and synthetic data. So, check those out. Um this is this is the team. It was a huge group of people, a lot of work over many months. Uh, I think we're all very proud of what we did, but there's lots of people to recognize here. If you scan that QR code, it goes to the tweet and everything uh about the Open Thoughts project is linked in from there. Yeah. Thank you. All right. Thank you so much, Ryan. Um, that was fascinating. Looks like we're already getting we have at least one question lined up. Again, we have time for maybe a couple of questions. So, if you have questions, um, please line up and and we'll do it. Um, actually, before we get to those questions, I will say as people are leaving, um, we are going to be back here at 2 o'clock. We've got an excellent afternoon planned on this track. We've got Nathan Lambert. Um, we've got the, uh, we've got Christian Seed, who's the co-founder of X. Um, and it's going to be a really great track at 2 o'clock back in this room. Also, one more thing. If you do have questions for any of the speakers from this morning, um, hopefully they're going to be able to stick around. Don't let them go to lunch. They're going to be they're they're sitting up here at the front, so swarm them as soon as we're done. But for now, let's uh let's get a couple questions for uh Go ahead. Um yes, over there. Uh thank you. Great talk. So uh two questions. One is um if you're just using SFT on this data, what's the difference between this and regular SFT? This is just regular SFT. Yeah. Oh, okay. So then how is regular SFT able to make the models like think longer? Because I thought for the reason models they have like this thinking block and they think for you know hours and minutes and exactly. So how do you how do you how does SFT make it think for hours? So you're you're doing supervised fine-tuning on the questions and the answers also contain the thinking. So the model learns to use its context window and produce these long thinking traces. So it can do this people call SFT imitation. Um but it it can learn to learn this format in the same way. Yeah. Thanks. All right, we'll take one from this side. Um, great presentation, Ryan. Uh, one question. Uh, why do you think, um, a smaller model like Quen 32B was a better teacher than a Deep Sea Car One? What was your insight in figuring out that like a good professor makes a bad lecturer? Yeah, that's a great question. Um I think this is something we need to investigate more but you can see that uh when you look at charts of the length of reasoning traces you can see the distributions are different. So uh it might be the case that you're using more of your context window using more tokens more steps. It also might be the case that you just have a better formatted response better output. Um this is like an another great open research research question. Interesting. I'll also say on this point, we also tried Claude as a teacher, which is like a very as a good strong model and it was just a terrible teacher. Um, so there's yeah, it's interesting what can what actually creates a good teacher. Yeah. All right, we'll take one more very brief question from this side and then those of you still waiting on questions um after uh after we have closed this up, it's swarming. Sorry. Um, great talk, Ryan. Um, we're doing similar kind of thing, but I just had a question. Do you guys have any like pattern map as to in the reasoning chain of thought when things don't work at what level you know in the eval do you find out that things are not working or it's not reasoning correctly is there a pattern map or something that you have in your open source rep is sorry I didn't catch that is there a so if there are five steps of reasoning to reach a final conclusion uh at what step does the reasoning go arai yeah this is this is a great question we don't do this fine grain analysis but there is a ton in literature about this um where yeah there's a sort of critical step where it get gets things wrong. Um there we did like the simplest thing possible right you could also go in and try to do more complicated things um at evaluation time where you're doing interventions to uh maybe detect steps that have gone arry and and changed or you can do this in the when you're creating the data set. So you could potentially rewrite things, but everything that we tried in terms of like messing with the reasoning trace, it wasn't helpful. Um, so yeah, I think there's still more to explore there. There's like this is really just the start of everything in reasoning. Awesome. Thank you everyone for your questions. And one more time, thank you to Ryan for the for everything you shared. All right. And again, we'll be back in this room at two o'clock sharp. Um, we have three more presentations. Looking forward to seeing you then. So I need what this scrap backics Is that all you want? What's that? This one right here. [Music] testing. Testing. What the [Music] Testing. Testing. All right. Um, hey everyone. Glad you're all here. This is the reasoning and reinforcement learning track uh on the afternoon of the last day of the AI Engineer World's Fair. Glad you're all here. Glad you're sharing it with us. Um, to let you know what the schedule's going to be. So, I'm going to be talking right now. I'll give myself an introduction in a moment. Directly after this, we're going to have Nathan Lambert from AI2. He'll be coming up. Very excited to see that. Um, and then Chris Seedy, formerly of XAI, one of the co-founders there, will be speaking last. Um, so hopefully you're all able to to stay for those sessions as well. I think they'll be excellent. Um, to introduce myself, uh, my name is Kyle Corbett. I am one of the co-founders of a company called Openpipe. Uh, what we do is we do reinforcement learning to make agents more reliable. Uh we work with large companies, usually Fortune 500s, things like that, and help them make their agents more reliable so they can deploy them in production. If that describes any of you and you want to talk to me after, um happy to chat later. But today, what I'm going to talk about is uh a very specific case study um that we did. Uh this case study, I'm going to talk about lessons learned very concretely. Um what did and didn't work, how we were able to build an agent that worked well with reinforcement learning. Uh all of this uh everything that I'm talking about in this presentation, this is an open-source codebase that we built. um we wanted to share these learnings and I'll I'll I'll share that link with you at the end as well um for those of you who want to replicate what we did. Um so what is the project we're going to be talking about? It's a project called ART E. It is a natural language uh assistant that helps you answer questions from your email inbox. So I'll give you an example of what we're talking about here. Um let's say you want to ask, you know, in this case our example question is when is Sherry's move to Portland targeted for? So you would ask this question to the assistant. It then goes and it searches your inbox. It's got several tools. So, it has like a search tool. It has a read email tool. And then it can actually answer the final question. You can kind of see if you if you look here what's going on behind the scenes. This is important so you get a sense of kind of how this agent works. And as we're talking through how we built it, how we made it work. Um hopefully that that helps uh make the conversation very grounded in a specific task. So anyway, you see the agent, it's it's, you know, searching for certain keywords. It get those messages back. It's in reading one of them and and answering the question. That's that's what it does. Okay. So, um question, you know, once once we've decided this is kind of the the task we're trying to solve, why would you re use reinforcement learning for this specifically? Um and uh and the answer is like to start with, you shouldn't. In fact, to start off with, we did not. Um so, the first version of this agent once we decided we wanted to build this, we did we didn't use any reinforcement learning at all. We purely built this on prompted models. And this is the first lesson from this talk uh that I want to share is I would generally always recommend starting with getting the best performance you can with a prompted model before going to any training including reinforcement learning. There's a few different reasons to do that. Uh three specifically. Um the first one is just like working out the bugs in your environment, right? Um you know maybe your tools aren't implemented properly. Maybe they don't have access to the data you think they do. um we find this happens a lot and it's a lot less frustrating to debug that uh you know separately from debugging your your training loops. So you want to make sure that like you can get at least some kind of performance um before you start training. Um and then second of all you may find as you're trying to improve the performance on uh on using these prompted models that uh you can get it working really well and that's great. So that means you don't need to train anything um and that saves you a lot of time. Um there's a third reason as well uh that I'll share which is uh basically once you've gone to that effort, you've done your best to get the best quality prompted baselines you possibly can. Um then uh if you find that those baselines are not able to get you where you need to go and you're able to surpass them with re reinforcement learning, it feels great. You get to glow and be like, "Yes, I was able to beat the the frontier models on my task." Um this this I I highly recommend it. It feels good. You can you can like post on X about it. There's there's nice, you know, graphs and stuff. So this is this is what it looks like when everything goes right. Um so this is an example of a training run for this art e model that I'm going to be talking about. Uh you can see that there's these these lines for each of the prompted model baselines that we've got. Um so we've got 03 04 mini and then Gemini and and 4.1. And you can see uh those ones, you know, they have certain level performance. And then you can see this this uh sort of moving line um that's going on. This is the model that we trained. And you can see it actually starts out significantly worse than these other models uh from from the start. That's because we started from a Quen 2.5 the 14 billion parameter one. It's a relatively small model, relatively weak model. Um and so it was doing much worse than these initially. But you can see as training progresses um you know initially at the beginning it it's sort of maybe maybe there's it's learning the right way to do tool calls. There's a very sharp bump as it figures out the basic stuff and then a more gradual climb until eventually it's able to significantly outperform uh any of the prompted models on this task. And this is sort of what you're you know in the ideal case when everything works this this is what you're looking for. This is what what you're hoping to achieve. Um this is another view actually of that same data we were just looking at. Um I I like I wanted to highlight it in this way because it's important to realize. So on the last graph it looked like the lines sort of asmtote out pretty close together. That's because they're getting near 100%. But the last um you can see for example with our best prompted model here 03 uh it's 90% accuracy and with our RL model we're able to get up to 96%. And so one way to think about that is like 60% of the errors that 03 was making um are are actually solved with our model. Um which is which is quite a large uh you know we find that that's actually can be very very important for the user experience of someone using one of these. um if you're getting you know just half as many errors uh that that can make the product much stronger. Um so this is this is where we got to on accuracy. There's a couple other metrics that we've find are often very very important. Um and you know the trade-off between these does does is very task dependent but but they matter in many cases. Um cost obviously is a big one. So for for this email agentic harness that we had we benchmarked the cost on 0304 mini and our model. So if you wanted to do like a thousand searches using 03, that's going to cost $55. Um, which is a lot. I think for most use cases, that probably would be cost prohibitive just from a unit economics point of view. Um, on 04 mini, we're down to $8, but that's still quite expensive. And then we drop another order of magnitude by moving to this smaller Quen 2.514B. Again, this is just driven it by being it being a much smaller model. So it's it's much cheaper to run. Um, but we're still able to get very good performance because we've specialized it on our task. Um, beyond cost and the accuracy, um, the third metric that often comes up is latency. Uh, particularly if you're doing, I mean, certainly anything with voice, but if there's any real-time human interaction with the task, latency is going to matter a lot. Um, and we were able to find on on this task, we were able to get significantly better latency. There's a number of different ways, which I'll go into in more detail later that we were able to achieve this. Um, you know, one was just again moving to a smaller model helps. There's just less less loading from memory, less matrix multiplies. it's just you're able to get tokens out faster. Um, we were also able to train this model to have fewer turns going back and forth with the database with the actual email um the list of emails. Uh, we we were able to train it to be more efficient with its queries. Um, and I'll go into that in a moment. And so that that leads to lower latency. Um, there's actually a third thing which we didn't apply here but can help a lot with these smaller things which is called speculative decoding. That's something you can do on large or small models. It generally works better on smaller task specific models because you get higher um acceptance rates on on your speculator. Basically, um there's there's lots of reasons why smaller models work better. Um okay, so then the next question, uh for those of you who haven't done this yet is like, okay, what is the effort required to do this to actually achieve these results? Um if you'd asked me this question a year ago, I would say, hey, you should really only be doing this if you know, you're this big company and willing to put, you know, months of of work into a project, I think that's changing. I honestly do. Um in this case, uh so this this training run, it cost us about $80 in GPU time. It did take about a week of engineering time to build this and and caveat that was with an engineer who is familiar with this domain and and had quite a lot of experience you know with machine learning and RL. Um but I actually expect as as we figure out the right patterns here collectively as an industry this will keep dropping. Um and I expect that uh you know the sort of payback period to get a return on investment from these specialized models is actually going to continue falling as well. Um, and uh, you know, part of part of the reason I wanted to give this talk is to sort of distribute that know the knowledge we learned and hopefully move faster towards that that world where this is just sort of like a thing everyone knows how to do and it's very easy and very fast. Um, so that's that's what we'll be talking about for the rest of time is is some more of the lessons we learned. Um, okay. So uh when you are using RL to train an agent or really using RL for anything else um I find that consistently with different problems we look at there are there are sort of two hard problems that come up every single time. All right. Um and the two hard problems are first of all figuring out a realistic environment right so if you're training an agent you need to be training it with realistic data with realistic inputs and outputs tools available everything like that to how it's going to be used in production. um because if you don't then it's going to be optimizing for the wrong thing and and you won't get the results you want when you deploy it. And then the second thing which sometimes is hard um sometimes isn't this one is a little bit test dependent is getting the right reward function. So reward function that just means you have to be able to know when your agent's gone through and say in this case given an answer to my email you have to have some way of knowing did it do a good job or a bad job. All right that's the reward function. It decides it it it's it's how you decide if it's good or it's bad. Um some depending on the domain sometimes that's really easy. We have I don't know if Nathan's here. He's going to be talking next. But um you know he and his team put together this thing called RLVR which in some verifiable domains it's actually very easy to do a reward. Often times uh not all domains are like that. Often times it is kind of hard. Um and so it's it's somewhat task dependent. I'm going to go through how we solve these problems specifically with RE. Okay. First one realistic environment. So for our re task, what is the environment we need? What is the environment this agent's going to be operating in? Well, it needs these tools available. It needs to be able to go and query an email inbox. It needs to be able to like get emails back um and and that look realistic. These emails, you know, the inbox should be large because that's what most email inboxes are like. Um the emails in it should be diverse and they have to look kind of like real emails. Um so this could be kind of hard because you can't just go ask like a thousand people to, you know, give you uh their their their personal emails to train on. Um luckily in this case, we were able to solve this with the help of a company that has contributed a lot um to just the open data ecosystem. Uh generally it's it's like a quite an iconic company. Uh perhaps I would call it a historic company. Um I'm of course talking about Enron. Um I'm hearing some laughter. So anyway, Enron was a uh they were a financialized energy company in the 90s and 2000s. Committed massive fraud, ended up getting shut down by the Department of Justice. As part of this uh um you know process uh the the the court cases they were going through, um a dump of like 500,000 of their emails was released to the public as part of the discovery process. Um, so that's that's that's great for things like this and that's what we used as our environment uh for the email inboxes. All right, so now we've got realistic email inboxes um with tens of thousands of emails that are real emails back and forth. Now we have to design our reward function. So as our agent is going and as our agent is um you know we're asking it questions and then it's giving us answers, we have to know is the answer correct or not? So we can reward it when it gets the answer right and it can learn to do that better. There's different ways and this part is very task dependent. Um the way that we went about it in this case um was we basically turned it into a more of a verifiable problem. And the way we did that was we actually took our email inbox. We sort of inverted the problem. We um we grabbed batches of 20 emails at a time uh from the inbox and gave them to Gemini 2.5 Pro and said, "Hey, given this set of emails, give us a few questions that a user might realistically ask that the answers are found in this email." Right? And so Gemini generated the questions, it generated the answers, and then of course the the source emails that came from. Um, and there were some extra steps on top of that. A lot of the questions it came up with looked a little bit unrealistic. We had a separate filtering step where we're like, okay, let's find the subset of these that that are actually look like questions that, you know, I would maybe ask. And we ended up with a list of a few thousand questions um along with their verified answers. Um, and so at this point, it becomes much more of a of a sort of verified thing. the reward function becomes much easier because we know what the correct answer should be. And so the way we can tell if our agent did a good job is we give our agent the question, we let it go and search the email inbox and try and find the right emails and everything. And eventually it comes back with an answer. And then we can just use an LLM as judge, a very simple one, and say like, hey, you know, here's the question. Here's the the golden answer that that we believe is right. Here's the answer we got from our from our model. Is it right or not? Um we did have to do a little bit of uh iteration there making sure that the judge was well calibrated on like what you know what what counts as as correct or not but by and large this worked pretty well um and was able to make this more of a verified task. Um so so that's how that's how we solved the the reward function problem was by having that you know turning this into something where we had more of a golden data set. Um, okay. So, once you've solved that problem, those problems, once you have your environment, once you have um your your reward function defined, then basically you just kind of have to run a loop over and over and over again where you have your agent go through and it tries to um solve the problem and then you figure out if it's good or it's bad. Um, and then you just uh you know, reward if it's if it's good and punish if it's bad and that's it. Um, and uh, you do this over and over and over again. And then hopefully if you've got everything set up right, um, it learns what good looks like, it learns what bad looks like, um, and it starts doing it right. Um, and then again, this is this is the curve we saw earlier where where you can see it it it starts getting better over time. Um, okay, a few other like interesting learnings from this project. Um, one thing is we found that there's actually you can throw a lot of stuff into your reward function uh beyond just the primary thing you're trying to solve for. And so we actually ended up we there were like sort of eight different little things that we gave extra credit for. Um, and I'm going to share two of them here. So the first one here is um is we were trying to have it optimized for the number of turns, how many times back and forth, how many times it had to query the email inbox before it came up with the right answer. Right? So, because the most important thing of course is is getting the answer right. But between two answers that both get it right, we would rather it took fewer turns back and forth because that's fewer tokens, that's lower latency, lower costs, um it's just like a more efficient agent. So, um so you can see here on this first graph that early on it while was getting its feet wet and figuring out what worked, it it ended up spiking up to over six turns on average. So, it would go back and forth a bunch of times with the email inbox and and try and find the right thing. But then once it was able to like figure out how to use the tools efficiently, figure out like you know the the right way to construct keywords and find the right email, it was able to get very efficient and actually fast uh better than any of our prompted models uh on this metric of um using fewer turns. And again, this was just because we gave it a little bit of extra it was it was a very small amount relative to the reward for getting it right, but a little bit of extra credit on on using for returns and it was able to to use that um to to optimize against that. Um, another extra reward function we gave it is um to try and discourage it from hallucinating answers. So, um, obviously the best thing is to get the right answer. If you can't find the right answer, it's much better to say, hey, I don't know, than to make up an answer in in a situation like this. So, we basically penalized it if um if the reward model said, hey, you got the answer wrong and but it hadn't tried to get an answer, give an answer, that was like a much lower reward than if it just said, hey, I don't know. I can't solve this problem. And as you can see, that worked quite well. Um, compared to any of the prompted models, including 03, we ended up with a significantly lower hallucination rate u because that was part of our reward function. Um, again, these are these are things that are just sort of like extra credit, but um we found that like you can throw in a bunch of these and and it can jointly optimize all of them at the same time, which is super powerful. Okay, I want to talk a little bit about reward hacking. Um, it's it's something that comes up a lot when you're trying to do this, and it's kind of a fun thing to talk about. Um, this is an iconic video some of you might have seen. Uh this was released by OpenAI almost a decade ago at this point of um they were they were trying to uh they had this environment where you were trying to uh get this boat to complete a race and instead of learning to complete uh complete the race it learned that oh if I just go in this like little circle that's not even part of the racetrack I can like just get a bunch of points. Um and so it just started doing that over and over and over again instead of like actually following. Um, this is something that comes up a lot if you're doing reinforcement learning. And it's basically just the difference between um, uh, the difference between what you actually want the model to do and what you can measure. Um, like what you're actually rewarding it for. And, and if you almost always if you let one of these run long enough, it will figure out some way to exploit your measure. Um, and it will figure out some way to to get a really high reward um, without actually solving the problem. And you need to just watch for that. So, I'm going to give a couple examples here. Um this is a this is a graph um from another project actually not this one. Uh so an engineer on our team was uh was working on this game called NYT connections. Some of you might know you get 16 words and you have to put them in like four groups of four. It's quite a challenging game especially for these language models because it requires a lot of world knowledge and like you know lateral thinking. Anyway, um so so they were trying to train this model to do it and uh it wasn't figuring out wasn't figuring out what it wasn't figuring out and then boom, you can see here around step 40 it just like takes off and it's like okay we figured out how to how to solve this and and this engineer I'm I'm going to I'm going to call out where's where's where's an on our team he's here at the conference yeah he's great you should talk to him after but um he was like hey we we we solved it like we got NYT connections and like and it's like okay the graph looks good let's look at what it's actually doing what it was actually doing is it had figured out there was a bug in how we wrote the verification And if it just put every single word in every single category, it was able to get a perfect score. Um because we weren't verifying that there were in fact only four words uh in each category. Um so this is another example. This is a fun one. So I was I was training a model um to produce really good titles for hacker news. Um titles that would get a thing upvoted. So I had this reward model I had trained on like existing hacker news um articles and how many upvotes they got. And I was I was trying to train this model to produce new titles. And it was working really well for a while. You can see and and sort of subjectively as well, I I looked at a bunch of these these titles generated and and for these first like thousand steps or so, it was actually learning things that I was like, "Okay, as someone who spends way too much time on Hacker News, yeah, that that does look like a good title. You're doing a good job." And then you can see around step um 1200 here, it just like jumps a bunch, right? It's like, "Okay, um it clearly figured something out. I don't know what it figured out. Um but we should look at that." Um, and so, uh, what it turns out what the model had figured out was that it could just completely ignore the content of the post and generate the same title for every single one of them and that would like maximize its score. So, it generated this title, Google lays off 80% of workforce, literally every single article, this was this was what it labeled it as. And the model was like, yes, that is going to get up vote on hacker news for sure, which which it probably would to be fair. Um so so anyway the way the way we solved this um what we found is that it's it's really important to watch out for this solving it typically involves modifying in some way your reward function to penalize things like that. So in the second example I talked about it was actually quite an easy fix once we identified it um which was just add an extra LMS judge that looked at the title looked at the content and said hey is there anything in the title that's not supported by the content and we added that on and and it it actually worked great. Um, the important thing here is you want to be looking at your your rollouts, not just blindly trusting the reward function, figuring out what's actually happening. Um, anyway, so, uh, that's it. Um, I'm I'm almost out of time, so I'm going to stop. Couple of QR codes for you. Um, everything in this presentation, and there's a much longer write up I have of this whole project. It includes the code, it includes the artifacts, data sets along the way. Um, you can you can check that out there. Um, one more thing is, uh, we have a Discord that's open. We have an open source project for training reinforcement learning models. Um we have a discord you can go to if you're interested in this kind of thing. Um we we're all in there. We answer questions. There's lots of people from the community trying to do these things. So if uh if you're interested in building things with this um feel free to join it and yeah happy happy to chat there. And um yes, thank you everyone. Uh appreciate your time. All right. And since I am also um in my my separate hat the moderator for this track, um it is my great pleasure to welcome to the stage Nathan Lambert. Everyone give a round of applause. Um Nathan, if for those of you I'm sure many of these people know who Nathan is, he's he's probably the strongest voice um in sort of the open ecosystem talking about post training generally and and also reinforcement learning. Um he's got a great uh newsletter he does as well as um he puts that out as a podcast and lots of content on X. But anyway, very excited to to see what you'll say. Um oh and also uh yeah runs a bunch of projects at AI2. They build um probably the best fully open models. Um so anyway, yeah, let's hear from Nathan. Okay, thanks for the intro fellow Seattleite and I'm sure or I'm sure we'll talk more. Um this I really came to this thinking about trying to reflect on six months into this like reinforcement learning with verifiable rewards post01 post deepseeek and I think that a lot of this stuff is somewhat boring because everybody has a reasoning model. Um we all know the basics of you can scale RL at training time and the numbers will go up and that's deeply correlated with being able to then do this inference time scaling. Um, but really in AI right now, everybody, there's a lot of people who are up to speed, but the crucial question is like where are things going to go and how do you skate where the puck is going? So, a lot of this talk is really me trying to process is like where is this going besides getting high benchmark scores with using 10,000 tokens per answer and like what do we need to do to actually train these models and what are the things that OpenAI etc. are probably already doing, but it's increasingly hard to get that uh signal out of them. So if we look at this like reasoning is really also unlocking really new language model applications. I think I this is the same search query which is like I as an RL researcher I need to find this all the time. I forget that it's called coastrunners and you Google like overoptimization 20 times to find it. But I tried asking 03 and it like literally gave me the download link directly. So I didn't even have to do anything. And that's a very unusual use case to just pop out of this reasoning training where math and code was the real thing to start with. And 03 is great. It's the model that I used the most for finding information. And this just really is the signal that I have that a lot of new interesting things are coming down the pipe. Um I would say it's starting to unlock a lot of new language model applications that I use some of these. So this is a screenshot of deep research. It's great. You can use it in really creative ways like uh prompt it to look at your website and find typos or look at only the material on your website and things like this. It's actually more steerable than than you may expect. Um Cloud Code, which I describe as just the the vibes are very good. It's fun. I'm not a serious software engineer, so I don't use it on hard things, but I use it for fun things because I can I can put the company API key in and just kind of mess around like helping me build my the website for this book that I wrote online. And then there's the really serious things which are like codecs and these fully autonomous agents that are starting to come. If you play with it, it's obvious that the form factor is going to be able to work. I'm sure there are people that are getting a lot of value out of it right now. I think for ML tasks, it's like there's no GPUs in it right now. And if you are dealing with open models, it's like they just added internet. So like it wasn't going to be able to go back and forth and look at like hugging face configs or something and all these headaches that you don't want to deal with. But in the six months, like all of these things are going to be stuff you should be using on a day-to-day basis. And this all downstream of this kind of step change in performance from reasoning models. And then this is kind of like another plot that's been talked about. And when I look at this, it's like through 2024, if we look at like GPT40, it things a lot and really were saturating then. And then there's these new set models in 01 which really helped push out the frontier and time horizon. So this is the y-axis is how long a a task can roughly be completed by the models in time which is kind of a weird way to measure it because things will get faster but um it's going to keep going and this reasoning model is the technique that was kind of unlocked in order to figure out how to push the limits and when you look at things like this it's not that just we're like on a path determined from AI and more gains are going to come it's really like we have to think about what the models need to be able to do in order to keep pushing out these frontiers. So there's a lot of human effort that goes into continuing the trends of AI progress. So it's like gains aren't free. And I'm thinking that a lot of planning and kind of thinking about training in a bit of a different way beyond just reasoning skills is going to be what helps push this and enable these uh language modeling applications and products that are kind of in the early stages to really shine. So this is a core question that I'm thinking about is like what do I have to do to come up with a research plan to train reasoning models that can work autonomous autonomously and really have meaningful ideas for what planning would be. So I kind of came up with a taxonomy that has a few different what I call traits within it. Um the first one is skills which we've pretty much already done. Skills are like getting really good at math and code. inference time scaling was useful to getting there, but they kind of become more researchy over time. I think for products, calibration is going to be crucial, which is like these models overthink like crazy. So, they need to be able to kind of have some calibration to how many output tokens are used relative to the difficulty of the problem. And this will kind of become more important when we're spending more on each task that we're planning. And then the last two are subsets of planning that I'm thinking about and happy to take feedback on this taxonomy but like strategy which is just going in the right direction and knowing different things that you can try because it's really hard for these language models to really change course where they can backtrack a little bit but restarting their plan is hard and then as tasks become very hard we need to do abstraction which is like the model has to choose on its own how to break down a problem into different things that it can do on its phone. I think right now humans would often do this, but if we want language models to do very hard things, they have to make a plan that has subtasks that are actually tractable or it calls in a bigger model to do that for it. But these are things that are the models aren't going to do natively. Natively, they're trying to like doing math problem solving like that doesn't have clear abstraction on like this task I can do and with this additional tool and all these things. So this is this is a new thing that we're going to have to add. So to kind of summarize, it's like we have skills, we have researcher calibration, I'll highlight some of it, but like planning is a new frontier where people are talking about it and we really need to think about like how we will actually put this into the models. So to just put this up on the slide, what we call reinforce learning with verifiable rewards looks very simple. I think a lot of RL and language models, especially before you get into this multi-turn setting, has been you take prompts, the agent creates a completion to the prompt and then you score the completions and with those scored completions, you can update the weights to the model. It's been single turn. It's been very simple. We I'll have to update this diagram for multi-turn and tools and it makes it a little bit more complex, but the core of it is just a language model generates completions and gets feedback on it. And it's good to just take time to look at these skills. These are a collection of evals and we can look at like where GBT40 was and these were the hardest evals that have existed and look were called like the frontier of AI. And if we look at the 01 improvements and the like 03 improvements in quick succession, these are really incredible eval gains that are mostly just from adding this new type of training in. And the core of this argument is that we need to do something similar if we want planning to work. So I would say that a lot of the planning tasks look mostly like humanity's last exam and Amy um just after adding this reasoning skill and we need to figure out what other types of things these models are going to be able to do. So it's like this list of reasoning abilities that these kind of like low-level skills is going to continue to go up. I think the most recent one if you look at recent DeepSseek models or recent Quen models is really this tool use being added in and I that's going to build more models like 03. So using 03 just feels very different because it is this kind of combination of tool use with reasoning and it's obviously good at math and code. But I think these kind of low-level skills that we expect from reasoning training are we're going to keep getting more of them as we figure out what is useful. So I think an abstraction for the kind of agenticness on top of tool use is going to be very nice but it's hard to measure and people mostly say that claude is the best at that but it's not yet super established on how we measure it or communicate it across different models. And then this is where I get into the fun interesting things. I think it's hard for us because calibration is passed to the user which is we have all sorts of things like model selectors if you're a chatbt user. Um, Claude has reasoning on off with this extended thinking and Gemini has something similar and there's these reasoning effort selectors in the API and this is really rough on a user side of things and making it so the model knows this will just really make it so it's easier to find the right model for the job and just kind of um you're kind of over spent tokens for no reason will go down a lot. it's kind of obvious to want it and then it'll just it becomes a bigger problem the longer we don't have this some examples from when overthinking was kind of identified as a problem. It's like the um left half of this is you can ask a language model like what is 2 plus three and you can see these reasoning models use hundreds to a thousand tokens or something that could realistically be like one token as an output and then on the right is a kind of comparison of sequence lengths from a standard like non RL trained instruction model versus the QWQ thinking model and you really can gain this like 10 to 100x in token spend when you ship to a reasoning model and if you do that in a way that is wasteful is just going to really load your infrastructure and cost and as a user I don't want to wait minutes for an easy question and I don't want to have to switch models or providers to deal with that. So I think one of the things that once we start have this calibration is I'm is this kind of strategy idea and on the right I to I went to the um I think it's epoch AI website took a qu one of their example questions from Frontier Math and I was like does this new deep sea gar 0528 model like does it do any semblance of planning when it starts and you ask it a math problem it just like okay the first thing I'm going to do is I I need to construct a polomial is like it just goes right in and it doesn't do anything like trying to sketch the problem before it thinks and this is going to probably output 10 to 40,000 tokens and if it's going to need to do another 10x there is just like if that's all in the wrong direction that's multiple dollars of spend and a lot of latency that's just totally useless and most of these applications are set up to expect a latency between 1 and 30 minutes. So it's like there there is just a timeout they are fighting. So either going in the wrong direction or just thinking way too hard about a sub problem is going to make it so the user leaves. So um right now these models I said they do very little planning on their own but as we look at these applications they're very likely prompted to plan which is like the beginning of deep research and cloud code and we kind of have to make it so that is model native rather than something that we do manually and then once we look at this plan there's all these implementation details across something like deep research or codeex which is like how do I manage a memory so we have cloud code compresses its memory when fills up its context window. We don't know if that's the optimal way for every application. We want to avoid repeating the same mistakes. We talked Greg was talking about the playing Pokemon earlier, which is a great example of that. We want to have tractable parts. We want to offload thinking if we have a really challenging part. So, I'll talk about parallel compute a little bit later as a way to kind of boost through harder things. And really we want m language models to call multiple other models in parallel. So right now people are spinning up t-mucks and launching cloud code in 10 windows to do this themselves. But there's no reason a language model can't be able to do that. It just needs to know the right way to approach it. And as I've started with this idea of kind of we need effort for or like we need to make effort to add new um capabilities into language models when you when I think about this kind of story of Qstar that became strawberry that became 01. The reason that it was in the news for so long and was such a big deal is like it was a major effort for OpenAI spending like 12 to 18 months building these initial reasoning traces that they could then train an initial model on that has some of these behaviors. So it took a lot of human data to get things like backtracking and verification to be reliable in their models. And we need to go through a similar arc with planning. But with planning, the kind of outputs that we're going to train on are are much more intuitive than something like reasoning. I think if I were to ask you to sit down and write a 10,000 token reasoning trace with backtracking, it's like you can't really do this, but a lot of expert people can write a five to 10step plan that is very good or check the work of Gemini or OpenAI when asked to um write an initial plan. So I'm a lot more optimistic on being able to hill climb on this. And then it goes through the same path where once you have initial data, you can do some SFT and then the hard question is if the RL and even bigger tasks can reinforce these planning styles. On the right, I added kind of a hypothetical, which is like we already have thinking tokens before answer tokens, and there's no reason we can't apply more structure to our models to just really make them plan out their answer before they think. So, um, to give a bit more depth on this idea of skill versus planning, if we go back to this example, I would say that 03 is extremely skilled at search. So being able to find a piece of niche information that researchers in a field know of but can't quite remember the exact search words that is an incredible skill. But when you try to put this into something like deep research, there's this lack of planning is making it so that sometimes you get a masterpiece and sometimes you get a dud. And if as these models get better at planning, it'll just be more thorough and reliable in getting the kind of coverage that you want. So, it's like if it's crazy that we have models that can do this search, but if you ask it to recommend um some sort of electronics purchase or something, it's really hard to trust because it can't just know how to pull in the right information and how hard it should try to do all that coverage. So, kind of summarize, these are the four things I presented. I think you can obviously add more to these. You could call a mix of strategy and abstraction. And there's like you could call what I was describing as like context management in many ways, but really you just want to have things like this so that you can break down the training problem and think about data acquisition or new algorithmic methods for kind of each of these tasks. And I mentioned parallel compute because I think this is an interesting one because if you use 01 Pro, it's still been one of the best models and the most robust models for quite some time. And I'm been very excited for 03 Pro, but it doesn't solve problems in the same way as like traditional inference time scaling where inference time scaling just made a bunch of things that didn't work go from zero to one where this parallel compute is really like it makes things more robust. It just makes them nicer and it seems like this kind of RL training is something that can encourage exploration and then if you apply more compute in parallel, it feels something kind of exploiting and getting a really well-crafted answer. So there's a time when you want that but it doesn't solve every problem. And to kind of transition into the end of this talk, it's like there's been a lot of talks today saying the things that you can do with RL. And there's obviously a lot of talk on the ground of um what is called continual learning and if we're just continually using very long horizon RL tasks to update a model and diminish the need of pre-training and there are a lot of data points that were closer to that in many ways. I think continual learning has a big um algorithmic bottleneck where but just like scaling up RL further is very tractable and something that is happening. So if people are to ask me what I'm working on at AI2 and what I'm thinking about this is my like rough summary of uh what I think a research plan looks like to train a reasoning model without with all the in between the line details. So step one is you just get a lot of questions that have verified answers across a wide variety of domains. Um most of these will be math and code because that's what out what is out there. And then two, if you look at all these recipe papers, they're having a step where they filter the questions based on the the difficulty with respect to your base model. So if a question is solved zero out of 100 times by your base model or 100 out of 100, you don't want questions that look like that because you're both not only wasting compute, but you're messing up the gradients in your RL updates to make them a bit noisier. And once you do that, you just want to make a stable RL run that'll go through all these questions and have the numbers keep going up. And that's the core of it is really stable infrastructure and data. And then you can tap into all these research papers that tell you to do methods like overong filtering or different clipping or resetting the reference model. And that'll give you a few percentage points on the top where really it's just data and stable infrastructure. And this kind of leads to the provocation which is like what if we rename post-training as training and if OpenAI 01 was like 1% of compute is post-raining relative to pre-training um they've already said that 03 has increased it by 10 10x so if if the number started at 1% you're very quickly getting to um what you may see as like parody in compute in terms of GPU hours between pre-training and post- training which if you were to take anybody back a year ago before 01 would seem pretty unfathomable. And one of the fun data points for this is that um the DeepSeek V3 paper and you kind of watch Deep Seek's transition into becoming more serious about post-raining. Like the original Deepseek V3 paper, they use 0.18% of compute on post- training in GPU hours. And they said their pre-training takes about two months. And there was a deleted tweet from one of their RL researchers that said the R1 training took a few weeks. So if you make a few very strong, probably not completely accurate assumptions that RL was on the same whole cluster, that would already be 10 to 20% of their compute. I think like specific things for DeepSeek are like, oh, their pre-training efficiency is probably way better than their RL code and things like this. But scaling RL is a very real thing. If you if you look at this, if you look at Frontier Labs and you look at the types of tasks that people want to solve with these long-term plans. So, it's good to kind of embrace what you think these models will be able to do and kind of break down tasks on their own and solve some of them. So, thanks for having me and let me know what you think. [Applause] Thank you. All right. So for our final talk of the track for the day um we are we have the great pleasure of hearing from uh Chris Seedy. So Chris was a researcher at Google for a long time became the co-founder of XAI where he was for a couple of years and is now working on a new startup. Um he's going to be talking us to us today about the path towards verified super intelligence which I'm very curious about myself. Um unfortunately Chris did have a last minute conflict. Um, so he's not able to be here in person. He will be here on Zoom. Um, so we should see in just a moment uh him coming up and uh be able to see him talk. So let's welcome [Applause] Chris. I see people in the A corner just looking very Okay, we're good. I feel like we've uh all been here with Zoom before. Um so we'll give Chris a moment. Uh sorry, I was just permission to Zoom. didn't have a permission to share my we can all empathize. I think we've this has happened to everyone on some meeting. Okay. Okay. Yes, we can see it. Sorry about more scientist. So I believe first let me give some more years. So basically I started doing 2011 nobody knew whe Good example research topic. So, so then we learned like 2015 most excited about But in the meantime it's special in some sense but still it was taken to the next accessible data. And that's where we are and thinking like two years ago and then I think the big idea behind that is that you have to create this information that is the chain of that is basically the learning from that reinforcement learning for you reinforce that idea almost. So basically the lessons that we could draw from all these plans. The first lesson is the first is that deep learning works at all. Then deep learning can create then learning also. So you can scale a different time essentially. Now the question is what is the next big trend that we can expect to happen and that's what I would like to talk about. But in order to uh to guess it I I let's see what were the limitations to each uh each of these trends. So first we have like you need to have labor data. So we needed a lot of human to certain data points and structural condition took that effort into an extra because multip uh what are the bottlenecks to AI self improvement? So, so I mean it's like it's the human generated labels, the human generated data and now we have the human generated tasks and human supervision uh and even human generated verifiable problems. So in order to scale up reinforcement learning you need to generate a lot of environments. So basically what we can say is that the document to AI self improvement is basically human every every aspect of the learning process that relies on human input is prone to be problematic. it as either it adds friction or there is a limited source of data because human data is getting less and less available that we have on so so what we really have to figure out how we create open-ended self improvement so that is a hard question at the previous talk but Nathan was also pointing this out that that's something is is the future But the real question is how do we really scale up so that it will become uh so basically a combination of life language model structure prediction and uh really super human reasoning in almost every aspect. So how do we get to that? So what is the way to get to reach self-improvement? So we we basically we have to remove these bottlenecks. We don't want to now that we run out of all these generative data from the internet. Now the question is now we are running into the problem of of having only a very limited set of human generative environments that we can train on. We already ran out of them. So people hire a lot lot of people and it takes a lot of money and effort to create this environment and be a task for the AI that can learn. So in short we can say that the AI needs computes it wants agency it's it needs challenges and it needs feedback basically a verification signal that gives a reward to the agent. So these are the most important uh part of making open self improvement work. So from this basic principles we can uh uh make some claims on what how we can get how how we can solve these bottlenecks. So one I think is the first battle neck that almost everybody is concerned nowadays is having good environments in which the AI can run safely. But it doesn't have to be uh so we don't want the AI to just go into some environment and take careful agency and see and then corrupt your computer or corrupt your cloud environment. So so there is a tension between agency and safety because if the AI is very uh I mean get full agency then it is it can be dangerous but it can create itself. So it can basically we want any want any environment in the in which the agent can have absolutely free. So it can have root access or even web access and and we also want the agent to be able to backtrack its decisions even if it made it installed the wrong package for example for a task or it created the wrong program or it deleted an important tool from food chain. So in this case the agent should be able to decide okay I want to go back and I want to try a different path. So this is not just important for inference time when we execute the agent it's very it's even more important for reinforcement learning where you are running the agent on a lot of problems. So basically getting a sandboxing environment that can can that allows you to do a lot of branchings and undos basically like a version control system uh for snapshots uh is is is an absolute must for a sophisticated data. So the other important bottleneck that we have seen is the lack of good verifiable problems. So what I think the next generation of AI will need to be able to create its own curriculum of problems. So you don't you don't want the agent to to be given a set of problems to train on. We should just give some high level uh task of let's do learn to be a good RS programmer for example and then it should should go to the internet and find all the important uh so so improve its own skills without uh having a set curriculum. So, and we want the agent to be have this ongoing continuous learning as Nathan also mentioned. I think that's very important and that that's the future of AI that is being able to to do the learning while it's actually executing useful task as well. So, I think that will be very important in the next few years. But most importantly and that's why I put such a emphasis on verification and verified AI is that we need need a verifier very strong verifiers and validator. So what do I mean by that? So verifier is uh is an agent that uh verifies the correctness of a solution. But validator is uh is also it could be the same agent. It validates that the problem formulation was interpreted correctly so that the human input was correctly interpreted. So the first one actually could be done with also external verifier. You don't even need models for that. It can be an agent that just uses some some kind of verifier that for example runs a code or or cause a proof checker. Uh so doing some formal verification etc. But validator should be uh model based because that checks the alignment between the human and the machine and that's very important. So and then we think that the next generation of AI will be able to produce safe and independently verifiable artifacts. So we we want code that is actually comes with its own uh guarantees of correctness and that's very important. And so today computer chips already always verified extensively and we don't do that with software but I think in the in the next five years uh the AIdriven uh cyber attacks will be so severe that nobody will ever try to run any software that is not fully formally verified at least in some for some uh purposes. So and while the AI produces verifiable artifacts. So basically the it shouldn't just produce code, it should produce code and proof of correctness of that code as well. And at the same time it will have to improve its own alignment. So it should should get better and better at respecting and guessing the human's intent as well. So that's what basically I uh believe is this self-supervised reinforcement learning or what I call is verified super intelligence. Uh and they sound different but actually they are the same thing. I think this is possible. So we can we can get to self-supervised reinforcement learning which will be such such a big leap as from uh uh as from supervised learning uh uh just like training on labor data to going to training on the whole internet. So uh so this will be this will be in my opinion the next big step with an agent that can create its own problems. It acts as its own verifier and it will bootstrap itself from all of internet or whatever part of internet you want but it will be autonomously finding its own problems. It creates its own problems and solves them. So we don't really need to give it any more problems. So basically how do we get to this uh infinite supply of problems? So that's the hard question and what we think is the key to that is verification. So we basically we need uh what I'm suggesting is that we mine the internet for actual problems uh but turn them into formally verifiable artifacts and formally verified meaning that the correctness of the artifacts can be checked. of for example code or mathematical problems you can use a C theorem prover for example to check the logical correctness of things which is definitely hard but not impossible it's getting more and more possible and at the same time it uses these signals to create rewards for alignment so so when for example an artifact it it gets from the internet for example code this uh or that problem and then it uh tries to code it up and read some solution it can like okay was uh does this solution really align uh uh with with what was meant here and I think this can be bootstrapped with various uh reward signals coming from cycle consistency uh ideas. So these can be in my opinion also trained and enforced. So correctness and alignment has this interplay between them. So because if the if the output is not correct then it cannot be properly aligned. So and what I really believe is that we can do most of the correctness checking model free so that it doesn't require necessarily a model that uh uh that checks the output. It's more like a a well definfined verifiable artifact like in a NP problem that you have also always a verifier alignment must be model based unfortunately uh but uh I think the interplay between the two allows us to improve on both fronts and I think verification should be always agentic. So we we want to move away from having like a model train to verify something. We the verifier also should have tool access and and uh being able to go into an environment where the artifact was produced for example a software system and then test that software system in all kinds of scenarios for example. So basically this is a very rough sketch of how I believe this uh uh this uh road map to uh to work out is that we have like a generator agent and a validator a verifier agent. The same agent the agents can share the same model uh but they all connected to something like the morph cloud that we are working on. This uh cloud allows you to branch a lot of sandboxes, try various versions, roll back etc. And then the generator provides solutions to the verifier and the verifier and validator uh give back reward signals to train the generator. So, and they they both give each other basically kind of reward signals and the potential tasks are taken from the web and they also should be agentically generated or crawled by the agent itself not by some human that uh selects them or creates them. So, they should be automatically created. And then the important part is that we want want the agent to be able to produce artifacts that are verified completely independently. So you don't want to trust the AI, you want to verify it. So these are the main properties that I really want is that we we have first class verification in this system. We have first class alignment in this system. So these are grounding principle of how to build a system and they should be from the ground up provided. So they should be go of the agent that we are reinforcing the whole time and we want to get modify verifiable artifact so that we can trust the AI without any AI intervention. These proofs might be uh might be uh created by a lot of AI computations but they should be just verifiable and then uh This will give us infinite supply of hard problems. So what are the application domains of this uh verified super intelligence? So uh I think uh it will be mandatory for AI software engineers. So if you make one bug in 100 lines of code that's today okay roughly it's still you you get some value out of your model. But if you want to create like a agency that the model uh can work for hours on a complicated software project completely autonomously then uh one bug every 100 lines is impossible then you will never get anything done. So therefore I think uh we need uh to have ongoing verification all the time in the process. Another important uh problem is cyber security. We want our systems to be hacking uh proof. So we don't want external AIS to hack in our systems which will be more and more happening in the next few years also engineering uh so engineering requires a lot of mathematics and reasoning and sciences of course uh also we require this. So we believe that mathematics will also be properly verified in the next few years uh mostly by AI agents. So, so the goal here is that verified super intelligence. We want a trustlessly aligned AI. This means that we don't just want to trust the AI. We want to actually check it. And this check should be uh possible. So that's uh uh Okay. Sorry. Yeah. So that concludes my talk. So I'm just uh Yeah. So, I'm happy to take questions. Thank you so much, Chris. We appreciate you taking the time. All right. Well, this concludes our um our track. Uh this concludes the the breakout sessions for today. Um yes. Um I think uh yeah, I don't think we'll be able to do questions because we are a little bit over time unfortunately. Um at five o'clock or sorry at 4 o'clock. Yes, at four o'clock there's going to be keynotes on the main stage. Um, so look forward to seeing all of you there and hope you have a good rest of your conference.