A Taxonomy for Next-gen Reasoning — Nathan Lambert, Allen Institute (AI2) & Interconnects.ai
Channel: aiDotEngineer
Published at: 2025-07-19
YouTube video id: jQcsVk0KWiQ
Source: https://www.youtube.com/watch?v=jQcsVk0KWiQ
[Music] I really came to this thinking about trying to reflect on six months into this like reinforcement learning with verifiable rewards post 01 post deepseeek and I think that a lot of this stuff is somewhat boring because everybody has a reasoning model Um, we all know the basics of you can scale RL at training time and the numbers will go up and that's deeply correlated with being able to then do this inference time scaling. Um, but really in AI right now everybody there's a lot of people are up to speed. But the crucial question is like where are things going to go and how do you skate where the puck is going? So, a lot of this talk is really me trying to process is like where is this going besides getting high benchmark scores with using 10,000 tokens per answer and like what do we need to do to actually train these models and what are the things that open AAI etc are probably already doing but it's increasingly hard to get that uh signal out of them. So, if we look at this like reasoning is really also unlocking really new language model applications. I think I this is the same search query which is like I as an RL researcher I need to find this all the time. I forget that it's called coastrunners and you Google like overoptimization 20 times to find it. But I tried asking 03 and it like literally gave me the download link directly. So I didn't even have to do anything. And that's a very unusual use case to just pop out of this reasoning training where math and code was the real thing to start with. And 03 is great. It's the model that I use the most for finding information and this just really is the signal that I have that a lot of new interesting things are coming down the pipe. Um I would say it's starting to unlock a lot of new language model applications that I use some of these. So this is a screenshot of deep research. It's great. You can use it in really creative ways like uh prompt it to look at your website and find typos or look at only the material on your website and things like this. It's actually more steerable than than you may expect. Um, Cloud Code, which I describe as just the the vibes are very good. It's fun. I'm not a serious software engineer, so I don't use it on hard things, but I use it for fun things because I can I can put the company API key in and just kind of mess around like helping me build my the website for this book that I wrote online. And then there's the really serious things which are like codecs and these fully autonomous agents that are starting to come. If you play with it, it's obvious that the form factor is going to be able to work. I'm sure there are people that are getting a lot of value out of it right now. I think for ML tasks, it's like there's no GPUs in it right now. And if you are dealing with open models, it's like they just added internet. So like it wasn't going to be able to go back and forth and look at like hugging face configs or something and all these headaches that you don't want to deal with. But in the six months, like all of these things are going to be stuff you should be using on a day-to-day basis. And this all downstream of this kind of step change in performance from reasoning models. And then this is kind of like another plot that's been talked about and when I look at this it's like through 2024 if we look at like GBT40 it things a lot and really were saturating then and then there's these new set models in 01 which really helped push out the frontier and time horizon. So this is the y- axis is how long a a task can roughly be completed by the models in time which is kind of a weird way to measure it because things will get faster but um it's going to keep going and this reasoning model is the technique that was kind of unlocked in order to figure out how to push the limits and when you look at things like this it's not that just we're like on a path determined from AI and more gains are going to come it's really like we have to think about what the models need to be able to do in order to keep pushing out these frontiers. So there's a lot of human effort that goes into continuing the trends of AI progress. So it's like gains aren't free. And I'm thinking that a lot of planning and kind of thinking about training in a bit of a different way beyond just reasoning skills is going to be what helps push this and enable these uh language modeling applications and products that are kind of in the early stages to really shine. So this is a core question that I'm thinking about is like what do I have to do to come up with a research plan to train reasoning models that can work autonomous autonomously and really have meaningful ideas for what planning would be. So I kind of came up with a taxonomy that has a few different what I call traits within it. Um the first one is skills which we've pretty much already done. Skills are like getting really good at math and code. inference time scaling was useful to getting there, but they kind of become more researchy over time. I think for products, calibration is going to be crucial, which is like these models overthink like crazy. So, they need to be able to kind of have some calibration to how many output tokens are used relative to the difficulty of the problem. And this will kind of become more important when we're spending more on each task that we're planning. And then the last two are subsets of planning that I'm thinking about and happy to take feedback on this taxonomy, but like strategy, which is just going in the right direction and knowing different things that you can try because it's really hard for these language models to really change course where they can backtrack a little bit, but restarting their plan is hard. And then as tasks become very hard, we need to do abstraction which is like the model has to choose on its own how to break down a problem into different things that it can do on its own. I think right now humans would often do this but if we want language models to do very hard things they have to make a plan that has subtasks that are actually tractable or calls in a bigger model to do that for it. But these are things that are the models aren't going to do natively. natively they're trying to like doing math problem solving like that doesn't have clear abstraction on like this task I can do and with this additional tool and all these things. So this is this is a new thing that we're going to have to add. So to kind of summarize it's like we have skills we have researcher calibration I'll highlight some of it but like planning is a new frontier where people are talking about it and we really need to think about like how we will actually put this into the models. So to just put this up on the slide, what we call reinforce learning with verifiable rewards looks very simple. I think a lot of RL and language models, especially before you get into this multi-turn setting, has been you take prompts, the agent creates a completion to the prompt and then you score the completions and with those scored completions, you can update the weights to the model. It's been single turn. It's been very simple. We I'll have to update this diagram for multi-turn and tools and it makes it a little bit more complex. But the core of it is just a language model generates completions and gets feedback on it. And it's good to just take time to look at these skills. These are a collection of evals and we can look at like where GBT40 was and these were the hardest evals that have existed and look were called like the frontier of AI. And if we look at the 01 improvements and the like 03 improvements in quick succession, these are really incredible eval gains that are mostly just from adding this new type of training in. And the core of this argument is that we need to do something similar if we want planning to work. So I would say that a lot of the planning tasks look mostly like humanity's last exam and Amy um just after adding this reasoning skill and we need to figure out what other types of things these models are going to be able to do. So it's like this list of reasoning abilities that these kind of like low-level skills is going to continue to go up. I think the most recent one if you look at recent DeepSeek models or recent Quen models is really this tool use being added in and I that's going to build more models like 03. So using 03 just feels very different because it is this kind of combination of tool use with reasoning and it's obviously good at math and code. But I think these kind of low-level skills that we expect from reasoning training are we're going to keep getting more of them as we figure out what is useful. I think an abstraction for the kind of agenticness on top of tool use is going to be very nice, but it's hard to measure. And people mostly say that claude is the best at that, but it's not yet super established on how we measure it or communicate it across different models. And then this is where we get into the fun interesting things. I think it's hard for us because calibration is passed to the user which is we have all sorts of things like model selectors if you're a chatbt user um claude has reasoning on off with this extended thinking and Gemini has something similar and there's these reasoning effort selectors in the API and this is really rough on a user side of things and making it so the model knows this will just really make it so it's easier to find the right model for the job and just kind of um you're kind of over spent tokens for no reason will go down a lot. It's kind of obvious to want it and then it'll just it becomes a bigger problem the longer we don't have this some examples from when overthinking was kind of identified as a problem. It's like the left half of this is you can ask a language model like what is 2 plus 3 and you can see these reasoning models use hundreds to a thousand tokens for something that could realistically be like one token as an output. And then on the right is a kind of comparison of sequence lengths from a standard like non ROL trained instruction model versus the QWQ thinking model. And you really can gain this like 10 to 100x in token spend when you ship to a reasoning model. And if you do that in a way that is wasteful, it's just going to really load your infrastructure and cost. And as a user, I don't want to wait minutes for an easy question and I don't want to have to switch models or providers to deal with that. So I think one of the things that once we start have this calibration is I'm is this kind of strategy idea and on the right I to I went to um I think it's epochi website took a qu one of their example questions from frontier math and I was like does this new deepseek R10528 model like does it do any semblance of planning when it starts and you ask it a math problem it just like okay the first thing I'm going to do is I I need to construct a polomial it's like it just goes right in and it doesn't anything like trying to sketch the problem before it thinks and this is going to probably output 10 to 40,000 tokens and if it's going to need to do another 10x there is just like if that's all in the wrong direction that's multiple dollars of spend and a lot of latency that's just totally useless and most of these applications are set up to expect a latency between 1 and 30 minutes. So it's like there there is just a timeout they are fighting. So either going in the wrong direction or just thinking way too hard about a sub problem is it's going to make it so the user leaves. So um right now these models I said they do very little planning on their own but as we look at these applications they're very likely prompted to plan which is like the beginning of deep research and cloud code and we kind of have to make it so that is model native rather than something that we do manually. And then once we look at this plan, there's all these implementation details across something like deep research or codeex which is like how do I manage the memory? So we have cloud code compresses its memory when it fills up its context window. We don't know if that's the optimal way for every application. We want to avoid repeating the same mistakes. We talked Greg was talking about the playing Pokemon earlier which is a great example of that. We want to have tractable parts. We want to offload thinking if we have a really challenging part. So I'll talk about parallel compute a little bit later as a way to kind of boost through harder things. And really we want language models to call multiple other models in parallel. So right now people are spinning up t-mucks and launching cloud code in 10 windows to do this themselves. But there's no reason a language model can't be able to do that. It just needs to know the right way to approach it. And as I've started with this idea of kind of we need effort for or like we need to make effort to add new um capabilities into language models when you when I think about this kind of story of Qstar that became strawberry that became 01. The reason that it was in the news for so long and was such a big deal is like it was a major effort for OpenAI spending like 12 to 18 months building these initial reasoning traces that they could then train an initial model on that has some of these behaviors. So it took a lot of human data to get things like backtracking and verification to be reliable in their models. And we need to go through a similar arc with planning. But with planning, the kind of outputs that we're going to train on are are much more intuitive than something like reasoning. I think if I were to ask you to sit down and write a 10,000 token reasoning trace with backtracking, it's like you can't really do this, but a lot of expert people can write a five to 10step plan that is very good or check the work of Gemini or OpenAI when asked to um write an initial plan. So I'm a lot more optimistic on being able to hill climb on this. And then it goes through the same path where once you have initial data, you can do some SFT. And then the hard question is if the RL and even bigger tasks can reinforce these planning styles. On the right, I added kind of a hypothetical, which is like we already have thinking tokens before answer tokens, and there's no reason we can't apply more structure to our models to just really make them plan out their answer before they think. So, um, to give a bit more depth on this idea of skill versus planning, if we go back to this example, I would say that 03 is extremely skilled at search. So being able to find a piece of niche information that researchers in a field know of but can't quite remember the exact search words that is an incredible skill. But when you try to put this into something like deep research, there's this lack of planning is making it so that sometimes you get a masterpiece and sometimes you get a dud. And if as these models get better at planning, it'll just be more thorough and reliable in getting the kind of coverage that you want. So, it's like if it's crazy that we have models that can do this search, but if you ask it to recommend um some sort of electronics purchase or something, it's really hard to trust because it can't just know how to pull in the right information and how hard it should try to do all that coverage. So, kind of summarize, these are the four things that I presented. I think you can obviously add more to these. You could call a mix of strategy and abstraction. And there's like you could call what I was describing as like context management in many ways, but really you just want to have things like this so that you can break down the training problem and think about data acquisition or new algorithmic methods for kind of each of these tasks. And I mentioned parallel compute because I think this is an interesting one because if you use 01 Pro, it's still been one of the best models and the most robust models for quite some time. And I'm been very excited for 03 Pro, but it doesn't solve problems in the same way as like traditional inference time scaling where inference time scaling just made a bunch of things that didn't work go from 0 to one. Where this parallel compute is really like it makes things more robust. It just makes them nicer. And it seems like this kind of RL training is something that can encourage exploration and then if you apply more compute in parallel, it feels something kind of exploiting and getting a really well-crafted answer. So there's a time when you want that but it doesn't solve every problem. And to kind of transition into the end of this talk, it's like there's been a lot of talks today saying the things that you can do with RL and there's obviously a lot of talk on the ground of um what is called continual learning and if we're just continually using very long horizon RL tasks to update a model and diminish the need of pre-training and there are a lot of data points that were closer to that in many ways. is I think continual learning has a big um algorithmic bottleneck where but just like scaling up RL further is very tractable and something that is happening. So if people were to ask me what I'm working on at AI2 and what I'm thinking about this is my like rough summary of uh what I think a research plan looks like to train a reasoning model without with all the in between the line details. So step one is you just get a lot of questions that have verified answers across a wide variety of domains. Um most of these will be math and code because that's what out what is out there. And then two, if you look at all these recipe papers, they're having a step where they filter the questions based on the the difficulty with respect to your base model. So if a question is solved zero out of 100 times by your base model or 100 out of 100, you don't want questions that look like that because you're both not only wasting compute, but you're messing up the gradients in your RL updates to make them a bit noisier. And once you do that, you just want to make a stable RL run that'll go through all these questions and have the numbers keep going up. And that's the core of it is really stable infrastructure and data. And then you can tap into all these research papers that tell you to do methods like overong filtering or different clipping or resetting the reference model. And that'll give you a few percentage points on the top where really it's just data and stable infrastructure. And this kind of leads to the provocation which is like what if we rename post-training as training and if OpenAI 01 was like 1% of compute is post- training relative to pre-training um they've already said that 03 has increased it by 10 10x so if if the numbers started at 1% you're very quickly getting to um what you may see as like par in compute in terms of GPU hours between pre-training and post- training which if you were to take anybody back a year ago before 01 would seem pretty unfathomable. And one of the fun data points for this is that um the Deepseek V3 paper and you kind of watch DeepSseek's transition into becoming more serious about post- training. Like the original Deep Seek Rethrough paper, they use 0.18% of compute on post-training in GPU hours and they said their pre-training takes about two months and there was a deleted tweet from one of their RL researchers that said the R1 training took a few weeks. So if you make a few very strong, probably not completely accurate assumptions that RL was on the same whole cluster that would already be 10 to 20% of their compute. I think like specific things for DeepSeek are like, "Oh, their pre-training efficiency is probably way better than their RL code and things like this, but scaling RL is a very real thing. If you if you look at this, if you look at Frontier Labs and you look at the types of tasks that people want to solve with these long-term plans. So, it's good to kind of embrace what you think these models will be able to do and kind of break down tasks on their own and solve some of them. So, thanks for having me and let me know what you think. [Music]