A Taxonomy for Next-gen Reasoning — Nathan Lambert, Allen Institute (AI2) & Interconnects.ai

Channel: aiDotEngineer
Published at: 2025-07-19
YouTube video id: jQcsVk0KWiQ
Source: https://www.youtube.com/watch?v=jQcsVk0KWiQ
[Music]
I really came to this thinking about
trying to reflect on six months into
this like reinforcement learning with
verifiable rewards post 01 post
deepseeek and I think that a lot of this
stuff is somewhat boring because
everybody has a reasoning model Um, we
all know the basics of you can scale RL
at training time and the numbers will go
up and that's deeply correlated with
being able to then do this inference
time scaling. Um, but really in AI right
now everybody there's a lot of people
are up to speed. But the crucial
question is like where are things going
to go and how do you skate where the
puck is going? So, a lot of this talk is
really me trying to process is like
where is this going besides getting high
benchmark scores with using 10,000
tokens per answer and like what do we
need to do to actually train these
models and what are the things that open
AAI etc are probably already doing but
it's increasingly hard to get that uh
signal out of them. So, if we look at
this like reasoning is really also
unlocking really new language model
applications. I think I this is the same
search query which is like I as an RL
researcher I need to find this all the
time. I forget that it's called
coastrunners and you Google like
overoptimization 20 times to find it.
But I tried asking 03 and it like
literally gave me the download link
directly. So I didn't even have to do
anything. And that's a very unusual use
case to just pop out of this reasoning
training where math and code was the
real thing to start with. And 03 is
great. It's the model that I use the
most for finding information and this
just really is the signal that I have
that a lot of new interesting things are
coming down the pipe. Um I would say
it's starting to unlock a lot of new
language model applications that I use
some of these. So this is a screenshot
of deep research. It's great. You can
use it in really creative ways like uh
prompt it to look at your website and
find typos or look at only the material
on your website and things like this.
It's actually more steerable than than
you may expect. Um, Cloud Code, which I
describe as just the the vibes are very
good. It's fun. I'm not a serious
software engineer, so I don't use it on
hard things, but I use it for fun things
because I can I can put the company API
key in and just kind of mess around like
helping me build my the website for this
book that I wrote online. And then
there's the really serious things which
are like codecs and these fully
autonomous agents that are starting to
come. If you play with it, it's obvious
that the form factor is going to be able
to work. I'm sure there are people that
are getting a lot of value out of it
right now. I think for ML tasks, it's
like there's no GPUs in it right now.
And if you are dealing with open models,
it's like they just added internet. So
like it wasn't going to be able to go
back and forth and look at like hugging
face configs or something and all these
headaches that you don't want to deal
with. But in the six months, like all of
these things are going to be stuff you
should be using on a day-to-day basis.
And this all downstream of this kind of
step change in performance from
reasoning models.
And then this is kind of like another
plot that's been talked about and when I
look at this it's like through 2024 if
we look at like GBT40 it things a lot
and really were saturating then and then
there's these new set models in 01 which
really helped push out the frontier and
time horizon. So this is the y- axis is
how long a a task can roughly be
completed by the models in time which is
kind of a weird way to measure it
because things will get faster but um
it's going to keep going and this
reasoning model is the technique that
was kind of unlocked in order to figure
out how to push the limits and when you
look at things like this it's not that
just we're like on a path determined
from AI and more gains are going to come
it's really like we have to think about
what the models need to be able to do in
order to keep pushing out these
frontiers. So there's a lot of human
effort that goes into continuing the
trends of AI progress. So it's like
gains aren't free. And I'm thinking that
a lot of planning and kind of thinking
about training in a bit of a different
way beyond just reasoning skills is
going to be what helps push this and
enable these uh language modeling
applications and products that are kind
of in the early stages to really shine.
So this is a core question that I'm
thinking about is like what do I have to
do to come up with a research plan to
train reasoning models that can work
autonomous autonomously and really have
meaningful ideas for what planning would
be. So I kind of came up with a taxonomy
that has a few different what I call
traits within it. Um the first one is
skills which we've pretty much already
done. Skills are like getting really
good at math and code. inference time
scaling was useful to getting there, but
they kind of become more researchy over
time. I think for products, calibration
is going to be crucial, which is like
these models overthink like crazy. So,
they need to be able to kind of have
some calibration to how many output
tokens are used relative to the
difficulty of the problem. And this will
kind of become more important when we're
spending more on each task that we're
planning. And then the last two are
subsets of planning that I'm thinking
about and happy to take feedback on this
taxonomy, but like strategy, which is
just going in the right direction and
knowing different things that you can
try because it's really hard for these
language models to really change course
where they can backtrack a little bit,
but restarting their plan is hard. And
then as tasks become very hard, we need
to do abstraction which is like the
model has to choose on its own how to
break down a problem into different
things that it can do on its own. I
think right now humans would often do
this but if we want language models to
do very hard things they have to make a
plan that has subtasks that are actually
tractable or calls in a bigger model to
do that for it. But these are things
that are the models aren't going to do
natively. natively they're trying to
like doing math problem solving like
that doesn't have clear abstraction on
like this task I can do and with this
additional tool and all these things. So
this is this is a new thing that we're
going to have to add. So to kind of
summarize it's like we have skills we
have researcher calibration I'll
highlight some of it but like planning
is a new frontier where people are
talking about it and we really need to
think about like how we will actually
put this into the models.
So to just put this up on the slide,
what we call reinforce learning with
verifiable rewards looks very simple. I
think a lot of RL and language models,
especially before you get into this
multi-turn setting, has been you take
prompts, the agent creates a completion
to the prompt and then you score the
completions and with those scored
completions, you can update the weights
to the model. It's been single turn.
It's been very simple. We I'll have to
update this diagram for multi-turn and
tools and it makes it a little bit more
complex. But the core of it is just a
language model generates completions and
gets feedback on it.
And it's good to just take time to look
at these skills. These are a collection
of evals and we can look at like where
GBT40 was and these were the hardest
evals that have existed and look were
called like the frontier of AI. And if
we look at the 01 improvements and the
like 03 improvements in quick
succession, these are really incredible
eval gains that are mostly just from
adding this new type of training in. And
the core of this argument is that we
need to do something similar if we want
planning to work. So I would say that a
lot of the planning tasks look mostly
like humanity's last exam and Amy um
just after adding this reasoning skill
and we need to figure out what other
types of things these models are going
to be able to do.
So it's like this list of reasoning
abilities that these kind of like
low-level skills is going to continue to
go up. I think the most recent one if
you look at recent DeepSeek models or
recent Quen models is really this tool
use being added in and I that's going to
build more models like 03. So using 03
just feels very different because it is
this kind of combination of tool use
with reasoning and it's obviously good
at math and code. But I think these kind
of low-level skills that we expect from
reasoning training are we're going to
keep getting more of them as we figure
out what is useful. I think an
abstraction for the kind of agenticness
on top of tool use is going to be very
nice, but it's hard to measure. And
people mostly say that claude is the
best at that, but it's not yet super
established on how we measure it or
communicate it across different models.
And then this is where we get into the
fun interesting things. I think it's
hard for us because calibration is
passed to the user which is we have all
sorts of things like model selectors if
you're a chatbt user um claude has
reasoning on off with this extended
thinking and Gemini has something
similar and there's these reasoning
effort selectors in the API and this is
really rough on a user side of things
and making it so the model knows this
will just really make it so it's easier
to find the right model for the job and
just kind of um you're kind of over
spent tokens for no reason will go down
a lot. It's kind of obvious to want it
and then it'll just it becomes a bigger
problem the longer we don't have this
some examples from when overthinking was
kind of identified as a problem. It's
like the left half of this is you can
ask a language model like what is 2 plus
3 and you can see these reasoning models
use hundreds to a thousand tokens for
something that could realistically be
like one token as an output. And then on
the right is a kind of comparison of
sequence lengths from a standard like
non ROL trained instruction model versus
the QWQ thinking model. And you really
can gain this like 10 to 100x in token
spend when you ship to a reasoning
model. And if you do that in a way that
is wasteful, it's just going to really
load your infrastructure and cost. And
as a user, I don't want to wait minutes
for an easy question and I don't want to
have to switch models or providers to
deal with that.
So I think one of the things that once
we start have this calibration is I'm is
this kind of strategy idea and on the
right I to I went to um I think it's
epochi website took a qu one of their
example questions from frontier math and
I was like does this new deepseek R10528
model like does it do any semblance of
planning when it starts and you ask it a
math problem it just like okay the first
thing I'm going to do is I I need to
construct a polomial it's like it just
goes right in and it doesn't anything
like trying to sketch the problem before
it thinks and this is going to probably
output 10 to 40,000 tokens and if it's
going to need to do another 10x there is
just like if that's all in the wrong
direction that's multiple dollars of
spend and a lot of latency that's just
totally useless and most of these
applications are set up to expect a
latency between 1 and 30 minutes. So
it's like there there is just a timeout
they are fighting. So either going in
the wrong direction or just thinking way
too hard about a sub problem is it's
going to make it so the user leaves. So
um right now these models I said they do
very little planning on their own but as
we look at these applications they're
very likely prompted to plan which is
like the beginning of deep research and
cloud code and we kind of have to make
it so that is model native rather than
something that we do manually.
And then once we look at this plan,
there's all these implementation details
across something like deep research or
codeex which is like how do I manage the
memory? So we have cloud code compresses
its memory when it fills up its context
window. We don't know if that's the
optimal way for every application. We
want to avoid repeating the same
mistakes. We talked Greg was talking
about the playing Pokemon earlier which
is a great example of that. We want to
have tractable parts. We want to offload
thinking if we have a really challenging
part. So I'll talk about parallel
compute a little bit later as a way to
kind of boost through harder things. And
really we want language models to call
multiple other models in parallel. So
right now people are spinning up t-mucks
and launching cloud code in 10 windows
to do this themselves. But there's no
reason a language model can't be able to
do that. It just needs to know the right
way to approach it.
And
as I've started with this idea of kind
of we need effort for or like we need to
make effort to add new um capabilities
into language models when you when I
think about this kind of story of Qstar
that became strawberry that became 01.
The reason that it was in the news for
so long and was such a big deal is like
it was a major effort for OpenAI
spending like 12 to 18 months building
these initial reasoning traces that they
could then train an initial model on
that has some of these behaviors. So it
took a lot of human data to get things
like backtracking and verification to be
reliable in their models. And we need to
go through a similar arc with planning.
But with planning, the kind of outputs
that we're going to train on are are
much more intuitive than something like
reasoning. I think if I were to ask you
to sit down and write a 10,000 token
reasoning trace with backtracking, it's
like you can't really do this, but a lot
of expert people can write a five to
10step plan that is very good or check
the work of Gemini or OpenAI when asked
to um write an initial plan. So I'm a
lot more optimistic on being able to
hill climb on this. And then it goes
through the same path where once you
have initial data, you can do some SFT.
And then the hard question is if the RL
and even bigger tasks can reinforce
these planning styles.
On the right, I added kind of a
hypothetical, which is like we already
have thinking tokens before answer
tokens, and there's no reason we can't
apply more structure to our models to
just really make them plan out their
answer before they think.
So, um, to give a bit more depth on this
idea of skill versus planning, if we go
back to this example, I would say that
03 is extremely skilled at search. So
being able to find a piece of niche
information that researchers in a field
know of but can't quite remember the
exact search words that is an incredible
skill. But when you try to put this into
something like deep research, there's
this lack of planning is making it so
that sometimes you get a masterpiece and
sometimes you get a dud. And if as these
models get better at planning, it'll
just be more thorough and reliable in
getting the kind of coverage that you
want. So, it's like if it's crazy that
we have models that can do this search,
but if you ask it to recommend um some
sort of electronics purchase or
something, it's really hard to trust
because it can't just know how to pull
in the right information and how hard it
should try to do all that coverage.
So, kind of summarize, these are the
four things that I presented. I think
you can obviously add more to these. You
could call a mix of strategy and
abstraction. And there's like you could
call what I was describing
as like context management in many ways,
but really you just want to have things
like this so that you can break down the
training problem and think about data
acquisition or new algorithmic methods
for kind of each of these tasks.
And I mentioned parallel compute because
I think this is an interesting one
because if you use 01 Pro, it's still
been one of the best models and the most
robust models for quite some time. And
I'm been very excited for 03 Pro, but it
doesn't solve problems in the same way
as like traditional inference time
scaling where inference time scaling
just made a bunch of things that didn't
work go from 0 to one. Where this
parallel compute is really like it makes
things more robust. It just makes them
nicer. And it seems like this kind of RL
training is something that can encourage
exploration and then if you apply more
compute in parallel, it feels something
kind of exploiting and getting a really
well-crafted answer. So there's a time
when you want that but it doesn't solve
every problem.
And to kind of transition into the end
of this talk, it's like there's been a
lot of talks today saying the things
that you can do with RL and there's
obviously a lot of talk on the ground of
um what is called continual learning and
if we're just continually using very
long horizon RL tasks to update a model
and diminish the need of pre-training
and there are a lot of data points that
were closer to that in many ways. is I
think continual learning has a big um
algorithmic bottleneck where but just
like scaling up RL further is very
tractable and something that is
happening. So if people were to ask me
what I'm working on at AI2 and what I'm
thinking about this is my like rough
summary of uh what I think a research
plan looks like to train a reasoning
model without with all the in between
the line details. So step one is you
just get a lot of questions that have
verified answers across a wide variety
of domains. Um most of these will be
math and code because that's what out
what is out there. And then two, if you
look at all these recipe papers, they're
having a step where they filter the
questions based on the the difficulty
with respect to your base model. So if a
question is solved zero out of 100 times
by your base model or 100 out of 100,
you don't want questions that look like
that because you're both not only
wasting compute, but you're messing up
the gradients in your RL updates to make
them a bit noisier. And once you do
that, you just want to make a stable RL
run that'll go through all these
questions and have the numbers keep
going up. And that's the core of it is
really stable infrastructure and data.
And then you can tap into all these
research papers that tell you to do
methods like overong filtering or
different clipping or resetting the
reference model. And that'll give you a
few percentage points on the top where
really it's just data and stable
infrastructure.
And this kind of leads to the
provocation which is like what if we
rename post-training as training and if
OpenAI 01 was like 1% of compute is
post- training relative to pre-training
um they've already said that 03 has
increased it by 10 10x so if if the
numbers started at 1% you're very
quickly getting to um what you may see
as like par in compute in terms of GPU
hours between pre-training and post-
training which if you were to take
anybody back a year ago before 01 would
seem pretty unfathomable.
And one of the fun data points for this
is that um the Deepseek V3 paper and you
kind of watch DeepSseek's transition
into becoming more serious about post-
training. Like the original Deep Seek
Rethrough paper, they use 0.18% of
compute on post-training in GPU hours
and they said their pre-training takes
about two months and there was a deleted
tweet from one of their RL researchers
that said the R1 training took a few
weeks. So if you make a few very strong,
probably not completely accurate
assumptions that RL was on the same
whole cluster that would already be 10
to 20% of their compute. I think like
specific things for DeepSeek are like,
"Oh, their pre-training efficiency is
probably way better than their RL code
and things like this, but scaling RL is
a very real thing. If you if you look at
this, if you look at Frontier Labs and
you look at the types of tasks that
people want to solve with these
long-term plans. So, it's good to kind
of embrace what you think these models
will be able to do and kind of break
down tasks on their own and solve some
of them. So, thanks for having me and
let me know what you think.
[Music]