RL for Autonomous Coding — Aakanksha Chowdhery, Reflection.ai

Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: QluDzKVfp6A
Source: https://www.youtube.com/watch?v=QluDzKVfp6A
[Music]
Hi everyone, I'm Akans Shaw. I was at
Google for more than six years and I led
the research for Palm and I was a lead
researcher in Gemini. uh these days I'm
working on uh pushing the frontier for
autonomous coding uh with reinforcement
learning.
So just to recap the arc of how we have
progressed in large language models and
um why autonomous coding and why now. Um
so I think everyone here or those of you
uh who don't remember in 2020 there was
this breakthrough paper that came out
which talked about scaling laws for
large language models.
And if you were to take a 30 second
recap, all the main thing it said was
that there's a power law relationship
between the test loss of large language
models. Um, so if you use more compute,
more data and put more parameters in
your machine learning model, uh, which
is a transformer model, you will get
more performant models and it will not
be performant just in the domain in
which you are training the model. it
will actually be performant and it will
generalize to many other domains and the
generalization was uh pretty much a
feature uh in this particular case. So
as the uh large language models got
bigger we saw continuous improvement
across benchmarks to the point that
they're starting to get saturated now.
And the other interesting thing was that
we saw emergent behavior uh where
capabilities were emerging in large
language models that were not present in
smaller models. And this is a classic
slide that I show for for the work that
we did uh in palm. So typically when you
go about trying to solve math problems
and you give the model some examples on
the left you have a math problem around
tennis balls and then you give a second
problem the model output looks wrong.
But what palm and the subsequent set of
uh papers showed was that if you ask the
model to output its uh reasoning chains
uh which has become very common concept
now but this is remember 2021 so four
years ago um if you ask the model to
show its reasoning chains then the
answer actually is correct. So basically
by getting the model to uh output its
chain of thought or reasoning chains the
the model performance improves and uh
this capability particularly emerged in
large language models. These are all the
models. Uh so lambda and palm were these
state-of-the-art models about three
years ago and what I'm showing on x-axis
is the increasing number of parameters.
Palm was scaled all the way up to 540
billion parameters. No one actually
publishes the number of parameters these
days. So you have to live with the the
graphs from 3 years ago or the open
source stuff that's coming out with
DeepSeek and Quen models. But what
Y-axis is showing is that the solve rate
on middle school math word problems was
increasing uh with the number of
parameters in the models and it was
essentially increasing mainly when you
were prompting the models and asking
them to show chain of thought and this
led to all kinds of prompting techniques
where you ask the model to think step by
step. you even go and bribe the model
and such and you ask the model nicely or
not. So this was all kinds of fun stuff.
Um and I think the the uh the thing that
really stood out from this generation of
models three years ago was that this
capability was not just limited to
mathemat uh math problems. It was uh
basically uh generalizing across a whole
bunch of domains anywhere from question
answering in other languages to puzzle
problems to um multitask natural
language understanding problems.
And what this led to next was that now
that these models could reason, we could
get them to follow instructions. So the
first set of applications that became
possible with these large language
models were chatbot applications. So
everyone remembers that chat GPT and uh
now Gemini and various other chatbots
have become extremely popular. All of us
use them all the time. But what made
them really possible was that when you
give instructions to the model to go do
something, it's actually able to do it.
And the way it learns that is actually
based on reinforcement learning. And the
reinforcement learning data that we're
giving to the model in this particular
case is essentially data based on human
feedback. So you're basically saying
okay here is a set of questions
and if I were to give it to a human and
it were there were two answers which one
would the human prefer and if you have
enough of this data and you train your
model you would actually end up with uh
better performance because you taught
the model which set of responses to
prefer and this actually doesn't only
work in chatbot applications it also
works in code so on the bottom right I'm
showing that even if you were to do this
for applications and code, you start to
see some performance improvements.
Now, of course, the question is that
last year there was a whole bunch of
debate as to are we hitting the wall in
terms of performance of large language
models. Uh pre-training is not giving
any gains or all all of these questions
were on the horizon. So, what is next?
And I I one of the key questions to
remember in in all of this is that when
you go and pre-train the models, you end
up spending a lot of money. Um on
training these models, it could be tens
of millions of dollars. And when you do
inference on the models, it's extremely
cheap. Um these numbers are not endorsed
by any of the companies I've worked at,
but these are public numbers uh from uh
public sources. So going back to the
main point that I want to make here is
that training is extremely costly. So if
you constantly try to scale up the model
size, you end up uh in in this regime of
like um if if it's not giving
performance gains, then can we get
performance gate at uh gains at
inference time because inference calls
are so cheap.
And a key idea that was uh extremely
useful here was that if you could get
the models to generate multiple
responses and then do majority voting.
So uh in the example above I'm showing
that we get we the the prompt doesn't
make sense but you've given a
mathematical problem to large language
model and you're asking it to generate
three answers independently and then you
basically do some voting on top of those
answers and if two answers match then
that's a majority vote or like if in
this room I were to ask a question and
all of you said yes then that is a
majority vote. So similarly in large
language models if you can get the model
to like generate many many samples and
then consistently get it to war uh like
get many of those answers to agree this
notion of majority voting or
self-consistency had shown gains. So
this kind of scaling computed inference
time was clearly one avenue to go push
on. Another avenue that emerged and
showed substantial value was that you
could sequentially revise your previous
response. So as humans often times we
write the first answer and then we go
evaluate our answer and we're like oh
there's some mistake here it doesn't
quite match and then you go fix it. So
basically can we get LMS to do the same
kind of revision looking at previous set
of revisions and this was the second. So
basically having longer uh chains of
thought uh and getting the model to
improve consistently in inference time
based on that and these kind of
techniques uh where you could verify
your correct answer. So in math or in
programming where you have unit tests
showed uh very clear gains. So what I'm
showing you here is an example from uh
uh one of my colleagues work um at
Stanford uh which is a publicly uh
published uh paper and on the y-axis we
have pass at k or coverage score and on
the x-axis we have u number of samples
so as you basically are doing a lot of
samples on the x-axis your accuracy is
improving uh with open source deepseeek
model and just taking more samples so
you're getting a very high score on
bench verified compared to even
state-of-the-art back in end of 2024.
Uh, of course, now all of these scores
have pushed up and we are roughly
somewhere around 80% already.
But what we want to take away here is
the fact that these lines of work they
showed that inference time compute
predictably gives us gains especially in
domains where we can verify.
If we know how to verify the answers,
then we actually know how to translate
that into intelligence. And going back
to my talk title, coding is one of those
domains where we do have the capability
to verify um and that gives us
tremendous advantage in terms of
building super intelligence on top of
autonomous coding.
Of course, now you ask the question of
what does automated verification mean
here? So
for inference time scaling to work you
need basically some way to say this
output is correct. Now in math um this
is a very simple example. If you were to
give the input to solve this
mathematical equation um and if you were
to do the same calculation on a
calculator you can actually verify that
that problem is correct or that solution
is correct. Um and similarly in math you
have formal proofs. So you can actually
verify things are correct. uh in coding
you have unit tests in compilers you can
actually generate the code and then use
pytorch as a verifier uh pytorch the
compiler as a verifier and in fact uh in
domains where you don't have uh this
kind of verification then there's a
large gap if you were to generate a lot
of solutions and then do majority voting
you actually don't get as much gains so
what this roughly meant was that okay so
inference time scaling would work in
scenarios where I have automated
verification
But that doesn't quite solve the problem
for it to be have real world impact. And
the reason for that is shown in this
graph as to typically uh if you do
majority voting and these uh this is
across multiple different models on
GSM8K which is middle school math
problems and another math benchmark uh
if you were to sort them by correct
fraction you have to sample a lot. The
correct generations could be very rare.
So who has time to sample 10,000 times
and then get a correct solution? You
would be sitting there waiting just
finding the correct solution unless you
can actually figure out where the
correct generation is.
So
basically scaling inference time compute
with just majority voting or longer
reasoning chains is great in the sense
that there are some correct solutions
somewhere there but it doesn't work well
across the board. So what will get these
models to learn to generate correctly uh
during training? Well, in in the chatbot
application scenario, we saw that RL
with human feedback did work. So can we
apply the same principle here and get
the model to generate correctly in uh
where where we can automatically verify
the outputs.
So our belief at uh reflection is that
the next frontier for scaling is
reinforcement learning and we already
have proof points from some of the
frontier labs as well and as David
Silver and um Saturn published recently
they agree with or or rather they they
are the pioneers in uh in reinforcement
learning. They say that we are basically
entering the era of experience like
starting from alpha go and alpha zero uh
where you had an era of simulation and
the next set of large language model era
was where you scaled up with RL using
human data but the next era from this
year is really the era of experience
which was which will lead us to super
intelligence. So reinforcement learning
will be a fundamental component in
building uh super intelligent systems uh
especially in areas where we have
automated uh verification
and some uh proof point for why this
makes sense is that in math u over
several papers this is uh results from
01 but over several papers we have
already seen examples that if you give
the model on the right side uh test time
compute uh on the y-axis test time
computer is same as inference time
scaling and you measure accuracy on the
x-axis it should go up. Um but as you
can repeat this process uh and with
reinforcement learning then the training
time compute going up on x-axis also
improves the accuracy on y-axis for a
challenging benchmark in math called
Amy. Uh most of these benchmarks
saturate within a year as you probably
have learned by now. So uh this
benchmark is already saturated.
Um
so now that I've hopefully convinced you
that reinforcement learning and scaling
reinforcement learning is uh the next
frontier
you would be like okay so why are why is
not everyone doing it? What's so
challenging about it? So as I have built
large language models before a big part
of building uh these systems uh ends up
being that the machine learning plus
system stack for these uh systems
themselves is very challenging. So here
is um an example of why scaling up
reinforcement learning is challenging.
So if you are trying to do reinforcement
learning with uh PO which is u one of
the algorithms used for RL with human
feedback um then it moved to uh DPO you
have to keep four copies of uh different
models. Uh so if you imagine a really
large model and then you have to keep
four copies then you have to arrange
them somewhere on GPUs in your large
cluster you you can have some fun
figuring out the exact layout and um
it's it's it's a fun and interesting
problem but it's a hard problem in the
sense that uh to make maximum
utilization of these systems and and
arranging them in the right way just
building that system is extremely hard
and uh deepseek actually showed uh with
deepseek math that gpo uh gets gets rid
of the value model and and only has
three copies of the model. But that
doesn't that's still a very challenging
problem. So scaling up RL uh is more ch
even more challenging um than scaling up
um LLMs because you have multiple copies
of the model and you have a training
loop and an inference loop. And then on
the machine learning side on the on the
reinforcement learning side you also
suffer a lot from reward hacking. If
you're uh the model that is deciding
that this is the correct answer is a
neural reward model. So you uh as we
discussed before in autonomous coding
applications you do have the ability to
verify your output. Uh which roughly
means that you can decide this is the
correct answer or not. Uh that's how
SweetBench verified scores uh work
today. U you have execution feedback,
you have unit tests. So all of these
possibilities of course um this is an
ongoing list. All of these possibilities
me mean that you can design better
reward functions.
Okay. So this means that autonomous
coding is a great domain for scaling up
RL. Then the question becomes how does
this have real world impact. So in
software engineering applications
generation of code is only one part of
the system. If you look at end toend
workflows where software engineering
there is many more parts to that system.
How do you scale up your system to
generalize across all of those domains?
So that's the problem we are trying to
solve at reflection.
Our mission is that we would like to
build some super intelligence and we are
starting with autonomous coding as the
root node problem for this um
mission and uh we have a team of about
35 pioneers um who are who have
pioneered various uh legendary works in
LMS and reinforcement learning. So if
you're excited about this mission, uh
you can reach out to um one of us um or
my u my email is my last name at
reflection.ai and we would love to work
with you. And with that, I can take
questions.
[Applause]
All right. Um same protocol as last
time. If you have a question, please
come up to one of these three
microphones we have distributed
throughout. We can probably take one or
two questions. So if you want to ask
something um feel free. Um I guess I
I'll do the first one while people are
coming up. So I'm curious um
seems like the foundation models are
tried to build one model and deployed
across everything. Do you have an
opinion with the work you're doing right
now if you think that's the right
approach or if you think there'll be
more specialization on different
languages or even like individual code
bases? Um or do you feel like the best
approach is just to have like one model
that's trained across the the greatest
diversity of tasks possible?
Uh I think that will answer your
question uh in terms of building coding
agents does require um multiple
capabilities and how you get there you
will definitely need multiple LM calls
and then whether that's one model or
multiple models I think that's the
secret sauce right now for most people.
Fair enough. All right please. Hi. Um
I'm wondering in the slide with the
chart of error of simulation, error of
something and error of experience.
Uh they had put in alpho and um the uh
previous one where also you they played
stock Star stock or something. They all
used MCDS uh which I mean maybe it's my
unfamiliarity with them but it's also
data sim simulation. Mhm.
Uh so we're using synthetic data for era
of experience as well. So how does why
is that called simulation and why is
what we're doing right now not called
simulation? What's the sort of overlap
between simulation experience? How does
that how do you think about that? I can
ask Dave that question, you know, but
going back to the point, I think I think
the better way to answer that question
is uh roughly what Greg covered in the
last talk where his comment was that um
so in gaming you can envision what
scenarios might happen next and you're
basically using that to build your
reinforcement learning. So you're doing
rollouts and you're you're basically
building uh based on that. Um in in real
world in most scenarios you have an
imperfect roll out. So you don't have
full knowledge of how the system might
work. Um simulation is possible in
certain domains where you do build a
world model uh which is closer to
robotics and all the work that's
happening in the physical AI space right
but in the in the real world
applications which is what we're
targeting uh you will have imperfect
things. So you have to actually
experience the real world and you have
to collect some data and that data is
not going to be in any way complete nor
will it complete early search the
exponential search space that could
exist.
[Music]