RL for Autonomous Coding — Aakanksha Chowdhery, Reflection.ai
Channel: aiDotEngineer
Published at: 2025-07-16
YouTube video id: QluDzKVfp6A
Source: https://www.youtube.com/watch?v=QluDzKVfp6A
[Music] Hi everyone, I'm Akans Shaw. I was at Google for more than six years and I led the research for Palm and I was a lead researcher in Gemini. uh these days I'm working on uh pushing the frontier for autonomous coding uh with reinforcement learning. So just to recap the arc of how we have progressed in large language models and um why autonomous coding and why now. Um so I think everyone here or those of you uh who don't remember in 2020 there was this breakthrough paper that came out which talked about scaling laws for large language models. And if you were to take a 30 second recap, all the main thing it said was that there's a power law relationship between the test loss of large language models. Um, so if you use more compute, more data and put more parameters in your machine learning model, uh, which is a transformer model, you will get more performant models and it will not be performant just in the domain in which you are training the model. it will actually be performant and it will generalize to many other domains and the generalization was uh pretty much a feature uh in this particular case. So as the uh large language models got bigger we saw continuous improvement across benchmarks to the point that they're starting to get saturated now. And the other interesting thing was that we saw emergent behavior uh where capabilities were emerging in large language models that were not present in smaller models. And this is a classic slide that I show for for the work that we did uh in palm. So typically when you go about trying to solve math problems and you give the model some examples on the left you have a math problem around tennis balls and then you give a second problem the model output looks wrong. But what palm and the subsequent set of uh papers showed was that if you ask the model to output its uh reasoning chains uh which has become very common concept now but this is remember 2021 so four years ago um if you ask the model to show its reasoning chains then the answer actually is correct. So basically by getting the model to uh output its chain of thought or reasoning chains the the model performance improves and uh this capability particularly emerged in large language models. These are all the models. Uh so lambda and palm were these state-of-the-art models about three years ago and what I'm showing on x-axis is the increasing number of parameters. Palm was scaled all the way up to 540 billion parameters. No one actually publishes the number of parameters these days. So you have to live with the the graphs from 3 years ago or the open source stuff that's coming out with DeepSeek and Quen models. But what Y-axis is showing is that the solve rate on middle school math word problems was increasing uh with the number of parameters in the models and it was essentially increasing mainly when you were prompting the models and asking them to show chain of thought and this led to all kinds of prompting techniques where you ask the model to think step by step. you even go and bribe the model and such and you ask the model nicely or not. So this was all kinds of fun stuff. Um and I think the the uh the thing that really stood out from this generation of models three years ago was that this capability was not just limited to mathemat uh math problems. It was uh basically uh generalizing across a whole bunch of domains anywhere from question answering in other languages to puzzle problems to um multitask natural language understanding problems. And what this led to next was that now that these models could reason, we could get them to follow instructions. So the first set of applications that became possible with these large language models were chatbot applications. So everyone remembers that chat GPT and uh now Gemini and various other chatbots have become extremely popular. All of us use them all the time. But what made them really possible was that when you give instructions to the model to go do something, it's actually able to do it. And the way it learns that is actually based on reinforcement learning. And the reinforcement learning data that we're giving to the model in this particular case is essentially data based on human feedback. So you're basically saying okay here is a set of questions and if I were to give it to a human and it were there were two answers which one would the human prefer and if you have enough of this data and you train your model you would actually end up with uh better performance because you taught the model which set of responses to prefer and this actually doesn't only work in chatbot applications it also works in code so on the bottom right I'm showing that even if you were to do this for applications and code, you start to see some performance improvements. Now, of course, the question is that last year there was a whole bunch of debate as to are we hitting the wall in terms of performance of large language models. Uh pre-training is not giving any gains or all all of these questions were on the horizon. So, what is next? And I I one of the key questions to remember in in all of this is that when you go and pre-train the models, you end up spending a lot of money. Um on training these models, it could be tens of millions of dollars. And when you do inference on the models, it's extremely cheap. Um these numbers are not endorsed by any of the companies I've worked at, but these are public numbers uh from uh public sources. So going back to the main point that I want to make here is that training is extremely costly. So if you constantly try to scale up the model size, you end up uh in in this regime of like um if if it's not giving performance gains, then can we get performance gate at uh gains at inference time because inference calls are so cheap. And a key idea that was uh extremely useful here was that if you could get the models to generate multiple responses and then do majority voting. So uh in the example above I'm showing that we get we the the prompt doesn't make sense but you've given a mathematical problem to large language model and you're asking it to generate three answers independently and then you basically do some voting on top of those answers and if two answers match then that's a majority vote or like if in this room I were to ask a question and all of you said yes then that is a majority vote. So similarly in large language models if you can get the model to like generate many many samples and then consistently get it to war uh like get many of those answers to agree this notion of majority voting or self-consistency had shown gains. So this kind of scaling computed inference time was clearly one avenue to go push on. Another avenue that emerged and showed substantial value was that you could sequentially revise your previous response. So as humans often times we write the first answer and then we go evaluate our answer and we're like oh there's some mistake here it doesn't quite match and then you go fix it. So basically can we get LMS to do the same kind of revision looking at previous set of revisions and this was the second. So basically having longer uh chains of thought uh and getting the model to improve consistently in inference time based on that and these kind of techniques uh where you could verify your correct answer. So in math or in programming where you have unit tests showed uh very clear gains. So what I'm showing you here is an example from uh uh one of my colleagues work um at Stanford uh which is a publicly uh published uh paper and on the y-axis we have pass at k or coverage score and on the x-axis we have u number of samples so as you basically are doing a lot of samples on the x-axis your accuracy is improving uh with open source deepseeek model and just taking more samples so you're getting a very high score on bench verified compared to even state-of-the-art back in end of 2024. Uh, of course, now all of these scores have pushed up and we are roughly somewhere around 80% already. But what we want to take away here is the fact that these lines of work they showed that inference time compute predictably gives us gains especially in domains where we can verify. If we know how to verify the answers, then we actually know how to translate that into intelligence. And going back to my talk title, coding is one of those domains where we do have the capability to verify um and that gives us tremendous advantage in terms of building super intelligence on top of autonomous coding. Of course, now you ask the question of what does automated verification mean here? So for inference time scaling to work you need basically some way to say this output is correct. Now in math um this is a very simple example. If you were to give the input to solve this mathematical equation um and if you were to do the same calculation on a calculator you can actually verify that that problem is correct or that solution is correct. Um and similarly in math you have formal proofs. So you can actually verify things are correct. uh in coding you have unit tests in compilers you can actually generate the code and then use pytorch as a verifier uh pytorch the compiler as a verifier and in fact uh in domains where you don't have uh this kind of verification then there's a large gap if you were to generate a lot of solutions and then do majority voting you actually don't get as much gains so what this roughly meant was that okay so inference time scaling would work in scenarios where I have automated verification But that doesn't quite solve the problem for it to be have real world impact. And the reason for that is shown in this graph as to typically uh if you do majority voting and these uh this is across multiple different models on GSM8K which is middle school math problems and another math benchmark uh if you were to sort them by correct fraction you have to sample a lot. The correct generations could be very rare. So who has time to sample 10,000 times and then get a correct solution? You would be sitting there waiting just finding the correct solution unless you can actually figure out where the correct generation is. So basically scaling inference time compute with just majority voting or longer reasoning chains is great in the sense that there are some correct solutions somewhere there but it doesn't work well across the board. So what will get these models to learn to generate correctly uh during training? Well, in in the chatbot application scenario, we saw that RL with human feedback did work. So can we apply the same principle here and get the model to generate correctly in uh where where we can automatically verify the outputs. So our belief at uh reflection is that the next frontier for scaling is reinforcement learning and we already have proof points from some of the frontier labs as well and as David Silver and um Saturn published recently they agree with or or rather they they are the pioneers in uh in reinforcement learning. They say that we are basically entering the era of experience like starting from alpha go and alpha zero uh where you had an era of simulation and the next set of large language model era was where you scaled up with RL using human data but the next era from this year is really the era of experience which was which will lead us to super intelligence. So reinforcement learning will be a fundamental component in building uh super intelligent systems uh especially in areas where we have automated uh verification and some uh proof point for why this makes sense is that in math u over several papers this is uh results from 01 but over several papers we have already seen examples that if you give the model on the right side uh test time compute uh on the y-axis test time computer is same as inference time scaling and you measure accuracy on the x-axis it should go up. Um but as you can repeat this process uh and with reinforcement learning then the training time compute going up on x-axis also improves the accuracy on y-axis for a challenging benchmark in math called Amy. Uh most of these benchmarks saturate within a year as you probably have learned by now. So uh this benchmark is already saturated. Um so now that I've hopefully convinced you that reinforcement learning and scaling reinforcement learning is uh the next frontier you would be like okay so why are why is not everyone doing it? What's so challenging about it? So as I have built large language models before a big part of building uh these systems uh ends up being that the machine learning plus system stack for these uh systems themselves is very challenging. So here is um an example of why scaling up reinforcement learning is challenging. So if you are trying to do reinforcement learning with uh PO which is u one of the algorithms used for RL with human feedback um then it moved to uh DPO you have to keep four copies of uh different models. Uh so if you imagine a really large model and then you have to keep four copies then you have to arrange them somewhere on GPUs in your large cluster you you can have some fun figuring out the exact layout and um it's it's it's a fun and interesting problem but it's a hard problem in the sense that uh to make maximum utilization of these systems and and arranging them in the right way just building that system is extremely hard and uh deepseek actually showed uh with deepseek math that gpo uh gets gets rid of the value model and and only has three copies of the model. But that doesn't that's still a very challenging problem. So scaling up RL uh is more ch even more challenging um than scaling up um LLMs because you have multiple copies of the model and you have a training loop and an inference loop. And then on the machine learning side on the on the reinforcement learning side you also suffer a lot from reward hacking. If you're uh the model that is deciding that this is the correct answer is a neural reward model. So you uh as we discussed before in autonomous coding applications you do have the ability to verify your output. Uh which roughly means that you can decide this is the correct answer or not. Uh that's how SweetBench verified scores uh work today. U you have execution feedback, you have unit tests. So all of these possibilities of course um this is an ongoing list. All of these possibilities me mean that you can design better reward functions. Okay. So this means that autonomous coding is a great domain for scaling up RL. Then the question becomes how does this have real world impact. So in software engineering applications generation of code is only one part of the system. If you look at end toend workflows where software engineering there is many more parts to that system. How do you scale up your system to generalize across all of those domains? So that's the problem we are trying to solve at reflection. Our mission is that we would like to build some super intelligence and we are starting with autonomous coding as the root node problem for this um mission and uh we have a team of about 35 pioneers um who are who have pioneered various uh legendary works in LMS and reinforcement learning. So if you're excited about this mission, uh you can reach out to um one of us um or my u my email is my last name at reflection.ai and we would love to work with you. And with that, I can take questions. [Applause] All right. Um same protocol as last time. If you have a question, please come up to one of these three microphones we have distributed throughout. We can probably take one or two questions. So if you want to ask something um feel free. Um I guess I I'll do the first one while people are coming up. So I'm curious um seems like the foundation models are tried to build one model and deployed across everything. Do you have an opinion with the work you're doing right now if you think that's the right approach or if you think there'll be more specialization on different languages or even like individual code bases? Um or do you feel like the best approach is just to have like one model that's trained across the the greatest diversity of tasks possible? Uh I think that will answer your question uh in terms of building coding agents does require um multiple capabilities and how you get there you will definitely need multiple LM calls and then whether that's one model or multiple models I think that's the secret sauce right now for most people. Fair enough. All right please. Hi. Um I'm wondering in the slide with the chart of error of simulation, error of something and error of experience. Uh they had put in alpho and um the uh previous one where also you they played stock Star stock or something. They all used MCDS uh which I mean maybe it's my unfamiliarity with them but it's also data sim simulation. Mhm. Uh so we're using synthetic data for era of experience as well. So how does why is that called simulation and why is what we're doing right now not called simulation? What's the sort of overlap between simulation experience? How does that how do you think about that? I can ask Dave that question, you know, but going back to the point, I think I think the better way to answer that question is uh roughly what Greg covered in the last talk where his comment was that um so in gaming you can envision what scenarios might happen next and you're basically using that to build your reinforcement learning. So you're doing rollouts and you're you're basically building uh based on that. Um in in real world in most scenarios you have an imperfect roll out. So you don't have full knowledge of how the system might work. Um simulation is possible in certain domains where you do build a world model uh which is closer to robotics and all the work that's happening in the physical AI space right but in the in the real world applications which is what we're targeting uh you will have imperfect things. So you have to actually experience the real world and you have to collect some data and that data is not going to be in any way complete nor will it complete early search the exponential search space that could exist. [Music]