Agent Reinforcement Fine Tuning – Will Hang & Cathy Zhou, OpenAI
Channel: aiDotEngineer
Published at: 2025-12-09
YouTube video id: p1CmPZ2j6Lk
Source: https://www.youtube.com/watch?v=p1CmPZ2j6Lk
[music] Hey everyone, I'm Will >> and I'm Kathy and we're on the fine tuning team at OpenAI >> and we're super excited to talk to you today about agent RF, the most powerful way to enhance the performance of your agents. So, you're probably joining us today because you're building an agent for your business and you'd like to improve its performance. So, let's first start by talking about what an agent actually is. What makes an agent different from a regular model is its ability to interact with the outside world to complete a task to get things done on its own without having to go through you all the time. So, this agent needs to have access to tools. For example, if you're building a coding agent, it's got to have access to a terminal, a code interpreter, or maybe even an entire codebase. But these agents aren't just blindly calling tools. They're reasoning at the same time. The way that we think about these agents is that their interactions with the outside world, such as tool calls, are interled with their reasoning traces in the same context window. So, an example of an agent that we've built in-house using this paradigm is Codeex. Codeex is our flagship coding agent. has access to a wide range of tools to complete coding tasks end to end like writing unit tests or submitting large diffs to your codebase that are hopefully correct. Um some tools are exposed as terminal commands and other tools are custom functions a model can call to invoke say a planning workflow. So now how do we make our agents better? We're all probably pretty familiar with the frontline techniques to improve the performance of agents. For example, for starters, prompt engineering or prompt optimization. Prompting you can steer model or agent behavior to align more with your preferences. But let's say you still want to squeeze more juice out of your task. Well, you can then turn to task optimization. You can simplify the task. You can add better guard rails around the task. You can add and subtract tools. Or you can change tool behavior to work better for the agent. But let's say you still want to squeeze even more juice out of that task. you've tried all these approaches and you still want better performance. So that's where you would turn to fine-tuning. Fine-tuning is a way to train the a agent end to end on your task to achieve even better performance by changing the weights of the model. And agent reinforcement fine-tuning or agent RF is the way to do this or it's the way that we would like you all to do this. Um, agent RFT changes the weights of the model according to a learning signal that you specify to teach the model what good behavior and what bad behavior looks like. And during training, the agent will explore many different ways of calling your tools to solve your task. So, we've introduced several major new additions to the RFT product. Um, first off, the model can now call your tools via your endpoints that are hosted in the public internet. Um, and after each roll out, we'll also invoke your custom reward signal that's hosted via an endpoint. So, these two additions actually mark the first time that we have we at OpenAI have allowed models to interact with the outside world during the training process. So, I think this is pretty cool. To summarize the benefits of agent RFT, it helps you improve the performance of your reasoning models, but more specifically the reasoning models that have to call tools and interact with the outside world to get things done in a multi-step fashion. Agent RF is also quite sample efficient. We've seen people get success from literally only using like 10 examples, which is pretty amazing. We'll go over specific examples of this when we deep dive into some of our customer spotlights. and it results in a model that has lower latency and just works better for your tasks. So now let's dive a little bit deeper into how all this works. One of the challenges with making agents work with your specific business context is that your environment, your world might just be different from how we train our models in house. So this phenomenon in ML is called domain shift. And it can result in an agent that doesn't quite call your tools that that well. might call a tool too many times or might just straight up shove wrong inputs into your tools. Agent RFT can readapt the model to your domain through this weight changing training process that results in an agent that actually understands your environment. And this has some really nice properties obviously better ML performance. It trains the model to use tools better and it trains the model to reason over the outputs of those tools better. All this is learned organically by the model while it explores the search space, all the possible ways of interacting with your environment and hill climbing on your reward. Another really nice property that results from this is the ability to achieve much lower latencies by making sure that the model stays within a given tool called budget and doesn't go over that limit. So we can actually impose this penalty that you know penalizes the model for going over that budget. What actually happens is the model learns to stay within that budget while preserving or exceeding the original ML performance. So to dive a little bit deeper into what happens at a systems level for each agent roll out will produce this unique identifier that specifies that that that particular roll out and we will associate all the tool calls that we make into your system with that UYU ID. And so we do this for every tool call so that you can keep track of a trajectory as it evolves. so that when we emit that final answer at the very end, you can then associate that final answer with all the context that you've maintained so far and you can just pass this whole thing as a holistic grading context into your grader. Now, we don't recommend everyone or anyone just use agent RFT right off the bat. Uh there's a process that we'd like you all to follow. You first want to make sure that your training data set and your eval data set closely match your production traffic. You do not want any drift whatsoever. Then you want to ground yourself in a baseline. You want to run your base model against these data sets so that you kind of understand what to expect performance-wise so that you can then hill climb from there. And then you want to optimize performance using some of the techniques that we talked about prior like prompt or task optimization. And only then when you still feel like you squeezed all the juice out of the task, but you still want more more juice, you would turn to agent RFT to push the frontier for your task. So now I'm going to turn it over to Kathy to talk about how some of our partners have really pushed that frontier. >> Yeah. So now that we learned how agent RFT works and how when you should use it, I'll show you some coding related examples of how our customers were able to use agent RFT to make their agents better and also highlight some key takeaways that you can apply when optimizing your own agents. So a few months ago we partnered with Cognition who use agent RFT on their code edit planning phase. This is the part where Devon inspects a reple and runs runs shell tools like rep and file reads to decide which exact files to edit. To train this behavior they build a data set of user queries paired with actual files that's that users has modified and they use the F1 score of the selected files as the reward. This F1 score is really great because it balances between the pre precision and the recall. So this ensures that the agent doesn't return too many inaccurate files or misses the critical ones. They also build extremely robust infrastructure to support this training. So in this case for each individual trajectory they spun up a VM to manage the codebase to execute the tool calls and grade the final answer. These VMs make sure that the environment is isolated so that the shell tools will not affect each other in different rollouts. We saw two important takeaways from Cognition's use case. First, data quality and the volume really matters. So, at first they fine-tuned on a data set of around 100 examples and were able to get a fivepoint improvement. But when they scaled to a thousand examples, the improvement jumped to 10 points. So the number of highquality examples you provide can very directly translate to a better agent behavior. Second, we also learned that RFT is really good for learning to call tools in parallel. So in this case, the model would initially take 8 to 10 steps alternating between generating tokens in its reasoning to actually calling the tools. After RFT, the agent launches many tool calls in parallel. at the very first step. So this was able to reduce that number down to four. And in this use case, the speed up was especially important because they wanted Devon to start producing edits quickly. And now I want to highlight a different use case. Codto is building a code review agent and a key piece of that is a deep research agent that answers developer questions on large code bases. To improve this deep research agent, they train GPD5 to answer coding questions by calling tools like search and retrieve over the repository. They assembled around a thousand authentic question answer pairs from eight different uh repositories and rewarded the model using the recall of how many relevant facts the agent were able to retrieve. [clears throat] With RFT, the agent improved by 6% and it was using fewer tool calls and output tokens. And what we found most interesting is this graph where it shows how RFT shifted the distribution of the number of tool calls. So with BBD5, the agent will occasionally fall into these bad runs where there were more than 15 tool calls in a single sample. This is very slow and also can lead to some inconsistent behaviors. So after RFT these tool calls that are very longtail um disappeared and the the distribution center to just around two to four tool calls. In this setup RFT didn't just improve uh accuracy. It also stabilized the agents behavior in eliminating these P95 longtail cases. And this is very important for production use cases where your latency will matter. Next, I want to share how cosign build coding agents for large and complex enterprise code uh enterprise co code bases with agent rft. To make this work, they train the agent on a very comprehensive set of 30 tools such as fry, keyword search, session terminal, browser sessions, etc. And they also built a very strict raider. So they observed that the model um originally when they were providing the model with partial credits and uh points for just trying out things um it didn't get really good results because the model would start to optimize things on coding style and tone. Um so at first they want to really make sure the agent ships working code and so based on that they give the model the reward only when the final code passes the test. And because the greater is very strict, it can sometimes give sparse rewards. In that case, um, GBD5 is also like is actually very great because it can give us some samples that work. So, um, Cosine also boosted the batch size and they increase the amount of compute so that there is even more samples that can give us positive rewards. So, it's not like every single sample in the batch will give us zero reward once the code is correct. Um, they also have a custom LLM that would judge by the score and tone. So, it will panalyze verbosity, emojis or anything that feels unprofessional. Finally, the grader will reward the agents that validate their own work. So, this means running tests, inspecting terminal outputs, and also checking linting before calling out a success. And after training with this very thoughtful set of tools and graders, cosine was able to reach the state-of-the-art on a lot of different benchmarks over here. And they also got a much much faster agent. So like in earlier examples, RFT shifted this distribution of tool calls and the agent stopped taking these extremely long trajectories. In this case, there were sometimes more than a 100 messages in a single trajectory and it converged to a much tighter and more efficient sequence of steps. Lastly, Macco is a very interesting use case. They're building agents that write high highly performant GPU kernels which is traditionally very hard for LMS because in normal use cases there's a lot more examples but in this case there's not a lot of example for kernels especially if you're using new hardware platforms like Nvidia B200's with Asian RFT macro trained GBD5 to write fast kernels using only about 100 PyTorch prompts and this was a major unlock. So we don't actually need that many samples and kernel data set in order to train a good model that produces kernels and we just have to specify a good reward function. In this case specifying a good reward function is also very hard. Early in training they observed that the model was reward hacking. So what they did was that they inspected the rollouts and they found seven different cases where the model was hacking and this include things like just uh returning the reference code or returning no kernels or identity kernels and they built a judge LM to catch all of these seven cases and reward them with a zero. They also added a static analysis tool with a abstract syntax tree to verify that the generated kernels actually exist and they're actually being launched. So after the they made sure that there was no reward hacking, they also scored on correctness and real speed up compared to the partorch baseline. Once all of these protections were in place, the agent got significantly better than GPD5. And uh ML also used a really smart technique here to improve the performance even more. They ran three different samples and they took the best one out of the three. This allowed them to beat the state-of-the-art by 72%. And yeah, I'll hand it back to Will. >> Thanks a lot, Kathy. So, uh, now we want all of you, all of you in this room and beyond to be as successful as the partners that Kathy just mentioned with agent RFD. So, here are four key principles to ensure your success. First of all, you want to make sure that your task is well defined, well constrained. There should be a clear, unambiguous definition of success. You should have removed all subjectivity out of your task. Taste should not be a requirement to grade your task properly. Next, you do not want the model to feel surprised in production. You want to make sure that your train and eval data sets mirror your production traffic. So, no none of that domain shift that we talked about. You do not want to introduce that domain shift on your own. Um, next, and this is a really important part, you want to make sure that through exploration, the model actually achieves better performance on a given data point if it samples more so that it can learn from itself. So what this means is if you take the maximum performance on a given data set, that should improve as you sample more from the model. So because of this, you should be able to see the these variances from a given data point. So the model can learn from itself, learn what the difference between a good and a bad rollout is for a given data point. And uh lastly, you want to make sure that your reward function is not hackable. Hopefully you've plugged up all the corner cases, all the edge cases. Um but also hopefully you've framed your task so that the reward is more continuous than binary. The continuous reward actually allows the model to kind of inch up closer and closer to optimal performance. Sort of like giving giving a student partial credit. um rather than you know slapping them all in the face or giving it a cookie uh if it gets stuff wrong or gets stuff right. So now in order to get started with agent RFT, please contact your friendly neighborhood account director and we're really excited to see what you all build with us. Thank you so much. [applause] [music] Heat. [music]