Agent Reinforcement Fine Tuning – Will Hang & Cathy Zhou, OpenAI

Channel: aiDotEngineer

Published at: 2025-12-09

YouTube video id: p1CmPZ2j6Lk

Source: https://www.youtube.com/watch?v=p1CmPZ2j6Lk

[music]
Hey everyone, I'm Will
>> and I'm Kathy and we're on the fine
tuning team at OpenAI
>> and we're super excited to talk to you
today about agent RF, the most powerful
way to enhance the performance of your
agents. So, you're probably joining us
today because you're building an agent
for your business and you'd like to
improve its performance. So, let's first
start by talking about what an agent
actually is. What makes an agent
different from a regular model is its
ability to interact with the outside
world to complete a task to get things
done on its own without having to go
through you all the time. So, this agent
needs to have access to tools. For
example, if you're building a coding
agent, it's got to have access to a
terminal, a code interpreter, or maybe
even an entire codebase.
But these agents aren't just blindly
calling tools. They're reasoning at the
same time. The way that we think about
these agents is that their interactions
with the outside world, such as tool
calls, are interled with their reasoning
traces in the same context window. So,
an example of an agent that we've built
in-house using this paradigm is Codeex.
Codeex is our flagship coding agent. has
access to a wide range of tools to
complete coding tasks end to end like
writing unit tests or submitting large
diffs to your codebase that are
hopefully correct. Um some tools are
exposed as terminal commands and other
tools are custom functions a model can
call to invoke say a planning workflow.
So now how do we make our agents better?
We're all probably pretty familiar with
the frontline techniques to improve the
performance of agents. For example, for
starters, prompt engineering or prompt
optimization. Prompting you can steer
model or agent behavior to align more
with your preferences. But let's say you
still want to squeeze more juice out of
your task. Well, you can then turn to
task optimization. You can simplify the
task. You can add better guard rails
around the task. You can add and
subtract tools. Or you can change tool
behavior to work better for the agent.
But let's say you still want to squeeze
even more juice out of that task. you've
tried all these approaches and you still
want better performance. So that's where
you would turn to fine-tuning.
Fine-tuning is a way to train the a
agent end to end on your task to achieve
even better performance by changing the
weights of the model. And agent
reinforcement fine-tuning or agent RF is
the way to do this or it's the way that
we would like you all to do this. Um,
agent RFT changes the weights of the
model according to a learning signal
that you specify to teach the model what
good behavior and what bad behavior
looks like. And during training, the
agent will explore many different ways
of calling your tools to solve your
task. So, we've introduced several major
new additions to the RFT product. Um,
first off, the model can now call your
tools via your endpoints that are hosted
in the public internet. Um, and after
each roll out, we'll also invoke your
custom reward signal that's hosted via
an endpoint. So, these two additions
actually mark the first time that we
have we at OpenAI have allowed models to
interact with the outside world during
the training process. So, I think this
is pretty cool. To summarize the
benefits of agent RFT, it helps you
improve the performance of your
reasoning models, but more specifically
the reasoning models that have to call
tools and interact with the outside
world to get things done in a multi-step
fashion. Agent RF is also quite sample
efficient. We've seen people get success
from literally only using like 10
examples, which is pretty amazing. We'll
go over specific examples of this when
we deep dive into some of our customer
spotlights. and it results in a model
that has lower latency and just works
better for your tasks.
So now let's dive a little bit deeper
into how all this works. One of the
challenges with making agents work with
your specific business context is that
your environment, your world might just
be different from how we train our
models in house. So this phenomenon in
ML is called domain shift. And it can
result in an agent that doesn't quite
call your tools that that well. might
call a tool too many times or might just
straight up shove wrong inputs into your
tools. Agent RFT can readapt the model
to your domain through this weight
changing training process that results
in an agent that actually understands
your environment. And this has some
really nice properties obviously better
ML performance. It trains the model to
use tools better and it trains the model
to reason over the outputs of those
tools better. All this is learned
organically by the model while it
explores the search space, all the
possible ways of interacting with your
environment and hill climbing on your
reward. Another really nice property
that results from this is the ability to
achieve much lower latencies by making
sure that the model stays within a given
tool called budget and doesn't go over
that limit. So we can actually impose
this penalty that you know penalizes the
model for going over that budget. What
actually happens is the model learns to
stay within that budget while preserving
or exceeding the original ML
performance.
So to dive a little bit deeper into what
happens at a systems level for each
agent roll out will produce this unique
identifier that specifies that that that
particular roll out and we will
associate all the tool calls that we
make into your system with that UYU ID.
And so we do this for every tool call so
that you can keep track of a trajectory
as it evolves. so that when we emit that
final answer at the very end, you can
then associate that final answer with
all the context that you've maintained
so far and you can just pass this whole
thing as a holistic grading context into
your grader. Now, we don't recommend
everyone or anyone just use agent RFT
right off the bat. Uh there's a process
that we'd like you all to follow. You
first want to make sure that your
training data set and your eval data set
closely match your production traffic.
You do not want any drift whatsoever.
Then you want to ground yourself in a
baseline. You want to run your base
model against these data sets so that
you kind of understand what to expect
performance-wise so that you can then
hill climb from there. And then you want
to optimize performance using some of
the techniques that we talked about
prior like prompt or task optimization.
And only then when you still feel like
you squeezed all the juice out of the
task, but you still want more more
juice, you would turn to agent RFT to
push the frontier for your task. So now
I'm going to turn it over to Kathy to
talk about how some of our partners have
really pushed that frontier.
>> Yeah. So now that we learned how agent
RFT works and how when you should use
it, I'll show you some coding related
examples of how our customers were able
to use agent RFT to make their agents
better and also highlight some key
takeaways that you can apply when
optimizing your own agents. So a few
months ago we partnered with Cognition
who use agent RFT on their code edit
planning phase. This is the part where
Devon inspects a reple and runs runs
shell tools like rep and file reads to
decide which exact files to edit. To
train this behavior they build a data
set of user queries paired with actual
files that's that users has modified and
they use the F1 score of the selected
files as the reward. This F1 score is
really great because it balances between
the pre precision and the recall. So
this ensures that the agent doesn't
return too many inaccurate files or
misses the critical ones. They also
build extremely robust infrastructure to
support this training. So in this case
for each individual trajectory they spun
up a VM to manage the codebase to
execute the tool calls and grade the
final answer. These VMs make sure that
the environment is isolated so that the
shell tools will not affect each other
in different rollouts.
We saw two important takeaways from
Cognition's use case. First, data
quality and the volume really matters.
So, at first they fine-tuned on a data
set of around 100 examples and were able
to get a fivepoint improvement. But when
they scaled to a thousand examples, the
improvement jumped to 10 points. So the
number of highquality examples you
provide can very directly translate to a
better agent behavior. Second, we also
learned that RFT is really good for
learning to call tools in parallel. So
in this case, the model would initially
take 8 to 10 steps alternating between
generating tokens in its reasoning to
actually calling the tools. After RFT,
the agent launches many tool calls in
parallel. at the very first step. So
this was able to reduce that number down
to four. And in this use case, the speed
up was especially important because they
wanted Devon to start producing edits
quickly.
And now I want to highlight a different
use case. Codto is building a code
review agent and a key piece of that is
a deep research agent that answers
developer questions on large code bases.
To improve this deep research agent,
they train GPD5 to answer coding
questions by calling tools like search
and retrieve over the repository. They
assembled around a thousand authentic
question answer pairs from eight
different uh repositories and rewarded
the model using the recall of how many
relevant facts the agent were able to
retrieve.
[clears throat]
With RFT, the agent improved by 6% and
it was using fewer tool calls and output
tokens. And what we found most
interesting is this graph where it shows
how RFT shifted the distribution of the
number of tool calls. So with BBD5, the
agent will occasionally fall into these
bad runs where there were more than 15
tool calls in a single sample. This is
very slow and also can lead to some
inconsistent behaviors. So after RFT
these tool calls that are very longtail
um disappeared and the the distribution
center to just around two to four tool
calls. In this setup RFT didn't just
improve uh accuracy. It also stabilized
the agents behavior in eliminating these
P95 longtail cases. And this is very
important for production use cases where
your latency will matter.
Next, I want to share how cosign build
coding agents for large and complex
enterprise code uh enterprise co code
bases with agent rft. To make this work,
they train the agent on a very
comprehensive set of 30 tools such as
fry, keyword search, session terminal,
browser sessions, etc. And they also
built a very strict raider. So they
observed that the model um originally
when they were providing the model with
partial credits and uh points for just
trying out things um it didn't get
really good results because the model
would start to optimize things on coding
style and tone. Um so at first they want
to really make sure the agent ships
working code and so based on that they
give the model the reward only when the
final code passes the test. And because
the greater is very strict, it can
sometimes give sparse rewards. In that
case, um, GBD5 is also like is actually
very great because it can give us some
samples that work. So, um, Cosine also
boosted the batch size and they increase
the amount of compute so that there is
even more samples that can give us
positive rewards. So, it's not like
every single sample in the batch will
give us zero reward once the code is
correct. Um, they also have a custom LLM
that would judge by the score and tone.
So, it will panalyze verbosity, emojis
or anything that feels unprofessional.
Finally, the grader will reward the
agents that validate their own work. So,
this means running tests, inspecting
terminal outputs, and also checking
linting before calling out a success.
And after training with this very
thoughtful set of tools and graders,
cosine was able to reach the
state-of-the-art on a lot of different
benchmarks over here. And they also got
a much much faster agent. So like in
earlier examples, RFT shifted this
distribution of tool calls and the agent
stopped taking these extremely long
trajectories. In this case, there were
sometimes more than a 100 messages in a
single trajectory and it converged to a
much tighter and more efficient sequence
of steps.
Lastly, Macco is a very interesting use
case. They're building agents that write
high highly performant GPU kernels which
is traditionally very hard for LMS
because in normal use cases there's a
lot more examples but in this case
there's not a lot of example for kernels
especially if you're using new hardware
platforms like Nvidia B200's with Asian
RFT macro trained GBD5 to write fast
kernels using only about 100 PyTorch
prompts and this was a major unlock. So
we don't actually need that many samples
and kernel data set in order to train a
good model that produces kernels and we
just have to specify a good reward
function. In this case specifying a good
reward function is also very hard. Early
in training they observed that the model
was reward hacking. So what they did was
that they inspected the rollouts and
they found seven different cases where
the model was hacking and this include
things like just uh returning the
reference code or returning no kernels
or identity kernels and they built a
judge LM to catch all of these seven
cases and reward them with a zero. They
also added a static analysis tool with a
abstract syntax tree to verify that the
generated kernels actually exist and
they're actually being launched. So
after the they made sure that there was
no reward hacking, they also scored on
correctness and real speed up compared
to the partorch baseline.
Once all of these protections were in
place, the agent got significantly
better than GPD5.
And uh ML also used a really smart
technique here to improve the
performance even more. They ran three
different samples and they took the best
one out of the three. This allowed them
to beat the state-of-the-art by 72%.
And yeah, I'll hand it back to Will.
>> Thanks a lot, Kathy. So, uh, now we want
all of you, all of you in this room and
beyond to be as successful as the
partners that Kathy just mentioned with
agent RFD. So, here are four key
principles to ensure your success. First
of all, you want to make sure that your
task is well defined, well constrained.
There should be a clear, unambiguous
definition of success. You should have
removed all subjectivity out of your
task. Taste should not be a requirement
to grade your task properly. Next, you
do not want the model to feel surprised
in production. You want to make sure
that your train and eval data sets
mirror your production traffic. So, no
none of that domain shift that we talked
about. You do not want to introduce that
domain shift on your own. Um, next, and
this is a really important part, you
want to make sure that through
exploration, the model actually achieves
better performance on a given data point
if it samples more so that it can learn
from itself. So what this means is if
you take the maximum performance on a
given data set, that should improve as
you sample more from the model. So
because of this, you should be able to
see the these variances from a given
data point. So the model can learn from
itself, learn what the difference
between a good and a bad rollout is for
a given data point.
And uh lastly, you want to make sure
that your reward function is not
hackable. Hopefully you've plugged up
all the corner cases, all the edge
cases. Um but also hopefully you've
framed your task so that the reward is
more continuous than binary. The
continuous reward actually allows the
model to kind of inch up closer and
closer to optimal performance. Sort of
like giving giving a student partial
credit. um rather than you know slapping
them all in the face or giving it a
cookie uh if it gets stuff wrong or gets
stuff right. So now in order to get
started with agent RFT, please contact
your friendly neighborhood account
director and we're really excited to see
what you all build with us. Thank you so
much. [applause]
[music]
Heat.
[music]