Let LLMs Wander: Engineering RL Environments — Stefano Fiorucci

Channel: aiDotEngineer

Published at: 2026-04-08

YouTube video id: 71V3fTaUp2Q

Source: https://www.youtube.com/watch?v=71V3fTaUp2Q

Hello everybody and welcome to Let's
Learn Wander engineering reinforcement
learning environments.
A few words about me. I am Stefano
Fiorucci, a AI and software engineer. By
day, I work on AI orchestration at
Deepset, where I develop Haystack, an
open-source LLM framework. By night, I
love tinkering with small language
models, fine-tuning, and reinforcement
learning.
Today, I'm going to talk about
reinforcement learning environments for
language models evaluation and training.
This has been a hot topic over the past
year, and I find it fascinating for
several reasons. These environments let
models learn by interacting, exploring,
and improving from feedback. They are
natural gyms for LLM agents that can use
tools, run code, and solve multi-step
tasks. In addition, startups building RL
environments are getting major funding
and working directly with big AI labs.
Recent technical reports by DeepSeek and
MiniMax showed that they are effectively
using thousands of reinforcement
learning environments to improve model
performance on challenging tasks and
scale intelligence.
Don't worry if you know nothing about RL
environments. I'll cover that soon.
Here is the agenda for the talk. We'll
first review classic reinforcement
learning concepts and see how they map
to the language models domain. We'll
then introduce Verifiers, an open-source
library to build environments as
software artifacts, and explore some
common patterns to implement them.
Finally, I'll walk you through an
experiment where we take a small model
that can barely play tic-tac-toe against
a random player and transform it into a
master using a reinforcement learning
environment. Let's start.
First, a quick refresher on
reinforcement learning.
In reinforcement learning, there are two
main characters, the agent and the
environment.
The environment is the world the agent
interacts with. At each step, the agent
sees the current state of the world and
takes an action. The state of the
environment then changes in response to
that action. The agent also receives a
reward from the environment, a number
indicating how good or bad the state is.
The agent's goal is to maximize its
cumulative reward over time, and to do
this, it has to balance exploration,
so trying new actions to discover better
strategies, and exploitation, using
actions known to work. By interacting
with the environment, the agent learns
from experience and improves its
behavior.
A trajectory, or rollout, is the
sequence of states, actions, and rewards
that the agent goes through while
interacting with the environment. It's a
record of the experience.
In this presentation, I'll use
trajectory to mean a complete episode,
like one entire game.
Let's take a look at LLM's training.
Language model is a a statistical model
that, given some text, the prompt,
returns a text completion.
The standard training recipe is divided
into three phases.
First, pre-training on a massive amount
of internet text. Here, the model learns
to create text completions. The base
model is knowledgeable, but can't follow
instructions and is hardly usable in
applications.
During supervised fine-tuning on
conversational examples, the model is
trained to follow instruction and can
learn new tasks.
In the third step, reinforcement
learning is often used with techniques
like proximal policy optimization to
align the model with human preferences.
It's worth showing an example of
supervised fine-tuning data, as later
we'll frequently compare
supervised fine-tuning learning with new
reinforcement learning approaches.
As you can see, we have pairs of prompt
and responses, and during this phase,
the model learns by statistical
imitation.
It's essentially trying to mimic the
examples provided.
You might remember Ilya Sutskever talk
at NeurIPS 2024.
He pointed out that the LLM training
paradigm we just saw has starting
showing its limits.
In particular, pre-training no longer
seems to be enough to keep improving
model quality at the same rate. We
needed a new way to scale.
Then, OpenAI published its O1 model
series.
In their release blog post, they
mentioned reinforcement learning
training to make models use chain of
thought effectively.
They also underlined that the
performance of O1 consistently improved
with more reinforcement learning train
time compute and with more time spent
thinking, test time compute.
Unfortunately, they did not share many
details on how this model was actually
trained.
The release of DeepSeek R1 shed some
light on how you can possibly achieve
those results. First, they recognized
that reasoning and chain of thought can
improve the performance of models, but
teaching these behaviors to models using
supervised fine-tuning requires curated
data that is too expensive to produce at
scale.
They used reinforcement learning with
verifiable rewards, which we'll see in a
moment,
and DeepSeek also used GRPO, a new
reinforcement learning algorithm that
offers a simple, lighter setup compared
to techniques like PPO.
So, what is reinforcement learning with
verifiable rewards?
In this paradigm, the model is asked the
question and generates both a reasoning
trace and an answer.
The answer is then checked against the
known correct answer.
The reward is used for reinforcement
learning training.
The underlying idea is more general, and
it does where the outcome can be
verified automatically,
like a correct answer, a won game, a
successful tool call can serve as a
training signal.
And this is fundamentally different from
supervised fine-tuning. In SFT, the
model learns from curated examples, and
its completions tend to stay close to
the distribution of those examples.
In reinforcement learning with
verifiable rewards, the model explores
different trajectories from its
pre-training and learns to favor the
ones that maximize rewards.
And this is exciting because the model
is no longer limited by the quality of
human examples. Through trial and error,
it can discover more efficient reasoning
strategies.
We can finally map the classic
reinforcement learning concepts to LLMs.
The language model acts as the agent.
The environment for any task
includes data,
harnesses, and scoring rules, everything
needed to check and possibly train the
model on the task.
From a software perspective, this marks
a shift from supervised fine-tuning to
reinforcement learning with verifiable
rewards. While SFT mainly relies on
conversational datasets, this new
paradigm usually requires an
environment, a dynamic system that the
model can interact with.
The definition of the agent is also
expanding. Language models can now be
given tools, from a weather API to a
terminal, and this makes environments
for training and evaluation more complex
and critical.
To make this more concrete, consider
teaching a model to play tic-tac-toe.
The agent is the language model. Its
action is generating a text response
with the specific move.
The environment acts as the game engine.
It handles prompting the model,
tracking the board status, generating
the opponent's move, and deciding when
the game is over.
The reward is the signal from the
environment, for example, +1 for a win
and zero for a loss. This reward guides
the model to find winning strategies
through trial and error.
This setup allows the agent to discover
strategies that maximize its score
without needing pre-existing human
examples.
You have probably understood that I am
enthusiastic about this topic, but let's
also use Andrej Karpathy's words to
describe the environments.
They give the LLM an opportunity to
actually interact, take actions, see
outcomes.
This means you can hope to do a lot
better than statistical expert
imitation.
Now, let's see how to build these
environments.
To build the environments as software
active X, we can use verifiers, an
open-source library by Prime Intellect.
Verifiers provides modular components to
create reinforcement learning
environments for LLM agents. These can
be used for both evaluation and
training.
Environments are Python packages that
can be easily installed and distributed.
The library provides base classes for
several setups, single-turn environments
with just one interaction between the
model and the end,
multi-turn environments, tool
environments where the model is equipped
with tools, and several others.
It also includes abstractions for
parsing model responses and defining
reward functions.
Verifier abstracts model serving. It
expects an OpenAI compatible API
endpoint, so you can plug in OpenAI,
open router, or local models via the
LLM.
It handles a single interaction and
parallel trajectories, so you focus on
the environment logic.
For training, verifiers come with its
own trainer and integrates with other
frameworks such as Prime RL, Tinker, and
Sky RL.
In short, verifiers lets us focus on the
task and the rewards [clears throat]
rather than the infrastructure.
Let's start with a single-turn
environment. Reverse text is a a simple
environment to evaluate or train
language models on their ability to
reverse a string of text.
What's going on here?
Load [clears throat] environment is the
entry point for every verifiers
environment. It contains all the setup
logic.
First, a dataset is loaded and mapped.
The default dataset contains 1,000 text
paragraphs stored in the prompt column.
During the mapping step, we transform
the original dataset. Question is the
text paragraph, while answer is the
reversed text.
Next,
XML parser is initialized. It extracts
the text inside the reversed text tags
as specified in the system prompt.
We then define the reward function. This
compares the model's output to the
ground truth and returns a longest
common subsequence ratio.
Finally, we bundle this into a rubric, a
collection of weighted rewards, and
initialize the single-turn end.
But, how does this come to life? Let me
show you an evaluation run.
Here is what happens under the hood.
Load environment is invoked.
Five examples are taken from the
dataset, and each example is used three
times for three rollouts. Each rollout
gets the same question, but may produce
different completions due to model
randomness.
We have 15 rollouts in total.
For each rollout, a conversation is
prepared with the system prompt and the
question. The conversation is sent to
the model.
The model generates a response. The
response is parsed, extracting the
answer. The reward is computed, and
results are saved. At the end of the
evaluation,
you get summary statistics
and additional info about reward
distribution.
Training follows the same core mechanism
with the additional step
of updating model parameters. We look at
at that in more detail later.
Let's look at the different example from
verifiers, the double-check environment.
Here, the model answers a math question,
and the environment then asks, "Are you
sure?"
This is a multi-turn environment,
similar in spirit to Tic-Tac-Toe, in
which each trajectory involves multiple
interaction between the model and the
environment.
Let's look at what
multi-turn end introduces.
State makes its first appearance here.
It's a dictionary that tracks
information during a rollout. We can set
the initial state through the setup
state method, not used in this example.
End response. Instead of the interaction
ending after one turn, the environment
can reply with a list of messages. Here,
it just says, "Are you sure?"
In more complex environments, the
response can be dynamically generated
based on the state.
The via stop decorator marks a method as
a stopping condition. This method runs
at every turn of the agent-environment
interaction. Once it returns true,
the rollout terminates.
Under the hood, verifiers runs a loop in
which the model and environment take
turns
exchanging messages, updating shared
state, until a stopping condition is met
and a full trajectory can be evaluated.
Another interesting type of environment
is the tool environment. All environment
types in verifiers are built on
multi-turn end, which implements the
core single-agent rollout loop. Tool end
adds tool calling to this foundation. As
you can see, tools are defined as Python
functions.
During rollouts, the model can call
tools, receive results, and continue
reasoning until it produces a response
without tool calls.
Each turn
consists of a model's response followed
by the environment's tool execution.
For a more realistic example, I
recommend checking out the Wiki search
environment.
Beyond the fundamental environments I
just showed, verifiers provides more
abstraction to build environments.
The NCP environment automatically
connects to model context protocol
servers to expose their tools.
Stateful tool end is for tools that need
per-rollout persistent state, like
database connection or session ID.
There is also a class implementing
recursive language models,
a novel idea you might have heard of.
It's an inference strategy where
language models can decompose and
recursively interact with input context
of unbounded length through REPL
environments.
Verifiers also plays well with others,
integrating with several third-party
environment libraries.
Verifiers is tightly integrated with the
environments hub, a community space for
sharing these RL environments. Verifiers
and the environments hub are different
faces of the same intent.
They aim to fight environment
fragmentation. Too often, environments
are locked
into specific training stack, making
them difficult to reuse.
And as a market for closed-source
environments emerges, these open
initiatives ensure we have a robust
alternative. We don't want open-source
models to lag behind just because they
lack the right playgrounds to train it.
Plus, beyond the serious side, it's just
fun to explore the hub and see what
people are building.
Now, let's move on, Tic-Tac-Toe.
We use verifiers to create a Tic-Tac-Toe
environment for training and evaluating
language models on this game.
Now, why Tic-Tac-Toe?
It's a simple game, but requires
multi-turn interaction, and capturing
its dynamics with a static dataset is
challenging.
Despite its small state space and
deterministic solution, small language
models often struggle with it.
Let's see if reinforcement learning can
help
>> [clears throat]
>> bridge that gap.
It's best to start with a simple
version, run evaluation to verify it
works, and then iterate.
To start, we make a few assumptions.
The model always plays as X
and goes first.
It must output a number between 0 and 8
inside move tags.
And the opponent just plays randomly.
In load environment, we create a data
set containing the initial user message
that starts each game.
For each rollout, setup state populates
the state dictionary with information
used and updated during the game, such
as the board and the winner.
And response contains the core game
logic. It parses the model last move,
check if it's valid. Invalid moves are
an immediate loss for now, and applies
it to the board. Then, it applies a
random opponent move and checks for a
win or draw. If the game isn't finished,
it returns a user message with the
current board state and asks for the
next move.
We use two reward function,
winner reward function
with weight one,
and
format reward function with weight 0.2,
which rewards the model for respecting
the XML format.
We can now make our environment more
flexible, realistic, and suitable for
both evaluation and training. I made
some of these improvements gradually
while eating was during training. First,
we want the model to sometimes play
first and sometimes play second.
Let's now address opponent's skill.
Always playing against a random opponent
isn't realistic. So, we introduce an
optimal opponent using the minimax
algorithm.
Against this opponent, a draw is the
best achievable outcome.
However, for training, we want the
opponent's skill to be controllable. If
the opponent is too perfect too early,
the model might never see a win and fail
to learn.
We can do so by introducing a
probability for the opponent to choose a
random move instead of the optimal one.
In load environment, we introduce mean
random move prob and max random move
prob varying from 0 to 1.
If we set both to 0, all games will be
against an optimal opponent. If we set
both to 1, all games will be against a
random opponent.
Using these parameters allows us to
control the opponent's skill across all
games.
For different rollouts originating from
the same data set example, the opponent
will always have the same probability of
choosing random moves, ensuring fair
comparison.
Now, about reasoning.
It's common to ask models to produce a
thinking trace before the final answer.
It can improve performance at inference
time, but it's also instrumental to make
models better during training.
We define a new format reward function
using a regular expression to also check
the presence of think tags.
Let's now cover invalid moves. While
experimenting with small open models, I
observed that many of them sometimes
their output format was incorrect. Other
times, the chosen cell was occupied.
Ending the game immediately is harsh. It
might stop smaller models from getting a
useful learning signal.
Instead, we now let the game continue
and apply a flat minus 0.1 penalty,
capping the turns at eight.
Let's discuss reducing noise in
group-based reinforcement learning.
In GRPO
learning, we compare several rollouts
from the same starting point to see
which ones to reinforce based on
rewards. For this to work, differences
in rewards
should come from how the model plays,
not from environment randomness. And how
can we reduce noise
in this setup?
First, we set an example seed for each
example in the data set to select the
starting player. Then, for each turn, we
derive a specific turn seed based on the
example seed and board state. This
guarantees that if two rollouts reach
the same board position, the opponent
will always respond the same way.
Last point, reducing noise across
batches.
For our training, the batch size is the
number of games taken into consideration
before the model's weight are updated.
In our setup, the opponent's skill
varies across the data set according to
mean random move prob and max random
[clears throat] move prob.
If we train with a small batch size and
the random move probability is not
fixed, we might sample a batch in which
many opponents are hard or many are
easy. This causes the average reward to
fluctuate a lot, making training
unstable. To fight this, I added
stratified sampling.
This forces every batch to contain a
perfectly balanced mix of opponent
difficulty spanning the chosen range.
I know this slide is dense, but you can
find all code and more details in the
GitHub repository.
Time to evaluate existing models. We
choose GPT-5 mini and LFM-2 by Liquid
AI, a small fast open model.
Using Verifiers, evaluating models just
requires a few commands.
Allowing for some statistical
variability, GPT-5 mini is excellent at
following format and is a good
Tic-Tac-Toe player, but not perfect.
The small open model by Liquid AI
struggles to follow format and to make
valid moves.
It's a weak Tic-Tac-Toe player,
sometimes winning against a random
opponent, but rarely surviving against
an optimal one.
There is a significant gap.
We decide to train LFM-2 for some
reasons.
It's a good model for its size, and it's
an instruct model ideal for transforming
it into a reasoning model.
How can we improve it?
We saw that this model struggles to
follow format and often provides invalid
moves. We can use supervised fine-tuning
for a warm-up phase, where we teach the
model the format and valid moves syntax.
We can then use reinforcement learning
to build deeper capabilities.
The first step is generating synthetic
data for supervised fine-tuning.
Once you have a good environment,
generating data requires a single
command.
Here, we use GPT-5 mini since it
followed format perfectly. And we don't
need many examples. We generate 200 and
filter out losing games to avoid baking
in suboptimal strategies.
With this synthetic data at hand, we can
easily spin up a supervised fine-tuning
run using Prime RL.
In this example, I am using a a 96
[clears throat] GB
GPU, but you can use a smaller one.
Training requires only a few minutes.
Time to evaluate our fine-tuned model.
Compared to the original model, it
learned format almost perfectly and
reduced the number of invalid moves.
It also improved the game performance,
but there is still significant work to
do.
Before jumping into RL training, let's
do a quick recap of group relative
policy optimization applied to
Tic-Tac-Toe.
Rollouts starting from the same initial
board, the model plays several games via
LLM sampling.
Each rollout is evaluated using
deterministic reward function, in our
case, win, format, and invalid moves
rewards.
An average score is calculated across
the group of rollouts,
and each rollout is then compared
against this average. Advantage
computation.
The model is updated to favor
trajectories that did better than the
group baseline.
We'll use CISPO, which is an improvement
over GRPO.
For reinforcement learning training, I
used Verifiers RL, simple trainer. Here,
we use a GPU for inference and a GPU for
training.
Let's comment some parameters in the
training configuration.
In this training run, random move
probability is ranging from 20
[clears throat] to 70%. No purely random
players and no optimal players. It's a
good playground to get signal and learn
both attack and defense.
The num_groups parameter is used to set
up stratified sampling.
When it comes to trainer arguments, we
want our model to learn [clears throat]
stably while fully utilizing our GPUs
without crashing.
For tips on how to use the GPU without
going out of memory, I recommend
checking out the GitHub repo.
Here, I want to stress that
reinforcement learning training is
sensitive to either parameters and can
be unstable.
I learned the hard way that batch size
is a key parameter. In this environment,
I observed unstable training and model
collapse, and experimented with values
lower than 256.
The explanation is intuitive. Batch size
is the number of games used to update
the model's weights. If this number is
low, this means learning to play from a
very small number of matches and
opponent types at once. And this likely
leads to suboptimal strategies.
Let's take a look at training plots.
Winning reward function and the total
reward constantly improved.
Format reward function was already near
perfect and did not change
significantly.
Invalid move penalty function started
well and converged to zero towards the
end of training. It suggests a good
training run, but let's run proper
evaluation.
Impressive.
Thanks to reinforcement learning, our
model has become a very competent
tic-tac-toe player. It dominates random
players and draws 85% of the time
against an optimal opponent. Invalid
moves have dropped to near zero.
These results are already satisfying,
and one can say that not much more we
can learn from this example.
I am a perfectionist, and I'd
[clears throat] like a perfect AI
player.
Inspecting the rollouts, I found some
recurrent failure modes. In particular,
our model sometimes falls into four
traps.
This example shows the end of the game.
Our model is not playing ideally, but it
had already lost the game by allowing
the opponent to have two winning pass
pass.
Is it possible to use our RL environment
to push our model further toward
perfection?
Let's try.
We have to make some changes.
In this run, I used the bigger GPUs just
to experiment quickly. It's probably not
required.
First, let's discuss opponent skill.
We increase opponent skill by setting
the probability of a random move to
range from 0% to 25%.
I also tried making the model play again
against perfect opponents only, but it
didn't work. The model became overly
defensive and failed to exploit errors
when tested again random players.
But how to make our model explore beyond
learned strategies? I made several
experiments where this model failed to
improve and forget the suboptimal
strategies. We want the model to
experiment with new approaches.
And temperature is the right parameter
to tweak.
But this is a bit risky. If the
temperature is too high, the model can
start generating gibberish.
Let's train.
Things get really interesting. There was
a significant initial drop in winning
reward function and total reward.
I interpret this as an exploratory phase
where the model tried new and random
strategies, which underperformed at
first, but over time it recovered and
improved to new highs.
Also, format reward function and invalid
move penalty function had an initial
drop, but overall always stayed around
their maximum values. Let's move to
proper evaluation.
Oh,
we finally got the tic-tac-toe master.
But
why not playing a game against it?
Okay, let's make not a perfect move.
Mhm, we now need to block
the model.
Let's block it again.
Oh, no, we lost.
It could be interesting now to compare
our model performance with GPT-5 Mini,
the teacher model we used to generate
our synthetic data.
Mhm, against a random opponent,
performance is very similar.
Let's see against an optimal opponent.
Oh, in this case, our model is superior.
This is a very nice achievement.
To get these results, I went through
several failed experiments, and I'd like
to share the findings with you.
First of all, batch size. If this value
is large, yes, your model apparently
learns slowly, but in exchange, you get
stable training.
If the batch size is small and your
environment produces diverse matches and
opponent skill, the model will learn
from a small number of games at once.
This can reinforce suboptimal
strategies, and you may observe unstable
training or model collapse.
Second lesson. Watch for hidden biases
in environments. Let me explain.
In a previous experiment, I used a
different minimax algorithm for the
optimal opponent.
I thought this was an implementation
detail and let code handle it.
I got great benchmark results, but then
playing against the model, I realized
that it was clueless.
Looking better at minimax, this had a
bias. If multiple moves had the same
optimal scores, the first free position
was always selected. I basically was
training my model against a specific
type of optimal player. Over many games,
the model simply memorized it.
Now, what about model choice? You can
start from a model which is already
trained for reasoning, but they tend to
output long thinking traces.
If you have limited GPU resources and
time, you may end up truncating most of
the longer completion at the beginning
to fit the short limits. This means
wasted budget and also the risk of
damaging the model's intelligence.
It might make more sense to start from
an initial model and transform it to
into a reasoning model for your task.
Another point about models, it could be
hard to push very small models to
competency.
This, of course, depends on the task.
What I recommend, evaluate the base
model in your environment, look at a few
completions, choose a model that shows
promising behaviors, even if the numbers
are not satisfying yet.
In general, it is always a good idea to
inspect some rollouts to see the model
evolve. Also, after training, do not
stop at programmatic evaluation. Try the
model in the real task.
The final recommendation is natural to
watch your logs and plots when you start
training to identify early out of memory
errors or instability. It's difficult
for me, too, but once training begins
well, I suggest stopping staring at
plots for a while. Reinforcement
learning is slow and takes time to see
progress. If you continually monitor it,
you risk the temptation to stop it and
tweak something prematurely. While a
slowly progressing run can be a
surprisingly good well given enough
time. So, start training and go for a
walk.
During this presentation, we mapped the
reinforcement learning concept to the
language models domain.
Then, I introduced various files, an
open-source library to build
environments as of Flo R active base.
Finally, I walked you through my
experiments where I took a small model
and turned it to into a tic-tac-toe
master using supervised fine-tuning and
reinforcement learning with verifiable
rewards.
We did not just show the model how to
play. We gave it a space to play and
guided it through rewards.
Nowadays, reinforcement learning
complements supervised fine-tuning in
language models post-training.
You can do this at home, too.
If you can define a clear reward signal,
you can build an environment and train a
small, specialized model to beat a large
closed model on a specific task at a
fraction of the cost.
I want to leave you with a few ideas and
resources on this topic. All what I
shared today can be found in my free LLM
RL environment link course, where I go
deep at explaining detail.
Take a look and give a star.
To figure out what others are building,
explore the environments hub.
And what to build next?
Something I'm very excited about is
this. Train a small language models on
two three tools you often use and try to
out perform a large model in that
specific task.
I recommend the Wiki search example in
the Prime RL repository as a starting
point.
Thank you.