Let LLMs Wander: Engineering RL Environments — Stefano Fiorucci
Channel: aiDotEngineer
Published at: 2026-04-08
YouTube video id: 71V3fTaUp2Q
Source: https://www.youtube.com/watch?v=71V3fTaUp2Q
Hello everybody and welcome to Let's Learn Wander engineering reinforcement learning environments. A few words about me. I am Stefano Fiorucci, a AI and software engineer. By day, I work on AI orchestration at Deepset, where I develop Haystack, an open-source LLM framework. By night, I love tinkering with small language models, fine-tuning, and reinforcement learning. Today, I'm going to talk about reinforcement learning environments for language models evaluation and training. This has been a hot topic over the past year, and I find it fascinating for several reasons. These environments let models learn by interacting, exploring, and improving from feedback. They are natural gyms for LLM agents that can use tools, run code, and solve multi-step tasks. In addition, startups building RL environments are getting major funding and working directly with big AI labs. Recent technical reports by DeepSeek and MiniMax showed that they are effectively using thousands of reinforcement learning environments to improve model performance on challenging tasks and scale intelligence. Don't worry if you know nothing about RL environments. I'll cover that soon. Here is the agenda for the talk. We'll first review classic reinforcement learning concepts and see how they map to the language models domain. We'll then introduce Verifiers, an open-source library to build environments as software artifacts, and explore some common patterns to implement them. Finally, I'll walk you through an experiment where we take a small model that can barely play tic-tac-toe against a random player and transform it into a master using a reinforcement learning environment. Let's start. First, a quick refresher on reinforcement learning. In reinforcement learning, there are two main characters, the agent and the environment. The environment is the world the agent interacts with. At each step, the agent sees the current state of the world and takes an action. The state of the environment then changes in response to that action. The agent also receives a reward from the environment, a number indicating how good or bad the state is. The agent's goal is to maximize its cumulative reward over time, and to do this, it has to balance exploration, so trying new actions to discover better strategies, and exploitation, using actions known to work. By interacting with the environment, the agent learns from experience and improves its behavior. A trajectory, or rollout, is the sequence of states, actions, and rewards that the agent goes through while interacting with the environment. It's a record of the experience. In this presentation, I'll use trajectory to mean a complete episode, like one entire game. Let's take a look at LLM's training. Language model is a a statistical model that, given some text, the prompt, returns a text completion. The standard training recipe is divided into three phases. First, pre-training on a massive amount of internet text. Here, the model learns to create text completions. The base model is knowledgeable, but can't follow instructions and is hardly usable in applications. During supervised fine-tuning on conversational examples, the model is trained to follow instruction and can learn new tasks. In the third step, reinforcement learning is often used with techniques like proximal policy optimization to align the model with human preferences. It's worth showing an example of supervised fine-tuning data, as later we'll frequently compare supervised fine-tuning learning with new reinforcement learning approaches. As you can see, we have pairs of prompt and responses, and during this phase, the model learns by statistical imitation. It's essentially trying to mimic the examples provided. You might remember Ilya Sutskever talk at NeurIPS 2024. He pointed out that the LLM training paradigm we just saw has starting showing its limits. In particular, pre-training no longer seems to be enough to keep improving model quality at the same rate. We needed a new way to scale. Then, OpenAI published its O1 model series. In their release blog post, they mentioned reinforcement learning training to make models use chain of thought effectively. They also underlined that the performance of O1 consistently improved with more reinforcement learning train time compute and with more time spent thinking, test time compute. Unfortunately, they did not share many details on how this model was actually trained. The release of DeepSeek R1 shed some light on how you can possibly achieve those results. First, they recognized that reasoning and chain of thought can improve the performance of models, but teaching these behaviors to models using supervised fine-tuning requires curated data that is too expensive to produce at scale. They used reinforcement learning with verifiable rewards, which we'll see in a moment, and DeepSeek also used GRPO, a new reinforcement learning algorithm that offers a simple, lighter setup compared to techniques like PPO. So, what is reinforcement learning with verifiable rewards? In this paradigm, the model is asked the question and generates both a reasoning trace and an answer. The answer is then checked against the known correct answer. The reward is used for reinforcement learning training. The underlying idea is more general, and it does where the outcome can be verified automatically, like a correct answer, a won game, a successful tool call can serve as a training signal. And this is fundamentally different from supervised fine-tuning. In SFT, the model learns from curated examples, and its completions tend to stay close to the distribution of those examples. In reinforcement learning with verifiable rewards, the model explores different trajectories from its pre-training and learns to favor the ones that maximize rewards. And this is exciting because the model is no longer limited by the quality of human examples. Through trial and error, it can discover more efficient reasoning strategies. We can finally map the classic reinforcement learning concepts to LLMs. The language model acts as the agent. The environment for any task includes data, harnesses, and scoring rules, everything needed to check and possibly train the model on the task. From a software perspective, this marks a shift from supervised fine-tuning to reinforcement learning with verifiable rewards. While SFT mainly relies on conversational datasets, this new paradigm usually requires an environment, a dynamic system that the model can interact with. The definition of the agent is also expanding. Language models can now be given tools, from a weather API to a terminal, and this makes environments for training and evaluation more complex and critical. To make this more concrete, consider teaching a model to play tic-tac-toe. The agent is the language model. Its action is generating a text response with the specific move. The environment acts as the game engine. It handles prompting the model, tracking the board status, generating the opponent's move, and deciding when the game is over. The reward is the signal from the environment, for example, +1 for a win and zero for a loss. This reward guides the model to find winning strategies through trial and error. This setup allows the agent to discover strategies that maximize its score without needing pre-existing human examples. You have probably understood that I am enthusiastic about this topic, but let's also use Andrej Karpathy's words to describe the environments. They give the LLM an opportunity to actually interact, take actions, see outcomes. This means you can hope to do a lot better than statistical expert imitation. Now, let's see how to build these environments. To build the environments as software active X, we can use verifiers, an open-source library by Prime Intellect. Verifiers provides modular components to create reinforcement learning environments for LLM agents. These can be used for both evaluation and training. Environments are Python packages that can be easily installed and distributed. The library provides base classes for several setups, single-turn environments with just one interaction between the model and the end, multi-turn environments, tool environments where the model is equipped with tools, and several others. It also includes abstractions for parsing model responses and defining reward functions. Verifier abstracts model serving. It expects an OpenAI compatible API endpoint, so you can plug in OpenAI, open router, or local models via the LLM. It handles a single interaction and parallel trajectories, so you focus on the environment logic. For training, verifiers come with its own trainer and integrates with other frameworks such as Prime RL, Tinker, and Sky RL. In short, verifiers lets us focus on the task and the rewards [clears throat] rather than the infrastructure. Let's start with a single-turn environment. Reverse text is a a simple environment to evaluate or train language models on their ability to reverse a string of text. What's going on here? Load [clears throat] environment is the entry point for every verifiers environment. It contains all the setup logic. First, a dataset is loaded and mapped. The default dataset contains 1,000 text paragraphs stored in the prompt column. During the mapping step, we transform the original dataset. Question is the text paragraph, while answer is the reversed text. Next, XML parser is initialized. It extracts the text inside the reversed text tags as specified in the system prompt. We then define the reward function. This compares the model's output to the ground truth and returns a longest common subsequence ratio. Finally, we bundle this into a rubric, a collection of weighted rewards, and initialize the single-turn end. But, how does this come to life? Let me show you an evaluation run. Here is what happens under the hood. Load environment is invoked. Five examples are taken from the dataset, and each example is used three times for three rollouts. Each rollout gets the same question, but may produce different completions due to model randomness. We have 15 rollouts in total. For each rollout, a conversation is prepared with the system prompt and the question. The conversation is sent to the model. The model generates a response. The response is parsed, extracting the answer. The reward is computed, and results are saved. At the end of the evaluation, you get summary statistics and additional info about reward distribution. Training follows the same core mechanism with the additional step of updating model parameters. We look at at that in more detail later. Let's look at the different example from verifiers, the double-check environment. Here, the model answers a math question, and the environment then asks, "Are you sure?" This is a multi-turn environment, similar in spirit to Tic-Tac-Toe, in which each trajectory involves multiple interaction between the model and the environment. Let's look at what multi-turn end introduces. State makes its first appearance here. It's a dictionary that tracks information during a rollout. We can set the initial state through the setup state method, not used in this example. End response. Instead of the interaction ending after one turn, the environment can reply with a list of messages. Here, it just says, "Are you sure?" In more complex environments, the response can be dynamically generated based on the state. The via stop decorator marks a method as a stopping condition. This method runs at every turn of the agent-environment interaction. Once it returns true, the rollout terminates. Under the hood, verifiers runs a loop in which the model and environment take turns exchanging messages, updating shared state, until a stopping condition is met and a full trajectory can be evaluated. Another interesting type of environment is the tool environment. All environment types in verifiers are built on multi-turn end, which implements the core single-agent rollout loop. Tool end adds tool calling to this foundation. As you can see, tools are defined as Python functions. During rollouts, the model can call tools, receive results, and continue reasoning until it produces a response without tool calls. Each turn consists of a model's response followed by the environment's tool execution. For a more realistic example, I recommend checking out the Wiki search environment. Beyond the fundamental environments I just showed, verifiers provides more abstraction to build environments. The NCP environment automatically connects to model context protocol servers to expose their tools. Stateful tool end is for tools that need per-rollout persistent state, like database connection or session ID. There is also a class implementing recursive language models, a novel idea you might have heard of. It's an inference strategy where language models can decompose and recursively interact with input context of unbounded length through REPL environments. Verifiers also plays well with others, integrating with several third-party environment libraries. Verifiers is tightly integrated with the environments hub, a community space for sharing these RL environments. Verifiers and the environments hub are different faces of the same intent. They aim to fight environment fragmentation. Too often, environments are locked into specific training stack, making them difficult to reuse. And as a market for closed-source environments emerges, these open initiatives ensure we have a robust alternative. We don't want open-source models to lag behind just because they lack the right playgrounds to train it. Plus, beyond the serious side, it's just fun to explore the hub and see what people are building. Now, let's move on, Tic-Tac-Toe. We use verifiers to create a Tic-Tac-Toe environment for training and evaluating language models on this game. Now, why Tic-Tac-Toe? It's a simple game, but requires multi-turn interaction, and capturing its dynamics with a static dataset is challenging. Despite its small state space and deterministic solution, small language models often struggle with it. Let's see if reinforcement learning can help >> [clears throat] >> bridge that gap. It's best to start with a simple version, run evaluation to verify it works, and then iterate. To start, we make a few assumptions. The model always plays as X and goes first. It must output a number between 0 and 8 inside move tags. And the opponent just plays randomly. In load environment, we create a data set containing the initial user message that starts each game. For each rollout, setup state populates the state dictionary with information used and updated during the game, such as the board and the winner. And response contains the core game logic. It parses the model last move, check if it's valid. Invalid moves are an immediate loss for now, and applies it to the board. Then, it applies a random opponent move and checks for a win or draw. If the game isn't finished, it returns a user message with the current board state and asks for the next move. We use two reward function, winner reward function with weight one, and format reward function with weight 0.2, which rewards the model for respecting the XML format. We can now make our environment more flexible, realistic, and suitable for both evaluation and training. I made some of these improvements gradually while eating was during training. First, we want the model to sometimes play first and sometimes play second. Let's now address opponent's skill. Always playing against a random opponent isn't realistic. So, we introduce an optimal opponent using the minimax algorithm. Against this opponent, a draw is the best achievable outcome. However, for training, we want the opponent's skill to be controllable. If the opponent is too perfect too early, the model might never see a win and fail to learn. We can do so by introducing a probability for the opponent to choose a random move instead of the optimal one. In load environment, we introduce mean random move prob and max random move prob varying from 0 to 1. If we set both to 0, all games will be against an optimal opponent. If we set both to 1, all games will be against a random opponent. Using these parameters allows us to control the opponent's skill across all games. For different rollouts originating from the same data set example, the opponent will always have the same probability of choosing random moves, ensuring fair comparison. Now, about reasoning. It's common to ask models to produce a thinking trace before the final answer. It can improve performance at inference time, but it's also instrumental to make models better during training. We define a new format reward function using a regular expression to also check the presence of think tags. Let's now cover invalid moves. While experimenting with small open models, I observed that many of them sometimes their output format was incorrect. Other times, the chosen cell was occupied. Ending the game immediately is harsh. It might stop smaller models from getting a useful learning signal. Instead, we now let the game continue and apply a flat minus 0.1 penalty, capping the turns at eight. Let's discuss reducing noise in group-based reinforcement learning. In GRPO learning, we compare several rollouts from the same starting point to see which ones to reinforce based on rewards. For this to work, differences in rewards should come from how the model plays, not from environment randomness. And how can we reduce noise in this setup? First, we set an example seed for each example in the data set to select the starting player. Then, for each turn, we derive a specific turn seed based on the example seed and board state. This guarantees that if two rollouts reach the same board position, the opponent will always respond the same way. Last point, reducing noise across batches. For our training, the batch size is the number of games taken into consideration before the model's weight are updated. In our setup, the opponent's skill varies across the data set according to mean random move prob and max random [clears throat] move prob. If we train with a small batch size and the random move probability is not fixed, we might sample a batch in which many opponents are hard or many are easy. This causes the average reward to fluctuate a lot, making training unstable. To fight this, I added stratified sampling. This forces every batch to contain a perfectly balanced mix of opponent difficulty spanning the chosen range. I know this slide is dense, but you can find all code and more details in the GitHub repository. Time to evaluate existing models. We choose GPT-5 mini and LFM-2 by Liquid AI, a small fast open model. Using Verifiers, evaluating models just requires a few commands. Allowing for some statistical variability, GPT-5 mini is excellent at following format and is a good Tic-Tac-Toe player, but not perfect. The small open model by Liquid AI struggles to follow format and to make valid moves. It's a weak Tic-Tac-Toe player, sometimes winning against a random opponent, but rarely surviving against an optimal one. There is a significant gap. We decide to train LFM-2 for some reasons. It's a good model for its size, and it's an instruct model ideal for transforming it into a reasoning model. How can we improve it? We saw that this model struggles to follow format and often provides invalid moves. We can use supervised fine-tuning for a warm-up phase, where we teach the model the format and valid moves syntax. We can then use reinforcement learning to build deeper capabilities. The first step is generating synthetic data for supervised fine-tuning. Once you have a good environment, generating data requires a single command. Here, we use GPT-5 mini since it followed format perfectly. And we don't need many examples. We generate 200 and filter out losing games to avoid baking in suboptimal strategies. With this synthetic data at hand, we can easily spin up a supervised fine-tuning run using Prime RL. In this example, I am using a a 96 [clears throat] GB GPU, but you can use a smaller one. Training requires only a few minutes. Time to evaluate our fine-tuned model. Compared to the original model, it learned format almost perfectly and reduced the number of invalid moves. It also improved the game performance, but there is still significant work to do. Before jumping into RL training, let's do a quick recap of group relative policy optimization applied to Tic-Tac-Toe. Rollouts starting from the same initial board, the model plays several games via LLM sampling. Each rollout is evaluated using deterministic reward function, in our case, win, format, and invalid moves rewards. An average score is calculated across the group of rollouts, and each rollout is then compared against this average. Advantage computation. The model is updated to favor trajectories that did better than the group baseline. We'll use CISPO, which is an improvement over GRPO. For reinforcement learning training, I used Verifiers RL, simple trainer. Here, we use a GPU for inference and a GPU for training. Let's comment some parameters in the training configuration. In this training run, random move probability is ranging from 20 [clears throat] to 70%. No purely random players and no optimal players. It's a good playground to get signal and learn both attack and defense. The num_groups parameter is used to set up stratified sampling. When it comes to trainer arguments, we want our model to learn [clears throat] stably while fully utilizing our GPUs without crashing. For tips on how to use the GPU without going out of memory, I recommend checking out the GitHub repo. Here, I want to stress that reinforcement learning training is sensitive to either parameters and can be unstable. I learned the hard way that batch size is a key parameter. In this environment, I observed unstable training and model collapse, and experimented with values lower than 256. The explanation is intuitive. Batch size is the number of games used to update the model's weights. If this number is low, this means learning to play from a very small number of matches and opponent types at once. And this likely leads to suboptimal strategies. Let's take a look at training plots. Winning reward function and the total reward constantly improved. Format reward function was already near perfect and did not change significantly. Invalid move penalty function started well and converged to zero towards the end of training. It suggests a good training run, but let's run proper evaluation. Impressive. Thanks to reinforcement learning, our model has become a very competent tic-tac-toe player. It dominates random players and draws 85% of the time against an optimal opponent. Invalid moves have dropped to near zero. These results are already satisfying, and one can say that not much more we can learn from this example. I am a perfectionist, and I'd [clears throat] like a perfect AI player. Inspecting the rollouts, I found some recurrent failure modes. In particular, our model sometimes falls into four traps. This example shows the end of the game. Our model is not playing ideally, but it had already lost the game by allowing the opponent to have two winning pass pass. Is it possible to use our RL environment to push our model further toward perfection? Let's try. We have to make some changes. In this run, I used the bigger GPUs just to experiment quickly. It's probably not required. First, let's discuss opponent skill. We increase opponent skill by setting the probability of a random move to range from 0% to 25%. I also tried making the model play again against perfect opponents only, but it didn't work. The model became overly defensive and failed to exploit errors when tested again random players. But how to make our model explore beyond learned strategies? I made several experiments where this model failed to improve and forget the suboptimal strategies. We want the model to experiment with new approaches. And temperature is the right parameter to tweak. But this is a bit risky. If the temperature is too high, the model can start generating gibberish. Let's train. Things get really interesting. There was a significant initial drop in winning reward function and total reward. I interpret this as an exploratory phase where the model tried new and random strategies, which underperformed at first, but over time it recovered and improved to new highs. Also, format reward function and invalid move penalty function had an initial drop, but overall always stayed around their maximum values. Let's move to proper evaluation. Oh, we finally got the tic-tac-toe master. But why not playing a game against it? Okay, let's make not a perfect move. Mhm, we now need to block the model. Let's block it again. Oh, no, we lost. It could be interesting now to compare our model performance with GPT-5 Mini, the teacher model we used to generate our synthetic data. Mhm, against a random opponent, performance is very similar. Let's see against an optimal opponent. Oh, in this case, our model is superior. This is a very nice achievement. To get these results, I went through several failed experiments, and I'd like to share the findings with you. First of all, batch size. If this value is large, yes, your model apparently learns slowly, but in exchange, you get stable training. If the batch size is small and your environment produces diverse matches and opponent skill, the model will learn from a small number of games at once. This can reinforce suboptimal strategies, and you may observe unstable training or model collapse. Second lesson. Watch for hidden biases in environments. Let me explain. In a previous experiment, I used a different minimax algorithm for the optimal opponent. I thought this was an implementation detail and let code handle it. I got great benchmark results, but then playing against the model, I realized that it was clueless. Looking better at minimax, this had a bias. If multiple moves had the same optimal scores, the first free position was always selected. I basically was training my model against a specific type of optimal player. Over many games, the model simply memorized it. Now, what about model choice? You can start from a model which is already trained for reasoning, but they tend to output long thinking traces. If you have limited GPU resources and time, you may end up truncating most of the longer completion at the beginning to fit the short limits. This means wasted budget and also the risk of damaging the model's intelligence. It might make more sense to start from an initial model and transform it to into a reasoning model for your task. Another point about models, it could be hard to push very small models to competency. This, of course, depends on the task. What I recommend, evaluate the base model in your environment, look at a few completions, choose a model that shows promising behaviors, even if the numbers are not satisfying yet. In general, it is always a good idea to inspect some rollouts to see the model evolve. Also, after training, do not stop at programmatic evaluation. Try the model in the real task. The final recommendation is natural to watch your logs and plots when you start training to identify early out of memory errors or instability. It's difficult for me, too, but once training begins well, I suggest stopping staring at plots for a while. Reinforcement learning is slow and takes time to see progress. If you continually monitor it, you risk the temptation to stop it and tweak something prematurely. While a slowly progressing run can be a surprisingly good well given enough time. So, start training and go for a walk. During this presentation, we mapped the reinforcement learning concept to the language models domain. Then, I introduced various files, an open-source library to build environments as of Flo R active base. Finally, I walked you through my experiments where I took a small model and turned it to into a tic-tac-toe master using supervised fine-tuning and reinforcement learning with verifiable rewards. We did not just show the model how to play. We gave it a space to play and guided it through rewards. Nowadays, reinforcement learning complements supervised fine-tuning in language models post-training. You can do this at home, too. If you can define a clear reward signal, you can build an environment and train a small, specialized model to beat a large closed model on a specific task at a fraction of the cost. I want to leave you with a few ideas and resources on this topic. All what I shared today can be found in my free LLM RL environment link course, where I go deep at explaining detail. Take a look and give a star. To figure out what others are building, explore the environments hub. And what to build next? Something I'm very excited about is this. Train a small language models on two three tools you often use and try to out perform a large model in that specific task. I recommend the Wiki search example in the Prime RL repository as a starting point. Thank you.