RL Environments at Scale – Will Brown, Prime Intellect
Channel: aiDotEngineer
Published at: 2025-12-09
YouTube video id: _IzZWeuTx7I
Source: https://www.youtube.com/watch?v=_IzZWeuTx7I
[music] Today we're talking about RL environments and how to scale them. But the title is a little bit of a red herring. We'll talk a bit about the engineering pieces and like running these with thousands of parallel rollouts and sandboxes on hundreds of GPUs, but I'm mostly going to focus on a different notion of scale. Uh, and what I mean by scaling here is we there's a number of different ways we talk about scaling in the context of AI and research. We know about scaling laws and we talk about how much data you need, compute and parameters and that if you pour in more data and compute and parameters or inference time. All of these things make models smarter or more performant. But there's also fuzzier side of scaling which is sometimes referred to as unhobbling or algorithmic tricks or talent. But where does this come from? It's not just pouring in resources, but it's something that is more intangible, harder to put a finger on, but really it comes from a community of people, a company, an organization, universities, the world, the internet, talking about ideas and sharing them and working on different applications, having these applications inspire ideas, using these ideas as test beds for different techniques, and building on top of these to increase the accessibility for other people in the future to not have to reinvent the wheel and to be able to build from uh what has been done by those before them to uh do more effective research and accelerate the pace of innovation. And so why do we have this talent bottleneck? There's a big issue that we hear all about with AI labs trying to like find more talent and salaries are going through the roof and everyone wants to hire the best and brightest AI researchers. But one other approach besides trying to just pay the most is increase the pool. Uh and so how do we increase the pool of AI researchers? How do we make doing AI research more accessible? And I want to talk a bit about who we are at Prime Intellect. If you haven't heard of us, we are a bunch of things. We're a research lab. We are a comput provider. We're a platform company. And we are an open source ecosystem. We do a lot of things and they all fit together in a way that I'm going to try to explain in this talk. But we see these as all different pieces of how we can build a business around doing exactly this, which is increasing the accessibility of AI research and making doing research more of a toolkit available to people at organizations around the world without needing to be inside of a large lab or without needing to spend crazy amounts on massive clusters or go do a PhD. We think that there's versions of doing AI research that really should be part of the breadandbut workflows of AI engineers around the world as we build applications and try to improve our systems and models and products. And I think a thing people are kind of iffy about in terms of AI is whether open source models are going to work. And in my mind, that's not quite the right analogy to draw. And so when we're comparing like AI to traditional software, there's lots of like great examples of open source software ecosystems that have been thriving in the past, things like Linux and Node and Apache. But in my mind, the analogy in AI is not models as kind of these fixed checkpoints, but it's about research as a practice and research as a set of ideas. And it's one that's more intangible, but there's a lot of parallels in terms of the goals of the best practices of growing a research ecosystem as well as a software ecosystem where you want to uh compound abstractions and best practices and have better tooling and iteration efficiency and have these gains over time allow uh more advanced powerful complex things to be built by uh decreasing barriers to entry for any given application and allowing this to become more accessible. And so one thing that we a term we'll use to describe some of what we're building at Prime Elect is we like this phrase called the open super intelligence stack. One because it's a fun acronym but also I think the idea of the stack of of all the pieces of the puzzle to build the engine to go do research. Uh there's a lot of layers to it. You need compute uh you need orchestration you need libraries for doing uh training and evaluation and you need platforms to support things like code execution and eval inference and fine-tuning and we're doing all these things. Uh but really the goal of this is to give people the tools to be able to go train models. We want people more people in the world. And we think I'll explain why in a bit. There's a lot of reasons why uh the best products are going to be the ones that are not just kind of taking the thing out of a box of an API and putting a thin wrapper around it. There's ways you can kind of improve around APIs. But I think in many cases people are realizing that winning products are going to be the kinds of things that whether it's a part of the model, a part of the stack, the part of the product or the whole thing, the ability to do research and have at least the option of deciding where in your product you might want to customize a model or improve a model gives you a lot more flexibility to really u make the best user experience. Um, and so we have heard the phrase in the past that the model is the product. And I think we're starting to see now this change a little bit to a lot of winning applications have the product kind of be the model. And I think the two notable examples of this that I'm big fans of and heavy users of are Cursor's new composer model as well as uh OpenAI's codeex. And I think these are both good examples of models that really are where the product kind of is the model very directly where the the model was trained to be the model for that product and the experience of using the model is the experience of using the product. And the way that [clears throat] this is done is by taking a harness that represents the product and training the model in the harness in essentially an environment, an RL environment. And environments really are just a harness with a collection of tasks and rewards. But they also have many other parallels throughout the ecosystem. Environments are not just for RL. Environments are also essentially the same thing as evals. Environments can also be engines for synthetic data which then you can use for SFT or distillation. You can do RL in them directly. But also the agents were actually deploying and monitoring out in the world. These are environments. The product of these things, the tasks, the harness, and the rewards, whether this is a data set offline or the stream of user tasks coming in to a product is an environment. And so this as an abstraction I think is a very useful way of framing what it might look like to start having uh research become more of a a practice that is adopted more broadly beyond just large AI labs. And I also think that there's a sense in which they're a really accessible entry point. Uh and so I like the analogy of environments as kind of like the web apps of AI research. And what I mean by this is that they're very simple. They're self-contained. They can they start simple but they can also get quite complex. They can get very elaborate representing the full complexity of a large product. They're also pedagogical in nature and that you can start simple and as you build complexity, you start bumping into these walls where you have to start learning new concepts, understanding more about scaling the system side, understanding more about the hyperparameters and the algorithms and they kind of open this door where you can by playing around with them start entering into a world of research without needing to kind of build a whole training infrastructure system from scratch. Um, and they also require experimentation. And so I think the key different uh differentiation between just an agent harness and an agent environment is that the environment forces you to also have your tasks and your rewards predefined to be able to do this experimentation. It's a proper eval. And what this means is that you can't just vibe check it. You can't just like build it and test it out a bit and say, "Hey, it's good. We're going to ship it." It forces you to say, "Okay, let's think about this a little more scientifically. Let's do some experiments. lets try out different models, try different hyperparameters. Uh, and it also gets you to the point where you can start doing more advanced research in terms of RL training or distillation or fine-tuning. And uh, so to really facilitate this, we wanted to make the environment as an entry point much more accessible. A few months back, we launched what we called the environments hub, which is a open source community platform for creating, discovering, and sharing RL environments and evals. And so far, we've had a lot of fun kind of seeing everyone build here. We've had hundreds of builders and environments come create either their own ideas or re-implement papers. Uh there's a bunch of examples here I can show you, but really it's just a bunch of people who have wanted to do research and found this as an entry point to start digging a little deeper. Whether this was investigating some benchmark and figuring out how to reimplement it or modify it to be appropriate for an RL context in terms of like new data or new examples or whether [snorts] this is some game that they'd been thinking about or some other task. But having this as an abstraction for encapsulating the the thing you want a model to do is a way of allowing yourself to start experimenting with ways of improving it without needing to have the answers. So I think people talk a lot about how fine-tuning never really took off in the SFT regime. And I think a big part of this is that getting data was really hard of the actual like solutions. I think having labeled examples of what you want the model to do is a very difficult thing to ask someone to go create. But if you can just think about the the settings it might be in without having the answers up front, if you can measure the answers, now you kind of can start creating data on the fly. And this engine is really what the environment is about unlocking. Um, and so actually 9 months ago, I was right here in this room. I had just released a library called verifiers, which I'm still working on today. Um, it's come a long way, but it's a toolkit for building these things. And it's been a lot of fun over this past year just playing with it and extending it to support more features and kinds of environments. But the idea with verifiers is to give people a toolkit that is uh essentially a bunch of components that you can mix and match and compose to do things like from simple evals or QA or games to things like tool use or using sandboxes or agent frameworks or uh uh like CLI coding agents or math problems. There's all sorts of things you might want models to do or agents to do. And it's a toolkit for building environments that is then uh ready to be automatically trained with reinforcement learning. And the way we thought about this design, it's been a lot of fun and also a big challenge to think like okay, how do you make a toolkit for this stuff that actually covers all the bases? And I think there's a lot of different approaches I've seen people go about and I I think they all make sense depending on what sorts of things you're wanting to work on. But we took a very kind of a general approach where we tried to say we are not going to know all the answers right away. There are going to be lots of pattern. There's going to be lots of special cases. There's going to be hierarchies of complexity. There's going to be patterns. And we really wanted to prioritize extensibility. So we think about these things hierarchically where let's say you want to do a a coding agent environment for client bench. uh this which is an instance of the harbor framework which is a example of a CLI agent which is a multi-turn environment which is an environment uh similar for text arrina and Wordle or for search with MCP or for giving a model a Python ripple in a sandbox and so thinking of these things hierarchically allows us to kind of really determine like what are the foundational pieces what is generic across all environments and then how do you build up the stack towards applications and so for one like example of this that I'll kind of walk through the whole process end to And we we call this one wiki search, but it's basically a simple search setting where we give an agent the ability to uh call some tools to search over Wikipedia pages and find some answers. And so here is the environments hub page. So the environments hub is a kind of full stack uh code management package registry. So every environment is a Python project where you can have dependencies and versions and uploading your evals and whatnot. Um but the environments are very simple. They start simple. They can get really complicated, but this one's pretty simple where we just kind of define our tools as async Python functions. We have our data set and we have what we call a rubric. And so a rubric is the abstraction for managing the different pieces of your rewards where you can kind of compose different things. You can also have metrics that are just a zero award but are for in uh observability of what's going on. And then the other piece of doing training will be a config. And so the config here is for our prime RL trainer, which is our kind of large scale training stack, which has been our uh culmination of all the best practices from the research literature for large scale asynchronous RL training. Um, but the config files are intended to expose kind of the pieces that people need to think about in ways that are starting to get you more into the algorithm, but are also still designed to be pretty high level, pretty self-contained, and with with defaults that we think are going to be sensible for a lot of people. And so running this is just kind of running a command line with uh you specify the environment and if it's in the environment hub it'll automatically install it and start your training run and then you can if you're lucky see your reward curve just shoot right up. Um and sometimes it doesn't go this nicely but the process of doing this is iterating on your environment on your rewards and your data and your tasks to understand what makes this task holistically actually tangible in practice. How do you tune the parameters? How do you look at your data? How do you define your rewards? Uh, and if you do this right, you can get really good improvements, especially from really small models, but also for much larger models. And so in this example for the the wiki search one, we started with a a Quen 3 4B model, which was about 55%. And after training, it was at 89% on par with uh much larger models like GPT4.1 as well as reasoning models like uh GBD5 mini. And so I think this practice of taking small models and being able to make them much better is a big win for a lot of applications where you either you want a really fast model, you want a really cheap model, you want a really really powerful model because the best models out there just aren't quite good enough. These are all the different things you can do with model customization. And this practice of doing of creating environments isn't only for customization, but it gives you this option. And so if you need to do eval anyways, it's useful to think of them as environments because the environment opens a lot of doors for whether this is prompt tuning or whether it's model selection or whether it's just getting a better sense of how your system could work at scale with many many users in parallel. It's a design process that really forces you to kind of pin down what is the thing I care about? What is my agent? What is my product? What is my harness? What am I optimizing for? Um, and so to kind of fully stress test this, we've been training a large model which will be out into the world quite soon called Intellect 3 with our full primal L stack. And this has been us really kind of validating the efficiency and performance at a very large scale. So this is a 100B plus model trained on 500 GPUs where we've kind of done the endtoend uh post train of SFT and RL which the primaril stack also has SFT if people want to do that. But it's also been about just understanding all the best practices. We love reading papers and we try to kind of try out all the tricks and see which ones work and see which ones don't and then distill this into a library with Primaril that can then be kind of consumed by the end user without needing to do all this uh implementation themselves. And so for us it being open is very important. So Primaril is on GitHub. You can go find it. Verifiers is on GitHub if you want to check it out. And for us, this is really about opening the door for more people to start learning about these things and for incorporating it into their workflows for optimizing their models and their products. Um, and [clears throat] the only way to do this that we've what we see as the best way to do this is through growing community. And so for us, it's been really important to really think about getting good feedback loops from the people who are building with this and understanding what they want, understanding what's going well, understanding what's painful, and addressing those problems. And so we've done a number of community programs in terms of sponsoring different kind of small tasks to uh a research residency program with uh grad students around the world uh and collecting like uh a smaller subset of the environment hub ones where we'll actually review them manually. And so this repo here, the prime environments repo is the ones where we are doing these directly where we're kind of offering to look over someone's kind of example. And so we've had hundreds of these come in and there will be hundreds more. And uh it's been a great learning process because it's forced us to fix a lot of things. We kind of understand the rough edges. We understand what we need to add. And we're kind of then distilling [clears throat] all of these learnings into what will be our kind of upcoming uh platform product which we're calling lab. And the idea of lab is to give people an interface, a platform where they can browse environments, they can run their evals, they can do their inference, they can do their fine-tuning and they can have research be more accessible in a way that it hasn't been historically because I think a lot of people find infrastructure very painful. They find dealing with torch versions painful, flash attention and VLM and getting all these things to work. We are happy to do that, but we understand that a lot of people may not want to. Um, and so the idea with this is that if you want to go read the code, you can go read the code, but you don't have to run it. We can run it for you. Um, and so this has been our version, which will be kind of out into the world in the near future of trying to allow people to really focus on the environment where the entry point to lab will be the environment. If you want to do synthetic data and SFT build, let's build an environment. If you want to do your evals, you build that as an environment. If you want to do RL, you build an environment. And I think building an environment is the kind of thing that I imagine a lot more people are going to want to be doing as we start really seeing where models are headed. In some cases, this will be we're going to use fine-tuning services from the labs because they're going to offer this because people want it. In some cases, this will be we really care about the smallest model we can run on prem at the lowest latency and we're really just going to optimize for our one thing. or it could just be research for the sake of research and advancing our kind of collective understanding of how this stuff all works. And I think that's really our goal is to have a world where there's going to be a lot of AI and where we can all kind of talk about it and understand it and look at it and poke at it and tweak it and have a better sense of what we're actually building because I think there's a lot of times when it feels like we're just kind of the model is a black box and digging into the research and going under the hood and changing things and breaking things tells you a lot about how these models work. It tells you a lot about understanding where they came from, where they could be going, where they might be headed, and preparing for that future. Thanks. [applause] [music]