Hard Won Lessons from Building Effective AI Coding Agents – Nik Pash, Cline
Channel: aiDotEngineer
Published at: 2025-12-12
YouTube video id: I8fs4omN1no
Source: https://www.youtube.com/watch?v=I8fs4omN1no
[music] Wow, it's wild to be on the same stage as so many people I've drawn inspiration from. Let's dive into it. My name is Nick. I'm the head of AI at Klein and today I'm going to share some lessons we learned along the way. So let's start with the bitter truth. For years we compensated for weak models by building clever scaffolds around them. All kinds of clever ideas like rag indexing systems, search trees, tool calling scaffolds, all this was invented to cope with weaker models. And Frontier models simply bulldoze those abstractions. Now, you don't really need your scaffolding anymore. Your scaffolding just gets in the way of these models. And the question really isn't how fancy is your agent stack. Increasingly, it's how strong is the model driving it. And the lesson here is relentless. Um, a perfect example of what I'm talking about is Gemini 3.0 released this week and it immediately dominated terminal bench leaderboards with no aentic harness supporting it at all. In this chart, you can see Gemini 3.0 on Terminus scored better than the vast majority of model agent combinations in the world all out of the box. And what's remarkable is that Terminus is designed to be an unopinionated generic stripped down harness. And it has no graph search, no rag, no indexing, just here's a terminal, go figure it out. And it crushes. The whole point of terminus is that it has no clever tool calling, no context engineering features. So the takeaway here is that capability beats scaffolding. If you get out of the model's way, it will perform just fine. So really what I'm driving at and the key takeaway from this whole talk is if you're building agents, just relax. Cool it with all your clever engineering tricks. Stop overthinking it. That's it. That's the lesson. And another point on this, kind of like an aside, is I don't know about you guys, but we're all on Twitter. I'm on Twitter, and at this point, I just think talking about these like clever little context tricks and and hacks is a little played out. Like, at this point, I'm straight up tired of seeing some of this stuff. And like, I get it. it's free engagement and we all, you know, indulge in it a little bit. But personally, I think there's not really much signal there. So, if you want the full playbook for building an effective coding agent, like the playbook's right here. It's up on the screen. Um, there's really some novelty talking about it like months ago, but at this point, in my opinion, it's been done to death. And we've been in this, you know, we're model agnostic at Klein. We support all the models. Every two weeks there's a new big model release going out and we've basically come down to the same playbook of supporting each model as it comes out. So I'm sure everyone here knows how to tune an agent from Sonnet 4 to Sonnet 4.5, from Gemini 2.5 to Gemini 3 and GBT 5 to GP GBT 5.1. I feel like this entire conversation is a little played out. So, I'm not really even going to cover this in depth because the tweaks here are trivial and the gains are marginal. So, what I really want to talk about is something that's not actually given a lot of attention and it's the real bottleneck. And the real bottleneck is that you can build the cleanest agent in the world, but that doesn't improve model capability by even 1%. Models only get better when labs train on something hard. And benchmarks, not agent cleverness, not all your clever engineering techniques, not your clever rag pipelines. It's benchmarks that determine what frontier models learn to do next. And models didn't magically get better at tool use. They got better because people built RL environments that forced them to practice certain actions. handling failure more handling failure modes retrying and for example like agents improve only when the model learns inside the right environment every jump in reasoning we've seen came from a benchmark every jump in agent reliability came from an RL environment so the real questions become what is a benchmark how do you turn real world agent coding data into an RL environment and what makes a good verifier how do you detect [clears throat] real difficult ulty and how do you train these models to work on the problems that we actually care about as engineers? These are the questions that matter for the next frontier. So what is a benchmark? A benchmark put simply it's an environment. It's a so in our case it's like a docker container where you let the agent run wild. It's a starting state which is the snapshot of the code when you started working on a real world coding task as well as a starting prompt. And the last thing is a verifier at the end that checks whether an end state is correct or acceptable. So how are RL environments different? [clears throat] Well, here's the thing. They're not really different at all. And you might notice this chart is basically the same thing as the previous slide. The only real difference, the only distinction here is how the reward is used. Benchmarks measure models. RL environments improve models. The score doesn't just stop in a leaderboard where you publish the results. The score is actually used to update the weights of the policy model. So, how do you transform real world coding data into useful RL environments for training? At Klein, we created the system called an RL environments factory. Looking for a better name there, but that's what we got so far. And the first phase in this pipeline is you get sub agents and you have them qualify tasks. And these sub agents, they work in parallel to decide whether or not given tasks are suitable to be turned into RL environments for the purpose of training. And the qualification process goes as follows. So you have you start with origins. So you have to validate does the repository actually exist. Is the starting commit accessible? Is it open source? The journey where you look at the starting prompt, the other follow-on prompts that the user might have followed up with with the agent. You have to try to understand what was the user actually trying to accomplish, what was the spirit of their task. And lastly, it's the outcome. So, can we find the actual commits or PRs that fix the problem in real life? Like, did they actually commit the solution to their problem later on in the timeline? And we're actively looking for easy disqualifiers as part of this. So, things like vibecoded slop, we don't need another benchmark that tests for, you know, build the next.js app uh from scratch. We're looking we're looking to disqualify trivial tasks that are too easy and tasks that have no reliable start or end states. And lastly, what makes a good RL environment good? How do we actually make an RL environment and what makes a good test or verifier? So phase two of this pipeline is building the actual RL environment. So you start out with archaeology where you actually reconstruct both states locally. You pull down the code, you see if you can implement it yourself, reconstruct it, build it, and verify that the bug that the user was referencing and the solution actually exists. You document every obstacle and dependency. You containerize it with Docker, removing Git obviously, so agents can't reward hack. And lastly, you define the verifier at the end. And this is where it gets into like a little bit of the art of building a good verifier. And I want to talk about this because the analogy that I typically give is a teac kettle. So let's say the user's goal is I want to boil water. A really good example of a verifier to test whether or not the water is boiling is a little whistle attachment that goes inside your teac kettle. And the whistle is a pure outcome verification. And it's an example of a pure outcome driven verifier where the water either reached the boiling point or it didn't. Either it's whistling or it's not. The kettle doesn't care how you achieved it, whether you used a gas stove, an electric induction stove, or a campfire. It just signals the result. And in the process of doing this, all these weird bad tests can emerge. So you might have noticed like that the sub agent might have noticed like oh in the ground truth solution like in a previous run the burner was set to high so maybe we should be checking for that but we all know that water can boil at a low setting on the burner or was it on the front left burner has 5 minutes elapsed like all kinds of weird bad tests and the key point here is don't overprescribe based on the ground truth test for the spirit of the task test for the outcome of the task. And the outcome at the end of all this is a containerized benchmark or environment for that task. Agent work is recorded so you can see the traces the trajectory that the agent took to complete the task and you can reliably score it and verify it and it's fully portable. You can run it on any device. So the path to automation that we've been undertaking as part of this is can we fully automate the process of converting real world coding data into RL environments for the purpose of training models. And this work largely started out manual but then the first time in the RL environment was like about 16 hours of my time. And what used to take 16 hours now takes less than 20 minutes per task. And we're building towards a fully automated RL environment factory where the bottleneck shifts from engineering to collecting high quality tasks. And an interesting kind of point here, the natural endpoint of all this is what if we actually built RL environments and this is like a question for everyone in the audience is what if we built RL environments to test how well agents can actually make RL environments kind of like a meta benchmark. What would hill climbing on that look like? And you can kind of start imagining that as models get really really good at making their own RL environments to train on based on real world user data, you kind of complete that loop. Something to think about. So, okay. Um, this next part is the truth nuke. Um, also known as TRO. Um, an unspoken fact is that we're not alone at Klein building this kind of system. Every major agent lab captures this data. They all do some version of this behind the scenes, but no one really talks about it. And I don't even need to name them. If you know, you know. And realistically, you all know. These same companies site internal benchmarks to justify legacy systems that they spent months maintaining. But curiously, you'll never be able to study or inspect them because they don't publish them openly. And this data is so valuable yet no one shares it. It's the only thing that actually moves the needle. And here's the heart of my argument is by standing between real world engineers working on real world tasks and the models agent labs have a unique role in history. We can build better prompts. We can build better tools. But none of that improves the underlying models. We possess the single richest data set of real engineering work anywhere in the world. Models don't improve without this data and keeping them closed is slowing down Frontier Research. So today we're announcing client bench. This is our attempt to finally create a benchmark that isn't cosplay engineering. It's not write me a server that generates Fibonacci sequences. This is real software development captured and packaged into standardized RL and inval and eval environments and this is the benchmark that we always wanted someone else to build. No one did. So we're doing it and anyone can participate. So here's how it works. The whole thing is open source. There's no secret sauce, no locked away data sets. You can openly run it yourself and inspect it to see how it works. Anyone can use these environments for SFT, RL, eval, whatever. The point is is to just give the entire ecosystem a real substrate to measure and improve models on, not just leak code puzzles. And this only works if the community contributes. And the good news is you don't actually need to do anything special. Just work on your open source project with the client provider turned on and opt into the client bench initiative. If a frontier model gets stuck and you step in to fix it, that's actually a ideal task for to be a candidate for a benchmark and that's it. Just use the climb provider, see where the model struggles and we'll pick it up and introduce it into this open-source benchmark. So, client bench will always remain free, fully open source and freely accessible. Thank you all. If you want to contribute, [music] >> [music]