Agents are Robots Too: What Self-Driving Taught Me About Building Agents — Jesse Hu, Abundant
Channel: aiDotEngineer
Published at: 2025-11-24
YouTube video id: qqXdLf3wy1E
Source: https://www.youtube.com/watch?v=qqXdLf3wy1E
All right. So this is my talk called Agents of Robots 2. I've given different variants of this talk in person for different events, but this is the first one that I've done for coding agents. So to kick things off, um just a little bit about me. I've been a lifelong ML engineer and I've worked at places like YouTube and Google where I worked on the two tower embedding model as well as some early work on BERT and mixture of experts. I worked on ML and robotics at Whimo where a lot of my focus was on the data side as well as reward modeling and evaluation. And most recently, I've been working on a company called Abundant, where we work on a lot of the same concepts applied to data sets for Foundation Model Labs and their training for agentic coding models. Um, none of this will cover any inside information about Whimo, but we'll instead cover some general topics that are carried over from self-driving and robotics into digital agents. So I'll kick things off in kind of like talking about what some of the parallels are. And I think one of the main things is that you sort of have this 1% versus 99% problem where you think that the model is doing most of the work. But when you get into real world applications, the model is only doing 1% of the work and 99% of the work goes into other things. So in robotics you have the hardware and sensors and actuators you have integration deployment and you have this whole offline stack that does simulation training um and other things. In agents you also have this. So if we take a look at the two stacks um so in in robotics you have hardware and you have actuators you have the fleet and in agents you also have um sort of like a body right whereas robotics is you know very obviously embodied because you go from a brain to a physical body. In agents you go from a model to sort of a body of a digital robot that includes tools. So now we have APIs and MCPs as well as more advanced uh embodiment in terms of the terminal and the browser and the VM. So you're starting to see like the robots hands and arms and legs to even more advanced things like the entire OS and persistent file systems and things like that. Um in addition you have the offline stacked so transfer over. So we're not just finished when we have the model. We also have to continuously retrain. We have to monitor these things. We also have human feedback loops and all this other stuff that we have to build as far as the tooling to even support development of the agent. And that's like sort of one of the first learnings that I I want to share is that um often times in self-driving people would often talk about the winning team not just having the best model and the best online stack but having the best offline stack because that enables developers to be much faster and ship more much more reliably. So moving on, there's this concept I want to share in robotics of open loop and closed loop. This is very simply uh being able to take an action or to uh move an actuator or a motor and then being able to get the feedback of how that actually uh happened in the real world so that you can close the loop on that actual action. So, for example, if I turn the wheel left, I want to actually measure uh how much did my car actually turn so that I can recalibrate and make sure that I'm turning exactly the amount I intended to because these things aren't perfect. In the same way, we're starting to see where some openloop things actually need to be closed. So, for example, if I run a bash command and I run an open-ended process, well, sometimes I can't observe the outputs, at least not in real time. I can't measure whether that bash command completed and I can't exit early if I need to. So that that's an example of where we need to make things more closed loop. Another thing that's kind of nuanced is the fact that um we are implicitly discretizing in time. So what do I mean by that? There are explicit design choices that we need to make in robotics about the input space and then the action space. And particularly in the input space, you have different modalities. So you have the option to use vision, LAR, radar, all these different inputs and then combine them in different ways to get a sense for the world. You also have the ability to discretize the world in different ways. You can sample things every second. You can sample things only when they're pushed to you. Or you can sample things in this example on like 50 Hz, so 50 times per second. So that means I'll keep updating the state of the world and I'll keep replanning uh and I'll react to the world very quickly. However, in agents we've kind of done this implicitly. So in agents we often have a conversation. So we wait to take our turn. We execute a tool, wait for the entire response. Maybe we do that in sort of weird ways, but we don't do this thing that's natural robotics where we keep sampling from the world and we keep interacting in real time. So this is an implicit design decision that is made that has its pros and cons. The pros are it's very easy to reason about when we have turns. It's very easy to reason about a conversation. It's really easy to reason about an input and output of a turn. Um but in in uh but the downside of that is that we don't get to do things in real time. You can't immediately respond to a pop-up. We can't immediately interact with a longunning process. So these are the implications of the design decisions that we make. So more on those uh inputs and action spaces. So in inputs we actually have handcrafted a bunch of tools, a bunch of ways that we can stream from tools, we can stream from the user, but there are other options out there. So one example I want to highlight is the terminus agent from terminal bench. Um, so this is very very awesome and unique in that they're actually using a T-X stream. So you can actually do character by character uh input and output if you want to where you can do things like control C or you can do various window commands if you want to. Um, and so that that's a very unique and more flexible way of interacting with our action space that we don't traditionally think about when designing agents. Other ways in which you could do action space and robotics. We could plan in purely XY. So you move up one block and then move over by two. You can do that in coarse ways. You can do that in continuous space. You can do things in 2D. You can do things in 3D. You can do things in acceleration instead of just position. You can do things in velocities. Um in agents we should also think about this although it's less relevant. You can you don't have to think about just interacting with uh MCPS and tool calls. Like I mentioned with Terminus, you can interact with the computer at a character level. You can even do things like the dreamer paper where you interact with the computer purely by interacting at 20 frames per second with the mouse clicks and keyboard. So the question is what trade-offs are we making and what implicit or explicit design decisions have we made that either enable us to do more or is limiting what we can do with our agent. The next thing I want to talk about is how we're going from stateless processes to stateful processes. If you think about driving in a video game, you can spawn from nothing. And you don't have to worry about where I came from and where I go after I terminate the session. You just have to worry about what I do during that session. But that's obviously not true in the real world. In the real world, you have a real car. That car takes up mass. It takes up space. And so, you do have to worry about where that car ends up. And you have to worry about how we got into the scene, right? everything is moving. There are implications to how fast you're moving and how fast everyone else is moving. Similarly, we're going from these stateless agents to more stateful agents. Right? Before we just had to spin up a session and the session, get an artifact out of it. That's great. Now, we have VMs. VMs that are stateful both in terms of what's running, but also the persistent file store. And so, now when we have agents and we spin them up, we have to consider, hey, what is the entire space that we're running into? What are all the Slack messages that are currently going on? What is the state of the world? What are all of the things that I have to interact with? And not only how we do that, deal with that online, but how does that impact how we do evaluation and simulation? So these are this is one of the more interesting things that's happening in agent space right now. One of the more nuanced things more familiar to the people that are working on modeling and training is a sort of like dagger and out of distribution problem. So just like in robotics and agents, we have options of training our models with imitation uh imitation learning being similar to the SFT from human demonstrations versus RL. And RL can be in simulation or it can be in other ways as well. But one of the known issues with imitation is that as soon as you get a little bit out of distribution or off policy in relation to the human examples, you get really out of distribution. And you can start to see this in agents such as browser agents. When you see a pop-up that never happened in training because humans actually interact with pop-ups quite naturally, it gets confused and it gets really confused. So this is an issue of cascading issues that you can see has been studied for quite a while in robotics. And the general theme around this is that actions have consequences. We're not just dealing with classification models. We're not just dealing with prediction models or sequences. We're dealing with a whole new paradigm in which you predict, you act, and then you deal with the consequences of that action and then re-evaluate everything you've done before. And that's really tough because actions have consequences and actions have consequences in a very messy real world. And as a result of the complexity of the real world, that's where simulation comes into play such that you can represent all of these complexities and all the messiness of the real world into your starting state and you can play through uh the real world not just in a single path but all the paths that you could possibly take as your agent changes. So we call that playing out counterfactuals. The other thing to be aware about, and this is sort of like classic reinforcement learning or robotics, is the concept of an MDP. And so that's where there's an agent that takes into account a state and a reward and then we'll take actions on an environment or a world. And this is just sort of a formalism about how to conceptualize how you're running the agent loop. And these are just useful primitives to have on hand so that you can describe and you can communicate what's going on. The reason this is important is because we're moving from just plain chat models to agent models that take action. For context, a lot of self-driving uh initially seemed really fast but was really slow in progress because it was sort of the same issues. So everybody in the space from 2017 to 2020 was really focused on perception models and thinking that all you really needed to do was uh take the state of the world and make boxes and then you can drive around the boxes really easily. It turns out that assumption wasn't necessarily true and there's a lot of hidden complexity in creating action models and not just predictive models. Similarly in language models we can see that we can understand basically everything about the world that comes in via text. We can generate really long sophisticated reasoning traces. But when you take these really sophisticated plans, really sophisticated chains of tool calls and you implement them in the real world, you can see things go wrong all the time. You can see the tool calls fail and the agent failed to progress. You can see the agent failed to correct from its own mistakes. This is sort of the loop that is deceptively tricky about when you get into actions from predictive models. This is really where the bulk of the work had been and where a bulk of the work will continue to be in agents as well. I also want to point out in both of these cases in self-driving when it comes to robotics and in code when it comes to digital agents we're actually very lucky in both. Like why are we lucky? I you can see self-driving working really well in production today in limited cases whereas the rest of robotics is still limited to demos and this is because of how we have this machine that's predefined with human controls that's been really well refined over the last few decades and then it has electronic controls and it has built-in telemetry right so it's something that you already have a predefined interface to take actions with and you have predefined interfaces to collect the data from. So that makes it really convenient to operate through code and it makes it really convenient to perform machine learning and learning in general on. We have this predefined interface with predefined actions and predefined telemetry and that makes it much much easier of a task than going into some of these other knowledge work tasks that require the full desktop and things that are less easy to codify. So when we explore new domains, these are some of the things we want to consider. Is there somewhere where we already get a predefined human interface that makes it easy to do those two things? Finally, I want to talk about one of the things that we face from day-to-day and that's the hill climbing process. And if you're not familiar with hill climbing, it's basically this iterative process of building or iterating on a complex system such as an LM or an agent. when you don't always make forward progress. So before when we were working on full stack web applications or working on more simple systems, you implement a feature and you probably guarantee that feature will arrive into prod. Nowadays you have this sort of like nebulous metric that you're trying to hit. And the only way you can do that is by guessing and checking. So you have a metric like a benchmark, then you make some guess, you run some experiment and you hope you go up, sometimes you go down, but as long as you keep going up and up and up, then you can eventually reach your goal. And that's the concept of hill climbing. But how we do it in the self-driving way is a little bit more sophisticated. We actually start by learning and then going through simulation. Simulation helps you deploy with confidence and it also helps your learning. But then once you deploy, you can actually get logs from the real world that feed back into your simulation engine. That's really important because you want to ground your simulation on something. And so you start to get this full loop. The logs actually become a much more important part of the process than they are today. you can get a lot more insights than just your numbers, right? So like a 70% at a benchmark will tell you a little bit, but if you start to break them down into different categories, different cities, different ways you can mess up, start to triage the individual failures, you can get a lot more insights about how to improve your system and on where to improve. And that's a lot of what we've developed our tooling around and a lot of what we've developed our processes around that help some of our customers with their hill climbing. Finally, like you know, we're only part of the way there. At least this is a metric from the remote labor benchmark. And you know, I'd like to compare this to where self-driving was back in the beginning. And it's because we have really great demos and we have really great predictive models, but we're not nearly there as far as endto-end work completion. A lot of the reasons are because of the things I brought up before with actions having consequences and the complexity of the real world. To recap, we've covered the parallels between robotics and agents. Some of those are having to do closed loop systems, getting closed loop feedback, how we discretize systems, how we pick action and input spaces, how we can go from stateless to stateful, how we're going from predictive models to action models, how we utilize simulation in deployment and in training, um, and how infrastructure is really important to the entire development process. If you've gotten this far, I'd like to say congrats and you've become a master in this new topic that we're calling agentics because why not? Because, you know, robotics sounds cool. Why not make the this agent development stuff just as cool? Um because I think it takes a lot of these core concepts and abstractions to really make this go from something that we hack on to something that has dedicated real science and really becomes a practice. And so if any of these concepts are useful for you like a lot of these things are pretty easy to understand and read about. You can read about openloop and closed loop control MDPs fully versus partially observable environments. You can read about dagger uh offline RL is a really cool topic that is featured in more recent robotics work. And then just like the intro reinforcement learning book is all great. You probably will understand these things natively because the problems are really obvious and easier to understand in agent space. And finally, you can read up on a lot of the recent robotics literature as well since a lot of the field is converging. So you can just start from the papers. Just as a recap, you know, agents are robots too. They act in the real world. They make mistakes. They have to recover. And all of these little things really matter. Thanks. You can feel free to get in touch. Here's my email, jesseabund.ai. Feel free to send me any thoughts or feedback. Thanks.