Software Development Agents: What Works and What Doesn't - Robert Brennan, OpenHands
Channel: aiDotEngineer
Published at: 2025-07-25
YouTube video id: o_hhkJtlbSs
Source: https://www.youtube.com/watch?v=o_hhkJtlbSs
[Music] Today I'm going to talk a little bit about uh coding agents and how to use them effectively really. Um if you're anything like me, you found that uh you found a lot of things that work really well and a lot of things that uh don't work very well. Um, so a little bit about me. Uh, my name is Robert Brennan. I've been building, uh, open source development tools for for over a decade now. Uh, and my team and I, uh, have created, uh, an open-source, uh, software development agent called Open Hands, formerly known as Open Devon. So, to to state the obvious, in 2025, software development is changing. Uh, our jobs are are very different now than they were 2 years ago. Uh, and they're going to be very different two years from now. Uh, and the thing I want to convince you of is that coding is going away. Uh, we're going to be spending a lot less time actually writing code. Uh, but that doesn't mean that software engineering is going away. Uh, we're paid not to to type on our keyboard, but to actually think critically about the problems that are in front of us. Uh, and so if we do AIdriven development correctly, um, it'll mean we spend less time actually like leaning forward and squinting into our IDE and more time kind of sitting back in our chair and thinking, you know, what does the user actually want here? uh what are we actually trying to build? What what problems are we trying to solve as an organization? Uh how can we architect this in a way that sets us up for the future? Uh the AI is very good at that at that interloop of development, the write code, run the code, write code, run the code. It's not very good at those kind of big picture tasks that have to take into account um that have to like empathize with the end user uh take into account business level objectives. Uh and that's where we come in as as software engineers. Uh so let's talk a little bit about what actually a coding agent is. Uh I think this word agent gets thrown around a lot these days. Uh the meaning has started to to drift over time. Uh but at the core of it is this this concept of agency. Um it's this idea of taking action out in the real world. Um and these are these are the main tools of a software engineer's job, right? We have a a code editor to actually modify our codebase, navigate our codebase. uh you have a terminal uh to help you actually run the code that you're that you're writing uh and you need a web browser in order to look up documentation and maybe copy and paste some code from Stack Overflow. So these are kind of the core tools of the job and these are the tools that we give to our agents to let them do their whole uh development loop. I also want to contrast uh you know coding agents from some more tactical codegen tools that are out there. Um, you know, we kind of started a couple years ago with things like, uh, GitHub Copilot's autocomplete feature where, you know, it's literally wherever your cursor is pointed in the codebase. Right now, it's just filling out two or three more lines of code. Um, and then over time, things have gotten more and more agentic, more and more asynchronous, right? Uh, so we got like AI powered idees that can maybe take a few steps at a time without a developer interfering. And then uh now you've got these tools like Devon and Open Hands where you're really giving an agent, you know, one or two sentences describing what you want it to do. It goes off and works for 5 10 15 minutes on its own and then comes back to you with a solution. This is a much more powerful way of working. You can get a lot done. Uh you can send off multiple agents at once. Um you know, you can focus on communicating with your co-workers or goofing off on Reddit while these agents are are working for you. Um, and it's uh it's just it's a it's a very different way of working, but it's a much more powerful way of working. Uh, so I want to talk a little bit about how these agents work under the hood. I feel like uh once you understand what's happening under the surface, uh, it really helps you build an intuition for how to use agents effectively. Uh, and at its core, um, an agent is this loop between a large language model and the and the external world. So, uh, the large language model kind of serves as the brain. Uh and then we have to repeatedly take actions in the external world, get some kind of feedback from the world and pass that back into the LLM. Um uh so basically at every every step of this loop, we're asking the LM what's the next thing you want to do in order to get one step closer to your goal. Uh it might say, okay, I want to read this file. I want to make this edit. I want to run this command. I want to look at this web page. uh we go out and take that action in the real world, get some kind of output, whether it's the contents of a web page, uh or the output of a command, and then stick that back into the LLM for the next turn of the loop. Uh just to talk a little bit about kind of the core tools that are at the agent's disposal. Uh the first one again is a is a code editor. Um you might think this is this is really simple. It actually turns out to be a fairly uh interesting problem. Uh the naive solution would be to just like give the old file to the LLM uh and then have it output the entire new file. That's not a very efficient way to work though. If you've got a thousand line uh thousand line of thousands of lines of code and you want to just change one line, uh you're going to waste a lot of tokens printing out all the lines that are staying the same. So most uh contemporary um agents use uh like a a find and replace type editor or a diff based editor to allow the LLM to just make tactical edits inside the file. Uh, a lot of times they'll also provide like an abst ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab abstract syntax tree or some kind of way to allow the agent to navigate the codebase more effectively. Uh next up is the terminal and again you would think text in text out should be pretty simple but there are a lot of questions that pop up here. You know what do you do when there's a longunning command that has no standard out for a long time. Do you kill it? Do you let the LLM wait? Uh what happens if you want to run multiple commands in parallel? Run commands in the background. Maybe you want to start a server and then run curl against that server. Uh lots of really interesting uh problems that crop up uh when you have an agent interacting with the terminal. Uh and then probably the most complicated tool is the web browser. Again, there's a naive solution here where you just uh the agent just gives you a URL and you give it a bunch of HTML. Um that's uh very expensive because there's a bunch of croft inside that HTML that the the LLM doesn't really need to see. uh we've had a lot of luck passing it uh accessibility trees or converting to markdown and passing that to the LLM um or allowing the LLM to maybe scroll through the web page if there's a ton of content there. Um and then also if you start to add interaction things get even more complicated. Uh you can let the LLM uh write JavaScript against the page or we've actually had a lot of luck basically giving it a screenshot of the page with labeled nodes and it can say what it wants to click on. Uh this is an area of active research. Uh we just had a contribution about a month ago that doubled our accuracy on web browsing. Uh I would say this is uh this is definitely a space to watch. Uh and then I also want to talk about about sandboxing. Uh this is a really important thing for agents because if they're going to run autonomously for several minutes on their own without you watching everything they're doing, you want to make sure that they're not doing anything dangerous. Uh and so all of our agents run inside of a Docker container by default. um they're they're totally separated out from your workstation, so there's no chance of it running RMRF on your home directory. Um increasingly though, we're giving agents access to thirdparty APIs, right? So you might give it access to a GitHub token or access to your AWS account. Super super important to make sure that those credentials are tightly scoped and that you're following uh the principle of lease privilege as you're granting agents access to do these things. All right, I want to move into some best practices. Uh my my biggest advice for folks who are just getting started is to start small. Um the best tasks are things that can be completed pretty quickly. You know, a single commit uh where there's a clear definition of done. You know, you want the agent to be able to verify, okay, the tests are passing, I must have done it correctly. Um or, you know, the merge conflicts have been solved, etc. Um and tasks that are easy for you as an engineer to verify uh were done completely and correctly. Um I like to tell people to start with small chores. Uh very frequently you might have a poll request where there's, you know, one test that's failing or there's some lint errors or there's merge conflicts. Uh bits of toil that you don't really like doing as a developer. Those are great tasks to just shove off to the AI. They're tend to be tend to be very rote. Uh the AI does does them very well. Um but as your intuition grows here, as you get used to working with an agent, you'll find that you can give it bigger and bigger tasks. Uh you'll you'll understand how to communicate with the agent effectively. Um, and I would say for for me, for my co-founders, and for our for our biggest power users, uh, for me, like 90% of my code now goes through the agent, and it's only maybe 10% of the time that I have to drop back into my IDE and kind of get my hands dirty in the codebase again. Uh, being very clear with the agent about what you want is super important. Uh, I specifically like to say, you know, you need to tell it not just what you want, but you need to tell it how you want it to do it. You know, mention specific frameworks that you want it to use. Uh if you wanted to do like a test-driven development strategy, tell it that. Um mention any specific files or function names that it can that it can go for. Um this not only uh helps it be more accurate and uh you know more clear as to what exactly you want the output to be um it also makes it go faster, right? It doesn't have to spend as long exploring the codebase if you tell it I want you to edit this exact file. Um this can save you a bunch of time and energy and it can save uh a lot of a lot of tokens, a lot of actual like inference costs. Uh, I also like to remind folks that in an AIdriven development world, code is cheap. Um, you can throw code away. You can you can experiment and prototype. Uh, I love if I if I have an idea, like on my walk to work, I'll just like uh, you know, tell open hands with my voice, like do X, Y, and Z, and then when I get to work, I'll I'll have a PR waiting for me. 50% of the time, I'll just throw it away. It didn't really work. 50% of the time it looks great, and I just merge it, and it's and it's awesome. Um, it's uh it's really fun to be able to just rapidly prototype using AIdriven development. Um, and I would also say, you know, if you if you try to try to work with the agent on a particular task and it gets it wrong, maybe it's close and you can just keep iterating within the same conversation and has already built up some context. If it's way off though, just throw away that work. Start fresh with a new prompt based on uh what you learned from the last one. Um, it's really really uh I think uh it's a new new sort of muscle memory you have to develop to just throw things away. Sometimes it's uh hard to throw away tens of lines tens of thousands of lines of code that uh have been generated because you're used to that being a very expensive uh bunch of code. Uh these days it's it's very easy to kind of just start from scratch. Again, this is probably the most important bit of advice I can give folks. Uh you need to review the code that the AI writes. Uh I've seen more than one organization run into trouble uh thinking that they could just vibe code their way to a production application uh and just you know automatically merging everything that came out of the AI. Um but uh if you just you know don't review anything you'll find that your codebase just grows and grows with this tech debt. You'll find duplicate code everywhere. Uh things get out of hand very quickly. Uh so make sure you're reviewing the code that it outputs and make sure you're pulling the code and running it on your workstation or running it inside of an ephemeral environment. uh just to make sure that you know the agent has actually solved the problem that you asked it to solve. Uh and I like to say you know trust but verify. You know as you work with agents over time you'll build an intuition for for what they do well and what they don't do well and you can generally trust them to to um you know operate the same way today that they did yesterday. Um but you really you really do need a human in the loop. Um, you know, one of our big learnings, uh, with Open Hands, in the early days, if you opened up a poll poll request with Open Hands, uh, that that poll request would show up as owned by Open Hands, it would be the little hands logo uh, next to the poll request. Uh, and that caused two problems. One, it meant that the human who had triggered that poll request could then approve it and basically bypass our whole code review system. You didn't need a second human in the loop to uh, before merging. Uh, and two, often times those poll requests would just languish. uh nobody would really take ownership for them. Uh if there was like a failing unit test, nobody was like jumping in to make sure the test passed. Um and those they would just kind of like sit there and not get merged or if they did get merged and something went wrong, the code didn't actually work. We didn't really know who to go to and be like, you know, who caused this? There was nobody we could hold accountable for that breakage. Um and so now if you open up a poll request with open hands, your face is on that poll request. You're responsible for getting it merged. You're responsible for any breakage it might cause down the line. Cool. And then uh I do want to just close just by going through a handful of use cases. Uh this is always kind of a tricky topic because agents are great generalists. They can they can hypothetically do anything as as long as you kind of like break things down into bite-sized steps that they can take on. Um but in that in that um in the spirit of starting small, I think there are a bunch of use cases that are like really great day one use cases for agents. My favorite is resolving merge conflicts. This is like the biggest chore as a part of my job. Uh, OpenHands itself is a very fastmoving codebase. Uh, I say there's probably no PR that I make that uh, I get away with zero merge conflicts. Um, and I love just being able to jump in and say at Open Hands, fix the merge conflicts on this PR. Uh, it comes in and, you know, it's such a rope task. It's usually very obvious, you know, what changed before, what changed in this PR, what's the intention behind those changes? And Open Hands knocks this out, you know, 99% of the time. Uh addressing PR feedback is also a favorite. Uh this one's great because somebody else has already taken the time to clearly articulate what they want changed and all you have to do is say at open hands do what that guy said. Uh and again like you can see in this example uh open hands did exactly what this person wanted. I don't know react super well and uh our front end engineer was like do x y and z and he mentioned a whole bunch of buzzwords that I don't I don't know. Open hands knew all of it and uh was able to address his feedback exactly how he wanted. uh fixing quick little bugs. Um you know, you can see in this example, we had an input uh that, you know, was a text input. Should have been a number input. Uh if I wasn't lazy, I could have like dug through my codebase, found the right file. Um but it was really easy for me to just like quickly I think I did this one from directly inside of Slack, uh just add open hands, fix this thing we were just talking about. Uh and uh it's just, you know, really I don't even have to like fire up my IDE. Um it's just it's a really really fun way to work. uh infrastructure changes I really like. Uh usually these involve looking up some like really esoteric syntax inside of like the Terraform docs or something like that. Um open hands and you know the underlying LLMs tend to just like know uh the right terraform syntax and if not they can they can look up the documentation using the browser. Um so this stuff is uh is really great. Sometimes we'll just get like an out of memory exception in Slack and immediately say okay open hands increase the memory. Uh database migrations are another great one. Uh this is one where I find uh I often leave best practices behind. I won't put indexes on the right things. I won't set up foreign keys the right way. Uh the LLM tends to be really great about following all best practices around database migrations. So again, it's kind of like a rote task for developers. It's not very fun. Um uh the LLM's great at it. uh fixing failing tests uh like on a PR. Uh if you've already got the code 90% of the way there and there's just a unit test failing because there was a breaking API change, very easy to call in an agent to just clean up the the failing tests. Uh expanding test coverage is another one I love because uh it's a very um safe task, right? As long as the tests are passing, it's uh generally safe to just merge that. So, if you notice a spot in your codebase where you're like, "Hey, we have really low coverage here." Just ask uh ask your agent to uh expand your test coverage in that area of the codebase. Uh it's a great quick win uh to make your codebase a little bit safer. Then, everybody's favorite building apps from scratch. Um you know, I would say if you're shipping production code, again, don't just like vibe code your way to a production application. Uh but we're finding increasingly internally at our company, a lot of times there's like a little internal app we want to build. Uh like for instance, we built a way to uh debug openhand trajectories, debug openhand sessions. Um uh we built like a whole web application that since it's just an internal application, we can vibe code it a little bit. We don't really need to review every line of code. It's not really facing end users. Uh this has been a really really fun thing for our business to just be able to turn out these really quick applications uh just to serve our own internal needs. Um so yeah, uh Greenfield is a great great use case for agents. U that's all I've got. Uh would love to have you all join the the OpenHS community. You can find us on GitHub, all handsaihands. Um join us on Slack, Discord. Uh we'd love to build with you. [Music]