Software Development Agents: What Works and What Doesn't - Robert Brennan, OpenHands

Channel: aiDotEngineer
Published at: 2025-07-25
YouTube video id: o_hhkJtlbSs
Source: https://www.youtube.com/watch?v=o_hhkJtlbSs
[Music]
Today I'm going to talk a little bit
about uh coding agents and how to use
them effectively really. Um if you're
anything like me, you found that uh you
found a lot of things that work really
well and a lot of things that uh don't
work very well. Um,
so a little bit about me. Uh, my name is
Robert Brennan. I've been building, uh,
open source development tools for for
over a decade now. Uh, and my team and
I, uh, have created, uh, an open-source,
uh, software development agent called
Open Hands, formerly known as Open
Devon.
So, to to state the obvious, in 2025,
software development is changing. Uh,
our jobs are are very different now than
they were 2 years ago. Uh, and they're
going to be very different two years
from now. Uh, and the thing I want to
convince you of is that coding is going
away. Uh, we're going to be spending a
lot less time actually writing code. Uh,
but that doesn't mean that software
engineering is going away. Uh, we're
paid not to to type on our keyboard, but
to actually think critically about the
problems that are in front of us. Uh,
and so if we do AIdriven development
correctly, um, it'll mean we spend less
time actually like leaning forward and
squinting into our IDE and more time
kind of sitting back in our chair and
thinking, you know, what does the user
actually want here? uh what are we
actually trying to build? What what
problems are we trying to solve as an
organization? Uh how can we architect
this in a way that sets us up for the
future? Uh the AI is very good at that
at that interloop of development, the
write code, run the code, write code,
run the code. It's not very good at
those kind of big picture tasks that
have to take into account um that have
to like empathize with the end user uh
take into account business level
objectives. Uh and that's where we come
in as as software engineers.
Uh so let's talk a little bit about what
actually a coding agent is. Uh I think
this word agent gets thrown around a lot
these days. Uh the meaning has started
to to drift over time. Uh but at the
core of it is this this concept of
agency. Um it's this idea of taking
action out in the real world. Um and
these are these are the main tools of a
software engineer's job, right? We have
a a code editor to actually modify our
codebase, navigate our codebase. uh you
have a terminal uh to help you actually
run the code that you're that you're
writing uh and you need a web browser in
order to look up documentation and maybe
copy and paste some code from Stack
Overflow. So these are kind of the core
tools of the job and these are the tools
that we give to our agents to let them
do their whole uh development loop.
I also want to contrast uh you know
coding agents from some more tactical
codegen tools that are out there. Um,
you know, we kind of started a couple
years ago with things like, uh, GitHub
Copilot's autocomplete feature where,
you know, it's literally wherever your
cursor is pointed in the codebase. Right
now, it's just filling out two or three
more lines of code. Um, and then over
time, things have gotten more and more
agentic, more and more asynchronous,
right? Uh, so we got like AI powered
idees that can maybe take a few steps at
a time without a developer interfering.
And then uh now you've got these tools
like Devon and Open Hands where you're
really giving an agent, you know, one or
two sentences describing what you want
it to do. It goes off and works for 5 10
15 minutes on its own and then comes
back to you with a solution. This is a
much more powerful way of working. You
can get a lot done. Uh you can send off
multiple agents at once. Um you know,
you can focus on communicating with your
co-workers or goofing off on Reddit
while these agents are are working for
you. Um, and it's uh it's just it's a
it's a very different way of working,
but it's a much more powerful way of
working.
Uh, so I want to talk a little bit about
how these agents work under the hood. I
feel like uh once you understand what's
happening under the surface, uh, it
really helps you build an intuition for
how to use agents effectively.
Uh, and at its core, um, an agent is
this loop between a large language model
and the and the external world. So, uh,
the large language model kind of serves
as the brain. Uh and then we have to
repeatedly take actions in the external
world, get some kind of feedback from
the world and pass that back into the
LLM. Um uh so basically at every every
step of this loop, we're asking the LM
what's the next thing you want to do in
order to get one step closer to your
goal. Uh it might say, okay, I want to
read this file. I want to make this
edit. I want to run this command. I want
to look at this web page. uh we go out
and take that action in the real world,
get some kind of output, whether it's
the contents of a web page, uh or the
output of a command, and then stick that
back into the LLM for the next turn of
the loop.
Uh just to talk a little bit about kind
of the core tools that are at the
agent's disposal. Uh the first one again
is a is a code editor. Um you might
think this is this is really simple. It
actually turns out to be a fairly uh
interesting problem. Uh the naive
solution would be to just like give the
old file to the LLM uh and then have it
output the entire new file. That's not a
very efficient way to work though. If
you've got a thousand line uh thousand
line of thousands of lines of code and
you want to just change one line, uh
you're going to waste a lot of tokens
printing out all the lines that are
staying the same. So most uh
contemporary um agents use uh like a a
find and replace type editor or a diff
based editor to allow the LLM to just
make tactical edits inside the file.
Uh, a lot of times they'll also provide
like an abst ab ab ab ab ab ab ab ab ab
ab ab ab ab ab ab ab ab ab ab ab ab ab
ab ab ab ab ab ab ab ab ab ab ab ab ab
ab ab ab ab abstract syntax tree or some
kind of way to allow the agent to
navigate the codebase more effectively.
Uh next up is the terminal and again you
would think text in text out should be
pretty simple but there are a lot of
questions that pop up here. You know
what do you do when there's a longunning
command that has no standard out for a
long time. Do you kill it? Do you let
the LLM wait? Uh what happens if you
want to run multiple commands in
parallel? Run commands in the
background. Maybe you want to start a
server and then run curl against that
server. Uh lots of really interesting uh
problems that crop up uh when you have
an agent interacting with the terminal.
Uh and then probably the most
complicated tool is the web browser.
Again, there's a naive solution here
where you just uh the agent just gives
you a URL and you give it a bunch of
HTML. Um that's uh very expensive
because there's a bunch of croft inside
that HTML that the the LLM doesn't
really need to see. uh we've had a lot
of luck passing it uh accessibility
trees or converting to markdown and
passing that to the LLM
um or allowing the LLM to maybe scroll
through the web page if there's a ton of
content there. Um and then also if you
start to add interaction things get even
more complicated. Uh you can let the LLM
uh write JavaScript against the page or
we've actually had a lot of luck
basically giving it a screenshot of the
page with labeled nodes and it can say
what it wants to click on. Uh this is an
area of active research. Uh we just had
a contribution about a month ago that
doubled our accuracy on web browsing. Uh
I would say this is uh this is
definitely a space to watch.
Uh and then I also want to talk about
about sandboxing. Uh this is a really
important thing for agents because if
they're going to run autonomously for
several minutes on their own without you
watching everything they're doing, you
want to make sure that they're not doing
anything dangerous. Uh and so all of our
agents run inside of a Docker container
by default. um they're they're totally
separated out from your workstation, so
there's no chance of it running RMRF on
your home directory. Um increasingly
though, we're giving agents access to
thirdparty APIs, right? So you might
give it access to a GitHub token or
access to your AWS account. Super super
important to make sure that those
credentials are tightly scoped and that
you're following uh the principle of
lease privilege as you're granting
agents access to do these things.
All right, I want to move into some best
practices.
Uh my my biggest advice for folks who
are just getting started is to start
small. Um the best tasks are things that
can be completed pretty quickly. You
know, a single commit uh where there's a
clear definition of done. You know, you
want the agent to be able to verify,
okay, the tests are passing, I must have
done it correctly. Um or, you know, the
merge conflicts have been solved, etc.
Um and tasks that are easy for you as an
engineer to verify uh were done
completely and correctly. Um I like to
tell people to start with small chores.
Uh very frequently you might have a poll
request where there's, you know, one
test that's failing or there's some lint
errors or there's merge conflicts. Uh
bits of toil that you don't really like
doing as a developer. Those are great
tasks to just shove off to the AI.
They're tend to be tend to be very rote.
Uh the AI does does them very well. Um
but as your intuition grows here, as you
get used to working with an agent,
you'll find that you can give it bigger
and bigger tasks. Uh you'll you'll
understand how to communicate with the
agent effectively. Um, and I would say
for for me, for my co-founders, and for
our for our biggest power users, uh, for
me, like 90% of my code now goes through
the agent, and it's only maybe 10% of
the time that I have to drop back into
my IDE and kind of get my hands dirty in
the codebase again.
Uh, being very clear with the agent
about what you want is super important.
Uh, I specifically like to say, you
know, you need to tell it not just what
you want, but you need to tell it how
you want it to do it. You know, mention
specific frameworks that you want it to
use. Uh if you wanted to do like a
test-driven development strategy, tell
it that. Um mention any specific files
or function names that it can that it
can go for. Um this not only uh helps it
be more accurate and uh you know more
clear as to what exactly you want the
output to be um it also makes it go
faster, right? It doesn't have to spend
as long exploring the codebase if you
tell it I want you to edit this exact
file. Um this can save you a bunch of
time and energy and it can save uh a lot
of a lot of tokens, a lot of actual like
inference costs.
Uh, I also like to remind folks that in
an AIdriven development world, code is
cheap. Um, you can throw code away. You
can you can experiment and prototype.
Uh, I love if I if I have an idea, like
on my walk to work, I'll just like uh,
you know, tell open hands with my voice,
like do X, Y, and Z, and then when I get
to work, I'll I'll have a PR waiting for
me. 50% of the time, I'll just throw it
away. It didn't really work. 50% of the
time it looks great, and I just merge
it, and it's and it's awesome. Um, it's
uh it's really fun to be able to just
rapidly prototype using AIdriven
development. Um, and I would also say,
you know, if you if you try to try to
work with the agent on a particular task
and it gets it wrong, maybe it's close
and you can just keep iterating within
the same conversation and has already
built up some context. If it's way off
though, just throw away that work. Start
fresh with a new prompt based on uh what
you learned from the last one. Um, it's
really really uh I think uh it's a new
new sort of muscle memory you have to
develop to just throw things away.
Sometimes it's uh hard to throw away
tens of lines tens of thousands of lines
of code that uh have been generated
because you're used to that being a very
expensive uh bunch of code. Uh these
days it's it's very easy to kind of just
start from scratch. Again,
this is probably the most important bit
of advice I can give folks. Uh you need
to review the code that the AI writes.
Uh I've seen more than one organization
run into trouble uh thinking that they
could just vibe code their way to a
production application uh and just you
know automatically merging everything
that came out of the AI. Um but uh if
you just you know don't review anything
you'll find that your codebase just
grows and grows with this tech debt.
You'll find duplicate code everywhere.
Uh things get out of hand very quickly.
Uh so make sure you're reviewing the
code that it outputs and make sure
you're pulling the code and running it
on your workstation or running it inside
of an ephemeral environment. uh just to
make sure that you know the agent has
actually solved the problem that you
asked it to solve.
Uh and I like to say you know trust but
verify. You know as you work with agents
over time you'll build an intuition for
for what they do well and what they
don't do well and you can generally
trust them to to um you know operate the
same way today that they did yesterday.
Um but you really you really do need a
human in the loop. Um, you know, one of
our big learnings, uh, with Open Hands,
in the early days, if you opened up a
poll poll request with Open Hands, uh,
that that poll request would show up as
owned by Open Hands, it would be the
little hands logo uh, next to the poll
request. Uh, and that caused two
problems. One, it meant that the human
who had triggered that poll request
could then approve it and basically
bypass our whole code review system. You
didn't need a second human in the loop
to uh, before merging. Uh, and two,
often times those poll requests would
just languish. uh nobody would really
take ownership for them. Uh if there was
like a failing unit test, nobody was
like jumping in to make sure the test
passed. Um and those they would just
kind of like sit there and not get
merged or if they did get merged and
something went wrong, the code didn't
actually work. We didn't really know who
to go to and be like, you know, who
caused this? There was nobody we could
hold accountable for that breakage. Um
and so now if you open up a poll request
with open hands, your face is on that
poll request. You're responsible for
getting it merged. You're responsible
for any breakage it might cause down the
line. Cool.
And then uh I do want to just close just
by going through a handful of use cases.
Uh this is always kind of a tricky topic
because agents are great generalists.
They can they can hypothetically do
anything as as long as you kind of like
break things down into bite-sized steps
that they can take on. Um but in that in
that um in the spirit of starting small,
I think there are a bunch of use cases
that are like really great day one use
cases for agents. My favorite is
resolving merge conflicts. This is like
the biggest chore as a part of my job.
Uh, OpenHands itself is a very
fastmoving codebase. Uh, I say there's
probably no PR that I make that uh, I
get away with zero merge conflicts. Um,
and I love just being able to jump in
and say at Open Hands, fix the merge
conflicts on this PR. Uh, it comes in
and, you know, it's such a rope task.
It's usually very obvious, you know,
what changed before, what changed in
this PR, what's the intention behind
those changes? And Open Hands knocks
this out, you know, 99% of the time.
Uh addressing PR feedback is also a
favorite. Uh this one's great because
somebody else has already taken the time
to clearly articulate what they want
changed and all you have to do is say at
open hands do what that guy said. Uh and
again like you can see in this example
uh open hands did exactly what this
person wanted. I don't know react super
well and uh our front end engineer was
like do x y and z and he mentioned a
whole bunch of buzzwords that I don't I
don't know. Open hands knew all of it
and uh was able to address his feedback
exactly how he wanted.
uh fixing quick little bugs. Um you
know, you can see in this example, we
had an input uh that, you know, was a
text input. Should have been a number
input. Uh if I wasn't lazy, I could have
like dug through my codebase, found the
right file. Um but it was really easy
for me to just like quickly I think I
did this one from directly inside of
Slack, uh just add open hands, fix this
thing we were just talking about. Uh and
uh it's just, you know, really I don't
even have to like fire up my IDE. Um
it's just it's a really really fun way
to work.
uh infrastructure changes I really like.
Uh usually these involve looking up some
like really esoteric syntax inside of
like the Terraform docs or something
like that. Um open hands and you know
the underlying LLMs tend to just like
know uh the right terraform syntax and
if not they can they can look up the
documentation using the browser. Um so
this stuff is uh is really great.
Sometimes we'll just get like an out of
memory exception in Slack and
immediately say okay open hands increase
the memory.
Uh database migrations are another great
one. Uh this is one where I find uh I
often leave best practices behind. I
won't put indexes on the right things. I
won't set up foreign keys the right way.
Uh the LLM tends to be really great
about following all best practices
around database migrations. So again,
it's kind of like a rote task for
developers. It's not very fun. Um uh the
LLM's great at it. uh fixing failing
tests uh like on a PR. Uh if you've
already got the code 90% of the way
there and there's just a unit test
failing because there was a breaking API
change, very easy to call in an agent to
just clean up the the failing tests.
Uh expanding test coverage is another
one I love because uh it's a very um
safe task, right? As long as the tests
are passing, it's uh generally safe to
just merge that. So, if you notice a
spot in your codebase where you're like,
"Hey, we have really low coverage here."
Just ask uh ask your agent to uh expand
your test coverage in that area of the
codebase. Uh it's a great quick win uh
to make your codebase a little bit
safer.
Then, everybody's favorite building apps
from scratch. Um you know, I would say
if you're shipping production code,
again, don't just like vibe code your
way to a production application. Uh but
we're finding increasingly internally at
our company, a lot of times there's like
a little internal app we want to build.
Uh like for instance, we built a way to
uh debug openhand trajectories, debug
openhand sessions. Um uh we built like a
whole web application that since it's
just an internal application, we can
vibe code it a little bit. We don't
really need to review every line of
code. It's not really facing end users.
Uh this has been a really really fun
thing for our business to just be able
to turn out these really quick
applications uh just to serve our own
internal needs. Um so yeah, uh
Greenfield is a great great use case for
agents. U that's all I've got. Uh would
love to have you all join the the OpenHS
community. You can find us on GitHub,
all handsaihands.
Um join us on Slack, Discord. Uh we'd
love to build with you.
[Music]