Automating Large Scale Refactors with Parallel Agents - Robert Brennan, OpenHands

Channel: aiDotEngineer
Published at: 2026-01-08
YouTube video id: rcsliSIy_YU
Source: https://www.youtube.com/watch?v=rcsliSIy_YU
All
right. Thank you all for for joining for
automating massive refactors with uh
with parallel agents. Um super excited
to talk to you all today about uh you
know what we're doing with open hands to
really automate large scale chunks of
software engineering work. Lots of uh
lots of toil related to tech debt, code
maintenance, code modernization. Uh
these are tasks that are super
automatable. Uh you can throw agents at
them, but they tend to be way too big
for like you know a single just one
shot. So it involves a lot of what we
call agent orchestration. Uh we're going
to talk a little bit about how we do
that uh with Open Hands and also just
more generically.
Uh a little bit about me. Um my name is
Robert Brennan. I'm the co-founder and
CEO at Open Hands. Uh my background is
in dev tooling. I've been working in
open source dev tools for over a decade
now. I've also been working in natural
language processing for about the same
amount of time. Um uh I've been really
excited over the last few years to see
those two fields suddenly converge as
LMS are really good at writing code. Um
and I'm super excited to be be working
in the space. Uh then open hands is an
MIT licensed coding agent. Open hands
started at open dev about a year and a
half ago when Devon first launched their
uh demo video of a fully autonomous
software engineering agent. Uh my
co-founders and I saw that got super
excited about you know what was possible
what the future of software engineering
might look like. uh but realized that
that shouldn't happen in a black box,
right? If our adopts are going to
change, we want that change to be driven
by the software development community.
We want to have a say in that change. Um
and so we started opens uh then open dev
as a way to give the community a way to
help drive what the future of software
engineering might look like in an AI
powered world.
Uh so hopefully not uh controversial for
me to say that software development is
changing. Um, I know my workflow has
changed a great deal, uh, in the last
year and a half. Um, uh, I would say
now, like, you know, pretty much every
line of code that I write goes through
an agent. Uh, rather than me opening up
my IDE and typing out lines of code, I'm
now asking an agent to do the work for
me. I'm still, you know, doing a lot of
critical thinking. You know, a lot of
the the mentality of the job hasn't
changed, but what the actual work looks
like has changed quite a bit. Uh, but
what I want to convince you all of is
that it's still changing. We're still
just in the first innings of this
change. We still haven't realized all
the um all the impact that large
language models are have already brought
to the job and are going to continue to
bring to the job as they improve. Uh I
would say even if you froze large
language models today and they didn't
get any better, you would still see the
job of software engineering changing
very drastically over the next two to
three years as we figure out ways to
operationalize the technology. Uh I
think there's still a lot of uh sort of
psychological and organizational hurdles
to adopting uh large language models
within software engineering. Um and
we're seeing a lot of those hurdles
disappear as time goes on.
A brief history of kind of how we got
here. U everything started I would say
with what I call contextware code
snippets. um some of the first large
language models it turned out were very
good at writing chunks of code
especially things that they'd seen over
and over again. So you could ask it to
write bubble sort. Uh you could ask it
for you know small algorithms you know
how to how to access a SQL database
things like that. Uh it was able to
generate little bits of code. It was
able to to you know it seemed to
understand the logic a bit. But this was
totally context unaware right it was
just dropping code into a chat window
that you had asked for. It had no idea
what project you were working on what
the context was.
Shortly thereafter we got these
contextaware code generation. Uh so like
GitHub copilot as autocomplete um was
probably like the the best example here
right uh so you actually was in your IDE
it could see you know where you're
typing you know what the what the code
you're working on in uh and it could
generate code that was specific to your
codebase that reference you know local
variable names that reference you know
local table names in your database uh
huge huge improvement for um uh you know
our productivity so instead of copy
pasting back and forth between the chat
GBT window and your IDE now All of a
sudden, you can see the little robot get
its eyes. It can see inside your
codebase and it can actually generate
relevant code for your for your your
codebase.
And then I think the the giant leap
happened in early 2024 um with the
launch of Devon and then uh the next day
the launch of open devon now open hands.
Uh this is where we first started to see
autonomous coding agents. So this is
when AI started not just writing code
but could run the code that it wrote and
it could Google an error message that
came out, find a stack overflow article,
apply that to the code, add some debug
statements into the code and run it and
see what happens. Basically automating
the entire inner loop of development. Um
this was this was a huge uh step
function forward. Um you can see the
little the little robot gets arms in
this picture. Um this was a this was a
huge jump at least at least in my own
productivity. um being able to like just
write a couple sentences of English,
give it to an agent and let it churn
through the task until it's got
something that's actually working,
running, tests are passing.
And then now what we're seeing is uh
parallel agents, what we're calling
agent orchestration. Uh folks are
figuring out how to get multiple agents
working uh in parallel, sometimes
talking to each other, sometimes
spinning up new agents under the hood.
Um you know, agents creating agents. Um
this is uh I would say kind of bleeding
edge of what's possible. Um people are
just starting to experiment with this
are just starting to see success with
this at scale but there are some uh some
really good tasks that are um uh very
amenable to this sort of workflow. Uh
and it has the potential to really uh
automate away a huge mountain of tech
that sits under you know every
contemporary software company.
a little bit about kind of like the the
market landscape here. Um, again, you
can kind of see that same evolution from
left to right where we really started
with, you know, plugins like GitHub
copilot inside of our existing IDEs and
we got these like AI AI empowered IDEs,
ids with like AI tacked onto them. Um, I
would say your your median developer is
kind of adopting local agents now. They
may be running cloud code locally for uh
one or two things. Um, maybe some ad hoc
tasks. Uh your early adopters though are
starting to look at cloud-based agents,
agents that get their own sandbox
running in the cloud. This allows uh
those early adopters to run as many
agents as they want in parallel. U it
allows them to run those agents much
more autonomously than if they were
running on their local laptop, right? If
it's running on your local laptop,
there's nothing stopping the agent from
doing rmrf slash trying to delete
everything in your home directory,
whatever it might do, installing some
weird software. Whereas if it's got its
own like containerized environment
somewhere in the cloud, you can run a
little bit more safely knowing that you
know the worst it can do is ruin its own
environment uh and um uh you don't have
to like sit there babysitting it and
hitting the Y key every time it wants to
run a command. Uh so those cloud-based
environments much more scalable uh a bit
more secure. Um and then uh I would say
at the far right here what we're really
just seeing the top like 1% of early
adopters uh start to experiment with is
orchestration. this idea that you not
only have these agents running in the
cloud, but you have them talking to each
other. Uh you're coordinating those
agents, you know, on a larger task. Uh
maybe those agents are spinning out sub
aents within the cloud that have their
own sandbox environments. Uh some really
cool stuff happening there. Uh I would
say, you know, with open hands, we we
generally started with cloud agents. Uh
we've leaned back a little bit and built
local CLI similar to cloud code in order
to meet developers where they are today.
you know these these types of
experiences are much more comfortable
for developers. Uh you know we've been
using autocomplete for decades just got
million times better with GitHub go-
pilot. Um I would say these experiences
on the right side are very foreign to
developers. They feel very strange to
like give off a pass to an agent or a
fleet of agents uh and let them do the
work for you. It feels kind of like uh
for me at least uh the jump that I made
when I went from being an IC to being a
manager um is is what it feels like
going from writing code myself to giving
that code to agents. Uh so very very
different way of working. I think one of
the developers have been very slow to
adopt. Uh but again the top 1% or so of
engineers that we've seen adopt the
stuff on the right side of this uh
landscape. Uh they've been able to get
you know massive massive lifts in
productivity and tackle huge backlogs of
tech that other teams just weren't
getting to.
Uh some examples of where you would want
to use orchestration rather than a
single agent. Uh typically these are
tasks that are going to be very
repeatable and very automatable.
Uh so some examples are things like the
basic code maintenance tasks, right?
Every codebase has to uh you know
there's there's a certain amount of work
to do to just keep the lights on, right?
To keep dependencies up to date to uh
make sure that any vulnerabilities get
solved. Uh we have one client for
instance that is using open hands to uh
remediate CDEs throughout their entire
codebase. They have tens of thousands of
developers, thousands of thousands of
repositories. Um and basically every
time a new vulnerability gets announced
in an open source project, they have to
go through their entire codebase, figure
out which of their repos are vulnerable,
uh submit a poll request to that
codebase to uh actually uh you know
resolve the CVD, update whatever
dependency, fix breaking API changes. Uh
and they have seen a 30x improvement on
time resolutions for these CVDs by doing
uh orchestration at scale. uh they
basically have a setup now where every
time an ACV gets announced, new
vulnerability comes in. Uh they kick off
an open hand session to scan a repo for
that vulnerability. Uh make any code
changes that are necessary and open up a
pull request and all the downstream team
has to do is click merge, validate the
changes.
Um you can also do this for like
automating documentation and release
notes. Um there's a bunch of
modernization challenges that uh
companies face. Um, for instance, uh,
you might want to add type annotations
to your Python codebase if you're
working with Python 3. Um, you might
want to split your Java, you know, like
a monolith into microservices. Um, these
are the sorts of tasks that are still
going to take a lot of um, thought for
an engineer. You know, you can't just
like one shot it with code and say like
uh, you know, refactor my model if it's
microservices, but it is still very real
work, right? You're still just kind of
like copying and pasting a lot of code
around. So if you thoughtully or trade
agents together, they can do this. Um a
lot of migration stuff. So migrating
from like old versions of Java to new
versions of Java. We're working with one
client to migrate a bunch of Spark 2
jobs to Spark 3. Um we've uh used Open
to migrate our entire front end from
React uh from Redux to Zustand. U so you
can do these very large migrations.
Again, lots of very growth work. still
takes a lot of um thinking from a human
about how they're going to orchestrate
these agents. Um and there's a lot of
tech that uh detecting unused code
getting rid of that um you know we we
have one client who's using our SDK to
basically scan their data.logs every
time there's a new error pattern go into
the codebase and uh add error handling
fix whatever problem is uh is cropping
up. Um, so lots of things that you know
are a little too big for a single agent
to just one shot. Um, but are super
automatable are good tasks to handle
with an agent as long as you're
thoughtful about orchestrating them.
A bit about why these aren't onestopable
tasks. Uh, some of them are
technological problems, some of them are
more like human psychological problems.
On the technology side, you have a
limited amount of context uh that you
can give to the agent. So extremely long
running tasks are tasks that span like a
very large code base. Usually you don't
really have enough there. You're going
to have to uh compact that context
window to the point the agent might get
lost. Uh we've all seen the laziness
problem. Uh I've tried to launch out
some of these types of tasks. And the
agent will say, "Okay, I migrated three
of your 100 services. I need to hire a
team of six people to do the rest." Um
uh the agents often lack domain
knowledge within your codebase, right?
They don't have the same intuition that
you do for the problem.
Uh and errors compound when you go on
these really long trajectories with an
agent. Uh a tiny error in the beginning
is going to uh you know compound over
time. The agent is going to basically
repeat that error over and over and over
again for every single step that it
takes in its task. Uh and then on the
human side uh you know we do have this
intuition for the problem we can't
convey. You know say you want to break
your model into microservices. You
probably have a mental model of how
that's going to work. Uh if you just
tell the agent break the model with
microservices it's just going to take a
shot in the dark. based on patterns seen
in the past without any real
understanding of your codebase.
Uh we have some difficulty decomposing
apps for agents and understanding like
what agent can actually get done uh in
one shot. Um uh we also like you you uh
do need this intermediate review
intermediate checkin from the human as
the agent's doing its work. We'll talk a
little bit about what that loop looks
like later. Uh but it's again not
something you can just like tell an
agent to do and expect the final result
to come in. have to kind of approve
things as the agent goes along. Uh and
then not having a true definition of
them. I think uh if you don't really
know what finish looks like for this
project, it's hard to tell the agent.
Uh on these types of orchestration
paths, want to make it super clear that
we don't expect every developer to be
doing agent orchestration. Um, we think
most developers are going to use a
single agent locally uh for you know
sort of ad hoc tasks that are common for
engineers building new features uh
fixing a bug things like that. I think
running quad code locally uh in a
familiar environment alongside an IDE is
probably going to be a common workflow
at least for the next couple years. Uh
what we're seeing is that a small
percentage of engineers who are early
adopters of agents who are really
excited about agents are finding ways to
orchestrate agents to t tackle like huge
mountains of tech debt at scale and get
a much bigger lift in productivity for
that smaller select set of tasks. Right?
You're not going to see 3,000% lifted
productivity for all software
engineering. Probably going to get more
of that, you know, 20% lift that
everybody's been reporting. uh but for
some select tasks like CDE remediation
or codebased modernization you can get a
massive massive lift you can do you know
ending your years of work in a in a
couple weeks
I want to talk a little bit about what
these workflows look like in practice so
this loop probably looks pretty familiar
if you're used to working with local
agents um this is very typical loop that
looks a lot like the inner loop of
development for you know nonI coding as
well but basically you know you give the
agents a prompt
uh it does some work in the background.
Maybe you babysit it and watch, you
know, everything it's doing and hit the
Y key every time it wants to run a
command. Uh then the agent finishes, you
look at the output. Uh you see the tests
are passing. You see if this actually
satisfies uh what you asked for and then
maybe you prompt the agent again to get
it to get a little closer to the answer.
Or maybe you're satisfied with the
result. You uh you know, you commit the
results and and push.
For bigger orchestrated tasks, this
becomes a little bit more complicated.
Uh basically what you need to do is uh
you or maybe handinhand with cloud you
want to decompose your task into a
series of tasks that can be executed
individually by agents. Uh then you'll
send off an agent for each one of those
individual tasks and you'll do one of
those one of those agents for each of
the individual tasks. And then finally
at the end uh you maybe with the help of
an agent are going to need to pull in
all the output together from all those
individual agents into a single change
uh and merge that into your codebase.
Very importantly there's still a lot of
human in the loop here. Um you need to
review not just the final output of the
collated result but uh the intermediate
outputs for each agent. Um I like to
tell folks the goal is not to automate
this process 100%. It's something like
90% automation. Uh that's still, you
know, an order of magnitude productivity
lift. Um I think this is this is really
tricky to get right. This is where a lot
of like thought comes into the process
of like how am I going to break the tax
down so that I can verify individual
step uh and so that uh I can actually uh
automate this whole process without just
ending up with a high coded mess.
Uh this is a typical git workflow that I
like to use for tasks like this. Uh
typically we'll start a new branch on
our repository. Uh we might add some
high level context to that branch using
like an agent or an open hand the
concept of a micro agent. Uh but I just
a markdown explaining you know here's
what we're doing here. Uh just so the
agent knows okay we're migrating from
Redux is us andor we're going to migrate
these Spark 2 jobs to Spark 3. uh you
might want to put some kind of
scaffolding in place. Uh I'll talk a
little bit more about examples of of uh
scaffolding later. Uh you're going to
create a bunch of agents based on that
on that first branch. Uh the idea is
that they're going to be submitting
their work into that branch and it's
basically going to accumulate our work
as we go along and then eventually once
we get to the end we can rip out our
scaffolding and merge that branch into
main. Uh now for uh if you're you're
kind of getting started with this I
would suggest limiting yourself to about
three to five concurrent agents. Uh I
find more than that your brain starts to
break. Uh but for folks that have really
adopted orchestration at scale uh we see
them running hundreds even thousands of
agents concurrently. Usually a human is
not uh in the loop for you know one
human is not on the hook to review every
single one but maybe those agents are
sending out pull requests to individual
teams things like that. Um, so you can
scale up very aggressively once you
start to get a feel for how all this
works and you feel like you have a very
good way of getting that human input
into the loop.
I'm going to kick it off to uh my
coworker Calvin here. He's going to talk
about uh a very very large scale
migration uh basically u eliminating
code smells from the open hands database
that he did using our refactor SDK up
here.
Open
hands excels at solving open tasks. Give
it a focused problem something like fix
my failing CI add and debug this end
point and it delivers. But like all
agents it can stumble when the scope
grows too large. Let's say I want to
refactor an entire code base. Maybe
enforce certifiing update with your
dependency or even migrate from one
framework to another.
These are not tasks. They're sprawling
interconnected changes that can touch
hundreds of files.
To battle problems at this scale, we're
using the open hands agent SDK to build
tools designed to specifically
orchestrate collaboration between humans
and multiple agents.
As an example, let's work to eliminate
code from the open answer.
Here's the repository structure. Just
the core agent definition has about 380
files uh spanning 60,000 lines of code.
Says a lot about the volume of the code
but not much about the structure. So
let's use our new tools to visualize the
dependency graph of this chunk of the
repository.
Here each node represents a file. The
edges show dependencies who imports who.
And as we keep zooming out it becomes
clear this tangled web is why
refactoring at scale is hard. To make
this manageable, we need to break the
scrap up into humanized chunks. Think PR
size batches that an agent can handle a
human can understand.
There are many ways to bash based on
what's important to you. Graph theoretic
algorithms give strong guarantees about
the structure of edges in between
induced batches, but for our purposes,
we can simply use the existing directory
structure to make sure that semantically
related files appear inside the same
batch. Navigating back to the dependency
graph, we can see that the codes of the
nodes are no longer randomly
distributed. Instead, they correspond to
the batch that each of those associated
files exist. Zooming out and zooming
back in, we easily find a cluster of
adjacent notes that are all the same
color, which indicates that an agent is
going to access all of those files
simultaneously.
Of course, this graph is still large and
incredibly tangled. To construct a
simpler view, we'll build a new graph
where nodes are batches and the edges
between those nodes are dependencies
that are inherited from the files within
each of those patches. This view is much
simpler. We can see the entire structure
on our screen at the same time.
But this is something we have with using
a graph. We can identify batches that
have no redies and expect the files that
go. Dispatch, for example, add 16. Looks
like it's in the file. It's probably
empty. Let's check.
Now, this is a tool intended for human
AI collaboration. So, once we know that
this file is empty, we might determine
that it's better to move it elsewhere.
Or maybe we're okay keeping it inside
this batch. And all that we want to do
is add a note to ourselves or reach so
we know the contents.
Of course, when refactoring code, it's
important to consider the complexity of
what it is you're moving. This batch is
trivial. Let's find one that's a little
bit more complex. Here's a batch that
has four files. They all do and the
complexity measures reflect this. These
are useful to indicate to a human that
we should be more careful when this for
example the first examples.
You need to identify what's wrong in the
first place. Enter the verifier.
There are several different ways of
defining the verifier based on what you
care about. You consider it to be
programmatic. So it calls a match
command. This is useful if your
verification is checking unit tests or
running a lender or a text.
Instead though, because I'm interested
in code smells, I'm going to be using a
language model that's going to be
looking at the code and trying to
identify any problematic patterns based
on a set of rules that I provided.
Now, let's go back to our first batch
and actually put this verifier to use.
Remember, this batch is trivial and
fortunately the verifier recognizes it
as such. It comes back with a nice
little report indicating which person
identified and didn't. And status of
this batch is turned to completed green.
Good.
And this change in status is also
reflected in the batch graph. Navigating
back and toggling the color display, we
can see that we have exactly one node
out of many completed and the rest are
still yet to be handled. But this
already gives us a really good sense of
the work that we've done and how it fits
into the bigger picture.
So now our strategy for ensuring that
there are no code smells in the highly
of our repository is straightforward. We
just have to ensure that every single
node on this batch graph turns green. So
let's go back to our batches and
continue verifying till we run across a
failure.
We'll keep going in dependency, making
sure that we pick nodes that don't have
any dependencies on other batches that
we have yet to analyze. This next batch
is about as simple as the first, but
because the init file is a little bit
more complex. The report that gets
generated is a little bit more verbose.
Continuing down the list, we come across
the bash we identified earlier with some
chunky files of relatively high code
complexity. And this batch happens to
give us our first tree later. Notice
that the status turns red instead of
green. Now this batch has more files
than what we've seen in the past. So the
verification report is proportionally
longer. Looking through see that it is
listing file by file. The code that is
identified in which
I see one file is particularly egregious
with its violations. We'll have to come
back to that.
And if we zoom all the way back out to
the bash graph and look at the status
indicators, we'll see the two green
nodes representing the batches we've
already successfully verified. We'll
also see the red representing the batch
that we just saw that verification. Now,
our student goal is to turn this entire
graph green. This red node presents a
little bit of an issue. To convert this
red node into a green node, we need to
address the problems that the verifier
found using the next step of the
pipeline, the fixer.
Just like the verifier, the fixer can be
defined in a number of different ways.
The programmatic fixer can run a batch
command or you can feed the entire batch
into a language model and hope it
addresses the issues in a single step.
But by far the most powerful fixer that
we have uses the open agent SDK to make
clean copy of the code instead of an
agent that has access to all sorts of
tools to run tests, examine the code,
look at documentation on the do whatever
it needs to to address these issues. So
let's go back to the scaling dash and
run the fixer and see what happens.
Now this part of the demo is sped up
considerably, but because we're
exploring these patches in dependency
order, while we're waiting, we can
continue to go down the list, running
our verifiers, and spinning up new
instances of the open agent using the
SDK until we come across a node that's
blocked because one of its extreme
dependencies is still complete.
When the fixer is done, the status of
the batch is set. We'll need to rerun
verification in the future to make sure
the associated returns again.
Looking at the report that the fixer is
returned, there's not much information,
just the title of the DR. We've set this
up so that every fixer produces a nice
tidy for request ready for human
approval. Just because the refactor is
automated doesn't mean it needs to be
viewed.
And here's the generated. and the agent
does an excellent job of summarizing the
code smells that identified the changes
made to address those as well as any
changes that they have to make. It's
also less helpful for the reviewer and
some notes for anybody working on this
part of the code in future.
And when we look at the content of this,
we see it's very risky. All the changes
are tightly focused on addressing the
code snails that we provided earlier.
And we've only modified a couple hundred
lines of code, the bulk of which is
simply refactoring messed block into its
own function call.
Not all the scope to be this small, but
our batching strategy and narrow
instructions ensure that the scope of
the changes are well considered. This
helps to improve performance, but it
also will easily
from here. The full process for removing
code smells from the entirety of code
becomes clear. Use the verify to
identify problems. Use the fixer to spin
up the address those problems. Review
and merge those PRs. Unblock new fixes
and repeat until that entire screen.
We've already used this tool to make
some pretty significant changes to the
code including typing and improving
test. And we could not have done it
without the open HSDK power everything
under the hood. All
right. So, that's the uh open hands
refactor SDK powered by our open hands
agent SDK. Uh we're going to walk
through a little bit later on the
workshop how to build something a little
simpler but very similar where we get
parallel agents working together to fix
tasks that were discovered by initial
agent.
Uh I want to talk a little bit about
strategy for both decomposing tasks and
sharing context between these agents.
These are both really big important
parts of agent orchestration. Uh so
effective task decomposition
uh you're really looking to uh break
down your very big problem into tasks
that a single agent can solve, a single
agent can one shot. Um something that
can fit in a single commit, single pull
request. Um super super important
because you don't want to be, you know,
constantly iterating with each of the
sub agents. You want each one, you want
a pretty good guarantee that each one is
just going to one-shot the thing. you'll
be able to rubber stamp it and get
merged into your ongoing branch.
Uh you want to look for things that can
be parallelized. This is going to be a
huge way to increase the uh the speed of
the task. Um you know, if you're just
executing a bunch of different agents
serially, you might as well just have a
single agent moving through the task
serially. U the more you can
parallelize, the more you get many
agents working at once, the faster
you're going to able to move through the
task uh and iterate. Um, you want things
that you can verify as correct very
easily and quickly. Ideally, you'll have
something where you can just like look
at the CI/CD status and have good
confidence that if everything's green,
you're good. Uh, maybe you'll need to
click through the application itself,
something like that, run a command
yourself to verify that things look good
to you. Uh, but you want to be able to
very quickly understand whether an agent
has done the work you asked it to or
not. U, and you want to have clear
dependencies and order in between tasks.
Uh you notice these these uh criteria
are pretty similar to how you might
break down work for an engineering team,
right? You need to make sure that you
have tasks that are maybe separable,
tasks that like different people on your
team can execute in parallel and then
colle the results together. You want to
know uh once I get task A done, then
that unlocks tasks B, C, and D and then
once those are done, we can do E. Um so
very similar to breaking down work for a
team of engineers.
Uh there are a few different strategies
for breaking down a very large refactor
like the one we saw challenges do. Uh
the simplest like most one is to just go
piece by piece. You know you might
iterate through every file in the
repository, every directory, maybe every
function or class. Um you know this this
uh is a fairly straightforward way to do
things. It works well uh if those um
dependencies are can be kind of executed
um you know without depending on one
another too much. Um so good examples
might be like adding type annotations
throughout your pipeline codebase. Um
uh and then you know at the very end
once you've migrated every single file
say you can collect all those results
into a single PR.
A slightly more sophisticated thing
would be to create a dependency tree. Um
and the idea here is to add some
ordering to that piece by piece approach
where you know you start as we saw
Calvin do you start with like the leaf
nodes in your dependency graph right you
start with maybe your utility files get
those migrated over um and then anything
that depends on those you know it's
going to have those those initial fixes
in place and the dependencies can uh can
start working through um you know their
their set of the process. You can
basically back your way up to whatever
the entry point of the application is.
Uh this is often a a better way to
proceed. Um it's more kind of a
principal approach for how you're going
to order through these tasks.
Another example is to create some kind
of scaffolding that allows you to live
in both the like pre-migrated and post
migrated worlds. Um we did this uh for
example when migrating our React state
management system. Uh we basically had
an agent set up uh some scaffolding that
would allow us to to work with both
Redux Redux and Zustan at the same time.
Um pretty ugly, not something you would
actually really want to do. Um but it
allowed us to test the application as
each individual component got migrated
from the old state management system to
the new state management system. Uh and
then we sent off parallel agents for
each of the components. uh I got each
component done and then at the very end
once everything was using zestand we
were able to rip out all of the u all
the scaffolding so there was no more
mention of redux and everything was
working but having that scaffolding in
place allowed us to validate you know as
each agent finished its work for just
that one component we could validate the
application was still working that
component still works uh we didn't have
to do everything all at once we got some
kind of human uh feedback from the
agents
uh next I want to talk a bit about
context sharing uh as you go through a
big large scale project like this uh
you're going to learn things right
you're going to figure out okay what I
my original mental model wasn't actually
complete I didn't actually uh you know
understand the problem correctly um your
agents might uh run into that you know
you might have a fleet of agents you got
10 agents running they're all hitting
the exact same problem you kind of want
to share the solution of that problem so
they're not all getting stuck right
there's a bunch of different strategies
for doing this context sharing between
agents
Uh, one strategy that I think the most
naive thing you can do is share
everything. Basically, every agent sees
every other agent's context. Uh, this
is, uh, not great. Uh, it's basically
the same thing as just having a single
agent working iteratively through the
task. Uh, you're going to leave your
context window really quickly if you do
something like this. Uh, so this is this
is not going to help.
Uh, a a better value approach would be
to have the human being just sort of
manually enter information into the
agents. Uh if you have a chat message, a
chat window with each agent, you can
just paste in like hey use library 1.2.3
instead of 1.2.2.
Um the human can also modify like an
agent MD or micro agent to pass messages
to these agents. Uh but this does
involve manual human effort. Um it
involves a lot more like babysitting of
the agents. So it's it's not super
scalable.
Uh you can also have the agents
basically share context with each other
through a file like agent MD. Uh you can
allow the agents to actually modify this
file themselves. Uh maybe they send a
pull request into the file as they learn
new things. Uh downside here is that
sometimes agents will try and learn
unimportant things. Uh they can get kind
of aggressive about pushing information
to this file. Uh so doing some kind of
human review seems to help.
And then last uh this is probably the
most like leading edge idea here. Um,
but you can basically give each change
in a tool that allows it to send
messages to other agents. Uh, it could
be like a broadcast message that goes
out to all the other agents. Uh, or it
could be, uh, you know, pointto-point
conversation. Uh, this is super, uh, fun
to experiment with. We're doing a lot
uh, to experiment with this now, uh,
with our SDK. Um, but it's, uh, it's
tricky to get right. It's, uh, you you
once you get agents talking to each
other, you're like increasing the, uh,
level of non-determinism in the system.
Uh, things can get a little bland. Uh I
have an example here on the right of uh
this is from a doctor's report where
they had two agents just talk to each
other. They just entered into a loop of
wishing each other zen perfection.
Um
cool. Uh now I want to work through an
exercise. Uh I would love it if you all
want to follow along. Um you can access
this presentation for uh copy pasting
purposes at uh dev.shophands-workshop.
Um, we'll work through some coding
exercises with the open hands SDK
specifically to uh do CD remediation at
scale. Um, we're going to write a script
that will take in a GitHub repository,
scan it for open source vulnerabilities
for CDEs. Um, uh, and then set up a
parallel agent for every single
vulnerability we find to solve that and
open up a poll request.
So, dub.shophandworkshop.
uh let me know anybody can access it.
>> It's gonna be the slideshow.
>> So, so it should be the slideshow if you
want to. There will be um
uh copy pasteable prompts and uh links
and stuff like that around slide 29.
>> Got it.
>> We'll get there.
Uh so in terms of how this process is
going to work,
uh basically we're going to start with
one agent that runs a CVE scan on this
repository. It's going to stand for
vulnerabilities. Uh what's nice about
using an agent for this is it can look
at the um uh the repository and decide
how am I going to scan for
vulnerabilities, right? Am I going to
use trivia to scan a Docker image? Uh am
I going to run npm audit on a
package.json?
uh so it can it can basically detect the
programming language to figure out how
am I going to stand for CDES here. Uh
then once we have our list of
vulnerabilities, we're going to run a
separate agent for each individual
vulnerability. Uh each of these agents
is going to research whether or not it's
solvable. Uh it's going to update the
relevant dependency, fix any breaking
API changes throughout the codebase, and
then open up a poll request. Uh what's
nice about this is that we can merge
those individual PRs once they're ready.
You
>> show the link again. Yeah.
Uh what's nice about running the solving
in parallel is that you know we get we
get a bunch of different PRs. Uh so we
can merge them as they're ready. If one
agent gets stuck, one of the
vulnerabilities isn't solvable. All the
other ones are still going to work. Uh
maybe we get to 90% or 95% solved. Uh we
don't have to get to 100% in order to
have any value here. Uh just some quick
pseudo code of what this is going to
look like.
Uh so this is an example using the
openhance SDK of how to create an agent.
You can see we create a large language
model. Um we then pass that large
language model to an agent object along
with some tools. Uh a terminal, a file
owner, a pass tracker for planning. Uh
we give it a workspace and then we just
tell it we want to do run. Uh this is a
pretty like naive hello world example.
We'll see how it gets a little bit more
complicated as we progress through this
particular task.
Uh but then once that first agent is
done, we're going to iterate through all
of the vulnerabilities to get back out.
Um and then for each one, we'll send off
a new agent uh asking it to solve that
particular CDE.
All right. So, uh to get started here,
uh it would say create a new GitHub
repository. Uh we start save our work
there. Uh you're also going to need both
a GitHub token and an LLM token.
Uh, I would, uh, if you sign up for for
OpenHands app.allands.dev, you can get a
$10 free credit u LLM credits there. Um,
if you're already an existing user, let
me know and I can I can bump up your
your existing credits for the purpose of
this exercise.
Um, then we're going to start uh an
agent server. Uh, this is a um uh
basically like a Docker container that's
going to house all the work that our
agents are doing. Uh this is a great way
again to run agents securely and more
scalably. So instead of running the
agents on our local machine to solve all
these CVEes uh we're going to run them
inside of a container. Hypothetically if
we were doing thousands of CVEs we could
run this in like a Kubernetes cluster so
that you know we have as many
workstations as we want for our agents
but for the purposes of this exercise
we'll just run one one Docker container
as a home for our agents. Um then we can
create uh an agent of be or an open
enhance micro agent to uh you know start
working through this task. I'm going to
be using the openhand CLI as we go here.
Um you're welcome to check out the open
hand cli. You can also use cursor or pod
code or whatever you're used to using uh
as we uh kind of bode our way through a
CD remediation process with open hands.
Uh I'm going to give it a couple
minutes. I'm going to walk through
creating my GitHub repo, getting my
GitHub token, etc. Um
uh if you all have any trouble feel free
to raise your hand and come around and
uh help you know getting it all out etc.
You said app.allhands.dev
app.
>> Yeah.
So, I've got my new GitHub repo here.
Uh, so I'm gonna add a quick open hands
micro agent here.
Perfect.
I'm just going to tell
a
uh process for remediating
with agents.
relevant talks
for the open hand SDK are at
open hands SDK.
So some data opens a little bit of
context
similar to agent. Um we now have officed
uh to get a token. I'm not actually
going to do it here so that was my token
but you can go to GitHub settings your
profile
then developer settings
personal access tokens.
I like to do classic tokens.
Uh classic token. Give it a name and
then uh the repo scope is really what
you'll need. Uh that way we can open up
pull requests uh to solve to the CS
involved.
>> We did a classic token not the new
thing.
I I haven't gotten a link
used to you're welcome to do. I guess
you could create a new repository.
>> I haven't got to them either. So,
>> I'm not with you.
>> Back in the old days.
>> So, what permissions do we need to
>> uh just the repo permission?
Also, it's going to show you sign up for
app.alland.dev.
Um,
you go to
piece under your profile here, you can
get your open API key, your L key here.
I won't show it, but
this will allow you to use our proxy
step.
Last, I'm gonna
start up some agent server here. You'll
probably want to copy paste this out of
the presentation.
Got my repo close
dinner.
Maybe
that's back here. If you do want to work
with the open hand cli
tool install open hands
I'm going to start up the open hands
CLI.
Again, you can use cloud code, cursor,
whatever else if you want. Uh you folks
need a little more time with the setup.
key get token set up.
Sorry, check.
Uh so I'm gonna start with
this first prompt. Uh basically what
we're going to do is we're going to
point our agent uh at the open hands SDK
point it at the documentation
uh and just ask it to basically check
that our LLM API key is working that it
can actually do an LLM deletion. This
will be like a very basic hello world.
just kind of get started here. Um, I'm
going to tell it uh I'm using I'm using
the open hands uh key that I generated
at app.allands.dub.
Um, so I'm telling it to use this open
handbon 4 model. Uh, you can replace
this with enthropic. If you want to use
just like a regular anthropic API key.
Uh, you may need to set this model a bit
differently depending on if you're using
open AI using light.
You can look at the light all docs to
figure out if you have an open API key
or an open AI key. Uh you can look at
the light all docs to figure out which
model plug for the string. But I'm just
going to copy paste this as is.
Sorry, what's the step for uh agents.md
or the one for open hands?
>> So I would say just create a u a file
either the agents.md if you're working
with a a tool that's compatible with
that or
uh for open hands we have it's called a
micro aent I can get to it. Uh so
openhands.openhandsmicroagent
by convention repo.mmd is the
description of the repository you're in.
Um and I just gave it a couple links to
the SDK documentation uh and the
repository for the SDK so it has access
to you know basically the the API docs
there.
This is kind of an optional step. Make
things a little easier though.
is doing. All right, it thinks it's got
something good. So, let's see what's
going on.
Python CV solver
need
environment variables.
I'm using the to set my brightness here.
Make sure I don't check those in.
One more time.
Got a small error. Looks like the agent
didn't quite get the API doc right.
Let's uh paste the error back. See what
happens.
Let's try again.
Of
course, never never go.
Not there.
She's working.
version.
>> Let's use club.
UV
tool install that breaks.
>> Yeah.
>> You know what version of UV you're on?
>> I'm on 096.9.6.
>> What error are you getting?
>> I don't know why.
No executables are provided by package
open hands. Removing tool error failed
to install entry points.
>> I'm newish to the Python world. So I
assumed I was doing silly.
>> You could try updating on 111 which is
what I'm on. But okay. Yeah, I'll try.
>> Another question.
>> Yeah.
>> Um, so I was able to I see you running
through the CLI. I was able to run this
on the like all all.dev.
>> Yeah. Cool. and it submitted a PR and
created it. Looks good.
>> Awesome.
>> Why are you doing it through the CLI?
>> Uh really just for
um
normally I actually prefer to work
through the web UI here. Um
I think uh being able to like run and
show that script is working locally. Uh
it's like a little bit better of a hand
out. I actually like to work through the
web UI normally and then have the agent
push and I pull locally if I really want
to work locally, but figured that was
just extra extra steps for presenting
purposes.
Yeah, feel free to use the the web or
the tool.
Looks like I
API key here. Come
Jesus.
200.
>> What's that?
>> Should we get 200?
>> Uh yeah, you should get something like
this. Uh like I just got finally uh
where the other one says hello.
Just
section.
Anybody managed to get connection
working?
>> I think so. I've created the file.
>> Nice.
Uh just a quick view of what this looks
like
in the first
basically you can see we create an tell
what model we want to use what I key we
want we want to use
and then just send a quick message to to
the to make sure it's actually working.
Uh all right for the second time I'm
going to move towards prompt two. Uh so
here we're going to actually start to do
some work for the uh so we're going to
tell um you know the agent we're working
with uh we want to use the SDK to create
a new agent uh that's going to take in a
GitHub repository. Uh it's going to
connect to a remote workspace uh running
at localhost 8000. Again that's the the
docker start command from before. If you
haven't already run that now's a good
time to get Docker running. uh Docker
run this agent server. Uh
it's going to uh clone our repository
into that Docker container. Uh we're
going to create an agent that's going to
work inside that Docker container and
we're going to tell that agent to scan
this repository for any
with the open ends CLA. Is there a way
to interrupt and get it to stop?
>> Uh, hit control P or pause. Yeah.
>> And then can I insert my corrections?
>> Yeah. Then you can type me a message or
just type continue.
Yeah,
>> I got the CLI to install, but I had to
add - AI.
>> Seems on PI that there's a D AI version,
but then it says in the docs.
>> I don't I think the AI one is
deprecated, but it is it is a usable
CLI. You want to use that
service that one off our team.
Did you get the dash AI1 to work?
Because as soon as I tried to run it, it
crashed. Oh,
>> oops. It installed. I was so happy.
>> Yeah, it installed and then it it didn't
work.
>> There's a deprecation warning when I go
to version. So, yeah,
>> there is a if you want to download an
executable binary on our release page.
>> Okay,
>> that might be straightforward. You can
also run it in a docker container. Um,
if you CLI docs,
I think there's a UV run as well.
Try UV run.
the version.
The version
that's for the not open AI regular.
>> Okay.
Thank you.
Okay, supposedly have an agent working
here. Let's see. Going to run it with
repo. It should have a few CVs in it.
Let's see if we find any vulnerable
by default. Open hands. We'll uh
we'll visualize the output here. So, we
can see the agent working uh even with
the SDK. pretty similar to how we saw
the uh uh CLI.
Uh you can see it's task list.
It's uh
the repository.
It's uh doesn't have trivia itself. So
it's like trivia. It's basically doing
what we would expect an agent to do. Uh,
we've been a task. We can't get to it.
So, we're running Trivia now.
show a bit about what this what this
generated code looks like. Uh you can
see so we we instantiated our LLM in the
first step.
Now we're actually passing this LLM to
an agent. We're also giving it terminal
tool and file editor tool. Uh we're
creating this remote workspace that's
connecting to our Docker container so
that a can start working in its own
environment. Uh we create what's called
a conversation which is basically one
chunk of context that the is going to
manage as it goes about it its work. Uh
we pass it a task with some clear
instructions for what it's supposed to
do and then send that send that task.
Looks like that initial scanner agent is
almost done.
Looks like that agent ran just fine. Got
these results.
I'll keep uh keep plugging along here.
We've got an agent that's uh scanning
for vulnerabilities.
Uh so the next thing I'm going to ask
this to do is basically we're going to
reach into the environment and get the
vulnerability list out from it. Uh the
idea is we're going to have it save as
the vulnerabilities to a JSON file. Uh
then we can on that workspace object
inside of the docker container we can
run execute command in order to get
those vulnerabilities back out. We also
have some some options for like
manipulating files uh within the
workspace. Uh then for now we're just
going to iterate over the
vulnerabilities.json file, print it down
just so we can see we were able to reach
into this workspace and get some
information back out.
All
right. Supposedly
good to go. See what happens.
Sheep.
Got some vulnerability results.
Agent's finished. Let's see if our
script can get
results back.
person. Jace
Um,
One more time.
>> What is the observation event? So for
every
states uh there's a there's an action
and then an observation. So it might be
run this command and then an observation
comes back with the output of that.
>> Uh it's more than a it's it's the
basically the entire trajectory the
agent takes of events and then there's
two kinds of events actions and
observations. So fans whenever we get
calls with the LM it comes back with an
action to take or basically a tool call
uh and then the observation is like a
tool call.
If anyone stuck on anything, happy to
come around to free to raise a hand.
Number
three.
>> Nice. Yeah, it looks like it's printing
the CV list. Yeah, that looks good.
create like
a specific sub agent for each script we
are running. Why you overating
the same file again and again?
>> So the the process we're going through
here with the five the five prompts this
is really
uh to demonstrate what it would feel
like to actually like build with our
SDK, right? uh this is not the way that
I would this is the way I I would maybe
like work if I was actively working on a
problem you know I could have just given
you this this whole fully packaged code
base pre-built right yeah
>> that had all this built but uh
is that what you're asking like why are
we why are we pasting these prompts in
one by one
>> eventually we get a very large script
right we should break it several
separate files or sections
>> yeah yeah yeah no I think there's
there's definitely better ways to
organize this code than to have one
single script just uh easier for demo
purposes. Yes, I do have a I do have a
demo repo um I think it's openhand CVE
demo that uses special classes. There's
a single, you know, CVE agent subassm
that's a little bit more than just this
one script.
We're still pressing JSON.
Seems
Yes.
Focus.
Enough of us.
That's beautiful question.
Our
SC
the open source models.
We're actually
I don't know what I'll be doing.
All right.
The thing is
I mean
>> Yeah.
Heat.