Devin 2.0 and the Future of SWE - Scott Wu, Cognition

Channel: aiDotEngineer

Published at: 2025-07-25

YouTube video id: MI83buT_23o

Source: https://www.youtube.com/watch?v=MI83buT_23o

[Music]
Yeah. Well, thank you guys so much for
having me. It's exciting to be back.
It's uh I I was last here at AI Engineer
one year ago. Um and it's kind of crazy.
I I've always been I I've been telling
Swix that we need to have these
conferences way more often if it's going
to be about AI software engineering.
probably should be like every two months
or something like that with the pace of
everything's done. But but but going to
be fun to to talk a little bit about um
you know what we've seen in the space
and and what we've learned over the last
12 or 18 months uh building Devon over
this time.
And I want to start this off with um
Moore's law for AI agents. And so you
can kind of think of the the the
capability or the capacity of an AI by
how much work it can do uninter
uninterrupted until you have to come in
and step in and intervene or steer it or
whatever it is, right? And um you know
in GPT3 for example, it's if you were to
go and ask GPT3 to do something, you
know, it could probably get through a
few words or so and then it'll say
something where it's like okay, you
know, this is probably not the right
thing to say. Um and GPT3.5 was better
and GP4 was better, right? Um and and so
people talk about these lengths of tasks
and what you see in general is that that
doubling time is about every seven
months which already is pretty crazy
actually. But in code it's actually even
faster. It's every 70 days which is two
or three months. And so, you know, if
you look at various software engineering
tasks that start from the simplest
single functions or single lines and you
go all the way to, you know, we're doing
tasks now that take hours of humans time
and and an AI agent is able to just do
all of that, right? Um, and if you think
about doubling every 70 days, I mean,
basically, you know, every two to three
months means you get four to six
doublings every year. Um, which means
that the amount of work that an AI agent
can do in code goes something between 16
and 64x in a year every year at least
for the last couple years that we've
seen. Um, and it's kind of crazy to
think about, but but that sounds about
right actually for for what we've seen.
You know, 18 months ago, I would say the
only really the only product experience
that had PMF in code was just tab
completion, right? It was just like
here's what I have so far. Predict the
next line for me. that was kind of all
you really could do um in in a way that
really worked. And we've gone from that
obviously to full AI engineer that goes
and just do does does all these tasks
for you, right? And implements a ton of
these things. And people ask all the
time, what is the um you know what what
what is the the future interface or what
is the right way to do this or what are
the most important capabilities to solve
for? And I think funnily enough, the
answer to all these questions actually
is it changes every two or three months.
like every time you get to the next
tier, the the the bottleneck that you're
running into or the most important
capability or the right way you should
be interfacing with it, like all these
actually change at at each point. And so
I wanted to talk a bit about some of the
the tiers for us over the last year or
so. Um and you know over the course of
that time obviously you know when we got
started um in the end of 2023 obviously
agents were not even a concept. Um, and
now everyone has, you know, everyone's
talking about coding agents, people are
doing more and more and more. Uh, and
and it's very cool to see. Um, and and
each of these has kind of been almost a
discrete tier for us. Um, and so right
right around a year ago when we were
doing the the last AI engineer talk
actually, um, the the biggest use case
that we really saw that that was getting
broad adoption was what I'll kind of
call these repetitive migrations. And so
I'm talking like JavaScript to
TypeScript or like upgrading your
Angular version from this one to that
one or going from this Java version to
that Java version or something like
that. Um and those those kinds of tasks
in particular what you typically see is
you are
you you have some massive code base that
you want to apply this whole migration
for. You have to go file by file and do
every single one. And usually the set of
steps is pretty clear, right? If you go
to the Angular website or something like
that, it'll tell you, all right, here's
what you have to do. This, this, this,
this, this, and um, you want to go and
execute each of these steps. It's not so
routine that there, you know, there's no
classical deterministic program that
solves that. But there's kind of a clear
set of steps. And if you can follow
those steps very well, then you can do
the task. And, you know, this was the
thing for us because that was all you
could really trust agents to do at the
time. you know, you could do harder
things once in a while and you could do
some really cool stuff occasionally,
but as far as something that was
consistent enough that you could do it
over and over and over, um, these kinds
of like repetitive migrations that you
would be doing for, you know, 10,000
files were, you know, in many ways the
the the easiest thing, which was cool
actually because
it was also kind of the the most
annoying thing for humans to do. And I
think that's generally been the trend
where um AI has always done these more
boilerplate tasks and the more tedious
stuff, the more repetitive stuff, and we
get to do the the the more fun, creative
stuff. Um and obviously as time has gone
on, it's it's taken on more and more of
that boiler plate. But for a problem
like this one, a lot of what you need to
do is you need Devon to be able to go
and execute a set of steps reliably. And
so a lot of this was, you know, I would
say the big capabilities problems to
solve was mostly instruction following.
And so we built this system called
playbooks where basically you could just
outline a very clear set of steps, have
it follow each of those step by step and
then do exactly what's said. Now if you
think about it, obviously a lot of
software engineering does not fall under
the category of literally just follow 10
steps step by step and do exactly what
it said. But migration does and it
allowed us to go and actually do these
and and this was kind of I would say the
first big use case of Devon that really
um that really came up. I think one of
the other big systems that got built
around that time which we've since
rebuilt many times is knowledge or
memory right which is you know if you're
doing the same task over and over and
over again then often the human will
have feedback on hey by the way you have
to remember to do X thing or you have to
you know you need to do Y thing every
time when you see this right um and so
basically an ability to to just maintain
and understand the learnings from that
and use that to improve the agent in
every future one and those were kind of
the the big problems of the time, you
know, and that was summer of last year.
And around end of summer or fall or so,
you know, I think the the the kind of
big thing that started coming up was as
these systems got more and more capable
instead of just doing the most routine
migrations, you could do, you know,
these more still pretty isolated, but
but but but a bit broader of these
general kind of bugs or features where
you can actually just tell it what you
want to do and have you have it do it,
right? And so for example, hey Devon, in
this uh repo select dropdown, can you
please just list the currently selected
ones at the top? Like having the
checkboxes throughout is just doesn't
really and and Devon will just go and do
that, right? And so if you think about
it, it's, you know, it's it's it's
something like the kind of level of task
that you would give an intern.
And there are a few particular things
that you have to solve for um with this.
First of all, usually these these these
changes are pretty isolated and pretty
contained. And so it's one maybe two
files that you really have to look at
and change to do a task like this, but
at least you do still need to be able to
set up the repo and work with the repo,
right? And so you want to be able to run
lint, you want to be able to run CI, all
of these other things. So, you know, to
at least have the basic checks of
whether things work. One of the big
things that we built around then was the
ability to really set up your repository
uh ahead of time and build a snapshot um
that that you could start off that you
could reload that you could roll back
and all of these kinds of primitives as
well right so having this clean remote
VM that could run all these things it
could run your CI it could run your
llinter uh and and so on um but that's
when we started to really see I would
say a bit more broad of value right I
mean migrations is one particular thing
and for that particular thing we were
showing a ton of value and then we
started to see where you know with these
bug fixes or things like that you would
be able to just generally get value from
Devon as as almost like a junior buddy
of yours
and then in the fall
things really moved towards just much
broader bugs and requests and here it's
you know most most changes again you
know you jumping another order of
magnitude most changes don't just
contain themselves to one file right
often you have to go and look see what's
going on you have to diagnose things you
have to figure out what's happening you
have to work across cross files and make
the right changes. Often these changes
are, you know, hundreds of lines if it's
like, hey, I've got this bug. Let's
figure out what's going on. Let's solve
it. Right?
And, you know, there there are a lot of
things here that that really started to
make sense and really started to be
important, but but one in particular
I'll just point out was there's a lot of
stuff that you can do with not just
looking at the code as text, but
thinking of it as this whole hierarchy,
right? So, so understanding call
hierarchies, running a language server,
uh, is a big deal. you have git commit
history which you can look at which
informs how how these different files
relate to one another. You have um um
obviously you have like your llinter and
things like that but but you're able to
kind of reference things across files.
And so like one of the big problems here
I think was u kind of working with the
context of it and getting to the point
where it could make changes across
several files. It could be consistent
across those changes. It would be able
to understand across the codebase. And
here was really the point, I would say,
where you started to be able to just tag
it and have it do an issue and just have
it build it for you. Um, and so Slack
was a was, you know, a huge part of the
workflow then. Um, and and it was just
it it made sense because it's where you
discuss your issues and it's where you
set these things up, right? So you would
tag Devon in Slack and say, "Hey, by the
way, we've got this bug. Please take a
look." Or, you know, could you please go
build this thing? Uh, this is especially
fun part for us because this is right
around when we went GA. Uh, and a lot of
that was because it was it got to the
point where you truly could just get set
up with Devon and ask it a lot of these
broad tasks and and just have it do it.
Um, but but a lot of these, you know, a
lot of the work that we did was around
having Devon have better and better
understanding of the codebase, right?
And if you think about it, you know,
from the human lens, it's the same way
where on your first day on the job, for
example, being super fresh in the
codebase, it's kind of tough to know
exactly what you're supposed to do. Like
a lot of these details are things that
you understand over time or that a
representation of the codebase that you
build over time, right? Um and Devon had
to do the same thing and had to
understand how do I plan this task out
before I solve it? How do I understand
all the files that need to be changed?
How do I go from there and make that
diff?
And
around the spring of this year, um,
again, every every gap is like two or 3
months. You know, we we got to an
interesting point, which is once you
start to get to harder and harder tasks,
you as the human don't necessarily know
everything that you want done at the
time that you're giving the task, right?
If you're saying, hey, you know, I I'd
like to go and um improve the
architecture of this, or you know, this
this function is slow. Like, let's let's
profile it and look into it and see what
needs to be done. or hey like you know
we really should should handle this this
error case better but like let's look at
all the possibilities and see what we
should you know what the right logic
should be in each of these right and
basically what it meant is that this
whole idea of taking a twoline prompt or
a threeline prompt or something and then
just having that result in a a Devon
task was was not sufficient and you
wanted to really be able to work with
Devon and specify a lot more and around
this time along with this kind of like
better codebase intelligence um we had a
few different things that that that came
up and so we released deep wiki for
example. Um and the whole idea of deep
wiki was you know funnily enough is
devon had its own internal
representation of the codebase but it
turns out that for humans it was great
to look at that too to be able to
understand what was going on or to be
able to ask questions quickly about the
codebase. Um closely related to that was
with search which is the ability to
really just ask questions about a
codebase and understand um some some
piece of this. And a lot of the workflow
that really started to come up was
actually basically this this more
iterative workflow where the first thing
that you would do is you would ask a few
questions. You would understand you
would basically have a more L2
experience where you can go explore the
codebase with your agent,
figure out what has to be done in the
task and then set your agent off to go
do that because for these more complex
tasks you kind of needed that, right?
Um and so so you know that was a I would
say kind of like a big paradigm shift
for us then is is understanding you know
this is what also came along with Devon
2.0 for example and the in IDE
experience where often yeah you want to
be able to have points where you closely
monitor Devon for 10% of the task 20% of
the task and then have it do uh work on
its own for the other 80 90%.
Um, and then lastly, most recently in
June, which is now, it was kind of,
yeah, really the ability to just truly
just kill your backlog and hand it a ton
of tasks and have it do all these at
once. And, you know, if you think about
this task, in many ways, I would say
it's it's almost like a culmination of
of many of these different things that
that had to be done in the past. You
have to work with all these systems.
Obviously, you have to integrate into
all these. Certainly, you want to be
able to to work with linear or with Jira
or systems like that, but you have to be
able to scope out a task to understand
what's meant by what's going on. You
have to decide when to go to the human
for more approval or for questions or
things like that. You have to work
across several different files. Often
you have to understand even what repo is
the right repo to make the change in if
your if your org has multiple repos or
what part of the codebase is the right
part of the codebase that needs to
change. Um, and to really get to the
point where you can go and do this more
autonomously,
first of all, um, you have to have like
a really great sense of confidence,
right? And so, um, you know, rather than
just going off and doing things
immediately, you have to be able to say,
okay, I'm quite sure that this is the
task and I'm going to go execute it now
versus I don't understand what's going
on. Human, please give me help.
Basically, right? But but the other
piece of it is this is I think the era
where testing and this asynchronous
testing gets really really important
right which is if you want something to
just deliver entire PRs for you for
tasks that you do especially for these
larger tasks you want to know that it is
can can test it itself and often the
agent actually needs this iterative loop
to be able to go and do that right so it
needs to be able to run all the code
locally it needs to know what to test it
needs to know what to look for um and in
many ways it's just a much higher
context problem to solve for right is
this testing itself.
And that brings us to now. And obviously
it's a it's a pretty fun time to see
because now what we're thinking about is
hey maybe if instead of doing it just
one task it's you know how how do we
think about tackling an entire project
right and after we do a project you know
what what goes after that a and maybe
one point that I would just make here is
we talk about all these two X's you know
that happen every couple months and I
think from a kind of cosmic perspective
all the two X's look the right but in
practice every 2x actually is a
different one right and so when we were
just doing you
tab completion, line, single line
completion. It really was just a text
problem. It is just like taken the
single file so far and just predict what
the line is next. Right? Over the last
year or year and a half, we've had to
think about so much more. How do how do
you work with the human in linear or
slack or how do you take in feedback or
steering? Um how how do you help the
human plan out and do all these things,
right? And moreover, obviously, there's
a ton of the the tooling and the
capabilities work that have to be done
of how does how does Devon test on its
own? How does Devon um uh you know make
a lot of these longer term decisions on
its own? How does it debug its own
outputs or or run the right shell
commands to figure out what the feedback
is uh and go from there? And so it's
super exciting now that there's a lot
more uh there's a lot more coding agents
in the space. It's uh it's it's very fun
to see and I think that you know we
we're going to see another 16 to 64x
over the next 12 months as well and uh
and so yeah super super excited.
Awesome. Well, that's all. Thank you
guys so much for having me.
[Music]