The 3 Pillars of Autonomy – Michele Catasta, Replit

Channel: aiDotEngineer

Published at: 2025-12-22

YouTube video id: MLhAA9yguwM

Source: https://www.youtube.com/watch?v=MLhAA9yguwM

[Music]
[Music]
So at Raplet, we're building a coding
agent for nontechnical users. It's a
very peculiar challenge I would say
compared to many people in this room.
And what I'm going to talk about today
is why autonomy has become kind of the
northstar that we keep chasing you know
since we launched the very first version
of rapid agent September last year.
Let's start from this very interesting
plot in case my clicker worked which now
does. Um I'm sure you all have seen it
the semiync value that published by Zixs
a few weeks ago and it kind of clarified
a bit the landscape you know for all of
us uh agent builders. On one hand, you
have the low latency interactions that
really allow you to stay in the loop,
you know, so you can do deep work and
focus really on the on the coding task
at hand, but you need to be an expert.
You need to know exactly what to the
model for and you need to understand
quickly if you want to accept the
changes or not. Then for several months
many of us including replet we kind of
lived in this I think value that where
the agent wasn't autonomous enough to
really delegate a task and come back and
see it accomplished but at the same time
it run long enough not to keep in the
zone not to keep in the loop likely over
time we managed to go all the way on the
right and now we have agents that runs
for several hours in a row. What I'm
going to be arguing with today and hopes
is not going to stop inviting me to this
event is the fact that there is an
additional dimension like a third
dimension to this plot that you know it
hasn't been covered here and namely the
fact is how do we build autonomous
agents for nontechnical users.
So what I'm going to be arguing today is
that there are two types of autonomy.
One of it is more supervised. So think
of the you know Tesla FSD example. When
you sit in a Tesla, you're still
expected to have a driving license.
You're going to be sitting in front of
the steering wheel. Perhaps 99% of the
time, you're not going to use it, but
you're there in order to take care of
the longtail events. And similarly, a
lot of the coding agents that we have
today require you to be technically
savvy in order to use them correctly.
We at Replet and uh other companies at
this point are focusing on kind of the
whimo experience for autonomous coding
agents. So you're expected to sit in the
back. You don't even have access to the
steering wheel. And I expect you
basically not to need any driving
license. Uh why is this important?
Because we want to empower every
knowledge worker to create software. And
I can't expect knowledge workers to know
what kind of technical decisions an
agent should be making. We should
offload completely the level of
complexity away from them.
Of course, it took a while to get here.
So I'm I'm sure what I'm showing you
here is something that all of you are
very familiar with. It took several
years to go from I know maybe less than
a minute feedback loop constant
supervision and talking about
completions and talking about
assistance. These are areas where the AI
power is and really been pioneering this
this type of user interaction. Then we
slowly climbed through you know higher
levels of autonomy. So we had the first
version of the agents based on on react.
So we concocted autonomy with a very
simple paradigm on top of LMS. Then
likely AI providers understood that tool
calling was extremely important poured a
lot of effort on that. So we built the
next version of agents with native tool
calling. And then I would say there is a
third generation of agents which I call
autonomous and that's when we started to
break the barrier of say one hour of
autonomy. Basically the the agent being
capable of running on long horizon tasks
and remaining coherent. It happens to be
the case that those are also the
versions of rapid agent that we launched
for the last year. So the B3 is the one
that we launched a couple of months ago
and it has exactly showcases those
properties. So the question for today is
can we actually build fully autonomous
agents and how do we get there?
So I'm going to try to redefine the
definition of autonomy today. I think
that often times we conflate autonomy
with a concept of something in the lungs
for a for a lot of time and usually as a
user you lose control. In reality what
the autonomy that I want to give to
agents can be very specifically scoped
and what I mean by that is especially
with rapid agent 3 what we accomplish is
we we make sure that our agent takes
holy technical decisions. Of course,
that could lead to very long gap between
the different user interactions and in
case the agent again runs for several
hours. But this happens if and only if
the scope of the task you're giving to
the agent is really broad. And it turns
out that in reality you can have an
agent that is really autonomous and is
still fast as long as you give it a very
narrow scope for the task, you know, at
hand. So what we can accomplish in this
way is that the user still maintains
control on the aspects that they care
about and a user cares about what
they're building. Especially again our
users, knowledge workers, they don't
care about how something has been built.
They just want to see their goals to be
accomplished. So autonomy should not be
basically conflated with long run times.
And similarly, it shouldn't become a
vanity metric. You know, a lot of us are
talking about it as a as a badge of
honor. And it's definitely been exciting
to see in the last few months that you
know many of us broke the the barrier of
running several hours in a row. But I
think in terms of how to build agents
that are going to be more powerful and
more steable in the future, we kind of
have to change a bit uh the the target
the metric that we that we keep in mind.
So think about it in this way. Tasks
have a natural level of complexity. And
basically what we care about is that
they have a minimum irreducible amount
of work that they express. What agents
do is that they always go through this
loop of planning, implementing and
testing. And of course to make this
happen and to make it work correctly,
you want this work to be happening over
a long quering trajectory. So our goal
is to maximize the reducible runtime of
the agent. By reducible, I mean having a
span of time where the user doesn't have
to make any technical decisions and the
agent can accomplish the task again in
full autonomy. This is especially
important for us because I can't trust
our users to make technical decisions.
So they they need a proper technical
collaborator by their side. I want to
abstract away as much complexity as
possible from the process of software
creation. And last but not least, I want
the users to feel in control of what
they're creating without startling their
creativity because they have also to
think about the technical decision that
the agent is making.
So now what are the pillars of autonomy?
How are we making this happen? I would
say there are three pillars that are
extremely important to think about. The
first one is of course the capabilities
of frontier models like the baseline IQ
that we inject in the main agentic loop.
I'm going to leave this as an exercise
to the reader and to other people in the
room. I'm really glad a lot of you are
building amazing models that you know we
use all the time at Rapid. So this is
the pillar number one. The second pillar
is verification. It's very important
that we test for local correctness of
our agent at every step that it takes
and the reason is fairly intuitive. If
you are building on very shaky
foundations, eventually the castle will
topple down. So we brought verification
in the loop to make sure that in a sense
you are having you know nines of
reliability where in the compounding
errors than an agent will make
unavabodably if you know you don't put
any control on it. And last but not
least, you heard it on stage even
earlier. I'm sure you're going to be
hearing this, you know, the entire day
or the entire duration of the
conference. Uh the importance of context
management. So on one end you want to
have an agent that is capable of being
globally coherent. So it's aligned with
the intent of the user, the expectation
of the user. But at the same time, it is
also to be capable of managing both the
high level goal and the single task that
the agent is working on. I think we made
amazing progress in the last months on
context management. But I'm also excited
to see, you know, where we're going as a
field.
Let's start from the first pillar that
we work actively at rapid which is
verification.
So why do we focus on this? Over the you
know last year we realize something that
I think each one of you has experienced.
So without testing agents build a lot of
painted doors. In our case the painted
doors are very visible because we create
a lot of web applications. So you end up
basically trying to click on a button
and the handler is not looked up or some
of the data that we're showing is
actually mock data and it's not coming
it's not coming from a database. But in
general this phenomenon spans you know
across every type of component you're
building being it front end or back end
a lot of components are actually not
fully fleshed uh by the agent. So we run
some evaluations internally. We found
out that more than 30% of the individual
features happen to be broken. Know the
first time that are cooked by the agent.
And that also means that almost every
applications at least one broken feature
or painted door. They're hard to find.
The reason is users are not going to
spend time testing every single button,
every single field. And this is also
probably one of the reasons why a lot of
our users, especially the nontechnical
ones, still can't trust coding agents
very much. They are shocked when they
find that there's a painted door out
there. So, how do we solve this problem?
Fundamentally, we need an agent must
gather all the feedback that they need
from their environment, right? It's
easier said than done. Um again
nontechnical users not only cannot make
technical decisions but also they cannot
provide the technical feedback that you
know an agent is required to make
progress and most what they can do is
basic you know quality assurance testing
they can literally go around the UI
click interact with application
I'm I'm sure you have tried it in your
life this is extremely tedious to do and
it leads to a very bad user experience
and even though we relied on that with
our first release of the agent last year
quickly we undo that users don't want to
spend time doing testing. So we had to
find a complete you know orthogonal
solution to that which is autonomous
testing and it solves several different
issues. The first one is it breaks the
feedback bottleneck. Even if again we
ask feedback to the user we were not
given enough of that. Now we don't have
to wait anymore for human feedback. We
have a way to elicit as much information
as possible from the app autonomously.
We also want to prevent the accumulation
of small errors. What I was saying
before, we don't want to have
compounding errors while the agent is
building. And last but not least, we
have to overcome the laziness of
frontier models. So we need to verify
that whenever a model tells us that a
task has been completed, there is
actually the truth and that result is
not being elucinated.
There is a wide spectrum of code
verification that you know you can
accomplish. I think we all started from
the very left. You know you have basic
study code analysis with LSPs. We have
been executing the code since we had
basically LMS that were capable of
debugging and then we slowly started to
move towards the right. So generating
unit tests and running them it has a
limitation. It's limited only to
functional correctness. Uh unit testing
is not very powerful to do like proper
integration testing by definition. We
started also to do now API testing but
is only limited to API code. So you can
test endpoint of an applications. you
can't really test how a web app
functions and looks like and for this
reason in the last few months H has and
other companies are putting a lot of
effort in really creating autonomous
testing based on the browser you know in
case the app that we're building is a
web application there are two main
categories here one is computer use it's
a onetoone mapping with user interface
so the model is directly interacting
with the application it requires
screenshots it tends to be fairly
expensive and fairly slow I'm sure you
you tested it yourself. A good way in in
the middle is browser use where we
simulate the user interface. You can
then interact with the browser and with
the web application and it relies on
basically accessing the DOM through
abstractions.
So how do we how do we make this work?
Um what we do is that we generate
applications that are amenable to
testing and we sort of merge everything
together from the previous slides that I
showed you. So we allow the our testing
agent to interact with an application
and gather screenshots in case nothing
has worked. So we have a full back to
computer use. But the vast majority of
times what we do is that we have
programmatic interactions with the
applications. So we interact with the
database, we read the logs, we do API
calls, we literally click on the app and
get back all the information that we
need. And by putting all of this
together, we collect enough feedback
that allows our agent both to make
progress and also to fix all the painted
doors that it encounters.
Just a know short technical deep dive on
how we accomplish this. I'm sure you
have seen a lot of the toolbased uh
browser use. There are amazing libraries
out there. First one comes to my mind is
stan and the idea is that you have an
agent that has a few very generic tools
exposed. So know the agent can create a
new tab, can click, can fill forms etc
etc. The limitation is that it's
difficult to enumerate all the different
type of interactions you could be having
with a browser. The problem of testing
is very similar to the Tesla analogy I
was making before. Maybe this cardality
of tools available is enough for 99% of
the interaction types. But then there is
always a long tale of idiosyncratic
interactions that a user makes with the
with a web application that are hard to
map into this tool these different tool
calls. So what we do uh in our case at
rapid is we directly write playrite code
and playrite code is first of all very
manageable for LLMs. LLMs are kind of
amazing at writing playright. You know
this is the experience that we had uh
since we started to work on this project
is also very powerful and expressive. So
in a sense it's a super set to what you
can express uh on the compared to the
left on the tools uh testing and last
but not least there is beauty in
creating playright code because you can
reuse those tests. The moment you write
a test in script then you can rerun it
as many times as you want. So in a sense
the moment you created a test you're
also creating a regression test suite
that you can keep running in the future.
And all these kind of uh tricks that I
explained to you right now, they helped
us to create something that is roughly a
order of magnitude cheaper and faster
compared to computer use. And we'll go
back later on how important latency is.
The second thing that the second pillar
that I wanted to talk about today of
course is context management. And I'm
going to go very fast here because I
think you're going to be hearing a lot
of talks today about it. The the high
level message here is that long context
models are not needed to work on queer
and long trajectories. Uh from
experience, we found that most of the
tasks, even the more ambitious one, can
be accomplished within the 200,000
tokens. So we're still not in a world
where working with models that have 10
million or 100 million uh context
windows is necessary to actually run
autonomous agents. And we accomplish
this by means of learning how to do
context management correctly. So first
of all, there are several different ways
to maintain state which don't imply
chucking all the state into your context
window. You can do that for example by
using the codebase itself to maintain
state. So you can write documentation
while the agent is creating new code.
You can also include the plan
description and all the different task
list that the agent is working on. You
can persist them on the file system. So
even there like have a lot of ways to
offload your memories. And last but not
least and this is something I think you
know Antropic has been uh really
evangelizing about um you can even dump
directly your memories in the file
system and then making sure that your
agent decides when to write them back
the moment they become relevant to your
work. So for this reason we have been
seeing a lot of announcements in the
last couple of months. Uh just pick this
one from Entropic with Cloud Sonet 4.7.
So I wish 4.5 uh they have been able to
run uh focus task for more than 30 hours
in a row. We have seen similar results
from open AI on the math problems. So I
think we we kind of broke the barrier of
running for long and you know being able
to have coent tasks.
I would say the key ingredient to make
this happen has been how good models
hand as agent builders have become in
doing sub agent orchestration.
Sabages basically work by means of
they're invoked in the core loop. So
it's a completely it's starting from a
blank slate uh from a completely fresh
context. You as an agent builder decide
what subset of the context to inject
when the sub agent starts and it's a
concept that is very similar I think to
everyone who's been writing software you
know in the last decades is separation
of concerns. So you decide what your sub
engine is going to be working on. You
give it the least possible amount of
context. You allow it to run to
completion. you only get the output the
results. You inject them back into the
main loop and you keep running in this
way. Of course, it significantly
improves the number of memories per
compression. I just brought this plot
from directly from reput agent running
in production the moment we kicked in
our new subvision orchestrator on the a
on the y-axis you can see the number of
memories per compression. So we went
from roughly 35 to 4550 recently. So big
improvement in terms of how often we are
recompressing our context just because
we can offload a lot of the context
pollution by means of using sub agents.
I'm going to give an example where this
made the difference for us. You know the
what I'm showing you here is more kind
of a cost optimization in a sense like
you're compressing less. You also have
separation of concerns which definitely
make your agent a bit smarter. In the
case of testing,
working with sub engine was almost
mandatory for us and basically we
started to work on automated testing
even before we were very advanced in
terms of subgent orchestration. And what
we found out is of course again as I was
saying before it makes things easier,
better cost, less pollution. But when
you allow the main loop not only to
create code but also to do browser opt
browser actions to put back the
observation of your browser actions into
the main loop you tend to confuse the
the hedging loop very much because at
this point there is a lot of it in terms
of the action that your main loop is
looking at. So in order to make this
work not only we have to build all the
playright framework that I was showing
to you before but we also have to move
our entire architecture into sub aents.
So at this point you can see very
clearly why there is a separation of
concern here. Got the main agent loop
running. We decide at a certain point
that it's time to verify if the output
of the agent has been correct. We make
this happen all within a sub agent. Then
we scratch the context window of that
sub agent. We just return back the last
observation to the agent loop and then
we keep running in that way. So if
you're having issues today making your
sub agents uh work correctly, this is
one of the reasons why that you want to
take a look at.
So I think we covered the high level of
how to create more and more powerful uh
autonomous agents over time and I only
see us as a field becoming even more
proficient than that in the next months.
There is one additional ingredient
though that is going to make the
difference and it's parallelism. And I
will argue that parallelism is important
not because it's going to make agents
more powerful per se, but rather because
it's going to make the user experience
more exciting. So of course it is great
to have an agent that is capable of
running autonomously for long, but at
the same time it comes with the price of
making the user experience less
thrilling. You are not in the zone
anymore. What you do is that you write a
very long prompt. It's translated into a
task list uh and then you go to have
lunch with your colleagues and then you
come back and you hope that the agent is
done. That is not the kind of experience
that most of the productive people want
to have in life. You know, you want to
see as much work as done as possible in
the shortest span of time.
So what we do as a as a field at this
point has been to create parallel
agents. It's a very common trade-off
which by the way doesn't only apply to
agents. it it applies to computing in
general and for parallel agents what you
do is that you you trade off basically
extra compute in exchange for time. Why
there is this trade-off? So first of all
when you're running agents in parallel
you're gathering the same context in
multiple context windows. So every
single parallel agent you will be
running probably shares say 80% of the
context across the board. So of course
you are just putting more computed work
because you're running those agents in
parallel. There is also another cost
that is kind of intangible for a lot of
you here in the room because I'm sure
you're all expert software developers.
But what do you do with the output of
multiple par agents at the end? Often
times you need to resolve merge
conflicts. So as a reminder, my users
don't even know what's the concept of
merge conflicts. It's something that I
have to figure out on our own. So the
current way in which we think of
parallel agents in in the space doesn't
really apply to rapid. Now at the same
time I still want to very much to
accomplish this. There are so many
interesting features that you can enable
with parallelism aside from the fact
that you can get more work done. Uh at
times you want to you want testing to be
running in parallel with the agent that
creates code. Testing no matter how much
we optimize it is still very slow. If an
agent is only spending time on testing
users are not going to be engaging with
your application anymore. Um, at the
same time, it's also great to have a
synchronous process running while your
agent is running because you can inject
useful information back into the main
core loop. And last but not least is a
very common technique that we know boost
performance if you have enough budget to
do so. You should be sampling multiple
trajectories at the same time. So a lot
of perks are coming with parallel
agents. But the the way in which we
implement them today which I go
basically call user has an orchestrator
is the fact that tasks the par task that
you want to run are determined by you by
the user and each task is dispatched in
its own thread. So there's a bit of
manual process even the task
decomposition in a sense is happening in
your mind while you're thinking about
which agents you want to run and then
the moment you get back all the results
you need to go through the problem of
merge conflicts and often times this is
not trivial at all no matter how many
amazing tools are out there. So what
we're working on today for our next
version of the agent is having the core
loop as the orchestrator. So the key
difference here is the fact that the the
subtask that we're going to be working
on are not determined by the user but
they're determined by the corion loop
and the parallelism is basically
deciding on the fly. The agent does the
task de composition on behalf of the
user and this comes with a couple of
advantages. First of all again there's
no cognitive burden to for the user to
understand how they should be
decomposing the task. At the same time
also there are ways in which you can
create tasks that sort of mitigate the
problem of merge conflicts. I'm not
claiming that we're going to be able to
mitigate it 100%. There are so many
corner cases in which merge conflict
will still represent a problem but there
are a lot of different techniques known
in software engineering to make sure
that you can try to have multiple sub
agent not stepping on each other tools.
So the core loop as an orchestrator is
going to be the our main bet for the
next few months.
And in case you're passionate about
these topics,
I'm always hiring a rabbit. Thank you.
[Music]