Harness Engineering: How to Build Software When Humans Steer, Agents Execute — Ryan Lopopolo, OpenAI

Channel: aiDotEngineer
Published at: 2026-04-17
YouTube video id: am_oeAoUhew
Source: https://www.youtube.com/watch?v=am_oeAoUhew
Our
[music]
next speaker is here to speak about
harness engineering. How to build
software when humans steer and agents
[music] execute. Please join me in
welcoming to the stage member of
technical staff at OpenAI, Ryan Leopo.
[applause]
[cheering]
Good morning, London.
[applause]
I'm super excited to be here today. I'm
Ryan Laapo and for the last nine months
I have had the privilege of building
software exclusively with agents. Uh I
am a token billionaire and I believe
that in order for us to get into our AGI
future, we want everybody to be token
billionaires to use the models to do the
full job. And what that means is to lean
into the idea that the models are
capable of being a full software
engineer. And I've lived that experience
by banning my team from even touching
their editors, to have to work through
the models in order to get the job done.
And today I'm going to talk to you a
little bit about what it means to lean
into that and operationalize the way you
work, the code spaces you live in, and
the processes on your teams in order to
get the agents to do the full job.
I believe I'm preaching to the choir
here when I say that the way we build
software has changed. In the last six
months, we have seen coding agents take
over the world and capability has
continually advanced at a super fast
pace to have these models and the
harnesses within which they live take
more complex actions, do more
complicated work with higher reliability
over longer time horizons.
And the place we've gotten to here is
that implementation is no longer the
scarce resource of what it means to do
the job of software engineering. Code is
free. We have an abundance of code to
solve the problems that we come across
in our day-to-day as we run our teams,
build software, and solve user problems.
Hiring the hands on the keyboards as
part of our teams is only constrained by
GPU capacity and token budgets. And each
engineer today in this room has access
to five, 50, or 5,000 engineers worth of
capacity 247 every day of the year. The
only thing that needs to happen, our
roles is to figure out how to
productively deploy these resources into
our code and into our teams to make use
of this new capacity.
And in this world, skill sets are
shifting more towards systems thinking,
system design, and delegation in order
to make use of this abundant capacity to
produce code to solve problems. And
there are three reasons that this
happened. All of which happened in late
2025.
For me, the magic moment was GPT 5.2,
which when it came out was able to do
the full job of a software engineer. The
models at this point are good enough
where they're isomeorphic to you and I
in terms of the ability to produce code
at high quality that solve real user
problems in real code bases.
Code is free. And I know this is maybe a
scary thing to hear because code carries
maintenance burden, but it's free to
produce, free to refactor, and it is not
a thing to get hung up on anymore.
We think of code as burden because it
it's a synchronous attention drain on
the human engineers on our team. But the
models are incredibly patient. They are
infinitely parallel. So the ability to
produce, maintain, refactor, and delete
code is no longer a forcing function on
figuring out how to allocate resources
on your engineering teams. So sort of be
agi pill here is to believe that the
models are capable of producing every
line of code we could ever possibly
need, figuring out when to delete them,
figuring out when to refactor them or
make them more reliable. And it's your
role as software engineers to figure out
how to unblock your team of agents and
humans driving those agents from being
able to drive them over long horizon
work to do the full job.
The idea here is that every one of you
is a staff engineer. You have as many
team members as you can possibly drive
concurrently and have tokens to support
and you need to look one day, one week,
six months into the future to figure out
what structures you need to put in place
to productively harness this infinite
capacity to produce code.
The scarce resources in this world that
we see today are three things. human
time, human and model attention and
model context window. And in the world
where human time and attention is
scarce, the role is to think about where
that time is going, figure out ways to
productively automate it and move that
synchronous human time into higher
leverage activities.
In a world where human time is scarce
and human time is required to produce
code, we have a stack rank. Things are
either P zeros or P2s. Those P3s will
never get done. However, in a world
where code is free and infinitely
abundant, all those P3s get kicked off
immediately, maybe 4x in parallel. We
pick one that solves the problem and in
it goes.
I've had the privilege of building a ton
of agents internally at OpenAI to
improve the productivity of my
co-workers. And when code is free, all
these internal tools can have good
localization and internationalization
from day one. I can make tools that my
colleagues in London, Dublin, Paris,
Brussels, Zurich, and Munich are able to
experience in their native languages
without really having to trade against
any of my other teams capacity in order
to make highquality tools.
We should be working with the assumption
that the best parts of software
engineering that we all know, live, and
breathe are available in any product
that we could ever build all the time.
Humans no need no longer need to concern
themselves with implementation. The
important thing is not the code but the
prompt and the guardrails that got you
there. This is why leaving breadcrumbs,
documentation, ADRs, persona oriented
documentation around what a good job
looks like. All the historical logs of
tickets and code reviews. This is the
process that got you and your teams to
the code and products that you have
today. And this is what is need needs to
happen in order to get your agents there
as well. Your job is to build systems,
software and structures that enable your
team to be successful. And to do that,
we need to make them legible to those
agents that are driving the
implementation. That means structuring
them in a way that's native to the
agents. Writing them in a way that is
respecting of scarce context, which is
this other scarce resource here, and
figuring out ways to make the tokens
that are required to do the job easy to
predict. That means making things the
same as much as possible so we can limit
the amount of attention the model needs
to activate in order to do the job.
Large scale refactoring in this world is
free. So making things the same is
something that you are all able to do.
There's never going to be a migration
that hangs open for six months now that
you can't get the last parts of the
codebase to do because you can just fire
off 15 agents to drive that work to
completion. This is what it means to
have a migration, right? We can finish
them now. Come on. That's good. That's
good. Clap.
[applause]
There's sort of this like meta
epistemological question here about like
what it means to do a good job and doing
a good job as a software engineer is
hard. It requires us years of being in
the industry to fully internalize what
it means to write highquality
maintainable reliable code that our
teammates are able to build on top of
that is going to acrue leverage to the
codebase
to do a single patch. well, probably
requires 500 little decisions along the
way around the underspecified
non-functional requirements that go into
producing good code. The agents, the
models during their training have seen
trillions of lines of code that make
every possible choice of those
non-functional requirements that you
could ever imagine. So, it's our job to
specify those non-functional
requirements to write them down in a way
that the agents can see this is what it
is to do a good acceptable job that's
going to produce a merged patch. And if
the agents aren't doing that, it's our
job to figure out ways to refine and
restrict their output such that the code
they write is acceptable. You can just
simply say do not produce slop. Don't
accept slop. You won't get slop in your
codebase. But to do that requires taking
short-term velocity hits in order to
back up or doubleclick into a task to
figure out what it is the agents are
struggling with in your environment.
Put the guardrails in place so they stop
making those mistakes
and then figure out ways to step back
and spend your time on higher leverage
activities once you solve some of the
blockers in the short term.
When I think about empowering my team in
this way, everyone is an expert in what
it is they bring. I have a diverse full
stack team that is experts in front-end
architecture, backend scalability, being
productminded. And each one of those
different personas fleshes out the skill
set of my team by bringing a different
understanding, a different set of solves
for those non-functional requirements.
Getting teammates to write those down
actually means that every engineer
driving agents gets the best of every
single person on my team. I don't need
to block on low signal code review in
order to learn what it means to write a
good QA plan. To have one engineer on my
team document that in a durable way
means every agent trajectory is going to
get a good QA plan. And we can do this
once in a high lever way that we're able
to stack on top of.
So how can we get the agents to do a
good job? What are some of the tools and
techniques we have in order to
essentially prompt inject our agents and
continually remind them of what it means
to make those specific choices that we
expect around those non-functional
requirements. And there's a bunch of
ways we can do this. We can write good
agents.mmd files. However, with autoco
compaction, which is a thing that has
continued to improve,
GPT 5.4 and CEX is fantastic at autoco
compaction. I essentially never have to
write slashnew anymore. I've got some
pictures on my Twitter of me strapping
my laptop into the back of my car so I
can continue do running inference while
I'm commuting to and from work. And in
this world, you have to kind of build
for that expectation that context will
get paged out over time. We need to be
continually refreshing context as the
agent goes about doing a task. And the
ways we can do that are by having
reviewer agents look at the code along
the way through the lens of what it
means to be successful. Right? We have
security and reliability review agents
in our codebase that are continually
running as part of every push and CI
that look at those documentations and
the proposed patch and do simple things
like say, are there timeouts and retries
on this bit of network code? Has the
code that has been introduced have a
secure interface that is impossible to
misuse?
I'm sure everyone here has been paged at
some point for network code that failed
in production causing an outage that
could have been remediated by a retry
and a timeout. And I know I'm guilty of
putting that retry and timeout in
merging the bug fix and otherwise
ignoring that. I am not a reliable
reviewer or author of code with respect
to this non-functional requirement.
However, taking the time to write some
docs, write a lint that is bespoke to my
codebase that is going to look at every
time I call fetch to make sure that
there's a retry and a timeout wrapped
around it means I've durably solved this
problem and I'm able to do it because I
lean on this axiom that code is free
that the agents are able to do a good
job that I can completely migrate the
codebase to solve this problem durably
once and for all. And in order to kind
of operate in this way, we need to step
back and look at the durable classes of
failures that the agents and the humans
in the codebase are making time after
time. Figure out why we're spending time
on it. Devise a solution to
systematically eliminate this class of
misbehavior and then continue to
observe, refine, and make additional
choices on those non-functional
requirements.
One really neat trick I use here is that
you can write tests about the source
code as well that are separate from
lints. Right? If we know that context is
limited, we can write a test that limits
the fact that files are no longer than
350 lines. We're adapting our codebase
to the harness to the models to do a
little bit of engineering to be context
efficient and squeeze more juice out of
the model capability that we have today.
The other things we can think about are
providing good error messages that give
actual remediation steps to the model
and to humans for how to proceed next.
It's not enough to say we've got a lint
failure because we're awaiting in a loop
or that we have an unknown at this deep
part of the codebase and why is the
model writing a function called is
record. What we need to do is provide a
prompt via a lint or a test failure that
says no no no you shouldn't have an
unknown here at all because we parse
don't validate at the edge and you
certainly have a type here which was
derived from zot loadbearing
infrastructure for our AI future
you can just prompt things
I've talked about here today is a prompt
you can do this without touching the
model weights at all.
Kind of a funny digression here is it
seems like each advancement we've had in
the complexity of the way we write code
to interact with these models comes from
both increasing capability in the models
and increasingly
niche ways for injecting prompts into
those models. prompts I'm sure you're
aware are prompts powers prompts rules
files prompts skills prompts these lint
error messages that I am talking about
prompts review agents that inject
comments onto the PR that we require the
agent to address before it is able to
propose it for merge prompts
you're going to find lots of ways to
insert prompts into your code and one
way you can do that is by embedding
agent SDKs into your tests that are
going
review the codebase for acceptability
using prompts that get embedded into the
code. And if I find myself spending a
ton of time writing prompts, we can
actually shell out to the agent for that
as well. Uh I've pointed codecs at all
of the prompting cookbooks we have on
the OpenAI developer guide and told to
synthesize a skill out of them for how
to write prompts. Which means when I
find a need to write prompts in order to
improve my agent performance locally in
the code, I use the skill to write
prompts that I wrote with the agent
looking at the prompts to write the
prompts. [laughter]
All the leverage that you're encoding in
in to your repository, your team, and
the agents in this way stacks incredibly
well. to kind of pull back to this idea
that a single product-minded engineer on
my team was able to give us a big lift.
They know what it means to write a good
QA plan. To write a good QA plan though,
you have to document all the features
that you have, the critical user
journeys, and how users engage with your
applications, web apps, APIs, and
services.
Once you write those down on how to
write a good QA plan with the
expectation that all userfacing work has
a QA plan, now a review agent is able to
assert expectations around what it means
to prove that you have effectively
written the feature. A QA plan indicates
what media should be attached to the PR
for the humans and agents to know that
you've done a good job, which has the
consequence of me trusting the output
more, needing to shoulders surf the
agent less.
and removing myself from the loop even
more to delegate more and more of the
work to agents. And all of this is just
making sure the agents have the tools
and tokens and context
to do the full job to remove myself from
the need as a synchronous driver. The
models crave tokens. We can
operationalize our codebase to give them
tokens to drive them forward using sub
agents and all these other techniques to
refine the agent output.
I'm excited to let you all know today in
the way you all do that you can just go
build things. Do not hesitate to remove
yourselves from the loop by getting the
agents to do the full job because they
can. Thank you.
>> Very excited to bring on our guest.
We've got Ryan Leapo today. He just gave
the keynote. Um, very exciting speaker.
The man is full send hyperengineering at
OpenAI. So, uh, a little bit of
background. We did a Laten Space episode
with him. We shipped it the other day.
The the story he wrote this great
article called Harness Engineering and
we're like, "Wow, this is pure gold." We
have him on the podcast. He's a token
billionaire spending over a billion
output tokens a day. That's like over
$1,000. So, you know, man is really
living it. Uh, we want to keep this
exciting. Ask good questions, ask
interesting stuff, ask things that
people can learn from. But, you know,
let's welcome Ryan onto the stage.
[applause]
>> Hi folks, how's it going? Excited to be
here. Uh, London has been fantastic and
uh, excited to kind of walk through what
it is, uh, that we do and how we work
here.
>> I think you got to come on. This camera
is just here. So,
>> I got blinded by the QR code. So, we're
[laughter]
>> Okay. So, background. We have about an
hour. Um, scan this QR code. You should
get Slido. Slido will let you ask
questions. If you see interesting stuff,
you can thumbs them up and we'll try to
get through them. Unfortunately, the
first one I can't superdo, but let's
just kick it off. Ryan, can you show us
your actual working setup with
[laughter] no laptop? Um,
>> uh, yeah. Uh, here. Beach [laughter]
Margarita
linear, right?
>> Oh, wow.
Um, I'll say watch the podcast we put
out. We go through some of the work, but
if you want to talk about it, I guess
without actually showing us what's your
what's your workflow like? What's your
setup? How do you how do you approach a
task?
>> Sure. So, uh, the way me and my team
work is to start with tickets, right? We
have chunks of work that we want to do,
features we want to add to our apps,
reliability work that we want to do. uh
we give that ticket to an agent along
with a couple of skills that enable it
to manipulate our app. Uh we want the
entry point to the development process
to be codecs not an environment which we
build around it. So we kind of do things
um outside in right like codeex is the
entry point the same way you would be
and we give it tools we give it
instructions on how to cook. So rather
than like creating a shell that our app
and CEX get spawned into, we have a
skill that teaches Codex how to launch
the app that teaches Codeex how to spin
up that local observability stack to
give it logging and telemetry. We give
it a skill that enables it to uh boot up
Chrome DevTools and attach to the
application with a you know local CLI
that will connect via some Damon that we
have. So the whole way we have set up
the repository and all of the local dev
tools is for codeex to invoke them
first. Um that means we have kind of
like a bunch of little mini harnesses
within the codebase that make it really
easy for us to slot in additional guard
rails. Uh you know a big package of
custom ESLint rules which get wired into
every PNPM package in the workspace. We
have another sort of local dev harness
that allows us to add sort of like
higher level wholesome tests that assert
the structure of the code itself rather
than like either the syntax or the
behavior of the code. Things like you
know package privacy dependency edges
between different layers of our stack.
these sorts of things. Uh making sure
that you know across multiple files zod
schemas are dduplicated that there's a
single canonical implementation of like
our async helpers. Uh these sorts of
things because you know the way we have
seen the agents work is to sometimes
optimize for local coherence of a
package rather than using like our
shared utilities and things like that.
So having observed that behavior, we
kind of have built a bunch of little
pseudo llinter source code verification
things that shake out some of that bad
behavior so the humans don't get
distracted paying attention to that in
reviews, stuff like that. But uh the
setup optimizes for the agent to do the
job and for the humans to not have to
keep track of the high churn in the
codebase. Um we kind of centralize our
leverage around five to 10 skills. uh we
don't go super wide on skills preferring
to make the existing skills better
because at least I find that the
infrastructure within the repository all
the local developer tools change super
frequently uh and I don't really have
the bandwidth to keep track of this. So
we hide all that complexity beneath the
skills that the human has to invoke and
let the agent just kind of figure it
out. One one kind of neat thing here is
um when we moved from using uh Chrome
DevTools protocol directly to having
this like Damon thing like I didn't know
that had happened for like three weeks.
Uh it was like totally fine because
Codex was able to do the thing uh you
know with the documentation and things
that we had in place
>> and part of this you can get more detail
in your article. So some background you
wrote a great piece called harness
engineering. There's a whole section in
there on how you thought about skills,
thousands of skills versus simplifying
it to just quite a few. But okay, uh,
continuing on, how do you stop yourself
from overgineering harnesses? And a
little bit of a similar followup is, do
you often build small tools for
yourself, if ever? Uh, do you do you
build custom tools?
>> Yeah. So, I think this is kind of
gesturing in the direction of the bitter
lesson here, right? which is how do I
make sure the work that I do isn't like
completely obsoleted by an increase in
model capability and the way I have
thought about that is doing sort of the
bare minimum amount of context
management to kind of pull in
requirements uh for the agent to do an
acceptable job over the course of its
work and context is a thing that I don't
think will ever be obsoleted right like
the the models must be told like the
requirements of the task which
guardrails to pay attention patention to
these sorts of things. So a good harness
is really operationalized around giving
the model text at the right time so it
can look at the work it has done and the
information around what a good job looks
like and you know fundamentally the
models are trained to follow
instructions. All the harness should do
is surface instructions to the model at
the right time. So we do want to
minimize that too, right? You don't want
to frontload all those instructions
because then you kind of like overwhelm
the agent, but all of these sort of
requirements around what a good job do
need to be paid attention to over the
entire course of a PR, right? So
figuring out ways to either defer or
just in time surface those instructions
is kind of what uh a good harness should
do, right?
If you know that uh you want your React
components, right, to be decomposed so
that they make good snapshot tests for
individual more stateless pieces, right?
You don't need to load that up front.
Instead, you should kind of let the
agent cook and prototype and experiment
with the UI you want to build and then
at lint or test time say, "Okay, you've
done the work. In order to finish it,
you have to break this apart so that
your components are small and as
stateless as possible and have local
dependencies on hooks instead of prop
drilling or whatever it is that you want
uh the code to look like. And then the
agent will say, "Oh, this is a new
instruction for me. Let me take the
patch as written, modify it to make sure
that it aderes to the instructions." And
then up it goes to GitHub. And this sort
of thing is not going to be obsoleted by
increases in model capability. It's
really just about getting that right
text, that right context to the agent at
the right time.
>> Can we talk about an example of a good
harness? So, a lot of people are asking
about the codeex model, the codeex
harness. How does that compare to other
harnesses? So, cloud code, open code.
Uh, how do you guys take these decisions
into play? You don't work directly on
codec, but if there's you can if there's
stuff you can speak about about the
codeex harness, what you guys see as you
architect it out.
>> Yeah. So one thing that I think is super
powerful is this notion that the labs
are not just post-training the models
but post-training the models in the
context of the harness in which they are
primarily deployed in right like the
apply patch tool or like the specific
quoting semantics of how to invoke the
bash tool are like in the loop for the
post- training process for the harnesses
from the labs which means like there is
leverage to be had by depending on these
sort of like firstparty harnesses
directly At least this is what I
believe. Uh, and as such, kind of being
able to direct through them via things
like the SDK or manipulating the Codex
app server directly means you kind of
get to ride the wave of all that
leverage in post- training. Instead,
focus on the parts that you care about,
which is like what correct code looks
like. Um, I kind of have high confidence
that things like clog code and codecs
will continue to get better. uh that is
the responsibility of like the teams
working on these coding agents. So in my
role where I don't really want to focus
on the coding harness at all is figuring
out ways to plug into them in ways that
um kind of like steer the agent. That
means my job can sort of like move up to
thinking about differences in model
behavior between releases rather than
deeply understanding the nuts and bolts
of the harness. Instead, I can think
about what it means to, you know, drive
the behavior that I want vers based on
the observed behavior rather than like
the inner mechanics of the thing. It's a
perfect follow-up to the next question,
which is, uh, do you have any
recommendations for collaboration
platform? So, when you're in the
software development life cycle, is
there any platform that you use for
agents, engineers, developers all to
collaborate on working on anything? Any
tips, any tools?
Yeah. So
in this world it has largely been just
markdown files in the repository and
GitHub that have been the primary sort
of hub and spoke sort of thing. If you
think about collaborating on a document
like you open Google Docs, you write
something, you ask for feedback, people
comment, you apply suggestions, these
sorts of things. This is kind of like a
little clean room environment just for
this work artifact that you're
producing. like a PR kind of has a
similar purpose. So we kind of treat
that as a big hub and spoke broadcast
domain where all of the agents and
humans collaborate together. Uh and
because we optimize for throughput, we
don't block on any sort of like
contribution to that like folks can
either review or not. Agents can either
review or not. The implementation agent
can acknowledge, defer or reject any
feedback that it gets. uh really
allowing each participant in the
production of diffs to kind of make
their own judgments around what it means
to deliver, receive, respond to
feedback. Uh and this has a nice
property of like not putting the model
in a box in a bunch of places. We want
them to use their good reasoning sort of
thing. So being super prescriptive
around like every bit of feedback must
be addressed can kind of have this like
catastrophic failure mode of your coding
agent being bullied by all of the
reviewers when really we want to bias
toward code being accepted, not perfect,
not drowning in minutia and these sorts
of things.
>> How should people get started with using
coding agents? People that have been
using a lot of doing a lot of manually
written code, how how do they start to
transition? What should they offload?
How do they kind of come over that
barrier of okay, I'm still checking
every PR I'm copy pasting from codecs.
How should like the average engineer
start to use these tools?
>> I think there's two ways to approach
this problem. One is to
start using the coding agents to improve
your confidence in the code itself as it
is written today. Right? I think we
would all agree that like more tests is
probably a good thing, right? to assert
that our programs are well specified and
behave correctly as our users interact
with them is a good thing. Uh and the
agents are super good at looking at the
existing code with some context around
how it is meant to be used and writing
tests that assert that behavior. So kind
of using this to improve your confidence
in the quality of the code will also
increase the agents ability to
successfully navigate it which means you
don't have to worry as much around doing
super detailed review of the agent
output. The other way to think about
this is to look at how you are spending
your time. Is it you know staring at
your editor writing code? Is it waiting
for tests to run? Is it waiting for
human review feedback? is CI slow and
you're like waiting on that maybe you
have a ton of flaky tests and using the
agents to incrementally automate the
parts where you are spending your time
because ultimately the high lever parts
of our jobs is to define the work that
must be done prioritize and schedule
that work and then effectively empower
folks on our team to do that work. uh
and the more and more we can delegate
and move into sort of this like
sequencing and orchestration role even
if if you just think about like managing
your teams right the more parallel and
the more like deeper individual
executions of those delegations we're
able to do right if I put primitives in
place that make it super easy to like
spin up ways to respond to events on my
Kafka queue right like I don't really
need to be in the weeds with every
engineer making sure they like implement
a consumer correctly Right. And these
same sort of like building block style
techniques apply really well to the
agents and stack really well too.
>> A fun one. How do you work with agents
in your car? [laughter]
>> Um so I have not used the new uh voice
mode that launched in CarPlay uh
recently. Uh not ready for that. But uh
usually what I'll do is kick off uh a
task uh right before I leave the office.
uh tether my laptop to my phone, buckle
it into the back seat, and kind of let
it cook in the 30 minutes it takes me to
get home. Uh most of the time with the
skills we invoke that tell the agent,
you know, you're operating on a task,
you go until the tests are green. Uh you
know, I don't have to reach back there
and poke yes, continue onto the thing.
Uh and I'm basically able to more fully
saturate, you know, my day with token
consumption. Um, the dream here is that
I actually have 50 agents running 247
and I don't have to interact with them
at all. Uh, and the way to do that is to
define the work well, figure out ways
for it to automatically be scheduled and
remove myself from having to click the
button. Right? Every time I have to type
continue to the agent is like a failure
of the harness to provide enough context
around what it means to continue to
completion.
>> Wow, good statement at the end.
[laughter] every time you have to
interact with the agent is a failure.
Okay, so the following question kind of
scales this out, right? As your org
knowledge map scales, what practical
steps do you have to like enable
progressive disclosure? So as you have a
larger and larger codebase, as you have
more people, how do you scale your
agents to work better with this?
>> Yeah. So
when I sort of initially started this
project that I was working on, blank
repository, create Electron app, right?
you know, V single package, all this
sort of stuff. And eventually ended up
with a mess, right? Because there's no
package privacy that allows me to
enforce invariance around what APIs are
public versus which ones are not. The
agent didn't have like concrete hooks in
the file system to determine which
domains were separate from the other
ones. So we ended up going like full
10,000 engineer organization heavy on
the architecture
750 packages in the PNPM workspace
isolated by business logic domain or
layer of the stack individual small util
packages that encapsulate reusable
functionality that we lint on being used
that we can encode leverage in and I do
think that like in this world even if
you don't actually have microservices
structuring your repositories in ways
that you can actually scope like the
directory subree you are looking in to
be able to do most of the change helps
uh and you know code in the file system
is also text which means it's
effectively prompts that you're giving
to your coding agent. Uh, so making the
code as much the same as possible kind
of makes it so that regardless of where
in the repository your agent is looking,
it develops a ton of transferable
context, right? Like you should have one
way to like do a bounded concurrency
helper. You should have one way to
construct a observable and instrumented
side effectful command. You should have
one OM, right? Like you should have one
programming language. You should have
one way of writing CI scripts. you
should have one way of adding additional
lint rules, these sorts of things
because it means that like the tokens
that you want the model to produce are
easier to predict and more consistently
predicted regardless of where it looks.
Um, so I would say figure out ways to
structure the code so it is local to a
subree in the repository for most of the
ways you would interact with that system
and then figure out a way to use these
agents to completely migrate the
codebase to be the same. you know,
empower someone on your team to be a
dictator to say this is the way it must
be done, right? Or, you know, figure
that out together
and, you know, write it down, evolve the
code so that it reflects that reality,
these sorts of things.
>> We've got a few questions on code
review. How do you approach code review
now that you have such high velocity?
Uh, do you just not read the code? Do
you just trust trust the test coverage?
Uh, how do you write good tests? How do
you offload that stigma of like, you
know, you have a mental blocker. I need
to manually check everything before I
merge pure.
>> So that same sort of idea where you have
to look at where you're spending your
time and figure out ways to spend less
of it. Uh, you know, when we started,
right, the first thing to do was figure
out how to get the agent reliably
producing code that we would accept. And
a big challenge we ran into is with each
engineer producing three to five PRs per
day, even on a team of three, merge
conflicts were super miserable, right?
Because these PRs tended to be pretty
big. We were working on the same parts
of the codebase. So that's where we
moved in two directions. One was to like
tree out the code a bit more to minimize
these merge conflicts, but also minimize
the amount of time PRs were open so that
we were uh reducing the likelihood of a
merge conflict actually occurring. And
the reason PRs were staying open so long
was because we needed code review uh
because humans were being the blocker in
this scenario. So in order to do
that piece automatically, I essentially
asked every engineer on the team to take
one day a week, Fridays, we called it
garbage collection day, where our entire
job was to take every bit of slop we had
observed over the course of the week
that was making a PR difficult to merge
and figure out ways to categorically
eliminate it from ever happening in the
first place, which is where we kind of
started closing this loop between the
feedback that humans were giving on the
PR indicates some context failure on
behalf of the agent, getting that into
the repository and then figuring out
ways to automatically prompt inject the
agent so that it would selfheal when it
produced this bad behavior. And this is
kind of how you go from synchronous
human time spent giving feedback as code
review comments to documentation in the
repository to automatically surfing this
documentation either via a failing test
or a reviewer agent who is primed to
review the code as written in the
context of these docs. But all of that
happens by putting those docs in a
single place that all these processes
are able to attach to. Um, you know, we
kind of asked folks to basically bucket
the types of review feedback they were
giving into like um like the persona
they were operating as like front-end
architect, you know, reliability
engineer, scalability sort of thing. And
then basically for each of those
personas, we spun up a review agent that
gets triggered on every push that says,
is this code good? Surface any P2s or
above that would block this PR from
merging based on these documentation
that says what good looks like. Uh and
with that and just continuously
appending to these files, we started to
see slop reduce reduce reduce.
>> People have questions about your billion
tokens. Where do you think those are
split up? So how much of it is on code
review? Where where is the majority of
that usage coming from? And a followup
for people that are just getting
started. Say they have they've jumped
and done a $200 pro plan, right? If you
had to cut your usage by a fifth, how
should people maximize their usage?
Right? You run into usage limits. Um,
you know, you don't want to just copy
paste million lines of code every six
hours. No prompt hit prompt cache hit.
But how should we how should we think
about that?
>> Yeah. So I would say probably
a it's probably a third a third a third
between like planning ticket curation
documentation implementation and stuff
that runs in CI.
>> Do you use plan mode?
>> Uh we uh I've used exec plans which was
kind of like an early version of this
that we published which is sort of like
a proto skill that says this is how you
should structure a plan with milestones
and acceptance criteria. um haven't
really used plan mode as part of the
harness at all. My my sort of
expectation here is that I should be
able to drop a ticket in and have it do
the job anyway without diverting through
a plan. Uh because most of the time I'm
never going to read it anyway.
[laughter]
Uh so I find that if you do use a plan
and you approve it without reading it at
all, you're actually encoding a bunch of
instructions that you don't necessarily
want followed. Uh so if you are going to
use plans, my recommendation is to push
those up as single PRs with just the
plan where you actually have human
review every line of it and like block
on human approval before they get merged
and then kicked off. Uh because it's
you're effectively potentially wasting
your time on a rollout with instructions
that like are bad. Uh so you want to
kind of like minimize the time that
happens. But I do think that uh kind of
getting tokens to be spent in CI is a
necessary part here because writing code
no longer is the hard part. Like getting
code accepted and advancing the code and
product forward is like what it takes to
make that written code be valuable. And
you know you kind of have all heard the
apherism that like you know senior
engineers give good code reviews like we
expect our senior engineers as agents to
do the same.
>> [laughter]
>> Uh someone asked is code a disposable
build artifact?
>> Yes.
>> Yes.
>> Uh I think we we touch on this with uh
symfony which is sort of this agent
orchestrator that we release. This idea
that you know we can publish a library
that's actually a super well-defined
spec that the code is a compiled
artifact of. And I think like using LLM
as fuzzy compiler is like an interesting
mental model to have, right? Like all of
the context that we're putting in the
codebase for harness engineering is
effectively like constraints and
optimization passes on which code is
acceptable to build in the first place.
Uh and this is pretty similar to like
the static analysis and optimization
passes that something like LLVM would do
in the process of compiling Rust code.
uh and sort of
swapping out one model for another is
sort of like changing your code
generation backend from you know LLVM to
crane lift in the Rust compiler and you
would expect that all of the sort of
rules around what acceptable Rust code
looks like produce valid sound machine
code out the back even if the generation
process is different and you end up with
different x86 instructions. So same sort
of mindset for LLMs swapping out
different models sort of thing. We want
the structure around the code to
basically limit
how it is written to things that would
be acceptable to us.
>> At a high level, can you give us a
picture of what future you're building
for? Does context still matter? How do
people do engineering, harness
engineering, context engineering? What
does the future look like?
sort of the the feature that I want to
build toward here is where
I'm able to take a token budget and a
quarter, a half or a year's worth of
work,
take the human input to rank what is
most important success metrics,
reliability metrics, give it to the
machines and have them continually work
and advance my product forward. uh
without sort of you know my hands
explicitly on the wheels at all. We
as we have gone through like very early
prototyping to internal alpha internal
beta external alpha I kind of have felt
that like new parts of the software
engineering process have kind of like
started from zero and we've had to build
up capability kind of like these like
you know pentagonal like personality
charts right where like I spike in this
direction maybe I'm weak over here and
you know when we get to deployed
software for the first time, right? The
agents ability to do like QA smoke
testing on our built artifacts before
they're promoted to distribution was
weak. We hadn't invested any time in
this. There were no docs. There were no
tools that the agents could use to like
download the built artifact, launch it,
poke around to make sure that our like
most critical user journeys were well
validated and tested. So, because I
don't want to be touching the computer,
we needed to figure out like ways for
the agents to build themselves tools to
do that part. Uh, and
there's a whole universe of software
engineering outside of writing code,
right? Like I am triaging user feedback.
I'm triaging pages. I am making sure
that we don't have any PII leaking in
the logs in production. I'm making sure
that like the Twitter vibes are good and
people are enjoying my software that uh
our user operations staff are supported
with well written runbooks that allow
them to triage and mitigate high volume
user issues and then moving that into
the code itself so they don't happen in
the first place and as I no longer have
to produce code like my mind can shift
to these other higher level or more
squishy activities but the agents are
good enough to do these things too and
figuring out how like write down the
processes and the acceptance criteria
becomes like the sort of like meta
programming part of the job using these
agents.
>> That's a great way to end it. What an
exciting future. Give it up for Ryan
guys.
>> Thank you folks. [applause]
>> [music]