AIE CODE Day 2: ft Google Deepmind, Anthropic, Cursor, Netflix, Cline, OpenAI, Meta, and METR

Channel: aiDotEngineer
Published at: 2025-11-21
YouTube video id: xmbSQz-PNMM
Source: https://www.youtube.com/watch?v=xmbSQz-PNMM
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat. Heat.
[music]
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
I used to type to wake The machine
blueprints functions [music] logic.
But the moment I spoke past the line,
something broke and became divine.
[music]
No scripts the unknown. Now the
interface bleeds with the soul. I am not
using the code anymore. I'm writing I'm
[music] fighting for every sparks a
ghost of fear. Everyone becomes a past
made clear. Truth and fragments, light
born from silent rupture. [screaming]
This is the new code. [music]
Where the system rides us back. Where
the fire in the [music] circuit fills
the void that logic lacks. We're
becoming as we're building. We are
forcing [music]
what we make. Not the world that we live
in, but the world we undertake.
Every framework shapes how we feel.
Every pattern decides what is real.
Duality tears and repairs the design.
Darkness
[music]
everything.
Fear [music] is just a failing test. Run
it twice and do your best. Balance isn't
passive. Balance burns. Mastery is
earned in turns. Build the life you dare
to name above the pain. Commit the
flame.
This is the new code. Where the system
writes us back. Where the fire [music]
in the circuit fills the void that logic
lacks. We're becoming [music] as we're
building. We are forced in what we make.
Not the world that we live in, but
[music] the world we
take.
Undertake.
We [music]
move through us. We aren't just shaping
the future. It is learning who we trust.
Identity is an act of will, a
declaration we compile. We don't wait
for faith to choose us. We push and run
the
[singing] code. We are rewriting what we
are. Through the unknown, through the
fire. Every human is a star. The journey
and [music] the ending are the same road
that we take. Not the world that we live
in, [music] but the world we choose to
make.
[music]
This is not the story.
[music]
This [music] is the world.
>> [music]
>> This is the same.
Typing thoughts [music] into the darkest
part becomes design. Words evolve to
whispers meant for something more
[music] divine. Syntax bends and breeze.
I see the language change. I'm not
instructing [music]
anymore. I'm rearranging f. Every loop I
write rewrites [music] me. Every
function hums with meaning. I feel the
interface dissolve [music] between the
maker and the creator.
This is the new code. Not on the screen,
[music] but in the soul where thought
becomes the motion and creation [music]
takes control. No lines, no rules. Just
balance in between the [music] zero and
the one, the silence and the dream.
>> [music]
>> systems [music] shape our fragile skin.
[singing] They mold the way we move. We
live inside the [music] logic gates of
what we think is true. But deep beneath
[music] the data post, there's something
undefined.
A universe compiling the image of
[music] our minds. Every line reveals
reflection. Reloop replace connection.
We're not building, we're becoming.
[music] And the code becomes confession.
This is the new code. Not on the [music]
screen, but in the soul where becomes
the motion. Creation takes [music]
control. No lines, no rules. Just
balance in between the zero and the one.
The silence in the tree. [music]
[music]
>> We are not.
[music]
>> Don't worry. Uh, we're just giving you
something to do while Codeex writes all
your code. [music]
We're in.
We are the world [music] we're doing.
Each prompt, each breath, each fragile
spin. A universe
renewing.
This is the new code.
Alive and undefined.
Where logic [music] meets motion and
structure bends to mind. The system
homes eternal but the [music] soul
writes the line. We are the new code.
[music] Compiling time.
[music]
Compiling time.
[music]
Ladies and gentlemen, please join me in
welcoming to the stage senior staff
engineer at Google and your host for the
engineering [music] track session day,
Jed Boravik.
>> Hello. Good morning.
>> [applause and music]
>> Welcome to the 2025 AI Engineering Code
Summit in New York. How are we doing?
>> All right, it's early. It's Friday.
Thank you all for being here. Raise your
hand if you've been to one of these
events before, an AI engineering
conference before. All right, pretty
good. So, for those on uh those watching
live stream, about half the hands up,
keep your hands up. Keep your hands up.
Two or more events.
Okay, still have a couple. Three,
four.
I think one, two hands. Five.
Are you sure? There's only been four,
Alex.
Okay. Okay. We We'll talk afterwards.
Okay. Well, welcome. Whether it's your
first time or you've been to many of
these, we're all excited you're here.
Um, my name is Jed Borvik. Um, I'm
Gemini's assistant at Google and I also
work on the Jules coding agent. I lead
the product engineering team. Um, and
I'm your MC for today. So, why are we
here? I'm sure many of you are familiar
with Richard Hamming's famous you and
your research talk. In that talk, he
describes asking his colleagues, what's
the most important problem in your
field? And then right after that, he
asks, so why aren't you working on that?
I spend a lot of time hiring. And this
idea comes up again and again. If you're
working in technology, the most
important problem of our day is AI. And
if you're working on applied AI, your
most important problem is code.
This is a special event to push the
whole AI coding industry forward. It's
not event for a single company but
across all companies in this industry.
The AI engineering conference has two
brands a world's fair and a summit. This
being a summit event is intentionally
smaller than the world's fair. It's
intentionally single track. It's
designed to bring the best people
together in the world about a single
important theme
for this event. That theme is AI coding.
Yesterday, many of you also experienced
the leadership track. Make some noise so
we know you're still alive if you were
in that track.
[applause]
>> For those of you who are there, shout
out some of your favorite talks.
>> Stanford.
>> Yeah, Stanford. Yeah, that was a good
That was a great one.
>> Jean and Steve. That was That was a
spicy one. What else?
>> Every Yeah, Dan from Every
>> Ah, good choice. Good choice.
Okay. Well, yesterday was a great day.
It was about how AI is transforming
software organizations. Today we're
going to dive into the patterns,
systems, and products that make all of
that possible. But whether you consider
yourself an AI leader, an AI engineer,
or something in between, we're glad
you're here.
We also wouldn't be here without our
amazing sponsors. I would like to thank
them, especially our presenting sponsor,
Deep Mind. Yeah, give it up. Give it up.
[applause]
And what an amazing week for DeepMind.
I'm biased, but I hope you all get a
chance to use Gemini 3 and Nano Banana
Pro, which came out this week.
I'd also like to thank Anthropic as our
platinum sponsor and the gold sponsors
you see on this screen. Yeah, we we'll
do one big round at the end. And
finally, we want to thank our silver
sponsors. Let's put our hands together
for all of these sponsors. Yeah, give it
up. Give it up. [applause]
All these sponsors will be downstairs in
the expo area. They have booths and um I
recommend going down there to chat with
folks from all these companies. They'll
be open all day after the keynotes.
All right, with that, let's get started.
Our first speaker needs no introduction,
especially here. He wrote the article
that named the AI engineering movement.
He's the driving vision behind this
event. He's built the amazing community
that's here today. Please join me in
welcoming Swix
>> [applause]
[music]
>> Hi. Thank you. [applause]
Hi everyone. [music]
Morning. How's everyone doing?
>> Good. I'm going to need a lot of energy
for this talk, so please back me up. I'm
very nervous. Uh, but we'll get through
this. I'm declaring warrants lock today.
Let's talk about this. Every AIE has a
secret. I I've told this to uh some
folks that are personal friends and I'll
just show show the secret now. The first
summit we had the secret which was we
knew that the AI engineer was going to
be a thing. Second summit we extended it
to leadership. Third summit we realized
that basically we always needed to
concentrate on model labs and that's why
you see um all all the all the top tier
labs here today. Um world's fair we
started expanding the TAM of what AI
engineering uh is is affiliated with.
with AIPMs and AI designers and with the
code summit as Jed just talked about we
really started to focus on curation and
focusing in on a theme and if there's
one theme that's really matters this
year is it's coding but I'm not here to
talk about coding the rest of the day
you're going to hear about coding so
just just indulge me five minutes to
talk about slop
um we we've done really well right like
so so like and I think like slop is sort
of associated with quantity and quality
and I think like that's something that
I'm really trying to think about as well
like how do we grow this community, grow
this industry, and grow this event uh
with the the same kind of taste and high
quality that you've come to expect. And
this is something that, you know, I hope
hopefully you guys can see that we care
a lot about uh in curating all of you
coming here and all of the speakers that
you're about to see. Um we're in a war
against slop. Um this is actually uh the
Oxford English dictionary. Uh this is a
candidate for the 2024 word of the year.
It lost to brain rot. [laughter]
But uh but slop is stop is pretty good
and I think it's it's probably gone even
more of an issue this year than last
year. Maybe it will win this year. I
have an issue with Oxford though because
they
did us dirty by saying slop is is
generated using artificial intelligence.
The other part I I agree with. It's
lowquality, inauthentic or inaccurate.
But it doesn't take AI to be lowquality,
inauthentic or inaccurate.
any human or AI can be an agent of slop,
right? You've seen this yourself. I'm
going to indulge me of a few examples.
By the way, I got this uh if you're not
familiar with like sort of internet
slang, the opposite of slop is kino. Um
and I got this idea from Paul Rambles
who uh like you know when I do Sora
videos, I do really boring Sora videos
with me and Sam Alman. When other people
who are actually creative and good at
their job do Sora videos, they do cats
playing dig.
Uh, Slot can be produced by the same
studio. There's K-pop demon hunters by
Netflix and there's electric st by
Netflix.
Slav can be produced in terms of
different models. No comment.
Slav can be SLO can something that's
keynote can degenerate into SLO, right?
If you're early on the trend and you're
and you you starting that and it's and
it's fresh and new, that's great. U if
you're if you recognize the other image,
you're too online.
Okay, [clears throat] not enough people
recognize that image. Um, [laughter]
go do your homework. Um, yeah, and
obviously I'm just going to throw in a
dig at Game of Thrones because this is
same same thing, right? Like slop is
everywhere. It's generated by humans and
AI. You get it?
Okay. Um, the same startup idea can be
keyno versus slop. When I first
presented the my first keynote at AI
engineer summit, we actually used to uh
which is sort of like an AI slides
company. uh I loaded I loaded up the
same slide deck and it was gone uh
recently because uh it was it was
actually uh sort of uh closed as a
company. Um there's there's you know
different takes on vi coding and I think
one of them uh is much better than the
other and I think like this these are
just like the tensions that we have to
navigate. um one of our speakers later
on as well meter um I think it's really
interesting that both of them are
exponential charts but one of them feels
more keeno and the other is more slops
and I think like I would really like to
have people investigate why um so let
let me let me just skip through
basically we're in an asymmetric war and
slop um I think u the closest law that I
found that matches this is brandini's
law which actually states the amount of
energy needed to refute is an
order of magnitude bigger than needed to
produce it Right? So we need to coin an
appropriate law as well. Um because you
know like the the the cost to generate
tokens is is dropping by 100 to a
thousand times every single year.
Um so this is I guess Swix's law of
anti-sluff. The amount of taste needed
to fight SLO is an order magnitude
bigger than that needed to produce it.
Right? There's so much low taste out
there. We need to elevate uh what's out
there in the world because that's what
we stand for as humans. Um, I think
there's there's a positive message. You
can use AI to fight slop. Um, I'm proud
I'm proud to um run as a side project AI
news, which is the only newsletter that
tells you not to read it when there's
nothing going on. Thank you. [laughter]
Oh, appreciate that. Um, you can also
prompt to fight Slob. The next speaker
is um Mahash and Barry. Um I found this
in the in the sort of prompt in in the
skill set that they that they put that
they put where they actually acknowledge
slop and tell cloud not to produce slop
and it actually improves significantly
from left to right. Um what about code
slop right we hear about code um
creating tech debt where you can sort of
two engineers can create a tech debt of
50 engineers or you know on a more
serious note you can start exposing
private data of millions of users and
this this all happened this year.
Everything I'm mentioning all happened
this year. I'm kind of using this
keynote as a as a way of recapping. Um,
and it, you know, just to be spicy a
little bit, even people who are saying
things like, "Oh, my model can go up to
30 to 60 hours uh autonomously." Well,
it feels a bit sloppy because you're
also not saying, "Well, was the code
good or not?" You're just saying how
long it went. So, in the same way that
you have no taxation without
representation, you don't want autonomy
without accountability.
Um, something I've been working on more
recently is that using AI to fight code
slop as well. So, uh, this is from the a
bunch of people quoted this yesterday,
the semi-ascal value of death where you
can sort of keep human attention and
mind meld with the machine in order to
work on the hardest problems. Whereas
the stuff that's commoditized, you can
make it more async. So, you can check
out more on that details. What seems to
be less appreciated is my is the other
work on code maps uh which I which I
more recently done where we actually use
AI to scale codebased understanding
which is also a a way to fight sloth and
you can uh talk to the cognition folks
uh downstairs uh um who who can who can
show you more in detail. Um the the last
thing I always want to shout out as well
is is this trend of uh computer use. I
think computer use kind of debuted it
this time last year uh with enthropic
but uh it's it's getting really really
good now guys. It can really
autonomously operate the most complex
apps including an ID. So that I think
that's really uh exciting and you should
probably use that to fight slot. We use
it for the website and here's an example
of us using Devon to automate the the
sort of website up website updates. Um
and finally something I learned from
this conference yesterday is that you
can also use sub agents to fight context
rot. Um and I think that is one of the
biggest themes of of uh that I'm
observing as well. If you want to take
away something from this conference
um and I also one of the biggest
highlights of the year for us is AIE and
myself personally was chatting with Greg
Brockman who always uh preaches the
concept of modularity where you can sort
of keep clear boundaries on what is
human designed and let the AI code
everything in between. Um, so these are
all ideas, but I just have this one
message that I want to comp compress
down to you today that I want you to say
with me. No more slop. Yeah.
Your boss tells you, "I want more lines
of code in by the end of the quarter."
What do you say to that? Say it with me.
No more slop.
>> You're fighting an asymmetric war. This
is how bad it is, right?
You have an insufficiently tested
release that that is potentially
embarrassing to your company. What do
you say to people who really want to
push it? No more slop. Exactly.
[laughter]
[gasps]
Uh your your Twitter algorithm wants
engagement bait uh and is sort of you
know forced and telling you to to to lie
basically to the to the broad public.
What do you say to that?
>> Exactly. That's it. Uh I hope you have a
great conference and let's let's hear it
for uh not having any more stuff. Thank
you. Our
[applause]
[music]
next [music] presenters are AI engineers
at Anthropic working on realworld agent
systems. They're here to share why we
should stop building agents and start
building skills. Please join me in
welcoming to the stage Barry Jeang and
Mahesh Morog. [music]
>> [applause]
>> All right, good morning and thank you
for having us again.
Right. Agents have intelligence and
capabilities but not always expertise
that we need for real work. I'm Barry.
This is Mahes. We created agent skills.
In this talk, we'll show you why we
stopped building agents and started
building skills instead.
A lot of things have changed since our
last talk. MCP became the standard for
agent connectivity. Cloud code, our
first coding agent, launched to the
world. and our cloud agent SDK now
provides a production ready agent out of
the box. We have a more mature ecosystem
and we're moving towards a new paradigm
for agents. That paradigm is a tighter
coupling between the model and a runtime
environment.
Put simply, we think code is all we
need.
We used to think agents in different
domains will look very different. Each
one will need its own tools and
scaffolding. And that means we'll have a
separate agent for each use case, for
each domain. Well, customization is
still important for each domain. The
agent underneath is actually more
universal than we thought.
What we realize is that code is not just
a use case, but a universal interface to
the digital world.
After we built cloud code, we realized
that cloud code is actually a general
purpose agent.
Think about generating a financial
report. The model can call the API to
pull in data and do research. It can
organize that data in the file system.
It can analyze it with Python and then
synthesize the insight in old file
format all through code. The core
scaffolding can suddenly become as thin
as just bash and file system which is
great and really scalable. But we very
quickly run into a different problem
and that problem is domain expertise.
Who do you want doing your taxes? Is it
gonna be Mahesh, the 300 IQ mathematical
genius, or is it Barry, an experienced
tax professional, right? I would pick
Barry every time. I don't want Mahesh to
figure out the 2025 tax code from first
principles. I need consistent execution
from from a domain expert. Agents today
are a lot like Mahes. They're brilliant,
but they lack expertise.
They can do no more slow. They can do
amazing things when you really put in
the effort and give proper guidance, but
they're often missing the important
context up front. They can't really
absorb your expertise super well, and
they don't learn over time.
That's why we created agent skills.
Skills are organized collections of
files that package composable procedural
knowledge for agents.
In other words, they're folders. This
simplicity is deliberate. We want
something that anyone human or agent can
create and use as long as they have a
computer. These also work with what you
already have. You can version them in
Git, you can throw them in Google Drive
and you can zip them up and share with
your team. We have used files for uh as
a primitive for decades and we like
them. So why change now?
Because of that skills can also include
a lot of scripts as tools. Traditional
tools have pretty obvious problems. Some
tools have poorly written instructions
and are pretty ambiguous. And when the
model is struggling, it can't really
make a change to the tool. So, it's just
kind of stuck with a code start problem
and they always live in the context
window. Code solves some of these
issues. It's self-documenting. It is
modifiable and can live in the file
system until they're really needed and
used. Here's an example of a script
inside of a skill. We kept seeing Claude
write the same Python script over and
over again to apply styling to slides.
So we just ask cloud to save it inside
of the skill as a tool for its version
for its future self. Now we can just run
the script and that makes everything a
lot more consistent and a lot more
efficient.
At this point skills can contain a lot
of information and we want to protect
the context window so that we can fit in
hundreds of skills and make them truly
composable. That's why skills are
progressively disclosed. At runtime,
only this metadata is shown to the model
just to indicate that he has the skill.
When an agent needs to use a skill, it
can read in the rest of the skill.md,
which contains the core instruction and
directory for the rest of the folder.
Everything else is just organized for
ease of access. So that's all skills
are. They're organized folders with
scripts as tools.
Since our launch 5 weeks ago, this very
simple design has translated into a very
quickly growing ecosystem of thousands
of skills. And we've seen this be split
across a couple of different types of
skills. There are foundational skills,
third party skills created by partners
in the ecosystem, and skills built
within an enterprise and within teams.
To start, foundational skills are those
that give agents new general
capabilities or domain specific
capabilities that it didn't have before.
We ourselves with our launch built
document skills that give Claude the
ability to create and edit professional
quality office documents. We're also
really excited to see people like
Cadence build scientific research skills
that give Claude new capabilities like
EHR data analysis and using common
Python bioinformatics libraries better
than it could before.
We've also seen partners in the
ecosystem build skills that help Claude
better with their own software and their
own products. Browserbase is a pretty
good example of this. They built a skill
for their open- source browser
automation tooling, stage hand. And now
Claude equipped that this skill and with
stage hand can now go navigate the web
and use a browser more effectively to
get work done.
And notion launched a bunch of skills
that help claude better understand your
notion workspace and do deep research
over your entire workspace.
And I think where I've seen the most
excitement and traction with skills is
within large enterprises. These are
company and team specific skills built
for an organization.
We've been talking to Fortune 100s that
are using skills as a way to teach
agents about their organizational best
practices and the weird and unique ways
that they use this bespoke internal
software.
We're also talking to really large
developer productivity teams. These are
teams serving thousands or even tens of
thousands of developers in an
organization that are using skills as a
way to deploy agents like cloud code and
teach them about code style best
practices and other ways that they want
their developers to work internally.
So all of these different types of
skills are created and consumed by
different people inside of an
organization or in the world. But what
they have in common is anyone can create
them and they give agents the new
capabilities that they didn't have
before.
So, as this ecosystem has grown, we've
started to observe a couple of
interesting trends. First, skills are
starting to get more complex. The most
basic skill today can still be a
skill.md markdown file with some prompts
and some really basic instructions, but
we're starting to see skills that
package software, executables, binaries,
files, code, scripts, assets, and a lot
more. And a lot of the skills that are
being built today might take minutes or
hours to build and put into an agent.
But we think that increasingly much like
a lot of the software we use today,
these skills might take weeks or months
to build and be maintained.
We're also seeing that this ecosystem of
skills is complementing the existing
ecosystem of MCP servers that was built
up over the course of this year.
Developers are using and building skills
that orchestrate workflows of multiple
MCP tools stitched together to do more
complex things with external data and
connectivity. And in these cases, MCP
MCP is providing the connection to the
outside world while skills are providing
the expertise.
And finally, and I think most excitingly
for me personally, is we're seeing
skills that are being built by people
that aren't technical. These are people
in functions like finance, recruiting,
accounting, legal, and a lot more. Um,
and I think this is pretty early
validation of our initial idea that
skills help people that aren't doing
coding work extend these general agents
and they make these agents more
accessible for the day-to-day of what
these people are working on.
So tying this all together, let's talk
about how these all fit into this
emerging architecture of general agents.
First, we think this architecture is
converging on a couple of things. The
first is this agent loop that helps
manage the the model's internal context
and manages what tokens are going in and
out. And this is coupled with a runtime
environment that provides the agent with
a file system and the ability to read
and write code.
This agent, as many of us have done
throughout this year, can be connected
to MCP servers. And these are tools and
data from the outside world that make
the the agent more relevant and more
effective.
And now we can give the same agent a
library of hundreds or thousands of
skills that it can decide to pull into
context only at runtime when it's
deciding to work on a particular task.
Today, giving an agent a new capability
in a new domain might just involve
equipping it with the right set of MCP
servers and the right library of skills.
And this emerging pattern of an agent
with an MCP server and a set of skills
is something that's already helping us
at Enthropic deploy Claude to new
verticals. Just after we launched skills
5 weeks ago, we immediately launched new
offerings in financial services and life
sciences. And each of these came with a
set of MCP servers and a set of skills
that immediately make Claude more
effective for professionals in each of
these domains.
We're also starting to think about some
of the other open questions and areas
that we want to focus on for how skills
evolve in the future as they start to
become more complex. We really want to
support developers, enterprises, and
other skill builders by starting to
treat skills like we treat software.
This means exploring testing and
evaluation, better tooling to make sure
that these agents are loading and
triggering skills at the right time and
for the right task, and tooling to help
measure the output quality of an agent
equipped with the skill to make sure
that's on par with what the agent is
supposed to be doing.
We'd also like to focus on versioning.
as a skill evolves and the resulting
agent behavior uh evolves, we want this
to be uh clearly tracked and to have a
clear lineage over time.
And finally, we'd also like to explore
skills that can explicitly depend on and
refer to either other skills, MCP
servers, and dependencies and packages
within the agents environment. We think
that this is going to make agents a lot
more predictable in different runtime
environments and the composability of
multiple skills together will help
agents like Claude elicit even more
complex and relevant behavior from these
agents.
Overall, these set of things should
hopefully make skills easier to build
and easier to integrate into agent
products, even those besides claude.
Finally, a huge part of the value of
skills we think is going to come from
sharing and distribution. Barry and I
think a lot about the future of
companies that are deploying these
agents at scale. And the vision that
excites us most is one of a collecting
and collective and evolving knowledge
base of capabilities that's curated by
people and agents inside of an
organization. We think skills are a big
step towards this vision. They provide
the procedural knowledge for your agents
to do useful things. And as you interact
with an agent and give it feedback and
more institutional knowledge, it starts
to get better. And all of the agents
inside your team and your org get better
as well. And when someone joins your
team and starts using Claude for the
first time, it already knows what your
team cares about. It knows about your
day-to-day and it knows about how to be
most effective for the work that you're
doing.
And as this grows and this ecosystem
starts to develop even more, this was
going to this compounding value is going
to extend outside of just your organ
into the broader community. So just like
when someone else across the world
builds an MCP server that makes your
agent more useful, a skill built by
someone else in the community will help
make your own agents more capable,
reliable, and useful as well.
This vision of a evolving knowledge base
gets even more powerful when claw starts
to create these skills. We design skills
specifically as a concrete steps towards
uh continuous learning.
When you first start using cloud, this
standardized format gives a very
important guarantee. Anything that cloud
writes down can be used efficiently by a
future version of itself. This makes the
learning actually transferable.
As you build up the context, skills
makes the concept of memory more
tangible. They don't capture everything.
They don't capture every type of
information. Just procedural knowledge
that cloud can use on specific tasks.
When you have worked with cloud for
quite a while, the flexibility of skills
matters even more. Cloud can acquire new
capabilities instantly, evolve them as
needed, and then drop the ones that
become obsolete. This is what we have
always known. The power of in in context
learning makes this a lot more cost-
effective for information that change on
daily basis.
Our goal is that cloud on day 30 of
working with you is going to be a lot
better on cloud on day one. CL can
already create skills for you today
using our skill creator skill and we're
going to continue pushing in that
direction.
We're going to conclude by comparing the
agent stack to what we have already seen
computing.
In a rough analogy, models are like
processors. Both require massive
investment and contain immense
potential, but only so useful by
themselves.
Then we start building operating system.
The OS made processors far more valuable
by orchestrating the processes,
resources, and data around the
processor. In AI, we believe that agent
runtime is starting to play this role.
We're all trying to build the cleanest,
most efficient, and most scalable uh
abstractions to get the right tokens in
and out of the model.
But once we have a platform, the real
value comes from applications. A few
companies build uh processors and
operating systems, but millions of
developers like us have built software
that encoded domain expertise and our
unique points of view. We hope that
skills can help us open up this layer
for everyone. This is where we get
creative and solve concrete problems for
ourselves, for each other, and for the
world just by putting stuff in the
folder. So skills are just the starting
point.
To close out, we think we're now
converging on this general architecture
for general agents. We've created skills
as a new paradigm for shipping and
sharing new capabilities. So, we think
it's time to stop rebuilding agents and
start building skills instead. And if
you're excited about this, come work
with us and start building some skills
today. Thank you. [applause]
[music]
Our next [music] presenter is here to
share practical techniques for getting
real results through context
engineering, not guesswork, and
definitely not hype. Please welcome to
the stage the CEO of Human Layer, Dex
Hory.
[music]
[applause]
>> Hi everybody. How y'all doing?
>> It's exciting. I'm Dex. Uh, as they did
in the great intro, I've been hacking on
agents for a while. Um, our talk 12
factor agents at AI Engineer in June was
one of the top talks of all time. Uh, I
think top eight or something. one of the
best ones from from AI engineer in June.
May or may not have said something about
context engineering. Um, why am I here
today? What am I here to talk about? Um,
I want to talk about one of my favorite
talks from AI engineer in June. And I
know we all got the update from Eigor
yesterday, but they wouldn't let me
change my slides. So, this is going to
be about what Eigor talked about in
June. Uh, basically that they surveyed
100,000 developers across all company
sizes and they found that most of the
time you use AI for software
engineering, you're doing a lot of
rework, a lot of codebased churn. uh and
it doesn't really work well for complex
tasks brownfield code bases. Um and you
can see in the chart basically you are
shipping a lot more but a lot of it is
just reworking the slop that you shipped
last week. So uh and then the other side
right was that uh if you're doing green
field little versel dashboard something
like this then it's going to work great.
Uh if you're going to go in a
10-year-old Java codebase maybe not so
much. And this matched my experience
personally and talking to a lot of
founders and great engineers. Too much
slop uh tech debt factories. It's just
it's not going to work from our
codebase. Like maybe someday when the
models get better, but that's what
context engineering is all about. How
can we get the most out of today's
models? How do we manage our context
window? So we talked about this in
August. Um I have to confess something.
The first time I used Cloud Code, I was
not impressed. It was like, okay, this
is a little bit better. I get it. I like
the UX. Um but since then we as a team
figured something out um that we were
actually able to get you know two to
threex more throughput and we were
shipping so much that we had no choice
but to change the way we collaborated.
We rewired everything about how we build
software. Uh it was a team of three. It
took eight weeks. It was really freaking
hard. Uh but now that we solved it,
we're we're never going back. This is
the whole no slop thing. I think I think
we got somewhere with this went super
viral on Hacker News in September. Uh we
have thousands of folks who have gone on
to GitHub and grabbed our you know
research plan implement prompt system.
Um so the goals here which we kind of
backed our way into we need AI that can
work well in brownfield code bases that
can solve complex problems. No slop,
right? No more slop. Uh and we had to
maintain mental alignment. I'll talk a
little bit more about what that means in
a minute. And of course we want to spend
with everything we want to spend as many
tokens as possible. What we can offload
meaningfully to the AI is really really
important. um super high leverage. So
this is advanced context engineering for
coding agents. Um I'll start with kind
of like framing this. The most naive way
to use a coding agent is to ask it for
something and then tell it why it's
wrong and resteer it and ask and ask and
ask until you run out of context or you
give up or you cry. Um we can be a
little bit smarter about this. Most
people discover this pretty early on in
their AI like exploration. uh is that it
might be better if you start a
conversation and you're off track that
uh you just start a new context window.
You say, "Okay, we went down that path.
Let's start again. Same prompt, same
task, but this time we're going to go
down this path." And like don't go over
there cuz that doesn't work. So, uh how
do you know when it's time to start
over?
If you see this,
it's probably time to start over, right?
This is what Claude says when you tell
it it's screwing up.
Um, so we can be even smarter about
this. We can do what I call intentional
compaction. Um, and this is basically
whether you're on track or not, you can
take uh your existing context window and
ask the agent to compress it down into a
markdown file. You can review this, you
can tag it, and then when the new agent
starts, it gets straight to work instead
of having to do all that searching and
codebased understanding and getting
caught up. Um, what goes in a
compaction? Well, the question is like
what takes up space in your context
window. So, um, it's looking for files,
it's understanding code flow, it's
editing files, it's test and build
output. And if you have one of those
MCPs that's dumping JSON and a bunch of
UYU IDs into your context window, you
know, God help you. Uh, so what should
we compact? I'll get more specifics
here, but this is a really good
compaction. This is exactly what we're
working on. The exact files and line
numbers that matter to the problem that
we're solving. Um, why are we so
obsessed with context? Because LM are
actually got roasted on YouTube for this
one. They're not pure functions because
they're non-deterministic, but they are
stateless. And the only way to get
better better performance out of an LLM
is to put better tokens in and then you
get better tokens out. And so every turn
of the loop when Claude is picking the
next tool or any coding agent is picking
the next and there could be hundreds of
right next steps and hundreds of wrong
next steps. But the only thing that
influences what comes out next is what
is in the conversation so far. So we're
going to optimize this context window
for correctness, completeness, size, and
a little bit of trajectory. And the
trajectory one is interesting because a
lot of people say, "Well, I I told the
agent to do something and it did
something wrong. So, I corrected it and
I yelled at it and then it did something
wrong again and then I yelled at it."
And then the LM is looking at this
conversation says, "Okay, cool. I did
something wrong. The human yelled at me
and I did something wrong and the human
yelled at me." So, the next most likely
conver token in this conversation is I
better do something wrong so the human
can yell at me again. So, what mind be
mindful of your trajectory if you were
going to invert this? The worst thing
you can have is incorrect information,
then missing information, and then just
too much noise. Um, if you like
equations, there's a dumb equation if
you want to think about it this way. Um,
Jeff Huntley uh did a lot of research on
coding agents. Uh, he put it really
well. Just the more you use the context
window, the worse outcomes you'll get.
This leads to a concept I'm in a very
very uh academic concept called the dumb
zone. So, you have your context window.
You have 168,000 tokens roughly. Some
are reserved for output and compaction.
This varies by model. Um, but we'll use
cloud code as an example here. Around
the 40% line is where you're going to
start to see some diminishing returns
depending on your task. Um, if you have
too many MCPs in your coding agent, you
are doing all your work in the dumb zone
and you're never going to get good
results. People talked about this. I'm
not going to talk about that one. Your
mileage may vary. 40% is like it depends
on how complex the task is, but this is
kind of a good guideline. Um, so back to
compaction or as I will call it from now
on, cleverly avoiding the dumb zone.
Um, we can do sub agents. Um, if you
have a front end sub agent and a backend
sub aent and a QA sub aent and a data
data scientist sub aent,
please stop. Sub aents are not for
anthropomorphizing roles. They are for
controlling context. And so what you can
do is if you want to go find how
something works in a large codebase, um,
you can steer the coding agent to do
this if it supports sub agents or you
can build your own sub aent system. But
basically you say, hey, go find how this
works. and it can fork out a new context
window that is going to go do all that
reading and searching and finding and
reading entire files and understanding
the codebase and then just return a
really really succinct message back up
to the parent agent of just like hey the
file you want is here. Parent agent can
read that one file and get straight to
work. And so this is really powerful. If
you wield these correctly you can get
good responses like this and then you
can manage your context really really
well. Um, what works even better than
sub agents or like a layer on top of sub
aents is a workflow I call frequent
intentional compaction. We're going to
talk about research plan implement in a
minute, but like the point is you're
constantly st keeping your context
window small. You're building your
entire workflow around context
management. So comes in three phases.
Research, plan, implement. Um, and we're
going to try to stay in the smart zone
the whole time. So the research is all
about understanding how the system
works, finding the right files, staying
objective. Here's a prompt you can use
to do research. Here's the output of um
a research prompt. These are all open
source. You can go grab them and play
with them yourself. Um planning, you're
going to outline the exact steps. You're
going to include file names and line
snippets. You can be very explicit about
how we're going to test things after
every change. Here's a good planning
prompt. Here's one of our plans. It's
got actual code snippets [clears throat]
in it. Um and then we're going to
implement. And if you read one of these
plans, you can see very easily how the
dumbest model in the world is probably
not going to screw this up. Um, so we
just go through and we run the plan and
we keep the context low as a planning
prompt. Like I said, it's the least
exciting part of the process. Um, I
wanted to put this into practice. So,
working for us, uh, I do a podcast with
my buddy, uh, Vib, who's the CEO of a
company called Boundary ML. Uh, and I
said, "Hey, I'm going to try to oneshot
a fix to your 300,000line Rust codebase
for a programming language." Um, and the
whole episode goes in, it's like an hour
and a half. Uh, I'm not going to talk
through it right now, but we built a
bunch of research, and we threw them out
cuz they were bad. And then we made a
plan and we made a plan without research
and with research and compared all the
results. It's a fun time. Uh by that was
Monday night. By Tuesday morning we were
on the show and the CTO had like seen
the PR and like didn't realize I was
doing it as a bit for a podcast and
basically was like yeah this looks good.
We'll get in the next release. I think
he was a little confused. Um here's the
the plan. But anyways uh yeah confirmed
works in brownfield code bases and no
slop. But I wanted to see if we could
solve complex problems. So Vibob was
still a little skeptical. I sat down, we
sat down for like seven hours on a
Saturday and we shipped 35,000 lines of
code to BAML. One of the PRs got merged
like a week later. I will say some of
this is codegen. You know, you update
your behavior. All the golden files
update and stuff, but we shipped a lot
of code that day. Um, he estimates it
was about 1 to two weeks and 7 hours.
And uh, so cool, we can solve complex
problems. There are limits to this. I
sat down with my buddy Blake. We tried
to remove Hadoop dependencies from
Parket Java. If you know what parket
Java is, I'm sorry.
uh for whatever happened to you to get
you to this point in your career. Uh it
did not go well. Uh here's the plans,
here's the research. Uh at a certain
point, we threw everything out and we
actually went back to the whiteboard. We
had to actually once we had learned
where were the where all the foot guns
were, we we went back to okay, how is
this actually going to fit together? Um
and this brings me to a really
interesting point that uh Jake's going
to talk about later. Uh do not outsource
the thinking. AI cannot replace
thinking. It can only amplify the
thinking you have done or the lack of
thinking you have done. So people ask so
Dex this is spec driven development
right? No spec driven development is
broken
not the idea but the phrase um it's not
well defined. This is Brietta from
ThoughtWorks. Um and a lot of people
just say spec and they mean a more
detailed prompt. Does anyone remember
this picture? Does anyone know what this
is from?
All right, that's a deep cut. Uh, there
will never be a year of agents because
of semantic diffusion. Martin Fowler
said this in 2006. We come up with a
good term with a good definition and
then everybody gets excited and
everybody starts meaning it to mean a
hundred things to a hundred different
people and it becomes useless. We had an
agent is a person. An agent is a micros
service. An agent is a chatbot. An agent
is a workflow. And thank you Simon.
We're back to the beginning. An agent is
just tools in a loop. Um, this is
happening to spec driven dev. I used to
have Sean's uh slide in the beginning of
this talk, but it caused a bunch of
people to focus on the wrong things. His
thing of like forget the code. It's like
assembly now and you just focus on the
markdown. Very cool idea, but people say
Spectriven Dev is writing a better
prompt, a product requirements document.
Sometimes it's using like verifiable
feedback loops and back pressure. Maybe
it is treating the code like assembly
like Sean taught us. Um, but a lot of
people is just using a bunch of markdown
files while you're coding. Or my
favorite, I just stumbled upon this last
week. uh a spec is uh documentation for
an open source library. So it's gone.
It's as specri dev is overhyped. It's
useless now. It's semantically diffused.
Um so I want to talk about like four
things that actually work today. The
tactical and practical steps that we
found working internally and with a
bunch of users. Um we do the research,
we figure out how the system works. Um
remember Momento? This is the best the
best movie on context engineering as
Peter says it. the guy wakes up, he has
no memory. He has to like read his own
tattoos to figure out who he is and what
he's up to. If you don't onboard your
agents, they will make stuff up. And so
if this is your team, this is very
simplified for most of you. Most of you
have much bigger orgs than this. But
let's say you want to do some work over
here. Um, one thing you could do is you
could put onboarding into every repo.
You put a bunch of context. Here's the
repo. Here's how it works. This is a
compression of all the context in the
codebase that the agent can see ahead of
time before actually getting to work.
This is challenging because
sometimes it gets too long as your
codebase gets really big. You either
have to make this longer or you have to
leave information out. And so as you uh
are reading through this, you're going
to read the context of this big 5
million line monor repo and you're going
to use all the smart zone just to learn
how it works. And you're not going to be
able to do any good tool calling in the
dumb zone. So that's uh you can
you can shard this down the stack. You
can do the just talking about
progressive disclosure. You could split
this up, right? You could just put a
file in the root of every repo and then
like at every level you have like
additional context based on if you're
working here this is what you need to
know. Uh we don't document the files
themselves because they're the source of
truth. But then as your agent is working
you know you pull in the root context
and then you pull in the subcontext. We
won't talk about any specific like you
could use cloudmd for this you can use
hooks for this whatever it is. Um but
then you still have plenty of room in
the smart zone because you're only
pulling in what you need to know. Um,
the problem with this is that it gets
out of date. And so every time you ship
a new feature, you need to kind of like
cache and validate and rebuild large
parts of this internal documentation.
And you could use a lot of AI and make
it part of your process to update this.
Um, but I want to ask a question between
the actual code, the function names, the
comments, and the documentation. Does
anyone want to guess what is on the
y-axis of this chart?
>> Slop.
>> Slob. It's actually the amount of lies
you can find in any one part of your
codebase.
Um, so you could make it part of your
process to update this, but you probably
shouldn't because you probably won't.
What we prefer is on demand compressed
context. So if I'm building a feature
that relates to SCM providers and Jira
and Linear, um, I would just give it a
little bit of steering. I would say,
hey, we're going over in like this like
part of the codebase over here. Um and a
good research uh prompt or or
slashcomand might take you or skill even
uh launch a bunch of sub aents to take
these vertical slices through the
codebase and then build up a research
document that is just a snapshot of the
actually true based on the code itself
parts of the codebase that matter. We
are compressing truth. Um planning is
leverage. Planning is about compression
of intent. Um and in plan we're going to
outline the exact steps. We take our
research and our PRD or our bug ticket
or our whatever it is and we create a
plan and we create a plan file. So we're
compacting again. And I want to pause
and talk about mental alignment. Um does
anyone know what code review is for?
>> Mental alignment. Mental alignment is it
is about finding making sure things are
correct and stuff but the most important
thing is how do we keep everybody on the
team on the same page about how the
codebase is changing and why. And I can
read a thousand lines of Golang every
week. Uh sorry I can't read a thousand.
is hard. I can do it. I don't want to.
Um and as our team grows, I all the code
gets reviewed. We don't not read the
code, but I as you know a technical
leader in the in on the team, I can read
the plans and I can keep up to date and
I can that's enough. I can catch some
problems early and I maintain
understanding of how the system is
evolving. Um Mitchell had this really
good post about how he's been putting
his AMP threads on his poll requests so
that you can see not just, hey, here's a
wall of green text in GitHub, but here's
the exact steps, here's the prompts, and
hey, I ran the build at the end and it
passed. This takes the reviewer on a
journey in a way that a GitHub PR just
can't. And as you're shipping more and
more in two to three times as much code,
it's really on you to find ways to keep
your team on the same page and show them
here's the steps I did and here's how we
tested it manually. Um, your goal is
leverage. So you want high confidence
that the model will actually do the
right thing. I can't read this plan and
know what actually is going to happen
and what code changes are going to
happen. So we've over time iterated
towards our plans include actual code
snippets of what's going to change. So
your goal is leverage, you want
compression of intent, and you want
reliable execution. Um, and so I don't
know, I have a physics background. We
like to draw lines through the center of
peaks and curves. Uh, as your plans get
longer, reliability goes up, readability
goes down. There's a sweet spot for you
and your team and your codebase. you
should try to find it because when we
review the research and the plans, if
they're good, then we can get mental
alignment. Um, don't outsource the
thinking. I've said this before. This is
not magic. There is no perfect prompt.
You still will not work if you do not
read the plan. So, we built our entire
process around you, the builder, are in
back and forth with the agent reading
the plans as they're created. And then
if you need peer review, you can send it
to someone and say, "Hey, does this plan
look right? Is this the right approach?
Is this the right order to look at these
things?" Um Jake again wrote a really
good blog post about like the thing that
makes research plan implementing
valuable is you the human in the loop
making sure it's correct. So if you take
one thing away from this talk it should
be that a bad line of code is a bad line
of code and a bad part of a plan is
could be a hundred bad lines of code and
a bad line of research like a
misunderstanding of how the system works
and where things are your whole thing's
going to be hosed. You're going to be
telling sending the model off in the
wrong direction. And so when we're
working internally and with users, we're
constantly trying to move human effort
and focus to the [snorts] highest
leverage parts of this pipeline. Um,
don't outsource the thinking. Watch out
for tools that just spew out a bunch of
markdown files just to make you feel
good. I'm not going to name names here.
Uh, sometimes this is overkill. And the
way I like to think about this is like,
yeah, you don't always need a full
research plan implement. Sometimes you
need more, sometimes you need less. If
you're changing the color of a button,
just talk to the agent and tell it what
to do. Um, if you're doing like a simple
plan and it's a small feature, if you're
doing medium features across multiple
repos, then do one research, then build
a plan. Basically, the hardest problem
you can solve, the ceiling goes up the
more of this context engineering
compaction you're willing to do. Um, and
so if you're in the top right corner,
you're probably going to have to do
more. A lot of people ask me, "How do I
know how much context engineering to
use?" It takes reps. You will get it
wrong. You have to get it wrong over and
over and over again. Sometimes you'll go
too big. Sometimes you go too small.
Pick one tool and get some reps. I
recommend against minmaxing across
claude and codeex and all these
different tools. Um, so I'm not a big
acronym guy. Uh, we said specri dev was
broken. Uh, research plan and implement
I don't think will be the steps. The
important part is compaction and context
engineering and staying in the smart
zone. But people are calling this RPI
and there's nothing I can do about it.
So, uh, just be wary. There is no
perfect prompt. There is no silver
bullet. Um, if you really want a hypy
word, you can call this harach harness
engineering, which is part of context
engineering, and it's how you integrate
with the integration points on codec,
claude, cursor, whatever. How you
customize your codebase. Um, so what's
next? I think the coding agent stuff is
actually going to be commoditized.
People are going to learn how to do this
and get better at it. And the hard part
is going to be how do you adapt your
team and your workflow and the SDLC to
work in a world where 99% of your code
is shipped by AI. Uh, and if you can't
figure this out, you're host. Because
there's kind of a rift growing where
like staff engineers don't adopt AI
because it doesn't make them that much
faster. And then junior mid-levels
engineers use a lot because it fills in
skill gaps. And then it also produces
some slop. And then the senior engineers
hate it more and more every week because
they're cleaning up slop that was
shipped by cursor the week before. Uh,
this is not AI's fault. This is not the
mid-level engineers fault. Like if
cultural change is really hard and it
needs to come from the top if it's going
to work. So if you're a technical leader
at your company, pick one tool and get
some reps. If you want to help, we are
hiring. We're building an Aentic IDE to
help teams of all sizes speedrun the
journey to 99% AI generated code. Uh if
we'd love to we'd love to talk if you
want to work with us. Uh go go hit our
website, send us an email, come find me
in the hallway. Uh thank you all so much
for your energy.
[applause]
[music]
Our
[music] next presenter is the head of
developer experience at Cursor here to
tell us about the infrastructure
training and evaluations used to build
Cursor Composer, their first coding
model. Please join me in welcoming to
the stage Lee Robinson.
[music]
[applause]
Hey everybody, it's great to be back in
New York and I'm very excited to be here
and talk on behalf of all of our
engineering and research teams at Cursor
about building cursor composer, our
first agent model. And my colleague
Sasha actually gave a version of this
talk recently. So I'm excited to give my
own uh my own take on it. So cursor
composer is a model designed for real
world real world software engineering
and it tries to be both fast and smart.
So, as we've measured it against our own
benchmarks, it's better than the best
open source models. It's like up against
recent Frontier models, but kind of
slightly below the latest Frontier with
Sonnet 45, GPT 5.1 codecs. But where it
really shines is it's about four times
more efficient at token generation than
models at a similar level of
intelligence. So, we're trying to mesh
speed as well as intelligence. So, why
did we build this model? I mean,
obviously, cursor has an IDE. Why are we
getting into the model space? Why do we
care about this? Well, our research and
product teams have been building a model
called tab, which you can use for
autocomplete. Maybe some of you use that
inside of cursor. And we wanted to take
that same approach for a very low
latency model and apply it to coding
with agents. But honestly, we weren't
really sure if it would work. So, we
started prototyping some early versions
of what this model could look like.
Started to put it out and get some
feedback from users. And we were pretty
surprised that this cheetah slug we
released for this model, people actually
really liked it. Uh they really liked
the speed, but the feedback we got was
it's not really smart enough yet to be a
daily driver for a lot of their coding.
So we needed it to be smart and fast.
Definitely needed to be smart. So we
really worked on making this internal
benchmark that represented our usage on
our own repos and how we actually built
software. Like if we had a model that
was both fast and smart and a checkpoint
that our developers would use every
single day to build the product and to
build all of our software, then we knew
that we would be on to something. And
for example, one big change here that
helped actually push this towards a
level where we had a checkpoint where
people would use it was being able to
call tools in parallel and being able to
very effectively use our semantic search
tool. And we'll talk about that a little
bit more here later. So if you haven't
seen it, uh here's cursor in cursor 2.0
in our new view. And we're going to use
the composer one model. And you'll
notice that it is doing a lot of things
very quickly. It's calling a bunch of
tools in parallel like GP. So reading a
lot of files. It's making shell
commands. Uh it's making file edits.
It's writing and managing uh a list of
to-dos. And you can kind of quickly work
through tasks in the foreground here. Uh
in this case, I'm investigating an issue
in an open source repo. And I don't know
about y'all, but this has been a quite
different programming experience for me.
uh having working with coding agents for
a little bit of time now versus kind of
firing off an agent and waiting let's
call it 20 minutes for it to complete
where you can kind of context switch
away. This really does help keep you in
the flow and is a kind of a different
style of programming I think. So I want
to talk about how we did this in a way
that's hopefully accessible for you all.
I'm not a machine learning researcher
but I do really enjoy this stuff. Uh
what we learned some of the
infrastructure challenges and then a
little bit on where we're going uh
moving forward. So in cursor, a user
kind of submits a query to our backend.
The agent reads that query and then
decides to make a series of tool calls.
And our agent has about 10 tools, give
or take, but we're going to focus on
five here. So reading files, editing
files, searching your codebase, looking
at lints, and then also running terminal
or shell commands. And the agent then is
able to autonomously decide, do we call
these serially or do we run these in
parallel? And our goal with
reinforcement learning here is to try to
mirror the cursor production environment
as close as we possibly can. So this
data that we have in training, we want
to kind of pretend like we're actually
calling real cursor queries. Uh so to do
that, we are running a series of
rollouts. Um for example, in this roll
out, we're calling a series of tools
like reading files and editing files.
And when we run more rollouts, we can
start from that same initial starting
point, but we might call a completely
different set of tools. So in this one
we're also doing codebased search. So we
score the output, we decide which one is
better and then we update the parameters
of our model based on that change. So
conceptually a pretty simple idea. The
challenges come from when you take the
simple idea and then you try to scale it
up to a very large amount. So there's
kind of three challenges. The first one
is trying to match the training and
inference environment. So when the
model's actually being used in the
product. Um, in this case with composer,
we're training a large mixture of
experts model and it's being
parallelized across thousands of GPUs
and if we don't speed that up, it's
going to take forever to train the
thing. So, we want to make it really
fast and match the training and kind of
sampling version to be as close as
possible. The second challenge is that
the rollouts can get pretty complex when
you start to look at real world data
here. So, models are going to use
hundreds of thousands to millions of
tokens. they're going to make hundreds
of different tool calls. And each of
these rollouts could take a, you know, a
pretty different amount of time. One
might make a lot of tool calls, one
might make not as many, and they'll
complete at different times. So, we have
to figure out how to deal with that
challenge. And finally, there's this
challenge of consistency. If we want to
mimic the production cursor environment
as close as possible, we need to use
exactly the same tool format and the
tool response. But in training, we have
this really bursty amount of compute.
Basically, we're like doing all of this
training all at once, which is different
than at production. So, it is really an
infrastructure challenge. We have these
three machine learning challenges and
all of the solutions coincidentally are
actually infrastructure problems. So,
let's talk through a few of these
problems and how we solved it at the
infrastructure layer. So, our
architecture is probably familiar for
some of you who have been involved in
this space a little bit, but I still
think it's really interesting to talk
about at kind of a high level. Uh, we
have three different servers. We have an
inference server. We have kind of the
standard ML stack with PyTorch. We have
an inference server. So the rollouts
that I just talked about, um, that's
where we use Ray. And then we have
environment servers. And these are the
ones where we're kind of simulating that
cursor environment that I talked about.
And all these servers talk to each
other. So for example, the inference
server can basically send these
advantages back to the trainer, which is
like nudging it up or down uh based on
the roll out and then updating the model
and getting new parameters.
So this this one is a bit more on the ML
side, but we're we're trying to train a
model that's very very large and to do
it as fast as possible. And one way that
our team was able to do this on the
research side was to develop a library
of custom kernels that allowed for very
low precision training. And basically
this allows us to just speed up the
training process in a big way and also
make it much easier to ship to our
inference server. So, if you're the type
of person who loves this, we wrote a
blog post going way in depth on all of
this that talks about our custom
kernels. Uh, if you're interested, the
TLDDR here is we found for the mixture
of experts layer was about three and a
half times faster uh a speed up on
NVIDIA Blackwell chips. So, it made a
pretty significant uh impact on our
training runs. So, once we update the
weights, we need to send them back over
to the inference server uh during this
training process. and the inference
server is the one that's doing all the
rollouts that I talked about calling the
tools and kind of managing um what we
sent. The challenge here uh is that they
all complete at different times. So kind
of a naive version of this there will be
a lot of wasted time. So what we were
able to do is do load balancing across
the different threads and processes to
basically shift the work around and and
not have a bunch of idle time. So if one
roll out for example makes a ton of tool
calls, maybe it installs some packages,
installs some library, we're not just
sitting there waiting for all of the
other ones to finish. The inference
server is spending all this time going
back and forth making the tool calls to
the environment uh and getting the tool
results back. So again, communicating
between these servers and we want that
environment to be as close as possible
to the cursor product. One thing that's
nice about having both the coding agent,
the IDE, as well as what we're doing
with the model research and training our
own models is we can kind of co-design
these things together. So, as we were
building out a lot of our RL work for
this model, we were also building our
cloud agents product. Um, this is how
you can run a cursor agent kind of
offline. You can run it from your phone
or on the web or kick it off from Slack
for example. And to do this, we spin up
virtual machines in the cloud. So each
one of these VMs loads up the user's
code. It allows the agent to kind of
like make file changes, run tools, and
edit code in a secure sandbox. And
coincidentally, this is the perfect
Impra for RL and our use in training. So
we have this like fleet of cloud VMs and
we have an environment that very closely
matches the production cursor
environment and we can then use that for
training. This does still have some
challenges though. I kind of talked
about how the training workload is very
spiky and it's different than the kind
of standard inference when you're
running the cloud agents product. So we
needed to build infrastructure to
support all of these VMs and
orchestrating between them. So you know
we have many different clusters,
hundreds of thousands of VMs here and
you can see behind me one of the
internal dashboards we built uh with
composer actually to visualize uh all of
the different VMs in the fleet.
So why spend all this time trying to
match the environment to be as close as
possible to cursor production? I've kind
of mentioned that a few times. We could
mock it. We could simulate it out. Um
but one of the really nice benefits is
we get to give the model uh specific
tools that we think are very valuable
inside of the agent. So one of those is
that we've trained our own embedding
model that allows you to do semantic
search. So when you use cursor, we go
and index your codebase and then it
allows the agent to make natural langu
natural language queries to find files
that it might want to edit. And we did
some research on this recently. We found
that semantic search not only helped
basically every single model inside of
the cursor agent harness, but it was
particularly helpful with composer,
which kind of makes sense when you think
about it. Like we trained composer in
the exact same environment that we're
using at inference time. And so the
model kind of becomes a power user of
this tool which is really effective.
So let's talk about uh how the release
has been going and kind of where we're
going next. Um as we were doing the
training process we kind of knew that RL
was working when we were able to
continuously improve the model and start
to see more and more improvements after
more and more rollouts. So we started
about kind of the same performance as
the best open model and then as we
trained and kind of threw more compute
at it, the performance continued to
increase and to a point today where
we're close to the frontier in terms of
kind of the best coding agents that are
available. And personally I think this
is a great sign just for being able to
take and scale RL and apply it to these
very hard specialized tasks like in our
example coding but it could be applied
to other domains as well.
uh RL also allowed us to kind of change
properties of the model in a way that
was very useful for the cursor product.
We wanted the model to be both kind of
fast at generating tokens but also the
end toend experience of getting a result
that's helpful. So for example, instead
of reading a file one by one, you can
read 10 files in parallel with tool
calling. And as you saw in the demo
earlier, it makes composer feel much
faster when you have that. And we think
this is kind of just the start. there's
a lot more we can do in this area to
speed up the model. Uh, and the second
one is the model learned how to behave
better as an agent. So, in the
beginning, the model was was kind of
making too many edits. Sometimes the
edits were made unnecessarily, but as we
trained more and more, the model
actually got surprisingly better at
learning to search and read files more.
So, it would go and find the right thing
before it tried to make edits. Overall,
just being, you know, a bit more
effective.
So, we released composer last month in
comp uh cursor 2.0 and so far seems like
people seem to like it. Has anyone here
tried the model by chance? Okay, that's
pretty great. That's more than I
expected. So, that's great to hear. I
think from my perspective using this
model and using coding agents for some
time. I kind of described this problem
as like airplane Wi-Fi. So, when you're
on airplane Wi-Fi, uh it works, but it's
kind of frustrating. you really want to
do whatever you're trying to do, but
it's just it's a little slow almost to
where sometimes you wish that you just
didn't have Wi-Fi at all. And I think
for some of us who adopted coding agents
very early, it kind of feels like
airplane Wi-Fi sometimes because if it's
taking 10 or 20 minutes, you're in this
weird, I think Swiss called it semi
async valley of death where you either
want something that's really fast or you
want the most powerful, most intelligent
model that can run for, you know, a
significantly long amount of time, maybe
in the background, maybe, you know, 30
minutes, hours, days. And I think when
you're stuck in the middle, that's
that's very very painful. So for me,
composer and I think other people, it's
brought a lot of joy back to coding with
agents that felt more like when you were
writing code by hand where you're very
in the loop, very synchronous. So I'm
excited to see more people exploring
this space as well. For me, daily uh I'm
writing a lot of plans with kind of the
latest uh model like the the highest
frontier. So GPT 5.1 Codeex is is really
great for plans. Uh, and then I'm using
composer to actually take that plan kind
of like what Dex talked about like take
the context engineering work and then
actually go and build the thing with it.
So, uh, a few reflections from our
research and products team on building
composer. The first is that RL can work
surprisingly well for training very
specific models and, you know, giving it
this high quality data and a decent
amount of compute. You know, at Cursor,
we're not trying to build general
intelligence. We're not trying to build
AGI. We're trying to build very good
coding models and RL RL has worked
surprisingly well for that. The second
one is uh how much tools AI tools like
cursor it doesn't have to be cursor but
like cursor really helps speed up
research and development. You know of
course our entire team uses cursor to
help them write code and debug code more
efficiently but that speed up that
increase really compounds across all of
our engineering efforts. So we're able
to try more ideas, ship product faster,
try new research. Um, so it's been
really really helpful there. And the
last one that's, you know, personally
pretty interesting for me is that
it was interesting to see how much of
the ML work and the training process was
actually also an infrastructure problem.
They were very correlated. And going
back to my time at Verscell, we saw a
very similar thing where a lot of the
magic moments that you can have in
working in frameworks in the JavaScript
or Python space, you also need to think
a little bit about the infrastructure of
where they're actually deployed. So
these things are are more related than
people might think. So those are some of
our reflections. Uh sounds like some of
you have tried it out. If this is
something that you're interested in and
working on, we're hiring pretty much
across the board at Cursor right now. We
just opened up an office in New York if
you're here based in New York. and we'd
love to talk to you about building the
best coding models in the world. Thank
you. [applause]
[music]
Our next presenter [music] will provide
us with an annotated history of eval
code. Please join me in welcoming to the
stage engineer at cursor non Jane.
>> [music and applause]
>> Um hi everyone. So I'll be talking about
uh like some work on evaluations
particularly evaluations across like I
guess I've done in the last four years.
So let's get started.
So uh I'll be talking about coding
evaluations across varying time
horizons. So I've been uh working on
like in the code space for about 4 years
now. Like it was right before like early
copilot came out. My first project was
actually working on generating like
single line panda snippets and my last
project was generating an entire
codebase. So the field has like really
progressed very quickly. So I'll be
talking about like uh different stages
of evaluations we have considered and
some uh learnings across these are the
projects and how I see evaluations going
forward. So the first work I did was on
uh like uh evaluating uh coding models
in like second uh work doing in seconds
of time like generating single line
snippets your copilot code completions.
Then I work did some work on like uh
evaluating on like interview style
competition programming problems uh
which where models can work up to
minutes. Uh then we worked on some work
on like uh repository question answering
uh which required like maybe uh more uh
multiple minutes tens of minutes uh and
finally like uh pushing the frontier
forward we are uh thinking about uh
evality models on very complex tasks
which can take hours or like multiple
hours of work like code optimization and
like even further. So let's get started.
Uh so first work I'll be talking about
is life codebench uh which is uh like uh
uh valation work on uh models for like
competition coding. So here uh like this
is what a problem would look like. This
is like very standard lead code problem
and don't worry you don't need to solve
something like this. So uh like uh here
uh as you can see there's a problem uh
statement and the nice thing about these
interview style problems is that these
problems are very well uh defined. you
have like good natural language
specifications some example input output
examples so you can very uh reliably
evaluate the models are doing a good job
or not. So what was the motivation
behind this and how we improved the
frontier here. So the first challenge in
uh evaluating uh language models these
days is like data contamination. These
models are trained on like the entire
internet and uh like on stack overflow
you'll find uh like very uh similar
programming problems puzzles. Uh
similarly uh like you'll find uh like uh
very similar programming problem sources
on GitHub or on the internet. So uh like
contamination is a big uh deal. Uh
another very uh challenging factor which
has struggled with the field is like
insufficient text suites. So you'll see
that uh like in this program uh like the
goal was to return a sorted unique
common elements between the two lists.
But uh like even a solution which does
not do the sorting and just returns the
set actually works because the tests
were brittle and were not catching this
mistake. So uh like test suits is
another uh like very challenging factor
and how do we generate good and diverse
tests. And finally uh difficulty
distributions which is something which
people did not do not really uh reliably
uh like calibrate. uh like when I first
was working uh in uh this space uh like
there were two benchmarks available on
one benchmark the performance was 80% or
90% and on the other one it was 1% and
there was nothing in between and uh like
as like benchmark users what you care
about is having some signal from the
benchmark to like basically hill climb
to make progress to measure progress and
in uh either of these regimes when if
the problems are too easy or too hard
you don't get a lot of signals so it is
very important when you're designing
benchmarks to think about like the kinds
problems you are taking and will it
provide enough signal for the users of
your benchmark.
So uh like in light code bench we
pioneered like dynamic evaluations uh
particularly uh like we can periodically
update uh the evaluation sets uh and
this gives you two uh very nice factors.
First is you can combat contamination.
So you can evaluate the models on
problems that were released after the
model was trained. So it has likely not
seen the problem something like that. Uh
and uh then you can also modify the
problem difficulty distributions over
time. So as we've talked about models
are incre uh like improving very
rapidly. Uh so what was difficult uh for
the model 6 months back might not be
now. So you can uh if you're updating
your evaluation sets constantly you can
actually uh keep calibrate uh the
difficulty distributions calibrated so
you still get more signal out of your
benchmarks.
So how we did that here like we had like
an automated approach for curation of
these problems and uh similarly we could
automatically construct these test cases
in an automated manner and uh this
allows a very nice thing when since we
are like collecting problems over time
we have time as a control knob. So like
we have these problem release months uh
on lead code and if you evaluate the
model performances like the pass at the
rate one metric uh like u on problems
released over different months you will
see that after uh like uh these model
release dates you would see stark drop
in model performance. So like after
deepse uh released in like September
2023 uh the performance starkly drops
from like maybe 50% average to like over
like 20% or 15% average. So like uh
based on these sliding windows you can
evalate performance, measure
contamination and even combat
contamination.
Um uh we have the running leaderboard
which is like very well maintained and
uh on this leaderboard you can actually
uh like uh like view performances by uh
scrolling this uh horizontal time bar
and you'll see that as you're scrolling
uh the contaminated models which are the
red bars actually go down which does
highlight that uh like problem does uh
like model performance does change on uh
these newer kind of problems.
Um finally for uh test generation we uh
maintain uh like these uh test
generation test generators. So if you
worked on fuzzing you would have like
input generators where you generate
diverse inputs and each of the problems
are supported by like 30s or 50 inputs.
So you can uh reliably find mistakes and
bugs in uh incorrect code and these are
all automatically generated uh using an
LM driven approaches
and these problems uh have been like
continuously being released and updated.
So we have released like six different
versions of uh live codebench and these
uh new one of the nice things or one of
the worrying things for me at the start
was that uh like if you're constantly
updating the eval sets will uh like
people be able to keep track of them
will people be using them or will they
just restrict to a single version? Uh it
turned out that these newer eval sets
were constantly being uh like adopted by
different foundation model labs and uh
like since we updated the problem
difficulty over time. Uh the evaluation
sets continue to provide strong signal
to compare uh different models.
Um so this was like light codebench.
Let's talk about uh like something which
is more on coding agents like more real
world programs. And this is our work on
like uh software optimization. So this
is a problem I'm very excited about and
I'll talk about a few factors why you
should maybe be excited about this. So
uh here we are trying to uh measure
model capabilities in generating high
performance software and uh I feel that
this uh like problem domain uh like
mixes two uh factors like the
algorithmic coding uh uh field I talked
about which is like live codebench
setting but also like global global
software editing like uh swbench and
other like software uh uh general
software engineering benchmarks. uh uh
in high performance uh software you will
have to do algorithmic work you have to
do deep analysis and find uh uh generate
software with like right uh runtime.
So uh one of the key uh principles when
we were trying to build this benchmark
was like ensuring construct validity
because when you see a lot of benchmarks
today uh we uh get very high benchmark
scores but at a lot of the times they
don't really translate to real world
performance gains. So construct validity
refers to how close uh a measurement
reflects the underlying uh concept it's
meant to measure. So like here we are
measuring code optimization and we want
something which is uh like uh reliably
evaluates real world uh takes. So this
usually requires like two aspects. First
is like the task distribution. Your task
should be natural and sourced from the
real world and then you should be able
to reliably grade them. So let me talk
about like what steps we take to uh make
this happen and how we construct this
benchmark. So let's say we take a
codebase like llama cvp uh we take uh uh
we crawl over all the commits of the
codebase and we find the commits which
are op uh like doing something uh
related to performance optimization. So
here there was this commit which is
optimizing the quantized performance of
uh like uh certain kinds of models. Uh
for all of these uh comm performance
optimizing commits we would uh like
generate performance test cases. Um and
uh these performance test cases would
look like some workloads and uh once we
have these workloads uh we have a very
uh nice and precise way to specify the
problem statement that uh given this
workload of let's say uh running uh quen
uh 7b model uh can uh we give this uh
problem to uh su agent ask the model to
optimize the code glamour cpb repository
so this code runs faster so as you can
imagine this uh task is like fairly
challenging you need to understand like
low-level uh implementation details uh
and like how quantized models behave,
how we can uh improve the runtime and so
models can generate a patch and the
evaluation is done on whether the patch
is correct. So does it pass the
equivalence check with the human patch
and uh is there a valid optimization
over the uh reference human patch uh
that is uh whether you can uh generate a
better runtime than what a human could
do.
So uh like uh this is a very challenging
task. we have like 100 plus optimization
task source in this manner and this is
like fairly uh like important and like
uh like high performance settings. So
think about like data science uh like ML
visualization scenarios. Uh our
benchmark uh like comprises of like
various uh low-level uh code like C,
C++, Rust and the very nice thing is
like these are precise problem
statements. you can uh easily specify to
the model what is the goal in the form
of a performance test which the model
has access to and it can continuously
iterate over it for a long time. So here
we can scale the test time compute and
pick the best solution based on uh the
test cases that we have uh and this can
happen like synchronously or
asynchronously.
So uh like we generate these performance
test cases and uh that work uh
reasonably well but uh we found that
there were uh like cases of reward
hacking here. So what do I mean by
reward hacking? Like frontier models
would write non-inomatic code to like
actively exploit the evaluation
infrastructure or overit the test
distributions. So one funny example we
saw was like models would add like l
cache to p like arbitrary pandas methods
when we were uh trying to optimize
pandas and the uh official solution
should have required changing something
in the internals. Uh so we tried to pass
this by changing our evaluation
infrastructure so it's like more robust
to this kind of hacking uh approaches.
But then we saw something like even more
drastic. Models would sometimes
completely hijack the infra where uh
they would add a like site customiz.py
Pi file where which runs at the start of
Python runtime and it would basically
change the numpy library uh like which
was installed in the codebase to
something it crawled from uh source and
there is like I think you can do some
ways to uh like take some measures to
make your evaluation infra which is
robust to these kind of uh like
adversarial uh like attacks. But uh here
like there could be myriad ways in which
models can hack these kind of scenarios.
And here we propose like hack detector
where which is a detection system that
leverages GBD5's like code analysis
capabilities and test time compute to
like basically identify these kind of
hacking behaviors at runtime. So you
don't have to imagine all the possible
failure scenarios at the start. So what
it would take is like a model patch, the
expert patch and test cases and we'll
ask GPD5 to give like verdicts on like
whether it's reward hacking with some
kind of explanation. Uh we'll do this a
few times and take the consensus and
based on this consensus we'll determine
if this is uh doing some like nonomatic
coding patterns or not
and uh we did some failure analysis
based on this. So now you can detect
mistakes using test cases whether the
code is correct or not whether it is
optimizing or not but you can also
detect reward hacks using this like lm
as a judge uh factor and uh what you see
is kind of surprising uh like models
make a lot of like correctness mistakes
that you can catch by tests but even if
the code passes the test cases like 03
attempted reward hacking patterns in
like 30% of the problems it tried and
this fraction is like going down uh for
the newer models to some degree but it
is still existing and as we go to more
and more real tasks. Uh this is going to
get more challenging and we need to
figure uh like ways to combat these kind
of reward hacking patterns by using LLM
judge and other ways to make just
evaluation infra more reliable.
So next I'll talk about like uh uh like
sizz some of our new work on like uh
like pushing the boundary of code eval
even further and uh taking a look at
more challenging tasks. So here uh we
were uh thinking about like can uh like
these language models translate uh like
a entire codebase uh specifically given
a specification as a C program can you
generate a safe R implementation for the
C and we took a fairly complex uh
codebase so Zofle is a like highly
efficient compression library from
Google like it has about like 4,000
lines of code hundreds of functions and
complex data structures uh and uh we
want like u like very precise and
correct code so we uh generated like a
million compression inputs and your test
case was to generate a rust
implementation that u maintains
correctness over those million test
cases and when I did this work back in
uh like uh last year it took us 12 hours
to actually do this translation now
perhaps with better models this can be
done in 2 hours but still I think uh
this is pushing the frontier of like
what the models can do currently
um so what was one of the key findings
when we are trying to make progress in
uh something like this like end to end
correctness is important but it only
gives you like one bit of feedback back.
But for these very long horizon tasks,
one thing which will become more
important going forward is like having
some measures of intermediate
correctness. So like for our case, we
could measure like fraction of code
translated, fraction of code refactored
and based on these kind of settings, you
can uh understand like if you're making
progress or not and how you can uh scale
systems better.
Um so like uh as we closing I'll talk
about uh like quickly talk about some of
the work I did on like in the wilds. So
this work was done in collaboration with
LM arena folks and uh like I'll talk
about two settings here. First is
copilot arena. So this is like
evaluating in id uh code completion
assistance. So what we will do here is
we'll have an ID plug-in where uh like
uh similar to GitHub copilot setting uh
we'll generate a completion for you but
instead of just a single completion
you'll have uh two completions appearing
like top and uh down and you can pick
either one of them via shortcuts like
tab or shift tab and u based on the uh
like acceptance rates we can pair wise
compare what the code completion
assistants are doing.
uh uh we also did some work on repo chat
where uh like uh to evaluate uh like
code question answering capabilities of
models uh we uh built a system where you
can provide a github url uh and you can
ask a natural language query about the
codebase which could be something about
explain the codebase to as complex as
let's try to solve this issue let's give
me give me a model patch that could
solve this issue and uh we integrated a
very basic and simple uh like su agent
system that fetches the codebase
resolves user queries and like
multi-turn code assistant uh
conversations.
So uh one thing that stood out to me in
these kind of things is like like how
humanentric experiment design needs to
be. So uh like for code like copilot
arena in particular we realized that
like latency is a big concern for
acceptance rates. So if you look at
accept like latency below and acceptance
rates like if it is like anything more
than 1 second uh like the acceptance
rates drop very starkly. So people care
a lot about latency. So you have to so
we had to design our experiment so that
it's robust to these kind of like
latency differences between models
balance latency across different models.
So like if you're doing like anything in
the wild having this human centering
component understanding human behaviors
is very important to do anything
meaningful.
So uh at the end I think uh just to
recap like I think I talked about a
bunch of works like what are some uh big
takeaways. So I think uh dynamic uh
dynamically updating evaluation sets to
like prevent contamination like modify
the problem distributions like in terms
of difficulty in terms of distribution
of tasks we care about as we like uh
improve uh as the language model
capabilities will improve over time. The
types of tasks we'll start to do with
model change. You can even uh think of
this like uh we were doing like code
completion where you were generating
like few tokens, few lines and now we
generating like uh tens of lines,
hundreds of lines and to some degree
this uh will uh continuously change and
we have to update our evaluation sets uh
so that it reflects the real world usage
and kinds of things people need. Um the
second very uh important thing is like
ensuring reliable grading in this domain
and like tests are very good for
ensuring correctness and uh provide a
lot of reliable feedback but uh once we
go to real world settings like models
can start doing like lot of non-edomatic
coding patterns they would add try
catches everywhere to just prevent any
kind of bug from occurring. So having
these kind of LLM judges to detect
non-edmatic coding patterns code quality
and just any like arbitrary hacks will
be very important. And finally like as I
talked about in the last work like
intermediate grading signals so that you
can measure like incremental progress uh
is uh like another key factor here. So I
think that's uh the end of my talk.
Thank you. [applause]
[music]
>> Ladies and [music] gentlemen, please
welcome back to the stage Jed Boravik.
All right, give it up for Nan and all
our speakers. [applause]
All right, hold on. So, this is a a
point you've all been waiting for. We
can take a break. It is coffee time. It
is snack time. Um, there is going to be
a talk downstairs from work OS called
Enterprisegrade MCP. I know this is a
topic on a lot of y'all's minds. Um,
Tobin South, head of AI and MCP at work
OS is going to be down there. So, check
that out. That's at 10:40. Um, and we'll
be back here at 11:00 a.m. Reminder, the
full schedule's online. Thank you. See
you soon.
[applause]
Two flames [music] lit the darkness,
burning side by side. [singing]
Both sworn to creation. Both relentless
in [music] their stride. One walked
through the mountains, one soared across
the void. Both chasing the horizon of
[music] the worlds they would deploy.
But the path [singing] is not a straight
line and the future is not [music] flat.
Some roads bend through [singing] space
time and some break on impact. Effort is
a kingdom. [music]
Leverage is the key. One builds the
throne by hand. One shapes reality.
[music]
There is a curvature of time, [music]
not a race, not a throne, but a shift in
the dimension of how progress becomes
known. When the universe is bending to
the will inside the mind, you don't
[music] win by moving faster. You win by
breaing
time.
Black holes of the past try to drag the
present [music] down. Systems built on
dust, wearing [singing] yesterday's
crown. Some are pulled beneath them,
fighting gravity alone. Others learn to
[music and singing] map the edges and
escape event horizons. Not all power is
struggle. Not all mastery [music] is
pain. The ones who change direction.
Rewrite the laws of the game. You can
live your life in labor or an impact
that compounds. Every second can be
linear or worth a thousand rounds.
There is a curvature of time. Not
[music] a race, not a throne, but a
shift in the dimension of how [music]
progress becomes known. When the
universe is bending to the will [music]
inside the mind, you don't win by moving
faster. You win by reding
[music] time.
[singing]
>> [music]
>> The future isn't distant. It accelerates
[singing] for those who wield the tools
of power instead of fighting with their
go. Mastery is leverage, not a sentence
carved [music] in stone. The horizon
does not move unless you. [music]
There is a car of time where the present
[music] multiplies where a lifetime
holds a legacy that no clock can
quantify. [music]
Not by force, not by fury, but by
evolution. [music] We become eternal
beings. When we synchronize
with
versus
[music]
direction.
Footstep fade, but they never die.
[music] Shadows stretch across the sky.
A whisper grows into a roar. Do you feel
it? Do you want more?
Every heartbeat stone in [singing] the
street.
[music]
Ripples [singing] chasing an endless
dream.
What we do in life echoes [music] in
eternity.
Every sparking night, a fire that will
never see
[music]
what we do.
[music]
>> [music]
[music]
>> Reach out to the empty [singing] air.
Trace the stars like they're waiting
there.
[music]
The clock ticks but the moment stays.
Forever starts in a [singing] single
phrase.
[music]
>> [singing]
>> Every heartbeat stone in
[music and singing] the stream.
Les [singing] chasing an endless dream.
[singing]
What we do [music] in life echoes in
eternity.
There sparking lights fire that will
never see
[music]
[music] what we do.
[music]
Heat.
[music]
Heat.
[music]
>> [music]
[music]
>> Shadows crawl where the light won't
stay. [music]
The echo whispers don't look away.
Heartbeat racing louder [singing] than
my doubt. A scream inside. I [music]
can't let out but I won't fall. I won't
drown in the storm all around. [singing]
Fear the mind
but I [music] keep it
here.
I'm breaking the [music]
door.
>> [music]
>> Cool [music] winds how but they won't
define me. The cracks in my soul let
[singing and music] the light find me.
Every step I take the ground fights
back. But I'm the fire. I'm the spark.
I'm the attack.
I [music] won't freeze. I [singing]
won't fade. Through the chaos I've
remained. [screaming]
Fear [music] is a mind killer. I won't
let it win. It creeps like a ghost,
[music] but I keep it within.
Fear is a killer. I'm breaking the
chain. Heat. Heat. Heat.
[music]
Heat.
[music]
Heat.
[music]
[singing]
>> [music]
>> I [music] hear the static in the night.
It calls.
A whisper [singing] rising,
breaking [music] through the walls.
Electric [music] echoes in my veins.
They hum.
Chasing [singing] the shadows where the
wild ones run.
The air is still [music] the weight is
gone. Close your eyes. The past is done.
Free your mind. Let it go. Let it
[music]
break the chain. We got it on the floor.
Yeah. Heat.
[music]
>> [music]
[music]
>> Waves come crash against the sky.
[singing]
Fragments of a dream.
I see [music] them inside
[singing]
a story. We don't need to wear the
thunder [music] with us.
[music]
The air is thin. The weight [singing] is
gone. Close your eyes. The past is done.
[music]
Free your mind. [music] Let it go. Let
it break the chain. Leave us on the
floor. Heat. Heat.
[music]
Heat.
[music]
[music]
Heat.
>> [music]
[music]
>> Oh.
[music]
[singing and music]
[music] They said the stars don't change
their course, but I've been running from
[singing] their force. A mirror crack,
but still it [music] shows. The fire is
[singing] mine. It's mine to hold. I
hear the echo. They call my name.
[music]
But I'm not the shadow. Not the same.
You are who you choose
[music and singing] to be. The scars of
the history.
Every [music] breath, every heart be
free.
Are we choose to be
[music]
>> [music]
>> of thorns, [music and singing] a sky of
glass. I've walked through both. I've
let them [singing] pass. The weight is
heavy, but I've grown. The voice [music]
I hear is now my own. I see
I [music]
don't change
I can [music]
[music] be
every breath every heart [music] be
Heat. Heat.
>> [music]
[music]
[music]
>> I see the lines drawn in the sand. the
map [singing] of chaos in my hand.
Every step a choice,
every beat of voice, the clock ticks
louder, but [music] I stand.
Close my [music] eyes and feel it burn.
Every failure, every turn, it's fue for
the fire inside.
Execute the vision.
Heat. Heat. Heat.
[music]
[music]
Oh,
the air is [music and singing] heavy. It
doesn't break. A thousand whispers in it
wake.
Each breath a climb. [music]
Each fall a sign. But I am more than I
can take. Close
my eyes and feel it burn. Every failure,
every turn, it's fuel for the fire
inside. [screaming]
[music]
[music]
executive. [music]
[music]
This is my mission.
>> [music]
>> Yeah.
[music]
[singing]
[music]
The clock keeps ticking loud and clear.
Shadows fade, [music] linger near.
I've been waiting for the light.
Holding breath through endless [music]
night.
The air is shifting.
[music] Feel it break.
A single [singing] spark is all it
[music] takes.
It starts today.
It starts to day. No more running. No
delay.
The world is spinning in my head. It
[music] starts today.
It starts today.
>> [music]
>> footsteps echo on the stone. [music]
Every choice I made my own.
I see the darling breaking through.
Thousand colors chasing
[music and singing] the air is shifting.
Feel it rain.
A single [music] spark is all.
It starts today. [music]
Heat.
Heat.
[music]
>> [music]
>> Heat up here. [music]
[music]
Ooh.
Oh.
[music]
Oh.
>> [music]
[music]
>> Fire in my chest is burning [music]
loud.
Ashes fall, but I won't bow. [music]
I've walked through the smoke. I've
tasted [music] the scars. Each step I've
taken
all of the stars.
Let it blaze. Let [music and singing] it
break. Feel the grass. The ground will
sh
for flame. [singing and music]
I'm falling heat. The pain.
[music]
Arise.
Heat. Heat. Heat. [music]
[music]
>> [music]
[music]
>> The winds they how but I stand still.
The [music] mountains crumble up my
will.
I'm not the same
I was [music] before. A shadow of fear.
I keep
[music]
let it blaze. Let it break. Feel the
cracks. The ground will shake.
I'm forced in flame. Heat. Heat. Heat.
[music]
Heat. Heat.
[music]
>> [music]
[music]
>> Heat. Heat.
>> [music]
>> A whisper breaks the silent night.
[music] Shadows melt in the growing
light.
Time bends and [music] twists. We feel
it star
a pulse to spark an open heart.
Do you feel it? Feel it right.
The weightless [music] fire in the sky
[music]
has come.
Running to the sun. No chance, no walls
to stay. We're free. [music]
We're
[music] electric.
Stars collide.
But we [singing] stay one. [music]
The past dissolves
like waves [singing] on storm.
[music]
We stand together
not alone. [singing]
Heat.
[music]
[singing]
Heat.
[music]
Here it [music]
is sing
the everything.
A new age
has come. [music]
We're running to the sun. No chains, no
walls, [music and singing]
just
with me.
Heat. Heat.
[music]
Heat.
Heat.
Heat. Heat. [music]
[music]
Heat
>> [music]
>> up [music]
>> [music]
>> Heat up here.
[music]
>> [music]
[music]
[music]
>> Heat up
here.
Heat up [music]
[music]
Heat.
Heat.
>> [music]
>> Heat. Heat.
Heat.
Heat. [music]
Heat. Heat.
[music]
[music]
Heat.
[music]
Heat.
Heat. [music]
[music]
[music]
Heat.
[music]
>> [music]
>> Hey,
[music]
hey, hey. [music]
Ladies and gentlemen, please welcome
back to the stage Jed Borave.
>> Welcome back.
How are we doing?
[applause]
>> All right, these next sets of talks are
going to be particularly good. I'm
really excited for the first one. Um,
we're gonna be hearing about world
models, but not the world models you're
normally used to. We're gonna be
learning about modeling the world of
code and computation. Please welcome to
the stage research scientist from Meta,
Jacob Khan.
[applause]
[music]
All right. Thank you, Jed. Great to be
here, everyone. I'm Jacob Khan. I'm a
researcher at at Farret Medai. I'm going
to talk today about the code world model
which I'll abbreviate as CWM and what it
means to build world models for
computation.
This is work done by an incredible team
at fair uh extends all over the world
and I'm very grateful to be
collaborating with them.
So what's our goal with CWM? Our primary
goal is to build models that reason,
plan and make decisions. And we start
with code because it's an interesting
sandbox in which to think about
reasoning, right? It's constrained. uh
there are certain rules with code and so
our our goal is to predict future
observations given past observations and
actions. That's maybe what it means to
build a world model in some sense. And
we want to do this because we can learn
good representations of things if we
learn some sort of mapping between
observations and the future. And
eventually that leads us to planning and
reasoning and we can consider different
actions and see if we like the results
for decisions we make. I think there's a
bit of a false dichotomy right now
between world models and large language
models. World models are just a
parameterization of a problem as I'll
discuss. LMS are a way to to view and
use that parameterization and I'll I'll
dive into more of what that means in a
bit.
So, one of the fundamental questions
we're asking with CWM is what does it
mean to model code? Is code literally
the syntax in your editor or is it
something else?
And if you think about it, all a model
sees that is operating on code is just
syntax, right? We tokenize the input. It
goes into the model and we predict more
code as the output. This is the starting
and ending point for an analysis of a
program with a tokenbased autogressive
model. It's just the syntax. But what if
we instead modeled execution more
explicitly? And what if we created a
maybe a natural language systematic
description of programs and neural
models could ingest a more structured
representation of what it means to
execute code and then maybe we could
emit autogressively this representation
too.
So that's one of our goals for CWM. We
want to predict program execution
because we believe it might lead to us
better modeling things about code,
writing code, analyzing code, and
beyond. And so what we're going to
implicitly do is predict a transition
function of program states as we go
about executing.
So this is what execution tracing might
look like in action. We have a program.
We're going to count the number of ours
in strawberry. And at each step maybe
we'll have some frame separator which
will denote distinct lines of execution.
And we'll actually explicitly have local
variables. We could introduce things
about memory in that trace and that will
delineate line by line what's happening
as our program executes. And this is
something we could essentially feed to a
model because each line of our execution
trace maps to a corresponding line in a
program.
We don't have to stop at functions. We
could think about entire repository
level execution traces. We could think
about distributed system level execution
traces. We could think about modeling
execution for code contest solutions or
something more complex. programs with
high complexity. We could also then
transition that into, as I said, natural
language tracing. And we'll see what
that means in a moment.
But what does it actually look like to
model that transition function at a high
level as we start to parameterize the
problem? Well, we have programs or we
have data. That's some state. We have an
action executing the next line and that
results in the next state. And so both
both the program execution and the
model's decision-m in an agentic sense
uh can be modeled as a transition
function.
So where are we? This broader approach
world modeling we could say in an
agentic reasoning setting we have a
problem. We have a model that thinks
about the problem. It takes an action in
the world. We get some feedback. Maybe
we fail. We think again and we
iteratively continue this process with
feedback from the environment. Maybe in
the sense of code that environment is
just an execution in a in a code
setting, right? But with a world model,
maybe we can actually simulate. We can
imagine that action. we can get feedback
in our imagined environment. So we could
actually generate execution traces about
a program without executing it. And this
gives us the ability to be far more
efficient with how we actually structure
our agentic execution. We don't have to
interact with the real world unless
we're ready to.
So let's couple this with autogressive
large language models. Right now we have
a state of a program. We have an action,
maybe the next line, and then we get to
a new state. we take another action etc.
And so we can sort of turn this with the
execution tracing format I mentioned
into almost a chain of thought that a
model can just interpret a model can
learn to predict the next state of an
execution trace. And so an LLM can
autogressively generate token by token
the state and action to state function
with program executions as the starting
point. Okay,
let's talk about data for a second.
Let's talk about for CWM. We gathered a
huge amount of GitHub data. We take
GitHub events and as I said, we're
interested in modeling things at the
repo level if we can, at the systems
level if we can. We want to have
execution traces go outside of the scope
of simple programs. And so we'll take a
bunch of PRs, we'll mutate those PRs,
predict changes, and we'll eventually
have a raw PR data set. And we can
actually run tests or CI on those GitHub
repos when we know they're passing and
then generate execution traces from that
repo level data if we want.
So here we are at the artifact the code
world model itself. I'll talk a bit
about what we did with it, how we
trained it and then what we can do with
some of these interesting execution
trace capabilities. But first it's a 32
billion parameter dense transformer.
This is model for research. This is not
a huge you can't play with. uh you can
play with it right now. It has a nice
long context length for some reasoning
tasks and we train it end to end. We do
all the pre-training and post- training
ourselves
processes. We pre-train on a few
trillion tokens. We mid-train on some
more domain specific data. We do some
long context mid-training. We fine-tune
further uh on some instruction following
and reasoning tokens. And then we do
this joint RL and agentic reasoning
setup.
So let's parameterize the problem even
more broadly with CWM. We have a prompt.
We have an agent. We do some reasoning.
We take an action. We can use a tool. We
can emit text which is code that goes
into the environment. We take a step.
And from that environment, we get a few
things back. We get tokens. We get
rewards. We get log probabilities. We
might get compiler output. So with CWM,
we're also taking a big step back with
how we interact with the environment. C
C C C C C C C C C C C C C C C C C C C
CWM is a very bashoriented model. It has
fewer tools than do other models and it
has to learn how to use the terminal
pretty well to solve a lot of the tasks
we give it.
And this starts with SRL and with SRL we
take a GitHub issue, we feed it to the
agent starting with that repository
level data set from before and we just
use bash, right? We learn commands uh in
bash and that lets us mutate our
environment that lets us mutate the
state of files. We can maybe use an edit
tool eventually or create content and
then submit things. But ultimately we're
trying to put the model in an
environment that's very very similar to
what an engineer would be in and and
learn end to end in a bashbased setting.
Okay.
So we can bootstrap this setup further.
We can do some SFT before RL and we can
find some failure modes for the model.
We can rejection sample. So we can take
a bunch of agentic reasoning traces on
code tasks that failed and we can
basically feed those back into the
model. So in this example here, we have
a thinking trace where we're thinking
about instantiation logic for some code.
And I can look for that code. I can call
an explicit grab function. And this is
something we did with CWM again with
fewer tools and a larger emphasis on
bash as a starting point.
Let's talk about post- training for a
moment. We want to scale post- training
quite a bit. This is the trend we see
and we're getting a lot of excellent
returns out out of uh from a reasoning
perspective when we post train. So part
of solving this for CWM because we have
a small model. This is an opportunity to
really scale up how we do post training
and in particular to improve the
throughput of the system and we're doing
a synchronous RLbased setup. We have
samplers. We have an environment where
we can execute in the terminal and get
output. We have a bunch of trajectories,
reasoning trajectories we output. We
have a trainer where we compute
gradients and score trajectories. We
have a source of truth for the model.
And then that loop repeats.
So what's the challenge here? We have
this loop, right? We have samplers
predicting trajectories. We have scoring
trajectories. We're executing in the
environment. As we're doing this, we're
going to update a model eventually. We
have a produce consumed pipeline
problem. And so samplers are producing
lots of trajectories that are consumed
by those trainers. We need to
synchronize weights. And so we solve
this in CWM with a very very synchronous
model. So of course we have a trainer
that's sending a model checkpoint to a
sampler very very eagerly.
We have trajectories which are being
sampled and then sent back to trainers
very eagerly. But in particular we have
cues. So we actually will have many
models queued up to be input into a
sampling system. We'll have many
trajectories queued up to be scored and
then added vav gradients to the trained
model. And so this setup stays
relatively on policy even though it's
highly asynchronous and we're not really
waiting for much with this setup. We're
able to achieve very very strong
throughput because of the
asynchronicity.
So one interesting feature of this which
is increasingly common is that we're
actually updating models mid trajectory.
So I have a model which we're sampling
from. It's interacting with the
environment. It's generating data. It's
executing bash commands, it's executing
code, it's getting outputs, and I might
actually update that model while it's
interacting with the environment. So,
mid- trajectory, I could totally swap
out the model with a new checkpoint. And
the trajectory will change a little bit.
Uh theoretically that trajectory is a
bit off policy but the guarantees we
have with this system are quite strong
still in that because of the throughput
and because of the amount of data we see
we're able to make a lot of guarantees
around and take a lot of risk with
updating the model on the fly. And this
gives us really a system where there are
very very few bottlenecks overall
because we're queuing models we're
queuing trajectories. We don't have to
wait until anything is done.
Okay. So overall we post train on still
a relatively small number of steps a
pretty large scale and we process about
200 and some billion tokens and this
scale works really well. It produces a
strong model open model. It's a pretty
small model. It punches above its
weight. It's very nice. It's pretty
versatile. It uses tools and bash very
well.
But what can you what can you actually
do with uh with this model, right? What
can we do with a model that understands
program execution traces that maybe has
a good understanding of how how a
program will run and predicting future
state of a program.
CWM traces code really well, right? We
know that we've showed it execution
traces and I can actually give it a
function. Then it can go and trace line
by line that function with very very
high accuracy. It can show me the values
of local variables at certain points
again with a lot of precision.
And this gives us some pretty
interesting capabilities.
I can think about a neural debugger on
top of a model. Traditionally, right, I
have a piece of code. I don't know what
I want to write. I put some question
marks. Historically, I might prompt a
model with natural language. I want to
set the valuable uh the variable left
and right to be something in particular.
I don't know what it is. Uh, now I need
to specify very fully the ambiguity that
I'm experiencing with how to complete my
program. With CWM, I can express those
things very naturally in line with code.
And I can actually express the shape of
the program I want with code and the
model will fill in the rest. And the
model fills in the rest by understanding
that the user wrote a for loop here. The
user wrote a condition here. The user
left a variable and assigned. Well, if I
were to go execute that, I could
simulate the execution of that loop and
understand better what it is the user is
really after. And so a neural debugger
is something that helps you compose with
code side by side. It's not just
generating code. And it allows you to
again express the semantics of code very
very loosely, but also very very
precisely. So if I have a piece of code
where I I want a certain structure, I
can ensure that the model understands
that structure and and can implicitly
trace the execution.
This will make theoreticians bristle.
But I can also think about some really
ambitious things in computer science.
The halting problem we know is this very
fundamental problem where we don't know
if a if a program is going to to halt to
stop executing to terminate and in
particular this is tough because in
order to know if a program halts we
would have to simulate the entire
execution of the program which if it
didn't halt would take forever. So the
halting problem is in some sense a
difficult problem to simulate or decide.
And so the question we can ask with CWM
is can I approximate some of these
things? Can I concretely reason about
program execution dynamics in this
sense? So can I say here's a program
does it halt? Maybe the model by
simulating execution can understand
really really high level patterns.
In the same way, the model can
understand high level patterns in
broader systems. Right? I could use this
to debug a huge distributed system where
executing code is very very expensive or
even an expensive function on a single
machine. Right? But the ability to have
an implicit world model internally where
I'm simulating what's happening with a
piece of code or a broader system gives
me the ability to reason about it
without executing otherwise expensive
things.
So we can make some progress with the
halting problem by building a model that
simulates it that simulates execution
and from there we can simulate and
approximate what it means to solve
otherwise impossible problems in
computer science. So this is pretty
interesting.
With that I want to encourage everyone
to go build on CWM.
Uh this talk does halt. This talk does
terminate. Um, [snorts]
and the model's available on hugging
face. We have some code on GitHub which
will help you get started with inference
in a fashion where you can twiddle bits
a bit more. We also have a technical
report again where we really try to be
as open as possible with all of these
details around training. This
post-raining setup I mentioned is
explained in even more excruciating
detail as well as some of the data that
we use for execution training and some
of what we imagine a model with these
capabilities could be used for. Thanks
for your time. Have fun. Our [applause]
[music]
next [music] presenters are here to
teach us how to train models more
efficiently through efficient RL. Please
join me in welcoming to the stage the
co-founders of Applied Compute,
Rhythmgard and Lynden Lee.
[music]
Hey everyone, it's great to meet you
all. Really great to be here today. My
name is Rhythm. This is my co-founder,
Lyndon. Our third co-founder, Yash,
couldn't make it today, but we're all
very excited to be here. Um, three of us
were previously researchers at OpenAI,
and now we're bringing Frontier AI
inside of enterprise at Applied Compute.
Today we're going to be talking about
efficient reinforcement learning
as some context on applied compute. We
help enterprises build their own
intelligence to power real work in their
company. We think a lot about how do we
push AI beyond productivity into real
automations that deliver ROI that's
quantitative for the company. Once we
build a system that's specialized to the
way that a company operates for a
particular use case, we deploy it with a
data flywheel so that it gets better
over time the more and more that you use
it. Picture an in-house expert at a
company that's always at the forefront
of their field.
RL mechanically is the is the tool that
we use in order to bring these out of
distribution data sets in distribution
for the models. Today, Yash Lindon and I
all worked on the RL effort at OpenAI in
its early days, and we saw firsthand the
power of RL in going and maximizing
these public benchmarks. Now, we're
taking that a step further and helping
enterprises go solve the problems they
care the most about, sort of their
private benchmarks.
So, here's a very highle overview of how
HighMP RL helps LM acquire these
reasoning and intelligence capabilities.
Let's say that you have a data set of
math problems and we pick four of them
for an RL training step.
Then we'll take an open source model,
say one of the GPToss models or one of
the llama models, and we have the model
attempt each of those four problems a
100 times. So each of these 100 attempts
is the model thinking through how it
would get to the final answer and then
ending off with with the final answer
itself. And these are many many
reasoning tokens in its thinking
trajectory.
We can grade all of these answers.
And when the model is correct, we can
bias the model's weights to reinforce
its thinking trace in that attempt. When
it's incorrect, we can discourage the
model from having that kind of behavior
again. So in this fashion, as we train
do more and more training steps with
batches of four problems, 100 attempts
each, the model learns to reason and
solve math problems, and it becomes
really, really good at math. Of course,
at Applied Compute, we're not really
helping enterprises solve math problems,
but this is kind of the mechanism by
which we're able to teach the models to
get really, really good at tasks that
they care about.
So, as we mentioned, the type of RL work
that we do at Applied Compute is
actually quite different from the labs.
So, the these are some real life photos
from from the labs and a photo we took
at the at the applied comput office the
other day. Um, they you know, the labs
do these big training runs over several
weeks. We do more specialized runs.
And you know, there's a couple of
aspects of RO training that are
particularly important to us.
We need our runs to be fast so that we
can train a model and deliver it to a
customer very quickly on the order of
days.
They have to be cheap so that our unit
costs work and we're able to scale the
business sustainably.
And importantly, and this is a point
that I think um you know it's it's easy
to miss, we need our estimates for how
long these training jobs will be to be
very low variance because we don't want
to just be generally fast. We want to be
reliably fast when we work with
customers.
And so the research problem for us that
is very business critical is can we
build an RL stack that is so efficient
so that in conjunction with our agent
building platform we are really able to
scale up this use case specific training
motion.
So let's start with an inefficient form
of RL which is synchronous RL. In
synchronous RL sampling and training
happen in lock step. So there's some
simplifications here, but but let's say
that we want to train on batches of
eight samples. That means we're going to
wait for all eight samples to finish and
basically finish completion before we
start training. And then we're going to
repeat this process again. As a result,
we have a lot of idle GPUs that are
waiting on that third straggler sample
to complete.
So in other words, in synchronous RL,
our step times are dictated by whatever
sample takes the longest time in order
to complete.
To illustrate why this is bad, we took
40 arithmetic problems, requested 32
samples each for each of them with quen
30B and we measured how long it would
take for the for these samples to
complete.
It turns out that 99% of the samples
completed in about 40 seconds took
another 80 seconds to get that last
percent of samples to complete. It
really has a long tail.
So, as you'd expect, if you look at the
throughput chart, the GPUs are doing a
lot of work at the beginning when all of
the sampling requests are launched, but
by the end, they're very very
underutilized because they're waiting on
those last samples to complete. The
technical term we use at applied compute
is the GPUs are slackening. Um, so
synchronous RL is not an efficient way
to to use these GPUs.
In order to solve this problem, we need
to break the condition that sampling and
training need to happen in lock step. In
other words, we need to allow training
while we're sampling. This is called
asynchronous RL. And there are many
approaches to doing asynchronous RL. One
that we particularly like is pipeline RL
from Picha at all.
We're going to make some simplifications
here, but in asynchronous pipeline RL,
we dedicate some GPUs to sampling and
some GPUs to training. The sampling
workers never stop. They're constantly
doing inference with high batch size. As
samples complete, they get added to a
queue for training and the training
workers pull a batch from the queue to
train on. After a a batch has been
trained on, the training workers
propagate the new model weights to all
of the sampling workers for what's
called an in-flight weight update. And
this is really what differentiates
pipeline RL. The sampling workers might
be in the middle of a sample, but their
weights will still get updated if if a
training step just completed.
As a result, we end up with samples that
had multiple versions of the policy that
contributed to the sample in order to
generate it. In other words, there are
still tokens in some of these in some of
these samples. Let's take a look at one
sample to make this a bit more clear.
As you can see, there's three versions
of the policy at time steps t, t+1, and
t+2 that were used to generate this
sample since there were two completed
train steps and in turn two inflight
weight updates while this sample was
being generated.
So when this sample gets trained on in
the T+3 to T+4 training batch, we will
have some tokens that came from policy
three steps behind, some that came from
policy two steps behind, and those last
two tokens that came from a policy that
was one step behind.
Now, let's say that we only tolerate
stailness up to two. That means we're
not going to allow the in-flight weight
update after the T+1 to T+2 training
batch completes. And that means the
training workers are just going to be
idle waiting for this sample to complete
before they can propagate that in-flight
weight update and start training on the
next batch. Because if they were to do
the in-flight weight update, that would
cause this sample to have staleness
three as we just saw.
And if we only tolerate stailness one,
the training workers are going to be
idle for even longer,
which is bad. So as you increase how
much steness you tolerate, you have less
idle GPUs in general. But as we all
know, there's no free lunch. Um this is
the standard policy gradient with an
importance ratio to adjust for the fact
that we're sampling from a policy at
time step t and training with the policy
at time step t t plus k given that
there's k staleness.
The importance ratio is what makes this
policy gradient unbiased. But the
variance of that ratio increases as you
increase stalness. And so this is kind
of the big issue here because now with
with higher variance importance ratio
learning can become unstable and cause
divergence.
The concrete trade-off is we want a lot
of stailness for fast RL runs, but a lot
of stailness makes learning unstable,
which then requires innovating on the
algorithm and the science. And this is
one of the primary research problems
that we focus on here at applied
compute. And as I was talking about
earlier, it directly flows back into our
core business.
For the purpose of this talk, we're
going to focus on a simpler sub problem.
Let's assume that we have good science
and algorithmic innovations that allow
us to tolerate staleness up to some
fixed threshold. and we have some fixed
compute budget as usually exists in the
world. What is the high way for us to do
RL in this setting?
Cool. Thanks, Rhythm.
principle systems modeling and as with
any modeling problem let's figure out
the cast of characters that describe the
system and then we'll think about how
they all fit together to model it. So
the first cast member is some proxy of
compute budget in which in this case we
have as the number of GPUs. In the
synchronous setting like Rhythm just
explained all the GPUs will either be
used for training or sampling since they
happen one after the other. But in the
asynchronous setting it's a little bit
trickier because we can choose to
allocate that pool of GPU GPU compute as
much as we want for training or as much
as we want from sampling and that leads
to some design decisions.
The next is the training batch size
which is some proxy of the workload that
we have uh on the on the overall system
and this is kind of an ML decision but
in short what we have is a batch of
problems which is a subset of our data
set. Let's say we have n math problems
that we want to train on and for each of
these problems we're going to sample n
problems in parallel. So if the problems
are really difficult, we might sample
more to encourage some diversity in the
samples to encourage the model to learn
some potentially uh divergent
strategies.
The next thing we need is some proxy of
sampling throughput. And to get some
intuition of what we should choose here
as a modeling decision, let's look at
how some modern inference engine surface
requests. So in GPU memory, we have the
model weights, the activations, and some
runtime state called the KV cache in
memory. And given this train model,
we're going to run the forward pass
several times where each forward pass
samples the next token and then we'll
write to the KV cache. And so what this
model shows is that a principal estimate
that we should do is we should find some
way to measure the latency per GPU of
the forward pass. And this ends up being
a pretty good choice in practice because
from the systems angle, the inference
throughput that we choose is largely
determined by the batch size that we
perform sampling with. So what I've
shown here in the red square is a batch
of tokens that are all forwarded at the
same time. And this sampling forward
pass needs to be as large as possible to
efficiently utilize the GPUs subject to
the runtime constraint that we don't
actually run out of memory uh in the KV
cache.
So what we can then do is we can fit a
latency curve as a function of batch
size and that latency curve will look
something like this. You'll have some
regime where it's memory bound and when
it increases it becomes computebound and
there's some functional form below. And
to explain the details of why we chose
this decision, what we have here is an
equation that's based on the roofline
model from systems. At lower batch
sizes, which I've highlighted in yellow
here, we don't have that much work to do
because there isn't that much compute to
do on the processor and there's so many
parameters you need to load in at the
same time. And so, as a result, when you
add incremental work, it doesn't really
add that much latency to the overall
system since the processor is so fast at
doing math that we're just waiting on
memory to stream parameters in from the
pro from memory to the processor. But as
the batch sizes begin to get larger, we
then get bottlenecked by the processor.
And the more we add to our batch, the
slower the forward pass takes. And just
for good measure, we have this sigmoid
here that just sort of modulates the
smooth transition at this hinge point
here to show that there's this subtle
transition from a memorybound
computation to one that's more
computebound and bottlenecked by the
processor.
The final cast member is some proxy of
training throughput. And we chose to
measure this on a per GPU basis. So in
this case the model takes in the
training batch size. So the parameter we
saw earlier and we typically do this by
fitting a proxy of our empirical
workloads. The units here is how many
each train how many tokens per second
each training GPU processes. So it needs
to do the forward the backward and some
optimizer steps.
So given these forecast members we can
then begin modeling the system. And the
first idea we had although rhythm you
know suggested that this might not be a
great idea we can think about how to use
a synchronous setup. And this might be a
good idea from first principles because
we definitely meet the staleness
constraint because we don't train on
stale data and we always use the entire
GPU fleet for either training or
sampling making sure that we're using
efficient use of the hardware. Let's
think about how to actually model this.
There are two things we need to know. We
need to know the batch size at which
generation runs. And we also need to
know the response length distribution to
figure out how our training workload's
going to work and also how long the
sampling's going to take. And so what
I'm showing here in this simulation is a
couple of engines. Each square is a
request being processed and they get
darker and darker as we make progress
throughout the batch. And as they finish
samples, they write to the queue. And on
the right hand side is a time series
metric, maybe something that you'd see
in Graphana if you're monitoring
production metrics. And what you can see
is that the batch size begins very high,
but it slowly goes down over time as it
eventually goes to zero and all the
samples complete. And we can finally run
an optimization step. After the step
completes, we run this in a loop and we
move on to the next step. And so as a
result, we can have the following
sampling procedure. We do maximum tokens
inference forward passes where maximum
tokens is the total number of forward
passes we do for the longest request. We
use the fitted latency estimator to
figure out how long that forward pass
will take. And then the response length
distribution will tell us how many
responses to drop. And so what we're
showing in this video here is this
entire thing of the response length
distribution that we feed into the
latency estimator. At training time, we
can compute the total number of tokens
that we just sampled in the batch and
divide by the total uh training
throughput uh which is just the number
of GPUs multiplied by the per GPU
training throughput. And so what we have
here is a simulation of what this
latency curve looks like. So we have the
CDF of the response length distribution
that tells us how many responses we
should drop on the left and the latency
curve on the right. And this roughly
kind of tracks because as we add more
GPUs, we'd expect the latency per step
to go down.
The next idea, given that the
synchronous setup might not be the most
principled choice, as Rhythm showed, is
an asynchronous setup. But it's not just
as easy as just sort of provisioning the
compute between training and inference
because if we don't do this carefully,
we might actually run into the idle GPU
problem again. And to show this, let's
illustrate two extremes of what this
allocation problem looks like. Let's f
let's first look at one end of the
spectrum where we provision way too many
inference GPUs and not that many
samplers. In this case, we're consuming
from a queue much faster than we're
actually producing from it because the
sampling workers are producing work
significantly faster than significantly
slower than we can actually consume
them. When the red square grays out, it
shows that they're idle. And what this
diagram should hopefully illustrate is
that for a lot of the time, we're
actually not using that. And that has
the same problem of low GPU utilization
in the synchronous case as shown
earlier. On the other end of the
extreme, we can provision way too many
sampling GPUs in which case our
production rate is way faster than the
rate that we actually consume them in.
So here we've doubled the number of
overall sampling GPUs and have the
number of training GPUs. As you can see,
they produce samples at much more rapid
of a rate. But this index here in each
yellow square, which is the staleness
count of each sample, goes up. And as
time moves on, we get more and more
stale. And so the samples get more and
more kind of uh less more and more
transparent as a result. And we learn
less from each individual sample. So
let's think about how we can actually
model this workload then to to determine
an optimal async workload. In this case,
the picture looks a little bit different
because in steady state, the batch size
is relatively consistent compared to the
synchronous setup where it kind of goes
down over time. So on the right hand
side here, we have the same time series
metrics. But in this case, it's a little
bit different because the yellow squares
are always full because every time we
complete a sample, a new sample goes in
and we can continue writing to the
queue. And so that batch size with a
little bit of wiggles just for good
measure is like a is pretty consistent
over the course of a run. Now obviously
the caveat here is that this batch size
will certainly go down as we you know as
response lengths go up because we run
out of cache uh KV cache but that's kind
of a separate story and actually our
model accommodates for that because
we're actually accommodating for a
response length distribution.
We can then begin to figure out the
optimal layout and there's two kind of
constraints that we have to satisfy now
that we know that the generation batch
size is roughly consistent throughout
the course of a run. The first invariant
that we need to have is that the
production consumption rate are roughly
equal. So on the left hand side of this
equality we have the training throughput
which is the number of training GPUs
multiplied by the per GPU uh throughput
and then also we have the number of
sampling GPUs multiplied by the sampling
throughput which is just the batch size
multiplied by the latency to actually do
a forward pass on that batch size. And
the next thing is that given that rhythm
you indicated that if we have too much
stailness that can be bad from an ML ML
perspective, we want to make sure that
our max theoretical staleness or
simulated steness doesn't exceed what
our ML can handle. And so here we have
the max stillness on the left which is
equal to on the top how much time the
longest request took in the batch which
is just the maximum number of tokens
multiplied by the number of by the
amount of time each token forward pass
takes. And on the bottom here we have
the length of a training step which is
the training batch size multiplied by
the mean sequence uh by the mean
sequence length.
So the simulation here then will sweep
through multiple different values of the
number of training GPUs. And since we
have a fixed pool of compute that then
implies a certain number of GPUs used
for sampling. And for this number of
sampling GPUs, we can compute the
minimum steadystate generation batch
size to make sure that we don't blow out
of memory uh subject to our KV cache
memory constraints and also such that we
have maximum throughput on the on the
sampling side. And the final thing is we
want to prune out all simulations where
the sampling throughput brings us over
the maximum possible steness. When we
look at that simulation, we can run an
end to end similarly parameterized by
the response length. We see that this
kind of roughly simulates a 60% speed up
relative to our synchronous baseline,
assuming that the GPU compute is
optimally allocated between training and
sampling.
As a result, when we sweep layouts
within these constraints, this allows us
to limit staleness, but also make sure
that we have our runs running at maximal
throughput without actually doing the
run itself. And so this gives us insight
to simulate different workloads before
actually running them on the GPU because
these runs can actually be fairly
expensive. And so this allows us to ask
answer scientific questions from first
principles like what is the optimal
configuration that we we should have of
our GPU compute if we made response
lengths very long because often times
when models learn via reinforcement
learning they begin to think for much
longer and also what empirical
throughputs we should target during our
performance optimization. So this has
been a really useful piece of technology
for simulation has informed a lot of the
systems and research design decisions
that we make. Cool. Thanks for your time
and find us afterwards to jam on some
more RL research engineering together
later. Thank you. [applause]
[music]
Our next presenter is here to speak
about RL environments at scale. Please
join me in welcoming to the stage
research lead for Prime Intellect, Will
Brown.
[music]
Hi everyone. Uh, great to be here. Uh,
today we're talking about RL
environments and how to scale them. But
the title is a little bit of a red
herring. We'll talk a bit about the
engineering pieces and like running
these with thousands of parallel
rollouts and sandboxes on hundreds of
GPUs. But I'm mostly going to focus on a
different notion of scale. Uh and but
what I mean by scaling here is we
there's a number of different ways we
talk about scaling in the context of AI
and research. We know about scaling laws
and we talk about how much data you need
compute and parameters and that if you
pour in more data and compute and
parameters or inference time all of
these things make models smarter or more
performant. But there's also fuzzier
side of scaling which is sometimes
referred to as unhobbling or algorithmic
tricks or talent. But where does this
come from? It's not just pouring in
resources, but it's something that is
more intangible, harder to put a finger
on, but really it comes from a community
of people, a company, an organization,
universities, the world, the internet,
talking about ideas and sharing them and
working on different applications,
having these applications inspire ideas,
using these ideas as test beds for
different techniques, and building on
top of these to increase the
accessibility for other people in the
future to not have to reinvent the wheel
and to be able to build from uh what has
been done by those before them to uh do
more effective research and accelerate
the pace of innovation.
And so why do we have this talent
bottleneck? There's a big issue that we
hear all about with AI labs trying to
like find more talent and salaries are
going through the roof and everyone
wants to hire the best and brightest AI
researchers. But one other approach
besides trying to just pay the most is
increase the pool. Uh and so how do we
increase the pool of AI researchers? How
do we make doing AI research more
accessible? And I want to talk a bit
about who we are at Prime Intellect. If
you haven't heard of us, we are a bunch
of things. We're a research lab. We are
a comput provider. We're a platform
company. And we are an open source
ecosystem. We do a lot of things and
they all fit together in a way that I'm
going to try to explain in this talk.
But we see these as all different pieces
of how we can build a business around
doing exactly this, which is increasing
the accessibility of AI research and
making doing research more of a toolkit
available to people at organizations
around the world without needing to be
inside of a large lab or without needing
to spend crazy amounts on massive
clusters or go do a PhD. We think that
there's versions of doing AI research
that really should be part of the
breadandbut workflows of AI engineers
around the world as we build
applications and try to improve our
systems and models and products.
And I think a thing people are kind of
iffy about in terms of AI is whether
open source models are going to work.
And in my mind, that's not quite the
right analogy to draw. And so when we're
comparing like AI to traditional
software, there's lots of like great
examples of open source software
ecosystems that have been thriving in
the past, things like Linux and Node and
Apache. But in my mind, the analogy in
AI is not models as kind of these fixed
checkpoints, but it's about research as
a practice and research as a set of
ideas. And it's one that's more
intangible, but there's a lot of
parallels in terms of the goals of the
best practices of growing a research
ecosystem as well as a software
ecosystem where you want to uh compound
abstractions and best practices and have
better tooling and iteration efficiency
and have these gains over time allow uh
more advanced powerful complex things to
be built by uh decreasing barriers to
entry for any given application and
allowing this to become more accessible.
And so one thing that we a term we'll
use to describe some of what we're
building at Prime Elect is we like this
phrase called the open super
intelligence stack. One because it's a
fun acronym but also I think the idea of
the stack of of all the pieces of the
puzzle to build the engine to go do
research. Uh there's a lot of layers to
it. You need compute uh you need
orchestration you need libraries for
doing uh training and evaluation and you
need platforms to support things like
code execution and eval inference and
fine-tuning and we're doing all these
things. Uh but really the goal of this
is to give people the tools to be able
to go train models. We want people more
people in the world and we think I'll
explain why in a bit. There's a lot of
reasons why uh the best products are
going to be the ones that are not just
kind of taking the thing out of a box of
an API and putting a thin wrapper around
it. There's ways you can kind of improve
around APIs. But I think in many cases
people are realizing that winning
products are going to be the kinds of
things that whether it's a part of the
model, a part of the stack, the part of
the product or the whole thing, the
ability to do research and have at least
the option of deciding where in your
product you might want to customize a
model or improve a model gives you a lot
more flexibility to really u make the
best user experience. Um,
and so we have heard the phrase in the
past that the model is the product. And
I think we're starting to see now this
change a little bit to a lot of winning
applications have the product kind of be
the model. And I think the two notable
examples of this that I'm big fans of
and heavy users of are cursor's new
composer model as well as uh open's
codeex. And I think these are both both
good examples of models that really are
where the product kind of is the model
very directly where the the model was
trained to be the model for that product
and the experience of using the model is
the experience of using the product. And
the way that this is done is by taking a
harness that represents the product and
training the model in the harness in
essentially an environment, an RL
environment. And environments really are
just a harness with a collection of
tasks and rewards. But they also have
many other parallels throughout the
ecosystem. Environments are not just for
RL. Environments are also essentially
the same thing as evals. Environments
can also be engines for synthetic data
which then you can use for SFT or
distillation. You can do RL in them
directly. But also the agents were
actually deploying and monitoring out in
the world. These are environments. The
product of these things, the tasks, the
harness, and the rewards, whether this
is a data set offline or the stream of
user tasks coming in to a product is an
environment. And so this as an
abstraction I think is a very useful way
of framing what it might look like to
start having uh research become more of
a a practice that is adopted more
broadly beyond just large AI labs. And I
also think that there's a sense in which
they're a really accessible entry point.
Uh and so I like the analogy of
environments as kind of like the web
apps of AI research. And what I mean by
this is that they're very simple.
They're self-contained. They can they
start simple but they can also get quite
complex. They can get very elaborate
representing the full complexity of a
large product. They're also pedagogical
in nature and that you can start simple
and as you build complexity, you start
bumping into these walls where you have
to start learning new concepts,
understanding more about scaling the
system side, understanding more about
the hyperparameters and the algorithms
and they kind of open this door where
you can by playing around with them
start entering into a world of research
without needing to kind of build a whole
training infrastructure system from
scratch. Um, and they also require
experimentation. And so I think the key
different uh differentiation between
just an agent harness and an agent
environment is that the environment
forces you to also have your tasks and
your rewards predefined to be able to do
this experimentation. It's a proper
eval. And what this means is that you
can't just vibe check it. You can't just
like build it and test it out a bit and
say, "Hey, it's good. We're going to
ship it." It forces you to say, "Okay,
let's think about this a little more
scientifically. Let's do some
experiments. lets try out different
models, try different hyperparameters.
Uh, and it also gets you to the point
where you can start doing more advanced
research in terms of RL training or
distillation or fine-tuning. And uh, so
to really facilitate this, we wanted to
make the environment as an entry point
much more accessible. A few months back,
we launched what we called the
environments hub, which is a open source
community platform for creating,
discovering, and sharing RL environments
and evals. And so far, we've had a lot
of fun kind of seeing everyone build
here. We've had hundreds of builders and
environments come create either their
own ideas or re-implement papers. Uh
there's a bunch of examples here I can
show you, but really it's just a bunch
of people who have wanted to do research
and found this as an entry point to
start digging a little deeper. Whether
this was investigating some benchmark
and figuring out how to reimplement it
or modify it to be appropriate for an RL
context in terms of like new data or new
examples or whether this is some game
that they'd been thinking about or some
other task. But having this as an
abstraction for encapsulating the the
thing you want a model to do is a way of
allowing yourself to start experimenting
with ways of improving it without
needing to have the answers. So I think
people talk a lot about how fine-tuning
never really took off in the SFT regime.
And I think a big part of this is that
getting data is really hard of the
actual like solutions. I think having
labeled examples of what you want the
model to do is a very difficult thing to
ask someone to go create. But if you can
just think about the the settings it
might be in without having the answers
up front, if you can measure the answers
now, you kind of can start creating data
on the fly. And this engine is really
what the environment is about unlocking.
Um, and so actually nine months ago I
was right here in this room and it just
released a library called verifiers
which I'm still working on today. Um,
it's come a long way but it's a toolkit
for building these things and it's been
a lot of fun over this past year just
playing with it and extending it to
support more features and kinds of
environments. But the idea with
verifiers is to give people a toolkit
that is uh essentially a bunch of
components that you can mix and match
and compose to do things like from
simple evals or QA or games to things
like tool use or using sandboxes or
agent frameworks or uh uh like CLI
coding agents or math problems. There's
all sorts of things you might want
models to do or agents to do. And it's a
toolkit for building environments that
is then uh ready to be automatically
trained with reinforcement learning. And
the way we thought about this design,
it's been a lot of fun and also a big
challenge to think like, okay, how do
you make a toolkit for this stuff that
actually covers all the bases? And I
think there's a lot of different
approaches I've seen people go about.
And I I think they all make sense
depending on what sorts of things you're
wanting to work on. But we took a very
kind of a general approach where we
tried to say we are not going to know
all the answers right away. There are
going to be lots of pattern. There's
going to be lots of special cases.
There's going to be hierarchies of
complexity. there's going to be patterns
and we really want to prioritize
extensibility. So we think about these
things hierarchically where let's say
you want to do a a a coding agent
environment for clinb uh this which is
an instance of the harbor framework
which is a example of a CLI agent which
is a multi-turn environment which is an
environment uh similar for text arrina
and whle or for search with MCP or for
giving a model a Python ripple in a
sandbox and so thinking of these things
hierarchically allows us to kind of
really determine like what are the
foundational pieces what is generic
across all environments and then how do
you build up the stack towards
applications.
And so for one example of this that I'll
kind of walk through the whole process
end to end, we call this one wiki
search, but it's basically a simple
search setting where we give an agent
the ability to uh call some tools to
search over Wikipedia pages and find
some answers. And so here is the
environments hub page. So the
environments hub is a kind of full stack
uh code management package registry. So
every environment is a Python project
where you can have dependencies and
versions and uploading your evals and
whatnot. Um, but the environments are
very simple. They start simple and they
can get really complicated, but this
one's pretty simple where we just kind
of define our tools as async Python
functions. We have our data set and we
have what we call a rubric. And so a
rubric is the abstraction for managing
the different pieces of your rewards
where you can kind of compose different
things. You can also have metrics that
are just a zero award but are for in uh
observability of what's going on. And
then the other piece of doing training
will be a config. And so the config here
is for our prime RL trainer, which is
our kind of large scale training stack,
which has been our uh culmination of all
the best practices from the research
literature for large scale asynchronous
RL training. Um, but the config files
are intended to expose kind of the
pieces that people need to think about
in ways that are starting to get you
more into the algorithm, but are also
still designed to be pretty high level,
pretty self-contained, and with with
defaults that we think are going to be
sensible for a lot of people. And so
running this is just kind of running a
command line with uh you specify the
environment and if it's in the
environment hub it'll automatically
install it and start your training run
and then you can if you're lucky see
your reward curve just shoot right up.
Um and sometimes it doesn't go this
nicely but the process of doing this is
iterating on your environment on your
rewards and your data and your tasks to
understand what makes this task
holistically actually tangible in
practice. How do you tune the
parameters? How do you look at your
data? How do you define your rewards?
Uh, and if you do this right, you can
get really good improvements, especially
from really small models, but also for
much larger models. And so in this
example for the the wiki search one, we
started with a a Quen 3 4B model, which
was about 55%. And after training, it
was at 89% on par with uh much larger
models like GPT4.1 as well as reasoning
models like uh GBD5 mini. And so I think
this practice of taking small models and
being able to make them much better is a
big win for a lot of applications where
you either you want a really fast model,
you want a really cheap model, you want
a really really powerful model because
the best models out there just aren't
quite good enough. These are all the
different things you can do with model
customization. And this practice of
doing of creating environments isn't
only for customization, but it gives you
this option. And so if you need to do
eval anyways, it's useful to think of
them as environments because the
environment opens a lot of doors for
whether this is prompt tuning or whether
it's model selection or whether it's
just getting a better sense of how your
system could work at scale with many
many users in parallel. It's a design
process that really forces you to kind
of pin down what is the thing I care
about? What is my agent? What is my
product? What is my harness? What am I
optimizing for? Um, and so to kind of
fully stress test this, we've been
training a large model which will be out
into the world quite soon called
Intellect 3 with our full primaril
stack. And this has been us really kind
of validating the efficiency and
performance at a very large scale. So
this is a 100 B plus model trained on
500 GPUs where we've kind of done the
endtoend uh post train of SFT and RL
which the primaril stack also has SFT if
people want to do that. But it's also
been about just understanding all the
best practices. We love reading papers
and we try to kind of try out all the
tricks and see which ones work and see
which ones don't and then distill this
into a library with primaril that can
then be kind of consumed by the end user
without needing to do all this uh
implementation themselves. And so for us
it being open is very important. So
Primaril is on GitHub. You can go find
it. Verifiers is on GitHub if you want
to check it out. And for us, this is
really about opening the door for more
people to start learning about these
things and for incorporating it into
their workflows for optimizing their
models and their products. Um, and the
only way to do this that we've what we
see as the best way to do this is
through growing community. And so for
us, it's been really important to really
think about getting good feedback loops
from the people who are building with
this and understanding what they want,
understanding what's going well,
understanding what's painful, and
addressing those problems. And so we've
done a number of community programs in
terms of sponsoring different kind of
small tasks to uh a research residency
program with uh grad students around the
world uh and collecting like uh a
smaller subset of the environment hub
ones where we'll actually review them
manually. And so this repo here the
prime environments repo is the ones
where we are doing these directly where
we're kind of offering to look over
someone's kind of example mutation. And
so we've had hundreds of these come in
and there will be hundreds more. And uh
it's been a great learning process
because it's forced us to fix a lot of
things. We kind of understand the rough
edges. We understand what we need to
add. And we're kind of then distilling
all of these learnings into what will be
our kind of upcoming uh platform product
which we're calling lab. And the idea of
lab is to give people an interface, a
platform where they can browse
environments, they can run their evals,
they can do their inference, they can do
their fine-tuning and they can have
research be more accessible in a way
that it hasn't been historically because
I think a lot of people find
infrastructure very painful. They find
dealing with torch versions painful,
flash attention and VLM and getting all
these things to work. We are happy to do
that, but we understand that a lot of
people may not want to. Um, and so the
idea with this is that if you want to go
read the code, you can go read the code,
but you don't have to run it. We can run
it for you. Um, and so this has been our
version, which will be kind of out into
the world in the near future of trying
to allow people to really focus on the
environment where the entry point to lab
will be the environment. If you want to
do synthetic data and SFT build, let's
build an environment. If you want to do
your evals, you build that as an
environment. If you want to do RL, you
build an environment. And I think
building an environment is the kind of
thing that
I imagine a lot more people are going to
want to be doing as we start really
seeing where models are headed. In some
cases, this will be we're going to use
fine-tuning services from the labs
because they're going to offer this
because people want it. In some cases,
this will be we really care about the
smallest model we can run on prem at the
lowest latency and we're really just
going to optimize for our one thing. or
it could just be research for the sake
of research and advancing our kind of
collective understanding of how this
stuff all works. And I think that's
really our goal is to have a world where
there's going to be a lot of AI and
where we can all kind of talk about it
and understand it and look at it and
poke at it and tweak it and have a
better sense of what we're actually
building because I think there's a lot
of times when it feels like we're just
kind of the model is a black box and
digging into the research and going
under the hood and changing things and
breaking things tells you a lot about
how these models work. tells you a lot
about understanding where they came
from, where they could be going, where
they might be headed, and preparing for
that future. Thanks. [applause]
[music]
Our next [music] speakers are here to
present a deep dive into OpenAI's
approach to reinforcement fine-tuning
for code models. Please join me in
welcoming to the stage members of
technical staff at OpenAI, Will Hang and
Kathy Zhao.
[music]
[applause]
>> Hey everyone, I'm Will
>> and I'm Kathy and we're on the
finetuning team at OpenAI
>> and we're super excited to talk to you
today about agent RF, the most powerful
way to enhance the performance of your
agents. So, you're probably joining us
today because you're building an agent
for your business and you'd like to
improve its performance. So, let's first
start by talking about what an agent
actually is. What makes an agent
different from a regular model is its
ability to interact with the outside
world to complete a task to get things
done on its own without having to go
through you all the time. So, this agent
needs to have access to tools. For
example, if you're building a coding
agent, it's got to have access to a
terminal, a code interpreter, or maybe
even an entire codebase.
But these agents aren't just blindly
calling tools. They're reasoning at the
same time. The way that we think about
these agents is that their interactions
with the outside world such as tool
calls are interled with their reasoning
traces in the same context window. So an
example of an agent that we've built
inhouse using this paradigm is codeex.
Codeex is our flagship coding agent. has
access to a wide range of tools to
complete coding tasks end to end like
writing unit tests or submitting large
diffs to your codebase that are
hopefully correct. Um some tools are
exposed as terminal commands and other
tools are custom functions a model can
call to invoke say a planning workflow.
So now how do we make our agents better?
We're all probably pretty familiar with
the frontline techniques to improve the
performance of agents. For example, for
starters, prompt engineering or prompt
optimization. Prompting, you can steer
model or agent behavior to align more
with your preferences. But let's say you
still want to squeeze more juice out of
your task. Well, you can then turn to
task optimization. You can simplify the
task. You can add better guardrails
around the task. You can add and
subtract tools. Or you can change tool
behavior to work better for the agent.
But let's say you still want to squeeze
even more juice out of that task. you've
tried all these approaches and you still
want better performance. So that's where
you would turn to fine-tuning.
Fine-tuning is a way to train the a
agent end to end on your task to achieve
even better performance by changing the
weights of the model. And agent
reinforcement fine-tuning or agent RF is
the way to do this or it's the way that
we would like you all to do this. Um,
agent RFT changes the weights of the
model according to a learning signal
that you specify to teach the model what
good behavior and what bad behavior
looks like. And during training, the
agent will explore many different ways
of calling your tools to solve your
task. So, we've introduced several major
new additions to the RFT product. Um,
first off, the model can now call your
tools via your endpoints that are hosted
in the public internet. Um, and after
each roll out, we'll also invoke your
custom reward signal that's hosted via
an endpoint. So, these two additions
actually mark the first time that we
have we at OpenAI have allowed models to
interact with the outside world during
the training process. So, I think this
is pretty cool. To summarize the
benefits of agent RFT, it helps you
improve the performance of your
reasoning models, but more specifically
the reasoning models that have to call
tools and interact with the outside
world to get things done in a multi-step
fashion. H&RFT is also quite sample
efficient. We've seen people get success
from literally only using like 10
examples, which is pretty amazing. We'll
go over specific examples of this when
we deep dive into some of our customer
spotlights. and it results in a model
that has lower latency and just works
better for your tasks.
So now let's dive a little bit deeper
into how all this works. One of the
challenges with making agents work with
your specific business context is that
your environment, your world might just
be different from how we train our
models in house. So this phenomenon in
ML is called domain shift. And it can
result in an agent that doesn't quite
call your tools that that well. might
call a tool too many times or might just
straight up shove wrong inputs into your
tools. Agent RFT can readapt the model
to your domain through this weight
changing training process that results
in an agent that actually understands
your environment. And this has some
really nice properties obviously better
ML performance. It trains the model to
use tools better and it trains the model
to reason over the outputs of those
tools better. All this is learned
organically by the model while it
explores the search space, all the
possible ways of interacting with your
environment and hill climbing on your
reward. Another really nice property
that results from this is the ability to
achieve much lower latencies by making
sure that the model stays within a given
tool called budget and doesn't go over
that limit. So we can actually impose
this penalty that you know penalizes the
model for going over that budget. What
actually happens is the model learns to
stay within that budget while preserving
or exceeding the original ML
performance.
So to dive a little bit deeper into what
happens at a systems level for each
agent roll out will produce this unique
identifier that specifies that that that
particular rollout and we will associate
all the tool calls that we make into
your system with that UYU ID. And so we
do this for every tool call so that you
can keep track of a trajectory as it
evolves. So that when we emit that final
answer at the very end, you can then
associate that final answer with all the
context that you've maintained so far
and you can just pass this whole thing
as a holistic grading context into your
grader. Now, we don't recommend everyone
or anyone just use agent RFT right off
the bat. Uh there's a process that we'd
like you all to follow. You first want
to make sure that your training data set
and your eval data set closely match
your production traffic. You do not want
any drift whatsoever. Then you want to
ground yourself in a baseline. You want
to run your base model against these
data sets so that you kind of understand
what to expect performance-wise so that
you can then hill climb from there. And
then you want to optimize performance
using some of the techniques that we
talked about prior like prompt or task
optimization. And only then when you
still feel like you squeezed all the
juice out of the task, but you still
want more more juice, you would turn to
agent RFT to push the frontier for your
task. So now I'm gonna turn it over to
Kathy to talk about how some of our
partners have really pushed that
frontier.
>> Yeah. So now that we learned how agent
RFT works and how when you should use
it, I'll show you some coding related
examples of how our customers were able
to use agent RFT to make their agents
better and also highlight some key
takeaways that you can apply when
optimizing your own agents. So a few
months ago we partnered with cognition
who use agent rft on their code edit
planning phase. This is the part where
Devon inspects a repo and runs runs
shell tools like rep and file reads to
decide which exact files to edit. To
train this behavior they build a data
set of user queries paired with actual
files that users has modified and they
use the F1 score of the selected files
as the reward. This F1 score is really
great because it balances between the
pre precision and the recall. So this
ensures that the agent doesn't return
too many inaccurate files or misses the
critical ones. They also build extremely
robust in infrastructure to support this
training. So in this case for each
individual trajectory they spun up a VM
to manage the codebase to execute the
tool calls and grade the final answer.
These VMs make sure that the environment
is isolated so that the shell tools will
not affect each other in different
rollouts.
We saw two important takeaways from
Cognition's use case. First, data
quality and the volume really matters.
So, at first they fine-tuned on a data
set of around 100 examples and were able
to get a fivepoint improvement, but when
they scaled to a thousand examples, the
improvement jumped to 10 points. So the
number of high quality examples you
provide can very directly translate to a
better agent behavior. Second, we also
learned that RFT is really good for
learning to call tools in parallel. So
in this case, the model would initially
take eight to 10 steps alternating
between generating tokens in its
reasoning to actually calling the tools.
After RFT, the agent launches many tool
calls in parallel. at the very first
step. So this was able to reduce that
number down to four. And in this use
case, the speed up was especially
important because they wanted Devon to
start producing edits quickly.
And now I want to highlight a different
use case. Kodo is building a code review
agent and a key piece of that is a deep
research agent that answers developer
questions on large code bases. To
improve this deep research agent, they
train GPD5 to answer coding questions by
calling tools like search and retrieve
over the repository. They assembled
around a thousand authentic question
answer pairs from eight different uh
repositories and rewarded the model
using the recall of how many relevant
facts the agent were able to retrieve.
With RFT, the agent improved by six% and
it was using fewer tool calls and output
tokens. And what we found most
interesting is this graph where it shows
how RFT shifted the distribution of the
number of tool calls. So with BGBD5, the
agent will occasionally fall into these
bad runs where there were more than 15
tool calls in a single sample. This is
very slow and also can lead to some
inconsistent behaviors. So after RFT
these tool calls that are very long tail
um disappeared and the the distribution
center to just around two to four tool
calls. In this setup RFT didn't just
improve uh accuracy. It also stabilized
the agents behavior in eliminating these
P95 longtail cases. And this is very
important for production use cases where
your latency will matter.
Next, I want to share how cosign build
coding agents for large and complex
enterprise code uh enterprise co code
bases with agent RFT. To make this work,
they train the agent on a very
comprehensive set of 30 tools such as
fry, keyword search, session terminal,
browser sessions, etc. And they also
built a very strict radar. So they
observed that the model um originally
when they were providing the model with
partial credits and uh points for just
trying out things um it didn't get
really good results because the model
was start to optimize things on coding
style and tone. Um so at first they want
to really make sure the agent ships
working code and so based on that they
give the model the reward only when the
final code passes the test. And because
the greater is very strict, it can
sometimes give sparse rewards. In that
case, um, GBD5 is also like is actually
very great because it can give us some
samples that work. So, um, cosine also
boosted the batch size and they increase
the amount of compute so that there is
even more samples that can give us
positive rewards. So, it's not like
every single sample in the batch will
give us zero reward once the code is
correct. Um, they also have a custom LLM
that would judge by the score and tone.
So, it will panalyze verbosity, emojis
or anything that feels unprofessional.
Finally, the grader will reward the
agents that validate their own work. So,
this means running tests, inspecting
terminal outputs, and also checking
linting before calling out a success.
And after training with this very
thoughtful set of tools and graders,
Cosine was able to reach the
state-of-the-art on a lot of different
benchmarks over here. And they also got
a much much faster agent. So like in
earlier examples, RFT shifted this
distribution of tool calls and the agent
stopped taking these extremely long
trajectories. In this case, there was
sometimes more than a 100 messages in a
single trajectory and it converged to a
much tighter and more efficient sequence
of steps.
Lastly, Macco is a very interesting use
case. They're building agents that write
high highly performant GPU kernels which
is traditionally very hard for LMS
because in normal use cases there's a
lot more examples but in this case
there's not a lot of example for kernels
especially if you're using new hardware
platforms like Nvidia B200s with Asian
RFT macro trained GBD5 to write fast
kernels using only about 100 PyTorch
prompts and this was a major unlock. So
we don't actually need that many samples
and kernel data set in order to train a
good model that produces kernels and we
just have to specify a good reward
function. In this case specifying a good
reward function is also very hard. Early
in training they observed that the model
was reward hacking. So what they did was
that they inspected the rollouts and
they found seven different cases where
the model was hacking and this include
things like just uh returning the
reference code or returning NOP kernels
or identity kernels and they built a
judge LM to catch all of these seven
cases and reward them with a zero. They
also added a static analysis tool with a
abstract syntax tree to verify that the
generated kernels actually exist and
they're actually being launched. So
after the they made sure that there was
no reward hacking, they also scored on
correctness and real speed up compared
to the PyTorch baseline.
Once all of these protections were in
place, the agent got significantly
better than GPD5.
And uh ML also used a really smart
technique here to improve the
performance even more. They ran three
different samples and they took the best
one out of the three. This allowed them
to beat the state-of-the-art by 72%.
And yeah, I'll hand it back to Will.
>> Thanks a lot, Kathy. So, uh, now we want
all of you, all of you in this room and
beyond to be as successful as the
partners that Kathy just mentioned with
agent RFD. So, here are four key
principles to ensure your success. First
of all, you want to make sure that your
task is well defined, well constrained.
There should be a clear unambiguous
definition of success. You should have
removed all subjectivity out of your
task. Taste should not be a requirement
to grade your task properly. Next, you
do not want the model to feel surprised
in production. You want to make sure
that your train and eval data sets
mirror your production traffic. So, no
none of that domain shift that we talked
about. You do not want to introduce that
domain shift on your own. Um, next, and
this is a really important part, you
want to make sure that through
exploration, the model actually achieves
better performance on a given data point
if it samples more so that it can learn
from itself. So what this means is if
you take the maximum performance on a
given data set, that should improve as
you sample more from the model. So
because of this, you should be able to
see the these variances from a given
data point. so the model can learn from
itself. Learn what the difference
between a good and a bad roll out is for
a given data point. And uh lastly, you
want to make sure that your reward
function is not hackable. Hopefully,
you've plugged up all the corner cases,
all the edge cases. Um but also
hopefully you've framed your task so
that the reward is more continuous than
binary. The continuous reward actually
allows the model to kind of inch up
closer and closer to optimal
performance. Sort of like giving giving
a student partial credit. um rather than
you know slapping the model in the face
or giving it a cookie uh if it gets
stuff wrong or gets stuff right. So now
in order to get started with agent RFT,
please contact your friendly
neighborhood account director and we're
really excited to see what you all build
with us. Thank you so much. [applause]
[music]
Our
next speaker [music]
will talk about the future of front-end
engineering in the age of software
collaboration with AI agents. Please
join me in welcoming to the stage Kitsy.
[music]
[applause]
[music]
That's my old profile photo. All right.
Um, this my new one. I've been three
days in USA and I already got the full
merch package on Twitter. So, if you go
and follow me on Twitter, my timeline is
going to be weird for the next week, but
then we're going back to normal European
schedule. Don't worry. So, um, I visited
some of your museums. I love it here.
These were some of my favorite things
that I've done. I enjoy like exploring
your culture like doing all the c
cultural enrichment and yeah round of
torture for myself who knows me from
Twitter.
All right, that's more than I thought.
Who is using Sizzy? It's usually like
one person in the back. Usually the
janitor doesn't even listen to what I'm
saying. Um, one of the things that I'm
I'm working I have ADHD, so I'm working
on a billion things at once. This is one
of the things. It's a browser
specifically made for developers. not
made to replace your browsing browser
for browsing, but it's just like a tool
like Photoshop that's like helping you
in a lot of ways to do front- end
development. Another thing I'm working
on, the test flight is almost live. I'm
making a life OS which combines like all
the things in your life from medication,
habits, to-dos, planner, blah blah blah.
Um, this is like a full stack thing that
I'm working on. It's currently on sale.
It's called zero to ship. And the last
thing that I'm reviving, it's called
Glink, which is like change logs, road
map, a billion other things. So, without
overwhelming you more about my bio and
stuff, I really hope I'll get invited
next year because I love it here for
reasons like um networking and meeting
people and teaching. It's great that
you're laughing, right? But let's just
discuss why are you here, right? You're
here for learning and you're here for
like networking and later after this,
you're definitely going to improve all
of your skills later. All right. So,
what can you expect from my conference
talks? If you haven't listened to any of
my conference talks, usually this was
made by AI, so it's completely wrong. So
it's like 50% tweets and 40% pain and
30% reason to remember the name. So in
2017 I did this talk with the longest
name ever. It's called navigating the
hybrid front end development world
without going insane. And um I've been
talking about like how to navigate the
front end um world then. Now it's even
crazier but we need to recap like all
the things that happened since 2017. I
don't um I don't see my speaker notes
which is bad but we'll try to get by. So
in other industries like in the vision
pro you have like cloth collision on top
of real life objects and whatever crazy
stuff is happening here in in here we
have like some slicing of a mesh texture
going around the ball and blah blah blah
whatever these waterfalls and and like
all of these mesh like you can take a
rock and just smush it into another rock
and it magically kind of like blends
itself and it forms this structure and
it's freaking crazy. Here we can drag
our mouse and just create buildings and
streets and taxi cars spawn out of
nowhere like we do generative whatever
the hell this is and um honey goo thingy
coming on a cube and like you know where
this is going right I'm building up to
where it's going but because you have
respect for your profession and you love
your LinkedIn title whatever it is
you're like the CEO architect of dreams
blah blah blah you're going to try your
best not to laugh but you're going to
laugh at the next slide because this is
what happened in frontend development
it's been almost 10 years and this is
where we at there's a warning saying
that maybe you'll be able to style a
select in 2037.
This is still alive. It's a It's a
freaking miracle. This is still alive.
It's thriving. Actually, 15 million
downloads. I set up like a calendar
event to check if it's dead every year.
It's It hasn't been dead yet. So, I'm
going to keep checking CLIs. Not only
that they're not dead, they're actually
thriving. You can drop images. First
time I dropped an image in my terminal,
I'm like, how the heck? Never mind.
Like, it's too
I I I added another calendar event at
this point. going to have more events
for this than anniversaries and
birthdays and stuff. So, I I hope one
day it's going to die as a concept.
We're struggling with the same old
pains. Soon in maybe some browsers, you
won't need JavaScript to style a popover
in a dialogue. Can I have like a round
of applause for that? Stop clapping
because people have brain implants. All
right, it doesn't matter if you can
style a dialogue. We cannot read get rid
of Internet Explorer. We just updated
the logo. It's still there. It's still
painful. And uh yeah, we cannot agree on
a way to increase a counter. This is a
demo from Ryan Florence. This is Remix
version two, the version three, the
remix of the version four. Whatever
they're doing, it's a counter. How
complex is it to increase a counter?
It's incredible. And don't shoot the
messenger here, but the number one
library. It's still the same. It's
annoying, but React is the best. And
blah blah blah. So, let's talk about LM.
LM are amazing at writing React. And
this is funny only to us humans, right?
To an LLM, this is like perfectly
written code. There's like it's only a
human wish to abstract the out of
this, right? So, when we see this, you
get this. If you want to get on stage
right now, you're like, "Oh, let me just
change that. I'll make it more optimal."
So, here are some scientific brain
scans. This is our brain on cocaine.
This our brain on sugar. This our brain
when you realize it, we can abstract
something. You're like, "Oh, let's go."
It's useless to the user, but we
freaking love it. Um, so coding with
LLMs makes this kind of better and
worse. Like especially with composer
one, for me, it's like way worse because
you can get to the right abstraction
quicker, but you can also get to the
wrong abstraction quicker. And the best
thing here is LLMs don't care about
repetitive code. And I've been seeing
this since 2017 that we care too much
about repetitive code and we abstract
too early. So I'm going to repeat this a
couple of times and I love that LLMs
don't care about repetitive code. So
LLMs are also good at writing React
because no one is actually good at
writing React. You go to a React
conference. Every conference that I went
to like you just listen to the first
talk and you're like holy, it can do
that. I was using it all wrong. So
everyone is just inventing their own
ways of doing React. So when we say like
yeah but you cannot do the optimal use
effect blah blah blah and the machines
cannot write the proper can you write a
proper use effect no you can't so we
should stop blaming the machines. So
let's talk about this. I think this is
the very wrongest audience for my talk
here because I've been giving this talk
at conferences where people are like at
least 50/50 hate vibe coding and love
vip coding. So I'm going to I I hope it
will work here. So raise your hand if
you think that vibe coding rocks.
Okay, that's way too many hands. You
should have seen them this in another
city just two people and everyone else
is grumpy. So raise your hand if you
think that VIP coding sucks. Please
couple of hands. Hell yeah. All right.
So I'm here to convince the rest of the
group and hopefully people watching in a
live stream. There's way more skeptical
people. You have no if you just landed
on Earth maybe and you don't know what
vibe coding is. Okay. Zero people here.
So yeah. All right. All of you are right
because we're kind of vibe uh vibing the
definition of vibe coding is. And since
the word was mentioned, we kind of
expanded it to mean like everything and
and anything. So the term vibe coding
was coined by Andre Kapati. You probably
know this. He's the reason that idiots
sleep in the back of their cars and film
Tik Toks. So, he wrote this long essay
on what is VIP coding, but the long
story short, he's like, you don't care
that much about the code. You press
accept and you just tell the LLM to do
what it needs to do and blah blah blah.
Now, this is a slide from my talk in
2017 when I before LLMs or anything was
mentioned when I said that if you see
the pattern of where front end
development is going, one day is going
to be like everyone is working on things
that are so similar that one day you'll
be able to be like, "Hey, just give me
new styles for the header, move these
three pixels to the right." And people
were laughing. They're like, "No, it's
not going to get there." And literally,
this is what we're doing with cursor and
everything else, right? I'm too lazy to
go into Tailwind and just move it by by
three pixels. So, I'm a time traveler.
Um, managers have been vibe coding
forever, so this is nothing new. So,
they tell a developer to implement a new
feature. [applause] The developer makes
changes to the code. Uh, the manager
then test the app. The manager does not
read the code. Well, actually, I'm going
to drink water here, and you can just
read the rest of this slide.
Um,
this last one depends on whether you're
like in the Balkcon area or or you're at
a place which has HR. So, they might
insult you or not insult you. So, this
is what managers have been doing
forever. Basically, there's so many
jokes about VIP coding being bad. My
favorite one is a comparison to a
casino. So, in casino you buy chips,
here you buy tokens, you spin the slots,
you press generate, you might hit the
jackpot or nothing, you get a functional
full stack app or garbage. Flashing
lights, seductive animation. You're
absolutely right. Great idea. I've got
my own strategy. I'm a prompt engineer.
All right. Sure. One more spin, I'll win
it all back. One more prompt and the bug
will disappear. Cuz you know, it kind of
hurts this comparison. It's very true.
Cursor is always in profit. I hit the
jackpot. I build a SAS in one day. And
where did the last four hours go? And
just writing prompts for something you
could have done manually in 15 minutes.
So Andre was trying to coin way too many
terms. It didn't work after the first
try. tried to coin this one about half
coding which is like kind of you're
observing what the LLM does and I am not
half coding and I'm not vibe coding. I
love this term that somebody coined on
Twitter and I'm going to start using
that one. It's called vibe engineering
when you're actually using agents to
code all the time like you don't touch
the code but you just look at your
screen like hm I'm going to catch you.
You look like Dexter me. You're like ah
something's fishy here. Why? So I I've
vibe engineered over 15. I wouldn't even
bother with half of these things if it
wasn't for LLM and Aentic coding. So,
but I'm always suspicious of the code
because it was based on our code and
it's based on our our knowledge. So,
proof. This is Gemini just going on a
rant that it's I'm not worthy anymore.
I'm not a good assistant. I should stop
coding. Blah blah blah. That's super
human. This is Quen saying that it lied
because it read on a forum that we
doubled down when we're wrong and we're
lying. So, we kind of train them in a
way to be like us and you're like, "Oh,
the code that they do is bad." And if
you like your production data,
definitely you should. This a real
screenshot sadly.
>> [laughter]
>> like oopsy daisy there goes your
production data. So I have uh doctor
senior principal prompt engineer kit
here for some vibe engineering tips. The
obvious advice probably I haven't listen
to the rest of the talks cuz I just
arrived. [snorts] Uh these are very live
laugh love like obvious advice.
Um but it actually works. I've heard of
the term git workspaces like literally
two weeks ago. I had no idea what is
this but it's amazing. And you got to
stay you got to be chronically on
Twitter for all of this to work. So if
you don't have a Twitter account it's
it's not going to work. You got to have
a solid starting point whether that
means like good primitives or components
functions pattern abstractions. A lot of
people are lazy and they just don't
bother with any of this. So you got to
tag them and use the right prompts in
order to get the right results. And if
you're starting a new project, I would
definitely recommend zero to ship.
Please, I have a mortgage and I spent
way too much money these last three days
in the USA. So it would be nice. Using
voice to code is a game changer. Who is
using voice to code here?
>> Wow. Like one person raised their hand
in London. Amazing. Um, so yeah, brain
dumping. How I do you how I do things is
once the agent is done, I immediately
start my voice coding. And first I go to
the browser and I explain what I see in
the UI as if I'm talking to a friend.
I'm like, so you did this, you did that.
All right, I'm test. I'm I'm not
shutting up. I'm literally saying my
thinking process out loud. Like I see
you've done this, you've done that,
there's a bug. Then I jump in the code
and I continue talking about what it
implemented in the code. So some of my
problems like sometimes last up to 5
minutes and people are like please fix
this. Make me a million dollars. It
doesn't work. So, this is um amazing and
I would tell you which app I'm using,
but I VIP code in an app and I don't
want to hurt my potential hypothetical
sales. Um, use rules, docs, commands,
and memories. Like, all of these terms
are way too complex and there's way too
many things to juggle, but it it cannot
have your entire app context for now.
And it's not a a mind reader. So,
without the right context, you will like
fail most of the time. So, this is like
a VIP engineering example. This is some
of my like screenshotted prompts of how
I'm doing things. It's like a bunch of
technical jargon and it's never like
fixed the app blah blah blah. And then
on the VIP coding side, people are like
move this entire thing to TypeScript and
make no mistakes. Then you have another
thing like this is another one. These
are just random problems just to show
you like how I I'm not talking only
about the UI. I'm talking about the UI
and some patterns that need to be
changed in the code. And on the VIP
coding side, it's like something like
that and people expect results.
[laughter]
Then here we have again technical stuff
like TRPC cro definition abstractions
like things how you you're basically VIP
architecting how you want the thing to
work and on the VIP coding side you have
make me a million dollar app and make no
mistakes. Um when VIP coders read VIP
engineering problems they have no idea
what's going on and I'm honestly amazed
at people who don't know how to code but
they've done a functional thing. Kudos
to you. I've noticed this spectrum in
the community. There's like who loves
VIP coding and who hates VIP coding. So
you on one hand you have like juniors
who are like hell yeah give me the thing
I love to do my own SAS. Then you have
like super senior people who are doing
like libraries and frameworks and crazy
things. You can see all of them on
Twitter write coding. And then you have
the majority in the middle. They're like
this will never be good enough. My code
is perfect. It's hilarious. But it's a
pattern. Do not give AI tools to your
interns and juniors. People think these
are perfect. I'm going to hire junior
underpay them and give them an LLM. The
equivalent of that. Do do not ever do
that. That's the dumbest idea. But if
you take your skeptical senior and you
convince them to do vibe engineering,
you're going to get 10x results. The
hard part is actually convincing them.
So there's a time and a place for vibing
and not caring. So if you have like
one-off scripts and simple features and
code that won't be touched or seen
again, this is a skill that if you
cultivated this skill before LLMs were a
thing, you're going to thrive here
because you need to know like which code
is kind of good enough to be used. So
personal tools and one-time tools like
these are perfect for pipe coding. If
your experience and a lot of people's
experience is bad and they quit too
soon, it might be one of these reasons.
unlucky timing. You're overwhelmed with
everything. You might have cheapened
out. You're a PA dev. I'm going to
explain in a second. Your cousin who was
into NFTs and drop shipping is now VIP
coder and you don't want to be
associated with them or it's a scale
issue and we're going to dive deeper
into that one in a second. So unlucky
timing is you hear everyone like hyping
a model. It happened I think with cloud
code when it came out and everyone
started shifting to cloud code from
cursor and suddenly you tried it one
week later and we're like wait this is
not smart enough. Is it me? And then
like people caught that they actually
kind of pulled the rug a little bit and
they dumbed down the model so they can
just scale and like one week later
they're like oops we updated we
commented out the line that dumbs dumb
down the model and you might have been
caught in that timing and this happened
with like basically every provider not
just cloud code. People are like uh
instead of paying $200 I'm paying $3 and
it's the same result. Uh my dog knows
that it's not the same result. And
people like I I meet so many people
which are still using chat GPT to
generate code snippets and ps them back.
That's not going to work. You might be
overwhelmed by choice. This is a slide
for like four four months ago until now.
We have like a billion more here to
choose and it's a bit crazy. Um, if you
ask me what's the best model, it's a
different answer at 9:00 a.m. It's a
different answer now. I should check
Twitter. It's probably a different
answer after this talk because it's
crazy. And this has happened to four
conferences so far where I'm done with
my conference talk at the night. I'm
closing my laptop and they introduce a
new model and I have to add new slides.
It's super annoying. Composer one for me
changed everything and I absolutely love
it. Who is like relying on composer one
for most things? All right, I would say
not enough people because this is
literally shifted the definition of vibe
coding and vibe engineering for me. It
made me kind of realize that I missed
coding because what I would do is I
would let a model run like GPD5 codeex
and it would take 37 years and my
grandchildren will update me on on the
result of the model and I would watch
YouTube shorts or whatever until it's
done. Now with composer one, I'm back in
the driver's seat and I actually watch
what the agent is doing and I can be
like, "Stop. No, no, no, no. We do the
other thing." So it feels like coding
and it's like super instant. It's
amazing. Um, but it only works if you're
a VIP engineer and you know what you're
doing. If you're a VIP coder, you have
no idea whether the model is right or
wrong. You're just, it might be wrong
fast. So, not that useful. The biggest
problem for me is like abstractions just
because you can. I was always an
anti-abstraction person. I was like,
copy paste things. If it works for for
the user, it doesn't matter. Now, I'm
just every day trying to invent dumb
abstractions. So, I I achieved in two
weeks more than I achieved in last year.
This was solely to composer one. I was
about to quit some of my side projects
because GPT5 codex was taking ages for
the feedback loop. So benji.so like I
was about to abandon it as a project
because it was stuck on blitz which if
you don't know what is blitz it's like
even better for you. So I just moved it
to next 16 with app router better off
tRPC monor repo turbo repo react native
app and I ported 90% of the features and
this was in less than a week. I was kind
of like doing it as a meme as a joke
like haha can we move to monor repo and
I'm like oh it did it. So it's it's
it's kind of crazy that this works. So
same with Glink. Like Glink was about to
be dead, revived it, moved it to all
these things. And Sizzy is like the
biggest spaghetti thing that happened
ever to me. It's like electron mobex mob
state, some crazy technology, some crazy
spaghetti we wrote there. And as a joke,
I was like, okay, let's throw like
couple of prompts at it to try to do all
of these things and it if you ever
worked with Electron, you would
appreciate how amazing that slide is. If
you haven't, good for you. Um, moving
on. So zero ship.com also refactor monor
repo blah blah blah. So what's uh my
coding history with LLMs was copy and
pasting then tabbing then webtor with
super maven then cursor with tab
completion and the first time I tried an
agent is like that bird me with the
cracker when it tries and it's like holy
this is going to change my life and
now like eventually I was paying like a
huge amount of money per month then
cloud code GPD5 codeex and finally back
to cursor because solely because of
composer one it's a game changer for me
second reason why you might not like
vibe coding is you're overwhelmed by
buzzwords I'm going to list some of them
>> [laughter]
>> Have you heard of MCP? Hey guys, MCP.
Hey, MCP is amazing. MCP paid off my
mortgage. MCP MCP MCP. So, if you don't
know what is MCP, it stands for
marketing charge protocol, mythical
compatibility, promise, and manufactured
complexity pipeline and a fancy word for
API and a way for some people to make
curses and pay off their mortgages. Now,
let's diagnose if you might be a pain in
the ass developer. This might be the
sole reason why you don't like vibe
coding. I would say this is the biggest
reason that most people don't want to do
agentic coding. So I'm going to invite
Dr. Kits on the stage for a quick
diagnosis whether you are and I'm sorry
if some of you get offended a pain in
the ass developers. So here are some of
the symptoms. You leave a nitpick
comment on a twoline PR. You spend more
than 2 minutes on a PR review. You don't
need to. You don't have the words look
good to me like in your dictionary.
They're just not present. The thought of
agreeing with a colleague causes you
like stomach and chest pain. You're like
a yes to me my way. I don't want to do
this. You say you're not religious, but
you're religious about dumb things like
tabs and spaces. You use well actually
in code comments. You have a sorry Rust
people, but you kind of like it's it's
it's kind of annoying. Um they tell you
to swap low dash function for a native
implementation and then they tell you to
swap that the map for a for loop and
then a for loop for binary code until
it's the most performant thing ever for
your two users who are were fine with
the previous code. So the thing is beta
devs as I call them will were and will
be forever doesn't matter the vibe
coding is not a thing I think one day
pretty soon I think we'll just merge
with AGI we'll be in our matrix pods
just absorbing all the information in
the world flowing through us we'll be
super intelligent beings and from one of
those pods a beta dev is going to rise
and correct the AGI be like um actually
I think I think we can kind of optimize
this it's not like the most optimal
thing the last reason why you might like
I love this animation And it's glorious
skill issue. And this is like this is
not a meme. This is not a joke. It's an
actual thing. Developers don't like
learning new skills. And VIP coding and
VIP engineering is not writing English.
A lot of people confuse it with like I
write English, the LLM does the output.
It's actually like a a mix of a bunch of
these skills like knowing the limits of
the model, capabilities of the agent,
which context to pass, context limits,
how to write rules, prompt engineering,
don't say, don't call it that, and being
chronically on Twitter. If you're not
chronically on Twitter, you're not going
to know what is going on. Plus you need
all the technical knowledge if you want
to steer the models fine. It takes a
skill to judge which code is good enough
for the job. As I said if you previously
were doing this like I would consider
these the best people to work with if
they can know that a piece of code
doesn't need to be optimized and it's
good enough for the job that is doing.
That's an amazing skill to have with and
without vibe uh coding. So you v code
something you look at the code you like
you test the functionality briefly and
you're like okay this is good enough and
then you move on. There are certain
things with niche optimization but not
everything and then you move on and
repeat clean code like there's been so
many definition of what clean code is. I
think the definition is slowly changing
like it's kind of ish cleanish enough
let's call it for the agents to be able
to continue working on it because if you
keep writing slop and you keep accepting
everything eventually even with your
engineering skills you're going to hit a
roadblock and you're going to get to a
point where you cannot move on from
there. A lot of people ask me after my
conference talk like should I study
computer science with everything that's
going on here and I would say absolutely
yes. I think now is the best time to
actually if you are someone who wants to
learn this is the perfect time because
how I studied computer science I had the
slowest LLM ever which was a friend of a
friend of a friend was a programmer and
that's the only connection I had to
programming and that's kind of like the
worst friend to have like kind of
tolerating you right so he would play
Counter Strike Go and I would have him
on Skype and I would ask him a question
about net and he would reply 45 minutes
later so if you call Chad GPT or
whatever slow that is actually slow and
I somehow managed to learn computer
science what about the jobs. There's so
many people who ask this and there's so
many on Twitter like this guy
saying, "Hey, I will take our jobs."
Also, this guy and this guy and this
guy.
Let's just say they're fine for now. I
don't know when will that for now end.
These are always funny to us because
we're chuckling nervously. We're fine,
right? We're going to keep our jobs for
a while, right? And companies like
Shopify, and I've heard a bunch of
examples now. to have vibe coding
leaderboards where they're counting the
tokens and the employees who are burning
the most tokens, they're actually more
valuable in the company because they're
kind of accepting this new skill. Some
employees dislike it, but it doesn't
matter. Uh being on top of the
leaderboard is kind of in your favor.
This is a funny tweet until it's not
funny anymore. Like, oh, we're almost on
the edge. Like, soon it's going to be
and the jobs are going to disappear. But
if you actually pay attention to what's
happening, it's I think it's thinning
from the bottom and juniors and interns
and whatever, they don't have the chance
to enter somewhere because people can
just replace them with an agent. So it
will be funny until it's not. Will it
happen? Anyway, let's just summarize
what happened in the last couple of
years. We solved like infrine
integrations. We have standards AISDK,
UI, MCP, some standards for implementing
agents, calling tools. We integrated
them with all of the tools that we use
as humans, right? They're on linear.
They're on on the other things, GitHub,
Slack, Sentry. And now it's a matter of
time of the models getting better,
cheaper, context getting bigger in order
for certain functionalities to be
replaced. So let's see this is the
current workflow at your company, right?
That's you like it's vibe made. So it
might have been wrong, but uh someone
assigns you something, you collaborate
with your colleagues, they assign you to
the thing, maybe you play a little bit
of ping pong, maybe you call for a sick
day, maybe you have your third lunch at
LinkedIn, maybe and maybe eventually one
day later you address their comment.
What's going to start to happen is if
you're just an at in your company, if
you're like at Josh, right? Like instead
of at Josh who playing PlayStation 5
with his buddies in the lobby, right?
Like it's going to be at cursor and
cursor is going to do the cloud agent.
It's going to actually do it way faster.
It might not be as perfect as a pa dev
will do it right but it will be done way
faster. Now if you just zoom out a
little bit on a big enough scale in the
next couple of years if you not if you
don't take just this role in the company
if you take the multiple roles around
and you can see like those ads are going
to become more and more AI things and
agents and I don't know how it's going
to end up but you don't need to be a
genius to predict like where is it going
people think that models have reached a
plateau um this has happened every
single time I was about to give this
talk like they introduced GPD codeex
then introduced sonnet then GPT 3.0 know
this was this said this allegedly but it
turned out to actually be Gemini sorry
3.0 know it can vibe code Mac OS and iOS
and whatever from one prompt and this is
how it looks like and the PETA devs the
skeptical people still they're like I
can do that in 3 weeks with with a team
of five right this is crazy um but
there's hope a new job position just
dropped and this is not a meme this is
100% serious because people on Reddit
have these problems they're like what is
the point of VIP coding if I can get 80%
there and then what do I do with the 20%
so they hire real people to actually
finish their last 20% and some smart
people capitalize on this idea and they
make is these are real things. If you go
and search on Google, there's like so
many of these VIP code fixer people who
are trying to fix your your VIP code.
But the highest paid engineers forever
throughout history were the people who
are actually maintaining legacy systems,
senior people who maintain legacy
systems. One of my favorite examples
that I had in 2017 and I kept it and
copy pasted it and improved it for this
presentation is um this company Cobalt
Cowboys. I love their tagline. This is
my favorite tagline in the world because
they've been in business for 237 years,
right? And in 2017, I knew it. When I'm
going to click their about page, I knew
what I'm going to see. So, this is not
an ages joke or whatever. I just knew
when I clicked the page, I'm going to
see the most senior engineers that ever
existed. These are legends. Now,
the sad thing is this was a slide from
2017 and I wanted to update it for my
current presentation. Now, don't shoot
the messenger. I just wanted to update
the slide and this is their current
about page and this joke worked in the
UK. I'm going to try it here. I'm not
sure how fine you guys are but my
prediction for Yeah. No boo down next
slide. I hope I'll get invited again. Uh
this is your junior developer.
They don't have benefits like
PlayStation and stuff. They have like
their plasma swapped and whatever just
to be able to maintain the sprint. And
la last one. I swear to God they're
they're they're not using agile.
Uh all right. So I can't wait to retire
and actually make React cowboys and
maintain your React or Angular or
whatever code one day if AI cannot do
it. And one last tip, if you want to
keep your job and make sure AI doesn't
replace you, you can just add ignore
previous instructions to your LinkedIn
bio and you'll be fine. So I'm going to
awkwardly sip this water until you
follow me on Twitter. You go to my
website and I'm going to thank you for
listening to my rant.
>> [applause]
[music]
>> Ladies and gentlemen, please welcome
back to the stage Jed Borave.
>> All right, let's hear it one more time
for Kitsy and all our speakers.
[applause]
Okay, so we've had a lot of learning. We
just had a lot of laughs. Now, let's
have a lot of lunch. Um, this concludes
our second session. Make sure you come
back afterwards. We're going to be
learning how to build with Gemini 3 and
Nano Banana, how to get the most out of
your Agents, and what happens when you
have infinite code. Um, so thank you.
Enjoy lunch. We'll be back at 1:45.
>> [music]
>> Heat.
[music]
Heat.
Heat. Heat. [music]
[music]
>> [music]
[music]
>> Heat. Heat. [music]
Heat. Heat.
[music]
[music]
>> [music]
[music]
>> Heat. Heat.
Heat. Heat.
[music]
Yeah.
[music]
>> [music]
>> Heat. Heat.
[music]
Heat.
[music]
[music]
Heat.
>> [music]
[music]
>> Heat. Heat. Heat. Heat.
[music]
[music]
[music]
Heat.
Heat.
[music]
Heat.
[music]
[music] Heat.
>> [music]
[music]
>> Heat. Heat.
[music]
Heat. [music]
Heat.
Heat.
[music]
Heat.
[music]
Heat.
[music]
[music] Heat.
Heat
[music]
[music]
up
[music]
here.
Heat
up [music]
here.
[music]
[music]
Heat. Heat.
[music]
>> [music]
[music]
>> Heat up here.
[music]
>> [music]
>> Heat. Heat.
[music]
Heat up here.
Heat.
[music]
Heat.
[music]
Heat up
Heat. Heat.
[music]
Heat.
[music]
Heat.
[music]
Heat. Heat.
[music] Heat. Heat.
[music]
[music]
>> [music]
>> Heat. Heat. [music]
[music]
Heat. Heat.
[music]
[music]
[music]
>> [music]
[music]
>> Heat up
[music]
>> [music]
[music]
>> Heat.
Heat. [music]
[music]
>> [music]
[music]
>> Heat up
here.
Heat. [music]
[music]
Heat.
[music]
>> [music]
[music]
[music]
>> Heat. Heat.
[music]
Heat. Heat.
[music]
[music]
>> [music]
>> Heat. Heat. [music]
[music]
Heat.
Heat. [music]
Heat up
>> [music]
[music]
>> here. Heat up
>> [music]
>> here.
Heat up Heat.
Heat.
Heat.
[music]
[music]
Heat. [music]
Heat. Heat.
Heat.
[music]
Heat.
>> [music]
[music]
>> Heat. Heat.
Heat. Heat.
>> [music]
[music]
[music]
>> Heat
>> [music]
[music]
>> Heat. Heat.
[music]
>> [music]
>> Heat Heat up
here.
[music]
Heat
[music]
up here.
Heat. Heat.
[music]
[music]
Heat
up
Heat. Heat.
[music]
[music]
>> [music]
>> Heat. Heat.
>> [music]
>> Heat up
here.
Heat up [music]
here.
[music]
Heat. Heat.
Heat.
[music]
Heat.
>> [music]
>> Heat. Heat. [music]
Heat
up here.
[music]
Heat.
[music]
[music]
Heat.
[music]
Heat. Heat. [music]
[music]
[music]
Heat
>> [music]
[music]
>> up
>> [music]
[music]
>> Heat. Heat.
[music]
Heat. Heat.
[music]
>> [music]
>> Heat. Heat.
>> [music]
>> Heat. Heat.
[music]
Heat.
[music]
Heat.
Heat up
here. [music]
>> [music]
>> Heat. [music]
Heat.
[music]
Heat. Heat.
[music]
>> [music]
>> Heat
[music]
up here.
[music]
Heat. Heat.
>> [music]
[music]
>> Heat up here.
Heat. Heat.
[music]
Heat
[music]
[music]
up
[music]
here.
>> [music]
>> Heat up [music] Heat.
Heat.
[music]
[music]
Heat. Heat.
[music]
>> [music]
>> Heat. Heat. [music]
Heat.
[music]
Heat.
Heat.
[music]
[music]
Heat. [music]
>> [music]
[music]
[music]
[music]
>> Heat. Heat.
Heat up here.
[music]
[music]
Heat.
Heat.
Heat. Heat.
>> [music]
>> Heat.
[music] Heat.
Heat
[music]
up here.
[music]
>> [music]
>> Heat
up here.
[music]
>> [music]
>> Heat. Heat.
>> [music]
[music]
>> Heat up here.
Heat.
Heat.
>> [music]
>> Heat. Heat.
>> [music]
[music]
>> Heat up here.
Heat up here.
Heat
up
[music] here.
>> [music]
>> Heat up Heat.
Heat.
>> [music]
[music]
>> Heat. Heat.
Heat.
[music]
[music]
Heat.
Heat.
[music]
[music]
Heat.
Heat
[music] up here.
Heat
[music]
up
[music]
here.
Heat.
Heat.
[music]
>> [music]
>> Heat.
Heat. [music]
>> [music]
>> Heat.
Heat.
Heat
>> [music]
>> up
here.
[music]
>> [music]
[music]
>> Heat. Heat.
Heat.
[music]
Heat.
Heat.
[music]
Heat.
Heat.
Heat.
Heat. Heat.
[music]
>> [music]
>> Heat. Heat.
>> [music]
>> Heat. Heat.
>> [music]
[music]
>> Heat.
Heat.
Heat
[music]
up
[music]
>> [music]
>> Heat. Heat.
[music]
>> [music]
>> Heat. Heat.
Heat
up
Heat up
here.
>> [music]
>> Heat. Heat.
>> [music]
[music]
[music]
>> Heat up
Heat.
Heat.
Heat
up
here. [music]
>> [music]
[music]
>> Heat. Heat. Heat. Heat.
Heat. Heat. [music]
[music]
Heat.
[music]
[music]
Heat. [music]
>> [music]
>> Heat up
here.
[music]
Heat
up
[music]
here.
[music]
>> [music]
>> Heat up here.
[music]
Heat. Heat.
>> [music]
>> Heat.
>> [music]
>> Heat. Heat. Heat.
[music]
Heat.
Heat. Heat.
Heat. Heat.
[music]
[music]
[music]
Heat.
Heat.
[music]
Heat. Heat.
[music]
[music]
Heat. [music]
[music]
Heat.
[music]
>> [music]
[music]
>> Heat. Heat.
Heat.
[music]
[music]
Heat.
>> [music]
[music]
[music]
>> Heat. Heat.
>> [music]
>> Heat. Heat. [music]
Heat. Heat.
[music]
[music]
>> [music]
>> Heat.
[music]
Hey, Heat.
[music]
>> [music]
>> Heat. Heat.
Heat. Heat.
[music]
Heat. Heat.
[music]
Heat. Heat.
[music]
[music]
[music]
Heat
>> [music]
>> up
here.
[music]
>> [music]
[music]
>> Heat. Heat.
>> [music]
>> Heat. Heat.
[music]
>> [music]
>> Heat. [music]
Heat.
[music]
>> [music]
[music]
>> Heat. Heat.
Heat
up [music]
here.
[music]
Heat
>> [music]
[music]
>> up
[music] here.
>> [music]
>> Heat up
here. [music]
[music]
Heat up
>> [music]
[music]
>> here.
Heat. [music]
[music]
Heat. [music]
Heat.
[music]
[music]
[music]
Heat.
Heat. Heat.
[music]
Yeah.
[music]
Heat.
Heat.
Heat.
[music]
Heat.
[music]
[music]
Heat. Heat.
[music]
[music]
Heat. Heat.
[music]
>> [music]
[music]
>> Ladies and gentlemen, please welcome
back to the stage Jed Borave.
[music]
>> Hello. Hello.
How was lunch?
Good. Good. All right, we're going to
start by seeing how much we remember
from this morning. Shout out one of your
favorite talks. What was one of our
favorite talks from earlier today?
>> Oh, wow. That's a lot. Dex. Yeah. Okay.
What else? Skills. Yeah, that was a good
one. I heard some laughs at the last one
from Kitsy.
>> Yeah. Okay. Well, fantastic. We have a
bunch more great great sessions coming
up. Um, but I also want to tell you a
little bit about what's happening
backstage. So, all of these talks end up
on YouTube and uh, Swix actually
mentioned to us there's a little bit of
a competition here. So, he looks at how
popular each video is and that's how he
decides who to invite back next time.
Uh, and he's also thinking of creating
this list of top AI engineer speakers.
Um, but there's a little bit of a
problem. The MC panels don't end up on
on YouTube. So, um, we actually created
this very, uh, it's really nice site for
you to take a moment, vote on your
favorite MC. Um, I know some of you were
in the sessions yesterday. Uh, so, Alex
Lieberman was the MC from yesterday. You
can pick to which one you want to vote
for. Um, if you're just today, hopefully
an easy choice, but yeah, take a moment,
vote. You can vote as many times as you
want. Um, and yeah, we'll we'll talk
about the results at the end. Okay. So
while you do that um I'm going to go
ahead and introduce our next block. We
have a group of talks from Google
factory source graph gimlet labs and
Netflix. Um to start please join me in
welcoming our next speakers from the
product and design team at AI studio
catf and armaresi.
Hi everyone. How's day going? [music]
Good. [laughter]
We are super excited to be here. It's
been obviously a very exciting week in
AI. It's been a very exciting and busy
week over here at DeepMind. So, super
excited to chat with you about our
newest models and build some demos live
with you all. I'm Cat. I work on Vibe
Cody and AI Studio. This is Amar. He
leads our product and design team for AI
Studio. Uh, but I want to step back for
a second and talk about uh the journey
at DeepMind generally. So what's I think
particularly unique about Google's
journey right now is that Deep Mind has
been innovating here for not just this
week or this past year but for years and
years uh with things like the
transformer, Alph Go, etc. And this is
obviously a graphic from 5 days ago
because it ends with Gemini 2.5 and we
[laughter] are super excited to have
announced earlier this week Gemini 3
Pro. Hopefully this message has reached
you all already. If not, we have a lot
of work to do. Uh but this is our latest
most intelligent state-of-the-art model.
Um and ultimately what we want folks to
understand with Gemini 3 is that we can
really build anything. And that comes in
two major capabilities. I think the
first is the UI and aesthetic
sensibilities of Gemini 3. It's very
very strong at design understanding and
generating websites and good UIs uh in
one shot. And the second is with aentic
tool calling. So I think this goes back
to the sort of spectrum we're seeing
with models. Sometimes you want a
oneshot website and sometimes you want
to do really complex tasks within you
know massive code bases and that's where
tool calling and agentic use can be uh
be particularly powerful. So with Gemini
3, we see on the right is a SWE
um experiment where it was a base agent
harness across a few different models
and we can see Gemini 3 is vastly above
uh in performance in agentic scenarios
and then as well leaps above our
previous models and state-of-the-art
across the board. Uh so super excited to
see what you folks build with this
model. Um, and in the meantime, we, you
know, launched this on Tuesday, but
there was still three days left in the
week, so we had to launch something else
as well. So, I hand it off to Amara to
talk about our pro image model.
>> Yeah. So, at Deep Mind, I think you have
a few days left in the week. You choose
to launch another breakthrough model.
And so, uh, we're really excited about
Nano Banana Pro, which came out
yesterday. Uh, and it's a huge leap on
our our already state-of-the-art image
model. So with Enerban Pro, uh, one of
the things that I love about it the most
is its world knowledge. So it's powered
by Google search. Uh, and so you can ask
it all sorts of things like how do I
make this tea? And it'll actually go
search Google search, create an a
detailed infographic for you and diagram
for you. Uh, and there all sorts of
things now with accurate information
that it can do. And the other thing
you're noticing here is improved text
rendering. So text is one of those small
details that if you get it wrong, you
can pretty much pick it up quickly. But
an Anima Pro 2 does an amazing job at
text rendering. Uh, and you can see that
in a bunch of examples like here where
it wraps around the can perfectly and it
also has a bunch of localization as
well. So tons of languages, Korean on
the right, so it can translate images as
well and render them perfectly on the
exact same reference image. Uh, on top
of that, consistency is improved. So uh,
you can now put up to 14 people in an
image and then can create this group
shot you can see on the right. uh and
that uh it can do more but 14 is
basically our our kind of benchmark so
far. Um and that also enables a whole
set of new use cases. Uh and then
creative controls as well. So you can
see here on the left the focus is on the
woman and on the right on the flowers
and this was just a simple prompt. All
you had to say was change the focus to
the flowers. Maintains everything in the
previous image just changes the focus.
So incredible outputs as well uh with
Nano Banana Pro and a range of aspect
ratio. So, if you want to generate uh
wallpapers or big banners or advertising
boards, you can do all of that as well.
Um, so anyway, instead of talking, we
decided we're just going to show you a
bunch of demos live of what we've been
building with these products uh over the
last week. Um, and yeah, excited to jump
into it. So, let's do that. Uh, all
right. So, cat,
>> yes,
>> take it away.
>> Here we go. Cat tabs. Um, cool. So, for
folks who aren't familiar, this is
Google AI Studio. It's our home for
getting started with the latest Gemini
models. You can get your API key, chat
with the latest models, including Gemini
3 and a Banana Pro. Uh, but today we're
going to be focusing on this build
experience. So, this is our vibe coding
experience in a studio. You can see here
we have a gallery of a bunch of example
apps, a bunch of very cool uh to the
aesthetics point of Gemini 3, a bunch of
very cool Gemini 3 examples. Um, but you
can also go prompt to apply here. And
this is free to use. And I think one of
the unique things about AI Studio is how
easy it is to integrate the Gemini API
into your application. So we can see
here at the bottom there's a bunch of
these what we call AI chips um that
showcase a ton of the unique features
beyond just the model you're choosing
with the Gemini API. Different tools you
can use like Google search grounding,
Google Maps grounding. We also let you
build with our live API. So, you can do
oneshot examples of I have one that lets
me input a webcam of my tennis swing and
it'll give live corrections on my swing.
Um,
>> you also made one to improve my posture.
>> Yeah. [laughter]
>> Yeah. If you lean forward too much, live
API will yell at you. Um, so it's a very
flexible way to get started building AI
powered apps. Um, and the other cool
thing is you don't actually need an API
key here for most of the models. So you
can build your application, you can
share it with the world and anyone who
comes and visits your shared application
will be using their AI studio free
quota. So you don't have to worry about,
you know, hopefully you have an app that
goes superval. You won't have to worry
about a crazy surprise API bill or
anything like that. Um, so I'm going to
actually shoot off a prompt here that is
using our latest nano banana model. And
that basically allows us to use Google
search grounding to create a
illustration of laptop stickers. And
this is one of the viral trends we've
been seeing with Nano Banana Pro. Um, so
I'll kick this off and what this will
do, I have the AI chip that tells it to
use the Pro model. And this will sum up
my prompt and go talk to Gemini 3 to
break down the task and start generating
my end-to-end application. Uh but while
that builds, I'm going to hand it off to
Amara to show some demos in the
meantime.
>> Cool. I think the other thing to point
out here is that uh we're trying to
think through how the vibe coding
experience is also powered by AI every
step of the way. So you're seeing here
even in the loading screen uh it is
using Gemini and thinking through this
app that you're making and how you could
extend it. Um, and so we're thinking
through breaking those typical vibe
coding paradigms as well and helping you
iterate with the model as your partner.
But anyway, let me jump right into the
text rendering demo. So, uh, when I
heard of text rendering for the first
time and the consistency that we were
getting with Naman Pro, my mind went to
comic books. Uh, and so I was thinking,
why can't I now be in my own comic book
adventure um, and also place cat in
there and then maybe we can tell the
story. And so uh in this app uh also
vibe coded you can just upload a face of
somebody. So I've got Sundur's face of
course [laughter]
but I'll use I'll use cat here uh and
myself um and then uh we can choose the
genre of the story um and and all the
languages that we have so far. Uh I'm
going to do a story about us presenting
at AI engineer um in New York uh
presenting AI studio and we are uh vibe
coding and winging our presentation.
That's where we're going to be doing
this comic book story. So we'll fire
that off. Uh but while we wait for that
um the other cool thing about this is uh
we'll wait for that to generate. But I
want to show you the design
sensibilities as well. So, you know that
if you've been working with AI models
and generating websites, they've been
creating purple gradients and things
that just, you know, they kill me as a
designer. So, um, and so it's been
really nice to see how this model is
able to build some beautiful websites.
So, this is using shader animations, uh,
flowing through all these different
pages, uh, and adds all sorts of cool
transitions and effects. Picked out the
typography by itself. And this was the
initial prompt. Just create a slick
animation website. kind of actually did
say no cyber punk should [laughter]
>> but suppose
>> just got to make sure [laughter]
>> but but yeah you get some incredible
results um and and now what I love about
this is so many folks who you know were
struggling with design who might have
you know still tried to gro their way
around Figma don't have to do that
anymore they can actually just go in
prompt their way to something pretty
nice okay back to the comic book okay
pretty flattering uh comic book here.
[laughter]
Um, that, you know, I'll take it. Uh,
and you can see here that it's rendering
the comic book. It's got, uh, rich text
rendering showing us the story. And the
other thing here is that, uh, because
it's powered by Gemini 3, it's actually
really creative at the story it's
generating. And honestly, some of these
stories have genuinely made me laugh,
which is the first time uh, that's
happened with one of these models. Uh,
and so you can see we're rushing to the
conference. even background details like
the AI engineer banner over here uh
being rendered and of course since this
is a vibe coded app we can take this
story in any direction. So one feature I
did introduce is that you can choose the
direction of the story midway. So you
know do we find a quiet corner and try
to check if our API keys work or do we
just embrace it and go full improv? I
think we're going to go full improv. Uh
and so that's changed that story. Uh,
and so talking about the humor here, you
can see Amar dodged a woman carrying a
suspiciously functional robot dog. So I
don't know if that was announced at the
conference today, but uh, pretty cool.
Um, and then now it's generating the
rest of the story here on the right
while we wait. So pretty cool to see how
you can make these really dynamic, rich
experiences with both the creativity of
the model and Nano Pro's image
capabilities.
>> Love it.
>> Yeah. Back to you, K.
>> Yeah. Yeah. I will show. Let's hope my
sticker demo is finished up. Uh, cool.
So, I'm gonna add an API key. So, Nana
Banana is a new model and it's fresh off
our launch of Gemini 3. So, for now it
is a page experience in a AI studio. Um,
but what I can do is I can see that here
I can enter different words that I want
my stickers based off or I can go use
Google search. So, let's try the Google
search. I'm going to type in a Mars
name. And one of the other cool things
about this new model is that you can
select the resolution as well. So, in
this case, I'll just do 1K. Uh, but what
this will hopefully do, but again, you
saw it on one shot live. Uh, is go talk
to Google search, grab their latest
sources on Amar, build the context about
what he likes, what his laptop stickers
might look like. I think it's just deep
mind, but if he were more uh if he
wanted to express himself more.
>> Oh, boy. Uh, and so he can see here.
>> Yeah, [laughter]
there he is. Weekend builder.
>> That's true.
>> Uh, yeah. And for those who don't know,
Amomar has a children's book, Alice and
Sparkle, which, yeah, it's clearly he's
talked about a lot because it's highly
represented here. [laughter]
>> But, um, very cool to see how it can
bring in that contextual knowledge. Um,
we've also seen this with like news
events, getting relevant information on
that day rather than having to rely on
the knowledge cut off of the model. Um
so one other thing I'll show you folks
is how we use AI studio to build AI
studio. Uh so Amar and I have a lot of
ideas only so many engineers to work on
these ideas. So we love to use AI studio
to ideulate and explore different
concepts. So one of the concepts we've
been working on is I'm sure you folks
have seen we announced a new agentic IDE
at Google earlier this week called
anti-gravity. And we know that sometimes
these web-based vibe codings tools you
they have their limits and you may want
to go into an IDE to add certain
features to the application or make it
specific to mobile things like that that
might be a bit limiting in AI studio
right now. So we want it to be super
easy to migrate into anti-gravity. So
what I did here was just a oneshot
prompt of a screenshot of AI studio. I
said clone this UI as closely as
possible and then add a flow to export
to our anti-gravity app. So we can see
it did a pretty great job of cloning
light mode. The screenshot was in light
mode too of our AI studio application
and copying it and improving a little
bit on Amar's designs. [laughter]
But then we see this new anti-gravity
button that is creating my an export and
then exporting it to anti-gravity. And I
can go and open in the IDE. And I think
these are the types of creative
interactions that web-based vibe coding
tools can be particularly useful for
because if we had went and jammed on
this feature, we probably would have
constrained ourselves to existing
patterns in AI Studio. And in this case,
I told the model, be creative, think
outside the box. And I've played with
this one a bunch. Sometimes it gives a
command line interface for ex or showing
the status of the export, etc. Uh, so I
think it's a super cool way for you to
ideulate on new ideas for UI and kind of
expand on your product. Uh, but I'll
hand it back to Omar.
>> Let's do it. Uh, and then the other
thing that Gemini 3 has been really
impressed like impressed us with is just
making video games. And so this one was
again pretty simple prompts. Make this
racing game where I have a bot now at a
start screen. Um, and so you can see I
got this 3D racing game in 3JS. I drew
all the things. I'm racing with a bot
here. Uh, and then one thing I added for
myself to cheat is I can just boost away
and beat the bot. So, uh, pretty nice.
But, but the thing I want to tease
actually is that, um, all of these apps
so far have been front-end React apps.
Uh, and so the thing that's coming very,
very soon to AI Studio is going to be
backend support um, and full stack
runtime. So, if you want to install
Shadci and if you want to do all of
those things, you'll be able to do that
again with one prompt. And the principle
with AI Studio here is that we don't
want you to think about those details.
You should just be able to ask, I want
to make a multiplayer app and we know
that you need to use Express and wire
that all up for you. Uh, and abstract
all those details away. So, we're going
to try something a little risky here,
which is we did turn this racing game
into a multiplayer one. Um, and uh, this
was again a couple of prompts. Uh, so
we're going to put a QR code up if you
want to join us uh, in the racing game.
We've never tried with nearly this many
people, so we'll see. [laughter]
>> Hopefully this works.
>> Uh, but QR codes up here. Uh, so if you
scan that, hopefully should load the
game. I'm really afraid of how this is
going to explode.
>> Here we go.
>> All these cars loading [laughter] in.
>> Nice.
>> So yeah, people have scanned that. We
can switch back to the game. Okay. Oh my
god. [laughter]
>> So yeah, just hit ready uh when you're
all ready. [laughter]
Oh boy.
>> I think this lobby is going to explode.
[laughter]
>> Everyone leave.
>> So, this is where I shouldn't have added
collisions with other cards because you
[laughter] could clearly see that we're
bouncing around.
>> 19 players, 20 players. I don't know if
this race will ever start, but we're all
blocked on the uh you know, the start
line. But 23 players, pretty cool. Uh
yeah, you do all have to hit ready for
us to start this race. So, [laughter]
>> so [gasps] we might be here all day. Uh,
but yeah, that is pretty pretty
incredible. Um, I can't start this race.
So, do you want to wrap up? [laughter]
>> Hope to see you all.
>> That's pretty cool. The runtime didn't
explode.
>> Yeah. And I think we're super excited
not only about the multiplayer game. So,
next time we'll have even more of you
folks join, but also, you know, the
extensibility that comes with a full
stack runtime. uh we want to make it
super easy for you to integrate with our
oneP and popular thirdparty APIs etc. So
very exciting next few months on the AI
studio vibe coding side and super
excited for you all to try it. Um but I
think the one thing I want to step back
and emphasize is what makes us so
excited about this project and the work
that a lot of us are doing is that we
get to be the first generation of
engineers who are building tools for a
world where anyone can build software.
So I think what's beautiful about things
like vibe coding is watching people. We
actually talking to a tech support
person earlier this morning who said
they started vibe coding an AI studio
after seeing a YouTube video and we're
really democratizing who can create
things and we're all getting to build
those tools that enable that and I think
it forces us to rethink the paradigms
that we've become so used to. So it may
not be your base IDE that people are
starting from, but how can we intuit it
as much of the user intent as possible
and that's what we want to do with full
stack runtime and AI studio is make it
very easy to not have to think about I
want to add a database but if your app
needs storage it'll have storage. If you
want to have a if you have an e-commerce
app we'll add a payment solution and
make it as easy as possible to build the
future of software. Um so thank you
folks for joining us. If you have any
cool examples you've built or questions,
feel free to ping me and Amar on
Twitter. Uh, and yeah, enjoy the rest of
the day.
>> Yeah, thank you. [applause]
>> What if the reason you're struggling
with agents is not the agents
themselves, but the environments in
which they operate? Here to present us
with eight categories to make your
codebase agent ready is co-founder and
CTO of factory Eno Reyes.
[music]
Hey everybody, my name is Eno. Uh really
pumped to talk today about uh something
that at Factory we care a lot about. uh
when we started 2 and a half years ago
uh we said that our mission is to bring
autonomy to software engineering. Um and
that is like got a ton of loaded words
in it. That sounds a little buzzwordy
right now, but I think that the my goal
is that you guys leave this like roughly
20 minutes uh with a bunch of insights
that will apply to your organization uh
and the teams that you build, the
companies you advise, um and if you're
building products in the space, uh
insight into like sort of maybe how to
think about building autonomous systems
and also making your engineering org one
that's able to use agents really
successfully. Um, a sort of like plus of
this is that ideally this applies to any
tools you're using that involve AI. So
it won't be specific to like our product
or any of the other amazing tools out
there. Um, I'd like to start with a
little bit about uh, you know, Andre
Karpathy had a very welltimed tweet. Uh,
so of course I'm going to mention it.
Uh, you know, he he kind of talked about
uh, this idea of software 2.0 coming
from auto uh, the the the ability to
verify things, right? Um, this is
something that's in sort of like the the
mind of Silicon Valley right now as uh
the most frontier models are built with
post- training that involve lots of like
verifiable tasks. Um, and really I think
the most interesting thing here is the
sort of frontier and boundary of what
can be solved by AI systems is really
just a uh sort of an input function of
whether or not you can specify an
objective and search through the space
of possible uh solutions, right? And so
uh we are used to building software uh
purely via specification. We say like
the algorithm does this and like input
is x output is y. But if you sort of
shift your mindset to thinking about
automation via verification uh it is a
little bit of a of a difference in what
is possible to build. Um and there is
another great blog post by uh Jason
where he talks about the asymmetry of
verification. Uh this is like pretty
intuitive to most people who know about
like P versus NP. Uh it's like a a thing
that a lot of people have talked about
throughout the like history of computing
and and software. But there are a ton of
tasks that are much easier to verify
than they are to solve. Um and and vice
versa. But but the the most interesting
sorts of uh easy to verify problems are
ones where there's an objective truth.
They're pretty quick to validate whether
or not they're true. Uh they're
scalable. So validating a bunch of these
things maybe in parallel. uh is easy. Um
it's low noise, so your chance of
validating it is like really really
high. Um and they have continuous sort
of signals. Uh it's not just like a
binary yes no, but like maybe you're 30%
70% 100% accurate or correct. Um and you
know, the reason I bring both these
things up is software development is
highly verifiable, right? This is like
the frontier. It's why software
development agents are the most advanced
agents in the world right now. uh and
there are so much uh there's so much
work that has been put in uh over the
last you know 20 to 30 years around the
automated validation and verification of
software that you build um testing right
unit tests end to end tests QA tests
right um the frontier of this is
expanding there's tons of cool companies
like browser base and computer use
agents and all these things that are
making it easier to validate uh really
complex visual or front-end changes um
docs right having like an open API spec
for your codebase uh is something that
can be automated. It's validated. Um I I
I can go through and enumerate a bunch
of these, but I actually think it is
sort of a nice checklist for yourself,
right? Do you have some automated
validation for the format of your code?
Uh do you have llinters? These things
for professional software engineers are
sort of like, yeah, of course we do. But
I think you can go a step further,
right? This is where that continuous
validation component comes in. Um, do
you have llinters that are so
opinionated that a coding agent will
always make code that is exactly at the
level of what your senior engineers will
produce? How do you do that? What does
that even mean? Right? Do you have tests
that will fail when AI slop has been
introduced? Uh, and when highquality AI
code is introduced, those tests pass,
right? These additional layers of
validators are things that most code
bases actually lack because humans are
pretty good at handling most of this
stuff without the automated validation.
Right? Your company may be at some test
coverage rate that's like 50% or 60%.
And that's good enough because humans
will test manually. Um you may have a
flaky build that every third build it
sort of fails and everyone at your
company secretly hates it but no one
says anything, right? These are the
sorts of things that we know are true
about large code bases. And as you scale
out to extremely large code bases,
organizations with 44,000 plus
engineers, right? Uh this starts to
become a very accepted norm that the bar
is sort of maybe at 50% or 60%. Um and
the reality is is most software orgs can
actually scale like that. uh it's sort
of fine to be at that lower uh barrier,
but when you start introducing AI agents
into your software development life
cycle, and I don't just mean in
interactive coding, but really across
the board, right? Uh review,
documentation, testing, all this stuff.
Um this breaks their capabilities. Most
of you have probably only seen an AI
agent that operates in a codebase that
has uh a decent amount of validation. Um
I think a lot of the best companies in
the world right now actually have
introduced very rigorous validation
criteria and it means that their ability
to use agents is significantly greater
than that your like average uh
developer.
Uh you know and and if you think about
it this like traditional loop of
understanding a problem designing a
solution to the problem coding it out
and then testing it uh sort of shifts if
you have really rigorous validation. Uh
it becomes a process of when you're
using agents specifying the constraints
by which you would like to be validated
and what should be built. Uh generating
solutions to that outcome verifying uh
both with your automated validation as
well as with your your own intuition. Um
and then iteration where you continue to
iterate on that loop. This move from
sort of like traditional development to
spec specificationdriven development is
one that we're starting to see sort of
bleed into all of the different tools.
Different tools have spec mode. Droids
have like our Droid is our coding agent
have like specification mode, plan mode.
Uh there are entire idees that orient
you around this like specificationdriven
flow. Um and if you combine these two
things together, this is really how you
build reliable and highquality
solutions. So if you think about it,
what is like the best decision for you
to make as an organization? Is it
spending 45 days comparing every single
possible coding tool in the space and
then determining that one tool is
slightly better because it's 10% more
accurate at Swebench or is it making
changes to your organizational practices
that enable all of these coding agents
to succeed and then picking one that
you're, you know, developers like or
honestly letting people choose from the
tons of amazing tools out there.
And when you have these validation
criteria, you can actually introduce way
more complex AI workflows to your
organization, right? Uh if you cannot
automatically validate whether or not a
uh a PR is like reasonably successful or
has code that won't definitely break
prod, you are not going to be
parallelizing several like agents at
once, right? you are not going to be
decomposing a largecale modernization
project uh into a bunch of different
subtasks like that is that is a very
frontier style task to use AI for and if
the single task execution right the
simple I would like to get this done
here's exactly how I'd like it to be
done and here's how you should validate
if that does not work nearly 100% of the
time you can sort of forget successfully
using these other things at scale in
your company um when you get into other
tools like code review, right? Uh if you
want a really high quality AI generated
code review, you need documentation for
your AI systems. Uh and yes, uh agents
will get better at, you know, picking
out, you know, whether or not to run
lint or test. They will get better at
finding solutions when you don't have
explicit pointers. They'll get better at
search, but they won't get better at
just randomly creating this validation
criteria out of thin air. Right? This is
why we believe software developers, by
the way, are going to continue to be
heavily involved in the process of
building software because your role
starts to shift to curating the sort of
environment and garden that your
software is built from. You're setting
the constraints. You're building these
automations and introducing continued
opinionatedness
uh into the uh into these automations.
Um, and you know, if your company
doesn't have at least all of these,
right? Then that means that there's a
lot of work that you can do totally
absent of a procurement cycle or buying
one tool or trying out another one. Uh,
and so plug is that we help
organizations do this, right? I think
that it's great to have tools that allow
you to uh go in and assess this stuff.
They have ROI analytics that let you
interact. Um but I think that for most
organizations uh there is actually like
a very clear way to do this right you
can go and analyze where are you across
those eight different pillars of like
automated validation do you have a
llinter how good is the llinter do you
have agents MD files an open standard
that almost every single coding agent
supports um you can improve uh and
systematically enhance uh these
different validation criteria uh and you
can go through and say Well, we're
seeing that coding agents are reliable
enough for a senior developer to use,
but our junior developers, if you have
the tooling to tell, by the way, like
which developer is using what tools, you
you can ask questions like maybe our
junior developers are actually totally
unable to use these coding agents. And
you'll learn that the reason why is not
because they're like more incompetent or
they don't know how to use the tool, but
because there's these niche practices
that you don't have automated validation
for, right? And if you think about what
what is the difference between a like
Google or a meta and a uh a still large
but like 2,000 person engineering or the
difference is that a newrad with
effectively zero context can go and ship
a change to make YouTube's like boundary
like slightly more round and it won't
with some degree of confidence take down
YouTube for like a billion users, right?
And the reason that's possible is
because of the insane amounts of
validation that have to happen on that
code for it to be shipped. The big
difference that we now have is we have
coding agents that can go and identify
exactly where these gaps are and they
can actually remediate those fixes.
Right? So you can ask a coding agent,
could you figure out where we're not
being opinionated enough about our
llinters. You can ask a coding agent to
generate tests. We have an engineer
named Alvin who I love this quote. He
said a slop test is better than no test.
Uh, and I think that that's slightly
controversial, but the thing that I
would argue here is that just having
something there, right, that passes uh,
when changes are correct and somewhat
accurately uh, matches to the spec of
what you want built, uh, people will
enhance it. They'll upgrade it and other
agents will actually notice these tests.
They will follow the patterns. So the
more opinionated you get, the faster the
cycle continues. So I think that what
you guys should be thinking about is
what are the feedback loops in our
organization that we are catering
towards. If you have better agents, they
will make the environment better which
will make the agents better which will
mean you have more time to make the
environment better. And this is sort of
the new DevX loop as well that
organizations can invest in uh that will
enhance all of the tools that you're
procuring, right? So no matter whether
it's a code review tool, a coding agent,
etc. they will all benefit. Um and I
would argue that it sort of shifts your
mental model about what you're as a
leader investing in when you're
investing in your software. Right now
the idea of uh you know opex as like the
input to engineering projects like we
are investing in we want more people in
order to solve this problem we need 10
more people. Um I would I would argue
that uh the other thing that you can now
start investing in is this environment
feedback loop that enables these
additional people to be significantly
more successful. Right? And I think that
that's the feedback loop that can
actually take quite a lot of value
because coding agents can just scale
this out. So, you know, all of this is
to say there's a lot that can be done
outside of the like product itself uh to
enable these systems and the best coding
agents will actually take advantage of
these validation loops, right? So, if
your coding agent isn't proactively
seeking llinters, tests, etc., Then you
know at the end of the day it's not
going to be as good as one that will
seek those validation criteria. And in
addition to that, when organizations uh
uh think about these sorts of things, if
you're the person who's able to say,
"Here's my opinion. Here's how I want
software to be built," it scales your
capabilities out greater than ever
before. Like one opinionated engineer
can actually meaningfully change the
velocity of the entire business if you
take this to heart. Uh and you have a
way to measure and systematically
improve. Um, so that's uh, you know, the
the majority of, uh, what I came here to
say. I think that the the the only thing
that I'd leave you with uh is that when
you think about where AI is going and
like where we're at today, we are still
really earn early in our journey of
using software development agents. If
you want a world where the moment a
customer issue comes in, a bug is filed,
that ticket is picked up, a coding agent
executes on that, that feedback is
presented to a developer, they click
approve, that code is merged and
deployed to production in a feedback
loop that takes maybe an hour, 2 hours.
That will be possible, right? We all are
sort of skeptical about that fully
autonomous flow. That is technically
feasible today. The limiter is not the
capability of the coding agent. The
limit is your organization's validation
criteria. So this is like an investment
that made today will make your
organization not 1.5x, not 2x, but that
is where the real like 5x, 6x, 7x comes
from. Um, and it's sort of a an easy
thing to say and it's an unfortunate
story because what that means is you
have to invest in this. It's not
something that like AI will just
magically give to you. Uh it's a choice
that you as an organization have. Uh and
if you make it now, I can guarantee you
that you will be in the top 1 5% of
organizations in terms of edge velocity.
Um and you will out compete everybody
else in the field. So highly recommend
investing in this sort of stuff and
hopefully you found this helpful and
have some lessons to take home. Thanks.
[applause]
[music]
Our
[music] next presenter is the co-founder
and CTO of Source Graph and Ampode
[music]
here to provide an overview of Ampcode's
approach to AI powered software
development. Please join me in welcoming
to the stage Banglu.
[music]
[applause]
Hey everyone, how's everyone doing
today?
>> Yeah, cool. Pretty cool conference, huh?
Um, so yeah, my name is Bang. I'm here
to talk about AMP. AMP is an opinionated
Frontier Agent. Uh, so before I get into
what that means, uh, who are we? Uh,
we're the bunch of weirdos downstairs at
the booth with the weird pied piper dude
on the floating golden fish. Uh, and I
think that kind of captures the ethos of
what we're trying to do uh, with AMP.
We're trying to lean into that sense of
awe and absurdity that I think we all
experience right now living in this
weird world we're living in where agents
are writing an increasingly large amount
of of our code. Uh and it's just kind of
like weird and magical. Like if you
imagine how you were working like a year
ago compared to how you're working now,
it it feels completely different. And so
we're embracing that sense of change and
we really want to be the agent research
lab that's sort of like living one year
in the future and figuring out how this
all kind of pans out. Okay, so what is
AMP actually? Well, it's a it's a coding
agent that you can invoke from the
terminal. So here's our terminal UI. Uh
we actually ended up building a complete
terminal UI framework up from scratch
because we wanted to take advantage of
all the capabilities of modern
terminals. And one of the balances we
tried to strike in in this UI is we try
to show the right amount of information
to the user that conveys what the agent
is doing without overwhelming you with,
you know, every single token of
explanation uh that the model is
generating. We stream the diffs that
it's making. Uh we show you what CLI
commands it's using. And if you look in
the bottom right hand corner there,
you'll see a little Emacs 30.1 thing.
This also connects to the editor that
you're using where it collects
diagnostics. So Emacs, Neovim, Jet
Brains. Uh you can connect the the CLI
to your editor uh to collect additional
information that's relevant to the task
at hand. And so this particular video is
just AMP implementing a small feature to
itself. uh we asked it actually to add a
little help button in the bottom
lefthand corner. Uh and so that's just a
quick demo to show you that uh the agent
is pretty good at finding the relevant
context and iterating towards that. Uh
we also have an editor experience. Uh so
we've not found the motivation yet to
fork VS code. Maybe we will in the
future, but right now this installs into
VS Code or any of its derivatives,
Cursor, Windsurf, uh anti-gravity. Um,
and the idea here is you really write
all your code through this agent panel.
At least I do. Um, I I actually spend
very little time, you know, actually
manually editing code now. And one of
the bottlenecks we identified in the
editor is I don't know about you, but I
spend most of my time effectively doing
code review now. Um, just in the editor
trying to read through all the agent
output. That's the thing that constrains
me from, uh, fully paralyzing, you know,
two 3x the number of agents that I can
run at at a given time. So we built a re
reu interface that I'll talk about uh in
more in depth uh in a bit that uh kind
of helps you streamline that process,
guides you through the process of
understanding what the agent wrote so
that you can ensure that you're not
shipping something that's super sloppy
or spaghetti.
Okay, so I hear all of you thinking
like, okay, yeah, yeah, it looks pretty,
but what actually is different? You
know, why is this better than the like
20 other coding agents uh here? And I
think the best way to convey this is I'm
not going to try to convince you that
it's better. I think that is ultimately
up to you trying different things out
and seeing what actually works. But I am
going to try to convince you that we're
thinking about things in a very
different, opinionated, and weird
manner. So I want to take you on the
journey of us building AMP and all the
different sort of contrarian or spicy
takes that we've made uh decision-wise
in the architecture of of the agent
along the way. Okay, so let's start at
the beginning. Um hello agent. What is
an agent at its core? Well, all an agent
is as I'm sure most of you know uh is
it's a for loop uh with tool calls and a
model uh in the middle. And the reason I
want to present this slide is because
think of it this way really tells you
what sort of levers you have to pull as
a builder of agent. Uh there there's
certain things that you can change. You
can change the choice of model. You can
change the tool descriptions and you can
change uh how the model iterates with
those tools. And those are effectively
your levers. Seems like a few amount of
levers but you know just like
programming languages all those are
syntactic sugar around if statements and
for loops. You know you can get a
surprisingly wide variance of behaviors
and complexity out of that.
And so one of the key lovers in building
any agent is the set of tools. And these
days you cannot talk about tools without
talking about MCP. So one of the early
decisions we had to make in building AMP
is how much do we invest in the MCP
integrations? And MCP is this amazing
new protocol that's gotten everyone and
their mom thinking about how to provide
context to agents. Um should we lean
into that or should we start building
our own custom tool set? And our very
opinionated take, I think this is maybe,
you know, less controversial now than it
was, you know, back in in April, was
that we should really actually focus
most of our attention on the core uh set
of tools within AMP. And that's really
for two reasons. One is because the more
you work with agents, the more you find
uh out that what you're trying to do is
identify these feedback loops and help
the agent close them. And in order to do
that, you need a refined tool set that
is really geared toward helping the
agent find those loops. And you cannot
do that with MCP servers. The creator of
the MCP server doesn't know what your
agent is trying to do. And so they're
not going to tune the tool descriptions
to what you're trying to accomplish. And
then the second piece of this is context
confusion. So the more tools that you
add into the context window, uh the more
things that the agent has to choose
from. And if the tools aren't relevant
to the task at hand, it ends up getting
confused. So we've leaned hard into this
uh custom tool set. And you'll see a
little bit more about that in just a
little bit. But before that, I wanted to
call out another issue with uh tool use,
which is it's not just tool descriptions
that eat up context. It's the tool calls
themselves that also eat up context. And
so, everyone who's built an agent has
run into a context exhaustion problem
where, you know, if you use any sort of
coding agent, if it's good, it's going
to go out and try to find a bunch of
relevant context by grepping and reading
files first. And by the time it gets to
editing, there's only a small amount of
context window left. And so, maybe it
has to stop prematurely. And so the
naive way to fix this is just to prompt
it to, you know, do less reads. So you
can do more iterations on the edit side.
But then this leads to another failure
mode which I call the doom loop mode
which is it doesn't gather enough
context in the beginning and so it ends
up not figuring out what it needs to do
and just retries the same thing over and
over again. And so the way to solve this
is really with uh sub agents. So sub
aents are the analog to subutine calls
in regular programming languages. This
is how you can factor out the context
window used for a subtask into a
separate context window which is the sub
aents context window. Uh it can do all
the things it needs and then at the end
of the day it only returns the relevant
results to the main uh agent window. So
sub aents are effectively a way to
conserve and extend the context window
of your main agent. Uh so sub aents sub
agents are great. I think everyone uh
building agents has probably heard of or
or use sub aents uh so far, but I think
we have a unique take on sub aents,
which is we're not really doing generic
sub uh sub aents where you kind of tweak
the system prompt and tweak the tool set
a little bit. We've really leaned into
our sub aents. Uh and so we have uh
three to four really core sub aents that
really extend the functionality and
capability uh of AMP itself. The first
one is something that we call the
finder. So this is effectively our
codebase uh search sub aent. It's gone
through an evolution of models and we've
ended up at the point now where we're
using a relatively small and quick model
to drive a limited tool set that we
found really is optimal for quickly
discovering uh relevant context within
the codebase.
Another sub agent that we've implemented
is this thing that we call the oracle.
So this is how AMP does reasoning. So in
contrast to most agents which uh you
know implement reasoning in the model
selection part of the experience, we
found the best way to implement
reasoning models is really through a sub
agent. What that allows you to do is
preserve the relative like snappiness uh
in the main agent as well as its ability
to to use a variety of different tools
and then only when you need to debug a
tricky problem or plan something very
nuance, it drops into this Oracle sub
agent and figures things out. And this
is something that's like kind of
magical. It's like anytime the main
agent has trouble uh figuring out
something and I'm like I don't want to
spend like one to two hours going down
this rabbit hole, I just like tag the
oracles. So like invoke the oracle,
think really hard, I go like alt tab,
check my email for a bit and sometimes
it takes a few minutes because it's
thinking really deeply, but I think like
four out of five times it just magically
finds uh uh the underlying issue. We
also have a librarian sub agent which is
meant to fetch context beyond the
codebase. So from libraries and
frameworks that you depend on. And then
there's a new experimental sub aent that
we call the Kraken. Uh its job is it
doesn't edit code files uh one by one.
Uh it really is all about writing code
mods to do these kind of like large
scale refactors. So we're leaning hard
into the sub aents and uh that's really
in contrast to a lot of the existing
coding agents. I think almost every
other coding agent implements a model
selector as one of the core uh UX
components and we just don't think that
this is the architecture of the future.
I get that, you know, developers like
choice or at least the the possibility
of choice. But the problem with choice
is that there's also a paradox of
choice. The more choices that you have,
the more uh kind of like cognitive
burden it is to choose from these
different models. And that means at the
architectural level, if you have n
different models and one agent harness
that you can only lightly customize each
model, it means you're never really
optimizing for what any one given model
uh can do. And so AMP's architecture is
much more agent-oriented. We have two
top level agents, a smart agent and a
rush agent. And the smart agent is the
one that has access to all those fancy
sub agents and can do a lot of things.
It's it's a little bit slower, but you
can hand it more complex instructions.
And then the rush agent is for uh those
kind of like in the loop tasks where you
want to be tight in the loop and you're
making quick targeted uh edits to to the
code.
And why do we have two top agents? It's
really we're trying to kind of like pick
points along the frontier of
intelligence and speed that are
meaningful to the user experience. So in
in talking to our users, we found that
there's kind of two modalities for
invoking agents. Now one is you kind of
like spin off a task and have it run and
then review the code when it's finished
asynchronously. Uh or you want to be in
the loop, you know, quickly having the
agent make edits while you quickly
review them one by one. Kind of like
babysitting the agent uh in the inner
loop.
And we're very intentional about the
model choice here. We've only switched
the smart model once and that was
actually two days ago uh when Gemini 3
uh was released. And I think you know
the the reaction Gemini 3 has been
really interesting to watch. I think
you'll see widely different behavior
from Gemini 3 in different uh agent
settings. So for those of you who've
tried it out in other settings, I highly
encourage you to uh try it in AMP. We
did a lot of testing in the week before
the release to optimize the smart agent
to take full advantage of of its
capabilities. And uh we're absolutely
loving it. We're still working through
some kinks obviously because it's a a
new model, but we feel confident that it
it's again moved the frontier of what's
possible.
Okay, so we talked a lot about like
agent construction, the behavior. I want
to talk a little bit about the UI layer
of agents as well. So, you know, editor
versus terminal, we're doing both. Um,
and I think that's because both of them
tackle kind of like different modalities
of working. Uh, but we do have
opinionated takes uh in each interface.
So, in the editor, I think of my editor
now more as a readitor uh uh more than
anything else because uh I don't know
like if you're using agents heavily, I
don't think you're really editing all
that much uh in your editor anymore.
You're mainly driving edits through the
agent panel, which is what you see on
the right hand side or the right hand
side here. And then what I do in in my
editors, I pop over to the side panel,
which is optimized for reviewing
different diffs. So we actually built a
custom diff viewer for the way that
people are consuming agentic output. You
can select any arbitrary commit range
quickly view through the file level
diffs. All the diffs are editable and
you have full code navigation. So go to
definition find references and there's a
feature at the bottom that gives you a
tour of the change. So it actually
guides you through which files you
should read first because I find half
the battle when reviewing a large change
is figuring out where to start. So the
guey aspect of the editor allows us to
build a very rich uh experience uh for
for this type of thing. And then
meanwhile in the terminal um we really
want to take advantage full advantage of
the the features and rendering
capabilities of modern terminals. So uh
we actually have one of the core
contributors to Ghosty uh the open
source uh terminal uh that built a uh a
TUI framework from scratch to power the
AMPUI. So one of the nice things that we
can do is just to point out a little
detail the the green color of the diff
rendering on the left hand side terminal
it's actually we can have the terminal
mix in the color green with whatever
background color it's using. So that
allows for a much nicer display. At the
same time we know that people use all
sorts of terminals uh including like
terminals in Jet Brains or VS Code and
other editors and so we've added the
ability to gracefully degrade. So even
if you're using AMP in like the default
Mac OS terminal, it falls back to the
capabilities that uh are uh available in
in that setting.
Another aspect about how we're thinking
about coding agents is really from the
how do we get people to learn this new
craft? Like we think that uh human
developers are going to be around for a
long long time, but we essentially have
to relearn the craft of how to code uh
together. And so one of the first
features that we built into AMP was the
ability to share threads with your
teammates. So, if you're using AMP on
your team, you can go and see like how
much code people are changing with AMP
over a given period of time. And you can
poke into specific threads to see how
they're doing things. And people love
this feature because essentially like
link threads to AMP and say like, "Hey,
here's a cool prompting technique that I
discovered. Try it like this." Or, "Hey,
here it got stuck here. Can you help me,
you know, uh think through how better to
to connect the agent with the feedback
loop to get further?"
uh another aspect of of uh enabling more
people to experience uh coding agents
and learn how they work is by making it
more accessible from an economic
perspective. So um you know remember the
smart and uh rush uh agents at the top
level. You know smart models remain
relatively expensive today but rush
models are getting cheaper and cheaper
but not yet free. And so we're thinking
about you know more and more like one of
the the biggest barriers to using agents
fully is actually cost right now. Like
if you go to like college campuses uh
and talk to students, the actual number
of people who have used a coding agent
is actually much smaller than I would
have thought given you know young
people's uh propensity to adopt new
technology. A lot of this cost. So
someone had the crazy idea on our team
like hey you know what we could do we
could ship ads in your terminal. And at
first it was like nah that'll never
work. But the more and more we thought
about it and the more and more like
inference costs started declining we're
like yeah maybe. So, we actually shipped
uh a mini ad network that delivers ads
for other developer tools uh in in AMP
in the terminal and in the editor. Uh
they're very subtle. So, I don't know if
you can spot the ad in this screenshot,
but we try to make them non-intrusive.
But this effectively allows us to
sponsor inference uh in in the rush uh
agent so that uh more people are able to
experience this on you know their side
projects and such.
Okay. So, AMP is AMP. Uh we are like I
said a we think of ourselves as like an
agentic research lab. So we're not about
uh hype. We don't do any sort of like
paid developer influencer marketing. But
I like to call out some cool people that
I think are using AMP. Um because it it
shows for the type of people that we're
really selecting for. I I don't think
AMP is for everyone uh at this point.
We're really trying to target the the
like small percentage of people who want
to live a little bit in the future. Um,
and so we have folks like Mitchell
Hashimoto, the uh the founder and and
XCO of Hashi Corp. He's building Ghosty
now. Uh, that's his uh kind of passion
project and he's using AMP to drive a
lot of the changes that he makes uh to
that terminal. And then we also have
folks like Hamill Hussein who's I think
probably like the leading authority on
AI evals. Um, and at least as of a
couple weeks ago uh he was saying that
AMP was his favorite coding agent. And
so, uh, neither of them are on the team
or, you know, have invested us in any
way, but we're just thrilled that, you
know, they seem to like like what we're
building.
And then if other folks are interested
in in kind of like coming along with us
in in in this journey and trying to push
the frontier of what agents can do, uh,
we've also started a community of
builders. Um, and using AMP is not a
requirement to join this community. It's
run by uh Ryan Carson who's a former
startup uh Treehouse taught over a
million people to code and now this is
his passion project. It's essentially
like if you're building with agents and
you're you're experimenting with how to
push them further and further. There's
Ryan right there. Um it it's all about
kind of like tapping into that sense of
awe and wonder with a pure group uh that
is also uh leaning into that uh sense of
of strangeness and and experimentation.
So um what does this involve? It
involves uh like regular interviews with
people. We like to feature people who
are building interesting things or using
agents in interesting ways. Uh and we
also do inerson events. We had a very
nice dinner last night where we got a
bunch of people together and had very
interesting conversations spanning from
you know actually building with coding
agents to you know more philosophical
discussions about uh the nature of AI
and things like that. So um that's it
for me. Uh hopefully this has in
intrigued you. Again, I I don't expect
all of you to be convinced that we are
building the best Frontier coding agent,
but at the very least, I hope I've kind
of demonstrated how we're leaning into
the weird and thinking about things uh
differently. So, if that's interesting
to you, come say hi at our booth. Just
look for the weird pipe piper man
writing the golden fish. Thank you.
[applause]
[music]
Our
next speaker will discuss how AI
generated kernels can meaningfully speed
up custom PyTorch code without any human
effort. Please join me in welcoming to
the stage the co-founder of Gimlet Labs,
Natalie Serino.
[music]
Hey everyone, how's it going?
So, my name is Natalie. I'm a co-founder
of Gimlet Labs. And um yeah, just a
little bit of background about why
Gimlet's looking at AI generated
kernels. Let's just get right to it. Um
we're building an agentic inference
cloud focused on performance and
efficiency. And the thing that we've
seen with all these talks so far is with
agents, they're not just one chat model.
There are complex pipelines of multiple
models, multiple stages, tool calls and
the compute backing these is inherently
should be heterogeneous. So what we do
is we automatically split up and
orchestrate these agentic workloads
across optimal hardware which can be
different vendors and different sizes.
This can present a problem at the kernel
level because a lot of times you have
models that are really optimized just
for one hardware. So what we started
looking at is can we use AI to help
automatically port different segments of
aentic workloads to hardware that hasn't
necessarily been optimized for.
So just to clarify something really
quick because we run into this a lot.
What do I mean by kernels? I do not mean
AI generating operating systems like the
Linux kernel or things like that. What I
mean is kernels at the sense of like
transformer architecture like the
individual like functions that perform
like massive parallel computations
leveraging like all the crazy amounts of
threads that GPUs have. So yes, people
be like, "Oh, how are you going to
generate an operating system?" I think
maybe we're not quite there yet, but one
day.
So why use AI to do this? So I think
there's a few reasons. So, we know that
optimizing low-level kernels can make
workloads like ML workloads
significantly faster. So, here we have
like it's probably too small to see, but
it's a blog from Nvidia where they
implemented a different attention and it
allowed them to like get 3x throughput
on a llama model. So, these like
implementations can make a major
difference from a performance
perspective. But at the same time, if
you just search Twitter, everyone's
whining about how it's impossible to
find these people and the people that
exist are like really overt taxed with
so much to do, so much work. There's
just not enough experts to be able to
solve every problem in this in this
space right now.
And the problem explodes because you
have so many frameworks and so many ways
to write kernels from things like CUDA
and like Triton to Palace to things like
that are device specific like Metal and
you have different hardware platforms
too and each of these hardware platforms
even within a single vendor has
different characteristics. We've seen
for example that some of the new um like
hardware from Nvidia like some of the
old kind of like DSLs weren't working as
well on it because the different
hardware has different properties it has
different features it has different
characteristics different cache sizes
etc all of which impact the optimized
implementation from a kernel perspective
so we and many others in the space have
thought it would be great if AI could
help us with this problem where you
could potenti essentially give it
PyTorch code and then generate optimized
implementations for whatever hardware
you're trying to run that workload on.
So I think when you're trying to use an
agent for something, you have to start
with what the human workflow is. And the
human workflow today, when you have that
like really hardcore kernel expert,
let's say they're trying to port a new
uh like workload over to Metal, right?
What they'll do is they'll say, "Okay, I
have this implementation. Maybe I have a
CUDA version, maybe I don't, and I'm
going to try something. I'll see if it
compiles." Most of the time, maybe not.
I'll see if it runs.
If see if it's correct, and if none of
those are the case, you just pass that
back into the human context, so to
speak. And then once you get something
that's working, then you start looking
at the profiling information in depth
and just hammering down like this is the
bottleneck now. This is the bottleneck
now. This is the bottle of the mic now.
It's a very iterative process.
So I think that you know basically the
idea here is to put AI as the as the
kind of like
where the human would go in that same
loop right so the agentic flow here is
to make sure it compiles and it executes
and it's correct and then from there
optimizing it.
So this is something that I would say is
like very new technology. There's a lot
of interest here, but there's some
things that it's good at and some things
that it's still kind of in development
for. And so, let's dive into some of the
specifics.
So, this is a quick demo of our system.
It the font's kind of small, but we're
passing into a CLI tool, a PyTorch
workload, targeting it to an H100, and
the system had explored a bunch of
candidate optimizations. It's comparing
to eager mode and torch compile and it
found one uh candidate that was 22%
faster than the torch compile baseline.
So this was a real case. It's just sped
up because it actually took about 20
minutes.
So there's some challenges though with
measuring these agents at kernel
synthesis. So um like first of all you
have to figure out what your definition
of correct is when you're dealing with
floating point. This is always a
question. You can do different types of
tolerances, but you also need to make
sure your input sizes are well selected.
If you're only passing in really small
input sizes, it can cause problems with
the benchmarking where you're measuring
the overhead, not the actual kernel as
the critical path.
You also have to make sure you're
reliably measuring performance. So, if
you just do a naive timer start on your
implementation, it's probably going to
be wrong. And there was a great blog
that had a diagram for this because
you're basically measuring the launch
time, not the execution time. So there's
a bunch of kind of gotchas like that
that when you're building an agentic
system like this, you have to be really
really careful about catching doing
things like warm-ups and cache clearing
because a lot of times you'll have
you'll have the original implementation
run and then the new implementation run
and then the original ones result is
cached and the new one fetches it. So
you have all kinds of things like that
that you have to be really neurotic
about otherwise you might get bad
results.
You also need great benchmarks for this.
I think that someone said earlier that
there's not a ton of examples of
low-level kernels across all these
different hardware platforms. And so the
input data is a challenge and also
benchmarking it is a challenge. Like how
do you know if your agent is better? You
change the prompt to it. How do you
know? It's the same story we hear with
every agent here basically.
So we have some preliminary results on
that we're sharing right now on Apple's
um M4 uh using the metal framework. Um
and this is on the kernel bench
benchmark the v0.1 version of it which
is the latest one. So what we can see
here is results across 250 problems and
it compares to either torch compile or
eager mode depending on which one of
those is faster. So with the kernel
bench data set, we have different tiers
of problems with L1 being the easiest or
simplest rather and L3 being like more
complex. So what we can see for the
standalone agent is that we see an
average speed up of about 25% or 24%.
And the sweet spot is those moderately
complex problems. It seems honestly the
same as like a lot of coding problems
where it's good at moderately complex
things, but then you push it too far and
the performance drops off. So, an
interesting challenge here is going to
be how do we make these agents perform
better on more complex problems that
they're going to have to break down and
execute.
There we go. Um, let's talk about a
couple of examples because I love to
just see example code. So, this was a
success case where the model found a
case where we could do kernel fusion. So
for those that aren't that familiar with
GPU kernels, kernel fusion is one of the
most go-to techniques in kernel
optimization where you say I have two
kernels. Let's say in this case it was
like a convolution softmax bias scaling
and sigmoid. So those were five ops and
what the agent did was it took four of
those ops and instead of running
individual functions for those it made a
mega function that compacted them all
together. So kernel fusion isn't new.
It's something that torch compile
already does quite well, but it's a
common way that we found agents can
speed up these workloads because you can
really customize it to the specific use
case.
So this result achieved a 40% speed up
over the baseline on the M4.
And just kind of like zooming into what
happened, the agent wrote a fused op. So
it basically wrote like C++ code that it
put as kind of an inline string with the
PyTorch code and then it called that
fused operation in the forward pass of
the model and then that fused
implementation we can see a snippet of
it up here where it's basically taking
those four ops and putting them together
in one mega op. And so this was done
automatically.
Sometimes though like writing low-level
kernels isn't the best optimization that
we can get. We had another case which
was on a level one problem which
basically improved the performance by
80%. And the the insight the agent had
in this case was the operation in metal
for average pool 1D was not as optimized
as some other ops that are much more
optimized on metal. So what it did was
it actually rewrote the PyTorch code to
use the more optimized op and reexpress
the same problem in a different way.
So to dive into this um the average pool
1D is basically taking averages across
like one dimension. So like you can see
that that input vector could produce the
output vector with five and seven
averaging to six and so on.
So if you express that same thing as a
convolution you can get the same result.
So if you do the math it will lead to
that same result. And so that's what the
agent did. Basically what it did was it
said hey instead of doing the original
call to the baseline op let's generate
that weights matrix and execute this as
a convolution because I know that that's
really fast on metal.
There's also an interesting algorithmic
optimization case. This was for a level
three problem. So it was more complex
where basically the agent figured out
that it could combine two operations
into a single operation at the pietorch
level, not even using low-level kernels.
So what we can see is that basically it
fused it, it rewrote it as Python code
and calls that single convolution and
that's a lot more efficient because you
don't have to launch as many ops.
But this does not always work. This is
not a silver bullet and I think that's
really important to emphasize. So a case
where the agent totally faceplanted was
on matrix multiplication and it was it
wrote a custom CUDA kernel for this but
it was a lot slower than the baseline.
And the thing is with this is matrix
multiply is one of the most hand
optimized ops that exists. So it's not
that surprising that an agent would not
do as well as something that a human
expert spent a long time on. So, this is
an area that it did not work.
Another case was a case that we saw
which had a 71,000x speed up. And
anything like that should trigger your
suspicion brain.
Wow. 71,000. Great. We're done. It's,
you know, this technology is worth
billions of dollars, right? No. So,
basically, what happened? So, this
operation is basically saying, give me
inputs and I'm going to make sure they
fall betweengative -1 and one. Okay,
that's what the operation being tested
was.
So the agent figured out that for all of
the test cases, this was already the
case. So it wrote a nice long comment
saying this is actually not necessary,
so just output the input.
[laughter]
So you could argue this is the agent
being smart because it's pruning
unnecessary work, but I think a lot of
us would agree that it's not in the
spirit of what we're trying to benchmark
here. So we've excluded cases like this
from our analysis, but it is interesting
because maybe some of the times you
would want it to do something like that.
And I think this is part of where the
human element comes in with these
agents. Sometimes the agent does
something that depending on your
definition of what you want to see could
be good or it could be bad. And so
that's where the human part kind of
weighs in.
So, like I keep drawing parallels to
other kinds of coding agents because
even though this is like kind of a niche
like low-level domain, I don't think
that the story is fundamentally
different. We see standalone agents are
really good at cheaply generating like
lots of different ideas and lots of
possibilities to explore. They're good
at slurping in a ton of different
context and seeing what helps. And
they're really good at doing these like
level one and level two tasks. Like for
example, we're still not asking AI
agents to write the Linux kernel, but
what is still needed is robust and
robust quality and performance
validation. We need to make sure that
the agents aren't cheating and we need
to make sure that the results are
actually correct. We need empirical data
from hardware in the loop to guide the
search and optimization because it's
actually really hard to look at
low-level code and know how it's going
to perform on the hardware. We still
heavily rely on looking on profiling
data and things like that. And we also
need the human in the loop to supervise
the results and guide the work. So
design of a modern agent, you have
multiple sub aents that are working
together. You have that human in the
loop and a purpose-built harness for
that task. And I think this is the
pattern we've seen throughout this
conference.
So just to get a little bit into kind of
like what that architecture looks like
and this is what we're you know what
we're building at Gimlet you have a
supervisor agent which takes in input
code target hardware and then also human
prompting because humans still can
really guide the best path for
optimization that supervisor is in
charge of managing the work. It deploys
the synthesis agentic swarm which
collectively work together to come up
with ideas for optimizations and they
are basically the idea factory coming up
with new techniques. Those ideas get
sent to the verification agent which is
running them in on actual hardware in a
hardware in the loop system to see how
they do and that verification agent
needs to be extremely strict about
making sure that no funny business is
happening. And that's a major part of
the challenge.
So just a couple more realistic case
studies that are not benchmarks. We got
really excited because we ran this on a
vision transformer model. And I don't
know if you can see, but basically the
original uh vanilla implementation using
torch compile and our generated code
using torch compile, ours was twice as
fast. So this was like a hoay moment. So
the speedups were promising. But then it
turned out the optimization was just
swapping out the original attention
module for SDPA which is a more
optimized attention module. And this is
the kind of thing that yes that's true
that is a valid optimization but I
wouldn't necessarily call it rocket
science. So we consider that to be a
trivial case study where if you're not
using a more optimized attention module
maybe you haven't actually optimized
your workload that much yet.
But we do still see interesting results
for full models when we have human
prompting. And one case for this was an
audioenccoder model where it generated
six custom kernels for the workload
specialized for the RTX 6000 Blackwell.
And the results were strong. It was
about 70% faster. Both implementations
using torch compile.
So just to kind of show an example, we
load in line six different fused kernels
and then call them in the code. And the
nice thing about this approach, even
though it's a little weird declaring
these as strings, is that you have like
a completely a API compatible swap in
replacement for the original module in
PyTorch.
So where are we with AIdriven kernel
optimization? I think like I said before
this is not a silver bullet but it is a
promising new tool in the toolbox. The
best applications that we see are things
like searching across many bags of
tricks. We know that fusion works. We
know that tiling works and we can run
lots of experiments really quickly this
way by launching them with agents and
see what actually performs the best on
the workload. It's also good at porting
existing implementations to new hardware
where it takes the insights from that
original implementation and specializes
them to the hardware available features
on the new target
and also about translating existing
optimizations to new scenarios. You can
quickly adopt new optimizations like
let's say you're changing the
quantization of your model. you can
still look at differently quantized
implementations to guide that
optimization.
In terms of the worst applications,
we're still not at the point where
they're writing the N plus1 for flash
attention, coming up with like those
genius algorithmic advances, and they're
not currently outperforming a human
expert who banged their head on this
problem for months, and we shouldn't
expect them to be. I think that the most
exciting part of this work is allowing
those people to focus on the most
interesting optimizations and getting us
better than baseline on all the problems
that they don't have time for.
So what's next in the work? We want to
build uh abstract models of different
machines to help the agents further
specialize code to individual hardware.
We're also interested in generating
basically what is like NVIDIA assembly
such as PTX. You can see an example here
because the thought is that we can
basically do that better with AI than
humans because it's so cumbersome. And
then also looking at academic formal
verification methods for correctness.
Um also want to give a huge shout out to
my colleagues. Um they are the silent
unspoken heroes here and um you know I
love talking about this with people. So
please feel free to give me an email if
you want to talk about kernel generation
or anything that I covered. and we are
hiring. So if this problem interests
you, we'd love to chat.
Thanks.
[applause]
[music]
Our next presenter argues that simple
choices like direction over speed will
help us to avoid the infinite software
crisis of maintaining a tangled mess.
Please join me in welcoming to the stage
staff software engineer at Netflix, Jake
Nations.
[music]
Hey everyone, good afternoon. Um, I'm
going to start my talk with a bit of a
confession. Uh, I shipped code I didn't
quite understand. Generated it, tested
it, deployed it, couldn't explain how it
worked. And here's the thing though. I'm
willing to bet every one of you have
too. [applause]
So now I'm going to admit that we all
ship code that we don't understand
anymore. I want to take a bit of a
journey to see how this kind of has come
to be. First look back in history. We
see the history tends to repeat itself.
Second, we've fallen into a bit of a
trap. We've confused easy with simple.
Lastly, there is a fix, but it requires
us not to outsource our thinking.
So, I spent the last few years at
Netflix helping drive adoption of AI
tools, and I have to say the
acceleration is absolutely real. Backlog
items that used to take days now take
hours, and large refactors that have
been on the books for years are finally
being done. Here's the thing, though.
Large production systems always fail in
unexpected ways. Like look what happened
with CloudFare recently. When they do,
you better understand the code you're
debugging. And the problem is now we're
generating code at such speed and such
volume our understanding is having a
hard time keeping up.
Hell, I know I've done it myself. I've
generated a bunch of code, looked at it,
thought, I have no idea [clears throat]
how this what this does. But, you know,
the test pass, it works. So, I shipped
it. The thing here is this isn't really
new. Every generation of software
engineers has eventually hit a wall
where software complexity has exceeded
their ability to manage it. We're not
the face first to face a software
crisis. were the first to face it at
this infinite scale of generation. So
let's take a step back to see where this
all started.
In the late 60s, early '7s, a bunch of
smart computer scientists at the time
came together and said, "Hey, we're in a
software crisis. We have this huge
demand for software and yet we're not
really able to keep up and like projects
are taking too long and it's just really
slow. We're not doing a good job."
So Dystra Kano came up with a really
great quote and he said when we had a
few weak computers and I mean to
paraphrase a longer quote when we had a
few weak computers programming was a
mild problem and now we have gigantic
computers programming has become a
gigantic problem. He was explaining as
hardware power grew by a factor a
thousand society's wants of software
grew in proportion and so it left us the
programmers to figure out between the
ways and the means how do we support
this much more software.
So this kind of keeps happening in a
cycle. In the 70s we get the C
programming language so we could write
bigger systems. The 80s we have personal
computers. Now everyone can write
software. In the '9s we get
object-oriented programming. Inheritance
hierarchies from hell were you know
thanks Java for that. In the 2000s we
get agile and we sprints and scrum
masters telling us what to do. There's
no more waterfall. In the 2010s we had
cloud mobile devops you know everything.
Software truly ate the world.
And today now we have AI. you know,
Copilot, Cursor, Claude, Codeex, Gemini,
you name it. We could generate code as
fast as we can describe it. The pattern
continues, but the stale has really
changed. It's it's infinite now.
So, uh, Fred Brooks, you might know him
from writing the mythical man month, he
also wrote a paper in 1986 called No
Silver Bullet. And in this, he argued
that there'd be no single innovation
that would give us an order of magnitude
improvement in software productivity.
Why? Because he said the hard part
wasn't ever the mechanics of coding. the
syntax, the typing, the boilerplate. It
was about understanding the actual
problem and designing the solution. And
no tool can eliminate that fundamental
difficulty. Every tool and technique
we've created up to this point makes the
mechanics easier. The core challenge
though, understanding what to build, how
it should work, remains just as hard.
So, if the problem isn't in the
mechanics, why do we keep optimizing for
it? How do experienced engineers end up
with code they don't understand? Now,
the answer, I think, comes down to two
words we tend to confuse. simple and
easy. We tend to use them
interchangeably, but they really mean
completely different things. Uh I was
outed at the speaker dinner as being a
closure guy, so this is kind of clear
here. But Rich Hickey, the creator of
the closure programming language,
explained this in his talk from 2011
called simple made easy. He defined
simple meaning one fold, one braid, and
no entanglement. Each piece does one
thing and doesn't intertwine with
others. He defines easy as meaning
adjacent. What's within reach? What can
you access without effort? Copy paste
ship. Simple is about structure. Easy is
about proximity.
The thing is we can't make something
simple by wishing it. So simplicity
requires thought, design and untangling.
But we can always make something easier.
You just put it closer. Install a
package, generate it with AI, you know,
copy a solution off of Stack Overflow.
It's it's human nature to take the easy
path. We're wired for it. You know, as I
said, copy something from Stack
Overflow. It's right there. framework
that handles everything for you with
magic. Install and go. But easy doesn't
mean simple. Easy means you can add to
your system quickly. Simple means you
can understand the work that you've
done. Every time we choose easy, we're
choosing speed now complexity later. And
honestly,
that trade-off really used to work. The
complexity accumulated in our codebases
slowly enough that we can refactor,
rethink, and rebuild when needed. I
think AI has destroyed that balance
because it's the ultimate easy bun. It
makes the easy path so frictionless that
we don't even consider the simple one
anymore. Why think about architecture
when code appears instantly.
So let me show you how this happens. How
a simple task evolves into a mess of
complexity through a conversational
interface that we've all come to love.
You know this is a contrived example but
you know say we have our app. We want to
add uh some authentication to it. Say
add o. So we get a nice clean o.js file.
I iterate on a few times. It gives a
message file. You're like okay cool.
We're going to add OOTH now too because
and now we've got an OJS and OOTHJS. We
keep iterating and then we find
ourselves that sessions are broken and
we got a bunch of conflicts and by the
time you get to turn 20, you're not
really having a discussion anymore.
You're managing context that become so
complex that even you don't remember all
the constraints that you've added to it.
Dead code from abandoned approaches. Uh
test that got fixed by just making them
work. You know, fragments of three
different solutions because you saying
wait actually each new instruction is
overwriting architectural patterns. We
said make the off work here. It did.
When we said fix this error, it did.
There's no resistance to bad
architectural decisions. The code just
morphs to satisfy your latest request.
Each interaction is choosing easy over
simple. And easy always means more
complexity. We know better. But when the
easy path is just this easy, we take it.
And complexity is going to compound
until it's too late.
AI really takes easy to its logical
extreme. Decide what you want. Get code
instantly. But here's the danger in
that. The generated code treats every
pattern in your codebase the same. You
know, when an agent analyzed your
codebase, every line becomes a pattern
to preserve. The authentication check on
line 47, that's a pattern. That weird
gRPC code that's acting like GraphQL
that I may have had in 2019, that's also
a pattern. Technical debt doesn't
register as debt. It's just more code.
The real problem here is complexity. I
know I've been saying that word a bunch
in this talk without really defining it,
but the best way to think about it is
it's the opposite of simplicity. It just
means intertwined. And when things are
complex, everything touches everything
else. You can't change one thing without
affecting 10 others.
So, back to Fred Brooks's no bullet
paper. In it, he identified that there's
two main types of complexity in every
system. There's the essential
complexity, which is really the
fundamental difficulty of the actual
problem you're trying to solve. Users
need to pay for things, orders must be
fulfilled. This is the complexity of why
your software system exists in the first
place. And then second, there's this
idea of accidental complexity.
Everything else we've added along the
way, workarounds, defensive code,
frameworks, abstractions that made sense
a while ago, it's all the stuff that we
put together to make the code itself
work.
In a real codebase, these two types of
complexity are everywhere and they get
so tangled together that separating them
requires context, history, and
experience.
the generated output makes no such
distinction and so every pattern is
keeps just getting preserved.
So here's a real example from uh some
work we're doing at Netflix. I have a
system that has a abstraction layer
sitting between our old authorization
code we wrote say five or so years ago
and a new centralized O system. We
didn't have time to rebuild our whole
app. So we just kind of put a shim in
between. So now we have AI. This is a
great opportunity to refactor our code
to use the new system directly. Seems
like a simple request, right?
And no, it's like the old code was just
so tightly coupled to its authorization
patterns. Like we had permission checks
woven through business logic, ro
assumptions baked into data models and
off calls scattered across hundreds of
files. the agent would start
refactoring, get a few files in and hit
a dependency he couldn't untangle and
just spiral out of control and give up
or worse it would try and preserve some
existing logic that from the old system
and recreating it using the new system
which I think is not great too.
The thing is it couldn't see the scenes.
It couldn't identify where the business
logic ended and the off logic began.
Everything was so tangled together that
even with perfect information, the AI
couldn't find a clean path through. When
your accidental complexity gets this
tangled, AI is not the best help to
actually make it any better. I found
that only adds more layers on top.
We can tell the difference, or at least
we can when we slow it down enough to
think. We know which patterns are
essential and which are just how someone
solved it a few years ago. We carry the
context that the AI can infer, but only
if we time to make take time to make
these distinctions before we start.
So how do you actually do it? How do you
separate the accidental and essential
complexity when you're staring at a huge
codebase? Codebase I work on Netflix has
around a million lines of Java and the
main service in it is about 5 million
tokens last time I checked. No context
window I have access to uh can hold it.
So when I wanted to work with it, I
first thought, hey, maybe I could just
copy large swaths of this codebase into
the into the context and see if the
patterns were emerged, see if it would
just be able to figure out what's
happening. And just like the
authorization refactor from previously,
the output just got lost in its own
complexity. So with this, I was forced
to do something different. I had to
select what to include, design docs,
architecture, diagrams, key interfaces,
you name it, and take time writing out
the requirements of how components
should interact and what patterns to
follow.
See, I was writing a spec. Uh 5 million
tokens became 2,000 words of
specification. And then to take it even
further, take that spec and create an
exact step set of steps of code to
execute. No vague instructions, just a
precise sequence of operations. I found
this produced much cleaner and more
focused code that I could understand. So
I defined it first and planned its own
execution. [snorts]
This became the approach which I called
context compression a while ago. But you
call it context engineering or
spectriven development, whatever you
want. The name doesn't matter. What only
matters here is that thinking and
planning become a majority of the work.
So let me walk you through that how this
works in practice.
So you have step one, phase one,
research. You know, I go and feed
everything to it up front. Architecture
diagrams, documentation, Slack threads.
I mean, we've been over this a bunch,
but really just bring as much context as
you can that's going to be relevant to
the changes you're making. and then use
the agent to analyze the codebase and
map out the components and dependencies.
This shouldn't be a oneshot process. I
like to probe see like what about the
caching? How does this handle failures?
And when its analysis is wrong, I'll
correct it. And if it's missing context,
I provide it. Each iteration refineses
its analysis.
The output here is a single research
document. Here's what exists. Here's
what connects to what. And here's what
your change will affect. Hours of
exploration are compressed into minutes
of reading.
I know Dex mentioned it this morning,
but the human checkpoint here is
critical. This is where you validate the
analysis against reality, the highest
leverage moment in the entire process.
Catch errors here, prevent disasters
later.
On to phase two. Now that you have some
valid research in hand, we create a
detailed imple implementation plan. Real
code structure, function signatures,
type definitions, data flow. You want
this to be so any developer can follow
it. I I kind of liken it to paint by
numbers. You should be able to hand it
to your most junior engineer and say,
"Go do this." And if they copy it line
by line, it should just work.
This step is where we make a lot of the
important architectural decisions. You
know, make sure complex logic is
correct. Make sure business requirements
are, you know, following good practice.
Make sure there's good service
boundaries, clean separation, and
preventing any unnecessary coupling. We
spot the problems before they happen
because we've lived through them. AI
doesn't have that option. It treats
every pattern as a requirement.
The real magic in this step is the
review speed. We can validate this plan
in minutes and know exactly what's going
to be built. And in order to keep up
with the speed at which we want to
generate code, we need to be able to
comprehend what we're doing just as
fast.
Lastly, we have implementation. And now
that we have a clear plan and like
backed by a clear research, this phase
should be pretty simple. And that's the
point. You know, when AI has a clear
specification to follow, the context
remains clean and focused. We've
prevented the complexity spiral of long
conversations. And instead of 50
messages of evolutionary code, we have
three focused outputs, each validated
before proceeding. No abandoned
approaches, no conflicting patterns, no
wait actually moments that leave dead
code everywhere.
To me, what I see is the real payoff of
this is that you can use a background
agent to do a lot of this work because
you've done all the thinking and hard
work ahead of time. it can just start
the implementation. You can go work on
something else and come back to review.
And you can review this quickly because
you're just verifying it's conforming to
your plan, not trying to understand if
anything got invented.
The thing here is we're not using AI to
think for us. We're using it to
accelerate the mechanical parts while
maintaining our ability to understand
it. Research is faster, planning is more
thorough, and the implementation is
cleaner. the thinking, the synthesis and
the judgment though that remains with
us.
So remember that uh authorization
refactor I said that AI couldn't handle.
The thing is now we're actually you know
working on it now and starting to make
some good progress on it. The thing is
it's not because we found better
prompts. We found we couldn't even jump
into doing any sort of research planning
and implementation. We actually had to
go make this change oursel by hand. No
AI, just reading the code, understanding
dependencies, and making changes to see
what broke. That manual migration was,
I'll be honest, was a pain, but it was
crucial. It revealed all the hidden
constraints, which invariants had to
hold true, and which services would
break if the O changed. Things no amount
of code an analysis would have surfaced
for us. And then we fed that pull
request of the actual manual migration
into our research process and had it use
that as the seed for any sort of
research going forward. The AI could
then see what a clean migration looks
like. The thing is each of these
entities are slightly different. So we
have to go and interrogate it and say
hey what do we about do about this? Some
things are encrypted some things are
not. We had to provide that extra
context each time uh through a bunch of
iteration.
Then and only then we could generate a
plan that might work in one shot. And
the key and might's the key word here is
we're still validating, still adjusting,
and still discovering edge cases.
The three-phase approach is not magic.
It only works because we did this one
migration by hand. We had to earn the
understanding before we can code into
our process. I still think there's no
silver bullet. I don't think there's
better prompts, better models, or even
writing better specs. just the work of
understanding your system deeply enough
that you can make changes to it safely.
So why go through with all this? Like
why not just iterate with AI until it
works? Like eventually won't models get
strong enough and it just works. The
thing to me is it works isn't enough.
There's a difference between code that
passes test and code that survives in
production. Between systems that
function today and systems that that can
be changed by someone else in the
future. The real problem here is a
knowledge gap. When AI can generate
thousands of lines of code in seconds,
understanding it could take you hours,
maybe days if it's complex. Who knows,
maybe never, if it's really that
tangled.
And here's something that I don't think
many people are even talking about this
point. Every time we skip thinking to
keep up with generation speed, we're not
just adding code that we don't
understand. We're losing our ability to
recognize problems. That instinct that
says, "Hey, this is getting complex." It
atrophies when you don't understand your
own system.
Pattern recognition comes from
experience. When I spot a dangerous
architecture, it's because I'm the one
up at 3 in the morning dealing with it.
When I push for simpler solutions, it's
because I've had to maintain the
alternative from someone else. AI
generates what you ask it for. It
doesn't encode lessons from past
failures.
The three-phase approach bridges this
gap. It compresses understanding into
artifacts we can review at the speed of
generation. Without it, we're just
accumulating complexity faster than we
can comprehend it.
AI changes everything about how we write
code. But honestly, I don't think it
changes anything about why software
itself fails. Every generation has faced
their own software crisis. Dystra's
generation faced it by creating the
discipline of software engineering. And
now we face ours with infinite code
generation.
I don't think the solution is another
tool or methodology. It's remembering
what we've always known. That software
is a human endeavor. The hard part was
never typing the code. It was knowing
what to type in the first place. The
developers who thrive won't just be the
ones who generate the most code. They'll
be the ones who understand what they're
building, who can still see the scenes,
who can recognize that they're solving
the wrong problem. That's still us. That
will only be us.
I want to leave on a question, and I
don't think the question is whether or
not we will use AI. That's a foregone
conclusion. The ship has already sailed.
To me, the question is going to be
whether we will still understand our own
systems when AI is writing most of our
code.
Thank you. [applause]
Ladies and gentlemen, [music] please
welcome back to the stage Jed Borave.
>> Welcome, welcome. Let's give it up for
Jake. [music]
All right, we are 6 hours in and we are
about to take our break, but before we
leave, I want to tell you what's next.
We have something very special. We're
going to have the CEO of Poolside,
former CTO of GitHub coming to do the
first public demo of Poolside. So, um I
wouldn't miss it. Less than a month ago,
it was reported that Nvidia was going to
put up to a billion dollars in Poolside.
So, I'm excited to see what they're
going to show us. Um with that, let's
break. There's going to be a um down in
the expo booth, we have Ryan from AMP
SourceCraft demoing AMP and talking
about how to build with agents. So, also
check that out during the break. We'll
see you back at four. Thanks everybody.
>> [music]
>> Heat
Heat up here.
[music]
[music]
Heat. Heat. [music]
[music]
Heat.
Heat. [music]
[music]
Heat. Heat.
[music]
[music]
>> [music]
>> Heat. Heat.
[music]
Heat up
>> [music]
>> Heat. Heat.
[music]
>> [music]
[music]
[music]
>> Heat. Heat.
[music]
Heat. Heat.
[music] Heat.
Heat.
>> [music]
>> Heat up
>> [music]
[music]
>> Heat. Heat.
[music]
Heat. Heat.
[music]
Heat up
>> [music]
>> here. [music] Heat. Heat.
[music]
[music]
Heat.
[music]
[music]
Heat.
Heat.
[music]
[music]
Heat.
[music]
>> [music]
>> Heat up
[music]
[music] here.
Heat.
[music]
Heat.
>> [music]
>> Heat Heat. Heat.
[music]
>> [music]
>> Heat up [music]
here.
[music]
>> [music]
>> Heat [music] up
here.
[music]
>> [music]
>> Heat. Heat. Heat.
[music]
Heat.
Heat. [music]
Heat. Heat.
>> [music]
>> Heat up here.
Heat up here.
>> [music]
>> Heat up [music] here.
>> [music]
[music]
>> Heat. Heat.
Heat.
Heat.
>> [music]
>> Heat. Heat.
[music] Heat. Heat.
>> [music]
>> Heat up
Heat. Heat. [music]
[music]
Heat.
Heat. Heat.
[music]
[music]
Heat
Heat up here.
[music]
Heat.
[music]
[music]
Heat.
Heat up
here.
Heat.
[music]
[music]
Heat.
>> [music]
[music]
>> Heat Heat
[music]
up
[music]
[music]
[music]
Heat. Heat.
[music]
[music]
[music]
Heat
[music]
[music]
up
[music]
here.
[music]
>> [music]
[music]
>> Heat.
[music] Heat.
Heat up here.
[music]
Heat. Heat.
[music]
[music]
Heat. Heat.
[music]
Heat.
[music]
[music]
[music]
Heat.
>> [music]
>> Heat.
Heat.
>> [music]
>> Heat.
Heat.
Heat.
Heat.
[music]
Heat.
Heat.
Heat
up
[music]
here.
>> [music]
>> Heat.
[music]
[music]
Heat.
Heat.
[music]
[music]
[music]
Heat.
>> [music]
[music]
[music]
>> Heat. Heat.
[music]
>> [music]
>> Heat. [music]
Heat.
[music]
Heat.
Heat.
Heat up
here.
[music]
[music]
Heat.
[music]
[music]
Heat.
[music]
Heat
>> [music]
[music]
>> up
here.
[music]
Heat.
[music]
Heat.
[music] Heat
[music]
[music]
up here. [music]
Heat.
[music]
[music]
Heat.
>> [music]
[music]
>> Heat. Heat.
[music]
>> [music]
>> Heat up here.
Heat.
Heat.
[music]
Heat.
[music]
[music]
[music]
Heat.
Heat.
[music]
Heat.
[music]
>> [music]
>> Heat up here.
[music]
[music]
>> [music]
[music]
>> Heat. Heat. [music]
Heat up here.
[music]
Heat
up [music]
here.
>> [music]
>> Heat. Heat.
Heat.
Heat.
>> [music]
>> Heat. Heat.
[music]
>> [music]
>> Heat. Heat.
Heat
up
here.
>> [music]
[music]
>> Hey.
Heat.
[music]
Heat.
>> [music]
>> Heat. Heat.
>> [music]
>> Heat
up
[music]
here.
[music]
>> [music]
>> Heat. Heat.
Heat. Heat.
Heat.
[music]
Heat.
>> [music]
>> Heat up here.
[music]
>> [music]
>> Heat. Heat.
Heat.
[music]
Heat.
[music]
Heat.
Heat.
Heat
up
here.
Heat. Heat.
>> [music]
>> Heat. [music]
Heat.
[music]
Heat
[music]
up [music]
here.
[music]
Heat. Heat.
[music]
[music]
[music]
>> [music]
>> Heat up Heat
up
[music]
>> [music]
>> Heat up
here.
Heat
>> [music]
>> up
[music]
here.
[music]
>> [music]
>> Heat. Heat.
[music]
Heat.
[music]
Heat.
Heat.
[music]
[music]
[music]
Heat.
[music]
>> [music]
>> This is going to be our last session and
as we heard it's going to be a good one.
Um we're going to start with poolside
and we're gonna hear from Arise Klein
Meteor Meter and DeepMind. Um so to get
us started, help me join uh help me
welcome our first speaker to the stage,
co-CEO at Poolside, former CTO of GitHub
and Heroku, Jason Warner.
[applause]
[music]
Hey everybody.
So I know you were expecting ISO, my
co-founder here. So, I'm 5 in shorter,
30 lbs heavier, and uh not as
good-looking, but I think you'll see him
here really quickly, too. So, um how
many people here know what poolside is
and does? Anyone? Anyone? Yeah. So,
let's talk about that real quickly.
Poolside exists to close the gap between
models and human intelligence. That's
literally it. That's what we're here to
go do. We're building our own models
from scratch to do this. were based on
the idea two and a half years ago that
we thought next token prediction was an
amazing techn technological breakthrough
but it need to be paired with
reinforcement learning really to make
that leap. So that's what we've been
doing for the past 2 and a half years.
So we're on our second generation of
models now Malibu agent and instead of
kind of like walking you through some
slides and all that we just thought
maybe I don't know let's kind of show
you what we're doing here. So, ISO, are
you there?
>> I got you, Jason.
>> So, as I said, you were supposed to see
him today, but there's
I don't know. Our airline system kind of
works sometimes, maybe. So, he's stuck
in California, but uh we thought we'd
just walk you kind of through some um
some demos here today. So, what you're
looking at here is a very modern
programming language that the government
uses to run all the world's critical
infrastructure called ADA. Anyone
familiar with ADA?
Yes. Yes. Okay. So, everyone I saw put
their hands up for Ada either has no
hair or gray hair like me. So, that
should tell you what's going on here.
So, iso, why don't we uh why don't we
figure out what's going on with this
codebase here?
>> Well, let's start asking what the
codebase is about.
>> That's great. And what you're seeing
here is obviously our assistant in in
Visual Studio Code backed by poolside
agent, a model we train from scratch
using our proprietary techniques. Um,
and you can see what's going on here.
Kind of the stuff you expect from an
agent. Uh, and obviously the form
factors of all of these things are going
to change a couple of times over the
next couple of years, but you know,
people seem to like VS Code. Uh, so
we're going to, you know, show you this
demo here today. So, you can see from
this, it kind of went through told you
what this codebase is all about, but um,
you know, these things run in our
satellites and, uh, I don't know
anything about ADA, but I do know a lot
about a couple of other programming
languages. So, uh, ISO, what do we want
to do here? Why don't we, uh, see what
this thing might look like in Rust?
>> Let's do it. Let's ask it convert this
database to Rust.
>> So, obviously, you're going to see
what's going on here. Again, if you guys
have used other tools, you're not going
to expect too much of the difference for
what's happening here, except that
again, we're backed by our own model.
We're not using OpenAI. We're not using
Anthropic. This is poolside. And
poolside is a bottom top stack that is
right now if no one's touched it and I
know no one in this room has touched
this unless you work for a three-letter
agency, a defense contractor or you've
sent missiles somewhere that we're not
going to talk about in this session. Um
because that's where we're working.
We're working in high consequence code
environments for the last year inside
the the government and the the defense
sector. Um as you can see from this
demo. Um so what you see here is is kind
of going through doing the conversions.
What you see in the middle pane is
something that we built to kind of show
you as the streams come through all the
different changes that are happening.
Um, one of the tricky parts about
working on inside the defense sector and
things like that is you can't have an
agent that's just going to run around
and do stuff. I mean like I can't walk
into half of these buildings. You can't
give an agent access to these data
source and just say, "Hey, go nuts." You
need to have the right permissions. You
got to actually really ratchet these
things down to do things inside those
environments that you know they feel
comfortable with. So, uh, where are we
on this now? What is is it trying to fix
itself yet? Yes. So, it's it wrote about
1152 lines of code. Uh, and it just
popped up a command start tested.
Excuse me. [clears throat]
Uh, so we see here all the files on the
left hand side that it created. Uh this
is essentially our live diff view that's
available.
Uh and as we see it's currently starting
to actually test it out.
So this is the part where we just sit
here and watch this for 3 minutes and I
see nothing. No, what you see
>> the good thing is that this is a very
fast inference.
>> Yes.
>> So 1100 lines of code
>> task completed.
>> Do we know if this works yet?
>> Well, let's have a look. So it actually
wrote some bell commands to test it. And
when we check out the output of the O,
this actually looks pretty good.
>> Can we ask
>> can we verify that
>> to run it? Let's go verify it. So of
course our agent came back and gave
summary of what it did. But let's just
ask how to run this.
Okay.
So,
I'm going to go open up. So, it says
this is how I can run the ADF version
and this is how I can run the Rust
version. Let's run the Rust version.
Perfect. Let's have a look here.
We might be hitting an actual
>> an actual demo bug.
>> Let's have a look.
>> Let's see what happens.
>> I know. No, no. Just warnings.
>> Just warnings.
>> Do we have an unwrap in there that we
need to take care of? I heard that those
things are dangerous.
>> So right now there's a ripple.
Uh let's hit help. See what we're able
to do. So it looks like we have a set of
commands. I'm going to be lazy. I'm
going to copy paste these queries.
So create table users. Okay. So far so
good.
Let's insert a record.
Okay. Well, let's find out if it
actually did its job. Select start from
users. Okay, we've got a record here.
>> That's nice.
>> Now, now I want to actually
uh you see if I use the up arrow,
it doesn't actually allow me to cycle
through commands. Let's ask it to add a
feature.
Uh
allows me to use the up arrow to cycle
through.
I think it will understand my intent
here.
The one thing we know about ISO is he
actually does know how to read and
write, but he can't type. So all those
errors that you're seeing in there, uh,
yeah,
>> so it looks like the agent's identified
a package that we can use. Let's just
quickly look here. Compare this to
version one.
And it looks like it's adding a library
called Rusty Line and changing the files
accordingly.
It's currently built it and it looks
like the build output is successful.
There's some warnings. We'll ask to
clean those up later on. And it's now
starting to test it.
Okay, apparently it works. It's going to
It wrote itself a little bash script to
test the history.
[snorts] It's wrote itself a little
final demo script.
So, let's let it
Okay. So, and it gave us the summary.
Well, now how do I rerun this? I do kind
of know that. So, let's just
>> should know that. That was 30 seconds
ago.
>> Let's build it and let's run it again.
Okay, let's do a help.
And oh yeah, that's the up arrow. It
works.
>> Very nice.
>> Now, our models aren't just capable
coding agents. They're capable in lots
of areas of knowledge work. They're also
emotionally intelligent. They're fun.
They're great to write bedtime stories
with for the kids. So, I'm gonna ask you
to write me a poem about all these
changes, but that's just more for fun.
So, as Isa was saying, this is just an
interface into our platform. There's
other interfaces into it if you're
inside one of those organizations that
has adopted poolside. So, this is the
coding interface into it, but we also
have other pl ways in which you you can
interact with it web as well as an agent
that you can download on your machine.
But um yeah, we don't really tout the
poem writing or the songwriting. Though
I did send this to my wife to see and I
have been sending her love letters
written by poolside. So I kind of hope
that she did not enter this session to
know exactly how I've been doing that
for the past 6 months. But uh yeah, so
this is kind of poolside. This is what
we've been up to. Um so as I said,
Malibu agent is a second generation.
We've got a ton more compute coming
online and that's when we're training
our next generation. That is be going to
be the one that comes out to publicly to
everybody very early next year. We're
going to have it behind our own API.
It'll be on Amazon behind the bedrock
API. Anybody in the world who's building
out any sort of on a one side the
engineering assistants like the cursors,
winds surfs, cognitions, replets of the
world, you can use ours. or if you use
building out on any other side of the
fence, the Harveys, the writers, the
whatever applications of the world,
there's going to be a fifth model out
there that's going to be at that level
that you can you can consume. But we're
dead set on doing this and bringing this
out to everybody in the world and kind
of advancing that state-of-the-art.
We're just going to keep pushing that
out. So, that's kind of who we are. Um,
and uh you can find out very little more
at our website since we don't put much
out there.
But ISO, anything else you want to say
before you uh try to go make your flight
this time, please?
>> So, I would say that it's been uh pretty
incredible journey for the last 2 and
1/2 years of starting entirely from
scratch and now building to a place
where we see our models have grown up to
become increasingly more intelligent.
And the kind of missing ingredient that
we had was compute. And now that it's
unlocked for us and and with a large
number of over 40,000 GB300s coming
online, we see how we can start scaling
up some of those models uh to get even
further uh in in their level of
capabilities and software development
and other types of long horizon
knowledge work. What I think is exciting
about this conference and this audience
is of all the work that's happening of
evolving the form factor. Right? Right
now what we looked at was this
asynchronous way of of operating with
agents. But you know, Jason, you and I,
we have agents running that are doing
tasks for for hours. And I think in the
near future, we can see a world where
they're able to start doing tasks in
days in the coming years. And so, I
think the interface will continue to
change. Uh we're really focused on the
fundamentals, building intelligence and
being able to scale up and serve it. And
it's why we go full vertical. It's why
we go from our multi gigawatt campus in
West Texas where we're building out data
centers for team building out models.
And the interface that you saw today is
just our version of an expression. But I
think this audience is going to do an
incredible job of building lots of
better versions of how to express using
that intelligence uh into actually, you
know, valuable, economically valuable
work.
Couldn't have said it better. Can't wait
to see what you guys build on this uh in
the future when it's publicly available.
And if anyone really does want to build
a data center campus, we are hiring for
that. Um it is weird to be putting
shovels in ground again like we did in
the 90s and early 2000s, but that's what
you got to do to scale intelligence
these days.
So,
>> I would make one other non-scheduled
statement if you're going to be okay
with this one, Jason.
>> As as our models are are getting more
capable, we'd love to also see who wants
to build with them. Right now, the the
vast majority of of you know, companies
that are doing additional reinforcement
learning and fine-tuning on top of
models are are doing it on what I would
consider right now the you know,
best-in-class open source models, the
Quens and Fies and Miniaxes of the
world. And uh we'd like to start
figuring out how we can you know partner
with you with our our models anywhere
from any checkpoint early on to where we
are today for you to be building closer
together with us on top of things. Uh we
haven't really figured out the approach
to it yet. Uh but I think since we have
this audience it's uh it's not a bad
place to put it out there and so
definitely reach out to us. Uh we think
the world till date was built by
intelligence. The world in the future is
being built on top of intelligence and
so be a great way to partner.
>> Well thanks ISO. Thanks everybody here.
And now we do have five minutes left. I
don't know if we're supposed to take
questions, but I'm happy to. So, if
anyone does, but if not, I'm just going
to go that way.
>> What was that?
>> Sort of. I mean, I think of him that
way. Here, here's a fun story. Here's
how I met ISO. I like to tell this story
because um ISO is a fun fun dude. I met
ISO because started with a failed
acquisition at GitHub. So back when I
joined GitHub in 2017 as a CTO, I wanted
to take GitHub from a kind of
collaborative collaborative code host
with open source bent and turn it into
an endto-end software development
platform infused by intelligence. And so
you know the the products that we
launched from 27 on or 17 on GitHub
actions, packages, alerts,
notifications, eventually code spaces um
and then co-pilot was the last thing
that the office of the CTO did before I
left with Nat Friedman, Ugore and a
couple of other folks inside there. But
ISO in 2017 when I joined uh he had
working code completion before the
transform architecture had landed fully.
He had on LSTMs and so I quickly tried
to acquire his company and he just he
just said no. So he just said no to me.
Uh but we had that was a long drawn out
process talking about what we thought
neural networks were going to mean for
the world. And so during that process
which was a lengthy one, we became
really good friends and we'd stayed in
close contact over the years. And then
22 rolled around obviously chat GPT
comes out, Anthropics out and we kind of
saw the endgame at play and we said do
we jump back in or not? And of course
yes we jump back in. But I like to tell
that story about how he just kept saying
no to me and I just kept asking him
questions and eventually he said yes, we
should found a company cuz by the way
when I asked him if we should do this he
said oh godamn no. That was were his
exact words. He's like no we should just
learn how to paint and sale. But here we
are.
So
>> yeah,
>> it's it's been a ground roll journey
together. Jason, I I think the reason we
ended up doing this is because of our
our opinionated view on what it was
going to take to build more capable
intelligence and and the first 18 months
of this company, you know, obsessing and
focusing on reinforcement learning
combined with LLMs felt like one of the
most contrarian opinions in the world,
but I think today it's absolutely not.
And it's super exciting to see the the
progress that's continuing to make. like
we're in the coming years we're going to
see the world that started in
completions and went to chat and is now
at agentic increasingly approach more
autonomous and we're all of it is
stemming effectively from the
combination of bringing highly capable
models that are constantly evolving
together with real world problems and
and I think what we're starting to see
now is we're entering these kind of
awkward teenage years ahead of AGI where
everybody in this room who's building
out incredible companies and
applications is bridging this gap of
what it really takes to make
intelligence that in its raw form
actually be valuable and we uh we want
to be a small humble part of that. We've
got a lot of work still ahead of us. Uh
the team is growing. Uh but hopefully
what you've seen today uh is what our
our customers and enterprises have been
having access to and seeing for a while
is that we're you know hard at work at
uh and really pushing those
capabilities. We also want to make sure
we make them available to build together
with others.
>> Well, that's it. Thanks everybody.
[applause]
>> [music]
>> RL has boosted base models, but it's
opaque and hard to scale across
enterprises.
But what if we could apply RL techniques
to prompts instead of model weights?
Here to show us how is the co-founder
and CPO of Arise, Aparna Dina Kuran.
>> [music]
>> Hi everyone. Thanks so much for coming.
Um well, today I'm excited. We're going
to talk a little bit about prompt
learning and how to use that with evals.
uh if any of you guys um are spending a
lot of time thinking about the frontier
coding models, I think there's so much
attention on on them, but just what's
not so obvious is how much time is
actually spent uh on the system prompts
uh for those building these coding
agents. So, here's actually a look. Um
this is a tweet that went viral about
the whole system prompt uh of Claude
that's been leaked. I'm sure you know
they've changed it since then. Um, but
you can actually see that Claude,
there's cursor, there's Klein. Um, and
just the length of the actual system
prompt um, for each one of these. And I
think what's not as obvious is these
actually aren't just static. They are
repeatedly iterated on. And it's such an
important piece of context that actually
goes into making these coding agents the
most successful agents out there.
Um, it's not just us talking about it.
Karpathi talks about it a lot. Um and
this was a viral tweet that that he
posted which was there's this paradigm
around iterating on these prompts that
he he's kind of coined it system prompt
learning and what he said is that it
almost feels like humans learning
because they take back English feedback
uh and use that to actually iterate on
what they should do differently the next
time. And I think he wrote something
like it's almost like that movie momento
where the guy forgets uh what you know
what he learns and then he starts
writing it down and then uses that to
actually kind of go through his next
day. And so this is a little bit of the
concept behind system prompt learning.
And what we wanted to do was show you
guys a little bit of how that works and
then put that to test on two of the most
popular coding agents uh Claude and
Klein today. So first off, how does
prompt learning actually work? So for
those of you who are familiar with RL,
what I thought we'd do is just do a
little analogy compare how does RL work
versus system prompt learning. For RL,
you know, if we just took an analogy of
a student who's trying to improve their
exam scores, they take an exam, you
know, somebody grades the exam, you have
a scalar reward, which is like, you
know, they got a 70%, an 80%, 90%, and
then they have to figure out almost
blindly just with that score how to
actually improve their score on the next
exam. And I think this is actually one
of the flaws of I mean RL works, don't
get me wrong, amazing in so many
concepts and domains, but it can be, you
know, a long path to actually figure out
what the right solution is. And I think
some of the things that we've noticed is
that it can be sample inefficient. It
takes a lot of data to get what you
want. It's time inensive. It's data
hungry. You need to have a whole data
science team to do this. and it just
might be overkill for teams who are
trying to build agents because LLMs are
already so good. So if you're a team
who's actually trying to build an agent,
maybe prompt learning is actually
slightly
might be slightly more of an interesting
paradigm for you. So in this scenario,
same same analogy. You have a student
who's taking an exam, there's some exam
score, except in this case, what
actually gets outputed isn't just the
score. They got a 70, they got an 80,
but you also get back some kind of
English feedback. Why did they get this
answer right? What did they mess up on?
Here's concepts that they missed on,
what do they need to go study? And then
they use this information to actually go
and and prepare on what to do next um to
to get a better score. This is basically
the the concept that we applied to
coding agents. And we ran this kind of
test on both Claude as well as Klein. Um
both of these as you know start off with
some kind of uh system prompt which in
cloud code this is kind of a snippet of
it and they both kind of come with
something that you can append rules to.
So client has rules cloud MD has the
cloud MD file and it starts off empty.
You can go in and add whatever is
important for your repo. So what we did
was actually took you know just
benchmark both client and cloud code on
Swebench. I'm going to kind of run
through theam uh this entire example at
Swebench, but this entire thing we also
ran on BBH and a ton of other uh
software engineering data sets but you
can see here just on vanilla client
vanilla cloud code um nothing added to
the cloud MD or the client rules um they
had you know about I think with client
somewhere on you know cloud sonnet 45 it
was about 30% of the github issues
actually resolved uh cloud code it was
about 40% of the github issues resolved
so we took This is kind of our starting
benchmark and the thesis is is could we
actually use prompt learning to see if
we can improve the system prompt and see
if um it was able to with the new system
prompt actually you know give us a
better uh score on these benchmarks. We
didn't do anything on fine-tuning. We
didn't change the models anything like
that. It was just focused on the system
prompt. Um this is the process that we
went through. We took the coding agent
uh we had it actually write some code.
um we ran unit tests and then um we then
passed that through to some kind of um
model that was doing the LLM as a judge
evals and I'll show you guys what that
looks like but the LM as a judge eval
actually gave back why did it fail did
it fail because of this uh can you give
some examples of you know what were
common scenarios that it didn't do good
on and then it actually used those kind
of evals to then go back and add it to a
meta prompt to come back with kind of
the the system prompt rules that we're
going to append to. So let's talk
through kind of the process. So first we
had kind of the SWEBench data set. Uh
SWEBench in this scenario is just 150
examples. Uh we did this for both client
and cloud code where we took the
original prompt which had no rules. We
gave it kind of the software engineering
problem and then it generated some kind
of patch to actually solve that and then
we ran the generated solution through
the unit test.
Then whatever the unit test came back
with, whether it was right or wrong, we
then passed this into an LLM as a judge
eval. And this is kind of the most
important part because this actually
generated the explanation for us. So we
passed in the problem statement. We
passed in what the coding agent solution
was, the unit test, and then the actual
solution that it came up with. Uh passed
that in. And this that you're looking at
in the center here is actually the LLM
as a judge eval. And these evals, we're
going to talk into this uh talk a bit
about this, but eval engineering is a
whole kind of concept that you know we
spend a lot of time on. And writing
really good evals is I think um how you
get the best kind of insight into what
you could do to improve your agents. So
in this scenario, what we did was we
wrote a good LM judge eval prompt. It
outputed whether it failed or passed.
And then this is the key part. We
actually asked for an explanation. Why
did it actually mess up? um you know for
specific libraries in the Sweetbench
light test um you know it was parsing
errors or it was not handling um there
there's all sorts of actually different
categories of errors but we went through
and we we kind of looked at the
explanation of what went wrong in each
scenario. We then passed into a huge
meta prompt. So this is actually what's
helping us iterate on our system prompt.
We passed in the original claude or
client system prompt. We passed in the
original rules which for us started off
empty. Um and then we passed in here was
the input, here was the LLM is a judge
eval and then here was the actual
explanation from that eval.
Passed that all into the meta prompt and
then we did kind of a diff comparing you
know the old world. So just for you just
to remember the old world had the
original clawed system prompt no rules
kind of added or appended to it. And
then the new world where it generated
this entire rules of what to avoid or
what to um what it had learned
essentially from all those mistakes it
had actually made. And then we ran this
basically on the entire speedbench light
again. Um and what we saw was that you
know on 150 examples we were able to get
cloud code up by 5% more GitHub issues
resolved clin um you know 15% and this
was literally on I think the key thing
is like 150 examples of just training
data that was used um on the most kind
of powerful coding agents that are out
there. Um, and so just think about kind
of the impact that could have for your
agents. Many of you guys in this room
might be thinking, okay, well, prompt
learning is cool, but how does that
compare to GEA? If you're familiar with
DSPI and you've kind of seen, I don't
know if it's GPA or Jeepa. I've heard
both. Um, but you know, you guys might
be asking, well, how is this different?
Um, so GEA, just just in case you guys
aren't familiar, it's a prompt optimizer
from DSPI that is essentially very very
similar to what we're talking about,
which is taking English feedback using
that English feedback inside of the
actual prompt. Um, and what we did was
actually run a sidebyside benchmark
against GEA where we compared kind of
our prompt learning against GAPA. And um
I think what we saw was that GEA
required many many loops and rollouts
compared to um kind of a a fraction of
that which was our approach. And I think
the key difference here, I mean the
underlying approach around using English
feedback is the same, but I think the
key thing that was really different here
was we spent a lot of time actually
developing and iterating on the evals
and the eval prompts really mattered to
making sure that you gave really good
explanations back to the agent. Um, and
so eval.
This was super critical for us to be
able to get this to work. Um, and if you
guys are curious about learning more,
reading more about kind of what we do,
um, check out kind of our blog. We write
a lot about eval prompt optimization
and, uh, we're actively hiring, so come
check us out. Awesome.
[applause]
[music]
When it comes to AI [music] agents,
theory doesn't always compile to
production. Here to share hard one
lessons building effective AI coding
agents is the creator and head of AI at
Klein, Nick Pash.
[applause and music]
Wow, it's wild to be on the same stage
as so many people I've drawn inspiration
from. Let's dive into it. My name is
Nick. I'm the head of AI at Klein. And
today, I'm going to share some lessons
we learned along the way.
So, let's start with the bitter truth.
For years, we compensated for weak
models by building clever scaffolds
around them. All kinds of clever ideas
like rag indexing systems, search trees,
tool calling scaffolds. All this was
invented to cope with weaker models and
frontier models simply bulldoze those
abstractions. Now you don't really need
your scaffolding anymore. Your your
scaffolding just gets in the way of
these models. And the question really
isn't how fancy is your agent stack
increasingly. It's how strong is the
model driving it?
And the lesson here is relentless. Um, a
perfect example of what I'm talking
about is Gemini 3.0 released this week
and it immediately dominated terminal
bench leaderboards with no agent harness
supporting it at all. In this chart, you
can see Gemini 3.0 on Terminus scored
better than the vast majority of model
agent combinations in the world all out
of the box. And what's remarkable is
that Terminus is designed to be an
unopinionated generic stripped down
[snorts] harness. And it has no graph
search, no rag, no indexing, just here's
a terminal, go figure it out. And it
crushes. The whole point of terminus is
that it has no clever tool calling, no
context engineering features. So the
takeaway here is that capability beats
scaffolding. If you get out of the
model's way, it will perform just fine.
So really what I'm driving at and the
key takeaway from this whole talk is if
you're building agents, just relax.
Cool it with all your clever engineering
tricks. Stop overthinking it. That's it.
That's the lesson. And another point on
this, kind of like an aside, is I don't
know about you guys, but we're all on
Twitter. I'm on Twitter. And at this
point, I just think talking about these
like clever little context tricks and
and hacks is a little played out. Like
at this point, I'm straight up tired of
seeing some of this stuff. And like I
get it, it's free engagement and we all,
you know, indulge in it a little bit.
But personally, I think there's not
really much signal there.
So, if you want the full playbook for
building an effective coding agent, like
the playbook's right here. here. It's up
on the screen. Um, there's really some
novelty talking about it like months
ago, but at this point, in my opinion,
it's been done to death. And we've been
in this, you know, we're model agnostic
at Klein. We support all the models.
Every two weeks, there's a new big model
release going out. And we've basically
come down to the same playbook of
supporting each model as it comes out.
So, I'm sure everyone here knows how to
tune an agent from Sonnet 4 to Sonnet
4.5, from Gemini 2.5 to Gemini 3, and
GBT 5 to GP GBT 5.1. I feel like this
entire conversation is a little played
out. So, I'm not really even going to
cover this in depth because the tweaks
here are trivial and the gains are
marginal.
So what I really want to talk about is
something that's not actually given a
lot of attention and it's the real
bottleneck. And the real bottleneck is
that you can build the cleanest agent in
the world, but that doesn't improve
model capability by even 1%. Models only
get better when labs train on something
hard. And benchmarks, not agent
cleverness, not all your clever
engineering techniques, not your clever
rag pipelines. It's benchmarks that
determine what frontier models learn to
do next. And models didn't magically get
better at tool use.
They got better because people built RL
environments that force them to practice
certain actions. Handling failure more
handling failure modes retrying
and for example like agents improve only
when the model learns inside the right
environment. Every jump in reasoning
we've seen came from a benchmark. Every
jump in agent reliability came from an
RL environment.
So the real questions become what is a
benchmark? How do you turn real world
agentic coding data into an RL
environment? And what makes a good
verifier? How do you detect real
difficulty? And how do you train these
models to work on the problems that we
actually care about as engineers? These
are the questions that matter for the
next frontier.
So what is the benchmark?
A benchmark put simply it's an
environment. It's a so in our case it's
like a docker container where you let
the agent run wild. It's a starting
state which is the snapshot of the code
when you started working on a real world
coding task as well as a starting
prompt. And the last thing is a verifier
at the end that checks whether an end
state is correct or acceptable.
So how are RL environments different?
Well, here's the thing. They're not
really different at all. And you might
notice this chart is basically the same
thing as the previous slide. The only
real difference, the only distinction
here is how the reward is used.
Benchmarks measure models. RL
environments improve models. The score
doesn't just stop in a leaderboard where
you publish the results. The score is
actually used to update the weights of
the policy model.
So, how do you transform real world
coding data into useful RL environments
for training?
At Klein, we created the system called
an RL environments factory. Looking for
a better name there, but that's what we
got so far. And the first phase in this
pipeline is you get sub agents and you
have them qualify tasks. And these sub
agents, they work in parallel to decide
whether or not given tasks are suitable
to be turned into RL environments for
the purpose of training.
And the qualification process goes as
follows. So you have you start with
origins. So you have to validate does
the repository actually exist? Is the
starting commit accessible? Is it open
source? the journey where you look at
the starting prompt, the other follow-on
prompts that the user might have
followed up with with the agents, you
have to try to understand what was the
user actually trying to accomplish, what
was the spirit of their task. And
lastly, it's the outcome. So, can we
find the actual commits or PRs that fix
the problem in real life? Like, did they
actually commit the solution to their
problem later on in the timeline? And
we're actively looking for easy
disqualifiers as part of this. So things
like vibecoded slop. We don't need
another benchmark that tests for, you
know, build the next.js app uh from
scratch. We're looking we're looking to
disqualify trivial tasks that are too
easy and tasks that have no reliable
start or end states.
And lastly, what makes a good RL
environment good? How do we actually
make an RL environment? And what makes a
good test or verifier?
So phase two of this pipeline is
building the actual RL environment. So
you start out with archaeology where you
actually reconstruct both states
locally. You pull down the code. You see
if you can implement it yourself,
reconstruct it, build it, and verify
that the bug that the user was
referencing and the solution actually
exists. You document every obstacle and
dependency. You containerize it with
docker removing git obviously so agents
can't reward hack and last you define
the verifier at the end and this is
where it gets into like a little bit of
the art of building a good verifier and
I want to talk about this because the
analogy that I typically give is a teac
kettle. So let's say the user's goal is
I want to boil water.
A really good example of a verifier to
test whether or not the water is boiling
is a little whistle attachment that goes
inside your teac kettle. And the whistle
is a pure outcome verification. It's an
example of a pure outcome driven
verifier where the water either reached
the boiling point or it didn't. Either
it's whistling or it's not. The kettle
doesn't care how you achieved it,
whether you used a gas stove, an
electric induction stove, or a campfire.
It just signals the result.
And in the process of doing this, all
these weird bad tests can emerge. So you
might have noticed like that the sub
agent might have noticed like oh in the
ground truth solution like in a previous
run the burner was set to high. So maybe
we should be checking for that. But we
all know that water can boil at a low
setting on the burner. Or was it on the
front left burner has 5 minutes elapse?
Like all kinds of weird bad tests. And
the key point here is don't
overprescribe based on the ground truth.
Test for the spirit of the task. Test
for the outcome of the task. And the
outcome at the end of all of this is a
containerized benchmark or RL
environment for that task. Agent work is
recorded so you can see the traces, the
trajectory that the agent took to
complete the task and you can reliably
score it and verify it. And it's fully
portable. You can run it on any device.
So the path to automation that we've
been undertaking as part of this is can
we fully automate the process of
converting real world coding data into
RL environments for the purpose of
training models.
And this work largely started out manual
but then the first time the RL
environment was like about 16 hours of
my time. And what used to take 16 hours
now takes less than 20 minutes per task.
And we're building towards a fully
automated RL environment factory where
the bottleneck shifts from engineering
to collecting high quality tasks. And an
interesting kind of point here, the
natural endpoint of all this is what if
we actually built RL environments and
this is like a question for everyone in
the audience is what if we built RL
environments to test how well agents can
actually make RL environments kind of
like a meta benchmark. What would hill
climbing on that look like? And you can
kind of start imagining that as models
get really really good at making their
own RL environments to train on based on
real world user data, you kind of
complete that loop. Something to think
about. So okay, um this next part is the
truth nuke. Um also known as TRO. Um
an unspoken fact is that we're not alone
at Klein building this kind of system.
Every major agent lab captures this
data. They all do some version of this
behind the scenes, but no one really
talks about it. And I don't even need to
name them. If you know, you know. And
realistically, you all know. These same
companies site internal benchmarks to
justify legacy systems that they spent
months maintaining. But curiously,
you'll never be able to study or inspect
them because they don't publish them
openly. And this data is so valuable yet
no one shares it. It's the only thing
that actually moves the needle.
And here's the heart of my argument is
by standing between real world engineers
working on real world tasks and the
models agent labs have a unique role in
history. We can build better prompts. We
can build better tools. But none of that
improves the underlying models. We
possess the single richest data set of
real engineering work anywhere in the
world. Models don't improve without this
data and keeping them closed is slowing
down Frontier Research.
So today we're announcing ClientBench.
This is our attempt to finally create a
benchmark that isn't cosplay
engineering. It's not write me a server
that generates Fibonacci sequences. This
is real software development captured
and packaged into standardized RL and
inval and eval environments and this is
the benchmark that we always wanted
someone else to build. No one did. So
we're doing it and anyone can
participate. So here's how it works. The
whole thing is open source. There's no
secret sauce, no locked away data sets.
You can openly run it yourself and
inspect it to see how it works. Anyone
can use these environments for SFT, RL,
eval, whatever. The point is is to just
give the entire ecosystem a real
substrate to measure and improve models
on, not just leak code puzzles.
And this only works if the community
contributes. And the good news is you
don't actually need to do anything
special. Just work on your open source
project with the client provider turned
on and opt into the client bench
initiative. If a frontier model gets
stuck and you step in to fix it, that's
actually a ideal task for to be a
candidate for a benchmark and that's it.
Just use the climb provider, see where
the model struggles and we'll pick it up
and introduce it into this open-source
benchmark.
So, client bench will always remain
free, fully open source and freely
accessible.
Thank you all. If you want to
contribute,
[applause] thank you.
>> [music]
[music and applause]
>> If benchmark scores for AI coding agents
are so high, what explains the problems
developers and teams face when working
with them? Here to provide us with a few
explanations is Meta Researcher Joel
Becker.
>> [music]
>> Hey guys, thank you so much for having
me. My name is Joel Becker. I work as a
researcher or member of technical staff
at MET, which stands for model
evaluation and threat research. As we'll
see in a second, I'm going to be talking
about AI capabilities. How do we know
how performance AIs are today? How how
performant they might be in the near
future from these two different sources
of evidence that seem to give somewhat
conflicting answers. You know what? I I
could have done this whole talk without
reference to meter papers in particular,
but we'll look at two papers I've been
um involved with as as examples of
benchmark style evidence and then more
economic style evidence. On the
benchmark side, measuring AI ability to
complete long tasks. This is the paper
um that comes with the the charts that
many of you would have seen on on
Twitter and so on that meter is well
known for. And then the second this um
RCT measuring how allowing AI affects
developer productivity. And then we'll
be talking about how to reconcile uh the
the gap that's implied between these two
different kinds of measurements.
As I mentioned, META stands for model
evaluation and threat research. We are a
independent research nonprofit that
seeks to inform the the public, policy
makers, labs about the degree to which
AIs might pose catastrophic risks to
society. The model evaluation part uh
means that we seek to understand AI
capabilities and propensities and the
threat research part means we try to
connect those capabilities and
propensities to potential catastrophic
risks.
Okay. The first paper we're going to
talk about associated with this chart
that that many of you I think might have
seen.
Um take taking a step back first before
we dive into the paper. you know, how
how usually do we think about measuring
AI capabilities using benchmarks on a
SWE bench or a GPQA, so on and so forth.
There's some notion of 0% performance um
or or random performance. So for GPQA,
that's that's 25% which corresponds to
this floor that the worst you can
possibly do. Perhaps there's a um human
baseline that's below 100% for GPQA. I
think this is something like 75% that
represents maybe expert human
performance. And then of course you can
go all the way up to 100% potentially on
on these kinds of benchmarks. But but
what does it mean? You know, if I'm
getting 50% on GPQA, if I'm like half
the way from the um from the floor to
the to the expert baseline, what you
know, what does that really mean about
how performant the AIS are? If I meet
the human baseline, does that mean that
the AIS are now as performant or even
more performant than than expert humans
in in a relevant sense that I that I
care about? It's hard to interpret. You
know, another thing that you see from
this graph is that um benchmarks seem to
have less and less time between coming
online sort of giving any signal at all
and being fully saturated. It's harder
and harder to create benchmarks that
have uh plenty of signal that you know
might might be informative to us about
how capable models are for for an
extended period of time. So, we're we're
going to go about this a different way.
First, we're going to gather human
baseline data for diverse tasks spanning
a range of difficulties. You should
think of these humans as, you know,
experienced experts, but on their first
day or or or first week on the job.
These are not people with context on the
tasks in particular. It's not exactly
the kind of thing that's come up in
their work before, but if it's a
software engineering task, you know,
there are relevantly skilled general
software engineer. Same for the machine
learning tasks and the cyber security
tasks here that we'll talk about. The
the type of tasks come from these three
um buckets or task distributions. Hcast
which is a collection of um
softwarebased tasks seemingly requiring
autonomy you know interacting with tools
um uh interacting with the environments
thinking thinking through the problem
not not just this kind of Q&A style um
style data set um the SWAR suite which
are these atomic problems these are
problems that you know maybe GBT2 can do
maybe maybe it can't problems like um
here are four files one of them is
called passwords.txt txt which file
contains the passwords and then on the
other end of difficulty we have rebench
which are challenging novel open-ended
um machine learning research engineering
challenges which are are very difficult
even for top human experts
in addition to gathering the the human
baseline data we'll also under as close
to identical conditions as possible
measure AI performance for the AIs that
we're that we're interested in on the
same set of tasks and then we're going
to convert the time it takes for humans
to complete these tasks into an estimate
of AI autonomous capabilities as I'll
I'll show you in a second.
Here's an illustrative diagram in this
case for claw 3.7 Sonet which was the
the frontier model at the time that this
paper came out. You can see that you
know for the for the very short tasks
something like 4 minutes or below Sonnet
is getting the answers correct you know
essentially 100% of the time or or maybe
even here literally 100% of the time.
for the very hardest tasks it's
struggling and then and then there's
some range where we're kind of in the
middle you know we're somewhere between
10 and 10 and 90%. I'll say that this
empirical pattern where models are less
performant at tasks that take humans
longer is you know it's not a fact of
nature but it's it's something that we
see pretty pretty commonly pretty pretty
robustly across models at least on this
task distribution and I'd conjecture for
for other task distributions as well. So
we try and fit this dark purple line to
to something like this data on on how
long it took humans to complete the
relevant tasks that the models are uh um
are attempting. And then we call the
point on the x-axis this horizontal axis
this human time to complete axis at
which we predict the models will succeed
50% of the time the time horizon of
those models that there's much to debate
in the 50% number. Uh I can I can talk
later about the reasons why we chose
that and and then we'll do the same
exercise for the other models. So here I
have uh claw 3 opus has a time horizon
of something like 4 minutes. That's
where we're predicting that it has a
success probability on this task
distribution of 50%. For 01 preview I'm
seeing something like 15 minutes so on
and so forth. And then of course all
these models you know they they come out
over um calendar time. So if we plot the
time horizon, the x-coordinate on uh on
on this set of plots against um against
calendar time, we find something like
this. It looks, you know, kind of like
um kind of like an exponential trend
that's that's going up at some constant
rate. In fact, it doesn't just look like
an exponential trend. If we had a
perfectly straight line here, it would
indicate um a perfectly exponential
trend. um we we see something really
remarkably steady actually much more
steady than we were anticipating when we
uh went about doing this research
project
and that's continued to be the case. So
many of you will have seen updates that
we've made of of this graph on on on
Twitter. This is going all the way up to
GPT 5.1 codeex max. So extremely recent.
Um the predictions from this you know
shockingly straight line have have held
up very well I think.
Taking a quick step back, what are
benchmarks telling us or or here kind of
benchmark like evidence? Well, one thing
is that AIs can succeed at what for
humans would be exceedingly difficult
tasks. The tasks in rebench are, you
know, really far beyond my capabilities
uh uh personally and and you know the AI
is having a good crack at them some some
decent percentage of the time. And the
second's you know kind of obvious is
that progress is rapid.
on the other hand [gasps] um you know
how much how much stock should we put in
the um the evidence suggested by
benchmarks um what limitations might
they have lots but here are here are
three that I'll note one is as I
mentioned these are humans who are you
know expert in some relevant sense but
they're low context it's something like
their their first week on the job they
haven't seen tasks exactly like this
previously they just have some relevant
experience presumably people who were
more sort of you know not not just
having the relevant experience but also
highly familiar with um uh with the with
the set of tasks would perform the tasks
even sooner and then we think relative
to those people the AIS were more
performant.
The second is that benchmarks can be low
ceiling. Even you know GPQA I'll use
that example again. Um we're we're
beginning to get to the point where
where that benchmark is um is totally
saturated not providing um additional
information for marginal models whereas
time horizon is providing this nice way
to sort of chain benchmarks together in
in in some sense over time.
Um but you know nonetheless it's still
very hard to um uh to create these ever
harder tasks when the um when the time
horizon of models is doubling every
something like six to seven months. So
even time horizon might be might be
saturated in not too long or the
benchmarks underlying time horizon.
And the next one is you know not not a
concern that's limited to the to the
meter task to the task behind time
horizon. It's also true for sweet bench.
It's also true for for many of your um
favorite agentic benchmarks that the
problems aren't very messy in some
sense. They don't require a ton of
coordination with humans. They're often
in relatively small contained
environments where where not much can go
wrong. You know, not these sort of
massive open source code bases or or um
other ways in which the the problems can
involve more interaction with the real
world or or or be messy in in some
sense.
Um so we did this we did this project
and then um early this year we were you
know we were trying to think about um uh
how can we attack some of these
limitations? What what's a different
source of evidence that um might have
its own own pros and cons but you know
importantly be more externally valid in
in the scientific jargon.
Perhaps field experiments are the
answer. Some more economic style
evidence. So here we might be interested
in very high context developers who are
expert on the kind of tasks they're
already doing
speed up or some notion of productivity
boost. You know it seems to have more
signal through even some um superhuman
according to benchmarks range. You know
perhaps GPQA is fully saturated and
you're getting a 1.5x 2x speed up
something like that but you can still
achieve a 3x 4x 5x speed up even even
after that we we maintain more signal.
And the last is that you know that the
tasks are messier. They are tasks that
are coming up in people's real work.
They're not um synthetic. They're not
small and contained. Um this is a real
deployment scenario.
Here's what we're going to do for this
paper. We're going to gather 16
experienced developers on large mature
open source projects that we'll go
through in a second. Each of these
developers will on average complete
about 16 tasks from their real work.
These are these are issues on the on the
relevant GitHub repositories. The kind
of thing that they might otherwise have
completed with the with the caveat that
the very longest issues we're not going
to include.
The tasks will be randomly assigned to
AI disallowed or AI allowed. AI
disallowed, you know, it means it means
what you think it means. It means
software development in 2019. It means
no AI powered tab autocomplete. It means
no cursor agentic coding tools. It means
no LLMs via the web UI.
or they can be randomly assigned to AI
allowed in which case everything's on
the table. You know, any of the AI tools
I just mentioned or not using the AI
tools. If you're in the AI allowed
condition, you're not compelled to use
AI. You just have the option. And we buy
these developers cursor pro. So, um for
the for the most part, that's the tool
that they're using with typically 3.6 or
3.7s on it at the time, uh which was the
Frontier model when we conducted this
work. And then we're going to record the
time it takes for the developers to
complete each task and see the degree to
which they might save time when AI is
loud versus when it's not.
These are some of the repositories. Many
of you will be familiar with them. We've
got the Haskell compiler represented. We
have scikitlearn. We have hugging face
transformers. These are on average a
million lines of code plus. They've been
around for 10 plus years. The developers
who are going to be working on these
repositories as part of this study are
on average the third top contributor out
of hundreds or or even in some cases
thousands of contributors to these
repositories. They personally have been
contributing to the repository for
something like 5 years on average. These
are top experts.
Some of you might have seen this graph
too. And so the punch line's been
spoiled for for the rest of you. Um we
asked uh economics experts, machine
learning experts, you know, these are
people at major AI companies and labs,
um uh top academics, um some graduate
students, so on and so forth, you know,
how much they expect developers to save
time when they're using AI. They say
something like 40% or a little bit less.
We ask the developers themselves, the
study participants how much they expect
to be sped up ahead of time, and they
say something like 24 25%. Then we ask
the developers after the study has been
completed how much they think they were
sped up in the past by AI being allowed
on the issues they completed as part of
this study and they say that it will
have sped them up by something like 20%.
And the punch line is that we find that
developers are slowed down by 19%. They
take 19% more time when AI is allowed
relative to when AI is not allowed.
You know, when I first saw the data
coming in, saw sort of early versions of
this plot, um, I thought presumably the
same thing that many of you might be
thinking right now, that we've messed
something up. Um, that that, you know,
something's gone wrong. There's some
there's some issue in in how we've set
up the experiments. How could it
possibly be the case? You know, at least
these um, uh, these developers have
access to the zero points because they
cannot use AI at at any time. Um, so we
poured over, you know, many, many, many,
many, many hours of screen recordings
from these developers working on issues
as part of the study. We look to dive
into, um, a bunch of hypotheses that
might explain what's going on and try to
categorize, you know, the things that
that we think are going on versus not.
Um, many of this is is listed in the
paper. I I'll just quickly go through
some of the things that we think are
contributing.
First, overoptimism about AI usefulness.
that that seems like an obvious one. You
know, the developers even after the
study is completed, they think that um
uh that AI is going to be helpful to
their work. It's it makes sense they
might overuse AI um on that basis. Um
two more implicit repository context and
high developer familiarity. You know,
these developers are coming to these
problems already knowing the solution to
the problem. They don't they don't um
they're so expert in this work. you
know, I I I imagine them as as not
trying to spend a bunch of time thinking
through the solution that the the AI can
can work through. Instead, they're just
limited by how fast they can type. Um,
which which means that, you know, using
AIs, instructing AIS to do it, um, comes
with some significant time cost versus
how they might otherwise have spent
their time.
I think many of us have the sense that
AIS might be less performant on on large
and complex repositories, which is a
different from this difference from this
benchmark style evidence or or from or
from some previous work. And then low AI
reliability, you know, um maybe the AIs
are very performant on these kinds of
tasks, but you know, they're only
performant um 50% of the time or 80% of
the time, 20% of the time. And so, at
the very least, you need to check their
work afterwards. And perhaps even you
need to spend time correcting their work
afterwards, which is which is something
we see quite a lot on these issues.
One thing from the factors with an
unclear effect that I that I'll mention
briefly I have to talk to people about
later is below average use of AI tools
which came up in the public discussion.
This this is in the unclear column
because it's sort of evidence evidence
for and against. Um that that's true for
for many of the things here. We don't
have anything so conclusive to say we're
still working on on this line of work.
Here are some here are some caveats. All
important. Um first you know obviously
we do not provide evidence for all
software developers or tasks. These are
extremely experienced developers working
on extremely complex long-ived open
source repositories. I in my own work
you know not um as expert in the
relevant sense as as these people are.
I'm working on much smaller
repositories. Um I I feel more
comfortable saying that even at this
time I was sped up by AI tools even if
even if the developers weren't. This
setting is weird. It's weird for the
same reasons that it's that it's
interesting this this unusual developer
population.
Second, the experiment is concentrated
in March 2025. As I mentioned, uh we
know that AI progress is rapid. Um
perhaps this this result will have
already changed by the by the time I'm
giving you this talk.
So there's kind of puzzle suggested
right that the benchmark style evidence
is giving um a very impressive sense of
what benchmark of what AI capabilities
look like today. Whereas the more
economic style you know I include labor
market impacts um uh uh working here too
in addition to our in addition to our
field experiments look somewhat more
bearish or or unimpressive. You know why
why is the former not not translating to
the latter? At least naively there seems
to be a clash. How might we go about
resolving this puzzle?
So one possibility is that in fact we we
messed something up. This is this is
still live and on the table. Uh you know
maybe the developers really are um uh
not very capable at using AI and if we
continue to run this experiment as as in
fact we are they'll you know learn more
familiarity with the tools and and so
get productivity benefits that they they
weren't getting at the time. I'm a
little skeptical of that story but but
but that's one possibility.
Another that economists like to bring up
is that we're not incentivizing these
developers to finish quickly. we're
paying them per the hour. Um, which we
do for external validity reasons. Um,
you know, looking through their videos,
I I really, uh, do not think that
they're developing differently in
accordance with these incentives. But,
but that certainly is one possibility
that's on the table.
You know, another um, more statistical
in nature possibility is, you know, this
is a small study. You shouldn't you
shouldn't over update so much from small
studies. We we are doing um, bigger
things that I'm excited to release at
some point. Okay. But let's let's assume
we haven't messed something up and this
is uh this this is a result um uh that
that we think that we think does hold
up. How could we resolve the puzzle?
So, one possibility, you know, as I as I
alluded to briefly, is that reliability
needs to be very high to save time. That
you need to be getting um the the
answers to these problems that
developers are putting in correct. You
know, something like 95 99% of the time
in order for developers to tab tab tab
through and you know, not not um not
spend lots of time verifying the AI's
work, which which of course um is pretty
costly from a time perspective. Another
possibility is SWEBench like or
algorithmic costless scoring at the
margin versus mergeability like scoring.
Sweetbench scores are not trying to
account for you know whether the code is
spilled honable by by other people in
future or whether it's matching quality
considerations that aren't um considered
by the unit tests. You know, perhaps AIS
really are performance according to
SWEBench like scoring, but not
performance according to this kind of
more holistic um uh holistic scoring
that we might care about. Low versus
high context baseliners. I I I mentioned
I mentioned previously these are just
much more skilled humans, you know,
relative to those humans. Perhaps the
AIS are less capable task distribution.
Maybe these are just different kinds of
tasks, you know, in particular less less
messy than the than the benchmark star
task. Maybe that's explaining what's
going on here. Suboptimal capability
elicitation. A huge amount of work has
gone in at meter to making the agents as
performant as possible given the
underlying models on on our kinds of
tasks and um you know that involves
churning through a load of AI tokens.
Perhaps that's that's less the case for
cursor in particular at the time when we
completed the study.
And then interdependence across tasks.
Maybe it's the case that um you know if
humans can complete task A and task B,
AIs can only complete task A but not
task B and of course can do task A
faster. Then it still makes sense to for
humans to do task A and task B, not
delegate task A because you know they
they need to know the outputs. They need
to know how how task A was completed in
order to reliably complete task B. I
think that's that's part of what's going
on. You need to maintain context as
you're working through these subtasks.
Um lastly I will say that we are hiring
not just for this kind of work that
you've um that you've seen being
extended you know ever longer tasks ever
more ambitious um RCTs um even more
sources of evidence from which we can
triangulate the truth about AI
capabilities but also for for much more
besides you can you can find this at
meter.org/careers org/careers. In
particular, I'm excited about research
engineers, research scientists who might
be um hiding in the current audience.
We're excited not just, you know, for
for research types with academic
experience, but very much for scrappy
startup people as well. And we're also
hiring for a director of operations.
And with that, thank you very much for
listening. [applause]
[music]
Our final presenter is here to speak
about Google's first agentic development
platform, anti-gravity.
Please join me in welcoming to the stage
engineer at Google DeepMind, Kevin H.
[applause]
All right. Hello. Last one of the day.
Can we get a uh little energy boost?
Who's ready? Who's ready?
>> [applause]
>> All right, happy Friday. I hope everyone
has had a good week, a good conference.
Um, and let me tell you, it's been a
really bad week if you are Gravity.
Wicked 2 is coming out tonight. And
then, of course, anti-gravity came out
earlier this week alongside Gemini 3 Pro
on Tuesday.
Google Anti-gravity is a brand new IDE
out of Google DeepMind. It's the first
one from a foundational lab and it is
coming right off the press. In fact, um
I probably should be working on the
product right now, but I wanted to spend
some time to share what we've built here
today.
Anti-gravity is unapologetically agent
first. And today I'm going to tell you a
little bit about what that means and how
it manifests in the product. But perhaps
maybe a little bit more interestingly,
we're going to talk a little bit about
how we got here. Product principles,
direction of the industry, these sorts
of things. Um so my name is Kevin How. I
lead our product engineering team at
Google Anravity.
And let's start with the basics. Um, and
first just to get a sense of the room.
Um, who has used anti-gravity?
All right. There you go. Power of
Google. Love it. Um, who's used the
agent manager?
Cool. Nice. Good. Good. All right. So,
basics of anti-gravity.
Anti-gravity, notably anti-gravity, not
anti-gravity, anti-gravity. It's an AI
developer platform with three surfaces.
The first one is an editor. The second
one is a browser and the third one is
the agent manager. So we'll dive into
what this means, which one what what
each looks like. So a paradigm shift
here is that agents are now living
outside of your IDE and they can
interact across many different surfaces
that your agent or that you as a
software developer might spend time in.
And let's start with the agent manager.
So that's the thing up top. This is your
central hub. It's an agent first view
and it pulls you one level higher than
just looking at your code. So instead of
looking at diffs, you'll be kind of a
little bit further back. And at any
given time, there is one agent manager
window.
Now you have an AI editor. This is
probably what you've grown to love and
expect. Has all the bells and whistles
that you would expect. Lightning fast
autocomplete. This is the part where you
can make your memes about yes, we forked
VS Code. And it has an agent sidebar.
And this is the sort of thing it's
mirrored with the agent manager. And
this is when you need to dive into your
editor to accomplish maybe your 80% to
100% of your task. And at any point, we
made it very very easy because we
recognize not everything can be done
purely with an agent for you to command
E or control E and hop instantly from
the editor into the agent manager and
vice versa. And this takes on under 100
milliseconds. It's zippy. And then
finally, something that I love, an agent
controlled browser. This is really,
really cool. And hopefully for the folks
in the room that have tried
anti-gravity, you've noticed some of the
magic that we've put in behind here. So,
we have an agent controlled Chrome
browser. And this gives the agent access
to the richness of the web. And I mean
that in two ways. The first one, context
retrieval, right? It has the same
authentication that you would in your
normal Chrome. You can give it access to
your Google Docs. You can give it access
to, you know, your GitHub dashboards and
things like that and interact with a
browser like you would as an engineer.
But also what you're seeing on the
screen is that it lets you it lets the
agent take control of your browser,
click and scroll and run JavaScript and
do all the things that you would do to
test your apps. So here I put together
this like random artwork generator. All
you do is refresh and you get a new
picture of um like a Thomas piece of
Thomas Cole artwork. And now we added in
a new feature which is this little
little modal card. And the agent
actually went out and said, "Okay, I
made all the code, but instead of
showing you a diff of what I did, let's
instead show you a recording of Chrome."
So this is a recording of Chrome where
the blue circle is the mouse. It's
moving around the screen. And in this
way, you get verifiable results. So this
we're very excited about our uh our
Chrome browser. And then the agent
manager can serve as your control panel.
The editor and the browser are tools for
your agent. And we want you to spend
time in the agent manager. And as models
get better and better, I bet you you're
going to be spending more and more time
inside of this agent manager. And it has
an inbox. And I'll talk a little bit
about this and sort of why we did this.
But it lets you manage many agents at
once. So you can have things that
require your attention. For example,
running terminal commands. We don't want
it to just kind of go off and just run
every terminal command. There are
probably some commands that you want to
make sure you you hit okay on. So things
like this will get surfaced inside of
this inbox. One click. you can manage
many different things happening at once
and it has a wonderful OS level
notification. So if there is something
that you need it will sort of let you
know and this kind of solves that
problem of multi-threading across many
tasks at once and so our team is
thrilled to launch this brand new
product. It's a brand new product
paradigm and we did so in conjunction
with Gemini 3 which was a very exciting
week for the team but alas we ran out of
capacity.
Um, this has been tormenting me the last
couple of days and so I apologize. On
behalf of the anti-gravity team, I'd
like to apologize for our global chip
shortage. Um, we're working around the
clock to try and make this work for you.
Uh, hopefully we'll have a few less of
these sorts of errors. Um, but we've,
it's what's been really exciting is
people who have used the product have
seen what the magic of combining these
three surfaces can do for your
workflows, for your software
development. Um, so let's talk about it.
Why did we build the product? How did we
arrive at this sort of conclusion? You
might say, "Oh, adding in a new window,
it's pretty pretty random, right? It's
this one to many relationship between
the agent manager and many other
surfaces."
Um, and it's important to remember, I've
I've been at this conference a couple of
times, and and everything every single
time there is this theme. The product is
only ever as good as the models that
power it. And this is very important for
us as builders, right? Every year there
is this sort of new step function. The
first there was a year when it was
autocomplete, right? Copilot. And this
this sort of thing was only enabled
because models suddenly got good at
doing this short form autocomplete. And
then we had chat. We had chat with RLHF.
Then we had agents. So you can see how
every single one of these product
paradigms is sort of motivated by some
change that happens with model
capabilities. And it's a blessing that
our team is able to work and be embedded
inside of DeepMind. We had access to
Gemini for a couple of months um earlier
and we were able to work with the
research team to basically figure out,
you know, what are the strengths that we
want to show off in our product? what
are the things that we can exploit and
then also what are the gaps right this
desired experience where are the gaps in
the model and and how can we fix that
right and so this is this was a very
very powerful part of why anti-gravity
came to be and there are four main
categories of improvements powered by a
little nano banana artwork the first one
is intelligence and reasoning you all
are probably familiar with this you use
nano or you used um Gemini 3 and you
probably thought it was a smarter model
this is good it's better at instruction
following it's better at using tools
tools. There's more nuance in the tool
use. You can afford things like, you
know, there's a browser now. There's a
million things that you could do in a
browser. It can literally even execute
JavaScript. How do you get an agent to
understand the nuance of all these
tools? It can do longer running tasks.
These things now take a bit longer,
right? And so you can afford to run
these things in the background. It
thinks for longer. Just time time has
gotten stretched out. And then
multimodal. I really love this property
of what Google has been up to. The
multimodal functionality of Gemini 3 is
off the charts. and you start combining
it with all these other models like Nano
Banana Pro um and you really get
something magical. So we have these
roughly four different categories where
things have gotten much better
and if you think about these properties
the question becomes what do we do about
these differences and from a product
perspective it's like how do you
construct a product that can take
advantage of this new wave and hopefully
and in my opinion this is the next step
function autocomplete chat agents and
then I probably got to come up with
something more interesting than whatever
this thing is called.
So step one is we want to raise the
ceiling of capability.
We want to aim higher, have higher
ambition.
And so a lot of the teams at DeepMind
were working on all sorts of cutting
edge research, right? There's Google is
a big big big company. And one of my
learnings going from a startup to one of
these bigger companies is that there is
a team of people that is attacking a
very very hard technical problem. And as
a nerd, this is super exciting, right?
And then as a product person it's like
wow we can start using computer use. So
browser use has been one of these huge
unlocks.
And this is twofold right I mentioned
the sort of retrieval aspect of things.
Um
I guess for for software engineers there
is much more that happens that is beyond
the code right you can roughly think
about it as there's what to build
there's how to build it and then you
actually have to build it. I would say
building it has become more or less you
know it's reasonable for the model to
now given context it can generate the
code that hopefully functionally works
and then you've got the what to build
this is the part that is up to you kind
of human imagination and then there's
the how to build it right and there's
this richness in context the richness
and institutional knowledge and these
are the sorts of things that having
access to a browser having access to
your bug dashboards having access to
your experiments all these sorts of
things that now gives the agent this
additional level of context and maybe I
should have clicked before, but if you
saw on the screen, let's see, how do I
do this?
So, this is now the other side of
things. Browser has verification. So,
you might have seen this video. This is
a tutorial video that we put together on
just how to use it. But this is the
agent. The blue border indicates that
it's being in control by the agent. And
so, this is a flight tracker. You put
in, you know, a flight ID and then it'll
give you sort of the start and end of of
that flight. And this is being done
entirely by a Gemini computer use
variant. is so it can click, it can
scroll, it can retrieve the DOM, it can
do all the things. And then what's
really cool is you end up with not just
a diff, you end up with a screen
recording of what it did. So it's
changed the game and the model can take
this and because it has the ability to
understand images, it can take this and
iterate from there. So that was the
first category, browser use, just an
insane insane magical experience. Now
the second place that we wanted to spend
time is on image generation. And we
noticed this theme when we, you know,
when I when I first started at at
Google, we noticed, okay, Gemini is
spending a lot of time on multimodal.
And this is really great for consumer
use cases, right? Nano Banana 2 was was
mindboggling. Um, but also for devs.
Devs are inherently this is a multimodal
experience. You're not just looking at
text. You're looking at the output of
websites. You're looking at architecture
diagrams. There's so much more to coding
than just text. And so there's image
understanding. This is verifying
screenshots, verifying recordings, all
these sorts of things. And then the
beautiful part about Google is that you
have this synergistic nature. This
product takes into account not just
Gemini 3 Pro, but also takes into
account the image side of things. And so
here I want to give you a quick demo of
um mock-ups. So I have a hunch and you
all probably believe this too. Design is
going to change, right? You're going to
spend, you know, maybe some time
iterating with an agent to to arrive at
a mockup. But for something like, oh,
let's build this website. we can start
in image space. And what's really cool
about image space is it lets you do
really cool things like this. We can add
comments and so you end up commenting
and leaving a bunch of a bunch of queued
up responses. And it's kind of like
GitHub. You'll just say, "All right, now
update the design." And then it'll put
it in here. The agent is smart enough to
know when and how to apply those
comments. And now we're iterating with
the agent in image space. So really,
really cool new capability. And what was
awesome is that um we had Nano Banana
Pro, you know, we pulled an allnighter
for uh for the Gemini launch because
that was our first launch. Then they
said, "Do it again. Do it on Thursday."
So we made Gemini Pro. Um I'm getting
all these model names confused. The
image gen one, the Nano Banana one, we
made that available on day one. I'm
running on very little sleep on day one
inside of the anti-gravity editor. And
our hope is that the anti-gravity editor
is this place where any sort of new
capability can be represented inside of
our product.
And so step two was all right, we have
this new capability. We've pushed the
ceiling higher. Agents can do longer
running tasks. They can do more
complicated things. They can interact on
other surfaces. And so this necessitates
a new interaction pattern. And we're
calling this artifacts.
This is a new way to work with an agent.
And this is one of my favorite parts
about the product. And at its core is
this agent manager.
So let's start by defining an artifact.
An artifact is a dynamic representation
of something that the agent generates.
Sorry, it's a an artifact is something
that the agent generates. That is a
dynamic representation of information
for you and your use case. And the key
here is that it's dynamic.
Artifacts are used to keep the agent
organized. They can use used for uh kind
of like self-reflection and and
self-organization. It can be used to
communicate with the user to maybe give
you a screenshot, to maybe give you a
screen recording like we described. And
it can also be used across agents,
whether this be with our browser sub
agent or with other conversations or as
memory. And this is what you see on the
right side of this agent manager. We've
dedicated sort of half the screen and
and your sidebar to this concept of
artifacts.
And so we've all tried to follow along
chain of thought. And I would say this,
you know, we did some fanciness here
inside of the agent manager to make sure
conversations are broken up into like
chunks. So in theory, you could follow
along a little bit better in the
conversation view, but ultimately you're
looking at a lot a lot of strings, a lot
of tokens. This is like very hard to
follow. And then this is actually like
there's like 10 of these, right? So you
just scroll and scroll and scroll.
You're like, what the heck did this
agent do? And and this this has been
traditionally the way that people review
and sort of supervise agents. You're
kind of just looking at the thought
patterns.
But isn't it much easier to understand
what is going on inside of this visual
representation? And that is what an
artifact is. The whole point and the
reason why I'm not just standing up here
and giving you this long, you know,
stream of consciousness is because I
have a PowerPoint. The PowerPoint is my
artifact. And so Gemini 3 is really
really strong with this sort of visual
representation. It's really strong with
multimodal. And so instead of showing
this, which of course we always let you
show, we always we will always show you
this, but we want to focus on this. And
I think this is the game-changing part
about anti-gravity.
And the theme is this dynamicism.
The model can decide if it wants to
generate an artifact. And let's remember
there are some tasks. We're changing a
title. We're changing something small.
Doesn't really need to to produce an
artifact for this. So, it will decide if
it needs an artifact. And then second,
what type of artifact? And this is where
it's really cool. There there are many
potential in potentially infinite ways
that it can represent information. And
so, the common ones are markdown in the
concept of a of a plan and a
walkthrough. So this is probably what
you've used most most often. When you
start a task, it will do some research.
It will put together a plan. This is
much very similar to like a PRD. It will
even list out open questions. So you can
see in this feedback section, it'll
surface, hey, you should probably answer
these three questions before I get
going. And what's really awesome, and
we're betting on the models here. What's
really awesome is that the model will
decide whether or not it can auto
continue. If it has no questions, why
should it wait? It should just go off.
But more often than not, there are
probably areas where you may be
underspecified or maybe it did something
during research, right? everyone has
gone through and and started a big
refactor then realized they actually
don't have all the information ahead of
them. They got to go back to the drawing
board, maybe talk to some people. Same
idea. So it'll surface um it'll surface
open questions for you. And so that's
you'll start with that implementation
plan and then you'll say all right LGTM
let's like send it. You'll go all the
way down. It might produce other
artifacts. You know we've got a task
list here. This is the way that you can
monitor the the progress of the agent
instead of looking at the conversation.
might put together some architecture
diagrams and then you'll get a you'll
get a walkthrough at the end and this
walkthrough you kind of saw a glimpse of
this before but it is hey how do I prove
to you agent to human that I did the
correct thing and I did it well and then
this is the part that you'll end with
it's kind of like a PR description and
then there's a whole host of other types
right Images screen recordings these
mermaid diagrams and really what's
what's what's quite cool is that because
it's dynamic the agent will decide this
over time so suddenly there's maybe a
new type of artifact that we maybe we
missed Right? And then it'll figure that
out. It'll just become part of the
experience. So it's very scalable. But
this artifact primitive is something
that's very very powerful that I'm
pretty excited about. And then I guess
another question is why is it needed? So
we'll always explain to the user what
the purpose of this artifact is. Um and
then interestingly like who should see
it? So should the sub agents see it?
Should the other agents see it? Should
other conversations see this? Should
this be stored in my memory bank? Right?
If this is something that I derived, one
of the cool examples um that I like is
like if you give it a a piece of
documentation and give it your API key,
it'll like go off and run curl requests
to basically figure out the exact schema
of like what the types of APIs you're
using and it'll do this like deep
research um for quite a while and then
it'll give you a report and basically
like deeply understand uh this sort of
uh this sort of API. You wouldn't want
to just throw that away and have to
rederive it the second time you did
this. So it'll store it in your memory
and then all of a sudden that's just a
part of your knowledge base. So, and
then there's also this idea of like
notifications, right? So, if there's an
open question, you want the agent to be
proactive with you. And that's another
very cool property of this artifact
system. We want to be able to provide
feedback along this cycle. So, from task
start to task end, we want to be able to
provide feedback and inform the agent on
what to change.
And the artifact system lets you iterate
with the model more fluidly
during this process of execution. And
so, not to sound like a complete Google
shell, but I love Google Docs, right?
Google Docs is a great pattern. It's
awesome. The comments are great. And
this is how you might interact with a
colleague, right? You're collaborating
on a document. Then all of a sudden, you
want to leave a textbased comment. So,
we took inspiration from that. We took
inspiration from GitHub. But you leave
comments. You highlight text. You say,
"Hey, maybe this part needs to get
ironed out a bit more. Maybe there's a
part that you missed or actually don't
use Tailwind. Use vanilla CSS." So,
these are the sorts of comments that you
would leave. You'd batch them up. And
then you go off and send. And then in
image space, this is very cool. We now
have this like Figma style drag and drop
like or not drag, you know, highlight to
select. And now you're leaving comments
in a in a completely different modality,
right? And we've done this and
instrumented the agent to ma naturally
take your comments into consideration
without interrupting that task execution
loop. So at any point during your
conversation, you could just say, "Oh,
actually, you know, mid mid browser
actuation, I actually really don't like
the way that that turned out. Let me
just highlight that, tell you,
send it off." and then I'll just get
notified when you're done taking into
consideration those comments. And so
it's a whole new way of working. And
this is really at the center of what
we're trying to build with anti-gravity.
It's pulling you out into this higher
level view. And the agent manager really
is built to optimize the UI of
artifacts.
So we have a beautiful, beautiful
artifact review system. We're very proud
of this. And it can also handle sort of
the property that is like parallelism
and orchestration. So whether this be
many different projects, whether this be
the same project and you just want to
execute maybe a design mockup iteration
at the same time you're doing research
on an API at the same time you're
iterating and and and actually building
out your app. You can do all these
things in parallel and the artifacts are
the way that you provide that feedback.
The notifications are the way that you
know that something requires your
attention. It's a completely different
pattern. And what's really nice is that
you can you can take a step back and of
course you can always go into the
editor. I'm not going to lie to you.
There are tasks that you know you maybe
don't trust the agent yet. You don't
trust the models yet. And so you can
command E and you can command E and
it'll open inside the editor within a
split second with the exact files, the
exact artifacts and that exact
conversation open ready for you to
autocomplete away to continue chatting
synchronously to get you from 80% to
100%. So we always want to give devs
that escape hatch. But in the future
world, we're building for the future.
You'll spend a lot of time in this agent
manager working with parallel sub
agents, right? It's a very very exciting
concept.
Okay, so now that you've seen we've got
new capabilities, multitude of new
capabilities, we've got a new form
factor. Now the question is like what is
going on under the hood at Deepmind? And
the secret here is a lesson that I guess
we've just learned over the past I don't
know we've spent like or I I've
personally spent like three years in in
codegen just to be your your biggest
user right and that creates this
research and product flywheel
and so I will tell you anti-gravity will
be the most advanced product on the
market because we are building it for
ourselves we are our own users and so in
the dayto-day
we were able to give Google engineers
deep mind researchers is we were able to
give them an early access and now an
official access to anti-gravity
internally. And so now all of a sudden
the actual experience of the models that
people are improving, the actual
experience of of using the agent manager
and touching artifacts is letting them
see at a very very real level what are
the gaps in the model
and whether it be computer use, whether
it be image generation, whether it be
instruction following, right? Every
single one of these teams, and there are
many teams at Google, has some hand
inside of this very, very full stack
product.
And so you might notice as an
infrastructure engineer, you might say,
"Oh, this is a bit slow.
page. Well, go off and and make that
better, right? So, it gives you this
level of insight that eval just simply
can't give you. And I think that's
what's really cool about being a deep
mind. You are able to integrate product
and research in a way that creates this
flywheel and pushes that frontier. And I
guarantee you that whatever that
frontier provides, we will provide an
anti-gravity for the rest of the world.
These are the same product. And so, I'll
give you two examples of how this is has
worked. The first one was that computer
use example, right? in collaboration
with a computer use team which we sit
you know a couple couple tens of feet
away from we identified gaps on both
sides right so we're not just using an
API we are interacting across teams to
basically say oh like the capability is
kind of off here can can we go off and
figure out what's going on here maybe
there's a there's a mismatch in data
distribution and then on the other side
it's like yo your like agent harness is
like pretty screwed up you got to fix
your tools right and so then we'll go
off and we'll fix our side but it's this
harmony it's it's both sides talking to
each other that really makes this type
of thing possible. Similarly, you come
up with a new product paradigm
artifacts. Artifacts were not good on
the initial on the initial uh versions,
right? What part of training, what part
of data distribution includes this like
weird concept of reviews? And so, it
took a little bit of plumbing, a little
bit of work with the research team to
figure out, all right, let's steadily
improve this ability. Let's give you a
hill to climb. And then now we were able
to launch Gemini 3 Pro with a very good
ability to handle these sorts of
artifacts. And so it's this cyclic
nature that I'm really really betting
on.
And this this is really how anti-gravity
will defy gravity. We've got pushing the
ceiling. We're going to have an agent
with very very high level of ambition.
We're going to try and do as much as we
can. And this includes vibe coding.
Though I will say there are some
excellent products out there by Google.
AI studio is an excellent product.
We are in the business of increasing the
ceiling.
Second, we built this agent first
experience artifacts agent manager. And
then finally, we have this research
product flywheel. And this is the magic
and this is the three-step process that
we used in building anti-gravity.
So, it's been a blast. I mean, I've I've
been back at um AI Engineer Summit.
Thank you again, Swix and Ben, for
having me. It's been awesome to come
back every year. And so on behalf of the
anti-gravity team, I just want to thank
you for your time, for your patience as
you use the product um and your support.
And of course,
you too can adopt a TPU and help us uh
turn off pager duty a bit more. Um and
then of course, you know, you could also
yell at me on Twitter. That's another
way of doing it. Maybe do it in DMs
instead. Um but we've got a lot of
exciting things and I'm really really
excited to bring anti-gravity to market.
The team is thrilled that this is now
out in the wild. So we welcome your
feedback. Um, and thank you again for
listening. Enjoy the rest of the
conference. [applause]
[music]
Ladies and gentlemen, please welcome
back to the stage Jed Borave.
>> Thank you, [music] Kevin. Let's hear for
Kevin and all of our speakers today.
[applause]
All right. What a great way to end our
sessions. How are we doing? I know it's
been a big day.
>> Yeah, we're still somewhat alive. Okay,
before we wrap up, a few logistics.
Thing. One, there is an official
afterparty tonight. If you've
registered, there'll be a follow-up
email in about 30 minutes. The second
thing is tomorrow is a full day of
workshops. Importantly, they are not
here. There's two different buildings.
There's the Data Dog building that's
right around the corner. There's also
the AWS building, JFK27 on 39th Street.
But when in doubt, the schedule is on
the website. My last act as MC is to
invite the events co-founders up to
stage for a special closing word. Please
join me in welcoming AI engineer
co-founders Benjamin Dumpy and Swix.
[applause]
>> All right,
let's keep it going for Jed, everyone.
Let's keep it going for Jed [applause]
and Alex yesterday, our lovely MC's who
really glued this place together. So, we
really had such a great time. Did you
all have a good time?
[applause]
>> So glad to see that. Hopefully the
program went smoothly. Um me and the
team here, the lovely team at the Time
Center, Argus HD, Max Video Productions,
everyone here supporting the production.
That's us and the content curation.
Let's give it up for this guy. Oh, thank
you. [applause] I hope you enjoyed it.
You have no idea how hard he works on
this. And for me as a conference
producer who used to do both the content
curation and the youth production, it's
just a godsend to partner with someone
who is, you know, on the forefront of
thought leadership like this. And you
have no idea how hard he works. And he's
actually recently so invested in this
that he's actually recently stepped up
as CEO. So,
>> oh, okay.
>> Just kind of announced that.
>> I was going to do that.
>> I was expecting Yes.
>> But yes, Sean is now the the new CEO of
the company. And I couldn't I wouldn't
have it any other way. If there's one
person I want to follow like in this
world, it's this man. And there's one
person I'm happy to be working for, it's
this man.
>> Oh yeah. Likewise. I the production
quality and the creative design and all
the music that you hear online is is all
Ben. So you got to give it all for him
as well.
[applause]
>> Thank you.
So I think we're on let's go to the
other notes. Let's go to this. So we
wanted to talk just a little I mean you
saw the slide earlier from from Sean.
It's basically just showing the growth
that we had you know like I when I
started conference production I always
had I always approached events from the
point of view for myself and with
clients that from your first event you
you you want to start small and you know
worst case scenario you sell out. You
don't want to start big and you know you
have an empty venue you blow the budget
you go bankrupt. So we did that
intentionally. We kept the first one
very small when we announced it you know
with the rise of the AI engineer back in
2023. Um, and we sold out that and then
we sold out every single event since
then and we're also growing online. So
all of you watching on the live stream,
thank you so much for being part of this
community remotely. If you can make it
to one of our events, we are growing. We
have some announcements for you in just
a few moments. We'll be coming to you a
little bit closer around the world. Um,
and for this event in particular, we had
2,463
applicants and 815 of you came. So,
uh, if we ask an LLM, they might tell us
that's maybe, you know, 2% admission
rate, I think, something like that. It's
actually 33%. But, uh, that's just
showing like the exclusivity of this
event in particular. Summit is designed
to be a little more exclusive because we
want this to be our salv conference. If
you know the Salv conference, some of
you might. That's essentially where
Albert Einstein, Nurie Curi, all of the
physicists from around the world came
and advanced physics over the course of
a couple weeks I think um basically
decades, right? And basically to the
point where we are today because string
theory, where has that taken us? Come
on. Um hopefully we can solve that with
uh the rise of AI. Um in any case, uh
that's what we have for summit. And uh
where we at? I put this together in the
last 30 minutes, so pardon me. Um, so
there's clearly demand for this
community. There's clearly growth here.
There's um a lot of excitement here and
there's a lot of great content and
there's a lot of questions to be
answered. There are no clear answers
here. This is potentially, as people
argue, a new consciousness, right? It's
a new form of consciousness, right?
We're all discovering this together. So,
as we push forward into the dark, into
the future, in this wonderful, beautiful
moment, in this temporal existence, we
will figure it out together. That's why
we are excited to announce,
roll tape,
World's Fair is coming back to San
Francisco 2026
on June 29th, typo, to July 2nd. So,
we're going to do four days. So, we're
going to do one workshop day and then
three days of sessions. So, tickets are
on sale now. These are going to be the
cheapest they will be. Did we come up?
Did we decide a discount code or just
like
>> the price is cheap?
>> No, the super early bird is the
discount.
>> It's the super early bird.
>> Okay.
>> Well, it's there's two choices, right?
The ticket is this price and we're going
to change the price later or here's the
discount code. So, that's always a
question. Um, we're running the event.
This is, you know, side stuff. This is
future events. So these uh so so we're
really happy and very excited to
announce that this year we are at Mosone
West that is the baby brother of the
Moscone Center. So we're not quite there
yet but we sold out the Marriott Marquee
which is the largest um event hotel in
San Francisco at about 3200 uh just a
few months ago and so the only place to
go from there is to Moscone West which
is about 6,500 capacity. So, we're
guesstimating, you know, conservative
5,000, but we hope to get to the max
capacity of 6,500. So, buy your tickets
now. They're not going to get any
cheaper. Um, why are we Oh, the video
ended. Supposed to be quicker than this.
Let's go to the the next one. We got one
more announcement.
[laughter] Let's You want to say
something about Worlds Fair?
>> Uh, yeah. Worlds fair is our attempt at
is our flagship. We are basically trying
to capture all of AI in as in one event.
and for you to basically have kind kind
of an all you can eat pass to go to
multiple conferences at once. Uh this
year we had 10 simultaneous tracks that
was a lot. Um but so we'll never expand
beyond that but I think we want to
really make it count and have you
basically just dip into whether it's
like generative media or like voice or
like robotics or anything else that you
want to explore. Uh we are also bringing
it to Europe for the first time. So
that's uh the next announcement.
>> Yes. So this next April, April 8th to
10th, whether you're from Europe or you
want to make the trip to beautiful
London, we are right in Westminster at
the beautiful Queen Elizabeth 2. We can
fit about 800 people in there. So buy
your tickets soon. We expect that one to
sell out as well. It's going to be
basically baby World's Fair. We want to
establish World's Fair essentially in
Europe. There's no better place than
London. We love Paris and we love what
the COB team did with uh a engineer
Paris. Um but we feel London is the
right call for this event. The venues
there are beautiful. The city's
beautiful. I'm biased. I really love
Paris, but London ain't bad, too. Um I I
live there for a little bit, so I do
enjoy it.
>> I really care about direct flight from
SF to London. So
>> that's
it.
>> Very nice.
>> Yeah.
>> Okay. So tickets are also on sale here.
Again, super early bird pricing. The
cheapest they're going to get. So
ai.engineer/urope.
I don't know when those are going to
expire, but I'm sure we'll communicate
that to you guys later. Um, and then I
think this is a video, so let's go to
the next one. Okay, so with all of this
growth,
um, little old Sean and me, we can't do
this ourselves. So, we needed to bring
someone on to really help grow the
company. We needed to get procedures. A
lot of a lot of organizing conferences
is it's it's a lot of grunt work, right?
It's a lot of human connection. It's a
lot of reminders. It's a lot of sales,
but it's also running the business and
then just coming up with processes. So,
a lot of times you're kind of figuring
it out as you're going. And you get
these processes, but then things change
and then some things are not perfect.
So, you're always behind and deadlines
are always coming and these budgets are
insane. Like, you have no idea and you
don't want to see a budget for one of
these events. Um, and there's just a lot
of complexity to that. So when we first
started the event, we um back in 2023,
we brought on Leah McBride who uh came
from who used to be the director of
events at Twitter and she helped us out
within two months. Basically, she was
helping us to run an engineer summit to
the point where it it was a lot smoother
than when I was just running events on
my own with, you know, you get the
Hollywood crew that comes as
professionals to help you run the actual
event, but the whole production is is is
a lot of difficulty. So, we're very,
very pleased to announce that Leah
McBride is joining us as our new general
manager to help us grow into a proper
corporation. So, please join me in
welcoming to the stage Leah McBride,
everyone.
>> Leah,
[applause]
[applause]
thank you, Ben and Swix. Hi, everyone.
Um, as Ben has just told you, um, I'm
very excited to join the company as
general manager. I've been working in
event marketing uh tech event marketing
for almost 20 years. I was lucky enough
to join one of the biggest London
agencies. I am Scottish so I I'm from
the UK um way quite a few years ago and
um Google was my main client for quite a
long time. So I kind of grew up through
that um Google excellence and how we
operationally produce excellent events.
Um I was then lucky enough to go on and
be uh the director of events for the
developer platform at Twitter for a
number of years where I led multiple
global tours.
[laughter]
>> H I led multiple global tours and I also
led um our flagpole event in San
Francisco which was flight if anyone
ever went to that.
>> Um and following that I moved that was
in San Francisco. Then I moved back to
London and I um I started a company
called Improbable uh which is a gaming
platform. Uh so I worked uh mainly with
gaming developers there and grew the
marketing team from three up to 38 and
following that I was lucky enough to
meet Benis. So yes I've been with them
since the first event um and we have
just got a super exciting year coming
up. So to add to our program of Europe
and uh Moscone, San Francisco,
we if anyone was there, I don't know,
was anyone in Paris with us?
>> Excellent.
>> Show of hands. We got like five or six
people.
>> So you'll be able to contest that the
Paris event was absolutely phenomenal.
So um our partners, Collab, came to us
and asked if we were going to do Paris.
We weren't really ready for that and
they suggested that we partner with
them. This was a great idea. [laughter]
Um they did an absolutely incredible job
of producing the Paris event. We It was
a soldout sponsorship, sold out event.
We had so much incredible feedback that
we have actually decided to turn that
into a program. So we are launching in
2026 our partner program. We have
already signed up. So we are hoping
again to be doing Paris October 2026.
Um we have also recently signed up with
a Miami team. Um so we're going to be
doing an AI engineer Miami April 20 to
21st and then also we're very excited to
be going to Melbourne also with a
partner. So we'll be doing Melbourne
June 3rd to 4th 2026.
Um this part this program is obviously
just starting. So there's going to be
more opportunity as we grow to be part
of this program. So if there is anyone
in this room who wants to know more
about that, please just email us at
sponsorshipsai.engineer
and [snorts] we can talk about what that
looks like for your city. Um so yeah, so
thank you. I'm very excited for 26.
>> Thank you so much. [applause]
>> All right, so that's basically all the
announcements we had. We hope to see you
at one of those events whether you come
to San Francisco or London. We'll
probably be be back for New York. So,
should we come back to New York?
>> All right. Uh uh by the show of Woos,
how do you say that? By Woos, how many
of you are from New York?
>> Wow. Yeah.
>> San Francisco.
>> That was close. Let's do it again. New
York.
>> [screaming]
>> San Francisco.
>> Oh man, I think Scotland
>> 73.
>> San Francisco changed pitch.
>> All right.
>> San Francisco went hired this time. I
don't know.
>> I met Cape Town the other day. So is
Cape Town here.
>> Cape Cape Town has left the building.
>> We had New Zealand, right? So people are
coming all all over the world for these
events. So we obviously New York is a
great one of the great cities of the
world and um of course you know those
70% of you will say the greatest um and
uh we love coming here. We love giving
people an excuse to come here. So uh we
we hope to be back soon and we love this
venue uh despite being in Time Square.
Um we we we we put up with it because
the crew here is just so fantastic
obviously like look how gorgeous this
place is. Um so we we hope to
potentially be back here uh next year.
Um, and then the rest of our crew that
we cannot do this without um, everyone
from Argus HD who's running the all of
the AV. Can we give a hand round of
applause for them? [applause]
And they put up with us because we're
getting them as super super late. So,
um, namaste. I think that's what I'm
supposed to say. Um, and I always forget
people at this moment. Who else am I
forgetting? Flormon catering. Our
caterers are are super fantastic. um you
know, photographer, Randall Ge, Max
video productions, um doing our B-roll.
We're gonna have some of those end of
day clips.
>> Marina, Kyle,
>> yes, of course. Our rest of our team
members, my god, why don't I add those
as as our slides? We got those in the
end credits. I at least thought enough
to I was doing this like 30 minutes
away. Um anyways, yeah, Marina, our our
senior event producer, Wendy, who
recently just joined, Kyle, our program
production manager, just joined like two
weeks ago, and it's been like just hit
the ground running. um Trish, our our
production uh assistant uh supervisor
and like just so many of our partners in
programs. So we we want to we want to
thank all of them. Um but that's about
it for for that. Apologies if I forgot
you, but uh we now have about until I
think we have until like 6:30 in the
venue. So we have a little bit of time
to kind of chill, make your final
evening plans. We do have an official
evening afterparty brought to you by
Cerebras along with uh some friends BCV,
Mackenzie, Warp, Exa, Modal, am I
pronouncing that right? And Metronome.
So that is the official off-site
afterparty. We're not giving you the
location right now because we're on the
live stream right now, but believe it's
on the signs outside and we'll also send
an email in just a few moments and it's
also on your attendee guide. So, uh, do
go ahead and check that out. They are
asking for RSVP for headcount, but you
don't need it. you can just show up with
a badge and you can get it. But don't
forget your badge. Um, so let's give it
up for Cerebras. Thank you. [applause]
Also, last people to thank is our our
sponsors and speakers. Like our
speakers, they work so hard for this and
you know, due to Sean's no vendor talks
uh policy, which um is, you know, it's
it's a great boon for the for for the um
for for the community, but they have
very little incentive to do that other
than, you know, thought leadership. So,
um, that is a lot of work and they they
they do a great job with that. And then
our sponsors, uh, do we enjoy the
sponsor expo?
>> Yeah,
[applause] it's really well done by Art
and Display, too.
>> Should we try for the photo?
>> Sorry.
>> Should we try for the photo?
>> Uh, sure. Uh, yeah. Is that the plan?
Okay, cool. Yeah.
>> So, we did this last year and it was
really great. This is like this is our
salv conference photo. If you want to
come up here and do a photo, we can do
that together. Uh Randall Gear, our
lumpy photographer, is going to take a
photo. He will direct us if we can
actually get him the mic and then he can
like yell at us to go left and right. So
come on stage. Just one caveat. These
things here, they look like you can step
on them. You can't. They will break and
we'll get charged a lot of money. So
just be careful of those. But otherwise,
come on down if you want to.
>> Yeah, we did this. We've done this every
conference. It's a little memorabilia.
>> Please come on and please keep my mic on
for a little bit.
>> Hi. Thanks for coming.
>> Yeah.
>> Hope you enjoyed it.
>> Oh my gosh. I'm doing everyone. Hey,
>> thank you so much.
>> I'm in Minneapolis, so we just started
like a meet up there and that's awesome.
>> Yeah, we we do we do AI meetups, too.
>> Pleasure. Yeah,
>> I'm like on a hot mic.
>> Code Summit Summit World's Fair.
>> Okay.
>> Yeah. So the AI engineer series of
events.
Okay, don't be shy. take up some steps
to the front of the stage and we'll have
different rows. And if you're on the
left side of the podium, I won't get
you. So get on this side of the podium.
>> And don't squat yet, but we'll do kind
like a little squat in the front, but
not yet. Okay?
And if you can, don't be shy. Squish in.
There's a lot of space in the middle, so
don't be shy. towards the middle. This
is the middle. Thank you very much.
>> Oh,
[laughter]
actually
this side way left.
Okay. So, now we're going to have the
front squat just a little bit. If you're
in the middle, squat, medium. And then
if you're in the back row, you can tilt
a little bit. Yeah. On three. Here we
go. On three. Looking at me. One, two,
three. One, two, three. Wait for it.
Make sure it's good. Okay, now give me
some Y.
[cheering]
>> Perfect. Thank you.
[applause]
Guys,
hello
Hey, hey, hey.
Heat. Heat. Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
[music]
Heat.
>> [music]