AIE CODE 2025: AI Leadership ft Anthropic, OpenAI, McKinsey, Bloomberg, Google Deepmind, and Tenex

Channel: aiDotEngineer
Published at: 2025-11-20
YouTube video id: cMSprbJ95jg
Source: https://www.youtube.com/watch?v=cMSprbJ95jg
[music]
This is insane.
Typing thoughts into [music] the darkest
spark becomes design. Words evolve
[music] to whispers meant for something
more divine. Syntax bends and breeze. I
see the language change. I'm not
instructing anymore. I'm rearranging f.
Every loop I write [singing] rewrites
me. Every function hums with meaning. I
feel the interface dissolve [music]
between the maker and the
new code. Not on the screen but in the
soul where [music] thought becomes the
motion and creation takes control. No
lines no rules just balance [music] in
between the zero and the one. The
silence and the dream.
>> [music]
>> systems shape our fragile skin. They
mold [singing] the way we move. We live
inside the logic gates [music] of what
we think is true. But deep beneath the
data post, [music] there's something
undefined.
A [singing] universe compiling the image
of our [music] minds. Every line reveals
reflection. Every loop replace [music]
connection. We're not building, we're
becoming. And the code becomes
confession.
This is the [music] new code. Not on the
screen, but in the soul where thought
becomes the motion. [music] Creation
takes control. No lines, no rules, just
balance in between the [music] zero and
the one. The silence and the dream.
[music]
[music]
[music]
We are not just
Don't worry. [music] Uh, we're just
giving you something to do while Codeex
writes all your code.
We are the world.
[music] Each prompt, each breath, each
fragile spin, a universe [music]
renewing.
This is the new code.
Alive and [music] undefined.
Where logic meets emotion and structure
bends to mind. [music] The systems
eternal but the soul writes the line. We
are the new code.
[music] Compiling time.
[music]
on fire inside.
[music]
[music]
[applause]
Ladies and gentlemen, please join me in
welcoming to the stage the co-founder
[music]
of Morning Brew and the managing partner
of 10X, your host for the leadership
[music] track session day, Alex
Lieberman.
Keep it going. Let's get a quick read of
the room. If you are coming from right
here in the Big Apple from New York,
make some noise.
Okay, now I have to say it. I assume
this is the biggest group. San
Francisco.
>> Wow, that is surprising. Uh, Austin.
>> Okay, we got Austin. Who thinks they
came from the furthest place and is in
the room today?
>> Where? Where?
>> Ecuador. Can anyone beat Ecuador?
[applause]
>> New Zealand.
>> I don't think anyone's going to beat New
Zealand. There we go. Well, first of
all, uh, I am so excited to welcome you
all to the AI Engineer Code Summit 2025.
Uh, I'm Alex Lieberman, co-founder of
Morning Brew and your MC for the day.
Um, now you may be wondering, why is a
newsletter guy hosting an AI engineer
conference? It's a great question. Well,
after I left my role at Morning Brew, I
asked myself one simple question, and it
was, what space do I want to spend my
time in for the next 20 years where I
can build something consequential and
spend my time with some of the smartest
people I've ever met? And the answer
became obvious. I wanted to be as close
to the frontier of AI as humanly
possible, which is why I co-founded
10x.co, CO, which is an AI
transformation firm helping mid-market
and enterprise companies learn how to
use AI within their business. And I
spend basically all of my time now with
AI engineers like yourselves. I'm the
only non-technical person in the
business, and I wouldn't have it any
other way.
So, as you know, this year has been a
banner year for the industry. And I
would think of today as both a look back
on where we've been as well as a
tactical view of where we are headed in
companies small and large, old and new.
We're going to hear from the labs. We'll
hear from Unicorn AI startups. We'll
hear from academics, big-time management
consultants, and Fortune50 brands. But
before we do that, we have to give the
brands that made this day possible their
flowers. So, let's go into it. Let's
give it up for Google DeepMind, today's
presenting sponsor. [applause]
Love it. Keep it going for Anthropic,
the platinum sponsor for the day.
[applause]
And then one more round of applause for
all of the gold and silver sponsors who
you can meet in the expo downstairs
throughout the day. One more. Let's keep
it going. [applause]
Are you guys ready to do the damn thing?
>> Let's do it. To kick things off, let's
give a huge welcome to head of
engineering of the Claude developer
platform, Caitlyn Les. Let's welcome
Kaitlin. [applause]
Good morning. Um, so first let's give a
huge thank you to Swix and the whole AI
engineer organizing team for bringing us
together. [applause]
I'm Caitlyn and I lead the claw
developer platform team at Anthropic.
Um, so let's start with a show of hands.
Who here is integrated against an LLM
API to build agents?
Okay, I'm talking to the right people.
Love it. Um, so today I want to share
how we're evolving our platform to help
you build really powerful agentic
systems using claude.
So we love working with developers who
do what we call raising the ceiling of
intelligence. They're always trying to
be on the frontier. They're always
trying to get the best out of our models
and build the most high performing
systems. Um, and so I want to walk you
through how we're building a platform
that helps you get the best out of
cloud. Um, and I'm gonna do that using a
product that you hopefully have all
heard of before. Um, it's an agenda
coding product. We love it a lot and
it's called Claude Code.
So when we think about maximizing
performance um, from our models, we
think about building a platform that
helps you do three things. Um, so first,
the platform helps you harness Claude's
capabilities. We're training Claude to
get good at a lot of stuff, and we need
to give you the tools in our API to use
the things that Claude is actually
getting good at. Next, we help you
manage Claude's context window. Keeping
the right context in the window at any
given time is really, really critical to
getting the best outcomes from Claude.
And third, we're really excited about
this lately. We think you should just
give Claude a computer and let it do its
thing. So I'll talk about how we're
we're evolving the platform to give you
the infrastructure and otherwise that
you need to actually let Claude do that.
So starting with harnessing Claude's
capabilities. Um so we're getting Claude
really good at a bunch of stuff and here
are the ways that we expose that to you
um in our API as ideally customizable
features. So here's a first example um
relatively basic. Claude got good at
thinking um and Claude's performance on
various tasks. um scales with the amount
of time you give it to reason through
those problems. Um and so uh we expose
this to you as an API feature that you
can decide do you want Claude to think
longer for something more complex or do
you want Claude to just give you a quick
answer. Um we also expose this with a
budget. Um so you can tell Claude how
many tokens to essentially spend on
thinking. Um and so for cloud code um
pretty good example. Obviously, you're
often debugging pretty complex systems
with cloud code, or sometimes you just
want a quick um answer to the thing
you're trying to do. And so, um claude
code takes advantage of this feature in
our API to decide whether or not to have
Claude think longer.
Another basic example is tool use.
Claude has gotten really good at
reliably calling tools. Um, so we expose
this in our API with both our own
built-in tools like our web search tool,
um, as well as the ability to create
your own custom tools. You just define a
name, a description, and an input
schema. Um, and Claude is pretty good at
reliably knowing when to actually go um,
and call those tools and pass the right
arguments. So, this is relevant for
cloud code. Cloud code has many many
many tools and it's calling them all the
time to do things like read files,
search for files, write to files um and
do stuff like rerun tests and otherwise.
So the next way we're evolving the
platform to help you ma maximize
intelligence from claude um is helping
you manage cla's context window. Getting
the right context at the right time in
the window is one of the most important
things that you can do to maximize
performance.
But context management is really complex
to get right. Um especially for a coding
agent like claude code. You've got your
technical designs, you've got your
entire codebase, um you've got
instructions, you've got tool calls. All
these things might be in the window at
any given time. And so how do you make
sure the right set of those things are
in the window. Um so getting that
context right and keeping it optimized
over time is something that we've
thought a lot about.
So let's start with MCP model context
protocol. We introduced this a year ago
and it's been really cool to see the
community swarm around adopting um MCP
as a standardized way for agents to
interact with external systems. Um and
so for cloud code, you might imagine
GitHub or Sentry, there are plenty of
places kind of outside of the agents
context where there might be additional
information or tools or otherwise that
you want your agent to be able to
interact with or the cloud code agent to
be able to interact with. Um, and so
this will obviously get you much better
performance than an agent that only sees
the things that are in its window as a
result of your prompting.
Uh, so the next thing is memory. So if
you can use tools like MCP to get
context into your window, we introduced
a memory tool to help you actually keep
context outside of the window that
Claude knows how to pull back into the
window only when it actually needs it.
Um, and so we introduced the first
iteration of our memory tool as
essentially a clientside file system. So
you control your data, but claude is
good at knowing, oh, this is like a good
thing that I should store away for
later. And then uh it knows when to pull
that context back in. So for cloud code,
you could imagine um your patterns for
your codebase or maybe your preferences
for your git workflows. These are all
things that claude can store away in
memory and pull back in only when
they're actually relevant.
And so the third thing is context
editing. If memory helps you keep stuff
outside the window and pull it back in
when it makes sense, context editing
helps you clear stuff out that's not
relevant right now and shouldn't be in
the window. Um, so our first iteration
of our context editing is just clearing
out old tool results. Um, and we did
this because tool results can actually
just be really large and take up a lot
of space in the window. And we found
that tool results from past calls are
not necessarily super relevant to help
claude get good responses later on in a
session. And so you can think about for
cla code is calling hundreds of tools.
Um those files that it read otherwise
all these things are taking up space
within the window. Um so they take
advantage of um context management to
clear those things out of the window.
And so um we found that if we combined
our memory tool with context editing, we
saw a 39% bump in performance over over
the benchmark on our own internal evals.
Um which was really really huge. And so
it just kind of shows you the importance
of keeping things in the window that are
only relevant at any given time. And
we're expanding on this by giving you
larger context windows. So for some of
our models, you can have a million token
context window. combining that larger
window with the tools to actually edit
what's in your window maximizes your
performance. Um, and over time we're
teaching Claude to get better and better
at actually understanding what's in its
context window. So maybe it has a lot of
room to run, maybe it's almost out of
space. Um, and Claude will respond
accordingly depending on how much time
uh or how much room it has left in the
window.
So here's the third thing. Um, we think
you should give Claude a computer and
just let it do its thing. We're really
excited about this one um because
there's a lot of discourse right now
around agent harnesses. Um you know, how
much scaffolding should you have? How
opinionated should it be? Should it be
heavy? Should it be light? Um and I
think at the end of the day, Claude has
access to writing code. And if Claude
has access to running that same code, it
can accomplish anything. You can get
really great professional outputs for
the things that you're doing just by
giving Claude runway to go and do that.
But the challenge for letting you do
that is actually the infrastructure as
well as stuff like expertise like how do
you give claude access to things that um
when it's using a computer it will get
you better results.
So a fun story is we recently launched
cloud code on web and mobile. Um and
this was a fun project for our team
because we had a lot of problems to
solve. When you're running cloud code
locally cloud code is essentially using
your machine as its computer. But if
you're starting a session on the web or
on mobile and then you're walking away,
what's happening? Like where is that
where is um cloud code running? Where is
it doing its work? Um and so we had some
hard problems to solve. We needed a
secure environment for claude to be able
to write and run code that's not
necessarily like approved code by you.
Um we needed to solve or container
orchestration at scale. Um and we needed
session persistence. um because uh we
launched this and many of you were
excited about it and started many many
sessions and walked away and we had to
make sure that um all of these things
were ready to go when you came back and
um wanted to see the results of what
Claude did.
So one key primitive in this is our code
execution tool. Um so we released our
code execution tool in the API um which
allows claw to run write code and run
that code in a secure sandboxed
environment. Um, so our platform handles
containers, it handles security, and you
don't have to think about these things
because they're running on our servers.
Um, so you can imagine deciding that um,
you you want Claude to write some code
and you want Claude to go and be able to
run that code. And for cloud code,
there's plenty of examples here. Um,
like make an animation more sparkly that
uh, you want Claude to actually be able
to run that code. Um, so we really think
the future of agents is letting the
model work pretty autonomously within a
sandbox environment and we're giving you
the infrastructure to be able to do
that.
And this gets really powerful once you
think about giving the model actual
domain expertise in the things that
you're trying to do. So we recently
released agent skills which you can use
in combination with our code execution
tool. Skills are basically just folders
of scripts, instructions, and resources
that Claude has access to and can decide
to run within its sandbox environment.
Um, it decides to do that based on the
request that you gave it as well as the
description of a skill. Um, and Claude
is really good at knowing like this is
the right time to pull this skill into
context and go ahead and use it. And you
can combine skills with tools like MCP.
So MCP gives you access to tools and
access to context. Um, and then skills
give you the expertise to actually make
use of those tools and make use of that
context. Um, and so for cloud code, a
good example is web design. Maybe
whenever you launch a new product or a
new feature, um, you build landing
pages. And when you build those landing
pages, you want them to follow your
design system and you want them to
follow the patterns that you've set out.
Um, and so Claude will know, okay, I'm
being told to build a landing page. this
is a good time to pull in the web design
skill um and use the right patterns and
and design system for that landing page.
Uh tomorrow Barry and Mahes from our
team are giving a talk on skills.
They'll go much deeper and I definitely
recommend checking that out.
So these are the ways that we're
evolving our platform um to help you
take advantage of everything that Claude
can do to get the absolute best
performance for the things that you're
building. First, harnessing Claude's
capabilities. So, as our research team
trains Claude, we give you the API
features to take advantage of those
things. Next, managing Claude's context.
It's really, really important to keep
your context window clean with the right
context at the right time. And third,
giving Cloud a computer and just letting
it do its thing.
So, we're going to keep evolving our
platform. Um, as Claude gets better and
has more capabilities and gets better at
the capabilities it already has, we'll
continue to evolve the API around that
so that you can stay on the frontier and
take advantage of the best that Claude
has to offer. Um, second, as uh, memory
and context evolve, we're going to up
the ante on the tools that we give you
in order to let Claude decide what to
pull in, what to store away for later,
and what to clean out of the context
window. And third, we're really going to
keep leaning into agent infrastructure.
Some of the biggest problems with the
idea of just let Claude have a computer
and do its thing are those problems that
I talked about around orchestration,
secure environments, and sandboxing. And
so we're going to keep working um to
make sure that those are um ready for
you to take advantage of.
Um and I'm hiring. We're hiring at
Anthropic. We're really growing our
team. Um, and so if you're someone who
loves um building delightful developer
products um and if you're excited about
what we're doing with Claude, we would
love to work with you across end product
design um Devril lots of functions. So
please reach out to us
and thank you. [applause]
[music]
Our next [music] presenter is the
president and head of AI at Replet. He's
here to speak about building the future
of coding. Please join me in welcoming
to the stage Mikuel Katasta.
[music]
All right, good morning everyone. So at
Raplet, we're building a coding agent
for nontechnical users. It's a very
peculiar challenge I would say compared
to many people in this room. And what
I'm going to talk about today is why
autonomy has become kind of the
northstar that we keep chasing you know
since we launched the very first version
of rapid agent September last year.
Let's start from this very interesting
plot in case my clicker worked which now
does. Um I'm sure you all have seen it.
you know the semiac value that published
by Wix a few weeks ago and it kind of
clarify a bit the landscape you know for
all of us uh agent builders on one hand
you have the low latency interactions
that really allow you to stay in the
loop you know so you can do deep work
and focus really on the on the coding
task at hand but you need to be an
expert you need to know exactly what to
the model for and you need to understand
quickly if you want to accept the
changes or not then for several months
many of us including rapid We kind of
live in this I think value that where
the agent wasn't autonomous enough to
really delegate a task and come back and
see it accomplished but at the same time
it run long enough not to keep in the
zone not to keep in the loop. Luckily
over time we managed to go all the way
on the right and now we have agents that
runs for several hours in a row. What
I'm going to be arguing with today and
hope is not going to stop inviting me to
this event is the fact that there is an
additional dimension like a third
dimension to this plot that you know it
hasn't been covered here and namely the
fact is how do we build autonomous
agents for nontechnical users.
So what I'm going to be arguing today is
that there are two types of autonomy.
One of it is more supervised. So think
of the you know Tesla FSD example. When
you sit in a Tesla, you're still
expected to have a driving license.
You're going to be sitting in front of
the steering wheel. Perhaps 99% of the
time, you're not going to use it, but
you're there in order to take care of
the longtail events. And similarly, a
lot of the coding agents that we have
today require you to be technically
savvy in order to use them correctly.
We at Replet and uh other companies at
this point are focusing on kind of the
whimo experience for autonomous coding
agents. So you're expected to sit in the
back. You don't even have access to the
steering wheel. And I expect you
basically not to need any driving
license. Uh why is this important?
Because we want to empower every
knowledge worker to create software. And
I can't expect knowledge workers to know
what kind of technical decisions an
agent should be making. We should
offload completely the level of
complexity away from them.
Of course, it took a while to get here.
So I'm I'm sure what I'm showing you
here is something that all of you are
very familiar with. It took several
years to go from I know maybe less than
a minute feedback loop constant
supervision and talking about
completions and talking about
assistance. These are areas where the AI
power is and really been pioneering this
this type of user interaction. Then we
slowly climbed through you know higher
levels of autonomy. So we had the first
version of the agents based on on react.
So we concocted autonomy with a very
simple paradigm on top of LMS. Then
likely AI providers understood that tool
calling was extremely important poured a
lot of effort on that. So we built the
next version of agents with native tool
calling. And then I would say there is a
third generation of agents which I call
autonomous and that's when we started to
break the barrier of say one hour of
autonomy. Basically the the agent being
capable of running on long horizon tasks
and remaining coherent. It happens to be
the case that those are also the
versions of rapid agent that we launched
over the last year. So the B3 is the one
that we launched a couple of months ago
and it has exactly showcases those
properties. So the question for today is
can we actually build fully autonomous
agents and how do we get there?
So I'm going to try to redefine the
definition of autonomy today. I think
that often times we conflate autonomy
with a concept of something in the lungs
for a for a lot of time and usually as a
user you lose control. In reality what
the autonomy that I want to give to
agents can be very specifically scoped
and what I mean by that is especially
with rapid agent 3 what we accomplish is
we we make sure that our agent takes
holy technical decisions. Of course,
that could lead to very long gap between
the different user interactions and in
case the agent again runs for several
hours. But this happens if and only if
the scope of the task you're giving to
the agent is really broad. And it turns
out that in reality you can have an
agent that is really autonomous and is
still fast as long as you give it a very
narrow scope for the task, you know, at
hand. So what we can accomplish in this
way is that the user still maintains
control on the aspects that they care
about and a user cares about what
they're building. Especially again our
users, knowledge workers, they don't
care about how something has been built.
They just want to see their goals to be
accomplished. So autonomy should not be
basically conflated with long run times.
And similarly, it shouldn't become a
dity metric. you know, a lot of us are
talking about it as a as a badge of
honor. And it's definitely been exciting
to see in the last few months that, you
know, many of us broke the the barrier
of running several hours in a row, but I
think in terms of how to build agents
that are going to be more powerful and
more suitable in the future, we kind of
have to change a bit uh the the target
the metric that we do that we keep in
mind. So, think about it in this way.
Tasks have a natural level of complexity
and basically what we care about is that
they have a minimum irreducible amount
of work that they express. What agents
do is that they always go through this
loop of planning, implementing and
testing. And of course to make this
happen and to make it work correctly,
you want this work to be happening over
a long twing trajectory. So our goal is
to maximize the reducible runtime of the
agent. By reducible I mean having a span
of time where the user doesn't have to
make any technical decisions and the
agent can accomplish the task again in
full autonomy. This is especially
important for us because I can't trust
our users to make technical decisions.
So they they need a proper technical
collaborator by their side. I want to
abstract away as much complexity as
possible from the process of software
creation. And last but not least, I want
the users to feel in control of what
they're creating without stifling their
creativity because they have also to
think about the technical decision that
the agent is making.
So now what are the pillars of autonomy?
How are we making this happen? I would
say there are three pillars that are
extremely important to think about. The
first one is of course the capabilities
of frontier models like the baseline IQ
that we inject in the main agentic loop.
I'm going to leave this as an exercise
to the reader and to other people in the
room. I'm really glad a lot of you are
building amazing models that you know we
use all the time at Rabbit. So this is
the pillar number one. The second pillar
is verification. It's very important
that we test for local correctness of
our agent at every step that it takes.
And the reason is fairly intuitive. If
you are building on very shaky
foundations, eventually the castle will
topple down. So we brought verification
in the loop to make sure that in a sense
you are having you know nines of
reliability where in the compounding
errors that an agent will make
unavoidably if you know you don't put
any control on it and last but not least
you heard it on stage even earlier I'm
sure you're going to be hearing this you
know the entire day or the entire
duration of the conference uh the
importance of context management so one
end you want to have an agent that is
capable of being globally coherent so
it's aligned with the intent of the user
the expectation of the user But at the
same time it is also to be capable of
managing both the high level goal and
the single task that the agent is
working on. I think we made amazing
progress in the last months on context
management but I'm also excited to see
you know where we're going as a field.
Let's start from the first pillar that
we work actively at rapid which is
verification.
So why do we focus on this? Over the you
know last year we realized something
that I think each one of you has
experienced. So without testing agents
build a lot of painted doors. In our
case the painted doors are very visible
because we create a lot of web
applications. So you end up basically
trying to click on a button and the
handler is not looked up or some of the
data that we're showing is actually mock
data and it's not coming it's not coming
from a database. But in general this
phenomenon spans you know across every
type of component you're building being
it front end or back end. A lot of
components are actually not fully
fleshed uh by the agent. So we run some
evaluations internally. We found out
that more than 30% of the individual
features happen to be broken. Know the
first stand that are cooked by the
agent. And that also means that almost
every applications have at least one
broken feature or painted door. They're
hard to find. The reason is users are
not going to spend time testing every
single button, every single field. And
this is also probably one of the reasons
why a lot of our users, especially the
nontechnical ones, still can't trust
coding agents very much. They are
shocked when they find that there is a
painted door out there. So, how do we
solve this problem?
Fundamentally, we need an agent must
gather all the feedback that they need
from their environment, right? It's
easier said than done. Um again
nontechnical users not only cannot make
technical decisions but also they cannot
provide the technical feedback that you
know an agent is required to make
progress and most what they can do is
basic you know quality assurance
testing. They can literally go around
the UI click interact with the
application.
I'm I'm sure you have tried it in your
life. This is extremely tedious to do
and it leads to a very bad user
experience. And even though we relied on
that with our first release of the agent
last year, quickly we found out that
users don't want to spend time doing
testing. So we had to find a complete,
you know, orthogonal solution to that
which is autonomous testing and it
solves several different issues. The
first one is it breaks the feedback
bottleneck. Even if again we ask
feedback to the user, we were not given
enough of that. Now we don't have to
wait anymore for human feedback. We have
a way to elicit as much information as
possible from the app autonomously. We
also want to prevent the accumulation of
small errors. What I was saying before,
we don't want to have compounding errors
while the agent is building. And last
but not least, we have to overcome the
laziness of frontier models. So we need
to verify that whenever a model tells us
that a task has been completed, there is
actually the truth and that result is
not being elucinated.
There is a wide spectrum of code
verification that you know you you can
accomplish. I think we all started from
the very left. You know you have basic
study code analysis with LSPs. We have
been executing the code since we had
basically LMS that were capable of
debugging and then we slowly started to
move towards the right. So generating
unit tests and running them it has a
limitation. It's limited only to
functional correctness. Uh unit testing
is not very powerful to do like proper
integration testing by definition. We
started also to do now API testing but
it's only limited to API code. So you
can test endpoint of an applications.
You can't really test how a web app
functions and looks like. And for this
reason in the last few months has and
other companies are pouring a lot of
effort in really creating autonomous
testing based on the browser you know in
case the app that we're building is a
web application. There are two main
categories here. One is computer use.
It's a onetoone mapping with user
interface. So the model is directly
interacting with the application. It
requires screenshots. It tends to be
fairly expensive and fairly slow. I'm
sure you you tested it yourself. A good
way in the middle is browser use where
we simulate the user interface. You can
then interact with the browser and with
the web application and it relies on
basically accessing the DOM through
abstractions.
So how do we how do we make this work in
Revit? Um what we do is that we generate
applications that are amenable to
testing and we sort of merge everything
together from the previous slides that I
showed you. So we allow the our testing
agent to interact with an application
and gather screenshots in case nothing
has worked. So we have a full back to
computer use. But the vast majority of
times what we do is that we have
programmatic interactions with the
applications. So we interact with the
database, we read the logs, we do API
calls, we literally click on the app and
get back all the information that we
need. And by putting all of this
together, we collect enough feedback
that allows our agent both to make
progress and also to fix all the painted
doors that it encounters.
Just a know short technical deep dive on
how we accomplish this. I'm sure you
have seen a lot of the toolbased uh
browser use. There are amazing libraries
out there. First one comes to man stan
and the idea is that you have an agent
that has a few very generic tools
exposed. So know the agent can create a
new tab, can click, can fill forms etc
etc. The limitation here is that it's
difficult to enumerate all the different
type of interactions you could be having
with a browser. The problem of testing
is very similar to the Tesla analogy I
was making before. Maybe this cardality
of tools available is enough for 99% of
the interaction types. But then there is
always a long tail of idiosyncratic
interactions that a user makes with the
with a web application that are hard to
map into these tool these different tool
calls. So what we do uh in our case at
rapid is we directly write playwright
code and playwright code is first of all
very manable for LLMs. LM are kind of
amazing at writing playright. You know
this is the experience that we had uh
since we started to work on this project
is also very powerful and expressive. So
in a sense it's a super set of what you
can express uh on the compared to the
left on the tools uh testing. And last
but not least, there is beauty in
creating playright code because you can
reuse those tests. If the moment you
write a test in script, then you can
rerun it as many times as you want. So
in a sense, the moment you created a
test, you're also creating a regression
test suite that you can keep running in
the future. And all these kind of uh
tricks that I explained to you right
now, they helped us to create something
that is roughly a order of magnitude
cheaper and faster compared to computer
use. And we'll go back later on how
important latency is.
The second thing that the second pillar
that I wanted to talk about today of
course is context management. And I'm
going to go very fast here because I
think you're going to be hearing a lot
of talks today about it. The the high
level message here is that long context
models are not needed to work on quer
and long trajectories. Uh from
experience we found that most of the
tasks even the more ambitious one can be
accomplished within the 200,000 tokens.
So we're still not in a world where
working with models that have 10 million
or 100 million uh context windows is
necessary to actually run autonomous
agents. And we accomplish this by means
of learning how to do context management
correctly. So first of all, there are
several different ways to maintain state
which don't imply chucking all the state
into your context window. You can do
that for example by using the codebase
itself to maintain state. So you can
write documentation while the agent is
creating new code. You can also include
the plan description and all the
different task list that the agent is
working on. You can persist them on the
file system. So even there like have a
lot of ways to offload your memories.
And last but not least and this is
something I think you know Antropic has
been uh really evangelizing about um you
can even dump directly your memories in
the file system and then making sure
that your agent decides when to write
them back the moment they become
relevant to your work. So for this
reason we have been seeing a lot of
announcements in the last couple of
months. I just picked this one from
Entropic and with CL at 4.7 so I wish
4.5 uh that have been able to run uh
focus task for more than 30 hours in a
row. We have seen similar results from
OpenAI on the math problems. So I think
we we kind of broke the barrier of
running for long and you know being able
to have quant tasks.
I would say the key ingredient to make
this happen has been how good models
hand as agent builders have become in
doing sub agent orchestration. Subages
basically work by means of they're
invoked in the core loop. So it's a
completely it's starting from a blank
slate uh from a completely fresh
context. You as an agent builder decide
what subset of the context to inject
when this sub agent starts. And it's a
concept that is very similar I think to
everyone who's been writing software you
know in the last decades is separation
of concerns. So you decide what your
subject is going to be working on. You
give it the least possible amount of
context. You allow it to run to
completion. You only get the output the
results. You inject them back into the
main loop and you keep running in this
way. Of course it significantly improves
the number of memories per compression.
I just brought this plot from directly
from rapid agent running in production.
The moment we kicked in our new
subvision orchestrator on the ax on the
y-axis you can see the number of
memories per compression. So we went
from roughly 35 to 4550 recently. So big
improvement in terms of how often we are
recompressing our context just because
we can offload a lot of the context
pollution by means of using sub aents.
I'm going to give an example where this
made the difference for us. You know
what I'm showing you here is more kind
of a cost optimization in a sense like
you're compressing less. You als have
have separation of concerns which
definitely make your agent be smarter.
In the case of testing,
working with sub engine was almost
mandatory for us. And basically we
started to work on automated testing
even before we were very advanced in
terms of subent orchestration. And what
we found out is of course again as I was
saying before it makes things easier,
better cost, less pollution. But when
you allow the main loop not only to
create code but also to do browser opt
browser actions to put back the
observation of your browser actions into
the main loop you tend to confuse the
the hent loop very much because at this
point there is a lot of heterogenity in
terms of the action that your main loop
is looking at. So in order to make this
work not only we have to build all the
playright framework that I was showing
to you before but we also have to move
our entire architecture into sub agents.
So at this point you can see very
clearly why there is a separation of
concern here. Get the main agent loop
running. We decide at a certain point
that it's time to verify if the output
of the agent has been correct. We make
this happen all within a sub agent. Then
we scratch the context window of that
sub agent. We just return back the last
observation to the agent loop and then
we keep running in that way. So if
you're having issues today making your
sub agents uh work correctly, this is
one of the reasons why that you want to
take a look at.
So I think we covered the high level of
how to create more and more powerful uh
autonomous agents over time and I only
see us as a field becoming even more
proficient than that in the next months.
There is one additional ingredient
though that is going to make the
difference and it's parallelism. And I
will argue that parallelism is important
not because it's going to make agents
more powerful per se, but rather because
it's going to make the user experience
more exciting. So of course it is great
to have an agent that is capable of
running autonomously for long, but at
the same time it comes with the price of
making the user experience less
thrilling. You are not in the zone
anymore. What you do is that you write a
very long prompt. It's translated into a
task list. Uh and then you go to have
lunch with your colleagues and then you
come back and you hope that the agent is
done. That is not the kind of experience
that most of the productive people want
to have in life. You know, you want to
see as much work as done as possible in
the shortest span of time.
So what we do as a as a field at this
point has been to create parallel
agents. It's a very common trade-off
which by the way doesn't only apply to
agents. it it applies to computing in
general and for parallel agents what you
do is that you you trade off basically
extra compute in exchange for time. Why
there is this trade-off? So first of all
when you're running agents in parallel
you're gathering the same context in
multiple context windows. So every
single parallel agent that you would be
running probably shares say 80% of the
context across the board. So of course
you are just putting more computed work
because you're running those agents in
parallel. There's also another cost that
is kind of intangible for a lot of you
here in the room because I'm sure you're
all expert software developers, but what
do you do with the output of multiple
par agents at the end? Often times, you
need to resolve merge conflicts. So, as
a reminder, my users don't even know
what's the concept of merge conflicts.
It's something that I have to figure out
on our own. So, the current way in which
we think of parallel agents in in the
space doesn't really apply to Rapid. Now
at the same time I still want to very
much to accomplish this. There are so
many interesting features that you can
enable with parallelism. Aside from the
fact that you can get more work done. U
at times you want to you want testing to
be running in parallel with the agent
that creates code. Testing no matter how
much we optimize it is still very slow.
If an agent is only spending time on
testing users are not going to be
engaging with your application anymore.
Um, at the same time, it's also great to
have a synchronous process running while
your agent is running because you can
inject useful information back into the
main core loop. And last but not least
is a very common technique that we know
boost performance if you have enough
budget to do so. You should be sampling
multiple trajectories at the same time.
So a lot of perks are coming with
parallel agents. But u the way in which
we implement them today which I
basically call user has an orchestrator
is the fact that tasks the parallel task
that you want to run are determined by
you by the user and each task is
dispatched in its own thread. So there
is a bit of manual process even the task
de composition in a sense is happening
in your mind while you're thinking about
which agents you want to run and then
the moment you get back all the results
you need to go through the problem of
merge conflicts and often times this is
not trivial at all no matter how many
amazing tools are out there. So what
we're working on today for our next
version of the agent is having the core
loop as the orchestrator. So the key
difference here is the fact that the the
subtask that we're going to be working
on are not determined by the user but
they are determined by the corion loop
and the parallelism is basically
deciding on the fly. The agent does the
task decomposition on behalf of the user
and this comes with a couple of
advantages. First of all again there's
no cognitive burden to for the user to
understand how they should be
decomposing the task. At the same time
also there are ways in which you can
create tasks that sort of mitigate the
problem of merge conflicts. I'm not
claiming that we're going to be able to
mitigate it 100%. There are so many
corner cases in which merge conflict
will still represent a problem but there
are a lot of different techniques known
in software engineering to make sure
that you can try to have multiple subage
and not stepping on each other toes. So
the core loop as an orchestrator is
going to be the our main bet for the
next few months.
And in case you're passionate about
these topics,
[music] I'm always hiring at Rapid.
Thank you. [applause]
From transforming support tickets into
merge requests to helping teams ship
fixes faster than ever, our next
presenter has been at the center of
Zapier's AI agent journey. Please
welcome engineering leader Lisa Orur.
[music]
[applause]
Hello.
I'm so excited to tell you about how at
Zapier we are empowering our support
team to ship code. Before I tell you
about that, has anybody here visited the
Grand Canyon?
It's a good amount. Anybody rafted
through the Grand Canyon?
I see one person. I just got off an
18-day trip rafting through the Grand
Canyon over 200 miles. It was
incredible. No internet, no cell
service. The moment I got off, I found
out I was giving this talk. I didn't
think about uh work at all on the river,
but once I got off, I started thinking
about the parallels between the Grand
Canyon and Zapier. And we have one thing
in common and that is erosion.
Now natural erosion happens over
millions of years with wind, water and
time. It creates the beautiful canyon
that we experience and it's never
stopping, always continuing. At Zapier,
we have over 8,000 integrations built on
thirdparty APIs and they are constantly
changing, which I'm now thinking of as
app erosion.
We've been around for 14 years. Some of
our apps are that old. API changes and
deprecations impact us and create
reliability issues. Again, it never
stops.
So, I like to think of our apps as like
layers in the Grand Canyon and they need
constant attention.
So, if we were to create our own Zapier
Canyon and our apps would be at the
walls, here's our support team flowing
down the middle watching out for app
erosion. And we have a backlog crisis.
Tickets were coming in faster than we
could handle them.
Creates integration reliability issues,
poor customer experience, even churn. So
to solve for app erosion, we kicked off
two parallel experiments. The first was
moving support from just triaging to
also fixing these bugs. It's experiment
number one. Experiment number two, we
were asking, can AI help solve app
erosion faster?
So let's jump into experiment one. This
get kicked off two years ago, but had to
start with the why. We needed to get
that buy in to empower our support team
to ship code.
So apparosion is one of the major
sources of bugs coming through to from
support to engineering. So there's a big
need support is eager [laughter] for
this experience to a lot of them want to
go into engineering eventually and
unofficially many support members were
already helping to maintain our apps.
This moves us into how we started this
out. Put on some guard rails. We started
with just four target apps to uh focus
our fixes on. Engineering was set to
review any merge requests coming from
support and we kept the focus on app
fixes.
So jumping into experiment 2, this is
what I've been leading for the last
couple of years. How can we use codegen
to help solve for app erosion? And so
fortuitously, the name of this project
is Scout, which ties in so well to the
Grand Canyon experience that I've just
been through.
As any good product manager, we start
with discovery. We did some dog fooding,
so I shipped some app fixes. Uh we
shadowed engineers and support team
members as they were going through the
app fix process. We designed out uh what
are the pain points experienced along
the way, what are the phases of the work
and how much time is spent.
One big discovery we had is how much
time is spent gathering the context
going to the third party AP API docs
even crawling the internet looking for
information about a bug that's emerging
maybe somebody else has already
discovered and solved for it outside of
Zapier. internal context, logs, all of
this is a lot of context to go and
search for as a human uh and a lot to
gro and work through. This is something
we knew we needed to solve for.
where we started with all this great uh
opportunities and pain points is we
started building APIs that we believed
would solve for these individual um pain
points and some of these APIs are using
LLMs to you know for our diagnosis tool
gathering all that context on behalf of
the uh support person engineer and
curating that context building a
diagnosis that's [clears throat] using
an LLM. And then some aren't like we
have a unit test uh or unit test
generator is, but the um test case
finder is simply using a search query to
look for the right test cases to pull in
for your unit test. We built a bunch of
APIs. We had a bunch of great ideas. So
there was a lot for us to test with, but
we ran into some challenges in this
first phase. We had APIs but they were
not embedded into our engineers process.
So our tool I just said they don't like
to go to so many web pages to find all
their context. They would love all this
information to come to them. And yet our
web interface where we've we've created
a playground we call autocode internally
where you can come and play around with
our APIs. And our ask to the teams was
come try out our APIs and give us
feedback.
Now this is just one more window to go
to. So we didn't get a lot of
engagement. Also because we had shipped
so many uh APIs our team was spread
pretty thin. Cursor launched at the same
time which has gotten great adoption at
Zapier. We're all huge fans of cursor.
But from our side, it made some of our
tools no longer necessary.
But there was one major win in this
phase, which is one of our APIs became a
support darling. It's diagnosis. That
number one pain point of need to go out
and find all of your context, curate it
for yourself so you can start solving
the problem. We were doing that on uh
the support team's behalf with the
diagnosis API
and support loved it enough that they
decided to embed it into their process.
They asked us to build a zap year
integration on our autocode APIs so they
could embed it into their zap that
creates the jur ticket from the support
issue and now diagnosis is included.
So embedding tools is the key to usage
as we find out. So how can we embed more
of our tools? Well, then MCP spins up
and that solves our problem.
We can now embed these API tools into
our engineers workflow. Specifically,
our engineers are pulling in these MCP
tools as they're using cursor.
Our builders using Scout MCP tools are
leaving the IDE less, spending more time
in one window.
Still coming into challenges. One of our
uh our our key tool diagnosis
uh is so valuable to pull all that
context and to provide a recommendation,
but it takes a long time to run. Now, we
might run down that runtime. However, as
you're working synchronously on a ticket
in your ID, this was frustrating. We
also weren't keeping up with the
customization needs. Not only did MCP
launch and we started leveraging it, Zap
Your MCP launched too. And some of our
tools, if we weren't keeping up with the
customization needs, our engineers
internally looked to Zap Your MCP, which
is great. We're all on the same team
solving the same problem, but some of
our tools had a dead end. Also adoption
was scattered. We had a whole suite of
tools and we thought there was value in
each of them as it solves for different
problems across the different stages.
Not every engineer was using our tools
and if they were using tools, they're
only using a few of them. So we have
tool usage. We're happy about that. But
we were under the hypothesis that true
value is going to come from tying these
tools together.
So what if we owned orchestration of
these tools rather than saying here's a
suite of tools you use them as you wish
what if we combined them and created an
agent to orchestrate this. So this we
are calling scout agent. We take that
diagnosis run that against a ticket uh
use that information to actually spin up
a codegen tool which will then produce a
merge request using all the right
context.
So who would benefit the most from
orchestration? There are several
integration teams at Zapier who are
solving for these app fixes of various
levels of complexity and there's the
support team. So when we're saying who
should be our first customer scout
agent, we're thinking it should probably
be the the team fielding small bugs that
are emergent and coming hot off the
queue which is the support team. And now
our two experiments merge
and we have scout agent. We are building
for the support team.
And this is the flow of how it works.
Support is submitting an issue to scout
agent. We first categorize the issue. We
next assess its fixability.
Not every issue that comes from support
can be fixed. If we thinks it's fixable,
we'll move on to generating a merge
request. At that point, the support
team, this is the first time they're
picking up the ticket. It already has a
merge request attached to it. They'll
review and test. If it's not satisfying
what they believe is the actual solution
or the the what what the solution should
be to best address the customer's need,
they will make a request for an
adjustment that can happen right in
GitLab, which is where we do our work
and Scout will do another pass and
hopefully at that point we've gotten it
right and support can submit that MR for
review from engineering.
How we are running Scout, it's all
kicked off by a Zap. This is a picture
of one of our zaps. There are many zaps
that's run this whole process and it
embeds right into our support team's
zaps. We do a ton of dog fooding at
Zapier.
We first run diagnosis and post that
result to the Jira ticket saying what
the categorization is if we believe it's
fixable. And then if we do believe it's
fixable, we then are kicking off a
GitLab CI/CD pipeline.
And we run three phases in that
pipeline. plan, execute and validate to
generate this merge request. The tools
used in this pipeline is Scout MCP. So
all those APIs we invested in a year ago
now are really coming together and we're
orchestrating it uh within the GitLab
pipeline and we're also leveraging
cursor SDK.
Once the merge request has been
completed, we attach it to Jira and
support picks it up.
The latest addition to this is doing a
rapid iteration once a um uh once a
ticket has been posted with the merge
request and support team is looking at
it and they say you know it needs some
tweaks to save them more time so they
don't have to go pull that down to their
ID do the fixes and push it back up.
they can simply chat with the uh scout
agent in gitlab that'll kick off another
uh pipeline which does that phase with
that new feedback and posts the new
merge request
on our side we want to make sure scout
agent is working so we ask three
questions categorization right is was it
actually fixable uh and was the code fix
accurate so far we have two evals 70 to
75% accuracy for categorization and
fixibility. As we get more feedback and
process more tickets, those become our
test cases and we can move forward
improving scout agent over time. So what
has been scout agents impact on app
erosion?
40% of supports support teams app fixes
are being generated by scout. So we're
doing more of the work on behalf of the
support team.
This is resulting and for some of our
support team it's doubling their
velocity from one to two tickets per
week which already is amazing. That's
going from a support team that wasn't
shipping any fixes well unofficially
they were sometimes to now shipping one
to two per week per person to now
shipping three to four with the help of
Scout.
Another uh process improvement, Scout
puts potentially fixable tickets right
there in the triage flow. takes away a
lot of the friction of looking for
something to grab from the backlog.
It's not just the support who's
benefiting, it's also engineering.
Engineering manager said, uh, it's a
great example of when it works. This
tool allows us to stay focused on the
more complex stuff.
And if you take away anything from this
talk, I hope it is that there is a
really powerful magic between support
and empowering them with codegen and
allowing them to ship fixes because they
have three superpowers. The first they
are the closest to customer pain which
mean they're closest to the context that
really matters for figuring out what's
the problem and how to solve it. They're
also troubleshooting in real time. These
tickets aren't stale. the context is
fresh, the logs aren't missing. You put
this ticket into engineering backlog
months later, you might not get access
to those logs anymore. And then three,
they're best at validation.
You've again you put the same ticket
into an engineering backlog. The
solution an engineer might come up with
may change the behavior and that might
be good for some customers but might not
necessarily be best for that one
customer who wrote in about the problem.
And one other major benefit of this is
our support team members who have been
part of this experiment are now
engineers.
I want to say thank you to the amazing
team who's helped build this process or
built all the tools and the scout agent.
Andy is actually here in the audience.
So shout out to Andy. If you want to
talk about any of the technical bits,
he's here. And I want to impress upon
you two things. or hiring, but mostly if
you haven't rafted through the Grand
Canyon, please consider it. It's
life-changing and you should go with
ORS. Thank you very much.
[applause]
[music]
Our next presenters believe that [music]
2026
is the year the IDE died. Please join me
in welcoming to the stage engineering
leader at Source Graph and AMP, Steve
YG, and author and researcher at IT
Revolution, Jean Kim.
[music]
Hey everybody. Um, really happy to be
here. I'm going to be talking the first
half. Co-author here, Jean Kim, is going
to talk second half. All right. Looking
forward to it. Cheers. All right. Right.
Today I'm going to Well, we're going to
talk real fast. This time is going to go
down fast. Uh I'm going to talk to you
about what tools look like next year.
Last year I was talking to you all about
chat and everybody ignored me and now
everybody's using chat this year and
it's like we're gonna we're going to fix
that right now. All right. So, here's
what it's looked like. I'm going to tell
you right now, everyone's in love with
Cloud Code. There's probably 40
competitors out there. Cloud Code ain't
it.
completions wasn't it. I love cloud
code. I use it 14 hours a day. I mean,
come on. But it ain't it. Developers
aren't adopting it. I'm gonna talk about
why in this talk. I'm going to talk
about what you can do about it and what
what to look forward to. But the reason
is they're too hard. Okay. Uh cognitive
overhead. Uh they lie, cheat, and steal.
Gene and I talk a lot about this in our
book, all the different ways that they
can lie, cheat, and steal. And uh most
devs just don't like this.
I have come to understand that claude
code is very much like a drill or a saw,
an electric one, right? How much damage
can you do as an untrained person with a
drill, right? Or a saw. Yeah. How much
damage can you do as an untrained
engineer with clawed code? It's real
similar. Yeah. You can cut your foot
off,
but you can also be really, really
skilled with it and do really precision
work, right? like a craftsman. The
problem is software is infinitely large.
Our ambition is infinitely large. And so
the analogy that I want to share with
you is next year will be the year from
moving from saws and drills to CNC
machines. A CNC machine, you strap a
drill on and you give it coordinates and
it moves it around and you're very
precise, right? We've been doing this
for centuries and we're not going to
stop this year.
One thing I hear people say is, "Well,
the models are plateaued." This is real
common. Your engineers are probably
saying this, okay, even if they
plateaued, we have still discovered
steam and electricity and it's going to
take us a little time to harness it. But
it's strictly an engineering problem at
this point. All code within a year, year
and a half will be written by giant
grinding machines overseen by engineers
who no longer actually look at the code
directly anymore.
Weird new world. That is where we are
going. Oh my gosh. Yep. This this slide.
So Gene and I talked to Andrew Glover
who I don't know is he here from OpenAI
and he said that they have this
incredible dichotomy unfolding at OpenAI
where you know some percentage of their
engineers are using codecs and then some
other percentage a larger percentage are
not using codecs and the difference in
productivity is so staggering that
they're having now alarms going off at
performance review time because how do
you compare these these two engineers
who are the same level same title same
everything and one of them is 10 times
as productive as the other one by any
measure.
And the answer is they're freaking out.
They may have to fire 50% of their
engineers. And this is unfolding at
other companies, too.
Who is refusing it? It's the senior and
staff engineers. How many minutes are we
at?
>> Eight [clears throat] minutes.
>> We're perfect. This is just like what
happened to the Swiss mechanical watch
industry over a couple of Well, it was
built up for a couple of centuries and
then courts killed it, you know, within
a couple of years. And what happened was
the craftsmen were doing the same thing
our staff engineers are doing today. No,
this is cheap.
That's word for word, right? That's what
they say.
All right. I didn't know where to put
this slide. This is this is Claude's
view of what next year looks like. And I
I was just like, what do you think it's
going to look like? And it actually does
kind of look like this. Most of the
words will be spelled correctly in in
next year. But this is a lot prettier
than cloud code.
Yeah, this is what it has to look like.
Some form of a UI, not an IDE. This is
the new IDE. Okay. And people are
building it. In fact, I think the
company that's the furthest along in
this is Replet, who just talked to you.
I think it's amazing what they're doing.
It's absolutely bravo, right? We should
not be all chasing tail lights and
building command line interfaces
anymore. All right. and and more
importantly, Cloud Code and all of its,
you know, competitors, they're all doing
it wrong because they're building the
world's biggest ant. Okay, this is my my
buddy Brennan Hopper at Commonwealth
Bank of Australia, right? He's like,
"Nature builds ant swarms and Claude
Code built this huge muscular ant that's
just going to bite you in half and take
all your resources, right? I mean, it's
a serious problem, right? If I say
please analyze this codebase, I, you
know, go to the expensive model." If I
say, "Is my git ignore file still
there?" I've also gone to the expensive
model, right? Everything that you say
goes to the expensive model. So, what's
going to happen? Whoa. What happened? Oh
gosh,
my slides are all messed up now.
Can you guys see them?
>> No.
>> Oh, this always happens to me, man.
There's something going on. All right.
So, I thought of a really cool analogy
called the diver the diver metaphor,
which is your context window is like an
oxygen tank. Okay. This is why these
things are fundamentally wrong because
you're sending a diver down into your
codebase underwater to swim around and
take care of stuff for you. One diver
and we're like, we're going to give him
a bigger tank. One million tokens. He's
still going to run out of oxygen. Like
you don't, right? You should send a
product manager diver down first
and then a coding diver, right? And then
a review diver and a test diver and a
get merge diver, etc. Right? Nobody's
doing this. Everyone's building a bigger
diver. I don't know my slides are all
messed up. My my my talk is almost done.
But um what we do is as engineers task
decomposition,
successive refinement, components, black
boxes. This is how it's going to be
built in the future. And it's going to
be built with lots and lots of agents,
not just one agent.
All right. Until then, I think we're out
of time, but so until then, learn cloud
code. Give up your IDE. Swix told me he
wants some hot take, so I'll give you
one. If you're using an IDE starting on,
I'll give you till January 1st.
You're a bad engineer.
There's your hot take. All right, folks.
[applause]
All right, cheers. Well, that that was
actually my talk. Um [clears throat]
uh uh learn coding agents and oh yeah,
then there's this guy. Speaking of bad
engineers, so this is this is Jordan
Huard uh who uh who's at Nvidia and he
tweeted or LinkedIn a really nice post
on how to get the most out of agents and
this guy responded with this, right?
This is everyone in your or this is 60%
of your org right here. This guy's not
an outlier. Okay, the backlash is very
real against this. Yeah. And this is
going to be a problem I'm not going to
I'm not going to share with you. I don't
have time to share how to fix it, but
it's something you should be aware of.
And anyway, I'm going to turn it over to
my co-author, Jean. We had a lot to talk
about. He's got a lot to go. So, let's
turn it over to Jean.
>> Yeah. Thank you, Steve.
>> Hi, buddy. [applause]
>> Yeah. By the way, um I have let me start
off by introducing myself and then I'm
going to share a little bit about like
what it's been working like uh what's
been like working with Steve on the vibe
coding book. Uh and so just a little bit
about myself. I've had the privilege of
studying high performing technology
organizations for 26 years. And that was
a journey that started when I was a
technical founder uh of a company called
Tripwire. I was there for 13 years. But
our mission was really to understand
these amazing high performing technology
organizations. They had the best project
due date performance and development,
the best operational reliability and
stability and also the best posture of
compliance uh security and compliance.
So we wanted to understand how did those
amazing organizations make their good to
great transformation. So we got
understand how to how other
organizations replicate those amazing
outcomes. And so you can imagine in that
26 year journey there are many
surprises. Among the biggest surprise
was how it took me into the middle of
the DevOps movement which is so uh
amazing because it reshaped technology
organizations. you know, it changed how
test and operations worked, information
security. Um, and I thought that would
be the most exciting adventure I'd be on
in my career until I met Steve Yaggi in
person. And so, I've admired his work
for over 11 years. And so, some of you
may have read this memo of Jeff Bezos's
most audacious memo of how in early
2000s they transformed from a gigantic
monolith that coupled 3,500 engineers
together, so none of them had
independent action. And uh he talked
about how all teams must henceforth
communicate and coordinate only through
APIs. No back doors allowed. Right? Uh
anyone who doesn't do this will be
fired. Thank you and have a nice day.
And the amazing person who chronicled
says number seven is obviously a joke
because Bezos doesn't care whether you
have a good day or not. And this is
actually enforced by Amazon CIO then
Rick Del. And so it turns out this memo
that I've been quoting for 11 years uh
was written by Steve Yaggi uh which was
meant to be a private uh memo on Google+
which was made public which landed him
on the front page of the Wall Street
Journal. Um and so I finally met him in
uh June and it turns out that we had
many things in common uh but one of them
was this uh love of AI and this sense
that AI was going to shape coding from
underneath us. And so one of our beliefs
is that uh the AI will reshape
technology organizations you know maybe
even a 100 times larger than what agile
cloud CI/CD and mobile did you know 10
years ago. Um and that these technology
breakthroughs not just reshape
organizations but they reshape the
entire economy. The entire economy
rearranges itself to take advantages of
these you know wild new better ways of
uh uh producing things. and and uh so
over the last year and a half we've had
a chance to look at these case studies I
think give us a glimpse of what these uh
what the shape of technology
organizations look like and so I'm going
to uh share with that what we've learned
but here's maybe a hint so some of you
may know the work of Aiden Cochroft he
was a cloud architect at Netflix right
he was what who drove uh the uh entire
Netflix infrastructure from a data
center uh back in 2009 to running
entirely in the AWS cloud and so he
wrote uh some months ago in 2011 some
people got very upset in uh
infrastructure and operations because
they called it noopops, right? And
everyone laughed back then, but he said,
"Oh, don't, you know, it's happening
again. This time it might be called no
dev, right?" Not so funny now, right?
So, it's it's interesting, right?
Because we heard this amazing
presentation from Zapier about like how
support ships and turns out designers
are shipping, UX is shipping, right?
Anyone who's been frustrated by
developers uh who, you know, say get in
line and you have to wait quarters or
years or maybe never, right, are now
suddenly in a position where you can
actually vibe code your own features
into production, right? And that
reshapes technology organizations and
reshapes you know potentially the entire
economy. And so uh Steve and I we've had
the privilege of watching what happens
you know when we change uh you know the
way we uh deploy right it wasn't so long
ago and 10 years ago uh I wrote a book
called the Phoenix project where it was
all about the catastrophic deployment.
Would you believe uh that it was you
know 10 years ago 15 years ago most
organizations shipped once a year right
and so I got to work on a project called
the state of DevOps research. It was a
cross population study that spanned
36,000 respondents uh from 2013 to 2019.
And what we found uh this was Dr. Nicole
Forsgrren and Judge Humble. Um and what
we found was that these high performers
ship multiple times a day, right? They
can ship in one hour or less. And you
know back in 2009, people thought, oh my
gosh, multiple deployments per day,
right? That's reckless and
irresponsible, maybe even immoral,
right? What sort of maniac would deploy
multiple times a day, right? And yet
it's very common place these days. In
fact, if you want to have great
reliability profiles, you want to have
short meantime prepare, you have to do
smaller deployments more frequently. And
I think we're now seeing these kind of
case studies that show that this better
way of coding, right, where you don't
type in code by hand might be, you know,
just a vastly better way uh to create
value. And so our definition of vibe
coding that we put into the uh vibe
coding book was that it's basically
anything where you don't type in code by
hand. And so for some of those of you
who don't understand that, that's like
sort of a uh typing an ID hunched over,
right? And you're actually moving your
fingers, right? That's sort of like how
some people go into a dark room to
develop photographs, right? Believe it
or not, some people still do that. Um
and and what I that's a great definition
that we uh loved until uh Dar Amade uh
CEO and co-founder of um Anthropic, he
gave us an even better definition,
right? The B coding is really the
iterative conversation uh that results
in AI writing your code. And he said
it's on one hand a beautiful term
because it evokes this different way of
coding but he said it's also somewhat
misleading because it sounds jokey right
uh but he said you know at anthropic
there's no other game in town right and
I just thought that was just a beautiful
way to evoke you know how important uh
vibe coding is uh this is Dr. Eric Meyer
um you he's probably considered one of
the greatest programming language
designers of all time. Uh he was part of
Visual Basic, C link, Haskell. He
created the hack programming language uh
that migrated millions of lines of code
at meta you know within a year uh
bringing static type checking to a bunch
of PHP programmers and he said we are
probably going to be the last generation
of developers uh to write code by hand.
So let's have fun doing it. Um, so one
of the things that uh when uh Steve and
I started working on the book last
November was uh watching him spend
hundreds of dollars a day on coding
agents uh and just seemed so strange
right um you know and so he's maxing out
not just over you know the uh the
monthly subscriptions right but he's
actually you know going way above and
beyond that and yet uh you know things
that we're hearing now is that as an
engineer part of my job is that I need
to be spending as much on tokens per day
as my salary right so you know that
think about like $500 to $1,000 a day,
right? Because this is the mechanical
advantage, the cognitive advantage that
these tools are giving us, right? And as
an engineer, right? I'm going to
challenge myself, you know, to get that
kind of value to deliver value to people
who matter. Um, and so in the book, we
talk about, you know, why people would
do this, right? And [snorts] the, uh,
acronym we came up with FAFO, right? Uh,
the most obvious one is F for faster,
right? Yeah, that's obviously true, but
I think it's the most superficial and um
part of why we do this because uh the
second one is it lets us do more
ambitious things, right? Uh the
impossible becomes possible. Uh so
that's one end of the spectrum. On the
other end of the spectrum, you know, the
uh the tedious and small tasks become
free. One of the things I uh the uh
interview of the cloud code team that I
just loved was uh I think it was
Katherine she said um uh one of the
things we've noticed is that you know
when customer issues come up uh instead
of putting them on a jur backlog and you
know arguing it about it in the grooming
sessions and so forth right we just fix
it on the spot right and ship to
production or whatever um you know
within 30 minutes right and so yes it
gets recorded but you know that whole
sort of coordination cost you know just
disappears right so again the impossible
becomes possible right and uh the
annoying things just become free. The
second a is uh um you know the ability
to do things alone or more autonomously,
right? And so um you know there's really
two coordination costs are being
alleviated here. One is, you know, if
you ever have to wait for a developer or
a team of developers, you know, to do
what you need to do, right? You have to
communicate and coordinate and
synchronize and prioritize and cajul and
escalate, you know, do all sorts of
things to get them to care about the
problem just as much as you do, right?
And, you know, now, you know, with these
amazing new miraculous technologies, you
can do them by yourself, right? So,
that's one coordination uh tax. The
other one is even if you get someone to
uh care about a problem as much as you,
uh they can't read your mind, right?
Right? And what we're finding is that
these LLM are just amazing
intermediation vehicles, right? Um, you
know, just through an LLM, you can
coordinate with other functional
specialties, right? Through a markdown
file, right? And that's not the end,
right? But it's just this amazing way uh
to have these high band coordination so
that you can essentially read each
other's minds, you know, because shared
outcomes require shared goals and shared
understanding. The second F is fun,
right? Is that Steve says, vibe coding
is addictive. This is so true. true. I
mean, I cannot I think what I love about
the book is that it's a story about two
guys who both thought their best days of
coding were behind them, right? And
found that, you know, it's entirely the
opposite. Um, and I've had so much fun
and uh, you know, I'm having to force
myself to go to sleep at night because
otherwise I'd be up till 2 or 3 in the
morning every night. Uh, and you know,
so it's not all great, but it certainly
beats be boring or tedious or, you know,
horrible. And then optionality. You
know, one of the things that uh I love
about Swix is that he has a shared love
of creating option value. And he told us
last night that option value is also
important for poker players, right?
Because you never want to paint yourself
in a corner. So option value is um one
of the biggest creators of economic
value, right? Modularity, the reason why
it's so powerful is because it creates
option value. Uh and so just the fact
that you can have so many more swings at
bat, you can do so many more parallel
experiments, right? This is what vibe
coding allows. So this is gives us
confidence that you know this is not
just uh this is a very powerful tool. Um
here's the quote from Andy Glover that
Steve Yaggi said is that you know as um
for people who have this aha moment and
in position uh you know I think the
instinct is how do we elevate everyone's
productivity to be as productive as you
are now being um you know that since
you've had your aha moment. So uh let me
share with you maybe some of our top
kind of uh exciting case studies that
kind of give us a hint of the future. So
uh I've run into this conference called
the enterprise technology leadership
summit for uh 11 years now and Swix we
had uh the honor of having Swix there
talking about the rise of the AI
engineer just this amazing
prognostication. Uh this year we had a
series of amazing uh case studies. One
was uh Bruno Pasos. He spoke this year
uh last year at this conference and he
presented on uh their in their evolving
experiment to elevate developer
productivity across 3,000 developers. Um
and this is at Booking.com the world's
largest travel agency and uh they're
finding that they're getting double
digit increase in productivity right uh
mergers are going in quicker peer review
times are uh smaller and and so forth
right and so that's just we feel like
that's a incomplete view of uh what
people are achieving uh this is Shri
Balakrishnan uh he was head of product
and technology at uh travelopia uh so
they're a 1.5 billion a year uh travel
company and one of the things that uh he
said is that uh you know they were able
to replace a legacy application uh in
six weeks with a pair of uh with a very
small team. In fact, one of his
conclusions is that before we would need
a team of eight people to do something
meaningful, right? Six developers, a UX
person and a product owner. And he said
maybe these days it might be two, a
developer and you know a a domain
expert. In other words, as Kent Beck
said, a person with a problem and a
person who can solve it, right? Maybe
maybe a pair of those teams, right? And
so that's going to reshape I think you
know how they can go further and faster.
Uh so again maybe just a hint of what
teams will look like in the future. This
is the one that excites me most. This is
Dr. Topra Pal. Uh he helped drive the
DevOps movement at Capital One. Um and
he's now at uh uh Fidelity. And so um
among other things he owns an
application uh that is the application
you go to asks which applications you
know the 25,000 applications there have
log 4j right and uh it's his team and
he's had this vision of what this
application should look like uh but
every time he asked like can can we
build it his team would say it would
take about five months right and we'd
hire need to hire a a front-end person
and he got so frustrated that he spent
five days just vibe coding it by himself
right uh you know directly accessing
read only the neofor 4J uh database um
and put it into production, right? And
so I think we're seeing a world where um
you know leaders even leaders with their
own teams are frustrated saying hey I
can do this uh can I do this better
myself not better just can I prove that
it can be done and uh by the way what
happened afterwards um he was looking
around who can help me maintain my
application in production and all the
senior engineers like not me. So enter
uh Swathy the most junior engineer on
the team uh who is helping maintain his
application and probably outarning you
know everybody in the organization
uh and interestingly uh he he's also
getting more headcount because the
number of consumers of this application
just increased by 10fold right so who
saw that coming right um so uh here's
John Rouser he's senior director of
engineering at Cisco security and he
convinces SVP to um require 100 of the
top leaders inside of Cisco security to
vibe code one feature into production in
a quarter that ended last month, right?
And so um you know we're actually
getting a chance to be able to survey
those people, right? Who finished? Uh
you know uh how many completed, didn't
complete, partially completed, etc. And
of those who completed, right, what was
what aha moment did they have? Uh as a
leader, what's the magnitude and
direction of what they want to do? And
so we're going to go in and study that.
And I just I my prediction is that we're
going to see parts of that organization
get reshaped as leaders realize kind of
what's possible. Everything from
strategy to processes and so forth. And
so let me just share with you one um you
know thing that really excites me which
is uh I got a chance to uh get back into
the state of DevOps research the Dora
study with uh u the Google cloud team
and one of the things that didn't make
into the report that I just found really
exciting was around this. It was like
what how would much do people trust AI?
And we're using a very strange
definition of trust which is to what
degree can I predict how the other party
will act and react, right? Because the
more you trust the other party, right,
you can give them bigger requests, you
can use fewer words, you have less need
for feedback, right? It's the whole
notion of finger spits and fuel, right?
Like how many of the 10,000 hours that
requires to be good at anything have you
used to get good at AI? And one of the
stunning findings was that it's this
line. So on the x- axis is how long have
you been using AI tools? Y is how much
do you trust it, right? And the longer
you use AI, right, the more you trust
it, right? So every every person who
says I tried it and it's terrible at
coding, right? On what basis did they
make that conclusion after maybe using
for an hour or two? And what this shows
us is that uh you know it requires
practice, right? And this is probably a
teachable skill. Um so length of time on
the x-axis is a very incomplete
expression, right? It's like frequency
and intensity and how many hours but
it's the signal there. So it just shows
that uh you know part of your job is to
help other people have the aha moment
and then help them you practice right so
they get very very good at it so they
can use every one of these amazing
technologies to achieve their goals. So
uh I'll leave you with one last kind of
vision. Steve and I we did a vibe coding
workshop for leaders um back six weeks
ago. And what was amazing to me was in
the three hours we had a 100% completion
rate. Everyone built something. You
know, they built a data visualization
tool. In fact, uh one person uh built a
an iOS app and another person actually
got it into the review queue in the
Apple iOS app store, right? Which is
which is absolutely astonishing. Uh and
here's a guy named Roger Safner. He
said, "I used to be a C MVP way back in
the day. I haven't coded in 15 years."
uh and he's showing off an app that
helped him automate the process of
getting checked in to Southwest Airlines
until the bot detection tools come off.
But look at look at the expression on
his face. And so I think uh what we're
seeing is like what happens when support
ships right and support codes and ships
when leaders code and ship. There's no
doubt in my mind that this will reshape
uh technology organizations. If you're
one of those, Stephen and I want to talk
to you, right? Because you are on the
frontier of something really really
important. I'll share with you a couple
quotes. Here's from a technology leader.
When I told my team that I wrote an app
that, you know, an AI wrote 60,000 lines
of code and I haven't looked at any of
it, they all looked at me as if they
wished I were dead.
Um, we've uh we've had these stupid
problems in legacy applications. I've
been there for over a decade. We got a
group of senior engineers together. We
used AI to generate a fix and we
submitted PR and the team accepted it.
Right? Unlike the time when they said it
was AI generated and they rejected it as
AI slop, right? So this is maybe
happening in your organizations. Um our
code velocity is so high. Uh we've
concluded that we can only have one
engineer per repo, right? Because of
merge conflicts, right? This is we
haven't figured out the coordination
cost uh mechanism yet. And so like all
these were some of the lessons that went
into the vibe coding book. Thank you for
everyone who were at the signing
yesterday. And uh if you're interested
in any of the talks we referenced and
excerpts of our book in uh and basically
uh all the links that uh are in this
presentation, just send an email to real
gene cams.com subject line vibe and
you'll get an automated response in a
minute or two. So with that, Steve and I
thank you for time and we were around
all week.
>> Thanks all. [applause]
[music]
[music] Ladies and gentlemen, please
welcome back to the stage, Alex
Lieberman.
[music]
>> Let's give it up again for Steven Jean
and also the rest of the speakers from
the morning session. Whether you are
watching in person or on YouTube or on
the AIE site, you've been breaking a
mental sweat. So, we are going to take a
30 minute break. Get some grub, get some
coffee, recharge, and we will see you
back here at 11. Thanks everyone.
Appreciate it.
>> [applause]
>> Two flames lit the darkness, burning
side by [music and singing] side. Both
sworn to creation. Both relentless in
their stride. One walked through the
mountains, [music] one soared across the
void. Both chasing the horizon of the
worlds they would deploy. [music]
But the path is not a straight line. And
the future is not flat. Some roads bend
through [music] spaceime and some break
[singing] on impact. Effort is a
kingdom. [music]
Leverage is the key. One builds the
throne by hand. One shapes [music]
reality.
There is a curvature of time, not
[music] a race, not a throne, but a
shift in the dimension of how progress
becomes known. When [music] the universe
is bending to the will inside the mind,
you don't win by moving faster. You win
by [music]
breaing
time.
Black holes of the past try to drag the
present [music] down. Systems built on
dust, [singing] wearing yesterday's
[music] crown. Some are pulled beneath
them, [singing] fighting gravity alone.
Others learn to map the edges and escape
events horizons. Not all power [music]
is struggle. [singing] Not all mastery
is pain. The ones who change direction
rewrite the laws of the game. You can
lose your life in [music and singing]
labor or an impact that compounds. Every
second can be linear or worth a thousand
rounds.
There is a curvature of time.
>> [music]
>> Not a race, not a throne, but a shift in
the dimension of how [music] progress
becomes known. When the universe is
bending to the will inside the mind, you
don't win by moving [music] faster. You
win by rediding
[music] time. [singing]
The future isn't [music] distant. It
accelerates [singing] for those who
wield the tools of power instead of
fighting with their go. Mastery is
leverage, [music] not a sentence carved
in stone. The horizon does not move
[singing] unless you. [music]
There is a curvature of time [music]
where the present multiplies where a
lifetime holds a legacy that no clock
[music] can quantify. Not by force, not
by fury, but by evolution.
We become eternal beings. When [music]
we synchronize
with
Progress is speed.
[music]
Progress is direction.
[music]
>> [music]
>> Footstep fade, but they never die.
Shadows stretch across the sky.
A whisper grows into a [singing] roar.
Do you [music] feel it? Do you want
more?
Every heartbeat stone [music] in the
street.
Ripples [music and singing]
chasing an endless dream.
What we do in life [music] echoes in
eternity.
Every spark ignit [music]
[music]
[music]
>> [music]
>> Reach out to the [singing] empty air.
Trace the stars like they're waiting
there.
[music]
The clock ticks but the moment stays
forever starts in a single [singing]
prayer.
Heat.
[music]
Heat.
>> [music]
>> Every heartbeat [music]
stone [singing] in the stream
ripples [music and singing] chasing an
endless dream.
What we do in life echoes in eternity.
There is night a fire that will never
see [music]
[music] what we do.
[music]
Heat
up [music]
[music]
[music]
crawl where the light won't stay.
The echo whispers don't [music] look
away.
Heartbeat racing louder than my doubt.
Scream inside. I can't let
[music and singing] out. But I won't
fall. I won't drown in the storm all
around.
Heat. Heat. Heat. [music]
[music]
There is my
breaking [music] the chain.
[music]
>> [music]
>> Cold winds how but they won't define me.
The cracks in my soul let the light find
[music] me. Every step I take the ground
fights back. But I'm the fire on the
spark. I'm the attack. [music]
I won't freeze. I won't fade. Through
the chaos I've remain,
[music]
I won't let it win. It creeps like a
ghost, but I keep it within.
Fear is a killer. I'm breaking [music]
the chain. No for the dark.
[music]
[music]
>> [music]
[music]
[singing]
[music]
>> I hear the static in the [singing]
night. It calls.
A whisper [music] rising,
breaking through the walls.
[music]
Electric echoes in my veins they home.
Chasing the shadows where [music] the
wild ones run.
The air is thin. The weight is
[music and singing] gone. Close your
eyes. The past is done.
Free your mind.
[music]
Break the chain on the floor.
[music]
[music]
>> [music]
[music]
>> Waves come crash against the sky.
presence of a dream. [music]
I see them inside
the [music] story. We don't need a
weather
with a speed.
[music]
[music]
The air is thin. The weight is gone.
Close your eyes. The [singing] past is
done.
>> [singing]
>> Free your mind. [music] Let it go. Let
it break the chain. Leave it on the
floor. Heat. Heat. Heat.
[music]
[music]
Heat
>> [music]
[music]
>> up
here.
[music]
>> [music]
[music]
[music]
[music]
>> They said The stars don't change
[singing] their course, but I've been
running from their force. A mirror
crack, but still it [music and singing]
shows. The fire is mine is mine to hold.
I hear the echo and call [music] my
name.
But I'm not the shadow. [music]
Not the same.
You are who you choose to be. The stars
[music]
of the history.
Every breath, every heart be free.
[music]
Are we choose to be
[music]
[music]
>> [music]
>> Road [music] of thorns, a sky of glass.
I've walked through both. I've let them
[singing] pass. The weight [music] is
heavy, but I've grown. The voice I hear
is now mine.
I see [music] the light
[music]
change.
I can say
[music]
you are
[music] history.
Every breath, every heart be
[music]
I see the lines [music and singing]
drawn in the sand. A map of chaos in my
hand.
Every step, a [music] choice,
every beat of voice. The clock ticks
louder. But I stand. [music]
Close my eyes [singing] and feel it
burn. Every [music] failure, every turn.
It's fue for the fire inside. [music]
Execute the vision.
Vision. Heat. Heat. Heat.
[music]
Heat. Heat.
>> [singing]
>> The air is heavy, it doesn't break. A
thousand whispers in it wake.
Each [music] breath a climb,
each fall a sign, but I am more than I
can take. [music]
Close my eyes and feel it burn. Every
failure, every turn is fueled for the
fire inside.
[music]
>> [music]
>> executes. [music]
[music]
This is
Yeah.
[music]
[music]
The clock keeps ticking loud and clear.
Shadows fade, [music] but linger near.
I've been waiting for the light.
[music]
Holding breath through endless night.
[music] The air is shifting.
Feel it break.
A single spark is all it [singing]
takes.
It starts today.
It starts today.
No more [music]
running. No delay.
The world is spinning
my hands. It [music] starts today. It
starts today.
Footsteps echo on the stone.
Every choice I made my own.
>> [music]
>> I see the dawn breaking through
a thousand colors chasing. [music]
The air is shifting. [music]
Feel it rain.
A single spark is all.
It starts to take. Heat. Heat. Heat.
Yeah.
[music]
Heat.
[music]
Heat. [music]
[music]
Heat.
Yeah.
[music]
[music]
>> [music]
>> Fire in my chest is burning loud.
Ashes fall, but I won't bow. [music]
I've walked [singing] through the smoke.
I've tasted the scars. Each step I've
taken little
stars. [music]
Let it blaze. Let it break. Feel the
ground.
I'm forced in [singing] flame. [music]
I'm falling heat.
The pain
[music]
again
from Heat. Heat. Heat.
[music]
>> [music]
[music]
>> The winds they how but I stand still.
[music]
The mountains crumble up my will.
I'm not the same
I was before. A shadow of fear. I keep
let it blaze. [music and singing] Let it
break. Feel the cracks. The ground will
shake.
I'm forged in flame. [music] Heat. Heat.
Heat.
[music]
Heat. Heat. Heat.
[music]
>> [music]
[music]
>> Heat up
[music]
here.
>> [music]
>> A whisper breaks the silent night.
Shadows melt in the growing light.
Time bends and twists. We feel it start
a pulse [music]
to spark [singing] an open heart.
Do you feel it? Feel it right.
The wayless
fire in the sky
[music]
has come.
Run into the sun. No share.
We're free.
[music]
We're
electric.
[music]
Stars collide.
But we stay one.
The past dissolves [music]
like waves on storm.
We stand together,
not alone. [singing]
>> [singing]
>> Heat. Heat.
Here [music]
it is sing
[music]
the everything.
A new age
has come.
We're running to the [music] sun. No
chains, no walls, just there with me.
[music]
Heat. Heat.
[music]
Heat.
[music]
Heat.
Heat. Heat. [music]
Heat
[music]
[music]
>> [music]
>> Heat up
[music] here.
[music]
Heat up here.
[music]
Heat up
>> [music]
>> here.
>> [music]
>> Heat up [music]
Heat
>> [music]
>> up
here.
Heat.
[music]
[music]
Heat.
[music]
>> [music]
[music]
[music]
>> Heat. Heat.
Heat. Heat.
[music]
[music]
[music]
Heat.
[music]
[music]
Hey, Heat. [music] Heat. Heat.
[music]
[music]
>> [music]
[music]
>> Heat. Heat. Heat.
[music]
[music]
Heat.
>> [music]
[music]
>> Heat. Heat.
[music]
>> [music]
>> Heat. Heat.
[music]
Heat. Heat. N.
[music]
Heat. Heat.
[music]
[music]
[music]
Ladies and gentlemen, please welcome
back to the stage, Alex Lieberman.
Let's uh keep it going for the morning
speakers. [music] Amazing job from
everyone who spoke earlier. I asked
before who thought they came from the
furthest place on Earth to to watch this
in person. And where's New Zealand
again? I don't know. New Zealand. There
we go.
>> From Bulgaria.
>> Bulgaria. Still, I think closer than New
Zealand, but still very far.
>> Australia via New Zealand.
>> Australia via New Zealand. We just got
someone to one up New Zealand. I have
another quick question since we just
came back from a coffee break. Also, if
you're watching live on YouTube, you can
comment. Who thinks they're the most
caffeinated right now? Who thinks
they're the most caffeinated in the
room? How many cups of coffee? I'm four
right now. Anyone beat four? Oh, we got
four. We got a five, maybe. Wow.
Impressive. Well, we are back for an
incredible next block of sessions. We're
going to be covering everything from
future proofing uh coding agents to
moving away from agile, how to quantify
AI ROI in software engineering, the
state of AI code quality, hype versus
reality and Miniax M2. But I am so
excited to kick off this next block of
talks with OpenAI. Please welcome to the
stage Bill Chen and Brian Fioa from the
applied AI team at OpenAI. Let's hear it
for them. [applause]
[music]
[music]
Hello everyone. Um, today we'll be
talking about how to build coding
agents.
And uh, I'm Bill. I work on the applied
AI startups team at OpenAI.
And I'm Brian. I work with Bill on the
OpenAI startups team.
>> And we specifically, uh, focus on, uh,
building coding agents here at OpenAI.
Um, yeah. So, why are we talk giving
this talk? Why why are we, you know, uh,
talking about coding agents? Well, it's
really quite interesting because it's
been booming for the the the past year
actually. It's just if you think about
it, it's not that much time ago, like
only been a year or so. The ground keeps
shifting really under the u harness on
on the coding agents. But if you think
about it, it's really like why it's
interesting is because it's really a
signal on how close we are to AGI.
Software engineering can be set as a
universal medium for problem solving.
But because the ground is shifting so
fast, uh, we ha kept having to rebuild
the agent on top of the model whenever a
model is released. And today we're going
to talk a little bit about how we might
be able to get around that.
So here's what we're going to go over
today. We'll start with the anatomy of a
coding agent, especially going into the
details of models and harnesses and how
they work together. We'll share some
lessons that we learned from putting
them together ourselves. And we're
specifically going to talk about codeex
here, which is our own coding agent.
We'll talk a little bit about emerging
patterns that we're seeing from all of
you for using agents like codeex in your
own products. And lastly, we'll talk a
little bit about what to expect from
codeex in the future so that you can
build along with us if you want to.
To start, let's talk a little bit about
what makes a coding agent an agent as a
whole. Um, it really is quite simple. I
think, you know, people kind of over
complicate things a little bit these
days. It's made out of three parts. It's
a user interface. It has a model. It's a
harness, right? Uh, the interface quite
self-explanatory. It could be a computer
uh like a CLI tool or it could be a uh
integrated developer environment. could
be also cloud or background agent. Um
models also very quite self-explanatory
are you know the things like the latest
and greatest the GPD 5.1 codeex uh max
that we just released yesterday uh or
the GPD 5.1 series of models or other uh
models from other providers as well. And
the harness uh is a little bit more of
an interesting part. This is the part
that directly interacts with the model.
Uh in the most reductive way, you can
sort of think of it as a collection of
prompts and tools combined in a core
agent loop which provides input and
outputs uh from a model. Uh the last
part will be our focus for today.
As touched on a bit earlier, coding is
one of the most active frontiers in
applied AI and uh how models are
constantly getting released and we're
not making the problem uh easier for
everybody
is that people have to constantly adapt
uh the agents to the new models.
So, um, Bill's done a great job of
giving us an overview of coding agents,
what they're made up of. So, let's zoom
in a little bit on the harness. Um,
turns out that's a little bit tricky.
So, what is a harness? A harness is
really the interface layer to the model.
It's the surface area the model uses to
talk to users and the code and perform
actions with tools. It's made up of all
of the pieces that the model needs to
work over many turns, call tools, and
and really write code for you and
interpret what the user is actually
asking. [snorts] Um, for some, the
harness might actually be the special
sauce of the product. But as we're going
to go into a little bit more, it's
really challenging work to build a good
harness. And we'll talk about how we did
that.
So let's see what are some of these
challenges. Um just to name a few, AV is
one. Um your [laughter]
um your brand new innovative custom tool
that you're giving to your agent might
not actually be something the model is
using is used to using. It may not have
ever seen that tool before in trading.
And even if it is, you need to spend
time tuning your prompt to that
particular model and the habits that it
comes with.
And new models are coming out all the
time. What about latency? Like does the
model take a while to think about
certain things? Which things do you
prompt it not to? How do you expose the
UX of what a thinking model is doing
while it's thinking? Is it communicating
with you while it's thinking or do you
have to summarize it? Managing the
context window and compaction can be
really challenging. We just launched
Codeex Max that does that out of the box
for you. you don't have to worry about
compaction and context window
management. It's really hard to do. Um
and so if you were to do it yourself,
have fun. Um and then also like the APIs
keep changing, right? So we have
completions, we have responses, we have
whatever else is coming in the future.
What does the model know how to use and
get to get the most intelligence out of
the box?
And so
this is the interesting part. Fitting a
model into a harness takes a lot of
prompting.
It turns out that how the model is
trained has side effects.
I like to think about it this way.
Intelligence plus habit. Intelligence.
What is the model good at? What
languages does it know really well? What
is what is its capabilities in terms of
like how well it can write code in
certain frameworks? And then what habits
did it learn to to use to solve those
problems? We've trained our models to
have habits of like planning a solution,
looking around, gathering context, and
and thinking about a problem before
diving in and writing code, and then
testing its work at the end.
Developing a feel for these habits is
how you become a good prompt engineer.
If you don't instruct the model in ways
that it's familiar with, you can have
problems. We saw this when we launched
GPD5. A lot of people who weren't used
to using our models encoding tried to
take prompts that existed for other
models and put them into their harness
and have GPD5 follow those instructions.
And it turned out that we taught our
model to do some of the things that the
other models didn't really do out of the
box. And so when they were prompting
them to look really hard at the context
and like examine every single file
before making a a code edit, our model
was being very kind of thorough about
that and it was taking a really long
time and they weren't seeing the best
performance. And so we figured out that
if you let the model just do the
behaviors that it's used to and don't
overprompt it, it'll actually perform
really better. We found out by asking I
was literally like, "Hey, like I like
the solution, but it took you a long
time to get there. What can I do
differently in your instructions to help
you get there faster next time?" And
literally it said, "Uh, you're telling
me to go look at everything and I don't
really need to. So that's what's taking
forever."
And so you can actually see the
advantages of building both the model
and the harness together because you
just like know all of that while you're
building it. And that's why codeex is
both a model and a harness combined.
So let's dig deeper into codeex and what
it can actually do.
So we built codeex to be an agent for
everywhere that you code. It's a VS code
plugin. It's a CLI. You can call it in
the cloud from the VS Code plugin or
from chat GBT from your phone. Um, and
it's very basic. You can use it to turn
your specs into runnable code starting
from a prompt. Um, having a plan. It
navigates your repo to edit files. It
runs commands, executes tasks, and you
can call it from Slack or you can have
it review PRs and GitHub. So, all of the
things that you would expect.
And that means that the that codec um
the harness of codeex needs to be able
to do a lot of really complex things. Uh
when I talked to a member of the codeex
team about this slide and what should be
on it, he was like it's way harder than
you think. [laughter]
You have to manage parallel tool calls
like thread merging and all of the
things involved in that. Think about all
of the security considerations you have
with sandboxing, prompt forwarding,
permissions, uh, port management. Um,
compaction is a whole thing. Um, and
doing that well is really complex. When
do you trigger compaction? When do you
reingject? How do you worry about uh
cache optimization during that MCP,
right? Like all of the thing the uh
plumbing you have to build for MCP
support into the harness. Uh, and then
not even mentioning images and what's
the resolution that you need to compress
them to to send them to the model. All
this all of this is like work that you
have to do if you're going to build this
from scratch and keep it updated as new
features come online.
So since we've bundled all of these
features together for you in an agent
that can safely write its own tools to
solve new problems that it encounters.
Oops.
Uh we actually have here uh a computer
use agent for the terminal.
Wow, that sounds quite a bit powerful
than just plain old coding agent,
doesn't it? Um but just think about it
again. Well, before browser and graphic
user interface was a thing, wasn't that
how we always operate a computer?
they're writing code and chaining them
together in a command line interface. So
that means if you can express your tasks
in command line as well as files tasks
codeex will be able to know what to do.
Um the example is I like to use codeex
to organize a lot of the photos from my
desktop into a folder and that's a very
simple use case but what it can also do
is it can analyze huge amounts of CSV
files inside of a folder uh doing data
analysis it does not have to be a coding
task and if it can be accomplished by
running tools from command line you can
use codeex.
So now that we see codeex is such a cool
harness, um I want to also share a
little bit about how you can use it to
build your own agents and what you can
do is you can use codeex the agent
inside of your own agent.
Um how does that work? Well, if you want
to build uh a coding uh the next coding
startup, we don't really have all the
answers, but we do have a few patterns
uh that we thought uh might help you
having worked with some of the top
coding customers uh like cursor and VS
code. Uh one of those patterns is uh
harness becoming the new abstraction
layer. The benefits of this is quite
obvious. Um, you no longer have to care
about prioritize uh optimizing the
prompt and tools with every mo model
upgrade.
>> But um does that mean you're just
building a wrapper?
>> Well, I disagree with that take.
[snorts] I disagree with disagreeing
with my colleague here. Um just like how
building rappers on top of models I
think is really reductive on uh on the
whole value prop of the infrastructure
layer. Sorry, I used to be a VC.
[laughter]
>> Focusing most of your efforts on
differentiating your product is what
this pattern allows you to do. And
that's where most of the value lies.
Exactly. Okay. So, let's look at some of
these patterns that we've seen and
actually have helped our customers build
um along with them. Codeex is an SDK. It
can be called through a TypeScript
library. You can call it
programmatically in a Python exec.
There's a GitHub action that you can
plug into to have it merge merge
conflicts on PRs that everybody hates
doing. Then uh you can also add it to
the agents SDK and give it MCP
connectors back to your product. So now
you have an agent. I like to say we
started with chat bots that you can talk
to. Then we gave the chat bots tools to
use. And then now you can give uh a tool
to your chatbot that can make other
tools that it doesn't have. And so now
you can actually build out enterprise
software that does it that writes its
own plug-in connectors to the API level
for each customer on the spot. That's
something that a professional services
team used to have to do. Um, so you have
fully customizable software that can now
talk back to itself. Um, I made a conbon
board for Devday that can actually fix
its own bugs. Um, it's pretty fun. And
then lastly, um, you can actually do
something like what Zed has done. They
have just decided to wrap codeex inside
of a layer and give it an interface to
the IDE for talking back and forth for
the user and making code edits. And now
they don't actually have to do all the
work of staying on top of all of the
things that we're good at doing and they
can focus on building like the best code
editor.
Uh so our top coding partners like
GitHub has used this uh to great effect
and well uh we've created an SDK uh for
it that they used to directly integrate
uh with codeex. You can also use the SDK
to uh control codeex as part of your
CI/CD pipeline as well as use it as an
agent that directly interacts with your
own agent as well. Uh if you really want
to customize the agent layer, you can do
it too. As an example of this, we worked
with closely with the cursor team to get
the best performance out of the codecs.
The model, not the agent. We're bad at
naming things. The model is different
from the agent. They did so by aligning
their tools to be in distribution with
how the model is trained. And they did
so by aligning uh their harness with our
open- source uh implementation of codeex
CLI. All of this is publicly available.
Uh you can fork the repo, you can use
our source code, you can use it. Uh go
nuts.
So what does the future hold for codeex?
It hasn't even been out for a year. Um
and especially with the lo la la la la
la la la la la la la la la la la la la
la la la la la la la la la la la la la
la la la la la la la la la launch of
codeex match yesterday like things are
really changing fast. Uh it's the
fastest growing model in usage now
serving dozens of trillions of tokens
per week which has actually doubled
since dev day.
It's always good to build where the
models are going. It's safe to assume
that the models will get better. They'll
be able to get to work on much longer
horizon tasks unsupervised.
New models will raise the trust ceiling.
I trust these models now to do some way
harder work than I would have six months
ago. And that's going to keep
increasing. The future is about
sprawling code bases and non-standard
libraries and knowing how to work in
closed source environments, matching
existing templates and practices
and the models uh and and and so you can
imagine that the SDK will evolve to
better support these model capabilities,
letting the model learn as it goes and
not repeat mistakes and generally
provide more surface area for an agent
that writes code and uses a terminal to
solve whatever problems it encounters.
counters and you can use that in your
products via the SDK.
So, what have we learned? Harnesses are
really complicated and take a lot of
work to maintain, especially with all
the new models coming out. So, we've
built one for you inside of Codeex that
you can use off the shelf or look at the
source if you want to and you can use it
to build new things outside of coding
and let us do all of the work making
sure that you have the most capable
computer agent.
And we're really excited to see what you
craft.
Our [applause]
[music]
next presenters believe that most
enterprises are failing to unlock real
value from AI because the systems in
which they operate are [music] stuck in
the past. Here to share how agents are
reshaping software delivery are McKenzie
partners Martin Harrison and Natasha
Mania.
>> [music]
>> All right, good morning. Hello everyone.
It's really great to be here. Uh so I'm
Martin and I'm here with my colleague
Natasha. Uh we're from a part of
McKenzie you may may not be as familiar
with. We have a practice called software
X and we work with uh mostly enterprise
clients on how to build better software
products which has messed mostly using
AI uh in the in the past couple of
years.
uh and so what their talk is about today
is really more focused on the people and
the operating model aspects of
leveraging AI for software development
and and that we believe that that has
changed quite significantly and and
that's what we're excited to talk to you
about.
If I take a quick step back uh in in
time and we just uh you know think
through some of these the major
technology breakthroughs that we've seen
in the last few decades uh they tend to
always come with a paradigm shift in
also how we develop software and so I
still recall uh almost 20 years ago now
I started working as a software engineer
an entry- level developer um in a tech
company and the company I was working
for was just switching to to agile we
were using kan boards we were doing uh
standups and and other ceremonies.
This was a big change. It was a massive
change for the for the company. And now
with everything that is happening
happening in AI, we're at the precipice
of another such paradigm shift.
And
um
if we think about some of the um some of
the things that are happening um with AI
and software development that we've seen
at this um at this conference, there's
no doubt that this is a new paradigm
that is about us. And so we'll talk
about two things. uh we'll first touch a
little bit about how do you go from
these things that we're seeing at
individual productivity to scaling that
to the whole team and what that what
type of changes we think that implies
and then we'll talk a little bit uh
about how do you scale that across uh a
whole organization and to really get get
value
um
if if you sort of I I'm talking to an
audience here which is using AI agents
all the time and I if I If I asked you
about some examples, I'm sure you could
rattle off, you know, 10 different ones
where you would say, "Look, there was
this thing that I used to do. It it used
to take uh maybe even days and and and
hours that are now taking only minutes,
right? There's no shortage of those
those stories and you can go over to the
expo and and talk to any of the
companies there about all these all
these great use cases. It really shows
that these tools work and they can be
really impactful." And so yet despite
seeing you know some of these uh
improvement uh improvements
uh we've done some research to gauge you
know where are our clients at the
moment. We we recently surveyed about
300 uh companies uh mostly enterprises
around what are they seeing in terms of
productivity improvements. So you have
this and then they would say uh on
average we're often seeing only 5 10 15%
improvements overall as as a company. So
we're in a place where there's a bit of
a disconnect between this this big
potential uh around AI as uh from the
reality.
And so we we think that um there is this
gap because as we've started
implementing AI whether it's um you know
coding assistance or whether it's now
using you know you just heard about uh
you know how open AAI is using agents
and more complex uh workflows. What has
started to emerge is a is a set of
bottlenecks uh that that were not
necessarily there before. Like for for
example, as we now start moving much
faster in certain in certain aspects of
the work, uh we haven't really changed
how we collaborate among people and and
team members that's not quite keeping
up.
We started generating way more more
code, but we're it's still being
reviewed in a in a pretty manual way in
in many companies. Then we also have
this this theme which was recently
highlighted in in even a research report
from from Carnegie Melon uh about how
all the new code that is being generated
is also amplifying uh the generation of
tech debt in some in some cases and
actually generating complexity. And so
there are these bottlenecks. They're not
impossible to overcome but this is what
we believe is limiting uh many companies
from seeing the the the real value that
that they should be seeing.
Let me talk about maybe just a couple of
examples to to make that uh come to life
a little bit more. One of the things
that we see as a big rate limiter at the
moment is around how work is allocated.
And so what what we've learned over the
last couple of years is that the impact
from AI and agents is highly uneven.
There are some tasks which where it
works amazingly well today and you see
uh huge improvements and there are
others where it it's not as effective
and so you have that variability. You
also have variability among people. Some
have have uh lots of experience now
using these tools and and know how to
pick that up and others uh are less
experienced right now and so what that
means for for team leaders for
engineering managers and so on is it's
very highly non-trivial to know how to
allocate work and resources in in a good
way and this is creating a lot of
inefficiencies.
Another example uh is is around how work
is being reviewed. So agents are often
giving given pretty uh fuzzy uh you know
stories that are written in pros with
pretty fussy acceptance criteria uh
which which means that the code that
comes back is not always what it was
intended to be and and for many
companies the only mechanism to control
that is is often manual review. So
you've automated some things but we've
generated more manual review. So these
are some of the some of the examples of
uh these bottlenecks that we that we see
coming up.
And as mentioned, what what has that has
resulted in so far is that most most
large companies today uh are are stuck a
little bit in in a world of relatively
marginal gains. Uh they're working in
ways that was developed with constraints
that we had in the past paradigm of
human development. So you have you you
know if you go out to most companies you
see 8 to 10 person teams you see working
in two week sprints you have all these
these elements that were largely parts
of like an of an agile operating uh
model and that is and that is uh putting
in some some limits to what they can
see. Over the past year, we've been
working with lots of clients to to sort
of break that model a bit uh and develop
new ways of of working in smaller teams,
in new roles, uh in with shorter cycles.
And when you do that, we see really
great performance improvements. And
that's what gives you gives us this uh
path to where we see things are going to
improve.
So we realized that rewiring the PDLC is
not just a one-sizefits-all solution.
For example, different types of
engineering functions across the
enterprise along the product life cycle
may require different operating models
based on how humans and agents best
collaborate. So if we take the example
of modernizing legacy code bases, this
task requires a high context of
potentially the entire codebase but also
has clearly well- definfined outputs. So
an example operating model could look
like a factory of agents where humans
provide an initial spec and final review
with minimal intervention.
For new features for green field and
brownfield projects, the operating model
may look like an iterative loop because
they may benefit from the
non-deterministic outputs and increased
variation where agents act as
co-creators um providing more options to
facilitate faster feedback loops.
So, as we mentioned, we did a survey
among 300 enterprises globally to
understand what sets these top
performers apart. We found that they are
seven times more likely to have AI
native workflows which meant scaling
over four use cases across the software
development life cycle rather than just
having point solutions for just code
review or for just code dov. They were
also six times more likely to have AI
native roles which meant having smaller
pods with different skill sets and new
roles.
To enable these shifts, these
organizations were investing in
continuous and hands-on upskilling,
impact measurement, and also incentive
structures to incentivize developers and
PMs to adopt AI.
This led to five to six times increase
in time to market and delivery speed as
well as higher quality and more
consistent artifacts.
So when we talk about AI native
workflows we mean that these enterprises
are moving away from quarterly planning
to continuous planning and also um the
unit of work is moving from storydriven
to spec driven development so that these
PMs are iterating on these specs with
agents rather than iterating on long
PRDs.
On the talent side, AI native roles
essentially means that we're moving away
from the two pizza structure to one
pizza pods of three to five individuals.
Instead of having separate QA frontend
and backend engineers, there are more
consolidated roles where product
builders are managing and orchestrating
agents with full stack fluency and also
a better understanding of the full
architecture of their codebase. PMS are
starting to create direct um prototypes
in code rather than iterating on these
long PRDs.
And one example um that we've described
in our article, we've studied some AI
native startups and realized that
they've actually implemented all of
these shifts to accelerate their
outcomes. And in our article, we've
described how cursor actually operates
internally.
But if you're a large enterprise
predicated on the agile model, what are
some steps you can take? So in in a
recent client study with a leading
international bank, we tested some team
level interventions to address the
bottlenecks previously mentioned before
mainly around the sequencing of steps
within the agile ceremony and how uh to
define the roles of agents and humans
within the sprint cycle. So let's walk
through some examples.
First, team leads would assign sprint
stories using agents based on the data
of the team velocity and delivery
history. And then they would create
co-create multiple prototypes and
iterate with agents on the acceptance
criteria around security and
observability needs to have more
consistent artifacts across teams. This
prevents downstream rework that was
mentioned before so that developers
don't have to constantly be iterating
with the agents during during the code
process. The squads were also
reorganized by workflow. So there would
be one which would be focused on um
small bug fixes and another focused on
green field development. In the
background agents would be used to look
and impact uh look at um the potential
cross repository impacts um to prevent
debugging time for developers.
And another example is that instead of
for reducing the collaboration overhead
and meetings that happen within the
sprint cycle, um instead of waiting for
data scientists input, PMS would
directly be observing the real-time
customer feedback to rep prioritize
these features
and this would lead to an acceleration
in the backlog within the same amount of
time.
So we studied the um impact of these
interventions and found high promising
results. For example, not just the
increase in agent consumption by over 60
times, but there was also an increase in
the delivery speed that was tied
directly to the business priorities for
this bank. There was a 51% increase in
code mergers, but also a decrease in um
an increase in efficiency.
The other aspect of this is is uh around
the different roles and and and the
talent model. And so one of the biggest
differentiators that we saw as mentioned
was around but you have actually changed
the roles that uh that are involved in
software development. And so you know
what what you all are seeing is that
engineers are moving away from execution
and and just simply writing code to
being more of orchestrators and and
thinking through more how to divide up
work to agents. for example. And we also
heard some examples of how the role of
the product manager is changing. And so
while this this may sound, you know,
pretty straightforward to many of you
here who are who are working with these
tools like day-to-day that you have to
change what you do, the reality is that
about 70% of the companies that we that
we survey have have not changed their
roles at all. Right? And so you have
this background expectation that people
are going to do things differently but
the the role is still defined in the
same way and it's the same understanding
uh as it was you know a couple of years
ago.
Um but we are starting to see you know
some companies changing this. So this is
another example from a from another
recent nent client. They were set up in
a in a way that is, you know, pretty
common for for u many companies and a
kind of typical two pizza uh team model
with with the types of roles that you
would be familiar with. Um the we ran a
bunch of experiments and front runners
and and tested new models that were had
much smaller pods uh that had uh new
roles which consolidated some of the
tasks that were previously done with
different roles.
And and so by doing that we could we
could create basically more pods or more
teams uh with with the same number of
people uh but retaining the expectation
that each pod is is uh is um performing
at about the same level as as they were
before.
And so so we also see really uh really
positive results from that uh with with
uh maintaining and even improving in
some is the quality of the code that was
generated. In particular there was a
there was a high speed up in in terms of
uh the output from from the different
teams and you can see some of the
metrics uh here.
Let's shift gears a little bit and and
and go from talking about just the team
level. So how does this now scale uh
across a big organization?
The reality is that many many companies
don't just have like one or two of these
these teams but often hundreds of teams
even and thousands or even tens of
thousands of people who are working in
this way. And uh this is where one of
the biggest differences that we that we
saw between those that are stuck a bit
in the um in in getting only 10% or so
change improvements from those who are
seeing outsized improvements is around
how you manage that how you manage that
change and change management I get is
like one of these is a little bit of an
often catch or elusive term for uh for a
lot of different things but but I think
in some ways it's not a bad way to think
about I I usually say that the change
management is about getting a lot of
like small things right. And so the crux
to like actually scaling this is often
about getting 20, 30 or even more things
right at the same time that involve the
way you communicate uh what this means,
the way you incentivize people, uh the
way you upskill them, and it all has to
come together.
Um and when it when it's not, we we we
see what happens. And so this is an
example from from another tech company
that we worked with um where initially
we're rolling out new AI tools for them
that that hit different parts of the
product development life cycle. Um we we
rolled we rolled out the tools there was
some usage but often it dropped off. It
was either not used or it was um it was
sort of um used in very suboptimal ways.
So that's the sort of jagged part that
you're seeing on the on the left hand
side here. despite kind of adding more
users uh the overall impact did not
change at all. So we had to do a quite a
reset and and um start over effectively
reset the expectations. What should what
what does this mean if you're a
developer dayto-day what does it mean
for a PM? Uh we had much more hands-on
upskilling. There was could bring your
own code. there were, you know, coaches
available, especially those first like
few sprints before you get make this a
habit and work it into the way that you
develop software dayto-day. It's a very
critical time and that's when when this
matters a lot. Um, and having a bit of a
a measurement system as well, so you
know what's changing and and you're able
to to see what's uh what's what what's
improving.
Another example just to put this alive a
little bit as mentioned like this is
about getting a lot of things um right
and it's each one of these individually
may not seem like it's the biggest deal
uh but put together they really make a
make a huge difference like this is for
this is some of the top uh interventions
that another client had to go through
for them it really helped having you
know setting up code labs for example
really you know instituting a new set of
certifications that help motivate and
and drive people to to change what they
do day dayto-day. And these these things
really added up to uh the change they
needed.
>> But building a robust measurement system
that prioritizes outcomes and not just
adoption is important not just to
monitor progress but also pinpoint
issues and course correct quickly. So
one surprising result from the survey
was that these enterprises that were
bottom performers were not even
measuring speed and only 10% were
measuring productivity.
But our goal is to make our clients top
performing organizations. So we've
worked with them to create a holistic
measurement system that captures impact
all the way down to inputs. So for
inputs this would include the investment
into coding tools and other AI tools but
also the time and resources in
upskilling and change management. These
inputs would lead to direct outputs but
a lot of organizations are just focusing
on how the increased breath and depth of
adoption with of AI tools is leading to
increased velocity and capac capacity
increase. However, it's also important
to understand how developers have uh
different NPS scores and if they're
enjoying their craft more um rather than
feeling more frustrated. And it's also
important to understand whether the code
is becoming more secure and have has
better quality but also more resilient.
And one proxy for resiliency that we
used for our client was the meantime to
resolve priority bugs.
Now if we look at economic outcomes
which is priority for um the seauite
executives they look into what is the
time to revenue target. What is the
increased price differential for higher
quality features or expanding the number
of customers to meet the feature demand
and also what is the cost reduction per
pod for reduced human labor. In
aggregate, having these larger economic
outcomes can also lead um to for
organizations to understand how there is
an increased reinvestment in green field
and brownfield development. But as these
tools evolve, the proxies for these
metrics will also evolve. But hopefully
this provides a MECI framework as an
initial starting point.
So what's next? The future of course is
difficult to predict, let alone in the
next 5 years. But we hope that with our
vision of a new software development
model, even as agents increase in their
intelligence and humans become more
fluent in AI, that this model still
stands. So hopefully this model that
includes um shorter sprints, smaller
teams, but large u smaller but larger
number of teams will set enterprises up
for success in the long term.
>> So just leave you with some some key
takeaways. um start now. I would say to
to our our clients, this is a human
change and it takes some times and it's
a big change and and it's going to be a
journey and so I think um this is
something that everyone needs to go on.
I think it's also important to figure
out which model works for you and set a
really bold ambition and with that say
thank you so much for listening to us
and and uh we have an article here if
you're more interested in in the
research that we've conducted. Thank you
so much for having us. Our [applause]
next presenter is a researcher at
Stanford who studies how AI impacts over
100,000 developers in the real world.
Please welcome Jaor Dennis Blanch.
So companies spend millions on AI tools
for software engineering. But do we
actually know how well these tools work
in the enterprise or are these tools
just all hype? to answer this and for
the past two years we've been
researching the impact of AI on software
engineering productivity and our
research is time series because we look
at get historical data meaning we can go
back in time and it's also
cross-sectional because we cut across
companies and the way we use to measure
most of the of the impact is by a
machine learning model that replicates a
panel of human experts. The way this
works is that imagine you have a
software engineer who writes a code
commit and this code commit would be
evaluated by multiple panels or of 10
and 15 independent experts who would
evaluate that code commit across
implementation time maintainability and
complexity and then produce an output
evaluation. So we took the labels of
these panels across you know millions of
of kind of evaluations and then trained
a model to replicate this panel of
experts meaning that we can deploy this
at scale and if there's ever any doubts
around the models output you can always
kind of assemble your own panel and see
that it correlates pretty well with
reality.
Today we'll talk about four things.
We'll start off with looking at some of
the things that are driving AI
productivity gains in software. Then
we'll look at a AI practices benchmark
that we developed. We'll then look at
how we propose to measure AI return on
investment in software engineering. And
lastly, we'll finish things off with a
case study.
So here we took 46 teams that were using
AI and we matched them with 46 similar
teams that were not using AI and we
measured their net productivity gains
from AI quarterly. And the shaded area
is the middle 50% of the data. And the
dark blue line is the median which as of
July of this year stands at about 10%
for this cohort.
I'd like to direct your attention to the
fact that the discrepancy between the
top performers and the bottom ones is
increasing. There's a widening gap. And
so if we very unscientifically and very
illustratively project this forward, we
might get something like this, right?
where uh you can have these top
performers being part of this the rich
gets richer effect where they these
successful early AI adopters might
compound their gains while these
strugglers could fall further behind. At
some point this is going to converge and
this is very directional. But my point
here is that if you're a leader in a
company, you definitely need to know in
which cohort you are right now so that
you can course correct and without
measuring the impact of AI on your
engineers, you're not going to be able
to do this.
So we started investigating what are
some of the factors that drive these top
teams to perform better and the first
thing we looked at is AI usage or
basically token spent. In this graph you
have the same kind of on the vertical
axis the productivity increase and then
on the horizontal one you have the token
usage per engineer per month on a
logarithmic scale. And what you can see
is that the correlation is quite loose
20 or so linearly. And there is a bit of
a death valley effect around the 10
million uh token mark whereby teams that
were using that amount of tokens seem to
be doing worse than teams that were
using a bit less tokens. It's very
directional but interesting.
Nevertheless,
the conclusion here might be that AI
usage quality matters more than AI usage
value.
We dug deeper and we said well does the
environment in which the engineers work
impact the productivity from AI and we
came up with an environment cleaniness
index index it's quite experimental it's
a composite score that looks at tests
looks at uh types documentation and at
modularity and at code quality and that
index is on the bottom axis here from 0
to one and then on the vertical axis
once again you have the kind of
productivity lift relative to teams not
using AI
And so what you can see is that there's
a point40 R squar meaning a pretty
decent correlation around environment
cleanliness and gains from uh AI or
productivity gains from using AI. And so
the takeaway here is to invest in
codebased hygiene to unlock these AI
productivity gains.
We dug deeper to illustrate this
concept. And here we have on this graph
on the vertical axis the percentage of
tasks that might uh be able to be
completed by AI based on three colors.
And so green means that AI can do most
of the work for that task in that
sprint. Yellow means that AI can help
someone and red uh means that AI is not
very useful. And this is quite
illustrative but it it conveys the
point. And so then any code base at any
point in time sits on a vertical line
across this graphic. And what you can
see is that clean code amplifies AI
gains.
Secondly is that you need to manage your
codebase entropy, right? Your codebase
tech debt because if you just use AI
unchecked, this is going to accelerate
this entropy which is going to push and
degrade your cleanliness to the left
kind of right and then you as as a human
need to push on the other side to kind
of improve or maintain that cleanliness
to keep reaping the benefits from AI.
Thirdly is that it's important that
engineers need to know when to use AI
and when not to use AI. And what happens
when they don't is this kind of line on
the left whereby you have AI AI outputs
that are rejected or need heavy
rewriting which then leads to engineers
losing trust in AI saying okay this just
doesn't work. I'm not going to use it.
Which then further collapses your AI
gains.
Now we said can we find out whether we
can look not only at usage but at how
are these companies and these engineers
using AI and we came up with an AI
engineering practices benchmark. The way
this works is that we can scan your
codebase and detect these AI
fingerprints or artifacts basically
traces of how your team is using AI.
It's quite directional at this point but
evolving. And we can quantify this based
on the percentage of your active
engineering work that uses each AI
pattern. And then we can repeat this
monthly using git history. And the way
this works is more or less you have kind
of a few levels. And level zero might be
how humans are just not using AI and
write all of the code. Level one is kind
of like personal use where engineers are
not sharing prompts across the team or
not versioning them. Level two is team
use whereby teams are are sharing these
kind of prompts and rules. And then
level three is even more sophisticated.
It's where AI autonomously does specific
tasks maybe not the entire workflow. And
level four is you know agentic
orchestration which is where AI just
runs the entire process. And so this is
going to be an open- source tool which
you can leverage if you sign up on the
sweeper research portal.
We applied this benchmark to one of the
companies in our research data set and
we saw this. This company had two
business units with equal access to AI
tools, right? Same licenses, same spend,
same tools, same everything. But the
adoption rate and the usage rate was
very different by business unit. On the
left, the first business unit, you can
as you can see in the area in the blue,
seemed to be using AI a lot more for
almost 40% of their work. whereas on the
on the uh right the second business unit
seem to struggle behind a bit more. And
so the takeaway here is that access to
AI and even AI usage doesn't mean or
doesn't guarantee that that AI is going
to be used in the same way across a
company.
As a leader, you really want to be
understanding not just whether they're
using but also how your engineers are
using AI.
Great. Now let's dive into how do we
actually measure AI return on investment
in software engineering.
Oh uh there we go. Okay. So here ideally
we would be measuring this based on
business outcomes right I give my AI
engineer my engineers AI and then I make
more money more revenue net revenue
retention whatever business KPI you want
to track. The problem is that there's
too much noise between the treatment
right giving AI and the result which is
the business outcome. And on top of this
there's confounding variables such as
your sales execution, the macro
environment, your product strategy and
therefore although that would be ideal
unfortunately uh I think we need to find
alternative paths and the most logical
one is to simply look at the engineering
outcomes because there is a clear signal
right but here we need to go beyond
measuring AI usage into measuring
engineering outcomes. There's a few
caveats and this topic is quite heavily
discussed and so I want to mention some
of them.
The first one is that this is assuming
that our product function can properly
direct that increased capacity into
something that generates value. And if
they aren't directing that, then it's a
product problem, which although sits
quite close to engineering, it's
slightly different, right? The second
caveat is that this assumes that
engineering is a meaningful bottleneck
for value which frankly it typically is
and that you can guard against good
hards law by using a balanced set of
metrics and also by having a good
company culture that doesn't weaponize
these metrics.
And thirdly is that AI is still very new
and measuring proxy metrics is still
better than not measuring. There's going
to be winners and losers in this AI
race. And progress is better than
perfection here. And so metrics don't
need to be flawless to be useful is what
I want to illustrate.
So then um here we have uh two parts
which you need to do to get the ROI from
AI, right? You kind of need to measure
usage and then you need to measure
engineering outcomes. And so let's start
with usage.
There's really two buckets for
enterprises. There's kind of more in a
research environment, but to make it
simple, there's access based and there's
usage based. Accessbased is basically
looking at when did people get access to
the tool. And here we have you can kind
of do a pilot group, give that group AI
and then compare it to a similar group
without AI or you can measure the same
team across time. The problem is that
access based is noisy and the gold
standard is really usage based which uh
uses telemetry from APIs from these
coding assistants right to uh give you
the right data to know who's using AI
and and where and the caveat here is
that the vendor API is different
unfortunately tools like GitHub copilot
aggregate the data and other tools like
cursor give you more granular data
the big takeaway is that you can measure
impact of um retroactively by using git
history. And so you don't need to set up
an experiment now and wait 6 months. You
can actually if you've already adopted
AI, you can go back in time and and and
do this. It's quite easy.
Now we've seen usage. Let's look into
how do we actually measure engineering
outcomes? What are some of the metrics
we propose?
Here we have um our framework which we
proposed which is using a primary metric
and a guardrail metric. And so here um
the primary metric is engineering
output. It's not lines of code. It's not
PR counts and it's not DORA. And it's
basically based on this machine learning
model that replicates the panel of
experts, right? And the second set of
metrics are the guard ones which you
want to maintain at a healthy level but
you don't want to maximize. It doesn't
make sense to maximize them truly. And
so then there's three categories within
the guardrail ones. Rework and
refactoring, quality, tech and risk, and
then people and devops. The third bucket
is important to highlight that these are
not productivity metrics. They're
useful, but you cannot just kind of use
them like maximize them to maximize
developer productivity. They kind of
fall off at some point. And so the goal
here might be to keep your guardrail
metrics healthy while increasing the
primary metric to whatever degree
possible.
Now, let's dive into a case study. Here
we worked with
a company that uh large enterprise. We
took a team of uh 350 people under a
vice president and we measured pull
requests. The reason we did this is to
illustrate that you cannot measure pull
requests to understand whether AI is
helping you. And so here this team
adopted um AI in May of this year and we
measured the four months before four
months after. We saw a 14% increase.
Great. That's fantastic. But what about
reviewer burden? What about code
quality? So we measured code quality.
And here what we saw is um I mean
firstly actually code quality think of
it as maintainability scale from 0 to
10. And uh there's kind of these bands.
Uh it uses our our methodology. You can
read it online. But basically what you
see is that in the preAI period their
code quality was quite stable and
consistent. And once they adopted AI,
two things happened. Code quality
decreased and then code quality became
more erratic.
Next, we took a look at our metric,
which is engineering output. It's not
lines of code. And here for every month,
you see the sigma, the sum of the output
delivered for that month broken down
into four buckets. Rework and
refactoring. So rework is when you're
changing or editing code that was it's
still kind of fresh, so it's recent.
refactoring is when you're changing code
that's a bit older and uh what uh then
like added and removed it's pretty
self-explanatory and then also you can
see these kind of benchmarks so we can
benchmark this company against similar
companies in their industry and here AI
usage had two effects firstly is that
rework went up by 2.5 times which is
really bad and effective output which is
kind of like a proxy for productivity or
so didn't really change
and so then what's the conclusion here
let's do a recap app. So we saw that PRs
went up by 14%. But this is inconclusive
because more PRs doesn't mean better. We
saw that code quality decreased by 9%
which is problematic. We saw that
effective output didn't increase
meaningfully. And then we saw that
rework increased by a lot. And so then
the question here is what is the ROI of
this AI adoption? Right? It might be
negative. And what I want to point out
here is that had this company not
measured this more thoroughly and simply
measured PR counts, they would have
thought, hey, we're doing great. We
increased our productivity by 14%. Let's
run the numbers. That's how many million
lots of millions of dollars. And does
this offset the AI licenses? Sure thing
it does, right? The other thing is that
I don't think this company should
abandon AI. They should simply use this
data to understand what they're doing
wrong. How can they improve? Because AI
is here to stay. It's a tool that's
going to transform how engineers are are
working, right? and you can just um kind
of like abandon it or so.
Great. So, this concludes our insights
for today. If you've enjoyed this uh
talk and you would like similar insights
for your company, I invite you to
participate in our research. Everything
you've seen today can uh be accessed
through kind of participating in our
research, some of them through live
dashboards in our research portal. And
especially I'd like to invite companies
that have access to Cursor Enterprise to
participate because we have a high need
for this so we can publish papers around
the granularity of using AI um in
software engineering. You can sign up at
software engineering
productivity.stanford.edu.
Thank you so much. [applause]
[music] Our
[music] next speaker will separate hype
from reality on AI code quality using
realworld data to show when AI generated
code can be trusted in production.
Please welcome CEO of Kodto, Edidomar
Freiedman.
[music]
It will grow. It will grow one or two
more months. I'm really excited being
here. So many so much pragmatic and
insight and suggestions. I was sitting
there uh just just before. So I'm Edomar
Freiedman, the CEO and co-founder of
Kodto. Codto stands for quality of
development and I'm going to share uh
our reports and other companies reports
about state of AI code quality. uh you
know trying to uh talk about the hype
versus reality which was uh like one of
the uh points that were discussed here
quite a lot which is awesome. So in the
last three weeks, four weeks, we saw
like three outages in the clouds
unfortunately, right? And these are
coming from companies that really care
about moving fast, right? They're
they're they're saying themselves that
they're using AI to generate code 10%,
30%, 50%, at the same time, they care
about quality. So how did that happen?
And is it is it related? I don't know.
But let's have some I'm going to share
some guess. So by the way 60% of
developers say that the like quarter of
their code is either generated by AI or
in in like uh uh shaped by I and 15% say
that even more than 80 80% of their code
uh is basically generated or or shaped
by AI. Now people are using AI to do
vibe coding but actually they're even
doing it for vibe checking vibe
reviewing. This is the command of cloud.
This is the prompt for the command of
claude code for security review. It was
hyped like two months ago. Do you know
what I'm talking about now? It says
there, I don't know if you see it. Uh
you are a senior security engineer.
Good. And then like somewhere there uh
down the line it says please exclude
denial of service. Don't don't uh catch
denial of service issues. Maybe that's
part of the part of the reason like
we're we're having uh cloud outages.
probably not just that, but you get the
point. Like we need to be rigorous about
how we deal with quality. It's not just
like vibe quality or or so like we're
doing vibe coding sometimes. Uh let's go
to another example. Okay, cursor I guess
like or or pilot most of you use rules,
right? We're going to talk about it. You
invest in code generation. After a
[snorts] while, you understand if you
invest, you'll get more out of it. And
uh we we asked like a bunch of of
developers and I'm asking you as well
think think for a second for all the
developers there in the audience like
when you write cursor rules or copilot
rules etc. Do you feel they're
completely followed or it's like mostly
followed? Do you know how much they're
followed and what extent are they
followed? It's rigorously like how
technical deep they're they're being
followed. So the what we get back like
the answer from what you see here on the
screen is mostly like B, C, and D. They
are followed but they're not completely
followed. Okay. So that means like we
are generating code trying to push it to
the standards but it's not necessarily
still like getting to the quality we
wanted. I'm going to share a bit more
statistics and and information and some
insight from three reports. One done by
Codo, another by done by Sonar, another
by far and all of them are are focused
on code code quality review etc. The
sample size is thousands of developers
in some cases even more millions of pull
requests and and a billion of of lines
lines of code that were uh uh being
checked. Like for example, if you think
about uh Sonar, this is a company. Yeah.
A bit like coming from pre-AII, but they
see code at scale and you they're doing
like a lot of uh checks in code that are
not necessarily AI focused, but are
necessary in order to check uh your your
software from all possible direction.
And that's why their scaling and the
scale of the code that they're seeing is
is immense. Okay. So for example, we
took information from from their report
and eventually my purpose here is to
break down the different dimension of
what uh code quality means and give you
some share some stats and and insights.
I want to start with the end. Okay, this
is the takeaway I want you all all like
to take from from the next 13 minutes
that I have. We started with code
generation. We like out of the box use
it autocomplete etc. and you invest in
it and you can get more out of it. But
there's the glass ceiling for how much
productivity you can get from code
generation. And then we move to the
agent code generation, right? Let's call
it gen 2.0. And that's a higher glass
ceiling. It could do much more
productivity and especially if you
invest in it, for example, rules, etc.
Then with AI breaking outside of the
IDE, we can start using AI also for code
for agentic quality workflows. It could
be inside the ID, but the the truth is
that if you think about all the
workflows you have in your organization,
especially if you're more than 100
developers or so, you probably have a
lot of workflows that you are related to
quality that you need to auto automate.
And that's where you start like breaking
through the glass ceiling of
productivity. if you invest in it. And
finally, I I claim that you need those
agentic workflows. Keep learning. And we
might touch a little bit of that like
later later on. Okay? Like because
quality is something dynamic. So you'll
only finally break break the glass
ceiling if if you really have those
quality workflows and rules and standard
being dynamic. And then then you will
see the promised 2x let alone the 10x
that you were promised the hyped and and
you you heard from McKenzie and from
Stanford you're not getting that. I
don't need to tell you the 2x 10x for
the entire software development uh life
cycle. So a bit about more about the
market adoption. Uh one of the report
says that 82% of adoption already for AI
dev tools are being used daily or
weekly. uh some people at 60 60% 59
report that they're using more than
three and 20% saying that they're using
more than five code generation tools. If
you think about it for a second uh don't
only take like cursor compilot codeex
cloud code etc. Sorry if I'm insulting
anyone in the that I forgot their tool
but there's also the lovable etc. They
also generate code and by the way you're
going to get to 10 I'm count on me
you're going to get to 10 tools in two
three years that generate code for you
okay come to talk to me about later I'll
try to convince you and and the thing is
that it's coming from bottom up like 50%
of the usage is coming from less than 10
teams that are less than 10 developers
but it is propagating also to the
enterprise again I'm sure you know I
mean talk propagating to the enterprise
at scale like not just like five
developers in the last year we're seeing
like more and more enterprise using co
code generation. Uh so if like an
average with within reports we saw 82 to
92% using weekly to a monthly code
generation tools and in some cases maybe
extreme maybe not we're going to talk
about it. We saw 3x productivity boost
in writing code. Okay, but that doesn't
mean that if you have uh 3x productivity
in writing code that you actually
guarantee any quality like I presented
before. So actually 67% of the developer
that we as asked have serious quality
concerns about all the AI generated all
the generated code uh uh code generated
by AI or influenced by AI and they're
claiming that they're missing the
framework how to deal with quality how
to measure quality. It's a big question.
What is quality? I'm going to talk about
it in the next few slides. Okay, think
about it for a second before I break
break it down. What what is quality? Um
so what we're actually saying that the
crisis with V right coding uh viable
coding we're seeing it shifting and
evolving is that you're getting like
more task being done like 20 some report
20% more task you know velocity and like
97 more% or so of PR being opened and
eventually it takes more time to review
PR like 90% more time to review PR and
by the way like there's a lot of
statistics about AI generating wring
code at least there's not less amount of
bugs per line of code I'm not claiming
that there are more but even if there's
not less bugs per line of code you have
much more bugs because there are much
more PRs much more code being generated
etc right so that that's a problem for
the reviewer so it's somebody's surprise
it takes more time to review these
especially in the age of agents right
when five minutes calling to cloud code
I have 1,000 line of code after 5
minutes once upon a time it took me like
hours to write 10 proper per lines of
code. Right now, let's zoom out for a
second. Code generation is magnificent.
Okay? Like it it's a gamecher when
you're talking about green field. You
saw people talk about it a few slides a
few minutes before me. Uh it it
revolutionized how we do p proof of
concept uh project etc. But when you're
dealing with heavyduty software then you
you like it or not we are dealing with a
lot of things when uh when you serve
millions of clients you have financial
transactions when you're doing
transportation you're dealing with code
integrity if you like code governance uh
review standards testing relability etc.
That's what we need to uh uh to deal
with. Now let's break that under the
surface part of the glacier into two
dimensions. This is one dimension you
can look on the quality issues in
throughout the software development life
cycle like planning and then development
writing code review code review is a bit
of a process but like what you're like
checking quality that's part of the
process of code review testing which is
another part of of quality and and
deployment and I know I didn't cover the
entire like uh software development life
cycle but just to give you an example
and each one of them like possess like
introduce new problems that are coming
because you're using more and more AI
generated code. Um now another dimension
to look at it is actually code level
problems and process level problems.
Okay, I'm not I'm not opening the you
know list of functional just opening the
list of non-functional. You're talking
about security inefficiency that are not
necessarily uh functional. Use I will
show you some statistics about that. And
then process level is for example
learning. Hey if you will have a a a bad
outage because of AI generated code who
is responsible is it the AI or or the
team that own that okay like you need to
learn and own the code eventually that's
a process that needs to be done
verification porting guard rails
standards uh etc. So, so all of those
issues when they are introduced to
thousands of developer that we asked
them do you think like actually AI
helped to reduce with those problems or
or actually made more like more
challenging 42 people reported that they
spend 42 more of the development time on
solving issues on fixing bugs etc. and
and they saw 35 uh% project delays.
We're talking about we're talking about
maybe games they're talking about like
delays. Okay, there's some bias. We told
them we talked about problem with
quality and what's the impact etc. Um
but that's what they they they present
uh to when they they answer uh when when
they're talking about like when you're
mass using AI code AI generated code and
we see reports uh some of the reports
talking about 3x more security inc
incidents by the way it makes sense you
remember we had a slide saying 3x more
writing code so 3x more security
incidents like the same amount of line
of code the same amount of uh uh
problems the correlation so what to do
with that like I talked about problems
and problems and problems okay help help
me deal with it like let's let's spend a
few minutes on on that. So one one
suspect of course is testing and
actually really interesting we asked a
couple of question about testing and one
really relevant saying of that people
said that when they heavily use AI to on
testing use AI to do testing they
actually double their trust and the AI
generated code okay that's one thing the
ne next suspect to help us with the
quality is code review what really
interesting about code review that it's
a process that helps almost with all the
process level and the code level like
issues. For example, you can set your AI
code review tool to tell you block this
PR if it doesn't cover certain level of
test coverage. So through the PR, you
take care of the testing process
problem. Okay. So code like code review
with AI is actually one of one of the
major things you you you can do and
people that are developers that are
using AI code review tool they're saying
that they're saying they're seeing
double the quality gain and they're
saying that actually it's it helps them
to uh uh improve improve 47% in
productivity of writing code. Okay. Now
a bit statistics from our own uh AI code
review tool. We scan a million of PRs a
month and we took one mill million of
those PRs and we noticed that 17%
include like high severity issues. By
the way, we're now analyzing uh before
and after using AI. I don't have that
statistics yet, but we are noticing
since we're starting uh most of the
companies we serve, they use AI
generated code. So that's why uh I don't
have before. We need to go scan
backwards. Uh and that's like a really
big a big number. Another thing I want
to talk to you like about uh when you're
trying to improve on quality is is the
foundation of having the right context
that is brought to the uh code
generation tool that is brought to the
AI code review tool. Better context
better quality across the board wherever
you're using AI. Uh so when we asked
developers when when you h when you
don't trust AI generated code like you
remember like 67% sa like a really
worried about that they said 80 80% of
the time they don't trust the context
that the LLM have okay and and and uh
when we asked developers what would you
like to be improved in your AI generated
code in your AI code review tool they
said the number one was context it was
number one was 33% they can choose among
many things to to improve. So context is
extremely important. I can tell you that
as codto one of our technology moes uh
is is around context and when you
connect our context engine we're seeing
it as the number one tool that is being
used like 60% of code generator or code
review tools 60% of their calls to an
MCP would be to a context MCP. Okay. And
just to tell you the context doesn't
necessarily need to include only your
code. It could also include context to
your standards, your best practices.
We're seeing in our AI code review that
8% of the context usage is actually from
files that are related to standards and
and best practices etc. Okay, I have to
CEO of Kodo like marketing will be mad
on me if I don't brag a little bit.
Right? So this is uh kind of like our
market of our context engine being
presented by Jensen and GTC keynote and
he notice he didn't talk about our co
code review capabilities about our
testing capabilities he talked about our
context engine that Nvidia checked
because there's a realization that AI
quality AI generated whatever review
testing will come from bringing the
right context. So invest in that you
need to to build your context buy a
solution and invest in it. build your
solution uh etc. And the context needs
to include code uh uh versioning PR
history uh organization logs etc. That's
where all the context sits. It's not
just in the last branch of your
codebase. Okay. So I'm I'm zooming out
starting to talk about like
recommendations and uh and like uh
takeaways. So what what what's next? So
automated uh quality gateways invest in
that. People talked throughout the
morning about parallel agents. You know
what I'm talking about like background
agents. You can use a lot of those like
tools and capabilities to build build
your quality gates. Uh use intelligent
code review testing and you need a li
living and breathing like documentation
and and what documentation means is is a
story by itself. Uh I'm not going to
double click on it. And and this is how
I present for 3 years now and I think
I'm going to go all the way until age of
60 with this slide of how I think the
future of software development looks
like. Okay. So basically you have your
specification and you have your code
right and you have multiple agents
parallel agents that are helping you to
improve your spec write your spec
improve your code transfer transfer from
your spec to your to your code uh make
tests which are executable specs right
uh and and then you're going to have
your context engine the software
development database and you will build
your tools especially MCPs around
quality and verification and you'll Make
sure you have environments, stable,
secured sandboxes where those agents can
run and and run validation and quality
uh workflows. So don't don't forget like
the path forward is quality is your
competitive edge over your uh
competition. AI is a tool. It's not it's
not a solution. Okay? And don't like
only think about code generation as the
only thing. Look on the entire SDLC or
product development life cycle. I saw
one of the uh people talked um speakers
and it iterate with everything we talked
about today. I have uh I want to tell
you that you will gain value from it.
We're seeing in the reports people
seeing like security availability being
reduced faster code review you we just
got a hit on that because of AI
generated code and test coverage in a
month can can triple depends on on the
project etc. with with the last minute I
want to show you like a really small
piece of what you can do with codo. uh
you can go into codto and define your
own rule for example almost the same
rule you'll put on cursor of I don't
like nested ifs if this is a problem
that you have but then codto will look
on your context build the good example
the bad example and then start giving
like building a workflow that is
specifically to catch that issue and
give you statistics over time when it's
being accepted and when not so you can
adjust that rule and really know and
have visibility to to your standards.
Okay. So when a PR is written with a few
ifs and else although it was written
with cursor copilot that had a rule do
not do nested ifs etc. then eventually
when you open a PR you will get uh codo
uh uh catching that and giving a
suggestion according to the good and the
bad example. COD will also make a graph,
give you a CLI checks like check each
one of the rules and eventually tell you
the nested if and then will record and
learn what you did or did not do with
that suggestion in order to adapt the
standard and of the of the quality. Um
there will also automated like
suggestion. You don't need to write your
own. It learns your your your standards
and quality and offer that to you. And
that's it. I'm I'm really really excited
about like breaking the glass ceiling,
okay, with what we did with code
generation and then a jet to code
generation. Now we're turning into the
era of putting AI into work and through
the entire SDLC. The most important part
is related to quality. You would need to
invest in that. It's not out of the box.
Okay. And then you would see eventually
the promised tox that that that probably
promised to the CEO or something like
that once they give you the budget for
for the relevant tools. Thank you so
much. [applause]
[music]
Our next speaker is introducing
Miniaax's latest model and how it powers
nextg experiences for code generation.
Please welcome to the stage senior
researcher at Miniaax, Olive Song.
[music]
Hi. Hi everyone. Um I'm Olive. It's my
great honor here today to present on our
new model Mini Max M2. Um, I actually
lived in New York City for six years, so
it feels great to come back. Um, but
with a different role. Um, I currently
study reinforcement learning and model
evaluation at Miniax. Um, let me just
get a quick sense of the room. Who here
has heard or have tried of Miniax
before? Oh, a couple of there. Yeah, not
everybody, but I guess Yeah, but here's
the value, right, of me standing here
today. Um so we are a global company
that works on both foundation models and
applications. We develop multi modality
models including text um vision language
models our video generation model hyo
and speech generation music generation
stuff and we also have um many
applications including agents and stuff
um inhouse. So that that's the specific
thing that's different from the other
labs for other companies. So we both
develop foundation models um and
applications. So we have research and
developers sitting uh sitting side by
side working on things. Um so our
difference would be that we have
firsthand experience from our um
in-house developers into developing
models that developers would really need
in the community. And here I want to
introduce our Miniax M2 um which is an
openweight model very small with only 10
billion active parameters um that was
designed specifically for coding
workplace agentic tasks. It's very
costefficient.
Um let me just go over the benchmark
performance because people care about
it. So uh we rank very top in both um
intelligence benchmarks and also agent
benchmarks. Uh we I think we're on the
top of the open source models. But then
numbers don't tell everything because
sometimes you get those super high
number models you plug into them um into
your environment and they suck, right?
So we really care about the dynamics in
the community and in our first week we
had the most downloads
and also we climbed up to top three
token usage on open router. So we're
very glad that people in the community
are really loving our model um into
their development cycle.
So today what I want to share is how we
actually shape these men model
characteristics that made M2 so good in
your coding experience. And I'm gonna
present to you um the training be behind
it that supports each one of them from
coding experience to long horizon state
tracking tasks um to robust
generalization to different scaffolds to
multi- aent uh scalability.
So first let's talk about code
experience which we sc uh which we
supported with um scaled environments
and scaled experts.
So um developers need a model that can
actually work in the language they use
and across the workflow that they deal
with every day. So which means that we
need to utilize the real data from from
the internet and then um scale the
number of environments so that the model
when during training for example during
reinforcement learning it can actually
um react to the uh environment. it can
actually target verifiable coding goals
and to learn from it. So that's why we
scaled both the number uh of
environments and also our um
infrastructure so that we can perform
those training very efficiently.
So um with data construction and
reinforcement learning we were able to
train the model so that it's very strong
um it's full stack multilingual
and what I want to mention here is that
besides scaling environment that
everybody talks about we actually scale
something called expert developers um as
reward models. So, as I mentioned
before, uh we have a ton of um super
expert developers inhouse that could
give us feedback to our model's
performance. So, they participated
closely into the model development and
training cycle, including problem
definition, for example, um bugs, bug
fixing, for example, um repo refactoring
and stuff like that. And also they
identify the model behaviors that
developers enjoy and they identify
what's reliable and uh what developers
would trust and they give precise reward
and evaluation to the model's behaviors
to the final um deliverables so that um
it is a model that developers really
want to work with and that can adds
efficiency to the developers.
So with that we were able to lead in
many um languages in real use.
And the second characteristic that
Miniax M2 has is it it performs good in
those long horizon tasks. Uh those lawn
tasks that require interacting with
complex environments that requiring um
using multiple tools with reasoning.
And we supported that with the interled
thinking pattern um and reinforcement
learning.
So what is interled thinking? Um so with
a normal reasoning model that can use
tools, it it normally works like this.
You have the tools information given to
it. You have the system prompts. Um you
have user prompts and then the model
would think and then it calls tools. It
can be a couple of tools at the same
time. And then they get the tool
response from the environment and then
it performs a final thinking and deliver
a final content. But but here's the
truth, right? In real world, the
environments are often noisy and
dynamic. You can't really perform this
one test just by once. You can get um
tool errors for example. You can get um
unexpected results from the environment
and stuff like that. So um what we did
is that we imagine how humans interact
with the world. We we we look at
something we get feedbacks and then we
think about it. We think if the feedback
is good or not and then we make other
actions, make other decisions. And
that's why we did the same thing with
our M2 model. So if we look at this um
chart over a diagram on the right. So
instead of just stopping um after one
round of tool calling, it actually
thinks again and reacts to the uh reacts
to the environments to see if the
information is enough for it to uh get
what it wants. So basically we call the
interle thinking or people call it
interle thinking because it interle
thinking with tool calling. um a couple
of time it can be you know uh tens to a
hundred um turns of tool calling within
just one user interaction term
so it helps um adaptation to environment
noise for example uh just like what I
mentioned the environment is it's it's
not stable all the time and then
something is suboptimal and then it can
choose to use other tools or do other
decisions it can focus on long horizon
has um can automate your workflow um
using for example Gmails, notions, um
terminal all at the same time. You just
need to maybe make one model call
without minim with minimal um human
intervention. It can do it all by
itself. And and here's a cool
illustration on the right because it's
New York City. I feel the vibe of you
know trading and marketing. Um so you
can see that there was some um there was
some perturbations in the stock market
uh I think last week and then our model
was able to keep it stable. So just like
I said there's like environment noise
there's no new information there's like
yeah news it looks like there there's
like other trading policies and stuff
like that but our model was able to uh
to perform pretty stably in these kind
of environments.
And the third characteristic is our
robust um generalization to many agent
scaffolds which was supported by our
perturbations in the data pipeline.
So we want our agent to generalize but
what is agent generalization?
At first we thought it was just tool
scaling. We train the model with enough
tools various tools kind of new tools.
we invent tools um and then it will just
perform good on unseen tools. Well, that
was kind of the truth. It worked at
first. Uh but then we soon realized that
if we perturb the environment a little
bit, for example, we change another
agent scaffold, then it doesn't
generalize. So what is agent
generalization?
Well, we conclude that um it's
adaptation to perturbations across the
model's entire uh operational space.
If we uh think back what's the model's
um operational space that we talked
about it can be tool information it can
be system prompts it can be user prompts
they can all all be different they can
be the chat template they can be the
environment they can be the tool
response. So what we did is that we
designed and maintained perturbation
pipelines of our data so that um our
model can actually gen generalized to a
lot of agent scaffolds
and the fourth characteristic that I
want to mention is the multi- aent
scalability
um which is very possible with M2
because it's very small and cost
effective.
I have a couple of videos here. Um, this
is M2 powered by our own Miniax agent uh
app. We actually have the QR code
downside. So, if you want it, you can
just scan and try it. So, it's like an
agent app we we developed. And here we
can see different copies of M2, right?
It can do research. um it can write the
write the research results and analyze
it and put it in a re report. It can put
it in some kind of front-end
illustration and they can work in
parallel. So because it is so small um
and so cost effective, it can really um
support those long run agentic tasks and
tasks that maybe um require some kind of
parallelism.
So what's next right for Miniax M2 from
what I've introduced we gathered
environments um algorithms data expert
values model architecture inference
evaluation all these stuff to build a
model um that was you know fast that was
uh intelligent that could use tools that
generalizes
what's next
for um M2.1 1 and M3 were in the future.
We thinks of better coding, maybe memory
work, context management, proactive AI
for workplace, vertical experts, and
because we have those great audio
generation, video generation models,
maybe we can integrate them. But all our
mission is that we're committed to bring
all these resources, whatever is on the
screen, and maybe more. Yeah. and values
and put them all together to develop
models for uh the community to use. So
um we really need feedback from the
community if possible because we want to
build this together and you know this is
kind of a race that everyone needs to
participate and then um we com we are
committed to share it with the
community. Yeah.
And that's all the insight for today.
Um, we really hope again we really hope
you to try the model because it's pretty
good. And then we can contact contact us
up there. You can try the models by
scanning the QR code. Yeah, basically
that's it. Thank you all for listening.
[applause]
[music]
Ladies [music] and gentlemen, please
welcome back to the stage, Alex
Lieberman.
Let's give it up again for Olive [music]
and all the other speakers from the
morning. [applause]
It is time for lunch. Very exciting. Uh,
one thing I want to say before we head
out for lunch and we're it's going to be
downstairs in the expo. check out all
the boos, talk to people, have food is,
you know, my own experience with going
to conferences is even though I come up
talk on stage a lot, I find it very
difficult to engage in conversation with
people when there's like these little
small group settings. I don't know like
can I go and chat with people? Can I
not? This is a kind of awkward. I give
you all permission to butt into
conversations, introduce yourself. Ben
and Swix have done an incredible job of
cultivating such a high quality
community here. And the most value you
will get is not just from these
incredible presentations. It's from
meeting other folks in the crowd. So
please, you have my permission butt into
conversations. Introduce yourself. Share
what you've learned with folks. And if
you need any sort of uh ice breakers to
get the conversation going, I have two
for you. One is just go into a group and
share your hottest take on uh the state
of AI today. It's a great way to get off
to a good start with someone. The
second, a little less intense, is is a
hot dog a sandwich? Is cereal uh in milk
a soup? That is how you're going to
start the conversations with folks.
Everyone enjoy lunch. We'll see you back
in an hour and uh thanks so much for
your time.
Heat. Heat.
[music]
Heat.
[music]
[music] Heat.
Heat. Heat.
[music]
[music]
>> [music]
[music]
>> Heat. Heat.
[music]
>> [music]
>> Heat.
Heat.
>> [music]
[music]
>> Heat.
Heat.
[music]
>> [music]
>> Heat up
here.
[music]
Heat [music]
up here.
[music]
>> [music]
[music]
>> Heat up here.
>> [music]
[music]
>> Heat up here.
[music]
[music]
Heat up
here.
Heat [music]
up here.
Yeah.
Heat.
[music]
[music]
>> [music]
>> Heat. Heat.
Heat. Heat.
[music]
Heat.
[music]
[music] Heat.
[music]
Heat.
Heat. [music]
Heat. Heat. [music]
[music]
[music]
[music]
>> [music]
[music]
[music]
>> Heat up
[music]
>> [music]
>> Heat.
[music]
Heat.
Heat. [music]
[music]
Heat.
[music]
>> [music]
>> Heat
up
here. [music]
Heat.
[music]
Heat. [music]
[music]
>> [music]
[music]
>> Heat up here.
[music]
>> [music]
[music]
>> Heat. Heat.
Heat
[music]
[music]
up
[music]
[music]
here.
>> [music]
>> Heat. Heat. [music]
Heat. Heat. [music]
[music]
Heat. Heat.
[music]
>> [music]
>> Heat. [music]
Heat.
Heat up here.
>> [music]
>> Heat. Heat.
[music]
>> [music]
>> Heat up here.
>> [music]
>> Heat.
Heat.
[music]
>> [music]
>> Heat. Heat. [music]
>> [music]
[music]
>> Heat. Heat.
Heat. Heat.
[music]
Heat. Heat.
[music]
Heat up here. [music]
[music]
[music]
>> [music]
>> Heat. Heat.
Heat [music]
up
here.
[music]
Heat [music]
up
here.
[music]
>> [music]
[music]
>> Heat. Heat. Heat.
[music]
Heat
up
[music]
here. Heat. Heat.
[music]
Heat up
here.
[music]
Heat
up
here.
>> [music]
[music]
>> Heat up
here.
Heat. Heat.
[music]
>> [music]
>> Heat up here.
Heat.
[music]
[music]
Heat.
[music]
Heat.
[music]
Heat.
[music]
Heat.
[music]
Heat.
[music]
[music]
>> [music]
[music]
[music]
>> Heat. [music]
[music]
[music]
Heat.
>> [music]
>> Heat. Heat.
[music]
Heat.
[music]
>> [music]
>> Heat up [music]
here.
[music]
Heat.
Heat.
Heat
[music]
up here.
Heat
up [music] here.
Heat. Heat.
[music]
>> [music]
>> Heat up [music]
here. [music]
[music]
>> [music]
>> Heat. Heat. [music]
Heat up here.
[music]
[music]
Heat.
[music]
[music]
Heat.
Heat. Heat.
[music]
[music]
>> [music]
>> Heat. Heat.
[music]
Heat. [music]
[music]
[music]
Heat.
[music]
>> [music]
[music]
>> Heat.
>> [music]
[music]
>> Heat. Heat.
Heat
up here.
Heat
[music]
up [music]
here.
>> [music]
>> Heat. Heat.
>> [music]
>> Heat up
Heat [music] up
here.
Heat
[music]
up here.
[music]
[music]
Heat up
>> [music]
>> here.
Heat.
Heat.
Heat
up
here.
Heat.
Heat.
Heat. Heat. [music]
Heat
up [music] here.
[music]
Heat
[music]
[music] up
>> [music]
>> Heat. Heat.
Heat. Heat.
[music]
>> [music]
>> Heat. Heat.
[music]
Heat. Heat.
[music]
[music]
Heat
up
[music] Heat.
[music]
Heat.
Heat.
Heat. [music]
[music]
>> [music]
>> Heat. Heat.
>> [music]
>> Heat up
here. [music]
>> [music]
>> Heat up
here. [music]
>> [music]
>> Heat up [music]
here.
[music]
Heat. Heat.
[music]
Heat.
[music]
[music]
[music]
Heat.
>> [music]
>> Heat. Heat.
Heat. Heat.
[music]
[music]
>> [music]
>> Heat. Heat.
Heat. Heat.
>> [music]
>> Heat. Heat.
Heat
>> [music]
>> Heat. Heat.
>> [music]
>> Heat. Heat. [music]
Heat
[music]
up here.
Heat
up Heat.
Heat.
Heat.
[music]
[music]
Heat.
Heat up here.
Heat
Heat up here.
[music]
[music]
>> [music]
>> Heat [music] up Heat.
Heat. [music]
>> [music]
>> Heat. Heat.
[music]
>> [music]
>> Heat. Heat.
[music]
>> [music]
>> Heat up here.
[music]
Heat [music]
[music]
up here.
>> [music]
>> Heat up here.
[music]
Heat
up
>> [music]
>> Heat. Heat.
>> [music]
[music]
>> Heat
Heat. Heat.
[music]
Heat.
[music]
Heat.
Heat.
Heat. Heat.
[music]
Heat.
[music]
Heat.
Heat. Heat.
[music]
Heat. Heat.
[music]
[music]
>> [music]
[music]
>> Heat. Heat.
Heat up
here.
[music]
>> [music]
>> Heat. Heat.
[music]
Heat.
Heat.
[music]
>> [music]
>> Heat. Heat. [music]
Heat. Heat.
[music]
[music]
[music]
>> [music]
>> Heat. Heat.
>> [music]
[music]
>> Heat. Heat. Heat. Heat. N.
[music]
[music]
Heat. Heat.
[music]
[music]
Heat up here.
[music] Heat
>> [music]
>> up
here. [music]
>> [music]
>> Heat up here.
>> [music]
[music]
[music]
[music]
>> Heat. Heat
[music]
up
here.
[music]
>> [music]
[music]
>> Heat up here.
[music]
>> [music]
[music]
>> Heat. Heat.
[music]
>> [music]
>> Heat
up
[music]
here.
Heat.
[music]
Heat.
[music]
Heat. Heat.
[music]
[music]
Heat
[music]
up
here. Heat. Heat.
[music]
[music]
Heat
up
here.
>> [music]
[music]
[music]
>> Heat. Heat.
Heat.
[music]
Heat. [music]
[music]
>> [music]
[music]
>> Heat up here.
[music]
Heat. Heat. [music]
>> [music]
[music]
[music]
[music]
[music]
>> Heat. Heat.
[music]
>> [music]
>> Heat up here.
Heat up
here.
Heat
[music]
up
[music]
here.
[music]
>> [music]
>> Heat. Heat.
[music]
Heat.
[music]
Heat.
Heat.
[music]
[music]
Heat.
>> [music]
>> Heat. Heat. Heat. [music]
[music]
[music]
>> [music]
>> Heat
up here.
[music]
Heat.
[music]
Heat. [music]
Heat [music] up here.
[music]
Heat
up here. [music]
[music]
>> [music]
>> Heat up
here.
[music]
[music]
>> [music]
>> Heat up
>> [music]
[music]
>> Heat up here.
>> [music]
>> How's everyone doing? Good lunch.
[music]
excited for the afternoon sessions. Out
of curiosity, did anyone have the hot
dog conversation? Does anyone think Who
thinks that a hot dog's a sandwich?
We got one. We got two. Uh, anyone think
a hot dog isn't a sandwich? Most of the
crowd. That is that is usually the
consensus. Uh, one other question. Who
thinks that they have the hottest take
on the state of AI or AI engineering
right now in the room? Anyone think they
have the hottest take? Well, I I'll I'll
give you uh a tea up for later. My
co-founder Arman is speaking around
four, and I would say he has one of the
hotter takes I've seen, which is he
thinks all engineers should be paid like
salespeople based on output. That is
going to attract a lot of debate and I
give you full permission to debate him
after his talk. Well, are you guys ready
to jump into the next group of sessions?
>> Let's do it. We will be diving into
proactive agents from Google Labs,
building Gen Bi at a Fortune 100
business, deploying AI within
Bloomberg's engineering org, lessons
learned building an AI browser, and
developer experience in the age of AI
coding agents. With that, please join me
in welcoming our next speaker, Kath
Corvec, director of product at Google
Labs. Let's give it to her.
>> [music]
>> Hey everybody.
I'm so excited to be here. I love New
York and I love meeting everybody here.
And I am Kath Corbec. I'm from Google
Labs and I work on this little team
called ADA. And I'm going to be talking
about some of the stuff that we've been
doing on this project called Jewels. So,
a few months ago in my household, our
dishwasher broke. And while it was being
repaired, my husband decided that he was
going to do all the dishes. And so, he
told me he was going to do this. But
every single night, I found myself
reminding him to do the dishes. And you
can imagine that got old pretty fast.
And I realized that even though I wasn't
physically washing the dishes, I was
still carrying this mental load. And I
know a lot of you can probably relate to
this. I was keeping track of whether or
not that task was done. following up,
making sure that things kept moving. And
I realized in that moment that that's
exactly where we are with asynchronous
agents today. They can handle some of
the work, but we're still the ones as
developers carrying that mental load and
monitoring them. So, here's the truth.
Humans, we are serial processors, not
parallel ones. We can juggle multiple
goals, but we execute them in sequence,
not all at once. When you manually kick
off a task in jewels, you're usually
waiting to be able to move on. And it's
that pause, it's that gap in attention
where we really lose momentum. And this
is actually backed up by science where
uh humans actually think we think we're
multitaskers, but we're actually
executing many tasks very rapidly. But
switching between these tasks comes with
a huge cost. It can cost up to 40% of
your productive time. So that's like
half a day lost to switching contexts
and reloading. So if humans are uniters,
what's the solution here with agents? So
for async agents, in order in order for
them to succeed, developers can't be
expected to babysit them.
We've all seen that post on Twitter of
16 different cloud code tasks running in
parallel on 16 different terminals on
three different huge browsers or huge
monitors. And when I first saw this, I
thought, god forbid that is the DevX of
the future. I want to I don't want to
manage work. I don't want to manage my
agents. I want to be a coder. I want to
build. And so we need to think we need
uh uh collaborators in our system that
we can trust. agents that really
understand context, can anticipate our
needs, and they know really when to step
in. And then uh I think finally, we're
reaching that point with models where
they're getting better and better at
executing end to end as long as they
understand what our goals are clearly.
And that's where trust really becomes
this unlock where you can trust the
system to know what's missing, to fill
in the gaps, and to really keep progress
moving forward while you manage on
something else where where while you
focus on what matters most. And
essentially, we want jewels to do the
dishes without being asked.
So most AI developer tools today are
fundamentally reactive. you open up your
CLI or your ID and you ask the agent to
do something and it responds or it waits
for you to start typing and then it
autocompletes a suggestion. And there's
a benefit to this model. It's very
efficient. It only uses compute when you
explicitly ask for it. But the real
question I'm asking myself is, is this
how I want to manage AI? And if you
think about in the future, imagine a
world where compute is not a limiting
factor anymore. Instead of a single
reactive assistant for instructions, you
could have dozens of small proactive
agents working with you in parallel,
quietly looking for patterns, noticing
friction, and taking on the boring tasks
that you don't want to do before you
even ask.
It can do things like fixing
authentication bugs that you've been
avoiding. uh updating configs, flagging
potential order uh errors, preparing m
uh migrations and all of this can happen
in the background triggered off of
things in my natural workflow. So I
really think there are four essential
ingredients that make up proactive
systems today. There's observation. The
agent has to really continually
understand what is happening and of what
your code changes are, what your
patterns are, what your workflow is,
etc. to get context about your entire
project. And then there's
personalization. And this one's
difficult. It has to learn how you work,
what you care about, what you tend to
ignore, what your preferences are, the
code that you absolutely don't want to
ever touch. And then it has to be timely
as well. If it comes in too soon, it's
going to interrupt you. And if it's too
late, then the moment is lost. And it
also has to work seamlessly across your
workflow. It has to insert itself into
spaces where you naturally work already
in your terminal, in your repository, in
your IDE, not forcing you to go
somewhere else to some application
that's secret or that you forgot about.
So, bringing all these tools together,
you can imagine, is not trivial.
>> So, I was running this presentation. Um,
and uh, you you want to be able to ask
your agent to understand your workflow
and anticipate your needs and then
intervene at exactly the right moment
without breaking your workflow.
And that's when it really starts to feel
like magic. The interesting thing is pro
these proactive systems, they're all
around us today. One of my favorite
examples is Google Nest where you put it
in your house, you install it, and then
you configure it and then it starts to
learn your habits as you leave the
house, as you come back, uh, as you go
to sleep, as you wake up in the morning.
And then pretty soon, you don't have to
think about climate control in your
house anymore because it's learned what
your habits are. Another one is your own
body. your heart rate elevates as you go
for a run or start to work out or it
anticipates that you're about to fall
and so it reacts before you consciously
think I'm going to put my hand out. So
when you look at it like that
proactivity is actually not that
proactivity for AI is actually not that
futuristic. It's very familiar and it is
very human and that's exactly the point.
What we're building is tools that behave
more like a good collaborator and less
like command line utilities. So we're
already doing this in this tool called
jewels which is this uh proactive
asynchronous autonomous coding agent
from Google labs. And we're doing this
in kind of three levels of of uh
proactivity. Level one is where a
collaboration really starts to emerge.
And this is how Jules works today where
it can detect things like missing tests,
unused dependencies, unsafe patterns,
and then it starts to automatically fix
those things as it's doing other other
tasks that you've asked it to do. This
is sort of like this attentive sue chef
in your workflow where it's keeping the
kitchen clean, the knives sharp, the
kitchen uh stocked so that you can focus
on what comes next. And that's the
beginning of proactive software. At
level two, the agent becomes more
contextually aware of the entire
project. It observes how you work, the
code you write. If you're a back-end
engineer, maybe you need help with
React. If you're a designer, maybe it
wants you to may maybe it'll help uh uh
write the database schema. And then it
learns what your frameworks are and what
your deployment style is, etc. And this
is the kitchen manager. This is the
person in your workflow keeping the
rhythm and anticipating what you need
next. And then comes level three. And
this is what we're working on pretty
hard right now going into December. And
I'll show you a little bit of what we're
what we're going to be shipping in
December in a minute. But level three is
where things start to converge around
that context. It's where the agent
starts to understand not just context,
but also consequence. How these choices
are actually affecting the users of your
products, the performance, and the
outcomes. And at that level, we have
this thing jewels. We also have an agent
called Stitch, which is a design agent.
and another one we're building called
insights which is a data agent and
they're all coming together to build
this collective intelligence across your
application. Jules can see what's
breaking in the software. Stitch
understands how users are interacting
with it and insights connects behaviors
from real world signals like analytics,
telemetry and conversion rates. And then
together they can propose improvements
across boundaries of how the system all
works together. doing things like
performance fixes to improve UX and then
design changes to prevent regressions
and then all of that is organized based
on live data. So the trick here is that
the human stays firmly in the loop.
You're observing what the agents are
doing. You're refining when you when
they when you need to intervene and then
you're redirecting it when it has when
it has been misdirected. So level three
isn't really about autonomy anymore.
It's actually about alignment to your
project. A a agents and humans
collaborating together across the full
life cycle of your project.
So right now Jules is focused on this
code awareness piece. It understands the
environment, the frameworks and the
project structures and we're moving
towards more of that system awareness.
So things that we're introducing in
Jules now, we've added something called
memory which I'm sure a lot of you are
familiar with. It's the ability for
Jules to write its own memories and you
can edit them and interact with them. It
can edit them and it understands that
and builds this memory and context and
knowledge of of your project as you work
with it. We've added a critic agent
which works adversarially with Jules to
make sure that the code is is high
quality but then also does a full code
review. And then we've added
verification where Jules will write a
playwright script, take a screenshot and
then put that back into the trajectory
for you to validate. And then we're also
doing things like adding uh a to-do bot
that will look through your code and
look through your repository and pick up
on anything that where you've said this
is a to-do I want to get to in the
future and it will start to proactively
work on those things with that context.
We're also adding in things like best
practices where Jules will understand
best practices and start to suggest
those and also environment setup. We
have an environment agent that we use
internally for running EV valves and
we're extending that externally to
better understand how environment how
your environments work and and set those
up for you. And then we also are adding
something called a just in time context.
It's like a jewels cheat sheet where if
it's doing something very specific it
can and get stuck it can just
immediately look at that cheat sheet
instead of reaching out to you. So, this
is all moving Jules very close to being
that proactive teammate, not just this
reactive assistant. Okay, so this
morning I was talking to my team back in
San Francisco and I was thinking, okay,
I'm going to do a live demo, but the
live demo gods did not align with me
this morning. We still have CLS that are
being pushed to staging right now. So,
I'm going to walk you through a little
bit of this. And if you know Jed, he's
going to, I think, be talking tomorrow.
We're gonna um affectionately try to fix
Jed's code here. Um, so this is a view
of of proactivity and this is this is
Jules where you prompt it and the first
thing you that you do when you configure
and enable proactivity is Jules will
index your entire uh codebase. It'll
index your directory and start looking
for things that it can do and then it'll
that'll show up on the screen. So right
here we're looking at a little bit more
in this um in this repository ADK Python
and uh and it's indexed the repository
and it's found a bunch of to-dos. It's
found a bunch of best practices that it
can update and it's giving me some
signal about what it's finding. And so
you can see the signal is high
confidence, medium confidence, and low.
And so it's actually telling me what it
thinks it can achieve based on what's in
my code and what it wants to do. And
that's so it has high confidence in
green, medium and purple, low and yellow
way down at the bottom. Um, and so I can
go through this and I can manually click
these and say I want to start these. And
so I don't have to think about the
prompt. I don't have to look at the
code. I don't I I can do kind of less
cognitive load here. We're working on
something to just start these
automatically. And so that's coming in
the future. But I can also delete these.
I can say, "Hey, this one isn't isn't
for me. Isn't good." And so once it gets
started on a task, I can kind of drill
into it and see a little bit more. I can
peek into the code that it is suggesting
uh that uh it's suggesting it work on. I
can find the location of that code. And
it also gives me some rationale about
why it wants to work on that code, why
what it's doing, etc. And so it's giving
me a lot more context and helping me
trust that it knows what to do here.
Okay. So that's proactivity that's
coming in December and hopefully we'll
be able to give that to everybody here.
We're very excited about it. And I want
to tell you a little story about uh
something my husband and I were working
on just to kind of set set wrap things
up. We uh tinker a bunch with hardware
and we live on this slow street in the
middle of San Francisco and hate ashbury
district. So on Halloween, we get a lot
of people walking by our house. And so
we were trying to take advantage of that
with our Halloween decorations. And so
we built this six-foot animatronic head
that sits in the front of our house.
It's this old Victorian house. And he
sculpted it out of foam, epoxy, and
fiberglass. And then I our our kids also
called this lovingly the bald head. And
it's based off of, if you ever see saw
Peewee Herman from the 80s, it's based
off of the Peewee Herman Peewee's Big
Adventures head. Um, so while my husband
was doing this, I was spending my time
working with Jules on updating the
firmware, controlling the stepper
motors, working on the um on the LEDs
and the sensors. And for me, that's the
fun part for me is like really getting
creative with what the LEDs are doing.
So I wanted to focus on that, the LED
animations. But I ended up spending most
of my time actually fixing bugs and
swapping libraries and doing things like
that. So what I would do is I would
prompt Jules, I'd wait 10 minutes and
then I would repeat. And I found that
process very very tedious. And what I
wanted was actually Jules to do the
research. I wanted it to handle the the
ugly parts where it was researching how
to fix a bug, uh doing the debugging
itself. And I wanted it to do this so
that I could focus on the creative
parts. I wanted the eyes to move and
like follow people as they walk down the
street and like have lasers coming out
of its eyes and stuff like I mentioned
it was Halloween. It was very scary. Uh
and and this but but I couldn't really
do as much of that and I ended up
actually not shipping as much as I
wanted to with this animatronic bald
head. And so it's that gap that we
actually want to close. It's the space
between with jewels. the space between
that tool friction and creative freedom
that we're trying to unlock with these
kinds of proactive agents.
So, what I really want you guys to take
away from it, I give this advice to the
the folks on on the Duels team a lot, is
that the product we build today actually
won't be the project the products that
we have in the future. And I think a lot
of us know that. But in reality, I want
everybody in this room and everyone
building working with AI to be able to
take those big steps. I think the
patterns that we rely on today, Git, uh
your your idees, even the code, how we
think about the code itself might not
exist a year from now, might not exist
six months from now. And that's the
exciting part for me. It's sort of we
get to invent the future right now. we
get to describe and decide how software
is made and built. Uh kind of all the
people in this room. So my my challenge
to you is to not be afraid to question
the old ways of how you're building
software because really the future is
coming faster than any of us know. It's
probably already here and the cool thing
is we get to build it together. Thank
you. [applause]
[music] Our
next talk is a case study from the
enterprise on incremental rollout of AI.
Here to provide us with a blueprint for
making AI transformation fundable,
governable and real inside large risk
averse organizations is engineering
leader at Northwestern Mutual, ASAP
board.
>> [music]
[applause]
>> Doesn't this look like something's going
to drop from the ceiling? Like a ground
zero type thing? [snorts] Be honest.
Like, who has the buzzer that if I'm I
really suck, they press it and
everything falls down through the trap
door? No.
>> Be careful.
>> Yeah. Okay. Who was it? Okay. you tell
me if I'm doing okay or if I should take
a couple steps back. Right. So, hi
everyone. I'm Assaf. Um, and I'm here to
talk about Genbi. And kind of first
disclaimer, this presentation was not
created with Gen AI. Um, to be honest, I
actually started doing it uh with uh
GPT03 back in August. Uh, [snorts] and
then I did kind of a first draft and
then a couple of weeks back I wanted to
come in and refresh it before the
conference and then GPT5 took over,
completely messed up my slide, so I
ended up doing it manually kind of
oldfashioned. So if I'm missing like an
M dash somewhere in the middle, let me
know after. Okay. [snorts] Uh, so first
of all, a bit of housekeeping. What's
GenBI? So it's a fusion of Gen AI and
BI. It's basically an agent that helps
people answer business questions with
data like a a business intelligence
person would do in real life. Uh the
reason that we're pursuing GenBI is
really because of the data
democratization that you can bring,
right? So having access to data at your
fingertips without having to be reliant
on a BI team that helps you find a
report, figure out what it means, uh
understand your world before they can
even give you any kind of input. Uh so
that's GenBBI. Uh, a bit about
Northwestern Mutual. That's where I
work. So, we're a financial services,
life insurance, and wealth management.
Been around for 160 years. Uh, [snorts]
some very impressive numbers there. But
first of all, I want to say why is
Northwestern Mutual a great place to do
Gen AI. We got a lot of data, we got a
lot of money, we got a lot of use cases,
and we got access to some of the best
talent uh, anyone can dream of. Really
truly humbled by the people that I get
to work with.
Um, but on the flip side, why is it hard
to do Gen AI at Northwestern Mutual?
Because it is a very riskaverse company,
right? If you think about it, our main
motto is generational responsibility. I
call it don't f up. Uh, because
what we end up selling to people is a
decadesl long commitment, right? you buy
life insurance now,
uh, if you stay with us until it comes
to term, so to speak, that can be 20,
40, 80 years down the line, depending on
when you buy it and how long you get to
live. And so stability is something
that's very important for us because
it's important for our clients. So, how
do we balance stability with innovation?
That's what I want to talk about today.
Um, and really the four main challenges
that we had when we even came up with
the idea kind of a pie in the sky Genbi
concept. Uh, [snorts] first of all, no
one's done it before, right? Truly, no
one's done Genbi in this fashion in the
past. Uh, secondly, and this was really
a preference for us, we wanted to use
actual data that's messy because we knew
that those were that's where the real
challenges are going to be, right?
understanding actual messy data for
160-y old company and how can we perform
well within that ecosystem. Um the third
was kind of a blind trust bias. So um
the bias the trust that you had to build
was both with the users but also with
the leadership of the company, right?
How can we bring accurate information,
accurate answers to people when uh all
of these things that we know about and
everyone's talked about is is just out
there, right? No one's blind to the
trust barriers. No one's blind to the
accuracy barriers. So, how do we
convince that this is actually something
that we can trust in the company? And
lastly,
um but really firstly, when we go to
approach this from an enterprise
perspective, budget impact, right? How
do we convince someone in a leadership
uh organization where risk averse is
ingrained in the DNA to even invest in
something like this that no one's done
before? We don't really know how we
would do it. Uh we're not even sure how
it would look like when it comes to
turn.
Uh so I'll start kind of one by one. Uh
and first of all really talk about why
we chose to use actual data uh and not
synthesized data or cleanse data. Uh
[snorts] so really it's about making
sure that we understand the actual
complexities that we will have to face
when we eventually want to go to
production right we know that you know
building uh PC's and demos is so easy
but the gap from PC to production is so
broad uh especially in this gen AI space
especially because we don't know upfront
how to design the system what we would
expect it to behave like so making sure
that we operate with real data just gave
us that extra confidence that when
something works in the it's very likely
to also work in reality. Uh but also and
maybe not uh in the least less important
is that we got to work with actual
people who work with the data day in and
day out and that gave us two things.
Okay, first of all subject matter
expertise which are super critical for
us to be able to validate that the
system is actually working gave us a lot
of real life examples of what people are
actually asking in a corporate and what
people have answered to them. So
basically the eval right and all the
testing and stuff. Uh but at the end of
the day it also brought the business to
be a part of the research project itself
and they became kind of bought into the
idea as part of the process. So we
didn't just test something in the lab
and then had to convince someone to go
ahead and use it. The end users were
part of the research process itself. And
so when eventually it matured enough so
we can take some of that to production,
they were already there and they
actually were pulling that. They told us
we want to take this, how can we wrap
it? How can we package it uh quickly
enough so we can put it into practice.
Uh and the next part was really about
building trust. Uh so this is about
building trust first of all with our
management team. All right. Now, I don't
know about you, but last time that I got
a million dollar to do a research
project that I wanted in a pie sky idea.
I woke up from the dream and I realized
that this is not how things work in
reality. You don't just get a million
dollars and go ahead and try something
out. Uh you had to show that you know
what you're doing. And part of what we
did, it's kind of listed out here, but
obviously, you know, we did all the
regular stuff, right? We worked in a
sandbox environment. We made sure that
we're not using actual client data. We
made sure to put in all the security
risk aside.
But uh one of the first approaches that
we said we're going to take is we're not
just going to build a tool that's going
to be uh released to everyone, right? We
understood very quickly that um how
people interact with the tool, their
ability to verify that what they're
getting is right and also give us
feedback changes dramatically depending
on their expertise and understanding of
the data. So we took that crawl, walk,
run approach that basically said we're
first going to release it to actual BI
experts, right? people that would be
able to do it on their own and know what
good looks like when they get it. And
we're just going to expedite the process
for them kind of like a GitHub co-pilot.
The next phase would be to bring it to
business managers and again people who
are closer to the BI team, but when they
see a mistake, they can pretty much
figure out that what they're seeing is
wrong because they're used to seeing
that on day-to-day basis. um and they
will might be less sensitive to these
types of mistakes and be more inclined
to give us that feedback instead of
just, you know, dumping it aside and
never using it again. Giving this type
of tool to executives in the company, I
don't even know when we're going to get
there, right? Like an executive, they
want clear, concise answers that they
know they can trust. We're definitely
not there yet. I think that's the vision
uh at some point in time, but the system
is not accurate enough for us to get
there. Maybe it never will be.
Um, [snorts]
another way that we another liver that
we kind of used to build inherent trust
in the system is that we said, well, in
the get-go, we're not going to even try
to build SQLs, right? This is very
complex. This is very hard even for a
person. So, we said step number one,
let's just bring information that is
already in the ecosystem that's already
verified, right? We have a lot of uh
certified reports and dashboards. Um and
actually in the conversations we had
with some of the BI teams that we worked
with, they told us guys like 80% of the
work that we do is basically sending
people to the right report and helping
them figure out how to use it. So the
report is already there. Um and that
again built some inherent trust into how
we architected the system because we
said we're not going to make up
information. we're just going to deliver
you the same asset that you would have
gotten anyway just in a much faster much
more interactive way. Uh and that was
the alignment of expectations that we
did very upfront with the uh users and
also with the management team.
Now [clears throat]
the biggest um
process or kind of the most important
approach that we took when uh
approaching our leadership team and
convincing them that we want to do this
was to create a very gradual incremental
process that gave them a lot of
visibility and control. [snorts] Uh and
it was very important for us to build
incremental deliveries throughout that
process so that uh not only did they
have the the visibility into what are we
funding now, what do we get out of it,
they actually had business deliverables
they could realize potential from
throughout the process and at any point
in time they could pull the plug right
and say okay like it's not working well
or we got enough out of it or you know
the next phase is so you know unknown
and long that we don't want further
invest in it. And this is how we
basically broke it down. So phase one
was just pure research, right? We kind
of did the shift from natural language
to SQL. We figured out how to write
responses. We figure out how to
understand questions if coming in. Just
kind of setting the stage. Phase
[snorts] two was about really
understanding, okay, so what does good
metadata and good context look like in
the perspective of a BI agent, right? It
looks very different if you're just
chatting with something or if you're
trying to do a rag with you know
unstructured data like documents and uh
business knowledge and stuff like that.
And this phase on its own already had uh
impact on the business because when we
define what good metadata looks like for
an an LLM
uh we could immediately apply that also
to just the ecosystem of data users
across the enterprise. Um, and by
understanding how to extract LM from the
information, we could also how to
extract metadata. Sorry, here's where
the [snorts] trap door comes into play,
right? Um, we could also project that on
how or what good metadata looks like for
humans interacting with the data. We
have another initiative around semantic
layer going on which tries to model
exactly that and this provided a very
valuable input to that initiative as
well. But the immediate next step was
basically just doing this kind of uh
multicontext semantic search, right?
People coming in asking different
questions and having the system figure
out what's the right context, what's the
right information we need in uh bring
them. And this is something that could
already be packaged as its own product
and delivered uh and basically just do
kind of a data finder and data owner
finder which is something that could
take anywhere between two to maybe four
weeks in an enterprise like Northwestern
Mutual just finding what data exists and
who owns it so I can start talk uh the
conversation with them.
Um and the next layer was really about
pulling in information and trying to do
some light pivoting around the data. Um
each one of these steps as you can see
also created an input to the to the
following step so that the research
itself was kind of self u
self-propelling and there were
incremental outcomes coming out of each
one of these phases. Uh the next one is
more kind of setting it up for
enterprise level usage. So understanding
roles of in uh of different users coming
in what they may be asking about what
type of access we want to give them etc
and eventually and this is still some
ways to go ahead uh building kind of a
fullyfledged NBI agent which doesn't
only quote information from existing
reports but I can actually run SQL
queries on its own uh pull in more data
do more sophisticated joints between
different data so it can answer more
complex questions so that's the road map
right that's kind the high level plan.
Now, why did that work? Well, kind of
quickly summarize them. We talked about
uh so we get value uh early and we get
value often. Each one of this was a six
week sprint at the end of which we had
had a very tangible deliverable coming
back to the business that we could
decide to productize. Uh and at any
point in time, we could decide how we
want to move forward. There was
transparent progress. There was
incremental business value. Uh each one
of these steps allowed us to learn
something that helped feed the next
step.
And maybe the most important part and
that's the bottom line here and that's
the part that executives really look at.
How do we control the risk in continuing
to invest in this type of research
project and this is really about
eliminating things like sun cost bias,
right? We already paid you know you know
whatever a million dollar. Let's just
get through the project see what we get
at the end. This eliminates the uh uh
fear of of competitors coming in and
maybe we don't need to continue
investing in this right so everyone in
the industry is researching GenBI and
there are solutions like data bricks
genie that are coming up and they're
getting better and better maybe at some
point in time it's better for us as an
organization to actually adopt data
bricks genie but at that point again
first it's much easier for us to pull
the plug in the funding but we already
have a good understanding of what good
looks like we have benchmarks that we
used for ourselves when testing our own
system that we can test a third party
solution with. And we know what to
expect, right? We know what works, we
know what doesn't. We know what a kind
of fluffy demo from a vendor would look
like. And we know where to drill in to
ask the tough questions.
So let's see kind of what it looks like
under the hood and how we productize
different elements uh of this
architecture. Uh and maybe kind of very
quickly, why can't we just do it with uh
chat GPT? So you know [snorts] just
dumping a schema into chachpp doesn't
work. Usually schemas are very messy.
It's not uh easy to understand the
context and the meaning of things. Uh
[snorts] and eventually governance is
super important. So there was a lot of
governance built into the architecture
that was very hard to apply on chpd from
the outside but even solutions like you
know data bricks genius third party much
harder to govern from the outside than
from the inside. But still TBD.
Uh so the stack kind of looks like this.
Uh we have a data and metadata layer
that we produced. We have four different
agents that are running across the
pipeline. A metadata agent that
understands the context. A rag agent
that finds the different reports. An SQL
agent that can pull more data if we need
that. And then eventually what we call a
BI agent that takes all that information
and delivers an answer to the question
that was asked. On top of that, we slap
governance and trust and orchestration
and eventually some kind of a contextual
UI. Um and this is how the flow goes. So
when a business question comes in, we uh
push it into the orchestrator and
basically decides how to facilitate the
process. The first thing that we do is
understanding the context. So that's
where that metadata agent comes in.
Works with the catalog, works with all
the documentation that we have across
the system to understand what we're
being asked about and what's the
relevant information to share. Then we
go to the rag agent which tries to find
an existing report again out of a list
of certified reports that we know are
allowed for people to use and people
have spent a lot of time fine-tuning
them and making them as accurate as
possible.
If we can't find the report or if it's
not exactly what we need to um to use,
that's where we go to the SQL agent that
basically tries to create a more um
exact query or a more elaborate query.
And even if the report that we have is
not usable as is, it gives us that
initial seed of a query that we can then
expand on rather than having to build
one from scratch. So it's kind of like a
fewot uh example, but in this case the
example that we gave is very very close
to the actual result that we're
expecting to get.
We then execute it against the database
pull and push it into the BI agent which
gen with which gen uh translate that to
a business answer and not just dumping
data back on the user and this is what
goes into the final answer. Now there's
obviously some kind of a loop that says
if I'm in the same conversation I'm
probably talking about the same data so
we don't have to talk about this or do
this again and again. Now
each one of these three components, each
one of these three agents can be
packaged as its own product and
delivered to production with a very
tangible and actual impact on business
metrics. Okay. And that's the kind of
beauty of this uh approach that after we
productize each one of these, we could
have basically said stop or let's move
forward.
uh and just some giving bottom line
numbers around some of these. So just
the rag agent that pulls the right
report uh allowed us to take about 20%
of the overall capacity of the BI team
that basically said uh all we do is just
share the right report with the right
person. So we were able to automate
around 80% out of those uh 20% and we're
talking about a team of 10 people. So
roughly two people full-time job all
they do is find the right report and
send it to the right person.
uh the metadata understandings that we
got from learning how to interact with
the data through an LLM allowed us to
run AB test in a in the semantic layer
project that we did and that allowed us
to prove back again to the senior
leadership in the company that there is
value and tangible value measurable
value in enriching metadata. And we did
that basically by running uh a a battery
of questions um against a database that
had good metadata and one that didn't
have good metadata. And we show how much
better an LLM performs when having the
right metadata in place. So basically
proving the value of something that can
be very fluffy like hey let's bring in
more documentation into the code. Uh
right now we're experimenting with the
data pivoting bot. Uh so once you have a
dashboard or a report be able to change
the time horizon some of the views some
of the segmentations and the groupings
of the data again kind of real time
without having a person do that for uh a
business stakeholder and some of the
next steps is really evaluating the
tools that are out there for uh Genbi
like data bricks genie for example and
we're going to go into a much more
rigorous process of enriching our
catalog with metadata and documentation
and that's also going to come out of a
lot of the learnings that we got from uh
the research that we've done. So even if
we don't end up writing a GenBI agent
full-fledged end to end, we already got
a lot of value back from this and this
is really what allowed our senior
leadership team to continuously invest
in this project quarter over quarter.
One thing that I want to wrap up with is
just a couple of thoughts I had about
the future. So um I think we talk a lot
about how to prepare data. I think
that's going to be a huge area in the
market and they're going to be probably
a lot of companies and tools that are
going to help us with that. Uh building
very specific task specific models and
applications. I think a lot of startups
and companies are going to come up from
that area. Uh co-pilots is really making
sure that we meet the users where they
are. Uh and securing of models obviously
a very big thing. The last thing is the
one I the the one I want to focus on the
most because that's kind of a recent
thought that came to me a couple of
weeks ago. How we do pricing of SAS in
the Gen AI era. Uh this is really about
the fact that one individual person
today can be 10x more effective uh than
they used to be in the past. And then do
we price uh software based on seats or
do we price software based on how much
they used it or do we price software
based on the value that they got out of
it? Uh Salesforce is already
experimenting with that. So that the
data cloud product at Salesforce is
starting to be uh usage priced and not
seats priced. And I think this is going
to have a big impact on just the uh kind
of SAS economics worldwide.
uh and it it doesn't even matter if the
product itself is genai. It's really
about what does the person using the
product can do and what can they do in
their other time uh and whether it still
makes sense to price it by how many
employees you have or how much work you
get done with the employees that you
have.
That is me and thank you very much for
listening and thanks for not opening the
door on me.
[applause]
Our
next presenter [music] is the head of
technology infrastructure engineering at
Bloomberg. He's here to tell us what
they learned deploying AI within
Bloomberg's engineering organization.
Please join me in welcoming to the stage
Le Jang.
[music]
All right. I don't have a joke about the
dot. I don't have a joke about the uh
hot dog either. So I will just jump to
the topic right away. Um so my name is
Lei. Um I lead the uh department of
technology infrastructure in Bloomberg.
So we're basically a group of
technologists focus on global
infrastructure think data centers
connectivities
um developer productivities uh think SCS
tooling and also uh reliability
solutions think telemetry and instant
responses right so um depends on
audience sometimes uh you know you're
familiar with what Bloomer is sometimes
you don't so I thought it might be a
good idea to talk a little bit about our
Um so there's no better way to talk
about our company by sharing some
numbers. I want to highlight a few
numbers. We have more than 9,000
engineers and most of them are software
engineers. Uh we handle a lot of market
ticks uh which in the billions and 600
billions I believe. And um we also have
tons of folks uh focus on AI research
and engineering. So we have a more than
you know really today's 500 plus
employees focused on AI products uh for
um sort of our customers.
So takeaway here is we are I guess you
know building a lot of software and use
a lot of data to empower our flagship
product which is called the bloomer
terminal and to really support our users
to make the most important f decisions
for them to do their job. um the best
um in the technical lens um a lot of
time kind of to explain that we actually
have one of the largest private network
uh in the whole world. We also have one
of the largest JavaScript codebase um in
the world. Um we because the domain
we're in uh so from terminal is really
you can think of a um software that
supports thousands of different
applications. Uh we call them functions
right? Um email is a function. Uh news
is a group of functions. Um let's say
fixed income price to yield calculation
to spread calculation is another
function.
um trading workflows is another group of
functions. So there's many many many
different type of functions as you can
imagine we kind of have to utilize
different technologies to really support
those uh functionalities.
Uh we also been
increasingly more than used but also
contribute to open source communities.
um for this audience I guess I want to
call out you know we kind of helped
creation of the case serve envoy AI
gateways and among many many other
things that that we deploy in-house and
support the communities again in summary
there's a lot of software there's a lot
of data uh we kind of have to um figure
out how to make the best of AI tooling
to support us to do our engineering work
all right so get to what is AI for
coding Um we started about two years ago
maybe a little bit more than that. Um
and as I guess the rest of the world we
look at the toolings provided and you
know I apologize if if your logos are
not here. Um but as you can imagine it's
kind of like overwhelming right there's
so many things and every day there's
news about this is great this is great.
Um so at the time we actually didn't
know what all the AI solutions can help
us to uh boost our productivities as
well as stability. But one thing we knew
at the time is um unless we deploy and
try we wouldn't know what's the best way
to benefit from all the awesome work and
and you know a lot of folks are
contributing to. So at the time uh we
quickly form a team people start
kind of like release um kind a set of
capabilities so that people start
iterating on um utilizing the toolings
and then of course you know we are data
company so kind of want to get a sense
of how we measure the impact and um what
we can do from the capability we provide
right so we look at the typical
developer productivity measurements
We ran a few survey. Uh it was very
obvious that people felt like there's
much quicker uh proof of concept, people
rolled out tests. U there's a lot of one
time use scripts being generated and
then the measurements dropped actually
pretty quickly when you
go beyond all the green field type of
thing, right? And then then we start
thinking like okay so what are the
things that we should really be doing
using all those wonderful things so that
we can really make a dent um in the in
the space and then at this time we also
kind of like also be thoughtful of um
unleash a very powerful tooling right uh
the the benefits is it's very fast the
challenge is also it's very fast, right?
Um, for any of you who actually dealt
with hundreds of millions of lines code,
you probably understand the system
complexity is a at least
um exponential or at least polomial as
function of your line of code or
software assets, right? So, at some
point you kind of want to be very
careful uh what you do with your
software assets. And what we thought so
maybe we should look at some of the
basics. One idea we had is
um all right so AI for coding there's
narrow definition of what coding is but
there's also a broader definition of
what software engineering right and then
maybe we can also look into some of the
work our developers don't really prefer
to do for instance
um some maintenance work some of the
migration work some of the I don't know
maintenance work and stuff like that so
I want to give some examples of the
things that we been trying and we think
there's pretty good return on
investment.
So the question we ask ourselves is how
do we evolve our codebase right the
first one is all right wouldn't it be
cool uh the day you get a ticket say hey
you know this piece of software needs
patched and at the same time you have a
pull request with the fix with a patch
and also with thinking why the patch
happened that way right so it's kind of
like we're trying to uh broadly deploy
something called uplift agents
um broadly scan through our codebase and
figure out what the patch would be
applicable and be able to apply those
patch step back a little bit. We did
have a reg based refraction tool. Um it
works to some extent but it's limited
right now with um LMS and other tooling.
So we are able to uh see very much
better results from the um uplift
agents. So there are a few challenges in
case you also plan to deploy such
capabilities. The first one is
I guess any AI or ML it would be really
nice if there's some deterministic
verification capability. uh oftentimes
it's not so easy especially if you have
test cases you don't have good llinter
if you don't have good verification the
the patch can sometimes be uh uh
difficult to to to be applied
and uh one thing we also realized when
we deploy AI tooling is the average open
pull requests increased and time to
merge also increased uh because you're
spinning a lot of new code and then
still we have to review the code and
merge the code right so time to merge
merge become a challenge sometimes and
the last one is um I think it applies to
any gen is the shift becomes what do we
want to achieve rather than how we want
to achieve right so
the second example that I I want to
share is uh the other area that people
kind of like sometimes
really impact our productivity in a
negative way or impact our stability in
negative way is how we handle instance
so we're trying to develop and then
deploy deploy um in response agents. Um
now
the importance of this is if you really
think about DNI tools, it's really
really fast and it's also unbiased,
right? Instance, it can go through your
codebase really quickly. It can go
through your telemetry system very
quickly. It can go through your feature
flags very quickly. It can go through
your um I don't know core trays very
quickly. and in I unbiased lens when we
do troubleshooting sometimes we have
this biased views okay it must be this
it turns out to be not the case so
there's many many interesting benefits
um by uh deploying agents from this
perspective
and then the second question is become
interesting is imagine you have
organization of 10,000 pe um let's say
9,000 people as I described a lot of
people trying to fix those problems
And you can have 10 teams who wants to
build a pull request review bots. You
have 20 teams who wants to build a
instant response agents. Right? They
become very quickly chaotic and
sometimes can have duplications.
So before I talk about the p pass, I'm
going to give example of the uh instance
response agent. So basically this is
what you know a instance response agent
will look like. Um the key part is we're
going to need to build a lot of MCP
servers to connect to the uh the metrics
and logs dashboards you have connect to
the topology you have whether it's
network topology or it's the um your
service dependency topology uh your
alarms your triggers right your SLOs's
and then we kind of don't want people
just start building MCP servers uh
without a pay pass so we created a pay
pass in partnership with our AI my
organization and I will talk a little
bit what that means.
Before that
um I do want to explain a little bit
some of the platform principles.
Some company allow teams to be have a
lot of freedom as at the same time
responsibility in the sense a business
unit can build whatever infrastructure
whatever platform. um some organization
have a very very strong tight
abstraction of the service
infrastructure and typically kind of
have to use their platforms right so
Bloomberg is kind of in the middle if
you look at the golden ones we kind of
believe in provide a golden path
um with enablement teams so my team is
really a en enabling team and one of the
guiding principle for us is we want to
make easy sense is extremely easy to do.
Uh sorry, the right thing is extremely
easy to do and we want to make sure the
wrong thing is ridiculously hard to do.
So that's the guiding principle here.
Now move on. So what is the pay path
here? So the pay path is
uh we have a gateway so that teams can
easily figure out which model works the
best. They can do quick experiments.
they can um we can have visibility of
what kind of models being used and we
can also guide through the teams which
model should is a better fit for the for
the problem they want to solve. uh we
have a two discovery uh basically MCP
directory via hub so that let's say team
A wants to do something they will go to
the hub okay someone is building MCP
server already maybe I should partner
with them to build it together right
uh tool creation and deployment is via a
pass it's basically a um you know a
standard platform service where you can
do your SDLC and and we provide runtime
environment for you as well taking care
of all and side of things as well so it
really reduce friction of for for teams
to to deploy um their MT MCP servers.
And then this is kind of interesting is
we want to make demo very easy so that
or I really say proof concept very easy
so that people can try have ideal
generation uh because we believe in
creativity come from some freedom of try
different new things but we also want to
make sure the production requires some
quality control. um
because at the end of the day stability
and system reliability is at the core of
our business. So this is sort of the
path that we deployed um and enabled the
rest of engineering really the 9,000
software engineers to do their job.
Okay.
And um with all this and then we start
maybe okay yes we got p uh path we have
some good ideas of how to evolve our
codebase.
help out our people right um now this is
where I find that
any new things any adoption of new
things provide opportunity to leverage
the strengths you have and also identify
the some of the weakness that you may
have so um in Bloomberg we have a
wellestablished training program uh it's
more than 20 years so there's on
boarding training depends on entry level
it depends on senior level um so we have
this whole training program to prepare
folks to before they join a team and
what we did is we just incorporate AI
coding in on boarding training program
and also show them how to best utilize
them with our principles and our
technologies right there's a huge
benefits here because um if any of you
run into the challenge of adoption
somehow run into a chasm right the rest
of is not uh adopt as quick as possible
whenever we have folks join a company
they learn how to do things in new way
when they go back to their team, they
were like, "Hey, why don't we do that?"
Right? They're going to challenge the
some of the senior folks as well to say,
"Hey, there's a new way to do this type
of things. Why don't we do that?" So, we
actually find this program extremely
effective uh to be a change agent for
anything we want to push out.
And then bunch of results, there's a lot
more familiarity and comfort with the
tooling. Um and also the important part
is there's lot more nuance insights of
where it's at value, right?
The second one is um often times we run
organization to push uh new initiatives.
So within Bloomer we have something
called um a champ program and a guild
program. That's basically a cross
organization or tech communities where
people have similar interest and similar
passion. They get together and get stuff
done. So um we had this for more than 10
years now. uh we sort of bootstrapped
engineer AI productivity community two
years back leveraged the community we
have already and then have some few
results uh because we have this pretty
much everyone passionate about this and
will be in that community so
organically it dduplicate efforts and
there's shared learning uh shared
learning happening
and it also helps to boost inner source
contributions and the vis engineer idea
right often times Team A wants to do
something, team B, let's say a platform
team have different prioritization and
the way we solve this is via inner
source or via visit engineer. We just
move someone over the team work for six
months a year get it done and then we
can move on. Um the last one is
interesting. So our data shows
individual contributors have a much
better stronger adoption than our
leadership team. Now if you think about
this a lot of software TLS and managers
in the age of AI they kind of don't
really have
um enough experience to truly guide
their teams to build software right so
often times the stuff that they learned
before might not be exactly applicable
it's still very valuable but there's
some missing piece there to make sure
they can continue to guide the team to
do the right thing. So, we're rolling
out the leadership workshops to make
sure our leaders are equipped with
whatever knowledge they need to have to
drive the techn um innovation.
So, um I'm going to close my part and to
share with you what uh the part I'm I
feel most excited about. The part I feel
most excite most excited about is that
with a lot of um creativity and
innovation in the geni space, it
actually changes the cost function of
software engineering.
Meaning
the trade-off decision of whether we do
something versus we don't do something
actually changed because some of the
work become a lot cheaper to do and some
work become a lot more expensive to do.
I tend to think it is a great
opportunity for engineers and
engineering leaders to get back to some
of the uh basic principles and sort of
ask a soul searching question. What is a
high quality soft engineering and how
can we use a tool for that purpose? So
that's it. Thank you very much.
[applause]
[music]
Our
next speaker helped to reimagine a
beloved browser from ARC toad by
rebuilding it around AI native
experiences.
Please welcome to the stage head of AI
engineering at the browser company Samir
Motti. [music]
[applause]
Hey everyone. Oh wow. How's it going?
Um, my name is Samir and I'm the head of
AI engineering at the browser company of
New York. And today I'm going to talk a
little bit about how we transitioned
from building ARC to DIA and the lessons
we learned in building an AI browser.
But first, a little about the browser
company.
So we started with a mission to rethink
how people use the internet. At its
core, we believe that the browser is one
of the most important pieces of software
in your life and it wasn't getting the
attention it deserved.
Simply put, the way we've used a browser
has changed over the last couple
decades, but the browser itself hadn't.
And think about this, we we started this
company in 2019. Um, and so this is a
screen cap of Josh, our CEO, sharing a
little bit about our idea on the
internet a few years ago, which we
endearingly called the internet
computer. So, our mission has been to
build a browser that reflects how people
use the internet today and how we think
the browser should be used tomorrow.
So through years of discovery, trial and
error, and some ups and some downs, we
shipped our first browser, Arc, in 2022.
It was a browser we felt was an
improvement over the browsers of that
time. It made the internet more
personal, more organized, and to us, a
little more delightful with a little
more craft.
And it was a browser that was loved by
many. It still is by millions. many of
whom are probably in this audience
today. I've gotten a lot of questions
about ARC today. Um, and it's great, but
um, if we took a step back, we felt that
ARC was still just an incremental
improvement over the browsers of that
time. And it didn't really hit the
vision that we set out to create. And
so, uh, we kept building. And then in
2022, we got access to LLMs like the GPT
models. And so we started like we always
do with prototyping. We started trying
new ideas um and eventually shipped a
few of them in arc. But what started as
a you know a basic exploration turned
into a fully formed thesis. In the
beginning of 2024, uh, our company put
out what we called act two, a video on
YouTube, where we shared that thesis
that we believe that AI is going to
transform how people use the internet
and in turn fundamentally change the
browser itself. And so with that, we
started building again, but this time we
built a new browser with AI speed and
security in mind and from the ground up.
And later or sorry, earlier this year,
we shipped DIA, our AI native browser.
It allows you to have an assistant
alongside you in all the work you do in
the browser. It gets to know you,
personalizes, helps you get work done
with your tabs, and effectively get more
work done through the apps you use. And
while it hasn't achieved our vision yet,
we fully believe it's well on the way,
too.
So it is not easy to build a product.
You all know that. Let alone two. The
latter of which an AI native one. We've
had a lot of years of iteration, trial
and error. And through that we've
learned a lot. And I'm going to just
talk about a few of those things uh here
today.
The first I want to talk about is
optimizing your tools and process for
faster iteration. From the beginning,
browser company has believed that we're
not going to win unless we build the
tools, the process, the platform, and
the mindset to iterate, build, ship, and
learn faster than everyone else. And
that of course holds true today. But the
form it takes with AI and an AI native
product has changed.
So even as a small company, where are we
investing in tooling these days? First
is prototyping for AI product features.
Second is building and running evals.
Third is collecting data for training
and for evals. And uh last but
definitely not least automation for hill
climbing.
So let's start with tools. Initially uh
as we always do we built some tools. The
first was a very rudimentary uh prompt
editor and it was only in dev builds.
What did what did this mean for us? Well
it meant a few things. One limited
access as only engineers were able to
access this. two slow iteration speeds
and three none of your personal context
and as you all know with an AI product
the context is what matters and what's
gives you the feel for whether a product
is good or not. So we evolved and since
then we built all of our tools into our
product. the product that we as a
company internally use every day and
that includes the prompts, the tools,
the context, the models, every
parameter. Um, which has not only
allowed us to 10x our speed of ideating,
iterating and refining our products, but
it has also widened the number of people
who can access and iterate on our
products themselves from our CEO to our
newest hire can ideate and create a new
product in DIA and also refine an
existing one all with their full
context.
And this holds true with all of our
major product protocols. We have tools
for optimizing our memory knowledge
graph which all of us use and we have
tools for creating iterating on our
computer use mechanism. We actually
tried tens of different types of
computer use strategies before landing
on one before even building it into the
product itself.
And I'll say and I'll end this part with
uh it actually is a lot of fun. People
don't talk about that a lot but uh
actually building these tools into our
product has enabled so much creativity.
It has enabled our PMs, our designers,
uh customer service and strategy and ops
to try out new ideas that are tailored
to their use cases. And that ultimately
is what we're trying to do.
The next thing I want to talk about is
how we evolve and optimize our prompts
through a mechanism called Jeepa. This
for us is very nent but an important
learning nevertheless.
How we hill climb and refine our AI
products is just as important as
ideating them in the first place. So
we're investing in mechanisms to help
with this to enable faster hill climbing
and one of those being Jeepa and this is
based on a paper from earlier this year
from a few smart folks.
So the key motivation here is simple.
It's a sample efficient way to improve a
complex LLM system without having to
leverage RL or other fine-tuning
techniques. And for us as a small
company that's hugely critical.
And how it works is you're able to seed
the system with a set of prompts, then
execute it across a set of tasks and
score them. Then leverage a mechanism
called PA selection to select the best
ones. and then leverage an LLM on top of
that to reflect on what went well and
what didn't and then generate new
prompts and then repeat with the key
innovations here being on around that
reflective prompt mutation technique,
the selection process which allows you
to explore more of the space of
prompting rather than one avenue and the
ability to tune text and not weights.
And here's a modest uh example of this
at work for us. you know, you can
provide it a very simple uh a simple
simple prompt and run it through Jeppa
and it's able to optimize it uh along
the metrics and scoring mechanisms that
we uh created to refine that prompt.
And so if I take a step back and talk
about kind of how we build uh for
certain types of features, I would buck
it into a couple different phases. The
first is that prototyping and ideiation
phase where we have widened the breadth
of number of ideas at the top of the
funnel um and lowered the threshold on
who can build them and how. And so we
try out a bunch of ideas every week
every day from all types of people and
we dog food those and if we feel like
there's actually real utility there.
It's solving a real problem for us and
there is a path towards actually hitting
the quality threshold that we believe we
need to hit. Then we'll move on to this
next phase where we collect and refine
eval to clarify product requirements and
then hill climb through code through
prompting and automated techniques like
Jeepa and then dog food as we always do
internally and then chip.
And I do want to kind of double down on
these phases. The ideation phase is
extremely important just as much as that
refinement phase.
And our goal is to enable faster
ideation and a more efficient path to
shipping because with all these AI
advancements every week, new
possibilities are unlocked in DIA. And
it's up to us as a browser, as a product
to get as many at bats with these new
ideas and try out as many of them and
explore as many of them as possible. At
the same time though not underestimating
the path it takes to ship some of these
ideas to productions as a high quality
experience.
Next uh I want to talk about treating
model behavior as a craft and
discipline.
So what is model behavior to us? It's
the function that defines evaluates and
ships the desired behavior models. It's
turning principles into product
requirements, prompts, and evals, and
ultimately shaping the behavior and the
personality of our LLM products, and
ultimately for us, our DIA assistant.
So, I'd buck it into a few different
areas. First, it's that behavior design,
defining the product experience we
actually want, the style, the tone, the
shape of responses in some cases. Then,
it's collecting that data for
measurement and training, clarifying
those product requirements through eval.
And last but not least, it's the model
steering. It's the building of the
product itself. It's the prompting, it's
the model selection, it's defining the
what's in the context window, the
parameters, etc. Um, and so much more.
And to us, that that process is
iterative, very iterative. We build,
refine, we create evals, and then we
ship, and then we collect more feedback
and feed that into our iterative
building process. That could be internal
feedback, and that could be also uh
external feedback.
And so I move on for a second. One
analogy we've thought about uh is for
model behavior is that to product design
through the evolution of the internet.
At first websites were functional. They
got the job done. But over time that
evolved as we tried to achieve more on
the internet and technology advanced. Uh
product design and the craft of the
internet itself grew as well as well as
the complexity.
And so what might that be for model
behavior? Well, at first it was
functional. We had prompts. We had
evals. We had instructions in and output
out. Now we frame it through agent
behaviors. It's goal- directed
reasoning, the shaping of autonomous
tasks, selfcorrection in learning, and
even shaping the personality of the LM
models themselves.
And so, what might the future hold? I'm
excited to see. But what we believe is
that we are in the early days of
building AI products and model behavior
will continue to evolve and into a
specialized and prevalent function of
its own even at product companies.
And the last thing I'll leave you with
here is that the best people for it
might just surprise you. One of my
favorite stories about building DIA
these last couple years has been uh the
formation of actually this model
behavior team. As I mentioned earlier,
uh engineers were writing the prompts at
first and then we built these prompt
tools to enable more people at the
company to actually prompt and iterate.
And there was a person on our team on
the strategy and ops team and he
actually leveraged these prompt tools
one weekend to rewrite all our prompts.
And he came in on a Monday morning and
dropped a Loom video sharing what he
did, how he did it, and why and a set of
prompts. And those prompts alone
unlocked a new level of capability and
quality and experience in our product.
And consequentially uh it was the
formation of our model behavior team.
And so one thing I'd emphasize to you
all is to think about who are those
people at the company agnostic of their
role who can help shape your product and
help shape and steer the model itself.
It might not be an engineer or it might
be it could also be someone on the
strategy and ops team.
Next, I want to talk about AI security
as an emergent property of product
building. And today, I'm going to focus
specifically on prompt injections.
So, what is a prompt injection? Well,
it's a prompt attack in which a third
party can override the instructions of
an LLM to cause harm. That might be data
exfiltration, the execution of malicious
commands, or ignoring safety rules.
And so here's an example in which you
give uh the context of a website to an
LLM and instruct it to summarize it.
Little did you know that there was a
prompt injection hidden in that
website's uh HTML.
So instead of actually summarizing the
web page, the LM actually gets directed
to open a new website, extracting your
personal information and embedding it as
get parameters in the website's URL,
effectively excfiltrating that data.
So, as a browser, prompt injections are
extremely crucial for us to prevent.
They're critical to prevent
because browsers sit at the middle of
what we can call a lethal trifecta.
It has access to your private data. It
has exposure to untrusted content and it
has the ability to externally
communicate and for us that means
opening websites, sending emails,
scheduling events, etc.
So, how to prevent this? Well, there's
some technical strategies we can try.
First is wrapping that untrusted context
in tax. You can tell the LM, listen to
these instructions around these tags and
don't listen to the content around these
tags. But this is easily escapable and
quite trivy, an attacker could still uh
leverage a prompt injection on your
browser.
Well, another solution we could try is
separating that data and that
instructions. We can assign uh the
operating instructions to a system role
and we can assign a user role for the
content of the third party and even
layer on randomly generated tags to wrap
that user content to be extra sure that
the LM listens to the instructions and
not the content. And while this can
help, there are no guarantees and prompt
injections will still happen.
So what do we do? Well, it's on us to
design a product with that in mind. We
have to blend technology approaches and
user experience and design into a
cohesive story that actually builds them
from the ground up and solves it
together.
So, what that might what that excuse me
what might that be for a feature in DIA?
Well, let's take the autofill tool in
DIA. The autofill tool allows you to
leverage an LLM with context, memory,
and your details to fill forms on the
internet. It's extremely powerful, but
as you can imagine, it has some
vulnerabilities. A prompt injection here
could extract your data and put it on a
form, and once it's on that form, it's
out of your hands.
So, we try to build with that in mind.
In this case, before the form is written
to, we actually let the user read and
confirm that data in plain text. This
doesn't prevent a prompt injection, but
it gives the user control, awareness,
and trust in what is happening. And this
is a framing we carry throughout our
product and how we build every single
feature. So here are some examples.
Scheduling events in DIA, we have a
similar confirmation step. Writing
emails India, we also have a similar
confirmation step.
So I've talked about three different
things here today. First is optimizing
your tools and process for fast
iteration. Second, treating model
behavior as a craft and discipline. And
third, AI security as an emergent
property of building products.
But uh the last thing I want to leave
you with, when we started on this
journey to building DIA, we recognized a
technology shift and we sought to evolve
our product of Arc. We initially came at
it from a hey, how can we leverage AI to
make ARC better, make the browser
better. But what we quickly learned and
adapted to was that it wasn't just a
product evolution. It was a company one
and today I shared a glimpse of that.
How we build and how it's changed a team
we've literally created around this and
how we think about security for AI
products. But really it's so much more.
It goes beyond that. It's how we train
everyone here. It's how we hire. It's
how we communicate. It's how we
collaborate and so much more. And if
there's one thing I'll leave you all
with, if there's one thing we've learned
over the last couple years, it's that
when when you recognize that technology
shift, you have to embrace it. And you
have to embrace it with conviction.
Thank you.
[applause]
[music]
Our next speaker [music]
draws on over 20 years in enterprise
developer experience to ask what will
still matter when AI coding agents are
everywhere. Please welcome to the stage
executive distinguished engineer at
Capital 1, Max Canet Alexander.
[music]
[applause]
Hey, how's everybody doing? Still awake?
Okay, great. So
like the robot voice said, I have been
doing developer experience for a very
long time and I have never in my life
seen anything like the last 12 months.
The you know about every two to three
weeks software engineers been making
this face on the screen.
Okay. And if you work in developer
experience the problem is even worse.
You're like this guy on the screen every
few weeks. You're like, "Oh yeah, yeah,
yeah, yeah, yeah. Here's the new
hotness." And then somebody else comes
up and they're like, "Well, can I use
the the new new hotness?" And you know,
people have been doing that for years.
I've been working in developer
experience for a long time. Everybody
always shows up and they're like, "Oh,
can I use this tool that came out
yesterday?" And you're like, "No, of
course not." And now we're like, "Uh,
maybe yes." Right? And what this leads
to overall is the future is super hard
to predict right now. So
a I think a lot of people a lot of CTO's
a lot of people who work in developer
experience to people who care about
helping developers are asking themselves
this question
are all of my investments going to go to
waste? Like what could I invest in now
that if I look back at the end of 2026
I'll be like I sure am glad that I
invested in that for my developers. And
I think a lot of people have just
decided well I don't know. I guess it's
just coding agents and I guess they'll
fix every single thing about my entire
company by themselves which look they're
amazing they're transformative but it's
not the only thing that you need to
invest in as a software engineering
organization. So we can clarify this by
asking ourselves two questions. The
first one is how can we use our
understanding of the principles of
developer experience to know what's
going to be valuable no matter what
happens. Okay. And what do we need to do
to get the maximum possible value from
AI agents? Like what would we need to
fix at all levels outside of the agents
in order to make sure that the agents
and our developers can be as effective
as possible? And this isn't like a minor
question. These are the sorts of things
that could make or break you as a
software business going into the future.
So let's talk about what some of those
things are that I think are no regrets
investments that will help both our
human beings and our agents. So the in
general one of the framings that I think
about here is things that are inputs to
the agents. Things around the agents
that help them be more effective. And
one of the biggest one is the
development environment. What are the
tools that you use to build your code?
What package manager do you use? What
llinters do you run? Those sorts of
things. You want to use the industry
standard tools in the same way the
industry uses them and ideally in the
same way the outside world uses them
because that's what's in the training
set. And look, yes, you can write
instruction files and you can try your
best to try to fight the training set
and make it do something unnatural and
unholy with some crazy amalgamation that
or modification that you've made of
those developer tools. Like you might be
you invented your own package manager.
You probably should not do that. you
probably should undo that and try to go
back to the way the outside world does
software development because then you
are not fighting the training set. Um,
and also it means it means things like
you can't use obscure programming
languages anymore. Look, I'm a
programming language nerd. I love those
things. I do not use them anymore in my
day-to-day agentic software development
work. as an enthusiast, I do come
sometimes go and I code on, you know,
frontline uh software engineering
languages, but not in my like real work
anymore.
So, what people ask me sometimes, does
that mean like we're never going to ever
have any new tools again because we're
always going to be dependent on the
tools that the model already knows?
Probably not because like I said,
there's still going to be enthusiasts.
And also, but like I would like to make
a point. The thing that I'm talking
about has always been a real problem.
Like there's always some developer at
the company has always come up to you
and be like, "Can I use this technology
that came out last week and has never
been vetted in an enterprise to run my
like 100,000 queries per second service
that serves a billion users?" And I'm
like, "No, you can't do that now and you
can't do that yesterday. It's still the
same."
Uh, another one is in order to take
action today, agents need either a CLI
or an API to take that action. Yes,
there's computer use. Yes, you can make
them write playright and orchestrate a
browser. But why? Like if you could have
a CLI that the agent can just execute
natively in its normal format that it
understands the most natively, which is
text interaction, why why would you
choose to do something else, especially
in an area where accuracy matters
dramatically and where that accuracy
dramatically influences the
effectiveness of the agent?
One of the most important things that
you can invest in is validation. So any
kind of objective deterministic
validation that you give an agent will
increase its capabilities. So yes,
sometimes you can create this with the
agent. I'm going to talk about that in a
second. But it doesn't really matter how
you get it or where you get it from. You
just need to think about how do I have
high quality validation that produces
very clear error messages. This is the
same thing you always wanted by the way
in your tests and your llinters, right?
But it's even more important for the
agents because the agents cannot divine
what you mean by 500 internal error with
no other message, right? Like they need
a way to actually understand what the
problem was and what they should do
about it.
However, there is a problem here. So,
you know, you think, okay, I'll just get
the agent to do it. They'll write my
tests and then I'll be fine. But have
you ever asked an agent to write a test
on a completely untestable codebase?
They do kind of what it's like is
happening on the screen here. They will
write a test that says, "Hey boss, I
pushed the button and the button pushed
successfully. Test passed."
Um, like so there is a sort of a a
larger problem that a lot of enterprises
have in particular, which is there's a
lot of legacy code bases that either
were not designed with testing in mind
or were not designed with like high
quality testing in mind. like maybe they
just have like some very high level
endto-end tests and they don't have like
great unit tests that the agent can
actually run iteratively in a loop and
that will produce actionable and useful
errors.
So another thing that you can invest in
that will can be perennially valuable
both to humans and to agents is
structure of your systems and structure
of your code bases. Agents work better
on better structured code bases. And for
those of you who have never worked in a
large enterprise and seen very old
legacy codebases, you might not be
familiar with what I'm talking about.
But for those who have, you know that
there are code bases that no human being
could reason about in any kind of
successful way because the information
necessary to reason about that codebase
isn't in the codebase and the structure
of the codebase makes the codebase
impossible to reason about looking at
it. Yes, you the agents can do the same
thing human beings do in that case,
which is sort of go through an iterative
process of trying to run the thing and
see what breaks, but that decreases the
capability of the agent so much compared
to just it having the ability to just
look at the code and reason about it the
exact same way that human capability is
decreased. And of course, like I said,
that all has to lead up to being
testable. If the only thing I can do
with your codebase is push a button and
know if the button pushed successfully
and not see the explosion behind it,
like if if there's no way to get that
information out of the codebase from the
test, then the agent's not going to be
able to do that either unless it it goes
and refactors it or you go and refactor
it first.
And you know, there's a lot of talk
about documentation. There's always been
a lot of talk about documentation in the
field of developer experience, in the
field of improving things. And there's
people go back and forth about it.
Engineers hate writing documentation.
Uh, and the value of it is often debated
like what kind of documentation you want
or don't want, do or don't want. But
here's the thing. The agent, let's just
take this in the context of the agent.
The agent cannot read your mind. It did
not attend your verbal meeting that had
no transcript.
Okay? Now there are many companies in
the world that depend on that sort of
tribal knowledge to understand what the
requirements are for the system. Why the
code is being written? What is the
specification that we're we're writing
towards if things are not written down?
And that sounds like blatantly obvious
but like there are a lot of things that
are fundamentally written like if the
code is comprehensible like all the
other steps are in that we've gotten to
so far. you don't need to reexlain
what's in the code. So, there's actually
probably a whole class of documentation
that we may not need anymore or you can
just ask the agent like, "Hey, tell me
about the structure of this codebase
overall." and it'll just do it. But it
won't be able to ever know why you wrote
it unless that's written down somewhere
or things that happen outside of the
program. Like what is the shape of the
data that comes in from this URL
parameter as an example like if you have
already written the code there's a
validator and that does explain it but
if you haven't written the code yet it
doesn't know what comes in from the
outside world. So basically anything
that can't be in the code or isn't in
the code needs to somehow be written
somewhere that the agent can access.
Now, we've covered sort of a few
technical aspects of things that we need
to improve, but there's a point about
software development in general, and it
that's always been true. And one of and
that's you've heard this, we spend more
time reading code than writing it. The
difference today is that writing code
has become reading code. So even now
when we are writing code we spend more
time reading it than actually typing
things into the terminal.
And what that means is
every software engineer becomes a code
reviewer as basically their primary job.
In addition, as anybody who has worked
in a in a shop that has deeply adopted
Agentic coding, we generate far more PRs
than ever before, which has led to code
review itself, the like the big scale
code review being a bottleneck.
So one of the things that we need to do
is we need to figure out how to improve
code review velocity both for the big
code reviews that we like where we you
send a PR and somebody like you know
writes comments on it you go back and
forth and also just the iterative
process of working with the agent. How
do you speed up a person's ability to
look at code and know what to do with
it? So
the principles are pretty similar for
both of those, but the exact way you
implement them is a little bit
different. What you care about the most
is making each individual response fast.
You don't actually want to shorten the
whole timeline of code review generally
because code review is a quality
process. It's the same thing with agent
iteration. Like what you want with agent
iteration is you want to get to the
place where you've got the right result.
You don't want to like just be like,
"Well, I guess I've hit my five minute
time limit, so I'm going to check in
this garbage that doesn't work, right?
You you But what you do want is you want
the iterations to be fast." Not just the
agents iterations, but the human
response time to the agent to be fast.
And in order to do that, they have to
get very good at doing code reviews or
knowing what the next step is to do with
a lot of code. At the big code review
level, one thing that I see that I think
is sort of a social disease that has
infected a lot of companies is when
people want PR reviews, they just send a
Slack message to a team channel and say,
"Hey, could one of the 10 of you review
my PR?" And what and you know what that
means is one person does all those
reviews. That's what really happens.
There there's like when you look at the
code of review stats of teams like that
there's one person who has like 50 and
the other person have like three two
five seven because there's just one
person is like super responsive so but
what that means is if you start
generating dramatically more PRs that
one person cannot handle the load you
have to distribute it and really the
only way to distribute it is to assign
it to specific individuals have a system
that distributes it among those
individuals and then set SLOs's that
have some mechanism of enforcement and
another thing is like that GitHub, for
example, is not very good at today is
making it clear whose turn it is to take
action. Like, I left a bunch of comments
on your PR. Uh,
you now responded to one of my comments.
Should I come back again now? Oh, wait.
No, no, now you pushed a no change.
Should I come back now? Okay. No, no,
now you've responded to more comments.
What I rely on mostly is people telling
me in Slack, I'm ready for you to review
my PR again, which is a terrible and
inefficient system.
And another thing you got to think about
a lot is the quality of code reviews.
And I mean this once again both for the
individual developers doing it with the
agent and the people doing it in the
code review pipeline.
You
have to keep holding a high bar. I know
that people have other opinions about
this. And yes, depending on the timeline
that you expect your software to live,
you might not need as much software
design. Like look, it's software design
is not the goal of perfection. It's a
goal of good enough and better than you
had before, right? But sometimes good
enough for a very long lived system is a
much higher bar than people expect it to
be.
And if you don't have a process that is
capable of rejecting things that
shouldn't go in, you will very likely
actually see decreasing productivity
gains from your agentic coders over time
as the system becomes harder and harder
for both the agent and the human to work
with.
The problem is this. In many companies,
we have the people who are the best code
reviewers not doing any of their time
doing code review. They are spending all
their times in meetings doing highle
reviews doing strategy. And so we aren't
teaching junior engineers to be better
software engineers and to be better code
reviewers. So we have to have some
mechanism that allows the people who are
the best at this to do this through
apprenticeship. If somebody else has a
better way of doing this than doing code
reviews with people, I would love to
know because in the 20 plus years that
I've been doing this, I have never found
a way to teach people to be good code
reviewers other than doing good code
reviews with them.
Now, if you do if you don't do all the
things that I talked about, what is the
danger? The danger is you take a bad
codebase with a confusing environment.
You give it to an agent or a developer
working with that agent. The agent
produces
relative levels of nonsense
and the developer experiences more or
less frustration and depending on how
persistent they are at some point they
give up and they just send their PR off
for review. They're like, "I think it
works." Right? And then if you have
lowquality code reviews or code
reviewers who are overwhelmed, they go,
"I don't know. I don't know what to do
with this. I guess it's okay." And you
just have lots and lots and lots of bad
rubber stamp PRs that keep going in and
you get into a vicious cycle where what
I expect to occur and what my prediction
is is if you are in this cycle, uh, your
agent productivity will decrease
consistently through the year. On the
other hand, we live in an amazing time
where if we increase the ability of the
agents to help us be productive, then
they can actually help us be more
productive and we actually get into a
virtuous cycle instead where we actually
accelerate more and more and more and
more. And yes, some of these things
sound like very expensive fundamental
investments, but I think now is the time
to make them because now is one of the
times you're going to have the biggest
differentiation in your business in
terms of software engineering velocity
if you can do these things versus other
in industries or companies that can't
structurally do these things.
So to summarize, here's a few things.
Not literally everything in the world
you can do that's no regrets, but you
can standardize your development
environments. You can make CLIs or APIs
for anything that needs a CLI or API.
Those CLIs or APIs have to run at
development time. By the way, too,
another big thing that people miss is
sometimes they have things that only run
in CI. If you're CI takes 15, 20 minutes
and you know, agents are like way more
persistent and patient than a human
being is. So like, but they're also more
errorprone than human beings are. So
like they will run the thing and then
run your test and then run a thing and
then run your test and then run a thing
and then run your test and they'll do it
like five times in a row. If that takes
20 minutes, your developers productivity
is going to be shot to heck. Whereas, if
it takes 30 seconds, you're going to
have a they're going to have a much
better experience. You can improve
validation. You can refactor for both
testability and the ability to reason
about the codebase. You can make sure
all the external context and your
intentions, the why is written down. You
can make every response during code
review faster. And you can raise the bar
on code review quality. But if you look
at all of these things, there's one
lesson and one principle that we take
away from all these things that covers
even more things than this. And it's
basically that what's good for humans is
good for AI. And the great thing about
this, one second. The great thing about
this is that it means that when we
invest in this thing, we will help our
developers no matter what. Even if
sometimes we miss on helping the agent,
we are guaranteed to help the humans.
Thank you very much. [applause]
[music]
Ladies and gentlemen, please welcome
back to the stage, Alex Lieberman.
[music] Let's give it up again for Max.
[applause]
We have one more break now and then the
last block of sessions where we'll have
speakers talking about AI. uh
consultancies, uh paying engineers like
salespeople and how to make your company
AI native. So, be back here at 4 o'clock
or if you're watching the live stream,
be back online at four o'clock and we'll
see you then. Thanks everyone.
Heat. Heat.
[music]
>> [music]
>> Heat.
[music] Heat.
Heat. Heat.
[music]
Heat.
Heat. Heat.
[music]
Heat.
Heat.
Heat. Heat.
Heat.
[music]
Heat. [music]
Heat. Heat.
[music]
[music]
Heat. Heat.
Heat.
[music]
Heat.
[music]
Heat.
[music]
[music]
[music]
Heat.
Heat. [music]
[music]
Heat. [music]
[music]
>> [music]
[music]
[music]
>> Heat. Heat.
Heat. Heat.
>> [music]
[music]
>> Heat. Heat.
[music]
Heat. Heat.
[music]
[music]
[music]
>> [music]
>> Heat.
Heat. [music]
Heat. Heat.
>> [music]
>> Heat.
Heat.
Heat.
[music]
Heat.
>> [music]
[music]
>> Heat. Heat.
Heat. [music]
Heat.
>> [music]
[music]
>> Heat. Heat. Heat.
[music]
[music]
>> [music]
[music]
>> Heat. Heat.
[music]
Heat.
[music]
Heat.
Heat.
Heat.
[music]
Heat.
[music]
Heat.
[music]
Oh,
[music]
[music]
[music]
heat, heat.
>> [music]
[music]
>> Heat up
[music]
[music] here. Heat.
[music]
Heat. [music]
Heat. Heat. [music]
[music]
[music]
Heat. Heat.
[music]
[music]
Heat.
[music]
[music]
Heat.
Heat.
Heat.
Heat. Heat. [music]
Heat.
[music]
Heat.
Heat
>> [music]
>> up here.
Heat. Heat. [music]
[music]
Heat.
[music]
[music]
Heat.
Heat.
Heat.
[music]
Heat.
Heat.
Heat up
>> [music]
[music]
>> here.
>> [music]
[music]
>> Heat up
[music]
here.
>> [music]
>> Heat. Heat.
Heat. Heat.
[music]
[music]
Heat. Heat.
>> [music]
[music]
>> Heat up here.
Heat. Heat.
>> [music]
>> Heat. Heat.
[music]
Heat.
[music]
Heat.
Heat.
[music]
Heat.
Heat up
>> [music]
>> here.
Heat.
Heat.
>> [music]
>> Heat. Heat. [music]
Heat. [music]
[music]
[music]
Heat.
Heat. Heat. Heat.
Heat
up here.
Heat [music]
[music]
up
here. [music]
[music]
>> [music]
>> Heat. Heat. [music]
>> [music]
[music]
>> Heat. Heat. [music]
Heat.
[music]
Heat.
>> [music]
>> Heat up
here.
Heat up
[music]
here.
Heat. [music]
[music]
Heat.
>> [music]
>> Heat. Heat.
[music]
[music]
>> [music]
[music]
>> Heat.
Heat.
[music]
>> [music]
>> Heat. Heat. [music]
Heat. Heat.
[music]
[music]
[music]
Heat. Heat.
[music]
[music]
[music]
>> [music]
[music]
[music]
>> Heat up here.
Heat. Heat.
[music]
[music]
Heat. Heat.
[music]
[music]
Heat
[music]
[music]
up here.
[music]
Heat. Heat.
[music]
[music]
>> [music]
>> Heat. Heat.
>> [music]
>> Heat Heat
>> [music]
>> up
[music] here.
[music]
>> [music]
>> Heat.
Heat.
[music]
>> [music]
>> Heat. Heat.
>> [music]
>> Heat. Heat.
Heat.
[music]
Heat.
[music] Heat. Heat.
[music]
Heat. Heat.
[music]
[music]
Heat. Heat.
>> [music]
>> Heat. Heat.
[music]
>> [music]
>> Heat. Heat.
Heat.
Heat.
Heat.
Heat. [music]
>> [music]
>> Heat. Heat.
>> [music]
[music]
>> Heat. Heat. [music]
>> [music]
>> Heat. Heat.
>> [music]
>> Heat. Heat.
How we doing? We are officially 7 hours
in. How's the energy level? 7 hours in.
Let's hear it. There we go. There we go.
So this is our last block of sessions
before you all get to enjoy the graphite
afterparty. More coming on that in a
few. And for this block, we're going to
cover a lot AI consulting in practice,
paying engineers like salespeople, as I
mentioned earlier, leadership in AI
assisted engineering, and how to build
an AI native company. You guys ready for
this?
>> Oh, come on. Let's go. So [applause]
with that, please join me in welcoming
our next speaker and one of last year's
MC's to talk about helping organizations
transform with AI. Let's hear it for
NLW.
[music]
All right, great to be back here, guys.
Uh for those of you who are here in
February, I had the privilege of MCing.
Uh and today I'm excited to talk uh
about something a little bit different.
So right now uh there's been the last
couple of months have been an
interesting time in AI. There's been a
sort of surge in the air uh the
narrative of an AI bubble. A lot of it
driven by dubious studies uh like the
MIT report. And so what I wanted to do
today is get into not so much the
practice of consulting and transforming
but what organizations are actually
finding value in right now. So for those
of you who don't know me, uh there's
kind of two context I bring to this
conversation. The first is as the host
of the AI daily brief which is a daily
uh news analysis podcast about AI. The
second is as the CEO of super
intelligent which is an AI planning
platform. So the different perspectives
are sort of very high level macro
thinking about the news that's happening
and then a much more kind of ground
level view where we're spending a ton of
time interviewing executives about
what's going on inside their
organizations. And what we're going to
talk about is sort of one kind of
briefly in the first part just the
status of enterprise adoption uh as it
currently stands. And two, um, and the
more interesting part is we've been live
with a study in in the market for about
a month now collecting self-reported
information about ROI around different
use cases. And this will be the first
time uh, this week was the first time I
did some analysis on it. And so I'm
going to share what people have uh, what
people have told us around the first
kind of 2500 or so use cases that
they've shared. Um, so it should be
pretty pretty interesting stuff. talking
about kind of enterprise AI adoption
first. I'll go through this pretty
quickly because it's um pretty
well-known stuff. Uh the short of it is
enterprises are adopting AI uh in in a
growing fashion. Um pretty much everyone
is using it at least a little bit. Uh
and increasingly they're using it a lot.
Uh this year I will need to tell none of
you that there was a major inflection
around um specifically adoption in the
uh coding and software engineering.
Right? You saw a huge huge uptick in
this. Um there's a lot that's
interesting about that from an
enterprise perspective because it wasn't
just with the software engineering
organizations. Other parts of the
organization are also now thinking about
how they can communicate with code,
build things with code. Uh but that's a
huge huge theme of this year [snorts]
coming into 2025. One of the big sort of
thoughts that many people had was that
this would be the year of agents inside
the enterprise, right? That big chunks
of work would get automated away. And on
the one hand, I think it's pretty clear
that we didn't see some sort of mass
shift towards automation uh at large
across different functions in the
organization. But when you dig into the
numbers, there has been actually pretty
significant uh shifts in the patterns of
of agent adoption. So this is from
KPMG's quarterly pulse survey. And it's
a measure of how many enterprises that
are a part of their survey, which is all
companies over a billion dollars in
revenue, have uh actual sort of full
production agents in deployment. So this
isn't pilots, this isn't experiments.
This is where they consider uh some
agent that's actually doing doing kind
of work in a in a full way. And it's
jumped from 11% in Q1 of this year to
42% in their most recent study for Q3.
So you actually are seeing pretty
meaningful uptake of of agents inside
the enterprise. In fact, I would argue
based on our conversations that people
have that it's moved more quickly
through the pilot or experimental phase
than people might have thought. Um, so
much so that you're actually seeing now
a big shift in the emphasis around kind
of the human side of agents and how
humans are going to interact with agents
and it's involving a shift in upskilling
and and uh and enablement work. Um,
you're seeing a decrease in the sort of
resistance to agents as people start to
actually dig in with them. You're seeing
more experiments like these sandboxes
where people can interact with agents.
So this is a big theme even if it wasn't
necessarily the dominant theme that some
thought it might be coming into this
year. At the same time, it is absolutely
the case that many many if not most
enterprises are broadly speaking stuck
inside sort of pilot and experimental
phases. There is a lot of challenge
around moving from some of those first
exciting experiments to something that's
more scaled. Um, so this is from
McKenzie state of AI study which came
out I think a couple weeks ago now and
you can see only 7% of the organizations
that they talked to claim or sort of see
themselves as as fully at scale with
with AI and agents and something like
62% are either still experimenting or
piloting.
Interestingly big organizations are on
in general a little bit ahead in terms
of uh the organizations that are scaling
as compared to small organizations. This
has been a a thing that we've noticed
kind of throughout the trajectory of uh
of AI um adoption over the last couple
of years that you would think that
perhaps smaller more nimble companies uh
would be more kind of quick to adopt
these things, but in fact it's often
been the opposite with the biggest
organizations making the biggest
efforts. You can also see from the chart
on the bottom that there's very sort of
jagged patterns of adoption, right?
you're starting to see uh from you know
last year if you looked there's very
similar kind of rates of experimentation
across lots of different departments
you're starting to see some pretty big
breakouts now uh with for example you
know IT operations kind of jumping out
ahead of other functions
I won't spend too much time on this sort
of high performer piece but I think the
thing to note because it comes back in
and in and some of the stuff that we
found with our ROI study is that you are
also starting to see a pretty
significant bifurcation between leaders
and laggers when it comes to AI I
adoption and one of the things that
tends to distinguish the companies that
are leading is that they are just doing
more of it and they are thinking more
comprehensively and systematically about
AI and agent adoption. So they are not
just sort of doing spot experiments.
They are thinking about their strategy
as a whole. They're doing multiple
things at once. And importantly, they're
not just thinking about sort of the very
kind of first tier time savings or
productivity types of use cases. They're
also thinking about how do we grow
revenue? How do we create new
capabilities? How do we create new
product lines?
Overall, it's very clear that despite
what is sort of, you know, the the the
concerns in the media, that spend is
going to do nothing but increase on
this. Um, so the bottom is the KPMG
pulse survey again, and this is a an
estimation of the amount of money that
these organizations intend to spend on
AI over the next 12 months. The
beginning of the year was 114, which by
the way was up from like 88 in Q4 of
last year. It's now up to in their last
study 130 million is what they expect to
spend uh in the in the year ahead which
obviously the the total magnitude
doesn't matter as much as the change. Um
you also see the green charts are from
Deote and you can see 90% plus of
organizations intend to increase their
spend uh on AI in the next 12 months and
as part of that I think that you're
going to see a much more determined
conversation around impact and ROI uh
which is a particularly thorny topic but
interestingly
there has been an increase in optimism
over the course of this year around the
realization of AI. So this is from a
different KPMG study, their annual CEO
survey, which interviews tons and tons
of CEOs. And if you look at the 2024
numbers, 63% of those pled thought that
it would take between 3 and 5 years to
realize ROI from their AI investments.
20% said 1 to three and 16% said more
than five. This year in that same
survey, the number that said 1 to 3
years had gone up to 67%. There were now
19% who said 6 months to one year. uh
and three to five years was down to just
12%. So huge huge kind of pull forward
of expectations of of ROI realization.
The challenge is that ROI is really
tough. So this is back to the poll
survey. 78% of those pled in that in
that survey said that they thought that
ROI is going to basically become a
bigger consideration in the year to
come. Uh but also 78% said that
traditional impact metrics and measures
were having a very hard time keeping up
with the with the new reality that we
were living in. And this is something
that I've heard constantly over and over
from CIOS and other people who are in
charge of these investments that the the
the ways that we have measured impact of
previous technologies and just previous
initiatives are kind of falling flat
with AI. And so that got us thinking
about the the the overall need that we
have to just have more information. I'm
not even talking about good systematic
information, just more information
around what ROI looks like, what impact
looks like, and you know, I've got this
great podcast audience. They're super
engaged. And so, we just decided, screw
it. We're going to ask them, we're just
going to ask them to report on what ROI
they're finding from their use cases.
So, this went up at the very end of
October. Uh like I said as of this
morning or when I looked last looked
we've had over a thousand submissions uh
a thousand individual organizations
rather submit something like 3500 use
cases and um this is uh some some of the
first observations that we had around um
kind of the first 2500.
So the impact categories the way that we
divided things was into sort of eight
broad categories of impact um which will
all I think be very intuitive to you
guys. time savings, increased output,
improvement in quality, new
capabilities, improved decision- making,
cost savings, increased revenue, and
risk reduction. So, basically, it was
trying to think of like kind of a a
broad simple heristic for uh for for
kind of dividing or subdividing the
different the different ways that people
are thinking about ROI. And TLDDR is
that people are finding uh ROI right
now. Um, now again, the caveats are that
this is a highly infranchised audience.
are listening to a daily AI podcast and
they are voluntarily sharing this. So, I
think that, you know, there's there's
some caveing there, but you have 44.3%
saying that they're seeing modest ROI
right now. And then you have another
37.6% seeing high ROI. For the purposes
of a lot of these stats, high ROI will
be significant plus transformational. Uh
only 5% or so are seeing negative ROI.
And keep in mind, negative ROI doesn't
mean that they think programs are
failing. It just means they haven't
they've spent more than they've gained
uh in terms of how their their
perception is. More [snorts] than that,
expectations are absolutely skyhigh. 67%
think over the next year they will see
uh increased and high growth in their
ROI. So we have really optimistic sense
from the ground view of where ROI is
going to be in AI. Um you even have the
teams that are currently experiencing
negative ROI. 53% say that they're going
to see high growth. So very very
optimistic. Um as [snorts] you might
imagine, time savings is the default.
It's the starting point for so many
organizations. It represents about 35%
of the use cases. After that, increasing
output, quality improvement, basically
all those things that you would imagine
around productivity are sort of like the
dominant categories when it comes to
these uh when it comes to these use
cases. When it comes to the specifics
around time savings, you see a real
cluster between 1 and 10 hours,
especially right around 5 hours. And I
think this is interesting to call out
because it's so obvious to all of us who
are inside building these things uh
whether you are a developer or an
entrepreneur or just someone sort of in
and around it how the the vast breadth
of opportunity that AI represents new
capabilities things unimagined yet. It's
hard to or it's easy to forget that if
you save 5 hours a week or 10 hours a
week you're talking about winning back 7
to 10 work weeks a year. Uh and that's
very very powerful. And when it comes to
a lot of these enterprises, that is a
very meaningful thing, even if it's not
what they're ultimately in it for.
Interestingly though, it's very clear
that the story, although it might be uh
have a concentration in time savings, is
about much more than time savings. So
this is the ROI distribution category uh
ROI distribution by organization size.
And this starts to get really
interesting where you can see that there
are some differences in where different
size organizations are focused. So for
example, the organization size between
200 and a,000 people has a higher
portion of their use cases concentrated
in increasing output. Now we haven't
taken the time yet to really figure out
exactly what this means or even
speculate on on what this means, but I
think it's interesting that this is a
category of organization that has often
reached a certain scale but is still
very much striving for more and so seems
to be focused more on use cases that
expand their capabilities.
Same thing with uh when you start to
divide things by role, you see a real
kind of variance where for example
seuitees and leaders uh are less focused
on those time savings use cases and more
focused on other things like increased
output and uh and new capabilities.
In general, we're finding that SE
leaders uh and just sort of seauite and
and leaders in general are even more
optimistic and excited and and seeing
transformational impact than people who
are in more junior positions. Now, some
of this might be sort of selection bias
in terms of um what types of use cases
you are focused on. If you are in that
seuite, you're thinking about things
that inherently if they work are more
transformational. Uh but it is notable
that 17% of uh of the use cases that
that people in those leadership
positions have submitted uh they say
have transformational impact and ROI
already. [snorts] Uh I'm going to skip
this because there's we don't have time
for too much. Um you're seeing uh
interestingly uh a concentration um
where the smallest organizations are
getting more of that transformational
benefit early. Um, one of the things
that I want to do following this study
is maybe do a sort of second round where
we dig into what this 1 to 50 person uh,
size really looks like. I actually think
that whereas there might be a lot of
similarity between a 1,000 and a 2,000
person organization. There could be a
wild difference between a threeperson,
you know, small company and a 40 person
company. And so I'd really like to dig
into that more. But you are definitely
seeing a a lot of impact in those sort
of more small nimble moving
organizations.
Uh as you might expect coding and uh and
software related or uh use cases have a
higher ROI than average and a lower
negative ROI than average. Um one really
interesting kind of you know pulling on
a specific category of use cases. risk
reduction is our lowest category in
terms of the percentage of use cases
that that that was their primary
benefit. So when you're filling out the
survey, which is by the way at ROI
survey.ai if you want to check it out,
uh you basically only get to pick a
primary ROI benefit. We didn't want it
to be super sort of um we wanted you to
pick and and hone in on the thing that
was uh seemed most important or most
significant. And so only 3.4% 4% have
risk reduction as their primary benefit
uh in terms of ROI categories, but it is
by far those use cases are by far the
most likely to have transformational
impact as as the as as their outcome.
It's at 25%. So a full quarter of those
uh have transformational ROI. And
interestingly, I was having this
conversation with a couple of my friends
who work in sort of back office and
compliance and risk functions, and this
has been their experience as well, where
there are a lot of uh a lot of the the
the challenges for those organizations
involve sheer volume and quantity uh in
ways that that AI can be really helpful
for.
We also are finding some interesting
patterns among organizations. And again,
this is where we get into some of the
limits of this just being a whoever
walks through the door of my listeners.
We have a pretty heavy concentration
among technology, as you might expect,
industries and among professional
services, but we still have fairly
decent sample sizes for some others. And
in both healthcare and manufacturing,
the use cases are meaningfully higher
impact on average uh than the average
across all organizations. Um, which I
think is uh it was kind of worthy of
further study.
Last sort of part of this as I wrap up,
you know, a lot of these use cases as
you saw have to do with that sort of
first tier that most enterprises are
going to be in. Uh, increasing the
amount of content that you output,
increasing the quality of that content,
just finding ways to win back, you know,
your 5 hours a week. Um but increasingly
there are automation and agentic use
cases and we are absolutely seeing that
where those are the the focus where
those use cases mention certain types of
automation or they mention agents they
wildly outperform in terms of the
self-reported ROI from them that's both
on automation and it's on agents and I
think that that's sort of a a trend
towards where we're headed with sort of
the next layer of more advanced use
cases.
The last thing that uh from this sort of
first first look of observations is
there is clearly benefits and this goes
back to to what we saw with that
Mackenzie study as well of thinking
about AI and agentic transformation in
systematic cross-organizational
cross-disciplinary types of terms. um
effectively pretty much uh directly the
more use cases that a person or an
organization submitted the the better
they tended to see uh ROI for. Now
there's lots of reasons for that but I
do think it speaks to that that core
idea that once you move beyond kind of
your single spot experiments there's a
lot of opportunity uh to to sort of grow
grow the impact of the organization. So,
like I said, that is the the first look.
Uh, it's kind of the first twothirds of
these uh of these use cases. We'll be
open for another week and then we'll
have the full study out at the beginning
of December. Um, I'm really excited, I
think, heading into next year to see how
we move from sort of generic
conversations about impact uh and our
gut senses about impact to a lot more
random experiments like this to figure
out where the impact really is and uh
and where we go next. So, look at that.
I'm going to end 27 seconds early and
really throw off the time, but
appreciate you guys all being here. Uh,
and again, if you want to check this
out, it's roicervey.ai.
[applause]
As AI [music] changes our business and
engineering landscape, do we need to
rethink how we incentivize and
compensate engineers? Here to provide us
with a case study for scaling output,
not overhead, is the co-founder and
managing partner at 10X, Arman Hezarki.
How's everybody feeling? It's been uh 7
and 1/2 hours. We doing what? Are we
doing okay?
>> Awesome. I'm Arman. Uh like the voice of
God apparently. That's what they're
called. Voice of God apparently. Uh so
my name's Armon. I'm one of the
co-founders and managing partners at a
company called 10X. Uh my co-founder is
Alex who's been uh kindly announcing
everybody all day. We do a lot of cool
work. We uh we help companies with their
AI transformation. We have incredible
clients all over the world. But I'm not
going to talk about any of that today.
I'm going to talk about something much
more niche. I'm going to talk about how
we pay engineers. And we pay engineers
like salespeople. Earlier I was just in
the green room with a bunch of
distinguished engineers that I've grown
to uh respect for my entire career. And
we were talking and I was telling them
that we pay engineers based on the story
points that they complete.
And we had a lot of people roll their
eyes and and laugh. And they asked,
"What do you mean?" And I said, "Clients
pay us for the number of story points
that we deliver and we pay engineers
based on the number of story points that
they complete."
And similar to the looks that I'm
getting from some of you, there was
skepticism.
And I know this sounds crazy, but it's
working. We've been able to hire
incredible engineers, many of whom have
started and exited uh companies before
this. We have been able to hire
worldclass machine learning and AI
researchers. We've hired rocket
scientists from NASA. We are shipping
code incredibly quickly and it's
maintainable and high quality code. Of
course, that is everyone's dream.
Everybody wants to hire great people.
Everyone wants to deliver really uh fast
code.
So, my goal here is not to convince you
all to adopt our model. My goal is to
show you what compensation looks like in
AI and hopefully provide a new
perspective on the fact that things
might change as we introduce this
technology. Before I jump in though, I
want to talk about uh how we got here.
So, I'm a software engineer by training.
I went to Carnegie Melon and then I
taught there in their school of computer
science. After that I went to Google and
I helped them scale their AI cloud and
mobile practices internationally before
starting a few ventureback startups. And
in my last startup I would work out of a
weiwork and I was sitting in this uh 33
Irving WiiWork. If any of you are from
New York, you you might have worked out
of that we work and they have these big
tables and there were 12 of 12 of us
kind of sitting around. No one's
talking. Everyone has their headphones
in. And I look to my left and I see I
see somebody with Visual Studio Code
open, right? I'm like, "Okay, I have a
fellow engineer to my left." And I see
that he was typing, but I didn't see a
chat window.
This person was typing into the code
editor. They were typing fo
like a caveman. This this poor person
was typing like with their little
chopstick fingers individual characters.
I I I couldn't believe it. On my
computer, I had 45 agents. Three were
ordering me lunch. Two were writing
code. One was doing research. Just
different worlds were happening on my
computer versus this person's computer.
And I felt bad. I thought maybe we
should do a GoFundMe or something. But I
I I tried to look deeply at what is
actually causing this difference. Why am
I using AI in the way that I am? And why
is this person not?
There are different ways that that
people try AI and there are different
reasons why people don't use it. We've
all heard people who have tried it and
have said it's not as good as me. We've
all heard people who have not tried it
because they don't want to. But
regardless, my belief is that this is an
incentive issue. For me, I was a founder
and I wanted to squeak out every bit of
incremental value and and efficiency
that I could. And so I would sit on
Twitter and LinkedIn and read blog posts
and try to understand what is the
cutting edge in software engineering and
what's going to give me the ability to
output more code higher quality faster.
And because of that, I was using all
these all these different agents, but
this person probably worked at a
startup, probably had a base salary with
an annual bonus and some equity. And
that was supposed to be the model that
incentivized people to be innovated, to
be innovative, and to work smarter and
faster and harder. But it wasn't
working. And so, in order to understand
how we got to where we are, I'm going to
do a brief uh history of compensation.
And this is by no means accurate. I'm
making a lot of things up here. It's all
illustrative. Okay. So, back in the day,
we had some cavemen who were writing
code. We were we're uh probably
inscribing C in a in a tablet somewhere
and we were paying people hourly, right?
This makes sense. I look at somebody
sitting in a chair and I'm going to pay
them some amount of dollars for some
amount of time. That makes sense for me
and it makes sense for the for the
engineer. But why is that broken? I
actually I want to hear from people. Why
is hourly broken?
>> It's slow output.
>> No upside.
>> There's no upside. there's no reason to
work faster, right? And in fact, there's
a disincentive to work faster. And so,
what if I I notice this as the buyer of
this technology and I say, "Okay, how
long is it going to take you? It's going
to take you five hours. Okay, so I'll
pay you 500 bucks, right? Hourly $100.
Multiply that by five. And then you as
the engineer, if you work faster, great.
You get to keep the $500. And if you
work slower, that's on you." As
engineers, we're really, really bad at
estimating how long things are going to
take. And so because of that, I'm not
going to say it's going to take five
hours. I'm going to say it's going to
take 15 hours, 20 hours, so that I have
no downside. And so again, as the buyer,
I don't want to pay you based on the
project.
So what if we hire people on salary and
give them a bonus, right? Well, we in
the startup community know what happens
in that when when this is the case.
People punch in at five, leave or nine,
leave at five. And so I'm Larry Paige. I
notice this and I see why am I working
so hard at Google? Why am I putting my
blood, sweat, and tears into this? It's
because I have some of the upside. I own
the company, right? And so when we exit
for for many, many dollars, I'm going to
see that. So what if I can share that
with my employees? And that's when
equity comes in. And and this has worked
this has worked for many many years to
incentivize employees. This is this is
the foundation of the startup community
that we all know and are a part of. It's
incredible.
But
the not every company is Google. In
fact, for every one Google, there are
many many failures. And software
engineers know this, right? For those
who want to take the risk, many will
just go to YC or or start their own
company. And for the ones who don't want
the risk, they're opting for cash over
equity. Many of us who've hired
engineers know that the cash is
non-negotiable. Equity. Yeah, sure. I'll
take some upside.
And so my contention is that this model
needs to be reinvented in the age of AI.
We need to directly incentivize people
to use these tools and to use them well
and to still maintain really high
quality standards of code. And so here's
how it works for us. So we basically
just to take a step back, we do two
types of work at 10X. One is road
mapping and one is execution. So
companies come to us and they say, "Hey,
we want AI." That's generally the
request. Sometimes it's more specific.
It's like, hey, I want my customer
service team to have 10% more uh output
using AI, right? But but generally they
come to us with a request. We do a bunch
of studying and learning and then we
output a road map and based on that road
map, they can take it and work on it on
their own or we can do it. For a lot of
things, we're taking off-the-shelf
tools, but a lot of what we do is custom
builds and that's where the story point
model comes in. So we will build a
roadmap for a lot of our clients but
once they see that then they're putting
in requests on their own as well and we
have two roles in the company that are
client facing one is the strategist and
the other is the AI engineer the
strategists are all are mostly technical
and so we've have we have former PMs we
have former engineers they are doing PM
type work consulting type work they're
the ones that are taking the product
requirements and distilling that down
with the client Then they hand that over
to the engineer and the engineer puts
together an architecture design
document. They spend a lot of time doing
that. In fact, that is where most of our
engineering time goes. Then they write
code and they start implementing that
that architecture design document
includes tickets and each ticket is
graded on some number of story points.
This is a very traditional method of
doing work, right?
And when that ticket is accepted, the
engineer gets paid a a fee per story
point that they complete. Our engineers
have a flat base that they're paid and
then every quarter we round up based on
the story points that they've completed.
And again, this has led to us being able
to hire incredible people, but we've
also been able to do incredible work.
So, I'm going to walk through a couple
of projects that we've done. So, this is
one. This is a billboard company.
If you go to Times Square right now,
you'll see some billboards that they've
sold that inventory for.
They sell in two ways. One is you can
call them up traditional sales, you can
buy that inventory, but the other is
they have like an Uber for billboards
type of product where you can go online,
you can upload a PNG, you can choose
where you want this to run and for how
long, similar to like a Facebook or
Google ad. It's very similar to that
experience. And they came to us and they
said, "Hey, we think that there's some
opportunities for AI in our product." We
did an analysis and we found a few. One
of them is this. We found that when an
image is uploaded to their system, it
has to go through two rounds of
moderation. One is internal to the
company and the other is with the
billboard owner. Internal to their
company, they're spending money on that
to actually hire the people to do that
and there's a lot of inaccuracy
and it takes a lot of time. So that
costs them money and it costs them
revenue because every moment that the
billboard is not running, they're not
making money. And so we found what if we
could build an AI model that can
actually do this moderation for them. We
scoped that out. We built the
architecture uh design doc. We broke it
down into tickets and we built this for
them. We did it in two weeks and we got
to 96% accuracy when compared to the
human moderator. We've done a lot of
other projects with this company as
well. This is another company. This is
they work with retailers all around the
world and currently they have devices in
these retailers and they're low power
devices and so because of this they're
able to run one AI model on device and
what this model does is does heat
mapping. So imagine there's a camera in
this room looks down and it can
basically generate a heat map of where
the traffic is throughout the day. And
for retailers, of course, this is very,
very useful. But there's other things
you can do too, right? If we just sit
here for a few minutes, we can probably
come up with a lot of ideas of if you
have a camera with a chip, you can make
a lot of money from that. You can show
really useful information. And so that's
what we did. We we came up with what are
some of the things that we could do with
this? if you put a little bit little bit
more power in that ship, if you make the
models, if you quantize them so they can
run in parallel, what could you do? And
so we gave them this report and then we
built them five models that can run in
parallel. It does everything from heat
mapping to Q detection to theft
detection and more. And again, we start
with the product requirement stock. We
break this down into architecture. Then
we build it and then we pay engineers
based on the output.
This is the big question. What are the
risks? Right? I just talked about
dandelions and rainbows, right? Uh so I
promised you that my goal is not to
convince you to do this. And part of
this is showing you what the potential
risks are. These are a few that come up.
One is what if an engineer inflates the
story points, right? What if an engineer
says, "Okay, you want me to add a
button? 45 story points." Right?
What if an engineer rushes and quality
drops? You're saying that it took two
weeks to do that. Well, was it good? Did
it work?
And what if engineers get sharp elbowed?
I started this by saying that we
compensate engineers like salespeople.
It's not a it's not a culture that we
necessarily want to emulate in software
engineering, right? So, how do we how do
we uh make sure that that's not
happening?
First of all, I mentioned that we have
two different roles and we compensate
like a counterbalance. So strategists
are compensated based on NR which really
is like customer happiness
and every single ticket has to be
approved internally with multiple rounds
of QA of which the strategist is
involved but also by the client and so
there's a counterbalance to every single
ticket that is delivered.
Uh I skipped to the second one. For the
first one inflating story points the
strategists are the ones who scope it.
And again we have to review all of that.
And for the third, how do you make sure
that all of this is correct? And how do
you make sure that there's no sharp
elbows? How do you make sure that
everybody is happy and the dandelions
and rainbows are continue throughout
this parade of joy? Well, you have to
hire the right people, and this is what
I tell everybody.
We make hiring incredibly difficult for
ourselves so that everything else is
easy. And that is a principle that we
all know and we all stand true to. And
this is incredibly important with AI. My
co-founder Alex always says AI makes
people look like one of those crazy
mirrors where any one of your attributes
it makes it 10 times larger. If you're a
great engineer, AI makes you great. If
you're not, it makes you sloppier. And
this is the case with all of these
things. You have to start with hiring.
Our belief is that AI gives people
superpowers and it makes all of us
smarter, faster, and better at what we
do. But my belief is that the current
way that we compensate people is
actually holding them back. And I would
invite you to think about how can you
compensate people on your team
differently, whether it's software
engineering or anything else. If you
want to unlock your employees potential,
feel free to reach out at armon 10x.co.
Thank you. [applause]
Our
next presenter [music] is deputy CTO at
DX, the engineering intelligence
platform designed by leading researchers
speaking about effective leadership in
AI enhanced organizations.
Please join me in welcoming to the stage
Justin Rio.
[music and applause]
Hello. Thanks for joining me in one of
the later day sessions. Looks like we we
we kept a lot of people here. This is a
nice full room. I'm great to see it.
We're going to go through a lot of
content in a short amount of time. So,
I'm going to get right into it. If you
want to get deeper into any of this
stuff, we have published this uh AI
strategy playbook for senior executives.
And uh a lot of the content that I'm
going to go through, I'm not going to
have time to get quite as deep, but this
is just a nice PDF copy that you can
come and refer to later. If you missed
this QR code, don't worry. I'll show it
again uh at the end. So, what is the
current impact of Genai?
Nobody knows, right? We've got Google on
the one hand telling us that everyone's
10% more productive. That's interesting.
Now, they're Google. They were already
pretty productive to begin with. But we
have this sort of now infamous meter MER
study which has some flaws in the way
that study was put together that showed
actually a 19% decrease in productivity
using codec assistance. So there's a lot
of volatility a lot of variability. Uh
what was really interesting about this
study even though I I mentioned there
were some flaws. Um but every engineer
that took part in this study felt more
productive but then the data actually
bore out that they were less productive.
Kind of interesting right? we've got
this induced flow uh that makes us feel
really good about what we're doing. So,
we need to address this. DORA has put
out some really good research on this,
too. But this is based on industry
averages. This is impact based on what
do we look at when we see a large sample
and an average of how certain factors
are being impacted by in this case 25%
increase in AI adoption. We see these
modest but positive leaning indicators.
7.5% increase in documentation quality
and uh increase in code quality by about
3.4%. At least that's not leaning in the
other direction, right? And when we
started digging through some of DX's
data, we have, you know, we're the
developer productivity measurement
company. We have lots of aggregate data
that we can look at with this. We found
the same thing. When we looked at
averages, we see about a 2.6% 6%
increase in overall uh change
confidence, which is a a percentage of
people who answered positively that they
feel confident in the changes that
they're putting into production. Uh
similar positive leaning average when we
looked at code maintainability, another
qualitative metric, a1% reduction in
change failure rate. Uh which when you
think about the industry benchmark being
4%, it's not insignificant.
But this is not the full story because
this is what we saw when we broke the
same studies down per company. Every
company here is a every every bar
represents a company, right? We have
some that are seeing 20% increases in
change confidence while others are
seeing 20% decreases. We're seeing
extreme volatility, which is why these
averages look so innocuous, but they're
belying the greater story of
variability. See the same thing with
code maintainability. The same thing
with change failure rate. So this is a
2% increase in change failure rate up
here at the top. Again, with an industry
benchmark of 4%, that means shipping as
much as 50% more defects than we were
shipping before, right? We want to make
sure we're on the lower end of this, but
how like what should we be doing? Well,
we found some patterns here. We see that
some organizations are seeing positive
impacts to KPIs, but others are
struggling with adoption and even seeing
some of these negative impacts. top-own
mandates are not working, right? Driving
towards, oh, we must have 100% adoption
of AI. Great, I will update my read my
file every morning and I will be
compliant, right? We're not actually
moving the needle anywhere when we do
that. We also find that lack of
education and enablement uh has a big
impact on sort of negatively impacting
this. Some organizations just turn on
the tech and expect it to just start
working and everybody to know the best
ways to use it. uh and a difficulty
measuring the impact or even knowing
what we should be measuring like what
metrics would should we be looking at
you know does utilization really tell us
much about the full story of genai
impact this is another graph from Dora
uh this is a basian uh posterior
distribution which is an interesting way
of representing data basically you want
your mass to be on the yellow side of
this line uh the the uh the right side
of this line for the audience yeah and
you want a sharp peak which is telling
you that we're pretty confident that
this initiative will have this impact.
And if we look at some of the topline
initiatives here, these are things like
clear AI policies. All right, we want to
make sure we have that. We want time to
learn, not just giving people materials,
but actually giving them space to
experiment, right? Um, and so these
types of factors are the ones that seem
to be moving the needle the most. So,
we're going to go over some quick tips
on how we can do all of these things.
And again, the guide will go deeper into
this. We want to integrate across the
SDLC. Right? For most organizations,
writing code has never been the
bottleneck, right? We can in uh we can
increase productivity a bit by helping
with code completion, but our our
biggest bottlenecks are elsewhere within
the STLC. There's a lot more to creating
software than just writing code. We want
to unblock usage. We can't just say,
well, we're worried about data
xfiltration, so we can't try this thing.
Like, no, get creative about it. We've
got really good infrastructure out there
now like Bedrock and Fireworks AI that
can let us run powerful models in safe
spaces.
We have to have open discussions about
these metrics. We need to evangelize the
wins and we need to let our engineers
know why we're gathering metrics and
data. What is it that we're trying to
improve? We have to reduce the fear of
AI, right? We have to make sure that
people understand that this is not a
technology that is ready to replace
engineers. This is a a technology that's
really good at augmenting engineers and
increasing the throughput of our
business. We have to establish better
compliance and trust. And we need to tie
this stuff to employee success. These
are new skill sets. AI is not coming for
your job, but somebody really good at AI
might take your job. And so, as leaders,
we have the opportunity to help our
employees become more successful with
this technology. So, how do we reduce
the fear? Well, first of all, why do we
need to do this? Well, there's a lot of
good reasons, but I love to point to
Google's project Aristotle. This was a
2012 study where Google wanted to figure
out what are the characteristics of
highly performant teams. uh they thought
that the recipe was just going to be
what Google had this combination of high
performers, experienced managers and
basically unlimited resources. And they
were dead wrong. Overwhelmingly the
biggest indicator of productivity was
psychological safety. Okay. And so that
very much applies now. We also have data
like this is Sweetbench. I'm sure a lot
of you have seen this and there are some
impressive benchmarks that the agents
can do like a third of the things
they're asked to do without any human
intervention. that means that they're
not able to do twothirds of them. Right?
Again, we are augmenting. We're not
replacing. We're not ready. We may never
be ready. So, we need to be very
transparent with what we're doing. We
need to set very clear intents. Why, you
know, are we uh using this to to
augment, not to replace. We need to be
proactive in the way that we communicate
that and not just wait for people to get
upset and possibly scared. We need to
say, "No, we are here to help you to
give you a better developer experience
and to increase the throughput of the
business." And again we have to have
these discussions about metrics. Now
what metrics? What should we be looking
at? Well, DX again developer experience
and productivity measurement company. Um
there are two sort of classes of metrics
that we can be looking at really two
levers that matter here and that's speed
and quality. Right? We want to increase
PR throughput. We want to increase our
velocity but not by just creating a
bunch of slop that's going to give us a
bunch of tech debt later that we're
going to have to deal with. And we've
just kicked the bottleneck down the road
if we do that, right? So we want to be
looking at things like change failure
rate, our overall perception of quality,
change confidence, maintainability.
And we have three types of metrics that
we can be looking at here. We have our
telemetry metrics. These are the things
coming out of the API. And they're good
for some stuff, but they're not always
accurate, right? We know like accept
versus suggest was kind of like all the
rage until we realize that engineers
need to click accept in the IDE in order
for the API to know about it. even if
they do click accept. Who's to say they
didn't just go back and rewrite every
line that was suggested, right? So
that's providing us some context, but we
also need to do some experience
sampling. We need to like for instance
add a new field to a PR form that says I
used AI to generate this PR or I enjoyed
using AI to generate this PR and get
some data that way. And then
self-reported data or survey data. We
are big on surveys, but let me
underscore we're big on effective
surveys. 90% plus participation rates
engineered against questions that treat
developer experience as a systems
problem not a people problem because
that's what it is W. Edwards Deming 90
to 95% of the productivity output of an
organization is determined by the system
and not the worker. Okay. So
foundational developer experience and
developer productivity metrics still
matter the most. Right? Our AI metrics
like utilization and things are telling
us what's happening with the tech, but
these core metrics that we've been able
to trust are telling us whether these
initiatives are actually working, right?
Are we actually moving the needle and
having the outcomes that we want to see?
So top companies are looking at
different things, right? We are seeing
like adoption metrics coming out of
Microsoft. They've also got this great
metric called a bad developer day. I'm
not going to go into it, but there's a
really good white paper that shows like
all the different telemetry that they
can look at to determine what makes a
bad developer day. Dropbox is looking at
similar stuff. Adoption like weekly
active users, daily active users, that
sort of thing, but also looking at
quality metrics like change failure
rate. And booking is looking at similar
stuff as well. And so we built a
framework around this. We were first to
market with what we call our DXI
measurement framework. And this is very
much inspired by things like Dora space
framework, DevX, just like our core for
metric set which you can ask me about
later. Uh and we take these metrics and
we uh normalize them into these three
dimensions of utilization, impact and
cost. And you can kind of think about
this as a maturity curve too. A lot of
people start just figuring out okay
what's happening? who's using the tech,
what's the percentage of pull requests
that we're getting that are AI assisted
maybe through experience sampling? How
many tasks are being assigned to agents?
But then we can mature that perspective
a little bit and we can correlate that
utilization to impact. What is this
actually doing to velocity? What is this
actually doing to quality? And this is
when we start getting more mature in our
picture of our impact. And then finally,
cost. Although I like to joke that we're
15 years past the last hype cycle which
was cloud and we still have new
companies spinning up that are teaching
us how to understand and optimize our
cloud costs. So we will see if we get
there. Although I also hear horror
stories about people burning through
2,000 tokens at $2,000 worth of tokens a
day. So we probably do need to hit that
as well. What about compliance and
trust? What can we do to ensure that the
output uh that that's being generated is
something that can be trusted by our
engineers? We have a lot of levers to
pull here, but one of the ones that I'd
like to talk about is setting up a
feedback loop for our system prompts. So
these could be called system prompts,
cursor rules, agent markdown. Pretty
much all of the mainstream solutions
have something like this where you can
go and provide a set of rules uh to
control how these models behave. Uh and
I won't get too much into the technical
details here. We have an example where
like the uh models have been providing
outdated Spring Boot uh stuff. We want
Spring Boot 3. It's It's been sending us
Spring Boot 2 stuff. The big takeaway
here is to have the feedback loop. Have
a gatekeeper, right? Have somebody or a
group in the organization that can
receive this feedback that understand
how to maintain and continuously improve
these system prompts, right? And that
way we're always maintaining the way
that these assistants or models or
agents affect the whole business. It
also pays to understand the way that uh
temperature works, especially when we're
building agents, right? we do have some
control over the determinism and
nondeterminism of these models. Uh again
like when a model is predicting a next
token, it doesn't just have like one
token. It has a matrix of tokens and
those are associated with a certain
probability of that being like the right
token. And so we have this setting
called temperature which is heat which
is entropy which is randomness that can
control the amount of randomness
involved in actually picking that token.
This is sometimes called increasing the
creativity of the model. And it's a
number between 0 and one. For those
reasons I just mentioned, don't use zero
or don't use one. Weird things will
happen. But you want some decimal in
between zero and one. When we have a
lower temperature, like we're seeing
here, 0.00001,
we give it the same task twice, and it
gives us the exact same output character
for character. When we set that
temperature higher, this is an example
of 0.9. I'm asking the agent to create a
gradient for me, uh, simple task. It's
giving me two relatively valid
solutions. I did ask it for a JavaScript
method and this is the only one that's
giving me a JavaScript method. But the
point is they are wildly different
approaches to the same problem when I've
increased the creativity of that model.
So we need to think about like use case-
wise where should we have more
creativity and where should we have more
determinism and temperature is another
setting that we have that can help
control this. You can experiment with
all this using like docker model runner
lama lm studio that sort of thing. How
can we tie this to better employee
success? We had to provide both
education and adequate time to learn. So
we put together a study where we sampled
a bunch of uh developers that were
saving at least an hour a day uh uh
excuse me an hour a week and we asked
them to stack rank their top five most
valuable use cases. And we built a guide
around that. a guide that effectively
goes through code examples, prompting
examples uh of what we determined using
the sort of data approach where we
should get more reflexive about our best
practice and about uh the use cases that
we're becoming reflexive in in our use
of AI. And so that's what this guide was
about. And uh we've had this become
required reading in certain engineering
groups and uh proud of that. And this is
another way that we can help educate,
but we need to give time. Uh we don't
have time to go through all of this. I
do think it's interesting that the
number one use case for this was stack
trace analysis, right? So, not a
generative use case, actually more of an
interpretive use case. And we see some
other ones here that are not too
surprising. And there's examples of each
of these. What about unblocking usage?
How can we make sure that we can
creatively ensure that engineers can
take the most advantage of this? Well,
leverage self-hosted and private models.
That's getting easier and easier to do.
Partner with compliance on day one,
right? Make sure that what you're doing
is in line with your organization's
compliance. You may find that you're
making a lot of assumptions about things
that you don't think you can do that you
can actually do, right? And then think
creatively around various barriers.
Finally, how can we integrate across the
SDLC? What should we think about doing
there? You know, and I'm a big Ellie
Gold theory of constraints fan. Probably
have some others in the audience. An
hour saved on something that isn't the
bottleneck is worthless. And when we
look at data across in this case almost
140,000 engineers, we find that there
are definitely good like annualized time
savings with AI that are being eclipsed
by sources of context switching and
interruption, meeting heavy days, these
other things that it's like, yeah, we
can save time here, but we're losing so
much more time over there. So find the
bottleneck, fix the bottleneck, right?
Morgan Stanley's been very public about
their uh building this thing called Dev
Gen AI that looks at a bunch of legacy
code Cobalt mainframe natural. I hate to
admit Pearl because I'm an old school
Pearl developer. Uh but apparently
that's legacy now too. And basically
creating specs uh for developers that
can just be handed to developers to
start modernizing the code without
having to do all that reverse
engineering. Right? And they're saving
about 300,000 hours annually right now
doing this. There's a Wall Street
Journal journal article about this,
Business Insider article about it. Uh
they're very public about that. Zapier,
Zapier should be the example for
everyone. They have a whole series of
bots and agents that are doing things
like assisting with onboarding. They can
now make engineers effective in two
weeks. Industry benchmark on the good
side is like a month. On the medium side
is like 90 days. And uh because they're
able to increase the effectiveness of
the engineers that they're h that
they've bringing into the organization,
they realized that they should be hiring
more, right? As opposed to trying to
maintain status quo by like cutting
headcount and trying to make individual
engineers more productive. They said,
"No, we could get more value out of a
single engineer. We should be hiring
faster than ever." And they are, and
it's really increasing their competitive
edge. I think that's the right attitude.
Spotify has been helping out their S sur
by pulling together context when
incidents uh are detected and then
taking things like run but steps and and
other areas of context and documentation
and pushing them directly into S sur
channels so that those critical minutes
of trying to get to the bottom of what's
actually happening and what we should do
do to resolve the incident uh they just
eliminated that time right it's
significantly increased their MTTR so
let's get creative about areas in the
STLC that are our actual bottlenecks All
right, next steps. Uh, distribute this
guide as a reference for integrating AI
into the development workflows that you
have. Uh, determine a method for
measuring and evaluating Genai impact.
It's really important to make sure that
we're not on the bad sides of those
graphs that I showed you earlier and
then track and measure AI adoption and
and see how that correlates to overall
impact metrics and iterate on best
practices and use cases. And here's a
guide. again. Thank you so much.
[applause]
Our closing presentation will teach us
how to build an AI native company, even
if that company is 50 years old. Please
join me in welcoming to the stage the
founder of Every, Dan [music] Shipper.
>> Hello.
>> [applause]
>> How's it going everybody? I'm the last
speaker of the day, so I'm just between
you and dinner or drinks. So, I'm going
to try to make this fun and hopefully a
little bit short.
So, first of all, I just want to say I'm
very glad to see everybody and I'm
actually kind of surprised to see so
many people here um because I've been I
live here, but I've been traveling. I
was in Portugal uh last week and I was
on Twitter and someone said that
everyone was moving to San Francisco.
Uh but it's great to have everybody here
instead cuz I love New York.
[laughter]
Come on. Come on.
>> [applause]
>> Um, so I'm supposed to talk uh today
about uh how to build an a playbook for
how to build an AI native company. And
um I actually don't have one
unfortunately. Um
and that's because I think the playbook
is actually being invented right now. So
we're doing it at the company that I run
every but all of you are doing it here
today as well. and and and so I don't
want to do this talk from the
perspective of I have all the answers
and I'm going to tell you the framework
and the playbook and all that kind of
stuff. Um but um I do think it is
helpful when we're in this beginning
stage of
uh learning how to use AI to do
engineering to build companies uh to
share like the the personal experiences
that we're having inside of our
companies um and uh and sort of
collaboratively figure out the playbook
together. So I think the best that I can
offer is really just sort of dispatches
from the future. Uh notes on what I've
figured out um and the work that we've
done inside of every um and I think the
the first big thing the first the first
big thing I really noticed is that there
is definitely a huge there's a 10x
difference between an org where 90% of
the engineers are using AI versus an org
where 100% of the engineers are using
AI. It's it's it's totally different.
Um, I think the I think the big thing is
if even 10% of your company is
uh is using a more traditional
engineering method, you you sort of have
to lean all the way back over into that
world. Um, and so it it prevents you
from doing some of the things that you
might do if everyone was uh not typing
into a code editor all the time. Um,
and I know this because this is what we
do at every um, which is the company
that I run. And it has totally
transformed what we are able to do as a
small company. Um, and so I think of us
as like a little bit of a lab for what's
possible that I I'm excited to share
with you. So for people who don't know,
I run every um, inside of every we have
six business units. We have four
software products. We run four software
products with just 15 people, which is
kind of crazy. Um, and these software
products are not toys. We've grown at
every we've grown MR by double digits
every month for the last 6 months. We
have over 7,000 paying subscribers and
over 100,000 free subscribers. Um, and
we've done this in a very capital-like
way. We've only raised about a million
dollars in total. Um and very
importantly for for this audience and
for this discussion um 99% of our code
is written by AI agents. Uh no one is
handwriting code. No one is writing code
at all. Um it's all done with cloud
code, codeex, Droid, what have you. Um
uh coding agent of your of your choice.
Um, and also really importantly for the
size of team we are, each one of our
apps is built by a single developer,
which is crazy. And these are not like
uh little apps. Uh, here here's an
example. This is Kora, which is a um AI
email management app. Um, it's sort of
an it's an it it's an assistant for your
email. It on on the left over here, it
summarizes all of your all of your
emails that come in. So, you can kind of
read your email that way. This is what
my inbox looks like. on the right is a
um email assistant that you can ask
questions like I asked where's when's my
AI engineer talk um today and it gave me
just gave me the answer um and this is
built primarily by one engineer um that
he's got one or two contractors that
have helped in in certain ways but like
almost all of this is built by one guy
same thing for um
uh this app which is another one that we
we make called monologue which is a
speechto text app It's sort of like
Super Whisper or Whisper Flow if you
know of those. Um, again, one guy,
thousands of users. Um, I I love it.
It's a it's a it's just a beautifully
done app and it's not it's not simple.
It's complicated. There's a lot of stuff
to it. Same thing for this app called
Spiral. You can see there's it's it's
big. Um, and again, one engineer.
So, obviously, this would not have been
possible um a few years ago. it would
not have been possible even a year ago.
And I think the big change that happened
that we're all starting to catch up to
is um it started with cloud codes. This
sort of like terminal UI that gets rid
of the code editor
really push pushed us into a place where
um we are delegating tasks to these
agents. We are and and that allows us to
uh work in parallel and do much more
than we would have ordinarily. Um,
so some of the things that some of the
things that I've noticed that we can do
that I I assume people in this room are
starting to see but um [snorts] I think
is is sort of important to put put our
finger on is uh the reason we can go
much faster is we can work on multiple
multiple features and bugs in parallel.
And I think that there's a um
there's like a little bit of a meme of
the vibe coder on Twitter that is oh
[snorts] like they they have um they
have four panes open but they're not
actually doing any work. And I actually
you can do it that way and I think there
are also definitely engineers and I know
that they are because they work at every
that are productively using four panes
of agents at the same time. Um, and
that's that's crazy and that that
contributes a lot to the um ability for
a single developer to build and run a
production application. Um, another like
really important thing about this, a
really big um unlock is because code is
cheap, you can prototype risky ideas and
that allows you to do more experiments
than you would ordinarily. And that lets
you make way more progress because the
starting energy to try something is so
much lower because you just like say,
"Oh, go do this. go do some research on
this like big refactor I might want to
do and then you go off and do something
else. And that's a really big deal.
Um, and another really interesting thing
that I love about this stuff that I'
I've noticed in inside of inside of our
organization is we move we're moving a
bit more toward a demo culture where um
instead of you know previously if you
wanted to make something you'd have to
be like maybe write a memo or do a do a
deck or um or you know convince a bunch
of people that it was a good idea to
spend time on because you can vibe code
something uh in a couple hours that sort
of shows the thing that
uh that you want to make it. It allows
you to show everybody and uh I think
that being a being a sort of de
democultural allows you to do weirder
things that you only get if you can feel
it. Um which is I think really amazing
and beyond just like sort of the basic
productivity unlocks.
um
AI has and the way that we use it has
caused us to sort of invent an entirely
new set of engineering primitives and
processes which I'm sure that everybody
in this room is starting to do already.
I think everyone is sort of approaching
the same things from different angles
and a lot of them definitely do echo
engineering processes from the past but
I think it's really helpful to try to
put our finger on okay what is the new
way of programming if we're moving up a
level of the stack and and we're moving
from you know Python and JavaScript and
scripting languages up into um up into
English and the uh the the name that
we've given to this process is
compounding engineering
Um, and the way that I talk about
compounding engineering is in
traditional engineering, each feature
makes the next feature harder to build.
In compounding engineering, your goal is
to make sure that each feature makes the
next feature easier to build. Um, and we
do that in this loop.
Um, the loop has four steps. The first
one is plan. And if you're you've been
here today, you've been paying
attention, you know how important it is
when you're working with agents to make
a really really detailed plan. So I
think everyone is doing that. Second
step is delegate. Just like go tell the
agent to do it. Everyone's doing that
too. Third step is assess. And we have
tons and tons of ways to um assess
whether the work that the agent did is
any good. There's tests, there's trying
it, there's having the agent uh figure
it out. There's there's code review,
there's agent code review, there's all
this types of stuff. And then the last
step, which is I think the most
interesting one, is codify. And this is
kind of like the the money step, which
is where you compound
everything that you've learned from the
planning stage, the delegation stage,
the assessment stage back into prompts
that go into your, you know, your cloud
MD file or your um your sub agents or
your slash commands. And you start to um
basically create this library. You take
all the tacet knowledge that you pick up
um that all your engineers are picking
up um as they find bugs, fix plans, um
delegate work, and you um you make it
into an explicit collection of prompts
that you can spread for your entire
organization.
And um when you do that really well,
there's a lot of like really interesting
um second order effects that are are not
I think that well understood or or that
commonly talked about that I think would
be interesting to to bring here because
my guess is that um some people are
already seeing this, but like maybe it
needs to be pushed on a little bit more
to like really be brought out and some
people uh it might be an interesting way
to get more of your organization to buy
into using these tools. tools 100% of
the time.
Um so the first thing that you notice if
you sort of if you set up this process
and you and you're like 100% bought in
on something like compounding
engineering um is that tacet code
sharing
becomes much easier. So uh we have we
have multiple products at every a lot of
a lot of products a lot of times need to
implement similar things even if they
use different technologies or imple
implementing similar things like a
team's feature or a certain type of ooth
or whatever. Um
previously in order to share code you'd
have to like abstract out whatever you
did into a library and then like allow
someone else to download it and it it'd
be hard to do or you'd have to talk
about it.
With agents, um you can just point your
Cloud Code instance at um the repo from
the developer sitting next to you and
learn the process that they went through
to build the feature that they that you
need to reimplement and re-implement it
yourself in your own tech stack, in your
own framework and in your own way. Um,
and that's really really cool to kind of
have this the more developers you have
working on different things inside of
the org, the more you can um share
without any extra cost because AI can
just go read all the code and and um and
use it. Um, another really cool thing
that I've noticed is that new hires are
productive on their first day because
you've taken all of the things that
you've learned about like, okay, how do
I set up an environment and what does a
good commit look like and all this kind
of stuff and on the first day they have
all that set up in their in in their,
you know, cloud MD files or their cursor
files or uh codeex files or whatever and
um the agent just sets up their local
environment and knows how write a good
PR. That's really cool. It also helps if
you um want to hire like expert
freelancers. Like there's some there's
one guy there's one person who just is
really good at this one specific thing.
You can have them come in for a day and
like do that thing. It's I think of it a
little bit like um like a DJ or whatever
can like go in on like a couple bars of
a song. Like you can just sort of drop
in and that's really helpful. it's it
would ordinarily be like too hard to
collaborate because the the startup cost
is too high, but you can do that a lot
better now.
Um, another thing that I've noticed
which is really cool too is um
developers inside of every commit to um
other products. So, uh you know, we have
four products that run internally.
Everybody uses all the products. If
someone uh runs into a bug or a paper
cutter, like a little minor quality of
life thing that they want, they will um
often just um they will often just uh
just submit a pull request for it to the
other GM of the app um because it's very
easy for them to go download the repo
and figure out uh or have really have
Claude or Codex figure out, okay, this
is how we fix the bug or this is how we
fix the paper cut. Um and that's really
really cool because you have this
much um much easier way of collaborating
across apps that I I think over the next
couple years. I imagine that you will
also be able to let customers do this to
some extent. Like if you run into a bug
um this is, you know, speculative, but
if you run into a bug, you can have your
little agent fix it um and submit it as
pull request. It's a weird open source
thing, but um yeah, this is really
really cool and and definitely is
happening a lot inside of our company.
Um,
another really cool thing is um we we
have not this may get different as we as
we scale but um we have not yet had to
standardize onto a particular stack or
language. We instead let everyone who's
building different products like pick
the thing that they like best and the
reason is because it makes it AI makes
it much easier to translate between
them. Um and it makes it much easier to
to jump into any language and framework
and environment and be productive. And
so it we don't uh it's easier for us to
let people just do the thing that that
they like and let AI kind of like handle
the translation in between.
Um and the last thing which is my
favorite but like is also the horror I
think of of some developers and to some
degree maybe the horror of my team um is
that managers can commit code. um if
you're technical uh even the CEO and
um for for me like I have no business
committing code because we've got four
products we've got 15 people we're
growing really fast um I'm doing tons
and tons of other things but I can and I
I have like committed production code
over the last couple months and the
reason for that is AI allows um
engineers to work with fractured
attention so previously you might have
needed like a 3 or 4 hour block of focus
time in order to like get anything done.
Um, but with cloud code, you can kind of
like get out of meeting and say, "Hey,
like I want you to investigate this bug
and then go do something else and then
come back and you have like a a plan or
like a um root cause fix and then you
can submit a PR." And it's not easy,
it's not magic, but it is actually
possible. And I think that's a that's
just a totally new way of thinking how
thinking of thinking about how managers
interact with the products that they
make.
So, um, just to just to summarize, um,
there's a I really think there's a 10x
difference in how things work when you
hit 100% AI adoption. I think, um, from
what we've seen, a single engineer
should be able to build and maintain a
complex production product. what we call
compounding engineering, but I think
what all of us are are sort of pointing
to um is I I think really works to make
each feature easier to build and then
creates all of these sort of nonobvious
second order effects that makes it
easier for the entire organization to
collaborate together.
And very importantly, many people in San
Francisco don't know this yet. Um so
you're you're the first to hear it. Um
so that is my talk. So, if you're
interested in um in what we do, uh I run
EveryY. Uh Every is the only
subscription you need to stay at the
edge of AI. You can find us at every.TO.
Um we uh we have a daily newsletter
about AI. So, we do ideas, apps, and
training. We have a on the ideas side,
we have a daily newsletter. We review
all the new models when they come out
and all the new products when they come
out. The apps, you already saw, we've a
bundle of all these apps and then we do
training and consulting with big
companies to help them use AI and it's
all bundled into one subscription. So
you get everything for one price and
that's it. Thank you very much.
[applause]
[music]
[music]
>> Ladies and gentlemen, please welcome
back to the stage Alex Lieberman.
Okay, 8 hours in. We did it. Um I have
some housekeeping. We have to finish the
day with housekeeping. First of all, I
want to thank you all. It has been
phenomenal to be on this journey with
you all. But uh let's give a a shout out
just to you all for being here, going
through a full day listening to the
programming. So, round of applause for
everyone in the crowd, everyone
[applause] online who's been watching.
Let's also keep it going for all the
team in production behind the scenes
making this possible. I watched them
work tirelessly throughout the day to
make this happen. And then finally,
let's give a huge shout out to Swix and
Ben who made this whole thing happen.
>> [applause]
>> So get comfortable for a second. I have
some housekeeping. Make sure everyone
knows where to go. And then we have one
final speaker who's going to chat uh
right after I hop off stage. So let's
just dive in for a sec. Uh tomorrow is
the engineering session day. I will not
be your MC. So you will be taken care of
by Jed who works at Google. I spent the
day with Jed. He is incredible. He's
just like a taller, better looking
version of me. and he's actually an
engineer. So, you get a true engineer
tomorrow. Uh, if you have a bundle pass,
your ticket includes tomorrow's track.
So, we'll see you tomorrow at 8:00 a.m.
here. If you have the leadership pass
only, your ticket does not include
access to the sessions or the venue
tomorrow. However, we have organized an
off-site brunch for you on us at a
restaurant not far from here. So, check
your calendar for the invite and the
location. But right now we are headed
into the afterparty. And not only is
there an afterparty, but there are after
afterparties. There's a lot of side
events. So your entire night is planned
for you. And we have Graphite to thank
for sponsoring the afterparty. So here
to give us the last word for a brief
message is the co-founder and CEO of
Graphite, Mel Lutzky.
[applause]
>> [music]
>> Good evening everyone. My name is Mel
Lutzky and I'm the co-founder and CEO of
Graphite. Uh we're the AI powered code
review platform for this new age of
agentic software development. Now I know
you guys heard a lot today about agents
and how to make them as effective as
possible in generating code and building
features faster than ever. And they're
incredible at this. But I think
everybody who's who's built software in
a professional environment knows that
writing the code is only the first part
of the story. Every code change then
needs to be tested. It needs to be
reviewed, merged, deployed. And
oftentimes that second half of the
process takes just as long if not longer
than actually generating the code. And
that's what we do with Graphite. We're
applying AI to the entire development
process and making code review as
quickly as as quick as possible. uh we
have an agent that's integrated fully
into our pull request page. Um it's like
reviewing code in 2025 and not you it
doesn't feel like 2015 anymore. U that's
what we build. Um we're super excited
about it. Uh if you want to come check
it out, we have our booth uh in the expo
hall and also we're going to be around
all day tomorrow. We're the official
sponsors of tonight's afterparty and
also tomorrow's event at public records.
So we wanted you know all you guys who
came from out of time out of town, we
wanted to show you good time in New
York. Uh so we have two events for you
uh to make sure that you have you have a
good time and uh see what New York is
all about. Uh want to give a big shout
out to Swix and Ben and the whole AIG
team for organizing and we're excited to
see you guys all at the party tonight.
Thank you very much.
[applause]
taking place in the halls on both doors.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
[music]
Heat.
Heat.