AIE Miami Keynote & Talks ft. OpenCode. Google Deepmind, OpenAI, and more!

Channel: aiDotEngineer

Published at: 2026-04-20

YouTube video id: 6IxSbMhT7v4

Source: https://www.youtube.com/watch?v=6IxSbMhT7v4

Hey, hey, hey.
Good morning everyone. Hello Miami.
>> How's everyone doing?
>> Welcome to Miami.
>> Yes.
>> How's everybody doing?
>> And in case you forgot where we are, uh
there is a queue in my jersey. Uh, so
we're bringing AI engineer to Miami
today and I'm so grateful to see I can't
really see that well because the light
is so bright, but I can kind of see your
faces and I'm just so glad that you're
all here to celebrate AI engineer and
accelerating practical AI applications.
Uh, so I hope that you're as excited as
we are about today. Um, so my name is
Ethel. I am one of the MC's for today. I
am AI researcher at Google and I'm here
with Iman, my colleague and dear friend.
>> Hello, I'm Iman, AI research engineer at
Google and uh I'm coming all the way
from the San Francisco Bay area, West
Coast, by the way. Who here is from the
West Coast? Raise your hand if you're
all right.
>> Nice. We see some hands.
>> I see a few hands. Who's from the East
Coast?
>> Oh, wow.
>> That's not fair. We're the minority.
Well, Miami is east coast as well, so
>> yeah, I can imagine. So, who's from the
central US?
>> Yeah, we see some.
>> Okay. Okay, that's a good woo. And uh
who's coming from outside of US?
>> Nice. Wow. Thank you for traveling.
>> Pretty diverse. And I'm curious um raise
your hand if by this time next year
you're going to be replaced by AI.
I'm counting around 20 colleagues with
realistic expectations.
Uh well that may be true or maybe uh the
reality is AI will expand what's
possible for us and it's going to
multiply us and uh it's going to
redefine what's achievable. But before
that let's hear some stats about who
who's attending.
>> Yeah. So we want this conference to be
about you and your connections. Uh so
sitting with us we have people from 23
countries. So thank you all who are
traveling internationally and also
locally uh including me myself. We
walked down a couple of flights to come
here. But no matter how far you
traveled, we're really grateful that
you're here. Uh and we also have some
companies that are very excited about
sending some engineers over. So we have
two companies that sent 12 engineers to
the event. So, thank you so much for
doing that. I also want to do a quick
roll call. Uh, who in the audience is a
AI engineer?
Nice. We see some hands. What about
quality engineers,
PMS,
AI researchers?
>> PM there.
>> Okay. So, an overwhelming group of AI
engineers. So, you're in the right place
and we're so happy to have you with us.
And we have great talks. We have our
sponsors in the expo area for later. And
we really want you to be able to network
with each other, talk to each other, and
really just uh build a community here.
So that's the vision that we have for
today.
>> Big plus. Big plus. Um yeah, last night
at the opening reception, I got to talk
to people. Some told me about their
personal experience. Their friend got
sick. they want to leverage AI to help
them and improve their life or talking
about how they want to leverage AI to
bring education to some parts of the
world that don't have proper education.
Uh very personal stories and I'd love to
hear that and I think that's the core of
the idea here. Let's consider this um a
playground for amazing minds come
together and build connections and make
big changes in the world. And I think
that's part of the vision. Um, and who's
better to tell about the vision more
than the amazing Gabe Greenberg, who is
the CEO and founder of G2I. I would like
to invite Gabe to the stage to share a
few thoughts. Thank you all.
>> Welcome, Gabe.
>> AI engineer Miami, what's up? How we
doing? All right. I like it. I got
through my uh first my start I said uh
AI engineer Miami rather than React
Miami. So I'm I'm starting off well.
It's our next conference. Okay. All
right. Well, I wanted to give you all a
little bit of an origin story because uh
this conference and this series of
conferences is uh is pretty unique. It's
special. There's a really special group
of people that you're sitting uh with
right now. DAX is special. The speakers,
I just uh I have a lot of love for the
people in this room. I'm the founder of
G2I. You can check us out at g2i.ai.
We're focused on uh reinforcement
learning environments in the human data
space specifically around software
engineering. Um I'm also the
co-organizer of this conference uh at
least AIE Miami and React Miami. And you
can find me on uh Twitter. I I don't I'm
not going to call it the other name.
Twitter/Gabe Greenberg. You can follow
me there. So origin story of how this
conference started. I uh I flew out to
uh San Francisco. I knew nobody and I
walked into React Comp 2016 and the
first person I meet is Ryan Florence. Uh
some of you may know him in this room
and uh you know little bit of a
celebrity mutuals background and you
know I shake his hand meet him cool guy
answered a few questions and I sit down
next to him and he opens up his laptop
and uh on his lock screen is
wait for it Brad Pitt.
And I knew that I was instantly like in
the right place like this guy does not
take himself too seriously. Nick Shrock
was up there talking about GraphQL.
React Native was kind of new. It was
this beautiful like you know this
bustling we were all excited you know
like uh on the bleeding edge and uh I
got really involved in the React
ecosystem. Uh we served for Reactive
Flux at a lot of Q&A with the React core
team and a few years later I got really
really really sick. Um, I had mold
toxicity and mercury poisoning. And I
would sleep on uh the floor underneath
my desk uh for hours because I had
migraines. I couldn't think straight. I
couldn't do simple math at times. And um
and I would get up and do a little work
and go back to sleep. Um there was times
I go into treatment on Monday and I
couldn't work till Friday. And this was
years. This lasted for eight years in my
life. It was the most terrible time. My
kids, you know, didn't see me much. My
wife was kind of like, "What's going
on?" And uh finally I was diagnosed. It
took me so many years to get diagnosed
with the mold toxicity and um I'm
sitting there on a vacation in pain,
physical pain every single day. And u
and my and my wife and I go, "We need
help. We need to raise money to get you
help." And so I put it on Twitter and
I've been involved in the in the
software uh ecosystem for a while at
this point. And uh and this tweet
changed my life. Uh Dan uh Abramov on
the React team uh said, "Let's help
Gabe." And uh they raised $22,000 for
me. Uh and I got healthy um a couple
years ago. And I've been healthy since
uh from this.
Yeah.
So, um, a couple years later, I'm up too
late on Twitter, of course, and I I I
post this. Uh, someone should put on
React Comp in Miami. Who'd be
interested? I own the domain, uh, and
hundreds of other domains I've never
used. Um, and, uh, of course, Ken
Wheeler says I'd go and probably never
be invited back. He was invited and he
uh he was invited back, believe it or
not. He'll be here this year. And
Michelle over there and her sister Becca
said, "Yes, we're in." And they were the
ones that convinced me. Michelle was the
one that convinced me this could be done
uh with no money right out of COVID. And
so we just we felt called to it. And
this was a response to what you all had
done for me, the organization G2I. uh we
felt called to this conference and to
serve the people here to not make it a
quote unquote corporate event for the
profit but to make it really for the
people and of course Swix you can see
him down there at the end um he's come
every almost every year to React Miami
spoke many times and so he created AI
engineer quite a few years ago at this
point runs late in space is super
involved he's at cognition and he said
Gabe Michelle Becca can you do the first
AIE in America So, here we are. Aie
Miami is born. Thank you all so much for
coming. It means the world to us.
And there's one more thing. Um, our
company, we've worked with the Frontier
Labs for a number of years now. We've
had to move really fast. um build
production software at uh you know in
matters of just compressed timelines
like I think some of you all would
believe because you know and uh so today
we are uh announcing uh orchestrator
orchestrator AI it's a multi- aent
orchestration platform for complex
engineering you can check it out at
orc.ai AI. Um we are uh really excited
about this. It's been dogfooted. Uh you
can run many different agents in um in
the platform. The coordinator runs the
impleer, auditor, reviewer, validator,
researcher. These are only some of the
the roles that we have in the platform.
Comes with the confidence score, shares
the known issues, the assumptions of
course that the uh that the different
agents are making. And you can spin up
to 16 of these for a single task.
There's true adversarial um governance
with this um and we're able to catch a
ton of large language model drift. Uh
extremely fast inter agent comms and of
course model agnostic. Uh it comes with
a self-p pruning context memory that
reduces context bloat and then also the
meta observer in the platform
automatically adds new skills as it
identifies opportunities for them and
then an observability layer that allows
you to delete them or add new skills
manually. So we're really really excited
about this. Uh we're signing up design
partners. We uh we've been doing a
little bit of benchmarking and then I'll
turn it over to to Dax to to come talk
about Open Code. Um we do a lot of spec
driven backend work. Uh we need to do it
really fast and and our engineers were
were dog fooding this platform. Really
it's been a couple years in the making
behind the scenes and uh they've been
building these specriven APIs. So if we
look at the pet store API that we did um
you're looking at 100% path coverage and
100% semantic score around the the
quality the response shapes types and
behaviors matching the spec and uh
compared to a single agent harness like
cloud code I mean it's about the same
you know we're we're 6% better not much
of difference now as as we increase the
complexity going into a startup API um
uh we're able to see the lift you know
we're able to hit 100% path coverage and
100% semantic for when cloud code is
hitting 78% and 60%. Interesting. But
then we really increase the surface area
of this thing and the complexity like 8x
on the spec of what a start a API would
would look like and uh and we're seeing
the lift where uh a single agent harness
might hit 22% uh we were hitting 92% on
the semantic score and the path coverage
uh is 100% for us. So excited in half
the time u to uh to launch this. And
lastly, uh, we spent the last 72 hours
making sure we could have one more
benchmark. We actually rented a hotel
room, uh, to put this laptop in just to
run the benchmarks cuz the the Wi-Fi was
not so good in this hotel. Um,
which is a funny story. Um, we ran the
orchestrator against SweetBench Pro,
specifically GBT 5.4 high. It's about
731 tasks. um we called it we bucketed
it into very easy all the way to very
hard um on the but these are really at
the end of the day easy is not super
simple you're talking about a multifile
fix and subsystem logic understanding um
so we're seeing bucket bybucket really
the easy medium and hard are 17.1% lift
over GBT4 5.4 high 14.8% 8% lift and
then a 1.7% lift on hard and when we get
to very hard we're talking about complex
long horizon issues spanning multiple
days a 5.7% lift. So 8.4% overall on top
of the model. Uh to give you an idea um
I think GPT2 5.2 to 5.4 was a 4% lift
and uh and and uh Opus 4.5 to 4.7
um was a uh a 7oint lift. So, we're
excited about this. Uh, this, uh, is
able to execute SWEBench Pro above Opus
4.7 with GBT 5.4. And if you'd like to
sign up as a design partner over the
summer, we'd love to work with you and,
uh, and build with you. So, thanks for
your time. Enjoy DAX and enjoy a AI
engineer Miami. Thank you.
Am
>> I going right away? Are you
>> Thank you so much, Gabe. Our
first presenter
has tried his hardest to insult us all
by his choice of the title.
He is a worldrenowned
troll on Twitter.
That's not me, that's Becca said that.
And uh the title for his talk is you
don't have any good ideas. I would like
to invite Dax Rods to the stage. Um, I
don't know where he's going with this,
but I'm going to let him figure it out.
So, please welcome to the stage.
All right.
I don't have any slides, so I'm just
going to walk around and uh and talk at
you guys. Um, a lot of people in here
today. All of you came from all over the
world, all the way to Miami, but you're
not like other people because you came
here to talk about AI. Not why people
usually come to Miami. kind of
embarrassing.
It's okay. There's a reason you're here.
The reason you're here is there's a lot
of smart people that are going to be
here uh giving talks. These are people
that are at the top of their game using
AI to build software.
You're going to learn all the tips, all
the tricks, give you the edge, you know,
do things that used to take a week, do
it in a day. You're going to go home.
You're going to use these tricks. You're
going to build all of your ideas. You're
going to be super successful. You're
going to get you're going to be rich,
going to fix everything that's wrong
with you.
Your mom's going to be proud of you. Um,
there's only one problem, which is you
don't actually have any good ideas. And
that probably hurts to hear, but it's
true. It's okay. I don't have any good
ideas either. Uh, and I think this is
the first time that we're all having to
confront this fact. You know, we have
more capability to build stuff than
ever. And you think, oh, finally, we can
kind of ship all the stuff that we
always said we would. Uh, and it turns
out a lot of the stuff that we thought
was good ideas are not good ideas. And
this is the number one problem that I'm
struggling with, uh, my company's
struggling with. Um, yes. So, again, my
name is Dax. I'm the co-founder of
Anomaly. Uh, we make a decently popular
coding agent called Open Code. Um, and
this talk is going to be all about
product restraint.
So to understand what I mean by this, uh
let's uh
let's let's think back to before AI was
a thing, before coding with AI was a
thing. It was like a long time ago. This
was like two years ago, forever. Um but
if you think really hard, you can think
back to that time and imagine what it
was like. Uh for those of you that are
programmers, you know, imagine working
at your companies. Someone would come to
you with a new idea, with a problem,
with a feature they wanted to ship, and
it would be really annoying. You would
hate it when someone did that because
you had a huge backlog of stuff you're
already trying to do. You had all this
stuff that you wish you were doing
better that you don't have time to get
to. So, when someone came to you with
yet another thing to put on your road
map, you did what the lazy engineer does
and you push back on them. You argued
every reason why we shouldn't do this
thing. Uh why we shouldn't ship this,
why the company shouldn't be doing this,
why the person was stupid. Um maybe we
should do this later.
You basically were the obstacle to
getting anything done. Um just because
you had you were over overwhelmed and
and you pushed back a lot and rightfully
so the company hated you for it. Uh if
you look at most companies, you talk to
them, you talk to them honestly, most
parts of the organization hate the
engineering team and for good reason.
because every problem that they have is
blocked by engineering. Uh when a
customer, you know, yells at someone on
support, it's because the engineering
team hasn't shipped something that would
have fixed their issue. When the sales
team loses a lead to a competitor, it's
because the engineering team has, you
know, they have a feature they haven't
shipped, the competitor does. So, it's
just engineering has just been the the
annoying part of the organization
forever. The source of every single
problem, at least the way it feels. And
it feels kind of stupid because software
is virtual. We're not like physically
building things. We're not moving things
from one one place to another. Uh it's
just in this virtual space and it feels
like the moment we have an idea, it
should just exist, right? Like it's just
a thing in the app. Like why why is
there so many steps and processes uh in
between that? And everyone wished that
things could be different. The past
couple years uh it feels like that's
kind of changed. We've gotten the
ability to kind of go from idea to a
real looking thing really quickly. Um,
and everyone's super hyped about this.
Like every company is trying to adopt
this workflow as much as possible. Blow
up every single process that they have.
Uh, you know, if you're not adopting
this, your competitors going to adopt
this and, you know, you're going to get
left behind. We're we're like measuring
tokens. Uh, we have token leaderboards.
See which engineers can, you know, get
to the top of the token leaderboard. if
you're not spending five times your
salary on tokens, you're going to get
fired. Um, so we're all going crazy with
finally all this pentup frustration that
has been around for decades with slow
engineering is solved and we're like
going crazy with it and I'm not saying
there's not a lot of positive from it.
Like a lot has changed for the better.
Um, I work on an AI coding agent. Like I
believe in this in a lot of ways, but
it's not universally good. And I want to
talk a little bit about
things are so different now. And we look
back at all this frustration and I think
what I'm realizing is
that frustration was kind of saving us
from ourselves and to understand this
let's think about how things used to
work right you had you know typical
organization they had the product
engineering and design roles um because
engineering was so backlogged all the
time uh product and design would work
together and refine ideas before they
brought to engineering. It was a lot
cheaper back then two years ago uh to
have a mockup in Figma than it was to
build a working prototype. So a lot of
ideas would just die at this phase. You
know someone would have an idea they
would kind of have to go work with
design. they think through it, they
might realize, okay, this actually
didn't make any sense or, you know, we
have to refine it and the initial idea
turns into something totally different
and by the time it kind of, you know,
bounces through the organization, uh, a
lot of the ideas die or they or they get
refined into something into something
pretty decent or they get shelved and
kind of brought out later. Um, and that
was like a natural thing that was
happening. There's all this filtering
that was that was going on. Now things
are a little bit different. Um, anyone
in your organization can kind of ship an
MVP. they can prompt a coding agent,
spend an hour with it, and implement a
feature that they think is good. And
this is this seems obviously good, you
know, like why would anyone be against
being able to experiment and build stuff
and iterate and and try things?
Obviously, it sounds like a good thing,
but the sneaky thing about MVPs is they
look almost done. Uh, you spend an hour
build something and it like basically
looks like it's there. At that point,
there's momentum behind it. The moment
something kind of looks like it's
basically there, it's it has like a life
of its own. At that point, it's
inappropriate to really think about it
from first principles or like question
the whole premise of it. It's basically
already there. People around you aren't
going to like really be a be a roadblock
or or get in the way. Um, and it ends up
in the product. These ideas, they go
from someone having the idea to
prompting it. They spend an hour on it.
You know, it's barely any work. and then
it's like in the product the next week.
And we're told this is actually a good
thing. We're told that in the new era of
AI, it's all about you have a problem,
you solve it right away. You ship the
fix right away. The faster you go, the
better to go go fast fast fast. That's
kind of the vibe of everything. And that
like adds to it even more, right? Like
we're not questioning anything that
we're doing. And so unsurprisingly, this
creates bloat. Products end up super
bloated. They end up with features that
are in weird spots. They end up with
three different ways to do things. Um,
and it's it's kind of making me realize
that without the previous checks and
balances of just things going slow, we
just didn't have a lot of good ideas.
Like most of the stuff that we're
shipping, they're bad ideas. I look at
our own products and, you know, our the
products we work on right now have been
out for less than a year. And I look at
it and I'm like, what are all these
features? Like when do these get in
here? Like we should never ship this, we
should never ship that. It's just gotten
so easy that things just slip through
into the product. um without you know
anyone really thinking twice and this is
kind of messing up the whole tech whole
team dynamics as well. Um for the first
time ever design is behind engineering
stuff just gets shipped right stuff just
gets shipped out there before design is
even looked at it. So now they are just
they just have a huge backlog of a 100
features that are shipped that they need
to go one by one and polish and just
independently going one by one polishing
100 different features that doesn't add
up to a good product. They're not doing
their role which is to think cohesively
about a product. Think about the
experience end to end to create like a
proper universal experience. There's
kind of like oneoff reacting to to stuff
that's going out. Um and this is
changing the engineering side as well.
Um, you know, historically, if someone
came to you and wanted to build a new
feature or iterate on a feature on a
system that already exists, you would
look at this system and you would think,
okay, like this feature doesn't really
fit into this system. So, we'll have to
like rethink the system from scratch.
There's going to be a lot of work. We
have to redesign it to support this
thing. Of course, there's always hacks,
but you know, you have to pay the cost
of that hack. You have to be the one to
go and hack this thing into the system.
anytime that that hack later like rubbed
up against other things incorrectly
because it interacts with every other
feature you have, you had to deal with
the pain of that, you no longer have to
deal with the pain of that. You can go
tell your agent, you know, hey, do the
[ __ ] for me. Um, and you don't have
to deal with like the dirty work really.
Uh, so engineers willingness to ship
hacky solutions,
you know, we're we're just a lot more
willing to do that. our bar for what
we're willing to do to our code bases is
like on the floor at this point because
we're not paying the price for the cost
of it. And that really shouldn't be the
case, right? Just because you can
offload the pain to someone else. In
this case, you know, it's not a real
person or well, some people think it's a
real person, but uh
that doesn't mean that you we should
change our philosophy on what we're
doing or how we're doing things
necessarily. Um so that's also impacting
the engine uh the engineering team as
well. Um, and of course we've had the
historical excuse, which is, you know,
it's it's okay to ship hack sometimes.
You know, you make the judgment call on
it's better to get something out now and
deal with it later. You have excuse, you
know, we'll get back to it later. You
intend for that to be three months. It
ends up being three years and by the
time you get to it, you like totally
regret having done it in the first
place. We've got whole new excuses now,
right? It's it's okay if this is bad.
The agent will fix it later. Um, it's
okay if this sucks. The models will get
better and it'll just kind of solve it.
It's like a completely like it's like a
faith-based approach to it. Like just
magically it's going to get fixed in the
future, which of course it doesn't. It
really doesn't happen. Um and the net
result of all these changes is just a
ton of rot. Our products are rotting so
quickly, right? We look at and I think
we can all feel this in our own products
and products that we're using lately.
They just feel really old really fast.
You know, you use something that came
out less than a year ago and all of a
sudden it feels like it's already five
years old, like post private equity
acquisition, like inertification. This
is happening in like several months. And
again, it comes back to the root
problem. We don't have a lot of good
ideas. When we just ship things
unchecked,
we just uh speedrun that life cycle of
of product deterioration. And it's like
happening at crazy speeds these days.
So, the key issue here is restraint. um
we have more power and capability than
ever which means it just magnifies our
judgment. So we need to exercise a lot
more restraint and I don't have a lot of
good ideas on how I think for me I just
look back to what traditionally has made
sense um and try to really keep that in
mind. So you know when someone comes to
you with a problem or a user has a
problem if you just like react to that
and fix that problem right away you're
just going to make 10 different
solutions for 10 different problems. Uh
if you slow down and wait, you listen to
this problem, you listen to that
problem, you might realize, hey, these
10 problems are they seem unrelated, but
they actually are related. If we ship
this one thing, it'll fix not only those
10 problems, but also the 50 other
problems that no one's even brought up
yet. That's actually your job when you
build product, right? Your pro your job
isn't just to be like a prompt router to
go from the user complaining to routing
to the agent. you have to slow down,
think and absorb and really make the
call on on what uh what you're actually
shipping. Um so so that that's that's
very important, you know, slow down,
absorb and try to ship high leverage
things that actually solve a lot of
problems. Um
the other thing is uh I think a lot
about the onboarding cycle. Uh every
product basically has one good idea in
it. Um, and your job is to get the user
from not knowing about your product to
understanding that one good idea as fast
as possible. All the other stuff that
you come up with, they're important and
maybe useful, but they're secondary
ideas. So, they're not going to go into
like the main onboarding. And a lot of
companies mess this up. It's uh I think
we've all had the experience where you
open up a product and there's like a
dozen different directions you can go
through and the people working on that
product are like, "Oh, we're giving them
so many options. They're going to go
like play around with this then and try
with all trial and stuff." People don't
do that. they just give up and they
leave. So you're never going to mess you
shouldn't mess up that one flow of
getting to your good idea. Which means
that every new idea you think of, you
need to craft the path of okay, they're
a user and they're kind of using my
product. How do I take them from there
to the point where they understand this
new feature? Um how do they learn about
it? How do they discover it? How do they
know when to use it? How do they know
how it works? It's really hard to come
up with this stuff. I have had so many
great features that I'm like, "This is
an awesome feature. I love it. It's so
useful." But I wasn't able to find a way
to get a lot of people to actually
discover it. And so we don't ship it,
right? We don't ship features like that.
Um, and it's painful, but again, it's
restraint. You don't want to put stuff
in there that has no actual way of being
used. Um,
the uh
yeah, so if I think if you keep these in
mind, you naturally don't ship as much
stuff. uh you know, right now you're
having an idea every single day, but if
you apply these filters,
most of them don't really pass. You're
not going to have a good idea every day.
You're not going to have a good idea
every week. If you have one every couple
months, that's I think you're doing
pretty well. And that's like a good
cadence to aim for. Regardless of how
fast AI is letting you go, there's no
reason why you suddenly have 10x the
number of good ideas, right? So, couple
good ideas a month, I think you're doing
you're doing pretty good. Um, as to
close off here, so I think again
thinking back to the frustration
everyone's felt in the past, I think we
can now kind of look back and feel a
little grateful for it because it
basically was again like I said saving
us from ourselves. We're moving a lot
slower. It was filtering out a lot of
bad ideas on its own and we never had to
confront the fact that most ideas were
bad just because they were kind of
naturally uh being taken care of. Um,
and as we kind of enter this new era
where that process is going away, we
have to be a bit more aware and be
intentional that hey, most of our ideas
are not good. Um, and that's okay.
That's exactly how things are supposed
to be. All right, that's all I had.
Thank you everyone.
All right, give it up to Dex. And then
just to confuse you, the next one is
Dex. So Dex needs no introduction.
Actually, a lot of people uh know him
because he's a veteran of AI engineer.
So if you have watched some of his
talks, one of them is very famous. It's
called No Vibes Allowed. Uh, but I'm
still gonna give a little introduction
of Dexter for people who do not know
him. So, Dexter is the founder of Human
Layer and I actually had the honor to
meet Dex in San Francisco, which is
where he's based. Um, so Dex has always
been a a prominent figure in the AI
community and I'm really glad that he is
here with us today. So, today he's gonna
tell us everything we got wrong about
RPI. So, tell us more, Dex.
>> Amazing. Thank you, Ethel.
>> Thank you. Um, and thanks Dax for that
wonderful intro and sorry she said give
it up for Dax and I literally thought
she was saying give it up for Dax. So I
made the mistake I was about to make fun
of all of you for which is uh praising
me on Twitter for all my hard work on
the open code project. That's the other
guy. Uh I have been doing coding agents
for a while though. Uh I think the no
vibes allowed is almost up to I wanted
it to get to 500k but any we done done a
lot of talks. Um all started with this
guy Eigor. I'm not going to go deep on
this, but it was basically like, hey,
when you use AI, you ship a lot more,
but a lot of it is fixing the slop you
shipped last week, and it doesn't really
work for brownfield code bases. And what
we want to do is we want to solve hard
problems in complex code bases. We had
to figure some stuff out. Um, we posted
our methodology on uh Hacker News back
in September and it was on the top front
page all day. There's probably about
10,000 people who have gone and grabbed
our open source prompts. Um, I found
public evidence that RPI is in use at
companies like Uber and Block and
private evidence of a bunch more that I
can't talk about. Uh, which sets us up
for a great talk about RPI and why it's
so great. Uh, but we're not going to do
that. We're going to do a different
talk. I'm going to tell you everything
we got wrong about RPI. Uh, because we
thought we had this thing figured out.
And of course, you know, models change
really fast and this whole world is
changing. Every three weeks there's a
new thing. Uh, and I think we got a
couple things wrong.
standing by. All right, we're back. Um,
we got a couple things wrong. One of the
things was we said, uh, it's okay to not
read the code. Uh, we advised people to
read very long plan files. Uh, and we
said Claude can have a little slop. Uh,
as it well, we never said this. It was
implied though. It was, you know, let
the let the model cook and we'll be
we'll be fine. Um, and so, uh, and and
you all have been doing your homework,
and I think, uh, at at AI Engineer
Europe, we kind of figured this out of,
uh, there's this continuum now, the
Zecharopo continuum. Uh, I'm going to
tell you how we ended up all the way
over here despite six months ago, eight
months ago being all the way on the
other side of the spectrum. Um, but
first to recap, um, and I can't see many
of you, so you have to raise your hands
very high. Who's run this prompt? Who's
done research codebase?
I'm going to assume everybody. It is
very bright here. Um, what about create
plan? Raise your hand if you use this
prompt. Uh, leave your hand up if you
use it like this. Hey, we got to go ship
a feature for this thing. Leave your
hand up if you ship it if you ran it
like this. Work back and forth with me,
starting with your open questions and
outline before writing the plan.
Some of you found the magic words.
That's great, but also it's a problem.
Um, since October, we've worked with
thousands of engineers uh of companies
of all sizes uh and trying to help
people use coding agents to solve hard
problems in complex code bases. And we
found over and over again we would give
the tools to an expert and they would
get great results and then they would
give it to their team and the results
were not always so good. Um so we got in
the trenches with our users as uh
product minded people do and we went to
go figure out what was going wrong and
we found three things. Uh the first one
was that people were getting bad
research and if you recall from no vibes
allowed this is the one slide I'm
repurposing um but you would pick a zone
of your codebase. You say, "Hey, go look
over here." And you would send off a
bunch of sub aents, take these deep
vertical slices through your many repo
codebase. And then you would compress
all of this down of like how does all
this work into a single document that is
just like a snapshot of the parts of the
codebase that matter for the task that
we're about to go embark on. And we said
we should keep this objective, right?
Discourage opinions, avoid
implementation planning. Uh research is
really just the compression of the truth
about the codebase and how it works
today. Um, and really good engineers
would, and we noticed this pattern where
people would take the ticket and they
would turn it into questions and then
they would pass the questions into the
resour. So if the thing you're building
is, oh, we need to add a new endpoint to
reticulate splines across tenants. You
might ask questions about how endpoints
are registered, what touches splines and
the worker program that handles
reticulation. But a lot of people would
just be lazy and say research, I got to
do this thing. Go research the codebase
for me. And this was a problem because
if we tell the model what you're working
on, you know, a good research is mostly
facts, but a bad research will have
opinions. And these models are so so so
deeply trained to go solve our problem
that it's going to uh steer the research
towards its thoughts on the first thing
it picked as the right way to solve this
problem. Um we'll get more on why models
shouldn't have opinions like they do get
to have opinions but just not at this
part of the workflow. It comes back to
this idea of you know do not outsource
the thinking. Um we also saw people were
getting bad plans. Um, and this is a
really interesting one. Uh, hopefully
you get some takeaways on like things
you can apply to your own prompting. But
we had this single prompt with like 85
instructions in it. And, uh, basically
if it worked properly,
um, what you would get is a, uh, laggy
YouTube video.
Try one more time. You know what? This
is why we have backup slides. Uh, it
would look like this. The model would go
back and forth and ask you a bunch of
questions. Uh, and then it would walk
through and like ask you what order you
wanted to do the things in and how you
wanted to test them. And only then after
this long conversation where you'd built
up all this shared understanding of the
problem would it write the plan. The
problem was is that uh if you were in a
hurry and you didn't prompt it quite
right uh it would just spit out a plan.
It wouldn't ask you any questions. It
wouldn't put you in the loop. You were
just getting whatever the model decided
was the number the first way to solve
this problem. And that's basically the
same as just prompting it to go do the
thing at that point. Uh and so we gave
the tools to an expert and they got
great results and some other people
didn't. And we were like what was the
difference? And this was uh an
embarrassing thing to say in like uh uh
customer onboarding. Uh but you have to
say the magic words apparently was the
was the challenge. If you didn't say
this thing, I mean if you've been
prompting LLMs for a while, you know,
hey, you repeat the most important
instruction at the very end of the
prompt and at the beginning. Uh so we
said work back and forth with me
starting with your open questions and
outline before writing the plan and then
it would follow the process. But if you
didn't do this, 50% of the time it would
just skip that. And this was not the
user's fault. Like we if you build a
tool that requires hours of training,
like go fix the tool. Um, and so we dug
in. We're like, why are these steps
getting skipped sometimes? And the basic
takeaway here for whatever AI thing
you're building, whether it's coding
agents or something else, is you have an
instruction budget. Uh, my co-founder
Kyle is here somewhere. He wrote a
really good blog post about like how to
like tune and optimize your cloud MD.
And the big takeaway was like Frontier
LMS could really follow about 150 to 200
instructions before they really are just
kind of half attending to all of them.
And obviously this was a year ago, so
inflate that number a little bit, but
there is a budget. And so this prompt is
85 instructions plus your cloudmd plus
your system prompt plus your tools plus
your MCP is very unlikely to get great
adherence. I'll talk about how we fix
this. Um the other thing that we kind of
recommended and did was was plan
reviews. And we advocated we said look
if you're not going to read the code you
got to read the plans. This is me on
stage saying in November you have to
read the plan. Uh some folks even code
reviewed them. They would get together
on their team and read the plan. But a
thousand line plan was about a thousand
lines of code. It's like order of
magnitude. it would be about the same
amount of reading either way. Uh, and
plans can have surprises and so you'd
actually end up reading the plan and
then someone would go implement it and
then you would have to read the code
again. You're actually doing more work,
not less work. Um, because you know
thousand line plan, thousand lines of
code, etc. Uh, that's not leverage.
That's actually doing more work. Uh, so
the new advice is don't read the plans,
just read the code.
Right. Yeah, I know. I am I am humble
enough to admit when I was wrong. Here
we are. Uh, this is all a journey. Uh,
don't forget to learn. Uh, there are
other ways to get leverage though. There
are other ways to get more out of less.
Uh, and we'll talk about that. But this
is how we ended up all the way on this
side of the of of the Mario side of the
continuum. And again, yes, you could
say, "Hey, Dex, you know, in August, you
said don't read the code." Yes, we used
to be all the way over here. I am humble
enough to admit when I was wrong. Um,
these things change. Uh, please read the
code. Uh, we tried not reading the code
for like 6 months. Uh, it did not end
well. Well, we ended up having to rip
and replace huge parts of that system.
Uh, and all of you now who are just
finding out about the like, oh, we can
do the lights off software factory thing
and just we just won't read the code.
I'm like, all right, be careful. If you
have people who depend on your code, if
you're going to if someone's going to
get paged at 3 three in the morning if
something is broken, please, I'm begging
you, please read it. There's a entire
profession here on the line. Uh, and we
need to save it. This is why I'm kind of
like iffy on the agent swarms thing
because like the bottleneck is last year
it was like how can you spend as many
tokens as possible and this year it's
going to be like okay how do you
actually what's the right speed you can
go because if you go 10x faster but
you're going to throw everything away in
6 months uh that's not actually like
productivity that's actually just
burning time and money and your
employer's time and your time uh I think
I do think you can get to two to 3x and
still read every line of code and own it
and have good architecture as Mario
would say. I'm not going to say it so we
don't get demonetized, but uh uh
everyone is racing to build these lights
off like slot factories, right? And I
think again what's going to happen is
you're going to wake up one day and
there's uh no one's read the code in
three months and you have a bug that the
agent can't solve. Uh and then you're
going to have three weeks of downtime as
you re onboard everybody on your team
back into your codebase that they
haven't read in three months. And in
that three weeks, you lose all your
customers and now your company is dead.
It's not going to happen to everybody.
It's probably going to happen to
somebody. uh be careful. So, we're going
to try to token smarter, not harder. Uh
and we do that with a couple different
ways. The research is the least exciting
one of it, but basically take your
ticket, turn it into questions, you make
the research. We can just do this with
prompting and workflows. So, we
basically hide the ticket from the
researcher programmatically. You have
one context window to generate questions
and then you just feed those questions
in to generate the research. This could
be done trivally with AI sort of query.
If you build deep research, this
technology has been around for a while.
Um, we also have to get better plans and
like before I was the coding agents guy,
I was the context engineering guy. Uh,
and we talked a lot about, you know,
there how like, you know, don't quite
use tools at a loop all the time. And
there was two reads of of context
engineering. One was like put better
information in the context window. This
is the most common one. Anyone here the
rag pipeline?
Uh, to be honest, come on. I actually
can't see any hands. I assume some of
you are just lazy and not wanting to
raise your hands. Um, it's also about
getting better instructions in your
context window. Um, and of course Jeff
is here and he's talking later. Uh, and
I used to have to explain who Jeff was.
Uh, very exciting now that everyone
knows who Jeff is. Uh, but he talked
about, you know, the more you use the
context window, the worse results you
get. We talked about the dumb zone.
We're over about 100,000 tokens for for
that's a cloud model number. The GPT
number is different. Um, you're
basically you're not you're not just
giving the model too much information.
You're probably also giving it too many
instructions. And so an example like
very simple is like you're building a
customer support bot and you give use
prompts for control flow. You say okay
if the input is this do this. If the
input is a is is a product feedback do
this. If it's a billing issue do this.
Give it a whole bunch of tools and say
hey go do the thing. Uh this probably
works but as it gets bigger your your
performance and accuracy will probably
go down. And what a lot of people end up
building is they have an initial step to
classify and then they have smaller
instruction modules for each of the
different classification cases. you
build this workflow and this pipeline
and this will be like faster, more
performant, more accurate. You could
probably use a smaller dumber dumber
model if you build your system this way.
Um, and so we took create plan which is
a single mega prompt uh and it's
supposed to look like this this very
specific guided workflow and we split it
across several prompts um to design
structure and planning. We're not going
to talk about implementation today uh
but it's got these three different
phases each build on each one and before
was 85 now they're all under 40
instructions. Um the lesson don't use
prompts for control flow if you can use
control flow for control flow. Uh switch
statements and if statements are
actually kind of good.
Um and this is not just for coding
agents. This is for you can imagine how
this would apply in every single AI
application you might build. Um and it
was kind of funny because we got up on
stage in in June at AI engineer and we
said you know don't don't do full fat
agents. Tools in a loop. No no no no no
that doesn't make these workflows do
these pipelines. Uh and then by August
it was kind of like with this cloud code
thing that thing's pretty good. tools at
a loop might be back. And we turn around
and we build these monolithic prompts of
these giant complex workflows. Uh and so
I decided it was time to drink our own
Kool-Aid and apply this stuff to how we
were doing this. Uh I hear a lot of
times uh well hey Dex, won't this just
get bitter lessened? And I assume none
of those people are here, but all of you
who hang out on Twitter and say this to
me, I want you to know that when you
shout this at me on Twitter, this is the
voice I assume that you're saying it in.
This is what plays in my head. uh
because I think in my experience the way
this works is you got a given frontier
model. It's got some sort of capacity at
a v variety of tasks um and through
naive prompting you get to do certain
things and then we come in and we do our
context engineering and we can make it
better at certain tasks that are
relevant to our problem. Then a new
model comes along and makes most of that
work irrelevant better at most tasks.
Maybe it's not as good at certain tasks
but then we do more context engineering
and we make it better at the next thing
at the next set of things and we push
the frontier. we're always going to be,
you know, 1 5 10% past the frontier, uh,
compared to the naive prompting. And
this matters because if you're doing
long horizon agentic tasks by turn 20,
the difference between 99% and 97% is a
27% gap because this compound. Uh, Dan
Shipper, I think, calls this like
surfing the models is like you can get
better at using the new model faster
than the new model can get better. Uh,
Swix's take is right, the bitter lesson
will kill this someday, but hey, it
works for now. Let's do it. And we do
that over and over again until we get to
AGI. So if you can get Opus latest to do
more to solve harder problems or maybe
you can get GP OSS which is small and
cheap to do the work of GPT5 high now
you've captured something really
interesting. So mind your instruction
budget and like no having more contact
it's just more attention spread out over
the same amount of or sorry the same
amount of attention spread out over more
instructions is not going to fix this.
Um and then we also found we got better
leverage. So we split these things up to
get better instruction following but uh
we also got more leverage and so like
you can look at the structure outline
this is the plan that was built from
that structure outline. It's like less
work for a human to review the first
thing and in general these higher order
documents are designed to be higher
leverage in terms of like human reads
less model uses less tokens and we're
talking at a at a higher level before we
go down into the details. Um so the
design discussion there is basically you
know where are we going what does the
final solution look like it's got
current state desired end state patterns
to follow you know where what did the
codebase patterns of the model find that
are relevant to implementing this
feature what architecture patterns do we
want to follow especially in legacy code
bases there's always six ways to do
everything and you find yourself being
like oh no you found the bad pattern no
we have to go do that one over there
this is your chance to do brain surgery
on the agent before it actually goes and
slops out a bunch of garbage uh it'll
track your resolve questions your design
questions uh it's sort of like cloud
code plan mode but like written down to
a markdown document. Uh Matt PCO pulled
it out. This is like the design concept.
Frederick Brooks had this idea of like
this thing that is never written down
but is in everybody's head. So when you
have this shared alignment with the
model of what you're building, uh this
is locked up in the context window. We
put it in a 200line markdown artifact.
Um and this gives you alignment with the
agent. And so you iterate on this thing.
Again, do not outsource the thing. You
want to give the agent every single
opportunity to show you what it's
thinking to brain dump its entire
understanding of the problem and what it
thinks you want the solution to look
like. And you say, "Okay, why do we need
humans in the loop at this point?"
Basically, like because you can't RL a
model on architecture because the cost
function of bad architecture is measured
in months and years, not in, you know,
five minute unit test cycles. Um, we
also got better leverage on the
structure outline. Design is where we're
going. Structure is how we get there. If
you want to map this to meetings that
make engineers miserable, you have your
design review, your architecture meeting
and then you have your sprint planning.
Uh so you take your design and all the
previous stuff and you build your
outline and it's just a highle outline
of what we're going to do and how we're
going to check it along the way. Um and
again it means lighter reviews. You can
read this and understand where we're
going. Um we need humans in the loop
here too as well because somehow models
just absolutely freaking love these
horizontal plans. uh which by that I
mean you know models I'm sure you've
seen this models love to do the whole
database layer and then the services
layer and then the API layer and then
finally the front end and before you
know it and you're on the other end of
1200 lines of code and something is
broken and the surface of stuff you
might have to debug is quite large which
means it's going to be hard for you it's
going to be hard for the model to figure
out what's wrong and so what we advocate
for and what we we when we use this
stuff internally and we say our users
it's like build it the way you would
have built it as an engineer you don't
write 1 1200 lines of code before you
check. You write a little bit of code
and then you check something. Write a
little bit of code, then you look at
something and then you wire your biz
logic and then you do your error
handling. And so it's like this is your
chance to re steer the model. They're
just markdown docs. You can ask for more
detail, but they start super high level.
So if you don't trust what the model's
going to do in your highle doc, ask it
to add more detail. Um, and the way you
get more leverage from the plan is you
basically you don't read it. It's it's
for claude. It's for the agent. You just
bot check it. You save that deep review
for the actual code. Um, but it does
have the line by line changes. Uh, and
it's not just about like human to agent
alignment. It's also really powerful for
human to human alignment. I think like
before AI, it was very common for a two
hours of coding feature to take two days
because you have to plan and align with
this is in like large, you know,
hundreds of engineers. You got to align
with other teams. You got to do the
planning. You got to do the code review.
You got to rework it. You got to test
it. Uh, and if you're just using AI for
coding, then you will go faster, but
you're not going to go 2x faster. You're
not going to go 3x faster. Um, but if
you use the model to help you do
planning and alignment with your team,
then you're going to get better
alignment because it's going to be more
thorough. And then your code review is
also going to be faster because everyone
like it's basically like there is a
thing that is not worth doing if you are
writing it by hand into a Google doc to
share with your team, but is worth doing
if AI can help you do it and it really
helps us like compress code review
cycles as well. Uh, I'm sorry I don't
have an answer for you on testing and
verification. Uh, much has been said
about this. That's for another talk. Uh
so if we want to put this all together
um you know five phases to actually even
just create the plan and then we go
implement it. We call out questions
research design structure outline plan
work tree implement pull request uh and
make a good acronym. So uh we just
picked the ones we liked. This is QRSPI
crispy if you want. Um we found it's a
really powerful way to get teams rowing
in the same direction. Um in terms of
like okay so like the things that we
have to solve are like okay three steps
was already a lot and took seven hours
of training. another seven. I don't know
if anyone else has built a crazy claude
code like command system and then tried
to teach it to somebody else and just
stormed out because it was frustrating.
Uh we also have to know like what's
working and why? What is the impact of
this stuff? And then we want to know
like you know if we want to make changes
to our prompting system, how do we
evaluate that? Um, I have a whole other
talk on like how do you drive AI
adoption in a large team where like you
need a process and then as Eigor would
say, you need like a defensible metric
and then you need somebody off in the
corner who's like shipping like crazy
and it's definitely not slop. Uh, and
then you can kind of do this. But you
can go do this. You can go try this in
your team. There is no magic prompt.
Like we actually don't publish the
crispy prompts because like the the core
of this is like understand context
engineering and instruction budgeting
and if you're not getting results good
results like break it down into smaller
workflows. We actually leave the
derivation of the prompts as an exercise
for the reader. You should go get the
open source ones from human layer human
layer and try to take the three prompts
and break them up into like eight. Um
try your own stuff. Come to your own
conclusions. Like I said this was me 10
minutes ago got in the trenches with our
users. Uh, be ready to spend hundreds of
hours with watching people struggle to
use your stuff. It is immensely
frustrating and gratifying and my
favorite part of the job. But if you're
not up for this, like consider whether
you want to build your own prompting
system. Uh, if you want to help with
this problem, uh, we're building an IDE
for uh, you know, just collaborating on
coding agent sessions. Got all these
opinions baked into it. Uh, we're open
to design partners. We're hiring
founding engineers. Send us a note.
Founders at human layer.dev. uh keep
learning, keep being wrong, keep uh keep
keep uh you know uh adjusting your uh
understanding of the meta. It's always
changing. Thank you so much. Uh thanks
Dax for the warm intro. Thank you Miami
for uh welcoming us in. And I'm really
excited to hang with you guys and the
rest of the speakers and uh learn with
you all this week. Cheers.
Oh,
>> okay. Way to go get the
>> Thank you, Dexter.
>> Okay. Well, thank you for your
attention. Okay. And before we introduce
the panel coming up next,
>> and we do see some chairs. They look
very comfy.
>> Uh we're gonna thank our sponsors
because without our sponsors, we won't
be able to be here and gathering
together. So I want to thank Cold
Rabbit, Serbus, Mintify, Sentry, Tail
Scale, and Cloudfire
>> and Modem and the Aify, Oz Zero, Deep
Mind,
>> Encrypt AI, and City Furniture.
>> So give it up to our sponsors for
bringing us together.
>> And as we're setting up, Iman is going
to introduce.
>> Okay, so where are we? Who's next? So
far we had Dax,
then Dex. Guess who's next?
Max.
Uh, okay. Max is joining us from OpenAI,
the folks behind Chat GPT. I tried to
come up with some AI jokes for the
conference. I tried ChatgPT. They were
unfunny. So, feel free to switch to
Google Gemini. Um
Um, yeah. Unless
Max can change our mind. Let's see how
we can do. And uh instead of a talk, Max
is going to surprise us with a panel
talk. And uh this panel includes himself
of course as well as Eric Thorelli who's
going to moderate the panel, Sunnil Pi,
and Ben Vinegar. Let's hear it for them.
Sure.
Sure.
Oh,
>> now.
>> Okay, cool. Uh, how's everybody doing?
>> Uh, we're going to try to do a group
selfie to start this thing off. Uh, can
we get the lights up? Yeah. Okay.
Awesome.
>> Hey, hold on. Don't Can everybody like
stand up, scratch, get some energy? Yes.
All right.
Miami.
All right.
Nice.
Okay, cool. Uh, I'm going to post that,
but instead of doing like 15 seconds of
awkward silence while I post it, I'll
just not listen to your guys' intros.
>> Thanks.
>> Um, is your mic on?
>> I think so.
>> Okay, cool. Um, so, uh, my name is Eric.
Uh, I work at Code Rabbit. Um, I I'm the
head of DX at Code Rabbit. Um, and
that's all I'm going to say about me
this whole talk hopefully. Um, I thought
we could go around uh introduce
yourselves uh maybe what you what you
did and your career, who you are, what
you did, and uh when you first learned
that you have taste.
>> Oh jeez. Uh how much time left? 30
seconds. Hi, I'm Max. I work at OpenAI
now. I work on connectors across JT and
Codex. If you've ever had Jacob and or
Codex connect to anything other than
itself, uh that's what I work on now.
Um, I'm probably most widely known for
making style components, which was a CSS
andJS library that lots of people used
back in the day uh that I think was uh
widely renowned for its taste.
Um, hey, I'm Ben. Uh,
if you know me, it's maybe because I
worked at Sentry for a long time, worked
on that product. I'm told it has pretty
decent DX. Um I actually started as a as
the JavaScript developer who was
responsible for making that SDK work uh
for the JavaScript community. So kind of
worked from that perspective and I don't
know people use it. So I think that's
something.
>> Uh hi my name is Sunil. Uh I'm tech lead
on the agent at Cloudflare. Spent some
time on the React team. Uh Oculus spoken
in React Miami fun. Uh I had a very
boring real-time multiplayer
infrastructure startup which had the
best company name called party kit. So I
have some taste.
>> All right. So uh taste is one of those
things that everyone says they have.
Uh but I think we should define it. So
maybe go starting from the the opposite
end. What is taste?
>> Uh so taste and imagination I think are
two sides of the same coin. Uh
uh the first one is about focus and
trying to remove the things that don't
matter in terms of experience in terms
of storytelling and so on and
imagination I think is about broadening
it from a place of taste to expose
yourself to everything that humanity
bring so full of myself I love that I
get to answer this first but yeah I
think it's about like focus and
expansion
>> I am not possibly smart enough to follow
that up so I'm just going going to go
over to Max.
>> I think taste is what we call opinions
that make sense. That's really what it
is. When when people feel like an
opinion makes sense and they can't quite
put into words why they just call it
that person.
>> Okay. All those answers were boring. So,
uh, can we get what is bad taste? What
does bad taste look like?
>> Opinions that don't make sense, Eric.
Obviously. Obviously.
Ben, are you gonna
>> What? What is bad taste?
Man, you told me these were going to be
kind of spicy and I was not prepared for
this. Uh, I got to punt again.
>> Great taste.
>> What is bad taste? I think when people
focus on the things that they just like
as a person and try to generalize it to
like everyone else, that comes off not
very tasteful.
This isn't a very JavaScript focused
panel.
Okay, we can make it we can make it more
AI here. Um,
so I'll I'll save the the more spicier
questions around AI uh as followup. Um,
does does taste scale with AI and how
how can you make this scale with AI?
>> I think one of the interesting
One of the interesting experiences I've
had about working with agents over the
past uh especially over the past six
months is that agents can solve lots of
problems now but they often solve them
at the wrong layer. I don't know if you
guys have noticed this. It's like you
give it a problem and you can solve that
problem usually at the front end level
or the back end level, maybe at the
database level, whatever, right? Like
there's lots of layers where you could
solve a problem and AI still today
pretty reliably picks the wrong one in
my experience. and you end up with a
solution that's like 3,000 lines of code
that somebody has to review that's like
it kind of solves a problem but also is
kind of messy. And one of the hardest
parts I found about working with agents
even still today is teaching them how to
use how to solve the problems at the
correct layers. Um and I I actually
think that's a lot about taste. It's
about knowing and having the experience
to know, oh, if we make it if we solve
the problem at this layer, it's going to
have these trade-offs and ramifications
of at a different layer. It's kind of
different trade-offs and ramifications.
And I feel like agents don't really have
that today.
I'm gonna answer this question. I'm
going answer the other one ear earlier.
I'm just I'm a I'm slow. I need time. Um
I think taste is maybe like a shortcut
to assessing something's quality. I
think that people who do not have taste
need to look at things like metrics or
box office numbers or number of
downloads to say, "Oh, this is good,
right? this is good because I see these
statistics or whatever that that is
evidence that it's good whereas I think
that somebody has taste can look at
something and go that's good right and
then maybe over time that's validated in
some way now as it pertains to agents
and why I think taste is important in
the age of agents is if you have that
shortcut ability if you can look at
something if you look at the output and
you can quickly go that's good you could
be way more effective in the age of
agents right agents can produce
something you can look at it and go
that's good. Somebody without taste,
they might not find out that answer till
way longer and you could be way deeper
into the hole of like how horrible this
code has gone before or the numbers
don't make sense before you realize
where you are. I don't know if that
makes sense.
>> Uh I fundamentally don't believe AI
helps with taste. Well, LLM specifically
because um so it's it's the idea that
all ideas are spread out in latent
space. So if you imagine it like a 2D
map and you ask it a question, hey, I
want to like solve this, make a UI, uh
it starts honing in on particular areas,
but stuff like taste are things that
like span ideas uh across different
parts of latent space. Oh, what if I
took this popular art style and I
applied it to this movie, so to speak.
Uh so I don't fundamentally think latent
the exploration of latent space helps
with forming taste. There's a really
good book called Where Ideas Come From
by Steven Johnson that goes into it. You
you should read this book, by the way.
It's a very old book and it's dope.
>> Yeah, it's so nice.
>> I I actually think I I kind of disagree
because I feel like it does help get
more parts on the table. It does help
explore the space much quicker and you
can do way more iterations of figuring
out which parts of the space that you're
exploring actually matter to you. You
can build 10 throwaway things and
therefore build up your taste through
knowing which of them suck or don't
suck.
>> Right? So all of these have to start
with like one person saying, "Oh, I want
to explore the space. Okay, now I want
to like do this." And then I get to sit
and decide, "Oh, fine. This is like the
thing that connects it. This is why like
you end up with these like wipes slop
UIs from people who don't know any
better.
>> Yeah. I don't know if any of you have
ever looked at a Figma file where the
designers, you know, they end up making
these explorations that go down and to
the right. I don't know if you've ever
seen a designer work this way. It's like
you go down, you make major iterations,
you go to the right, you make minor
iterations, and you end up with the
staircase of explorations of ideas, and
at the end is the final screen that
you're supposed to be looking at. Uh,
and I feel like exploring with AI is
very similar to that.
>> So is I can drive the AI to explore. I
could just say don't make it sloppy. Use
good taste. Or is there more to it?
>> No, you don't.
>> Yeah, you you can't I agree with No, you
have to know what to explore. And I
think that's the taste that you have to
apply.
>> And and where does that taste come from?
>> You have to watch good movies. You have
to read good books. You have to listen
to good music. I'm not even joking. Like
every like every time I hear a VC
talking about taste being the mo and I
see that they have like a crypto ape
avar, I'm like, "Oh, this I I can't
trust this person with taste at all."
No, this is how you develop taste by
like just reveling in what humanity has
to give you.
>> I mean, I agree. I I've been told I have
good taste. That's like weird to say
that.
>> You have like such a great haircut and
like this is clearly part of your
identity. I mean it, by the way. I'm
like I'm not gassing you up just on
stage. Like this is a person who knows
how to present themselves. Ah, okay. I
I'll I'll take that. Um, but I wanted a
plus one. Like I actually agree with
like all the pop culture stuff. Like,
you know, I don't think it's a waste to
have have watched every single Ninja
Turtles cartoon that has been produced.
You know, there's a lot of good stuff in
there. Um, and extrapolate that to
music, art, um, you know, photography,
whatever. It's all good.
>> So, what I get out of this is that I can
expense a trip to Napa to taste some
wine because I'm developing my DX taste.
Uh,
>> you joke, but like especially the first
two hours of a Nappa trip is good. Past
that, I assume you're like not really
seeing straight. That's that's where you
need to do your research.
>> There's a lot of goddy goddy stuff in in
Napa, just to warn you. So,
>> okay. Um, okay. So, you talked about the
moat as well. Um, we need to get a
little bit more controversial here. We
need to get you guys stop being so nice
to each other so we get some
disagreements going. Um, you talked
about taste as a moat. Some people say
that even people that don't have taste.
Um, you guys are all three kind of leads
of developer experience of the previous
generation, the pre AI area era.
>> He just called us old.
>> Uh, trying to be nice. Uh, but now you
are all three heavily invested all in in
the AI era. Uh, does that the modes that
you were able to create with DX before,
does that transfer automatically like
the wine tasting and movies, uh, does it
transfer to the the agent era
or do you have to do something
differently?
>> Everyone's pausing here
>> like like yes and no. Like there is
clearly like I feel so glad that I have
by the way uh for the young people in
the crowd you go from being the youngest
person in the in the room to being the
oldest person so quickly. I still dress
kind of like a child but like I'm very
aware that my back hurts sitting up here
right now. Uh I'm so glad that I have a
couple of decades of experience behind
me to have formed opinions like with my
bare hands. I mean like creating UIs
etc. So I can like look at grid lines
and say okay this kind of sucks. this
doesn't and especially a couple of years
ago LLMs were like particularly bad at
it. Uh that being said, I'm having so
much fun right now in doing this
exploration because there it's not just
that AI makes things faster. It's that
things I wouldn't have even attempted
because it would have taken that much
time I can now do which means I'm
actually trying out more. It's not just
a compression. So being able to uh use
Chang's new pre-text library to build a
wild UI experience that just would have
been out of scope for me like four years
ago. So like yes and no. Like yes, I'm
so glad that I have opinions on how
these things should be built, but now
I'm actually getting to explore the bits
that I just couldn't put the effort for.
That does that does that help?
Yeah, I I think the no part is that so
many of the skills that we hone or I
should speak for myself. So many of the
skills that I honed are kind of no
longer relevant. Like I type really
quickly. Who the [ __ ] cares? I just
dictate everything.
>> That that is a huge advantage. I don't
know what you're talking about.
>> It used to be. It used to be.
>> I can prompt so fast.
>> I just speak into it now. You know, like
I have a whisper mic and I just whisper
into it and I dictate everything and I
because I no longer need to dictate
syntax. like it doesn't matter that I
can type really quickly anymore, you
know, and I think there's skills there's
many skills that we've acquired over the
course or there's many skills that I've
acquired over the course of my career
that are definitely less relevant than
they used to be. I do think that knowing
what to work on and knowing what to
explore is probably one of the biggest
ones that I do still use every day.
>> Try not to like talk all the time. Okay.
Um,
so
we could go a couple couple couple ways.
Let's do this. Um,
right now going fast. You don't need to
type fast. Great. You can go 100x
thousandx. You can run parallel like for
the same prompt 100 agents. Let's see
who who does well. So you can go super
super fast. And we see this in products
now uh where you know Codex team others
producing software that would have maybe
taken years uh before and doing it you
know in two days. Um if you if you have
to compete in the market at that
iteration speed
how is it possible to have taste?
>> By the way I absolutely hate this 100
agents background agents thing. I
understand that the OpenAI people want
you to do that. I get so
no 100%. Uh
>> how many agents are you running right
now while we're sitting here?
>> Zero.
>> I I don't trust these things at all. I
need to see reasoning. Tracy,
>> I have zero running right now.
>> Yeah, that's I think that's
>> Yeah. No.
>> So, there's a difference between like
velocity versus just spray and prey.
Like if you actually want to
uh like if you want to stand out right
now, you kind of want to do less and be
known for a particular way of doing
things. Like you have to develop a
brand, you have to develop focus on the
things you're trying to do. I say this
because and like I work at Cloudflare,
we like ship a lot, but the things we've
been working on are things that we have
actually been working on for years. Uh
we shipped like we we just had like a
ship week last week and we shipped a
number of things and I looked through
all the announcements. And I was like,
"Yep, these are things we decided to do
like four years ago." Uh, it has it
means that we're still doing the things
we want, but we don't want to do like
everything. And I would highly recommend
if you're in the if you're a builder,
you're a creator, uh, you should not try
to build four products at once. Like you
kind of want to find focus. You want to
like kill two of them and say, "I want
to build two great experiences at the
moment." Iterate on that. Find
explorations in that space. But bro,
this entire 100 background agent thing,
I just No, bro. I No. There. No. Oh. Oh.
>> Um
I think um you know we are still
building products for humans. And on the
topic of taste, if you're building a
product for human, you have to put on
your human face and actually like try
the product and to use it and to
evaluate whether humans will enjoy this
or they understand it. And that remains
the biggest bottleneck if you want to
call it a bottleneck, right? Anyone can
make a million things but whether
they're good or whether anyone wants to
use them is another thing. So I I know
in my experience like right now over the
last year using a lot of agentic coding
like that is the barrier and even though
we can produce things like really
quickly I often look back and I'm like
this was garbage you know and of course
it was because we didn't even really try
or use it right. Um, I don't know.
>> I I think the area where I find myself
having maybe not a hundred, but a bunch
of background agents is um when I'm
working on a thing that's relatively
large that splits up into discrete
pieces and I can parallelize the
discrete pieces where each discrete
piece with GPD4XI might take 45 minutes
to hours, right? I'm not going to sit
there for 45 minutes to an hour just
watch Cortex do its thing. I spin up
another coding extension, then I start
working on the other part in parallel.
Often it's like I'll work on the iOS,
the Android implementation, and then the
back end implementation kind of at the
same time and just kick through all
three off, wrangle all the agents back
and forth, right? Um, but I don't often
find myself paralyzing across discrete
tasks because having five things running
at the same time and them all working on
different things is so much context
switching in my brain. I I I can't I
can't keep up.
you're running test and you got like V
tests on like 10 runners. Do you go and
say, "Yeah, I use Vest with 10 runners."
>> No.
>> Yeah. I'm just bringing that up because
I almost think this language of like sub
a oh I do use sub agents. To me I'm just
like still one agent. Does that make
sense? This is like almost like a
technical detail.
>> Yeah. I uh kind of although I actually
use separate agents this is why it feels
more separate.
>> Okay. All right,
>> cuz I because we actually run all of our
agents on Dave boxes in the clouds. You
actually have multiple laptops basically
in the cloud that you run the agents on.
You have to wrangle multiple of those.
>> I asked these question like I even want
to know these answers, you know. So
>> that's why we're here. That's why we're
here.
>> Um you you said something interesting a
moment ago too, which is that we build
products for humans. Uh some people are
building products these days also for
agents. Um is is developer experiences
DX the same as agent experience? An
agent experience we can maybe define as
like the the the
the tasteful experience just to use the
word in the definition for agents the
thing that's attractive to them easy to
use efficient effective um is it
different uh when building for agents
for agent experience
>> uh 100% I think until like at least a
few months ago we kept saying oh if we
design it well for humans then agents
will use it well at this point I think
that's scope. By the way, uh agents have
a completely different personality, etc.
U for example, and I'm not trying very
hard not to be a shill. We have a thing
called code mode where we let agents
interact with your generating code.
>> Oh, this whole thing I just put it on. I
didn't even realize I was going to be
thinking about it. Uh we've learned that
agents can interact with systems by
writing code that interacts with them.
Uh so in this assumption like we
wouldn't design it for a human by
saying, "Oh, every human being can write
code that interacts with systems." It's
a completely different kind of like
behavior. And uh it uh the way the way
I've been talking about internally is
that you if you really loved human
beings when they were your users, you
need to really love agents as well. Uh
like where do they hang out? They don't
really hang out in pubs. They hang out
in like registries. They dream in like
syntax errors. Uh you have to like do do
you truly love your users? Like you have
to like find out what it what are agents
desires. And it turns out they love
writing code. They love being told like
thank you and things like that. Uh no I
I I'm like 100% like this is like 2026
is the year where we actually classify
them as different alien beings and we
like learn their personality and like uh
create systems that they like
interacting with like I say that agents
like interacting with them but yeah like
I'm 100% on this by the way like I think
we kept saying oh like as long as the
the docs are like readable by human
beings they'll be good for agents. No,
screw that. Like, dump a bunch of
context and tell them to like figure it
out. I'm there now.
>> Okay, we've got 50 seconds left. Uh,
let's leave the audience with one
sentence, very practical, none of this
high flutin stuff. Uh, how they can have
taste in their day-to-day. What does it
look like? What's the behavior?
>> Use 100 100 parallel codex agents.
>> We said no shilling. Yeah.
Um, I don't know. Go to go to an art
gallery, go to a museum. And I bring
that up really quickly just to say that,
um, man, I remember 10 years ago, 12
years ago, online personas were way
richer in that I learned about how
people went and explored the world. And,
you know, the algorithms today really
just make us like singular, you know,
code monsters or whatever. And so, I
don't know, just want to make a comment
on that.
>> No, no, I'm I'm there. like uh have
friends, go out for brunch, watch
movies. Uh it surprisingly affects the
quality of the work that you build.
You're empathetic to
>> be human.
>> Yeah. Like
>> u thank you all very much. Thank you
pan.
>> Thank you.
Okay everybody.
Okay everybody. So just a call out. 400
about 400 people are watching us on live
stream. Isn't that amazing?
>> All right for those of us here. There's
coffee for you out there. For you on
live stream. Good luck. Um, feel free to
grab your coffee, go to the expo hall,
and uh, pick up a few goodies. Let's be
back here at 11 sharp. See you soon.
>> Shout out to
Ladies and gentlemen, please take your
seats. Our event will start in 5
minutes.
Ladies and gentlemen, please take your
seats. Our event will start in 2
minutes.
They're gonna come on stage.
Yes. Send them on stage. I'm ready for
them.
>> Thank you.
Once I see the curtain move, we'll start
fading the music.
We are ready. Send them to the stage.
>> Hello. Hello. Welcome back.
>> How's everyone?
Anybody talk to new people? Raise up
hand if you met somebody new.
>> Some hands going up. We have plenty of
breaks for you to network with other
people. Uh but now we're back for some
more exciting talks. So for our next
speaker, you just saw him, so he's in
our panel earlier, Ben Vinegar. So I'll
welcome him onto the stage. Welcome,
Ben.
Welcome back. Okay, so I'll do a quick
intro of Ben. So Ben is the co-founder
and CEO of modem. So they have a little
booth over there. So during a break feel
free to chat with his team. Uh so modem
is a AI platform for PM work. And Ben
has been in the AI space for several
decades. I'm not going to spoil how long
it is
because Ben is going to explain himself.
Um so he's going to tell us a little bit
more about working with coding agents
over SSH. So, we're gonna transform from
local to being remote. Ben, take us
away.
>> Thank you. Um, hey everybody.
Is that Is that readable?
>> Yeah.
>> Barely.
I'll give you I'll give you a one plus
zoom. Little better. All right.
Um, so, hey, I'm Ben. I'm back. Sorry.
Um, and
I just want to talk a little bit about a
little bit about working with coding
agents over SSH. And I just really want
to preface this by saying that uh oh
well first of all
have you seen these graphs before? Are
you going exponential from rise? He was
a versel and I at open code. Um I was
shocked to see this but this is a little
bit what's going on in my life just over
the over the last couple months. It is
exacerbated because I'm old and I have
like a really old GitHub account so it
looks like more crazy. Um, but I have
been doing a lot more of this. Um, but
this is not a talk about, you know, drop
everything, use Linux, use Omari. I I
don't really know that. I'm actually
just trying to present this talk to you
as like a norm as like a just like a a
Mac enjoyer. Like a normal Mac enjoyer.
Okay. And I hope that that's like a lot
of you. Like I this is more exciting to
me. Um I guess we just had a bit of an
intro, but repeat some of it. I've been
programming for a long time, mostly web
and JavaScript. Um but I actually
started in graphics driver development.
I think it's kind of fun. Um I've spent
my whole life working as kind of an
early employee at startups. And I've
been prompting since 2023. And I mean
that to say for me the first prompting
in like VS Code and Copilot was when
somebody explained to me that I could
like write a comment and then I could
gen you know and then the like early
agents would generate a little bit of
what you wanted back then. So I I I
think of that as kind of like early
prompting and yeah I'm just like a you
know a normal IDE user. I like Mac. I
I've used text VS code all that.
So I work at this company we started
called modem and come check out our
booth. I won't spend a lot of time just
to explain that AI coding is real. Your
ability to deliver software faster is
definitely real, but sort of the
mechanical product work around that like
capturing user feedback, following up
with users, like that's still pretty
slow and that's the problems that we're
trying to solve. If you're interested,
go find out um at our booth.
How we've been building modem is maybe
um kind of interesting and relevant to
this. So, it's 99% codegen. When we
started about a year ago, we made the
decision that we were just gonna become
a completely agent company. Kind of like
good timing. I think this is like in the
sonnet 37 days.
There's six engineers, the code base,
about 270,000 lines of code, and there's
quite a bit of test code. Um, just to
give you an idea of like what we're
working with and and as testing is
involved, we we to make that work,
there's like a lot of tests. Just as a
random aside, this is relevant to the
talk, but like agents love to generate
mock tests. We kind of throw that out.
We make them use like end to end
database tests. So these tests are
actually pretty heavy.
So,
I don't know about you, I have felt this
way, but when I go on Twitter and I go
online and I see these posts by people
who are like spinning up 10 agents in
like with like custom harnesses and
like, "Wow, I'm doing all this stuff."
And I just I just didn't get it. I
didn't understand how I could work this
way. Um,
and I wanted to achieve more. I felt
like more was possible. So I started to
like think about what were the things
that were slowing me down and could I
address them.
One of the biggest ones and I don't know
this is a question I have which is
how many of you run your coding agents
in like you know living dangerously yolo
mode 100% of the time.
I can't see the lights are blinding
everybody.
I think it's like 40% maybe.
Um I think agents are scary. They can do
lots of crazy things. I have, you know,
stuff on my computer. I don't want
things to happen. I've experimented with
jailbreaking.
You can do it. You can mess with them.
So, uh this is like what would often
happen is even if I had approved a
million rules, I'd often like start jobs
and I'd come back, you know, an hour
later and discover, oops, you stopped 30
seconds in. Very frustrating, right?
Uh, another thing that was slowing me
down is just like pure compute
resources. Like I mentioned, we like run
a lot of I like I think running unit
tests as part of your like building
agents is critical. I'm running them all
the time. When you've got a big test
suite and it's kind of like hitting the
database, man, I would hit you know 100%
CPU all the time on this machine even
with just a couple agents if they were
going through like testing loops. Fans
spin up. I could barely browse the web
or do anything else like that was
getting pretty frustrating.
I'm on the go a lot. I like, you know, I
took a plane here. I wanted to do work
on the plane. That's Wi-Fi was pretty
spotty. They don't have Starlink on Air
Canada. Um, and so often I'm just
environments where just like the
internet just wouldn't work for me. Um,
so one there are like solutions. There's
people who build solutions for this like
cloud agents or cloud now has like
managed agents. I think open code is
working on something like this big
asterisk. This is like this is changing
all the time, right? It's so hard to
talk about an experience that was like
two months ago could be totally
different today. But if I thought about
two months ago, I was experimenting with
these these products that let you build
in the cloud and I just was never
satisfied partly because I wanted to run
uh tests that hit the database and
you know on claude I'd hit a problem
where like they had like a sandbox
environment with like a network proxy I
couldn't like pull out I couldn't like
get out to uh my database provider and I
just got very frustrated. What I ended
up doing is even if I could start some
work in like a cloud agent, I'd end up
bringing it in locally and working on it
and like throwing half of it away
anyways and I just didn't feel like it
was getting faster.
So
I considered how I was failing and it
brought me back to Linux which was sort
of like why am I exploring all these
kind of like halfbaked versions of Linux
sandboxes like why don't I just do the
same thing.
So the way that I work today mostly
looks like this. I have this machine and
then I use SSH plus Tailscale. Tail
scale is a sponsor here. Um I remote
into a machine. It's using T-Muts which
I'll talk a little bit more about these.
Um and then I got a coding agent in
there.
Um and I don't know again I'm just like
a normal you know IDE user. I've heard
about these words like T-Mox and they
mostly scare and intimidate me. Um I
know just enough Vim to quit. That's
like it.
So, you know, humor me that when I'm
presenting this to you, it's like not
it's like I don't consider myself an
expert. There's probably like 20 people
in this room who are already really mad
at me for for explaining Linux wrong.
So, you know, when I did this, first of
all, I don't care. People really, you
know, Linux Linux distros, they have
like strong opinions. Um, iuntu is fine.
I use Arch. That's fine. Um, if you just
want to have one of these, you can go to
a VPS provider. You can click a button
and they'll spin up like a Linux
environment for you, you know, right
now, right? Pretty easy. If you want to
bring your own computer, it's more work,
but it's, you know, you'll probably get
more compute. And I think that, um, this
has just been like a real, this is a re,
you know, computers are expensive right
now for a reason, right? There's a lot
of demand on compute. And, um, if that's
of interest to you, I think that that's
worthwhile. Um, I have a machine that's
in my basement that I that I've
exclusively, you know, dedicated to
this. It's not a Mac Mini running
OpenClaw. It
it is it is just a plain computer with a
Linux distribution on it. And um, if you
don't want to set up Linux, you know, my
tip is let your agent do it. I think
this is like if you told me
the like when I started getting into
Linux in the last six months
uh like I just didn't want to do it. But
once I learned that like the coding you
know your coding agents can actually
kind of like configure it and and get it
going for you became a lot more
approachable.
Tail scale is pretty much just like an
easy way of connecting to your machines.
Um, man, they're here. So, I don't know
that I'm I'm want to talk too much about
this, but you get a private network,
connect to your machine, you don't have
to go and expose a bunch of ports. Works
everywhere. It just lets you lets you
connect.
And then T-Max is like a window manager
basically for Linux. Um, you get Windows
and PES and I'll show you this in a
moment. If you squint and if you
pretend, you can pretend like it's like
Mac OS. Um, it supports the mouse, which
was like shocking to me, I guess,
because I just never really bothered
with this stuff. Like actually a lot of
terminal programs support the mouse. And
I'll show you some of that. A big thing
that T-Max gives you and and its
predecessor screen is like you can you
can rejoin these sessions because you
will disconnect. Like I'll close my
laptop, sever my internet connection.
Um, SSH is gone. I can come back later.
And then um TMS will give that all that
back to me. And then it's last thing is
it's agent scriptable. So I got a little
demo here. And
so over here
this is
um Linux running on
on a on a VM on my machine. I've tried
to do this talk with like a like a fully
remote machine, but I've learned that
that is like not a good idea, especially
if you're like trying to live stream at
the same time. Not a good idea. So, just
to kind of show you right now is like
I'm um
right now like I'm on my I'm on my Mac
and then I can kind of get back into my
machine. I've got like this little uh
Arch logo here to help me understand
where the hell I am.
And uh Oh no.
Oh boy.
Oh, this is the problem when you don't
actually know this stuff very well.
We're going to have to open up like a
coding agent to help me understand how I
can get back to my um to my to my thing.
Well, that's okay. We're over here. Oh,
right. All right. I was doing a new one
anyways.
All right. So, really quickly, this is
like T-Max. It just looks like a
terminal, right? That's what it is,
except you can kind of like have PES.
Um,
right. I could make more split planes.
Split panes. Hey, I can use the mouse. I
can drag this stuff around, which is
kind of neat, right? So, over here, I
can go and I can um I could say like run
a server over here,
right? Oh my goodness.
Well, I've forgotten like how to do all
my demo stuff, but anyways.
>> Yeah.
Look, when you got the lights flashing
in your face, like, you know, and I
could open up like kind of a
I can't even actually like see it very
well, which is like not what I was
expecting, you know? I could open up
like an editor over here, right? And
then I could even like do uh um
um you know, I could have diffs over
here or whatever, like whatever, right?
Um and I think this is pretty neat.
But like I didn't actually know those
commands just a few months ago. So
gonna open up open code here.
I'll just give you an example of just
kind of how I think like coding agents
have made this more accessible which is
like hey you're in a team session open
up some PES and put some cool [ __ ] in
there
that's Linuxy.
All right. Um,
there it's going right. So, it's firing
up. It's creating PES. And the way that
I worked with this in a while, um, if
you're wondering why this is fast is I'm
using Kimmy K25 just because you for
demo purposes, it's the only thing
that's going to finish fast enough. Um,
but I've got different PES here. What'
you do? You you What' you do? You gave
me Htop. You gave me some live disc
usage. Digital rain.
I don't I don't see that one. But
anyways, you know, I could also be like,
"Okay, now close them.
You made bad choices."
Okay. Right. So, let's bring me back
here. So, T-Max, if you do want to mess
around with it, like you can just start
with having the agent like um do stuff.
And the reason this works, and I didn't
load any skill files or anything, is
that TeamX is just like you control it
using the CLI. So the agent is actually
just calling like a bunch of shell
commands to do all that. It doesn't need
MCP. It doesn't need anything which is
pretty neat. And these are some of the
commands it can do. I wish I knew these
earlier like listing the PES. You can
split the window. The other thing that's
interesting and we'll come back to this.
So you can read the con. It can read the
content. It can act on it. It can
actually send keys. It actually be like
a little driver of them. So if you start
messing around with this, you end up
having an environment that looks like
this. I was pretty zoomed in so that you
could see this. If I'm actually at home,
like I'll zoom out and I actually have
like quite a bit of surface area. And
it's not just PES. You can have multiple
windows. So like this is how I actually
get four agents running is I'll have
maybe a couple different projects
packages and I'll work in them and I'll
have like a like my whole environment
kind of split up this way.
So at the end of this if you go through
this exercise like one you know you can
skip permissions all the time because
you have a machine that doesn't you know
mingle other data and if it gets ruined
you can just spin up another VM or
whatever. It's not a problem. You get
access to more compute and actually I
just straight up in like having other
machines do the work so that my I have
like way better battery life on this
laptop when I'm like moving around. I
don't have the fans spinning up. I could
do other work. I do some like video
stuff. That's cool. Um, always
connected, fast internet 247. So that
that 1.0
like um megabit stuff is real. I
recently took a train to Montreal from
Toronto. We do have functioning trains.
Um,
and that I was only getting, you know,
one one megabit on there and it was like
brutal. I couldn't basically work with
an agent. However, on six-hour train
ride, everything was great because
that's just enough for me to connect to
the remote session, work with Teams,
have have everything working, and so I
was kind of uninterrupted.
Um, so I guess like my way of getting
more out of agents is kind of boring.
It's sort of like I'm not I don't have
agents off in the cloud doing a bunch of
independent work for me. I just sort of
like more effectively figure out how I
can kind of like work with them.
So, you know, the good news is you have
a remote terminal setup for AI work. The
bad news is that the ergonomics of this,
like, let's be honest, are like not that
good. We're we're, you know, I'm a Mac
enjoyer. It's not that great. If you
there are tools built into into other
platforms like cursor and VS code, you
can actually just run all this stuff on
that machine. So, once you've you know,
you can stop before the T-max part and
you can just kind of like work with the
machine over over these editors. I
haven't done it that much. I have used
it to like look and edit and review the
code.
But humor me like hear me out. I think
that there is there is value in like
working with some of this like more
primitive tech. And
just I think a real wakeup call for me
is just as like um you know Windows and
and Mac operating systems have evolved
and now we've got liquid glass and isn't
that incredible? um that there's also
you know actually terminals you know the
technology that you can run on terminals
and the and the two that exist now have
evolved like consider this the most
valuable software created in the last
decade is a terminal application
right which is cloud code that's wild to
me and I don't think it stops there
right I think that we're going to see
other terminal apps become valuable so
humor for me. I'm going to show you like
a little bit of my like if you if you um
go all in on this, like what does this
look like a little bit? And um
so over here, this is my my term max
demo. So over here I have um this is
like a team session and I've got a bunch
of different stuff here. Okay, so these
are actually like different windows.
This is my little custom extension that
I vibe coded. It works for me and I've
made it like a bright color just so you
can kind of see it. Um, first up, all
right, let's just review the plan goal.
Don't embarrass yourself. Good. Okay.
All right. We're going to check out this
window first and then I got a bunch of
things. So, I'm starting here to
actually show you. Um, this is an editor
called Fresh. Um, I'll have the URL at
the end. Like I said, I don't really
know Vim. And I I I tried to give Neo
Vim a shot. It was way too complicated
for me. Fresh is interesting because
it's sort of like, hey, VS Code users,
do you want to use the mouse? Do you
want to do the things that you know how
to do? Do you want to have um a um like
a control plane where you can actually
just bring up files and just kind of
work the way that you're used to
working. Um so this is like relatively
new software. I think it was built in
like the last year. Somebody's working
on it. It's pretty neat. I I'm enjoying
it and it's made like just sort of like
looking at code easier. So that's just
like one thing that I think is kind of
interesting is that your idea of like
what a text editor might be on the
terminal has kind of evolved. I could
keep going but where am I? So for
example, you know, I can I can go
through some of these files. I think it
also shows like modifi modifications in
here as well too. Um any
oh boy, not enough time. So over here
let's see this is I'm going to skip
part. So this is like an open two app
that I've been building. Building open
two apps in the terminal. really great.
This is just like um you know it's very
simple. Okay, but I want to illustrate
this which is like use T-Max look at
pain one what we got.
I think that T-Max
I don't look at it just as like a as
like a workspace. I actually look at it
as like a playright tool for working on
the terminal. If you've ever struggled
with like you have logging output or
whatever and how do I get that into my
agent like when you have TX it's easy
because it can just read the pain using
those shell commands that make sense
right and so um okay I want to change
this right I want this to look
differently so I have a tool here called
term draw term draw looks like this it
is actually a like vectorbased
sort of like editor
Okay. Or I can like resize things and I
can put let's say this here. I'm going
to put, you know, let's put close. This
is going to be the title. Let's put a
line here as well. We got like this
smooth line thing which is kind of fun.
This is all over SSH. I'm going to move
this, right? And then I'm just going to
do this
um text. Okay. I send that to my agent.
Make the modal look like this.
I think there's something about working
with the agent where like look, it works
in text, you know, like the agent reads
text, right? It reads markdown files and
to actually produce artifacts that it
understands. It actually understands
ASKI incredibly well versus like you
give it a screenshot, what is it doing?
It's spending thousands of tokens to
decompose that into a text description
that at the end of the day is not going
to have much more fidelity than what I
just generated, right? Right. So, I
think that like sometimes like that's
kind of interesting.
Um,
all right. It thinks that it did it. You
know what? I'm lazy. Can you run it in
pain one for me? I don't test a lot of
this stuff out, by the way. So, I'm just
hoping that it works.
Oh, I didn't reload it. Oh. Oh. Oh, you
figured that out.
Um, you following this?
Okay. So, I built a lot of TUI apps and
if people are wondering a little bit
like how I do that, it is a little bit
like this. You know, it's the ability to
actually like iteratively have um this
thing go back and forth. So,
I'm going to stop you at this point
though because it looks we're getting
there, but you can see how it's going to
get there. Um
so let's see. So another tool that I was
going to show here is um
so I built my own diffing tool and that
came because basically if I start
working this way I built my own diff
tool that actually like upsets me that
that's coming out of my mouth.
But ultimately like I just not wasn't
very happy with some of the solutions
that we have. So, I bought this thing
called Hunk, and it is um basically like
how can I have something closer to like
a VS Code experience on the terminal. Um
I'll show like a a better diff, but it
basically accepts sort of like diff
commands like this. So, I could say like
main at three. I can go in and I can
kind of like take a look like this. Kind
of go like this. It's got like split
view. This is a bad example for split
view. It's got word wrap. You can even
scroll horizontally,
right? So, these are like tools that
like I never really experienced for div
tools. Um anyh who so let's just go
back. I guess the last thing I wanted to
also illustrate is that also this
presentation is over SSH on the terminal
uh with the rich graphics and stuff too.
Um and it's called present term. So you
can check that out. So these are some of
the tools I've been using. Fresh hunk I
didn't get to show you. Glance ran out
of time. Term draw. Um and uh that's got
think about it. It's not crazy.
All right, I'm back on. Uh,
>> so how's it going so far? I go back.
>> Having fun?
>> Good. I tell you what, I asked the next
speaker
If he was forced to delete all of the
apps on his phone and he could only keep
three of them, what would they be? What
would they What would they be for you?
Think about it. I'm going to reveal his
answers.
Okay. Number one was Slack.
Then was YouTube.
Then was X.
What do you think? Do you agree?
Okay. Our
next speaker is Shashank Goyal. He is
the founding engineer of open router and
uh uh I talked to him about how he hires
new people and if AI is impacting that
and he said no we actually need a lot of
engineers and what he's looking for is
enthusiasm, excitement and people who
ask the right questions and that's his
interview technique. Did I say it right?
>> Yeah.
>> Okay. All right. We're all set. All
right, let's hear it for Shashank.
>> All right, thank you everyone.
So today we'll be talking about the rise
of AI agents and I'm sure you guys are
going to hear this word so many times
and I apologize and uh but yeah, so
we're I'm from Open Router. I uh we
started the company about two and a half
years ago. I joined about two years ago
and have been building this company.
What's special about Open Router?
We think that we're at a really cool
horizontal space in the ecosystem. We're
right between all of the different
models and all of the apps. Uh we're in
model aggregator that makes it really
easy to use any model in the ecosystem.
As of this month, Open Router is doing
about 75 trillion tokens every month. We
have over 5 million users using the
platform every month. There's more than
60 providers and 300 models. What does
this all mean?
The ecosystem is starting to shift a
lot. There isn't just one best model for
any one use case. We find that users are
using more than one model for any for
whatever workflows that they have. So we
saw this pain point a long time ago
where it just got got really hard if you
wanted to use OpenAI models and Gemini
models and anthropic models to know
which model to use, when to use it,
where to use it. We didn't think that
there was like a good empirical
benchmark. And so we built our ranking
page to basically show you not the
benchmarks, not um you know all of the
benchmaxed scores that are uh produced
at every model launch, but like hey how
are users using these models? Where are
they spending their dollars? How are
they voting with their actual use? And
how do we see this in the ecosystem? So
we this is one of the charts that we
have which shows what are the top models
that users are actually using and you
can actually see that obviously um some
of the entropic models are at the top
but there's a bunch of open source
models like deepseek mimo um there's
miniax there's gemini openai so it's a
very vibrant ecosystem of models and
it's really important to remember that
there is this is really not a winner
takes all market
another thing that I really like about
the viewpoint that open router has in
the ecosystem. And why I'm really
excited to share some of the metrics
that we have with you guys today is that
not only do we see all of the models,
but we actually also see what apps are
using those models and how they're using
them. So, um the chart that you have on
the screen right now, but basically
shows how coding agent rankings have
been changing over time, which agents
have become really popular. Um, some of
these you obviously see kilo code,
claude code, but you might not have
heard of Hermes agent for example, and
it's an agent that is, uh, really
starting to engage and gain a lot of
popularity. So, um, you know, again, go
to the open router rankings chart
whenever you guys have time, but it'll
really like help you see the ecosystem
from this like top down view and
understand what models, what apps, how
people are using them, and what people
are building.
But from all this data that open router
is very lucky to have at the top. What
have you learned so far?
Best practices change week week over
week not years or even months. Every
week if you ask me what's my workflow
it's going to look very different. Uh
prompts also need to change as models
change as models get better. So you have
to actually tell them fewer and fewer
things and models rotate very quickly.
The three trends that are very clear
from our analysis is that inference is
becoming a core internet utility. The
same way that if the internet is down,
you can't get your work done. If your
tokens are down, you can't get your work
done. The market is restructuring and
extremely dynamic and agents are now the
primary workflow and workload for
inference in the market. Over the last
year, we've seen over a 14 times growth
in the number of tokens consumed. And
that basically shows you how much more
value users are getting because every
token that is consumed on Open Router is
a dollar that was paid by the user.
The number of requests also continues to
grow and there's a slight gap in the
number of requests versus the number of
tokens and we'll get get into that right
after this.
grow growth in the platform is also
decentralized. It's not a single user or
a single app that is growing in the
ecosystem and that's true for everyone
in AI. Um, so we have like breakout apps
like OpenClaw which have single-handedly
consumed over 18 trillion tokens last
month. And that really shows how much
value users are finding, right? Like we
see all these buzzwords. We see like,
hey, this is really cool. I like it. But
does what does that actually mean? Are
users using OpenClaw? And this is one of
the best metrics to show you. 18
trillion tokens is about $1.8 million
that have been spent on OpenClaw just on
Open Router. And we we're a very small
percentage of the overall inference
market. And it's an open source app
where all of the code you can see it,
you can reuse it, you can build it. It's
fully MIT licensed. So there's a wide
ecosystem here and very easy to build
agents because the most popular agents
are actually open source.
The other thing that has led to this
huge spike in token usage is that there
has been a pretty significant cost
collapse across the ecosystem.
Uh when we started with GPD4 around
March 2 years ago um exactly from now we
were at like $30 input and $60 output
prices. A GPD4 quality model is the
Gemini 2.5 today and it's at 15 cents to
60 cents which is straight up 20 times
cheaper or sorry 50 times cheaper than
the same level of intelligence 2 years
ago. That does not mean that frontier
models are getting cheaper as well
because we have seen that frontier
models have continued to stay the same
price but what is frontier intelligence
today in one year will be like 10 to 20
times cheaper than it is today. And so
that's like a that has a very big
ramifications in how we use AI because
you can only deploy cloud 47 opus today
on tasks that you know are going to be
very high value because it's very
expensive but that's not the world
you're building for. You should be
thinking about how do I deploy 47 opus
across all my tasks because in a year or
even in six months this model is going
to be so much cheaper or this level of
intelligence is going to be so much
cheaper. Uh which is why it's very
important to realize that even though
models seem very expensive today at this
level of intelligence that trend is
going to continue to push those prices
down.
Sorry.
Yeah. So all of this growth has really
changed the models how they're consumed
and why users are consuming them.
I don't expect you to read all of these,
but it's the point is that these are the
models that we've onboarded and users
have uh consumed more than 100 million
tokens on over the last 12 months.
All of these models were on the platform
and we're pretty selective about the
models that we onboard to models that
have something unique about their
architecture, something different about
how they were built and like you're
usually pushing the frontier either for
their size or for um max intelligence.
And there's a lot of choice in the
ecosystem. No single model stays at the
top for very long. Um this chart
basically shows all of the different
model families on open router. So you
have Google Anthropic OpenAI etc at the
top. But then you do have very
significant percentage of usage from
Miniaax, Deepseek, Xiaomi, ZAI, a lot of
Chinese open source labs that are
producing frontier level quality of
models.
the market is decentralizing
um as I've been saying no like there are
so many models so the share of the top
five top 10 models on the market is
continuing to go down and it's very
important for the workflows that you
guys are building to evaluate them
against multiple models because you will
find there's like a big Pareto frontier
of quality versus cost and there's
already a lot of uh trade-offs to be
made in the marketplace place.
Maybe not a surprise, but now reasoning
is the default. There's still a lot of
non-reasoning models like Gemini 25
flash um Gemma 4 models, but now
reasoning is the default. All models
reason and uh users that are using them
are extremely like looking for models
that think before they reply. And it's
the what we used to call like test time
compute or compute during inference is a
very important quality of the models
that you choose.
um to specifically specifically call out
in the zeitgeist deepseek might not have
released a new model for a long time
now. Um their last like big release came
March of last year uh with R2 but
DeepSeek has continued to grow in the
market because people are aware of how
good the output is for the prices. So um
it's a model that I would recommend that
you all try out if you're building aic
workflows. It's really good at tool
calling and uh the market corroborates
that story.
One of the interesting trends in the
ecosystem as well is that open-source
models are the volume leader, but
because they're so cheap, the spend
percentage is actually way lower. If you
can see that there's they do about 35 to
40% of tokens happen on open source
models, but they represent a much
smaller percentage of total revenue
because they're so much cheaper. And I
think it's like a really big advantage
for people building because there are so
many good models available that are much
cheap.
And combining all of this like the
growth in tokens, the growth in models,
the growth in ability is I would say why
agents are now a primary workload on the
platform.
Over 15% of spend now comes from agentic
workflows. Uh the way that we decide if
something is an agentic workflow is if
there's doing a lot of tool calling,
multi-turn loops, they're using
orchestration. Um so we can detect this
using metadata and about 40% of the
total workflows on the platform already
are agentic and I can expect to see this
continue to increase because models are
no longer being used for single question
and answer kinds of responses. Even when
that's the user interface for the user,
behind the scenes, they're making a lot
more calls and using a lot more tools to
answer the user's question.
Tool calls are the main backbone of
agentic workflows. For people, I've been
saying tool calls a lot and maybe I
didn't define it. For people that aren't
aware, tool calls are how models engage
with the wider world outside of just
their own pre-trained memory. So every
time the model wants to get more context
you invoke or take an action in the real
world they use tool calls and you can
see that there's been a really big
inflection of how many users and um
requests are actually using tool calls
um in terms of total tokens um what we
showed earlier but basically it it's
like a really big hockey stick
exponential curve and it's the probably
the biggest trend that we see right now
that has continued to explode in the
last 12 months.
A very cool insight. U we expected
agents to be using more tokens, but I
didn't expect it to be quite as high.
There's a gray line at the bottom of the
screen that is honestly a little hard to
see because that's about how many tokens
that non-agentic workflows are using uh
or sorry, tokens per request that
non-agentic workflows are using. And you
can see it actually hasn't changed at
all over the last year. The number of
tokens per request has stayed relatively
stable even as like models have gotten
bigger context. There's like more
intelligent models. But then if you look
at users that are using agentic
workflows, it's a totally different
story. And this again shows like the
difference when you're building agents,
you actually can utilize the full
context and it allows you to make much
more powerful workflows and experiences
on top of models. Um agent sessions are
also usually 11 times longer. Um session
here basically means a number of turns.
Um non agentic workflows usually users
will ask like two to three questions but
for agentic ones we have seen them now
like average turn lengths are getting to
80. So think of the number of sessions
that are get getting to much much
longer. And I mean it's uh you can think
when you're looking at your cloud code
screen the number of turns that it's
doing. But this again is like a very
important chart to understand how
different the two workflows are.
Why now? Why did agents suddenly become
so good? It's really not sudden. It's
been a slow buildup over the last year,
honestly, like year and a half. Um, we
had our first reasoning model in January
of 2025.
Um, then we had tool calling, but it
didn't really work. We used to see tool
call success rates hover around 85 to
90%. So that means like one out of 10
calls uh 10 LLM calls that tried to use
a tool would fail and in agentic
sessions where the number of turns can
be 80 that's like eight chances of
failure per session. So it was really
bad over time like around August
November um is when we really saw like
models and model labs figure out how to
do tool calling in a more reliable
manner and we saw success rates for tool
calling go up to like 99 99.5% for the
frontier models and that's really like
been one of the big unlocks because the
models are just way more reliable.
Um, we also saw a really big explosion
in harnesses uh around December of last
year. So you have claude code, client,
open hands, open claw. Uh, so all of
these harnesses have just made it that
much easier to use agents for everyone
in the ecosystem. And around January to
February, I think it finally all came
together. uh we had cloud code I don't
know um I've been using cloud code for
like eight or nine months using a lot of
different agentic tooling but I've
really felt the inflection in all of the
agents I use around when claude 4 five
opus dropped or um in December and then
one more time when people really figured
out how to use these harnesses to the
best of their ability um so putting it
all together we I think finally do have
we're in a world where agents are mostly
reliable uh they're able to go often do
really longunning tasks without much
supervision. And they're able to use a
lot of different models and orchestrate
themselves to generally know what the
best way to build themselves is. And
we'll go into that in a second.
Putting it all together, the five forces
that we should all be thinking about as
we're building is that models are
smarter, inference is cheaper, context
is longer, tool calls are more reliable,
and harnesses are better. If you have
this mental flip for yourself, because
it was something that I had to like
think about each one because it's
changing so quickly, it's easy to forget
how cheap inference really is or how
good the harnesses are or like the fact
that most of the frontier models now
have a 1 million context window. A
million is you can put a lot of full
code bases into the context. So you
don't don't even need grap tools. It
just makes a lot of different workflows
that were not possible earlier possible
now.
Agentic flows when you're building them
also make it so that the you can use the
best capabilities that the models have
to offer. When we look at users that are
using agentic flows versus not. Um,
agentic flows use more reasoning, they
get better caching, they are usually
trying out more models. Um so they're
try people that are using agentic
workflows are usually trying out six
different models for their different use
cases versus three for non-aggentic
users. Um and also like the total volume
and um request volume is
for like agentic use cases.
I have a few quick minutes. I didn't
want this whole presentation to be just
me sharing some data. So I wanted to
show you guys at open router how we're
using agents for ourselves as well using
taking all of the learnings from the
data but then how do we decide what to
build for ourselves.
One of the things that we built is
something called spawn which is
basically a one-click deployment for any
other user to deploy agents. So if you
want to take cloud code but then you
want to run it in VM so you can control
it from your phone you can do that at
open router.ai/ AI/Spawn. But that's not
the thing that I'm excited to share with
you guys. What I found really cool is
that the full code base that um sets up
Spawn um all of the integrations with
the different agents and different cloud
providers where you can deploy it. It's
a 100% agent written codebase. There are
zero PRs made by humans in the entire
codebase. There can be some sometimes
there are issues that are made by
humans, but there's no code that is
written by humans. Most of the issues
and everything is agentic, fully
automated. There are different agents
that write code, that review code, that
do security analyses, that do um issue
triaging. They also uh we have endto-end
testing agents that are like basically
it's a swarm of agent that runs on this
P uh repo all the time. The repo is open
source. Um, if you go to this link
again, you'll find the GitHub repo at
the bottom of the page or you can just
search for open router spawn and the
GitHub repo should come up. You can
actually see exactly how all of these
agents are orchestrated in our internal
workflows.
Scouts is an example of a really simple
workflow uh that has added a lot of
value. Um, I wanted to put it on here to
show that agents don't have to be
complicated. Um, something that we
wanted to do was we have a open router
is integrated into a lot of open- source
GitHub repositories like OpenClaw for
example and we wanted to track all of
the issues that users are facing on
OpenClaw. Um, and we tried a few
different things. You can just have a
cron job that like fetches all of the
different issues that are on OpenClaw
and sends us a result. But what we found
is that every single day would just give
us the same results over and over
because those were the same top issues
that weren't getting resolved. We
actually built something called scouts
also in GitHub. But the scout agent uses
GitHub PRs as its memory. So to create a
scout, you just create a new PR with
like a small system prompt. And then
there's another agent that looks at all
open PRs in this scout repo and spins
off web searches for it. And then
whenever the web search is done, it
appends to the PR as its own history. So
we're using the PR as like the running
context for a model. It's very simple.
It's a single file in the PR, but then
it just made our like daily cron jobs
that much better because the model
remembered everything that it had seen.
The other thing that we've built for
ourselves, which has been a total game
changer, is what we call AI or open
router intern manager. We find that when
you have a single agent that's trying to
do a lot of different tasks, it's way
higher chance that it fails. So we love
to deploy agents internally that just do
one thing but then they do that one
thing really really well. Um so there's
like a list of 20 agents. Some of them
are really fun like Dexter. We have like
thousands of emojis in Slack and it was
getting really hard to know which one to
use. So you can ask Dexer uh what agent
what emoji you should use depending on
uh what your mood is. My personal
favorite is buddy. So buddy helps us
onboard any new models and endpoints.
Anytime there's a new model, we just
tell it, hey buddy, enthropic launch
cloud 47 opus happened on Thursday and
Buddy's able to like fully on board the
endpoint, test everything for us. We use
like sniffer for KYC operations. We have
Tony for customer support. So um and we
continually add more interns and each
intern gets its own VM and its own
GitHub repo. And what we like to say is
that it can do brain surgery on itself.
So each agent can learn and be better.
you can just tell buddy that hey like
you didn't have this ability can you go
learn it'll figure out how to do the
thing and then it'll make a repo on make
a PR on its own repo um also the really
cool thing about AI is that you can ask
AI to spawn more interns to uh also
remove them if they're not being used or
if we like improve some workflow that we
want to push out to all the agents or
you can do that for us or you also do
does credential management across all of
these different agents for
Um,
from everything that we've built, the
three things that if I had to say that
we have learned is that experiment,
experiment as much as you can.
Production is still hard. That's one of
the reasons that we have so many
different agents versus just a single
agent. And then automate the everyday
because that's really where a lot of the
value is. Um, you might not think that
something is automatable, but you should
just try and really help.
The thing that we keep telling ourselves
and are really like forcing ourselves to
remember every single day is that we're
no longer building open router. We're
building the machine that builds open
router. And having that kind of mindset
switch has really helped us automate
across the board to be able to ship a
product that's used by millions of users
by with a very small development team
and get the most of all of the AI models
in the ecosystem.
And that's me.
Right. Thanks everyone.
All right. All right.
Hope you're all super energized about uh
using open router and uh check out all
the different AI tools. Now for our next
speaker, we have Nana. Nana is a full
stack engineer and she's currently a
principal developer advocate and
software engineer for Kodo and she's
also a proud member of women defining
AI. So really trying to uh build a world
where AI is augmenting our lives and I
know there are some quality engineers in
the audience. So I think you should
definitely tune into her talk because
she's going to talk about how to embed
AI code quality gates in your software
development life cycle. So Nana take it
away.
>> Thank you. Thank you. Really glad to be
here. My name is Nana Andukquay. I lead
developer relations at Kodto. code is an
AI code quality platform and I am
obsessed with AI but also um being able
to use it in a very structured way. So
that's exactly what we're going to talk
about.
So before we really really begin, I just
want to set the tone. I am not a
beginner. I've been in the game the
industry for almost 10 years now just
about. And so my knowledge is before
this current uh exciting phase that
we're in with AI and my engineering
experience in particular um whether it
was building systems in a fintech
company you know where or building um
investment portfolio management systems
with tens of hundreds of millions of
dollars on the line or building a live
events platform at O'Reilly working with
back-end engineers to build an entirely
new experience for products that
generate a million plus in revenue. So
I've always had to think to some degree
about quality for software development.
And
I am also AI pill like totally obsessed.
I'm pragmatic but also super optimistic
and I I really do think AI is great for
neurode divergent brains. Um but that's
another topic of conversation.
Um so in my journey as an engineer and
now with AI when I think about code
quality I'm always like where are the
touch points of where the there's a
quality degradation or that there's an
opportunity for a degradation of quality
and if you look at the um the software
development life cycle and and a typical
workflow planning and design development
code review testing deployment these
this is the entire surface area for
where quality can begin to degrade very
subtly or maybe in very obvious ways.
And so the opportunity really the the
blast radius I guess you can call it is
everywhere. These are all the places
where issues can happen especially now
with AI almost making software feel even
more that much more fragile.
And that's why I think that we are
currently building workarounds. We are
building workarounds. We are real life
architects in this time where um we're
trying to build around the limitations
of AI systems and LLMs and it's exciting
but it can also be very frustrating and
some of these workarounds are makeshift
and others are becoming standardized in
real time and we're just going to see
what comes of it.
One really amazing example, I'm really
excited to give a shout out to Lex for
creating GSD. This was a signal, this
framework that then became a coding
agent um built on PI SDK. This was about
structured AI assisted development. Not
only did professional developers at some
professional developers at some of the
of the largest companies that we know
largest companies that we know today
today started using this tool also vibe
started using this tool also vibe coders
coders wanted more structure and
wanted more structure and structure
structure yields quality. So this
yields quality. So this popularity
popularity 55,000 uh GitHub stars this
55,000 uh GitHub stars this was a signal
was a signal and is a signal that there
and is a signal that there is a strong
is a strong need for qualitydriven
need for qualitydriven systems and I
systems and I think that's exactly what
think that's exactly what we should be
we should be building. But how do you
building. But how do you begin to think
begin to think about that? Not only did
about that? Um, and that's what I call
and what we call at Kodto2 is a
verification layer. A verification layer
that should be embedded and interwoven
into your existing development workflow.
So how do you do that? We put on our,
you know, critical thinking architect
hat right now. Number one, you need to
define what code quality actually means
to you or what is code quality. There
are things that you can research and
pull down information from so many
different um sources and documents to
gather that and there's also very
specific requirements that are specific
to your projects and the way that you
work or maybe the way that your team or
your engineering organization operate.
Now all of that needs to be codified and
it needs to be codified because we are
working with agents and this is context.
Once you've defined it, you need to
decide where that codified quality or
those quality standards live. And we'll
talk more about what that looks like
because both of these touch on context.
And number three, you need to design the
verification layer where you already
work. That's in your IDE, the CLI, um
the git providers that you're using,
CI/CD pipeline. These are all the touch
points in your actual workflow for uh
being able to embed code quality.
So step one and two define and decide
where your code quality standards
actually live. This is all context
engineering and this is only some of the
examples um of context engineering,
right? Um I don't even think people do
people still use claim MD files. I don't
know since that paper came out about it
not being very effective. I don't know.
But we have agent markdown files. You've
got internal docs and some engineering
standards. You have criteria that you
might want your code review uh to be
measured against. Whether it's manual
code review or an AI doing it for you.
That's all context. And of course, org
specific policies and any other um uh
quality expectations that you might
have. This is all context. And this
these are all the important elements
that are needed for the actual uh
pipeline or the the life cycle for your
development workflow. And a great
example of this is Kodto's rule system.
So engineers at COD built a rule system
to be able to have one context plane for
managing the uh agent.mmd files for all
of the rules that could be orwide or uh
repos specific or maybe language or
framework specific. They're all listed
in one place and uh they are also
categorized by correctness or
reliability or quality and a level of
severity. How important is this rule to
you per repo or per a pull request? So
this is what I consider to be the the
context plane that can be centralized so
that when you are working on distributed
teams it's not only visible to you and
your team members as a developer but
it's also visible to your agents.
So how do we allow our agents to access
you know this kind of um this context
plane so that we can pull it into our
dev workflow
and we'll get into that. Um, number
three, when I mentioned designing the
verification layer, we have the
traditional software development life
cycle. And then we've got our agents,
the agent harness, and then what I
mentioned before, our dev workflow, IDE,
CLI, Git.
And what we're going to do with that
context or what you should do is
operationalize it. And I argue that
bringing it into the planning phase is
uh is really a strong principle for
quality is actually enforcing quality
early as early as possible and as often
as possible. Right? Remember we're
building around the limitations um and
the unpredictability that comes with
working with agents. So bring it bring
in those standards early already in the
planning phase. And you can do this by
the way that you prompt. And you can do
this by um using agent skills to pull
down that context from your centralized
context plane. And then you have your
agent skills that you've probably been
collecting like Yu-Gi-Oh cards um for
code generation. And then what you'll
also want to do is for code review is to
also enforce those um those the quality
criteria again. And so what what's
important about this is that there's
consistency across the stages of your
workflow. The same exact rules and the
same exact place where the agent skills
and different artifacts can be pulled
down from across the entire um beginning
at least the first half of the
development workflow. And that way you
can identify um that way consistency is
important when it comes to agents. Then
you can also identify where there might
be quality gaps when there when you
actually are enforcing quality
consistently.
So these are my skills. Um they're in
codeex uh app right now. This is what it
looks like. There's a ton of them. I've
collected them like I said like Yu-Gi-Oh
cards over time. And um a lot of them
are related to uh cleaning up dead code
or maybe some error handling that I know
that agents seem to just keep struggling
with. I also have test-driven
development, behavior driven development
um skills. But this is just to give you
an idea of this is the kind of agent
skills that I have and that I actually
enforce before I begin implementation. I
do this in the planning phase. Um this
is a recent development of mine and so
that's what I mean by enforcing quality
early and often. In this example, I'm
using uh codto get rules. And this is a
skill that pulls down rules from the
rules system and uh and determines which
one are most relevant for your current
coding task. And then that is what those
are the rules that are going to be
enforced as your agent begins
implementing code and then it's going to
go through a verification process as
well when it is done. That is a great
example or at least a more sophisticated
example of centralized context plane for
your for your quality standards and then
pulling them into planning and code
generation.
And then I truly believe that a local
code review is very very valuable here.
Whether you're working in an IDE,
whether you're in the CLI using uh
codeex or cloud code, wherever it is, it
is totally worth it to run uh local code
review against maybe your uncommitted
changes or your committed changes
because well, you you want to make sure
that anything that could be caught is
actually caught before you make a pull
request and let the whole world know
that you have just generated AI slot.
Once you have actually fixed up some of
those um issues that might have been
surfaced from local code review and you
actually make a pull request, you want
to be able to leverage AI for that first
pass of a code review at the PR stage.
And I think it's really valuable to use
an AI code review tool that is
automated. So, as soon as you open up a
pull request and you have, of course,
your llinters and your tests and your
security checks and all of the things
that are going to beef up the quality um
of your of your process and your the
code itself, that's when you can have a
code review tool automatically run as a
first pass and surface any important
insights that you can use to improve the
code before you maybe have another
developer take a look at it or before
you even to take a look at it. And I
think what's important here that makes
this part different from the local code
review is that when you have a much
stronger uh system that actually takes
longer to run because it's checking
against entire codebase or multiple
repos to give you the important um
insights that you might need about
breaking changes for example making an
API change here this contract breaks
something in a couple other repos. Um
there's there's so much more context
that I think can be leveraged at the PR
stage as a part of everything else that
needs to get built um at this point that
could be different and take longer but
is um very effective for a pull request
voter view. And so this is what I ended
up uh testing out. Of course this is
actually my exist my current workflow
but I tested this uh process out for a
relatively large PR. I say relatively
large because I do believe um there's
stats around um you know your cognitive
load of of being of an effective code
review kind of degrading dramatically
after about 400 lines of code change.
So, I went all the way to the extreme on
purpose. Uh 1,900 lines of code change
for this um policy enforcement MCP
server and CLI tool I'm building. And
there was only one bug that was
uncovered and I was shocked. Um and
Codto is, you know, not to be of course
I'm biased, but I we dog food so I use
it every single day. And I was shocked
to only see that there was only one bug.
But to me that was proof that my process
is working is definitely working to some
degree with the skills that I have with
the rules that are in place with all my
tests and my llinters. This is the
proof.
And so this was uh once I once codto
found the bug that I had which was an
unhandled settings exception I went
ahead and fixed it it locally in codeex
and uh and what I mentioned before about
the robustness of a code review being um
automated at the PR stage is that it
gets to check against your rules and
your standards and requirements gaps
with adversarial agents. Some folks
think that um you know like what is the
point of like an automated code review
bot running when I can just or only run
a code review locally with the agents I
use to generate code and I am I'm always
screaming about this on um X I always
say that you need an independent
verification layer because of bias from
LLMs and because the system that are the
systems for coding agents are optimized
for like autocomplete on steroids,
right? Optimizing for actually
completing code as quickly as possible.
So, you need a completely independent
system that could come in with an
adversarial architecture and uh goal and
that's when you can begin to uncover
some um some subtle bugs that might
exist.
And so, this is the AI dev workflow
checklist for quality. I think you need
to define um the code quality standards.
I think you need to def uh decide where
the codified code quality lives in a
centralized context plane especially for
folks who are not just solo devs but
you're working on teams and you need
that to be distributed and managed and
then you need to pull in the agents and
the skills uh for accessing that context
for planning and for code generation
verifying those these local code changes
before you make a PR with your um static
you know llinters and tests and then
automate the more serious more robust
code review process and a bonus which is
something that I've been doing lately is
automation and AI for iterative
refinement of your skills and rules over
time now Codto's rule system is uh it
automatically um can suggest new rules
based on the behavioral um uh history of
PRs and comments and it can suggest new
rules for like you coding standards as
your code base evolves. But something
else that you can do is leverage
automation kind of like a prawn job or
something that can run weekly and assess
the um all of your PRs and any trends
that have occurred and then decide which
new skills are worth creating and which
new skills are worth um refining so that
you can reduce the types of issues that
might keep popping up if there are
trends. But so that by the time that you
get to code review, then you know some
of those um issues from the past have
already been handled and you have less
um issues by the time you get to that
last um line of defense.
And I'd love to show what that looked
like. So I this is an example of the
actual PR that I'm showing you that we
looked at. This was the plan that I
used. I used Aaron Francis. He has
faster.dev. I used some of his um audit
skills and I forced codeex or GPT 5.4
extra high. Um I forced it to include
exactly which skills I wanted uh it
should be able to use or should use for
this particular um feature change or
update. And this is a long list
as you can see here, but this is the
structure behind my planning that I
force agents to do that includes scope,
the canonical contracts, the component
design, the test plan, definition of
done. I mean it's exhaust it's
exhaustive but this is that method of
being able to uh confirm that the agent
actually knows what it needs to do and
that there will be a thorough
verification implementation process and
that is how I was able to generate
thousands of lines of code and end up
with only one issue that I needed to
fix.
And so this is something that I
definitely recommend you all begin to
think about because quality really is a
it requires a mindset shift. You know
security engineers talk about security
first best practices when it comes to
quality and preserving software
craftsmanship. no matter how fast and
exciting the changes evolution of AI
that it is that we get to experience in
this domain that we can still preserve
um our intellect and our expertise and
begin to have agents mirror that in the
way that they work and the way in the
way in which they work for us and that
is uh that is my talk that's what I had
to talk about so thank you
All right. Thank you, NA.
>> Thank you.
>> I would like to thank you all for being
present. This concludes our morning
session. I would also like to thank our
audience on live stream.
Uh go get refueled both for your body
and both your LinkedIn connections and
we'll be back at 1:30 sharp.
ladies and gentlemen, please take your
seats. Our event will start in 5
minutes.
All right. All right. Hello everyone.
How was lunch? Well, raise hands if you
like lunch.
>> Hey. Hey. Okay, that was a good lunch.
Okay, also some local flare. I really
enjoyed it. Hope you enjoyed it, too.
Um, so to kick you out of your food
coma, we have a very exciting speaker,
Jeff. So, I met Jeff in San Francisco as
well. So, uh, fun fact, I've never seen
Jeff without an overall and a hat. So,
that's kind of like the the image that I
seared in my head about who Jeff is. Uh,
but he's going to introduce himself a
little bit more. Jeff is currently on a
global tour also building on the site
latent patents. So feel free to check
out his website. And today he has a
pretty philosophical question for us
about the change of the economics when
software development is cheaper than
minimum wage. So let's welcome Jeff onto
the stage. Welcome Jeff.
Hello everyone.
Well, I'm here today with a somewhat
provocative title. Software development
now costs less than minimum wage. Now,
there's always been a difference between
software engineering and software
development, folks. But I want you to
think about this. If your identity
function is that you're doing software
development, you're typing in the ID,
etc. Well, uh, burger flipper and
mackers, a burger flipper at Mackers
gets paid more than you right now. Um,
yeah. So, it's been a year, a year and a
half since I first published a technique
for managing memory that I
affectionately call Ralph. Ralph is
really simple. You give a context
window, you give it a singular goal, and
you let it autogress towards that goal
with the right backing. So here's me
over at Alassian um two months ago
giving a talk about hey uh things are
changing the economics of software have
forever changed. Uh a week after this uh
Atlassian did their layoffs.
So um folks
the unit economics of business has
forever changed. Like if you consider
yourself to be a software developer like
you can run clawed code or codeex in a
loop AFK and at API pricing it's about
$1042 an hour and that will generate a
lot of code. Now it'll generate so much
code it's too hard to review. That's one
of the hardest things about this this
thing. But without a doubt the economics
of business has forever changed.
Like I was at the cursor meet up back in
Sydney and it was product manager after
product manager after product manager
just sharing the latest and greatest
thing and they're having the time of the
life folks. They don't have the
psychological wounds. They don't have
their identity function removed. They're
like hell yeah I can build things
without people now. Like people I don't
have to convince people to listen to me.
They were just like yeah I just made
this thing. So it was person after
person after person. I I encourage you
to go outside the bubble of software
developers and into the non-engineering
demographic and seeing all the magical
things they're doing with these tools
and you'll see how things are
fundamentally changing because these
head of design and the product manager,
they're now software engineers, folks.
But it's not just them. Like uh last
month I'm doing a bit of a world tour
and I was over in uh Oakland and I went
on a tour of Hobbiton the Lord of the
Rings and the tour guide was like, "Hey
Jeff, what are you doing?" Like what do
you do? I'm like, "Oh, talking about
AI." Next thing you know, he's like,
"Wow, how good is AI? I'm able to do
like all these polyrade trading bots."
And he was just like, "What does it mean
when your tour guide operator is like
like
token maxing?
Wow. Cuz um everyone now is a software
developer. I want you to internalize
that everyone is now a software
developer or everyone is now a coder.
So if your identity function of how you
derive value is coming from the idea
that you are someone who types an IDE,
you're pretty cool, right? Because like
the PM the PM can mog you. So,
but it's also interesting because where
society is been structured around the
idea that knowledge was scarce,
right? And with AI where that's flipped,
but knowledge is an abundance. It's not
just software developers. Like if you
want principal software level, principal
software developer level output, you
know, you just create a a cord skill for
that. What about like entry- level legal
chord skill? Like what does it happen?
What happens when a knowledge economy
goes from something that was scarce to
abundance? And this is what we're facing
right now.
Ouch.
So, if we rewind time,
if we rewind time about a two years ago,
this is me. I was going, "Oh, fuck." I
actually published this. Um, I ran
Claude in a loop and I built a Haskell
audio library. And the models were
pretty good back then, but they required
a lot of skill to get some outcome.
But it was pretty clear to see where
things were going.
Now you might recognize this moment in
time.
So this is Christmas last year. So the
models are now quite good. But one thing
I would part to you, it doesn't matter
how good the models get. It takes a
period of rest where people will realize
the step shift in technology
improvements.
You see in the people around me, the
people who get the most out of AI put in
deliberate intentional practice. Society
is kind of like forcing these musical
instruments or LLMs or guitars in the
corporate to all employees right now.
And it's like, please pick it up, please
chew some tokens, please give it a
strum, please practice.
And um what we happened in the Christmas
break, people actually had some time off
and they picked it up. And um the models
were good now. Like they always were
good for the last two years, but like
they've been RL now to a point where
they're no longer these wild stallions.
Like they require skill to break in the
horse. They're kind of like this My
Little Pony that's just all boxed up
ready to go and to get things going.
Now, if you're looking to roll out AI
with an organization, the one thing I
must part to you is like musos don't
pick up a guitar and give it a strum and
go, "Oh, that's crap." and assume it's
always going to be the case. They play
they play with the instrument. And this
is one of the things now, like at least
for the people in the room, you
hopefully you've been playing with these
guitars and you're learning all these
tricks that you can do with these and
how different LLM models sound
differently and have different
characteristics.
So I kind of think the world is now kind
of K-shaped.
Um up in the top left we've got the
model first companies. This is the lean
Apex Predators. They're building AI and
they're developing all the workflows
with AI and they're having one hell of a
time. And uh down the bottom there is
like everyone else trying to do their
people transformation program figuring
out what to do with AI. And um
would you believe there's people who
have banned AI within the corporate? Um
if anyone's watching and AI is banned
within your organization, you should
leave that organization.
Straight up. Now you might have seen
this and a few things like this. My
honest take is Jack is right,
but uh my further take here is AI is not
factored in yet. What we're seeing is PE
ratios and the valuations of SAS
companies return to standard business
metric. The fund hasn't started yet.
What we have here is not a tool. It's
more like a kind of a substrate or
polymer that allows us to redefine
how business works.
You see, for the last two months, I've
been traveling around the world and I've
been catching up with venture
capitalists in San Fran, New Zealand,
South Korea, and we're just kind of
wondering like the disruption is not
just us as software developers. It's
upstream in the finance industry. Why
does someone need to re raise speed
capital these days?
If it's just a fiveman show now,
is software still investable? These are
problems and questions at a
philosophical level upstream in the
financing side of things. The disruption
of AI has created uncertainty not just
for us as software developers and that
but also in the finance realm.
You see every story needs a frame. So
for no particular reason at all I'm
picking SAP Concur. Um I don't like
their expense management software.
Would you believe they got fixed fixed
overheads of 6,800 people? What? Like
6,800 people? That's a lot of people to
do AI transformation.
So I think the better question is
thinking about like how business has
been structured. Business has been
structured in a way that we've layered
humans on humans on humans as an
intelligence layer within organizations.
I think this is going to be the year we
figure out whether this is true or not.
We're going we're see already seeing
companies play with the substrate and
changing things around.
You see how long does it take to
transform 6,800 people like with an
organization two three years? I think
the better question is why would you
right all organizations right now
putting LLMs down and encouraging token
burn see it's not about like the
leaderboard and token maxing it's just
literally seeing if someone is actually
curious like if they're not burning
tokens I haven't found a way to burn
tokens and they're not losing sleep
because like oh my god all the things
they can build it's really they're
failing a pulse check and there's a lot
of people who are failing pulse checks
right now So why would you transform
them? You see, we know that
organizations
and like you run events or party
management like the less people the
better. Like the social complexity is
there. Smaller teams get better
outcomes.
And here's a story from a founder in New
Zealand. We're smaller, but we
effectively cut two/3 by telling our
board that we wouldn't backfill.
That's almost two and a half years ago,
folks. almost three years ago. They
stopped back filling them. So, you might
see all these things saying, "Oh, we're
making all these changes to staffing and
hiring." It's already been happening,
folks.
And it was the best decision because it
got rid of it got rid of all the people
are sick of hearing about AI.
20-ish people now produce 30 times the
output three years ago. I want you to
think about this and let this sink in
because one of the hardest things about
AI is it's kind of been forced upon the
world non-consensually.
Like it's just you got to put your chin
up and get through it. Like for a lot of
people, they're specialized in doing
these lord of lord of like game of
thrones type social hierarchy dilbert
type stuff and it's all going to be for
nothing because if you're a founder and
this is your own capital, why wouldn't
you compress the org chart? This is
going to be really interesting because
as we get these lean apex model first
companies, they're going to operate on
cheaper. they're going to operate much
cheaper and on leaner margin. So, it's
not like they the founders will want to
do this. They're going to have to be
forced to do this. If the experiments
this year with all the founders changing
around the org chart pay off, it's going
to take one public, it's just going to
take one business study. Next thing you
know, they're all going to start
copying. They're all going to start
reorganizing their organizations.
So, the experience today as a software
engineer does not relevant. does not
guarantee relevance in the future. This
has always been the case. There was a
time when a software engineer would move
on from a company because they weren't
adopting cloud. They want to keep their
skills relevant. Our profession has
always been a traveler. One of the
scariest things is really just how fast
this travel is going. You see, if an a
if a company's having problems adopting
AI, well, that's a company issue.
That's now my literal problem that I
think about help companies with. It's
not an employee issue. You see,
employees trade time and skill for
money. And I'm really worried that
people aren't investing in themselves.
It It's crazy. Like, here we go. Here's
something I published about two years
ago saying some software developers are
not going to make it. Well, I no longer
hire on the left side of this anymore.
like why would you
and it's crazy there there really is now
two categories of employees those who
are consuming curser windsurf what else
have you and all the AI token things and
the other class of engineer is someone
who actually knows the the fundamentals
of an agent and knows how to automate
things and they're a senior engineer
because they can teach the next cohort
and generation it's been two year folk
two years folks there is a huge amount
of people that you can hire for that um
have this knowledge. You're trying to
figure out who to hire for in your
interviews. Quite literally pull someone
out and get them to explain what a
primary key is.
You see, but I'm not talking about a
primary key. If you if I was to ask you
what a primary key is or like what a
linked list is, you'd be like, Jeff,
like come on. Are you bullshitting me?
Is this test? But like it's surprising
so many people can't answer this this
fundamental question.
You're trying to figure out who to hire
for as a startup and like who who's
going to make it, who's not going to
make it. It really is this question like
can do they understand the big scary
boogeyman, the AI monster is literally a
while loop that automatically copies and
pastes information into array. Can they
draw a sequence diagram explaining how
this all works? This is what you should
be looking at. I like to call this a
curiosity test.
And unfortunately, way too many people
are failing this curiosity curiosity
test. Software development has changed
in the last 6 months more than the last
30 years. If they're not paying
attention, like what the heck's going
on? So, you might be wondering, why
wouldn't you save someone in the center
like they're stuck here in the
headlights? It's because it's a it's a
psychological thing. They might run back
up the hill. They plan their career at
Fang and they just want another couple
of years so they vest and they quit.
Like you really at this point you you
need to be just someone who is
understanding AI and is a good software
engineer and is learning techniques how
to keep the agents on the loop and
developing pace and music.
So,
it's going to be really interesting to
see how this pans out, folks. Really
interesting to see how this pans out
because like if we start seeing layoffs
and job cuts. I don't think they've even
started yet. What happens to the people
who uh get displaced, they're going to
need jobs. And what are they going to
do? They're going to get a job at the
next employer and they're going to do
what was done to them. And this is my
concerns. It could be somewhat
recursive. The best thing you can really
do if you see someone who is like stuck
in the headlights and like oh crap
what's going on is like to build an
agent. I was in San Fran at a codeex
meet up. I put my hand I called out to
everyone like hey who here can actually
draw me a sequence diagram and explain
to me how inferencing works and tool
calling works. Five hands went up out of
200. Like holy crap like it's the
numbers who understand these things are
crazy low. And this is one of the most
amazing things you can do right now is
to build your own agent. It's
I've got a repo on GitHub. It's got a
couple couple thousand stars,
300 lines of code. And what you do is
you uh you have the agent and then you
use the agent to self-improve itself.
Then you build with a recursive latent
space. And I think that's something that
causes someone's head to completely flip
and switch and you go, "Well, this is
just a chat app." They're like, "Cool.
Do you want it to be a tuy or do you
want to be a web app?" They're like,
"Oh, web app." Prompt for it. You're
like, "What the hell? I just saw
self-evolving software." Yes. This is
the one thing that I can highly suggest
and nudge people if they're still stuck
in the center. Just build their agent
and get them interested in the idea of
evolving the software. The software
builds the software. So, it's going to
be really interesting to see how this
pans out because for a lot of people,
they haven't noticed that AI isn't
knocking up at their doorstep because
it's borrowing under their house. And
this is really, really scary to me
thinking about this. So many people
haven't actually paying attention to
what's happening. the closing ponderus.
Removing waste from your company is
probably one of the biggest
accelerators to AI. If you got like if
I've had clients that have had like a
repo per design atom like they're using
polymer like 200 plus repos box repo a
checkbox repo get rid of all that waste
folks like monor repos are love mainline
that because agents don't cross the
boundaries very well if you got your
primary source of truth of how systems
work in the architecture document in
confluence and then thumb in markdown
and across the RICO across the
organization. Fix that stuff. Fix the
waste. Maybe you didn't hire enough
designers, right? Maybe that you've
you've you you put you made your
software developers mushrooms instead of
product engineers. Fix that stuff. It's
only until you fix that type of waste
until you really get the acceleration
with AI. And the organizations who are
invested into testing and all the things
that they should have been doing,
they're getting accelerated by AI.
Meanwhile, they're a big brand. They're
having problems with AI. It's like, you
know, you know, in that organization,
they had no testing policy and they
haven't prioritized it. So, no, no [ __ ]
They're having problems adopting AI.
They had low standards.
There's an old saying that ideas are
everything. Uh, ideas are worthless and
execution is everything. But what does
it mean if you can just like rip a fart
into clawed code and it builds that
idea?
like really just to go to Dax how he
opened up um thinking about what to
generate
is really hard folks. It's really really
hard. Like good ideas are shockingly
rare. Shockingly rare. You should be
spending a lot of time thinking about
what the right thing is to generate. And
not only that, when you have an idea,
generate like 20 varieties of it and
touch it and like hold it and play with
it like and figure out whether it's good
and that's how you develop taste. Now, I
keep mentioning identity functions. This
is something that's kind of weird. Um,
we used to have like tribes in software
and it's like, "Oh, what are you?" And
they, "Oh, yeah, I'm a Ruby developer.
I'm a PHP developer. I'm a Golang
developer." And we had subtribes. Do you
use do you use neoim? Do you use Emac?
All that all that stuff doesn't matter
anymore. It's all been erased. The and
that creates for people kind of like a a
wound, a psychological wound because
it's all been erased. None of it
matters. All that matters now is you're
a software developer. I would expect any
software engineer to be able to pick up
Rust within a couple of days now or PHP
or what else have you because it's all
been made funible and that's going to be
really hard for a lot of people to
stomach.
But one thing I can say is
if you see someone stuck and with this
psychological wound, get them to build
an agent and get them to use that agent
to self-improve that agent and case
evolutionary software and recursive
later phase. And I found that like 10
out of 10 that snaps them out of it
because I don't want engineers who can
just like downloading whatever comes on
hacker news. I want engineers who
understand the inner fundamentals. Like
I don't want a mechanic that just
switches engines. I want a mechanic
that's able to explain what a piston is,
what a tool call. Engineers are meant to
be curious. Thank you.
Okay, thank you for the great talk.
>> Thank you.
>> So, your model is returning lowquality
responses and uh the provider is selling
you garbage tokens. Who's to blame?
Yes, quantization.
Today on trial, we have Philip Kylie
trying to redeem himself with a talk,
How to Quantize Models Without Killing
Quality. Good luck.
Hello everyone. How's it going? So
wonderful to be here today. And wow,
Dexter was not lying when he says you
cannot see. It's just just the lights.
The lights are bright here in Miami. I
am here I am here to uh talk about
quantization. everyone's least favorite
thing when they're trying to run their
agents at uh peak hours. So, I'm Philip.
Um I I've heard that to make yourself
easily identifiable, you should not use
a group photo. So, I put up a photo of
myself with all my buddies uh from B 10.
Um we're that company you see in pink
and green all over SF if you're out
there. Um and and I work on inference
every day. That's that's that's what
this is. So, what is my agenda today?
What are we going to talk about? We're
going to talk about why models are so
slow. We're going to talk about what is
quantization. We're going to talk about
the great gift of NVFP4, aka a great way
to sell you Blackwell GPUs. Uh we're
going to talk about what is safe to
quantize within models, what is more
risky, and then take a look at some real
world performance and quality results.
But I mean, that's like the agenda,
right? But what's the agenda? Why am I
here? What am I trying to sell you?
People are really suspicious of of
quantization. You know, all of those
greedy inference providers are out there
trying to rip you off, selling you poor
quality tokens at frontier prices by
just squishing their models into these
little tiny four-bit uh four-bit number
formats. And it's it's making the tokens
sick. You can see that they're weak and
sickly tokens. And and Chat GPT did not
necessarily realize that I meant LLM
tokens, so they gave me bitcoins. But
we're just going to pretend that these
are LLM tokens. You know, there's a
there's a a lot of discussion on the
internet about how quantization just
nerfs models. You know, that that that
second bullet point here, you know, if
uh if maybe under the hood people are
are sneakily and suspiciously quantizing
models down to tiny data formats.
And you know, so some people are pretty
cool with quantization. Dax was also up
here getting involuntary LASIK from
these lights uh this morning. He uh you
know he he's he's good with it. He says
some of the highest quality providers
serve models in NV in FP4. You know,
maybe there's more to quality than just
quantization. So So who's right? You
know, I am Bro, I'm I'm just trying to
give you cheap tokens, bro. Like quit
quit coming after me for this whole
quantization. I just want your inference
to be cheap and to be fast and for your
LM tokens because I fixed it on this
image. Your LM tokens to be frolicking
through a field at like 30 to 50% faster
speeds. So, how do we get there? The
thing about inference, inference is a
hard problem with a lot of moving parts.
And the thing that I don't want you to
take away from today's talk is, oh,
there's this one magical silver bullet
thing called quantization and you do it
to a model and it solves all the problem
and now your inference is fast and
cheap. There's actually dozens of
different technologies and techniques
working together to make inference
effective. That's what makes it such an
amazing field to work in. But today,
we're just going to take a close look at
one single technique for one single part
of the stack, which is quantization.
because again it's the one that it's the
one that everyone complains about. You
know, no one's ever starting any Twitter
beef over speculative decoding or tensor
parallelism. So, let's take a look at
the the hot topic. All right. So, I've
we've got we've got various levels of
inference engineering knowledge in this
room. I was actually talking to someone
yesterday who did a graduate thesis on
quantization. I was like, "Oh, you want
to just like give my talk for me,
please?" Uh, but she she's talking about
something else. Um so and and and some
people who are a little newer to the
field. So we're just going to start with
some basics. Um if you've you know run a
model on a GPU just like take a power
nap for two minutes. I know we are right
after lunch. So LLM inference has two
different phases a prefill phase and a
decode phase. And generalizing here
prefill is bound on compute how many
operations per second you can do. And so
to make prefill faster, you want access
to faster cores. Decode on the other
hand, that's the tokens per second part.
If prefill is time to first token,
decode is tokens per second. That's
bound on memory bandwidth. How fast can
you move data from the VRAM into the L0
L1 caches to actually use them for
inference? And for this, we need to move
less data.
So you know in in inference in general
actually it's not just for LLMs for you
know image and video generation as well
you can be computebound and for audio
transcription speech synthesis etc you
can be memory bound.
Quantization helps with both. It's one
of the only model performance techniques
that actually helps with both your
compute problems and your memory
bandwidth problems at the same time. So
just let's take a take a detour back to
the old days of computing when uh we
talked a lot about compression. You know
compression has been around for a long
time. If you've watched Silicon Valley,
they made a whole six season TV show
about a compression company. And we've
gotten like really good at compression.
Uh, these two images you see up here,
the the one on the
just putting myself in your frame of
reference for a second, the one on the
left is four times bigger than the one
on the right, but they look identical.
Uh, you know, maybe if I took the one on
the right and I blew it up onto a
billboard, you might be able to see the
difference, but here on my screen at
least, it looks just fine. So, how can
we do the same thing for models?
You know, again with with quantiz the
problem with models is like all these
conals gem at all. They take a long
time. You got to read the data off the
the VRAM and then you got to do the
matrix multiplication on the cores. And
so the solution is what if you just had
smaller numbers that you were working
with? What if every time you moved, you
know, a 100 megabytes through the VRAMm
you were moving twice as much
information to the model? What if you
were using cores that were twice as
powerful because they're operating at a
lower precision? Quantization is kind of
this magic thing that increases your
effective bandwidth. It increases your
cache residency if you can do KV cache
quantization. Works with any model and
any modality to solve any bottleneck. So
why doesn't everyone just love this
thing? Well, the problem is that
generative AI models, they exhibit
emergent behavior. you throw some stuff
in, some things happen and then an
output occurs. These are the sort of
technical insights you come to a Philip
Kylie talk for. Um, and the the problem
with some things happen is that, you
know, we we all know that inference is
non-deterministic. We all know that a
bunch of stuff can happen. If you're
effectively rounding a bunch of your
numbers and multiplying those together
and compounding errors throughout the
inference process, maybe you're still
going to get the result you want, maybe
you're not. Maybe your, you know, your
logits end up skewed just slightly. Your
prediction probabilities end up just
off. And then your next token ends up
being act instead of abs. um and then
you get up on stage and just suck in
your gut for 25 minutes so that you act
like you have abs. Um this this can be a
problem in your inference system.
So you know the the other thing I want
to talk about just to sort of
disambiguate this really quick before we
jump in is I'm mostly talking about
post-training quantization.
Increasingly, AI labs are setting the
model's native precision to be somewhat
smaller so that they can take advantage
of advanced inference without losing any
of that quality. But we're talking about
the stuff that's under your control as
an individual developer pulling down a
model off of hugging face. The
post-training quantization that you can
do at inference time. Um, so this has
the model weights for our purpose.
They're already baked. They're already
done. We're not doing any further
distillation or fine-tuning or RL or any
of that kind of stuff. The model's ready
to go. It's already as smart as it's
going to be. We just want to make it
faster and hopefully not make it any
dumber. So, let's take a look at that.
Let's take a look at where we're
starting and where we're moving to.
Doing okay on time. Okay. So, the data
formats, um, you can represent the
components of a model. those, you know,
if you think about Kimmy with a trillion
parameters, we've got a bunch of
matrices that have a trillion different
numbers in them. How big are each of
those numbers? How many bits are we
using to represent those numbers? And
what format are we using? Are we using
an integer or a floating point?
Floating point numbers have three
different types of bits within them.
There's the sign bit, positive or
negative, then the exponents, and the
mantissa. And so if you think about the
way you construct a floatingoint number
from these bits, it's a uh it's a 10 to
the power of a thing or a two to the
power of a thing times the mantissa. Um
and that gives you um something called
dynamic range, which we're going to get
to in a second. Now, part of the problem
with quantization, part of the reason it
has this really bad reputation is
because a lot of these floatingoint
formats are relatively recent. So if you
look here, oh good, you can see my
mouse. This is fantastic. So in this
sort of 2022 era when Hopper and Love
Lace were first rolling out, they
brought with them the concept of FP8.
Blackwell brought with it the concept of
FP4 for inference in production. And
before that, when we were doing
quantization onto Ampio, onto towing or
onto local hardware. In many cases,
these were integer quantizations. And
integer quantizations, you know, not to
cach shade, but are like not very good.
Um, and so the sort of industry opinion
of quantization was formed on integer
quantizations. Now we have floatingoint
quantizations. Let's see if they're
better.
So dynamic range is like the thing here
with quantization. It's the ability to
encode like very very small values and
very very large values on an absolute
basis.
Floating point formats use the stuff we
were talking about the signs the
exponents and the mantissas to preserve
dynamic range. If you think about a FP16
number your sort of standard format for
a model you start with five exponent
bits that controls kind of how big and
small your numbers can get on an
absolute basis. When you move to FP8 you
you still have four of them in most
cases. you've actually only lost one bit
of dynamic range even though you've
shrunk substantially. When you get down
to FP4 though, which is where we want to
go so that we can do really fun fast
stuff on Blackwell, you lose two more
and and now you're you're kind of cooked
in a dynamic range perspective. So what
do you do about it? You're trying to map
all of these values from this massive
range down to literally just 16 buckets.
If you think about a six a four-bit
floatingoint format, you only have 16
numbers that you're representing what
used to be 65,000 numbers. How do you do
it? How do you put all of these numbers
into these buckets? Well, you cheat. Uh
you you have something called a scale
factor that allows you to record
additional information while trading off
keeping track of more numbers and doing
more math. You can have a scale factor
at the tensor level, at the channel
level, or at the block level. And today
the best small scale formats are
microscaling data formats that use
blockwise quantization. You have a
couple of generalpurpose ones. Um but
I'm not here to shill general purpose
things that you can run anywhere. I'm
here to shill NVFP4 that you can run on
Nvidia Blackwell. Um and the difference
with this format is that you actually
have two scaling factors and a smaller
block format. So your blockwise scaling
is a n= 16. That means each block level
scale factor is applied to 16 numbers.
Why is 16 important? That's how many
different values you have. So now you
can use one value for everything and
then put a scale factor that maps that
block appropriately. And then to make
sure that you get a whole ton of dynamic
range, you do a secondary FP32 global
scaling factor. Now having keeping track
of all of this stuff is is hard and
expensive and and slows you down a
little bit, but it's all baked into the
Blackwell architecture. So we can just
forget about it and run NVFP before and
our life is great. Um it provides
increased accuracy because your um pres
you have your block scale factor is now
a E4M3. So you have some mantisa in
there for specificity. You get your
extra exponents from your tensor scaling
factor and life is good. That's my talk
everyone. Just use NVFP4 and
everything's easy. You're done. Oh wait,
we're not done. Okay. there there's
still there's still some other things
you have to do besides just use the
magical data format um that only works
sometimes in some cases.
So the the the big question is like what
can you actually quantize?
You know there's a sort of spectrum of
pretty safe to like why the hell would
you touch that in terms of your
parameters and your model weights these
gigantic linear layers are like pretty
safe to quantize in most cases.
Generally, we see a lot of success
pushing those all the way down to four
bits. Activations in KV cache, maybe
like a year or two ago, that was even
kind of risky to put in eight bits. Now,
we're getting pretty good at at putting
it in 8 bits. And attention, don't touch
attention. Just leave it alone. It's
hard enough. Just just just let it do
its thing. Um, so generally you don't
quantize attention. Um, unless you are,
you know, feeling really really lucky
that day. Um so the uh you know if you
want to take a deeper look in um you
know what parts of the model to quantize
um check out my my friend Ali's uh blog
post on Twitter uh called four bits um
where he goes super deep into for an
image model some of these different
layers and what is and what is not safe
to touch. But in general not all layers
are created equal. For example, your
input and output layers um from that
weights block might be more sensitive.
You might only want to quantize some of
the interior layers. You might only want
to quantize, you know, part of the uh
you might want to, for example, in a
vision language model, leave the vision
encoder alone because it's it's small
and it's more sensitive and and just
focus on the main LLM layers. there's
all kinds of sort of model specific uh
specificity and and sensitivity that you
want to account for in this quantization
process. Um so it's it's always
important to keep that in mind as you're
working.
The other thing of course to keep in
mind is the hardware and kernel support.
Uh just because NVIDIA says you can
quantize something to a certain format
and run it on a certain GPU does not
always mean um that that you're going to
be extremely successful in doing that in
production. Again, a lot of the open
source work and the kernel work is still
targeting Hopper, is still targeting
that FBA quantization. So, if you're
trying to run NVFP4, you should expect
to have to do a lot of porting to get
something like a deep gem kernel up and
running on your new Blackwell
architecture.
There are other factors to think about
in terms of which models you can and
cannot quantize. You know, the biggest
one is model size. All else equal, like
models with more parameters are more
resistant to negative quality impacts
because any individual outlier that
might have gotten smoothed over in the
process is not as important in say like
a trillion parameter model as it is in a
billion parameter model. You can also of
course within the architecture of the
model itself introduce quantization
aware training which labs are
increasingly doing. If you look at for
example GPTOSS that has an MXFP4 native
quantization which is one of the reasons
that models had so much staying power on
the market uh because it it does you
know resist quantization very very well
and then the the final thing to think
about when you're you know actually the
one hands-on doing the quantization is
the calibration process as you're using
for example Nvidia model opt or some
other tool to apply the quantization
calculate what the new weight should be
calculate your scale factor factors. You
want to do those under conditions that
very closely match production usage. For
example, if you're using a chat data set
and you're going to use your model for
code generation, that's probably not
going to give you a very appropriate
calibration output.
Cool. So, we've we've done all this hard
work. Let's see if it was actually, you
know, good for anything. So, to review,
quantization was bad because we had
these integer based data formats. We
were quantizing small early models like
a Llama 70B or something and we were
doing it in a sort of generic way a
couple years ago where quantization was
not applied or calibrated super
specifically. And that's how
quantization got this bad reputation as
the labbotomizer of models. And today
quantization can work because we have
floatingoint formats, we have
quantization aware training and we kind
of sort of at least a little bit know
what we're doing.
So
all of that to say like does it actually
work? Does it actually do anything? And
here's the other kind of gotcha with
quantization is like you might look at
the spec sheet and be like whoa I can
get like four maybe three and a half
times faster just based on my flops and
VRAM bandwidth and and that is just
going to translate linearly to
performance gains. It's not. Or you
could look at your Kono profile and say
like, well, if every single one of these
becomes twice as fast, I'm going to get
like 1.9x faster. You're not. Uh,
generally observed gains from FP16 to
FP8 or FP8 to FP4 is like 30 to 50% with
every step. So, it's definitely a sort
of more limited observed real world
gain, but still, I mean, 30 to 50%
that's an extra, you know, 60 tokens per
second. That's a hundred milliseconds
off your time to first token. That's a
few million dollars off your influence
bill. It it's it's a big outcome. Um if
you can, you know, confidently get your
model there. So that's where the quality
checking comes in. You've got to first
off, I mean, everyone always says look
at your data, look at your outputs. This
is my sort of counter example to the uh
to the compression image that I showed
earlier. We have here a full precision
tiger and an NVFP4 tiger. Can anyone
tell me the difference between these two
tigers? Uh, or can anyone notice that I
actually uh switch switched the labels
on you? They look exactly the same, but
the one under full precision is actually
the NVFP4 tiger um with of course the
same seed and the same settings and all
to get this, you know, very very
identical output image.
And then for something that you can't
just look at, maybe something like an
LLM, a more complex agent, you can look
at a perplexity score. you don't want
your perplexity to go up. You can of
course just run your same eval set on
the original weights and the quantized
weights and make sure that everything is
within a a comfortable margin of error
of noise. You can do spot checks. Um,
always check, you know, your function
calling long context, all that kind of
stuff. And if it's not quantization,
your model could feel dumber because of
the reasoning effort, because of the
chat templates, because of a new
checkpoint. There's all kinds of other
reasons. So quantization can be bad, but
don't blame it for everything. Um,
quantize if you want to make it faster,
and you can do so without making it
dumber. Thank you all so much. Um,
and I'm just going to take uh a few
seconds to plug something. Uh, you can
get this book. I unfortunately did not
have enough room in my luggage for
everyone. Uh, so if you scan this QR
code, hop on the wait list, um, I will
send you an email when I get Shopify up
and, uh, you can like DM me a picture
from AI Miami. I'll send you a code for
a free one. Thank you so much and let's
have a great time out here.
Whoa, Philip, that was a very energizing
talk. For the next one, I hope you all
are going to go bananas for this one
because we're inviting the Google
DeepMind team who's going to be talking
about generative media. So, including
your favorite models like nano bananas
and more. Okay, so I'm going to
introduce my colleague Alisa and Gileum.
So, welcome to the stage. So, Alisa and
Gilam are from Google Deep Mind and they
both work for AI Studio. So, if you
haven't heard about it, they're going to
do a demo for us today. Uh, so super
exciting and I'm going to let them
introduce themselves a little bit more.
Uh, we have a dynamic duel of a PM and
the developer advocate. So they're gonna
walk you through the creative world that
you can get in with AI Studio and
Generative Media. So take it away.
>> That's
Hi everyone, my name is Alisa Forton.
I'm one of the PMs on the Google DIY
team and my focus is generative media
models, specifically image, video, and
audio models. And this is my partner in
crime.
>> So hello everyone. I'm Guiam. I'm uh
Alisa's partners for for all of the
Gemini models launches. I'm developer
advocate, meaning that my job is to
represent all of you inside of Google
and to make sure that whatever we
release is easy to use and
for developers so that they you can
easily make things with our models.
Um so uh we are going to talk about um
generative media but just a word about u
the the Gemini the vision of deep mind
for AI model right like from the
beginning the the vision for deep for
deep mind was to make multimodel models
uh because we believe that we need the
models to be able to understand as many
modalities as possible so videos audio
uh sensors uh speech and and so on and
so on kind of like of five senses and to
also express itself in in all of those
modalities as well and so to be able to
generate images and so on. So most of
the things we are going to talk about
today uh and the reason we we we want
that is because that's that's what we
called world models and we want the
model to be able to understand
everything about the world and to uh and
to act on it. Uh yeah,
>> and before we continue, uh we are going
to do a bunch of live demos today. I
don't know if it's actually going to be
helpful, but maybe some phones can go on
airplane mode. Uh that might help our
live demo. We were practicing and the
speeds are kind of slow, but um thank
you. We'll be prepared. So why why do we
love working on Gen Media? Think about
the three words. Entertain, communicate,
and learn. Open your phone. You will
look at the world. there's media around
you. When we're improving these models,
we aren't just like tweaking code or
training data. We're actually helping
teachers teach, helping businesses
connect with their users and we're
helping creators create. So that's what
we're building and that's why we're so
excited to speak about this to you with
you today.
>> And so and we are shipping a lot of
things at Google and that's that slide
is just about the gen media models we
are having. We have the video ones, the
image one with nano banana. We have Gen3
that we shipped uh quite recently as
well. And so we know that we are
shipping so much things that it's hard
to keep track of of all of the the
offer. And that's the reason why we
wanted to do this talk and to show you
the different uh models that we have and
to give you some tricks uh and do live
demos about what you can uh do to
improve your usage of the model.
>> So this brings me to the first model I'm
going to talk about. So when you think
about kind of the past and how the
camera was used to capture the reality,
generative video actually brings your
imagination to the real world. And so
that's why VO is so exciting and we're
just at the beginning of what generative
video can do. Um when we work to create
these models, we always think about how
can we make it accessible to as many
people as possible. So we're thinking
about those creators. We're thinking
about the teachers. We're thinking about
the small businesses and how they're
reaching their audience. So that's why
VO has a a family of models. And the
most recent model that we introduced is
VO3.1 Light, which is basically our
fastest. It's our most cost-effective
model that will allow you to quickly
prototype and bring things to
production. So when you're looking at
our models and you're kind of thinking
which model should I pick for my use
case. Um this will be kind of similar
across all of Junk Media. You're going
to use VO3.1 light or fast model when
you need to quickly prototype something
do a bunch of generations at speed post
a bunch of videos on social media. And
then when you're truly building like
something cinematic, you need the 4K
outputs. You're going to move to our
flagship VO quality quality models.
>> So um just a very quick like quick tips
on how to get something better with VO
and as we said we are doing live demos
and it's never working. So um but very
quickly um one of the thing that most
people are doing when using VO or any
any generative media models is let's
let's try a prompt that is very
representative of what we get.
And that the thing like most people are
just sending prompts that are just too
short. And the shorter the mo the the
model the the prompt is the harder it is
for the model to know exactly what you
it has to generate. And so it basically
has to f to fill the gaps and and that's
the reason why the first the first tips
that we can give you for any gen media
model is to actually write long prompts
long the longer prompts you can write
because you need to reduce the number of
things that the model will have have to
invite invent by itself. There's the
reason why I made this this quick uh
this quick demo that is basically using
Gemini to take your prompt and to
enhance it to get to be better and to
have more details about what it uh it
needs to generate. And the good thing
with that is that if you get the longer
the prompt is, the easier it's going to
be also to take the video that you made
with V3.1 light and to make it uh
better, larger uh with uh the standout
model. uh because if you the prompt is
very detailed, the model is going to
know exactly what to do uh and do more
or less the same thing each time. And if
you go you want to go even further, you
can use what we called um um JSON mode
and basically just using a very very big
JSON um with a lot of uh fields like
titles, creative summary as you can see
for each character, the name of the
characters, the visual descriptions,
attire, hairstyle, accessories, and so
on. uh where's this where's the location
is, what the R direction is, and so on
and so on. The point of this is to be
thorough and to have this this kind of
checklist of all of the things I need to
tell the model so that I'm I'm sure that
it's going to do exactly what I want it
to do. And you can see that we even have
like those chunks of what's happening
between the the the first second and the
second and when what's happening in the
next two seconds and so on. Um,
>> I think one thing to add about JSON
prompting is that we don't actually our
evals don't show that JSON prompts work
any better than the natural language
prompts. But if you're structuring a
prompt in your head and then you want to
update things like you know timestamps
and what's happening and then small
backgrounds, JSON allows you to keep
that structure and make these minor
changes.
>> So see that the kind of things you get
with a proper prompt that is describing
everything you you want to do. Um I yeah
that's the generation was a bit slow so
I I I run it before. Um but let's go
back to uh all slides
and
nano banana. Now
>> next one. Where's your bananas?
Everyone's the recent favorite nano
banana. And just like with all of our
other models, we have a family of
nanobanana models now. So to give you
just a quick TLDDR as we're running out
of time when you're looking at Nano
Banana 2, you're looking at quick
prototyping. You may need draft like
resolutions 520 pixels. You may need
lower cost. This is what you're looking
at the workhorse model Nano Banana 2.
When you are ready to go to more
highfidelity photo realism natural um
outputs, this is where Nano Banana Pro
still excels.
And um all of our nano banana models
support a variety of aspect ratios and
resolutions. Basically we want to me
make sure that we meet all the project
and asset needs um that you have today.
>> So and one of the core feature of Nano
Banana since Nanobanana Pro uh is that
it's able to use search grounding. So
basically go on the internet and search
for information about what you are
asking it so that it can give you the
the latest information. So you can ask
it about the like make a image about the
news from yesterday about the score from
a specific uh sports game. Uh and for
example in this case look look the way
for me and make an image that just
represent my um my my work and
everything I did in my in my past work.
>> And then recently with Nano Banana 2 we
actually introduced grounding with
Google search for images. So, you know,
as generative media models, they're
great at the actual rendering and the
outputs, but what they're really
terrible at is fact, right? So, if
you're trying to prompt for this top
right, you see there is a bridge that um
IU bridge in Pakistan that we were
trying to represent here and then we
asked Nano Banana Pro that only has
search grounding to render a watercolor
of this bridge. You see how it missed
like the structure of the bridge and how
complex it is. Well, Nano Banana 2
actually grounds the responses in the
image as well. So, it's not just
hallucinating the output of what it
thinks the bridge should look like or
what what the training data says it
should look like. It takes it's able to
view the image of this actual bridge and
you can see how the bridge on the top
right is rendered with more accuracy.
>> So, uh once again some quick tip for
Nano Banana. One of the first thing is
that um you need to remember that the
model was mainly trained to do
multi-turn editing. So you give an image
and you you ask for edits on the image.
I know that most people are using it to
generate image but that's actually not
what the model was meant to be
initially. Um it's also very good to to
use references. So you can give it a lot
of uh images that you want to reuse as
this is my character, this is this is
the scene and so on. And that's how you
get the best results. Um and there was
also like we get a lot of requests
because people were kind of
disappointing about that uh to get how
to get um like transparent background.
So I made this quick uh this quick app
that I need to reload currently um just
to show you different uh uh the way I'm
using to create um transparent
background for images uh using Dan
Banana. And I think we're running out of
time so let's go with the pre-run one.
Um, so basically this this was a manate
outing in Miami wearing an AI engineer
cap. And the way I'm I'm doing that is
that I'm creating a first image and
asking it to to create a white
background. And then I'm asking it to
change the white background to a black
background. And the thing is with Nano
Banana is that it's pixel perfect. So
when you ask it to change an image, it's
not going to touch any of the pixels
that it doesn't need to touch. So then
you just need to do a diff and
everything that changed between the two
images is going to be a transparent
background. So that's how you can get
very easily transparent background.
And there was another neat trick that I
wanted to show you uh in case you want
to create like lots of uh lots of images
using nano banana. Um you you have
different ways to save on cost. You can
use batch, you can use the new flex uh
service level that is that are both
reducing the cost by by half. But
there's actually another very neat way
to um to reduce the cost. And I'm just
going to yeah um jump right away to uh
to the results. But basically um the
trick is to instead of creating of
creating multiple small images of 51 12
pixel create a big image in 4K but ask
ask for it to be a grid with multiple
images. So in this case, I'm just asking
for an AI conference in Miami named AI
engineer and and the model is actually
going to to create 20 64 images
representing this font. But deep down it
was actually one image that was a grid
and then the model and then I just had
to cut the image to get my my small
images out of it. And if we just check
on the prices, it cost it cost me the
price of one image. So instead of 64
ones even though they were smaller. So
I'm saving 95% of the cost that way. And
that even works if you want like all
kind of different prompts. So all this
one is 24 different prompts. And and the
model was still able to create those 24
different images within one image,
right? In one time. So that's a kind of
neat trick if you want to save on cost
with uh with Nano Banana.
Um, and now let's let's move to our the
next the latest addition to our gen
media models.
>> Yeah. So, LIA again, the LIA model
family is our flagship song and music
generation model. Um, again, we're
offering you two different models. The
first one is great if you're doing like
quick loops or you're doing quick promo
um audio because it's a 30 second
generation song. And then the LIA 3 Pro
is actually our kind of flagship model
that can generate the full song and it
has um it allows for a lot of control to
support different um parts of the
composition and also it can take
different uh multimodal inputs like
images to make a song based based on um
any image that you provide. And then of
course all of the songs can be in
different languages. Um, so just like
with all of our models, when you're
prompting, the models do take natural
language, but it does help to have the
structure in your mind for how you're
going to prompt for these songs. And so
GM, you just take it away with a demo.
>> Yeah. So, so the thing is with the
model, what's what's really cool is that
you can you can print prompt it in
different ways. Uh, and and one of the
easiest ways you can say, "This is what
I want in the intro. this is what I want
in the first verse. This is when I want
the chorus and the second verse and so
on and so on. And the model is going to
build this song based on that. And for
each of those prompts, you can say the
style of music you want, how energic you
want it to be. You can even say the
scale or or set the BPM for the world
song and it will create uh the song
according to what you did. So, um I just
like it's um it's generating at the
moment. I don't know if we're going to
wait for it. Maybe sometimes can take
one minute. Um, but we can check the one
that I regenerated earlier with the same
one. Um, and I asked the beginning of
the song to be um,
we don't have the sound. Can you enable
the sound of the laptop, please?
>> No.
>> Can we have the sound working?
>> Perfect. Thank you.
And then and then it's speeding up
because we asked for the verse to be I
don't know a mix of kumbia and salsa
with Spanish lyrics.
>> And then if we fast forward I think it's
going to be even more
dynamic.
I think like one thing to add here is
that if you're using the 30 second
model, your prompt needs to be able to
fit within those 30 seconds as you're
generating your timed durations. And
then if you're using the full song
model, obviously you have a little bit
more creative freedom. If you go to
Google AI Studio, you can check out Liia
and the special composer mode which
allows you um Yagom has it right here.
basically allows you to construct your
prompt um the same way that you might be
passing to the gem API.
>> Yeah. And um and let me show you another
very cool demo with LIA as well. Uh
because Laria 3 is not actually the
first model, the first music generation
that we released. We have this other one
that is named LA real time and that's uh
instead of creating music using um um
like uh from from the prompt it's
creating music in real time as the name
implies and and basically it creates um
as soon as you start it's a live model
so it it will continue to create music
um as soon as you
as soon as you uh as you make it work.
And this demo is basically you are in
this world of um
in space and every planet is a is a
prompt and when you move around the
planets the music changes in real time
and one of the cool things so I never
know how to find it. So uh but there's a
place where there are Christmas songs
and just next to them there's uh
Scandinavian metal. So the the mix can
be quite fun when you when you mix them
all. But I think we're just going to
have to go with salsa, I guess.
>> But see, it's it's it's very cool the
way it's it just mix things in in real
time.
>> Um, and there was another demo that we
wanted to show is this one. You wanted
to try this one.
>> I want to try it. Yeah.
>> Okay. Are we ready?
>> No. Apparently need to reload again.
Okay. Uh, let's go.
>> Hey DJ, can you hear me?
>> You're live on air with the jukebox DJ
and we're about to create an absolute
banger together. So, what's the vibe
we're going for and what should this
Sonic creation be all about? Lay it on
me.
>> Um, okay. Okay, let's create a song
about really sunny weather in Miami and
make it Latin style.
>> Oh, lovely jubbly Miami sunshine and
Latin flavors. That's a proper summer
anthem right there. Now, are we talking
about sipping mojitos on the beach,
dancing till dawn, or something a bit
more chill? And what kind of Latin style
specifically? Salsa, regaton, bachata?
Give me those details, mate.
Yes.
>> Uh, yes.
>> Whatever. Yes.
>> Get going.
>> That means we're moving.
>> That wasn't either.
>> Absolute banger. I need a little more
info, mate.
Spicy like salsa or something smoother.
And what's the story? Just sunny days or
maybe a summer romance? Hit me up.
>> We can't hear what it's saying because
>> salsa.
>> Salsa.
Top of the pops, salsa. It is. Picture
this. Pure sunshine.
>> Yeah, you know, that's kind of the
problem with LLM. Sometimes they're way
too talkative.
>> Right then, some folks just love a good
chinwag, don't they?
Making music happen. Hope you love that
salsa track, mate.
>> I'm glad you're aware of it.
>> Right. Always happy to keep the energy
up and make some noise. Okay, let's
let's move on because we are already out
of time. Let's uh
>> we have 13 seconds, but we were supposed
to do the song in French.
>> Yeah,
>> next time.
>> Okay. Um, so what what you
Oh,
>> that's that's static nose from the from
the thing.
So, quick note, the talkative DJ is
actually our recently released new audio
live model. And so, my next slide is
we're going to talk about audio real
quick. Um, we have a new texttospech
model, Gemini 3.1 Flash. If you're
familiar with texttospech models, they
take your flat input text and then they
will generate some speech. This model is
actually the model that's powering
everyone's favorite podcast feature in
Notebook LM. And so with this new
update, we're put basically putting the
user in the director chair. You're not
only delivering the flat text for the
model to then speak. You actually have
more granular controls with using the
audio tags to control how the model
speaks and what a motion essential
essentially it's producing. And then of
course we're supporting multiple
different languages with the 24
languages that we've recently optimized
to make sure that they're delivering on
like the highest quality and make sure
that they have the native accent.
>> So uh very quickly a very quick demo um
here. So one of the thing is with this
model is that you can actually take any
voice. We have a bunch of hardcoded
voices but you can trump them
specifically in for the way they should
be talking. So you can say this
character is going to be a style of
vocal smile whatever it is it means and
speak very fast and with an American
accent while the other is going to be uh
speaking with like a newscaster rapid
fire as well with also an American
accent. And here's what it does.
>> Welcome back to the show. Today we're
diving into the intersection of AI and
creative expression.
>> Exactly. I've got so many thoughts on
what happened this week.
>> It really is shifting daily. I mean, did
you see the demo they dropped on
Tuesday?
>> So, the cool thing is that you can
change that in real time. Like, not in
real time, but in the text, you can say
this next sentence is going to be is
very angry. And then the you the voice
is going to change the way it talks in
uh in the middle of the sentence or or
things like that. But I what I wanted to
show you is is another neat trick. If
you want to create uh discussions with
more than two characters, you can
actually do that. And that's what I'm
doing here. So I'm basically creating
one um one discussion with uh five
astronauts trying to bake a cake on the
internal space station and for each
character I'm creating one prompt for
each of for for them and then instead of
saying just a discussion with the the
five characters which would not work
with the model because it's limited to
two voices I'm actually uh sending it if
you if we check you check here is that
this is going to be the first character
with who is using the low pitch voice.
I'm using a low pitch and the high pitch
voice and then I put the the the full
prompt that uh represents this character
and then the next one is going to have
its own prompting and so on. And that's
actually giving you uh discussion with
more than than two voices.
>> Team we face our greatest challenge yet.
The funfetti
>> is ridiculous. Flower is getting into
the ventilation. We will choke on the
sprinkles. Oh, do not be a spoiled
sports fetana. It is simply a matter of
whisking with enthusiasm. Pip pip.
Actually, if the centrifugal force of
the whisk exceeds 3.4 gs, the batter
will atomize.
>> Whoa, dudes.
Exploding. It is a yellow space orb.
Capture the orb, Chuck. The mission
depends on it.
>> So, you see that you you like it's
actually using the same voices, but you
you get you get the feeling that it's
different voice. Like each characters
are actually it's their own voices. I
see it basically as when I read bedtime
stories to my daughter and they do
voices for each character. That's the
same, but the is slightly better than me
at that. Um
let's yeah we have to we are at time but
like very quickly um one of the cool
thing with the gen media models is that
you can mix the models together and I
actually gave uh two weeks ago at AI
engineer London u uh workshop about how
to do that. So if you want to to check
it you can follow this link and all of
the content of the of the workshop is
there. Uh I guess the video is going to
be uploaded anytime soon as well. So you
will be able to see that. Um and uh we
were planning to ask you what uh what
you needed to in the models but since we
are running out of time our job is to
basically get feedback from you so that
we can steer the models into the right
directions and get the actual user in
the models and instead of what the
researchers thinks the user needs. So if
you want things from the models uh
please this uh fill this form and tell
us what you're missing and so that way
we have we we can use your feedback to
uh to get to make the models even
better.
>> The very short form goes directly to me
so then I can bug research to build
things that you will actually go use in
the real world.
Um, yeah, that was
>> this one.
>> We didn't skip.
>> Oh, we didn't skip any slides. No, we
just went quickly.
>> Yeah.
>> Yep. Then that's
>> Thank you. Thank you everyone.
>> Yeah, you're welcome.
All right. Awesome. A note to
presenters, do not skip any slides,
please.
Um, our next presenter works at the
intersection of AI, data engineering,
security, and governance, and she's
going to share with us how she built an
agent with scale and security in mind
for enterprise use cases. I'm excited to
invite Anna on the stage and her talk
will be on from tickets to PRs, shipping
a governed snowflake ops agent with
Langraph and MCP. Please welcome Anna.
Heat. Heat.
Hello everyone. They weren't lying. You
really can't see a thing up here. So,
I'm going to pretend I'm just talking to
myself. I'm back in my hotel room
practicing the speech. Um, but no, I'm
actually super excited to be here today.
This is one of my favorite projects that
I've honestly ever worked on. So, I'm
here to talk about how I built a
governed ops agent um specifically to
operate our Snowflake operational work
at Pinterest. Now, I'm going to focus a
little on Snowflake
types of requests we were getting,
right? But I want you to look at this as
a reusable pattern, something that you
can apply to your own operational work.
That could be infrastructure setup
requests, IT help desk ops requests. um
it could be data access to other data
systems. Whatever it is, I've purposely
abstracted away some of those details so
we could focus on the patterns, how we
approached this problem and how how we
chose to solve it. Right?
There we go. Um so just very briefly
what I'll be covering. First, the
problem and why an agent made sense for
this, right? Why I'm even here talking
today. I'll talk about the workflow,
some of the design, some of the guard
rails and controls we added, and lastly,
what made this shippable, and what you
could walk away with today.
And that's the end of my presentation.
Okay,
there we go. Um, okay. So, as we know,
as LLMs get more and more advanced, they
can solve more and more complex
problems. However, the technical
complexity wasn't the issue for us. It
was actually that we had a lot of
routine repetitive requests that just
took manual time to solve. So in the
case of Snowflake, we're getting access
requests, we're getting schema creation
requests,
IP allow changes, things like that,
which very repeatable flows like we have
to check what roles exist, who needs
access to what, maybe people already
have permissions and it just takes time.
So people are sometimes waiting days for
their ticket just because it's in the
queue, right? So
this is going to be a challenge. Oops.
Okay. So this isn't streaming agent
right now, right? It's saying we have
opportunity to automate this. And that's
true. If you look at the first three
items here, the ones without the the red
border, right? We've got reviewable
output.
So my team was uh solving these requests
and we would generate SQL script you run
against snowflake to grant permissions
or make whatever changes you need right
we have repeatable process
in the end right everyone's asking for
access to some data sales needs access
to this marketing needs access to this
the use case varies the data sets vary
the roles permissions etc but very
simable repeatable processes and we had
certain control points already built in
Right? So, when we're building agents,
the goal isn't to move away from PR
reviews. You don't want to move away
from approvals. You don't want to remove
those. Maybe you already have
standardized deployment workflows like
we did. You want to maintain those where
it makes sense. And actually, those are
the guardrails you should be placing.
But still, those three, you're not
going, you needed an agent for this,
right? No. That's still screaming,
here's a path to automation. But it's
this last point, contextual reasoning,
using LLM uh for what they're good at,
that's where we really saw the
opportunity to build an agent. So at
Pinterest, we purposely abstracted away
some of the nitty-gritty details of like
snowflake ro hierarchy, our naming
conventions, you know, how we set up
different teams access. So people don't
always know what they're asking for. And
that's on purpose, right? You don't want
someone to get caught up like trying to
figure out the naming convention if
their team has a role. No, we've allowed
them to ask really like simple requests
like the sales team needs access to this
sales data set.
Our agent and what LMS are good at,
they're good at understanding text,
they're good at uh gathering context,
doing lookups, doing searches. So that's
why we really leaned into this and built
an agent that and it was really really
super fun. So I didn't mention this
earlier but this actually started as a
hackathon idea that we built in two days
but the road to production took a long
time and that's why I'm here to talk
about this today like how we think about
security governance different controls
we need to add in. So before I deep dive
into the architecture and a little bit
more on the design I want to give you
two key framing principles that drove
how we approached this problem. The
first is my secret to building a good
agent.
A good agent needs a good mascot. So,
anyone who works with me knows I love to
name my agents. I love to have a fun
mascot.
It just makes building them so much more
fun. But on a more serious note,
and something to keep in mind once it
loads,
there we go. match agent authority to
workflow risk. Right? So when I say
authority, I'm talking about what can
the agent do? What should it not be able
to do? And then think about risk. Think
about the systems you're working with.
Think about what data you're touching.
Right? So for us in the case of
Snowflake, we have sensitive production
data we store there. And it's a sock
compliant system, meaning any processes
we build out on there, any agents have
to be auditable, right? We can't have uh
agents modifying our data and uh you
know messing with our reporting. No. So
as you think about this, as you think
how it applies to your own operational
work, think about what authority you
give your agent and think about the risk
involved.
I'm just pointing this out because
this drove part of our design, but maybe
you have scenarios that are lower risk.
Maybe you can give the agent more
authority.
So in practice right
that principle I shared that means what
can the agent do what can't it do so
specifically for the agent we built what
can it do and here we really lean into
what LLM do best and where our team had
a lot of our bottlenecks right so our
agent does a great job interpreting
requests
seeing what details someone gave us
what's missing gathering context doing
those lookups, figuring out, yes, it's a
data access request, what's needed, what
roles, what permissions, generating SQL
code, LMS are great at that, opening
GitHub PRs, but that's where we decided
to draw the boundary and hand it off to
govern workflows, right? So what we
decided our agent can't do, it can't
write to production, it can't modify any
data even though it's generating
metadata queries to, it can't actually
run those just in case something goes
wrong, right? It might drop our entire
data warehouse. It can't approve its own
changes, but we can have other agents do
code reviews. Um, in the end, it really
just can't act without constraints. You
have to define those boundaries.
So to share more about the highle
architecture
um I like to break it down into these
three uh three uh parts of it right
we've got the intake no matter what your
use case you're going to have requests
coming in somewhere tickets black
messages other ticketing systems right
and then the agent that's where all the
magic is happening so we designed our
agent as a lane graph workflow and I'll
go into why in the next slide
But here are all the different things it
can do, right? Parsing a request, taking
that messy, ambiguous request someone
gave us, turning into structured output,
looking up metadata against Snowflake,
doing all those lookups, generating SQL,
generating the PR even, right? We have
LM wired into all those step. So even
for PR creation, why not have the agent
generate the the PR summary, right?
Once that PR is out there though, we hit
what I'm calling our governed execution
part. We always have a human in the
loop, especially because it's a stock
system, right?
We're always going to have a person
review the output,
make sure it's accurate, syntactically
correct, the agent didn't come up with
some wild permissions to grant, and then
we still don't give the LLM or the agent
the keys to our production system. Even
if it generated the right SQL, there's
no guarantee it's actually going to
execute it correctly or it won't insert
some some additional queries and drop
our data. Right? So, we stuck with our
standardized deployment workflow. You
can maintain whatever CI/CD processes
you have now
and then apply those changes, right? But
if you think back to the problem, we had
processes that worked, right? We were
really just trying to solve for those
manual pain points, the queueing, the
bottlenecks. So, we use the agent where
it works really well. So, if you're
looking at this point, my team now
doesn't even know requests are coming in
until we get pinged that there's a PR
for review or worst case, the agent has
to escalate something to us that needs
some manual um intervention.
This is what I'm going to be remember
for the
dead clicker.
All right. So I said I'm going to touch
on why we went with langraph
specifically. So in reality, operational
workflows aren't oneshot things, right?
You get ambiguous requests. People say
like, I need this and don't give you
half the details. Even when we
standardize our intake for the snowflake
request, we put a question, right? What
level of access do you need? Read or
write, people will still leave that out,
right? So the agent has to go and ask
them and then wait for the the user to
provide more details or in some case
it's a valid request but some of our
request types require some approvals. So
in the case of data access requests we
always have data owners right? So it's
not up to the agent to decide yeah you
could have access to super sensitive
data. No it's always the data owner. So
we're waiting for approvals.
So if you look at this, you see we've
got entry points, re-entry points, exit
points, very specific branching, and the
agent needs to be able to maintain state
or know where it left off. So this is
what Lang graph is really good for. So
if you're not aware, Lang graph is a
workflow orchestration framework for
stateful agent, right? So we actually
maintain state on the ticket right now.
Once the agent leaves off somewhere,
maybe it's an invalid request, it'll
mark that on the ticket. So then it
knows where it left off. This way you
don't have to rerun the whole workflow.
The agent can pick off where it left
off. Right? You're just waiting for an
approval. You know the ticket was valid
unless someone went and changed
something which might trigger a
different state. You don't have to rerun
the whole validate request step.
All right. And part of my talk title did
mention MCP. So model context protocol.
Um, so I'm going to touch on that. But
in the last slide, we have our Langraph
workflow. You might be thinking this
feels like such a rigid deterministic
flow, right? It always has to do this
step followed by this step followed by
that step. Um, and I'll point out it's
just the flow or the steps that are
deterministic. We still have LLMs used
at different stages, right? Those are
always going to be non-deterministic.
And that's something we have to think
about. But in this case, it feels like
we didn't give the agent a lot of
autonomy, right? And it might be super
tempting like now we have AI, just let
it do everything, right? Take in this
request, solve it, one shot, go. No, you
should really be thinking about where
autonomy will help. where should we
scope that autonomy in and at what
level? So, a step like validate request,
it's just a oneshot prompt, right? You
get your ticket, we defined a prompt
with criteria of what's a valid request,
and it'll go valid, invalid with the
gather context step. And part of the
reason we even built this agent
was because it's not there aren't like
three predefined queries you always have
to run to solver or very specific steps,
right? It's use case by use case. So, we
actually have a metadata sub agent that
has access to the Snowflake MCP server.
Um, and it's going to look up different
things against metadata views. It
iterates, right? It does a lot of
iterative reasoning.
And I'll actually show you the tools we
gave it.
It does a lot of iterative reasoning.
So, it'll run a query
that tells it something. It figures out
what other information it needs.
runs another query, keeps going, keeps
going, and then turns all those query
results into structured output. And
again, if you're thinking about how to
apply this to your own operational
workflows,
this one's very snowflake specific, but
maybe you have Zenesk tickets, Jira
tickets, some other system. You could
give it your own MCP for that system,
right? You're just giving it the keys so
AI could talk to those tools. It could
do its own lookup, look up different
states, different searches, figure out
what it needs to find, gather that
context, and once it has all the
information it needs, it's going to move
on to next step,
which for us was generating the SQL.
The other step I want to highlight
though, right, you saw on one of the
previous slides, we always have a human
in the loop. So we could have generated
the SQL
and just gone created the PR
triggered a PR review right but I want
to highlight this review and repair
and my point here is don't blindly trust
the first pass output right we've kind
of designed the agent to work well use
it for what LLMs are good at we actually
chose to use SQL templates for this. So,
it's not like here's a template, here's
what the output is always going to be.
But no, Snowflake has certain
standardized queries, right? You're
trying to create a role. There's a
create roll if not exist.
We told the agent it's up to you to
figure out how many roles you need to
create. If you even need to create one,
just use that. Don't go crazy. Don't
hallucinate. Don't come up with your
own. Don't make assumptions about this
how it's done. Um, this is how snowflake
queries work, right? So, we gave it
those templates, but there's no
guarantee it's going to stick to those,
right? There's no guarantee it's not
going to insert a semicolon where there
shouldn't be. Stick it in the role name,
right? So, we've actually told it, you
generate your output, actually validate
that. And here you can choose whether it
makes sense to use an LLM for the
validation or whether you want to have
more deterministic checks. Right?
We chose to go with an LLM because
they're great at generating SQL. They're
also good at reading SQL. So, we told
it, right? Make sure it's syntactically
correct SQL. Um, keep in mind it can't
actually execute these queries because
it doesn't have right permissions on
Snowflake. So, it's just reviewing
syntax. It's reviewing the original
request. Making sure what it output
actually solves that request. making
sure someone didn't ask for Jira data
and it's granting access to the all the
data in the data warehouse. Right? So we
give it a chance to review its own code
and if it finds any issues, we give it a
bounded number of texts to repair those
and it'll repeat a couple times. And if
for whatever reason it really can't
solve it, it's having issues or just
went off track, at that point we
escalate to a person. And that's when
someone from my team will go and
manually resolve the request.
We also have other escalation points uh
built in throughout the workflow. I just
didn't show them on the graph, but this
way you just give the LM a chance to or
the agent a chance to
make sure it's
any day now,
please.
>> All right,
it finally works. It just doesn't like
me. Um I think I was talking too fast
maybe speeding through it. So it it
wants to make sugar go a little slower.
Um
so I could go into more technical
details. There are more flows. There are
more branches right as we expand this
agent to handle other high volume asks.
It might get more complexity but in
general it's going to have very similar
step right validate the request. Some
are going to have data uh some are going
to have approvals. um we'll always
generate some SQL or whatever you could
generate Terraform uh config code um a
set of steps to run right but what made
this shippable right like why did we get
the okay from security and legal uh to
deploy this on a sock on a sock system
go I kind of already highlighted this
point these points at the beginning um
but just to summarize and you know give
you points to think about as you build
your own agents first First off, start
with bounded work, right?
For us, it was high volume requests,
things that were causing pain points,
bottlenecks, but that had clear
repeatable steps, clear outputs, right?
We always knew we needed to generate SQL
queries or we always knew we need to
generate config code.
And then reasoning without authority,
right? It's up to you to assess the risk
of your system and decide how much
authority to give it. And again, in our
case, we had really sensitive data. We
had a sensitive production system. So,
we needed to give it less authority. But
you see, we actually built a successful
agent and never had to give it right
access to Snowflake. So, again, assess
the risk of your system and decide.
Maybe you can use LMS to take certain
right actions or maybe you're in the
same boat as us and it's too risky.
But agents at the end of the day are
going to be really good at reasoning,
really good at understanding requests,
right?
I mentioned this at the beginning and
I'm going to mention again, reuse those
control points. We're not building
agents to try to get rid of uh PR
reviews. We can have those code reviewed
agents. They can speed things up, but at
the end of the day, you still want a
human to sanity check even the code
review agent output, right? or if you
have working CI/CD workflows, don't try
to incorporate LLMs where where you
don't need them, where things are
already working.
Next one, scope autonomy. Right? You saw
we have a deterministic flow, but we
chose where to give it a little more
autonomy, right? And again, for your use
case, maybe you have a less rigid flow.
Maybe you just have two steps in your
Langraph workflow,
right? But add use, right? add it where
it makes sense.
Don't don't try to make it solve
everything in one go. Right? And my last
point is design for re-entry. If you're
thinking about real life operational
workflows,
if you could get it done in one shot,
that's great. But realistically, you're
going to have exit points. You're going
to have cases where the agent needs more
information. You're going to need
approvals, um,
etc. You're going to have different
branching logic escalate to a human,
right?
Please,
I just want there's one line I just want
to leave you with, which is this, and I
think I got this point across, which is
useful agents need boundary, not more
authority. And that's going to be the
key to actually launching a lot of these
agents in production. This is what's
going to make your security teams, your
privacy teams, your legal teams happy,
right?
You could give the the agent access to
everything, sure, but there's more risk
involved. There's more that could go
wrong. And hopefully, I've shown you
that you don't need to do that. You
should just use LLM's agents for what
they're good at, where it makes sense.
Choose to use deterministic workflows or
existing code where it makes sense.
And I don't know, go build agents.
Thank you.
>> Thank you, Anna. Sorry about the
clicker. We need Yeah. No, we we'll have
to build AI agents to make that work
better next.
>> Yeah. But well, despite all that, Anna
was able to finish her talk. So, let's
give it up for Anna. Okay, so this is an
exciting moment because we're gonna go
on a break. Uh, so we have uh quite a
bit of time for you to just talk to each
other, grab some coffee, maybe go
outside. It's really nice outside. Uh,
some logistics before we break, but uh
if you didn't lose your parking ticket,
you can feel free to already mingle with
other people. just come back at 3:50
p.m. We're going to start our final
talks of the day at 3:50. However, if
you feel like you miss your parking
ticket, I have the something that I need
to read. Um, if you're missing your
parking ticket, we found a ticket from
garage 4 yesterday at 12:40 p.m. Uh, so
the volunteers at the check-in table
will be uh holding on to the ticket. So,
uh, if you can't find your ticket, uh,
go talk to them or just talk to them for
fun because they're really nice people.
Anyways, um, so with that, we're ready
to break and I'll see you at 3:50 p.m.
Ladies and gentlemen, please take your
seats. Our event will start in 5
minutes.
Ladies and gentlemen, please take your
seats. Our event will start in 2
minutes.
All right, welcome back everybody.
Our next presenter is an open-source
superstar,
an educator. He he's doing a lot of
things. Frankly, I couldn't memorize
them. I'm going to read them for you.
Um, he's the creator of epic webb.dev,
epicai.pro, Pro, the Epic Stack,
Epicreact.dev,
and testing JavaScript.com,
and more recently, um, he's launching
epic product.engineer.
Um, yeah, he's, uh, well-known educator
and, uh, contributor to the open source
community. It's my pleasure to welcome
Kent C. Dots on the stage.
>> Thank you.
Hello everybody.
Thank you so much for having me. Uh, AI
Engineer Miami. I love Miami. I'm super
excited to be here and talk with all of
you. Um, I'm going to be talking about
building a free agent, which I think is
a fun, clever title, but um, it's free
as in freedom, cookies. I don't drink
beer, so it's cookies and puppies as in
something you have to take care of. So,
what do I mean by that? Well, you're
going to have to wait a second to find
out because I want you all to stand up.
Please stand up. If you're physically
able to join us, please do. It's been
like a long day. You need blood flow for
your brains to work. So, put your arms
out in front of you like this. Squat
down and back up. That one doesn't
count. That's just a practice. We're
going to do 12 of these. I want you to
count out loud with me. Ready? One. Two.
You're doing great. Three. You can go
really low if you want. Four. Or just
like a little dip. That's fine, too.
Six. Seven. Do you feel that blood flow?
It's so good. What are we on? One. No,
I'm just kidding. I forgot. Are we at Is
that 10 11
>> and 12? Thank you. Okay, stretch over
your head as high as you can and then
over to one side and over to the other.
All right, that feels great. Okay, sit
down. Thank you.
>> Yes. Um, blood flow makes your brain
work better. So, exercise um we are not
robots um yet. Okay, this is the view
from my office. Can you believe that?
Yeah, I'm looking at that all day. And
um it's it's super great. Um but it's
not always super great because sometimes
you get glare and it's especially bad in
my kitchen because it reflects off of
the countertop. This is not my kitchen.
My wife would not want me showing all
strangers of the internet my kitchen.
Um, this is my office though. Um, whoops
there. That is my office and it's great.
Um, but uh, yeah, glare is not fun. So,
uh, we do have shades. Again, not my
home. Um, and and they can avoid the
glare. And so
I I actually have um like automated
shades and it uses some I if anybody's
familiar with uh powers shades or nice
viewer or Elon or whatever but um it's
nice so I can like control things but I
really really like my view and u and
actually even this view is a problem if
I get overcast and now the clouds are
all super bright and now it's shining
into my eyes and I do use light mode but
it's not enough and it hurts.
And so I would like to have some
mechanism for me to say um when the
lights or or or
when the sun is in this position or when
it's overcast then lower the shades or
raise them up. I pretty much I want them
to be up when they can be up but not
when I'm going to get something like
this. And so
um I decided to solve this using AI. Of
course we're here at AI engineer. Um,
but the the way that I solved this is
with a little program or or AI assistant
that I call Cody. And this is actually
Cody, my mascot for all the stuff that I
do. And Cody is now my AI assistant. And
so now uh thanks to Cody, my shades will
go will stay up as kind of the default,
but then they'll go down for privacy
reasons in the evening. They'll go down
when the weather is uh overcast. Um, and
actually in my office specifically, it
will go down just that little section
that's going to blind my eyes. Uh, in my
kitchen, it goes down in the afternoons
when the sun is going to shine off and
reflect off of things and it calculates
the where the sun is in the sky. I think
it used a word like azmouth or
something. I don't know any of that.
That's I love AI. Um, and so now like I
can just live my life and my shades just
move as as they need. Oh yeah, and super
annoying if the shades move when I'm
recording. And so it knows my uh lights
my lights are also um uh integrated into
this experience. And so if my light my
recording lights are on, it knows I'm
recording. And so it's not going to
change the shades. It's awesome. It's
pretty cool. Let me tell you something
else that I did with Cody. This is a
game that I uh told Cody to build for my
son who is two and a half. And so he
finds the right thing. He clicks on it
and he gets a little celebration. If he
gets it wrong, then it blows up and he
has to go click this one. He actually
really likes seeing it blow up. So he'll
do this
and then he he'll see the confetti. Uh
so this is a fun fun little game that uh
I had Cody build and deploy. Um and uh I
I didn't have to log into anything. Cody
was already logged into my Cloudflare
account and so uh deployed it and
everything actually made this OG image
using uh browser run from Cloudflare uh
which is pretty cool too. And it's all
going uh through Kodi. I don't have to
do any of that. I think some of you
might be starting to get bored. Um and
and if you're not then I'll tell you why
you probably should be here in a second.
Um, this was a a very exciting live
stream uh screen catcher. You can see
I'm very excited. I'm actually wearing
the same shirt incidentally. Um, but
super excited because I built a Spotify
player um that integrated with my own
Spotify, but I used Cody to do it and
that's why I was excited uh to have what
was exciting about this was that the
integration uh flow how that worked and
I'll show you a little bit about that
too. Um, I also had Cody um set up a
Docker container for Navad Drrome, which
is a a self-hosted music uh application
on my NAS. And I um told it to wire it
up with a Cloudflare tunnel so I could
access it outside of the internet or or
outside of my local network without
exposing ports on my local network. Uh
Cloudflare rocks. I'm not sponsored by
Cloudflare, but I think they're pretty
great and I use them for so many for
everything that Cody is. Um, but uh all
of that worked. So, we're we're not
having to do any of this stuff ourselves
anymore and it's pretty great. Uh, and
then I was lying in bed. I we had just
purchased epic product.engineer the day
before and I had my team working on
that. If you're curious what that looks
like, you can go look it up right now.
It's a real site and you can give me
your email address. But I was like, you
know what? Epic.engineer engineer would
be pretty cool to have too. So, I bought
it from on when I was lying in bed from
my phone and I told Cody, "Go build me a
landing page and uh it even integrated
with uh my kit." So, that's uh my email
mailing service so it could actually set
up a real subscription and everything.
Deployed it on Cloudflare, made the OG
image for me, everything.
Okay, so at this point, lots of you are
like, "Bro, that's awesome. I'm so glad
that you rebuilt a worse version of
OpenClaw.
Um, that is kind of what I did. But
OpenClaw is not and I I have definitely
I've opened the OpenClaw world and there
were a lot of things that were really
cool about it. A lot of things that I
wasn't super jazzed about for my own use
cases. And so that's why I did this. And
I want to tell you one of the my
favorite reasons that I love what Cody
does is that this is all free. I didn't
have to pay for inference at all with an
asterisk. The asterisk is um I don't
have to pay more than the existing
subscriptions that I already have. How
many of you have more than one AI
subscription? Like one place where Yeah,
you have more than one. Why do you have
more than one? Like you're laughing.
Yes, of course I have more than one.
I've got like I've got Chat GPT and I've
got Claude and I've got I don't even
know like so many others. And of course
you got your um your coding assistants
and everything. So the reason that I can
make Kodi free is because I Whoops. I
build on top of those. Everything that
Kodi is is actually exposed through MCP.
And that's what makes it so I can do all
these cool things with Kodi for free
because Cloudflare infrastructure is
like hilariously cheap, especially if
you're serving just one user. Um, and so
I I'm effectively able to do all of
these things using my existing
subscriptions
um, for all the inference. So I want to
tell you a little bit about how that
works. And my my goal here is to kind of
like make you a lot of actually back up.
How many of you thought MCP was dead?
For real. Yeah. Okay. No shame. How dare
you.
Just kidding. Um, but a lot of
developers especially are like, why do
we need MCP? I have a CLI. I'm already
signed into GitHub using my CLI. CLI
have help flags. There's like
progressive disclosure. And like in
fact, models are even trained on some of
the more popular CLIs that I use. So,
and now we've got this skill thing. So,
even if the models not trained, I can
just use the skills. So, what like who
cares about MCP? And I agree with you. I
think MCP is pretty uninteresting for
software development use cases. Where it
gets really interesting is when I tell
Cody, I don't want the sun to glare in
my eyes. So, the non-developer use case
is the thing that gets me most excited
about MCP. So,
um the one of the big criticisms of MCP
has always been that there are just
there's context bloat uh and it's a a
huge mess and so we we hate MCP. Well,
That never made sense to me because like
we're we're software developers. We see
a problem and we don't just say, "Huh, I
guess this is foundationally flawed and
go off to something else." No, we're
like you you analyze the problem. Is
this foundationally flawed? Maybe. Let's
look into it. Oh, we could just do some
sort of search on the MCP tools and then
boom, now we have just the ones that are
relevant for the thing we're doing.
That's exactly what Claude does. Now,
chat GBT does this now. So, the the
whole context blo is not a big deal.
However, um I really like um the fact
that u Cloudflare introduced this idea
of code mode because it has unlocked a
lot. How many of you have heard of code
mode before? Okay. So, it's the idea
that you can take uh some sort of spec
like MCP or open API or something, turn
it into TypeScript definitions, and then
tell the agent. So this would be chat
GPT or VS code or cursor or cloud code
or whatever tell the agent to write code
uh against that TypeScript definition
and then on your side you evaluate that
code in a safe environment. Um and
Cloudflare has done this with dynamic
worker loaders. It's so so cool and
that's what I'm using. So based off of
what Cloudflare has done with their own
MCP server, I created Kodi to have three
tools. They did two. I needed one more.
uh search to identify what capabilities
there are. So there's your progressive
disclosure, execute to write and run
that code inside of those gotten uh made
me a little less interested because of
all the cool things that claude uh
desktop is doing. Have you all seen this
stuff? You're just like build me a thing
and it builds the thing and it's it's
really really awesome. I I don't really
use this one quite as much but I do use
uh some features off of that. So pretty
much search and execute are the things I
want to focus on.
It is pretty cool to be able to open a
generated UI. That's how I built the
little game that my son played and that
was fun. Uh, all right. So, uh, somehow
duplicated that slide. My bad. Uh, okay.
So, let's talk about search first. So,
when when the search tool is called by,
um, by your agent, whatever agent that
be, uh, it's going to pass a query, some
sort of like, let's say that I here,
let's try this. We're gonna actually try
this. I I hope that I don't regret this.
Um, okay. I'm standing up in front of
hundreds of people in Miami at AI
Engineer Miami and I need some hype
music. Could you play something on my
Spotify, please?
Um, I'm already running it on my laptop.
So, you know how the like when you're
doing an AI demo and you you know the
mistakes that the AI is going to make
and so you just kind of like subtly
insert another little like here's a
little bit more context. Uh, okay. So,
first here's the loading tools bit.
That's that's Claude saying um I know
that like we're not just going to load
all the tools into context. So, it loads
the tools. It knows u from that that
there's this execute tool. It
conveniently missed the search tool,
which is perfect for um our
demonstration here. Uh I'm being
sarcastic. Thanks a lot, Claude. Um and
uh Oh, that's interesting. Spoiler
alert. We'll look at that here in a
little bit. Okay, I'm going to let this
run in the background, and if we start
hearing music in the background, then
we'll know it worked. Uh okay, so
Spotify weather uh current location
playlist. So the the query that we're
actually exploring is um I want
something that's thematically
appropriate for Miami. Um so that's
where lots of this is coming from. So
we're going to query that. I want to
limit the results to 10. And then here's
here's what I'm trying to do. So this
memory context thing uh Cody has memory
built in. And um and so this memory
context will help to retrieve memories
as appropriate. Uh so then here are the
search results. It has this uh whole
explanation of how you actually deal
with these matches. This only shows up
on the first time you run the search
query and then thereafter it assumes
that the agent is going to remember that
so we don't bloat the context. Uh we'll
look at some of that stuff here in a
little bit. Ooh, secrets. What's that?
We'll look at that later. Um and here's
some relevant match uh memories based on
what you're trying to accomplish. And
then these will also not show up in the
future. So it keeps track of what
memories have been shared. Uh and then
we've got this idea of packages. So
inside of Kodi, this is as of less than
24 hours ago, I made a complete huge
massive rewrite. Let me correct myself.
Cursor and GPT 5.4 made a complete
massive rewrite of how Kodi works under
the hood. And it uses Cloudflare's new
um artifacts uh API. Yeah, we're excited
about artifacts. It's like so now Kodi
has its own GitHub basically on top of
artifacts and uh and it's cooler than
that but I I want to show you the code
for that here in a second. So I'm not
going to spoil what else it can do. So
it has a package for Spotify. Um this
has uh secrets for interacting with
Spotify that are um created in such a
way that the model doesn't actually have
access to those which I think is pretty
cool. Um, and I I will use my four
minutes uh to hopefully explain what
that is. Um, and then we also had a
yeah, we've got a value. Here's our our
client ID. And we also have secrets. Um,
and yeah, that's all that we need to see
there. So, it performs a search. Now, it
knows, oh, okay, I'm going to use this
Spotify capability uh or this Spotify
package to write my code. So, now it's
going to execute. Wonder how it's doing.
Not so well. Um, but it's it's trying
and that's more than you can say for uh
chat GBT. Uh,
they're all improving. They'll be fine.
So, um, that on that first search, it
gets back a conversation ID and then the
agent uses that and that's how it keeps
track of the memories that have been
shared uh, over time. Uh, here's the
memory context, here's what I'm trying
to do now, and then here's the code that
I want to execute. So, what does that
look like? Uh, this is an example what
that code might look like. So it brings
in this Kodi runtime that uh is um has
some useful features for authenticated
fetch which conveniently or
inconveniently for our demo doesn't
allow me to show you how that
authenticated fetch works but basically
there's a special syntax that Kodi can
uh can write code for um for managing
secrets. So, a really important part of
all of this is that uh the agent,
whatever agent you're using, never sees
the secrets ever. It cannot. And so, the
only way that you add those is the agent
will give you a URL that will go to
heycodi.dev.
You put in your uh secret in that UI on
HTTPS so nobody can see it. And then um
it the agent can then reference it using
curly braces um in any fetch call. And
then because I'm using dynamic worker
loaders, I can intercept every fetch
call. And I look and I say, "Hey, that's
a pretty cool secret. Let me make sure
that you've approved that secret to go
to this domain. Oh, that's like your
Spotify token. I'm not going to send it
to dangerous domain. I've
beenprompted.com."
Um, instead I'm going to ask the user,
hey, is this cool? And then you can go
through an approval flow. Uh, that's
pretty cool. Uh so we also this is
pretty cool. We've got this environment
lookups weather uh package. So this is
where we can interact with all of our
weather stuff. It's it's packaged up as
a weather API. Uh and this is under the
Kodi namespace. So this is kind of like
our own internal npm. Uh and so we can
have all these different repos inside of
Kodi on top of artifacts. And then um
when we're executing this code, we go
and reach into that repo. we uh create a
bundle out of that export and then we
can use it inside of anything. So um
Kodi can write code that references all
of these things. It's super super cool.
And then you know of course all of this
is what you would expect. We have an
export default um that can run
authenticated uh stuff with Spotify.
Here it's getting the weather and then
it does a query and it does you know
searches and then it plays but it you
know it uh reached its tool limit. So
that's what I get for uh for changing
everything just um before doing this
demo. Just here how about this? It
worked. Just kidding. Just kidding. Just
I promise it does work. Uh I I am still
working on uh some of the kinks. Uh
okay. So then the return uh will have
the conversation ID. execute also
returns relevant memories um based on
what was actually accomplished um in
particular this memory context it's like
what are you trying to do okay here's I
I did the thing but also here's some
additional um memories
um and here's the results and and the
agent gets to choose what results it get
back gets back so another criticism of
MCP is that it's not just the tool
descriptions but it's also when we
invoke the tool the output of that tool
fills up our context the agent gets to
choose
what comes back from all of these uh
executions. And like code mode is so
cool. You you don't get it. I can tell
because you'd be like jumping on your
chairs if you got it. Uh code mode is
fantastic. If we tried to do the same
thing using regular tool calls, this
would be many regular tool calls and it
would probably mess up a lot. Uh so code
mode agents are really really good at
this and it's super cool uh to play
around with. So, I've got 29 seconds,
which is wonderful because I don't have
a lot more to share. Um, actually, I do
I come talk to me. I've got stickers and
I will give them to you if you ask me
good questions. U, but yeah, come and
talk to me, ask me questions. Um, Epic
AI Pro, uh, I is here's my little plug.
Um, that's where I'll teach you how to
build MCP servers. Um, and it's really,
really great. Cody is open source. It's
pretty much just for me. I would love to
make it possible for other people to use
right now or eventually, but yeah, Kodi
right now is just kind of my thing. I
mostly just wanted to show you that MCP
rocks, code mode rocks, and um we've got
a lot of really cool and exciting things
to look forward to. Um with that, go
check out epic uh product.engineer. It's
the last skill you need to learn. Thank
you.
Good luck.
All right.
I guess one thing that we can take away
is the squats. So, if you feel like you
need a little bit of a wakeup call
before the last two talks, feel free to
stand up and do some squats. So, I'm
gonna do it myself to get ready for
Rita's talk. Yes. Okay. So, Rita is a
really good person to befriend because
if your website is down, she will be a
great person to call because Rita is a
VP of product for Cloudflare and she has
been building a couple of developer
platforms and AI initiatives within
Cloudflare. and she has meandered a
little bit from a software engineer to
solutions engineer and now in product
development. So Rita today uh we'll be
talking about building infrastructure
that can scale to billions or even
trillions of agents. Take it away.
>> Thank you. Thank you. Quite the intro.
Okay,
we can build infrastructure for
trillions of agents, but let's see if we
can figure out how to plug this in
correctly.
You guys see stuff?
Um, okay, here we go.
How about now?
Aha. All right. Um, hello everyone. My
name is Rita. I am VP of product for
Cloudflare's developer platform. Kent
already said everything that there is to
say about code mode and MCP. So, thank
you everyone for coming.
Um, no, I I'm really really excited to
be here today. Um, Cler is a really
interesting place to be. Um, sometimes
people ask me you do product at an
infrastructure company that how does
that work? And um it is actually really
fascinating first of all because we get
to work at really really massive scale.
So especially working on a developer
platform and developer tools, every
single optimization that we make, we
instantly get to see the benefits of it.
And even the tiniest things can really
save everyone lots of hours, lots of
days. But the other thing that's really
interesting about it is actually the
physicality of the web is something that
I think people don't think about a lot.
Like there are undersea cables first of
all that connect us all that when you
have a zoom with someone in London
that's how it all works. When I first
joined Cloudflare, I came across an
incident page that was talking about how
it was like breaking record heat in
India and that was affecting a data
center and I just never really thought
about how it could get so hot that a
data center would go down. Um, so I
think that more and more we are going to
start to get connected between the real
world and what's going on in tech and
AI. And so I am going to talk about MCP
and code mode, but I'm going to dive
into some of the underlying details of
how we do all that. Now, a lot of the
time our job feels a little bit like
debating the age-old question of if a
dog were to wear pants, would it wear
them like this or like that? Um, if you
think it's the first one, raise your
hand.
Um, if you think it's the second one,
raise your hand. Okay, everyone that
raised their hand first. You're a
psychopath.
It's definitely the second one. Like how
would it put it on? Um but um no,
thinking about how agents work, it is
kind of similar and and you'll see this
come up more and more, right? Um you can
think about a single giant MCP server.
You can break it up into a lot of
smaller pieces. You can think about, you
know, should you execute the code here
or over there. And for the first time,
we're getting to not just do like micro
optimizations in the developer space,
but really truly invent stuff from the
ground up and really think about it in
that way. And so when LLMs first came
around um they you know when we started
using them through Chad GPT about uh two
and a half years ago it was like having
a really really smart brain with you in
the room all the time that you could ask
questions you could get it to you know
maybe generate code for you but it it
couldn't go that extra step of doing too
many things. It was like a brain with no
hands to really act on its behalf. And
that's because LLMs initially weren't
that good at tool calling. But
increasingly they became better and
better. So you could actually start to
build agents like Kodi that could take
actions on our behalf. And initially
when people started using tool calling,
every single agent was implementing the
whole thing soup to knots on its own. So
you had your tool and you integrated it
with your agent and that was the only
place where it could run. The really
cool thing about MCP and especially
remote MCP is that all of a sudden you
could share tools with agents that
you've never actually met before.
So you could start to, you know, you
could create an app, you could ask it
for the weather, but there's this thing
that started to creep in over time,
which is the context starts to grow. So
I was going to demo a small app that I
built called Fluma, which is fake Luma.
I'll demo a different version of it in a
bit just to save us on time. But I if
you're uh if you're building something
that's more sophisticated than something
that I vibe coded in a weekend, you you
really start to see that scope creep,
that token creep, right? So something
like the Cloudflare SDK, it has all of
the DNS records, it has all of the
workers, it has R2, now we it has purge
cache, and before you know it, you're
exceeding a context window of 1.7
million tokens. And actually, if you
were to include Claflair's entire open
API spec, it would take up 2.3 million
tokens. Okay, so that's more than the
biggest models can even fit these days.
So, it's a bit of a pickle for us.
So, okay, we started to think about how
do we solve this problem? And one way to
do that is we could split up the server
by domains. So you could have uh an MCP
server for just the API. You could have
an MCP server for documentation. We had
an MCP server for workers, for
observability, all these different
things. And that partially solves the
problem, but it actually really just
puts it on the user to figure out which
MCP server they need. Um, so if I want
an MCP server that deploys my worker,
but then to look at the logs, I have to
go through the whole OOTH dance again.
It's very, very annoying. So, we needed
to solve that. The second thing that we
kind of realized the more we were
thinking about this is that even though
LLM had gotten a lot better at tool
calls, they still get confused pretty
easily. Like if you ask it to do
something that happened on a given date,
it's just going to assume a random date
that it was trained on in the past. It
might not get the exact tool that it
needs to call. If you give it a lot of
different tools, it'll actually also get
confused. like a lot of tools have
create in them. Create worker, create
DNS record starts to do the wrong stuff.
And if you think about it, it makes a
lot of sense. Uh LLMs were trained on
like all of the code that exists in the
world. So they're very very good at
writing code, but tool calling is
something that we just kind of bolted on
at the end. And it's not too dissimilar
from if you know you took Shakespeare
and you gave him a month-long crash
course in Mandarin. I presume he was
extremely extremely smart. Um, so then
if you asked him to write a play in
Mandarin,
it's bloody Shakespeare, so it's gonna
be good, but but it's not going to be it
his best work. Um, and and LLMs are a
little bit in the same way where no
matter how good they get at tool
calling, they just don't quite cut it.
So, at this point, we started thinking,
okay, are are we holding this wrong? Um,
like we're trying to make LLMs do things
that they're not that good at. We're
inflating the context window. what's a
different way to attempt to do this? And
that's where code mode came from. So
imagine if you let the agent or the LLM
do what it's really good at, which is
write code and do fewer tool calls. So
let's see this in action. So here I have
my app called Fluma. Let's increase the
font on this. And on uh on one on one
side we have our vanilla legacy MCP
agent. that's just going to make regular
tool calls. And on the other hand, we
have our code mode agent that's going to
write code first and then execute it. So
let's ask it to do something simple
first like create an event or a
hackathon
on Wednesday
at
um 9:00 a.m. at Hyatt
Regency Miami.
misspelled Miami.
Um, okay. So, over here we have our
regular MCP agent. It um, it wanted me
to confirm the date. It was thinking
about January 10th, 2024, which is not
quite this Wednesday. On the other hand,
we have our code mode agent that just
pulled up today's date because it's able
to call a function and then it used code
mode to create an event that's going to
come up. But now let's try something
even more sophisticated. So I'm gonna
going to ask you to do something like
create an event for each day in May 2026
for a meetup on the topics of AI
engineering
for MCP at Cloudflare's party house all
at 700 p.m.
Okay. So now they're both going to be
off to the races. You can see that the
MCP agent is going and making a whole
bunch of different calls. And on the
other side we have our code agent that
went ahead and generated this code with
um different topics. And it's going to
go through this for loop and create a
whole bunch of these events. So now it's
making a bunch of calls to the API.
Going to wait for both of these to
finish up.
Any
second now, guys.
All right. So, our MCP agent is already
done. Generally, these take about the
same amount of time. Now, our code mode
agent is also done. So, we've
accomplished roughly the same task. But
notice one important difference which is
the code mode agent used almost 70%
fewer tokens. That's a really really big
difference because the code mode agent
doesn't have to carry all of those tool
calls in its context constantly. It's
able to just generate the code once
executed and be done.
All right. But we had another problem
and it's that clients were slow to adopt
code mode and if you want something done
my my parents are Soviet so they would
always say you know you have to do it
yourself. So we had to take matters into
our own hands. We still had a context
window that would you know take up over
two million tokens. So we started
thinking about what would it look like
to have a serverside MCP server and we
came up with a way that allowed us to
still run all the code that was
generated on the server side with two
simple functions. One that's called
search which is going to look at the
spec and just find the um and only find
the APIs that match the particular type
that we're looking for. and another one
that would write the code that would
actually execute what we needed it to
do.
So let's take a deeper look at how this
works again under the hood. So we first
of all have our tool search. I'm going
to type workers in here. And even if I
added like every single workers API that
we need, we're still only I don't know
if you guys can see this at less than
2,000 tokens. So a very very big
difference from a million tokens. And
then we have the second half of this
which is the execute tool. So here what
the execute tool is going to do, it's
going to look at the TypeScript schema
that's being passed down from the search
tool and it's going to generate this
code that it's then going to have to
execute. So here we have list workers.
It's going to write code to list all of
our workers. can write code to deploy
your worker. Um add access on top of
your application. And here we can
quickly see this in action where this
code was um was executed. And here is
the result that we got.
Um so okay, now let's put the two of
these together um and ask our MCP server
to create a hello world worker.
So the first thing that we're going to
do as we would with any other MCP server
is we're going to go through OOTH
and select the account that we need.
This is where I deploy all of my workers
to. We authorize it.
And now it's going to call our two
commands. Um, so first, as you can see,
we're running search in here. And it's,
as predicted, going to return all of the
different worker related APIs that are
available to it. And then we're going to
run execute, which is going to generate
this worker. And we are going to execute
it immediately. So now we have a hello
world worker that's been fully deployed.
So we talked about three different
models basically so far of doing the
exact same thing. One that's basically
vanilla MCP where you're directly doing
the tool calling that's going to be the
least efficient in terms of token usage.
We talked about client side code mode
which is efficient but not all client
support code mode yet. And another one
that's server side code mode where if
you think about you know we saw the
results where it went uh to you know 70%
token savings but if you go from 2
million to like 2,000 it's like 99.99
to% token savings which if anyone is
here is paying for tokens you know that
that's a lot of money that's being
saved.
But how how does all of this work? Like
really what we're doing is we're putting
a lot of trust in the LLM to write some
code that we've never looked at before
and allowing it to execute immediately.
And this can bring a lot of problems,
right? Um if you're running it in the
same sandbox as the rest of your
application in the same container, it
can do things like read the file system.
It can make rogue network requests with
the data that you just gave it. It can
do things like create an infinite loop
or eat up all of your memory. And there
are a couple of ideas of maybe things uh
other approaches that people have tried.
One is DSL. Um if you if you've written
DSL before, you probably never want to
do that ever again in your life. Um
another one is you could use VMs, but
VMs are very slow to start up. So we
would be waiting here for a very very
long time for all of these calls to
complete and it would get really really
expensive very quickly. uh we could get
humans to review the code but that's
even slower than VMs. So we need a
different approach.
This is where dynamic workers come in.
So dynamic workers are based on the same
technology as workers which we've been
running at Cloudflare for over nine
years now. But dynamic workers allow you
to create a worker on the flight and on
the fly and immediately execute it. So
you can see that here we're going to
pull in the generated code that the LLM
created. The rest of this looks just
like loading up a worker. You can set
the compatibility date and which modules
you want. And importantly, you can set
what outbound hosts you want to allow.
And so if you don't provide any, the
worker actually can't access the web at
all. So everything stands really really
sandboxed. Um, and we can actually test
this out in practice. So, um, if as far
as being worried about things like
configuration keys or other things being
leaked, it's only going to have access
to the things that you gave it explicit
access to. So, if I'm trying to access
secrets in process, well, guess what?
It's not going to show up in any of my
globals. These are just the functions
that Cloudflare provides by default. And
if I try to make an outbound fetch to
HTTP bin, well, the same thing is going
to happen. It's going to say this worker
is not permitted to access the internet.
We should really capitalize this. Um but
um yeah, you you can't just access all
of these arbitrary things. So it becomes
a really really powerful environment to
enable you to run code mode securely.
And so far we've just talked about this
in the context of MCP, but I think it's
pretty obvious where agents are going
next. and it's that all of us are going
to be running many of them at all times.
Right now I would imagine all of us are
using agents um like open code or claude
codecs primarily for coding use cases
right um and a lot of them run on our
laptops which is why people at this
conference are running around not
wanting to shut their laptops right um
because you want your agent to complete
your task and you could do this also in
a hosted container environment but if
you start to do the math of how it's
going to scale to the rest of the world
and we'll get to that in a
the math doesn't quite math and here's
what I mean by that. So recent uh
recently open claw has been taking off
right and it's kind of a similar thing
where it's early adopters and all of us
went and got Mac minis but again that's
not sustainable for every single person
being able to run multiple agents
and if you do the quick math just on the
US alone and just for the workforce
right um there are about 100 million
people in the US uh workforce if we set
50% concurrency so this is actually
being very conservative. I actually
imagine that we'll be running a lot more
agents than this at all times because
guess what? Agents don't even sleep. Um,
and by the way, I have an agents never
sleep hat that if you're the first to
the Cloudflare booth, you can claim. Um,
but we're going to be running many, many
of them. And for that, we need a lot of
CPUs to power that. Everyone is talking
about the need for GPUs, but no one is
talking about this part of having to
power enough global agents.
So let's let's take this a step further.
Okay, there are eight billion people in
this world. If each of them had a
personal agent, again at like 50%
concurrency, we're not super coordinated
in how we're using them. We need like 80
to 160 million CPUs. Um server CPU
production today is in the tens of
millions per year. So we're already an
order of magnitude off. If you start
imagining that everyone is running
several agents, three agents, 10 agents,
we are many, many, many orders of
magnitude off from being able to power
the agentic future that I think everyone
in this room is really excited about.
So, how does this problem? How do we
solve this problem? Um, believe it or
not, yet again, um, dynamic workers.
Um, so the the thing about dynamic
workers is they run on isolates and
isolates are a lot more efficient than
VMs or containers because they're able
to share so much more of the underlying
context. So in a VM for every single new
application that you spin up, you share
the hardware, but you have to create
spin up a new operating system every
single time. For container, you're able
to take that one notch further where the
operating system is shared, but every
single time you spin up a container, you
need to bring in the entire language
runtime and the full application with
it. With isolates, we're able to import
just that generated code, whether it's
the application or the agent generated
code, and executed on the spot, which
means we can utilize the same exact
hardware, but 100 times more
efficiently. that basically makes up the
difference that we need in order for
every single person in the world to be
able to run their own Kodi agent.
So this is why I'm so excited about
isolates and what's really cool is we've
been working on this for a long time. We
bet on this technology uh nine years ago
and we didn't think that you know it
would become relevant in this particular
way after all this time. And it's
interesting to see more and more
companies, you know, Cloudflare is not
the only implementation of isolates. And
I think the more people use it, the more
people adopt it, the more validation it
gives what Cloudflare is doing. And
we're still going to need containers for
some agents because you need git and
bash and file system and all of that.
But for especially consumer use cases,
isolates are increasingly going to
matter more and more.
So that was a lot of me talking. If you
want to learn more about this, um,
there's a whole bunch of blog posts that
we put out, especially last week, that I
recommend you go check out. Dynamic
workers are an open beta, so you can go
and play around with them literally
today. Um, I will also give a couple
other shoutouts um to experiment with
everything that we talked about today,
including code mode. You can go install
Cloudflare's agents SDK. We just made
Kimmy 2.6 available on Workers AI.
It's a brand new model. It's super fast.
Go play around with it. And last but not
least, we have a lot of really hard
problems to solve and we need help
solving them. So, we're hiring. And if
you're looking for an a gig, come find
us. All right. Thank you all so much.
Woo. Okay, how's everybody doing? We're
almost there. Last talk of the day. And
speaking of SDKs, the next presenter
believes you're using the wrong AI SDK.
And he's going to talk about the
evolution of SDKs for AI and uh where we
should expect it to head in the near
future. He is an educator, a full stack
educator from developers to all the way
up to tech CEOs and CTOs. He has a
podcast and he also manages The YouTube
channel uh for techn technical
development. Please welcome Ben Davis.
Perfect.
All right. So the title of this talk is
you are using the wrong AI SDK. And
before we even get into that, I want to
kind of talk about how these things have
changed over the last few years. Because
when I was going through and prepping
this talk, the initial concept for it
was all right, I want to take the Open
Code SDK, the PI SDK, the Verscell AI
SDK, and then the BAML SDK. Kind of
compare those four and try kind of
explain when you would use each one. But
as I was going through, I kind of
realized that there's a pretty strong
throughine here. I've kind of, at least
in my head, I like to think of these in
generations where the first generation
we had the API wrapper. This would just
be the normal OpenAI SDK. You can see
the code snippet for it. Let me zoom in.
There we go. It would look something
like this. You're just directly hitting
the OpenAI API to generate some text,
maybe do a tool call or something like
that, but there's nothing else built
into it. I assume if you're here, you
probably know what an agent loop is. But
in case you're not familiar, generally
speaking, the way these things work is
if you want it to do some more
complicated action than just generating
text, like reading a specific file or
doing a web search, the model can't do
that on its own, it has to ask you to do
that for it. So what it'll do is when
you send a request up to OpenAI, it'll
send back a response that instead of
being a text response will be a tool
call response that has the input the
tool that it wants you to call with some
arguments and then you go ahead and call
that tool and send it back. You can do
that full tool calling loop within the
normal OpenAI SDK, but it requires you
to manually have a while loop and add a
bunch of other stuff in here to actually
make that work. It's not the most
ergonomic thing in the world, which is
why the Gen 2 that came along was the
Verscell AI SDK. This was I think this
is one of the coolest things Verscell
has made in the last couple years. Like
this is truly a this is truly an
incredible open- source project. It
seems very simple like it is just
wrapping a bunch of different LLM API
providers. But the actual layer for
having the actual code that goes into
making a centralized interface that goes
around that can both anthropic models,
openAI models, Gemini models all in one
place is pretty insane. And the actual
code for this is a lot more abstracted
now where you can define tools with sod
schemas. You can execute them. It has
this stop when which means that now that
agent loop can happen within the actual
SDK. This generate text call will hit
the will hit the OpenAI API multiple
times here. So if I go in and I uh zoom
into this, if I do bun 2 for the second
generation as this is actually running,
it did multiple requests up to OpenAI to
do that weather tool call. And then when
it did that tool call, it went back
again, passed that result back in,
generated the final text, and then
that's what you saw pop out on the
screen. So this can actually so this was
enough for people to start actually
building agents with. This was the sort
of first generation where we were really
able to push these things into making
more complicated and useful products.
But there were a lot of decisions and
beliefs that were made at this time that
I don't think have quite held up. And
these are a lot of things that if you
had asked me four months ago, I would
have personally believed where like if
you look at the code for the AI SDK, one
of the things you'll notice is that
there is full type safety on the tool
calls. Like this execute takes in a
city. You can see that it's a string. It
has this input schema which is a zod
validator so that it makes sure it's
always going in in the right shape. It
is very much built to
allow you to make these very
well-defined agents for your products.
But that is not where we have kind of
ended up because the thing that happened
sometime last year was cloud code got
released. And when cloud code got
released, we got the first full coding
agent which was effectively something
like the AI SDK wrapped up with a really
nice TUI that can now suddenly take
actions on your computer with an exec
tool called. It can run bash commands.
It can run scripts. It can write code.
It can do whatever the hell you want it
to. And over the last year, we've had
more and more of these pop up. and the
two that I wanted to talk about
specifically in this presentation
because I think they're just like
they're the most interesting ones to
talk about because I think they're the
best ones to use. I'm not personally a
huge fan of the Claude agents SDK for a
variety of reasons and the codeex SDK is
limited to codeex, but these two are
both open source. They're incredibly
powerful and they are the things that
power the actual coding agents. And when
you're working with a coding agent SDK,
you are able to do so much more than you
can do with these things because the
mental model has changed a lot. I'll
start with the PI example because it is
the more minimal version of these two.
If you look in here, the way this
actually is defined is kind of similar
to the actual AISDK thing where we are
defining a weather tool here with some
basic stuff we have as you would expect.
Then we are creating an agent session
with the model, the O storage, the model
registry, custom tools. These are all
implementation details. If you want to
look into how to actually use these
things, the best thing you can do is
just like go to the GitHub repo, copy
paste the link into whatever coding
agent you prefer, tell it to make a temp
directory, clone the repo into that, and
then ask it questions. That is the
easiest way to figure out how to
actually use these things. But the real
point that I'm trying to make here is
when we create this agent session even
with the TypeScript SDK it is booting up
the full agent harness because if I do
pi this is now a full coding agent that
is running on my machine. You can ask it
it works the way you'd expect. Same
thing is happening when I am doing this
um pi example here. So if I go into the
agents.mmd
um always answer in French. So, if I go
in here and I run this example again,
uh, bun 3 pi, I think I called it. Um, I
think this is the version that should be
loading the agents MD. Yep, it's doing
its tool calls, getting all that,
sending it back, and there you go. So,
you can see even though I didn't have
any code in here that implicitly loaded
the agents MD file, it still did because
it is operating as a normal coding agent
SDK on my machine. Does its thing, gives
me the result. Very, very useful. Open
code is very similar except I would say
it is it does the same thing but is more
more batteries included. I I love both
of these projects. I think they're
really cool. I mean this is a very high
complent to both of them. But the way I
kind of think of them in my head is that
open code is kind of like the VS Code of
coding agents. It is open source. It has
really good defaults. It just kind of
works out of the box, but you can still
change some things about it. Add in
different themes, extend it, do other
stuff there. versus PI is kind of like
Neoim where right out of the box it does
basically nothing but the absolute
border like the true essential sort of a
coding agent but you can extend the ever
living hell out of it and it's really
cool. I like both of these a lot but you
can see this reflected a lot within the
SDKs too where the open code SDK works
slightly differently where you can uh
where the way it actually works is you
it is a client server model where
whenever you spin up open code it's
spinning up a server. So we have to
create our open code instance here which
has the server. Then we can do a bunch
of stuff in here to create a client
session subscribe to the events
console.log them. Nothing too
interesting in there. The only really
interesting piece here is that there is
this open code directory with a bunch of
custom tools in it. This is how you
define custom tools. Again don't pay too
much attention to the syntax here. That
is not the important piece. I'm sure
that this will be changed and improved
over time. It's solid right now. It's
not the important part. The important
part is the way we can actually build
with these things. now because like I
said with generation 2 we were entirely
focused on these very well curated
agents that was the type safety on the
SDK really trying to be like okay let's
give it a dedicated read read tool a
write tool maybe if we're doing
something say for my own personal use
cases it needs to hit the YouTube API to
do some stuff there we would give it a
read YouTube API tool give it a um sorry
we would give it a read YouTube channel
tool a read video tool a read comments
tool, whatever you want to do. All of
those are well put together. Then you
give the agent, you let it execute, you
do that whole thing. It's fine. But
do things kind of differently. And I um
Okay, I was going back and forth on
this, but I think we're going to do it.
I'm going to pull up uh Oh my god, I
cannot see.
Can we see that? Yeah. So, just hear me
out. Hear me out. Okay. Again, hear me
out. So, there's this project called
GStack, and you've probably seen it on
Twitter because it is it's been meme'med
very heavily because Gary has been going
very hard with these LLMs, and there
have been a lot of memes that have come
out of this. The 40k lines of code in a
day thing, the crazy Gary's List site,
which is hundreds of thousands of lines
of vibecoded Ruby for a static blog
site. There are some silly things in
here. I'm not denying that. But there's
also some very, very good stuff in here.
Because when I looked at this about a
week and a half ago, I went into it
fully expecting to just kind of see
something funny, maybe dunk on it, do
whatever, because it's what everyone
else was doing. I was like, "Yeah, sure.
Let's go take a look." I went into it. I
went into the skills directory, which I
think is right here. Let's use the um
that's a good example. Let's do the
office hours one. Sure. I went into
this. I looked at the skill and the
first thing I saw is this 30line bash
script that looks like a virus. This
does not look like something I want to
have in my skills. I was like, what the
actual hell is this? What are we doing
here? This is worse than I thought. I
was going through this and I kept
reading and I just kept going through
all this. But then as I slowly just kind
of started to think about it more, I
realized what I was actually looking at
here is a program. I know that sounds
very insane, but if you think about what
this actually is, it is including a
bunch of commands that the agent is
supposed to run. It has a bunch of
steps. It is defining a workflow and it
is creating a full usable application
entirely on top of a coding agent with
natural language. And the more I thought
about that, the more I realized, holy
[ __ ] I think Gary's on to something
here. And I've been testing it more.
I've been going deeper into it. And I've
realized that with these Gen 3 SDKs,
these full coding agents, we can do some
very weird stuff that we couldn't do
before. Before with Gen 2, we were still
having to manually write TypeScript
functions for every single piece of the
agent. But now in Gen 3, we are using
coding agents. And coding agents are
capable of writing code. They are
capable of executing bash scripts. So
the actual programs we're creating,
these agents do not have to follow the
patterns that we previously had. I was
sitting up in my hotel room earlier
today doing a little bit of
experimenting here and
one of the things I put together here is
the uh a better byt sync and effectively
what this does is this is a little
program I don't want to save that this
is a little program which allows me to
go to the YouTube API sync all of that
data down into whatever place and then
save it into a PGE database. This is a
very very useful thing for my job
day-to-day. I need to have this data. I
need a way to sync it. And the way I've
done this in the past is with is by
writing out a pretty big complicated
TypeScript project which will do the
manual syncing logic. Like obviously it
seems very simple like you're just going
from one API to fetch to DB but what
about retrying? What about the actual
orchestration of this? What about the
cron job? And then there's even more
things like okay so what if we want to
do some deeper parsing on the videos?
What if we wanted to parse the sponsor
of a video because that's very useful
information for us to have or parse the
sentiment on comments. Well, now we have
to bring in an LLM. And the way I was
bringing in an LLM is with one of the
things I was talking about earlier which
is BAML. I think BAML is very very cool.
Currently, it is entirely a gen one AI
SDK. I am sure that they are working on
something new to do the actual agent
thing. But for right now, effectively
what this is is it is a new programming
language designed for agents that works
really really well for taking some blob
of text or data or videos or whatever
you want to do giving it an output
shape. So I'm like okay this is the
actual output shape I want to get from
BAML. This is the function that we are
doing here which takes in a video. Video
is a data type which just take in a URL
so that it knows it's a YouTube video.
Video sponsor is a l is a list of these.
We pass in the prompt here. We have this
and you can see within their playground
what I did is I passed this in to the
Gemini API did the actual orchestration
of this. They have a lot of stuff under
the hood that ensures that the LM will
output in the correct shape and you get
the correct shape every single time
because obviously when you're working
with little non-determinism machines,
they will do non-deterministic things.
And even if it's a 99% success rate,
this does a great job of getting you
over that line to make sure that it will
always be in this correct shape that you
can then pass into your TypeScript code
and do something useful with. You can
see like for this video was sponsored by
G2I just like this conference. We love
G2I. And the thing is that's great. That
all works just fine.
But it gets kind of annoying. And what
if we did this differently? What if
instead of writing the program as a
bunch of TypeScript files, what if you
just wrote it as a bunch of markdown
files? And that's trying out here. And
you can see I have a couple different
skills in here. There's nothing too
crazy about these. The most interesting
one here, I guess probably, well, I
guess we can go through all three
briefly. So, the top level one is this
YouTube video sync uh skill. And this
basically just tells the agent the steps
it needs to go through to sync sync the
YouTube channel. It has a bunch of like
helper functions in here so that like
instead of manually writing some Python
code to fetch the API and then put it in
the DB, it has these good like wrappers.
So like fetch channel, it does all this
correctly. But you don't even need this.
In the first version of this, literally
all I had was just a markdown file with
some natural language steps and I'm
like, okay, go sync some data from the
YouTube API. Here's an API key. Put that
in a Postgress DB. Good luck. And it did
it. It did it very very well. This is
the reason why something like OpenC
works. so well is because these coding
agents can just kind of make these
things happen. The only reason I have
these extra extra functions in here is
because I intend to deploy this and use
this elsewhere and I want it to be a
little more robust. But you can see all
we're really doing here is defining the
steps which is what you would do in a
normal program and then let it run and
it will do the thing. And it has a
remarkably high success rate. you end up
kind of getting this nice property on
these things where it's kind of almost
self-healing where I noticed a couple
times there would be like a weird uh
rate limit error or something like that.
Instead of having to do exponential
retries yourself, the agent will just
kind of naturally do it for you. And now
the way all of this kind of comes
together is if you are building
something like this and you are building
it with skills and you have all these
extra things in here. If you want to run
this with an SDK because one way you can
do it is I could just open up pi here
and I could just do slash sync. I hit
enter on this. It's just going to start
doing the syncing. So it's checking the
project directory cding around doing
whatever it needs to. But if we don't
want to do that, say we want to run this
on a sandbox in the cloud on a cron job,
you can use the open code SDK, which is
what I'm doing here to create an open
code instance, create a new session,
then pass in a prompt that tells it to
use the skill to actually do the thing,
log some useful information up here, and
then it works the because like I said
earlier, since this is a full coding
agent SDK, it is reading all of the
skills out of that directory. It is
reading all the agents. even reading the
O stuff. So, because my open code is
authenticated with GBT 5.4 mini, if I
ran the sync command with this index.ts,
it would do that sync with 5.4 mini. It
all just kind of comes together and
allows you to build these very weird new
shapes of programs that I honestly
didn't think were going to be a thing,
but clearly are. And really, that's what
I wanted to get across with this talk is
not any specific implementation details.
just go experiment with the experiment
with those on your own. The thing that I
wanted to get across here and it's
really just a message for myself because
I keep doing this is that the shape and
direction of where these things are
going is very strange and is changing
all the time. If you had told me 3
months ago that I would be giving a talk
where I defended GStack and markdown
files as a new way to do programming, I
would have laughed at you. But here we
are. Because the thing about this weird
AI revolution right now is every single
time I learn something new about it. I
find something test a new model, try a
new theory or whatever, I make a new
line in my head of like, okay, this is
what it is capable of. This is what we
can do. This is how this works. And then
that is like the box I live in. And I
don't really go past that. But that
doesn't work anymore. These things
change so much so fast. And you just
need to try weird random ideas. every
time you have some very strange idea and
you're like, "Oh yeah, that probably
won't work." Still just give it a shot
because it might on paper this doesn't
seem work and yet it kind of does and it
works really really well. And you can
even take like what's here and what is
the next logical step here? Do I need to
have all of this code written here? Can
we go further with this? I don't know.
All I know is that as time has gone on,
we have started giving agents more
freedom and time will tell what that
freedom will bring. So that's all I got.
Thanks for listening.
>> All right, that was a great talk by Ben.
>> All right, awesome. Uh very very
um big day with a lot of great talks and
uh we covered a lot of ground. I think
>> what we covered today.
>> Yeah, we went from context engineering,
some philosophical discussions, a lot of
practical discussions, SDKs,
uh MCP servers, and uh yeah, here we
are. I think I need a drink.
>> Oh, okay. Iman. So, are you more of a
mojito person or a no person?
>> We'll see. The night is young.
>> Okay. Okay. And let's see from our
audience who is for mojitos.
>> Nice. We see some hands. What about
Nojitos?
>> Nder.
>> Okay, maybe you want Miami Vice. Uh, no
judgments here. But what is happening
here is that we're going to close out
the day. I think you all deserve some
good dinners, some good drinks, and we
have a whole day tomorrow from 9 to 5:30
again. So, we're going to hear from uh
the organizer G2I all the way to uh some
software engineer from Cursor. So, gear
up for another day of fully packed
schedule, great connections and great
knowledge sharing. So, that is it for
today. Thank you so much for being here
and joining us and we'll see you
tomorrow. Take care.