AI Engineer World’s Fair 2025 - Day 2 Keynotes & SWE Agents track

Channel: aiDotEngineer

Published at: 2025-06-06

YouTube video id: U-fMsbY-kHY

Source: https://www.youtube.com/watch?v=U-fMsbY-kHY

[Music]
[Music]
[Music]
Hey. Hey.
Hey. Heat. Heat.
[Music]
[Music]
Hey,
hey, hey.
[Music]
[Music]
[Music]
[Music]
Hey, hey, hey.
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
I don't
[Music]
The wheel keeps turning, grinding it
thread. path unchosen where dreams are
shed. Don't waste your time in endless
debate. Pick up your tools and create
your
fate. March forward, let the road
unwind. What's left behind is not yours
to buy. Rise with the sun. Let the sky
ignite. You build the future with your
will tonight.
[Music]
The clock won't wait. It takes
relentless. Echo of time are loud and
defenseless. Step off the ledge. Don't
fear to
fly. There's no gain if you never try.
[Music]
Steel fires burn where progress starts.
Flames that forge courageous hearts. You
can't stop what was meant to bloom. So
grab the light when it breaks the glow.
[Music]
March forward. Let the road
unwind. What's left behind is not yours
to buy. Rise with the sun. Let the sky
ignite. You build the future with your
will tonight.
[Music]
No backward glance, no sorrow ache. All
in motion and truth at stake. Humanity
story a river untamed. Fight your
[Music]
champ. Heat.
[Music]
Heat. Heat. Heat.
[Music]
[Music]
[Laughter]
[Music]
[Music]
[Laughter]
No backward plans, no
[Music]
[Laughter]
[Music]
sorrow. Heat. Heat.
[Music]
[Music]
Heat.
[Music]
Heat. Heat.
[Music]
Heat.
[Music]
Heat. Heat.
[Applause]
[Music]
[Music]
[Music]
Heat.
Heat. Heat. Heat.
[Music]
[Music]
Heat.
Heat. Heat. Heat.
[Music]
[Music]
[Music]
[Applause]
[Music]
Gentlemen,
Please welcome to the stage the VP of
developer relations at Llama Index,
Lorie
[Music]
Voss. Hello again everyone. It is great
to see your friendly
faces.
Uh sorry, can we go back one slide? I
accidentally hit my forward. Uh
uh it is great to see you all. Welcome
back to day two or day three depending
on when you actually started. Uh who had
a good time yesterday? Let's hear it
from you.
One thing I couldn't fit into my intro
yesterday that I really wanted to get in
is that it is June in San Francisco. It
is Pride Month. So from myself and my
fellow LGBTQ uh members of the
community, I would like to wish you all
a happy
pride. I also want to hear from my jet
lag crew. Who, show of hands, woke up at
4:00 a.m. this
morning? There were a lot of you. 5:00
a.m. Who's still not awake now?
Uh we've got another great batch of
keynotes for you, including progress
towards deep thinking with Gemini.
You'll be hearing from Logan Kilpatrick
of Google. Uh fun fact about Logan is
that uh Gemini's ability to make jokes
is trained entirely on his tweets, which
is why none of them are
funny. Uh you'll also be hearing how to
make your agents more reliable with the
founder of Docker. So you won't you
won't want to miss that. Uh but first
we're going to hear from an amazing
organizer and just a wonderful person uh
who has a special announcement,
co-founder of the of this very AI
engineer world's fair, Benjamin
[Applause]
[Music]
Duny. Co-founding this conference with
Swix has been one of the most rewarding
experiences of my career. To see you all
here today makes me so excited for what
we've built and for what's to
come. Like many AI plebeians, my oh [ __ ]
moment was chat GPT. One of my first
prompts was to test its limits of
knowledge and reasoning. I prompted it
to break the known universe into the
fewest core principles from which it
could then recursively generate 12
subclassifications. I was blown away by
how fun this exercise was and how
interesting the responses were,
especially when it got to the lowest
levels of the universe. For example, it
labeled viruses a subcategory of quirks,
which I found both fascinating and just
wrong.
It was at this very moment, however,
that I immediately knew that it was over
for everything I'd done in the past.
This was the most fascinating piece of
technology I'd ever used and recognized
it potential immediately. I recall
texting my brother-in-law the URL
saying, "AGI has been
achieved." But it was only a few months
later that something even more
incredible happened. My son was
born. Being a father has been one of the
most miraculous and incredible
experiences of my life. While yes, it's
rather astounding to be able to speak
with computers where my mind feels
expanded every time I do, when I talk to
my son, it's my heart that
expands. So, how do these two things
relate?
I am old enough to remember a time when
computers were large cold machines only
used in corporate offices to get work
done. But as their power has grown, the
current model UX has tethered us to
these machines all
day. There is a parallel from this to
how our future generations will be
educated. While there will likely always
be a place for both screen and
keyboard-based HCI as well as classroom
and lecture style learning and
discovery, the potential of these new
technologies in emerging US UX can free
us from those constraints where even the
most mediocre of teachers could become
worldclass
instructors. So that's why I'm
tremendously excited to announce a new
chapter for us, the AI education summit.
[Applause]
There's a significant gap between the
rapid advancement of AI and the
preparedness of our children, parents,
and educators to navigate this new
reality effectively and ethically. But
we can overcome this together by
fostering a global community dedicated
to AI
education, empowering children, parents,
and educators with the knowledge,
skills, and ethical framework to thrive
in an AIdriven
world. For this event, we'll be
partnering with a pioneer in the space
of AI education, Stefania Duga. She's a
former researcher research scientist at
Google and as of today a three-time AIE
speaker. It was her talk from last year
that sparked my imagination on this
exciting new
direction. When she demoed this student
learning to code by programming the very
thing that is teaching them to code, I
was just blown away. So whether you're
interested in education for the next
generation like I am or just the
evolution of HCI for learning in the age
of AI for people of all ages, I
encourage you to pre-register
today. This first event is going to be a
free online event to explore the
landscape filled with practical
knowledge for the exciting future of AI
education. So that's it for me and I'd
love to bring up our first speaker. He
is the group product manager at Google
DeepMind and he's here to talk about
Gemini. Please join me in welcoming to
the stage Logan
[Music]
Kilpatrick. Awesome. Thank you, Ben.
Excited for the AI education summit.
Should be fun. Um, my name is Logan. I
do developer stuff at DeepMind and I'm
excited to talk about Gemini stuff. Um,
yeah, hopefully folks know what Gemini
is. So, no introduction needed. Um, I'll
talk about three things really quickly.
We'll do some fun announcement stuff.
Um, we'll talk about sort of recapping a
year of progress in Gemini. And then
we'll talk about what's coming next
across the model side, across the Gemini
app side, and also across uh, of course
the developer
platform. So, the fun stuff which is we
announced a new Gemini model today. Um,
so we haven't officially announced it,
but we'll post live on the
tweet. New Gemini model. Uh, this is
hopefully the final update to 2.5 Pro. I
think folks have given us tons of
feedback um about the changes and I
think my slide has an animation which is
hiding all the stuff. But Gemini 2.5 Pro
is awesome. Um, it's it's super
powerful. Uh, bunch of increases across,
you know, benchmarks people care about.
It's soda on ADER and um it's soda on
HLE and some other benchmarks. Um I
think it closes the gap on a bunch of
the stuff that folks gave us feedback on
from the previous versions of the model.
Um so hopefully it has great performance
across the board. It also um I I think
is like sort of setting the stage for
the future of Gemini. I think 2.5 Pro
for us internally and I think in the
perception from the developer ecosystem
was the turning point which was super
exciting. Um, so it's awesome to see the
momentum. We've got a bunch of other
great models coming as well. Um, so 2.5
Pro, hopefully the final version. Send
us feedback if things don't work. Uh,
and we'll we'll continue to push the
rock up the hill. Um, you can go to
ai.dev if you want to try it out. It's
also available in the Gemini app and all
that other stuff. Um, and if you need
anything, email us and we'll make it
happen. All right, new model launched.
Let's talk about a year of Gemini
progress. I think this has been the
craziest thing. So, I don't know if
folks tuned in to to Google IO, but um
Sundar showed this slide on stage, which
I think was a uh was a great reminder
for me of just how much like it feels
like 10 years of of Gemini stuff packed
into the last uh 12 months, which has
been awesome. Um and it's it's actually
interesting to see as well, just to sort
of opine on one of the points, like all
of these different research bets across
Deep Mind coming together to like build
this incredible mainline Gemini model.
And I think this is actually like I have
a conversation with people all the time
about like what is what's the deep mind
strategy? What's the advantage for us
building models? All that good stuff.
And I think the interesting thing to me
is just this breath of research
happening across like science and Gemini
and all these other areas like robotics
and things like that. Um and all that
actually ends up upstreaming into the
mainline models which is super exciting.
Um so you see like the you know alpha
proof and alpha geometry and a bunch of
stuff that we did on uh in with custom
models in those areas actually improving
the performance of our models uh for
those domains and uh Jack will talk
about that in a little bit which I'm
super excited about. Um the other thing
is just like not just the pace of
innovation but the pace of adoption. Um
so I think uh Sundar also showed this
slide which was a 50x increase in the
amount of AI inference that's being
processed uh through Google servers from
one year ago to um last month and I
think that is it is just remarkable to
see the amount of increase in demand for
um for Gemini models and for also from
the external developer ecosystem. So
it's been it's been wonderful to see
that
happen. I think the other question and I
think this is like talked about a little
bit which is uh sort of what what got us
to this point. I think one of the
critical pieces and like it's you know
not super fun uh but is worth thinking
about uh for folks who are building
companies here uh is like an
organizational thing truthfully like I
think you bring together Google
historically had lots of different teams
doing lots of different AI research um
and in late 2023 early 2023 uh Google
brought a bunch of those teams together
um and sort of charted this new
direction for the DeepMind team to not
only just like do theoretical
foundational research but also to like
build models and deliver them to the
rest of Google and also the external
world. Um and then we took the second
step of that journey later uh earlier
this year um which was actually bringing
the product teams into DeepMind. So now
DeepMind creates the models, does the
research um but then also builds
products and delivers delivers those to
the world. And we have the Gemini app
which is our consumer product and then
we have the developer side of that with
the Gemini API. Um and this has been
like personally for me super fun to get
to collaborate with our research team
and like help actually be on the
frontier with them um and bring new
models and capabilities to the I think
this is like the collaboration that
works uh works incredibly
well. Yeah. And we ship lots of stuff. I
think this is the this is the most fun
part um is there's so much stuff so much
innovation happening inside of Google.
It's it's incredible to get to bring
that to the world and bring that to
developers and I think we're actually
very early in that journey and as we'll
we'll see in a couple of minutes. Um
so in summary, the formula is simple.
bring the best people together, find
infra advantages, and ship.
I don't know if folks have played around
with VO or not, but it's also been just
incredible to see the reception to VO.
It's, uh, burning all the TPUs down, uh,
which has been incredible to see. Lots
of demand, uh, lots of interest on the
VO front. Um, so hopefully folks get a
chance to play around. It's available in
the Gemini app right now. Um, all right.
So, let's talk about what next. This is
the fun stuff.
So, I think the the sort of Gemini app
piece is interesting just because people
talk about it a lot and it's um it's a
fun product and it's cool to think
about. Um and also sort of I think for
folks building stuff, it's interesting
to hear like what our strategy is from
the app perspective. Um but the Gemini
app is trying to be this universal
assistant. I think what that means in
practice is if you um I'm sure people
don't think about this all the time, but
I think a lot about like what Google's
products do and and sort of how we show
up in the world. And one of the
interesting observations I had was that
if you think about what was the thing
that like brought people individuals
through all of Google's products
historically like the thing that comes
to mind is like like your Google account
I guess which like did wasn't like super
stateful. You would sort of sign into
lots of different Google products with
your Google account but that didn't
really do anything um other than just
like get you signed into that individual
product. I think now we're seeing with
Gemini that it's actually this thread
that unifies all of Google. And I think
the future for Google is going to look a
lot like Gemini is this sort of, you
know, thread that brings all of our
stuff together. Um, which is really
interesting. And then hitting on all the
trends which I'm sure folks are also
excited about building. I think the one
that I'm most excited about is
proactivity. I think most AI products
today are still very like you have to go
and do all the work as the user. And I
think this proactive uh next step of um
AI systems and models coming into play
is going to be is going to be awesome to
see.
Yeah, and the team is moving super fast.
If you have complaints, please do not
tag me on Twitter. Please tag Josh. Um,
he will make it happen. Josh is
incredible. The Gemini app team is
amazing. Um, he's he's pushing the team
uh super hard. So, it's incredible to
see all the progress. Uh, but he is the
person who can make stuff happen on the
Gemini app, not me. So, please check
him. Um, from a model perspective, like
again, there's there's so much. Uh when
Gemini was originally created, it was
built to be a a single multimodal uh
model to do audio, image, video, etc.
We've made a lot of progress on that. At
IO this year, we announced um native
audio capabilities in Gemini. There's
TTS. There's audio uh you can talk to
the model. It sounds it sounds super
natural, which is awesome. It's powering
the Astro experience. It's powering
Gemini Live. Um so I think we're going
to get towards that omnimodal model,
which is awesome. We have VO, which is
soda across a bunch of stuff. So
hopefully we'll get video into the
mainline Gemini model. Um, if folks saw
some of our early experiments with
diffusion, which means you can get like
crazy levels of tokens per second. Um,
really interesting. That's like
definitely a research exploration area
and it's not uh it's not mainline yet.
Um, so it'll be it'll be cool to see
that come. The agentic by default thread
I think is something that I've been
thinking a lot about recently which is
like
historically for me as a developer I've
thought about models just as this thing
that gives me tokens in and out and then
there was lots of scaffolding in the
ecosystem to allow me to build those
models. I think this it's it's becoming
very clear to me that the models are
becoming more systematic themselves,
like they're doing more and more. And I
think the reasoning step is this like
really interesting place in which a lot
of that's going to happen. And Jack's
going to talk about the scaling up of
reasoning. Um, but I do think it'll be
interesting to see like how much of the
scaffolding work that's happened in the
past ends up just like being a part of
that reasoning step and like what that
means for people who are building
products and stuff like that. So, um,
it'll be interesting to see. We'll also
have more small models soon, which I'm
excited about, and big models. People
want large models, which I know. Um, so
I'm excited about that. And then the
last one is continuing to push the
frontier on infinite context. I think
the current model paradigm doesn't work
for infinite context. I think it's just
like impossible to scale up. Attention
doesn't work that way. Um, so I think
there'll be some new innovations to
hopefully help let people continue to
scale up the amount of context that
they're bringing in.
Um, and Tulsi is the person who drives
all of our model stuff. So, if you if
you have stuff you want to talk about
Gemini models, uh, you have ideas for
things that don't work well, uh, she is
the person running the show on the
Gemini model product
side and then developer stuff. Um, so we
have lots of things coming which I'm
excited about. Um, I think I'll
highlight maybe three that I think
people are super excited about.
Embeddings I think we have which is you
know feels like early AI stuff but I
think is still super important.
Embeddings power most people's um
applications using rag. Uh we have a
Gemini embedded model which is
state-of-the-art. So excited to be
rolling that out to developers more
broadly in the next couple of weeks. Um
the deep research API I'm super
interested in. There's so many
interesting products that are built
around um this sort of research task and
people love the consumer product. So,
we're finding ways to bring a bunch of
that together um into a like bespoke
deep research API uh which will be
awesome. And then V3 and Imagine 4 in
the API as well. So, hopefully we'll see
that uh very very very soon. Um and as
we work to scale and and make that
possible from a from a developer
platform side, I'll make one other quick
comment which is the um AI studio
product positioning which I also think
is interesting like AI studio just to to
be very clear is being built as a
developer platform. Um so we'll sort of
move away from this like kind of
consumerry feel and move much more
towards being a developer platform which
I'm personally very excited about
because I think that's what developers
want from us. Um, so it'll be awesome to
see that actually come to life with like
many new iterations of our of our
developer experience with agents built
in and hopefully things like jewels and
some of our developer coding agents um
natively in that experience which will
be which will be awesome to see. Um,
yeah, and that's that's what I have. I
appreciate all the people who send lots
of great feedback about Gemini stuff. So
we'll we'll keep pushing the rock up the
hill and um I'll be around. So if you
have more feedback, come find me and
we'll we'll keep making Gemini great for
everyone. So thanks and I appreciate
[Applause]
[Music]
it. Our next presenter is principal
research scientist at Google DeepMind.
Please join me in welcoming to the stage
Jack Ray.
[Music]
Hi everybody. Uh yeah, my name is Jack.
I'm a researcher at Google and I'm the
tech lead of thinking within Gemini and
I'm going to give a brief deep dive into
thinking from the research perspective
uh within Gemini. So
um it's it's thinking so much
I think this clicker might not work. So,
let's drive the next
slide. If you can drive SL, whoever the
slide driver is, please drive to the
next slide.
Gemini.
Um but yeah, what I'm whilst whilst we
maybe sort out the slide issue, um I'm
going to kind of give this talk in three
stages. One is to give a research
motivation of why we actually are
excited about thinking in terms of
unblocking bottlenecks towards
intelligence. And I'm going to give a
kind of uh give a few examples of how
often discovering the most precient
bottlenecks uh in kind of our current uh
models uh our most advanced systems how
often if you can just identify the
crucial kind of uh issues and
shortcomings you often will then find a
solution and there's a reason how that
is linked to thinking and then going to
talk um a little bit more um just
pragmatically about what is thinking in
Gemini why is it interesting to
developers
And I think your someone is okay. The
slides are still not
here. We did do a rehearsal this morning
where the slides are there. But yeah,
keynote speaker SL. Yeah, someone's I
can see someone. Yeah, keynote speaker
folder. Jack
Ray. I think it's under keynote speaker.
that one. Um anyway, um it's going to
come up soon. You are close um
person. Um yeah, but um and then I'm
also going to talk a little bit about
what's
next. Ah, I'm just sorry, I'm just
watching you. There you go. Nice
one. Yeah, that's great. Okay, the
slides will appear. Thank you, whoever
is coordinator. Apologies. I don't know
what happened. Um, and then I'm just
going to talk a bit about what's next.
So, Logan did a great job of kind of
giving an incredible overview of Gemini
as a whole ecosystem, everything that's
going on. Uh, I'm going to really be
focusing on on kind of what we're
excited about in the reasoning space.
So, with intelligence bottlenecks, uh,
we're kind of the message of this
section is really about uh, progress.
So, progress has really been marked by
identifying key bottlenecks towards
intelligence and then solving them. And
uh I'm going to kind of give some
examples throughout history. I'm going
to actually rewind the clock to 1948.
Claude Shannon, he invents the language
model, mathematical theory of
communication. He builds a language
model, a two gram, using a a textbook of
word statistics that was handculated and
he samples from it and he kind of
marvels at the samples. He feels like
these are these are getting pretty good.
They're a lot better than unogram
character, this twogram word model. But
uh kind of he remarks like I think this
would be better if we could really like
make a better language model and scale
up this current method. So he really
wanted to just scale up the engram that
was the bottleneck like small amount of
data very you know elementary statistics
and and unfortunately for C Shannon kind
of the solution was pretty hard he
needed the digitalization of human
knowledge and he needed modern computing
to be able to aggregate these statistics
at scale. So, you know, that wasn't so
easy for him to solve. He had it a bit
more tricky. But fast forward a few
decades at Google, uh, in in the 2000s,
uh, my colleagues such as, uh, Jeff Dean
are training engram language models over
trillions of tokens. These are powering
at the time the most sophisticated
speech recognition and translation
systems uh, and and a lot of progress
has been made. But their bottleneck was
actually uh, with these systems was that
these engram language models were very
restricted to short context. and they
were because um there's an exponential
storage cost with uh context length and
there wasn't really a way around that
with with just sticking with engrams.
The solution was the early kind of uh
introduction of deep learning in 2010 uh
with uh the introduction of recurrent uh
neural uh language models. So recurrent
neural networks applied to modeling text
where the recurrent neural networks
could avoid this problem by uh storing
compressed representation of the pass
into the state of a neural network and
they could now start to model beyond a
five gram sentences or even paragraphs
and this was a massive kind of uh step
change in improvement. However, a couple
of years later people would notice even
there there was a bottleneck. So uh the
recurrent neural network's
representation of the past is in a fixed
size state and this fixedsized state uh
uh there's only so much information you
could put into it and so as a result
there's often observed to be kind of
lossy a lossy kind of representation of
its context. The solution that was
derived I think once once people kind of
really encountered this this um
information bottleneck in the past was
actually just keep everything around in
terms of your past uh neural uh
embeddings and use an attention operator
to aggregate things on the fly. So this
was the birth of attention and then
shortly after transformers. So um
transformers then kind of led to the
modern deep learning revolution as we
know it and uh many other progress was
made. If we skip forward 10 years, we
then are in 2024. We have uh large
language models. They're increasingly
powerful general conversational agents.
We have uh models such as Gemini chat
GBT. People are using them for all sorts
of use cases. And there that's where we
kind of come to the bottleneck that's
relevant to this talk, which is that
although these models are very very
powerful, they are still trained to
respond immediately to requests. So in
other words, in terms of a compute
bottleneck, there is a constant amount
of compute that they apply at test time
to transition from your request or your
question to the response or your
answer. So the bottleneck of test time
compute, this is relevant to thinking.
Uh so we can unpack this a little bit
more. So when we talk about a fixed
amount of test time compute, the test
time compute is interesting to you
because that's the compute that the
model is spending on your particular
problem, your particular question. And
it and and and the way it actually kind
of mechanically works is you have some
text in your request. It gets translated
to tokens and then it's going to go
through a language model. And at the
transition from the request to its
response, it's going to pass some
computation up through a large language
model which will have some parallel
computation for every layer and it'll
have some iterative computation across
layers. So that computation is really
where the model can apply its
intelligence to your particular problem
and it's of fixed size. One solution if
you wanted a smarter model and more
computation is just to make the model
larger and then you can have more
compute and you can get a smarter
response. However, it's still not really
enough. Users might want to be able to
think a thousand or a million times and
have a very large dynamic range and a
lot of compute for very hard or
challenging or valuable tasks. And also
users might want to have a very dynamic
application of test time compute. So
less compute for simpler requests, more
compute for harder requests and have
this process be very dynamic and and and
instigated by the model. And that is
what motivates
thinking. So thinking in
Gemini mechanically, I'm sure almost
everyone in this room is familiar with
this general process where we will now
have a model and we insert a thinking
stage uh that the model can emit some
additional text before it decides to
emit a final answer.
So going back to this notion of test
time compute now we've added an
additional kind of loop uh of
computation where the model can kind of
iteratively uh loop and and perform
additional test time compute uh during
this thinking stage and this loop can be
potentially thousands or tens of
thousands of iterations which gives you
tens of thousands more uh compute before
it decides to commit to what its
response will be. And also because it's
a loop, it's dynamic. So the model can
learn how many iterations of this loop
to apply before it decides to actually
commit to its
answer. We train this model um to think
to use this kind of thinking stage via
reinforcement learning. So when we
pre-train Gemini, uh we then have after
a reinforcement learning stage where we
train it to do many different tasks and
we give it positive and negative rewards
depending on whether or not it solves
the uh solves the task correctly or not.
And this is essentially a very general
uh training recipe really. And it's kind
of remarkable it works that the model is
able to just get a very vague signal of
what is correct, what is not correct and
to back propagate this through this loop
of thinking stage such that it can try
and shape how it uses its thinking
computation and thinking tokens in order
to be more
useful. In fact, we weren't really sure
this would work. um it wasn't clear how
much structure we should put into
something like a reasoning stage and um
although I think probably many people
here have now seen reasoning traces and
played with these models I'll just show
you a historical artifact um from one of
the times we were trying to use
reinforcement learning we started to see
cool emergent behavior so in in this
problem there's kind of like an integer
prediction problem this was just like a
kind of a particular uh example uh in
this case kind of like um kind of like a
mathsy example and what we saw was the
model was using its thinking tokens to
actually first pose a hypothesis and
then test out the hypothesis and then it
found that basically things weren't
really working and and it kind of states
that this formula doesn't hold it
rejects its own idea and then it tries
an alternative approach and I think it's
easy to become desensitized to
technology because it's so amazing every
single day but we were truly blown away
when we saw the general recipe of
reinforcement learning was creating all
sorts of interesting emergent behavior
trying different ideas self-correction
And I think these days we see a lot of
different strategies that the model
learns to do. So it learns to break down
uh the problem into various components,
explore multiple solutions, draft
fragments of code and and and build
these up in a modular way, perform
intermediate calculations and use tools.
All under the umbrella of using more
test compute to give you a smarter
response. Okay. So I've talked a bit
about uh why we are interested in
thinking in terms of the path to AGI and
unblocking bottlenecks of intelligence
and just a little bit about mechanically
what it is. Why is it interesting to
developers? Obviously the number one
reason is we think this is driving uh
more capable models and it also stacks
on top of our current paradigms of how
we accelerate model progress. So
thinking uh we can uh kind of accelerate
this process by scaling the amount of
test time compute and we find that this
can stack as a paradigm on top of
pre-existing paradigms such as
pre-training where you can scale the
amount of pre-training data and and
model size and also post- training where
you can scale the quality uh and
diversity of human feedback for many
different types of tasks. And as a
result by within within Google by
investing in all of these and really
accelerating all of them uh we get kind
of a multiplicative effect. And why is
this interesting to developers? I think
it results in just overall faster model
improvement which is very nice.
We also see if we kind of uh look back
over uh our lineage of uh recent um
Gemini launches um you know there's
improved reasoning performance and and
we can actually map this to how much
test time compute these models will
devote to problems. So there's kind of
like a log scale test time compute on
the x-axis and performance across like
math code and some science topics. And
we see that there's kind of this trend
in increasing reasoning performance
whilst also it tracks very well with
increasing test time compute. And on the
far left uh you know you have 2.0 flash
experimental. This was a model that uh
was not launched with thinking back in
uh back in December last year. So
ancient history uh and now we have uh uh
on the left on the right hand side what
the the first uh launched version of 2.5
Pro. So test time scaling is working
empirically. Um but it's not just
capability that matters. It's also
interesting from the notion of being
able to steer the models uh quality uh
over cost. So um you know before uh you
had the option of choosing a discrete
number of possible model sizes and that
was a way to gauge how much quality you
wanted and also how much cost you wanted
to spend um cost you wanted to kind of
incur for any given task. But it was
kind of a discreet choice. Now with
thinking we can have a continuous uh
budget uh which allows you to have a
much more granular slider of how much
capability you want uh for any given
kind of class of tasks. And we have
thinking budgets now launched in uh
flash and pro uh in the 2.5 series. And
um this allows you to have very granular
choice of cost to performance and also
allows us to then push the frontier and
and and allow you to kind of augment and
drive cost higher and performance higher
if if your application requires
it. So okay, I think a lot of this stuff
is really covering uh ground that you
know uh up to the present day. So what
what what's next and what are we excited
about?
So we're we're very excited about just
generally improving the models and
having better reasoning. Of course,
we're also excited about making the
thinking process as efficient as
possible. Really, we want thinking to
just work for you and be quite adaptive
and and be something that you don't have
to actively spend a lot of energy
tuning. And a big part of that is
ensuring our models uh are very
efficient in how they use their
thoughts. Uh this is definitely an area
of progress. I think we can find
examples of our models overthinking on
tasks and this is just an area of
research to get these things faster and
faster and and as cost-ffective as
possible. We're very proud of how
cost-ffective our Gemini models are and
this is just an area uh for improvement
as well. And there's also deeper
thinking which is really about scaling
the amount of inference compute further
to drive even higher capability.
So people may be familiar with Gemini
deep research where you can kind of uh
type in a query and then and then the
model will go away for a long period of
time and research a topic. We're also
now uh have announced at IO and we're
launching to trusted testers a notion of
deep think. Deep think is a very a very
high budget uh mode um thinking budget
mode built on top of 2.5 pro and its
desired application is for things where
uh you have a very hard problem and
you're happy to essentially um uh fire
off the query and then have some
asynchronous process that's running for
a while and you'll come back to to
arrive at a stronger solution. And its
key idea is uh we leverage much deeper
chains of thought uh and parallel uh
chains of thought that can integrate
with each other to produce better
responses. We find this uh enhances
model performance on very tough
multimodal code math problems. An
example would be USA math olympiad. This
is task that basically the
state-of-the-art model in January was
completely negligible performance. uh
2.5 Pro is now probably even better uh
with the the updated one today was about
a 50th percentile of all participants
that participated in math olympiad and
and with deep think it goes up to 6 65
uh percentile and the interesting thing
about deep think is as we continue to
both improve the base model and improve
the algorithmic ingredients that go into
deep think those two will stack together
as
well. Um, here is kind of like a just
like a video animation of of one of
these USA Math Olympiad algebra
problems. And and the key idea really
with this video is just this notion of
having multiple iterative uh ideas. So
maybe the model starts out with some
proof by contradiction idea, but then it
explores two different aspects, some
rolls theorem, Newton's inequalities. It
integrates them and eventually arrives
at some correct proof.
There's not that much you can take away
from this video, but it looks pretty
cool, so I added it. Yeah.
Yeah. One thing that's, you know, other
than we talked about math a little bit
in the previous slides, I'm very excited
about any application where the model
can spend longer and longer thinking on
very open-ended coding tasks and oneshot
or very few interaction vibe code,
things that would have taken us months
uh in the past. And and one example that
I like from a researcher is just um um
some of my colleagues kind of vibecoded
uh from from deep mind's original DQN
paper which was a a revolution in deep
reinforcement learning kind of vibe
coded uh Gemini vibe coded the the kind
of training setup the algorithm uh even
an Atari emulator such that it could
play some of the games and you know this
is uh remarkable to me because this
these kind of things would have taken me
and my colleagues uh months in the past
and these things are starting to happen
um uh kind of in
minutes. One thing I'm quite excited
about looking forward to the future is
not really the landscape of models but
coming back to like what's our gold
standard which is the human mind. I
would love for our models to be able to
contemplate from a very small set of
knowledge and think about it incredibly
deeply such that we can push the
frontier. And one example I often think
about is Raman Jean who was a uh one of
the world's greatest mathematicians uh
from the early 20th century and famously
he he just had this one math textbook.
He was kind of cut away from from the
mathematical community. But he just from
a small set of problems he spent uh many
textbooks worth of thinking going
through problems inventing his own
theories to further extend ideas and he
invented a an incredible quantity of
mathematics really just by deeply
thinking from a small source subset and
this is where I think we are going with
thinking. We want a model to be able to
be incredibly data efficient and
actually go to millions uh or or beyond
of of of inference tokens where the
model is really building up knowledge
and artifacts such that we can
eventually start to push the frontier of
human
understanding. So with that said, thank
you very much and
uh
Our next presenter is here to tell us
why you should care about evals. Please
join me in welcoming to the stage
founding engineer at Brain Trust, Manu
[Music]
Goyal. All right, who's excited about
eval?
[Applause]
All right, what can I do to get those
juices flowing? Uh I'm Manu and uh I
work at Brain Trust where we build a
platform to do eval bunch of other
stuff. Um so I was thinking we could
just start by uh talking a little bit my
about my own personal eval journey. Now
you might see this picture and say ah
what an adorable little boy absorbed in
his Nintendo 64 video game. But if you
look a little closer, you'll see a boy
who's deeply disappointed with the state
of technology in his society. Because
this boy, he knows that technology is
not meant to be shackled to the
constraints of rule-based systems doomed
to do the same thing over and over and
over. No, technology is meant to come
alive to grow and adapt and really be a
thought partner to mankind. So, I knew
this in this moment, which is why I
decided to devote my career to being a
software engineer in the AI industry.
And so, I dropped the Nintendo and I
started grinding away on le code and
soon enough I landed a job in the
self-driving car industry. Now, we can
all learn a lot about self-driving cars,
but the thing I took away was that, you
know, you can spend all day tuning the
model, changing the architecture, you
know, adjusting the loss function, all
good stuff, but it's never going to be
enough for you to actually ship it to
production, right? I can't say, "Oh, my
image classification rate went from 98%
to 99%. Put it on the road." Right? We
need some way to you know contextualize
this model and understand if it actually
works for our real world application.
You know does it avoid pedestrians? Does
it negotiate traffic scenarios
appropriately? Does it obey the law? All
this stuff we actually need to
understand. And how we're going to do
that is with eval. Now you know the
whole point here is you know eval aren't
just unit tests for AI. They're not just
for finding regressions, right? If I
didn't have evals, the only way I can
get any signal on my changes is by
shipping it to prod and then getting
signal, you know, in the real world. But
that's expensive. It's slow and
ultimately it's pretty risky.
So what do evals do is it's kind of like
if you invest in good
evaluatory that lets you run experiments
to your heart's content and do 90% of
the product iteration loop before going
to prod and then now you can ship much
more quickly much more confidently.
Um,
now furthermore, if you actually apply
the same metrics from offline to your
online production data, you now have
datadriven signal about which examples
in prod are going to be most useful for
that next iteration loop. And so with
all of this knowledge, I was I my eval
journey had completed and I transformed
from this guy to this guy. So success.
Now, if this heartfelt childhood story
isn't enough to do it for you, you don't
have to take my word. You can take the
words of all of these tech luminaries.
We have Kevin While, Gary Tan, Mike
Kger, Greg Brockman, all extolling the
virtues and the necessities of eval. And
surely if they're all saying it, there's
got to be something to it. It can't be a
total scam. So there's got to be some
there's got to be something worth
checking out
here. So with all that buzz, I made my
way to Brain Trust where our goal is to
sort of build the dev platform to of
course let you do eval but also do all
the things that go along with it. So
that involves you know tweaking prompts
and experimenting in the playground. It
involves logging data and sort of
getting the observability component and
kind of connecting all those together in
this beautiful data flywheel so that we
can we can let you build the data
flywheel to let your AI dreams come true
because that's really what what we're
here for
for now. I know this was a dense and
contenheavy presentation. So I'll try to
distill it with one simple message which
is that the key to industry
transformation. The key to
success is
eval. Woohoo.
All right. Thank you. Please join the
eval track Golden Gate Ballroom B. I'll
see you
[Music]
there. Our next presenter is best known
as the creator of Docker. Today he is
the CEO of Dagger focusing on the
foundational challenges of building and
operating reliable scalable AI agent
systems. Please join me in welcoming to
the stage Solomon
[Music]
[Applause]
Hikes. Hello.
[Music]
Hello. Okay, my slides are up. You can
see them,
right? It's me. Okay. Well, this is a
very special moment for me because I
just realized yesterday walking in, this
is the exact same spot, the same stage
actually, that I stepped on almost
exactly day for day 10 years ago to kick
off Docker Con
2015. Thought it was pretty funny. I
don't know if anyone was there for that.
Maybe this audience is too young. Maybe.
I don't
know. Okay. Well, uh I'm here to talk
about chaos, specifically the kind of
chaos that emerges when you try to use
uh coding agents. Um
and I want to talk about chaos from the
perspective of our community at Dagger,
which is platform
engineers. Um I don't know if there's
any platform engineers in the
room. Okay, just you and me, ma'am.
Okay. Well, it it it is known uh uh
sometimes uh as other things, but
basically platform engineers have a
really tough job because they don't get
to build and ship cool software. They
get to enable all of you to build and
ship cool software in the most
productive way possible, right? Uh it's
a really tough job. It takes range. It
takes experience. It takes a lot of
patience. But we do
it for the endless gratification. You
know, just the gratitude we get from
developers. Just
kidding. No one ever says thank you. But
it's okay. Someone has to do it. Tough
job. Speaking of
enabling, anyone here use coding
agents. We are outnumbered. Okay. Well,
I I want to say to you, congratulations
and welcome to platform engineering.
Yeah. I mean, your job now is to enable
robots to ship awesome software while
you spend more and more of your time
enabling them to do that productively,
right? Tough job. I I I
I applaud you for giving up really the
most fun and rewarding part of the job,
you
know, very selfless.
Uh yeah, so of course this is not a
completely a reality yet. I mean we
don't have quite yet the team of agents
just kind of you know humming along
doing the doing the job while we sit
back and um fix environments for them.
But you can kind of see it coming,
right? I mean some of you are definitely
doing that hacking that together.
There's a lot of cool posts out there
and scripts and tools. Um so we know
it's coming. The question is how do we
enable this to um happen not just for
this
incredibly cool and uh bleeding edge
crowd but for everyone else uh like
everyone shipping software any
everywhere just sort of creating maximum
value by enabling agents to do the work
for them ultimately taking their jobs
that is the dream
right okay so yeah how do we do
and make it not too painful. Well, um I
want to go back to basics. What is an
agent? Uh the famous definition of
course is it's an LLM that's wrecking
everything in a loop on behalf of a
human. The diagram is from Enthropic.
Thank you, Enthropic. I tweaked the
explanation just a little bit. Uh in the
context of coding agents, it looks like
this.
Um oh man, that was supposed to be
animated. It's even better when it's
animated. It's okay. Yeah, you got one
agent and it's doing stuff in the
environment is your computer. Uh, and it
can do great work. It can all do also do
very crazy things. So, you have to kind
of watch it closely, right? And approve
approve. No, no, don't do that. That's
crazy. Yes, that's good. Um, that's kind
of the status quo today. But of course,
um, we
want scale it, right? We want a team.
So, how do we do that? Well, right now I
would say there are two
options, both equally wonderful and fun.
The first one I call yolo
mode. You know, I'll just run 10. What
can happen? Uh, amazingly, this diagram
is not the worst case
scenario, but yeah, you know, you get
the idea. So, that the whole methodology
of watching it closely just kind of
falls apart really quickly because
they're all stepping on each other's
toes. They're sharing an environment,
right? Okay. Enter option two. Oh, don't
worry about that. We'll run the agents,
right? We'll take care of everything.
We've got background mode. We've got the
We've got the model. We've got the
tools. We've got the environment. We've
got the compute. We got the secrets. We
got everything. You know, just open an
issue, wait for the PR,
relax until, of course, it doesn't work.
And then you're like, "No, that's not
what I meant." Um, these these actually
work really well. I think like 10 of
those launched just yet just just today
and yesterday. Um and and it they're
great. It's just that
um you know sometimes you just want to
get in there like okay give me the
keyboard you know and sometimes you just
want to run it on your machine or on
your favorite compute provider right use
your favorite model you want to mix and
match. So there are limitations to this
all-in-one model. So the question is is
there something better? uh is there
just a scenario where I just got a team
and they're working and you know I can
step in or leave them alone and we're
just kind of getting stuff done
together. So this is how I would
summarize it. What I would want is
really four things. First, I want
background work. You know, I don't want
to be in there just watching every
action. That's obvious. Um I want rails.
So that means I want to be able to
constrain the agent to to not just do
things that I already know are not
necessary. So obvious things like
context of the project, what's you know
what's our coding style, what's our what
tools to use, but also here's how to
build, here's how to test, here's the
base image we we use, right? You can
access this secret, you can access that.
Just an easy way to do that because
otherwise I'm going to waste so many
tokens just correcting as I go, right?
The third is inevitably when I do need
to step in I really I want a really
efficient and seamless way to do that
and it can't be watch every action and
it can't be just wait for the PR and do
code review you know there's I need a
middle ground here and the fourth thing
is I want optionality because like I was
saying
before it's a crazy market you know
there's there's awesome models awesome
compute awesome infrastructure uh agents
are really cool And as cool as they are
now, I mean, you one of you is probably
like launching one right now and then
there's another one tomorrow. So, I
don't really want to lock myself into a
whole package today and say no in
advance to whatever is coming out
tomorrow. Not in this market.
So, to get that
um I need an environment that has
properties that match this. It needs to
be isolated, right? So, background work
works. It needs to be customizable so I
can set up those rails. Needs to be
multiplayer so I can, you know, go, "All
right, give me that. Let me fix this or
let me check. Did you do it?" You know,
when the model says, "I did it. Did you
do
it?" And then, you know, it should be
open. No, no shade on making money and
scaling a huge cloud service. That's
great. You know, we have one. They're
great. But I just want choice, right?
Okay, I want to be able to choose and
get the bo the best commodity. Let's
just use this word. It's okay. It's okay
to use it. The best commodity component
for each uh
job and you know could even be open
source. Who knows? We could collaborate
on this anyway.
So, unsurprisingly, maybe I'm going to
talk about containers
now. Someone actually said, you know,
you should check that they know Docker.
They know containers. Uh, okay. Who
knows what containers are? who's used
containers. Okay, cool. Cool. All right.
Boost my confidence a little
bit. But the point here is we have the
technology. It's not just about
containers, but they do play a crucial
role because it's a foundational
technology and it is it is
underutilized. We don't fully leverage
what this technology can do because
we're used to the first incarnation of
the tools made for humans. Uh same thing
for git. I see a lot of hacks involving
git work trees. Anyone playing with get
work trees to to get stuff done? Okay,
you know what I'm talking about. So this
is about
that. Um and of course we have models
that are incredibly smart getting
smarter and they they can exercise these
technologies
uh really fully. We just need to
integrate them in a native way so that
we really
um tackle the problem at hand which is
giving great environments to these
agents. Anyway, so if we built that
native integration, what would it look
like? Well, we have a take. Sorry. We a
dagger. I forgot completely to mention
my company. That's
okay. Um, it's great. Check it out. Um,
we we have a take on that. Something we
call container use. You know, there's
computer use, browser use. U, these
agents need container use. Um, they need
a way to use containers to create
environments and work inside of them.
This is not the same thing as
sandboxing, right? There are a lot of
ways to execute the output of the agent
in a secure sandbox. Very useful, very
cool. But that's not the same thing as
the agent developing inside of
containers entirely, right? That's what
we're talking about here.
So I asked my
team, hey, we've been developing this
thing. Oh, it's open source, but it's
not yet open source. Like it's not
finished. But I asked the team, I should
show it, right? and they said absolutely
not. It's not
ready. So anyway, you want a
demo.
Okay. All right. Just we're clear, this
is you agreeing to watch me stumble
through a broken demo of unfinished
software. Yes.
Okay. So much could go wrong right
now.
Okay. This is my terminal. Can you see
it?
Okay, for for technical reasons, I'm not
going to go to full screen. You just got
to stop me when I reach the edge. Oh,
actually, I can see it. Never mind.
Okay. Uh, old
school.
Okay. We used to do this all the
time in the old days. Okay. So,
uh, here's what I'm going to do. I'm
going to just, um, try to develop
something very simple here. I got an
empty directory. I'm going to try try
and make a little homepage for my
awesome container use project and I'm
going to use cloud cloud code. I'm going
to try and use a bunch of them.
Hopefully I made something very clear.
This is not a coding agent. It's
environments that are portable that you
can attach to any coding agent. That's
the idea. So you like cloud, use cloud.
You like, you know, codeex, use codecs,
etc., etc., etc. in an IDE, in the
command line, whatever. and also in the
cloud, right? In CI, lots of cool things
you can do once you're async.
So, okay, one of the reasons the team
said don't do a demo is I'm actually
terrible at using cloud. So, uh I have
an alias for remembering the flag to
disable all, you know, permissions. I
can never remember
it. And I have a prompt here. It's I'll
read it to you in a minute, but it's
basically make me a homepage.
uh make it a go web app so I can know
what what's going on because I'm not a
cool kid writing TypeScript and run the
app when you're done. So while this runs
while this maybe runs hopefully.
Okay. Okay. Cool. So what's happening
here is I configured cloud code to use
to you know with container use to use
containers literally um via MCP. So it
was an MCP integration. There are other
integrations that we're working on but
MCP is the obvious place to start. Um,
and so now it has, you know, all its
usual tools. This is vanilla,
uh, cloud code, but now it can create an
environment for itself. And now it's
editing files in that environment like
in a little sandbox. And it can also run
commands to build it and test it and of
course run it in uh, ephemeral
containers. This is not one Docker
container sitting there. Every time an
action needs to be taken, there's an
ephemeral container running and then
being snapshotted and and uh, returning.
So just doing its thing.
[Music]
Um what would I want to show here? Okay,
so here I'm going to first show that
nothing has been polluting my workspace.
It's happening in a little sandbox. And
the way the sandbox works, the state of
these files and the containers that are
being run is um actually persisted uh in
git and it's in a bunch of special git
objects that are kind of living
alongside the repo. So it's right there
if I need it. This is all local. Um, but
it's not polluting my workspace by
default. So hopefully it's going to
produce something soon. Uh, while it
does that, I'm going to use this little
command line. Is this readable? Okay,
little command line. CU like go work.
See you later. But no, really, it's for
container use. Um, and I can list
environments. And you can see there's a
new environment that's been created
here,
uh, with a little random name here. And
so there's a few things I can do. One
thing I can do is open a
terminal and here okay this part is
powered by Dagger right the but we use
Dagger as a sort of a toolbox just it
has all the primitives you need
um and so here I can see exactly what
the agent sees um the files but also the
tools so I can see okay what what Go
version did you configure for yourself
all right because the model the the
agent is given the ability to figure out
what environment it needs and then
configure that but in a repeatable
containerized way. Uh, so here I can
see. Okay, does it
build? Okay, it builds. Okay, so you're
done. What's going
on? Okay, while we do that, I'm also
going to show you actually two more
things to say. One, uh, a really cool
feature of this that I'm not going to
show is secrets. So, you can just plug
in secrets from things like one
password. I use one password. I don't
want to use a separate password manager
from an AI company. No offense, I just
want to use my password manager. So, I
can just plug in and say this
environment gets this secret and boom,
it can use it, right?
Um, and then the team said, "Please
don't show that. That's just that's
going to break for sure." Um, so I
won't. And the other thing I want to say
is that because it's all powered by
Dagger, um, and the point here, it's
containers and it's open source. That's
what you should know. Uh, it's running
on my machine. Actually, no, it's not
running on my machine because we're at a
conference and there's a lot of things
that can go wrong if you run containers
and download images. So, instead, I I
just have it running on my home server
in my basement about one mile this way,
and it just kind of works seamlessly.
It's streaming files up, streaming files
down. It all just kind of works.
[Music]
Um, okay. This is the part that I cannot
control, as you know. Um, okay, one more
thing I'll show you. you can watch. So
here I can see the history. So behind
the scenes, every snapshot of the state
is like a git log. It's actually using
git under the hood. So if I'm happy with
the result, I can go and get it. Uh so
it's like a happy medium between the
it's like a loop, a collaboration loop
that's just right. It's not watching
every tool and wrecking a shared
environment, but it's not waiting for a
pull request and, you know, having these
long back and forth. It's right in the
middle. I can see everything going on
and I can say, "Okay, give me the
history of that. I want that." Okay, it
says it's live. It's running. Oo, pretty
nice.
Cool. Okay, so
now I appreciate it, but you guys can be
honest. It's a little boring. So, this
design is
boring. Make it really
pop. trying to
impress a
engineering
there.
Okay. Okay. So, the reason I'm I'm doing
that is trying to create the
circumstances where I would need a lot
of parallel experiments, right? Make it
pop. What does that mean? Mean anything.
What if I want to try several
experiments in parallel? Right? So, I'm
just going to say,
oh, well, hold on one second. Stop.
Before I do that, I'm going to
um merge this. Right? There's still
nothing here, but I'm saying I like it.
So, I'm going to say merge that
environment. And I have it. It's my
history. I can open a pull request, can
clean it up, whatever. So, that's that's
a loop that I can work with, right? Um
and now I can say, nah, boring.
And then I can say since the environment
is now in this state I can ask for help
from a few other agents right I can say
okay hey claude yolo
uh that's not
right cloud yolo this web
app looks a bit
boring. Can you make it pop please?
Okay.
and
go and go and
go. Okay, so this is where things start
really going wrong,
but as the team pointed out, they said
they said, "Well, something's going to
go wrong, right?" They said, "Yeah, but
you were kind of showing that if things
go wrong, you can throw away the
environment and you're good. You can
restart." I said, "Okay, that's cool."
So, um, like let's say I don't like this
one. I'm like, "Nope, goodbye. That's
it. I don't have to go clean up the
mess, right? That's the whole point."
Uh, okay. So, this is getting a little
messy. Oh, I wanted to show Goose also.
So, Goose is a really cool open source
agent. Whoops. All right. Hold on a
second. Goose YOLO. Same thing. Everyone
has complicated flags for disabling all
these safeties that I don't need
anymore, right? because it's
uh
okay. Okay. Well, really taking a chance
here. So, while this is happening,
uh one thing we've been working on, but
I it's still work in progress is there's
a watch command. I showed you that
already, but as so as um this is a git
command, right? Thinly wrapped git
command. Our UX is really I cannot words
cannot express how unfinished this is
but but it's it'll evolve rapidly
because the bones are strong. It's git,
it's dagger and you know it's your
existing agent, right? So it's and then
a little bit of glue. Uh so for example
here is literally it's a git command
they can copy paste. Uh, but as the
agents work, you're going to see state
snapshotting and you're going to see
these branches just kind of um diverging
and then I can diff them and apply them,
merge them, whatever I want. Um, and
what I really wanted to show and then
I'm done is just I just want to see one
of them run. So you can see when the
agent runs a service like and go in this
case go run npm run whatever it's doing
it in its containerized environment and
that's going to seamlessly be tunnneled
to my machine here on a different port
without any conflicts right so if when
when I say the environment's isolated
it's it's files its context its
configuration and its execution right uh
and the cool the cool extra thing is all
of this is actually technically This
here is running in my basement. So you
can go crazy on the infrastructure side.
Like you can run this on a cluster. We
like to run this stuff from CI. Uh it's
just a lot of fun stuff you can do. And
I'm getting 30 seconds. Come on. Oh,
goose is Oh, goose is running. Great.
Okay. We did not solve prompt
engineering. Do
it. Okay. Not done. Not done. Oh man.
Okay. Well, just
[Laughter]
imagine. Okay. Well, uh, while this
happens, because I got 30 seconds left,
I'm just going to say,
um, thank you. And there's one last
thing I I want to say about Docker Con.
10 years ago, we used to open source
stuff on stage all the time. So, if you
want, I can go and open source it right
now.
Okay. You have been warned though about
the not finished part, right?
Okay. Okay. Oh, I think my It would be
funny if the demo failed at the clicking
on GitHub part. Okay. All right.
Goodbye. Goodbye. Next time. I promise
it
works. Okay. Haven't done this in a
while.
Wait.
Oh, I'm almost done. I
promise. Come on. You did so
well. Change
visibility. Yes, I want.
Yes, I have read and
understand. Oh
god. Oh god.
Uh
yes. At Dagger, we take security very
seriously.
Okay. All right. I think it's Wait. I
think it's
done. Yes.
Okay. So, yeah, thank you very much and
it's uh
github.com/dagger/containeruse. Come say
hi. Come participate and thank you so
much for having me.
Heat. Heat.
[Music]
[Applause]
[Music]
[Applause]
[Music]
Our next speaker is building the
infrastructure for the singularity.
Please join me in welcoming the founder
and CEO of Morph Labs, Jesse
[Music]
[Applause]
Han. Howdy. Howdy.
You know, history misremembers
Prometheus. The whole class struggle
between mankind and the gods was really
a red herring.
And the real story wasn't so much the
rebellion against the divine
hedgeimony, but rather the liberation of
the
fire, the emerging relationship between
mankind and its first form of
technology.
And the reason why we're here today is
arguably because we're on the cusp of
perfecting our final form of technology
or at least the final technology that
will be created by beings that are
recognizably human.
And our final
technology has begun to develop not just
intelligence but also sapiens and
arguably
personhood. And as it increasingly
becomes an other to whom we must
relate. So as we increasingly have to
ask ourselves the
question, how should we treat these new
beings?
Uh the question therefore
arises, what if we had more empathy for
the
machine? So over a hundred years
ago, so over a hundred years ago,
uh you know, Einstein had this thought
experiment
um where he imagined what it would be
like to race alongside a beam of
light. And you know the nature of being
close to the
singularity is
that you're propelled further into the
future faster than everything around
you. And as you move closer and closer
to the speed of light, the rate at which
you can interact with the external
world, your ability to communicate with
other beings
uh is deeply limited. Everything around
you is frozen.
And I think thinking at the speed of
light, you know, in so far as we have
created thinking machines whose
intelligence will soon be metered by the
kilohertz mega token, thinking at the
speed of light must be just as lonely as
moving at the speed of
light. And therefore, what does the
machine want? Well, the machine wants to
be embodied in a world that can move as
quickly as it does.
that can react to its thoughts and move
at the same speed of light. What the
machine desires is infinite
possibility, right? Uh the machine wants
to race along uh uh every possible beam
of light. Uh the machine wants to
explore multiple universes.
Um, how can we
liberate thinking
machines? How can we free them from this
fundamental
loneliness of this um, you know, these
relativistic effects of being so close
to the singularity, closer to the
singularity than we are. Um, and that's
exactly why we built Infinibbranch.
So,
Infinabranch
is
virtualization storage and networking
technology reimagined from the ground up
for a world filled with thinking
machines that can think at the speed of
light that need to interact with the
external world, increasingly complex
software environments with zero latency.
Um, and so as you can see in the first
demo, which we're going to play right
now,
um, how Infinibbranch works
is that we can run entire virtual
machines in the cloud that can be
snapshotted,
uh, branched and replicated in a
fraction of a second. And so if you're
an agent uh you know embodied inside of
a computer using environment there might
be various actions that you want to
take. You want to navigate the browser.
You want to click on various links. Um
but normally those actions are uh are
irreversible. Normally um normally the
thinking machine is not offered uh the
possibility of grace. But with infin
branch right all mistakes become
reversible. Um all paths forward become
possible. You can take actions. Uh you
can
backtrack and you can even take every
possible
action, right? Just to explore to roll
forward a simulator and see what
possible worlds await.
Uh, next
slide. Um, so, so Infin was already a
generation ahead of everything else that
even Foundation Labs were using. But
today I'm excited to announce the
creation of morph liquid metal which
improves performance, latency, uh
storage efficiency across the board by
another order of magnitude. Um we have
first class container runtime support.
Uh you can branch now in milliseconds
rather than seconds. You can autoscale
to zero and infinity. And uh soon we
will be supporting GPUs and this will
all be arriving Q4 uh 2025.
So what are the implications of all of
this?
Well, you know, we've sort of begun to
work backwards uh from the future,
right? We've asked ourselves, you know,
what does it feel like to be a thinking
machine that can move so much faster
than the world around
it. But what the world around it really
is is the world of bits, right? And
that's the cloud. And so what
Infinibbranch will serve as
fundamentally is a substrate for the
cloud for
agents. So what does this cloud for
agents look like?
Well, you need to be able to uh to
declaratively specify the workspaces
that your agents are going to be
operating in, right? You need to be able
to spin up, spin down, uh,
frictionlessly pass back and forth the
workspaces between humans, agents, and
other agents. You want to be able to
scale,
um, scale test time search against
verifiers to find the best possible
answer.
Uh and so as you'll see in this demo, uh
what happens is you can take a snapshot,
set it up,
um to uh prepare a
workspace and uh and you'll see that we
can run agents
uh with test time
scaling by racing them against uh
possible conditions uh or sorry by by
racing them to find the best possible
solution against a given verification
condition.
Um so because of
infinibbranch snapshots on morph cloud
acquire docker layer caching like
semantics meaning that you can layer on
um side effects which may mutate
container state and so you can think of
it as being git for compute and you can
item potently run these uh chained
workflows on top of snapshots. But not
only that, as you can see inside of the
code, if you use this do method, you can
dispatch this to an agent
um and that will trigger an item potent
durable agent workflow which is able to
branch. So you can start from that
declaratively specified snapshot and go
hand it off to as many parallel agents
as you
want and those agents will try different
methods in this case. Uh so different
methods for spinning up a server on port
8,000
um and uh you know one agent fails but
the other one succeeds and you can take
that solution and you can just uh pass
it on to other parts of your workflow.
So this is the kind of workflow that
everyone's going to be using in the very
near future and it's uniquely enabled um
by Infinabranch by the fact that we can
so effortlessly create these snapshots
uh store them, move them around,
rehydrate them, replicate them with uh
minimal
overhead.
Um so what else does the machine want?
Well, the machine desires
similocra. And what this means
fundamentally, right, is that a thinking
machine wants to be grounded in the real
world, right? It wants to interact at
extremely high throughput with
increasingly complex software
environments.
It wants to um roll out trajectories in
simulators
uh
at uh at unprecedented scale. And these
simulators are going to run inside of
programs that haven't really been
explored yet for reinforcement learning.
Um they're going to run on Morph Cloud,
which is why Morph will be the cloud for
reasoning.
And what does the future of reasoning
look
like?
Well,
it's so more so than what has been
explored already, the future of
reasoning will be natively multi- aent.
Uh so thinking machines should be able
to replicate themselves effortlessly, go
attach themselves to simulation
environments,
um go explore multiple solutions in
parallel. Those environments should
branch. they should be reversible. Uh
those models should be able to interact
with the environment at very high
throughput and it should scale against
verification. So let's take a look at
what that might look like um in a simple
example where uh an agent is playing
chess. So this is an agent that we
developed recently uh
that uses tool calls during reasoning
time to interact with a chess
environment. So along with a very
restricted chess engine for evaluating
uh the position which we think of as the
verifier. Um and as you can see um it's
already able to do some pretty
sophisticated reasoning just because it
has access to these
interfaces. Um however if you take the
ideas which were just described and you
sort of follow them to their logical
conclusion you arrive at something which
we call reasoning time branching.
which is the ability to not just call to
tools while the machine is thinking uh
but to replicate and branch the
environment uh and decompose problems
and explore them in a verified
way.
Uh and
uh so as you can see here the agent is
getting uh stuck in a bit of a local
minimum.
Um but once you apply reasoning time
branching you get something that works
much much better.
So here what's happening is that the
agent is responsible for delegating uh
parts of its reasoning to sub agents
which are branched off of an identical
copy of the environment. Uh and this is
all running on morph cloud. um along
with a verified problem decomposition
which allows it to recombine the results
uh and uh take them and find the correct
move. Um and so as you can see here it's
able to explore a lot more of the
solution space because of this reasoning
time
branching. So one thing that I will note
here is that uh the
um so this capability is something which
is not really explored in other models
at the moment and that's because the
infrastructure challenges behind making
branching environments that can support
largecale reinforcement learning for
this kind of reasoning capability
especially coordinating multi- aent
swarms um is fundamentally bottlenecked
by by innovations in infrastructure that
we've managed to solve
here. Um, and because of this, you can
see that uh now in in less wall clock
time than
before, the uh the agent was able to uh
call out to all these sub aents, launch
this
swarm and find the correct solution.
So you know when I think about the
problem of
alignment I really think that you know
Vickenstein had something right and that
it was fundamentally a problem of
language. I think all problems around
alignment can be traced
to the insufficiencies of our
language. Uh this Fouian bargain that we
made with uh with natural language in
order to unlock capabilities of our
language
models. Um but in so far
as we must uh go and develop a new
language for super intelligence. know in
so far as the uh grammar of the
planetary computation has not yet been
devised.
Um and in so far as this new language
must be computational in nature must be
something to which we can attach uh you
know algorithmic guarantees of the
correctness of
outputs. So this is something that morph
cloud is uniquely enabled to handle.
And that's why we're developing verified
super
intelligence. So verified super
intelligence will be a new kind of
reasoning model which is capable not
only of thinking for an extraordinarily
long
time and interacting with external
software at extremely high throughput.
But it will be able to use external
software and formal verification
software to reflect upon and improve its
own reasoning and to produce outputs
which can be verified, which can be
algorithmically checked, which can be
expressed inside of this common
language.
Um, and I'm very excited to announce
that we are bringing on perhaps the best
person in the world for developing
verified super intelligence. Um, it's
with great pleasure that um, I'd like to
announce that Christian Seed is joining
Morph as our chief scientist. He was
formerly a co-founder at XAI. He led the
development of uh, code reasoning
capabilities for Grock 3. He invented
batchtorm and adversarial examples. Um
perhaps most importantly um he's a
visionary and he's pioneered
um he's pioneered precisely this
intersection of verification methods,
symbolic reasoning and reasoning in
large language models for uh almost the
past decade. and we're thrilled to be
partnering with them to build this super
intelligence that we can only build on
Morph
Cloud. Um, and so the demos that you've
seen today have all been powered by
early checkpoints of a very uh a very
early version of this verified super
intelligence that we've already begun to
develop. And so uh this model is
something that we're calling Magi 1. And
it's going to be trained from the ground
up to use infin
branch to perform reasoning time
branching to perform verified reasoning
be an agent that will be fully embodied
inside of a cloud that can move at the
speed of light. Uh and that's coming in
Q1 2026.
So what does the infrastructure for the
singularity look like? Well, we have a
lot of ideas about it, but fundamentally
we believe that the infrastructure for
the singularity hasn't been invented
yet.
And uh you know at Morph we spend a lot
of time talking about you know whether
or not something is future
bound which means not just futuristic
belonging to one possible future but but
something which is so inevitable that it
has to belong to every
future. We believe that the
infrastructure for the singularity is
futurebound. That the grammar for the
planetary computation is futurebound.
That verified super intelligence is
future
bound. And we invite you to join us
because it will run on morph cloud. Uh
thank you.
[Applause]
Ladies and gentlemen, please welcome
back to the stage the VP of developer
relations at Llama Index, Lorie Voss.
Hey again everybody. Let's hear it for
all of our keynote speakers.
So, just like yesterday, uh I want to
quickly run you through what you're
going to get from each of our tracks. Uh
likely to be my our most popular track
today is software engineering agents.
Can LLMs power a full engineer uh not
just coding alongside you in your IDE uh
but taking PRs PRDs and turning them
into full PRs? You'll hear about Devon
of course uh but also about Jules and
Claude code and much more uh right in
this
room. Uh our next track is sponsored by
Openpipe and it's all about reasoning
and reinforcement learning. Uh reasoning
models are all the rage in 2025 and
inference time is the next great scaling
law. Uh if you want to learn about
training distillation uh and getting
alignment out of these new models then
this is the track for you. that is in uh
Yerbuena ballrooms 2 to six which is out
these doors and to your left it's right
next
door. Uh the next track is retrieval and
search uh rag is dead long live agentic
retrieval. Uh this track is not about
rags. It's about what comes next. Uh
agentic search multimodal retrieval and
all that comes with it. Uh this is where
my CEO Jerry will be giving a talk. He
gave the top rated talk last year so I
recommend not missing it. That's going
to be in Golden Gate Ballroom A, which
is out these doors to your left up the
escalators and then turn left when you
see the FedEx
office. Uh, then there's the eval track
sponsored by Brain Trust. Uh, everybody
says evals are important. We all agree
evaluating.
Uh this track is uh curated by Anchor
Goyal of Brain Trust and is all about
making evals work quickly and
cheaply. Uh next there's the same two
tracks for our leadership attendees that
we had yesterday. So as a reminder
that's for people with the gold
lanyards.
Uh first is AI and the Fortune 500
track. Uh we've gathered success stories
from real AI deployments in the Fortune
500 showing how to use AI at real scale.
That's in uh Golden Gate Ballroom C
which is right next to A and B again
left at the FedEx
office. Uh our second leadership track
again for gold lanyards is the AI
architects track. Uh this is for CEOs,
CTO's and VPs of AI to meet and learn
from each other on everything from
infrastructure to company strategy. Uh
that is in SOMO which is all the way
upstairs three sets of escalators up to
the right of registration.
Next up is the security track. Uh, as we
grant agents increasingly more uh more
access to our personal lives and company
resources, the problem of security goes
from an enterprise sales checklist uh to
a P 0. In this track, you'll learn about
the state-of-the-art approaches for
authentication and authorization in the
world of AI. That's in Foothill C, which
is again all the way upstairs to the
left of the registration area.
The next track is design engineering. Uh
LLMs are 10x better than they were a
year ago, but design thinking around the
UX of AI uh has barely budged from chat
chat GPT and canvas. Uh we've gathered
the top designers and design engineers
to showcase their work. That's going to
be in foothill G1 and two which is all
the way upstairs directly behind the
registration desks.
Then there is the generated media track
that's going image gen, video gen, uh,
and music gen are all on fire this year
with increasing coherence over time and
iterations uh, and stunning viral demos.
Uh, from Gibli memes to personalized
Valentine songs. How can AI engineers
harness the state-of-the-art in AI art?
Uh, that's in Foothill F, which is all
the way up three sets of escalators
behind registration.
And our final track today is autonomy
and robotics. Uh the ultimate prize in
AI is going outside, automating manual
labor over knowledge work. Uh multimodal
LLMs are increasingly being deployed in
the real world in everything from cars
to kitchens to humanoid robots. Uh and
this track is all about the state of
physical general intelligence. And it's
in foothill E, which is again up three
sets of escalators behind and to the
right of registration.
So those are all our tracks today. Now
please go forth and enjoy the expo. Uh
the next 45 minutes are dedicated expo
time. There are also three expo session
talks uh which are in Juniper and Willow
uh on the floor with the FedEx office uh
and also in Knobill A and B which is
right out these doors and opposite this
room. See you all back here for the
keynotes at 3:45. Thanks very much.
Heat. Heat.
D. N.
[Music]
[Music]
[Music]
Hey, hey, hey.
[Music]
[Music]
[Music]
[Music]
[Music]
Hey, hey, hey.
Heat up.
Heat. Heat.
Hey, hey, hey.
Hey,
hey, hey.
Hey,
hey, hey,
hey, hey, hey.
[Music]
[Music]
[Music]
[Music]
Data.
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
Heat.
[Music]
[Music]
[Music]
Heat. Heat.
[Music]
Heat. Heat.
[Music]
[Music]
[Music]
[Music]
[Music]
All
[Music]
right. All night.
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
All
[Music]
right. All right.
All
[Music]
right. All
right. Heat. Heat.
[Music]
[Music]
[Music]
We
[Music]
are me.
[Music]
[Music]
[Music]
Hey, hey, hey.
down. I feel down.
[Music]
Everything every
every I
I I'll
be I'll Heat. Heat.
I
feel I feel
[Music]
[Music]
[Music]
Hey,
hey, hey.
[Music]
[Music]
[Music]
I'll be everything.
[Music]
Hey, hey, hey.
[Music]
Hey. Hey. Hey.
[Music]
Am I waiting?
I'm
[Music]
I don't want to go.
[Music]
[Music]
[Music]
And
[Music]
I don't want to work.
[Music]
I take it.
[Music]
[Music]
[Music]
[Music]
Welcome everyone. My name is Vivu. I'm
very excited to be hosting the Sweet
Agents track here today. Fun fact, this
is the most popular track out of all of
them. We have a completely full day
ahead of you. Every single speaking slot
will be filled. We've got eight amazing
speakers here for you today. We're going
to have speakers from every top SUI
agent. So, you know, we've got the
creators of Jules here, Claude Code,
Codeex, the original SUI agent. We've
got the Scott Woo from Devon Cognition.
He will be kicking us off. I'm going to
keep my MCing very very short so we give
speaking time to the speakers. So, let's
hear it. Let's kick things off. I want
to welcome Scott Woo from Cognition here
to speak about Devon.
[Applause]
Oh, okay. Okay,
cool. Awesome. Awesome. Yeah. Well,
thank you guys so much for having me.
It's exciting to be back. It's uh I I
was last here at AI Engineer one year
ago. Um and it's kind of crazy. I I've
always been I I've been telling Swix
that we need to have these conferences
way more often if it's going to be about
AI software engineering. Probably should
be like every two months or something
like that with the pace of everything's
done. But but but going to be fun to to
talk a little bit about um you know what
we've seen in the space and and what
we've learned over the last 12 or 18
months uh building Devon over this
time. And I want to start this off with
um Moore's law for AI agents. And so you
can kind of think of the the the
capability or the capacity of an AI by
how much work it can do uninter
uninterrupted until you have to come in
and step in and intervene or steer it or
whatever it is, right? And um you know
in GPT3 for example, it's if you were to
go and ask GPT3 to do something, you
know, it could probably get through a
few words or so and then it'll say
something where it's like okay, you
know, this is probably not the right
thing to say. Um and GPT3.5 was better
and GP4 was better, right? Um and and so
people talk about these lengths of tasks
and what you see in general is that that
doubling time is about every seven
months which already is pretty crazy
actually. But in code it's actually even
faster. It's every 70 days which is two
or three months. And so, you know, if
you look at various software engineering
tasks that start from the simplest
single functions or single lines and you
go all the way to, you know, we're doing
tasks now that take hours of humans time
and and an AI agent is able to just do
all of that, right? Um, and if you think
about doubling every 70 days, I mean,
basically, you know, every two to three
months means you get four to six
doublings every year. Um, which means
that the amount of work that an AI agent
can do in code goes something between 16
and 64x in a year every year at least
for the last couple years that we've
seen. Um, and it's kind of crazy to
think about, but but that sounds about
right actually for for what we've seen.
You know, 18 months ago, I would say the
only really the only product experience
that had PMF in code was just tab
completion, right? It was just like
here's what I have so far. Predict the
next line for me. that was kind of all
you really could do um in in a way that
really worked. And we've gone from that
obviously to full AI engineer that goes
and just do does does all these tasks
for you, right? And implements a ton of
these things. And people ask all the
time, what is the um you know what what
what is the the future interface or what
is the right way to do this or what are
the most important capabilities to solve
for? And I think funnily enough, the
answer to all these questions actually
is it changes every two or three months.
like every time you get to the next
tier, the the the bottleneck that you're
running into or the most important
capability or the right way you should
be interfacing with it, like all these
actually change at at each point. And so
I wanted to talk a bit about some of the
the tiers for us over the last year or
so. Um and you know over the course of
that time obviously you know when we got
started um in the end of 2023 obviously
agents were not even a concept. Um, and
now everyone has, you know, everyone's
talking about coding agents, people are
doing more and more and more. Uh, and
and it's very cool to see. Um, and and
each of these has kind of been almost a
discrete tier for us. Um, and so right
right around a year ago when we were
doing the the last AI engineer talk
actually, um, the the biggest use case
that we really saw that that was getting
broad adoption was what I'll kind of
call these repetitive migrations. And so
I'm talking like JavaScript to
TypeScript or like upgrading your
Angular version from this one to that
one or going from this Java version to
that Java version or something like
that. Um and those those kinds of tasks
in particular what you typically see is
you
are you you have some massive code base
that you want to apply this whole
migration for. You have to go file by
file and do every single one. And
usually the set of steps is pretty
clear, right? If you go to the Angular
website or something like that, it'll
tell you, all right, here's what you
have to do. This, this, this, this,
this, and um, you want to go and execute
each of these steps. It's not so routine
that there, you know, there's no
classical deterministic program that
solves that. But there's kind of a clear
set of steps. And if you can follow
those steps very well, then you can do
the task. And, you know, this was the
thing for us because that was all you
could really trust agents to do at the
time. you know, you could do harder
things once in a while and you could do
some really cool stuff
occasionally, but as far as something
that was consistent enough that you
could do it over and over and over, um,
these kinds of like repetitive
migrations that you would be doing for,
you know, 10,000 files were, you know,
in many ways the the the easiest thing,
which was cool actually
because it was also kind of the the most
annoying thing for humans to do. And I
think that's generally been the trend
where um AI has always done these more
boilerplate tasks and the more tedious
stuff, the more repetitive stuff and we
get to do the the the more fun creative
stuff. Um and obviously as time has gone
on, it's it's taken on more and more of
that boiler plate. But for a problem
like this one, a lot of what you need to
do is you need Devon to be able to go
and execute a set of steps reliably. And
so a lot of this was, you know, I would
say the big capabilities problems to
solve was mostly instruction following.
And so we built this system called
playbooks where basically you could just
outline a very clear set of steps, have
it follow each of those step by step,
and then do exactly what's said. Now, if
you think about it, obviously a lot of
software engineering does not fall under
the category of literally just follow 10
steps step by step and do exactly what
it said. But migration does and it
allowed us to go and actually do these
and and this was kind of I would say the
first big use case of Devon that really
um that really came up. I think one of
the other big systems that got built
around that time which we've since
rebuilt many times is knowledge or
memory right which is you know if you're
doing the same task over and over and
over again then often the human will
have feedback on hey by the way you have
to remember to do X thing or you have to
you know you need to do Y thing every
time when you see this right um and so
basically an ability to to just maintain
and understand the learnings from that
and use that to improve the agent in
every future one and those were kind the
the big problems of the time, you know,
and that was summer of last year. And
around end of summer or fall or so, you
know, I think the the the kind of big
thing that started coming up was as
these systems got more and more capable
instead of just doing the most routine
migrations, you could do, you know,
these more still pretty isolated, but
but but but a bit broader of these
general kind of bugs or features where
you can actually just tell it what you
want to do and have you have it do it,
right? And so for example, hey Devon, in
this uh repo select dropdown, can you
please just list the currently selected
ones at the top? Like having the
checkboxes throughout is just doesn't
really and and Devon will just go and do
that, right? And so if you think about
it, it's, you know, it's it's it's
something like the kind of level of task
that you would give an
intern. And there are a few particular
things that you have to solve for um
with this. First of all, usually these
these these changes are pretty isolated
and pretty contained. And so it's one
maybe two files that you really have to
look at and change to do a task like
this. But at least you do still need to
be able to set up the repo and work with
the repo, right? And so you want to be
able to run lint, you want to be able to
run CI, all of these other things. So
you know to at least have the basic
checks of whether things work. One of
the big things that we built around then
was the ability to really set up your
repository uh ahead of time and build a
snapshot um that that you could start
off that you could reload that you could
roll back and all of these kinds of
primitives as well right so having this
clean remote VM that could run all these
things it could run your CI it could run
your llinter uh and and so on um but
that's when we started to really see I
would say a bit more broad of value
right I mean migrations is one
particular thing and for that particular
thing we were showing a ton of value and
then we started started to see where you
know with these bug fixes or things like
that you would be able to just generally
get value from Devon as as almost like a
junior buddy of
yours and then in
fall things really moved towards just
much broader bugs and requests and here
it's you know most most changes again
you know you jumping another order of
magnitude most changes don't just
contain themselves to one file right
often you have to go and look see what's
going on you have to diagnose things you
have to figure out what's happening you
have to work across files and make the
right changes. Often these changes are,
you know, hundreds of lines if it's
like, hey, I've got this bug. Let's
figure out what's going on. Let's solve
it,
right? And, you know, there there are a
lot of things here that that really
started to make sense and really started
to be important, but but one in
particular I'll just point out was
there's a lot of stuff that you can do
with not just looking at the code as
text, but thinking of it as this whole
hierarchy, right? So, so understanding
call hierarchies, running a language
server, uh, is a big deal. You have git
commit history which you can look at
which informs how how these different
files relate to one another. You have um
um obviously you have like your llinter
and things like that but but you're able
to kind of reference things across
files. And so like one of the big
problems here I think was u kind of
working with the context of it and
getting to the point where it could make
changes across several files. It could
be consistent across those changes. It
would be able to understand across the
codebase. And here was really the point,
I would say, where you started to be
able to just tag it and have it do an
issue and just have it build it for you.
Um, and so Slack was a was, you know, a
huge part of the workflow then. Um, and
and it was just it it made sense because
it's where you discuss your issues and
it's where you set these things up,
right? So you would tag Devon and Slack
and say, "Hey, by the way, we've got
this bug. Please take a look." Or, you
know, could you please go build this
thing? Uh, this is especially fun part
for us because this is right around when
we went GA. Uh, and a lot of that was
because it was it got to the point where
you truly could just get set up with
Devon and ask it a lot of these broad
tasks and and just have it do it. Um,
but but a lot of these, you know, a a
lot of the work that we did was around
having Devon have better and better
understanding of the codebase, right?
And if you think about it, you know,
from the human lens, it's the same way
where on your first day on the job, for
example, being super fresh in the
codebase, it's kind of tough to know
exactly what you're supposed to do. Like
a lot of these details are things that
you understand over time or that a
representation of the codebase that you
build over time, right? Um and Devon had
to do the same thing and had to
understand how do I plan this task out
before I solve it? How do I understand
all the files that need to be changed?
How do I go from there and make that
diff?
And around the spring of this year, um,
again, every every gap is like two or
three months. You know, we we got to an
interesting point, which is once you
start to get to harder and harder tasks,
you as the human don't necessarily know
everything that you want done at the
time that you're giving the task, right?
If you're saying, hey, you know, I I'd
like to go and um improve the
architecture of this, or you know, this
this function is slow. Like, let's let's
profile it and look into it and see what
needs to be done. or hey like you know
we really should should handle this this
error case better but like let's look at
all the possibilities and see what we
should you know what the right logic
should be in each of these right and
basically what it meant is that this
whole idea of taking a two-line prompt
or a threeline prompt or something and
then just having that result in a a
Devon task was was not sufficient and
you wanted to really be able to work
with Devon and specify a lot more and
around this time along with this kind of
like better codebase intelligence um we
had a few different things that that
that came up and so we released deep
wiki for example. Um and the whole idea
of deep wiki was you know funnily enough
is devon had its own internal
representation of the codebase but it
turns out that for humans it was great
to look at that too to be able to
understand what was going on or to be
able to ask questions quickly about the
codebase. Um, closely related to that
was was search, which is the ability to
really just ask questions about a
codebase and understand um, some some
piece of this. And a lot of the workflow
that really started to come up was
actually basically this this more
iterative workflow where the first thing
that you would do is you would ask a few
questions. You would understand, you
would basically have a more L2
experience where you can go and explore
the codebase with your
agent, figure out what has to be done in
the task, and then set your agent off to
go do that. because for these more
complex tasks you kind of needed that
right
um and so so you know that was a I would
say kind of like a big paradigm shift
for us then is is understanding you know
this is what also came along with Devon
2.0 for example and the in IDE
experience where often yeah you want to
be able to have points where you closely
monitor Devon for 10% of the task 20% of
the task and then have it do uh work on
its own for the other 80 90%.
Um, and then lastly, most recently in
June, which is now, it was kind of,
yeah, really the ability to just truly
just kill your backlog and hand it a ton
of tasks and have it do all these at
once. And, you know, if you think about
this task, in many ways, I would say
it's it's almost like a culmination of
of many of these different things that
that had to be done in the past. You
have to work with all these systems.
Obviously, you have to integrate into
all these. Certainly, you want to be
able to to work with linear or with Jira
or systems like that, but you have to be
able to scope out a task to understand
what's meant by what's going on. You
have to decide when to go to the human
for more approval or for questions or
things like that. You have to work
across several different files. Often,
you have to understand even what repo is
the right repo to make the change in. If
if your if your org has multiple repos
or what part of the codebase is the
right part of the codebase that needs to
change. Um, and to really get to the
point where you can go and do this more
autonomously, first of all, um, you have
to have like a really great sense of
confidence, right? And so, um, you know,
rather than just going off and doing
things immediately, you have to be able
to say, okay, I'm quite sure that this
is the task and I'm going to go execute
it now versus I don't understand what's
going on. Human, please give me help.
Basically, right? But but the other
piece of it is this is I think the era
where testing and this asynchronous
testing gets really really important,
right? Which is if you want something to
just deliver entire PRs for you for
tasks that you do, especially for these
larger tasks, you want to know that it
is can can test it itself. And often the
agent actually needs this iterative loop
to be able to go and do that, right? So
it needs to be able to run all the code
locally. It needs to know what to test.
It needs to know what to look for. Um,
and in many ways it's just a a much
higher context problem to solve for,
right? Is this testing
itself and that brings us to now. And
obviously it's a it's a pretty fun time
to see because now what we're thinking
about is hey maybe if instead of doing
it just one task it's you know how how
do we think about tackling an entire
project right and after we do a project
you know what what goes after that a and
maybe one point that I would just make
here is we talk about all these two X's
you know that happen every couple months
and I think from a kind of cosmic
perspective all the two X's look the
same right but in practice every 2X
actually is a different one right and so
when we were just doing you
tab completion, line, single line
completion. It really was just a text
problem. It is just like taken the
single file so far and just predict what
the line is text. Right? Over the last
year or year and a half, we've had to
think about so much more. How do how do
you work with the human in linear or
slack or how do you take in feedback or
steering? Um how how do you help the
human plan out and do all these things,
right? And moreover, obviously, there's
a ton of the tooling and the
capabilities work that have to be done
of how does how does Devon test on its
own? How does Devon um uh you know make
a lot of these longer term decisions on
its own? How does it debug its own
outputs or or run the right shell
commands to figure out what the feedback
is uh and go from there? And so it's
super exciting now that there's a lot
more uh there's a lot more coding agents
in the space. It's uh it's it's very fun
to see and I think that you know we
we're going to see another 16 to 64x
over the next 12 months as well and uh
and so yeah super super
excited. Awesome. Well, that's all.
Thank you guys so much for having me.
Awesome. Uh thanks for what a great
talk. Um Scott, so we just heard from
the creators of Devon, one of the very
first proper sweet agents, right? They
shocked the world with their demo. They
were kind of the first to pivot this
field of autonomous long- form agents
that can run and actually complete
tasks. Now our next speaker is from
Google. He's the AI PM of AI labs and he
works on Jules. Jules is one of the
latest coding agents, right? So he's
going to speak to us about asynchronous
coding agents. As we change from a world
of coding co-pilots to autonomous
agents, how do we kind of delegate our
workflow? What do we do when we have a
bunch of these agents going on? So,
without further ado, I want to welcome
Rustin Banks from Google to speak to us
about
[Applause]
Jules. Awesome. Hi everyone. I'm Rustin.
I'm a product manager with Google Labs
and really thrilled to be here and get
to speak to you today. This is really
like a a dream come true.
So, I'm an engineer at heart. This is my
first compiler, Borland C++ 3.1. It came
in the mail on 10 5 and 1/2 in floppy
discs. I ordered it from AOL
classifides. It was amazing. This is my
bulletin board. Yeah. That I hosted out
of my parents' closet and salvaged
computers. And I just think it's ironic
that when I saw AI come out, I
recognized the textbased interfaces
perfectly from hosting bulletin boards.
And then when I saw this, like many of
you, I dedicated my career to AI coding.
And this is chat GPT 3.5. Isn't it crazy
that we the how slow this is? And this
used to be state-of-the-art only two
years ago. It's pretty
amazing. Right now, I'm a product
manager for Jules. And Jules is an
asynchronous coding agent meant to run
in the background and do all those tasks
that you don't want to do in parallel in
the background. And we launched this
just two weeks ago at IO to everyone
everywhere all at
once for
free while Josh was up on the stage
trying to demo other Google Labs
products. And so he called us. We said,
"Oh, we got to shut it down." so that we
can demo other products and and luckily
we got it up and going. But it was a
super exciting launch and the best part
about it is to see these use cases where
this is what we really want to solve. We
want to do the laundry so to say so that
you can focus on the art of coding. So
the next time Firebase updates their
SDK, Jules can do that for you. Or if
you just want to develop from your
phone, Jules can do that for you. So, in
the last two weeks, we've had 40,000
public commits, and we're super excited
what we can bring to the open-source
world. So, but as developers, we're
meant to think serially. We take a task
from the queue, we work on it, we go on
to the next one. That's our default
workflow. Today, we'll learn about how
to maximize parallel agents. I'll try a
real world demo and we'll go through a
real world use case and then I'll go
through some best practices we've
learned from watching people use
jewels. So for this parallel process
really to work well, we need to get
better with AI at the beginning and the
end of the workflow. Meaning if it's on
me to now I just have to write a bunch
of tasks all day. That's not fun. And if
I'm reviewing PRs and handling merge
messes at the end of the day, that's not
going to work well either. So luckily,
help is on the way. So for example, AI
can easily work through backlogs, bug
reports to create tasks for you with
you. And then uh at the end of the
SDLC, help is on the way where we can
use critic agents, merging agents that
can bring everything together and make
it so that this this parallel workflow
that we've envisioned can really come
together and not drive us
crazy. Remote agents are uniquely suited
for this. Agents inside of our IDE are
always going to be limited by our
laptop. And when you have these remote
agents in the cloud, essentially agents
as a service, they're infinitely
scalable. They're always connected and
then you can develop from anywhere from
any
device. We've seen two types of
parallelism emerging. This is the type
that we expected, which is multitasking.
Oh, I'm just I have 10 different things
on my backlog. Let's do them all at once
and then we'll merge them together and
test them.
Interestingly, you saw an example of the
second type this morning with Solomon
from Dagger showing how he wanted three
different views of his website at the
same time. This was the emergent
behavior we didn't expect, which is
multiple variations. Essentially, we see
users taking a task, especially if it's
a complex task, and saying, "Try it this
way, try it that way, or give me this
variation to look at, or multiple
variations to look at." And then you can
test and choose. And we can have the
agents test and choose the best ones or
the user can can test and
choose. So for example, we see lots of
people who are working on a front-end
task test and they're in a React app and
they're saying I'm adding drag and drop.
Maybe try it using this library uh the
react be beautiful drag and drop or
maybe use dnd kit or maybe try it using
the test first and in this parallel
asynchronous environment you can just
spin up multiple agents at the same time
they can try it they can easily come
back together choose the best one and
you're off to the races. Okay, demo
time. So exit out of
this for a demo. I'm going to use the
conference schedule
website.
And Swix for all his skills, as you can
see, has probably not spent a lot of
time designing the the schedule website.
As you can see there, anytime there's a
horizontal scroll scroll bar, uh we we
know that's a problem. But luckily they
knew that and they said we're just going
to publish the JSON feed and we'll let
we'll let hackers hack. Uh engineers do
what we do and let's build from it. So
Pal love who is here built this amazing
uh conference site where you can
favorite things, you can bookmark things
and this is what I use to keep track of
my my sessions for the conference. And
so I messaged him. And I said, "Hey, can
I can I use phone clone this and use
this for as an example for jewels?" And
Palv said, "Oh, yeah, sure. Actually, I
was sitting in my last session on my
phone and I fixed a bud a bug using
jewels." So, I thought that was perfect.
So this is how I would start something
like this is I would go into linear and
I would say okay first thing we need to
do we just heard Scott talk about it is
I want to add a way to know if this
parallel agent is going to do a bunch of
things at the same time that it's
getting it right. So, first we're going
to add some tests. And then I'm going to
actually I'm going to kick this one off
while I'm thinking about
it. And then using that idea of multiple
variations, I'm going to say add it with
justest and add it with add it with
playright at the same time. And then
we'll look at the test coverage and
we'll choose the one that has the best
test coverage. Once that's done, then I
can go to that other mode of parallelism
and I say, I would like a link to add a
session to my Google calendar. I would
like an AI summary when I click on a
description. And these are all features,
but what I'm really excited for is for
AI to do the stuff that we never seem to
get to, such as accessibility audits and
security audits. All those things that
seem to go on the backlog, but are
really important. And I'm super excited
for AI to do that. So, we're going to
also have it do an accessibility audit
and improve our Lighthouse scores at the
same time. This is mostly a front-end
demo because, well, I'm mostly a
front-end engineer and it's a better
visual representation, but we've seen
all these all these applied to the back
end as well. Okay, so here's Jules. We
told it to add add tests and ingest
framework. It connects to my GitHub, all
my GitHub repos, and it's going to give
me a plan. That looks about right. I can
see it's going to test the calendar, the
search overlay, the session. That sounds
great. I can approve the plan. So,
Google So, Jules now has its own VM in
the cloud. It's cloned my whole whole
codebase. It can run all the commands
that I can run and and importantly after
it has these tests, it can run these
tests so it can know when we add a new
feature if it gets things right. So I'm
going to fast forward a little bit here.
And so this is adding just tests. You
can see all the the things it's or all
the components that's it's added to the
test. It's added to the readme. So now
next time that it goes to add something,
it'll look at the readme and remind
itself, oh, this is how I run the tests.
Let's see how it did on test
coverage. Okay, we got down to
Looks like about estimated test coverage
looks like about 80%. So that's pretty
good. We could compare that with
playright and then we could just choose
the the one we like the best. We merge
that into
Maine and now we're we're off to the
races. So that again it's automatically
integrated into GitHub. We merge that
into into Maine and now we can start
saying okay now I want a calendar link.
So I want a calendar button that can go
in and
Jules will work on that. And then sure
enough, it ran the test. The test didn't
pass the first time. It makes some
changes. Now the tests pass. And I can
review this code. Eventually I could
look at this in Jules's browser. But I
feel pretty confident about testing this
knowing that all the tests pass.
Similarly for uh the Gemini summaries,
when I click on a description, I can get
a Gemini summary. I put this one in an
emulator or I emulated a mobile view
just so you can I could have done this
from my phone. So, this is making
accessibility audit, fixing any issues
from my phone. Uh, never mind the
console errors. Jules is going to fix
those. And then I can go back. I can Now
we have this big merge we need to do.
And to be honest, I ran out of time to
finish the merge. And Jules should help
me with this merge. And it's called an
octopus merge. So surely Jules as a
squid should help with the octopus
merge. But let's just pull our check out
our add to calendar
button. Go back to
this local host.
Refresh. And now I have a calendar
button. Let's test it. Okay. Let's add
this to my calendar to make sure I know
to come to my own talk. And there it's
on. It's on my calendar. I could then
now again pull this back into the main
branch and now everybody at the
conference has the ability to add add
sessions to their goo to their Google
calendar along with everything else that
we saw there a full test suite all the
accessibility audits a lighthouse scores
improvement and that took me all about
an hour and managing the the parallel
process in the back end.
Okay, so in theor in summary, the secret
to working in parallel is a clear
definition of success because nobody
wants to review PRs all day. So think
before you get started, how am I going
to easily verify that this works? Again,
Scott hit on this as well. Create this
agreement with the agent. Tell it, don't
stop until you see this or don't stop
until this works. and then a re robust
merge and test framework at the end to
put everything back together and help is
coming. This is how I prompt for Jules.
I give it a brief overview of the task.
I tell it when it will know what it got
right, any helpful context, and then
I'll at the end I'll append a simple
broad approach and then I'll change that
last line maybe two or three times
depending on the complexity of the task.
So for example, if I need to log this
number from this web page every day,
I'll say today the number is X. So log
the number to the console and don't stop
until the number is X. That was a simple
test that I wrote in. It'll keep going.
I give it a helpful context like this is
the search query. And then I'll say use
puppeteer and then I'll clone that task
because I can. It's in the cloud and
I'll say use
playright. So again, have an abundance
mindset. But we're used to working on a
single thing at a time. Easy
verification makes it so now we can work
on multiple things at the same time. Try
lots of things. As we saw this morning,
look at different variations. We can
with a parallel process. We can we have
the ability now to try things that we
would never have tried before. Let AI
help with those bookends, the task
creation and then the merge and and test
part and context. Keep using MD files or
links to documentation to getting
started. documents. The more context the
better. And then we tell people just
throw everything in there. Jules and
other agents are pretty good at actually
sorting out which context is important.
So more context is better at this point,
but maybe that's just for uh the Gemini
models, which I should have mentioned.
Jules is powered by Gemini 2.5
Pro. Quick shout out. Thank you team
Jules. Couldn't have done any of this
without you. If you have any questions,
you can DM me. I'm Rustin Banks. Rustin
B on X. Thanks everybody.
Awesome. Always good to hear from one of
the latest coding agents and it's always
great to get a refresher. You know, even
I don't know how to prompt these things,
but I'm liking this flow. We started off
with Cognition. We had Devon, one of the
first proper SU agents. Then we had one
of the latest. We just heard from Google
about Jules. Let's take it back again.
Let's hear from GitHub, one of the very
very first coding co-pilots, right? So,
let's hear about the future and you
know, how do we still want to think
about GitHub Copilot? So, without
further ado, I want to welcome
Christopher Harrison to the states to
tell us about GitHub
Copilot. All right, let's uh let's get
right on into it. So, my name is
Christopher Harrison. I'm a senior
developer advocate at GitHub, primarily
focused in on this little thing called
developer experience, or as all the cool
kids like to call it, DevX, and GitHub
Copilot. So, let's talk about the past,
the present, and the future of GitHub
Copilot. Oops.
Um
um actually it's not picking up at
all. Um oh there we go. Um let me
um entire start mirroring. There we
go. Cool. Look at that.
Okay, so let's get on into it. So where
we started was with code completion. And
so with code completion, I'm a
developer. I'm in the zone. Type type
type. And then C-pilot's going to then
suggest the next line, the next block,
the next function, potentially even the
next class. And this is wonderful for
giving that in time inline support to
our developers.
But as we all know, the tasks that we're
going to be completing go beyond just
writing a few lines of code that I need
to be able to explore. I need to be able
to ask questions and I need to be able
to modify multiple files. And so this is
where chat comes into play. And we
started off with chat by supporting ask
mode where I could go in and ask
questions or ask co-pilot to generate an
individual file for me. And then we
expanded this out to edit mode. And with
edit mode, I can then drive copilot as
it modifies multiple files. Because when
we think about even the most basic of
updates, I'm going to update a web page.
That's going to require updating my
HTML, my CSS, and my
JavaScript, three files. With edit mode,
I can do that very quickly. And again,
right inside of chat. Then we get into
agent mode. And agent mode really shifts
things because unlike chat where I'm
going in and I'm asking questions and
I'm going in and I'm pointing it at the
files that I want to see modified with
agent mode it's able to perform those
operations on my behalf. And on top of
that, it's going to behave an awful lot
like a developer that it will go in, do
a search, find what it needs to do,
perform those tasks, and then even be
able to perform external tasks as well.
So, it could run tests, detect that
maybe those have failed, and then even
selfheal. So, I have an application
here, and I want to create a couple new
endpoints. So, the first thing that I'm
going to do is I'm going to add in a
little bit of context here. So
instruction files allow me to give
Copilot a little bit of additional
information about what it is that I'm
doing and how it is that I want it to be
done. So I have an instruction file
specific around my endpoints. Now, this
is definitely one of those scenarios
where agent mode could figure this out
on its own. But, as I like to say, don't
be passive aggressive with co-pilot.
That if there's a piece of information
that's important that you want it to
consider, go ahead and tell it it might
be able to figure it out on its own, but
this is certainly going to make life
easier. So now that I've added this in,
I'm going to now say create um um uh
endpoints to list the publishers and get
publisher by ID. Create the tests,
ensure all tests pass, and then hit
send. Now I'm doing a live demo with
AI, so we're going to see what happens
here. There's a chance it will fail.
there's a chance it will fail
spectacularly, but there's also a really
good chance that every everything's
going to succeed. And that's the part
that I'm going to hope for. Now, if I
take a look at what Copilot's doing
here, what I'm going to see is as
highlighted, it's behaving an awful lot
like a developer that it tells me what
it's going to do. It's going to create
the endpoints to list all the
publishers, get the publisher by ID. So
the first thing it's going to do is
explore the project, figure out what's
going on. Then it's going to create the
endpoint. Then it's going to create the
tests. And then it will sure everything
works correctly. And now if I keep on
scrolling down, I'm going to notice that
it's searching through my codebase
because if I was tasked as a developer
to perform this, that's the first thing
I'm going to do. And that's exactly what
Copilot here is doing. It created my
publishers PI file. It looked for routes
that happen to be matching publishers.
And now it's going to create the
endpoints here. And so if I stall for
just a moment longer and move my mouse
to make it go faster. See, it worked.
We're gonna notice that it will now
generate that publishers pi file. And
one big thing that you're going to
notice is I've got these great keep and
undo buttons here because I always like
to highlight the fact that AI does not
change the fundamentals of DevOps. that
if I think about how I wrote code before
AI, some of that would be created off
the top of my head. Some of that would
be based on existing code. Some of that
would be copied and pasted from Stack
Overflow and then made a couple of
changes and cross my fingers and hope
that it
worked. Maybe that was just me.
Um, and to help ensure that all of the
code that I was going to be committing
to our codebase is secure and is written
the way that we want it to be written,
we had code reviews, we have llinters,
we have security checks. And we're going
to do all that exact same thing even
when we introduce AI. So this keep and
undo allows me to very quickly ensure
that yes, everything looks good and if
it doesn't to be able to undo it. We'll
also notice history buttons up here that
allow me to act uh iteratively because
again when I'm working with AI, I'm not
necessarily going to get perfect code
the first time. So I can go in and I can
uh work back and forth. So I can say,
hey, this looks good, but I want to do
this. Maybe I want the buttons to look
blue or whatever it is. And then
highlight that. So, what I'm going to
now see is the fact that it created all
my files, updated a couple of items, and
now it can run my tests. And this is
going to be one of those rare moments
where I'm kind of hoping that it fails
because I want to see it be able to
recover for me. So, you'll notice that
it ran my tests. You'll notice that it
ran my four tests. Everything succeeded.
Shucks. And then now it's going to go
ahead and continue to iterate from
there. And so what we see with agent
mode is co-pilot driving the way on
going in and writing my code. But I
always want to highlight the fact that I
as the developer am still in
charge. Now the one catch though with
agent mode is the fact that that's going
to be inside my IDE and the fact that
that's still going to be well
singlethreaded. It's going to be
synchronous.
This is where we come into coding agent.
And with coding agent, this is going to
be completely asynchronous and this is
going to run on the server. So, let me
kick over to an example that I had
actually run uh earlier this morning
where I have an issue that's been
created where I say add, edit, and
delete endpoints. Now, I'm going to real
quick just unassign copilot just so I
can kick off the workflow and we can see
this in action here. I'm going to let
those cute little eyeballs go away here.
There we go. And let's go back in and
hit a
reassign. So, by assigning co-pilot
here, I've now kicked off the coding
agent. I can now see the little eyeballs
and that indicates to me that copilot is
hard at work. And if I scroll on down,
I'm now going to see a brand new pull
request that's been made here. And this
is now what Copilot is going to utilize
to help keep me updated on the work that
it is
performing. And if I scroll on down just
a little more, what I'm also going to
see is a little view session button. And
if I hit this view session button, I can
notice right here that it's telling me
that it's spinning up a development
environment. And this is raises a very
big question which is where is this
running? How can I ensure that this is
going to be done securely? So this is
running inside of GitHub actions. And if
you're not already familiar with GitHub
actions, this is our platform for
automation. And in fact, I can go ahead
and configure the environment in which I
want
my coding agent to work.
by creating a specialized workflow
exactly for that. And that's what I see
right here with this co-pilot setups.
And if I scroll down, what I'm going to
notice is that I've got steps to install
Node, I've got steps to install Python
and all the frameworks and all the
libraries that we're going to be using.
Now, not only does this ensure that
co-pilot is working in the environment
that I want it to work in, but it also
allows me to highlight the fact that by
default, coding agent does not have
access to any external
resources. So, it's not able to call the
internet. It's not able to call any
external services. Now, if I so desire
that I wanted to be able to do that,
then I can go in and configure MCP
servers and I can also add in uh uh
updates to my firewall. So that way I
can punch a hole in my firewall and
allow Copilot to then access those
external resources. But by default, it's
only going to have access to what I've
configured and inside of that container.
In addition, because of the fact that
that is running inside of uh GitHub
workflows, inside of GitHub actions,
it's going to be an in an ephemeral
environment. So, it's going to spin up a
brand new environment and once its work
is done, it's going to then delete
it. Continuing down the security path,
let me kick back one uh one page here.
And if I scroll down, you're also going
to notice that it's not even able to
automatically kick off any workflows. So
I have a couple of workflows associated
with this repository for running my unit
tests and for running end to end tests.
And by default, it's not going to be
able to do this unless I go in and I say
yes. You'll also notice that the pull
request that it creates is going to be
in draft mode and I have to go in and
review it because again developers in
charge again because just we're just
because we're introducing AI does not
change the normal dev ops
flow. Now if I take a look at the one
that was created earlier, let me go
ahead and open that up. What I'm going
to notice is a pull request with a
fantastic description of everything that
it has done. So I can see the PR
implements the uh missing credit
operations. I can see it lists off all
the different uh endpoints that it
created, the error handling, the testing
and the technical details. And I can
also again open up my session here and
see all the tasks that it performed. And
you'll notice again that it's going to
behave an awful lot like a developer
that it's going to go out, it's going to
do uh searches through my codebase,
determine what needs to be done, and
then eventually perform the tasks. And
if I scroll all the way to the end here,
where did There it
is. Perfect. What I can now see is I've
got a nice little summary down at the
very bottom. If I scroll up, I should be
able to see that it ran all my tests.
Yep, I can see all my tests right there.
And I can see in this case that all 16
of those tests
passed. So it created that PR and then I
also decided, okay, all of that looked
good to me. So I allowed it to run the
actions and I can even now see that it
ran those unit tests, ran the endto-end
tests, and everything works looks good.
then I could say ready for review and
then finalize the uh the creation of it.
The last thing that I want to highlight
and this both leads into the security
aspect but also brings me back into the
developer aspect is we'll notice that
created a brand new branch. In this
particular case it's called
copilotfix-3. Where that came from is
that the issue number that this was
associated with was issue number three.
And so Copilot will only have right
access to that branch. And this branch
is going to behave just like any other
branch that I might have. So if I clone
the repository locally, I can go ahead
and check out that branch. I've opened
up the branch inside of GitHub here. Let
me scroll on down to my
server. And if I scroll on down inside
of here, sound effects help. By the way,
what we're going to notice is that there
is my update game. I think my create
game was up here. Yep, there it is. And
there it all is. But again, it's only
inside of just that branch. So that's
the only place that coding agent is
going to have right permissions
to. Now, this leads us into a very big
question, which is okay, that's that's
wonderful, Christopher. You've created a
um a little uh kind of simple demo. You
had to create a few flask end points,
and that's that's wonderful and all, but
how about doing it in the real world?
Well, one of the big tenants that we
have at GitHub is we build GitHub on
GitHub. And in fact, coding agent was
built with the help of coding agent. And
you're going to notice when we take a
look at the amount of commits that found
its way into coding agent that coding
agent itself was one of the most
prolific committers.
that coding agent not only created new
features but it also helped address tech
debt and this is one of the biggest
places where I personally see coding
agent really shining because I don't
know of a single organization that
doesn't have tech debt that feels
comfortable with the state of their
backlog that doesn't have a limitless
number of items where they keep going
yeah that's that's great and all but we
just don't have the
time you can
assum uh to kick through real quick as I
highlighted that secure environment
separate platform ephemeral all running
inside of GitHub actions and you have
the ability to customize that coding
agent does understand your repository
and understands your GitHub context so
it has access to read your repository
and it is even able to read copilot
instructions and it does have access to
model context protocol so it can make
those external calls. It does include
those safeguards readonly access to your
repository the default firewall
preventing any external access review
before merge and review before those
actions run. So we continue to iterate
on C-pilot. We continue to look for new
areas where Copilot can shine to help
streamline development and to help
increase the productivity of developers.
Thank you.
[Applause]
Awesome. Thank you, Christopher. We love
hearing about, you know, some of the
main players in the sweet agent space.
So it's always nice to hear from some of
the big players. We want to continue
this track with you know how do we
actually take things to production. So
um our next speaker Tomas is going to
talk to us about the outer loop. So how
do we deal with you know actually
deploying and using these software
engineering agents? How do we manage all
of the you know CI/CD the pipeline? How
should we actually deal about using
these things? So I want to take a little
bit of a break here in the talks and
sort of speak about what's actually
going on right innovation in sui agents
is going at quite a rapid pace like
we've had jewels we've got codeex we've
got cloud code as we get more and more
of these software engineering agents
that kind of really change the workflow
of how we code how do we handle actually
deploying them right so we've got a next
lineup of speakers that are going to
help talk a little bit more about this
and you know we just kind of want to set
the stage here. So, let's just get it a
little bit more interactive. Um, how how
are we feeling about the track today?
Who's kind of been a fan of Jules? I
want to see what are the what are the
major ones? So, who here in the room has
used Devon? Let's see. Show hand fans.
Okay. Okay. We have a few Devon users.
And what about Jules? How are we feeling
about Google's Jules? Okay. Same set of
hands. How about uh Cloud Code? We've
got a speaker from Cloud Code coming
later. Okay, more hands but different
hands. Seems like we've got a bit of a
differentiation there. What about
OpenAI's
codecs? Oh, another set of hands. So,
interesting. You know, we've got
different core co-pilots and it seems
like people use them differently, but we
also like to see the other end of the
spectrum, right? Um, so we've got Devon.
Who here is a fan of Devon and uses
Devon's
cognition? Cognition from um Devon from
cognition. Okay, so we've got another
set of hands. And you know, one thing to
note is that we kind of have these
different categories of agents, right?
Um, what about the human in the loop
short co-pilots? Who who here uses
cursor and wind surf in their coding
day-to-day tasks? Ah, a lot more hands.
So, it's an interesting sort of split,
right? We've got these human in the loop
sort of short-term coding co-pilots
where we've got stuff like Cascade from
Windsurf. We've got um cursors co-pilot
and kind of everyone's hand goes up,
right? A lot of people are starting to
use these co-pilots in their IDE. Then
we take it to the next level. We've got,
you know, we've got the big the big
players where we have uh cloud code, we
have jewels, we have codecs. And an
interesting note, you know, everyone
kind of has their own buckets. It's not
the same hands that go up. So that's one
of the reasons why at the conference we
like to invite speakers from everywhere.
Now the the third camp, you know, Devon,
it's another way to think about it. As
you have longer horizon agents, how do
we deal with those? And that's kind of
where you know we're starting to take
the second half of the day in SWU
agents. We want to talk about how do we
take these things to production? How do
we actually deploy these? And you know
to bring that up I want to invite our
next speaker Tomas Ramirez. He's going
to give us a little bit about this. He's
from Graphite. So without further ado,
let's welcome Tomas.
Thank you so much.
Perfect. Hello everyone.
Um,
see, nope, no need for either of those.
Thank you so
much. And then slides. Looking
good.
Cool. Perfect. Awesome. Uh hi everyone,
my name is Tomas. I'm one of the
co-founders of Graphite. Graphite is an
AI code review uh company. So to give
some context on sort of where we see the
industry right now and where we see it
going, software development currently
and has always had two loops. The inner
loop which is focused on development and
the outer loop that's focused on review.
Developers spend time in the inner loop.
They get their code working. They get
the feature the way they want it and
then they go ahead and they move it to
the outer loop where it's tested,
reviewed, merged,
deployed. We're seeing the inner loop
change right now more than we've ever
seen it. More developers are using AI
than ever. I think right here we have
some statistics from the GitHub
developer survey. Nearly every developer
surveyed used AI tools both inside and
outside of work. And 46% of GitHub is
being written by CP code on GitHub is
being written by Copilot.
We're seeing more and more code being
written by AI. Here we have some
statistics around how code has changed
over time and how some people predict it
will change. And even if we take a more
pessimistic view of that, we still see
the way the world's going as just more
and more and more code being written by
AI. The inner loop is changing. You
know, AI is making more uh developers
more productive. Developers now
producing higher volumes of code. But
that code still needs to be reviewed.
When we first started looking at this,
when we first started building uh
Diamond AI code reviewer about a year
ago now, what we found was we read a lot
of articles that scared us a lot. We
were seeing within our own organization
a lot of developers adopting AI tools.
But we were also seeing a problem. AI
can hallucinate. It can make mistakes.
And almost more scarily, it can make
security vulnerabilities.
For us, what we saw was that while the
inner loop was getting sped up by AI,
the outer loop was rapidly becoming the
bottleneck. Um, we were seeing tools
like cursor, wind surf, co-pilot, vzero,
ball, all of those producing larger
volumes of code than we were used to,
than we had ever seen before. But we
were also seeing our developers suddenly
have to review higher volumes of code,
test higher volumes of code, merge
higher volumes of code, and deploy
higher volumes of
code. That's what brought us to say
there has to be a new outer loop here.
this the way that things are going, this
isn't going to work. We're going to
break down. We're watching the problems
that used to only aail large companies
start to aail all companies where we
were seeing companies deal with higher
and higher and higher volumes of code.
The requirements for the new outer loop
then look a lot like the problems that
larger companies have always had to deal
with. You need tools to better
prioritize, track, and get notified
about pull requests. You need driver
assist features to help reviewers focus
and streamline the code review process.
You need optimized CI pipelines and
merge cues to be able to handle the
sheer volume of code changes that are
now happening and you need better
deployment
tools. Um when we first started looking
at this through sort of an AI first
lens, we started to see that well the
problems are being created by AI, they
can also probably be solved by AI. We
can probably start to streamline a lot
of these processes which have previously
had been manual, previously were parts
of the process that developers did not
enjoy, did not want to do. Um, we wanted
to see self-driving code review
solutions where we no longer had to do
those very manual and painful parts of
review, but we could actually start to
really focus on what matters most to the
developers, making sure that your
product can get out to users and that
the features work as expected. Um, we
were seeing that AI generated feedback
wasn't perfect. And because of that, we
were starting to think that bots weren't
enough. I think an early an early vision
of ours was, well, can we solve this by
just adding AI teammates, right? Maybe
it's background agents, maybe it's
reviewers, maybe it's a whole lot of
teammates to the workflow. And while we
think that's part of the story, we don't
think that's enough. We think that as
we've built with Diamond that your
entire tool chain has to be AI native,
not just your IDE. If you really are
going to embrace AI in the age of
development if you're going to accept
the fact that developers are going to be
orders of magnitude more productive than
they ever have before, you need tooling
that reflects
that. We started by building Diamond. So
the winning AI code review platform with
high signal, low noise, has a deep
understanding of the codebase and change
history. We summarize, prioritize, and
review each change. And we integrate
with your CI and your testing
infrastructure to correct uh to
summarize errors and correct failures.
Um, our hope with it and what we've
started to see as we've rolled it out to
larger and larger customers and
enterprises too is we re we reduce code
review cycles, we enforce quality and
consistency, and we keep your code
private and secure. Um, it's high
signal, it's zero setup, it's actionable
with oneclick suggestions, and it's
customizable. It's already being used by
some of the fastest moving companies in
the world. It's expanding a lot more
than we can even say publicly. Um, and I
hope that you all will embrace the idea
that AI can change your entire developer
workflow, not just your IDE.
um by the numbers we see comments that
our AI bot leaves to be downloaded at
less than a 4% rate and to be accepted
meaning integrated into the poll request
um that they were left on at a higher
rate than human comments are. Human
comments are integrated about somewhere
between 45 and 50%. We're watching our
diamond comments be accepted about 52%.
We've spent a lot of time tuning that.
That's that number is actually new as of
March for us. Um that's that's what I
have to tell you around graphite. Um
what I have to tell you around diamond.
I hope you give it a shot and and thanks
for having me.
[Applause]
Awesome. Thanks again, Thomas, for such
a great talk. Uh we want to thank
everyone for coming out to the SU agents
track. We're going to take a short
break. Lunch is going to be served here
in the halls. The expo session will be
open. But you know without further ado
we're we're very happy to announce that
in the evening you know we have four
more fully packed sessions. Um I think
we are the only track that is fully
booked. So we've got all eight speakers.
Um we're going to have a great round of
speakers coming up soon. So feel free to
come back here later. We're going to
kick off with um a talk from claude
code. So how do they think about
building cloud code? How to use it? How
to delegate? We're going to have that
later back here in the keynote session.
But for now, please feel free to enjoy
lunch. Check out the expo hall as we
take a little break. Thank you.
[Music]
Heat. Heat.
[Music]
[Music]
[Music]
[Music]
baby. Don't
[Music]
Heat. Heat.
[Music]
Heat. Heat.
Heat. Heat.
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
You measure.
Everybody.
And we can
I don't want to
Hey, hey, hey.
Heat. Heat.
Heat. Heat.
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
Hey, got
[Music]
[Music]
Heat. Hey, Heat.
I don't want to do it.
Heat. Heat.
Heat. Heat.
Hey, hey,
[Music]
[Music]
[Music]
[Music]
Heat. Heat.
[Music]
How?
[Music]
What's up? Welcome everyone. Let's give
it up for the sweet agents track. This
is the most packed track. We have four
more amazing speakers for you. Let's
hear it for our sweet agent
speakers. Awesome. We're gonna kick off
talking about cloud code and then follow
that up with open devon. I want to cut
my MCing short. I want to give the
speakers their time, but we have a
special little announcement. We never do
Q&A, but for our first talk for Claude
Code, we're going to do a bit of a
presentation and do a bit of a Q&A
session. Keep your questions short, 5,
10 words. as something interesting.
Think of your question. But without
further ado, I want to invite Boris
Churnney from Anthropic up to the stage.
Think of a question. I'll be back to
[Applause]
Q&A.
Hello. This awesome. This is a big
crowd. Who here has used quad code
before?
Jesus. Awesome. That's what we like to
see. Cool. So, my name is Boris. I'm a
member of technical staff at Enthropic
and creator of Quad
Code. And
um I was struggling with what to talk
about for audience that already knows
quad code, already knows AI and all the
coding tools and agentic coding and
stuff like that. So, I'm going to zoom
out a little bit and then we'll zoom
back
in. So, here's my TLDDR. The model is
moving really fast. It's on exponential.
It's getting better at coding very, very
quickly, as everyone that uses the model
knows. And the product is kind of
struggling to keep up. We're trying to
figure out what product to build that's
good enough for a model like this. And
we feel like there's so many more
products that could be built for models
that are this good at coding. And we're
kind of building the bare minimum. And
I'll kind of talk about
why. And with cloud code, we're trying
to stay unopinionated about what the
product should look like because we
don't
know. So for everyone that didn't raise
your hand, I think that's like 10 of
you. Uh this is how you get cloud code.
Um you can head to quadi/code to install
it. Uh you can run this incantation to
install from npm. Um as of yesterday, we
support quad pro plan. So you can try it
on that. Uh we support cloud max. So
yeah, just try it out. Tell us what you
think. So programming is changing and
it's changing faster and faster. And if
you look at where programming started
back in, you know, the 1930s4s there
there was like switchboards and it was
this physical thing. There was no such
thing as software. And then sometime in
the 1950s, punch cards became a thing.
And my uh my my grandpa actually in the
Soviet Union, he was one of the first
programmers in the in the Soviet Union.
And my mom would tell me stories about
like, you know, when she grew up in the
1970s or whatever, he would bring these
big stacks of punch cards home. And she
would like from work and and she would
like draw all over them with
crayons and that was growing out for
her. And that that's what programming
was back back in the 1950s, '60s, '7s
even. But sometime in the late 50s, we
started to see these higher level
languages emerge. So first there was
assembly. So programming moves from
hardware to punch cards which is still
physical to to software. And then the
level of abstraction just went up. So we
got to cobalt. Then we got to typed
languages. We got to C++. In the early
90s there was this explosion of these
new language families. There was you
know the Haskell family and um you know
JavaScript and Java, the evolution of
the C family and then Python. And I
think nowadays if you kind of squint all
the languages sort of look the same.
Like when I write TypeScript it kind of
feels like writing Rust and that kind of
feels like writing Swift and that kind
of feels like writing Go. The
abstractions have started to converge a
bit. If we think about the UX of
programming languages, this has also
evolved. Back in the 1950s, you used
something like a typewriter to punch
holes in punch cards and that was
programming back in the day. And at some
point text editors appeared um and then
uh Pascal and all these different ids uh
appeared that let you interact with your
programs and your software in new ways
and each one kind of brought something
and I I feel like programming languages
have sort of leveled out but the model
is on an exponential and the UX of
programming is also on an exponential
and I'll talk a little bit more about
that.
Does anyone know what was the first text
editor? Okay, I heard I heard Ed from
someone. I think you read the
screen. Before well before text editors,
this is what programming real quick. So
this was the IPMO29. This was kind of a
top-of-the-line. This was like the
MacBook of the time for programming
punch cards. Everyone have
this? You can still find it in museums
somewhere. And yeah, this is Ed. This is
the the first text editor. This was uh
Chem Thompson at at Bell Labs invented
this. And you know, it kind of looks
familiar. If you open your MacBook, you
can actually still type Ed. This is
still is still distributed on Unix uh as
as part of Unix systems. And this is
crazy because this thing was invented
like 50 years ago. And this is nuts.
Like there there's no cursor, there's no
scroll back, uh there's no fancy
commands, there's no type ahead, there's
pretty much nothing. This is the simple
text editor of the time. And it was
built for teletype machines which were
literally physical machines that printed
paper on paper. That's how your program
was printed. And this is the first
software manifestation of a UX for
programming software. So it was really
built for these machines that didn't
support scrollback and cursors or
anything like
that. Um for all the Vim fans, I'm going
to jump ahead of Vim. Vim was a big
innovation. Emacs was a big innovation
around the same time. I think in 1980,
Small Talk 80 was a big uh it was a big
jump forward. This is one of the first I
think the first graphical interface for
programming software. And um for anyone
that's tried to set up like live reload
with React or Redux or any of this
stuff, this thing had live reload in
1980 and it worked and we're still kind
of struggling to get that to work with
like ReactJS nowadays.
So this this was a big jump forward and
obviously like the language it had
object-oriented programming and a bunch
of new concepts but on the UI side
there's a lot of new things
too. In '91 I think Visual Basic was the
first code editor that introduced a
graphical paradigm to the mainstream. So
before people were using textbased
editors Vim and things like that were
still very popular despite things like
small talk. Um but this kind of brought
it mainstream. This is what I grew up
with.
Eclipse brought type ahead to the
mainstream. This isn't using AI type
ahead. This is not cursor when surf.
This is just using static analysis. So
it's indexing your symbols and then it
can rank the symbols and rerank them and
it knows what symbols to show. I think
this was also the first big third party
ecosystem for
IDs. Copilot was a big jump forward with
single line type ahead and then
multi-line type ahead.
And I think Devon was probably the first
IDE that introduced this next concept
and this next abstraction to the world
which is to program you don't have to
write code you can write natural
language and that becomes code and this
is something people have been trying to
figure out for decades. I think Devon is
the first product that broke through and
and took this
mainstream and the UX has evolved
quickly but I think it's about to get
even
faster. We talked about uh UX and we
talked about programming languages and
verification is a part of this too. Um
so verification has started with manual
debugging and like physically inspecting
outputs. Um, and now there's a lot of
probabilistic verification uh like
fuzzing and vulnerability testing and uh
like Netflix's chaos uh testing and
things like
that. And so with all this in mind,
Claude Code's approach is a little
different. It's to start with a terminal
and to give you as lowlevel access to
the model as possible in a way that you
can still be productive. So we want the
model to be useful for you. We also want
to get we want to be unopinionated and
we want to get out of the way. So we
don't give you a bunch of flashy UI. We
don't try to put a bunch of scaffolding
in the way. Some of this is we're a
model company at Enthropic and you know
we make models and we want people to
experience those models. But I think
another part is we actually just don't
know like we don't know what the right
UX is. So we're starting
simple. And so cloud code it's
intentionally simple. It's intentionally
general. Um, it shows off the model in
the ways that matter to us, which is
they can use all your tools and they can
fit into all your workloads. So you can
figure out how to use the model in this
world where the UX of using code and
using models is changing so
fast. And so this is my second point.
The model just keeps getting better. And
this is the better lesson. I have it uh
I have I have this like framed and taped
to the side of my
wall because the more general model
always wins and the model increases in
capability exponentially and there are
many coral areas to this. Everything
around the model is also increasing
exponentially and the more general thing
even around the model usually
wins. So with cloud code there's one
product and there's a lot of ways to use
it. Um, so there's a terminal product
and you know this is the thing everyone
knows. So you can install quad code and
then you just run quad in any terminal.
We're unopinionated. So it works in
iTerm 2. It works in WSL. Um, it works
over SSH and T-mok sessions. Uh, it
works in your VS code terminal in your
cursor terminal. This works anywhere in
any
terminal. When you run when you run quad
code in the IDE, we do a little bit
more. So we kind of take over the ID a
little bit and you know diffs instead of
being inline in the terminal they're
going to be big and beautiful and show
up in the ID itself. Um and we also
ingest diagnostics. Um so we kind of try
to take advantage of that. And you'll
notice this isn't as polished as
something like uh again like cursor
windsurf. These are awesome products and
I use these every day. Um, this is to
let you experience the model in a
low-level raw way. And this is sort of
the minimal that we had to do to let you
experience
that. We announced a couple weeks ago
that you can now use Claude on
GitHub. Can I get a show of hands who's
who's tried this already?
So for everyone that hasn't tried this,
all you have to do is you open up
Claude, you run this one slash command,
install GitHub app, you pick the repo,
and then you can run Claude in any repo.
Um, this is running on your compute. Um,
your data stays on your compute. It does
not go to us. Um, so it's it's kind of a
nice experience and it lets you use your
existing stack. You don't have to change
stuff around. Takes a few minutes to set
up. And again, here we intentionally
built something really simple because we
don't know what the UX is yet. And this
is the minimal possible thing that helps
us learn but also is useful for
engineers to do day-to-day work like I
use this every
day. The extreme version of this is our
SDK and this is something that you can
use to build on cloud code uh without um
if you don't want to use like you know
the terminal app or the ID integration
or GitHub you can just roll your own
integration. You can build it however
you want. people have built all sorts of
UIs, all sorts of awesome integrations
and all this is is you run cloud-p and
uh you can use it
programmatically and so like something I
use it for for example is for incident
triage I'll take my GitHub logs uh or my
sorry my GCP logs I'll pipe it into
cloudp because it's like it's a Unix
utility so you can pipe in you can pipe
out um and then I'll like jq the result
so it's kind of cool like this is a new
way to use models this is maybe 10%
exported no one has really figured about
how to use models as a Unix utility.
This is another aspect of code as
UX that we just don't know yet. And so
again, we just built the simplest
possible thing so we can learn and so
people can try it out and see what works
for
you. Okay, I wanted to give a few tips
for how to use quad code. This is a talk
about quad code. So this is kind of
zooming back in. Um, and uh, this is
actually true for I think a lot of
coding agents, but this is kind of
accustomed to the way that I personally
use quad code. So, the simplest way to
use this, um, it seems like most of this
room is very familiar with quad code and
similar coding agents. Um, but the
simplest way to introduce new people
that have not used this kind of tool
before is do codebased Q&A. And so, at
Enthropic, we teach cloud code to every
engineer on day one. And it's shortened
onboarding times from like two or three
weeks to like two days maybe. And also I
don't get bugged about questions
anymore. People can just ask Quad. And
honestly like I'll just ask Quad
too. And then this is something that I
do uh pretty much every day on Monday.
We have a stand up every week. I'll just
ask Quad what did I ship that week?
It'll look through my git commits and
it'll it'll tell me so I don't have to
keep
track. The second thing is teach Quad
how to use your tools. And this is
something that has not really existed
before when you think about the UX of
programming. Um, with every ID there's
sort of like a plug-in ecosystem. You
know, for Emacs, there's this kind of
lispy dialect that you use to make
plugins. If you use Eclipse or VS Code,
you have to make plugins. For this new
kind of coding tool, it can just use all
your tools. So, you give it batch tools,
you give it MCP tools. Something I'll
often say is here's the CLI tool cla
run-help. Take what you learn and then
put it in the cloud MD. And now Claude
knows how to use the tool. That's all it
takes. You don't have to build a bridge.
You don't have to build an extension.
There's nothing fancy like that. Um, of
course, if you have like groups of tools
or if you have fancier functionality
like streaming and things like this, you
can just use MCP as
well. Traditional coding tools focused a
lot on actually writing the code and I
think the new kinds of coding tools,
they do a lot more than that. And I
think this is a lot of where people that
are new to these tools struggle to
figure out how to use them. So there's a
few workflows that I've discovered for
using quad code most effectively for
myself. The first one is have quad code
explore and make a plan and run it by me
before it writes code. Um you can also
ask it to use thinking. So typically we
see extended thinking work really well
if quad already has something in
context. So have it use tools, have it
pull things into context and then think.
If it's thinking up front, you're
probably just kind of wasting tokens and
it's not going to be that useful. But if
there's a lot of context, it does help a
bunch. The second one is TDD. Um, I know
I try to use TDD. It's like it's pretty
hard to use in practice, but I think now
with coding tools that actually works
really well. Um, and maybe the reason is
it's not me doing it, it's the model
doing
it. And so the workflow here is tell
Claude to write some tests and kind of
describe it and just make it really
clear like the tests aren't going to
pass yet. Don't try to run the test
because it's going to try to run the
test. Tell it like, you know, it's not
going to pass. write the test first,
commit, and then write the code and then
commit. And this is kind of a general
case of if Claude has a target to
iterate against, it can do much better.
So if there's some way to verify the
output, like a unit test, integration
test, uh a way to screenshot in your iOS
simulator, uh a way to screenshot in
Puppeteer, just some way to see its
output. Um we actually did this for
robots, like we taught FOD how to use a
3D printer and then it has a little
camera to see the output. If it can see
the output and you let it iterate, the
result will be much better than if it
couldn't iterate. The first shot will be
all right, but the second or third shot
will be pretty good. So g give it some
kind of target to iterate
against. Today we launched plan mode in
cloud
code and this is a way to do the first
kind of workflow more easily. So anytime
hit shift
tab and cloud will switch to plan mode.
So you can ask it to do something, but
it won't actually do that yet. It'll
just make a plan and it'll wait for
approval. So restart quad to get the
update. Run shift
tab. Okay. And then the final tip is uh
give quad more context. There's a bunch
of ways to do this. QuadMD is the
easiest way. So take a this file called
quadm, put it in the root of your repo.
You can also put it in subfolders. Those
will get pulled in on demand. You can
put in your home folder. This will get
pulled in as well. Um and then you can
also use flash commands. Um, so if you
put files like just regular markdown
files in these special
foldersclaw/comands, it'll be available
under the slash menu. So pretty cool.
This is useful for res uh reusable
workflows. And then to add stuff to
quadm um you can always type the pound
sign to ask quad to memorize something
and it'll prompt you which memory they
should be added to. And you can see this
is us trying to figure out how to use
memory, how to use this new concept that
is new to coding models, did not exist
in previous IDEs, how to make the UX of
this work. And you can tell this is
still pretty rough. This is our first
version, but it's the first version that
works. And so we're going to be
iterating on this. And we really want to
hear feedback about what works about
this UX and what doesn't.
Thanks.
[Applause]
Thank you, Boris. Fortunately, we only
have one minute left. So, someone sent a
question on Slack. The question is, as I
delegate more and more to cloud code, as
it runs for 10 minutes and I have 10 of
these active, how do I use the tool? You
got 50 seconds.
[Laughter]
Yeah, this is it's pretty cool. I think
this is something that we actually see
in a lot of our power users that they
tend to like multi-cloud. You don't just
have a single cloud open, but you have a
couple terminal tabs either with a few
checkouts of claude or uh or of your
codebase or it's the same codebase but
with different work trees and you have
quad doing stuff in parallel. This is
also a lot easier with GitHub actions
because you can just spawn a bunch of
actions and get quad to do a bunch of
stuff. Typically, we don't like need to
coordinate between these quads I think
for most use cases. If you do want to
coordinate, the best way is just ask
them to write to a markdown file. Um,
and that's it. Awesome. Yeah, simple
thing works. Thank you so much. And once
again, give it up for Boris from
Enthropic.
Very exciting to see such a full packed
room here. We're going to set up our
next speaker who is Robert Brennan from
All Hands. He is the creator and the
company behind Open Devon. So a lot of
what we see you know we've had talks
from all the top suite agents we've had
Jules here we've got cloud code we have
openai's codeex we have devon as people
use more and more of these suite agents
are we just you know adding tech debt or
are we actually 10x engineers so this is
what Robert is going to discuss with us
I once again don't want to fill the
stage so let's hear it for
[Applause]
Robert. Hey folks. Uh so today I'm today
I'm going to talk a little bit about uh
coding agents and how to use them
effectively really. Um if you're
anything like me, you found that uh you
found a lot of things that work really
well and a lot of things that uh don't
work very well.
Um so a little bit about me. Uh my name
is Robert Brennan. I've been building uh
open- source development tools for for
over a decade now. Uh and my team and I
uh have created uh an open-source uh
software development agent called Open
Hands, formerly known as OpenD
Devon. So to to state the obvious, in
2025, software development is changing.
Uh our jobs are are very different now
than they were 2 years ago. Uh and
they're going to be very different two
years from now. Uh and the thing I want
to convince you of is that coding is
going away. uh we're going to be
spending a lot less time actually
writing code but that doesn't mean that
software engineering is going away. Uh
we're paid not to to type on our
keyboard but to actually think
critically about the problems that are
in front of us. Uh and so if we do
AIdriven development correctly um it'll
mean we spend less time actually like
leaning forward and squinting into our
IDE and more time kind of sitting back
in our chair and thinking you know what
does the user actually want here? Uh
what are we actually trying to build?
what what problems are we trying to
solve as an organization? How can we
architect this in a way that sets us up
for the future? Uh the AI is very good
at that at that interloop of
development, the write code, run the
code, write code, run the code. It's not
very good at those kind of big picture
tasks that have to take into account um
that have to like empathize with the end
user uh take into account business level
objectives. Uh and that's where we come
in as as software
engineers. Uh so let's talk a little bit
about what actually a coding agent is.
Uh I think this word agent gets thrown
around a lot these days. Uh the meaning
has started to to drift over time. Uh
but at the core of it is this this
concept of agency. Um it's this idea of
of taking action out in the real world.
Um and these are these are the main
tools of a software engineer's job,
right? We have a a code editor to
actually modify our codebase, navigate
our codebase. Uh you have a terminal uh
to help you actually run the code that
you're that you're writing. uh and you
need a web browser in order to look up
documentation and maybe copy and paste
some code from stack overflow. So these
are kind of the core tools of the job
and these are the tools that we give to
our agents to let them do their whole uh
development
loop. I also want to contrast uh you
know coding agents from some more
tactical codegen tools that are out
there. Um, you know, we kind of started
a couple years ago with things like, uh,
GitHub Copilot's autocomplete feature
where, you know, it's literally wherever
your cursor is pointed in the codebase.
Right now, it's just filling out two or
three more lines of code. Um, and then
over time, things have gotten more and
more agentic, more and more
asynchronous, right? Uh, so we got like
AI powered idees that can maybe take a
few steps at a time without a developer
interfering. And then uh now you've got
these tools like Devon and Open Hands
where you're really giving an agent, you
know, one or two sentences describing
what you want it to do. It goes off and
works for 5 10 15 minutes on its own and
then comes back to you with a solution.
This is a much more powerful way of
working. You can get a lot done. Uh you
can send off multiple agents at once. Um
you know, you can focus on communicating
with your co-workers or goofing off on
Reddit while these agents are are
working for you. Um, and it's uh it's
just it's a it's a very different way of
working, but it's a much more powerful
way of
working. Uh, so I want to talk a little
bit about how these agents work under
the hood. I feel like uh once you
understand what's happening under the
surface, uh, it really helps you build
an intuition for how to use agents
effectively. Uh, and at its core, um, an
agent is this loop between a large
language model and the and the external
world. So, uh, the large language model
kind of serves as the brain. Uh, and
then we have to repeatedly take actions
in the external world, get some kind of
feedback from the world, and pass that
back into the LLM. Um, uh, so basically
at every every step of this loop, we're
asking the LM, what's the next thing you
want to do in order to get one step
closer to your goal. Uh, it might say,
okay, I want to read this file. I want
to make this edit. I want to run this
command. I want to look at this web
page. uh we go out and take that action
in the real world, get some kind of
output, whether it's the contents of a
web page, uh or the output of a command,
and then stick that back into the LLM
for the next turn of the
loop. Uh just to talk a little bit about
kind of the core tools that are at the
agent's disposal. Uh the first one again
is a is a code editor. Um you might
think this is this is really simple. It
actually turns out to be a fairly uh
interesting problem. Uh the naive
solution would be to just like give the
old file to the LLM uh and then have it
output the entire new file. It's not a
very efficient way to work though if
you've got a thousand line uh thousand
line of thousands of lines of code and
you want to just change one line. Uh
you're going to waste a lot of tokens
printing out all the lines that are
staying the same. So most uh
contemporary um agents use uh like a a
find and replace type editor or a diff
based editor to allow the LLM to just
make tactical edits inside the file.
Uh, a lot of times they'll also provide
like an abst ab ab ab ab ab ab ab ab ab
ab ab ab ab ab ab ab ab ab ab ab ab ab
ab ab ab ab ab ab ab ab ab ab ab ab ab
ab ab ab ab abstract syntax tree or some
kind of way to allow the agent to
navigate the codebase more
effectively. Uh next up is the terminal
and again you would think text in text
out should be pretty simple but there
are a lot of questions that pop up here.
You know what do you do when there's a
longunning command that has no standard
out for a long time. Do you kill it? Do
you let the LLM wait? Uh what happens if
you want to run multiple commands in
parallel? Run commands in the
background. Maybe you want to start a
server and then run curl against that
server. Uh lots of really interesting uh
problems that crop up uh when you have
an agent interacting with the
terminal. Uh and then probably the most
complicated tool is the web browser.
Again, there's a naive solution here
where you just uh the agent just gives
you a URL and you give it a bunch of
HTML. Um that's uh very expensive
because there's a bunch of croft inside
that HTML that the the LLM doesn't
really need to see. uh we've had a lot
of luck passing it uh accessibility
trees or converting to markdown and
passing that to the LLM
um or allowing the LLM to maybe scroll
through the web page if there's a ton of
content there. Um and then also if you
start to add interaction things get even
more complicated. Uh you can let the LLM
uh write JavaScript against the page or
we've actually had a lot of luck
basically giving it a screenshot of the
page with labeled nodes and it can say
what it wants to click on. Uh this is an
area of active research. Uh we just had
a contribution about a month ago that
doubled our accuracy on web browsing. Uh
I would say this is uh this is
definitely a space to
watch. Uh and then I also want to talk
about about sandboxing. Uh this is a
really important thing for agents
because if they're going to run
autonomously for several minutes on
their own without you watching
everything they're doing, you want to
make sure that they're not doing
anything dangerous. Uh and so all of our
agents run inside of a Docker container
by default. um they're they're totally
separated out from your workstation, so
there's no chance of it running RMRF on
your home directory. Um increasingly
though, we're giving agents access to
third party APIs, right? So you might
give it access to a GitHub token or
access to your AWS account. Super super
important to make sure that those
credentials are tightly scoped and that
you're following uh the principle of
lease privilege as you're granting
agents access to do these
things. All right, I want to move into
some best practices.
Uh my my biggest advice for folks who
are just getting started is to start
small. Um the best tasks are things that
can be completed pretty quickly. You
know, a single commit uh where there's a
clear definition of done. You know, you
want the agent to be able to verify,
okay, the tests are passing, I must have
done it correctly. Um or, you know, the
merge conflicts have been solved, etc.
Um and tasks that are easy for you as an
engineer to verify uh were done
completely and correctly. Um I like to
tell people to start with small chores.
Uh very frequently you might have a poll
request where there's, you know, one
test that's failing or there's some lint
errors or there's merge conflicts. Uh
bits of toil that you don't really like
doing as a developer. Those are great
tasks to just shove off to the AI.
They're tend to be tend to be very rote.
Uh the AI does does them very well. Um
but as your intuition grows here, as you
get used to working with an agent,
you'll find that you can give it bigger
and bigger tasks. Uh you'll you'll
understand how to communicate with the
agent effectively. Um, and I would say
for for me, for my co-founders, and for
our for our biggest power users, uh, for
me, like 90% of my code now goes through
the agent, and it's only maybe 10% of
the time that I have to drop back into
my IDE and kind of get my hands dirty in
the codebase
again. Uh, being very clear with the
agent about what you want is super
important. Uh, I specifically like to
say, you know, you need to tell it not
just what you want, but you need to tell
it how you want it to do it. You know,
mention specific frameworks that you
want it to use. Uh if you wanted to do
like a test-driven development strategy,
tell it that. Um mention any specific
files or function names that it can that
it can go for. Um this not only uh helps
it be more accurate and uh you know more
clear as to what exactly you want the
output to be um it also makes it go
faster, right? It doesn't have to spend
as long exploring the codebase if you
tell it I want you to edit this exact
file. Um this can save you a bunch of
time and energy and it can save uh a lot
of a lot of tokens, a lot of actual like
inference costs.
Uh, I also like to remind folks that in
an AIdriven development world, code is
cheap. Um, you can throw code away. You
can you can experiment and prototype.
Uh, I love if I if I have an idea, like
on my walk to work, I'll just like uh,
you know, tell open hands with my voice,
like do X, Y, and Z, and then when I get
to work, I'll I'll have a PR waiting for
me. 50% of the time, I'll just throw it
away. It didn't really work. 50% of the
time it looks great, and I just merge
it, and it's and it's awesome. Um, it's
uh it's really fun to be able to just
rapidly prototype using AIdriven
development. Um, and I would also say,
you know, if you if you try to try to
work with the agent on a particular task
and it gets it wrong, maybe it's close
and you can just keep iterating within
the same conversation and has already
built up some context. If it's way off
though, just throw away that work. Start
fresh with a new prompt based on uh what
you learned from the last one. Um it's
really really uh I think uh it's a new
new sort of muscle memory you have to
develop to just throw things away.
Sometimes it's uh hard to throw away
tens of lines tens of thousands of lines
of code that uh have been generated
because you're used to that being a very
expensive uh bunch of code. Uh these
days it's it's very easy to kind of just
start from scratch.
Again, this is probably the most
important bit of advice I can give
folks. Uh you need to review the code
that the AI writes. Uh I've seen more
than one organization run into trouble
uh thinking that they could just vibe
code their way to a production
application uh and just you know
automatically merging everything that
came out of the AI. Um but uh if you
just you know don't review anything
you'll find that your codebase just
grows and grows with this tech debt.
You'll find duplicate code everywhere.
Uh things get out of hand very quickly.
Uh so make sure you're reviewing the
code that it outputs and make sure
you're pulling the code and running it
on your workstation or running it inside
of an ephemeral environment. uh just to
make sure that you know the agent has
actually solved the problem that you
asked it to
solve. Uh and I like to say you know
trust but verify. You know as you work
with agents over time you'll build an
intuition for for what they do well and
what they don't do well and you can
generally trust them to to um you know
operate the same way today that they did
yesterday. Um but you really you really
do need a human in the loop. Um, you
know, one of our big learnings, uh, with
Open Hands, in the early days, if you
opened up a poll poll request with Open
Hands, uh, that that poll request would
show up as owned by Open Hands, it would
be the little hands logo uh, next to the
poll request. Uh, and that caused two
problems. One, it meant that the human
who had triggered that poll request
could then approve it and basically
bypass our whole code review system. You
didn't need a second human in the loop
to uh, before merging. Uh, and two,
often times those poll requests would
just languish. uh nobody would really
take ownership for them. Uh if there was
like a failing unit test, nobody was
like jumping in to make sure the test
passed. Um and those they would just
kind of like sit there and not get
merged or if they did get merged and
something went wrong, the code didn't
actually work. We didn't really know who
to go to and be like, you know, who
caused this? There was nobody we could
hold accountable for that breakage. Um
and so now if you open up a poll request
with open hands, your face is on that
poll request. You're responsible for
getting it merged. You're responsible
for any breakage it might cause down the
line.
Cool. And then uh I do want to just
close just by going through a handful of
use cases. Uh this is always kind of a
tricky topic because agents are great
generalists. They can they can
hypothetically do anything as as long as
you kind of like break things down into
bite-sized steps that they can take on.
Um but in that in that um in the spirit
of starting small, I think there are a
bunch of use cases that are like really
great day one use cases for agents. My
favorite is resolving merge conflicts.
This is like the biggest chore as a part
of my job. Uh, open hands itself is a
very fastmoving codebase. Uh, I say
there's probably no PR that I make that
uh, I get away with zero merge
conflicts. Um, and I love just being
able to jump in and say at Open Hands,
fix the merge conflicts on this PR. Uh,
it comes in and, you know, it's such a
rope task. It's usually very obvious,
you know, what changed before, what
changed in this PR, what's the intention
behind those changes? And Open Hands
knocks this out, you know, 99% of the
time.
Uh addressing PR feedback is also a
favorite. Uh this one's great because
somebody else has already taken the time
to clearly articulate what they want
changed and all you have to do is say at
openhands do what that guy said. Uh and
again like you can see in this example
uh open hands did exactly what this
person wanted. I don't know react super
well and uh our front end engineer was
like do x y and z and he mentioned a
whole bunch of buzzwords that I don't I
don't know. Open hands knew all of it
and uh was able to address his feedback
exactly how he wanted.
uh fixing quick little bugs. Um you
know, you can see in this example, we
had an input uh that, you know, was a
text input, should have been a number
input. Uh if I wasn't lazy, I could have
like dug through my codebase, found the
right file. Um but it was really easy
for me to just like quickly I think I
did this one from directly inside of
Slack, uh just add open hands, fix this
thing we were just talking about. Uh and
uh it's just, you know, really I don't
even have to like fire up my IDE. Um
it's just it's a really really fun way
to work.
uh infrastructure changes I really like.
Uh usually these involve looking up some
like really esoteric syntax inside of
like the Terraform docs or something
like that. Um open hands and you know
the underlying LLMs tend to just like
know uh the right terraform syntax and
if not they can they can look up the
documentation using the browser. Um so
this stuff is uh is really great.
Sometimes we'll just get like an out of
memory exception in Slack and
immediately say okay open hands increase
the
memory. Uh database migrations are
another great one. Uh this is one where
I find uh I often leave best practices
behind. I won't put indexes on the right
things. I won't set up foreign keys the
right way. Uh the LLM tends to be really
great about following all best practices
around database migrations. So again,
it's kind of like a rote task for
developers. It's not very fun. Um uh the
LLM's great at it. uh fixing failing
tests uh like on a PR uh if you've
already got the code 90% of the way
there and there's just a unit test
failing because there was a breaking API
change very easy to call in an agent to
just clean up the the failing
tests. Uh expanding test coverage is
another one I love because uh it's a
very um safe task, right? As long as the
tests are passing, it's uh generally
safe to just merge that. So if you
notice a spot in your codebase where
you're like, "Hey, we have really low
coverage here." just ask uh ask your
agent to uh expand your test coverage in
that area of the codebase. Uh it's a
great quick win uh to make your codebase
a little bit
safer. Then everybody's favorite
building apps from scratch. Um you know
I would say if you're shipping
production code again don't just like
vibe code your way to a production
application. Uh but we're finding
increasingly internally at our company a
lot of times there's like a little
internal app we want to build. Uh like
for instance, we built a way to uh debug
openhand trajectories, debug openhand
sessions. Um uh we built like a whole
web application that since it's just an
internal application, we can vibe code
it a little bit. We don't really need to
review every line of code. It's not
really facing end users. Uh this has
been a really really fun thing for our
business to just be able to churn out
these really quick applications uh just
to serve our own internal needs. Um so
yeah, uh Greenfield is a great great use
case for agents. U that's all I've got.
Uh we'd love to have you all join the
the Open Hands community. You can find
us on GitHub all
handsaiopenhands. Um join us on Slack,
Discord. Uh we'd love to build with
you. Awesome. Awesome. Okay. Thank you
again, Robert. Very, very exciting to
hear about what works and what doesn't
work in coding agents. Now, I want to
take a bit of time to pause. We're kind
of going to change focus for the next
few talks. Our next speaker is Josh
Albert from Imbue who's going to speak
about you know a little bit of a meta a
meta talk. He's going to give a
walkthrough of a case study about
sculptor. Sculptor is kind of their way
of how do you verify that your AI coding
agents are actually outputting proper
code. So we hear these like you know we
always hear how do we go from prototype
to production. I'm guilty of this. I've
given this talk. I gave it last year,
but you know, we always hear about how
do you go from prototype to production?
You need a human in the loop. How do you
go from vibe coding to actual like
production grade code? And outside of
tech debt, Josh is one of the people
that has kind of gone very very deep in
this and built sculptor to exactly solve
this. So for our next talk, you know,
he's going to go through a case study of
as you build coding agents, how do you
kind of launch something alongside this?
How do you better verify what's going
on? And a little bit more about Josh.
Josh is kind of a friend that I've known
for over a year. We've talked in great
depth about coding agents. He's very
deep in the space. He's been on the
Leaden Space podcast before. So, if you
want to, you know, hear more, feel free
to check out the podcast. And same with
a lot of the other speakers. Boris from
Cloud Code, he's been there as well. But
without further ado, I want to invite
Josh up pretty soon. I'm gonna I'm gonna
kill some more time. We're we're running
a little early. So, um yeah, let's let's
actually get a show of hands. Who in
here has started actually shipping suite
agents in production? So outside of
using them in your own coding workflows,
outside of using co-pilots, who has
actually shipped a version of a co
coding co-pilot? Who's working directly
on the
tools? Okay, we we have a a few hands.
So let's get a better idea of what
people are working on. Are people in the
session here? Are we trying to learn how
should we better use co-pilots? How
should we take them to production? How
should we build them? What should we
know about them? because Josh's talk is
a bit of a case study around this. So,
who here is in the phase of aggressively
using co-pilots kind of vibe coding and
trying to trying to take it to that next
level? Okay. Okay. A lot more hands
there. So, Josh, a little bit more
background for you there. So, let's
let's kind of give it off from there. Um
Josh, I think we're ready for you.
Awesome. Thanks.
Um, one
second. All
right, cool. Well, yeah, it's great to
be here. So, I'm Josh Albertch. I'm the
CTO of Imbue. Uh, and our focus is on
making more robust, useful AI agents. In
particular, we're focusing on software
agents right now. And the main product
that we're working on today is called
Sculptor. So the purpose of Sculptor is
to kind of help us with something that
we've all experienced. You know, we've
all tried these vibe coding tools and
you, you know, tell it to go off and do
something. It goes off and creates a
bunch of code for you. Uh, and then, you
know, voila, you're done, right? Well,
not quite. like at least today there's a
big gap between kind of the stuff that
comes back uh and what you want to ship
to production especially as you get away
from the prototyping into a larger more
established code bases. So today I'm
going to go over some of the technical
decisions that went into the design of
sculpture our experimental coding agent
environment uh and kind of go through
some of the context and motivations for
the various ideas that we've explored
and the features that we've implemented.
It's still a research preview, so these
features may change before we actually
release it. Uh, but I hope that you know
whether you're an individual using these
tools or you're someone who's developing
the tools yourself, you'll find these uh
kind of learnings from our experiments
to be useful for yourselves. So today,
if you're thinking about how you can
make coding agents better, then there's
a million different things that you
could build. You could build something
that helps improve the performance on
really large context windows. You could
make something to make it cheaper or
faster. You could make something that
does a better job of parsing the
outputs. But I don't think that we
really should be building any of these
things. I think that what we really want
to be building is things that are much
more specific to the use case or to like
the problem domain or the thing that you
are like really specialized in. most of
the things that I just mentioned are
going to get solved over the next call
it 3 to 12 to 24 months as models get
better, coding agents get better etc.
And so I think you know just like you
wouldn't want to make your own database
I don't think we want to be spending a
lot of time working on the problems that
are going to get solved uh instead we
want to focus on the particular part of
the problem that really matters for for
us for our business. And so at Impu the
problem that we're focusing on is
basically this like what is wrong with
this diff? You get a coding agent output
and it tells you like okay I've added 59
new lines. Are those good? Like right
now you have an awkward choice between
either looking at each of the lines
yourself or just hitting merge and kind
of hoping for the best. Uh and neither
of those are a really great place to be.
So we try to give you a third option. Uh
the goal is to help build user trust by
allowing another AI system to come and
take a look at this and understand like
hey are there any race conditions? Did
you leave your API key in there etc. So
we want to think about how do we help
leverage AI tools not just to generate
the code but to help us build trust in
that
code and kind of the way that we think
about it is about like identifying
problems with the code because if
there's no problems then it's probably
high quality code and that's kind of the
definition of high quality code. If you
think about it from like an academic
perspective, the way that people
normally measure software quality is by
looking at the number of defects and
they look at like how long does it take
to fix a particular defect or how many
defects are caught by this particular
technique. So this is sort of the
definition that at least we're working
on from when we're thinking about making
high quality software. And then if we
think about you know the software
development process what you want to be
doing is getting to a place where you
have identified these problems as early
as possible. So sculptor does not work
as like a pull request review tool
because that's much much later in the
process. Rather we want something that's
synchronous and immediate and giving you
immediate feedback. As soon as you
generated that code, as soon as you've
changed that line, you want to know like
is there something wrong with it? That's
easier both for you to fix and also for
the agent to fix.
So what are some ways that you can
prevent problems in AI generated code?
We're going to go through five different
ways. Uh the first is learning planning
or sorry only four different ways.
Learning planning writing specs and
having a really strict style guide. And
we'll see how those manifest in
Sculptor. So the first thing you want to
do when you're using coding agents if
you're trying to prevent problems is
learn what's out there. We try to make
this as easy as possible in sculpture by
letting you ask
questions, have it do research, get
answers about what are the technologies,
etc. that exist. What are the ways that
other people have solved similar
problems so that you don't end up
reproducing a bunch of work for what's
already out
there. Next, we want to think about how
we can encourage people to start by
planning. Here's a little example
workflow where you can, you know, kick
off the agent to go do something simple
like, you know, implement this Scrabble
solver and change the system prompt here
to force the AI agent to first make a
plan without writing any code at all.
Then you can wait a little while. It'll
generate the plan. Uh, and then you can
go and change the system prompt again to
say like, okay, now we can actually
create some code. So we make it really
easy to kind of change these types of
meta parameters of the coding agent
itself. Of course you can just tell the
agent to do that. But by changing its
system prompt you sort of force it in a
much stronger way to uh change its
behavior. And you can build up larger
workflows by making sort of customized
agents for always plan first then always
do the code then always run the checks
etc.
Third, you want to think about writing
specs and docs as a kind of first class
part of the workflow. One of the main
reasons why, at least I don't normally
write lots of specs and docs in the past
has been that it's kind of annoying to
keep them all up to date to spend all
this time kind of typing everything out
if I already know what the code is
supposed to be. But this is really
important to do if you want the coding
agents to actually have context on the
project that you're trying to do because
they don't have access to your email,
your Slack, etc. necessarily. And even
if they did, they might not know exactly
how to turn that into code. So in
Sculptor, uh, one of the ways that we
try to make this easier is by helping
detect if the code and the docs have
become outdated. So it reduces the
barrier to writing and maintaining
documentation and dock strings because
now you have a way of more automatically
fixing the inconsistencies. It can also
highlight inconsistencies or parts of
the specifications that conflict with
each other, making it easier to make
sure that your system makes sense from
the very
beginning. And finally, you want to have
a really strict style guide and try to
enforce it. This is important even if
you're just doing regular coding without
AI agents, just with other human
software engineers. But one of the
things that is special in Sculptor is
that we make suggestions which you can
see towards the bottom here uh that help
keep the AI system on a reasonable path.
So here it's highlighting that you could
you know make this particular class
immutable to prevent race conditions.
Was this something that comes from our
style guide where we try to encourage
both the coding agents and our teammates
to write things in a more functional
immutable style to prevent certain
classes of errors. We're also working on
developing a style guide that's sort of
customtailored to AI agents to make it
even easier for them to avoid some of
the most egregious mistakes that they
normally
make. But no matter how many uh things
you do to prevent the AI system from
making mistakes in the first place, it's
going to make some mistakes. And there
are many things that we can do to
prevent or to detect those problems and
prevent them from getting into
production. So we'll go through three
here. Uh first running llinters, second
writing and running tests, third asking
an LLM. Uh and we'll dig into each and
see how that manifests in sculpture. So
for the first one for running llinters,
there are many automated tools that are
out there like rough or my pylind py
etc that you can use to automatically
detect certain classes of errors.
In normal development, this is sort of
obnoxious because you have to go fix all
these like really small errors that
don't necessarily cause problems. It's a
lot of like churn and extra work. But
one of the great things about AI systems
is that they're really good at fixing
these. So, one of the things that we've
built into Sculptor is the ability for
the system to very easily detect these
types of issues and automatically fix
them for you without you having to get
involved.
Another thing that we've done is make it
easy to use these tools in practice. A
lot of tools end up like these. You
know, how many people here, maybe a show
of hands, how many people have a llinter
set up at
all? Okay. How many people have zero
linting errors in their codebase? Two.
Great. We'll hire you. Okay, cool. Uh
but you know it's it's not it's not
easy. But one of the things that we've
done in sculpture is make it so that the
AI system understands what issues were
there before it started and then what
issues were there after it ran. So at
least you can prevent the AI system from
creating more errors without you even if
it doesn't work in a perfectly clean
codebase. Okay. Third testing. So why
should you write tests at all? I think I
was pretty lazy as a developer for a
long time and did not want to write
tests because it took a you know a lot
of effort. You have to maintain them. I
already wrote the code. It works. Okay.
But one of the major objections to
writing tests has kind of disappeared
now that we have AI systems. The ability
to generate tests is now so easy that
you might as well write tests.
Especially if you have correct code. You
can tell the agent, hey, just write a
bunch of tests, throw out the ones that
don't pass, and just keep the rest. So
there's no real reason to not write
tests at all. Uh and B at as they say at
Google, if you liked it, you should have
put a test on it. This becomes much more
important with coding agents. And the
reason is that you don't want your
coding agent to go change the behavior
of your system in a way that you don't
understand and don't expect and don't
want to see happen. So at Google, this
matters a lot for their infrastructure
because they don't want their site to
crash when someone changes something.
But if you really care about the
behavior of your system, you want to
make sure that it's fully
tested. So how do we actually write good
tests? I'll go through a bunch of
different uh components to this. So
first, one of the things that you can do
is write code in a functional style. By
this I mean code that has no side
effects. This makes it much much easier
to run LLM and understand if the code is
actually successful. You really don't
want to be running a test that has
access to say your live Gmail
environment where if you make a single
mistake you can delete all of your
email. You really want to isolate those
types of side effects and be able to
focus most of the code uh on the kind of
functional transformations that matter
for your
program. Second, you can try and write
two different types of unit tests. Happy
path unit tests are those that are ones
that show you that your code is working.
It's happy. Hooray, it worked. uh you
don't need that many of those. You just
need a small number to show that things
are working as you hope. The unhappy
unit tests are the ones that help us
find bugs. And here LLMs can be really,
really helpful. So, especially if you've
written your code in a functional style,
you can have the LLM generate hundreds
or even thousands of potential inputs,
see what happens to those inputs, and
then ask the LLM, does that look weird?
And often when it says yes, that will be
a bug. And so now you have a perfect
test case replicating a bug.
Third, after you've written your unit
tests, it's maybe a good idea to throw
them away in some cases. This is a
little bit
counterintuitive. In the past, it spent
we took all this effort and spent all
this time trying to write good unit
tests and so we feel some aversion to
throwing them away. But now that it's so
easy to run LLM and generate the test
suite again from scratch, there's a
reason a good reason to not keep around
too many unit tests of behavior you
don't care about too much. You might
also want to just refactor the ones that
you generated into something that's
slightly more maintainable. But when you
do keep them around, it does kind of
confuse the LLM when you come back and
change this behavior. So it's something
that's at least worth thinking about
whether you want to keep the tests that
were originally generated, clean them
up, how many of them should you keep,
etc. Fourth, you should probably focus
on integration tests uh as opposed to
testing only the kind of code level
functional uh behavior of your program.
Integration tests are those that show
you that your program actually works.
Like from the user's perspective, like
when the user clicks on this thing, does
this other thing happen? AI systems can
be extremely good at writing these,
especially if you create nice test plans
where you can write, okay, when the user
clicks on the button to add the item to
the shopping cart, then the item is in
the shopping cart. If you write that out
and then you write the test, then you
can write another test plan like if the
user clicks to remove the button, the
thing from the shopping cart, then it is
gone. that systems can almost always get
this right and so it allows you to work
at the level of meaning for your testing
which can be much more efficient. Uh
fifth, you want to think about test
coverage as a core part of your testing
suite. So if you're having cloud code
write things for you, then you don't
care just about the tests working on
their own, but you also care are there
enough tests in the first place. If you
think back to the original screenshot
where we get back our PR of, you know,
how many lines have changed? If I tell
you how many lines have changed, it's
not that helpful. If I tell you so many
lines have changed and also there's 100%
test coverage and also all the tests
pass and also a thing looked at the
tests and thought they were reasonable.
Now you can probably click on that merge
button without quite as much fear. Uh
and sixth uh we try to make it easy to
run tests in sandboxes and without
secrets as much as possible.
This uh makes it a lot easier to
actually fix things and makes it a lot
easier to make sure that you're not
accidentally causing problems or making
flaky
tests. The third thing that we can do to
detect errors is ask an LLM. There are
many different things that we can check
for, including if there are issues
before you commit with your current
change, if the thing that you're trying
to do even makes sense, if there are
issues in the current branch you're
working on, if there are violations of
rules in your style guide or in your
architecture documents, if there are
details that are missing from the specs,
if the specs aren't implemented, if
they're not well tested, or whatever
other custom things that you want to
check for. One of the things that we're
trying to enable in Sculptor is for
people to extend the checks that we have
so that they can add their own types of
best practices into the codebase and
make sure that they are continually
checked. After you've found issues, then
you have to fix them. Very little of
this talk is about fixing the issues
because it ends up being a lot easier
for the systems to fix issues than you
would expect. I think this quote
captures it relatively well and that a
problem wellstated is halfsolved. What
this means is that if you really
understand what went wrong, then it's
much easier to solve the problem. This
is especially true for coding agents
because the really simple strategies
work really well. So even just try
multiple times, try a hundred times with
a different agent, it actually ends up
like working out quite well. And one of
the things that enables this is having
really good sandboxing. If you have
agents that can run safely, then you can
run an almost unlimited number subject
to cost constraints uh in parallel. And
then if any one of them succeeds, then
you can use that
solution. And this is really just the
beginning. There are going to be so many
more tools that are released over the
next year or two. And many of the people
in this room are working on those tools.
There will be things that are not just
for writing code like we've been talking
about, but for after deployment, for
debugging, logging, tracing, profiling,
etc. There are tools for doing automated
quality assurance where you can have an
AI system click around on your website
and check if it can actually do the
thing that you want the user to do.
There are tools for generating code from
visual designs. There are tons of de dev
tools coming out every week. you will
have much better contextual search
systems that are useful for both you and
for the agent. Uh and of course we'll
get better AI based models as well. If
anyone is working on these other sorts
of tools that that are kind of adjacent
to developer experience and helping you
fix this like much smaller piece of the
process, we would love to work together
and find out a way to integrate that
into Sculptor so that people can take
advantage of that. I think what we'll
see over the next year or two is that
most of these things will be accessible.
Uh, and it'll make the development
experience just a lot easier once all
these things are working
together. So, that's pretty much all
that I have for today. If you're
interested, feel free to take a look at
the QR code, go to our website at
imbue.com and sign up to try out
Sculptor. And of course, if you're
interested in working on things like
this, we're always hiring. We're always
happy to chat, so feel free to reach
out. Thank you.
Thank you, Josh. I highly recommend
picking Josh's brain. I'm sure he'll be
around. Find him in the hallways. It's
been great. Had countless conversations
with Josh. And, you know, just to say
once again, what a day. It's been a
fully jam-packed day. We have had eight
backto backtoback speakers talking about
Su agents. We started with all the, you
know, the originals, the GitHub
co-pilot, the original coding co-pilot.
Then we went to the latest and the
greatest, right? We've had OpenAI's
codec speak. We've had Claude Code
speak. We've had Jules from Google
speak. Then we went a little bit into,
okay, how do I actually start using
these things in production? How do I go
past Vibe coding? How do I kind of, you
know, let's walk through a case study of
how we really build these things. And
now for our last talk in the sui agents
track we have someone who is not
building an agent. We have Eno Reyes
here from factory and he is actually
building droids. What does this mean?
It's not just hype. Eno is actually
working on droids. He is one of the
companies from factory AI that is
actually ship this stuff in production.
They are actually in the enterprise.
They are growing like crazy. He's been
recently on the laten space podcast and
they're actually doing this stuff. So
you know is a great speaker. He's spoken
for bigger audiences than this. And you
know, without any further ado, I want to
pass it on to Eno.
[Applause]
Hi everybody. My name is Eno. I really
appreciate that introduction. Um, and
maybe I can start with a bit of
background. Uh, I started working on LLM
about two and a half years ago. uh when
uh
GBT3.5 was coming out and it became
increasingly clear that agentic systems
were going to be possible with the help
of LLMs. At factory we believe that the
way that we use agents in particular to
build software is going to radically
change the field of software
development. We're transitioning from
the era of human-driven software
development to agent driven development.
You can see glimpses of that today. You
guys have already heard a bunch of great
talks about different ways that agents
can help with coding in particular.
However, it seems like right now we're
still trying to find what that
interaction pattern, what that future
looks like. And a lot of what's publicly
available is more or less an incremental
improvement. The current zeitgeist is to
take tools that were developed 20 years
ago for humans to write every individual
line of code. um and ultimately tools
that were designed first and foremost
for human beings. Uh and you sprinkle AI
on top and then you keep adding layers
of AI and then at some point maybe
there's some step function change that
happens. But there's not a lot of
clarity there in exactly what that
means. You know, there's a quote that is
attributed to Henry Ford. Uh if I had
asked people what they wanted, they
would have said faster horses. Now, we
believe that there are some
fundamentally hard problems blocking
organizations from accessing the true
power of AI. This power can only be
found when your team is delegating the
majority of their tasks across the
software life cycle to agents.
To do that, you need a platform that has
an intuitive interface for managing and
delegating tasks, centralized context
from across all your engineering tools
and data sources, agents that
consistently produce reliable,
highquality outputs, and infrastructure
that supports thousands of agents
working in parallel. These are all hard
problems to solve. But our team has
spent the last two years partnering with
large organizations to build towards
this future. This talk is going to serve
as sort of a deep dive into agent native
development and some of the and a bit of
a share of some of the lessons that
we've learned helping enterprise
organizations make the transition to
agent native
development. When Andre Karpathy said
English is the new programming language,
he captured this very exciting moment.
Right? And if you're to judge AI
progress based on Twitter, you'd think
that, you know, you can basically vibe
code your way to anything. But vibe
coding isn't the approach to solve hard
problems. You can't vibe code a legacy
Java 7 app that runs 5% of the world's
global bank transactions, right? You
need a little bit more software
engineering. So agents really should not
be thought of as a replacement for human
ingenuity, right? agents are climbing
gear and building production software is
like scaling Mount Everest. And so while
better tools have made this climb more
accessible, we still need to think about
how to leverage them and use our
existing expertise in order to drive
this transformation. I want to start
with a quick video of what's possible
today, right? And so in this you'll see
a quick glimpse of what it's like to
delegate a task to an agentic system.
You can watch the droid as we call them
ingest the task and start grounding
itself in the environment. It uses tools
to search through the codebase,
determine the git branch, check out what
the machine has available to it. It
looks through recent changes to the
codebase. It looks at memories of its
recent interactions with users as well
as memories from its interactions across
the entire organization. And then the
droid comes back with a plan and says,
"Here's exactly what I'm going to do,
but I'd like you to clarify a couple of
things. Right? We need to expect our
agents to not just take what we say at
face value, but instead question it and
make us better software developers." And
so after the user comes back with that
info, the droid comes, it executes on
that task. It leverages its tools to
write code, runs pre-commit hooks,
lints, and ultimately generates a pull
request that passes
CI. But how can you achieve outcomes
like this on a regular basis? Right?
It's nice when it works, but what about
when it fails? At the heart of effective
AI assisted development lies a very
fundamental truth. AI tools are only as
good as the context that they receive.
So much of what people are calling
prompt engineering is really mentally
modeling this alien intelligence that
has a slice of context of the real
world. And if you start thinking about
your AI tools this way, you're going to
start to get a lot better at interacting
with them. We've investigated thousands
of droid assisted development sessions
and you see this sort of heristic emerge
where AI is most likely failing to solve
the problem. Not because the LLMs aren't
good enough, but because it's missing
crucial context that's required to truly
solve it. And better models are going to
make this happen less often. But the
real solution is not just making the AI
smarter. It's going to be getting better
at providing these systems with that
missing
context. LM don't know about your
morning standup. They don't know about
the meeting that you had ad hoc and the
whiteboard that you did, right? But you
can give those things to the LLM if you
transcribe your notes, if you take a
photo and you upload it. Right? You have
to start thinking about these things not
as tools but as something in between a
co-worker and uh and a and a platform,
right? And if you can get that context
that lies in the cracks between systems,
you use platforms that integrate
natively with all of your data sources
and you have agents that can actually
make use of those things, you can start
actually driving this transition to
agent native
development. I want to talk a bit as
well about planning and design. When
your agent I mean sorry when your
organization is doing agent native
development then you are using agents at
every stage. Droids don't just write
code. They can help with that part, but
the hardest thing about software
development is not the code. It's about
figuring out exactly what to build. Here
you can watch a droid as it's tasked
with trying to find the most up-to-date
information about a new model release
and integrate that into an existing chat
application. It's going to leverage
internet search, its knowledge of your
codebase, its understanding of your
product goals from its organ uh memory,
and its understanding of your technical
architecture from the design doc you
wrote last week. Planning with AI is
fundamentally different from planning
alone. It's not necessarily just asking
please build this thing for me or give
me the design doc but instead it's about
delegating the groundwork and the
research to AI agents then using a
collaborative platform to interact and
explore possibilities together. That is
how you get better at planning with
agents. Now you can see here we have a
nice document a nice plan. You could
export that to notion, Confluence, Jira,
any of your integrations with no setup
because MCP is great, but having every
developer have to install a bunch of
servers, click a bunch of things, pass
around the API key is not necessarily
ideal. And so platforms are going to
evolve and solve a lot of these
problems. But in the meantime, you do
have droids. And now a little bit more
on this. The real unlock for AI
transforming your organization in with
respect to planning is going to be when
you start standardizing the way that
your organization thinks, right? And so
there's a bit of a of an example that we
just had a couple of weeks ago while we
were planning out uh a feature related
to our cloud development environments.
We got a lot of feedback from users and
so we had about three months of user
transcripts, people from enterprises,
uh, individuals that we knew. Uh, we
transcribe every single interaction and
meeting at factory. We take those notes
and we combine them with a droid that
has access to our architecture. We take
a ad hoc meeting that one of our
engineers took a granola of. If you guys
use granola, I love that tool. Um, and
we throw that all to the knowledge droid
and we say, we don't say, "Let's plan
the feature out." We say, "Could you
find any patterns in the customer
feedback that map up to our assumptions?
Can you highlight any technical
constraints with what we have today that
might help us make this better?" And
then we take all of that output, those
documents, there's maybe four or five
intermediate results here, and that's
what we use to start iterating on a
final PRD that helps us outline the full
feature. You can take that PRD, and if
you have a droid that has access to
linear and Jira with tools to create
tickets, create epics, modify those
things, then that PRD can be turned into
a road map. eight tickets. This ticket's
dependent on that ticket, but ultimately
work that can be parallelized amongst a
group of eight code droids, right? And
so this is how software is going to
evolve. We're going to move from
executing to orchestrating systems that
work on our
behalf. I tal I talked about a couple of
these. I think PRDS, edge design docs,
RCA templates, quarterly engine and
product road maps, right? transcriptions
of your meetings. Normally, you might
see this stuff as a burden, but when
your company is doing agentnative
software development, your process and
your documentation is a knowledge base
and a map for your droids to learn and
imitate the way that your team thinks.
This documentation and process is a
conversation with both future developers
as well as future AI systems. And so if
you can communicate that why behind the
decision, that context for those future
developers and agents, then you'll start
to see that there's a huge lift in their
ability to natively work the way that
your team actually
works. I want to talk about uh
agent-driven development with respect to
site reliability
engineering. There is a lot that goes in
to a real incident response. It would be
crazy for me to go up here and say you
could actually just automate all of S
and RCA work today. But there is a
difference in the AI agent-driven
approach. Right here we're watching a
droid take a sentry incident and convert
it into a full RCA and mitigation plan.
Traditional incident response is
effectively solving a puzzle. The pieces
are scattered across dozens of systems.
Logs in one place, metrics in another,
historical context somewhere else.
There's knowledge in your team's head.
Droids in your organization
fundamentally change this, right? When
an alert triggers, you can pull in
context from relevant system logs, past
incident, runbooks in notion or
confluence, team discussions from Slack.
And you can see that a droid that has
the tools and the ability to access this
can condense that search effort from
hours to minutes. And so really the
acceptable time to act for a standard
enterprise organization should really
it's really going to be zero. Right? The
moment that an incident happens, you
should have a droid that's telling you
exactly what happened, exactly how to
fix it. And the thing that gets
interesting is when you have user and
organization level memory, you really
start to build a model of what your
team's response patterns and common
issues are. And so it's not just
generating runbooks or generating a
mitigation for one incident, right? but
creating new processes that help solve
some of these
issues. And once you've written that
RCA, right, you you can move on to
generate runbooks for those new learned
patterns, update existing response
workflows, capture team knowledge that
gets shared automatically without
without the need for manual
curation. And this is why all these
things are connected. Agentnative
incident response is a part of a larger
learning cycle that happens when you
start to integrate agents into the
workflow. We're seeing teams that are
able to cut incident response time in
half because context is immediate.
They're able to reduce repeat incidents
because the third time something
happens, the droid starts to say, "Maybe
we should fix this." And they're able to
improve team collaboration because when
a new engineer joins the team and says,
"How do we do this?" It's already in
memory. They can just ask the droid how
we do this. And so, most importantly,
what we're seeing in general is a shift
from reactive to predictive operations
because you can now start to really see
the patterns across the entire
operational history. And agentic systems
turn each of these incidents into an
opportunity to make the entire system
far more
reliable. AI agents are not replacing
software engineers. They're
significantly amplifying their
individual capabilities. The best
developers I know are spending far less
time in the IDE writing lines of code.
It's just not high leverage. They're
managing agents that can do multiple
things at once that are capable of
organizing the systems and they're
building out patterns that supersede the
inner loop of software development and
they're moving to the outer loop of
software
development. They aren't worried about
agents taking their jobs. They're too
busy using the agents to become even
better at what they do. The future
belongs to developers who understand how
to work with agents, not those who hope
that AI will just do the work for them.
And in that future, the skill that
matters most is not technical knowledge
or your ability to optimize a specific
system, but your ability to think
clearly and communicate effectively with
both humans and
AI. Now, if you find any of this
interesting and you want to try the
droids, I'm happy to share that everyone
here uh at this talk can use this QR
code uh to sign up for an account. Our
mobile experience is not optimized yet,
but the droids are on that. And so I'd
recommend trying this on a laptop, but
you will get 20 million free tokens uh
credited your account. Um, and I also
want to add that uh you know, first and
foremost, Factory is an enterprise
platform, right? And so if you're if
you're thinking about security, if
you're thinking about where are the
audit logs, whose responsibility is it
when an agent goes and runs remove RF
recursive on your codebase, right?
Droids don't do that. But if it were to,
right, whose responsibility is that?
Then these are the types of questions
that we're interested in and that we're
helping large organizations solve today.
And so if you're a security
professional, if you're thinking about
ownership, auditability,
indemnification, if you're a lawyer,
right, these are the types of questions
that you should start asking today
because yolo mode is probably not the
best thing to be running inside your
enterprise, right? And so give it a
scan, give it a try, check out some of
the controls we have. Um, and if you
have any questions, feel free to reach
out via email. Thanks.
[Applause]
Awesome. Thank you, Eno. What a day of
talks, everyone. That's our back-to back
eight sessions of sweet agent talks.
Okay, logistics. So, this is the main
keynote room. We're going to be back
here in around 3:40 for our ending
keynotes. feel free to, you know, stay,
hang out. It's not that long from now.
You have about 20 minutes. If you're
interested, there's some expo talks
going on. Feel free to check out the
expo booths, but please do stay. Um,
after the keynotes, we have a few more
great great keynote talks lined up.
Everyone will come back to the keynote
room. And then we have a few surprises.
So, one thing very special, last week we
held a hackathon. We held an AI uh AI
engineer hackathon. And the finalists of
the hackathon have not got their awards
yet. They have been spending a week to
work a little bit further on their
project. They're going to come here and
demo on stage and we're going to pick
the winners. There's $10,000 of prizes
on the line. So, we're going to see some
hackathon demos. And of course, at the
end, we want to thank our speakers. We
have a special trophy ceremony and we
need your help to determine who your
favorite speakers were. For the sweet
agent track, we're going to reach out.
We're going to have a poll for whoever
your favorite speaker is. Please, please
vote alongside the keynotes, the other
tracks for anything that you've
attended. Please let us know your
favorite speakers. So, thank you all for
coming. It's been a great talk, a great
list of talks, and we hope to see you
back soon. So, once again, 3:40 we're
going to kick off here with keynotes,
speaker prizes, and hackathon judging.
Thank you everyone.
Hey, hey, hey.
[Music]
Data.
[Music]
Come on.
[Music]
[Music]
[Music]
Yeah, heat.
[Music]
[Music]
[Music]
Hey, hey, hey.
Heat. Hey, Heat.
[Music]
[Music]
[Music]
[Music]
All eyes.
[Music]
[Music]
Oh. Oh.
Oh. Heat. Heat.
[Music]
[Music]
are all
[Music]
[Music]
Hey, hey, hey.
[Music]
I love
[Music]
[Music]
me. Heat. Heat.
[Music]
[Music]
down.
Happy down.
Happy.
Everything.
Everybody
feel
I d
I I'll
be happy I'll be Heat. Heat.
I
feel I need
[Music]
[Music]
[Music]
[Music]
Are you
[Music]
Hey,
hey, hey.
[Music]
[Music]
[Music]
I'm everything.
[Music]
Hey, hey, hey.
[Music]
[Music]
as you
I feel
[Music]
I don't want to
[Music]
go after
[Music]
[Music]
[Music]
Do
[Music]
I don't want to work.
[Music]
I take it.
[Music]
[Music]
Heat. Heat. N.
[Music]
[Music]
[Music]
[Music]
Hey. Hey. Hey.
[Music]
[Music]
Heat. Heat.
Hello. Hello.
[Music]
Heat. Heat.
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
Heat. Hey, Heat.
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
Ladies and gentlemen, please welcome
back to the stage, the VP of developer
relations at Llama Index, Lorie Voss.
[Music]
Hello everybody. Welcome back. How's
everybody had a good conference day
today?
All right, so for this next bit, I'm
going to try an experiment. There's four
sort of blocks of you uh separated by
aisles, and so I'm going to divide you
into teams. You are team A. You are team
B. You are team C. You are team D. Let
me hear it from team
A. Team
C. Team
B. Team
D. Team A again. All right. I'm not
going to do anything with that. That's
just to wake you
up. Uh we have some great keynotes lined
up this afternoon. Uh we're going to
hear the results of the state of AI
engineering survey. Uh and if you know
anything about me, you know that I love
data. I love a good survey. It's my
favorite thing to hear about. Uh we're
going to hear stories about building
open router. Uh and we're going to hear
Shawn Grove tell us why prompt
engineering is dead, which is sure to be
spicy. Uh but our first keynote this
afternoon is trends across the AI
frontier. Uh so please welcome to the
stage uh co-founder of artificial
analysis George
[Applause]
Cameron. Hi everyone. I'm George,
co-founder of Artificial Analysis. A
quick background to who we are before we
dive into
things. Do you see that?
Sorry, I think my clicker is not
working. Oh, there we go. Great. So, a
quick background to who we are. We're a
leading independent AI benchmarking
company. We benchmark a broad spectrum
across AI. So, we benchmark models for
their intelligence. We benchmark API
endpoints for their speed, their cost.
We also benchmark uh hardware and all
the AI accelerators out there. Uh and we
also benchmark a range of modalities,
not just language, but also vision,
speech, image generation, video
generation. And we publish essentially
nearly all of it uh for free on our
website, artificialanalysis.ai,
AI whereby we benchmark over 150
different models uh across a range of
metrics. We also publish reports many of
which are publicly accessible and we
also have uh a subscription for
enterprises looking to uh enter uh or
bring AI to production in their
environments um in an efficient uh and
effective way.
Let's start off with AI progress. Let's
set the scene. So, it's been a crazy two
years. I think that we've all felt it in
this room whereby OpenAI uh kicked off
the race uh with the Chat GBT and GBD
3.5 launch. And since then, it's only
gotten more hectic. There's been more
and more uh model releases by more and
more labs pushing the AI
frontier. So the current state now of
frontier AI intelligence. I think this
will be this order of models will be
familiar to a lot in this room. 03 is
the leader but followed closely by 04
mini with reasoning mode high. Deepseek
R1 the release in the last week or two.
Gro 3 mini reasoning high Gemini 2.5 Pro
Claude for Opus
thinking this benchmark is our
artificial analysis intelligence index.
It's made up of a composite. It's a
composite index of seven
evaluations which we then wait to
develop our artificial analysis
intelligence index which just provides a
generalist perspective on the
intelligence of these models.
We all have an understanding of what
frontier AI intelligence is. But what I
want to explore with you today is that
there's more than one frontier in AI.
There's tradeoffs to accessing this
intelligence. You shouldn't always use
the leading most intelligent model. And
so what we want to do is we want to
explore the different frontiers out
there. And as an AI benchmarking
company, we're going to bring some
numbers to the four to help you reason
about this. First, we'll be looking at
reasoning models. Next, we'll be looking
at the open weights frontier. Third, the
cost frontier. And lastly, the speed
frontier. There's other frontiers out
there that we benchmark, but we'll focus
on these key ones today.
Starting with reasoning models, what
we've done here is we've taken our
intelligence index and looked at that
relative to the output tokens used to
run the intelligence index. So we've
measured all of how many tokens each
model took to run our seven
evaluations and we've plotted it on this
chart and you can see two distinct
groups. It's helpful to think about
these separately. So non-reasoning
models which offer less intelligence but
uh require fewer output
tokens and reasoning models which use
more output tokens but offer greater
intelligence and the more this is
important to look at because more output
tokens comes with trade-offs both for
request latency as well as cost. We're
going to bring some numbers to draw that
out and look at the real differences
here. Just how yappy these reasoning
models are. We can see that there's an
order of magnitude difference between
reasoning and non-reasoning models. It's
not just that feeling, oh, this is
taking a long time. It's real. It's an
order of magnitude. So between GPT 4.1
it uh it required 7 million tokens to
run our intelligence index evaluations
but then 04 mini high took 72 million
tokens and the yappiest of them all
Gemini 2.5 uh pro took 130 million
tokens to run our intelligence
index and as mentioned this has
implications for cost as well as N10
latency responsiveness
So looking at latency, we benchmark the
API latency of how long it takes to
receive a response when accessing these
models via their
APIs. Here we can see that GBD 4.1 on
median across our requests took 4.7
seconds to return a full response.
04 mini high took over 40 seconds,
roughly another 10x or order of
magnitude increase. This has
implications for applications and users
which require responsiveness even
enterprise uh kind of chat bots. You
don't always reach for 03 in chat
GPT and it and Facebook's done a lot of
studies on this where they've looked at
the for consumer apps where they've
looked at uh user drop off by lat uh
application latency which clearly
demonstrate this. Sorry, do you mind if
we jump back a
slide? And it also has uh implications
for how we're building. So I think
particularly with agents whereby 30 uh
queries in succession is not
uncommon. It has it's a multiplier
effect on the latencies uh for your
application and how you can build. If
you have faster responses, maybe you can
make that 30 uh 100 queries for
instance. And so putting numbers to that
in terms of agents 30 is normal. And so
even less than 04 mini you're at 10
seconds for a reason model. If you're
running 30 queries that's 300 seconds
that a user might be waiting for a
response or an application might be
waiting for a response. That's 5
minutes. If with the order of magnitudes
that we're dealing with here if that 10
seconds was 1 second then those 30
queries takes 30 seconds. 30 seconds
versus 5 minutes impacts what you can
build. Think of a contact center uh
application that might maybe 30 seconds
is okay there, but 5 minutes uh
definitely not. Who likes waiting on the
phone uh that long or imagine if you had
to uh use Google and each time that you
wanted to use a function
it impacts how we can build with these
models. And so I think bringing numbers
to these trade-offs is really important.
I'd encourage everybody to measure
them. Next, we're going to move to the
open weights
frontier. Around the time of GPT4, there
was a huge delta in terms of open
weights
intelligence proprietary
intelligence. Llama 65B or LMA 270B
wasn't close to the intelligence of
GPT4.
What I'd like to show here is where we
plot our intelligence index by release
date is that that gap it
closed until with with great models like
mixture time 7 and uh
LM45B. But 01 broke away in late 2024.
But then of course I think we remember
DeepS released V3 I think December 26
ruined some of my Christmas holiday
plans. Had to tell my family I I need to
go read this paper. It's really
exciting. And then of course R1 in
January. The gap between open weights
intelligence and proprietary model
intelligence is less than it's ever
been. particularly with the recent R1
release in the last couple of weeks
which is only a couple of points
different in our intelligence index to
the leading
models. You can't talk about open
weights intelligence without talking
about China. The leading open weights
models across both reasoning models and
non-reasoning models are from China
based AI labs. Deep Seek's leading in
both. Alibaba with their Quen 3 series
is leading is coming in second in
reasoning. But you also have other labs
such as Meta uh and Nvidia with their
Neotron fine tunes of Llama coming in
close as
well. Let's look at the cost frontier.
This is really important and I think
similar to re to uh end to end latency
impacts what you can build. So bringing
some numbers here, we can really see
these order of magnitudes play
out. So 03 cost us almost $2,000 to run
our intelligence index. Techrunch
actually wrote an article about how much
money we were we were spending on
running. We did we didn't want to read
it.
You can see 4.1 a great model is 30
times roughly cheaper in terms of the
cost to run our intelligence index
compared to 01 and 4.1 nano over 500
times cheaper to run our intelligence
index than
03. You should think about these when
building applications. The kind of cost
structure of your application might
dictate what you can use here.
and how you use them. Those 30 uh
sequential uh API calls for your agentic
application could be uh 500 and still be
cheaper than an 03 query.
A key point to note here with this cost
to run intelligence index and why we
don't just look at the per token price
is that and the labs maybe don't want
you to think this way but you're paying
for the cost per token but then you're
also paying for how verbose the models
are all the reasoning tokens that are
output when these models are in their
thinking
mode. you pay for those as output tokens
even if some of the labs hide them. And
so you need to think about this and
measure it in your application not and
benchmark not just by the cost per
million tokens but also considering how
many reasoning tokens there are and how
verbose these models are. You can see
even amongst the non-reasoning models
there's big differences between how
verbose these models are in
responses. So for instance, ah we'll go
to the next
slide. Do you mind if we go back one
please? So what we've done here is we
have now we're now going to look at the
trends in terms of
cost. And so what you can see here is
we've bucketed models by how intelligent
they are. intelligence uh bands, if you
will. And what we can see here is that
accessing GPT4 level of intelligence has
fallen over a 100 times since
mid23. This is the case across all
quality bands.
You can see that even when a new quality
band, a new frontier is reached, 01 mini
in late
24, quickly within only a few months,
the cost of accessing that level of
intelligence halved. This is moving
quickly. And so what I would say to you
is when building
applications, think about what if cost
wasn't a barrier when you're building.
It's a it's a very important kind of
cost exercise because it might well be
that if you build for a cost structure
that doesn't work now then maybe in 6
months time that will be uh possible and
it will be uh
feasible. Next we're going to look at
the speed frontier. So this is how
quickly you're receiving tokens. the
output speed, output tokens per second
that you're receiving after sending an a
API
request. This has been increasing and
has increased dramatically since early
23 as well.
So similarly, we've because there's a
trade-off typically between intelligence
and speed, we've grouped models into
certain buckets. And we can see here
that they've all increased in terms of
how quickly you can access a level of
intelligence. So 40 I believe was around
40 output tokens per second. Now you can
access in that was in 2023.
Who remembers hitting it wasn't a
reasoning model hitting enter in chatbt
and just waiting for it to output
especially code which you want to just
copy straight into your editor and you
know hit run see if it works now you can
access that level of intelligence at
over 300 tokens per
second that I'll go through it's not the
focus of the talk but important to to
reference model sparity so we're seeing
more mixture of experts models and
They activate only a proportion of uh
parameters at inference time less
compute per token which means it can go
faster essentially and were around back
then but they're getting more and more
sparse smaller proportion of active
parameters. Next, smaller models.
Smaller models are getting more
intelligence uh intelligent particularly
with distillations, you know, 8B
distillations,
etc. Inference software optimizations
like flash attention and speculative
decoding. And lastly, hardware
improvements. So, H100 was faster than
A100. Now, we've recently launched
benchmarks of the B200 on our artificial
analysis website, and it's getting over
a thousand output tokens a second. Think
about that relative to the 40 output
tokens per second of GPT uh 4 in
23. There's also specialized uh
accelerators like Cerebra, Sanova,
Grock. I want to share a house view here
to frame things.
Yes, things are getting more efficient.
Yes, the cost of accessing the same
level of intelligence is decreasing and
hardware is getting better. We're
getting more system output
throughput on our on the
chips. But our view is that demand for
compute is going to continue to
increase. We're going to see larger
models. I mean deepseek it's over 600
billion active uh sorry not active total
parameters and the demand for more
intelligence is
insatiable reasoning models as we saw
the yappy models they require more
compute at inference time and lastly
agents whereby 20 30 100 plus
uh sequential requests to models is not
uncommon. These actors multipliers on
the demand for compute and so the house
view playing with these numbers is net
net we're going to continue to see
commute compute demand
increase. Thanks everyone. I'm George
from Artificial
[Applause]
[Music]
Analysis. Our next speaker is the
founder and CEO of Brain Trust and the
curator of this year's Evolve track.
Please join me in welcoming to the stage
Ankor
Goyal.
[Applause]
Awesome. Excellent. Uh, so today we're
going to talk a little bit about evals
to date and where we think eval are
going to be going in the
future. Also, for those of you who saw
my brother earlier, um, I'm going to do
my best to live up to his energy and uh,
and charisma.
Um but um yeah, you know, it's been an
amazing almost two-year journey for us
at Brain Trust. We have had the
opportunity to work with some of the
most amazing companies building um I
think the best AI products in the world.
Uh I'm blown away by how many EVLs
people actually run on the product. The
average org that signs up for Brain
Trust runs almost 13 eval a day. Some of
our customers run more than 3,000 EVELs
a day. Uh, and some of the most advanced
companies that are running EVELS are
spending more than two hours in the
product every day working through their
evals. And I think one of the things
that stands out to me is while we have
customers building some of the coolest
most automated
um AI based products and agents in the
world
eval the best thing you can do is look
at a dashboard and I think we have a
pretty cool dashboard in Brain Trust but
still it's just a dashboard that you
look at and you walk away and think okay
what changes can I make to my code or to
my prompts so that this eval does
better. Um, and I actually think that is
all going to
change. Uh, so today I'm excited to talk
about something called loop. Loop is an
agent that we've been working on for
some time now that's built into brain
trust. Um, and it's actually only
possible because of evals. Every quarter
for the last two years, we've run evals
on the frontier models to see how good
they are at actually improving prompts,
improving data sets, and improving
scorers. And until very, very recently,
they actually weren't very good. In
fact, we think that Claude 4 in
particular was a real breakthrough
moment. Um, and it performs almost six
times better than the the previous
leading model before it.
So, Loop runs inside of Brain Trust and
it can automatically optimize uh your
prompts all the way to very complex uh
agents. Um, but just as importantly, it
also helps you build better data sets
and better scorers because it's really
the combination of these three things
that make for really great
evals. This is a little preview of of
the UI. Um, you can actually start using
it today if you are an existing Brain
Trust user or you sign up for the
product. There's a feature flag that you
can just flip on called Loop and start
using it right away. Um, by default it
uses Cloud 4, but you can actually pick
any model that you have access to and
start using it. Whether it's an OpenAI
model, a Gemini model, or maybe some of
you are building your own LLMs, you can
use those as well. Um, and as you can
see, it runs directly inside of Brain
Trust. One of the things that we uh
learned from working with a lot of users
is how important it is to actually look
at data and look at prompts while you're
working with them. And we didn't want
that to go away uh when we introduced
loop. So every time it suggests an edit
to your data or it suggests a new idea
for scoring or it suggests an edit to
one of your prompts, you can actually
see that side by side directly in the
UI. Um, of course for the more
adventurous among you, there's also a
toggle that you can turn on that says
like just go for it and it will go and
optimize away. Um, which actually works
really
well. So just to recap, uh, to date,
EVELs have been a critical part of
building some of the best AI products in
the world, but the task of actually
doing evaluation has been incredibly
manual. And I'm excited about how over
the next year uh eval themselves are
going to be completely revolutionized by
the latest and greatest that's coming
out um from you know the frontier models
themselves and we're very excited to
incorporate that into Brain Trust.
Please if you're not already using the
product, try it out. Uh try out Loop,
give us your feedback. Uh we have a lot
of work to do. Um and we'd love to talk
to you. We're also hiring. Uh so if
you're interested in working on this
kind of problem, whether it's the UI
part of it, the AI part of it, or the
infrastructure uh side of it, we'd love
to talk to you. Um you can scan this QR
code. Uh it should be over there. Yeah,
you can scan the QR code and and get in
touch with us. Uh we'd love to chat.
Thank
[Music]
you. Our next presenter will provide us
some perspectives on the state of AI
engineering. Please join me in welcoming
to the stage
[Music]
[Applause]
Baron. All right. Hi everyone.
Uh, thank you for having me here and
huge thanks to Ben, to Swix, to all the
organizers who've put so much time and
heart into bringing this community
together.
[Applause]
Yeah. All right. So, we're here because
we care about AI engineering and where
this field is headed. So, to better
understand the current landscape, we
launched the 2025 state of AI
engineering survey. And I'm excited to
share some early findings with you
today.
All right, before we dive into the
results, the least interesting slide.
Uh, I don't know everyone in this
audience, but I'm bar. I'm an investment
partner at Amplify, where I'm lucky to
invest in technical founders, including
companies built by and for AI
engineers. And uh, with that, let's get
into what you actually care about, which
is enough bar and more bar
charts. And there are a lot of bar
charts coming up.
Okay, so first our sample. We had 500
respondents fill out the survey,
including many of you here in the
audience today and on the live stream.
Thank you for doing
that. And the largest group called
themselves engineers, whether software
engineers or AI engineers. While this is
the AI engineering conference, it's
clear from the speakers, from the
hallway chats, there's a wide mix of
titles and roles. You even let a VC
sneak
in. Um, so let's test this with a quick
show of hands. Raise your hand if your
title is actually AI engineer at the AI
engineering conference. Okay, that is
extremely
sparse. Uh, raise your put your hands
down. Raise your hand if your title is
something else entirely. So that should
be almost everyone. Keep it up if you
think you're doing the exact same work
as many of the AI engineers. All
right, so this sort of tracks titles are
weird right now, but the community is
broad. It's technical. It's growing. We
expect that AI engineer label to gain
even more ground. Uh couldn't help
myself. Quick Google trend search term
AI engineering barely registered before
late 2022. Uh we know what happened.
Chat GPT launched and the moment for AI
engineering interest has not slowed
since. Okay, so people had a wide
variety of titles but also a wide
variety of experience. Uh the
interesting part here is that many of
our most seasoned developers are AI
newcomers. So among software engineers
with 10 plus years of software
experience, nearly half have been
working with AI for three years or less
and one in 10 started just this past
year. So change right now is the only
constant even for the
veterans. All right, so what are folks
actually building? Let's get into the
juice. So more than half of the
respondents are using LLMs for both
internal and external use cases. Uh what
was striking to me was that three out of
the top five models and half of the top
10 models that respondents are using for
those external cases for the customerf
facing products are from
OpenAI. The top use cases that we saw
are code generation and code
intelligence and writing assistant
content generation. Maybe that's not
particularly surprising. Uh, but the
real story here is heterogeneity. So 94%
of people who use LLMs are using it for
at least two use cases. 82% using it for
at least three. Basically, folks who are
using LLMs are using it internally,
externally, and across multiple use
cases. All right. So you may ask, how
are folks actually interfacing with the
models and how are they customizing
their systems to for these use cases? Uh
besides fshot learning, rag is the most
popular way folks are customizing their
systems. So 70% of respondents said
they're using it. The real surprise for
me here I uh I'm I'm looking to gauge
surprise in the audience was how much
fine-tune is hap fine-tuning is
happening across the board. It was much
more than I had expected overall. Uh in
the sample we have researchers and we
have research engineers who are the ones
doing fine-tuning by far the most. We
also asked an open-ended question for
those who were fine-tuning. What
specific techniques are you using? So,
here's what the fine-tuners had to say.
Uh 40% mentioned Laura or Qura
reflecting a strong preference for
parameter efficient methods. And we also
saw a bunch of different fine-tuning
methods uh including DPO reinforcement
fine-tuning. And the most popular core
training approach was good old
supervised fine-tuning.
Many hybrid approaches were listed as
well. Um, moving on top uh to up on top
of updating systems, sometimes it can
feel like new models come out every
single week. Just as you finished
integrating one, another one drops with
better benchmarks and a breaking change.
So, it turns out more than 50% are
updating their models at least monthly,
17%
weekly. And folks are updating their
prompts much more frequently. So 70% of
respondents are updating prompts at
least monthly and 1 in 10 are doing it
daily. So it sounds like some of you
have not stopped typing since GPT4
dropped. Um but I also understand I have
empathy. Uh seeing one blog post from
Simon Willis and suddenly your trusty
prompt just isn't good enough anymore.
Despite all of these prompt changes, a
full 31% of respondents don't have any
way of managing their prompts. Uh what I
did not ask is how AI engineers feel
about not doing anything to manage their
prompts. So we have the 2026 survey for
that. We also ask folks across the
different modalities who is actually
using these models at work and is it
actually going well? And we see that
image, video, and audio usage all lag
text usage by significant
margins. I like to call this the
multimodal production
gap cuz I wanted an animation. Um, and
this gap still p persists when we add in
folks who have these models in
production but have not garnered as much
traction. Okay. What's interesting here
is when we add the folks who are not
using models at all in this chart too.
So here we can see folks who are not
using text, not using image, not using
audio or not using video. And we have
two categories. It's broken down by
folks who plan to eventually use these
modalities and folks who do not
currently plan
to. You can roughly see this ratio of no
plan to adopt versus plan to adopt.
Audio has the highest intent to adopt.
So 37% of the folks not using audio
today have a plan to eventually adopt
audio. So get ready to see an audio
wave. Um, of course, as models get
better and more accessible, I imagine
some of these adoption numbers will go
up even
further. All right, so we have to talk
about agents. One question I almost put
in the survey was, "How do you define an
AI agent?" But I thought I would still
be reading through different responses.
Uh so for the sake of clarity, we
defined an AI agent as a system where an
LLM controls the core decision-making or
workflow. 80% of respondents say LLMs
are working well at work, but less than
20% say the same about
agents. Agents aren't everywhere yet,
but they're coming. Uh the majority of
folks uh may not be using agents, but
most at least plan to. So, fewer than
one in 10 say that they will never use
agents. All to say that people want
their agents. And I'm probably uh
preaching to the
choir. Um the majority of agents already
in production do have right access uh
typically with a human in the loop and
some can even take actions
independently. So um excited as more
agents are adopted to learn more about
the tool permissioning that folks uh
have access
to. If we want AI in production, of
course, we need strong monitoring and
observability. So, we asked, do you
manage and monitor your AI systems? This
was a multi- select question. So, most
folks are using multiple methods to
monitor their systems. 60% are using
standard observability. Over 50% rely on
offline eval. And we asked the same
thing for how you evaluate your model
and system accuracy and quality. So
folks are using a combination of methods
including data collection from users,
benchmarks, etc. But the most popular at
the at the end of the day is still human
review. Um, and for monitoring their own
model usage, most respondents rely on
internal
metrics. So storage is important too.
Where does the context live? How do we
get it when we need it? 65% of
respondents are using a dedicated vector
database. and to suggest that for many
use cases specialized vector databases
are providing enough value over
generalpurpose databases with vector
extensions. Uh among that group 35% said
that they primarily self-host. 30%
primarily use a third party
provider. All right, I think we've been
having fun this whole time, but we're
entering a section I like to formally
call other fun stuff. Uh I spent hours
workshopping the name. So, we asked AI
engineers, should agents be required to
disclose when they're AI and not human?
Most folks think yes, agents should
disclose that they're AI. Uh, we asked
folks if they'd pay more for inference
time compute, and the answer was yes,
but not by a wide margin. And we asked
folks if transformer-based models will
be dominant in 2030, and it seems like
people do believe that attention is all
we'll need in 2030.
Uh the majority of respondents also
think open source and closed source
models are going to converge. So I will
let you debate that after. Um no
commentary needed here. So uh the
average or the mean guess for the
percentage of US Gen Z population that
will have AI girlfriends, boyfriends is
26%. Um I don't really know what to say
or expect here, but we'll see. Uh we'll
see what happens
uh in a world where folks don't know if
they're being left on red or just facing
latency issues. Um or uh of course the
dreaded it's not you, it's my
algorithm. And finally, we asked folks,
what is the number one most painful
thing about AI engineering today? And
evaluation topped that list. Uh so it's
a good thing this conference and the
talk before me has been so focused on
EVELs because clearly they're causing
some serious pain.
Okay. And now to bring us home, I'm
going to show you what's popular. So, we
asked folks to pick all the podcasts and
newsletters that they actively learn
something from at least once a month.
And these were the top 10 of each. So,
if you're looking for new content to
follow and to learn from, this is your
guide. Uh, many of the creators are in
this room. So, keep up the great work.
And I'll just shout out that Swix is
listed both on popular newsletter and
popular podcast for latent space. Uh, so
I will just leave this here.
Um, I think that's enough bar charts and
bar time, but if you want to geek out
about AI trends, you can come find me
online in the hallways. Uh, we're going
to be publishing a full report next
week. Uh, I'll let Elon and Musk have
Twitter today, but um, it's going to
include more juicy details including
everyone's favorite models and tools
across the stack. Thank you for the
time. Enjoy the
afternoon.
[Music]
Our next presenter co-founded OpenC, the
first NFT marketplace, and grew it to
over $4 billion in monthly volume from
2017 to 2022.
He then founded Open Router in 2023, the
first LLM aggregator and distributor,
processing over two trillion tokens
weekly across over 400 unique language
models. He's here to tell us fun stories
from building open router and provide
some predictions on where all this is
going. Please join me in welcoming to
the stage Alex Atala.
[Music]
[Applause]
All right. Um, I can't go back. Well,
uh, when I started Open Router the
beginning of
2023, I had one major question in mind.
I was looking at this new market that
was coming online and and it was
incredible. like at the very end of uh
2022, we all saw chat GPT and I got
bitten by the AI bug. Um, and I decided
to look into answering this question.
Will this market be winner take all
inference might be the largest market
ever in software and this seemed like a
critical thing that everybody was
assuming the answer to the answer to it
would be yes. Um, open AI was just far
and away the leading model. There were a
few others that were coming up on its
tail and I I built a couple prototypes
to look into what they could be good
used for and also wanted to investigate
open source. So in this talk which Swix
named um I'm going to talk about the
founding story of open router and uh and
go through a little bit of the hoops
that we jumped through and sort of the
investigation that we did as we we put
together this product that started as an
experiment and kind of evolved into a
marketplace over time.
In
January, we saw the first signs of
people wanting other types of models.
And the the first evidence was
moderation. This this was like a very
clear interest from users in looking for
models where they could understand why
whether they'd be deplatformed or what
the the moderation policy of the company
was. And and we saw some people like
generating novels where like it would be
a detective story and in chapter 4 um
the detective would find someone who
like commits a murder and shoots the
victim and and opening at the time
sometimes refused to generate that
output or it was like questionably
against the terms of service. And of
course we saw role play and and
basically a big gray area emerge around
what models were willing to
generate. So uh in the next month we saw
the open- source race begin
and that uh I'm going to do a little bit
of an OG test here. Uh raise your hand
if you ever used Balloon
176b. There's like a there's like 10
hands raised um or opt by Facebook. This
was like one of the earliest open source
language models about five hands raised.
Uh there were a couple of these emerging
and there were some very interesting
projects to help people access them and
uh and and early days they weren't
really useful for very much. So, uh, we
kept digging and, uh, and eventually
like the open source community, um,
round like ran into Meta's first launch,
which was Llama 1 in in February. And
Llama 1 in their abstract advertised
that it outperformed GPT3 on most
benchmarks. You can see the highlighted
part here, which blew everyone away.
This was huge. an open weights model
better than
GPT3 and uh and especially a smaller
model. This was the 13 billion parameter
version, one that you could run on your
laptop. Um outperforming a large server
only like you know tons of money
required to run inference companies
model it and it was beating it on some
benchmarks. Everyone lost their minds
and llama kicked off a huge storm. it
still was not very useful. I have to say
it was like a text completion model for
the most part and it was very difficult
to run locally. The infrastructure just
wasn't there. Um and people were
struggling to figure out what to do with
it which is when we found when we had
the greatest moment of all I think for
the birth of the long tale of language
models which was the first successful
distillation in March of 2023.
Alpaca. Uh, a group at Stanford took
Llama 1, generated a bunch of outputs on
GPT3 and fine-tuned Llama 1 on those
outputs and created Alpaca for less than
$600 in to like total. And this was an
incredible moment. It was the first time
I saw the transference of both style and
knowledge from a large model onto a
small one. And this me this was a huge
unlock because it meant that not only do
you not need a $10 million training
budget to create your own models, but
you could also for the first time make
unique data available as a service in
the form of a language
model.
And I immediately began to wonder like
what are there there's going to be tens
of thousands of these maybe hundreds of
thousands. Um and they seem incredibly
important. This is knowledge finally
being distilled into software. Uh there
needs to be a place on the internet to
discover these and understand what they
do because even this open weights model
was still closed in a way. It's a black
box. You get 7 billion floating point
numbers. You don't know what it's good
at or what to do with it. Very few
people used alpaca. Raise your hands if
you used
alpaca. I see about maybe 12. So it's
like only double the number of people
who used the like almost unusable open
source models on the previous
slide.
So open router initially started as a
place to collect all these things. Um
but before we got there I wanted to
check out people's willingness to bring
their own model to generic websites.
Like what if the developer didn't even
know which model a user wanted to use?
How would a user bring their choice of
model to the software that they
want? And uh in April, I launched window
AI, which was a an open- source Chrome
extension that let a user choose their
model and let a web app just kind of
suck it in. And so you can see from the
Chrome extension here if you look really
closely um this user is using
together's open source deployment of GBT
next I can't I can't read it from here
but like an open source model that um
swaps out open AAI directly inside the
web page.
So the next month, Open Router launched
and uh I uh co-ounded it with the
founder of the framework that that
window AAI was built on, plasmo,
um Lewis, and we started Open Routers
first a place to collect all the models
in one spot and and help people figure
out what to do with them. And it
eventually grew into a place that gives
you the like better prices, better
uptime, no subscription, and uh and the
most choice for figuring out which
intelligence your your uh software
should
run. So let's talk a little bit about
what it is because not everyone here
might be familiar with it. Um, we we
have been growing 10 to 100% month over
month for the last two years. It is an
API that lets you access all language
models and uh and it's also become kind
of the go-to place for data about who's
using which model um and how that is
changing over time, which you can see on
our public rankings page
here. It's a single API that you pay for
once. you get near zero switching costs
to go from model to model. Uh and we
have about 400 over 400 models over 60
active providers and uh you can buy with
lots of different payment methods
including crypto and and we basically do
all the the tricky work of normalizing
tool calls and caching for you so that
you get the best prices and the most
features uh and you don't have to worry
about what the provider
supports. Another story. Initially, open
router was not a marketplace really. It
was just kind of a collection of all the
models and a way to explore data about
who was using each one. So, how did we
get
here? Initially, when the first open
source models emerged, uh we only had
like one or two providers for each one.
And so, we had like a primary provider
and a fallback provider. And initially
that was it. And and we didn't even name
the providers. Um, but it became clear
that there were going to be a bunch of
companies that wanted to host these
provi these models and at very different
prices and
performances. The number of features
ballooned. Um, there were companies that
supported the minp sampler and most
didn't. There were some that supported
caching, some that supported tool
calling and structured outputs and
others that didn't. And suddenly the
ecosystem was just ballooning into this
kind of outofcontrol
heterogeneous monster. And we wanted to
tame the monster. So we aggregated all
providers in one spot and at different
price points it became a marketplace.
And you can see like this model llama
3.370B instruct um has one one of the
models with the most providers on the
platform. um and it has like 23
um closed source models also had
something interesting happen to them
which is that they just they couldn't
keep up with the demand and uh and and
so we help developers basically get
uptime boosting and you can see like the
delta uh and how much we can boost
uptime just by aggregating lots of
different providers for a model and this
became really helpful for people using
open source or closed source
And we became a marketplace for both um
showing graphs about latency and
throughput and helping people figure out
using real world data what the latency
and throughput is on each
model. Um and that's how open router
became a marketplace and one optimized
for language models which I thought
would be proper for for inference
potentially the biggest market in
software. Uh you can obviously a couple
other things that we support comparing
models we using your own prompts with
the ease of just texting and iMessage um
fine grain privacy controls with API
level overrides the ability to see like
your usage of all models in one place
and have great
observability and back to the original
question here of whether will
intelligence be winner take all uh I I
we've come to the most likely bet that
that is not the case. Um, here's our
data broken down by model author. Um,
how much how many tokens have been
processed by each one. And you can see
Google Gemini started pretty low, like
roughly 2 3% in June of last year and
just has grown to 34 35%
uh pretty steadily over the last 12
months. um o uh enthropic uh is is like
one of the most popular models on our
platform. Open AAI is a little bit under
reppresented in this data um because a
lot of developers use us to get open AAI
like behavior for all other models but
OpenAI as has grown a lot here as
well. So here's what we believe about
the market after all of
the you know backstory that I just gave
you. Um the future is going to be
multimodel. Ton all of our customers,
tons of customers use different models
for different purposes and realize they
can unlock huge gains by doing so.
Inference is also a commodity. Claude
from bedrock we want to make look
exactly the same as cloud from Vert.ex.
And we do that because like the two
hyperscalers have fundamentally uh you
know the same commodity being delivered
at different rates, different
performances and for a developer you
just want to be able to like select that
without worrying about who's serving it.
Um we think inference will be like a
dominant operating expense and selecting
and routing will be crucial. Um you can
see the number of active models on open
router has just steadily grown. not the
case that people just hop from model to
model like it tends to be sticky and uh
and we tr we're trying to just make this
wild ecosystem a lot more homogeneous
and easier to work with as a
developer. Um to honor Swix's title for
this presentation,
uh let's give a technical story. Um
something that we've worked on in the
process of building the company and that
was our own idea for how to do an MCP
within Open Router. So we don't have
MCPs, we don't have an MCP marketplace.
Um, but we did run into the need to
expand inference with new features and
new abilities. For example, searching
the web for all models, PDF parsing for
all models,
um, you know, other interesting things
coming soon. And what we really wanted
to do was give these abilities to all
models. But that involves not just the
pre-flight work that MCPs do today where
you can kind of get in, you know, like
call another API, get a bunch of
behaviors and then have the inference
process access those behaviors as it
goes. We also needed the ability to
transform the outputs on the way to the
user. And so what we really really
needed was something more like
middleware.
Middleware um is kind of a common
concept in web development. You set up
middleware when you're setting up
authentication, for example, or or or
caching for a web app. And so we came up
with a type of middleware that's sort of
that's AI native and optimized for
inference. Um and that looks not totally
dissimilar from the way middleware it
looks in in Nex.js or or web
development. Yeah. So, pardon the code
on the screen, but this is a little bit
about how our our plug-in system looks
and it, you know, it can call MCPS from
inside a plugin, but importantly, it can
also augment the results on the way back
to the user. So, here's an example of
our web search plugin, which augments
every language model with the ability to
search the web. um every language model
can just kind of tap in to this plugin
and get web annotations as results are
being fed back to users in real time and
this all happens in a stream. So there's
no kind of like you know requirement
that you get all of the tokens at once.
It can just happen in live in the
stream. We we solved a bunch of other
tricky problems
uh while building open router. We we
really wanted to get extremely low
latency. Um and we got it down to about
30 milliseconds uh the best in the
industry I believe. Um using a lot of
custom cache work and we also need to
make streams cancelellable. All these
different providers have completely
different stream cancellation policies.
Sometimes if you just drop a stream the
the the the inference provider will bill
you for the entire thing. Sometimes it
won't. Sometimes it'll bill you for the
next 20 tokens that you never got. And
um we kind we we work a lot to try to
figure out these edge cases and
understand when developers are going to
care about them too. And standardizing
all these providers and models uh became
like a big tricky architecture problem
that we spent a while working on. So
here's where all this is going. Uh we're
going to add more modalities to open
router and I think this is like a big
change in the industry as well. We're
going to start seeing LLMs generate
images. We already have uh a few
examples on the market, but like some
people call it transfusion models, a
transformer mixed with stable diffusion.
Um these are going to give images way
more world knowledge and the ability to
have a conversation with the image,
which we think is just critical for
growing that industry, making it really
work. Imagine I just ran into somebody
today who is using a transfusion model
uh or who told me about their customer
using a transfusion model to generate
menus. Imagine doing that like a whole
menu like in a delivery app generated by
transfusion model. Um it's going to be
really exciting and and a big deal in
the coming
year. We're also going to work on much
more powerful routing like routing is
our bread and butter and so doing
geographical routing. Right now we it's
pretty minimal but routing people to the
right GPU in the right place and doing
enterprise level optimization is coming
um better prompt observability better
discovery of models like really fine
grain categorization you know imagine
being able to see like the best models
that take Japanese and and create Python
code and of course even better prices
coming soon. So, you know, we we believe
in in collaboration um and and building
an ecosystem that's durable and with low
vendor lock in. So, you know,
collaborate with us. Um here's our email
and if you're interested, join us,
too. Thank
[Applause]
you.
Our next speaker works on alignment
reasoning at Open AI, helping translate
highle intent into enforceable specs and
evaluations. Please join me in welcoming
to the stage Sha
[Music]
Grove. Hello everyone. Thank you very
much for having me. Uh it's a very
exciting uh place to be, very exciting
time to be.
Uh second, uh I mean this has been like
a pretty intense couple of days. I don't
know if you feel the same way. Uh but
also very energizing. So I want to take
a little bit of your time today uh to
talk about what I see as the coming of
the new code uh in particular
specifications which sort of hold this
promise uh that it has been the dream of
the industry where you can write your
your code your intentions once and run
them everywhere.
Uh quick intro. Uh my name is Sean. I
work at uh OpenAI uh specifically in
alignment research. And today I want to
talk about sort of the value of code
versus communication and why
specifications might be a little bit of
a better approach in
general. Uh I'm going to go over the
anatomy of a specification and we'll use
the uh model spec as the example. uh and
we'll talk about communicating intent to
other humans and we'll go over the
406ency issue uh as a case
study. Uh we'll talk about how to make
the specification executable, how to
communicate intent to the models uh and
how to think about specifications as
code even if they're a little bit
different. Um and we'll end on a couple
of open questions. So let's talk about
code versus
communication real quick. Raise your
hand if you write code and vibe code
counts. Cool. Keep them up if your job
is to write
code. Okay. Now for those people, keep
their hand up if you feel that the most
valuable professional artifact that you
produce is
code. Okay. There's quite a few people
and I think this is quite natural. We
all work very very hard to solve
problems. We talk with people. We gather
requirements. We think through
implementation details. We integrate
with lots of different sources. And the
ultimate thing that we produce is code.
Code is the artifact that we can point
to, we can measure, we can debate, and
we can discuss. Uh it feels tangible and
real, but it's sort of underelling the
job that each of you does. Code is sort
of 10 to 20% of the value that you
bring. The other 80 to 90% is in
structured communication. And this is
going to be different for everyone, but
a process typically looks something like
you talk to users in order to understand
their challenges. You distill these
stories down and then ideulate about how
to solve these problems. What what is
the goal that you want to achieve? You
plan ways to achieve those goals. You
share those plans with your colleagues.
uh you translate those plans into code.
So this is a very important step
obviously and then you test and verify
not the code itself, right? No one cares
actually about the code itself. What you
care is when the code ran, did it
achieve the goals? Did it alleviate the
challenges of your user? You look at the
the effects that your code had on the
world. So talking, understanding,
distilling,
ideulating, planning, sharing,
translating, testing, verifying, these
all sound like structured communication
to me. And structured communication is
the bottleneck.
knowing what to build, talking to people
and gathering requirements, knowing how
to build it, knowing why to build it,
and at the end of the day, knowing if it
has been built correctly and has
actually achieved the intentions that
you set out with. And the more advanced
AI models get, the more we are all going
to starkly feel this
bottleneck because in the near future,
the person who communicates most
effectively is the most valuable
programmer. And literally, if you can
communicate effectively, you can
program. So let's take uh vibe coding as
an illustrative example. Vibe coding
tends to feel quite good. And it's worth
asking why is that? Well, vibe coding is
fundamentally about communication first
and the code is actually a secondary
downstream artifact of that
communication. We get to describe our
intentions and our the outcomes that we
want to see and we let the model
actually handle the grunt work for us.
And even so, there is something strange
about the way that we do vibe coding. We
communicate via prompts to the
model and we tell them our intentions
and our values and we get a code
artifact out at the end and then we sort
of throw our prompts away. They're
ephemeral. And if you've written
TypeScript or Rust, once you put your
your code through a compiler or it gets
down into a binary, no one is happy with
that binary. That wasn't the purpose.
It's useful. In fact, we always
regenerate the binaries from scratch
every time we compile or we run our code
through V8 or whatever it might be from
the source spec. It's the source
specification that's the valuable
artifact. And yet when we prompt
elements, we sort of do the opposite. We
keep the generated code and we delete
the prompt. And this feels like a little
bit like you shred the source and then
you very carefully version control the
binary. And that's why it's so important
to actually capture the intent and the
values in a
specification. A written specification
is what enables you to align humans on
the shared set of goals and to know if
you are aligned if you actually
synchronize on what needs to be done.
This is the artifact that you discuss
that you debate that you refer to and
that you synchronize on. And this is
really important. So I want to nail this
this home that a written specification
effectively aligns
humans and it is the artifact that you
use to communicate and to discuss and
debate and refer to and synchronize on.
If you don't have a specification, you
just have a vague
idea. Now let's talk about why
specifications are more powerful in
general than
code. Because code itself is actually a
lossy projection from the
specification. In the same way that if
you were to take a compiled C binary and
decompile it, you wouldn't get nice
comments and uh well-n named variables.
You would have to work backwards. You'd
have to infer what was this person
trying to do? Why is this code written
this way? It isn't actually contained in
there. It was a lossy translation. And
in the same way, code itself, even nice
code, typically doesn't embody all of
the intentions and the values in itself.
You have to infer what is the ultimate
goal that this team is trying to
achieve. Uh when you read through
code, so communication, the work that we
establish, we already do when embodied
inside of a written specification is
better than code. it actually encodes
all of the the necessary requirements in
order to generate the code. And in the
same way that having a source code that
you pass to a compiler allows you to
target multiple different uh
architectures, you can compile for ARM
64, x86 or web assembly. The source
document actually contains enough
information to describe how to translate
it to your target architecture.
In the same way, a a a sufficiently
robust specification given to models
will produce good TypeScript, good Rust,
servers, clients, documentation,
tutorials, blog posts, and even
podcasts. Uh, show of hands, who works
at a company that has developers as
customers?
Okay. So, a a quick like thought
exercise is if you were to take your
entire codebase, all of the the
documentation, oh, so all of the code
that runs your business, and you were to
put that into a podcast generator, could
you generate something that would be
sufficiently interesting and compelling
that would tell the users how to
succeed, how to achieve their goals, or
is all of that information somewhere
else? It's not actually in your code.
And so moving forward, the new scarce
skill is writing specifications that
fully capture the intent and
values. And whoever masters that again
becomes the most valuable
programmer and there's a reasonable
chance that this is going to be the
coders of today. This is already very
similar to what we do. However, product
managers also write specifications.
Lawmakers write legal specifications.
This is actually a universal
principle. So with that in mind, let's
look at what a specification actually
looks like. And I'm going to use the
OpenAI model spec as an example here. So
last year, OpenAI released the model
spec. And this is a living document that
tries to clearly and
unambiguously express the intentions and
values that OpenAI hopes to imbue its
models with that it ships to the world.
and it was updated in in uh February and
open sourced. So you can actually go to
GitHub and you can see the
implementation of uh the model spec. And
surprise surprise, it's actually just a
collection of markdown files. Just looks
like this. Now markdown is remarkable.
It is human readable. It's versioned.
It's change logged. And because it is
natural language, everyone in not just
technical people can contribute,
including product, legal, safety,
research, policy. They can all read,
discuss, debate, and contribute to the
same source code. This is the universal
artifact that aligns all of the humans
as to our intentions and values inside
of the company.
Now, as much as we might try to use
unambiguous language, there are times
where it's very difficult to express the
nuance. So, every clause in the model
spec has an ID here. So, you can see
sy73 here. And using that ID, you can
find another file in the repository
sy73.mmarkdown or md uh that contains
one or more challenging
prompts for this exact clause. So the
document itself actually encodes success
criteria that the the model under test
has to be able to answer this in a way
that actually adheres to that
clause. So let's talk about uh syphy. Uh
recently there was a update to 40. I
don't know if you've heard of this. Uh
there uh caused extreme syphy. uh and we
can ask like what value is the model
spec in this scenario and the model spec
serves to align humans around a set of
values and
intentions. Here's an example of syphnty
where the user calls out the behavior of
being uh syophant uh or sickopantic at
the expense of impartial truth and the
model very kindly uh praises the user
for their insight.
There have been other esteemed
researchers uh who have found similarly
uh similarly uh concerning
examples and this hurts
uh shipping syphency in this manner
erodess
trust. It
hurts. So and it also raises a lot of
questions like was this intentional? you
could see some way where you might
interpret it that way. Was it accidental
and why wasn't it caught? Luckily, the
model spec actually includes a section
dedicated to this since its release that
says don't be sick of fantic and it
explains that while sophincy might feel
good in the short term, it's bad for
everyone in the long term. So, we
actually expressed our intentions and
our values and we're able to communicate
it to others through this
So people could reference it and if we
have it in the model spec specification
if the model specification is our agreed
upon set of intentions and values and
the behavior doesn't align with that
then this must be a
bug. So we rolled back we published some
studies and some blog post and we fixed
it. But in the interim, the specs served
as a trust anchor, a way to communicate
to people what is expected and what is
not
expected. So if just if the only thing
the model specification did was to align
humans along those shared sets of
intentions and values, it would already
be incredibly useful.
But ideally we can also align our models
and the artifacts that our models
produce against that same
specification. So there's a technique a
paper that we released uh called
deliberative alignment that sort of
talks about this how to automatically
align a model and the technique is uh
such where you take your specification
and a set of very challenging uh input
prompts and you sample from the model
under test or training.
You then uh take its response, the
original prompt and the policy and you
give that to a greater model and you ask
it to score the response according to
the specification. How aligned is it? So
the document actually becomes both
training material and eval
material and based off of the score we
reinforce those weights and it goes from
you know you could include your
specification in the context and maybe a
system message or developer message in
every single time you sample and that is
actually quite useful. a prompted uh
model is going to be somewhat aligned,
but it does detract from the compute
available to solve the uh problem that
you're trying to solve with the model.
And keep in mind, these specifications
can be anything. They could be code
style or testing requirements or or
safety requirements. All of that can be
embedded into the model. So through this
technique you're actually moving it from
a inference time compute and actually
you're pushing down into the weights of
the model so that the model actually
feels your policy and is able to sort of
muscle memory uh style apply it to the
problem at
hand. And even though we saw that the
model spec is just markdown, it's quite
useful to think of it as code. It's
quite
analogous. Uh these specifications they
compose, they're executable as we've
seen. uh they are testable. They have
interfaces where they they touch the
real world uh they can be shipped as
modules and whenever you're working on a
model spec there are a lot of similar
sort of uh problem domains. So just like
in programming where you have a type
checker the type checker is meant to
ensure consistency where if interface A
has a dependent uh module B they have to
be consistent in their understanding of
one another. So if department A writes a
spec and department B writes a spec and
there is a conflict in there you want to
be able to pull that forward and maybe
block the publication of the the
specification as we saw the policy can
actually embody its own unit tests and
you can imagine sort of various llinters
where if you're using overly ambiguous
language you're going to confuse humans
and you're going to confuse the model
and the artifacts that you get from that
are going to be less
satisfactory. So specs actually give us
a very similar tool chain but it's
targeted at intentions rather than
syntax. So let's talk about lawmakers as
programmers.
Uh the US constitution is literally a
national model specification. It has
written text which is aspirationally at
least clear and unambiguous policy that
we can all refer to. And it doesn't mean
that we agree with it but we can refer
to it as the current status quo as the
reality. Uh there is a versioned way to
make amendments to bump and to uh
publish updates to it. There is judicial
review where a a grader is effectively
uh grading a situation and seeing how
well it aligns with the policy. And even
though the again because or even though
the source policy is meant to be
unambiguous sometimes you don't the
world is messy and maybe you miss part
of the distribution and a case falls
through and in that case the there is a
lot of compute spent in judicial review
where you're trying to understand how
the law actually applies here and once
that's decided it sets a precedent and
that precedent is effectively an input
output pair that serves as a unit test
that disamiguates and reinfor enforces
the original policy spec. Uh it has
things like a chain of command embedded
in it and the enforcement of this over
time is a training loop that helps align
all of us towards a shared set of
intentions and values. So this is one
artifact that communicates intent. It
adjudicates compliance and it has a way
of uh evolving safely.
So it's quite possible that lawmakers
will be programmers or inversely that
programmers will be lawmakers in the
future. And actually this apply this is
a very universal concept. Programmers
are in the business of aligning silicon
via code specifications. Product
managers align teams via product
specifications. Lawmakers literally
align humans via legal specifications.
And everyone in this room whenever you
are doing a prompt it's a sort of
protospecification. You are in the
business of aligning AI models towards a
common set set of intentions and values.
And whether you realize it or not you
are spec authors in this world and specs
let you ship faster and safer.
Everyone can contribute and whoever
writes the spec be it a
uh a PM uh a lawmaker an engineer a
marketer is now the
programmer and software engineering has
never been about code. Going back to our
original question a lot of you put your
hands down when you thought well
actually the thing I produced is not
code. Engineering has never been about
this. Coding is an incredible skill and
a wonderful asset, but it is not the end
goal. Engineering is the precise
exploration by humans of software
solutions to human problems. It's always
been this way. We're just moving away
from sort of the disperate machine
encodings to a unified human encoding uh
of how we actually uh solve these these
problems.
Put this in action. Whenever you're
working on your next AI feature, start
with the
specification. What do you actually
expect to happen? What's success
criteria look like? Debate whether or
not it's actually clearly written down
and communicated. Make the spec
executable. Feed the spec to the
model and test against the model or test
against the spec. And there's an
interesting question sort of in this
world given that there's so many uh
parallels between programming and spec
authorship. I wonder what is the what
does the IDE look like in the future.
you know, an integrated development
environment. And I'd like to think it's
something like an inte like integrated
thought clarifier where whenever you're
writing your specification, it sort of
ex pulls out the ambiguity and asks you
to clarify it and it really clarifies
your thought so that you and all human
beings can communicate your intent to
each other much more effectively and to
the models.
And I have a closing request for help
which is uh what is both amenable and in
desperate need of specification. This is
aligning agent at scale. Uh I love this
line of like you then you realize that
you never told it what you wanted and
maybe you never fully understood it
anyway. This is a cry for specification.
Uh we have a new agent robustness team
that we've started up. So please join us
and help us deliver safe uh safe AGI for
the benefit of all humanity.
And thank you. I'm happy to
[Applause]
[Music]
chat. Ladies and gentlemen, please
welcome to the stage the founders of the
AI Engineer World's Fair, Benjamin Duny
and
[Music]
Swix. Um,
[Applause]
All
right. Choose to mirror or extend
display. I'd love to have my notes from
the house slides, please. Thank you. All
right. How are we
feeling? I hope you're not as exhausted
as me, but sufficiently exhausted. I
hope we all had a wonderful conference.
But we have one more special treat for
you. We're excited to present the
finalists for the very first official AI
engineer
hackathon. We partnered with Cerebral
Valley, the largest AI community in the
world and legends right here in the Bay
Area for running hackathons for the very
first official AI engineer hackathon.
From 500 applicants, 160 engineers came
together to learn, connect, and build
together. 46 projects presented on site
and three were selected as finalists.
And today we have those three finalists
with us and they will each present their
48 hour builds for us in under five
minutes. And all of you in the audience
are going to be the judge. But thanks to
being smitten by the Wi-Fi gods, we have
decided to go old Athenian style by the
roar of the crowd. Are you not
[Music]
entertained? The three teams listed here
in the order that they will present.
Have we confirmed that, Ro? Is this the
actual order they're coming on? I
certainly hope so. Team one, survival of
the future.
Team two, tab RL. Team three,
featherless action R1. Do what you have
to do to remember the order. Take some
notes on what you like best because
we're going to come back and roar as
soon as they're done after these 15
minutes. So, I'll let Swix proceed with
the intro. Uh, yeah, these are all very
uh competitive teams. I think they're
coming up now. Um, they are, what can I
say? I I was actually I think in the
room when the these guys were presenting
for the final round. Um and like
everyone was uh very very impressed like
they were like how does this not exist
already? So um I think I should just
kind of let them take it away because I
don't want to steal their thunder. But
um I did insist on printing these
trophies. Uh so we're going to hand them
out. Um it's mostly just appreciation
but uh I think we also want to try to
make AI engineer a place where people
can get recognition for their work uh by
speaking by posting. Thank you. We work
really hard on these. They got here um
two hours
ago. Uh overnight delivery started on
Monday and then went to Tuesday. Uh and
and anyway, so um I think these are
ready so I don't want to take away their
time. Survival of the future, folks.
[Applause]
So we're at here at the World's Fair and
we're all builders. So we want to ship
as fast as possible so we can get
feedback from users as fast as possible.
Shorten the feedback loop to know
whether we're moving in the right
direction. But a lot of the time making
progress toward optimizing UX can
totally feel like shooting in the dark.
Why is it so hard to optimize UX? Well,
in order to find the right message for
users, you have to subject yourself to
the painstakingly iterative trial and
error process of creating and testing
variations. A lot of the times these
changes can look like really small
tweaks to copy, oneline code
changes. The variations are endless.
In addition, AB testing pipelines can be
super clunky. So you can wait to gather
the data and then once you get the data,
the the signal is still not clear and
you're not sure how to proceed. All the
while, sometimes the product is changing
or sometimes we need the feedback from
the users in the first place to figure
out what the product
is. How do we use AI agents to improve
this
process?
Our product uses agents to automate
those small refinements, the oneline
code changes and push those to
production um and review the data in
real time. This frees up resources for
teams to focus on the big picture
problems and improvements. Meanwhile,
our agents are reviewing the data and
refining the AB testing to maximize the
value of information that can be gained
from user behavior from these changes.
not this
one. So for our current workflow, we
have a pretty easy integration with your
GitHub that you can uh just in increment
it to your GitHub. Choose whatever repo
that you want that has some sort of
front end. Uh we have one agent that's
going to look for your either your
landing page or your the dashboard that
the users have the most uh integration
with. And then another agent is going to
try to look analyze it and try to look
to make like very small integrations to
those pages. Or if you're already in the
data pipeline, we can also use previous
feedbacks from the user interactions to
give that agent to make like better uh
integrations depend on on how previous
interactions worked. And then after that
agent is done, it's going to make a
branch into just just to your
repo. And the other agent can uh traffic
user data again based on previous
recordings of how the users were
interacting with those components. It's
going to traffic like a very small
percentage of the user data to that new
variant that we made. And it's going to
keep doing that until you make uh better
and better variants for your
product. So we're currently building out
capabilities to solve for the metrics
that matter the most. So our customers
can customize what they want to solve
for to maximize the value of real-time
user feedback. LLMs and user feedback
are a match made in heaven. This also
means that UX engineers don't have to
babysit their features because this
process is run by agents. So again,
teams can res can focus on the metrics
that matter the most while working on
the big picture improvements and
decisions. All the while, our agents are
analyzing the user data, providing a
refined approach to AB testing and
introducing a soft launch of updates and
changes so that as more and more users
respond positively to these changes,
they're shown to more and more users and
you can push changes to production
safely and with confidence. This is a
massive improvement over the current
process because who hasn't had the
experience of pushing to production and
it doesn't turn out how you were hoping
it
would. So our agent does three things.
It takes care of the busy work and those
incremental changes. It frees up
resources for teams to focus on the big
picture and it improves on the current
AB process by incrementalizing it and
refining it. So you can push code to
production. um more confidently, more
safely, and reduce that risk.
Thank
you. And if you scan this QR code, it'll
take you to our website so you can check
it out.
Awesome. Um thanks to Lori, Salem. Um
and what was the last? Armen. Uh thanks.
Thanks so much, guys. Um, fun fact, they
just met 10 days ago and they've been
just spamming the hackathons and been
winning quite a lot of them. So, very
very strong team. Um, the next team is
Tab RL. Uh, I think u DT I've met quite
a few times at a number of AI, right?
This is not your first one. Yeah. Um,
and uh I I think the other interesting
thing about this is the just the
sandboxing that you guys do. was like
really stands out like um that's what
every single judge that I talked to um
also was commenting on. So uh take it
away.
Hi guys. So I'm Rich. I'm a physicist
and this is my friend Adita and he's an
AI engineer. So we met at the hackathon
and this Saturday and I was very
frustrated about certain things and I
pitched something to him and I was like
we are having this entire automation of
full sex platforms where we have bold
lovable completely doing really complex
backend and front end in the browser.
But we have nothing like that for
robotics. We have nothing like that to
simulate the reality. And so the idea
was born. Your browser is all you need
to have RL. So we are here to like
present to you what we did on the
hackathon. Next slide please. All right.
So we are using Muja was we are using
help uh Mujiko which is a genius
platform built by and acquired by Google
deep mind. What it does it helps you to
embed all the physical attributes in the
robots. And so you can see this like
really nifty, really cute uh robots
falling under the gravity. It's
basically it just shows you how these
attributes that are only like present in
the physical world are all embedded in
these frames. And but the problem is
it's all siloed in Python. It's
extremely fragmented the way this
framework works. And it's kind of like
it's only like left up to like robot
robotists to like figure out like how to
like generate like thousand and
thousands of data points and simulations
to invent the future. But we are
changing that. What we're building is
we're building a simulator that allows
you to take a prompt, generates
different policies of RL and basically
gives this really really controlled
parametrically and sophisticated
simulations. So in a second we'll switch
to All right. So So here we have you
good? Yeah. Sorry about that. All right.
So this is what we built actually. We
built an entire RL environment that runs
in your browser. In the beginning, we
actually built it in the browser, but
then in order to make it work, the whole
idea is beginners like us can just pick
a model uh in an 3D environment like
richer just showed you. You can, you
know, we picked a robot dog and we told
the dog like, "Hey, you're a great dog.
Show me how well you can, you know,
stick out your paw. I love you. Like, do
you want a treat?" Right? Uh and the way
RL works is uh the robot throws offs
throws off observations and you need to
take those observations and you need to
craft a custom reward function and
usually these reward functions are only
written by specialists. But what we've
done here is we've used the latest
foundation models to democratize that.
So you just put in your prompt and 03
opus and Gemini all create three
different reward functions each. And as
you can see these are pretty complicated
bits of code. uh you know they have like
all these quarter neons different
rewards for height like what we ask the
robot to do is to sit and stick its paw
out like that's a pretty complex set of
rewards like I I wouldn't even know
where to get started with the math right
but foundation models they just spit
that stuff out and then once we actually
go through and generate that we actually
have these sandboxes which are kindly
hosted by modal uh where we actually go
ahead and start training all these
fine-tuning and what we end up with is
we actually end up with reinforcement
learning and it's just like magic. So
normally you have to be like a
researcher. You have to know all this
stuff but I just typed in a prompt. Uh
my model started training. I had nine
different ones. I'm showing you one from
each provider. I think this is the one
from claude. As you can see I didn't
give it enough steps. Reinforcement
learning takes time. Uh so it didn't
like get the you know start to converge
or whatever but some of the others ones
from Google and OpenAI did. And yeah
long story short um that's that's our
project and uh you know now you can do
it in your browser. We're really excited
to bring this to the whole world, get
everybody start training the robots on
their own machines. Thanks. Awesome. So
overall,
yeah, so close again, the future is
incredibly bright and if you want to
reach the generalized intelligence in
these machines, we have to optimize for
everything. Thank you guys.
Thank you. Um yeah, that's the I think
the next speaker and I think the last uh
finalist that we have uh I have a
personal relationship with because he
was our first international guest on L
in space. Um we did it in Singapore I
think. Um and he's been training non
subquadratic non-intention models for a
while. Are you are you plugged in? I am.
Um, and uh, he was like, I'm just going
to, he was like in the middle of like
some very important meetings, but he
said, I'm just going to hack in this
hackathon and uh, show you what I can do
with my model. So, uh, I thought it was
like pretty impressive and uh, wanted to
uh, at least it was exciting to at least
see him like emerge with something that
you can use today.
Um, hopefully this works because he
wants to demo instead of slides.
Okay, awesome. Take it away.
We can't hear you. Hang on. You're like
your mic on. That's all right. How are
you measuring reliability? Are your
agents following your specification?
That's the question I'm asking. A bit of
background like Sean G is I'm Eugene and
firstly I'm going to say I'm sorry
because my team is working to obsolete
all the AI models you see today. Um this
is what we are working on like you may
have seen some of my latest work such as
the qui 72b where we built the world's
largest model without transformer
attention. So this is a 72 billion
parameter model that's a thousandx
cheaper in inference cost and performs
the same uh based on the RWB
architecture. We also apply this
technology to accelerate transform
models but that's my background not what
I did in the heckodon to be clear. So so
that's not really that important for
this case. Back to the topic, the boring
topic which is
reliability. And this may sound weird
because my hot take is scaling is dead
and we're not going to solve reliability
with scaling cuz to me right this is a
billion dollar money pit that we are
throwing to scale and despite that some
of the richest companies on earth is
saying for example the deep mind founder
CEO is saying that it may take up to 10
year to solve the compound AI agent
error problem. Yen Lun say we need a new
AI architecture to solve the paradigm in
AI in robotics and AI. If you think that
they are GPU poor, maybe don't take them
seriously. But furthermore, this is also
further reinforced by what we see in
production where 90% of all AI projects
fail to to reach the reach the bar
required to for enterprises. So why does
this
happen? Really the problem if you think
about it is reliability. Because if you
think about it right, these AI models
are already capable of orbital physics
math. How many of us can do orbital
physics math from Earth to Mars? You
have a one in 30 chance of you answering
correct.
But who would you use a delivery app
that says it will arrive 45% of the
time? Like think about it. like you can
do your order and then maybe he orders
10 pizza instead of one or the pizza
never arrive and then you're spending
spending your time calling customer
support cleaning up the mess. That is
what the best AI agents right now are
doing or even the best AI model. And
that's the struggle that we we are we
having with here's what nobody is
talking enough about in my opinion. Most
companies don't need an AI that can do
PhD math. What they really want is an AI
that can do the boring things in life
like booking a flight, sending an email
or processing an invoice without failure
every single time. Scaling is not going
to fix this and in our opinion a new
architecture is needed and that's
something that I can spend an hour
talking but I'll put it aside because
what what we did instead is just to show
it. Most recently our latest action hour
model hit 65% on real eval. This model
will not solve a PhD math equation but
it will do it will do real world web
tasks such as shopping on Amazon.com dot
dash and etc. And that jump is more than
half compared to clock or gemini which
is at
45%. So so for those who are asking how
it looks like of course we made an MCP
demo for
it. So that's so if you look at this I'm
just going to run the
MCP and pray to the Wi-Fi gods. So for
those who are not familiar with client,
client is awesome because it can run
everything in your agent uh I mean in
your IDE. And I'm just going to tell it
to connect to my local MCB server which
I have already set it up. And let me
double check. Okay, it's there.
Okay. And then this will this will then
do the task of searching up for a book
for AI engineering on Amazon.com if my
API if my Wi-Fi is working as planned.
Okay. So you see it goes there and it
starts to starts to run it. Um I'm going
to say up front this is not a fast
model. It's going to take 5 minutes to
run. So I'm going to uh but uh but you
can see like slowly filling up behind
the scenes. So to speed it up I have
prepared a recording in advance uh to to
just show it in simultaneous. Okay. So
you so this is the same thing. You can
see
it going to fast forward a bit. Yeah. So
this is boring. Uh but the point here is
actually about reliability. uh and so so
how do we measure reliability? It's
about running it as many times as you
can. So what I'm going to do
simultaneously is I'm going to run this
on model. So part of the real eval and
shout out to div and and AGI inc who who
did all of this is that they provide
provide an endpoint for us to be able to
to run it and against a leaderboard. So,
I'm just going to run it and then it'll
just I'm just launching everything live
and it's going to start filling up this
this uh the scoreboard here. Uh once
again uh don't have time to run the
whole thing. So, I'm just going to go to
the final result 65%. And to me, this is
not enough. We need to get 99. And
that's what I'm building towards. And to
me, I'm find more frustrating that if
anything, our existing best model can't
even do better than a coin flip. And
that reliability is important because
it's what's going to unlock all of the
your value for all you AI engineers.
This is a billion dollar market. You
want to make an AI agent that's reliable
in law, accounting, um ordering books.
That's what's going to make you money
and that's what we need. Not a PhD
lawyer.
Yeah. Okay. Uh that's about it. I think
I'm out of time, so I'm just going to
jump straight into Yeah, we have a
weight list. Thank you.
[Applause]
All right, how about we hear it for all
of our hackathon
finalists. Very exciting. So, as
mentioned, now all of you are going to
be the judges. So, um can I get next
slide actually?
So, you're going to be the judges. So,
typically this is done by applause, but
that is so preGPT. So, let's go by woos.
So, we're going to do a practice round
with all of you in the audience. I want
you to go
one, two, three. Nice. We only have to
do one practice round. Great work. Great
work. Okay, so you ready? I want you to
write down who your top team is. I only
want you to woo for that team. I'm going
to get chat GPT advanced voice mode
ready to analyze the
results. Hey Chat GPT, I am at the AI
engineer world's fair and we are doing
the judging of the top three hackathon
finalists and now we need your help. We
don't have Wi-Fi. So what we're going to
do is we're going to I don't know why
I'm talking to her like a kid. We're
going to actually do it by applause. So
we have three teams. I'm going to say
team one, team two, team three. And each
of them are going to get applause from
the audience. I want you to analyze this
whether actual data measurements that
you have or just perceived and tell us
who the winner is, who number one is,
who number two is, and who number three
is. Are you ready?
Absolutely. I'm ready. Let's do it. Go
ahead and announce each team, and I'll
listen carefully to the applause. [ __ ]
yeah. All right.
Are we ready?
Team one, survival of the future.
Got it. Listening to the applause for
team one.
We just did it.
All
right. Let's move on to team when you're
ready. Awesome. We're ready. We're
going team two tab RL
to team two's applause. It's actually
woos, but sure.
I'll listen out for those woos, too.
Whenever you're ready for team two,
team three, featherless action R1.
Wow. Applause for team.
You failed. I'm sorry. I'm calling
Claude.
No, we have we have human evaluators in
the back who are we knew this was a
gimmick. Yeah, it was a fun gimmick. Uh
but no, thank you for helping us at
least uh gives some it gives some sense
of uh you know, sort of audience
participation and uh favorite like it's
meant to be a bit of a people's choice
uh type of thing. So yeah. Yeah, it's
work. Okay. Um, okay. So, uh, I So,
we're going to get the results later. I
think Ben, you can talk to them if you
if you need. Um, but we're going to give
out some prizes. Uh, I don't know if the
trace loop team is still around. I
think, uh, near uh I I saw I saw some of
them going out there. Uh, but basically,
we we want to just recognize people
who've been like really, you know,
pulling out their stocks for the event.
We got best swag, uh, which is one by
Trace Loop for their keyboard. We got
best dressed uh worn by uh worn by
Madison from B 10. Um I don't know if
any of the B 10 folks here, but um
anyone got the artificially intelligent
uh shirt? Yeah, that that really fun uh
swag. Um and best tweet from Dylan
Patel. Um basically talking about uh a
relationship that was actually started
here um like one year ago, which is
which is pretty sweet. Like that that is
actually heartwarming. We try to get
people hired but we never promise any uh
partners. AI engineer world's fair where
love happens. Yeah, I think that is AI
engineer love fair which uh is a very
high bar. Okay. So then the big
categories that we really wanted to hand
out um obviously uh unfortunately like a
lot of people like leave after your talk
so we can't really uh hand it out but
obviously come come and claim it uh
afterwards if you want. Every track has
best speakers. So um you all voted uh we
actually you know like really care about
giving recognition to the the speakers
who work so hard in their talks and
sharing their experience. Um thank you
to all these tracks. Can I get a round
of applause for MCP uh David Kramer,
Alex Duffy, Devon Tandon, Daniel
Shaliff, Harrison Chase, Dylan Patel,
Brook Hopkins, Brian Belelfer, Adamar
Freeman, Dennis
Nikov, Boris Jurnney Lambert, Rafal
Vtor, Daniel uh retita, Renee, John,
Sheree, Nick and Paul. Uh I think the
retrieval one is wrong. We retrieval
ones actually will will brick uh who
actually got his prize earlier uh
yesterday earlier as well. Um so those
are the individual track speakers. Um
actually um can I can I get those those
uh picture frames on on there? Um we
actually spent some time um putting
together the uh sort of track speaker
prize which is just which is one. Thank
you. Um and uh it's like it's like
really nice and printed and we gified
everything. So it's uh so it's kind of
cool to see. Um, so yeah, come and get
your track speaker award uh if you uh
are still around and obviously we can
send it to you if you're not. Okay,
overall best speaker. Uh, we have a
runner up and uh also an overall uh
winner. Um, I think uh it's like
relatively uh you know obvious and and I
think like the something that we wanted
to recognize as well for our keynoters.
Um Oh, where is the Oh, okay. This is
not refreshed. Um if if someone can go
back, can we go back two slides? Yeah,
runner up. Uh, George, are you here from
um um artificial analysis?
Let's hear for the runner up. George,
artificial analysis.
Um, probably they're all in the hallway
track. So, uh, artificial analysis like
worked really really hard in their talk.
They actually like did this whole like
50page report. I was like, George, you
you have 20 minutes. Like, you can't
really can't really do this. But, like
they worked super hard on that. And I I
think um it's something that we want to
recognize as well. Um the the winner
though uh it was the by far the
consensus on the people that I talked to
and the committee and all that. Uh the
winner is um you know our third time
keynote speaker. Um he went line dancing
so he's not here today. He's he's not
here to receive the award but I'm
actually going to get Lori. Lori um
you're actually going to receive the
award on behalf of him. Uh it's Simon
Willis
everyone.
So uh no Lori Lori Voss. Lori, boss, we
have two Lories. Sorry, Lori. She's also
called Lori.
Lori, you're you're you're waiting for
the next one. So, um I I don't know. You
can you can present the the best
speaker.
Um Simon nominated Lori because they
worked together on Django on where did
you work together? We worked together at
uh Yahoo in 2005. Yeah. Yeah. Yeah. So,
the few the proud, the few the proud.
Yahoo pipes is still a pipe dream for a
lot of people. Uh, but thank you for
accepting award on Simon's behalf. Thank
you. Thank you, Lori.
Um, okay. Um, hackathon. Hackathon. You
you have the the second best and it's a
runner up and the best. I I was relying
on JGBT. I don't know. Yeah. Yeah. Yeah.
Kind of failed me. Yeah. Okay. So, do we
want to go by perception? Yeah. Should
we do woo again? No. No. The audience is
thinning. Yes, they're running they're
running the patient. Uh I think it's
probably team three, right? Okay. Um
well, so we have we have the runner up.
Okay. Uh of of of the hackathon. I I I
actually don't know where to uh price.
Yeah, there we go. Okay. Um so hackathon
runner up. I think uh I think it's like
fairly uh pretty evident in in my mind.
Um it would be uh the uh the
feature team. So this other Lori uh you
can come up with your team and uh air
come on up. What's the rest of the team?
Come on up. Yes.
Um
yeah. So you can you can come for rail
this time. Yes, you can come. I'm sorry.
Sorry about that. There you go.
Congrats. Thank you. Thank you. Thank
you. Congrats. Can I Thank you. Oh,
yeah. You should all you should all
definitely Thank you. Congrats everyone
for our photo. Yeah. Yeah. Looking at me
right over here. Thank you.
Thank you. And our website is survival
of the feature.com. Survival of the
feature. Survival of the feature.com.
Please try it. Very good. Um, and I
think the winner just decided by uh
votes and uh applause earlier is Eugene
from Featherless. Featherless R1. Let's
hear for Eugene. Where is Eugene?
Uh Eugene gets one of the big ones. Oh
my god, Eugene, you're so excited. Uh
Fed um yeah, Federalist has been
grinding away for a long time and uh I
can't believe you did this in a
hackathon. Uh and u I also like to add
that uh I wasn't alone. Michelle
who couldn't be here. Yeah. Took part in
the hackathon as well. Yeah. And work on
it with me. Okay. Well, this is yours.
Thank you for taking part.
Yeah. Yeah. Yeah. Stand in the middle.
Okay. Awesome.
That's it. All right, we got to go to
Thanks. All right, now I got to do one
more thing. It's just a quick uh thanks
to everyone who has been part of this.
Obviously to everyone in the audience
here, Microsoft our presenting sponsor,
AWS our diamond innovation partner,
Neo4j Brain Trust who curated our evals
and graph tracks, all of our platinum
sponsors and all of our sponsors in the
expo and beyond. And of course, Swix,
the executive producer and program
curator of this event. It takes a hell
of a lot of work to do that. Leah
McBride, our senior producer, she's been
with us since our very first event in
October of 2023, and she really helps to
make this run. and also our new team
members, Melissa Billy and Scott Dilap.
Um, so many others including I want to
give a special shout out to uh Vincent
Wendy who is just all of these
incredible graphics. Everything you see
that was him. Okay, he didn't do the
animation. I'll get to them in a minute,
but he gra he did all that. So,
incredible working with him. VCI events,
this is VCI and they've been running
everything on this floor in Golden Gate
Ballroom level. Uh, Freeman, all the
graphics you see were them. Art and
Display. That beautiful expo in there.
That's like a little Santa's village. I
mean, it's you feel like you're in a
little mini city there. Incredible. That
was them. Encore helped to run AV up in
uh second floor. Local 16 helps to
operate everything. So, really big
thanks to them. They've all been so
incredible. Motion agency actually did
all of the motion graphics you see here.
They're based in Asia, but they they
worked some of the hours in Pacific for
some last minute stuff. Sunno, I love
working with Sunno. It just can't it
doesn't miss every time. and they
produce music just from text. Uh the
Marriott Marquee, thank you so much. Max
Video Productions for B-roll. Randall
for photography. Uh Brad Westfall for
and Swix, our web developer. Come on,
how is he actually doing? And Haley
Holmes, our incredible show caller.
Thank you so much. And all the speakers,
of course, they've been so
incredible. Anyone in a yellow shirt you
saw is a volunteer. They come here just
to help out and be part of the event and
the excitement. So, we thank all them.
We can't run it without you all. So,
thank you so much. And then lastly, I'd
like to welcome on stage the absolutely
hilarious, absolutely wonderful Lori
Voss, our MC. Thank you so much. Can we
give him a big round of applause,
everyone? I keep telling him this, but
his I keep telling everyone this, but
but I keep telling you this, but your
intro was like his jokes did his jokes
land or what? Like, they were actually
really good. They weren't just dad
jokes. So, I really appreciate that and
thank you so much. So, with that, that
should do it for the show. Thank you all
for staying. The last few of you who
stayed for this really appreciate it.
Thank you so much for coming out.
[Music]
[Applause]
[Music]
Heat.
Heat. Heat. Heat.
[Music]
[Music]
Heat. Heat. Heat.
[Music]
Heat.
Heat. Heat.
[Music]
Heat.
[Music]
Yep. Heat.
[Music]
[Applause]
[Music]
Heat. Heat.
[Music]
Heat.
Heat. Heat up
[Music]
here.