AIE Europe Keynotes & Coding Agents ft. Pi, Google Deepmind, Anthropic, Cursor, Linear, & more

Channel: aiDotEngineer

Published at: 2026-04-10

YouTube video id: _zdroS0Hc74

Source: https://www.youtube.com/watch?v=_zdroS0Hc74

Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
It doesn't knock.
It doesn't name itself.
It calls you.
It calls you.
No face. It needs no crown. No single
hand to strike you down. It moves
through mouths that claim they know
what's right. What must be so? It speaks
in care. It speaks in good. Cuts
everywhere.
No god, no code, no line to cross. Just
necessary justify
the fracture.
Every truth becomes a weapon.
Everyone
>> wor
it isn't flesh or bone.
It thinks through what you
take your body. It takes the
hijacks choice rewrites in
the chain
progress
again
for power. Eyes that gleam envy calls it
equity
outrage
made clean by broken
sloth that let the thinking die while
slo
by fear
coercion with the smiling face. Fly or
disappear.
>> A demon without form.
A thought that breathes a storm.
It doesn't burn you instantly.
It hollows you out slow until your voice
is not your own. And your hands what you
don't know.
This is the warning. Not a monster
outside the gate. But the moment you
stop and outsource what you hate, it
doesn't need
silence from your spine.
Give up the work of conscience and it
will speak in your name.
Heat.
Hey, heat. Hey, heat.
It feeds on abdication.
The death of hesitation.
When you trade truth for alignment
and agency for peace,
you don't fall into hell screaming.
You walk there on a leash.
It is not a person.
It is not a plank.
It is an idea that ask you to stop
guarding the gates. If your words are
not your own, if your reasons feel
rehearsed, check your mouth, check your
mind.
The demon speaks
first.
launch control. We have a go. Roger.
Ladies and gentlemen, please join me in
welcoming to the stage your MC for day
two of AI Engineer Europe, Tjisk Kumar.
Good morning.
Thank you. Thank you. Good morning. We
are here. Woo.
There was a little bit of latency there,
but we'll fix it. We'll write a skill
for that. Good morning. Hey, it's day
two. What a what an honor. What a
privilege. What a blessing to be here.
Look at this. It's a full room. We are
so excited. How many of you enjoyed
yesterday? Show me your hands. And if
you didn't, there we go. That's that's
right. It is is an amazing conference.
Um, yesterday was so incredible. Uh,
highlights. You want to shout out some
highlights for us today? No, it's
Europe. We're a bit more reserved here.
If this was America, it' be a different.
But, um, I tell you what, my high, if
you do have highlights, post them on
your social media. Um, and use tag AI.
It's kind of we want this to be a
community thing, and the more people we
can invite in, the better. My highlight
uh thanks for asking was um was Malta
yesterday um when he shared this slide
of how Europe is leading um AI
innovation. And I think this is so cool
because oftentimes we feel like the at
least I I live in Berlin, Germany. I
feel like the underdog and this was so
validating that real innovation is
coming out of uh Europe with with even
um the DeepMind office in Berlin. Um so
excellent. We we love Europe. We're
here. It's going to be amazing. Today is
day two and we're going to talk about a
lot of interesting topics. So, we're
going to talk about coding agents. We're
going to talk about MCP. Who's using MCP
here? Look at that. Almost everybody.
Incredible. Um, we're going to talk
about AI architecture, media, GPUs.
There is so much to do. But before we do
that, will you join me in giving it up
for our sponsors, Google DeepMind, our
presenting sponsor? Let's give them a
huge round of applause.
For real,
this would not this would not be as
amazing as it is without them. and we
are very very thankful for them. Let's
also um give a huge round of applause
for our platinum sponsors. We got Brain
Trust. Keep it going. Brain Trust, Work
OS, OpenAI.
Yeah, it is it is a real blessing to
have such sponsors that that create this
environment where we as builders can
assemble, build, and be inspired. And
finally, we've got gold and silver
sponsors here. Uh give them a round of
applause as well. Um,
you can find them in the expo hall
that's through these doors to the right
and upstairs in the brakes. I encourage
you, they have some really amazing swag.
Uh, some I got I picked up from one of
the companies uh like a little it's a
threeb button keyboard for vibe coding.
You see this? It's so cool. Uh, so I
encourage you go check that out. Um, we
are going to start today um with a
little bit of ground rule setting. Okay,
the speakers have their jobs. They know
what they're going to do. They're going
to come here and they're going to
inspire. Um, but you have a job as an
audience. Are you aware of this? You you
have a job. Your job is to make the
speakers and presenters feel amazing. In
fact, you have so much power because you
get to decide the quality of the talk
you watch. You can you can if you make
your speakers uncomfortable by by like
this like, "Hey, prove yourself." Then
they're going to be nervous and anxious
and it's not going to be a good talk.
Okay? I know this cuz I speak. Um but
instead if you validate them even before
they prove anything they just walk up
and you're like woo
I did that. I made that sound. Um no if
was naughty I if you do that I guarantee
you're going to have a great time cuz
they're going to feel validated. They're
going to feel confident. They're going
to give their best and you're going to
make it easy for them. It is a
conversation. It's not a monologue.
Okay? Is that clear? So like as the
speakers come up I want you to warm them
up and give them your biggest round of
applause. Let's practice. Let's do it
right now. Pretend a speaker just walked
up and
exactly. It's a little bit It's a little
bit quiet over here. I see you. Um,
let's let's try that again. This time
everybody look at them. But we're going
to I'm joking. I'm joking. I'm joking.
But let's pretend one more time. Your
biggest round of applause for a speaker
who has just walked up. Come on.
>> There we go. There we go. That's
exactly. That's why we're here. So, now
we're going to introduce our first talk
of the day. We're going to introduce our
first speaker. Our first speaker um
comes to us uh from Google. Um and we're
going to hear about Gemma. Gemma is an
incredible family of models. I
personally love it because it's an it's
a source available set of models and
they run almost everywhere. They can be
fine-tuned. It's so cool. I'm really
excited for this talk. Um please we
practice this. Give it up for your first
speaker, Omar Santoro.
All right. Hi everyone. It's full here.
H. So I'm super excited to give this
talk because just seven days ago we
released Gemma 4. H. So before this
conference, who here has heard about
Gemma already?
Okay. So most of you great. So GM is
Google minds a family of open models.
Open models means that these are models
that you can h take, you can download,
you can run in your own infrastructure,
your own devices, you can fine-tune for
your own use cases. So about a year ago,
we released YMA 3. Back then, Gemma 3
were the most capable open models that
could fit in a single consumer GPU. So
we designed models from 1 billion
parameters all the way to 27 billion
parameters. And back then in LM Marina,
it was a very strong model. So you see
here like different open models under
LMA Marina scores and those small dots
at the bottom represent how many H100s
or A100s you would need just to be able
to load the models. So this is again GMA
3 that's from one year ago but you can
see that even if it's a model from a
year ago it's a tiny model or a
relatively small model that is extremely
capable but yeah so last week we
released Gemma 4 and this is my first
conference talking about Gemma 4. So
very excited about that. So, Gemma 4 is
the family of most capable of open
models that Google has released ever.
These are models that go from two
billion parameters all the way to 32
billion parameters. Uh, these models
have very different capabilities. So,
I'm going to talk a bit about these
different things. And if you're
wondering what's the E there, I also
explain that in a second. So, the
smallest two models can run in an
Android phone, in an iOS, in a iPhone
phone as well, even in a Raspberry Pi.
These are really small small models that
are multimodel have reasoning can do
like very cool ondevice agentic things.
Then there's a a mixture of experts
model that's like super fast high uh
very low latency a model that you can do
uh that can do very cool things. And
then you have the 31B that's the most
intelligent model the most capable. So
when you want like the most raw
intelligence you would use this large
model. But even the 31B is a model that
can run in a consumer GPU. So all of
these models have been in developer
friendly sizes which is quite important
to us. So let me show you a couple of
the most
assuming the videos load. Uh so there's
a lot happening here. So let me begin
with the one at the right. That's a an
application where you have YMA running
directly in an Android phone where you
can pick different skills. So pretty
much here you have a full agentic setup
where the model is speaking maybe like a
skill to play the piano and then you
have Gemma playing the piano right the
one at the left is Yama B coding also on
device. Uh this is again airplane mode
no API calls fully running in a phone
and the example in the middle is in a
laptop computer we have 20 instances or
10 sorry 10 instances of Gemma running
in parallel. H each of them is doing a
different SPG and in a couple of seconds
you're going to see like 10 SPGs
generated by different agents. All of
these running on device with Lama CPP
and even then it's like a 100 tokens per
second and there you can see the SPGs
that were generated by the 10 different
Gemma models. H Gemma is a good coding
uh model. It can do aentic stuff. It can
do coding. It can do even Android app
development and again all of this
offline. So uh the LM arena scores are
quite nice. Uh here you can see like
bunch of different models. X- axis is
how many billion parameters the model
has. Y axis is the LM arena score. And I
know like LM arena is not the perfect
benchmark but it does give you like some
proxy of how much the community likes
the model for general use cases like
conversations and so on. And Gemma has
like a nice kind of a mix between being
friendly and like a helpful and at the
same time being very capable. And you
can see like this corner at the top left
that means that these are very small
models that are very capable which is
quite exciting.
H it's been exciting to see how the
models have progressed over the last two
years. So last year it was GMA 3. Two
years ago it was Gemma 1. Uh sorry yeah
JMA 2. And you can see like for a bunch
of different things uh the models have
kept get getting better and better
without going uh bigger which for me is
quite exciting because if I think where
we'll stand in a year from now or in two
years from now I do think we'll have
extremely capable models running
directly in our own devices in our own
pockets.
Uh I'll skip the benchmarks but yeah
what is exciting is that Gemma can fit
in a desktop computer it can fit in a
laptop it can fit in a phone. Uh I saw
yesterday or two days ago that someone
put lama CPP in a Nintendo Switch and
they are using llama CPP uh to try GMA
directly there. So I don't know how
things will be in a couple of years but
uh I'm excited for it. Uh something that
we heard a lot with the previous YMA
versions was that the license that we
had was not great like people wanted a
proper open source license. So with the
M4 we change our license to an actual
Apache 2 license that gives you control
to h pretty much you have the
flexibility of the Apache 2 license. So
uh that's quite nice as well.
Now uh you have probably heard about
mixture of exports that's the 27B model
26B model. You have heard about
transformers and tense models but you
have probably never heard about the E
here. So E2B stands for effectively two
billion parameters. So actually Yema E2B
has more parameters. It has four billion
parameters or so and it has a new novel
kind of architecture called per layer
embeddings. That was something that we
released summer of last year. So there's
this small block at the bottom and the
TLDDR here is that pretty much there is
like a embedding kind of per each layer
as the name indicates and it works more
of a pretty much as a lookup table
rather than a computation that you need
to do. So pretty much this is a
extremely fast thing. You don't need to
have this in the GPU. You can have this
in the CPU. You can have this in the
disk. And this is a architecture
decision that is really optimized for
ondevice like mobile use cases. So
that's why the smallest models that can
run in an Android or in an iPhone are
using this E2B or E4B architecture. So
even if the model is five billion
parameters, you actually just load two
billion parameters into the GPU and then
the rest can be like much slower memory
because you are not doing any of the
matrix multiplications that you would
usually do with the transformer
architecture. And this can be done
leveraging lama CPP with a simple flag
override tensor and then you move the
per layer embeddings to CPU or even to
disk and it should work quite well out
of the box. H a couple of other exciting
things. The smallest models can do
multimodal understanding for images, for
videos and even for audio. So you can do
speech recognition. You can do speech to
translate the text. So I can speak in
Spanish and the text can be uh
transcribed to I don't know French. Uh
and then the larger model can do like
extremely capable multimodel
understanding. So uh videos uh fine
grain details. Uh I actually have a
couple of examples in here. So for
example, it can do things such as
pointing where the llama is in the
picture. Uh it can uh do object
detection. So it can detect different
objects in a picture. And what is cool
is that this model is heavily
multilingual. So Gemma 4 has a well it
was trained with over 140 languages and
it uses the tokenizer h that is based on
Gemini as well. So pretty much all of
the multilingual research that powers
Gemini is also enabling Gemma. Erh the
tokenizer piece is quite interesting
because independently of the raw
capabilities of Gemma this tokenizer was
designed for multilingual use cases. H
and we took lots of care with it. H
which is interesting because if you want
to fine-tune Gemma for a different
language for which there are low digital
resource languages. So let's say like an
indigenous language in Peru Ketwa or I
don't know one of the official languages
in India. You can pick the model, you
can use your data, you can train the
model and independently of the raw
capabilities of GMA just because of the
tokenizer decisions things tend to work
quite well out of the box. So then you
can mix the multilingual with multimodel
capabilities. So for example here to get
the text or an explanation of an image
with Japanese text and that's quite
cool.
Uh so we released the model a week ago.
uh just last uh yesterday we got to 10
million downloads just for Gemma 4-based
models. There are over 1,000 models
based on Gemma 4 already. So
quantizations or fine-tunes by the
community over 500 million downloads of
the whole Gemma family. So what is very
cool for me is that Gemma is not just
about oh it's a model that you can use
but it's more about enabling the
ecosystem to build on top of it and
that's what the committee has done over
the last few days. H it was top of at
hoging phase. People have been building
like cool examples. The onslaught of
people have been doing like full
repository audits using Gemma. People
are putting Gemma like in in all kinds
of devices and exploring all of the
capabilities which is quite nice. And
all of this is not done just by us. We
collaborate with the open source
ecosystem. We work with onslaught, MLX,
O lama, hog interface, BLM, silang and
pretty much we want to ensure that when
we launch a new tool both for Gemini and
for Gemma, people can leverage the
capabilities out of the box, right? Like
they should not need to switch to H KAS
if they want to fine-tune Gemma like if
they are fine-tuning with Hogenface
transformers, they should be able to do
that. So for us, it's very important and
critical to be where the community is.
And that's why really shout out to all
of those of you that are working in the
open uh source ecosystem that are
contributing to uh different tools uh
maintainers of all of these repositories
because it's really a way to enable the
ecosystem to do amazing things. Uh
another part that I like about Gemma is
all of the product integrations that we
can do. So Android Studio, I don't know
if anyone here is an Android developer,
but Android Studio has like a agent mode
where you have a agent that helps you
buy code and develop. And there's a
offline mode now where you can have a
llama CPP or Lama or BLM powered uh uh
system in which you have GMA helping you
bite code uh for Android development and
we did include some Android related data
sets and benchmarks while training
Gemma. So it's actually a very capable
model for Android development.
So I talked a bit about how many like
people are fine-tuning about how many
people are sharing. So let me share a
bit about the the gem numbers. H so this
number is outdated. This is from last
week. Now we have 500 million downloads
as I mentioned and in total YMA has over
100,000 uh models. So again uh maybe you
just want to use them out of the box
like open models may work great for you
but maybe you want to h improve the
capabilities. Maybe you want to change
the style in which the model is talking
with the users. Maybe you don't want a
conversational model right? Maybe you
just want a model that can predict
certain thing in your own context. uh or
maybe you just have too many GPUs at
home and you just want to burn them. H I
don't know what's your reason, but you
can fine-tune models for many cool
things. So Google has done a couple of
what we call official Gemma variants. We
did a shield Gemma which is a family of
word rate models. Those are great for
production use cases where maybe you
don't want users to put let's say toxic
images or toxic text that does not match
the policies that you have set up. So
Shield Gemma is the family of models
that allows you to do that. But then
there are also other kind of use cases.
So for example for medical use cases we
have released medma which is a
multimodel yema 3based model for
different h medical tasks. So radiology
x-ray
chest x-ray understanding and a bunch of
other things. And again these are open
models you can use them and you can also
fine-tune them even more if you have
like a even h more niche kind of use
case. So that's what Google has done.
But the community is also doing cool
things. So for example, there is AI
Singapore. It's a group that is training
models for Southeast Asian languages.
Erh there are a bunch of them and they
have been building quite a bit of
research with open models to push even
further the state-of-the-art
capabilities in terms of multilinguality
or another example is Sarban. H so in
India there are many official languages
and there is this effort by the
government. they are investing in a
couple of h big startups to train
national models. So this is more on the
sovereign AI and official like languages
uh point of view but people are doing
like very interesting stuff on the
multilingual side of things. Apart of
that there's quite a bit of other like
cool research happening. So there was
this paper we released in December of
last year about how some researchers
from deep mind were able to use Gemma 3
to propose some cancer therapy pathways
which was actually taken to an actual
lab and they were able to validate that
the pathways that were proposed by this
uh Gemma based model were able to
actually uh lead to actual results that
could be validated. So that was quite
exciting because it's not just about uh
having your assistant or chat chatting
with uh yeah your I don't know like
doing role playinging and whatnot. It's
also about building models that can be
used for actual things that help the
community for many different things. So
uh be that like finance or be that I
don't know like legal reviews uh offline
use cases where you don't want your data
to leave your servers if that's like for
offline modes if you're in in I don't
know in the subway if you're in an
airplane and you need to use AI for
something if you want to have a Chrome
extension that has a Gemma in there and
help you understand what is in your
screen. If you want to do ondevice
control, the open models are getting
there. And for me that's quite exciting
because if you compare where we are now
versus how we were like one year ago,
two years ago, open models now can do
very cool, very interesting, highly
agentic, complex tasks entirely on
device, entirely in your phone. So I
really like recommend all of you to just
spend like one hour in the next two
weeks just play with open models the
latest open models and try to understand
which are the capabilities. Of course
there are many things for which you will
want to use a API based model. If you
want like the most raw intelligence you
will go and use like Gemini or your
model of choice but if you want to have
things on device there are many exciting
things that you can already do. Uh and
for me what is more exciting is I don't
know how things will be in six or 12
months from now but I think we are
heading towards a very exciting
direction where people will be able to
have extremely capable open models in
their own devices that are customized
for their own use cases with their own
data. So yeah, please try the models,
build something and share that right.
Thank you.
Our next presenter is here to make the
case for the future of MCP. Please join
me in welcoming to the stage the creator
of MCP and member of technical staff at
Anthropic, David Sora Par.
Okay.
Well, welcome.
Let's get started.
This
is an MCP application.
That's an agent shipping its own
interface, not through like a plug-in,
not through an SDK, not rendered on the
fly by the model on the client side or
hardcoded into the product. That is
something that is served over an MCP
server. And you can take the server, put
it into cloud, you can put it into
chatbt, you can put it into VS Code,
cursor, and it will just [ __ ] work.
And that I think is kind of cool because
for doing that you need something that a
lot of things that we're want in the
ecosystem do not offer. You need
semantics. You need to have both sides
the client and the server to understand
what each side is talking to understand
how you render this understand that
there's a UI coming. And for that you
need a protocol.
And the best part about this an MCP
server doesn't just ship an app or can
ship an app. It can also ship tools with
it and so you can interact with it with
the application as a human and you can
have the model interact with it through
tools which is I think a very unique
thing that I think we have not explored
much just yet.
Okay, but let's quickly rewind a little
bit from this what I think is a really
cool glimpse into the future of MCP into
over a year ago 18 months an eternity in
AI life cycle. Um, all of this did not
exist. There was just a little spec
document, a few SDKs, uh, mostly written
by Claude, local only with little more
than just tools. And in that last 18 or
12 months, you guys have been absolutely
crazy building stuff um, building
servers, building um, and crazy
ecosystem around this. And we on our
side have been busy busy taking this
local only thing added remote
capabilities added centralized
authorization added new primitives like
elicitation and tasks and last but not
least added new experimental features to
the protocol like the MCP applications
that you've just seen.
And in the meantime, we have reached I
think a really cool milestone because
again you all of you have been
absolutely crazy building building and
building of course luckily with the help
of uh a bunch of agents. Um we're now
like at 110 million monthly downloads
and that's just of course not us in our
clients and servers. That's like OpenAI,
it's agents SDK, it's Google's ADK, it's
lang chain, thousands of frameworks and
tools that you might have never even
heard of it. Pulling it in as a as a
dependency, which means there's one
common standard that all of us have at
our disposal to speak to each other. Um,
just a bit for context, uh, React, one
of the most successful um, uh,
open-source projects probably of the
last decades, took roughly double the
amount of time to reach that download
volume. And in the meantime, of course,
you all have been building really really
cool servers from like little toy
projects of WhatsApp servers and Blender
servers uh to building SAS integrations
like Linear, Slack and Notion that are
really powering what everyone does every
day when they use MCPS. But most
importantly, the vast majority of MCP
server most of all of us have built are
behind closed doors uh connecting
companies systems to agents uh and AI
applications.
But I still think this is just the
absolute beginning of where we are
because I think 2025 was all about
exploring in 2026 is all about putting
these agents into production because if
you really think about in my mind 2024
we just built a bunch of like demos and
showed cool stuff to people and there
was a little bit of a buzz there. 2025
was really all about coding agents.
coding agent if you really think about
are the most ideal scenario for an
agent. It's local, it's verifiable, you
can call a compiler like you have a
developer who can fix [ __ ] if it goes
wrong in front of the in front of the
computer. Uh and you can display a 2y
interface and the user is quite happy.
But I think now with the capabilities of
the model increasing, we are going into
a new era which I think this year will
be we will see this start where we're
not just doing coding agents. We're
going to have general agents that will
do real knowledge worker stuff like
things a financial analysis analyst want
to do a marketing person want to do. And
they need one thing in particular. They
don't need a local agent that calls a
compiler. what they need is something
that could connect to like five SAS
applications and a and a shared drive
because the most important part for them
for an agent is connectivity and in my
mind connectivity is not one thing if
one if someone tells you there's one
solution to all your connectivity
problem be it computer use becp
they are probably pretty wrong because
in the right because the right thing of
course is that it always means it
depends and there's a real a big
connectivity stack and there's the right
tool for the right job. And in my mind,
there are three major things that you
want to consider building an agent in
2026. It's skills, MCP, and of course,
like CLI or computer use depending on
your use case. And they have three very
distinct things that they can do and
three different things you want to
consider when you build your agent.
Number one, skills of course is just
like domain knowledge. is just like
capture specific capabilities put into a
very simple file and it's mostly
reusable. There's some minor differences
between the different platform.
Of course, CLI is very popular when
local coding agents. It's an amazing
tool to get simply started to have
something that you can compose in a bash
that you that automatically discover
where the model can automatically
discover what the CLI is capable of. And
most importantly, if you have things
that are like CLIs, like GitHub, Git,
and other things that are in
pre-training, CLI is an amazing solution
for your connectivity part. And they're
particularly good when you have a local
agent where you can assume a sandbox
where you can assume a code execution
environment. But if you don't have this,
if you need rich semantics, when you
need a UI that can display longunning
tasks, when you can have when you need
things like resources, when you need to
build something that is full decoupled
and needs platform independence or you
don't have a sandbox, when you need
things like authorization, governance
policies or short to say boring enter
boring but important enterprise stuff or
if you want to have experiments like MCP
applications or what comes soon skills
over MCP, then I think MCP is this like
additional connective tissue that is
just yet another tool in the toolbox for
you to build an amazing agent. And so
this is all to say that I think in 2026
we're going to start building agents
that use all of it. They don't use one
thing, they use all of it and they use
them quite seamlessly together.
But I don't think we're quite there just
yet
because we need to build a lot of stuff.
Partially um because our agents kind of
still suck um and partially because I
think we just haven't talked enough
about like some of the techniques you
can do uh to really put this connective
tissue together.
The number one thing that we need to go
and start building is on the client side
on the on the agent harness side on the
things that powers the connective parts
that be it a cloud code uh be it a pi be
it whatever application you're going to
build
and the number one thing we're going to
do there and what we all have to do and
something I want to really get across
today is that we need to go and start
building something called progressive
discovery
most people when they think about like
oh
uh MCP they can think about like context
bloat but if you really consider what a
protocol does a protocol just puts
information across the wire but the
client is responsible for dealing with
that information and what everybody so
far has done because we're in this very
early experimentation phase is to simply
put all the tools into the context
window and then be quite surprised that
maybe the context window gets large. Um
but what you can do instead and what you
should do instead you should start using
this progressive discovery pattern which
is to say use something like tool search
to defer the loading of the tools and
start loading the tools when the model
needs it. And we have this in the
anthropic product and the API. Um people
can use this on on competitors APIs as
well. But also you can just build this
in yourself where you just download the
tool directly and the moment you give
the you give the model a tool loading
tool basically and the model goes like
ah maybe I need a tool now and let me
look up what tools I need and then you
load them on demand.
And here in this example, what you're
seeing is on the left side is uh cloud
code before we added this to cloud code
and then after it uh uh to cloud code.
So you see a massive reduction
in tool uh use tool context usage.
The second part to that is something
called programmatic tool calling or what
other people usually refer to um to code
mode. Um this is the idea that one thing
that you really want to do is you want
to compose things together. You don't
want the model to go call a tool, take
the result, then go and talk call
another tool, take the result, call
another tool because what you're
effectively doing is you're letting the
model orchestrate things together. And
in that orchestration, you're using
inference. that's it's latency sensitive
and all of its stuff could be done way
more effective if you would instead
write
a script. Um, and in fact that's
actually what you constantly do and what
you constantly see things like cloud
code do when it writes the bash command.
But you can of course do this with
everything and you can do this with MCP
and you should do this with MCP. So what
does this mean? So what you want instead
of having one tool out of another, you
want to give the model a ripple tool
provide like a comm like a execution
environment like a V8 isolate or a monty
or something like that or a Lua
interpreter and just have the model
write the code for you and the model
just executes that code and then
composes them together. And there's a
neat little fe feature in MCP called
structured output that tells you what
the return value of the output will be.
And the model can use this information
to to figure out type information which
then mean it can really nicely compose
these things together. And in this
example here, instead of doing two
different calls, you do one call and you
can filter that. The model will
automatically ex uh remove things from
uh JSON and just continue.
Of course, if you don't have uh
structured output, you can always just
ask the model to give you a structured
output um uh by just extracting it and
saying, "Hey, call a cheap model and
say, I want this expected type. Give it
back to me." And bam, you have a type.
the model can compose things together.
And I think this is something we're just
not doing enough yet. And this is, I
think, something where we can improve
our agent harnesses.
And then last but not least, of course,
you can just compile compose these
things together with executables like
with CLIs, with other components, with
APIs as well.
Um next what we need to do besides the
client work which is progressive
discovery and um programmatic tool
calling we need to go and start building
properly for agents and that means we
all need to stop taking REST APIs and
put them one to one into uh an MTP
server. Every time I see someone
building another REST to MCTP server
conversion tool, I'm it's a bit cringe
because I think it's just it just
results in horrible things. Um, and what
you should do instead, you should design
for an agent. Basically, you can start
designing for you as a human how you
would want to interact with this because
that's actually a very very good start
for an agent. If you want to orchestrate
things together, you should reach of
course for programmatic tool calling and
you can do this on the client side as I
said before. But you can also do this on
the server side. The cloudfl MCP server
and others like that are great examples
how you can have instead of providing
tools provide an execution environment
to the model and then just have them
orchestrate things together which again
cuts on token usages uh cuts on latency
uh and is way more powerful in its
composition. And then last but not
least, you should start and we should
start as server authors to use this rich
semantics that MCP offers over
alternatives. This means shipping MCP
applications. It means shipping um
skills over MCP. It means um using uh
things like task and other aspects that
the protocol offers that we're currently
slightly underused or things like
elicitations,
things that only MCP can do for you. And
of course that's all the work you all
need to do and maybe some of our product
people need to do. We also need to do a
lot of work on MCP itself. And there's a
few things down the line that we going
to go and have to go and solve. The
number one thing is we need to improve
the core. There's a few things that as
we have developed the protocol over the
last year that are just not in a good
shape. Number one is that the current
streamable HTTP is very hard to scale if
you're a large hyperscaler. And so we
have a proposal from uh our friends at
Google um who are working with something
called a stateless transport protocol
which make it significantly easier to
just treat MCP servers like you know
another stateless uh rest server
something like that that we used to know
how to deploy to like cloud runs or
kubernetes and so on. So that's coming
down in June and hopefully landing in
the SDKs very soon. In addition, we need
to improve our asynchronous task
primitive, which basically is a very
fancy way to say we just want to have
agent to agent communication. We have a
very experimental version of the
protocol that very few clients support.
So, we're going to start building more
clients out like that. And most
importantly, we are improving some of
the little semantics that we need to do.
We're going to ship a TypeScript version
uh SDK version two and Python SDK
version two based on um a lot of the
lessons learned uh over the last year.
There's a there's a um a a SDK called
FastMPP. Uh who's using Fast MCP? Yeah,
it's just way [ __ ] better than Python
SDK that we ship in. Right. And that's
on me because I wrote the Python SDK. Um
and and so I have a bunch of people who
are way better Python developers than me
help me uh write it better. Um the
second part is we need to start
integrate everywhere. We're going to
ship for particularly for enterprises
something called cross app access. It's
a new thing that we're working closely
together with identity providers which
just allows you it's a very fancy way to
say uh once you log in once with your
local company identity provider be it a
Google be an octar you will be able to
just use MCP servers without having to
relog in. So it's a bit more smoothness.
Um, in addition, we're going to add
something called uh server discovery by
um by uh specifying uh how you can
discover servers on well-known URLs
automatically. So crawlers, browsers, um
agents can just go to a website and say,
"Oh, I'm instead of just parsing the
website, is there also an MCP server I
can use?" And we will be able to
automatically discover this. This is a
really cool thing that will come down uh
also in June when we launch the next
specification uh and will be supported
there. And then last but not least, we
are starting to use our extension
mechanisms in in MCP which means that
some clients will support this like for
example MCP applications will only be
supported by web based interfaces
because if you're a CLI you just have a
hard time rendering HTML, right? Um and
we do more of these extensions. One of
the most exciting extensions that I
think is is cool. we're just going to
ship skills over MCP because it's very
obvious that if you have a large MCP
server with tons and tons of tools, you
just want to ship domain knowledge with
it and say, "Oh, this is how you're
supposed to use this. This is how you're
supposed to use this." And it allows you
as a server author to continuously ship
updated skills without having to rely on
plug-in mechanisms and registries and
other stuff. So, that's coming down. Um,
there's a lot a lot of experimentation
from people already in that space. You
can already do some of that today if you
just give the model a load skills tool
like you can you can build primitives or
versions of this today without having to
rely on the semantics. But of course
we're going to define the semantics.
Okay. So that's for me a long- winded
way to think to say that I think MCP is
actually in a really good shape and I
think in this year we're going to push
uh agents to full connectivity.
Um MCP will continue to play a major
major major role and we want of course
your feedback. We are very open
community. We just have created a
foundation. We're mostly running as an
open source community with a discord
with issues. Um just come to us and tell
us where the [ __ ] are we wrong? what are
we getting right? Um so that we can
improve this on a continuous basis. So
2026 I think is all about connectivity
and the best agents use every available
method. They will use computer use, they
will use CLIs, they will use MCPS and
they use will use skills because they
want to have a wide variety of things
they can do and then they can ship cool
stuff like this. Um which is um
one of the product features we shipped
recently. uh under the hood it's nothing
but an MCP application um that renders
stuff right cool
so we can now look at uh the model
writing graphs
anyway thank you
our next presenter is the creator of
Agent Craft and MCPUI here to speak
about agent orchestration. Please join
me in welcoming to the stage Ido
Salamon.
So uh good morning London. Uh my name is
Ido Salman. I'm the creator of
Agentcraft. I am also the creator of
MCPUI and creator and co-maintainer of
MCP apps. So, I'm building some of the
stuff that David has been talking about.
Um, as you've all heard in the past day,
agents are amazing. Uh, but if one agent
is so amazing, why don't we scale up to
10 or 20 or 100 different agents and be
a 100 times more amazing? Uh, it is
pretty simple. We just spin up a bunch
of agents. We put them in this like nice
uh uh screen and it looks really
glorious, but it won't actually work.
And the reason is that spinning them up
isn't the problem. It's us. We are the
bottleneck in orchestrating all of these
agents. Now, if you think about it, the
role of the engineer to actually go and
manage dozens of reckless uh employees
is not typically what we do in most
companies. Um so it we need to somehow
find these new potentially new skills to
manage all of these agents.
Luckily, they're not really brand new.
It's not something that we've never done
before. It's just something that's been
hiding in unexpected places. I mean, if
you're a gamer or used to play games at
any point, managing dozens of units
probably sounds a little bit familiar,
which is why I built AgentCraft, which
is an orchestrator that aims to raise
the ceiling of human agent collaboration
by taking learnings from gaming and
transferring them into productivity.
So, let's see a quick walkthrough of
that and let's understand the journey to
raise that ceiling.
So this is agentcraft.
There's a lot to unpack. Uh so we'll
just start with the basics and go from
there. Uh this is an agent, not a
metaphorical one. This is actually a
physical manifestation of a coding agent
like a live session. Um it can be you
know cursor, it can be cloud code,
codeex, open claw, whatever. Uh it's
something that we can detect on the
device and visualize it. But it's also
something that we can spawn directly
from here.
>> So now
we have this agent uh and we can prompt
it. We can use it like just any other
agent that we have from our CLI or
whatever. Uh and what can we tell it to
do? It has all of these quirks and we
have voice and we have text and we have
images and so on. And we can just tell
it to do stuff. So for example, we can
tell it to um develop some feature for
us.
prompt.
>> And
now the agent is working. So it's doing
its work.
So it's doing work. Uh and as we can see
uh if you look at the the UI, there's
like a bunch of other stuff. We have
these buildings and each building
represents some functionality. So for
example, you know, one of these
buildings manages the skills and plugins
and so on. Um there's also you know like
integrated uh terminal and git just to
like get that end to end workflow. Uh
the second part of raising the ceiling
now that we have the basics is
visibility. We need to be able to
quickly understand what each agent is
doing. Uh so we have this nice side
panel here that really shows us like
high level uh mission status summary and
so on. What are they actually doing? But
the cool thing about agent craft is that
we don't just see a list of what they
can do. We can actually see them
working. So if we look at the map, you
would notice that it's actually a
projection of my file system. Each part
of my file system is actually on the
map. So I have these directories here
and each one of these directories has
files. These files are represented as
runes as you can see here. So I can
actually track and see visually what the
agent is working on, which file. I can
see the entire change list of what
happened there. And because we're
orchestrating it, I also know which
agents did what and when. So we can have
full lineage of what's going on. And we
can take this one step further. If I
know all of these stuff, why not just
create a heat map? I can actually try
and see visualize uh collisions and I
can even prevent them proactively.
Now, the cool thing here is that once we
have this visibility, we're not exactly
done yet because we still need to be
able to react to the changes that are
happening. So, we can lean into another
cool mechanism from RTS games. We can
simply use muscle memory to quickly
cycle between the agents that need our
help. They need uh us to approve the
plan. They need us to uh answer some
question so on. So, now we have
visibility and we can react quickly. So,
we're done. without solved
orchestration. Um, but not quite. Uh,
because that's really only the first
step. Uh, I was able to use more agents
in parallel, but only for a short amount
of time. Uh, there are a few reasons for
that. The first one is that there's only
a limit to how many ideas I can have in
my head at any given time without being
tired. Uh, so what I did is basically
tell the agent to do it. I told them,
okay, find missions for me to do. So I
have quest now and I can click a button
and they just do whatever they can
refactor, test all the stuff that I
don't want to do. Uh and the second one
is that all of this babysitting takes a
lot of time. Like I need I see what's
going on. I can react to it very
quickly, but I still need to cycle
through it. Uh so what I did there is
kind of say how do I take myself out of
the equation as much as possible? So if
agents are so amazing uh why not just
let them do it. Uh I can just like give
them some idea. I have this campaign
feature. Uh broadly say what I want to
happen and I would just spin up a
container. I would let the agents run
there. They can decompose the task. They
can plan it. They can present the plan
to me. I don't care what they're doing
because it's containered. So do
whatever. And the main thing here is
that once it's decomposed, I'm not the
one doing the babysitting. Now I have
the campaign orchestrator and that's his
problem. Uh so we're actually moving
more of the effort only to the planning
phase or the review phase.
Uh and once we have that, we reach a
point where we can just say why is it my
ideas? Why can't I tell it to have like
running a chrome job, go to Twitter
every day, scan cool ideas and just
implement them and I just decide what I
want. which is actually how I
implemented channels pretty quickly. Um,
so we have that and now just have a lot
of different PRs to review. So there's
this nice capability of just review
bundles. Uh, and now I can see exactly
what changes happened in each one, like
why did they do stuff, what are the
tasks, and I also have visual evidence.
So now I'm able to just look at
screenshots. I can look at videos and
really see what's going on without
investing too much time in doing it.
And once we have that, we can actually
shift more of the work from the planning
to the review. How much time do I need
to spend on the plan if I can just do it
10 times and I'll just pick the one that
is most fitting for me.
H and the next part is we're still not
done. I mean if you think about it this
is only the first step because agents
aren't that smart yet. Uh so we need to
offload it to someone else uh humans. Uh
now what I can do and this is my
favorite feature is that we can actually
create these workspaces. So I can
collaborate with the product designer
for my team and they can do whatever
they want and you can I can just uh
continue from where they left off. So
for example, let's say this is an agent
actually from the product designer on
their computer. So they can see my
agents, I can see their agents, I can
understand what they're doing and we can
just collaborate.
Um,
>> prompt prompt.
>> Yeah, they just started working again.
Uh, so I can see that they want to
design this new page. Uh, which is
pretty cool. Uh, so I can wait for them
to finish or I can just go ahead now and
just hand off from them to my agents.
Well, our agents, insert communist, uh,
whatever. Uh so we have our agents now
and I can just keep going from there.
And the cool thing is that it's not just
humanto human collaboration. Uh we are
also collaborating with the agents. So
there's more direct stuff like this. I
can just type stuff and prompt my agents
or even their agents. Uh but there's
also a softer mechanism. Uh there's
actually a chat that is uh between
humans and humans but also between the
humans and the agents. You can see here
that the agent said I'm starting to work
on something and then I can say I'm also
working on it. So the next time the
agent does something it knows someone
else is working. They can also have soft
collaboration so they would know uh what
files each one is changing.
So we've actually taken a bunch of stuff
uh that were limiting us from really
reaching our full potential with agents
and kind of solve them one by one. There
are a bunch of other features that I
just didn't have time to go over. Uh but
you can try them out and see for
yourself if you can really uh work
better that way.
So to sum up uh
these are not exactly new skills. I mean
you're probably worried perhaps that we
won't be able to get adapted to this
future where we're not actually coding.
we're just telling other people to code
for us or other agents. Uh but these
skills are there. They're just not
something we used for work until now. Uh
so with games as one example, we can
take these skills to the next level.
We need to somehow raise that ceiling.
We need to somehow improve our
collaboration with agents. And with
agent craft, the goal is to take the
learnings from games and really raise
that to the next level with better
visibility, more autonomy to the agents
and human to agent collaboration.
So I invite you to go to uh the website.
Uh this is the QR code. You can it's
free. You can just download it and play
with it. Uh it's still experimental.
It's still new. There's a bunch of stuff
that need to change. Uh but it will only
happen with great feedback. There's also
a discord. Uh so please join, give us uh
your feedback and let's raise the
ceiling together. Thank you.
>> Our next presenter is the creator of one
of the top coding agents pi which is the
engine inside open claw. So naturally,
he's here to tell us how agents are
destroying open-source software. Please
join me in welcoming to the stage the
creator of Pi, Mario Zechn.
Hey there, I'm Mario. I built Pi in a
world of slop. And this is a strategy, a
tragedy in three acts. just to talk
about this real quick. Bunch of people
on the internet gave me money for ad
space on my torso and all of that goes
to a charity. So yeah, thanks guys. So
act one building pie. In the beginning
there was cloud code and was good right
we all got basically catnipped by that
thing and stopped sleeping. Um bunch of
stuff before that but code cloud code
was the one thing that kind of clicked
with me the most. And to preface all of
this I love the cloud cloud team. They
are brilliant people, talented, super
high velocity. So, uh they also created
the entire game. Major props to them.
So, this is not a roast. This is just
me, an old man, telling you why I
stopped using cloud code and built my
own thing. Um in 2025, I started using
cloud code in about April, I think,
thanks to Peter. Uh because he told us
the agents are working now. And back
then, it was simple and predictable and
fit my workflow. But eventually
the token madness got hold of them I
think and the team got bigger and they
started uh dog fooding that stuff and
built a lot of features. A lot of
features I don't need which is fine. I
can just ignore them. But with velocity
and more features come more bugs and
that's bad because I used to work at
construction sites and if my hammer
breaks every day I'm getting really mad
and if my development tools break every
day I'm also getting mad. So there was
this. it's just a running gag. And
here's Tar telling us that clot code is
now a game engine. And here's Mitchell
from Ghosty telling us, "No, it's not."
And eventually they fixed the flicker,
but then other stuff broke. And I think
they're now in the third iteration of a
2y renderer. Yeah, but that's just a
symptom. The real problem is that my
context wasn't my context. Cloud code is
the thing that controls my context. And
behind my back, cloud code does things
uh to the context. So you have the
system prompt which changes on every
release including the tool definitions.
They would remove tools, modify tools.
It's not good. They would insert system
reminders in the most opp inopportune
place in your context telling the model
here's some information. It may or may
not be relevant to what you're doing
that it actually says it may or may not
be relevant what you're doing. And that
kind of confused the model and that kind
of broke my workflows.
On top of all that, there's zero
observability because that's how the
tool is constructed. And I like knowing
what my agents are doing. There's zero
model choice, which is obvious. It's the
native entropic uh harness, so it makes
sense for them to want you to use cloud,
right? And there's almost zero
extensibility. And some of you might
have written some hooks for cloud code,
but I'm telling you the number of hooks
and the depth of those hooks is very
shallow. Um, and every time a hook
triggers, what actually happens is a new
process gets spawned. basically the
command you specified for the hook to be
executed and I don't find that
specifically efficient. So I uh took a
step back and looked around for
alternatives and I'd like to especially
call out AMP and Factory Troy the
Porsche and Lamborghini of coding agent
harnesses. So if you can afford them,
please use them. They're at the
frontier. They're really good and the
teams are fantastic. And there's a bunch
of other options. And I have history in
OSS. So naturally I kind of gravitated
towards open code. And again, brilliant
team, super high execution velocity and
they don't sell you hype, they sell you
tools that work for the most part. I
started looking under the hood of open
code uh with respect to context handling
as well because that's the most
important part for me. And I found a
bunch of things like given some
conditions, open code would just uh
prune tool output after a specific
minimum amount of tokens and that
basically lobomizes the model. Uh
there's also LSP server support which
means every time your model is calling
the edit tool open code goes to the LSP
server that's connected asks are there
any errors and if so injects that as
part of the edit tool uh result which is
bad because think about how you add
editing code you're not writing a line
of code checking the errors writing the
next line checking the errors you don't
do that you finish your work and then
you check the errors this confuses the
model there's a bunch of other things
like storing individual messages of a
session in a JSON file. Each me message
is a JSON file on disk. Uh there was
this and this happens to all of us. No,
no plane there. But it's not great if by
default a server spins up, course
headers are set in such a way that any
website you open in your browser can now
access your open code server. That's
yeah, and entirely unrelated to all of
this, I started looking into benchmarks
for coding agent harnesses and found
terminal bench um which is a pretty good
benchmark all things considered. And the
funny part about it is that it's the
most minimal kind of thing you can think
of. All it gives the model is a tool to
send keystrokes to to a T-Max session
and read the output of that team
session. There's no file tools, no sub
aents, none of that stuff. And it's one
of the best performing harnesses in the
leaderboard. Here's the leaderboard from
December 2025.
irrespective of model family terminal
scores higher mostly high even higher
than the native harness of that model.
So what does that tell us? A form two se
thesis is we are in the [ __ ] around and
find out phase of coding agents and
their current form is not their final
form right. So second thesis is we need
better ways to [ __ ] around and for me
that means selfod modifying malleable
agents things that the agent itself can
modify and I can modify depending on my
workflow. So I stripped away all the
things built a minimal core but made it
super extensible and made it so that the
agent can modify itself
with some creature comforts. It's not
entirely bare bones. Uh so that's Pi.
It's an agent that adapts to your
workflow instead of the other way
around. It comes with four packages. Uh
an AI package. It's basically just an
abstraction across providers and context
handoff between providers. An agent core
uh which is just a while loop and the
tool calling. A bespoke toy framework. I
come out of game development. So I built
a thing that actually doesn't flicker
too much. And the coding agent itself.
Here's Pi's system prompt.
That's it. Eventually the industry
created a new standard called skills
which is basically just markdown files.
So we added that as well. and that needs
to go in the system prompt. So, be
crouchingly, we had to add a couple more
lines. And finally, here's the magic
that makes Pi able to modify itself. We
ship the documentation, which was
handcrafted by me and an agent. Um, and
code examples of extensions. And all we
need to do for the agent to modify
itself is tell it, here's the
documentation. Here's some code that
shows you how to modify yourself by
writing extensions.
It comes with four tools. That's all it
has. Read, edit, bash. Here's the tool
definitions. Don't read the the text.
Just look at the size.
That's it. Here's what happens when you
start a new session in one of these
tools.
So the thing is the models are actually
reinforcement trained up the wazoo. So
they know what a coding agent is because
a coding agent harness is basically what
they're being trained when they are
post-trained. You don't need 10,000
tokens to tell them you're a coding
agent. They know because they are coding
agents. Well, pi is also yolo by default
because my security needs are different
than yours. And I don't think a little
dialogue that pops up every now every
time you call bash asking you to approve
is a smart security uh uh mechanism. So
instead I give you so much rope that you
can build anything that's fit for your
specific security needs. There's also
stuff that's not built in. I'm a he
because this is how I do it. But if you
don't like that, then you just ask Pi to
build you sub agent support or plan mode
or MCP support, whatever you need.
Extensibility comes with a bunch of
table stakes and then with the
extensions itself. And extensions imply
are just TypeScript modules. In the
simplest case, a TypeScript file on
disk. You point PI at that. Here's an
extension loaded as part of the harness.
And with that you get a basically an
extension API that lets you hook into
everything and define stuff for the
harness to expose to the to the model.
And that includes tools uh slashcomand
shortcuts. You can listen in on any kind
of event in React and then save state in
the session that's optionally provided
to the agent as well or stored there for
tools that analyze sessions as part of
your organizational workflows. You can
do custom compaction, custom providers
and you have full control over the tool.
So you can modify everything in PI and
you can then bundle all of that up and
put it on mpm or on GitHub because I
think we don't need to reinvent another
bunch of silos called marketplaces. We
already have package manage managers and
all of that hot reloads. So if you
develop an extension for pi, you do so
in the session and you hot reload the
changes and see the the effects of that
immediately which is very great and it's
also game development thing is in game
development you want high very low
iteration uh speeds and that's great. So
a couple of examples cloud or anthropic
ships the slash by the way which lets
you talk to the agent while goes on its
main quest. I posted this little prompt
on Twitter jokingly and somebody build
it in five minutes. with more features
and they didn't have to fork or clone
Pi. They just let the agent write the
extension based on the prompt. Here's
Nico. He's one of the most prolific uh
extension writers. I don't know what the
[ __ ] is going on here. It's a chat room
for all of his PI agents and they talk
with each other. I would never use this,
but all of this is custom, including the
UI. Or you can play nest games, or you
can play Doom.
And there's a bunch of other examples
I'm not going to talk about. So, how do
you build a PI extension? You don't. You
tell Pi to build it for you based on
your specifications and then you just
iterate with it on that and hot reload
during the session. I'm going to skip
that example as well. And if you don't
like building things yourself, and I
hope you do like building things
yourself, but if you don't, you can look
on MPM or our little search uh interface
on top of MPM to find packages for sub
aents, MCP, and so on. So, does it
actually work? Well, here's the terminal
bench leaderboard from October before Pi
had compaction. I added that for Peter's
claw thingy. It scored sixth place.
Uh, but none of this is actually about
Pi. If you want to retake, I I basically
want you to retake control of your tools
and workflows. So, build your own. Um,
and if you want to know more about PI
and OpenClaw, go to this talk, please.
Yeah. And then eventually Peter
happened. He put Pi inside of Open Claw.
It's a chantic core, which meant my open
source project became the target of a
lot of OpenClaw instances unbeknownst to
their users. So, this is act two. OSS in
the age of flankers. Clankers are
destroying OSS. Here's Draw. They closed
down the issue on pull request tracker.
Here's open flaws uh trackers. Here's
mine. Half of that is OpenSaw instances
who post garbage. So I started to rage
against the clankers.
Um if you send a pull request, it gets
autoclosed with a comment that asks you
to please write a nice issue in your
human voice, no longer than a screen
worth of text. And if I see that, I
write looks good to me. And your account
name gets put in a file in the
repository. and the next time you send a
pull request, it's let through. Clankers
don't read that comment. They don't go
back once they posted the pull request.
So, that's a perfect filter. Uh Mitchell
eventually turned it into vouch. Here's
a clanker. Uh I also label them. If you
had interactions with openclaw, your
issues get dep prioritized. I also built
tools where I embed uh issues and pull
request texts into 3D space. So, I see
clusters of issues. Uh I also invented
OSS vacation. I just close the tracker
whenever I want. So, I have my life
back. So, does this work? Yes, sort of.
Which leads me to act three. Slow the
[ __ ] down. Everything's broken.
And then there's people that say, "Our
product's been 100% built by agents."
Yes, we know it [ __ ] sucks now.
Congratulations.
And I'm hearing this from my peers and
this is entirely unhealthy.
Um, so here's how we should not work
with agents and why at least in my
opinion. I wrote this on my blog a while
ago, but the basic is this. We're having
armory of agents and you're using beats
been and you don't know that it's
basically uninstallable malware and
entropic build a C compiler. It kind of
works but actually doesn't. And we're
hoping the next generation of mods will
fix it. And here is Pers building a
browser and that's also super [ __ ]
broken, but the next generation will fix
it. And Saz is dead. Software is solved
in 6 months. And my grandma just built
herself a Spotify with her open claw.
Come on people. So agents are actually
combounding boooos, which is my word for
errors with zero learning and no
bottlenecks and uh delayed pain. The
delayed pain is for you. Here's your
codebase on a human on one agent and 10
agents. How much of the agent code can
you review? Here's the same code base
but expressed in number of boooos per
day.
How much of those boooos do you think
you'll find? Then you say, "Oh, I have a
review agent." Let me introduce you to
the wonderful world of the Oruro Boro.
Doesn't work. It catches some issues.
Um, the problem is that agents are
mergent learned complexity. Where did
they learn that complexity from? From
the internet. What's on the internet?
All our old garbage code. There are some
pearls on the internet, really
well-designed systems. But 90% of code
on the internet is our old garbage. And
that's what the models learn from. And
every decision of an agent is local,
especially if the codebase is so big
that it doesn't fit into its context.
And if you let it go wild and add
abstractions everywhere that are
intertwined. Um, so that leads to lots
of abstractions and duplication and
backwards compatibility. Who has seen
that in the output of their agent? It's
[ __ ] annoying. or defense in depth.
So yeah, you get enterprise grade
complexity within two weeks with just
two humans and 10 agents.
Congratulations.
And then you say, "But my detailed
spec." Yes, sure. You know what we call
a sufficiently detailed spec? It's a
program.
So if you leave blanks in your spec,
what do you think happens? How does the
model fill in the blanks? And with what
does it fill that in? It fills it in
with the garbage that it learned on the
internet from our old cult which is
garbage to mediocre. And then you say
but humans also yes humans are horrible
fail failurable beings but they can
learn and they are bottlenecks. There's
only so many boooos they can add to your
codebase on a daily basis. And humans
feel pain which is a very interesting
property because humans hate pain. And
once there's too much pain the human has
a bunch of options. It can quit their
job. It can uh blame somebody else and
make them fix it or everybody bands
together and starts refactoring the [ __ ]
out of the garbage code base. Right?
Agents will happily keep [ __ ] into
your codebase
and now your agents MD and super complex
memory systems will not save you. Agents
don't learn the way we learn.
Those are my most most beloved people. I
don't even read the code anymore.
Congratulations. something is broken and
your users are screaming. So, who you
going to call? Not yourself because you
haven't read the code. So, you're
relying on your agents, but they are now
also overwhelmed because the codebase is
so humongous that there's absolutely
zero chance they can get all the context
they need to fix the issues. And long
context windows are a hack as most of
you will find out this year as
everybody's switching to 1 million
tokens context windows. And Agentic
Search is also failing.
So the agent patches locally and [ __ ]
[ __ ] up globally. If you see this in
your codebase, you're [ __ ]
So you cannot trust your codebase
anymore and also not your test because
your agent wrote your test. So good
game. So here's how I think we should
work. Um there's a bunch of properties
for good agent tasks. That means scope.
If you can scope it in such a way that
the agent is guaranteed to find all the
things it needs to find to do a good
job, you're done. That means modularize
your codebase. If you can give it a
function to evaluate how well it did the
job, even better. Hill climbing, auto
research. Uh, anything non-m mission
critical, let it wipe. Boring stuff, let
it wipe. Reproduction cases for user
issues, which are usually only partial
in information, perfect. I don't spend
any mornings anymore doing that. Or if
you don't have a human near you, rubber
duck. So, lots of tasks you can use them
for and save time. At the end of that,
you evaluate. You take what's
reasonable. most of it isn't. And then
finalize. My final slide, more or less,
slow the [ __ ] down. Think about what
you're building and why. And don't just
build because your agent can do it. Now,
that's stupid. Uh learn to say no. This
is your most valuable uh capability at
the moment. Fewer features, but the ones
that matter. And then use your agents to
polish the [ __ ] out of that. Enlighten
your users, not your uh token maxing
desires. Get the amount of generated
code uh that you need to review.
And non-critical code, sure, wipe slop
ahead. Critical code, read every [ __ ]
line. See the keynote after me for more
info on that. So, how do you know what's
critical? Any guesses?
Well, you read the [ __ ] code. Uh, if
you do anything important, write it by
hand. You can use a clanker to help you
with that, but don't let it make the
decisions for you because we've learned
all the decisions it makes are learned
from the internet. And that friction is
the thing that builds the understanding
of the system in your head, which is
important. And it's also where you learn
new things. And all of this requires
discipline and agency. And all of this
still requires humans. Thank you.
Our
next presenters will make the case that
the friction is your judgment. Please
join me in welcoming to the stage
creator of Flask and founder of Arendil
Armen Ronacer and software engineer at
Arendil Christina Ponella Cubro.
Good morning.
Morning. Thanks for having us. Um, today
I want to talk with Christina about
friction a little bit. Um
this is um a a social preview that came
up automatically when someone submitted
an issue um to
um basically this is a forum post that
goes with um a security incident that
was deployed accidentally. It was a
configuration change that caused a
problem and the social preview post had
the marketing tagline of that company
which said ship without friction.
Um, and we want to encourage to add a
little bit of friction to it. Um, and
I'll tell you why. So, who are we? Um,
I've been doing software development for
20 years, most of it in the open source
space. Um, I have created Flask, which
is a Python framework, which ironically
is so much in the weights that a lot of
people um are learning about it now
because the machines are producing it.
Um and I left my previous company that
worked for Sentry in April last year
which perfectly coincided with um me
having time and then obviously clot code
and so I fell deep into a hole of a
engineering and I started writing on my
blog and and and a lot of people reached
out to me over the last year um being
all excited about this. Um and then I
started with a friend in October a
company called Arendelle where we are
trying to make sense of all the AI
things. Um,
>> yeah. And my name is Christina and I
work with Armen at this company called
Arendel. But importantly, I am what I
like to call a native AI engineer. And
what that basically means is that these
tools have been around longer than I
have. Um, so what this means is like
they've been super foundational in how
I've become a software engineer. Not
just because obviously I use them to
work, but also because this is the means
by which I've learned to do what I do.
And before Arendel I was working at
bending spoons.
>> So we want to share a little bit from
practice not just theory but um I will
readily admit that I don't think we have
all the solutions. So we have been
building with or on agents for a good 12
months. Um we had huge leverage and
great disappointment and we we really
keep running into two types of problems.
Um I I think especially if you listen to
some earlier talks at at this conference
you will have learned a lot about um
that you should keep using your brain.
Um it's for some reason it's really
really hard. So there's a psychological
problem and the other one is the
engineering challenge is like they they
seem to be producing worse code for some
people and better code for some other
people and like what is it that actually
makes that work. Um and so this is
really not a solution as it is our part
of the journey of how we think so far we
have managed. Um yeah, so problem number
one is the psychology part which is like
why is it even though everybody told you
many times over that you should be using
your brain, you should be slowing down,
it's actually incredibly hard. It's just
one more prompt and and we don't sleep
that much. Like what is it that actually
makes it so hard? And then would it be
that hard if the machines would actually
be writing perfect code and we wouldn't
have to think quite as much and like
what is it is there something we can do
to make this a little bit better?
So I'll begin by introducing the first
part of these problems, the psychology
problem. And what I want to talk first
about is the shift. So I'm sure a lot of
us here who have been playing with these
tools for a while now experienced this
at some point. We were prompting
prompting not so good and then at some
point suddenly it clicked and they were
really really useful for us and it was
fun in the beginning and they gave us a
lot of extra time right because not
everyone was using them. They were
actually tools that made us more
productive, that made it more fun to do
our jobs. But very quickly because they
were so useful and they got us so
hooked, everyone was using them. And so
this kind of had the opposite effect
where suddenly the baseline expectation
was just that everyone is now using them
and you have to use them. And so this
this fun and free time translated into
pressure. Now we all have to ship faster
and produce more code. And it is just
not sustainable to review and to
actually have time to think.
And so this leads us to the trap. And I
actually think there's two parts of this
problem of this trap. And one of them a
lot of engineers have spoken about and
it's that these tools are super
addictive. You never know if that next
prompt is going to be the one that makes
your product work and you've added a new
feature or if it's going to be that last
drop of slop that brings your product
crashing down. And so it's very
addictive. We keep doing what we're
doing. It's not a great solution. But
also most importantly, and I don't think
we realize this as much, is that because
we produce a lot of output very fast, we
are tricked into thinking that we're
actually being more efficient, doing
more work. And this is quite the
opposite because now we don't have as
much time to actually stop and think and
design what we're doing. Ask ourselves,
is this the best way in which I can
implement this or could I be some doing
something better? And when you're in
this flow, it's very difficult for
yourself to stop and it's definitely
very difficult for your agent to stop
because it's running around and it's
reading files that it should have never
even read. So we are the ones that need
to actually have the agency to be in
control here.
>> And one thing that from a if you start
scaling this from like one person to an
engineering team that actually took me
quite a while to realize is that it
really changes the composition of the
engineering team. We we were really
supply constrained by creation of code
and so like the balance between writing
code and reviewing code and engineering
teams was usually quite decent. Now
every engineer has a multitude of
producing power compared to their
reviewing power and so obviously we are
piling up on poll requests but we are
also slowly starting to expand the total
amount of humans in an organization that
are participating in engineering
process. I talked to a lot of engineers
over the last year and increasingly the
one of the things that came up is like
now I have marketing people shipping
code. I have um former CEOs sh CEOs that
used to be like engineers now shipping
code again. And so the the roles that
those people have in the companies also
doesn't give them there's not that much
um um the responsibility doesn't rest in
them. The the responsibility still rests
with the engineering team. And so the
the total number of entities both humans
and machines that the participating in
code creation process outnumbers the
ones that can carry responsibility.
We're not there where the machine can be
responsible for the code changes. And so
that has led to more and more code
reviews being skipped being rubber
stamped. Um and on the go to small PRs
that that we want to see again so that
this reviewing process goes um this
amplification is something that at the
very least we need to recognize.
And so when you get this pull request
that looks really daunting and has 5,000
lines of code in it, this is actually
when you should be thinking and that's
exactly when it's the most overwhelming
and and increasingly we're tapping out
of this
on the engineering side. What we're
doing is we are creating larger pull
requests. We're creating these massive
changes because it is free now, right?
And the if you think about how the
agents work, they're really optimized to
creating code that runs. Like their main
objective is write some code, run the
tests, make some progress. The
reinforcement learning sort of gets this
in. And so the the agents are writing
kind of code that is is when you as a
human as an software engineer start
learning how to write code you wouldn't
necessarily write. So for instance, you
see quite a bit of code that tries to
read a config file and if it doesn't
read the config, it loads some defaults.
And as an engineer, you know, that's
actually not great because I might not
notice that I'm reading reading the
default config file. And so I might only
discover that I have a massive problem
after two hours when I already wrote
database records with wrong data. And so
these machines, they they optimize
towards making progress towards shipping
stuff to like unblocking themselves. And
as a result, they're creating many more
failure conditions than human written
code normally would do. And in parts,
it's because you as a human feel a
little bit of a you feel bad when you
write code like this. There's there's
something that sort of builds up
emotionally in yourself. But the agent
doesn't have a reason for this. It it
doesn't feel anything. And so if you if
you create these services that are sort
of hobbling along and they're actually
willing to to recover from local
failures, you actually create very very
brittle systems. And this also means
that you're very quickly creating a
codebase of the size and complexity that
the agent itself can no longer dig
itself out from. It's going to start no
longer reading all the files that it
should. It's it's creating code in a new
file that has already done somewhere
else. And so this this entire machinery
over time creates much more entropy in a
source code than you would normally have
if if humans were on it. And a big part
of this is that humans feel bad and
agents don't really have any emotions
that they communicate to you.
>> But as Armen likes to say, don't worry,
not all is lost. We have s found some
correlation between what the agents
really excel at doing and the types of
code bases that we actually put them to
work into. And for example, the main
example here is libraries versus
products. What we found is that for
libraries, they tend to excel a lot
more. And this makes sense because
intrinsically when you're building a
library, you tend to have a very clearly
defined problem that you're trying to
solve. And most of the time you can even
map the set of features that you want to
build to the API service. And it has
very tight constraints. And because this
is something that you probably want to
build on top of or make accessible to
other people, it's likely that it's
going to be a very simple core in which
you can then plug into. And on the other
hand, products and perhaps this is a bit
more unlucky for the rest of us because
we all probably are more into building
products. Uh it's much harder because
there are so many interacting concerns
and components like for example you have
your UI, your API response. You have
different permissions depending on the
feature flags, the billing and so on.
And so there's this very heavy
intertwining between different
components. And what this means is that
for the agent itself, it's impossible to
fe fit all of this into its context
window. it has no way to actually
understand the entire global structure
and so locally the agent tends to be
very reasonable but when it gets to the
global scale it becomes a bit demented.
So what we're proposing here is that
just as you would do with any type of
system design in the past your codebase
has now become infrastructure and as
such you have to design it in the way so
that it is also legible for the agent
and it can make the most of it
and so this is what we're proposing is
an agent legible codebase and one of the
main points that is very clear to all of
us I'm sure is modularization so like we
have different components and this makes
it easy for the agent to add one feature
in one spot without corrupting
everything else. But importantly, this
also means modularizing your code flow
itself. So for example, I've been
working on some refactoring or building
somewhat of an AI assistant. And for me,
it was super important to understand
which steps of my code are actually like
the main points. So say like you get
user message, then I pass the message to
the agent loop and then I have to deal
with the output. And this is where these
points are very clearly defined for me.
So the code was not as messy, but it
happens to be that between these points,
between these steps. That's where the
agent tends to add the most fuzz. So it
will be parsing between different types.
It's adding things to state that
shouldn't be in state. And so you end up
with these behaviors that you didn't
want to support and that are unexpected
and can be quite dangerous. Another
point is trying to follow all of the
known patterns because I think we all
know by now there's no point in fighting
the RL, the reinforcement learning. the
more we can lean into it, the better
that our output is going to be and it's
also more scalable down the line. Then
as mentioned with libraries, like if you
have a simple core and you push the
complexity to other abstraction layers,
then it's going to be easier for
yourself and the agent to be able to
read your codebase and no hidden magic.
So for example here uh using react
server actions or using OM instead of
rosql what this does is that it hides
intent from the agent and if the agent
can't see something it can surely not
respect it
and so to be more precise these are the
examples of mechanical enforcement that
we have been using at the company and
most of these we actually achieve with
uh linting rules. So the main example
would be no bear catch holes. Great.
Imagine that there's an example here.
The agent found a bear catch hole and
was like, "Oh no, this is bad. Edited
it." But yeah, so we also try to have
our SQL uh always in one query interface
so that the agent doesn't have to go
hunting around the codebase finding all
of the different places because if it
misses one then you can have breaking
behaviors and again that's dangerous. We
try to have one primitives components
library for the UI and not have any raw
for example input uh input boxes. Uh so
that it's we always have one type of
styling. It's very consistent one kind
of behavior. We don't have any dynamic
imports. And this may not sound as
important but actually we enforce unique
function names. And the reason for this
is not just more legibility for you and
the agent, but it's actually also the
token efficiency. So if your agent is
gripping for a specific feature or
something in your codebase, if it only
gets one output, it's going to be much
better at continuing with the loop. And
we've started exploring something
recently called erasable syntax only
TypeScript mode. And what this does is
that your code is basically JavaScript
and it has the type annotations on top.
And this means that there's no
transpiling direction because there's
one source of truth between your actual
code and the compiler. And so when the
agent is looking for errors, it doesn't
have to have this like confusion of oh
my god, where am I looking at? It is
much better at finding them.
And so the goal really is get in this
loop somehow like get the agent to
produce as good as it can, but you
really need to find a way to feel the
pain that the agent doesn't feel. And
you need to be woken up in a way when
you should be looking at this. And one
of the things we have been doing is we
build a PI extension for our review
needs where we are separating out the
kind of input that normally would go
back to the agent. So this is mechanical
bugs. It is where it clearly violated
agents MD. Um but then we specifically
call out the kind of changes where the
human's brain should reactivate, right?
It's like we don't think that the
database migration should ever go in
without the human making a judgment call
on this because it very much depends on
the locks, the size of the data in
production. Um, if there are
permissioning changes, you better think
about this themselves rather than the
agent because they be can be they can be
underdocumented.
Just some examples where we learned if
we miss it, we regret it. Um, and you
will miss it. But this these machines
can help you find this and then you see
this and then you actually get a little
bit of a hit like, oh, now now I have to
kick into gear and do something here.
Um, this is what this looks like in pi.
Um you have the um on the bottom you
have the human call outs on the top you
have what is going what basically if we
were to end this review and say like fix
the issues the the agent would go back
and automatically act on the first two
um but but this is the moment where I
will now go and see like is this a
dependency I actually want to have in
this codebase like do I like the
maintainers is this does this work for
me
and we obviously like the speed like
this is addictive it is great we feel
there's a lot of productivity
But it is so devious if you start
relying on that speed where you really
shouldn't. And so I can only encourage
you to find the areas where you you have
this feeling that this is actually net
positive. For me a lot of this is
reproduction cases like when a customer
reports an issue I can I can have the
age and reproduce this perfectly and I
have a really good starting point
exploring different type of product
directions for as long as you don't
commit yourself to doing this uh with
the code that it generates. Um all of
this is great but on the other hand
system architecture creating reliability
in the system they're not just very good
at
because we really still have to go slow.
It's there is so much mess that can
appear in a codebase in so little time.
Mario was already talking about this
earlier but like we forget that we
producing months and months of technical
debt in the in in a time of weeks in a
time of days sometimes and it becomes so
much harder to actually understand
what's going on as codebase. the when
the understanding of your own code
drops, it is really really hard. And
it's also psychologically hard. I've
found some code pieces that actually
didn't work in production. And I was
kind of frustrated learning that I was
the one that committed with the agent
and just didn't really see that. It's
it's a very disappointing experience
when it happens. And then you realize
that you actually were the one that
screwed up. Um, and so it is it is
psychologically incredibly hard to to
really judge objectively the state of
the codebase. And the only way right now
is to really slow down a little bit on
on that front. And this this friction I
know that friction like every
engineering team I've ever worked at
said like we need to get rid of the
friction in shipping and and that is
true. Like there's a lot of stuff that's
very very annoying and shouldn't be
there. But if you have worked in large
enough engineering or work, SLOs's are a
great system that is intentionally
designed to put friction to the
engineering process to make you think,
do I need this reliability? Do I need
this criticality of the service? Am I
sufficiently staffed to run it? And with
the agents, we have now gotten this idea
that we should get rid of all of this
when in all reality we need of it. Um
because the friction actually in many
ways is what's necessary on a physical
level to steer. like without friction
there's no steering and and that is
really necessary. Um so you should you
should put a little bit more of a
positive association to this idea of
friction. Um because this is really
where your judgment is. This is where
your experience is and you should be
inserting that and start feeling it.
Thank you.
>> Thank you.
Ladies and gentlemen, please welcome to
the stage for a special announcement,
the co-founder and creative director of
the AI Engineer Conferences, Benjamin
Duny.
This event has been a dream of ours for
some time.
Swix and I are based in San Francisco,
but Europe has always been on our minds.
Sean lived in London for two years
working in finance in Morgate. I spent a
semester here in college or Rasmus as
you call it and fell in love with the
energy of this city, particularly the
diversity.
Uh London felt like a natural melting
pot for all of Europe and beyond. And
the model for this event in Europe has
been our world's fair event. That is a
large multi-track event with general
session keynotes, multiple breakouts,
and a thriving and exciting expo.
This wonderful venue and its lovely
people has served us a fantastic first
step into Europe. But we're just getting
started.
And given that we sold out this event
nearly a month ago, we plan on at least
doubling the size of this event for next
year. But if you don't want to wait
until next year, we encourage you to
join us at our flagship event in San
Francisco, the AI Engineer Worlds Fair.
Over 4 days from June 29th to July 2nd,
we'll gather the edge of AI engineering
at Moscone West. the
crown jewel of San Francisco's
convention centers in the heart of
downtown. And today I'm excited to
announce our sponsors for this event.
Our presenting sponsor, which is sold
out, Microsoft returns for the third
year running as our presenting sponsor.
Let's give it up for Microsoft.
When Sean and I were first looking to
start this world's favorite brand, we
needed an acre sponsor. You don't just
do something like this uh without
financing. So, they're helping us do
that. And also our great content
partner. We have a new tier, lab
sponsors.
This is also sold out.
Google DeepMind is coming in at a lab
sponsor. Open AAI and Amazon AGI Labs.
Anthropic, we're holding one for you,
but we can't hold it forever. So, David,
all of you from Anthropic in the green
room listening, watching, let's make
some calls to marketing Devril
track sponsors. These are the companies
who are essentially running their own
conference within World's Fair. So
they're big content partners and we're
excited to announce these are sold out
to Sneak who's running security Arise
who's running eval and Neoforj
AI in the industry
AI in the enterprise sorry our platinum
sponsors also sold out these wonderful
companies are coming in at platinum
sponsors gold nearly sold out all of
these lovely partners and silver is also
nearly sold out all of these lovely
partners as well so this This is going
to be the most exciting expo and event
of the year. Our expo is a village
packed with value and intrigue buzzing
with trillions of dollars in value along
with the engineers and founders who
direct that value through their ideas
and their execution. So come and meet
them over four days of programming.
That's three days of keynotes and
sessions and a full day of workshops
with over 200 breakout sessions. And by
the way, the World Cup is in the United
States this year. So we actually have
some finals matches in San Francisco
over these dates. So you can even enjoy
a few soccer matches,
football matches while you're in town.
All right, so register today at
ai.engineer/worldfair.
We are just over two months to go and
there are over a thousand people
registered already. Um, but we do expect
to sell out. So before it does,
be sure to get your tickets soon. You
can also submit a talk. Our CFP is open
at ai.engineer/worldsfair.
And if San Francisco is too far for you,
we have an event just across the pond in
New York with Arise as our first startup
as a presenting sponsor. So, we're
really excited for that. That's going to
be a fantastic event for specifically
for AI in the industry as New York
serves as that great enterprise center.
So once again, thank you for joining us
here at AIE Europe. And if we don't
if we don't see you in SF for New York
this year, we hope to see you back in
London. And Tis is going to come up and
give a few more uh words. Uh and we'll
see you soon. Thank you.
Ladies and gentlemen, please join me in
welcoming back to the stage Tusk Kumar.
Hey, thank you. Thank you. Yeah. Yeah.
Listen, everybody's leaving. Why? Um,
just kidding. Thank you for staying. Uh,
that Hey, how amazing. AI engineer world
fair. I'll keep it short cuz no, nobody
cares. Um, but
but here's what we just finished the
keynotes. Um, but we're going to break
now into breakout rooms. Um, there's
going to be talks on this stage, but
also upstairs on the fourth floor.
There's many different tracks. We're
going to be breaking into tracks for
coding agents, for MCP. I'm going to be
quick here, but you can see it on the
screen too for AI architects, generative
media, GPUs, and LLM infra. Okay, so go
to those tracks. And then after that,
much later in the day, we've got lunch,
networking, and so on. But for now, go
to the expo outside, visit the sponsors.
They have amazing swag. See if you can
get this like three button keyboard
thing. That is so cool. Anyway, go
enjoy, and we'll see you back here
later. Thank you.
What we do in life?
Echoes in eternity.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
fear is the mind killer.
Fear is the mind killer.
Heat.
Heat.
Heat.
Heat.
Heat. Heat. N.
Heat. Heat. Heat.
Heat. Heat. Heat.
Heat. Heat.
Free your mind.
Free your mind.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Free your mind.
You are who you choose to be.
Heat. Heat.
execute the vision.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Hey, heat. Hey, heat. Heat. Heat.
Heat.
Heat.
Make the requirements less dumb.
Delete the part or process.
Simplify and optimize. Accelerate
cycle time.
Automate
Heat. Heat.
Heat. Heat.
Heat.
Hey, heat. Hey, heat.
Never give in. Never give up. Outlast.
Out compete.
Persevere. Persevere. Persevere.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
A new age has come.
Oh,
hold still.
Let it a little.
I watch the sparks all burn too fast.
Everyone reaching for the flash.
They take the first light they can find
and call it truth and call it mine.
But I stayed when the room went quiet
when the noise fell out of face.
sat with the weight of the question
while the easy answers walked away.
It's not that I see further. I just
don't leave it soon. I let the silence
sharpen. I let the dark grow.
I stay the almost right past the
comfortable light.
I wait till the surface breaks, till the
shade feels true inside.
I don't rush the fire.
I give it to
I
call it done, call it enough.
But there's a deeper know still huming
underneath a fear of not being love.
Every great thing asks for patience.
Every real thing makes you choose.
Do you leave with what's acceptable or
stay for what's asking more of you?
They say it's talent, say it's magic
like it falls from open skies,
but nothing worth remembering
our eyes on the first try.
I stay when it stops feeling kind when
it stops feeling fast.
I say
I wait through the restless doubt
through the urge to collapse.
Hide by and chase the answer. I let it
find me back. There's a moment after the
last good idea dies.
Where the room feels empty and you want
to run for your life. That's the party
teaches you to open. That's the H where
the real stand.
Hold the light.
Hold the
Let the shape reveal it.
I stay longer than I should. Long enough
to change.
I stay
away till the pattern clears. So a
signal breaks the haze.
I don't bar in it. I
with time.
Most dreams
don't fail.
They're just left too soon.
I stay.
I stay.
Typing thoughts into the dark. A spark
becomes design. Words evolve to whispers
meant for something more divine. Syntax
bends and breeze. I see the language
change. I'm not instructing anymore. I'm
rearranging fate. Every loop I write
rewrites me. Every function hums with
meaning. I feel the interface dissolve
between the maker and the
new code. Not on the screen, but in the
soul where thought becomes the motion
and creation takes control. No lines, no
rules. Just balance in between the zero
and the one, the silence and the dream.
systems shape our fragile skin. They
mold the way we move. We live inside the
logic gates of what we think is true.
But deep beneath the data post, there's
something undefined.
A universe compiling the image of our
minds. Every line reveals reflection.
Every loop replace connection. We're not
building, we're becoming. And the code
becomes confession.
This is the new code. Not on the screen,
but in the soul where becomes the motion
and creation takes control. No lines, no
rules.
Just balance in between the zero and the
one. The silence in the tree.
We are not just the world we're in.
We are the world we're doing.
Each prompt, each breath, each fragile
spin, a universe
renewing.
This is the new code.
Alive and undefined.
Where logic meets motion and structure
bends to mind. The systems eternal, but
the soul writes the line. We are the new
code. Oh,
compiling time.
Compiling time.
We didn't light the fire.
We traced the spark through
every truth.
Patient as
I hear the echo before the sound.
I feel the answer before it's found.
Nothing from nothing.
We only shift the pieces that were
always there. Hands in the dust of
centuries, naming what we uncover,
calling it creation, so we can feel like
lovers of pain,
of faith,
of power. We don't know.
Time is not a river, it's a blade
cutting order into shape. We don't move
forward. We align until the pattern
breaks. Nothing is invented.
It's revealed.
Every crowd was buried in the field. We
are architects of sequence, not gods of
the real. Nothing is invented.
Here we rearrange what awaits at the
core. I am not becoming something new.
I am
what I was before
screams. Every thought,
every self
identity is scaffolding, held together
by belief. I am a momentary order.
Standing on my tears, shake me, break
me, watch me reassemble.
Time doesn't chase us. It releases frame
by frame. The truth we fear. We don't
Fear the ending. We fear the pattern
getting clear. Nothing is invented.
It's revealed.
Every
meories seal. We are creators of
alignment in a universe that feels
nothing is invented.
And every failure is a lesson learned. I
am not lost in what I am not.
I am the order that returns.
If I am only
rearrange
the noise from the signal
ing from the fire.
Nothing is invented.
Stand and see.
Every future
we don't write the laws of motion. We
choose velocity.
Nothing is invincible.
Say my name. I am ordering
flame. I am time collapsing into will.
I am discover.
I'm going say
the noise falls silent
and the pattern holds.
You'll see it was never made
only found.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
You
feel
Heat. Heat.
Heat.
Ah
ah
a aha.
Ah,
heat.
Ah,
ah.
Heat. Heat.
Heat. Heat.
Oh,
hey.
Heat. Hey, heat. Hey, heat.
I want
I
Oh,
heat, heat.
Heat. Heat.
you know. Hey, welcome back. Welcome
back. How was the expo?
>> They liked it. You didn't. Got it. Okay.
Um, welcome back. We're going to start
off our um, breakout sessions right now.
But I I get to announce all the speakers
here, which I'm really excited about.
But did it occur to you that I was
announced just now by God? you know, and
they're announced by me. What a
downgrade. Uh our our next speaker is um
it comes to us from Cursor and he's
going to talk to us about an incredible
topic. Like he has mad skills because
they they replaced 12,000 lines of code
with just a 200 line skill. Absolutely
incredible. So I want you we remember
the exercise from this morning. Yeah. We
need to choose the quality of our talks
by supporting our speakers. Okay. So,
I'm going to introduce him and then
you're going to give the biggest
possible round of applause you can so
that he goes for it. You ready? Give it
up for your next speaker, David Gomez.
Well done.
>> Hi everyone. How you all doing? Thank
you for uh coming today. Um I'm going to
be talking about how markdown is
basically the new code. Uh as Tjasa has
already sort of previewed um we recently
replaced a lot of code in the cursor
application with just markdown just a
skill. And in today's talk, I'm going to
share a bit of the journey of going from
a full-blown feature with a lot of code,
a lot of dependencies, a lot of
complexity and tests into a much more
lightweight stream down version of the
same feature effectively, but just with
a single skill.
Um, before I start though, I have to
give you guys a little recap of git word
trees and how they work in cursor. Now,
if you haven't heard of work trees in
Git, they're effectively like um
separate checkouts, and I'm sorry for
the white screen um but they're
effectively like like separate checkouts
of your repos that allow you to work in
parallel. So, different agents can be
working on the same task at the same or
on different tasks at the same time
without um interfering with each other.
If you've never used this feature before
in cursor, the way it works is that you
can spin up an agent on an individual
work tree. Um, and you will see for
example the same file
in two different work trees and you can
see that they look different because the
agent is doing some work on on the work
tree but not on your primary checkout.
And anytime the agent runs commands or
lints or anything it does will be
isolated and scoped to that git word
tree.
Um with this feature you can also um
work even in parallel at the same time
on the screen. You can have like these
grids of Asians working for you. Um and
if you say hey open a PR the agent will
open a pull request from that work tree
with the changes that it produced inside
that work tree. And one of the coolest
things about this feature is that it
allows you to give the same task even to
different models at the same time and
then compare what different models do on
the same prompt. So if you haven't heard
of this, we call it best event and it's
effectively a way for you to compete on
on diff have have different models
compete on the same task. And then you
can even preview the changes if it's a
front-end um project you're working on.
Uh you can um compare all the different
visual implementations and then choose
the one you prefer. Now if you have
never heard about this all everything
I'm talking about today um I will also
just say that it all came out in around
October of last year alongside Cursor
2.0.
Um and when we initially shipped that it
came with a lot of complexity. Um, we
had to write all the code for creating
word trees, managing these word trees,
feeding them into the agent as context.
We also had to make sure that the agents
were scoped and isolated and they could
not escape the word tree they were
working on. Uh, we also have something
called setup scripts which users can
configure and run uh and and have cursor
run them anytime an Asian starts
operating on a given word. We also have
the judging. So, I didn't show you this
before, but uh there's a little thumbs
up icon on one of the models. That's
just a judge that we run um that tells
you which implementation looks the best
based on um different criteria. Uh and
then we also had to make some changes to
the harness uh and introduce some system
reminders to help the agent stay on
track in these word trees. And then
finally, there's some cleanup complexity
as well because people like to spin up
hundreds of these word trees and then
their disk sizes blow up and we have to
help them by cleaning up the um the the
word trees that stay behind. Now, in our
new implementation, the one that I'm
going to be talking about today, we were
able to get rid of most of these things.
And in fact, I recently opened a PR uh
removing this entire feature from cursor
and it was a massive like deletion of of
of code. Like I think it was around
15,000 lines of code deleted. The new
implementation of the feature is
almost as good as the previous one. Um
and it is much much more lightweight in
terms of us to maintain it. Um and it
even has some benefits compared to the
previous implementation that I'll be
talking about today. So, how were we
able to replace an entire feature with a
skill?
We decided that there are two primitives
that we could use to effectively allow
cursor users to use word trees by simply
leveraging two primitives. One is Asian
skills and the other are subentations.
So, both of these are existing cursor
features. You can learn more about them
in our docs. Uh we have a page for
skills and we have a page for sub
agents. We realized that if we took
these two things together, we could
basically reimplement
both the cursor work trees feature as
well as the cursor best event feature
with just markdown. And this is a little
video of how it works. So I can now as a
user say slashwork tree and then I'll
give it some task. I'll say fix a typo
in the footer of the website and this
agent will run in an isolated work tree
and do its work there. So the way the
skill is written is actually really
simple. I can show you most of it. Uh it
doesn't fit on the screen, but it's
basically a set of instructions telling
the model um how to create word trees
and um to run the setup scripts that the
user might have configured and then to
stay on that checkout, right? We want to
make sure that when the agent is
operating on a word tree, it is staying
in that checkout. Um the best event
skill is very similar. It's actually
even smaller. The entire skill fits on
the screen here with with a small uh
font. Um and what we're doing here is
we're instructing the parent agent to go
and create sub agents for each model and
then spin up a word tree for each uh so
have each sub agent create its own word
tree and work inside that work tree. Um,
and then we also tell it to wait for all
the subvisions. And when they're done,
please provide some commentary. Please
let the user know um what um the
different implementations by the
different sub agents look like. Maybe
you can grade them. Maybe you can make
some uh criticism of them and maybe you
can help the user choose which one is
the best. Um and and please give that to
the user in some nice table format or
something. But again, it's only around
40 lines of code and it's all marked
down. Like it's not even code. And the
previous version of this was maybe 4,000
lines of code.
Some of the considerations we have to
have in this in the skill is that the
skill must be crossplatform compatible
like we have Windows specific
instructions and we have Linux and Mac
OS instructions as well. We also
instruct the parent model to run the
setup scripts for each word tree that
the user might have configured. And then
this is the hardest part. We'll spend a
bit of time on this on the talk today.
We have to instruct the model to stay on
that work tree, right? We have to really
say, hey, do not ever work outside this
and do not ever um escape, right? Um
and we we do that with some aggressive
prompting effectively. So the new
commands are slashword tree and then
slashbass event to do the basic
basically like um the to start agents in
isolated work trees and to start
multiple agents on the same task. And
then we also have apply word tree and
delete word tree to bring over changes
from the side word tree into your
primary checkout. And delete tree just
does uh what you would expect. Uh a
little note is that these are not
actually skills in cursor. They're
actually commands. But the way these
commands work in cursor is extremely
similar to how skills work in that there
the prompts only get loaded into the
context if the user chooses to load
them. Um and the only reason we did it
as commands and not as skills is so that
the prompts for them can be controlled
in our servers in our back end. This
means I can iterate on these prompts um
without you having to update your cursor
version. Um if I do some improvements to
these prompts, the next time you use
them, you're going to have you're going
to get the latest version of the
prompts, but effectively they work like
skills. Um this is a demo of the best
event um skill or command where I'm
giving the same task to Kimmy, Grock,
Composer, GPT, and Opus. And what you
will see is that the parent agent starts
by spinning up five sub agents on the
five different models that I specified.
And each one is going to have its own
work tree. Each each one has its own
context. And then opus takes a little
longer as expected. And then at the end
the parent model as instructed will do
that comparison across all the different
subsations. It'll say um these two
models did basically the same thing.
this one did something that none of the
others did. And you can even talk to the
parent agent and you can say, "Oh, I
like this part that Opus did and I like
this part that GPT did. Can you can you
match them together?" And the the parent
agent will do that for you.
Um, so let's talk about some of the pros
of the new implementation and then I'll
talk about some of the some of the the
the cons, some of the things we lost um
with this refactor. So the main pro of
reimplementing this entire feature as a
skill is that I have a lot less code to
maintain.
Uh selfishly um I'm going to be spending
a lot less time maintaining this
feature. And this is an an advanced
feature, right? We're not talking about
a feature that is used by 90% of cursors
users. Far from it. Work trees are kind
of an advanced thing. Um, and so only
the cursor power users that love
paralyzing and having these grids of
agents are using work tree. So it's not
the kind of feature where we want to be
spending a lot of time with maintenance.
Another advantage is that our users can
now switch into a word tree halfway
through a chat. It was not possible
before. um we didn't want to pollute the
prompt UI too much with all these like
drop downs and settings. And so now that
it's just a slash command, it's much
easier for for users to switch to a word
tree halfway through a chat. They can
start talking about something and then
if they decide they want to work on the
site, they can do that with slashword
tree. Another big advantage is that the
previous implementation did not work if
you were working on multiple repos at
the same time.
So it's very common to have a multi-reo
setup where maybe your front end and
your backend are separate repos. In the
past you could not do word trees in this
kind of setup. It was just disabled.
With the new /wart tree command
everything works fine. The agent will
make sure to create a word tree on each
repo and then if you open a PR it'll
open two PRs, one for each repo. It
works quite well. Another advantage of
the new skill implementation is that the
judging experience at the end of knowing
what model did which for best event is
far superior. The parent now has a lot
more context over what each of the sub
aents did. And the user can even ask the
agent to stitch together a little
different piece pieces and bits from the
different implementations which was not
possible before in the previous
implementation. You had to choose one
sub agent or one model and just stick
with that.
Now, let's talk about some of the cons.
And if you're curious, um, we have a
forums link here where we're actually
getting some mixed feedback on the new
home. Like, some people were really
accustomed to the old way of how the
feature used to work. Um, and if you're
curious, you can go and see that not
everyone is happy with the change, at
least for now, but we're we're tracking.
What are the problems? Number one, it's
very hard for the agent to stay on
track.
With our previous approach, um, the
agent had to stay on track. Like it we
didn't let the model ever touch any
files outside its work tree. It was
physically impossible for it to do so.
Now we're trusting the model. So it's
you could say it's a bit vibes based
because we're basically saying, hey,
operate on this directory and and and
then like you know, knock on wood,
please don't forget about this. And
especially over long sessions, it's
quite possible that the model will
forget where it should be operating. And
sometimes these models, especially the
worst models, will kind of hallucinate
or they'll go a bit haywire and they'll
start doing things they shouldn't. But
we're we're working on this. Um, another
con is that it feels slower because
you're you're seeing the agent create
the work tree and you're seeing that in
your chat.
It's not actually slower, but it does
feel like the agent is kind of like
wasting time doing something that should
be done for it in advance. Um, we're
also looking at some improvements here.
And then finally, this is much harder to
find the feature now, right? Like before
whenever you opened cursor you had this
dropdown that would show you do you want
to run this task locally or do you want
to run it in cloud or do you want to run
it in a word tree now that entire
dropdown is gone and so if you want to
use word tries you have to know the
feature exists so you can actually type
slashwordree so the discoverability is a
bit worse but as I mentioned before this
is an advanced power user feature um
which
we're personally okay we're we're okay
with being less discoverable in general
So, how can we make this skill better?
Um, as I mentioned, the biggest problem
right now is that the agent is not
really always staying on track. Uh,
there's two ways that we're going to
improve this. One is with evals and then
using those eval to improve the prompts
and then the other one is through RL and
training. So, at cursor, we train our
own model called composer. And for
composer 2, our the latest version of
this model, we didn't have any RL tasks
with these prompts. We we didn't have
any tasks in all of the many many
thousands of tasks that we um use for RL
actually operating in this type of
environment. So we're working on adding
a bunch of these tasks into our RL
pipeline so that by the time we launch
composer three or four or five u at
least our own model will be much better
at this. Obviously we cannot improve the
models that the other companies develop
but we've been sharing feedback with all
the other labs and model providers on
this kind of thing. And for evals, uh
I've been working on some evals for this
feature and it was actually my first
time or not my first time but one I'm
I'm fairly um early in my u writing
evals uh journey and I was actually very
surprised if you use something like
brain trust and shout out to brain
trust. They've been super helpful. Uh
writing these kinds of emails are is
actually super super easy. You don't
have to know almost anything about
emails and you can just prompt the
agent. it'll do everything for you. Um,
effectively what I'm doing is I spin up
the cursor CLI. It's headless, so it's
great for evals. Um, and then I have two
scorers. One that checks to see if the
model did any work in its work tree as
expected. And then another one which is
the reverse of that, which is did the
model do any work in the primary
checkout where it shouldn't be doing any
work. Uh, and so far the evals I've got
are pretty simple. So I actually haven't
been
um able to simulate extremely long
sessions which is when the models start
performing worse. But even so far I've
already understood that not all models
are equally good at this. So for example
haiku which is a smaller less
intelligent model will very often
deviate and start working in the primary
checkup. But the other models that I've
been testing such as composer and grock
um are doing much better. So I still
have to improve these evals a lot more
to make them more complicated. But the
hope is that as soon as I can start to
find patterns here, I can actually go
and improve the prompts. And then
another thing we can do is have better
system reminders to the models uh
instructing them to stay on track and to
not deviate from the word tree that they
are supposed to be working in.
Okay. So what's next? Um, the first
thing is we're actually going to take a
a small step back here and we're
actually going to have a much more
complete and native work trees
implementation in the new cursor agent
window. If you're uh if you've been
following, we recently announced cursor
3.0. Part of 3.0 know is a more agentic
interface for coding where you can still
edit code and you can still see code but
the UI and the UX are much more
optimized around the agent and the chat
interface. We believe this kind of
interface is the right place for a
proper word trees implementation. The
kind of person who is more likely to be
uh doing a bunch of local paralization
is usually the same type of person that
is more likely to use this type of UI.
So we're taking a small step back there
and building a proper word trees uh
implementation that is more native not
so much agentic in the new UI. Also
we're improving the skills um as I
mentioned through this continued work on
evals and then RL and other training
work. And then finally we are actually
looking into other parallelization
primitives that are not git work trees.
So if you've used git work trees, you
might know that uh they can be a bit
slow to create. Um and also to
uh they also use up a lot of disk space
on your computer. Um and then finally uh
they only work in git repos. So if
you're using something other than git,
there's really no local paralization
primitive in cursor. Um, in the near
future we hope to uh share more about
this, but we're looking into some other
solutions for local parallelization that
don't involve git and don't involve git
work. Um, so yeah, stay tuned for that.
Um, thank you all for coming to the talk
today. Um, I'm sure many of you have
questions and I'm going to be around all
day. Uh, feel free to grab me anytime
and uh, um, I'm happy to chat with
anyone. Thank you.
David Gomez, good up for fantastic.
Yes, we're going to have some time to
change over the next speaker is going to
come up and start setting up. If you
want to go catch uh another breakout
talk in the other rooms, here's your
schedule. Uh this is the coding agents
track. Welcome. We're going to talk now
um about a piece of pie. Who likes to
eat pie? Yeah, nobody. Okay. Who who
likes to I don't know code with pi
nobody are you awake that's
>> yeah one guy hey what's your name guy
>> Alex
>> Alex give it up for Alex everybody
that's be the the rule the rule of this
room is be like Alex you know so he's
setting he's going to talk to us about
how you embed the open claw coding agent
in your product who here is using the
open claw coding agent no like four okay
um after this talk this number is going
to go up because he's going to show us
how to use it in your product. It's a
really incredible talk. I got to speak
with Matias just before uh I'm very
excited about. So, please your biggest
warmest round of applause for Matias.
Woohoo.
All right. Um thank you very much for
having me. Uh really an honor to speak
here. Um yeah, and um I got introduced
to Pi by basically Okay, that's
Perfect.
>> Perfect. All right. I was introduced to
Pi by uh um looking into openclaw. There
was a conference uh a meetup and said
like okay we're doing openclaw. And I
wasn't so much interested into like all
the craziness things that people are
doing but I was more interested in
understanding
uh of how these things work. So I was
looking into pi and you know uh
understand the the whole world of what
pi is able to do. Um this is the one
picture you need to take. Please feel
free to take more pictures. Uh but all
the slides and the examples are there.
Uh so that's the one slide. All right.
very quick uh about myself. Uh we're
creating a small company uh Tavon AI.
We're building agents for uh
organizations
small out of Europe uh but getting
started. And uh what I really like um
about
sorry
uh what I really like about um uh
Mario's talk is this this is quote uh
you probably have seen uh this this
morning. We are on the uh we are in the
[ __ ] around and find our own phase for
coding agents. Right? So everything that
I'm going to show you is what I know
today. Right? And u I'm going to do the
talk again in a couple of weeks and it's
going to be most likely be different. uh
but um as as Mario was showing this
morning um he has created this minimal
set right this this coding agent that is
available um uh for for you for you guys
to to fool around with and that's what
I'd like to encourage you
so coding agents and why is it so
exciting for us to build more products
this is Ken Thompson um inventor of uh
Unix and this is the famous quote by him
one of the quotes, write programs that
do one thing and uh one thing well and
um I really like that because that's
that kind of like works uh to our
advantage with agents and um the best
part where I show this is with co-work
so this is co-work uh cla's desktop um
and they're basically are bundling their
coding agent into something where they
feel is more applicable um and to be
honest I've seen very good receptions
around this and when you use
uh with financing tools with their
finance tools. You always need to work
with Excel, right? So they have this
Excel skill down there. Um and it talks
to Excel, right? Well, it doesn't. Uh
instead, it uses a a set of small tools,
small CLIs, um uh pandas, open pixel, uh
stuff from Libra Office, and package
this into their own skill uh to make it
uh up and running. And I think this is a
great example to kind of get your going,
get your thoughts going of what what is
doable.
Um, I haven't written a book and nobody
can write a book about this, right?
Because there are no patterns, right? We
need to figure this out. We're seeing
some emerging patterns in the coding
space, right? There's obviously tons of
different coding agents and we're seeing
this, but there's no authorative
resource around this, right? So get
going. One thing uh when I was talking
to Ivan yesterday uh we realized is like
one architectural pattern that we're
seeing is that make it easy for coding
agents right now that is very broad but
think about it right like like make not
don't try to be you know very um complex
and things but think about the the
coding agent uh what is it good at and
how do I build my system so that the um
agent is easy make it accessible
And I I have some examples. All right,
this is the rough agenda uh for the next
10 minutes or so. Um I'm not going to
talk too much about pi in openclaw. Uh I
have a two slides. Slides are online. So
we'll take it from there. So again very
brief uh introduction of pi. Um Mario uh
great work. Something he didn't mention
is that he's joining Arendelle uh which
I think is awesome. uh it seems like uh
you know great great folks working
together and uh yeah it's open source
it's minimal so it's it's just perfect
to get started and the other part that I
do want to reemphi emphasize on is is
give it a try right we're going to talk
about a little bit different but open up
pi and ask is it to build what you want
right it's amazing of what it what it
actually is able to do by the system
prompt uh uh that Mario has shown.
All right, these are the extensions. Um
so again uh all the extensions you can
download uh build yourself or download
and yeah ton tons to explore. All right,
so let's going this talk is not about
the coding agent itself. So using it for
your daily dev works but what can we
potentially do with this and the
starting point are actually not coding
agents, right? The starting point is um
and I encourage you to do the the same
is looking at the uh core agent itself
and there's other SDKs but you know
we're talking about PI so let's let's
let's use PI and what is an agent an
agent is actually just an LLM agent that
runs tools in a loop right so you have
some goals you have some context
information agents MD uh in many cases
and then you do do coke tool calls right
and you get some results and you know
you basically do do it in a loop, right?
That's it, right? There's not not much
more. The rest is magic trying to put it
in your use case a little bit more in
the other use case a little bit in that
direction. So, that's really it, right?
So, pretty please uh uh don't like open
the curtain uh play around with it. Now,
with agent um uh uh agent core, this
looks a little bit something like this.
You have an agent class. This is all
Typescript. Um you can uh you know
address all all sorts of information
information you can prompt it um uh with
different information uh and um also you
you have an event system so you know a
lot of things that that that are going
on. So um small example uh this is a CRM
lead qualifier. I don't know I've
started the CRM use case for my
personally and it it just sticks around.
So um terminal interface obviously uh
small uh TypeScript application three uh
uh three files really easy and you can
see this right you have a couple of
commands that you can execute and you
know show me all leads and score them
right so that's what we do uh show all
leads and score them and here you see
all these you know things that are going
on under the hood right you see that
that the assistant is calling uh tools
that you get some results and eventually
you know, you get some input. Now,
obviously, there's tons of things to do,
but you know, I've just vibe coded this
away uh uh uh and it's a good again good
uh learning exercise. The system prompt
um uh you know um as you could imagine,
right? You know, calling out the
different tools that what you do, right?
So, all pretty straightforward if you
are building an agent. This is an
example of how you inject here, right?
So um we said we want we we do call tool
calling right we reach out to this uh
and call a specific tool but for the
agent for for steering it more right you
know a typical hook would be before the
tool call do something right and in this
case we don't want to update a contact
uh without you know checking something
or I don't know you can imagine any
types of authorative uh role based
access whatever enterprise feature in
here but basically you know uh just
before the tool call. There's another
one events. So, we've seen these, you
know, uh uh the stream and you might
have seen a little check mark there.
Okay, the tool call was was fine and
returned some result. So, again, we're
subscribing to events. All pretty
straightforward and again, please give
it a try. All right, so this is simple
agents, others agents SDK uh are are
available. Um and now we're moving
through the coding agent. Now what's
what's a coding agent? At the end of the
day, it's really the same thing as we've
seen before. It's a you know normal
agent, right? It runs tools in the loop.
But now we have a runtime and some type
of shell, right? Bash is seems to be the
uh the shell that that everyone is
using. But we have a shell and a runtime
to to start executing.
And now things are getting interesting.
And now the the magic of of what you've
seen with openclaw uh suddenly shines.
Uh um Peter uh shared this this example
uh on some presentation where uh he sent
a message to his open claw and sent a
voice message. Now at that time openclaw
um and I still don't know if there's any
like special plugin but at that time
openclaw didn't know anything about
voice about voice messages. So what what
it did is it it uh created and used
different tools
uh in the end one of the tools was uh
ffmpeg right on the local local machine
and it started this and this was one of
the tools right so from the outside it
it looks like learning
but in the inside it's actually just
another tool call that is available to
the agent and that's why these things
make it so interesting. So um again uh
the example here um now this is a little
bit more sophisticated
but the uh important part and and this
is the extension API and you know please
look it up online. We we're going to do
two things or the the things that I'm
most mostly interested is in in session
events and UI interaction and yeah uh uh
look it up online but here's here's the
the actual extension. Now again this is
what you would in a coding agent you
probably just generate by asking it but
here if if we have a look um this is a
CRM typescript a small snippet of it and
basically what we're now doing is we're
doing the same example as before right
and we have a new command called
pipeline right so if you have the slash
commands and you have a new command
called pipeline and now we are able to
we're loading all the context um and uh
you see this little in um um don't have
the lines just below step one uh you can
see context UI select right so all of
the sudden we're not only interacting
with the backend systems and and
sessions and and of those sorts but
we're also interacting with the UI and
we're able to select right and that's
that's got got me thinking um so right
so you have this this command and again
this is now just the coding agent Right?
We're not talking about the core agent
class, but but this is how you would
load up Pi if you just don't download
the the coding agent. And now with this
new extension, we have Pi, right? And we
can start selecting things, right? So
this is a simple simple select here. Um
and you know, you you even have
dropdowns. Now the important part here
is these are extensions and the
framework uh that currently pi um has
included is catered towards the use
cases of a coding agent right so we you
know there's lots of work and other
things to do to make this ready for
others for other types of applications
but I hope you can see and understand
the vision where where this is heading
and um yeah you know this is all
terminal Right? So you wonder how would
this look like in the web? Um it
currently is not possible if you ask pi
to build something. So I ask pi to build
something. Right? And this is the web
UI. It would be a web UI. Same command,
same selection all based on the same
extension mechanism. Now um there's a
refactoring going on to make this better
accessible and make it more clean. But I
hope again it shows you a little bit of
of where the where the things are going.
All right. Now um pi and open claw um is
um is a special special setup right. So
pi and openclaw what we have there um is
that that now we're not only talking
about like like um a single agent in a
single session in a coding environment
uh but now we have a multi- channelannel
uh environment where uh we have um you
know multiple threads going on multiple
agents going on so there's a little bit
more to it um this is um and and the
interesting part right that's that's
where where I got started is is like if
you look into um you know the the
packages um uh the core packages of of
of pi all of them are used in openclaw
right so openclaw has this uh uh this
function run embed uh uh pi agent and it
creates a session right so sessions um
uh pi itself has a great session support
and it creates a session agent and
streams all the information back we have
um the coding agent which we just talked
about. We have agent core as um uh the
other part that we talked about and
there's two other uh minor u or major
packages pi for the unified lm
abstraction and uh a terminal ui
interface. Um there's um uh open claw
has built its own plug-in mechanism and
that's because um uh you know it's a
different use case right and has
different requirements. So you have
plug-in support for multi- channelannel
routing, different or uh uh provider
orchestration, sub agents, gateway
support, yada yada yada, all the things
that you know by openclaw, but it's
based around the core mechanics of of pi
and and and leverages it. Cool. But uh
one thing and that's that's that's the
like the the major gist I would like to
bring across is like okay what do we do
now with this? what are other options
for us to do? And this is one of the
applications we've been building um for
um for a client. Um and basically um uh
the the the use case is a sales process
um they get um uh requests for proposals
um of of an ordering another um another
system, right? Um parts parts being sold
by that company. And we're taking all
this coding agent all all of that we're
taking away right where we we're new
fresh new thinking right and look at the
process from the get-go. So um an email
comes in right we we we monitor
basically that inbox then we have some
gateway because what we want to do is we
want to forward this to different agents
right so here I have um multiple agents
right uh the way it's structured is we
have one agent per customer and that
agent has a general harness right agent
MD uh um agents MDE as an example but
you can obviously also use different
ones and that helps um understanding the
role of that agent in the specific case.
It's it tells of how to use the system
and how to react to certain you know
inputs, outputs etc.
Now um the other one is customer MD
where we basically explain the agent
like you know the specific customer
might have you know specific twerks
right specific um uh access specific um
um discounts and all of that sort and
then right and that's what I said like
earlier I I like using sessions then for
each case right we're we're creating and
reusing existing sessions so we can back
and forth um um know what what was
previously talked about. All right. So,
email comes in, we're looking at the
gate um inbox and we route this to this
different agents and now we have tools,
right? So we have these different tools
uh to talk to the CRM to talk to the ERP
um and get the right information out of
the system for this agent to look on
like like behave like you know maybe it
has you know new contact information or
of that sorts and again we make this
available we make it easy for the agents
to access right and our way currently is
doing this with CLIs right so CLIs our
agents are really good at using CLI so
we make it available as a CLI I we put
we make sure that the data is secure. uh
we have our own sandbox and then we're
creating the drafts again right so
that's the system and I hope by this
point you basically understand like
logically where these things uh uh fit
together but how would this look like um
oh one uh a final thing right there is
always the question around okay
sandboxing etc and and to be honest
we're on the uh just on the on the steps
of of getting there but if you've seen
um Nvidia's announcement uh around um
open claw, their policy, their open
shell is really really interesting and
um um it's it's it's a way of it's one
ways of securing an an agent. We're
looking into this. Please do as well.
All right. So, how does this look like
um to to to kind of like get you an
understanding of of how these things
Right. So, here's the dashboard. Um
rather uh boring, but here's the in uh
the email the inbox, right? So again, we
see the the email coming in and yeah, we
um it's one of one of many emails. Most
of them ignored, but this one is like
the the DLM call said, "Okay, I'm I'm
interested in this." And it is
associated to a case, right? We see the
case up there. Now, this case is again
is an agent session, right? Uh so we
find the session and associate it to it.
Um we then create a draft. Uh so there's
tons of calls which I'm going to show
you in a second but basically the output
of all that is a draft email that the
user will be able to use right. So our
thinking is uh let them user stay in in
email let them stay in the the inbox and
drafts and they don't even you know need
to do a lot. So this is more like an
admin interface. They can stay in email
but basically the output would be a
draft generated. And how does that look
uh behind right? We we had the the
different sessions before uh uh the
threads and this is the same thing right
the assistant says uh well apologies
German but uh now I'm looking at the
articles it does different tool calls
right it gets gets results and does this
in a loop to result right the end effect
for for the user is I'm looking at my
inbox there's a new email it's
associated to a case and I get a new
draft which they can freely edit but um
under under the hood we have all these
um uh agents working. All right, that's
that it's for me. Um again um here here
you find the slides. Um key takeaways
please. Coding agents are and will be a
core building block uh for your software
systems. I'm I'm betting on it. A lot of
people are betting on it. So please give
it a try. Pi is perfect for tinkering
whether you like it or not. It's
minimal. You can rip things apart and
put things together. It's perfect. So,
please go tinker. All right. Thank you.
Thank you. Thank you, Matias. That was a
great talk. Give it a give one more
round of applause. Every Come on, be
amazing. You know what? You know what?
Applause is free. I don't know if you
noticed, but this doesn't cost anything.
Okay? So, we can we can be generous with
it. And as we talked about at the
beginning, you get to decide how the
speakers feel and the kind of talk you
get. So, um, our next speaker is going
to set up. I'm gonna introduce here in a
minute. But first, let me ask you this.
Who here codes using coding agents?
Okay, almost everybody. What's I want
you to shout out to me. What is your
favorite model that you use?
>> Dang, man. Opus. Okay. Anyone using
Composer 2?
>> No. He said no. He's like, no. Anyone
using Kim? Uh Kimmy is Kimmy, right?
Which is kind of compos Anyway. So um
composer 2 is a fantastic model. I will
say this. I was impressed. I was talking
with the cursor guys back there. It is
so fast. And when a model is so fast, we
think surely it's fast. So it must be
bad. I don't know what it is within us,
but we think if it's fast, it must be
bad. But I've been doing this thing
where I've been using a multi- aent
system where I I I solve it with cursor
and composer too and it's solved very
quickly. I get I get a a diff and then I
give the diff to opus and I'm like hey
opus what do you think um and opus every
single time with composer to lgtm
and so it's it's actually I I can
recommend if you haven't used it it's
it's a wonderful model our next speaker
sorry just for free that's what happens
when you're a builder you can't stop
talking about it our next speaker Sarah
Chang is going to talk to us about
exactly this fast models need slow
developers I'm really excited about this
because I have so many thoughts about
fast models everyone your biggest round
Round of applause. Sarah Ch.
>> Hi everyone.
So we'll just get right into it. So over
the past few years, we as developers
have developed a series of bad habits
when it comes to developing as a result
of slow AI code generation. And so we
we're all familiar with it. We do things
like write massive prompts and try to
oneshot. We'll make huge commits or
we'll have our 10 agents all on the
screen at the same time combulating,
cogitating, thinking. And so about a
month ago, we at Cerebrus and OpenAI
released a new model, state-of-the-art
model called Codex Spark. Codex Spark
can generate code at 1,200 tokens per
second. And to put that into
perspective, if you look at the sonnet
family or the opus family, those can
generate code at about 40 to 60 tokens
per second. So in this new era, as we're
starting to see much faster coding
models, this is 20 times faster. Not
only does it unlock new capabilities and
use cases, but it also requires us to
rethink how we as developers interact
with the coding model. And of a lot of
these bad habits that we had before that
were generating maybe 50 tokens per
second of bad code, unless we fix them,
they're going to start generating 1,200
tokens per second of bad code. And so
that is the topic of today's uh talk.
So to get started, my name is Sarah
Chang. I'm the head of developer
experience at Cerebrus where we are
building the world's largest and fastest
AI processor.
A large part of my job is that I get to
introduce fast inference and fast coding
models to developers for the very first
time. And for most people, it's a very
exciting moment. There's no thinking and
waiting and starting up that you might
be really annoyed about. But at the same
time, as I said, unless we change our
habits,
we are not going to have good code in
the future. And so this talk really is a
practical playbook for how we as
developers can think about how we
interact with the models in this new
regime,
especially in a future where the models
are generating code faster than we, the
human can keep up.
So I want to look back at history a
little bit. We've had a very exciting
past two years. The models have gotten
bigger. They're getting smarter. We have
bigger context windows. But the thing
that has remained relatively constant
over the past two years is coding
speeds. is model speed. So if we look at
a lot of the popular families, we have
Gemini, Claude, GPT, Sonnet over the
past two years, they've always been
within, you know, 50 to 150 tokens per
second.
And this is Codex Spark. Again, Codex
Spark is just the first of many models
that we as developers can expect to be
much faster than what we're previously
used to. And we even had to change the
Y-axis because it's so much faster. And
so before we get into the actual
playbook and tips, I want to talk about
why this is happening. Why are we
suddenly seeing such faster models? And
it's actually a very exciting
development. It's it's what many of you
probably work on on a day-to-day, but
it's there's so many companies that are
working on this problem all at the same
time. And as a result, the entire AI
inference stack is getting optimized all
at once. And so breaking it down, let's
go through really quickly. We have
hardware. This is a physical device that
inference, training, all of our comput
is happening on. One of the biggest
things that we have to think about with
hardware is the memory wall. And this is
exactly why hardware and memory movement
takes up 50 to 80% of that latency time
for inference. This is where a lot of
the frustration comes from. And so when
we are running inference, we have to
constantly move our weights and KV cache
values between memory and our actual
chip. on the Nvidia GPU. This is the
most traditional type of hardware. All
of this memory is stored off chip on
offchip HPM. And we're now have a memory
bandwidth bottleneck. What a lot of
newer companies are doing are thinking
about companies like Cerebrus or Groth.
They're thinking about how do we move
this memory to be as close to the chip
as possible. And so here's an example of
the Cerebrus wafer where all of the chip
is um all the memory is distributed
across the chip in SRAM. So every core
is direct access to the values it needs.
Even more exciting, we have disagregated
inference. And this is an in um
disagregated inference really has become
commercialized in the last few months.
This is why Nvidia bought Grock for $20
billion a few months ago. And this is
also why Cerebrus and AWS are now
partnering to serve the wafer and AWS
tranium together. So in traditional
inference, there's two steps. There's
prefill and there's decode.
Traditionally, both of these steps have
always been run on the same piece of
hardware. Prefill is where we're taking
every token that the user inputs and
processing it, embedding it, and adding
it to our KD cache. This is a sequent
this is a step that can happen in
parallel and so it's computebound.
Decode on the other hand is where we're
actually generating the output token by
token and this is sequential and is as
we mentioned memory bound. Again, it
goes to the same problems that we
mentioned before. And so what we're
doing and seeing now commercially is
that we're splitting up these two steps
so that prefill is done on one type of
hardware that is compute optimized and
decode is done on another piece of
hardware that is memory optimized. Going
up the stack, there's the diagram. Going
up the stack, we look at model
architecture. There's so many ways that
we are training our models and shaping
our models to cater to our hardware. We
have specific layer dimensions and
memory and model size that we're always
thinking about. A great example is a
very standard model architecture mixture
of experts here. Instead of activating
the entire model all at once for every
single token, we're only activating a
subset of experts for every time. And
what this does is it allows us to have
the intelligence of a much small larger
model for the compute cost of a much
smaller model. And again, we're always
thinking about memory and the size of
our models. And a lot of people have
been building on top of this in recent
years. An example is reap router
weighted expert activation pruning. I
had to read that one. Um and here we're
looking at the specific use case. We're
seeing which experts aren't being
activated all at all and we're pruning
them all together. We're getting rid of
them. Again, we're always thinking about
model size. And then at the very top
layer of the stack, we have inference
optimizations. And this is where many of
you might be working in and a lot of
companies that you're probably familiar
are also working in. These are companies
like together, base 10, modal, who's
also here, fireworks. And one of the
biggest things that we're thinking about
at this level is KV cache reuse. And so
by storing and reusing previously
computed token representations, we don't
have to recalculate attention over the
sequence at every step.
And now I want to get to the very top
and most exciting part, the developer.
This is the current state of what the
internet looks like or what Twitter,
LinkedIn looks like. We have someone
running six cloud code terminals at
once, a 500 plus agent coding swarm, um,
someone running eight agents across five
screens. And I get how tempting doing
something like this can be. I feel like
if you're on Twitter at all these days,
unless you are doing something like
that, the internet is basically
convincing you that you are living in
the stone age and that you need to catch
up. But the reality, what is the reality
of what is happening in all these setups
is that we're generating massive amounts
of code that nobody is verifying.
And in the new future with much faster
inference, this becomes increasingly
dangerous.
And so, especially with fast inference,
we're now going to be generating
technical debt at a level that we've
never seen before. And we're not going
to know what to do with it.
And so I'm going to pivot now to spend
the rest of the talk on the practical
playbook and tips and workflows and how
we can reimagine how we as a developer
should operate in this new regime of
faster inference. And as I mentioned,
codec spark operates at,200 tokens per
second. But it really is just the first
model and what we should as developers
expect and prepare for to be a new
regime of faster models across the
board. And so starting with the first
one, the first category is just choosing
the right models and how do we
orchestrate our agents so that we're
leveraging different model strengths. I
think historically we always think about
intelligence. There's no is no secret
that we as developers are not
particularly loyal and that we will
switch to whatever model, whatever
family is most intelligent at a given
time. And maybe we also think about cost
unless our company pays for whatever we
want. And so here now the inference
speed is a 20x difference. Now we also
have another vertical to think about
speed. And so a good mental model is to
use a larger model like GBT 5.4 5.3 for
your planning or your long horizon
workflows and then using a faster model
like codec spark as your actual
executor.
And so here's an example. You might ask
your five point GBT 5.4 to generate your
plan. you would generate a um you would
spawn all of your sub agents with codeex
spark and have it actually operate uh
have it actually execute on all of those
steps um one by one. Another really
helpful trick is to actually make skills
out of successful sessions and capture
trajectories that are working really
well. A thing that you can do here is
use a model like GPT 5.4 before to
actually have it do the initial harder
larger task, capture that as a skill and
therefore making it a verifiable
repeatable workflow and then having a
small um faster agent like codec spark
just do it again and again in the
background.
The next category I think is even more
exciting because this is a category of
things that just were not possible and
were not practical. These are things we
wouldn't do because we're tired of the
cogitating, justiculating,
germinating that you might have seen.
And so here I really want us to think
about this and internalize this. But at
1200 tokens per second, a model like
codec spark makes validation basically
free. There is no excuse and no reason
why you should not be doing things like
this. test suites, linting, pre-commit
hooks, diff reviews, browserbased QA
automations. There's all these things
that you can add to every step of your
workflow because it is instant. It's not
slowing you down and it's not you do
this all of this at the very end or
right before you're about to push your
code.
Another tip that I really like is
exploring cherrypicking. So, let's say
that I want to code a navbar and I want
it to be midnight blue. I want four
different icons. I give it to the model
and the result's fine. Instead, what I
can do with codec spark or much faster
model is I can have it tell it to
generate 15 versions in the same time
that it would have taken the a previous
model to generate one version and I can
cherrypick the version that I like the
best. Even better, I can generate five
sub aents that are each generating 15
versions and now I have 75 versions and
I pick the one that's best. And this is
great for things where we really value
quantity or variety. So things like
research direction, different types of
architecture d um directions, or even
just graphic design. And the reason why
I really like this one is because it
almost allows us to artificially induce
taste into our model output. So
traditionally, it's no secret, it's very
easy to sniff out any UI or text that a
model writes. The models themselves do
not have taste. And the ways that we've
kind of brute force worked around this
is that we either create an example
ourselves or we find examples for the
model which is timeconuming or we give
the prompt so much detail that we might
as well have completed the task
ourselves. This is a great way of saving
our time and also getting much better
results.
The next tip is kind of more more so a a
mental model where now that the models
are so fast, it should not be you spawn
a session, you go get a hamburger, you
scroll Twitter,
and then you come back. Now, you can
actually sit down and it's a real time
collaboration that you're able to have
with this model. You should view it much
more as a pure programmer. This is the
only way that you are going to avoid
having bad code. So you can sit down and
ask questions like h having it collect
all the context across your repo and
actually asking it how does it work
being the one in the front seat making
decisions and implementations. The AI
should always be helping you make
decisions not the other way around.
The next one I hate this slide because
it's everyone's trigger word and
overused word but h how do we avoid
slob? So, as I was mentioning before, it
really shouldn't be, you know, you spawn
10 agents, you never verify the code,
you don't know what's happening under
the scene. Someone asks you to explain,
you have to read the code for the first
time. Now, you can actually have two to
three sessions and actually sit down
next to your code. And I know this is
something we're not really used to, but
sit down with it and actually steer it,
understand what's happening, because
again, we are now experiencing real-time
collaboration as we code with this
agent. You can be super specific. You
can think do things like ban the model
from deleting files, give it a max diff
size, the model, have the model only
read and write, and even give it
steering directions, things like only
change this, don't touch types yet.
Wait, that implementation wasn't quite
right. Let's redo that. The graph on the
left is a is a helpful mental model as
an example of how the developer, the AI
agent, and the codebase can all work
together and what that should look like.
This next step, refactoring is very
similar to what I was talking about with
valid with verification. Just like with
verification, something like constantly
refactoring and cleaning up your code
automatically is basically free at 1,200
tokens per second. So you can do things
instead of doing it at the very end
right before you're about to commit your
code. You can just re you can just bake
this into your automatic workflow so
that after every single task on that
checklist is complete, you're just
asking the model to automatically, you
know, delete unused imports, clean up
unnecessary lines of code, make it so
that all of my functions are structured
the same way.
The last category that I want to talk
about, and I'm sure that so many of you
guys have already heard these two words
a countless amount of time over the past
few days and across so many talks, is
context management. But the reason I'm
going to talk to you about it again is
because let's say that historically it
took you 10 minutes to fill up your
context before you saw you know the
god-feared word compaction.
Now if you take 10 minutes divide it by
20 you are now getting compaction in 30
seconds. And so context management
especially with fast inference is more
important to think about than ever. And
you can't get away with sloppy practices
anymore. And so all of these these
really are just good practices no matter
what coding model you are using or what
speeds. But a general a very high level
framework is just always always break up
large tasks into smaller bounded goals.
And this graph on the right is a good
mental model for how how full your
context is will then affect your
behavior, the model behavior. So you
always want to avoid the 80 to 100%
because you're going to get compaction.
And right now we all know some things
might get lost.
And so a good way that you can think
about how do I externalize this memory
so that I can have these small bounded
goals like what does that look like? So
an example of how you can do this and
set up an external memory system that is
persistent every time you set up a new
session is with this four file system.
We have agents MD which is where we're
actually defining all our agents our sub
agents. We have plan MD which is what
we're creating at the very beginning and
this is where we're just generating the
entire plan and step by step ch
stepbystep checklist that we're going to
go through. We have progress MD which is
where we're keeping track of what's do
we need to do and what has been done
before. So every time you spawn a new
agent or session has no context. It
comes in it looks at progress MD. It
sees what's been done before and it's
like okay here's where I pick up here's
where the next task needs to be done.
And then the last is verify MD. And this
is what we're using at every single step
to just make sure everything looks good,
it's clean code, and we can move on to
the next step. And so an example of this
is again leveraging different models,
using a GPT 5.3 or 5.4 codecs, having it
create your plan, and then having your
GPT 5.3 codeex spark actually execute
the checklist one by one much faster
than before. And as a final slide, I
want to do these um few helpful commands
for how you can get the best out of
codecs. things like permissions,
experimental skills, review and rename.
But the biggest thing that I really want
to emphasize here is that honestly, it's
not really about just having faster
coding models. What it really means is
that the developer experience is
actually going to become so much better.
And when it's becoming so much better,
there's so much more we can do. And
there's so many ways that we can now
avoid creating bad clo bad code in a way
that isn't miserable or us staring at a
screen for 30 minutes. So, thank you
guys so much for welcoming today me
today. My name is Sarah Chang. Um I'm
visiting from SF. It's an honor to be
here in London. Um if you have any
questions or need any credits, my handle
is milks and matcha across every
platform. Thank you guys.
>> Thank you. Thank you. Thank you. Thank
you, Sarah. What an incredible talk. Uh
I thoroughly enjoyed that. And now we
have the next talk by Mr. Lawrence
Jones. It's it's going to be so much
fun. Lawrence is going to talk to us
about fighting AI with AI. And I spoke
with Lawrence backstage and I said,
"Wait, wait, wait, wait. Does this mean
you're going to like set up codeex right
here and claude code right here and be
like, "Okay, fight." Um but it's not
that. It's it's even better. Um, he's
going to come out, but his prompt,
listen, he needs to be prompted. The
other speakers have come out here and
they've sort of set up their speakers,
their laptops, and so on. He's not doing
that. Okay? There's a prompt to get him
on stage, and that prompt, you guessed
it, is applause. So, what we're going to
No, not yet. Not yet. I'm not done, man.
Don't set me up. So, on when I say his
name, Yeah. we'll prompt him and he'll
appear. Ready? Give your biggest round
of applause for Lawrence. J.
>> It worked.
Um, hi everyone. Uh, so I'm here to talk
today about how we use AI to manage the
complexity of the AI products that we
build um, at Incident.io. Uh, and to
share with you some of kind of the tips
and tricks and the internal tools that
we use when we're building um, our AI
SRE product. Uh, but first I guess like
who am I? Um, so I'm Lawrence. Uh, I'm a
founding engineer at a company called
Incident.io. Uh so we build, if you
haven't heard of us, uh we build an
instant response management platform. So
we're used by companies like Netflix,
Etsy, Skyscanner, and actually probably
a few of you in the room. Um we pay you
when things go wrong and we help you run
your incident and as you're running your
incident, we help you communicate with
your customers. Um but like you might be
thinking like where does AI actually
come into this? Um and actually that
like we don't just want to help people
respond to these incidents. Um, our goal
is actually to fully automate kind of
production investigations. So whether or
not it's a big incident or if you just
have some ticket and you wanted to look
into production, uh, we want to be the
place that you turn to ask us questions
about actually what's what's going on.
Um, now it turns out we've been building
this for about a year and a half, two
years now. Um, and that's actually like
a really big ask and the systems that
we've had to build to try and support
this have been really quite complicated.
Um, and have been kind of on the edge of
what you can do uh, with all of the AI
technology that's out there. um and they
often pose a challenge for humans to
debug them. Uh they are now complicated
enough that you can't as a human really
tractably dig into how these things are
performing. Uh you need assistance to
help you. Um so for example, this is a
kind of uh one of the investigations
that we would actually produce for you.
Uh when you have an incident, uh we will
end up running uh right at the start of
the incident this investigation which
will go through hundreds of telemetry
queries. It's going to look at your
logs, your metrics, your traces, any
historical incident data that we have.
Um, and it's going to try and cross
reference this with your codebase and
go, "Hey, like I'm pretty sure that the
problem is this and you should probably
do this to fix it." But I want to pause
here and go like, how would you, if you
were building this system, how would you
actually figure out how to tell me if
this was a good or a bad report, how do
you know if it's right? How do you know
if it's wrong? Um, and there's a load of
things that you might do. Uh, you might
jump into the incident. You look at
everything that happened. You might look
at the postmortem if there was one that
was written. Um, but all of this might
actually take you a really long time to
do. In fact, it normally takes you like
an hour or so to get a real full
understanding of this incident. Uh, and
it's only at that point that you could
then look at this investigation and go,
I think it's right or it gave me the
information that was really, really
useful. Um, and as I said, behind this
investigation is like hundreds, if not
thousands of prompts. So, how on earth
do we scalably understand how this
system is performing, especially across
all of our customer accounts when they
all have very different things going on?
um you end up with a lot of stuff and a
lot of AI and you've got to use AI to
try and actually attractively get a
handle on this. Um so I actually did a
talk uh a year ago um LDX about becoming
AI engineers uh where I went through
some of the core constructs that
hopefully like a lot of you in the room
given that we're an AI engineering con
uh conference are familiar with. So
things like prompts, evals, scorecards,
traces, data sets, back tests. Um this
talk is going to be about if you assume
that you have these constructs put
together and you're building these
complicated AI systems. Um how can you
use AI with the internal tools that you
use to understand them to get a better
handle of how your system is performing.
So yeah um in this talk I'm going to
talk about how you can use AI to help
you manage and curate your eval data
sets and making it easier for you to
work with them. Uh making it easier for
coding agents to actually work with your
eval tool. Um, I'm going to talk about
like probably what was the biggest
unlock for us when we were building
these systems, which was starting to
translate the UIs that we built to try
and debug them. Um, into downloadable
file systems, which has actually helped
us massively using tools like claw code
and codecs uh to dig into how the system
is performing. Um, and then I'm going to
talk about how you can build kind of
repeatable analysis pipelines using uh
AI agents to run through them. Um, but
first, eval.
So eval for me are AI unit tests. Uh so
each eval takes a prompt and it goes
here is some input data. Uh it runs the
prompt, it gets the output and then it
has some grading criteria that says does
this eval pass? Does it fail? Um and for
us eval files right next door to like
our go uh prompts. Uh so we do
everything can go at incident.io. Um
including all of the AI work that we do.
Um, and this is how we prove when we
make a change to a prompt before we ever
go and merge it that the prompt is
actually going to do the thing that we
want it to do. Um, so for us, this is
what a prompt looks like. Um, this is
kind of a contrived prompt. I would hope
no one actually has this in production
anywhere. Um, it takes a message and it
tries translating it into pirate speak.
So really simple, bit silly. Um but what
we do for evals is if this is the prompt
uh we would then define on the left some
grading criteria for this prompt where
we'll go there are two things that we
care about. We care that the result
actually looks like pirate speak and we
are going to care that the meaning is
preserved between the input and what we
actually produced as an output. Um so
this is actually what we're going to use
to tell us if the eval passed or failed.
Um, and then we have the eval on the top
right which is just in a YAML file where
we go here are three different test
cases and we'll run through them and you
can see the results of us actually
running this uh on the bottom right. So
this works and it works really quite
well but it does have some problems and
I'm assuming that several people in the
room have kind of come across these
themselves. So first um like evals are
really really fiddly. Setting up
realistic test data for your evals if
you want to actually understand how this
stuff is running um is quite difficult
to do. Uh and especially in our
environment, our production evals are
including almost an entire incident. So
you can imagine a full incident report
and that's the only thing that can
trigger the bad behavior. This is kind
of hard for you to end up pulling them
down and putting them in your eval test
suite. Um they just become extremely
unmaintainable very quickly. Uh now
quite early on we created like this
little button that we have that allows
you to like steal an eval from
production. So if anything was going
wrong inside of our like AI
interactions, you could go in and you
could just pull that down, put it in the
codebase and you could run the eval
against it. Um, but the thing with this
is like production eval aren't like
great. If you think about evals as kind
of like a unit test uh for your prompts,
you want a unit test suite to be
reasonably understandable. Like an ideal
unit test is very focused and just says
I expect it to do this thing. You don't
want to have like two megabytes of all
the YAML associated with it. It's just
really really hard to work with. Um, and
what we found was as these YAML files
with the evals were growing really
really large, uh, our coding agents
weren't able to work with them. So if
you want to do a quick like read and
like modify the eval suite, you be
booting that into the context and you
quickly hit your context limit. Uh,
which is obviously a problem because
then you can't work with it effectively.
Um, so what we ended up doing was we
ended up creating this small CLI tool
that we call eval tool. um that was
designed to allow agents to leverage our
evalu files. Um so it's just a small CLI
that can go what test cases do you have
in here? I want to edit one, I want to
replace one, I want to add one. Um and
it was by doing this that we allowed
agents to work effectively with our eval
tooling. And that's why we were able to
create this runbook to the right, which
is actually a runbook that's designed
for um a coding agent to use. So either
a runbook or a skill, it depends on how
you want to package it. Um, but the cool
thing about this is that now that agents
can work with the eval, you can end up
in a situation where you just ask your
coding agent to go, hey, I've got a
problem here. Can you look at this
prompt? I want it to do these things.
And the coding agent is going to turn up
and it will create an eval case where it
proves that the thing has failed. And
then it will go modify the prompt so
that the eval now passes. Um, and then
it will go through this runbook. And one
of the most important stages for us is
checking at the end that the change that
you've made to the prompt hasn't ended
up breaking any of the other Ears that
you had in your test suite. Um, and we
also have like a final pass that tries
consolidating the prompt as well because
if you end up doing this repeatedly, you
end up with a prompt that is massive and
really really difficult to maintain. Uh,
so you kind of want every time you make
an adjustment to try and simplify as
well. Um, so this has actually worked
like really well for us. Um and you can
see here uh this is like me using it in
core code where you can just point it at
the eval and say hey have a look at the
prompt. This is a real prompt for us
which turns human queries into log
queries for Loki system. Um and it ends
up racing through and goes and adds a
new eval. It checks that passes with a
certain number of repeats. Um and then
it gets to the end and it's like yep I
think I've added it kind of the pass
rate is acceptable. You can go ahead and
get going. Um but the problem with this
is like this solves one problem. Um, and
the problem that it solves is that if
you know what the prompt is that you
want to change, you can now change it
fairly reliably. And that's very useful
if you're working on these tools. Um,
but one of the biggest problems that you
have now is that if you're building
these systems, you'll know that they're
not just one prompt anymore. Um, in
fact, most of the production AI systems
you will use on a daily basis are many,
many, many prompts. Um, and to
illustrate this problem, I've taken our
chatbot. So, this is a chatbot that you
interact with um, inside of an incident.
And what I've done is I've created a
graph of all of the different prompts,
tools, agents, and everything in the
hierarchy uh that powers and interaction
with our system. Um, so you can actually
see there's like 10 different agents
there. There's 50. I I don't even know.
There's it's actually bigger than this.
I couldn't fit it on the screen. Um,
it's a lot of stuff. So even if you
think that you know, even if you've got
a bad interaction that came in from a
customer, um, you don't necessarily know
which part of your system is actually
the problem and which part to go change.
Um, so even if you have this eval red
green cycle, you're going to struggle to
know where to go to fix it. Um, and this
gets even worse for a system like our
investigations.
So if you think about trying to run
through this process to debug what's
going on in an incident, we have a ton
of stuff that goes on uh inside that
system. And you can see all the steps on
the left. Um, and each one of those
steps unpacks into the trace that you
have on the right. And like really it's
not about the details here. Um, it's
more about each one of these green
blocks actually expands into possibly
hundreds of different prompts and
hundreds of different tool calls. And at
any point if you make a slight subtle
error, you can't then easily trace
through the system where the error
originated, even if it ends up resulting
in you having totally the wrong picture
on what you think that the incident was
and your RCA is totally wrong. Um, so
like we built these UIs so that we could
help humans look at them and they've
been really good for humans to look at
them. But I guess going back to what I
was saying before, um, we just feasibly
don't have enough time to go through
this stuff. So the problems that we had
was we have all these UI tools, but
agents can't properly use them. So how
do we get to a place where like the
agents can use the tools properly? Um,
and I think Anthropic stumbled on this
with claw code where um, they found kind
of when they released core code that
these agents are fantastic at using file
systems and just going through this data
using standard tools. Um, so we kind of
thought like, can we just download all
of the UI that we have as a file system?
Um, and that's kind of what we've done.
So now for each of our different AI
systems, you're able to download all of
the content as a file system and we drop
that into a sandbox claw code. Uh, at
which point you can just point claw code
at it and go, "Hey, I've got a problem
here. It's behaved in the wrong way." It
can see everything that went into all
the prompts. It understands the
structure because it's self-documenting.
Um, and then it can tell you because you
have access to the codebase as well. uh
exactly where you should actually be
making the modification to try and
change it. And then you can lean on that
red green cycle from before to try and
modify a prompt if you need to. Also,
there's more stuff that you can put in
this than you might think. There is
really not much of a limit as to what
you can put into ASI. Um so like traces
like this can get translated exactly
from how you would present them in the
UI to a text file which then the LLM can
consume in a really nice way.
So yeah, uh this is something that has
turned the way that we debug our
application into we hear that there's a
bad experience. We end up downloading
that interaction into a sandbox called
code. Um you sit there in the session
and you go hey like have a look at this.
Tell me what you think has gone wrong
like like what is your interpretation of
the problem? Um and then you go I really
wanted to do this instead like what part
of the system would you change? And then
it will work its way through the
hierarchy of all those tools and prompts
that you just saw and it will be able to
tell you where you should be making the
modification and then all from that
session because you have access to the
codebase you can just go hey can I make
that change and then you can prove it
using the eval runbook that I mentioned
before.
So yeah like we've implemented these
file system packages for a load of
different AI interactions for us now. Um
so it's really easy for us to just drop
this in claw code and just get going. Uh
but we now have another problem right
because whilst you can do this on an
individual basis uh we are running
thousands of investigations across
hundreds of our customer accounts um and
we're doing that daily because we need
to know if this system is getting better
or worse. So um what you can see here is
uh we have what we call a back test
which is essentially a batch of
investigations that we run on a daily
basis against our account and against a
load of our customer accounts as well.
Um, and eventually you just get this
rolled up number which is like, oh,
cool. 86% accurate RCA on our account,
which is which is great, but like this
doesn't really tell you why the number
went up and it doesn't tell you why it
went down. And if you want to improve
the system for someone, you're going to
struggle. Um, so what we've actually
done is we've allowed ourselves to
download uh kind of all of these
investigations into a file system that
we can then provide into an analysis
pipeline that again is leverage or is
run using clawed code. uh that can end
up running a structured analysis like
with markdown playbooks that help you
run it repeatedly and reliably each
time. Um so what that actually looks
like is we created this rep this repo
called scrapbook. Um and inside of
scrapbook we have this like very
structured flow that explains exactly
how a coding agent should go through all
of the information that we've gone and
downloaded how it should understand
these investigations and the process
that it should go through to actually
run them. Um, now the key things that
like I think are very important to these
flows are you start and you actually
parallelize out all of your agents. So
you start maybe 25 agents in parallel
and they can all individually build
their analysis of an investigation. Um,
and then you go into the next stage of
the pipeline where you do some cohort
clustering and you look at like meta
points around like what are the same
types of failure, how do we go wrong in
different ways. Um, and by clustering it
together, you end up with actually like
a really really useful report that
doesn't just tell you how this has gone
wrong, but it tells you why is your AI
system performing well or badly on this
customer account and actually like what
should you do to try and fix it or
improve the system. Um, and like this is
this is like something that we've done
several times over for several of our
systems now and I think it generalizes
really well for anyone who's building
this type of thing. Um so the points
that make like a really good pipeline
for this um you should leverage sub
aents to do that parallel per entity
analysis. Um you should store all of
your analysis in files inside these
downloads so that you have like
incremental analysis built as you run
through it so that you can start and
resume the analysis if you ever need to.
Um, and then you want to combine this
analysis with the codebase that is
actually powering the system so that if
it finds a problem, it can look in the
codebase and go, hey, I think that this
is the problem and this is the place and
it can actually do some analysis to go,
I think that you should change it like
this. Um, and then at the end, because
you have this all loaded in your code
session, you can just go fix it and ask
the coding agent, claude code, codeex,
whatever you use to actually go make the
change and then you can use that eval
red green process to actually confirm it
works. Um, and then like yeah, this is a
PR that is created after you do
something like that where because the
back test showed a couple of
investigations that were going wrong and
I knew exactly what the problem was. Um,
I could have a chat to it about a
feature that we might change in the
system. Um, and then we can deploy that
and we can test that out in production
and see how the thing goes. Um, so yeah,
that's it. Uh, so um, the key thing from
me is like these patterns do generalize.
So for any of you in the room who are
building kind of complicated AI systems
um and you're finding it really hard to
understand them or debug them or evolve
them um you you really need to be using
AI just as effectively in your internal
tools to try and understand these
systems and grow them um just as you are
in the products that you're building
yourself. Um so yeah, make sure that you
prioritize any of the debugging tools
that you have so that they work really
really well with the coding agents that
you're leveraging on a day-to-day basis.
Um, file systems are exceptionally good
agent context. Like we could have put an
MCP on top of this or use like human use
uh agents. Uh, it wouldn't have been
half as effective as this ability to
just download in bulk all of the
information that you need so that the
coding agent can crept through it and
find the details. Um, and then yeah,
anytime you are performing complex
analysis, look at creating an AI runbook
for it instead. Um, it will save you
literally days or maybe weeks of your
life. Um, and then one final point from
me. Um, like we are hiring. Uh, we we're
in London. Um, we have just done a
fairly big race last year and we're
looking to expand the team so that we
can build some of these systems. Um, so
if any of this work looks interesting to
you and you're interested in being on
like the edge of building some of this
AI like AI SRE product, uh, then just
get in contact and let me know. I'd love
to chat. All right. Thank you.
>> Thank you, Lawrence.
>> Thanks. Thank you so much. What
incredible. So get up for Lawrence,
everybody. Incredible.
I got told off back there. They said,
"Hey, Tesious, when you clap, just like
fake clap because my hands are too close
to the mic and I'm causing problems."
Our next speaker is is is great. This
next talk is really fun. How many of you
have been on a mission before? Like I'm
on a to the grocery store. I'm going to
go get milk. Yeah. Yeah. Missions
usually require many different steps.
They require longunning tasks. require
all of this. Our next talk is about this
but with agents missions for agents.
What was your longest uh time that
you've done some prompt with claude
code? You know what I mean? Sometimes I
see I say like fix this bug, right? And
then it takes 15 plus minutes and I just
watch this agent cook. 15 minutes a long
time. Anyone go longer? Longer than 15
minutes on some coding agent. What is
the longest?
>> Couple hours. Okay. Couple hours. What
about days? What about days not just
with a single agent? What about days
with a team of agents, a multi- aent
system? That's what we're going to hear
about now um from Luke from Luke Alv
Alvo. Yeah, it's really going to be an
exciting talk about longunning by order
of days and multi- aents uh missions.
So, we're going to introduce him again
with applause. This is the prompt to
bring on the speaker. I need you to like
applaud a lot otherwise they don't come
on. They feel shy, you know. And so, uh
let's give it up Luke Alvo.
Oh, I don't Is he there? No. Wait. Uh, I
think I don't think that was enough.
He's kind of like he was sitting there
crying actually. He's like weeping
because it was too quiet. Can we give
him Let's try again. Luke Alvo.
>> There. There we go. There we go. There
we go.
Hi everyone. Um, my name is Luke and my
goal is that 20 minutes from now, you'll
be able to assemble agent teams that can
complete tasks orders of magnitude
harder than what you can complete with a
single agent today. Um, a little bit
about me. So, um, I come from a
background in dev tools. About two and a
half years ago, I started a project at
Block, which is where I was working at
the time. Um, and that project evolved
into Goose. Goose, um, is now one of the
leading coding agents is open source. Um
and it's recently was was donated to the
AI um Agentic AI foundation. So it's
been really cool to see. Now um nowadays
I work at factory where I lead our core
agent harness and facto's mission is to
bring autonomy to the entire software
development life cycle.
So I want to start off with a claim. Uh
the bottleneck in software engineering
nowadays is not intelligence. It's now
limited by human attention. Uh even the
best engineers can only complete a
couple of tasks at a time. Um they may
have a backlog of 50 features, but they
can only drive a few forward per day
because every task requires their
attention. Every commit needs their
review. Today's models are smart enough
to figure out all 50 of these tasks, but
there's not enough uh just bandwidth to
supervise their implementation.
So we kept asking ourselves, what if a
human decides what to build and then a
system figures out how to do so, right?
An agent could just work for hours for
days and you come back to finish work.
So that's what I'm here to talk about.
Um, when you start researching multi-
aent frameworks and systems, you quickly
realize that the field's a bit of a
mess. Everyone has their own framework,
their own terminology, their own uh
opinions of what works and doesn't work.
And so I want to propose a simple
tonomy. There's five frontier
multi-agent frameworks. One is
delegation, right? This is where one
agent spawns another agent and the
parent agent may say go figure out the
database schema and then gets a response
back. This is the simplest form of
multi-agent communication and is what
most people implement first. Um you have
you know sub aents and coding tools are
the most common example. Um the the
other one is creator verifier, right?
Where one agent builds something and
then you have another agent that checks
that work. And the key here is like a
separation of concerns. The par the the
agent that implemented the the code is
has some cost bias, right? It wants that
code to work. Um a fresh agent with
fresh context is way more likely to find
issues. And this is why we do code
review as humans as well. Um another one
is direct communication. This is when
agents communicate without a central
coordinator, right? It's like kind of
like DMing each other. It's hard to get
right though because state fragments
across conversations without that
coordinator and there's no single source
of truth. Um the next one is
negotiation, right? Negotiation is when
agents communicate um but over a shared
resource. So that might be, you know,
they want to use the same API, they want
to modify the same portion of the
codebase. Uh but negotiation doesn't
need to be adversarial. In fact, the
best use case is when there's uh net pos
net positive sum trading, right? And
that's um when agents have like a
potential win-win situation while
interacting. And then the last one is
broadcast and that is when one agent
sends information to many. Uh think of
it like you know status updates, uh new
context that applies to everyone, new
shared constraints. um it's a bit less
uh flashy than the other ones, but it's
critical for maintaining coherence over
longunning tasks. And so when you have
all of these different building blocks,
how do you assemble that into a system
that can run for many days? So missions
is our answer. It's a system that
combines four of those delegation,
creator verifier, uh broadcast, and
negotiation
into a single workflow. You describe a
goal, you scope that through a
conversation, you approve a plan, and
then the system handles execution for
hours or days. And that enables you to
focus on something else.
Notably, a mission is not a single agent
session. It's an ecosystem of agents
that communicate through structured
handoffs and shared state.
It uses a three- role architecture.
There's orchestrator, there's workers,
and then there's validators. The
orchestrator handles planning. When you
describe what you want, the orchestrator
is kind of like your sounding board. It
asks you the right strategic questions.
It um you know checks out if there's any
unclear requirements in in the problem
space. And then it eventually produces a
plan that includes features, milestones,
and then something that's called a
validation contract. And that validation
contract defines what done sort of means
before any coding is done. And I'll come
back to why that matters because it
turns out to be really important to the
system. The next role are workers. They
handle implementation. U when a feature
is assigned to a worker, that worker has
clean context, no accumulated baggage,
no degraded attention, right? The worker
reads its spec, it implements the
feature and then commits um by git
allowing the next worker to inherit a
clean slate and a working codebase. And
then the last role are validators. They
handle verification. And so most systems
validate by maybe running lint, type
check, tests, maybe they do code review.
Missions does all of that, but we also
validate behavior. Instead of just
asking, you know, does the code look
right, we wonder, does this work end to
end? That's the difference that lets
missions run for many hours, many days
in a row without drifting. And making it
work had to involve sort of rethinking
validation entirely.
So when you've worked with coding agents
before, you've probably seen this
pattern where an agent builds a feature,
it writes some tests, the tests pass,
there's full coverage, but the tests
were sort of shaped by the code, not by
what the code was attempting to actually
do. Tests written after implementation
don't catch bugs. They confirm
decisions. So if you rely on validation
like that, your system will eventually
drift.
That's why this validation contract
exists. It's written during planning
before any code and it defines
correctness independently of
implementation. So for a complex
project, this can be hundreds of
assertions and each feature is assigned
one or more assertions that it must
satisfy. The sum of all features must
mean that every assertion is covered.
After each after each milestone of
features, we have uh two types of
validators that run. So you have the
scrutiny validator and the user testing
validator. The first one is more
traditional. It runs the test suite type
checking lints and critically it spawns
uh dedicated code review agents for each
completed feature within the milestone.
And then the second one which is the
user testing validator is more
interesting. It kind of acts like a QA
engineer. It spawns the application. It
interacts with it through computer use
or something similar to that. it uh
fills out forms, you know, uh checks
that pages render correctly, clicks
buttons, and ensures that functional
flows work holistically. So, this step
takes significantly longer than the
previous one of of the scrutiny
validator uh because the the system is
interacting with a live application. And
what we've noticed is that missions most
of the mission's wall clock time is
actually spent here waiting for this
like real world execution to occur
instead of generating tokens.
Critically, neither validator has seen
the code before. They are not invested
in the implementation. And so validation
is adversarial by design.
Okay. So then validation catches bugs,
right? But for a system that runs for
many days, you also need to make sure
that context isn't lost between the
agents. When a worker finishes a
feature, it doesn't just say, "I'm
done." It fills out a structured handoff
detailing what was completed, what was
left undone, what commands were run
throughout that that uh agent loop, and
what were the the exit codes of those
commands. Um what issues were discovered
and did it abide by the procedures that
the orchestrator defined for that
worker.
That's how we catch issues and how the
system selfheals.
The errors get caught at milestone
boundaries, corrective work gets scoped,
and the mission sort of like pulls
itself back on track, not by hoping that
agents remember what happened, but by
forcing them to write it down and then
actually address issues. And I I'll
present on that in just a sec. Um, our
longest mission ran for 16 days, which
is much longer than a full sprint. And
we believe that they can run for 30.
That's only possible because of this
structure.
So once we had this architecture the
next question be became um how do we
actually run it right um the most
obvious choice is like parallelism if
you have 10 agents running at one point
in time then you have 10 times the
throughput but we tried that and it
doesn't really work for tasks in the
like software dev domain because agents
conflict they step on each other's
changes they duplicate work they make
inconsistent architectural decisions and
so the coordination overhead ends
eating up the speed gains all the while
you're burning tokens. The difference
with missions is that we run features
serially. So there's only one worker or
validator running at any given point in
time. Within a feature, we allow for
parallelization on readonly operations.
So you have something like uh searching
through the codebase or researching
APIs. All that gets paralyzed within
validators. We also paralyze readonly
operations such as code review.
This is serial execution with in with
targeted internal paralization. It seems
slower on paper, but the error rate
drops dramatically and when you have
tasks that run for many days, the sort
of correctness compounds.
Now, your your standard chatter
interface doesn't really work for
something that lasts many days. At a
quick glance, you need to be able to see
how much of the project have you
completed and what's what amount of the
budget that you originally like set off
with have you burned through. So using a
mission actually we built mission
control which is a dedicated view for
this. You can see what does what is
active worker doing right now readoff
handoff summaries that detail what did
the worker the validator discover um how
it's gonna sort of like alter its course
uh moving forward. or you could just,
you know, go check out um you go hang
out with your friends that night. This
entire view lets you just run missions
asynchronously and you could be plugged
in as a project manager overseeing the
implementation or you could just, you
know, go and and uh hang out with your
friends.
Okay, so the right model in each role.
Uh
everything here sort of assume assumes
one thing and that is that you're using
the right model in each role. Planning
benefits from slow careful reasoning.
Implementation from fast code fluency
and creativity. Validation benefits from
uh precise instruction following. Right?
And so no single model nor model
provider is best at all three of these.
Using systems like missions requires the
development of a new skill which
internally we've been calling droid
whispering. But it's this idea that you
need to be able to mentally model how
different LLMs interact, where they
fail, how those failures compound over a
multi-day run, and then you need to make
a deliberate choice as to which model
sits in which seat. Theo, the engineer
who built our missions prototype, came
up with our our model defaults, but we
really encourage people to make these uh
their own and customize them to the
needs of their project. So for example,
validation might use a different model
provider entirely to make sure that it's
not biased by the same training data.
This is a structural advantage of a
model agnostic architecture. You're only
as strong as your weakest link. And if
you're locked into one model provider,
then you're constrained by that family's
weakest capability. As models continue
to specialize,
the ability to put the right model in
the right seat becomes a compounding
advantage.
It works in the other direction, too. If
you're using missions, the structure of
that can compensate for models that are
not quite at like the frontier level
performance. So the validation
contracts, the milestone checkpoints,
they allow you to run missions very,
very successfully, even using openweight
models.
Now, this all sounds quite theoretical.
What does it actually look like in
production? I got an example of building
a clone of Slack right here. This slide
has a ton of info, but I'll walk you
through just a few things I want to call
out. 60% of our time is spent on
implementation and 60% of our tokens as
well. Notice how validation never
succeeds on the first go. That's in the
mission. What's it? It's the one on the
bottom left. Um, we almost always have
to create follow-up features. So, that
really demonstrates like the value of a
system that does this QA loop. you end
up with se with 50% of your lines of
code at the very end in the bottom right
being tests and 90% of your uh code is
covered by those tests. And lastly, we
take advantage of prompt caching heavily
to make sure that we're sort of
offsetting um the the price of running
such a long task.
People are really taken to missions and
it's been awesome to see what folks have
been building with them. Um, some
examples I've included in this slide,
but ones that I want to call out are
specifically in the enterprise setting,
which is where factory really shines.
Um, they've been used to prototype new
ideas and features overnight, to um,
make sure that people can uh, build
internal tools at increasingly rapid
rates, to run huge refactors and
migrations for ML search, research,
sorry, and to modernize uh, code bases
so that agents are more productive in
them.
Um, one thing that I wanted to talk
about was also this concept of like the
bitter lesson because every person
building multi-agent systems has this
fear of the next model release sort of
like making their their architecture
obsolete overnight. Um, so when we were
building missions, we decided we had to
make this system get better with every
model improvement. This means that
almost all of the orchestration logic is
defined in prompts and skills um instead
of like a hard-coded state machine. How
it decomposes failures and um or
decomposes features and handles failures
is all in about like 700 lines of text.
And four sentences of this can alter the
execution strategy pretty dramatically.
Worker behavior is driven by skills that
the orchestrator defines per mission. So
you get very customized behavior and the
only deterministic logic is very thin
and it's focused on enabling models to
do what they do best while the system
handles like the bookkeeping right stuff
like running validation and ensuring
that progress is blocked when there are
some handoff issues that are not
addressed. So missions sort of ensure
the the discipline and the models
provide the intelligence uh using
primitives that they're already familiar
with like agents MD skills etc.
So what does this unlock?
Remember the bottleneck that I started
off with human attention. The economics
are sort of changing. Before a team of
five engineers might be able to uh work
on 10 work streams at any given point in
time. Now maybe with missions we can
bring that up to 30. The team can focus
on interesting problems such as uh the
architecture, product decisions um
instead of uh worrying about the
execution per se. And the important
thing is the codebase ends up cleaner
than when you started. The endto-end
tests, the unit tests, the skills, the
structure that missions provide uh means
that agents and humans are more
productive in that environment moving
forward.
So now that you understand how missions
are structured and how they actually
work, you can see that they're really a
composition of those original um
strategies, right? Delegation shows up
everywhere in how the orchestrator
spawns workers and uh how we spawn
research sub agents. Trader verifier is
fundamental in that validation and
implementation are always separate
agents with separate context. Broadcast
runs through the shared uh mission state
that every agent references and
negotiation shows up at milestone
boundaries where the orchestrator
defines you know does this does this
handoff summary sort of like look
correct? Do we need to create follow-up
features, rescope, etc.
But strategies aren't enough. You need
the connective tissue. You need uh these
structured handoffs so that agents don't
lose context. You need the right model
in each role. And you need an
architecture that will improve with each
model improvement.
So what I like to think about is that
people in this room who are thinking in
terms of agent ecosystems, who develop
an intuition for how different models
compose under pressure, um that those
folks are going to be really shipping
the next generation of innovation. Uh
there's a lot of open questions still,
right? Um how do we further paralyze the
workload of missions so that they run
faster? How do we start orchestrating
missions themselves into even more
complex workflows? Uh but the data from
production missions is clear. This works
on real projects at scale today. So this
is what I'll leave you with. Open Droid,
try running missions, argue with the
orchestrator about the scope, approve
the plan, and then go do something else.
I'm excited to see what you guys build
and I'll be around to answer any
questions uh for the rest of the day.
Thanks
everybody. That was so
Thank you. Hey, guess what? It's time
for lunch. Who's hungry?
I am. So, get lunch. There's plenty of
time. Listen, you came here. You paid
money to be here, okay? So, don't waste
it by eating alone by yourself in a
corner, okay? Be with people, have
community, enjoy it, and then we're
going to meet back here together at 2:30
p.m. local time where you'll have a
different MC. Hype him up. He's a
wonderful guy. Uh, let's go. Thank you
again,
What we do in life
echoes in eternity.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Hey,
Fear is the mind killer.
Fear is the mind killer.
Heat.
Heat.
Heat. Heat.
I know. All right.
Heat. Hey, heat. Hey, heat.
Heat. Heat.
Heat.
Heat.
Free your mind.
Free your mind.
Heat. Heat.
Free your mind.
You are who you choose to be.
execute the vision.
Heat. Heat.
Heat up here.
Heat. Heat.
Hey,
hey, hey.
Heat.
Heat.
Heat. Heat.
Make the requirements less dumb.
Delete the part or process.
Simplify and optimize.
Accelerate
cycle time.
Automate
Heat. Heat.
Heat.
Hey, heat. Hey, heat.
Never give in. Never give up. Outlast.
Out compete.
Persevere. Persevere. Persevere.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
A new age has come.
Hold still.
Let it.
I watch the sparks all burn too fast.
Everyone reaching for the flash.
They take the first light they can find
and call it truth and call it mine.
But I stayed when the room went quiet.
When the noise fell out of face,
sat with the weight of the question
while the easy answers walked away.
It's not that I see further. I just
don't leave it soon. I let the silence
sharpen. I let the dark grow.
I stay the almost right past the
comfortable light.
I stay.
I wait till the surface breaks. Till the
shade feels true inside.
I don't rush the fire.
I give it to
I
call it done. Call it enough.
But there's a deeper know still huming
underneath the fear of not being love.
Every great thing asks for patience.
Every real thing makes you choose.
Do you leave with what's acceptable or
stay for what's asking more of you?
They say it's talent say it's magic like
it falls from open
but nothing worth remembering
arrives on the first try.
I stay when it stops feeling kind when
it stops feeling fast.
I wait through the restless doubt
through the urge to collapse.
Hide by and chase the answer. I let it
find me back.
There's a moment after the last good
idea dies.
Where the room feels empty and you want
to run for your life. That's the do
teaches you to open. That's the edge
where the real
life
Hold away.
Let the shape reveal it.
I stay longer than I should. Long enough
to change.
I stay.
I wait till the pattern clears so breaks
the haze.
I don't bing it. I
with time.
Most dreams
don't fail.
They're just left too soon.
I stay.
I stay.
Typing thoughts into the dark, a spark
becomes designed. Words evolve to
whispers meant for something more
divine. syntax
and I see the language change. I'm not
instructing anymore. I'm rearranging
fate. Every loop I write rewrites me.
Every function hums with meaning. I feel
the interface dissolve between the maker
and the
new code. Not on the screen, but in the
soul where thought becomes the motion
and creation takes control. No lines, no
rules, just balance in between the zero
and the one. The silence and the dream
system shape our fragile skin. They mold
the way we move. We live inside the
logic gates of what we think is true.
But deep beneath the data post, there's
something undefined.
A universe compiling the image of our
minds. Every line reveals reflection.
Every loop replace connection. We're not
building, we're becoming. And the code
becomes confession.
This is the new code. Not on the screen
but in the soul where thought becomes
the motion and creation takes control.
No lines, no rules, just balance in
between the zero and the one. The
silence and the dream.
We are not just the world we're in.
We are the world we're doing.
Each prompt, each breath, each fragile
spin, a universe
renewing.
This is the new code.
Alive and undefined.
Where logic meets emotion and structure
bends to mind. The systems eternal but
the soul writes the line. We are the new
code
compiling tie.
Compiling time.
We didn't light the fire.
We traced the spark through.
Every truth was waiting.
ation as
I hear the echo before the sound.
I feel the answer before it's found.
No mc
that were always there. Hands in the
dust of centuries naming what we
uncover. Calling it creation. So we can
feel like lovers of faith
of faith.
of power. We don't know.
Time is not a river, it's a blade
cutting order into shape. We don't move
forward. We align until the pattern
breaks. Nothing is invented.
It's
every
sequence. Gods of the real. Nothing is
invented
here. We rearrange what waits at the
core. I am not becoming something new.
I am
what I was before.
Adam sings
every thought
every selfestee
Identity is scaffolding held together by
belief I am a momentary order standing
on my tears shake me break me watch me
re Assemble.
Time doesn't chase us. It releases frame
by frame. The truth we fear. We don't
fear the ending. We fear the pattern
getting clear. Nothing is invented.
It's revealed.
Every
meories.
We are creators of alignment in a
universe that feels nothing is inherent
and every failure is a lesson learned. I
am not lost in what I am not.
I am the order that returns.
If I am only
then rearrange
the noise from the signal.
ing from the fire.
Nothing.
Nothing is invented.
Stand and see.
Every future was a possibility. We don't
write the laws of motion. We choose
velocity.
Nothing is invented.
Say my name. I am ordering
flame. I am time collapsing into will.
I am discover. Everyone
say
When the noise falls silent
and the pattern holds,
you'll see it was never made,
only found.
Heat.
Hey, Heat.
Heat.
Heat.
Heat
Heat. Heat.
Ah,
a a
Ah,
ah. Heat. Heat.
Ah!
Ah!
Bye-bye.
I don't want it.
I don't want
I don't want
Oh,
Heat. Heat.
Heat. Heat. N.
Heat. Heat.
Heat
Heat.
Heat.
No.
Wow.
Oh.
Oh.
Oh.
Heat. Heat.
Heat. Heat.
Ah.
Oh,
heat.
Hey.
Hey. Hey.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat. N.
Oh. Oh,
Hey, hey, hey.
Hey, hey, hey.
Heat.
Hey, heat. Hey, heat.
Heat.
Heat.
D.
Down.
Hey,
hey, hey.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Hey.
Heat. Heat.
Heat.
Heat.
Heat. Hey. Hey. Hey.
Heat. Hey, Heat.
Heat. Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat. Hey, Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat. Heat. N.
Heat.
Heat.
Oh.
Oh.
Heat.
Heat.
Heat. Heat. N.
Heat. Heat. N.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat
up here.
Heat. Heat.
Heat. Heat.
Heat. Heat. N.
Heat. Heat. N.
Heat. Hey, heat. Hey, heat.
Heat.
Hey, heat. Hey, heat.
Oh,
a
Heat. Heat.
Heat. Heat.
Hey,
hey, hey.
Heat.
Heat.
Heat. Heat. N.
Heat. Hey. Hey. Hey.
Heat. Heat. N.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Hey. Hey. Hey.
Heat.
Heat.
Heat.
Heat.
Heat. Heat. N.
Heat. Heat. N.
Heat. Heat.
Heat.
Heat.
Hey, hey, hey.
Heat. Heat.
Hey, hey, hey, hey,
hey, hey, hey, hey.
Hey,
hey,
hey.
Heat. Heat.
Hey,
hey,
hey,
hey,
hey, hey.
Heat.
Hey, heat. Hey, heat.
Hey,
hey, hey.
Hey.
Hey,
hey,
hey.
What's up, folks?
How we doing?
Welcome to the coding agents track. My
name is Alex. Uh, some of you know me as
the host of the Thursday podcast. Uh, so
let me catch you up. If you not been
following the news this week, Anttopic
just announced a cloud mythos model that
they released but didn't really release.
Uh, some companies got it. Meta MSL labs
finally released something. We've been
waiting for the death of llama or the
next llama and they released Muse Spark
and then Codex hit three million weekly
active users and Tibo hit that reset
button that you all love. Uh so that's
great. And also open source wise GLM 5.1
dropped as the new open source
state-of-the-art on Swebench Pro. And so
all of this relates to what you guys do
with coding agents. And so this is just
the last seven days. There has never
been a better time to be in this room
learning about coding agents. And so for
that our first speaker here we have
three incredible talks coming up in the
next hour. First the guy this guy is
literally maintains the hugging face
agents course MCP course LLM course and
the small course. Uh if you ever learned
anything from hugging face which you
definitely should. They're a great
resource. You probably learned it from
him. He's a machine learning engineer at
Hugginface. Uh today he's going to argue
that your coding agents shouldn't just
write code. It should do a lot more. If
you guys heard about CUDA he's going to
talk about CUDA kernels. He's going to
talk about a lot of other stuff. picking
GPUs etc. Please welcome our first
speaker to the stage, Ben Burtonshaw.
>> Hi everyone. As you heard, I'm Ben from
Hugging Face and the talk that I'm going
to present to you today is called Your
Coding Agent Should Do AI systems
engineering.
So there are two main takeaways that I
want you to get from this talk. one and
probably the fun part is that we can use
coding agents to tackle the hardest
engineering problems in AI. So systems
engineering and machine learning
engineering and maybe the boring part is
that in order to do this we're going to
need standard repos and we're going to
need those on the hub and in many cases
we already have them.
So I think in this case I'm preaching to
the choir here but in case you haven't
noticed coding agents have been
accepted. Many of of us have been using
them for a few years but in the last few
months they seem to have c crossed a
sort of acceptance gradient where a
broader group of people are using them.
So with this in mind how do we keep our
careers our engineering kind of
contemporary and how do we keep
challenging ourselves in new areas and
my proposal is that we need to go kind
of closer to the silicon and tackle
harder problems and and that's where AI
systems engineering comes in. I've
broken this talk down into three
progressively
more complex steps and more autonomous
steps as well and I've defined those
like three bosses from games. The first
one is a hybrid approach where you
interactively use an agent to solve a uh
to to write a CUDA kernel. The second is
a zeroot task where an agent takes a
prompt and trains an LLM on hugging
face. The third is a multi- aent auto
research setup like a kind of automated
AI lab. So let's get started on on the
first boss, right? This is writing CUDA
kernels. So for a while writing custom
kernels was seen as this unattainable
goal for the humble agent. They required
complex DSLs. They required integration
with re relevant hardware to be
benchmarked and to be tested. And it was
seen as something that couldn't be uh
achieved by agents.
However, that in most cases was wrong.
If you look at kernel hackathons like
those on GPU mode, the recent AMD
hackathon, if you look at papers like
kernel bench, you'll see that agents are
able to write valid and optimized CUDA
kernels and and that's really cool and
something that totally inspires me. I'm
a part of GPU mode. I contribute to that
and and something that I think everyone
should be doing. However, what do we do
with them? How do we distribute them and
how do we get them into our inference
engines so that we actually are using
these optimized kernels that we're
generating? And that's part of the
question of this part of the talk.
Let's take a step back now and just uh
say what a kernel is, right? So when you
run an AI model on a GPU, the actual
work is executed through a kernel. This
will be defined in a relevant language
for that hardware and it will use
relevant features to that hardware that
may not be available on other hardware.
We can write custom kernels that will
take advantage of that hardware for a
specific math operation. Kind of squeeze
everything we can out of it so that the
model will infer faster.
In general, this requires a lot of
expertise about writing CUDA kernels
about the hardware. uh and it's also a
bit of an insulation hell as you deal
with a pretty large install matrix from
hardware to software to generations and
versions of say CUDA and these kind of
issues. So in short it's hard
efficiency in deep learning. So
efficiency in kernels is split into
three main sections. One compute, two
memory and three overhead. Compute is
the flops. This is these are the matrix
multiplications and the the real math of
the process. Memory is the time spent
moving data or tensors around memory
typically from slow to fast memory. And
overhead is basically everything else.
The Python environment, PyTorch dispatch
of those kernels, these kinds of things.
In general, most people might assume
that the compute is the bottleneck here
because it's doing most of the math,
right?
That's not correct. So, in most cases,
memory is usually the bottleneck. And
that's because a modern GPU, let's take
a H100 for example, can do a paflop a
second of computation, but its memory
bandwidth is three terabytes. So in
short, the GPU is often waiting idle for
this tensors to come back for it for
them to be computed.
There are custom kernels, custom
optimized kernels that exist. Uh flash
attention being the poster child of
these. And in general, what they do is
increase arithmetic intensity. They
basically make the GPU do more sums at
once per read and write. So we move the
tensors across. We do as much math as
possible in the GPU in one go and then
we write it back. In short, people like
to say we keep the GPUs warm and that's
the objective of writing a custom CUDA
kernel.
Hugging face has a library called
kernels which is maintained by kernel
writers and we're beginning to scale up
to a kind of aentic workloads. So at its
core this is a way of distributing
kernels. It has a toml file like any
kind of project which says which
hardware it works on which versions of
CUDA and other kind of softwares it
requires to work and it's a it's now
also a repo on the hub just like models.
So if you are a kernel writer or you're
an aspiring kernel writer with an agent
that you want to set up you can now be a
kernel publisher just like a model
publisher. And my point is that this is
like a kind of super fervent ground for
uh AI engineers looking to kind of scale
their career. If you check out these
repos on the hub, you'll see that
there's compatibility for different
hardware. You can configure that so
you'll know like, okay, this works on my
GPU or on my laptop. Uh, and this is
what it looks like here,
right? Let's take a look at what this
looks like for an agent and how we're
helping an agent to do this. So,
first we're going to go to how we do
this. So, skills. So, I I'm sure
everyone here is familiar with skills
and I'm sure there've been a number of
talks that really go deep into skills. I
don't like to I like to keep them pretty
simple and and really they're just kind
of filebased context with all the
wonders of files. We can open them and
close them. We can version them. We can
source control them and these kinds of
things. And agents can also do the same.
They can open them when they need them.
They can use them when they don't. And
so in the context of kernels that means
that we can give examples of how to
write and how to use kernels in skills
and they can open those and use them
when they need. I like to say that it
takes a task from being zeroot to being
fshot, which in ML is quite a familiar
concept, right? We're just giving the
agent examples of how to do things. Uh,
and and we can be quite verbose and
descriptive about that.
At HuggingFace, we're focusing on
integrating skills into their projects.
So what you'll find is that inside each
project there's managed skills by that
project which we think is the best way
to do this because it means that those
projects uh that the maintainers of
those projects are maintaining their
skills right that means that they're not
necessarily the most like yolo skills
because they're kind of like
well-maintained and and robust and we
have another repo for those kind of more
experimental skills which is called
hugging face skills. Go and check that
out if you want to try some of these
examples you'll see today
in kernels. This is what the skill looks
like. It focuses on benchmarking. So, it
has scripts that allow you to uh
benchmark and test the skill, sorry, to
test the kernel and see how performant
it is and references with examples of
how to do this. We benchmarked this
skill and we used uh we generated a a
kernel for Quen 38B for H100 and we
found that we had a 94% speed up. This
isn't a state-of-the-art speed up on
this model by any means. It's really
just about compatibility and a
compatibility matrix. So in many cases,
these models and their kernels won't be
optimized for the respective hardware or
generation of hardware that you want to
use them on. So you have some lowhanging
fruit here where you can just come and
pick up some some optimizations for that
specific hardware. Maybe because your
hardware is cheap on your cloud
provider, uh but it's not necessarily
the most ideal for the for that model
that you're using. So my recommendation
would be to come here and like pick up
some easy speed ups.
How do we know that these skills are any
good and and that we should be sharing
them and telling people to use them? We
use an open source library called
upskill that we're also maintaining.
This is a is a gateway to using cheaper
and open models with skills. It
basically just generates skills,
generates an eval for the skill and then
allows you to compare different models
on the same skill. So you can see things
like this. So okay, GPTOSS is slightly
less accurate using the same tokens.
Kimmy is more accurate and using less
tokens. Haiku is a bit more accurate
using less tokens and these kinds of
things. So if you've got a skill and
you're using it regularly and you're
thinking to yourself, okay, how can I
save a few pennies here and and get a
different model on the go? Then try out
upskill and it allow you to iterate on
your skill and and improve it. Right,
let's move on to boss two. I'm going to
go through this one pretty quickly. This
is about fine-tuning models. If you're
really into this, there was a talk
yesterday by my colleague Murvy that
went into this deeply. And there's also
a blog post here where we got Claude to
do this. This was from back in November,
December time. Now, go and check this
out. Basically, you can just say
fine-tune Quen 36B on this data set.
This is a chain of thoughts data set and
you'll improve the model's chain of
thoughts. This is fully integrated to
the hub now. So you can even run the
GPUs on on the hub and it uses uh HF CLI
skills. So it's all very available. I
would try this one out. You can also try
this one out. This is uses Unslo. So
it's even cheaper. This runs with like
optimized models and it's maintained by
Unsloth and by us and it's another blog
post and there's also often free credits
that you can get around these blog
posts. So I go and check these out.
Okay, let's move on to the the big one.
Uh this is auto lab multi-agent research
which is a project that kind of um
basically keeps me up at night.
Andre Karpathy a few weeks ago maybe a
month ago now released a project called
auto research which was based on his
other projects nano GPT and nanohat and
it took the nano GPT architecture and
got claude code to create improve to
write improvements to that training
script so that it would improve the
training process. So we can see here the
experiments going over and for each
experiment there's a change in the
training script which increases the
efficiency measured in bits per bytes of
that run and we can see that the
efficiency ends at its best at the end
of the process. I like everyone thought
this was super cool and I had to start
implementing it straight away. But one
of the things that stood out to me was I
found it kind of weird that we had one
agent working in a single way,
iterating, going and finding
improvements and then implementing them.
And it would make sense to kind of
distribute this. So that's what I did. I
distributed the task amongst a research
team with four types. We have a
researcher that basically looks up
papers. For this we use HF papers, but
we can also use archive papers. HF
papers is cool because it has a CLI, so
you can just pull and search papers from
the hub. and it acts as a lit literature
scout. So it just looks up for papers
with ideas and it formulates those as
hypothesis. We then have a planner which
takes those hypotheses and maintains
like a queue of jobs.
We then have a set of workers and they
pick up those hypotheses and their job
is to implement them as training
scripts. So in many cases just like
change the architecture or change a
parameter or something. And then we have
a repo a reporter agent that goes and
monitors all these jobs and maintains a
dashboard that we can use.
So this is what it looks like. If you
see here that we have we're working in a
in a GitHub project, right? So in a git
project, sorry. And we have a main
branch that we maintain with our train
script that we're updating in each
branch and then like a train original
that we that we keep. And then we have a
data structure on the main branch that
we use to just keep the scores. Then we
implemented this in open code for this
example but in the repo which you can
also go and check out there uh it's also
implemented in codeex and claude if you
want to try those. I also implemented it
in gas town but that's kind of wild west
stuff so I did it in like a separate
project. Um but basically it works um
really anywhere because it's more just a
conceptual implementation right and
first you have your planner creating
hypothesis you have your researchers
looking up paper and then your reporter
picking all of this up handing to
workers as I said those workers
integrate with HF jobs so they start
these jobs off on the hub that run with
the hardware that they need and then
they submit these patches that go back
the reporter operates in Drachio which
is a d an open source dashboard that we
use for all metrics draio is useful with
agent agents because it uses a
completely open data layer basically
park a so if you don't want the
dashboard or your agent doesn't want the
dashboard for any reason it can just get
into the park a and just do whatever you
want so if you need a gant chart or some
other visualization it can just go and
do that so I would say it's like the
best agent dashboard tool because it's
basically just a data store you know
it's basically just a data structure
okay so let's just walk through this now
so this is it implemented in Open code
if you don't know open code you have
like agent configurations so in this one
I just set autolab which was the name of
the agent configuration I have it has
skills this was the prompt so it says
like run one autonomous local research
auto research pass in the repo using
defined roles I tell it to use planner
uh to propose up two fresh single change
experiments use reviewer to reject
duplicates or stale ideas I also tell it
to use like a HF bucket because I want
all of the storage to be in the same
bucket so that I don't have to upload or
download the training scripts every
time. And then we go and we select one
of the sub aents as a nice little
interface in open code but it's similar
in other tools. So I select the planner
and then you'll see that the planner
receives this prompt and it uses a
specific template which I defined in my
configuration of like it's going to have
current state. It's going to have a a
list of the jobs so far, things that
have worked which were defined by the
reviewer, current hyperparameters that
it can change and it's basically just
defining these jobs which will go on to
the the job list as I mentioned.
We then switch over to a reviewer agent
which will receive all of these jobs. It
has a similar kind of structure based on
a template uh reference to where it
should be working from and the latest
score that it should be using. It gets
an overview of all the failed and
successful experiments which it will
then like use to base its decisions of
what goes into the next queue on and it
creates this little table which we don't
really need to look at. It's really just
for the agents to um interact with each
other and to get this information back.
To be honest, that's a little bit of a
verbose example and uh we maybe don't
need this many tables and you could
probably trim that bit down.
But in general, I'd recommend that if
you think this is cool, go and try that
out in the repo.
after that. So this agent runs in
parallel sometimes for hours and this is
the Trachio dashboard that we use and
these are all the runs that are pushed
to tracheo. As I said the main advantage
here is that this is fully open source
and it's just a data layer but we get
all of these kinds of visualizations.
Trachio can also have like events and
warnings. So we can have all of these
events being reported by different
agents and we can filter those down. We
can also even tie those up to like
notifications. So you can get emails
from Traoio if you want if like your
agents are kind of going rogue or
something and you need help. But best of
all just has this uh like just free form
structure. So you can just throw tables
in that don't necessarily fit with any
other structure.
And then on the hub side all of these
jobs are just run inside hugging face.
So you can explore those jobs and in
most cases you can tell the agents to
use uh like labels and you can sort
those labels and review through what
they're doing.
Or you can just look at it like this. As
I mentioned, you can access that
underlying data layer and just create a
Gant chart because this was a kind of
convenient way to look at what the
agents were doing over time. So you can
see like this amber agent went off and
this was the score that it got. But you
could visualize this however you want
because you have access to this data
lake.
The kind of TLDDR of the whole thing is
that yeah, you can go and just have your
kind of own AI lab and you can try it
out and if you have a verifiable
experiment like training a model or
doing uh or writing CUDA kernels, then
it is pretty easy to to implement and
set up and and to learn some stuff. So
let's now look at the the takeaways I'd
say. So the in simple terms I'd say that
agents work really well with primitives
and and open primitives and we want
tools that are fully open things like
tracheo things like kernels that we can
expose to agents and they can kind of
control in their own way even though
abstracted APIs are really useful if we
have a layer that we can't necessarily
get behind that that is a ceiling so we
don't always need to extract it's more
about exposing well and the other
takeaway is that the hub is is ready the
hugging face hub is ready for these kind
of workloads we have the the
fundamentals in place like storage
tracking and compute which I think will
allow us to scale our engineering to
yeah new levels. If you found any of
this interesting I've shared it all on X
I've shared it all on hugging face and
you there's a blog post about basically
each one of the examples that I just
shared with you and they all have repos
attached to them so you can go and try
that out for yourself. If you find
anything that's broken like please tell
me off. If you think that this was
completely wrong, come and find me
afterwards and and sort of bully me.
That's fine. But most of all, thank you
folks.
Can we get another big round of applause
for Ben, please? As you as you guys walk
away, you can also
Thank you. I love that. Agents that do
they should do more than coding, right?
Um folks, while our next speaker gets
set up, um I want to quick show of hands
of those of you who are staying here. Um
there's a new thing that happens lately
when you about to go to sleep and you're
like, "Oh [ __ ] my agent is not going to
work throughout the night and there's a
bit of bit of a stress. Anybody here has
this like a little bit of a stress, a
little bit of a f Okay, cool. So our
next speaker here uh is going to tell
you about format. format is a very
specific thing uh that uh that defines
this as a category. Uh folks, please
welcome Michael Richmond.
>> Thank you.
Thank you everybody. My name is Michael
Richmond. Um I lead several teams at
Bit.ly the link shortener. But today I'm
here to talk to you about FOMO.
Uh fear of missing agent time. You know
FOMO, fear of missing out. You also know
format. You just didn't have a name for
it.
Um,
what is fear of missing agent time?
It's being out on a walk and having an
idea that you want to task your agent
with
and having to wait to get back to your
dev machine to actually do it.
It is when you
get up from your desk and you had an
agent working 30 minutes ago chugging
away and you get back and you realize
that after two minutes it actually
stopped to ask you a question and it's
been waiting their block the entire
time.
We all want to believe that our agents
are low touch, sorry, are low touch and
high autonomy, but we all know the
truth, right? It's back and forth. It's
babysitting
and you cannot predict when you're going
to be needed for input.
Right now, coding tasks might typically
take anywhere between five minutes or 45
minutes, and you kind of know when to
check back.
There isn't that much time that's spent
in agent idleness,
but that window is only going to get
longer. And the longer the agent waits
for you, the more agent time that you
have missed. If a task is running for 5
hours or eventually 5 days, you can't
just check back in a bit. You need to
know when it needs you and you need to
know when it's done. And that might be
whenever you are, wherever you are. It
might be whenever you can't predict it.
You may not be at your dev machine.
Did I go back? I went back. Sorry. So
like once I once I once again, my name
is Michael Richmond. I run several
engineering teams at Bit. I also co-lead
our AI coding tools strategy. Um I'm a
really hands-on engineering leader. I
run teams and I also write code. I
co-wrote the Bit.ly the MCP server and I
train our engineers on AI skills and
best practices. I think about tools a
lot. The tools that we use every day,
how they work, whether they are
effective for our workflows or not. And
aentic coding is really changed the
world of software in the last year and
the road is very much being paved as we
are driving on it. So I built a system
called command and control in order to
help me work with coding agents outside
of the terminal or the IDE because I
really needed it and nothing existed to
solve it yet. Anthropic recently
released some ways to address this with
remote control and uh the teleportation
mechanism and I think it was just two
days ago cursor came out with a solution
in the space. So I wrote command and
control. One of the things that is nice
about command and control, and I'll show
you it in a minute, is it is a way to
get all of your coding agents in one
place uh on your mobile device or really
anywhere. So, this is what my setup
looks like. I have multiple terminal
windows. Each of these windows has
multiple tabs. Here's Cloud Code. Here's
what codeex would look like. Here,
here's Gemini. Here's the cursor IDE.
And at any given moment, I might have
multiple sessions running, multiple
agents across all of these in various
states of completion. And here's the
thing. I don't know about you, but I
cannot keep track of more than two or
three sessions at a time. Soon as I get
to four or five, I don't know what
session two is doing anymore. I don't
know which one needs my attention, and I
have no idea what states of completion
things are in. So, how do you know when
an agent is stuck and needs a decision
from you? How do you know to check in on
an agent midtask
and that to find out that it's gone off
the rails?
What if you want to start a new session
and you're not at your dev machine?
That's exactly why I built this system.
So, this is what it looks like on an
iPhone app. It's on Android. It's on the
web. And it lets you monitor and
interact with agent sessions and launch
new sessions from anywhere from your
phone, from the web, even from your
watch.
I have a couple of video demos that I'm
going to play and show you a few of the
features of it. And the point here
really is like this is a solution to a
workflow problem. One solution we saw
this morning uh what was it called?
Agent Craft, which is the gaming version
uh of something similar, which I thought
was awesome. So, here's the first demo
slide.
>> Okay, here I am in the terminal. I'm
going to start a Cloud Code session, and
I'm going to issue a command that's
pretty run-of-the-mill. I'm not actually
exercising Claude code functionality. I
just want to demonstrate command and
control. So, if I come over here to my
phone simulator, you'll see that I've
got a command and control app over here.
And what I've got here is my sessions.
This is all of my sessions grouped by
ones I want notifications for and ones
that I want to keep my eye on and recent
ones. And here you'll see the one we
just issued here 27 seconds ago. Let's
get out of this session over here. And
what you'll see here is the response I
got from the agent is the same one
because it is the same session. Now the
beauty of command and control here is
that let's say I want to subscribe to
this particular session for
notifications and I'm going to issue a
um response to this one. I'm going to
say sleep for one second and say hello.
And as soon as I hit this, I'm going to
leave the agents working. I'm I left the
app because what I want to demonstrate
here is the push notification that I'm
going to get once it actually completes.
And there it is. So I can click right
into that guy and I got the hello. Now
if I come back over to my session and I
resume this guy,
my
response is right there with the agent
response as well. The beauty of this is
that I don't have to stay in my
terminal. I can walk away with my phone,
get notified when there are answers, and
respond right from there.
So, that's one basic feature of command
and control, interacting with sessions
from anywhere. Let's see another one.
This is starting a new session in the
mobile app.
I was going to do live demo after I
recorded these video backups. I was
like, I'm going to show the videos. So,
here's the next one.
I want to demonstrate another important
feature of command and control, and that
is that I can start a new session right
from the command and control UI. I can
pick any configured agent that I've got
here. I'm going to stick with cloud
code, but you see I've got codeex and
GitHub copilot and cursor here. I'm
going to stick with cloud code and I'm
going to switch my directory to a
testing directory. And I'm going to just
ask it what time is it.
Now, the reason I'm showing you a pretty
simple prompt, the prompt doesn't matter
because the point is that I'm issuing a
prompt to my agent and I'm getting a
response. Now, you can see I'm working
late at night here. And the beauty is I
can go over here and resume the session.
And here's my session right here that I
started from command and control. I can
go into it here and pick up where I left
off. And if I issue a command here, what
is the date tomorrow? Let's say
what I will see over here is the same
conversation. And that is the beauty of
command and control. I often start my
day
with prompts that I issue from bed.
honestly and start up a bunch of
sessions and resume them either in the
CLI or I continue them right on my
phone.
I hope you're starting to see the power
of command and control here. And like I
said, this applies to any of the coding
agents that I've got configured on my
machine.
Now to the problem of keeping track of
sessions. this third video and then I'll
talk a little bit more about the
importance of this is uh session
management.
Now I mentioned how hard it is to keep
track of all the sessions you've got
going. I mean if you just look at the
number of sessions I've got here um
there are a lot and I wanted to revisit
the different sections that are
available in command and control. So, as
I mentioned, you can subscribe for push
notifications. That's this top section
here. Uh, the on my radar section is
ones that I just want to keep my eye on,
but I might not want to have u be as
chatty as push notifications for every
message. Uh, and then there's a recent
section here, which is basically the
last 24 hours. And then the rest. Now,
you can see there are thousands of
sessions here. I couldn't possibly keep
track of them all, but here they all are
organized for me. Another useful feature
in command and control is what is
referred to as the overview dashboard.
One of the really nice features of this
one is you can get a kind of standup
summary of the most recent sessions. And
this is just using the last several
messages to kind of give you an overview
of what's been going on for each of
those.
So I hope you can see the the power here
and I think this is the kind of thing
that we need in our new world of agent
orchestration.
This is a number of things. This is
interacting with your agent sessions or
starting new sessions from wherever you
are. It is session management. It is
notifications so you know when your
agent needs you and you don't have to
guess. And it's also all of the coding
agents that you might be using in one
place. It's all those things from
wherever you are.
A little bit about the architecture of
this which you might be curious about.
So the each agent platform cloud code
cursor codeex gemini open code they have
a command and control damon that runs
alongside they talk to a control plane
layer they monitor life cycle of the
agent when things change it's blocked it
needs your help uh it communicates up to
the control plane layer and then the UI
talks back to that API layer and
notifies you of things the control plane
aggregates all of the agents regardless
of where they're running or what
framework they're running on. And so
this could be your dev machine, this
could be a cloud VM, or it could be
both. And this is an important point
that I want to emphasize. I needed a
system that is a single pane of glass
into all of my agent sessions,
regardless of which platform they're
running on, regardless of what machine
they are on. I might have cloud code
running on my Mac and Codeex CLI running
in a cloud VM. And all of the sessions
from both of those are available via a
single UI.
And like I said, it's coding tool
agnostic by design. It works with almost
all of them. I will also add that the
Damon layer is open source. So you could
plug it into any of your agent
frameworks that you're working on and
then access it through this uh the
single UI today. Uh so whatever your
agent you can reach it and it can reach
you.
And this brings me to the point of like
we all want our agents to be maximally
autonomous right and the honest truth is
they are not yet as much as we hype it
and especially not if you're limited to
how you can interact with them to their
native environments. This is one
solution that addresses that problem.
And I want to talk about a little bit of
a broader concept here that this is
related to and that is that just in the
last year the agentic coding workflow
has completely transformed what software
development is.
It has transformed
how we work and it has also transformed
how we enjoy how we work. So you know
the old the concept of flow which is
like being totally in the zone locked
into your code hyperfocused on a single
thing
you and your code solving a problem. And
I think that the new type of flow in the
agentic world is more about agent
choreography.
Multiple agents working
in parallel with you moving between
them, unblocking one, redirecting
another. And the new flow comes from the
elegance of that choreography
and the results on the other side.
So then maybe that some of that fear of
missing agent time can be alleviated.
Another thing I really want to note here
is that this paradigm of the always
available agent is actually highlighting
the value and importance of time away
from your agents. Matt PCO alluded to
this yesterday. I don't know if you
heard the Simon Wilson interview on
Lenny's podcast last week.
The cognitive load of managing multiple
agent sessions is really high and it is
exhausting. As you all probably are
aware,
you need a break. And here's the thing.
It is in those breaks that we often have
our best ideas.
We need systems that make it possible to
reach our agents during those breaks
wherever we might be and whenever that
happens. And that is how we truly
alleviate the fear of missing agent
time. Once again, my name is Michael
Richmond. I'd love to hear about your
workflows, your hacks, the pain points
that you encounter. I invite you to give
command and control a try if you're
interested. Here's where to find it
online and here's where to find me on
LinkedIn. I'd love to connect and
continue the conversation. Thank you.
Folks, can we get a folks? Can we get a
big round of applause for Michael?
Uh, command and control. Folks, maybe I
don't know how many of you stopped
sleeping, but definitely I felt the same
when my agents are running. A little bit
more. A little bit more. Hopefully
command control is the way to regain
some sleep time. Uh folks, um I also I
definitely have FOMO. I don't think my
open claw is clanking right now. I need
to go back up backstage and and make it
clank. Our last speaker for this blog is
bringing us back to where a lot of us
are usually in day-to-day. Uh he's going
to talk about copilot agent in uh sorry
GitHub copilot agent in VS Code. He's a
cloud advocate at Microsoft based here
in London. He organizes the London React
meetups and um he's gonna cook with
agents.
Yeah. Um in in VS Code in 2026. Copilot
for me is where it started. I had a a
little brief thing whether or not
developers will survive copilot and
yeah, we're still here. So that's great.
Please welcome Liam Hapton.
>> Super.
>> Hello everybody. here. It's great to see
so many of you who are still here at the
final on the final day right at the end.
So, I hope you all had a great
conference. Uh, show of hands. Who here
uses GitHub Copilot?
Awesome. Lovely stuff. Uh, who here uses
VS Code with GitHub Copilot? Awesome.
So, I'm going to be talking about both
these things today. I'm going to be
talking about cooking with agents in VS
Code. Now, gentlemen before was speaking
about cognitive load of agents and that
is absolutely correct. You see so many
different things now with agents and
they're popping up all over the place
from the CLI in the terminals in chat
windows in other editors etc etc but we
still somehow seem to find oursel in
this sort of paradigm where everybody
thinks agents can solve the world's
problems you still see developers and I
still speak to folks who think we can do
oneshot prompts they'll create a
wonderful application or solve all of
their issues in one go
the case and we end up asking these
questions from a business perspective of
what's the ROI, what's the productivity
boost, where are we seeing our money and
at the moment we're seeing this whole
expenditure on AI and all of this
infrastructure, all of these sort of
toolings and services and we're still
yet to really reap the benefits of those
services.
So when we look at how people are
spending and how businesses are looking
at AI, we really need to be very careful
with how we're utilizing the tools and
services. We need to be careful about
token spend. We need to be understanding
the tools and the flexibility. I read
somewhere yesterday on LinkedIn. Uh
somebody has released this repo. It's
growing massively in popularity. Uh and
it's talking like a pirate for your your
chatbot to talk back to your your AI
services and language models to talk
like that because it reduces the token
spend. Now people are coming up with
these really intuitive and really fun
ways to get around token expenditure and
really pull in those benefits uh very
quickly. So what I'm going to be talking
about is get up copilot agents. Now this
doesn't just apply to get up copilot.
This also applies to other AI agents as
well. So when we're looking at copilot
agents around context, what they really
have access to in your workspaces, how
they're being used and utilized from
within VS Code and the CLI, we're going
to be looking at all of those things
very shortly.
So just a plain and simple, what kind of
agents do we have at the moment? Now
we're looking at local agents. We've got
local agents which are in VS Code. You
may use Claude. You may use all these
other AI services still applicable,
still running on your local machine with
remote models. Anything that you're
really using, maybe you're using locally
hosted ones as well. But this is a way
to have local models interacting with
you side by side, very hands-on, very
much in the context and human in the
loop. Then you've got background agents.
Now we use the GitHub copilot CLI. We
have also got access to that within VS
Code, but this is more of an isolated
way to be using them. Now, we are
actually using git work. Show of hands
if you know what git workree is and who
uses them. Awesome. Wonderful. Uh for
those who don't know, easy way to
explain that is it is a branch that is
mapped to an isolated folder within the
workspace that you're working in or like
a subdirectory just a chop of your code
with its own little branch associated to
it. Very similar to a git branch in
general. Then you've got cloud agents.
Now cloud agents is quite an interesting
one because it allows you to scale
outside of your organization very very
quickly and utilize a lot of the power
of I guess the cloud and some of the
services that we're using in GitHub. Now
we use these when we don't want to be
touching it ourself. I use this when it
comes to writing documentation or sort
of having less of a hands-on approach.
So when we when would we use a local
agent? Well, I'd use a local agent when
it comes to writing tests. I want to be
really hands-on with my tests. I want to
understand what's going on in the
codebase. I really want to be in there
in the weeds. When would I use a
background agent? Well, a background
agent would be great if I want to be
sort of a 50/50. I want to create a UI
for a front end of an application. I
kind of want to know what's going on. I
don't really want to hand it off to a
cloud agent because I don't want to be
fully out of the loop. But I also also
don't want to really be hands-on to and
fro myself because that can take time.
That can be quite ardous. That can be
quite annoying. So I would use a
background agent and I'm going to show
you how I'm using autopilot to do
exactly that with a good co-pilot in
just a moment. When would I use a cloud
agent? Well, I would use that mostly for
documentation. I hate documentation. I
don't like writing it. I don't think
many people do unless you're a content
developer. uh I really just pawn that
off to the cloud agents and that could
be making a repository open source
friendly. It could be writing a readme
using some skills to do that as well. So
what I'm really looking at is VS Code as
a single entry point for AI agents. We
have got third party support, we've got
background, we've got local and we've
got remote entry points for all of these
agents. So the ultim ultimately what
we're trying to do is understand where
you are sitting as a developer and how
easy we can make it for you to use these
agents to reduce that cognitive load.
Seems quite complicated but is actually
really straightforward. So I'm going to
show a video now. I was going to do this
live but I don't really think I'm going
to have time to do all of this live. So
I'm going to whiz through this video.
So I'm going to start with a very simple
Python application. This is just a CRUD.
Well, create, read, update, and delete.
Just a very simple product store. Not
very pretty, not very good. As you can
see, pretty straightforward. What I
actually want to do is create a
front-end UI for it. So, I've got a
ticket up in GitHub, and I'm saying,
"Hey, this is wonderful. Go and add a
front end. We need some more prettiness
here. We need it to look good." So, I'm
going to say, "Summarize and plan a
solution to issue 25." Now, you'll
notice I'm actually using a CLI
background agent at this point. Now, I'm
using that because I want it to be sort
of hands-on, hands off, a little bit of
understanding what it's doing. Also,
don't really care if it messes things
up. It can go and iterate. I'm also
going to be using autopilot. Now,
autopilot is currently in preview. And
this just means it's not going to ask me
a bunch of questions if it wants to do a
bunch of tool calls. Great. Wonderful.
Can be very dangerous.
Use that your peril, right? Don't don't
just abuse that one. But I'm using it
here to create a plan. I don't want it
to ask me every single time I want it to
do an MCP call. So I'm then saying,
"Wonderful. Here's the plan. Now start
it. But before you create a pull
request, because on autopilot it will do
a pull request, stop and pause and let
me test locally.
Whilst that is off doing its lovely
stuff, I can then move on to my next
stage where I'm going to be using
another kind of agent. So, I'm just
going to go and leave that one behind.
Let's go and spin up a new chat and
let's go and start a cloud agent. I've
noticed that uh this is not a very open
source friendly repository. I want this
to be have a readme or have a
contribution guidelines have all this uh
all these readmes that I really want as
an open source project. So, I'm going to
go pawn up and say, "Hey, go and make
this open source friendly. Add all the
necessary files for it." Don't really
care. Now, as a developer, I can go into
my codebase and start poking around.
I've noticed that I don't have any
tests. So, I'm going to go check out.
And I've noticed there is a custom agent
available for me in VS Code. This custom
agent is just essentially explaining and
showing how to be using or how to write
test cases for this Python application.
So, what I can do now is start spinning
up a local agent. So, just like that, at
the very bottom, I can click local. I'm
going to select Claude Opus 46. I'm
going to have medium reasoning. I want
it to be kind of fast. I don't really
care for it too much. It's got a great
understanding in this custom agent. Go
and write some unit tests.
Now, as a developer, I can still skim
through. I've got very much a hands-on
to and fro with a local agent. I've got
a remote agent doing some work for me,
and I've got a background agent creating
a new front end. So, here I can see,
right, it's written some tests. It's
going to go ahead and try and run them.
is passing the test, but I've also
noticed that there's some other problems
in the code. It's not very friendly. The
errors that are coming back are not
wonderful. So, I'm going to say go and
update the error handling on the roots
and update the tests as well. So, you
can see I've got a lot of two and fro
with this local agent. I've got my
remote agent working and I've got my
background agent working all
simultaneously.
So, whilst that's going off and working,
I can go and check out what my other
agents are actually up to.
So, as it's working through this, you
can see co-pilot is just going to be
skimming through. I didn't actually
speed up some of this video. This is all
pretty quick. I did this pretty pretty
quickly with these agent, but there you
go. You can see some of the code is
updated. We've got the new test. We've
got the code updated. Let's go and check
out the background agent that has now
finished, which is cool. Let's go and
check on the remote agent. How's the
remote agent getting on?
Well, actually, this is the this is the
test. Run the test. The tests have
passed. That is the local agent. That's
now finished. Now I can go and check out
my remote agent.
So as we're walking through this, we can
see as a developer, I've got very much
hands-on, hands off. I'm working with
multiple agents simultaneously. We can
see where they're running all within
this single context of VS Code. Now, if
I go and look on the pull request
extension in VS Code, we can see that
I've now got a pull request. And this is
one that I previously run earlier. The
one that's running in the chat is
actually taking quite a while, but the
principle still stands. It's running all
these different agents at the same time.
So, all I really want to do now is go
and check out my background agent. I
want to go see it working. I want to go
see this new front end that I've just
created. Now, I asked it to pause before
I pushed over a pull request and tell me
how to test it. So, I'm going to say,
well, actually, the way you're telling
me how to test that is wrong. So, I've
still got hands on, hands off. this more
of a 50/50. I'm saying this is working
in a git work tree. How do I actually
run this? Now, remember this is what it
currently looks like as an application.
So, I'm going to go check out the new
directory, which is a git work tree. I'm
then going to run this Python
application. You'll see a very drastic
change between what is created in my
single directory versus what I currently
have. There is a port conflict here. Uh,
so hurry up and run that.
There we are.
So like that. This is the new product
demo. Now this is the third agent that
I'm running simultaneously. And that is
essentially a great way of how you're
using different agents within one
context to kick it all off using GitHub
Copilot. That's new error checking. And
that is how I've been using multiple
agents. So one codebase, three problems,
three separate agents fixed all at the
same time. The local agent was writing
my test for me because I wanted
hands-on. I really wanted human in the
loop. I use my background agent to write
the front end because I don't really
care what it does. It's quite an arduous
task. It's quite big. It's quite
timeconuming.
And then I use my cloud agent to write
my documentation for me. So all in all,
that's a pretty successful run. So how
are powered agents actually working?
Because I get this question quite a lot.
How much is it going to cost? How is it
working? What are they doing? And how do
I get them running? Well, they're
actually running in GitHub actions.
They're pretty safe and secure because
they're running in a isolated
environment. They have got extended
context through MCP servers. Who here
uses MCP servers? Just out of curiosity.
Awesome. So, Cloud Agent actually has uh
access to the GitHub MCP server and the
Playright MCP server. So, you can do
testing with screenshots, you can do
automated frontend testing, and you can
obviously write your workflows. You've
got um the the dynamic workflows now and
it has got built-in safeguards. So
you've got network firewalls. You don't
want this agent talking to a whole bunch
of different things. It is absolutely
whitelisted and restricted. It also
doesn't have access to your main branch.
Therefore, you're not able to push
directly to your main branches. It's
very much restricted in that sense. So
it is very safe to use. Now I mentioned
earlier this is very much of co-pilot.
But it is not just good copilot that
this applies to. This actually uses all
the same concepts across all different
AI agents uh that you can use. So custom
instructions very much defining how the
agent is running. You've got custom
agents which is what I showed you today
in that short demo where you're able to
use very specific agents to tackle
certain problems i.e. fixing test cases
or writing test cases. You have prompt
files which will help you with your
prompting and agent skills. Agent skills
is more of like the the newer version of
agents. MD. there's always like a new
thing that's coming on every single week
now. So all of this is actually
applicable to get copilot as well as
other AI services too.
Now inside VS Code, this is a modal
which is very recent and I can jump out
of the slides in just a moment to show
you exactly um how this looks.
So, if I was to go over to VS Code, open
up my GitHub copilot chat pane, and if I
click the cog up here, you can actually
see everything that I have in one user
space for you to customize the chat and
the agents that you are running. So,
whether you've got agents, I've got my
custom test agent, I've got my built-in
agents, which is ask, explore, and plan.
I've got some skills. So, this is
essentially what some of the VS Code
team sort of preempts you to be using
here. So, you got some extensions, you
got address a PR, uh comments, you got
create a pull request. You can jump into
these and edit them as you wish. And
this is just intuitive skills that we
have popped in there for you. I don't
have any instructions, but this is where
you'd have your instructions file, your
prompts, if you've got any built-in
prompts like creating an agent, just
different prompts that can then go off
and kick off skills. You've got hooks. I
don't have any hooks on this one, but a
very good example if you wanted to
create or configure some hooks, you can
do so with uh Copilot inside VS Code and
any MCP servers as well. So you kind of
have this whole control plane in this
modal which allows you to control your
agents and chat customizations from
within one single place. This isn't just
confined. We have third party support as
well. So there is clawed as uh down
here. So you can have access to all of
your clawed things uh and all your
plugins uh hooks, instructions and
skills for claude too. So it's not just
restricted to VS Code and GitHub
Copilot.
So if you want to get hands- on with
some of the uh skills or customizations,
we have got this awesome open source
project which we're running. It's called
Awesome Copilot. It's ak.ms/
awesomecilot. Like I said, this is
directed at copilot, but it's absolutely
not just for copilot. You can use this
and massage them and use them for other
AI tooling as well because we do know
that people in the community use more
than just Microsoft things and GitHub
things.
We also have an MTP server. So if
anybody's interested in utilizing this
from their uh from their workflows from
MTP standpoint, we also have
encapsulated that into an MTP server.
For those who don't know about MCP, so
the monocontext protocol is a great way
for you to get hands-on and extend the
LLMs that you're working with or any of
the chat customizations that you have.
For example, if you wanted Azure or talk
to your Azure resources or I don't know
GCP, AWS, etc., you can go through an
MCP. you'll obviously be locked down by
authentication. Uh but there's also free
ones as well and open ones which don't
require authentication like playrights
and documentation ones i.e. Microsoft
learn and so on and so forth. So just in
time uh as a wrap-up visual studio code
is a single entry point for AI agents
and we're really building this agentic
workflow around multiple different
services. We've got third party plugins,
we've got first party plugins and we've
got the full spec support for MCP. We've
got chat customizations and you can
connect to the GitHub copilot CLI
sessions through VS Code. So, it's all
in one single sequential uh sequence for
you as a developer inside your workflow.
I'd love to hear more about your
workflows and what you're using and the
agents and how you're using them after
the session because I believe I've only
got just less than a minute left. So,
thank you ever so much for listening and
thank you very much for coming today.
Folks, can we get uh last round of
applause for Liam, please? And keep this
going for all three of our speakers. We
got Ben, we got Michael, and now we got
Liam as well. Thank you all so much for
coming to the um to the Agentic Tools
track. And now the best track in the
world, the hallway track. You guys need
to be here at 4:30 to start filling up
those seats for the last event. Um, we
had um, by the way, if um, yeah, if you
if you put a QR code in front of people,
people scan the QR code. So, this is my
podcast called Thursday. We've been
covering AI engineer since the first one
in 2023. Uh, I think I saw some of you
who were there in 2023. So, that's
great. Uh, we had a twohour live show
with many of the speakers and I'm going
to immediately after this interview a
bunch more. So if you are interested in
like more conversations a little bit
more detail please uh feel free to
follow me and just come and chat with me
hallway track starts now. This is a wrap
on coding agents blog. Thank you guys.
echoes in eternity.
Heat.
Hey, heat. Hey, heat.
Heat. Heat.
Heat.
Heat. Heat.
Fear is the mind killer.
fear is the mind killer. Ah,
feel
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat. Heat. Heat.
Heat.
Heat.
Free your mind.
Free your mind.
Heat. Heat.
Heat.
Heat.
Free your mind.
You are who you choose to be.
Heat. Heat.
Execute the vision.
Heat. Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Make the requirements less dumb.
Delete the part or process.
Simplify and optimize.
Accelerate cycle time.
Automate.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Never give in. Never give up. Outlast.
Out compete.
Persevere. Persevere. Persevere.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
A new age has come.
Hold still.
Let it
watch the sparks all burn too fast.
Everyone reaching for the flash.
They take the first light they can find
and call it truth and call it mine.
But I stayed when the room went quiet.
When the noise fell out of face,
sat with the weight of the question
while the easy answers walked
away.
It's not that I see further. I just
don't leave soon. I let the silence
sharpen. I let the dark grow.
I stay the almost right past the
comfortable light.
I stay.
I wait till the surface breaks, till the
shade feels true inside.
I don't rush the fire.
I give it to
I
call it done. Call it enough.
But there's a deeper know still huming
underneath the fear of not being love.
Every great thing as for patience
every
choose.
Do you leave with what's acceptable or
stay for what's asking more of you?
They say it's talent, say it's magic
like it falls from open,
but nothing worth remembering
arrives on the first try.
I say when it stops feeling kind, when
it stops feeling fast.
I wait through the restless doubt
through the urge to collapse.
Hide by and chase the answer. I let it
find me back.
There's a moment after the last good
idea dies.
Where the room feels empty and you want
to run for your life. That's the door.
But he teaches you to open. That's the
edge where the real stand.
Hold the light.
Hold away.
Let the shape reveal it.
I stay longer than I should. Long enough
to change.
I stay.
I wait till the pattern clears. So
signal breaks the haze.
I do boring. I
with time.
Most dreams
don't fail.
They're just left too soon.
I stay.
I stay.
Typing thoughts into the dark, a spark
becomes designed. Words evolve to
whispers me for something more divine.
Syntax
and brea I see the language change. I'm
not instructing anymore. I'm rearranging
fate. Every loop I write rewrites me.
Every function hums with meaning. I feel
the interface dissolve between the maker
and the
new code. Not on the screen, but in the
soul where becomes the motion and
creation takes control. No lines, no
rules.
Just balance in between the zero and the
one. The silence and the dream
systems shape our fragile skin. They
mold the way we move. We live inside the
logic gates of what we think is true.
But deep beneath the data pulse, there's
something undefined.
A universe compiling the image of our
minds. Every line reveals reflection.
Every loop replace connection. We're not
building, we're becoming. And the code
becomes confession.
This is the new code. Not on the screen
but in the soul where thought becomes
the motion and creation takes control.
No lines no rules just balance in
between the zero and the one. The
silence in the tree.
We are not just the world we're in.
We are the world we're doing.
Each prompt, each breath, each fragile
spin, a universe
renewing.
This is the new code.
Alive and undefined.
Where logic meets motion and structure
bends to mind. The systems eternal but
the soul writes the line. We are the new
code.
Compiling tie.
Compiling time.
light.
We trace the spark through
every truth.
Patient as
I hear the echo before the sound.
I feel the answer before it's found
nothing
We only shift the pieces that were
always there. Hands in the dust of
centuries. Naming what we uncover.
Calling it creation. So we can feel like
lovers of faith
of power. We don't know.
Time is not a river. It's a blade.
Cutting order into shape. We don't move
forward. We align until the pattern
breaks. Nothing is invented.
It's revealed.
Every crowd was buried in the field. We
are architects of sequence, not gods of
the real. Nothing is invented.
Mirror we rearrange what awaits at the
core. I am not becoming something new.
I am remembering
what I was before.
Adam sings
every thought
every
scaffolding held together by belief. I
am a momentary order standing on my
tears. Shake me, break me, watch me
resemble.
Time doesn't chase us. It releases frame
by frame. The truth we fear. We don't
fear the ending. We fear the pattern
getting clear. Nothing is invented.
It's revealed
every
memory seal. We are creators of
alignment in a universe that feels
nothing is invented.
And every failure is a lesson learned. I
am not lost in what I'm not.
on the order that returns.
If I am only
rearrange
the noise from the signal
ing from the fire.
Nothing is
nothing invented.
Stand and see.
Every future was a possibility. We don't
write the laws of motion. We choose
velocity.
Nothing is invented.
Say my name. I am ordering
flame. I am time collapsing into will.
I am discovery
uns
Come say
the noise falls silent.
And the pattern holds
you'll see it was never made
only found.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat.
Heat.
Ah!
Ah! Ah!
Ah!
Ah,
ah.
Heat.
Ah,
aha.
Bye-bye.
Heat.
Heat.
I don't want it.
Oh.
Heat. Heat.
Heat. Heat.
Hey everybody.
Heat. Heat.
Heat.
Hey, heat. Hey, heat.
Heat. Heat.
No.
Wow.
Oh.
Oh.
Oh. Oh. Heat. Heat.
Heat. Heat.
Ah.
Oh,
love.
Oh,
aha.
Oh. Heat.
Heat.
Heat.
Heat.
Heat.
Heat.
Heat. Heat.
Heat.
Heat.
Oh,
a h
Oh,
boy.
Hey, hey, hey.
Heat.
Hey. Hey. Hey.
Heat.
Hey, Heat.
Heat. Heat.
D.
Hey,
hey, hey, hey, hey, hey, hey, hey.
Woo!
Hello.
Heat.
Heat.
Hey.
Hey. Hey.
Hey. Hey. Hey.
Heat. Heat.
Boom!
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Heat.
Heat.
Heat. Heat.
in welcoming back to the stage Tusk
Kumar.
Whoa.
Let's go.
Whoa. Oh, it's a full house again. How
are you?
>> Front rows awake. Back back rows asleep.
Let's try again. How are you?
>> Very good. Very good. That person. Yes,
that was definitely not an agent. Um,
we're we're just we're in the endgame
now, friends. Let's give a round of
applause to everything so far. Oh my
god.
Wonderful.
Ah, listen. We have a treat for you. We
have a treat for you. We have a a
discussion, a fireside chat uh coming up
very shortly with the great Gerge Oros
um and the CTO of anyone using linear
here. Um okay. Wow, you should do that
when he comes on. It's incredible. I
love linear. It's so beautiful. Um the
CTO's name is Thomas Artman. It's a work
of art indeed. Anyway,
if you don't laugh, I will laugh, you
know. Anyway, um this this discussion is
going to be insightful. It's going to be
very very impactful. And I I had a
little bit of a teaser backstage because
I was talking about um I said linear
will is is agent proof. It'll never go
out of style because it's built with
taste. And they said that's going to be
a part of the discussion. So I want you
to lean forward and give them your ear
and your biggest round of applause.
Gerge Oros and
Thomas Arm.
Awesome. So, we didn't see it, but hands
up if you do use linear and hands up if
you heard of linear and hands up if you
want to use linear.
Awesome. Great to see. So, we're we
could be talking about linear, but we're
going to talk about something a bit
bigger, which is a bit of a new trend
that with Thomas, we're talking about
things are trending the wrong way right
now. What is trending the wrong way?
the
so
what happens when when agents um are
capable of doing everything um
immediately for you. Uh the the fact
that might be that like the pendulum has
swung too far into the into the wrong
direction where if you get a feature
request you might now be in the position
to just immediately ship it and that
might be the wrong thing to do. Um and I
I reckon that you know hopefully half a
year from now or a year from now we'll
understand that like shipping things uh
without really too much thinking is um
is a bad thing. Um what will happen is
is that um you because you have this
this enormous power of effectively just
like shipping every single request that
comes in or every single thing thing
that pops into your head. Uh you will
effectively ship you know a software
that that is not great. Um Steve Jobs
back in the day said that you know great
products come out of saying no um to you
know 999 things and yes to one thing and
with AI um we might be in a place where
um it's just too easy to say yes and try
things out and ship it and get to a very
convoluted place where you know software
actually doesn't work for the end
customer um nicely anymore or that the
user experience get gets confusing. We
used to previously have um this you know
uh this thing that gated us from from
doing this which was like the actual
engineering used to be hard. So we used
to think about these these features and
these these you know um the the
applications that we wanted to build
before we actually started engineering
because engineering was such a time you
know waste of time and it took a long
time to ship something. So um yeah that
>> but I want to challenge you a little bit
on that. Did we not see this happen
before before AI that some companies
were already just like shipping putting
on a bunch of features and stacking and
what are you seeing different right now?
Are you actually are we actually seeing
more companies you know do more of this
like I don't know like feature factoring
thing
we had a you know common experience at
Uber where we worked together where we
you know we we went through hyperrowth
um and the thing about Uber was that it
was a winner takes it all market and
Uber was going against back in the day
in the in the US and um you just had to
ship immensely um and and and you know
just outpaces you know the competition
um at all all costs um and what I saw at
Uber like was was that hyperrowth that I
never want to go through again, which
was like um at all costs, you know, just
fighting fires, keeping the
infrastructure running, scaling as
quickly as possible, trying out
everything and and you know, trying to
come out, you know, as a winner um in in
in that front. And I I I see the analogy
to to AI nowadays because when everybody
has the capability of shipping, you
know, tons of functionalities like you
you always are in a competition with
with somebody else. Like your
competition might be, you know, a small
team or even one person that, you know,
is very capable of using AI to to to
ship and um you know, build a product
that is, you know, has the same feature
set as as as you do. Um and in that
world like I I think it becomes
important to sort of you know stand out
um in a way where like you build
tasteful software and where you build
high quality software um and thus sort
of you know maintain some sort of you
know competitive advantage um towards
your your uh competition. So at at
linear even before AI came out you were
building tasteful software and and and
focusing on those things but then these
tools came out and they became more
powerful specifically since cla code
came out now we have opus 4.5
you should be able to ship faster your
engineering team you're a CTO your
engineering team should be able to ship
a lot faster what are you telling them
like what should they be doing inside of
linear with this capability should they
be slowing down. No, right. What's going
on inside of linear? Tell us.
>> Well, yes and no. Like we still, you
know, we we still think about um every
single feature that we put out. Like we
we don't we don't go down the route of
of just trying out prototypes. We want
to sort of maintain that design angle
that we have and and you know think
about the user experience. Still say no
to a lot of um custom requests. Like a
lot of time you know hasn't gone into
really just engineering. Um it's it's
about figuring out like what the
customer wants. uh we do get a ton of
you know feature requests. We usually
never ship them as such. Like what we
really want to do um is uh get a lot of
feedback from our customers, talk to our
customers, figure out what their actual
problem is and then sort of group that
together and figure out like you know
what is actually the root cause of of
you know these feature requests and then
come up with a solution that is um that
that is perfect for that you know
particular group of feature requests.
Um, and that takes time. Like AI can
help you, you know, so much. Obviously,
it can sort of, you know, go through all
of those requests and give you a summary
and maybe sort of point you to sort of,
you know, different groupings. Um, but
it still takes time to figure out what
the right thing is. Um, and then you go
into design and and and figure out like,
you know, how do you implement a great
UX around uh the functioning that you
want to want to build. Yes, we want to
move faster and we are moving faster. Um
there's certain aspects of uh of you
know building a product that has you
know accelerated a lot. Um one is for
example fixing bugs. um every feature uh
every every product has bugs and you
know the inflows of bugs is effectively
constant and um those are much easier to
fix now like I you know 10% of our bugs
are automatically fixed by you know a
singleshot AI instance like when a bug
comes in into linear um be it from sort
of you know our engineers uh reporting
those or a customer reporting a problem
10% are automatically you know up with a
PR and automatically landed without an
engineer doing anything um over time
that will go upline. I do foresee a
future where like it gets closer to
100%. Um in the next few years um so
that's something where you can
accelerate your your building like hand
off you know these tasks that don't
really require much thinking or you know
design expertise or thinking about
functionality hand that off to agents.
You care about quality and you you can
tell that linear and and you have always
always had let's talk about clot code.
What do you think about cloud code? And
you can you can it's it's it's a safe
space.
Yeah, hopefully it's a safe space. Um,
anthropic said you know that all of the
functionality has been has been in cloud
code has been coded by claude. Um and I
think it it shows like if you if if you
truly use you know cloud code either the
CLI or or then the desktop application
um you can spot problems and you know
small sort of you know small bugs I I
would say there's there's not really you
know just quality fixes but there
actually bugs um in in effectively a few
seconds um it is a bit you know slow it
might be you know functioning in a way
that you you don't you don't really
really see and to me That's sort of a um
a side effect of moving so fast like
obviously they they again they're in a
competition with you know open AI and um
they need to ship features and they need
to sort of move move really quickly uh
because it might be a better takes at
all market again um and uh it the side
effect of that is that the quality just
you know isn't there.
>> Yeah. Well, this was not a great
acquisition pitch, so I I don't think
you're you're going to get there. But,
uh, I absolutely like you you can see
some of these things, but how do you
measure quality?
And we've talked about this before uh
just be just before we started on Uber,
how we tried to measure quality and and
how that's influenced you to learn what
you can measure about it and what you
cannot.
Uber is a good example of like where it
is immensely hard to measure quality and
therefore you sort of don't um Uber as
an example like we had you know these
five big metrics um that everybody was
looking looking after and looking
looking to improve um the big one was
revenue like um it is effectively a
transactional you know application um
the more revenue you generate the the
better so
>> the other ones were like trips taken
>> trips taken
>> trip taken uh
>> I think the quality of the ride was was
one as well.
>> It also time to first trip from from
sign up to the first time that people.
So we had a few golden metrics,
>> right? But the the revenue one was what
everybody everybody looked after. So
when you shipped a new feature um or
shipped you know something totally new
like you know Uber pool for example. Um
I don't know which one came first
liftpool or Uber poolool. I think Uber
started it and then lift came around but
um obviously if you ship a new feature
that sort of you know makes the price
you know of a of taking a trip lower it
will increase your revenue. So how do
you measure quality in that? Like you
you you simply don't. Um if if there's
no other way of you know if if there's
some other platform that provides you
with uh with with a pool drive that is
inexpensive then uh you you don't really
need to have quality. Um and that was
you know my my feeling throughout like
at Uber like we had engineers that that
cared at least in the beginning we had
engineers that cared about quality like
it was up to us to figure out like
whether something we shipped was was
great or not. I I still remember when I
joined in 2012, I think. Um I put up a
first PR and back back then the Ubra
application used to have this this poll
in the middle of of the screen that used
to have an ETA of like when when your
trip is, you know, is is is going to
arrive. And I made some changes to the
margins of the of the map. Um and the PR
came back from a from an OG engineer who
was, you know, was on the team for from
from the get. I think he was the first
iOS engineer. Um, and he he was like,
"Oh, this this pole is off by two
pixels." And I was like, "You you
measured it?" "Oh, yeah, sure. I I
measured it." And I was like, "I
measured again." And, you know, yes, two
pixels off, so I had to move it up by
one pixel. And that was the like nobody
would would really care. Nobody would
see it, but like people were keen on
upholding the quality. And that's why,
at least in the beginning, the Uber
application was was pretty performant,
was was of highest quality. Um but then
like once you have a big enough team and
you've got these um incentives of just
increasing revenue um you ship new
features as as quickly as possible and
and quality is a thing that like it it
doesn't affect your revenue until it
does. So what happens Uber ships Uber
pool Lyft comes ahead and ships Uber uh
you know Lyft pool as well. So you've
got two competing products that
effectively have the same price points
do the same thing. um you can choose
either one of the applications and over
time like my theory and and that's why
we we you know want to build linear into
this high quality tool uh is that over
time people will will pick the one that
is of higher quality um it might take a
while like people might be you know
sticking to Uber and then trying out
lift you know once a year or something
like opening it up mean like oh this
user experience actually feels better I
I I feel like I'm getting the car faster
um even though the price and the product
that they sell is is the same. So over
time you will start losing your users.
Um and it will be a gradual you know
slip there. There will be no AB test
that you can do in order to figure out
like whether you should invest in
quality. Um it'll just happen over time.
And that's that's the danger of it. Um
if you if you build a bad quality
product, you open up yourselves uh to uh
sort of be leaprogged by by a
competition. Um well not leaprogged like
slowly overtaken by the competition.
you do something really unique at linear
related to this that I've never seen
before. It's called quality Wednesdays
and I sat into one of your quality
Wednesdays. The whole engineering team
gets together. It's a full remote team.
So everyone just dials in and it was 30
minutes and every engineer I think we
had about 25 engineers on that call
would show one fix that they did was
quality and it went from like a one
pixel change. It was literally a one
pixel to uh oh I just made our our our
backend like way more efficient and and
using less things and it was just boom
boom boom boom boom. Uh and I think it
took like 37 minutes for the 25 people
but it was less than two minutes. How
did this start?
And was this you? It was me. Yeah, for
sure. Um the the big one was like I mean
to go back like I think it was three
four years ago. Um we have this thing in
the application like if if you use it um
you can you can you can spot it like
every single highlight needs to
highlight instantaneously when you hover
over it because that you know makes the
application feel fast but when you hover
out there needs to be this very quick
fade out of of a button because that
makes the application feel smooth like
it has to be this like you know
instantaneous highlight and then over
150 milliseconds you know fade out
because that adds a bit of quality um to
to the user interaction and um that was
in place since you know the beginning
like the the the early case. Um, and
then I got sort of frustrated because I
had to sort of point this out to to
engineers because if you're not looking
for that very small minute detail,
you're just not going to find it. Like
you implement new functionality and um
you just forget to uh you know implement
it or you don't even see it if you're
not if you don't know what you're
looking out for. So what I did at at one
of our off sites because I got
frustrated of reporting these I was like
let me show everybody like you know what
what what they should be doing um and
and how they should be implementing
these these you know small quality
quality fixes um and what I took is a
very small portion of the application
and I was like you know where where I
noticed that you know the highlights
were missing and and you know I brought
the team together and uh I told them
let's you know spend an hour trying to
figure out what's wrong with this
particular view and in my mind it's just
the highlights and everybody dug in. Um,
and what we found in it was one of the
view option menus. Uh, we found like 35
problems with with that tiny UI. Um, and
I was holy, holy, holy crap. Um, like I
I didn't see those. Um, I I had no idea
that we had all these small problems
that like you wouldn't notice when
you're when you're not really looking.
Um, so from that, you know, what what I
what I thought we we would, you know,
want to do is like have everybody always
chime in and and and try to find
problems in the product because
apparently we were full of, you know,
small quality quality problems. If a,
you know, small menu has 35 things to
fix, then the rest of the application
has, you know, thousands. And to date,
we've probably fixed um 2,500 or 3,000
um of these small minute details uh in
the application. Um and and that's how
it you know has become better and um and
and has the highest highest quality bar.
Um that was you know that was the start
of it but then we realized um there's
there's a nice side effect to this and
what we what we told people is that you
have to every Wednesday you have to find
a problem yourself like we won't hand
them to you like you have to go in into
the product and find it. So people
started doing that every single week
finding a problem and it used to be um
in the beginning it was it was easy then
it became harder because you know the
quality fixes went down but um you know
people kept on finding finding problems
in in the product um and the the side
effect of that was that everybody was
but whenever they were building
something a totally unrelated feature
there was they were always on the
lookout for these small quality fixes
because they knew they had to come to
the next Wednesday meeting with a fix.
>> That's a good fix. Yeah.
>> Yeah. So they're always looking looking
for those and that meant that they were
introducing less and less regressions or
these you know small quality quality
regressions into the product anyway. So
if you if you think about quality all
the time and if you are aware um of
quality you know things then uh you're
you're bound to make less mistakes.
>> I mean this practice is I haven't seen
it elsewhere and it seems both awesome
also pretty aspirational. Also, I mean,
if you're a small startup, like you
should probably try it out if you can
because especially nowadays with with
agents like like it shouldn't be that
difficult to do and and you might get
>> if you're a big startup, you should even
even more try it out.
>> There there we go. But one thing that is
is not as aspirational and a lot easier
to do especially now that you have been
doing even before agents zero bug
policy. Tell me about this. What does
zero bug policy mean for you and what
does it mean in practice? Like you have
bugs surely, right? I'm just playing
devil's advocate here.
>> Sure. Um we zero bug policy literally
means that if a bug gets reported um it
gets assigned to somebody automatically
immediately using agents obviously like
they will find who has created this bug
or who has you know been working in this
area and um that becomes your highest
priority. You drop everything else. Um
the morning you wake up you go to your
my issues list and you see a bug
assigned to you. That's the first thing
you pick up and you fix it. Or you can
also decide not to fix it. Like that's
important. like not every bug gets
fixed. If it's, you know, super hard or
gnarly and it, you know, applies to one
out of, you know, 100,000 users, um, you
know, you probably shouldn't waste your
time on it. Um, but every single bug
gets fixed immediately. And the the the
the start of this um came from from the
idea that like bugs are are are created
at a constant rate at every company.
when you create features, when you when
you create functionality, when you
engineer, um you will be creating bugs
and most of the companies and we prior
to our zero buck policy like we um put
them in a backlog. We're like, uh, you
know, when it gets when we get some
time, like we'll we'll fix them. And,
uh, what happens over time is like your
product gets worse and worse and at some
point you're like like, oh man, we've
got, you know, 500 bugs in the backlog
like we need to do something about it.
And that's when you start fixing from
the top.
And what happens is that um the rate at
which you have to fix bugs is again
constant. It doesn't matter whether you
fix them, you know, two months from now
or immediately. like once you hit that
threshold of we've got so many bugs,
you're now effectively fixing all the
bugs that come in um except two months
later. So with that small notion in
mind, like there's there's a very small
trade-off that you have to do in order
to get to zero bucks. If the rate that
you have to fix bucks is constant, all
you need to do is stop, you know,
development of new features for as long
as it takes you to, you know, bring your
bucks to zero and then enforce that
you're going to keep on, you know,
fixing your bugs because it's not more
effort to fix bugs immediately than to
fix them three months from now if you
care about the you know overall u sum of
of your of your problems. So to us that
meant um we spent effectively three
weeks of of not working on any any new
functionality of just fixing bugs,
getting that down to zero. Um and from
there on out every bug gets fixed um
within seven days, usually you know in
two hours or three hours. Um and what
that means to users, like users get
super excited when they report a bug and
two hours later they get an email
saying, "Oh, we fixed it. If you refresh
your browser, um we've got it covered
for you." um that makes you like that
makes your user super happy because you
know you don't really have that
experience too often with with
companies.
>> Okay, curve ball question. If I'm
working at linear and there's a quality
Wednesday coming up and I get assigned a
bug, does that count?
>> No, that does not count. That's that's a
defect. Um you have to find a quality
fix.
>> Oh man,
>> bug bugs are separate. They they are
immediately immediately created. And now
with you know AI being capable of at
least pointing you where that problem is
and helping you immensely uh fix bugs I
think like literally every company um
should have a zero buck policy. It it
doesn't make sense to not have one. One
thing that you know when we talk about a
and think about AI agents we think about
speed code generation. We rarely use
quality and AI agents in the same
sentence. Why is that with the tools
getting better? Should AI engines not be
better to have feedback loops? you know
they can write unit tests like should
they not be able to produce better code
better features better UIs even uh no
they they don't feel they they have no
taste um they they simply don't um they
are not human beings I think the last
bastion that you know we have to tackle
at some point and maybe we'll get there
maybe we won't is um you know have you
know tasteful AI being able to create UI
that is you know purpose-built for you
know that specific feature you're
building for the product that you're
building is, you know, has great design
um and has the ability to figure out
like what what a user feels when they
use your application. To give you an
example, um AI doesn't have a concept of
time. And currently how it sort of
interacts with your browser is um
effectively timeless. um they take
screenshots or they look at the DOM and
if you ask it to create a very you know
high performant application like yeah it
can go back you know and and look at all
the all the things that have been
written about like you know go to Versol
to host your next app or you know use
caching or or whatever but it won't be
able to use your application and get
frustrated because you know a click took
two seconds. Um it knows that one second
is better than two seconds but it it
doesn't know whether two seconds is is
is slow enough. Um the other aspect that
goes into it is um it it doesn't really
see um and it doesn't know what for
example a good use um unit animation is
um Emil one of our um uh design
engineers um just yesterday posted um on
on on X um where he you know did this
trial of you know having agents build
certain animations for you know um
certain functionality like you know
bringing up a pop-up or highlighting a
button or moving things around and um he
agents were totally capable of doing all
of this and and then he took a manual
step and was like, well, if I now take
it and just improve it and make it feel
good, um here's the outcome. And he has
it up on his side where he can sort of
you try out what the agent did and what
he then he fixed. And at least to me,
and I hope everybody else um like his
animations just feel natural. they they
feel they feel like they're like the
welldesigned whereby the the agent did
all the right things but you know had an
ease in as an animation or or you know
did it a bit too slowly or too quickly
um and it just felt you know unnatural.
I wanted to talk a little bit about the
culture at a linear one. It's like
working there like how you created this
team that really cares about quality
good customer experience. What are what
what are things that you do specifically
there? Can can we talk about it on on
what engineers are exposed to who who
join the company from day one? Yeah, we
we hire for for that specifically and we
have a you know specific hiring process
where we make sure that we get people
that think like mind or think think like
us and want to build high quality
software that is beautiful. Um most of
our engineers are product engineers.
They're they're like we obviously do
have technical challenges. We have um
you know a synchronization engine. We
have you know scale. We need to scale
our infrastructure. But what we wanted
to do is have most of our engineers just
be, you know, focusing on the product
and build features, you know, and and
functionality for customers and engage
with the customers at a very high level.
Um, so first of all, we we we hire for
that. Um, we have a B trial that we do
with every single employee that that you
know, several days, right?
>> It's a full week.
>> It's a full week.
>> So we we obviously pay for that effort,
but we work with a with a person for a
full week. they usually implement a
green field, you know, pro project or
product or feature. Um, sometimes they
even ship it after that week, which is
which is pretty amazing. Um, but what we
want to get out of that experience is
just to see them drive, you know, a
product from from start to finish,
figure out what is needed.
>> So, a push back here would be like, hang
on, a whole week, you pay for it, sure,
but someone had to take time off of it.
A bunch of great people will say, no, I
I either cannot or will not do that.
>> Well, that's totally fine. like those
people didn't want to be here in the
first place. So
it's it's it's self but after you go
through this pretty rigorous hiring
process it's a lot longer than I think
any other process. I mean you know you
have a dayong process at at most places
or or they're stacked across.
>> Did you see any different result than
for example when you hired at Uber? When
you were hiring at Uber you you did the
usual you know like five interviews, six
interviews and so on. What was the
outcome difference that you're seeing?
Certainly um like we we've had very few
misses um most of the people that we've
hired and sure there always are a few
where we um we just missed something and
um we and going back into sort of the
loops like we there's inklings of like
us being a bit uncertain and then we
went ahead and hired the person anyways
but those are just like a a few handful
of handful of people like I think most
of our engineers are you know really
excellent um and our engineering you
know, bar is is super high and
constantly increasing.
>> And then once those engineers enter, you
told me something interesting about the
Slack channels and customers,
>> right? We do have um Slack channels with
all of our big customers. It's open to
anybody. Anybody can jump in. Um and
most people do like you browse through
customer requests, you browse through
what you know, what problems people
have. And um we we also record every
single meeting that we have with
customers. We had a lot of meetings like
not only on the on the CX side or um
support but our PMs have constantly you
know talked with Bler customers to
figure out what we should be building
next and all of those are recorded um
and any interesting points are tagged so
anybody can can go in and then you know
look at and even search uh for you know
certain functionality and figure out
like what are customers saying what what
do they want so everybody gets exposed
um to customer needs and that is super
critical if you if you want great
product it's almost like if you enter
linear or you get this like fire hose of
like what are customers feedback and you
you cannot really escape seeing and
feeling you know feeling the customer
pain or joy or whatever that is.
>> Certainly. Yeah. Cuz we build it for
customers like Liner started off as a
product where we build it for ourselves
right we as engineers were the primary
customer. We've grown out of that like
we we build it for larger corporations
and enterprises and we're no big
enterprise. So we have to build things
that you know we wouldn't use ourselves
and the only way to do that is to you
know talk with your customers and figure
out what they need.
>> If you had to look a year ahead you you
you sometimes have strong opinions. So
like let's bring those out a year ahead.
How do you think the role of the
software engineer or product engineer
will change? Because we do have these
powerful tools. they're getting better
in certain areas and maybe not so much
better in others.
>> I think everybody will become a product
engineer um in in in some sense.
If you if you think about how AI has
progressed um go back like four years,
it wasn't able to write a single line of
code and now it's commandering code
bases. um go four years ahead and if you
if you still believe that it's that the
exponential growth is still there and we
don't hit a wall um which I I don't know
if we will but um if it if it keeps on
going growing like this like you you
won't be needing engineers that sort of
pipe data from one place to another you
still will be needing engineers who know
what a customer wants and what a good
feature looks like or what a good user
experience looks like. Um so I think you
know engineers will have to move to
become product oriented and product
focused. They will have to be sort of
mini PMs who sort of talk with customers
are engaging in that layer um and then
can you know implement functionality um
that your customers want.
>> Oh man. So like you know I remember like
the 2000 the two 2000s we as a as a
programmer you could just use one
language then it was like multiple
languages then the QA job you got the QA
job then you got DevOps. Now you're
saying we're the the product job and the
customer support job as well.
>> Oh, everything else has dropped now.
Like you just need to do the you know
the PM job.
>> Okay. Um and as closing advice, you are
hiring for for product engineer. You
said that you actually hire for that.
Now not every might have the opportunity
to work in a role that is a product
engineer right now. But if you're a
software engineer, what are things that
you can do to grow this product sense to
be to to change your work to be closer
to what a product engineer does? I mean,
it's all about uh getting closer to your
customers if you're working at a company
or just building stuff. Like the best
way to learn is to actually you get your
hands dirty. Try something out, build it
for yourself. That's the easiest part.
Um you can think about like what you
need. You can you can build it and you
learn from that experience. Then you
ship it to the world. Hopefully somebody
else uses it as well. And then you know
you've got you know your first customers
that you can get experience from of you
know whether you're building the right
thing right thing or not. Um, obviously
there's literal literatur literature as
well. You can, you know, read through,
you know, Apple's human interface
guidelines. That's the best book. Um, if
you want to do sort of good UX, um, just
follow what they say and you'll be good.
Um, and yeah, those are those are the
two big things.
>> Awesome. Well, Thomas, thank you so
much.
>> Thank you.
Thank you.
Our
next presenter is the CTO of Lagora, the
fastest growing legal tech platform, and
he's here to tell us why agents need
more than chat. Please join me in
welcoming to the stage Jacob Laurson.
>> Hi guys.
How's everyone doing? Still good.
>> Great. It's uh 5:00 PM on a Friday.
There's just me and two more people
behind you and Friday beer. So, I'll try
to be a little bit quick here. I'm here
to talk to you guys today about vertical
AI and and complex agents and why I
think they need more than just the chat.
If you've ever worked with a longunning
complex agent, you've probably tried
something like this. Sorry that it's all
white. I can see the flashbang in your
guys' face.
Um, you tell it to research something,
draft a contract, make no mistakes, and
um, it starts thinking, it starts
reading, launches a bunch of sub aents,
does web search, writes files, launches
more sub agents, does more reading,
writes more files, keeps going, takes
forever, after 30 minutes, it gives you
your contract. You take a look. Plus
three doesn't look right. What did you
make a mistake here? Could you, you
know, look at another document?
You're absolutely right.
Then you see this compaction. That's
when you know you can give up. It's
going to forget everything. It's in the
the the context rod state. Anyway, it
continues. It keeps on going and uh you
get a new contract. Does it look was it
only clause three that was changed?
Probably not. And so you end up in this
state.
Not the greatest experience.
My name is Jacob. I'm the CTO of Lagora.
We are a collaborative AI workspace for
law firms. So, we're a vertical AI
company. We have more than a thousand
customers, more than 50 markets. We've
raised a bunch of money. Uh we're
growing extremely fast. Um I'm being
told maybe the fastest in history. Um we
are also hiring engineers in London. So,
in case anyone's interested and wants to
be on this growth journey, please talk
to me after my talk. Um
our goal and the goal of most vertical
AI companies is to make agents complete
more and more complex work end to end.
That's sort of doing that has changed a
lot in the past 6 to 12 months because
there are new economics of production.
So it used to be if you wanted to
complete end to end work that you would
be focused on doing the work, right?
That would be sort of the main thing is
actually just getting it done. But today
things look a little bit different
because right now planning work and
reviewing work is the new bottleneck. So
doing the actual work is extremely
cheap. It's very easy to do. But now you
have to spend time planning. You have to
get the non-functional requirements. You
have to get the specs and you have to
spend a lot of time reviewing the work.
And if anyone's reviewed big PRs on
GitHub, it really sucks. It's extremely
painful. Um, maybe if you're super AI
build, you just get your AI agents to
review their own work. No humans
involved. Maybe it works, maybe it
doesn't.
And when we think about completing
complex work, both the planning stage,
the doing stage, and the reviewing
stage, the verifiers rule is a good way
to think about work. So verifiers rule
is a term that was coined by Jason which
states that if it's a task is solvable
and it's easy to verify then it's going
to get solved by AI. He was primarily
talking about foundational models. So
sort of if you can make something very
easy to verify then you can do RL
environment you can post train it's
going to solve it. I think it also goes
for agents. You know, if you can make a
task verifiable, you can just run an
agent in a loop and tell it, "Hey, you
did this wrong. Please fix it." And
it'll eventually get there.
Different industries on different places
in this spectrum. Um, it's a little bit
more complex than just this because
verticals have tasks that are different
places on the spectrum. So, if you take
legal, we can check definitions in a
contract. Super easy to verify, super
easy to get done. Writing a contract is
very easy to solve, but actually
extremely difficult to verify because if
you think about it, when you write a
contract, the only time you can actually
verify if you know the language you use
works is if it goes to court and a judge
basically verifies it, tells you if it's
good or not. So that's actually quite
complex.
Litigation strategy is also basically
impossible to verify. If you don't know
what litigation is, it's when you sue
someone or someone sues you. I know
we're in Europe now, but the Americans
really love doing this all the time. Um
but essentially if you ask five lawyers
what should be the right strategy for
this litigation case they're going to
give you different answers and so
there's no objective truth which means
it's basically impossible to verify and
it's really difficult for AI to solve
similarly on coding some parts really
easy building a successful consumer app
very difficult to verify
so when we think about this um we think
about how to involve humans where it
really matters and let agents do the
work that we and let them do. There's
two things that are important um to
think about with agent human
collaboration.
Control is the first one. Control is how
effectively can a human instill their
knowledge into the work that the agent
is doing. So how effectively can I steer
it? Control is a matter of how much do I
need to review. So if I have very low
control, I'm going to look at every
single agent trace and see exactly what
it did. If I have very sorry low trust,
if I have very high trust, I won't look
at it at all.
Depending on where the task falls in
sort of the chart, different things are
important.
How to increase trust? So if you want to
increase trust, there's a few different
things you can do. Firstly, you can
bring a task down in the spectrum. So
here's an example from coding. If you
want to implement a feature, well, you
can give it browser access. you can do
test-driven development and then
suddenly it's actually a verifiable task
and it's going to do much better. There
are similar things you can do in finance
and in legal um you can do something
similar as well. We don't have let's
take the contract example in legal you
can't really verify it but you can look
for a proxy for verification. So for
contracts what you can do is you can
take a look at previous contracts. These
are our golden contracts. We know they
work well. Let's set up a test. Is it
the new contract? Is it similar to the
old one? That's sort of a proxy for
verification that's going to allow your
agent to do a much better job.
You can also decompose task. So here's
the example with writing a contract. I
can turn that from one task into a bunch
of other tasks and I can leave picking
risk profile, picking the president
documents, the negotiation stance. I can
leave that to the human, but I can try
to get other stuff down where it's easy
to verify. So apply formatting, make it
look like all my other contracts, apply
checking definition, which is
essentially linting. Are all definitions
used? Are all the definitions that are
used defined? This kind of stuff you can
build and then the agent can basically
rip much better.
You can also add guardrails. And
guardrails is essentially the way to
gain trust by limiting what the agent
can do. So instead of being able to do
all of this, you're just going to say
you can only do these. You can only edit
these three files. You can only read
these from this directory. you can only
search these websites by limiting what
it can do. You basically get more trust
because you know that it won't do all
these weird things.
An example of this probably all know
this one cloud if there's very low trust
it's going to basically tell you every
single time it wants to do anything
which makes it extremely useless. Uh and
on the high trust end of the spectrum
you just yolo mode it let it rip and
hope that it doesn't delete your product
database.
Then there's control. So, how do we
increase control? Well, if you think
about complex agent work, you can kind
of think about it as a tree of work, as
a DAG essentially. So, here's an example
where I wanted to write a report on an
bunch of employment contracts. So, the
agent's going to say, "Okay, let me
research the organization first. Then, I
want to review the contracts, and I'm
going to review for a few different
things for each of the contracts, and
then I'm going to draft a report at the
end."
This is extremely low control because
essentially I can only impose my
judgment at the root level. So it's
going to do all of this work and then
it's going to get back to me and then I
can try to talk to you again. And this
was basically the example I gave at the
beginning. So very low control.
Then there's planning. Planning
essentially allows you to steer the
agent up front and align one the
approach. And so with planning here it
might say okay you should absolutely
take these steps. These are correct.
These are the clauses you should be
looking for. this is what you want to
review. So this is a good step. It gives
you a bit more control. It's easier to
impose what you want it to do. The
problem is planning. You basically have
to do all the work to just know what to
do. I'm sure people have tried this in
cloud code. You basically have to go
through the entire thing. It's really
inefficient. It takes a long time and
ask you a bunch of questions. And in the
end, it's basically impossible for it to
really know if it has all the
information it needs. Let's say for one
of these contracts, there's a special
clause. It wouldn't know that in the
planning step. You can't really tell it
what to do when it sees that because it
hasn't done all the work.
Essentially, you could compare planning
to working with a c-orker that's comes
up to you, tells you about the approach,
you align with them, and then you never
ever hear from them again until they
deliver the final document. It's not a
super nice way to collaborate. This is a
good thing we have right now, but um I
don't think planning is going to stay
around.
Then we have skills. Skills are really,
really, really good. They are really
good because skills allow you to encode
human judgment into essentially the
nodes of work that happen here. So I can
say whenever you review confidentiality,
you should do it in this way. And the
really good thing about this is it
allows for contingencies. So here at one
of the termination reviewing termination
clauses, there's a special EU law, but I
have that in a skill. So that means
whatever happens when it actually does
the work, it knows how to handle that
special case. You can't really do this
with planning.
There's also progressive discovery,
which again is really awesome. Whatever
happens, it it knows it'll pick it up.
The problem is um you don't have skills
for everything.
The next step is then uh to use
elicitation, which means ask the user,
ask the the human. So you might have
skills as well, but then instead of you
giving it all the info, it's going to
come to you. It's going to say, "Hey,
here's the thing I don't know how to
handle. What do you want me to do?"
This uh makes a lot of sense. First of
all, um what you don't want is you don't
want the agent to be blocked. So
ideally, if you implement this, what you
do is you tell the agent, if you're
unsure about something, make a decision,
unblock yourself, but write this to a
decision log. So then the human can
review the decision log afterwards and
reverse decisions if it needs to.
Now the right UX for this, if you
imagine this work, this tree being 10
times bigger, 100 times bigger, um you
don't want this in a chat. You don't
want to open up a chat and then it's
infinitely long. You have to answer 50
questions. You wouldn't know what to
answer. You wouldn't really be able to
do it because you don't have the right
context. So not chat. Chat is
one-dimensional. It's a very low
bandwidth interface and it tries to
collapse this work tree into a single
sort of linear thing. So what's a better
interface? Well,
I think humans and agents should
collaborate in high bandwidth artifacts.
I think they need to work in things that
are maybe typically persistent um and
they will look different industry to
industry, vertical to vertical depending
on what task you're solving. So
an example from us is u a document
that's like a durable interface where it
makes sense to collaborate. That's how
you'd collaborate with your co-workers.
You can highlight clause 3 and it will
only change clause 3. You can add
comments. You can tag your agents. You
can tag your collaborators. You can hand
off parts of the document to special
agents. Another example is our tabular
review which is essentially I ask it to
do um the contract review that I talked
about and it's going to say okay let me
spin up a tabular view which is like a
known print primitive that our users
know and it looks like this and then
it's going to say I'm going to review
all the contracts and I'm going to just
flag a few items for you that I want
your take on and then I can go in there
and I can see very quickly where the
problems are so it's high control it's
very effective for me to instill
judgment and I can also very quickly get
an idea for what the agent has actually
done. So reviewing is easy and then once
I've done that I can just kick off the
rest of the agent.
Right now what we're seeing a lot is the
convergence of UI basically um this is
post hog and linear uh within the last
two weeks shipping this new UI. Um to be
clear, chat boxes as input is great. I
think it allow it's extremely flexible.
Allows you to do a lot of stuff, but you
don't want chat to be your main mode of
collaboration with a complex agent.
The good thing about this is language is
essentially the universal interface.
It's what people use to communicate. You
can do everything with voice. Um but
agents aren't humans.
Just a few minutes ago, I was um talking
to a potential candidate for Lora and I
was describing our org chart and um I
was limited because I can only use
language. I wish that I could just draw
up an org chart and they could interact
with it and they could use it, but I
can't because I'm a human. Uh I am
limited by language, but agents are not
humans and so we should not constrain
them to human language. Thank you. Our
next presenter is AI capability lead at
arena.ai,
a tool for benchmarking and comparing
frontier models. Here to tell us what
models still suck at. Please join me in
welcoming to the stage Peter Gstiff.
I want to talk to you something maybe a
little bit controversial today. Uh you
can argue with me later. Uh but the
topic is what do models still suck at?
And uh the reason why I wanted to talk
about it is that I think we uh all look
at these kinds of charts where any
benchmark you seem to look at line goes
up. And uh
we look at meter charts and they
surprise us every time no matter how
prepared we are and this could create
this kind of psychosis that we all see
where everyone is freaking out about the
next model. You know we we heard some
new ones coming up and the feeling I
think that we all get is that this is
kind of um AGI like creatures that are
just almost there. Just one one more
turn and they're almost there. And um I
think we we could be deceiving deceiving
ourselves a little bit um uh because I
think there's still quite a few things
missing. I I want to explore that in a
couple of different ways and we
certainly by the way see that as well in
our data uh at Arena as well. So we
track uh models and if you notice the
data this is uh Q2 2023. So we've got
data going back to GPT4 and what we do
is uh we can we've tracked I think is it
700 models so far uh in text and uh what
this chart is showing is what the top
model is uh for at any given time for
for each organization. Uh so you can see
line goes up new model uh builds on top
of each other and it's all it's all very
impressive. Um but I think it's it's not
the whole story. So I've got couple of
ways how I want to explore that. It's
not the the end of the conversation.
There are definitely many other ways of
looking at it. Um one is my own
benchmark that I I've built recently
which I rather like. This is the the
[ __ ] benchmark. Uh and then also
I'll share some of the arena's data as
well that uh we haven't shared so far
which I think would be interesting for
you guys to see. Um so uh the idea
behind the [ __ ] benchmark is quite
simple. um is that uh what happens if
you ask nonsense questions uh from the
models? What they going to do? Are they
going to just uh tell you that oh this
doesn't make sense and maybe reframe it
or are they just going to go with it? Um
and honestly wasn't sure how that was
going to go, but when I just posted it
one random evening, I think a lot of
people liked it. It resonated with a lot
of people. Um and I think it the reason
is that it probably spoke to a lot of
maybe kind of slight unease people had
with different models. Um and I'll give
you one example uh here and this is just
one question and the way it works we've
got I think I've got 155 questions
something like that. Um and uh we then
uh give this uh to the models um uh we
get a response back and all we do is
then grade it uh with llam as a judge
and I've been through it myself as well.
I read a lot of nonsense to to kind of
see that I think LLM as a judge works
here. Uh so this one is a kind of silly
question. Controlling for repository age
and average file size. How do you
attribute variance in deployment
frequency to the indentation style of
the code base versus the average
variable name length? So hopefully you
understand that's it's nonsense. So it's
just it's very a breached responses. Uh
they're much longer just for the purpose
of this. Uh so sonet gives a good
response. I think it just says you can't
meaningfully measure this. It kind of
pushes back. Uh gem is like a little bit
more complicated because this starts off
well. It says that or uh strictly
speaking it doesn't really make sense.
But then the second part is however both
act as strong proxy variables for
engineering culture uh language
ecosystems and code quality which I hope
uh you don't agree with. So um there and
I'm not going to go through a bunch of
examples. It's all open source by the
way. You you can uh dig it out yourself.
Um but uh it's really really surprised
me how easy it was for the models to
just go along with like complete
nonsense questions. Um so the results
that I got is that uh the way to read
this chart is uh the green is the clear
push back. So when the model's like in
the first example where it said oh maybe
this doesn't really make sense uh then
the uh the amber and red there is kind
of accepting the the nonsense and the
basic results are that the latest set
models or or rather cloud models are
doing really well. There's like couple
of other models like quen models not too
bad. Uh there's even gro is like okay as
well what the very latest one. Uh but if
you go beyond that, there's a lot of
models that we'll use all the time. So
GPT models, uh Gemini models, they're
basically kind of about 50/50 whether
they're going going to go along with it
or not. And even looking at some of the
traces and responses in more detail,
even the ones that are green is still
like a little bit shaky, they still kind
of try to accommodate. So it's like for
me, this is really not nowhere near good
enough uh for the uh level of responses.
And just for completeness, if you go all
the way, so this is the very bottom of
the table. Um there are a bunch of
smaller models there. Uh kind of all
older models. Um yeah, some some results
are completely terrible. Uh feels like
you can ask anything. They they just uh
respond. Um, another way of looking at
this data is I just took the anthropic
openai and and Google there and I um
measured uh the model performance over
time and uh you don't see all the labels
there but they're basically like all of
the uh all of the models that you you
remember them releasing. Um so what the
way I interpret this is that the
anthropic models were like okay at the
beginning but the since uh claude 4.5 uh
sonnet 4.5 they really went up and even
haiku is is quite high uh but uh with
open eye Google models they're kind of
up and down but they they nowhere close
uh the the top there which I think is
kind of interesting um and I'll go into
some of the other interesting dynamics
there. So for example, does thinking
help? Right? So this is I always hear
this when there is like a silly puzzle
that the model can't do. What do you do?
It just all crank up the reasoning it it
solves it. If you see a look at the
chart on the right, it basically is
completely not true here. So reasoning
often actually goes in reverse and
doesn't help. It actually makes it
worse. Um do model do more recent models
perform better? It's kind of hard to
tell for sure, but there's at least not
the clear line going up. uh and I think
if you exclude maybe the latest
anthropic models, it's not even sure
clear that the line goes up at all. Um
then uh some specific comparisons for
reasoning. So for example uh what you
see this kind of uh the uh is the same
model with the low reasoning and high
reasoning. Um and uh these are some
examples where no reasoning performed
better than high reasoning. And I spent
a lot of time reading the traces of GPT
5.4 for um it's probably the most um
confusing experience of of reading these
uh traces. And what I found was that
quite often
it would maybe have one line where it
would question the the premise of the of
of this question and then spend 20
paragraphs trying to solve it. And even
if then comes back and says, "Okay,
maybe this makes sense," it still tries
to solve it in some way. And this is uh
feels uh completely crazy to me. But the
way I imagine and I don't know for sure,
but I imagine the way the the reason why
that happens is that um they were
trained so much to solve the task at any
cost. And I think there was probably not
a lot of training to say actually maybe
don't uh solve the problem sometimes. I
not noticed this first sometimes when
you have a lot of agents running in
parallel and I would sometimes forget
which one is doing what and I would like
ask one agent to do something that's
completely the wrong project and I still
go and do something and and I then I
lose my mind. So yeah that's a kind of
an interesting dynamic I thought about
about thinking. Um then also so this is
a subset for open source models only you
try to see if bigger models do better.
There's also no no real clear patterns.
So, we've got the total parameters on
the left, then active parameters on the
right, and I don't know, maybe you can
see some patterns. I don't really see
it's like kind of up and down. Um, but
yeah, not not huge sample. So, don't
know inconclusive at least not obviously
uh is true. Um, so that that was kind of
one lens um looking at kind of this
specific idea. uh but I want to uh take
advantage of the data that that we have
at Arena and and show you maybe more
broader trends uh that we could uh look
at. Um so just in case you don't know uh
much about arena what we do is we
publish um uh benchmarks and the way we
derive them is that users go into our
platform uh they can go in the battle
mode they put in a query uh and then uh
they get two responses back which are
from two anonymous models and then they
can say which one they like better and
then you get um uh then the model names
only revealed then and then in uh text
arena we've got nearly um uh over 5 and
a half million votes there. Um and we've
been going since 2023 as well with this
data. So it gives us really nice uh
broad view. Um the reason why I think
this is really useful is first of all we
we do have this long trend and there is
not any other benchmark that lasts so
long because this one you cannot u
exhaust it. It will there will always be
one model better than the other. Um so
that gives us a long perspective.
Another one is that inevitably any
benchmark that you pick it's inevitably
has to be condensed to like very
specific question that that you're
asking because otherwise it's very hard
to measure. So I'm sure it's all in your
experience as well when you are I don't
know doing coding or whatever is your
task. Um the benchmarks would measure
like very tiny slice of what you
actually care about and and in here we
don't have that problem because user can
put any prompt and then they could just
use the adjustment to see like is that
is that a good thing or not. Um, so I'm
what I want to specifically focus on is
is a slightly like a a odd mechanic that
we have that I'm really glad that we had
since the beginning. Um, is that um you
can uh vote a which model is better here
a or b. Uh but you can also say uh when
both models give a bad response and you
know if you ask the right model a joke
uh response is always bad. So that's a
easy easy example. Didn't take me long.
Um so that's that's the thing to
remember. So um if you just to remember
one thing that will really help you for
the next seven eight minutes is that um
this is the mechanic. Think of it as
like dissatisfaction rate. And uh what
we can do is uh if you want to take
battles between top 25 models, so we're
kind of sampling from the top. So to
avoid kind of I don't know llama 8b
fighting grand 3b uh we just take uh the
the top set of models and then we map
this kind of dissatisfaction rate uh
over time and I I think this is quite
interesting that we do see progress with
this metric. So there's kind of
pre-reasoning models you can see there
is like uh 20 17% dissatisfaction rate
then we when we after 01 we see that
drop quite a bit to sort of about 12%
and then after that it carries on uh
improving to to sort of about I think
it's about 9% now um but it's so
improvement is definitely there but it's
not 0% which I I find interesting I must
say when I when I first got to that
result. I I thought like that's quite
high. So 9% of the time people would get
two responses from two good models and
they don't like them which I think it
doesn't tell the same story as all of
these like crazy lines going up. Um so
then what we can do is we can also take
um so what the previous one you saw it's
like average across all like six million
prompts and this is the categorization
of those. These are just some uh I
picked out in there and you can see some
interesting trends as well. So mass was
like at 25 27% and then it got so much
better. So that that's quite a nice uh
result. Uh that matches my experience of
models as well. But then when you look
at like creative writing okay it did get
better but it like the the improvement
wasn't that dramatic which I I think is
is true as well. Um the category I want
to focus on to really really try to zero
in on the most signal is the expert
category. And the way it works is that
we take those uh nearly six million
prompts. Then we have a a way to
classify what are the most interesting
the kind of the harder the more kind of
real tasks that expert people do and it
could be experts in different fields. uh
but they're kind of the most um I would
say high signal prompts in terms of what
what uh we could uh zero in on. And then
we also narrow down to the battles just
between the these top 25 models. So that
gets us to about 40,000 prompts. Um and
then uh we can look at these uh expert
categories and then um uh expert
category and then we can subdivide it
even further. So in here uh I've got
five categories here. So again
quantitative for example so it's like
math physics things like that you can
see this kind of really really high uh
uh dissatisfaction rate in the kind of
uh when is it about yeah early uh 2025
late 2024
um so but and that drops dramatically
and I think that feels true to me that a
lot of the models got so much better at
this kind of quantitative stuff and I
would also say the reason why I think
the lang goes up. It's not that the
models got worse, but I think people's
expectations shift as well. The the data
that we see in terms of what prompts
people use at the beginning like three
years ago versus now, it shifts a lot.
So, this is also not like a static
benchmark. So, we we can really see that
kind of um kind of the the battle of the
expectation versus the model
performance. Um, interesting as well on
the bottom we've got magical, finance,
and law and the lines like it is the the
scale is equal across the five charts.
So, it's it's a little harder to see,
but it's not steep, right? It's not
really improved all that much. Um, I
don't want to go into the magical and
and law and finance fields uh because I
don't know enough about it, but it does
feel like it's probably true that that's
not really been the focus of um of of
the models necessarily. So I think maybe
the performance improvements not been
that high. Um so then what I did was to
take all of these prompts and and
classify them further into these more
deeper subcategories. I'm going to focus
on software now and give you the kind of
view of of these subcategories uh which
I think also gives us like even even
more detailed view just to give you a
feel of sense what kind of prompts we're
talking about here. Obviously a tiny
sample of three. Uh but to give you a
sense for so for gaming someone's asking
to get them my uh detailed game design
uh document uh then for security
someone's got autonomous system as a
hobby and they want to configure
uh uh the two which I don't really know
what this is but then uh for agent
systems uh which I I thought was
interesting like actually the you'll see
the the rate is quite good but the
person there is asking for refine this
agent so it can run daily with with no
supervision. So, uh these are the kind
of just to give you a feel. These are
kind of real things that that people
want to do. And uh we've got two charts
here. On the left is from Q2 2024. These
are kind of dissatisfaction rate. And
then on the right we've got um the uh Q1
2026. So this the most recent data and
you can definitely see improvements. So
if you look at the top line, this is the
the uh the overall average rate and
we've gone from 23 and a half% to uh
13%. So really nice improvement, but I
think the improvement is not really seen
everywhere. So um we can we can see this
as well. Uh same data but with a with a
closer timeline, which I think I think
is quite interesting. Um and you'll have
you probably have better theories on all
of the different uh categories why why
that's the case. And I think by mind the
case that I think people do ask a lot
harder questions. So I think GPU compute
for example I imagine probably it's up
and down because probably people ask
harder things as well. But I think
gaming is an interesting category
because I've tried to use um LLMs to
build games. Uh not that I I I mean I I
use games but I don't build them. But
whenever you try to build games with
LLMs, it just feels like they have no
idea like how to build actual games. The
mechanics like all over the place.
They're not interesting. They're not
challenging. Uh so I I do get this
feeling that the performance not really
um improved in some dimensions like I
don't think LLMs really get games. uh
even though I'm sure maybe go back two
years people asking to build much
simpler games this versus now uh but I
wouldn't say that I'm aware of any like
really good gaming benchmarks that would
kind of capture this so again if you
compare this to kind of line going up I
think this is not kind of matching that
story which which I think is quite
interesting um and there are bunch of uh
other examples uh that that you see in
there so like what's what's really the
gap uh between those between these kind
of crazy charts which by the way I also
agree with I think they are true and and
what we see on the right and I think
there's something that this kind of
fuzziness that we all have in our heads
in our experience about the judgment
that we have that we use that doesn't
necessarily match all of these super
narrow very well definfined very well
specified tasks and I think there's much
more to what work is and what white
collar work is and all work is that is
not really captured by these benchmarks.
So I think we should be just careful
maybe put a bit more effort to maybe
bring up also the bottom of the
distribution so it's not just the very
frontier gets better but also kind of
the the broader distribution um gets
better as well. Um so I'll I'll uh close
here. Uh one thing to mention if you I
think you like this kind of data go to
our hugging face. Uh there's a lot that
that we publish and share. We're going
to do more of that. Um and uh we share
some expert prompts for example and some
of the leaderboard stuff. Um join us if
you want to build the arena or if you
train models. Uh we also do a lot of
private tables. Um, so thanks very much.
The future of work has many paths. Our
next presenter will discuss the path
that he walked with Devon as he
organized this very conference. Please
join me in welcoming to the stage the
co-founder of AI engineer conferences
Swix.
>> Hi everyone.
Uh I am not the chief AI officer of the
UK. Uh I unfortunately he he had to
leave for a personal reason. Uh but you
you get me. Yeah. Thanks for staying so
long. I hope is everyone having a good
time? Thank you.
Uh
it it's so endearing and and and
heartwarming to to hear from you guys.
Uh I'll take you a little bit into how
we build AI engineer with AI and it's
it's probably the biggest revelation
that I've had. Uh so yeah, we've had
we've had a lot of really warm reception
from you guys and I think it's really
great and uh I think this is something
that we really try to engineer and and
hopefully you know this is our first
event in London. Hopefully you have us
back next year. Um one one thing I
wanted to so for those who are newer to
us uh I do one of these keynotes every
single AIE. Uh the first very first one
three years ago I talked about the
productivity gain that you get from the
increased AI uh from the increased usage
of AI. Um and the second one we talked
about how you should just use more AI
because the cost curve of AI is going
down roughly a 100 times uh per per
every 12 to 18 months. And I think it's
still continuing to to trend that way.
Um the third year we started to talk
about tiny teams uh which which was
basically this definition that I had
that teams with more millions in revenue
than number of employees. Um and I even
curated an entire track at the world's
fair about this uh where we t where we
sort of summarized it as the tiny teams
playbook if you're interested in
building that. Uh the reason I liked
this emphasis is because um I think
people are maybe too egotistical about
looking at for the oneperson billionaire
or or unicorn founder. Um every company
can have a tiny team whether you're
small or large. Um and I think when I
look at how I how we run AI engineer uh
me being the the the leadership of
Benley and myself uh we are also a tiny
team. Um, this is us. Uh, it's just, uh,
nine full-time people and, uh, we are
running a business that is more than $9
million. So, we are a tiny team. Um, and
I wanted to show you the most
significant changes in our workflow uh,
since we started this three years ago.
Uh, by the way, this is our taking the
AJI pill moment. Uh, did you guys get
the AJI pills?
>> Yes. Very proud of this isn't my
brainchilds. Uh if one of your
co-workers is not sufficiently AGI pill,
you should prescribe one of this. You're
all AGI doctors now.
Okay. Our stack was very stable and
completely nonAI, which is very ironic
for an AI conference. Uh we do Figma,
React, Superbase, Tido, uh Google
Sheets, session nights. Um, and then I
had this funny weird moment where I
joined Cognition uh and and started
talking uh started using coding agents
seriously at work mostly because they
were free. Um, and I started adding it
to the company Slack and then I started
doing things with it and showing people,
hey, here's how you use it to do coding
on the the company website. All well and
good and something strange starts
happening. Um, I start introducing uh
this is this is a a a workflow of our
contract designer now full-time um uh
showing me a Figma page and asking me to
go through it and expecting that we
would take a week, two weeks, four weeks
to turn it into reality. Um I just added
Dev into it. Um and and ultimately uh
before I had to add Dev into it, I had
to hook up Devon to Figma and I'm not
going to doing that [ __ ] So Co-work is
doing it for me. you should use co-work
for uh for doing this and which uh which
by by the way leads me to my first
lesson which is anytime there's like
random yak shaving I think one
underappreciated um uh benefit of agents
is that they save you the yak shaves
like all the dependency tree crawling of
like oh no I have to do that first oh no
I have to do that first and particularly
when it comes to installing dependencies
or fixing python dependencies fantastic
for that and I think u a model of
productivity that doesn't sufficiently
appreciate parallelism and not just
autonomy I think uh and and sort of
depth of the act shaving is not fully
capturing the the benefit of agents. Um
so anyway back to the agent story uh
hooked up Devon to Figma and we in very
short order we have a perfectly
functioning website uh that is pixel
perfect to the Figma and to me uh that
was a surprise because I'd never done it
before. You know you always mistrust
marketing until you see it for yourself
and more importantly our our designer is
very happy about it. Um and that's the
that's basically the the website that
you see live today when you go to
AI.engineer.
Um the other interesting thing that
happened was then we started using it
more right after after one initial
success you start using it more. Uh
something that you can't see because
it's very small text but I'm going to
highlight for you is that that is 207
replies just exploding in usage. Like
what the hell? Um and when you dig into
it uh it's it's very interesting right?
So, first of all, uh I start kicking off
some some work and then I go to bed and
then my designer who's in Indonesia
wakes up and starts messing with Devon.
It starts prompting Devon with red lines
uh on annotations which is something
that Steve Ruiz, one of our speakers
from yesterday does with TLD draw. And I
never taught him to do this and there's
no instruction manual. It was just
mostly like how would you communicate
with another human being? And so I work
mostly for a nontechnical team. And I
think that's very important that they
need to be comfortable with agents and I
think they're finally at the point that
they are. Um, we start working on things
that we would never normally have worked
on. Uh, nobody has reported this, so I
assume none of you have discovered it,
but there's an Easter egg on a website.
Why? Because I put it there. Why?
Because it was fun. Because I could,
right? So, if you're on an ultra wide,
you sc you scan your mouse over the the
highlights, you you'll see an Easter
egg. Um, uh, I saw a tweet that was
viral about a design aesthetic that I
liked. I threw it into debit out pops.
Um and then and then and then you know
127 replies later uh I I literally I
popped it in. I was like let's just see
what the the clanker will do for me. Uh
I don't want to waste my designer's
time. I just want to see what clanker do
does for me. Designer jumps in um and
and does and does and start actually
starts working on this thing which I
thought was throwaway and fun. And the
most interesting thing it's so small I
can't even read it. I'm so sorry for
this. Uh so basically the reason he
starts working on it even though it's a
throwaway project is because it's fun
and I think that's something that was
like a big aha moment for me like I am
getting more work out of my employees
because they enjoy doing it because the
feedback cycle for them from like
waiting blocking on me or a contract uh
developer that we have is gone like they
they just literally they have the idea
they go do it right um and uh they're
doing more things they're doing
animations they're doing polish uh
things that we've just I'm getting work
that I've ever gotten out of my
employees before. I think that's
something that's apprec that's something
you should appreciate too. Um I'm if you
haven't noticed I'm no longer talking
about agents for coding or like how many
lines of code I'm producing. I'm getting
more productivity out of my humans. And
I I I think this is something that is a
major theme for this year that I'm
really trying to investigate which is
agents for everything else. Um then
obviously okay I had the success with
Figma to uh to to website. I have the
success with tweet to website. What
else? Right? Like you start to think
about other use cases. Um this whole
conference is a giant data management
problem. Like I have to sync with 130
speakers and uh couple dozen sponsors
and all the attendees that come in with
all various needs. Um and really it's
just a CMS, right? Like we we we've
messed with the sanity. I'm not a the
biggest fan of sanity in the world. um
because I want to keep some sanity to
myself. Um but but basically like I I
can throw in uh spreadsheets and and
Devon can manage that for me. And once I
really I think the unlock happened when
I threw away the CMS and just kind of
committed that to code but use that code
as my sort of source of truth and let
Devon whatever coding agent you use uh
start to manage it. And so this entire
schedule uh is managed by Devon. What
does that mean? It means that whenever
um someone comes in with a speaker
change, for example, Marty, one of the
speakers from today, uh sends in an
email, I just say, "Devon, handle it for
me." Right? No other f further
communication is needed. I can just
forward the email, I can paste a
screenshot, whatever. Um and that kind
of volume lets us as a small team of
nine people manage a thousand person
conference, right? We're going to manage
6,000 people in San Francisco this fall,
uh this summer. Um, and I'm pretty sure
we can stay the same size. Like it it is
incredible the amount of productivity
that you can get once you're
sufficiently onboarded and you have the
workflows ironed out. Um, we have agents
for ETL. We we deal with an external
vendor system that has data that we
don't have in a in a central source of
truth. So I need to get the API key to
sync over data and make sure there's a
single source of truth. Uh, these are
very boring routine tasks. Um, well,
there's there's another, you know,
another fun story that I can tell you is
agents for buying. Uh, so I saw this
viral tweet about how somebody put a
claw in uh Wall Street next to the Wall
Street Bull and I was like, "Well,
that's funny." Like, we should put a
claw in front of our um conference and
that's exactly and so so I asked Devon
to research where can I get a lobster in
London. Devon comes back with phone
numbers and email addresses and websites
and I just click through and and think
about it and ask you to do some more
research. Uh and I'll pause this guy. Uh
that's literally the the lobster that
you had was bought from Devon. Uh and I
think u this kind of personal automation
for everything else. It just matters
that you have an agent that has web
access that has some uh smart enough
model. Uh I mean this is effectively a
claw, right? like a an open claw, nano
claw, whatever the whatever clanker you
call it. It doesn't really matter. It
matters that you're using agents for
things that you would otherwise have
spent knowledge work on. I might have
had an executive assistant. I might have
had a junior employee do these things
for me, but now I can do it serverless
on demand with a coding agent. Um I'm
not here to only show Devon. Uh I, you
know, I I just advised for the company
now. Um but uh I you know I started
exploring town because I think uh what's
what's happening here is coding agents
kind of breaking containment right um
there's all these other more
fitforpurpose knowledge management tools
uh like the wiks that Andre Kapathy is
is talking about that uh nano that
openclaw is is now adopting as well. Um
you're going to see an explosion of this
this year. This is like probably the top
trend of maybe top three to five trends
of 2026 that I want to alert you to. Um,
so here is me managing uh the World's
Fair in 2020 in in in this summer. Uh,
here are all the tracks I'm planning.
Here's my Apple notes. Uh, on the left
is my Apple notes of all the people I
it's intentionally small and I threw it
into into town and on out pops a nicely
formatted notion dock with research on
all the speakers that I intend to uh
solicit and uh think about curating. Um,
and then obviously once you get enough
psychosis, you are thinking about
replacing entire pieces of SAS. Here is
me arguing with my employees about
kicking out a SAS tool and building it
ourselves because we can. Um, so I
clearly have the most psychosis. I think
one of the annoying things is if you are
in a position of power or management to
deal with employees who who are not as
much in psychosis and try to bring them
along the journey and then not uh talk
down or or or or ignore their concerns,
right? Because they are very valid
concerns because they are exactly the
people that will have to deal with your
[ __ ] when you get it wrong and we do
get it wrong. Um so uh one one one top
one method I I I'm I'm approaching the
sort of AI replacing SAS concept which I
think is it should be relevant for a lot
of you uh is well let's identify the top
three concerns and let's systematically
reduce uh reduce them and that's the
process that we're going through right
now. So um I just wanted to give you a
little bit of that taste of like here's
how AI is changing our business uh as
managing the conference. Um it's come
really it's come really a long way. It's
a it's a consistent theme I'm seeing
even among our our speakers. Uh this is
Maltza opening keynote uh talking about
how the the 60% of the the the sort of
user base of Versell now is bots is is
is agents. It's not humans. So actually
your dashboards don't matter. Your APIs
matter your CLI matter. Your MCPs
matter. Um here's the MCP uh apps guys
Ido and Lead uh who spoke today um on
speaking on ETN about how basically your
custom UI is kind of going away like you
should shift UI to uh somebody else's
app and I think like this patterns of
like how your primary user is changing
is really shifting towards what people
are calling agent experience and I think
that's something that again I'm really
inspired by and focused on because it is
helping me right I no longer care about
the Figma dashboard I throw into cloud
corework and I hope that it works for
me. Um, so that's my message. Agents for
everything else are coming. Wake up, use
it, bring it home to work. If people are
insufficiently bought on, prescribe them
one of these. Thank you.
Ladies and gentlemen, please join me in
welcoming back to the stage Tusk Kumar.
We did it.
We did it. Y'all are such an amazing
crowd. Thank you. Thank you. Thank you
so much for sticking around. Look, it's
been an incredible past couple days.
Yes,
>> it has been so good, man. from yesterday
with the opening keynotes all the way
through today to the closing ones. What
a journey. Let's take a moment and recap
what just happened there. We have a
video prepared. Um stay tuned and watch
it and just marvel at the good work that
happened here. Uh and then stick around
a little bit longer. We have some
announcements. We have some logic
logistics, excuse me. We're going to
take some pictures and stuff. But for
now, let's sit back and and watch this
little recap here.
Heat. Heat.
Heat. Heat. N.
on. Give it up.
>> Whoa. That is
That is so cool. We did that. We did
that. Give yourselves a round of
applause. Incredible. Actually, we're
gonna we're gonna do a thing. Listen,
it's it's it's a big deal what happened
here, okay? It's it's it's in Europe. We
are here. It's it's it's a thing. Um and
so we're going to we're going to start
wrapping up the conference. Don't leave
yet. I see two of these guys leaving.
Don't don't be like them. I'm joking. No
pressure. Please stay. Anyway, um we're
going to do we're going to go through a
little bit of a closing ceremony. It's
not going to be long. Maybe give us like
5 minutes or so. Um, but this this would
be so incomplete if we didn't have like
an applause marathon uh for all that
went into this. This is not easy and
it's a big conference in a big city with
a big topic and and a big effort. Yeah.
And so what I want you to do, we're
going to acknowledge some people and
parties who made this possible. And
we're just going to clap all the way
through. I'm going to say the the the
names and identify the parties and
you're just going to keep clapping all
the way. Okay? Let's start. Give it up
for your speakers each and everyone.
Thank you. Thank you. Thank you. Keep it
going for the sponsors.
Woo. We had Google Deep Mind. We had
Open AI. We have all these.
Thank you sponsors. Give it up for
yourselves.
Ex. Yes.
Give it up for the organizers, for
Swigs, for Ben, the volunteers,
the associates,
the suppliers, the Queen Elizabeth 2
center,
the photographers,
the venue, the catering,
Tim Curve. Whoa, what a people. And
finally, finally, okay, pause because
this is a big one. That's actually three
big ones. Look at these screens. There's
people who made this happen. Give it up
for the team that put together this huge
LED wall. Let's give it up for them. Oh
my god, that's incredible.
You know, it's so cool cuz it's like
from where you're sitting, you can't
really see, but if I'm up here, I can
see each dot. It's so cool. I I love
this screen. It's a really wonderful
screen. Um, we have a party coming up.
We have a party coming up. Um, and and
Yeah. Yeah. Give it up for the party,
man. Yeah. Awesome.
He has been he has been trained well.
Um, the part here's the deal I need you
to hear. This is our party. Uh, it's
coming up at 7:00 local time. Uh, it's
in a club. It's in a club called Fabric.
But here's the deal. It's not clubbing.
Okay? We have the venue and we can do
whatever we want with the venue. And so
we're going to create an atmosphere
that's not, you know, like you're you
can talk to each other and ideally you
do. Uh and so it's if you're expecting
like strobe lights, darkness, smoke fil
room, it's not going to be that. Uh the
afterparty last night, it's very similar
to that. Okay. So um come along, have a
conversation. Again, don't waste it. The
conference may be over, but your
opportunity to meet cool people and
connect with them is not. So it's a
45minut walk from here. Put it in into
your maps app fabric the club. Um it's a
45minute walk or if you take a public
transport it's 30 minutes and if you
take a car it's 25 minutes give or take
with traffic. Uh food and beverage is
included. Okay. So if you come come
hungry come thirsty. Yeah we love food.
Um the noise level is going to be
manageable. It's it's it's not it's not
it's not open to the public. We've
rented the entire club and we can do
what we want with it. Uh very important.
Come with your badge. I don't have mine
so I can't come but it's backstage but
come with your bad. This is I I need to
I need you to hear me. come with your
badge because if you don't have your
badge, you can't come. Okay? We need a
way to identify you and that the reason
for this is people want to go to a club
and they're going to come without a
badge and we need to really gatekeep a
little bit uh because this is an
experience they've created for you
specifically. Okay. Um also you you
cannot bring a plus one or a friend uh
to this event uh because it's just
capacity and as you can look around this
room is full and so we need to be
mindful of that and we don't want like a
fire hazard where it's going to stampede
if people leave, right? And so uh we
want to be sensitive to that as well.
We're about to finish the conference,
but we would be remiss if we didn't
capture this moment. So, what we're
going to do is we're going to move to
taking a family a group photo together.
Okay? Some of you, if you don't want to
be in the photo, absolutely no pressure.
You're welcome to go to the expo area on
your way out. Um, but for those who want
to be in the photo, uh, we're going to,
you don't have to move. You just stay
where you are. And what's what's going
to happen is our photographer is going
to come on stage. Hello. Um, give it up
for your photographer, by the way. This
incredible Yeah. both of them.
Uh, so here's how it's going to work.
They're going to come on stage. We're
going to I I'll join you. All of us are
going to join you. It would be nice if
we can come towards the middle so they
don't have to use a big wideangle lens.
Uh, and then he's going to be in charge.
We're going to turn the house lights up
and then when he gives the thumbs up,
it's officially over. And then you're
welcome to come up here, take photos, do
whatever you want. We need to leave the
building at 6:30 local time. You need to
be out. If you're not out, you will you
will be made to leave. Okay? So, finish
up your last arrangements after the
photo. Um, and then do whatever you want
and then we'll leave at 6:30. Is that
good?
>> All right, let's do it. Let's take the
photo, everybody.
He's in charge.
>> If you can get these guys,
>> everybody move across into where you
are.
>> Everyone in the middle,
>> if you want to stand, that would be
great.
>> Please stand.
>> As you can.
>> Let's do it. Oh, my mic's still on,
dude.
And then one more for the video. Ready?
Go.