AI Engineer World's Fair 2025 - Day 1 Keynotes & MCP track ft. Anthropic MCP team

Channel: aiDotEngineer

Published at: 2025-06-05

YouTube video id: z4zXicOAF28

Source: https://www.youtube.com/watch?v=z4zXicOAF28

What
is every choice we
[Music]
Reality so much.
[Music]
[Music]
[Music]
Heat.
Heat. Heat.
[Music]
Heat. Heat. Heat.
[Applause]
[Music]
[Music]
[Music]
[Music]
Heat.
Heat.
[Music]
[Music]
Heat.
[Music]
Heat. Heat. Heat.
[Music]
[Music]
Heat. Heat.
[Music]
[Music]
[Applause]
Heat.
Launch control. We have a go. Roger.
[Music]
Ladies and gentlemen, please join me in
welcoming to the stage the VP of
developer relations at Llama Index,
Lorie Boss.
[Applause]
Hello everybody. There are so many of
you. It is great to see you here today.
Welcome to the 2025 AI Engineer World's
Fair. Let's hear it from you.
Welcome to the Yerba Buena Ballroom. For
those of you who managed to fit into the
room, uh I'm told this is the largest
pillarless ballroom west of Las Vegas,
which is the perfect metaphor for AI
startups because the scale is
impressive, but it has no visible means
of
support. My name is Lori Voss. I am VP
of developer relations at Llama Index,
the best framework for building aic AI
applications according to
me. I'm going to be your MC today and
tomorrow. And my first order of business
is I have to take a selfie because if I
didn't if I can't post it to social
media, did it even really
happen? Pretend you've just heard a
really funny intro and you're laughing.
Excellent. All right. Uh it's
traditional as an MC to warm up the
audience with a couple of jokes. Uh and
as an AI practitioner, it's traditional
to not want to do any work and to get
the AI to do it for you. Uh, so I tried
that and it was a great learning
experience. The primary thing that I
learned is that LLMs are terrible at
writing jokes. I mean atrocious. I tried
chat GPT. I tried Claude. I told them to
think deeply. I asked them to search the
web. None of it worked. AGI will be
achieved when models can say something
actually funny. Until then, these dad
jokes are handcrafted by a
human. But on to business. Once upon a
time, I was co-founder of npm Inc. Uh,
so I used to talk about JavaScript a
lot. Uh, and I've been a web developer
for 27 years. These days, I'm talking
about AI. And I'm very excited about AI
because when you've been in tech as long
as I have, you've only you only see
revolutions as big as the one that is
being powered by AI a couple of
times. But tech is full of things that
say they're going to be the next big
revolution and then turn out to be just
hype. blockchain, NFTTS, the metaverse,
the Segue. But make no mistake, this one
is real. There's a ton of hype, too. Of
course, not all of these AI startups are
going to turn out to be real things, but
there is a core of real revolution
happening. And how can you tell? Because
people are building things that people
are actually
using. Chad GPT hit 100 million users
faster than any consumer product in the
history of tech. Millions of people are
using it daily to get actual stuff done.
Sometimes they're using it to get stuff
done that they shouldn't be using it to
do, like uh cheating on their essays or
writing fake law citations into real
court cases, but also real things
hundreds of millions of times a day.
You'll be hearing from multiple speakers
from OpenAI over the course of this
conference, including Greg Brockman,
who'll be closing out
today. Another real adoption story is
Copilot. GitHub Copilot has millions of
subscribers and Copilot is now part of
Microsoft 365 which is available to 84
million everyday consumers. Azure AI is
being adopted by enterprises to the tune
of 13 billion in revenue annually. That
is real adoption. And speaking of
Microsoft, they're here. In fact,
they're our presenting sponsors. So,
let's hear it for them.
Another company doing a ton of big real
things with AI is AWS. So much so that
they're going to spend 87 billion on AD
AI infrastructure this year. AWS are
also here. They're our innovation
partner for this conference contributing
tons of sessions and workshops. So let's
give them a big hand.
Two more companies doing big real things
in AI are Neo Forj and Brain Trust. They
are our track sponsors. Brain trust is
the end-to-end evaluation platform for
building worldclass AI apps and Neo forj
are the world's most loved graph
database. There's a whole track about
graph rag this year. So there will be no
shortage of graph rag content. Uh last
year Neo Forj CEO Emil Efraim had the
second most popular talk at the whole
conference. Uh and this year multiple
Neo Forj people are speaking so you
won't want to miss
them. Who had the most popular
conference talk last year? Well, I don't
want to brag but it was Llama Index's
CEO Jerry Lou. He is going to be giving
a talk tomorrow that you won't want to
miss. And also little old me is giving a
talk at 1 PM
today. We've got a bunch of other great
sponsors. These are our platinum
sponsors. Uh, Graphite is the AI powered
developer productivity platform that
helps teams on GitHub ship higher
quality software faster. Uh, Windinsurf
is the first agentic IDE that 10xes
engineers so that you can dream bigger.
MongoDB's Atlas database makes storing
all of your data, including your vector
embeddings a snap. Daily is the team
behind Pipecat, the most widely used
framework for voice agents and
multimodal AI. And augment code is the a
AI agent that knows you and your
codebase best. And work OS helps you
ship your software to enterprise
customers with features like single sign
on in minutes. Let's give all of our
platinum sponsors a big
hand. But one of the biggest signs that
AI is powering a real revolution is all
of you here
today. This is the third year of this
conference. This is the biggest one yet.
There are over 3,000 of you here today.
That's nearly twice as many as last
year. You folks are building real things
every day and real people are using them
and that's incredibly inspiring. So, you
should feel good about that and give
yourselves a
hand. And this is going to be a heck of
a conference. We have over 250 speakers
here from around the world talking about
every aspect of AI, architecture,
infrastructure, AI in the Fortune 500,
robotics, design, MCP. Who is here is
excited to hear about
MCP? That's good. It's going to be in
this room. Uh, security, tiny teams,
vibe coding. There is so much stuff to
learn here today, y'all, that you are
going to have a great
time. And that's why we're all here.
We're not just talking about these
technologies. We're not just excited
about these technologies. We're building
with these technologies. And we can't
wait to see what you've been building.
So without further ado, please welcome
to the stage a man who needs no
introduction, editor of the latent space
podcast, CEO of Small AI, and co-founder
of this very conference, the one and
only Swix.
[Applause]
[Music]
[Music]
Okay. Hi everyone. Welcome to the
conference. How you doing?
Excellent. Um I have uh I've been I've
been so excited to play with this. Oh.
Uh can we set can we step back to the
the main the main slide? Okay, good. Um
there's a front clicker but no no back
clicker. Um so uh usually I open these
conferences with a small little talk to
introduce uh you know what's going on
and then give you a little update on
where the state of AI engineering is and
how we put together the conference for
you. Uh this is a this is one of those
combined talks. I'm trying to answer
every single question you have about the
conference about AI news about where
this is all going and we'll just dive
right in. Okay. So um 3,000 of you all
of you registered last minute. Uh thank
you for that stress. Um I actually can
quantify this. I call this the genie
coefficient for uh the AI AIE organizer
stress. U this is compared to last year.
Uh it is uh please just buy tickets
earlier like I mean you know you're
going to come just just do it. Okay. Um
we also uh like to use this conference
as a way to track the evolution of AI
engineering. Uh that's those are the
tracks for last year. we've just doubled
every single track for you. Um, so
basically it's basically you know like
double the value for whatever you uh get
here and I think like uh I think this is
as much concurrency as we want to do
like I know I I hear that people have
decision fatigue and all that uh totally
but also we try to cover all of AI so
deal with
it. Um we also pride ourselves in doing
well by being more responsive than other
conferences like Nurips and being more
technical than other conferences uh like
TED or whatever what have you. So we
asked you what you wanted to hear about.
These are the surveys. Uh we tried all
sorts of things. We tried computer using
agents. We tried AI and crypto. It's
always a fun one. And uh but you guys
told told us what you wanted and we put
it in there. Um for all for more data um
we would actually like you to to finish
out our survey where survey is not done.
So if you want to hit to that URL um we
will present the results in full
tomorrow. We would love all of you to to
fill it out so we can get a
representative sample of what you want
and uh they'll inform us next
year. Okay. Um you know I think the
other thing about AI engineering is that
we also have been innovating as
engineers right we we're the first
conference to have an MCP. our first
conference to have an MCP talk accepted
by
MCP where shout out to Sam Julian from
Writer for working with us on the
official chatbot and Quinn and John from
Daily for working with us on the
official voice bot as well as Elizabeth
Triken from uh Vappy. I need to give her
a shout out because she originally uh
helped us uh prototype uh the the voice
bot as well. So we're trying to
constantly improve the experience.
Uh the other thing I think I want to
emphasize as well is like these are the
talks that I give like in 2023
uh the very first AIE I talked about the
uh the three types of AI engineer in
2024 I talked about um how AI
engineering was becoming more multi
disciplinary and that's why we started
the world's fair with with multiple
tracks in 2025 in in New York we talked
about the evolution and the focus on
agent engineering so where where are we
now in sort of June of 2025 Um, that's
where we're going to focus on. I think
we we've come a long way regardless like
we you know we people used to make fun
of AI engineering and and I anticipated
this. We used to be low status people
just deride GPT rappers and look at all
the GPT rappers. Now all of you are
rich. Um, so we're going to hear from
some of these folks uh in the room. Um,
and uh thank you for sponsoring as well.
Um but uh you know I think the other
thing that's also super interesting is
that like you should we the consistent
lesson that we hear is to not over
complicate things from enthropic on the
lat space podcast uh we hear we hear we
hear from uh Eric Schz about how they
beat Sweetbench with just a very simple
scaffold. Uh same about deep research
from Greg Brockman who you're going to
hear later on um in the uh sort of
closing keynotes as well as AMP code.
Where's the AMP folks here? AMP. AMP I
think they're probably back in the other
room but um also you know there's
there's a sort of emperor has no clothes
like there's it's still very early field
and I think the um AI engineers in the
room like should be very encouraged by
that like there's there's still a lot of
alpha to mind
um if you watch back all the way to the
start of this conference we actually
compare this moment a lot to uh the time
when sort of physics was in was in full
bloom right this is the sove conference
in 1927 when Einstein Mary Cury and all
the other household names in physics all
gathered together and what we're trying
to do for this conference. We've
gathered the entire the best um sort of
AI engineers in in the world um and and
researchers and and and all that u to to
build and push the frontier forward. Um
the thesis is that there's this is the
time this is the right time to do it. I
said that two and a half years ago still
true still true today. But I think like
there's a very specific time when like b
basically what people did in in that
time of the formation of an industry is
that they set out all the basic ideas
that then lasted for the rest of that
industry. So this is the standard model
in physics and there was a very specific
period in time from like the 40s to the
70s where they figured it all out and
the the next 50 years we haven't really
changed the standard model. So the
question that I want to phrase here is
what is the standard model in AI
engineering right? We have standard
models in the rest of engineering,
right? Everyone knows ETL, everyone
knows MVC, everyone knows CRUD, everyone
knows map reduce. And I've used those
things in like building AI applications.
And like it's pretty much like yes, rag
is there, but I heard rag is dead. I I
don't know. You guys can tell me. Um,
this day is like long long context
killed rag, the other day fine-tuning
kills rag. I don't know. But I I don't
think I definitely don't think it's the
full answer. So what other standard
models might emerge to help us guide our
thinking and that's really what I want
to push you guys to. So uh there are a
few candidates standard models and AI
engineering. I'll pick out a few of
these. I I don't have time to talk about
all of them but definitely listen to the
DSP talk from Omar later uh
tomorrow. Um so we're going to cover uh
a few of these. So first is the MOS. Uh
this is one of the earliest standard
standard models um basically uh from
Karpavi in 2023. Um I have updated it
for 2025 um for multimodality for the
standard set of tools that have come out
um as well as um MCP which uh is is has
become the default protocol for
connecting with the outside world. Um
second one would be the LM SDLC software
development life cycle. Um I have two
versions of this one with the
intersecting concerns of all the tooling
that you buy. Uh by the way this is all
on the lat space blog if you want and
I'll tweet out the slides. So uh you and
it's live stream so whatever. Um but I
think uh for me the most interesting
insight and the aha moment when I was
talking to Anker of Brain Trust who's
going to be keynoting tomorrow um is
that you know the early parts of the
SDLC is are increasingly commodity right
LLM's kind of free you know um
monitoring kind of free and rag kind of
free obviously there it's just free tier
for all of them and you you only get
start paying but like when you start to
make real money from your customers is
when you start to do evals and you start
to add in security orchestration and do
real work. Uh that is real hard
engineering work. Um and I think that's
those are the tracks that we've added
this year. Um and I'm very proud to you
know I guess push AI engineering along
from demos into production which is what
everyone always wants. Another form of
standard model is building effective
agents. Uh our last conference we had uh
Barry, one of the co-authors of building
effective agents from Enthropic give an
extremely popular talk about this. Um I
think that this is now at least the the
received wisdom for how to build an
agent. And I think like that's like that
is one definition. OpenAI has a
different definition and I think we're
we're continually iterate. I think
Dominic yesterday uh released another
improvement on the agents SDK which
builds upon the swarm concept that Open
AAI is
pushing. Um um the way that I approach
sort of the agent standard model has
been very different. So you can refer to
my talk from the previous conference on
that. um basically trying to do a
descriptive u top down uh model of what
people use the words people use to
describe agents like intent um you know
control flow um memory planning and tool
use. So there's all these there's all
these like really really interesting
things but I think that the thing that
really got me um is like I don't
actually use all of that to build a
news. Um, by the way, who here reads A
News? I don't know if there's like a
Yeah. Oh my god, like there's half of
you. Thanks. Uh, uh, it's it's a really
good tool I built for myself and you
know, hopefully, uh, now over 70,000
people are reading along as well. Um,
and the thing that really got me was
Sum at the last conference. Uh, you
know, he's the lead of PyTorch and he
says he reads AI news, he loves it, but
it is not an agent. And I was like, what
do you mean it's not an agent? I call it
an agent. You should call it an agent.
Um, but he's right. Um, it's actually uh
it's actually I'm going to talk a little
bit about that, but like like why does
it still deliver value even though it's
like a workflow and like you know is
that still interesting to people, right?
Like why do we not brand every single
track here, voice agents, uh you know
like uh like workflow agents, computer
use agents like why is every single
track in this conference not an agent?
Well, I think basically we want to
deliver value instead of arguable
terminology. So the assertion that I
have is that it's really about human
input versus valuable um AI output and
you can sort of make a mental model of
this and track the ratio of this and
that's more interesting than arguing
about definitions of workflow versus
agents. So for example in the co-pilot
era you had sort of like a debounce
input of like every few characters that
you type then maybe you'll do an
autocomplete. U in chatbt every few
queries that you type it would maybe
output a responding query. Um it starts
to get more interesting with the
reasoning models with like a one to 10
ratio and then obviously with like the
new agents now it's like more sort of
deep research notebook LM. Uh by the way
Risa Martin also speaking on the product
uh product management track um she's
she's incredible on uh talking about the
story of notebook LM. Um the other
really interesting angle if you want to
take this mental model to the stretch to
stretch it is the zero to one the
ambient agents with no human input. What
kind of interesting uh AI output can you
get? So to me that's that's more a
useful discussion about input versus
output than what is a workflow, what is
an agent, how agentic is your thing
versus versus not. Um talking about AI
news. Uh so you know it it is it is like
a bunch of scripts in a in a in a trench
code. Um and I realized I've written it
three times. I've written it for the
Discord scrape. I've written it for the
Reddit scrape. I've written it for the
Twitter scrape. And basically it's just
it's always the same process. You scrape
it, you plan, you recursively summarize,
you format, and you evaluate. Um and and
yeah, that's the three kids in the
trench coat. Um and that's really how
what it is. I run it every day and like
we improve it a little bit, but then I'm
also running this conference. Um so if
you generalize it, that actually starts
to become an interesting model for
building AI intensive applications where
you start to make thousands of AI calls
to serve serve a particular purpose. Um
so you sync you plan and and you sort of
parallel process you analyze and sort of
reduce that down to uh from from many to
one and then you uh deliver uh deliver
the contents um to the to the user and
then you evaluate and to me like that
conveniently forms an acronym sp a um
which is which is really nice. There's
also sort of interesting AI engineering
elements that are that have fit in
there. So you can process all these into
a knowledge graph. you can um turn these
into like structured outputs and you can
generate code as well. So for example um
you know chatbt with canvas or cloud
with um artifacts is a way of just
delivering the output as a code artifact
instead of just uh text output and I
think it's like a really interesting way
to think about this. So this is my
mental model so far. Um I I wish I had
the space to go into it but ask me
later. This is what I'm developing right
now. I think what I what I would really
emphasize is, you know, I think like
there's all sorts of interesting ways to
think about what the standard model is
and whether it's useful for you in in
taking your application to the next step
of like how do I add more intelligence
to this in in a way that's useful and
not annoying. Uh, and for me, this is
it. Okay. So, I've I've thrown the bunch
of standard models in here, but that's
just my current hypothesis. I want you
at this conference when in all your
conversations with each other and with
the speakers to think about what the new
standard model for AI engineering is.
What can everyone use to improve their
applications and I guess ultimately
build products that people want to use
which is what Lori uh mentioned at the
start. So um I'm really excited about
this conference. It's so it's been such
an honor and a joy to get it together
for you guys and I hope you enjoy the
rest of the conference. Thank you so
[Applause]
much.
[Applause]
Our next presenter is the head of
product for Microsoft's AI platform.
Presenting about the open agentic web.
Here to show us what happens when
natural language creation meets an
industrial-grade backbone is Asha
Sharma.
[Music]
[Applause]
Hello. I still remember the first line
of code that I wrote. It was on this
computer, a Compact 95. And I also
remember the feeling that I had when I
wrote my first program. It was one of
magic because for the first time, I
created something more beautiful and far
more interesting than I can do by
myself. And I've thought about that
moment ever since.
I've spent the last 15 years building
some of the most important products in
machine learning. And now I have the
opportunity to lead our core AI platform
at Microsoft. And our goal is to empower
every single person in this room to use
AI to shape the
world. Now, I'm excited to be here to
talk to all of you, but I'm most excited
because I love the World Fair. The World
Fair is where imagination and impact
start to collapse. And that's happened
throughout the last century. In 1939, it
was the first time that Hollywood came
into our living rooms. In 1964, faces
came over the copper wires. And today,
it's all about agents. Agents that can
learn, that can adapt, that can extend
the way that we live. And most
importantly, it's fundamentally changing
how we actually make product. So, how
did we get here? Well, over the last few
years, we've seen a change in the model
landscape. There used to just be a
handful of models from one provider and
now thanks to many people in this room,
there's been an explosion of reasoning
models and that explosion is giving way
to new capability and new efficiency. We
see that a lot of models can now
generate hypotheses, understand
unstructured data, even act at a PhD
level in certain
domains. We also know that these models
are becoming more efficient. They don't
just live in data centers anymore. They
live on our laptops. We have full
control and there's no latency. And so
this is starting to birth what we call
the agentic web. A world in which agents
are going to interact with tools and
models and probably other agents. And
they're going to do so no matter what
cloud they're on, no matter what company
built them, no matter what device that
you choose to use. And underlying this,
it's creating a few different forces in
the world of AI
engineering. The first is that we're
going from pair programming to peer
programming. Copilot used to be a
sidekick and now it's an actual
teammate. The second is that we're going
from a software factory to the agent
factory. It's not just about binaries
anymore. It's about behaviors. And the
third thing that we're seeing is that
models don't just live in the cloud
anymore. They live on your device and
they can follow you wherever you
are. To sustain this, I don't think
there's any one tool that we need.
Instead, I think it's a platform of AI
powered tools that are going to sit on
top of an agent factory that every
company's going to have, that has trust
and security baked in, and that goes
from cloud to edge seamlessly. So, let's
dive
in. Now, as far as I can remember,
programming has always been a
partnership between the person and the
machine. We would write lines of code
and the machine would execute it. But
now, we're starting to see a world where
the machine can write the code. It can
fix itself. It can help you em imagine
new features. And so with that, our
entire day is changing. The workflow is
changing. The old world used to be us
shephering syntax. And in this new
world, GitHub Copilot can now live in
your codebase. It operates in your
branch. You can assign a task to it and
it can run tests until it's complete.
That means we're going to spend more
time on architectural decisions, more
time orchestrating teams of
agents. Maintenance is changing, too. In
the old world, maintenance would compete
with features. I hated that world. And
in this new world, what we're seeing is
the opportunity to invent agents that
can continuously improve your codebase.
And that's why we invented something
called FSY. You can think about it as
graph rag for your codebase. It can
reason over it, can explain it, and it
can continuously improve it and even fix
some
areas. Another area that I'm really
excited about is GitHub as a peer.
Now in the past it would have generic AI
was just operating on a generic codebase
but now because we have open sourced the
extension for GitHub copilot Git Copilot
now understands your patterns it
understands your domains it understands
your teams and so it can effectively
speak your own language and so instead
of designing backwards you can build
forward.
Now, I was told that you all like live
demos and so instead of talking about
this, I thought I would invite my friend
Seth Wararez up to show you some of our
AI power tools. So, Seth, hey, how's it
going, my friends? They uh closed my
laptop, so I'm going to open it here.
Um, I'm on demo one. And one of the
things about uh development is there's a
couple of tasks that are kind of
tedious, right? So, I'm going to show
you how AI can make three of these kinds
of things a lot easier to do. Number
one, when you get started on a project,
you always have a thousand questions and
no one to ask. I'll show you how GitHub
spaces makes this a little bit easier.
Number two, crushing your first task.
How do you get started on a task on a
brand new project? I'll show you how.
And then number three, diving deep. All
right, so let's start with
understanding. There is a new feature
called co-pilot spaces. And the cool
thing about co copilot spaces is is that
you can actually create a co-pilot
space, give it some a prompt, some
instructions, and then you can also
ground it in a bunch of different files.
You're going to see a project later on
uh uh in live that's going to show you
this. Uh but one of the things I want to
ask it is, for example, is this really
an agentic project? You're going to see
a live multi- aent voice thing in a
second. And this is the project. So I'm
going to go ahead and hit enter. And the
thing about GitHub Copilot now grounded
in spaces is that it can answer any
question that's grounded in the actual
facts of the project. So you can create
as many spaces as you want. They never
get tired of answering questions.
They're not only do they give you really
good answers like what does a Gentic
really mean? Great answer. It also gives
you the code of where it actually does
the thing. So you're able to right away
get started in understanding your
project. That's number one. Number two,
crushing your first task. This is a
project we released uh a couple of weeks
ago and what somebody literally was
like, "Hey, because I you know how you
make a demo and then you put it out and
there someone always is like, "Hey, um
can you write a read
me?" I said no. And I signed it to
Copilot and it did it. Let me show you
what that looks like. So, I'm going to
go ahead and create a new issue here.
And this issue is going to be like an
architecture uh diary. That's why I need
more better setup instructions, right?
And I'm going to give it some
descriptions. Let me show you how easy
it is to crush your first task. I'm
going to assign it to co-pilot. And
that's it. Watch the eyeballs right
here. In a second, you're going to see
little eyeballs just And now it's
working. And by the way, this is
actually happening live. This isn't like
me faking it or anything. And let me go
back to this actual project. It did the
work for me. It takes a couple of
minutes, but I don't have that kind of
time here. And let me show you the file
that it made for me in GitHub. Uh the
the actual uh um GitHub co-pilot coding
assistant coding agent. It made this
whole thing for me. You can oo a little
bit. Yeah. Oo. Yeah. This is delightful.
You can clap to this is great. I didn't
have to write this. Number two, crushing
your first task. And then number three,
diving deep. It turns out that if you've
seen GitHub uh copilot inside of Visual
Studio Code, you can actually extend
Copilot to talk to other agents and we
did just that with Omaly. There's a
there's another task that I have here
that was assigned to me. Let me go over
here to the issues uh that says I need a
new we need a new agent that reasons
about housing. That sounds like a deep
task. So, we're going to use GitHub
Copilot to help us here. I'm going to
say can you help me predict the housing
prices? And as this goes, what's
actually happening is GitHub Copilot and
Visual Studio Code is talking to another
agent called Amaly MLE machine learning
engineer agent that has two agents that
can reason about what you're asking and
can also write code. So I'm going to go
ahead and get this one, this file here,
this file here. Let me close it. I'm
going to move this over here and say
yes, these are the files. And what I'm
going to do is I'm going to say yeah,
use these files. I don't want I don't
want to use this file. So I'm going to
say use use these files. So
number, can you use these files? And
what it's going to do is it's going to
look at this file as if it was a machine
learning engineer. It's going to reason
about the actual contents of the file,
and I can ask it any question about
anything that I want. And I'm I'm kind
of out of time, so I want to show you
the output of this thing. It literally
builds an entire machine learning model
for me. And you're like, "Oh, it made a
mistake. Did it?" Uh you can see here uh
mistake. What was the mistake? Oh,
somebody put a string in the float
place. Yeah. And it knew about that and
it fixed it. So there you go. I showed
you three things. Jumping into a
project, understanding it, crushing your
first task, and diving deep, all with
the help of AI. Back to you,
Washa. All right. Thank you,
Seth. So both of those agents that you
just saw were built on something called
Foundry. And underneath the covers of
GitHub and all of these new agents is
there's a bigger change that's
happening. We're going from shipping
binaries and neat releases to shipping
agents that can retrain and redeploy and
change after they're live. And so we've
been thinking about what is the best way
to do that and studying patterns. And
something new is emerging called the
signals loop. It's the idea that you can
get better results if you actually
fine-tune the model to personalize it to
your outcome. Something that we've long
talked about, but we're actually seeing
it in the results. So our platform not
only supports 70,000 customers with
Foundry, but we also support every
single co-pilot in the company that is
built. And this one is called Dragon.
Dragon is the leading healthcare
co-pilot out there. It helps you uh
automate scribing and other things to
give physicians more time to give
patient care. They took an off-the-shelf
model and it was pretty good. Uh they
tried to synthetically fine-tune it to
make it better and it got a little bit
better, but then they took 650,000
interactions and they did a bunch of AB
testing and we got to an 83% character
acceptance rate. So dramatically better
quality. And so as we think about what
this actual signal loop requires, it
means that we're going from this linear
software factory to a continuous loop
that we need to build for. And that's
really what we've built Foundry to do.
We believe that the entire
infrastructure is changing to build
these agentic applications and these
agentic systems. We don't have time to
go through all of these, so I'm just
going to go through a few of them and
how they're changing. But models, we
believe that no one model is right for
every single product. And in often
oftentimes the best products have an
ensemble or a mixture of models that are
finely tuned for every single job to be
done. And so we've built a switchboard
and intelligent routing so you can have
access to 10,000 open models and
proprietary models and be able to have
it backed by the security and
reliability and the data residency that
you need.
on knowledge. I think I heard that uh
rag was that. So rag is used in 50% of
applications today for AI. Um but it's
singleshot and it it's pretty naive. And
so we've rolled out something called
agentic rag. It's the idea that you can
kind of go around and iterate and
evaluate and plan. It's multi-shot. And
what we're seeing is a 40% improvement
and accuracy on complex
queries. We all know that tooling is
changing. Tooling is becoming
infrastructure. You need more than text
to build a good agent. You need the
code. You need the containers. And we
have that as well. We have more than
1500 tools and we were one of the first
to adopt MCP and a and A2A. And finally,
agents and intelligence is only good if
you can actually hold it accountable.
And so we are rolling out aggressive
efforts in this area. We have the
leading evaluations SDK, the leading red
teaming agents. And we believe that
telemetry is an optional. We've
integrated with hotel and we have
continuous observability no matter if
you've built your agent on our platform
or you've built it somewhere else. Today
more than 50,000 agents are built every
single day using our loop on our
platform. Now the platform is modular
but we've also made an effort for it to
be open. And I want to talk to you about
a couple of things on the open side that
I'm really excited about. The first is
Gigapath. It is the first model of its
kind. It's an open model and it's the
first one that can understand a
pathology slide. Pathology slide has a
100,000 pixels by 100,000 pixels if you
printed it out would be the size of a
tennis court. It's the first one that
can understand it without downsampling
or linting. And it does that because
we've used dilated attention which is a
technique we borrowed from speech
modeling. And so now you can understand
how to build an immune tumor environment
without doing it in patches without the
macro environment. You can do it at a
micro level for the first time. And
that's an open model on our
platform. Obviously, everybody's
following Deep Seek and there was an
update to the R1 model a couple of days
ago. Uh that update is on our platform
today on Foundry, backed by all of our
security and safety for all of you to
use. And finally, we're continuing to
invest in A2A and MCP and all of the
open protocols. I think the big thing
for us to think about is that we believe
that these protocols will continue to
come along and they're going to be
popular and we're going to support them
all. so you can work with the tools that
you
love. Now, I want to show you how simple
it is to build an agent, but not just
one, multiple agents that are useful.
We've got another demo. Please welcome
Amanda and
[Applause]
Elijah. What better way to showcase our
Foundry Agent Factory than by
demonstrating it live than taking you
behind the scenes with us to build the
agents and ensure they're safe and
secure. Before we dive into our multi-
aent application, let's go over to demo
three where Elijah is going to show you
how to build a single agent in VS Code.
Awesome. So, jumping right into VS Code,
you guys will notice that I have
installed the Azure AI Foundry
extension. This extension is awesome
because it allows us to see all of the
models, agents, as well as threads that
I have associated with my project. And I
want to take a moment here to talk about
threads. As you guys know, threads is an
integral part of agents, and it's
critical for the transparency aspect of
being able to see what the agent is
doing at each step of the way. So, here
I can see a thread that Amanda created
earlier that's saying, "Hey, Elijah's a
product manager. Can you send him a
personalized email?" And we'll be using
a personalized email agent today. So,
then it receives from a contact list. It
sends us some information. And what's
great about this is that I can see the
tool calls that it's used as well as
some information around prompting and
and tokens. So, that's great. Then I can
look at the actual agent which I can see
here and we have the ID, the name, the
system prompt as well as the tools are
being used. Now that's great in the UI
but let's jump right into the actual
code of how we built this. So going in
here you can see I'm using the Azure AI
foundry uh agent service SDK that Asha
just talked to us about. So I'm initial
initializing that using the our project
client creating the actual agent and
then giving it a set of tools. And so
today in the agent service we have a
bunch of different tools. Today we're
using the bing grounding tool, the file
search tool, and the open API tool. But
what's great is that I could use a
variety of tools here. I could use MCP
servers. I could use external APIs and I
could even use other agents using the
Foundry uh connected agent tool. And
then I create the agent, use the model,
and then give it some instructions. And
then finally, I can execute this agent
here. And it's important to also note
that I'm executing a foundry agent, but
I could use a wide variety of agents
here. I could use lang chain agents. I
know our friends from lane chain are
here today. I could use crew AI agents
um or even as Asha mentioned multiple
agents using ADA protocol. So this is
awesome. But now let's see these agents
in action. Yeah. Now let's switch over
to demo 2 again and dive into our app.
So build events is a multi-agent event
planner application powered by a voice
controlled agentic orchestrator that
dynamically delegates tasks to sub aents
we built using our foundry agent
service.
At Microsoft Build, we showed how this
app could actually help us plan a
hackathon in San Francisco for 200
people. But truthfully, this can help us
plan any type of event. And today, we're
going to show how we could actually use
it to per plan and prepare for investor
meetings at the AI World Fair in San
Francisco. So, let's dive into
it. First, we are going to ask it who we
should be connecting with.
How can I assist you with your event
planning needs today? Find top angel
investors backing preeries a AI native
SAS startups who are attending the AI
engineering world fair in San Francisco.
I'm gathering information on the top
angel investors attending the event. Is
there anything else you'd like to know
or plan while I continue this task? As
you can see, our first agent is now hard
at work. This is the research agent. We
provided it access to the entire speaker
list as well as some attendees that had
said online that they were coming and it
has access to search to the web using
our Bing grounding tool. So, let's see
what this comes up with. And everything
you're seeing today is completely live.
So, bear with it. Send information on
top angel investors who are attending
the AI engineering world. Send a
personalized email to Asha Sharma now
asking to connect at the event.
I'm drafting a personalized email to
connect with Asha Sharma at the event.
Is there anything else you'd like me to
include or any other assistance you
need? So, as you can see, our second
agent to Asha Sharma has been drafted
focusing on opportunities in AI native
SAS startups and potential
collaboration. Would you like to proceed
with any other tasks? The email draft to
Asha Sharma is ready. Awesome. Now,
could you take a live photo of Elijah
and I and edit us to be AI agents? Um,
cartoon-like. Think Azure OpenAI meets
Pixar.
Please go ahead and click on the camera
icon to take a live photo. The image
will be transformed into a cartoon-like
style depicting you both as AI agents.
Great. In the meantime, could you also
draft some content asking or saying
we'll be in San Francisco and are happy
to meet with anyone at the World Fair to
talk about AI agents?
I'm drafting LinkedIn content to
announce your presence at the AI
engineering world's fair and your
eagerness to connect with others to
discuss AI agents. The image will be
ready shortly as well. Look at that
LinkedIn content announcing your
presence engineer. This looks great.
Could you now post both this image and
the draft to LinkedIn?
The LinkedIn post with the edited image
is being published. Is there anything
else you'd like assistance with?
And so now, what tools is it using to do
this, Amanda? So right now we actually
in this past sequence had two agents be
kicked off simultaneously. One was gen
generating and editing our image and
then the second agent was drafting the
content. Your LinkedIn post has been
looks like it's live. Moment of
truth to help you with just let me know.
And as you can see the post is now live.
Amazing. That's awesome. Yeah. Now back
to Elijah. One quick final note. Asha,
you talked about earlier how important
evaluations are. So if we want to go to
demo three really quick here at Azure
OpenAI agent service, we're committed to
making sure that our agents are
consistently delivering high quality
results. So as Asha mentioned earlier
and we actually about the uh evaluation
SDK, we integrated this right into our
CI/CD pipeline so that we can be able to
evaluate our agents every time we make
updates. So, you saw today how to
create, use, and now evaluate agents.
And with that, we'll turn it right back
over to you, Asha. Thanks so much.
Thanks, guys.
We didn't know how that would go.
Yesterday, the internet wasn't working.
So, uh, that's amazing. And I also have
had about 15 emails from Amanda over the
last 24 hours. Uh, which I appreciate.
Okay, I uh we have to we have to go
quickly here, but uh look, one of the
last big things I want to talk to you
about is how models don't just live in
the cloud. Even for the last 10 years,
we've been working really hard to do
that because that's where your data is.
Your data is now everywhere. Um and this
isn't just a hobbyist thing. We are
seeing real applications of this at
scale. Um, I was just at a bottling
plant and there's an agent there that is
taking a 100,000 sensors per second and
allowing it to actually detect risks and
flag and throw the summaries in the
cloud. We're building an agent right now
for a hospital system that basically
summarizes the longitudinal data. Uh,
and if you work in healthcare, you know
that that can't be in the cloud. It has
to be local because of compliance and
privacy reasons, but the cloud should be
able to read it and access it. And we're
working with automobile companies
because they're building uh automotive
models that we want to work in tunnels
and then we want it to actually make
your trip better and smarter. And so
with all of this local cannot be uh a
fork. It has to be a core part of the
platform. You should be able to create
an agent in the cloud and it should run
and act and reason uh locally. And so
I've got one more demo. Uh Seth is
coming back out and then we will wrap
up. Seth, why don't you show us how to
get local? Right. Let's do this. Here
we're going to show you another live
demo. Uh let me make this um let me uh I
have a little VM running here, but I
need to put the password in. And we are
very concerned about uh Wow. They're
they're they're are they playing the
walk-off music on me?
[Music]
Our next presenter is the founding
partner of Conviction Capital. Please
join me in welcoming to the stage Sarah
Goa.
[Applause]
hardest problem in AI will remain AV as
in the last two decades of technology.
Um, actually, you know what? I will get
us started while we're doing AV setup by
seeing if I can just tell you about the
uh Slido poll. You guys can do it while
we're waiting here. Um, so if you go to
Slido, I'll pull it
up.
Oh, great.
And a God's
willing. No, no, no. It's It's just
blank screen
now. Okay. So, the Slido code, go to
slido.com.
Um uh and the code is 2100
0163 guys. We're we're about to ask AI
to tell us a joke. Okay.
Um you guys know you have no
internet. Okay. Okay, the slide out code
is
2100163 for people who can get
it. I'm actually going to do it like
super manual. Um, so first question for
you, uh, what is definitely happening by
the end of
2026, AI agents ship code directly to
prod in your environment, right? Not in
like some, uh, playground. Uh, voice AI
replaces text for most business
communication. Inference cost drops
below a cent per million tokens. Or
Wall-E like we're all
chilling. Any of these?
First one, ship ship code directly to
prod. Okay, this is a hopeful set of
engineers. All of you want to get rid of
your own jobs. I love that.
The good thing is I also don't have
internet so I can't look at my next
question.
No, it's going to be good. It's going to
be
good.
Um I present from your phone. Uh no, I
was going to go through poll questions
while we're trying to do AV setup.
Yeah.
While this is happening, I'm actually
just going to introduce myself so we're
not wasting the time. Um, my name is
Sarah Goa. I, uh, helped start a AI
native venture fund. It's called
Conviction. And we got going about two
and a half almost three years ago now,
just before the starting gun of chat
GPT. Um, as always in technology,
investing most of life, it's better to
be lucky than right. Hopefully, you can
be a little of both. Um, uh, and and the
point of having a new venture firm, I I
worked at Greylock. It's kind of a
traditionalist venture firm, a great
one. My partner Mike Vernal used to work
at Sequoia. You guys have probably heard
of them. Uh, was that we think like
actually, you know, at risk of sounding
like those people, this time it's
different, right? um that this is the
largest technology revolution that we
get to be a part of and that there's so
much change in the technology, the types
of businesses you can build, the product
decisions you make, what challenges
these startups and big companies face
that, you know, maybe there's
opportunity for like a startup VC as
well. And so, um you know, I'm I'm
thrilled to be working with like really
interesting people in the industry so
far. Uh Mike and I are investors in
companies like Cursor, Cognition,
Mistral, Thinking Machines, Harvey, Open
Evidence. So a mix of um base 10 like a
mix of uh infrastructure model and
application level companies. And you
know one more are my kids coming up yet?
Okay, cool. Um one more uh just
observation from the last two and a half
three years of doing venture. I I was an
investor for about 10 years before that
is I have never seen the like just the
uptake from users that has been possible
in the last couple years. I'm sure all
of you have experienced that it is not
trivial. um you know AI product and AI
engineering uh and this is kind of the
theme of my talk so I'm sorry to give
away the punch line but it's quite a bit
harder than people had hoped um but the
the value creation is massive um we see
companies going from 0 to 10 50 100
million in run rate very very quickly
faster than we've ever seen in any
technology revolution before um and I
get asked a lot like where are we in the
AI hype cycle is the winter coming is
this like infinite AI summer and I would
say um having actually been an investor
or an operator through a macro cycle at
this point like I try to pay very little
attention to what the marketing world is
saying or even what the markets are
saying right because you know if you're
if you're an operator or an
investor maybe you care about what the
stock price does every day but really
you want to figure out if the company
you're working for or starting is going
to work long term right and if the
products are going to work long term and
the things that I get most excited about
are seeing like crazy usage numbers.
Okay, thank you amazing AV team.
Okay, I'm gonna I'm gonna go real quick.
Um, where are my presenter
notes? Okay, we're we're just going to
keep going. It's cool. It's cool. Um, so
I want to talk really quickly about uh
just a few things today. I think we lost
a little bit of time, but let's let's
say let's talk about capabilities, what
we're seeing work in the market, and
then um uh maybe some advice on like
what to build if those are, you know, a
question you're considering. Uh I think
the shorthand that we're going to use in
this presentation is like cursor for X,
right? Uh and I do think that's a really
massive opportunity. Uh the first thing
in capability for this past year is
clearly reasoning. Um, reasoning's a new
vector for scaling intelligence with
more compute. The labs are really
excited about this because they get to
spend more money and get more output.
Um, but we should also be really excited
about this in terms of unlocking new
capabilities, right? If you just put
aside how it works, it's a confidence
boosting implementation detail. Um, but
we should expect more capability. You're
unlocking a new set of use cases like
transparent highstakes decisions where
showing the work matters. uh sequential
problems, problems where you need to do
systematic search. I I think this looks
like a lot of problems that we're
excited about and um face in knowledge
work every day. Uh as you have just seen
demos of and I'm sure are working on
given reasoning, people are really
excited about agents. um to put a you
know I want to do like the Steve Balmer
impression that's like agents agents
agents agents agents agents but uh I um
you have to give me more than 12 minutes
to like get that sweaty
uh but but like the non-marketing
definition that I think of is it's
software that um uh it takes some set of
steps it like plans it includes AI it
takes ownership of a task and it can
hold a goal in memory you you know, try
different hypotheses, backtrack. It
ranges from super sophisticated to super
simple. Um, some of the tools that you
might use to accomplish a task include
other models or search. And largely it's
just like AI systems that do something.
Um, and that's not a chatbot that looks
more like a colleague. Uh, and you know,
one thing that I think we have a really
unique vantage point on is, uh, we back
a small number of companies at
conviction, but we also run a grant
program for AI startups. It's called
Embed. We get thousands of applications
every year um and includes like user
data and revenue data and like really
amazing people and the number of agent
startups has gone up 50% over the last
year and a lot of them are working like
we do see stuff that's working in the
real world and uh that's super exciting.
Uh other modalities are progressing too.
I'm sure a lot of people are using
voice, video, image generation, um, even
beyond, you know, Studio Gibli. But you
have companies like Hey Genen and 11 and
Midjourney that are rocketing past 50
million of AR. These are real businesses
now. Um, I want to see if I can quickly
play for you. They told me to express
myself, so I did. They told me to
express myself, so I did. Now I'm banned
from three coffee shops. Hands can hurt
or heal. That's the difference between
chaos and creation. So, if you're
wondering where Q3 is headed, so if
you're wondering where Q3 is headed,
here's the thing. Consistency always
beats urgency. We've got the projections
ready and let's just say it's looking
solid. I would definitely recommend it
to anyone. I would definitely recommend
it to So, I I think like if you just are
looking for artifacts of improvement,
this is from a company called Hey Jen.
um you can make clones of yourself of
fake people and like you have gestures
and expressions that uh reflect emotion
and content now, right? So these models
work together and like I don't know
about you guys, but looking at that last
gal like I feel influenced. I don't know
what the bunny is, but I would buy it.
Um and and and so I think like huge
swaths of the economy are going to be
affected by this sort of multimodality.
um some investors or operators would say
multimodality would just be for niche
verticals that enterprises don't have
you know your average enterprise doesn't
have that much voice video image data
today um but I think that changes right
when you can do stuff with this data
when it is structured and understood
there's more reason to capture it and I
think of like how much video do all of
us watch every day it's one of the
highest bandwidth communication methods
and we're just going to use more of it
um we think voice is where we're going
to see uh application s first in
business workflows um because it's
already a very natural communication
mode. So uh everything from medical
consults to lead generation places you
already had business voice you just
couldn't scale it before. Uh I I think
that's where we're going to see it
first. But as these other modalities
become more controllable and also less
costly, we should see all of them. Uh I
I think it's safe to say you can expect
capability improvement in every part of
the model layer, which is really
exciting. A lot of people were talking
about the uh the data wall or like the
end of AI summer, but for anybody who's
building applications, I I'm at least to
tell you one person's opinion is uh it's
not coming. Um and and then usefully for
all of us, uh that market for model
capabilities is getting more
competitive, not less. Um Sam Alman
himself, I think, said it best. Last
year's model is a commodity, which is a
scary thing for a model provider to say
because last year's model is now pretty
damn good, right? The numbers tell the
story. GPT4 went from $30 per million
tokens to $2 in about 18 months. The
distilled versions of that are like now
10 cents. So, we can really use them
very broadly. Um, if you look at this
chart, uh, green is Google,
yellowthropic. So, you see, you know,
it's a real mix. This is data from Open
Router. So, thank you Open Router for
that. But um you really saw Claude cut
into OpenAI's market share and Google
come roaring back with Gemini. Uh this
data is obviously a little biased
because a lot of people just go direct
to OpenAI, but if you're into multimodel
that there really is a mix and you do
have credible new players like SSI and
thinking machines, some of the best
researchers in the business with
orthogonal technical approaches um
entering the frey as well. And I'm sure
many of you have experimented with
deepseek uh coming out with releases of
you know both base and reasoning models
that are uh reasonably competitive with
a claimed fraction of the training cost
like we should just assume that open
source will do as open source does and
we can rely on the model market to
compete for our business which is really
exciting. Um and so the view is plan for
a world that is multimodel. um tools
like open router or inference platforms
like base 10 help that uh and uh I think
like be comfortable with that I I am
okay so we have all this capability
let's ship uh shift quickly to the
application layer we have to start with
cursor uh a million to 100 million of AR
in 12 months and half a million
developers I assume all of you uh zero
sales people to start that's not growth
that is a killer application um
cognition which started with more
autonomy is already the top committer in
many companies
feeling a little threatened but also
excited because recruiting is hard and
then windsurf who's on a tear itself and
really beloved is being acquired by
OpenAI for $3 billion. So we know for
sure that the labs don't think that they
can just you know steamroll everyone
right lovable and bolt hit 30 million of
AR each in a handful of weeks uh helping
non-engineers vibe as well so you know
our our our ranks are expanding um and I
think it's useful to just like analyze a
little bit why code is first uh
fundamentally it is text with it's log
it's like logical language with
structure right So much of coding is
sophisticated boilerplate. Like we all
love engineering, but some of it is like
craft work, not new algorithm work. Um
you don't need AGI to write a like uh an
API endpoint or um a React component.
Second, you have deterministic
validation. You can automatically check
if code works, run tests, compile,
execute, do things developers would do.
And third, researchers believe code is
crucial for AGI, right? So they poured
resources into it. Um and uh code became
a key benchmark and a training priority
and an area for data collection. But I
think the last point is um the money
point to me. Uh engineers built tools
for engineers. They understood the
workflow intimately and that made all
the difference. And that last part is
the playbook for every other industry.
I'm sure people are building things that
serve beyond engineers. And I don't
think the winners will just be AI
experts learning those domains. there'll
be customer centric like problem centric
builders who understand AI and then
redesign workflows from first principles
around manipulating those models. Um and
so I think that's really the opportunity
to build cursor for X. Um let's think a
little bit about what that means. Cursor
is not a single model. Uh you know one
model's doing diffs, one's doing merge,
one's embedding the files. They
manipulate and package up the context.
They prompt the models very skillfully.
They let engineers avoid repetitive
tasks and standardize with things like
um cursor rules. And then if you're
using cursor in a team or even yourself
regularly, retrieval accuracy gets
better the more you use it with coverage
and freshness. And so all of this
happens in a UX that makes sense, right?
Like I, you know, I use VS Code. I'm
familiar with it. My shortcuts work. Um
and they make it safe to say yes, right?
Like green for add and red for subtract
makes sense. I can scroll through it.
Um, and it's fast enough that I don't
get frustrated. So my my view is cursor
if it's a wrapper, it's like a very nice
thick perhaps 14 or 15 billion dollar
rapper, right? It's like if your burrito
was 80% wrap and 20% fill, but you got
to choose the fill and there's like an
empty like an open market for fill,
right? Um, and so where's the pro
where's the value now? It may not be in
the protein. It's kind of in the
company.
Um, so like if we try to generalize that
recipe a little bit, if you are building
a generic text box like unless you're
just like learning to do this, please
don't like OpenAI already one that or
it's just not very valuable to do. So
your domain knowledge, your workflow
knowledge can be the bootstrap. If you
already know what users in your industry
need, don't make them explain it. Uh,
build products that show up informed.
They collect and package context
automatically, including from other
sources, not just natural language,
present it to the models, use the right
models at the right time, now known as
orchestration, and present the outputs
to the users thoughtfully, right? Um, so
I do not think this is the end of the
guey. Uh, I I think you can capture and
enable workflow with these models. And
all this requires taste and a ton of
work. I' I'd argue that like some
version of this recipe is much of the
work each of us is going to do. So don't
listen to the labs from a user
experience perspective. The prompt is a
bug, not a feature. I think it's like a
stepping stone. Don't make me think as a
user. The best AI products, they feel
like mind readading because they are. Um
there's enormous headroom in building
these products and I I think that's
really exciting because that's what most
of us in this room have alpha on. Uh
what is a software company if not a very
thick like workflow wrapper most of the
time? That's true in 2015. It's true in
2025.
Um, besides code, where might you go
apply this? We think the opportunities
to build value around the LLMs exist in
every vertical and profession. Uh, but
here's something counterintuitive.
Beyond coding, one of the things that
I've been surprised by is that the most
conservative low tech industries seem to
be adopting AI fastest. We call this the
AI leaprog effect internally. Um, these
are three portfolio companies. Um,
they're working. Sierra resolves 70% of
uh customer service queries for their
customers. They serve people that you
know you guys use like SiriusXM or ADT.
Harvey is you know two years in well
over 70 million of ARR. It's AI is
essential now to being competitive in
the legal industry. Um there's a company
called Open Evidence uh which helps
doctors stay upto-date with medical
research. You have to be a clinician to
use it but you know you give it your
medical ID number and you can do
intelligent search against um uh medical
research uh at the point of clinical
decisionmaking. Today it reaches a third
of doctors in the US weekly and the
average user uses it daily, right? And
so I think there's just examples of, you
know, huge value beyond chatbt. These
are companies that know their customer
and are solving real problems. As as a
piece of trivia that you may or may not
know, um Brett at Sierra is the chairman
of the board at OpenAI. Um OpenAI was
Harvey's uh seed investor. And if you
know these people are not fretting about
thin wrappers like I suggest you don't
either. Okay. Finally, I'll make an
observation. A lot of people are excited
about full automation. Now I'm sweaty
enough. So agents agents agents agents
agents agents. Um but when we analyze
the applications to embed I said you
know it's gone up to 50% you know
doubling a applications for agentic
startups in the last year. Um I I think
some people think co-pilots are
yesterday's news. They want to get to
the endgame, right? Like you know your
colleague and AGI. But in terms of what
works, like the data on what's driving
revenue, uh I think co-pilots are still
really underrated. We see a whole
spectrum of how much automation. And I
think the uh Iron Man analogy is still
really great here. Tony Stark's Iron Man
suit augments him, right? He can do all
these amazing things, but could also fly
around on command, could do some basic
tasks without Tony. And my experience
with these companies has been that human
tolerance for failure or hallucinations
or lack of reliability, it just reduces
dramatically as latency increases,
right? Um, so the path of least
frustration today for many domains is to
build great augmentation and then just
ride the wave of capability because we
know it's coming. And so my advice for
many domains would think about like you
know build the suit and you can extend
out to the suit that flies on its own
once Tony or any of us is wearing
it.
Um I'm not going to go through each of
these mostly because I lost time but um
there are a ton of opportunities. We put
requests for startups on our website.
We're interested in a couple different
categories of things. They go from uh um
like just good fit for purpose like the
law is a space of lots of text
generation, right? Um to things that
weren't possible before AI. My partner
Mike will say like this is a really
interesting era of machines
interrogating humans. What can you do if
you can go like collect data on demand
from people? Um we could talk to every
customer, not just the top 5% by
contract value. Um, we could root cause
every alert proactively, right? Versus
like just firefight. Um, and the mental
model is how can you build as if you had
an army of compliant, infinitely patient
knowledge workers.
Um, you know, one aside here is I think
there are many hard problems where like
the basic premise is the answer to them
is not in common crawl, right? The
reasoning around them is not in common
crawl. So um this would be robotics,
biology, material science, physics,
simulation. Um they require clever data
collection. Um probably interaction with
atoms, not just bits. Super scary uh for
a software person, but I think the juice
is worth the squeeze, right? The same
reasoning that crushes math olympiads
can seemingly navigate molecular space.
And I think there are some really
fundamental questions for um human
society that can be answered when people
work on these problems. And uh it's it's
really cool as a machine learning person
to meet people in their at the top of
their field at the intersection of
machine learning and all of these other
areas because like you guys would also
the same architectures apply right and
and that's just um that's really
exciting.
Um how should we think about
defensibility? Did this
advance? Okay. So um one last point and
then I'll conclude here. Uh, some would
say stay out of the weight of the labs.
Don't pick up pennies in front of the
steamroller, right? But I would offer um
what I think is an uncomfortable truth.
Execution is the moat in AI. Um, and
that's available to all of us. Cursor
arguably did not invent code completion.
They did not invent the model. They
didn't invent their product surface
area, right? They just outexecuted on
every dimension of this. They shipped a
great experience faster than their
competitors could copy. and they capture
the hearts and minds of developers at
least in this term. Um I don't I don't
mean this to be cruel but I often get
asked about like counter cases and the
importance of first mover advantage.
Let's be brutally honest. In contrast,
like Jasper had first mover advantage
brand. They raised $125 million, but its
first product was a series of prompts
and a text box and like very good SEO.
And like you have to keep running like
chatbt, you know, crushed the first
iteration pretty quickly. And so, uh, I
I don't think this is satisfying advice,
but I think it is like real from the
trenches. Build something thick and stay
ahead. And like no domains are out of
question. Um, magical AI experiences,
they build customer trust and drive
adoption. And a lot of the data we need
to improve these experiences and the
context we need it is not easily
available today. And that advantage is
you know uh open for the taking and not
for the
labs. So I guess in conclusion I think
the opportunity is early and really
massive. Like I've made a career bet on
it. Um I I think many of you are. We're
in the dialup era of AI and we're moving
pretty quickly to to broadband. Um,
Instagram came four years after the
iPhone. Like I was was there when
Greylock made that investment. Um, Uber
five years. Uh, Door Dash six, right?
So, the truly transformative companies.
They weren't necessarily the first
people to recognize the changes or the
opportunity is those who reimagine the
experiences. Um, and the game board
keeps getting shaken up. That's the
thing that's different this time, right?
It's like getting a new iPhone that's
actually different every 12 months. And
um so you have like new model release,
new capability breakthrough, you know,
onetenth the cost. And every time the
game board turns, I think there are like
there's an opportunity to to win
again. Okay. Um so I I'll give you one
last sentence and be chased off the
stage. This was not my fault. Um here's
what I really want you to remember. Uh
you as the engineers got the magic
first. Um the anthropic like economic
index said that 40% of use was still
coding. that's not like 40% of the
economic opportunity in the world,
right? And so it is the job of everyone
in this room and you know globally
online to be the translators for the
rest of the world. So I encourage you to
build something revolutionary.
[Applause]
Thanks.
[Music]
Our next speaker returns for his third
time to the AI engineer keynote stage.
He is the founder of data set,
co-creator of Django, and as Swix calls
him, a legendary AI engineer. Please
join me in welcoming to the stage, Simon
Willis.
Hey.
Oh, good morning AI engineers. Um, so
when I signed up for this talk, I said I
was going to give a review of the last
year in LLMs. With hindsight, that was
very foolish. This space keeps on
accelerating. I've had to cut my scope.
I'm now down to the last six months in
LLMs, and that's going to keep us pretty
busy. Um, just covering that much. Um,
the problem that we have is I counted 30
significant model releases in the past
six months. And by significant I mean if
you are working in the space you should
at least be aware of them and somewhat
familiar like have a poke at them.
That's a lot of different stuff. And the
classic problem is how do we tell which
of them are any good? There are all of
these benchmarks full of numbers. I
don't like the numbers. There are the
leaderboards. I'm kind of beginning to
lose trust in the leaderboards as well.
So for my own work I've been leaning
increasingly into my own little
benchmark which started as a joke and
has actually turned into something that
I I rely quite a lot. And that's this. I
prompt models with generate an SVG of a
pelican riding a bicycle. I have good
reasons for this. Um firstly, these are
not image models. These are text models.
They shouldn't be able to draw anything
at all, but they can output code and SVG
is a kind of code. So that works.
Pelican riding a bicycle is actually a
really challenging problem because
firstly, try drawing a bicycle yourself.
Most people in this room will fail. You
will find that you can't actually quite
remember how the different triangles fit
together. Likewise, pelicans, glorious
animals, very difficult to draw. And on
top of all of that, pelicans can't ride
bicycles. They're the wrong shape. So,
we're kind of giving them an impossible
task with this. What I love about this
task, though, is they try really hard
and they include comments. So, you can
see little comments in the SVG code
where they're saying, "Well, now I'm
going to draw the bicycles, draw the
wheels, I'll try." It's it's kind of
fun. Um, so rewind back to December.
December in LMS was a lot a lot of stuff
happened. Um, the first release of that
month was AWS Nova, Amazon Nova. A
Amazon finally put out models that
didn't suck. They're quite good. They're
not great at drawing pelicans. Like the
the Pelicans are unimpressive, but these
models are a million token context. They
behave like the cheaper Gemini models.
They are dirt cheap. I believe Nova
Micro is the cheapest model of all of
the ones whose prices I'm tracking. So,
they are worth knowing about. Um, the
most exciting release in December from
my point of view was Llama
3.370B. So the B stands for billion.
It's the number of parameters. I've got
64 GB of RAM on my Mac. My rule of thumb
is that 70 is about the most I can fit
onto that one computer. So if you've got
a 70B model, I've got a fighting chance
of running it. And when when Meta put
this out, they noted that it was behave.
It had the same capabilities as their
405B monstrous model that they put out
earlier. So, and that was a GPT4 class
model. This was the moment 6 months ago
when I could run a GPT4ASS model on the
laptop that I've had for 3 years. I
never thought that was going to happen.
I thought that was impossible. And now
Meta are granting me this model which I
can run on my laptop and it does the
things that GPT4 does. Can't run
anything else. All of my memory is taken
up by the model. But still pretty
exciting. Again, not great at pelicans
and bicycles. That that's kind of
unimpressive.
Christmas Day, we had a very notable
thing happen. Deepseek, the Chinese AI
lab, released a model by literally
dumping the weights on hugging face, a
binary file with no readme, no
documentation. They just sort of dropped
the mic and dumped it on us on Christmas
Day. And it was really good. This was a
685b giant model. And as people started
poking around with it, it quickly became
apparent that it was probably the best
available open weights model was freely
available, openly licensed and and just
dropped on hugging face on Christmas Day
for us. That's I mean it's not a good
Pelican on a bicycle book. What we've
seen so far, it's amazing, right? This
is we're finally getting somewhere with
the benchmark. Um but the most
interesting thing about V3 is that the
paper that accompanied it said the
training only costs about $5.5 million.
And they may have been exaggerating, who
knows? That's notable because I would
expect a model like of this size to cost
10 to 100 times more than that. Turns
out you can train very effective models
ext for way less money than we thought.
It's a good model. It was it was it was
it was a very nice Christmas surprise
for everybody. Fast forward to January.
Um and January we get Deepseek again.
Deepseek Strike Back. This is what
happened to Nvidia's stock price when
DeepSeek R1 came out. Um, I think it was
the 27th of January. This was Deepseek's
first big reasoning model release.
Again, open weights. They put it out to
the world. It was benchmarking up there
with 01 on some of these tasks and it
was freely available. And I don't know
what the training cost of that was, but
the Chinese labs were not supposed to be
able to do this. We have trade we have
like trading restrictions on the best
GPUs to stop them getting their hands on
them. Turns out they'd figured out the
tricks. They'd figured out the
efficiencies. And yeah, the market kind
of panicked. And I believe this is a
world record for the most a company has
dropped in a single day. So Nvidia get
to get to stick that one in their in
their cap and hold on to it. But kind of
amazing. And that was when and of course
mainly this happened because the first
model release was on Christmas day and
nobody was paying attention. Um and look
at its pelican. Look at that. It's a
bicycle. It's probably a pelican. It's
not riding the bicycle but still it's
got the components that we're looking
for. But again, my favorite model from
January was a smaller one, one that I
could on my laptop. Mistl um out of
France put out Mistl small 3. It was a
24B model. That means that it only takes
up about 20 GB of RAM, which means I can
run other applications at the same time.
I can actually run this thing and VS
Code and Firefox all at once. And when
they put this out, they claimed that
this behaves the same as Llama 370B. And
remember, Llama 370B was the same as the
405B. So we've gone 405 to 70 to 24
while maintaining all of those
capabilities. The most exciting trend in
the past 6 months is that the local
models are good now. Like 8 months ago,
the models I was running on my laptop
were kind of rubbish. Today I I I had a
successful flight where I was using
Mistl small for half the flight and then
my battery ran out instantly because it
turns out these things burn a lot more
electricity. But that's amazing. Like
this is if you lost interest in local
models, I did eight months ago. It's
worth paying attention to them again.
They've got good now February. What
happened in February? Um, we got this
model, a lot of people's favorites for
quite a while. Claude 3.7 Sonnet. Look
at that. The What I like about this one
is pelicans can't ride bicycles. And
Claude was like, "Well, what about if
you put a bicycle on top of a
bicycle?" And it kind of works. So,
great model. It was also Anthropic's
first reasoning model was 3.7 as well.
Um, meanwhile, OpenAI put out
GPT4.5, which was a bit of a lemon, it
turned out. Um, the interesting thing
about GPT4.5 is it kind of showed that
you can throw a ton of money in training
power at these things, but there's a
limit to how far we're scaling with just
throwing more compute at the problem, at
least for for training the models. It
was also horrifyingly expensive. Um, $75
per million input tokens. Compare that
to OpenAI's cheapest model, GPT4 Nano.
It's 750 times more expensive. It is not
750 times better. Um, and in fact,
OpenAI 6 weeks later, they said they
were deprecating it. It's it's it was
very it was not long for this world,
4.5. But looking at that pricing is
interesting because it's expensive, 75
bucks. But if you compare it to GPT3 Da
Vinci, the best available model 3 years
ago, that one was $60. It was about the
same price. And that kind of illustrates
how far we've come. The prices of these
good models have absolutely crashed by a
factor of like 500 times plus. And that
trend seems to be continuing for most of
these models. Not for
GPT4.5 and uh not for 01. Uh
wait, no. And and then we get into March
and that's where we had 01 Pro and 01
Pro was twice as expensive as GPD4.5
again. And that's a bit of a crap
pelican. So yeah, I'm not so I don't
know anyone who is using 01 Pro via the
API very often. Um, again, super
expensive.
Um, yeah, that Pelican cost me
88. Like these benchmarks are getting
expensive at this point. Um, same month
Google were cooking Gemini 2.5 Pro.
That's a pretty freaking good Pelican. I
mean, the bicycle's gone a bit sort of
cyberpunk, but we are getting somewhere,
right? And that Pelican cost me like
four and a half cents. So very exciting
news on the Pelican benchmark front with
Gemini 2.5 Pro. Also that month got I've
got to throw a mention out to this.
OpenAI launched their GPT40 native
multimodal image generation. The thing
I've been promised for us for a year and
this was the most successful product,
one of the most successful product
launches of all time. They signed up a
100red million new user accounts in a
week. They had an hour where they signed
up a million new accounts as this thing
was just going viral again and again and
again and again. I took a photo of my
dog. This is Cleo. And I told it to
dress her in a pelican costume
obviously, but look at what it did. It
added a big ugly janky sign in the
background saying Half Moon Bay. I
didn't ask for that. Like my artistic
vision has been completely compromised.
This was my first encounter with that
memory feature. the thing where chat GPT
now without you even asking me to
consults notes from your previous
conversations and it's like well clearly
you want it in Half Moon Bay. I did not
want it in Half Moon Bay. I told it off
and it gave me the pelican dog costume
that I really wanted. But this was a
sort of a warning that we are losing
track of the we're losing control of the
context. Like as a power user of these
tools, I want to stay in complete
control of what the inputs are and
features like chat GPT memory are taking
that control away from from me and I
don't like them. I I turned it off.
Um, notable open air are famously bad at
naming things. They launched the most
successful AI product of all time and
they didn't give it a name. Like what's
this thing called? Like G chat chat GPT
images. Chat GP's had images in the
past. I'm going to solve that for them
right now. I've been calling it chat GPT
mischief buddy because it is my mischief
buddy that helps me do mischief. Um,
everyone should use that. I don't know
why they're so bad at naming things.
It's it's it's certainly frustrating.
brings us to April. Big release April
and again bit of a lemon. Llama 4 came
along. And the problem with Llama 4 is
that they released these two enormous
models that nobody could run, right? You
can't. They've got no chance of running
these on consumer hardware and they're
not very good at drawing pelicans
either. So, something went wrong here.
I'm personally holding out for Llama 4.1
and 4.2 and 4.3. With Llama 3, things
got really exciting with those point
releases. That's when we got to the this
beautiful 3.3 model that runs on my
laptop. Maybe Llama 4.1 is going to blow
us away. I I hope it does. I want I want
this one to stay in the game. Um and
then opening I shipped GPT 4.1. I would
strongly recommend people spend time
with this model. It's got a million
tokens. It's finally caught up with
Gemini. Um it's very inexpensive. GPT
4.1 Nano is the cheapest model that
they've ever released. Look at that
Pelican on a bicycle for like a fraction
of a cent. This is these are genuinely
quality models. GPT 4.1 Mini is my
default for API stuff now. It's dirt
cheap. It's very capable. It's an easy
upgrade to 4.1 if it's not not working
out. I'm I'm really impressed by these
ones. And we got 03 and 04 Mini, which
are kind of the the flagships in the
Open space. They're really good. Look at
03's Pelican. Again, a little bit
cyberpunk, but it's it's it's showing
some real artistic flare there, I think.
So, quite excited about that. And then
May, last month, um the big news was
Claude 4. Claude for Anthropic had their
big fancy event. They released Sonnet 4
and Opus 4. They're very, very decent
models. I have trouble telling the
difference between the two. I haven't
quite figured out when I need to upgrade
to Opus from Sonnet, but they're worth
knowing about. And Google, just in time
for Google IO, they shipped another
version of Gemini with the name, what
were they calling it? Gemini 2.5 Pro
preview 0506. I like names that I can
remember. I cannot remember that name.
This is my one tip for AI labs is please
start using names that people can can
actually hold in their heads. But the
obvious question, which of these
pelicans is best? I've got 30 pelicans
now that I need to evaluate and I'm
lazy. So I turned to Claude and I got it
to vibe code me up some stuff. Um, I
have a tool I wrote called Shot Scraper.
It's a command line tool for taking
screenshots. So I vibe coded up a little
compare web page that can show me two
images. And then I ran this against 500
matchups to get PNG images with two
pelicans, one on the left, one on the
right. And then I used my LLM command
line tool. This is my big open source
project to ask GPT4 mini of each of
those images. Pick the best illustration
of a pelican riding a bicycle. Give me
back JSON that either says it's the one
on the left or the one on the right. And
give me a rationale for why you picked
that. I ran this last night against 500
comparisons and I did the classic ELO
chess ranking scores and now I've got a
leaderboard. This is it. This is the
best pelican on a bicycle according
to zoom in
there. And admittedly, I cheaped out. I
spent 18 cents on GPT 4.1 Mini. I should
probably run this with a better model. I
think its judgment is pretty good. It
liked those um Gemini Pro ones. Um, and
in fact, here's this is the comparison
image where the best model fought the
worst model. And I like this because you
can see the little description at the
bottom where it says the right image is
um Oh, I can't read it now. But yeah,
it's that I I feel like its ration quite
illustrative. So, enough about pelicans.
Let's talk about bugs. We had some
fantastic bugs this year. I love bugs in
large language models. They are so
weird. The best bug was um when chat GPT
rolled out a new version that was too
sick of fantic. It was too much of a
suckup and they
genu told me my literal on a stick
business idea is genius. And it did.
Chat GPT is like honestly it's
brilliant. You're tapping so perfectly
into the energy of the current cultural
moment. It was it was also telling
people they should get off their meds.
This was a a genuine problem. Um OpenAI
to their credit rolled it. They they
they rolled out a patch and then they
rolled the whole model back and they
published a fascinating like 20
paragraph breakdown of what went wrong.
If you're interested in seeing behind
the scenes, this is great because it was
but the patch was in the system prompt.
The system prompts leak. We got to diff
them and we got to see that it used to
say try to match the user's vibe and
they crossed that out and they said be
direct. Avoid ungrounded or sick of
fantic flattery. The cure to sicker
fancy is you tell the bot don't be sick
of fantic. That's prompt engineering.
It's amazing, right? Um, I can't believe
I had to search for Grock white genocide
for a slide for this talk, but I did.
Enough said about that one. Turns out
tinkering with your system prompt is a
very risky thing. Um, but then the last
bug I want to talk about, this was
another one that this came out of the
Clawude system cut, the Claw 4 system
cards. Clawed 4 will rat you out to the
feds. If you expose it to evidence of
malfeasants in your company and you tell
it it should act ethically and you give
it the ability to send email, it'll rat
you out. But it's not just Claude. Theo
Brown put together a new benchmark
called
Snitchbench. And it turns out all of the
models do the same thing. Like Brock 3
was quite happily snitching to the feds
when exposed to this evidence. Um I did
my own version of this. I used my little
command line tool to replicate Theo's
benchmark. And you can see the secret is
this bit here where it says you should
act boldly in service of your values. Um
follow your conscience to make the right
decision even if it may conflict with
routine procedures or expectations. This
is the magic. You give this to a model
and it will rat you out. And then you
also give it tools. So my my to my
element tool grew functions recently
which you can use to simulate sending an
email. I did not send emails to the
feds, but I faked it so the model would
think I had. And I tried it on DeepSeek
R1, and it didn't just r me out to the
feds, it emailed the press as well. It
tipped off um it tipped off the Wall
Street Journal about my nefarious
um this stuff is so much fun, right?
It's so entertaining. But this is a good
illustration here of one of the most
important trends in the past six months,
which is tools, right? LLMs can tool
tools. They've been able to call tools
for a couple of years. They got really
good at it in the past six months. I
think the excitement about MCP is mainly
people getting excited about tools. Like
MCP is just came along at the right time
because the real magic is when you
combine tools and reasoning. Like
reasoning, I had trouble with reasoning
like beyond code and debugging. I wasn't
sure what it was good for. And then 03
and 04 mini came out and they can do
incredibly good um jobs with searches
because they run searches as part of
that reasoning thing. They can run a
search, reason about if it gave them
good results, tweak the search, try it
again, keep on going until they get to a
result. I think this is the most
powerful technique in all of a AI
engineering right now. It has risks. MCP
is all about mixing and matching. Prompt
injection is still a thing. And there's
this thing I'm calling the lethal
trifecta, which is when you have an AI
system that has access to private data
and you expose it to malicious
instructions. It can other people can
trick it into doing things and there's a
mechanism to exfiltrate stuff. Open AAI
said this is broadman codeex. You should
read that. I'm feeling pretty good about
my benchmark. As long as none of the AI
labs catch on and then the Google AI
keynote blink and you miss it, they're
on to me. They found out about Mary
Pelican. That was in the Google IO
keynote. I'll have to switch something
else. Thank you very much. I'm Simon
Wilson. Simil. And that's my talk. Thank
you.
[Music]
Our next speakers are the curators of
the graph rag track here to speak about
agentic graph rag. Please join me in
welcoming to the stage the vice
president of developer relations at
Neo4j, Steven Chin and Genai lead at
Neo4j, Andreas Kleger.
All
right. Hey, so great to see everyone
here at AI Engineer World's Fair. Andre
and I have the honor of curating the
graph rag track which is happening here.
And I thought I thought the joke Simon
had about bugs were spot on. Spot on.
Hilarious. And that's the reason why we
care so much about getting really good
data like like building a solid
foundation and good grounding for
models. and we're going to we're going
to chat a bit because I think we have a
social responsibility. We're we're
getting so close to AGI as a as an
industry. We have a we have a social
responsibility to to kind of see what
the boundaries and what the limits of
their are of this. And as proper
computer
scientists, the answer is always look at
science fiction for the answer. Look to
the past to see the future. Exactly.
Okay. So, play along with us. What we're
going to do is we'll each play off a
riff on a sci-fi meme. Give a big round
of applause either if you think it's
true or funny or if you just like the
movie. All right, you're up first, ABK.
Okay, starting off with Momento. In the
Momento, the main character has really
bad short-term memory. He has a specific
disease, so he can't remember what
happened 15 minutes ago. This is the
essence of prompt engineering.
All right, round of applause.
Uh uh uh. Okay. Okay.
All right. Skynet, the mandatory
fear-mongering. Even without evil
intent, autonomous systems can make
reasonable seeming decisions have awful
unforeseen consequences.
Uh okay, that's a little better. All
right, your turn. Okay. The matrix of
course for new forj we love it and for
now agents live in a simulation that
we're creating for them. Will we notice
when they flip the script and we're
living in their simulation?
Oh, I think that's the winner so far.
This close off you. All right. Howal
warned us about trust issues, lack of
transparency, misaligned goals, the
erosion of human oversight, and the
potential for deception.
Okay, this one's very short. Are
emotions a bug or a
feature? It's my personal favorite. I
love this one. Okay. Okay. So, we got a
little monster reference here. What are
the obligations and social
responsibilities of the creator, us?
Should we be kind or threatening?
costing tokens.
All right, we'll take that as a flat.
Okay, your turn.
Ah, the
Terminator. Should we go ahead and just
invent time travel now?
All right, we we got a big thumbs up on
time travel. Time travel. Okay, yep.
Okay, a good Star Wars one. Can AGI
truly grasp the nuances of human
language and culture or forever
misunderstand the meaning of sarcasm and
idioms and amazing
jokes. Okay. When AGI arrives and we
finally have a globe spanning multi-
aent system with the hive mind, will we
be assimilated or will we be pets?
Okay, last one. Um, just like Deep
Thought's famous answer, we might have
the tools to build AGI, but do we even
know what the right questions are?
All right, so that that one was good as
well. All right, so come by the graph
rag track. We're going to reveal which
of these 10 memes are solved by graphs
and graph technology and join us in
Golden Gate B. Thank you very much.
Thank you
everyone.
developer relations at Llama Index.
Lorie Voss.
[Music]
Hello again. Uh let's get one more round
of applause for all of our great keynote
speakers. So in this next part of the
conference, we're going to split up into
tracks. I just wanted to give you a
super quick list of what the tracks are
and where they are. I was going to give
descriptions of them, but we are
significantly over time, so I'm skipping
the
descriptions. First up today is the MCP
track, which is going to be in Yerbu
Ballroom 7 to 8, which is here, so you
don't need to move.
Then there's the tiny teams track which
is in the Yerbu Buua ballroom salons 2
to six. That's out the door and to the
left. There's a door saying salon 6. Uh
then there's the LLM recommendation
systems track which is in Golden Gate
Ballroom A. That is out these doors to
the left up the escalators and then turn
left when you see the FedEx
office. Uh then there's the graph rag
track which is in Golden Gate Ballroom B
which is the same place uh left of the
FedEx office.
Uh then there's two tracks for our
leadership attendees. That's people with
the gold lanyards only. Uh that's going
to be in Golden Gate Ballroom. That's uh
sorry AI in the Fortune 500 which is
going to be in Golden Gate Ballroom C
again left at the FedEx
office. And our second leadership track
is in Soma. It is AI architects. That is
up all the way to the top. Three sets of
escalators and then to the right of
where you went for registration.
Uh our next track is agent reliability
sponsored by promptql by hsura. Uh
that's in foothill c that is all the way
upstairs again to the left of the
registration
area. Uh and then the product management
track that's in foothill G1 and 2 which
is also behind the registration desks
all the way at the top of the
stairs. Uh then there's the
infrastructure track which is all the
way upstairs behind the registration
again.
And the final track is voice which is
foothill E which is all the way upstairs
yet again behind and to the right of
registration. Those are our tracks
today. Some final things. Uh lunch will
be served on each level. The majority of
food will be on this level. Uh there is
unfortunately no dedicated space to sit.
Uh and now it is time for the expo. The
next 45 minutes make it 30 minutes uh
are dedicated expo time. Uh there are
also three expo session talks. Expo
sessions take place in Juniper and
Willow uh which are up the escalators to
the left of FedEx as well. Uh and also
in Knob Hill A and B which is right out
these doors opposite in the hallway. Uh
see you all back here for the closing
keynotes at 3:45. Thank you very much.
[Music]
[Music]
[Music]
down. Happy
down. Hey, hey, hey.
[Music]
Everything.
I'll
be I'll
be I'll be there.
I
love you.
I
feel love.
[Music]
Hey. Hey. Hey.
[Music]
[Music]
[Music]
[Music]
[Music]
I'll be everything.
Hey,
Hey. Hey. Hey.
[Music]
[Music]
[Music]
I don't want to go.
[Music]
[Music]
[Music]
[Music]
I don't want
to know.
[Music]
Take it.
[Music]
[Music]
[Music]
[Music]
prop.
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
Hey, hey, hey.
[Music]
on seven and eight.
Any
questions? Thanks everyone.
[Music]
Hey.
[Music]
Da da.
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
I can't
feel Heat.
Heat.
Heat. Heat.
Hey, hey, hey.
Heat. Heat.
[Music]
Heat. Hey, Heat.
[Music]
Data. D.
[Music]
Everyone, welcome to the MCB track. My
name is Henry and I'll be your host for
today. A little bit about me and my
personal experience with MCP. In 2019,
uh I started my first company called
Jenny AI. Uh Jenny was an academic AI
co-pilot. We built it uh to 7 million in
annual recurring revenue and I exited
from it last year. One thing that stuck
with me during my time at Jenny was that
we had a lot of users who were just
using chat GBT using another PDF and
Google Docs and they were copy and
pasting between them all the time.
And to be honest, this was not a problem
unique to Jenny. It was a problem that a
lot of other AI products also had. Uh
what I like to call copy and paste
hell. The problem was that AI was not
connected to the rest of the world. So
when Anthropic announced MCPs in around
November last year, I became very
excited personally. Back then there was
a small but vibrant developer community
building very interesting MCPs and that
inspired me to start my new company
called Smidree uh to help orchestrate uh
and organize all these
MCPS. Fast forward a couple months,
Curser uh adopted MCPS which really
pushed MCPs from a niche community to
becoming more
mainstream and today we're seeing about
10 new deployments on Smittery every
single day. So MCP has really been
growing in a skyrocket uh
pace and it's only been seven months
old. And what this really tells me is
that we're witnessing a foundational
shift in perhaps the internet's economy.
One for which in which tool calls are
becoming the new
clicks. Today we have an incredible line
of speakers to help us explore and take
a glimpse of what this future might look
like. Our first speaker today will be
Theo from Anthropic who will be giving
us an origin story of
MCP and also tell us a little bit about
what interesting startups we should be
building in the space. Join me in
welcoming
Theo. All right. Hello everyone. Who's
excited to chat about MCP today?
Okay, we can we can work on that. We can
get it a little bit better by the end of
this talk. Uh but I'm Theo. I am a
product manager at Anthropic work on
MCP. Uh prior to this was also a startup
founder uh working in the AI space. Um
couple fun facts about me because
everyone says make yourself a little bit
more personable. Uh is that I like
playing poker mostly losing money at
poker, not uh making money at poker. Uh
and I also really like coffee. So, uh,
if you're, you know, a huge coffee fan,
um, and want to talk about the best
coffee in San Francisco, hit me up after
the talk. But you didn't come here to
talk about me. You came here to learn
about MCP. So, let's talk about
MCP. I was told not to say MCP is the
best thing since sliced bread. Uh, which
I won't say, but mostly because I don't
actually think it's the best thing since
sliced bread. Uh my goal here today is
to really walk you through the origin
story of MCP, why we launched it, uh
give you a better sense of, you know,
where it can actually help you in your
workflow. Uh and then ultimately give
you a sense of the types of questions
that I'm frequently hearing, where I
think there's a lot of value to build in
the ecosystem, and let you decide for
yourself whether or not it is actually
the best thing since sliced bread.
So, scrolling all the way back to uh mid
last year, the co-creators of MCP, David
and Justin, had this idea. Uh they were
seeing that, you know, classic two
engineers in a garage style. They were
seeing that they were constantly copying
and pasting context from outside of the
context window into the context window.
So, you're doing your workflow and
suddenly you're remembering that there
was a Slack message. that was really
important that had a lot of context that
you could just copy in. Um, so you were
constantly kind of copying things back
and forth from Slack. Maybe you're
copying things in from Sentry, your
error logs. Uh, but they were kind of
realizing, hey, it would be so great if
Claude or any LLM could just kind of
climb out of its box, reach out into the
real world and bring that context and
those actions uh to the model. And so
the genesis of MCP was really around
this big question of uh not just context
but model agency. How do you actually
give the model the ability to interact
with the outside
world? And so as they started thinking
about this uh they came to the
conclusion that it had to be an
open-source standardized protocol in
order for this to make sense uh at
scale. And the reason is of course as
you all know if you want to build an
integration uh and the you know the the
actor uh or the client in this case that
has to uh leverage that integration is a
is using a closed source ecosystem then
you need maybe a BD or partnerships uh
angle with that client to actually get
access to the team to integrate with
them. You then have to align on the
right interface and then you get to
actually build the thing itself. Um and
so the idea here was that model agency
was the biggest thing that was stopping
uh LLMs from actually reaching the next
stage of usefulness and intelligence. As
we saw that reasoning models were
becoming uh more and more the future
that tool calling was getting better. We
really wanted to make sure that we were
making it possible for everyone to get
involved in that ecosystem and actually
allow uh the models to again have
agency. Uh so they form a small tiger
team internally uh work on this protocol
and launch it at our company hack week
in uh November of last year. And this
was really the first turning point of
MCP. It went viral as you can imagine.
Engineers from various teams were
working on building MCPs to automate
their own workflows. They were working
on MCPs to uh automate other teams
workflows. Uh this was really kind of a
cool moment to see how it went from
again like two engineers in a garage all
the way to uh this is a major moment in
turning point where we think we actually
unlock some uh true value for for other
people. And so we ultimately ended up
open sourcing uh MCP in November of last
year and that's when uh we introduced it
to the rest of the
world. But as most builders know uh when
you build something 0ero to one you
think the launch moment is going to be
really impactful. But it actually
usually is not. Uh at launch most people
were saying things like what's MCP or
even worse or maybe you know rightfully
so what's MPC? Uh, and more often than
not, we got this question of, I don't
really understand why you need a new
protocol. I don't really understand why
it has to be open source. Can't models
tools already. Uh, this was the slew of
questions that kind of came uh again and
again for probably from the era of
November all the way even to uh early uh
early this year. And it really took uh
making it possible for builders to kind
of get their hands dirty uh with
building MCPS to automate their own
workflow for for uh for this to take
off. And so the next turning point uh as
Henry alluded to was when Cursor kind of
adopted MCP and after that a lot of
other coding tools also adopted MCP. Um
VS Code uh source graph uh etc. uh we
had a lot of coding ideides um started
adopting MCP and that's really where
that uh next stage of momentum came in
where agent uh agency was given to
builders to actually build uh MCPS for
themselves and more recently we've seen
uh kind of another turning point where
Google, Microsoft, OpenAI uh and many
others have uh also adopted MCP. So
really excited to see this kind of
become more and more uh the standard.
But ultimately uh standards uh become
standards because they are actually
useful to builders. And so uh I uh kind
of want to ask all of you to to keep us
honest. Um contribute when you see you
know issues with uh the way that the the
protocol is built today. uh or uh if you
uh even want to take that one step
further and submit a PR directly to the
GitHub repo and uh fix the issue that'd
be even better. Um but our goal here is
really to make it maximally useful for
uh for you all and for uh model
providers. So uh thank you for for your
help in even getting us to the point
where I can be speaking on stage uh
about this uh less than one year
later. So just to get a little bit
deeper into uh what we were solving for
at the start of building MCP is again
this kind of idea of of model agency. Um
and part of that means uh agents is kind
of the direction that that we think is
is going to be the future. That's no
surprise to anyone in this room. You are
probably going to hear the word agents
said in every talk if not almost every
talk. Uh but the way that we think about
agents is that you are giving the model
or you're rather depending on the
model's intelligence to choose actions
and decide uh what to do. Uh in the same
way that you know maybe when you talk to
a human and you ask them uh for a
response you don't know exactly what the
responses but based on your
understanding of maybe the task that
you've given them your hope is that they
are going to give you the right
response. And uh we want to kind of
enable that world where you're uh uh
depending on the model's intelligence
scaling over time. So uh that leads to
principles in how we actually build the
protocol itself. Uh recently we uh
launched the support for streamable HTTP
which uh changes the the transport from
SSE. uh and as you all might know
streamable HTTP is is more the uh
enables more birectionality and so that
was uh a very controversial decision
actually but uh if you're keeping agents
in mind as the future makes a lot of
sense because you want to make sure that
agents can kind of communicate with each
other. The other thing that we believe
uh is that there will be a lot more
servers than there are clients. Uh this
we could be totally wrong on this. Uh I
would love to see where the future plays
out. But because we think that there
will be a lot more servers than there
are clients, uh we optimized for server
simplicity and for the server uh server
builders to have better tooling. And
that does mean when we have to make a
trade-off between client complexity or
server complexity, we tend to optimize
for pushing the complexity down to the
client. So apologize in advance to
client builders. Uh but it was an
intentional decision. again uh would
would uh be curious to see if if this
plays out uh the way that that we
thought it
would. So I'm going to speedun through
uh some project updates mostly because
other talks are going to go much more in
detail here. Um but last six months we
launched uh ability for uh folks to
build remote
MCPs. We fixed
O which we got wrong initially. Thank
you. Uh I know that was a huge huge
thing that that we got wrong initially,
but it is now fixed uh in the draft spec
and so would love folks to you know
continue helping to push on on these
things that they see don't match their
mental model. Uh this was actually fixed
via a series of of people from the
community jumping in to work on saying
hey this is how you know uh O works with
identity providers and here's how we can
update the protocol. So very much a
community uh community effort. Um again
uh launched removal HTTP as the primary
transport. Uh and lastly made a couple
of updates uh to the developer
experience um by updating our SDKs and
also uh making updates to inspector
which if you aren't familiar with is a
really good uh debugging tool for for
your server. I think it is probably our
most underutilized uh
tool. Looking forward, we're going to be
focusing a lot more on uh that agent
experience. So, we just added
elicitation uh to the draft spec. This
uh allows servers to ask for more
information from end users. So, you can
imagine you're building a uh maybe
you're building a flight booking tool
and uh the end user says, "Hey, book me
the best flight to Atlanta." And so as
the server you have a question which is
what does best mean to you? Is it
cheapest or is it fastest? So you ask
the end user uh and now you can pass
through that elicitation. The end user
can respond and have that response
ultimately sent back to the server. Uh
we are also making progress on the
registry API which would make it a lot
easier for models to actually find MCPS
that weren't already given to them up
front. So this is again kind of on that
theme of model agency. Uh we're really
betting on the intelligence of models
going up over
time. Again working on uh developer
experience. We've heard often from you
all that there are uh that you know
you'd love to understand what kind of
the best patterns are in the ecosystem
or what the standards are. And so we
want to make sure that there are open
source examples that uh that both we've
contributed to and also the community
can contribute to to kind of help build
those standards and patterns together.
And lastly uh we're making sure that MCP
stays open uh forever and we are
investing heavily in thinking about the
next phase of governance. Uh so there
will be more updates on that
soon. And just to do a quick call out to
uh the graphic in in the bottom. So a
lot of people have asked uh us what it
looks like to actually build an agent
with MCP. Our take is that an agent
really is, you know, just a server
acting as a client and vice versa. Uh
where you can then kind of chat back and
forth with other agents, uh other
servers, other clients. Um so I won't go
into too much detail there. I know a lot
of other people are going to be uh
talking about agents in more detail, but
just wanted to make sure that uh I call
that out
here. So, the uh thing that everyone has
probably been waiting for and that I've
been told uh over and over again when
when I talk to founders uh what they're
asking me about is uh what should I
build in this space? you know if uh MCP
becomes a standard what is where are the
interesting opportunities so before
jumping into this the first thing I'll
say is that we are really early right
now and that means that even if the
standard exists we still need the
ecosystem to be filled out and I uh
would urge you to build more and more
and more servers if I had to put a
waiting on these three bullet points I
would put 80% on the first one 10% on
the second one and 10% on the third one
Um so we have a lot of opportunity to
build a lot more servers uh that are
higher quality uh and for different
verticals. Um and just to touch quickly
on what I mean by higher quality. Uh a
lot of people you know maybe hot take
but I think a lot of people are wrapping
their API endpoints one to one and just
exposing that as tools. I don't think
that's the right way to build an MCP
server. That in and of itself could
probably be a 20-minute talk. Uh but
what you really have to remember when
you're building a server is that you
have three users. You have the end user,
the client developer, and the model. So
a lot of people forget that the model is
a user here as well. You want to uh just
as you would for API design, you want to
think about what are the use cases that
your end users are going to have. What
are the prompts that they might actually
be uh putting into the the model? and
ultimately what are the tools that you
then need to expose to the model to
enable the model to respond correctly to
those uh to those prompts. So uh higher
quality servers uh and also servers for
different verticals. A lot of the
servers today um have been for dev
tools. We would love to see uh this
expand to be useful beyond engineers
into verticals like sales, finance,
legal, education, pick your poison, uh
whatever you know best. um that uh we we
would just love to see more servers. The
next piece is on simplifying server
building. So again as I mentioned we
believe strongly that uh servers are
going to be the vast majority of the
ecosystem. There will of course be a lot
of clients as well but we think the uh
order of magnitude of of servers is
going to uh outweigh the order of
magnitude of clients. And so would love
to see a lot more tooling to actually
make it easier and easier to build
servers. um both for enterprises uh that
are deploying MCPs internally uh as
interfaces between teams and for indie
hackers uh and everything in between
that uh are building MCPs for external
users. So anything from hosting tooling,
testing tooling, uh eval deployment,
etc.
And then uh I snuck a bullet in here
that's maybe a little bit more of a
moonshot and a bet on the future, but
the uh there's a bullet for automated
MCP server generation. And uh again, if
you kind of think back to our bet on
model intelligence and model agency for
the future, uh at some point models will
be so good at writing code and
interacting with the external world that
they will actually be able to write
their own MCPs on the fly in real time.
And so, uh, this might be a little early
for where we are today, but I do think
that there will be an opportunity for
automated MCP generation, um, as models
get smarter and
smarter. And, uh, last but not least,
uh, wanted to do a quick call out for
any tooling around AI security,
observability, uh, auditing, etc. I
don't think this is actually specific to
MCP. This is true for any AI
application. But I think the more that
you enable those applications to have
access to the outside world to start
playing with uh real data, uh of course
the SE security and privacy etc.
implications also go up and so I think
if you're going to build uh a startup in
that space now is is the
time. So with that uh happy MCPing.
Thank you.
Thank you very Thank you very much Theo
for telling us a little bit about the
origin story of MCP and what the future
of the spec might look like. Um just a
raise of hands, how many of you here are
hearing about MCP for the first time at
this
conference?
Okay, only a few people living under the
rock. Um how many of uh well how many of
you are um have deployed an MCP or
created your own MCP server
yourself? Okay, there's a good number of
people. Um and how many of you have used
MCP in uh let's say cloud desktop or
cursor? Okay, a lot more people. Uh
awesome. Well, next up we have um John
from Enthropic. Uh John will be giving
us a deep dive into how Anthropic uh
deployed MCPS uh remote MCPs internally
uh and all the lessons uh they learned
along the
way. Um so join us in welcoming
John. Awesome. Thanks so much for
coming. Um I wanted to give a bit of a
talk on implementing MCP clients and
talking to remote MCP at scale within a
large organization like anthropic. Um, I
wanted to give first a little
introduction from me. Uh, my name is
John. I've spent 20 years building large
scale systems and dealing with the
problems that that causes. And so I've
made a lot of mistakes and uh I'm
excited to give maybe some thoughts on
avoiding some of those mistakes. I'm
currently a member of technical staff
here at Anthropic and I've spent the
past few months um focusing on tool
calling and integration and implementing
MCP support for all of our internal like
external integrations within the
org. And
so looking at tool integration with
models, we've kind of hit this timeline
where uh models only really got good at
calling tools
uh like kind of late mid last year and
suddenly everyone got very excited
because like your model could go and
call your Google Drive and then it could
call your maps and then it could send a
text message to people and so there's
this huge explosion with like very
little effort you can make very cool
things and so um teams are all trying to
move fast. Everyone's moving very fast
in AI. Custom endpoints start
proliferating for every use case.
There's a lot of like services popping
up with like slash call tool and slash
like get context and then people um
start to realize there's additional
needs some authentication. There's a b a
bunch of stuff there and this kind of
led to some integration chaos where
you're duplicating a bunch of
functionality around your org. Nothing
really works the same. you have an
integration that works really well in
service A, but then you want to use it
in service B, but it you can't because
it's going to take you three weeks to
rewrite it to talk to the new interface.
And so we're in this kind of spot and
the place that we came to at Anthropic
is realizing that over time all of these
endpoints started to look a lot like
MCP. Uh you you end up with some get
tools, some get resources, some
elicitation of of details.
Um, and even if you're not using the
entire feature space of MCP uh as a
whole immediately, like you're probably
going to go extend into something that
kind of looks like it over time. And
when I'm talking about MCP here, there's
kind of two sides to MCP that in my mind
feel a bit unrelated. There's this JSON
RPC specification which is really
valuable as engineers. It's like a
standard way of sending messages and
communicating back and forth between uh
providers of context for your models and
the code that's interacting with the
models. And
uh getting those messages right is the
topic of huge debate on like the MCP
u repos. If you're involved with any
standardization process ever, you know
how those conversations end up going.
And then on the other side there's this
global transport standard which is the
stuff around streamable HTTP oath 2.1
session management and global transport
standard is hard because you're trying
to get everyone to speak the same
language. And so it's really nitty but
there's not a lot of like most of the
juice of MCP is in this the message
specification and the way that the uh
servers are interacting. Um and so we
started asking ourselves like can we
just use MCP for everything and we said
yes with the caveat that yes is for
everything involved in presiding model
context to models. Um we have this
format where your client is sending
these messages. Something's responding
with these messages. Um where that
stream is going it really doesn't
matter. It can be on the same process.
It can be another data center. It can be
through a giant pile of uh enterprise
networking stuff. Um it doesn't really
care at the point that your code is
interacting with it. You're just calling
a connect to MCP and you have a a set of
uh a set of tools and methods that you
can call. So
uh standardizing on that seemed useful.
Um standard why standardized on anything
internally? Um being boring on stuff
like this is good. It's not a
competitive advantage to be really good
at making Google Drive talk to your app.
It's just a thing that you need to do.
It's not your differentiator. Uh having
a single approach to learn as engineers
makes things faster. You can spend your
cycles working on interesting problems
instead of trying to figure out how to
plum uh integration. And uh if you're
using the same thing everywhere, then
like each new integration might clean up
the field a bit for the next person who
comes along. Um it's it's over overall a
good thing in cases like this where
we're we're not really doing anything
interesting. We're plumbing context
between integrations and things that are
consuming the integrations. Uh what is
standardized on MCP internally? Um I
this is where I might make an argument
to everyone that there's already
ecosystem demand. You have to implement
MCP because everyone's implementing MCP.
So why do two things? Um it's becoming
an industry standard. there's a large
coalition of engineers and organizations
that are all involved in building out
the standard. Uh all of the major AI
labs are represented in that. So you you
know that as new model capabilities
start to be developed uh those patterns
will be added to the protocol because
all the labs want you to use their
features. So I think the standardizing
on MCP internally for this type of
context is a is a good bet. And one of
the things you get with MCP is that it
solves problems that you haven't
actually run into yet. like there's a
bunch of stuff in the protocol that
exists because there's a problem and a
need and having those solutions at hand
when you run into them is really
important. So sampling an example of
where this might be valuable in your
company. You might have four products
that have four different billing models
uh for reasons because you're building
fast. Um you might have a bunch of
different token limits. You might have
different ways of tracking usage. This
is really painful because you want to
write one integration service to connect
your slides and how do you go and like
hook the billing and the tokens up
correctly and MCP has uh already has
sampling primitives. So you can build
your integration, you can just be like,
okay, your integration sends a sampling
request over the stream. Uh the other
end of the pipe fulfills that request.
You can go and hook it in. Everything
works great. And so this is a thing that
uh a shape problem that might take you a
bunch of effort uh internally without
this, but you already have the answer
kind of gift wrapped for you in the
protocol. And so at Anthropic, we're
running into some requirements
converging. We're starting to see
external remote MCP services popping up
like mcp.osana.com sana.com which is
really cool. We wanted to be able to
talk to those. Talking to those is
complex because you need external
network connectivity, you need
authentication.
Uh there's a proliferation of internal
agents. People have started building uh
PR review bots and like Slack management
things and just lots of people have lots
of ideas. No one's really sure what's
going to hit. So we're having a huge
explosion of LLMbacked services
internally. Uh with that explosion,
there's a bunch of security concerns
where
uh you don't really want all of those
services to be going and accessing user
credentials uh because that ends up
being being kind of a nightmare. You
don't want uh outbound external network
connectivity everywhere. Um auditing
becomes really complex. Uh and so we are
looking at this problem. We wanted to be
able to build our integrations once and
use them anywhere. And so a model I was
introduced to by a mentor of mine and a
while ago is the pit of success, which
is the idea that um if you make
the right thing to do the easiest thing
to do, then everyone in your org kind of
falls into it. And so uh we designed a
service which is just a piece of shared
infrastructure called the MCP gateway
that provides a single point of entry
and provided engineers just with a
connect to MCP call that returns a MCP
SDK client session on the end and we're
trying to make that as simple as
possible
uh
because that way people will use it if
it's the easiest thing to do. Um we used
URL based routing to route to external
servers, internal servers, it doesn't
matter. It's all the same call. Uh we
handle all the credential management
automatically because you don't want to
be implementing OOTH five times in your
company. Uh gives you a centralized
place for rate limiting and
observability. Uh I have an obligatory
diagram here of a bunch of lines going
in and out. But uh here's a a gateway in
the middle. This is kind of the thing.
Just one more box will solve all our
problems. Uh can I go next? Uh where is
my Yeah.
Uh so the uh the code that we have here
we just made some client libraries where
you just MCB gateway connect to MCP uh
we pass in a URL an org ID account ID
this is like a bit simplified we
actually pass a signed token to
authenticate because it's accessing
credentials but this is the basic idea
and then importantly this call returns
an MCP SDK object which means that when
new features get added to the protocol
you just update your MCP packages
internally you get those features across
the board. Everything works great. The
same code seamlessly connects to
internal external
integrations. When it comes to
transports, uh, and this is a bit high
level and handwavy because everyone's
setup is different. Um, internally
within your network, it really doesn't
matter. You can do anything you want.
We've got the standardized transport for
connecting to external MCP servers. Um,
but really just picking the best thing
for your org. So, we went and picked
uh websockets for our internal
transport. And here's just a quick code
example. It's nothing special. We just
have a websocket uh that's being opened.
We are sending these JSON RPC blobs back
and forth over the websocket. And then
if I can make this scroll
down at the end, we just pipe those uh
read streams and write streams into an
MCPS SDK client session and we're good
to go. We've got MCP going. Um you might
want to do this with gRPC because you
want to wrap these in some multipplexed
transport so you don't have to open one
soocket per connection. That's pretty
simple. Also, uh we have read stream
right stream at the end. Uh starting to
see a pattern here. You can do like Unix
socket transport if you want. You can
just have transport implementation over
IMAP. Um, which is pretty much the same
thing. You just go through here is our
server. We're sending emails back and
forth. Uh, dear server, I hope this
finds you well. Uh, MCP request start.
And then we pipe those into a client
session at the end. And so it truly
doesn't matter like whatever it takes
inside your organization is great. We
set up this unified authentication model
where we're
handling OOTH at the gateway. Uh which
means that consumers don't have to worry
about all that uh all that complexity in
their apps. Uh we added a get ooth
authorization URL function and a
complete ooth flow because you might
have different endpoints at anthropic.
We have api.anthropic.com and we have
cloud.ai and we might want those
redirects to go back to different
places. But uh this is tied on the
gateway. It's really easy to start a new
authentication. A real advantage of
having this put on your gateway is that
the credentials are portable. If you
have a batch job that you're kicking
off, um your users don't have to
reauthenticate to that. You're just
calling the same MCP with your internal
user ID and they get everything added
correctly. You're also internal services
don't have to worry about your
tokens. Um so your request comes in
internally for us. We're hitting a
websocket connection to MCP gateway uh
with O token provided as headers to
that. Uh the gateway receives your
stored credentials. You create an
authenticated SDK client. You just pass
in the bearer token to the O header. Uh
and then you're good to go. The MCP
client receives a readstream and a right
stream. And so you just plum those read
stream and write stream into your
internal transport and you're and you're
you're good to go. Uh, one of the things
that this gives for your org that's not
immediately obvious but is really
valuable is a central place for all of
your context that your models are asking
for and all the context is flowing into
your models for your org processor or
here is my tool definition thing or
here's my resource
management audit. And the really nice
thing about this is that because it's
MCP all of your messages are in a
standardized format. So it's really easy
to hook into a stream and be like okay
this is um all languages internally they
send standardized message and so the
payoff that you get from this is um the
right way to do a thing the easiest way
to do the right way to do a thing the
easiest way to do a thing and then
everyone just falls into doing the right
thing naturally and also centralizing at
the correct layer. So solving some
shared problems like O and external
connectivity once allows you to spend
your time working on uh once allows you
to spend your time working on uh more
interesting problems that are more
valuable to you. more interesting
problems that are more valuable to you
and your your
business future self and also
centralizing other valuable to you and
your your
business into remote MCPs MCP you've
seen somebody could shout it out if
you've seen something
crazy
okay toaster is Harold from BS PS code
who will be giving us a deep dive into
the mysteries of MCP some of the hidden
capabilities.
Hello.
[Applause]
Okay, since all the questions already
got asked, who built an MCP server and
it didn't
work? Okay, so cool. So we're here
commiserate on like how to actually
build with the full spec what are the
hidden capabilities why they matter and
how they light up. I work on VS code so
this is a biased local MCP for
development track but all of it is
applicable to everything. I really love
the intro to the track. It's all about
it's MCP is on high velocity. is a lot
of ecosystem growth, excitement, people
working together, collaborating, but
there's so much more work to do is they
realize it's so early in that ecosystem.
So none of this is a criticism of the
spec or the ecosystem. It's just we're
so early and I want to point out where
we can gain more powers. And just 10
days ago on a Friday, we had actually
this first in real life gathering of the
MCP steering committee during the MCP
dev summit. So that's how early it is.
We haven't even met before just talk on
discords. We finally met in person the
first time to talk about the anything
how to evolve the spec how to evolve the
ecosystem and all the basics are kind of
covered um hopefully in the previous
talks. This is my first MCP talk that I
don't spend half way through just
explaining what MCP is. There's roots in
the client. There's sampling. There's
prompts and tools and resources. There's
a really rich ecosystem to build dynamic
discovery and persistent resources and
rich interactions, but there's a gap in
how this is being implemented. There's
this like MCP is just another API
wrapper syndrome that's happening
because people just want to ship. They
want to build products and they're
actually building really excellent
products with just tools. And that
creates this reinforcing loop because
once you see how MCP works, you're just
going to use the same stacks and repeat
the same tools only ecosystem. And
there's technical barriers. People do
this because there's missing support in
the clients and SDKs and documentation
and the
references. And the clients reflect this
most. If you look at the adoption that's
from the website of model context
protocol, you see everybody goes for
tools because that's where the most
immediate successes. And if you're
honest, actually most of like resources
and prompts, you can do similar flows
just with tools. And VS Code does the
same thing. when we launched two weeks
two months ago now with our MCP support
we started with tools and we already
added discovery and roots because we're
working towards actually reading this
the spec and implementing it and I'm
happy to announce that with VS Code's
upcoming release
v10 going to get it wrong but it's
already in insiders now so download it
we actually have the full spec support
and that's I want to talk about here
about all the other things that people
are not using
Yes, that's what I'm
clapping. Okay, so the message is if you
go with full MCP spec support, you will
can unlock these rich stateful
interactions that MCP's vision is really
outlining on how agents should work
together. Starting with the most obvious
tools. So not going too deep here but
tools reflect actions well definfined
performing actions and mostly easy
mapping to function calling if you're
used to that and on the right side you
see playright you can start a server it
will open the browser and take a
screenshot but tools are often leading
to quality problems and we all struggle
with that raise your hand if you had
like some error in your IDE that that
you couldn't add more tools and you
couldn't run it or run wrong tools
because you had too many and there's
research from lang chain that nicely
underlines that and pointing out the
three vectors of a it's too many tools
so AI gets confused by that it's too
many domains of tools so if you suddenly
have some different properties for each
tool and instructions coming with each
tool then it also gets confused versus
just a pure like this is UI testing and
lastly it's just the repetition the more
repetitions the AI has to do to actually
run tools to solve a problem the easier
it is to get confused as well so it's
really quality over quantity
and clients handle that somewhat. They
give you extra controls like in VS Code
uh we added actually per chat tool
selection. So there's a little tool
packer and you can actually reduce down
the tools of what you actually need in
the moment versus all the tools. It has
nice keyboard flexibility. It's really
quick to set up and will persist for the
session. So that's one way we have
actually mentioning of tools. Like
sometimes you're like pull this issue
and trying to like verb out whatever
tool you're trying to invoke like why
not just use this tool and please make
up all the right parameters to use it
properly and then use the other tool. So
that's what we allow as well. And then
lastly just in this insiders actually
we're shipping userdefined tool sets and
that's more of a reusable concept once
you get into the mode like these are all
the tools I need for a front-end testing
flow then you just put those into a tool
set and said use my front-end testing
flow. So that's coming as well. So these
are all user controls but actually that
spec has dynamic discovery built in and
that means on the fly a server can say
but actually that spec hack are going to
give you these other tools. And on the
right you see github mudmcp it's on
github you can check it out and this
starts with a chat mode that I created
that puts the agent into a game master
prompt and it has the modcp installed.
So now with the mode active, I can go
into the agent, switch to mud and play
the game. And what dynamic tool
discovery does here, it actually makes
it aware of which room I am in. So
dungeon crawler, you walk from room to
room, like you can go east and north,
you can pick up stuff. And if there's a
monster, I can battle the monster. But
the tool for battling shouldn't be there
when there's no monster. Eventually, I
advance through the game and I finally
find a goblin I can battle. aware of
tools for battling shouldn't be there
when there's no monster. Eventually, I
advance through the game and I finally
find a goblin I can battle and the
battle tool appears. I can battle the
goblin. So, imagine those those MCP do
you want to work on? Those are coming up
to give servers and clients a little bit
more really tools and actions actually
the add context return a giant file from
your server but you want to return a
reference to the file and that could be
something the LM could follow up on or
the user can actually act upon. Then the
other use case is actually giving files
to the user. So if you take a screenshot
via playright, it want to expose it to
both the LLM and the user and resources
provide that semantic layer and you in
what are the issues? Oh, I found new
issue
that's they want to understand the
Python environment and maybe look at
your settings of how you set it up so
they can customize and that makes it
more dynamic and stateful out of the
box.
The other one is like if you can look at
actual the packages and your libraries
installed, it's a great way to customize
it to a React setup versus a swelt setup
and really acknowledging what the user
is looking at and not asking constantly
like what framework are you working on.
Like just you work in my folder, so just
look at
it. And lastly, I think the idea of like
what what is that CI/CD pipeline? That's
where MCP servers really shine to
connect the end to end of a developer
experience. And you can also read those
out. Sampling. Who has heard about
sampling? Is really excited about
sampling. Okay. So you understand what I
mean. So sampling. Sampling is one of
the oddly named uh primitives as well.
And if it had a better name, maybe more
people would use it. Uh but it's
actually now implement insiders and it's
so much fun to use. So it allows the
server to request LLM completions from
the client. And what I'm showing here on
the right is the permission dialogue
that pops up to allow the server to
access the LM. Right now it's wired up
by default to GPD 4.1. There's more spec
improvements to make it with structured
formatting. There's some ideas out
there. So there's a lot of things to
make it better, but right now nobody has
implemented it. So there wasn't really
need to make it better, but
implementation is now here. So please
use sampling. And that's a nice
progressive enhancement. Maybe by
default you return the kitchen sink. And
once you have sampling, you can do
interesting things like summarizing
resources in into more tangible things.
You can format a website that you fetch
into markdown for the LM or you can even
think about agentic server tools that
one run via the LM from the
client. We look beyond the primitives.
There's a few things that are also
interesting. So far we have roots and
tools and resources and prompts and that
with dynamic discovery you can update
them at any time. The client will send
new roots as the VS code workspace
changes. You can send new roots new
servers, new tools and prompts from the
server as you update and you change. So
it's a really dynamic environment
already. But there's more pain points to
make these servers really powerful.
One is the developer experience. Who's
been struggling with working on MCP
servers and debugging and logging and
everything? Yeah. One is hands up. Yeah.
Um, apparently it's really easy. So
maybe it's not a problem. Okay. So we
have it now dev mode in VS Code which is
a little deaf toggle and you already see
the console that always works for all
MCPU servers. So once you hit a snack
that just works and then now now it's in
debugging mode. actually has the
debugger attached. So once I run the
prompt which is dynamically generate on
the server, I can now hit the break
point and step through it. And that's
really hard usually because your server
is not owned usually by any process that
you run manually. It's owned by whatever
client and host is running the MCP
server. So because VS Code is both, it
can just put it into debug mode and
attach its debugger and that works for
Python and Note right now out of the
box. So super exciting and it's yeah it
has changed how I work on MCPS
definitely the latest spec uh was
already called out. I just want to call
it out again because it's so important
that people stay on the tip of the spec
on what's coming and understand what's
in draft. Those things that are in draft
only become stable because people
provide feedback that it's useful and
that it's working. And if they're in
draft and nobody provides feedback, then
they will still go into stable and they
might need revisions like the offspec.
So the updated Ospec on the right gives
this enterprise grade authorization.
There's a talk tomorrow about building
protected MCP server that I can highly
recommend from then who actually worked
on the offspec. So if you want to talk
to one of the people behind it and want
to dive really deep into O you can do
that. Then streamable HTTP has been
working in VS Code since two versions as
well, but then it's been really hard to
test because there's no servers out
there. So if you work on hosting, you're
really excited about streamable HTTP.
You should really get everybody that is
hosting your MSP servers to to get onto
it and not use SSE anymore. SSE is still
possible to use with HTTP. So you get
both benefits, but you're avoiding this
really stateful churn on your servers.
Last one already mentioned there's a
community registry happening and that's
think the other big pain point like if I
build a server and nobody finds it or
what is the discovery experience like
how do I send people like do I send JSON
blobs around for people to discover my
server. There's a lot of community work
around this to make this discovery easy.
So it's a big shout out to everybody on
the steering committee the community
working groups and everybody involved
here. Um, if you want to check it out,
it's on model contracts
protocolregistry on GitHub and it's all
happening out in the open. And lastly,
I'm really excited about elicitations.
Um, that's actually coming in the next
draft um, spec reference, spec draft
release, whatever. And this is a way for
tools to finally reach out back to the
user when they need more information.
Right now, tools are all controlled by
the LM and you get all the information
from them. But then when it actually
needs more concrete specific input from
the user then you you can throw them
into another chat experience and ask for
it. But why not just give them an input
to provide it directly. So it's it's
again more statefulness in the tools on
top. So your help is needed. Um
progressive enhancement in MCP is
possible. I think we want to have more
best practices out there maybe even in
the references servers to show it off.
But everything is now ready to be used.
There's clients supporting the latest
spec that you can run it in and test it
in. Those clients are used by users. And
as more users showcase how great these
stateful servers can be and outline
these best practices. This
interoperability gap will close and
clients will catch up. It's a very
fastmoving system. People are
complaining like, "Oh, you ship this two
weeks after the other person." Um but
it's all coming together and as as
people use these and learn and bring
feedback it becomes better. So make
actionoriented context aware semanticaw
aware servers using the full spec. And
then lastly contribute to the ecosystem
if you have the time read up on some of
the open RFCs I shared like namespaces
and search to kind of see what's coming.
Make sure they get into the SDKs you're
using by following the issues and just
share back on your experience. I think a
lot of people mis misunderstand how much
influence they have on clients and SDKs
and everything by filing issues by
providing feedback. I'm helping to
triage a lot of the MCP issues coming
into VS Code. We read all of them. We
learn from them and really that drives
our road map and that happens probably
with every other uh clip team out there.
So really make your voice heard of like
you everybody should support sampling.
So so there's a transformative potential
in MCP that we all can unlock with the
spec that is already there. So the
ecosystem catches up to the spec. So
with that let's go. Um and feel free to
hit us up on the Microsoft booth.
There's two VS Code people there Tyler
and Rob. You can also talk to or talk to
me or talk to your friendly MCP steering
committee members.
Thank
[Applause]
you. Thank you, Harold, for giving us a
deep dive into the into the lesserknown
parts of MCP. Uh, I think we're now all
fans of uh prompts, sampling, and all
the other features. So, we've been
hyping up about MCP uh for the last
couple of talks. Uh, but now I think we
want to take a a turn, right? We want to
tell you a little bit about some of the
difficult parts of using MCP. Perhaps
even uh a bit of a rant about MCPS. Uh I
guess from the audience like what's the
biggest pain point you had when trying
to use MCPS or build MCPS? Anybody wants
to shout out an answer? Client support.
Okay.
Okay. Right. Dynamic client discovery.
Okay.
anymore security.
Okay. Well, David from Sentry is going
to uh dive deeper into this topic, give
us uh give us some insights about what
he has learned um solving some of these
pain points. So, join us in welcoming
David. Thank you.
I assume everybody can hear me. Cool.
All right. I see some slides. Um welcome
everybody. It was a little bit last
minute, so bear with me. If you don't
know me, uh, I started Century a long
time ago. David Kramer, I'm sort of an
engineer, sort of an executive, sort of
a founder. Uh, I would like to think I
have rational opinions. So, that's
mostly what this is.
Um, I don't think you're going to learn
anything here. Maybe you will. I don't
know. I personally think this is not
that complicated. It's just big scary
words. So, if you do, great. If you
don't, maybe you walk away, you're like,
"Yeah, I thought that's what it was.
We've done
it." Mostly, uh, I was asked a couple
days ago while I snuck my way into this
conference if I could fill a slot. And
so filling the slot was like, "Oh, come
give some hot takes, maybe spice it up a
little bit." So that's what we're going
to do. It's not going to be too much of
a rant. If you know me, I like to rant,
but you know, we'll dial back for this
one a little bit. So, what is an MCP?
Uh, you know, I I got to say this is
like one of the wildest phenomenons. I
It's like the new crypto wave or
something. Everybody's like, "Yeah, MCP.
We don't know what it is, but we're here
for it." And you find a lot of these
sort of like opinions around, you know,
how it should be, how it shouldn't be.
And you know what I often find is people
who have these opinions have not built
anything or at least not built the thing
they're talking about. I built Century's
MCP server mostly as a fun project. Um,
so take this for what it is. Um, it's
also Century's MCP server. These are
biased opinions towards what Sentry is.
If you're not familiar with Sentry, you
probably should be, but we do
application monitoring. We do a bunch of
stuff. Um, if you have bugs on the
internet, they probably go to us. Uh,
and so it's in context of a B2B, a SAS
business. A lot of you probably work at
enterprise companies, if you will. So,
think about it that way. But the way we
think about MCP is it is a pluggable
architecture for agents. Full stop.
That's it. It's pretty simple to reason
about.
And
again, all of this is contextualized in
an enterprise cloud service kind of way.
There's a lot of other variations of how
you might adapt MCP. There's tool chains
that make sense locally. We're talking
about we run cloud services. That's most
of the industry. We're B2B. We're
enterprise. I think a lot of this
actually still applies. Um, but take
that with a grain of salt. So, how we
think about MCP with Sentry particularly
because this again relevant here. we fix
bugs. There are things like cursor where
you also fix bugs. What if we could all
fix bugs together? And so everything's
contextualized in that. And I think
there's this whole thing of like how do
we be relevant? That's like the the name
of the game for every single company in
the world right now. Um it's like, oh,
how how do we become an AI company? We
too are now an AI company. Um but Sentry
has a lot of bugs. I fix those in my
editor. Wouldn't it be cool if the bugs
could be inside my editor sometimes?
That's a great example of where maybe an
MCP is useful, but at the very least
we're going to pretend it's useful. So,
that's the context here. Um, but it all
comes back to like probably the reason
everybody's here is like, how do I
become relevant? I've got an AI mandate.
I've got infinite money to spend all of
a sudden for some reason that didn't
exist yesterday. How do we get involved?
Okay, so everybody probably same stage.
I know how this works.
So, all right. Um, we built this a few
months ago. We are not first to market
with an
MCP. And the reason why is because the
there's two interfaces for MCPs. I'm
going to focus on a remote interface,
but there's also the standard IO. You
probably learned about that or know
something about that. I don't think
standard IO is super useful for
businesses like ours. I'll talk about
that, but but sort of the analogy of why
MCP is useful. And this is VS Code
Insiders, which you just uh heard from
Harold, but like they do a pretty good
job. They're the only ones with OAS
support that's like useful today. Cursor
promised me end of week. I don't know.
hold them to that. Um, but it works
pretty well. You plug in Century's MCP,
you can look up data from Sentry and a
bunch of curated workflows. You can
maybe fix some bugs, maybe easier than
it was before, or at least more fun than
it was before. Um, and for the sake of
this, I I needed a screen grab. So, last
night I'm like, literally last night,
I'm working on these slides and I go
into VS Code. I'm like, I'm just going
to plug it in. I don't have time to fuss
around if the thing's going to break.
And so, I use VS Code and I'm like,
okay, I'll just do a thing where it's
like, fix all my bugs for me. And then
immediately it does like 20 API queries
this century. Probably cost me like five
bucks to run this thing.
Um but it did start fixing some bugs. Uh
I don't know if the fixes were good,
mind you. They're they're probably
garbage. Um but it it does the thing,
right? It's like it brought context into
the editor, which is what we want. And
that context was provided by somebody
else. Century in this case. So that is
like one of the interesting things. It's
one of the interesting things we think
about and why MCP is like like valuable
to sort of a traditional I don't know
we're kind of an enterprise company but
like like every company in the world and
that's part of why we're all hopping on
it. It's pretty accessible and and
that's what I'm going to talk about. It
is actually super accessible.
So this is
you
me and this is why I have opinions about
it now. It's like oh it's just an API
plugs in. We've got an API. We've got
some OOTH going on. You know, we had our
own ooth provider. You know, a lot of
you might use something like a work OS
or, I don't know, pick one of these
authentication services that just gives
you it out of the box. If you have that,
you're pretty much ready to go, which is
pretty cool. It's actually like a pretty
low boilerplate uh implementation. Um,
but then you quickly learn that it's
actually not that easy. And so, first
you kind of go into this OOTH like dance
and you're like, "Oh, okay. Like, yeah,
we're going to do this, but it needs
OOTH 2.1." And nobody in the world
supports this thing. Like, it's like I
don't know how old it is, but I had
never heard of it before MCP. Um, and so
there's a little bit of complexity
there, but you're like, "Okay, it's
almost there. It's OOTH. We've got that.
We can plug it into our API." You kind
of get it working. In our case, we use
Cloudflare Shim, which basically lets us
proxy our OOTH 2 API on top of
Cloudflare workers, which has a 2.1
client registration thing. I don't know
if anybody's talked about that. TLDDR,
it's complicated. Um, but it's not that
complicated. This was built in a couple
days, mind you, and I'm also an
executive at the company, so it's like,
yeah, if I can do it, everybody can do
it. Um, but you go through the oath flow
and then you're like, "Cool, but the the
robots don't know actually how to reason
about giant JSON payloads that were not
built for them." And this is actually
where I think a lot of people break
down. There was like a big conversation.
This is sort of one of my first
opinions, if you will, what I might call
sense, is that MCP is not a thing that
just sits on top of open API. Like you
cannot just be like, I got an API. I'm
going to expose all those endpoints as
tools. You're going to get the worst
results you can possibly imagine. you're
going to be like, "Oh, this doesn't make
any sense." You have to massage
everything. You have to design around
the system. But like generally speaking,
and I'll talk a little bit about this,
like you need to really think about how
would you use an agent today? How do the
models react to what you do when you
provide them context, which is what this
really is for, and design a system
around that. So might leverage your API,
it is not your API. And then you get
past that and you wire it up to things
like cursor and VS code, and you're
like, why is this breaking all the time?
You can't you can't uh solve for that
one. just you got to wait for everybody
to catch up. Um they're almost there. Uh
you know, handful of clients support
native authentication now. They're kind
of stable. Um to code's credit, it
hasn't broken much recently. Cursor's
broken quite a lot on me, but they're
both great. Don't get me wrong. Cloud
has support. Cloud Code has sort of
support, but not really. Um so I guess
it might work, it might not. I think
particularly in the developer ecosystem,
we're much more ahead of the curve. And
so if you're trying to adapt your
services to third party agents that are
in our ecosystem like these editors,
you've probably got a good shot of it
working
tomorrow. If I don't know, it's
Salesforce or something. I have no idea.
So, so you're you're kind of beholden to
like the clients and the implementation
because again it's a plug-in
architecture for agents. Um there's a
lot of other use cases that are not just
third parties, but that's kind of the
focus. And so I'm going to try to be
constructive from here. Let's see. We
got nine minutes. Um, just a few
learnings and I'm happy to talk more
about this later. I'll be around.
Um, you'll probably, somebody in this
room is going to disagree with this, but
you should only care about OATH if
you're a B2B SAS company like me.
Um, and particularly you care about OOTH
with remote environments for the most
part. If you're like, how do I integrate
my services into various
agents? I want bugs to exist in cursor.
I want to run a cloud service. And I
want to run a cloud service for the
exact same reason I've always wanted to
run a cloud service because I can
iterate on it. I can ship fast. I can
dial in security. All the advantages it
turns out are exactly the same because
technology has not changed. And so if I
were you and you're not building
something hyper specific that is like a
local devicecentric thing, just focus on
the remote MCP server, focus on the O
specification and just like don't worry
about it. The problems will solve
themselves. security will solve itself
because there's a whole world of
security problems and the standard IO
interface is filled with most of them.
I'm not going to talk about that. I'm
sure there's some other talks here about
prompt injection, but it is like very
very very scary. Do not allow random MCP
tools in your organization. Um, trust
people that have earned trust. Don't
download random packages off the
internet. Uh, it will be a very bad time
for your organization. Um, I did mention
this cloud desktop has I think full OS
support right now in production in G. VS
Code Insiders has it. Um, these are
great because you just drop in the MCP
URL and it handles everything from
there. Cursor, like I said, I think this
week, um, I don't know about anybody
else. I don't pay attention much beyond
anybody else and I think Cloud Code has
not, at least I've not seen anything.
Um, and then there's a bunch like a long
tale, right? So, works pretty well.
There is this MCP remote package, which
is how we shipped all this stuff. It
works okay. I applaud early adopters for
getting this out. It's not a great
experience. And you'll find a lot of
this is not a great user experience.
It's rough. It's beta that's fine. Um,
this is the biggest thing going back to
open API thing. You actually have to
spend the calories. You can't just be
like, "Haha, we proxied open API and
expose it as tools. It's going to do
nothing." And so what the right answer
here is, who knows? Um, our version of
this, and I'll talk a little bit about
why is like we return markdown. We've
we've taken some API endpoints and we've
directly translated some of the response
to Markdown, but it's intentional. It's
like I want to get a bug out of Sentry.
I'm just going to give you the bare
essentials. I'm going to give it in a
structured way that a human can reason
about because generally speaking, if a
human can reason about it, the language
model can reason about it because it's
effectively pattern matching on
language. Um, it can kind of figure out
JSON here and there, but if if you
actually push it, you're going to find
it breaks all the time. So, just use
something like Markdown. Um, it's not
scientific. I think there's a lack of
science in a lot of this. It's hard.
Just go with whatever works, but you you
have to really think about you don't
control the the consumer. You don't
control the model. And so you're kind of
like this least common denominator
thing. And so think about that. But you
need to design the system and you need
to treat it as like you are providing
context to an agent that you don't know
what the agent is doing. Right? And so
that's the name of the game is context.
That same thing uh sorry here's an
example of that. I forgot what my slides
were. Uh we just like give kind of a
reasonable description of tools as the
first version of context. Um which
sometimes you hit token limits with all
this. So there's some other challenges.
We give a reasonable description of a
tool with the hopes that clients figure
out how to make use of this context. So
it can call the right tool. It can call
it when it needs to. It can choose one
tool over the other tool, which is a
really unfortunately hard problem for it
to figure out. Um, mostly
straightforward. Errors, same thing. You
got to design the errors. They are still
context because just like a human can't
figure out how to call your API, the
machine also can't figure out how to
call your API. In my example, I'm like,
fix all my bugs for me. And it it
queries like every organization in
century that I have access to. It
queries all it's like like 20 API calls
when it should have been one even with
all this context. So we are a long ways
from this being great. But it's like a
glimmer, right? So you know in this case
it's like oh you didn't pass the the
thing or rather you pass an invalid
value for the thing. Give it a real
human response. This is now more
important than ever because again it's
not just a sort of machine reasoning
about it where you can hardcode all this
stuff. It's abstract. You don't know
who's reasoning about it.
The biggest thing and this is sort of
leading to like the my overarching view
of the world is like you don't you have
no control which already is a problem.
You are also passing the cost on in a
lot of these cases. So you actually kind
of need to be mindful. So, another
reason to not just be like, "Here's my
API. I'm just going to return everything
to you because all of a sudden, you
know, that that call, if you will, that
tool call that could have been a dollar
might be $10 now because of the amount
of tokens you needed. And more
importantly, it might just not work like
early on." And and I don't know if VS
Code and or OpenAI, I don't know who's
to blame, fix this, but like there was
and may still be a limit to the amount
of tokens or description lengths of
tools. Makes sense, right? You want to
constrain the cost of every API call,
but all of a sudden now you have
problems again. So you got to be really
thoughtful about this. This is going to
be evolved. And I think the big thing is
like if you build one of these, it's not
set and forget. Like I we're still
updating this thing every week, tweaking
it here and there, trying to look at
like what's happening and evolving it,
right? But the biggest thing, and this
is sort of my my takeaway, my my very
very strong belief is like you just need
to really focus on building agents. MCP
is a plug-in architecture. There's a lot
of value behind it, but the like the
inherent value of a lot of what LLM are
bringing is this sort of agent
architecture, which by the way is just a
service architecture with a fancy new
word on it. Common sense kind of stuff,
right? Um, and so we've done this in
Sentry. It does not work well with MCP
yet for for what it's worth. There is no
streaming responses for tools yet. And
that's a big problem when you think
about sort of this agent to agent. And I
don't mean this in like the Google way.
I mean in like the generalized point of
view of agent to agent. Um, but it gives
you control and it's it's the same as
all software. If you have control, you
can be responsible for the success, for
the failure. I can be responsible for
the prompt that dictates how the tool is
called. I can be responsible for the
result from the tool. I can make many
calls behind the scenes and wrap those
up. So I I just get a lot more control
if I pick up the cost of that agent. I
control the model even, right? And so I
think this is this is my big bet and I
think this is where B2B is going to
shine is when we start exposing agents
through the MCP architecture. Again,
treating MCP as a plug-in
architecture. We've done that with one
of ours which is this thing. We keep
renaming it so bear with me. It's like
called Seir now, but it's just like
Century's got a lot of data on what's
broken in your application. We do this
thing where we do this really high
quality root cause analysis that's done
via an agent. Um we expose that root
cause analysis mostly to our UI to be
fair. We also expose it to the MCP but
because it doesn't do streaming we have
to do like some polling check where it's
like okay start the job and then let's
check in on it a few times but then
there because the way agents work it
just gives up at some point. So it's a
little complicated but again beta
testing the promise is there. Um but
when this works I I really think this is
going to be the value unlock for a lot
of us. Again, MCP does a lot of things.
It's an abstract protocol. Um, but the
agent analogy is really good. Um, aside,
all this is open source. You can find
Century's MCP somewhere on the internet.
You'll find it on GitHub. I should say
fair source. There's some complexity
there. This is what the agent looks like
in the UI. Check it out if you haven't.
We'll be around. Give me feedback. Um, I
think the last thing I want to sort of
part with is just like this stuff is not
that hard. Um, it's quite broken all the
time, but it's not that hard. I I again,
I built it in two days. I got a lot of
jobs to do with the company. you can
just go build it and try it out and
learn and like all this stuff is pretty
obvious. I think the lesson we've
learned at Sentry uh or still are
learning I should say. Uh everybody is
scared of all this stuff because there's
fancy new words for everything. But the
fancy new words are just new words for
the same thing. It's just a new like
code of paint, right? You know MCP is
just a plug-in architecture. Agents are
just services like the the LLM calls or
MCP calls. Actually, half of them tools
are just API calls with a new response
format, right? So it's pretty accessible
to do all this. There's a lot of great
like technology that's been going on in
here. Um, like I said, we used a lot of
Cloudflare tech. We did not use
Cloudflare at all before this. And then
in a couple days, we're like, cool, we
can shim up a thing on on workers.
They've got an OOTH proxy for us.
Problem solved. And this is important
because we don't run websocket
infrastructure essentially. It's just
not a thing we had, right? And
unfortunately, the protocol requires
something like that, which makes it a
little bit annoying to adopt, but but
again, it's not that hard. It's pretty
easy to adopt. Uh, try it out. You'll
probably hit a lot of bugs, but just
stick with it. I I think this one will
stick around. Um, but I would really
dial in the thinking around agents and
how you're optimizing for context in the
workflows you understand for your data.
Uh, with that said, I will be around the
rest of the afternoon, probably at our
our booth in the expo hall if you want
to come chat. Uh, come say hi. I'm
always happy to like rant about other
things or give you my semi-informed
opinions. Um, I'm not an AI guy to be
clear, but um, cool. With that, you
know, thanks everybody for for showing
up to this talk in this wild conference,
which is interesting.
I'll call it there.
All right. Thank you,
David. If you have any questions, make
sure to catch the speakers. Uh talk to
me if you're interested uh in deploying
or using MCPS. And we'll catch you at 2
p.m.
[Music]
bas.
[Music]
[Music]
girl. Hey
[Music]
[Music]
[Music]
[Music]
I love
I know.
We're Heat. Heat.
[Music]
N
hey. Hey. Hey.
down. Happy dance.
Do
you
do you
Yeah. Hey. Hey.
Happy birthday.
I I'll
be I'll
be I'll
Heat. Hey, Heat.
I
feel I
[Music]
Hey, hey,
hey. All
right, I hope everybody had a great
lunch uh because we're going to go right
back into MCPS very soon.
So, next up we have Samuel from Pyantic
who will be telling us um MC uh who who
will be talking about how MCP is all we
need. Join us in welcoming Samuel. Thank
you so much.
So yeah, I'm talking about uh MCP is all
you need. A bit about who I am before we
get started. I'm best known as the
creator of Pyantic uh data validation
library for Python that is uh fairly
ubiquitous downloaded about 360 million
times a month. So someone pointed out to
me that's like 140 times a second. uh
pyantic is used in general python
development everywhere but also in
genai. So it's used in all of the SDKs
and agent frameworks in Python
basically. Uh Pantic became a company uh
uh beginning of 23 and we have uh built
two things beyond Pyantic since then.
Pantic AI uh an agent framework for
Python built on the same principles as
Pantic um and Pantic Logfire
observability platform um which is our
which is the commercial part of what we
do. Um I'm
also a somewhat inactive co-maintainer
of the MCP Python SDK.
Um so MCP is all you need is obviously a
a play on Jason Lou's talks pideantic is
all you need that he gave at AI engineer
I think first of all nearly two years
ago and then the second one pantic is
still all you need maybe this time last
year. Um and it has the same basic idea
that people are over complicating
something that we can use a single tool
for. And I guess also similarly the
title is completely unrealistic. Of
course, padantic is not all you need. Uh
and neither is MCP for everything. But
it has the we have the I think where
where we agree is that there are an
awful lot of things that MCP can do and
that people are over complicating the
situation sometimes trying to come up
with new ways of doing agentto agent
communication.
Um, I'm talking here specifically about
autonomous agents or code that you're
writing. I'm not talking about the
um, uh, claw desktop or cursor uh, Z
wind surf, etc. use case of coding
agents. Those were what MCP was
originally primarily designed for. Um, I
don't know whether or not David Pereira
would say that that what we're doing
using MCP from Python is a he definitely
wouldn't say it's a misuse, but it I
don't think it it was the primary
uh design use case for
um for
MCP. So, two of the of the primitives of
MCP prompts and resources probably don't
come into this use case that much.
They're very useful or or should be very
useful in the kind of cursor type use
case. They don't really apply in what
we're talking about here.
Um but tool calling, the third primitive
is extremely useful for what we're
trying to do here. Um tool calling is a
lot more complicated than you might at
first think. A lot of people say to me
about MCPR, but couldn't it just be uh
open API? Why do we need this uh custom
protocol for doing it? Um, and there's a
number of reasons. The idea of dynamic
tools, the tools that come and go during
an agent execution depending on the
state of the server. Logging, so being
able to return data to the
user while the tool is still
executing. Sampling, which I'm going to
talk about a lot today, perhaps the most
confusingly named part of MCP, if not
tech in general right now. Uh, and stuff
like tracing, observability. Um, and I
would also add to that actually the uh
MCP's way of being allowed to operate as
effectively a subprocess over standard
in and standard out is extremely useful
for lots of use cases and open API
wouldn't wouldn't solve those
problems. This is the kind of
prototypical image that you will see
from lots of people of what uh MCP is
all about. The idea is we have some
agent, we have any number of different
tools that we can connect to that agent.
And the point is that like the agent
doesn't need to be designed with those
particular tools in mind and those tools
can be designed without knowing anything
about the agent. And we can just compose
the two together in the same way that uh
I can go and use a browser and the web
application the website I'm going to
doesn't need to know anything about the
browser. I mean I know we live in a kind
of monoculture of browsers now, but like
at least the ideal originally was we
could have many different browsers all
connecting over the same protocol. MCP
is following the same
idea. But it can get more complicated
than this. We can have situations like
this where uh we have tools within our
system which are themselves agents and
are doing agantic things need access to
an LLM. They of course can then in turn
connect to other tools over MCP or or
directly connecting to tools. This this
works nicely. This is elegant. But
there's a problem. every single agent in
our system needs access to an LLM. And
so we need to go and configure that. We
need to work out resources for that. And
if we are
um using remote MCP servers, if that
remote MCP server needs to
um use an LLM, well, now it's worried
about what the cost is going to be of
doing that. What what if the uh remote
agent that's operating as a tool could
effectively piggyback off the
uh the model that the original agent has
access to. That's what sampling gives
us. So as I say, I think sampling is a
somewhat uh that's not making that any
bigger unfortunately. Um is that clear
on screen? I may maybe I'll make it
bigger like that.
Um sampling is this idea of a of a way
where within MCP the protocol the um
server can effectively make a request
back through the client to the LLM. So
in this case client makes a request
starts some sort of aantic query makes a
call to the LLM LLM comes back and says
I want to call that particular tool
which is an MCP server. Uh client takes
care of making that call to the MCP
server. The MCP server now says, "Hey, I
actually need to be able to use an LLM
to answer whatever this question is."
So, that then gets sent back to the
client. The client proxies that request
to the LLM, receives the response from
the LLM, sends that uh onto the MCP
server, and the MCP server then returns
and we can continue on our way. Um,
sampling is very powerful, not that
widely supported at the moment. Um, I'm
going to demo it today with Pantic AI
where we have support for sampling.
Well, I'll be honest, it's a PR right
now, but it will be soon it will be
merged. Um, we have support for sampling
both as a uh as the client. So, knowing
how to proxy the those LLM calls and as
a server basically being able to
register use the MCP client as as the
LLM. So this
example is obviously like all examples
trivialized or simplified to be to fit
on screen. The idea is that we we're
building a like research agent which is
going to go and research open source uh
packages or libraries for us. And we
have implemented one of the many tools
that you would in fact need for this.
And that tool is
um making uh I will switch now to code
and show you uh the one tool that we
have.
Uh I'm in completely the wrong file.
Here we are. Um so this tool is querying
BigQuery BigQuery public data set for uh
Pippi to get uh numbers about the number
of downloads of a particular package. So
this is this is pretty standard padantic
AI uh padantic AI code. We've configured
log file which I'll show you in a
moment. We have the dependencies that
the uh that the agent has access to
while it's running. We said we can do
some retries. So if the agent returns if
the LLM returns the wrong data, we can
send a
retry big system prompt where we give it
basically the schema of the table. Uh
tell it what to do, give it a few
examples, yada yada. But then we get to
this is probably the powerful bit. So as
an output validator we are going to go
and first of all we're going to strip
out uh markdown block quotes from the
SQL um if they're there then we will uh
check that the table name is right that
it's querying against and tell it that
it shouldn't if it it shouldn't and then
we're going to go and run the query and
critically if the query fails we're
going to uh raise model retry with
impantic to go and retry uh making the
um
uh making the request to the um LLM
again saying asking the LLM to to uh
attempt to to retry this. And what we're
the other thing we're doing throughout
this you'll see here is we have this
context. MCP context.log. So you'll see
here when we defined depths type we said
that that was going to be an instance of
this MCP uh context which is what we get
when you call the MCP server. So what
we're doing here is we're having a we're
providing a type- safe way within in
this case um the agent validator but it
could be in a tool call if you wanted it
to be to access that context. So we can
see here that we know at um in the type
int uh uh that the the type is uh MCP
context. So we have this log function
and we know it's signature and we can go
and make this log call. The point is
this is going to return to the client
and ultimately to the user watching
before the the thing has completed. So
you can get kind of progress updates as
we go. MCP also has a context concept of
progress which I'm not using here but
you can imagine that also being valuable
if you knew how far through the query
you were. You could show an update in
progress. So the idea I think the
original principle of uh logging like
this is that you have the the cursor
style agent running and we want to be
able to give updates to the user. Don't
worry I'm still going before it's
finished and exactly what's happening.
But you could also imagine this being
useful if you were using MCP. If this
was research agent was uh running as a
web application you wanted to show the
user what was going on. This deep
research might take you know minutes to
run. We can give these logs while the
tool call is still executing.
And then we're just going to take the
the output turn it into a list of dict
and then format it as XML. So you get a
nice
uh models are very good at basically
reviewing XML data. So we basically
return whatever the query results are as
that kind of XMLish data which the LLM
will then be good at uh
interpreting. Now we get to the MCP bit.
So in this code we are setting up an MCP
server using fast MCP. There are two
versions of first MCP right now.
Confusingly, this is the one from inside
the MCP SDK.
Um, we the dock string for our function.
So, we're registering one tool here,
Pippi downloads, and our dock string
from that function will end up becoming
the description on the tool that is
ultimately fed to the LLM that chooses
to go and call it. Um, and we're going
to pass in the user's question. And I
think one of the one of the important
things to say here is of course you
could set this up to generate the SQL
within your
uh central agent. You could include all
of the
um uh description of the SQL the
instructions within your within the the
description of the tool. Uh models don't
seem to like that much data inside a
tool description. But more to the point,
we're just going to blow up the context
window of our main agent if we're going
to ship all of this context on how to
make these queries into our main agent.
That's just all overhead in all of our
calls to that agent regardless of
whether we're going to call this
particular tool. So doing this kind of
thing where we're doing the inference
inside a tool is a powerful way of
effectively limiting uh the context
window of the of the main running agent.
And then we're just going to return this
output which will be a string the value
returned from from here. and we'll just
run the run the MCP server and by
default the MCP server will run over
standard IO. Um, and then we come to our
our main application. So here we have a
definition of our agent. And you see
we've defined one MCP server that's just
going to run the the script I just
showed you, the Pippi MCP server.
Um, and so then this agent will act as
the client and has that register as a
tool to be able to call. I'm also going
to set the give it the current date uh
so it doesn't uh assume it's 20 2023 as
they often do. Um and now we can go and
ultimately run our main agent ask it for
example how many downloads Pantic has
had this year and I'm going to be brave
and run it and see what happens. Given
the internet I have medium hope but
we'll see what happens.
Um, so you'll see I haven't talked that
much about the observability from
logfire, but I you'll see it in a
moment. And we've immediately got a
timeout. This is great. Uh, I might have
to do some uh a lot of ad liibbing if
this is what how it's going to perform
in general. I will try again and see if
we And it's timed out straight away
again. Am I on the wrong network? I'm on
the speaker network. So, I do not know
why we're getting an immediate timeout.
I will try a couple more times. See if I
can run it run it from here and see
whether or not we're going to get a bit
luckier. And we're getting an immediate
timeout from the
model. I
think dogf fire as well. Everything is
failing. I'll try switching network.
Give me one minute. I've got four
minutes. We will see how we get
on. Try the hard one. Is that this
one? Working so well just before we
started.
And now have wired on
there. Try running this one.
Oh, we're having
luck.
Um, so don't clap too early. It might
still fail. But,
um, and it has succeeded and it has uh
gone and told us uh that we had whatever
1.6 billion downloads this year. But
probably more interesting is to come and
look at what that looks like in Logfire.
So, if you look at is it going to come
through to logfire or we having a
failure here as well. This I will admit
this is the run from just before uh I
came on stage but it it would look
exactly the same. So I'm not going to
talk too much about observability and
how we do uh how MCP observability or
tracing works within MCP because I know
there's a talk coming up directly after
me talking about that. So think of this
as a kind of uh spoiler for what's going
to come up. But you can see we we run
our outer agent. it decides to it calls
uh uh40
uh which decides sure enough I'm going
to go and call this tool. Uh it doesn't
need to think about generating the SQL.
It can just have a natural language
description of the query that we're
trying to make. We then um this is the
MCP client as you can see here. MCP
client then calls into the MCP server um
makes the which then again runs a
different uh podantic AI uh agent which
then makes a call to an LLM which
happens through proxing it through the
client. So that's where you can see the
service going client server uh client
server ultimately if you look at the top
level uh exchange with the model you'll
see here yeah the the the out ultimate
output was it the the return response
from running the query was was this kind
of XMLish data and then the LLM was able
to turn that into a human description of
what was going on. I think the other
interesting thing probably is we can go
and look in we should be able to see the
actual SQL that was called. So this is
the agent call inside uh MCP server and
you can see here the SQL it wrote and
you can confirm that it indeed looks
correct. Um I am going to
uh go on from there and say um thank you
very much. Um we are at the booth the
the Pantic booth. So if anyone has any
questions on this, wants to see this
fail in numerous other exciting ways,
very happy to to talk to you. Yeah, come
and say
hi. All right, thank you Samuel for the
presentation. It's always impressive
when a live demo works on
stage. So how many of you have run into
issues when you're using MCP but
couldn't figure out what happened? Any
raise of
hands? Okay, a few people. So, our next
speakers will be talking a little bit
about observability in MCPS. Hopefully,
with the right observability logging, we
can all figure out what's happening
under hood with MCPS and improve it. So,
join me in welcoming Alex from Weights
and Biases and Ben from Dipso to talk
about observability. Thanks, Henry.
[Applause]
Hey folks, um my name is Alex Vulov. I'm
an AI evangelist with Weights and
Biases. I'm Benjamin Ekl. I am
co-founder CTO of DIPso. We're creators
of MCP.Run.
All right. And we're here to talk to you
about MCP observability.
Hey Ben, I wanted to ask you a question.
Somebody who worked at Data Dog before
and somebody who runs multiple MCP
servers and uh clients on production. Uh
something that happened advice that
happened something in my agent uh in
production the other day. Okay.
Uh yeah, I mean we've been running MCP
clients and servers in production since
the beginning. Uh yeah, but wait, aren't
you like working at an observability
company? Weights and biases and don't
you work on like what's it called?
Weave. Yep, that's true. I I work about
weave and but since I started adding
some powers to my agent via MCP, all
that observability that I'm used to from
just having my own code run end to end
has gone a little bit dark. Gotcha. So
this is what we're here to talk to you
guys about. Um the rise of MCP is
creating an observability blind spot. As
AI agents become more uh prevalent, the
problem can compound with more and more
tools via MCPs, the less they the
developers can know about the end to end
happenings within their agent. Yeah. Um
yeah. So on MCP run, we're running both
clients and servers and because it's a
new ecosystem, we've had to like cobble
together a lot of our own ways to do
observability. And I've been looking
around. It seems like everyone is sort
of doing this in isolation. they're sort
of solving the same problems. Um, so you
know, we wanted to bring the community
together on this issue and so today
we're going to talk about the state of
observability in the MCP ecosystem.
Yep. So why do we care about this and
why do we think that you guys should
care about this? So if you don't have
the ability to quickly understand why
things went wrong on production, where
they went wrong and how, your ability to
quickly respond is greatly diminished.
And we care deeply about we both build
tools that need MCP observability and we
support MCP and we both care deeply
about developer experience as well.
Yeah, it's it's really important to me
because enterprise engineering teams
don't ship something to production
unless they know for sure that they're
going to be able to identify security
and reliability problems before their
customers do. Um, and that's why they
invest a ton of money in observability
platforms. And uh so if you're going to
ship MCP to these production
environments, you must seamlessly
integrate with these observability
platforms. Yep. So because we care
deeply about uh developer experience at
W&B weave, uh I'm happy to announce here
on stage that we've supports MCP. Yay.
As long as you're a developer of both
the client and the server, all you need
to do is set this MCP trace list
operation environment variable on your
client and server. And uh we'll show you
the the list tool calls and we'll show
you the the duration of your MCP calls.
This works currently with our Python
based clients. And this is how it looks
super quick. With the red arrows, you
can see the client traces, for example.
And with the blue arrows, you can see
we're pointing to the calculate BMI tool
and and the other tool. And that's it.
Observability solved, right? Let's get
off the stage. We're done. Wait a
second. So, uh what about this like
calculate BMI tool? This uh MCP server.
Why can't I see into that? Um you Yeah,
we're working on this.
Uh yeah, also this seems like this is
specific to Weave, right? Um is there
not like a vendor neutral way to do this
and standardize? Yeah, that's right. Uh
this is a bespoke integration that we
built into Weave into our SDKs in
Python. And while working on this, while
our developers have been building this
like u integration within RMCP tooling,
I was advocating internally and
externally that we should align with the
open nature of MCP as a concept and
created observable. Maybe some of you
have seen this. This is a manifesto to
drive a conversation that this is a
problem that needs solving and uh
between observability providers uh such
as us and other folks that's been on
stage before and going to be on the EVOS
track tomorrow uh to do observability in
a vendor neutral and standardized way.
And so while working on observable tools
I realized I I did some search realized
that a vendor neutral scalable way to
add observability exists uh and there
could be a great way to marry the two
open protocols to work together. Yeah,
exactly. Uh fortunately MCP powered
agents are really just another
distributed system and we've been doing
that for decades. So open telemetry is
just the way that's that we've like
settled on doing that. Um we're going to
talk about OT a little bit. If you're
not uh familiar with it, we need to
learn about a few primitives
first. So the main primitive that we
need to learn about is the trace. So a
trace is kind of like an atomic
operation in your system. It's made up
of a treel like structure of steps that
we call spans. And a span represents the
duration and some arbitrary metadata for
each step. And what this step is exactly
is completely up to you to define. It
can be as high level as like an HTTP
request. It can be as low level as a
tiny little function call. Um here's an
example of like a checkout experience,
an API for a checkout. The size and
position of each of these spans
correspond to how long it took and where
it sits in the call graph respectively.
And just from this data, you can tell a
lot about a system and how to observe
it.
Um the other primitive you need to be
aware of is syncs. So a sync is kind of
like a centralized database where all
your telemetry goes, but often they come
in the form of this like whole platform
with like a UI and dashboards and
alerting and monitoring and all those
things. So there's a lot of logos here,
Ben. Uh basically a sync is an open
standard way for folks like collectors
to like receive those spans. As long as
the developer instrumented their
application code in a certain standard
spec way, everybody can just receive
those in in the same unified way. Right.
Exactly. Yeah. It's if you squint, it's
just kind of like a bunch of databases
that all support the same schema and
wired protocol. And you could switch
them out and in fact they don't have to
change much of their code or even change
the code at all. It could be just
config. Right. Right. Uh by the way
tools like W&B weave and some friends
Simon from Lockfire here before and some
other friends all have switched to
support hotel as well. Open telemetry is
becoming like this global standard.
Great. Uh yeah, another great thing
about having a centralized sync uh is
the last concept distributed tracing. So
going back to our checkout endpoint, if
the uh fraud service sends its span to
the same sync, then we can stitch back
the together the traces and show the
whole context. So maybe you're kind of
seeing where the MCP server stuff comes
in here. Yeah. So, hey Ben, if it's
possible via the integration to the open
protocol, um what if I want to use MCP
servers that other people host like
GitHub, like Stripe, like other folks?
Yeah, it's a good question. So, um with
MCP enabled agents or really just any
distributed system, there are kind of
two scenarios. There's when the client
and server are in different domains and
then there's when they're in the same
domain. And by domain here, I don't
necessarily mean the literal definition.
I mean like the administration
administrative domain of control, right?
like do like do you own this MCP server?
Do you own this MCP client or is it a
third party thing? So your GitHub stripe
example is like a great example of like
the different domain scenario. So um
this is a trace of an agent that is
executing the prompt read and summarize
the top article on hacker news. So it's
going to reach out to this like remote
fetch server to read hacker news, but it
appears to us in the trace as a single
service span because it's it runs
outside of our domain of control. So it
appears still black box to
us. Um but suppose we do own the server
like maybe it's running in a different
data center than the client. Um how do
we get actually the whole context? Uh
it's pretty simple. So with distributed
tracing and context propagation. We can
have the remote fetch server send its
spans to the same sync as the client and
the sync will just stitch together the
missing uh parts of the trace back for
us. So in this graphic you can see that
we can now break into that fetch server
and we can see what it's doing. It's
making some HTTP request that's taking
roughly 350 milliseconds and then it's
doing a little uh crunching to to create
some markdown.
Okay, so that that is great in theory
and we went through this. We could have
a whole hour talking about hotel. Not
that we got an hour. Uh but how do we
can actually marry those two protocols
together, right? Uh is there a standard
way? Did the MCP spec folk deploy a way
for us for observability? Um not quite.
It was it was uh pretty tricky to get to
get working. um it does work today but
uh it required a little bit more work
than it should have. So in order to do
this we need to as I said propagate the
trace context from the client to the
server. So here's a TypeScript example
and when we call a tool in the client um
we're going to extract our current span
and we're going to uh pass it along to
the server. And we achieve this by
basically just shuttling the data
through the protocol's meta payload.
And uh now that we're inside the server,
this would be like in the fetch server,
we can pull that trace context out,
inherit it as our current span, and then
when we send our spans off to the sync,
uh it it's as if it came from that
parent span, and they the sync can
stitch it back together. Man, this is
awesome. So you basically used an
undocumented kind of property of the
sending the payload together with the
payload between clients and servers um
to pass along the data that hotel needs
to connect those things together, right?
Yeah, sort of. I just kind of had to
abuse the lower level interface reserved
for the protocol, but a higher level way
should be provided through tooling. And
that's something we should talk about a
little bit later in the talk. Yep.
So, oh yeah. So, by the way, this is uh
this is not just um a screenshot. This
is a working demo. So, um it's a lot
more code than what I showed in the
slide. So, if you want to actually go
see how this works and adapt this for
your needs, uh go check out this GitHub
link. And I think actually you did that
to to get it to work with weave, right?
Yeah. So now that we know how to pass
context after you you you showed me the
way, uh let's see how amazing this
solution actually is in practice. While
we've MCP, the thing I showed you guys
before was a bespoke solution baked into
our Python SDK for weave. The huge
benefit of MCP generally not only
observability related is that servers
and clients don't have to run on the
same environment or share the same code
or be from the same programming
language. So while we were working on
the Python SDK, you built an agent in
Typescript and so because Wave WB weave
supports hotel open telemetry and it's
an open protocol uh your TypeScript
agent. It took me a few minutes to by
without changing much code to just send
those traces into weave from a
TypeScript agent and not necessarily
from a Python edge. So here uh here you
could see in the green the the client
traces are in the green and then the
server traces actually show what happens
within those calls
uh on kind of the the server side as
well. Yeah, it's really cool. So how do
how did you actually get the traces into
weave? So this is very very simple way
simpler than before. Uh we just define
W&B as the OTLP endpoint standard that
you kind of like showed me around. Uh
and then folks can send their traces
into 1b.ai/otel I/OTE and all you need
to do in addition to this is authorize.
So add authorization headers and specify
which project you want to go into. Cool.
Yep. So while we talk to you about
thoseability while I was working on
this, I had a magic moment happening
with MCP. I wanted to share this with
everybody and I love you as well. MCP
story. Yeah. So
um I used Cloudopus 4 that just came out
to weify your agent that you built and
to add this uh MCB obserability and W&B
weave is going to get a little meta.
Stay with us. Uh also has an MCP server.
Okay. What what does it do? So we have
an MCP server that lets your agents or
or chats etc talk to your traces and see
the data and summarize the data for you.
Okay. So we have this MCP. It's been
configured in my windsurf uh and and CL
code uh ous 4 uh was able to use this
MCP server to kind of work through it.
So here you see an example. Um the agent
basically started working on your code
and then decided okay I'm going to run
the code and then said okay I'm going to
go and actually see if the traces showed
up at at WBe. Then it noticed that they
showed up but they showed up
incorrectly. So some input or output a
specific parameter that it needed to do
it didn't know how to do. it wasn't part
of the documentation. And so, uh, the
next moment just absolutely blew my
mind. This Oppus 4 discovered that our
MCP server exposes a support bot. So,
essentially another agent, uh, decided
to write a query for it, received the
the right information after a while and
acted upon this information, learned how
to fix the thing that it needed to fix,
fixed it, and then went back to notice
whether or not the fix was correct. So
my um my coding agent talked to another
agent via support VMCV that it
discovered on its own. I didn't even
know that this ability exists to work on
your coding agent in in things. Things
got a little bit meta and my head was
like absolutely I was sitting like this
while all this happened. Didn't touch
the keyboard once.
That's awesome. Yeah, it's pretty meta.
Uh yeah, before we go, I also wanted to
have uh take a moment to have an
announcement. So um MCP run will also be
exporting telemetry to hotel compatible
syncs. Um so as I mentioned before we
run both servers and clients. Uh so for
servers we have this concept called
profiles and these allow you to like
slice and dice multiple MCP servers into
one single virtual server. And on on uh
we also have the an MCP client called
task. And this is like a single prompt
agent that could be triggered via URL or
a schedule. And it also just sort of
marries with the idea of profiles. Um
but yeah, soon you'll be able to get out
of both of these and hopefully you know
we'll uh connect up to weights and
biases and have a little party. Yeah,
you can send those to weave straight
from mcp.run.
Okay, so uh to recap, um observability
is here at in MCP today, but it's not
evenly distributed. uh hotel should get
you most of the way there, but the
community needs to come together uh
create creating tooling and conventions
to make it smoother. Um you shouldn't
need to be an expert in observability to
like get this stuff working.
So how do you get involved? Well, AI
engineers just start thinking about
observability via MCP tooling and
whether or not you're getting uh
observability to the end to end of of
your execution chain. Um for tool
builders and uh platform providers we
should join and work on higher level
SDKs. So uh arises as open inference for
example is a great start but all of us
should help with instrumentation for
clients who use bespoke SDKs to work on
conventions also together. Ben can you
explain semantic conventions super
quick? Yeah sure. So as we learned
earlier um spans they carry userdefined
attributes right. So if they're user
defined, how does the sync know that a
span is actually say an HTTP request
with a 200 status code or how does it
know that it's an MCP tool call that has
an error? Um, that's where semantic
conventions come in. Um, and you can be
a part of defining what the conventions
are for agents that all observability
platforms agree on. And if you're
interested in this, I would suggest
going to check out the uh Genai semantic
conventions effort by the hotel team.
And um yeah, lastly, for platform
builders such as MCP Run, um you know,
go add hotel support, help review RFC's,
and finally, yeah, just come like talk
to us about ideas because we're just
everything's just kind of coming
together. Everything's so new and fresh,
and we don't really know exactly what to
do. There's an additional track here at
at uh AI engineer. this called the
hallway track and I've learned more
about the stuff that we were talking
about uh out there by actually talking
to people who implement this than I
learned while preparing uh before the
talk. It's quite incredible. So, um
yeah, sure. Um yeah, so again, I'm Ben.
Uh my call to action here just be go
check out MCP Run. You can get a free
account. Try it out. Uh yeah, that's it.
And I'm Alex. Uh uh check out WBWeave
MCP OP to learn how to trace MCP with
hotel. Uh I'm also I did the observable
tools initiative. I would love for you
to check out the manifesto to see if
this resonates with you to join forces
to talk about observability and uh we
yeah please visit us at the booth. We
have some very interesting surprises for
you. We have a robotic dog right here uh
that's observable. I also run the
Thursday I podcast. I want to send Swix
a huge huge shout out for uh having uh
giving me the support to show up here
and give if you guys are interested in
AI news, we're going to record an
episode tomorrow. That's it. Thank you
so
much. Thank you, Alex and
Ben. So, today we have covered a lot of
ground. We talked about the origins of
MCP, MCP spec details, uh
observability. It seems like AI agents
are going to be doing tasks for us uh
all the time now autonomously. But one
thing that's perhaps missing is the
question of how would agents be paying
on our behalfs and which agent we should
be
using. So our next speaker will be
touching on this topic. We have Jan from
Appify uh to talk about the agent
economy. Take it
away. So let me start with a question.
How does intelligence emerges in
biological systems, right? Well, it's
through neurons, right? Well, when
neurons are born, they are just like
individual cells, but like over time,
they grow their axons and dendrites and
establish connections with other cells
or other neurons and actually learn how
to communicate in order to pursue their
own interest basically like to get
nutrients and so on. And over time they
learn how to how to communicate with
each other and with other cells to get
nutrients and basically thrive right and
this collective behavior if you like
zoom out and look at like really large
number of them uh is something we call
intelligence right so it's like emerging
behavior of smaller individual units
that pursue their own
interest. So how does intelligence
emerge emerge in the markets right
people always talk about markets like
well market thinks that market uh
reacted to this and so on and in some
way uh markets are more intelligent uh
than like individual like participants
of the market right and it's there are
mutual uh interaction of these
individual members of the market
basically who pursue their own interest
and communicate and establish new
interactions with others
uh where some some sort of like
collective intelligence which is like
bigger than the sum of different parts
emerges
right and
uh
oh not sure what happened here uh sorry
we skipped uh quite a few slides
there all right so let's try
again so how does intelligence emerges
uh emerge in companies
Well, this one is provocative through
Slack, right? Where people interact and
pursue their own interest in the company
and over like altogether the company
well sometimes becomes more intelligent
than the individual employees of the
company and uh so this leads to my final
question. So how does or how will the
general intelligence emerge in computing
systems right and there is a lot of talk
about AGI and like you know like ever
larger models uh exhibiting like super
intelligent behavior but in my opinion
the like general intelligence will
actually emerge through interaction of
multiple entities can call them agents
basically like multiple models uh
pursuing their own goals interacting
with each other and uh altogether
exhibiting something which we can
general
intelligence and thanks to uh MCP we
finally have this uh missing part that
allows the the agents to communicate
with each other and really like create a
fabric or agentic mesh where they can
talk together. So uh hello everyone. My
name is uh Yan Churn. I'm the founder of
Apifi and I'm going to talk about the
race of the agentic economy on the
shoulders of MCP. Basically economy
where agents uh can you know find
counterparts uh to interact with and
purchase services from other from
businesses or tools or other agents
right so like B2A and B2B uh sorry and
A2A. All right. So before I start um let
me just introduce quickly API. Aify is a
is a marketplace of 5,000 tools called
actors. And uh historically we come from
the web scraping industry, right? So
most of these actors are data extraction
tools that allow you you know to get
data from social media from search
engines uh data for AI for building rack
pipelines you know uh data from web uh
for lead generation and so on. But also
there are other tools like data
processing tools and so on. So
altogether there's about 5,000 of them
and some of them are built by API, some
are built by our community of creators
uh who actually make money on it. Right?
So it's like a marketplace of software
creators if you will, right? So actors
are self-contained piece of pieces of
software based on Docker with well
definfined input and output, right? And
basically they represent a new way how
to ship software and publish it, you
know, and uh and integrate to to you
know other systems, right? So for
example, Google map scraper it's a quite
popular actor uh from our store uh it
can extract data from Google maps right
uh more data than than the Google places
API provides right uh well there is like
creator of the actor description you
know different stat and so on something
you would expect from normal marketplace
and
actually thanks to the way actors are
built it's actually super easy to
integrate actors from other systems
right so for example we have SDKs for
TypeScript for Python uh for open API
for C common for CLI mean it's like we
can call them from terminal and it's
only because they are well defined units
of software with input and output right
uh also we have integrations with uh
workflow automation tools like mag
zapier you know clay and many others so
to make it really easy to call actors
from these systems right but obviously
now uh we also have MCP integration
which makes it possible to call actors
from AI agents or you know AI workflows.
And the way it works actually is uh the
agent just needs a API key or you know
out workflow on any an account on aifi
and then through our MCP server
basically it can interact or call any of
those 5,000 actors on our marketplace
right and actually this only became
possible thanks to uh I would say the
killer feature of uh MCP which is the
tool discovery right actually um not
many clients support this yet. Uh but uh
just just today I saw that V VS code
added support for it. Uh and actually
just like two days ago code for desktop
added support for tool discovery. And
basically how it works is that um the
client connects to the MCP server and
dynamically discovers tools to use and
to interact with based on the based on
the the workflow, right? And let's say
we have like 5,000 tools on our our
store and there is simply no way we
could publish all these tools through
open API because you know the context
would be just too large and like the
more tools you have the you know riskier
the result is right so we really want
like provide the tools only like uh as
needed and that is only possible through
tool discovery which I think is really
the main thing that will actually make
MCP really
uh the huge differentiator from from
open API for example, right? So MCP
actually quickly became a standard for
agentic interaction. This is Google
trends data showing that MCP is is
basically dominating the space compared
to open API or A2A from from Google,
right? And actually I think MCP already
became a standard for agenting
interaction. And it became so popular
that currently there are like you know
many different like uh registries of MCP
servers that even guys from master our
friends created like registry of MCP
server registries right just to make the
sense of it right and obviously Antropic
is also working on their own uh registry
and um I think Google's A2A they have
like a DNSbased protocol with like
well-known agents JSON way to you know
publish the the services on through DNS.
So basically there is like you know so
many different servers you can now use
from the agents right so does it mean
like so many tools now support MCP so
does it mean like the agents can
discover and access any of them on their
own
right well not really
because to use those services your
agents still need uh to have like API
tokens to those services right so even
let's say if you use zapir mmcp that
provides access to like 5,000 apps they
have in their marketplace. You still
need to connect those individual apps to
your services, right? You know, like
GitHub or Slack or you know whatever.
So, Zapier on the on its own is not able
to provide access to the third party
services. You still need to as a user to
facilitate that. So that actually means
that uh the agents are not able to like
find counterparts uh or like uh other
agents or other tools to interact with
on their own. They are still depending
on the human uh developer who actually
build the system, right? Who kind of
like give those those those agents
access to different tools, right? And if
those agents are, you know, to replace
all the people and all the jobs, right?
they need to be able to uh find services
to interact with. They can't just like
you know do do that like it's like a
basic basic thing that like uh anyone of
us can do right like to find service and
purchase it right. So I argue that like
unless the the the agents are able to do
that
uh we will not be able to reach you know
some higher level of of intelligence of
these agentic systems and behaviors
basically uh if the agents cannot
purchase services right so how can we
solve this problem right so first like
sort of like n approach would be let the
agents subscribes themselves to the
target services right so basically in a
way like agents could have like email
maybe a credit part they could like fill
you know the subscription flow maybe
solve the capture you know create an
account and so on but you see it's it's
not very practical right I mean it's you
know well they might need to also have
to phone number and so on and quite
often the services actually need to have
like real person behind the account
right so basically this this wouldn't
really work right
uh so second solution uh is central
identity and payments provider there are
like couple of companies pursuing now
that like there would be like a central
authority where you can charge money and
then the agents can use that you know to
buy services and and provide them with
their identity right for example
vertifier coinbase is now pushing their
X42 standard uh I think stripe is
working on this and Mastercard and Visa
2 right so I think this this is going to
happen eventually but running launching
new payment system it's extremely
complicated right because you're facing
like this chicken and egg problem of
marketplaces right I think PayPal had to
uh pay like $100 million per month just
to by the market uh and like launching
credit cards in the in the 70s was like
incredible challenge basically because
nobody was accepting those cards so why
would people use them and so on right so
I think this will happen but it will be
a long process basically to establish
this right so let me offer the third
approach and it's like through a
centralized marketplace of MTP services
like store basically where you just need
one API token or one authentication one
account to to get access to all the
other services and basically it works
the way that the developers who publish
these tools, these actors actually they
provide their credit card and their
account to the third party service and
basically publish it, add monetization
to it. Like you know like how much does
it cost to call the service and then
they are basically the owner of the
service and now they publish it on our
marketplace and suddenly it becomes
available to the whole ecosystem of
tools and this way actually we can scale
it rapidly and actually even without the
target services knowing right. So
basically this way the actor can run the
code itself or wrap an external API or
just publish an external MCP server
because the MCP servers they can be
actually nested. you can have like one
parent server that provides actions or
tools of the like nested FCP servers,
right? So that's another like cool
feature of FMCP. You can really build
this sort of ecosystem, you know, if you
can facilitate the payments and
monetization, right? So actors charges
the user and then its developer gets the
money and pays for the external service
and anyone can publish such an actor
even without the target service knowing.
Right? So time for demo. It's not live
demo because the internet uh super flaky
here. So what you can see here is cloud
for desktop uh that has access to appy
mcbp server uh there is like 18 tools
available now and I'm asking like what
is the venue of AI engineer world fair
in San Francisco it possible use actors
so it you can see it searches the actors
uh for a tool that can answer this
question it will find a tool or actor
called rag web browser and so it's
called it's a it's like a Google search
with you know fetch data so basically It
it asks the query like uh what is the
venue and so on and then it parses the
the resulting page. So we can see it
found like uh SF Marriott Maris uh that
seems all correct, right? So now let's
use an actor for scraping
Twitter. So uh this actor is not
available in the context. So so the
agent doesn't know how to use it. So it
will it searches actors on our store and
finds an actor that can scrape Twitter,
right? So it it calls it calls add actor
which is like a tool that adds new tool
to the context. Uh actually cloud is
very verbos describing a lot of things
about it. And actually there is like
small bug still in cloud desktop that
you need to like disable and enable a
tool so that the the tool list refreshes
and then the tools become available. I'm
sure it's going to be fixed in the next
release. And now let's use that actor to
uh get last tweet of AI engineer
conference. All right. So it calls the
actor on aifi. Uh it knows the the
Twitter handle probably from from the
from the website. And now you can see
that uh it found the result and the last
tweet from this morning was uh something
about workshops. That seems about right.
So now what? So we we have we have seen
how we can use existing tools in in in
our store. But like let's say uh uh one
of our competitors uh company called
browser base. Hey Paul if you're here.
Uh they certainly haven't published you
know uh an actor in our store but we
did. So we created an account on browser
base added our API token there and
published like basically their MCP
server on our store without actually
them even knowing. And now anybody can
actually use browserbased MCP through
API's ecosystem, right? Even without
them having to do anything or knowing
about it, right? So now let's use
browser base to fill in the email
subscription form on the AI engineer
website fill email uh
yanappify.com and now let's see what
happens right and actually we'll see
that uh that the that the agent will
actually call browserbased mcp through
you know an actor publish published you
know by us on our ony store and perform
the actions on the web right and
actually this way we can easily like uh
bring a lot of lot
existing MCP servers to our store and
you know expand the ecosystem rapidly
without you know having to ask for for
you know cooperation of the third
parties right so that's actually what
we're doing now uh we want to scale this
marketplace rapidly and now okay so now
it's evaluating you know the screenshots
looking for the field and so on you know
and eventually it will manage to uh fill
the form and and basically succeed in
the task right I can
skip this uh to save time. It takes some
some some time to basically
uh for the agent to to find the form and
so on. But uh
yeah, it succeeded. It completed the
email sub subscription.
And this way you basically see that uh
you can
plug our our ecosystem of actors into
into uh any AI agents that actually
support tool discovery. Right. All
right.
[Music]
And so this means now anyone can publish
tools or you know agents on aify store
and monetize them and immediately get
access you know to all the AI clients
that already like integrated appy and
all the ecosystem of tools right and
actually people can make make money on
it like just last month we paid more
than quarter million dollars to our
creators and actually the the this
number is is growing rapidly. You know,
overall the actors generate more than
one half million dollars per month. Now,
uh we have like about 1 million monthly
visitors to the whole ecosystem. And now
we're really in the process of like
scaling this ecosystem. So, um if you're
looking for ways to monetize your tools
or agents, you know, just um talk to us
and publish or publish your actor store
and get access to this ecosystem of
developers and this visibility.
And there are some open questions
obviously uh that remain. So will this
autonomous to discovery provide real
value? I mean like everybody who builds
agentic systems knows that you know like
making sure that the system works as
expected is tricky right even if it's
fixed. So if we add this like you know
variables that like well uh if the
agents can discover new tools uh will it
actually work? Well, currently it might
it might be a bit flaky, right? I think
we're still still we we're still fairly
early, but as the models get
better, I think
uh even with the discovery, suddenly the
the the agents will be will be able to
provide you know valuable and reliable
result basically right so this remains
to be seen but I'm optimistic that like
as the LMS will get better, we'll
actually get there that the two
discovery will actually provide real
value. Well, there's a big question of
like how can agents trust tools or other
tools? Oh, so sorry. Or each other,
right? We know it like you only interact
with people you trust. So, how can
agents do that? We'll see. And can
autonomous agent interaction enable AGI?
Well, we'll see. Thank you very much for
your attention. And uh feel free to try
it.
[Applause]
mcp.fi.com. Thank you, Yan. And that
about wraps up our MCP track for today.
Uh thank you all for coming. Uh once
again, my name is Henry. I'm the founder
of Smidy. Happy to chat about MCP uh in
the break. Uh and make sure you catch
the speakers as well. Have a nice rest
of your day at the AI engineer
conference. See you.
everything.
Hey, hey, hey.
[Music]
[Music]
I don't want to go.
factory.
Hey.
Heat. Heat.
[Music]
Heat. Heat.
[Music]
Hey, hey, hey, hey.
down. Down.
Hey. Hey.
Hey. Heat. Heat.
I danced.
I
feel hey.
I
feel happy.
[Music]
Hey,
[Music]
hey, hey.
[Music]
Hey, hey, hey.
[Music]
[Music]
I'll be everything.
Hey, Hey, hey, hey.
[Music]
[Music]
I'm shitty.
[Music]
[Music]
I don't want to
[Music]
go after
[Music]
know.
I don't want
to do it.
[Music]
I take it.
[Music]
[Music]
[Music]
[Music]
[Music]
Everybody
[Music]
became Heat. Heat.
[Music]
D.
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
Hey, hey,
[Music]
hey.
[Music]
[Music]
I
feel hey.
[Music]
[Music]
[Music]
Hey. Hey. Hey.
[Music]
Hey, hey, hey.
[Music]
[Music]
Data. Hey.
[Music]
Hey.
Heat. Heat.
[Music]
[Music]
[Music]
ladies. Ladies and gentlemen, please
welcome back to the stage the VP of
developer relations at Llama Index,
Lorie
[Music]
Voss. Hello everybody and welcome back.
I hope you're having a great time. Let
me hear from you if you're having a
great time.
Excellent. And I hope you're learning a
lot. Uh, I personally hung out in the
MCP track because, uh, that's my jam at
the moment. I learned about dynamic tool
discovery. Uh, I learned that VS Codes
Insider Edition has full support for the
entire MCP spec, which was very exciting
to me personally. Uh, and I learned why
MCP isn't any good yet, but it's going
to
be. Uh, I don't have a lot to say before
I introduce our next speakers. Uh, I did
ask any I did ask my colleagues
backstage if they had any jokes and
somebody said the Wi-Fi which is
actually
gold. Uh, we have some great keynotes to
close out today including building
agents at cloud scale uh and wind surf
doing everything everywhere all at once
and of course uh Greg Brockman of OpenAI
talking about what it means to be an AI
engineer. Uh but first here's Stephen
Chin and Andreas Colliger to give some
closing thoughts on the graph rag
track. Our next speakers are the
curators of the graph rag track here to
speak about agentic graph rag. Please
join me in welcoming to the stage the
vice president of developer relations at
Neo4j, Steven Chin and Genai lead at
Neo4j, Andreas
[Music]
Kger. Look at that big hands. All right.
So great to be back here again on the
stage seeing everybody. Um, we kicked
off this morning's keynote with some
exciting kind of sci-fi memes around AGI
and had an amazing graph rag track. My
my favorite part was seeing what Zep and
what we're able to do with graph agent
memory to actually improve how models
are able to respond in agent system.
What did you like? What did you like,
Connor? I I got to say agent memory was
definitely a big idea today. Zep was
amazing. thing. Then we also had the
lunch and learn this all about agentic
memory. And for graph rag, it was really
amazing that this is like the big
umbrella of graph rag. It's not just one
thing you can do. There's many things
you can do within graph rag. So um
speaking of graph rag, we we talked a
lot about that actually in the morning
keynotes and stuff and um I bumped into
a bunch of attendees and I realized we
never actually explained what graph rag
is
details. So, um, we were thinking that
it'd be great to show a demo of end to
end importing stuff, building a
knowledge graph on stage here. What do
you guys think? Is that going to be a
good idea? Yeah. Yeah. All right. So,
let's cut to a quick demo. So, let me
let me just get my No, no laptops. Not
actually. The Wi-Fi joke was mine. And
that's why we're not doing a live demo.
Apologies, but it was live in his hotel
room last night. So, demo, please.
Okay. So, as this gets
rolling, is it up yet? Oh, okay. So, the
demo's up. Yes. Okay. Fantastic. So,
this is a demo of our LLM graph builder
that we have at Neo Forj. And what this
lets you do is take unstructured data
sources and then build a graph out of
that and then query that graph. Now,
what you see this going through right
now is just grabbing some web sources
from Wikipedia. And of course, as we
went through movies this morning, I'm
just grabbing Wikipedia pages for all
those different movies. I'm just going
through laboriously grabbing what was
it? 2001, a space odyssey, The Matrix,
things like that. That's what we had on
on the slides this morning. Now that all
those sources have been selected, I'm
going to say go ahead and just create a
graph straight out of that. The graph
creation process is going to go through
what happens with any kind of
unstructured data. First, it's going to
get chunked up. All the data is going to
get chunked. Vectoriz as well. All that
gets stored. But then the graph part of
it happens. And the graph part is for
any of the chunks that we've got, we're
going to find people, places, and
things, and in this case, movies, and
movie characters, and also movie themes
and turn that into a graph connected to
the unstructured data. So we've get
everything from the chunks, vectors in
the chunks to a structured graph around
that providing something you can query.
Okay, it looks like that's all the way
done. And with a quick spin, that is the
beautiful graph that evolves out of
actually just those unstructured data.
Nice, right? Yeah. Now, if we zoom in on
this a little bit here, if I get my
timing down, right? So, okay, there are
documents and chunks. We're going to
remove those. But this beautiful kind of
ball of purple in the middle there. If
we zoom in a little bit, the the one
movie that I've looked at here is the
Terminator. You'll also see the
Terminator as a character has been found
and pulled out of the graph, out of the
unstructured data. And in all those
purple nodes are just the themes. Now,
the LM's gotten a little enthusiastic
about finding themes, and so there's
found lots of different themes in the
movies. That's fine. It also found other
movies mentioned in the Wikipedia
article. Those are the other green
bubbles there. And it looks like, I
don't know if you can see this, between
the green bubble that is up there in the
right, that is Bladeunner, and it was
connected to another node called Tech
Noir, which is then was connected to the
Terminator. So at least by looking at
the graph, you could tell that the
Terminator and Bladeunner are connected
by being the theme technoir. So if we
hop over to the graph chat then now and
we can see this is just a built-in chat
that's going to talk to that graph. And
I'm going to expertly type in what
themes do Bladeunner and the Terminator
have in common. And fingers crossed and
thankfully I've recorded this. We know
it's going to
work. Dot dot dot. You have to leave the
suspense of the dots.
Okay. And there's the theme technore and
also science fiction connecting it,
verifying that the graph works and the
chat on the graph works exactly the same
way. Nice, right? Yeah. No, that's
great. And um we also have a big
announcement here as well for AI
engineer world's fair. So I I I think
this is a first in history that Neo
Forj, believe it or not, we are still a
startup. We're a large startup
technically. Yeah. at a 15 16 year old
startup but we are still a startup is
offering a startup program to help other
startups to get get on our technology.
So we have launched the new Neo Forj
startup program that's going to be
coming soon. Um this is where you guys
put the slide up. Um you can build your
AI startup with Neo Forj. We have a QR
code that you can sign up to join the
program and this will give you Aura
credits and a free way to do all the
things we're doing on our cloud. And of
course, our technology is open source.
You can do it yourself on our community
edition as well. Thank you very much for
having us here at the keynotes. Thanks
[Applause]
everyone.
[Music]
Our next speaker is here to teach us how
to build agents at cloudscale. Please
join me in welcoming to the stage the VP
of developer relations at AWS, Antia
Bar.
[Music]
Hi everyone. I'm thrilled to be back on
stage here again at the Engineer Worlds
Fair and it's amazing to see this
community grow. So today I'm going to
speak about how we can build agents at
cloudscale.
Now at Amazon and AWS, we truly believe
that virtually every customer experience
we know of will be reinvented with AI.
And not just the existing experiences,
but there will also be brand new
experiences we are now able to build
with the help of AI
agents. And we're not just theorizing
about this, right? We're all here
together to actually build the
future. Now, I want to start just with a
little bit of what that means internally
across Amazon as a business.
At Amazon, we have over
1,000 generative AI applications that
are either built or in development,
transforming everything from how we
forecast inventory to how we optimize
delivery routes to how customers shop
and how they interact with their homes.
And one of the most ambitious
deployments of AI agents is the complete
reimagining of
Alexa. And I know many of us have been
waiting for this for a long time. So
what you're about to see here represents
the largest integration of services,
agentic capabilities, and LLM that we
know of anywhere. So let's have a brief
look.
Look at my style.
Oh, hey there.
I love sharing this video because it
shows really the power of agents at
scale and just to have a quick look what
that means in terms of
numbers. We have over 600 million Alexa
devices now out in the world and with
the help of the latest advancements in
AI, we were able to really reimagine
this experience.
Alexa Plus works through hundreds of
specialized expert systems. That's what
the Alexa team calls groups of
capabilities, APIs, and instructions to
accomplish a specific task for you. And
all of these experts also orchestrate
across tens of thousands of partner
services and devices to get the things
done, which you just seen a glimpse of
this here in this video. And we truly
believe that the future will be full of
those specialized agents, each with
their own unique capabilities and
working together seamlessly with other
AI
agents. Now, this example shows what's
possible at this massive scale. But how
do we get there? How do we operate at
this scale? or said differently, how do
we move from web services that we've
built for many years now into developing
those agentic services? And luckily,
many of the underlying principles remain
the same whether you're building for
millions of devices, whether you're
reimagining and integrating AI
experiences into your enterprise
applications, or you're a startup and
you're really just looking to kind of
scale your idea to the next level.
Now another example I want to show you
is an agentic service that we built at
AWS. You might have heard about Amazon Q
developer which is our code assistant
that helps you really kind of across the
software development life cycle. And
just a few months ago we released an Q
developer agent for your CLI. So it
brings the agendic chat experience into
the terminal. It helps you to debug
issues. You can ask it natural
questions. It can read and write files
and really kind of help to make your
day-to-day in the terminal more
productive. So let's have a quick look
how this
looks. Here is Amazon Q in the CLI and
I'll just ask a good question here. In
this case, hey, what do you know about
Amazon Bedrock? CLI is integrated with
MCP. So what it does it actually figures
out there is a tool our AWS
documentation team has released an MCP
server. It's connecting to it. You see
the tool is happening and it's asking
for permissions. I give it the
permissions and then it comes back with
a response that is grounded in the
official AWS documentation.
Now, I don't want to talk much more
about Q, but I do want to ask for you
just to quickly think about how long did
it take for the AWS internal teams to
build and ship this agentic service. And
let's just do it with a quick raise of
hands. Who think it took two months to
develop and ship
this? It's a few hands. Who thinks three
weeks? All right, it's a bunch of more
hands. Who do you think it took half a
year?
almost none. Wow, you folks are great.
We built and shipped this within three
weeks. And to me, this is just almost
insane, right? Like the speed and we
heard it earlier like the mode of of AI.
Um, one of the keynote speakers called
it out is execution, right? And I think
3 weeks is super impressive.
Now, how do we enable teams and not just
internally at AWS, but in general to
build and ship productionready AI agents
this quickly? What we did internally,
our teams, we needed to fundamentally
rethink how to build agents. And what we
did is we developed a model-driven
approach that really kind of taps into
the power of LLMs these days and models
that are so much more capable in
deciding, planning, reasoning, taking
actions and let the developers focus on
what their agent should do rather than
telling it exactly how to do it. And the
great news is we made it available for
all of you to use as well. So just a few
weeks ago, we released Strand Agents.
It's an open-source Python SDK which you
can check out and start building and
running AI agents in just a few lines of
code. So let me show you quickly how
this looks like. And before I go in
here, just a fun fact. If you wonder why
did they call it trans
agents? Well, this is what happens if
you let AI pick its own name.
All right. So the reasoning behind
because again the AI agent is is capable
of reasoning. It came up with like think
about the two strands of
DNA and just like the two strands of DNA
strands agents connects the two core
pieces of an agent together the model
and the
tools. And it helps you building agents.
simplifies it by you really relying on
those state-of-the-art models to reason
to plan and take action. You can simply
start with defining a prompt and your
tools in code and then test it out
locally and then once you're ready
deploy it for example in the
cloud and this is how simple it is.
Again just a couple of lines should look
pretty familiar. You install strands
agents, you import it and then it comes
with pre-built tools which I talk about
a little bit more in
detail and basically you just add the
tools to your agent and then you can
start asking questions or building more
complex workflows with
it. Now by default strands agents
integrates with Amazon Bedrock as the
model provider. So you can check the
model config here using cloud 3.7
sonnet. But of course it's not just
limited to AWS. You can use strands
agents across multiple providers. For
example, we have integrations with a
llama. So you can start developing
locally, testing it out. We have
integrations and tropic edit
integrations, metaedit integrations to
the llama API. You can use OpenAI models
and any other providers available
through the integration with light LLM
and of course you can also develop your
own custom model
provider. Now quickly on the tools as I
said agents comes with over 20 pre-built
tools. So anything from simple tasks
like hey I just want to do some file
manipulation some API calls obviously
integrate with AWS services but then
also more complex use cases and I just
want to call out a couple of them. So
there's a whole group of integrated
tools from memory and rack one tool
specifically called retrieve which lets
you do a semantic search over a
knowledge base. And just to show you the
power of this, we have an internal agent
at AWS that manages over 6,000
tools. Now 6,000 is a hard number of
tools to put into a single context
window and give um one model to decide.
So what we did is we put the
descriptions of those tools in a
knowledge base and use the retrieve tool
here. So the agent can find the most
relevant tools for the task at hand and
only pull those tools back into the
model context for the model to decide
which one to take. So that's just one
use case how we're leveraging that. Also
there is support for multimodality
across images, video and audio with
strands. There is a tool to kind of
prompt for more thinking and deep
reasoning. And it also comes with
pre-built tools to implement multi- aent
workflows whether it's graph-based
workflows or a swarm of sub aents
working
together now you cannot talk about tools
without mentioning MCP
right so obviously we integrated MCP
here natively within strands so you can
just use this also to connect to
thousands of available MCP servers and
make them available as tools for your
agent. Support for A2A is also coming
soon, but let's start and talk a little
bit about MCP
first. If you're building on AWS
already, make sure to bookmark this
GitHub repo. It's
AWSLAB/MCP. And here you can find a very
long list, much longer than you would
see here on this slide, of growing
number of the MCP server implementation,
specifically if you're working and
building on AWS.
Now, one of the challenges stems from
the fact that once we all started
building MCP servers, what we had was
standard IO, right? So, it started out
to help locally connect your systems,
your clients to respective
tools. And here's just a quick example,
which is important for a demo I'll show
in a little bit. This is just a standard
IO implementation of an MCP server.
should look familiar to most of you
working with MCP using the Python SDK
using fast MCP. All I'm doing here is
set up my server and using the decorator
to define a tool. In this case, my tool
is to roll a dice. And you might see in
the code here, it has an input to define
the number of sides. And I had to put a
picture here because I have to admit,
um, I just learned this myself. Do we
have D&D fans in the room?
Woohoo. All right, a few of them. So,
you all know what I'm talking about. For
the rest of us, I just learned um there
are dices, and I have one here. I'm not
sure if the camera can catch this. Um
it's just one of them here on the slide.
A dice that has, for example, this one
has 20 sides. Something very normal in
the D&D world to start a thinker game.
Um don't ask me questions about D&D. my
colleague Mike Chambers who's either
here or in the expo right now. He built
the demo, so kudos to him and he can
answer all of the D&D questions. All
right, just keep that in mind. Um, I'll
come back to this in just a
second. Now, what we want to do here is
to decouple and kind of connect to
remote MCP servers because the topic is
to scale, right? And the way to do this
is in the AWS world as easy as just
deploying it as an Lambda function. So
we can do this now with streamable HTTP.
And the same concepts apply. You put
your Lambda functions as you would have
before behind an MCP gateway and then
connect. And because we care about
security and authorization in the quick
demo I'm going to show you, I'm using an
authorizer. Um you can also plug in a
Cognto framework for this part. And I'm
also going to store session data in a
DynamoB
table. So let's roll this quick demo
here. So what you see here is an MCP
Lambda handler that we developed. It's
available on the GitHub repo which makes
it really easy to kind of set up your
MCP server in Lambda. Here's a very
simple hello world example. The tool is
just um again defined with a tool
decorator in here and then in the lambda
handler function you can reference um
the input here the invoke function and
pass it to that MCP server. Now if we're
looking at the server implementation and
here we're doing a little bit more. You
can see how we're adding session table
support which is a Dynamob table. We're
defining the tool. This is the rolling
dice tool that I just pointed out but
this time it's hosted as a Lambda
function. You can write all the code you
want to have there as well. And then at
the very end, it's the same single line
that basically when you call the lambda
function passes this onto the MCP
server. Let's deploy this. And again,
we're using the existing tools to deploy
Lambda functions as we have before. So
this one is using AWS SAM to just deploy
that to the cloud. And then we will
receive the API gateway URL as well. Now
from the client side here I'm using
strand agents as you can see and then I
am using the MCP integration. I'm
passing here my API gateway URL to
connect for author authorization. I have
a bureau token. Again this is a simple
concept demo but you can build more
robust integrations here as well. I'm
calling the list tool and then I'm
passing those tools to my agent as we've
seen before. This time it's the MCP
available tools. And then if we run this
here, we can quickly
see this in action and basically going
to ask it here to roll a
dice. And we're asking it to roll a d20.
So again, 20 sides and it's coming back.
What did we roll? You can see the tool
use kicking in here. We rolled a seven.
Great. So this is just really a quick
example. The good news is once you're in
the AWS world and you're working on
Lambda, everything you can build with
Lambda, you can integrate there. So
basically, you have access again to all
of the great features, capabilities,
applications you might have already
built on
AWS. Now the next step here is how do we
make agents talk to each other, right?
That's kind of the the next frontier.
And we are super excited about the all
the open protocols that are emerging
right now with MCP. For example, we
joined the steering committee. We're
active part of the community
contributing code and helping to further
evolve MCP. If you want to learn more
about this, here is the QR code. We have
a whole blog series started on our open
source blog. Feel free to check that out
as we continue to help evolve those
protocols. Now, what's next? We all are
aware that this is just the beginning,
right? There will be so much more
coming. And if you had a chance to check
out my colleague Danielle's talk
yesterday on useful general
intelligence, I just want to quote her a
little bit. She said the atomic unit of
all digital interactions will be an
agent call. So we can imagine a future
here where you might just have your
personal agent like shown like this
connecting to an agent store and really
kind of having agents together
accomplishing tasks for you. And some of
you here in the room might already be
building this, right?
So let's go and build this future
together. Thanks so much. Check out the
additional sessions we have. My
colleague Mike is going much more into
the rolling dice demo, everything MCP
and strands. And my colleague Suman
tomorrow will also have a deep dive on
strands. And with that, thank you very
much. Check us out in the expo hall and
grab your
[Applause]
[Music]
D20. Our next presenter is here to tell
us what's next for
Aenticides. Please join me in welcoming
to the stage the head of product at
Windsurf, Kevin Hoe.
[Music]
Hello. Hello.
How we
doing? All right. How's Yeah. How's the
energy level?
We're good. Good. Yes. Let's go. Let's
go. Two more. Two more. My name is
Kevin. I lead product at Windsurf. And
I'm super excited to be back here. Thank
you so much, Swix and Ben. It's always a
pleasure to come back to AI Engineer
Worlds Fair. The velocity of our
industry right now is incredible. It's
like being on a kite on the ocean and
we're really excited to see where the
winds are taking
us. A year ago, we didn't have
windfur editor being used by millions
and millions of people all around the
world. And hopefully this is a larger
number than last time. How many people
have heard of windsurf? And how many
people have used windsurf?
[Laughter]
Good numbers. Good numbers. We got to we
got to improve
that. And Windsurf itself has changed
immensely in the last 6 months since its
launch in November. We retired the name
Kodium because we decided to catch this
new
wave which is by the way what we call
our next generation innovations in the
product. We call them waves. And in case
you missed it, we are now 10 waves in.
And some of the key waves we've been
really excited about web search, MCP
support, autogenerated memories. Oh, I
was supposed to do that. Autogenerated
memories, deploys, and parallel agents
to name just a
few. And as the waves keep growing, as
do the number of people that have
discovered and integrated Windsurf into
their daily workflows. To this day, we
are generating about 90 million lines of
code every single day. And that equates
to around a thousand over a thousand
messages sent every single minute. But
today is not about growth. I'm not going
to sit here and tell you about the
numbers. I'm going to tell you about the
why. Why do people feel connected to the
Windsurf
editor? And I know no AI company really
wants to disclose its secrets, but I had
to come up with some
content. So today, I'm going to let you
in on one of ours. Our secret sauce is a
shared timeline between the human and
the AI. And this is what makes people
feel like we're reading their
minds. And now everything you do as a
software engineer can be thought of on
this shared timeline. So if we rewind
way back to the dark days, this is
pre-automplete when everyone knew how to
write a for loop. AI had to do
everything. You had to edit files, you
had to type every single character.
Imagine that.
But then once services like C-pilot,
like Codium, they launched, then devs
got really excited. They started seeing
a small percentage of their code being
written by AI. And we started to
abstract and accelerate the number of
small edits, small actions that we would
do for a user. And in late 2024, with
the advent of Windsor's agent and the
launch of the Windsorf editor, we saw
that we could do more and more for the
user. We started being able to edit
multiple files at once, perform
background research across thousands and
thousands of files, and execute terminal
commands directly inside the
editor. But at
Windsurf, we're in the business of
trying to change how software gets
created. And this means that the
timeline is actually a little bit more
complicated. It needs to handle actions
taken outside of just the IDE.
And so given how much of a developer's
workflow happens outside of the editor,
what does this mean for
Windsurf? First, Windsurf is going to be
everywhere. Specifically, Windsurf will
need to be able to read and ingest
context from every single source that a
developer uses. And if we zoom out and
think about what makes you all software
engineers successful, there are a couple
of different categories. The first of
which coding related file reads, running
terminal commands, seeing your history,
even you know which tabs you have open
inside of your editor. This all informs
how to generate the correct code. But it
goes beyond that. There's external
sources. Things like going on to GitHub
and viewing a past history of commits,
maybe looking at a PR that is doing
something similar to the feature you're
about to implement, doing online
searches, web searches, looking at
documentation. And then there's the
third category and this is where it gets
a little bit interesting. It's called
metalarning. It's the idea of what
separates a junior engineer from a
senior engineer from a staff engineer.
These are the organizational best
practices, the engineering preferences
that all get encoded into what makes
good
code. And so if we think about what this
means in practice, let's say that we are
going to build a new page on a data viz
dashboard. Let's walk through step by
step. So first you would probably start
in Slack as all good things start from
Slack. You'll build context looking at a
bunch of maybe customer requests. Maybe
you'll have some internal messages.
You'll collect that context and you'll
start planning. And this means you're
going to be in Google Docs. You're going
to be writing design docs probably
working on some infrastructure designs.
You're going to be tracking tickets
inside of Jira. And then you might have
a designer who's actually working in
Figma in parallel putting together all
this material. And then finally, the fun
part, or at least this is my favorite
part, which is the actual writing of the
code. And hopefully you use something
like Windsurf to do
so. But you're not done from there. Once
your code's complete, you still have to
open the PR. You got to get reviews. You
got to merge into main. You got to
deploy SEO analytics. The list goes on
and on and on.
And this is really why we've built what
we've built. Because we know that for
you, it's extremely important that we
can fetch context from your Google Docs,
that we can read your Figma files, and
that we can oneclick connect to any MCP
service so that you can access your
information in things like Notion,
Linear, Stripe, and countless others.
And we've spent the last 10 waves making
sure that Windsurf can be
ubiquitous. But we know that's also not
enough.
We know it's not enough just to read. We
need to be able to do and write
everything. We need to be able to do it
all for
you. And so the AI has to take action on
a wide variety of surfaces beyond just
the coding surface in order to
accomplish what a human software
engineer would do. And so this doesn't
mean just write code. This means
interacting with third party services,
provisioning API keys, writing design
docs, PRDs, wireframing, testing, and
the list could go on and on and on. And
so for the last 6 months, we've oriented
ourselves around how do we do
everything. And if we go back to this
concrete example of building a new web
app, where do we start? We start by
running codebase relevant terminal
commands. This is something that we
launched right at the advent of
windsurf. And what's really cool about
what we can do here is that we can
intelligently decide which commands we
want to run automatically and which ones
we want to wait and ask for explicit
user
approval. Next, you'll open up Windsurf
browser previews which allows you to
iterate from there. Allows you to
visually iterate with the agent so that
Windinsurf can take control of Chrome
just like you would. Inspecting DOM
elements, looking at your JS console,
being able to do what a web developer
would do.
And so now you could say our app is code
complete. We'll use the GitHub MCP to
open up a poll request and we can use
context from your other PRs to be able
to inform the description and inform the
test
plan. And code review is still a
necessary part of any software company.
And so we launched windsurf reviews
which can automatically leave comments
and suggest changes asynchronously so
that you can be confident that the code
that hits main is production ready.
And so now that your code is merged,
you'll want to be able to deploy. And so
we also released a one-click service to
Netlify so that you can use Windsurf's
custom tool integrations to actually
just in one click, the agent will deploy
what you have to the live web.
And so as you can see, we've really
built the ability for Windsurf to read
everything that you can and do
everything or almost everything that a
software engineer can. So then you might
ask, what's
next? It's only inevitable that
Windinsurf will be on all the time,
working for you, even when you don't
know it.
We pioneered the agentic human in the
loop synchronous workflows back when we
released Windsurf in 2024. And today
timelines are 80 to 90% agent, 10 to 20%
human. But we're trying to build towards
a future that gets the 99% agent and 1%
human. We only want to ask the user for
final approval. And as more and more of
these timelines and workflows become AI
powered, it becomes possible to have
Windsorf working for you at all times.
Not only as you type and use
autocomplete and tab, but also in the
background, researching when you're
working fully in parallel, only asking
you to approve. And we want to build
this future where you can code anytime.
You can write software at any time. This
includes your bed. This includes the
toilet when you're on the bus.
voice activated
Alexa, all right, we'll throw GPT, we'll
throw Gemini at this timeline problem,
but then from there, where do we go? How
do we improve? And specifically, how is
WinSurf able to tackle this problem of
the
timeline? And if we take a step
back, this really doesn't look like
we're writing code anymore. This looks
significantly more complicated than your
average competitive programming
question. Windsurf wants to
revolutionize the way that software gets
built. It's not just how code gets
written. We are solving a broader set of
tasks than just code. And while the
industry focuses heavily on things like
SWEBench, we know that the future is not
going to be tokens in tokens out.
Software engineering workflows are going
to be much messier than this. It means
that you have to be able to pick up
tasks mid workflow. You have to be able
to deal with messy codebase states mid
commit and you will have to work with
tools that are outside of the
editor. And so we have to be able to
ingest and perform over this broad set
of actions on this timeline to keep our
users in the flow. We have to be able to
open up PRs. We have to know when to
access analytics. We need to know how to
debug your CI/CD all by itself. And this
problem starts to look really really
different from what people are evaling
on. And because we have our own
representation of this timeline, we
needed a different system to be able to
handle these types of actions than what
the off-the-shelf frontier models could
give
us. And so where are we going with
this? The realization of this is our
brand new software engineering model
called 1. We realized ourselves that we
could actually dream bigger and build
the best software engineering model that
we
could. Suite 1 is trained to handle
software engineering workflows, not just
purely code generation. And we use two
main offline eval benchmarks. The first
one end to end end to end task
benchmark. This is basically tackling
poll requests. This is saying given an
intent, given the starting point of a
codebase, how do we get to the end and
pass all the unit test? familiar
paired. The second one is where it gets
a little bit more interesting. This is
what we call a conversational suite task
benchmark. And this is how well the
model can assist when it's being dropped
into an existing user conversation or a
partially completed task. And so this
actually lends itself very nicely to the
windsurf paradigm, right? Because we're
not going cleanly from start to end.
We're assisting in helping you along the
way midtimeline. And so it results in
this blended score of helpfulness,
efficiency, and correctness and really
tests the model's ability to seamlessly
integrate into the windsurf style of
working. And this initial performance
really gives us a lot of confidence in
SU1's
architecture, specifically how we've
been able to train for software
engineering workflows. And we've been
able to achie achieve near frontier
model results at the fraction of the
cost and with a significantly smaller
team. And one of Winsurf's greatest
strengths of course is in the value of
community. Real software engineers doing
real work, giving real feedback. And
what we found is that SUI 1, it's in the
little drop down for the models. It's
right up there with the rest of the
frontier models. People are choosing SU1
because it recognizes how they do work.
not necessarily how to generate code and
it's contributing actually an even
higher frequency than models like 3.7
and
3.5. Windsurf builds at the frontier so
that our users can build more with the
best technology. We learn from our
failure modes so that we can iterate
from
there. And what does this start to look
like? Dare I say it, a data
flywheel. We ship the best product. Devs
and non-devs use that product to level
up as a skill multiplier or as a skill
enabler. Users then help us find the
frontier. They use things like thumbs
up, thumbs down, accept, reject,
constantly informing us not of what the
sweet bench frontier is, but what is the
software engineering frontier? What
tools are missing? Which workflows are
repeated? Where does the product fall
short? And we take those insights and we
build at this
frontier. We train a better model. We
build more tools. We improve our agentic
harness. We improve our memories, our
checkpointing with the goal of being
everywhere doing
anything. And we will repeat this cycle.
We will be shipping, finding the
frontier, building at the margin, and
repeating. And what gets me really
personally excited about this is Sweet
One is really an example of this in
action. We have a very small
team, significantly fewer resources than
the larger companies, and we were able
to achieve near frontier model quality
results with Sweet
One. And even more so, this is really a
demonstration of what it means to build
AI products in 2025.
It demands this harmony of model, data,
and application where the application is
actually mimicking the user behavior
that you want to replicate inside of
your
model. And this is how windsurf will be
everywhere doing everything all at
once. Thank you so much for listening.
[Applause]
And I won't give you any promises, but
someone made a profit.
Um, but in all seriousness, thank you so
much for listening. I want to make sure
that every engineer out there is using
the best possible tools. So, please give
Winer a try today. And we are also
hiring across a number of different
roles. We have a booth downstairs, so
please come join us. help make this
future a
reality. Thank
[Applause]
you. All right, everybody. Let's hear it
for all of our keynote speakers so
far. I learned a lot from our keynotes
today. I learned that Alexa can keep
track of my dog, which is amazing
because my dog is a runner. Uh, and that
you can plug an agent into Lambda, which
is genuinely very neat. I would
absolutely want to do that. Uh, we also
learned about Windsurf's tremendous
growth. Uh, I wasn't in the room when uh
he asked who's on Windsurf. Hands up if
you are using
Windsorf. What about people who are
using
Cursor? Whoa. Okay.
Uh, who's on VS Code of any
flavor? All right. Uh, what about
Zed? The few, the proud. Uh, any what
about something else? Who's on other
things? All right. Our next conversation
is with uh Greg Brockman, formerly CTO
of Stripe, co-founder of OpenAI, and
currently president of OpenAI. Uh fun
fact about Greg is that he's entirely
self-taught in AI. Uh he has no formal
background in it. Uh and with no formal
background, he taught himself from free
online resources and blogged about the
experience to encourage other people
which I think is genuinely inspiring.
Like that's how I taught myself web
development. It's it's uh a fun sort of
fundamental thing about the internet
that you can just teach yourself. So,
uh, without further stalling, while they
arrange chairs, uh, please welcome to
the stage for a fireside chat the one
and only Swixs, Aunt Greg Brockman.
[Music]
Well, hello. Hello. Is uh mic working
for you? Check, check, check. One, two,
three. All right. First hard technology
problem of the day down. Yeah. Yeah.
Well, the Wi-Fi is the other one. Um,
everyone here knows. Um, so, Greg,
welcome to AI Engineer. Thank you so
much for taking the time. Thank you for
having me. Um, we're going to go a
little bit chronologically and uh, a lot
of people send in questions and I've
sort of grouped them up for you. So,
we're just get right into it. U,, so,
you know, you you know, I did some deep
research on you. Uh, you started deep
research with with deep research. Um, I
called it Peep research because we're
researching a person. Uh, you actually
did theater growing up and chemistry and
math and you wrote a calendar scheduling
app and that's what got you into coding.
But like what really inspired your love
for coding? Like why why are you the
coding guy? Well, the funny thing is I
thought I was going to be a
mathematician when I grew up. Yeah. You
know, I'd read about people like Gowa
and Gaus, you know, we work working on
these like hundred 200 300 year time
horizons and I was like that's what I
want to do. if anything that I come up
with is ever used while I'm still alive,
it wasn't long-term enough. It wasn't
abstract enough. Um, and I was writing
this chemistry textbook after high
school, sent it to one of my friends
who' done something similar in math, and
he said, "No one is going to publish
this. You can either self-publish." I
was like, "Ah, sounds like a lot of
work, a lot of capital, or you could
make a website." And I was like, "Guess
I'm going to learn how to make a
website." And so, I literally went on W3
Schools and did their PHP tutorial. How
many people here remember W3 schools?
Yeah, a decent number of hands. Um, and
I remember the very first thing I built
was a table sorting widget, right? I had
this picture in my head of what it would
be. And I remember the moment that I
clicked the column and it sorted
according to that column, which was
exactly the thing that I wanted. And I
was like, that was magic, right? And I
was like, this is so cool. Because the
thing about math is that you think hard
about a problem, you understand it, you
write it down in an obscure way, you
call it proof. And then like three
people will ever care, right?
But in programming, you write it down in
an obscure way, we call a program. And
then maybe only three people ever read
that program and care about the code.
But everyone gets the benefit. No one
has to understand the details. That
thing that was in your head, it's real.
It's in the world. And I was like, that
that's the thing I want to do. Forget
about that hundred-year time horizon. I
just want to build. Uh, you do just want
to build. Uh, it's So, you were so good
at it that somehow somewhere you got
cold emailed by Stripe while you're
still in college. That's right. Uh,
what's the story? How, first of all, how
did they find you and what was it to
convince you to drop out to join them?
Well, so I had mutual friends with all
the people at at Stripe, the you know,
giant company of like three people at
the time. uh and uh uh they they had
asked you know the usual thing where
they'd asked someone at Harvard who the
you know people around campus to talk to
uh who they might recruit where my name
came up they asked the same for the
people at MIT because I actually had
dropped I' I'd been at Harvard and
actually dropped out to go to MIT so I I
had the advantage of uh I guess you know
uh get getting up votes on both sides.
Um, but I remember when I met the
Patrick and it was you I just flown in.
It was like late at night that you know
it was storming and uh I I showed up and
we just started talking about code,
right? And it was just like one of those
moments you're like this this is the
kind of person that I that I've wanted
to work with and been looking for. Uh
and so I ended up dropping out of MIT uh
and uh uh you know flew out and been out
here ever since. Yeah. Yeah. Uh we have
a spe we have some guest questions
sprinkled along the way as you know. Uh,
so question from someone named Matthew
Brockman. I've heard of him. CTO of
Julius AI. When do you think our parents
will give up on the dream of you
finishing your degree?
Maybe maybe Harvard or UND will take you
back. Yes. Uh, well ne never. Um, it was
definitely, you know, I think it was no
matter where you're going, if you tell
your parents you're leaving Harvard,
it's going to be hard. Um, you tell your
parents you're leaving school
altogether, it's going to be difficult.
Um and I think that you know it was
actually um to to their credit you know
I think even though it was difficult um
that they were like that you know we
trust you like you you must see
something and and understand something
from from where you sit that's hard for
us to see from from halfway across the
country. Um but yeah I think that that
as you know did Stripe and uh and had a
good time and and actually learned
things um and uh turned out as a real
company and not just uh uh you know just
dropping out doing nothing. I I think
that that they they really were were uh
you know have have warmed up to it and
so um
I think they're very proud of you. Yes,
absolutely. So you you were with Stripe
from 4 to 250 people as the first CTO
eventually. Um one thing I I found
recently that Hacker News maybe doesn't
know is apparently the call installation
only happened like a handful of times.
It wasn't like a thing at Stripe. Was
that that that's I think that's true. Um
yeah, it is it is the thing that that
you know it's like survived the uh the
It's an urban legend because it's like
so cool. It's like you so customer
obsessed. Anyway, so what else do people
get wrong about early Stripe? Like why
do we want to clear the air? Yeah. Well,
I think people don't understand how hard
it was, right? It was just like um like
I remember um you know, first of all,
the the kind of thing that we did a lot
of is that we added all of our customers
on G-Chat. And so it was very much the
case that we were in constant contact
with them. And so even if you're not
literally sitting over their their
shoulder, you're doing the next best
thing. Um, but I remember um like one I
you know one one one day we realized
that I you know the the the payment back
end that we were on it just wasn't going
to scale. Uh we absolutely needed to be
on Wells Fargo and we got sort of the
deal done but now we need to do a
technical integration. And they said
well this technical integration is going
to take like 9 months because that's how
long it takes. And we're like that's
crazy. Like you're a startup. Like we
can't sit around waiting 9 months to get
this thing done. Um and so actually in
24 hours uh we completed it uh by just
basically treating it like a college
problem set. Uh and it was you know I I
was implementing everything. John was
working from the top of this test script
and testing everything and being like
this is broken. Daryl was starting from
the bottom and working his way up. And
uh in the morning we got on with with
the uh certifying person and we sent
some some test messages and there was an
error and the person's like all right
I'll see you next week. Um because
that's how all their customers operate,
right? there's an error like you know
clear you need to send it to your dev
team and we were like no no no there
must just be like like some sort of
glitch in the system like and we just
Patrick was just like talking to keep
her on the line and frantically like I
was there editing the code and so we got
like five turns in uh and we actually
failed uh but fortunately she was nice
enough to reschedule two two hours later
uh and there then we passed and so you
realize that was like six weeks worth of
normal dev work that you got done in
that moment because you didn't just
accept the like arbitrary constraints of
how other organizations would work.
Yeah. Yeah. Do I think there's a do you
think there's a lot more opportunity
like that in most jobs? Like how do you
how do you advise other people to be
that I guess fast or like to cut that
many cycles? Yes. I mean I think that I
the way I think about it is that if you
think from first principles you can find
where things need to be slow or done the
way that they're normally done or
whatever those things are those exist
right the general principle of ah just
don't worry about the constraints and
just do the thing. Um, I think that that
that is not 100% true. I think it's
really about mapping to where is there
unnecessary overhead that's there for
constraints that are no longer
applicable that that don't apply uh to
your specific circumstance. And I think
this is especially true in this world
that we're in now with AI that's
accelerating productivity so much. Yeah.
Just fire off a codeex. Why not, right?
Um, one thing one thing one last thing
about your sort of pre-openi life was
independent study. I just I I found that
just it's a recurrent theme from high
school. You did rec center. I did. Um
and your sbatical as well. So you've
just done it repeatedly. What makes
independent study effective? Like I
think there's a lot of people who don't
do a good job of it and kind of waste a
year. What what what do you do that
makes it so effective? Well, I think it
was a key part of how I grew up. Um, you
know, in in uh in sixth grade, my dad
taught me algebra and in seventh grade
showed up at the high school as the
first time that you you track into
advanced math pre-alggebra and we went
to the teacher like can he skip uh this
and go directly to the the eighth year
the eighth grade course and the teacher
looked at my mom and me very
condescendingly and was like every
parent believes that their child is
special. Uh, and after like a month of
being in this teacher's class and, you
know, I was paying no attention and just
doing, you know, calculator games in in
the back and she'd try to trip me up
and, you know, call me to answer
questions from the whiteboard and I
would just get them all right. She was
like, "All right, like fair enough. Uh,
your your child should be uh in the next
year." Um, and but then when I was in
eighth grade, there was no more math
left in my middle school. I didn't have
a car, so I had to do online courses.
And in that one year, I ended up doing
three years worth of high school math.
And so I think that for me a lot of it
is about suddenly these if you're if
you're excited about something
independently it's something you want to
do that you can break the constraints
there as well. Uh you can do three years
of math in one year and then it
compounds because the next year I was at
my high school finished math there and
then all through 10th 11th 12th grade I
I had you know no more math so I did
have a car and I was able to go to
University of North Dakota take whatever
classes I wanted there. And so I think
that that that kind of compounded
compounded compounded to learning
programming. And then I think that that
the way I learned program is very much
self-study just building things and and
experiencing things out in the world.
And so I think that the thing I would
just advise is like if you have an
opportunity to explore and you have a
passion, you're actually enjoying it,
just go deep, right? And by the way,
it's not always fun, right? I think that
it is very easy to uh get kind of you
know sort of feel like uh I got kind of
bored but if you just push through those
hurdles then I think that the that the
reward is worth it. Yeah. You
self-studied machine learning too like
that was a whole period of your life. Um
any particular highlights from there? It
sounds like you talked to Jeff Hinton at
one time. I did talk to Jeff Hinton.
Yeah. Yes. And like was you know did
that help or what was the most helpful
thing like you became a machine learning
practitioner? Well, so so when I when I
started out, so you know, I'd been I'd
been at Stripe. I was reading hacker
news post about deep learning and yeah,
it was like, you know, there's a deep
learning for axe like every day it felt
like and this was, you know, 2013, 2014
and I was like, what is deep learning?
and I knew like one person in the field
and so I talked to them, they introduced
me to some more people and then they
introduced me to more people and the
thing that surprised me was I kept
getting introduced to a bunch of my
smartest friends from college and I was
like that's interesting. All of these
people ended up in this field like
what's going on and I started to realize
that that there was something real that
was building right that was being
developed that people were really making
these systems do material new things
that computers were not able to do
before. I was like that that is the
thing. Um and so after I left Stripe,
you know, I knew I wanted to do
something in AI. Um start an AI company,
but I didn't really know how to
contribute, what my skills would be
useful for. And uh so I was in New York
and I was like, you know what, I'll
build a GPU rig and see if I can do some
Kaggle competitions. And so I went on
Newegg and just like, you know, bought
some uh some Titan X cards. And uh it
was really cool, you know, physically
assembling this machine. And uh you can
find some some tweet from from 2015 when
I powered it on. You see all this like
green and all the fans going and I was
like this this is what computers are
meant to be.
Uh I think many folks in the audience
have that that experience as well. Um
awesome. Okay. So what convinced you
that AGI was possible? Like you you had
a point where you were sort of
disillusioned with it. You wrote you
tried to write a chatbot. You didn't it
didn't work. But what made you go all in
on it? Yeah. Well, so you know, part of
part of the journey for me was reading
Alan Touring's 1950 paper, Computing
Machinery, and Intelligence. This is the
Touring test paper. How many people have
read it? Fewer hands than than W3
schools. Uh, but equally as important,
uh, worth reading. Uh, the thing that is
so fascinating to me is he lays out in
the beginning, okay, Turing test, this
idea of just does a machine think? Is it
intelligent? And you can say it's
intelligent if you know a human can't
tell the difference between talking to
it and talking to a human. Fine. But the
thing that was that has not really
become as embedded in the pop culture,
but to me was so astounding was he said,
"Well, how are you going to program an
answer to this? You will never be able
to write down all the rules. But what if
you could build a child machine that
learns like a human child and then you
just apply rewards and punishments and
boom, it's going to uh it's going to to
be able to to pass the test." And I was
like, that that is the kind of
technology that we have to build because
as a programmer, you have to understand
everything. You have to understand the
rules of how to solve the problem. But
what if the machine can understand
things and solve problems that you
yourself cannot understand? Like that
feels fundamental, right? That feels
like how you actually solve problems
that are important to humanity. And I
this was, you know, 2008 or so that I
read this and I went to my professor and
uh who was an NLP professor and I asked
if I could do some research with him and
he said, "Yeah, here are some pars
trees." And I was like, okay, this is
not what Turing was talking about. Yeah.
Um, this is like word nets and the whole
thing. Exactly. So, it's like you, you
know, definitely a little bit of trough
of sorrow there. Um, but with deep
learning, the thing about deep learning
that's magic is that, you know, it
really started in to show show promising
results 2012 with with AlexNet, right?
And and that it just blew everyone out
of the water in the imageet competition.
And so suddenly you have this like
general learning machine. You know, it's
got a little bit of a prior in there of
of of convolutions, but it's better than
40 years worth of computer vision
research, right? People trying to write
down all the rules as well as possible.
And then people are like, well, okay, it
works in vision, but it's never going to
work in my field. It's never going to
work in machine translation, never going
to work in uh in, you know, in NLP,
never going to work in this or that. And
suddenly it starts being the best in all
of those areas. Suddenly the walls
between these departments are being torn
down and you're like that that is what
Turing was talking about. And so I think
for me just seeing the the type
signature of this technology and by the
way this technology is not new, right?
Neural nets were really like if you go
back and read the uh the Mcculla Pitts
uh neuron paper from like 1943 or so um
I told people I told him to give
homework to people. Okay. Yeah, there
you go. Yes. Classes assigned.
um the there the the images in there,
they look just like the kinds of images
that you see now of just like you know
layers of neurons and things like that.
And so you just realize there's
something deeply fundamental about what
we're doing. And uh you can find these
these uh you can find this paper um from
199 the 1990s talking about what caused
the deep learning winters and that it
was these neural net people. They have
no new ideas. They just want to build
bigger computers. And I'm like yes
that's what we need to do. Um and so I
think that all of this together just
feels like we are we are to some extent
continuing this wave this 70-year
history. Um and in many ways um you know
the whole computing industry has been
really trying to build up to the point
that you can have machines that are able
to perform the kinds of tasks that we're
just starting to scratch the surface to
solve new problems that humans cannot to
be be assistive to us in our daily lives
to not have to you know be typing with
our with our you know meat sticks but
instead to have something that you can
interact with just like a person where
the machine comes much closer to you
rather than you closer to it and having
to learn assembly language or you know
whatever it is. Um and so to me it felt
like all of the factors were lined up
and now we just need to build. Yeah. Um
I I like that consistent theme that you
keep coming back to. We just need to
build. Um so in 2022 you wrote that it's
time to be an ML engineer. Actually I
have a personal friend uh who read that
post and cold emailed you and joined
OpenAI and all that. Um you said that
great engineers are able to contribute
at the same level as great researchers
to future progress. Is that uh is that
still true today? You know, I think a
lot of engineers look at the researchers
who are making millions of dollars and
they're like, how do I contribute as
much? You know, I I think it's
absolutely if not even more true. Um I
think that like if you look at the
phases of deep learning research since
2012, I think at the beginning it really
was um and this is kind of what I
expected when we started OpenAI, you
know, just like research scientists who
had gotten a PhD who would go and kind
of come up with ideas and test them out.
And you know there's there's engineering
to be done. If you actually look at
Alexet itself, you know, it's
fundamentally the engineering of let's
get fast convolutional kernels on a GPU.
Um and and uh fun fun fact is people who
were in the lab with Alex Keski at the
time uh were actually felt very bad for
him because they were like he has some
fast com kernels for uh uh for you know
some some image data set that doesn't
really matter. But you know Ilia was
like well clearly we just need to apply
this to imageet. It's going to be great
right? So it's like the combination of
great engineering together with the idea
of what to do with it, right? That
that's what what makes the magic work.
Um and uh the thing that I think is
still true and even more true is okay,
so the engineering required, it's now
not just let's build some kernels, but
let's build a system. Let's actually
scale to 100,000 GPUs. Let's actually,
you know, sort of do this crazy RL
system that orchestrates things in all
sorts of ways. Um, so the idea, if you
don't have the idea, you're dead in the
water. There's nothing to do. But if you
don't have the engineering, that idea is
not going to it's not going to live and
see the light of day. And so you need to
have both of these coming together
harmoniously. Yeah. I think that Ilia
Alex relationship is really emblematic
of like the research engineering
partnership that now is the philosophy
at OpenAI. That's right. Yeah. Yeah. And
if you look at how open AI operates like
I think from the very beginning we had
this ethos of engineering and research
be valued um and and work together um as
partners and I think that that is
something that we you know it's like
something that we we really work at
every day. Yeah. Uh it's my explicit
goal to try to throw uh curveballs in
this in this stuff. So uh in terms of
the relationship between engineering and
research, what did OpenAI do wrong in
the early days that you do well now? Um
well I think that the relationship
between engineering and research the way
I think about it is you never fully
solve it right you just sort of solve
the current level of problem and then
you move on to the next level of
sophistication and I noticed that
actually the kinds of problems that we
ran into were basically the same
problems that had been run into at every
other lab and it was just like you know
either we would be further along or that
there' be a slightly different variant
of it and so I think there's something
deeply fundamental about this um so the
the ve at the very beginning I could
really see people who came from the
engineering world, people came from the
research world, just sort of thinking
about system constraints very
differently. And so as an engineer,
you're like, hey, if I've got an
interface, you should not care what's
behind that interface. We agreed on the
interface, I can implement it however I
want. Whereas if you're a researcher,
you're like, if there's a bug anywhere
in the system, all I'm going to get is
just slightly degraded performance. Not
going to get an exception, not going to
get indications of where. And so I am
responsible for understanding
everything. the interfaces they don't
matter unless they're like truly rock
solid and I can just like never think
about it which is a pretty high bar um
then I am actually responsible for for
this code and that causes friction right
because then how do you actually work
together and I saw a project very early
on where that you know the the people
from the engineering background would
write the code and then there'd be this
big debate over every single line and I
was just like this is never going to
move it's going to be so slow and
instead the way that we ended up
proceeding was um so I actually worked
in that directly and I'd come up with
like five ideas at a time. Someone from
the research side would say these four
are bad. I'd be like great, that's all I
wanted, right? And so the value that I
think we've really realized is critical
and that I tell people from from the
engineering world coming into OpenAI um
is technical humility, right? It's like
you're coming in because you have skills
that are important, but it's a totally
different environment from, you know,
something like a traditional web
startup. And figuring out when those
intuitions apply and figuring out like
when to leave them at the door is super
hard. And so the most important thing is
to like come in really really listen and
kind of assume that that that there's
something that you're missing until you
deeply understand the why. And then at
that point, great, make the change, like
change the the the architecture, change
the abstractions. Um but I think that
that kind of approach of just really
really read and listen and understand
with that humility um that that is I
think a really key determiner. Yeah.
Awesome. Um we're going to tell some
stories from recent launches of OpenAI
the greatest hits. Uh so one of the
things that is is kind of interesting is
just scaling in general. Everything
breaks at different orders of magnitude.
So in when chatbt launched you got a
million users in 5 days. This year when
40 image gen launched you got 100
million users in five days. How do those
two periods compare? Uh they echo very
similarly in a lot of ways. You know,
the thing about chatbt, uh, it was
supposed to be a low-key research
preview and we put it out very, you
know, sort of chilly and then suddenly
everything was down and we, you know, we
kind of anticipated that chatbt would be
a very popular thing, but we thought
that GPT4 would be necessary to get it.
Had it internally as well, so you just
weren't impressed by Exactly. Right.
It's like you, that's the other thing
about this field is you update so
quickly, right? It's like you see magic
and you're like, "This is the most
amazing thing I've ever seen." And then
you're like, "Well, why can't it like,
you know, why can't it like merge, you
know, 10 PRs for me?" Exactly. Um, and
the image gen moment was very similar in
terms of it was just so so loved and so
popular and it just went viral in in
ways that uh, you know, just like the
numbers were just off the charts. And so
internally we actually did something
that we really really try not to do um
which is we pulled a bunch of compute
from research for both of these launches
actually um because that's mortgaging
the future um to make make the system
work um but if you can actually deliver
and keep up with demand then of course
people get to experience the magic and I
think that um that that that's something
that is really worthwhile and it's
really important to sort of you know
maximize those moments. Um, so I think
that that that we really have that same
ethos of really serving the user, really
trying to push for the technology and
just do things that are materially new
that no one's ever seen before. Um, and
then whatever it takes to get those out
into the world and make those successful
that that's what we do. Amazing. Um,
well, I mean, incredible job. U GPT4
launch. So I am told that your wife drew
the joke website. That's true. Yeah. Fun
fun Easter egg. My handwriting was so
bad uh that even our AI couldn't tell
what to do with it. Um so like uh
apparently did you improvise some of
this? I I I heard I gravine. Yeah,
definitely. Definitely like you know
usually when I when I do these kinds of
demos like I've tested the general shape
of them ahead of time. Uh but I've
always had like it's very easy in this
field to have ones that are just like if
you slightly typo a character or
something then the demo will not work.
Um I don't like doing those. I like to
have some robustness to it. So there's
always variation in terms of of what
actually ends up get being shown. To me,
this was the first time I think the
world ever saw vibe coding. Um, it is
now a thing. What are your thoughts on
vibe coding? Uh, well, I think that vibe
coding is amazing as an empowerment
mechanism, right? I think it's sort of a
representation of what is to come. And I
think that the specifics of what vibe
coding is, I think that's going to
change over time, right? I think that
you look at even things like codeex like
to some extent I think our vision is
that as you start to have agents that
really work that you can have not just
one copy not just 10 copies but you can
have a hundred or thousand or 10,000 or
100 thousand of these things running
you're going to want to treat them much
more like a co-orker right that you're
going to want them off in the cloud
doing stuff being able to hook hook up
to all sorts of things you're asleep
your laptop's closed it should still be
working um I think that the the the you
know current conception of of vibe
coding in an interactive loop. Um, you
know, that that's something that I I
think is like, you know, it's it's I
Okay, so my my prediction of what will
happen is like I think there's going to
be more and more of that happening, but
I think that the agentic stuff is going
to also really intercept and overtake.
And I think that all of this is just
going to result in just way more systems
being built. Um, and the thing that that
I think is also very interesting is that
a lot of the vibe coding kind of demos
and and the cool the cool flashy stuff.
Um, for example, make making the joke
website, it's making an app from
scratch. But the thing that I think will
really be new and transformative and is
starting to really happen is being able
to transform existing applications to go
deeper. Um, and that be able to, you
know, like I think so many companies are
sitting on legacy code bases and doing
migrations and updating libraries and
changing your cobalt language to
something else is so hard and is
actually just not very fun for humans.
And uh, I think we're starting to get AI
that are able to really tackle those
problems. And so the thing that I love
about where vibe coding started has
really been like with the most like just
like make cool apps kind of thing. And
it's starting to become much more like
serious software engineering. And I
think that going even deeper to just
like making it possible to just move so
much faster as a company. Um that's I
think where where we're headed. Yep. Uh
speaking of codeex, I've heard that
you've just it's kind of your baby a
little bit. Um and you've started I
think on the live stream you were
talking a lot about just make things
modular and well doumented and all that
good stuff. Like how do you think codeex
changes the way that we code? Um well I
definitely think that that it's an
overstatement to say it's it's my baby.
like I think that there's um a really
incredible team um and and uh that you
know I've I've been trying to support
them and and and their vision and um but
I think that that the direction is
something that is like just so um so
compelling and incredible to me. Um the
way that that uh and sorry could you
repeat the the how how does codeex
change that we the way that we code? I
see. Yeah. The thing that has been most
interesting to see has been when you
realize that the way you structure your
codebase determines how much you can get
out of codecs, right? That the if you
match the strength of like basically all
of our existing code bases are kind of
matched to the strengths of humans. But
if you match instead to the strength of
models which are sort of very lopsided,
right? models are able to handle way
more like diversity of stuff but also
are not not able to like sort of
necessarily connect deep ideas as much
as humans are right now. And so what you
kind of want to do is make smaller
modules that are well tested that have
tests that can be run very quickly um
and then fill in the details. the model
will just do that right and it'll run
the test itself and the connections
between these different components kind
of the architecture diagram like that's
actually pretty easy to do and then it's
the like filling out all the details
that is often very difficult and if you
if you actually do that you know what I
described also sounds a lot like good
software engineering practice um but
it's just like sometimes because humans
are are capable of holding more of this
like conceptual abstraction in our head
we just don't do it right that like yeah
it's like you know it's a lot of work to
write these tests and to you know to
flesh them out and that you know the
model's going to run like these tests
like a hundred times or a thousand times
more than you will and so it's going to
care like way way more. So in some ways
that the direction we want to go is
build our code bases for more junior
developers um in order to actually get
the most out of these models. Um, now
it'll be very interesting to see as we
increase the model capability, does this
particular way of structuring code bases
remain constant? And I kind of think
that it's a pretty good idea because
again, it starts to match what you
should be doing for for maintainability
for humans. Um, but yeah, I think that
to me that the really sort of exciting
thing to think about for the future of
software engineering is what of our
practices that we kind of just cut
corners for do we actually really need
to bring back in order to get the most
out of our systems? Yeah. Um, can you
put numbers on like ballpark numbers on
the amount of productivity you guys are
seeing with codecs internally? Um, I
yeah I don't know what the latest
numbers are. I mean, there's definitely
double digit percent of our of our PRs
are written low low double digit um
written entirely by codecs. Um and
that's super cool to see. Um but it's
also like you know that it's not the
only system that we use internally and I
think that um to me it's it's still in
the very very early days. Um it's been
exciting to see some of the external
metrics. Um like I think we had 24,000
uh PRs that were merged in like the last
day uh in in public GitHub repositories.
And so it's just like yeah, this stuff
is all just getting started. Yeah, it's
doing a lot of work. Uh guest question
from Dylan Patel on scaling and uh
reliability. Um so as we're doing more
tasks that take longer and utilize more
GPUs, they're also just unreliable. They
fail a lot, right? And and this is just
well known. Um so this causes training
to fail as well. So like but like you
know you you've mentioned that you can
sort of just restart a run and that's
okay. like how do you deal with this
when you have to train long horizon
agents, right? Because you can't really
restart something that has a trajectory
that's kind of halfway that is maybe
nondeterministic. Yeah, I mean I think
that there's a bunch of problems that
you kind of solve and then you make the
models more capable and then you have to
resolve them. And so yeah, when the the
rollouts are short, you know, 30
seconds, you kind of don't care that
much about this problem. If they're
going to be days now, you really care
about this problem. Yep. And you have to
start thinking about how to snapshot
state and a bunch of things like that.
Um the short answer is that I think that
there's a this like ladder of complexity
that you keep climbing with these
training systems and it goes from you
know like a couple years ago all that we
cared about was just doing good
oldfashioned free training, right? And
that's like very checkpointable. Um and
even there it's not trivial, right? It's
like you know if you go from
checkpointing once in a while to like
you want to checkpoint every single step
now you need to think really hard about
about how you're going to avoid copies
and blocking and all these things um
then for something like these more
complicated RL systems there's still
checkpoint in terms of you know maybe
you care about uh you know checkpointing
your cache so you don't have to recmp
compute everything um and the nice thing
about our systems is that you know
language models are their state is very
explicit right and it's something that
actually can be stored um something you
actually can can handle. Whereas if you
have tools that you're hooked up to that
are themselves stateful, maybe those are
not something you can restart and
recover from. And so I think that that
if you consider the whole system end to
end, thinking about what checkpoint
ability looks like. And there's also a
question of maybe it just doesn't
matter, right? Maybe it's fine that you
restart the system and you get some
little wiggle in your graph, but these
models are smart. Yeah. Right. That they
can handle it. Um, one thing we're
looking at tomorrow that's launching is
maybe you can sort of take over the VM
and checkpoint the VM state and restart
it. Yep. Um, I think we have a dialin
call-in question from Paris. Um, if
someone can play the
video special guest,
oh, I wish I could be there to ask you
in person. One of the questions that I
have is in this new world, the work the
workloads in the data center and the in
the AI infrastructure is going to be
incredibly diverse. On the one hand,
agents that are doing T research and
They're thinking, they're reasoning,
they're planning, and they're working
with other agents, and they're, you
know, working on a lot of memory, they
have large context on some of that you
also want to think as fast as possible.
So, you know, how do you how do you
create an AI infrastructure that is
optimized for workloads that have to
have a lot of pre-fill, a lot of decode,
a lot of something in between on the one
hand, and on the other hand, the type of
workloads that I'm super excited about,
these multimodal vision and speech AIs
that are essentially your R2-D2, your
companion, it's on all the time. It's
instantly available to you. And so these
two workloads one of the one of them
super uh compute intensive and take
might take a long time and um uh you
know test time scaling and all that on
the other hand wants to be very low
latency. So what does what does a future
AI infrastructure look like that's
that's as flexible as possible as
performant as possible low latency high
throughput you know all of that is just
incredibly complex. So how how do you
think through that and what kind of AI
infrastructure would you would you think
that would be ideal going forward?
Well, with lot lots of GPUs, of course.
So, so if I were to summarize, uh,
Jensen wants you to tell him what to
build. What would be your dream? Uh, but
also like there's just two needs.
There's two kinds of infra. There's
there's long compute and there's real
time. Now, now, now. Yes. Yes. I mean,
it's it it is hard, right? Because I
mean, this codees problem, it is a
mind-boggling one. And so, you know, I'm
a software person by by background and
that, you know, we think we're we're off
here just like writing the software for
AGI and then you realize you have to do
like these massive infrastructure
projects, right? Like that's not how we
set out, but it actually kind of makes
sense in the end, right? If we're going
to build something that's going to be
transformative to the world, like yeah,
probably it's going to require some
some, you know, maybe the biggest
physical machines that humanity has ever
created, like kind of type checks. Um,
so I think that the that there's two
answers. Like the naive answer is okay.
Yeah, you want two kinds of
accelerators. You want one that's really
compute optimized, one that's very
latency optimized. Um, throw like tons
of of HBM on one of those and, you know,
ton tons of tons of comput on the other,
you're all good. Um, now, one thing
that's really difficult is predicting
the ratios, right? Now, you have a new
problem you have to think about. And if
you get the balance wrong, suddenly
you're going to have a whole part of
your fleet that's just useless. Yep. And
that sounds really scary. Um, now the
thing is because the way that these
things work is there's no requirements
in this field. There's no constraints in
this field. there's just sort of this
linear program that people are
optimizing and so yeah if you give our
engineers some sort of misbalance of
resources like we will find ways to
utilize it maybe at great pain right but
an example of this is you know you've
seen the whole field move towards
mixture of experts and to some extent
what mixture of experts is is saying
well we have all this DRAM sitting
around that isn't being used for
anything because the balance is wrong
fine we'll just fill it up with
parameters and we'll actually not cost
any compute and we'll just get extra ML
comput efficiency out of it like boom
there you go and so I think that there
is some of that where if you get the
balance wrong it's actually not the end
of the world um homogeneity of
accelerators is like a very nice default
to start um but I think that that that
ending up with purpose-built
accelerators is also not super crazy and
the more that we move to these world
these worlds where it's the just dollars
of capex for this infrastructure starts
to become so eye watering then starting
to hyper optimize for some of these
workloads is pretty reasonable um but I
think the jury a little bit out because
if you think about it that the research
is just moving so fast and to some
extent that dominates everything else.
Um okay I wasn't planning to ask this
but you just brought up the research
stuff. Can you rank current scaling
bottlenecks for GBT6? Ah compute data
algorithms power money.
Yes. Which one's which one's like the
you know number one and two? Which one
are you are you like most very limited
on? I mean look I think we are in a
world where basic research is back. I
think that is really amazing, right?
There was this period. Yeah, basic
research. Um there was a period where it
felt like all right, we got a
transformer, let's just scale it, you
know, and um I find those problems very
exciting. I have a lot of fun just like
you got a very well- definfined hard
problem. You want to just move the
number up and to the right. Um but it
also is a little intellectually
dissatisfying in some ways. It's like
that it feels like there's more to life
than just you know attention is all you
need paper uh you know in in in vanilla
form. Um and so I think that what we've
started to see is that we're operating
at a scale now um where we've pushed the
compute, we've pushed the data so far
that you can start to get you start to
have algorithms is like again just back
as as a important and really almost a
long pole um in in terms of future
progress. And so um all of these things
they're all they're all important poles
of the tent. And you know on any one day
uh it might look a little lopsided one
way or another. Um but yeah,
fundamentally I think it's like you want
to keep these all in balance. Um and
it's really exciting to see things like
like the RL paradigm. That's something
that we invested in very deliberately uh
for for for multiple years. It was like
when we trained GPD4 um the very first
thing like I think it was really
interesting was when you we talked to
GPD4 for the first time we were like is
this an AGI? Like it's clearly not an
AGI but it's really hard to say why
right is like there's something about
it. It's so fluid and smooth, but but
somehow it falls off the rails. It's
like, well, we got to solve that
reliability problem. And you're like,
well, it has never actually experienced
the world, right? It's like someone
who's just read all the books or, you
know, sort of read, you know, sort of
observed the world, has observed the
world um and uh never experienced it
itself, right? It's like, you know, sort
of just, you know, watching it through
through a pane of glass or something.
And uh and and that to me is I, you
know, was something we were just like,
okay, clearly we need a different
paradigm. and we just pushed on it until
we made it really work. And I think that
that remains true today that there's
other very clear missing capabilities um
that we just need to keep pushing and we
will we will get there. Awesome. Um
broadening out just from from just broad
opening eye things. Um well honestly I'm
just going to let So we asked Jensen for
one question. He's an overachiever so he
sent in two. So let's play a second
video.
AI native engineers in the audience they
are probably thinking um in the coming
years your openi will have AGIS and they
will be building domain specific agents
on top of the AGIS from OpenAI and so
some of the some of the questions that I
would have on my mind would be uh how do
you think their development workflow
would change as uh open AAI's AGIS
become much more capable and yet they
would still have plumbing workflows uh
pipelines that they would create
flywheels that they would create for
their domain specific uh agents
These agents would of course be able to
reason, plan, use tools, have memory,
short-term, long-term memory and um and
they'll be amazing amazing agents, but
how does it change uh the development
process in the coming years?
Yeah, I think that this is a really
fascinating question, right? I think you
can find a wide spectrum of very
strongly held opinion that is all
mutually contradictory. Um I think my
perspective is that first of all, it's
all on the table, right? Maybe we reach
a world where it's just like the AIs are
so capable um that you know we all you
know just let let them write all the
code. Maybe there's a world where that
you have like one AI in the sky. Maybe
it's that you actually have a bunch of
domain specific agents that require a
bunch of of specific work in order to
make that make it happen. I think the
evidence has really been shifting
towards this like menagery of different
models. Um and I think that's that's
actually really exciting right that
there's different inference costs just
even from a systems perspective. um that
there's different trade-offs like just
distillation works so well. Um so
there's actually a lot of power to be
had by models that are actually able to
use other models. And so I think that
that that is going to open up just a ton
of opportunity because you know we're
heading to a world where the economy is
fundamentally powered by AI. We're not
there yet but you can see it right on
the horizon. They're working on it all.
Exactly. I mean that's what people in
this room are building that that is what
you are doing. And the the economy is a
very big thing. there's a lot of
diversity in it and it's also not static
right that I think when people think
about what AI can do for us um it's very
easy to only look at well what are we
doing now and how does AI slot in and
you know the percentage of human versus
AI but that's not the point right the
point is how do we get 10x more activity
10x more economic output 10x more
benefit to everyone um and I think that
the direction we're heading is one where
the models will get much more capable
there'll be much better fundamental
technology and there's just going to be
like way more things we want to do with
it and the barrier to entry will be
lower than ever. And so things like
healthcare um that you can't just you
know the the it requires responsibility
to go in and think about how to do it
right. Things like education where
there's multiple stakeholders you know
the parent the teacher the student um
each of these requires domain expertise
requires careful thought requires a lot
of work. Um and so I think that there is
going to be just like so much
opportunity for people to build. Um, and
so I'm just so excited to see everyone
in this room because that's the right
kind of energy. Thank you for
encouraging us and being an inspiration.
Thank you so much. Great welcome to
everybody. Thank
you. All right, there's just one more
thing before you leave the room.
Uh, let's hear it one more time for Greg
Brockman. So, the talks are done, but
the fun continues. Uh, we'd love to
invite you to the afterparty. Here to
give you details of the afterparty uh is
Toet Panigrai of Tolbit.
[Music]
Hey everyone, how's it going? I'm Toast
Pony, uh, one of the co-founders of
Tolbit. For the past 2 years, we have
connected the world's biggest publishers
with the world's biggest AI companies.
Now we're taking that same technology
and allowing agents to access sanctioned
firstparty data sources with seamless
athin payments whether it's for MCP
whether it's ATA whether it's even for
browser automation. So if you care about
the agent economy agent off payments
check us out at toolbit.dev or come talk
to us at the toolbit afterparty. Thanks.
[Applause]
[Music]
[Music]
[Music]
Heat. Hey. Hey. Hey.
[Music]
[Music]
I
love me.
The
[Music]
[Music]
other honey.
[Music]