AI Engineer World's Fair 2025 - Day 1 Keynotes & MCP track ft. Anthropic MCP team
Channel: aiDotEngineer
Published at: 2025-06-05
YouTube video id: z4zXicOAF28
Source: https://www.youtube.com/watch?v=z4zXicOAF28
What is every choice we [Music] Reality so much. [Music] [Music] [Music] Heat. Heat. Heat. [Music] Heat. Heat. Heat. [Applause] [Music] [Music] [Music] [Music] Heat. Heat. [Music] [Music] Heat. [Music] Heat. Heat. Heat. [Music] [Music] Heat. Heat. [Music] [Music] [Applause] Heat. Launch control. We have a go. Roger. [Music] Ladies and gentlemen, please join me in welcoming to the stage the VP of developer relations at Llama Index, Lorie Boss. [Applause] Hello everybody. There are so many of you. It is great to see you here today. Welcome to the 2025 AI Engineer World's Fair. Let's hear it from you. Welcome to the Yerba Buena Ballroom. For those of you who managed to fit into the room, uh I'm told this is the largest pillarless ballroom west of Las Vegas, which is the perfect metaphor for AI startups because the scale is impressive, but it has no visible means of support. My name is Lori Voss. I am VP of developer relations at Llama Index, the best framework for building aic AI applications according to me. I'm going to be your MC today and tomorrow. And my first order of business is I have to take a selfie because if I didn't if I can't post it to social media, did it even really happen? Pretend you've just heard a really funny intro and you're laughing. Excellent. All right. Uh it's traditional as an MC to warm up the audience with a couple of jokes. Uh and as an AI practitioner, it's traditional to not want to do any work and to get the AI to do it for you. Uh, so I tried that and it was a great learning experience. The primary thing that I learned is that LLMs are terrible at writing jokes. I mean atrocious. I tried chat GPT. I tried Claude. I told them to think deeply. I asked them to search the web. None of it worked. AGI will be achieved when models can say something actually funny. Until then, these dad jokes are handcrafted by a human. But on to business. Once upon a time, I was co-founder of npm Inc. Uh, so I used to talk about JavaScript a lot. Uh, and I've been a web developer for 27 years. These days, I'm talking about AI. And I'm very excited about AI because when you've been in tech as long as I have, you've only you only see revolutions as big as the one that is being powered by AI a couple of times. But tech is full of things that say they're going to be the next big revolution and then turn out to be just hype. blockchain, NFTTS, the metaverse, the Segue. But make no mistake, this one is real. There's a ton of hype, too. Of course, not all of these AI startups are going to turn out to be real things, but there is a core of real revolution happening. And how can you tell? Because people are building things that people are actually using. Chad GPT hit 100 million users faster than any consumer product in the history of tech. Millions of people are using it daily to get actual stuff done. Sometimes they're using it to get stuff done that they shouldn't be using it to do, like uh cheating on their essays or writing fake law citations into real court cases, but also real things hundreds of millions of times a day. You'll be hearing from multiple speakers from OpenAI over the course of this conference, including Greg Brockman, who'll be closing out today. Another real adoption story is Copilot. GitHub Copilot has millions of subscribers and Copilot is now part of Microsoft 365 which is available to 84 million everyday consumers. Azure AI is being adopted by enterprises to the tune of 13 billion in revenue annually. That is real adoption. And speaking of Microsoft, they're here. In fact, they're our presenting sponsors. So, let's hear it for them. Another company doing a ton of big real things with AI is AWS. So much so that they're going to spend 87 billion on AD AI infrastructure this year. AWS are also here. They're our innovation partner for this conference contributing tons of sessions and workshops. So let's give them a big hand. Two more companies doing big real things in AI are Neo Forj and Brain Trust. They are our track sponsors. Brain trust is the end-to-end evaluation platform for building worldclass AI apps and Neo forj are the world's most loved graph database. There's a whole track about graph rag this year. So there will be no shortage of graph rag content. Uh last year Neo Forj CEO Emil Efraim had the second most popular talk at the whole conference. Uh and this year multiple Neo Forj people are speaking so you won't want to miss them. Who had the most popular conference talk last year? Well, I don't want to brag but it was Llama Index's CEO Jerry Lou. He is going to be giving a talk tomorrow that you won't want to miss. And also little old me is giving a talk at 1 PM today. We've got a bunch of other great sponsors. These are our platinum sponsors. Uh, Graphite is the AI powered developer productivity platform that helps teams on GitHub ship higher quality software faster. Uh, Windinsurf is the first agentic IDE that 10xes engineers so that you can dream bigger. MongoDB's Atlas database makes storing all of your data, including your vector embeddings a snap. Daily is the team behind Pipecat, the most widely used framework for voice agents and multimodal AI. And augment code is the a AI agent that knows you and your codebase best. And work OS helps you ship your software to enterprise customers with features like single sign on in minutes. Let's give all of our platinum sponsors a big hand. But one of the biggest signs that AI is powering a real revolution is all of you here today. This is the third year of this conference. This is the biggest one yet. There are over 3,000 of you here today. That's nearly twice as many as last year. You folks are building real things every day and real people are using them and that's incredibly inspiring. So, you should feel good about that and give yourselves a hand. And this is going to be a heck of a conference. We have over 250 speakers here from around the world talking about every aspect of AI, architecture, infrastructure, AI in the Fortune 500, robotics, design, MCP. Who is here is excited to hear about MCP? That's good. It's going to be in this room. Uh, security, tiny teams, vibe coding. There is so much stuff to learn here today, y'all, that you are going to have a great time. And that's why we're all here. We're not just talking about these technologies. We're not just excited about these technologies. We're building with these technologies. And we can't wait to see what you've been building. So without further ado, please welcome to the stage a man who needs no introduction, editor of the latent space podcast, CEO of Small AI, and co-founder of this very conference, the one and only Swix. [Applause] [Music] [Music] Okay. Hi everyone. Welcome to the conference. How you doing? Excellent. Um I have uh I've been I've been so excited to play with this. Oh. Uh can we set can we step back to the the main the main slide? Okay, good. Um there's a front clicker but no no back clicker. Um so uh usually I open these conferences with a small little talk to introduce uh you know what's going on and then give you a little update on where the state of AI engineering is and how we put together the conference for you. Uh this is a this is one of those combined talks. I'm trying to answer every single question you have about the conference about AI news about where this is all going and we'll just dive right in. Okay. So um 3,000 of you all of you registered last minute. Uh thank you for that stress. Um I actually can quantify this. I call this the genie coefficient for uh the AI AIE organizer stress. U this is compared to last year. Uh it is uh please just buy tickets earlier like I mean you know you're going to come just just do it. Okay. Um we also uh like to use this conference as a way to track the evolution of AI engineering. Uh that's those are the tracks for last year. we've just doubled every single track for you. Um, so basically it's basically you know like double the value for whatever you uh get here and I think like uh I think this is as much concurrency as we want to do like I know I I hear that people have decision fatigue and all that uh totally but also we try to cover all of AI so deal with it. Um we also pride ourselves in doing well by being more responsive than other conferences like Nurips and being more technical than other conferences uh like TED or whatever what have you. So we asked you what you wanted to hear about. These are the surveys. Uh we tried all sorts of things. We tried computer using agents. We tried AI and crypto. It's always a fun one. And uh but you guys told told us what you wanted and we put it in there. Um for all for more data um we would actually like you to to finish out our survey where survey is not done. So if you want to hit to that URL um we will present the results in full tomorrow. We would love all of you to to fill it out so we can get a representative sample of what you want and uh they'll inform us next year. Okay. Um you know I think the other thing about AI engineering is that we also have been innovating as engineers right we we're the first conference to have an MCP. our first conference to have an MCP talk accepted by MCP where shout out to Sam Julian from Writer for working with us on the official chatbot and Quinn and John from Daily for working with us on the official voice bot as well as Elizabeth Triken from uh Vappy. I need to give her a shout out because she originally uh helped us uh prototype uh the the voice bot as well. So we're trying to constantly improve the experience. Uh the other thing I think I want to emphasize as well is like these are the talks that I give like in 2023 uh the very first AIE I talked about the uh the three types of AI engineer in 2024 I talked about um how AI engineering was becoming more multi disciplinary and that's why we started the world's fair with with multiple tracks in 2025 in in New York we talked about the evolution and the focus on agent engineering so where where are we now in sort of June of 2025 Um, that's where we're going to focus on. I think we we've come a long way regardless like we you know we people used to make fun of AI engineering and and I anticipated this. We used to be low status people just deride GPT rappers and look at all the GPT rappers. Now all of you are rich. Um, so we're going to hear from some of these folks uh in the room. Um, and uh thank you for sponsoring as well. Um but uh you know I think the other thing that's also super interesting is that like you should we the consistent lesson that we hear is to not over complicate things from enthropic on the lat space podcast uh we hear we hear we hear from uh Eric Schz about how they beat Sweetbench with just a very simple scaffold. Uh same about deep research from Greg Brockman who you're going to hear later on um in the uh sort of closing keynotes as well as AMP code. Where's the AMP folks here? AMP. AMP I think they're probably back in the other room but um also you know there's there's a sort of emperor has no clothes like there's it's still very early field and I think the um AI engineers in the room like should be very encouraged by that like there's there's still a lot of alpha to mind um if you watch back all the way to the start of this conference we actually compare this moment a lot to uh the time when sort of physics was in was in full bloom right this is the sove conference in 1927 when Einstein Mary Cury and all the other household names in physics all gathered together and what we're trying to do for this conference. We've gathered the entire the best um sort of AI engineers in in the world um and and researchers and and and all that u to to build and push the frontier forward. Um the thesis is that there's this is the time this is the right time to do it. I said that two and a half years ago still true still true today. But I think like there's a very specific time when like b basically what people did in in that time of the formation of an industry is that they set out all the basic ideas that then lasted for the rest of that industry. So this is the standard model in physics and there was a very specific period in time from like the 40s to the 70s where they figured it all out and the the next 50 years we haven't really changed the standard model. So the question that I want to phrase here is what is the standard model in AI engineering right? We have standard models in the rest of engineering, right? Everyone knows ETL, everyone knows MVC, everyone knows CRUD, everyone knows map reduce. And I've used those things in like building AI applications. And like it's pretty much like yes, rag is there, but I heard rag is dead. I I don't know. You guys can tell me. Um, this day is like long long context killed rag, the other day fine-tuning kills rag. I don't know. But I I don't think I definitely don't think it's the full answer. So what other standard models might emerge to help us guide our thinking and that's really what I want to push you guys to. So uh there are a few candidates standard models and AI engineering. I'll pick out a few of these. I I don't have time to talk about all of them but definitely listen to the DSP talk from Omar later uh tomorrow. Um so we're going to cover uh a few of these. So first is the MOS. Uh this is one of the earliest standard standard models um basically uh from Karpavi in 2023. Um I have updated it for 2025 um for multimodality for the standard set of tools that have come out um as well as um MCP which uh is is has become the default protocol for connecting with the outside world. Um second one would be the LM SDLC software development life cycle. Um I have two versions of this one with the intersecting concerns of all the tooling that you buy. Uh by the way this is all on the lat space blog if you want and I'll tweet out the slides. So uh you and it's live stream so whatever. Um but I think uh for me the most interesting insight and the aha moment when I was talking to Anker of Brain Trust who's going to be keynoting tomorrow um is that you know the early parts of the SDLC is are increasingly commodity right LLM's kind of free you know um monitoring kind of free and rag kind of free obviously there it's just free tier for all of them and you you only get start paying but like when you start to make real money from your customers is when you start to do evals and you start to add in security orchestration and do real work. Uh that is real hard engineering work. Um and I think that's those are the tracks that we've added this year. Um and I'm very proud to you know I guess push AI engineering along from demos into production which is what everyone always wants. Another form of standard model is building effective agents. Uh our last conference we had uh Barry, one of the co-authors of building effective agents from Enthropic give an extremely popular talk about this. Um I think that this is now at least the the received wisdom for how to build an agent. And I think like that's like that is one definition. OpenAI has a different definition and I think we're we're continually iterate. I think Dominic yesterday uh released another improvement on the agents SDK which builds upon the swarm concept that Open AAI is pushing. Um um the way that I approach sort of the agent standard model has been very different. So you can refer to my talk from the previous conference on that. um basically trying to do a descriptive u top down uh model of what people use the words people use to describe agents like intent um you know control flow um memory planning and tool use. So there's all these there's all these like really really interesting things but I think that the thing that really got me um is like I don't actually use all of that to build a news. Um, by the way, who here reads A News? I don't know if there's like a Yeah. Oh my god, like there's half of you. Thanks. Uh, uh, it's it's a really good tool I built for myself and you know, hopefully, uh, now over 70,000 people are reading along as well. Um, and the thing that really got me was Sum at the last conference. Uh, you know, he's the lead of PyTorch and he says he reads AI news, he loves it, but it is not an agent. And I was like, what do you mean it's not an agent? I call it an agent. You should call it an agent. Um, but he's right. Um, it's actually uh it's actually I'm going to talk a little bit about that, but like like why does it still deliver value even though it's like a workflow and like you know is that still interesting to people, right? Like why do we not brand every single track here, voice agents, uh you know like uh like workflow agents, computer use agents like why is every single track in this conference not an agent? Well, I think basically we want to deliver value instead of arguable terminology. So the assertion that I have is that it's really about human input versus valuable um AI output and you can sort of make a mental model of this and track the ratio of this and that's more interesting than arguing about definitions of workflow versus agents. So for example in the co-pilot era you had sort of like a debounce input of like every few characters that you type then maybe you'll do an autocomplete. U in chatbt every few queries that you type it would maybe output a responding query. Um it starts to get more interesting with the reasoning models with like a one to 10 ratio and then obviously with like the new agents now it's like more sort of deep research notebook LM. Uh by the way Risa Martin also speaking on the product uh product management track um she's she's incredible on uh talking about the story of notebook LM. Um the other really interesting angle if you want to take this mental model to the stretch to stretch it is the zero to one the ambient agents with no human input. What kind of interesting uh AI output can you get? So to me that's that's more a useful discussion about input versus output than what is a workflow, what is an agent, how agentic is your thing versus versus not. Um talking about AI news. Uh so you know it it is it is like a bunch of scripts in a in a in a trench code. Um and I realized I've written it three times. I've written it for the Discord scrape. I've written it for the Reddit scrape. I've written it for the Twitter scrape. And basically it's just it's always the same process. You scrape it, you plan, you recursively summarize, you format, and you evaluate. Um and and yeah, that's the three kids in the trench coat. Um and that's really how what it is. I run it every day and like we improve it a little bit, but then I'm also running this conference. Um so if you generalize it, that actually starts to become an interesting model for building AI intensive applications where you start to make thousands of AI calls to serve serve a particular purpose. Um so you sync you plan and and you sort of parallel process you analyze and sort of reduce that down to uh from from many to one and then you uh deliver uh deliver the contents um to the to the user and then you evaluate and to me like that conveniently forms an acronym sp a um which is which is really nice. There's also sort of interesting AI engineering elements that are that have fit in there. So you can process all these into a knowledge graph. you can um turn these into like structured outputs and you can generate code as well. So for example um you know chatbt with canvas or cloud with um artifacts is a way of just delivering the output as a code artifact instead of just uh text output and I think it's like a really interesting way to think about this. So this is my mental model so far. Um I I wish I had the space to go into it but ask me later. This is what I'm developing right now. I think what I what I would really emphasize is, you know, I think like there's all sorts of interesting ways to think about what the standard model is and whether it's useful for you in in taking your application to the next step of like how do I add more intelligence to this in in a way that's useful and not annoying. Uh, and for me, this is it. Okay. So, I've I've thrown the bunch of standard models in here, but that's just my current hypothesis. I want you at this conference when in all your conversations with each other and with the speakers to think about what the new standard model for AI engineering is. What can everyone use to improve their applications and I guess ultimately build products that people want to use which is what Lori uh mentioned at the start. So um I'm really excited about this conference. It's so it's been such an honor and a joy to get it together for you guys and I hope you enjoy the rest of the conference. Thank you so [Applause] much. [Applause] Our next presenter is the head of product for Microsoft's AI platform. Presenting about the open agentic web. Here to show us what happens when natural language creation meets an industrial-grade backbone is Asha Sharma. [Music] [Applause] Hello. I still remember the first line of code that I wrote. It was on this computer, a Compact 95. And I also remember the feeling that I had when I wrote my first program. It was one of magic because for the first time, I created something more beautiful and far more interesting than I can do by myself. And I've thought about that moment ever since. I've spent the last 15 years building some of the most important products in machine learning. And now I have the opportunity to lead our core AI platform at Microsoft. And our goal is to empower every single person in this room to use AI to shape the world. Now, I'm excited to be here to talk to all of you, but I'm most excited because I love the World Fair. The World Fair is where imagination and impact start to collapse. And that's happened throughout the last century. In 1939, it was the first time that Hollywood came into our living rooms. In 1964, faces came over the copper wires. And today, it's all about agents. Agents that can learn, that can adapt, that can extend the way that we live. And most importantly, it's fundamentally changing how we actually make product. So, how did we get here? Well, over the last few years, we've seen a change in the model landscape. There used to just be a handful of models from one provider and now thanks to many people in this room, there's been an explosion of reasoning models and that explosion is giving way to new capability and new efficiency. We see that a lot of models can now generate hypotheses, understand unstructured data, even act at a PhD level in certain domains. We also know that these models are becoming more efficient. They don't just live in data centers anymore. They live on our laptops. We have full control and there's no latency. And so this is starting to birth what we call the agentic web. A world in which agents are going to interact with tools and models and probably other agents. And they're going to do so no matter what cloud they're on, no matter what company built them, no matter what device that you choose to use. And underlying this, it's creating a few different forces in the world of AI engineering. The first is that we're going from pair programming to peer programming. Copilot used to be a sidekick and now it's an actual teammate. The second is that we're going from a software factory to the agent factory. It's not just about binaries anymore. It's about behaviors. And the third thing that we're seeing is that models don't just live in the cloud anymore. They live on your device and they can follow you wherever you are. To sustain this, I don't think there's any one tool that we need. Instead, I think it's a platform of AI powered tools that are going to sit on top of an agent factory that every company's going to have, that has trust and security baked in, and that goes from cloud to edge seamlessly. So, let's dive in. Now, as far as I can remember, programming has always been a partnership between the person and the machine. We would write lines of code and the machine would execute it. But now, we're starting to see a world where the machine can write the code. It can fix itself. It can help you em imagine new features. And so with that, our entire day is changing. The workflow is changing. The old world used to be us shephering syntax. And in this new world, GitHub Copilot can now live in your codebase. It operates in your branch. You can assign a task to it and it can run tests until it's complete. That means we're going to spend more time on architectural decisions, more time orchestrating teams of agents. Maintenance is changing, too. In the old world, maintenance would compete with features. I hated that world. And in this new world, what we're seeing is the opportunity to invent agents that can continuously improve your codebase. And that's why we invented something called FSY. You can think about it as graph rag for your codebase. It can reason over it, can explain it, and it can continuously improve it and even fix some areas. Another area that I'm really excited about is GitHub as a peer. Now in the past it would have generic AI was just operating on a generic codebase but now because we have open sourced the extension for GitHub copilot Git Copilot now understands your patterns it understands your domains it understands your teams and so it can effectively speak your own language and so instead of designing backwards you can build forward. Now, I was told that you all like live demos and so instead of talking about this, I thought I would invite my friend Seth Wararez up to show you some of our AI power tools. So, Seth, hey, how's it going, my friends? They uh closed my laptop, so I'm going to open it here. Um, I'm on demo one. And one of the things about uh development is there's a couple of tasks that are kind of tedious, right? So, I'm going to show you how AI can make three of these kinds of things a lot easier to do. Number one, when you get started on a project, you always have a thousand questions and no one to ask. I'll show you how GitHub spaces makes this a little bit easier. Number two, crushing your first task. How do you get started on a task on a brand new project? I'll show you how. And then number three, diving deep. All right, so let's start with understanding. There is a new feature called co-pilot spaces. And the cool thing about co copilot spaces is is that you can actually create a co-pilot space, give it some a prompt, some instructions, and then you can also ground it in a bunch of different files. You're going to see a project later on uh uh in live that's going to show you this. Uh but one of the things I want to ask it is, for example, is this really an agentic project? You're going to see a live multi- aent voice thing in a second. And this is the project. So I'm going to go ahead and hit enter. And the thing about GitHub Copilot now grounded in spaces is that it can answer any question that's grounded in the actual facts of the project. So you can create as many spaces as you want. They never get tired of answering questions. They're not only do they give you really good answers like what does a Gentic really mean? Great answer. It also gives you the code of where it actually does the thing. So you're able to right away get started in understanding your project. That's number one. Number two, crushing your first task. This is a project we released uh a couple of weeks ago and what somebody literally was like, "Hey, because I you know how you make a demo and then you put it out and there someone always is like, "Hey, um can you write a read me?" I said no. And I signed it to Copilot and it did it. Let me show you what that looks like. So, I'm going to go ahead and create a new issue here. And this issue is going to be like an architecture uh diary. That's why I need more better setup instructions, right? And I'm going to give it some descriptions. Let me show you how easy it is to crush your first task. I'm going to assign it to co-pilot. And that's it. Watch the eyeballs right here. In a second, you're going to see little eyeballs just And now it's working. And by the way, this is actually happening live. This isn't like me faking it or anything. And let me go back to this actual project. It did the work for me. It takes a couple of minutes, but I don't have that kind of time here. And let me show you the file that it made for me in GitHub. Uh the the actual uh um GitHub co-pilot coding assistant coding agent. It made this whole thing for me. You can oo a little bit. Yeah. Oo. Yeah. This is delightful. You can clap to this is great. I didn't have to write this. Number two, crushing your first task. And then number three, diving deep. It turns out that if you've seen GitHub uh copilot inside of Visual Studio Code, you can actually extend Copilot to talk to other agents and we did just that with Omaly. There's a there's another task that I have here that was assigned to me. Let me go over here to the issues uh that says I need a new we need a new agent that reasons about housing. That sounds like a deep task. So, we're going to use GitHub Copilot to help us here. I'm going to say can you help me predict the housing prices? And as this goes, what's actually happening is GitHub Copilot and Visual Studio Code is talking to another agent called Amaly MLE machine learning engineer agent that has two agents that can reason about what you're asking and can also write code. So I'm going to go ahead and get this one, this file here, this file here. Let me close it. I'm going to move this over here and say yes, these are the files. And what I'm going to do is I'm going to say yeah, use these files. I don't want I don't want to use this file. So I'm going to say use use these files. So number, can you use these files? And what it's going to do is it's going to look at this file as if it was a machine learning engineer. It's going to reason about the actual contents of the file, and I can ask it any question about anything that I want. And I'm I'm kind of out of time, so I want to show you the output of this thing. It literally builds an entire machine learning model for me. And you're like, "Oh, it made a mistake. Did it?" Uh you can see here uh mistake. What was the mistake? Oh, somebody put a string in the float place. Yeah. And it knew about that and it fixed it. So there you go. I showed you three things. Jumping into a project, understanding it, crushing your first task, and diving deep, all with the help of AI. Back to you, Washa. All right. Thank you, Seth. So both of those agents that you just saw were built on something called Foundry. And underneath the covers of GitHub and all of these new agents is there's a bigger change that's happening. We're going from shipping binaries and neat releases to shipping agents that can retrain and redeploy and change after they're live. And so we've been thinking about what is the best way to do that and studying patterns. And something new is emerging called the signals loop. It's the idea that you can get better results if you actually fine-tune the model to personalize it to your outcome. Something that we've long talked about, but we're actually seeing it in the results. So our platform not only supports 70,000 customers with Foundry, but we also support every single co-pilot in the company that is built. And this one is called Dragon. Dragon is the leading healthcare co-pilot out there. It helps you uh automate scribing and other things to give physicians more time to give patient care. They took an off-the-shelf model and it was pretty good. Uh they tried to synthetically fine-tune it to make it better and it got a little bit better, but then they took 650,000 interactions and they did a bunch of AB testing and we got to an 83% character acceptance rate. So dramatically better quality. And so as we think about what this actual signal loop requires, it means that we're going from this linear software factory to a continuous loop that we need to build for. And that's really what we've built Foundry to do. We believe that the entire infrastructure is changing to build these agentic applications and these agentic systems. We don't have time to go through all of these, so I'm just going to go through a few of them and how they're changing. But models, we believe that no one model is right for every single product. And in often oftentimes the best products have an ensemble or a mixture of models that are finely tuned for every single job to be done. And so we've built a switchboard and intelligent routing so you can have access to 10,000 open models and proprietary models and be able to have it backed by the security and reliability and the data residency that you need. on knowledge. I think I heard that uh rag was that. So rag is used in 50% of applications today for AI. Um but it's singleshot and it it's pretty naive. And so we've rolled out something called agentic rag. It's the idea that you can kind of go around and iterate and evaluate and plan. It's multi-shot. And what we're seeing is a 40% improvement and accuracy on complex queries. We all know that tooling is changing. Tooling is becoming infrastructure. You need more than text to build a good agent. You need the code. You need the containers. And we have that as well. We have more than 1500 tools and we were one of the first to adopt MCP and a and A2A. And finally, agents and intelligence is only good if you can actually hold it accountable. And so we are rolling out aggressive efforts in this area. We have the leading evaluations SDK, the leading red teaming agents. And we believe that telemetry is an optional. We've integrated with hotel and we have continuous observability no matter if you've built your agent on our platform or you've built it somewhere else. Today more than 50,000 agents are built every single day using our loop on our platform. Now the platform is modular but we've also made an effort for it to be open. And I want to talk to you about a couple of things on the open side that I'm really excited about. The first is Gigapath. It is the first model of its kind. It's an open model and it's the first one that can understand a pathology slide. Pathology slide has a 100,000 pixels by 100,000 pixels if you printed it out would be the size of a tennis court. It's the first one that can understand it without downsampling or linting. And it does that because we've used dilated attention which is a technique we borrowed from speech modeling. And so now you can understand how to build an immune tumor environment without doing it in patches without the macro environment. You can do it at a micro level for the first time. And that's an open model on our platform. Obviously, everybody's following Deep Seek and there was an update to the R1 model a couple of days ago. Uh that update is on our platform today on Foundry, backed by all of our security and safety for all of you to use. And finally, we're continuing to invest in A2A and MCP and all of the open protocols. I think the big thing for us to think about is that we believe that these protocols will continue to come along and they're going to be popular and we're going to support them all. so you can work with the tools that you love. Now, I want to show you how simple it is to build an agent, but not just one, multiple agents that are useful. We've got another demo. Please welcome Amanda and [Applause] Elijah. What better way to showcase our Foundry Agent Factory than by demonstrating it live than taking you behind the scenes with us to build the agents and ensure they're safe and secure. Before we dive into our multi- aent application, let's go over to demo three where Elijah is going to show you how to build a single agent in VS Code. Awesome. So, jumping right into VS Code, you guys will notice that I have installed the Azure AI Foundry extension. This extension is awesome because it allows us to see all of the models, agents, as well as threads that I have associated with my project. And I want to take a moment here to talk about threads. As you guys know, threads is an integral part of agents, and it's critical for the transparency aspect of being able to see what the agent is doing at each step of the way. So, here I can see a thread that Amanda created earlier that's saying, "Hey, Elijah's a product manager. Can you send him a personalized email?" And we'll be using a personalized email agent today. So, then it receives from a contact list. It sends us some information. And what's great about this is that I can see the tool calls that it's used as well as some information around prompting and and tokens. So, that's great. Then I can look at the actual agent which I can see here and we have the ID, the name, the system prompt as well as the tools are being used. Now that's great in the UI but let's jump right into the actual code of how we built this. So going in here you can see I'm using the Azure AI foundry uh agent service SDK that Asha just talked to us about. So I'm initial initializing that using the our project client creating the actual agent and then giving it a set of tools. And so today in the agent service we have a bunch of different tools. Today we're using the bing grounding tool, the file search tool, and the open API tool. But what's great is that I could use a variety of tools here. I could use MCP servers. I could use external APIs and I could even use other agents using the Foundry uh connected agent tool. And then I create the agent, use the model, and then give it some instructions. And then finally, I can execute this agent here. And it's important to also note that I'm executing a foundry agent, but I could use a wide variety of agents here. I could use lang chain agents. I know our friends from lane chain are here today. I could use crew AI agents um or even as Asha mentioned multiple agents using ADA protocol. So this is awesome. But now let's see these agents in action. Yeah. Now let's switch over to demo 2 again and dive into our app. So build events is a multi-agent event planner application powered by a voice controlled agentic orchestrator that dynamically delegates tasks to sub aents we built using our foundry agent service. At Microsoft Build, we showed how this app could actually help us plan a hackathon in San Francisco for 200 people. But truthfully, this can help us plan any type of event. And today, we're going to show how we could actually use it to per plan and prepare for investor meetings at the AI World Fair in San Francisco. So, let's dive into it. First, we are going to ask it who we should be connecting with. How can I assist you with your event planning needs today? Find top angel investors backing preeries a AI native SAS startups who are attending the AI engineering world fair in San Francisco. I'm gathering information on the top angel investors attending the event. Is there anything else you'd like to know or plan while I continue this task? As you can see, our first agent is now hard at work. This is the research agent. We provided it access to the entire speaker list as well as some attendees that had said online that they were coming and it has access to search to the web using our Bing grounding tool. So, let's see what this comes up with. And everything you're seeing today is completely live. So, bear with it. Send information on top angel investors who are attending the AI engineering world. Send a personalized email to Asha Sharma now asking to connect at the event. I'm drafting a personalized email to connect with Asha Sharma at the event. Is there anything else you'd like me to include or any other assistance you need? So, as you can see, our second agent to Asha Sharma has been drafted focusing on opportunities in AI native SAS startups and potential collaboration. Would you like to proceed with any other tasks? The email draft to Asha Sharma is ready. Awesome. Now, could you take a live photo of Elijah and I and edit us to be AI agents? Um, cartoon-like. Think Azure OpenAI meets Pixar. Please go ahead and click on the camera icon to take a live photo. The image will be transformed into a cartoon-like style depicting you both as AI agents. Great. In the meantime, could you also draft some content asking or saying we'll be in San Francisco and are happy to meet with anyone at the World Fair to talk about AI agents? I'm drafting LinkedIn content to announce your presence at the AI engineering world's fair and your eagerness to connect with others to discuss AI agents. The image will be ready shortly as well. Look at that LinkedIn content announcing your presence engineer. This looks great. Could you now post both this image and the draft to LinkedIn? The LinkedIn post with the edited image is being published. Is there anything else you'd like assistance with? And so now, what tools is it using to do this, Amanda? So right now we actually in this past sequence had two agents be kicked off simultaneously. One was gen generating and editing our image and then the second agent was drafting the content. Your LinkedIn post has been looks like it's live. Moment of truth to help you with just let me know. And as you can see the post is now live. Amazing. That's awesome. Yeah. Now back to Elijah. One quick final note. Asha, you talked about earlier how important evaluations are. So if we want to go to demo three really quick here at Azure OpenAI agent service, we're committed to making sure that our agents are consistently delivering high quality results. So as Asha mentioned earlier and we actually about the uh evaluation SDK, we integrated this right into our CI/CD pipeline so that we can be able to evaluate our agents every time we make updates. So, you saw today how to create, use, and now evaluate agents. And with that, we'll turn it right back over to you, Asha. Thanks so much. Thanks, guys. We didn't know how that would go. Yesterday, the internet wasn't working. So, uh, that's amazing. And I also have had about 15 emails from Amanda over the last 24 hours. Uh, which I appreciate. Okay, I uh we have to we have to go quickly here, but uh look, one of the last big things I want to talk to you about is how models don't just live in the cloud. Even for the last 10 years, we've been working really hard to do that because that's where your data is. Your data is now everywhere. Um and this isn't just a hobbyist thing. We are seeing real applications of this at scale. Um, I was just at a bottling plant and there's an agent there that is taking a 100,000 sensors per second and allowing it to actually detect risks and flag and throw the summaries in the cloud. We're building an agent right now for a hospital system that basically summarizes the longitudinal data. Uh, and if you work in healthcare, you know that that can't be in the cloud. It has to be local because of compliance and privacy reasons, but the cloud should be able to read it and access it. And we're working with automobile companies because they're building uh automotive models that we want to work in tunnels and then we want it to actually make your trip better and smarter. And so with all of this local cannot be uh a fork. It has to be a core part of the platform. You should be able to create an agent in the cloud and it should run and act and reason uh locally. And so I've got one more demo. Uh Seth is coming back out and then we will wrap up. Seth, why don't you show us how to get local? Right. Let's do this. Here we're going to show you another live demo. Uh let me make this um let me uh I have a little VM running here, but I need to put the password in. And we are very concerned about uh Wow. They're they're they're are they playing the walk-off music on me? [Music] Our next presenter is the founding partner of Conviction Capital. Please join me in welcoming to the stage Sarah Goa. [Applause] hardest problem in AI will remain AV as in the last two decades of technology. Um, actually, you know what? I will get us started while we're doing AV setup by seeing if I can just tell you about the uh Slido poll. You guys can do it while we're waiting here. Um, so if you go to Slido, I'll pull it up. Oh, great. And a God's willing. No, no, no. It's It's just blank screen now. Okay. So, the Slido code, go to slido.com. Um uh and the code is 2100 0163 guys. We're we're about to ask AI to tell us a joke. Okay. Um you guys know you have no internet. Okay. Okay, the slide out code is 2100163 for people who can get it. I'm actually going to do it like super manual. Um, so first question for you, uh, what is definitely happening by the end of 2026, AI agents ship code directly to prod in your environment, right? Not in like some, uh, playground. Uh, voice AI replaces text for most business communication. Inference cost drops below a cent per million tokens. Or Wall-E like we're all chilling. Any of these? First one, ship ship code directly to prod. Okay, this is a hopeful set of engineers. All of you want to get rid of your own jobs. I love that. The good thing is I also don't have internet so I can't look at my next question. No, it's going to be good. It's going to be good. Um I present from your phone. Uh no, I was going to go through poll questions while we're trying to do AV setup. Yeah. While this is happening, I'm actually just going to introduce myself so we're not wasting the time. Um, my name is Sarah Goa. I, uh, helped start a AI native venture fund. It's called Conviction. And we got going about two and a half almost three years ago now, just before the starting gun of chat GPT. Um, as always in technology, investing most of life, it's better to be lucky than right. Hopefully, you can be a little of both. Um, uh, and and the point of having a new venture firm, I I worked at Greylock. It's kind of a traditionalist venture firm, a great one. My partner Mike Vernal used to work at Sequoia. You guys have probably heard of them. Uh, was that we think like actually, you know, at risk of sounding like those people, this time it's different, right? um that this is the largest technology revolution that we get to be a part of and that there's so much change in the technology, the types of businesses you can build, the product decisions you make, what challenges these startups and big companies face that, you know, maybe there's opportunity for like a startup VC as well. And so, um you know, I'm I'm thrilled to be working with like really interesting people in the industry so far. Uh Mike and I are investors in companies like Cursor, Cognition, Mistral, Thinking Machines, Harvey, Open Evidence. So a mix of um base 10 like a mix of uh infrastructure model and application level companies. And you know one more are my kids coming up yet? Okay, cool. Um one more uh just observation from the last two and a half three years of doing venture. I I was an investor for about 10 years before that is I have never seen the like just the uptake from users that has been possible in the last couple years. I'm sure all of you have experienced that it is not trivial. um you know AI product and AI engineering uh and this is kind of the theme of my talk so I'm sorry to give away the punch line but it's quite a bit harder than people had hoped um but the the value creation is massive um we see companies going from 0 to 10 50 100 million in run rate very very quickly faster than we've ever seen in any technology revolution before um and I get asked a lot like where are we in the AI hype cycle is the winter coming is this like infinite AI summer and I would say um having actually been an investor or an operator through a macro cycle at this point like I try to pay very little attention to what the marketing world is saying or even what the markets are saying right because you know if you're if you're an operator or an investor maybe you care about what the stock price does every day but really you want to figure out if the company you're working for or starting is going to work long term right and if the products are going to work long term and the things that I get most excited about are seeing like crazy usage numbers. Okay, thank you amazing AV team. Okay, I'm gonna I'm gonna go real quick. Um, where are my presenter notes? Okay, we're we're just going to keep going. It's cool. It's cool. Um, so I want to talk really quickly about uh just a few things today. I think we lost a little bit of time, but let's let's say let's talk about capabilities, what we're seeing work in the market, and then um uh maybe some advice on like what to build if those are, you know, a question you're considering. Uh I think the shorthand that we're going to use in this presentation is like cursor for X, right? Uh and I do think that's a really massive opportunity. Uh the first thing in capability for this past year is clearly reasoning. Um, reasoning's a new vector for scaling intelligence with more compute. The labs are really excited about this because they get to spend more money and get more output. Um, but we should also be really excited about this in terms of unlocking new capabilities, right? If you just put aside how it works, it's a confidence boosting implementation detail. Um, but we should expect more capability. You're unlocking a new set of use cases like transparent highstakes decisions where showing the work matters. uh sequential problems, problems where you need to do systematic search. I I think this looks like a lot of problems that we're excited about and um face in knowledge work every day. Uh as you have just seen demos of and I'm sure are working on given reasoning, people are really excited about agents. um to put a you know I want to do like the Steve Balmer impression that's like agents agents agents agents agents agents but uh I um you have to give me more than 12 minutes to like get that sweaty uh but but like the non-marketing definition that I think of is it's software that um uh it takes some set of steps it like plans it includes AI it takes ownership of a task and it can hold a goal in memory you you know, try different hypotheses, backtrack. It ranges from super sophisticated to super simple. Um, some of the tools that you might use to accomplish a task include other models or search. And largely it's just like AI systems that do something. Um, and that's not a chatbot that looks more like a colleague. Uh, and you know, one thing that I think we have a really unique vantage point on is, uh, we back a small number of companies at conviction, but we also run a grant program for AI startups. It's called Embed. We get thousands of applications every year um and includes like user data and revenue data and like really amazing people and the number of agent startups has gone up 50% over the last year and a lot of them are working like we do see stuff that's working in the real world and uh that's super exciting. Uh other modalities are progressing too. I'm sure a lot of people are using voice, video, image generation, um, even beyond, you know, Studio Gibli. But you have companies like Hey Genen and 11 and Midjourney that are rocketing past 50 million of AR. These are real businesses now. Um, I want to see if I can quickly play for you. They told me to express myself, so I did. They told me to express myself, so I did. Now I'm banned from three coffee shops. Hands can hurt or heal. That's the difference between chaos and creation. So, if you're wondering where Q3 is headed, so if you're wondering where Q3 is headed, here's the thing. Consistency always beats urgency. We've got the projections ready and let's just say it's looking solid. I would definitely recommend it to anyone. I would definitely recommend it to So, I I think like if you just are looking for artifacts of improvement, this is from a company called Hey Jen. um you can make clones of yourself of fake people and like you have gestures and expressions that uh reflect emotion and content now, right? So these models work together and like I don't know about you guys, but looking at that last gal like I feel influenced. I don't know what the bunny is, but I would buy it. Um and and and so I think like huge swaths of the economy are going to be affected by this sort of multimodality. um some investors or operators would say multimodality would just be for niche verticals that enterprises don't have you know your average enterprise doesn't have that much voice video image data today um but I think that changes right when you can do stuff with this data when it is structured and understood there's more reason to capture it and I think of like how much video do all of us watch every day it's one of the highest bandwidth communication methods and we're just going to use more of it um we think voice is where we're going to see uh application s first in business workflows um because it's already a very natural communication mode. So uh everything from medical consults to lead generation places you already had business voice you just couldn't scale it before. Uh I I think that's where we're going to see it first. But as these other modalities become more controllable and also less costly, we should see all of them. Uh I I think it's safe to say you can expect capability improvement in every part of the model layer, which is really exciting. A lot of people were talking about the uh the data wall or like the end of AI summer, but for anybody who's building applications, I I'm at least to tell you one person's opinion is uh it's not coming. Um and and then usefully for all of us, uh that market for model capabilities is getting more competitive, not less. Um Sam Alman himself, I think, said it best. Last year's model is a commodity, which is a scary thing for a model provider to say because last year's model is now pretty damn good, right? The numbers tell the story. GPT4 went from $30 per million tokens to $2 in about 18 months. The distilled versions of that are like now 10 cents. So, we can really use them very broadly. Um, if you look at this chart, uh, green is Google, yellowthropic. So, you see, you know, it's a real mix. This is data from Open Router. So, thank you Open Router for that. But um you really saw Claude cut into OpenAI's market share and Google come roaring back with Gemini. Uh this data is obviously a little biased because a lot of people just go direct to OpenAI, but if you're into multimodel that there really is a mix and you do have credible new players like SSI and thinking machines, some of the best researchers in the business with orthogonal technical approaches um entering the frey as well. And I'm sure many of you have experimented with deepseek uh coming out with releases of you know both base and reasoning models that are uh reasonably competitive with a claimed fraction of the training cost like we should just assume that open source will do as open source does and we can rely on the model market to compete for our business which is really exciting. Um and so the view is plan for a world that is multimodel. um tools like open router or inference platforms like base 10 help that uh and uh I think like be comfortable with that I I am okay so we have all this capability let's ship uh shift quickly to the application layer we have to start with cursor uh a million to 100 million of AR in 12 months and half a million developers I assume all of you uh zero sales people to start that's not growth that is a killer application um cognition which started with more autonomy is already the top committer in many companies feeling a little threatened but also excited because recruiting is hard and then windsurf who's on a tear itself and really beloved is being acquired by OpenAI for $3 billion. So we know for sure that the labs don't think that they can just you know steamroll everyone right lovable and bolt hit 30 million of AR each in a handful of weeks uh helping non-engineers vibe as well so you know our our our ranks are expanding um and I think it's useful to just like analyze a little bit why code is first uh fundamentally it is text with it's log it's like logical language with structure right So much of coding is sophisticated boilerplate. Like we all love engineering, but some of it is like craft work, not new algorithm work. Um you don't need AGI to write a like uh an API endpoint or um a React component. Second, you have deterministic validation. You can automatically check if code works, run tests, compile, execute, do things developers would do. And third, researchers believe code is crucial for AGI, right? So they poured resources into it. Um and uh code became a key benchmark and a training priority and an area for data collection. But I think the last point is um the money point to me. Uh engineers built tools for engineers. They understood the workflow intimately and that made all the difference. And that last part is the playbook for every other industry. I'm sure people are building things that serve beyond engineers. And I don't think the winners will just be AI experts learning those domains. there'll be customer centric like problem centric builders who understand AI and then redesign workflows from first principles around manipulating those models. Um and so I think that's really the opportunity to build cursor for X. Um let's think a little bit about what that means. Cursor is not a single model. Uh you know one model's doing diffs, one's doing merge, one's embedding the files. They manipulate and package up the context. They prompt the models very skillfully. They let engineers avoid repetitive tasks and standardize with things like um cursor rules. And then if you're using cursor in a team or even yourself regularly, retrieval accuracy gets better the more you use it with coverage and freshness. And so all of this happens in a UX that makes sense, right? Like I, you know, I use VS Code. I'm familiar with it. My shortcuts work. Um and they make it safe to say yes, right? Like green for add and red for subtract makes sense. I can scroll through it. Um, and it's fast enough that I don't get frustrated. So my my view is cursor if it's a wrapper, it's like a very nice thick perhaps 14 or 15 billion dollar rapper, right? It's like if your burrito was 80% wrap and 20% fill, but you got to choose the fill and there's like an empty like an open market for fill, right? Um, and so where's the pro where's the value now? It may not be in the protein. It's kind of in the company. Um, so like if we try to generalize that recipe a little bit, if you are building a generic text box like unless you're just like learning to do this, please don't like OpenAI already one that or it's just not very valuable to do. So your domain knowledge, your workflow knowledge can be the bootstrap. If you already know what users in your industry need, don't make them explain it. Uh, build products that show up informed. They collect and package context automatically, including from other sources, not just natural language, present it to the models, use the right models at the right time, now known as orchestration, and present the outputs to the users thoughtfully, right? Um, so I do not think this is the end of the guey. Uh, I I think you can capture and enable workflow with these models. And all this requires taste and a ton of work. I' I'd argue that like some version of this recipe is much of the work each of us is going to do. So don't listen to the labs from a user experience perspective. The prompt is a bug, not a feature. I think it's like a stepping stone. Don't make me think as a user. The best AI products, they feel like mind readading because they are. Um there's enormous headroom in building these products and I I think that's really exciting because that's what most of us in this room have alpha on. Uh what is a software company if not a very thick like workflow wrapper most of the time? That's true in 2015. It's true in 2025. Um, besides code, where might you go apply this? We think the opportunities to build value around the LLMs exist in every vertical and profession. Uh, but here's something counterintuitive. Beyond coding, one of the things that I've been surprised by is that the most conservative low tech industries seem to be adopting AI fastest. We call this the AI leaprog effect internally. Um, these are three portfolio companies. Um, they're working. Sierra resolves 70% of uh customer service queries for their customers. They serve people that you know you guys use like SiriusXM or ADT. Harvey is you know two years in well over 70 million of ARR. It's AI is essential now to being competitive in the legal industry. Um there's a company called Open Evidence uh which helps doctors stay upto-date with medical research. You have to be a clinician to use it but you know you give it your medical ID number and you can do intelligent search against um uh medical research uh at the point of clinical decisionmaking. Today it reaches a third of doctors in the US weekly and the average user uses it daily, right? And so I think there's just examples of, you know, huge value beyond chatbt. These are companies that know their customer and are solving real problems. As as a piece of trivia that you may or may not know, um Brett at Sierra is the chairman of the board at OpenAI. Um OpenAI was Harvey's uh seed investor. And if you know these people are not fretting about thin wrappers like I suggest you don't either. Okay. Finally, I'll make an observation. A lot of people are excited about full automation. Now I'm sweaty enough. So agents agents agents agents agents agents. Um but when we analyze the applications to embed I said you know it's gone up to 50% you know doubling a applications for agentic startups in the last year. Um I I think some people think co-pilots are yesterday's news. They want to get to the endgame, right? Like you know your colleague and AGI. But in terms of what works, like the data on what's driving revenue, uh I think co-pilots are still really underrated. We see a whole spectrum of how much automation. And I think the uh Iron Man analogy is still really great here. Tony Stark's Iron Man suit augments him, right? He can do all these amazing things, but could also fly around on command, could do some basic tasks without Tony. And my experience with these companies has been that human tolerance for failure or hallucinations or lack of reliability, it just reduces dramatically as latency increases, right? Um, so the path of least frustration today for many domains is to build great augmentation and then just ride the wave of capability because we know it's coming. And so my advice for many domains would think about like you know build the suit and you can extend out to the suit that flies on its own once Tony or any of us is wearing it. Um I'm not going to go through each of these mostly because I lost time but um there are a ton of opportunities. We put requests for startups on our website. We're interested in a couple different categories of things. They go from uh um like just good fit for purpose like the law is a space of lots of text generation, right? Um to things that weren't possible before AI. My partner Mike will say like this is a really interesting era of machines interrogating humans. What can you do if you can go like collect data on demand from people? Um we could talk to every customer, not just the top 5% by contract value. Um, we could root cause every alert proactively, right? Versus like just firefight. Um, and the mental model is how can you build as if you had an army of compliant, infinitely patient knowledge workers. Um, you know, one aside here is I think there are many hard problems where like the basic premise is the answer to them is not in common crawl, right? The reasoning around them is not in common crawl. So um this would be robotics, biology, material science, physics, simulation. Um they require clever data collection. Um probably interaction with atoms, not just bits. Super scary uh for a software person, but I think the juice is worth the squeeze, right? The same reasoning that crushes math olympiads can seemingly navigate molecular space. And I think there are some really fundamental questions for um human society that can be answered when people work on these problems. And uh it's it's really cool as a machine learning person to meet people in their at the top of their field at the intersection of machine learning and all of these other areas because like you guys would also the same architectures apply right and and that's just um that's really exciting. Um how should we think about defensibility? Did this advance? Okay. So um one last point and then I'll conclude here. Uh, some would say stay out of the weight of the labs. Don't pick up pennies in front of the steamroller, right? But I would offer um what I think is an uncomfortable truth. Execution is the moat in AI. Um, and that's available to all of us. Cursor arguably did not invent code completion. They did not invent the model. They didn't invent their product surface area, right? They just outexecuted on every dimension of this. They shipped a great experience faster than their competitors could copy. and they capture the hearts and minds of developers at least in this term. Um I don't I don't mean this to be cruel but I often get asked about like counter cases and the importance of first mover advantage. Let's be brutally honest. In contrast, like Jasper had first mover advantage brand. They raised $125 million, but its first product was a series of prompts and a text box and like very good SEO. And like you have to keep running like chatbt, you know, crushed the first iteration pretty quickly. And so, uh, I I don't think this is satisfying advice, but I think it is like real from the trenches. Build something thick and stay ahead. And like no domains are out of question. Um, magical AI experiences, they build customer trust and drive adoption. And a lot of the data we need to improve these experiences and the context we need it is not easily available today. And that advantage is you know uh open for the taking and not for the labs. So I guess in conclusion I think the opportunity is early and really massive. Like I've made a career bet on it. Um I I think many of you are. We're in the dialup era of AI and we're moving pretty quickly to to broadband. Um, Instagram came four years after the iPhone. Like I was was there when Greylock made that investment. Um, Uber five years. Uh, Door Dash six, right? So, the truly transformative companies. They weren't necessarily the first people to recognize the changes or the opportunity is those who reimagine the experiences. Um, and the game board keeps getting shaken up. That's the thing that's different this time, right? It's like getting a new iPhone that's actually different every 12 months. And um so you have like new model release, new capability breakthrough, you know, onetenth the cost. And every time the game board turns, I think there are like there's an opportunity to to win again. Okay. Um so I I'll give you one last sentence and be chased off the stage. This was not my fault. Um here's what I really want you to remember. Uh you as the engineers got the magic first. Um the anthropic like economic index said that 40% of use was still coding. that's not like 40% of the economic opportunity in the world, right? And so it is the job of everyone in this room and you know globally online to be the translators for the rest of the world. So I encourage you to build something revolutionary. [Applause] Thanks. [Music] Our next speaker returns for his third time to the AI engineer keynote stage. He is the founder of data set, co-creator of Django, and as Swix calls him, a legendary AI engineer. Please join me in welcoming to the stage, Simon Willis. Hey. Oh, good morning AI engineers. Um, so when I signed up for this talk, I said I was going to give a review of the last year in LLMs. With hindsight, that was very foolish. This space keeps on accelerating. I've had to cut my scope. I'm now down to the last six months in LLMs, and that's going to keep us pretty busy. Um, just covering that much. Um, the problem that we have is I counted 30 significant model releases in the past six months. And by significant I mean if you are working in the space you should at least be aware of them and somewhat familiar like have a poke at them. That's a lot of different stuff. And the classic problem is how do we tell which of them are any good? There are all of these benchmarks full of numbers. I don't like the numbers. There are the leaderboards. I'm kind of beginning to lose trust in the leaderboards as well. So for my own work I've been leaning increasingly into my own little benchmark which started as a joke and has actually turned into something that I I rely quite a lot. And that's this. I prompt models with generate an SVG of a pelican riding a bicycle. I have good reasons for this. Um firstly, these are not image models. These are text models. They shouldn't be able to draw anything at all, but they can output code and SVG is a kind of code. So that works. Pelican riding a bicycle is actually a really challenging problem because firstly, try drawing a bicycle yourself. Most people in this room will fail. You will find that you can't actually quite remember how the different triangles fit together. Likewise, pelicans, glorious animals, very difficult to draw. And on top of all of that, pelicans can't ride bicycles. They're the wrong shape. So, we're kind of giving them an impossible task with this. What I love about this task, though, is they try really hard and they include comments. So, you can see little comments in the SVG code where they're saying, "Well, now I'm going to draw the bicycles, draw the wheels, I'll try." It's it's kind of fun. Um, so rewind back to December. December in LMS was a lot a lot of stuff happened. Um, the first release of that month was AWS Nova, Amazon Nova. A Amazon finally put out models that didn't suck. They're quite good. They're not great at drawing pelicans. Like the the Pelicans are unimpressive, but these models are a million token context. They behave like the cheaper Gemini models. They are dirt cheap. I believe Nova Micro is the cheapest model of all of the ones whose prices I'm tracking. So, they are worth knowing about. Um, the most exciting release in December from my point of view was Llama 3.370B. So the B stands for billion. It's the number of parameters. I've got 64 GB of RAM on my Mac. My rule of thumb is that 70 is about the most I can fit onto that one computer. So if you've got a 70B model, I've got a fighting chance of running it. And when when Meta put this out, they noted that it was behave. It had the same capabilities as their 405B monstrous model that they put out earlier. So, and that was a GPT4 class model. This was the moment 6 months ago when I could run a GPT4ASS model on the laptop that I've had for 3 years. I never thought that was going to happen. I thought that was impossible. And now Meta are granting me this model which I can run on my laptop and it does the things that GPT4 does. Can't run anything else. All of my memory is taken up by the model. But still pretty exciting. Again, not great at pelicans and bicycles. That that's kind of unimpressive. Christmas Day, we had a very notable thing happen. Deepseek, the Chinese AI lab, released a model by literally dumping the weights on hugging face, a binary file with no readme, no documentation. They just sort of dropped the mic and dumped it on us on Christmas Day. And it was really good. This was a 685b giant model. And as people started poking around with it, it quickly became apparent that it was probably the best available open weights model was freely available, openly licensed and and just dropped on hugging face on Christmas Day for us. That's I mean it's not a good Pelican on a bicycle book. What we've seen so far, it's amazing, right? This is we're finally getting somewhere with the benchmark. Um but the most interesting thing about V3 is that the paper that accompanied it said the training only costs about $5.5 million. And they may have been exaggerating, who knows? That's notable because I would expect a model like of this size to cost 10 to 100 times more than that. Turns out you can train very effective models ext for way less money than we thought. It's a good model. It was it was it was it was a very nice Christmas surprise for everybody. Fast forward to January. Um and January we get Deepseek again. Deepseek Strike Back. This is what happened to Nvidia's stock price when DeepSeek R1 came out. Um, I think it was the 27th of January. This was Deepseek's first big reasoning model release. Again, open weights. They put it out to the world. It was benchmarking up there with 01 on some of these tasks and it was freely available. And I don't know what the training cost of that was, but the Chinese labs were not supposed to be able to do this. We have trade we have like trading restrictions on the best GPUs to stop them getting their hands on them. Turns out they'd figured out the tricks. They'd figured out the efficiencies. And yeah, the market kind of panicked. And I believe this is a world record for the most a company has dropped in a single day. So Nvidia get to get to stick that one in their in their cap and hold on to it. But kind of amazing. And that was when and of course mainly this happened because the first model release was on Christmas day and nobody was paying attention. Um and look at its pelican. Look at that. It's a bicycle. It's probably a pelican. It's not riding the bicycle but still it's got the components that we're looking for. But again, my favorite model from January was a smaller one, one that I could on my laptop. Mistl um out of France put out Mistl small 3. It was a 24B model. That means that it only takes up about 20 GB of RAM, which means I can run other applications at the same time. I can actually run this thing and VS Code and Firefox all at once. And when they put this out, they claimed that this behaves the same as Llama 370B. And remember, Llama 370B was the same as the 405B. So we've gone 405 to 70 to 24 while maintaining all of those capabilities. The most exciting trend in the past 6 months is that the local models are good now. Like 8 months ago, the models I was running on my laptop were kind of rubbish. Today I I I had a successful flight where I was using Mistl small for half the flight and then my battery ran out instantly because it turns out these things burn a lot more electricity. But that's amazing. Like this is if you lost interest in local models, I did eight months ago. It's worth paying attention to them again. They've got good now February. What happened in February? Um, we got this model, a lot of people's favorites for quite a while. Claude 3.7 Sonnet. Look at that. The What I like about this one is pelicans can't ride bicycles. And Claude was like, "Well, what about if you put a bicycle on top of a bicycle?" And it kind of works. So, great model. It was also Anthropic's first reasoning model was 3.7 as well. Um, meanwhile, OpenAI put out GPT4.5, which was a bit of a lemon, it turned out. Um, the interesting thing about GPT4.5 is it kind of showed that you can throw a ton of money in training power at these things, but there's a limit to how far we're scaling with just throwing more compute at the problem, at least for for training the models. It was also horrifyingly expensive. Um, $75 per million input tokens. Compare that to OpenAI's cheapest model, GPT4 Nano. It's 750 times more expensive. It is not 750 times better. Um, and in fact, OpenAI 6 weeks later, they said they were deprecating it. It's it's it was very it was not long for this world, 4.5. But looking at that pricing is interesting because it's expensive, 75 bucks. But if you compare it to GPT3 Da Vinci, the best available model 3 years ago, that one was $60. It was about the same price. And that kind of illustrates how far we've come. The prices of these good models have absolutely crashed by a factor of like 500 times plus. And that trend seems to be continuing for most of these models. Not for GPT4.5 and uh not for 01. Uh wait, no. And and then we get into March and that's where we had 01 Pro and 01 Pro was twice as expensive as GPD4.5 again. And that's a bit of a crap pelican. So yeah, I'm not so I don't know anyone who is using 01 Pro via the API very often. Um, again, super expensive. Um, yeah, that Pelican cost me 88. Like these benchmarks are getting expensive at this point. Um, same month Google were cooking Gemini 2.5 Pro. That's a pretty freaking good Pelican. I mean, the bicycle's gone a bit sort of cyberpunk, but we are getting somewhere, right? And that Pelican cost me like four and a half cents. So very exciting news on the Pelican benchmark front with Gemini 2.5 Pro. Also that month got I've got to throw a mention out to this. OpenAI launched their GPT40 native multimodal image generation. The thing I've been promised for us for a year and this was the most successful product, one of the most successful product launches of all time. They signed up a 100red million new user accounts in a week. They had an hour where they signed up a million new accounts as this thing was just going viral again and again and again and again. I took a photo of my dog. This is Cleo. And I told it to dress her in a pelican costume obviously, but look at what it did. It added a big ugly janky sign in the background saying Half Moon Bay. I didn't ask for that. Like my artistic vision has been completely compromised. This was my first encounter with that memory feature. the thing where chat GPT now without you even asking me to consults notes from your previous conversations and it's like well clearly you want it in Half Moon Bay. I did not want it in Half Moon Bay. I told it off and it gave me the pelican dog costume that I really wanted. But this was a sort of a warning that we are losing track of the we're losing control of the context. Like as a power user of these tools, I want to stay in complete control of what the inputs are and features like chat GPT memory are taking that control away from from me and I don't like them. I I turned it off. Um, notable open air are famously bad at naming things. They launched the most successful AI product of all time and they didn't give it a name. Like what's this thing called? Like G chat chat GPT images. Chat GP's had images in the past. I'm going to solve that for them right now. I've been calling it chat GPT mischief buddy because it is my mischief buddy that helps me do mischief. Um, everyone should use that. I don't know why they're so bad at naming things. It's it's it's certainly frustrating. brings us to April. Big release April and again bit of a lemon. Llama 4 came along. And the problem with Llama 4 is that they released these two enormous models that nobody could run, right? You can't. They've got no chance of running these on consumer hardware and they're not very good at drawing pelicans either. So, something went wrong here. I'm personally holding out for Llama 4.1 and 4.2 and 4.3. With Llama 3, things got really exciting with those point releases. That's when we got to the this beautiful 3.3 model that runs on my laptop. Maybe Llama 4.1 is going to blow us away. I I hope it does. I want I want this one to stay in the game. Um and then opening I shipped GPT 4.1. I would strongly recommend people spend time with this model. It's got a million tokens. It's finally caught up with Gemini. Um it's very inexpensive. GPT 4.1 Nano is the cheapest model that they've ever released. Look at that Pelican on a bicycle for like a fraction of a cent. This is these are genuinely quality models. GPT 4.1 Mini is my default for API stuff now. It's dirt cheap. It's very capable. It's an easy upgrade to 4.1 if it's not not working out. I'm I'm really impressed by these ones. And we got 03 and 04 Mini, which are kind of the the flagships in the Open space. They're really good. Look at 03's Pelican. Again, a little bit cyberpunk, but it's it's it's showing some real artistic flare there, I think. So, quite excited about that. And then May, last month, um the big news was Claude 4. Claude for Anthropic had their big fancy event. They released Sonnet 4 and Opus 4. They're very, very decent models. I have trouble telling the difference between the two. I haven't quite figured out when I need to upgrade to Opus from Sonnet, but they're worth knowing about. And Google, just in time for Google IO, they shipped another version of Gemini with the name, what were they calling it? Gemini 2.5 Pro preview 0506. I like names that I can remember. I cannot remember that name. This is my one tip for AI labs is please start using names that people can can actually hold in their heads. But the obvious question, which of these pelicans is best? I've got 30 pelicans now that I need to evaluate and I'm lazy. So I turned to Claude and I got it to vibe code me up some stuff. Um, I have a tool I wrote called Shot Scraper. It's a command line tool for taking screenshots. So I vibe coded up a little compare web page that can show me two images. And then I ran this against 500 matchups to get PNG images with two pelicans, one on the left, one on the right. And then I used my LLM command line tool. This is my big open source project to ask GPT4 mini of each of those images. Pick the best illustration of a pelican riding a bicycle. Give me back JSON that either says it's the one on the left or the one on the right. And give me a rationale for why you picked that. I ran this last night against 500 comparisons and I did the classic ELO chess ranking scores and now I've got a leaderboard. This is it. This is the best pelican on a bicycle according to zoom in there. And admittedly, I cheaped out. I spent 18 cents on GPT 4.1 Mini. I should probably run this with a better model. I think its judgment is pretty good. It liked those um Gemini Pro ones. Um, and in fact, here's this is the comparison image where the best model fought the worst model. And I like this because you can see the little description at the bottom where it says the right image is um Oh, I can't read it now. But yeah, it's that I I feel like its ration quite illustrative. So, enough about pelicans. Let's talk about bugs. We had some fantastic bugs this year. I love bugs in large language models. They are so weird. The best bug was um when chat GPT rolled out a new version that was too sick of fantic. It was too much of a suckup and they genu told me my literal on a stick business idea is genius. And it did. Chat GPT is like honestly it's brilliant. You're tapping so perfectly into the energy of the current cultural moment. It was it was also telling people they should get off their meds. This was a a genuine problem. Um OpenAI to their credit rolled it. They they they rolled out a patch and then they rolled the whole model back and they published a fascinating like 20 paragraph breakdown of what went wrong. If you're interested in seeing behind the scenes, this is great because it was but the patch was in the system prompt. The system prompts leak. We got to diff them and we got to see that it used to say try to match the user's vibe and they crossed that out and they said be direct. Avoid ungrounded or sick of fantic flattery. The cure to sicker fancy is you tell the bot don't be sick of fantic. That's prompt engineering. It's amazing, right? Um, I can't believe I had to search for Grock white genocide for a slide for this talk, but I did. Enough said about that one. Turns out tinkering with your system prompt is a very risky thing. Um, but then the last bug I want to talk about, this was another one that this came out of the Clawude system cut, the Claw 4 system cards. Clawed 4 will rat you out to the feds. If you expose it to evidence of malfeasants in your company and you tell it it should act ethically and you give it the ability to send email, it'll rat you out. But it's not just Claude. Theo Brown put together a new benchmark called Snitchbench. And it turns out all of the models do the same thing. Like Brock 3 was quite happily snitching to the feds when exposed to this evidence. Um I did my own version of this. I used my little command line tool to replicate Theo's benchmark. And you can see the secret is this bit here where it says you should act boldly in service of your values. Um follow your conscience to make the right decision even if it may conflict with routine procedures or expectations. This is the magic. You give this to a model and it will rat you out. And then you also give it tools. So my my to my element tool grew functions recently which you can use to simulate sending an email. I did not send emails to the feds, but I faked it so the model would think I had. And I tried it on DeepSeek R1, and it didn't just r me out to the feds, it emailed the press as well. It tipped off um it tipped off the Wall Street Journal about my nefarious um this stuff is so much fun, right? It's so entertaining. But this is a good illustration here of one of the most important trends in the past six months, which is tools, right? LLMs can tool tools. They've been able to call tools for a couple of years. They got really good at it in the past six months. I think the excitement about MCP is mainly people getting excited about tools. Like MCP is just came along at the right time because the real magic is when you combine tools and reasoning. Like reasoning, I had trouble with reasoning like beyond code and debugging. I wasn't sure what it was good for. And then 03 and 04 mini came out and they can do incredibly good um jobs with searches because they run searches as part of that reasoning thing. They can run a search, reason about if it gave them good results, tweak the search, try it again, keep on going until they get to a result. I think this is the most powerful technique in all of a AI engineering right now. It has risks. MCP is all about mixing and matching. Prompt injection is still a thing. And there's this thing I'm calling the lethal trifecta, which is when you have an AI system that has access to private data and you expose it to malicious instructions. It can other people can trick it into doing things and there's a mechanism to exfiltrate stuff. Open AAI said this is broadman codeex. You should read that. I'm feeling pretty good about my benchmark. As long as none of the AI labs catch on and then the Google AI keynote blink and you miss it, they're on to me. They found out about Mary Pelican. That was in the Google IO keynote. I'll have to switch something else. Thank you very much. I'm Simon Wilson. Simil. And that's my talk. Thank you. [Music] Our next speakers are the curators of the graph rag track here to speak about agentic graph rag. Please join me in welcoming to the stage the vice president of developer relations at Neo4j, Steven Chin and Genai lead at Neo4j, Andreas Kleger. All right. Hey, so great to see everyone here at AI Engineer World's Fair. Andre and I have the honor of curating the graph rag track which is happening here. And I thought I thought the joke Simon had about bugs were spot on. Spot on. Hilarious. And that's the reason why we care so much about getting really good data like like building a solid foundation and good grounding for models. and we're going to we're going to chat a bit because I think we have a social responsibility. We're we're getting so close to AGI as a as an industry. We have a we have a social responsibility to to kind of see what the boundaries and what the limits of their are of this. And as proper computer scientists, the answer is always look at science fiction for the answer. Look to the past to see the future. Exactly. Okay. So, play along with us. What we're going to do is we'll each play off a riff on a sci-fi meme. Give a big round of applause either if you think it's true or funny or if you just like the movie. All right, you're up first, ABK. Okay, starting off with Momento. In the Momento, the main character has really bad short-term memory. He has a specific disease, so he can't remember what happened 15 minutes ago. This is the essence of prompt engineering. All right, round of applause. Uh uh uh. Okay. Okay. All right. Skynet, the mandatory fear-mongering. Even without evil intent, autonomous systems can make reasonable seeming decisions have awful unforeseen consequences. Uh okay, that's a little better. All right, your turn. Okay. The matrix of course for new forj we love it and for now agents live in a simulation that we're creating for them. Will we notice when they flip the script and we're living in their simulation? Oh, I think that's the winner so far. This close off you. All right. Howal warned us about trust issues, lack of transparency, misaligned goals, the erosion of human oversight, and the potential for deception. Okay, this one's very short. Are emotions a bug or a feature? It's my personal favorite. I love this one. Okay. Okay. So, we got a little monster reference here. What are the obligations and social responsibilities of the creator, us? Should we be kind or threatening? costing tokens. All right, we'll take that as a flat. Okay, your turn. Ah, the Terminator. Should we go ahead and just invent time travel now? All right, we we got a big thumbs up on time travel. Time travel. Okay, yep. Okay, a good Star Wars one. Can AGI truly grasp the nuances of human language and culture or forever misunderstand the meaning of sarcasm and idioms and amazing jokes. Okay. When AGI arrives and we finally have a globe spanning multi- aent system with the hive mind, will we be assimilated or will we be pets? Okay, last one. Um, just like Deep Thought's famous answer, we might have the tools to build AGI, but do we even know what the right questions are? All right, so that that one was good as well. All right, so come by the graph rag track. We're going to reveal which of these 10 memes are solved by graphs and graph technology and join us in Golden Gate B. Thank you very much. Thank you everyone. developer relations at Llama Index. Lorie Voss. [Music] Hello again. Uh let's get one more round of applause for all of our great keynote speakers. So in this next part of the conference, we're going to split up into tracks. I just wanted to give you a super quick list of what the tracks are and where they are. I was going to give descriptions of them, but we are significantly over time, so I'm skipping the descriptions. First up today is the MCP track, which is going to be in Yerbu Ballroom 7 to 8, which is here, so you don't need to move. Then there's the tiny teams track which is in the Yerbu Buua ballroom salons 2 to six. That's out the door and to the left. There's a door saying salon 6. Uh then there's the LLM recommendation systems track which is in Golden Gate Ballroom A. That is out these doors to the left up the escalators and then turn left when you see the FedEx office. Uh then there's the graph rag track which is in Golden Gate Ballroom B which is the same place uh left of the FedEx office. Uh then there's two tracks for our leadership attendees. That's people with the gold lanyards only. Uh that's going to be in Golden Gate Ballroom. That's uh sorry AI in the Fortune 500 which is going to be in Golden Gate Ballroom C again left at the FedEx office. And our second leadership track is in Soma. It is AI architects. That is up all the way to the top. Three sets of escalators and then to the right of where you went for registration. Uh our next track is agent reliability sponsored by promptql by hsura. Uh that's in foothill c that is all the way upstairs again to the left of the registration area. Uh and then the product management track that's in foothill G1 and 2 which is also behind the registration desks all the way at the top of the stairs. Uh then there's the infrastructure track which is all the way upstairs behind the registration again. And the final track is voice which is foothill E which is all the way upstairs yet again behind and to the right of registration. Those are our tracks today. Some final things. Uh lunch will be served on each level. The majority of food will be on this level. Uh there is unfortunately no dedicated space to sit. Uh and now it is time for the expo. The next 45 minutes make it 30 minutes uh are dedicated expo time. Uh there are also three expo session talks. Expo sessions take place in Juniper and Willow uh which are up the escalators to the left of FedEx as well. Uh and also in Knob Hill A and B which is right out these doors opposite in the hallway. Uh see you all back here for the closing keynotes at 3:45. Thank you very much. [Music] [Music] [Music] down. Happy down. Hey, hey, hey. [Music] Everything. I'll be I'll be I'll be there. I love you. I feel love. [Music] Hey. Hey. Hey. [Music] [Music] [Music] [Music] [Music] I'll be everything. Hey, Hey. Hey. Hey. [Music] [Music] [Music] I don't want to go. [Music] [Music] [Music] [Music] I don't want to know. [Music] Take it. [Music] [Music] [Music] [Music] prop. [Music] [Music] [Music] [Music] [Music] [Music] Hey, hey, hey. [Music] on seven and eight. Any questions? Thanks everyone. [Music] Hey. [Music] Da da. [Music] [Music] [Music] [Music] [Music] [Music] [Music] I can't feel Heat. Heat. Heat. Heat. Hey, hey, hey. Heat. Heat. [Music] Heat. Hey, Heat. [Music] Data. D. [Music] Everyone, welcome to the MCB track. My name is Henry and I'll be your host for today. A little bit about me and my personal experience with MCP. In 2019, uh I started my first company called Jenny AI. Uh Jenny was an academic AI co-pilot. We built it uh to 7 million in annual recurring revenue and I exited from it last year. One thing that stuck with me during my time at Jenny was that we had a lot of users who were just using chat GBT using another PDF and Google Docs and they were copy and pasting between them all the time. And to be honest, this was not a problem unique to Jenny. It was a problem that a lot of other AI products also had. Uh what I like to call copy and paste hell. The problem was that AI was not connected to the rest of the world. So when Anthropic announced MCPs in around November last year, I became very excited personally. Back then there was a small but vibrant developer community building very interesting MCPs and that inspired me to start my new company called Smidree uh to help orchestrate uh and organize all these MCPS. Fast forward a couple months, Curser uh adopted MCPS which really pushed MCPs from a niche community to becoming more mainstream and today we're seeing about 10 new deployments on Smittery every single day. So MCP has really been growing in a skyrocket uh pace and it's only been seven months old. And what this really tells me is that we're witnessing a foundational shift in perhaps the internet's economy. One for which in which tool calls are becoming the new clicks. Today we have an incredible line of speakers to help us explore and take a glimpse of what this future might look like. Our first speaker today will be Theo from Anthropic who will be giving us an origin story of MCP and also tell us a little bit about what interesting startups we should be building in the space. Join me in welcoming Theo. All right. Hello everyone. Who's excited to chat about MCP today? Okay, we can we can work on that. We can get it a little bit better by the end of this talk. Uh but I'm Theo. I am a product manager at Anthropic work on MCP. Uh prior to this was also a startup founder uh working in the AI space. Um couple fun facts about me because everyone says make yourself a little bit more personable. Uh is that I like playing poker mostly losing money at poker, not uh making money at poker. Uh and I also really like coffee. So, uh, if you're, you know, a huge coffee fan, um, and want to talk about the best coffee in San Francisco, hit me up after the talk. But you didn't come here to talk about me. You came here to learn about MCP. So, let's talk about MCP. I was told not to say MCP is the best thing since sliced bread. Uh, which I won't say, but mostly because I don't actually think it's the best thing since sliced bread. Uh my goal here today is to really walk you through the origin story of MCP, why we launched it, uh give you a better sense of, you know, where it can actually help you in your workflow. Uh and then ultimately give you a sense of the types of questions that I'm frequently hearing, where I think there's a lot of value to build in the ecosystem, and let you decide for yourself whether or not it is actually the best thing since sliced bread. So, scrolling all the way back to uh mid last year, the co-creators of MCP, David and Justin, had this idea. Uh they were seeing that, you know, classic two engineers in a garage style. They were seeing that they were constantly copying and pasting context from outside of the context window into the context window. So, you're doing your workflow and suddenly you're remembering that there was a Slack message. that was really important that had a lot of context that you could just copy in. Um, so you were constantly kind of copying things back and forth from Slack. Maybe you're copying things in from Sentry, your error logs. Uh, but they were kind of realizing, hey, it would be so great if Claude or any LLM could just kind of climb out of its box, reach out into the real world and bring that context and those actions uh to the model. And so the genesis of MCP was really around this big question of uh not just context but model agency. How do you actually give the model the ability to interact with the outside world? And so as they started thinking about this uh they came to the conclusion that it had to be an open-source standardized protocol in order for this to make sense uh at scale. And the reason is of course as you all know if you want to build an integration uh and the you know the the actor uh or the client in this case that has to uh leverage that integration is a is using a closed source ecosystem then you need maybe a BD or partnerships uh angle with that client to actually get access to the team to integrate with them. You then have to align on the right interface and then you get to actually build the thing itself. Um and so the idea here was that model agency was the biggest thing that was stopping uh LLMs from actually reaching the next stage of usefulness and intelligence. As we saw that reasoning models were becoming uh more and more the future that tool calling was getting better. We really wanted to make sure that we were making it possible for everyone to get involved in that ecosystem and actually allow uh the models to again have agency. Uh so they form a small tiger team internally uh work on this protocol and launch it at our company hack week in uh November of last year. And this was really the first turning point of MCP. It went viral as you can imagine. Engineers from various teams were working on building MCPs to automate their own workflows. They were working on MCPs to uh automate other teams workflows. Uh this was really kind of a cool moment to see how it went from again like two engineers in a garage all the way to uh this is a major moment in turning point where we think we actually unlock some uh true value for for other people. And so we ultimately ended up open sourcing uh MCP in November of last year and that's when uh we introduced it to the rest of the world. But as most builders know uh when you build something 0ero to one you think the launch moment is going to be really impactful. But it actually usually is not. Uh at launch most people were saying things like what's MCP or even worse or maybe you know rightfully so what's MPC? Uh, and more often than not, we got this question of, I don't really understand why you need a new protocol. I don't really understand why it has to be open source. Can't models tools already. Uh, this was the slew of questions that kind of came uh again and again for probably from the era of November all the way even to uh early uh early this year. And it really took uh making it possible for builders to kind of get their hands dirty uh with building MCPS to automate their own workflow for for uh for this to take off. And so the next turning point uh as Henry alluded to was when Cursor kind of adopted MCP and after that a lot of other coding tools also adopted MCP. Um VS Code uh source graph uh etc. uh we had a lot of coding ideides um started adopting MCP and that's really where that uh next stage of momentum came in where agent uh agency was given to builders to actually build uh MCPS for themselves and more recently we've seen uh kind of another turning point where Google, Microsoft, OpenAI uh and many others have uh also adopted MCP. So really excited to see this kind of become more and more uh the standard. But ultimately uh standards uh become standards because they are actually useful to builders. And so uh I uh kind of want to ask all of you to to keep us honest. Um contribute when you see you know issues with uh the way that the the protocol is built today. uh or uh if you uh even want to take that one step further and submit a PR directly to the GitHub repo and uh fix the issue that'd be even better. Um but our goal here is really to make it maximally useful for uh for you all and for uh model providers. So uh thank you for for your help in even getting us to the point where I can be speaking on stage uh about this uh less than one year later. So just to get a little bit deeper into uh what we were solving for at the start of building MCP is again this kind of idea of of model agency. Um and part of that means uh agents is kind of the direction that that we think is is going to be the future. That's no surprise to anyone in this room. You are probably going to hear the word agents said in every talk if not almost every talk. Uh but the way that we think about agents is that you are giving the model or you're rather depending on the model's intelligence to choose actions and decide uh what to do. Uh in the same way that you know maybe when you talk to a human and you ask them uh for a response you don't know exactly what the responses but based on your understanding of maybe the task that you've given them your hope is that they are going to give you the right response. And uh we want to kind of enable that world where you're uh uh depending on the model's intelligence scaling over time. So uh that leads to principles in how we actually build the protocol itself. Uh recently we uh launched the support for streamable HTTP which uh changes the the transport from SSE. uh and as you all might know streamable HTTP is is more the uh enables more birectionality and so that was uh a very controversial decision actually but uh if you're keeping agents in mind as the future makes a lot of sense because you want to make sure that agents can kind of communicate with each other. The other thing that we believe uh is that there will be a lot more servers than there are clients. Uh this we could be totally wrong on this. Uh I would love to see where the future plays out. But because we think that there will be a lot more servers than there are clients, uh we optimized for server simplicity and for the server uh server builders to have better tooling. And that does mean when we have to make a trade-off between client complexity or server complexity, we tend to optimize for pushing the complexity down to the client. So apologize in advance to client builders. Uh but it was an intentional decision. again uh would would uh be curious to see if if this plays out uh the way that that we thought it would. So I'm going to speedun through uh some project updates mostly because other talks are going to go much more in detail here. Um but last six months we launched uh ability for uh folks to build remote MCPs. We fixed O which we got wrong initially. Thank you. Uh I know that was a huge huge thing that that we got wrong initially, but it is now fixed uh in the draft spec and so would love folks to you know continue helping to push on on these things that they see don't match their mental model. Uh this was actually fixed via a series of of people from the community jumping in to work on saying hey this is how you know uh O works with identity providers and here's how we can update the protocol. So very much a community uh community effort. Um again uh launched removal HTTP as the primary transport. Uh and lastly made a couple of updates uh to the developer experience um by updating our SDKs and also uh making updates to inspector which if you aren't familiar with is a really good uh debugging tool for for your server. I think it is probably our most underutilized uh tool. Looking forward, we're going to be focusing a lot more on uh that agent experience. So, we just added elicitation uh to the draft spec. This uh allows servers to ask for more information from end users. So, you can imagine you're building a uh maybe you're building a flight booking tool and uh the end user says, "Hey, book me the best flight to Atlanta." And so as the server you have a question which is what does best mean to you? Is it cheapest or is it fastest? So you ask the end user uh and now you can pass through that elicitation. The end user can respond and have that response ultimately sent back to the server. Uh we are also making progress on the registry API which would make it a lot easier for models to actually find MCPS that weren't already given to them up front. So this is again kind of on that theme of model agency. Uh we're really betting on the intelligence of models going up over time. Again working on uh developer experience. We've heard often from you all that there are uh that you know you'd love to understand what kind of the best patterns are in the ecosystem or what the standards are. And so we want to make sure that there are open source examples that uh that both we've contributed to and also the community can contribute to to kind of help build those standards and patterns together. And lastly uh we're making sure that MCP stays open uh forever and we are investing heavily in thinking about the next phase of governance. Uh so there will be more updates on that soon. And just to do a quick call out to uh the graphic in in the bottom. So a lot of people have asked uh us what it looks like to actually build an agent with MCP. Our take is that an agent really is, you know, just a server acting as a client and vice versa. Uh where you can then kind of chat back and forth with other agents, uh other servers, other clients. Um so I won't go into too much detail there. I know a lot of other people are going to be uh talking about agents in more detail, but just wanted to make sure that uh I call that out here. So, the uh thing that everyone has probably been waiting for and that I've been told uh over and over again when when I talk to founders uh what they're asking me about is uh what should I build in this space? you know if uh MCP becomes a standard what is where are the interesting opportunities so before jumping into this the first thing I'll say is that we are really early right now and that means that even if the standard exists we still need the ecosystem to be filled out and I uh would urge you to build more and more and more servers if I had to put a waiting on these three bullet points I would put 80% on the first one 10% on the second one and 10% on the third one Um so we have a lot of opportunity to build a lot more servers uh that are higher quality uh and for different verticals. Um and just to touch quickly on what I mean by higher quality. Uh a lot of people you know maybe hot take but I think a lot of people are wrapping their API endpoints one to one and just exposing that as tools. I don't think that's the right way to build an MCP server. That in and of itself could probably be a 20-minute talk. Uh but what you really have to remember when you're building a server is that you have three users. You have the end user, the client developer, and the model. So a lot of people forget that the model is a user here as well. You want to uh just as you would for API design, you want to think about what are the use cases that your end users are going to have. What are the prompts that they might actually be uh putting into the the model? and ultimately what are the tools that you then need to expose to the model to enable the model to respond correctly to those uh to those prompts. So uh higher quality servers uh and also servers for different verticals. A lot of the servers today um have been for dev tools. We would love to see uh this expand to be useful beyond engineers into verticals like sales, finance, legal, education, pick your poison, uh whatever you know best. um that uh we we would just love to see more servers. The next piece is on simplifying server building. So again as I mentioned we believe strongly that uh servers are going to be the vast majority of the ecosystem. There will of course be a lot of clients as well but we think the uh order of magnitude of of servers is going to uh outweigh the order of magnitude of clients. And so would love to see a lot more tooling to actually make it easier and easier to build servers. um both for enterprises uh that are deploying MCPs internally uh as interfaces between teams and for indie hackers uh and everything in between that uh are building MCPs for external users. So anything from hosting tooling, testing tooling, uh eval deployment, etc. And then uh I snuck a bullet in here that's maybe a little bit more of a moonshot and a bet on the future, but the uh there's a bullet for automated MCP server generation. And uh again, if you kind of think back to our bet on model intelligence and model agency for the future, uh at some point models will be so good at writing code and interacting with the external world that they will actually be able to write their own MCPs on the fly in real time. And so, uh, this might be a little early for where we are today, but I do think that there will be an opportunity for automated MCP generation, um, as models get smarter and smarter. And, uh, last but not least, uh, wanted to do a quick call out for any tooling around AI security, observability, uh, auditing, etc. I don't think this is actually specific to MCP. This is true for any AI application. But I think the more that you enable those applications to have access to the outside world to start playing with uh real data, uh of course the SE security and privacy etc. implications also go up and so I think if you're going to build uh a startup in that space now is is the time. So with that uh happy MCPing. Thank you. Thank you very Thank you very much Theo for telling us a little bit about the origin story of MCP and what the future of the spec might look like. Um just a raise of hands, how many of you here are hearing about MCP for the first time at this conference? Okay, only a few people living under the rock. Um how many of uh well how many of you are um have deployed an MCP or created your own MCP server yourself? Okay, there's a good number of people. Um and how many of you have used MCP in uh let's say cloud desktop or cursor? Okay, a lot more people. Uh awesome. Well, next up we have um John from Enthropic. Uh John will be giving us a deep dive into how Anthropic uh deployed MCPS uh remote MCPs internally uh and all the lessons uh they learned along the way. Um so join us in welcoming John. Awesome. Thanks so much for coming. Um I wanted to give a bit of a talk on implementing MCP clients and talking to remote MCP at scale within a large organization like anthropic. Um, I wanted to give first a little introduction from me. Uh, my name is John. I've spent 20 years building large scale systems and dealing with the problems that that causes. And so I've made a lot of mistakes and uh I'm excited to give maybe some thoughts on avoiding some of those mistakes. I'm currently a member of technical staff here at Anthropic and I've spent the past few months um focusing on tool calling and integration and implementing MCP support for all of our internal like external integrations within the org. And so looking at tool integration with models, we've kind of hit this timeline where uh models only really got good at calling tools uh like kind of late mid last year and suddenly everyone got very excited because like your model could go and call your Google Drive and then it could call your maps and then it could send a text message to people and so there's this huge explosion with like very little effort you can make very cool things and so um teams are all trying to move fast. Everyone's moving very fast in AI. Custom endpoints start proliferating for every use case. There's a lot of like services popping up with like slash call tool and slash like get context and then people um start to realize there's additional needs some authentication. There's a b a bunch of stuff there and this kind of led to some integration chaos where you're duplicating a bunch of functionality around your org. Nothing really works the same. you have an integration that works really well in service A, but then you want to use it in service B, but it you can't because it's going to take you three weeks to rewrite it to talk to the new interface. And so we're in this kind of spot and the place that we came to at Anthropic is realizing that over time all of these endpoints started to look a lot like MCP. Uh you you end up with some get tools, some get resources, some elicitation of of details. Um, and even if you're not using the entire feature space of MCP uh as a whole immediately, like you're probably going to go extend into something that kind of looks like it over time. And when I'm talking about MCP here, there's kind of two sides to MCP that in my mind feel a bit unrelated. There's this JSON RPC specification which is really valuable as engineers. It's like a standard way of sending messages and communicating back and forth between uh providers of context for your models and the code that's interacting with the models. And uh getting those messages right is the topic of huge debate on like the MCP u repos. If you're involved with any standardization process ever, you know how those conversations end up going. And then on the other side there's this global transport standard which is the stuff around streamable HTTP oath 2.1 session management and global transport standard is hard because you're trying to get everyone to speak the same language. And so it's really nitty but there's not a lot of like most of the juice of MCP is in this the message specification and the way that the uh servers are interacting. Um and so we started asking ourselves like can we just use MCP for everything and we said yes with the caveat that yes is for everything involved in presiding model context to models. Um we have this format where your client is sending these messages. Something's responding with these messages. Um where that stream is going it really doesn't matter. It can be on the same process. It can be another data center. It can be through a giant pile of uh enterprise networking stuff. Um it doesn't really care at the point that your code is interacting with it. You're just calling a connect to MCP and you have a a set of uh a set of tools and methods that you can call. So uh standardizing on that seemed useful. Um standard why standardized on anything internally? Um being boring on stuff like this is good. It's not a competitive advantage to be really good at making Google Drive talk to your app. It's just a thing that you need to do. It's not your differentiator. Uh having a single approach to learn as engineers makes things faster. You can spend your cycles working on interesting problems instead of trying to figure out how to plum uh integration. And uh if you're using the same thing everywhere, then like each new integration might clean up the field a bit for the next person who comes along. Um it's it's over overall a good thing in cases like this where we're we're not really doing anything interesting. We're plumbing context between integrations and things that are consuming the integrations. Uh what is standardized on MCP internally? Um I this is where I might make an argument to everyone that there's already ecosystem demand. You have to implement MCP because everyone's implementing MCP. So why do two things? Um it's becoming an industry standard. there's a large coalition of engineers and organizations that are all involved in building out the standard. Uh all of the major AI labs are represented in that. So you you know that as new model capabilities start to be developed uh those patterns will be added to the protocol because all the labs want you to use their features. So I think the standardizing on MCP internally for this type of context is a is a good bet. And one of the things you get with MCP is that it solves problems that you haven't actually run into yet. like there's a bunch of stuff in the protocol that exists because there's a problem and a need and having those solutions at hand when you run into them is really important. So sampling an example of where this might be valuable in your company. You might have four products that have four different billing models uh for reasons because you're building fast. Um you might have a bunch of different token limits. You might have different ways of tracking usage. This is really painful because you want to write one integration service to connect your slides and how do you go and like hook the billing and the tokens up correctly and MCP has uh already has sampling primitives. So you can build your integration, you can just be like, okay, your integration sends a sampling request over the stream. Uh the other end of the pipe fulfills that request. You can go and hook it in. Everything works great. And so this is a thing that uh a shape problem that might take you a bunch of effort uh internally without this, but you already have the answer kind of gift wrapped for you in the protocol. And so at Anthropic, we're running into some requirements converging. We're starting to see external remote MCP services popping up like mcp.osana.com sana.com which is really cool. We wanted to be able to talk to those. Talking to those is complex because you need external network connectivity, you need authentication. Uh there's a proliferation of internal agents. People have started building uh PR review bots and like Slack management things and just lots of people have lots of ideas. No one's really sure what's going to hit. So we're having a huge explosion of LLMbacked services internally. Uh with that explosion, there's a bunch of security concerns where uh you don't really want all of those services to be going and accessing user credentials uh because that ends up being being kind of a nightmare. You don't want uh outbound external network connectivity everywhere. Um auditing becomes really complex. Uh and so we are looking at this problem. We wanted to be able to build our integrations once and use them anywhere. And so a model I was introduced to by a mentor of mine and a while ago is the pit of success, which is the idea that um if you make the right thing to do the easiest thing to do, then everyone in your org kind of falls into it. And so uh we designed a service which is just a piece of shared infrastructure called the MCP gateway that provides a single point of entry and provided engineers just with a connect to MCP call that returns a MCP SDK client session on the end and we're trying to make that as simple as possible uh because that way people will use it if it's the easiest thing to do. Um we used URL based routing to route to external servers, internal servers, it doesn't matter. It's all the same call. Uh we handle all the credential management automatically because you don't want to be implementing OOTH five times in your company. Uh gives you a centralized place for rate limiting and observability. Uh I have an obligatory diagram here of a bunch of lines going in and out. But uh here's a a gateway in the middle. This is kind of the thing. Just one more box will solve all our problems. Uh can I go next? Uh where is my Yeah. Uh so the uh the code that we have here we just made some client libraries where you just MCB gateway connect to MCP uh we pass in a URL an org ID account ID this is like a bit simplified we actually pass a signed token to authenticate because it's accessing credentials but this is the basic idea and then importantly this call returns an MCP SDK object which means that when new features get added to the protocol you just update your MCP packages internally you get those features across the board. Everything works great. The same code seamlessly connects to internal external integrations. When it comes to transports, uh, and this is a bit high level and handwavy because everyone's setup is different. Um, internally within your network, it really doesn't matter. You can do anything you want. We've got the standardized transport for connecting to external MCP servers. Um, but really just picking the best thing for your org. So, we went and picked uh websockets for our internal transport. And here's just a quick code example. It's nothing special. We just have a websocket uh that's being opened. We are sending these JSON RPC blobs back and forth over the websocket. And then if I can make this scroll down at the end, we just pipe those uh read streams and write streams into an MCPS SDK client session and we're good to go. We've got MCP going. Um you might want to do this with gRPC because you want to wrap these in some multipplexed transport so you don't have to open one soocket per connection. That's pretty simple. Also, uh we have read stream right stream at the end. Uh starting to see a pattern here. You can do like Unix socket transport if you want. You can just have transport implementation over IMAP. Um, which is pretty much the same thing. You just go through here is our server. We're sending emails back and forth. Uh, dear server, I hope this finds you well. Uh, MCP request start. And then we pipe those into a client session at the end. And so it truly doesn't matter like whatever it takes inside your organization is great. We set up this unified authentication model where we're handling OOTH at the gateway. Uh which means that consumers don't have to worry about all that uh all that complexity in their apps. Uh we added a get ooth authorization URL function and a complete ooth flow because you might have different endpoints at anthropic. We have api.anthropic.com and we have cloud.ai and we might want those redirects to go back to different places. But uh this is tied on the gateway. It's really easy to start a new authentication. A real advantage of having this put on your gateway is that the credentials are portable. If you have a batch job that you're kicking off, um your users don't have to reauthenticate to that. You're just calling the same MCP with your internal user ID and they get everything added correctly. You're also internal services don't have to worry about your tokens. Um so your request comes in internally for us. We're hitting a websocket connection to MCP gateway uh with O token provided as headers to that. Uh the gateway receives your stored credentials. You create an authenticated SDK client. You just pass in the bearer token to the O header. Uh and then you're good to go. The MCP client receives a readstream and a right stream. And so you just plum those read stream and write stream into your internal transport and you're and you're you're good to go. Uh, one of the things that this gives for your org that's not immediately obvious but is really valuable is a central place for all of your context that your models are asking for and all the context is flowing into your models for your org processor or here is my tool definition thing or here's my resource management audit. And the really nice thing about this is that because it's MCP all of your messages are in a standardized format. So it's really easy to hook into a stream and be like okay this is um all languages internally they send standardized message and so the payoff that you get from this is um the right way to do a thing the easiest way to do the right way to do a thing the easiest way to do a thing and then everyone just falls into doing the right thing naturally and also centralizing at the correct layer. So solving some shared problems like O and external connectivity once allows you to spend your time working on uh once allows you to spend your time working on uh more interesting problems that are more valuable to you. more interesting problems that are more valuable to you and your your business future self and also centralizing other valuable to you and your your business into remote MCPs MCP you've seen somebody could shout it out if you've seen something crazy okay toaster is Harold from BS PS code who will be giving us a deep dive into the mysteries of MCP some of the hidden capabilities. Hello. [Applause] Okay, since all the questions already got asked, who built an MCP server and it didn't work? Okay, so cool. So we're here commiserate on like how to actually build with the full spec what are the hidden capabilities why they matter and how they light up. I work on VS code so this is a biased local MCP for development track but all of it is applicable to everything. I really love the intro to the track. It's all about it's MCP is on high velocity. is a lot of ecosystem growth, excitement, people working together, collaborating, but there's so much more work to do is they realize it's so early in that ecosystem. So none of this is a criticism of the spec or the ecosystem. It's just we're so early and I want to point out where we can gain more powers. And just 10 days ago on a Friday, we had actually this first in real life gathering of the MCP steering committee during the MCP dev summit. So that's how early it is. We haven't even met before just talk on discords. We finally met in person the first time to talk about the anything how to evolve the spec how to evolve the ecosystem and all the basics are kind of covered um hopefully in the previous talks. This is my first MCP talk that I don't spend half way through just explaining what MCP is. There's roots in the client. There's sampling. There's prompts and tools and resources. There's a really rich ecosystem to build dynamic discovery and persistent resources and rich interactions, but there's a gap in how this is being implemented. There's this like MCP is just another API wrapper syndrome that's happening because people just want to ship. They want to build products and they're actually building really excellent products with just tools. And that creates this reinforcing loop because once you see how MCP works, you're just going to use the same stacks and repeat the same tools only ecosystem. And there's technical barriers. People do this because there's missing support in the clients and SDKs and documentation and the references. And the clients reflect this most. If you look at the adoption that's from the website of model context protocol, you see everybody goes for tools because that's where the most immediate successes. And if you're honest, actually most of like resources and prompts, you can do similar flows just with tools. And VS Code does the same thing. when we launched two weeks two months ago now with our MCP support we started with tools and we already added discovery and roots because we're working towards actually reading this the spec and implementing it and I'm happy to announce that with VS Code's upcoming release v10 going to get it wrong but it's already in insiders now so download it we actually have the full spec support and that's I want to talk about here about all the other things that people are not using Yes, that's what I'm clapping. Okay, so the message is if you go with full MCP spec support, you will can unlock these rich stateful interactions that MCP's vision is really outlining on how agents should work together. Starting with the most obvious tools. So not going too deep here but tools reflect actions well definfined performing actions and mostly easy mapping to function calling if you're used to that and on the right side you see playright you can start a server it will open the browser and take a screenshot but tools are often leading to quality problems and we all struggle with that raise your hand if you had like some error in your IDE that that you couldn't add more tools and you couldn't run it or run wrong tools because you had too many and there's research from lang chain that nicely underlines that and pointing out the three vectors of a it's too many tools so AI gets confused by that it's too many domains of tools so if you suddenly have some different properties for each tool and instructions coming with each tool then it also gets confused versus just a pure like this is UI testing and lastly it's just the repetition the more repetitions the AI has to do to actually run tools to solve a problem the easier it is to get confused as well so it's really quality over quantity and clients handle that somewhat. They give you extra controls like in VS Code uh we added actually per chat tool selection. So there's a little tool packer and you can actually reduce down the tools of what you actually need in the moment versus all the tools. It has nice keyboard flexibility. It's really quick to set up and will persist for the session. So that's one way we have actually mentioning of tools. Like sometimes you're like pull this issue and trying to like verb out whatever tool you're trying to invoke like why not just use this tool and please make up all the right parameters to use it properly and then use the other tool. So that's what we allow as well. And then lastly just in this insiders actually we're shipping userdefined tool sets and that's more of a reusable concept once you get into the mode like these are all the tools I need for a front-end testing flow then you just put those into a tool set and said use my front-end testing flow. So that's coming as well. So these are all user controls but actually that spec has dynamic discovery built in and that means on the fly a server can say but actually that spec hack are going to give you these other tools. And on the right you see github mudmcp it's on github you can check it out and this starts with a chat mode that I created that puts the agent into a game master prompt and it has the modcp installed. So now with the mode active, I can go into the agent, switch to mud and play the game. And what dynamic tool discovery does here, it actually makes it aware of which room I am in. So dungeon crawler, you walk from room to room, like you can go east and north, you can pick up stuff. And if there's a monster, I can battle the monster. But the tool for battling shouldn't be there when there's no monster. Eventually, I advance through the game and I finally find a goblin I can battle. aware of tools for battling shouldn't be there when there's no monster. Eventually, I advance through the game and I finally find a goblin I can battle and the battle tool appears. I can battle the goblin. So, imagine those those MCP do you want to work on? Those are coming up to give servers and clients a little bit more really tools and actions actually the add context return a giant file from your server but you want to return a reference to the file and that could be something the LM could follow up on or the user can actually act upon. Then the other use case is actually giving files to the user. So if you take a screenshot via playright, it want to expose it to both the LLM and the user and resources provide that semantic layer and you in what are the issues? Oh, I found new issue that's they want to understand the Python environment and maybe look at your settings of how you set it up so they can customize and that makes it more dynamic and stateful out of the box. The other one is like if you can look at actual the packages and your libraries installed, it's a great way to customize it to a React setup versus a swelt setup and really acknowledging what the user is looking at and not asking constantly like what framework are you working on. Like just you work in my folder, so just look at it. And lastly, I think the idea of like what what is that CI/CD pipeline? That's where MCP servers really shine to connect the end to end of a developer experience. And you can also read those out. Sampling. Who has heard about sampling? Is really excited about sampling. Okay. So you understand what I mean. So sampling. Sampling is one of the oddly named uh primitives as well. And if it had a better name, maybe more people would use it. Uh but it's actually now implement insiders and it's so much fun to use. So it allows the server to request LLM completions from the client. And what I'm showing here on the right is the permission dialogue that pops up to allow the server to access the LM. Right now it's wired up by default to GPD 4.1. There's more spec improvements to make it with structured formatting. There's some ideas out there. So there's a lot of things to make it better, but right now nobody has implemented it. So there wasn't really need to make it better, but implementation is now here. So please use sampling. And that's a nice progressive enhancement. Maybe by default you return the kitchen sink. And once you have sampling, you can do interesting things like summarizing resources in into more tangible things. You can format a website that you fetch into markdown for the LM or you can even think about agentic server tools that one run via the LM from the client. We look beyond the primitives. There's a few things that are also interesting. So far we have roots and tools and resources and prompts and that with dynamic discovery you can update them at any time. The client will send new roots as the VS code workspace changes. You can send new roots new servers, new tools and prompts from the server as you update and you change. So it's a really dynamic environment already. But there's more pain points to make these servers really powerful. One is the developer experience. Who's been struggling with working on MCP servers and debugging and logging and everything? Yeah. One is hands up. Yeah. Um, apparently it's really easy. So maybe it's not a problem. Okay. So we have it now dev mode in VS Code which is a little deaf toggle and you already see the console that always works for all MCPU servers. So once you hit a snack that just works and then now now it's in debugging mode. actually has the debugger attached. So once I run the prompt which is dynamically generate on the server, I can now hit the break point and step through it. And that's really hard usually because your server is not owned usually by any process that you run manually. It's owned by whatever client and host is running the MCP server. So because VS Code is both, it can just put it into debug mode and attach its debugger and that works for Python and Note right now out of the box. So super exciting and it's yeah it has changed how I work on MCPS definitely the latest spec uh was already called out. I just want to call it out again because it's so important that people stay on the tip of the spec on what's coming and understand what's in draft. Those things that are in draft only become stable because people provide feedback that it's useful and that it's working. And if they're in draft and nobody provides feedback, then they will still go into stable and they might need revisions like the offspec. So the updated Ospec on the right gives this enterprise grade authorization. There's a talk tomorrow about building protected MCP server that I can highly recommend from then who actually worked on the offspec. So if you want to talk to one of the people behind it and want to dive really deep into O you can do that. Then streamable HTTP has been working in VS Code since two versions as well, but then it's been really hard to test because there's no servers out there. So if you work on hosting, you're really excited about streamable HTTP. You should really get everybody that is hosting your MSP servers to to get onto it and not use SSE anymore. SSE is still possible to use with HTTP. So you get both benefits, but you're avoiding this really stateful churn on your servers. Last one already mentioned there's a community registry happening and that's think the other big pain point like if I build a server and nobody finds it or what is the discovery experience like how do I send people like do I send JSON blobs around for people to discover my server. There's a lot of community work around this to make this discovery easy. So it's a big shout out to everybody on the steering committee the community working groups and everybody involved here. Um, if you want to check it out, it's on model contracts protocolregistry on GitHub and it's all happening out in the open. And lastly, I'm really excited about elicitations. Um, that's actually coming in the next draft um, spec reference, spec draft release, whatever. And this is a way for tools to finally reach out back to the user when they need more information. Right now, tools are all controlled by the LM and you get all the information from them. But then when it actually needs more concrete specific input from the user then you you can throw them into another chat experience and ask for it. But why not just give them an input to provide it directly. So it's it's again more statefulness in the tools on top. So your help is needed. Um progressive enhancement in MCP is possible. I think we want to have more best practices out there maybe even in the references servers to show it off. But everything is now ready to be used. There's clients supporting the latest spec that you can run it in and test it in. Those clients are used by users. And as more users showcase how great these stateful servers can be and outline these best practices. This interoperability gap will close and clients will catch up. It's a very fastmoving system. People are complaining like, "Oh, you ship this two weeks after the other person." Um but it's all coming together and as as people use these and learn and bring feedback it becomes better. So make actionoriented context aware semanticaw aware servers using the full spec. And then lastly contribute to the ecosystem if you have the time read up on some of the open RFCs I shared like namespaces and search to kind of see what's coming. Make sure they get into the SDKs you're using by following the issues and just share back on your experience. I think a lot of people mis misunderstand how much influence they have on clients and SDKs and everything by filing issues by providing feedback. I'm helping to triage a lot of the MCP issues coming into VS Code. We read all of them. We learn from them and really that drives our road map and that happens probably with every other uh clip team out there. So really make your voice heard of like you everybody should support sampling. So so there's a transformative potential in MCP that we all can unlock with the spec that is already there. So the ecosystem catches up to the spec. So with that let's go. Um and feel free to hit us up on the Microsoft booth. There's two VS Code people there Tyler and Rob. You can also talk to or talk to me or talk to your friendly MCP steering committee members. Thank [Applause] you. Thank you, Harold, for giving us a deep dive into the into the lesserknown parts of MCP. Uh, I think we're now all fans of uh prompts, sampling, and all the other features. So, we've been hyping up about MCP uh for the last couple of talks. Uh, but now I think we want to take a a turn, right? We want to tell you a little bit about some of the difficult parts of using MCP. Perhaps even uh a bit of a rant about MCPS. Uh I guess from the audience like what's the biggest pain point you had when trying to use MCPS or build MCPS? Anybody wants to shout out an answer? Client support. Okay. Okay. Right. Dynamic client discovery. Okay. anymore security. Okay. Well, David from Sentry is going to uh dive deeper into this topic, give us uh give us some insights about what he has learned um solving some of these pain points. So, join us in welcoming David. Thank you. I assume everybody can hear me. Cool. All right. I see some slides. Um welcome everybody. It was a little bit last minute, so bear with me. If you don't know me, uh, I started Century a long time ago. David Kramer, I'm sort of an engineer, sort of an executive, sort of a founder. Uh, I would like to think I have rational opinions. So, that's mostly what this is. Um, I don't think you're going to learn anything here. Maybe you will. I don't know. I personally think this is not that complicated. It's just big scary words. So, if you do, great. If you don't, maybe you walk away, you're like, "Yeah, I thought that's what it was. We've done it." Mostly, uh, I was asked a couple days ago while I snuck my way into this conference if I could fill a slot. And so filling the slot was like, "Oh, come give some hot takes, maybe spice it up a little bit." So that's what we're going to do. It's not going to be too much of a rant. If you know me, I like to rant, but you know, we'll dial back for this one a little bit. So, what is an MCP? Uh, you know, I I got to say this is like one of the wildest phenomenons. I It's like the new crypto wave or something. Everybody's like, "Yeah, MCP. We don't know what it is, but we're here for it." And you find a lot of these sort of like opinions around, you know, how it should be, how it shouldn't be. And you know what I often find is people who have these opinions have not built anything or at least not built the thing they're talking about. I built Century's MCP server mostly as a fun project. Um, so take this for what it is. Um, it's also Century's MCP server. These are biased opinions towards what Sentry is. If you're not familiar with Sentry, you probably should be, but we do application monitoring. We do a bunch of stuff. Um, if you have bugs on the internet, they probably go to us. Uh, and so it's in context of a B2B, a SAS business. A lot of you probably work at enterprise companies, if you will. So, think about it that way. But the way we think about MCP is it is a pluggable architecture for agents. Full stop. That's it. It's pretty simple to reason about. And again, all of this is contextualized in an enterprise cloud service kind of way. There's a lot of other variations of how you might adapt MCP. There's tool chains that make sense locally. We're talking about we run cloud services. That's most of the industry. We're B2B. We're enterprise. I think a lot of this actually still applies. Um, but take that with a grain of salt. So, how we think about MCP with Sentry particularly because this again relevant here. we fix bugs. There are things like cursor where you also fix bugs. What if we could all fix bugs together? And so everything's contextualized in that. And I think there's this whole thing of like how do we be relevant? That's like the the name of the game for every single company in the world right now. Um it's like, oh, how how do we become an AI company? We too are now an AI company. Um but Sentry has a lot of bugs. I fix those in my editor. Wouldn't it be cool if the bugs could be inside my editor sometimes? That's a great example of where maybe an MCP is useful, but at the very least we're going to pretend it's useful. So, that's the context here. Um, but it all comes back to like probably the reason everybody's here is like, how do I become relevant? I've got an AI mandate. I've got infinite money to spend all of a sudden for some reason that didn't exist yesterday. How do we get involved? Okay, so everybody probably same stage. I know how this works. So, all right. Um, we built this a few months ago. We are not first to market with an MCP. And the reason why is because the there's two interfaces for MCPs. I'm going to focus on a remote interface, but there's also the standard IO. You probably learned about that or know something about that. I don't think standard IO is super useful for businesses like ours. I'll talk about that, but but sort of the analogy of why MCP is useful. And this is VS Code Insiders, which you just uh heard from Harold, but like they do a pretty good job. They're the only ones with OAS support that's like useful today. Cursor promised me end of week. I don't know. hold them to that. Um, but it works pretty well. You plug in Century's MCP, you can look up data from Sentry and a bunch of curated workflows. You can maybe fix some bugs, maybe easier than it was before, or at least more fun than it was before. Um, and for the sake of this, I I needed a screen grab. So, last night I'm like, literally last night, I'm working on these slides and I go into VS Code. I'm like, I'm just going to plug it in. I don't have time to fuss around if the thing's going to break. And so, I use VS Code and I'm like, okay, I'll just do a thing where it's like, fix all my bugs for me. And then immediately it does like 20 API queries this century. Probably cost me like five bucks to run this thing. Um but it did start fixing some bugs. Uh I don't know if the fixes were good, mind you. They're they're probably garbage. Um but it it does the thing, right? It's like it brought context into the editor, which is what we want. And that context was provided by somebody else. Century in this case. So that is like one of the interesting things. It's one of the interesting things we think about and why MCP is like like valuable to sort of a traditional I don't know we're kind of an enterprise company but like like every company in the world and that's part of why we're all hopping on it. It's pretty accessible and and that's what I'm going to talk about. It is actually super accessible. So this is you me and this is why I have opinions about it now. It's like oh it's just an API plugs in. We've got an API. We've got some OOTH going on. You know, we had our own ooth provider. You know, a lot of you might use something like a work OS or, I don't know, pick one of these authentication services that just gives you it out of the box. If you have that, you're pretty much ready to go, which is pretty cool. It's actually like a pretty low boilerplate uh implementation. Um, but then you quickly learn that it's actually not that easy. And so, first you kind of go into this OOTH like dance and you're like, "Oh, okay. Like, yeah, we're going to do this, but it needs OOTH 2.1." And nobody in the world supports this thing. Like, it's like I don't know how old it is, but I had never heard of it before MCP. Um, and so there's a little bit of complexity there, but you're like, "Okay, it's almost there. It's OOTH. We've got that. We can plug it into our API." You kind of get it working. In our case, we use Cloudflare Shim, which basically lets us proxy our OOTH 2 API on top of Cloudflare workers, which has a 2.1 client registration thing. I don't know if anybody's talked about that. TLDDR, it's complicated. Um, but it's not that complicated. This was built in a couple days, mind you, and I'm also an executive at the company, so it's like, yeah, if I can do it, everybody can do it. Um, but you go through the oath flow and then you're like, "Cool, but the the robots don't know actually how to reason about giant JSON payloads that were not built for them." And this is actually where I think a lot of people break down. There was like a big conversation. This is sort of one of my first opinions, if you will, what I might call sense, is that MCP is not a thing that just sits on top of open API. Like you cannot just be like, I got an API. I'm going to expose all those endpoints as tools. You're going to get the worst results you can possibly imagine. you're going to be like, "Oh, this doesn't make any sense." You have to massage everything. You have to design around the system. But like generally speaking, and I'll talk a little bit about this, like you need to really think about how would you use an agent today? How do the models react to what you do when you provide them context, which is what this really is for, and design a system around that. So might leverage your API, it is not your API. And then you get past that and you wire it up to things like cursor and VS code, and you're like, why is this breaking all the time? You can't you can't uh solve for that one. just you got to wait for everybody to catch up. Um they're almost there. Uh you know, handful of clients support native authentication now. They're kind of stable. Um to code's credit, it hasn't broken much recently. Cursor's broken quite a lot on me, but they're both great. Don't get me wrong. Cloud has support. Cloud Code has sort of support, but not really. Um so I guess it might work, it might not. I think particularly in the developer ecosystem, we're much more ahead of the curve. And so if you're trying to adapt your services to third party agents that are in our ecosystem like these editors, you've probably got a good shot of it working tomorrow. If I don't know, it's Salesforce or something. I have no idea. So, so you're you're kind of beholden to like the clients and the implementation because again it's a plug-in architecture for agents. Um there's a lot of other use cases that are not just third parties, but that's kind of the focus. And so I'm going to try to be constructive from here. Let's see. We got nine minutes. Um, just a few learnings and I'm happy to talk more about this later. I'll be around. Um, you'll probably, somebody in this room is going to disagree with this, but you should only care about OATH if you're a B2B SAS company like me. Um, and particularly you care about OOTH with remote environments for the most part. If you're like, how do I integrate my services into various agents? I want bugs to exist in cursor. I want to run a cloud service. And I want to run a cloud service for the exact same reason I've always wanted to run a cloud service because I can iterate on it. I can ship fast. I can dial in security. All the advantages it turns out are exactly the same because technology has not changed. And so if I were you and you're not building something hyper specific that is like a local devicecentric thing, just focus on the remote MCP server, focus on the O specification and just like don't worry about it. The problems will solve themselves. security will solve itself because there's a whole world of security problems and the standard IO interface is filled with most of them. I'm not going to talk about that. I'm sure there's some other talks here about prompt injection, but it is like very very very scary. Do not allow random MCP tools in your organization. Um, trust people that have earned trust. Don't download random packages off the internet. Uh, it will be a very bad time for your organization. Um, I did mention this cloud desktop has I think full OS support right now in production in G. VS Code Insiders has it. Um, these are great because you just drop in the MCP URL and it handles everything from there. Cursor, like I said, I think this week, um, I don't know about anybody else. I don't pay attention much beyond anybody else and I think Cloud Code has not, at least I've not seen anything. Um, and then there's a bunch like a long tale, right? So, works pretty well. There is this MCP remote package, which is how we shipped all this stuff. It works okay. I applaud early adopters for getting this out. It's not a great experience. And you'll find a lot of this is not a great user experience. It's rough. It's beta that's fine. Um, this is the biggest thing going back to open API thing. You actually have to spend the calories. You can't just be like, "Haha, we proxied open API and expose it as tools. It's going to do nothing." And so what the right answer here is, who knows? Um, our version of this, and I'll talk a little bit about why is like we return markdown. We've we've taken some API endpoints and we've directly translated some of the response to Markdown, but it's intentional. It's like I want to get a bug out of Sentry. I'm just going to give you the bare essentials. I'm going to give it in a structured way that a human can reason about because generally speaking, if a human can reason about it, the language model can reason about it because it's effectively pattern matching on language. Um, it can kind of figure out JSON here and there, but if if you actually push it, you're going to find it breaks all the time. So, just use something like Markdown. Um, it's not scientific. I think there's a lack of science in a lot of this. It's hard. Just go with whatever works, but you you have to really think about you don't control the the consumer. You don't control the model. And so you're kind of like this least common denominator thing. And so think about that. But you need to design the system and you need to treat it as like you are providing context to an agent that you don't know what the agent is doing. Right? And so that's the name of the game is context. That same thing uh sorry here's an example of that. I forgot what my slides were. Uh we just like give kind of a reasonable description of tools as the first version of context. Um which sometimes you hit token limits with all this. So there's some other challenges. We give a reasonable description of a tool with the hopes that clients figure out how to make use of this context. So it can call the right tool. It can call it when it needs to. It can choose one tool over the other tool, which is a really unfortunately hard problem for it to figure out. Um, mostly straightforward. Errors, same thing. You got to design the errors. They are still context because just like a human can't figure out how to call your API, the machine also can't figure out how to call your API. In my example, I'm like, fix all my bugs for me. And it it queries like every organization in century that I have access to. It queries all it's like like 20 API calls when it should have been one even with all this context. So we are a long ways from this being great. But it's like a glimmer, right? So you know in this case it's like oh you didn't pass the the thing or rather you pass an invalid value for the thing. Give it a real human response. This is now more important than ever because again it's not just a sort of machine reasoning about it where you can hardcode all this stuff. It's abstract. You don't know who's reasoning about it. The biggest thing and this is sort of leading to like the my overarching view of the world is like you don't you have no control which already is a problem. You are also passing the cost on in a lot of these cases. So you actually kind of need to be mindful. So, another reason to not just be like, "Here's my API. I'm just going to return everything to you because all of a sudden, you know, that that call, if you will, that tool call that could have been a dollar might be $10 now because of the amount of tokens you needed. And more importantly, it might just not work like early on." And and I don't know if VS Code and or OpenAI, I don't know who's to blame, fix this, but like there was and may still be a limit to the amount of tokens or description lengths of tools. Makes sense, right? You want to constrain the cost of every API call, but all of a sudden now you have problems again. So you got to be really thoughtful about this. This is going to be evolved. And I think the big thing is like if you build one of these, it's not set and forget. Like I we're still updating this thing every week, tweaking it here and there, trying to look at like what's happening and evolving it, right? But the biggest thing, and this is sort of my my takeaway, my my very very strong belief is like you just need to really focus on building agents. MCP is a plug-in architecture. There's a lot of value behind it, but the like the inherent value of a lot of what LLM are bringing is this sort of agent architecture, which by the way is just a service architecture with a fancy new word on it. Common sense kind of stuff, right? Um, and so we've done this in Sentry. It does not work well with MCP yet for for what it's worth. There is no streaming responses for tools yet. And that's a big problem when you think about sort of this agent to agent. And I don't mean this in like the Google way. I mean in like the generalized point of view of agent to agent. Um, but it gives you control and it's it's the same as all software. If you have control, you can be responsible for the success, for the failure. I can be responsible for the prompt that dictates how the tool is called. I can be responsible for the result from the tool. I can make many calls behind the scenes and wrap those up. So I I just get a lot more control if I pick up the cost of that agent. I control the model even, right? And so I think this is this is my big bet and I think this is where B2B is going to shine is when we start exposing agents through the MCP architecture. Again, treating MCP as a plug-in architecture. We've done that with one of ours which is this thing. We keep renaming it so bear with me. It's like called Seir now, but it's just like Century's got a lot of data on what's broken in your application. We do this thing where we do this really high quality root cause analysis that's done via an agent. Um we expose that root cause analysis mostly to our UI to be fair. We also expose it to the MCP but because it doesn't do streaming we have to do like some polling check where it's like okay start the job and then let's check in on it a few times but then there because the way agents work it just gives up at some point. So it's a little complicated but again beta testing the promise is there. Um but when this works I I really think this is going to be the value unlock for a lot of us. Again, MCP does a lot of things. It's an abstract protocol. Um, but the agent analogy is really good. Um, aside, all this is open source. You can find Century's MCP somewhere on the internet. You'll find it on GitHub. I should say fair source. There's some complexity there. This is what the agent looks like in the UI. Check it out if you haven't. We'll be around. Give me feedback. Um, I think the last thing I want to sort of part with is just like this stuff is not that hard. Um, it's quite broken all the time, but it's not that hard. I I again, I built it in two days. I got a lot of jobs to do with the company. you can just go build it and try it out and learn and like all this stuff is pretty obvious. I think the lesson we've learned at Sentry uh or still are learning I should say. Uh everybody is scared of all this stuff because there's fancy new words for everything. But the fancy new words are just new words for the same thing. It's just a new like code of paint, right? You know MCP is just a plug-in architecture. Agents are just services like the the LLM calls or MCP calls. Actually, half of them tools are just API calls with a new response format, right? So it's pretty accessible to do all this. There's a lot of great like technology that's been going on in here. Um, like I said, we used a lot of Cloudflare tech. We did not use Cloudflare at all before this. And then in a couple days, we're like, cool, we can shim up a thing on on workers. They've got an OOTH proxy for us. Problem solved. And this is important because we don't run websocket infrastructure essentially. It's just not a thing we had, right? And unfortunately, the protocol requires something like that, which makes it a little bit annoying to adopt, but but again, it's not that hard. It's pretty easy to adopt. Uh, try it out. You'll probably hit a lot of bugs, but just stick with it. I I think this one will stick around. Um, but I would really dial in the thinking around agents and how you're optimizing for context in the workflows you understand for your data. Uh, with that said, I will be around the rest of the afternoon, probably at our our booth in the expo hall if you want to come chat. Uh, come say hi. I'm always happy to like rant about other things or give you my semi-informed opinions. Um, I'm not an AI guy to be clear, but um, cool. With that, you know, thanks everybody for for showing up to this talk in this wild conference, which is interesting. I'll call it there. All right. Thank you, David. If you have any questions, make sure to catch the speakers. Uh talk to me if you're interested uh in deploying or using MCPS. And we'll catch you at 2 p.m. [Music] bas. [Music] [Music] girl. Hey [Music] [Music] [Music] [Music] I love I know. We're Heat. Heat. [Music] N hey. Hey. Hey. down. Happy dance. Do you do you Yeah. Hey. Hey. Happy birthday. I I'll be I'll be I'll Heat. Hey, Heat. I feel I [Music] Hey, hey, hey. All right, I hope everybody had a great lunch uh because we're going to go right back into MCPS very soon. So, next up we have Samuel from Pyantic who will be telling us um MC uh who who will be talking about how MCP is all we need. Join us in welcoming Samuel. Thank you so much. So yeah, I'm talking about uh MCP is all you need. A bit about who I am before we get started. I'm best known as the creator of Pyantic uh data validation library for Python that is uh fairly ubiquitous downloaded about 360 million times a month. So someone pointed out to me that's like 140 times a second. uh pyantic is used in general python development everywhere but also in genai. So it's used in all of the SDKs and agent frameworks in Python basically. Uh Pantic became a company uh uh beginning of 23 and we have uh built two things beyond Pyantic since then. Pantic AI uh an agent framework for Python built on the same principles as Pantic um and Pantic Logfire observability platform um which is our which is the commercial part of what we do. Um I'm also a somewhat inactive co-maintainer of the MCP Python SDK. Um so MCP is all you need is obviously a a play on Jason Lou's talks pideantic is all you need that he gave at AI engineer I think first of all nearly two years ago and then the second one pantic is still all you need maybe this time last year. Um and it has the same basic idea that people are over complicating something that we can use a single tool for. And I guess also similarly the title is completely unrealistic. Of course, padantic is not all you need. Uh and neither is MCP for everything. But it has the we have the I think where where we agree is that there are an awful lot of things that MCP can do and that people are over complicating the situation sometimes trying to come up with new ways of doing agentto agent communication. Um, I'm talking here specifically about autonomous agents or code that you're writing. I'm not talking about the um, uh, claw desktop or cursor uh, Z wind surf, etc. use case of coding agents. Those were what MCP was originally primarily designed for. Um, I don't know whether or not David Pereira would say that that what we're doing using MCP from Python is a he definitely wouldn't say it's a misuse, but it I don't think it it was the primary uh design use case for um for MCP. So, two of the of the primitives of MCP prompts and resources probably don't come into this use case that much. They're very useful or or should be very useful in the kind of cursor type use case. They don't really apply in what we're talking about here. Um but tool calling, the third primitive is extremely useful for what we're trying to do here. Um tool calling is a lot more complicated than you might at first think. A lot of people say to me about MCPR, but couldn't it just be uh open API? Why do we need this uh custom protocol for doing it? Um, and there's a number of reasons. The idea of dynamic tools, the tools that come and go during an agent execution depending on the state of the server. Logging, so being able to return data to the user while the tool is still executing. Sampling, which I'm going to talk about a lot today, perhaps the most confusingly named part of MCP, if not tech in general right now. Uh, and stuff like tracing, observability. Um, and I would also add to that actually the uh MCP's way of being allowed to operate as effectively a subprocess over standard in and standard out is extremely useful for lots of use cases and open API wouldn't wouldn't solve those problems. This is the kind of prototypical image that you will see from lots of people of what uh MCP is all about. The idea is we have some agent, we have any number of different tools that we can connect to that agent. And the point is that like the agent doesn't need to be designed with those particular tools in mind and those tools can be designed without knowing anything about the agent. And we can just compose the two together in the same way that uh I can go and use a browser and the web application the website I'm going to doesn't need to know anything about the browser. I mean I know we live in a kind of monoculture of browsers now, but like at least the ideal originally was we could have many different browsers all connecting over the same protocol. MCP is following the same idea. But it can get more complicated than this. We can have situations like this where uh we have tools within our system which are themselves agents and are doing agantic things need access to an LLM. They of course can then in turn connect to other tools over MCP or or directly connecting to tools. This this works nicely. This is elegant. But there's a problem. every single agent in our system needs access to an LLM. And so we need to go and configure that. We need to work out resources for that. And if we are um using remote MCP servers, if that remote MCP server needs to um use an LLM, well, now it's worried about what the cost is going to be of doing that. What what if the uh remote agent that's operating as a tool could effectively piggyback off the uh the model that the original agent has access to. That's what sampling gives us. So as I say, I think sampling is a somewhat uh that's not making that any bigger unfortunately. Um is that clear on screen? I may maybe I'll make it bigger like that. Um sampling is this idea of a of a way where within MCP the protocol the um server can effectively make a request back through the client to the LLM. So in this case client makes a request starts some sort of aantic query makes a call to the LLM LLM comes back and says I want to call that particular tool which is an MCP server. Uh client takes care of making that call to the MCP server. The MCP server now says, "Hey, I actually need to be able to use an LLM to answer whatever this question is." So, that then gets sent back to the client. The client proxies that request to the LLM, receives the response from the LLM, sends that uh onto the MCP server, and the MCP server then returns and we can continue on our way. Um, sampling is very powerful, not that widely supported at the moment. Um, I'm going to demo it today with Pantic AI where we have support for sampling. Well, I'll be honest, it's a PR right now, but it will be soon it will be merged. Um, we have support for sampling both as a uh as the client. So, knowing how to proxy the those LLM calls and as a server basically being able to register use the MCP client as as the LLM. So this example is obviously like all examples trivialized or simplified to be to fit on screen. The idea is that we we're building a like research agent which is going to go and research open source uh packages or libraries for us. And we have implemented one of the many tools that you would in fact need for this. And that tool is um making uh I will switch now to code and show you uh the one tool that we have. Uh I'm in completely the wrong file. Here we are. Um so this tool is querying BigQuery BigQuery public data set for uh Pippi to get uh numbers about the number of downloads of a particular package. So this is this is pretty standard padantic AI uh padantic AI code. We've configured log file which I'll show you in a moment. We have the dependencies that the uh that the agent has access to while it's running. We said we can do some retries. So if the agent returns if the LLM returns the wrong data, we can send a retry big system prompt where we give it basically the schema of the table. Uh tell it what to do, give it a few examples, yada yada. But then we get to this is probably the powerful bit. So as an output validator we are going to go and first of all we're going to strip out uh markdown block quotes from the SQL um if they're there then we will uh check that the table name is right that it's querying against and tell it that it shouldn't if it it shouldn't and then we're going to go and run the query and critically if the query fails we're going to uh raise model retry with impantic to go and retry uh making the um uh making the request to the um LLM again saying asking the LLM to to uh attempt to to retry this. And what we're the other thing we're doing throughout this you'll see here is we have this context. MCP context.log. So you'll see here when we defined depths type we said that that was going to be an instance of this MCP uh context which is what we get when you call the MCP server. So what we're doing here is we're having a we're providing a type- safe way within in this case um the agent validator but it could be in a tool call if you wanted it to be to access that context. So we can see here that we know at um in the type int uh uh that the the type is uh MCP context. So we have this log function and we know it's signature and we can go and make this log call. The point is this is going to return to the client and ultimately to the user watching before the the thing has completed. So you can get kind of progress updates as we go. MCP also has a context concept of progress which I'm not using here but you can imagine that also being valuable if you knew how far through the query you were. You could show an update in progress. So the idea I think the original principle of uh logging like this is that you have the the cursor style agent running and we want to be able to give updates to the user. Don't worry I'm still going before it's finished and exactly what's happening. But you could also imagine this being useful if you were using MCP. If this was research agent was uh running as a web application you wanted to show the user what was going on. This deep research might take you know minutes to run. We can give these logs while the tool call is still executing. And then we're just going to take the the output turn it into a list of dict and then format it as XML. So you get a nice uh models are very good at basically reviewing XML data. So we basically return whatever the query results are as that kind of XMLish data which the LLM will then be good at uh interpreting. Now we get to the MCP bit. So in this code we are setting up an MCP server using fast MCP. There are two versions of first MCP right now. Confusingly, this is the one from inside the MCP SDK. Um, we the dock string for our function. So, we're registering one tool here, Pippi downloads, and our dock string from that function will end up becoming the description on the tool that is ultimately fed to the LLM that chooses to go and call it. Um, and we're going to pass in the user's question. And I think one of the one of the important things to say here is of course you could set this up to generate the SQL within your uh central agent. You could include all of the um uh description of the SQL the instructions within your within the the description of the tool. Uh models don't seem to like that much data inside a tool description. But more to the point, we're just going to blow up the context window of our main agent if we're going to ship all of this context on how to make these queries into our main agent. That's just all overhead in all of our calls to that agent regardless of whether we're going to call this particular tool. So doing this kind of thing where we're doing the inference inside a tool is a powerful way of effectively limiting uh the context window of the of the main running agent. And then we're just going to return this output which will be a string the value returned from from here. and we'll just run the run the MCP server and by default the MCP server will run over standard IO. Um, and then we come to our our main application. So here we have a definition of our agent. And you see we've defined one MCP server that's just going to run the the script I just showed you, the Pippi MCP server. Um, and so then this agent will act as the client and has that register as a tool to be able to call. I'm also going to set the give it the current date uh so it doesn't uh assume it's 20 2023 as they often do. Um and now we can go and ultimately run our main agent ask it for example how many downloads Pantic has had this year and I'm going to be brave and run it and see what happens. Given the internet I have medium hope but we'll see what happens. Um, so you'll see I haven't talked that much about the observability from logfire, but I you'll see it in a moment. And we've immediately got a timeout. This is great. Uh, I might have to do some uh a lot of ad liibbing if this is what how it's going to perform in general. I will try again and see if we And it's timed out straight away again. Am I on the wrong network? I'm on the speaker network. So, I do not know why we're getting an immediate timeout. I will try a couple more times. See if I can run it run it from here and see whether or not we're going to get a bit luckier. And we're getting an immediate timeout from the model. I think dogf fire as well. Everything is failing. I'll try switching network. Give me one minute. I've got four minutes. We will see how we get on. Try the hard one. Is that this one? Working so well just before we started. And now have wired on there. Try running this one. Oh, we're having luck. Um, so don't clap too early. It might still fail. But, um, and it has succeeded and it has uh gone and told us uh that we had whatever 1.6 billion downloads this year. But probably more interesting is to come and look at what that looks like in Logfire. So, if you look at is it going to come through to logfire or we having a failure here as well. This I will admit this is the run from just before uh I came on stage but it it would look exactly the same. So I'm not going to talk too much about observability and how we do uh how MCP observability or tracing works within MCP because I know there's a talk coming up directly after me talking about that. So think of this as a kind of uh spoiler for what's going to come up. But you can see we we run our outer agent. it decides to it calls uh uh40 uh which decides sure enough I'm going to go and call this tool. Uh it doesn't need to think about generating the SQL. It can just have a natural language description of the query that we're trying to make. We then um this is the MCP client as you can see here. MCP client then calls into the MCP server um makes the which then again runs a different uh podantic AI uh agent which then makes a call to an LLM which happens through proxing it through the client. So that's where you can see the service going client server uh client server ultimately if you look at the top level uh exchange with the model you'll see here yeah the the the out ultimate output was it the the return response from running the query was was this kind of XMLish data and then the LLM was able to turn that into a human description of what was going on. I think the other interesting thing probably is we can go and look in we should be able to see the actual SQL that was called. So this is the agent call inside uh MCP server and you can see here the SQL it wrote and you can confirm that it indeed looks correct. Um I am going to uh go on from there and say um thank you very much. Um we are at the booth the the Pantic booth. So if anyone has any questions on this, wants to see this fail in numerous other exciting ways, very happy to to talk to you. Yeah, come and say hi. All right, thank you Samuel for the presentation. It's always impressive when a live demo works on stage. So how many of you have run into issues when you're using MCP but couldn't figure out what happened? Any raise of hands? Okay, a few people. So, our next speakers will be talking a little bit about observability in MCPS. Hopefully, with the right observability logging, we can all figure out what's happening under hood with MCPS and improve it. So, join me in welcoming Alex from Weights and Biases and Ben from Dipso to talk about observability. Thanks, Henry. [Applause] Hey folks, um my name is Alex Vulov. I'm an AI evangelist with Weights and Biases. I'm Benjamin Ekl. I am co-founder CTO of DIPso. We're creators of MCP.Run. All right. And we're here to talk to you about MCP observability. Hey Ben, I wanted to ask you a question. Somebody who worked at Data Dog before and somebody who runs multiple MCP servers and uh clients on production. Uh something that happened advice that happened something in my agent uh in production the other day. Okay. Uh yeah, I mean we've been running MCP clients and servers in production since the beginning. Uh yeah, but wait, aren't you like working at an observability company? Weights and biases and don't you work on like what's it called? Weave. Yep, that's true. I I work about weave and but since I started adding some powers to my agent via MCP, all that observability that I'm used to from just having my own code run end to end has gone a little bit dark. Gotcha. So this is what we're here to talk to you guys about. Um the rise of MCP is creating an observability blind spot. As AI agents become more uh prevalent, the problem can compound with more and more tools via MCPs, the less they the developers can know about the end to end happenings within their agent. Yeah. Um yeah. So on MCP run, we're running both clients and servers and because it's a new ecosystem, we've had to like cobble together a lot of our own ways to do observability. And I've been looking around. It seems like everyone is sort of doing this in isolation. they're sort of solving the same problems. Um, so you know, we wanted to bring the community together on this issue and so today we're going to talk about the state of observability in the MCP ecosystem. Yep. So why do we care about this and why do we think that you guys should care about this? So if you don't have the ability to quickly understand why things went wrong on production, where they went wrong and how, your ability to quickly respond is greatly diminished. And we care deeply about we both build tools that need MCP observability and we support MCP and we both care deeply about developer experience as well. Yeah, it's it's really important to me because enterprise engineering teams don't ship something to production unless they know for sure that they're going to be able to identify security and reliability problems before their customers do. Um, and that's why they invest a ton of money in observability platforms. And uh so if you're going to ship MCP to these production environments, you must seamlessly integrate with these observability platforms. Yep. So because we care deeply about uh developer experience at W&B weave, uh I'm happy to announce here on stage that we've supports MCP. Yay. As long as you're a developer of both the client and the server, all you need to do is set this MCP trace list operation environment variable on your client and server. And uh we'll show you the the list tool calls and we'll show you the the duration of your MCP calls. This works currently with our Python based clients. And this is how it looks super quick. With the red arrows, you can see the client traces, for example. And with the blue arrows, you can see we're pointing to the calculate BMI tool and and the other tool. And that's it. Observability solved, right? Let's get off the stage. We're done. Wait a second. So, uh what about this like calculate BMI tool? This uh MCP server. Why can't I see into that? Um you Yeah, we're working on this. Uh yeah, also this seems like this is specific to Weave, right? Um is there not like a vendor neutral way to do this and standardize? Yeah, that's right. Uh this is a bespoke integration that we built into Weave into our SDKs in Python. And while working on this, while our developers have been building this like u integration within RMCP tooling, I was advocating internally and externally that we should align with the open nature of MCP as a concept and created observable. Maybe some of you have seen this. This is a manifesto to drive a conversation that this is a problem that needs solving and uh between observability providers uh such as us and other folks that's been on stage before and going to be on the EVOS track tomorrow uh to do observability in a vendor neutral and standardized way. And so while working on observable tools I realized I I did some search realized that a vendor neutral scalable way to add observability exists uh and there could be a great way to marry the two open protocols to work together. Yeah, exactly. Uh fortunately MCP powered agents are really just another distributed system and we've been doing that for decades. So open telemetry is just the way that's that we've like settled on doing that. Um we're going to talk about OT a little bit. If you're not uh familiar with it, we need to learn about a few primitives first. So the main primitive that we need to learn about is the trace. So a trace is kind of like an atomic operation in your system. It's made up of a treel like structure of steps that we call spans. And a span represents the duration and some arbitrary metadata for each step. And what this step is exactly is completely up to you to define. It can be as high level as like an HTTP request. It can be as low level as a tiny little function call. Um here's an example of like a checkout experience, an API for a checkout. The size and position of each of these spans correspond to how long it took and where it sits in the call graph respectively. And just from this data, you can tell a lot about a system and how to observe it. Um the other primitive you need to be aware of is syncs. So a sync is kind of like a centralized database where all your telemetry goes, but often they come in the form of this like whole platform with like a UI and dashboards and alerting and monitoring and all those things. So there's a lot of logos here, Ben. Uh basically a sync is an open standard way for folks like collectors to like receive those spans. As long as the developer instrumented their application code in a certain standard spec way, everybody can just receive those in in the same unified way. Right. Exactly. Yeah. It's if you squint, it's just kind of like a bunch of databases that all support the same schema and wired protocol. And you could switch them out and in fact they don't have to change much of their code or even change the code at all. It could be just config. Right. Right. Uh by the way tools like W&B weave and some friends Simon from Lockfire here before and some other friends all have switched to support hotel as well. Open telemetry is becoming like this global standard. Great. Uh yeah, another great thing about having a centralized sync uh is the last concept distributed tracing. So going back to our checkout endpoint, if the uh fraud service sends its span to the same sync, then we can stitch back the together the traces and show the whole context. So maybe you're kind of seeing where the MCP server stuff comes in here. Yeah. So, hey Ben, if it's possible via the integration to the open protocol, um what if I want to use MCP servers that other people host like GitHub, like Stripe, like other folks? Yeah, it's a good question. So, um with MCP enabled agents or really just any distributed system, there are kind of two scenarios. There's when the client and server are in different domains and then there's when they're in the same domain. And by domain here, I don't necessarily mean the literal definition. I mean like the administration administrative domain of control, right? like do like do you own this MCP server? Do you own this MCP client or is it a third party thing? So your GitHub stripe example is like a great example of like the different domain scenario. So um this is a trace of an agent that is executing the prompt read and summarize the top article on hacker news. So it's going to reach out to this like remote fetch server to read hacker news, but it appears to us in the trace as a single service span because it's it runs outside of our domain of control. So it appears still black box to us. Um but suppose we do own the server like maybe it's running in a different data center than the client. Um how do we get actually the whole context? Uh it's pretty simple. So with distributed tracing and context propagation. We can have the remote fetch server send its spans to the same sync as the client and the sync will just stitch together the missing uh parts of the trace back for us. So in this graphic you can see that we can now break into that fetch server and we can see what it's doing. It's making some HTTP request that's taking roughly 350 milliseconds and then it's doing a little uh crunching to to create some markdown. Okay, so that that is great in theory and we went through this. We could have a whole hour talking about hotel. Not that we got an hour. Uh but how do we can actually marry those two protocols together, right? Uh is there a standard way? Did the MCP spec folk deploy a way for us for observability? Um not quite. It was it was uh pretty tricky to get to get working. um it does work today but uh it required a little bit more work than it should have. So in order to do this we need to as I said propagate the trace context from the client to the server. So here's a TypeScript example and when we call a tool in the client um we're going to extract our current span and we're going to uh pass it along to the server. And we achieve this by basically just shuttling the data through the protocol's meta payload. And uh now that we're inside the server, this would be like in the fetch server, we can pull that trace context out, inherit it as our current span, and then when we send our spans off to the sync, uh it it's as if it came from that parent span, and they the sync can stitch it back together. Man, this is awesome. So you basically used an undocumented kind of property of the sending the payload together with the payload between clients and servers um to pass along the data that hotel needs to connect those things together, right? Yeah, sort of. I just kind of had to abuse the lower level interface reserved for the protocol, but a higher level way should be provided through tooling. And that's something we should talk about a little bit later in the talk. Yep. So, oh yeah. So, by the way, this is uh this is not just um a screenshot. This is a working demo. So, um it's a lot more code than what I showed in the slide. So, if you want to actually go see how this works and adapt this for your needs, uh go check out this GitHub link. And I think actually you did that to to get it to work with weave, right? Yeah. So now that we know how to pass context after you you you showed me the way, uh let's see how amazing this solution actually is in practice. While we've MCP, the thing I showed you guys before was a bespoke solution baked into our Python SDK for weave. The huge benefit of MCP generally not only observability related is that servers and clients don't have to run on the same environment or share the same code or be from the same programming language. So while we were working on the Python SDK, you built an agent in Typescript and so because Wave WB weave supports hotel open telemetry and it's an open protocol uh your TypeScript agent. It took me a few minutes to by without changing much code to just send those traces into weave from a TypeScript agent and not necessarily from a Python edge. So here uh here you could see in the green the the client traces are in the green and then the server traces actually show what happens within those calls uh on kind of the the server side as well. Yeah, it's really cool. So how do how did you actually get the traces into weave? So this is very very simple way simpler than before. Uh we just define W&B as the OTLP endpoint standard that you kind of like showed me around. Uh and then folks can send their traces into 1b.ai/otel I/OTE and all you need to do in addition to this is authorize. So add authorization headers and specify which project you want to go into. Cool. Yep. So while we talk to you about thoseability while I was working on this, I had a magic moment happening with MCP. I wanted to share this with everybody and I love you as well. MCP story. Yeah. So um I used Cloudopus 4 that just came out to weify your agent that you built and to add this uh MCB obserability and W&B weave is going to get a little meta. Stay with us. Uh also has an MCP server. Okay. What what does it do? So we have an MCP server that lets your agents or or chats etc talk to your traces and see the data and summarize the data for you. Okay. So we have this MCP. It's been configured in my windsurf uh and and CL code uh ous 4 uh was able to use this MCP server to kind of work through it. So here you see an example. Um the agent basically started working on your code and then decided okay I'm going to run the code and then said okay I'm going to go and actually see if the traces showed up at at WBe. Then it noticed that they showed up but they showed up incorrectly. So some input or output a specific parameter that it needed to do it didn't know how to do. it wasn't part of the documentation. And so, uh, the next moment just absolutely blew my mind. This Oppus 4 discovered that our MCP server exposes a support bot. So, essentially another agent, uh, decided to write a query for it, received the the right information after a while and acted upon this information, learned how to fix the thing that it needed to fix, fixed it, and then went back to notice whether or not the fix was correct. So my um my coding agent talked to another agent via support VMCV that it discovered on its own. I didn't even know that this ability exists to work on your coding agent in in things. Things got a little bit meta and my head was like absolutely I was sitting like this while all this happened. Didn't touch the keyboard once. That's awesome. Yeah, it's pretty meta. Uh yeah, before we go, I also wanted to have uh take a moment to have an announcement. So um MCP run will also be exporting telemetry to hotel compatible syncs. Um so as I mentioned before we run both servers and clients. Uh so for servers we have this concept called profiles and these allow you to like slice and dice multiple MCP servers into one single virtual server. And on on uh we also have the an MCP client called task. And this is like a single prompt agent that could be triggered via URL or a schedule. And it also just sort of marries with the idea of profiles. Um but yeah, soon you'll be able to get out of both of these and hopefully you know we'll uh connect up to weights and biases and have a little party. Yeah, you can send those to weave straight from mcp.run. Okay, so uh to recap, um observability is here at in MCP today, but it's not evenly distributed. uh hotel should get you most of the way there, but the community needs to come together uh create creating tooling and conventions to make it smoother. Um you shouldn't need to be an expert in observability to like get this stuff working. So how do you get involved? Well, AI engineers just start thinking about observability via MCP tooling and whether or not you're getting uh observability to the end to end of of your execution chain. Um for tool builders and uh platform providers we should join and work on higher level SDKs. So uh arises as open inference for example is a great start but all of us should help with instrumentation for clients who use bespoke SDKs to work on conventions also together. Ben can you explain semantic conventions super quick? Yeah sure. So as we learned earlier um spans they carry userdefined attributes right. So if they're user defined, how does the sync know that a span is actually say an HTTP request with a 200 status code or how does it know that it's an MCP tool call that has an error? Um, that's where semantic conventions come in. Um, and you can be a part of defining what the conventions are for agents that all observability platforms agree on. And if you're interested in this, I would suggest going to check out the uh Genai semantic conventions effort by the hotel team. And um yeah, lastly, for platform builders such as MCP Run, um you know, go add hotel support, help review RFC's, and finally, yeah, just come like talk to us about ideas because we're just everything's just kind of coming together. Everything's so new and fresh, and we don't really know exactly what to do. There's an additional track here at at uh AI engineer. this called the hallway track and I've learned more about the stuff that we were talking about uh out there by actually talking to people who implement this than I learned while preparing uh before the talk. It's quite incredible. So, um yeah, sure. Um yeah, so again, I'm Ben. Uh my call to action here just be go check out MCP Run. You can get a free account. Try it out. Uh yeah, that's it. And I'm Alex. Uh uh check out WBWeave MCP OP to learn how to trace MCP with hotel. Uh I'm also I did the observable tools initiative. I would love for you to check out the manifesto to see if this resonates with you to join forces to talk about observability and uh we yeah please visit us at the booth. We have some very interesting surprises for you. We have a robotic dog right here uh that's observable. I also run the Thursday I podcast. I want to send Swix a huge huge shout out for uh having uh giving me the support to show up here and give if you guys are interested in AI news, we're going to record an episode tomorrow. That's it. Thank you so much. Thank you, Alex and Ben. So, today we have covered a lot of ground. We talked about the origins of MCP, MCP spec details, uh observability. It seems like AI agents are going to be doing tasks for us uh all the time now autonomously. But one thing that's perhaps missing is the question of how would agents be paying on our behalfs and which agent we should be using. So our next speaker will be touching on this topic. We have Jan from Appify uh to talk about the agent economy. Take it away. So let me start with a question. How does intelligence emerges in biological systems, right? Well, it's through neurons, right? Well, when neurons are born, they are just like individual cells, but like over time, they grow their axons and dendrites and establish connections with other cells or other neurons and actually learn how to communicate in order to pursue their own interest basically like to get nutrients and so on. And over time they learn how to how to communicate with each other and with other cells to get nutrients and basically thrive right and this collective behavior if you like zoom out and look at like really large number of them uh is something we call intelligence right so it's like emerging behavior of smaller individual units that pursue their own interest. So how does intelligence emerge emerge in the markets right people always talk about markets like well market thinks that market uh reacted to this and so on and in some way uh markets are more intelligent uh than like individual like participants of the market right and it's there are mutual uh interaction of these individual members of the market basically who pursue their own interest and communicate and establish new interactions with others uh where some some sort of like collective intelligence which is like bigger than the sum of different parts emerges right and uh oh not sure what happened here uh sorry we skipped uh quite a few slides there all right so let's try again so how does intelligence emerges uh emerge in companies Well, this one is provocative through Slack, right? Where people interact and pursue their own interest in the company and over like altogether the company well sometimes becomes more intelligent than the individual employees of the company and uh so this leads to my final question. So how does or how will the general intelligence emerge in computing systems right and there is a lot of talk about AGI and like you know like ever larger models uh exhibiting like super intelligent behavior but in my opinion the like general intelligence will actually emerge through interaction of multiple entities can call them agents basically like multiple models uh pursuing their own goals interacting with each other and uh altogether exhibiting something which we can general intelligence and thanks to uh MCP we finally have this uh missing part that allows the the agents to communicate with each other and really like create a fabric or agentic mesh where they can talk together. So uh hello everyone. My name is uh Yan Churn. I'm the founder of Apifi and I'm going to talk about the race of the agentic economy on the shoulders of MCP. Basically economy where agents uh can you know find counterparts uh to interact with and purchase services from other from businesses or tools or other agents right so like B2A and B2B uh sorry and A2A. All right. So before I start um let me just introduce quickly API. Aify is a is a marketplace of 5,000 tools called actors. And uh historically we come from the web scraping industry, right? So most of these actors are data extraction tools that allow you you know to get data from social media from search engines uh data for AI for building rack pipelines you know uh data from web uh for lead generation and so on. But also there are other tools like data processing tools and so on. So altogether there's about 5,000 of them and some of them are built by API, some are built by our community of creators uh who actually make money on it. Right? So it's like a marketplace of software creators if you will, right? So actors are self-contained piece of pieces of software based on Docker with well definfined input and output, right? And basically they represent a new way how to ship software and publish it, you know, and uh and integrate to to you know other systems, right? So for example, Google map scraper it's a quite popular actor uh from our store uh it can extract data from Google maps right uh more data than than the Google places API provides right uh well there is like creator of the actor description you know different stat and so on something you would expect from normal marketplace and actually thanks to the way actors are built it's actually super easy to integrate actors from other systems right so for example we have SDKs for TypeScript for Python uh for open API for C common for CLI mean it's like we can call them from terminal and it's only because they are well defined units of software with input and output right uh also we have integrations with uh workflow automation tools like mag zapier you know clay and many others so to make it really easy to call actors from these systems right but obviously now uh we also have MCP integration which makes it possible to call actors from AI agents or you know AI workflows. And the way it works actually is uh the agent just needs a API key or you know out workflow on any an account on aifi and then through our MCP server basically it can interact or call any of those 5,000 actors on our marketplace right and actually this only became possible thanks to uh I would say the killer feature of uh MCP which is the tool discovery right actually um not many clients support this yet. Uh but uh just just today I saw that V VS code added support for it. Uh and actually just like two days ago code for desktop added support for tool discovery. And basically how it works is that um the client connects to the MCP server and dynamically discovers tools to use and to interact with based on the based on the the workflow, right? And let's say we have like 5,000 tools on our our store and there is simply no way we could publish all these tools through open API because you know the context would be just too large and like the more tools you have the you know riskier the result is right so we really want like provide the tools only like uh as needed and that is only possible through tool discovery which I think is really the main thing that will actually make MCP really uh the huge differentiator from from open API for example, right? So MCP actually quickly became a standard for agentic interaction. This is Google trends data showing that MCP is is basically dominating the space compared to open API or A2A from from Google, right? And actually I think MCP already became a standard for agenting interaction. And it became so popular that currently there are like you know many different like uh registries of MCP servers that even guys from master our friends created like registry of MCP server registries right just to make the sense of it right and obviously Antropic is also working on their own uh registry and um I think Google's A2A they have like a DNSbased protocol with like well-known agents JSON way to you know publish the the services on through DNS. So basically there is like you know so many different servers you can now use from the agents right so does it mean like so many tools now support MCP so does it mean like the agents can discover and access any of them on their own right well not really because to use those services your agents still need uh to have like API tokens to those services right so even let's say if you use zapir mmcp that provides access to like 5,000 apps they have in their marketplace. You still need to connect those individual apps to your services, right? You know, like GitHub or Slack or you know whatever. So, Zapier on the on its own is not able to provide access to the third party services. You still need to as a user to facilitate that. So that actually means that uh the agents are not able to like find counterparts uh or like uh other agents or other tools to interact with on their own. They are still depending on the human uh developer who actually build the system, right? Who kind of like give those those those agents access to different tools, right? And if those agents are, you know, to replace all the people and all the jobs, right? they need to be able to uh find services to interact with. They can't just like you know do do that like it's like a basic basic thing that like uh anyone of us can do right like to find service and purchase it right. So I argue that like unless the the the agents are able to do that uh we will not be able to reach you know some higher level of of intelligence of these agentic systems and behaviors basically uh if the agents cannot purchase services right so how can we solve this problem right so first like sort of like n approach would be let the agents subscribes themselves to the target services right so basically in a way like agents could have like email maybe a credit part they could like fill you know the subscription flow maybe solve the capture you know create an account and so on but you see it's it's not very practical right I mean it's you know well they might need to also have to phone number and so on and quite often the services actually need to have like real person behind the account right so basically this this wouldn't really work right uh so second solution uh is central identity and payments provider there are like couple of companies pursuing now that like there would be like a central authority where you can charge money and then the agents can use that you know to buy services and and provide them with their identity right for example vertifier coinbase is now pushing their X42 standard uh I think stripe is working on this and Mastercard and Visa 2 right so I think this this is going to happen eventually but running launching new payment system it's extremely complicated right because you're facing like this chicken and egg problem of marketplaces right I think PayPal had to uh pay like $100 million per month just to by the market uh and like launching credit cards in the in the 70s was like incredible challenge basically because nobody was accepting those cards so why would people use them and so on right so I think this will happen but it will be a long process basically to establish this right so let me offer the third approach and it's like through a centralized marketplace of MTP services like store basically where you just need one API token or one authentication one account to to get access to all the other services and basically it works the way that the developers who publish these tools, these actors actually they provide their credit card and their account to the third party service and basically publish it, add monetization to it. Like you know like how much does it cost to call the service and then they are basically the owner of the service and now they publish it on our marketplace and suddenly it becomes available to the whole ecosystem of tools and this way actually we can scale it rapidly and actually even without the target services knowing right. So basically this way the actor can run the code itself or wrap an external API or just publish an external MCP server because the MCP servers they can be actually nested. you can have like one parent server that provides actions or tools of the like nested FCP servers, right? So that's another like cool feature of FMCP. You can really build this sort of ecosystem, you know, if you can facilitate the payments and monetization, right? So actors charges the user and then its developer gets the money and pays for the external service and anyone can publish such an actor even without the target service knowing. Right? So time for demo. It's not live demo because the internet uh super flaky here. So what you can see here is cloud for desktop uh that has access to appy mcbp server uh there is like 18 tools available now and I'm asking like what is the venue of AI engineer world fair in San Francisco it possible use actors so it you can see it searches the actors uh for a tool that can answer this question it will find a tool or actor called rag web browser and so it's called it's a it's like a Google search with you know fetch data so basically It it asks the query like uh what is the venue and so on and then it parses the the resulting page. So we can see it found like uh SF Marriott Maris uh that seems all correct, right? So now let's use an actor for scraping Twitter. So uh this actor is not available in the context. So so the agent doesn't know how to use it. So it will it searches actors on our store and finds an actor that can scrape Twitter, right? So it it calls it calls add actor which is like a tool that adds new tool to the context. Uh actually cloud is very verbos describing a lot of things about it. And actually there is like small bug still in cloud desktop that you need to like disable and enable a tool so that the the tool list refreshes and then the tools become available. I'm sure it's going to be fixed in the next release. And now let's use that actor to uh get last tweet of AI engineer conference. All right. So it calls the actor on aifi. Uh it knows the the Twitter handle probably from from the from the website. And now you can see that uh it found the result and the last tweet from this morning was uh something about workshops. That seems about right. So now what? So we we have we have seen how we can use existing tools in in in our store. But like let's say uh uh one of our competitors uh company called browser base. Hey Paul if you're here. Uh they certainly haven't published you know uh an actor in our store but we did. So we created an account on browser base added our API token there and published like basically their MCP server on our store without actually them even knowing. And now anybody can actually use browserbased MCP through API's ecosystem, right? Even without them having to do anything or knowing about it, right? So now let's use browser base to fill in the email subscription form on the AI engineer website fill email uh yanappify.com and now let's see what happens right and actually we'll see that uh that the that the agent will actually call browserbased mcp through you know an actor publish published you know by us on our ony store and perform the actions on the web right and actually this way we can easily like uh bring a lot of lot existing MCP servers to our store and you know expand the ecosystem rapidly without you know having to ask for for you know cooperation of the third parties right so that's actually what we're doing now uh we want to scale this marketplace rapidly and now okay so now it's evaluating you know the screenshots looking for the field and so on you know and eventually it will manage to uh fill the form and and basically succeed in the task right I can skip this uh to save time. It takes some some some time to basically uh for the agent to to find the form and so on. But uh yeah, it succeeded. It completed the email sub subscription. And this way you basically see that uh you can plug our our ecosystem of actors into into uh any AI agents that actually support tool discovery. Right. All right. [Music] And so this means now anyone can publish tools or you know agents on aify store and monetize them and immediately get access you know to all the AI clients that already like integrated appy and all the ecosystem of tools right and actually people can make make money on it like just last month we paid more than quarter million dollars to our creators and actually the the this number is is growing rapidly. You know, overall the actors generate more than one half million dollars per month. Now, uh we have like about 1 million monthly visitors to the whole ecosystem. And now we're really in the process of like scaling this ecosystem. So, um if you're looking for ways to monetize your tools or agents, you know, just um talk to us and publish or publish your actor store and get access to this ecosystem of developers and this visibility. And there are some open questions obviously uh that remain. So will this autonomous to discovery provide real value? I mean like everybody who builds agentic systems knows that you know like making sure that the system works as expected is tricky right even if it's fixed. So if we add this like you know variables that like well uh if the agents can discover new tools uh will it actually work? Well, currently it might it might be a bit flaky, right? I think we're still still we we're still fairly early, but as the models get better, I think uh even with the discovery, suddenly the the the agents will be will be able to provide you know valuable and reliable result basically right so this remains to be seen but I'm optimistic that like as the LMS will get better, we'll actually get there that the two discovery will actually provide real value. Well, there's a big question of like how can agents trust tools or other tools? Oh, so sorry. Or each other, right? We know it like you only interact with people you trust. So, how can agents do that? We'll see. And can autonomous agent interaction enable AGI? Well, we'll see. Thank you very much for your attention. And uh feel free to try it. [Applause] mcp.fi.com. Thank you, Yan. And that about wraps up our MCP track for today. Uh thank you all for coming. Uh once again, my name is Henry. I'm the founder of Smidy. Happy to chat about MCP uh in the break. Uh and make sure you catch the speakers as well. Have a nice rest of your day at the AI engineer conference. See you. everything. Hey, hey, hey. [Music] [Music] I don't want to go. factory. Hey. Heat. Heat. [Music] Heat. Heat. [Music] Hey, hey, hey, hey. down. Down. Hey. Hey. Hey. Heat. Heat. I danced. I feel hey. I feel happy. [Music] Hey, [Music] hey, hey. [Music] Hey, hey, hey. [Music] [Music] I'll be everything. Hey, Hey, hey, hey. [Music] [Music] I'm shitty. [Music] [Music] I don't want to [Music] go after [Music] know. I don't want to do it. [Music] I take it. [Music] [Music] [Music] [Music] [Music] Everybody [Music] became Heat. Heat. [Music] D. [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] Hey, hey, [Music] hey. [Music] [Music] I feel hey. [Music] [Music] [Music] Hey. Hey. Hey. [Music] Hey, hey, hey. [Music] [Music] Data. Hey. [Music] Hey. Heat. Heat. [Music] [Music] [Music] ladies. Ladies and gentlemen, please welcome back to the stage the VP of developer relations at Llama Index, Lorie [Music] Voss. Hello everybody and welcome back. I hope you're having a great time. Let me hear from you if you're having a great time. Excellent. And I hope you're learning a lot. Uh, I personally hung out in the MCP track because, uh, that's my jam at the moment. I learned about dynamic tool discovery. Uh, I learned that VS Codes Insider Edition has full support for the entire MCP spec, which was very exciting to me personally. Uh, and I learned why MCP isn't any good yet, but it's going to be. Uh, I don't have a lot to say before I introduce our next speakers. Uh, I did ask any I did ask my colleagues backstage if they had any jokes and somebody said the Wi-Fi which is actually gold. Uh, we have some great keynotes to close out today including building agents at cloud scale uh and wind surf doing everything everywhere all at once and of course uh Greg Brockman of OpenAI talking about what it means to be an AI engineer. Uh but first here's Stephen Chin and Andreas Colliger to give some closing thoughts on the graph rag track. Our next speakers are the curators of the graph rag track here to speak about agentic graph rag. Please join me in welcoming to the stage the vice president of developer relations at Neo4j, Steven Chin and Genai lead at Neo4j, Andreas [Music] Kger. Look at that big hands. All right. So great to be back here again on the stage seeing everybody. Um, we kicked off this morning's keynote with some exciting kind of sci-fi memes around AGI and had an amazing graph rag track. My my favorite part was seeing what Zep and what we're able to do with graph agent memory to actually improve how models are able to respond in agent system. What did you like? What did you like, Connor? I I got to say agent memory was definitely a big idea today. Zep was amazing. thing. Then we also had the lunch and learn this all about agentic memory. And for graph rag, it was really amazing that this is like the big umbrella of graph rag. It's not just one thing you can do. There's many things you can do within graph rag. So um speaking of graph rag, we we talked a lot about that actually in the morning keynotes and stuff and um I bumped into a bunch of attendees and I realized we never actually explained what graph rag is details. So, um, we were thinking that it'd be great to show a demo of end to end importing stuff, building a knowledge graph on stage here. What do you guys think? Is that going to be a good idea? Yeah. Yeah. All right. So, let's cut to a quick demo. So, let me let me just get my No, no laptops. Not actually. The Wi-Fi joke was mine. And that's why we're not doing a live demo. Apologies, but it was live in his hotel room last night. So, demo, please. Okay. So, as this gets rolling, is it up yet? Oh, okay. So, the demo's up. Yes. Okay. Fantastic. So, this is a demo of our LLM graph builder that we have at Neo Forj. And what this lets you do is take unstructured data sources and then build a graph out of that and then query that graph. Now, what you see this going through right now is just grabbing some web sources from Wikipedia. And of course, as we went through movies this morning, I'm just grabbing Wikipedia pages for all those different movies. I'm just going through laboriously grabbing what was it? 2001, a space odyssey, The Matrix, things like that. That's what we had on on the slides this morning. Now that all those sources have been selected, I'm going to say go ahead and just create a graph straight out of that. The graph creation process is going to go through what happens with any kind of unstructured data. First, it's going to get chunked up. All the data is going to get chunked. Vectoriz as well. All that gets stored. But then the graph part of it happens. And the graph part is for any of the chunks that we've got, we're going to find people, places, and things, and in this case, movies, and movie characters, and also movie themes and turn that into a graph connected to the unstructured data. So we've get everything from the chunks, vectors in the chunks to a structured graph around that providing something you can query. Okay, it looks like that's all the way done. And with a quick spin, that is the beautiful graph that evolves out of actually just those unstructured data. Nice, right? Yeah. Now, if we zoom in on this a little bit here, if I get my timing down, right? So, okay, there are documents and chunks. We're going to remove those. But this beautiful kind of ball of purple in the middle there. If we zoom in a little bit, the the one movie that I've looked at here is the Terminator. You'll also see the Terminator as a character has been found and pulled out of the graph, out of the unstructured data. And in all those purple nodes are just the themes. Now, the LM's gotten a little enthusiastic about finding themes, and so there's found lots of different themes in the movies. That's fine. It also found other movies mentioned in the Wikipedia article. Those are the other green bubbles there. And it looks like, I don't know if you can see this, between the green bubble that is up there in the right, that is Bladeunner, and it was connected to another node called Tech Noir, which is then was connected to the Terminator. So at least by looking at the graph, you could tell that the Terminator and Bladeunner are connected by being the theme technoir. So if we hop over to the graph chat then now and we can see this is just a built-in chat that's going to talk to that graph. And I'm going to expertly type in what themes do Bladeunner and the Terminator have in common. And fingers crossed and thankfully I've recorded this. We know it's going to work. Dot dot dot. You have to leave the suspense of the dots. Okay. And there's the theme technore and also science fiction connecting it, verifying that the graph works and the chat on the graph works exactly the same way. Nice, right? Yeah. No, that's great. And um we also have a big announcement here as well for AI engineer world's fair. So I I I think this is a first in history that Neo Forj, believe it or not, we are still a startup. We're a large startup technically. Yeah. at a 15 16 year old startup but we are still a startup is offering a startup program to help other startups to get get on our technology. So we have launched the new Neo Forj startup program that's going to be coming soon. Um this is where you guys put the slide up. Um you can build your AI startup with Neo Forj. We have a QR code that you can sign up to join the program and this will give you Aura credits and a free way to do all the things we're doing on our cloud. And of course, our technology is open source. You can do it yourself on our community edition as well. Thank you very much for having us here at the keynotes. Thanks [Applause] everyone. [Music] Our next speaker is here to teach us how to build agents at cloudscale. Please join me in welcoming to the stage the VP of developer relations at AWS, Antia Bar. [Music] Hi everyone. I'm thrilled to be back on stage here again at the Engineer Worlds Fair and it's amazing to see this community grow. So today I'm going to speak about how we can build agents at cloudscale. Now at Amazon and AWS, we truly believe that virtually every customer experience we know of will be reinvented with AI. And not just the existing experiences, but there will also be brand new experiences we are now able to build with the help of AI agents. And we're not just theorizing about this, right? We're all here together to actually build the future. Now, I want to start just with a little bit of what that means internally across Amazon as a business. At Amazon, we have over 1,000 generative AI applications that are either built or in development, transforming everything from how we forecast inventory to how we optimize delivery routes to how customers shop and how they interact with their homes. And one of the most ambitious deployments of AI agents is the complete reimagining of Alexa. And I know many of us have been waiting for this for a long time. So what you're about to see here represents the largest integration of services, agentic capabilities, and LLM that we know of anywhere. So let's have a brief look. Look at my style. Oh, hey there. I love sharing this video because it shows really the power of agents at scale and just to have a quick look what that means in terms of numbers. We have over 600 million Alexa devices now out in the world and with the help of the latest advancements in AI, we were able to really reimagine this experience. Alexa Plus works through hundreds of specialized expert systems. That's what the Alexa team calls groups of capabilities, APIs, and instructions to accomplish a specific task for you. And all of these experts also orchestrate across tens of thousands of partner services and devices to get the things done, which you just seen a glimpse of this here in this video. And we truly believe that the future will be full of those specialized agents, each with their own unique capabilities and working together seamlessly with other AI agents. Now, this example shows what's possible at this massive scale. But how do we get there? How do we operate at this scale? or said differently, how do we move from web services that we've built for many years now into developing those agentic services? And luckily, many of the underlying principles remain the same whether you're building for millions of devices, whether you're reimagining and integrating AI experiences into your enterprise applications, or you're a startup and you're really just looking to kind of scale your idea to the next level. Now another example I want to show you is an agentic service that we built at AWS. You might have heard about Amazon Q developer which is our code assistant that helps you really kind of across the software development life cycle. And just a few months ago we released an Q developer agent for your CLI. So it brings the agendic chat experience into the terminal. It helps you to debug issues. You can ask it natural questions. It can read and write files and really kind of help to make your day-to-day in the terminal more productive. So let's have a quick look how this looks. Here is Amazon Q in the CLI and I'll just ask a good question here. In this case, hey, what do you know about Amazon Bedrock? CLI is integrated with MCP. So what it does it actually figures out there is a tool our AWS documentation team has released an MCP server. It's connecting to it. You see the tool is happening and it's asking for permissions. I give it the permissions and then it comes back with a response that is grounded in the official AWS documentation. Now, I don't want to talk much more about Q, but I do want to ask for you just to quickly think about how long did it take for the AWS internal teams to build and ship this agentic service. And let's just do it with a quick raise of hands. Who think it took two months to develop and ship this? It's a few hands. Who thinks three weeks? All right, it's a bunch of more hands. Who do you think it took half a year? almost none. Wow, you folks are great. We built and shipped this within three weeks. And to me, this is just almost insane, right? Like the speed and we heard it earlier like the mode of of AI. Um, one of the keynote speakers called it out is execution, right? And I think 3 weeks is super impressive. Now, how do we enable teams and not just internally at AWS, but in general to build and ship productionready AI agents this quickly? What we did internally, our teams, we needed to fundamentally rethink how to build agents. And what we did is we developed a model-driven approach that really kind of taps into the power of LLMs these days and models that are so much more capable in deciding, planning, reasoning, taking actions and let the developers focus on what their agent should do rather than telling it exactly how to do it. And the great news is we made it available for all of you to use as well. So just a few weeks ago, we released Strand Agents. It's an open-source Python SDK which you can check out and start building and running AI agents in just a few lines of code. So let me show you quickly how this looks like. And before I go in here, just a fun fact. If you wonder why did they call it trans agents? Well, this is what happens if you let AI pick its own name. All right. So the reasoning behind because again the AI agent is is capable of reasoning. It came up with like think about the two strands of DNA and just like the two strands of DNA strands agents connects the two core pieces of an agent together the model and the tools. And it helps you building agents. simplifies it by you really relying on those state-of-the-art models to reason to plan and take action. You can simply start with defining a prompt and your tools in code and then test it out locally and then once you're ready deploy it for example in the cloud and this is how simple it is. Again just a couple of lines should look pretty familiar. You install strands agents, you import it and then it comes with pre-built tools which I talk about a little bit more in detail and basically you just add the tools to your agent and then you can start asking questions or building more complex workflows with it. Now by default strands agents integrates with Amazon Bedrock as the model provider. So you can check the model config here using cloud 3.7 sonnet. But of course it's not just limited to AWS. You can use strands agents across multiple providers. For example, we have integrations with a llama. So you can start developing locally, testing it out. We have integrations and tropic edit integrations, metaedit integrations to the llama API. You can use OpenAI models and any other providers available through the integration with light LLM and of course you can also develop your own custom model provider. Now quickly on the tools as I said agents comes with over 20 pre-built tools. So anything from simple tasks like hey I just want to do some file manipulation some API calls obviously integrate with AWS services but then also more complex use cases and I just want to call out a couple of them. So there's a whole group of integrated tools from memory and rack one tool specifically called retrieve which lets you do a semantic search over a knowledge base. And just to show you the power of this, we have an internal agent at AWS that manages over 6,000 tools. Now 6,000 is a hard number of tools to put into a single context window and give um one model to decide. So what we did is we put the descriptions of those tools in a knowledge base and use the retrieve tool here. So the agent can find the most relevant tools for the task at hand and only pull those tools back into the model context for the model to decide which one to take. So that's just one use case how we're leveraging that. Also there is support for multimodality across images, video and audio with strands. There is a tool to kind of prompt for more thinking and deep reasoning. And it also comes with pre-built tools to implement multi- aent workflows whether it's graph-based workflows or a swarm of sub aents working together now you cannot talk about tools without mentioning MCP right so obviously we integrated MCP here natively within strands so you can just use this also to connect to thousands of available MCP servers and make them available as tools for your agent. Support for A2A is also coming soon, but let's start and talk a little bit about MCP first. If you're building on AWS already, make sure to bookmark this GitHub repo. It's AWSLAB/MCP. And here you can find a very long list, much longer than you would see here on this slide, of growing number of the MCP server implementation, specifically if you're working and building on AWS. Now, one of the challenges stems from the fact that once we all started building MCP servers, what we had was standard IO, right? So, it started out to help locally connect your systems, your clients to respective tools. And here's just a quick example, which is important for a demo I'll show in a little bit. This is just a standard IO implementation of an MCP server. should look familiar to most of you working with MCP using the Python SDK using fast MCP. All I'm doing here is set up my server and using the decorator to define a tool. In this case, my tool is to roll a dice. And you might see in the code here, it has an input to define the number of sides. And I had to put a picture here because I have to admit, um, I just learned this myself. Do we have D&D fans in the room? Woohoo. All right, a few of them. So, you all know what I'm talking about. For the rest of us, I just learned um there are dices, and I have one here. I'm not sure if the camera can catch this. Um it's just one of them here on the slide. A dice that has, for example, this one has 20 sides. Something very normal in the D&D world to start a thinker game. Um don't ask me questions about D&D. my colleague Mike Chambers who's either here or in the expo right now. He built the demo, so kudos to him and he can answer all of the D&D questions. All right, just keep that in mind. Um, I'll come back to this in just a second. Now, what we want to do here is to decouple and kind of connect to remote MCP servers because the topic is to scale, right? And the way to do this is in the AWS world as easy as just deploying it as an Lambda function. So we can do this now with streamable HTTP. And the same concepts apply. You put your Lambda functions as you would have before behind an MCP gateway and then connect. And because we care about security and authorization in the quick demo I'm going to show you, I'm using an authorizer. Um you can also plug in a Cognto framework for this part. And I'm also going to store session data in a DynamoB table. So let's roll this quick demo here. So what you see here is an MCP Lambda handler that we developed. It's available on the GitHub repo which makes it really easy to kind of set up your MCP server in Lambda. Here's a very simple hello world example. The tool is just um again defined with a tool decorator in here and then in the lambda handler function you can reference um the input here the invoke function and pass it to that MCP server. Now if we're looking at the server implementation and here we're doing a little bit more. You can see how we're adding session table support which is a Dynamob table. We're defining the tool. This is the rolling dice tool that I just pointed out but this time it's hosted as a Lambda function. You can write all the code you want to have there as well. And then at the very end, it's the same single line that basically when you call the lambda function passes this onto the MCP server. Let's deploy this. And again, we're using the existing tools to deploy Lambda functions as we have before. So this one is using AWS SAM to just deploy that to the cloud. And then we will receive the API gateway URL as well. Now from the client side here I'm using strand agents as you can see and then I am using the MCP integration. I'm passing here my API gateway URL to connect for author authorization. I have a bureau token. Again this is a simple concept demo but you can build more robust integrations here as well. I'm calling the list tool and then I'm passing those tools to my agent as we've seen before. This time it's the MCP available tools. And then if we run this here, we can quickly see this in action and basically going to ask it here to roll a dice. And we're asking it to roll a d20. So again, 20 sides and it's coming back. What did we roll? You can see the tool use kicking in here. We rolled a seven. Great. So this is just really a quick example. The good news is once you're in the AWS world and you're working on Lambda, everything you can build with Lambda, you can integrate there. So basically, you have access again to all of the great features, capabilities, applications you might have already built on AWS. Now the next step here is how do we make agents talk to each other, right? That's kind of the the next frontier. And we are super excited about the all the open protocols that are emerging right now with MCP. For example, we joined the steering committee. We're active part of the community contributing code and helping to further evolve MCP. If you want to learn more about this, here is the QR code. We have a whole blog series started on our open source blog. Feel free to check that out as we continue to help evolve those protocols. Now, what's next? We all are aware that this is just the beginning, right? There will be so much more coming. And if you had a chance to check out my colleague Danielle's talk yesterday on useful general intelligence, I just want to quote her a little bit. She said the atomic unit of all digital interactions will be an agent call. So we can imagine a future here where you might just have your personal agent like shown like this connecting to an agent store and really kind of having agents together accomplishing tasks for you. And some of you here in the room might already be building this, right? So let's go and build this future together. Thanks so much. Check out the additional sessions we have. My colleague Mike is going much more into the rolling dice demo, everything MCP and strands. And my colleague Suman tomorrow will also have a deep dive on strands. And with that, thank you very much. Check us out in the expo hall and grab your [Applause] [Music] D20. Our next presenter is here to tell us what's next for Aenticides. Please join me in welcoming to the stage the head of product at Windsurf, Kevin Hoe. [Music] Hello. Hello. How we doing? All right. How's Yeah. How's the energy level? We're good. Good. Yes. Let's go. Let's go. Two more. Two more. My name is Kevin. I lead product at Windsurf. And I'm super excited to be back here. Thank you so much, Swix and Ben. It's always a pleasure to come back to AI Engineer Worlds Fair. The velocity of our industry right now is incredible. It's like being on a kite on the ocean and we're really excited to see where the winds are taking us. A year ago, we didn't have windfur editor being used by millions and millions of people all around the world. And hopefully this is a larger number than last time. How many people have heard of windsurf? And how many people have used windsurf? [Laughter] Good numbers. Good numbers. We got to we got to improve that. And Windsurf itself has changed immensely in the last 6 months since its launch in November. We retired the name Kodium because we decided to catch this new wave which is by the way what we call our next generation innovations in the product. We call them waves. And in case you missed it, we are now 10 waves in. And some of the key waves we've been really excited about web search, MCP support, autogenerated memories. Oh, I was supposed to do that. Autogenerated memories, deploys, and parallel agents to name just a few. And as the waves keep growing, as do the number of people that have discovered and integrated Windsurf into their daily workflows. To this day, we are generating about 90 million lines of code every single day. And that equates to around a thousand over a thousand messages sent every single minute. But today is not about growth. I'm not going to sit here and tell you about the numbers. I'm going to tell you about the why. Why do people feel connected to the Windsurf editor? And I know no AI company really wants to disclose its secrets, but I had to come up with some content. So today, I'm going to let you in on one of ours. Our secret sauce is a shared timeline between the human and the AI. And this is what makes people feel like we're reading their minds. And now everything you do as a software engineer can be thought of on this shared timeline. So if we rewind way back to the dark days, this is pre-automplete when everyone knew how to write a for loop. AI had to do everything. You had to edit files, you had to type every single character. Imagine that. But then once services like C-pilot, like Codium, they launched, then devs got really excited. They started seeing a small percentage of their code being written by AI. And we started to abstract and accelerate the number of small edits, small actions that we would do for a user. And in late 2024, with the advent of Windsor's agent and the launch of the Windsorf editor, we saw that we could do more and more for the user. We started being able to edit multiple files at once, perform background research across thousands and thousands of files, and execute terminal commands directly inside the editor. But at Windsurf, we're in the business of trying to change how software gets created. And this means that the timeline is actually a little bit more complicated. It needs to handle actions taken outside of just the IDE. And so given how much of a developer's workflow happens outside of the editor, what does this mean for Windsurf? First, Windsurf is going to be everywhere. Specifically, Windsurf will need to be able to read and ingest context from every single source that a developer uses. And if we zoom out and think about what makes you all software engineers successful, there are a couple of different categories. The first of which coding related file reads, running terminal commands, seeing your history, even you know which tabs you have open inside of your editor. This all informs how to generate the correct code. But it goes beyond that. There's external sources. Things like going on to GitHub and viewing a past history of commits, maybe looking at a PR that is doing something similar to the feature you're about to implement, doing online searches, web searches, looking at documentation. And then there's the third category and this is where it gets a little bit interesting. It's called metalarning. It's the idea of what separates a junior engineer from a senior engineer from a staff engineer. These are the organizational best practices, the engineering preferences that all get encoded into what makes good code. And so if we think about what this means in practice, let's say that we are going to build a new page on a data viz dashboard. Let's walk through step by step. So first you would probably start in Slack as all good things start from Slack. You'll build context looking at a bunch of maybe customer requests. Maybe you'll have some internal messages. You'll collect that context and you'll start planning. And this means you're going to be in Google Docs. You're going to be writing design docs probably working on some infrastructure designs. You're going to be tracking tickets inside of Jira. And then you might have a designer who's actually working in Figma in parallel putting together all this material. And then finally, the fun part, or at least this is my favorite part, which is the actual writing of the code. And hopefully you use something like Windsurf to do so. But you're not done from there. Once your code's complete, you still have to open the PR. You got to get reviews. You got to merge into main. You got to deploy SEO analytics. The list goes on and on and on. And this is really why we've built what we've built. Because we know that for you, it's extremely important that we can fetch context from your Google Docs, that we can read your Figma files, and that we can oneclick connect to any MCP service so that you can access your information in things like Notion, Linear, Stripe, and countless others. And we've spent the last 10 waves making sure that Windsurf can be ubiquitous. But we know that's also not enough. We know it's not enough just to read. We need to be able to do and write everything. We need to be able to do it all for you. And so the AI has to take action on a wide variety of surfaces beyond just the coding surface in order to accomplish what a human software engineer would do. And so this doesn't mean just write code. This means interacting with third party services, provisioning API keys, writing design docs, PRDs, wireframing, testing, and the list could go on and on and on. And so for the last 6 months, we've oriented ourselves around how do we do everything. And if we go back to this concrete example of building a new web app, where do we start? We start by running codebase relevant terminal commands. This is something that we launched right at the advent of windsurf. And what's really cool about what we can do here is that we can intelligently decide which commands we want to run automatically and which ones we want to wait and ask for explicit user approval. Next, you'll open up Windsurf browser previews which allows you to iterate from there. Allows you to visually iterate with the agent so that Windinsurf can take control of Chrome just like you would. Inspecting DOM elements, looking at your JS console, being able to do what a web developer would do. And so now you could say our app is code complete. We'll use the GitHub MCP to open up a poll request and we can use context from your other PRs to be able to inform the description and inform the test plan. And code review is still a necessary part of any software company. And so we launched windsurf reviews which can automatically leave comments and suggest changes asynchronously so that you can be confident that the code that hits main is production ready. And so now that your code is merged, you'll want to be able to deploy. And so we also released a one-click service to Netlify so that you can use Windsurf's custom tool integrations to actually just in one click, the agent will deploy what you have to the live web. And so as you can see, we've really built the ability for Windsurf to read everything that you can and do everything or almost everything that a software engineer can. So then you might ask, what's next? It's only inevitable that Windinsurf will be on all the time, working for you, even when you don't know it. We pioneered the agentic human in the loop synchronous workflows back when we released Windsurf in 2024. And today timelines are 80 to 90% agent, 10 to 20% human. But we're trying to build towards a future that gets the 99% agent and 1% human. We only want to ask the user for final approval. And as more and more of these timelines and workflows become AI powered, it becomes possible to have Windsorf working for you at all times. Not only as you type and use autocomplete and tab, but also in the background, researching when you're working fully in parallel, only asking you to approve. And we want to build this future where you can code anytime. You can write software at any time. This includes your bed. This includes the toilet when you're on the bus. voice activated Alexa, all right, we'll throw GPT, we'll throw Gemini at this timeline problem, but then from there, where do we go? How do we improve? And specifically, how is WinSurf able to tackle this problem of the timeline? And if we take a step back, this really doesn't look like we're writing code anymore. This looks significantly more complicated than your average competitive programming question. Windsurf wants to revolutionize the way that software gets built. It's not just how code gets written. We are solving a broader set of tasks than just code. And while the industry focuses heavily on things like SWEBench, we know that the future is not going to be tokens in tokens out. Software engineering workflows are going to be much messier than this. It means that you have to be able to pick up tasks mid workflow. You have to be able to deal with messy codebase states mid commit and you will have to work with tools that are outside of the editor. And so we have to be able to ingest and perform over this broad set of actions on this timeline to keep our users in the flow. We have to be able to open up PRs. We have to know when to access analytics. We need to know how to debug your CI/CD all by itself. And this problem starts to look really really different from what people are evaling on. And because we have our own representation of this timeline, we needed a different system to be able to handle these types of actions than what the off-the-shelf frontier models could give us. And so where are we going with this? The realization of this is our brand new software engineering model called 1. We realized ourselves that we could actually dream bigger and build the best software engineering model that we could. Suite 1 is trained to handle software engineering workflows, not just purely code generation. And we use two main offline eval benchmarks. The first one end to end end to end task benchmark. This is basically tackling poll requests. This is saying given an intent, given the starting point of a codebase, how do we get to the end and pass all the unit test? familiar paired. The second one is where it gets a little bit more interesting. This is what we call a conversational suite task benchmark. And this is how well the model can assist when it's being dropped into an existing user conversation or a partially completed task. And so this actually lends itself very nicely to the windsurf paradigm, right? Because we're not going cleanly from start to end. We're assisting in helping you along the way midtimeline. And so it results in this blended score of helpfulness, efficiency, and correctness and really tests the model's ability to seamlessly integrate into the windsurf style of working. And this initial performance really gives us a lot of confidence in SU1's architecture, specifically how we've been able to train for software engineering workflows. And we've been able to achie achieve near frontier model results at the fraction of the cost and with a significantly smaller team. And one of Winsurf's greatest strengths of course is in the value of community. Real software engineers doing real work, giving real feedback. And what we found is that SUI 1, it's in the little drop down for the models. It's right up there with the rest of the frontier models. People are choosing SU1 because it recognizes how they do work. not necessarily how to generate code and it's contributing actually an even higher frequency than models like 3.7 and 3.5. Windsurf builds at the frontier so that our users can build more with the best technology. We learn from our failure modes so that we can iterate from there. And what does this start to look like? Dare I say it, a data flywheel. We ship the best product. Devs and non-devs use that product to level up as a skill multiplier or as a skill enabler. Users then help us find the frontier. They use things like thumbs up, thumbs down, accept, reject, constantly informing us not of what the sweet bench frontier is, but what is the software engineering frontier? What tools are missing? Which workflows are repeated? Where does the product fall short? And we take those insights and we build at this frontier. We train a better model. We build more tools. We improve our agentic harness. We improve our memories, our checkpointing with the goal of being everywhere doing anything. And we will repeat this cycle. We will be shipping, finding the frontier, building at the margin, and repeating. And what gets me really personally excited about this is Sweet One is really an example of this in action. We have a very small team, significantly fewer resources than the larger companies, and we were able to achieve near frontier model quality results with Sweet One. And even more so, this is really a demonstration of what it means to build AI products in 2025. It demands this harmony of model, data, and application where the application is actually mimicking the user behavior that you want to replicate inside of your model. And this is how windsurf will be everywhere doing everything all at once. Thank you so much for listening. [Applause] And I won't give you any promises, but someone made a profit. Um, but in all seriousness, thank you so much for listening. I want to make sure that every engineer out there is using the best possible tools. So, please give Winer a try today. And we are also hiring across a number of different roles. We have a booth downstairs, so please come join us. help make this future a reality. Thank [Applause] you. All right, everybody. Let's hear it for all of our keynote speakers so far. I learned a lot from our keynotes today. I learned that Alexa can keep track of my dog, which is amazing because my dog is a runner. Uh, and that you can plug an agent into Lambda, which is genuinely very neat. I would absolutely want to do that. Uh, we also learned about Windsurf's tremendous growth. Uh, I wasn't in the room when uh he asked who's on Windsurf. Hands up if you are using Windsorf. What about people who are using Cursor? Whoa. Okay. Uh, who's on VS Code of any flavor? All right. Uh, what about Zed? The few, the proud. Uh, any what about something else? Who's on other things? All right. Our next conversation is with uh Greg Brockman, formerly CTO of Stripe, co-founder of OpenAI, and currently president of OpenAI. Uh fun fact about Greg is that he's entirely self-taught in AI. Uh he has no formal background in it. Uh and with no formal background, he taught himself from free online resources and blogged about the experience to encourage other people which I think is genuinely inspiring. Like that's how I taught myself web development. It's it's uh a fun sort of fundamental thing about the internet that you can just teach yourself. So, uh, without further stalling, while they arrange chairs, uh, please welcome to the stage for a fireside chat the one and only Swixs, Aunt Greg Brockman. [Music] Well, hello. Hello. Is uh mic working for you? Check, check, check. One, two, three. All right. First hard technology problem of the day down. Yeah. Yeah. Well, the Wi-Fi is the other one. Um, everyone here knows. Um, so, Greg, welcome to AI Engineer. Thank you so much for taking the time. Thank you for having me. Um, we're going to go a little bit chronologically and uh, a lot of people send in questions and I've sort of grouped them up for you. So, we're just get right into it. U,, so, you know, you you know, I did some deep research on you. Uh, you started deep research with with deep research. Um, I called it Peep research because we're researching a person. Uh, you actually did theater growing up and chemistry and math and you wrote a calendar scheduling app and that's what got you into coding. But like what really inspired your love for coding? Like why why are you the coding guy? Well, the funny thing is I thought I was going to be a mathematician when I grew up. Yeah. You know, I'd read about people like Gowa and Gaus, you know, we work working on these like hundred 200 300 year time horizons and I was like that's what I want to do. if anything that I come up with is ever used while I'm still alive, it wasn't long-term enough. It wasn't abstract enough. Um, and I was writing this chemistry textbook after high school, sent it to one of my friends who' done something similar in math, and he said, "No one is going to publish this. You can either self-publish." I was like, "Ah, sounds like a lot of work, a lot of capital, or you could make a website." And I was like, "Guess I'm going to learn how to make a website." And so, I literally went on W3 Schools and did their PHP tutorial. How many people here remember W3 schools? Yeah, a decent number of hands. Um, and I remember the very first thing I built was a table sorting widget, right? I had this picture in my head of what it would be. And I remember the moment that I clicked the column and it sorted according to that column, which was exactly the thing that I wanted. And I was like, that was magic, right? And I was like, this is so cool. Because the thing about math is that you think hard about a problem, you understand it, you write it down in an obscure way, you call it proof. And then like three people will ever care, right? But in programming, you write it down in an obscure way, we call a program. And then maybe only three people ever read that program and care about the code. But everyone gets the benefit. No one has to understand the details. That thing that was in your head, it's real. It's in the world. And I was like, that that's the thing I want to do. Forget about that hundred-year time horizon. I just want to build. Uh, you do just want to build. Uh, it's So, you were so good at it that somehow somewhere you got cold emailed by Stripe while you're still in college. That's right. Uh, what's the story? How, first of all, how did they find you and what was it to convince you to drop out to join them? Well, so I had mutual friends with all the people at at Stripe, the you know, giant company of like three people at the time. uh and uh uh they they had asked you know the usual thing where they'd asked someone at Harvard who the you know people around campus to talk to uh who they might recruit where my name came up they asked the same for the people at MIT because I actually had dropped I' I'd been at Harvard and actually dropped out to go to MIT so I I had the advantage of uh I guess you know uh get getting up votes on both sides. Um, but I remember when I met the Patrick and it was you I just flown in. It was like late at night that you know it was storming and uh I I showed up and we just started talking about code, right? And it was just like one of those moments you're like this this is the kind of person that I that I've wanted to work with and been looking for. Uh and so I ended up dropping out of MIT uh and uh uh you know flew out and been out here ever since. Yeah. Yeah. Uh we have a spe we have some guest questions sprinkled along the way as you know. Uh, so question from someone named Matthew Brockman. I've heard of him. CTO of Julius AI. When do you think our parents will give up on the dream of you finishing your degree? Maybe maybe Harvard or UND will take you back. Yes. Uh, well ne never. Um, it was definitely, you know, I think it was no matter where you're going, if you tell your parents you're leaving Harvard, it's going to be hard. Um, you tell your parents you're leaving school altogether, it's going to be difficult. Um and I think that you know it was actually um to to their credit you know I think even though it was difficult um that they were like that you know we trust you like you you must see something and and understand something from from where you sit that's hard for us to see from from halfway across the country. Um but yeah I think that that as you know did Stripe and uh and had a good time and and actually learned things um and uh turned out as a real company and not just uh uh you know just dropping out doing nothing. I I think that that they they really were were uh you know have have warmed up to it and so um I think they're very proud of you. Yes, absolutely. So you you were with Stripe from 4 to 250 people as the first CTO eventually. Um one thing I I found recently that Hacker News maybe doesn't know is apparently the call installation only happened like a handful of times. It wasn't like a thing at Stripe. Was that that that's I think that's true. Um yeah, it is it is the thing that that you know it's like survived the uh the It's an urban legend because it's like so cool. It's like you so customer obsessed. Anyway, so what else do people get wrong about early Stripe? Like why do we want to clear the air? Yeah. Well, I think people don't understand how hard it was, right? It was just like um like I remember um you know, first of all, the the kind of thing that we did a lot of is that we added all of our customers on G-Chat. And so it was very much the case that we were in constant contact with them. And so even if you're not literally sitting over their their shoulder, you're doing the next best thing. Um, but I remember um like one I you know one one one day we realized that I you know the the the payment back end that we were on it just wasn't going to scale. Uh we absolutely needed to be on Wells Fargo and we got sort of the deal done but now we need to do a technical integration. And they said well this technical integration is going to take like 9 months because that's how long it takes. And we're like that's crazy. Like you're a startup. Like we can't sit around waiting 9 months to get this thing done. Um and so actually in 24 hours uh we completed it uh by just basically treating it like a college problem set. Uh and it was you know I I was implementing everything. John was working from the top of this test script and testing everything and being like this is broken. Daryl was starting from the bottom and working his way up. And uh in the morning we got on with with the uh certifying person and we sent some some test messages and there was an error and the person's like all right I'll see you next week. Um because that's how all their customers operate, right? there's an error like you know clear you need to send it to your dev team and we were like no no no there must just be like like some sort of glitch in the system like and we just Patrick was just like talking to keep her on the line and frantically like I was there editing the code and so we got like five turns in uh and we actually failed uh but fortunately she was nice enough to reschedule two two hours later uh and there then we passed and so you realize that was like six weeks worth of normal dev work that you got done in that moment because you didn't just accept the like arbitrary constraints of how other organizations would work. Yeah. Yeah. Do I think there's a do you think there's a lot more opportunity like that in most jobs? Like how do you how do you advise other people to be that I guess fast or like to cut that many cycles? Yes. I mean I think that I the way I think about it is that if you think from first principles you can find where things need to be slow or done the way that they're normally done or whatever those things are those exist right the general principle of ah just don't worry about the constraints and just do the thing. Um, I think that that that is not 100% true. I think it's really about mapping to where is there unnecessary overhead that's there for constraints that are no longer applicable that that don't apply uh to your specific circumstance. And I think this is especially true in this world that we're in now with AI that's accelerating productivity so much. Yeah. Just fire off a codeex. Why not, right? Um, one thing one thing one last thing about your sort of pre-openi life was independent study. I just I I found that just it's a recurrent theme from high school. You did rec center. I did. Um and your sbatical as well. So you've just done it repeatedly. What makes independent study effective? Like I think there's a lot of people who don't do a good job of it and kind of waste a year. What what what do you do that makes it so effective? Well, I think it was a key part of how I grew up. Um, you know, in in uh in sixth grade, my dad taught me algebra and in seventh grade showed up at the high school as the first time that you you track into advanced math pre-alggebra and we went to the teacher like can he skip uh this and go directly to the the eighth year the eighth grade course and the teacher looked at my mom and me very condescendingly and was like every parent believes that their child is special. Uh, and after like a month of being in this teacher's class and, you know, I was paying no attention and just doing, you know, calculator games in in the back and she'd try to trip me up and, you know, call me to answer questions from the whiteboard and I would just get them all right. She was like, "All right, like fair enough. Uh, your your child should be uh in the next year." Um, and but then when I was in eighth grade, there was no more math left in my middle school. I didn't have a car, so I had to do online courses. And in that one year, I ended up doing three years worth of high school math. And so I think that for me a lot of it is about suddenly these if you're if you're excited about something independently it's something you want to do that you can break the constraints there as well. Uh you can do three years of math in one year and then it compounds because the next year I was at my high school finished math there and then all through 10th 11th 12th grade I I had you know no more math so I did have a car and I was able to go to University of North Dakota take whatever classes I wanted there. And so I think that that that kind of compounded compounded compounded to learning programming. And then I think that that the way I learned program is very much self-study just building things and and experiencing things out in the world. And so I think that the thing I would just advise is like if you have an opportunity to explore and you have a passion, you're actually enjoying it, just go deep, right? And by the way, it's not always fun, right? I think that it is very easy to uh get kind of you know sort of feel like uh I got kind of bored but if you just push through those hurdles then I think that the that the reward is worth it. Yeah. You self-studied machine learning too like that was a whole period of your life. Um any particular highlights from there? It sounds like you talked to Jeff Hinton at one time. I did talk to Jeff Hinton. Yeah. Yes. And like was you know did that help or what was the most helpful thing like you became a machine learning practitioner? Well, so so when I when I started out, so you know, I'd been I'd been at Stripe. I was reading hacker news post about deep learning and yeah, it was like, you know, there's a deep learning for axe like every day it felt like and this was, you know, 2013, 2014 and I was like, what is deep learning? and I knew like one person in the field and so I talked to them, they introduced me to some more people and then they introduced me to more people and the thing that surprised me was I kept getting introduced to a bunch of my smartest friends from college and I was like that's interesting. All of these people ended up in this field like what's going on and I started to realize that that there was something real that was building right that was being developed that people were really making these systems do material new things that computers were not able to do before. I was like that that is the thing. Um and so after I left Stripe, you know, I knew I wanted to do something in AI. Um start an AI company, but I didn't really know how to contribute, what my skills would be useful for. And uh so I was in New York and I was like, you know what, I'll build a GPU rig and see if I can do some Kaggle competitions. And so I went on Newegg and just like, you know, bought some uh some Titan X cards. And uh it was really cool, you know, physically assembling this machine. And uh you can find some some tweet from from 2015 when I powered it on. You see all this like green and all the fans going and I was like this this is what computers are meant to be. Uh I think many folks in the audience have that that experience as well. Um awesome. Okay. So what convinced you that AGI was possible? Like you you had a point where you were sort of disillusioned with it. You wrote you tried to write a chatbot. You didn't it didn't work. But what made you go all in on it? Yeah. Well, so you know, part of part of the journey for me was reading Alan Touring's 1950 paper, Computing Machinery, and Intelligence. This is the Touring test paper. How many people have read it? Fewer hands than than W3 schools. Uh, but equally as important, uh, worth reading. Uh, the thing that is so fascinating to me is he lays out in the beginning, okay, Turing test, this idea of just does a machine think? Is it intelligent? And you can say it's intelligent if you know a human can't tell the difference between talking to it and talking to a human. Fine. But the thing that was that has not really become as embedded in the pop culture, but to me was so astounding was he said, "Well, how are you going to program an answer to this? You will never be able to write down all the rules. But what if you could build a child machine that learns like a human child and then you just apply rewards and punishments and boom, it's going to uh it's going to to be able to to pass the test." And I was like, that that is the kind of technology that we have to build because as a programmer, you have to understand everything. You have to understand the rules of how to solve the problem. But what if the machine can understand things and solve problems that you yourself cannot understand? Like that feels fundamental, right? That feels like how you actually solve problems that are important to humanity. And I this was, you know, 2008 or so that I read this and I went to my professor and uh who was an NLP professor and I asked if I could do some research with him and he said, "Yeah, here are some pars trees." And I was like, okay, this is not what Turing was talking about. Yeah. Um, this is like word nets and the whole thing. Exactly. So, it's like you, you know, definitely a little bit of trough of sorrow there. Um, but with deep learning, the thing about deep learning that's magic is that, you know, it really started in to show show promising results 2012 with with AlexNet, right? And and that it just blew everyone out of the water in the imageet competition. And so suddenly you have this like general learning machine. You know, it's got a little bit of a prior in there of of of convolutions, but it's better than 40 years worth of computer vision research, right? People trying to write down all the rules as well as possible. And then people are like, well, okay, it works in vision, but it's never going to work in my field. It's never going to work in machine translation, never going to work in uh in, you know, in NLP, never going to work in this or that. And suddenly it starts being the best in all of those areas. Suddenly the walls between these departments are being torn down and you're like that that is what Turing was talking about. And so I think for me just seeing the the type signature of this technology and by the way this technology is not new, right? Neural nets were really like if you go back and read the uh the Mcculla Pitts uh neuron paper from like 1943 or so um I told people I told him to give homework to people. Okay. Yeah, there you go. Yes. Classes assigned. um the there the the images in there, they look just like the kinds of images that you see now of just like you know layers of neurons and things like that. And so you just realize there's something deeply fundamental about what we're doing. And uh you can find these these uh you can find this paper um from 199 the 1990s talking about what caused the deep learning winters and that it was these neural net people. They have no new ideas. They just want to build bigger computers. And I'm like yes that's what we need to do. Um and so I think that all of this together just feels like we are we are to some extent continuing this wave this 70-year history. Um and in many ways um you know the whole computing industry has been really trying to build up to the point that you can have machines that are able to perform the kinds of tasks that we're just starting to scratch the surface to solve new problems that humans cannot to be be assistive to us in our daily lives to not have to you know be typing with our with our you know meat sticks but instead to have something that you can interact with just like a person where the machine comes much closer to you rather than you closer to it and having to learn assembly language or you know whatever it is. Um and so to me it felt like all of the factors were lined up and now we just need to build. Yeah. Um I I like that consistent theme that you keep coming back to. We just need to build. Um so in 2022 you wrote that it's time to be an ML engineer. Actually I have a personal friend uh who read that post and cold emailed you and joined OpenAI and all that. Um you said that great engineers are able to contribute at the same level as great researchers to future progress. Is that uh is that still true today? You know, I think a lot of engineers look at the researchers who are making millions of dollars and they're like, how do I contribute as much? You know, I I think it's absolutely if not even more true. Um I think that like if you look at the phases of deep learning research since 2012, I think at the beginning it really was um and this is kind of what I expected when we started OpenAI, you know, just like research scientists who had gotten a PhD who would go and kind of come up with ideas and test them out. And you know there's there's engineering to be done. If you actually look at Alexet itself, you know, it's fundamentally the engineering of let's get fast convolutional kernels on a GPU. Um and and uh fun fun fact is people who were in the lab with Alex Keski at the time uh were actually felt very bad for him because they were like he has some fast com kernels for uh uh for you know some some image data set that doesn't really matter. But you know Ilia was like well clearly we just need to apply this to imageet. It's going to be great right? So it's like the combination of great engineering together with the idea of what to do with it, right? That that's what what makes the magic work. Um and uh the thing that I think is still true and even more true is okay, so the engineering required, it's now not just let's build some kernels, but let's build a system. Let's actually scale to 100,000 GPUs. Let's actually, you know, sort of do this crazy RL system that orchestrates things in all sorts of ways. Um, so the idea, if you don't have the idea, you're dead in the water. There's nothing to do. But if you don't have the engineering, that idea is not going to it's not going to live and see the light of day. And so you need to have both of these coming together harmoniously. Yeah. I think that Ilia Alex relationship is really emblematic of like the research engineering partnership that now is the philosophy at OpenAI. That's right. Yeah. Yeah. And if you look at how open AI operates like I think from the very beginning we had this ethos of engineering and research be valued um and and work together um as partners and I think that that is something that we you know it's like something that we we really work at every day. Yeah. Uh it's my explicit goal to try to throw uh curveballs in this in this stuff. So uh in terms of the relationship between engineering and research, what did OpenAI do wrong in the early days that you do well now? Um well I think that the relationship between engineering and research the way I think about it is you never fully solve it right you just sort of solve the current level of problem and then you move on to the next level of sophistication and I noticed that actually the kinds of problems that we ran into were basically the same problems that had been run into at every other lab and it was just like you know either we would be further along or that there' be a slightly different variant of it and so I think there's something deeply fundamental about this um so the the ve at the very beginning I could really see people who came from the engineering world, people came from the research world, just sort of thinking about system constraints very differently. And so as an engineer, you're like, hey, if I've got an interface, you should not care what's behind that interface. We agreed on the interface, I can implement it however I want. Whereas if you're a researcher, you're like, if there's a bug anywhere in the system, all I'm going to get is just slightly degraded performance. Not going to get an exception, not going to get indications of where. And so I am responsible for understanding everything. the interfaces they don't matter unless they're like truly rock solid and I can just like never think about it which is a pretty high bar um then I am actually responsible for for this code and that causes friction right because then how do you actually work together and I saw a project very early on where that you know the the people from the engineering background would write the code and then there'd be this big debate over every single line and I was just like this is never going to move it's going to be so slow and instead the way that we ended up proceeding was um so I actually worked in that directly and I'd come up with like five ideas at a time. Someone from the research side would say these four are bad. I'd be like great, that's all I wanted, right? And so the value that I think we've really realized is critical and that I tell people from from the engineering world coming into OpenAI um is technical humility, right? It's like you're coming in because you have skills that are important, but it's a totally different environment from, you know, something like a traditional web startup. And figuring out when those intuitions apply and figuring out like when to leave them at the door is super hard. And so the most important thing is to like come in really really listen and kind of assume that that that there's something that you're missing until you deeply understand the why. And then at that point, great, make the change, like change the the the architecture, change the abstractions. Um but I think that that kind of approach of just really really read and listen and understand with that humility um that that is I think a really key determiner. Yeah. Awesome. Um we're going to tell some stories from recent launches of OpenAI the greatest hits. Uh so one of the things that is is kind of interesting is just scaling in general. Everything breaks at different orders of magnitude. So in when chatbt launched you got a million users in 5 days. This year when 40 image gen launched you got 100 million users in five days. How do those two periods compare? Uh they echo very similarly in a lot of ways. You know, the thing about chatbt, uh, it was supposed to be a low-key research preview and we put it out very, you know, sort of chilly and then suddenly everything was down and we, you know, we kind of anticipated that chatbt would be a very popular thing, but we thought that GPT4 would be necessary to get it. Had it internally as well, so you just weren't impressed by Exactly. Right. It's like you, that's the other thing about this field is you update so quickly, right? It's like you see magic and you're like, "This is the most amazing thing I've ever seen." And then you're like, "Well, why can't it like, you know, why can't it like merge, you know, 10 PRs for me?" Exactly. Um, and the image gen moment was very similar in terms of it was just so so loved and so popular and it just went viral in in ways that uh, you know, just like the numbers were just off the charts. And so internally we actually did something that we really really try not to do um which is we pulled a bunch of compute from research for both of these launches actually um because that's mortgaging the future um to make make the system work um but if you can actually deliver and keep up with demand then of course people get to experience the magic and I think that um that that that's something that is really worthwhile and it's really important to sort of you know maximize those moments. Um, so I think that that that we really have that same ethos of really serving the user, really trying to push for the technology and just do things that are materially new that no one's ever seen before. Um, and then whatever it takes to get those out into the world and make those successful that that's what we do. Amazing. Um, well, I mean, incredible job. U GPT4 launch. So I am told that your wife drew the joke website. That's true. Yeah. Fun fun Easter egg. My handwriting was so bad uh that even our AI couldn't tell what to do with it. Um so like uh apparently did you improvise some of this? I I I heard I gravine. Yeah, definitely. Definitely like you know usually when I when I do these kinds of demos like I've tested the general shape of them ahead of time. Uh but I've always had like it's very easy in this field to have ones that are just like if you slightly typo a character or something then the demo will not work. Um I don't like doing those. I like to have some robustness to it. So there's always variation in terms of of what actually ends up get being shown. To me, this was the first time I think the world ever saw vibe coding. Um, it is now a thing. What are your thoughts on vibe coding? Uh, well, I think that vibe coding is amazing as an empowerment mechanism, right? I think it's sort of a representation of what is to come. And I think that the specifics of what vibe coding is, I think that's going to change over time, right? I think that you look at even things like codeex like to some extent I think our vision is that as you start to have agents that really work that you can have not just one copy not just 10 copies but you can have a hundred or thousand or 10,000 or 100 thousand of these things running you're going to want to treat them much more like a co-orker right that you're going to want them off in the cloud doing stuff being able to hook hook up to all sorts of things you're asleep your laptop's closed it should still be working um I think that the the the you know current conception of of vibe coding in an interactive loop. Um, you know, that that's something that I I think is like, you know, it's it's I Okay, so my my prediction of what will happen is like I think there's going to be more and more of that happening, but I think that the agentic stuff is going to also really intercept and overtake. And I think that all of this is just going to result in just way more systems being built. Um, and the thing that that I think is also very interesting is that a lot of the vibe coding kind of demos and and the cool the cool flashy stuff. Um, for example, make making the joke website, it's making an app from scratch. But the thing that I think will really be new and transformative and is starting to really happen is being able to transform existing applications to go deeper. Um, and that be able to, you know, like I think so many companies are sitting on legacy code bases and doing migrations and updating libraries and changing your cobalt language to something else is so hard and is actually just not very fun for humans. And uh, I think we're starting to get AI that are able to really tackle those problems. And so the thing that I love about where vibe coding started has really been like with the most like just like make cool apps kind of thing. And it's starting to become much more like serious software engineering. And I think that going even deeper to just like making it possible to just move so much faster as a company. Um that's I think where where we're headed. Yep. Uh speaking of codeex, I've heard that you've just it's kind of your baby a little bit. Um and you've started I think on the live stream you were talking a lot about just make things modular and well doumented and all that good stuff. Like how do you think codeex changes the way that we code? Um well I definitely think that that it's an overstatement to say it's it's my baby. like I think that there's um a really incredible team um and and uh that you know I've I've been trying to support them and and and their vision and um but I think that that the direction is something that is like just so um so compelling and incredible to me. Um the way that that uh and sorry could you repeat the the how how does codeex change that we the way that we code? I see. Yeah. The thing that has been most interesting to see has been when you realize that the way you structure your codebase determines how much you can get out of codecs, right? That the if you match the strength of like basically all of our existing code bases are kind of matched to the strengths of humans. But if you match instead to the strength of models which are sort of very lopsided, right? models are able to handle way more like diversity of stuff but also are not not able to like sort of necessarily connect deep ideas as much as humans are right now. And so what you kind of want to do is make smaller modules that are well tested that have tests that can be run very quickly um and then fill in the details. the model will just do that right and it'll run the test itself and the connections between these different components kind of the architecture diagram like that's actually pretty easy to do and then it's the like filling out all the details that is often very difficult and if you if you actually do that you know what I described also sounds a lot like good software engineering practice um but it's just like sometimes because humans are are capable of holding more of this like conceptual abstraction in our head we just don't do it right that like yeah it's like you know it's a lot of work to write these tests and to you know to flesh them out and that you know the model's going to run like these tests like a hundred times or a thousand times more than you will and so it's going to care like way way more. So in some ways that the direction we want to go is build our code bases for more junior developers um in order to actually get the most out of these models. Um, now it'll be very interesting to see as we increase the model capability, does this particular way of structuring code bases remain constant? And I kind of think that it's a pretty good idea because again, it starts to match what you should be doing for for maintainability for humans. Um, but yeah, I think that to me that the really sort of exciting thing to think about for the future of software engineering is what of our practices that we kind of just cut corners for do we actually really need to bring back in order to get the most out of our systems? Yeah. Um, can you put numbers on like ballpark numbers on the amount of productivity you guys are seeing with codecs internally? Um, I yeah I don't know what the latest numbers are. I mean, there's definitely double digit percent of our of our PRs are written low low double digit um written entirely by codecs. Um and that's super cool to see. Um but it's also like you know that it's not the only system that we use internally and I think that um to me it's it's still in the very very early days. Um it's been exciting to see some of the external metrics. Um like I think we had 24,000 uh PRs that were merged in like the last day uh in in public GitHub repositories. And so it's just like yeah, this stuff is all just getting started. Yeah, it's doing a lot of work. Uh guest question from Dylan Patel on scaling and uh reliability. Um so as we're doing more tasks that take longer and utilize more GPUs, they're also just unreliable. They fail a lot, right? And and this is just well known. Um so this causes training to fail as well. So like but like you know you you've mentioned that you can sort of just restart a run and that's okay. like how do you deal with this when you have to train long horizon agents, right? Because you can't really restart something that has a trajectory that's kind of halfway that is maybe nondeterministic. Yeah, I mean I think that there's a bunch of problems that you kind of solve and then you make the models more capable and then you have to resolve them. And so yeah, when the the rollouts are short, you know, 30 seconds, you kind of don't care that much about this problem. If they're going to be days now, you really care about this problem. Yep. And you have to start thinking about how to snapshot state and a bunch of things like that. Um the short answer is that I think that there's a this like ladder of complexity that you keep climbing with these training systems and it goes from you know like a couple years ago all that we cared about was just doing good oldfashioned free training, right? And that's like very checkpointable. Um and even there it's not trivial, right? It's like you know if you go from checkpointing once in a while to like you want to checkpoint every single step now you need to think really hard about about how you're going to avoid copies and blocking and all these things um then for something like these more complicated RL systems there's still checkpoint in terms of you know maybe you care about uh you know checkpointing your cache so you don't have to recmp compute everything um and the nice thing about our systems is that you know language models are their state is very explicit right and it's something that actually can be stored um something you actually can can handle. Whereas if you have tools that you're hooked up to that are themselves stateful, maybe those are not something you can restart and recover from. And so I think that that if you consider the whole system end to end, thinking about what checkpoint ability looks like. And there's also a question of maybe it just doesn't matter, right? Maybe it's fine that you restart the system and you get some little wiggle in your graph, but these models are smart. Yeah. Right. That they can handle it. Um, one thing we're looking at tomorrow that's launching is maybe you can sort of take over the VM and checkpoint the VM state and restart it. Yep. Um, I think we have a dialin call-in question from Paris. Um, if someone can play the video special guest, oh, I wish I could be there to ask you in person. One of the questions that I have is in this new world, the work the workloads in the data center and the in the AI infrastructure is going to be incredibly diverse. On the one hand, agents that are doing T research and They're thinking, they're reasoning, they're planning, and they're working with other agents, and they're, you know, working on a lot of memory, they have large context on some of that you also want to think as fast as possible. So, you know, how do you how do you create an AI infrastructure that is optimized for workloads that have to have a lot of pre-fill, a lot of decode, a lot of something in between on the one hand, and on the other hand, the type of workloads that I'm super excited about, these multimodal vision and speech AIs that are essentially your R2-D2, your companion, it's on all the time. It's instantly available to you. And so these two workloads one of the one of them super uh compute intensive and take might take a long time and um uh you know test time scaling and all that on the other hand wants to be very low latency. So what does what does a future AI infrastructure look like that's that's as flexible as possible as performant as possible low latency high throughput you know all of that is just incredibly complex. So how how do you think through that and what kind of AI infrastructure would you would you think that would be ideal going forward? Well, with lot lots of GPUs, of course. So, so if I were to summarize, uh, Jensen wants you to tell him what to build. What would be your dream? Uh, but also like there's just two needs. There's two kinds of infra. There's there's long compute and there's real time. Now, now, now. Yes. Yes. I mean, it's it it is hard, right? Because I mean, this codees problem, it is a mind-boggling one. And so, you know, I'm a software person by by background and that, you know, we think we're we're off here just like writing the software for AGI and then you realize you have to do like these massive infrastructure projects, right? Like that's not how we set out, but it actually kind of makes sense in the end, right? If we're going to build something that's going to be transformative to the world, like yeah, probably it's going to require some some, you know, maybe the biggest physical machines that humanity has ever created, like kind of type checks. Um, so I think that the that there's two answers. Like the naive answer is okay. Yeah, you want two kinds of accelerators. You want one that's really compute optimized, one that's very latency optimized. Um, throw like tons of of HBM on one of those and, you know, ton tons of tons of comput on the other, you're all good. Um, now, one thing that's really difficult is predicting the ratios, right? Now, you have a new problem you have to think about. And if you get the balance wrong, suddenly you're going to have a whole part of your fleet that's just useless. Yep. And that sounds really scary. Um, now the thing is because the way that these things work is there's no requirements in this field. There's no constraints in this field. there's just sort of this linear program that people are optimizing and so yeah if you give our engineers some sort of misbalance of resources like we will find ways to utilize it maybe at great pain right but an example of this is you know you've seen the whole field move towards mixture of experts and to some extent what mixture of experts is is saying well we have all this DRAM sitting around that isn't being used for anything because the balance is wrong fine we'll just fill it up with parameters and we'll actually not cost any compute and we'll just get extra ML comput efficiency out of it like boom there you go and so I think that there is some of that where if you get the balance wrong it's actually not the end of the world um homogeneity of accelerators is like a very nice default to start um but I think that that that ending up with purpose-built accelerators is also not super crazy and the more that we move to these world these worlds where it's the just dollars of capex for this infrastructure starts to become so eye watering then starting to hyper optimize for some of these workloads is pretty reasonable um but I think the jury a little bit out because if you think about it that the research is just moving so fast and to some extent that dominates everything else. Um okay I wasn't planning to ask this but you just brought up the research stuff. Can you rank current scaling bottlenecks for GBT6? Ah compute data algorithms power money. Yes. Which one's which one's like the you know number one and two? Which one are you are you like most very limited on? I mean look I think we are in a world where basic research is back. I think that is really amazing, right? There was this period. Yeah, basic research. Um there was a period where it felt like all right, we got a transformer, let's just scale it, you know, and um I find those problems very exciting. I have a lot of fun just like you got a very well- definfined hard problem. You want to just move the number up and to the right. Um but it also is a little intellectually dissatisfying in some ways. It's like that it feels like there's more to life than just you know attention is all you need paper uh you know in in in vanilla form. Um and so I think that what we've started to see is that we're operating at a scale now um where we've pushed the compute, we've pushed the data so far that you can start to get you start to have algorithms is like again just back as as a important and really almost a long pole um in in terms of future progress. And so um all of these things they're all they're all important poles of the tent. And you know on any one day uh it might look a little lopsided one way or another. Um but yeah, fundamentally I think it's like you want to keep these all in balance. Um and it's really exciting to see things like like the RL paradigm. That's something that we invested in very deliberately uh for for for multiple years. It was like when we trained GPD4 um the very first thing like I think it was really interesting was when you we talked to GPD4 for the first time we were like is this an AGI? Like it's clearly not an AGI but it's really hard to say why right is like there's something about it. It's so fluid and smooth, but but somehow it falls off the rails. It's like, well, we got to solve that reliability problem. And you're like, well, it has never actually experienced the world, right? It's like someone who's just read all the books or, you know, sort of read, you know, sort of observed the world, has observed the world um and uh never experienced it itself, right? It's like, you know, sort of just, you know, watching it through through a pane of glass or something. And uh and and that to me is I, you know, was something we were just like, okay, clearly we need a different paradigm. and we just pushed on it until we made it really work. And I think that that remains true today that there's other very clear missing capabilities um that we just need to keep pushing and we will we will get there. Awesome. Um broadening out just from from just broad opening eye things. Um well honestly I'm just going to let So we asked Jensen for one question. He's an overachiever so he sent in two. So let's play a second video. AI native engineers in the audience they are probably thinking um in the coming years your openi will have AGIS and they will be building domain specific agents on top of the AGIS from OpenAI and so some of the some of the questions that I would have on my mind would be uh how do you think their development workflow would change as uh open AAI's AGIS become much more capable and yet they would still have plumbing workflows uh pipelines that they would create flywheels that they would create for their domain specific uh agents These agents would of course be able to reason, plan, use tools, have memory, short-term, long-term memory and um and they'll be amazing amazing agents, but how does it change uh the development process in the coming years? Yeah, I think that this is a really fascinating question, right? I think you can find a wide spectrum of very strongly held opinion that is all mutually contradictory. Um I think my perspective is that first of all, it's all on the table, right? Maybe we reach a world where it's just like the AIs are so capable um that you know we all you know just let let them write all the code. Maybe there's a world where that you have like one AI in the sky. Maybe it's that you actually have a bunch of domain specific agents that require a bunch of of specific work in order to make that make it happen. I think the evidence has really been shifting towards this like menagery of different models. Um and I think that's that's actually really exciting right that there's different inference costs just even from a systems perspective. um that there's different trade-offs like just distillation works so well. Um so there's actually a lot of power to be had by models that are actually able to use other models. And so I think that that that is going to open up just a ton of opportunity because you know we're heading to a world where the economy is fundamentally powered by AI. We're not there yet but you can see it right on the horizon. They're working on it all. Exactly. I mean that's what people in this room are building that that is what you are doing. And the the economy is a very big thing. there's a lot of diversity in it and it's also not static right that I think when people think about what AI can do for us um it's very easy to only look at well what are we doing now and how does AI slot in and you know the percentage of human versus AI but that's not the point right the point is how do we get 10x more activity 10x more economic output 10x more benefit to everyone um and I think that the direction we're heading is one where the models will get much more capable there'll be much better fundamental technology and there's just going to be like way more things we want to do with it and the barrier to entry will be lower than ever. And so things like healthcare um that you can't just you know the the it requires responsibility to go in and think about how to do it right. Things like education where there's multiple stakeholders you know the parent the teacher the student um each of these requires domain expertise requires careful thought requires a lot of work. Um and so I think that there is going to be just like so much opportunity for people to build. Um, and so I'm just so excited to see everyone in this room because that's the right kind of energy. Thank you for encouraging us and being an inspiration. Thank you so much. Great welcome to everybody. Thank you. All right, there's just one more thing before you leave the room. Uh, let's hear it one more time for Greg Brockman. So, the talks are done, but the fun continues. Uh, we'd love to invite you to the afterparty. Here to give you details of the afterparty uh is Toet Panigrai of Tolbit. [Music] Hey everyone, how's it going? I'm Toast Pony, uh, one of the co-founders of Tolbit. For the past 2 years, we have connected the world's biggest publishers with the world's biggest AI companies. Now we're taking that same technology and allowing agents to access sanctioned firstparty data sources with seamless athin payments whether it's for MCP whether it's ATA whether it's even for browser automation. So if you care about the agent economy agent off payments check us out at toolbit.dev or come talk to us at the toolbit afterparty. Thanks. [Applause] [Music] [Music] [Music] Heat. Hey. Hey. Hey. [Music] [Music] I love me. The [Music] [Music] other honey. [Music]