AIE Miami Keynote & Talks ft. OpenCode. Google Deepmind, OpenAI, and more!
Channel: aiDotEngineer
Published at: 2026-04-20
YouTube video id: 6IxSbMhT7v4
Source: https://www.youtube.com/watch?v=6IxSbMhT7v4
Hey, hey, hey. Good morning everyone. Hello Miami. >> How's everyone doing? >> Welcome to Miami. >> Yes. >> How's everybody doing? >> And in case you forgot where we are, uh there is a queue in my jersey. Uh, so we're bringing AI engineer to Miami today and I'm so grateful to see I can't really see that well because the light is so bright, but I can kind of see your faces and I'm just so glad that you're all here to celebrate AI engineer and accelerating practical AI applications. Uh, so I hope that you're as excited as we are about today. Um, so my name is Ethel. I am one of the MC's for today. I am AI researcher at Google and I'm here with Iman, my colleague and dear friend. >> Hello, I'm Iman, AI research engineer at Google and uh I'm coming all the way from the San Francisco Bay area, West Coast, by the way. Who here is from the West Coast? Raise your hand if you're all right. >> Nice. We see some hands. >> I see a few hands. Who's from the East Coast? >> Oh, wow. >> That's not fair. We're the minority. Well, Miami is east coast as well, so >> yeah, I can imagine. So, who's from the central US? >> Yeah, we see some. >> Okay. Okay, that's a good woo. And uh who's coming from outside of US? >> Nice. Wow. Thank you for traveling. >> Pretty diverse. And I'm curious um raise your hand if by this time next year you're going to be replaced by AI. I'm counting around 20 colleagues with realistic expectations. Uh well that may be true or maybe uh the reality is AI will expand what's possible for us and it's going to multiply us and uh it's going to redefine what's achievable. But before that let's hear some stats about who who's attending. >> Yeah. So we want this conference to be about you and your connections. Uh so sitting with us we have people from 23 countries. So thank you all who are traveling internationally and also locally uh including me myself. We walked down a couple of flights to come here. But no matter how far you traveled, we're really grateful that you're here. Uh and we also have some companies that are very excited about sending some engineers over. So we have two companies that sent 12 engineers to the event. So, thank you so much for doing that. I also want to do a quick roll call. Uh, who in the audience is a AI engineer? Nice. We see some hands. What about quality engineers, PMS, AI researchers? >> PM there. >> Okay. So, an overwhelming group of AI engineers. So, you're in the right place and we're so happy to have you with us. And we have great talks. We have our sponsors in the expo area for later. And we really want you to be able to network with each other, talk to each other, and really just uh build a community here. So that's the vision that we have for today. >> Big plus. Big plus. Um yeah, last night at the opening reception, I got to talk to people. Some told me about their personal experience. Their friend got sick. they want to leverage AI to help them and improve their life or talking about how they want to leverage AI to bring education to some parts of the world that don't have proper education. Uh very personal stories and I'd love to hear that and I think that's the core of the idea here. Let's consider this um a playground for amazing minds come together and build connections and make big changes in the world. And I think that's part of the vision. Um, and who's better to tell about the vision more than the amazing Gabe Greenberg, who is the CEO and founder of G2I. I would like to invite Gabe to the stage to share a few thoughts. Thank you all. >> Welcome, Gabe. >> AI engineer Miami, what's up? How we doing? All right. I like it. I got through my uh first my start I said uh AI engineer Miami rather than React Miami. So I'm I'm starting off well. It's our next conference. Okay. All right. Well, I wanted to give you all a little bit of an origin story because uh this conference and this series of conferences is uh is pretty unique. It's special. There's a really special group of people that you're sitting uh with right now. DAX is special. The speakers, I just uh I have a lot of love for the people in this room. I'm the founder of G2I. You can check us out at g2i.ai. We're focused on uh reinforcement learning environments in the human data space specifically around software engineering. Um I'm also the co-organizer of this conference uh at least AIE Miami and React Miami. And you can find me on uh Twitter. I I don't I'm not going to call it the other name. Twitter/Gabe Greenberg. You can follow me there. So origin story of how this conference started. I uh I flew out to uh San Francisco. I knew nobody and I walked into React Comp 2016 and the first person I meet is Ryan Florence. Uh some of you may know him in this room and uh you know little bit of a celebrity mutuals background and you know I shake his hand meet him cool guy answered a few questions and I sit down next to him and he opens up his laptop and uh on his lock screen is wait for it Brad Pitt. And I knew that I was instantly like in the right place like this guy does not take himself too seriously. Nick Shrock was up there talking about GraphQL. React Native was kind of new. It was this beautiful like you know this bustling we were all excited you know like uh on the bleeding edge and uh I got really involved in the React ecosystem. Uh we served for Reactive Flux at a lot of Q&A with the React core team and a few years later I got really really really sick. Um, I had mold toxicity and mercury poisoning. And I would sleep on uh the floor underneath my desk uh for hours because I had migraines. I couldn't think straight. I couldn't do simple math at times. And um and I would get up and do a little work and go back to sleep. Um there was times I go into treatment on Monday and I couldn't work till Friday. And this was years. This lasted for eight years in my life. It was the most terrible time. My kids, you know, didn't see me much. My wife was kind of like, "What's going on?" And uh finally I was diagnosed. It took me so many years to get diagnosed with the mold toxicity and um I'm sitting there on a vacation in pain, physical pain every single day. And u and my and my wife and I go, "We need help. We need to raise money to get you help." And so I put it on Twitter and I've been involved in the in the software uh ecosystem for a while at this point. And uh and this tweet changed my life. Uh Dan uh Abramov on the React team uh said, "Let's help Gabe." And uh they raised $22,000 for me. Uh and I got healthy um a couple years ago. And I've been healthy since uh from this. Yeah. So, um, a couple years later, I'm up too late on Twitter, of course, and I I I post this. Uh, someone should put on React Comp in Miami. Who'd be interested? I own the domain, uh, and hundreds of other domains I've never used. Um, and, uh, of course, Ken Wheeler says I'd go and probably never be invited back. He was invited and he uh he was invited back, believe it or not. He'll be here this year. And Michelle over there and her sister Becca said, "Yes, we're in." And they were the ones that convinced me. Michelle was the one that convinced me this could be done uh with no money right out of COVID. And so we just we felt called to it. And this was a response to what you all had done for me, the organization G2I. uh we felt called to this conference and to serve the people here to not make it a quote unquote corporate event for the profit but to make it really for the people and of course Swix you can see him down there at the end um he's come every almost every year to React Miami spoke many times and so he created AI engineer quite a few years ago at this point runs late in space is super involved he's at cognition and he said Gabe Michelle Becca can you do the first AIE in America So, here we are. Aie Miami is born. Thank you all so much for coming. It means the world to us. And there's one more thing. Um, our company, we've worked with the Frontier Labs for a number of years now. We've had to move really fast. um build production software at uh you know in matters of just compressed timelines like I think some of you all would believe because you know and uh so today we are uh announcing uh orchestrator orchestrator AI it's a multi- aent orchestration platform for complex engineering you can check it out at orc.ai AI. Um we are uh really excited about this. It's been dogfooted. Uh you can run many different agents in um in the platform. The coordinator runs the impleer, auditor, reviewer, validator, researcher. These are only some of the the roles that we have in the platform. Comes with the confidence score, shares the known issues, the assumptions of course that the uh that the different agents are making. And you can spin up to 16 of these for a single task. There's true adversarial um governance with this um and we're able to catch a ton of large language model drift. Uh extremely fast inter agent comms and of course model agnostic. Uh it comes with a self-p pruning context memory that reduces context bloat and then also the meta observer in the platform automatically adds new skills as it identifies opportunities for them and then an observability layer that allows you to delete them or add new skills manually. So we're really really excited about this. Uh we're signing up design partners. We uh we've been doing a little bit of benchmarking and then I'll turn it over to to Dax to to come talk about Open Code. Um we do a lot of spec driven backend work. Uh we need to do it really fast and and our engineers were were dog fooding this platform. Really it's been a couple years in the making behind the scenes and uh they've been building these specriven APIs. So if we look at the pet store API that we did um you're looking at 100% path coverage and 100% semantic score around the the quality the response shapes types and behaviors matching the spec and uh compared to a single agent harness like cloud code I mean it's about the same you know we're we're 6% better not much of difference now as as we increase the complexity going into a startup API um uh we're able to see the lift you know we're able to hit 100% path coverage and 100% semantic for when cloud code is hitting 78% and 60%. Interesting. But then we really increase the surface area of this thing and the complexity like 8x on the spec of what a start a API would would look like and uh and we're seeing the lift where uh a single agent harness might hit 22% uh we were hitting 92% on the semantic score and the path coverage uh is 100% for us. So excited in half the time u to uh to launch this. And lastly, uh, we spent the last 72 hours making sure we could have one more benchmark. We actually rented a hotel room, uh, to put this laptop in just to run the benchmarks cuz the the Wi-Fi was not so good in this hotel. Um, which is a funny story. Um, we ran the orchestrator against SweetBench Pro, specifically GBT 5.4 high. It's about 731 tasks. um we called it we bucketed it into very easy all the way to very hard um on the but these are really at the end of the day easy is not super simple you're talking about a multifile fix and subsystem logic understanding um so we're seeing bucket bybucket really the easy medium and hard are 17.1% lift over GBT4 5.4 high 14.8% 8% lift and then a 1.7% lift on hard and when we get to very hard we're talking about complex long horizon issues spanning multiple days a 5.7% lift. So 8.4% overall on top of the model. Uh to give you an idea um I think GPT2 5.2 to 5.4 was a 4% lift and uh and and uh Opus 4.5 to 4.7 um was a uh a 7oint lift. So, we're excited about this. Uh, this, uh, is able to execute SWEBench Pro above Opus 4.7 with GBT 5.4. And if you'd like to sign up as a design partner over the summer, we'd love to work with you and, uh, and build with you. So, thanks for your time. Enjoy DAX and enjoy a AI engineer Miami. Thank you. Am >> I going right away? Are you >> Thank you so much, Gabe. Our first presenter has tried his hardest to insult us all by his choice of the title. He is a worldrenowned troll on Twitter. That's not me, that's Becca said that. And uh the title for his talk is you don't have any good ideas. I would like to invite Dax Rods to the stage. Um, I don't know where he's going with this, but I'm going to let him figure it out. So, please welcome to the stage. All right. I don't have any slides, so I'm just going to walk around and uh and talk at you guys. Um, a lot of people in here today. All of you came from all over the world, all the way to Miami, but you're not like other people because you came here to talk about AI. Not why people usually come to Miami. kind of embarrassing. It's okay. There's a reason you're here. The reason you're here is there's a lot of smart people that are going to be here uh giving talks. These are people that are at the top of their game using AI to build software. You're going to learn all the tips, all the tricks, give you the edge, you know, do things that used to take a week, do it in a day. You're going to go home. You're going to use these tricks. You're going to build all of your ideas. You're going to be super successful. You're going to get you're going to be rich, going to fix everything that's wrong with you. Your mom's going to be proud of you. Um, there's only one problem, which is you don't actually have any good ideas. And that probably hurts to hear, but it's true. It's okay. I don't have any good ideas either. Uh, and I think this is the first time that we're all having to confront this fact. You know, we have more capability to build stuff than ever. And you think, oh, finally, we can kind of ship all the stuff that we always said we would. Uh, and it turns out a lot of the stuff that we thought was good ideas are not good ideas. And this is the number one problem that I'm struggling with, uh, my company's struggling with. Um, yes. So, again, my name is Dax. I'm the co-founder of Anomaly. Uh, we make a decently popular coding agent called Open Code. Um, and this talk is going to be all about product restraint. So to understand what I mean by this, uh let's uh let's let's think back to before AI was a thing, before coding with AI was a thing. It was like a long time ago. This was like two years ago, forever. Um but if you think really hard, you can think back to that time and imagine what it was like. Uh for those of you that are programmers, you know, imagine working at your companies. Someone would come to you with a new idea, with a problem, with a feature they wanted to ship, and it would be really annoying. You would hate it when someone did that because you had a huge backlog of stuff you're already trying to do. You had all this stuff that you wish you were doing better that you don't have time to get to. So, when someone came to you with yet another thing to put on your road map, you did what the lazy engineer does and you push back on them. You argued every reason why we shouldn't do this thing. Uh why we shouldn't ship this, why the company shouldn't be doing this, why the person was stupid. Um maybe we should do this later. You basically were the obstacle to getting anything done. Um just because you had you were over overwhelmed and and you pushed back a lot and rightfully so the company hated you for it. Uh if you look at most companies, you talk to them, you talk to them honestly, most parts of the organization hate the engineering team and for good reason. because every problem that they have is blocked by engineering. Uh when a customer, you know, yells at someone on support, it's because the engineering team hasn't shipped something that would have fixed their issue. When the sales team loses a lead to a competitor, it's because the engineering team has, you know, they have a feature they haven't shipped, the competitor does. So, it's just engineering has just been the the annoying part of the organization forever. The source of every single problem, at least the way it feels. And it feels kind of stupid because software is virtual. We're not like physically building things. We're not moving things from one one place to another. Uh it's just in this virtual space and it feels like the moment we have an idea, it should just exist, right? Like it's just a thing in the app. Like why why is there so many steps and processes uh in between that? And everyone wished that things could be different. The past couple years uh it feels like that's kind of changed. We've gotten the ability to kind of go from idea to a real looking thing really quickly. Um, and everyone's super hyped about this. Like every company is trying to adopt this workflow as much as possible. Blow up every single process that they have. Uh, you know, if you're not adopting this, your competitors going to adopt this and, you know, you're going to get left behind. We're we're like measuring tokens. Uh, we have token leaderboards. See which engineers can, you know, get to the top of the token leaderboard. if you're not spending five times your salary on tokens, you're going to get fired. Um, so we're all going crazy with finally all this pentup frustration that has been around for decades with slow engineering is solved and we're like going crazy with it and I'm not saying there's not a lot of positive from it. Like a lot has changed for the better. Um, I work on an AI coding agent. Like I believe in this in a lot of ways, but it's not universally good. And I want to talk a little bit about things are so different now. And we look back at all this frustration and I think what I'm realizing is that frustration was kind of saving us from ourselves and to understand this let's think about how things used to work right you had you know typical organization they had the product engineering and design roles um because engineering was so backlogged all the time uh product and design would work together and refine ideas before they brought to engineering. It was a lot cheaper back then two years ago uh to have a mockup in Figma than it was to build a working prototype. So a lot of ideas would just die at this phase. You know someone would have an idea they would kind of have to go work with design. they think through it, they might realize, okay, this actually didn't make any sense or, you know, we have to refine it and the initial idea turns into something totally different and by the time it kind of, you know, bounces through the organization, uh, a lot of the ideas die or they or they get refined into something into something pretty decent or they get shelved and kind of brought out later. Um, and that was like a natural thing that was happening. There's all this filtering that was that was going on. Now things are a little bit different. Um, anyone in your organization can kind of ship an MVP. they can prompt a coding agent, spend an hour with it, and implement a feature that they think is good. And this is this seems obviously good, you know, like why would anyone be against being able to experiment and build stuff and iterate and and try things? Obviously, it sounds like a good thing, but the sneaky thing about MVPs is they look almost done. Uh, you spend an hour build something and it like basically looks like it's there. At that point, there's momentum behind it. The moment something kind of looks like it's basically there, it's it has like a life of its own. At that point, it's inappropriate to really think about it from first principles or like question the whole premise of it. It's basically already there. People around you aren't going to like really be a be a roadblock or or get in the way. Um, and it ends up in the product. These ideas, they go from someone having the idea to prompting it. They spend an hour on it. You know, it's barely any work. and then it's like in the product the next week. And we're told this is actually a good thing. We're told that in the new era of AI, it's all about you have a problem, you solve it right away. You ship the fix right away. The faster you go, the better to go go fast fast fast. That's kind of the vibe of everything. And that like adds to it even more, right? Like we're not questioning anything that we're doing. And so unsurprisingly, this creates bloat. Products end up super bloated. They end up with features that are in weird spots. They end up with three different ways to do things. Um, and it's it's kind of making me realize that without the previous checks and balances of just things going slow, we just didn't have a lot of good ideas. Like most of the stuff that we're shipping, they're bad ideas. I look at our own products and, you know, our the products we work on right now have been out for less than a year. And I look at it and I'm like, what are all these features? Like when do these get in here? Like we should never ship this, we should never ship that. It's just gotten so easy that things just slip through into the product. um without you know anyone really thinking twice and this is kind of messing up the whole tech whole team dynamics as well. Um for the first time ever design is behind engineering stuff just gets shipped right stuff just gets shipped out there before design is even looked at it. So now they are just they just have a huge backlog of a 100 features that are shipped that they need to go one by one and polish and just independently going one by one polishing 100 different features that doesn't add up to a good product. They're not doing their role which is to think cohesively about a product. Think about the experience end to end to create like a proper universal experience. There's kind of like oneoff reacting to to stuff that's going out. Um and this is changing the engineering side as well. Um, you know, historically, if someone came to you and wanted to build a new feature or iterate on a feature on a system that already exists, you would look at this system and you would think, okay, like this feature doesn't really fit into this system. So, we'll have to like rethink the system from scratch. There's going to be a lot of work. We have to redesign it to support this thing. Of course, there's always hacks, but you know, you have to pay the cost of that hack. You have to be the one to go and hack this thing into the system. anytime that that hack later like rubbed up against other things incorrectly because it interacts with every other feature you have, you had to deal with the pain of that, you no longer have to deal with the pain of that. You can go tell your agent, you know, hey, do the [ __ ] for me. Um, and you don't have to deal with like the dirty work really. Uh, so engineers willingness to ship hacky solutions, you know, we're we're just a lot more willing to do that. our bar for what we're willing to do to our code bases is like on the floor at this point because we're not paying the price for the cost of it. And that really shouldn't be the case, right? Just because you can offload the pain to someone else. In this case, you know, it's not a real person or well, some people think it's a real person, but uh that doesn't mean that you we should change our philosophy on what we're doing or how we're doing things necessarily. Um so that's also impacting the engine uh the engineering team as well. Um, and of course we've had the historical excuse, which is, you know, it's it's okay to ship hack sometimes. You know, you make the judgment call on it's better to get something out now and deal with it later. You have excuse, you know, we'll get back to it later. You intend for that to be three months. It ends up being three years and by the time you get to it, you like totally regret having done it in the first place. We've got whole new excuses now, right? It's it's okay if this is bad. The agent will fix it later. Um, it's okay if this sucks. The models will get better and it'll just kind of solve it. It's like a completely like it's like a faith-based approach to it. Like just magically it's going to get fixed in the future, which of course it doesn't. It really doesn't happen. Um and the net result of all these changes is just a ton of rot. Our products are rotting so quickly, right? We look at and I think we can all feel this in our own products and products that we're using lately. They just feel really old really fast. You know, you use something that came out less than a year ago and all of a sudden it feels like it's already five years old, like post private equity acquisition, like inertification. This is happening in like several months. And again, it comes back to the root problem. We don't have a lot of good ideas. When we just ship things unchecked, we just uh speedrun that life cycle of of product deterioration. And it's like happening at crazy speeds these days. So, the key issue here is restraint. um we have more power and capability than ever which means it just magnifies our judgment. So we need to exercise a lot more restraint and I don't have a lot of good ideas on how I think for me I just look back to what traditionally has made sense um and try to really keep that in mind. So you know when someone comes to you with a problem or a user has a problem if you just like react to that and fix that problem right away you're just going to make 10 different solutions for 10 different problems. Uh if you slow down and wait, you listen to this problem, you listen to that problem, you might realize, hey, these 10 problems are they seem unrelated, but they actually are related. If we ship this one thing, it'll fix not only those 10 problems, but also the 50 other problems that no one's even brought up yet. That's actually your job when you build product, right? Your pro your job isn't just to be like a prompt router to go from the user complaining to routing to the agent. you have to slow down, think and absorb and really make the call on on what uh what you're actually shipping. Um so so that that's that's very important, you know, slow down, absorb and try to ship high leverage things that actually solve a lot of problems. Um the other thing is uh I think a lot about the onboarding cycle. Uh every product basically has one good idea in it. Um, and your job is to get the user from not knowing about your product to understanding that one good idea as fast as possible. All the other stuff that you come up with, they're important and maybe useful, but they're secondary ideas. So, they're not going to go into like the main onboarding. And a lot of companies mess this up. It's uh I think we've all had the experience where you open up a product and there's like a dozen different directions you can go through and the people working on that product are like, "Oh, we're giving them so many options. They're going to go like play around with this then and try with all trial and stuff." People don't do that. they just give up and they leave. So you're never going to mess you shouldn't mess up that one flow of getting to your good idea. Which means that every new idea you think of, you need to craft the path of okay, they're a user and they're kind of using my product. How do I take them from there to the point where they understand this new feature? Um how do they learn about it? How do they discover it? How do they know when to use it? How do they know how it works? It's really hard to come up with this stuff. I have had so many great features that I'm like, "This is an awesome feature. I love it. It's so useful." But I wasn't able to find a way to get a lot of people to actually discover it. And so we don't ship it, right? We don't ship features like that. Um, and it's painful, but again, it's restraint. You don't want to put stuff in there that has no actual way of being used. Um, the uh yeah, so if I think if you keep these in mind, you naturally don't ship as much stuff. uh you know, right now you're having an idea every single day, but if you apply these filters, most of them don't really pass. You're not going to have a good idea every day. You're not going to have a good idea every week. If you have one every couple months, that's I think you're doing pretty well. And that's like a good cadence to aim for. Regardless of how fast AI is letting you go, there's no reason why you suddenly have 10x the number of good ideas, right? So, couple good ideas a month, I think you're doing you're doing pretty good. Um, as to close off here, so I think again thinking back to the frustration everyone's felt in the past, I think we can now kind of look back and feel a little grateful for it because it basically was again like I said saving us from ourselves. We're moving a lot slower. It was filtering out a lot of bad ideas on its own and we never had to confront the fact that most ideas were bad just because they were kind of naturally uh being taken care of. Um, and as we kind of enter this new era where that process is going away, we have to be a bit more aware and be intentional that hey, most of our ideas are not good. Um, and that's okay. That's exactly how things are supposed to be. All right, that's all I had. Thank you everyone. All right, give it up to Dex. And then just to confuse you, the next one is Dex. So Dex needs no introduction. Actually, a lot of people uh know him because he's a veteran of AI engineer. So if you have watched some of his talks, one of them is very famous. It's called No Vibes Allowed. Uh, but I'm still gonna give a little introduction of Dexter for people who do not know him. So, Dexter is the founder of Human Layer and I actually had the honor to meet Dex in San Francisco, which is where he's based. Um, so Dex has always been a a prominent figure in the AI community and I'm really glad that he is here with us today. So, today he's gonna tell us everything we got wrong about RPI. So, tell us more, Dex. >> Amazing. Thank you, Ethel. >> Thank you. Um, and thanks Dax for that wonderful intro and sorry she said give it up for Dax and I literally thought she was saying give it up for Dax. So I made the mistake I was about to make fun of all of you for which is uh praising me on Twitter for all my hard work on the open code project. That's the other guy. Uh I have been doing coding agents for a while though. Uh I think the no vibes allowed is almost up to I wanted it to get to 500k but any we done done a lot of talks. Um all started with this guy Eigor. I'm not going to go deep on this, but it was basically like, hey, when you use AI, you ship a lot more, but a lot of it is fixing the slop you shipped last week, and it doesn't really work for brownfield code bases. And what we want to do is we want to solve hard problems in complex code bases. We had to figure some stuff out. Um, we posted our methodology on uh Hacker News back in September and it was on the top front page all day. There's probably about 10,000 people who have gone and grabbed our open source prompts. Um, I found public evidence that RPI is in use at companies like Uber and Block and private evidence of a bunch more that I can't talk about. Uh, which sets us up for a great talk about RPI and why it's so great. Uh, but we're not going to do that. We're going to do a different talk. I'm going to tell you everything we got wrong about RPI. Uh, because we thought we had this thing figured out. And of course, you know, models change really fast and this whole world is changing. Every three weeks there's a new thing. Uh, and I think we got a couple things wrong. standing by. All right, we're back. Um, we got a couple things wrong. One of the things was we said, uh, it's okay to not read the code. Uh, we advised people to read very long plan files. Uh, and we said Claude can have a little slop. Uh, as it well, we never said this. It was implied though. It was, you know, let the let the model cook and we'll be we'll be fine. Um, and so, uh, and and you all have been doing your homework, and I think, uh, at at AI Engineer Europe, we kind of figured this out of, uh, there's this continuum now, the Zecharopo continuum. Uh, I'm going to tell you how we ended up all the way over here despite six months ago, eight months ago being all the way on the other side of the spectrum. Um, but first to recap, um, and I can't see many of you, so you have to raise your hands very high. Who's run this prompt? Who's done research codebase? I'm going to assume everybody. It is very bright here. Um, what about create plan? Raise your hand if you use this prompt. Uh, leave your hand up if you use it like this. Hey, we got to go ship a feature for this thing. Leave your hand up if you ship it if you ran it like this. Work back and forth with me, starting with your open questions and outline before writing the plan. Some of you found the magic words. That's great, but also it's a problem. Um, since October, we've worked with thousands of engineers uh of companies of all sizes uh and trying to help people use coding agents to solve hard problems in complex code bases. And we found over and over again we would give the tools to an expert and they would get great results and then they would give it to their team and the results were not always so good. Um so we got in the trenches with our users as uh product minded people do and we went to go figure out what was going wrong and we found three things. Uh the first one was that people were getting bad research and if you recall from no vibes allowed this is the one slide I'm repurposing um but you would pick a zone of your codebase. You say, "Hey, go look over here." And you would send off a bunch of sub aents, take these deep vertical slices through your many repo codebase. And then you would compress all of this down of like how does all this work into a single document that is just like a snapshot of the parts of the codebase that matter for the task that we're about to go embark on. And we said we should keep this objective, right? Discourage opinions, avoid implementation planning. Uh research is really just the compression of the truth about the codebase and how it works today. Um, and really good engineers would, and we noticed this pattern where people would take the ticket and they would turn it into questions and then they would pass the questions into the resour. So if the thing you're building is, oh, we need to add a new endpoint to reticulate splines across tenants. You might ask questions about how endpoints are registered, what touches splines and the worker program that handles reticulation. But a lot of people would just be lazy and say research, I got to do this thing. Go research the codebase for me. And this was a problem because if we tell the model what you're working on, you know, a good research is mostly facts, but a bad research will have opinions. And these models are so so so deeply trained to go solve our problem that it's going to uh steer the research towards its thoughts on the first thing it picked as the right way to solve this problem. Um we'll get more on why models shouldn't have opinions like they do get to have opinions but just not at this part of the workflow. It comes back to this idea of you know do not outsource the thinking. Um we also saw people were getting bad plans. Um, and this is a really interesting one. Uh, hopefully you get some takeaways on like things you can apply to your own prompting. But we had this single prompt with like 85 instructions in it. And, uh, basically if it worked properly, um, what you would get is a, uh, laggy YouTube video. Try one more time. You know what? This is why we have backup slides. Uh, it would look like this. The model would go back and forth and ask you a bunch of questions. Uh, and then it would walk through and like ask you what order you wanted to do the things in and how you wanted to test them. And only then after this long conversation where you'd built up all this shared understanding of the problem would it write the plan. The problem was is that uh if you were in a hurry and you didn't prompt it quite right uh it would just spit out a plan. It wouldn't ask you any questions. It wouldn't put you in the loop. You were just getting whatever the model decided was the number the first way to solve this problem. And that's basically the same as just prompting it to go do the thing at that point. Uh and so we gave the tools to an expert and they got great results and some other people didn't. And we were like what was the difference? And this was uh an embarrassing thing to say in like uh uh customer onboarding. Uh but you have to say the magic words apparently was the was the challenge. If you didn't say this thing, I mean if you've been prompting LLMs for a while, you know, hey, you repeat the most important instruction at the very end of the prompt and at the beginning. Uh so we said work back and forth with me starting with your open questions and outline before writing the plan and then it would follow the process. But if you didn't do this, 50% of the time it would just skip that. And this was not the user's fault. Like we if you build a tool that requires hours of training, like go fix the tool. Um, and so we dug in. We're like, why are these steps getting skipped sometimes? And the basic takeaway here for whatever AI thing you're building, whether it's coding agents or something else, is you have an instruction budget. Uh, my co-founder Kyle is here somewhere. He wrote a really good blog post about like how to like tune and optimize your cloud MD. And the big takeaway was like Frontier LMS could really follow about 150 to 200 instructions before they really are just kind of half attending to all of them. And obviously this was a year ago, so inflate that number a little bit, but there is a budget. And so this prompt is 85 instructions plus your cloudmd plus your system prompt plus your tools plus your MCP is very unlikely to get great adherence. I'll talk about how we fix this. Um the other thing that we kind of recommended and did was was plan reviews. And we advocated we said look if you're not going to read the code you got to read the plans. This is me on stage saying in November you have to read the plan. Uh some folks even code reviewed them. They would get together on their team and read the plan. But a thousand line plan was about a thousand lines of code. It's like order of magnitude. it would be about the same amount of reading either way. Uh, and plans can have surprises and so you'd actually end up reading the plan and then someone would go implement it and then you would have to read the code again. You're actually doing more work, not less work. Um, because you know thousand line plan, thousand lines of code, etc. Uh, that's not leverage. That's actually doing more work. Uh, so the new advice is don't read the plans, just read the code. Right. Yeah, I know. I am I am humble enough to admit when I was wrong. Here we are. Uh, this is all a journey. Uh, don't forget to learn. Uh, there are other ways to get leverage though. There are other ways to get more out of less. Uh, and we'll talk about that. But this is how we ended up all the way on this side of the of of the Mario side of the continuum. And again, yes, you could say, "Hey, Dex, you know, in August, you said don't read the code." Yes, we used to be all the way over here. I am humble enough to admit when I was wrong. Um, these things change. Uh, please read the code. Uh, we tried not reading the code for like 6 months. Uh, it did not end well. Well, we ended up having to rip and replace huge parts of that system. Uh, and all of you now who are just finding out about the like, oh, we can do the lights off software factory thing and just we just won't read the code. I'm like, all right, be careful. If you have people who depend on your code, if you're going to if someone's going to get paged at 3 three in the morning if something is broken, please, I'm begging you, please read it. There's a entire profession here on the line. Uh, and we need to save it. This is why I'm kind of like iffy on the agent swarms thing because like the bottleneck is last year it was like how can you spend as many tokens as possible and this year it's going to be like okay how do you actually what's the right speed you can go because if you go 10x faster but you're going to throw everything away in 6 months uh that's not actually like productivity that's actually just burning time and money and your employer's time and your time uh I think I do think you can get to two to 3x and still read every line of code and own it and have good architecture as Mario would say. I'm not going to say it so we don't get demonetized, but uh uh everyone is racing to build these lights off like slot factories, right? And I think again what's going to happen is you're going to wake up one day and there's uh no one's read the code in three months and you have a bug that the agent can't solve. Uh and then you're going to have three weeks of downtime as you re onboard everybody on your team back into your codebase that they haven't read in three months. And in that three weeks, you lose all your customers and now your company is dead. It's not going to happen to everybody. It's probably going to happen to somebody. uh be careful. So, we're going to try to token smarter, not harder. Uh and we do that with a couple different ways. The research is the least exciting one of it, but basically take your ticket, turn it into questions, you make the research. We can just do this with prompting and workflows. So, we basically hide the ticket from the researcher programmatically. You have one context window to generate questions and then you just feed those questions in to generate the research. This could be done trivally with AI sort of query. If you build deep research, this technology has been around for a while. Um, we also have to get better plans and like before I was the coding agents guy, I was the context engineering guy. Uh, and we talked a lot about, you know, there how like, you know, don't quite use tools at a loop all the time. And there was two reads of of context engineering. One was like put better information in the context window. This is the most common one. Anyone here the rag pipeline? Uh, to be honest, come on. I actually can't see any hands. I assume some of you are just lazy and not wanting to raise your hands. Um, it's also about getting better instructions in your context window. Um, and of course Jeff is here and he's talking later. Uh, and I used to have to explain who Jeff was. Uh, very exciting now that everyone knows who Jeff is. Uh, but he talked about, you know, the more you use the context window, the worse results you get. We talked about the dumb zone. We're over about 100,000 tokens for for that's a cloud model number. The GPT number is different. Um, you're basically you're not you're not just giving the model too much information. You're probably also giving it too many instructions. And so an example like very simple is like you're building a customer support bot and you give use prompts for control flow. You say okay if the input is this do this. If the input is a is is a product feedback do this. If it's a billing issue do this. Give it a whole bunch of tools and say hey go do the thing. Uh this probably works but as it gets bigger your your performance and accuracy will probably go down. And what a lot of people end up building is they have an initial step to classify and then they have smaller instruction modules for each of the different classification cases. you build this workflow and this pipeline and this will be like faster, more performant, more accurate. You could probably use a smaller dumber dumber model if you build your system this way. Um, and so we took create plan which is a single mega prompt uh and it's supposed to look like this this very specific guided workflow and we split it across several prompts um to design structure and planning. We're not going to talk about implementation today uh but it's got these three different phases each build on each one and before was 85 now they're all under 40 instructions. Um the lesson don't use prompts for control flow if you can use control flow for control flow. Uh switch statements and if statements are actually kind of good. Um and this is not just for coding agents. This is for you can imagine how this would apply in every single AI application you might build. Um and it was kind of funny because we got up on stage in in June at AI engineer and we said you know don't don't do full fat agents. Tools in a loop. No no no no no that doesn't make these workflows do these pipelines. Uh and then by August it was kind of like with this cloud code thing that thing's pretty good. tools at a loop might be back. And we turn around and we build these monolithic prompts of these giant complex workflows. Uh and so I decided it was time to drink our own Kool-Aid and apply this stuff to how we were doing this. Uh I hear a lot of times uh well hey Dex, won't this just get bitter lessened? And I assume none of those people are here, but all of you who hang out on Twitter and say this to me, I want you to know that when you shout this at me on Twitter, this is the voice I assume that you're saying it in. This is what plays in my head. uh because I think in my experience the way this works is you got a given frontier model. It's got some sort of capacity at a v variety of tasks um and through naive prompting you get to do certain things and then we come in and we do our context engineering and we can make it better at certain tasks that are relevant to our problem. Then a new model comes along and makes most of that work irrelevant better at most tasks. Maybe it's not as good at certain tasks but then we do more context engineering and we make it better at the next thing at the next set of things and we push the frontier. we're always going to be, you know, 1 5 10% past the frontier, uh, compared to the naive prompting. And this matters because if you're doing long horizon agentic tasks by turn 20, the difference between 99% and 97% is a 27% gap because this compound. Uh, Dan Shipper, I think, calls this like surfing the models is like you can get better at using the new model faster than the new model can get better. Uh, Swix's take is right, the bitter lesson will kill this someday, but hey, it works for now. Let's do it. And we do that over and over again until we get to AGI. So if you can get Opus latest to do more to solve harder problems or maybe you can get GP OSS which is small and cheap to do the work of GPT5 high now you've captured something really interesting. So mind your instruction budget and like no having more contact it's just more attention spread out over the same amount of or sorry the same amount of attention spread out over more instructions is not going to fix this. Um and then we also found we got better leverage. So we split these things up to get better instruction following but uh we also got more leverage and so like you can look at the structure outline this is the plan that was built from that structure outline. It's like less work for a human to review the first thing and in general these higher order documents are designed to be higher leverage in terms of like human reads less model uses less tokens and we're talking at a at a higher level before we go down into the details. Um so the design discussion there is basically you know where are we going what does the final solution look like it's got current state desired end state patterns to follow you know where what did the codebase patterns of the model find that are relevant to implementing this feature what architecture patterns do we want to follow especially in legacy code bases there's always six ways to do everything and you find yourself being like oh no you found the bad pattern no we have to go do that one over there this is your chance to do brain surgery on the agent before it actually goes and slops out a bunch of garbage uh it'll track your resolve questions your design questions uh it's sort of like cloud code plan mode but like written down to a markdown document. Uh Matt PCO pulled it out. This is like the design concept. Frederick Brooks had this idea of like this thing that is never written down but is in everybody's head. So when you have this shared alignment with the model of what you're building, uh this is locked up in the context window. We put it in a 200line markdown artifact. Um and this gives you alignment with the agent. And so you iterate on this thing. Again, do not outsource the thing. You want to give the agent every single opportunity to show you what it's thinking to brain dump its entire understanding of the problem and what it thinks you want the solution to look like. And you say, "Okay, why do we need humans in the loop at this point?" Basically, like because you can't RL a model on architecture because the cost function of bad architecture is measured in months and years, not in, you know, five minute unit test cycles. Um, we also got better leverage on the structure outline. Design is where we're going. Structure is how we get there. If you want to map this to meetings that make engineers miserable, you have your design review, your architecture meeting and then you have your sprint planning. Uh so you take your design and all the previous stuff and you build your outline and it's just a highle outline of what we're going to do and how we're going to check it along the way. Um and again it means lighter reviews. You can read this and understand where we're going. Um we need humans in the loop here too as well because somehow models just absolutely freaking love these horizontal plans. uh which by that I mean you know models I'm sure you've seen this models love to do the whole database layer and then the services layer and then the API layer and then finally the front end and before you know it and you're on the other end of 1200 lines of code and something is broken and the surface of stuff you might have to debug is quite large which means it's going to be hard for you it's going to be hard for the model to figure out what's wrong and so what we advocate for and what we we when we use this stuff internally and we say our users it's like build it the way you would have built it as an engineer you don't write 1 1200 lines of code before you check. You write a little bit of code and then you check something. Write a little bit of code, then you look at something and then you wire your biz logic and then you do your error handling. And so it's like this is your chance to re steer the model. They're just markdown docs. You can ask for more detail, but they start super high level. So if you don't trust what the model's going to do in your highle doc, ask it to add more detail. Um, and the way you get more leverage from the plan is you basically you don't read it. It's it's for claude. It's for the agent. You just bot check it. You save that deep review for the actual code. Um, but it does have the line by line changes. Uh, and it's not just about like human to agent alignment. It's also really powerful for human to human alignment. I think like before AI, it was very common for a two hours of coding feature to take two days because you have to plan and align with this is in like large, you know, hundreds of engineers. You got to align with other teams. You got to do the planning. You got to do the code review. You got to rework it. You got to test it. Uh, and if you're just using AI for coding, then you will go faster, but you're not going to go 2x faster. You're not going to go 3x faster. Um, but if you use the model to help you do planning and alignment with your team, then you're going to get better alignment because it's going to be more thorough. And then your code review is also going to be faster because everyone like it's basically like there is a thing that is not worth doing if you are writing it by hand into a Google doc to share with your team, but is worth doing if AI can help you do it and it really helps us like compress code review cycles as well. Uh, I'm sorry I don't have an answer for you on testing and verification. Uh, much has been said about this. That's for another talk. Uh so if we want to put this all together um you know five phases to actually even just create the plan and then we go implement it. We call out questions research design structure outline plan work tree implement pull request uh and make a good acronym. So uh we just picked the ones we liked. This is QRSPI crispy if you want. Um we found it's a really powerful way to get teams rowing in the same direction. Um in terms of like okay so like the things that we have to solve are like okay three steps was already a lot and took seven hours of training. another seven. I don't know if anyone else has built a crazy claude code like command system and then tried to teach it to somebody else and just stormed out because it was frustrating. Uh we also have to know like what's working and why? What is the impact of this stuff? And then we want to know like you know if we want to make changes to our prompting system, how do we evaluate that? Um, I have a whole other talk on like how do you drive AI adoption in a large team where like you need a process and then as Eigor would say, you need like a defensible metric and then you need somebody off in the corner who's like shipping like crazy and it's definitely not slop. Uh, and then you can kind of do this. But you can go do this. You can go try this in your team. There is no magic prompt. Like we actually don't publish the crispy prompts because like the the core of this is like understand context engineering and instruction budgeting and if you're not getting results good results like break it down into smaller workflows. We actually leave the derivation of the prompts as an exercise for the reader. You should go get the open source ones from human layer human layer and try to take the three prompts and break them up into like eight. Um try your own stuff. Come to your own conclusions. Like I said this was me 10 minutes ago got in the trenches with our users. Uh, be ready to spend hundreds of hours with watching people struggle to use your stuff. It is immensely frustrating and gratifying and my favorite part of the job. But if you're not up for this, like consider whether you want to build your own prompting system. Uh, if you want to help with this problem, uh, we're building an IDE for uh, you know, just collaborating on coding agent sessions. Got all these opinions baked into it. Uh, we're open to design partners. We're hiring founding engineers. Send us a note. Founders at human layer.dev. uh keep learning, keep being wrong, keep uh keep keep uh you know uh adjusting your uh understanding of the meta. It's always changing. Thank you so much. Uh thanks Dax for the warm intro. Thank you Miami for uh welcoming us in. And I'm really excited to hang with you guys and the rest of the speakers and uh learn with you all this week. Cheers. Oh, >> okay. Way to go get the >> Thank you, Dexter. >> Okay. Well, thank you for your attention. Okay. And before we introduce the panel coming up next, >> and we do see some chairs. They look very comfy. >> Uh we're gonna thank our sponsors because without our sponsors, we won't be able to be here and gathering together. So I want to thank Cold Rabbit, Serbus, Mintify, Sentry, Tail Scale, and Cloudfire >> and Modem and the Aify, Oz Zero, Deep Mind, >> Encrypt AI, and City Furniture. >> So give it up to our sponsors for bringing us together. >> And as we're setting up, Iman is going to introduce. >> Okay, so where are we? Who's next? So far we had Dax, then Dex. Guess who's next? Max. Uh, okay. Max is joining us from OpenAI, the folks behind Chat GPT. I tried to come up with some AI jokes for the conference. I tried ChatgPT. They were unfunny. So, feel free to switch to Google Gemini. Um Um, yeah. Unless Max can change our mind. Let's see how we can do. And uh instead of a talk, Max is going to surprise us with a panel talk. And uh this panel includes himself of course as well as Eric Thorelli who's going to moderate the panel, Sunnil Pi, and Ben Vinegar. Let's hear it for them. Sure. Sure. Oh, >> now. >> Okay, cool. Uh, how's everybody doing? >> Uh, we're going to try to do a group selfie to start this thing off. Uh, can we get the lights up? Yeah. Okay. Awesome. >> Hey, hold on. Don't Can everybody like stand up, scratch, get some energy? Yes. All right. Miami. All right. Nice. Okay, cool. Uh, I'm going to post that, but instead of doing like 15 seconds of awkward silence while I post it, I'll just not listen to your guys' intros. >> Thanks. >> Um, is your mic on? >> I think so. >> Okay, cool. Um, so, uh, my name is Eric. Uh, I work at Code Rabbit. Um, I I'm the head of DX at Code Rabbit. Um, and that's all I'm going to say about me this whole talk hopefully. Um, I thought we could go around uh introduce yourselves uh maybe what you what you did and your career, who you are, what you did, and uh when you first learned that you have taste. >> Oh jeez. Uh how much time left? 30 seconds. Hi, I'm Max. I work at OpenAI now. I work on connectors across JT and Codex. If you've ever had Jacob and or Codex connect to anything other than itself, uh that's what I work on now. Um, I'm probably most widely known for making style components, which was a CSS andJS library that lots of people used back in the day uh that I think was uh widely renowned for its taste. Um, hey, I'm Ben. Uh, if you know me, it's maybe because I worked at Sentry for a long time, worked on that product. I'm told it has pretty decent DX. Um I actually started as a as the JavaScript developer who was responsible for making that SDK work uh for the JavaScript community. So kind of worked from that perspective and I don't know people use it. So I think that's something. >> Uh hi my name is Sunil. Uh I'm tech lead on the agent at Cloudflare. Spent some time on the React team. Uh Oculus spoken in React Miami fun. Uh I had a very boring real-time multiplayer infrastructure startup which had the best company name called party kit. So I have some taste. >> All right. So uh taste is one of those things that everyone says they have. Uh but I think we should define it. So maybe go starting from the the opposite end. What is taste? >> Uh so taste and imagination I think are two sides of the same coin. Uh uh the first one is about focus and trying to remove the things that don't matter in terms of experience in terms of storytelling and so on and imagination I think is about broadening it from a place of taste to expose yourself to everything that humanity bring so full of myself I love that I get to answer this first but yeah I think it's about like focus and expansion >> I am not possibly smart enough to follow that up so I'm just going going to go over to Max. >> I think taste is what we call opinions that make sense. That's really what it is. When when people feel like an opinion makes sense and they can't quite put into words why they just call it that person. >> Okay. All those answers were boring. So, uh, can we get what is bad taste? What does bad taste look like? >> Opinions that don't make sense, Eric. Obviously. Obviously. Ben, are you gonna >> What? What is bad taste? Man, you told me these were going to be kind of spicy and I was not prepared for this. Uh, I got to punt again. >> Great taste. >> What is bad taste? I think when people focus on the things that they just like as a person and try to generalize it to like everyone else, that comes off not very tasteful. This isn't a very JavaScript focused panel. Okay, we can make it we can make it more AI here. Um, so I'll I'll save the the more spicier questions around AI uh as followup. Um, does does taste scale with AI and how how can you make this scale with AI? >> I think one of the interesting One of the interesting experiences I've had about working with agents over the past uh especially over the past six months is that agents can solve lots of problems now but they often solve them at the wrong layer. I don't know if you guys have noticed this. It's like you give it a problem and you can solve that problem usually at the front end level or the back end level, maybe at the database level, whatever, right? Like there's lots of layers where you could solve a problem and AI still today pretty reliably picks the wrong one in my experience. and you end up with a solution that's like 3,000 lines of code that somebody has to review that's like it kind of solves a problem but also is kind of messy. And one of the hardest parts I found about working with agents even still today is teaching them how to use how to solve the problems at the correct layers. Um and I I actually think that's a lot about taste. It's about knowing and having the experience to know, oh, if we make it if we solve the problem at this layer, it's going to have these trade-offs and ramifications of at a different layer. It's kind of different trade-offs and ramifications. And I feel like agents don't really have that today. I'm gonna answer this question. I'm going answer the other one ear earlier. I'm just I'm a I'm slow. I need time. Um I think taste is maybe like a shortcut to assessing something's quality. I think that people who do not have taste need to look at things like metrics or box office numbers or number of downloads to say, "Oh, this is good, right? this is good because I see these statistics or whatever that that is evidence that it's good whereas I think that somebody has taste can look at something and go that's good right and then maybe over time that's validated in some way now as it pertains to agents and why I think taste is important in the age of agents is if you have that shortcut ability if you can look at something if you look at the output and you can quickly go that's good you could be way more effective in the age of agents right agents can produce something you can look at it and go that's good. Somebody without taste, they might not find out that answer till way longer and you could be way deeper into the hole of like how horrible this code has gone before or the numbers don't make sense before you realize where you are. I don't know if that makes sense. >> Uh I fundamentally don't believe AI helps with taste. Well, LLM specifically because um so it's it's the idea that all ideas are spread out in latent space. So if you imagine it like a 2D map and you ask it a question, hey, I want to like solve this, make a UI, uh it starts honing in on particular areas, but stuff like taste are things that like span ideas uh across different parts of latent space. Oh, what if I took this popular art style and I applied it to this movie, so to speak. Uh so I don't fundamentally think latent the exploration of latent space helps with forming taste. There's a really good book called Where Ideas Come From by Steven Johnson that goes into it. You you should read this book, by the way. It's a very old book and it's dope. >> Yeah, it's so nice. >> I I actually think I I kind of disagree because I feel like it does help get more parts on the table. It does help explore the space much quicker and you can do way more iterations of figuring out which parts of the space that you're exploring actually matter to you. You can build 10 throwaway things and therefore build up your taste through knowing which of them suck or don't suck. >> Right? So all of these have to start with like one person saying, "Oh, I want to explore the space. Okay, now I want to like do this." And then I get to sit and decide, "Oh, fine. This is like the thing that connects it. This is why like you end up with these like wipes slop UIs from people who don't know any better. >> Yeah. I don't know if any of you have ever looked at a Figma file where the designers, you know, they end up making these explorations that go down and to the right. I don't know if you've ever seen a designer work this way. It's like you go down, you make major iterations, you go to the right, you make minor iterations, and you end up with the staircase of explorations of ideas, and at the end is the final screen that you're supposed to be looking at. Uh, and I feel like exploring with AI is very similar to that. >> So is I can drive the AI to explore. I could just say don't make it sloppy. Use good taste. Or is there more to it? >> No, you don't. >> Yeah, you you can't I agree with No, you have to know what to explore. And I think that's the taste that you have to apply. >> And and where does that taste come from? >> You have to watch good movies. You have to read good books. You have to listen to good music. I'm not even joking. Like every like every time I hear a VC talking about taste being the mo and I see that they have like a crypto ape avar, I'm like, "Oh, this I I can't trust this person with taste at all." No, this is how you develop taste by like just reveling in what humanity has to give you. >> I mean, I agree. I I've been told I have good taste. That's like weird to say that. >> You have like such a great haircut and like this is clearly part of your identity. I mean it, by the way. I'm like I'm not gassing you up just on stage. Like this is a person who knows how to present themselves. Ah, okay. I I'll I'll take that. Um, but I wanted a plus one. Like I actually agree with like all the pop culture stuff. Like, you know, I don't think it's a waste to have have watched every single Ninja Turtles cartoon that has been produced. You know, there's a lot of good stuff in there. Um, and extrapolate that to music, art, um, you know, photography, whatever. It's all good. >> So, what I get out of this is that I can expense a trip to Napa to taste some wine because I'm developing my DX taste. Uh, >> you joke, but like especially the first two hours of a Nappa trip is good. Past that, I assume you're like not really seeing straight. That's that's where you need to do your research. >> There's a lot of goddy goddy stuff in in Napa, just to warn you. So, >> okay. Um, okay. So, you talked about the moat as well. Um, we need to get a little bit more controversial here. We need to get you guys stop being so nice to each other so we get some disagreements going. Um, you talked about taste as a moat. Some people say that even people that don't have taste. Um, you guys are all three kind of leads of developer experience of the previous generation, the pre AI area era. >> He just called us old. >> Uh, trying to be nice. Uh, but now you are all three heavily invested all in in the AI era. Uh, does that the modes that you were able to create with DX before, does that transfer automatically like the wine tasting and movies, uh, does it transfer to the the agent era or do you have to do something differently? >> Everyone's pausing here >> like like yes and no. Like there is clearly like I feel so glad that I have by the way uh for the young people in the crowd you go from being the youngest person in the in the room to being the oldest person so quickly. I still dress kind of like a child but like I'm very aware that my back hurts sitting up here right now. Uh I'm so glad that I have a couple of decades of experience behind me to have formed opinions like with my bare hands. I mean like creating UIs etc. So I can like look at grid lines and say okay this kind of sucks. this doesn't and especially a couple of years ago LLMs were like particularly bad at it. Uh that being said, I'm having so much fun right now in doing this exploration because there it's not just that AI makes things faster. It's that things I wouldn't have even attempted because it would have taken that much time I can now do which means I'm actually trying out more. It's not just a compression. So being able to uh use Chang's new pre-text library to build a wild UI experience that just would have been out of scope for me like four years ago. So like yes and no. Like yes, I'm so glad that I have opinions on how these things should be built, but now I'm actually getting to explore the bits that I just couldn't put the effort for. That does that does that help? Yeah, I I think the no part is that so many of the skills that we hone or I should speak for myself. So many of the skills that I honed are kind of no longer relevant. Like I type really quickly. Who the [ __ ] cares? I just dictate everything. >> That that is a huge advantage. I don't know what you're talking about. >> It used to be. It used to be. >> I can prompt so fast. >> I just speak into it now. You know, like I have a whisper mic and I just whisper into it and I dictate everything and I because I no longer need to dictate syntax. like it doesn't matter that I can type really quickly anymore, you know, and I think there's skills there's many skills that we've acquired over the course or there's many skills that I've acquired over the course of my career that are definitely less relevant than they used to be. I do think that knowing what to work on and knowing what to explore is probably one of the biggest ones that I do still use every day. >> Try not to like talk all the time. Okay. Um, so we could go a couple couple couple ways. Let's do this. Um, right now going fast. You don't need to type fast. Great. You can go 100x thousandx. You can run parallel like for the same prompt 100 agents. Let's see who who does well. So you can go super super fast. And we see this in products now uh where you know Codex team others producing software that would have maybe taken years uh before and doing it you know in two days. Um if you if you have to compete in the market at that iteration speed how is it possible to have taste? >> By the way I absolutely hate this 100 agents background agents thing. I understand that the OpenAI people want you to do that. I get so no 100%. Uh >> how many agents are you running right now while we're sitting here? >> Zero. >> I I don't trust these things at all. I need to see reasoning. Tracy, >> I have zero running right now. >> Yeah, that's I think that's >> Yeah. No. >> So, there's a difference between like velocity versus just spray and prey. Like if you actually want to uh like if you want to stand out right now, you kind of want to do less and be known for a particular way of doing things. Like you have to develop a brand, you have to develop focus on the things you're trying to do. I say this because and like I work at Cloudflare, we like ship a lot, but the things we've been working on are things that we have actually been working on for years. Uh we shipped like we we just had like a ship week last week and we shipped a number of things and I looked through all the announcements. And I was like, "Yep, these are things we decided to do like four years ago." Uh, it has it means that we're still doing the things we want, but we don't want to do like everything. And I would highly recommend if you're in the if you're a builder, you're a creator, uh, you should not try to build four products at once. Like you kind of want to find focus. You want to like kill two of them and say, "I want to build two great experiences at the moment." Iterate on that. Find explorations in that space. But bro, this entire 100 background agent thing, I just No, bro. I No. There. No. Oh. Oh. >> Um I think um you know we are still building products for humans. And on the topic of taste, if you're building a product for human, you have to put on your human face and actually like try the product and to use it and to evaluate whether humans will enjoy this or they understand it. And that remains the biggest bottleneck if you want to call it a bottleneck, right? Anyone can make a million things but whether they're good or whether anyone wants to use them is another thing. So I I know in my experience like right now over the last year using a lot of agentic coding like that is the barrier and even though we can produce things like really quickly I often look back and I'm like this was garbage you know and of course it was because we didn't even really try or use it right. Um, I don't know. >> I I think the area where I find myself having maybe not a hundred, but a bunch of background agents is um when I'm working on a thing that's relatively large that splits up into discrete pieces and I can parallelize the discrete pieces where each discrete piece with GPD4XI might take 45 minutes to hours, right? I'm not going to sit there for 45 minutes to an hour just watch Cortex do its thing. I spin up another coding extension, then I start working on the other part in parallel. Often it's like I'll work on the iOS, the Android implementation, and then the back end implementation kind of at the same time and just kick through all three off, wrangle all the agents back and forth, right? Um, but I don't often find myself paralyzing across discrete tasks because having five things running at the same time and them all working on different things is so much context switching in my brain. I I I can't I can't keep up. you're running test and you got like V tests on like 10 runners. Do you go and say, "Yeah, I use Vest with 10 runners." >> No. >> Yeah. I'm just bringing that up because I almost think this language of like sub a oh I do use sub agents. To me I'm just like still one agent. Does that make sense? This is like almost like a technical detail. >> Yeah. I uh kind of although I actually use separate agents this is why it feels more separate. >> Okay. All right, >> cuz I because we actually run all of our agents on Dave boxes in the clouds. You actually have multiple laptops basically in the cloud that you run the agents on. You have to wrangle multiple of those. >> I asked these question like I even want to know these answers, you know. So >> that's why we're here. That's why we're here. >> Um you you said something interesting a moment ago too, which is that we build products for humans. Uh some people are building products these days also for agents. Um is is developer experiences DX the same as agent experience? An agent experience we can maybe define as like the the the the tasteful experience just to use the word in the definition for agents the thing that's attractive to them easy to use efficient effective um is it different uh when building for agents for agent experience >> uh 100% I think until like at least a few months ago we kept saying oh if we design it well for humans then agents will use it well at this point I think that's scope. By the way, uh agents have a completely different personality, etc. U for example, and I'm not trying very hard not to be a shill. We have a thing called code mode where we let agents interact with your generating code. >> Oh, this whole thing I just put it on. I didn't even realize I was going to be thinking about it. Uh we've learned that agents can interact with systems by writing code that interacts with them. Uh so in this assumption like we wouldn't design it for a human by saying, "Oh, every human being can write code that interacts with systems." It's a completely different kind of like behavior. And uh it uh the way the way I've been talking about internally is that you if you really loved human beings when they were your users, you need to really love agents as well. Uh like where do they hang out? They don't really hang out in pubs. They hang out in like registries. They dream in like syntax errors. Uh you have to like do do you truly love your users? Like you have to like find out what it what are agents desires. And it turns out they love writing code. They love being told like thank you and things like that. Uh no I I I'm like 100% like this is like 2026 is the year where we actually classify them as different alien beings and we like learn their personality and like uh create systems that they like interacting with like I say that agents like interacting with them but yeah like I'm 100% on this by the way like I think we kept saying oh like as long as the the docs are like readable by human beings they'll be good for agents. No, screw that. Like, dump a bunch of context and tell them to like figure it out. I'm there now. >> Okay, we've got 50 seconds left. Uh, let's leave the audience with one sentence, very practical, none of this high flutin stuff. Uh, how they can have taste in their day-to-day. What does it look like? What's the behavior? >> Use 100 100 parallel codex agents. >> We said no shilling. Yeah. Um, I don't know. Go to go to an art gallery, go to a museum. And I bring that up really quickly just to say that, um, man, I remember 10 years ago, 12 years ago, online personas were way richer in that I learned about how people went and explored the world. And, you know, the algorithms today really just make us like singular, you know, code monsters or whatever. And so, I don't know, just want to make a comment on that. >> No, no, I'm I'm there. like uh have friends, go out for brunch, watch movies. Uh it surprisingly affects the quality of the work that you build. You're empathetic to >> be human. >> Yeah. Like >> u thank you all very much. Thank you pan. >> Thank you. Okay everybody. Okay everybody. So just a call out. 400 about 400 people are watching us on live stream. Isn't that amazing? >> All right for those of us here. There's coffee for you out there. For you on live stream. Good luck. Um, feel free to grab your coffee, go to the expo hall, and uh, pick up a few goodies. Let's be back here at 11 sharp. See you soon. >> Shout out to Ladies and gentlemen, please take your seats. Our event will start in 5 minutes. Ladies and gentlemen, please take your seats. Our event will start in 2 minutes. They're gonna come on stage. Yes. Send them on stage. I'm ready for them. >> Thank you. Once I see the curtain move, we'll start fading the music. We are ready. Send them to the stage. >> Hello. Hello. Welcome back. >> How's everyone? Anybody talk to new people? Raise up hand if you met somebody new. >> Some hands going up. We have plenty of breaks for you to network with other people. Uh but now we're back for some more exciting talks. So for our next speaker, you just saw him, so he's in our panel earlier, Ben Vinegar. So I'll welcome him onto the stage. Welcome, Ben. Welcome back. Okay, so I'll do a quick intro of Ben. So Ben is the co-founder and CEO of modem. So they have a little booth over there. So during a break feel free to chat with his team. Uh so modem is a AI platform for PM work. And Ben has been in the AI space for several decades. I'm not going to spoil how long it is because Ben is going to explain himself. Um so he's going to tell us a little bit more about working with coding agents over SSH. So, we're gonna transform from local to being remote. Ben, take us away. >> Thank you. Um, hey everybody. Is that Is that readable? >> Yeah. >> Barely. I'll give you I'll give you a one plus zoom. Little better. All right. Um, so, hey, I'm Ben. I'm back. Sorry. Um, and I just want to talk a little bit about a little bit about working with coding agents over SSH. And I just really want to preface this by saying that uh oh well first of all have you seen these graphs before? Are you going exponential from rise? He was a versel and I at open code. Um I was shocked to see this but this is a little bit what's going on in my life just over the over the last couple months. It is exacerbated because I'm old and I have like a really old GitHub account so it looks like more crazy. Um, but I have been doing a lot more of this. Um, but this is not a talk about, you know, drop everything, use Linux, use Omari. I I don't really know that. I'm actually just trying to present this talk to you as like a norm as like a just like a a Mac enjoyer. Like a normal Mac enjoyer. Okay. And I hope that that's like a lot of you. Like I this is more exciting to me. Um I guess we just had a bit of an intro, but repeat some of it. I've been programming for a long time, mostly web and JavaScript. Um but I actually started in graphics driver development. I think it's kind of fun. Um I've spent my whole life working as kind of an early employee at startups. And I've been prompting since 2023. And I mean that to say for me the first prompting in like VS Code and Copilot was when somebody explained to me that I could like write a comment and then I could gen you know and then the like early agents would generate a little bit of what you wanted back then. So I I I think of that as kind of like early prompting and yeah I'm just like a you know a normal IDE user. I like Mac. I I've used text VS code all that. So I work at this company we started called modem and come check out our booth. I won't spend a lot of time just to explain that AI coding is real. Your ability to deliver software faster is definitely real, but sort of the mechanical product work around that like capturing user feedback, following up with users, like that's still pretty slow and that's the problems that we're trying to solve. If you're interested, go find out um at our booth. How we've been building modem is maybe um kind of interesting and relevant to this. So, it's 99% codegen. When we started about a year ago, we made the decision that we were just gonna become a completely agent company. Kind of like good timing. I think this is like in the sonnet 37 days. There's six engineers, the code base, about 270,000 lines of code, and there's quite a bit of test code. Um, just to give you an idea of like what we're working with and and as testing is involved, we we to make that work, there's like a lot of tests. Just as a random aside, this is relevant to the talk, but like agents love to generate mock tests. We kind of throw that out. We make them use like end to end database tests. So these tests are actually pretty heavy. So, I don't know about you, I have felt this way, but when I go on Twitter and I go online and I see these posts by people who are like spinning up 10 agents in like with like custom harnesses and like, "Wow, I'm doing all this stuff." And I just I just didn't get it. I didn't understand how I could work this way. Um, and I wanted to achieve more. I felt like more was possible. So I started to like think about what were the things that were slowing me down and could I address them. One of the biggest ones and I don't know this is a question I have which is how many of you run your coding agents in like you know living dangerously yolo mode 100% of the time. I can't see the lights are blinding everybody. I think it's like 40% maybe. Um I think agents are scary. They can do lots of crazy things. I have, you know, stuff on my computer. I don't want things to happen. I've experimented with jailbreaking. You can do it. You can mess with them. So, uh this is like what would often happen is even if I had approved a million rules, I'd often like start jobs and I'd come back, you know, an hour later and discover, oops, you stopped 30 seconds in. Very frustrating, right? Uh, another thing that was slowing me down is just like pure compute resources. Like I mentioned, we like run a lot of I like I think running unit tests as part of your like building agents is critical. I'm running them all the time. When you've got a big test suite and it's kind of like hitting the database, man, I would hit you know 100% CPU all the time on this machine even with just a couple agents if they were going through like testing loops. Fans spin up. I could barely browse the web or do anything else like that was getting pretty frustrating. I'm on the go a lot. I like, you know, I took a plane here. I wanted to do work on the plane. That's Wi-Fi was pretty spotty. They don't have Starlink on Air Canada. Um, and so often I'm just environments where just like the internet just wouldn't work for me. Um, so one there are like solutions. There's people who build solutions for this like cloud agents or cloud now has like managed agents. I think open code is working on something like this big asterisk. This is like this is changing all the time, right? It's so hard to talk about an experience that was like two months ago could be totally different today. But if I thought about two months ago, I was experimenting with these these products that let you build in the cloud and I just was never satisfied partly because I wanted to run uh tests that hit the database and you know on claude I'd hit a problem where like they had like a sandbox environment with like a network proxy I couldn't like pull out I couldn't like get out to uh my database provider and I just got very frustrated. What I ended up doing is even if I could start some work in like a cloud agent, I'd end up bringing it in locally and working on it and like throwing half of it away anyways and I just didn't feel like it was getting faster. So I considered how I was failing and it brought me back to Linux which was sort of like why am I exploring all these kind of like halfbaked versions of Linux sandboxes like why don't I just do the same thing. So the way that I work today mostly looks like this. I have this machine and then I use SSH plus Tailscale. Tail scale is a sponsor here. Um I remote into a machine. It's using T-Muts which I'll talk a little bit more about these. Um and then I got a coding agent in there. Um and I don't know again I'm just like a normal you know IDE user. I've heard about these words like T-Mox and they mostly scare and intimidate me. Um I know just enough Vim to quit. That's like it. So, you know, humor me that when I'm presenting this to you, it's like not it's like I don't consider myself an expert. There's probably like 20 people in this room who are already really mad at me for for explaining Linux wrong. So, you know, when I did this, first of all, I don't care. People really, you know, Linux Linux distros, they have like strong opinions. Um, iuntu is fine. I use Arch. That's fine. Um, if you just want to have one of these, you can go to a VPS provider. You can click a button and they'll spin up like a Linux environment for you, you know, right now, right? Pretty easy. If you want to bring your own computer, it's more work, but it's, you know, you'll probably get more compute. And I think that, um, this has just been like a real, this is a re, you know, computers are expensive right now for a reason, right? There's a lot of demand on compute. And, um, if that's of interest to you, I think that that's worthwhile. Um, I have a machine that's in my basement that I that I've exclusively, you know, dedicated to this. It's not a Mac Mini running OpenClaw. It it is it is just a plain computer with a Linux distribution on it. And um, if you don't want to set up Linux, you know, my tip is let your agent do it. I think this is like if you told me the like when I started getting into Linux in the last six months uh like I just didn't want to do it. But once I learned that like the coding you know your coding agents can actually kind of like configure it and and get it going for you became a lot more approachable. Tail scale is pretty much just like an easy way of connecting to your machines. Um, man, they're here. So, I don't know that I'm I'm want to talk too much about this, but you get a private network, connect to your machine, you don't have to go and expose a bunch of ports. Works everywhere. It just lets you lets you connect. And then T-Max is like a window manager basically for Linux. Um, you get Windows and PES and I'll show you this in a moment. If you squint and if you pretend, you can pretend like it's like Mac OS. Um, it supports the mouse, which was like shocking to me, I guess, because I just never really bothered with this stuff. Like actually a lot of terminal programs support the mouse. And I'll show you some of that. A big thing that T-Max gives you and and its predecessor screen is like you can you can rejoin these sessions because you will disconnect. Like I'll close my laptop, sever my internet connection. Um, SSH is gone. I can come back later. And then um TMS will give that all that back to me. And then it's last thing is it's agent scriptable. So I got a little demo here. And so over here this is um Linux running on on a on a VM on my machine. I've tried to do this talk with like a like a fully remote machine, but I've learned that that is like not a good idea, especially if you're like trying to live stream at the same time. Not a good idea. So, just to kind of show you right now is like I'm um right now like I'm on my I'm on my Mac and then I can kind of get back into my machine. I've got like this little uh Arch logo here to help me understand where the hell I am. And uh Oh no. Oh boy. Oh, this is the problem when you don't actually know this stuff very well. We're going to have to open up like a coding agent to help me understand how I can get back to my um to my to my thing. Well, that's okay. We're over here. Oh, right. All right. I was doing a new one anyways. All right. So, really quickly, this is like T-Max. It just looks like a terminal, right? That's what it is, except you can kind of like have PES. Um, right. I could make more split planes. Split panes. Hey, I can use the mouse. I can drag this stuff around, which is kind of neat, right? So, over here, I can go and I can um I could say like run a server over here, right? Oh my goodness. Well, I've forgotten like how to do all my demo stuff, but anyways. >> Yeah. Look, when you got the lights flashing in your face, like, you know, and I could open up like kind of a I can't even actually like see it very well, which is like not what I was expecting, you know? I could open up like an editor over here, right? And then I could even like do uh um um you know, I could have diffs over here or whatever, like whatever, right? Um and I think this is pretty neat. But like I didn't actually know those commands just a few months ago. So gonna open up open code here. I'll just give you an example of just kind of how I think like coding agents have made this more accessible which is like hey you're in a team session open up some PES and put some cool [ __ ] in there that's Linuxy. All right. Um, there it's going right. So, it's firing up. It's creating PES. And the way that I worked with this in a while, um, if you're wondering why this is fast is I'm using Kimmy K25 just because you for demo purposes, it's the only thing that's going to finish fast enough. Um, but I've got different PES here. What' you do? You you What' you do? You gave me Htop. You gave me some live disc usage. Digital rain. I don't I don't see that one. But anyways, you know, I could also be like, "Okay, now close them. You made bad choices." Okay. Right. So, let's bring me back here. So, T-Max, if you do want to mess around with it, like you can just start with having the agent like um do stuff. And the reason this works, and I didn't load any skill files or anything, is that TeamX is just like you control it using the CLI. So the agent is actually just calling like a bunch of shell commands to do all that. It doesn't need MCP. It doesn't need anything which is pretty neat. And these are some of the commands it can do. I wish I knew these earlier like listing the PES. You can split the window. The other thing that's interesting and we'll come back to this. So you can read the con. It can read the content. It can act on it. It can actually send keys. It actually be like a little driver of them. So if you start messing around with this, you end up having an environment that looks like this. I was pretty zoomed in so that you could see this. If I'm actually at home, like I'll zoom out and I actually have like quite a bit of surface area. And it's not just PES. You can have multiple windows. So like this is how I actually get four agents running is I'll have maybe a couple different projects packages and I'll work in them and I'll have like a like my whole environment kind of split up this way. So at the end of this if you go through this exercise like one you know you can skip permissions all the time because you have a machine that doesn't you know mingle other data and if it gets ruined you can just spin up another VM or whatever. It's not a problem. You get access to more compute and actually I just straight up in like having other machines do the work so that my I have like way better battery life on this laptop when I'm like moving around. I don't have the fans spinning up. I could do other work. I do some like video stuff. That's cool. Um, always connected, fast internet 247. So that that 1.0 like um megabit stuff is real. I recently took a train to Montreal from Toronto. We do have functioning trains. Um, and that I was only getting, you know, one one megabit on there and it was like brutal. I couldn't basically work with an agent. However, on six-hour train ride, everything was great because that's just enough for me to connect to the remote session, work with Teams, have have everything working, and so I was kind of uninterrupted. Um, so I guess like my way of getting more out of agents is kind of boring. It's sort of like I'm not I don't have agents off in the cloud doing a bunch of independent work for me. I just sort of like more effectively figure out how I can kind of like work with them. So, you know, the good news is you have a remote terminal setup for AI work. The bad news is that the ergonomics of this, like, let's be honest, are like not that good. We're we're, you know, I'm a Mac enjoyer. It's not that great. If you there are tools built into into other platforms like cursor and VS code, you can actually just run all this stuff on that machine. So, once you've you know, you can stop before the T-max part and you can just kind of like work with the machine over over these editors. I haven't done it that much. I have used it to like look and edit and review the code. But humor me like hear me out. I think that there is there is value in like working with some of this like more primitive tech. And just I think a real wakeup call for me is just as like um you know Windows and and Mac operating systems have evolved and now we've got liquid glass and isn't that incredible? um that there's also you know actually terminals you know the technology that you can run on terminals and the and the two that exist now have evolved like consider this the most valuable software created in the last decade is a terminal application right which is cloud code that's wild to me and I don't think it stops there right I think that we're going to see other terminal apps become valuable so humor for me. I'm going to show you like a little bit of my like if you if you um go all in on this, like what does this look like a little bit? And um so over here, this is my my term max demo. So over here I have um this is like a team session and I've got a bunch of different stuff here. Okay, so these are actually like different windows. This is my little custom extension that I vibe coded. It works for me and I've made it like a bright color just so you can kind of see it. Um, first up, all right, let's just review the plan goal. Don't embarrass yourself. Good. Okay. All right. We're going to check out this window first and then I got a bunch of things. So, I'm starting here to actually show you. Um, this is an editor called Fresh. Um, I'll have the URL at the end. Like I said, I don't really know Vim. And I I I tried to give Neo Vim a shot. It was way too complicated for me. Fresh is interesting because it's sort of like, hey, VS Code users, do you want to use the mouse? Do you want to do the things that you know how to do? Do you want to have um a um like a control plane where you can actually just bring up files and just kind of work the way that you're used to working. Um so this is like relatively new software. I think it was built in like the last year. Somebody's working on it. It's pretty neat. I I'm enjoying it and it's made like just sort of like looking at code easier. So that's just like one thing that I think is kind of interesting is that your idea of like what a text editor might be on the terminal has kind of evolved. I could keep going but where am I? So for example, you know, I can I can go through some of these files. I think it also shows like modifi modifications in here as well too. Um any oh boy, not enough time. So over here let's see this is I'm going to skip part. So this is like an open two app that I've been building. Building open two apps in the terminal. really great. This is just like um you know it's very simple. Okay, but I want to illustrate this which is like use T-Max look at pain one what we got. I think that T-Max I don't look at it just as like a as like a workspace. I actually look at it as like a playright tool for working on the terminal. If you've ever struggled with like you have logging output or whatever and how do I get that into my agent like when you have TX it's easy because it can just read the pain using those shell commands that make sense right and so um okay I want to change this right I want this to look differently so I have a tool here called term draw term draw looks like this it is actually a like vectorbased sort of like editor Okay. Or I can like resize things and I can put let's say this here. I'm going to put, you know, let's put close. This is going to be the title. Let's put a line here as well. We got like this smooth line thing which is kind of fun. This is all over SSH. I'm going to move this, right? And then I'm just going to do this um text. Okay. I send that to my agent. Make the modal look like this. I think there's something about working with the agent where like look, it works in text, you know, like the agent reads text, right? It reads markdown files and to actually produce artifacts that it understands. It actually understands ASKI incredibly well versus like you give it a screenshot, what is it doing? It's spending thousands of tokens to decompose that into a text description that at the end of the day is not going to have much more fidelity than what I just generated, right? Right. So, I think that like sometimes like that's kind of interesting. Um, all right. It thinks that it did it. You know what? I'm lazy. Can you run it in pain one for me? I don't test a lot of this stuff out, by the way. So, I'm just hoping that it works. Oh, I didn't reload it. Oh. Oh. Oh, you figured that out. Um, you following this? Okay. So, I built a lot of TUI apps and if people are wondering a little bit like how I do that, it is a little bit like this. You know, it's the ability to actually like iteratively have um this thing go back and forth. So, I'm going to stop you at this point though because it looks we're getting there, but you can see how it's going to get there. Um so let's see. So another tool that I was going to show here is um so I built my own diffing tool and that came because basically if I start working this way I built my own diff tool that actually like upsets me that that's coming out of my mouth. But ultimately like I just not wasn't very happy with some of the solutions that we have. So, I bought this thing called Hunk, and it is um basically like how can I have something closer to like a VS Code experience on the terminal. Um I'll show like a a better diff, but it basically accepts sort of like diff commands like this. So, I could say like main at three. I can go in and I can kind of like take a look like this. Kind of go like this. It's got like split view. This is a bad example for split view. It's got word wrap. You can even scroll horizontally, right? So, these are like tools that like I never really experienced for div tools. Um anyh who so let's just go back. I guess the last thing I wanted to also illustrate is that also this presentation is over SSH on the terminal uh with the rich graphics and stuff too. Um and it's called present term. So you can check that out. So these are some of the tools I've been using. Fresh hunk I didn't get to show you. Glance ran out of time. Term draw. Um and uh that's got think about it. It's not crazy. All right, I'm back on. Uh, >> so how's it going so far? I go back. >> Having fun? >> Good. I tell you what, I asked the next speaker If he was forced to delete all of the apps on his phone and he could only keep three of them, what would they be? What would they What would they be for you? Think about it. I'm going to reveal his answers. Okay. Number one was Slack. Then was YouTube. Then was X. What do you think? Do you agree? Okay. Our next speaker is Shashank Goyal. He is the founding engineer of open router and uh uh I talked to him about how he hires new people and if AI is impacting that and he said no we actually need a lot of engineers and what he's looking for is enthusiasm, excitement and people who ask the right questions and that's his interview technique. Did I say it right? >> Yeah. >> Okay. All right. We're all set. All right, let's hear it for Shashank. >> All right, thank you everyone. So today we'll be talking about the rise of AI agents and I'm sure you guys are going to hear this word so many times and I apologize and uh but yeah, so we're I'm from Open Router. I uh we started the company about two and a half years ago. I joined about two years ago and have been building this company. What's special about Open Router? We think that we're at a really cool horizontal space in the ecosystem. We're right between all of the different models and all of the apps. Uh we're in model aggregator that makes it really easy to use any model in the ecosystem. As of this month, Open Router is doing about 75 trillion tokens every month. We have over 5 million users using the platform every month. There's more than 60 providers and 300 models. What does this all mean? The ecosystem is starting to shift a lot. There isn't just one best model for any one use case. We find that users are using more than one model for any for whatever workflows that they have. So we saw this pain point a long time ago where it just got got really hard if you wanted to use OpenAI models and Gemini models and anthropic models to know which model to use, when to use it, where to use it. We didn't think that there was like a good empirical benchmark. And so we built our ranking page to basically show you not the benchmarks, not um you know all of the benchmaxed scores that are uh produced at every model launch, but like hey how are users using these models? Where are they spending their dollars? How are they voting with their actual use? And how do we see this in the ecosystem? So we this is one of the charts that we have which shows what are the top models that users are actually using and you can actually see that obviously um some of the entropic models are at the top but there's a bunch of open source models like deepseek mimo um there's miniax there's gemini openai so it's a very vibrant ecosystem of models and it's really important to remember that there is this is really not a winner takes all market another thing that I really like about the viewpoint that open router has in the ecosystem. And why I'm really excited to share some of the metrics that we have with you guys today is that not only do we see all of the models, but we actually also see what apps are using those models and how they're using them. So, um the chart that you have on the screen right now, but basically shows how coding agent rankings have been changing over time, which agents have become really popular. Um, some of these you obviously see kilo code, claude code, but you might not have heard of Hermes agent for example, and it's an agent that is, uh, really starting to engage and gain a lot of popularity. So, um, you know, again, go to the open router rankings chart whenever you guys have time, but it'll really like help you see the ecosystem from this like top down view and understand what models, what apps, how people are using them, and what people are building. But from all this data that open router is very lucky to have at the top. What have you learned so far? Best practices change week week over week not years or even months. Every week if you ask me what's my workflow it's going to look very different. Uh prompts also need to change as models change as models get better. So you have to actually tell them fewer and fewer things and models rotate very quickly. The three trends that are very clear from our analysis is that inference is becoming a core internet utility. The same way that if the internet is down, you can't get your work done. If your tokens are down, you can't get your work done. The market is restructuring and extremely dynamic and agents are now the primary workflow and workload for inference in the market. Over the last year, we've seen over a 14 times growth in the number of tokens consumed. And that basically shows you how much more value users are getting because every token that is consumed on Open Router is a dollar that was paid by the user. The number of requests also continues to grow and there's a slight gap in the number of requests versus the number of tokens and we'll get get into that right after this. grow growth in the platform is also decentralized. It's not a single user or a single app that is growing in the ecosystem and that's true for everyone in AI. Um, so we have like breakout apps like OpenClaw which have single-handedly consumed over 18 trillion tokens last month. And that really shows how much value users are finding, right? Like we see all these buzzwords. We see like, hey, this is really cool. I like it. But does what does that actually mean? Are users using OpenClaw? And this is one of the best metrics to show you. 18 trillion tokens is about $1.8 million that have been spent on OpenClaw just on Open Router. And we we're a very small percentage of the overall inference market. And it's an open source app where all of the code you can see it, you can reuse it, you can build it. It's fully MIT licensed. So there's a wide ecosystem here and very easy to build agents because the most popular agents are actually open source. The other thing that has led to this huge spike in token usage is that there has been a pretty significant cost collapse across the ecosystem. Uh when we started with GPD4 around March 2 years ago um exactly from now we were at like $30 input and $60 output prices. A GPD4 quality model is the Gemini 2.5 today and it's at 15 cents to 60 cents which is straight up 20 times cheaper or sorry 50 times cheaper than the same level of intelligence 2 years ago. That does not mean that frontier models are getting cheaper as well because we have seen that frontier models have continued to stay the same price but what is frontier intelligence today in one year will be like 10 to 20 times cheaper than it is today. And so that's like a that has a very big ramifications in how we use AI because you can only deploy cloud 47 opus today on tasks that you know are going to be very high value because it's very expensive but that's not the world you're building for. You should be thinking about how do I deploy 47 opus across all my tasks because in a year or even in six months this model is going to be so much cheaper or this level of intelligence is going to be so much cheaper. Uh which is why it's very important to realize that even though models seem very expensive today at this level of intelligence that trend is going to continue to push those prices down. Sorry. Yeah. So all of this growth has really changed the models how they're consumed and why users are consuming them. I don't expect you to read all of these, but it's the point is that these are the models that we've onboarded and users have uh consumed more than 100 million tokens on over the last 12 months. All of these models were on the platform and we're pretty selective about the models that we onboard to models that have something unique about their architecture, something different about how they were built and like you're usually pushing the frontier either for their size or for um max intelligence. And there's a lot of choice in the ecosystem. No single model stays at the top for very long. Um this chart basically shows all of the different model families on open router. So you have Google Anthropic OpenAI etc at the top. But then you do have very significant percentage of usage from Miniaax, Deepseek, Xiaomi, ZAI, a lot of Chinese open source labs that are producing frontier level quality of models. the market is decentralizing um as I've been saying no like there are so many models so the share of the top five top 10 models on the market is continuing to go down and it's very important for the workflows that you guys are building to evaluate them against multiple models because you will find there's like a big Pareto frontier of quality versus cost and there's already a lot of uh trade-offs to be made in the marketplace place. Maybe not a surprise, but now reasoning is the default. There's still a lot of non-reasoning models like Gemini 25 flash um Gemma 4 models, but now reasoning is the default. All models reason and uh users that are using them are extremely like looking for models that think before they reply. And it's the what we used to call like test time compute or compute during inference is a very important quality of the models that you choose. um to specifically specifically call out in the zeitgeist deepseek might not have released a new model for a long time now. Um their last like big release came March of last year uh with R2 but DeepSeek has continued to grow in the market because people are aware of how good the output is for the prices. So um it's a model that I would recommend that you all try out if you're building aic workflows. It's really good at tool calling and uh the market corroborates that story. One of the interesting trends in the ecosystem as well is that open-source models are the volume leader, but because they're so cheap, the spend percentage is actually way lower. If you can see that there's they do about 35 to 40% of tokens happen on open source models, but they represent a much smaller percentage of total revenue because they're so much cheaper. And I think it's like a really big advantage for people building because there are so many good models available that are much cheap. And combining all of this like the growth in tokens, the growth in models, the growth in ability is I would say why agents are now a primary workload on the platform. Over 15% of spend now comes from agentic workflows. Uh the way that we decide if something is an agentic workflow is if there's doing a lot of tool calling, multi-turn loops, they're using orchestration. Um so we can detect this using metadata and about 40% of the total workflows on the platform already are agentic and I can expect to see this continue to increase because models are no longer being used for single question and answer kinds of responses. Even when that's the user interface for the user, behind the scenes, they're making a lot more calls and using a lot more tools to answer the user's question. Tool calls are the main backbone of agentic workflows. For people, I've been saying tool calls a lot and maybe I didn't define it. For people that aren't aware, tool calls are how models engage with the wider world outside of just their own pre-trained memory. So every time the model wants to get more context you invoke or take an action in the real world they use tool calls and you can see that there's been a really big inflection of how many users and um requests are actually using tool calls um in terms of total tokens um what we showed earlier but basically it it's like a really big hockey stick exponential curve and it's the probably the biggest trend that we see right now that has continued to explode in the last 12 months. A very cool insight. U we expected agents to be using more tokens, but I didn't expect it to be quite as high. There's a gray line at the bottom of the screen that is honestly a little hard to see because that's about how many tokens that non-agentic workflows are using uh or sorry, tokens per request that non-agentic workflows are using. And you can see it actually hasn't changed at all over the last year. The number of tokens per request has stayed relatively stable even as like models have gotten bigger context. There's like more intelligent models. But then if you look at users that are using agentic workflows, it's a totally different story. And this again shows like the difference when you're building agents, you actually can utilize the full context and it allows you to make much more powerful workflows and experiences on top of models. Um agent sessions are also usually 11 times longer. Um session here basically means a number of turns. Um non agentic workflows usually users will ask like two to three questions but for agentic ones we have seen them now like average turn lengths are getting to 80. So think of the number of sessions that are get getting to much much longer. And I mean it's uh you can think when you're looking at your cloud code screen the number of turns that it's doing. But this again is like a very important chart to understand how different the two workflows are. Why now? Why did agents suddenly become so good? It's really not sudden. It's been a slow buildup over the last year, honestly, like year and a half. Um, we had our first reasoning model in January of 2025. Um, then we had tool calling, but it didn't really work. We used to see tool call success rates hover around 85 to 90%. So that means like one out of 10 calls uh 10 LLM calls that tried to use a tool would fail and in agentic sessions where the number of turns can be 80 that's like eight chances of failure per session. So it was really bad over time like around August November um is when we really saw like models and model labs figure out how to do tool calling in a more reliable manner and we saw success rates for tool calling go up to like 99 99.5% for the frontier models and that's really like been one of the big unlocks because the models are just way more reliable. Um, we also saw a really big explosion in harnesses uh around December of last year. So you have claude code, client, open hands, open claw. Uh, so all of these harnesses have just made it that much easier to use agents for everyone in the ecosystem. And around January to February, I think it finally all came together. uh we had cloud code I don't know um I've been using cloud code for like eight or nine months using a lot of different agentic tooling but I've really felt the inflection in all of the agents I use around when claude 4 five opus dropped or um in December and then one more time when people really figured out how to use these harnesses to the best of their ability um so putting it all together we I think finally do have we're in a world where agents are mostly reliable uh they're able to go often do really longunning tasks without much supervision. And they're able to use a lot of different models and orchestrate themselves to generally know what the best way to build themselves is. And we'll go into that in a second. Putting it all together, the five forces that we should all be thinking about as we're building is that models are smarter, inference is cheaper, context is longer, tool calls are more reliable, and harnesses are better. If you have this mental flip for yourself, because it was something that I had to like think about each one because it's changing so quickly, it's easy to forget how cheap inference really is or how good the harnesses are or like the fact that most of the frontier models now have a 1 million context window. A million is you can put a lot of full code bases into the context. So you don't don't even need grap tools. It just makes a lot of different workflows that were not possible earlier possible now. Agentic flows when you're building them also make it so that the you can use the best capabilities that the models have to offer. When we look at users that are using agentic flows versus not. Um, agentic flows use more reasoning, they get better caching, they are usually trying out more models. Um so they're try people that are using agentic workflows are usually trying out six different models for their different use cases versus three for non-aggentic users. Um and also like the total volume and um request volume is for like agentic use cases. I have a few quick minutes. I didn't want this whole presentation to be just me sharing some data. So I wanted to show you guys at open router how we're using agents for ourselves as well using taking all of the learnings from the data but then how do we decide what to build for ourselves. One of the things that we built is something called spawn which is basically a one-click deployment for any other user to deploy agents. So if you want to take cloud code but then you want to run it in VM so you can control it from your phone you can do that at open router.ai/ AI/Spawn. But that's not the thing that I'm excited to share with you guys. What I found really cool is that the full code base that um sets up Spawn um all of the integrations with the different agents and different cloud providers where you can deploy it. It's a 100% agent written codebase. There are zero PRs made by humans in the entire codebase. There can be some sometimes there are issues that are made by humans, but there's no code that is written by humans. Most of the issues and everything is agentic, fully automated. There are different agents that write code, that review code, that do security analyses, that do um issue triaging. They also uh we have endto-end testing agents that are like basically it's a swarm of agent that runs on this P uh repo all the time. The repo is open source. Um, if you go to this link again, you'll find the GitHub repo at the bottom of the page or you can just search for open router spawn and the GitHub repo should come up. You can actually see exactly how all of these agents are orchestrated in our internal workflows. Scouts is an example of a really simple workflow uh that has added a lot of value. Um, I wanted to put it on here to show that agents don't have to be complicated. Um, something that we wanted to do was we have a open router is integrated into a lot of open- source GitHub repositories like OpenClaw for example and we wanted to track all of the issues that users are facing on OpenClaw. Um, and we tried a few different things. You can just have a cron job that like fetches all of the different issues that are on OpenClaw and sends us a result. But what we found is that every single day would just give us the same results over and over because those were the same top issues that weren't getting resolved. We actually built something called scouts also in GitHub. But the scout agent uses GitHub PRs as its memory. So to create a scout, you just create a new PR with like a small system prompt. And then there's another agent that looks at all open PRs in this scout repo and spins off web searches for it. And then whenever the web search is done, it appends to the PR as its own history. So we're using the PR as like the running context for a model. It's very simple. It's a single file in the PR, but then it just made our like daily cron jobs that much better because the model remembered everything that it had seen. The other thing that we've built for ourselves, which has been a total game changer, is what we call AI or open router intern manager. We find that when you have a single agent that's trying to do a lot of different tasks, it's way higher chance that it fails. So we love to deploy agents internally that just do one thing but then they do that one thing really really well. Um so there's like a list of 20 agents. Some of them are really fun like Dexter. We have like thousands of emojis in Slack and it was getting really hard to know which one to use. So you can ask Dexer uh what agent what emoji you should use depending on uh what your mood is. My personal favorite is buddy. So buddy helps us onboard any new models and endpoints. Anytime there's a new model, we just tell it, hey buddy, enthropic launch cloud 47 opus happened on Thursday and Buddy's able to like fully on board the endpoint, test everything for us. We use like sniffer for KYC operations. We have Tony for customer support. So um and we continually add more interns and each intern gets its own VM and its own GitHub repo. And what we like to say is that it can do brain surgery on itself. So each agent can learn and be better. you can just tell buddy that hey like you didn't have this ability can you go learn it'll figure out how to do the thing and then it'll make a repo on make a PR on its own repo um also the really cool thing about AI is that you can ask AI to spawn more interns to uh also remove them if they're not being used or if we like improve some workflow that we want to push out to all the agents or you can do that for us or you also do does credential management across all of these different agents for Um, from everything that we've built, the three things that if I had to say that we have learned is that experiment, experiment as much as you can. Production is still hard. That's one of the reasons that we have so many different agents versus just a single agent. And then automate the everyday because that's really where a lot of the value is. Um, you might not think that something is automatable, but you should just try and really help. The thing that we keep telling ourselves and are really like forcing ourselves to remember every single day is that we're no longer building open router. We're building the machine that builds open router. And having that kind of mindset switch has really helped us automate across the board to be able to ship a product that's used by millions of users by with a very small development team and get the most of all of the AI models in the ecosystem. And that's me. Right. Thanks everyone. All right. All right. Hope you're all super energized about uh using open router and uh check out all the different AI tools. Now for our next speaker, we have Nana. Nana is a full stack engineer and she's currently a principal developer advocate and software engineer for Kodo and she's also a proud member of women defining AI. So really trying to uh build a world where AI is augmenting our lives and I know there are some quality engineers in the audience. So I think you should definitely tune into her talk because she's going to talk about how to embed AI code quality gates in your software development life cycle. So Nana take it away. >> Thank you. Thank you. Really glad to be here. My name is Nana Andukquay. I lead developer relations at Kodto. code is an AI code quality platform and I am obsessed with AI but also um being able to use it in a very structured way. So that's exactly what we're going to talk about. So before we really really begin, I just want to set the tone. I am not a beginner. I've been in the game the industry for almost 10 years now just about. And so my knowledge is before this current uh exciting phase that we're in with AI and my engineering experience in particular um whether it was building systems in a fintech company you know where or building um investment portfolio management systems with tens of hundreds of millions of dollars on the line or building a live events platform at O'Reilly working with back-end engineers to build an entirely new experience for products that generate a million plus in revenue. So I've always had to think to some degree about quality for software development. And I am also AI pill like totally obsessed. I'm pragmatic but also super optimistic and I I really do think AI is great for neurode divergent brains. Um but that's another topic of conversation. Um so in my journey as an engineer and now with AI when I think about code quality I'm always like where are the touch points of where the there's a quality degradation or that there's an opportunity for a degradation of quality and if you look at the um the software development life cycle and and a typical workflow planning and design development code review testing deployment these this is the entire surface area for where quality can begin to degrade very subtly or maybe in very obvious ways. And so the opportunity really the the blast radius I guess you can call it is everywhere. These are all the places where issues can happen especially now with AI almost making software feel even more that much more fragile. And that's why I think that we are currently building workarounds. We are building workarounds. We are real life architects in this time where um we're trying to build around the limitations of AI systems and LLMs and it's exciting but it can also be very frustrating and some of these workarounds are makeshift and others are becoming standardized in real time and we're just going to see what comes of it. One really amazing example, I'm really excited to give a shout out to Lex for creating GSD. This was a signal, this framework that then became a coding agent um built on PI SDK. This was about structured AI assisted development. Not only did professional developers at some professional developers at some of the of the largest companies that we know largest companies that we know today today started using this tool also vibe started using this tool also vibe coders coders wanted more structure and wanted more structure and structure structure yields quality. So this yields quality. So this popularity popularity 55,000 uh GitHub stars this 55,000 uh GitHub stars this was a signal was a signal and is a signal that there and is a signal that there is a strong is a strong need for qualitydriven need for qualitydriven systems and I systems and I think that's exactly what think that's exactly what we should be we should be building. But how do you building. But how do you begin to think begin to think about that? Not only did about that? Um, and that's what I call and what we call at Kodto2 is a verification layer. A verification layer that should be embedded and interwoven into your existing development workflow. So how do you do that? We put on our, you know, critical thinking architect hat right now. Number one, you need to define what code quality actually means to you or what is code quality. There are things that you can research and pull down information from so many different um sources and documents to gather that and there's also very specific requirements that are specific to your projects and the way that you work or maybe the way that your team or your engineering organization operate. Now all of that needs to be codified and it needs to be codified because we are working with agents and this is context. Once you've defined it, you need to decide where that codified quality or those quality standards live. And we'll talk more about what that looks like because both of these touch on context. And number three, you need to design the verification layer where you already work. That's in your IDE, the CLI, um the git providers that you're using, CI/CD pipeline. These are all the touch points in your actual workflow for uh being able to embed code quality. So step one and two define and decide where your code quality standards actually live. This is all context engineering and this is only some of the examples um of context engineering, right? Um I don't even think people do people still use claim MD files. I don't know since that paper came out about it not being very effective. I don't know. But we have agent markdown files. You've got internal docs and some engineering standards. You have criteria that you might want your code review uh to be measured against. Whether it's manual code review or an AI doing it for you. That's all context. And of course, org specific policies and any other um uh quality expectations that you might have. This is all context. And this these are all the important elements that are needed for the actual uh pipeline or the the life cycle for your development workflow. And a great example of this is Kodto's rule system. So engineers at COD built a rule system to be able to have one context plane for managing the uh agent.mmd files for all of the rules that could be orwide or uh repos specific or maybe language or framework specific. They're all listed in one place and uh they are also categorized by correctness or reliability or quality and a level of severity. How important is this rule to you per repo or per a pull request? So this is what I consider to be the the context plane that can be centralized so that when you are working on distributed teams it's not only visible to you and your team members as a developer but it's also visible to your agents. So how do we allow our agents to access you know this kind of um this context plane so that we can pull it into our dev workflow and we'll get into that. Um, number three, when I mentioned designing the verification layer, we have the traditional software development life cycle. And then we've got our agents, the agent harness, and then what I mentioned before, our dev workflow, IDE, CLI, Git. And what we're going to do with that context or what you should do is operationalize it. And I argue that bringing it into the planning phase is uh is really a strong principle for quality is actually enforcing quality early as early as possible and as often as possible. Right? Remember we're building around the limitations um and the unpredictability that comes with working with agents. So bring it bring in those standards early already in the planning phase. And you can do this by the way that you prompt. And you can do this by um using agent skills to pull down that context from your centralized context plane. And then you have your agent skills that you've probably been collecting like Yu-Gi-Oh cards um for code generation. And then what you'll also want to do is for code review is to also enforce those um those the quality criteria again. And so what what's important about this is that there's consistency across the stages of your workflow. The same exact rules and the same exact place where the agent skills and different artifacts can be pulled down from across the entire um beginning at least the first half of the development workflow. And that way you can identify um that way consistency is important when it comes to agents. Then you can also identify where there might be quality gaps when there when you actually are enforcing quality consistently. So these are my skills. Um they're in codeex uh app right now. This is what it looks like. There's a ton of them. I've collected them like I said like Yu-Gi-Oh cards over time. And um a lot of them are related to uh cleaning up dead code or maybe some error handling that I know that agents seem to just keep struggling with. I also have test-driven development, behavior driven development um skills. But this is just to give you an idea of this is the kind of agent skills that I have and that I actually enforce before I begin implementation. I do this in the planning phase. Um this is a recent development of mine and so that's what I mean by enforcing quality early and often. In this example, I'm using uh codto get rules. And this is a skill that pulls down rules from the rules system and uh and determines which one are most relevant for your current coding task. And then that is what those are the rules that are going to be enforced as your agent begins implementing code and then it's going to go through a verification process as well when it is done. That is a great example or at least a more sophisticated example of centralized context plane for your for your quality standards and then pulling them into planning and code generation. And then I truly believe that a local code review is very very valuable here. Whether you're working in an IDE, whether you're in the CLI using uh codeex or cloud code, wherever it is, it is totally worth it to run uh local code review against maybe your uncommitted changes or your committed changes because well, you you want to make sure that anything that could be caught is actually caught before you make a pull request and let the whole world know that you have just generated AI slot. Once you have actually fixed up some of those um issues that might have been surfaced from local code review and you actually make a pull request, you want to be able to leverage AI for that first pass of a code review at the PR stage. And I think it's really valuable to use an AI code review tool that is automated. So, as soon as you open up a pull request and you have, of course, your llinters and your tests and your security checks and all of the things that are going to beef up the quality um of your of your process and your the code itself, that's when you can have a code review tool automatically run as a first pass and surface any important insights that you can use to improve the code before you maybe have another developer take a look at it or before you even to take a look at it. And I think what's important here that makes this part different from the local code review is that when you have a much stronger uh system that actually takes longer to run because it's checking against entire codebase or multiple repos to give you the important um insights that you might need about breaking changes for example making an API change here this contract breaks something in a couple other repos. Um there's there's so much more context that I think can be leveraged at the PR stage as a part of everything else that needs to get built um at this point that could be different and take longer but is um very effective for a pull request voter view. And so this is what I ended up uh testing out. Of course this is actually my exist my current workflow but I tested this uh process out for a relatively large PR. I say relatively large because I do believe um there's stats around um you know your cognitive load of of being of an effective code review kind of degrading dramatically after about 400 lines of code change. So, I went all the way to the extreme on purpose. Uh 1,900 lines of code change for this um policy enforcement MCP server and CLI tool I'm building. And there was only one bug that was uncovered and I was shocked. Um and Codto is, you know, not to be of course I'm biased, but I we dog food so I use it every single day. And I was shocked to only see that there was only one bug. But to me that was proof that my process is working is definitely working to some degree with the skills that I have with the rules that are in place with all my tests and my llinters. This is the proof. And so this was uh once I once codto found the bug that I had which was an unhandled settings exception I went ahead and fixed it it locally in codeex and uh and what I mentioned before about the robustness of a code review being um automated at the PR stage is that it gets to check against your rules and your standards and requirements gaps with adversarial agents. Some folks think that um you know like what is the point of like an automated code review bot running when I can just or only run a code review locally with the agents I use to generate code and I am I'm always screaming about this on um X I always say that you need an independent verification layer because of bias from LLMs and because the system that are the systems for coding agents are optimized for like autocomplete on steroids, right? Optimizing for actually completing code as quickly as possible. So, you need a completely independent system that could come in with an adversarial architecture and uh goal and that's when you can begin to uncover some um some subtle bugs that might exist. And so, this is the AI dev workflow checklist for quality. I think you need to define um the code quality standards. I think you need to def uh decide where the codified code quality lives in a centralized context plane especially for folks who are not just solo devs but you're working on teams and you need that to be distributed and managed and then you need to pull in the agents and the skills uh for accessing that context for planning and for code generation verifying those these local code changes before you make a PR with your um static you know llinters and tests and then automate the more serious more robust code review process and a bonus which is something that I've been doing lately is automation and AI for iterative refinement of your skills and rules over time now Codto's rule system is uh it automatically um can suggest new rules based on the behavioral um uh history of PRs and comments and it can suggest new rules for like you coding standards as your code base evolves. But something else that you can do is leverage automation kind of like a prawn job or something that can run weekly and assess the um all of your PRs and any trends that have occurred and then decide which new skills are worth creating and which new skills are worth um refining so that you can reduce the types of issues that might keep popping up if there are trends. But so that by the time that you get to code review, then you know some of those um issues from the past have already been handled and you have less um issues by the time you get to that last um line of defense. And I'd love to show what that looked like. So I this is an example of the actual PR that I'm showing you that we looked at. This was the plan that I used. I used Aaron Francis. He has faster.dev. I used some of his um audit skills and I forced codeex or GPT 5.4 extra high. Um I forced it to include exactly which skills I wanted uh it should be able to use or should use for this particular um feature change or update. And this is a long list as you can see here, but this is the structure behind my planning that I force agents to do that includes scope, the canonical contracts, the component design, the test plan, definition of done. I mean it's exhaust it's exhaustive but this is that method of being able to uh confirm that the agent actually knows what it needs to do and that there will be a thorough verification implementation process and that is how I was able to generate thousands of lines of code and end up with only one issue that I needed to fix. And so this is something that I definitely recommend you all begin to think about because quality really is a it requires a mindset shift. You know security engineers talk about security first best practices when it comes to quality and preserving software craftsmanship. no matter how fast and exciting the changes evolution of AI that it is that we get to experience in this domain that we can still preserve um our intellect and our expertise and begin to have agents mirror that in the way that they work and the way in the way in which they work for us and that is uh that is my talk that's what I had to talk about so thank you All right. Thank you, NA. >> Thank you. >> I would like to thank you all for being present. This concludes our morning session. I would also like to thank our audience on live stream. Uh go get refueled both for your body and both your LinkedIn connections and we'll be back at 1:30 sharp. ladies and gentlemen, please take your seats. Our event will start in 5 minutes. All right. All right. Hello everyone. How was lunch? Well, raise hands if you like lunch. >> Hey. Hey. Okay, that was a good lunch. Okay, also some local flare. I really enjoyed it. Hope you enjoyed it, too. Um, so to kick you out of your food coma, we have a very exciting speaker, Jeff. So, I met Jeff in San Francisco as well. So, uh, fun fact, I've never seen Jeff without an overall and a hat. So, that's kind of like the the image that I seared in my head about who Jeff is. Uh, but he's going to introduce himself a little bit more. Jeff is currently on a global tour also building on the site latent patents. So feel free to check out his website. And today he has a pretty philosophical question for us about the change of the economics when software development is cheaper than minimum wage. So let's welcome Jeff onto the stage. Welcome Jeff. Hello everyone. Well, I'm here today with a somewhat provocative title. Software development now costs less than minimum wage. Now, there's always been a difference between software engineering and software development, folks. But I want you to think about this. If your identity function is that you're doing software development, you're typing in the ID, etc. Well, uh, burger flipper and mackers, a burger flipper at Mackers gets paid more than you right now. Um, yeah. So, it's been a year, a year and a half since I first published a technique for managing memory that I affectionately call Ralph. Ralph is really simple. You give a context window, you give it a singular goal, and you let it autogress towards that goal with the right backing. So here's me over at Alassian um two months ago giving a talk about hey uh things are changing the economics of software have forever changed. Uh a week after this uh Atlassian did their layoffs. So um folks the unit economics of business has forever changed. Like if you consider yourself to be a software developer like you can run clawed code or codeex in a loop AFK and at API pricing it's about $1042 an hour and that will generate a lot of code. Now it'll generate so much code it's too hard to review. That's one of the hardest things about this this thing. But without a doubt the economics of business has forever changed. Like I was at the cursor meet up back in Sydney and it was product manager after product manager after product manager just sharing the latest and greatest thing and they're having the time of the life folks. They don't have the psychological wounds. They don't have their identity function removed. They're like hell yeah I can build things without people now. Like people I don't have to convince people to listen to me. They were just like yeah I just made this thing. So it was person after person after person. I I encourage you to go outside the bubble of software developers and into the non-engineering demographic and seeing all the magical things they're doing with these tools and you'll see how things are fundamentally changing because these head of design and the product manager, they're now software engineers, folks. But it's not just them. Like uh last month I'm doing a bit of a world tour and I was over in uh Oakland and I went on a tour of Hobbiton the Lord of the Rings and the tour guide was like, "Hey Jeff, what are you doing?" Like what do you do? I'm like, "Oh, talking about AI." Next thing you know, he's like, "Wow, how good is AI? I'm able to do like all these polyrade trading bots." And he was just like, "What does it mean when your tour guide operator is like like token maxing? Wow. Cuz um everyone now is a software developer. I want you to internalize that everyone is now a software developer or everyone is now a coder. So if your identity function of how you derive value is coming from the idea that you are someone who types an IDE, you're pretty cool, right? Because like the PM the PM can mog you. So, but it's also interesting because where society is been structured around the idea that knowledge was scarce, right? And with AI where that's flipped, but knowledge is an abundance. It's not just software developers. Like if you want principal software level, principal software developer level output, you know, you just create a a cord skill for that. What about like entry- level legal chord skill? Like what does it happen? What happens when a knowledge economy goes from something that was scarce to abundance? And this is what we're facing right now. Ouch. So, if we rewind time, if we rewind time about a two years ago, this is me. I was going, "Oh, fuck." I actually published this. Um, I ran Claude in a loop and I built a Haskell audio library. And the models were pretty good back then, but they required a lot of skill to get some outcome. But it was pretty clear to see where things were going. Now you might recognize this moment in time. So this is Christmas last year. So the models are now quite good. But one thing I would part to you, it doesn't matter how good the models get. It takes a period of rest where people will realize the step shift in technology improvements. You see in the people around me, the people who get the most out of AI put in deliberate intentional practice. Society is kind of like forcing these musical instruments or LLMs or guitars in the corporate to all employees right now. And it's like, please pick it up, please chew some tokens, please give it a strum, please practice. And um what we happened in the Christmas break, people actually had some time off and they picked it up. And um the models were good now. Like they always were good for the last two years, but like they've been RL now to a point where they're no longer these wild stallions. Like they require skill to break in the horse. They're kind of like this My Little Pony that's just all boxed up ready to go and to get things going. Now, if you're looking to roll out AI with an organization, the one thing I must part to you is like musos don't pick up a guitar and give it a strum and go, "Oh, that's crap." and assume it's always going to be the case. They play they play with the instrument. And this is one of the things now, like at least for the people in the room, you hopefully you've been playing with these guitars and you're learning all these tricks that you can do with these and how different LLM models sound differently and have different characteristics. So I kind of think the world is now kind of K-shaped. Um up in the top left we've got the model first companies. This is the lean Apex Predators. They're building AI and they're developing all the workflows with AI and they're having one hell of a time. And uh down the bottom there is like everyone else trying to do their people transformation program figuring out what to do with AI. And um would you believe there's people who have banned AI within the corporate? Um if anyone's watching and AI is banned within your organization, you should leave that organization. Straight up. Now you might have seen this and a few things like this. My honest take is Jack is right, but uh my further take here is AI is not factored in yet. What we're seeing is PE ratios and the valuations of SAS companies return to standard business metric. The fund hasn't started yet. What we have here is not a tool. It's more like a kind of a substrate or polymer that allows us to redefine how business works. You see, for the last two months, I've been traveling around the world and I've been catching up with venture capitalists in San Fran, New Zealand, South Korea, and we're just kind of wondering like the disruption is not just us as software developers. It's upstream in the finance industry. Why does someone need to re raise speed capital these days? If it's just a fiveman show now, is software still investable? These are problems and questions at a philosophical level upstream in the financing side of things. The disruption of AI has created uncertainty not just for us as software developers and that but also in the finance realm. You see every story needs a frame. So for no particular reason at all I'm picking SAP Concur. Um I don't like their expense management software. Would you believe they got fixed fixed overheads of 6,800 people? What? Like 6,800 people? That's a lot of people to do AI transformation. So I think the better question is thinking about like how business has been structured. Business has been structured in a way that we've layered humans on humans on humans as an intelligence layer within organizations. I think this is going to be the year we figure out whether this is true or not. We're going we're see already seeing companies play with the substrate and changing things around. You see how long does it take to transform 6,800 people like with an organization two three years? I think the better question is why would you right all organizations right now putting LLMs down and encouraging token burn see it's not about like the leaderboard and token maxing it's just literally seeing if someone is actually curious like if they're not burning tokens I haven't found a way to burn tokens and they're not losing sleep because like oh my god all the things they can build it's really they're failing a pulse check and there's a lot of people who are failing pulse checks right now So why would you transform them? You see, we know that organizations and like you run events or party management like the less people the better. Like the social complexity is there. Smaller teams get better outcomes. And here's a story from a founder in New Zealand. We're smaller, but we effectively cut two/3 by telling our board that we wouldn't backfill. That's almost two and a half years ago, folks. almost three years ago. They stopped back filling them. So, you might see all these things saying, "Oh, we're making all these changes to staffing and hiring." It's already been happening, folks. And it was the best decision because it got rid of it got rid of all the people are sick of hearing about AI. 20-ish people now produce 30 times the output three years ago. I want you to think about this and let this sink in because one of the hardest things about AI is it's kind of been forced upon the world non-consensually. Like it's just you got to put your chin up and get through it. Like for a lot of people, they're specialized in doing these lord of lord of like game of thrones type social hierarchy dilbert type stuff and it's all going to be for nothing because if you're a founder and this is your own capital, why wouldn't you compress the org chart? This is going to be really interesting because as we get these lean apex model first companies, they're going to operate on cheaper. they're going to operate much cheaper and on leaner margin. So, it's not like they the founders will want to do this. They're going to have to be forced to do this. If the experiments this year with all the founders changing around the org chart pay off, it's going to take one public, it's just going to take one business study. Next thing you know, they're all going to start copying. They're all going to start reorganizing their organizations. So, the experience today as a software engineer does not relevant. does not guarantee relevance in the future. This has always been the case. There was a time when a software engineer would move on from a company because they weren't adopting cloud. They want to keep their skills relevant. Our profession has always been a traveler. One of the scariest things is really just how fast this travel is going. You see, if an a if a company's having problems adopting AI, well, that's a company issue. That's now my literal problem that I think about help companies with. It's not an employee issue. You see, employees trade time and skill for money. And I'm really worried that people aren't investing in themselves. It It's crazy. Like, here we go. Here's something I published about two years ago saying some software developers are not going to make it. Well, I no longer hire on the left side of this anymore. like why would you and it's crazy there there really is now two categories of employees those who are consuming curser windsurf what else have you and all the AI token things and the other class of engineer is someone who actually knows the the fundamentals of an agent and knows how to automate things and they're a senior engineer because they can teach the next cohort and generation it's been two year folk two years folks there is a huge amount of people that you can hire for that um have this knowledge. You're trying to figure out who to hire for in your interviews. Quite literally pull someone out and get them to explain what a primary key is. You see, but I'm not talking about a primary key. If you if I was to ask you what a primary key is or like what a linked list is, you'd be like, Jeff, like come on. Are you bullshitting me? Is this test? But like it's surprising so many people can't answer this this fundamental question. You're trying to figure out who to hire for as a startup and like who who's going to make it, who's not going to make it. It really is this question like can do they understand the big scary boogeyman, the AI monster is literally a while loop that automatically copies and pastes information into array. Can they draw a sequence diagram explaining how this all works? This is what you should be looking at. I like to call this a curiosity test. And unfortunately, way too many people are failing this curiosity curiosity test. Software development has changed in the last 6 months more than the last 30 years. If they're not paying attention, like what the heck's going on? So, you might be wondering, why wouldn't you save someone in the center like they're stuck here in the headlights? It's because it's a it's a psychological thing. They might run back up the hill. They plan their career at Fang and they just want another couple of years so they vest and they quit. Like you really at this point you you need to be just someone who is understanding AI and is a good software engineer and is learning techniques how to keep the agents on the loop and developing pace and music. So, it's going to be really interesting to see how this pans out, folks. Really interesting to see how this pans out because like if we start seeing layoffs and job cuts. I don't think they've even started yet. What happens to the people who uh get displaced, they're going to need jobs. And what are they going to do? They're going to get a job at the next employer and they're going to do what was done to them. And this is my concerns. It could be somewhat recursive. The best thing you can really do if you see someone who is like stuck in the headlights and like oh crap what's going on is like to build an agent. I was in San Fran at a codeex meet up. I put my hand I called out to everyone like hey who here can actually draw me a sequence diagram and explain to me how inferencing works and tool calling works. Five hands went up out of 200. Like holy crap like it's the numbers who understand these things are crazy low. And this is one of the most amazing things you can do right now is to build your own agent. It's I've got a repo on GitHub. It's got a couple couple thousand stars, 300 lines of code. And what you do is you uh you have the agent and then you use the agent to self-improve itself. Then you build with a recursive latent space. And I think that's something that causes someone's head to completely flip and switch and you go, "Well, this is just a chat app." They're like, "Cool. Do you want it to be a tuy or do you want to be a web app?" They're like, "Oh, web app." Prompt for it. You're like, "What the hell? I just saw self-evolving software." Yes. This is the one thing that I can highly suggest and nudge people if they're still stuck in the center. Just build their agent and get them interested in the idea of evolving the software. The software builds the software. So, it's going to be really interesting to see how this pans out because for a lot of people, they haven't noticed that AI isn't knocking up at their doorstep because it's borrowing under their house. And this is really, really scary to me thinking about this. So many people haven't actually paying attention to what's happening. the closing ponderus. Removing waste from your company is probably one of the biggest accelerators to AI. If you got like if I've had clients that have had like a repo per design atom like they're using polymer like 200 plus repos box repo a checkbox repo get rid of all that waste folks like monor repos are love mainline that because agents don't cross the boundaries very well if you got your primary source of truth of how systems work in the architecture document in confluence and then thumb in markdown and across the RICO across the organization. Fix that stuff. Fix the waste. Maybe you didn't hire enough designers, right? Maybe that you've you've you you put you made your software developers mushrooms instead of product engineers. Fix that stuff. It's only until you fix that type of waste until you really get the acceleration with AI. And the organizations who are invested into testing and all the things that they should have been doing, they're getting accelerated by AI. Meanwhile, they're a big brand. They're having problems with AI. It's like, you know, you know, in that organization, they had no testing policy and they haven't prioritized it. So, no, no [ __ ] They're having problems adopting AI. They had low standards. There's an old saying that ideas are everything. Uh, ideas are worthless and execution is everything. But what does it mean if you can just like rip a fart into clawed code and it builds that idea? like really just to go to Dax how he opened up um thinking about what to generate is really hard folks. It's really really hard. Like good ideas are shockingly rare. Shockingly rare. You should be spending a lot of time thinking about what the right thing is to generate. And not only that, when you have an idea, generate like 20 varieties of it and touch it and like hold it and play with it like and figure out whether it's good and that's how you develop taste. Now, I keep mentioning identity functions. This is something that's kind of weird. Um, we used to have like tribes in software and it's like, "Oh, what are you?" And they, "Oh, yeah, I'm a Ruby developer. I'm a PHP developer. I'm a Golang developer." And we had subtribes. Do you use do you use neoim? Do you use Emac? All that all that stuff doesn't matter anymore. It's all been erased. The and that creates for people kind of like a a wound, a psychological wound because it's all been erased. None of it matters. All that matters now is you're a software developer. I would expect any software engineer to be able to pick up Rust within a couple of days now or PHP or what else have you because it's all been made funible and that's going to be really hard for a lot of people to stomach. But one thing I can say is if you see someone stuck and with this psychological wound, get them to build an agent and get them to use that agent to self-improve that agent and case evolutionary software and recursive later phase. And I found that like 10 out of 10 that snaps them out of it because I don't want engineers who can just like downloading whatever comes on hacker news. I want engineers who understand the inner fundamentals. Like I don't want a mechanic that just switches engines. I want a mechanic that's able to explain what a piston is, what a tool call. Engineers are meant to be curious. Thank you. Okay, thank you for the great talk. >> Thank you. >> So, your model is returning lowquality responses and uh the provider is selling you garbage tokens. Who's to blame? Yes, quantization. Today on trial, we have Philip Kylie trying to redeem himself with a talk, How to Quantize Models Without Killing Quality. Good luck. Hello everyone. How's it going? So wonderful to be here today. And wow, Dexter was not lying when he says you cannot see. It's just just the lights. The lights are bright here in Miami. I am here I am here to uh talk about quantization. everyone's least favorite thing when they're trying to run their agents at uh peak hours. So, I'm Philip. Um I I've heard that to make yourself easily identifiable, you should not use a group photo. So, I put up a photo of myself with all my buddies uh from B 10. Um we're that company you see in pink and green all over SF if you're out there. Um and and I work on inference every day. That's that's that's what this is. So, what is my agenda today? What are we going to talk about? We're going to talk about why models are so slow. We're going to talk about what is quantization. We're going to talk about the great gift of NVFP4, aka a great way to sell you Blackwell GPUs. Uh we're going to talk about what is safe to quantize within models, what is more risky, and then take a look at some real world performance and quality results. But I mean, that's like the agenda, right? But what's the agenda? Why am I here? What am I trying to sell you? People are really suspicious of of quantization. You know, all of those greedy inference providers are out there trying to rip you off, selling you poor quality tokens at frontier prices by just squishing their models into these little tiny four-bit uh four-bit number formats. And it's it's making the tokens sick. You can see that they're weak and sickly tokens. And and Chat GPT did not necessarily realize that I meant LLM tokens, so they gave me bitcoins. But we're just going to pretend that these are LLM tokens. You know, there's a there's a a lot of discussion on the internet about how quantization just nerfs models. You know, that that that second bullet point here, you know, if uh if maybe under the hood people are are sneakily and suspiciously quantizing models down to tiny data formats. And you know, so some people are pretty cool with quantization. Dax was also up here getting involuntary LASIK from these lights uh this morning. He uh you know he he's he's good with it. He says some of the highest quality providers serve models in NV in FP4. You know, maybe there's more to quality than just quantization. So So who's right? You know, I am Bro, I'm I'm just trying to give you cheap tokens, bro. Like quit quit coming after me for this whole quantization. I just want your inference to be cheap and to be fast and for your LM tokens because I fixed it on this image. Your LM tokens to be frolicking through a field at like 30 to 50% faster speeds. So, how do we get there? The thing about inference, inference is a hard problem with a lot of moving parts. And the thing that I don't want you to take away from today's talk is, oh, there's this one magical silver bullet thing called quantization and you do it to a model and it solves all the problem and now your inference is fast and cheap. There's actually dozens of different technologies and techniques working together to make inference effective. That's what makes it such an amazing field to work in. But today, we're just going to take a close look at one single technique for one single part of the stack, which is quantization. because again it's the one that it's the one that everyone complains about. You know, no one's ever starting any Twitter beef over speculative decoding or tensor parallelism. So, let's take a look at the the hot topic. All right. So, I've we've got we've got various levels of inference engineering knowledge in this room. I was actually talking to someone yesterday who did a graduate thesis on quantization. I was like, "Oh, you want to just like give my talk for me, please?" Uh, but she she's talking about something else. Um so and and and some people who are a little newer to the field. So we're just going to start with some basics. Um if you've you know run a model on a GPU just like take a power nap for two minutes. I know we are right after lunch. So LLM inference has two different phases a prefill phase and a decode phase. And generalizing here prefill is bound on compute how many operations per second you can do. And so to make prefill faster, you want access to faster cores. Decode on the other hand, that's the tokens per second part. If prefill is time to first token, decode is tokens per second. That's bound on memory bandwidth. How fast can you move data from the VRAM into the L0 L1 caches to actually use them for inference? And for this, we need to move less data. So you know in in inference in general actually it's not just for LLMs for you know image and video generation as well you can be computebound and for audio transcription speech synthesis etc you can be memory bound. Quantization helps with both. It's one of the only model performance techniques that actually helps with both your compute problems and your memory bandwidth problems at the same time. So just let's take a take a detour back to the old days of computing when uh we talked a lot about compression. You know compression has been around for a long time. If you've watched Silicon Valley, they made a whole six season TV show about a compression company. And we've gotten like really good at compression. Uh, these two images you see up here, the the one on the just putting myself in your frame of reference for a second, the one on the left is four times bigger than the one on the right, but they look identical. Uh, you know, maybe if I took the one on the right and I blew it up onto a billboard, you might be able to see the difference, but here on my screen at least, it looks just fine. So, how can we do the same thing for models? You know, again with with quantiz the problem with models is like all these conals gem at all. They take a long time. You got to read the data off the the VRAM and then you got to do the matrix multiplication on the cores. And so the solution is what if you just had smaller numbers that you were working with? What if every time you moved, you know, a 100 megabytes through the VRAMm you were moving twice as much information to the model? What if you were using cores that were twice as powerful because they're operating at a lower precision? Quantization is kind of this magic thing that increases your effective bandwidth. It increases your cache residency if you can do KV cache quantization. Works with any model and any modality to solve any bottleneck. So why doesn't everyone just love this thing? Well, the problem is that generative AI models, they exhibit emergent behavior. you throw some stuff in, some things happen and then an output occurs. These are the sort of technical insights you come to a Philip Kylie talk for. Um, and the the problem with some things happen is that, you know, we we all know that inference is non-deterministic. We all know that a bunch of stuff can happen. If you're effectively rounding a bunch of your numbers and multiplying those together and compounding errors throughout the inference process, maybe you're still going to get the result you want, maybe you're not. Maybe your, you know, your logits end up skewed just slightly. Your prediction probabilities end up just off. And then your next token ends up being act instead of abs. um and then you get up on stage and just suck in your gut for 25 minutes so that you act like you have abs. Um this this can be a problem in your inference system. So you know the the other thing I want to talk about just to sort of disambiguate this really quick before we jump in is I'm mostly talking about post-training quantization. Increasingly, AI labs are setting the model's native precision to be somewhat smaller so that they can take advantage of advanced inference without losing any of that quality. But we're talking about the stuff that's under your control as an individual developer pulling down a model off of hugging face. The post-training quantization that you can do at inference time. Um, so this has the model weights for our purpose. They're already baked. They're already done. We're not doing any further distillation or fine-tuning or RL or any of that kind of stuff. The model's ready to go. It's already as smart as it's going to be. We just want to make it faster and hopefully not make it any dumber. So, let's take a look at that. Let's take a look at where we're starting and where we're moving to. Doing okay on time. Okay. So, the data formats, um, you can represent the components of a model. those, you know, if you think about Kimmy with a trillion parameters, we've got a bunch of matrices that have a trillion different numbers in them. How big are each of those numbers? How many bits are we using to represent those numbers? And what format are we using? Are we using an integer or a floating point? Floating point numbers have three different types of bits within them. There's the sign bit, positive or negative, then the exponents, and the mantissa. And so if you think about the way you construct a floatingoint number from these bits, it's a uh it's a 10 to the power of a thing or a two to the power of a thing times the mantissa. Um and that gives you um something called dynamic range, which we're going to get to in a second. Now, part of the problem with quantization, part of the reason it has this really bad reputation is because a lot of these floatingoint formats are relatively recent. So if you look here, oh good, you can see my mouse. This is fantastic. So in this sort of 2022 era when Hopper and Love Lace were first rolling out, they brought with them the concept of FP8. Blackwell brought with it the concept of FP4 for inference in production. And before that, when we were doing quantization onto Ampio, onto towing or onto local hardware. In many cases, these were integer quantizations. And integer quantizations, you know, not to cach shade, but are like not very good. Um, and so the sort of industry opinion of quantization was formed on integer quantizations. Now we have floatingoint quantizations. Let's see if they're better. So dynamic range is like the thing here with quantization. It's the ability to encode like very very small values and very very large values on an absolute basis. Floating point formats use the stuff we were talking about the signs the exponents and the mantissas to preserve dynamic range. If you think about a FP16 number your sort of standard format for a model you start with five exponent bits that controls kind of how big and small your numbers can get on an absolute basis. When you move to FP8 you you still have four of them in most cases. you've actually only lost one bit of dynamic range even though you've shrunk substantially. When you get down to FP4 though, which is where we want to go so that we can do really fun fast stuff on Blackwell, you lose two more and and now you're you're kind of cooked in a dynamic range perspective. So what do you do about it? You're trying to map all of these values from this massive range down to literally just 16 buckets. If you think about a six a four-bit floatingoint format, you only have 16 numbers that you're representing what used to be 65,000 numbers. How do you do it? How do you put all of these numbers into these buckets? Well, you cheat. Uh you you have something called a scale factor that allows you to record additional information while trading off keeping track of more numbers and doing more math. You can have a scale factor at the tensor level, at the channel level, or at the block level. And today the best small scale formats are microscaling data formats that use blockwise quantization. You have a couple of generalpurpose ones. Um but I'm not here to shill general purpose things that you can run anywhere. I'm here to shill NVFP4 that you can run on Nvidia Blackwell. Um and the difference with this format is that you actually have two scaling factors and a smaller block format. So your blockwise scaling is a n= 16. That means each block level scale factor is applied to 16 numbers. Why is 16 important? That's how many different values you have. So now you can use one value for everything and then put a scale factor that maps that block appropriately. And then to make sure that you get a whole ton of dynamic range, you do a secondary FP32 global scaling factor. Now having keeping track of all of this stuff is is hard and expensive and and slows you down a little bit, but it's all baked into the Blackwell architecture. So we can just forget about it and run NVFP before and our life is great. Um it provides increased accuracy because your um pres you have your block scale factor is now a E4M3. So you have some mantisa in there for specificity. You get your extra exponents from your tensor scaling factor and life is good. That's my talk everyone. Just use NVFP4 and everything's easy. You're done. Oh wait, we're not done. Okay. there there's still there's still some other things you have to do besides just use the magical data format um that only works sometimes in some cases. So the the the big question is like what can you actually quantize? You know there's a sort of spectrum of pretty safe to like why the hell would you touch that in terms of your parameters and your model weights these gigantic linear layers are like pretty safe to quantize in most cases. Generally, we see a lot of success pushing those all the way down to four bits. Activations in KV cache, maybe like a year or two ago, that was even kind of risky to put in eight bits. Now, we're getting pretty good at at putting it in 8 bits. And attention, don't touch attention. Just leave it alone. It's hard enough. Just just just let it do its thing. Um, so generally you don't quantize attention. Um, unless you are, you know, feeling really really lucky that day. Um so the uh you know if you want to take a deeper look in um you know what parts of the model to quantize um check out my my friend Ali's uh blog post on Twitter uh called four bits um where he goes super deep into for an image model some of these different layers and what is and what is not safe to touch. But in general not all layers are created equal. For example, your input and output layers um from that weights block might be more sensitive. You might only want to quantize some of the interior layers. You might only want to quantize, you know, part of the uh you might want to, for example, in a vision language model, leave the vision encoder alone because it's it's small and it's more sensitive and and just focus on the main LLM layers. there's all kinds of sort of model specific uh specificity and and sensitivity that you want to account for in this quantization process. Um so it's it's always important to keep that in mind as you're working. The other thing of course to keep in mind is the hardware and kernel support. Uh just because NVIDIA says you can quantize something to a certain format and run it on a certain GPU does not always mean um that that you're going to be extremely successful in doing that in production. Again, a lot of the open source work and the kernel work is still targeting Hopper, is still targeting that FBA quantization. So, if you're trying to run NVFP4, you should expect to have to do a lot of porting to get something like a deep gem kernel up and running on your new Blackwell architecture. There are other factors to think about in terms of which models you can and cannot quantize. You know, the biggest one is model size. All else equal, like models with more parameters are more resistant to negative quality impacts because any individual outlier that might have gotten smoothed over in the process is not as important in say like a trillion parameter model as it is in a billion parameter model. You can also of course within the architecture of the model itself introduce quantization aware training which labs are increasingly doing. If you look at for example GPTOSS that has an MXFP4 native quantization which is one of the reasons that models had so much staying power on the market uh because it it does you know resist quantization very very well and then the the final thing to think about when you're you know actually the one hands-on doing the quantization is the calibration process as you're using for example Nvidia model opt or some other tool to apply the quantization calculate what the new weight should be calculate your scale factor factors. You want to do those under conditions that very closely match production usage. For example, if you're using a chat data set and you're going to use your model for code generation, that's probably not going to give you a very appropriate calibration output. Cool. So, we've we've done all this hard work. Let's see if it was actually, you know, good for anything. So, to review, quantization was bad because we had these integer based data formats. We were quantizing small early models like a Llama 70B or something and we were doing it in a sort of generic way a couple years ago where quantization was not applied or calibrated super specifically. And that's how quantization got this bad reputation as the labbotomizer of models. And today quantization can work because we have floatingoint formats, we have quantization aware training and we kind of sort of at least a little bit know what we're doing. So all of that to say like does it actually work? Does it actually do anything? And here's the other kind of gotcha with quantization is like you might look at the spec sheet and be like whoa I can get like four maybe three and a half times faster just based on my flops and VRAM bandwidth and and that is just going to translate linearly to performance gains. It's not. Or you could look at your Kono profile and say like, well, if every single one of these becomes twice as fast, I'm going to get like 1.9x faster. You're not. Uh, generally observed gains from FP16 to FP8 or FP8 to FP4 is like 30 to 50% with every step. So, it's definitely a sort of more limited observed real world gain, but still, I mean, 30 to 50% that's an extra, you know, 60 tokens per second. That's a hundred milliseconds off your time to first token. That's a few million dollars off your influence bill. It it's it's a big outcome. Um if you can, you know, confidently get your model there. So that's where the quality checking comes in. You've got to first off, I mean, everyone always says look at your data, look at your outputs. This is my sort of counter example to the uh to the compression image that I showed earlier. We have here a full precision tiger and an NVFP4 tiger. Can anyone tell me the difference between these two tigers? Uh, or can anyone notice that I actually uh switch switched the labels on you? They look exactly the same, but the one under full precision is actually the NVFP4 tiger um with of course the same seed and the same settings and all to get this, you know, very very identical output image. And then for something that you can't just look at, maybe something like an LLM, a more complex agent, you can look at a perplexity score. you don't want your perplexity to go up. You can of course just run your same eval set on the original weights and the quantized weights and make sure that everything is within a a comfortable margin of error of noise. You can do spot checks. Um, always check, you know, your function calling long context, all that kind of stuff. And if it's not quantization, your model could feel dumber because of the reasoning effort, because of the chat templates, because of a new checkpoint. There's all kinds of other reasons. So quantization can be bad, but don't blame it for everything. Um, quantize if you want to make it faster, and you can do so without making it dumber. Thank you all so much. Um, and I'm just going to take uh a few seconds to plug something. Uh, you can get this book. I unfortunately did not have enough room in my luggage for everyone. Uh, so if you scan this QR code, hop on the wait list, um, I will send you an email when I get Shopify up and, uh, you can like DM me a picture from AI Miami. I'll send you a code for a free one. Thank you so much and let's have a great time out here. Whoa, Philip, that was a very energizing talk. For the next one, I hope you all are going to go bananas for this one because we're inviting the Google DeepMind team who's going to be talking about generative media. So, including your favorite models like nano bananas and more. Okay, so I'm going to introduce my colleague Alisa and Gileum. So, welcome to the stage. So, Alisa and Gilam are from Google Deep Mind and they both work for AI Studio. So, if you haven't heard about it, they're going to do a demo for us today. Uh, so super exciting and I'm going to let them introduce themselves a little bit more. Uh, we have a dynamic duel of a PM and the developer advocate. So they're gonna walk you through the creative world that you can get in with AI Studio and Generative Media. So take it away. >> That's Hi everyone, my name is Alisa Forton. I'm one of the PMs on the Google DIY team and my focus is generative media models, specifically image, video, and audio models. And this is my partner in crime. >> So hello everyone. I'm Guiam. I'm uh Alisa's partners for for all of the Gemini models launches. I'm developer advocate, meaning that my job is to represent all of you inside of Google and to make sure that whatever we release is easy to use and for developers so that they you can easily make things with our models. Um so uh we are going to talk about um generative media but just a word about u the the Gemini the vision of deep mind for AI model right like from the beginning the the vision for deep for deep mind was to make multimodel models uh because we believe that we need the models to be able to understand as many modalities as possible so videos audio uh sensors uh speech and and so on and so on kind of like of five senses and to also express itself in in all of those modalities as well and so to be able to generate images and so on. So most of the things we are going to talk about today uh and the reason we we we want that is because that's that's what we called world models and we want the model to be able to understand everything about the world and to uh and to act on it. Uh yeah, >> and before we continue, uh we are going to do a bunch of live demos today. I don't know if it's actually going to be helpful, but maybe some phones can go on airplane mode. Uh that might help our live demo. We were practicing and the speeds are kind of slow, but um thank you. We'll be prepared. So why why do we love working on Gen Media? Think about the three words. Entertain, communicate, and learn. Open your phone. You will look at the world. there's media around you. When we're improving these models, we aren't just like tweaking code or training data. We're actually helping teachers teach, helping businesses connect with their users and we're helping creators create. So that's what we're building and that's why we're so excited to speak about this to you with you today. >> And so and we are shipping a lot of things at Google and that's that slide is just about the gen media models we are having. We have the video ones, the image one with nano banana. We have Gen3 that we shipped uh quite recently as well. And so we know that we are shipping so much things that it's hard to keep track of of all of the the offer. And that's the reason why we wanted to do this talk and to show you the different uh models that we have and to give you some tricks uh and do live demos about what you can uh do to improve your usage of the model. >> So this brings me to the first model I'm going to talk about. So when you think about kind of the past and how the camera was used to capture the reality, generative video actually brings your imagination to the real world. And so that's why VO is so exciting and we're just at the beginning of what generative video can do. Um when we work to create these models, we always think about how can we make it accessible to as many people as possible. So we're thinking about those creators. We're thinking about the teachers. We're thinking about the small businesses and how they're reaching their audience. So that's why VO has a a family of models. And the most recent model that we introduced is VO3.1 Light, which is basically our fastest. It's our most cost-effective model that will allow you to quickly prototype and bring things to production. So when you're looking at our models and you're kind of thinking which model should I pick for my use case. Um this will be kind of similar across all of Junk Media. You're going to use VO3.1 light or fast model when you need to quickly prototype something do a bunch of generations at speed post a bunch of videos on social media. And then when you're truly building like something cinematic, you need the 4K outputs. You're going to move to our flagship VO quality quality models. >> So um just a very quick like quick tips on how to get something better with VO and as we said we are doing live demos and it's never working. So um but very quickly um one of the thing that most people are doing when using VO or any any generative media models is let's let's try a prompt that is very representative of what we get. And that the thing like most people are just sending prompts that are just too short. And the shorter the mo the the model the the prompt is the harder it is for the model to know exactly what you it has to generate. And so it basically has to f to fill the gaps and and that's the reason why the first the first tips that we can give you for any gen media model is to actually write long prompts long the longer prompts you can write because you need to reduce the number of things that the model will have have to invite invent by itself. There's the reason why I made this this quick uh this quick demo that is basically using Gemini to take your prompt and to enhance it to get to be better and to have more details about what it uh it needs to generate. And the good thing with that is that if you get the longer the prompt is, the easier it's going to be also to take the video that you made with V3.1 light and to make it uh better, larger uh with uh the standout model. uh because if you the prompt is very detailed, the model is going to know exactly what to do uh and do more or less the same thing each time. And if you go you want to go even further, you can use what we called um um JSON mode and basically just using a very very big JSON um with a lot of uh fields like titles, creative summary as you can see for each character, the name of the characters, the visual descriptions, attire, hairstyle, accessories, and so on. uh where's this where's the location is, what the R direction is, and so on and so on. The point of this is to be thorough and to have this this kind of checklist of all of the things I need to tell the model so that I'm I'm sure that it's going to do exactly what I want it to do. And you can see that we even have like those chunks of what's happening between the the the first second and the second and when what's happening in the next two seconds and so on. Um, >> I think one thing to add about JSON prompting is that we don't actually our evals don't show that JSON prompts work any better than the natural language prompts. But if you're structuring a prompt in your head and then you want to update things like you know timestamps and what's happening and then small backgrounds, JSON allows you to keep that structure and make these minor changes. >> So see that the kind of things you get with a proper prompt that is describing everything you you want to do. Um I yeah that's the generation was a bit slow so I I I run it before. Um but let's go back to uh all slides and nano banana. Now >> next one. Where's your bananas? Everyone's the recent favorite nano banana. And just like with all of our other models, we have a family of nanobanana models now. So to give you just a quick TLDDR as we're running out of time when you're looking at Nano Banana 2, you're looking at quick prototyping. You may need draft like resolutions 520 pixels. You may need lower cost. This is what you're looking at the workhorse model Nano Banana 2. When you are ready to go to more highfidelity photo realism natural um outputs, this is where Nano Banana Pro still excels. And um all of our nano banana models support a variety of aspect ratios and resolutions. Basically we want to me make sure that we meet all the project and asset needs um that you have today. >> So and one of the core feature of Nano Banana since Nanobanana Pro uh is that it's able to use search grounding. So basically go on the internet and search for information about what you are asking it so that it can give you the the latest information. So you can ask it about the like make a image about the news from yesterday about the score from a specific uh sports game. Uh and for example in this case look look the way for me and make an image that just represent my um my my work and everything I did in my in my past work. >> And then recently with Nano Banana 2 we actually introduced grounding with Google search for images. So, you know, as generative media models, they're great at the actual rendering and the outputs, but what they're really terrible at is fact, right? So, if you're trying to prompt for this top right, you see there is a bridge that um IU bridge in Pakistan that we were trying to represent here and then we asked Nano Banana Pro that only has search grounding to render a watercolor of this bridge. You see how it missed like the structure of the bridge and how complex it is. Well, Nano Banana 2 actually grounds the responses in the image as well. So, it's not just hallucinating the output of what it thinks the bridge should look like or what what the training data says it should look like. It takes it's able to view the image of this actual bridge and you can see how the bridge on the top right is rendered with more accuracy. >> So, uh once again some quick tip for Nano Banana. One of the first thing is that um you need to remember that the model was mainly trained to do multi-turn editing. So you give an image and you you ask for edits on the image. I know that most people are using it to generate image but that's actually not what the model was meant to be initially. Um it's also very good to to use references. So you can give it a lot of uh images that you want to reuse as this is my character, this is this is the scene and so on. And that's how you get the best results. Um and there was also like we get a lot of requests because people were kind of disappointing about that uh to get how to get um like transparent background. So I made this quick uh this quick app that I need to reload currently um just to show you different uh uh the way I'm using to create um transparent background for images uh using Dan Banana. And I think we're running out of time so let's go with the pre-run one. Um, so basically this this was a manate outing in Miami wearing an AI engineer cap. And the way I'm I'm doing that is that I'm creating a first image and asking it to to create a white background. And then I'm asking it to change the white background to a black background. And the thing is with Nano Banana is that it's pixel perfect. So when you ask it to change an image, it's not going to touch any of the pixels that it doesn't need to touch. So then you just need to do a diff and everything that changed between the two images is going to be a transparent background. So that's how you can get very easily transparent background. And there was another neat trick that I wanted to show you uh in case you want to create like lots of uh lots of images using nano banana. Um you you have different ways to save on cost. You can use batch, you can use the new flex uh service level that is that are both reducing the cost by by half. But there's actually another very neat way to um to reduce the cost. And I'm just going to yeah um jump right away to uh to the results. But basically um the trick is to instead of creating of creating multiple small images of 51 12 pixel create a big image in 4K but ask ask for it to be a grid with multiple images. So in this case, I'm just asking for an AI conference in Miami named AI engineer and and the model is actually going to to create 20 64 images representing this font. But deep down it was actually one image that was a grid and then the model and then I just had to cut the image to get my my small images out of it. And if we just check on the prices, it cost it cost me the price of one image. So instead of 64 ones even though they were smaller. So I'm saving 95% of the cost that way. And that even works if you want like all kind of different prompts. So all this one is 24 different prompts. And and the model was still able to create those 24 different images within one image, right? In one time. So that's a kind of neat trick if you want to save on cost with uh with Nano Banana. Um, and now let's let's move to our the next the latest addition to our gen media models. >> Yeah. So, LIA again, the LIA model family is our flagship song and music generation model. Um, again, we're offering you two different models. The first one is great if you're doing like quick loops or you're doing quick promo um audio because it's a 30 second generation song. And then the LIA 3 Pro is actually our kind of flagship model that can generate the full song and it has um it allows for a lot of control to support different um parts of the composition and also it can take different uh multimodal inputs like images to make a song based based on um any image that you provide. And then of course all of the songs can be in different languages. Um, so just like with all of our models, when you're prompting, the models do take natural language, but it does help to have the structure in your mind for how you're going to prompt for these songs. And so GM, you just take it away with a demo. >> Yeah. So, so the thing is with the model, what's what's really cool is that you can you can print prompt it in different ways. Uh, and and one of the easiest ways you can say, "This is what I want in the intro. this is what I want in the first verse. This is when I want the chorus and the second verse and so on and so on. And the model is going to build this song based on that. And for each of those prompts, you can say the style of music you want, how energic you want it to be. You can even say the scale or or set the BPM for the world song and it will create uh the song according to what you did. So, um I just like it's um it's generating at the moment. I don't know if we're going to wait for it. Maybe sometimes can take one minute. Um, but we can check the one that I regenerated earlier with the same one. Um, and I asked the beginning of the song to be um, we don't have the sound. Can you enable the sound of the laptop, please? >> No. >> Can we have the sound working? >> Perfect. Thank you. And then and then it's speeding up because we asked for the verse to be I don't know a mix of kumbia and salsa with Spanish lyrics. >> And then if we fast forward I think it's going to be even more dynamic. I think like one thing to add here is that if you're using the 30 second model, your prompt needs to be able to fit within those 30 seconds as you're generating your timed durations. And then if you're using the full song model, obviously you have a little bit more creative freedom. If you go to Google AI Studio, you can check out Liia and the special composer mode which allows you um Yagom has it right here. basically allows you to construct your prompt um the same way that you might be passing to the gem API. >> Yeah. And um and let me show you another very cool demo with LIA as well. Uh because Laria 3 is not actually the first model, the first music generation that we released. We have this other one that is named LA real time and that's uh instead of creating music using um um like uh from from the prompt it's creating music in real time as the name implies and and basically it creates um as soon as you start it's a live model so it it will continue to create music um as soon as you as soon as you uh as you make it work. And this demo is basically you are in this world of um in space and every planet is a is a prompt and when you move around the planets the music changes in real time and one of the cool things so I never know how to find it. So uh but there's a place where there are Christmas songs and just next to them there's uh Scandinavian metal. So the the mix can be quite fun when you when you mix them all. But I think we're just going to have to go with salsa, I guess. >> But see, it's it's it's very cool the way it's it just mix things in in real time. >> Um, and there was another demo that we wanted to show is this one. You wanted to try this one. >> I want to try it. Yeah. >> Okay. Are we ready? >> No. Apparently need to reload again. Okay. Uh, let's go. >> Hey DJ, can you hear me? >> You're live on air with the jukebox DJ and we're about to create an absolute banger together. So, what's the vibe we're going for and what should this Sonic creation be all about? Lay it on me. >> Um, okay. Okay, let's create a song about really sunny weather in Miami and make it Latin style. >> Oh, lovely jubbly Miami sunshine and Latin flavors. That's a proper summer anthem right there. Now, are we talking about sipping mojitos on the beach, dancing till dawn, or something a bit more chill? And what kind of Latin style specifically? Salsa, regaton, bachata? Give me those details, mate. Yes. >> Uh, yes. >> Whatever. Yes. >> Get going. >> That means we're moving. >> That wasn't either. >> Absolute banger. I need a little more info, mate. Spicy like salsa or something smoother. And what's the story? Just sunny days or maybe a summer romance? Hit me up. >> We can't hear what it's saying because >> salsa. >> Salsa. Top of the pops, salsa. It is. Picture this. Pure sunshine. >> Yeah, you know, that's kind of the problem with LLM. Sometimes they're way too talkative. >> Right then, some folks just love a good chinwag, don't they? Making music happen. Hope you love that salsa track, mate. >> I'm glad you're aware of it. >> Right. Always happy to keep the energy up and make some noise. Okay, let's let's move on because we are already out of time. Let's uh >> we have 13 seconds, but we were supposed to do the song in French. >> Yeah, >> next time. >> Okay. Um, so what what you Oh, >> that's that's static nose from the from the thing. So, quick note, the talkative DJ is actually our recently released new audio live model. And so, my next slide is we're going to talk about audio real quick. Um, we have a new texttospech model, Gemini 3.1 Flash. If you're familiar with texttospech models, they take your flat input text and then they will generate some speech. This model is actually the model that's powering everyone's favorite podcast feature in Notebook LM. And so with this new update, we're put basically putting the user in the director chair. You're not only delivering the flat text for the model to then speak. You actually have more granular controls with using the audio tags to control how the model speaks and what a motion essential essentially it's producing. And then of course we're supporting multiple different languages with the 24 languages that we've recently optimized to make sure that they're delivering on like the highest quality and make sure that they have the native accent. >> So uh very quickly a very quick demo um here. So one of the thing is with this model is that you can actually take any voice. We have a bunch of hardcoded voices but you can trump them specifically in for the way they should be talking. So you can say this character is going to be a style of vocal smile whatever it is it means and speak very fast and with an American accent while the other is going to be uh speaking with like a newscaster rapid fire as well with also an American accent. And here's what it does. >> Welcome back to the show. Today we're diving into the intersection of AI and creative expression. >> Exactly. I've got so many thoughts on what happened this week. >> It really is shifting daily. I mean, did you see the demo they dropped on Tuesday? >> So, the cool thing is that you can change that in real time. Like, not in real time, but in the text, you can say this next sentence is going to be is very angry. And then the you the voice is going to change the way it talks in uh in the middle of the sentence or or things like that. But I what I wanted to show you is is another neat trick. If you want to create uh discussions with more than two characters, you can actually do that. And that's what I'm doing here. So I'm basically creating one um one discussion with uh five astronauts trying to bake a cake on the internal space station and for each character I'm creating one prompt for each of for for them and then instead of saying just a discussion with the the five characters which would not work with the model because it's limited to two voices I'm actually uh sending it if you if we check you check here is that this is going to be the first character with who is using the low pitch voice. I'm using a low pitch and the high pitch voice and then I put the the the full prompt that uh represents this character and then the next one is going to have its own prompting and so on. And that's actually giving you uh discussion with more than than two voices. >> Team we face our greatest challenge yet. The funfetti >> is ridiculous. Flower is getting into the ventilation. We will choke on the sprinkles. Oh, do not be a spoiled sports fetana. It is simply a matter of whisking with enthusiasm. Pip pip. Actually, if the centrifugal force of the whisk exceeds 3.4 gs, the batter will atomize. >> Whoa, dudes. Exploding. It is a yellow space orb. Capture the orb, Chuck. The mission depends on it. >> So, you see that you you like it's actually using the same voices, but you you get you get the feeling that it's different voice. Like each characters are actually it's their own voices. I see it basically as when I read bedtime stories to my daughter and they do voices for each character. That's the same, but the is slightly better than me at that. Um let's yeah we have to we are at time but like very quickly um one of the cool thing with the gen media models is that you can mix the models together and I actually gave uh two weeks ago at AI engineer London u uh workshop about how to do that. So if you want to to check it you can follow this link and all of the content of the of the workshop is there. Uh I guess the video is going to be uploaded anytime soon as well. So you will be able to see that. Um and uh we were planning to ask you what uh what you needed to in the models but since we are running out of time our job is to basically get feedback from you so that we can steer the models into the right directions and get the actual user in the models and instead of what the researchers thinks the user needs. So if you want things from the models uh please this uh fill this form and tell us what you're missing and so that way we have we we can use your feedback to uh to get to make the models even better. >> The very short form goes directly to me so then I can bug research to build things that you will actually go use in the real world. Um, yeah, that was >> this one. >> We didn't skip. >> Oh, we didn't skip any slides. No, we just went quickly. >> Yeah. >> Yep. Then that's >> Thank you. Thank you everyone. >> Yeah, you're welcome. All right. Awesome. A note to presenters, do not skip any slides, please. Um, our next presenter works at the intersection of AI, data engineering, security, and governance, and she's going to share with us how she built an agent with scale and security in mind for enterprise use cases. I'm excited to invite Anna on the stage and her talk will be on from tickets to PRs, shipping a governed snowflake ops agent with Langraph and MCP. Please welcome Anna. Heat. Heat. Hello everyone. They weren't lying. You really can't see a thing up here. So, I'm going to pretend I'm just talking to myself. I'm back in my hotel room practicing the speech. Um, but no, I'm actually super excited to be here today. This is one of my favorite projects that I've honestly ever worked on. So, I'm here to talk about how I built a governed ops agent um specifically to operate our Snowflake operational work at Pinterest. Now, I'm going to focus a little on Snowflake types of requests we were getting, right? But I want you to look at this as a reusable pattern, something that you can apply to your own operational work. That could be infrastructure setup requests, IT help desk ops requests. um it could be data access to other data systems. Whatever it is, I've purposely abstracted away some of those details so we could focus on the patterns, how we approached this problem and how how we chose to solve it. Right? There we go. Um so just very briefly what I'll be covering. First, the problem and why an agent made sense for this, right? Why I'm even here talking today. I'll talk about the workflow, some of the design, some of the guard rails and controls we added, and lastly, what made this shippable, and what you could walk away with today. And that's the end of my presentation. Okay, there we go. Um, okay. So, as we know, as LLMs get more and more advanced, they can solve more and more complex problems. However, the technical complexity wasn't the issue for us. It was actually that we had a lot of routine repetitive requests that just took manual time to solve. So in the case of Snowflake, we're getting access requests, we're getting schema creation requests, IP allow changes, things like that, which very repeatable flows like we have to check what roles exist, who needs access to what, maybe people already have permissions and it just takes time. So people are sometimes waiting days for their ticket just because it's in the queue, right? So this is going to be a challenge. Oops. Okay. So this isn't streaming agent right now, right? It's saying we have opportunity to automate this. And that's true. If you look at the first three items here, the ones without the the red border, right? We've got reviewable output. So my team was uh solving these requests and we would generate SQL script you run against snowflake to grant permissions or make whatever changes you need right we have repeatable process in the end right everyone's asking for access to some data sales needs access to this marketing needs access to this the use case varies the data sets vary the roles permissions etc but very simable repeatable processes and we had certain control points already built in Right? So, when we're building agents, the goal isn't to move away from PR reviews. You don't want to move away from approvals. You don't want to remove those. Maybe you already have standardized deployment workflows like we did. You want to maintain those where it makes sense. And actually, those are the guardrails you should be placing. But still, those three, you're not going, you needed an agent for this, right? No. That's still screaming, here's a path to automation. But it's this last point, contextual reasoning, using LLM uh for what they're good at, that's where we really saw the opportunity to build an agent. So at Pinterest, we purposely abstracted away some of the nitty-gritty details of like snowflake ro hierarchy, our naming conventions, you know, how we set up different teams access. So people don't always know what they're asking for. And that's on purpose, right? You don't want someone to get caught up like trying to figure out the naming convention if their team has a role. No, we've allowed them to ask really like simple requests like the sales team needs access to this sales data set. Our agent and what LMS are good at, they're good at understanding text, they're good at uh gathering context, doing lookups, doing searches. So that's why we really leaned into this and built an agent that and it was really really super fun. So I didn't mention this earlier but this actually started as a hackathon idea that we built in two days but the road to production took a long time and that's why I'm here to talk about this today like how we think about security governance different controls we need to add in. So before I deep dive into the architecture and a little bit more on the design I want to give you two key framing principles that drove how we approached this problem. The first is my secret to building a good agent. A good agent needs a good mascot. So, anyone who works with me knows I love to name my agents. I love to have a fun mascot. It just makes building them so much more fun. But on a more serious note, and something to keep in mind once it loads, there we go. match agent authority to workflow risk. Right? So when I say authority, I'm talking about what can the agent do? What should it not be able to do? And then think about risk. Think about the systems you're working with. Think about what data you're touching. Right? So for us in the case of Snowflake, we have sensitive production data we store there. And it's a sock compliant system, meaning any processes we build out on there, any agents have to be auditable, right? We can't have uh agents modifying our data and uh you know messing with our reporting. No. So as you think about this, as you think how it applies to your own operational work, think about what authority you give your agent and think about the risk involved. I'm just pointing this out because this drove part of our design, but maybe you have scenarios that are lower risk. Maybe you can give the agent more authority. So in practice right that principle I shared that means what can the agent do what can't it do so specifically for the agent we built what can it do and here we really lean into what LLM do best and where our team had a lot of our bottlenecks right so our agent does a great job interpreting requests seeing what details someone gave us what's missing gathering context doing those lookups, figuring out, yes, it's a data access request, what's needed, what roles, what permissions, generating SQL code, LMS are great at that, opening GitHub PRs, but that's where we decided to draw the boundary and hand it off to govern workflows, right? So what we decided our agent can't do, it can't write to production, it can't modify any data even though it's generating metadata queries to, it can't actually run those just in case something goes wrong, right? It might drop our entire data warehouse. It can't approve its own changes, but we can have other agents do code reviews. Um, in the end, it really just can't act without constraints. You have to define those boundaries. So to share more about the highle architecture um I like to break it down into these three uh three uh parts of it right we've got the intake no matter what your use case you're going to have requests coming in somewhere tickets black messages other ticketing systems right and then the agent that's where all the magic is happening so we designed our agent as a lane graph workflow and I'll go into why in the next slide But here are all the different things it can do, right? Parsing a request, taking that messy, ambiguous request someone gave us, turning into structured output, looking up metadata against Snowflake, doing all those lookups, generating SQL, generating the PR even, right? We have LM wired into all those step. So even for PR creation, why not have the agent generate the the PR summary, right? Once that PR is out there though, we hit what I'm calling our governed execution part. We always have a human in the loop, especially because it's a stock system, right? We're always going to have a person review the output, make sure it's accurate, syntactically correct, the agent didn't come up with some wild permissions to grant, and then we still don't give the LLM or the agent the keys to our production system. Even if it generated the right SQL, there's no guarantee it's actually going to execute it correctly or it won't insert some some additional queries and drop our data. Right? So, we stuck with our standardized deployment workflow. You can maintain whatever CI/CD processes you have now and then apply those changes, right? But if you think back to the problem, we had processes that worked, right? We were really just trying to solve for those manual pain points, the queueing, the bottlenecks. So, we use the agent where it works really well. So, if you're looking at this point, my team now doesn't even know requests are coming in until we get pinged that there's a PR for review or worst case, the agent has to escalate something to us that needs some manual um intervention. This is what I'm going to be remember for the dead clicker. All right. So I said I'm going to touch on why we went with langraph specifically. So in reality, operational workflows aren't oneshot things, right? You get ambiguous requests. People say like, I need this and don't give you half the details. Even when we standardize our intake for the snowflake request, we put a question, right? What level of access do you need? Read or write, people will still leave that out, right? So the agent has to go and ask them and then wait for the the user to provide more details or in some case it's a valid request but some of our request types require some approvals. So in the case of data access requests we always have data owners right? So it's not up to the agent to decide yeah you could have access to super sensitive data. No it's always the data owner. So we're waiting for approvals. So if you look at this, you see we've got entry points, re-entry points, exit points, very specific branching, and the agent needs to be able to maintain state or know where it left off. So this is what Lang graph is really good for. So if you're not aware, Lang graph is a workflow orchestration framework for stateful agent, right? So we actually maintain state on the ticket right now. Once the agent leaves off somewhere, maybe it's an invalid request, it'll mark that on the ticket. So then it knows where it left off. This way you don't have to rerun the whole workflow. The agent can pick off where it left off. Right? You're just waiting for an approval. You know the ticket was valid unless someone went and changed something which might trigger a different state. You don't have to rerun the whole validate request step. All right. And part of my talk title did mention MCP. So model context protocol. Um, so I'm going to touch on that. But in the last slide, we have our Langraph workflow. You might be thinking this feels like such a rigid deterministic flow, right? It always has to do this step followed by this step followed by that step. Um, and I'll point out it's just the flow or the steps that are deterministic. We still have LLMs used at different stages, right? Those are always going to be non-deterministic. And that's something we have to think about. But in this case, it feels like we didn't give the agent a lot of autonomy, right? And it might be super tempting like now we have AI, just let it do everything, right? Take in this request, solve it, one shot, go. No, you should really be thinking about where autonomy will help. where should we scope that autonomy in and at what level? So, a step like validate request, it's just a oneshot prompt, right? You get your ticket, we defined a prompt with criteria of what's a valid request, and it'll go valid, invalid with the gather context step. And part of the reason we even built this agent was because it's not there aren't like three predefined queries you always have to run to solver or very specific steps, right? It's use case by use case. So, we actually have a metadata sub agent that has access to the Snowflake MCP server. Um, and it's going to look up different things against metadata views. It iterates, right? It does a lot of iterative reasoning. And I'll actually show you the tools we gave it. It does a lot of iterative reasoning. So, it'll run a query that tells it something. It figures out what other information it needs. runs another query, keeps going, keeps going, and then turns all those query results into structured output. And again, if you're thinking about how to apply this to your own operational workflows, this one's very snowflake specific, but maybe you have Zenesk tickets, Jira tickets, some other system. You could give it your own MCP for that system, right? You're just giving it the keys so AI could talk to those tools. It could do its own lookup, look up different states, different searches, figure out what it needs to find, gather that context, and once it has all the information it needs, it's going to move on to next step, which for us was generating the SQL. The other step I want to highlight though, right, you saw on one of the previous slides, we always have a human in the loop. So we could have generated the SQL and just gone created the PR triggered a PR review right but I want to highlight this review and repair and my point here is don't blindly trust the first pass output right we've kind of designed the agent to work well use it for what LLMs are good at we actually chose to use SQL templates for this. So, it's not like here's a template, here's what the output is always going to be. But no, Snowflake has certain standardized queries, right? You're trying to create a role. There's a create roll if not exist. We told the agent it's up to you to figure out how many roles you need to create. If you even need to create one, just use that. Don't go crazy. Don't hallucinate. Don't come up with your own. Don't make assumptions about this how it's done. Um, this is how snowflake queries work, right? So, we gave it those templates, but there's no guarantee it's going to stick to those, right? There's no guarantee it's not going to insert a semicolon where there shouldn't be. Stick it in the role name, right? So, we've actually told it, you generate your output, actually validate that. And here you can choose whether it makes sense to use an LLM for the validation or whether you want to have more deterministic checks. Right? We chose to go with an LLM because they're great at generating SQL. They're also good at reading SQL. So, we told it, right? Make sure it's syntactically correct SQL. Um, keep in mind it can't actually execute these queries because it doesn't have right permissions on Snowflake. So, it's just reviewing syntax. It's reviewing the original request. Making sure what it output actually solves that request. making sure someone didn't ask for Jira data and it's granting access to the all the data in the data warehouse. Right? So we give it a chance to review its own code and if it finds any issues, we give it a bounded number of texts to repair those and it'll repeat a couple times. And if for whatever reason it really can't solve it, it's having issues or just went off track, at that point we escalate to a person. And that's when someone from my team will go and manually resolve the request. We also have other escalation points uh built in throughout the workflow. I just didn't show them on the graph, but this way you just give the LM a chance to or the agent a chance to make sure it's any day now, please. >> All right, it finally works. It just doesn't like me. Um I think I was talking too fast maybe speeding through it. So it it wants to make sugar go a little slower. Um so I could go into more technical details. There are more flows. There are more branches right as we expand this agent to handle other high volume asks. It might get more complexity but in general it's going to have very similar step right validate the request. Some are going to have data uh some are going to have approvals. um we'll always generate some SQL or whatever you could generate Terraform uh config code um a set of steps to run right but what made this shippable right like why did we get the okay from security and legal uh to deploy this on a sock on a sock system go I kind of already highlighted this point these points at the beginning um but just to summarize and you know give you points to think about as you build your own agents first First off, start with bounded work, right? For us, it was high volume requests, things that were causing pain points, bottlenecks, but that had clear repeatable steps, clear outputs, right? We always knew we needed to generate SQL queries or we always knew we need to generate config code. And then reasoning without authority, right? It's up to you to assess the risk of your system and decide how much authority to give it. And again, in our case, we had really sensitive data. We had a sensitive production system. So, we needed to give it less authority. But you see, we actually built a successful agent and never had to give it right access to Snowflake. So, again, assess the risk of your system and decide. Maybe you can use LMS to take certain right actions or maybe you're in the same boat as us and it's too risky. But agents at the end of the day are going to be really good at reasoning, really good at understanding requests, right? I mentioned this at the beginning and I'm going to mention again, reuse those control points. We're not building agents to try to get rid of uh PR reviews. We can have those code reviewed agents. They can speed things up, but at the end of the day, you still want a human to sanity check even the code review agent output, right? or if you have working CI/CD workflows, don't try to incorporate LLMs where where you don't need them, where things are already working. Next one, scope autonomy. Right? You saw we have a deterministic flow, but we chose where to give it a little more autonomy, right? And again, for your use case, maybe you have a less rigid flow. Maybe you just have two steps in your Langraph workflow, right? But add use, right? add it where it makes sense. Don't don't try to make it solve everything in one go. Right? And my last point is design for re-entry. If you're thinking about real life operational workflows, if you could get it done in one shot, that's great. But realistically, you're going to have exit points. You're going to have cases where the agent needs more information. You're going to need approvals, um, etc. You're going to have different branching logic escalate to a human, right? Please, I just want there's one line I just want to leave you with, which is this, and I think I got this point across, which is useful agents need boundary, not more authority. And that's going to be the key to actually launching a lot of these agents in production. This is what's going to make your security teams, your privacy teams, your legal teams happy, right? You could give the the agent access to everything, sure, but there's more risk involved. There's more that could go wrong. And hopefully, I've shown you that you don't need to do that. You should just use LLM's agents for what they're good at, where it makes sense. Choose to use deterministic workflows or existing code where it makes sense. And I don't know, go build agents. Thank you. >> Thank you, Anna. Sorry about the clicker. We need Yeah. No, we we'll have to build AI agents to make that work better next. >> Yeah. But well, despite all that, Anna was able to finish her talk. So, let's give it up for Anna. Okay, so this is an exciting moment because we're gonna go on a break. Uh, so we have uh quite a bit of time for you to just talk to each other, grab some coffee, maybe go outside. It's really nice outside. Uh, some logistics before we break, but uh if you didn't lose your parking ticket, you can feel free to already mingle with other people. just come back at 3:50 p.m. We're going to start our final talks of the day at 3:50. However, if you feel like you miss your parking ticket, I have the something that I need to read. Um, if you're missing your parking ticket, we found a ticket from garage 4 yesterday at 12:40 p.m. Uh, so the volunteers at the check-in table will be uh holding on to the ticket. So, uh, if you can't find your ticket, uh, go talk to them or just talk to them for fun because they're really nice people. Anyways, um, so with that, we're ready to break and I'll see you at 3:50 p.m. Ladies and gentlemen, please take your seats. Our event will start in 5 minutes. Ladies and gentlemen, please take your seats. Our event will start in 2 minutes. All right, welcome back everybody. Our next presenter is an open-source superstar, an educator. He he's doing a lot of things. Frankly, I couldn't memorize them. I'm going to read them for you. Um, he's the creator of epic webb.dev, epicai.pro, Pro, the Epic Stack, Epicreact.dev, and testing JavaScript.com, and more recently, um, he's launching epic product.engineer. Um, yeah, he's, uh, well-known educator and, uh, contributor to the open source community. It's my pleasure to welcome Kent C. Dots on the stage. >> Thank you. Hello everybody. Thank you so much for having me. Uh, AI Engineer Miami. I love Miami. I'm super excited to be here and talk with all of you. Um, I'm going to be talking about building a free agent, which I think is a fun, clever title, but um, it's free as in freedom, cookies. I don't drink beer, so it's cookies and puppies as in something you have to take care of. So, what do I mean by that? Well, you're going to have to wait a second to find out because I want you all to stand up. Please stand up. If you're physically able to join us, please do. It's been like a long day. You need blood flow for your brains to work. So, put your arms out in front of you like this. Squat down and back up. That one doesn't count. That's just a practice. We're going to do 12 of these. I want you to count out loud with me. Ready? One. Two. You're doing great. Three. You can go really low if you want. Four. Or just like a little dip. That's fine, too. Six. Seven. Do you feel that blood flow? It's so good. What are we on? One. No, I'm just kidding. I forgot. Are we at Is that 10 11 >> and 12? Thank you. Okay, stretch over your head as high as you can and then over to one side and over to the other. All right, that feels great. Okay, sit down. Thank you. >> Yes. Um, blood flow makes your brain work better. So, exercise um we are not robots um yet. Okay, this is the view from my office. Can you believe that? Yeah, I'm looking at that all day. And um it's it's super great. Um but it's not always super great because sometimes you get glare and it's especially bad in my kitchen because it reflects off of the countertop. This is not my kitchen. My wife would not want me showing all strangers of the internet my kitchen. Um, this is my office though. Um, whoops there. That is my office and it's great. Um, but uh, yeah, glare is not fun. So, uh, we do have shades. Again, not my home. Um, and and they can avoid the glare. And so I I actually have um like automated shades and it uses some I if anybody's familiar with uh powers shades or nice viewer or Elon or whatever but um it's nice so I can like control things but I really really like my view and u and actually even this view is a problem if I get overcast and now the clouds are all super bright and now it's shining into my eyes and I do use light mode but it's not enough and it hurts. And so I would like to have some mechanism for me to say um when the lights or or or when the sun is in this position or when it's overcast then lower the shades or raise them up. I pretty much I want them to be up when they can be up but not when I'm going to get something like this. And so um I decided to solve this using AI. Of course we're here at AI engineer. Um, but the the way that I solved this is with a little program or or AI assistant that I call Cody. And this is actually Cody, my mascot for all the stuff that I do. And Cody is now my AI assistant. And so now uh thanks to Cody, my shades will go will stay up as kind of the default, but then they'll go down for privacy reasons in the evening. They'll go down when the weather is uh overcast. Um, and actually in my office specifically, it will go down just that little section that's going to blind my eyes. Uh, in my kitchen, it goes down in the afternoons when the sun is going to shine off and reflect off of things and it calculates the where the sun is in the sky. I think it used a word like azmouth or something. I don't know any of that. That's I love AI. Um, and so now like I can just live my life and my shades just move as as they need. Oh yeah, and super annoying if the shades move when I'm recording. And so it knows my uh lights my lights are also um uh integrated into this experience. And so if my light my recording lights are on, it knows I'm recording. And so it's not going to change the shades. It's awesome. It's pretty cool. Let me tell you something else that I did with Cody. This is a game that I uh told Cody to build for my son who is two and a half. And so he finds the right thing. He clicks on it and he gets a little celebration. If he gets it wrong, then it blows up and he has to go click this one. He actually really likes seeing it blow up. So he'll do this and then he he'll see the confetti. Uh so this is a fun fun little game that uh I had Cody build and deploy. Um and uh I I didn't have to log into anything. Cody was already logged into my Cloudflare account and so uh deployed it and everything actually made this OG image using uh browser run from Cloudflare uh which is pretty cool too. And it's all going uh through Kodi. I don't have to do any of that. I think some of you might be starting to get bored. Um and and if you're not then I'll tell you why you probably should be here in a second. Um, this was a a very exciting live stream uh screen catcher. You can see I'm very excited. I'm actually wearing the same shirt incidentally. Um, but super excited because I built a Spotify player um that integrated with my own Spotify, but I used Cody to do it and that's why I was excited uh to have what was exciting about this was that the integration uh flow how that worked and I'll show you a little bit about that too. Um, I also had Cody um set up a Docker container for Navad Drrome, which is a a self-hosted music uh application on my NAS. And I um told it to wire it up with a Cloudflare tunnel so I could access it outside of the internet or or outside of my local network without exposing ports on my local network. Uh Cloudflare rocks. I'm not sponsored by Cloudflare, but I think they're pretty great and I use them for so many for everything that Cody is. Um, but uh all of that worked. So, we're we're not having to do any of this stuff ourselves anymore and it's pretty great. Uh, and then I was lying in bed. I we had just purchased epic product.engineer the day before and I had my team working on that. If you're curious what that looks like, you can go look it up right now. It's a real site and you can give me your email address. But I was like, you know what? Epic.engineer engineer would be pretty cool to have too. So, I bought it from on when I was lying in bed from my phone and I told Cody, "Go build me a landing page and uh it even integrated with uh my kit." So, that's uh my email mailing service so it could actually set up a real subscription and everything. Deployed it on Cloudflare, made the OG image for me, everything. Okay, so at this point, lots of you are like, "Bro, that's awesome. I'm so glad that you rebuilt a worse version of OpenClaw. Um, that is kind of what I did. But OpenClaw is not and I I have definitely I've opened the OpenClaw world and there were a lot of things that were really cool about it. A lot of things that I wasn't super jazzed about for my own use cases. And so that's why I did this. And I want to tell you one of the my favorite reasons that I love what Cody does is that this is all free. I didn't have to pay for inference at all with an asterisk. The asterisk is um I don't have to pay more than the existing subscriptions that I already have. How many of you have more than one AI subscription? Like one place where Yeah, you have more than one. Why do you have more than one? Like you're laughing. Yes, of course I have more than one. I've got like I've got Chat GPT and I've got Claude and I've got I don't even know like so many others. And of course you got your um your coding assistants and everything. So the reason that I can make Kodi free is because I Whoops. I build on top of those. Everything that Kodi is is actually exposed through MCP. And that's what makes it so I can do all these cool things with Kodi for free because Cloudflare infrastructure is like hilariously cheap, especially if you're serving just one user. Um, and so I I'm effectively able to do all of these things using my existing subscriptions um, for all the inference. So I want to tell you a little bit about how that works. And my my goal here is to kind of like make you a lot of actually back up. How many of you thought MCP was dead? For real. Yeah. Okay. No shame. How dare you. Just kidding. Um, but a lot of developers especially are like, why do we need MCP? I have a CLI. I'm already signed into GitHub using my CLI. CLI have help flags. There's like progressive disclosure. And like in fact, models are even trained on some of the more popular CLIs that I use. So, and now we've got this skill thing. So, even if the models not trained, I can just use the skills. So, what like who cares about MCP? And I agree with you. I think MCP is pretty uninteresting for software development use cases. Where it gets really interesting is when I tell Cody, I don't want the sun to glare in my eyes. So, the non-developer use case is the thing that gets me most excited about MCP. So, um the one of the big criticisms of MCP has always been that there are just there's context bloat uh and it's a a huge mess and so we we hate MCP. Well, That never made sense to me because like we're we're software developers. We see a problem and we don't just say, "Huh, I guess this is foundationally flawed and go off to something else." No, we're like you you analyze the problem. Is this foundationally flawed? Maybe. Let's look into it. Oh, we could just do some sort of search on the MCP tools and then boom, now we have just the ones that are relevant for the thing we're doing. That's exactly what Claude does. Now, chat GBT does this now. So, the the whole context blo is not a big deal. However, um I really like um the fact that u Cloudflare introduced this idea of code mode because it has unlocked a lot. How many of you have heard of code mode before? Okay. So, it's the idea that you can take uh some sort of spec like MCP or open API or something, turn it into TypeScript definitions, and then tell the agent. So this would be chat GPT or VS code or cursor or cloud code or whatever tell the agent to write code uh against that TypeScript definition and then on your side you evaluate that code in a safe environment. Um and Cloudflare has done this with dynamic worker loaders. It's so so cool and that's what I'm using. So based off of what Cloudflare has done with their own MCP server, I created Kodi to have three tools. They did two. I needed one more. uh search to identify what capabilities there are. So there's your progressive disclosure, execute to write and run that code inside of those gotten uh made me a little less interested because of all the cool things that claude uh desktop is doing. Have you all seen this stuff? You're just like build me a thing and it builds the thing and it's it's really really awesome. I I don't really use this one quite as much but I do use uh some features off of that. So pretty much search and execute are the things I want to focus on. It is pretty cool to be able to open a generated UI. That's how I built the little game that my son played and that was fun. Uh, all right. So, uh, somehow duplicated that slide. My bad. Uh, okay. So, let's talk about search first. So, when when the search tool is called by, um, by your agent, whatever agent that be, uh, it's going to pass a query, some sort of like, let's say that I here, let's try this. We're gonna actually try this. I I hope that I don't regret this. Um, okay. I'm standing up in front of hundreds of people in Miami at AI Engineer Miami and I need some hype music. Could you play something on my Spotify, please? Um, I'm already running it on my laptop. So, you know how the like when you're doing an AI demo and you you know the mistakes that the AI is going to make and so you just kind of like subtly insert another little like here's a little bit more context. Uh, okay. So, first here's the loading tools bit. That's that's Claude saying um I know that like we're not just going to load all the tools into context. So, it loads the tools. It knows u from that that there's this execute tool. It conveniently missed the search tool, which is perfect for um our demonstration here. Uh I'm being sarcastic. Thanks a lot, Claude. Um and uh Oh, that's interesting. Spoiler alert. We'll look at that here in a little bit. Okay, I'm going to let this run in the background, and if we start hearing music in the background, then we'll know it worked. Uh okay, so Spotify weather uh current location playlist. So the the query that we're actually exploring is um I want something that's thematically appropriate for Miami. Um so that's where lots of this is coming from. So we're going to query that. I want to limit the results to 10. And then here's here's what I'm trying to do. So this memory context thing uh Cody has memory built in. And um and so this memory context will help to retrieve memories as appropriate. Uh so then here are the search results. It has this uh whole explanation of how you actually deal with these matches. This only shows up on the first time you run the search query and then thereafter it assumes that the agent is going to remember that so we don't bloat the context. Uh we'll look at some of that stuff here in a little bit. Ooh, secrets. What's that? We'll look at that later. Um and here's some relevant match uh memories based on what you're trying to accomplish. And then these will also not show up in the future. So it keeps track of what memories have been shared. Uh and then we've got this idea of packages. So inside of Kodi, this is as of less than 24 hours ago, I made a complete huge massive rewrite. Let me correct myself. Cursor and GPT 5.4 made a complete massive rewrite of how Kodi works under the hood. And it uses Cloudflare's new um artifacts uh API. Yeah, we're excited about artifacts. It's like so now Kodi has its own GitHub basically on top of artifacts and uh and it's cooler than that but I I want to show you the code for that here in a second. So I'm not going to spoil what else it can do. So it has a package for Spotify. Um this has uh secrets for interacting with Spotify that are um created in such a way that the model doesn't actually have access to those which I think is pretty cool. Um, and I I will use my four minutes uh to hopefully explain what that is. Um, and then we also had a yeah, we've got a value. Here's our our client ID. And we also have secrets. Um, and yeah, that's all that we need to see there. So, it performs a search. Now, it knows, oh, okay, I'm going to use this Spotify capability uh or this Spotify package to write my code. So, now it's going to execute. Wonder how it's doing. Not so well. Um, but it's it's trying and that's more than you can say for uh chat GBT. Uh, they're all improving. They'll be fine. So, um, that on that first search, it gets back a conversation ID and then the agent uses that and that's how it keeps track of the memories that have been shared uh, over time. Uh, here's the memory context, here's what I'm trying to do now, and then here's the code that I want to execute. So, what does that look like? Uh, this is an example what that code might look like. So it brings in this Kodi runtime that uh is um has some useful features for authenticated fetch which conveniently or inconveniently for our demo doesn't allow me to show you how that authenticated fetch works but basically there's a special syntax that Kodi can uh can write code for um for managing secrets. So, a really important part of all of this is that uh the agent, whatever agent you're using, never sees the secrets ever. It cannot. And so, the only way that you add those is the agent will give you a URL that will go to heycodi.dev. You put in your uh secret in that UI on HTTPS so nobody can see it. And then um it the agent can then reference it using curly braces um in any fetch call. And then because I'm using dynamic worker loaders, I can intercept every fetch call. And I look and I say, "Hey, that's a pretty cool secret. Let me make sure that you've approved that secret to go to this domain. Oh, that's like your Spotify token. I'm not going to send it to dangerous domain. I've beenprompted.com." Um, instead I'm going to ask the user, hey, is this cool? And then you can go through an approval flow. Uh, that's pretty cool. Uh so we also this is pretty cool. We've got this environment lookups weather uh package. So this is where we can interact with all of our weather stuff. It's it's packaged up as a weather API. Uh and this is under the Kodi namespace. So this is kind of like our own internal npm. Uh and so we can have all these different repos inside of Kodi on top of artifacts. And then um when we're executing this code, we go and reach into that repo. we uh create a bundle out of that export and then we can use it inside of anything. So um Kodi can write code that references all of these things. It's super super cool. And then you know of course all of this is what you would expect. We have an export default um that can run authenticated uh stuff with Spotify. Here it's getting the weather and then it does a query and it does you know searches and then it plays but it you know it uh reached its tool limit. So that's what I get for uh for changing everything just um before doing this demo. Just here how about this? It worked. Just kidding. Just kidding. Just I promise it does work. Uh I I am still working on uh some of the kinks. Uh okay. So then the return uh will have the conversation ID. execute also returns relevant memories um based on what was actually accomplished um in particular this memory context it's like what are you trying to do okay here's I I did the thing but also here's some additional um memories um and here's the results and and the agent gets to choose what results it get back gets back so another criticism of MCP is that it's not just the tool descriptions but it's also when we invoke the tool the output of that tool fills up our context the agent gets to choose what comes back from all of these uh executions. And like code mode is so cool. You you don't get it. I can tell because you'd be like jumping on your chairs if you got it. Uh code mode is fantastic. If we tried to do the same thing using regular tool calls, this would be many regular tool calls and it would probably mess up a lot. Uh so code mode agents are really really good at this and it's super cool uh to play around with. So, I've got 29 seconds, which is wonderful because I don't have a lot more to share. Um, actually, I do I come talk to me. I've got stickers and I will give them to you if you ask me good questions. U, but yeah, come and talk to me, ask me questions. Um, Epic AI Pro, uh, I is here's my little plug. Um, that's where I'll teach you how to build MCP servers. Um, and it's really, really great. Cody is open source. It's pretty much just for me. I would love to make it possible for other people to use right now or eventually, but yeah, Kodi right now is just kind of my thing. I mostly just wanted to show you that MCP rocks, code mode rocks, and um we've got a lot of really cool and exciting things to look forward to. Um with that, go check out epic uh product.engineer. It's the last skill you need to learn. Thank you. Good luck. All right. I guess one thing that we can take away is the squats. So, if you feel like you need a little bit of a wakeup call before the last two talks, feel free to stand up and do some squats. So, I'm gonna do it myself to get ready for Rita's talk. Yes. Okay. So, Rita is a really good person to befriend because if your website is down, she will be a great person to call because Rita is a VP of product for Cloudflare and she has been building a couple of developer platforms and AI initiatives within Cloudflare. and she has meandered a little bit from a software engineer to solutions engineer and now in product development. So Rita today uh we'll be talking about building infrastructure that can scale to billions or even trillions of agents. Take it away. >> Thank you. Thank you. Quite the intro. Okay, we can build infrastructure for trillions of agents, but let's see if we can figure out how to plug this in correctly. You guys see stuff? Um, okay, here we go. How about now? Aha. All right. Um, hello everyone. My name is Rita. I am VP of product for Cloudflare's developer platform. Kent already said everything that there is to say about code mode and MCP. So, thank you everyone for coming. Um, no, I I'm really really excited to be here today. Um, Cler is a really interesting place to be. Um, sometimes people ask me you do product at an infrastructure company that how does that work? And um it is actually really fascinating first of all because we get to work at really really massive scale. So especially working on a developer platform and developer tools, every single optimization that we make, we instantly get to see the benefits of it. And even the tiniest things can really save everyone lots of hours, lots of days. But the other thing that's really interesting about it is actually the physicality of the web is something that I think people don't think about a lot. Like there are undersea cables first of all that connect us all that when you have a zoom with someone in London that's how it all works. When I first joined Cloudflare, I came across an incident page that was talking about how it was like breaking record heat in India and that was affecting a data center and I just never really thought about how it could get so hot that a data center would go down. Um, so I think that more and more we are going to start to get connected between the real world and what's going on in tech and AI. And so I am going to talk about MCP and code mode, but I'm going to dive into some of the underlying details of how we do all that. Now, a lot of the time our job feels a little bit like debating the age-old question of if a dog were to wear pants, would it wear them like this or like that? Um, if you think it's the first one, raise your hand. Um, if you think it's the second one, raise your hand. Okay, everyone that raised their hand first. You're a psychopath. It's definitely the second one. Like how would it put it on? Um but um no, thinking about how agents work, it is kind of similar and and you'll see this come up more and more, right? Um you can think about a single giant MCP server. You can break it up into a lot of smaller pieces. You can think about, you know, should you execute the code here or over there. And for the first time, we're getting to not just do like micro optimizations in the developer space, but really truly invent stuff from the ground up and really think about it in that way. And so when LLMs first came around um they you know when we started using them through Chad GPT about uh two and a half years ago it was like having a really really smart brain with you in the room all the time that you could ask questions you could get it to you know maybe generate code for you but it it couldn't go that extra step of doing too many things. It was like a brain with no hands to really act on its behalf. And that's because LLMs initially weren't that good at tool calling. But increasingly they became better and better. So you could actually start to build agents like Kodi that could take actions on our behalf. And initially when people started using tool calling, every single agent was implementing the whole thing soup to knots on its own. So you had your tool and you integrated it with your agent and that was the only place where it could run. The really cool thing about MCP and especially remote MCP is that all of a sudden you could share tools with agents that you've never actually met before. So you could start to, you know, you could create an app, you could ask it for the weather, but there's this thing that started to creep in over time, which is the context starts to grow. So I was going to demo a small app that I built called Fluma, which is fake Luma. I'll demo a different version of it in a bit just to save us on time. But I if you're uh if you're building something that's more sophisticated than something that I vibe coded in a weekend, you you really start to see that scope creep, that token creep, right? So something like the Cloudflare SDK, it has all of the DNS records, it has all of the workers, it has R2, now we it has purge cache, and before you know it, you're exceeding a context window of 1.7 million tokens. And actually, if you were to include Claflair's entire open API spec, it would take up 2.3 million tokens. Okay, so that's more than the biggest models can even fit these days. So, it's a bit of a pickle for us. So, okay, we started to think about how do we solve this problem? And one way to do that is we could split up the server by domains. So you could have uh an MCP server for just the API. You could have an MCP server for documentation. We had an MCP server for workers, for observability, all these different things. And that partially solves the problem, but it actually really just puts it on the user to figure out which MCP server they need. Um, so if I want an MCP server that deploys my worker, but then to look at the logs, I have to go through the whole OOTH dance again. It's very, very annoying. So, we needed to solve that. The second thing that we kind of realized the more we were thinking about this is that even though LLM had gotten a lot better at tool calls, they still get confused pretty easily. Like if you ask it to do something that happened on a given date, it's just going to assume a random date that it was trained on in the past. It might not get the exact tool that it needs to call. If you give it a lot of different tools, it'll actually also get confused. like a lot of tools have create in them. Create worker, create DNS record starts to do the wrong stuff. And if you think about it, it makes a lot of sense. Uh LLMs were trained on like all of the code that exists in the world. So they're very very good at writing code, but tool calling is something that we just kind of bolted on at the end. And it's not too dissimilar from if you know you took Shakespeare and you gave him a month-long crash course in Mandarin. I presume he was extremely extremely smart. Um, so then if you asked him to write a play in Mandarin, it's bloody Shakespeare, so it's gonna be good, but but it's not going to be it his best work. Um, and and LLMs are a little bit in the same way where no matter how good they get at tool calling, they just don't quite cut it. So, at this point, we started thinking, okay, are are we holding this wrong? Um, like we're trying to make LLMs do things that they're not that good at. We're inflating the context window. what's a different way to attempt to do this? And that's where code mode came from. So imagine if you let the agent or the LLM do what it's really good at, which is write code and do fewer tool calls. So let's see this in action. So here I have my app called Fluma. Let's increase the font on this. And on uh on one on one side we have our vanilla legacy MCP agent. that's just going to make regular tool calls. And on the other hand, we have our code mode agent that's going to write code first and then execute it. So let's ask it to do something simple first like create an event or a hackathon on Wednesday at um 9:00 a.m. at Hyatt Regency Miami. misspelled Miami. Um, okay. So, over here we have our regular MCP agent. It um, it wanted me to confirm the date. It was thinking about January 10th, 2024, which is not quite this Wednesday. On the other hand, we have our code mode agent that just pulled up today's date because it's able to call a function and then it used code mode to create an event that's going to come up. But now let's try something even more sophisticated. So I'm gonna going to ask you to do something like create an event for each day in May 2026 for a meetup on the topics of AI engineering for MCP at Cloudflare's party house all at 700 p.m. Okay. So now they're both going to be off to the races. You can see that the MCP agent is going and making a whole bunch of different calls. And on the other side we have our code agent that went ahead and generated this code with um different topics. And it's going to go through this for loop and create a whole bunch of these events. So now it's making a bunch of calls to the API. Going to wait for both of these to finish up. Any second now, guys. All right. So, our MCP agent is already done. Generally, these take about the same amount of time. Now, our code mode agent is also done. So, we've accomplished roughly the same task. But notice one important difference which is the code mode agent used almost 70% fewer tokens. That's a really really big difference because the code mode agent doesn't have to carry all of those tool calls in its context constantly. It's able to just generate the code once executed and be done. All right. But we had another problem and it's that clients were slow to adopt code mode and if you want something done my my parents are Soviet so they would always say you know you have to do it yourself. So we had to take matters into our own hands. We still had a context window that would you know take up over two million tokens. So we started thinking about what would it look like to have a serverside MCP server and we came up with a way that allowed us to still run all the code that was generated on the server side with two simple functions. One that's called search which is going to look at the spec and just find the um and only find the APIs that match the particular type that we're looking for. and another one that would write the code that would actually execute what we needed it to do. So let's take a deeper look at how this works again under the hood. So we first of all have our tool search. I'm going to type workers in here. And even if I added like every single workers API that we need, we're still only I don't know if you guys can see this at less than 2,000 tokens. So a very very big difference from a million tokens. And then we have the second half of this which is the execute tool. So here what the execute tool is going to do, it's going to look at the TypeScript schema that's being passed down from the search tool and it's going to generate this code that it's then going to have to execute. So here we have list workers. It's going to write code to list all of our workers. can write code to deploy your worker. Um add access on top of your application. And here we can quickly see this in action where this code was um was executed. And here is the result that we got. Um so okay, now let's put the two of these together um and ask our MCP server to create a hello world worker. So the first thing that we're going to do as we would with any other MCP server is we're going to go through OOTH and select the account that we need. This is where I deploy all of my workers to. We authorize it. And now it's going to call our two commands. Um, so first, as you can see, we're running search in here. And it's, as predicted, going to return all of the different worker related APIs that are available to it. And then we're going to run execute, which is going to generate this worker. And we are going to execute it immediately. So now we have a hello world worker that's been fully deployed. So we talked about three different models basically so far of doing the exact same thing. One that's basically vanilla MCP where you're directly doing the tool calling that's going to be the least efficient in terms of token usage. We talked about client side code mode which is efficient but not all client support code mode yet. And another one that's server side code mode where if you think about you know we saw the results where it went uh to you know 70% token savings but if you go from 2 million to like 2,000 it's like 99.99 to% token savings which if anyone is here is paying for tokens you know that that's a lot of money that's being saved. But how how does all of this work? Like really what we're doing is we're putting a lot of trust in the LLM to write some code that we've never looked at before and allowing it to execute immediately. And this can bring a lot of problems, right? Um if you're running it in the same sandbox as the rest of your application in the same container, it can do things like read the file system. It can make rogue network requests with the data that you just gave it. It can do things like create an infinite loop or eat up all of your memory. And there are a couple of ideas of maybe things uh other approaches that people have tried. One is DSL. Um if you if you've written DSL before, you probably never want to do that ever again in your life. Um another one is you could use VMs, but VMs are very slow to start up. So we would be waiting here for a very very long time for all of these calls to complete and it would get really really expensive very quickly. uh we could get humans to review the code but that's even slower than VMs. So we need a different approach. This is where dynamic workers come in. So dynamic workers are based on the same technology as workers which we've been running at Cloudflare for over nine years now. But dynamic workers allow you to create a worker on the flight and on the fly and immediately execute it. So you can see that here we're going to pull in the generated code that the LLM created. The rest of this looks just like loading up a worker. You can set the compatibility date and which modules you want. And importantly, you can set what outbound hosts you want to allow. And so if you don't provide any, the worker actually can't access the web at all. So everything stands really really sandboxed. Um, and we can actually test this out in practice. So, um, if as far as being worried about things like configuration keys or other things being leaked, it's only going to have access to the things that you gave it explicit access to. So, if I'm trying to access secrets in process, well, guess what? It's not going to show up in any of my globals. These are just the functions that Cloudflare provides by default. And if I try to make an outbound fetch to HTTP bin, well, the same thing is going to happen. It's going to say this worker is not permitted to access the internet. We should really capitalize this. Um but um yeah, you you can't just access all of these arbitrary things. So it becomes a really really powerful environment to enable you to run code mode securely. And so far we've just talked about this in the context of MCP, but I think it's pretty obvious where agents are going next. and it's that all of us are going to be running many of them at all times. Right now I would imagine all of us are using agents um like open code or claude codecs primarily for coding use cases right um and a lot of them run on our laptops which is why people at this conference are running around not wanting to shut their laptops right um because you want your agent to complete your task and you could do this also in a hosted container environment but if you start to do the math of how it's going to scale to the rest of the world and we'll get to that in a the math doesn't quite math and here's what I mean by that. So recent uh recently open claw has been taking off right and it's kind of a similar thing where it's early adopters and all of us went and got Mac minis but again that's not sustainable for every single person being able to run multiple agents and if you do the quick math just on the US alone and just for the workforce right um there are about 100 million people in the US uh workforce if we set 50% concurrency so this is actually being very conservative. I actually imagine that we'll be running a lot more agents than this at all times because guess what? Agents don't even sleep. Um, and by the way, I have an agents never sleep hat that if you're the first to the Cloudflare booth, you can claim. Um, but we're going to be running many, many of them. And for that, we need a lot of CPUs to power that. Everyone is talking about the need for GPUs, but no one is talking about this part of having to power enough global agents. So let's let's take this a step further. Okay, there are eight billion people in this world. If each of them had a personal agent, again at like 50% concurrency, we're not super coordinated in how we're using them. We need like 80 to 160 million CPUs. Um server CPU production today is in the tens of millions per year. So we're already an order of magnitude off. If you start imagining that everyone is running several agents, three agents, 10 agents, we are many, many, many orders of magnitude off from being able to power the agentic future that I think everyone in this room is really excited about. So, how does this problem? How do we solve this problem? Um, believe it or not, yet again, um, dynamic workers. Um, so the the thing about dynamic workers is they run on isolates and isolates are a lot more efficient than VMs or containers because they're able to share so much more of the underlying context. So in a VM for every single new application that you spin up, you share the hardware, but you have to create spin up a new operating system every single time. For container, you're able to take that one notch further where the operating system is shared, but every single time you spin up a container, you need to bring in the entire language runtime and the full application with it. With isolates, we're able to import just that generated code, whether it's the application or the agent generated code, and executed on the spot, which means we can utilize the same exact hardware, but 100 times more efficiently. that basically makes up the difference that we need in order for every single person in the world to be able to run their own Kodi agent. So this is why I'm so excited about isolates and what's really cool is we've been working on this for a long time. We bet on this technology uh nine years ago and we didn't think that you know it would become relevant in this particular way after all this time. And it's interesting to see more and more companies, you know, Cloudflare is not the only implementation of isolates. And I think the more people use it, the more people adopt it, the more validation it gives what Cloudflare is doing. And we're still going to need containers for some agents because you need git and bash and file system and all of that. But for especially consumer use cases, isolates are increasingly going to matter more and more. So that was a lot of me talking. If you want to learn more about this, um, there's a whole bunch of blog posts that we put out, especially last week, that I recommend you go check out. Dynamic workers are an open beta, so you can go and play around with them literally today. Um, I will also give a couple other shoutouts um to experiment with everything that we talked about today, including code mode. You can go install Cloudflare's agents SDK. We just made Kimmy 2.6 available on Workers AI. It's a brand new model. It's super fast. Go play around with it. And last but not least, we have a lot of really hard problems to solve and we need help solving them. So, we're hiring. And if you're looking for an a gig, come find us. All right. Thank you all so much. Woo. Okay, how's everybody doing? We're almost there. Last talk of the day. And speaking of SDKs, the next presenter believes you're using the wrong AI SDK. And he's going to talk about the evolution of SDKs for AI and uh where we should expect it to head in the near future. He is an educator, a full stack educator from developers to all the way up to tech CEOs and CTOs. He has a podcast and he also manages The YouTube channel uh for techn technical development. Please welcome Ben Davis. Perfect. All right. So the title of this talk is you are using the wrong AI SDK. And before we even get into that, I want to kind of talk about how these things have changed over the last few years. Because when I was going through and prepping this talk, the initial concept for it was all right, I want to take the Open Code SDK, the PI SDK, the Verscell AI SDK, and then the BAML SDK. Kind of compare those four and try kind of explain when you would use each one. But as I was going through, I kind of realized that there's a pretty strong throughine here. I've kind of, at least in my head, I like to think of these in generations where the first generation we had the API wrapper. This would just be the normal OpenAI SDK. You can see the code snippet for it. Let me zoom in. There we go. It would look something like this. You're just directly hitting the OpenAI API to generate some text, maybe do a tool call or something like that, but there's nothing else built into it. I assume if you're here, you probably know what an agent loop is. But in case you're not familiar, generally speaking, the way these things work is if you want it to do some more complicated action than just generating text, like reading a specific file or doing a web search, the model can't do that on its own, it has to ask you to do that for it. So what it'll do is when you send a request up to OpenAI, it'll send back a response that instead of being a text response will be a tool call response that has the input the tool that it wants you to call with some arguments and then you go ahead and call that tool and send it back. You can do that full tool calling loop within the normal OpenAI SDK, but it requires you to manually have a while loop and add a bunch of other stuff in here to actually make that work. It's not the most ergonomic thing in the world, which is why the Gen 2 that came along was the Verscell AI SDK. This was I think this is one of the coolest things Verscell has made in the last couple years. Like this is truly a this is truly an incredible open- source project. It seems very simple like it is just wrapping a bunch of different LLM API providers. But the actual layer for having the actual code that goes into making a centralized interface that goes around that can both anthropic models, openAI models, Gemini models all in one place is pretty insane. And the actual code for this is a lot more abstracted now where you can define tools with sod schemas. You can execute them. It has this stop when which means that now that agent loop can happen within the actual SDK. This generate text call will hit the will hit the OpenAI API multiple times here. So if I go in and I uh zoom into this, if I do bun 2 for the second generation as this is actually running, it did multiple requests up to OpenAI to do that weather tool call. And then when it did that tool call, it went back again, passed that result back in, generated the final text, and then that's what you saw pop out on the screen. So this can actually so this was enough for people to start actually building agents with. This was the sort of first generation where we were really able to push these things into making more complicated and useful products. But there were a lot of decisions and beliefs that were made at this time that I don't think have quite held up. And these are a lot of things that if you had asked me four months ago, I would have personally believed where like if you look at the code for the AI SDK, one of the things you'll notice is that there is full type safety on the tool calls. Like this execute takes in a city. You can see that it's a string. It has this input schema which is a zod validator so that it makes sure it's always going in in the right shape. It is very much built to allow you to make these very well-defined agents for your products. But that is not where we have kind of ended up because the thing that happened sometime last year was cloud code got released. And when cloud code got released, we got the first full coding agent which was effectively something like the AI SDK wrapped up with a really nice TUI that can now suddenly take actions on your computer with an exec tool called. It can run bash commands. It can run scripts. It can write code. It can do whatever the hell you want it to. And over the last year, we've had more and more of these pop up. and the two that I wanted to talk about specifically in this presentation because I think they're just like they're the most interesting ones to talk about because I think they're the best ones to use. I'm not personally a huge fan of the Claude agents SDK for a variety of reasons and the codeex SDK is limited to codeex, but these two are both open source. They're incredibly powerful and they are the things that power the actual coding agents. And when you're working with a coding agent SDK, you are able to do so much more than you can do with these things because the mental model has changed a lot. I'll start with the PI example because it is the more minimal version of these two. If you look in here, the way this actually is defined is kind of similar to the actual AISDK thing where we are defining a weather tool here with some basic stuff we have as you would expect. Then we are creating an agent session with the model, the O storage, the model registry, custom tools. These are all implementation details. If you want to look into how to actually use these things, the best thing you can do is just like go to the GitHub repo, copy paste the link into whatever coding agent you prefer, tell it to make a temp directory, clone the repo into that, and then ask it questions. That is the easiest way to figure out how to actually use these things. But the real point that I'm trying to make here is when we create this agent session even with the TypeScript SDK it is booting up the full agent harness because if I do pi this is now a full coding agent that is running on my machine. You can ask it it works the way you'd expect. Same thing is happening when I am doing this um pi example here. So if I go into the agents.mmd um always answer in French. So, if I go in here and I run this example again, uh, bun 3 pi, I think I called it. Um, I think this is the version that should be loading the agents MD. Yep, it's doing its tool calls, getting all that, sending it back, and there you go. So, you can see even though I didn't have any code in here that implicitly loaded the agents MD file, it still did because it is operating as a normal coding agent SDK on my machine. Does its thing, gives me the result. Very, very useful. Open code is very similar except I would say it is it does the same thing but is more more batteries included. I I love both of these projects. I think they're really cool. I mean this is a very high complent to both of them. But the way I kind of think of them in my head is that open code is kind of like the VS Code of coding agents. It is open source. It has really good defaults. It just kind of works out of the box, but you can still change some things about it. Add in different themes, extend it, do other stuff there. versus PI is kind of like Neoim where right out of the box it does basically nothing but the absolute border like the true essential sort of a coding agent but you can extend the ever living hell out of it and it's really cool. I like both of these a lot but you can see this reflected a lot within the SDKs too where the open code SDK works slightly differently where you can uh where the way it actually works is you it is a client server model where whenever you spin up open code it's spinning up a server. So we have to create our open code instance here which has the server. Then we can do a bunch of stuff in here to create a client session subscribe to the events console.log them. Nothing too interesting in there. The only really interesting piece here is that there is this open code directory with a bunch of custom tools in it. This is how you define custom tools. Again don't pay too much attention to the syntax here. That is not the important piece. I'm sure that this will be changed and improved over time. It's solid right now. It's not the important part. The important part is the way we can actually build with these things. now because like I said with generation 2 we were entirely focused on these very well curated agents that was the type safety on the SDK really trying to be like okay let's give it a dedicated read read tool a write tool maybe if we're doing something say for my own personal use cases it needs to hit the YouTube API to do some stuff there we would give it a read YouTube API tool give it a um sorry we would give it a read YouTube channel tool a read video tool a read comments tool, whatever you want to do. All of those are well put together. Then you give the agent, you let it execute, you do that whole thing. It's fine. But do things kind of differently. And I um Okay, I was going back and forth on this, but I think we're going to do it. I'm going to pull up uh Oh my god, I cannot see. Can we see that? Yeah. So, just hear me out. Hear me out. Okay. Again, hear me out. So, there's this project called GStack, and you've probably seen it on Twitter because it is it's been meme'med very heavily because Gary has been going very hard with these LLMs, and there have been a lot of memes that have come out of this. The 40k lines of code in a day thing, the crazy Gary's List site, which is hundreds of thousands of lines of vibecoded Ruby for a static blog site. There are some silly things in here. I'm not denying that. But there's also some very, very good stuff in here. Because when I looked at this about a week and a half ago, I went into it fully expecting to just kind of see something funny, maybe dunk on it, do whatever, because it's what everyone else was doing. I was like, "Yeah, sure. Let's go take a look." I went into it. I went into the skills directory, which I think is right here. Let's use the um that's a good example. Let's do the office hours one. Sure. I went into this. I looked at the skill and the first thing I saw is this 30line bash script that looks like a virus. This does not look like something I want to have in my skills. I was like, what the actual hell is this? What are we doing here? This is worse than I thought. I was going through this and I kept reading and I just kept going through all this. But then as I slowly just kind of started to think about it more, I realized what I was actually looking at here is a program. I know that sounds very insane, but if you think about what this actually is, it is including a bunch of commands that the agent is supposed to run. It has a bunch of steps. It is defining a workflow and it is creating a full usable application entirely on top of a coding agent with natural language. And the more I thought about that, the more I realized, holy [ __ ] I think Gary's on to something here. And I've been testing it more. I've been going deeper into it. And I've realized that with these Gen 3 SDKs, these full coding agents, we can do some very weird stuff that we couldn't do before. Before with Gen 2, we were still having to manually write TypeScript functions for every single piece of the agent. But now in Gen 3, we are using coding agents. And coding agents are capable of writing code. They are capable of executing bash scripts. So the actual programs we're creating, these agents do not have to follow the patterns that we previously had. I was sitting up in my hotel room earlier today doing a little bit of experimenting here and one of the things I put together here is the uh a better byt sync and effectively what this does is this is a little program I don't want to save that this is a little program which allows me to go to the YouTube API sync all of that data down into whatever place and then save it into a PGE database. This is a very very useful thing for my job day-to-day. I need to have this data. I need a way to sync it. And the way I've done this in the past is with is by writing out a pretty big complicated TypeScript project which will do the manual syncing logic. Like obviously it seems very simple like you're just going from one API to fetch to DB but what about retrying? What about the actual orchestration of this? What about the cron job? And then there's even more things like okay so what if we want to do some deeper parsing on the videos? What if we wanted to parse the sponsor of a video because that's very useful information for us to have or parse the sentiment on comments. Well, now we have to bring in an LLM. And the way I was bringing in an LLM is with one of the things I was talking about earlier which is BAML. I think BAML is very very cool. Currently, it is entirely a gen one AI SDK. I am sure that they are working on something new to do the actual agent thing. But for right now, effectively what this is is it is a new programming language designed for agents that works really really well for taking some blob of text or data or videos or whatever you want to do giving it an output shape. So I'm like okay this is the actual output shape I want to get from BAML. This is the function that we are doing here which takes in a video. Video is a data type which just take in a URL so that it knows it's a YouTube video. Video sponsor is a l is a list of these. We pass in the prompt here. We have this and you can see within their playground what I did is I passed this in to the Gemini API did the actual orchestration of this. They have a lot of stuff under the hood that ensures that the LM will output in the correct shape and you get the correct shape every single time because obviously when you're working with little non-determinism machines, they will do non-deterministic things. And even if it's a 99% success rate, this does a great job of getting you over that line to make sure that it will always be in this correct shape that you can then pass into your TypeScript code and do something useful with. You can see like for this video was sponsored by G2I just like this conference. We love G2I. And the thing is that's great. That all works just fine. But it gets kind of annoying. And what if we did this differently? What if instead of writing the program as a bunch of TypeScript files, what if you just wrote it as a bunch of markdown files? And that's trying out here. And you can see I have a couple different skills in here. There's nothing too crazy about these. The most interesting one here, I guess probably, well, I guess we can go through all three briefly. So, the top level one is this YouTube video sync uh skill. And this basically just tells the agent the steps it needs to go through to sync sync the YouTube channel. It has a bunch of like helper functions in here so that like instead of manually writing some Python code to fetch the API and then put it in the DB, it has these good like wrappers. So like fetch channel, it does all this correctly. But you don't even need this. In the first version of this, literally all I had was just a markdown file with some natural language steps and I'm like, okay, go sync some data from the YouTube API. Here's an API key. Put that in a Postgress DB. Good luck. And it did it. It did it very very well. This is the reason why something like OpenC works. so well is because these coding agents can just kind of make these things happen. The only reason I have these extra extra functions in here is because I intend to deploy this and use this elsewhere and I want it to be a little more robust. But you can see all we're really doing here is defining the steps which is what you would do in a normal program and then let it run and it will do the thing. And it has a remarkably high success rate. you end up kind of getting this nice property on these things where it's kind of almost self-healing where I noticed a couple times there would be like a weird uh rate limit error or something like that. Instead of having to do exponential retries yourself, the agent will just kind of naturally do it for you. And now the way all of this kind of comes together is if you are building something like this and you are building it with skills and you have all these extra things in here. If you want to run this with an SDK because one way you can do it is I could just open up pi here and I could just do slash sync. I hit enter on this. It's just going to start doing the syncing. So it's checking the project directory cding around doing whatever it needs to. But if we don't want to do that, say we want to run this on a sandbox in the cloud on a cron job, you can use the open code SDK, which is what I'm doing here to create an open code instance, create a new session, then pass in a prompt that tells it to use the skill to actually do the thing, log some useful information up here, and then it works the because like I said earlier, since this is a full coding agent SDK, it is reading all of the skills out of that directory. It is reading all the agents. even reading the O stuff. So, because my open code is authenticated with GBT 5.4 mini, if I ran the sync command with this index.ts, it would do that sync with 5.4 mini. It all just kind of comes together and allows you to build these very weird new shapes of programs that I honestly didn't think were going to be a thing, but clearly are. And really, that's what I wanted to get across with this talk is not any specific implementation details. just go experiment with the experiment with those on your own. The thing that I wanted to get across here and it's really just a message for myself because I keep doing this is that the shape and direction of where these things are going is very strange and is changing all the time. If you had told me 3 months ago that I would be giving a talk where I defended GStack and markdown files as a new way to do programming, I would have laughed at you. But here we are. Because the thing about this weird AI revolution right now is every single time I learn something new about it. I find something test a new model, try a new theory or whatever, I make a new line in my head of like, okay, this is what it is capable of. This is what we can do. This is how this works. And then that is like the box I live in. And I don't really go past that. But that doesn't work anymore. These things change so much so fast. And you just need to try weird random ideas. every time you have some very strange idea and you're like, "Oh yeah, that probably won't work." Still just give it a shot because it might on paper this doesn't seem work and yet it kind of does and it works really really well. And you can even take like what's here and what is the next logical step here? Do I need to have all of this code written here? Can we go further with this? I don't know. All I know is that as time has gone on, we have started giving agents more freedom and time will tell what that freedom will bring. So that's all I got. Thanks for listening. >> All right, that was a great talk by Ben. >> All right, awesome. Uh very very um big day with a lot of great talks and uh we covered a lot of ground. I think >> what we covered today. >> Yeah, we went from context engineering, some philosophical discussions, a lot of practical discussions, SDKs, uh MCP servers, and uh yeah, here we are. I think I need a drink. >> Oh, okay. Iman. So, are you more of a mojito person or a no person? >> We'll see. The night is young. >> Okay. Okay. And let's see from our audience who is for mojitos. >> Nice. We see some hands. What about Nojitos? >> Nder. >> Okay, maybe you want Miami Vice. Uh, no judgments here. But what is happening here is that we're going to close out the day. I think you all deserve some good dinners, some good drinks, and we have a whole day tomorrow from 9 to 5:30 again. So, we're going to hear from uh the organizer G2I all the way to uh some software engineer from Cursor. So, gear up for another day of fully packed schedule, great connections and great knowledge sharing. So, that is it for today. Thank you so much for being here and joining us and we'll see you tomorrow. Take care.