AIE CODE 2025: AI Leadership ft Anthropic, OpenAI, McKinsey, Bloomberg, Google Deepmind, and Tenex
Channel: aiDotEngineer
Published at: 2025-11-20
YouTube video id: cMSprbJ95jg
Source: https://www.youtube.com/watch?v=cMSprbJ95jg
[music] This is insane. Typing thoughts into [music] the darkest spark becomes design. Words evolve [music] to whispers meant for something more divine. Syntax bends and breeze. I see the language change. I'm not instructing anymore. I'm rearranging f. Every loop I write [singing] rewrites me. Every function hums with meaning. I feel the interface dissolve [music] between the maker and the new code. Not on the screen but in the soul where [music] thought becomes the motion and creation takes control. No lines no rules just balance [music] in between the zero and the one. The silence and the dream. >> [music] >> systems shape our fragile skin. They mold [singing] the way we move. We live inside the logic gates [music] of what we think is true. But deep beneath the data post, [music] there's something undefined. A [singing] universe compiling the image of our [music] minds. Every line reveals reflection. Every loop replace [music] connection. We're not building, we're becoming. And the code becomes confession. This is the [music] new code. Not on the screen, but in the soul where thought becomes the motion. [music] Creation takes control. No lines, no rules, just balance in between the [music] zero and the one. The silence and the dream. [music] [music] [music] We are not just Don't worry. [music] Uh, we're just giving you something to do while Codeex writes all your code. We are the world. [music] Each prompt, each breath, each fragile spin, a universe [music] renewing. This is the new code. Alive and [music] undefined. Where logic meets emotion and structure bends to mind. [music] The systems eternal but the soul writes the line. We are the new code. [music] Compiling time. [music] on fire inside. [music] [music] [applause] Ladies and gentlemen, please join me in welcoming to the stage the co-founder [music] of Morning Brew and the managing partner of 10X, your host for the leadership [music] track session day, Alex Lieberman. Keep it going. Let's get a quick read of the room. If you are coming from right here in the Big Apple from New York, make some noise. Okay, now I have to say it. I assume this is the biggest group. San Francisco. >> Wow, that is surprising. Uh, Austin. >> Okay, we got Austin. Who thinks they came from the furthest place and is in the room today? >> Where? Where? >> Ecuador. Can anyone beat Ecuador? [applause] >> New Zealand. >> I don't think anyone's going to beat New Zealand. There we go. Well, first of all, uh, I am so excited to welcome you all to the AI Engineer Code Summit 2025. Uh, I'm Alex Lieberman, co-founder of Morning Brew and your MC for the day. Um, now you may be wondering, why is a newsletter guy hosting an AI engineer conference? It's a great question. Well, after I left my role at Morning Brew, I asked myself one simple question, and it was, what space do I want to spend my time in for the next 20 years where I can build something consequential and spend my time with some of the smartest people I've ever met? And the answer became obvious. I wanted to be as close to the frontier of AI as humanly possible, which is why I co-founded 10x.co, CO, which is an AI transformation firm helping mid-market and enterprise companies learn how to use AI within their business. And I spend basically all of my time now with AI engineers like yourselves. I'm the only non-technical person in the business, and I wouldn't have it any other way. So, as you know, this year has been a banner year for the industry. And I would think of today as both a look back on where we've been as well as a tactical view of where we are headed in companies small and large, old and new. We're going to hear from the labs. We'll hear from Unicorn AI startups. We'll hear from academics, big-time management consultants, and Fortune50 brands. But before we do that, we have to give the brands that made this day possible their flowers. So, let's go into it. Let's give it up for Google DeepMind, today's presenting sponsor. [applause] Love it. Keep it going for Anthropic, the platinum sponsor for the day. [applause] And then one more round of applause for all of the gold and silver sponsors who you can meet in the expo downstairs throughout the day. One more. Let's keep it going. [applause] Are you guys ready to do the damn thing? >> Let's do it. To kick things off, let's give a huge welcome to head of engineering of the Claude developer platform, Caitlyn Les. Let's welcome Kaitlin. [applause] Good morning. Um, so first let's give a huge thank you to Swix and the whole AI engineer organizing team for bringing us together. [applause] I'm Caitlyn and I lead the claw developer platform team at Anthropic. Um, so let's start with a show of hands. Who here is integrated against an LLM API to build agents? Okay, I'm talking to the right people. Love it. Um, so today I want to share how we're evolving our platform to help you build really powerful agentic systems using claude. So we love working with developers who do what we call raising the ceiling of intelligence. They're always trying to be on the frontier. They're always trying to get the best out of our models and build the most high performing systems. Um, and so I want to walk you through how we're building a platform that helps you get the best out of cloud. Um, and I'm gonna do that using a product that you hopefully have all heard of before. Um, it's an agenda coding product. We love it a lot and it's called Claude Code. So when we think about maximizing performance um, from our models, we think about building a platform that helps you do three things. Um, so first, the platform helps you harness Claude's capabilities. We're training Claude to get good at a lot of stuff, and we need to give you the tools in our API to use the things that Claude is actually getting good at. Next, we help you manage Claude's context window. Keeping the right context in the window at any given time is really, really critical to getting the best outcomes from Claude. And third, we're really excited about this lately. We think you should just give Claude a computer and let it do its thing. So I'll talk about how we're we're evolving the platform to give you the infrastructure and otherwise that you need to actually let Claude do that. So starting with harnessing Claude's capabilities. Um so we're getting Claude really good at a bunch of stuff and here are the ways that we expose that to you um in our API as ideally customizable features. So here's a first example um relatively basic. Claude got good at thinking um and Claude's performance on various tasks. um scales with the amount of time you give it to reason through those problems. Um and so uh we expose this to you as an API feature that you can decide do you want Claude to think longer for something more complex or do you want Claude to just give you a quick answer. Um we also expose this with a budget. Um so you can tell Claude how many tokens to essentially spend on thinking. Um and so for cloud code um pretty good example. Obviously, you're often debugging pretty complex systems with cloud code, or sometimes you just want a quick um answer to the thing you're trying to do. And so, um claude code takes advantage of this feature in our API to decide whether or not to have Claude think longer. Another basic example is tool use. Claude has gotten really good at reliably calling tools. Um, so we expose this in our API with both our own built-in tools like our web search tool, um, as well as the ability to create your own custom tools. You just define a name, a description, and an input schema. Um, and Claude is pretty good at reliably knowing when to actually go um, and call those tools and pass the right arguments. So, this is relevant for cloud code. Cloud code has many many many tools and it's calling them all the time to do things like read files, search for files, write to files um and do stuff like rerun tests and otherwise. So the next way we're evolving the platform to help you ma maximize intelligence from claude um is helping you manage cla's context window. Getting the right context at the right time in the window is one of the most important things that you can do to maximize performance. But context management is really complex to get right. Um especially for a coding agent like claude code. You've got your technical designs, you've got your entire codebase, um you've got instructions, you've got tool calls. All these things might be in the window at any given time. And so how do you make sure the right set of those things are in the window. Um so getting that context right and keeping it optimized over time is something that we've thought a lot about. So let's start with MCP model context protocol. We introduced this a year ago and it's been really cool to see the community swarm around adopting um MCP as a standardized way for agents to interact with external systems. Um and so for cloud code, you might imagine GitHub or Sentry, there are plenty of places kind of outside of the agents context where there might be additional information or tools or otherwise that you want your agent to be able to interact with or the cloud code agent to be able to interact with. Um, and so this will obviously get you much better performance than an agent that only sees the things that are in its window as a result of your prompting. Uh, so the next thing is memory. So if you can use tools like MCP to get context into your window, we introduced a memory tool to help you actually keep context outside of the window that Claude knows how to pull back into the window only when it actually needs it. Um, and so we introduced the first iteration of our memory tool as essentially a clientside file system. So you control your data, but claude is good at knowing, oh, this is like a good thing that I should store away for later. And then uh it knows when to pull that context back in. So for cloud code, you could imagine um your patterns for your codebase or maybe your preferences for your git workflows. These are all things that claude can store away in memory and pull back in only when they're actually relevant. And so the third thing is context editing. If memory helps you keep stuff outside the window and pull it back in when it makes sense, context editing helps you clear stuff out that's not relevant right now and shouldn't be in the window. Um, so our first iteration of our context editing is just clearing out old tool results. Um, and we did this because tool results can actually just be really large and take up a lot of space in the window. And we found that tool results from past calls are not necessarily super relevant to help claude get good responses later on in a session. And so you can think about for cla code is calling hundreds of tools. Um those files that it read otherwise all these things are taking up space within the window. Um so they take advantage of um context management to clear those things out of the window. And so um we found that if we combined our memory tool with context editing, we saw a 39% bump in performance over over the benchmark on our own internal evals. Um which was really really huge. And so it just kind of shows you the importance of keeping things in the window that are only relevant at any given time. And we're expanding on this by giving you larger context windows. So for some of our models, you can have a million token context window. combining that larger window with the tools to actually edit what's in your window maximizes your performance. Um, and over time we're teaching Claude to get better and better at actually understanding what's in its context window. So maybe it has a lot of room to run, maybe it's almost out of space. Um, and Claude will respond accordingly depending on how much time uh or how much room it has left in the window. So here's the third thing. Um, we think you should give Claude a computer and just let it do its thing. We're really excited about this one um because there's a lot of discourse right now around agent harnesses. Um you know, how much scaffolding should you have? How opinionated should it be? Should it be heavy? Should it be light? Um and I think at the end of the day, Claude has access to writing code. And if Claude has access to running that same code, it can accomplish anything. You can get really great professional outputs for the things that you're doing just by giving Claude runway to go and do that. But the challenge for letting you do that is actually the infrastructure as well as stuff like expertise like how do you give claude access to things that um when it's using a computer it will get you better results. So a fun story is we recently launched cloud code on web and mobile. Um and this was a fun project for our team because we had a lot of problems to solve. When you're running cloud code locally cloud code is essentially using your machine as its computer. But if you're starting a session on the web or on mobile and then you're walking away, what's happening? Like where is that where is um cloud code running? Where is it doing its work? Um and so we had some hard problems to solve. We needed a secure environment for claude to be able to write and run code that's not necessarily like approved code by you. Um we needed to solve or container orchestration at scale. Um and we needed session persistence. um because uh we launched this and many of you were excited about it and started many many sessions and walked away and we had to make sure that um all of these things were ready to go when you came back and um wanted to see the results of what Claude did. So one key primitive in this is our code execution tool. Um so we released our code execution tool in the API um which allows claw to run write code and run that code in a secure sandboxed environment. Um, so our platform handles containers, it handles security, and you don't have to think about these things because they're running on our servers. Um, so you can imagine deciding that um, you you want Claude to write some code and you want Claude to go and be able to run that code. And for cloud code, there's plenty of examples here. Um, like make an animation more sparkly that uh, you want Claude to actually be able to run that code. Um, so we really think the future of agents is letting the model work pretty autonomously within a sandbox environment and we're giving you the infrastructure to be able to do that. And this gets really powerful once you think about giving the model actual domain expertise in the things that you're trying to do. So we recently released agent skills which you can use in combination with our code execution tool. Skills are basically just folders of scripts, instructions, and resources that Claude has access to and can decide to run within its sandbox environment. Um, it decides to do that based on the request that you gave it as well as the description of a skill. Um, and Claude is really good at knowing like this is the right time to pull this skill into context and go ahead and use it. And you can combine skills with tools like MCP. So MCP gives you access to tools and access to context. Um, and then skills give you the expertise to actually make use of those tools and make use of that context. Um, and so for cloud code, a good example is web design. Maybe whenever you launch a new product or a new feature, um, you build landing pages. And when you build those landing pages, you want them to follow your design system and you want them to follow the patterns that you've set out. Um, and so Claude will know, okay, I'm being told to build a landing page. this is a good time to pull in the web design skill um and use the right patterns and and design system for that landing page. Uh tomorrow Barry and Mahes from our team are giving a talk on skills. They'll go much deeper and I definitely recommend checking that out. So these are the ways that we're evolving our platform um to help you take advantage of everything that Claude can do to get the absolute best performance for the things that you're building. First, harnessing Claude's capabilities. So, as our research team trains Claude, we give you the API features to take advantage of those things. Next, managing Claude's context. It's really, really important to keep your context window clean with the right context at the right time. And third, giving Cloud a computer and just letting it do its thing. So, we're going to keep evolving our platform. Um, as Claude gets better and has more capabilities and gets better at the capabilities it already has, we'll continue to evolve the API around that so that you can stay on the frontier and take advantage of the best that Claude has to offer. Um, second, as uh, memory and context evolve, we're going to up the ante on the tools that we give you in order to let Claude decide what to pull in, what to store away for later, and what to clean out of the context window. And third, we're really going to keep leaning into agent infrastructure. Some of the biggest problems with the idea of just let Claude have a computer and do its thing are those problems that I talked about around orchestration, secure environments, and sandboxing. And so we're going to keep working um to make sure that those are um ready for you to take advantage of. Um and I'm hiring. We're hiring at Anthropic. We're really growing our team. Um, and so if you're someone who loves um building delightful developer products um and if you're excited about what we're doing with Claude, we would love to work with you across end product design um Devril lots of functions. So please reach out to us and thank you. [applause] [music] Our next [music] presenter is the president and head of AI at Replet. He's here to speak about building the future of coding. Please join me in welcoming to the stage Mikuel Katasta. [music] All right, good morning everyone. So at Raplet, we're building a coding agent for nontechnical users. It's a very peculiar challenge I would say compared to many people in this room. And what I'm going to talk about today is why autonomy has become kind of the northstar that we keep chasing you know since we launched the very first version of rapid agent September last year. Let's start from this very interesting plot in case my clicker worked which now does. Um I'm sure you all have seen it. you know the semiac value that published by Wix a few weeks ago and it kind of clarify a bit the landscape you know for all of us uh agent builders on one hand you have the low latency interactions that really allow you to stay in the loop you know so you can do deep work and focus really on the on the coding task at hand but you need to be an expert you need to know exactly what to the model for and you need to understand quickly if you want to accept the changes or not then for several months many of us including rapid We kind of live in this I think value that where the agent wasn't autonomous enough to really delegate a task and come back and see it accomplished but at the same time it run long enough not to keep in the zone not to keep in the loop. Luckily over time we managed to go all the way on the right and now we have agents that runs for several hours in a row. What I'm going to be arguing with today and hope is not going to stop inviting me to this event is the fact that there is an additional dimension like a third dimension to this plot that you know it hasn't been covered here and namely the fact is how do we build autonomous agents for nontechnical users. So what I'm going to be arguing today is that there are two types of autonomy. One of it is more supervised. So think of the you know Tesla FSD example. When you sit in a Tesla, you're still expected to have a driving license. You're going to be sitting in front of the steering wheel. Perhaps 99% of the time, you're not going to use it, but you're there in order to take care of the longtail events. And similarly, a lot of the coding agents that we have today require you to be technically savvy in order to use them correctly. We at Replet and uh other companies at this point are focusing on kind of the whimo experience for autonomous coding agents. So you're expected to sit in the back. You don't even have access to the steering wheel. And I expect you basically not to need any driving license. Uh why is this important? Because we want to empower every knowledge worker to create software. And I can't expect knowledge workers to know what kind of technical decisions an agent should be making. We should offload completely the level of complexity away from them. Of course, it took a while to get here. So I'm I'm sure what I'm showing you here is something that all of you are very familiar with. It took several years to go from I know maybe less than a minute feedback loop constant supervision and talking about completions and talking about assistance. These are areas where the AI power is and really been pioneering this this type of user interaction. Then we slowly climbed through you know higher levels of autonomy. So we had the first version of the agents based on on react. So we concocted autonomy with a very simple paradigm on top of LMS. Then likely AI providers understood that tool calling was extremely important poured a lot of effort on that. So we built the next version of agents with native tool calling. And then I would say there is a third generation of agents which I call autonomous and that's when we started to break the barrier of say one hour of autonomy. Basically the the agent being capable of running on long horizon tasks and remaining coherent. It happens to be the case that those are also the versions of rapid agent that we launched over the last year. So the B3 is the one that we launched a couple of months ago and it has exactly showcases those properties. So the question for today is can we actually build fully autonomous agents and how do we get there? So I'm going to try to redefine the definition of autonomy today. I think that often times we conflate autonomy with a concept of something in the lungs for a for a lot of time and usually as a user you lose control. In reality what the autonomy that I want to give to agents can be very specifically scoped and what I mean by that is especially with rapid agent 3 what we accomplish is we we make sure that our agent takes holy technical decisions. Of course, that could lead to very long gap between the different user interactions and in case the agent again runs for several hours. But this happens if and only if the scope of the task you're giving to the agent is really broad. And it turns out that in reality you can have an agent that is really autonomous and is still fast as long as you give it a very narrow scope for the task, you know, at hand. So what we can accomplish in this way is that the user still maintains control on the aspects that they care about and a user cares about what they're building. Especially again our users, knowledge workers, they don't care about how something has been built. They just want to see their goals to be accomplished. So autonomy should not be basically conflated with long run times. And similarly, it shouldn't become a dity metric. you know, a lot of us are talking about it as a as a badge of honor. And it's definitely been exciting to see in the last few months that, you know, many of us broke the the barrier of running several hours in a row, but I think in terms of how to build agents that are going to be more powerful and more suitable in the future, we kind of have to change a bit uh the the target the metric that we do that we keep in mind. So, think about it in this way. Tasks have a natural level of complexity and basically what we care about is that they have a minimum irreducible amount of work that they express. What agents do is that they always go through this loop of planning, implementing and testing. And of course to make this happen and to make it work correctly, you want this work to be happening over a long twing trajectory. So our goal is to maximize the reducible runtime of the agent. By reducible I mean having a span of time where the user doesn't have to make any technical decisions and the agent can accomplish the task again in full autonomy. This is especially important for us because I can't trust our users to make technical decisions. So they they need a proper technical collaborator by their side. I want to abstract away as much complexity as possible from the process of software creation. And last but not least, I want the users to feel in control of what they're creating without stifling their creativity because they have also to think about the technical decision that the agent is making. So now what are the pillars of autonomy? How are we making this happen? I would say there are three pillars that are extremely important to think about. The first one is of course the capabilities of frontier models like the baseline IQ that we inject in the main agentic loop. I'm going to leave this as an exercise to the reader and to other people in the room. I'm really glad a lot of you are building amazing models that you know we use all the time at Rabbit. So this is the pillar number one. The second pillar is verification. It's very important that we test for local correctness of our agent at every step that it takes. And the reason is fairly intuitive. If you are building on very shaky foundations, eventually the castle will topple down. So we brought verification in the loop to make sure that in a sense you are having you know nines of reliability where in the compounding errors that an agent will make unavoidably if you know you don't put any control on it and last but not least you heard it on stage even earlier I'm sure you're going to be hearing this you know the entire day or the entire duration of the conference uh the importance of context management so one end you want to have an agent that is capable of being globally coherent so it's aligned with the intent of the user the expectation of the user But at the same time it is also to be capable of managing both the high level goal and the single task that the agent is working on. I think we made amazing progress in the last months on context management but I'm also excited to see you know where we're going as a field. Let's start from the first pillar that we work actively at rapid which is verification. So why do we focus on this? Over the you know last year we realized something that I think each one of you has experienced. So without testing agents build a lot of painted doors. In our case the painted doors are very visible because we create a lot of web applications. So you end up basically trying to click on a button and the handler is not looked up or some of the data that we're showing is actually mock data and it's not coming it's not coming from a database. But in general this phenomenon spans you know across every type of component you're building being it front end or back end. A lot of components are actually not fully fleshed uh by the agent. So we run some evaluations internally. We found out that more than 30% of the individual features happen to be broken. Know the first stand that are cooked by the agent. And that also means that almost every applications have at least one broken feature or painted door. They're hard to find. The reason is users are not going to spend time testing every single button, every single field. And this is also probably one of the reasons why a lot of our users, especially the nontechnical ones, still can't trust coding agents very much. They are shocked when they find that there is a painted door out there. So, how do we solve this problem? Fundamentally, we need an agent must gather all the feedback that they need from their environment, right? It's easier said than done. Um again nontechnical users not only cannot make technical decisions but also they cannot provide the technical feedback that you know an agent is required to make progress and most what they can do is basic you know quality assurance testing. They can literally go around the UI click interact with the application. I'm I'm sure you have tried it in your life. This is extremely tedious to do and it leads to a very bad user experience. And even though we relied on that with our first release of the agent last year, quickly we found out that users don't want to spend time doing testing. So we had to find a complete, you know, orthogonal solution to that which is autonomous testing and it solves several different issues. The first one is it breaks the feedback bottleneck. Even if again we ask feedback to the user, we were not given enough of that. Now we don't have to wait anymore for human feedback. We have a way to elicit as much information as possible from the app autonomously. We also want to prevent the accumulation of small errors. What I was saying before, we don't want to have compounding errors while the agent is building. And last but not least, we have to overcome the laziness of frontier models. So we need to verify that whenever a model tells us that a task has been completed, there is actually the truth and that result is not being elucinated. There is a wide spectrum of code verification that you know you you can accomplish. I think we all started from the very left. You know you have basic study code analysis with LSPs. We have been executing the code since we had basically LMS that were capable of debugging and then we slowly started to move towards the right. So generating unit tests and running them it has a limitation. It's limited only to functional correctness. Uh unit testing is not very powerful to do like proper integration testing by definition. We started also to do now API testing but it's only limited to API code. So you can test endpoint of an applications. You can't really test how a web app functions and looks like. And for this reason in the last few months has and other companies are pouring a lot of effort in really creating autonomous testing based on the browser you know in case the app that we're building is a web application. There are two main categories here. One is computer use. It's a onetoone mapping with user interface. So the model is directly interacting with the application. It requires screenshots. It tends to be fairly expensive and fairly slow. I'm sure you you tested it yourself. A good way in the middle is browser use where we simulate the user interface. You can then interact with the browser and with the web application and it relies on basically accessing the DOM through abstractions. So how do we how do we make this work in Revit? Um what we do is that we generate applications that are amenable to testing and we sort of merge everything together from the previous slides that I showed you. So we allow the our testing agent to interact with an application and gather screenshots in case nothing has worked. So we have a full back to computer use. But the vast majority of times what we do is that we have programmatic interactions with the applications. So we interact with the database, we read the logs, we do API calls, we literally click on the app and get back all the information that we need. And by putting all of this together, we collect enough feedback that allows our agent both to make progress and also to fix all the painted doors that it encounters. Just a know short technical deep dive on how we accomplish this. I'm sure you have seen a lot of the toolbased uh browser use. There are amazing libraries out there. First one comes to man stan and the idea is that you have an agent that has a few very generic tools exposed. So know the agent can create a new tab, can click, can fill forms etc etc. The limitation here is that it's difficult to enumerate all the different type of interactions you could be having with a browser. The problem of testing is very similar to the Tesla analogy I was making before. Maybe this cardality of tools available is enough for 99% of the interaction types. But then there is always a long tail of idiosyncratic interactions that a user makes with the with a web application that are hard to map into these tool these different tool calls. So what we do uh in our case at rapid is we directly write playwright code and playwright code is first of all very manable for LLMs. LM are kind of amazing at writing playright. You know this is the experience that we had uh since we started to work on this project is also very powerful and expressive. So in a sense it's a super set of what you can express uh on the compared to the left on the tools uh testing. And last but not least, there is beauty in creating playright code because you can reuse those tests. If the moment you write a test in script, then you can rerun it as many times as you want. So in a sense, the moment you created a test, you're also creating a regression test suite that you can keep running in the future. And all these kind of uh tricks that I explained to you right now, they helped us to create something that is roughly a order of magnitude cheaper and faster compared to computer use. And we'll go back later on how important latency is. The second thing that the second pillar that I wanted to talk about today of course is context management. And I'm going to go very fast here because I think you're going to be hearing a lot of talks today about it. The the high level message here is that long context models are not needed to work on quer and long trajectories. Uh from experience we found that most of the tasks even the more ambitious one can be accomplished within the 200,000 tokens. So we're still not in a world where working with models that have 10 million or 100 million uh context windows is necessary to actually run autonomous agents. And we accomplish this by means of learning how to do context management correctly. So first of all, there are several different ways to maintain state which don't imply chucking all the state into your context window. You can do that for example by using the codebase itself to maintain state. So you can write documentation while the agent is creating new code. You can also include the plan description and all the different task list that the agent is working on. You can persist them on the file system. So even there like have a lot of ways to offload your memories. And last but not least and this is something I think you know Antropic has been uh really evangelizing about um you can even dump directly your memories in the file system and then making sure that your agent decides when to write them back the moment they become relevant to your work. So for this reason we have been seeing a lot of announcements in the last couple of months. I just picked this one from Entropic and with CL at 4.7 so I wish 4.5 uh that have been able to run uh focus task for more than 30 hours in a row. We have seen similar results from OpenAI on the math problems. So I think we we kind of broke the barrier of running for long and you know being able to have quant tasks. I would say the key ingredient to make this happen has been how good models hand as agent builders have become in doing sub agent orchestration. Subages basically work by means of they're invoked in the core loop. So it's a completely it's starting from a blank slate uh from a completely fresh context. You as an agent builder decide what subset of the context to inject when this sub agent starts. And it's a concept that is very similar I think to everyone who's been writing software you know in the last decades is separation of concerns. So you decide what your subject is going to be working on. You give it the least possible amount of context. You allow it to run to completion. You only get the output the results. You inject them back into the main loop and you keep running in this way. Of course it significantly improves the number of memories per compression. I just brought this plot from directly from rapid agent running in production. The moment we kicked in our new subvision orchestrator on the ax on the y-axis you can see the number of memories per compression. So we went from roughly 35 to 4550 recently. So big improvement in terms of how often we are recompressing our context just because we can offload a lot of the context pollution by means of using sub aents. I'm going to give an example where this made the difference for us. You know what I'm showing you here is more kind of a cost optimization in a sense like you're compressing less. You als have have separation of concerns which definitely make your agent be smarter. In the case of testing, working with sub engine was almost mandatory for us. And basically we started to work on automated testing even before we were very advanced in terms of subent orchestration. And what we found out is of course again as I was saying before it makes things easier, better cost, less pollution. But when you allow the main loop not only to create code but also to do browser opt browser actions to put back the observation of your browser actions into the main loop you tend to confuse the the hent loop very much because at this point there is a lot of heterogenity in terms of the action that your main loop is looking at. So in order to make this work not only we have to build all the playright framework that I was showing to you before but we also have to move our entire architecture into sub agents. So at this point you can see very clearly why there is a separation of concern here. Get the main agent loop running. We decide at a certain point that it's time to verify if the output of the agent has been correct. We make this happen all within a sub agent. Then we scratch the context window of that sub agent. We just return back the last observation to the agent loop and then we keep running in that way. So if you're having issues today making your sub agents uh work correctly, this is one of the reasons why that you want to take a look at. So I think we covered the high level of how to create more and more powerful uh autonomous agents over time and I only see us as a field becoming even more proficient than that in the next months. There is one additional ingredient though that is going to make the difference and it's parallelism. And I will argue that parallelism is important not because it's going to make agents more powerful per se, but rather because it's going to make the user experience more exciting. So of course it is great to have an agent that is capable of running autonomously for long, but at the same time it comes with the price of making the user experience less thrilling. You are not in the zone anymore. What you do is that you write a very long prompt. It's translated into a task list. Uh and then you go to have lunch with your colleagues and then you come back and you hope that the agent is done. That is not the kind of experience that most of the productive people want to have in life. You know, you want to see as much work as done as possible in the shortest span of time. So what we do as a as a field at this point has been to create parallel agents. It's a very common trade-off which by the way doesn't only apply to agents. it it applies to computing in general and for parallel agents what you do is that you you trade off basically extra compute in exchange for time. Why there is this trade-off? So first of all when you're running agents in parallel you're gathering the same context in multiple context windows. So every single parallel agent that you would be running probably shares say 80% of the context across the board. So of course you are just putting more computed work because you're running those agents in parallel. There's also another cost that is kind of intangible for a lot of you here in the room because I'm sure you're all expert software developers, but what do you do with the output of multiple par agents at the end? Often times, you need to resolve merge conflicts. So, as a reminder, my users don't even know what's the concept of merge conflicts. It's something that I have to figure out on our own. So, the current way in which we think of parallel agents in in the space doesn't really apply to Rapid. Now at the same time I still want to very much to accomplish this. There are so many interesting features that you can enable with parallelism. Aside from the fact that you can get more work done. U at times you want to you want testing to be running in parallel with the agent that creates code. Testing no matter how much we optimize it is still very slow. If an agent is only spending time on testing users are not going to be engaging with your application anymore. Um, at the same time, it's also great to have a synchronous process running while your agent is running because you can inject useful information back into the main core loop. And last but not least is a very common technique that we know boost performance if you have enough budget to do so. You should be sampling multiple trajectories at the same time. So a lot of perks are coming with parallel agents. But u the way in which we implement them today which I basically call user has an orchestrator is the fact that tasks the parallel task that you want to run are determined by you by the user and each task is dispatched in its own thread. So there is a bit of manual process even the task de composition in a sense is happening in your mind while you're thinking about which agents you want to run and then the moment you get back all the results you need to go through the problem of merge conflicts and often times this is not trivial at all no matter how many amazing tools are out there. So what we're working on today for our next version of the agent is having the core loop as the orchestrator. So the key difference here is the fact that the the subtask that we're going to be working on are not determined by the user but they are determined by the corion loop and the parallelism is basically deciding on the fly. The agent does the task decomposition on behalf of the user and this comes with a couple of advantages. First of all again there's no cognitive burden to for the user to understand how they should be decomposing the task. At the same time also there are ways in which you can create tasks that sort of mitigate the problem of merge conflicts. I'm not claiming that we're going to be able to mitigate it 100%. There are so many corner cases in which merge conflict will still represent a problem but there are a lot of different techniques known in software engineering to make sure that you can try to have multiple subage and not stepping on each other toes. So the core loop as an orchestrator is going to be the our main bet for the next few months. And in case you're passionate about these topics, [music] I'm always hiring at Rapid. Thank you. [applause] From transforming support tickets into merge requests to helping teams ship fixes faster than ever, our next presenter has been at the center of Zapier's AI agent journey. Please welcome engineering leader Lisa Orur. [music] [applause] Hello. I'm so excited to tell you about how at Zapier we are empowering our support team to ship code. Before I tell you about that, has anybody here visited the Grand Canyon? It's a good amount. Anybody rafted through the Grand Canyon? I see one person. I just got off an 18-day trip rafting through the Grand Canyon over 200 miles. It was incredible. No internet, no cell service. The moment I got off, I found out I was giving this talk. I didn't think about uh work at all on the river, but once I got off, I started thinking about the parallels between the Grand Canyon and Zapier. And we have one thing in common and that is erosion. Now natural erosion happens over millions of years with wind, water and time. It creates the beautiful canyon that we experience and it's never stopping, always continuing. At Zapier, we have over 8,000 integrations built on thirdparty APIs and they are constantly changing, which I'm now thinking of as app erosion. We've been around for 14 years. Some of our apps are that old. API changes and deprecations impact us and create reliability issues. Again, it never stops. So, I like to think of our apps as like layers in the Grand Canyon and they need constant attention. So, if we were to create our own Zapier Canyon and our apps would be at the walls, here's our support team flowing down the middle watching out for app erosion. And we have a backlog crisis. Tickets were coming in faster than we could handle them. Creates integration reliability issues, poor customer experience, even churn. So to solve for app erosion, we kicked off two parallel experiments. The first was moving support from just triaging to also fixing these bugs. It's experiment number one. Experiment number two, we were asking, can AI help solve app erosion faster? So let's jump into experiment one. This get kicked off two years ago, but had to start with the why. We needed to get that buy in to empower our support team to ship code. So apparosion is one of the major sources of bugs coming through to from support to engineering. So there's a big need support is eager [laughter] for this experience to a lot of them want to go into engineering eventually and unofficially many support members were already helping to maintain our apps. This moves us into how we started this out. Put on some guard rails. We started with just four target apps to uh focus our fixes on. Engineering was set to review any merge requests coming from support and we kept the focus on app fixes. So jumping into experiment 2, this is what I've been leading for the last couple of years. How can we use codegen to help solve for app erosion? And so fortuitously, the name of this project is Scout, which ties in so well to the Grand Canyon experience that I've just been through. As any good product manager, we start with discovery. We did some dog fooding, so I shipped some app fixes. Uh we shadowed engineers and support team members as they were going through the app fix process. We designed out uh what are the pain points experienced along the way, what are the phases of the work and how much time is spent. One big discovery we had is how much time is spent gathering the context going to the third party AP API docs even crawling the internet looking for information about a bug that's emerging maybe somebody else has already discovered and solved for it outside of Zapier. internal context, logs, all of this is a lot of context to go and search for as a human uh and a lot to gro and work through. This is something we knew we needed to solve for. where we started with all this great uh opportunities and pain points is we started building APIs that we believed would solve for these individual um pain points and some of these APIs are using LLMs to you know for our diagnosis tool gathering all that context on behalf of the uh support person engineer and curating that context building a diagnosis that's [clears throat] using an LLM. And then some aren't like we have a unit test uh or unit test generator is, but the um test case finder is simply using a search query to look for the right test cases to pull in for your unit test. We built a bunch of APIs. We had a bunch of great ideas. So there was a lot for us to test with, but we ran into some challenges in this first phase. We had APIs but they were not embedded into our engineers process. So our tool I just said they don't like to go to so many web pages to find all their context. They would love all this information to come to them. And yet our web interface where we've we've created a playground we call autocode internally where you can come and play around with our APIs. And our ask to the teams was come try out our APIs and give us feedback. Now this is just one more window to go to. So we didn't get a lot of engagement. Also because we had shipped so many uh APIs our team was spread pretty thin. Cursor launched at the same time which has gotten great adoption at Zapier. We're all huge fans of cursor. But from our side, it made some of our tools no longer necessary. But there was one major win in this phase, which is one of our APIs became a support darling. It's diagnosis. That number one pain point of need to go out and find all of your context, curate it for yourself so you can start solving the problem. We were doing that on uh the support team's behalf with the diagnosis API and support loved it enough that they decided to embed it into their process. They asked us to build a zap year integration on our autocode APIs so they could embed it into their zap that creates the jur ticket from the support issue and now diagnosis is included. So embedding tools is the key to usage as we find out. So how can we embed more of our tools? Well, then MCP spins up and that solves our problem. We can now embed these API tools into our engineers workflow. Specifically, our engineers are pulling in these MCP tools as they're using cursor. Our builders using Scout MCP tools are leaving the IDE less, spending more time in one window. Still coming into challenges. One of our uh our our key tool diagnosis uh is so valuable to pull all that context and to provide a recommendation, but it takes a long time to run. Now, we might run down that runtime. However, as you're working synchronously on a ticket in your ID, this was frustrating. We also weren't keeping up with the customization needs. Not only did MCP launch and we started leveraging it, Zap Your MCP launched too. And some of our tools, if we weren't keeping up with the customization needs, our engineers internally looked to Zap Your MCP, which is great. We're all on the same team solving the same problem, but some of our tools had a dead end. Also adoption was scattered. We had a whole suite of tools and we thought there was value in each of them as it solves for different problems across the different stages. Not every engineer was using our tools and if they were using tools, they're only using a few of them. So we have tool usage. We're happy about that. But we were under the hypothesis that true value is going to come from tying these tools together. So what if we owned orchestration of these tools rather than saying here's a suite of tools you use them as you wish what if we combined them and created an agent to orchestrate this. So this we are calling scout agent. We take that diagnosis run that against a ticket uh use that information to actually spin up a codegen tool which will then produce a merge request using all the right context. So who would benefit the most from orchestration? There are several integration teams at Zapier who are solving for these app fixes of various levels of complexity and there's the support team. So when we're saying who should be our first customer scout agent, we're thinking it should probably be the the team fielding small bugs that are emergent and coming hot off the queue which is the support team. And now our two experiments merge and we have scout agent. We are building for the support team. And this is the flow of how it works. Support is submitting an issue to scout agent. We first categorize the issue. We next assess its fixability. Not every issue that comes from support can be fixed. If we thinks it's fixable, we'll move on to generating a merge request. At that point, the support team, this is the first time they're picking up the ticket. It already has a merge request attached to it. They'll review and test. If it's not satisfying what they believe is the actual solution or the the what what the solution should be to best address the customer's need, they will make a request for an adjustment that can happen right in GitLab, which is where we do our work and Scout will do another pass and hopefully at that point we've gotten it right and support can submit that MR for review from engineering. How we are running Scout, it's all kicked off by a Zap. This is a picture of one of our zaps. There are many zaps that's run this whole process and it embeds right into our support team's zaps. We do a ton of dog fooding at Zapier. We first run diagnosis and post that result to the Jira ticket saying what the categorization is if we believe it's fixable. And then if we do believe it's fixable, we then are kicking off a GitLab CI/CD pipeline. And we run three phases in that pipeline. plan, execute and validate to generate this merge request. The tools used in this pipeline is Scout MCP. So all those APIs we invested in a year ago now are really coming together and we're orchestrating it uh within the GitLab pipeline and we're also leveraging cursor SDK. Once the merge request has been completed, we attach it to Jira and support picks it up. The latest addition to this is doing a rapid iteration once a um uh once a ticket has been posted with the merge request and support team is looking at it and they say you know it needs some tweaks to save them more time so they don't have to go pull that down to their ID do the fixes and push it back up. they can simply chat with the uh scout agent in gitlab that'll kick off another uh pipeline which does that phase with that new feedback and posts the new merge request on our side we want to make sure scout agent is working so we ask three questions categorization right is was it actually fixable uh and was the code fix accurate so far we have two evals 70 to 75% accuracy for categorization and fixibility. As we get more feedback and process more tickets, those become our test cases and we can move forward improving scout agent over time. So what has been scout agents impact on app erosion? 40% of supports support teams app fixes are being generated by scout. So we're doing more of the work on behalf of the support team. This is resulting and for some of our support team it's doubling their velocity from one to two tickets per week which already is amazing. That's going from a support team that wasn't shipping any fixes well unofficially they were sometimes to now shipping one to two per week per person to now shipping three to four with the help of Scout. Another uh process improvement, Scout puts potentially fixable tickets right there in the triage flow. takes away a lot of the friction of looking for something to grab from the backlog. It's not just the support who's benefiting, it's also engineering. Engineering manager said, uh, it's a great example of when it works. This tool allows us to stay focused on the more complex stuff. And if you take away anything from this talk, I hope it is that there is a really powerful magic between support and empowering them with codegen and allowing them to ship fixes because they have three superpowers. The first they are the closest to customer pain which mean they're closest to the context that really matters for figuring out what's the problem and how to solve it. They're also troubleshooting in real time. These tickets aren't stale. the context is fresh, the logs aren't missing. You put this ticket into engineering backlog months later, you might not get access to those logs anymore. And then three, they're best at validation. You've again you put the same ticket into an engineering backlog. The solution an engineer might come up with may change the behavior and that might be good for some customers but might not necessarily be best for that one customer who wrote in about the problem. And one other major benefit of this is our support team members who have been part of this experiment are now engineers. I want to say thank you to the amazing team who's helped build this process or built all the tools and the scout agent. Andy is actually here in the audience. So shout out to Andy. If you want to talk about any of the technical bits, he's here. And I want to impress upon you two things. or hiring, but mostly if you haven't rafted through the Grand Canyon, please consider it. It's life-changing and you should go with ORS. Thank you very much. [applause] [music] Our next presenters believe that [music] 2026 is the year the IDE died. Please join me in welcoming to the stage engineering leader at Source Graph and AMP, Steve YG, and author and researcher at IT Revolution, Jean Kim. [music] Hey everybody. Um, really happy to be here. I'm going to be talking the first half. Co-author here, Jean Kim, is going to talk second half. All right. Looking forward to it. Cheers. All right. Right. Today I'm going to Well, we're going to talk real fast. This time is going to go down fast. Uh I'm going to talk to you about what tools look like next year. Last year I was talking to you all about chat and everybody ignored me and now everybody's using chat this year and it's like we're gonna we're going to fix that right now. All right. So, here's what it's looked like. I'm going to tell you right now, everyone's in love with Cloud Code. There's probably 40 competitors out there. Cloud Code ain't it. completions wasn't it. I love cloud code. I use it 14 hours a day. I mean, come on. But it ain't it. Developers aren't adopting it. I'm gonna talk about why in this talk. I'm going to talk about what you can do about it and what what to look forward to. But the reason is they're too hard. Okay. Uh cognitive overhead. Uh they lie, cheat, and steal. Gene and I talk a lot about this in our book, all the different ways that they can lie, cheat, and steal. And uh most devs just don't like this. I have come to understand that claude code is very much like a drill or a saw, an electric one, right? How much damage can you do as an untrained person with a drill, right? Or a saw. Yeah. How much damage can you do as an untrained engineer with clawed code? It's real similar. Yeah. You can cut your foot off, but you can also be really, really skilled with it and do really precision work, right? like a craftsman. The problem is software is infinitely large. Our ambition is infinitely large. And so the analogy that I want to share with you is next year will be the year from moving from saws and drills to CNC machines. A CNC machine, you strap a drill on and you give it coordinates and it moves it around and you're very precise, right? We've been doing this for centuries and we're not going to stop this year. One thing I hear people say is, "Well, the models are plateaued." This is real common. Your engineers are probably saying this, okay, even if they plateaued, we have still discovered steam and electricity and it's going to take us a little time to harness it. But it's strictly an engineering problem at this point. All code within a year, year and a half will be written by giant grinding machines overseen by engineers who no longer actually look at the code directly anymore. Weird new world. That is where we are going. Oh my gosh. Yep. This this slide. So Gene and I talked to Andrew Glover who I don't know is he here from OpenAI and he said that they have this incredible dichotomy unfolding at OpenAI where you know some percentage of their engineers are using codecs and then some other percentage a larger percentage are not using codecs and the difference in productivity is so staggering that they're having now alarms going off at performance review time because how do you compare these these two engineers who are the same level same title same everything and one of them is 10 times as productive as the other one by any measure. And the answer is they're freaking out. They may have to fire 50% of their engineers. And this is unfolding at other companies, too. Who is refusing it? It's the senior and staff engineers. How many minutes are we at? >> Eight [clears throat] minutes. >> We're perfect. This is just like what happened to the Swiss mechanical watch industry over a couple of Well, it was built up for a couple of centuries and then courts killed it, you know, within a couple of years. And what happened was the craftsmen were doing the same thing our staff engineers are doing today. No, this is cheap. That's word for word, right? That's what they say. All right. I didn't know where to put this slide. This is this is Claude's view of what next year looks like. And I I was just like, what do you think it's going to look like? And it actually does kind of look like this. Most of the words will be spelled correctly in in next year. But this is a lot prettier than cloud code. Yeah, this is what it has to look like. Some form of a UI, not an IDE. This is the new IDE. Okay. And people are building it. In fact, I think the company that's the furthest along in this is Replet, who just talked to you. I think it's amazing what they're doing. It's absolutely bravo, right? We should not be all chasing tail lights and building command line interfaces anymore. All right. and and more importantly, Cloud Code and all of its, you know, competitors, they're all doing it wrong because they're building the world's biggest ant. Okay, this is my my buddy Brennan Hopper at Commonwealth Bank of Australia, right? He's like, "Nature builds ant swarms and Claude Code built this huge muscular ant that's just going to bite you in half and take all your resources, right? I mean, it's a serious problem, right? If I say please analyze this codebase, I, you know, go to the expensive model." If I say, "Is my git ignore file still there?" I've also gone to the expensive model, right? Everything that you say goes to the expensive model. So, what's going to happen? Whoa. What happened? Oh gosh, my slides are all messed up now. Can you guys see them? >> No. >> Oh, this always happens to me, man. There's something going on. All right. So, I thought of a really cool analogy called the diver the diver metaphor, which is your context window is like an oxygen tank. Okay. This is why these things are fundamentally wrong because you're sending a diver down into your codebase underwater to swim around and take care of stuff for you. One diver and we're like, we're going to give him a bigger tank. One million tokens. He's still going to run out of oxygen. Like you don't, right? You should send a product manager diver down first and then a coding diver, right? And then a review diver and a test diver and a get merge diver, etc. Right? Nobody's doing this. Everyone's building a bigger diver. I don't know my slides are all messed up. My my my talk is almost done. But um what we do is as engineers task decomposition, successive refinement, components, black boxes. This is how it's going to be built in the future. And it's going to be built with lots and lots of agents, not just one agent. All right. Until then, I think we're out of time, but so until then, learn cloud code. Give up your IDE. Swix told me he wants some hot take, so I'll give you one. If you're using an IDE starting on, I'll give you till January 1st. You're a bad engineer. There's your hot take. All right, folks. [applause] All right, cheers. Well, that that was actually my talk. Um [clears throat] uh uh learn coding agents and oh yeah, then there's this guy. Speaking of bad engineers, so this is this is Jordan Huard uh who uh who's at Nvidia and he tweeted or LinkedIn a really nice post on how to get the most out of agents and this guy responded with this, right? This is everyone in your or this is 60% of your org right here. This guy's not an outlier. Okay, the backlash is very real against this. Yeah. And this is going to be a problem I'm not going to I'm not going to share with you. I don't have time to share how to fix it, but it's something you should be aware of. And anyway, I'm going to turn it over to my co-author, Jean. We had a lot to talk about. He's got a lot to go. So, let's turn it over to Jean. >> Yeah. Thank you, Steve. >> Hi, buddy. [applause] >> Yeah. By the way, um I have let me start off by introducing myself and then I'm going to share a little bit about like what it's been working like uh what's been like working with Steve on the vibe coding book. Uh and so just a little bit about myself. I've had the privilege of studying high performing technology organizations for 26 years. And that was a journey that started when I was a technical founder uh of a company called Tripwire. I was there for 13 years. But our mission was really to understand these amazing high performing technology organizations. They had the best project due date performance and development, the best operational reliability and stability and also the best posture of compliance uh security and compliance. So we wanted to understand how did those amazing organizations make their good to great transformation. So we got understand how to how other organizations replicate those amazing outcomes. And so you can imagine in that 26 year journey there are many surprises. Among the biggest surprise was how it took me into the middle of the DevOps movement which is so uh amazing because it reshaped technology organizations. you know, it changed how test and operations worked, information security. Um, and I thought that would be the most exciting adventure I'd be on in my career until I met Steve Yaggi in person. And so, I've admired his work for over 11 years. And so, some of you may have read this memo of Jeff Bezos's most audacious memo of how in early 2000s they transformed from a gigantic monolith that coupled 3,500 engineers together, so none of them had independent action. And uh he talked about how all teams must henceforth communicate and coordinate only through APIs. No back doors allowed. Right? Uh anyone who doesn't do this will be fired. Thank you and have a nice day. And the amazing person who chronicled says number seven is obviously a joke because Bezos doesn't care whether you have a good day or not. And this is actually enforced by Amazon CIO then Rick Del. And so it turns out this memo that I've been quoting for 11 years uh was written by Steve Yaggi uh which was meant to be a private uh memo on Google+ which was made public which landed him on the front page of the Wall Street Journal. Um and so I finally met him in uh June and it turns out that we had many things in common uh but one of them was this uh love of AI and this sense that AI was going to shape coding from underneath us. And so one of our beliefs is that uh the AI will reshape technology organizations you know maybe even a 100 times larger than what agile cloud CI/CD and mobile did you know 10 years ago. Um and that these technology breakthroughs not just reshape organizations but they reshape the entire economy. The entire economy rearranges itself to take advantages of these you know wild new better ways of uh uh producing things. and and uh so over the last year and a half we've had a chance to look at these case studies I think give us a glimpse of what these uh what the shape of technology organizations look like and so I'm going to uh share with that what we've learned but here's maybe a hint so some of you may know the work of Aiden Cochroft he was a cloud architect at Netflix right he was what who drove uh the uh entire Netflix infrastructure from a data center uh back in 2009 to running entirely in the AWS cloud and so he wrote uh some months ago in 2011 some people got very upset in uh infrastructure and operations because they called it noopops, right? And everyone laughed back then, but he said, "Oh, don't, you know, it's happening again. This time it might be called no dev, right?" Not so funny now, right? So, it's it's interesting, right? Because we heard this amazing presentation from Zapier about like how support ships and turns out designers are shipping, UX is shipping, right? Anyone who's been frustrated by developers uh who, you know, say get in line and you have to wait quarters or years or maybe never, right, are now suddenly in a position where you can actually vibe code your own features into production, right? And that reshapes technology organizations and reshapes you know potentially the entire economy. And so uh Steve and I we've had the privilege of watching what happens you know when we change uh you know the way we uh deploy right it wasn't so long ago and 10 years ago uh I wrote a book called the Phoenix project where it was all about the catastrophic deployment. Would you believe uh that it was you know 10 years ago 15 years ago most organizations shipped once a year right and so I got to work on a project called the state of DevOps research. It was a cross population study that spanned 36,000 respondents uh from 2013 to 2019. And what we found uh this was Dr. Nicole Forsgrren and Judge Humble. Um and what we found was that these high performers ship multiple times a day, right? They can ship in one hour or less. And you know back in 2009, people thought, oh my gosh, multiple deployments per day, right? That's reckless and irresponsible, maybe even immoral, right? What sort of maniac would deploy multiple times a day, right? And yet it's very common place these days. In fact, if you want to have great reliability profiles, you want to have short meantime prepare, you have to do smaller deployments more frequently. And I think we're now seeing these kind of case studies that show that this better way of coding, right, where you don't type in code by hand might be, you know, just a vastly better way uh to create value. And so our definition of vibe coding that we put into the uh vibe coding book was that it's basically anything where you don't type in code by hand. And so for some of those of you who don't understand that, that's like sort of a uh typing an ID hunched over, right? And you're actually moving your fingers, right? That's sort of like how some people go into a dark room to develop photographs, right? Believe it or not, some people still do that. Um and and what I that's a great definition that we uh loved until uh Dar Amade uh CEO and co-founder of um Anthropic, he gave us an even better definition, right? The B coding is really the iterative conversation uh that results in AI writing your code. And he said it's on one hand a beautiful term because it evokes this different way of coding but he said it's also somewhat misleading because it sounds jokey right uh but he said you know at anthropic there's no other game in town right and I just thought that was just a beautiful way to evoke you know how important uh vibe coding is uh this is Dr. Eric Meyer um you he's probably considered one of the greatest programming language designers of all time. Uh he was part of Visual Basic, C link, Haskell. He created the hack programming language uh that migrated millions of lines of code at meta you know within a year uh bringing static type checking to a bunch of PHP programmers and he said we are probably going to be the last generation of developers uh to write code by hand. So let's have fun doing it. Um, so one of the things that uh when uh Steve and I started working on the book last November was uh watching him spend hundreds of dollars a day on coding agents uh and just seemed so strange right um you know and so he's maxing out not just over you know the uh the monthly subscriptions right but he's actually you know going way above and beyond that and yet uh you know things that we're hearing now is that as an engineer part of my job is that I need to be spending as much on tokens per day as my salary right so you know that think about like $500 to $1,000 a day, right? Because this is the mechanical advantage, the cognitive advantage that these tools are giving us, right? And as an engineer, right? I'm going to challenge myself, you know, to get that kind of value to deliver value to people who matter. Um, and so in the book, we talk about, you know, why people would do this, right? And [snorts] the, uh, acronym we came up with FAFO, right? Uh, the most obvious one is F for faster, right? Yeah, that's obviously true, but I think it's the most superficial and um part of why we do this because uh the second one is it lets us do more ambitious things, right? Uh the impossible becomes possible. Uh so that's one end of the spectrum. On the other end of the spectrum, you know, the uh the tedious and small tasks become free. One of the things I uh the uh interview of the cloud code team that I just loved was uh I think it was Katherine she said um uh one of the things we've noticed is that you know when customer issues come up uh instead of putting them on a jur backlog and you know arguing it about it in the grooming sessions and so forth right we just fix it on the spot right and ship to production or whatever um you know within 30 minutes right and so yes it gets recorded but you know that whole sort of coordination cost you know just disappears right so again the impossible becomes possible right and uh the annoying things just become free. The second a is uh um you know the ability to do things alone or more autonomously, right? And so um you know there's really two coordination costs are being alleviated here. One is, you know, if you ever have to wait for a developer or a team of developers, you know, to do what you need to do, right? You have to communicate and coordinate and synchronize and prioritize and cajul and escalate, you know, do all sorts of things to get them to care about the problem just as much as you do, right? And, you know, now, you know, with these amazing new miraculous technologies, you can do them by yourself, right? So, that's one coordination uh tax. The other one is even if you get someone to uh care about a problem as much as you, uh they can't read your mind, right? Right? And what we're finding is that these LLM are just amazing intermediation vehicles, right? Um, you know, just through an LLM, you can coordinate with other functional specialties, right? Through a markdown file, right? And that's not the end, right? But it's just this amazing way uh to have these high band coordination so that you can essentially read each other's minds, you know, because shared outcomes require shared goals and shared understanding. The second F is fun, right? Is that Steve says, vibe coding is addictive. This is so true. true. I mean, I cannot I think what I love about the book is that it's a story about two guys who both thought their best days of coding were behind them, right? And found that, you know, it's entirely the opposite. Um, and I've had so much fun and uh, you know, I'm having to force myself to go to sleep at night because otherwise I'd be up till 2 or 3 in the morning every night. Uh, and you know, so it's not all great, but it certainly beats be boring or tedious or, you know, horrible. And then optionality. You know, one of the things that uh I love about Swix is that he has a shared love of creating option value. And he told us last night that option value is also important for poker players, right? Because you never want to paint yourself in a corner. So option value is um one of the biggest creators of economic value, right? Modularity, the reason why it's so powerful is because it creates option value. Uh and so just the fact that you can have so many more swings at bat, you can do so many more parallel experiments, right? This is what vibe coding allows. So this is gives us confidence that you know this is not just uh this is a very powerful tool. Um here's the quote from Andy Glover that Steve Yaggi said is that you know as um for people who have this aha moment and in position uh you know I think the instinct is how do we elevate everyone's productivity to be as productive as you are now being um you know that since you've had your aha moment. So uh let me share with you maybe some of our top kind of uh exciting case studies that kind of give us a hint of the future. So uh I've run into this conference called the enterprise technology leadership summit for uh 11 years now and Swix we had uh the honor of having Swix there talking about the rise of the AI engineer just this amazing prognostication. Uh this year we had a series of amazing uh case studies. One was uh Bruno Pasos. He spoke this year uh last year at this conference and he presented on uh their in their evolving experiment to elevate developer productivity across 3,000 developers. Um and this is at Booking.com the world's largest travel agency and uh they're finding that they're getting double digit increase in productivity right uh mergers are going in quicker peer review times are uh smaller and and so forth right and so that's just we feel like that's a incomplete view of uh what people are achieving uh this is Shri Balakrishnan uh he was head of product and technology at uh travelopia uh so they're a 1.5 billion a year uh travel company and one of the things that uh he said is that uh you know they were able to replace a legacy application uh in six weeks with a pair of uh with a very small team. In fact, one of his conclusions is that before we would need a team of eight people to do something meaningful, right? Six developers, a UX person and a product owner. And he said maybe these days it might be two, a developer and you know a a domain expert. In other words, as Kent Beck said, a person with a problem and a person who can solve it, right? Maybe maybe a pair of those teams, right? And so that's going to reshape I think you know how they can go further and faster. Uh so again maybe just a hint of what teams will look like in the future. This is the one that excites me most. This is Dr. Topra Pal. Uh he helped drive the DevOps movement at Capital One. Um and he's now at uh uh Fidelity. And so um among other things he owns an application uh that is the application you go to asks which applications you know the 25,000 applications there have log 4j right and uh it's his team and he's had this vision of what this application should look like uh but every time he asked like can can we build it his team would say it would take about five months right and we'd hire need to hire a a front-end person and he got so frustrated that he spent five days just vibe coding it by himself right uh you know directly accessing read only the neofor 4J uh database um and put it into production, right? And so I think we're seeing a world where um you know leaders even leaders with their own teams are frustrated saying hey I can do this uh can I do this better myself not better just can I prove that it can be done and uh by the way what happened afterwards um he was looking around who can help me maintain my application in production and all the senior engineers like not me. So enter uh Swathy the most junior engineer on the team uh who is helping maintain his application and probably outarning you know everybody in the organization uh and interestingly uh he he's also getting more headcount because the number of consumers of this application just increased by 10fold right so who saw that coming right um so uh here's John Rouser he's senior director of engineering at Cisco security and he convinces SVP to um require 100 of the top leaders inside of Cisco security to vibe code one feature into production in a quarter that ended last month, right? And so um you know we're actually getting a chance to be able to survey those people, right? Who finished? Uh you know uh how many completed, didn't complete, partially completed, etc. And of those who completed, right, what was what aha moment did they have? Uh as a leader, what's the magnitude and direction of what they want to do? And so we're going to go in and study that. And I just I my prediction is that we're going to see parts of that organization get reshaped as leaders realize kind of what's possible. Everything from strategy to processes and so forth. And so let me just share with you one um you know thing that really excites me which is uh I got a chance to uh get back into the state of DevOps research the Dora study with uh u the Google cloud team and one of the things that didn't make into the report that I just found really exciting was around this. It was like what how would much do people trust AI? And we're using a very strange definition of trust which is to what degree can I predict how the other party will act and react, right? Because the more you trust the other party, right, you can give them bigger requests, you can use fewer words, you have less need for feedback, right? It's the whole notion of finger spits and fuel, right? Like how many of the 10,000 hours that requires to be good at anything have you used to get good at AI? And one of the stunning findings was that it's this line. So on the x- axis is how long have you been using AI tools? Y is how much do you trust it, right? And the longer you use AI, right, the more you trust it, right? So every every person who says I tried it and it's terrible at coding, right? On what basis did they make that conclusion after maybe using for an hour or two? And what this shows us is that uh you know it requires practice, right? And this is probably a teachable skill. Um so length of time on the x-axis is a very incomplete expression, right? It's like frequency and intensity and how many hours but it's the signal there. So it just shows that uh you know part of your job is to help other people have the aha moment and then help them you practice right so they get very very good at it so they can use every one of these amazing technologies to achieve their goals. So uh I'll leave you with one last kind of vision. Steve and I we did a vibe coding workshop for leaders um back six weeks ago. And what was amazing to me was in the three hours we had a 100% completion rate. Everyone built something. You know, they built a data visualization tool. In fact, uh one person uh built a an iOS app and another person actually got it into the review queue in the Apple iOS app store, right? Which is which is absolutely astonishing. Uh and here's a guy named Roger Safner. He said, "I used to be a C MVP way back in the day. I haven't coded in 15 years." uh and he's showing off an app that helped him automate the process of getting checked in to Southwest Airlines until the bot detection tools come off. But look at look at the expression on his face. And so I think uh what we're seeing is like what happens when support ships right and support codes and ships when leaders code and ship. There's no doubt in my mind that this will reshape uh technology organizations. If you're one of those, Stephen and I want to talk to you, right? Because you are on the frontier of something really really important. I'll share with you a couple quotes. Here's from a technology leader. When I told my team that I wrote an app that, you know, an AI wrote 60,000 lines of code and I haven't looked at any of it, they all looked at me as if they wished I were dead. Um, we've uh we've had these stupid problems in legacy applications. I've been there for over a decade. We got a group of senior engineers together. We used AI to generate a fix and we submitted PR and the team accepted it. Right? Unlike the time when they said it was AI generated and they rejected it as AI slop, right? So this is maybe happening in your organizations. Um our code velocity is so high. Uh we've concluded that we can only have one engineer per repo, right? Because of merge conflicts, right? This is we haven't figured out the coordination cost uh mechanism yet. And so like all these were some of the lessons that went into the vibe coding book. Thank you for everyone who were at the signing yesterday. And uh if you're interested in any of the talks we referenced and excerpts of our book in uh and basically uh all the links that uh are in this presentation, just send an email to real gene cams.com subject line vibe and you'll get an automated response in a minute or two. So with that, Steve and I thank you for time and we were around all week. >> Thanks all. [applause] [music] [music] Ladies and gentlemen, please welcome back to the stage, Alex Lieberman. [music] >> Let's give it up again for Steven Jean and also the rest of the speakers from the morning session. Whether you are watching in person or on YouTube or on the AIE site, you've been breaking a mental sweat. So, we are going to take a 30 minute break. Get some grub, get some coffee, recharge, and we will see you back here at 11. Thanks everyone. Appreciate it. >> [applause] >> Two flames lit the darkness, burning side by [music and singing] side. Both sworn to creation. Both relentless in their stride. One walked through the mountains, [music] one soared across the void. Both chasing the horizon of the worlds they would deploy. [music] But the path is not a straight line. And the future is not flat. Some roads bend through [music] spaceime and some break [singing] on impact. Effort is a kingdom. [music] Leverage is the key. One builds the throne by hand. One shapes [music] reality. There is a curvature of time, not [music] a race, not a throne, but a shift in the dimension of how progress becomes known. When [music] the universe is bending to the will inside the mind, you don't win by moving faster. You win by [music] breaing time. Black holes of the past try to drag the present [music] down. Systems built on dust, [singing] wearing yesterday's [music] crown. Some are pulled beneath them, [singing] fighting gravity alone. Others learn to map the edges and escape events horizons. Not all power [music] is struggle. [singing] Not all mastery is pain. The ones who change direction rewrite the laws of the game. You can lose your life in [music and singing] labor or an impact that compounds. Every second can be linear or worth a thousand rounds. There is a curvature of time. >> [music] >> Not a race, not a throne, but a shift in the dimension of how [music] progress becomes known. When the universe is bending to the will inside the mind, you don't win by moving [music] faster. You win by rediding [music] time. [singing] The future isn't [music] distant. It accelerates [singing] for those who wield the tools of power instead of fighting with their go. Mastery is leverage, [music] not a sentence carved in stone. The horizon does not move [singing] unless you. [music] There is a curvature of time [music] where the present multiplies where a lifetime holds a legacy that no clock [music] can quantify. Not by force, not by fury, but by evolution. We become eternal beings. When [music] we synchronize with Progress is speed. [music] Progress is direction. [music] >> [music] >> Footstep fade, but they never die. Shadows stretch across the sky. A whisper grows into a [singing] roar. Do you [music] feel it? Do you want more? Every heartbeat stone [music] in the street. Ripples [music and singing] chasing an endless dream. What we do in life [music] echoes in eternity. Every spark ignit [music] [music] [music] >> [music] >> Reach out to the [singing] empty air. Trace the stars like they're waiting there. [music] The clock ticks but the moment stays forever starts in a single [singing] prayer. Heat. [music] Heat. >> [music] >> Every heartbeat [music] stone [singing] in the stream ripples [music and singing] chasing an endless dream. What we do in life echoes in eternity. There is night a fire that will never see [music] [music] what we do. [music] Heat up [music] [music] [music] crawl where the light won't stay. The echo whispers don't [music] look away. Heartbeat racing louder than my doubt. Scream inside. I can't let [music and singing] out. But I won't fall. I won't drown in the storm all around. Heat. Heat. Heat. [music] [music] There is my breaking [music] the chain. [music] >> [music] >> Cold winds how but they won't define me. The cracks in my soul let the light find [music] me. Every step I take the ground fights back. But I'm the fire on the spark. I'm the attack. [music] I won't freeze. I won't fade. Through the chaos I've remain, [music] I won't let it win. It creeps like a ghost, but I keep it within. Fear is a killer. I'm breaking [music] the chain. No for the dark. [music] [music] >> [music] [music] [singing] [music] >> I hear the static in the [singing] night. It calls. A whisper [music] rising, breaking through the walls. [music] Electric echoes in my veins they home. Chasing the shadows where [music] the wild ones run. The air is thin. The weight is [music and singing] gone. Close your eyes. The past is done. Free your mind. [music] Break the chain on the floor. [music] [music] >> [music] [music] >> Waves come crash against the sky. presence of a dream. [music] I see them inside the [music] story. We don't need a weather with a speed. [music] [music] The air is thin. The weight is gone. Close your eyes. The [singing] past is done. >> [singing] >> Free your mind. [music] Let it go. Let it break the chain. Leave it on the floor. Heat. Heat. Heat. [music] [music] Heat >> [music] [music] >> up here. [music] >> [music] [music] [music] [music] >> They said The stars don't change [singing] their course, but I've been running from their force. A mirror crack, but still it [music and singing] shows. The fire is mine is mine to hold. I hear the echo and call [music] my name. But I'm not the shadow. [music] Not the same. You are who you choose to be. The stars [music] of the history. Every breath, every heart be free. [music] Are we choose to be [music] [music] >> [music] >> Road [music] of thorns, a sky of glass. I've walked through both. I've let them [singing] pass. The weight [music] is heavy, but I've grown. The voice I hear is now mine. I see [music] the light [music] change. I can say [music] you are [music] history. Every breath, every heart be [music] I see the lines [music and singing] drawn in the sand. A map of chaos in my hand. Every step, a [music] choice, every beat of voice. The clock ticks louder. But I stand. [music] Close my eyes [singing] and feel it burn. Every [music] failure, every turn. It's fue for the fire inside. [music] Execute the vision. Vision. Heat. Heat. Heat. [music] Heat. Heat. >> [singing] >> The air is heavy, it doesn't break. A thousand whispers in it wake. Each [music] breath a climb, each fall a sign, but I am more than I can take. [music] Close my eyes and feel it burn. Every failure, every turn is fueled for the fire inside. [music] >> [music] >> executes. [music] [music] This is Yeah. [music] [music] The clock keeps ticking loud and clear. Shadows fade, [music] but linger near. I've been waiting for the light. [music] Holding breath through endless night. [music] The air is shifting. Feel it break. A single spark is all it [singing] takes. It starts today. It starts today. No more [music] running. No delay. The world is spinning my hands. It [music] starts today. It starts today. Footsteps echo on the stone. Every choice I made my own. >> [music] >> I see the dawn breaking through a thousand colors chasing. [music] The air is shifting. [music] Feel it rain. A single spark is all. It starts to take. Heat. Heat. Heat. Yeah. [music] Heat. [music] Heat. [music] [music] Heat. Yeah. [music] [music] >> [music] >> Fire in my chest is burning loud. Ashes fall, but I won't bow. [music] I've walked [singing] through the smoke. I've tasted the scars. Each step I've taken little stars. [music] Let it blaze. Let it break. Feel the ground. I'm forced in [singing] flame. [music] I'm falling heat. The pain [music] again from Heat. Heat. Heat. [music] >> [music] [music] >> The winds they how but I stand still. [music] The mountains crumble up my will. I'm not the same I was before. A shadow of fear. I keep let it blaze. [music and singing] Let it break. Feel the cracks. The ground will shake. I'm forged in flame. [music] Heat. Heat. Heat. [music] Heat. Heat. Heat. [music] >> [music] [music] >> Heat up [music] here. >> [music] >> A whisper breaks the silent night. Shadows melt in the growing light. Time bends and twists. We feel it start a pulse [music] to spark [singing] an open heart. Do you feel it? Feel it right. The wayless fire in the sky [music] has come. Run into the sun. No share. We're free. [music] We're electric. [music] Stars collide. But we stay one. The past dissolves [music] like waves on storm. We stand together, not alone. [singing] >> [singing] >> Heat. Heat. Here [music] it is sing [music] the everything. A new age has come. We're running to the [music] sun. No chains, no walls, just there with me. [music] Heat. Heat. [music] Heat. [music] Heat. Heat. Heat. [music] Heat [music] [music] >> [music] >> Heat up [music] here. [music] Heat up here. [music] Heat up >> [music] >> here. >> [music] >> Heat up [music] Heat >> [music] >> up here. Heat. [music] [music] Heat. [music] >> [music] [music] [music] >> Heat. Heat. Heat. Heat. [music] [music] [music] Heat. [music] [music] Hey, Heat. [music] Heat. Heat. [music] [music] >> [music] [music] >> Heat. Heat. Heat. [music] [music] Heat. >> [music] [music] >> Heat. Heat. [music] >> [music] >> Heat. Heat. [music] Heat. Heat. N. [music] Heat. Heat. [music] [music] [music] Ladies and gentlemen, please welcome back to the stage, Alex Lieberman. Let's uh keep it going for the morning speakers. [music] Amazing job from everyone who spoke earlier. I asked before who thought they came from the furthest place on Earth to to watch this in person. And where's New Zealand again? I don't know. New Zealand. There we go. >> From Bulgaria. >> Bulgaria. Still, I think closer than New Zealand, but still very far. >> Australia via New Zealand. >> Australia via New Zealand. We just got someone to one up New Zealand. I have another quick question since we just came back from a coffee break. Also, if you're watching live on YouTube, you can comment. Who thinks they're the most caffeinated right now? Who thinks they're the most caffeinated in the room? How many cups of coffee? I'm four right now. Anyone beat four? Oh, we got four. We got a five, maybe. Wow. Impressive. Well, we are back for an incredible next block of sessions. We're going to be covering everything from future proofing uh coding agents to moving away from agile, how to quantify AI ROI in software engineering, the state of AI code quality, hype versus reality and Miniax M2. But I am so excited to kick off this next block of talks with OpenAI. Please welcome to the stage Bill Chen and Brian Fioa from the applied AI team at OpenAI. Let's hear it for them. [applause] [music] [music] Hello everyone. Um, today we'll be talking about how to build coding agents. And uh, I'm Bill. I work on the applied AI startups team at OpenAI. And I'm Brian. I work with Bill on the OpenAI startups team. >> And we specifically, uh, focus on, uh, building coding agents here at OpenAI. Um, yeah. So, why are we talk giving this talk? Why why are we, you know, uh, talking about coding agents? Well, it's really quite interesting because it's been booming for the the the past year actually. It's just if you think about it, it's not that much time ago, like only been a year or so. The ground keeps shifting really under the u harness on on the coding agents. But if you think about it, it's really like why it's interesting is because it's really a signal on how close we are to AGI. Software engineering can be set as a universal medium for problem solving. But because the ground is shifting so fast, uh, we ha kept having to rebuild the agent on top of the model whenever a model is released. And today we're going to talk a little bit about how we might be able to get around that. So here's what we're going to go over today. We'll start with the anatomy of a coding agent, especially going into the details of models and harnesses and how they work together. We'll share some lessons that we learned from putting them together ourselves. And we're specifically going to talk about codeex here, which is our own coding agent. We'll talk a little bit about emerging patterns that we're seeing from all of you for using agents like codeex in your own products. And lastly, we'll talk a little bit about what to expect from codeex in the future so that you can build along with us if you want to. To start, let's talk a little bit about what makes a coding agent an agent as a whole. Um, it really is quite simple. I think, you know, people kind of over complicate things a little bit these days. It's made out of three parts. It's a user interface. It has a model. It's a harness, right? Uh, the interface quite self-explanatory. It could be a computer uh like a CLI tool or it could be a uh integrated developer environment. could be also cloud or background agent. Um models also very quite self-explanatory are you know the things like the latest and greatest the GPD 5.1 codeex uh max that we just released yesterday uh or the GPD 5.1 series of models or other uh models from other providers as well. And the harness uh is a little bit more of an interesting part. This is the part that directly interacts with the model. Uh in the most reductive way, you can sort of think of it as a collection of prompts and tools combined in a core agent loop which provides input and outputs uh from a model. Uh the last part will be our focus for today. As touched on a bit earlier, coding is one of the most active frontiers in applied AI and uh how models are constantly getting released and we're not making the problem uh easier for everybody is that people have to constantly adapt uh the agents to the new models. So, um, Bill's done a great job of giving us an overview of coding agents, what they're made up of. So, let's zoom in a little bit on the harness. Um, turns out that's a little bit tricky. So, what is a harness? A harness is really the interface layer to the model. It's the surface area the model uses to talk to users and the code and perform actions with tools. It's made up of all of the pieces that the model needs to work over many turns, call tools, and and really write code for you and interpret what the user is actually asking. [snorts] Um, for some, the harness might actually be the special sauce of the product. But as we're going to go into a little bit more, it's really challenging work to build a good harness. And we'll talk about how we did that. So let's see what are some of these challenges. Um just to name a few, AV is one. Um your [laughter] um your brand new innovative custom tool that you're giving to your agent might not actually be something the model is using is used to using. It may not have ever seen that tool before in trading. And even if it is, you need to spend time tuning your prompt to that particular model and the habits that it comes with. And new models are coming out all the time. What about latency? Like does the model take a while to think about certain things? Which things do you prompt it not to? How do you expose the UX of what a thinking model is doing while it's thinking? Is it communicating with you while it's thinking or do you have to summarize it? Managing the context window and compaction can be really challenging. We just launched Codeex Max that does that out of the box for you. you don't have to worry about compaction and context window management. It's really hard to do. Um and so if you were to do it yourself, have fun. Um and then also like the APIs keep changing, right? So we have completions, we have responses, we have whatever else is coming in the future. What does the model know how to use and get to get the most intelligence out of the box? And so this is the interesting part. Fitting a model into a harness takes a lot of prompting. It turns out that how the model is trained has side effects. I like to think about it this way. Intelligence plus habit. Intelligence. What is the model good at? What languages does it know really well? What is what is its capabilities in terms of like how well it can write code in certain frameworks? And then what habits did it learn to to use to solve those problems? We've trained our models to have habits of like planning a solution, looking around, gathering context, and and thinking about a problem before diving in and writing code, and then testing its work at the end. Developing a feel for these habits is how you become a good prompt engineer. If you don't instruct the model in ways that it's familiar with, you can have problems. We saw this when we launched GPD5. A lot of people who weren't used to using our models encoding tried to take prompts that existed for other models and put them into their harness and have GPD5 follow those instructions. And it turned out that we taught our model to do some of the things that the other models didn't really do out of the box. And so when they were prompting them to look really hard at the context and like examine every single file before making a a code edit, our model was being very kind of thorough about that and it was taking a really long time and they weren't seeing the best performance. And so we figured out that if you let the model just do the behaviors that it's used to and don't overprompt it, it'll actually perform really better. We found out by asking I was literally like, "Hey, like I like the solution, but it took you a long time to get there. What can I do differently in your instructions to help you get there faster next time?" And literally it said, "Uh, you're telling me to go look at everything and I don't really need to. So that's what's taking forever." And so you can actually see the advantages of building both the model and the harness together because you just like know all of that while you're building it. And that's why codeex is both a model and a harness combined. So let's dig deeper into codeex and what it can actually do. So we built codeex to be an agent for everywhere that you code. It's a VS code plugin. It's a CLI. You can call it in the cloud from the VS Code plugin or from chat GBT from your phone. Um, and it's very basic. You can use it to turn your specs into runnable code starting from a prompt. Um, having a plan. It navigates your repo to edit files. It runs commands, executes tasks, and you can call it from Slack or you can have it review PRs and GitHub. So, all of the things that you would expect. And that means that the that codec um the harness of codeex needs to be able to do a lot of really complex things. Uh when I talked to a member of the codeex team about this slide and what should be on it, he was like it's way harder than you think. [laughter] You have to manage parallel tool calls like thread merging and all of the things involved in that. Think about all of the security considerations you have with sandboxing, prompt forwarding, permissions, uh, port management. Um, compaction is a whole thing. Um, and doing that well is really complex. When do you trigger compaction? When do you reingject? How do you worry about uh cache optimization during that MCP, right? Like all of the thing the uh plumbing you have to build for MCP support into the harness. Uh, and then not even mentioning images and what's the resolution that you need to compress them to to send them to the model. All this all of this is like work that you have to do if you're going to build this from scratch and keep it updated as new features come online. So since we've bundled all of these features together for you in an agent that can safely write its own tools to solve new problems that it encounters. Oops. Uh we actually have here uh a computer use agent for the terminal. Wow, that sounds quite a bit powerful than just plain old coding agent, doesn't it? Um but just think about it again. Well, before browser and graphic user interface was a thing, wasn't that how we always operate a computer? they're writing code and chaining them together in a command line interface. So that means if you can express your tasks in command line as well as files tasks codeex will be able to know what to do. Um the example is I like to use codeex to organize a lot of the photos from my desktop into a folder and that's a very simple use case but what it can also do is it can analyze huge amounts of CSV files inside of a folder uh doing data analysis it does not have to be a coding task and if it can be accomplished by running tools from command line you can use codeex. So now that we see codeex is such a cool harness, um I want to also share a little bit about how you can use it to build your own agents and what you can do is you can use codeex the agent inside of your own agent. Um how does that work? Well, if you want to build uh a coding uh the next coding startup, we don't really have all the answers, but we do have a few patterns uh that we thought uh might help you having worked with some of the top coding customers uh like cursor and VS code. Uh one of those patterns is uh harness becoming the new abstraction layer. The benefits of this is quite obvious. Um, you no longer have to care about prioritize uh optimizing the prompt and tools with every mo model upgrade. >> But um does that mean you're just building a wrapper? >> Well, I disagree with that take. [snorts] I disagree with disagreeing with my colleague here. Um just like how building rappers on top of models I think is really reductive on uh on the whole value prop of the infrastructure layer. Sorry, I used to be a VC. [laughter] >> Focusing most of your efforts on differentiating your product is what this pattern allows you to do. And that's where most of the value lies. Exactly. Okay. So, let's look at some of these patterns that we've seen and actually have helped our customers build um along with them. Codeex is an SDK. It can be called through a TypeScript library. You can call it programmatically in a Python exec. There's a GitHub action that you can plug into to have it merge merge conflicts on PRs that everybody hates doing. Then uh you can also add it to the agents SDK and give it MCP connectors back to your product. So now you have an agent. I like to say we started with chat bots that you can talk to. Then we gave the chat bots tools to use. And then now you can give uh a tool to your chatbot that can make other tools that it doesn't have. And so now you can actually build out enterprise software that does it that writes its own plug-in connectors to the API level for each customer on the spot. That's something that a professional services team used to have to do. Um, so you have fully customizable software that can now talk back to itself. Um, I made a conbon board for Devday that can actually fix its own bugs. Um, it's pretty fun. And then lastly, um, you can actually do something like what Zed has done. They have just decided to wrap codeex inside of a layer and give it an interface to the IDE for talking back and forth for the user and making code edits. And now they don't actually have to do all the work of staying on top of all of the things that we're good at doing and they can focus on building like the best code editor. Uh so our top coding partners like GitHub has used this uh to great effect and well uh we've created an SDK uh for it that they used to directly integrate uh with codeex. You can also use the SDK to uh control codeex as part of your CI/CD pipeline as well as use it as an agent that directly interacts with your own agent as well. Uh if you really want to customize the agent layer, you can do it too. As an example of this, we worked with closely with the cursor team to get the best performance out of the codecs. The model, not the agent. We're bad at naming things. The model is different from the agent. They did so by aligning their tools to be in distribution with how the model is trained. And they did so by aligning uh their harness with our open- source uh implementation of codeex CLI. All of this is publicly available. Uh you can fork the repo, you can use our source code, you can use it. Uh go nuts. So what does the future hold for codeex? It hasn't even been out for a year. Um and especially with the lo la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la launch of codeex match yesterday like things are really changing fast. Uh it's the fastest growing model in usage now serving dozens of trillions of tokens per week which has actually doubled since dev day. It's always good to build where the models are going. It's safe to assume that the models will get better. They'll be able to get to work on much longer horizon tasks unsupervised. New models will raise the trust ceiling. I trust these models now to do some way harder work than I would have six months ago. And that's going to keep increasing. The future is about sprawling code bases and non-standard libraries and knowing how to work in closed source environments, matching existing templates and practices and the models uh and and and so you can imagine that the SDK will evolve to better support these model capabilities, letting the model learn as it goes and not repeat mistakes and generally provide more surface area for an agent that writes code and uses a terminal to solve whatever problems it encounters. counters and you can use that in your products via the SDK. So, what have we learned? Harnesses are really complicated and take a lot of work to maintain, especially with all the new models coming out. So, we've built one for you inside of Codeex that you can use off the shelf or look at the source if you want to and you can use it to build new things outside of coding and let us do all of the work making sure that you have the most capable computer agent. And we're really excited to see what you craft. Our [applause] [music] next presenters believe that most enterprises are failing to unlock real value from AI because the systems in which they operate are [music] stuck in the past. Here to share how agents are reshaping software delivery are McKenzie partners Martin Harrison and Natasha Mania. >> [music] >> All right, good morning. Hello everyone. It's really great to be here. Uh so I'm Martin and I'm here with my colleague Natasha. Uh we're from a part of McKenzie you may may not be as familiar with. We have a practice called software X and we work with uh mostly enterprise clients on how to build better software products which has messed mostly using AI uh in the in the past couple of years. uh and so what their talk is about today is really more focused on the people and the operating model aspects of leveraging AI for software development and and that we believe that that has changed quite significantly and and that's what we're excited to talk to you about. If I take a quick step back uh in in time and we just uh you know think through some of these the major technology breakthroughs that we've seen in the last few decades uh they tend to always come with a paradigm shift in also how we develop software and so I still recall uh almost 20 years ago now I started working as a software engineer an entry- level developer um in a tech company and the company I was working for was just switching to to agile we were using kan boards we were doing uh standups and and other ceremonies. This was a big change. It was a massive change for the for the company. And now with everything that is happening happening in AI, we're at the precipice of another such paradigm shift. And um if we think about some of the um some of the things that are happening um with AI and software development that we've seen at this um at this conference, there's no doubt that this is a new paradigm that is about us. And so we'll talk about two things. uh we'll first touch a little bit about how do you go from these things that we're seeing at individual productivity to scaling that to the whole team and what that what type of changes we think that implies and then we'll talk a little bit uh about how do you scale that across uh a whole organization and to really get get value um if if you sort of I I'm talking to an audience here which is using AI agents all the time and I if I If I asked you about some examples, I'm sure you could rattle off, you know, 10 different ones where you would say, "Look, there was this thing that I used to do. It it used to take uh maybe even days and and and hours that are now taking only minutes, right? There's no shortage of those those stories and you can go over to the expo and and talk to any of the companies there about all these all these great use cases. It really shows that these tools work and they can be really impactful." And so yet despite seeing you know some of these uh improvement uh improvements uh we've done some research to gauge you know where are our clients at the moment. We we recently surveyed about 300 uh companies uh mostly enterprises around what are they seeing in terms of productivity improvements. So you have this and then they would say uh on average we're often seeing only 5 10 15% improvements overall as as a company. So we're in a place where there's a bit of a disconnect between this this big potential uh around AI as uh from the reality. And so we we think that um there is this gap because as we've started implementing AI whether it's um you know coding assistance or whether it's now using you know you just heard about uh you know how open AAI is using agents and more complex uh workflows. What has started to emerge is a is a set of bottlenecks uh that that were not necessarily there before. Like for for example, as we now start moving much faster in certain in certain aspects of the work, uh we haven't really changed how we collaborate among people and and team members that's not quite keeping up. We started generating way more more code, but we're it's still being reviewed in a in a pretty manual way in in many companies. Then we also have this this theme which was recently highlighted in in even a research report from from Carnegie Melon uh about how all the new code that is being generated is also amplifying uh the generation of tech debt in some in some cases and actually generating complexity. And so there are these bottlenecks. They're not impossible to overcome but this is what we believe is limiting uh many companies from seeing the the the real value that that they should be seeing. Let me talk about maybe just a couple of examples to to make that uh come to life a little bit more. One of the things that we see as a big rate limiter at the moment is around how work is allocated. And so what what we've learned over the last couple of years is that the impact from AI and agents is highly uneven. There are some tasks which where it works amazingly well today and you see uh huge improvements and there are others where it it's not as effective and so you have that variability. You also have variability among people. Some have have uh lots of experience now using these tools and and know how to pick that up and others uh are less experienced right now and so what that means for for team leaders for engineering managers and so on is it's very highly non-trivial to know how to allocate work and resources in in a good way and this is creating a lot of inefficiencies. Another example uh is is around how work is being reviewed. So agents are often giving given pretty uh fuzzy uh you know stories that are written in pros with pretty fussy acceptance criteria uh which which means that the code that comes back is not always what it was intended to be and and for many companies the only mechanism to control that is is often manual review. So you've automated some things but we've generated more manual review. So these are some of the some of the examples of uh these bottlenecks that we that we see coming up. And as mentioned, what what has that has resulted in so far is that most most large companies today uh are are stuck a little bit in in a world of relatively marginal gains. Uh they're working in ways that was developed with constraints that we had in the past paradigm of human development. So you have you you know if you go out to most companies you see 8 to 10 person teams you see working in two week sprints you have all these these elements that were largely parts of like an of an agile operating uh model and that is and that is uh putting in some some limits to what they can see. Over the past year, we've been working with lots of clients to to sort of break that model a bit uh and develop new ways of of working in smaller teams, in new roles, uh in with shorter cycles. And when you do that, we see really great performance improvements. And that's what gives you gives us this uh path to where we see things are going to improve. So we realized that rewiring the PDLC is not just a one-sizefits-all solution. For example, different types of engineering functions across the enterprise along the product life cycle may require different operating models based on how humans and agents best collaborate. So if we take the example of modernizing legacy code bases, this task requires a high context of potentially the entire codebase but also has clearly well- definfined outputs. So an example operating model could look like a factory of agents where humans provide an initial spec and final review with minimal intervention. For new features for green field and brownfield projects, the operating model may look like an iterative loop because they may benefit from the non-deterministic outputs and increased variation where agents act as co-creators um providing more options to facilitate faster feedback loops. So, as we mentioned, we did a survey among 300 enterprises globally to understand what sets these top performers apart. We found that they are seven times more likely to have AI native workflows which meant scaling over four use cases across the software development life cycle rather than just having point solutions for just code review or for just code dov. They were also six times more likely to have AI native roles which meant having smaller pods with different skill sets and new roles. To enable these shifts, these organizations were investing in continuous and hands-on upskilling, impact measurement, and also incentive structures to incentivize developers and PMs to adopt AI. This led to five to six times increase in time to market and delivery speed as well as higher quality and more consistent artifacts. So when we talk about AI native workflows we mean that these enterprises are moving away from quarterly planning to continuous planning and also um the unit of work is moving from storydriven to spec driven development so that these PMs are iterating on these specs with agents rather than iterating on long PRDs. On the talent side, AI native roles essentially means that we're moving away from the two pizza structure to one pizza pods of three to five individuals. Instead of having separate QA frontend and backend engineers, there are more consolidated roles where product builders are managing and orchestrating agents with full stack fluency and also a better understanding of the full architecture of their codebase. PMS are starting to create direct um prototypes in code rather than iterating on these long PRDs. And one example um that we've described in our article, we've studied some AI native startups and realized that they've actually implemented all of these shifts to accelerate their outcomes. And in our article, we've described how cursor actually operates internally. But if you're a large enterprise predicated on the agile model, what are some steps you can take? So in in a recent client study with a leading international bank, we tested some team level interventions to address the bottlenecks previously mentioned before mainly around the sequencing of steps within the agile ceremony and how uh to define the roles of agents and humans within the sprint cycle. So let's walk through some examples. First, team leads would assign sprint stories using agents based on the data of the team velocity and delivery history. And then they would create co-create multiple prototypes and iterate with agents on the acceptance criteria around security and observability needs to have more consistent artifacts across teams. This prevents downstream rework that was mentioned before so that developers don't have to constantly be iterating with the agents during during the code process. The squads were also reorganized by workflow. So there would be one which would be focused on um small bug fixes and another focused on green field development. In the background agents would be used to look and impact uh look at um the potential cross repository impacts um to prevent debugging time for developers. And another example is that instead of for reducing the collaboration overhead and meetings that happen within the sprint cycle, um instead of waiting for data scientists input, PMS would directly be observing the real-time customer feedback to rep prioritize these features and this would lead to an acceleration in the backlog within the same amount of time. So we studied the um impact of these interventions and found high promising results. For example, not just the increase in agent consumption by over 60 times, but there was also an increase in the delivery speed that was tied directly to the business priorities for this bank. There was a 51% increase in code mergers, but also a decrease in um an increase in efficiency. The other aspect of this is is uh around the different roles and and and the talent model. And so one of the biggest differentiators that we saw as mentioned was around but you have actually changed the roles that uh that are involved in software development. And so you know what what you all are seeing is that engineers are moving away from execution and and just simply writing code to being more of orchestrators and and thinking through more how to divide up work to agents. for example. And we also heard some examples of how the role of the product manager is changing. And so while this this may sound, you know, pretty straightforward to many of you here who are who are working with these tools like day-to-day that you have to change what you do, the reality is that about 70% of the companies that we that we survey have have not changed their roles at all. Right? And so you have this background expectation that people are going to do things differently but the the role is still defined in the same way and it's the same understanding uh as it was you know a couple of years ago. Um but we are starting to see you know some companies changing this. So this is another example from a from another recent nent client. They were set up in a in a way that is, you know, pretty common for for u many companies and a kind of typical two pizza uh team model with with the types of roles that you would be familiar with. Um the we ran a bunch of experiments and front runners and and tested new models that were had much smaller pods uh that had uh new roles which consolidated some of the tasks that were previously done with different roles. And and so by doing that we could we could create basically more pods or more teams uh with with the same number of people uh but retaining the expectation that each pod is is uh is um performing at about the same level as as they were before. And so so we also see really uh really positive results from that uh with with uh maintaining and even improving in some is the quality of the code that was generated. In particular there was a there was a high speed up in in terms of uh the output from from the different teams and you can see some of the metrics uh here. Let's shift gears a little bit and and and go from talking about just the team level. So how does this now scale uh across a big organization? The reality is that many many companies don't just have like one or two of these these teams but often hundreds of teams even and thousands or even tens of thousands of people who are working in this way. And uh this is where one of the biggest differences that we that we saw between those that are stuck a bit in the um in in getting only 10% or so change improvements from those who are seeing outsized improvements is around how you manage that how you manage that change and change management I get is like one of these is a little bit of an often catch or elusive term for uh for a lot of different things but but I think in some ways it's not a bad way to think about I I usually say that the change management is about getting a lot of like small things right. And so the crux to like actually scaling this is often about getting 20, 30 or even more things right at the same time that involve the way you communicate uh what this means, the way you incentivize people, uh the way you upskill them, and it all has to come together. Um and when it when it's not, we we we see what happens. And so this is an example from from another tech company that we worked with um where initially we're rolling out new AI tools for them that that hit different parts of the product development life cycle. Um we we rolled we rolled out the tools there was some usage but often it dropped off. It was either not used or it was um it was sort of um used in very suboptimal ways. So that's the sort of jagged part that you're seeing on the on the left hand side here. despite kind of adding more users uh the overall impact did not change at all. So we had to do a quite a reset and and um start over effectively reset the expectations. What should what what does this mean if you're a developer dayto-day what does it mean for a PM? Uh we had much more hands-on upskilling. There was could bring your own code. there were, you know, coaches available, especially those first like few sprints before you get make this a habit and work it into the way that you develop software dayto-day. It's a very critical time and that's when when this matters a lot. Um, and having a bit of a a measurement system as well, so you know what's changing and and you're able to to see what's uh what's what what's improving. Another example just to put this alive a little bit as mentioned like this is about getting a lot of things um right and it's each one of these individually may not seem like it's the biggest deal uh but put together they really make a make a huge difference like this is for this is some of the top uh interventions that another client had to go through for them it really helped having you know setting up code labs for example really you know instituting a new set of certifications that help motivate and and drive people to to change what they do day dayto-day. And these these things really added up to uh the change they needed. >> But building a robust measurement system that prioritizes outcomes and not just adoption is important not just to monitor progress but also pinpoint issues and course correct quickly. So one surprising result from the survey was that these enterprises that were bottom performers were not even measuring speed and only 10% were measuring productivity. But our goal is to make our clients top performing organizations. So we've worked with them to create a holistic measurement system that captures impact all the way down to inputs. So for inputs this would include the investment into coding tools and other AI tools but also the time and resources in upskilling and change management. These inputs would lead to direct outputs but a lot of organizations are just focusing on how the increased breath and depth of adoption with of AI tools is leading to increased velocity and capac capacity increase. However, it's also important to understand how developers have uh different NPS scores and if they're enjoying their craft more um rather than feeling more frustrated. And it's also important to understand whether the code is becoming more secure and have has better quality but also more resilient. And one proxy for resiliency that we used for our client was the meantime to resolve priority bugs. Now if we look at economic outcomes which is priority for um the seauite executives they look into what is the time to revenue target. What is the increased price differential for higher quality features or expanding the number of customers to meet the feature demand and also what is the cost reduction per pod for reduced human labor. In aggregate, having these larger economic outcomes can also lead um to for organizations to understand how there is an increased reinvestment in green field and brownfield development. But as these tools evolve, the proxies for these metrics will also evolve. But hopefully this provides a MECI framework as an initial starting point. So what's next? The future of course is difficult to predict, let alone in the next 5 years. But we hope that with our vision of a new software development model, even as agents increase in their intelligence and humans become more fluent in AI, that this model still stands. So hopefully this model that includes um shorter sprints, smaller teams, but large u smaller but larger number of teams will set enterprises up for success in the long term. >> So just leave you with some some key takeaways. um start now. I would say to to our our clients, this is a human change and it takes some times and it's a big change and and it's going to be a journey and so I think um this is something that everyone needs to go on. I think it's also important to figure out which model works for you and set a really bold ambition and with that say thank you so much for listening to us and and uh we have an article here if you're more interested in in the research that we've conducted. Thank you so much for having us. Our [applause] next presenter is a researcher at Stanford who studies how AI impacts over 100,000 developers in the real world. Please welcome Jaor Dennis Blanch. So companies spend millions on AI tools for software engineering. But do we actually know how well these tools work in the enterprise or are these tools just all hype? to answer this and for the past two years we've been researching the impact of AI on software engineering productivity and our research is time series because we look at get historical data meaning we can go back in time and it's also cross-sectional because we cut across companies and the way we use to measure most of the of the impact is by a machine learning model that replicates a panel of human experts. The way this works is that imagine you have a software engineer who writes a code commit and this code commit would be evaluated by multiple panels or of 10 and 15 independent experts who would evaluate that code commit across implementation time maintainability and complexity and then produce an output evaluation. So we took the labels of these panels across you know millions of of kind of evaluations and then trained a model to replicate this panel of experts meaning that we can deploy this at scale and if there's ever any doubts around the models output you can always kind of assemble your own panel and see that it correlates pretty well with reality. Today we'll talk about four things. We'll start off with looking at some of the things that are driving AI productivity gains in software. Then we'll look at a AI practices benchmark that we developed. We'll then look at how we propose to measure AI return on investment in software engineering. And lastly, we'll finish things off with a case study. So here we took 46 teams that were using AI and we matched them with 46 similar teams that were not using AI and we measured their net productivity gains from AI quarterly. And the shaded area is the middle 50% of the data. And the dark blue line is the median which as of July of this year stands at about 10% for this cohort. I'd like to direct your attention to the fact that the discrepancy between the top performers and the bottom ones is increasing. There's a widening gap. And so if we very unscientifically and very illustratively project this forward, we might get something like this, right? where uh you can have these top performers being part of this the rich gets richer effect where they these successful early AI adopters might compound their gains while these strugglers could fall further behind. At some point this is going to converge and this is very directional. But my point here is that if you're a leader in a company, you definitely need to know in which cohort you are right now so that you can course correct and without measuring the impact of AI on your engineers, you're not going to be able to do this. So we started investigating what are some of the factors that drive these top teams to perform better and the first thing we looked at is AI usage or basically token spent. In this graph you have the same kind of on the vertical axis the productivity increase and then on the horizontal one you have the token usage per engineer per month on a logarithmic scale. And what you can see is that the correlation is quite loose 20 or so linearly. And there is a bit of a death valley effect around the 10 million uh token mark whereby teams that were using that amount of tokens seem to be doing worse than teams that were using a bit less tokens. It's very directional but interesting. Nevertheless, the conclusion here might be that AI usage quality matters more than AI usage value. We dug deeper and we said well does the environment in which the engineers work impact the productivity from AI and we came up with an environment cleaniness index index it's quite experimental it's a composite score that looks at tests looks at uh types documentation and at modularity and at code quality and that index is on the bottom axis here from 0 to one and then on the vertical axis once again you have the kind of productivity lift relative to teams not using AI And so what you can see is that there's a point40 R squar meaning a pretty decent correlation around environment cleanliness and gains from uh AI or productivity gains from using AI. And so the takeaway here is to invest in codebased hygiene to unlock these AI productivity gains. We dug deeper to illustrate this concept. And here we have on this graph on the vertical axis the percentage of tasks that might uh be able to be completed by AI based on three colors. And so green means that AI can do most of the work for that task in that sprint. Yellow means that AI can help someone and red uh means that AI is not very useful. And this is quite illustrative but it it conveys the point. And so then any code base at any point in time sits on a vertical line across this graphic. And what you can see is that clean code amplifies AI gains. Secondly is that you need to manage your codebase entropy, right? Your codebase tech debt because if you just use AI unchecked, this is going to accelerate this entropy which is going to push and degrade your cleanliness to the left kind of right and then you as as a human need to push on the other side to kind of improve or maintain that cleanliness to keep reaping the benefits from AI. Thirdly is that it's important that engineers need to know when to use AI and when not to use AI. And what happens when they don't is this kind of line on the left whereby you have AI AI outputs that are rejected or need heavy rewriting which then leads to engineers losing trust in AI saying okay this just doesn't work. I'm not going to use it. Which then further collapses your AI gains. Now we said can we find out whether we can look not only at usage but at how are these companies and these engineers using AI and we came up with an AI engineering practices benchmark. The way this works is that we can scan your codebase and detect these AI fingerprints or artifacts basically traces of how your team is using AI. It's quite directional at this point but evolving. And we can quantify this based on the percentage of your active engineering work that uses each AI pattern. And then we can repeat this monthly using git history. And the way this works is more or less you have kind of a few levels. And level zero might be how humans are just not using AI and write all of the code. Level one is kind of like personal use where engineers are not sharing prompts across the team or not versioning them. Level two is team use whereby teams are are sharing these kind of prompts and rules. And then level three is even more sophisticated. It's where AI autonomously does specific tasks maybe not the entire workflow. And level four is you know agentic orchestration which is where AI just runs the entire process. And so this is going to be an open- source tool which you can leverage if you sign up on the sweeper research portal. We applied this benchmark to one of the companies in our research data set and we saw this. This company had two business units with equal access to AI tools, right? Same licenses, same spend, same tools, same everything. But the adoption rate and the usage rate was very different by business unit. On the left, the first business unit, you can as you can see in the area in the blue, seemed to be using AI a lot more for almost 40% of their work. whereas on the on the uh right the second business unit seem to struggle behind a bit more. And so the takeaway here is that access to AI and even AI usage doesn't mean or doesn't guarantee that that AI is going to be used in the same way across a company. As a leader, you really want to be understanding not just whether they're using but also how your engineers are using AI. Great. Now let's dive into how do we actually measure AI return on investment in software engineering. Oh uh there we go. Okay. So here ideally we would be measuring this based on business outcomes right I give my AI engineer my engineers AI and then I make more money more revenue net revenue retention whatever business KPI you want to track. The problem is that there's too much noise between the treatment right giving AI and the result which is the business outcome. And on top of this there's confounding variables such as your sales execution, the macro environment, your product strategy and therefore although that would be ideal unfortunately uh I think we need to find alternative paths and the most logical one is to simply look at the engineering outcomes because there is a clear signal right but here we need to go beyond measuring AI usage into measuring engineering outcomes. There's a few caveats and this topic is quite heavily discussed and so I want to mention some of them. The first one is that this is assuming that our product function can properly direct that increased capacity into something that generates value. And if they aren't directing that, then it's a product problem, which although sits quite close to engineering, it's slightly different, right? The second caveat is that this assumes that engineering is a meaningful bottleneck for value which frankly it typically is and that you can guard against good hards law by using a balanced set of metrics and also by having a good company culture that doesn't weaponize these metrics. And thirdly is that AI is still very new and measuring proxy metrics is still better than not measuring. There's going to be winners and losers in this AI race. And progress is better than perfection here. And so metrics don't need to be flawless to be useful is what I want to illustrate. So then um here we have uh two parts which you need to do to get the ROI from AI, right? You kind of need to measure usage and then you need to measure engineering outcomes. And so let's start with usage. There's really two buckets for enterprises. There's kind of more in a research environment, but to make it simple, there's access based and there's usage based. Accessbased is basically looking at when did people get access to the tool. And here we have you can kind of do a pilot group, give that group AI and then compare it to a similar group without AI or you can measure the same team across time. The problem is that access based is noisy and the gold standard is really usage based which uh uses telemetry from APIs from these coding assistants right to uh give you the right data to know who's using AI and and where and the caveat here is that the vendor API is different unfortunately tools like GitHub copilot aggregate the data and other tools like cursor give you more granular data the big takeaway is that you can measure impact of um retroactively by using git history. And so you don't need to set up an experiment now and wait 6 months. You can actually if you've already adopted AI, you can go back in time and and and do this. It's quite easy. Now we've seen usage. Let's look into how do we actually measure engineering outcomes? What are some of the metrics we propose? Here we have um our framework which we proposed which is using a primary metric and a guardrail metric. And so here um the primary metric is engineering output. It's not lines of code. It's not PR counts and it's not DORA. And it's basically based on this machine learning model that replicates the panel of experts, right? And the second set of metrics are the guard ones which you want to maintain at a healthy level but you don't want to maximize. It doesn't make sense to maximize them truly. And so then there's three categories within the guardrail ones. Rework and refactoring, quality, tech and risk, and then people and devops. The third bucket is important to highlight that these are not productivity metrics. They're useful, but you cannot just kind of use them like maximize them to maximize developer productivity. They kind of fall off at some point. And so the goal here might be to keep your guardrail metrics healthy while increasing the primary metric to whatever degree possible. Now, let's dive into a case study. Here we worked with a company that uh large enterprise. We took a team of uh 350 people under a vice president and we measured pull requests. The reason we did this is to illustrate that you cannot measure pull requests to understand whether AI is helping you. And so here this team adopted um AI in May of this year and we measured the four months before four months after. We saw a 14% increase. Great. That's fantastic. But what about reviewer burden? What about code quality? So we measured code quality. And here what we saw is um I mean firstly actually code quality think of it as maintainability scale from 0 to 10. And uh there's kind of these bands. Uh it uses our our methodology. You can read it online. But basically what you see is that in the preAI period their code quality was quite stable and consistent. And once they adopted AI, two things happened. Code quality decreased and then code quality became more erratic. Next, we took a look at our metric, which is engineering output. It's not lines of code. And here for every month, you see the sigma, the sum of the output delivered for that month broken down into four buckets. Rework and refactoring. So rework is when you're changing or editing code that was it's still kind of fresh, so it's recent. refactoring is when you're changing code that's a bit older and uh what uh then like added and removed it's pretty self-explanatory and then also you can see these kind of benchmarks so we can benchmark this company against similar companies in their industry and here AI usage had two effects firstly is that rework went up by 2.5 times which is really bad and effective output which is kind of like a proxy for productivity or so didn't really change and so then what's the conclusion here let's do a recap app. So we saw that PRs went up by 14%. But this is inconclusive because more PRs doesn't mean better. We saw that code quality decreased by 9% which is problematic. We saw that effective output didn't increase meaningfully. And then we saw that rework increased by a lot. And so then the question here is what is the ROI of this AI adoption? Right? It might be negative. And what I want to point out here is that had this company not measured this more thoroughly and simply measured PR counts, they would have thought, hey, we're doing great. We increased our productivity by 14%. Let's run the numbers. That's how many million lots of millions of dollars. And does this offset the AI licenses? Sure thing it does, right? The other thing is that I don't think this company should abandon AI. They should simply use this data to understand what they're doing wrong. How can they improve? Because AI is here to stay. It's a tool that's going to transform how engineers are are working, right? and you can just um kind of like abandon it or so. Great. So, this concludes our insights for today. If you've enjoyed this uh talk and you would like similar insights for your company, I invite you to participate in our research. Everything you've seen today can uh be accessed through kind of participating in our research, some of them through live dashboards in our research portal. And especially I'd like to invite companies that have access to Cursor Enterprise to participate because we have a high need for this so we can publish papers around the granularity of using AI um in software engineering. You can sign up at software engineering productivity.stanford.edu. Thank you so much. [applause] [music] Our [music] next speaker will separate hype from reality on AI code quality using realworld data to show when AI generated code can be trusted in production. Please welcome CEO of Kodto, Edidomar Freiedman. [music] It will grow. It will grow one or two more months. I'm really excited being here. So many so much pragmatic and insight and suggestions. I was sitting there uh just just before. So I'm Edomar Freiedman, the CEO and co-founder of Kodto. Codto stands for quality of development and I'm going to share uh our reports and other companies reports about state of AI code quality. uh you know trying to uh talk about the hype versus reality which was uh like one of the uh points that were discussed here quite a lot which is awesome. So in the last three weeks, four weeks, we saw like three outages in the clouds unfortunately, right? And these are coming from companies that really care about moving fast, right? They're they're they're saying themselves that they're using AI to generate code 10%, 30%, 50%, at the same time, they care about quality. So how did that happen? And is it is it related? I don't know. But let's have some I'm going to share some guess. So by the way 60% of developers say that the like quarter of their code is either generated by AI or in in like uh uh shaped by I and 15% say that even more than 80 80% of their code uh is basically generated or or shaped by AI. Now people are using AI to do vibe coding but actually they're even doing it for vibe checking vibe reviewing. This is the command of cloud. This is the prompt for the command of claude code for security review. It was hyped like two months ago. Do you know what I'm talking about now? It says there, I don't know if you see it. Uh you are a senior security engineer. Good. And then like somewhere there uh down the line it says please exclude denial of service. Don't don't uh catch denial of service issues. Maybe that's part of the part of the reason like we're we're having uh cloud outages. probably not just that, but you get the point. Like we need to be rigorous about how we deal with quality. It's not just like vibe quality or or so like we're doing vibe coding sometimes. Uh let's go to another example. Okay, cursor I guess like or or pilot most of you use rules, right? We're going to talk about it. You invest in code generation. After a [snorts] while, you understand if you invest, you'll get more out of it. And uh we we asked like a bunch of of developers and I'm asking you as well think think for a second for all the developers there in the audience like when you write cursor rules or copilot rules etc. Do you feel they're completely followed or it's like mostly followed? Do you know how much they're followed and what extent are they followed? It's rigorously like how technical deep they're they're being followed. So the what we get back like the answer from what you see here on the screen is mostly like B, C, and D. They are followed but they're not completely followed. Okay. So that means like we are generating code trying to push it to the standards but it's not necessarily still like getting to the quality we wanted. I'm going to share a bit more statistics and and information and some insight from three reports. One done by Codo, another by done by Sonar, another by far and all of them are are focused on code code quality review etc. The sample size is thousands of developers in some cases even more millions of pull requests and and a billion of of lines lines of code that were uh uh being checked. Like for example, if you think about uh Sonar, this is a company. Yeah. A bit like coming from pre-AII, but they see code at scale and you they're doing like a lot of uh checks in code that are not necessarily AI focused, but are necessary in order to check uh your your software from all possible direction. And that's why their scaling and the scale of the code that they're seeing is is immense. Okay. So for example, we took information from from their report and eventually my purpose here is to break down the different dimension of what uh code quality means and give you some share some stats and and insights. I want to start with the end. Okay, this is the takeaway I want you all all like to take from from the next 13 minutes that I have. We started with code generation. We like out of the box use it autocomplete etc. and you invest in it and you can get more out of it. But there's the glass ceiling for how much productivity you can get from code generation. And then we move to the agent code generation, right? Let's call it gen 2.0. And that's a higher glass ceiling. It could do much more productivity and especially if you invest in it, for example, rules, etc. Then with AI breaking outside of the IDE, we can start using AI also for code for agentic quality workflows. It could be inside the ID, but the the truth is that if you think about all the workflows you have in your organization, especially if you're more than 100 developers or so, you probably have a lot of workflows that you are related to quality that you need to auto automate. And that's where you start like breaking through the glass ceiling of productivity. if you invest in it. And finally, I I claim that you need those agentic workflows. Keep learning. And we might touch a little bit of that like later later on. Okay? Like because quality is something dynamic. So you'll only finally break break the glass ceiling if if you really have those quality workflows and rules and standard being dynamic. And then then you will see the promised 2x let alone the 10x that you were promised the hyped and and you you heard from McKenzie and from Stanford you're not getting that. I don't need to tell you the 2x 10x for the entire software development uh life cycle. So a bit about more about the market adoption. Uh one of the report says that 82% of adoption already for AI dev tools are being used daily or weekly. uh some people at 60 60% 59 report that they're using more than three and 20% saying that they're using more than five code generation tools. If you think about it for a second uh don't only take like cursor compilot codeex cloud code etc. Sorry if I'm insulting anyone in the that I forgot their tool but there's also the lovable etc. They also generate code and by the way you're going to get to 10 I'm count on me you're going to get to 10 tools in two three years that generate code for you okay come to talk to me about later I'll try to convince you and and the thing is that it's coming from bottom up like 50% of the usage is coming from less than 10 teams that are less than 10 developers but it is propagating also to the enterprise again I'm sure you know I mean talk propagating to the enterprise at scale like not just like five developers in the last year we're seeing like more and more enterprise using co code generation. Uh so if like an average with within reports we saw 82 to 92% using weekly to a monthly code generation tools and in some cases maybe extreme maybe not we're going to talk about it. We saw 3x productivity boost in writing code. Okay, but that doesn't mean that if you have uh 3x productivity in writing code that you actually guarantee any quality like I presented before. So actually 67% of the developer that we as asked have serious quality concerns about all the AI generated all the generated code uh uh code generated by AI or influenced by AI and they're claiming that they're missing the framework how to deal with quality how to measure quality. It's a big question. What is quality? I'm going to talk about it in the next few slides. Okay, think about it for a second before I break break it down. What what is quality? Um so what we're actually saying that the crisis with V right coding uh viable coding we're seeing it shifting and evolving is that you're getting like more task being done like 20 some report 20% more task you know velocity and like 97 more% or so of PR being opened and eventually it takes more time to review PR like 90% more time to review PR and by the way like there's a lot of statistics about AI generating wring code at least there's not less amount of bugs per line of code I'm not claiming that there are more but even if there's not less bugs per line of code you have much more bugs because there are much more PRs much more code being generated etc right so that that's a problem for the reviewer so it's somebody's surprise it takes more time to review these especially in the age of agents right when five minutes calling to cloud code I have 1,000 line of code after 5 minutes once upon a time it took me like hours to write 10 proper per lines of code. Right now, let's zoom out for a second. Code generation is magnificent. Okay? Like it it's a gamecher when you're talking about green field. You saw people talk about it a few slides a few minutes before me. Uh it it revolutionized how we do p proof of concept uh project etc. But when you're dealing with heavyduty software then you you like it or not we are dealing with a lot of things when uh when you serve millions of clients you have financial transactions when you're doing transportation you're dealing with code integrity if you like code governance uh review standards testing relability etc. That's what we need to uh uh to deal with. Now let's break that under the surface part of the glacier into two dimensions. This is one dimension you can look on the quality issues in throughout the software development life cycle like planning and then development writing code review code review is a bit of a process but like what you're like checking quality that's part of the process of code review testing which is another part of of quality and and deployment and I know I didn't cover the entire like uh software development life cycle but just to give you an example and each one of them like possess like introduce new problems that are coming because you're using more and more AI generated code. Um now another dimension to look at it is actually code level problems and process level problems. Okay, I'm not I'm not opening the you know list of functional just opening the list of non-functional. You're talking about security inefficiency that are not necessarily uh functional. Use I will show you some statistics about that. And then process level is for example learning. Hey if you will have a a a bad outage because of AI generated code who is responsible is it the AI or or the team that own that okay like you need to learn and own the code eventually that's a process that needs to be done verification porting guard rails standards uh etc. So, so all of those issues when they are introduced to thousands of developer that we asked them do you think like actually AI helped to reduce with those problems or or actually made more like more challenging 42 people reported that they spend 42 more of the development time on solving issues on fixing bugs etc. and and they saw 35 uh% project delays. We're talking about we're talking about maybe games they're talking about like delays. Okay, there's some bias. We told them we talked about problem with quality and what's the impact etc. Um but that's what they they they present uh to when they they answer uh when when they're talking about like when you're mass using AI code AI generated code and we see reports uh some of the reports talking about 3x more security inc incidents by the way it makes sense you remember we had a slide saying 3x more writing code so 3x more security incidents like the same amount of line of code the same amount of uh uh problems the correlation so what to do with that like I talked about problems and problems and problems okay help help me deal with it like let's let's spend a few minutes on on that. So one one suspect of course is testing and actually really interesting we asked a couple of question about testing and one really relevant saying of that people said that when they heavily use AI to on testing use AI to do testing they actually double their trust and the AI generated code okay that's one thing the ne next suspect to help us with the quality is code review what really interesting about code review that it's a process that helps almost with all the process level and the code level like issues. For example, you can set your AI code review tool to tell you block this PR if it doesn't cover certain level of test coverage. So through the PR, you take care of the testing process problem. Okay. So code like code review with AI is actually one of one of the major things you you you can do and people that are developers that are using AI code review tool they're saying that they're saying they're seeing double the quality gain and they're saying that actually it's it helps them to uh uh improve improve 47% in productivity of writing code. Okay. Now a bit statistics from our own uh AI code review tool. We scan a million of PRs a month and we took one mill million of those PRs and we noticed that 17% include like high severity issues. By the way, we're now analyzing uh before and after using AI. I don't have that statistics yet, but we are noticing since we're starting uh most of the companies we serve, they use AI generated code. So that's why uh I don't have before. We need to go scan backwards. Uh and that's like a really big a big number. Another thing I want to talk to you like about uh when you're trying to improve on quality is is the foundation of having the right context that is brought to the uh code generation tool that is brought to the AI code review tool. Better context better quality across the board wherever you're using AI. Uh so when we asked developers when when you h when you don't trust AI generated code like you remember like 67% sa like a really worried about that they said 80 80% of the time they don't trust the context that the LLM have okay and and and uh when we asked developers what would you like to be improved in your AI generated code in your AI code review tool they said the number one was context it was number one was 33% they can choose among many things to to improve. So context is extremely important. I can tell you that as codto one of our technology moes uh is is around context and when you connect our context engine we're seeing it as the number one tool that is being used like 60% of code generator or code review tools 60% of their calls to an MCP would be to a context MCP. Okay. And just to tell you the context doesn't necessarily need to include only your code. It could also include context to your standards, your best practices. We're seeing in our AI code review that 8% of the context usage is actually from files that are related to standards and and best practices etc. Okay, I have to CEO of Kodo like marketing will be mad on me if I don't brag a little bit. Right? So this is uh kind of like our market of our context engine being presented by Jensen and GTC keynote and he notice he didn't talk about our co code review capabilities about our testing capabilities he talked about our context engine that Nvidia checked because there's a realization that AI quality AI generated whatever review testing will come from bringing the right context. So invest in that you need to to build your context buy a solution and invest in it. build your solution uh etc. And the context needs to include code uh uh versioning PR history uh organization logs etc. That's where all the context sits. It's not just in the last branch of your codebase. Okay. So I'm I'm zooming out starting to talk about like recommendations and uh and like uh takeaways. So what what what's next? So automated uh quality gateways invest in that. People talked throughout the morning about parallel agents. You know what I'm talking about like background agents. You can use a lot of those like tools and capabilities to build build your quality gates. Uh use intelligent code review testing and you need a li living and breathing like documentation and and what documentation means is is a story by itself. Uh I'm not going to double click on it. And and this is how I present for 3 years now and I think I'm going to go all the way until age of 60 with this slide of how I think the future of software development looks like. Okay. So basically you have your specification and you have your code right and you have multiple agents parallel agents that are helping you to improve your spec write your spec improve your code transfer transfer from your spec to your to your code uh make tests which are executable specs right uh and and then you're going to have your context engine the software development database and you will build your tools especially MCPs around quality and verification and you'll Make sure you have environments, stable, secured sandboxes where those agents can run and and run validation and quality uh workflows. So don't don't forget like the path forward is quality is your competitive edge over your uh competition. AI is a tool. It's not it's not a solution. Okay? And don't like only think about code generation as the only thing. Look on the entire SDLC or product development life cycle. I saw one of the uh people talked um speakers and it iterate with everything we talked about today. I have uh I want to tell you that you will gain value from it. We're seeing in the reports people seeing like security availability being reduced faster code review you we just got a hit on that because of AI generated code and test coverage in a month can can triple depends on on the project etc. with with the last minute I want to show you like a really small piece of what you can do with codo. uh you can go into codto and define your own rule for example almost the same rule you'll put on cursor of I don't like nested ifs if this is a problem that you have but then codto will look on your context build the good example the bad example and then start giving like building a workflow that is specifically to catch that issue and give you statistics over time when it's being accepted and when not so you can adjust that rule and really know and have visibility to to your standards. Okay. So when a PR is written with a few ifs and else although it was written with cursor copilot that had a rule do not do nested ifs etc. then eventually when you open a PR you will get uh codo uh uh catching that and giving a suggestion according to the good and the bad example. COD will also make a graph, give you a CLI checks like check each one of the rules and eventually tell you the nested if and then will record and learn what you did or did not do with that suggestion in order to adapt the standard and of the of the quality. Um there will also automated like suggestion. You don't need to write your own. It learns your your your standards and quality and offer that to you. And that's it. I'm I'm really really excited about like breaking the glass ceiling, okay, with what we did with code generation and then a jet to code generation. Now we're turning into the era of putting AI into work and through the entire SDLC. The most important part is related to quality. You would need to invest in that. It's not out of the box. Okay. And then you would see eventually the promised tox that that that probably promised to the CEO or something like that once they give you the budget for for the relevant tools. Thank you so much. [applause] [music] Our next speaker is introducing Miniaax's latest model and how it powers nextg experiences for code generation. Please welcome to the stage senior researcher at Miniaax, Olive Song. [music] Hi. Hi everyone. Um I'm Olive. It's my great honor here today to present on our new model Mini Max M2. Um, I actually lived in New York City for six years, so it feels great to come back. Um, but with a different role. Um, I currently study reinforcement learning and model evaluation at Miniax. Um, let me just get a quick sense of the room. Who here has heard or have tried of Miniax before? Oh, a couple of there. Yeah, not everybody, but I guess Yeah, but here's the value, right, of me standing here today. Um so we are a global company that works on both foundation models and applications. We develop multi modality models including text um vision language models our video generation model hyo and speech generation music generation stuff and we also have um many applications including agents and stuff um inhouse. So that that's the specific thing that's different from the other labs for other companies. So we both develop foundation models um and applications. So we have research and developers sitting uh sitting side by side working on things. Um so our difference would be that we have firsthand experience from our um in-house developers into developing models that developers would really need in the community. And here I want to introduce our Miniax M2 um which is an openweight model very small with only 10 billion active parameters um that was designed specifically for coding workplace agentic tasks. It's very costefficient. Um let me just go over the benchmark performance because people care about it. So uh we rank very top in both um intelligence benchmarks and also agent benchmarks. Uh we I think we're on the top of the open source models. But then numbers don't tell everything because sometimes you get those super high number models you plug into them um into your environment and they suck, right? So we really care about the dynamics in the community and in our first week we had the most downloads and also we climbed up to top three token usage on open router. So we're very glad that people in the community are really loving our model um into their development cycle. So today what I want to share is how we actually shape these men model characteristics that made M2 so good in your coding experience. And I'm gonna present to you um the training be behind it that supports each one of them from coding experience to long horizon state tracking tasks um to robust generalization to different scaffolds to multi- aent uh scalability. So first let's talk about code experience which we sc uh which we supported with um scaled environments and scaled experts. So um developers need a model that can actually work in the language they use and across the workflow that they deal with every day. So which means that we need to utilize the real data from from the internet and then um scale the number of environments so that the model when during training for example during reinforcement learning it can actually um react to the uh environment. it can actually target verifiable coding goals and to learn from it. So that's why we scaled both the number uh of environments and also our um infrastructure so that we can perform those training very efficiently. So um with data construction and reinforcement learning we were able to train the model so that it's very strong um it's full stack multilingual and what I want to mention here is that besides scaling environment that everybody talks about we actually scale something called expert developers um as reward models. So, as I mentioned before, uh we have a ton of um super expert developers inhouse that could give us feedback to our model's performance. So, they participated closely into the model development and training cycle, including problem definition, for example, um bugs, bug fixing, for example, um repo refactoring and stuff like that. And also they identify the model behaviors that developers enjoy and they identify what's reliable and uh what developers would trust and they give precise reward and evaluation to the model's behaviors to the final um deliverables so that um it is a model that developers really want to work with and that can adds efficiency to the developers. So with that we were able to lead in many um languages in real use. And the second characteristic that Miniax M2 has is it it performs good in those long horizon tasks. Uh those lawn tasks that require interacting with complex environments that requiring um using multiple tools with reasoning. And we supported that with the interled thinking pattern um and reinforcement learning. So what is interled thinking? Um so with a normal reasoning model that can use tools, it it normally works like this. You have the tools information given to it. You have the system prompts. Um you have user prompts and then the model would think and then it calls tools. It can be a couple of tools at the same time. And then they get the tool response from the environment and then it performs a final thinking and deliver a final content. But but here's the truth, right? In real world, the environments are often noisy and dynamic. You can't really perform this one test just by once. You can get um tool errors for example. You can get um unexpected results from the environment and stuff like that. So um what we did is that we imagine how humans interact with the world. We we we look at something we get feedbacks and then we think about it. We think if the feedback is good or not and then we make other actions, make other decisions. And that's why we did the same thing with our M2 model. So if we look at this um chart over a diagram on the right. So instead of just stopping um after one round of tool calling, it actually thinks again and reacts to the uh reacts to the environments to see if the information is enough for it to uh get what it wants. So basically we call the interle thinking or people call it interle thinking because it interle thinking with tool calling. um a couple of time it can be you know uh tens to a hundred um turns of tool calling within just one user interaction term so it helps um adaptation to environment noise for example uh just like what I mentioned the environment is it's it's not stable all the time and then something is suboptimal and then it can choose to use other tools or do other decisions it can focus on long horizon has um can automate your workflow um using for example Gmails, notions, um terminal all at the same time. You just need to maybe make one model call without minim with minimal um human intervention. It can do it all by itself. And and here's a cool illustration on the right because it's New York City. I feel the vibe of you know trading and marketing. Um so you can see that there was some um there was some perturbations in the stock market uh I think last week and then our model was able to keep it stable. So just like I said there's like environment noise there's no new information there's like yeah news it looks like there there's like other trading policies and stuff like that but our model was able to uh to perform pretty stably in these kind of environments. And the third characteristic is our robust um generalization to many agent scaffolds which was supported by our perturbations in the data pipeline. So we want our agent to generalize but what is agent generalization? At first we thought it was just tool scaling. We train the model with enough tools various tools kind of new tools. we invent tools um and then it will just perform good on unseen tools. Well, that was kind of the truth. It worked at first. Uh but then we soon realized that if we perturb the environment a little bit, for example, we change another agent scaffold, then it doesn't generalize. So what is agent generalization? Well, we conclude that um it's adaptation to perturbations across the model's entire uh operational space. If we uh think back what's the model's um operational space that we talked about it can be tool information it can be system prompts it can be user prompts they can all all be different they can be the chat template they can be the environment they can be the tool response. So what we did is that we designed and maintained perturbation pipelines of our data so that um our model can actually gen generalized to a lot of agent scaffolds and the fourth characteristic that I want to mention is the multi- aent scalability um which is very possible with M2 because it's very small and cost effective. I have a couple of videos here. Um, this is M2 powered by our own Miniax agent uh app. We actually have the QR code downside. So, if you want it, you can just scan and try it. So, it's like an agent app we we developed. And here we can see different copies of M2, right? It can do research. um it can write the write the research results and analyze it and put it in a re report. It can put it in some kind of front-end illustration and they can work in parallel. So because it is so small um and so cost effective, it can really um support those long run agentic tasks and tasks that maybe um require some kind of parallelism. So what's next right for Miniax M2 from what I've introduced we gathered environments um algorithms data expert values model architecture inference evaluation all these stuff to build a model um that was you know fast that was uh intelligent that could use tools that generalizes what's next for um M2.1 1 and M3 were in the future. We thinks of better coding, maybe memory work, context management, proactive AI for workplace, vertical experts, and because we have those great audio generation, video generation models, maybe we can integrate them. But all our mission is that we're committed to bring all these resources, whatever is on the screen, and maybe more. Yeah. and values and put them all together to develop models for uh the community to use. So um we really need feedback from the community if possible because we want to build this together and you know this is kind of a race that everyone needs to participate and then um we com we are committed to share it with the community. Yeah. And that's all the insight for today. Um, we really hope again we really hope you to try the model because it's pretty good. And then we can contact contact us up there. You can try the models by scanning the QR code. Yeah, basically that's it. Thank you all for listening. [applause] [music] Ladies [music] and gentlemen, please welcome back to the stage, Alex Lieberman. Let's give it up again for Olive [music] and all the other speakers from the morning. [applause] It is time for lunch. Very exciting. Uh, one thing I want to say before we head out for lunch and we're it's going to be downstairs in the expo. check out all the boos, talk to people, have food is, you know, my own experience with going to conferences is even though I come up talk on stage a lot, I find it very difficult to engage in conversation with people when there's like these little small group settings. I don't know like can I go and chat with people? Can I not? This is a kind of awkward. I give you all permission to butt into conversations, introduce yourself. Ben and Swix have done an incredible job of cultivating such a high quality community here. And the most value you will get is not just from these incredible presentations. It's from meeting other folks in the crowd. So please, you have my permission butt into conversations. Introduce yourself. Share what you've learned with folks. And if you need any sort of uh ice breakers to get the conversation going, I have two for you. One is just go into a group and share your hottest take on uh the state of AI today. It's a great way to get off to a good start with someone. The second, a little less intense, is is a hot dog a sandwich? Is cereal uh in milk a soup? That is how you're going to start the conversations with folks. Everyone enjoy lunch. We'll see you back in an hour and uh thanks so much for your time. Heat. Heat. [music] Heat. [music] [music] Heat. Heat. Heat. [music] [music] >> [music] [music] >> Heat. Heat. [music] >> [music] >> Heat. Heat. >> [music] [music] >> Heat. Heat. [music] >> [music] >> Heat up here. [music] Heat [music] up here. [music] >> [music] [music] >> Heat up here. >> [music] [music] >> Heat up here. [music] [music] Heat up here. Heat [music] up here. Yeah. Heat. [music] [music] >> [music] >> Heat. Heat. Heat. Heat. [music] Heat. [music] [music] Heat. [music] Heat. Heat. [music] Heat. Heat. [music] [music] [music] [music] >> [music] [music] [music] >> Heat up [music] >> [music] >> Heat. [music] Heat. Heat. [music] [music] Heat. [music] >> [music] >> Heat up here. [music] Heat. [music] Heat. [music] [music] >> [music] [music] >> Heat up here. [music] >> [music] [music] >> Heat. Heat. Heat [music] [music] up [music] [music] here. >> [music] >> Heat. Heat. [music] Heat. Heat. [music] [music] Heat. Heat. [music] >> [music] >> Heat. [music] Heat. Heat up here. >> [music] >> Heat. Heat. [music] >> [music] >> Heat up here. >> [music] >> Heat. Heat. [music] >> [music] >> Heat. Heat. [music] >> [music] [music] >> Heat. Heat. Heat. Heat. [music] Heat. Heat. [music] Heat up here. [music] [music] [music] >> [music] >> Heat. Heat. Heat [music] up here. [music] Heat [music] up here. [music] >> [music] [music] >> Heat. Heat. Heat. [music] Heat up [music] here. Heat. Heat. [music] Heat up here. [music] Heat up here. >> [music] [music] >> Heat up here. Heat. Heat. [music] >> [music] >> Heat up here. Heat. [music] [music] Heat. [music] Heat. [music] Heat. [music] Heat. [music] Heat. [music] [music] >> [music] [music] [music] >> Heat. [music] [music] [music] Heat. >> [music] >> Heat. Heat. [music] Heat. [music] >> [music] >> Heat up [music] here. [music] Heat. Heat. Heat [music] up here. Heat up [music] here. Heat. Heat. [music] >> [music] >> Heat up [music] here. [music] [music] >> [music] >> Heat. Heat. [music] Heat up here. [music] [music] Heat. [music] [music] Heat. Heat. Heat. [music] [music] >> [music] >> Heat. Heat. [music] Heat. [music] [music] [music] Heat. [music] >> [music] [music] >> Heat. >> [music] [music] >> Heat. Heat. Heat up here. Heat [music] up [music] here. >> [music] >> Heat. Heat. >> [music] >> Heat up Heat [music] up here. Heat [music] up here. [music] [music] Heat up >> [music] >> here. Heat. Heat. Heat up here. Heat. Heat. Heat. Heat. [music] Heat up [music] here. [music] Heat [music] [music] up >> [music] >> Heat. Heat. Heat. Heat. [music] >> [music] >> Heat. Heat. [music] Heat. Heat. [music] [music] Heat up [music] Heat. [music] Heat. Heat. Heat. [music] [music] >> [music] >> Heat. Heat. >> [music] >> Heat up here. [music] >> [music] >> Heat up here. [music] >> [music] >> Heat up [music] here. [music] Heat. Heat. [music] Heat. [music] [music] [music] Heat. >> [music] >> Heat. Heat. Heat. Heat. [music] [music] >> [music] >> Heat. Heat. Heat. Heat. >> [music] >> Heat. Heat. Heat >> [music] >> Heat. Heat. >> [music] >> Heat. Heat. [music] Heat [music] up here. Heat up Heat. Heat. Heat. [music] [music] Heat. Heat up here. Heat Heat up here. [music] [music] >> [music] >> Heat [music] up Heat. Heat. [music] >> [music] >> Heat. Heat. [music] >> [music] >> Heat. Heat. [music] >> [music] >> Heat up here. [music] Heat [music] [music] up here. >> [music] >> Heat up here. [music] Heat up >> [music] >> Heat. Heat. >> [music] [music] >> Heat Heat. Heat. [music] Heat. [music] Heat. Heat. Heat. Heat. [music] Heat. [music] Heat. Heat. Heat. [music] Heat. Heat. [music] [music] >> [music] [music] >> Heat. Heat. Heat up here. [music] >> [music] >> Heat. Heat. [music] Heat. Heat. [music] >> [music] >> Heat. Heat. [music] Heat. Heat. [music] [music] [music] >> [music] >> Heat. Heat. >> [music] [music] >> Heat. Heat. Heat. Heat. N. [music] [music] Heat. Heat. [music] [music] Heat up here. [music] Heat >> [music] >> up here. [music] >> [music] >> Heat up here. >> [music] [music] [music] [music] >> Heat. Heat [music] up here. [music] >> [music] [music] >> Heat up here. [music] >> [music] [music] >> Heat. Heat. [music] >> [music] >> Heat up [music] here. Heat. [music] Heat. [music] Heat. Heat. [music] [music] Heat [music] up here. Heat. Heat. [music] [music] Heat up here. >> [music] [music] [music] >> Heat. Heat. Heat. [music] Heat. [music] [music] >> [music] [music] >> Heat up here. [music] Heat. Heat. [music] >> [music] [music] [music] [music] [music] >> Heat. Heat. [music] >> [music] >> Heat up here. Heat up here. Heat [music] up [music] here. [music] >> [music] >> Heat. Heat. [music] Heat. [music] Heat. Heat. [music] [music] Heat. >> [music] >> Heat. Heat. Heat. [music] [music] [music] >> [music] >> Heat up here. [music] Heat. [music] Heat. [music] Heat [music] up here. [music] Heat up here. [music] [music] >> [music] >> Heat up here. [music] [music] >> [music] >> Heat up >> [music] [music] >> Heat up here. >> [music] >> How's everyone doing? Good lunch. [music] excited for the afternoon sessions. Out of curiosity, did anyone have the hot dog conversation? Does anyone think Who thinks that a hot dog's a sandwich? We got one. We got two. Uh, anyone think a hot dog isn't a sandwich? Most of the crowd. That is that is usually the consensus. Uh, one other question. Who thinks that they have the hottest take on the state of AI or AI engineering right now in the room? Anyone think they have the hottest take? Well, I I'll I'll give you uh a tea up for later. My co-founder Arman is speaking around four, and I would say he has one of the hotter takes I've seen, which is he thinks all engineers should be paid like salespeople based on output. That is going to attract a lot of debate and I give you full permission to debate him after his talk. Well, are you guys ready to jump into the next group of sessions? >> Let's do it. We will be diving into proactive agents from Google Labs, building Gen Bi at a Fortune 100 business, deploying AI within Bloomberg's engineering org, lessons learned building an AI browser, and developer experience in the age of AI coding agents. With that, please join me in welcoming our next speaker, Kath Corvec, director of product at Google Labs. Let's give it to her. >> [music] >> Hey everybody. I'm so excited to be here. I love New York and I love meeting everybody here. And I am Kath Corbec. I'm from Google Labs and I work on this little team called ADA. And I'm going to be talking about some of the stuff that we've been doing on this project called Jewels. So, a few months ago in my household, our dishwasher broke. And while it was being repaired, my husband decided that he was going to do all the dishes. And so, he told me he was going to do this. But every single night, I found myself reminding him to do the dishes. And you can imagine that got old pretty fast. And I realized that even though I wasn't physically washing the dishes, I was still carrying this mental load. And I know a lot of you can probably relate to this. I was keeping track of whether or not that task was done. following up, making sure that things kept moving. And I realized in that moment that that's exactly where we are with asynchronous agents today. They can handle some of the work, but we're still the ones as developers carrying that mental load and monitoring them. So, here's the truth. Humans, we are serial processors, not parallel ones. We can juggle multiple goals, but we execute them in sequence, not all at once. When you manually kick off a task in jewels, you're usually waiting to be able to move on. And it's that pause, it's that gap in attention where we really lose momentum. And this is actually backed up by science where uh humans actually think we think we're multitaskers, but we're actually executing many tasks very rapidly. But switching between these tasks comes with a huge cost. It can cost up to 40% of your productive time. So that's like half a day lost to switching contexts and reloading. So if humans are uniters, what's the solution here with agents? So for async agents, in order in order for them to succeed, developers can't be expected to babysit them. We've all seen that post on Twitter of 16 different cloud code tasks running in parallel on 16 different terminals on three different huge browsers or huge monitors. And when I first saw this, I thought, god forbid that is the DevX of the future. I want to I don't want to manage work. I don't want to manage my agents. I want to be a coder. I want to build. And so we need to think we need uh uh collaborators in our system that we can trust. agents that really understand context, can anticipate our needs, and they know really when to step in. And then uh I think finally, we're reaching that point with models where they're getting better and better at executing end to end as long as they understand what our goals are clearly. And that's where trust really becomes this unlock where you can trust the system to know what's missing, to fill in the gaps, and to really keep progress moving forward while you manage on something else where where while you focus on what matters most. And essentially, we want jewels to do the dishes without being asked. So most AI developer tools today are fundamentally reactive. you open up your CLI or your ID and you ask the agent to do something and it responds or it waits for you to start typing and then it autocompletes a suggestion. And there's a benefit to this model. It's very efficient. It only uses compute when you explicitly ask for it. But the real question I'm asking myself is, is this how I want to manage AI? And if you think about in the future, imagine a world where compute is not a limiting factor anymore. Instead of a single reactive assistant for instructions, you could have dozens of small proactive agents working with you in parallel, quietly looking for patterns, noticing friction, and taking on the boring tasks that you don't want to do before you even ask. It can do things like fixing authentication bugs that you've been avoiding. uh updating configs, flagging potential order uh errors, preparing m uh migrations and all of this can happen in the background triggered off of things in my natural workflow. So I really think there are four essential ingredients that make up proactive systems today. There's observation. The agent has to really continually understand what is happening and of what your code changes are, what your patterns are, what your workflow is, etc. to get context about your entire project. And then there's personalization. And this one's difficult. It has to learn how you work, what you care about, what you tend to ignore, what your preferences are, the code that you absolutely don't want to ever touch. And then it has to be timely as well. If it comes in too soon, it's going to interrupt you. And if it's too late, then the moment is lost. And it also has to work seamlessly across your workflow. It has to insert itself into spaces where you naturally work already in your terminal, in your repository, in your IDE, not forcing you to go somewhere else to some application that's secret or that you forgot about. So, bringing all these tools together, you can imagine, is not trivial. >> So, I was running this presentation. Um, and uh, you you want to be able to ask your agent to understand your workflow and anticipate your needs and then intervene at exactly the right moment without breaking your workflow. And that's when it really starts to feel like magic. The interesting thing is pro these proactive systems, they're all around us today. One of my favorite examples is Google Nest where you put it in your house, you install it, and then you configure it and then it starts to learn your habits as you leave the house, as you come back, uh, as you go to sleep, as you wake up in the morning. And then pretty soon, you don't have to think about climate control in your house anymore because it's learned what your habits are. Another one is your own body. your heart rate elevates as you go for a run or start to work out or it anticipates that you're about to fall and so it reacts before you consciously think I'm going to put my hand out. So when you look at it like that proactivity is actually not that proactivity for AI is actually not that futuristic. It's very familiar and it is very human and that's exactly the point. What we're building is tools that behave more like a good collaborator and less like command line utilities. So we're already doing this in this tool called jewels which is this uh proactive asynchronous autonomous coding agent from Google labs. And we're doing this in kind of three levels of of uh proactivity. Level one is where a collaboration really starts to emerge. And this is how Jules works today where it can detect things like missing tests, unused dependencies, unsafe patterns, and then it starts to automatically fix those things as it's doing other other tasks that you've asked it to do. This is sort of like this attentive sue chef in your workflow where it's keeping the kitchen clean, the knives sharp, the kitchen uh stocked so that you can focus on what comes next. And that's the beginning of proactive software. At level two, the agent becomes more contextually aware of the entire project. It observes how you work, the code you write. If you're a back-end engineer, maybe you need help with React. If you're a designer, maybe it wants you to may maybe it'll help uh uh write the database schema. And then it learns what your frameworks are and what your deployment style is, etc. And this is the kitchen manager. This is the person in your workflow keeping the rhythm and anticipating what you need next. And then comes level three. And this is what we're working on pretty hard right now going into December. And I'll show you a little bit of what we're what we're going to be shipping in December in a minute. But level three is where things start to converge around that context. It's where the agent starts to understand not just context, but also consequence. How these choices are actually affecting the users of your products, the performance, and the outcomes. And at that level, we have this thing jewels. We also have an agent called Stitch, which is a design agent. and another one we're building called insights which is a data agent and they're all coming together to build this collective intelligence across your application. Jules can see what's breaking in the software. Stitch understands how users are interacting with it and insights connects behaviors from real world signals like analytics, telemetry and conversion rates. And then together they can propose improvements across boundaries of how the system all works together. doing things like performance fixes to improve UX and then design changes to prevent regressions and then all of that is organized based on live data. So the trick here is that the human stays firmly in the loop. You're observing what the agents are doing. You're refining when you when they when you need to intervene and then you're redirecting it when it has when it has been misdirected. So level three isn't really about autonomy anymore. It's actually about alignment to your project. A a agents and humans collaborating together across the full life cycle of your project. So right now Jules is focused on this code awareness piece. It understands the environment, the frameworks and the project structures and we're moving towards more of that system awareness. So things that we're introducing in Jules now, we've added something called memory which I'm sure a lot of you are familiar with. It's the ability for Jules to write its own memories and you can edit them and interact with them. It can edit them and it understands that and builds this memory and context and knowledge of of your project as you work with it. We've added a critic agent which works adversarially with Jules to make sure that the code is is high quality but then also does a full code review. And then we've added verification where Jules will write a playwright script, take a screenshot and then put that back into the trajectory for you to validate. And then we're also doing things like adding uh a to-do bot that will look through your code and look through your repository and pick up on anything that where you've said this is a to-do I want to get to in the future and it will start to proactively work on those things with that context. We're also adding in things like best practices where Jules will understand best practices and start to suggest those and also environment setup. We have an environment agent that we use internally for running EV valves and we're extending that externally to better understand how environment how your environments work and and set those up for you. And then we also are adding something called a just in time context. It's like a jewels cheat sheet where if it's doing something very specific it can and get stuck it can just immediately look at that cheat sheet instead of reaching out to you. So, this is all moving Jules very close to being that proactive teammate, not just this reactive assistant. Okay, so this morning I was talking to my team back in San Francisco and I was thinking, okay, I'm going to do a live demo, but the live demo gods did not align with me this morning. We still have CLS that are being pushed to staging right now. So, I'm going to walk you through a little bit of this. And if you know Jed, he's going to, I think, be talking tomorrow. We're gonna um affectionately try to fix Jed's code here. Um, so this is a view of of proactivity and this is this is Jules where you prompt it and the first thing you that you do when you configure and enable proactivity is Jules will index your entire uh codebase. It'll index your directory and start looking for things that it can do and then it'll that'll show up on the screen. So right here we're looking at a little bit more in this um in this repository ADK Python and uh and it's indexed the repository and it's found a bunch of to-dos. It's found a bunch of best practices that it can update and it's giving me some signal about what it's finding. And so you can see the signal is high confidence, medium confidence, and low. And so it's actually telling me what it thinks it can achieve based on what's in my code and what it wants to do. And that's so it has high confidence in green, medium and purple, low and yellow way down at the bottom. Um, and so I can go through this and I can manually click these and say I want to start these. And so I don't have to think about the prompt. I don't have to look at the code. I don't I I can do kind of less cognitive load here. We're working on something to just start these automatically. And so that's coming in the future. But I can also delete these. I can say, "Hey, this one isn't isn't for me. Isn't good." And so once it gets started on a task, I can kind of drill into it and see a little bit more. I can peek into the code that it is suggesting uh that uh it's suggesting it work on. I can find the location of that code. And it also gives me some rationale about why it wants to work on that code, why what it's doing, etc. And so it's giving me a lot more context and helping me trust that it knows what to do here. Okay. So that's proactivity that's coming in December and hopefully we'll be able to give that to everybody here. We're very excited about it. And I want to tell you a little story about uh something my husband and I were working on just to kind of set set wrap things up. We uh tinker a bunch with hardware and we live on this slow street in the middle of San Francisco and hate ashbury district. So on Halloween, we get a lot of people walking by our house. And so we were trying to take advantage of that with our Halloween decorations. And so we built this six-foot animatronic head that sits in the front of our house. It's this old Victorian house. And he sculpted it out of foam, epoxy, and fiberglass. And then I our our kids also called this lovingly the bald head. And it's based off of, if you ever see saw Peewee Herman from the 80s, it's based off of the Peewee Herman Peewee's Big Adventures head. Um, so while my husband was doing this, I was spending my time working with Jules on updating the firmware, controlling the stepper motors, working on the um on the LEDs and the sensors. And for me, that's the fun part for me is like really getting creative with what the LEDs are doing. So I wanted to focus on that, the LED animations. But I ended up spending most of my time actually fixing bugs and swapping libraries and doing things like that. So what I would do is I would prompt Jules, I'd wait 10 minutes and then I would repeat. And I found that process very very tedious. And what I wanted was actually Jules to do the research. I wanted it to handle the the ugly parts where it was researching how to fix a bug, uh doing the debugging itself. And I wanted it to do this so that I could focus on the creative parts. I wanted the eyes to move and like follow people as they walk down the street and like have lasers coming out of its eyes and stuff like I mentioned it was Halloween. It was very scary. Uh and and this but but I couldn't really do as much of that and I ended up actually not shipping as much as I wanted to with this animatronic bald head. And so it's that gap that we actually want to close. It's the space between with jewels. the space between that tool friction and creative freedom that we're trying to unlock with these kinds of proactive agents. So, what I really want you guys to take away from it, I give this advice to the the folks on on the Duels team a lot, is that the product we build today actually won't be the project the products that we have in the future. And I think a lot of us know that. But in reality, I want everybody in this room and everyone building working with AI to be able to take those big steps. I think the patterns that we rely on today, Git, uh your your idees, even the code, how we think about the code itself might not exist a year from now, might not exist six months from now. And that's the exciting part for me. It's sort of we get to invent the future right now. we get to describe and decide how software is made and built. Uh kind of all the people in this room. So my my challenge to you is to not be afraid to question the old ways of how you're building software because really the future is coming faster than any of us know. It's probably already here and the cool thing is we get to build it together. Thank you. [applause] [music] Our next talk is a case study from the enterprise on incremental rollout of AI. Here to provide us with a blueprint for making AI transformation fundable, governable and real inside large risk averse organizations is engineering leader at Northwestern Mutual, ASAP board. >> [music] [applause] >> Doesn't this look like something's going to drop from the ceiling? Like a ground zero type thing? [snorts] Be honest. Like, who has the buzzer that if I'm I really suck, they press it and everything falls down through the trap door? No. >> Be careful. >> Yeah. Okay. Who was it? Okay. you tell me if I'm doing okay or if I should take a couple steps back. Right. So, hi everyone. I'm Assaf. Um, and I'm here to talk about Genbi. And kind of first disclaimer, this presentation was not created with Gen AI. Um, to be honest, I actually started doing it uh with uh GPT03 back in August. Uh, [snorts] and then I did kind of a first draft and then a couple of weeks back I wanted to come in and refresh it before the conference and then GPT5 took over, completely messed up my slide, so I ended up doing it manually kind of oldfashioned. So if I'm missing like an M dash somewhere in the middle, let me know after. Okay. [snorts] Uh, so first of all, a bit of housekeeping. What's GenBI? So it's a fusion of Gen AI and BI. It's basically an agent that helps people answer business questions with data like a a business intelligence person would do in real life. Uh the reason that we're pursuing GenBI is really because of the data democratization that you can bring, right? So having access to data at your fingertips without having to be reliant on a BI team that helps you find a report, figure out what it means, uh understand your world before they can even give you any kind of input. Uh so that's GenBBI. Uh, a bit about Northwestern Mutual. That's where I work. So, we're a financial services, life insurance, and wealth management. Been around for 160 years. Uh, [snorts] some very impressive numbers there. But first of all, I want to say why is Northwestern Mutual a great place to do Gen AI. We got a lot of data, we got a lot of money, we got a lot of use cases, and we got access to some of the best talent uh, anyone can dream of. Really truly humbled by the people that I get to work with. Um, but on the flip side, why is it hard to do Gen AI at Northwestern Mutual? Because it is a very riskaverse company, right? If you think about it, our main motto is generational responsibility. I call it don't f up. Uh, because what we end up selling to people is a decadesl long commitment, right? you buy life insurance now, uh, if you stay with us until it comes to term, so to speak, that can be 20, 40, 80 years down the line, depending on when you buy it and how long you get to live. And so stability is something that's very important for us because it's important for our clients. So, how do we balance stability with innovation? That's what I want to talk about today. Um, and really the four main challenges that we had when we even came up with the idea kind of a pie in the sky Genbi concept. Uh, [snorts] first of all, no one's done it before, right? Truly, no one's done Genbi in this fashion in the past. Uh, secondly, and this was really a preference for us, we wanted to use actual data that's messy because we knew that those were that's where the real challenges are going to be, right? understanding actual messy data for 160-y old company and how can we perform well within that ecosystem. Um the third was kind of a blind trust bias. So um the bias the trust that you had to build was both with the users but also with the leadership of the company, right? How can we bring accurate information, accurate answers to people when uh all of these things that we know about and everyone's talked about is is just out there, right? No one's blind to the trust barriers. No one's blind to the accuracy barriers. So, how do we convince that this is actually something that we can trust in the company? And lastly, um but really firstly, when we go to approach this from an enterprise perspective, budget impact, right? How do we convince someone in a leadership uh organization where risk averse is ingrained in the DNA to even invest in something like this that no one's done before? We don't really know how we would do it. Uh we're not even sure how it would look like when it comes to turn. Uh so I'll start kind of one by one. Uh and first of all really talk about why we chose to use actual data uh and not synthesized data or cleanse data. Uh [snorts] so really it's about making sure that we understand the actual complexities that we will have to face when we eventually want to go to production right we know that you know building uh PC's and demos is so easy but the gap from PC to production is so broad uh especially in this gen AI space especially because we don't know upfront how to design the system what we would expect it to behave like so making sure that we operate with real data just gave us that extra confidence that when something works in the it's very likely to also work in reality. Uh but also and maybe not uh in the least less important is that we got to work with actual people who work with the data day in and day out and that gave us two things. Okay, first of all subject matter expertise which are super critical for us to be able to validate that the system is actually working gave us a lot of real life examples of what people are actually asking in a corporate and what people have answered to them. So basically the eval right and all the testing and stuff. Uh but at the end of the day it also brought the business to be a part of the research project itself and they became kind of bought into the idea as part of the process. So we didn't just test something in the lab and then had to convince someone to go ahead and use it. The end users were part of the research process itself. And so when eventually it matured enough so we can take some of that to production, they were already there and they actually were pulling that. They told us we want to take this, how can we wrap it? How can we package it uh quickly enough so we can put it into practice. Uh and the next part was really about building trust. Uh so this is about building trust first of all with our management team. All right. Now, I don't know about you, but last time that I got a million dollar to do a research project that I wanted in a pie sky idea. I woke up from the dream and I realized that this is not how things work in reality. You don't just get a million dollars and go ahead and try something out. Uh you had to show that you know what you're doing. And part of what we did, it's kind of listed out here, but obviously, you know, we did all the regular stuff, right? We worked in a sandbox environment. We made sure that we're not using actual client data. We made sure to put in all the security risk aside. But uh one of the first approaches that we said we're going to take is we're not just going to build a tool that's going to be uh released to everyone, right? We understood very quickly that um how people interact with the tool, their ability to verify that what they're getting is right and also give us feedback changes dramatically depending on their expertise and understanding of the data. So we took that crawl, walk, run approach that basically said we're first going to release it to actual BI experts, right? people that would be able to do it on their own and know what good looks like when they get it. And we're just going to expedite the process for them kind of like a GitHub co-pilot. The next phase would be to bring it to business managers and again people who are closer to the BI team, but when they see a mistake, they can pretty much figure out that what they're seeing is wrong because they're used to seeing that on day-to-day basis. um and they will might be less sensitive to these types of mistakes and be more inclined to give us that feedback instead of just, you know, dumping it aside and never using it again. Giving this type of tool to executives in the company, I don't even know when we're going to get there, right? Like an executive, they want clear, concise answers that they know they can trust. We're definitely not there yet. I think that's the vision uh at some point in time, but the system is not accurate enough for us to get there. Maybe it never will be. Um, [snorts] another way that we another liver that we kind of used to build inherent trust in the system is that we said, well, in the get-go, we're not going to even try to build SQLs, right? This is very complex. This is very hard even for a person. So, we said step number one, let's just bring information that is already in the ecosystem that's already verified, right? We have a lot of uh certified reports and dashboards. Um and actually in the conversations we had with some of the BI teams that we worked with, they told us guys like 80% of the work that we do is basically sending people to the right report and helping them figure out how to use it. So the report is already there. Um and that again built some inherent trust into how we architected the system because we said we're not going to make up information. we're just going to deliver you the same asset that you would have gotten anyway just in a much faster much more interactive way. Uh and that was the alignment of expectations that we did very upfront with the uh users and also with the management team. Now [clears throat] the biggest um process or kind of the most important approach that we took when uh approaching our leadership team and convincing them that we want to do this was to create a very gradual incremental process that gave them a lot of visibility and control. [snorts] Uh and it was very important for us to build incremental deliveries throughout that process so that uh not only did they have the the visibility into what are we funding now, what do we get out of it, they actually had business deliverables they could realize potential from throughout the process and at any point in time they could pull the plug right and say okay like it's not working well or we got enough out of it or you know the next phase is so you know unknown and long that we don't want further invest in it. And this is how we basically broke it down. So phase one was just pure research, right? We kind of did the shift from natural language to SQL. We figured out how to write responses. We figure out how to understand questions if coming in. Just kind of setting the stage. Phase [snorts] two was about really understanding, okay, so what does good metadata and good context look like in the perspective of a BI agent, right? It looks very different if you're just chatting with something or if you're trying to do a rag with you know unstructured data like documents and uh business knowledge and stuff like that. And this phase on its own already had uh impact on the business because when we define what good metadata looks like for an an LLM uh we could immediately apply that also to just the ecosystem of data users across the enterprise. Um, and by understanding how to extract LM from the information, we could also how to extract metadata. Sorry, here's where the [snorts] trap door comes into play, right? Um, we could also project that on how or what good metadata looks like for humans interacting with the data. We have another initiative around semantic layer going on which tries to model exactly that and this provided a very valuable input to that initiative as well. But the immediate next step was basically just doing this kind of uh multicontext semantic search, right? People coming in asking different questions and having the system figure out what's the right context, what's the right information we need in uh bring them. And this is something that could already be packaged as its own product and delivered uh and basically just do kind of a data finder and data owner finder which is something that could take anywhere between two to maybe four weeks in an enterprise like Northwestern Mutual just finding what data exists and who owns it so I can start talk uh the conversation with them. Um and the next layer was really about pulling in information and trying to do some light pivoting around the data. Um each one of these steps as you can see also created an input to the to the following step so that the research itself was kind of self u self-propelling and there were incremental outcomes coming out of each one of these phases. Uh the next one is more kind of setting it up for enterprise level usage. So understanding roles of in uh of different users coming in what they may be asking about what type of access we want to give them etc and eventually and this is still some ways to go ahead uh building kind of a fullyfledged NBI agent which doesn't only quote information from existing reports but I can actually run SQL queries on its own uh pull in more data do more sophisticated joints between different data so it can answer more complex questions so that's the road map right that's kind the high level plan. Now, why did that work? Well, kind of quickly summarize them. We talked about uh so we get value uh early and we get value often. Each one of this was a six week sprint at the end of which we had had a very tangible deliverable coming back to the business that we could decide to productize. Uh and at any point in time, we could decide how we want to move forward. There was transparent progress. There was incremental business value. Uh each one of these steps allowed us to learn something that helped feed the next step. And maybe the most important part and that's the bottom line here and that's the part that executives really look at. How do we control the risk in continuing to invest in this type of research project and this is really about eliminating things like sun cost bias, right? We already paid you know you know whatever a million dollar. Let's just get through the project see what we get at the end. This eliminates the uh uh fear of of competitors coming in and maybe we don't need to continue investing in this right so everyone in the industry is researching GenBI and there are solutions like data bricks genie that are coming up and they're getting better and better maybe at some point in time it's better for us as an organization to actually adopt data bricks genie but at that point again first it's much easier for us to pull the plug in the funding but we already have a good understanding of what good looks like we have benchmarks that we used for ourselves when testing our own system that we can test a third party solution with. And we know what to expect, right? We know what works, we know what doesn't. We know what a kind of fluffy demo from a vendor would look like. And we know where to drill in to ask the tough questions. So let's see kind of what it looks like under the hood and how we productize different elements uh of this architecture. Uh and maybe kind of very quickly, why can't we just do it with uh chat GPT? So you know [snorts] just dumping a schema into chachpp doesn't work. Usually schemas are very messy. It's not uh easy to understand the context and the meaning of things. Uh [snorts] and eventually governance is super important. So there was a lot of governance built into the architecture that was very hard to apply on chpd from the outside but even solutions like you know data bricks genius third party much harder to govern from the outside than from the inside. But still TBD. Uh so the stack kind of looks like this. Uh we have a data and metadata layer that we produced. We have four different agents that are running across the pipeline. A metadata agent that understands the context. A rag agent that finds the different reports. An SQL agent that can pull more data if we need that. And then eventually what we call a BI agent that takes all that information and delivers an answer to the question that was asked. On top of that, we slap governance and trust and orchestration and eventually some kind of a contextual UI. Um and this is how the flow goes. So when a business question comes in, we uh push it into the orchestrator and basically decides how to facilitate the process. The first thing that we do is understanding the context. So that's where that metadata agent comes in. Works with the catalog, works with all the documentation that we have across the system to understand what we're being asked about and what's the relevant information to share. Then we go to the rag agent which tries to find an existing report again out of a list of certified reports that we know are allowed for people to use and people have spent a lot of time fine-tuning them and making them as accurate as possible. If we can't find the report or if it's not exactly what we need to um to use, that's where we go to the SQL agent that basically tries to create a more um exact query or a more elaborate query. And even if the report that we have is not usable as is, it gives us that initial seed of a query that we can then expand on rather than having to build one from scratch. So it's kind of like a fewot uh example, but in this case the example that we gave is very very close to the actual result that we're expecting to get. We then execute it against the database pull and push it into the BI agent which gen with which gen uh translate that to a business answer and not just dumping data back on the user and this is what goes into the final answer. Now there's obviously some kind of a loop that says if I'm in the same conversation I'm probably talking about the same data so we don't have to talk about this or do this again and again. Now each one of these three components, each one of these three agents can be packaged as its own product and delivered to production with a very tangible and actual impact on business metrics. Okay. And that's the kind of beauty of this uh approach that after we productize each one of these, we could have basically said stop or let's move forward. uh and just some giving bottom line numbers around some of these. So just the rag agent that pulls the right report uh allowed us to take about 20% of the overall capacity of the BI team that basically said uh all we do is just share the right report with the right person. So we were able to automate around 80% out of those uh 20% and we're talking about a team of 10 people. So roughly two people full-time job all they do is find the right report and send it to the right person. uh the metadata understandings that we got from learning how to interact with the data through an LLM allowed us to run AB test in a in the semantic layer project that we did and that allowed us to prove back again to the senior leadership in the company that there is value and tangible value measurable value in enriching metadata. And we did that basically by running uh a a battery of questions um against a database that had good metadata and one that didn't have good metadata. And we show how much better an LLM performs when having the right metadata in place. So basically proving the value of something that can be very fluffy like hey let's bring in more documentation into the code. Uh right now we're experimenting with the data pivoting bot. Uh so once you have a dashboard or a report be able to change the time horizon some of the views some of the segmentations and the groupings of the data again kind of real time without having a person do that for uh a business stakeholder and some of the next steps is really evaluating the tools that are out there for uh Genbi like data bricks genie for example and we're going to go into a much more rigorous process of enriching our catalog with metadata and documentation and that's also going to come out of a lot of the learnings that we got from uh the research that we've done. So even if we don't end up writing a GenBI agent full-fledged end to end, we already got a lot of value back from this and this is really what allowed our senior leadership team to continuously invest in this project quarter over quarter. One thing that I want to wrap up with is just a couple of thoughts I had about the future. So um I think we talk a lot about how to prepare data. I think that's going to be a huge area in the market and they're going to be probably a lot of companies and tools that are going to help us with that. Uh building very specific task specific models and applications. I think a lot of startups and companies are going to come up from that area. Uh co-pilots is really making sure that we meet the users where they are. Uh and securing of models obviously a very big thing. The last thing is the one I the the one I want to focus on the most because that's kind of a recent thought that came to me a couple of weeks ago. How we do pricing of SAS in the Gen AI era. Uh this is really about the fact that one individual person today can be 10x more effective uh than they used to be in the past. And then do we price uh software based on seats or do we price software based on how much they used it or do we price software based on the value that they got out of it? Uh Salesforce is already experimenting with that. So that the data cloud product at Salesforce is starting to be uh usage priced and not seats priced. And I think this is going to have a big impact on just the uh kind of SAS economics worldwide. uh and it it doesn't even matter if the product itself is genai. It's really about what does the person using the product can do and what can they do in their other time uh and whether it still makes sense to price it by how many employees you have or how much work you get done with the employees that you have. That is me and thank you very much for listening and thanks for not opening the door on me. [applause] Our next presenter [music] is the head of technology infrastructure engineering at Bloomberg. He's here to tell us what they learned deploying AI within Bloomberg's engineering organization. Please join me in welcoming to the stage Le Jang. [music] All right. I don't have a joke about the dot. I don't have a joke about the uh hot dog either. So I will just jump to the topic right away. Um so my name is Lei. Um I lead the uh department of technology infrastructure in Bloomberg. So we're basically a group of technologists focus on global infrastructure think data centers connectivities um developer productivities uh think SCS tooling and also uh reliability solutions think telemetry and instant responses right so um depends on audience sometimes uh you know you're familiar with what Bloomer is sometimes you don't so I thought it might be a good idea to talk a little bit about our Um so there's no better way to talk about our company by sharing some numbers. I want to highlight a few numbers. We have more than 9,000 engineers and most of them are software engineers. Uh we handle a lot of market ticks uh which in the billions and 600 billions I believe. And um we also have tons of folks uh focus on AI research and engineering. So we have a more than you know really today's 500 plus employees focused on AI products uh for um sort of our customers. So takeaway here is we are I guess you know building a lot of software and use a lot of data to empower our flagship product which is called the bloomer terminal and to really support our users to make the most important f decisions for them to do their job. um the best um in the technical lens um a lot of time kind of to explain that we actually have one of the largest private network uh in the whole world. We also have one of the largest JavaScript codebase um in the world. Um we because the domain we're in uh so from terminal is really you can think of a um software that supports thousands of different applications. Uh we call them functions right? Um email is a function. Uh news is a group of functions. Um let's say fixed income price to yield calculation to spread calculation is another function. um trading workflows is another group of functions. So there's many many many different type of functions as you can imagine we kind of have to utilize different technologies to really support those uh functionalities. Uh we also been increasingly more than used but also contribute to open source communities. um for this audience I guess I want to call out you know we kind of helped creation of the case serve envoy AI gateways and among many many other things that that we deploy in-house and support the communities again in summary there's a lot of software there's a lot of data uh we kind of have to um figure out how to make the best of AI tooling to support us to do our engineering work all right so get to what is AI for coding Um we started about two years ago maybe a little bit more than that. Um and as I guess the rest of the world we look at the toolings provided and you know I apologize if if your logos are not here. Um but as you can imagine it's kind of like overwhelming right there's so many things and every day there's news about this is great this is great. Um so at the time we actually didn't know what all the AI solutions can help us to uh boost our productivities as well as stability. But one thing we knew at the time is um unless we deploy and try we wouldn't know what's the best way to benefit from all the awesome work and and you know a lot of folks are contributing to. So at the time uh we quickly form a team people start kind of like release um kind a set of capabilities so that people start iterating on um utilizing the toolings and then of course you know we are data company so kind of want to get a sense of how we measure the impact and um what we can do from the capability we provide right so we look at the typical developer productivity measurements We ran a few survey. Uh it was very obvious that people felt like there's much quicker uh proof of concept, people rolled out tests. U there's a lot of one time use scripts being generated and then the measurements dropped actually pretty quickly when you go beyond all the green field type of thing, right? And then then we start thinking like okay so what are the things that we should really be doing using all those wonderful things so that we can really make a dent um in the in the space and then at this time we also kind of like also be thoughtful of um unleash a very powerful tooling right uh the the benefits is it's very fast the challenge is also it's very fast, right? Um, for any of you who actually dealt with hundreds of millions of lines code, you probably understand the system complexity is a at least um exponential or at least polomial as function of your line of code or software assets, right? So, at some point you kind of want to be very careful uh what you do with your software assets. And what we thought so maybe we should look at some of the basics. One idea we had is um all right so AI for coding there's narrow definition of what coding is but there's also a broader definition of what software engineering right and then maybe we can also look into some of the work our developers don't really prefer to do for instance um some maintenance work some of the migration work some of the I don't know maintenance work and stuff like that so I want to give some examples of the things that we been trying and we think there's pretty good return on investment. So the question we ask ourselves is how do we evolve our codebase right the first one is all right wouldn't it be cool uh the day you get a ticket say hey you know this piece of software needs patched and at the same time you have a pull request with the fix with a patch and also with thinking why the patch happened that way right so it's kind of like we're trying to uh broadly deploy something called uplift agents um broadly scan through our codebase and figure out what the patch would be applicable and be able to apply those patch step back a little bit. We did have a reg based refraction tool. Um it works to some extent but it's limited right now with um LMS and other tooling. So we are able to uh see very much better results from the um uplift agents. So there are a few challenges in case you also plan to deploy such capabilities. The first one is I guess any AI or ML it would be really nice if there's some deterministic verification capability. uh oftentimes it's not so easy especially if you have test cases you don't have good llinter if you don't have good verification the the patch can sometimes be uh uh difficult to to to be applied and uh one thing we also realized when we deploy AI tooling is the average open pull requests increased and time to merge also increased uh because you're spinning a lot of new code and then still we have to review the code and merge the code right so time to merge merge become a challenge sometimes and the last one is um I think it applies to any gen is the shift becomes what do we want to achieve rather than how we want to achieve right so the second example that I I want to share is uh the other area that people kind of like sometimes really impact our productivity in a negative way or impact our stability in negative way is how we handle instance so we're trying to develop and then deploy deploy um in response agents. Um now the importance of this is if you really think about DNI tools, it's really really fast and it's also unbiased, right? Instance, it can go through your codebase really quickly. It can go through your telemetry system very quickly. It can go through your feature flags very quickly. It can go through your um I don't know core trays very quickly. and in I unbiased lens when we do troubleshooting sometimes we have this biased views okay it must be this it turns out to be not the case so there's many many interesting benefits um by uh deploying agents from this perspective and then the second question is become interesting is imagine you have organization of 10,000 pe um let's say 9,000 people as I described a lot of people trying to fix those problems And you can have 10 teams who wants to build a pull request review bots. You have 20 teams who wants to build a instant response agents. Right? They become very quickly chaotic and sometimes can have duplications. So before I talk about the p pass, I'm going to give example of the uh instance response agent. So basically this is what you know a instance response agent will look like. Um the key part is we're going to need to build a lot of MCP servers to connect to the uh the metrics and logs dashboards you have connect to the topology you have whether it's network topology or it's the um your service dependency topology uh your alarms your triggers right your SLOs's and then we kind of don't want people just start building MCP servers uh without a pay pass so we created a pay pass in partnership with our AI my organization and I will talk a little bit what that means. Before that um I do want to explain a little bit some of the platform principles. Some company allow teams to be have a lot of freedom as at the same time responsibility in the sense a business unit can build whatever infrastructure whatever platform. um some organization have a very very strong tight abstraction of the service infrastructure and typically kind of have to use their platforms right so Bloomberg is kind of in the middle if you look at the golden ones we kind of believe in provide a golden path um with enablement teams so my team is really a en enabling team and one of the guiding principle for us is we want to make easy sense is extremely easy to do. Uh sorry, the right thing is extremely easy to do and we want to make sure the wrong thing is ridiculously hard to do. So that's the guiding principle here. Now move on. So what is the pay path here? So the pay path is uh we have a gateway so that teams can easily figure out which model works the best. They can do quick experiments. they can um we can have visibility of what kind of models being used and we can also guide through the teams which model should is a better fit for the for the problem they want to solve. uh we have a two discovery uh basically MCP directory via hub so that let's say team A wants to do something they will go to the hub okay someone is building MCP server already maybe I should partner with them to build it together right uh tool creation and deployment is via a pass it's basically a um you know a standard platform service where you can do your SDLC and and we provide runtime environment for you as well taking care of all and side of things as well so it really reduce friction of for for teams to to deploy um their MT MCP servers. And then this is kind of interesting is we want to make demo very easy so that or I really say proof concept very easy so that people can try have ideal generation uh because we believe in creativity come from some freedom of try different new things but we also want to make sure the production requires some quality control. um because at the end of the day stability and system reliability is at the core of our business. So this is sort of the path that we deployed um and enabled the rest of engineering really the 9,000 software engineers to do their job. Okay. And um with all this and then we start maybe okay yes we got p uh path we have some good ideas of how to evolve our codebase. help out our people right um now this is where I find that any new things any adoption of new things provide opportunity to leverage the strengths you have and also identify the some of the weakness that you may have so um in Bloomberg we have a wellestablished training program uh it's more than 20 years so there's on boarding training depends on entry level it depends on senior level um so we have this whole training program to prepare folks to before they join a team and what we did is we just incorporate AI coding in on boarding training program and also show them how to best utilize them with our principles and our technologies right there's a huge benefits here because um if any of you run into the challenge of adoption somehow run into a chasm right the rest of is not uh adopt as quick as possible whenever we have folks join a company they learn how to do things in new way when they go back to their team, they were like, "Hey, why don't we do that?" Right? They're going to challenge the some of the senior folks as well to say, "Hey, there's a new way to do this type of things. Why don't we do that?" So, we actually find this program extremely effective uh to be a change agent for anything we want to push out. And then bunch of results, there's a lot more familiarity and comfort with the tooling. Um and also the important part is there's lot more nuance insights of where it's at value, right? The second one is um often times we run organization to push uh new initiatives. So within Bloomer we have something called um a champ program and a guild program. That's basically a cross organization or tech communities where people have similar interest and similar passion. They get together and get stuff done. So um we had this for more than 10 years now. uh we sort of bootstrapped engineer AI productivity community two years back leveraged the community we have already and then have some few results uh because we have this pretty much everyone passionate about this and will be in that community so organically it dduplicate efforts and there's shared learning uh shared learning happening and it also helps to boost inner source contributions and the vis engineer idea right often times Team A wants to do something, team B, let's say a platform team have different prioritization and the way we solve this is via inner source or via visit engineer. We just move someone over the team work for six months a year get it done and then we can move on. Um the last one is interesting. So our data shows individual contributors have a much better stronger adoption than our leadership team. Now if you think about this a lot of software TLS and managers in the age of AI they kind of don't really have um enough experience to truly guide their teams to build software right so often times the stuff that they learned before might not be exactly applicable it's still very valuable but there's some missing piece there to make sure they can continue to guide the team to do the right thing. So, we're rolling out the leadership workshops to make sure our leaders are equipped with whatever knowledge they need to have to drive the techn um innovation. So, um I'm going to close my part and to share with you what uh the part I'm I feel most excited about. The part I feel most excite most excited about is that with a lot of um creativity and innovation in the geni space, it actually changes the cost function of software engineering. Meaning the trade-off decision of whether we do something versus we don't do something actually changed because some of the work become a lot cheaper to do and some work become a lot more expensive to do. I tend to think it is a great opportunity for engineers and engineering leaders to get back to some of the uh basic principles and sort of ask a soul searching question. What is a high quality soft engineering and how can we use a tool for that purpose? So that's it. Thank you very much. [applause] [music] Our next speaker helped to reimagine a beloved browser from ARC toad by rebuilding it around AI native experiences. Please welcome to the stage head of AI engineering at the browser company Samir Motti. [music] [applause] Hey everyone. Oh wow. How's it going? Um, my name is Samir and I'm the head of AI engineering at the browser company of New York. And today I'm going to talk a little bit about how we transitioned from building ARC to DIA and the lessons we learned in building an AI browser. But first, a little about the browser company. So we started with a mission to rethink how people use the internet. At its core, we believe that the browser is one of the most important pieces of software in your life and it wasn't getting the attention it deserved. Simply put, the way we've used a browser has changed over the last couple decades, but the browser itself hadn't. And think about this, we we started this company in 2019. Um, and so this is a screen cap of Josh, our CEO, sharing a little bit about our idea on the internet a few years ago, which we endearingly called the internet computer. So, our mission has been to build a browser that reflects how people use the internet today and how we think the browser should be used tomorrow. So through years of discovery, trial and error, and some ups and some downs, we shipped our first browser, Arc, in 2022. It was a browser we felt was an improvement over the browsers of that time. It made the internet more personal, more organized, and to us, a little more delightful with a little more craft. And it was a browser that was loved by many. It still is by millions. many of whom are probably in this audience today. I've gotten a lot of questions about ARC today. Um, and it's great, but um, if we took a step back, we felt that ARC was still just an incremental improvement over the browsers of that time. And it didn't really hit the vision that we set out to create. And so, uh, we kept building. And then in 2022, we got access to LLMs like the GPT models. And so we started like we always do with prototyping. We started trying new ideas um and eventually shipped a few of them in arc. But what started as a you know a basic exploration turned into a fully formed thesis. In the beginning of 2024, uh, our company put out what we called act two, a video on YouTube, where we shared that thesis that we believe that AI is going to transform how people use the internet and in turn fundamentally change the browser itself. And so with that, we started building again, but this time we built a new browser with AI speed and security in mind and from the ground up. And later or sorry, earlier this year, we shipped DIA, our AI native browser. It allows you to have an assistant alongside you in all the work you do in the browser. It gets to know you, personalizes, helps you get work done with your tabs, and effectively get more work done through the apps you use. And while it hasn't achieved our vision yet, we fully believe it's well on the way, too. So it is not easy to build a product. You all know that. Let alone two. The latter of which an AI native one. We've had a lot of years of iteration, trial and error. And through that we've learned a lot. And I'm going to just talk about a few of those things uh here today. The first I want to talk about is optimizing your tools and process for faster iteration. From the beginning, browser company has believed that we're not going to win unless we build the tools, the process, the platform, and the mindset to iterate, build, ship, and learn faster than everyone else. And that of course holds true today. But the form it takes with AI and an AI native product has changed. So even as a small company, where are we investing in tooling these days? First is prototyping for AI product features. Second is building and running evals. Third is collecting data for training and for evals. And uh last but definitely not least automation for hill climbing. So let's start with tools. Initially uh as we always do we built some tools. The first was a very rudimentary uh prompt editor and it was only in dev builds. What did what did this mean for us? Well it meant a few things. One limited access as only engineers were able to access this. two slow iteration speeds and three none of your personal context and as you all know with an AI product the context is what matters and what's gives you the feel for whether a product is good or not. So we evolved and since then we built all of our tools into our product. the product that we as a company internally use every day and that includes the prompts, the tools, the context, the models, every parameter. Um, which has not only allowed us to 10x our speed of ideating, iterating and refining our products, but it has also widened the number of people who can access and iterate on our products themselves from our CEO to our newest hire can ideate and create a new product in DIA and also refine an existing one all with their full context. And this holds true with all of our major product protocols. We have tools for optimizing our memory knowledge graph which all of us use and we have tools for creating iterating on our computer use mechanism. We actually tried tens of different types of computer use strategies before landing on one before even building it into the product itself. And I'll say and I'll end this part with uh it actually is a lot of fun. People don't talk about that a lot but uh actually building these tools into our product has enabled so much creativity. It has enabled our PMs, our designers, uh customer service and strategy and ops to try out new ideas that are tailored to their use cases. And that ultimately is what we're trying to do. The next thing I want to talk about is how we evolve and optimize our prompts through a mechanism called Jeepa. This for us is very nent but an important learning nevertheless. How we hill climb and refine our AI products is just as important as ideating them in the first place. So we're investing in mechanisms to help with this to enable faster hill climbing and one of those being Jeepa and this is based on a paper from earlier this year from a few smart folks. So the key motivation here is simple. It's a sample efficient way to improve a complex LLM system without having to leverage RL or other fine-tuning techniques. And for us as a small company that's hugely critical. And how it works is you're able to seed the system with a set of prompts, then execute it across a set of tasks and score them. Then leverage a mechanism called PA selection to select the best ones. and then leverage an LLM on top of that to reflect on what went well and what didn't and then generate new prompts and then repeat with the key innovations here being on around that reflective prompt mutation technique, the selection process which allows you to explore more of the space of prompting rather than one avenue and the ability to tune text and not weights. And here's a modest uh example of this at work for us. you know, you can provide it a very simple uh a simple simple prompt and run it through Jeppa and it's able to optimize it uh along the metrics and scoring mechanisms that we uh created to refine that prompt. And so if I take a step back and talk about kind of how we build uh for certain types of features, I would buck it into a couple different phases. The first is that prototyping and ideiation phase where we have widened the breadth of number of ideas at the top of the funnel um and lowered the threshold on who can build them and how. And so we try out a bunch of ideas every week every day from all types of people and we dog food those and if we feel like there's actually real utility there. It's solving a real problem for us and there is a path towards actually hitting the quality threshold that we believe we need to hit. Then we'll move on to this next phase where we collect and refine eval to clarify product requirements and then hill climb through code through prompting and automated techniques like Jeepa and then dog food as we always do internally and then chip. And I do want to kind of double down on these phases. The ideation phase is extremely important just as much as that refinement phase. And our goal is to enable faster ideation and a more efficient path to shipping because with all these AI advancements every week, new possibilities are unlocked in DIA. And it's up to us as a browser, as a product to get as many at bats with these new ideas and try out as many of them and explore as many of them as possible. At the same time though not underestimating the path it takes to ship some of these ideas to productions as a high quality experience. Next uh I want to talk about treating model behavior as a craft and discipline. So what is model behavior to us? It's the function that defines evaluates and ships the desired behavior models. It's turning principles into product requirements, prompts, and evals, and ultimately shaping the behavior and the personality of our LLM products, and ultimately for us, our DIA assistant. So, I'd buck it into a few different areas. First, it's that behavior design, defining the product experience we actually want, the style, the tone, the shape of responses in some cases. Then, it's collecting that data for measurement and training, clarifying those product requirements through eval. And last but not least, it's the model steering. It's the building of the product itself. It's the prompting, it's the model selection, it's defining the what's in the context window, the parameters, etc. Um, and so much more. And to us, that that process is iterative, very iterative. We build, refine, we create evals, and then we ship, and then we collect more feedback and feed that into our iterative building process. That could be internal feedback, and that could be also uh external feedback. And so I move on for a second. One analogy we've thought about uh is for model behavior is that to product design through the evolution of the internet. At first websites were functional. They got the job done. But over time that evolved as we tried to achieve more on the internet and technology advanced. Uh product design and the craft of the internet itself grew as well as well as the complexity. And so what might that be for model behavior? Well, at first it was functional. We had prompts. We had evals. We had instructions in and output out. Now we frame it through agent behaviors. It's goal- directed reasoning, the shaping of autonomous tasks, selfcorrection in learning, and even shaping the personality of the LM models themselves. And so, what might the future hold? I'm excited to see. But what we believe is that we are in the early days of building AI products and model behavior will continue to evolve and into a specialized and prevalent function of its own even at product companies. And the last thing I'll leave you with here is that the best people for it might just surprise you. One of my favorite stories about building DIA these last couple years has been uh the formation of actually this model behavior team. As I mentioned earlier, uh engineers were writing the prompts at first and then we built these prompt tools to enable more people at the company to actually prompt and iterate. And there was a person on our team on the strategy and ops team and he actually leveraged these prompt tools one weekend to rewrite all our prompts. And he came in on a Monday morning and dropped a Loom video sharing what he did, how he did it, and why and a set of prompts. And those prompts alone unlocked a new level of capability and quality and experience in our product. And consequentially uh it was the formation of our model behavior team. And so one thing I'd emphasize to you all is to think about who are those people at the company agnostic of their role who can help shape your product and help shape and steer the model itself. It might not be an engineer or it might be it could also be someone on the strategy and ops team. Next, I want to talk about AI security as an emergent property of product building. And today, I'm going to focus specifically on prompt injections. So, what is a prompt injection? Well, it's a prompt attack in which a third party can override the instructions of an LLM to cause harm. That might be data exfiltration, the execution of malicious commands, or ignoring safety rules. And so here's an example in which you give uh the context of a website to an LLM and instruct it to summarize it. Little did you know that there was a prompt injection hidden in that website's uh HTML. So instead of actually summarizing the web page, the LM actually gets directed to open a new website, extracting your personal information and embedding it as get parameters in the website's URL, effectively excfiltrating that data. So, as a browser, prompt injections are extremely crucial for us to prevent. They're critical to prevent because browsers sit at the middle of what we can call a lethal trifecta. It has access to your private data. It has exposure to untrusted content and it has the ability to externally communicate and for us that means opening websites, sending emails, scheduling events, etc. So, how to prevent this? Well, there's some technical strategies we can try. First is wrapping that untrusted context in tax. You can tell the LM, listen to these instructions around these tags and don't listen to the content around these tags. But this is easily escapable and quite trivy, an attacker could still uh leverage a prompt injection on your browser. Well, another solution we could try is separating that data and that instructions. We can assign uh the operating instructions to a system role and we can assign a user role for the content of the third party and even layer on randomly generated tags to wrap that user content to be extra sure that the LM listens to the instructions and not the content. And while this can help, there are no guarantees and prompt injections will still happen. So what do we do? Well, it's on us to design a product with that in mind. We have to blend technology approaches and user experience and design into a cohesive story that actually builds them from the ground up and solves it together. So, what that might what that excuse me what might that be for a feature in DIA? Well, let's take the autofill tool in DIA. The autofill tool allows you to leverage an LLM with context, memory, and your details to fill forms on the internet. It's extremely powerful, but as you can imagine, it has some vulnerabilities. A prompt injection here could extract your data and put it on a form, and once it's on that form, it's out of your hands. So, we try to build with that in mind. In this case, before the form is written to, we actually let the user read and confirm that data in plain text. This doesn't prevent a prompt injection, but it gives the user control, awareness, and trust in what is happening. And this is a framing we carry throughout our product and how we build every single feature. So here are some examples. Scheduling events in DIA, we have a similar confirmation step. Writing emails India, we also have a similar confirmation step. So I've talked about three different things here today. First is optimizing your tools and process for fast iteration. Second, treating model behavior as a craft and discipline. And third, AI security as an emergent property of building products. But uh the last thing I want to leave you with, when we started on this journey to building DIA, we recognized a technology shift and we sought to evolve our product of Arc. We initially came at it from a hey, how can we leverage AI to make ARC better, make the browser better. But what we quickly learned and adapted to was that it wasn't just a product evolution. It was a company one and today I shared a glimpse of that. How we build and how it's changed a team we've literally created around this and how we think about security for AI products. But really it's so much more. It goes beyond that. It's how we train everyone here. It's how we hire. It's how we communicate. It's how we collaborate and so much more. And if there's one thing I'll leave you all with, if there's one thing we've learned over the last couple years, it's that when when you recognize that technology shift, you have to embrace it. And you have to embrace it with conviction. Thank you. [applause] [music] Our next speaker [music] draws on over 20 years in enterprise developer experience to ask what will still matter when AI coding agents are everywhere. Please welcome to the stage executive distinguished engineer at Capital 1, Max Canet Alexander. [music] [applause] Hey, how's everybody doing? Still awake? Okay, great. So like the robot voice said, I have been doing developer experience for a very long time and I have never in my life seen anything like the last 12 months. The you know about every two to three weeks software engineers been making this face on the screen. Okay. And if you work in developer experience the problem is even worse. You're like this guy on the screen every few weeks. You're like, "Oh yeah, yeah, yeah, yeah, yeah. Here's the new hotness." And then somebody else comes up and they're like, "Well, can I use the the new new hotness?" And you know, people have been doing that for years. I've been working in developer experience for a long time. Everybody always shows up and they're like, "Oh, can I use this tool that came out yesterday?" And you're like, "No, of course not." And now we're like, "Uh, maybe yes." Right? And what this leads to overall is the future is super hard to predict right now. So a I think a lot of people a lot of CTO's a lot of people who work in developer experience to people who care about helping developers are asking themselves this question are all of my investments going to go to waste? Like what could I invest in now that if I look back at the end of 2026 I'll be like I sure am glad that I invested in that for my developers. And I think a lot of people have just decided well I don't know. I guess it's just coding agents and I guess they'll fix every single thing about my entire company by themselves which look they're amazing they're transformative but it's not the only thing that you need to invest in as a software engineering organization. So we can clarify this by asking ourselves two questions. The first one is how can we use our understanding of the principles of developer experience to know what's going to be valuable no matter what happens. Okay. And what do we need to do to get the maximum possible value from AI agents? Like what would we need to fix at all levels outside of the agents in order to make sure that the agents and our developers can be as effective as possible? And this isn't like a minor question. These are the sorts of things that could make or break you as a software business going into the future. So let's talk about what some of those things are that I think are no regrets investments that will help both our human beings and our agents. So the in general one of the framings that I think about here is things that are inputs to the agents. Things around the agents that help them be more effective. And one of the biggest one is the development environment. What are the tools that you use to build your code? What package manager do you use? What llinters do you run? Those sorts of things. You want to use the industry standard tools in the same way the industry uses them and ideally in the same way the outside world uses them because that's what's in the training set. And look, yes, you can write instruction files and you can try your best to try to fight the training set and make it do something unnatural and unholy with some crazy amalgamation that or modification that you've made of those developer tools. Like you might be you invented your own package manager. You probably should not do that. you probably should undo that and try to go back to the way the outside world does software development because then you are not fighting the training set. Um, and also it means it means things like you can't use obscure programming languages anymore. Look, I'm a programming language nerd. I love those things. I do not use them anymore in my day-to-day agentic software development work. as an enthusiast, I do come sometimes go and I code on, you know, frontline uh software engineering languages, but not in my like real work anymore. So, what people ask me sometimes, does that mean like we're never going to ever have any new tools again because we're always going to be dependent on the tools that the model already knows? Probably not because like I said, there's still going to be enthusiasts. And also, but like I would like to make a point. The thing that I'm talking about has always been a real problem. Like there's always some developer at the company has always come up to you and be like, "Can I use this technology that came out last week and has never been vetted in an enterprise to run my like 100,000 queries per second service that serves a billion users?" And I'm like, "No, you can't do that now and you can't do that yesterday. It's still the same." Uh, another one is in order to take action today, agents need either a CLI or an API to take that action. Yes, there's computer use. Yes, you can make them write playright and orchestrate a browser. But why? Like if you could have a CLI that the agent can just execute natively in its normal format that it understands the most natively, which is text interaction, why why would you choose to do something else, especially in an area where accuracy matters dramatically and where that accuracy dramatically influences the effectiveness of the agent? One of the most important things that you can invest in is validation. So any kind of objective deterministic validation that you give an agent will increase its capabilities. So yes, sometimes you can create this with the agent. I'm going to talk about that in a second. But it doesn't really matter how you get it or where you get it from. You just need to think about how do I have high quality validation that produces very clear error messages. This is the same thing you always wanted by the way in your tests and your llinters, right? But it's even more important for the agents because the agents cannot divine what you mean by 500 internal error with no other message, right? Like they need a way to actually understand what the problem was and what they should do about it. However, there is a problem here. So, you know, you think, okay, I'll just get the agent to do it. They'll write my tests and then I'll be fine. But have you ever asked an agent to write a test on a completely untestable codebase? They do kind of what it's like is happening on the screen here. They will write a test that says, "Hey boss, I pushed the button and the button pushed successfully. Test passed." Um, like so there is a sort of a a larger problem that a lot of enterprises have in particular, which is there's a lot of legacy code bases that either were not designed with testing in mind or were not designed with like high quality testing in mind. like maybe they just have like some very high level endto-end tests and they don't have like great unit tests that the agent can actually run iteratively in a loop and that will produce actionable and useful errors. So another thing that you can invest in that will can be perennially valuable both to humans and to agents is structure of your systems and structure of your code bases. Agents work better on better structured code bases. And for those of you who have never worked in a large enterprise and seen very old legacy codebases, you might not be familiar with what I'm talking about. But for those who have, you know that there are code bases that no human being could reason about in any kind of successful way because the information necessary to reason about that codebase isn't in the codebase and the structure of the codebase makes the codebase impossible to reason about looking at it. Yes, you the agents can do the same thing human beings do in that case, which is sort of go through an iterative process of trying to run the thing and see what breaks, but that decreases the capability of the agent so much compared to just it having the ability to just look at the code and reason about it the exact same way that human capability is decreased. And of course, like I said, that all has to lead up to being testable. If the only thing I can do with your codebase is push a button and know if the button pushed successfully and not see the explosion behind it, like if if there's no way to get that information out of the codebase from the test, then the agent's not going to be able to do that either unless it it goes and refactors it or you go and refactor it first. And you know, there's a lot of talk about documentation. There's always been a lot of talk about documentation in the field of developer experience, in the field of improving things. And there's people go back and forth about it. Engineers hate writing documentation. Uh, and the value of it is often debated like what kind of documentation you want or don't want, do or don't want. But here's the thing. The agent, let's just take this in the context of the agent. The agent cannot read your mind. It did not attend your verbal meeting that had no transcript. Okay? Now there are many companies in the world that depend on that sort of tribal knowledge to understand what the requirements are for the system. Why the code is being written? What is the specification that we're we're writing towards if things are not written down? And that sounds like blatantly obvious but like there are a lot of things that are fundamentally written like if the code is comprehensible like all the other steps are in that we've gotten to so far. you don't need to reexlain what's in the code. So, there's actually probably a whole class of documentation that we may not need anymore or you can just ask the agent like, "Hey, tell me about the structure of this codebase overall." and it'll just do it. But it won't be able to ever know why you wrote it unless that's written down somewhere or things that happen outside of the program. Like what is the shape of the data that comes in from this URL parameter as an example like if you have already written the code there's a validator and that does explain it but if you haven't written the code yet it doesn't know what comes in from the outside world. So basically anything that can't be in the code or isn't in the code needs to somehow be written somewhere that the agent can access. Now, we've covered sort of a few technical aspects of things that we need to improve, but there's a point about software development in general, and it that's always been true. And one of and that's you've heard this, we spend more time reading code than writing it. The difference today is that writing code has become reading code. So even now when we are writing code we spend more time reading it than actually typing things into the terminal. And what that means is every software engineer becomes a code reviewer as basically their primary job. In addition, as anybody who has worked in a in a shop that has deeply adopted Agentic coding, we generate far more PRs than ever before, which has led to code review itself, the like the big scale code review being a bottleneck. So one of the things that we need to do is we need to figure out how to improve code review velocity both for the big code reviews that we like where we you send a PR and somebody like you know writes comments on it you go back and forth and also just the iterative process of working with the agent. How do you speed up a person's ability to look at code and know what to do with it? So the principles are pretty similar for both of those, but the exact way you implement them is a little bit different. What you care about the most is making each individual response fast. You don't actually want to shorten the whole timeline of code review generally because code review is a quality process. It's the same thing with agent iteration. Like what you want with agent iteration is you want to get to the place where you've got the right result. You don't want to like just be like, "Well, I guess I've hit my five minute time limit, so I'm going to check in this garbage that doesn't work, right? You you But what you do want is you want the iterations to be fast." Not just the agents iterations, but the human response time to the agent to be fast. And in order to do that, they have to get very good at doing code reviews or knowing what the next step is to do with a lot of code. At the big code review level, one thing that I see that I think is sort of a social disease that has infected a lot of companies is when people want PR reviews, they just send a Slack message to a team channel and say, "Hey, could one of the 10 of you review my PR?" And what and you know what that means is one person does all those reviews. That's what really happens. There there's like when you look at the code of review stats of teams like that there's one person who has like 50 and the other person have like three two five seven because there's just one person is like super responsive so but what that means is if you start generating dramatically more PRs that one person cannot handle the load you have to distribute it and really the only way to distribute it is to assign it to specific individuals have a system that distributes it among those individuals and then set SLOs's that have some mechanism of enforcement and another thing is like that GitHub, for example, is not very good at today is making it clear whose turn it is to take action. Like, I left a bunch of comments on your PR. Uh, you now responded to one of my comments. Should I come back again now? Oh, wait. No, no, now you pushed a no change. Should I come back now? Okay. No, no, now you've responded to more comments. What I rely on mostly is people telling me in Slack, I'm ready for you to review my PR again, which is a terrible and inefficient system. And another thing you got to think about a lot is the quality of code reviews. And I mean this once again both for the individual developers doing it with the agent and the people doing it in the code review pipeline. You have to keep holding a high bar. I know that people have other opinions about this. And yes, depending on the timeline that you expect your software to live, you might not need as much software design. Like look, it's software design is not the goal of perfection. It's a goal of good enough and better than you had before, right? But sometimes good enough for a very long lived system is a much higher bar than people expect it to be. And if you don't have a process that is capable of rejecting things that shouldn't go in, you will very likely actually see decreasing productivity gains from your agentic coders over time as the system becomes harder and harder for both the agent and the human to work with. The problem is this. In many companies, we have the people who are the best code reviewers not doing any of their time doing code review. They are spending all their times in meetings doing highle reviews doing strategy. And so we aren't teaching junior engineers to be better software engineers and to be better code reviewers. So we have to have some mechanism that allows the people who are the best at this to do this through apprenticeship. If somebody else has a better way of doing this than doing code reviews with people, I would love to know because in the 20 plus years that I've been doing this, I have never found a way to teach people to be good code reviewers other than doing good code reviews with them. Now, if you do if you don't do all the things that I talked about, what is the danger? The danger is you take a bad codebase with a confusing environment. You give it to an agent or a developer working with that agent. The agent produces relative levels of nonsense and the developer experiences more or less frustration and depending on how persistent they are at some point they give up and they just send their PR off for review. They're like, "I think it works." Right? And then if you have lowquality code reviews or code reviewers who are overwhelmed, they go, "I don't know. I don't know what to do with this. I guess it's okay." And you just have lots and lots and lots of bad rubber stamp PRs that keep going in and you get into a vicious cycle where what I expect to occur and what my prediction is is if you are in this cycle, uh, your agent productivity will decrease consistently through the year. On the other hand, we live in an amazing time where if we increase the ability of the agents to help us be productive, then they can actually help us be more productive and we actually get into a virtuous cycle instead where we actually accelerate more and more and more and more. And yes, some of these things sound like very expensive fundamental investments, but I think now is the time to make them because now is one of the times you're going to have the biggest differentiation in your business in terms of software engineering velocity if you can do these things versus other in industries or companies that can't structurally do these things. So to summarize, here's a few things. Not literally everything in the world you can do that's no regrets, but you can standardize your development environments. You can make CLIs or APIs for anything that needs a CLI or API. Those CLIs or APIs have to run at development time. By the way, too, another big thing that people miss is sometimes they have things that only run in CI. If you're CI takes 15, 20 minutes and you know, agents are like way more persistent and patient than a human being is. So like, but they're also more errorprone than human beings are. So like they will run the thing and then run your test and then run a thing and then run your test and then run a thing and then run your test and they'll do it like five times in a row. If that takes 20 minutes, your developers productivity is going to be shot to heck. Whereas, if it takes 30 seconds, you're going to have a they're going to have a much better experience. You can improve validation. You can refactor for both testability and the ability to reason about the codebase. You can make sure all the external context and your intentions, the why is written down. You can make every response during code review faster. And you can raise the bar on code review quality. But if you look at all of these things, there's one lesson and one principle that we take away from all these things that covers even more things than this. And it's basically that what's good for humans is good for AI. And the great thing about this, one second. The great thing about this is that it means that when we invest in this thing, we will help our developers no matter what. Even if sometimes we miss on helping the agent, we are guaranteed to help the humans. Thank you very much. [applause] [music] Ladies and gentlemen, please welcome back to the stage, Alex Lieberman. [music] Let's give it up again for Max. [applause] We have one more break now and then the last block of sessions where we'll have speakers talking about AI. uh consultancies, uh paying engineers like salespeople and how to make your company AI native. So, be back here at 4 o'clock or if you're watching the live stream, be back online at four o'clock and we'll see you then. Thanks everyone. Heat. Heat. [music] >> [music] >> Heat. [music] Heat. Heat. Heat. [music] Heat. Heat. Heat. [music] Heat. Heat. Heat. Heat. Heat. [music] Heat. [music] Heat. Heat. [music] [music] Heat. Heat. Heat. [music] Heat. [music] Heat. [music] [music] [music] Heat. Heat. [music] [music] Heat. [music] [music] >> [music] [music] [music] >> Heat. Heat. Heat. Heat. >> [music] [music] >> Heat. Heat. [music] Heat. Heat. [music] [music] [music] >> [music] >> Heat. Heat. [music] Heat. Heat. >> [music] >> Heat. Heat. Heat. [music] Heat. >> [music] [music] >> Heat. Heat. Heat. [music] Heat. >> [music] [music] >> Heat. Heat. Heat. [music] [music] >> [music] [music] >> Heat. Heat. [music] Heat. [music] Heat. Heat. Heat. [music] Heat. [music] Heat. [music] Oh, [music] [music] [music] heat, heat. >> [music] [music] >> Heat up [music] [music] here. Heat. [music] Heat. [music] Heat. Heat. [music] [music] [music] Heat. Heat. [music] [music] Heat. [music] [music] Heat. Heat. Heat. Heat. Heat. [music] Heat. [music] Heat. Heat >> [music] >> up here. Heat. Heat. [music] [music] Heat. [music] [music] Heat. Heat. Heat. [music] Heat. Heat. Heat up >> [music] [music] >> here. >> [music] [music] >> Heat up [music] here. >> [music] >> Heat. Heat. Heat. Heat. [music] [music] Heat. Heat. >> [music] [music] >> Heat up here. Heat. Heat. >> [music] >> Heat. Heat. [music] Heat. [music] Heat. Heat. [music] Heat. Heat up >> [music] >> here. Heat. Heat. >> [music] >> Heat. Heat. [music] Heat. [music] [music] [music] Heat. Heat. Heat. Heat. Heat up here. Heat [music] [music] up here. [music] [music] >> [music] >> Heat. Heat. [music] >> [music] [music] >> Heat. Heat. [music] Heat. [music] Heat. >> [music] >> Heat up here. Heat up [music] here. Heat. [music] [music] Heat. >> [music] >> Heat. Heat. [music] [music] >> [music] [music] >> Heat. Heat. [music] >> [music] >> Heat. Heat. [music] Heat. Heat. [music] [music] [music] Heat. Heat. [music] [music] [music] >> [music] [music] [music] >> Heat up here. Heat. Heat. [music] [music] Heat. Heat. [music] [music] Heat [music] [music] up here. [music] Heat. Heat. [music] [music] >> [music] >> Heat. Heat. >> [music] >> Heat Heat >> [music] >> up [music] here. [music] >> [music] >> Heat. Heat. [music] >> [music] >> Heat. Heat. >> [music] >> Heat. Heat. Heat. [music] Heat. [music] Heat. Heat. [music] Heat. Heat. [music] [music] Heat. Heat. >> [music] >> Heat. Heat. [music] >> [music] >> Heat. Heat. Heat. Heat. Heat. Heat. [music] >> [music] >> Heat. Heat. >> [music] [music] >> Heat. Heat. [music] >> [music] >> Heat. Heat. >> [music] >> Heat. Heat. How we doing? We are officially 7 hours in. How's the energy level? 7 hours in. Let's hear it. There we go. There we go. So this is our last block of sessions before you all get to enjoy the graphite afterparty. More coming on that in a few. And for this block, we're going to cover a lot AI consulting in practice, paying engineers like salespeople, as I mentioned earlier, leadership in AI assisted engineering, and how to build an AI native company. You guys ready for this? >> Oh, come on. Let's go. So [applause] with that, please join me in welcoming our next speaker and one of last year's MC's to talk about helping organizations transform with AI. Let's hear it for NLW. [music] All right, great to be back here, guys. Uh for those of you who are here in February, I had the privilege of MCing. Uh and today I'm excited to talk uh about something a little bit different. So right now uh there's been the last couple of months have been an interesting time in AI. There's been a sort of surge in the air uh the narrative of an AI bubble. A lot of it driven by dubious studies uh like the MIT report. And so what I wanted to do today is get into not so much the practice of consulting and transforming but what organizations are actually finding value in right now. So for those of you who don't know me, uh there's kind of two context I bring to this conversation. The first is as the host of the AI daily brief which is a daily uh news analysis podcast about AI. The second is as the CEO of super intelligent which is an AI planning platform. So the different perspectives are sort of very high level macro thinking about the news that's happening and then a much more kind of ground level view where we're spending a ton of time interviewing executives about what's going on inside their organizations. And what we're going to talk about is sort of one kind of briefly in the first part just the status of enterprise adoption uh as it currently stands. And two, um, and the more interesting part is we've been live with a study in in the market for about a month now collecting self-reported information about ROI around different use cases. And this will be the first time uh, this week was the first time I did some analysis on it. And so I'm going to share what people have uh, what people have told us around the first kind of 2500 or so use cases that they've shared. Um, so it should be pretty pretty interesting stuff. talking about kind of enterprise AI adoption first. I'll go through this pretty quickly because it's um pretty well-known stuff. Uh the short of it is enterprises are adopting AI uh in in a growing fashion. Um pretty much everyone is using it at least a little bit. Uh and increasingly they're using it a lot. Uh this year I will need to tell none of you that there was a major inflection around um specifically adoption in the uh coding and software engineering. Right? You saw a huge huge uptick in this. Um there's a lot that's interesting about that from an enterprise perspective because it wasn't just with the software engineering organizations. Other parts of the organization are also now thinking about how they can communicate with code, build things with code. Uh but that's a huge huge theme of this year [snorts] coming into 2025. One of the big sort of thoughts that many people had was that this would be the year of agents inside the enterprise, right? That big chunks of work would get automated away. And on the one hand, I think it's pretty clear that we didn't see some sort of mass shift towards automation uh at large across different functions in the organization. But when you dig into the numbers, there has been actually pretty significant uh shifts in the patterns of of agent adoption. So this is from KPMG's quarterly pulse survey. And it's a measure of how many enterprises that are a part of their survey, which is all companies over a billion dollars in revenue, have uh actual sort of full production agents in deployment. So this isn't pilots, this isn't experiments. This is where they consider uh some agent that's actually doing doing kind of work in a in a full way. And it's jumped from 11% in Q1 of this year to 42% in their most recent study for Q3. So you actually are seeing pretty meaningful uptake of of agents inside the enterprise. In fact, I would argue based on our conversations that people have that it's moved more quickly through the pilot or experimental phase than people might have thought. Um, so much so that you're actually seeing now a big shift in the emphasis around kind of the human side of agents and how humans are going to interact with agents and it's involving a shift in upskilling and and uh and enablement work. Um, you're seeing a decrease in the sort of resistance to agents as people start to actually dig in with them. You're seeing more experiments like these sandboxes where people can interact with agents. So this is a big theme even if it wasn't necessarily the dominant theme that some thought it might be coming into this year. At the same time, it is absolutely the case that many many if not most enterprises are broadly speaking stuck inside sort of pilot and experimental phases. There is a lot of challenge around moving from some of those first exciting experiments to something that's more scaled. Um, so this is from McKenzie state of AI study which came out I think a couple weeks ago now and you can see only 7% of the organizations that they talked to claim or sort of see themselves as as fully at scale with with AI and agents and something like 62% are either still experimenting or piloting. Interestingly big organizations are on in general a little bit ahead in terms of uh the organizations that are scaling as compared to small organizations. This has been a a thing that we've noticed kind of throughout the trajectory of uh of AI um adoption over the last couple of years that you would think that perhaps smaller more nimble companies uh would be more kind of quick to adopt these things, but in fact it's often been the opposite with the biggest organizations making the biggest efforts. You can also see from the chart on the bottom that there's very sort of jagged patterns of adoption, right? you're starting to see uh from you know last year if you looked there's very similar kind of rates of experimentation across lots of different departments you're starting to see some pretty big breakouts now uh with for example you know IT operations kind of jumping out ahead of other functions I won't spend too much time on this sort of high performer piece but I think the thing to note because it comes back in and in and some of the stuff that we found with our ROI study is that you are also starting to see a pretty significant bifurcation between leaders and laggers when it comes to AI I adoption and one of the things that tends to distinguish the companies that are leading is that they are just doing more of it and they are thinking more comprehensively and systematically about AI and agent adoption. So they are not just sort of doing spot experiments. They are thinking about their strategy as a whole. They're doing multiple things at once. And importantly, they're not just thinking about sort of the very kind of first tier time savings or productivity types of use cases. They're also thinking about how do we grow revenue? How do we create new capabilities? How do we create new product lines? Overall, it's very clear that despite what is sort of, you know, the the the concerns in the media, that spend is going to do nothing but increase on this. Um, so the bottom is the KPMG pulse survey again, and this is a an estimation of the amount of money that these organizations intend to spend on AI over the next 12 months. The beginning of the year was 114, which by the way was up from like 88 in Q4 of last year. It's now up to in their last study 130 million is what they expect to spend uh in the in the year ahead which obviously the the total magnitude doesn't matter as much as the change. Um you also see the green charts are from Deote and you can see 90% plus of organizations intend to increase their spend uh on AI in the next 12 months and as part of that I think that you're going to see a much more determined conversation around impact and ROI uh which is a particularly thorny topic but interestingly there has been an increase in optimism over the course of this year around the realization of AI. So this is from a different KPMG study, their annual CEO survey, which interviews tons and tons of CEOs. And if you look at the 2024 numbers, 63% of those pled thought that it would take between 3 and 5 years to realize ROI from their AI investments. 20% said 1 to three and 16% said more than five. This year in that same survey, the number that said 1 to 3 years had gone up to 67%. There were now 19% who said 6 months to one year. uh and three to five years was down to just 12%. So huge huge kind of pull forward of expectations of of ROI realization. The challenge is that ROI is really tough. So this is back to the poll survey. 78% of those pled in that in that survey said that they thought that ROI is going to basically become a bigger consideration in the year to come. Uh but also 78% said that traditional impact metrics and measures were having a very hard time keeping up with the with the new reality that we were living in. And this is something that I've heard constantly over and over from CIOS and other people who are in charge of these investments that the the the ways that we have measured impact of previous technologies and just previous initiatives are kind of falling flat with AI. And so that got us thinking about the the the overall need that we have to just have more information. I'm not even talking about good systematic information, just more information around what ROI looks like, what impact looks like, and you know, I've got this great podcast audience. They're super engaged. And so, we just decided, screw it. We're going to ask them, we're just going to ask them to report on what ROI they're finding from their use cases. So, this went up at the very end of October. Uh like I said as of this morning or when I looked last looked we've had over a thousand submissions uh a thousand individual organizations rather submit something like 3500 use cases and um this is uh some some of the first observations that we had around um kind of the first 2500. So the impact categories the way that we divided things was into sort of eight broad categories of impact um which will all I think be very intuitive to you guys. time savings, increased output, improvement in quality, new capabilities, improved decision- making, cost savings, increased revenue, and risk reduction. So, basically, it was trying to think of like kind of a a broad simple heristic for uh for for kind of dividing or subdividing the different the different ways that people are thinking about ROI. And TLDDR is that people are finding uh ROI right now. Um, now again, the caveats are that this is a highly infranchised audience. are listening to a daily AI podcast and they are voluntarily sharing this. So, I think that, you know, there's there's some caveing there, but you have 44.3% saying that they're seeing modest ROI right now. And then you have another 37.6% seeing high ROI. For the purposes of a lot of these stats, high ROI will be significant plus transformational. Uh only 5% or so are seeing negative ROI. And keep in mind, negative ROI doesn't mean that they think programs are failing. It just means they haven't they've spent more than they've gained uh in terms of how their their perception is. More [snorts] than that, expectations are absolutely skyhigh. 67% think over the next year they will see uh increased and high growth in their ROI. So we have really optimistic sense from the ground view of where ROI is going to be in AI. Um you even have the teams that are currently experiencing negative ROI. 53% say that they're going to see high growth. So very very optimistic. Um as [snorts] you might imagine, time savings is the default. It's the starting point for so many organizations. It represents about 35% of the use cases. After that, increasing output, quality improvement, basically all those things that you would imagine around productivity are sort of like the dominant categories when it comes to these uh when it comes to these use cases. When it comes to the specifics around time savings, you see a real cluster between 1 and 10 hours, especially right around 5 hours. And I think this is interesting to call out because it's so obvious to all of us who are inside building these things uh whether you are a developer or an entrepreneur or just someone sort of in and around it how the the vast breadth of opportunity that AI represents new capabilities things unimagined yet. It's hard to or it's easy to forget that if you save 5 hours a week or 10 hours a week you're talking about winning back 7 to 10 work weeks a year. Uh and that's very very powerful. And when it comes to a lot of these enterprises, that is a very meaningful thing, even if it's not what they're ultimately in it for. Interestingly though, it's very clear that the story, although it might be uh have a concentration in time savings, is about much more than time savings. So this is the ROI distribution category uh ROI distribution by organization size. And this starts to get really interesting where you can see that there are some differences in where different size organizations are focused. So for example, the organization size between 200 and a,000 people has a higher portion of their use cases concentrated in increasing output. Now we haven't taken the time yet to really figure out exactly what this means or even speculate on on what this means, but I think it's interesting that this is a category of organization that has often reached a certain scale but is still very much striving for more and so seems to be focused more on use cases that expand their capabilities. Same thing with uh when you start to divide things by role, you see a real kind of variance where for example seuitees and leaders uh are less focused on those time savings use cases and more focused on other things like increased output and uh and new capabilities. In general, we're finding that SE leaders uh and just sort of seauite and and leaders in general are even more optimistic and excited and and seeing transformational impact than people who are in more junior positions. Now, some of this might be sort of selection bias in terms of um what types of use cases you are focused on. If you are in that seuite, you're thinking about things that inherently if they work are more transformational. Uh but it is notable that 17% of uh of the use cases that that people in those leadership positions have submitted uh they say have transformational impact and ROI already. [snorts] Uh I'm going to skip this because there's we don't have time for too much. Um you're seeing uh interestingly uh a concentration um where the smallest organizations are getting more of that transformational benefit early. Um, one of the things that I want to do following this study is maybe do a sort of second round where we dig into what this 1 to 50 person uh, size really looks like. I actually think that whereas there might be a lot of similarity between a 1,000 and a 2,000 person organization. There could be a wild difference between a threeperson, you know, small company and a 40 person company. And so I'd really like to dig into that more. But you are definitely seeing a a lot of impact in those sort of more small nimble moving organizations. Uh as you might expect coding and uh and software related or uh use cases have a higher ROI than average and a lower negative ROI than average. Um one really interesting kind of you know pulling on a specific category of use cases. risk reduction is our lowest category in terms of the percentage of use cases that that that was their primary benefit. So when you're filling out the survey, which is by the way at ROI survey.ai if you want to check it out, uh you basically only get to pick a primary ROI benefit. We didn't want it to be super sort of um we wanted you to pick and and hone in on the thing that was uh seemed most important or most significant. And so only 3.4% 4% have risk reduction as their primary benefit uh in terms of ROI categories, but it is by far those use cases are by far the most likely to have transformational impact as as the as as their outcome. It's at 25%. So a full quarter of those uh have transformational ROI. And interestingly, I was having this conversation with a couple of my friends who work in sort of back office and compliance and risk functions, and this has been their experience as well, where there are a lot of uh a lot of the the the challenges for those organizations involve sheer volume and quantity uh in ways that that AI can be really helpful for. We also are finding some interesting patterns among organizations. And again, this is where we get into some of the limits of this just being a whoever walks through the door of my listeners. We have a pretty heavy concentration among technology, as you might expect, industries and among professional services, but we still have fairly decent sample sizes for some others. And in both healthcare and manufacturing, the use cases are meaningfully higher impact on average uh than the average across all organizations. Um, which I think is uh it was kind of worthy of further study. Last sort of part of this as I wrap up, you know, a lot of these use cases as you saw have to do with that sort of first tier that most enterprises are going to be in. Uh, increasing the amount of content that you output, increasing the quality of that content, just finding ways to win back, you know, your 5 hours a week. Um but increasingly there are automation and agentic use cases and we are absolutely seeing that where those are the the focus where those use cases mention certain types of automation or they mention agents they wildly outperform in terms of the self-reported ROI from them that's both on automation and it's on agents and I think that that's sort of a a trend towards where we're headed with sort of the next layer of more advanced use cases. The last thing that uh from this sort of first first look of observations is there is clearly benefits and this goes back to to what we saw with that Mackenzie study as well of thinking about AI and agentic transformation in systematic cross-organizational cross-disciplinary types of terms. um effectively pretty much uh directly the more use cases that a person or an organization submitted the the better they tended to see uh ROI for. Now there's lots of reasons for that but I do think it speaks to that that core idea that once you move beyond kind of your single spot experiments there's a lot of opportunity uh to to sort of grow grow the impact of the organization. So, like I said, that is the the first look. Uh, it's kind of the first twothirds of these uh of these use cases. We'll be open for another week and then we'll have the full study out at the beginning of December. Um, I'm really excited, I think, heading into next year to see how we move from sort of generic conversations about impact uh and our gut senses about impact to a lot more random experiments like this to figure out where the impact really is and uh and where we go next. So, look at that. I'm going to end 27 seconds early and really throw off the time, but appreciate you guys all being here. Uh, and again, if you want to check this out, it's roicervey.ai. [applause] As AI [music] changes our business and engineering landscape, do we need to rethink how we incentivize and compensate engineers? Here to provide us with a case study for scaling output, not overhead, is the co-founder and managing partner at 10X, Arman Hezarki. How's everybody feeling? It's been uh 7 and 1/2 hours. We doing what? Are we doing okay? >> Awesome. I'm Arman. Uh like the voice of God apparently. That's what they're called. Voice of God apparently. Uh so my name's Armon. I'm one of the co-founders and managing partners at a company called 10X. Uh my co-founder is Alex who's been uh kindly announcing everybody all day. We do a lot of cool work. We uh we help companies with their AI transformation. We have incredible clients all over the world. But I'm not going to talk about any of that today. I'm going to talk about something much more niche. I'm going to talk about how we pay engineers. And we pay engineers like salespeople. Earlier I was just in the green room with a bunch of distinguished engineers that I've grown to uh respect for my entire career. And we were talking and I was telling them that we pay engineers based on the story points that they complete. And we had a lot of people roll their eyes and and laugh. And they asked, "What do you mean?" And I said, "Clients pay us for the number of story points that we deliver and we pay engineers based on the number of story points that they complete." And similar to the looks that I'm getting from some of you, there was skepticism. And I know this sounds crazy, but it's working. We've been able to hire incredible engineers, many of whom have started and exited uh companies before this. We have been able to hire worldclass machine learning and AI researchers. We've hired rocket scientists from NASA. We are shipping code incredibly quickly and it's maintainable and high quality code. Of course, that is everyone's dream. Everybody wants to hire great people. Everyone wants to deliver really uh fast code. So, my goal here is not to convince you all to adopt our model. My goal is to show you what compensation looks like in AI and hopefully provide a new perspective on the fact that things might change as we introduce this technology. Before I jump in though, I want to talk about uh how we got here. So, I'm a software engineer by training. I went to Carnegie Melon and then I taught there in their school of computer science. After that I went to Google and I helped them scale their AI cloud and mobile practices internationally before starting a few ventureback startups. And in my last startup I would work out of a weiwork and I was sitting in this uh 33 Irving WiiWork. If any of you are from New York, you you might have worked out of that we work and they have these big tables and there were 12 of 12 of us kind of sitting around. No one's talking. Everyone has their headphones in. And I look to my left and I see I see somebody with Visual Studio Code open, right? I'm like, "Okay, I have a fellow engineer to my left." And I see that he was typing, but I didn't see a chat window. This person was typing into the code editor. They were typing fo like a caveman. This this poor person was typing like with their little chopstick fingers individual characters. I I I couldn't believe it. On my computer, I had 45 agents. Three were ordering me lunch. Two were writing code. One was doing research. Just different worlds were happening on my computer versus this person's computer. And I felt bad. I thought maybe we should do a GoFundMe or something. But I I I tried to look deeply at what is actually causing this difference. Why am I using AI in the way that I am? And why is this person not? There are different ways that that people try AI and there are different reasons why people don't use it. We've all heard people who have tried it and have said it's not as good as me. We've all heard people who have not tried it because they don't want to. But regardless, my belief is that this is an incentive issue. For me, I was a founder and I wanted to squeak out every bit of incremental value and and efficiency that I could. And so I would sit on Twitter and LinkedIn and read blog posts and try to understand what is the cutting edge in software engineering and what's going to give me the ability to output more code higher quality faster. And because of that, I was using all these all these different agents, but this person probably worked at a startup, probably had a base salary with an annual bonus and some equity. And that was supposed to be the model that incentivized people to be innovated, to be innovative, and to work smarter and faster and harder. But it wasn't working. And so, in order to understand how we got to where we are, I'm going to do a brief uh history of compensation. And this is by no means accurate. I'm making a lot of things up here. It's all illustrative. Okay. So, back in the day, we had some cavemen who were writing code. We were we're uh probably inscribing C in a in a tablet somewhere and we were paying people hourly, right? This makes sense. I look at somebody sitting in a chair and I'm going to pay them some amount of dollars for some amount of time. That makes sense for me and it makes sense for the for the engineer. But why is that broken? I actually I want to hear from people. Why is hourly broken? >> It's slow output. >> No upside. >> There's no upside. there's no reason to work faster, right? And in fact, there's a disincentive to work faster. And so, what if I I notice this as the buyer of this technology and I say, "Okay, how long is it going to take you? It's going to take you five hours. Okay, so I'll pay you 500 bucks, right? Hourly $100. Multiply that by five. And then you as the engineer, if you work faster, great. You get to keep the $500. And if you work slower, that's on you." As engineers, we're really, really bad at estimating how long things are going to take. And so because of that, I'm not going to say it's going to take five hours. I'm going to say it's going to take 15 hours, 20 hours, so that I have no downside. And so again, as the buyer, I don't want to pay you based on the project. So what if we hire people on salary and give them a bonus, right? Well, we in the startup community know what happens in that when when this is the case. People punch in at five, leave or nine, leave at five. And so I'm Larry Paige. I notice this and I see why am I working so hard at Google? Why am I putting my blood, sweat, and tears into this? It's because I have some of the upside. I own the company, right? And so when we exit for for many, many dollars, I'm going to see that. So what if I can share that with my employees? And that's when equity comes in. And and this has worked this has worked for many many years to incentivize employees. This is this is the foundation of the startup community that we all know and are a part of. It's incredible. But the not every company is Google. In fact, for every one Google, there are many many failures. And software engineers know this, right? For those who want to take the risk, many will just go to YC or or start their own company. And for the ones who don't want the risk, they're opting for cash over equity. Many of us who've hired engineers know that the cash is non-negotiable. Equity. Yeah, sure. I'll take some upside. And so my contention is that this model needs to be reinvented in the age of AI. We need to directly incentivize people to use these tools and to use them well and to still maintain really high quality standards of code. And so here's how it works for us. So we basically just to take a step back, we do two types of work at 10X. One is road mapping and one is execution. So companies come to us and they say, "Hey, we want AI." That's generally the request. Sometimes it's more specific. It's like, hey, I want my customer service team to have 10% more uh output using AI, right? But but generally they come to us with a request. We do a bunch of studying and learning and then we output a road map and based on that road map, they can take it and work on it on their own or we can do it. For a lot of things, we're taking off-the-shelf tools, but a lot of what we do is custom builds and that's where the story point model comes in. So we will build a roadmap for a lot of our clients but once they see that then they're putting in requests on their own as well and we have two roles in the company that are client facing one is the strategist and the other is the AI engineer the strategists are all are mostly technical and so we've have we have former PMs we have former engineers they are doing PM type work consulting type work they're the ones that are taking the product requirements and distilling that down with the client Then they hand that over to the engineer and the engineer puts together an architecture design document. They spend a lot of time doing that. In fact, that is where most of our engineering time goes. Then they write code and they start implementing that that architecture design document includes tickets and each ticket is graded on some number of story points. This is a very traditional method of doing work, right? And when that ticket is accepted, the engineer gets paid a a fee per story point that they complete. Our engineers have a flat base that they're paid and then every quarter we round up based on the story points that they've completed. And again, this has led to us being able to hire incredible people, but we've also been able to do incredible work. So, I'm going to walk through a couple of projects that we've done. So, this is one. This is a billboard company. If you go to Times Square right now, you'll see some billboards that they've sold that inventory for. They sell in two ways. One is you can call them up traditional sales, you can buy that inventory, but the other is they have like an Uber for billboards type of product where you can go online, you can upload a PNG, you can choose where you want this to run and for how long, similar to like a Facebook or Google ad. It's very similar to that experience. And they came to us and they said, "Hey, we think that there's some opportunities for AI in our product." We did an analysis and we found a few. One of them is this. We found that when an image is uploaded to their system, it has to go through two rounds of moderation. One is internal to the company and the other is with the billboard owner. Internal to their company, they're spending money on that to actually hire the people to do that and there's a lot of inaccuracy and it takes a lot of time. So that costs them money and it costs them revenue because every moment that the billboard is not running, they're not making money. And so we found what if we could build an AI model that can actually do this moderation for them. We scoped that out. We built the architecture uh design doc. We broke it down into tickets and we built this for them. We did it in two weeks and we got to 96% accuracy when compared to the human moderator. We've done a lot of other projects with this company as well. This is another company. This is they work with retailers all around the world and currently they have devices in these retailers and they're low power devices and so because of this they're able to run one AI model on device and what this model does is does heat mapping. So imagine there's a camera in this room looks down and it can basically generate a heat map of where the traffic is throughout the day. And for retailers, of course, this is very, very useful. But there's other things you can do too, right? If we just sit here for a few minutes, we can probably come up with a lot of ideas of if you have a camera with a chip, you can make a lot of money from that. You can show really useful information. And so that's what we did. We we came up with what are some of the things that we could do with this? if you put a little bit little bit more power in that ship, if you make the models, if you quantize them so they can run in parallel, what could you do? And so we gave them this report and then we built them five models that can run in parallel. It does everything from heat mapping to Q detection to theft detection and more. And again, we start with the product requirement stock. We break this down into architecture. Then we build it and then we pay engineers based on the output. This is the big question. What are the risks? Right? I just talked about dandelions and rainbows, right? Uh so I promised you that my goal is not to convince you to do this. And part of this is showing you what the potential risks are. These are a few that come up. One is what if an engineer inflates the story points, right? What if an engineer says, "Okay, you want me to add a button? 45 story points." Right? What if an engineer rushes and quality drops? You're saying that it took two weeks to do that. Well, was it good? Did it work? And what if engineers get sharp elbowed? I started this by saying that we compensate engineers like salespeople. It's not a it's not a culture that we necessarily want to emulate in software engineering, right? So, how do we how do we uh make sure that that's not happening? First of all, I mentioned that we have two different roles and we compensate like a counterbalance. So strategists are compensated based on NR which really is like customer happiness and every single ticket has to be approved internally with multiple rounds of QA of which the strategist is involved but also by the client and so there's a counterbalance to every single ticket that is delivered. Uh I skipped to the second one. For the first one inflating story points the strategists are the ones who scope it. And again we have to review all of that. And for the third, how do you make sure that all of this is correct? And how do you make sure that there's no sharp elbows? How do you make sure that everybody is happy and the dandelions and rainbows are continue throughout this parade of joy? Well, you have to hire the right people, and this is what I tell everybody. We make hiring incredibly difficult for ourselves so that everything else is easy. And that is a principle that we all know and we all stand true to. And this is incredibly important with AI. My co-founder Alex always says AI makes people look like one of those crazy mirrors where any one of your attributes it makes it 10 times larger. If you're a great engineer, AI makes you great. If you're not, it makes you sloppier. And this is the case with all of these things. You have to start with hiring. Our belief is that AI gives people superpowers and it makes all of us smarter, faster, and better at what we do. But my belief is that the current way that we compensate people is actually holding them back. And I would invite you to think about how can you compensate people on your team differently, whether it's software engineering or anything else. If you want to unlock your employees potential, feel free to reach out at armon 10x.co. Thank you. [applause] Our next presenter [music] is deputy CTO at DX, the engineering intelligence platform designed by leading researchers speaking about effective leadership in AI enhanced organizations. Please join me in welcoming to the stage Justin Rio. [music and applause] Hello. Thanks for joining me in one of the later day sessions. Looks like we we we kept a lot of people here. This is a nice full room. I'm great to see it. We're going to go through a lot of content in a short amount of time. So, I'm going to get right into it. If you want to get deeper into any of this stuff, we have published this uh AI strategy playbook for senior executives. And uh a lot of the content that I'm going to go through, I'm not going to have time to get quite as deep, but this is just a nice PDF copy that you can come and refer to later. If you missed this QR code, don't worry. I'll show it again uh at the end. So, what is the current impact of Genai? Nobody knows, right? We've got Google on the one hand telling us that everyone's 10% more productive. That's interesting. Now, they're Google. They were already pretty productive to begin with. But we have this sort of now infamous meter MER study which has some flaws in the way that study was put together that showed actually a 19% decrease in productivity using codec assistance. So there's a lot of volatility a lot of variability. Uh what was really interesting about this study even though I I mentioned there were some flaws. Um but every engineer that took part in this study felt more productive but then the data actually bore out that they were less productive. Kind of interesting right? we've got this induced flow uh that makes us feel really good about what we're doing. So, we need to address this. DORA has put out some really good research on this, too. But this is based on industry averages. This is impact based on what do we look at when we see a large sample and an average of how certain factors are being impacted by in this case 25% increase in AI adoption. We see these modest but positive leaning indicators. 7.5% increase in documentation quality and uh increase in code quality by about 3.4%. At least that's not leaning in the other direction, right? And when we started digging through some of DX's data, we have, you know, we're the developer productivity measurement company. We have lots of aggregate data that we can look at with this. We found the same thing. When we looked at averages, we see about a 2.6% 6% increase in overall uh change confidence, which is a a percentage of people who answered positively that they feel confident in the changes that they're putting into production. Uh similar positive leaning average when we looked at code maintainability, another qualitative metric, a1% reduction in change failure rate. Uh which when you think about the industry benchmark being 4%, it's not insignificant. But this is not the full story because this is what we saw when we broke the same studies down per company. Every company here is a every every bar represents a company, right? We have some that are seeing 20% increases in change confidence while others are seeing 20% decreases. We're seeing extreme volatility, which is why these averages look so innocuous, but they're belying the greater story of variability. See the same thing with code maintainability. The same thing with change failure rate. So this is a 2% increase in change failure rate up here at the top. Again, with an industry benchmark of 4%, that means shipping as much as 50% more defects than we were shipping before, right? We want to make sure we're on the lower end of this, but how like what should we be doing? Well, we found some patterns here. We see that some organizations are seeing positive impacts to KPIs, but others are struggling with adoption and even seeing some of these negative impacts. top-own mandates are not working, right? Driving towards, oh, we must have 100% adoption of AI. Great, I will update my read my file every morning and I will be compliant, right? We're not actually moving the needle anywhere when we do that. We also find that lack of education and enablement uh has a big impact on sort of negatively impacting this. Some organizations just turn on the tech and expect it to just start working and everybody to know the best ways to use it. uh and a difficulty measuring the impact or even knowing what we should be measuring like what metrics would should we be looking at you know does utilization really tell us much about the full story of genai impact this is another graph from Dora uh this is a basian uh posterior distribution which is an interesting way of representing data basically you want your mass to be on the yellow side of this line uh the the uh the right side of this line for the audience yeah and you want a sharp peak which is telling you that we're pretty confident that this initiative will have this impact. And if we look at some of the topline initiatives here, these are things like clear AI policies. All right, we want to make sure we have that. We want time to learn, not just giving people materials, but actually giving them space to experiment, right? Um, and so these types of factors are the ones that seem to be moving the needle the most. So, we're going to go over some quick tips on how we can do all of these things. And again, the guide will go deeper into this. We want to integrate across the SDLC. Right? For most organizations, writing code has never been the bottleneck, right? We can in uh we can increase productivity a bit by helping with code completion, but our our biggest bottlenecks are elsewhere within the STLC. There's a lot more to creating software than just writing code. We want to unblock usage. We can't just say, well, we're worried about data xfiltration, so we can't try this thing. Like, no, get creative about it. We've got really good infrastructure out there now like Bedrock and Fireworks AI that can let us run powerful models in safe spaces. We have to have open discussions about these metrics. We need to evangelize the wins and we need to let our engineers know why we're gathering metrics and data. What is it that we're trying to improve? We have to reduce the fear of AI, right? We have to make sure that people understand that this is not a technology that is ready to replace engineers. This is a a technology that's really good at augmenting engineers and increasing the throughput of our business. We have to establish better compliance and trust. And we need to tie this stuff to employee success. These are new skill sets. AI is not coming for your job, but somebody really good at AI might take your job. And so, as leaders, we have the opportunity to help our employees become more successful with this technology. So, how do we reduce the fear? Well, first of all, why do we need to do this? Well, there's a lot of good reasons, but I love to point to Google's project Aristotle. This was a 2012 study where Google wanted to figure out what are the characteristics of highly performant teams. uh they thought that the recipe was just going to be what Google had this combination of high performers, experienced managers and basically unlimited resources. And they were dead wrong. Overwhelmingly the biggest indicator of productivity was psychological safety. Okay. And so that very much applies now. We also have data like this is Sweetbench. I'm sure a lot of you have seen this and there are some impressive benchmarks that the agents can do like a third of the things they're asked to do without any human intervention. that means that they're not able to do twothirds of them. Right? Again, we are augmenting. We're not replacing. We're not ready. We may never be ready. So, we need to be very transparent with what we're doing. We need to set very clear intents. Why, you know, are we uh using this to to augment, not to replace. We need to be proactive in the way that we communicate that and not just wait for people to get upset and possibly scared. We need to say, "No, we are here to help you to give you a better developer experience and to increase the throughput of the business." And again we have to have these discussions about metrics. Now what metrics? What should we be looking at? Well, DX again developer experience and productivity measurement company. Um there are two sort of classes of metrics that we can be looking at really two levers that matter here and that's speed and quality. Right? We want to increase PR throughput. We want to increase our velocity but not by just creating a bunch of slop that's going to give us a bunch of tech debt later that we're going to have to deal with. And we've just kicked the bottleneck down the road if we do that, right? So we want to be looking at things like change failure rate, our overall perception of quality, change confidence, maintainability. And we have three types of metrics that we can be looking at here. We have our telemetry metrics. These are the things coming out of the API. And they're good for some stuff, but they're not always accurate, right? We know like accept versus suggest was kind of like all the rage until we realize that engineers need to click accept in the IDE in order for the API to know about it. even if they do click accept. Who's to say they didn't just go back and rewrite every line that was suggested, right? So that's providing us some context, but we also need to do some experience sampling. We need to like for instance add a new field to a PR form that says I used AI to generate this PR or I enjoyed using AI to generate this PR and get some data that way. And then self-reported data or survey data. We are big on surveys, but let me underscore we're big on effective surveys. 90% plus participation rates engineered against questions that treat developer experience as a systems problem not a people problem because that's what it is W. Edwards Deming 90 to 95% of the productivity output of an organization is determined by the system and not the worker. Okay. So foundational developer experience and developer productivity metrics still matter the most. Right? Our AI metrics like utilization and things are telling us what's happening with the tech, but these core metrics that we've been able to trust are telling us whether these initiatives are actually working, right? Are we actually moving the needle and having the outcomes that we want to see? So top companies are looking at different things, right? We are seeing like adoption metrics coming out of Microsoft. They've also got this great metric called a bad developer day. I'm not going to go into it, but there's a really good white paper that shows like all the different telemetry that they can look at to determine what makes a bad developer day. Dropbox is looking at similar stuff. Adoption like weekly active users, daily active users, that sort of thing, but also looking at quality metrics like change failure rate. And booking is looking at similar stuff as well. And so we built a framework around this. We were first to market with what we call our DXI measurement framework. And this is very much inspired by things like Dora space framework, DevX, just like our core for metric set which you can ask me about later. Uh and we take these metrics and we uh normalize them into these three dimensions of utilization, impact and cost. And you can kind of think about this as a maturity curve too. A lot of people start just figuring out okay what's happening? who's using the tech, what's the percentage of pull requests that we're getting that are AI assisted maybe through experience sampling? How many tasks are being assigned to agents? But then we can mature that perspective a little bit and we can correlate that utilization to impact. What is this actually doing to velocity? What is this actually doing to quality? And this is when we start getting more mature in our picture of our impact. And then finally, cost. Although I like to joke that we're 15 years past the last hype cycle which was cloud and we still have new companies spinning up that are teaching us how to understand and optimize our cloud costs. So we will see if we get there. Although I also hear horror stories about people burning through 2,000 tokens at $2,000 worth of tokens a day. So we probably do need to hit that as well. What about compliance and trust? What can we do to ensure that the output uh that that's being generated is something that can be trusted by our engineers? We have a lot of levers to pull here, but one of the ones that I'd like to talk about is setting up a feedback loop for our system prompts. So these could be called system prompts, cursor rules, agent markdown. Pretty much all of the mainstream solutions have something like this where you can go and provide a set of rules uh to control how these models behave. Uh and I won't get too much into the technical details here. We have an example where like the uh models have been providing outdated Spring Boot uh stuff. We want Spring Boot 3. It's It's been sending us Spring Boot 2 stuff. The big takeaway here is to have the feedback loop. Have a gatekeeper, right? Have somebody or a group in the organization that can receive this feedback that understand how to maintain and continuously improve these system prompts, right? And that way we're always maintaining the way that these assistants or models or agents affect the whole business. It also pays to understand the way that uh temperature works, especially when we're building agents, right? we do have some control over the determinism and nondeterminism of these models. Uh again like when a model is predicting a next token, it doesn't just have like one token. It has a matrix of tokens and those are associated with a certain probability of that being like the right token. And so we have this setting called temperature which is heat which is entropy which is randomness that can control the amount of randomness involved in actually picking that token. This is sometimes called increasing the creativity of the model. And it's a number between 0 and one. For those reasons I just mentioned, don't use zero or don't use one. Weird things will happen. But you want some decimal in between zero and one. When we have a lower temperature, like we're seeing here, 0.00001, we give it the same task twice, and it gives us the exact same output character for character. When we set that temperature higher, this is an example of 0.9. I'm asking the agent to create a gradient for me, uh, simple task. It's giving me two relatively valid solutions. I did ask it for a JavaScript method and this is the only one that's giving me a JavaScript method. But the point is they are wildly different approaches to the same problem when I've increased the creativity of that model. So we need to think about like use case- wise where should we have more creativity and where should we have more determinism and temperature is another setting that we have that can help control this. You can experiment with all this using like docker model runner lama lm studio that sort of thing. How can we tie this to better employee success? We had to provide both education and adequate time to learn. So we put together a study where we sampled a bunch of uh developers that were saving at least an hour a day uh uh excuse me an hour a week and we asked them to stack rank their top five most valuable use cases. And we built a guide around that. a guide that effectively goes through code examples, prompting examples uh of what we determined using the sort of data approach where we should get more reflexive about our best practice and about uh the use cases that we're becoming reflexive in in our use of AI. And so that's what this guide was about. And uh we've had this become required reading in certain engineering groups and uh proud of that. And this is another way that we can help educate, but we need to give time. Uh we don't have time to go through all of this. I do think it's interesting that the number one use case for this was stack trace analysis, right? So, not a generative use case, actually more of an interpretive use case. And we see some other ones here that are not too surprising. And there's examples of each of these. What about unblocking usage? How can we make sure that we can creatively ensure that engineers can take the most advantage of this? Well, leverage self-hosted and private models. That's getting easier and easier to do. Partner with compliance on day one, right? Make sure that what you're doing is in line with your organization's compliance. You may find that you're making a lot of assumptions about things that you don't think you can do that you can actually do, right? And then think creatively around various barriers. Finally, how can we integrate across the SDLC? What should we think about doing there? You know, and I'm a big Ellie Gold theory of constraints fan. Probably have some others in the audience. An hour saved on something that isn't the bottleneck is worthless. And when we look at data across in this case almost 140,000 engineers, we find that there are definitely good like annualized time savings with AI that are being eclipsed by sources of context switching and interruption, meeting heavy days, these other things that it's like, yeah, we can save time here, but we're losing so much more time over there. So find the bottleneck, fix the bottleneck, right? Morgan Stanley's been very public about their uh building this thing called Dev Gen AI that looks at a bunch of legacy code Cobalt mainframe natural. I hate to admit Pearl because I'm an old school Pearl developer. Uh but apparently that's legacy now too. And basically creating specs uh for developers that can just be handed to developers to start modernizing the code without having to do all that reverse engineering. Right? And they're saving about 300,000 hours annually right now doing this. There's a Wall Street Journal journal article about this, Business Insider article about it. Uh they're very public about that. Zapier, Zapier should be the example for everyone. They have a whole series of bots and agents that are doing things like assisting with onboarding. They can now make engineers effective in two weeks. Industry benchmark on the good side is like a month. On the medium side is like 90 days. And uh because they're able to increase the effectiveness of the engineers that they're h that they've bringing into the organization, they realized that they should be hiring more, right? As opposed to trying to maintain status quo by like cutting headcount and trying to make individual engineers more productive. They said, "No, we could get more value out of a single engineer. We should be hiring faster than ever." And they are, and it's really increasing their competitive edge. I think that's the right attitude. Spotify has been helping out their S sur by pulling together context when incidents uh are detected and then taking things like run but steps and and other areas of context and documentation and pushing them directly into S sur channels so that those critical minutes of trying to get to the bottom of what's actually happening and what we should do do to resolve the incident uh they just eliminated that time right it's significantly increased their MTTR so let's get creative about areas in the STLC that are our actual bottlenecks All right, next steps. Uh, distribute this guide as a reference for integrating AI into the development workflows that you have. Uh, determine a method for measuring and evaluating Genai impact. It's really important to make sure that we're not on the bad sides of those graphs that I showed you earlier and then track and measure AI adoption and and see how that correlates to overall impact metrics and iterate on best practices and use cases. And here's a guide. again. Thank you so much. [applause] Our closing presentation will teach us how to build an AI native company, even if that company is 50 years old. Please join me in welcoming to the stage the founder of Every, Dan [music] Shipper. >> Hello. >> [applause] >> How's it going everybody? I'm the last speaker of the day, so I'm just between you and dinner or drinks. So, I'm going to try to make this fun and hopefully a little bit short. So, first of all, I just want to say I'm very glad to see everybody and I'm actually kind of surprised to see so many people here um because I've been I live here, but I've been traveling. I was in Portugal uh last week and I was on Twitter and someone said that everyone was moving to San Francisco. Uh but it's great to have everybody here instead cuz I love New York. [laughter] Come on. Come on. >> [applause] >> Um, so I'm supposed to talk uh today about uh how to build an a playbook for how to build an AI native company. And um I actually don't have one unfortunately. Um and that's because I think the playbook is actually being invented right now. So we're doing it at the company that I run every but all of you are doing it here today as well. and and and so I don't want to do this talk from the perspective of I have all the answers and I'm going to tell you the framework and the playbook and all that kind of stuff. Um but um I do think it is helpful when we're in this beginning stage of uh learning how to use AI to do engineering to build companies uh to share like the the personal experiences that we're having inside of our companies um and uh and sort of collaboratively figure out the playbook together. So I think the best that I can offer is really just sort of dispatches from the future. Uh notes on what I've figured out um and the work that we've done inside of every um and I think the the first big thing the first the first big thing I really noticed is that there is definitely a huge there's a 10x difference between an org where 90% of the engineers are using AI versus an org where 100% of the engineers are using AI. It's it's it's totally different. Um, I think the I think the big thing is if even 10% of your company is uh is using a more traditional engineering method, you you sort of have to lean all the way back over into that world. Um, and so it it prevents you from doing some of the things that you might do if everyone was uh not typing into a code editor all the time. Um, and I know this because this is what we do at every um, which is the company that I run. And it has totally transformed what we are able to do as a small company. Um, and so I think of us as like a little bit of a lab for what's possible that I I'm excited to share with you. So for people who don't know, I run every um, inside of every we have six business units. We have four software products. We run four software products with just 15 people, which is kind of crazy. Um, and these software products are not toys. We've grown at every we've grown MR by double digits every month for the last 6 months. We have over 7,000 paying subscribers and over 100,000 free subscribers. Um, and we've done this in a very capital-like way. We've only raised about a million dollars in total. Um and very importantly for for this audience and for this discussion um 99% of our code is written by AI agents. Uh no one is handwriting code. No one is writing code at all. Um it's all done with cloud code, codeex, Droid, what have you. Um uh coding agent of your of your choice. Um, and also really importantly for the size of team we are, each one of our apps is built by a single developer, which is crazy. And these are not like uh little apps. Uh, here here's an example. This is Kora, which is a um AI email management app. Um, it's sort of an it's an it it's an assistant for your email. It on on the left over here, it summarizes all of your all of your emails that come in. So, you can kind of read your email that way. This is what my inbox looks like. on the right is a um email assistant that you can ask questions like I asked where's when's my AI engineer talk um today and it gave me just gave me the answer um and this is built primarily by one engineer um that he's got one or two contractors that have helped in in certain ways but like almost all of this is built by one guy same thing for um uh this app which is another one that we we make called monologue which is a speechto text app It's sort of like Super Whisper or Whisper Flow if you know of those. Um, again, one guy, thousands of users. Um, I I love it. It's a it's a it's just a beautifully done app and it's not it's not simple. It's complicated. There's a lot of stuff to it. Same thing for this app called Spiral. You can see there's it's it's big. Um, and again, one engineer. So, obviously, this would not have been possible um a few years ago. it would not have been possible even a year ago. And I think the big change that happened that we're all starting to catch up to is um it started with cloud codes. This sort of like terminal UI that gets rid of the code editor really push pushed us into a place where um we are delegating tasks to these agents. We are and and that allows us to uh work in parallel and do much more than we would have ordinarily. Um, so some of the things that some of the things that I've noticed that we can do that I I assume people in this room are starting to see but um [snorts] I think is is sort of important to put put our finger on is uh the reason we can go much faster is we can work on multiple multiple features and bugs in parallel. And I think that there's a um there's like a little bit of a meme of the vibe coder on Twitter that is oh [snorts] like they they have um they have four panes open but they're not actually doing any work. And I actually you can do it that way and I think there are also definitely engineers and I know that they are because they work at every that are productively using four panes of agents at the same time. Um, and that's that's crazy and that that contributes a lot to the um ability for a single developer to build and run a production application. Um, another like really important thing about this, a really big um unlock is because code is cheap, you can prototype risky ideas and that allows you to do more experiments than you would ordinarily. And that lets you make way more progress because the starting energy to try something is so much lower because you just like say, "Oh, go do this. go do some research on this like big refactor I might want to do and then you go off and do something else. And that's a really big deal. Um, and another really interesting thing that I love about this stuff that I' I've noticed in inside of inside of our organization is we move we're moving a bit more toward a demo culture where um instead of you know previously if you wanted to make something you'd have to be like maybe write a memo or do a do a deck or um or you know convince a bunch of people that it was a good idea to spend time on because you can vibe code something uh in a couple hours that sort of shows the thing that uh that you want to make it. It allows you to show everybody and uh I think that being a being a sort of de democultural allows you to do weirder things that you only get if you can feel it. Um which is I think really amazing and beyond just like sort of the basic productivity unlocks. um AI has and the way that we use it has caused us to sort of invent an entirely new set of engineering primitives and processes which I'm sure that everybody in this room is starting to do already. I think everyone is sort of approaching the same things from different angles and a lot of them definitely do echo engineering processes from the past but I think it's really helpful to try to put our finger on okay what is the new way of programming if we're moving up a level of the stack and and we're moving from you know Python and JavaScript and scripting languages up into um up into English and the uh the the name that we've given to this process is compounding engineering Um, and the way that I talk about compounding engineering is in traditional engineering, each feature makes the next feature harder to build. In compounding engineering, your goal is to make sure that each feature makes the next feature easier to build. Um, and we do that in this loop. Um, the loop has four steps. The first one is plan. And if you're you've been here today, you've been paying attention, you know how important it is when you're working with agents to make a really really detailed plan. So I think everyone is doing that. Second step is delegate. Just like go tell the agent to do it. Everyone's doing that too. Third step is assess. And we have tons and tons of ways to um assess whether the work that the agent did is any good. There's tests, there's trying it, there's having the agent uh figure it out. There's there's code review, there's agent code review, there's all this types of stuff. And then the last step, which is I think the most interesting one, is codify. And this is kind of like the the money step, which is where you compound everything that you've learned from the planning stage, the delegation stage, the assessment stage back into prompts that go into your, you know, your cloud MD file or your um your sub agents or your slash commands. And you start to um basically create this library. You take all the tacet knowledge that you pick up um that all your engineers are picking up um as they find bugs, fix plans, um delegate work, and you um you make it into an explicit collection of prompts that you can spread for your entire organization. And um when you do that really well, there's a lot of like really interesting um second order effects that are are not I think that well understood or or that commonly talked about that I think would be interesting to to bring here because my guess is that um some people are already seeing this, but like maybe it needs to be pushed on a little bit more to like really be brought out and some people uh it might be an interesting way to get more of your organization to buy into using these tools. tools 100% of the time. Um so the first thing that you notice if you sort of if you set up this process and you and you're like 100% bought in on something like compounding engineering um is that tacet code sharing becomes much easier. So uh we have we have multiple products at every a lot of a lot of products a lot of times need to implement similar things even if they use different technologies or imple implementing similar things like a team's feature or a certain type of ooth or whatever. Um previously in order to share code you'd have to like abstract out whatever you did into a library and then like allow someone else to download it and it it'd be hard to do or you'd have to talk about it. With agents, um you can just point your Cloud Code instance at um the repo from the developer sitting next to you and learn the process that they went through to build the feature that they that you need to reimplement and re-implement it yourself in your own tech stack, in your own framework and in your own way. Um, and that's really really cool to kind of have this the more developers you have working on different things inside of the org, the more you can um share without any extra cost because AI can just go read all the code and and um and use it. Um, another really cool thing that I've noticed is that new hires are productive on their first day because you've taken all of the things that you've learned about like, okay, how do I set up an environment and what does a good commit look like and all this kind of stuff and on the first day they have all that set up in their in in their, you know, cloud MD files or their cursor files or uh codeex files or whatever and um the agent just sets up their local environment and knows how write a good PR. That's really cool. It also helps if you um want to hire like expert freelancers. Like there's some there's one guy there's one person who just is really good at this one specific thing. You can have them come in for a day and like do that thing. It's I think of it a little bit like um like a DJ or whatever can like go in on like a couple bars of a song. Like you can just sort of drop in and that's really helpful. it's it would ordinarily be like too hard to collaborate because the the startup cost is too high, but you can do that a lot better now. Um, another thing that I've noticed which is really cool too is um developers inside of every commit to um other products. So, uh you know, we have four products that run internally. Everybody uses all the products. If someone uh runs into a bug or a paper cutter, like a little minor quality of life thing that they want, they will um often just um they will often just uh just submit a pull request for it to the other GM of the app um because it's very easy for them to go download the repo and figure out uh or have really have Claude or Codex figure out, okay, this is how we fix the bug or this is how we fix the paper cut. Um and that's really really cool because you have this much um much easier way of collaborating across apps that I I think over the next couple years. I imagine that you will also be able to let customers do this to some extent. Like if you run into a bug um this is, you know, speculative, but if you run into a bug, you can have your little agent fix it um and submit it as pull request. It's a weird open source thing, but um yeah, this is really really cool and and definitely is happening a lot inside of our company. Um, another really cool thing is um we we have not this may get different as we as we scale but um we have not yet had to standardize onto a particular stack or language. We instead let everyone who's building different products like pick the thing that they like best and the reason is because it makes it AI makes it much easier to translate between them. Um and it makes it much easier to to jump into any language and framework and environment and be productive. And so it we don't uh it's easier for us to let people just do the thing that that they like and let AI kind of like handle the translation in between. Um and the last thing which is my favorite but like is also the horror I think of of some developers and to some degree maybe the horror of my team um is that managers can commit code. um if you're technical uh even the CEO and um for for me like I have no business committing code because we've got four products we've got 15 people we're growing really fast um I'm doing tons and tons of other things but I can and I I have like committed production code over the last couple months and the reason for that is AI allows um engineers to work with fractured attention so previously you might have needed like a 3 or 4 hour block of focus time in order to like get anything done. Um, but with cloud code, you can kind of like get out of meeting and say, "Hey, like I want you to investigate this bug and then go do something else and then come back and you have like a a plan or like a um root cause fix and then you can submit a PR." And it's not easy, it's not magic, but it is actually possible. And I think that's a that's just a totally new way of thinking how thinking of thinking about how managers interact with the products that they make. So, um, just to just to summarize, um, there's a I really think there's a 10x difference in how things work when you hit 100% AI adoption. I think, um, from what we've seen, a single engineer should be able to build and maintain a complex production product. what we call compounding engineering, but I think what all of us are are sort of pointing to um is I I think really works to make each feature easier to build and then creates all of these sort of nonobvious second order effects that makes it easier for the entire organization to collaborate together. And very importantly, many people in San Francisco don't know this yet. Um so you're you're the first to hear it. Um so that is my talk. So, if you're interested in um in what we do, uh I run EveryY. Uh Every is the only subscription you need to stay at the edge of AI. You can find us at every.TO. Um we uh we have a daily newsletter about AI. So, we do ideas, apps, and training. We have a on the ideas side, we have a daily newsletter. We review all the new models when they come out and all the new products when they come out. The apps, you already saw, we've a bundle of all these apps and then we do training and consulting with big companies to help them use AI and it's all bundled into one subscription. So you get everything for one price and that's it. Thank you very much. [applause] [music] [music] >> Ladies and gentlemen, please welcome back to the stage Alex Lieberman. Okay, 8 hours in. We did it. Um I have some housekeeping. We have to finish the day with housekeeping. First of all, I want to thank you all. It has been phenomenal to be on this journey with you all. But uh let's give a a shout out just to you all for being here, going through a full day listening to the programming. So, round of applause for everyone in the crowd, everyone [applause] online who's been watching. Let's also keep it going for all the team in production behind the scenes making this possible. I watched them work tirelessly throughout the day to make this happen. And then finally, let's give a huge shout out to Swix and Ben who made this whole thing happen. >> [applause] >> So get comfortable for a second. I have some housekeeping. Make sure everyone knows where to go. And then we have one final speaker who's going to chat uh right after I hop off stage. So let's just dive in for a sec. Uh tomorrow is the engineering session day. I will not be your MC. So you will be taken care of by Jed who works at Google. I spent the day with Jed. He is incredible. He's just like a taller, better looking version of me. and he's actually an engineer. So, you get a true engineer tomorrow. Uh, if you have a bundle pass, your ticket includes tomorrow's track. So, we'll see you tomorrow at 8:00 a.m. here. If you have the leadership pass only, your ticket does not include access to the sessions or the venue tomorrow. However, we have organized an off-site brunch for you on us at a restaurant not far from here. So, check your calendar for the invite and the location. But right now we are headed into the afterparty. And not only is there an afterparty, but there are after afterparties. There's a lot of side events. So your entire night is planned for you. And we have Graphite to thank for sponsoring the afterparty. So here to give us the last word for a brief message is the co-founder and CEO of Graphite, Mel Lutzky. [applause] >> [music] >> Good evening everyone. My name is Mel Lutzky and I'm the co-founder and CEO of Graphite. Uh we're the AI powered code review platform for this new age of agentic software development. Now I know you guys heard a lot today about agents and how to make them as effective as possible in generating code and building features faster than ever. And they're incredible at this. But I think everybody who's who's built software in a professional environment knows that writing the code is only the first part of the story. Every code change then needs to be tested. It needs to be reviewed, merged, deployed. And oftentimes that second half of the process takes just as long if not longer than actually generating the code. And that's what we do with Graphite. We're applying AI to the entire development process and making code review as quickly as as quick as possible. uh we have an agent that's integrated fully into our pull request page. Um it's like reviewing code in 2025 and not you it doesn't feel like 2015 anymore. U that's what we build. Um we're super excited about it. Uh if you want to come check it out, we have our booth uh in the expo hall and also we're going to be around all day tomorrow. We're the official sponsors of tonight's afterparty and also tomorrow's event at public records. So we wanted you know all you guys who came from out of time out of town, we wanted to show you good time in New York. Uh so we have two events for you uh to make sure that you have you have a good time and uh see what New York is all about. Uh want to give a big shout out to Swix and Ben and the whole AIG team for organizing and we're excited to see you guys all at the party tonight. Thank you very much. [applause] taking place in the halls on both doors. Heat. Heat. Heat. Heat. Heat. Heat. [music] Heat. Heat.