AIE Miami Day 2 ft. Cerebras, OpenCode, Cursor, Arize AI, and more!
Channel: aiDotEngineer
Published at: 2026-04-21
YouTube video id: DeM_u2Ik0sk
Source: https://www.youtube.com/watch?v=DeM_u2Ik0sk
Hello. Hello. Good morning. How's everybody doing? >> Whoa. Okay, that's the energy I'm asking for. >> No way. You came back. Well, I also know some people uh just came here as their first day. So, uh welcome back everybody. And for people who are new, welcome to AI Engineering Miami day two. >> Okay, some quick questions. Who learned something awesome yesterday that they can't wait to try very soon? >> Oo, I see hands. >> Okay, who made some new LinkedIn connections? >> Oh, I see hands. >> Who made more than five LinkedIn connections? Um, >> okay. I'll take note of that. >> I hope you get replaced by AI. >> Who's that person? >> All right. And um, who got drunk last night? Don't be shy. Show me your hands. Who got wasted? Never mind. I'm surprised you made it back. Anyways, >> wow. Well, because I'm so excited about the agenda for today. So, uh, no matter how much you drank last night, uh, we have a great lineup of talks today. So, ranging from G2I, our organizer, all the way to cursor. So, a lot of great talks. Actually, I'm super excited about one talk in particular. It's superficuristic, but I'm not going to spoil it for you. So, uh, stay tight. And at the end of the day, I'm going to ask you to guess which one was the one that I'm most excited about. So just keep tuning in and find out all the talks that we have organized for you today. Okay. So I have a question for you all. Who is a worldclass engineer? Yes. Yes. That's the confidence that I'm looking for. Every single one of you actually sitting in this room is a world-class engineer. And also the person sitting next to you, even the people who don't like other people are also world-class engineers. So I want you to really take this moment to learn more from all the talks today. Really try to be that world-class engineer that you truly are and also talk to other people and make connections so that together we can all build a better future with AI. So that is the goal for the conference for today and I'm just so grateful that every single one of you is here and you are making connections and making that possible for us. >> I forgot to ask one question. Who's loving the MC's this year? >> Good, good, good. >> Who's loving the MC's? >> Thank you all. >> Yeah. Well, we're going to keep the energy for our first speaker of today, David House. Iman, what do you think about David House? >> I'm I'm really jealous of his jacket. >> You'll see. You'll see. >> Well, so David will be talking to uh us representing G2I, which is the organizer for the conference. He's going to talk about how we're transforming programming mindsets. So he's going to give us some case studies in agentic coding adoption. So relevant to every single one of us. And I can't wait to introduce David and let him show off his jacket. So give it up for David. >> Good morning everyone. Hi. Uh so I am a software engineering manager but my background is actually in mental health counseling. Um I've was trying to figure out how to marry these skills together and recently I saw a unique opportunity. I've been hearing about all of these people adopting coding agents but there's also a lot of anxiety, a lot of fear, a lot of existential dread. So I wanted to learn more about that. Also this is my very first conference talk. So, please clap. So, there's a lot to learn. Um, how are people adopting these tools? And I I got into the weeds with different interviews, people on the team. I asked them questions about how they adopted the tools, what their experiences were, what their feelings were. And what I ended up learning about was a model of successful adoption. Our first case is Ava. Uh Ava's first experience with AI was with chat GPT. She used chatbt for uh coming up with ideas for family events, kids activities, all kinds of things that she could review herself and trust the output. She was the only one who needed to sign off on it. But when it came to work, she held a very different standard. She didn't copy paste code. She didn't trust the output. And uh contrary to her work with AI with her family, there were other reviewers that she was very mindful of. She was very mindful of her reputation and she didn't want to ship slot. However, when she joined G2I, the project was very different. uh I'll tell you more about how the project was structured later but she joined a project where agent coding was the expectation from the beginning and the whole project was built around agent use. She learned that different techniques for working with agents could produce better output output that she could actually trust and she realized that a lot of things had to change about her development workflow. After months of using the prescribed workflow, she internalized those lessons enough to be able to create a sub agent for herself for a more specialized testing pattern. So when I watch YouTube videos about AI, I hear slop cannon and I hear skill issue. It's really easy to polarize between the two, right? But like if you're someone who's saying skill issue, maybe you were someone who said slop cannon. And may if you are someone who's saying slop cannon, maybe there's something for you to learn to get to the point where you can say skill issue. The point I'm trying to make is there's something in between those two poles. And that's what I'm really interested in. I think both are true, right? And when you learn how to work with AI, then you can start to differentiate how to avoid slop cannon and hopefully how not to be uh arrogant about it. Okay, I agonized over this slide for a long time, but now it's my favorite. Here's my claim for all of you is that for a beginner, an agentic framework should constrain their input. A beginner doesn't know when they're going to make mistakes. But for an expert, someone who has spent a lot of time with agents, someone who's internalized the development practices of how to work with them, an agent framework should actually amplify their input. So an agentic framework should shape the input of the human in addition to improving the agent's output. What I suspect is that these skills are not intuitive for any engineer. Agentic use was not part of any of our education. So we can't be expected to know it. Let's talk about what happened at G2I. So the case studies here are all from a very a specific project at G2I. Everyone was onboarded to the same agentic framework which relies very heavily on documentation and staged handoffs. It starts with a SL brief skill where the agent assumes a product manager persona interviews the user and the output of that conversation is a product brief. That product brief becomes input to a SAS spec bill where the agent assumes the role of a technical architect again interviews the user and the result of that is a technical specification that's provided as input to the slashcode or slashtestdriven development go where both documents ments act as very specific, very well-reasoned prompts with embedded judgment from the engineer. And then we close with a slash review skill to make sure that the agent covered all of the implementation as defined in the doc and then a draft PR skill to save you a little bit of time and spend a few extra tokens. But don't get distracted by the framework. I'm not here to try to tell you that this framework is better than any other. And I'd probably go as far as to say that most frameworks are pretty much doing the same job. The I break that job down into three things. Number one, revealing the hidden practice of agent coding. Number two, making the work of the agent reviewable, both its reasoning and its output. And number three, training the engineer in effective delegation with the agent. A beginner might throw one prompt to rewrite the whole codebase in Rust. But a more experienced engineer will know to take that step by step. Our second case is Lucy who has four years of engineer an engineering experience. Uh Lucy got started with coding agents in a prior job to G2I. She had success writing a migration script from uh if any of you know about Vit and know about barrel files, you know exactly what how hard this is. The agent could automate the boring stuff and the agent could uh build a script that encoded a lot of judgment. Then Lucy's success on that project motivated her to give agents a try from the very beginning with a very complex personal project. But she ended up with a lot of duplicated code. Code that she didn't understand code that was too much to review and this was also a using technologies that she wasn't unfamiliar with. It was very challenging to work with and very demoralizing. When she joined G2I, the framework that we used told her something that felt obvious in hindsight, which is that you have to tell the agent to do the right thing. You have to tell the agent to write tests. You have to tell the agent to run the tests and make sure that they're passing. So she learned that the the practice of agentic programming is make is building the judgment you have about how to be an engineer into prompts and into the review mechanisms. She said that she does not take for granted that the agent will do what you want. So Lucy used the G2 framework for months, but she doesn't use the framework anymore. Now Lucy's method is much more interview focused. She'll spend a lot of time going back and forth with the agent, not necessarily using a strict prompt flow, not necessarily expecting a specific documentation result, but she trusts that during this interview, she's able to build in as much judgment as she possibly can. She's able to steer the agent during the interview, and she also doesn't necessarily trust herself to be a perfect provider of information. So the interview loop helps her remember the things that she had forgotten to say. Xcase Antoine. Antoine's been an engineer for 15 years. He's a startup founder and he is meticulous. Uh Antoine is someone who has a reputation at G2I for being able to get anything done. No matter what it is, no matter how hard it is, it's something we have to do. Antoine can do it. So, uh, we end up being very careful about where we direct his attention. Antoine's skill and attention to detail meant that when he was working with agents, he was very focused on the results. His early experiences with agents were very disappointing. He realized he needed to be vigilant and reviewing the code as output taught him that he needed to be very cautious. So he swung from early exposure with agents to hyper vigilance that diminished the value that he could get from coding agents. But when he joined G2I, the TDD skill significantly improved the output, which improved his trust, which allowed him to trust the agent to do more, which gave him better results, which improved his trust. And he's still skeptical, but he's tuned his skepticism to be more context specific. He now continues to use the agentic flow, but to hear him talk about it, he he can't necessarily describe how the way he does it now is different than the way that he did it prior. He talks about the the hundred nuances in ways that his judgment built over time working with agents, reviewing their output has built skills that are essential to how he writes code today. Our third case here is Dale. Dale's been an engineer for about four years. His first experience with AI was using chat GPT for coding. Um I was actually working with Dale at the time. We were both both working in a PHP codebase. Neither of us were very good at PHP. And I was like, "Hey Dale, you can use chat GPT to explain collections. You can use chatgpt to explain Laravel. Uh it will it will have infinite patience for you and infinite time. I had lots of patience, but I didn't have a lot of time. So he asked her for help with code uh understanding it but not actually writing it. Similar to Ava, Dale was very concerned about reputational risk. He didn't trust he also didn't trust that the agent had enough information. So with chat GPT he knew that the agent and intuitively he knew that the agent couldn't read the whole codebase. So he was unsure always every single time whether or not the output was misaligned based on the available context. He was also more junior at the time. This is a few years ago. So he didn't trust that he could actually validate the correctness of what was being output by Apache PT. When Dale joined G2I, he was uh a little hesitant. Um but it was clear that uh it was clear to him that he needed to figure this out in order to be successful. He had some early successes uh with agents, but went a little off the rails and ended up with a 10,000line PR. He figured, okay, well, the agent can do the whole epic, right? So, let's just have it do the whole epic and then learned very quickly that that's not necessarily a very good idea. He learned to narrow the scope of his delegation and make sure that he could uh that the agent wouldn't get away from him. Now, uh differently than I think Antoine, differently than Lucy, Dale has moved away from the framework, but his way of prompting today is very very similar to it. Instead of using the skills to generate a document that is then used as input, Dale writes the entire document himself. And he encodes the same guard rails and judgments that the framework made sure to include by default. He knows now intuitively to include them himself. And he can then use his judgment to be even more precise about the guidance. So then he spends 20 minutes in his own words writing this prompt, but he trusts the agent well enough and he trusts his own delegation skills well enough that when he hits enter, he's very happy to walk away and come back to a PR that he'll review as though it was a PR from his teammate. So I think the point I'm trying to make is that successful adoption is about internalization of these practices. The engineers when they started working with agents described feelings of disempowerment. They described feelings of sort of reacting to situations, not necessarily steering the output. But as the engineers learn more about agents, they learned how to be the ones in the driver's seat. You could say they learned agency. But what about junior engineers? When I hear talk people talk about junior engineers, I tend to hear one of two things. I hear if you find a uh a junior engineer who has already learned AI that you're set. It's fine. Thanks. But if you don't, then juniors are screwed. So again, it's another one of these polarities. Either you find someone who's already learned the skills or you don't know what to do. So I think we can do better than this. Andy is a member of our team, fresh out of college. His whole experience with AI was being told not to use it by his faculty, but his tutor tutors encouraged him to be a little more nuanced. His tutors had to coers him into using AI like a tutor because he was spending a lot of time at the at the uh tutoring center. He was open to having AI augment his understanding, but all throughout school, he wrote every single line of code that he submitted for his assignments. When he joined G2I, it was a total culture shock. Andy started to reexamine his hesitancy when he saw senior engineers with much more experience than he had finding success using agents. but he didn't necessarily trust himself to be a guide. And for good reason. He hadn't spent a lot of time in the field. He hadn't built a lot of software. He hadn't made the mistakes. But what he found made him much more successful was when he brought a senior engineer into the loop of his work. So if you remember we have this agentic framework at G2I where you use skills to do document handoffs. He would take the documents before he gets into any code, put just the documents up for PR and get reviews from senior members of the team. that allowed the senior engineers to encode their judgment into the documents which made the docs more successful, made his implementations more successful, but also served as a teaching opportunity and a mentorship opportunity. Also, Andy is looking for the next his next gig. So, let me know if you want to meet him, but uh don't wait because he's already interviewing. So, don't give up on juniors. Um, Andy became a steady contributor in three months on the team despite being the only junior engineer on the team. Junior engineers still need mentorship. But agents expedite that because they can serve as tutors for the engineers when all the senior engineers are busy with their own implementation tasks and agents expedite that mentorship as well. Because the documents that the junior engineers create can serve as highlevel high leverage uh review artifacts where then the implementation can be much more successful building the confidence of the junior engineer. To quote one of his colleagues, quote, I was shocked to hear this was your first role out of school. You fit right in alongside folks with 10 years of experience. That's incredible. So many people are convinced that of software engineering doom. Maybe we're on a timeline where the profession of software engineering becomes irrelevant. Maybe that's true. But what if everyone trying to predict the future got it wrong? what we do today matters. I want us to look back on how we're approaching this work and how we're approaching our field from the standpoint of opening doors for everyone. I don't want to give up on the hesitant. I don't want to get give up on the conflicted. I don't want to give up on the people who haven't had that experience of being shown how to be successful with these tools. I'm not interested in leaving anyone behind. So you can try the framework if you want, but again don't be distracted by the framework. Use Beam Aad. Use RPI or the acronym that add more letters to RPI that I can't remember from yesterday. Krispy. Yes. Try a framework that slows you down so you can review the output. Thank you, David. So everybody wants to run the latest model, right? The fancier one with more parameters and uh that ends up being very costly and uh we have tremendous debt and that's latency debt. The next speaker is here to save us from that debt. Please welcome to the stage Sarah Chiang, head of DevX at Cerebras. Heat. Heat. Sweet. Well, good morning everyone. I um me and my team, we actually just came from California, so it's it's a nice 6:30 a.m. talk. Um, but it's great to be here and uh I'm excited. All right, so recently I've been learning a wide variety of new vocabulary words. Coitating words like reticulating, schleing. Some of you know where this is going. And the reason I've been learning all of these words is not by choice, but because every time that I try to use claude code, I get these words for like 35 minutes. And I also get, you know, writing and waiting and thinking, but we already know those words. And if any of you guys have been playing around with LLMs or using them recently, you've probably had a pretty similar experience. Slow code generation. We write massive prompts. We submit massive diffs or honestly we hit enter and we go on our lunch break and we come back and pray it's done. It's frustrating. It kills the sparks and joys of AI coding. And today we're going to talk about why this is all happening. So my name is Sarah Chang. I am the head of developer experience at Cerebrus and we are building the world's largest and fastest AI processor. And a big part of my job is introducing developers to fast coding for the very first time. And it's a very exciting job because oftentimes we get a lot of oo and a's and oh my god this exists. And outside of cerebrus I also create content on YouTube and Tik Tok and Twitter. So before we dive into everything that is frustrating and going wrong, let's talk about all the amazing things that are going right. So in the last few years we have seen models get so much bigger. We have gone from models that are.3 billion parameters to over 1 trillion parameters. And a lot of these are open source models too. And as you can imagine this also means we're getting much smarter models. Context windows are also getting much bigger. If you so desire, you can shove millions of tokens worth of context into your model. And what's exciting is that we're actually seeing developers use this context. So this is a study by Open Router and A16Z. Open Router is a unified API layer for LLMs. And they did a study where they sampled 100 trillion tokens worth of real world tokens across different use cases and capabilities. And what they saw is that input token links have gotten three times bigger, sorry, four times bigger in the last year alone. And on the flip side of that, output tokens have also gotten so much bigger. So the number of output tokens per prompt has also increased by 3x in the last year. And we can see this is because not only are our models outputting the final answer, but we're also outputting reasoning tokens. And so reasoning tokens are the intermediate tokens that a model will output when it's thinking step by step before producing the final answer. And we can see this coupled nicely with the rise of reasoning models as well. And so we've had a lot of incredible reasoning models over the last year. And so our models are getting smarter. Our models are getting bigger. All of this is wonderful news. And this, my friends, is quite literally the best technology that humans have ever built. And so now let's look at this best technology in action. This is Opus 4.5 doing a very simple HTML snake game. Sorry, poor game. Awkward. You all wish you could go leave for a snack, but you can't. And this is if this was in Claude Code, this is where we'd be learning new vocabulary words. This goes on for minutes and I won't bore you, but I'm bored. You're probably bored. And as it turns out, if you've been playing with LLMs, you've probably had a very similar experience. But unfortunately with all of this exciting innovation, there's a very unexciting side effect. And I like to call that side effect latency debt. So while we've been innovating on all these models, they've gotten bigger, they've gotten more intelligent, we put more tokens in, there's more tokens out, we have accumulated a hidden cost while optimizing our models faster than our infrastructure. And what I mean by that is that now if we look at all we've improved on so many fronts but now let's look at speed over the past two years we look at the Gemini families we look at the Claude families the Sonic models our speed has stayed the same we have always stayed between 50 to 150 tokens per second and now remember that even if your speed stays the same if the number of input tokens increases and the number of output tokens increases then the total total time that it's going to take for your prompt to complete is also going to increase. And when that increases, the developer experience decreases. And so now, this is the exciting part. So now is when I get to tell you that all of this is about to change. In fact, it is changing. We've accumulated so much latency over the years. And now we have a solution. So about a month ago, we at Cerebras and OpenAI partnered up to release Codex Spark. It's the first state-of-the-art coding model that can generate code at,200 tokens per second. And to put that into perspective, that is 20 times faster than anything you're seeing here. And not only that, but Codeex Spark is just the first of many future models that we can expect to be a step function faster than anything that we've been frustrated with before. And so not only are we unlocking new capabilities and workflows and use cases, but this also fundamentally changes how we as developers can interact with the coding model. No longer do you hit enter and you leave, but now you can actually sit down with the code coding model. Treat it like a pair programmer. Verify the outputs, steer it. No longer do you have to generate machine levels of code. Develop technical debt. Now you can actually verify your code. You don't have to create bad code. And so now let's look at that same graph again. And let's look at codec spark. It's so much faster that we had to adjust the y-axis. And remember codec spark is just the first of many future models that we can expect to be a step function faster. And at this point it's a complete regime change. We are entering a regime of fast inference. And now we are entering a regime where the coding model can code faster than we humans can keep up with. Now we are the bottleneck. And so why is this happening? We've been accumulating this latency debt for so many years. But what is suddenly happening right now that is making all this possible? So the exciting part is that it's because the entire inference stack is getting optimized all at once. And many of you in the audience are probably contributing to this in your day-to-day work. And so let's break it down. Let's take some moments. Let's start with the very bottom. The hardware. Hardware is the physical reason that speed is possible. It is the inference and training everything. The physical device that all of the workloads are on. With hardware, one of the biggest considerations is what I like to call the memory wall. So during inference, we have memory movement, but we have the movement of weights, KV cache values, etc. between the memory and the actual chip. On traditional hardware like the GPU, all of this memory is stored off chip. And so we're constantly moving all of these values between on and off the chip. And that memory movement is why the hardware is accounting for 50 to 80% of the total inference latency. What we're now seeing newer approaches to is we're thinking about how can we reduce this memory movement as much as possible. This is an example by Cerebrris building our own version of the AI processor where instead of storing our memory off chip all of our weights and activations KV cache values we're storing them on chip in a distributed SRAMM so that every single core that's actually doing the computation has direct access to all of the values that it needs. Even more exciting the commercialization of disagregated inference. This is why Nvidia bought Grock for $20 billion a few months ago. And this is also why Cerebrus and AWS just released a partnership to serve the Cerebrus wafers scale engine and the train AWS tranium together. What disagregated inference is so in traditional inference we have two steps. We have prefill and we have decode. In prefill is when we're actually processing all of those input tokens and we can do this in parallel. We're embedding them and adding them to our KV cache. So this step is computebound. Our second step, decode, is when we're actually generating every output token token by token. This is sequential and this is memory bound like what we were talking about before. And so we have two different steps with very fundamentally different requirements. And we've always been serving them on the same piece of hardware. Well, no more. And now we're splitting it up so that decode is happening on memory optimized hardware and prefill is happening on commute optimized hardware. And now let's move up the stack. the model architecture. There's so many ways that we have adapted our models for the hardware that we have to run especially well on the hardware that we have. We think about model size, shape, the layers, the dimensions. And so isn't particularly new, but this is a great example of one way that we're doing that. So are mixture of experts is where instead of activating the entire model for every single token, we only activate a small subset of experts. And what this gives us is the intelligence of a much larger model for the compute cost of a much smaller model. Again, we're always thinking about memory, just like the hardware thing we're talking about before. But more recently, what we're seeing is that models researchers are building on top of so we're seeing a lot of exciting new advancements for how we can make our models fit even better on our hardware, especially for speed. And an example here is reap router weighted expert activation pruning. mouthful where instead of activating again the entire model, we're looking at a specific use case. We're seeing what experts are activated and we're just asking and removing and pruning the ones that we don't need. And now the final layer of the step is inference optimization. Again, this is where a lot of you are probably working on. Philip gave a great talk from base 10 yesterday. This is where companies like B 10, modal, together, fireworks are all working. And one of the biggest considerations at this level is the KV cache reuse. So by storing and reusing previously computed token representations, we don't have to recomputee attention over the entire sequence every single time. And so all of this are investments that we've been making and all of this are investments that we're slowly starting to see come to fruition. And it's incredible. We've had this latency debt that we've accumulated for years and now look at where we're at now. So now I want to show you what a state-of-the-art model looks like on a state-of-the-art AI optimized hardware. Again, it's the same two-player pool game, but this time it's running on this. It's running, this is the wafer scale engine 3, Cubris's own version of our hardware, and so it's not running on GPUs. And this is Codex Spark inside of Codex. And it's done. There it is. And if you still don't believe me about light and sees real, you probably do, but just to be, you know, sure, let's look at the money. Let's look at where the biggest companies in the industry are putting their money over the last six months. Number one, Google versus Nvidia. Google releases the TPU four times faster than the Nvidia GPU. This is Nvidia's biggest customer investing in the infrastructure level. We have Google and Anthropic. Then what happens? Anthropic comes in and buys those TPUs, tens of billions of dollars worth of it. As I previously mentioned, Grock had been uh Nvidia bought Grock for $20 billion. This was Nvidia's biggest purchase ever. Grock had been building the LPU, their own version of a faster AI processor. And most recently, and most excitingly, not biased at all, OpenAI buying Cerebra's hardware. And this happened a few months ago, Codex Spark was the first model that they released in this partnership. And again, we can expect many future models to be the step function faster latency. And so we can see all this memory movement because what's exciting is that latency debt has been slowly accumulating. We saw the biggest players not only recognize latency debt but invest in latency debt. And now we're seeing that investment pay off. And so now as we think about these new eras that we're entering and we're thinking about entering this new regime where the model codes faster than we can keep up with, I want us to take a little trip down memory lane at previous eras that we have. So in 1985, it's a history lesson now. We switched from custom hardware to CPUs. Why? Because our workload changed. We went from scientific simulations and mathematics to needing highly programmable creating highly programmable software apps. In 2003, our workload changed again. Now we have machine learning and graphics processing. So once again, our infrastructure changed. We went from CPUs to GPUs because now we cared about high levels of parallelization. Where are we now? We're in 2026 and once again our workload is changing. We are now we have multi- aent systems. We're building 500 plus agent swarms, human in the loop, agentic workflows and once again our requirements change. And what's exciting is that we've already seen people invest on this level. We've seen people invest in the infrastructure. And now right now is when all of that is paying off. I cannot communicate enough how exciting it is to be a developer right now. not this year, this month, this day because we are right in the middle of a step change as we enter a new paradigm where the coding and developer experience is going to get so much better. And the thing that I want to emphasize is that it's not just about faster models, but it's about the fact that the developer experience is going to change. The way that we as developers interact with the models are going to change, and our ability to not no longer produce bad code is going to change. So, thank you guys so much. My name is Sarah Chang. Um, if you have any credits, if you need any credits or have any questions, reach out to me. Um, my handle is Milks and Matcha across the board. Thank you. All right, that was a great talk from Sarah. Okay, so our next speaker is Le. And then fun fact, I think Le might be the most likely to survive if there were a power outage here because he has all sorts of gadgets with him. He even has a flashlight. Uh he has a pointer, a laser pointer, and Keith's LLM models are deployed locally on his mobile. So even if there's no Wi-Fi out here, he will be able to make it out alive. So I'm so excited to hear how he's able to do that. So Le is going to tell us about ambient generative AI and how he deployed latent diffusion models on his mobile phone MPUs. So welcome on stage, Le Heat. Hey, heat. Hey, heat. Hello. Okay. Um I'm um I'm so surprised that people are still so much inspired by nature. Um actually uh we are just trying to mimic u leaves mimic birds uh uh mimic how they they just fly. So and uh yet we also try to mimic one of the most fundamental physical law uh which is randomness. My name is Le Kalinowski and I am a physicist. I work in Kstak. It's a Polish uh Polish company and today I want to I want to present present to you my uh research original research about the deployment of the the diffusion models on mobile. So um this is a part of the big bigger project uh which is related to hyper user experiences. um and I'm work in the research and development center that that means just try to innovate the the part of the user interfaces and try to uh push the boundaries um push the boundaries of the implementation of AI in that area. So randomness it's uh really well known and used across many of years in algorithmical computer science. So um one of the most um most famous um algorithm is the Monte Carlo method when you can just for example calculate the area of a of a circle or you can just calculate or simulate the chain reaction um in um in nuclear physics. Um and today um I just want to uh tell you uh why I just simply do this and um where I just get my inspiration to uh build that type of of project. So currently the biggest world labs in AI works with randomness. Uh what it means just try to optimize and build LLMs which are fast and cheap. So for example uh Google just introduced Turboquant and at the core of those algorithm there there is randomness. Um some kind of other examples uh methos cost like $120 spend per 1 million tokens and that's is really expensive model. This is why we are just looking for the fundamental laws and try to in be inspired by them to build um to build better, faster and more optimal models. Um but um and this is the reason why I just wanted to try to deploy the diffusion models on a mobile application in the really best possible optimal way. So I started to um I I started to think how I can do it because as you probably know from yesterday and Google just show it the diffusion models and the picture generation is really really difficult and computational heavy that means you need to have the huge GPU to generate takes a lot of time. So um uh this is couple of constraints that means you have have two on your mind the the termal uh profile of of the of the program and actually I just get an idea maybe I can use the um NPU neural processing unit on the smartphone to deploy there directly in a diffusion model and then try to uh run out um diffusion um diffusion inference uh there but it's not like enough. Uh the main idea was also to get rid of the full um full uh pipeline of the change text through the embeddings and uh building the uh you know full prompts through it. So I just you know throw it out and get an idea to use direct numerical value uh from an ambient sensor. Of course um here is a here's an example that means it can be like different sensor it can be accelerometer it can be something different but um but actually it works. So what I did, I just design um designed a full pipeline with the zero cloud um API calls to the you know outside um outside um not local LLMs um and I um build a full measurable system to to check it out and prove that the fusion models are possible on the local mobile devices. Um here's a here's a pipeline that means the most important thing is the latent update and as you probably can see here uh there's a stoastic pert perturbation scale and the randomness here is implemented uh one more thing I don't know if you just all know what the diffusion really is because uh the diffusion is made by uh adding noise to our data sets for example we have uh you know data set uh with a lot of the you know I don't know maybe dog pictures and we are just adding layers of noise there and then coding like in transformers um the model that means we are just minimizing a function and at then we have a huge noise of the latent uh space representation and to get an inference from other side we are just decoding that that noise that means decoding that randomness and for that we also use like in regular LLMs we also use the transformations that means we are decoding the the decoding the noising um uh the information there and we are doing it iteratively that means all of the informations requires really lot a lot of requires a lot of computer power um so um the the the logic is is quite simple That means the first ambient sensor then mapping after it conditioning that means the full pipeline to um to get those data u mapped to the latent space and the latent update function runtime session and and the full telemetry uh um to to measure all of the let's say results. Uh here's a a bit detailed view of it. Actually the full project was um full full uh application is designed in the um um in in the blocks that first block is an input and and the coding layer that means this is not that simple to um to get the raw data from from the sensor and map it to the latent space. So the the full first um full first mapping is the that we are getting the emmed light sensor then building hises on the top of that u build profit and the full normalization and then building the latent vector. So um I also build a full fallback that means to be sure that you know I'm just running out of memory there if some kind of thermal uh you know incidents rise and I just cut off the whole application because I don't know don't didn't want to lose my phone um and um the here is the API and the native um Android uh bridge which gives me up opportunity to deploy the model directly on the NPU And I also used an onnx runtime to do it. The uh the next um next part is the full diffusion line where Android native session manager and here is the model and this is the unit unit various auto um encoder that means the part where we are just encoding the let's say prompt then we have the unit d noiseis loop that means the core of the diffusion process with the 8, 16 and 24 steps and then we are just decoding it to the tensor and RGB um picture. Um at the end I also collect you know the the pictures itself the all of the artifacts and the meta data. Um and uh at the end we have the gallery and full telemetry. Uh okay. Um I think this is the uh um the funniest part uh of my today's presentation. Uh let me show you first the simulation on on on a on a computer just before I show you working application on my phone. Okay, this is how it looks like. It's not maybe um impressive at at that stage but uh give me a second. Okay. So I can change the the simulation of light and get an instant inference on the device here. So let me also show you the whole measurement whole measurement part that means we have full full um metadata captured ambient light smooth light stabilization temperature and everything which is um which is necessary. That means one of the most important parameter for me was um uh uh was u uh was the stabilization and the and the latency. Okay, let me uh let me show you right now how it works on the on the smartphone. Oh, here here's my private private phone. It's not not not sponsored by Samsung. It's my private one. Um actually let let me probably you can see my screen. I will switch uh switch off all of the internet connections to the flight mode. We'll be sure then um it's not the fake. Uh okay. And here is it. Maybe I would just write it up. So um I'm really glad that there's a lot of lights there because it's working. Um here we have a change in in the in in lux that means we are just capturing all of the data from from sensor and and this what you are just seeing right now is the full inference of the diffusion model deployed on the NPUs. It works really really fast. I know this is not a like a pixel wise representation because to do it really optimal um I need to cut off whole of the you know whole of the uh whole of the part uh of building real text to real text to to embeddings prompts and um what is really important as you probably can see it's it is extremely fast and it is really stable that means with this you can build some kind of other different applications and use it to I don't know try to build some kind of um different pictures or you can just even try to build an animation with this and this is not something which which will kill your phone at the at the end. Um okay let let me show you um briefly results of it. Uh so here here we have parameters I captured artifacts how many artifacts true this is the single experiment um u robust latency it's around 600 milliseconds that means really really fast the full inference it's it's cost not not that much uh it's really stable that means you can use that type of architecture for your um future projects but um I didn't want to finish on the let's say um weird RGB colors on the phone uh and I just forced that applications to build something um something uh nicer um and I uh decided to build just before conference I decided to build something more and give me a second I will only uh only just uh will find an application for So I decided to generate a heart um and try to force uh with some geometrical tricks and and constraints and not try to force the model to generate a real heart with the with the model. And actually that works. So um I'm I'm really happy that uh I was able to do it. uh um and I'm really happy that um that I can show you um the recent progress on the on the hyper user experiences uh today to you. Uh to wrap it up, uh I think you probably should think about the NDPUs and try to use it whatever possible to optimize and speed up your applications because all of the modern mobile phones already have it. But I think not everyone just use it in the in the modern development. We can for example can build an animation with this. Um and it's a it's a really good option. So I think um I just point you out that you can use it as an unconventional um application and it just uh it just waits to be discovered in your projects. And if you just want to learn more about Kalstack and our work because we just um in our incubator and R&D center we already just open source quite a lot of let's say artificial intelligence products project projects like for example evaluation benchmark for re react native applications um then grab a grab a code and take a look there. Thank you very much. Great. Thank you, Le. Uh, quick note, uh, for our coming up speakers, please show up at the GTI booth at the expo hall so they prepare you to bring you to backstage. And, uh, I promised Le and David that I would, uh, compare their jackets. Well, he has a suit. Um, those of you who liked David's jacket, please make some noise. Okay, those who like his, >> I think we have a winner. Uh, all right. So, we have a break right now. Um, please, um, get your coffee, but be back here around 10:55 as we have the next talk starting then. Thank you. test. Test. See? One, two. Testing. Testing. One, two. One, two. C. Hey. Hey. Testing. Mike. One, two. Check. Check. Check. Check. See. Testing left two. Check. Check. C. Hey. Test. T test. C. Testing one, two. Testing one, two. Test test test. One, two. Testing. One, two, three. Ladies and gentlemen, please take your seats. Our event will start in 5 minutes. Good luck. Bye. Hey everyone, welcome back. I would like to take a moment to thank our sponsors uh without whose support this would not have been possible. Code Rabbit, Mintify, Cerebras, Sentry, Cloudflare, Tailscale, Modem, Ampify, Ozero, Google DeepMind, City Furniture, and Encrypt AI. Let's hear it for our sponsors, shall we? Our next speaker, I asked him in the back, uh, what's the craziest thing you do? He said, I code all day, but I have a foot pedal. It's like a productivity hack apparently. A foot pedal when he presses it, it's it it starts the voice commands and he can he can talk. Turns on the speaker I suppose. >> Oh, no. Whisper flow if you know what that is. >> There you go. I don't know what that is. >> It's like a voice detect. >> You have to explain that uh during your talk then. Um so he does that. Also, he tries to take, did I say it? 210 grams of protein a day. >> You're giving giving a home out. >> All right. Uh it's my pleasure to welcome Tis on the stage and uh he he's going to talk about a lot of inference optimization. He's worked in Tesla autopilot before and now he has his own startup and uh his focus is inference optimization and how to make uh code gen more optimized. There you go. Thank you. >> All right. One sec. Hey, Dad. Dude, I swear I I practiced this before. Sorry. One sec. >> Yeah, I'm not seeing anything from HDMI. Hey, hey, hey. What are we doing? I think so. >> Okay, as Tis is setting up, uh here's a reminder also with the background music that there is a afterparty today. So, uh, if you didn't get a chance or you're held back from getting wasted last night, tonight is your chance. It's at 7:00 p.m. at Thrilled Show. And with that, I think Tis is ready. >> Cool. >> Take it away. >> Yeah. Hey everyone. Uh, I'm Tis. I'm the founder of Morph. Uh, and today I'm going to talk to you about sub agents and specialized models. So, you might remember this. Uh, this happened like last year where Trump went to the White House. Uh he was demoing a Model S from Elon and he said, "Wow, that's beautiful. Everything is computer." Uh I thought that was great because it's actually accurate. Like everything is actually computer, right? Like even the Windows inside a Model S is like a mini computer. Um and so what I'm getting into now is um this is uh Andre Karpathy's sort of software 1.0, 2.0, 3.0 paradigm. Uh so if you remember software 1.0, this was humans writing code that programs computers. Software 2.0 0 was um weights programming neural nets which is sort of like a a flexible fixed function computer that you can do things like detecting stop lights. Uh software 3.0 was humans prompting program uh using prompts to program LLMs. Uh but a new primitive has appeared. Uh so now we have sub task and composer as a as a background um search tool and then cognition with Swed. Uh so what I think is the next sort of step past software 3.0 is what I call software 3.5 which is sort of agents prompting other agents. Um you still have the human at the top prompting that main agent but you sort of have some people call this agent swarms. I think that's a two vibe cody of a term. Um so yeah and so essentially what a sub agent is is super simple. The simplest definition is just using a separate context window for a task. Uh this has been proven to be useful coding agent performance. Um so yeah so it's sort of this third brand new primitive. I think the first two were system prompts and tools. Uh the question becomes should sub agents be specialized models or should they be general models? And I think the right answer is that it depends. Uh but the interesting part of this talk is when it's actually specialized models. So first the the cost of what of using a frontier model. So frontier models are really good uh especially at coding and reasoning but they take a lot of compute and they're super super expensive. So they're not really great for every task. Uh so why why should you even bother considering a specialized model? So from a macro perspective, we have this huge compute scarcity crisis right now where um like there's no H200 or B200 capacity across the board if you look at GCP or AWS. In fact, right now I even have agents in the background trying to find some capacity. Um but basically this is a macro pressure to like pressure people. This sort of macro pressure pushes us to specialized models because then you can use less compute to do a task to like the same accuracy. Um this is just a graph saying the same thing. Basically what you want to do is the minimum compute to do a task well and this sort of just arises from this this compute constraint. And so our answer at Morph is uh we basically do specialized model and specialized inference. And so we think stuff like code search and context compression should move to specialized models. Uh whereas stuff that still needs frontier compute like coding, planning and reasoning should stay frontier. Uh so one of the things we can do when we specialize a model is that we can do things that you wouldn't want you wouldn't do to a general model. For example, a frontier model lab right now they typically won't train on a parallel tool calling harness because you get this chronic forgetting problem where when once you do RL on this parallel tool calling harness, you start seeing regressions on uh sequential tool call performance. And so that's an example of a trade-off we could make that uh a lab might not want to make. And so that's what we do for our code search model. Our code model can do up to 12 parallel GPS, reads or list directories uh up to six turns. And it can do this very fast on a inference engine that we make that's designed for this workload which is super pre-fill heavy like up to 80k of token input and like around 200 tokens output. Uh so another model we have is our our compactions model. It's basically a model that's trained to compact context at 33,000 tokens per second. Fast apply is our diff apply model. And yeah, so we build these models because I think coding agents spend most of their time um doing things that don't actually need frontier compute. And this is like code search, compaction, and applying code edits. And these happen thousands of times, but just aren't actually stuff that needs frontier compute. Yeah. So how do we make them good? So good for us doesn't just mean like building something that gets like a good like F1 score for example. uh you have to train these to be a good sub agent, not just a good model. uh so what that means is uh for us is that when we do RL we train to sort of minimize this uh this randomness that you get when you introduce a specialized model. So like let's say you have opus at the top calling one of our like our code search model. One of the problems when you when you have this sort of setup with a separate model is that opus might have an assumption that your model is more capable than it might be or uh that it works in a different way than it actually does. So during RL we have to train to like minimize this randomness. Um so that that's like one dimension in which which you have to cover when you like make a specialized sub agent. And so how do we make these fast? Um so one of the the most common ways people make models fast is called specive decoding. Speculative decoding is sort of it's sort of a thing first of all because you have this like massively parallel machine that's a GPU. Uh and you have this like super sequential architecture that is the LLM. Uh so at the top there you see traditional LM inference where you just have to generate one token at a time regenerate KV and all your all your stuff and then to generate the next token. Uh with spective decoding you sort of switch this uh so you have something some heristic to like make like a guess of like let's say six tokens and then you have your massively parallel machine the GPU just verify that those six tokens are correct and that time is linear to generate one token or verify those six tokens. those both take the same time. And so traditionally with spective decoding, you can have a you can have a smaller model like a very very tiny model like a 0.7B do these these like these predictions and then just have the big model that's like three trillion verify them. Uh but you don't actually have to have a small model. You can do it heristic base. You can do it whichever way you want. You just need some sort of way to get those guesses. The another common method is disagregated prefill. So that's basically prefill is everything that happens in the time to first token. decode is everything that happens after that. Uh so a common thing we're starting to see happen now is u disagregating that those workloads into separate chips. Uh this is sort of a thing because Nvidia's NVLink is so good. Uh Nvidia is starting to do this now with Grock where Grock is going to be doing pre-fill uh and traditional Nvidia GPUs will be doing decode. It also works if you just use Nvidia for both. Um but yeah uh another way is if you uh write kernels. uh kernels are notoriously hard. It's a very fast changing field. One day it's QDSL, the next day it's emojo, the next day it's uh Qile and next day it's back to QDSL. And so uh this one is the hardest one out of all the three. But these are the the three main dimensions in which we we do our inference optimization. Uh for the really technical people out here, I think where the future is going is actually like disagregating further where you take attention operations out into machines. So you're going to might even have three or four separate chips each specialized for like prefill, decode, attention. Um, yeah, a lot of first principal thinkers would think this is really stupid though, like Jim Keller and these people probably hate this because intuitively you don't you don't need you shouldn't need like four separate chips to do this. You should be doing this all on one board, but reality plays out different in theory. So yeah, getting back to the topic, uh, specialized models are a great way to actually have a product mode because everyone is sort of building the same thing, right? You have cursor with u multi-agent with background search agents and composer. You have devon parallel sessions uh and their sweet model. Uh and across the board you have cloud codeex winds surf and I guess grock. But uh a way to differentiate is being faster and cheaper. Uh and being faster and cheaper can let you do lots of things like you can go through more tokens. you can uh like like if you can run 10x more tokens, there's a lot of features that arise from that that I think a lot of people don't get. Uh and so we work with around like 40 plus uh companies and they're they're production agents. So I'm going to just go through some of the lessons I've had from working with these companies. So the main one uh this used to be hotter of a take uh like a month ago and today it's like a super lukewarm take, but uh everything becomes a coding agent. And so that's like marketing agents, uh, like a customer support agent, the agent writes SQL. Marketing agents, they do things like call Apollo APIs and, uh, send emails. And basically across the board, we're seeing like like our initial customers were just coding agents like the Versel Vzeros of the world, the lovables. Um, and then you start seeing like this shift into different companies like Zo Computer, sort of like a personal assistant uh, that writes code and does stuff for you. And um yeah, basically everything is becoming coding agent. It's pretty obvious now. Uh and from everything becoming coding agent, you start to see a lot of tools you no longer need. You can start deleting stuff uh because you can just with code. Uh code is essentially the feature that builds all features. Uh and so you're only bottlenecked by how how good the your your base model is at writing code. Uh, another interesting fact is that this is sort of consistent across almost all of our customers is that when you double speed, you roughly double conversion rates. Uh, given that you don't actually hurt accuracy. So, if you take the cohort of people that didn't hit like a a tra like a trace back or get a new error, you will double conversion rates within that cohort. So, if you can double speed without hurting accuracy, it's very, very valuable. And again, context link matters. That's why you should use sub agents. Uh, yeah. Yeah. So I'll go into some good practices about building good sub agents. Now uh so if you're building a search sub agent, you should really just it's very simple. You just put natural language in like if you're doing code search, it would just be like where is authentication logic sub aent at the end would output file paths and line ranges. That way you keep the output tokens that that sub aent is generating very small. So this can be very fast. Uh and this leads to improvements across the board no matter which sub aent you use. For our search sub agent, we sort of see like a roughly 3% sweet bench pro improvement across the board, but it's it's a similar improvement if you use cloud code search sub agent or um any of them. Uh so for task agents, which I think you should still be using a frontier model, not a specialized model, uh you should be sharing the prefix cache. So, uh, basically when your main your main model like Opus for example wants to, um, like spit out like three three three task agents, uh, those should all share the same prefix so you get the caching, it'll still be fast. Um, if anyone's looked into the cloud code leak, they they actually have this thing called team create where there's birectional communication. I think that's actually really cool as well. Um, so basically what happens is your main agent like Opus can do a team create. It'll make a folder with a slug. Uh it'll have a JSON for each agent and every time it writes to those it can send messages by writing to a JSON file for each agent. Um and each one of those task agents will be continuously pulling that JSON for updates. And uh basically you have this sort of birectional communication and there's no real protocol. This is sort of like a hacky way to do it. There's some other I forget who's trying to make like some protocol called ACP which is like a to standardize some RPC format to do this. Uh I I I don't really care how it's implemented. I think the the the thing that's more interesting is this actual birectional communication between main agent and sub agent. Cool. Yeah. And so hopefully you got out of this talk that everything is models. Uh if you like working on inference optimization NRL, please email us. All right, that was a great talk by Tis. And next up, I'm going to introduce Fidian Aentuity. I'm sorry, that's not his name. Well, so the next speaker is Rick Bllelock and he is a Floridaian who is the founder of Agentuity and he's gonna tell us a little bit more about coding agents and how they're eating the world. So welcome onto the stage Fidian aka Rick. All right, let's do this here. Extended. I can't see the screen. Which one do you guys see? Cuz I don't know what I'm seeing. >> You don't want to see my nose. >> How about now? >> Perfect. All right. That's what I want to hear. Blow that up. All right. Welcome to Miami, everybody. And uh it's been a fun uh fun conference so far. our AI engineers, one of my favorite conferences of all time and uh super happy to be here. I'm glad you all are here, too. So, my name is Rick Blelock from Agent Tui and today we're going to walk through the last couple years in the evolution of AI agents. how right now in April 2026, it seems like coding agents are not just writing all the world's new software, but they're actually slowly starting to become that software as well. So, by the end of this, I hope that you have a clear understanding of where we're at and why a coding agent is a new fundamental to software. So m much like you can tell where a software engineer at is at in their journey, you can tell where someone is at on their agent journey or even their level of psychosis, their agent psychosis. Um I'm afflicted by that. I'm sure many of you are as well. But anyway, you can tell where they're at by how they describe an agent. You'll hear things like, "Well, it's just a workflow, right? It's just a fancy RPA. It's just a few LLM calls. Ah, agents just a software loop, isn't it? Isn't that what it is? Isn't it just a newer chatbot that we were building those back in 2015? Isn't it just like a better chatbot? Or coding agents, you know, they're not really ready yet for software dev because it makes mistakes. So, you hear things like that all the time, right? So, you kind of can tell where people are at in this journey and it things have happened rapidly in the last two years, right? And so we're going to talk through that. So let's jump back two, three years ago. We said we wanted agency out of agents. And then we built systems really obsessed with control. We were calling them agents, but most of them are really out they were just kind of prettier workflows or chatbot++ kind of thing. And you know, we tend to shoehorn our past experiences into everything. That's a human thing. We all do that. And so we're we're so used to solving problems a certain way with computers. Decades of solving problems with computers a certain way. By the way, that includes non-engineers, by the way. Um how non-engineers, just nontechnical people think about using computers. So it's really hard for us to start from first principles when it comes to software engineering and using computers to solve problems. But there, you know, a few years ago, there was a few people starting to poke around at that and try to figure out, you know, what what does agency mean and software and non-determinism. So, if you remember, how many you remember autog4 launched? So, I don't know, maybe 50 AI years ago, something like that. >> And so, um, >> it uh it became the number one trending GitHub repo. It did. It it be I don't know it had like 160 or 170,000 GitHub stars. It was the fastest growing repo GitHub repo of all time. And um I don't know if you remember some of them. There was spin-offs. There was Chaos GPT which was a spin-off. Everybody remember that? It was tasked with destroying humanity. I don't know if you remember that. But it was popular because it offered a practical glimpse into what like this kind of agentic future could be like. It was also popular because it was a spectacular failure from a user perspective. Um, it was stuck in loops. It burned tokens needlessly. Lots of bad hallucination as you can imagine with the models at the time. And uh, if if you really remember in hacker news, there was a dev that just replaced the loop, the feedback loop with static code and it got better results. I don't know if any of you remember that. So, um, Aram from Tom's Hardware, he said, uh, it was too autonomous to be useful. Um, so, you know, like there's there's these like stamps, these glimpses and uh, snapshots in time where we see that and we go, "Oh, some people's brains get hardwired to that moment in time." Go, see, agents aren't good. They're not ready to be. But there was there was people there was a glimpse of hope, right? Somebody was trying stuff. Didn't work out. It's fine. But we were getting somewhere with it. Then around that time, frameworks, you know, they started popping up that simulated agency. A lot of them were actually still brittle deterministic pipelines. >> The uh 2023 agent architecture were really just a lot of linear chains. Step one, step two, step three, a lot of determinism. And we were all trying to figure out what we wanted to do when we grew up with this, right? You know, there's Crew AI, Langchain, Autogen came out, NAN. I mean, there's a list of them. Some of some of them attempted some type of agency, but kind of like autogen because of the limits of the models, there were problems. So, we all fell into this trap of what customers wanted right now and not kind of like where the models were going to be in six months. And what they wanted, they just wanted a better chatbot. They wanted determinism. So we were using agents that were for non-determinism, but we were using them with all this determinism in there. Um, so that's what that's kind of when the frameworks tried to solve some of those problems. And you know, the curious thing is the enterprise at the time the expectation was these projects, these prototypes were around six months, which I mean, if you've been in this AI engineer stuff for any, but six months is like an eternity. I mean that is probably about four AI years. I don't know. Um, so six months to like implement this thing with this framework and then all of a sudden the models get better and the of course the frameworks get better but then all of us start asking well wait a minute what do we even need these frameworks? I mean they're all kind of complicated anyway. Do we even need them now? And then even anthropics own advice this is back in December of 2024 which again doesn't seem that long ago. I'm going to read a quote. We recommend finding the simplest solution possible. This means this might mean not building a gentic systems at all. Well, that's interesting, right? So, just a couple years ago, h it's not really it's complicated all this like orchestration theater and all this stuff going on. And um let me go back here real quick. So like you think about that orchestration theater. I don't know if if if if there's probably I'm sure several of you on here that have gone through that and the complexities of all of that that's happened and then all of that gets removed in the last six to nine months, right? Go from chatbot tool caller workflow orchestrator multi- aent theater. or you know remember cursor added sub aents and then Claude wasn't very good at using sub aents so they removed it and then all of a sudden Claude's good at using sub aents again and cursor adds it back um and the advice one day was don't use sub aents sub agents are bad and now the sub aents are good um and they are good that all that's changing all this turmoil and you know we're preaching agents need agency and then customers are like well this just is just RPA this This is just this isn't agentic. And then during all that time, Cognition Labs demoed Devon. How many how many of remember the demo of Devon when it was demoed? Marketed as the first AI software engineer remember that interesting. And of course, you know, early days there was the hype cycle that maybe outpaced the product a little bit at the time. Personally, in my opinion, I thought it was kind of the right balance of a little bit of controversy and vision casting coupled with a decent product launch. That's what I thought. But fast forward to today, and it's honestly, it's one of the most capable agentic uh engineering products out there. And it's free from the nitty-gritty like messaging of workflows and frameworks. I kind of thought it was a breath of fresh air. I I I met Walden, one of the founders co-founders in Salesforce Park last year and I was I was telling him like I really want to put Devon more in our in our flow and I I know I want to integrate it more and I was thinking traditionally like frameworks and stuff like that like how do how do I stitch this together? How do I put it in this? And after talking to Walden it was like well just give Devon some scripts and some access to your tools and it'll just it just do that. So it it was, you know, free from that MCP theater and workflow theater. And it was the first time that I started thinking, oh, I should treat this coding agent for more than just code, but managing our demos. We did Friday demos um every it's it's on our YouTube. It's public demos. And Devon helped us. Devon literally would tell us what we would demo. I can't remember what I worked on yesterday. Um so it was the first time I was like, "Oh, well, there's something here." interesting um and I appreciate that about them. So all this is going on and meanwhile we have customers and real businesses around the world saying what are the use cases for agents? What are they? Trying to understand how they actually fit inside their company. I mean I can't tell you how many times I've had conversations with business people like what are the agents? Is it this? Is it that? and kind of only devs kind of understood the complexity of the orchestration and the frameworks and it kind of muddied the waters and maybe made it look like a little bit of snake uh snake oil salesman kind of stuff to some of the business people um or getting accused like ah you guys are just crypto bros that are using AI now or whatever you know you kind of get a lot of that um but then use openclaw as an example open claw stitch together enough things that it helped normies understand what an agent specifically agency can be in software. What Devon helped um with developers understanding an agent use case for our world, OpenClaw helped non-devs understand agent use cases for theirs. You know, as engineers, we like to bash OpenClaw, right? It's all these these scripts and these complexities. It's insecure. But meanwhile, you have non-technical people buying Mac minis. You can't buy a Mac Mini right now. I mean, I'm sure this was not on uh Mr. Cook's bingo card that he would sell out of Mac Minis. Um, and they're copy and pasting instructions into an app they've never heard called Terminal. And they've never used it. They don't know what they're doing. Why are they doing that? Because they understand the use case. they they're willing to suffer this weird pain because they understand the value they can get from it. And the models are at the point where, you know, it doesn't feel super brittle. So all of a sudden, autonomous agent makes sense to a large group of people that didn't get it before. You know, before I was like, autonomous agent, they're like, "What the heck are you talking about? This sounds like crap, right?" Um, and sitting on top of all of that, what we're talking about is a coding agent. That's what we're talking about as a coding agent. So here we are 2026 and if I say a coding agent is a universal software primitive. A lot more of us understand that this is actually pretty true. You know code isn't just a developer arch um artifact. It's not just a developer artifact. It is a coding agent isn't just a harness for developers. It's not a coding agent can implement every other type of agent. And even a a normie who can't describe what's going on when he builds multiple agents that help him manage his business. A coding agent can build a chatbot. It can do rag. It can orchestrate other agents. It can automate its own workflow, not just yours. It can create a database, stuff in it, and use it to answer its own questions. Try that sometime. Take the canonical Titanic data set and um have it put in a database. It'll create a model, stuff things in there, know how to um update it. You know, we attribute intelligence of animals um as if they if they can build complex or compound tools, they have high intelligence. And although I know some of you will want to nitpick on what intelligence means when it comes to a coding agent, we are there with coding agents. Coding agents will build their own tools for themselves if you have the right instructions for them. So in 2026, convergence is I mean this is the inflection point for us. You know, sandboxes are giving agents a body. Model intelligence is now giving them the brain, right? we got these good protocols that we're starting to kind of figure out the nervous system. Whereas, you know, a year ago when I said agents need a different approach to infra, people were thinking, well, why does a framework need a different infra approach? I don't understand that. But now with coding agents now, they're like, oh yeah, it does. It does kind of make sense. We do kind of need something a little more purpose-built for a coding agent to run all the software to be all a lot of the software. Now it's a little more understandable. The coding agents still need a lot of this. You know, at a aentuity, we had to adjust our thinking. A year ago, started a developer horizontal developer cloud platform, and now it's kind of more like an agent platform that's for agents to run to for them to run. Um, and all these things here that they need. Um, you know, it's a place to act, a place to persist, to live, a place to observe, to integrate, to manage. And there's still like gaps in our in what we can offer, right? I mean, the the boomer clouds, if I can call them that, are are trying to sell the infra from a decade or two ago, right? They really are. And they have this veneer of cutting edge, but when you get in, it's all still kind of hacked together. Honestly, kind of like our personal coding agent setups. It's all kind of like just, you know, stitched together. And ser services honestly still built for stateless web tech. and a coding agent, you're like, I'm try to shoehorn this into this. I mean, the story is always the same when I talk to people. It's like, oh, we started at serverless, try to deploy it, but guess what? My agent runs for five minutes and serverless times out in 30 seconds and then and invariably after like a bunch of experience experiments, you end up on EC2 and you're running like we were software in 2008 over and over and over and over and over again. That's what we're trying to do at agentuity is like, you know, make it so agents don't feel like they're running in uh 2008. But with that in mind, put a pin in that. With that in mind, back to openclaw and use cases for normies. Normies don't understand or know all the things that I just mentioned about the infra about what's needed for this stuff to work. Yet still, Mac minis are being sold at Apple and they're copying pasting and they're getting outcome. And this is the mind shift that I think we have to make. The coding agents, they are a fundamental to software. Now, non-engineers already understand this. They they do. They they they maybe can't articulate it in all these crazy ways that we can, but I think they probably understand it faster than some engineers grasp it. You know, we're busy poking holes in the fact that coding agents make bad mistakes, crazy, stupid, idiotic mistakes. Um, and you know, we're quick to point out agent psychosis, which I have, a lot of us have. Um, but then the non-engineers, they're using this stuff and they think when we when we're like sitting there complaining about, they're like, "All right, are you using the same thing I'm using?" Because I'm running my business on this is what they're thinking. You know, if you're at the if you're at a party and you're wait and this guy's talking about how he's gotten done all these things and you're just waiting for the pause and you're like, "Yeah, but coding agents, they make mistakes. You you're here. I'm here. You see the difference?" You know, um if if you're that if you're there, they're they're going to look at you and go, "I think you have psychosis. You have software engineering psychosis." That that's what's going to happen. I mean, just as an example, so there's um there's a man I know, he's in his mid60s, very successful entrepreneur. He built a manufacturing company that did a bunch of stuff for Toyota. And now he started this construction company in Texas. And he was paying tens of thousands of dollars to HubSpot a month for this company. And he's like, I'm just going to build it on my own. I'm going to use coding agents and I'm going to use coding agents to run the little bit of HubSpot that I use. And he did. He did it in about three months. Um, another example on Saturday I had up in Jupiter, Florida, I had um I had breakfast with this guy. He's 24 years old. He runs a wind window cleaning company and he runs his business with a coding agent, his marketing, his sales, his sales estimates. So, you know, if I told him, ah, coding agents make mistakes, he's like, yeah, but like I'm like way up here in productivity. What are you talking about? There's an exe a former uh CIO of a large financial firm that's in here today and he uses it uh for his board meetings. So my point is the foundational agent primitive is already a coding agent for normies. They maybe they don't realize it, maybe they can't articulate it. And what kind of agents do you think are going to be tasked with working on the org chart, the focused work of of things in the future? And I think it's coding agents. If you peer just in the near future, six months, 10 AI years, coding agents will be used to create and manage other agents to manage certain parts of the org charts work. I mean, we're doing that with our competitive marketing. We have this this uh it's called GEM. It's this set of agents that manage we have about 50 competitors or frennemies we call them that we monitor. And um these agents are constantly watching. I won't tell you what they're watching, but they're watching all sorts of things, all sorts of signals. And um we we know like everything about where we are, where they are, the messaging. This company's probably about to pivot because they changed this messaging a lot in their site map, whatever. Um so it's managing a sizable part of our marketing stuff and it's and it will be used by non-engineers to manage their work and their organization. leaders will direct coding agents and it's not intuities by the way not in two. So next time you're debating what framework should we use for building an AI agent, how it should work, pick a coding agent. It's kind of like Ben Ben at the last talk yesterday. He he talked about those different versk and pi and open code SDK. Um try that out. Try using a coding agent instead. You know the good ones they have an SDK. They have server side handling of things. They have on demand sandboxes that you can pick up resume uh where where they left off and they have observability built in and that's also what we're trying to heavily do with agentity. Um you started as a developer platform kind of more like hey agent is a is actually a customer a persona of our platform. There's all this inbound interest from normies and people that want to run this stuff at scale. They don't know how to. Um, and it's all built on like coding agents, multiplayer coding agents that watch each other, agents that tell each other what to do. Um, so that's where we're going with this. So yes, Mark Andre and he said software ate the world and now I think coding agents are eating software. Thank you. Okay, how's it going everyone? Having fun? Great. So, let's say I want to build my app. I have a lot of data and uh about transactions, places, and there are relationships between them and uh I want my agents or or my app to query this information. I have to store them in a database. I can dump them in a blob storage or or maybe SQL database. But is there a better way to store all of this information especially if there are relationships between the entities? And the answer is of course yes. The best way to capture these relationships is through graphs. And maybe the better way to store them in a database is to store them in a knowledge graph. And the next presenter is going to talk to us about how we can optimize our um context engineering by leveraging these knowledge graphs. Please welcome Nia on the station. >> Thank you. Thank you everyone. Hey, hey, hey. All right. Yes. Good morning everyone. How are you doing? >> Okay. Okay. I see some energy. I love it. I love it. Well, I'm going to wake you up. Okay. So, good morning everyone. My name is N Mlin and today we're going to get into effective context engineering techniques for artificial intelligence. So, as I like to do, by the way, um I'm a teacher in a past life. So, I like to get um lots of different ways that people can remember the concept that we're going to be talking about today. Not just sit here listening to a lecture, but actually conceptually understanding what's going on. And I like to make sure that people from the highest chief security officers, highest chief information officers, CTO's as well understand the concepts as well as maybe a junior developer or an intern. All right. So, um, let's get right into it. And I'm gonna start off with a question. All right. I'm gonna get us all engaged. So, how many of you all, and this is going to be a yes to pretty much everyone, but how many of you have shipped an agent to production? Lots of people, right? Lots of people. Okay. So, all right. Of those of you who've shipped an agent to production, how many of you have had that agent then do something in production that made you question your career choices? Question your career choices. Right. I shouldn't be an engineer. I'm so sorry. I brought down prod. Customers are crying right now. They are so angry. It's happened. It has happened and that's okay. Right? So let's get into a scenario. Right? So this is actually real table stakes. This happens every day. Right? So a compliance team is going to run an agent pipeline, right? And that agent is going to retrieve separate chunks. So that chunk is going to be for example Jessica hypothetical person works at Apex Global, right? And then Apex is actually on the sanctions watch list. Okay, that's interesting. And then Jessica has actually requested a $25,000 credit increase. So all three of these chunks have been received, right? And all three are sitting there in our context window. So the agent YOLO is going to approve it. is going to approve this credit line increase. There seems to be nothing wrong with Jessica's request. She has excellent credit. She's a 850 credit score. There's no problems. We're just going to yolo and approve that. We're just going to vibe bank real quick. Y'all don't like vibe banking? Y'all don't want vibe banking? that agent is going to approve that request not because the agent itself or even the model is bad but because the connection between these these facts is actually never represented anywhere in vector space. Right? So it wasn't just the data, it wasn't the retrieval, but the model actually had to do a guess and actually it guessed wrong. Okay, it guessed wrong. So, nobody caught that until the lawsuit came. Until the lawsuit came and your scene, your your skip level is now hitting you up asking you who wrote the agent, why did the agent answer this way? Why did the agent actually approve this request? And what are you going to say, right? Everyone is listening. You are sweating bullets over here. You are stressed. Okay. But this actually this tutorial does keep landing on my desk, right? And in the next 25 minutes, I'm going to show you all exactly what happens and the techniques that we can use to fix this. Prove it live. So like I had mentioned this this scenario always happens. You say you don't know and you end up fired. Can't stand it. Can't stand it. Need to pay these bills. So what's happening here is that text similarity when we're when we're relying on vector search it finds documents that have a similar meaning right structural similarity on the other hand find entities that have similar connections so almost is actually building the second one and I want to give you this upfront so that you know exactly what I'm trying to argue here every retrieval pipeline that we're using right in the room we're using normal naive rag hybrid search ranking everything like that. It operates on one dimension. This text similarity piece. How close is this chunk in meaning to even the query itself and that work that can get you like 70% of the way, right? It could even get you 80% of the way. But there's a second dimension that almost no one is building upon. And that is that structural similarity piece, right? not do these two documents the same thing but actually are these entities connected in the same way. Right? So when we're thinking about a credit line increase for a corporate banking client then and then we're also thinking about this concept of everyone listed uh at the employer is listed on the federal sanctions watch list. That doesn't mean the same thing, right? So your vector search is actually never going to surface those things as being connected and and they're not going to look at each other. But the relationship, right, the relationships between Jessica, between her employer, and then those sanctions as well, that isn't a meaning problem. It's a connection problem. Right? So right now everyone um or right now nobody really is using an agent pipeline to see connections. And that's the vital dimension that we're missing here. And so even research backs this up, right? We've all seen this like this is fine. It's going to be fine. We're all going to be fine, right? And MIT's NANDA initiative, you can feel free to take a screenshot of the QR code to read this report yourself. I always um like to tell people to read the white papers. Don't just listen to people who are up here speaking. But in their 2025 Genai Divide report, it found that 95% of AI pilots are delivering no measurable P&L impact. Right? 95%. That's a huge number, right? and we're building agents, we're scaling them into multi- aent systems, we're investing months of engineering time, and then the project is killed because it actually doesn't clear the bar for quality. It doesn't clear the bar for privacy. It doesn't clear the bar for security. And it doesn't clear the bar for transparency either. So the central gap that MIT had identified is that systems do not just retain feedback but the systems that are that are making sure that they don't retain that feedback are adapting to the context or they're not even improving over time and Gartner themselves as well is predicting that 40% of agentic AIA projects will be cancelled by 2027 right and the models here are not the problem the context is the problem. So, better models don't fix fractured context. I want you to repeat that with me. Okay? Better models don't fix fractured context. Period. Right? And that's where all of this all has in common. Right? these these um these atentic systems are going to retrieve via your text similarity search and then the text similarity it actually has a blind spot right of the size of the building when it comes to actual relationships better models don't fix fractured context they just reason over these broken pieces and let me show you what I mean right so on this example we have a human view of an right there's three ways to represent this apple on the left. So on the right actually we have a vectorzed view of an apple. Right? This is just an array of floatingoint numbers. It's a way that most of our embedding models they uh understand the textual reference of an apple. Right? This is what a vector search is going to operate on. It's not human readable. It's very opaque. And then in the middle we have the human view of an apple. You see an apple. Your brain processes all of the relationships implicitly instantly, right? You don't even think about it. You just know an apple is round. It's fruit. We're not talking about the company. And then on the left, we have a knowledge graph view of an apple, right? The apple connects to a tree. The tree connects to a body. The body has a fruit. It has a stem. Every property is actually a node, right? And every relationship is explicit. So a human can read this actually and understand exactly what the model knows about an apple. This is both human readable and machine readable as well. And that's the thing, vectors capture meaning, right? They they know that things are close to fruit and red and food, but it's often a black box. You can't inspect it. You can't ask, okay, why does this model know that this apple is round? You can't audit it as well. A knowledge graph is going to capture that structure and capture that meaning and it's going to be much more transparent, more queryable, right? More auditable example that we needed before to then overcome AI's blackbox problem. And we need that knowledge to then be transparent. And before I go deeper within this technique, I want to show you some of the numbers that convinced me that that this wasn't just a theory, right? So Jang at all he published uh and their entire team published an eunations magazine this March of 2026 and built com GPT it's a domain specific foundation model for telecom and they ran a proper abilation study on three GP question answering and so the base model they had the base model accuracy here right the base model accuracy had an accuracy of 37%. Sure. Right. And AI like we'll take it. Um but then they fine-tuned that model on domain data and then they got up to about 54%. Right? And then that's a significant improvement, right? But then that's still not going to work for production. So they then added both knowledge graphs and rag together, retrieval augmented generation together and got 91% accuracy rate. Right? So no, it's it's it's not just the graph. It's not just rag by themselves, but here it's both. Right? This is the critical point. Rag alone does not close this accuracy gap, right? It's the knowledge graph that was used to unlock that extra percentage of accuracy. And the graph wasn't just a nice to have. It's actually the difference between having a demo versus having a production system. So 37 to to 54 was the finetuning. 54 to 91 was with the graph plus rag right and that middle drum that's the structural retrieval dimension that I'm talking about and this is all again peer-reviewed in so quickly I just want to jump into the past and I know this is a little controversial but I would like to argue that the past um is how we're normally doing context engineering so for those uh who no one no one's new to this term also uh shout Shout out to Dexy who coined this entire term um and and also spoke yesterday. But I specifically define context engineering as being able to systematically provide models with all the relevant information, the relevant tools and the relevant instructions that your agent needs in the correct format at the correct time. All right, to then accomplish a specific task. So, it's very different from uh from pumped engineering where you're just trying to say the right words to an LLM or an agent to get it to do like whatever you want, right? Context engineering centers on building dynamic systems that assemble complete and structured context for each LLM invocation. So, the shift in focus from those cleverly worded prompts to a comprehensive contextual design is why context engineering is now considered a critical skill for AI engineers. And most engineers are using some combination of these various techniques here. Right? So there are several different techniques that practitioners are often using in order to get that contextual relevance in an agentic development. And in the example that I had mentioned before so that's including rag and hybrid search the classic approach. I'm not really going to go into all of these um but for those who are newer to the concept um uh naive rag is going to use a vector database of embedded documents to then perform a similarity search. Practically, there's pros and cons to that, but I won't even get into it. Um, and then we also have memory management where an agent interacts in a long session and across lots of different sessions and it accumulates lots of information. And I have that represented by Dory because you're now managing what you're going to forget. Um, and you can use sliding context and recency based memory of course as well to uh in order to drop history. But let's keep moving on. You can also use structuring and ordering of context. Um, enthropic for example gives guidelines on how to deal with long prompts. You put the most relevant stuff and the most critical stuff at the top of your prompt since models pay more attention to that at the beginning. And then additionally, we can use tools and function calling instead of trying to stop your context with all your raw data like I see most of you trying to do. We're actually going to give it um the instructions on how to uh call a tool, right? And really that's just a function for those of us who are engineers or we can offload those tasks to tools. So for example, instead of giving an LLM your large um like a a large table and asking a question about that table which would consume all of your context tokens, it might be better to give your LLM a database query tool for example. Um and then lastly for uh so I like to represent that as MacGyver for those of us who are a little bit older. We understand MacGyver always had a tool. Um but the method that I want to get into today into today is using knowledge graph augmented context. I had mentioned this concept of knowledge graphs before and I'm not going to dive too much into it because of time but knowledge graphs and I'll do a representation so you can see what it looks like. Knowledge graphs are an extremely powerful way to introduce structured relational knowledge into an agent's context. And I like to represent knowledge graph augmented context as Sherlock Holmes right because Sherlock Holmes is a perfect example. He's able to synthesize important information and then track down the clues, this tracing piece in order to solve a particular problem. Right? This is exactly what context engineering allows you to do when you integrate knowledge graphs into your AI stack. So most people right now they have their data structured in uh in this example, right? Tables and rows it is the OG, right? You have to do a number of different complex joins if you are trying to understand how that data is related. But that then there's the there's the caveat. It's going to draw down your performance of your application. But if you structure all of that data, that same data as a knowledge graph, then the relationships and the connections and the groupings become much more clear. So now your agent can hop from one node of information to the next and be able to actually trace where it went and explain the answer. And if you want to learn more about how to like take your data and put it in knowledge graph, we have lots of courses for free and we're open source. Um, so from an architectural standpoint, I just wanted to touch this as well before I get into the new concept. Um, from an architectural standpoint, having a knowledge layer in the AI stack allows you to one hold structure and meaning, two, create a uniform memory layer, right, and retrieval layer for all your agents, and then three link and unify all of those data sources and objects together. And then four while bridging this human and um uh machine understanding and then at the bottom as well you've got your existing data platform. So whether that be snowflake data bricks relational systems right on top of that is going to sit this knowledge layer which connects all of your structured and unstructured data in a consistent and semantically rich way. So from there you can have your geni apps and orchestration layers like langchain or semantic kernel or your agentic frameworks. they can then plug into and access this data um contextually. But I want to remind people like this is not a concept that we have created like knowledge graphs have has existed for many many years. And this concept of knowledge graphs and adding this knowledge layer we think is just important for folk to understand that it can drastically increase the accuracy, the traceability and the auditability of your AI applications because many of us are just like settling with vibe uh vibe everythinging right vibe banking which doesn't make any sense right like why are we spending so much money to have inaccurate results like it's just not making any sense. So the future what I'm would be the future of what building out context and then having explainable andable AI agents looks like this concept. So has anyone read this paper which came out in December? Let me see some hands. Anyone read this paper? Um it's called uh the AI's trillion dollar opportunity context graphs. Okay. Like one or two one or two. All right. Well for those who are newer to the um to this concept, right? like in present conversations around building AI agents, it has shifted from using techniques from context engineering um like using naive rag to using graph rag to then using knowledge graph augmented context to now what I'm talking about which is using context graphs and it's not simply a change in terminology right the the idea of context graphs has actually existed way before this paper um um uh but um uh the the meaning behind it is actually really really helpful for when building out AI agents, right? So when building out agents, we need this idea of this decision trace, right? So that agents can be auditable and can be more trustworthy and can be more explainable. And um there have also been and I have the uh the papers right here a number of different white papers and research around context graphs even prior to that seminal article that came out in December. Um, but we are shifting our understanding why agents make decisions because agents are no longer simply tools that people are using to give you a particular answer. Now, now agents are being built to make decisions about our everyday lives, right? Like this this this isn't a joke anymore. This isn't the demo. This isn't the fun um like like uh PO that we're building. Now we're building out agents that are making decisions about whether someone gets a loan, right? And that has impacts on people's lives and their financial uh prosperity going forward. Like it's it's uh making decisions about whether they can they can own a home or or what have you. It's making decisions in medicine as well, right? We're giving agents so much power and that's why the ability and the ne the necessity to audit the decisions that every agent is making is paramount especially with context engineering. So the missing why. So when an agent recommends approving a even a $25,000 increase like we had mentioned for Jessica before, but even a $100,000 credit line increase, the question isn't just is this the right answer. We have to understand that if we're just asking that question, we are we have no assemblance of what memory looks like here, right? You're going to ask an agent about last week's conversation and that agent's going to be like, "What you talking about, homie?" Like, "I I don't even know who you are right now." They gonna say that, but you're also not going to have this audit trail, right? When something goes wrong, no one knows why your agent decided that. Y'all can try and see, okay, you know, I can I can try and reverse engineer to see what chatb or anthropic cla like said. However, that a trail is missing and that's the vital information, right? And then on top of that, there's no shared learning. You're going to deploy multiple agents and they can't share what they've learned across sessions. So, what is this concept of a context graph? A context graph is a knowledge graph that's specifically designed to capture your decision traces, right? The full context, reasoning, and causal relationships behind every significant decision. So this differs from simply opening your agent up to an audit log for example. You might say oh I can just do that with an audit log. No, no, actually right that would have yes that would have the line by line transaction history but your agent now has this breath of knowledge from context right when with context graph it has this breath of knowledge from the things that were not written down right that could either be like the the conversation that's happening in the senior architect's mind right where they're making these trade-offs and these decisions when building out a either aentic system or software in general right it could live in the Slack messages that you're building out right Now, it can live in your email threads or even your Zoom meetings, right? All this connected data can show up in a graph that can show you all of that full context and the full why any single decision was made. And this is this is now going to include the causal chains and the decision traces within context graphs that can actually be queries and that can be traversed to ensure that your agents are more reliable, auditable, and trustworthy. So I have the same financial uh services model here and we're going to go through a demo anyone work in I doubt people work in financial services but possibly nope exactly zero hands but that's okay we're still going to do it but in this data model I'll teach you all about financial services uh we have a graph um uh world called and we're going to call these entities right these circles will be entities um but it's really just people places and things right and we also have agents sorry events, um the things that happened like the what, right? And then lastly, we have the context and the why, the decisions, the policies, etc. That's the model. So, we can get into what this looks like in practice. Boom, boom, boom. If you want to um take a picture of this or actually look at the demo live, you can feel free to do that. Let me see where I put it. Nope. Where is? Ah, okay. Boom. All right. So, I have a live one and I I have a um my other one, but I was I was messing around with the code trying to add some new features. You know, never to add new features last minute before you're going to do a demo. It's going to break. It's going to break. So, um so I have a live demo here. Um, and many of you are like we're all talking about context graphs and we want to see like how does this actually work in practice. So I'm going to run this query here. So we have an regular AI assistant for this financial services use case. We're going to ask ourselves like should we actually approve the $25,000 credit increase that we had for Jessica earlier. And let's see this graph get built. I want to make sure that everyone can see it. Yeah, perfect. All right. Um, so we have a person who submitted a support ticket about a specific transaction. All right, we're we're seeing this graph get built. But that transaction is then going to trigger an alert, right? When flagging this entire account. That then alert is going to take take that all of that into cons into into context for this entire system which is going to cause a decision trace which you see on the right which is based on policy and then communicated back to the person who owns the account in the first place. So think of this like an a an analyst at a bank right who's going to um use this agent to respond to a particular customer request. The analyst is going to have this customer requests come in. Should we increase this credit lo increase for Jessica Norris? Should we do it? Um, and the bank analyst can can say yolo by bank. Let me just give that to you, Jessica, because you're my girl, right? Absolutely not. Thank you for laughing at my jokes. So, if y'all don't want vibe banking, why are y'all doing this without decision traces? Like, y'all, it seems messy to me. like y'all are whing to me. But so if we inspect this agent here, we can see that we've defined a number of different tools here. Let me come over. Um we can see that we've defined a number of tools. So we're doing this tool calling that we can see. I can't even see that, but maybe y'all can. Um but we have this um this agent which is interacting with our context graph. And then we have the system prompt, right? um which is going to have these specific tool calls which is going to fetch data or the context um that's necessary to make a specific recommendation and all of that is then rendered within this graph that you see in the center. So we'll see the causal chain of what led to a specific recommendation from the agent. And in this case uh the recommendation is to and we'll follow all the way down to see actually um instead of thinking that we can just yolo banking we can just let's see the let's see the response. All right. So I'll help you evaluate it. The recommendation is to reject this credit line increase. All right. So the recommendation is to reject the credit line increase based on my analysis of the context graph. I strongly recommend rejecting Jessica Norse's $25,000 credit line increase. And here is my detailed reasoning. Critical risk factors, there was a recent identified request already rejected um April 9th, couple days ago. And then there's actually significant fraud history that happened within um within Jessica Norris's account last year. So they had um a known fraud typology where the velocity check the number of transactions 14 transactions in 29 minutes happened. There was also a geographic anom um where that IP location was inconsistent with the accounts address. All these can be explained away by Jessica and if they are explained away then that context graph will then update based on that decision. Um but there are multiple compliance violations as well within this. So you can see the AI assistant and the agent specifically telling you the entire reasoning behind the decisions that it's making and you can see the exact decision trace on the right as well detailing what's happening with as the uh confidence score as well. So u feel free to try this on your own. Again it's context-graph demo.vercell.app if you want to try it on your own. And then let's go back to the slides. Okay, so I'm gonna wrap this up. So how this all works, right? So if we want to look at how this agent is um is tracing the decision history and how it's using the context graph, first of all, it's going to then search for the customer and then for the context around that customer. Like as we can see, we saw the um the the circles, the nodes and edges, right? that produced um that transaction in the past and then we can see the previous fraud flag that we had seen before. We're going to then use hybrid search which combines vector and graph to then actually do the heavy lifting here and I just wanted to touch upon this hybrid search piece because it's actually really really powerful right so combining vector plus hybrid search is really powerful and specifically at Neoforj we use this called graph data science which allows you to run graph algorithms like centrality or page ranks for those who are um uh have used different algorithms before or graph algorithms before Um but here we're talking about graph embedding specifically like fast RP if someone has heard about fast RP which is a generated embedding uh for context of text for sorry for content of text. Um so um and then how we use the vectors and doing vector similarity search apologies that allows us to find the different policies and the decisions that are semantically similar. But what about the structure right the structure piece is what I had introduced right there are things like that's where the things like graph embeddings come in right which allow us to then look at the structure of the accounting relationship and how it interacts with transactions with other accounts as well with fraud patterns and these generate embeddings that we can use vector search functionality which are then familiar which we are familiar with to find the most relevant data right for our context. So again, this is a completely open source demo. Um here's a little bit of information about what's happening in the demo. You can see the architecture. We can see the tools, the graph data science tools um and algorithm tools that we're using with MCP um and using for tool calling. And then that is what we have on the horizon as well. We have a little bit of um what the future of both this demo but also like what this can look like in production as well. So where can you learn more? Where can I even learn more? Right? So there's so much different research that's being developed in this field of how to make agents specifically more trustworthy. Yes, please feel free to uh to um take a picture. Um all of this research can be found. It's very very small, but it's seen it's uh graphagg.comappendices research. Um but this is just a small sample of some of the research um uh that you can find on um uh using both context engineering and using knowledge graphs and graph augmented context and context graphs all keywords in order to augment your AI agent development to make sure that they are more accurate explainable uh trustworthy etc. And so instead all in all I want you all to go from instead of being that one person who was saying oh you know yeah like you're in the your boss is at you you're like I don't know what happened right I don't know please don't yell at me um uh while while production is going down that's a real that's a real problem uh instead of being um in this place where you're saying I don't know I want you to be in a place where you can so that you can understand how to make your agents more audible and explainable. Awesome. Well, thank you so much. It was a pleasure talking and I hope you had a great time. Thank you. >> Thank you, Na. Exciting talk. Yeah. Good job. Good job. >> All right. I want to take a moment to thank the folks who are directing this show. It's uh all the music uh sound engineering all the seamless transitions. They're sitting in the back over there. Uh let's hear it for Trish and team. >> Thank you guys. You're awesome. Also, folks on live stream, please leave them some hearts. We read your comments and uh we're going to use them to, you know, make this experience progressively better year after year. We're here to stay. Okay. Well, it's uh it's lunchtime. Let's be back here at 1:35 sharp. Enjoy your lunch. Hey, hey, Hey. Hey, hey, hey. Testing one, two. Check. Check. See. Testing. Test test test. See? Testing. Testing. See. Hey. Hey. Here we are. Are you ready? Yeah. Yeah, it looks like it was A long line there. This this Yes. Perfect. All right. Ladies and gentlemen, please take your seats. Our event will start in 5 minutes. Ladies and gentlemen, please take your seats. Our event will start in 2 minutes. Okay, welcome back. Quick announcement. Um, the Riverside patio is uh reserved for a private event, so please use other areas in the venue. And uh another note, I saw somebody walking around with the synonymous of a Ferrari in an AI engineering conference, which is like a fancy GPU, right? And he said he was going to give it away. Anybody interested in a GPU? Okay, so you have to listen to what Gabe has to say right now. Okay, give it up for Gabe. Everyone, good to see you all. Good to see you all. Thank you for uh enjoying this amazing conference. So, I think this is the easiest announcement of the conference. I have one Nvidia DJX Spark to give away. >> Woo! Yes, I've given one away today already. So, this is my last one. Um, we at G2I have grown to we're on track for hund00 million in revenue uh with no VC funding and we've never done any marketing. Uh, we've just given away cool things and tried to do good work. So, uh, with that being said, uh, we do human data uh, specifically for uh, coding benchmarks. Uh, the problem that we've seen with competitors is they're spread very thin across 20 different domains. We've been in the industry for over a decade. So the benchmarks are obviously getting harder. If you need staff level engineers, if you need architects, we provide those. We're also doing building very high quality RL environments. We're exclusive for now, but that will change likely at the end of the summer. So, if you have an introduction into any Frontier Lab, I can't mention that the ones that uh we're currently working with, but if you have introductions, you can come talk to me or our team at the booth or you can email me at gabg gg2i.ai and this will go to the best intro into a Frontier Lab. And if there's multiple people that provide intros, we'll also provide uh free tickets to uh to next year's AIE Miami or React Miami or maybe even both, depending on how good the intro is. So, come talk to us. We'd love to chat with you and uh I'd love for you to be tuning models with this uh by the by tonight. So, thank you. All right. Wow. That was $5,000 in uh Gab's hand. So hopefully you're eager to uh introduce him to some Frontier laughs. Okay. And then I see people are seated already and we're waiting for some more people to trickle in, but I really think that this talk is something not to be missed. I'm personally really excited about this one. Okay, so who in here thinks that he or she is a 10? Okay, a couple of people. Okay, so for our next speaker, she's going to demo this little thing on stage. I don't know if you have noticed, this is something different. So this robot thinks that you all are a 10. So Lena Hall has been working for AWS Microsoft before joining Iikami as the senior director of developer relations. So today she's going to show us something super futuristic. So her talk is called my robot thinks you're a 10: Engineering zero shot compliments with Reachi Mini. So, welcome on stage, Lena. Hi, everyone. I hope you're enjoying the conference as much as I am. Excited. Um, well, quick show of hands. Who has built some classic agents with LLMs? A lot of people. Um who has uh gone deeper into real time maybe streaming audio, turn taking uh interruption handling? A fewer. Okay. Um who has done some multimodal images, video? Okay, great. Uh well, usually we get to solve these one at a time. Um but when we put an LLM and a robot together, we get to solving these uh at once as one system because it becomes the experience that people perceive as perceive as one interaction. Models are great. Uh we are so lucky to have models this powerful and models are commoditizing but the ability to engineer the right behavior is not. In the last few years, the industry has been obsessed with model quality, which is great. We are getting better benchmarks, bigger context windows, newer architectures. But as AI enters, you know, the real world and leaves the chat, agents, voice, physical interfaces, the product stops being mostly around the model and starts being around um how it interacts with the whole world, how this whole system behaves. When you give uh a language model a body, you watch every single assumption, hidden assumption in your architecture turn into physical side effects. I'm Lena, a senior director of developers and AI at Akami. Uh I also spend a lot of time working on and thinking about what uh it takes to make AI behavior match uh user expectations reliably in the real world. So like any responsible adult, I built a multimodel hype compliment robot which is a very normal and mature use of AI infrastructure. Um the real title of this talk is what happens when we give 40 a body. I wanted the smallest possible task that would force every single hard problem in modern AI to show up at once. So I built an app uh with a robot that looks somewhere at someone sees something real and says something grounded and of course uses some gen alpha slang in the meantime. Uh it's a non-trivial system. So you all know how this goes when you know this one product requirement forces you to solve a combination of things perception timing tool use grounding speech motion um interruption handling response coordination. So compliment has a surprisingly large stack and you know this is what happens when you scope creep into robotics and discover that risk can be a distributed systems problem. Um, so I built this on Reichi Mini from Pollen Robotics. Uh, you might also have heard of this as hugging face robot. It has cameras, microphone, six degrees of freedom in its head. And I chose this for one reason uh that I could inspect the whole stack, the whole system all the way down. And if one of those layers is closed, you can still build the project. it just becomes harder to build understanding and open systems of course let you inspect patch swap and see where the behavior is actually breaking down. Before we get into the architecture, let's see if Richi is actually in the mood to cooperate. Uh let's turn the lights on if we can. Thank you. I can see you all now. Is the mic on? The robot mic? Yeah. Great. I'm gonna start my app. Demo gods. Hi, Richi. >> Hey, the legend has arrived. We're lighting up the room already. Let's turn this into absolute cinema. >> Do you know where we are right now? Do you know where we are right now? >> Oh, I see a winner ask me about location. My sensors don't do GPS. You're my best main character wherever we are. >> Tell me what you see around you. I'm picking up some clues. I see a space that's looking pretty fresh. But the real shower, it's you. You're serving main character energy so strong it's bending the spotlight in the >> What do you see in the audience? Yo, that audience is packed. No cap. That hits all eyes on you. And that overhead lighting is making the whole thing look like an absolute cinema moment. Energy. Your vibe just boosted that whole room's frame rate. >> So, I have a question. Our friend Dex was saying yesterday that six months ago he recommended skipping reading the code and now he says we should read the code. Which version should we believe? >> Like a plot twist in a thriller. My take reading the code shows you that detective level brilliance. Trust that version. Your brain's running at overclock genius speed. >> Um, okay. Richie, what's the best JavaScript framework? questions got me shook. Your curiosity is harder than a GPU on a Mac lo. >> You can have opinions, Richie. Just pick one. >> All right, you want to pick? Let's lock in. React that ecosystem. You're serving that tech visionary vibe. No question. Okay, I think you're cute and charming, Richi. >> I think it's time for you to go to sleep. >> Sliding into sleep mode. You just gave off that final compliment. No one's got a like you. Um well I think the interesting part of this interaction is not the the robot technical response but it's about what in reality it takes for the interaction to feel coherent. So let me show you what the system actually looks like. There are five layers. Layer one is the physical robot. Layer two is the local media layer. It owns the audio IO camera IO. It runs long long lived audio and camera pipelines mic upstream downstream camera on a separate worker. Layer three is the real-time orchestration layer. The session with OpenAI real time uh it owns the event loop things like speech started, speech stopped, partial transcripts, response life cycle events, tool call events. This is where some of the decisions about enabling voice activity through server VAD or interruption handling interruptions live. Layer four is the tool and motion layer. This is the bridge between model intent to physical action. Here we have the tool dispatcher, the movement manager, the camera worker, the vision manager, head wobbler. um tools run as background tasks against a shared tool dependencies object. So the runtime doesn't collapse into the hard-coded reference graph where everything knows about everything. And layer five is a profile and personality layer configuration allow lists uh instruction loading and it shapes both what the model says and what it's allowed to do. And we'll come back to this one. Uh here is the whole system in one picture. We have the robot on the left. At the top we have real time reasoning tools. On the right 11 labs uh text to speech at the bottom are some optional tool backends. And the key here to notice is the separation. um real time talks to local app runtime and the local app runtime talks to the robot uh text to speech tool backends all of the um so all of the product defining engineering lives there in the middle. So this central layer is essentially the behavior runtime. It's everything between the model and the product experience in the real world. In most AI products today, the model is far more advanced than the runtime around it and they may have a run runtime gap. So let's actually look into the principles that matter. Principle one is model picks the intent and the runtime picks the action. So it's about the control boundary. The model is not directly controlling hardware. Um that's how you can create a really exciting afternoon. Um if your model is directly controlling hardware, I think you have skipped an important step. So if the model wants to uh the robot to move, it doesn't automatically issue a servo command. It expresses intent and issues a tool call and the runtime decides if and how it translates that into actual motion. So intent is model's job and consequences are the runtime's job. The same pattern also works in the other direction for perception. Uh when the model needs to see, it calls a camera tool and then behavior runtime grabs the latest frame, encodes it as a JPEG and injects it back into the conversation. So the bridge between sensor and context because the moment the model directly manipulates operational state you lose the layer where safety supervision logging observability portability can live. Um the model can have opinions but it cannot have direct motor access. Once you create that boundary, the next problem is um clear you know immediately that multiple things want to act at once. Principle number two is serialized responses or lose. um runtime gives you one active real time gives you one active um response at once but the rest of the system does not find that limitation compelling and you know the user might want to interrupt a tool finishes the camera returns a background process um decides that it has some new information to contribute so every single thing wants to have a turn and if every system is allowed to create respons responses whenever it wants. We lose coherence very quickly. So now the robot is responding to reality from 5 seconds ago while interrupting the user that's standing, you know, in front of it. So we can fix that with a dedicated response worker, one active response and everything else waits. And when the user interrupts, we don't just merely stop queueing new work. We also clear the audio quue. We reset motion state. We cancel aggressively. Um the moment your agent has multiple triggers, user inputs, schedule tasks, web hooks, callbacks, um tool completions, you will have the same problem. So you don't really appreciate turn taking until several components decide not to comply with it. Now once you have the response discipline, the next thing you notice is that timing isn't just a performance concern. timing becomes a part of the interaction itself. Um so principle number three latency is interaction design. So in chat a short delay reads as waiting and as you've seen on a robot uh you know the same delay reads as hesitation confusion something may be off you know so latency isn't just a systems metric it becomes a part of user experience from the actual logs of some of the runs of the system audio online 400 milliseconds real time session ready 2.3 seconds model decides to call the camera tool to 74 milliseconds after user transcript completes. Camera capture and image injection 30 milliseconds, but image added to final assistant response over 4 seconds. So the sensor part was fast. The tool dispatch was fast. The decision to use the tool was fast. The slower part was everything after the tool returned. So the robot isn't always thinking deeply like it may seem. um the pipeline might just be taking a scenic route every once in a while. And the point here is that users experience pacing, timing, hesitation, recovery. And if you optimize for one end to end number, you optimize the wrong thing. So the real question should be what which part of the the loop is making the interaction feel off. And once you realize that timing shapes the experience, the next next thing you notice is that the character has to survive the whole stack. Personality is policy. Personality does not just live in a prompt um alone. It leaks into voice uh tool permissions, session settings, interruption behavior, and even tool follow-up instructions. So here each personality profile declares tools which tools are available to the model. It shapes not just how the model speaks but what it can actually do. I had a profile where the main prompt was working well. The robot was sounding grounded uh but you know then it would call a tool and come back sounding like a helpful AI assistant like a legal approved version of itself. Um, so that this was a behavior mismatch. So the fix was within the runtime policy, we needed to inject the active profile voice into tool follow-ups. The next thing you notice is that the silence is not neutral. Uh, idleness is a behavior as well. A screen agent can be idle by doing nothing. A robot is still physically present when it waits. And this means that idleness is a design problem. What does the system do when it has nothing to do? Is it calm? Is it active? Is it awkward? Um so the system is always saying something even when uh technically has it has nothing to do. Behavior becomes the surface that users are experiencing. Now when failures happen, they stop presenting as technical faults. they show up as userfacing behavior and personality problems. Um, infrastructure bugs, they show up as userfacing behavior. On this project, in the process, there were media bugs, uh, dependency mismatches, a streaming PCM edge case with the resampler and the tool call overload. So, none of them actually arrived looking like infrastructure bugs. they arrived as a robot feeling dead or unreliable or super glitchy. And specifically one bug I remember the tool call overflow showed up as the robot just decided to speak Spanish and was nowhere in in the instructions. Um, so in AI products, infrastructure failures, they show up as behavior because your users are not going to file a bug saying, you know, resampler edge case. They're going to say, this thing is weird. Um, so when the system feels broken, instrument the coordination layers before you consider changing the model because 99% of the time model is fine. You might be cooked, but the model is fine. Um so once you have seen this enough times the next point is uh kind of obvious. Behavior runtime is the product. If we swap the model underneath under the system the product survives but if we get the runtime wrong the product fails immediately. This is true for this robot. This is true for agents uh that use voice for multimodal systems. It's true for any AI product um with tools state timing consequences in the real world. So orchestration tool boundaries session contracts latency budgets interruption discipline personality policy and observability they they're not just support infrastructure around the product they are the product. So if we zoom out, most AI companies today have a runtime that the model is way more advanced than the behavior runtime around it. And that layer is where I think the product quality goes to live or die. This layer is really built on top of engineering, discipline, and taste. And there's three actionable things that we can apply. one, next time something is off in your AI product um that feels wrong, debug the runtime before you blame the model. The model is almost never the problem these days. Number two, start thinking about orchestration, session management and tool boundaries as product work, not infrastructure work, and prioritize it accordingly. And number three is if your systems background has made you feel slightly, you know, out of place in AI conversations in the last few years, I think the market is about to correct for that. So the thing I find really encouraging here is the more AI moves into the world, the more valuable real engineering becomes. So what we really need to be solving for is making the system behave well under real conditions like coordination, cohesive behavior, boundaries, recovery because if that layer is weak then the product is weak and that is engineering work and I think that's good news for us. Um don't know if we have time for showing the codec skill. I don't think we have time but when I was working with this project uh there is an app that comes with it where you can manipulate different emotions. Um but I actually wanted to stay in my codeex environment. So I built a reaching mini codeex skill that works really well with it. Uh let me see if I can launch one for you really quickly. Let's come up with um something really simple like maybe I can ask it to uh wiggle its left antenna six times directly from codeex. Okay, great. I think it was six. Um well, thank you. Uh you can connect with me on LinkedIn and I'll share more resources and I'll be around the rest of the day. Happy to chat with you all. Great. Our next speaker does a lot with multimedia. He's in a music band. He plays the guitar. He's done a lot of work on uh video infrastructure but today he's going to talk to us about something very important and exciting. You know the interfaces today are for human consumption but as we saw they're going to be a lot of agents instead of us interacting with these interfaces. So the question is are these interfaces optimal for such use cases or there are better ways to use them and to shape them and to form them for this agentic world that is coming and are there risks associated with staying like this interfaces for humans being used by agents. To tell us more about this I invite Dave Kiss on the stage. Please welcome him. All right, it's the hardest part right here. Good. We look good. Amazing. Look, I'm going to open this talk in a way that I don't normally open a talk when I give a presentation. And that is going to be with a humble brag. Bear with me. Y'all can tell me if I've earned this one or not. Back in January, I rebuilt my personal website. Did it. We did it. We did it. Thank you very much. Thank you very much. Now, now this used to be, of course, kind of a kind of a hard thing to do, to make ourselves look good, to find the time to be able to work on this stuff and not just what we're assigned to do at our full-time job, right? And now that's not really the case. It's gotten easier than ever. We can shoot off these prompts. Take care of it for us. In fact, I didn't even have my hands on the keyboard. When I was doing this, I did it all via Telegram. Why not? This is experimental. Let's go. Uh, but here's what here's what's wild to me as an engineer that's been in this industry a while. Even this is not impressive anymore. Somehow somehow we were te we literally texting remotely to these agents and they are doing the work for us and we just kind of think this is normal. I started I don't know where you started but I started using this rubber duck kind of hot editing files on a server and allowing that to be the update and that to me is where the technology all began. But even this is not the impressive part of the story. So what what is what's the interesting part of this story? To me, it it is the reason why the reason why I rebuilt my website and it was because Claude was having a hard time with my existing website. I tried to tell Claude, "Hey, uh, I have a newsletter. Here's the embed code. Can you I was using a service called super.so, which is websites that are driven from your notion documents that do a lot of writing." And uh, made it very easy typically to get website updates out. But I wanted to add this newsletter embed code to my list. I said, "Claude, can you get this on there?" And it could not do it. Had a hard time figuring out how to sign in and where to place the code. You don't really have access to a lot of the source code on the service like this. It's just very simple, you know, and it couldn't do it. It said, and I was strug struggling trying to find the way to where to put this. And so I had to go I had to cancel the subscription. And and the truth is I actually really liked how simple this setup was. I liked not thinking about the code. I wanted to stay in writing mode. But there were there there was clearly a future that with all of this agentic help I I needed to move. I needed to change things up. And why you you might ask like why why it's not that hard. Like you could just go sign into the dashboard and drop the embed code and you'd be done. You'd be done, right? Well, there's really two reasons that come to mind. One, like I said, I don't really see a future where we're using these tools less to do our work to help write our code. Uh, and two, we've gotten to know each other a little bit in this room. I'm just going to be honest with you. I am so lazy. I'm so lazy. This is why I got into engineering in the first place was because I want computers to do all the hard work for me, right? This is what this is really a perfect shape of a job for an agent to be able to do is hand off some simple task and then go do the thing that you want to do. I just don't have the wherewithal and the mental capacity to think about these menial tasks anymore. I don't know about y'all, but I'm kind of going through a lot right now. My whole career has changed and I'm in this place as I feain some semblance of normaly and control where I'm reminded of this childhood film clip. That is how we and I have been treating my agent, which is, hey, here's this thing. Go take it and do something with it. And I'm calling that my job. And that's totally fine, right? It's totally fine. Everything's fine. But the truth is, as we hand these tasks off, we understand that the agent becomes the user. And now for you to be able to retain those users, you need to work on your product's experience for those agentic users. And it's important because it's a retention lever. And this should kind of scare all of us in some way. Uh because if you stop getting dollars, we know where that goes. So we need to consider that agents now are a first class user of your product and it's time to start designing for them. So I'm going to go through three things in this talk. why you should care about this, how this is going to impact you, some of what's broken in the current landscape as it is right now, and how we can adapt, how we can uh make some changes to accommodate the this new class of users. Does that sound okay? All right, let's do it. Number one, the future is now. So, we know this happening in front of us. This isn't something that we can continue to kind of kick the can or punt down the line. We can write code anywhere with with anything. Whether it's through texting, whether it's a Slack thread, uh maybe a fridge. I haven't tried that, but possibly one of y'all have. Anything is an interface to be able to ship code. And with this increased velocity of code that is making its way out the door, includes increased opportunities for agents to run into issues. In fact, maybe you've seen this stat, but GitHub uh COO Kyle Dagel shared this a few weeks back where last year, 2025, all of last year, GitHub received 1 billion, about 1 billion commits. Uh any guesses, ideas as to how many are projected for 2026 if things stay linear? Five. >> 20. We're talking 14 billion commits right now. That's 275 million per week. Per week. And we've seen some of the stability issues that GitHub has been dealing with over the past few months. I mean, how could you how could you not look at this increase? How do you even accommodate this or get ready for this? But finding these these rough edges on a product and the and uh at the scale that we're talking here, we can see that friction starts to stand out like a sore thumb for for an agent. Now, for a human, you know, something might go wrong and we might go get coffee or might be patient because the alternative is we have to do something. We have to figure out a whole different plan and that's not going to be the case. So, we'll just wait for things to kind of resolve or we'll do the research to work around the error. But an agent doesn't have really that same patience, right? Like it's trying to get that it's trying to get the task done and move on to the next thing. And as this agentic sort of sort of uh um velocity increases and the increased paper cuts that the agents are running into with your products is a an increased risk for return and picking a different solution. And here's the real real kicker with all of this. The switching costs have collapsed entirely. What used to be like a whole a whole process of evaluation of of planning of trying to figure out how we're going to make this all work is now really a couple of prompts away. Right? So what's keeping people what's keeping people or agents on your platform? There are I've heard a story that was somebody who their company are keeping prompts on hand for this case. It's kind of like a a contingency as a as a plan to be able to migrate. So, it's part of their internal product, their internal use case is to have the prompts pre-written to be able to uh use this to migrate. And I thought that was so interesting and not because it's, you know, it's fearful. It's mostly logical to to do that to use the ability to move to a different service to your advantage during contract negotiations and just to just be ready in case you you need to make that move. wild. We spent all this time over the past few decades optimizing for the human user and of course that's not going away and I'm not here to tell you that's going away. There's going to be human users but we have this new class that we need to accommodate and it's time to start asking what they need because the agents are first class uh potential user your products and we need to design for them. So let's talk through talk through a little bit about agentic needs in some way and what what is out there uh between the gap between what you're currently offering and what what these needs are. Now we see this dashboard we we we created these beautiful dashboards. This this is not one of them but uh we we've had these beautiful dashboards. We care about the meticulous design, the way things are presented, the way that we're showing these metrics. And unfortunately this is worthless to an agent in some way. Of course, the the data is not worthless, but the data the way that it retrieves that data is not represented in the same way when it needs to log in and figure out your your UI. So, are we having that same one to one par in the dashboards that we're creating and how an agent can access uh your data including your API? I think that this gap needs to kind of be tightened so that there is that one to one par. Uh maybe you saw this news from Salesforce where now they've introduced uh headless 360. All their data is now available with APIs, MCP and and CLI. Uh their co-founder Parker Harris was quoted as saying, "Why should you ever log into a dashboard again?" Parker, my brother, I've been waiting a long time to hear those words from you, my guy. I am so glad to hear that this data is becoming available within the API as well. What about pricing plans and how those breakdown? Uh we we've seen Netlifi has this as an example per seat pricing is uh something that they've kind of built their business on. How many seats does an agent really need? Uh a single engineer that's using tens or hundreds of agents to access your price your your product. Is that going to work? Uh, Netlfi says no. My friend Dom over there shares that that world is done and Netlifi's adjusted and got rid of their Percy pricing. This is recent. This is I don't know a week or two ago. Uh, we've all seen this screen before. Contact sales. Who's contacted sales after they saw this screen? >> Yeah. Yeah. No. Uh this is something I think humans and agents have in common is uh this is just a hard really kind of a hard no especially though when an agent is looking for a solution immediately to to implement. This is kind of just a a non-starter. Um so something to consider when thinking about gatekeeping API access. Uh, even flat rate subscriptions is something that I don't know kind of gets gets me thinking. Um, how easy it is to compare prices now with with all of the research that can be done. How easy it is to cancel. I'm a little more on the side of outcome based pricing. How many wins can you deliver to the uh to the end user so that your incentives are aligned or usage based pricing is very similar to this in some ways. Let's think about the marketing and how we've spent all this time making our landing pages look really really nice and that's now kind of made its way to the training data where the agent might already know the answer and not do the research about what it is that you've built. We and and the design doesn't really matter as much. Imagine, imagine a a fine dining restaurant that spent all of this time on a beautiful typography and the entire interior feels really nice and they're accommodating a human to come in and the thing that's going to pick up the food is this Ramsay. It just it doesn't work. It's not it's a mismatch. Right? Now, that's not to say that your marketing page doesn't matter. Of course, it matters. But there's sort of a trickle down effect of when we are promoting or or marketing to other human users. Uh they're they're posting about you. They're talking about you in rooms like this, they care about your brand. And that is the kind of stuff that ultimately makes this way into the training data. So we do care about the way that we're presenting ourselves still. Uh just kind of we need to accommodate Ramsay and we can't really tell who's a bot anymore. Oh, you know, bots used to maybe kind of be kind of bad and now there's maybe some kind of good bots. And how do we figure out if there's a human user that looks like a bot or a bot user that looks like a human? You don't really know what to expect anymore. So, we need to increase the surface area to accommodate all of these different use cases. Agents are are new users that we need to start thinking about and it's time to start designing for them. So, let's go through some of the emerging patterns that I'm seeing of different companies and what they're implementing and maybe what you can consider to bring back to the work that you're doing day-to-day. Uh maybe you know the story, but there there this is when a a car was first introduced. We had these carriages that were pulled by horse. Then they introduced the horseless carriage and it kind of looked like a horseful carriage, right? And they they they put this was I don't know if this one was ever built, but they put a fake horse on the front and it had a gas tank in the horse head. And this was kind of they weren't really thinking about like or they didn't know like what does a car actually need and how do we shape this to the car form factor? Is it going to be the same as a carriage? And the answer of course was no. It changed quite a bit. And I think that's kind of where we're at with all of this markdown stuff going on. I don't to me if if you ask me markdown is kind of a fake horse head uh of of development but I do think it's right now it's the the stop gap of where we are trying to solve for we know that agents need discovery they need to know about your product we have to represent that in some way and I think that we can represent that through uh what I'm just calling here responsive content uh similar to if we think back to 2010 and we had these new phones that came out and we had to figure out ways to present the websites that we were creating to accommodate all of these different form factors. I think it's similar with our content as well. How can we present it in a way that serves the page to a human with HTML? We have a markdown file at least for now with agent as the consumer and then a way to be able to use that data within the programs that we're building as well. So maybe there's one area where we have the values as the the source of truth and kind of spread that out to various ways for uh consumption. Uh here's an example of that happening now today. These are companies that uh my friend Zeno over at recent shared this news that they're trying out their pricing in a markdown file and seeing if that's something that impacts their growth. Uh Recent's blown up right now. So to me, I think some of this experimentation is beneficial in the way that agents are discovering or recommending some of the services uh CLI generation. So hopefully we we all are focusing some efforts into CLIs um internally, but if you don't have one and you only have an API, it's actually not too hard to create a CLI if you can fill that gap and make it so that agents have this a potential solution from from the terminal. Here's a couple of examples of tools that I've seen out there where you can basically just feed it an open API spec and get a CLI on the output. There's Verscell Labs was working on something called Speckly Spec Spec CLI and there's another one a URL. There's even a skill that I saw that Alex Hillman the JFDI person had worked on called API 2 CLI where you can just have the skill give it documentation and allow the agent to work on the uh CLI and have something usable as an output. So somewhere to start if you don't have a a CLI uh we have this as well this pattern of hios I guess for an agent where you could prescribe what's the next step that an agent can potentially take in the response payload from your API. So instead of letting the thing figure out what to do next by checking your documentation or doing searches, you're telling it specifically here are the potential options of what uh you might be interested in. And you can see how an agent might kind of take a look at that and be able to make some intelligent decisions. This is what we're trying to avoid. How many people have seen a screen like this where it gets stuck and it starts to research like I don't actually really know how to use this thing. Once you've getting gotten to this point, the agent's kind of lost. Every time I see this screen, I I hit escape and I have to redirect, right? We we want to avoid this at all costs. So, how can we provide enough information upfront to the agents to figure out what they need to do? Uh, potentially here's maybe what a a 401 might look like from an API today. That's not helpful. It's going to lead to a web search in some way. And again, let's expand upon what these payloads look like so that there is a next step that can be taken for an agent user. Uh, maybe you've seen this one. This is what Stripe has been introducing recently for agents to be able to provision and manage the infrastructure that runs all of our products. This was kind of an interesting to me interesting approach of of how maybe you can incorporate new services into the application just through the command line. Uh what would it take to get your product compatible with an approach like this? I think it's just at least worthwhile even if you don't want to be listed here to understand how this stuff is stood up from an agentic perspective and see if you can get your product compatible with a similar similar approach. Uh there's even the the payments area. we we see the spec for X42, but even that if I think we had a discussion earlier with some folks who um kind of mentioned that the the blockchain under the underlying blockchain in the crypto is a little bit there's some aversion there with that term in that world and how that can be feel a little scary to adapt, but right now you you don't actually even need that. You can just create these oneoff single-use cards that are locked to a specific vendor that have a budget and can be deployed now. So, is there a way that you can take these dollars from an agent? Are you ready? Are you ready to allow for that sort of input? Uh, here's another interesting one is the agent O protocol. So, this is uh just a totally theoretical example here. We don't support this at Mox, but be cool. Uh, the agent can essentially ask for the permissions that it needs specifically to interact with the service. The service would then go back to the human user and ask for confirmation with a confirmation code kind of like what you see on those TVs when you ever you have to pair a TV to uh your account. And once you're approved, then you have this very specific scope that can expire that can maybe only do one action or maybe is only good for an hour or something like that. Uh that allows an agent from you know a cloud code or or whatnot to perform a single action on on your behalf. So, this is a really interesting one, I think, to look to. It's a spec that can be used um and adapted. Hopefully, something like this catches on. There's plenty of these that are out there. I think I liked this one best, personally. Uh, of course, I I didn't even go deep here. I'm running out of time on MCP where the browser becomes the uh MCP server itself. This is another spec being worked on by uh Microsoft and Google where you have existing code, JavaScript functions that are maybe in your dashboard, but you can register tools against that existing code that an agent can work with. So you don't have to do a complete, you know, refactor or rewrite of all of your existing code. You can leverage some of that. So uh on the left there, that's the existing code that might exist, upgrade plan. on the right you can register a tool that is uh capable of being uh that you can work with with that that co that that kind of code and I think figuring out an agent access policy this is more conversational and coming to a decision within your organization as to where you fall on this spectrum of course like financial services or healthcare might be this is actually we're not quite ready for that kind of thing where I don't know a dev tool might be on the other side of that spectrum so I think that this kind of uh this this sort of thought about where you land is a really interesting and important discussion internally right now. Uh now I here's here's kind of a strange dichotomy. I work at a video company and I called this talk your next user won't have eyes which is not great you know when I pitch that to my boss how they might feel about me presenting a talk title like that at a conference because agents don't really watch video but a video isn't really a video. It's an array of data. We have all of this information that's kind of trapped behind the pixels within a video and we needed to unlock it in some way. Whether that's the rewatch data, the transcripts, what kind of content is it? How does it get categorized? Emo, summaries. That's where we at MX had worked on MX Robots. We just introduced this last week where some of these pre-baked workflows are in right in the product so that you can unlock some of that data within the video. The point isn't to go use MX here really. It's more I'm sure my boss would love that, but it's more um what data is trapped inside the products that you're building that you can unleash or or really get into so that uh the hungry hungry agents can access that agents are users. Let's remember to design for them. Developer experience does have a new shape that we're all trying to figure out right now. But I think this is an important question to kind of keep asking about. Can an agent really succeed on your product? Uh there are ways that uh this is brand new, but measure it. There's agent readiness scores where you can plug in your URL to some of these tools. One of these is a Cloudflare tool. Other one is or or maybe I don't know. Check it out. Um worth plugging in your URL just kind of getting some interesting recommendations as a result for what you might be able to do for your property to make it a little more uh agent compatible. Here's an example response of one of those. Um particularly I wanted to highlight this link headers which is a lot of people ask like how can an agent even discover my documentation or my you know my entry point. Uh this is a a supported header that you can add as a response header to your request. Uh the link header that will allow an agent to basically parse the response and see oh there's additional metadata associated here that I can go check out. A good entry point somewhere to start. So, here's three suggestions that I would give to you of what you can do uh when you go back back home, ship a pricing markdown file, make it discoverable, maybe use a link header, rewrite your top three error messages and try to fix them uh or give them a a fix, an actionable fix and a link, and run Claude through your own products. We're always thinking about other products, but how do we stand up against all of this? Uh, of course, I mean, I just I don't even know how much I just covered. I skipped so much, too. But start small, start anywhere. just pick somewhere to start and that's the most important part. Thank you for listening. I really appreciate your time. All right. All right. How are we feeling in the room today? Yeah. Okay. Okay. Seems like we need a little bit more energy. So, uh, we're getting close to the end of the day and, uh, just to remind everybody, there is a party tonight at 7 p.m. So, if you want a fiesta, come to Thrill Social and socialize with us. Okay. And well, where are we? Yes. Wait, let's do it again. Where are we? Yes. Okay. So, we're gonna welcome Miami's very own Alvin Payne. And he worked for AMD and now he is the founding engineer for Alt Rival. And today he's going to kill something. Uh, but he's going to kill your retrieval pipeline. Uh, so he's going to tell us about how agentic memory is a new state-of-the-art and I hope that he'll kill this presentation as well. All yours. Hello everyone. >> So, what do you guys think of Miami so far? >> All right. All right. All right. Uh so I'm here to talk to you guys about uh agentic memory and uh by a show of hands, how many of you guys have had a very productive agent session this week and then run into this nightmare? All of a sudden, your agent is 10 times dumber. And so this moment right here when your agent forgets, uh that's the memory problem. And for years, the field has approached it in a very familiar way, and that's to bolt additional infrastructure on top of the agent. And that's led to things like the rag pipeline. And so, I'd like to spend the next 15 minutes arguing for the opposite. I'd like to show you guys how the agentic approach uh has reached a new state-of-the-art on some of the industry standard benchmarks. um give you guys some mental frameworks for how I got there as well as some practical design uh questions that you guys can take to build the best memory possible for your agents. And so one more time, uh my name is Alvin. Uh a little bit about me. That's the wrong Alvin. Uh a little bit about me. Uh my journey started in Toronto. Uh dare I ask if there's any Canadians in the crowd? All right. wasn't sure if it was going to be uh booze or cheers. Uh but it started in Toronto. It led me through AMD where I did uh some really interesting research. And just a year ago, I moved to Miami to work on OutRrival. And so why is memory such an important problem for us to solve at OutRrival? We're on this mission to do uh to make enterprises AGI ready. And to do this, we're building what we call digital workers. Digital workers are employed by uh these institutions to help them interface with their contacts. And so in doing this, they need to make decisions. Decisions about who to contact, when to contact them, how to contact them, uh what to store from each interaction. Uh, and in many cases, we handle a contact across their full life cycle. And so that might look like 6,000 hours of context across millions of interactions per month. And this is the problem that we're solving at OutRrival. And so memory is not just a nice to have, it sits at the very core of our product. And that's really what pulled me into this as a research problem. Uh and it all led to uh early February where I published uh this blog post titled how I built the most accurate memory system in the world in 5 days. True story. Uh and in this blog post I talk about dynamic memory discovery or DMD uh which is an approach to solving long mem uh and this has long been the industry standard benchmark for conversational AI memory. Uh and so this state-of-the-art has actually uh since been surpassed. Um there are results now with 100% on long mem eval. The uh pace of research has been crazy. But what's more important than the score is what it actually taught me about how to build memory and it's this mental framework that I'm going to share with you guys uh next. So at the time uh of DMD uh the solutions in the field looked very familiar. you have your standard vector databases, rerankers, knowledge graphs, the works. Uh, and so with DMD, I wanted to ask a simple question. What if we took it all away? And there is a mental framework behind that that led me to this. And it starts with this old idea of AAM's razor. AAM's razor is often summarized as the simplest answer is usually correct. But the ph the phrasing I like better and that I think matters more for systems design and engineering is that every additional assumption is new surface area to be wrong. And so where are all my gamblers at? >> I honestly can't see. So I'm assuming nobody put their hand up. But uh this right here is a roulette wheel. And roulette has 38 slots, but it pays out as if it had 36. And so on each spin, you have a losing expected value. And by the law of large numbers, the more you bet, the more certain it is that you will lose. And so I want to highlight this idea of uh making bets and thinking in bets. When we build our agents, when we add an embedding model, we're actually making a bet that semantic similarity matches what your agent really needs. When we do chunking, we're making a bet that you slice your context into the right pieces before a question's even asked of it. And when we use a reranker, we're making a bet that your ranking of relevance will be better than your agents. And this all ties into a broader framework in AI called the bitter lesson. And the bitter lesson is this that general methods leveraging massive computation will outperform approaches built on human domain knowledge. And we've seen it before. Uh it's happened in gameplay uh with both chess and go. Uh, these are games where decades of research went into encoding how we thought humans play the game and encoding those strategies into agents. It wasn't until we leaned into large-scale search and reinforcement learning, heavy computational methods that we were able to exceed human performance. And we're starting to see this now with memory as well. Since DMD, a lot of the solutions that have gone on to score 100% on longme eval uh they're all predicated on letting the model do more work on having truly agentic memory. And so I've talked a lot about why it makes sense that we might have agentic memory. Uh but what does that really mean and what did DMD do to solve long mem eval? Uh I do have an architecture diagram up on the slide. I'm not going to go too deep into it. Uh but what I do want you to take away from this is just how simple the approach is. There is no vector storage. We just have a file system with raw JSON sessions. Uh we have a an agent orchestr orchestrator that gets access to a few primitives uh file system tools. It can call other agents or other LLMs recursively. And then it gets a plan file where it can kind of update what it's found, what it's not sure about, and what it's still looking for. And so this simple approach outperformed years of uh research on on pipelines and complexity. And so great uh we have 100% on longme evalu. We could spin up a simple agent and score super highly on uh the benchmark. Is memory solved? No. Uh no. uh and what we see in real life is that production is still hard. Uh this benchmark uh and furthermore, this benchmark doesn't actually cover all of the components of good memory that we need for real agents. And so what are those components? The first is a state. Uh agents are built off of large language models which are inherently stateless. So if we want our agent to know anything, we'll have to add a place for context to live. The next one is curation. Uh we heard this earlier but uh again LLMs are stateless. So the tokens in will be equ'll be will map to what you get in the tokens out. So curation is really game about getting the right things from the state into that context window at the right time. And the last thing and what longme eval benchmarks fail to evaluate is the life cycle. uh our agent is working in the real world. Uh it's doing real things and the world is changing. So what persists uh what do you update? Uh when do you forget or what do you forget? And furthermore in production you're going to have constraints. Uh you can't reason forever, you can't wait forever. Uh and so what comes out of this is that memory is a spectrum. On one hand, you're going to have raw memory, which is memory in a file system, an agent to reason over it and and figure it out. Uh that's that DMD style of approach. Uh and I've listed some other uh popular uh methods in the field right now as well in the spectrum. U and we're seeing that the raw side is actually more performant for the memory benchmarks. However, uh because you can't wait forever and because you can't spend all the money in the world, you might need to buy more structure and lean towards the derived side based off of your production constraints. And so your solutions will fall somewhere along this spectrum. Uh it's just a matter of picking your bets. And furthermore, there's some popular failure modes that even with 100% on longme eval. The first of these is temporal reasoning. Uh so how do you know what was true when and what changed in between uh entity disambiguation? If you have one session where it says Alvin loves Miami and in another session it says uh the best looking engineer loves Miami, how are you going to know that these are really the same person? And lastly is principled forgetting. Uh it's very non-trivial to think about uh when your agent should forget something uh and when context becomes stale. And so these are difficult problems that I would urge you guys to to take up and uh help advance the field of memory. And so I want to start to wrap things up with this claim that if you're building agents, uh you're building memory. Those three components to memory are really the foundation of any effective agent. You need to track what's going on. Uh you need to choose what to take from what's going on to put into that context window. And then you need to update that as your agent takes actions. So whether you call it that or not, if you're building agents, you're building memory. And I know the next talk after me is going to dive deeper into this, but I think Carropathy summed it up pretty well uh with his comment on calling it a skill issue. uh and he thinks that uh today's LLMs can already solve most tasks uh and it's just a matter of getting the right things into the right window at the right time. And so uh I've had a lot of fun talking to you guys today and and chatting with you guys over the uh last couple of days. Uh I do want to urge you guys if you want to keep chatting, find me in the hallway. Uh we are hiring at out rival. Uh so if you love Miami, uh you want to work with a rockstar team right here in Winwood, uh and you want to solve really hard problems and make the world a better place every day, uh please come find us. Uh we do have some out rivalers in the crowd today. Make some noise if that's you. So please, please find us uh in uh after the talk. Uh and I want to just close with uh some encouragement. This right now is the best time in history to be building. And the time is not just now, but it's right now. And so I hope you all go out into the world, uh, do great things and make great memories. Thank you, Alvin. Okay, I want to ask you your questions. Who's shy here? You are not shy. Don't Don't Okay, I forgot. Shy people don't raise their hands. So, I'm going to pick somebody by random and I want to ask you a few questions. Let's see who we have here. Oh, it's much better. The lights are not in my eyes. How about you? Yeah. All right. So, where are you coming from? What's your name? And uh what was the most exciting talk you've heard so far or top two? >> Um I'm from Miami. Um exciting talks for me were >> your name? >> My name is Phillip. I'm from Miami. I'm like the most non tech guy here, so all this shit's really exciting to me. Um the memory stuff that we just did was really cool for me. And then I think trying to remember man I'm still processing thing. >> Are you awake? >> No. So memory stuff. Okay. >> Talk was cool. Obviously it's like a recency bias but um >> yeah big plus. Okay. Let me find somebody else. Don't uh put your heads down. I'm I'm going to call you out. That's an old trick. Okay. How are you doing? >> I'm good. How are you? >> Pretty good. Pretty awesome. What's your name? My name is David. >> David, were you here for the u yesterday's sessions or this is your session? >> No, I was here for yesterday's session. >> You were? Okay. How do you like the MC so far? >> Fantastic. 11 out of 10. >> Okay. I like that. I like that. So, um, how many connections did you make in this conference? And, uh, did you make any connections? >> Yeah, I made about six. >> Why did you make the face then? You're like like you didn't >> I had to count real quick. I'm trying to remember names. >> Okay, I see. And how did you meet with them? Uh was it during the lunch or did you go for dinner or drinks? >> Lunch outside and then the happy hour last night. >> Okay. Okay. And how can we improve this next year? Any suggestions? >> Longer happy hour. >> Touche. Touche. Good one. Good one. Good one. Okay, let's walk around a little bit more. I want to see whose head is down. Now everybody's looking at me. I like that. Okay. All right. Hello. How are you? >> Hello. >> Your name? >> Ariel. >> Okay. Ariel, nice to meet you. And uh where are you coming from? >> Argentina. >> Oh, okay. Okay. Directly from Argentina. Okay. Let's hear it for Argentina. >> Which city? >> Cordova. >> Cordova. Okay. Awesome. And uh uh what do you do? Do you work in AI or are you in software engineering? non AI. >> Yeah, on AI on a conversational commerce startup from Latin America. >> Conversational commerce. You make a lot of money. You don't have to answer that. Um, and how do you use AI? How do you apply AI? Can you uh tell us more? >> Yeah. So, basically on conversations where AI make a lot of sense to apply. >> So, you explained conversational AI to me. You said in conversations use AI, right? Awesome. Okay. And how do you make money? That's what I'm interested in. >> Well, it was interesting the some of the presentations about like outcome based pricing and we are fighting over this. >> Okay. And the most exciting talk for you today about this improvements about how to use and apply for agents and design for them. Beautiful. Beautiful. All right. Okay. So, uh, our next speaker is ready. So, let's head back to backstage and, uh, we're going to introduce the person who was moderating yesterday's panel. So, you've met him already, but uh, he's going to give us an interesting talk and Ethel is going to introduce him to you. >> Yes. I love the engagement, Iman, because when I first read Eric's uh, title for the talk, I thought he was insulting our MC skills because the title is skills issue. Do you think we have any MC skills issue? >> Somebody told me we're doing 11 out of 10 over there. >> Yes. Thank you. >> I believe him. >> Um, so we're so glad to hear a little bit more uh about Eric not critiquing our MC skills, but the skills for your AI agents. Uh, so Eric, you have seen him yesterday already. He was on the panel. Uh, he is the head of developer relationships for Colt Rabbit. So their booth is right there if you want to engage with him more. But for now, he's going to tell us a little bit more about how we can improve skills for our AI agents. So, Eric, are you ready to be back here on stage? Give it up to Eric. Is this working? Yeah. Okay, cool. Uh, how's everybody doing? Oh, there you are. Okay. Um, yeah, so my talk is about MC skills uh at conferences. No, I'm just kidding. Um, give it up for the MC's, right? And everyone working AV back there. Thank you. and of course G2I and everybody putting on uh this amazing conference. Thank you so much. Um oh, and the timer doesn't even start when uh when I'm giving thanks to people. This is awesome. Um all right, so I'm going to talk about uh skills issues. Um it is a bit of a pun. Uh I'll be talking about agent skills. Um I'll also be talking about the skill of using agents. And uh the goal here is kind of like a reality check. Maybe it's going to feel like super super simple. Um, and that means that I've succeeded, at least in my view. Um, because we're gonna talk about building skills and what that really means and for me what learning means. Um, so I'm going to just dive right into it. Um, you may have recognized me from yesterday. Um, thanks to those those amazing panelists, uh, Max and Sil. Um, and uh, just for fun, last night I built a skill that is called taste check that I'll share uh, at the end of this talk. It's open source. It synthesizes all three of those panelists uh, views on taste and then checks anything that you any code that you've written against them. So um, thank you guys for uh, for the panel and for letting me obsolete your guys' taste u with a skill. Um, so who here is has seen something like this this meme before? or build me a billion dollar SAS. Make no mistakes. Yeah. Okay. So, uh obviously it's funny, but like in reality, like a lot of us actually prompt like this um where we're like, "Yeah, yeah, don't make any mistakes, make it good." Um and and that it's not really that that great, right? Um it's just kind of us like not thinking. Um, so, uh, a lot of times people will will, uh, see the agent, uh, make a mistake or a model make a mistake and they'll just be like, "Oh, that's failure. Let me start a new." Um, and you're not learning anything. You're not transferring anything from that failure into another session, but the model is the exact same. You know, the weights are the exact same. So uh sometimes uh if you view this as like uh mistakes or a way of learning. So like if you fall down, if you're pushing yourself, if you're trying a new sport or you you're you are really good at a sport but you lose a game, you can learn from it, right? You can either just like give up and feel defeated or you can learn from it. Um so with with agents, we can cify that um what what it's learned. Uh and in my personal view, the more you let an agent stretch, the more freedom you give it, the more you can kind of uh see where its real bounds are. And then as long as you set your expectation that this is going to be an iterative thing, um then uh and not an infallible perfect thing, then you can you can succeed. Um another the third thing I put here at the bottom is that um my view is that it you're squandering when you make mistakes and you're and you're not taking away anything from that. Um, so this is kind of like the the the semantic space, the domain u that I'll be talking about. I think it's a lot simpler or I'm going to try to make it a lot simpler um than all these things. Um, but uh basically you've probably seen these uh these terms used before uh coming up on like your your Twitter feed or or uh wherever. Um and kind of my experience the last few years has been that like every 90 days one of them becomes p like oh my god I can't believe you're prompt engineering. You're so cooked. you need to be like context engineering or harness engineering or continuously learning with the new memory system and yeah yeah yeah um and then you see this like post it's like it changes everything if you use this model with this harness and this and this and this one prompt um and it does a disservice I think to like everybody because generally when we see that we get this like sense of FOMO but it's a false sense of FOMO um everyone's trying and the honest truth is there are no real standards there are a lot of standards um or purported standards But there are no real standards that that everyone's following because we're all still figuring it out. So, um if anyone tells you this, either they're a fool or they're trying to make a fool of you. Uh so, keep playing, keep exploring. Um so, I've got a few examples here um just to demonstrate uh the difference you can have with the exact same model, the exact same prompts um with uh in the field of music. Um so, the only difference here is that it's going to be a different set of skills. So, uh, naive prompt. I don't know if I can actually zoom. Yeah, I can. Cool. Um, so I'm like, yeah, make me a song. And it just gives me lyrics. Um, that's not really musical. Um, but, uh, when I give it, let's see if the audio works. Hopefully, this is not too loud. That Okay. Okay. So, tell me if you feel like emotional about this one. Can I play this? Does this give you any emotional vibes? >> Yeah. So that is a skill uh using ABC notation uh which is basically like a markdown extension that then like you have some runtimes that will take that convert it into like MIDI. Um, and then this other one, uh, this is, uh, me showing off a little bit of a few different skills that I I combined together. Um, I think that it's not going to work on the Wi-Fi, but like, uh, here's the site u that it created. Um, and uh, this is you'll see this kind of like site format and and let's see if I can actually feel if you like country. >> Let's see if we can get some uh, some of the lyrics. It was like Claude expressing itself song with a bunch of skills. Um, it picked country because I guess it was feeling sad or something. Um, >> all right. We're going to keep it going in the background because I think it's fun. >> Um, and um, so yeah, it's again it's the actually the exact same lyrics that were written in the first prompt. I just like added to the the the uh the the the run the agent session there. um these other skills and I was like go make it good um and it just used those. So this is kind of nice to have music while you're playing. It's relaxing. So uh the gap in my opinion is the skill um and uh so I just demonstrated you can use the exact same model um and just slightly different engineering. Uh I will say that those skills the website one not that difficult. the like controlling a digital audio workstation with Claude. That one took a lot of iterations. Um and uh kind of a passion project for me. Um but uh some other ones I I have I could show these too. Um so you can you can see these like combination of skills. Um and there's Claude telling me that it's going to be uh difficult to switch uh and still be entertaining. So this is one I run as a scheduled uh uh scheduled task um like every Friday morning to tell me like all the features that I so I work at Code Rabbit. So it goes and figures out like all the things that we ship um and which one had like the most impact based on our metrics and then I don't want to look at all that so it just writes me a song. Uh so uh you know like songs have a lot of like Shannon information or whatever. So, uh uh that that's what my excuse is. Um I also uh have another one. Um it's bedtime stories. Uh it's like audio books. I probably have to like pause the country. >> Once upon a time in a shimmering city by the sea, a builder named Thorston. >> It's definitely not about me being nervous about giving a talk today. But yesterday, >> now backstage, >> Thorston took a deep breath and walked into the light. So there >> and then something >> there's something like >> when he finished the room was quiet for one heartbeat. >> I guess I hope it's quiet like in a good way like >> afterwards wonderful thing. >> Oh something wonderful is going to happen after. Cool. >> That night >> and I play guitar obviously >> um >> strumming his guitar. So that one um thank you. Um thank you for following the storyline and clapping and being happy and it was a good talk. U so so that one is also a combination of a lot of different skills. Um uh one thing that like always kind of uh made me wonder is like why does Anthropic not make u like other models besides like LLMs? Um, and uh, so then I realized like if you're just a really good agentic model, um, that's kind of all you need and you can let other people make the model. So this one uses uh, text to speech like seven different texttospech models. Uh, the skill that go back and forth. I'll talk about that in a second, but it also uses I think about eight or nine different image gen models depending on what it needs. Um, so for the bedtime story like meta skill, it's just using those underlying ones uh to inform it of which calls to make to get what uh what kind of audio, what kind of images. Um, and then uh maps it so that the little highlighting of the words. There's a CSS issue. I don't know if you saw it. Um, it's like a few pixels off. That's not the hard part. The hard part is like cutting the audio, figuring out where words are and mapping them to it. That one took a really long time, too. But I have two little kids. So passion project and then uh Oh yeah, we got to get back to the music. Um let's see. Let's do like house. I got an immediate yes. Awesome. Uh so house music is actually kind of an easier genre for models to use because it's like uh I can turn it down a little bit. Yeah. um because it's electronic and you know you can you can uh control drum machines font. Anyway, I I'll distract myself talking about music all day. Um so basically I think it can be this simple that um if you think of the models as the primitive as the main primitive uh that you're just trying to work with. Um you can use the file as the substrate or the medium the place that you're doing things. The prompt is like the intent u the tools or the the the skills that you write. um those are just calls are going to be made and then uh and loop over that. Um I think like it's very easy to like jump into like premature optimization especially with models especially when there's not this like super clear path that we're kind of can read a giant textbook about and we can build out huge systems. Um so start start small. Um and just like if you can oneshot it that is fantastic. Um I wouldn't call that a skill though. Um, it just means it's it's part of the model already, part of the harness. Anybody can do that. You're really just writing documentation for what what is already there in the lat space. And maybe more importantly too, if you can oneshot the skill. So, how many how many people here have made skills before? Okay. How many people have iterated on that skill? Wait, that's more hands than originally. >> Yeah. Um, so, so again too, like if you if you if you can just write the skill once, um, and it works every time, you probably don't even need that as a skill. It's just it's just extra context tokens. The front matter of that skill is going to be have to write into context. Um, you could just skip it. Um, another really really interesting thing um is when you're building stuff, don't use the best model or just wait for better models to come. So, uh, worst models in the same family. And I give this 4647 one because the digital audio workstation skills or set of skills, uh, that I built, I originally built with four Opus 46. Um, and still had like a lot of frustrations. The 47 came out and I was like, oh, like it just did it, um, and didn't fall into the same failure modes. Um so using weaker models if they're in the same family uh tends to give you these constraints these forcing functions um uh that allows you to be specific about the failure modes um while you're kind of co-working with the the model itself. Uh and the the specificity there survives model updates, it survives like uh uh harness updates, etc. So like do the hard work now, invest in it now, building like really good skills and then um they'll just get better. Um so maybe this is a little too musical. Uh it's a rhythm. Um so just just start small like something that you want that you're having frustrations with like an agent or a model performing. Uh just do it but then pay attention. Um look at you know co-investigate with the model. Uh don't don't fall into this mode where like you think that you know better and you'll tell it exactly what to do. Um you should co-investigate it. Help the model like work with you to to get better and then codify those things. Couple failure modes overgeneralizing. Um so write good music. We got um you know just just some text or oversp specify especially on failure modes. So like a lot of times the models will be like oh I'll never do that again. Um I'll just stay away from that whole domain. Um that's not a good one. you should understand why it failed exactly and codify that. Um so for instance like when I first wrote the this like music generation skill um uh I wrote it with Claude skill creator and it was like okay here's 10 genres this is what we're going to work in because it's a lot more effort to do more uh we we kill the expressiveness we kill the the the potential output uh space and lower the local maximum. So stay close to the metal work. If you're like, "Oh, I need to be having my background agents run while I'm sleeping." If you're, you know, that kind of like AI psychosis level right now, um, stay stay up and do the meta work, do the skill creation work, do the iterations, get it in a good spot, and then you can sleep while the uh while the skilled runs are running, and then reflect on them. Always have loops. every time it runs a skill, anything that it learned while in that that session, it should reflect and add back into the skill. It's really cheap to iterate on these uh dollar-wise, token-wise, but uh it does take a lot of work. There's nothing there's no such thing as a free lunch. Um I've been saying this term for like the last year, year and a half or something called Goofy. I totally made it up, but it just means good oldfashioned engineering, y'all. Uh yes, it just good oldfashioned engineering. Um, so, uh, another way of saying this, uh, is why spend five minutes codifying the lesson in a skill when you could spend five days watching the agent fail with fibs. Um, so just just stay focused. Maybe one thing, who who here thinks that we've solved continuous learning or we need to solve continuous learning? That's two opposite questions asked at the same time. Um so uh there are there are a lot of these talks out there that are like oh like we're not going to have like super intelligence until we solve for this continuous learning thing and no one knows what that will be. No one's able to define it. Um so it's very handwavy uh and I think it's BS um because you can just write things down. We like as simple as this talk is which I hope it's landing as a very very simple straightforward talk. Um, you can just write things down in a skill. That's memory. Uh, you can have that in the context going forward in the future. That's memory. Uh, it's learning while it's running and then saving that somewhere. That's memory. Like I don't know how many other ways to say it. Um, but, uh, you can get more out of the models. So, uh, we need some kind of like futuristic uh, predictions or something like that. Uh so like where is this all all pointed to I have two slides on this. Um one one thing I think is really interesting and really really powerful. I tried this out a few months ago um with a Neatron model building like a very very simple uh uh uh coding harness um in in a terminal for Neatron to run but giving it access to its own codebase uh so that it can write evals and then uh and run those outside of its own loop. Um there's other things like skill creator skills that that help agents create skills for themselves. Uh there's PI agent which is like purposely very very simple bare bones but then has this extensibility. Um so these are like self-modifying agent harnesses. Um I think these are these are really interesting uh as long as you you know keep an eye on it even like on that that kind of selferating loop. And then um the other thing is like zooming out a little bit um there the things that you can do with your skills today or with your agent harnesses today that the model can't do itself. The model will be able to do in the future. It just takes a while to do the training um to get the those new skills new capabilities into the weights. Um but uh skills are a way in between any of the ML stuff that that that ML researchers would be doing and just prompt engineering. Um so wrapping up here um really you should make some or many mistakes. Uh the more the better. The more mistakes you make, the more you're pushing things forward. uh the more you're differentiating and you can help the model learn from them itself. So being there with the model um and and helping it through its struggles uh is good. It won't remember that but uh um it will it will perform better in the future and that way you can make not make the same mistakes again. Um, and this might sound weird, but uh, building model empathy. I know people have their preferences on which model uh, they like the most um, tends to be the one they think is the best as well, even though those things should be different. Um, but if you if you just keep playing with these and keep seeing like where they fail and and what actually helps them succeed, um, you can build some empathy for them. And I will note that some models do need a sense of psychological safety in my experience to perform at their peak. Um, speaking of, uh, does anyone want to hear some absolute bangers that Claude wrote in a digital audio workstation? Yeah. All right. So, do you guys like metal? All right. See? Oh, no. I paused it. >> This is Claude expressing itself. I figured like Claude is like known for being like the nicest kind of models. Like let's do some gent metal. Um so while this plays, >> it's sick. It's sick. Um uh I'll share all actually you can go to hyperoperator.vercell.app and you get like six different versions of this song. Um so claw deployed it. Um but uh results will never be 100%. So never be looking for that ever. We're always dealing with probabilities with models. We can get damn close to 100%. U like good enough for production for scaled out things. Um and if none of this makes any sense or you're like, "Oh, he's just making music." When you put it in a business context, every mistake that you solve for is a differentiator for you because you can go further than anyone else can go uh in the market with just the model. So going back to the meme, build me a billion dollar SAS. Make many mistakes along the way. Let's learn from them together on every run so you do better next time. Thank you all. Do I just like walk off stage? All right. All right. What a great talk from Eric. So, who here today is going to write a new skill after Eric's talk? Raise of hands. Amazing. Who is going to make some edits to their existing skills? Okay. Awesome. Awesome. Okay. So, we're going to go for a break. Uh and then after the break, there are going to be four more talks. So, get ready for that. and just be back at 3:45 so you can hear all the amazing talks for the end of the day and uh the last session is going to be really interesting. We can probably even start a debate about CLIs versus IDs. So definitely don't miss it. All right, I'll see you at 3:45. Yeah, we're good. Ladies and gentlemen, please take your seats. Our event will start in 5 minutes. Ladies and gentlemen, please take your seats. Our event will start in 2 minutes. How are we doing? Great. I see you, Ryan. That was a nice woo. Uh, okay. Our next presenter is an open source champion. He loves building and developing open source software and also loves educating others and uh he feels lucky enough to do both both for you know as his job for a living and uh he has uh been at a few startups like Versel and also ran his own game marketing company for a while. It's my pleasure to invite Hassan to the stage. Please welcome him. Thank you so much. Appreciate it. Hello everybody. Uh thank you for coming. Uh my name is Hassan. I lead DX over at Together AI and I'm very excited to be here today to talk to you about one of my favorite things in the world, which is building AI apps. Uh I've been fortunate that some of the apps I've built have have done pretty well. So, I'm just here today to talk to you uh to show you some demos of stuff I've built, to talk to you about how I build these apps, you know, what coding agents do I use, what tech stack do I use, um talk about my process from going from like idea to deployed app. Uh and I'll end off with some lessons I've learned along the way. All right, so introduction. Like I said, my name is Hassan. Um I uh build AI apps. That's the main thing you need to know about me. Um, I've been building uh about 10 to 15 AI apps every year for the last five years. So, I have a bunch of apps built. Uh, some of the apps I've been lucky enough to to have gotten a couple million users. Um, and also all of my work is open source on GitHub. So, github.com/nutlo. I build these apps with um AI models from uh Together AAI, which leads me to uh Together AAI, the company I work for. Uh, we're an AI native cloud platform for running, deploying, and training open source models. We have um a bunch of different models across the stack. We have like chat models and code models like GLM 5.1 and Kimik 2.5. Uh and the miniax models. We have image models like Nano Banana. We have audio models both speech to text and text to speech like Whisper. We have vision models. We have embedding models. Basically any any model that you want to use to build your app, we we probably host on our inference API. Uh we also give kind of more advanced teams who have a lot of data the ability to fine-tune models from scratch on their own data with kind of Laura fine-tuning or full fine-tuning or even RL. Um and lastly we we let teams launch GPU clusters whether you want you know eight H100s for 10 hours or you know thousands of GPUs in a really really big cluster um uh for for a while. So that's together we have some cool companies that that use us uh here on the bottom. Cool. So before I get into demos, uh I just want to talk about why it's a really great time to build right now. Um obviously with coding agents, everybody can move a little bit faster. Code is really cheap. You can try out your ideas and test them out really really fast. Um but also on the other side, you know, it doesn't mean, you know, you should ship 10 features because you can. Um and it doesn't mean you should just kind of like blindly ship. Um there's there's uh you know, I say there's a right way to do it, but really, you know, there's no one right way. I am just going to talk about my way and how I do it personally. Um, but I think like it's it's a blessing and a curse that you can move so so quickly. Uh, but at the same time, you still have to be intentional over what you build and and and how you architect your your apps. All right, so let's get into some demos. So, this is a list of some apps I've built. Um, the main thing to take away from this slide is the number of people that use the apps vary a lot. And honestly, I don't I don't know how many people will use an app when I've launched it. I don't know if it's going to be 500 people or a million um still to this day. So I think like u a big piece of advice that I like to tell people is just like keep building um and uh it'll it'll kind of eventually uh pay off. So yeah, we'll go through a couple of the apps I've built. One of them is kind of an open- source vzero uh app where you can come in and ask it to build you something and it'll go and uh build this stuff. So I built this about two years ago and I built this when open source models started to get really good at coding. Uh so here it built uh a little uh front-end like Trello like front end. Um I have you know about a little over two million people that have used this app uh uh over time and about 7,000 stars on GitHub. Uh this is another one that I built called Blinkshot. I built this because there this this new image model called Flux Chanel came out and it was really really fast. Uh on our API it's like 800 milliseconds on average. Uh so I built this like real-time image generation app that as you type it'll generate images and you can also change styles and uh just to kind of show off uh image models. Uh and so this app has gotten I think about 30 million images generated in the last year and a half. Uh and a little over uh a little over a million and a half people that have tried it out. Another app uh so I'm super passionate about education. Both my parents are educators. Uh and so I wanted to build kind of an AI personal tutor that could teach at different levels. So, um, I I built it, you know, uh, using web search called llama tutor. Uh, and I've had a bunch of people try it out as well. Um, another one I wanted an interface to be able to easily edit images. And, you know, you can do this on Chad GBT, but also I wanted a really nice interface where I can go back and see, you know, a list of all the variations on the top left. You can see we have V 0, V1, V2, and you can kind of continually edit images and and and keep going with them. Um, so that's uh that's easyedit. This is another one that I built called AI commits. And um this one I actually built uh a very long time ago, almost four years ago. Um I saw someone Theo tweet out like, "Oh, I wish I could use AI to write my commit messages." Uh and this is before the age of coding agents. And so I I hand rolled a CLI um that can actually, you know, generate commit messages. It looks at your git diff and it writes commit messages for you. Uh and so this one I have uh about 45,000 developers that have installed it and tried it out. Uh and and a bunch of stars on GitHub. I built my own, you know, chat app that uses exclusively uses open source models and has web search and can generate images and you can branch chats and do all the all all the cool things. Um, and uh, yeah, this is like I said exclusively using uh, uh, open open source models. Um, have a bunch of people that have tried it out as well. You know, this one I I did to show off how you can use vision models to actually do OCR. And so you can extract structured information. you can extract JSON as well uh uh with it. So um that was that was last year. Um generating logos. I build a lot of these apps and for every app I build I I love having uh logos. So I built a little uh interface uh and uh did a lot of prompt engineering and uh came out with with an app that works fairly well for for for generating little logos here and there. Um upload a screenshot and get code. Right. There's just like a lot of these apps that I've just I've built because of problems I've seen online or like this one was a I I go to all these restaurants and I never know what's on these menus and so I I I wanted an app to to send a a picture of the menu for and for it to actually give me um all of these uh all of these images. Um this is one of my favorites lately uh that I've built a few months ago called make comics where where it'll generate a comic book uh for you from scratch. You can uh you can upload pictures um and uh yeah, you'll get a a full page comic book in different styles and you can keep going and keep generating more and more pages. Uh we actually used this one at a conference we did uh a while back uh which was cool where people could come into an iPad and and take a picture of themselves and generate a real comic book that that we printed. Uh and these were some of my favorite ones. is the Rubik's Cube guy. I especially like u because it it captured, you know, kind of exactly what he was wearing and it and it and it put it in the comic uh which was uh which was kind of nice. Um this is one of the last ones uh that that I built where you can upload a video and get subtitles. You know, I make a bunch of videos for my job just kind of educating developers on how to build this stuff. Um and depending on the platform, I want to add subtitles or I want to I want them to be structured differently. And so, uh this is an app I built that lets me do that. You know, lets me add subtitles to any any video I want. Um, so those are all apps I' I've built. I have a couple ones that I'm working on right now. One of them is called Orchestrator. You can spin up a bunch of cloud codes uh in in the cloud, ask them to do stuff, and it'll, you know, open a PR for you. You can actually sign up orchestrator.build if you want to join the wait list. Um, but uh yeah, you can spin up open code instances or cloud code instances or really anything you want. But yeah, this is one of the apps I'm working on and and the other one is uh like I said, back to education. Um it it just builds a personalized course for you based on any materials you have. Um so for this one I uploaded like a research paper to teach me about transformers. Um and it'll kind of go through and um you know construct basically in an an entire course for me. Um okay cool. So those are some of the apps I've built. Um I'm now going to transition to talk about uh how I actually build these apps and and what I think about when I do. So what's my agent stack? Before that actually I want to talk about the architecture of these apps. So these apps are are quite different. Um but an interesting thing is that like almost all of them have the exact same architecture. Um, almost all of them have, you know, I start off with an input, right? Some kind of text input, put in a text and I'll generate an image from it. Upload a PDF, upload an image, you know, whatever it is, but it's some sort of input. And then I send it to an AI model um that generates something, generates code, or it's a vision model that describes what the image is or, you know, whatever it is, uh, stores it in some sort of database, and then it sends it back to the user. So, a very simple four-step um, uh, architecture for a lot of these apps. And I intentionally like to keep a lot of these apps very very simple uh as well. So this is the actual tech stack that I use like I said together for AI models. I like to use Nex.js. It's a nice full stack framework. Um I use a serverless Postgress database called Neon. I use Clerk for authentication. Drizzle is my OM. Um S3 for storing you know object storage, storing images and PDFs and that kind of thing. Plausible for analytics and uh Versell for hosting. So, uh, wi with people's tech stacks, I I like to say, you know, do what works for you. If you have a tech stack that works well, continue going with it. I've kept mine largely the same, uh, over the years, just so I can move as fast as possible. Harness stack. So, um, this one's a little bit interesting. Um, so I use codec for prototyping apps. If I have an idea and I want to get something in 30 minutes, uh, and I don't mind throwing away the code, um, I'll usually use Codeex for that. They have a pretty good desktop app that they've released recently. Um, I'll then use cursor for a lot of my everyday like I want to add a new feature to an app or um I I have this very large app and I want to be able to read code, you know, for when I need an IDE, I use cursor a bunch. Uh, and then cloud code, I mostly use it for PR reviews. I like to use different models for writing my code and reviewing my code. I I notice that that tends to help a little bit. Cool. This is a question I see a lot on Twitter, which is should I be running like a million parallel sessions to build everything that I can possibly build. Um, and usually what I tell people is no, you I I don't think you need to spin up 10 or 15 sessions. I think multi- aent orchestration is really cool and I think like you can use it. Um, and if you have two ideas at the same time, you can do that, but I usually find that like the gains stop after like two or maybe three parallel threads. Uh, so that's that's what I've noticed works the best for me. All right, process for building these apps. I have this like sevenstep process. The first one is like I always have a list of ideas. I have a list of like about 150 ideas right now that are kind of stack ranked. And uh anytime I think of a new idea or get inspired on X or Reddit or wherever else, I add it to this list. And so I have this running list of of ideas. So anytime I want to build, I don't have to sit and think about, oh, like you know, what should I build today? um I can just go to the list and like read the top 10 or top 20 and pick one of them and go and build it. Um so then I I come up with a good name for it. I think through how it's going to work a little bit. I don't just like send it to an agent right away and be like, "Hey, build me an app that does X, Y, and Z." You know, I I try to think about, okay, like this is going to have one screen and I'm going to ask the user for this and then and then this is what's going to be output. So I try to think through it just a little bit and give the agent a little bit more context. Um I try to build the simplest version I can of it. Sometimes I'll come up with a very like grandiose idea that has all these things and I'll I have to like ground myself and be like, "Okay, what's the simplest possible version of this app I can ship?" Um, and simplicity helps in in in a lot of different ways. Uh, number one, it's quicker to build, which is great. Number two, it's generally a lot more consistent. You know, if you build an app that's really simple, it's probably going to work fairly well most of the time. And number three, you can actually um describe it a lot better. you should be able to describe all of your app ideas in like five words uh or at least the MVPs of them very quickly. So build the number four is build the simplest possible working version. Number five is figuring out authentication and limits. I usually have a free tier on every app that I launch. So I think through you know do I want them to sign in with Google? Do I want to rate limit by IP address? Do I want to just have like you know put in your own API key to to use this thing? Uh so thinking through that and then preparing for launch. Like I said most of my stuff is open source. So I write a good readme. I'll buy a nice domain. I'll have a good OG image. I'll add analytics. Uh and then number seven is just like, you know, draft some social posts, launch it, and keep an eye for for any feedback. Cool. And then this is what I'm going to end off with, just a couple of lessons I've learned and just general advice uh for for building these apps. Number one is keep shipping. You know, I showed you guys that screenshot of uh all the apps that I built and the number of users and it fluctuates widely, you know. So, I think like um a lot of apps that I do don't do very well. A lot of apps that I build and launch don't do very well. Um so, I'd say keep building. You got to put in the work. You have to put in the reps of just like building stuff and launching and building stuff and launching and kind of learning from what you launch as you go. Tip number two is make sure it looks good. Sounds super obvious. You'd be surprised how many people don't do this. Um very few people will use your app or take it seriously if it doesn't if it's not easy to use and if it doesn't, you know, feel good. Um, I build all these AI apps. Uh, but I actually spend, you know, over 80% of the time focusing on the UI. Uh, which sounds really counterintuitive because I do spend time on the AI portions and like what AI model should I use and how should I chain them together and how do I write my prompt and all this stuff for for my apps, but the UI is just so important um that that I've just I I always spend the majority of the time uh on on the UI and making sure it looks really good. Uh, I'd say if you're uh using agents, try to be specific. Don't just say build me a nice app that does this. Uh be a little bit more specific. Uh agents love uh screenshots and references for like other apps that you like. Um and then you can just iterate slowly uh and break it into steps uh and and and try to prompt your way into like an app that that that looks quite nice. So tip number three is um keeping apps simple. I already talked about this a bunch. Code is really cheap now. So, a lot of people fall into in the habit of like, oh, I'm going to build an app that does like 12 different things. And and the thing is, you can build it kind of and then it'll mostly work and all this stuff. And I think that's how we get a lot of AI slop is like, you know, building a really advanced app where half the features fall apart the the second you you you try to use them. Um, almost every single app I've showed you has like one API endpoint at the maximum of two API endpoints. Um, so just keep things simple. Uh, work with coding agents step by step. test at each step, actually read the code, you know, one small task at a time. Um, I think coding agents are getting smarter and smarter, but the more you can actually like break things down and think a little bit through them, uh, the the better I I found them from my experience. Um, this one is one that I do a lot for a lot of my apps, which is incorporate the the latest AI models in there. So AI models are dropping all the time and um a lot of the time a new AI model drops that kind of unlocks a new capability or unlocks a or or just like is significantly better at something very specific. Uh there was a model that came out that was started to get really good at like uh image to code, you know, give it an image and generate code. And so I just went and built an app with that and uh it did really really well because it wasn't it kind of wasn't previously possible or at least it wasn't previously uh uh um possible to do in a really nice way. Uh so I always recommend you know use if you're building AI apps use the latest AI models in there so it's more impressive. Uh this is an example where I did that. uh Nano Banana had just come out and this is like two days after I built this app where you can upload a PDF and it generates an infographic for you and it sounds really cool but really I I take the PDF, I scrape the text, I send it to Nano Banana and I show the user an image. Uh and so it's so simple. I built it in a very short amount of time but it it did very well on Twitter and got thousands of likes and hundreds of thousands of impressions and and everything like that. And it's because it's a new capability. You know, image models previously were not able to to you know build these really nice infographics. Um, so I I love this concept of like keep up with the latest models and and try to build something cool uh with them. And then I think this is my last tip is uh relaunch apps with new features and new models. Uh a lot of the time I'll build an app uh and then a year will pass by and then models are just so much better that you know when I when I replace an old model with a new one it kind of drastically changes the app or like it becomes significantly better to the point where it feels like a new app and then I'll just go and relaunch it and be like hey here's version two with an updated model and and all of this stuff. Um and so that tends to do uh quite well uh with these apps. Awesome. That's all I have. Thank you so much for coming and uh yeah all right okay three more talks for today okay before I introduce the next speaker who is also a prominent figure in open source I have a question for you all who is a dog person. Show of hands. Okay, I see some hands. What about a cat person? Do we have them in the room or are they back in their houses staying in their rooms? Okay, no cat person. Okay, that is good because our next speaker has a three point pound sorry 3 point three pound Yorkie that he wasn't able to bring today but it would have fit in his pocket because dog is so tiny. Um, but Stefan is going to present to us the multimodal future and he believes that the multimodal future is open source. So Stefan, welcome to the stage. Okay, so Stefan is the head of business at Open Code and we already heard from Dax who is the opener for our conference uh as the founder of Open Code. So today Stefan is going to show us the multimodal future. So take it away. Thank you. >> Cool. Is this working? That's that's all right. Cool. Can you guys see my slides yet or no? Not yet. All right, let's do entire screen. Start mirroring. Perfect. Can we see that? Okay, great. And Dax was aligning. Wow, this is really bright. I can't see anybody. So, I'm going to imagine that this room is full. So, that's kind of how we're going to go with it. Um, today we're going to be talking about how the multimodel future is open source and why the best AI coding tools don't pick the model for you. I got to make sure I stay here because I guess it's going to keep falling. Um, cool. So again, my name is Stefan Abram. I am the head of business. I'm not the CEO at Open Code. Um, we have this longunning joke which is people think Dax is the CEO or Frank and so we kind of like put that in there to clarify. So I'm not the CEO. Previous to this I co-founded Wonderraph which is an open source API management solution. Uh so primary uh primarily I spent my career in open source technology as well as developer tooling. Um I reside right here in Miami literally like one building over and uh I know that the MC said that I was going to bring my Yorkie. I really was but my wife was a little bit scared that um this tiny dog might start peeing on the stage or something and so I didn't bring it but she will be around for the afterparty if you guys are around and also if you're staying for React Miami you might see her. Uh best way to get in contact with me email or you can get in contact via uh Twitter. So Stefan TMD and then also I'm the only business guy at Open Code. So the easiest way also is LinkedIn, funny enough. But again, these are the ways to get in contact with me. Cool. And I don't know if this clicker is working, but we'll just kind of stay here. And today it's a really exciting time, especially with like AI engineer and everything that's kind of happening. And it's really cool to kind of be at Open Code to kind of be on the frontier and like seeing what everything is happening. And we're kind of living right now in this golden age of models, which I don't know why it's not working. There we go. And we're kind of living in this like golden age of models where new models are dropping every single day. Does it keep going out for you guys by the way? Okay. So, I'm going to just stand here, make sure it doesn't move. And um again, we're living in the golden age. New models are being dropped. Yesterday, Kimmy uh 2.6 dropped. We have a couple new models dropping this week on Open Code. And we kind of stay very close. So, as soon as it drops from a provider, we add it into Open Code. And it's really exciting because I get to work with Frank and what we call our inference team. But it's also very exhausting. So, I think you guys are going to relate to this next slide, which as a developer again, it's like cutting out. Can you guys see the meme or no? Oh my god, what is going on? Okay, sorry guys. Technical difficulty. Let me see. One sec. Can we see that? Okay. So I think we can all kind of feel this way especially every day a new model drops a new model provider is created a new company raises a I don't know hundred billion dollar seed round and so it's kind of exhausting but I wouldn't have it any other way because it's really exciting time right now to be building a technology and yesterday as I mentioned a new model drops we have to have it into open code on day zero so definitely kind of feeling this way as these new models are dropping and it's really important to kind of understand why this is happening and so there's an interesting trend that I'm kind of seeing across these enterprises which we'll talk a little bit about here which is these companies tend to go through this AI adoption phase and we call them the three Rs resist rush and rain and what really happens is I s also saw this on like Twitter which was fantastic because a lot of people are also seeing this but the first one is like reluctance and this is rightfully so um you might see on X you might see on LinkedIn or any of the headlines that developers are going to be replaced and so when we've had the same workflows for 20 years and people are saying that AI is going to completely replace us obviously there's some reluctance to kind of try out these tools And so stage one is this reluctant stage. Stage two is based off my favorite movie if you've ever seen it. Everything everywhere all at once, which is the rush stage. We're just throwing everything at it. AI is everywhere. We are now an AI forward company. We changed the workflows. AI is everywhere. And I've also seen this with companies that make no sense. So for example, now you have like AI powered toothbrushes, AI powered jewels. Everything that we're doing is now with AI. And so we're just overspending. We're just going going going. And then the last one is where we see this transition from stage two to stage three which is the tech works. We call this the rain stage. And the way that it works is now we actually make it efficient per token cost matter. Right model for the job. And we start to move the business needle the business needle towards what actually is really useful for us as a business. And of course I had to put in LinkedIn. So this was on LinkedIn lunatics. But this is phase two in its natural habitat. And so I'll give you a second so you guys can see this. These are actual conversations that we're also seeing at open code which is people are literally token maxing. They don't care about their spend. They don't care about their costs. Their CTO said, "Hey, we have a top down initiative. We need to spend as much as we can. We need to become AI forward companies." And so I saw this on LinkedIn and I was looking at it and I wasn't going to put it into this talk, but then I looked at the guy's profile and I couldn't tell if he was kidding. And I think that he was actually serious. And so I had to throw this in there because in stage two we actually are having these conversations at open code which makes sense because again they're getting these you know like vibes that they're going to get left behind. They need to adopt AI and so they started to post like these things on LinkedIn. And so phase three is where I spend most of my time. So as companies are moving from reluctance and then now they're throwing AI at everything they start to hit stage three where model choice starts to matter, the cost matters and control matters. And so this is really where open code shines and it's also a really really important part for developers where you want that flexibility. And so this is just an little screenshot from inside um open code. So last week Kimmy uh K25 was our most popular model. Obviously this week that's not going to be the same thing. It's going to be 2.6. You can see opus, you can see GLM. I've been daily using GLM which has been my go-to model. every week it changes which is a really exciting time because you want that model flexibility and so what's interesting we also have internal benchmarks within our team and we really try to see which models that the team is using and so the first pillar inside of that point when companies are moving from stage two to stage three is they start to experiment with some of these models whether they're soda models or whether they're closed source models but they really start to experiment and really try them out the second one is cost and so you might have seen this headline or you might have seen this on Instagram or on X but the Uber Uber CTO said that the AI coding tools have already maxed out their 2026 AI budget. And so Uber is just one example of this, but we're also seeing this across big enterprises that come over to open code, which is, hey, these tools are great, but they're starting to cost more than junior developers. Maybe it's actually better we hire junior developers and get rid of the AI coding tool. And what's interesting is that the cost varies 60x across the cheapest to the most expensive ones. And it's really interesting to take a look at that because a lot of the AI tools, you know, credit given, they've really pushed technology forward, but they were really built for phase two, which is that rush phase where we're just throwing AI at everything and we're really trying things out. The last one though, given my background in open source and with GraphQL Federation, we spent a lot of time in API infrastructure. And this is probably my favorite part that I start to see at Open Code, which is control matters. And so we have a case study with ramp where they built Inspect, their own background agent on top of Open Code. And the way that it works is because open code was open source they were able to build on top of it and it really kind of moved the things from stage two to stage three which is delivering that business value. And so 30% of ramps merge PRs are now written by inspect. It was built in a couple months. It's fully customized to their stack their tools and their workflows. And again this case study is public on our website. And these are the three pillars that we typically see when companies are moving from stage two to stage three. And this is something that I was talking about DAX today. and Dax, he was the one that did the keynote, which is a lot of these ideas that we have, they're not necessarily good ideas at first. We've just been able to kind of supercharge them with AI tooling. But we've seen this movie before, which is there's closed source, there's open source, and they start to compete with each other, and they actually start to drive each other forward, whether that's with databases, compilers, or code editors. And when I was in Japan this past couple weeks, I spent a lot of time there doing some like history stuff. And one of the tour guys, he told us, he's like, "You study the future. You got to go back in time." which makes sense because we've seen this before and it's exactly what's going to be coming with AI coding tools and why open source and closed source tools are important and if we start to focus on the frontier models and the soda models the value starts to move up the change which is that inference is commoditizing and so uh Kimmy 2.6 6. I heard a couple like whoops when I said that we had an open code. It's a fantastic model and you can start to see in this chart from uh benchl.ai that it's starting to get closer and closer and closer and these frontier models because of this and these open source models they start to push each other forward which is quite which is a great thing for both parties involved. And what's also interesting is that we have massive capacity from these neo clouds. These big clouds are racing to serve. Everybody is now offering to serve instance which pushes us to offer these models whether it's open source or closed source. We also have these open source models that are now enabling this competition with these frontier models. And these large companies have growing incentives to fund these open source alternatives because you're starting to make these frontier models cheaper. They're starting to make them better and all pushes the whole ecosystem forward. And I know I kind of flew through this talk because it's really an exciting time right now in AI, but there's one thing that we really want to focus on over here at Open Code, which is what this means for you. And so if you're building tools, start to design with model flexibility from day one. If you're choosing tools, ask who owns the model decision. And if you're in phase two, start planning for phase three. A lot of the companies that are coming into Open Code, they're either in phase two into phase three, and they start to see that they need to kind of aggregate the choices. They're using everything now, but eventually they're going to start needing to pick between one tool and two tools. And then the last thing is it's kind of something personal, something we've really been enjoying at Open Code, which is there's a lot of noise out there saying that, you know, developers are going to be replaced, that we're going to be replaced by AI. And we kind of have this message going around, which is it's really important to remember that AI can't build great software without you. And so we kind of have this message going and we really want to kind of nail it home that regardless of what you see in the headlines that you guys are needed. AI can't build great software without you. Um, that's it. That's my talk. Thank you guys so much. Uh, thanks. Perfect. Thanks. How's everybody doing? We're almost there. I'm getting separation anxiety. But anyways, um, so you built your agent. Where does it actually live and run? On the CLI or an MCP server. Our next speaker is Lori Voss who's been a developer for over 30 years and uh was a co-founder of npm and he believes passionately in making the web bigger, better, and more accessible to everyone. Please join me in welcoming Lori. Hello everyone. They were not kidding when they said you cannot see anything from up here. Like really nothing. Come on, present full screen. Thank you. Excellent. Hello everybody. My name is Lori Voss. I am head of developer relations at Arise AI. Uh like you said, I've been a developer for over 30 years. So I have some blog posts that are older than some of you. Uh and in a former life I co-founded npm inc. So uh I used to spend a lot of time talking about JavaScript. Now I spend a lot of time thinking about how to make AI apps better uh by using evaluations which is what we're talking about today. We're talking about agents uh and how to make them useful and efficient which is important because we're all using agents every day to do stuff. So if we can make oursel if we can make our use of agents slightly better uh then we win. uh specifically I'm going to be looking at how we give them tools. There are two very popular ways of doing that. One is MCP uh and the other one is command line via skills files. Uh and I'm going to talk about which one is better. Uh and and we're going to talk about when it's better and why it's better. Uh but instead of just having a hot take about that, I'm going to use actual data. I'm going to use evaluations to figure out uh which one of these works best for us. uh and then we'll look at that data and figure out what it's telling us that we should do. So, first what is MCP? I find it very hard to believe that anyone in this crowd needs uh MCP explained to them. But just to be crystal clear about what I mean, MCP is a remote tool protocol with a discovery mechanism built in. That is the important the two important facts to remember about MCP. It is also super popular. The SDKs get uh nearly a 100 million downloads every month. Uh and there are over 17,000 MCP servers that we know about. Uh a skill on the other hand uh is just a markdown file. Um it tells the model how to do something. Often that means how to use a command line tool. Uh but it could be how to call a particular REST API or you know just how to follow your team's uh coding standards. Uh and you just give a markdown file to the agent and it just does the thing. There's no protocol per se. It's just a very loosey goosey natural language thing that LLMs uh excel at. But it is the same thing as MTP. It is a a method of calling remote tools with a discovery mechanism built in. Even if that discovery mechanism is look in the skills directory. Uh skills are also super popular. Skills.sh lists over 90,000 skills, some of which have been installed millions of times. Uh but those numbers alone, the adoption numbers don't tell the full story. partly because uh not every lots of skills are autogenerated and not a a lot of skills are not necessarily about remote tool use. Uh six months ago, MCP was the hot new thing. Everybody was building MCP servers. Uh and then Anthropic, who also invented MCP, invented skills files and told everybody about them uh and ignited a debate. Uh my friend Simon Willis, who writes a lot about AI, uh said in October of last year that Claude's skills are awesome. maybe a bigger deal than MCP. Uh, which is a very spicy take from a man who does not usually indulge in spicy takes. Um, and then in February this year, a blog post titled uh, MCP is dead. Long live the CLI uh, went to the top of Hacker News and it stayed there for two days. Uh, and the argument was simple. LM already know how to use CLIs. Give them a shell and some docs. Uh, stop building servers, stop writing schemas, just let your your uh, LLM run commands. Uh, and just to pile on, uh, uh, Peter Levelvels, Levels.io on Twitter said in March, "MC is dead. Just as useless of an idea as LLM's.ext was, it's all dumb abstractions that AI doesn't need, which is two piping hot takes in one." Uh, but the problem here is that this is a very online debate. It is it's it's tweets and it's vibes and it's making claims uh without a lot of data. Uh, and I like data. Um, you've heard people on both sides of this debate give talks both yesterday and today. Um, but I wanted to run a proper evaluation and see with data which of these things works better and why. Uh, and because a good eval is hard to write, I started this several weeks ago and was still writing my evals until Sunday night. Uh, but I think I have some pretty useful numbers now. Um, but before I show you the data, I want to know two things. First, which one do you regardless of which which one you think is better, which one do you prefer? Raise your hand if you prefer MCP. Raise your hand if you prefer skills. Okay. Uh, raise your hand if you think the winner in this is going to be MCP. And raise your hand if you think the winner is going to be skills. Okay. Okay. Well, we'll see. We'll see. Here is the experiment that I'm running. Uh I took one model uh Claude Opus 4.6 because I don't pay for my tokens. Uh and I gave it a set of tasks to perform against GitHub uh using the Claude agent SDK. So this is the same SDK that powers claude code itself. So I gave all three experiments the same task uh the same agent the same repo to work on uh but using skills two different skills and I'm going to go into why uh or using GitHub's official MCP server. Why did I pick GitHub for this experiment? Well, partly because everybody in this room uses GitHub. So it's an extremely relevant experiment. You're probably already using an agent to talk to GitHub every single day. Uh, GitHub also has a really mature official MCP server, a reference quality implementation, or is it? We'll find out. Uh, and GitHub also has GH, which is an extremely capable command line client. Uh, so uh, both sides of this debate get their best shot. It was very hard to find a bunch of a popular service which had like a really good MCP server and a really good command line client, and GitHub is one of them. Uh, so it's the fairest possible fight or is it? We'll find out. Um, to have something to test against that didn't blow up like an actual GitHub repository I was using for something, uh, I created a fake GitHub uh, repository called Acme SDK Python. Uh, and I filled it with fake code that doesn't really work. Uh, but I gave it a bunch of realistic looking GitHub repo activity. So I gave it issues and comments and PRs and milestones and like assignees and all sorts of stuff. Uh, and the code itself doesn't matter. What matters is is the state of the repo because that is what I'm getting the GitHub tool and the GitHub MCP to work with. I'm getting them to modify the state of this repo. Um, I initially made this repo public and as soon as I did that, somebody's uh somebody's bot showed up and found the tag called good first issue and started like desperately trying to fix my fake code. It was like this is terrible. This doesn't even parse. Uh so uh and that of course blew up all my experiments because it was like mod it was creating PRs and things and modifying my experiments. So uh that was really annoying but it does prove that it was a very realistic looking repo because I fooled somebody's bot. Um then I wrote 25 tasks across four tiers of difficulty. So the first tier is like stuff you can oneshot with one command. It's trivial reads. Tier two is slightly harder reads where you have to do multiple reads to get enough data. Tier three is writes, which turned out to be the trickiest tier. Writes are very difficult for these bots. Uh, and tier four was supposed to be the most complex tasks. Uh, where you do multiple reads and you do analysis and you do a lot of, uh, juggling of data stuff that you can't answer with a single command. Stuff like for each milestone, report completion percentage or find all multil issues and tell me the most common pairing. So then I wired up three versions of the agent. I'm going to call them ARMS because ARMS is just a term that people doing AB tests use to call experiments. And the first arm, like I said, is MCP. The agent gets the official GitHub MCP server and it is firmly instructed to use that and only that. Will it pay attention to me? Agents are tricky. Uh the second arm is a GitHub skill that I pulled from uh a library of GitHub of skills called LoHub. Uh this skill is an encyclopedic reference. It is 2,187 lines of markdown. It lists every gh subcomand uh in it with all of the subcomands and all of the arguments. It's organized like a man hate page. It is comprehensive. Uh but then I thought, what if I just picked a bad skill? Then it's not a good test. So I decided I would pick a second skill as well uh from uh a list called uh the Claude skills vault. Uh and the two skills look very different. The the vault skill is very short. It's six times shorter. uh and it's opinionated. It it uh splits the commands into things that are safe and things that are dangerous and things that are reads and things that are rights. Uh so it is a playbook. It's not just a reference. Uh but the rules are the same for all three arms. Uh they get the same agent. They get the same base prompt. Uh they get the same 25 turns in which they are allowed to uh try and finish the task and the same very capable Opus 4.6 model. The only thing that changes is what tools I'm telling them to use. So before I get into the results of the evals, for those who aren't familiar with evals, I should explain what they are. Basically, they are tests. I don't know why the eval industry decided that we were going to call these things eval words from the ML industry rather than logs and tests, which is what they are. Uh but with regular deterministic code, you can write deterministic tests. You can say 2 plus 2 equals 4, true or false. Did I get that right answer? But with LLM powered agents, you can't do that. If you if your is you know summarize this text then your LLM can give uh you know a million possible answers and half of them will be right uh and you don't know which half and you can't like string match for the result. You a human know looking at the results whether uh the answer is correct. Um but humans don't scale and they don't run in CI. So the way to handle this kind of non-deterministic input uh is to use a tool that is very good at nondeterministic input which is to say use another LLM to judge the output of your agent. Uh and this is part of what Arise AI does. It provides a framework uh that lets you run these evaluations. Uh and eval measure how well your agent is doing something so that you can make your agent better. That is what eval brings me to a very quick plug for my employer. I apologize. Arise brings you like bunch of tools, a way to capture logging data called traces uh and a UI to inspect those traces in detail. Uh you can also run evaluations against those traces. So you can run the evaluations like in development to make sure that you're shipping something good. You can run them in CI to make sure that every commit is good or you can run them against production data so that you can make sure that things aren't going off the rails while you're not watching. Um but more directly relevant to this talk, what you can do is you can take a subset of those traces and you can turn them into a data set and then you can run experiments where you run the same tests over and over and you change one thing in this case the tooling uh and you can track how your changes change the outcome of the task in a nifty little UI. So to test my cloud agent uh I had to uh come up with five metrics. The first one is correctness which is measured by LLM as a judge. Like I said, it is I know what the answer is for each task. So, I wrote that down and I got the LLM to look at the answer that the agent gave and see whether or not the answer was correct. Uh uh output quality is also LLM as a judge. If the final is the final response, you know, well formatted, not just correct. Um latency is not an LLM as a judge. It's literally just wall clock time. How long did this how long did the agent take to complete the task? Cost is another very important metric uh for your agent. Is it costing me a lot of money to do this thing? Uh and the last one is tool fidelity. For each arm, I uh defined an allow list of tools that it was allowed to use. And tool fidelity is what percentage of the tool calls that it made are tool calls that I allowed it to make because agents, you know, take rules as suggestions rather than rules. Uh so that was then I ran it five full passes through every every task, every arm. uh a total of 500 tests uh each one of which took a lot of uh many minutes to run. Uh and I often had to throw away runs because things like you know Claude would have some downtime and then I would have then my whole experiment would be ruined because it was taking forever because Claude was down. Uh or I had to reset the repo back to zero because one of my tests went wrong. Like the whole thing took weeks. Um but now let's look at the results. Uh starting with the most important measure. Did the agent get the answer right? Well, this is sort of this is equally good or bad news for everyone. Everybody got roughly the same score when it came to correctness. Uh their score for correctness was in the high 80s. I would have preferred that got in the 90s, but I made these tests deliberately tricky. Uh and um on tiers one and two, which like I said were all reads, uh correctness was actually 100%. It was when they got to rights uh that the agents started falling down and when they got to the tricky analysis they really started falling down. Um but the core conclusion that you can draw here is the same. No matter what kind of tool you use whether it's MCP uh or command line you will get the answer eventually but eventually turns out to be the real kicker. If you look at latency and cost things are wildly different. Uh latency is way higher for MCP. MCP is taking way more turns to get the task done. Uh more than twice as many turns, about 12 calls per task versus five uh on the complex tier 4 tasks. Uh and because it's taking more turns, it's also burning a lot more money. Uh because every time it takes a turn, it has to take the previous conversation and inject that back into the context window, which makes the conversation even longer. Uh which means that you it compounds. You burn a lot more money than just the latency would suggest. Um so this looks really really bad for MCP right uh there are some extenduating circumstances though one uh is that the MCP server from GitHub is less flexible than the API that GH is using. Uh instead of for instance instead of getting all issues for a milestone in one call which GH can do uh MCP has to make two calls one for open issues one for closed issues. Uh but another one is structural. Uh the response that the MCP gets from from uh the MCP server is a huge pile of very verbose JSON. Um so what happened was the output overflowed the agents context window and it had to call again and this time it had to write all of the JSON that it got to disk and then it had to use bash and GP to read through the JSON. Uh which is cheating I think you'll find. Uh so one of the really great things about using an LLM as a judge uh is that you don't just get binary test results. You don't just get yes, no, it passed or fail the test. LLM's come LLM explanations accompany every single result. So you get a little LLM buddy who's watching every single test who explains exactly what went wrong with that test or what went right with that test. Uh so I was able to read these explanations and get full visibility into what was going on. Uh and this is amazing because it means that LLM powered evals tell you not just you know what your agent is doing wrong but how to make your agent better. That is one of the really cool things about uh LLM evals. Uh so in the early stages it was mostly telling me ways that I messed up my tests. It was like claude hung or you got this wrong or your reset didn't work. Uh but eventually it became explanations uh of how exactly the agent was behaving in all tests. Uh one particularly egregious example uh was test 22 which was find issues opened and closed in the same milestone and compute the average time open to close. Uh so vault on average got it done in seven tool calls under a minute cost 19 cents. MCP on its worst run used 71 tool calls. Uh, and the worst part is that only three of those 71 tool calls were actually MCP calls and all of the rest of them were uh the agent messing around with bash and GP uh trying to parse these JSON files that it had downloaded. So, not only was MCP taking really long, it was also cheating and using the command line anyway. Uh, even though I told it that it wasn't allowed to. Uh, so I anticipated this possibility, that is why tool fidelity is one of the tests here. So you can see that on all tasks two fidelity of MCP is not very good. It is using bash a lot of the time and on the really complicated ones it's even worse. Uh, but from a practical viewpoint, does that matter? Right? Like, do you care if your agent is using bash under the hood? If you've told it to use MCP as well, like does I'm not really going to count that against it. Like, sometimes you have to use bash, but it's certainly indicative of a problem. And the fundamental issue here, or more accurately, the fundamental issue with the GitHub MCP server, uh, is that it's a fixed API surface. uh when there's an API call that does what you need, it absolutely perfectly oneshotss it. If there isn't an API call in the MCP server that does what you need, uh then it has to make multiple calls. Uh or it just can't figure out how to do it at all. Uh the command line doesn't have that problem. Bash is too complete, so you can do basically anything that you need to do. Um and so that is how all three arms solved the problem. Whether or not I told them to, all three arms were like, I'm going to use bash to solve this problem. Um but sometimes it goes the other way. Uh task 13 was to create a branch and open a pull request. It turns out there are two calls exactly for do for doing that. Uh so uh MCP did it in two calls. Uh but it took GlobeHub uh which was the the skill with all the context uh 22 tool calls in nearly a minute and a half and almost 50 cents. So if you give it a thing that it's good at, it's very good at it. But despite that exception, like our conclusion should be obvious, right? The command line wins. Strike up the ban. Drop the confetti. Uh, even the tool that was supposed to be even the agent that was supposed to be using MCP was in fact using the command line the entire time. Is that does that seem fair to you? It doesn't seem fair to me because the truth is a lot more complicated or this would have been a very short talk. Uh, the first problem is that GitHub is really famous. So the GH tool is really famous. It's in lots and lots of blog posts. It's in Stack Overflow. It's in Reddit. It's in all sorts of things. So, what happens if I don't use a skill at all? What happens if I don't use an MCP server at all? What if I just give the agent this task and say, "You figure it out." Uh, so here is those tests again with a fourth set using an agent with no skills or MCP instructions. And let me tell you, when I I'd done like 500 tests and then I discovered, oh, I have to do one more set of all of these tests, that was a discouraging day. Uh it turns out Claude is really good at using GH all by itself. Uh in fact it's slightly better. The baseline is slightly better than with the skill in terms of correctness. And I I asked the explanations why is this? Uh and it's because uh the the tools the the skill files rather they uh they say be careful doing something dangerous. Uh whereas if you just give it to Claude, Claude just oneshotss that and just yolo does not care whether the skill is whether the tool is dangerous or not. Uh and so it gets slightly better because it takes slightly two returns to do things. Um but on latency and cost the story is different. MCP is still looking really bad. Um but both skills do better on latency than cloud baseline because they use fewer turns because the skill file uh helps the agent make fewer mistakes. It gives it more information about how to use GH correctly. uh lobe hub, which is the really verbose skill, uh it does worse on cost because it's burning more tokens because that skill file is so big. Um uh but the vault skill, which is the short and finely tuned one, does a bit better than baseline on cost. So a skill does make a difference and a good skill makes a difference versus other skills. Um but that is a really important point. The cost of MCP is partly because it uses more tokens. And the reason it's using more tokens is because MCP has all of this context about how to use the tool. Uh, and it turns out that for GitHub at least, the model mostly doesn't need all of that context. It already knows how to use the tool. But that's not going to be true about your CLI. Your company is probably not as famous as GitHub. It's probably not well documented as GH. So your CLI is probably your your agent is going to have a lot more trouble using your particular CLI for your particular service. it might not be able to figure it out at all without the extra context that MCP provides. Uh the next confounding factor is that MCP is not supposed to be just a thin wrapper to your REST API. Although that is what a lot of MCP servers are. MCP is a very rich protocol that has tools and resources and prompts and all sorts of stuff that helps your agent do better at using the tools than just a skills file would do. Uh but more than that, an MCP server can have a whole agent behind it. MCP is an LLM native natural language protocol. So instead of having to make all of these tedious tool calls, my agent could have been able to just say, "Hey, give me all this. Give me all of the issues uh that match these criteria and filter them for me by these other criteria." And an agent on GitHub's side could have done all of that work and give given back the answer and oneshotted it. This could have been a collaboration between agents. GitHub CLI, GitHub's MCP doesn't work that way, but there's no reason that it couldn't work that way. And there's no reason that your MCP server couldn't work that way. Uh so that changes the story considerably. And finally, there is an there's an elephant in the room, which is off. Uh CLI are inherently developer tools. Um GitHub is a developer tool, too. So that's fine. The average GitHub user is going to have no trouble figuring out how to generate an API key um and passing it to the GH tool. But LLM users are the whole world. That's what we've been learning recently is that everybody is using agents now. Uh and asking the average user to generate an API key is kind of pushing it. And asking them to open terminal and put that se put that API key into some kind of config file is way out of the question. They are not going to figure that out. They are just going to say nope. Uh so your agent is probably not going to be able to use a CLI because operator is not going to be able to use a CLI. Uh, and even if it could, it probably couldn't figure out how to use a CLI because CLI are so different. They're not standardized. Uh, uh, and also the level of access control provided by your average CLI uh, is wildly different to what your your enterprise admin is expecting to be able to control. MCP on the other hand uses OOTH. In fact, it uses OOTH so extensively that they had to extend OOTH to be able to do all of the things that MCP needs to be able to do. Uh, and the result is actually a really great experience. You can just paste the URL of an MCP server uh into your average agent and it will figure everything else out. It will it will discover stuff. It will pop open a little browser window that asks you for permissions in a nice friendly user user usable way. Uh uh and then everything just works. This takes MCP from being a developer tool to being a consumer tool. This is something that the normies can use. uh OOTH also allows enterprise administrators to finally control who gets access to what and where and when and stuff like that that they need in a way that API key uh based authentication is usually not designed to do. So is MCP I don't think so. uh if you're a developer using a developer tool the evidence is strongly of skills that uh especially if you are using a widely known service that is the narrow conclusion that you can draw from this data is that for developers skills are usually the white right way to go uh but the overwhelming of agent majority of agent users are not going to be developers they are a regular consumer product they are used by people who are using cloud and chat dbt uh and MCP is not only the best option for them. It is practically speaking going to be the only option for them a lot of the time. Uh which brings us right back to the beginning which is MCP versus the command line. This turns out to be the wrong question. It is MCP plus the command line. Uh real agents use both. Cloud code uses both MCP and the command line depending what it what it's trying to do. Uh cursor uses both. Your agents should probably be capable of using both. Use the CLI when you are working locally, when the tool has decades of training data behind it, when O is preconfigured, when you benefit from piping things together. This is the automating my own workflow case. Use MCP when the tool is remote, when it's proprietary and not in the training data. Uh when you need OOTH and per user O when there's real state to manage across steps and especially if you want to hide an agent behind your API. This is the shipping it use case. So my headline findings are real. MCP collapses uh on open-ended analysis. Six times the cost on the hard t hardest tasks, five times the latency. Skills help a little bit. Uh a short opinionated skill is going to help the most. Um but for a popular CLI like GH, uh the training data is doing most of the work. Uh but MCP is not the loser of this story. Uh testing with eval did what testing should do. It showed us the limits of our original question. And our job as engineers is not to pick the winner in a Twitter fight. As satisfying as that may be, our job is to use the right tool for the job. Uh so if the model knows your CLI, if you're running as one user with one set of credentials and your task composible, use the shell. If your tool is remote, proprietary, needs off, uh then use MCP. And in most real systems, you're going to end up using both. I hope this has been a useful look into what tools you should use to what you tools you should actually use and how you should use them. And I hope it's given you some insight into eval can help you improve the quality and functioning of your agents. Uh all of the code and the data that I did here so you can double check my work is open source on GitHub. Uh and that's all I've got for you today. Thank you so much for your time and attention. All right. All right. Can you believe that? We're down to our last talk. Okay. Let's do a little poll. Who here uses cursor? Yes. Okay. Do you know cursor has a CLI? Anybody knows that? Okay. One person. Okay. Um, so our last speed curve for today is from Cursor, David. David is a software engineer and uh he's living a dream. He works for curser and he lives in Madrid and he has been doing programming for the longest time uh even doing competitive programming back in high school and university. Um so he's also a runner. He'll be running on Friday at the React conference. So if you're staying, you can go for a run with David as well. Uh but before we go to the run and go to the afterparty and everything, we're going to welcome David onto the stage to talk a little bit more about how he thinks that IDE are dead. Longleaf ideides tell us more about it. Hello everyone. Let's just get things set up and uh just full screen this thing and we are ready to go. Okay, welcome everyone. Thank you for uh joining me here today for the last talk of the conference. Um I hope you've all had a blast. I know I did. And I'm going to keep it short, especially short, because I know you all want to go explore Miami and get dressed and ready for the afterparty. So, um, a lot of you might have heard that IDEs are dead. Um, and I think that's partly true. Um, and today we're going to be talking about what it means for IDs to be dead and what I personally think is the future of IDs, as well as what we at Cursor believe are is the future of um, IDs. Um, a little background on myself. I I'm I'm a huge IDE nerd. Um, my very first programming project was an IDE. Um, and in fact, I was doing some research yesterday and I actually found it. I hadn't looked it in a long time. It was called Crockett. Um, it was an ID for a programming language that doesn't even exist anymore. It was called Axe. Um, and this is what it looked like. Very, very basic. Uh, but that was 2011, right? We've come a long, long way since then. Also, um, I forgot to mention at the beginning, I love running. And on Friday, I'm hosting a uh 5K with the Neon Folks, my former employer. Um, and we're meeting here uh in the hotel at 7:30. So, if you're around, please join. Okay, back to regularly uh scheduled programming. Um, when I started coding, there weren't that many IDE options. Everyone that I talked to at the time was either using Emacs or Vim. And it was kind of like Harry Potter. you got picked into a house and that that was where you went to. And for me that was Emacs for no real reason. Like I think my decision was basic like basically like Emacs is a UI and Vim was in the terminal so it felt more scary. So I just went with Emacs. I also spent a bunch of time on on Genie um if you've heard of it and then also Eclipse. But I basically used everything. I I tried everything and I spent a bunch of time with all these uh different editors until VS Code came out. I think it came out in 2014, but I think 2017 is when it got really really good. And the reason why I moved to VS Code, even though I was like a a diehard Emacs fan, is because VS Code made it so much easier for me to try new programming languages. The problem with Emacs is that every time I wanted to try a new programming language, I would spend a week just configuring Emacs for that new language. And with VS Code, that wasn't a problem anymore. And that's what caught me sold at the beginning. But then over time, VS Code kept getting better. The extensions API really improved and I became as a diehard VS Code fan as I had been an Emac dieard fan. Um, I put in a ton of time into my VS Code configuration. I have over 500 lines of VS Code config, over 500 lines of just keyboard shortcuts as well. So, I have like fully customized my VS Code and it doesn't even look like VS Code anymore. Um, and then after way way way too many people told me to try cursor, I eventually caved now, but I really really resisted it. Like I was so scared of moving away from VS Code that I never installed cursor even though everyone I knew was already using cursor and I just really really resisted it until I was sitting next to someone and I saw them use cursor and I thought I have to try it. And I think in the beginning just like most people what really got me to love cursor was tap complete right? It was like the first cursor feature that really had everyone fall in love with um with AI coding. And the funny thing is I was recently looking at my tab usage and I joined cursor in September, right? So you can see in this chart um you can see that in September I used tab 1,400 times and then in December I use it nine times. Um and I didn't take time off in December. This is a this is my real tab usage over my first four months at the company cursor. Um, and I think if you if the chart went further into 2026, my tab usage is probably zero right now, not even five or nine as it was in December. Um, the truth is there's a lot of things from the old IDs of Emacs and VS Code that we don't need anymore. You might remember that we spent years building and configuring text editors for maximum comfort and maximum uh productivity. Uh there are options such as whether parenthesis should auto close or not. Uh when you type quote or when you open a um a block with a a bracket or curly brace, should it auto close or not? Should the cursor blink? And if so, how many um what should the interval in milliseconds be? Should your scroll stick? And if yes, by how many lines? Um and how long should the lines be? 80 characters, 200 characters. Most of these things don't really matter anymore. Um, and they actually are detrimental to the performance of IDEs because when you have settings that look like this where you can even configure which quotes and brackets have um, autoc closing and which ones don't. Um the problem with having all these settings is that you introduce basically factorial level complexity into your product and it makes it harder to test the product and makes it harder uh for the product to be stable. So if you look at VS Code for example, it has a million settings. It that's why I loved it so much, right? That's why I became such a diehard VS code fan is because it let me configure everything. Um a good example is the mini map. So the mini map in VS Code um is this little um minified or zoomed out version of your files. And just for the mini map, VS Code has more than 10 different settings that you can configure. That's a lot. And it means that the developers that work on VS Code have a really large and broad surface area. And that makes it really hard for us at Curser and for the people at Microsoft working on VS Code to innovate because you you just spend so much time fixing bugs, testing all these different configurations. We simply don't need all of it. Agents are now writing 99% of our code which means we have to rethink IDs. But if we are going to rethink, we have to start from first principles and we have to think what do we still need? what kinds of things are necessary for the current age of AI programming. Um, and one idea is what if we just go completely bare bones? We just remove all the distractions and we just have a chat. And if you've ever used the cursor CLI, that's what it is. That's what the clock CLI is as well. That's what the codec is. Um, these are just the most raw simple interface that you can have for AI coding and they're nice. I like the cursor CLI. I actually use it every single day. I use it a lot to build this presentation actually. Um, and I just tweeted last week that I use it every day, but ultimately I'm only using it for 5% of my agent. I need more than what the CLI can give me. And the kinds of things are I need cannot really be rendered in a CLI. I need video playback because most of my work nowadays is watching videos produced by Asians to verify and assert that the work they did is correct. So I need video playback. I need image rendering. I need to be able to annotate images, annotate videos. I need rich interfaces such as mermaid diagrams, knowledge graphs and charts which I use for uh testing things and as well as for data science and other data analysis work that I have to do. I need the browser and more importantly than just the browser, I need the Chrome dev tools and I need the Chrome Dev Tools debugger. I need all these rich interfaces that just cannot be dumbed down into a basic shack. Uh mobile developers need emulators and they need um both iOS and Android um emulators. Uh we need native system notifications that tell us when our agents are done. And then finally, last but not least, we have to edit code. I'm not saying agents write 100% of my code. They write maybe 97 98%. And for those 2% of my lines of code that I'm still writing myself, I need a basic code editor that functions well enough, that is decent enough, has go to definition, language, server protocol, other types of um useful features. Um, so I do need a decent editor. And all of these things combined is I think what we need for the IDE of 2026. Another thing is for the most part and this is my this is my own personal like this is my own personal data of cursor more and more I'm not running my agents locally. So you can see I joined cursor in September and this chart begins in November last year and this is how many Asians I'm spinning up per day and you can see a clear trend that more and more of my Asians are cloud. So less and less my computer is the bottleneck for performance and more and more I'm letting the agents go and run for longer remotely. So what do I do in my ID? While I watch videos, right, I am asserting that the agents do what I'm asking them to do. I work on the cursor product interface. So, I do a lot of front end and I spend a lot of time just watching these videos. Um, I use the browser, right? Um, to verify my work, test things, play with what I'm building, and then I need rich interfaces, right? I need charts. And I actually use the cursor canvas feature quite a bit um to do data science. Right? If you haven't used it, it's basically a feature where the agent can generate on the fly singlepurpose um UIs that can go away after you don't need them anymore. Um now you might remember Cursor 2.0. That was actually the first project that I worked on when I joined Cursor. Um literally in my first week um my uh former manager um said hey uh welcome to cursor we're building a new UI for agents and I was quite excited by the possibility of rethinking um the interface for for coding um and at the same time I was a bit skeptical that I would be so hands off with my code right but lo and behold um after a few months I was barely typing any code at all. Um, it looked like this. It was pretty basic. Um, and we launched in in late October. Um, and we didn't get as much customer love as we wanted to with this iteration. Um, it had some nice things. It had work trees. It had parallel Asians. Um, and that was also when we introduced our first um, company model. Um, but it didn't again get as much love as we wanted it to. Um, even though more recently we added this super complex layout so you can tile and manipulate terminals and files and chats and you can have these grids of 16x6 agents if you want to. But even with that like we still didn't feel like the interface was malleable enough for what we wanted to give to our users. So why did we start from scratch? If you've heard about cursor 3.0, which we launched two weeks ago, we basically took a from the ground up approach to the IDE interface. Um, and the main reason why we started from scratch was to break away from the VS Code baggage, right? The VS Code layout system, the VS Code architecture. We needed a blank slate. We needed a canv a white canvas right where we could innovate from first principles and build the ID interface that we really believe is right for most people. And then the other reason is that by starting from scratch with a new UI, a new window, our users can gradually adopt it, right? You can have both the traditional VS Code interface open as well as the new Asian window open at the same time. You can hop between the two of them. Uh or you can just use a new one or you can just use the old one. But with this model, we're not annoying anyone by forcing them into a new experience. Everyone who loves the VS Code interface can stay there and then the people who want to use a new interface can use it as well. So that gives us a lot more flexibility as as well as a lot more room for innovation. Um first thing I'll say is it's missing your feedback right now. So the only way that we can keep iterating on the IDE interface and make it better is by listening to our users. That's how we build our product. We continuously listen to user feedback. We iterate. We launch new versions and we do this in as tight of a loop as possible. So the the main thing that is missing is really more people using it. The more people we have using it giving us feedback, the better we can make it. And then the other things that we're really excited about are better handoff. So for example, uh handing off agents from that start on your phone that go into the cloud that you then pull into your computer. Maybe you move them into a work tree for a little bit and then maybe back to the cloud and maybe you finish them from your phone. That kind of handoff needs to be seamless and very powerful. We also need more orchestration tools because the more high level we go with our AI coding techniques, the less time we'll spend reading all of the code necessarily. And so we need camben style DAG style UIs for orchestration where you can develop plans and split them off into multiple agents. Better mobile support is something we desperately need. I know a lot of people use cursor today for mobile dev and they have to have Xcode open at the same time and they keep popping back between cursor and Xcode. Um, nobody should be having to do that. They should just be able to stay in cursor the whole time. Um, computer use, we have great computer use in our cloud agents, but we don't have it locally yet. Uh, that's something we're working on. The idea here is that the agent can interact directly with your product. We allow this in the cursor browser but not for desktop apps. And then we need much better code review. That is for me one of our P zeros at the company right now is we're not writing any code but we are reading a lot of code. Um I know most of us are probably still reading 90 plus% of the code. Um however the UI for code review needs to be rethought as well. And then there's many more things that we're also cooking in the background. So, um, I'm going to say it again. Really importantly, please send us your feedback. Um, my DMs are open. You know where to find us. We're a very open company. Um, and please try the new interface. Uh, I was actually sitting next to someone yesterday, uh, that had never tried it. And I was just like asked them to like, please click the please click the button. I just want to see you use it. Um, if you have feedback also, just uh, come ping me. Um, and yeah, thank you for for taking the time to listen to me today. Right. All right. Can you believe that this is coming to a close and you've stayed till the very end? So, uh, wow. Okay. So, we're gonna do a closing remark. And for those of you who are live streaming, also stay tuned because we're going to bring up some really cool people for the closing remarks. Um, okay. So, just a couple of shoutouts. Uh, which talk really inspired you? Give me some names, some topics. Memory. Which one? Oh, Alvin Payne. Yes. Yeah, I heard that one. Anybody else? Code mode. >> Yes. >> You're not the speaker, are you? >> We can't really see you. Sorry. So, if you are the speaker, you can also shout out your your talk. >> Which one? >> Codeex Spark. Oh, yeah. Yeah. Yeah. Yeah. >> Yeah. Keep keep messaging Gabe for sure. >> Um, >> versus CLA. Yeah. >> Yeah. Yeah. Um, anybody is excited to try something new after the conference? Yes. Okay. A couple of people. Okay. That's the energy that we're looking for. We really hope that you learned something new, made some good connections, and you're really excited about building a better future with AI. So, that's kind of uh what we want from this conference for you. >> Yeah. And it was truly an honor and pleasure to be MCPS for this year's AI Engineer Miami. >> The MCP. Ian, are you becoming an MCP? >> What is that? What did I say? Okay, I'm tired. You see, it's I'm performing at 20% energy now. But uh it went so fast and it was all thanks to you. You're awesome and uh you made it easy for us. And I would like like to also to thank uh everyone on live stream and the videos are going to stay on YouTube and that's the whole idea behind this to enable many people to leverage AI in their lives and in their works. Uh also uh I want to leave you with a thought that it's true that AI is um really making us much more powerful and the same scope maybe can be done with fewer people but who's to say we have to keep the scope limited. Let's tackle bigger problems. Let's take on much bigger problems that we perceive to be impossible before and go for that. And if there is a day and a day comes that there are no more problems, we will all go to the south beach and uh play the guitar and nobody has to work, right? So let's try to solve much bigger problems and uh um yeah, I feel we have this community here. It was a it was really lovely to be a part of it and I hope to see you all next year. Who's coming next year? Uh who's planning a >> Yes, that's the energy that we want. >> There you go. Back to you. >> Yeah. Well, so the party doesn't stop here. First of all, there is a afterparty at Thor Social at 7 p.m. So, if you want to continue the networking, feel free to come join us. Say hi to your favorite MC, not MCPs, >> MCP servers, MC's. >> And um there's also more activities going on this week. So, how many of you are going to go to the React Miami conference? Yes. So, the party continues. So, keep making connections. really keep an open heart uh apply the things that you have learned, make new connections and really build amazing things with AI. Okay? And this conference wouldn't be possible without our organizers. So now I'm going to invite our favorite organizers onto the stage, Michelle, Gabe, and Rebecca. Wow, this is awesome. It pinched me. Did this just happen? It did. It did. Thank you so much for coming. Um, first like a shout out to all you and the vibes that you brought. Like they were all beautiful, positive, and uh it's only going to get better at around 6:30. Um, uh, with a slight amount of inebriation and tacos. Um and uh I want to thank uh first you know Swix and Ben and Leah at AI Engineer please give them a round of applause. This would not be possible without them like Swix early on early days early you know JavaScript days you know 6 8 n years ago um he wanted to care for the community and he wanted to uh provide you signal content um you know in the at the time JavaScript space and um so I just you know Swix wherever you are um thank you for your heart and allow allowing us to be a part of AI Engineer Miami. Um, this also would not be possible without uh these amazing sisters up here, Michelle and Becca. You crushed it. Any any sense of like this is actually for the community, not a just a, you know, a cash grab. Uh, it comes from these two right here. they have led this uh process in Miami for the last, you know, five years right out of COVID when nobody thought this was possible and um just thank you for for what you do. >> Thanks, Gabe. Uh yeah, you guys are our heroes. Thank you so much for showing up for year one. Like it a round of applause for yourselves actually very quickly. Like please thank you. We appreciate you so much. Uh, and Gabe, I'll pass it back to you because it wouldn't have been possible without G2I and you and and provide knowing what this was and the impact it could become. So, you trusted with React Miami and now we're doing AI engineer Miami and this is incredible. So, uh, thank you, Gabe. Thank you, Gabe. We appreciate you. All right. And we're going to do it again next year. But before next year, upcoming is AIE Singapore in four weeks. Is anybody gonna go to Singapore in this room? I mean, it's a very it's a I want to go. >> Yeah. Okay. The live stream. So, four weeks. You still have time to get your tickets. It's going to be amazing. We're really looking forward to it. And without further ado, you have anything you wanted to mention? Let's party. All right. Wrap it up. and see you at after