Spec-Driven Development: Agentic Coding at FAANG Scale and Quality — Al Harris, Amazon Kiro
Channel: aiDotEngineer
Published at: 2026-01-09
YouTube video id: HY_JyxAZsiE
Source: https://www.youtube.com/watch?v=HY_JyxAZsiE
For those of you who haven't heard of us, Kira is an agentic ID. Um, we launched generally available this most recent Monday, I think the 17th, but we launched public preview on, uh, in July, >> uh, I think July 14th. So, out there for a few months getting customer feedback, um, all that good stuff. We're going to talk a little bit about using Spectriven development to sharpen your AI toolbox. I did a show of hands. About a quarter of the people here familiar with Spectrum and Dev. My name is Al Harris. Um, principal engineer at Amazon. I've been working on Curo for the last. Uh, and we're a very small team. We were basically three or four people sitting in a closet doing what we thought we could do to improve um the software development life cycle for customers. So we were ch we were charged with building a development tool that's that answered um that improved the experience for spectrum and development. We were theoretically funded out of the org that supported things like QDV but we were purposefully a very different product suite from the QE system to just take a different take on these things. So we wanted to work on scaling, you know, helping you scale AI dev to more complex problems. Uh improve the amount of control you have over AI agents and improve the code quality and maintain uh reliability, I should say, of what you got out the other end of the pipe. Now we're back to new content. Um so our solution was specri. We took a look at some existing stuff out there and said, "Hey, vibe coding is great, but vibe coding relies a lot on me as the operator getting things right. That is me giving guardrails to the system. And that is me uh putting the agent through a uh kind of a strict workflow. We wanted Spectri driven dev to sort of represent the holistic SDLC because we've got you know 25 30 years of industry experience um building uh software building it well and building it with different practices right we've gone through waterfall at XP um we have all these different ways that we represent what a system should do and we want to effectively respect what came before. So uh this animation looked a lot better. It was initially just the left diamond but I the idea was hey you know you basically are iterating on an idea. I think like half of software development is discovery requirements. Um and that discovery doesn't just happen by sitting there and thinking about what what should the system do? What can the system do? We we realized though kind of working on this that the best way to make these systems work is to actually synthesize the output and be able to feed that back really quickly. things like your input requirements um to actually do the design and feedback you know realize oh actually if we do this there's a side effect here we didn't consider we need to feed that back to the input requirements and so this compression of the SDLC evolved to bring structure into the software development flow we wanted to take um the artifacts that you generate as part of a design that's the requirements that maybe a product manager or developer writes that's going to be the acceptance criteria what does success look like at the end of this and then we want to the design artifacts that you might review with your dev team, you might review with you know stakeholders and say this is what we're going to go build and implement the thing and we want to make sure that you can do this all in some tight inner loop. Um and ult that was initially what spectriven dev was um what spectriven development in hero is today or at least was before it went g was uh you give us a prompt and we will take that and turn it into a set of clear requirements with acceptance criteria. We represent these acceptance criteria in the EARS format. EARS stands for the easy approach to requirement syntax. Um, and this lets you really easily uh it's effectively a structured natural language representation of what we you want the system to do. Now, for the first four and a half months this product existed, the ears format looked like kind of an interest decision we made, but just that sort of interesting. Um and with our launch, our general availability launch on Monday, we have finally started to roll out some of the side effects of which is property based testing. Um so now your ears requirements can be translated directly into properties of the system which are effectively invariants that you want to deliver. Um, for those of you who have or like have not I guess done property based testing in the past using something like I think it's a hypothesis in Python or fast check and node um closures spec library is another example. These are uh approaches to testing your software system where you're effectively trying to produce a single uh test case that that falsifies the invariant that you want to prove. And if you can find any uh contraositive then you can say this requirement is not met. If you cannot you have some high degree of confidence where the word high there is doing a little bit of heavy lifting because it depends on how well you write your tests but you can say with a high degree of confidence that the system does exactly what you're saying it does. Um yeah, so a property we we'll get a little bit more into property based testing and PBTs a little later, but this is the first step of many we're taking to actually take these structured natural language requirements and then tie this with a throughine all the way to the finished code and say if your code if the properties of the code meet the initial requirements, we have a high degree of confidence that you have re uh reliably shipped the the software you expected to ship. So with spectriven dev, we take your prompt, we turn it into requirements, we pull a design out of that, we define properties of the system and then we build a task list and we go and you can run your task list. Effectively the spec then becomes the natural language representation of your system. It has constraints, it has concerns um around functional requirements, non-functional requirements and it's this set of artifacts uh that you're delivering. So I don't think I have the slide in this deck, but ultimately the way I look at spec is that it is one a set of artifacts that represent sort of the state of your system at a point in time t. It is two a structured workflow that we push you through to reliably deliver high-quality software and that is the requirements design um and execution phases. And then three it is a set of tools and and um systems on top of that that help us deliver reproducible results where one example of that is property based testing. Another example of that which is a little less obvious but we can talk about later is going to be um I don't even know what to call it uh requirements verification. So we scan your requirements for over ambiguity. We scan your requirements for um invalid constraints eg uh you have conflicting requirements and we help you resolve those ambiguities using sort of classic uh automated reasoning techniques. Um and I could talk a little bit more about sort of the the features of Kira. I think that's maybe less interesting for this talk because we want to talk about spectrum and dev. We have all the stuff you would expect though. We have steering which is sort of memory and sort of cursor rules. We have MCP integration. We have you know image yada yada. Um so we have ways to and we have software hooks. Um so let's talk a little bit about sharpening your tool chain. And I'm going to take a break really quick here. Uh just pause for a moment for folks in the room who had maybe tried downloading Curo um or something else and just say are there any questions right now before we dive into how to actually use spec to achieve a goal? No questions. It could be a good sign. Could mean I'm not uh talking about anything that's particularly interesting. So um I actually want to like talk in some concrete detail here. Uh this is a talk I gave a few months ago on how to use MCPS in Kira. And so one of the challenges that people who had tested out Kira had that might be a little easier to see was that they um they felt that the flow we were pushing them through was a little bit too structured like you don't have access to external data, you don't have access to the to all these other things you want. And so one thing that we said on our journey here towardsing your um oh you know what this out of order here's my nice AI generated image. So you can use MCP. Everybody here I assume is familiar with MCP at this point. But uh Curo integrates MCP the same way all the other tools do. Uh but what I think people don't do enough is use their MCPs when they're building their specs. And so you can use your MCP servers in any phase of the specdriven development workflow. That's going to be requirements generation, design, um, and implementation. Um, and you can use, we'll go through an example of each. So, first of all, to set up a spec in Kuro is fairly straightforward. We have the Kuro panel here, which there's a little ghosty um, and then you can go down to your MCP servers and click the plus button. You can also just my favorite way to do it is to ask Kirro to add an MCP uh and then give it some some information on where it is and it can go figure it out usually from there or you just give it the JSON blob and it'll figure it out. Once you have your MCP added, you'll see it in the control panel down here and you can enable it, disable it, allow list tools, disable tools, etc. So you can manage context that way. Worth noting changing MCP and changing tools in general is a caching operation. So if you're very deep into a long session, maybe don't tweak your MCP config because it will slow you down dramatically. But let's talk about um MCP inspect generation. So something I the Curo team uses a um for reasons I don't know, but it's our task tracker of choice. Uh but so one thing I want to do is uh maybe go and say I don't want to write the requirements for a spec from scratch. My product team has already done some thinking. We've iterated in a sauna to kind of break a project down. This is not always how things work, but sometimes how things work. So in this case, I have I have a task in a sauna. Oh no, I did the wrong thing. That's what I get for zooming. So I have this task in in a sauna that says add the view model and controller to this API. In this case, this was a demo app that I can figure in a few minutes. And we even had like it's kind of peeking under here, but we had some details about what we wanted to have happen. Now I can go into Kira and just say start executing task XYZ URL from ASA and Kira is going to recognize this is an Asana URL. I had the ASAN MCP installed. It goes and pulls down all the metadata there. Um da da da. So it's going to break out and from there start um start determining what to work on. Um oh it's funny these titles are backwards. basically create a spec for my open asauna tasks. Again, go pull from a sauna all the tasks and then for each one generate um requirements based on those tasks. So I think I had like six tasks assigned to me. One is do user management, do some sort of um uh property management da da da it pulled them in generated the requirements and then in this case title is wrong apologies start executing task. this is I want to go and do the code synthesis for this um and I will take a quick break here to talk about how you can do this in practice. So for those of you who are you know following along in room uh feel free to fire up your curo open a project and then picking a an MCP server. I'll share a few repos here really quick that you can play around with. So I have an MCP server implemented. I have this lofty views which I think implements the asauna. Um and then these should all be public. Let me just double check. Yeah. Okay. So for example, if you wanted to extend my I have a Nobel Prize MCP which curls perhaps unsurprisingly there is a Nobel Prize API. Um, so you can use UVX to install it or you can get clone this Al Harris at Nobelmcp. Uh, this is just one example. Another one here is if you want to play around with the sample that's in the video. Um, I have Al Harris atlofty Views. Um, I'll leave these both sort of up on the screen for a few moments for folks who do want to copy the uh the URLs. But while that is happening, oh no, let's put you on the same window. So what I'll demo quick is the usage of an MCP to make like spec generation much easier or more reliable. So here I have let's see Got a lot of MCPs. Which ones do I actually want to use? Let's use the GitHub MCP. Oh, no. Ignore me. That's better. Okay. Well, I have the fetch MCP. So in this case I could for example come in here and say hey I've generated a bunch of tasks lofty views app. This is basically a very simple CRUD web app. Um but I want Kira to uh use the fetch MCP to pull examples from similar products that exist on the internet. You could also use you know Brave search or Tavlet search MCP servers but in this case I'll just use fetch because I've got it enabled. Um, so let's say, oh actually we can run the web server and use fetch. That's a good example. This is one example of you can at any point in the workflow generating a spec go through and um you know use your MCP servers to get things working. No, this is what I get for not using a project in a while. We'll cancel that. We can actually do something a little more interesting which is a separate project I've been working on. Um, so I've been working on a an agent core agent and that might be I I know the project works, which is the reason I'll fire it up here. Should I call it? Well, maybe we'll do live demos at the end. So that's sort of like the most basic thing you can do with Kira is just use MCP servers, but any tool uses MCP servers. I actually don't think that's particularly interesting. So let's say in sort of this process of trying to sharpen our our spec dev toolkit, we've finished up with the 200 grit. We've added some capabilities with MCP. It's useful, but it's not going to be a gamecher for us. I want to come in here and actually get up to the 400 grit. Let's get start to get a really good polish on this thing. I want to customize the artifacts produced because you've got this task list, you've got this requirements list and I don't agree with what you put in there, Al. Um, you could say that a lot of people do and I that's a a great starting point. So, here's something I heard earlier in the week at um, you know, earlier in the conference is that people like to do things like use wireframes in their mocks. Um, use wireframe mocks because in your specs are natural language, you're using specs as a control surface to explain what you want the system to do. Uh therefore I want to be able to actually put UI mocks in here. So the trivial case is that I just come in here and say Kuro's asked me here does does the design look good? Are you happy? And I said this looks great but could you include wireframe diagrams and ask you for the screens we're going to build here. I'm adding this is again from that lofty views thing. I'm adding a user management UI but I want to actually see what we're sort of proposing building not just the architecture of the thing. So your cure is going to sit here and churn for a few seconds, but you can add whatever you want to any of these artifacts because they're natural language. So they're structured, which means we want some re um some sort of reproducibility in what they look like, but ultimately what they look like doesn't matter because we've got the the any machine here, the agent sitting that can help translate it to what it needs to be. So Kira's churning away here. It's thinking thinking and then it's going to spit out these uh text wrapped asy diagrams. I'll fix the wrapping here in a second in the video, but ultimately like you know it does whatever you want. So if you want additional data in your requirements, you can do that. If you want additional data in the design like this, uh you can easily add that. Here we've got sort of these wireframes in ASKI that help me sort of rationalize what we're actually about to ship. Um, and then I can again continue to chat and say actually in the design I don't want um, you know, maybe I don't want this add user button to be up at the top the entire time in which case I could chat with it to make that change easily and now we're on the same page up front instead of later during implementation time. So we've again sort of left shifted some of the concerns. Um, so that's one example. You know, I want to add UI mocks to the design of a system. Another example though could be this. Um, oh, this is a just a quick snapshot of the end state there where now my design does have these UI mocks. Um, but another example that I actually like a little bit more is this uh including test cases in the definition and tasks. So today the tasks that cure will give you will be kind of the bullet points of the requirements and the acceptance criteria you need to hit. But I want to know that at the end state of this task being executed, we have a really crisp understanding that it is correct. It's not just like done because the a anybody who's used an agent can probably testify that um the LMS are very good at saying I'm done. I'm happy. I'm sure you're happy. I'm just going to be complete. Oh, yeah. The tests don't pass but they're annoying. I tried three times them to work. I'm just going to move on. Um no, I don't want that. I want to actually know that things are working. So, in this case, I've asked Hero to um include explicit unit test cases that are going to be covered. So my task here for example in create creating this agent core memory checkp pointer is going to have all the test cases that need to pass before it's complete and then I can use things like agent hooks to ensure those are correct. We'll run this uh sample a little later in the talk. Um this is the thing I'm ready to little demo. Uh yeah, so this is another example where you can again you're you're working on your toolbench. You're sort of you have all these capabilities and primitives at your control and you can tweak the process to work for you, not just the process that I think is the best one. And then sort of last but not least, the 800 grit. At this point, we're getting a final polish on the tool. Uh we might be stropping necks, but we want to, you know, you can iterate on your artifacts, but you can also iterate on the actual process that runs. So, one thing you might have, and I do this a lot, is I'll I'll be chatting with Kira, and I say, "Hey, I want to um in this case, I want to add memory to my agent in agent core. Um, let's dump conversations to an S3 file at the end of every execution." Cur is going to say, "That's great. I know how to do that. I'm going to research exactly how to do that thing. I will achieve this goal for you." But ultimately what I've done is actually introduce a bias up front which is I'm steering the whole agent using S3 as this storage solution just because maybe I'm familiar with it but it's probably not the best way to go about it. So then after it had synthesized the design and all the tasks and all this stuff I came back and said well like we don't need to stick to this rigid spectriven dev workflow that I've that has been defined by Kirao. I can ask for alternatives like is this the idiomatic way to achieve session persistence? I don't know maybe there's a better way. Maybe if we're talking AWS services, it's not S3, it's Dynamo or yada yada. Uh Kira's going to come in here and say, you know, good question. Uh da da da. Let me research. It's going to go through call a bunch of MCP tools that I've given it access to. This kind of ties back to that you should be using MCP. And then it comes back with this recommendation that I didn't know was a feature, which is Asian core memory. Um it says it's more idiomatic and future proof that maybe is TBD and should be checked a little closer. Um, but uh or you could use S3, which is the thing you recommend. Now, actually, I I bet there's far more than two options here. So, you could probably keep asking the agent, are there other options, yada yada, and it would go and continue to investigate, but you should not lock yourself into the rigid flow that is sort of the starting point here. Um, yeah. So, that that's actually I think it for my deck. Um what I will talk about is let's just run through that sample I just had up there which is that um so basically let me delete delete it and I'll just do a live demo of sort of specs in Curo and how we can fine-tune things a little bit. So this project is a Node.js app. It is a um it's a CDK. Again, I'm not trying to sell more AWS. This is just the technologies I'm familiar with, so I can move a lot more quickly. So, I wanted to know a little bit about agent core, which is a new AWS offering. And as somebody building an agent, I should probably be familiar with it. So, and I'm not familiar enough with it. So, I've got we've got some other people here who know a lot about it. So, put my hand up a little bit and you know, you caught me. So, I set up a CDK stack, which is just um you know, IA technology to deploy software. I'm familiar with it and I love it. Uh, so I have a stack here that lets me deploy whatever an agent core runtime is. I don't know. I asked Kira to do it. We vibe coded this part. So we vibe coded the general structure. We got an agent. We got IA set up. I then vibe code added commit lint. I added husky. A few things like this that I like for my own TypeScript projects. Um, prettier and eslint I think. So we have a basic product here or like a basic project here that I know I can deploy to my personal AWS account. Um, now I'm going to come in here and oh, and then importantly, this is super important because I don't know how the hell agent core works. And I could go read the docs, but the docs are long and they're complicated and I'm really just trying to build out a PC to to like learn about it myself. So, I added two MCP servers. Oh, no, maybe I didn't. Let me check. Oh, okay. Yes, sorry. Buried down here at the bottom. So this is my Kira MCP config. I added one important MCP server here which is the AWS documentation one. There's other ways to get documentation. You can use things like um Tessle level 7 but in this case this is vended by AWS. So I have some confidence that it might be correct. So I used this to help the agent have knowledge about sort of what technologies exist. And I think I used fetch quite a bit as well. So these are the two sets of um these are the two step sets of uh MCP servers I provided the system. That's great. Move on. Confirm. So and I'll just rerun this from scratch. So what I had done yesterday evening or maybe the evening before was I sat down and I have this system sort of basically working and now I want to start doing specri development. So, I want to add this uh session ID concept and then I want to read conversation to an S3 file blah blah blah. This is the whole sort of bias thing I showed you earlier. We're going to fire that off through Curo. It's going to start running uh chugging away and then it's going to, you know, see if the spec exists. Uh, okay, the folder does exist. It's probably going to realize there's no files there and start working away. But, um, from here I'll sort of live demo. It's going to read through require. It's going to read through existing docs. It's going to read through existing files, gather the context it needs. Sure, in a way. Um, but in a moment once it generates sort of the initial requirements and design, I am going to challenge it to use its own, you know, MCQ servers. I want you to go and do some research on the best way to do this and provide me some proposals. Um, and this is why I was hoping to get the clip on mic working because I've got to set this down for a moment. Okay. So, you know, I don't know if this is the best way to do this. Um, go read docs, go use fetch. D. It's going to keep kind of churning away here and then come back to me after it's probably got a few ideas and proposed it. But, um, this is an example of me just using additional capabilities. uh use fetch, use the docs MCP, use whatever you can to get the best information and don't take at face value the things that I said. These are usually things we have to prompt pretty hard to get the agent to do, but if you're doing it in real time, it works fairly well. Um, again, the agent, all of these agents are going to be very easy to please. So, you know, just cuz I said something in the stupid docs, it may or may not actually be the most important thing from the agents perspective down the road. So, you know, okay, so it's done a little bit of research. It understands the lang graph which is the agent framework we're using already has this knowledge of persistence um da da da and actually in this case it didn't find it did not use the mcp for uh agent core docs who didn't find that agent core has this knowledge of persistence um so maybe you like let's assume I don't I still don't know that exists because I didn't dry run this a few days ago um we might have to find that later the design phase so first thing it's going to do is kind of iterate over all my requirements requirements here. Um, you know, it's changed the requirements based on what it now knows about Langraph and how it can natively integrate with the uh checkpointing, but it's still really crisply bound to this like S3 decision that I made implicitly in the ask. Um, so that is just something to be aware of. Anything you put in the prompt is effectively rounding the agent. Um, for better or for worse. I see it's still iterating. So, yeah, comes through says, does this look good? We changed duh. I'm going to say looks great. Let's go to the design phase. So now Curo is going to take my requirements and take me into the design phase of this project. I can make this so things are a little bit bigger. But um here's an example of what I meant by these ears requirements. So the user story here is as a dev I want to implement a custom S3based checkpoint so the agent can use Langraph's native persistence mechanism with S3. Great. That sounds reasonable to me as a person you know sort of co-authoring these requirements. This here, this sort of when then shall syntax. This is the years format and the structured natural language is really important for us to pass this through non LLM based models and give you more deterministic results when we parse out your requirements because ultimately our goal is to actually use the LM for as little not as little as possible but less and less over time. We want to use classic automated reasoning techniques to give you high quality results not just you know whatever the latest model is going to tell you. Um, so here's gone through spits out a design doc. Let's actually just look at this in markdown. This sure you got a server da da checkpo pointer ghost s3 that makes sense pseudo code again in a real scenario. Maybe I read this a little bit more closely and what's actually this is the new thing we shipped in um on the 17th is that now cur is going to go through and do this formalizing requirements for correctness properties. Um and so right now what the system is doing is it's taking a look at those requirements you generated uh the requirements we agreed upon with the system earlier. These look good. I agree with them. yada yada. It's taking a look at the design and it's extracting correctness properties about the system that we want to run property based testing for down the road. This is something that may or may not matter for you in the prototyping phase but should matter for you significantly when you're going to production. because these properties are correct and these properties are all met. The system aligns one to one with the input requirements you provided. Um yeah, so while this is chugging away, any questions yet? Any folks kind of curious about this? >> Um yeah, >> we're here and then there. >> Um what would you say is the main difference between that has? Uh I haven't used the planning mode in a couple of weeks. So it's I'm things move so fast it's a little wild. Um but I think ultimately uh what we would say is that Kuro's spectrum and dev is not just LLM driven but it is actually driven by like a structured system. Um and so planning mode I'm not sure if there's actually like a workflow behind it that takes you through things but um yeah this is our take on it for sure. >> I'm not familiar enough to give like a more concrete example unfortunately. similar I mean it doesn't give you like this I think that this document is cool is bringing you the school but uh what Cer does is to basically create you a plan that's >> just an execution plan okay >> oh I see so I think that the fundamental difference there uh does that plan get committed anywhere or is it just ephemeral >> uh it's kind of >> okay so what I want over time is not is not just how we make the changes we care about but it is actually the documentation and specification about what the system does. Um so the long-term goal I have is that as Kira we were able to do sort of a birectional sync that is as you continue to work with Kira you're not just acrewing these sort of task lists uh and so I'm just going to say go for it to go to the tasks um but we're not just acrewing task list but actually if I come back and let's say change the requirements down the road we will mutate a previous spec. So I'm looking at really just a diff of requirements which as you go through the green field process you're going to produce a lot of green in your PRs which is maybe not the best because I'm just reviewing three new huge markdown files but on the next time or the subsequent times that I go and open that doc up I want to be seeing oh you've actually you know you've relaxed this previous requirement you've added a requirement that actually has this implication on the design doc um that is the process the curo team internally uses to talk about changes to the curo So we review our design docs have in general been uh replaced by spec reviews. So we will you know somebody will take a spec from markdown they'll blast it into our wiki basically using an MCP tool we use internally and then we'll review that thing and comment on it in sort of a design session as opposed to you know I wrote this markdown file or a wiki from scratch. Um so it becomes sort of if uh well it's actually not like an ADR because it's not point in time. It is like this living documentation about the system. Um but yeah thanks for the question. There's one over here. >> Um this may be more a spectrum development question but are there like like is there like a template for a set of files that you fill out? Like right now you're in the design.md. >> Are there like >> is this is the designd the spec and it's a single doc or are there >> oh great question. So the yeah the question was um are there and correct me if I'm wrong here but question is are there a set of templates that are used for the system and is the question you're driving at can you change the templates or is just are there okay so the yeah question is are there a set of templates um there are implicitly in our system prompts for how we take care of your specs so you'll see here at the top navbar here right now we're really rigid about this requirement design task list phase but we know that doesn't work for everybody for example if you're starting we get this feedback from a lot of internal Amazonians actually that I want to start with a I have an idea for a technical design and I don't necessarily know what the requirements are yet but I know I want to make maybe design is even the wrong word I want to start with a technical note like I want to refac this comes up a lot for refactoring actually um so I want to refactor this to no longer have a dependency on um here's a good example here we use a ton of mutxes around the system to make sure that we're locking appropriately when the agent is taking certain actions because we don't want different agents to step on each other's toes. But maybe I want to challenge the requirements of the system so I can remove one of these mutexes uh or semaphors I should say. Um so I might start with something like a technical note and then from there sort of extract the the requirements that I want to share with the team and say hey you know I had to kind of play with it for a little while to understand what I wanted to build but I still want to generate all these rich artifacts. So today it's this structured workflow. We're playing a lot around with making that a little bit more flexible. But the the structure is important because the structure lets us build reproducible tooling that is not just an L. So I think that that's an important distinction we make is that our agent is not just an LLM with a workflow on top of it. The backend may or may not be an LLM or it may or may not be other neurosymbolic reasoning tools under the hood. Um, and so we we try to keep that distinction a little bit clear, uh, that you're not just talking to like Sonnet or Gemini or whatever. You're talking to sort of an amalgam of systems based on what type of task you're executing at any point in time. Um, although when you're chatting, you are talking to just an LLM. Um, but yeah, so we have a template for the requirements. We have a template for this design doc because there's sections that we think are important to cover. Um and again like if you disagree and you're like I don't care about the testing strategy section just ask the do it and similarly the task list has is structured because we have sort of UI elements that are built on top of it as well like task management and um do we have we'll get there when we do some property based testing but um there's some additional UI we'll add for things like optional you can have optional tasks and stuff like that and so we we need the structure there for our uh taskless LSP to work for example. Um yeah, thank you for the question. Anything else before we truck on? Cool. Uh I may need somebody to remind me what we were doing. Oh, that's right. So, we went through and we synthesized the spec for adding memory and some amount of persistence to my agent. By the way, I didn't introduce you to this project. This project is called Gramps. It is uh it is an agent that I'm deploying to agent core to learn about it. I mentioned that. But what I didn't tell you is that is it is uh a dad joke generator. A very expensive one since we're powering it via LLMs, but effectively you're a dad joke generator. Jokes should be clean. They should be based on puns, you know, obviously bon bonus points if they're slightly corny but endearing. Um yada yada. So we're deploying this to the back end. So, the reason I want memory is because every time I ask the dad joke generator for a joke, it gives me the same damn joke and that's just super boring and my kids are not going to be excited about that. So, I want memory so that as I come back for the same session, I get different jokes over and over again. Um, that's the context on the project. So, we've come through here and we actually said we generated this thing, we did the task list. I said, "Hey, is this the idiomatic way to do it?" But what I know is that we didn't actually uh we're not using Agent Core's memory feature, which is probably a big oops. Um, and so, you know, quick show hands. Do we want to make the mistake and go all the way to synthesis and deployment, or should we fix it now? >> Who wants to fix it now because we know better? >> No, I want to make the mistake. Let's keep on trucking. I I had three yeses in a room full of nothing. So, we're going to make the mistake and then come back and fix it later. So, uh, let's say run all tasks in order. Uh, the reason I mention in order, which seems very specific, is because this is a preview build of Kira. Um, and so somebody just added to the system prompt I should only do one task at a time. And I found that if I say run all tasks, it thinks I somehow mean do them all in parallel. So, we'll that'll be fixed before these changes get out to production. So Kira's going to keep kind of going through here and chewing away on the system in the back end. Um, it has steering docs that explain how to do its job. It has, which I guess I should show you guys. Steering again is like memory. So I have some steering how to do commits. Uh, you know, how I like to have commits, but also steering on things like how do you actually deploy this thing? Um, how do you deal with agent core? And then how do you run the commands that are necessary for you to deploy this to my local dev account. Um, and then those are mostly just an example again of sharpening your tools like uh I went through this kind of painful process of figuring out oh you know you have to use this parameter on the CDK the CDK command you have to use this lag otherwise it doesn't work correctly and so once I go through that pain of learning I just say kira write what you learned into a steering doc and it will usually do a very good job of summarizing um and so it generated automatically this Asian core langraph workflow MD file um yeah so I mean it's just going to kind of go away here and truck truck on and do its job and we can watch it in the background. But in the interim, um I think at this point we're at a pretty flexible spot. Uh so for folks who want feel free to use Kira, try out Spectriven Dev on your own. I'm going to keep just kind of running this in the background and taking questions and comments. But that's kind of it for the scheduled part of today. >> Yep. >> How does Carol work for like existing large code bases or this? >> Yeah. >> Yeah. question was how does cure work for large and existing code bases basically the brownfield use case uh and the answer is it depends on what you're trying to do um for spec driven dev you can ask cure to do research into what already exists so when you start a new spec it will usually start by reading through the the working tree um but the agent is generally starting from a a scratch perspective right it needs to understand the system um in practice what that means is that you're going to end up with a bunch of things like if your system already had good separation of concerns uh your the components in your system are highly cohesive and they're sort of highly coherent and highly cohesive, it's going to have a great job, right? It's going to be able to say this is the module that does this thing. I don't need to keep 18 things in my context to do my job and it's going to do well. Um if you let's just take an example that's off the top of my head. if you were trying to launch an IDE very quickly uh leading up to an AWS launch and you um you know took a lot of tech debt along the way that you need to unwind and you know nobody here would do that I'm sure but um in case you did that like me then your agent might actually have a much harder time traversing the codebase in the same way that a dev would right so uh from just kind of that perspective the more reliable things like your test suite are and the more understandable things like module separation and sort of decomposition of concerns are the better the agent will do. Um and versus true of course. Now for things like uh understanding the code base, this is a bad example because this is a very small code base, but uh we do have things like you know code search and workspace. Um uh I don't know what to call these context providers. Um, so you can come in here and just say I want to do code. Uh, what is it? I might have turned this off actually. Oh, I did turn it off because the code base isn't big enough. We'll do things like indexing in the background so the agent like you can do semantic search over what you've got um if you're just chatting. But in general, uh, Cur should go in and do sort of background search to figure out how to do its job. like as the codebase scales up, it's going to be less do probably less well overall. But that's one thing we're working on as a team. Did that answer your question or did I kind of glance off the side a bit? >> Yeah, I think I got it. >> Okay, cool. >> Anybody else? >> Uh, how long are you willing to wait for indexing to complete? Uh so one example I have is that the code OSS um if it's not supremely obvious by looking at it cur is a code OSS fork just like you know cursor winds surf um one of the challenges we've had is the code OSS codebase is very large fairly large there's other big ones out there but that's kind of my large code base because I'm not forced get to work in it fairly frequently um and so there there's definitely some perceived slowdown when you're dealing with something large like that, especially when you talk about codebased indexing. It's a very active area of work for us though. So, we're trying to do things like um either remove indexing from the critical path so that you're not waiting there on some kind of slowed down render thread because indexing is running. Um but in practice, there should not be. I mean, again, the agent may practically do less well, but we're going to be talking in a couple weeks at reinvent about how some of the temple features in Curo were built via spec in a codebase we did not understand particularly well because we're just not VS code devs. Um, and Curo did a fine job of it. But again, that's a testament to the fact that codebase is reasonably well um structured >> and like if you've taken the time to understand how it works, it's very understandable. If you have not, it will might be a little bit opaque to stare at. >> Yeah. >> Uh in terms of indexing, is it like just just putting um um as much information from the code base into context or it just >> is there a way to like create some kind of like vector database of all the code base and then like query it? I just >> Yes. Um so the question was what do you mean by indexing? Um because indexing can mean a bunch of different things and what I mean is that um the agent is actually not provided the >> I'm going to keep the agent context as small as possible. We use the uh the index for most like secondary effects things like if you're doing a uh a code search or if I do something like search for um pound uh what the file in here http server like we use it more for these types of UI um than giving it to the agent because the agent does this is sort of anecdotal and based on our benchmarks does better when given less context but given the tools to understand where to go find things. Um, something we've heard a lot about is sort of incremental disclosure here at this conference. And that's again, we don't want to load too much at the beginning of the context and conversation with the agent. We want the agent to self-discover the right context for the task. Yeah. >> Thank you. >> Yeah. >> You guys managing session length like is there any kind of compression or pruning? >> Yeah. So, um, question was how do we manage session length? We have no incremental pruning today or incremental summary. Um you basically just accrete context until you hit your limit which I think right now I'm on auto which has like a 200k token limit um similar to the sonnetss. Um uh so we don't have a very sophisticated algorithm here yet. We've looked at a few things but our number one concern actually is um prompt caching hit rate. And so in a normal use case, I can achieve something like 90 95% cash token usage here on per turn, which means that my interactions are very fast. And that's or they're much faster than the alternative, which is I'm sending 160k tokens to to bedrock cold. Um, so that's one of the reasons we've actually not done much experimentation with incremental summary. Um, our summarization feature exists. When you hit the cap, it's not great. It's something we're trying to uh ship an improved version very very shortly. Um eg in the next couple of weeks which should be faster. Today it's like a one-off operation that can take up to 30 or 45 seconds which is a horrendous experience. We're hoping to fix that here and make it sort of a real-time experience. The follow >> managing stapleness between sessions then is that how why you're relying on a stereopated spectrum. >> So sort of um that is not the only reason I mean the spect the spectrum of dev is less to do with performance and more to do with reproducibility and accuracy of the agent. Um because if we can give you the right result, the the the way I and I think that we talk about it internally as this team is if I spend 10 seconds giving a prompt to the agent and then it goes off and it gets it wrong, it's like it's kind of no skin off my back, right? I burned however many tokens and you know, a couple cents of credit usage with whoever my LM provider is, but I spent 10 seconds generating a prompt. If I spend five to 10 minutes with the system producing a detailed design doc or let's just say even a detailed set of requirements I wanted to do a fairly good job. If I spend an hour generating a design doc reviewing it with my team and then synthesizing from that I wanted to get it right. So the goal necessarily is not just latency but actually accuracy when we talk about that. No, it's a both and. You need to do both. But um spec comes more from a uh the goal to have um highly reproducible output. I'm going to go over here first and then you >> Yeah. How did each of these task agents pass context to each other? And then are you only supposed to run this this parent task? Because it just finished all like 3.1 3.2 3.3 but then it still thought that 3.1 wasn't done and ran that in 3.2. too. >> Oh, did it? >> Yeah. Well, no, mine right. >> Oh, okay. Yeah. Yeah. Um, so if you the uh the question is if you're in the UI and you're like running tasks and I can just kind of pull up my task list here. Um, so if I just hit start, start start each of these is going to be a new session which means the context is completely unique. Um, personally I like to just if I can if I've got the context base to afford it, I just say do all the tasks because I find that more understandable and I think I actually get better performance. But by default, each task will be a new session that has no shared context with the previous ones. So the session is effectively just seated with your specification and then like here you're working on a spec that does all this stuff block of text um and you are doing this task da da da don't do any other tasks just do this. Um, so that sounds like a bug. Um, >> they ever spin up sub agents for certain things. >> We don't have sub agents yet in Caro, some we're working on. >> Yeah. Yeah. Because I mean, ideally, right, if we click on task three and I've got 31, 32, 33, and they're separated, there's no good reason I couldn't have different systems working on them. Yeah. >> Uh, right here, >> we do have in the Curo CLI custom agents that you can also run off. >> Yeah. Curli is a concept of custom agents. um which can be run sort of as a task um and it's something we're playing with right now in Curo Desktop um and I think you had another one >> yeah I'm sorry if I missed this but in the spec folder um as you do more and more of these tasks over time >> y is it just all in one design requirements tasks your whole project is defined there or did it group by >> that's a good question um yeah so I will have many I will have uh the question was as you do more you generate let's say more specs over time. Are you sort of just creating one massive spec and no? Uh let me open a different project. So this is for example the curo extension which is like a 1p extension inside the curo IDE. This is where the agent itself lives. And so we have pruned some specs but there are specs in here that we can talk through or I can just kind of demo. Um so these are the way I think about it is that the spec sort of represents a feature or a problem area in the in the project. And so for example, I can blast this a little larger. So for example, we have um like some of these are just tests. We've done things like oh could we have a prompt registry? Could we have a prompt registry file loader? They may or may not make it all the way to production. Um I want telemetry on the chat UI. So these are just like somebody will go off and spend maybe represents a few days of work for an SD. Um, agents MD support is a good one where we just, you know, I sort of said research what agents MD is and build it in the way you build steering in like support in the same way. This spec is fairly unlikely for us to come back and revisit in the future. So I may actually just delete it. Um, which is what we've done with some of the older ones. But a good example of one that we might come back to is our message history sanitizer. So, one thing we've had issues with or we had issues with early in the the development of Kira is that we would send these sort of invalid um sequences of messages because let's say the anthropic API required tools to be in the same order they were invoked and the responses but the system wasn't doing that. So we built this whole sanitizer system that has a bunch of requirements around um let's see very specifically yeah when conversation is validated the system shall verify that each user input is either non-MPT content or tool responses. So we had things where like empty strings would get passed in but there was a tool response. This is a good example where we've come in over time and actually just added maybe not to the requirements but to the to the acceptance criteria of the requirements as new validation rules are uncovered. >> Yeah. >> So how do you handle like that? So for example you have like >> telemetry up there y feature that needs telemetry is it going to go back and update that spec too or you're just >> it should. Yeah. So, if you usually you'll see and let me just ask uh a new chat here. No, that's a terrible idea. So here I've asked I've made a inspect mode I've made some requests to um add UI telemetry to the thing I'll help you add it let me first check if there's any relevant runbooks then explore the codebase and sand the implementation it might go do a little bit of research here and then flip of a coin again it's an LLM so it may or may not discover the existing uh spec but ideally it will after doing its research say there exists a spec already for things like UI telemetry, I'm going to go and amend that one. Um, and if it doesn't in this case, like I would come in and just ask it to um as sort of the operator of the system. But over time, again, we want that to be easier for you as a user to not have to think about so much. We can watch it while it chugs along. >> Is there anything reconfigured in Kira that makes it better to work with AWS? trans. >> No, not really. Um, >> was that a question? >> Oh, question was, uh, is there anything in Kira that that's preconfigured to make it work better with AWS? No. Um, we are sort of purposefully we're in we are brought to you by AWS, which so you know, uh, Andy Jasse and Jeffy B pay my check, but um, we're not like an AWS product that's deeply deeply integrated with the rest of the AWS ecosystem. Now that said, I still answer emails when somebody says, "Why is this other thing we built with AWS not working with Curo?" Yay. But um similarly like if you're building on GC or Azure, whatever um or you're running some on-rem system, the product should work just as well for you. That's our goal. >> Good a good answer potentially is the AWS documentation MCP server. >> Yes. >> So there are MCP servers that you can add into any of these things that will make better. Yeah, that's a good point. So, like in this case, I actually had to add the AWS MCP documentation here. We could of course have natively bundled this, but I don't want to ship this to customers who don't need it. Yeah, because again, AWS is not the only docs that we might care about. Um, by the way, coming back to your question, so it did find the existing spec for telemetry. It read it, it read different sections of it, and now it's actually making amendments to it. So, we can follow the diff as it shows up here. So, it's added new uh requirements. um to the pre-existing specs. So, this is effectively another case where we're mutating the system as opposed to just adding this sort of never- ending spiel of specs. >> I guess what I'm wondering is like how did it know or decide where to put the spec, you know, if you break down your project into these different categories? >> Y >> I would imagine like crossover. >> Yeah. I mean, it's that that's sort of like software development in a nutshell though, right? like how do you actually define the seams between different parts of your system different concerns the product >> right but if you want to like build something like I have a task and it's going to cost >> require changing like three or four things >> y >> it's going to change three or four specs and then run tasks across three or four >> oh yeah yeah no it should not do that it would probably so again I don't have a good example off hand that we can do for that but um my my perspective would be that if you're working on something that is a crossf functional uh by the way the question was um if I'm working on something that let's say I have a spec for security requirements and I have a spec for API design uh like the API shapes and I have a spec for logging and I am changing something in the API public interface that is a securityf facing concern because we're redacting logging PII um I think that's maybe a semi-tangible use case uh that we can all imagine coming down from our governance teams um I want to I would imagine that you either pick one of those to load the requirements into or you create sort of a cross functional spec, but that would come down to I think you as a as an operator making that decision in much the same way that if I how you actually implement it might be you you would not necessarily implement my PII API redaction module. It's a standalone thing. It's going to be a crosscutting theme across your codebase, I'd imagine. And it's also a good example. There's like multi group workspace came out when it went to G on Monday and now you can like drag different. So like in your example you just went through with like APIs and off and like even the front ending you can bring in those projects if you have them separately and then still work. >> Yeah. Thanks bro. the mental model the spec generates the code after that like what code you can specify how does that work >> yeah so um we have now synthesized effectively the spec so we we sat down we defined the requirements design and task list I've had Kira now go through and run all the tasks in this spec so it ran them one at a time it basically worked on small bite-sized pieces of work uh chunk by chunk and then uh now this is done So what we've actually produced is not just like the completed spec, but it went here into my agent and it did a few things in the CDK repo because it's doing persistence to S3. I'm sure it added a bucket. Yep. Some new bucket encryption and yada yada. It then went in to the agent, added the S3 checkpoint saver. It looks like it, you know, created a checkpointer. It adds this to the graph and it kind of passes this all the way through the system. And the S3 checkpointer here I'm sure has some knowledge of how to write the checkpoints to and from S3. So like we have gone not just for defining the system but we've now um produced it end to end or we've uh delivered it end to end including property tests I believe. Um yeah. >> Oh, I have a answer to an earlier question related to like um some specific AWS related features like that makes it easier to work with. The Curo CLI comes with the use AWS tool which helps with the CLI. >> Yeah. Yep. So, uh, what Rob's pointing out is the Curo CLI, which we just rebranded, um, this week, has a use AWS tool, which is basically a wrapper over the AWS SDK, um, to make some of those things easy. Uh, but again, BYO use GCP tool as an FCP server if you were so inclined, if that's your uh, tool of choice. And I believe, don't quote me on this, um, because the CLI is kind of new to my new to me I should say. Um, but I believe you can turn off tools in the CLI as well. Let me know if that's not right, Rob. >> Yeah. So, that's like you're actually not strict. Uh, in the desktop product today, you can't control the tools, the native tools built in, but in CLI, you can. >> Um, so I I intuitively get the benefits of having a spec. Have you done any work to empirically see like how a project or a problem would have worked with or without? >> Yeah. Um we do have benchmarks uh covering the data off hand. Um I think part of that's in our blogs. So if you go to the cure.deblog or it's on the site, we we talk really crisply about some of the lift things like property based testing give to task accuracy. Science team's always working on that stuff. a blog about specs. I'm curious about >> Yeah. Distinguish engineer for databases. Yeah. >> His blog post really steps it up. I don't think it has the D specific that you are asking for, but I think it will be useful. >> Yeah. Yeah. >> How does it work? I understand the feature side of it, but how does it work in a nonfunctional site like agency dealing with, you know, a little bit more harder problems? >> Well, yeah. I mean, that is ultimately the goal here, right? Is we're saying you're making a slightly larger investment up front, but we believe that the uh the structure we're bringing is going to help you get increase the accuracy of your uh result. So, um, while we've got a team of people who are basically working on making spec better, my job when I fly back to Seattle is to make cur as a whole much faster. Um, one, execution time and like kind of like laggginess in the UI, but two, how do we get tokens through the system faster? How do we get responses to you faster so that like you're not syncing as much cost into KO to use a spec? >> Yeah. Yeah. I'm not talking about the KO tool itself, the code generated from the spec. >> Oh. Oh, yeah. Okay. Yeah. you mean like the non-functional requirements of the generated code? So, uh that's going to come down to I think what you're specifically trying to do. So, you could add uh one of the slides I had here was talking a little bit about how to tweak the process and tweak the artifacts for your use cases. Um again, you could very easily add something like I want non-functional requirements for speed and runtime and things like lock contention to be considered in the design phase. Um yeah, something you could certainly add. So you could generate a code in Rust or or Java. >> Yeah, totally. Yeah. >> And it will vary in the functional depending on what language you generated. >> I mean it would it would have to like yeah there's no other way I think to approach it. Um again I'm just I'm familiar with node so I'm doing everything here in node but you can use this with any language. I think technically we say we support Java, Python, JavaScript, um, and Jesus, JavaScript, TypeScript, Java, and Rust. But in practice, there's no reason that this doesn't work with any language. I mean, it's just an LLM. The there's nothing language specific or framework specific in the system. And for those of you um, so there was a conference earlier this week hosted by Tessle, which are doing sort of specs for knowledge base. um as long as you've got the right grounding docks in there and this is sort of uh their argument is that it should not matter what you're building like that's all just informed by the the context you're building for your system. >> This is also a really good point for steering. So steering you can get the agent to develop code in the way you want. Like being a developer is all about making trade-offs and the problem with your out of the box is it's like so polite because it's trying to be everything to everyone. U and especially like with latency and cost and other things like that, just tell it in steering what you want it to prioritize and then that will influence any code that gets generated. >> Yep. >> Even like how it designs based on that as well. So if there's something that's very specific to your use case or your industry or whatever, just shove it in that steering file and then >> Yeah, that's exactly right. So, for example, I I will have Kira generate um commits for me. And one of the things I care I personally care about is that I can track commits I generate versus commits that Kira generates being the ones that come from the system. And so my steering dock while short includes things like very specifically my requirement for Curo is just use the UI um attributed to the co-author of Kuro agent um which is trivial but also I want it to happen every time. So in this case it just generated a commit co-authored by Kirao agent D. So that's an example of like you could add whatever you want in there, not just something related to get commits, but you could do code style, you could do um uh you know code style, code coverage. Uh whenever you add a spec or you're adding a new module, make sure that you annotate it with coverage minimums that are 90% because that's the thing I care about. Um you can kind of put anything you want up in there. The good news is it looks like what we built works. Um, Cur is very happy with itself at least and it looks like all tests passed. But um, yeah, so we'll we can deploy this to the back end and see how things work. We're uh technically just about time. So, you know, if anybody has any other questions, I'm going to stick around here for a while. But uh, thank you all for joining, listening, and uh, learning a little bit more about Spectrum and Dev. Heat.