Spec-Driven Development: Agentic Coding at FAANG Scale and Quality — Al Harris, Amazon Kiro

Channel: aiDotEngineer
Published at: 2026-01-09
YouTube video id: HY_JyxAZsiE
Source: https://www.youtube.com/watch?v=HY_JyxAZsiE
For those of you who haven't heard of
us, Kira is an agentic ID. Um, we
launched generally available this most
recent Monday, I think the 17th, but we
launched public preview on, uh, in July,
>> uh, I think July 14th. So, out there for
a few months getting customer feedback,
um, all that good stuff. We're going to
talk a little bit about using Spectriven
development to sharpen your AI toolbox.
I did a show of hands. About a quarter
of the people here familiar with
Spectrum and Dev. My name is Al Harris.
Um, principal engineer at Amazon. I've
been working on Curo for the last. Uh,
and we're a very small team. We were
basically three or four people sitting
in a closet doing what we thought we
could do to improve um the software
development life cycle for customers. So
we were ch we were charged with building
a development tool that's that answered
um that improved the experience for
spectrum and development. We were
theoretically funded out of the org that
supported things like QDV but we were
purposefully a very different product
suite from the QE system to just take a
different take on these things. So we
wanted to work on scaling, you know,
helping you scale AI dev to more complex
problems. Uh improve the amount of
control you have over AI agents and
improve the code quality and maintain uh
reliability, I should say, of what you
got out the other end of the pipe. Now
we're back to new content. Um so our
solution was specri. We took a look at
some existing stuff out there and said,
"Hey, vibe coding is great, but vibe
coding relies a lot on me as the
operator getting things right. That is
me giving guardrails to the system. And
that is me uh putting the agent through
a uh kind of a strict workflow. We
wanted Spectri driven dev to sort of
represent the holistic SDLC because
we've got you know 25 30 years of
industry experience um building uh
software building it well and building
it with different practices right we've
gone through waterfall at XP um we have
all these different ways that we
represent what a system should do and we
want to effectively respect what came
before.
So uh this animation looked a lot
better. It was initially just the left
diamond but I the idea was hey you know
you basically are iterating on an idea.
I think like half of software
development is discovery requirements.
Um and that discovery doesn't just
happen by sitting there and thinking
about what what should the system do?
What can the system do? We we realized
though kind of working on this that the
best way to make these systems work is
to actually synthesize the output and be
able to feed that back really quickly.
things like your input requirements um
to actually do the design and feedback
you know realize oh actually if we do
this there's a side effect here we
didn't consider we need to feed that
back to the input requirements and so
this compression of the SDLC evolved to
bring structure into the software
development flow we wanted to take um
the artifacts that you generate as part
of a design that's the requirements that
maybe a product manager or developer
writes that's going to be the acceptance
criteria what does success look like at
the end of this and then we want to the
design artifacts that you might review
with your dev team, you might review
with you know stakeholders and say this
is what we're going to go build and
implement the thing and we want to make
sure that you can do this all in some
tight inner loop. Um and ult that was
initially what spectriven dev was
um what spectriven development in hero
is today or at least was before it went
g was uh you give us a prompt and we
will take that and turn it into a set of
clear requirements with acceptance
criteria. We represent these acceptance
criteria in the EARS format. EARS stands
for the easy approach to requirement
syntax. Um, and this lets you really
easily uh it's effectively a structured
natural language representation of what
we you want the system to do. Now, for
the first four and a half months this
product existed, the ears format looked
like kind of an interest decision we
made, but just that sort of interesting.
Um and with our launch, our general
availability launch on Monday, we have
finally started to roll out some of the
side effects of which is property based
testing. Um so now your ears
requirements can be translated directly
into properties of the system which are
effectively invariants that you want to
deliver. Um, for those of you who have
or like have not I guess done property
based testing in the past using
something like I think it's a hypothesis
in Python or fast check and node um
closures spec library is another
example. These are uh approaches to
testing your software system where
you're effectively trying to produce a
single uh test case that that falsifies
the invariant that you want to prove.
And if you can find any uh contraositive
then you can say this requirement is not
met. If you cannot you have some high
degree of confidence where the word high
there is doing a little bit of heavy
lifting because it depends on how well
you write your tests but you can say
with a high degree of confidence that
the system does exactly what you're
saying it does. Um yeah, so a property
we we'll get a little bit more into
property based testing and PBTs a little
later, but this is the first step of
many we're taking to actually take these
structured natural language requirements
and then tie this with a throughine all
the way to the finished code and say if
your code if the properties of the code
meet the initial requirements, we have a
high degree of confidence that you have
re uh reliably shipped the the software
you expected to ship.
So with spectriven dev, we take your
prompt, we turn it into requirements, we
pull a design out of that, we define
properties of the system and then we
build a task list and we go and you can
run your task list. Effectively the spec
then becomes the natural language
representation of your system. It has
constraints, it has concerns um around
functional requirements, non-functional
requirements and it's this set of
artifacts uh that you're delivering. So
I don't think I have the slide in this
deck, but ultimately the way I look at
spec is that it is one a set of
artifacts that represent sort of the
state of your system at a point in time
t. It is two a structured workflow that
we push you through to reliably deliver
high-quality software and that is the
requirements design um and execution
phases. And then three it is a set of
tools and and um systems on top of that
that help us deliver reproducible
results where one example of that is
property based testing. Another example
of that which is a little less obvious
but we can talk about later is going to
be um I don't even know what to call it
uh requirements verification. So we scan
your requirements for over ambiguity. We
scan your requirements for um invalid
constraints eg uh you have conflicting
requirements and we help you resolve
those ambiguities using sort of classic
uh automated reasoning techniques. Um
and I could talk a little bit more about
sort of the the features of Kira. I
think that's maybe less interesting for
this talk because we want to talk about
spectrum and dev. We have all the stuff
you would expect though. We have
steering which is sort of memory and
sort of cursor rules. We have MCP
integration. We have you know image yada
yada. Um so we have ways to and we have
software hooks. Um so let's talk a
little bit about sharpening your tool
chain. And I'm going to take a break
really quick here. Uh just pause for a
moment for folks in the room who had
maybe tried downloading Curo um or
something else and just say are there
any questions right now before we dive
into how to actually use spec to achieve
a goal?
No questions. It could be a good sign.
Could mean I'm not uh talking about
anything that's particularly
interesting. So um I actually want to
like talk in some concrete detail here.
Uh this is a talk I gave a few months
ago on how to use MCPS in Kira. And so
one of the challenges that people who
had tested out Kira had that might be a
little easier to see was that they
um they felt that the flow we were
pushing them through was a little bit
too structured like you don't have
access to external data, you don't have
access to the to all these other things
you want. And so one thing that we said
on our journey here towardsing your um
oh you know what this out of order
here's my nice AI generated image. So
you can use MCP. Everybody here I assume
is familiar with MCP at this point. But
uh Curo integrates MCP the same way all
the other tools do. Uh but what I think
people don't do enough is use their MCPs
when they're building their specs. And
so you can use your MCP servers in any
phase of the specdriven development
workflow. That's going to be
requirements generation, design, um, and
implementation. Um, and you can use,
we'll go through an example of each. So,
first of all, to set up a spec in Kuro
is fairly straightforward. We have the
Kuro panel here, which there's a little
ghosty um, and then you can go down to
your MCP servers and click the plus
button. You can also just my favorite
way to do it is to ask Kirro to add an
MCP uh and then give it some some
information on where it is and it can go
figure it out usually from there or you
just give it the JSON blob and it'll
figure it out. Once you have your MCP
added, you'll see it in the control
panel down here and you can enable it,
disable it, allow list tools, disable
tools, etc. So you can manage context
that way. Worth noting changing MCP and
changing tools in general is a caching
operation. So if you're very deep into a
long session, maybe don't tweak your MCP
config because it will slow you down
dramatically. But let's talk about um
MCP inspect generation. So something I
the Curo team uses a um for reasons I
don't know, but it's our task tracker of
choice. Uh but so one thing I want to do
is uh maybe go and say I don't want to
write the requirements for a spec from
scratch. My product team has already
done some thinking. We've iterated in a
sauna to kind of break a project down.
This is not always how things work, but
sometimes how things work. So in this
case, I have I have a task in a sauna.
Oh no, I did the wrong thing.
That's what I get for zooming. So I have
this task in in a sauna that says add
the view model and controller to this
API. In this case, this was a demo app
that I can figure in a few minutes. And
we even had like it's kind of peeking
under here, but we had some details
about what we wanted to have happen. Now
I can go into Kira and just say start
executing task XYZ URL from ASA and Kira
is going to recognize this is an Asana
URL. I had the ASAN MCP installed. It
goes and pulls down all the metadata
there. Um da da da. So it's going to
break out and from there start um
start determining what to work on. Um
oh it's funny these titles are
backwards.
basically create a spec for my open
asauna tasks. Again, go pull from a
sauna all the tasks and then for each
one generate um requirements based on
those tasks. So I think I had like six
tasks assigned to me. One is do user
management, do some sort of um
uh property management da da da it
pulled them in generated the
requirements and then in this case title
is wrong apologies start executing task.
this is I want to go and do the code
synthesis for this um and I will take a
quick break here to talk about how you
can do this in practice. So for those of
you who are you know following along in
room uh feel free to fire up your curo
open a project and then picking a an MCP
server. I'll share a few repos here
really quick that you can play around
with.
So
I have an MCP server implemented.
I have
this lofty views which I think
implements the asauna. Um and then these
should all be public. Let me just double
check.
Yeah. Okay. So for example, if you
wanted to extend my I have a Nobel Prize
MCP which curls perhaps unsurprisingly
there is a Nobel Prize API. Um, so you
can use UVX to install it or you can get
clone this Al Harris at Nobelmcp.
Uh, this is just one example. Another
one here is if you want to play around
with the sample that's in the video. Um,
I have Al Harris atlofty Views. Um, I'll
leave these both sort of up on the
screen for a few moments for folks who
do want to copy the uh the URLs.
But while that is happening,
oh no, let's put you on the same window.
So what I'll demo quick is the usage of
an MCP to make like spec generation much
easier or more reliable. So here I have
let's see Got
a lot of MCPs. Which ones do I actually
want to use?
Let's use the GitHub MCP.
Oh, no.
Ignore me.
That's better. Okay. Well, I have the
fetch MCP. So in this case I could for
example come in here and say hey I've
generated a bunch of tasks lofty views
app. This is basically a very simple
CRUD web app. Um but I want Kira to
uh use the fetch MCP to pull examples
from similar products that exist on the
internet. You could also use you know
Brave search or Tavlet search MCP
servers but in this case I'll just use
fetch because I've got it enabled. Um,
so let's say,
oh actually we can run the web server
and use fetch. That's a good example.
This is one example of you can at any
point in the workflow generating a spec
go through and um you know use your MCP
servers to get things working. No, this
is what I get for not using a project in
a while.
We'll cancel that. We can actually do
something a little more interesting
which is a separate project I've been
working on. Um, so I've been working on
a an agent core agent and that might be
I I know the project works, which is the
reason I'll fire it up here. Should I
call it?
Well, maybe we'll do live demos at the
end.
So that's sort of like the most basic
thing you can do with Kira is just use
MCP servers, but any tool uses MCP
servers. I actually don't think that's
particularly interesting. So let's say
in sort of this process of trying to
sharpen our our spec dev toolkit, we've
finished up with the 200 grit. We've
added some capabilities with MCP. It's
useful, but it's not going to be a
gamecher for us. I want to come in here
and actually get up to the 400 grit.
Let's get start to get a really good
polish on this thing. I want to
customize the artifacts produced because
you've got this task list, you've got
this requirements list and I don't agree
with what you put in there, Al. Um, you
could say that a lot of people do and I
that's a a great starting point. So,
here's something I heard earlier in the
week at um, you know, earlier in the
conference is that people like to do
things like use wireframes in their
mocks. Um, use wireframe mocks because
in your specs are natural language,
you're using specs as a control surface
to explain what you want the system to
do. Uh therefore I want to be able to
actually put UI mocks in here. So the
trivial case is that I just come in here
and say Kuro's asked me here does does
the design look good? Are you happy? And
I said this looks great but could you
include wireframe diagrams and ask you
for the screens we're going to build
here. I'm adding this is again from that
lofty views thing. I'm adding a user
management UI but I want to actually see
what we're sort of proposing building
not just the architecture of the thing.
So your cure is going to sit here and
churn for a few seconds, but you can add
whatever you want to any of these
artifacts because they're natural
language. So they're structured, which
means we want some re um some sort of
reproducibility in what they look like,
but ultimately what they look like
doesn't matter because we've got the the
any machine here, the agent sitting that
can help translate it to what it needs
to be. So Kira's churning away here.
It's thinking thinking and then it's
going to spit out these uh text wrapped
asy diagrams. I'll fix the wrapping here
in a second in the video, but ultimately
like
you know it does whatever you want. So
if you want additional data in your
requirements, you can do that. If you
want additional data in the design like
this, uh you can easily add that. Here
we've got sort of these wireframes in
ASKI that help me sort of rationalize
what we're actually about to ship. Um,
and then I can again continue to chat
and say actually in the design I don't
want um, you know, maybe I don't want
this add user button to be up at the top
the entire time in which case I could
chat with it to make that change easily
and now we're on the same page up front
instead of later during implementation
time. So we've again sort of left
shifted some of the concerns. Um, so
that's one example. You know, I want to
add UI mocks to the design of a system.
Another example though could be this.
Um, oh, this is a just a quick snapshot
of the end state there where now my
design does have these UI mocks.
Um, but another example that I actually
like a little bit more is this uh
including test cases in the definition
and tasks. So today the tasks that cure
will give you will be kind of the bullet
points of the requirements and the
acceptance criteria you need to hit. But
I want to know that at the end state of
this task being executed, we have a
really crisp understanding that it is
correct. It's not just like done because
the a anybody who's used an agent can
probably testify that um the LMS are
very good at saying I'm done. I'm happy.
I'm sure you're happy. I'm just going to
be complete. Oh, yeah. The tests don't
pass but they're annoying. I tried three
times them to work. I'm just going to
move on. Um no, I don't want that. I
want to actually know that things are
working. So, in this case, I've asked
Hero to um include explicit unit test
cases that are going to be covered. So
my task here for example in create
creating this agent core memory checkp
pointer is going to have all the test
cases that need to pass before it's
complete and then I can use things like
agent hooks to ensure those are correct.
We'll run this uh sample a little later
in the talk. Um this is the thing I'm
ready to little demo.
Uh yeah, so this is another example
where you can again you're you're
working on your toolbench. You're sort
of you have all these capabilities and
primitives at your control and you can
tweak the process to work for you, not
just the process that I think is the
best one. And then sort of last but not
least, the 800 grit. At this point,
we're getting a final polish on the
tool. Uh we might be stropping necks,
but we want to, you know, you can
iterate on your artifacts, but you can
also iterate on the actual process that
runs. So, one thing you might have, and
I do this a lot, is I'll I'll be
chatting with Kira, and I say, "Hey, I
want to um in this case, I want to add
memory to my agent in agent core. Um,
let's dump conversations to an S3 file
at the end of every execution." Cur is
going to say, "That's great. I know how
to do that. I'm going to research
exactly how to do that thing. I will
achieve this goal for you." But
ultimately what I've done is actually
introduce a bias up front which is I'm
steering the whole agent using S3 as
this storage solution just because maybe
I'm familiar with it but it's probably
not the best way to go about it. So then
after it had synthesized the design and
all the tasks and all this stuff I came
back and said well like we don't need to
stick to this rigid spectriven dev
workflow that I've that has been defined
by Kirao. I can ask for alternatives
like is this the idiomatic way to
achieve session persistence? I don't
know maybe there's a better way. Maybe
if we're talking AWS services, it's not
S3, it's Dynamo or yada yada. Uh Kira's
going to come in here and say, you know,
good question. Uh da da da. Let me
research. It's going to go through call
a bunch of MCP tools that I've given it
access to. This kind of ties back to
that you should be using MCP. And then
it comes back with this recommendation
that I didn't know was a feature, which
is Asian core memory. Um it says it's
more idiomatic and future proof that
maybe is TBD and should be checked a
little closer. Um, but uh or you could
use S3, which is the thing you
recommend. Now, actually, I I bet
there's far more than two options here.
So, you could probably keep asking the
agent, are there other options, yada
yada, and it would go and continue to
investigate, but you should not lock
yourself into the rigid flow that is
sort of the starting point here. Um,
yeah. So, that that's actually I think
it for my deck. Um what I will talk
about
is let's just run through that sample I
just had up there which is that um
so
basically let me delete delete it and
I'll just do a live demo of sort of
specs in Curo and how we can fine-tune
things a little bit. So this project is
a Node.js app. It is a um it's a CDK.
Again, I'm not trying to sell more AWS.
This is just the technologies I'm
familiar with, so I can move a lot more
quickly. So, I wanted to know a little
bit about agent core, which is a new AWS
offering. And as somebody building an
agent, I should probably be familiar
with it. So, and I'm not familiar enough
with it. So, I've got we've got some
other people here who know a lot about
it. So, put my hand up a little bit and
you know, you caught me. So, I set up a
CDK stack, which is just um you know, IA
technology to deploy software. I'm
familiar with it and I love it. Uh, so I
have a stack here that lets me deploy
whatever an agent core runtime is. I
don't know. I asked Kira to do it. We
vibe coded this part. So we vibe coded
the general structure. We got an agent.
We got IA set up. I then vibe code added
commit lint. I added husky. A few things
like this that I like for my own
TypeScript projects. Um, prettier and
eslint I think. So we have a basic
product here or like a basic project
here that I know I can deploy to my
personal AWS account. Um, now I'm going
to come in here and oh, and then
importantly, this is super important
because I don't know how the hell agent
core works. And I could go read the
docs, but the docs are long and they're
complicated and I'm really just trying
to build out a PC to to like learn about
it myself. So, I added two MCP servers.
Oh, no, maybe I didn't. Let me check.
Oh, okay. Yes, sorry. Buried down here
at the bottom. So this is my Kira MCP
config. I added one important MCP server
here which is the AWS documentation one.
There's other ways to get documentation.
You can use things like um Tessle level
7 but in this case this is vended by
AWS. So I have some confidence that it
might be correct. So I used this to help
the agent have knowledge about sort of
what technologies exist. And I think I
used fetch quite a bit as well. So these
are the two sets of um
these are the two step sets of uh MCP
servers I provided the system. That's
great. Move on.
Confirm. So
and I'll just rerun this from scratch.
So what I had done yesterday evening or
maybe the evening before was I sat down
and I have this system sort of basically
working and now I want to start doing
specri development. So, I want to add
this uh session ID concept and then I
want to read conversation to an S3 file
blah blah blah. This is the whole sort
of bias thing I showed you earlier.
We're going to fire that off through
Curo. It's going to start running uh
chugging away and then it's going to,
you know, see if the spec exists. Uh,
okay, the folder does exist. It's
probably going to realize there's no
files there and start working away. But,
um, from here I'll sort of live demo.
It's going to read through require. It's
going to read through existing docs.
It's going to read through existing
files, gather the context it needs.
Sure, in a way. Um,
but in a moment once it generates sort
of the initial requirements and design,
I am going to challenge it to use its
own, you know, MCQ servers. I want you
to go and do some research on the best
way to do this and provide me some
proposals. Um, and this is why I was
hoping to get the clip on mic working
because I've got to set this down for a
moment.
Okay. So, you know, I don't know if this
is the best way to do this. Um, go read
docs, go use fetch. D. It's going to
keep kind of churning away here and then
come back to me after it's probably got
a few ideas and proposed it. But, um,
this is an example of me just using
additional capabilities. uh use fetch,
use the docs MCP, use whatever you can
to get the best information and don't
take at face value the things that I
said. These are usually things we have
to prompt pretty hard to get the agent
to do, but if you're doing it in real
time, it works fairly well. Um, again,
the agent, all of these agents are going
to be very easy to please. So, you know,
just cuz I said something in the stupid
docs, it may or may not actually be the
most important thing from the agents
perspective down the road. So, you know,
okay, so it's done a little bit of
research. It understands the lang graph
which is the agent framework we're using
already has this knowledge of
persistence
um da da da and actually in this case it
didn't find it did not use the mcp for
uh agent core docs who didn't find that
agent core has this knowledge of
persistence um so maybe you like let's
assume I don't I still don't know that
exists because I didn't dry run this a
few days ago um we might have to find
that later the design phase so first
thing it's going to do is kind of
iterate over all my requirements
requirements here. Um, you know, it's
changed the requirements based on what
it now knows about Langraph and how it
can natively integrate with the uh
checkpointing, but it's still really
crisply bound to this like S3 decision
that I made implicitly in the ask. Um,
so that is just something to be aware
of. Anything you put in the prompt is
effectively rounding the agent. Um, for
better or for worse. I see it's still
iterating. So, yeah, comes through says,
does this look good? We changed duh. I'm
going to say looks great. Let's go to
the design phase. So now Curo is going
to take my requirements and take me into
the design phase of this project. I can
make this
so things are a little bit bigger.
But
um here's an example of what I meant by
these ears requirements. So the user
story here is as a dev I want to
implement a custom S3based checkpoint so
the agent can use Langraph's native
persistence mechanism with S3. Great.
That sounds reasonable to me as a person
you know sort of co-authoring these
requirements.
This here, this sort of when then shall
syntax. This is the years format and the
structured natural language is really
important for us to pass this through
non LLM based models and give you more
deterministic results when we parse out
your requirements because ultimately our
goal is to actually use the LM for as
little not as little as possible but
less and less over time. We want to use
classic automated reasoning techniques
to give you high quality results not
just you know whatever the latest model
is going to tell you. Um, so here's gone
through spits out a design doc. Let's
actually just look at this in markdown.
This sure you got a server da da checkpo
pointer ghost s3 that makes sense pseudo
code again in a real scenario. Maybe I
read this a little bit more closely
and what's actually this is the new
thing we shipped in um on the 17th is
that now cur is going to go through and
do this formalizing requirements for
correctness properties. Um and so right
now what the system is doing is it's
taking a look at those requirements you
generated uh the requirements we agreed
upon with the system earlier. These look
good. I agree with them. yada yada. It's
taking a look at the design and it's
extracting correctness properties about
the system that we want to run property
based testing for down the road. This is
something that may or may not matter for
you in the prototyping phase but should
matter for you significantly when you're
going to production. because these
properties are correct and these
properties are all met. The system
aligns one to one with the input
requirements you provided. Um yeah, so
while this is chugging away, any
questions yet? Any folks kind of curious
about this?
>> Um yeah,
>> we're here and then there.
>> Um what would you say is the main
difference between
that has?
Uh I haven't used the planning mode in a
couple of weeks. So it's I'm things move
so fast it's a little wild. Um but I
think ultimately uh what we would say is
that Kuro's spectrum and dev is not just
LLM driven but it is actually driven by
like a structured system. Um and so
planning mode I'm not sure if there's
actually like a workflow behind it that
takes you through things but um yeah
this is our take on it for sure.
>> I'm not familiar enough to give like a
more concrete example unfortunately.
similar I mean it doesn't give you like
this I think that this document is cool
is bringing you the school but uh what
Cer does is to basically create you a
plan that's
>> just an execution plan okay
>> oh I see so I think that the fundamental
difference there uh does that plan get
committed anywhere or is it just
ephemeral
>> uh it's kind of
>> okay so what I want over time is not is
not just how we make the changes we care
about but it is actually the
documentation and specification about
what the system does. Um so the
long-term goal I have is that as Kira we
were able to do sort of a birectional
sync that is as you continue to work
with Kira you're not just acrewing these
sort of task lists uh and so I'm just
going to say go for it to go to the
tasks um but we're not just acrewing
task list but actually if I come back
and let's say change the requirements
down the road we will mutate a previous
spec. So I'm looking at really just a
diff of requirements which as you go
through the green field process you're
going to produce a lot of green in your
PRs which is maybe not the best because
I'm just reviewing three new huge
markdown files but on the next time or
the subsequent times that I go and open
that doc up I want to be seeing oh
you've actually you know you've relaxed
this previous requirement you've added a
requirement that actually has this
implication on the design doc um that is
the process the curo team internally
uses to talk about changes to the curo
So we review our design docs have in
general been uh replaced by spec
reviews. So we will you know somebody
will take a spec from markdown they'll
blast it into our wiki basically using
an MCP tool we use internally and then
we'll review that thing and comment on
it in sort of a design session as
opposed to you know I wrote this
markdown file or a wiki from scratch. Um
so it becomes sort of if uh well it's
actually not like an ADR because it's
not point in time. It is like this
living documentation about the system.
Um but yeah thanks for the question.
There's one over here.
>> Um
this may be more a spectrum development
question but are there like like is
there like a template for a set of files
that you fill out? Like right now you're
in the design.md.
>> Are there like
>> is this is the designd the spec and it's
a single doc or are there
>> oh great question. So the yeah the
question was um are there and correct me
if I'm wrong here but question is are
there a set of templates that are used
for the system and is the question
you're driving at can you change the
templates or is just are there okay so
the yeah question is are there a set of
templates um there are implicitly in our
system prompts for how we take care of
your specs so you'll see here at the top
navbar here right now we're really rigid
about this requirement design task list
phase but we know that doesn't work for
everybody for example if you're starting
we get this feedback from a lot of
internal Amazonians actually that I want
to start with a I have an idea for a
technical design and I don't necessarily
know what the requirements are yet but I
know I want to make maybe design is even
the wrong word I want to start with a
technical note like I want to refac this
comes up a lot for refactoring actually
um so I want to refactor this to no
longer have a dependency on
um here's a good example here we use a
ton of mutxes around the system to make
sure that we're locking appropriately
when the agent is taking certain actions
because we don't want different agents
to step on each other's toes. But maybe
I want to challenge the requirements of
the system so I can remove one of these
mutexes uh or semaphors I should say. Um
so I might start with something like a
technical note and then from there sort
of extract the the requirements that I
want to share with the team and say hey
you know I had to kind of play with it
for a little while to understand what I
wanted to build but I still want to
generate all these rich artifacts. So
today it's this structured workflow.
We're playing a lot around with making
that a little bit more flexible. But the
the structure is important because the
structure lets us build reproducible
tooling that is not just an L. So I
think that that's an important
distinction we make is that our agent is
not just an LLM with a workflow on top
of it. The backend may or may not be an
LLM or it may or may not be other
neurosymbolic reasoning tools under the
hood. Um, and so we we try to keep that
distinction a little bit clear, uh, that
you're not just talking to like Sonnet
or Gemini or whatever. You're talking to
sort of an amalgam of systems based on
what type of task you're executing at
any point in time. Um, although when
you're chatting, you are talking to just
an LLM.
Um, but yeah, so we have a template for
the requirements. We have a template for
this design doc because there's sections
that we think are important to cover. Um
and again like if you disagree and
you're like I don't care about the
testing strategy section just ask the do
it and similarly the task list has is
structured because we have sort of UI
elements that are built on top of it as
well like task management and um do we
have we'll get there when we do some
property based testing but um there's
some additional UI we'll add for things
like optional you can have optional
tasks and stuff like that and so we we
need the structure there for our uh
taskless LSP to work for example. Um
yeah, thank you for the question.
Anything else before we truck on?
Cool. Uh I may need somebody to remind
me what we were doing. Oh, that's right.
So, we went through and we synthesized
the spec for adding memory and some
amount of persistence to my agent. By
the way, I didn't introduce you to this
project. This project is called Gramps.
It is uh it is an agent that I'm
deploying to agent core to learn about
it. I mentioned that. But what I didn't
tell you is that is it is uh a dad joke
generator.
A very expensive one since we're
powering it via LLMs, but effectively
you're a dad joke generator. Jokes
should be clean. They should be based on
puns, you know, obviously bon bonus
points if they're slightly corny but
endearing. Um yada yada. So we're
deploying this to the back end. So, the
reason I want memory is because every
time I ask the dad joke generator for a
joke, it gives me the same damn joke and
that's just super boring and my kids are
not going to be excited about that. So,
I want memory so that as I come back for
the same session, I get different jokes
over and over again. Um, that's the
context on the project. So, we've come
through here and we actually said we
generated this thing, we did the task
list. I said, "Hey, is this the
idiomatic way to do it?" But what I know
is that we didn't actually uh we're not
using Agent Core's memory feature, which
is probably a big oops. Um, and so, you
know, quick show hands. Do we want to
make the mistake and go all the way to
synthesis and deployment, or should we
fix it now?
>> Who wants to fix it now because we know
better?
>> No, I want to make the mistake. Let's
keep on trucking. I I had three yeses in
a room full of nothing. So, we're going
to make the mistake and then come back
and fix it later. So, uh, let's say run
all tasks
in order.
Uh, the reason I mention in order, which
seems very specific, is because this is
a preview build of Kira. Um, and so
somebody just added to the system prompt
I should only do one task at a time. And
I found that if I say run all tasks, it
thinks I somehow mean do them all in
parallel. So, we'll that'll be fixed
before these changes get out to
production. So Kira's going to keep kind
of going through here and chewing away
on the system in the back end. Um, it
has steering docs that explain how to do
its job. It has, which I guess I should
show you guys. Steering again is like
memory. So I have some steering how to
do commits. Uh, you know, how I like to
have commits, but also steering on
things like how do you actually deploy
this thing? Um, how do you deal with
agent core? And then how do you run the
commands that are necessary for you to
deploy this to my local dev account. Um,
and then those are mostly just an
example again of sharpening your tools
like uh I went through this kind of
painful process of figuring out oh you
know you have to use this parameter on
the CDK
the CDK command you have to use this lag
otherwise it doesn't work correctly and
so once I go through that pain of
learning I just say kira write what you
learned into a steering doc and it will
usually do a very good job of
summarizing um and so it generated
automatically this Asian core langraph
workflow MD file um yeah so I mean it's
just going to kind of go away here and
truck truck on and do its job and we can
watch it in the background. But in the
interim, um I think at this point we're
at a pretty flexible spot. Uh so for
folks who want feel free to use Kira,
try out Spectriven Dev on your own. I'm
going to keep just kind of running this
in the background and taking questions
and comments. But that's kind of it for
the scheduled part of today.
>> Yep.
>> How does Carol work for like existing
large code bases or this?
>> Yeah.
>> Yeah. question was how does cure work
for large and existing code bases
basically the brownfield use case uh and
the answer is it depends on what you're
trying to do um for spec driven dev you
can ask cure to do research into what
already exists so when you start a new
spec it will usually start by reading
through the the working tree um but the
agent is generally starting from a a
scratch perspective right it needs to
understand the system um in practice
what that means is that you're going to
end up with a bunch of things like if
your system already had good separation
of concerns uh your the components in
your system are highly cohesive and
they're sort of highly coherent and
highly cohesive, it's going to have a
great job, right? It's going to be able
to say this is the module that does this
thing. I don't need to keep 18 things in
my context to do my job and it's going
to do well. Um if you let's just take an
example that's off the top of my head.
if you were trying to launch an IDE very
quickly uh leading up to an AWS launch
and you um you know took a lot of tech
debt along the way that you need to
unwind and you know nobody here would do
that I'm sure but um in case you did
that like me then your agent might
actually have a much harder time
traversing the codebase in the same way
that a dev would right so uh from just
kind of that perspective the more
reliable things like your test suite are
and the more understandable things like
module separation and sort of
decomposition of concerns are the better
the agent will do. Um and versus true of
course. Now for things like uh
understanding the code base, this is a
bad example because this is a very small
code base, but uh we do have things like
you know code search and workspace. Um
uh I don't know what to call these
context providers. Um, so you can come
in here and just say I want to do code.
Uh, what is it?
I might have turned this off actually.
Oh, I did turn it off because the code
base isn't big enough. We'll do things
like indexing in the background so the
agent like you can do semantic search
over what you've got um if you're just
chatting. But in general, uh, Cur should
go in and do sort of background search
to figure out how to do its job. like as
the codebase scales up, it's going to be
less do probably less well overall. But
that's one thing we're working on as a
team.
Did that answer your question or did I
kind of glance off the side a bit?
>> Yeah, I think I got it.
>> Okay, cool.
>> Anybody else?
>> Uh, how long are you willing to wait for
indexing to complete?
Uh so one example I have is that the
code OSS um if it's not supremely
obvious by looking at it cur is a code
OSS fork just like you know cursor winds
surf um
one of the challenges we've had is the
code OSS codebase is very large fairly
large there's other big ones out there
but that's kind of my large code base
because I'm not forced get to work in it
fairly frequently um and so there
there's definitely some perceived
slowdown when you're dealing with
something large like that, especially
when you talk about codebased indexing.
It's a very active area of work for us
though. So, we're trying to do things
like um either remove indexing from the
critical path so that you're not waiting
there on some kind of slowed down render
thread because indexing is running. Um
but in practice, there should not be. I
mean, again, the agent may practically
do less well, but we're going to be
talking in a couple weeks at reinvent
about how some of the temple features in
Curo were built via spec in a codebase
we did not understand particularly well
because we're just not VS code devs. Um,
and Curo did a fine job of it. But
again, that's a testament to the fact
that codebase is reasonably well um
structured
>> and like if you've taken the time to
understand how it works, it's very
understandable. If you have not, it will
might be a little bit opaque to stare
at.
>> Yeah.
>> Uh in terms of indexing, is it like just
just putting um um as much information
from the code base into context or it
just
>> is there a way to like create some kind
of like vector database of all the
code base and then like query it? I just
>> Yes. Um so the question was what do you
mean by indexing? Um because indexing
can mean a bunch of different things and
what I mean is that um the agent is
actually not provided the
>> I'm going to keep the agent context as
small as possible. We use the uh the
index for most like secondary effects
things like if you're doing a uh a code
search or if I do something like search
for um pound uh what the file in here
http server like we use it more for
these types of UI um than giving it to
the agent because the agent does this is
sort of anecdotal and based on our
benchmarks does better when given less
context but given the tools to
understand where to go find things. Um,
something we've heard a lot about is
sort of incremental disclosure here at
this conference. And that's again, we
don't want to load too much at the
beginning of the context and
conversation with the agent. We want the
agent to self-discover the right context
for the task. Yeah.
>> Thank you.
>> Yeah.
>> You guys managing session length like is
there any kind of compression or
pruning?
>> Yeah. So, um, question was how do we
manage session length? We have no
incremental pruning today or incremental
summary. Um you basically just accrete
context until you hit your limit which I
think right now I'm on auto which has
like a 200k token limit um similar to
the sonnetss. Um uh so we don't have a
very sophisticated algorithm here yet.
We've looked at a few things but our
number one concern actually is um prompt
caching hit rate. And so in a normal use
case, I can achieve something like 90
95% cash token usage here on per turn,
which means that my interactions are
very fast. And that's or they're much
faster than the alternative, which is
I'm sending 160k tokens to to bedrock
cold. Um, so that's one of the reasons
we've actually not done much
experimentation with incremental
summary. Um, our summarization feature
exists. When you hit the cap, it's not
great. It's something we're trying to uh
ship an improved version very very
shortly. Um eg in the next couple of
weeks which should be faster. Today it's
like a one-off operation that can take
up to 30 or 45 seconds which is a
horrendous experience. We're hoping to
fix that here and make it sort of a
real-time experience.
The follow
>> managing stapleness between sessions
then is that how why you're relying on a
stereopated
spectrum.
>> So sort of um
that is not the only reason I mean the
spect the spectrum of dev is less to do
with performance and more to do with
reproducibility and accuracy of the
agent. Um because if we can give you the
right result,
the the the way I and I think that we
talk about it internally as this team is
if I spend 10 seconds giving a prompt to
the agent and then it goes off and it
gets it wrong, it's like it's kind of no
skin off my back, right? I burned
however many tokens and you know, a
couple cents of credit usage with
whoever my LM provider is, but I spent
10 seconds generating a prompt. If I
spend five to 10 minutes with the system
producing a detailed design doc or let's
just say even a detailed set of
requirements I wanted to do a fairly
good job. If I spend an hour generating
a design doc reviewing it with my team
and then synthesizing from that I wanted
to get it right. So the goal necessarily
is not just latency but actually
accuracy when we talk about that. No,
it's a both and. You need to do both.
But um spec comes more from a uh the
goal to have um highly reproducible
output.
I'm going to go over here first and then
you
>> Yeah. How did each of these task agents
pass context to each other? And then are
you only supposed to run this this
parent task? Because it just finished
all like 3.1 3.2 3.3 but then it still
thought that 3.1 wasn't done and ran
that in 3.2. too.
>> Oh, did it?
>> Yeah. Well, no, mine right.
>> Oh, okay. Yeah. Yeah. Um, so
if you
the uh the question is if you're in the
UI and you're like running tasks and I
can just kind of pull up my task list
here. Um, so if I just hit start, start
start each of these is going to be a new
session which means the context is
completely unique. Um, personally I like
to just if I can if I've got the context
base to afford it, I just say do all the
tasks because I find that more
understandable and I think I actually
get better performance. But by default,
each task will be a new session that has
no shared context with the previous
ones. So the session is effectively just
seated with your specification and then
like here you're working on a spec that
does all this stuff block of text um and
you are doing this task da da da don't
do any other tasks just do this. Um, so
that sounds like a bug. Um,
>> they ever spin up sub agents for certain
things.
>> We don't have sub agents yet in Caro,
some we're working on.
>> Yeah. Yeah. Because I mean, ideally,
right, if we click on task three and
I've got 31, 32, 33, and they're
separated, there's no good reason I
couldn't have different systems working
on them. Yeah.
>> Uh, right here,
>> we do have in the Curo CLI custom agents
that you can also run off.
>> Yeah. Curli is a concept of custom
agents. um which can be run sort of as a
task um and it's something we're playing
with right now in Curo Desktop um and I
think you had another one
>> yeah I'm sorry if I missed this but in
the spec folder
um as you do more and more of these
tasks over time
>> y is it just all in one design
requirements tasks your whole project is
defined there or did it group by
>> that's a good question um yeah so I will
have many I will have uh the question
was as you do more you generate let's
say more specs over time. Are you sort
of just creating one massive spec and
no? Uh let me open a different project.
So this is for example the curo
extension which is like a 1p extension
inside the curo IDE. This is where the
agent itself lives. And so we have
pruned some specs but there are specs in
here that we can talk through or I can
just kind of demo. Um
so these are the way I think about it is
that the spec sort of represents a
feature or a problem area in the in the
project. And so for example, I can blast
this a little larger. So for example, we
have um like some of these are just
tests. We've done things like oh could
we have a prompt registry? Could we have
a prompt registry file loader? They may
or may not make it all the way to
production. Um I want telemetry on the
chat UI. So these are just like somebody
will go off and spend maybe represents a
few days of work for an SD. Um, agents
MD support is a good one where we just,
you know, I sort of said research what
agents MD is and build it in the way you
build steering in like support in the
same way. This spec is fairly unlikely
for us to come back and revisit in the
future. So I may actually just delete
it. Um, which is what we've done with
some of the older ones. But a good
example of one that we might come back
to is our message history sanitizer. So,
one thing we've had issues with or we
had issues with early in the the
development of Kira is that we would
send these sort of invalid um sequences
of messages because let's say the
anthropic API required tools to be in
the same order they were invoked and the
responses but the system wasn't doing
that. So we built this whole sanitizer
system that has a bunch of requirements
around um
let's see very specifically
yeah when conversation is validated the
system shall verify that each user input
is either non-MPT content or tool
responses. So we had things where like
empty strings would get passed in but
there was a tool response. This is a
good example where we've come in over
time and actually just added maybe not
to the requirements but to the to the
acceptance criteria of the requirements
as new validation rules are uncovered.
>> Yeah.
>> So how do you handle like that? So for
example you have like
>> telemetry up there y feature that needs
telemetry is it going to go back and
update that spec too or you're just
>> it should. Yeah. So, if you usually
you'll see and let me just ask uh
a new chat here.
No, that's a terrible idea.
So here I've asked I've made a inspect
mode I've made some requests to um add
UI telemetry to the thing I'll help you
add it let me first check if there's any
relevant runbooks then explore the
codebase and sand the implementation it
might go do a little bit of research
here and then flip of a coin again it's
an LLM so it may or may not discover the
existing uh spec but ideally it will
after doing its research say there
exists a spec already for things like UI
telemetry, I'm going to go and amend
that one. Um, and if it doesn't in this
case, like I would come in and just ask
it to um as sort of the operator of the
system. But over time, again, we want
that to be easier for you as a user to
not have to think about so much.
We can watch it while it chugs along.
>> Is there anything reconfigured in Kira
that makes it better to work with AWS?
trans.
>> No, not really. Um,
>> was that a question?
>> Oh, question was, uh, is there anything
in Kira that that's preconfigured to
make it work better with AWS? No. Um, we
are sort of purposefully we're in we are
brought to you by AWS, which so you
know, uh, Andy Jasse and Jeffy B pay my
check, but um, we're not like an AWS
product that's deeply deeply integrated
with the rest of the AWS ecosystem. Now
that said, I still answer emails when
somebody says, "Why is this other thing
we built with AWS not working with
Curo?" Yay. But um similarly like if
you're building on GC or Azure, whatever
um or you're running some on-rem system,
the product should work just as well for
you. That's our goal.
>> Good a good answer potentially is the
AWS documentation MCP server.
>> Yes.
>> So there are MCP servers that you can
add into any of these things that will
make better.
Yeah, that's a good point. So, like in
this case, I actually had to add the AWS
MCP documentation here. We could of
course have natively bundled this, but I
don't want to ship this to customers who
don't need it.
Yeah, because again, AWS is not the only
docs that we might care about. Um, by
the way, coming back to your question,
so it did find the existing spec for
telemetry. It read it, it read different
sections of it, and now it's actually
making amendments to it. So, we can
follow the diff as it shows up here. So,
it's added new uh requirements. um to
the pre-existing specs. So, this is
effectively another case where we're
mutating the system as opposed to just
adding this sort of never- ending spiel
of specs.
>> I guess what I'm wondering is like how
did it know or decide where to put the
spec, you know, if you break down your
project into these different categories?
>> Y
>> I would imagine like crossover.
>> Yeah. I mean, it's that that's sort of
like software development in a nutshell
though, right? like how do you actually
define the seams between different parts
of your system different concerns the
product
>> right but if you want to like build
something like I have a task and it's
going to cost
>> require changing like three or four
things
>> y
>> it's going to change three or four specs
and then run tasks across three or four
>> oh yeah yeah no it should not do that it
would probably so again I don't have a
good example off hand that we can do for
that but um my my perspective would be
that if you're working on something that
is a crossf functional uh by the way the
question was um if I'm working on
something that let's say I have a spec
for security requirements and I have a
spec for API design uh like the API
shapes and I have a spec for
logging and I am changing something in
the API public interface that is a
securityf facing concern because we're
redacting logging PII um I think that's
maybe a semi-tangible use case uh that
we can all imagine coming down from our
governance teams um I want to
I would imagine that you either pick one
of those to load the requirements into
or you create sort of a cross functional
spec, but that would come down to I
think you as a as an operator making
that decision in much the same way that
if I how you actually implement it might
be you you would not necessarily
implement my PII API redaction module.
It's a standalone thing. It's going to
be a crosscutting theme across your
codebase, I'd imagine. And it's also a
good example. There's like multi group
workspace came out when it went to G on
Monday and now you can like drag
different. So like in your example you
just went through with like APIs and off
and like even the front ending you can
bring in those projects if you have them
separately and then still work.
>> Yeah. Thanks bro.
the mental model the spec generates the
code after that like what code you can
specify how does that work
>> yeah so um we have now synthesized
effectively the spec so we we sat down
we defined the requirements design and
task list I've had Kira now go through
and run all the tasks in this spec so it
ran them one at a time it basically
worked on small bite-sized pieces of
work uh chunk by chunk and then uh now
this is done So what we've actually
produced is not just like the completed
spec, but it went here into my agent and
it did a few things in the CDK repo
because it's doing persistence to S3.
I'm sure it added a bucket. Yep. Some
new bucket encryption and yada yada. It
then went in to the agent, added the S3
checkpoint saver. It looks like it, you
know, created a checkpointer. It adds
this to the graph and it kind of passes
this all the way through the system. And
the S3 checkpointer here I'm sure has
some knowledge of how to write the
checkpoints to and from S3. So like we
have gone not just for defining the
system but we've now um produced it end
to end or we've uh delivered it end to
end including property tests I believe.
Um yeah.
>> Oh, I have a answer to an earlier
question related to like um some
specific AWS related features like that
makes it easier to work with. The Curo
CLI comes with the use AWS tool which
helps with the CLI.
>> Yeah. Yep. So, uh, what Rob's pointing
out is the Curo CLI, which we just
rebranded, um, this week, has a use AWS
tool, which is basically a wrapper over
the AWS SDK, um, to make some of those
things easy. Uh, but again,
BYO use GCP tool as an FCP server if you
were so inclined, if that's your uh,
tool of choice. And I believe, don't
quote me on this, um, because the CLI is
kind of new to my new to me I should
say. Um, but I believe you can turn off
tools in the CLI as well. Let me know if
that's not right, Rob.
>> Yeah. So, that's like you're actually
not strict. Uh, in the desktop product
today, you can't control the tools, the
native tools built in, but in CLI, you
can.
>> Um, so I I intuitively get the benefits
of having a spec. Have you done any work
to empirically see like how a project or
a problem would have worked with or
without?
>> Yeah. Um we do have benchmarks uh
covering the data off hand. Um I think
part of that's in our blogs. So if you
go to the cure.deblog
or it's on the site, we we talk really
crisply about some of the lift things
like property based testing give to task
accuracy.
Science team's always working on that
stuff.
a blog about specs. I'm curious about
>> Yeah. Distinguish engineer for
databases. Yeah.
>> His blog post really steps it up. I
don't think it has the D specific that
you are asking for, but I think it will
be useful.
>> Yeah. Yeah.
>> How does it work? I understand the
feature side of it, but how does it work
in a nonfunctional site like agency
dealing with, you know, a little bit
more harder problems?
>> Well, yeah. I mean, that is ultimately
the goal here, right? Is we're saying
you're making a slightly larger
investment up front, but we believe that
the uh the structure we're bringing is
going to help you get increase the
accuracy of your uh result. So, um,
while we've got a team of people who are
basically working on making spec better,
my job when I fly back to Seattle is to
make cur as a whole much faster. Um,
one, execution time and like kind of
like laggginess in the UI, but two, how
do we get tokens through the system
faster? How do we get responses to you
faster so that like you're not syncing
as much cost into KO to use a spec?
>> Yeah. Yeah. I'm not talking about the KO
tool itself, the code generated from the
spec.
>> Oh. Oh, yeah. Okay. Yeah. you mean like
the non-functional requirements of the
generated code? So, uh that's going to
come down to I think what you're
specifically trying to do. So, you could
add uh one of the slides I had here was
talking a little bit about how to tweak
the process and tweak the artifacts for
your use cases. Um again, you could very
easily add something like I want
non-functional requirements for speed
and runtime and things like lock
contention to be considered in the
design phase. Um yeah, something you
could certainly add. So you could
generate a code in Rust or or Java.
>> Yeah, totally. Yeah.
>> And it will vary in the functional
depending on what language you
generated.
>> I mean it would it would have to like
yeah there's no other way I think to
approach it. Um again I'm just I'm
familiar with node so I'm doing
everything here in node but you can use
this with any language. I think
technically we say we support Java,
Python, JavaScript, um, and
Jesus, JavaScript, TypeScript, Java, and
Rust. But in practice, there's no reason
that this doesn't work with any
language. I mean, it's just an LLM. The
there's nothing language specific or
framework specific in the system. And
for those of you um, so there was a
conference earlier this week hosted by
Tessle, which are doing sort of specs
for knowledge base. um as long as you've
got the right grounding docks in there
and this is sort of uh their argument is
that it should not matter what you're
building like that's all just informed
by the the context you're building for
your system.
>> This is also a really good point for
steering. So steering you can get the
agent to develop code in the way you
want. Like being a developer is all
about making trade-offs and the problem
with your out of the box is it's like so
polite because it's trying to be
everything to everyone. U and especially
like with latency and cost and other
things like that, just tell it in
steering what you want it to prioritize
and then that will influence any code
that gets generated.
>> Yep.
>> Even like how it designs based on that
as well. So if there's something that's
very specific to your use case or your
industry or whatever, just shove it in
that steering file and then
>> Yeah, that's exactly right. So, for
example, I I will have Kira generate um
commits for me. And one of the things I
care I personally care about is that I
can track commits I generate versus
commits that Kira generates being the
ones that come from the system. And so
my steering dock while short includes
things like very specifically my
requirement for Curo is
just use the UI
um
attributed to the co-author of Kuro
agent um which is trivial but also I
want it to happen every time. So in this
case it just generated a commit
co-authored by Kirao agent D. So that's
an example of like you could add
whatever you want in there, not just
something related to get commits, but
you could do code style, you could do um
uh you know code style, code coverage.
Uh whenever you add a spec or you're
adding a new module, make sure that you
annotate it with coverage minimums that
are 90% because that's the thing I care
about. Um you can kind of put anything
you want up in there. The good news is
it looks like what we built works. Um,
Cur is very happy with itself at least
and it looks like all tests passed. But
um, yeah, so we'll we can deploy this to
the back end and see how things work.
We're uh technically just about time.
So, you know, if anybody has any other
questions, I'm going to stick around
here for a while. But uh, thank you all
for joining, listening, and uh, learning
a little bit more about Spectrum and
Dev.
Heat.