AIE Miami Day 2 ft. Cerebras, OpenCode, Cursor, Arize AI, and more!

Channel: aiDotEngineer
Published at: 2026-04-21
YouTube video id: DeM_u2Ik0sk
Source: https://www.youtube.com/watch?v=DeM_u2Ik0sk
Hello. Hello. Good morning.
How's everybody doing?
>> Whoa. Okay, that's the energy I'm asking
for.
>> No way. You came back.
Well, I also know some people uh just
came here as their first day. So, uh
welcome back everybody. And for people
who are new, welcome to AI Engineering
Miami day two.
>> Okay, some quick questions. Who learned
something awesome yesterday that they
can't wait to try very soon?
>> Oo, I see hands.
>> Okay, who made some new LinkedIn
connections?
>> Oh, I see hands.
>> Who made more than five LinkedIn
connections?
Um,
>> okay. I'll take note of that.
>> I hope you get replaced by AI.
>> Who's that person?
>> All right. And um, who got drunk last
night?
Don't be shy. Show me your hands. Who
got wasted? Never mind. I'm surprised
you made it back. Anyways,
>> wow. Well, because I'm so excited about
the agenda for today. So, uh, no matter
how much you drank last night, uh, we
have a great lineup of talks today. So,
ranging from G2I, our organizer, all the
way to cursor. So, a lot of great talks.
Actually, I'm super excited about one
talk in particular. It's
superficuristic, but I'm not going to
spoil it for you. So, uh, stay tight.
And at the end of the day, I'm going to
ask you to guess which one was the one
that I'm most excited about. So just
keep tuning in and find out all the
talks that we have organized for you
today. Okay. So I have a question for
you all. Who is a worldclass engineer?
Yes. Yes. That's the confidence that I'm
looking for. Every single one of you
actually sitting in this room is a
world-class engineer. And also the
person sitting next to you, even the
people who don't like other people are
also world-class engineers. So I want
you to really take this moment to learn
more from all the talks today. Really
try to be that world-class engineer that
you truly are and also talk to other
people and make connections so that
together we can all build a better
future with AI. So that is the goal for
the conference for today and I'm just so
grateful that every single one of you is
here and you are making connections and
making that possible for us.
>> I forgot to ask one question. Who's
loving the MC's this year?
>> Good, good, good.
>> Who's loving the MC's?
>> Thank you all.
>> Yeah. Well, we're going to keep the
energy for our first speaker of today,
David House. Iman, what do you think
about David House?
>> I'm I'm really jealous of his jacket.
>> You'll see. You'll see.
>> Well, so David will be talking to uh us
representing G2I, which is the organizer
for the conference. He's going to talk
about how we're transforming programming
mindsets. So he's going to give us some
case studies in agentic coding adoption.
So relevant to every single one of us.
And I can't wait to introduce David and
let him show off his jacket. So give it
up for David.
>> Good morning everyone. Hi. Uh so I am a
software engineering manager but my
background is actually in mental health
counseling. Um I've was trying to figure
out how to marry these skills together
and recently I saw a unique opportunity.
I've been hearing about all of these
people adopting coding agents but
there's also a lot of anxiety, a lot of
fear, a lot of existential dread. So I
wanted to learn more about that. Also
this is my very first conference talk.
So, please clap.
So, there's a lot to learn. Um, how are
people adopting these tools?
And I I got into the weeds with
different interviews, people on the
team. I asked them questions about how
they adopted the tools, what their
experiences were, what their feelings
were. And what I ended up learning about
was a model of successful adoption. Our
first case is Ava.
Uh Ava's first experience with AI was
with chat GPT.
She used chatbt for uh coming up with
ideas for family events, kids
activities, all kinds of things that she
could review herself and trust the
output. She was the only one who needed
to sign off on it.
But when it came to work, she held a
very different standard. She didn't copy
paste code. She didn't trust the output.
And uh contrary to her work with AI with
her family, there were other reviewers
that she was very mindful of. She was
very mindful of her reputation and she
didn't want to ship slot.
However, when she joined G2I,
the project was very different. uh I'll
tell you more about how the project was
structured later but she joined a
project where agent coding was the
expectation from the beginning
and the whole project was built around
agent use.
She learned that different techniques
for working with agents could produce
better output output that she could
actually trust
and she realized that a lot of things
had to change about her development
workflow. After months of using the
prescribed workflow, she internalized
those lessons enough to be able to
create a sub agent for herself for a
more specialized testing pattern.
So when I watch YouTube videos about AI,
I hear slop cannon and I hear skill
issue.
It's really easy to polarize between the
two, right? But like if you're someone
who's saying skill issue, maybe you were
someone who said slop cannon. And may if
you are someone who's saying slop
cannon, maybe there's something for you
to learn to get to the point where you
can say skill issue. The point I'm
trying to make is there's something in
between those two poles. And that's what
I'm really interested in.
I think both are true, right?
And when you learn how to work with AI,
then you can start to differentiate how
to avoid slop cannon and hopefully how
not to be uh arrogant about it.
Okay, I agonized over this slide for a
long time, but now it's my favorite.
Here's
my claim for all of you is that for a
beginner, an agentic framework should
constrain their input.
A beginner doesn't know when they're
going to make mistakes.
But for an expert, someone who has spent
a lot of time with agents, someone who's
internalized the development practices
of how to work with them, an agent
framework should actually amplify their
input.
So an agentic framework should shape the
input of the human in addition to
improving the agent's output.
What I suspect is that
these skills are not intuitive for any
engineer.
Agentic use was not part of any of our
education. So we can't be expected to
know it.
Let's talk about what happened at G2I.
So
the case studies here are all from a
very a specific project at G2I. Everyone
was onboarded to the same agentic
framework
which relies very heavily on
documentation and staged handoffs.
It starts with a SL brief skill where
the agent assumes a product manager
persona interviews the user and the
output of that conversation is a product
brief.
That product brief becomes input to a
SAS spec bill where the agent assumes
the role of a technical architect again
interviews the user and the result of
that is a technical specification
that's provided as input
to the slashcode or slashtestdriven
development go
where both documents ments act as very
specific, very well-reasoned prompts
with embedded judgment from the
engineer.
And then we close with a slash review
skill to make sure that the agent
covered all of the implementation as
defined in the doc and then a draft PR
skill to save you a little bit of time
and spend a few extra tokens.
But don't get distracted by the
framework.
I'm not here to try to tell you that
this framework is better than any other.
And I'd probably go as far as to say
that most frameworks are pretty much
doing the same job.
The I break that job down into three
things. Number one, revealing the hidden
practice of agent coding. Number two,
making the work of the agent reviewable,
both its reasoning and its output.
And number three, training the engineer
in effective delegation with the agent.
A beginner might throw one prompt to
rewrite the whole codebase in Rust.
But a more experienced engineer will
know to take that step by step.
Our second case is Lucy who has four
years of engineer an engineering
experience.
Uh Lucy got started with coding agents
in a prior job to G2I.
She had success writing a migration
script from uh if any of you know about
Vit and know about barrel files, you
know exactly what how hard this is.
The agent could automate the boring
stuff and the agent could uh build a
script that encoded a lot of judgment.
Then
Lucy's success on that project motivated
her to give agents a try from the very
beginning with a very complex personal
project. But she ended up with a lot of
duplicated code.
Code that she didn't understand code
that was too much to review and this was
also a using technologies that she
wasn't unfamiliar with. It was very
challenging to work with and very
demoralizing.
When she joined G2I,
the framework that we used
told her something that felt obvious in
hindsight, which is that you have to
tell the agent to do the right thing.
You have to tell the agent to write
tests. You have to tell the agent to run
the tests and make sure that they're
passing.
So she learned that
the the practice of agentic programming
is make is building the judgment you
have about how to be an engineer
into prompts and into the review
mechanisms.
She said that she does not take for
granted that the agent will do what you
want.
So Lucy used the G2 framework for
months, but she doesn't use the
framework anymore.
Now Lucy's method is much more interview
focused.
She'll spend a lot of time going back
and forth with the agent, not
necessarily using a strict prompt flow,
not necessarily expecting a specific
documentation result, but she trusts
that during this interview, she's able
to
build in as much judgment as she
possibly can. She's able to steer the
agent during the interview,
and she also doesn't necessarily trust
herself to be a perfect provider of
information. So the interview loop helps
her remember the things that she had
forgotten to say.
Xcase Antoine. Antoine's been an
engineer for 15 years.
He's a startup founder and he is
meticulous. Uh Antoine is someone who
has a reputation at G2I for being able
to get anything done. No matter what it
is, no matter how hard it is, it's
something we have to do. Antoine can do
it. So, uh, we end up being very careful
about where we direct his attention.
Antoine's skill and attention to detail
meant that when he was working with
agents,
he was very focused on the results.
His early experiences with agents were
very disappointing. He realized he
needed to be vigilant and reviewing the
code as output taught him that he needed
to be very cautious.
So he swung from early exposure with
agents to hyper vigilance that
diminished the value that he could get
from coding agents.
But when he joined G2I, the TDD skill
significantly improved the output, which
improved his trust, which allowed him to
trust the agent to do more, which gave
him better results, which improved his
trust.
And he's still skeptical,
but he's tuned his skepticism to be more
context specific.
He now continues to use the agentic
flow, but to hear him talk about it, he
he can't necessarily describe how the
way he does it now is different than the
way that he did it prior.
He talks about the the hundred nuances
in ways that
his judgment built over time working
with agents, reviewing their output
has built skills that are essential to
how he writes code today.
Our third case here is Dale. Dale's been
an engineer for about four years. His
first experience with AI was using chat
GPT for coding. Um I was actually
working with Dale at the time. We were
both both working in a PHP codebase.
Neither of us were very good at PHP. And
I was like, "Hey Dale, you can use chat
GPT to explain collections. You can use
chatgpt to explain Laravel. Uh it will
it will have infinite patience for you
and infinite time. I had lots of
patience, but I didn't have a lot of
time.
So he asked her for help with code uh
understanding it but not actually
writing it. Similar to Ava, Dale was
very concerned about reputational risk.
He didn't trust he also didn't trust
that the agent had enough information.
So with chat GPT he knew that the agent
and intuitively he knew that the agent
couldn't read the whole codebase. So he
was unsure always every single time
whether or not the output was
misaligned based on the available
context.
He was also more junior at the time.
This is a few years ago. So he didn't
trust that he could actually validate
the correctness of what was being output
by Apache PT.
When Dale joined G2I,
he was uh a little hesitant. Um but it
was clear that uh it was clear to him
that he needed to figure this out in
order to be successful.
He had some early successes
uh with agents, but went a little off
the rails and ended up with a 10,000line
PR.
He figured, okay, well, the agent can do
the whole epic, right? So, let's just
have it do the whole epic and then
learned very quickly that that's not
necessarily a very good idea.
He learned to narrow the scope of his
delegation and make sure that he could
uh that the agent wouldn't get away from
him.
Now, uh
differently than I think Antoine,
differently than Lucy, Dale has moved
away from the framework, but his way of
prompting today is very very similar to
it.
Instead of using the skills to generate
a document that is then used as input,
Dale writes the entire document himself.
And he encodes the same guard rails and
judgments that the framework made sure
to include by default. He knows now
intuitively to include them himself. And
he can then use his judgment to be even
more precise about the guidance.
So then he spends 20 minutes in his own
words writing this prompt,
but he trusts the agent well enough and
he trusts his own delegation skills well
enough that when he hits enter, he's
very happy to walk away and come back to
a PR that he'll review as though it was
a PR from his teammate.
So I think the point I'm trying to make
is that successful adoption is about
internalization of these practices.
The engineers when they started working
with agents described feelings of
disempowerment.
They described feelings of
sort of reacting to situations, not
necessarily steering the output.
But as the engineers learn more about
agents, they
learned how to be the ones in the
driver's seat. You could say they
learned agency.
But what about junior engineers?
When I hear talk people talk about
junior engineers,
I tend to hear one of two things.
I hear if you find a
uh a junior engineer who has already
learned AI that you're set. It's fine.
Thanks.
But if you don't, then juniors are
screwed. So again, it's another one of
these polarities. Either you find
someone who's already learned the skills
or you don't know what to do.
So I think we can do better than this.
Andy is a member of our team, fresh out
of college.
His whole experience with AI was being
told not to use it by his faculty, but
his tutor tutors encouraged him to be a
little more nuanced. His tutors had to
coers him into using AI like a tutor
because he was spending a lot of time at
the at the uh tutoring center.
He was open to having AI augment his
understanding,
but all throughout school, he wrote
every single line of code that he
submitted for his assignments.
When he joined G2I, it was a total
culture shock.
Andy started to reexamine his hesitancy
when he saw senior engineers with much
more experience than he had finding
success using agents.
but he didn't necessarily trust himself
to be a guide. And for good reason. He
hadn't spent a lot of time in the field.
He hadn't built a lot of software. He
hadn't made the mistakes.
But what he found made him much more
successful
was when he brought a senior engineer
into the loop of his work.
So if you remember we have this agentic
framework at G2I where you use skills to
do document handoffs.
He would take the documents before he
gets into any code, put just the
documents up for PR
and get reviews from senior members of
the team.
that allowed the senior engineers to
encode their judgment into the documents
which made the docs more successful,
made his implementations more
successful, but also served as a
teaching opportunity and a mentorship
opportunity.
Also, Andy is looking for the next his
next gig. So, let me know if you want to
meet him, but uh don't wait because he's
already interviewing.
So, don't give up on juniors. Um, Andy
became a steady contributor in three
months on the team despite being the
only junior engineer on the team.
Junior engineers still need mentorship.
But agents expedite that because they
can serve as tutors for the engineers
when all the senior engineers are busy
with their own implementation tasks
and
agents expedite that mentorship as well.
Because the documents that the junior
engineers create can serve as highlevel
high leverage
uh review artifacts
where then the implementation can be
much more successful building the
confidence of the junior engineer.
To quote one of his colleagues, quote, I
was shocked to hear this was your first
role out of school. You fit right in
alongside folks with 10 years of
experience. That's incredible.
So many people are convinced that
of software engineering doom. Maybe
we're on a timeline where the profession
of software engineering becomes
irrelevant. Maybe that's true.
But what if everyone trying to predict
the future got it wrong?
what we do today matters.
I want us to look back on how we're
approaching this work and how we're
approaching our field from the
standpoint of opening doors for
everyone.
I don't want to give up on the hesitant.
I don't want to get give up on the
conflicted. I don't want to give up on
the people who haven't had that
experience
of being shown how to be successful with
these tools.
I'm not interested in leaving anyone
behind.
So you can try the framework if you
want, but again don't be distracted by
the framework.
Use Beam Aad.
Use RPI or the acronym that add more
letters to RPI that I can't remember
from yesterday.
Krispy. Yes.
Try a framework that slows you down so
you can review the output.
Thank you, David.
So everybody wants to run the latest
model, right? The fancier one with more
parameters and uh that ends up being
very costly and uh we have tremendous
debt and that's latency debt. The next
speaker is here to save us from that
debt. Please welcome to the stage Sarah
Chiang, head of DevX at Cerebras.
Heat. Heat.
Sweet. Well, good morning everyone. I um
me and my team, we actually just came
from California, so it's it's a nice
6:30 a.m. talk.
Um, but it's great to be here and uh I'm
excited. All right, so recently I've
been learning a wide variety of new
vocabulary words.
Coitating
words like reticulating,
schleing.
Some of you know where this is going.
And the reason I've been learning all of
these words is not by choice, but
because every time that I try to use
claude code, I get these words for like
35 minutes.
And I also get, you know, writing and
waiting and thinking, but we already
know those words.
And if any of you guys have been playing
around with LLMs or using them recently,
you've probably had a pretty similar
experience.
Slow code generation. We write massive
prompts. We submit massive diffs or
honestly we hit enter and we go on our
lunch break and we come back and pray
it's done. It's frustrating. It kills
the sparks and joys of AI coding. And
today we're going to talk about why this
is all happening.
So my name is Sarah Chang. I am the head
of developer experience at Cerebrus and
we are building the world's largest and
fastest AI processor.
And a big part of my job is introducing
developers to fast coding for the very
first time. And it's a very exciting job
because oftentimes we get a lot of oo
and a's and oh my god this exists.
And outside of cerebrus I also create
content on YouTube and Tik Tok and
Twitter.
So before we dive into everything that
is frustrating and going wrong, let's
talk about all the amazing things that
are going right. So in the last few
years we have seen models get so much
bigger. We have gone from models that
are.3 billion parameters to over 1
trillion parameters. And a lot of these
are open source models too. And as you
can imagine this also means we're
getting much smarter models.
Context windows are also getting much
bigger. If you so desire, you can shove
millions of tokens worth of context into
your model. And what's exciting is that
we're actually seeing developers use
this context.
So this is a study by Open Router and
A16Z.
Open Router is a unified API layer for
LLMs. And they did a study where they
sampled 100 trillion tokens worth of
real world tokens across different use
cases and capabilities. And what they
saw is that input token links have
gotten three times bigger, sorry, four
times bigger in the last year alone.
And on the flip side of that, output
tokens have also gotten so much bigger.
So the number of output tokens per
prompt has also increased by 3x in the
last year.
And we can see this is because not only
are our models outputting the final
answer, but we're also outputting
reasoning tokens. And so reasoning
tokens are the intermediate tokens that
a model will output when it's thinking
step by step before producing the final
answer. And we can see this coupled
nicely with the rise of reasoning models
as well. And so we've had a lot of
incredible reasoning models over the
last year. And so our models are getting
smarter. Our models are getting bigger.
All of this is wonderful news.
And this, my friends, is quite literally
the best technology that humans have
ever built. And so now let's look at
this best technology in action. This is
Opus 4.5
doing a very simple HTML snake game.
Sorry, poor game.
Awkward.
You all wish you could go leave for a
snack, but you can't.
And this is if this was in Claude Code,
this is where we'd be learning new
vocabulary words.
This goes on for minutes and I won't
bore you, but
I'm bored.
You're probably bored.
And as it turns out, if you've been
playing with LLMs, you've probably had a
very similar experience.
But unfortunately with all of this
exciting innovation, there's a very
unexciting side effect.
And I like to call that side effect
latency debt. So while we've been
innovating on all these models, they've
gotten bigger, they've gotten more
intelligent, we put more tokens in,
there's more tokens out, we have
accumulated a hidden cost while
optimizing our models faster than our
infrastructure.
And what I mean by that is that now if
we look at all we've improved on so many
fronts but now let's look at speed over
the past two years we look at the Gemini
families we look at the Claude families
the Sonic models our speed has stayed
the same we have always stayed between
50 to 150 tokens per second and now
remember that even if your speed stays
the same if the number of input tokens
increases and the number of output
tokens increases then the total total
time that it's going to take for your
prompt to complete is also going to
increase. And when that increases, the
developer experience decreases.
And so now, this is the exciting part.
So now is when I get to tell you that
all of this is about to change. In fact,
it is changing. We've accumulated so
much latency over the years. And now we
have a solution.
So about a month ago, we at Cerebras and
OpenAI partnered up to release Codex
Spark. It's the first state-of-the-art
coding model that can generate code
at,200 tokens per second. And to put
that into perspective, that is 20 times
faster than anything you're seeing here.
And not only that, but Codeex Spark is
just the first of many future models
that we can expect to be a step function
faster than anything that we've been
frustrated with before. And so not only
are we unlocking new capabilities and
workflows and use cases, but this also
fundamentally changes how we as
developers can interact with the coding
model. No longer do you hit enter and
you leave, but now you can actually sit
down with the code coding model. Treat
it like a pair programmer. Verify the
outputs, steer it. No longer do you have
to generate machine levels of code.
Develop technical debt. Now you can
actually verify your code. You don't
have to create bad code. And so now
let's look at that same graph again. And
let's look at codec spark. It's so much
faster that we had to adjust the y-axis.
And remember codec spark is just the
first of many future models that we can
expect to be a step function faster. And
at this point it's a complete regime
change. We are entering a regime of fast
inference. And now we are entering a
regime where the coding model can code
faster than we humans can keep up with.
Now we are the bottleneck.
And so why is this happening? We've been
accumulating this latency debt for so
many years. But what is suddenly
happening right now that is making all
this possible?
So the exciting part is that it's
because the entire inference stack is
getting optimized all at once. And many
of you in the audience are probably
contributing to this in your day-to-day
work. And so let's break it down. Let's
take some moments. Let's start with the
very bottom. The hardware. Hardware is
the physical reason that speed is
possible. It is the inference and
training everything. The physical device
that all of the workloads are on. With
hardware, one of the biggest
considerations is what I like to call
the memory wall. So during inference, we
have memory movement, but we have the
movement of weights, KV cache values,
etc. between the memory and the actual
chip. On traditional hardware like the
GPU, all of this memory is stored off
chip. And so we're constantly moving all
of these values between on and off the
chip. And that memory movement is why
the hardware is accounting for 50 to 80%
of the total inference latency.
What we're now seeing newer approaches
to is we're thinking about how can we
reduce this memory movement as much as
possible.
This is an example by Cerebrris building
our own version of the AI processor
where instead of storing our memory off
chip all of our weights and activations
KV cache values we're storing them on
chip in a distributed SRAMM so that
every single core that's actually doing
the computation has direct access to all
of the values that it needs.
Even more exciting the commercialization
of disagregated inference. This is why
Nvidia bought Grock for $20 billion a
few months ago. And this is also why
Cerebrus and AWS just released a
partnership to serve the Cerebrus wafers
scale engine and the train AWS tranium
together. What disagregated inference is
so in traditional inference we have two
steps. We have prefill and we have
decode. In prefill is when we're
actually processing all of those input
tokens and we can do this in parallel.
We're embedding them and adding them to
our KV cache. So this step is
computebound. Our second step, decode,
is when we're actually generating every
output token token by token. This is
sequential and this is memory bound like
what we were talking about before. And
so we have two different steps with very
fundamentally different requirements.
And we've always been serving them on
the same piece of hardware. Well, no
more. And now we're splitting it up so
that decode is happening on memory
optimized hardware and prefill is
happening on commute optimized hardware.
And now let's move up the stack. the
model architecture. There's so many ways
that we have adapted our models for the
hardware that we have to run especially
well on the hardware that we have. We
think about model size, shape, the
layers, the dimensions. And so isn't
particularly new, but this is a great
example of one way that we're doing
that. So are mixture of experts is where
instead of activating the entire model
for every single token, we only activate
a small subset of experts. And what this
gives us is the intelligence of a much
larger model for the compute cost of a
much smaller model. Again, we're always
thinking about memory, just like the
hardware thing we're talking about
before. But more recently, what we're
seeing is that models researchers are
building on top of so we're seeing a lot
of exciting new advancements for how we
can make our models fit even better on
our hardware, especially for speed. And
an example here is reap router weighted
expert activation pruning. mouthful
where instead of activating again the
entire model, we're looking at a
specific use case. We're seeing what
experts are activated and we're just
asking and removing and pruning the ones
that we don't need.
And now the final layer of the step is
inference optimization.
Again, this is where a lot of you are
probably working on. Philip gave a great
talk from base 10 yesterday. This is
where companies like B 10, modal,
together, fireworks are all working. And
one of the biggest considerations at
this level is the KV cache reuse. So by
storing and reusing previously computed
token representations, we don't have to
recomputee attention over the entire
sequence every single time.
And so all of this are investments that
we've been making and all of this are
investments that we're slowly starting
to see come to fruition. And it's
incredible. We've had this latency debt
that we've accumulated for years and now
look at where we're at now. So now I
want to show you what a state-of-the-art
model looks like on a state-of-the-art
AI optimized hardware. Again, it's the
same two-player pool game, but this time
it's running on this. It's running, this
is the wafer scale engine 3, Cubris's
own version of our hardware, and so it's
not running on GPUs. And this is Codex
Spark inside of Codex. And it's done.
There it is.
And if you still don't believe me about
light and sees real, you probably do,
but just to be, you know, sure, let's
look at the money. Let's look at where
the biggest companies in the industry
are putting their money over the last
six months. Number one, Google versus
Nvidia. Google releases the TPU four
times faster than the Nvidia GPU.
This is Nvidia's biggest customer
investing in the infrastructure level.
We have Google and Anthropic. Then what
happens? Anthropic comes in and buys
those TPUs, tens of billions of dollars
worth of it.
As I previously mentioned, Grock had
been uh Nvidia bought Grock for $20
billion. This was Nvidia's biggest
purchase ever. Grock had been building
the LPU, their own version of a faster
AI processor.
And most recently, and most excitingly,
not biased at all, OpenAI buying
Cerebra's hardware.
And this happened a few months ago,
Codex Spark was the first model that
they released in this partnership. And
again, we can expect many future models
to be the step function faster latency.
And so we can see all this memory
movement because what's exciting is that
latency debt has been slowly
accumulating.
We saw the biggest players not only
recognize latency debt but invest in
latency debt. And now we're seeing that
investment pay off.
And so now as we think about these new
eras that we're entering and we're
thinking about entering this new regime
where the model codes faster than we can
keep up with, I want us to take a little
trip down memory lane at previous eras
that we have.
So in 1985,
it's a history lesson now. We switched
from custom hardware to CPUs. Why?
Because our workload changed. We went
from scientific simulations and
mathematics to needing highly
programmable creating highly
programmable software apps.
In 2003, our workload changed again. Now
we have machine learning and graphics
processing. So once again, our
infrastructure changed. We went from
CPUs to GPUs because now we cared about
high levels of parallelization.
Where are we now?
We're in 2026 and once again our
workload is changing. We are now we have
multi- aent systems. We're building 500
plus agent swarms, human in the loop,
agentic workflows and once again our
requirements change. And what's exciting
is that we've already seen people invest
on this level. We've seen people invest
in the infrastructure. And now right now
is when all of that is paying off.
I cannot communicate enough how exciting
it is to be a developer right now. not
this year, this month, this day because
we are right in the middle of a step
change as we enter a new paradigm where
the coding and developer experience is
going to get so much better. And the
thing that I want to emphasize is that
it's not just about faster models, but
it's about the fact that the developer
experience is going to change. The way
that we as developers interact with the
models are going to change, and our
ability to not no longer produce bad
code is going to change.
So, thank you guys so much. My name is
Sarah Chang. Um, if you have any
credits, if you need any credits or have
any questions, reach out to me. Um, my
handle is Milks and Matcha across the
board. Thank you.
All right, that was a great talk from
Sarah. Okay, so our next speaker is Le.
And then fun fact, I think Le might be
the most likely to survive if there were
a power outage here because he has all
sorts of gadgets with him. He even has a
flashlight. Uh he has a pointer, a laser
pointer, and Keith's LLM models are
deployed locally on his mobile. So even
if there's no Wi-Fi out here, he will be
able to make it out alive. So I'm so
excited to hear how he's able to do
that. So Le is going to tell us about
ambient generative AI and how he
deployed latent diffusion models on his
mobile phone MPUs. So welcome on stage,
Le
Heat.
Hey, heat. Hey, heat.
Hello.
Okay. Um
I'm um I'm so surprised that people are
still so much inspired by nature.
Um actually uh we are just trying to
mimic u leaves mimic birds uh uh mimic
how they they just fly. So and uh yet we
also try to mimic one of the most
fundamental physical law uh which is
randomness.
My name is Le Kalinowski and I am a
physicist. I work in Kstak. It's a
Polish uh Polish company and today I
want to I want to present present to you
my uh research original research about
the deployment of the
the diffusion models on mobile. So um
this is a part of the big bigger project
uh which is related to hyper user
experiences.
um and I'm work in the research and
development center that that means just
try to innovate the the part of the user
interfaces and try to uh push the
boundaries um push the boundaries of the
implementation of AI in that area.
So randomness it's uh really well known
and used across many of years in
algorithmical computer science. So um
one of the most um most famous um
algorithm is the Monte Carlo method when
you can just for example calculate the
area of a of a circle or you can just
calculate or simulate the chain reaction
um in um in nuclear physics.
Um and today um I just want to uh tell
you uh why I just simply do this and um
where I just get my inspiration to uh
build that type of of project. So
currently the biggest world labs in AI
works with randomness. Uh what it means
just try to optimize and build LLMs
which are fast and cheap.
So for example uh Google just introduced
Turboquant and at the core of those
algorithm there there is randomness.
Um some kind of other examples uh methos
cost like $120
spend per 1 million tokens and that's is
really expensive model. This is why we
are just looking for the fundamental
laws and try to in be inspired by them
to build um to build
better, faster and more optimal models.
Um but um and this is the reason why I
just wanted to try to deploy the
diffusion models on a mobile application
in the really best possible optimal
way.
So I started to um I I started to think
how I can do it because as you probably
know from yesterday and Google just show
it the diffusion models and the picture
generation is really really difficult
and computational heavy that means you
need to have the huge GPU to generate
takes a lot of time.
So um uh this is couple of constraints
that means you have have two on your
mind the the termal uh profile of of the
of the program and actually I just get
an idea maybe I can use the um NPU
neural processing unit on the smartphone
to deploy there directly in a diffusion
model and then try to uh run out um
diffusion um diffusion inference uh
there
but it's not like enough. Uh the main
idea was also to get rid of the full um
full uh pipeline of the change text
through the embeddings and uh building
the uh you know full prompts through it.
So I just you know throw it out and get
an idea to use direct numerical value uh
from an ambient sensor. Of course um
here is a here's an example that means
it can be like different sensor it can
be accelerometer it can be something
different but um but actually it works.
So what I did, I just design um designed
a full pipeline with the zero cloud um
API calls to the you know outside um
outside
um not local LLMs um and I um build a
full measurable system to to check it
out and prove that the fusion models are
possible on the local mobile devices.
Um here's a here's a pipeline that means
the most important thing is the latent
update and as you probably can see here
uh there's a stoastic pert perturbation
scale and the randomness here is
implemented
uh one more thing I don't know if you
just all know what the diffusion really
is because uh the diffusion is made by
uh adding noise to our data sets for
example we have uh you know data set uh
with a lot of the you know I don't know
maybe dog pictures and we are just
adding layers of noise there and then
coding like in transformers
um the model that means we are just
minimizing a function and at then we
have a huge noise of the latent uh space
representation and to get an inference
from other side we are just decoding
that that noise that means decoding that
randomness and for that we also use like
in regular LLMs we also use the
transformations that means we are
decoding the the decoding the noising um
uh the information there and we are
doing it iteratively that means all of
the informations requires really lot a
lot of requires a lot of computer power
um so um the the the logic is is quite
simple
That means the first ambient sensor then
mapping after it conditioning that means
the full pipeline to um to get those
data
u mapped to the latent space and the
latent update function runtime session
and and the full telemetry uh um to to
measure all of the let's say results. Uh
here's a a bit detailed view of it.
Actually the full project was um full
full uh application is designed in the
um um in in the blocks that first block
is an input and and the coding layer
that means this is not that simple to um
to get the raw data from from the sensor
and map it to the latent space. So the
the full first um full first mapping is
the that we are getting the emmed light
sensor then building hises on the top of
that u build profit and the full
normalization
and then building the latent vector. So
um I also build a full fallback that
means to be sure that you know I'm just
running out of memory there if some kind
of thermal uh you know incidents rise
and I just cut off the whole application
because I don't know don't didn't want
to lose my phone um and um the here is
the API and the native um Android uh
bridge which gives me up opportunity to
deploy the model directly on the NPU And
I also used an onnx runtime to do it.
The uh the next um
next part is the full diffusion line
where
Android native session manager and here
is the model and this is the unit unit
various auto um encoder that means the
part where we are just encoding the
let's say prompt then we have the unit d
noiseis loop that means the core of the
diffusion process with the 8, 16 and 24
steps and then we are just decoding it
to the tensor and RGB um picture.
Um at the end I also collect you know
the the pictures itself the all of the
artifacts and the meta data. Um and uh
at the end we have the gallery
and full telemetry.
Uh okay. Um I think this is the uh um
the funniest part uh of my today's
presentation.
Uh let me show you first the simulation
on on on a on a computer just before I
show you working application on my
phone.
Okay, this is how it looks like. It's
not maybe um impressive at at that stage
but uh give me a second. Okay. So I can
change the
the simulation of light and get an
instant inference on the device
here. So let me also show you the whole
measurement whole measurement part that
means we have full full um metadata
captured
ambient light smooth light stabilization
temperature and everything which is um
which is necessary. That means one of
the most important parameter for me was
um uh uh was u
uh
was the stabilization and the and the
latency. Okay, let me uh let me show you
right now how it works on the on the
smartphone.
Oh, here here's my private private
phone. It's not not not sponsored by
Samsung.
It's my private one. Um
actually let let me probably you can see
my screen. I will switch uh switch off
all of the internet connections to the
flight mode.
We'll be sure then um it's not the fake.
Uh
okay. And here is it.
Maybe I would just
write it up.
So um I'm really glad that there's a lot
of lights there because it's working. Um
here we have a change in in the in in
lux that means we are just capturing all
of the data from from sensor and and
this what you are just seeing right now
is the full inference of the diffusion
model deployed on the NPUs. It works
really really fast. I know this is not a
like a pixel wise representation because
to do it really optimal um I need to cut
off whole of the you know whole of the
uh whole of the part uh of building real
text to real text to to embeddings
prompts and um what is really important
as you probably can see it's it is
extremely fast and it is really stable
that means with this you can build some
kind of other different applications and
use it to I don't know try to build some
kind of
um different pictures or you can just
even try to build an animation with this
and this is not something which which
will kill your phone at the at the end.
Um
okay let let me show you um briefly
results of it. Uh so here here we have
parameters I captured artifacts how many
artifacts true this is the single
experiment um u robust latency it's
around 600 milliseconds that means
really really fast the full inference
it's it's cost not not that much uh it's
really stable that means you can use
that type of architecture for your um
future projects but um I didn't want to
finish on the let's say um weird RGB
colors on the phone uh and I just forced
that applications to build something um
something uh nicer um and I uh decided
to build just before conference I
decided to build something more and give
me a second I will only uh only just uh
will find an application for
So I decided to generate a heart um and
try to force uh with some geometrical
tricks and and constraints and not try
to force the model to generate a real
heart with the with the model. And
actually that works. So um I'm I'm
really happy that uh I was able to do
it. uh um and I'm really happy that um
that I can show you um the recent
progress on the on the hyper user
experiences
uh today to you. Uh to wrap it up,
uh I think you probably should think
about the NDPUs and try to use it
whatever possible to optimize and speed
up your applications because all of the
modern mobile phones already have it.
But I think not everyone just use it in
the in the modern development. We can
for example can build an animation with
this. Um and it's a it's a really good
option.
So I think um I just point you out that
you can use it as an unconventional um
application and it just uh it just waits
to be discovered in your projects.
And if you just want to learn more about
Kalstack and our work because we just um
in our incubator and R&D center we
already just open source quite a lot of
let's say artificial intelligence
products project projects like for
example evaluation benchmark for re
react native applications
um then grab a grab a code and take a
look there. Thank you very much.
Great. Thank you, Le. Uh, quick note,
uh, for our coming up speakers, please
show up at the GTI booth at the expo
hall so they prepare you to bring you to
backstage. And, uh, I promised Le and
David that I would, uh, compare their
jackets. Well, he has a suit. Um, those
of you who liked David's jacket, please
make some noise.
Okay,
those who like his,
>> I think we have a winner. Uh, all right.
So, we have a break right now. Um,
please, um, get your coffee, but be back
here around 10:55 as we have the next
talk starting then. Thank you.
test. Test. See? One, two.
Testing. Testing. One, two. One, two. C.
Hey. Hey. Testing. Mike. One, two.
Check. Check. Check. Check. See.
Testing left two. Check. Check. C. Hey.
Test. T test. C.
Testing one, two.
Testing one, two.
Test test test.
One, two.
Testing. One, two, three.
Ladies and gentlemen, please take your
seats. Our event will start in 5
minutes.
Good luck. Bye.
Hey everyone, welcome back.
I would like to take a moment to thank
our sponsors uh without whose support
this would not have been possible. Code
Rabbit, Mintify,
Cerebras, Sentry, Cloudflare, Tailscale,
Modem, Ampify, Ozero, Google DeepMind,
City Furniture, and Encrypt AI. Let's
hear it for our sponsors, shall we?
Our next speaker, I asked him in the
back, uh, what's the craziest thing you
do? He said, I code all day, but I have
a foot pedal. It's like a productivity
hack apparently. A foot pedal when he
presses it, it's it it starts the voice
commands and he can he can talk. Turns
on the speaker I suppose.
>> Oh, no. Whisper flow if you know what
that is.
>> There you go. I don't know what that is.
>> It's like a voice detect.
>> You have to explain that uh during your
talk then. Um so he does that. Also, he
tries to take, did I say it? 210 grams
of protein a day.
>> You're giving giving a home out.
>> All right. Uh it's my pleasure to
welcome Tis on the stage and uh he he's
going to talk about a lot of inference
optimization. He's worked in Tesla
autopilot before and now he has his own
startup and uh his focus is inference
optimization and how to make uh code gen
more optimized. There you go. Thank you.
>> All right. One sec.
Hey, Dad.
Dude, I swear I I practiced this before.
Sorry. One sec.
>> Yeah, I'm not seeing anything from HDMI.
Hey, hey, hey.
What are we doing?
I think so.
>> Okay, as Tis is setting up, uh here's a
reminder also with the background music
that there is a afterparty today. So,
uh, if you didn't get a chance or you're
held back from getting wasted last
night, tonight is your chance. It's at
7:00 p.m. at Thrilled Show. And with
that, I think Tis is ready.
>> Cool.
>> Take it away.
>> Yeah. Hey everyone. Uh, I'm Tis. I'm the
founder of Morph. Uh, and today I'm
going to talk to you about sub agents
and specialized models.
So, you might remember this. Uh, this
happened like last year where Trump went
to the White House. Uh he was demoing a
Model S from Elon and he said, "Wow,
that's beautiful. Everything is
computer."
Uh I thought that was great because it's
actually accurate. Like everything is
actually computer, right? Like even the
Windows inside a Model S is like a mini
computer.
Um and so what I'm getting into now is
um this is uh Andre Karpathy's sort of
software 1.0, 2.0, 3.0 paradigm. Uh so
if you remember software 1.0, this was
humans writing code that programs
computers. Software 2.0 0 was um weights
programming neural nets which is sort of
like a a flexible fixed function
computer that you can do things like
detecting stop lights. Uh software 3.0
was humans prompting program uh using
prompts to program LLMs.
Uh but a new primitive has appeared. Uh
so now we have sub task and composer as
a as a background um search tool and
then cognition with Swed.
Uh so what I think is the next sort of
step past software 3.0 is what I call
software 3.5 which is sort of agents
prompting other agents. Um you still
have the human at the top prompting that
main agent but you sort of have some
people call this agent swarms. I think
that's a two vibe cody of a term. Um so
yeah and so essentially what a sub agent
is is super simple. The simplest
definition is just using a separate
context window for a task.
Uh this has been proven to be useful
coding agent performance.
Um so yeah so it's sort of this third
brand new primitive. I think the first
two were system prompts and tools.
Uh the question becomes should sub
agents be specialized models or should
they be general models? And I think the
right answer is that it depends. Uh but
the interesting part of this talk is
when it's actually specialized models.
So first the the cost of what of using a
frontier model. So frontier models are
really good uh especially at coding and
reasoning but they take a lot of compute
and they're super super expensive. So
they're not really great for every task.
Uh so why why should you even bother
considering a specialized model? So from
a macro perspective, we have this huge
compute scarcity crisis right now where
um like there's no H200 or B200 capacity
across the board if you look at GCP or
AWS. In fact, right now I even have
agents in the background trying to find
some capacity. Um but basically this is
a macro pressure to like pressure
people. This sort of macro pressure
pushes us to specialized models because
then you can use less compute to do a
task to like the same accuracy.
Um this is just a graph saying the same
thing. Basically what you want to do is
the minimum compute to do a task well
and this sort of just arises from this
this compute constraint.
And so our answer at Morph is uh we
basically do specialized model and
specialized inference.
And so we think stuff like code search
and context compression should move to
specialized models. Uh whereas stuff
that still needs frontier compute like
coding, planning and reasoning should
stay frontier.
Uh so one of the things we can do when
we specialize a model is that we can do
things that you wouldn't want you
wouldn't do to a general model. For
example, a frontier model lab right now
they typically won't train on a parallel
tool calling harness because you get
this chronic forgetting problem where
when once you do RL on this parallel
tool calling harness, you start seeing
regressions on uh sequential tool call
performance. And so that's an example of
a trade-off we could make that uh a lab
might not want to make. And so that's
what we do for our code search model.
Our code model can do up to 12 parallel
GPS, reads or list directories uh up to
six turns. And it can do this very fast
on a inference engine that we make
that's designed for this workload which
is super pre-fill heavy like up to 80k
of token input and like around 200
tokens output.
Uh so another model we have is our our
compactions model. It's basically a
model that's trained to compact context
at 33,000 tokens per second. Fast apply
is our diff apply model.
And yeah, so we build these models
because I think coding agents spend most
of their time um doing things that don't
actually need frontier compute. And this
is like code search, compaction, and
applying code edits. And these happen
thousands of times, but just aren't
actually stuff that needs frontier
compute.
Yeah. So how do we make them good? So
good for us doesn't just mean like
building something that gets like a good
like F1 score for example. uh you have
to train these to be a good sub agent,
not just a good model. uh so what that
means is uh for us is that when we do RL
we train to sort of minimize this uh
this randomness that you get when you
introduce a specialized model. So like
let's say you have opus at the top
calling one of our like our code search
model. One of the problems when you when
you have this sort of setup with a
separate model is that opus might have
an assumption that your model is more
capable than it might be or uh that it
works in a different way than it
actually does. So during RL we have to
train to like minimize this randomness.
Um so that that's like one dimension in
which which you have to cover when you
like make a specialized sub agent.
And so how do we make these fast? Um so
one of the the most common ways people
make models fast is called specive
decoding. Speculative decoding is sort
of it's sort of a thing first of all
because you have this like massively
parallel machine that's a GPU. Uh and
you have this like super sequential
architecture that is the LLM. Uh so at
the top there you see traditional LM
inference where you just have to
generate one token at a time regenerate
KV and all your all your stuff and then
to generate the next token. Uh with
spective decoding you sort of switch
this uh so you have something some
heristic to like make like a guess of
like let's say six tokens and then you
have your massively parallel machine the
GPU just verify that those six tokens
are correct and that time is linear to
generate one token or verify those six
tokens. those both take the same time.
And so traditionally with spective
decoding, you can have a you can have a
smaller model like a very very tiny
model like a 0.7B do these these like
these predictions and then just have the
big model that's like three trillion
verify them. Uh but you don't actually
have to have a small model. You can do
it heristic base. You can do it
whichever way you want. You just need
some sort of way to get those guesses.
The another common method is
disagregated prefill. So that's
basically prefill is everything that
happens in the time to first token.
decode is everything that happens after
that. Uh so a common thing we're
starting to see happen now is u
disagregating that those workloads into
separate chips. Uh this is sort of a
thing because Nvidia's NVLink is so
good. Uh Nvidia is starting to do this
now with Grock where Grock is going to
be doing pre-fill uh and traditional
Nvidia GPUs will be doing decode. It
also works if you just use Nvidia for
both. Um
but yeah
uh another way is if you uh write
kernels. uh kernels are notoriously
hard. It's a very fast changing field.
One day it's QDSL, the next day it's
emojo, the next day it's uh Qile and
next day it's back to QDSL. And so uh
this one is the hardest one out of all
the three. But these are the the three
main dimensions in which we we do our
inference optimization.
Uh for the really technical people out
here, I think where the future is going
is actually like disagregating further
where you take attention operations out
into machines. So you're going to might
even have three or four separate chips
each specialized for like prefill,
decode, attention. Um, yeah, a lot of
first principal thinkers would think
this is really stupid though, like Jim
Keller and these people probably hate
this because intuitively you don't you
don't need you shouldn't need like four
separate chips to do this. You should be
doing this all on one board, but reality
plays out different in theory.
So yeah, getting back to the topic, uh,
specialized models are a great way to
actually have a product mode because
everyone is sort of building the same
thing, right? You have cursor with u
multi-agent with background search
agents and composer. You have devon
parallel sessions uh and their sweet
model. Uh and across the board you have
cloud codeex winds surf and I guess
grock. But uh
a way to differentiate is being faster
and cheaper. Uh and being faster and
cheaper can let you do lots of things
like you can go through more tokens. you
can uh like like if you can run 10x more
tokens, there's a lot of features that
arise from that that I think a lot of
people don't get. Uh and so we work with
around like 40 plus uh companies and
they're they're production agents. So
I'm going to just go through some of the
lessons I've had from working with these
companies. So the main one uh this used
to be hotter of a take uh like a month
ago and today it's like a super lukewarm
take, but uh everything becomes a coding
agent. And so that's like marketing
agents, uh, like a customer support
agent, the agent writes SQL. Marketing
agents, they do things like call Apollo
APIs and, uh, send emails. And basically
across the board, we're seeing like like
our initial customers were just coding
agents like the Versel Vzeros of the
world, the lovables. Um, and then you
start seeing like this shift into
different companies like Zo Computer,
sort of like a personal assistant uh,
that writes code and does stuff for you.
And um yeah, basically everything is
becoming coding agent. It's pretty
obvious now. Uh and from everything
becoming coding agent, you start to see
a lot of tools you no longer need. You
can start deleting stuff uh because you
can just with code.
Uh code is essentially the feature that
builds all features. Uh and so you're
only bottlenecked by how how good the
your your base model is at writing code.
Uh, another interesting fact is that
this is sort of consistent across almost
all of our customers is that when you
double speed, you roughly double
conversion rates. Uh, given that you
don't actually hurt accuracy. So, if you
take the cohort of people that didn't
hit like a a tra like a trace back or
get a new error, you will double
conversion rates within that cohort. So,
if you can double speed without hurting
accuracy, it's very, very valuable.
And again, context link matters. That's
why you should use sub agents.
Uh, yeah. Yeah. So I'll go into some
good practices about building good sub
agents. Now uh so if you're building a
search sub agent, you should really just
it's very simple. You just put natural
language in like if you're doing code
search, it would just be like where is
authentication logic sub aent at the end
would output file paths and line ranges.
That way you keep the output tokens that
that sub aent is generating very small.
So this can be very fast. Uh and this
leads to improvements across the board
no matter which sub aent you use. For
our search sub agent, we sort of see
like a roughly 3% sweet bench pro
improvement across the board, but it's
it's a similar improvement if you use
cloud code search sub agent or um any of
them.
Uh so for task agents, which I think you
should still be using a frontier model,
not a specialized model, uh you should
be sharing the prefix cache. So, uh,
basically when your main your main model
like Opus for example wants to, um, like
spit out like three three three task
agents, uh, those should all share the
same prefix so you get the caching,
it'll still be fast. Um, if anyone's
looked into the cloud code leak, they
they actually have this thing called
team create where there's birectional
communication. I think that's actually
really cool as well. Um, so basically
what happens is your main agent like
Opus can do a team create. It'll make a
folder with a slug. Uh it'll have a JSON
for each agent and every time it writes
to those it can send messages by writing
to a JSON file for each agent. Um and
each one of those task agents will be
continuously pulling that JSON for
updates. And uh basically you have this
sort of birectional communication and
there's no real protocol. This is sort
of like a hacky way to do it. There's
some other I forget who's trying to make
like some protocol called ACP which is
like a to standardize some RPC format to
do this.
Uh I I I don't really care how it's
implemented. I think the the the thing
that's more interesting is this actual
birectional communication between main
agent and sub agent.
Cool. Yeah. And so hopefully you got out
of this talk that everything is models.
Uh if you like working on inference
optimization NRL, please email us.
All right, that was a great talk by Tis.
And next up, I'm going to introduce
Fidian Aentuity. I'm sorry, that's not
his name. Well, so the next speaker is
Rick Bllelock and he is a Floridaian who
is the founder of Agentuity and he's
gonna tell us a little bit more about
coding agents and how they're eating the
world. So welcome onto the stage Fidian
aka Rick.
All right, let's do this here. Extended.
I can't see the screen.
Which one do you guys see? Cuz I don't
know what I'm seeing.
>> You don't want to see my nose.
>> How about now?
>> Perfect. All right.
That's what I want to hear.
Blow that up.
All right. Welcome to Miami, everybody.
And uh it's been a fun uh fun conference
so far. our AI engineers, one of my
favorite conferences of all time and uh
super happy to be here. I'm glad you all
are here, too. So, my name is Rick
Blelock from Agent Tui and today we're
going to walk through the last couple
years in the evolution of AI agents. how
right now in April 2026, it seems like
coding agents are not just writing all
the world's new software, but they're
actually slowly starting to become that
software as well. So, by the end of
this, I hope that you have a clear
understanding of where we're at and why
a coding agent is a new fundamental to
software. So m much like you can tell
where a software engineer at is at in
their journey, you can tell where
someone is at on their agent journey
or even their level of psychosis, their
agent psychosis. Um I'm afflicted by
that. I'm sure many of you are as well.
But anyway, you can tell where they're
at by how they describe an agent. You'll
hear things like, "Well, it's just a
workflow, right? It's just a fancy RPA.
It's just a few LLM calls. Ah, agents
just a software loop, isn't it? Isn't
that what it is? Isn't it just a newer
chatbot that we were building those back
in 2015? Isn't it just like a better
chatbot? Or coding agents, you know,
they're not really ready yet for
software dev because it makes mistakes.
So, you hear things like that all the
time, right? So, you kind of can tell
where people are at in this journey and
it things have happened rapidly in the
last two years, right? And so we're
going to talk through that. So let's
jump back two, three years ago. We said
we wanted agency out of agents. And then
we built systems really obsessed with
control. We were calling them agents,
but most of them are really out they
were just kind of prettier workflows or
chatbot++ kind of thing. And you know,
we tend to shoehorn our past experiences
into everything. That's a human thing.
We all do that. And so we're we're so
used to solving problems a certain way
with computers. Decades of solving
problems with computers a certain way.
By the way, that includes non-engineers,
by the way. Um how non-engineers, just
nontechnical people think about using
computers. So it's really hard for us to
start from first principles when it
comes to software engineering
and using computers to solve problems.
But there, you know, a few years ago,
there was a few people starting to poke
around at that and try to figure out,
you know, what what does agency mean and
software and non-determinism.
So, if you remember, how many you
remember autog4
launched? So, I don't know, maybe 50 AI
years ago, something like that.
>> And so, um,
>> it uh it became the number one trending
GitHub repo. It did. It it be I don't
know it had like 160 or 170,000 GitHub
stars. It was the fastest growing repo
GitHub repo of all time. And um I don't
know if you remember some of them. There
was spin-offs. There was Chaos GPT which
was a spin-off. Everybody remember that?
It was tasked with destroying humanity.
I don't know if you remember that. But
it was popular
because it offered a practical glimpse
into what like this kind of agentic
future could be like. It was also
popular because it was a spectacular
failure from a user perspective. Um, it
was stuck in loops. It burned tokens
needlessly. Lots of bad hallucination as
you can imagine with the models at the
time. And uh, if if you really remember
in hacker news, there was a dev that
just replaced the loop, the feedback
loop with static code and it got better
results. I don't know if any of you
remember that.
So, um, Aram from Tom's Hardware, he
said, uh, it was too autonomous to be
useful.
Um, so, you know, like there's there's
these like stamps, these glimpses and
uh, snapshots in time where we see that
and we go, "Oh, some people's brains get
hardwired to that moment in time." Go,
see, agents aren't good. They're not
ready to be. But there was there was
people there was a glimpse of hope,
right? Somebody was trying stuff. Didn't
work out. It's fine. But we were getting
somewhere with it. Then around that
time, frameworks, you know, they started
popping up that simulated agency. A lot
of them were actually still brittle
deterministic pipelines.
>> The uh 2023 agent architecture were
really just a lot of linear chains. Step
one, step two, step three, a lot of
determinism.
And we were all trying to figure out
what we wanted to do when we grew up
with this, right? You know, there's Crew
AI, Langchain, Autogen came out, NAN. I
mean, there's a list of them. Some of
some of them attempted some type of
agency, but kind of like autogen because
of the limits of the models, there were
problems. So, we all fell into this trap
of what customers wanted right now and
not kind of like where the models were
going to be in six months. And what they
wanted, they just wanted a better
chatbot. They wanted determinism. So we
were using agents
that were for non-determinism, but we
were using them with all this
determinism in there. Um, so that's what
that's kind of when the frameworks tried
to solve some of those problems. And you
know, the curious thing is
the enterprise at the time the
expectation was these projects, these
prototypes were around six months, which
I mean, if you've been in this AI
engineer stuff for any, but six months
is like an eternity. I mean that is
probably about four AI years. I don't
know. Um, so six months to like
implement this thing with this framework
and then all of a sudden the models get
better and the of course the frameworks
get better but then all of us start
asking well wait a minute
what do we even need these frameworks? I
mean they're all kind of complicated
anyway. Do we even need them now? And
then even anthropics own advice this is
back in December of 2024 which again
doesn't seem that long ago. I'm going to
read a quote. We recommend finding the
simplest solution possible. This means
this might mean not building a gentic
systems at all. Well, that's
interesting, right? So, just a couple
years ago, h it's not really it's
complicated all this like orchestration
theater and all this stuff going on.
And um let me go back here real quick.
So like you think about that
orchestration theater. I don't know if
if if if there's probably I'm sure
several of you on here that have gone
through that and the complexities of all
of that that's happened and then all of
that gets removed in the last six to
nine months, right?
Go from chatbot tool caller workflow
orchestrator multi- aent theater. or you
know remember cursor added sub aents and
then Claude wasn't very good at using
sub aents so they removed it and then
all of a sudden Claude's good at using
sub aents again and cursor adds it back
um and the advice one day was don't use
sub aents sub agents are bad and now the
sub aents are good um and they are good
that all that's changing all this
turmoil
and you know we're preaching agents need
agency and then customers are like well
this just is just RPA this This is just
this isn't agentic.
And then
during all that time, Cognition Labs
demoed Devon. How many how many of
remember the demo of Devon when it was
demoed? Marketed as the first AI
software engineer remember that
interesting.
And of course, you know, early days
there was the hype cycle that maybe
outpaced the product a little bit at the
time. Personally, in my opinion, I
thought it was kind of the right balance
of a little bit of controversy and
vision casting coupled with a decent
product launch. That's what I thought.
But fast forward to today, and it's
honestly, it's one of the most capable
agentic uh engineering products out
there. And it's free from the
nitty-gritty like messaging of workflows
and frameworks. I kind of thought it was
a breath of fresh air. I I I met Walden,
one of the founders co-founders in
Salesforce Park last year and I was I
was telling him like I really want to
put Devon more in our in our flow and I
I know I want to integrate it more and I
was thinking
traditionally like frameworks and stuff
like that like how do how do I stitch
this together? How do I put it in this?
And after talking to Walden it was like
well just give Devon some scripts and
some access to your tools and it'll just
it just do that. So it it was, you know,
free from that MCP theater and workflow
theater. And it was the first time that
I started thinking, oh, I should treat
this coding agent for more than just
code, but managing our demos. We did
Friday demos um every it's it's on our
YouTube. It's public demos. And Devon
helped us. Devon literally would tell us
what we would demo. I can't remember
what I worked on yesterday. Um so it was
the first time I was like, "Oh, well,
there's something here." interesting um
and I appreciate that about them.
So all this is going on and meanwhile we
have customers and real businesses
around the world saying what are the use
cases for agents? What are they? Trying
to understand how they actually fit
inside their company. I mean I can't
tell you how many times I've had
conversations with business people like
what are the agents? Is it this? Is it
that? and kind of only devs kind of
understood the complexity of the
orchestration and the frameworks and it
kind of muddied the waters and maybe
made it look like a little bit of snake
uh snake oil salesman kind of stuff to
some of the business people um or
getting accused like ah you guys are
just crypto bros that are using AI now
or whatever you know you kind of get a
lot of that
um
but then use openclaw as an example
open claw stitch together enough things
that it helped normies understand what
an agent specifically agency can be in
software. What Devon helped um with
developers understanding an agent use
case for our world, OpenClaw helped
non-devs understand agent use cases for
theirs. You know, as engineers, we like
to bash OpenClaw, right? It's all these
these scripts and these complexities.
It's insecure.
But meanwhile, you have non-technical
people buying Mac minis. You can't buy a
Mac Mini right now. I mean, I'm sure
this was not on uh Mr. Cook's bingo card
that he would sell out of Mac Minis. Um,
and they're copy and pasting
instructions into an app they've never
heard called Terminal. And they've never
used it. They don't know what they're
doing. Why are they doing that? Because
they understand the use case. they
they're willing to suffer this weird
pain because they understand the value
they can get from it. And the models are
at the point where, you know, it doesn't
feel super brittle. So all of a sudden,
autonomous agent makes sense to a large
group of people that didn't get it
before. You know, before I was like,
autonomous agent, they're like, "What
the heck are you talking about? This
sounds like crap, right?" Um, and
sitting on top of all of that, what
we're talking about is a coding agent.
That's what we're talking about as a
coding agent. So here we are 2026 and if
I say a coding agent is a universal
software primitive.
A lot more of us understand that this is
actually pretty true. You know code
isn't just a developer arch um artifact.
It's not just a developer artifact. It
is a coding agent isn't just a harness
for developers.
It's not
a coding agent can implement every other
type of agent.
And even a a normie who can't describe
what's going on when he builds multiple
agents that help him manage his
business. A coding agent can build a
chatbot. It can do rag. It can
orchestrate other agents. It can
automate its own workflow, not just
yours. It can create a database, stuff
in it, and use it to answer its own
questions.
Try that sometime. Take the canonical
Titanic data set and um have it put in a
database. It'll create a model, stuff
things in there, know how to um update
it. You know, we attribute intelligence
of animals
um as if they if they can build complex
or compound tools, they have high
intelligence.
And although I know some of you will
want to nitpick on what intelligence
means when it comes to a coding agent,
we are there with coding agents. Coding
agents will build their own tools for
themselves if you have the right
instructions for them.
So in 2026,
convergence is I mean this is the
inflection point for us. You know,
sandboxes are giving agents a body.
Model intelligence is now giving them
the brain, right? we got these good
protocols that we're starting to kind of
figure out the nervous system. Whereas,
you know, a year ago when I said agents
need a different approach to infra,
people were thinking, well, why does a
framework need a different infra
approach? I don't understand that. But
now with coding agents now, they're
like, oh yeah, it does. It does kind of
make sense. We do kind of need something
a little more purpose-built for a coding
agent to run all the software to be all
a lot of the software. Now it's a little
more understandable.
The coding agents still need a lot of
this. You know, at a aentuity, we had to
adjust our thinking. A year ago, started
a developer horizontal developer cloud
platform, and now it's kind of more like
an agent platform that's for agents to
run to for them to run. Um, and all
these things here that they need.
Um, you know, it's a place to act, a
place to persist, to live, a place to
observe, to integrate, to manage. And
there's still like gaps in our in what
we can offer, right? I mean, the the
boomer clouds, if I can call them that,
are are trying to sell the infra from a
decade or two ago, right? They really
are. And they have this veneer of
cutting edge, but when you get in, it's
all still kind of hacked together.
Honestly, kind of like our personal
coding agent setups. It's all kind of
like just, you know, stitched together.
And ser services honestly still built
for stateless web tech. and a coding
agent, you're like, I'm try to shoehorn
this into this. I mean, the story is
always the same when I talk to people.
It's like, oh, we started at serverless,
try to deploy it, but guess what? My
agent runs for five minutes and
serverless times out in 30 seconds and
then and invariably after like a bunch
of experience experiments, you end up on
EC2 and you're running like we were
software in 2008
over and over and over and over and over
again. That's what we're trying to do at
agentuity is like, you know, make it so
agents don't feel like they're running
in uh 2008.
But with that in mind, put a pin in
that. With that in mind, back to
openclaw and use cases for normies.
Normies don't understand or know all the
things that I just mentioned about the
infra about what's needed for this stuff
to work. Yet still, Mac minis are being
sold at Apple and they're copying
pasting and they're getting outcome. And
this is the mind shift that I think we
have to make. The coding agents, they
are a fundamental to software. Now,
non-engineers already understand this.
They they do. They they they maybe can't
articulate it in all these crazy ways
that we can, but I think they probably
understand it faster than some engineers
grasp it. You know, we're busy poking
holes in the fact that coding agents
make bad mistakes, crazy, stupid,
idiotic mistakes.
Um, and you know, we're quick to point
out agent psychosis, which I have, a lot
of us have. Um, but then the
non-engineers, they're using this stuff
and they think when we when we're like
sitting there complaining about, they're
like, "All right, are you using the same
thing I'm using?" Because I'm running my
business on this is what they're
thinking. You know, if you're at the if
you're at a party and you're wait and
this guy's talking about how he's gotten
done all these things and you're just
waiting for the pause and you're like,
"Yeah, but coding agents, they make
mistakes. You you're here. I'm here. You
see the difference?" You know, um if if
you're that if you're there, they're
they're going to look at you and go, "I
think you have psychosis. You have
software engineering psychosis." That
that's what's going to happen. I mean,
just as an example, so there's um
there's a man I know, he's in his
mid60s, very successful entrepreneur. He
built a manufacturing company that did a
bunch of stuff for Toyota. And now he
started this construction company in
Texas. And he was paying tens of
thousands of dollars to HubSpot a month
for this company. And he's like, I'm
just going to build it on my own. I'm
going to use coding agents and I'm going
to use coding agents to run the little
bit of HubSpot that I use. And he did.
He did it in about three months. Um,
another example on Saturday I had up in
Jupiter, Florida, I had um I had
breakfast with this guy. He's 24 years
old. He runs a wind window cleaning
company
and he runs his business with a coding
agent, his marketing, his sales, his
sales estimates. So, you know, if I told
him, ah, coding agents make mistakes,
he's like, yeah, but like I'm like way
up here in productivity. What are you
talking about? There's an exe a former
uh CIO of a large financial firm that's
in here today and he uses it uh for his
board meetings. So my point is the
foundational agent primitive
is already a coding agent for normies.
They maybe they don't realize it, maybe
they can't articulate it. And what kind
of agents do you think are going to be
tasked with working on the org chart,
the focused work of of things in the
future? And I think it's coding agents.
If you peer just in the near future, six
months,
10 AI years, coding agents will be used
to create and manage other agents to
manage certain parts of the org charts
work.
I mean, we're doing that with our
competitive marketing. We have this this
uh it's called GEM. It's this set of
agents that manage we have about 50
competitors or frennemies we call them
that we monitor. And um these agents are
constantly watching. I won't tell you
what they're watching, but they're
watching all sorts of things, all sorts
of signals. And um we we know like
everything about where we are, where
they are, the messaging. This company's
probably about to pivot because they
changed this messaging a lot in their
site map, whatever. Um so it's managing
a sizable part of our marketing stuff
and it's and it will be used by
non-engineers to manage their work and
their organization.
leaders will direct coding agents and
it's not intuities by the way not in
two. So next time you're debating what
framework should we use
for building an AI agent, how it should
work, pick a coding agent. It's kind of
like Ben Ben at the last talk yesterday.
He he talked about those different versk
and pi and open code SDK. Um try that
out. Try using a coding agent instead.
You know the good ones they have an SDK.
They have server side handling of
things. They have on demand sandboxes
that you can pick up resume uh where
where they left off and they have
observability built in and that's also
what we're trying to heavily do with
agentity. Um you started as a developer
platform kind of more like hey agent is
a is actually a customer a persona of
our platform. There's all this inbound
interest from normies and people that
want to run this stuff at scale. They
don't know how to. Um, and it's all
built on like coding agents, multiplayer
coding agents that watch each other,
agents that tell each other what to do.
Um, so that's where we're going with
this. So yes, Mark Andre and he said
software ate the world and now I think
coding agents are eating software. Thank
you.
Okay, how's it going everyone?
Having fun?
Great. So, let's say I want to build my
app. I have a lot of data and uh about
transactions, places, and there are
relationships between them and uh I want
my agents or or my app to query this
information. I have to store them in a
database. I can dump them in a blob
storage or or maybe SQL database. But is
there a better way to store all of this
information especially if there are
relationships between the entities? And
the answer is of course yes. The best
way to capture these relationships is
through graphs. And maybe the better way
to store them in a database is to store
them in a knowledge graph. And the next
presenter is going to talk to us about
how we can optimize our um context
engineering by leveraging these
knowledge graphs. Please welcome Nia on
the station.
>> Thank you. Thank you everyone.
Hey,
hey, hey.
All right. Yes.
Good morning everyone. How are you
doing?
>> Okay. Okay. I see some energy. I love
it. I love it. Well, I'm going to wake
you up. Okay. So, good morning everyone.
My name is N Mlin and today we're going
to get into effective context
engineering techniques for artificial
intelligence. So, as I like to do, by
the way, um I'm a teacher in a past
life. So, I like to get um lots of
different ways that people can remember
the concept that we're going to be
talking about today. Not just sit here
listening to a lecture, but actually
conceptually understanding what's going
on. And I like to make sure that people
from the highest chief security
officers, highest chief information
officers, CTO's as well understand the
concepts as well as maybe a junior
developer or an intern. All right. So,
um, let's get right into it. And I'm
gonna start off with a question. All
right. I'm gonna get us all engaged. So,
how many of you all, and this is going
to be a yes to pretty much everyone, but
how many of you have shipped an agent to
production? Lots of people, right? Lots
of people. Okay. So, all right. Of those
of you who've shipped an agent to
production, how many of you have had
that agent then do something in
production
that made you question your career
choices? Question your career choices.
Right. I shouldn't be an engineer. I'm
so sorry. I brought down prod. Customers
are crying right now. They are so angry.
It's happened. It has happened and
that's okay. Right? So let's get into a
scenario. Right? So this is actually
real table stakes. This happens every
day. Right? So a compliance team is
going to run an agent pipeline, right?
And that agent is going to retrieve
separate chunks. So that chunk is going
to be for example Jessica hypothetical
person works at Apex Global, right? And
then Apex is actually on the sanctions
watch list. Okay, that's interesting.
And then Jessica has actually requested
a $25,000 credit increase. So all three
of these chunks have been received,
right? And all three are sitting there
in our context window. So the agent YOLO
is going to approve it. is going to
approve this credit line increase. There
seems to be nothing wrong with Jessica's
request. She has excellent credit. She's
a 850 credit score. There's no problems.
We're just going to yolo and approve
that. We're just going to vibe bank real
quick. Y'all don't like vibe banking?
Y'all don't want vibe banking?
that agent is going to approve that
request not because the agent itself or
even the model is bad but because the
connection between these these facts is
actually never represented anywhere in
vector space. Right? So it wasn't just
the data, it wasn't the retrieval, but
the model actually had to do a guess and
actually it guessed wrong. Okay, it
guessed wrong. So, nobody caught that
until the lawsuit came.
Until the lawsuit came and your scene,
your your skip level is now hitting you
up asking you who wrote the agent, why
did the agent answer this way? Why did
the agent actually approve this request?
And what are you going to say, right?
Everyone is listening. You are sweating
bullets over here. You are stressed.
Okay. But this actually this tutorial
does keep landing on my desk, right? And
in the next 25 minutes, I'm going to
show you all exactly what happens and
the techniques that we can use to fix
this. Prove it live. So like I had
mentioned this this scenario always
happens. You say you don't know and you
end up fired. Can't stand it. Can't
stand it. Need to pay these bills. So
what's happening here is that text
similarity when we're when we're relying
on vector search it finds documents that
have a similar meaning right structural
similarity on the other hand find
entities that have similar connections
so almost is actually building the
second one and I want to give you this
upfront so that you know exactly what
I'm trying to argue here every retrieval
pipeline that we're using right in the
room we're using normal naive rag hybrid
search ranking everything like that. It
operates on one dimension. This text
similarity piece. How close is this
chunk in meaning to even the query
itself and that work that can get you
like 70% of the way, right? It could
even get you 80% of the way. But there's
a second dimension that almost no one is
building upon. And that is that
structural similarity piece, right? not
do these two documents the same thing
but actually are these entities
connected in the same way. Right? So
when we're thinking about a credit line
increase for a corporate banking client
then and then we're also thinking about
this concept of everyone listed uh at
the employer is listed on the federal
sanctions watch list. That doesn't mean
the same thing, right? So your vector
search is actually never going to
surface those things as being connected
and and they're not going to look at
each other. But the relationship, right,
the relationships between Jessica,
between her employer, and then those
sanctions as well, that isn't a meaning
problem. It's a connection problem.
Right? So right now everyone um or right
now nobody really is using an agent
pipeline to see connections. And that's
the vital dimension that we're missing
here.
And so even research backs this up,
right? We've all seen this like this is
fine. It's going to be fine. We're all
going to be fine, right? And MIT's NANDA
initiative, you can feel free to take a
screenshot of the QR code to read this
report yourself. I always um like to
tell people to read the white papers.
Don't just listen to people who are up
here speaking. But in their 2025 Genai
Divide report, it found that 95% of AI
pilots are delivering no measurable P&L
impact. Right? 95%. That's a huge
number, right? and we're building
agents, we're scaling them into multi-
aent systems, we're investing months of
engineering time, and then the project
is killed because it actually doesn't
clear the bar for quality. It doesn't
clear the bar for privacy. It doesn't
clear the bar for security. And it
doesn't clear the bar for transparency
either. So the central gap that MIT had
identified is that systems do not just
retain feedback but the systems that are
that are making sure that they don't
retain that feedback are adapting to the
context or they're not even improving
over time and Gartner themselves as well
is predicting that 40% of agentic AIA
projects will be cancelled by 2027 right
and the models here are not the problem
the context is the problem.
So, better models don't fix fractured
context. I want you to repeat that with
me. Okay? Better models don't fix
fractured context. Period. Right? And
that's where all of this all has in
common. Right? these these um these
atentic systems are going to retrieve
via your text similarity search and then
the text similarity it actually has a
blind spot right of the size of the
building when it comes to actual
relationships better models don't fix
fractured context they just reason over
these broken pieces
and let me show you what I mean right so
on this example we have a human view of
an right there's three ways to represent
this apple on the left.
So on the right actually we have a
vectorzed view of an apple. Right? This
is just an array of floatingoint
numbers. It's a way that most of our
embedding models they uh understand the
textual reference of an apple. Right?
This is what a vector search is going to
operate on. It's not human readable.
It's very opaque. And then in the middle
we have the human view of an apple. You
see an apple. Your brain processes all
of the relationships implicitly
instantly, right? You don't even think
about it. You just know an apple is
round. It's fruit. We're not talking
about the company. And then on the left,
we have a knowledge graph view of an
apple, right? The apple connects to a
tree. The tree connects to a body. The
body has a fruit. It has a stem. Every
property is actually a node, right? And
every relationship is explicit. So a
human can read this actually and
understand exactly what the model knows
about an apple. This is both human
readable and machine readable as well.
And that's the thing, vectors capture
meaning, right? They they know that
things are close to fruit and red and
food, but it's often a black box. You
can't inspect it. You can't ask, okay,
why does this model know that this apple
is round? You can't audit it as well. A
knowledge graph is going to capture that
structure and capture that meaning and
it's going to be much more transparent,
more queryable, right? More auditable
example that we needed before to then
overcome AI's blackbox problem. And we
need that knowledge to then be
transparent.
And before I go deeper within this
technique, I want to show you some of
the numbers that convinced me that that
this wasn't just a theory, right? So
Jang at all he published uh and their
entire team published an eunations
magazine this March of 2026 and built
com GPT it's a domain specific
foundation model for telecom and they
ran a proper abilation study on three GP
question answering and so the base model
they had the base model accuracy here
right the base model accuracy had an
accuracy of 37%. Sure. Right. And AI
like we'll take it. Um but then they
fine-tuned that model on domain data and
then they got up to about 54%. Right?
And then that's a significant
improvement, right? But then that's
still not going to work for production.
So they then added both knowledge graphs
and rag together, retrieval augmented
generation together and got 91%
accuracy rate. Right? So no, it's it's
it's not just the graph. It's not just
rag by themselves, but here it's both.
Right? This is the critical point. Rag
alone does not close this accuracy gap,
right? It's the knowledge graph that was
used to unlock that extra percentage of
accuracy. And the graph wasn't just a
nice to have. It's actually the
difference between having a demo versus
having a production system. So 37 to to
54 was the finetuning. 54 to 91 was with
the graph plus rag right and that middle
drum that's the structural retrieval
dimension that I'm talking about and
this is all again peer-reviewed in
so quickly I just want to jump into the
past and I know this is a little
controversial but I would like to argue
that the past um is how we're normally
doing context engineering so for those
uh who no one no one's new to this term
also uh shout Shout out to Dexy who
coined this entire term um and and also
spoke yesterday. But I specifically
define context engineering as being able
to systematically provide models with
all the relevant information, the
relevant tools and the relevant
instructions that your agent needs in
the correct format at the correct time.
All right, to then accomplish a specific
task. So, it's very different from uh
from pumped engineering where you're
just trying to say the right words to an
LLM or an agent to get it to do like
whatever you want, right? Context
engineering centers on building dynamic
systems that assemble complete and
structured context for each LLM
invocation. So, the shift in focus from
those cleverly worded prompts to a
comprehensive contextual design is why
context engineering is now considered a
critical skill for AI engineers.
And most engineers are using some
combination of these various techniques
here. Right? So there are several
different techniques that practitioners
are often using in order to get that
contextual relevance in an agentic
development. And in the example that I
had mentioned before so that's including
rag and hybrid search the classic
approach. I'm not really going to go
into all of these um but for those who
are newer to the concept um uh naive rag
is going to use a vector database of
embedded documents to then perform a
similarity search. Practically, there's
pros and cons to that, but I won't even
get into it. Um, and then we also have
memory management where an agent
interacts in a long session and across
lots of different sessions and it
accumulates lots of information. And I
have that represented by Dory because
you're now managing what you're going to
forget. Um, and you can use sliding
context and recency based memory of
course as well to uh in order to drop
history. But let's keep moving on. You
can also use structuring and ordering of
context. Um, enthropic for example gives
guidelines on how to deal with long
prompts. You put the most relevant stuff
and the most critical stuff at the top
of your prompt since models pay more
attention to that at the beginning. And
then additionally, we can use tools and
function calling instead of trying to
stop your context with all your raw data
like I see most of you trying to do.
We're actually going to give it um the
instructions on how to uh call a tool,
right? And really that's just a function
for those of us who are engineers or we
can offload those tasks to tools. So for
example, instead of giving an LLM your
large um like a a large table and asking
a question about that table which would
consume all of your context tokens, it
might be better to give your LLM a
database query tool for example. Um and
then lastly for uh so I like to
represent that as MacGyver for those of
us who are a little bit older. We
understand MacGyver always had a tool.
Um but the method that I want to get
into today into today is using knowledge
graph augmented context. I had mentioned
this concept of knowledge graphs before
and I'm not going to dive too much into
it because of time but knowledge graphs
and I'll do a representation so you can
see what it looks like. Knowledge graphs
are an extremely powerful way to
introduce structured relational
knowledge into an agent's context. And I
like to represent knowledge graph
augmented context as Sherlock Holmes
right because Sherlock Holmes is a
perfect example. He's able to synthesize
important information and then track
down the clues, this tracing piece in
order to solve a particular problem.
Right? This is exactly what context
engineering allows you to do when you
integrate knowledge graphs into your AI
stack.
So most people right now they have their
data structured in uh in this example,
right? Tables and rows it is the OG,
right? You have to do a number of
different complex joins if you are
trying to understand how that data is
related. But that then there's the
there's the caveat. It's going to draw
down your performance of your
application.
But if you structure all of that data,
that same data as a knowledge graph,
then the relationships and the
connections and the groupings become
much more clear. So now your agent can
hop from one node of information to the
next and be able to actually trace where
it went and explain the answer. And if
you want to learn more about how to like
take your data and put it in knowledge
graph, we have lots of courses for free
and we're open source. Um, so from an
architectural standpoint, I just wanted
to touch this as well before I get into
the new concept. Um, from an
architectural standpoint, having a
knowledge layer in the AI stack allows
you to one hold structure and meaning,
two, create a uniform memory layer,
right, and retrieval layer for all your
agents, and then three link and unify
all of those data sources and objects
together. And then four while bridging
this human and um uh machine
understanding and then at the bottom as
well you've got your existing data
platform. So whether that be snowflake
data bricks relational systems right on
top of that is going to sit this
knowledge layer which connects all of
your structured and unstructured data in
a consistent and semantically rich way.
So from there you can have your geni
apps and orchestration layers like
langchain or semantic kernel or your
agentic frameworks. they can then plug
into and access this data um
contextually. But I want to remind
people like this is not a concept that
we have created like knowledge graphs
have has existed for many many years.
And this concept of knowledge graphs and
adding this knowledge layer we think is
just important for folk to understand
that it can drastically increase the
accuracy, the traceability and the
auditability of your AI applications
because many of us are just like
settling with vibe uh vibe everythinging
right vibe banking which doesn't make
any sense right like why are we spending
so much money to have inaccurate results
like it's just not making any sense. So
the future what I'm would be the future
of what building out context and then
having explainable andable AI agents
looks like this concept. So has anyone
read this paper which came out in
December? Let me see some hands. Anyone
read this paper? Um it's called uh the
AI's trillion dollar opportunity context
graphs. Okay. Like one or two one or
two. All right. Well for those who are
newer to the um to this concept, right?
like in present conversations around
building AI agents, it has shifted from
using techniques from context
engineering um like using naive rag to
using graph rag to then using knowledge
graph augmented context to now what I'm
talking about which is using context
graphs and it's not simply a change in
terminology right the the idea of
context graphs has actually existed way
before this paper um um uh but um uh the
the meaning behind it is actually really
really helpful for when building out AI
agents, right? So when building out
agents, we need this idea of this
decision trace, right? So that agents
can be auditable and can be more
trustworthy and can be more explainable.
And um there have also been and I have
the uh the papers right here a number of
different white papers and research
around context graphs even prior to that
seminal article that came out in
December. Um, but we are shifting our
understanding why agents make decisions
because agents are no longer simply
tools that people are using to give you
a particular answer. Now,
now agents are being built to make
decisions about our everyday lives,
right? Like this this this isn't a joke
anymore. This isn't the demo. This isn't
the fun um like like uh PO that we're
building. Now we're building out agents
that are making decisions about whether
someone gets a loan, right? And that has
impacts on people's lives and their
financial uh prosperity going forward.
Like it's it's uh making decisions about
whether they can they can own a home or
or what have you. It's making decisions
in medicine as well, right? We're giving
agents so much power and that's why the
ability and the ne the necessity to
audit the decisions that every agent is
making is paramount especially with
context engineering.
So the missing why. So when an agent
recommends approving a even a $25,000
increase like we had mentioned for
Jessica before, but even a $100,000
credit line increase, the question isn't
just is this the right answer. We have
to understand that if we're just asking
that question, we are we have no
assemblance of what memory looks like
here, right? You're going to ask an
agent about last week's conversation and
that agent's going to be like, "What you
talking about, homie?" Like, "I I don't
even know who you are right now." They
gonna say that, but you're also not
going to have this audit trail, right?
When something goes wrong, no one knows
why your agent decided that. Y'all can
try and see, okay, you know, I can I can
try and reverse engineer to see what
chatb or anthropic cla like said.
However, that a trail is missing and
that's the vital information, right? And
then on top of that, there's no shared
learning. You're going to deploy
multiple agents and they can't share
what they've learned across sessions.
So, what is this concept of a context
graph? A context graph is a knowledge
graph that's specifically designed to
capture your decision traces, right? The
full context, reasoning, and causal
relationships behind every significant
decision. So this differs from simply
opening your agent up to an audit log
for example. You might say oh I can just
do that with an audit log. No, no,
actually right that would have yes that
would have the line by line transaction
history but your agent now has this
breath of knowledge from context right
when with context graph it has this
breath of knowledge from the things that
were not written down right that could
either be like the the conversation
that's happening in the senior
architect's mind right where they're
making these trade-offs and these
decisions when building out a either
aentic system or software in general
right it could live in the Slack
messages that you're building out right
Now, it can live in your email threads
or even your Zoom meetings, right? All
this connected data can show up in a
graph that can show you all of that full
context and the full why any single
decision was made. And this is this is
now going to include the causal chains
and the decision traces within context
graphs that can actually be queries and
that can be traversed to ensure that
your agents are more reliable,
auditable, and trustworthy.
So I have the same financial uh services
model here and we're going to go through
a demo anyone work in I doubt people
work in financial services but possibly
nope exactly zero hands
but that's okay we're still going to do
it but in this data model I'll teach you
all about financial services uh we have
a graph um uh world called and we're
going to call these entities right these
circles will be entities um but it's
really just people places and things
right and we also have agents sorry
events, um the things that happened like
the what, right? And then lastly, we
have the context and the why, the
decisions, the policies, etc. That's the
model. So, we can get into what this
looks like in practice. Boom, boom,
boom. If you want to um take a picture
of this or actually look at the demo
live, you can feel free to do that. Let
me see where I put it.
Nope.
Where is?
Ah, okay. Boom.
All right.
So, I have a live one and I I have a um
my other one, but I was I was messing
around with the code trying to add some
new features. You know, never to add new
features last minute before you're going
to do a demo. It's going to break. It's
going to break. So, um so I have a live
demo here. Um, and many of you are like
we're all talking about context graphs
and we want to see like how does this
actually work in practice. So I'm going
to run this query here. So we have an
regular AI assistant for this financial
services use case. We're going to ask
ourselves like should we actually
approve the $25,000 credit increase that
we had for Jessica earlier.
And let's see this graph get built. I
want to make sure that everyone can see
it. Yeah, perfect. All right. Um, so we
have a person who submitted a support
ticket about a specific transaction.
All right, we're we're seeing this graph
get built. But that transaction is then
going to trigger an alert, right? When
flagging this entire account. That then
alert is going to take take that all of
that into cons into into context for
this entire system which is going to
cause a decision trace which you see on
the right which is based on policy and
then communicated back to the person who
owns the account in the first place.
So think of this like an a an analyst at
a bank right who's going to um use this
agent to respond to a particular
customer request. The analyst is going
to have this customer requests come in.
Should we increase this credit lo
increase for Jessica Norris? Should we
do it? Um, and the bank analyst can can
say
yolo
by bank. Let me just give that to you,
Jessica, because you're my girl, right?
Absolutely not. Thank you for laughing
at my jokes.
So, if y'all don't want vibe banking,
why are y'all doing this without
decision traces? Like, y'all, it seems
messy to me. like y'all are whing to me.
But
so if we inspect this agent here, we can
see that we've defined a number of
different tools here. Let me come over.
Um we can see that we've defined a
number of tools. So we're doing this
tool calling that we can see. I can't
even see that, but maybe y'all can.
Um but we have this um this agent which
is interacting with our context graph.
And then we have the system prompt,
right? um which is going to have these
specific tool calls which is going to
fetch data or the context um that's
necessary to make a specific
recommendation and all of that is then
rendered within this graph that you see
in the center. So we'll see the causal
chain of what led to a specific
recommendation from the agent. And in
this case uh the recommendation is to
and we'll follow all the way down to see
actually um instead of thinking that we
can just yolo banking we can just let's
see the let's see the response. All
right. So I'll help you evaluate it. The
recommendation is to reject this credit
line increase. All right. So the
recommendation is to reject the credit
line increase based on my analysis of
the context graph. I strongly recommend
rejecting Jessica Norse's $25,000 credit
line increase. And here is my detailed
reasoning. Critical risk factors, there
was a recent identified request already
rejected
um April 9th, couple days ago. And then
there's actually significant fraud
history that happened within um within
Jessica Norris's account last year. So
they had um a known fraud typology where
the velocity check the number of
transactions 14 transactions in 29
minutes happened. There was also a
geographic anom
um where that IP location was
inconsistent with the accounts address.
All these can be explained away by
Jessica and if they are explained away
then that context graph will then update
based on that decision. Um but there are
multiple compliance violations as well
within this. So you can see the AI
assistant and the agent specifically
telling you the entire reasoning behind
the decisions that it's making and you
can see the exact decision trace on the
right as well detailing what's happening
with as the uh confidence score as well.
So u feel free to try this on your own.
Again it's context-graph
demo.vercell.app
if you want to try it on your own. And
then
let's go back to the slides. Okay, so
I'm gonna wrap this up. So how this all
works, right? So if we want to look at
how this agent
is um is tracing the decision history
and how it's using the context graph,
first of all, it's going to then search
for the customer and then for the
context around that customer. Like as we
can see, we saw the um the the circles,
the nodes and edges, right? that
produced um that transaction in the past
and then we can see the previous fraud
flag that we had seen before. We're
going to then use hybrid search which
combines vector and graph to then
actually do the heavy lifting here
and I just wanted to touch upon this
hybrid search piece because it's
actually really really powerful right so
combining vector plus hybrid search is
really powerful and specifically at
Neoforj we use this called graph data
science which allows you to run graph
algorithms like centrality or page ranks
for those who are um uh have used
different algorithms before or graph
algorithms before Um but here we're
talking about graph embedding
specifically like fast RP if someone has
heard about fast RP which is a generated
embedding uh for context of text for
sorry for content of text. Um so um and
then how we use the vectors and doing
vector similarity search apologies that
allows us to find the different policies
and the decisions that are semantically
similar. But what about the structure
right the structure piece is what I had
introduced right there are things like
that's where the things like graph
embeddings come in right which allow us
to then look at the structure of the
accounting relationship and how it
interacts with transactions with other
accounts as well with fraud patterns and
these generate embeddings that we can
use vector search functionality which
are then familiar which we are familiar
with to find the most relevant data
right for our context.
So again, this is a completely open
source demo. Um here's a little bit of
information about what's happening in
the demo. You can see the architecture.
We can see the tools, the graph data
science tools um and algorithm tools
that we're using with MCP um and using
for tool calling. And then that is what
we have on the horizon as well. We have
a little bit of um what the future of
both this demo but also like what this
can look like in production as well.
So where can you learn more? Where can I
even learn more? Right? So there's so
much different research that's being
developed in this field of how to make
agents specifically more trustworthy.
Yes, please feel free to uh to um take a
picture. Um all of this research can be
found. It's very very small, but it's
seen it's uh graphagg.comappendices
research. Um but this is just a small
sample of some of the research um uh
that you can find on um uh using both
context engineering and using knowledge
graphs and graph augmented context and
context graphs all keywords in order to
augment your AI agent development to
make sure that they are more accurate
explainable uh trustworthy etc. And so
instead all in all I want you all to go
from instead of being that one person
who was saying oh you know yeah like
you're in the your boss is at you you're
like I don't know what happened right I
don't know please don't yell at me
um uh while while production is going
down that's a real that's a real problem
uh instead of being um in this place
where you're saying I don't know I want
you to be in a place where you can so
that you can understand how to make your
agents more audible and explainable.
Awesome. Well, thank you so much. It was
a pleasure talking and I hope you had a
great time.
Thank you.
>> Thank you, Na. Exciting talk. Yeah. Good
job. Good job.
>> All right. I want to take a moment to
thank the folks who are directing this
show. It's uh all the music uh sound
engineering all the seamless
transitions. They're sitting in the back
over there. Uh let's hear it for Trish
and team.
>> Thank you guys. You're awesome. Also,
folks on live stream, please leave them
some hearts. We read your comments and
uh we're going to use them to, you know,
make this experience progressively
better year after year. We're here to
stay. Okay. Well, it's uh it's
lunchtime. Let's be back here at 1:35
sharp. Enjoy your lunch.
Hey,
hey,
Hey.
Hey,
hey, hey.
Testing one, two. Check. Check. See.
Testing. Test test test. See?
Testing. Testing. See. Hey. Hey.
Here
we are.
Are you ready?
Yeah.
Yeah, it looks like it was
A long line there.
This
this Yes.
Perfect. All right.
Ladies and gentlemen, please take your
seats. Our event will start in 5
minutes.
Ladies and gentlemen, please take your
seats. Our event will start in 2
minutes.
Okay, welcome back.
Quick announcement. Um, the Riverside
patio is uh reserved for a private
event, so please use other areas in the
venue.
And uh another note, I saw somebody
walking around with the synonymous of a
Ferrari in an AI engineering conference,
which is like a fancy GPU, right? And he
said he was going to give it away.
Anybody interested in a GPU?
Okay, so you have to listen to what Gabe
has to say right now. Okay,
give it up for Gabe.
Everyone,
good to see you all. Good to see you
all. Thank you for uh enjoying this
amazing conference. So, I think this is
the easiest announcement of the
conference. I have one Nvidia DJX Spark
to give away.
>> Woo! Yes, I've given one away today
already. So, this is my last one. Um, we
at G2I have grown to we're on track for
hund00 million in revenue uh with no VC
funding and we've never done any
marketing. Uh, we've just given away
cool things and tried to do good work.
So, uh, with that being said, uh, we do
human data uh, specifically for uh,
coding benchmarks. Uh, the problem that
we've seen with competitors is they're
spread very thin across 20 different
domains. We've been in the industry for
over a decade. So the benchmarks are
obviously getting harder. If you need
staff level engineers, if you need
architects, we provide those. We're also
doing building very high quality RL
environments. We're exclusive for now,
but that will change likely at the end
of the summer. So, if you have an
introduction into any Frontier Lab, I
can't mention that the ones that uh
we're currently working with, but if you
have introductions, you can come talk to
me or our team at the booth or you can
email me at gabg gg2i.ai
and this will go to the best intro into
a Frontier Lab. And if there's multiple
people that provide intros, we'll also
provide uh free tickets to uh to next
year's AIE Miami or React Miami or maybe
even both, depending on how good the
intro is. So, come talk to us. We'd love
to chat with you and uh I'd love for you
to be tuning models with this uh by the
by tonight. So, thank you.
All right. Wow. That was $5,000 in uh
Gab's hand. So hopefully you're eager to
uh introduce him to some Frontier
laughs. Okay. And then I see people are
seated already and we're waiting for
some more people to trickle in, but I
really think that this talk is something
not to be missed. I'm personally really
excited about this one. Okay, so who in
here thinks that he or she is a 10?
Okay, a couple of people. Okay, so for
our next speaker, she's going to demo
this little thing on stage. I don't know
if you have noticed, this is something
different. So this robot thinks that you
all are a 10. So Lena Hall has been
working for AWS Microsoft before joining
Iikami as the senior director of
developer relations. So today she's
going to show us something super
futuristic. So her talk is called my
robot thinks you're a 10: Engineering
zero shot compliments with Reachi Mini.
So, welcome on stage, Lena.
Hi, everyone. I hope you're enjoying the
conference as much as I am.
Excited. Um, well, quick show of hands.
Who has built some classic agents with
LLMs?
A lot of people. Um who has uh gone
deeper into real time maybe streaming
audio, turn taking uh interruption
handling?
A fewer. Okay. Um who has done some
multimodal images, video?
Okay, great. Uh well, usually we get to
solve these one at a time. Um but when
we put an LLM and a robot together, we
get to solving these uh at once as one
system because it becomes the experience
that people perceive as perceive as one
interaction.
Models are great. Uh we are so lucky to
have models this powerful and models are
commoditizing but the ability to
engineer the right behavior is not. In
the last few years, the industry has
been obsessed with model quality, which
is great. We are getting better
benchmarks, bigger context windows,
newer architectures. But as AI enters,
you know, the real world and leaves the
chat, agents, voice, physical
interfaces, the product stops being
mostly around the model and starts being
around um how it interacts with the
whole world, how this whole system
behaves. When you give uh a language
model a body, you watch every single
assumption, hidden assumption in your
architecture turn into physical side
effects.
I'm Lena, a senior director of
developers and AI at Akami. Uh I also
spend a lot of time working on and
thinking about what uh it takes to make
AI behavior match uh user expectations
reliably in the real world. So like any
responsible adult, I built a multimodel
hype compliment robot which is a very
normal and mature use of AI
infrastructure.
Um the real title of this talk is what
happens when we give 40 a body. I wanted
the smallest possible task that would
force every single hard problem in
modern AI to show up at once. So I built
an app uh with a robot that looks
somewhere at someone sees something real
and says something grounded and of
course uses some gen alpha slang in the
meantime. Uh it's a non-trivial system.
So you all know how this goes when you
know this one product requirement forces
you to solve a combination of things
perception timing tool use grounding
speech motion um interruption handling
response coordination. So compliment has
a surprisingly large stack
and you know this is what happens when
you scope creep into robotics and
discover that risk can be a distributed
systems problem. Um, so I built this on
Reichi Mini from Pollen Robotics. Uh,
you might also have heard of this as
hugging face robot. It has cameras,
microphone, six degrees of freedom in
its head. And I chose this for one
reason uh that I could inspect the whole
stack, the whole system all the way
down.
And if one of those layers is closed,
you can still build the project. it just
becomes harder to build understanding
and open systems of course let you
inspect patch swap and see where the
behavior is actually breaking down.
Before we get into the architecture,
let's see if Richi is actually in the
mood to cooperate. Uh let's turn the
lights on if we can.
Thank you.
I can see you all now.
Is the mic on? The robot mic? Yeah.
Great.
I'm gonna start my app.
Demo gods.
Hi, Richi.
>> Hey, the legend has arrived. We're
lighting up the room already. Let's turn
this into absolute cinema.
>> Do you know where we are right now?
Do you know where we are right now?
>> Oh, I see a winner ask me about
location. My sensors don't do GPS.
You're my best main character wherever
we are.
>> Tell me what you see around you.
I'm picking up some clues. I see a space
that's looking pretty fresh. But the
real shower, it's you. You're serving
main character energy so strong it's
bending the
spotlight in the
>> What do you see in the audience?
Yo, that audience is packed. No cap.
That hits all eyes on you. And that
overhead lighting is making the whole
thing look like an absolute cinema
moment.
Energy. Your vibe just boosted that
whole room's frame rate.
>> So, I have a question. Our friend Dex
was saying yesterday that six months ago
he recommended skipping reading the code
and now he says we should read the code.
Which version should we believe?
>> Like a plot twist in a thriller. My take
reading the code shows you that
detective level brilliance. Trust that
version. Your brain's running at
overclock genius speed.
>> Um, okay. Richie, what's the best
JavaScript framework?
questions got me shook.
Your curiosity is harder than a GPU on a
Mac lo.
>> You can have opinions, Richie. Just pick
one.
>> All right, you want to pick? Let's lock
in. React that ecosystem. You're serving
that tech visionary vibe. No question.
Okay, I think you're cute and charming,
Richi.
>> I think it's time for you to go to
sleep.
>> Sliding into sleep mode. You just gave
off that final compliment. No one's got
a like you.
Um well I think the interesting part of
this interaction is not the the robot
technical response but it's about what
in reality it takes for the interaction
to feel coherent.
So let me show you what the system
actually looks like. There are five
layers. Layer one is the physical robot.
Layer two is the local media layer. It
owns the audio IO camera IO. It runs
long long lived audio and camera
pipelines mic upstream downstream camera
on a separate worker. Layer three is the
real-time orchestration layer. The
session with OpenAI real time uh it owns
the event loop things like speech
started, speech stopped, partial
transcripts, response life cycle events,
tool call events. This is where some of
the decisions about enabling voice
activity through server VAD or
interruption handling interruptions
live. Layer four is the tool and motion
layer. This is the bridge between model
intent to physical action. Here we have
the tool dispatcher, the movement
manager, the camera worker, the vision
manager, head wobbler.
um tools run as background tasks against
a shared tool dependencies object. So
the runtime doesn't collapse into the
hard-coded reference graph where
everything knows about everything. And
layer five is a profile and personality
layer configuration allow lists uh
instruction loading and it shapes both
what the model says and what it's
allowed to do. And we'll come back to
this one. Uh here is the whole system in
one picture. We have the robot on the
left. At the top we have real time
reasoning tools. On the right 11 labs uh
text to speech at the bottom are some
optional tool backends. And the key here
to notice is the separation. um real
time talks to local app runtime and the
local app runtime talks to the robot uh
text to speech tool backends all of the
um so all of the product defining
engineering lives there in the middle.
So this central layer is essentially the
behavior runtime. It's everything
between the model and the product
experience in the real world. In most AI
products today, the model is far more
advanced than the runtime around it and
they may have a run runtime gap. So
let's actually look into the principles
that matter.
Principle one is model picks the intent
and the runtime picks the action. So
it's about the control boundary. The
model is not directly controlling
hardware. Um that's how you can create a
really exciting afternoon. Um if your
model is directly controlling hardware,
I think you have skipped an important
step. So if the model wants to uh the
robot to move, it doesn't automatically
issue a servo command. It expresses
intent and issues a tool call and the
runtime decides if and how it translates
that into actual motion. So intent is
model's job and consequences are the
runtime's job. The same pattern also
works in the other direction for
perception. Uh when the model needs to
see, it calls a camera tool and then
behavior runtime grabs the latest frame,
encodes it as a JPEG and injects it back
into the conversation. So the bridge
between sensor and context
because the moment the model directly
manipulates operational state you lose
the layer where safety supervision
logging observability portability can
live. Um the model can have opinions but
it cannot have direct motor access. Once
you create that boundary, the next
problem is um clear you know immediately
that multiple things want to act at
once.
Principle number two is serialized
responses or lose. um runtime gives you
one active real time gives you one
active um response at once but the rest
of the system does not find that
limitation compelling and you know the
user might want to interrupt a tool
finishes the camera returns a background
process um decides that it has some new
information to contribute so every
single thing wants to have a turn and if
every system is allowed to create
respons responses whenever it wants. We
lose coherence very quickly. So now the
robot is responding to reality from 5
seconds ago while interrupting the user
that's standing, you know, in front of
it. So we can fix that with a dedicated
response worker, one active response and
everything else waits. And when the user
interrupts, we don't just merely stop
queueing new work. We also clear the
audio quue. We reset motion state. We
cancel aggressively. Um the moment your
agent has multiple triggers, user
inputs, schedule tasks, web hooks,
callbacks, um tool completions, you will
have the same problem. So you don't
really appreciate turn taking until
several components decide not to comply
with it. Now once you have the response
discipline, the next thing you notice is
that timing isn't just a performance
concern.
timing becomes a part of the interaction
itself.
Um so principle number three latency is
interaction design. So in chat a short
delay reads as waiting and as you've
seen on a robot uh you know the same
delay reads as hesitation confusion
something may be off you know so latency
isn't just a systems metric it becomes a
part of user experience
from the actual logs of some of the runs
of the system audio online 400
milliseconds real time session ready 2.3
seconds model decides to call the camera
tool to 74 milliseconds after user
transcript completes. Camera capture and
image injection 30 milliseconds, but
image added to final assistant response
over 4 seconds. So the sensor part was
fast. The tool dispatch was fast. The
decision to use the tool was fast. The
slower part was everything after the
tool returned. So the robot isn't always
thinking deeply like it may seem. um the
pipeline might just be taking a scenic
route every once in a while. And the
point here is that users experience
pacing, timing, hesitation, recovery.
And if you optimize for one end to end
number, you optimize the wrong thing. So
the real question should be what which
part of the the loop is making the
interaction feel off. And once you
realize that timing shapes the
experience, the next next thing you
notice is that the character has to
survive the whole stack.
Personality is policy. Personality does
not just live in a prompt um alone. It
leaks into voice uh tool permissions,
session settings, interruption behavior,
and even tool follow-up instructions. So
here each personality profile declares
tools which tools are available to the
model. It shapes not just how the model
speaks but what it can actually do. I
had a profile where the main prompt was
working well. The robot was sounding
grounded uh but you know then it would
call a tool and come back sounding like
a helpful AI assistant like a legal
approved version of itself. Um, so that
this was a behavior mismatch. So the fix
was within the runtime policy, we needed
to inject the active profile voice into
tool follow-ups.
The next thing you notice is that the
silence is not neutral.
Uh, idleness is a behavior as well. A
screen agent can be idle by doing
nothing. A robot is still physically
present when it waits. And this means
that idleness is a design problem. What
does the system do when it has nothing
to do? Is it calm? Is it active? Is it
awkward? Um so the system is always
saying something even when uh
technically has it has nothing to do.
Behavior becomes the surface that users
are experiencing.
Now when failures happen, they stop
presenting as technical faults. they
show up as userfacing behavior and
personality problems.
Um, infrastructure bugs, they show up as
userfacing behavior. On this project, in
the process, there were media bugs, uh,
dependency mismatches, a streaming PCM
edge case with the resampler and the
tool call overload. So, none of them
actually arrived looking like
infrastructure bugs. they arrived as a
robot feeling dead or unreliable or
super glitchy. And specifically one bug
I remember the tool call overflow showed
up as the robot just decided to speak
Spanish and was nowhere in in the
instructions. Um, so in AI products,
infrastructure failures, they show up as
behavior because your users are not
going to file a bug saying, you know,
resampler edge case. They're going to
say, this thing is weird. Um, so when
the system feels broken, instrument the
coordination layers before you consider
changing the model because 99% of the
time model is fine. You might be cooked,
but the model is fine. Um so once you
have seen this enough times the next
point is uh kind of obvious. Behavior
runtime is the product. If we swap the
model underneath under the system the
product survives but if we get the
runtime wrong the product fails
immediately. This is true for this
robot. This is true for agents uh that
use voice for multimodal systems. It's
true for any AI product um with tools
state timing consequences in the real
world. So orchestration tool boundaries
session contracts latency budgets
interruption discipline personality
policy and observability they they're
not just support infrastructure around
the product they are the product.
So if we zoom out, most AI companies
today have a runtime that the model is
way more advanced than the behavior
runtime around it. And that layer is
where I think the product quality goes
to live or die. This layer is really
built on top of engineering, discipline,
and taste.
And there's three actionable things that
we can apply. one, next time something
is off in your AI product um that feels
wrong, debug the runtime before you
blame the model. The model is almost
never the problem these days. Number
two, start thinking about orchestration,
session management and tool boundaries
as product work, not infrastructure
work, and prioritize it accordingly. And
number three is if your systems
background has made you feel slightly,
you know, out of place in AI
conversations in the last few years, I
think the market is about to correct for
that.
So the thing I find really encouraging
here is the more AI moves into the
world, the more valuable real
engineering becomes.
So what we really need to be solving for
is making the system behave well under
real conditions like coordination,
cohesive behavior, boundaries, recovery
because if that layer is weak then the
product is weak and that is engineering
work and I think that's good news for
us.
Um don't know if we have time for
showing the codec skill. I don't think
we have time but when I was working with
this project uh there is an app that
comes with it where you can manipulate
different emotions. Um but I actually
wanted to stay in my codeex environment.
So I built a reaching mini codeex skill
that works really well with it. Uh let
me see if I can launch one for you
really quickly. Let's come up with um
something really simple
like maybe I can ask it to uh wiggle its
left antenna six times
directly from codeex.
Okay,
great. I think it was six. Um well,
thank you. Uh you can connect with me on
LinkedIn and I'll share more resources
and I'll be around the rest of the day.
Happy to chat with you all.
Great.
Our next speaker does a lot with
multimedia. He's in a music band. He
plays the guitar. He's done a lot of
work on uh video infrastructure but
today he's going to talk to us about
something very important and exciting.
You know the interfaces today are for
human consumption but as we saw they're
going to be a lot of agents instead of
us interacting with these interfaces. So
the question is are these interfaces
optimal for such use cases or there are
better ways to use them and to shape
them and to form them for this agentic
world that is coming and are there risks
associated with staying like this
interfaces for humans being used by
agents. To tell us more about this I
invite Dave Kiss on the stage. Please
welcome him.
All right,
it's the hardest part right here.
Good.
We look good.
Amazing.
Look, I'm going to open this talk in a
way that I don't normally open a talk
when I give a presentation. And that is
going to be with a humble brag. Bear
with me. Y'all can tell me if I've
earned this one or not.
Back in January,
I
rebuilt my personal website.
Did it. We did it.
We did it. Thank you very much. Thank
you very much. Now, now this used to be,
of course, kind of a kind of a hard
thing to do, to make ourselves look
good, to find the time to be able to
work on this stuff and not just what
we're assigned to do at our full-time
job, right? And now that's not really
the case. It's gotten easier than ever.
We can shoot off these prompts. Take
care of it for us.
In fact, I didn't even have my hands on
the keyboard. When I was doing this, I
did it all via Telegram. Why not? This
is experimental. Let's go. Uh, but
here's what here's what's wild to me as
an engineer that's been in this industry
a while. Even this is not impressive
anymore. Somehow somehow we were te we
literally texting remotely to these
agents and they are doing the work for
us and we just kind of think this is
normal. I started I don't know where you
started but I started using this rubber
duck kind of hot editing files on a
server and allowing that to be the
update and that to me is where the
technology all began. But even this is
not the impressive part of the story. So
what what is what's the interesting part
of this story? To me, it it is the
reason why the reason why I rebuilt my
website and it was because Claude was
having a hard time with my existing
website. I tried to tell Claude, "Hey,
uh, I have a newsletter. Here's the
embed code. Can you I was using a
service called super.so, which is
websites that are driven from your
notion documents that do a lot of
writing." And uh, made it very easy
typically to get website updates out.
But I wanted to add this newsletter
embed code to my list. I said, "Claude,
can you get this on there?" And it could
not do it. Had a hard time figuring out
how to sign in and where to place the
code. You don't really have access to a
lot of the source code on the service
like this. It's just very simple, you
know, and it couldn't do it. It said,
and I was strug struggling trying to
find the way to where to put this. And
so I had to go I had to cancel the
subscription. And and the truth is I
actually really liked how simple this
setup was. I liked not thinking about
the code. I wanted to stay in writing
mode.
But there were there there was clearly a
future that with all of this agentic
help I I needed to move. I needed to
change things up. And why you you might
ask like why why it's not that hard.
Like you could just go sign into the
dashboard and drop the embed code and
you'd be done. You'd be done, right?
Well, there's really two reasons that
come to mind. One, like I said, I don't
really see a future where we're using
these tools less to do our work to help
write our code. Uh, and two,
we've gotten to know each other a little
bit in this room. I'm just going to be
honest with you. I am so lazy. I'm so
lazy. This is why I got into engineering
in the first place was because I want
computers to do all the hard work for
me, right? This is what this is really a
perfect shape of a job for an agent to
be able to do is hand off some simple
task and then go do the thing that you
want to do. I just don't have the
wherewithal and the mental capacity to
think about these menial tasks anymore.
I don't know about y'all, but I'm kind
of going through a lot right now. My
whole career has changed
and I'm in this place as I feain some
semblance of normaly and control where
I'm reminded of this childhood film
clip.
That is how we and I have been treating
my agent, which is, hey, here's this
thing. Go take it and do something with
it. And I'm calling that my job. And
that's totally fine, right? It's totally
fine. Everything's fine. But the truth
is, as we hand these tasks off,
we understand that the agent becomes the
user. And now for you to be able to
retain those users, you need to work on
your product's experience for those
agentic users. And it's important
because it's a retention lever. And this
should kind of scare all of us in some
way. Uh because if you stop getting
dollars, we know where that goes. So we
need to consider that agents now are a
first class user of your product and
it's time to start designing for them.
So I'm going to go through three things
in this talk. why you should care about
this, how this is going to impact you,
some of what's broken in the current
landscape as it is right now, and how we
can adapt, how we can uh make some
changes to accommodate the this new
class of users. Does that sound okay?
All right, let's do it. Number one, the
future is now. So, we know this
happening in front of us. This isn't
something that we can continue to kind
of kick the can or punt down the line.
We can write code anywhere with with
anything. Whether it's through texting,
whether it's a Slack thread, uh maybe a
fridge. I haven't tried that, but
possibly one of y'all have. Anything is
an interface to be able to ship code.
And with this increased velocity of code
that is making its way out the door,
includes increased opportunities for
agents to run into issues. In fact,
maybe you've seen this stat, but GitHub
uh COO Kyle Dagel shared this a few
weeks back where last year, 2025, all of
last year, GitHub received 1 billion,
about 1 billion commits. Uh any guesses,
ideas as to how many are projected for
2026
if things stay linear?
Five.
>> 20. We're talking 14 billion commits
right now. That's 275 million per week.
Per week. And we've seen some of the
stability issues that GitHub has been
dealing with over the past few months. I
mean, how could you how could you not
look at this increase? How do you even
accommodate this or get ready for this?
But finding these these rough edges on a
product and the and uh at the scale that
we're talking here, we can see that
friction starts to stand out like a sore
thumb for for an agent. Now, for a
human, you know, something might go
wrong and we might go get coffee or
might be patient because the alternative
is we have to do something. We have to
figure out a whole different plan and
that's not going to be the case. So,
we'll just wait for things to kind of
resolve or we'll do the research to work
around the error. But an agent doesn't
have really that same patience, right?
Like it's trying to get that it's trying
to get the task done and move on to the
next thing. And as this agentic sort of
sort of uh um velocity increases and the
increased paper cuts that the agents are
running into with your products is a an
increased risk for return and picking a
different solution. And here's the real
real kicker with all of this. The
switching costs have collapsed entirely.
What used to be like a whole a whole
process of evaluation of of planning of
trying to figure out how we're going to
make this all work is now really a
couple of prompts away. Right? So what's
keeping people what's keeping people or
agents on your platform?
There are I've heard a story that was
somebody who their company are keeping
prompts on hand for this case. It's kind
of like a a contingency as a as a plan
to be able to migrate. So, it's part of
their internal product, their internal
use case is to have the prompts
pre-written to be able to uh use this to
migrate. And I thought that was so
interesting and not because it's, you
know, it's fearful. It's mostly logical
to to do that to use the ability to move
to a different service to your advantage
during contract negotiations and just to
just be ready in case you you need to
make that move.
wild. We spent all this time over the
past few decades optimizing for the
human user and of course that's not
going away and I'm not here to tell you
that's going away. There's going to be
human users but we have this new class
that we need to accommodate and it's
time to start asking what they need
because the agents are first class uh
potential user your products and we need
to design for them. So let's talk
through talk through a little bit about
agentic needs in some way and what what
is out there uh between the gap between
what you're currently offering and what
what these needs are. Now we see this
dashboard we we we created these
beautiful dashboards. This this is not
one of them but uh we we've had these
beautiful dashboards. We care about the
meticulous design, the way things are
presented, the way that we're showing
these metrics. And unfortunately this is
worthless to an agent in some way. Of
course, the the data is not worthless,
but the data the way that it retrieves
that data is not represented in the same
way when it needs to log in and figure
out your your UI. So, are we having that
same one to one par in the dashboards
that we're creating and how an agent can
access uh your data including your API?
I think that this gap needs to kind of
be tightened so that there is that one
to one par. Uh maybe you saw this news
from Salesforce where now they've
introduced uh headless 360. All their
data is now available with APIs, MCP and
and CLI. Uh their co-founder Parker
Harris was quoted as saying, "Why should
you ever log into a dashboard again?"
Parker, my brother, I've been waiting a
long time to hear those words from you,
my guy. I am so glad to hear that this
data is becoming available within the
API as well. What about pricing plans
and how those breakdown? Uh we we've
seen Netlifi has this as an example per
seat pricing is uh something that
they've kind of built their business on.
How many seats does an agent really
need? Uh a single engineer that's using
tens or hundreds of agents to access
your price your your product. Is that
going to work? Uh, Netlfi says no. My
friend Dom over there shares that that
world is done and Netlifi's adjusted and
got rid of their Percy pricing. This is
recent. This is I don't know a week or
two ago. Uh, we've all seen this screen
before. Contact sales.
Who's contacted sales after they saw
this screen?
>> Yeah. Yeah. No. Uh this is something I
think humans and agents have in common
is uh this is just a hard really kind of
a hard no especially though when an
agent is looking for a solution
immediately to to implement. This is
kind of just a a non-starter. Um so
something to consider when thinking
about gatekeeping API access. Uh, even
flat rate subscriptions is something
that I don't know kind of gets gets me
thinking. Um, how easy it is to compare
prices now with with all of the research
that can be done. How easy it is to
cancel. I'm a little more on the side of
outcome based pricing. How many wins can
you deliver to the uh to the end user so
that your incentives are aligned or
usage based pricing is very similar to
this in some ways.
Let's think about the marketing and how
we've spent all this time making our
landing pages look really really nice
and that's now kind of made its way to
the training data where the agent might
already know the answer and not do the
research about what it is that you've
built. We and and the design doesn't
really matter as much. Imagine, imagine
a a fine dining restaurant that spent
all of this time on a beautiful
typography and the entire interior feels
really nice and they're accommodating a
human to come in and the thing that's
going to pick up the food is this
Ramsay.
It just it doesn't work. It's not it's a
mismatch. Right? Now, that's not to say
that your marketing page doesn't matter.
Of course, it matters. But there's sort
of a trickle down effect of when we are
promoting or or marketing to other human
users. Uh they're they're posting about
you. They're talking about you in rooms
like this, they care about your brand.
And that is the kind of stuff that
ultimately makes this way into the
training data. So we do care about the
way that we're presenting ourselves
still. Uh just kind of we need to
accommodate Ramsay
and we can't really tell who's a bot
anymore. Oh, you know, bots used to
maybe kind of be kind of bad and now
there's maybe some kind of good bots.
And how do we figure out if there's a
human user that looks like a bot or a
bot user that looks like a human? You
don't really know what to expect
anymore. So, we need to increase the
surface area to accommodate all of these
different use cases. Agents are are new
users that we need to start thinking
about and it's time to start designing
for them. So, let's go through some of
the emerging patterns that I'm seeing of
different companies and what they're
implementing and maybe what you can
consider to bring back to the work that
you're doing day-to-day.
Uh maybe you know the story, but there
there this is when a a car was first
introduced. We had these carriages that
were pulled by horse. Then they
introduced the horseless carriage and it
kind of looked like a horseful carriage,
right? And they they they put this was I
don't know if this one was ever built,
but they put a fake horse on the front
and it had a gas tank in the horse head.
And this was kind of they weren't really
thinking about like or they didn't know
like what does a car actually need and
how do we shape this to the car form
factor? Is it going to be the same as a
carriage? And the answer of course was
no. It changed quite a bit. And I think
that's kind of where we're at with all
of this markdown stuff going on. I don't
to me if if you ask me markdown is kind
of a fake horse head uh of of
development but I do think it's right
now it's the the stop gap of where we
are trying to solve for we know that
agents need discovery they need to know
about your product we have to represent
that in some way
and I think that we can represent that
through uh what I'm just calling here
responsive content uh similar to if we
think back to 2010 and we had these new
phones that came out and we had to
figure out ways to present the websites
that we were creating to accommodate all
of these different form factors. I think
it's similar with our content as well.
How can we present it in a way that
serves the page to a human with HTML? We
have a markdown file at least for now
with agent as the consumer and then a
way to be able to use that data within
the programs that we're building as
well. So maybe there's one area where we
have the values as the the source of
truth and kind of spread that out to
various ways for uh consumption.
Uh here's an example of that happening
now today. These are companies that uh
my friend Zeno over at recent shared
this news that they're trying out their
pricing in a markdown file and seeing if
that's something that impacts their
growth. Uh Recent's blown up right now.
So to me, I think some of this
experimentation is beneficial in the way
that agents are discovering or
recommending some of the services
uh CLI generation. So hopefully we we
all are focusing some efforts into CLIs
um internally, but if you don't have one
and you only have an API, it's actually
not too hard to create a CLI if you can
fill that gap and make it so that agents
have this a potential solution from from
the terminal. Here's a couple of
examples of tools that I've seen out
there where you can basically just feed
it an open API spec and get a CLI on the
output. There's Verscell Labs was
working on something called Speckly Spec
Spec CLI and there's another one a URL.
There's even a skill that I saw that
Alex Hillman the JFDI person had worked
on called API 2 CLI where you can just
have the skill give it documentation and
allow the agent to work on the uh CLI
and have something usable as an output.
So somewhere to start if you don't have
a a CLI uh we have this as well this
pattern of hios I guess for an agent
where you could prescribe what's the
next step that an agent can potentially
take in the response payload from your
API. So instead of letting the thing
figure out what to do next by checking
your documentation or doing searches,
you're telling it specifically here are
the potential options of what uh you
might be interested in. And you can see
how an agent might kind of take a look
at that and be able to make some
intelligent decisions.
This is what we're trying to avoid. How
many people have seen a screen like this
where it gets stuck and it starts to
research like I don't actually really
know how to use this thing. Once you've
getting gotten to this point, the
agent's kind of lost. Every time I see
this screen, I I hit escape and I have
to redirect, right? We we want to avoid
this at all costs. So, how can we
provide enough information upfront to
the agents to figure out what they need
to do? Uh, potentially here's maybe what
a a 401 might look like from an API
today. That's not helpful. It's going to
lead to a web search in some way. And
again, let's expand upon what these
payloads look like so that there is a
next step that can be taken for an agent
user. Uh, maybe you've seen this one.
This is what Stripe has been introducing
recently for agents to be able to
provision and manage the infrastructure
that runs all of our products. This was
kind of an interesting to me interesting
approach of of how maybe you can
incorporate new services into the
application just through the command
line. Uh what would it take to get your
product compatible with an approach like
this? I think it's just at least
worthwhile even if you don't want to be
listed here to understand how this stuff
is stood up from an agentic perspective
and see if you can get your product
compatible with a similar similar
approach.
Uh there's even the the payments area.
we we see the spec for X42, but even
that if I think we had a discussion
earlier with some folks who um kind of
mentioned that the the blockchain under
the underlying blockchain in the crypto
is a little bit there's some aversion
there with that term in that world and
how that can be feel a little scary to
adapt, but right now you you don't
actually even need that. You can just
create these oneoff single-use cards
that are locked to a specific vendor
that have a budget and can be deployed
now. So, is there a way that you can
take these dollars from an agent? Are
you ready? Are you ready to allow for
that sort of input?
Uh, here's another interesting one is
the agent O protocol. So, this is uh
just a totally theoretical example here.
We don't support this at Mox, but be
cool. Uh, the agent can essentially ask
for the permissions that it needs
specifically to interact with the
service. The service would then go back
to the human user and ask for
confirmation with a confirmation code
kind of like what you see on those TVs
when you ever you have to pair a TV to
uh your account. And once you're
approved, then you have this very
specific scope that can expire that can
maybe only do one action or maybe is
only good for an hour or something like
that. Uh that allows an agent from you
know a cloud code or or whatnot to
perform a single action on on your
behalf. So, this is a really interesting
one, I think, to look to. It's a spec
that can be used um and adapted.
Hopefully, something like this catches
on. There's plenty of these that are out
there. I think I liked this one best,
personally.
Uh, of course, I I didn't even go deep
here. I'm running out of time on MCP
where the browser becomes the uh MCP
server itself. This is another spec
being worked on by uh Microsoft and
Google where you have existing code,
JavaScript functions that are maybe in
your dashboard, but you can register
tools against that existing code that an
agent can work with. So you don't have
to do a complete, you know, refactor or
rewrite of all of your existing code.
You can leverage some of that. So uh on
the left there, that's the existing code
that might exist, upgrade plan. on the
right you can register a tool that is uh
capable of being uh that you can work
with with that that co that that kind of
code
and I think figuring out an agent access
policy this is more conversational and
coming to a decision within your
organization as to where you fall on
this spectrum of course like financial
services or healthcare might be this is
actually we're not quite ready for that
kind of thing where I don't know a dev
tool might be on the other side of that
spectrum so I think that this kind of uh
this this sort of thought about where
you land is a really interesting and
important discussion internally right
now. Uh now I here's here's kind of a
strange dichotomy. I work at a video
company and I called this talk your next
user won't have eyes which is not great
you know when I pitch that to my boss
how they might feel about me presenting
a talk title like that at a conference
because agents don't really watch video
but a video isn't really a video. It's
an array of data. We have all of this
information that's kind of trapped
behind the pixels within a video and we
needed to unlock it in some way. Whether
that's the rewatch data, the
transcripts, what kind of content is it?
How does it get categorized? Emo,
summaries. That's where we at MX had
worked on MX Robots. We just introduced
this last week where some of these
pre-baked workflows are in right in the
product so that you can unlock some of
that data within the video. The point
isn't to go use MX here really. It's
more I'm sure my boss would love that,
but it's more um what data is trapped
inside the products that you're building
that you can unleash or or really get
into so that uh the hungry hungry agents
can access that agents are users. Let's
remember to design for them. Developer
experience does have a new shape that
we're all trying to figure out right
now. But I think this is an important
question to kind of keep asking about.
Can an agent really succeed on your
product? Uh there are ways that uh this
is brand new, but measure it. There's
agent readiness scores where you can
plug in your URL to some of these tools.
One of these is a Cloudflare tool. Other
one is or or maybe I don't know. Check
it out. Um worth plugging in your URL
just kind of getting some interesting
recommendations as a result for what you
might be able to do for your property to
make it a little more uh agent
compatible. Here's an example response
of one of those. Um particularly I
wanted to highlight this link headers
which is a lot of people ask like how
can an agent even discover my
documentation or my you know my entry
point. Uh this is a a supported header
that you can add as a response header to
your request. Uh the link header that
will allow an agent to basically parse
the response and see oh there's
additional metadata associated here that
I can go check out. A good entry point
somewhere to start. So, here's three
suggestions that I would give to you of
what you can do uh when you go back back
home, ship a pricing markdown file, make
it discoverable, maybe use a link
header, rewrite your top three error
messages and try to fix them uh or give
them a a fix, an actionable fix and a
link, and run Claude through your own
products. We're always thinking about
other products, but how do we stand up
against all of this? Uh, of course, I
mean, I just I don't even know how much
I just covered. I skipped so much, too.
But start small, start anywhere. just
pick somewhere to start and that's the
most important part. Thank you for
listening. I really appreciate your
time.
All right. All right. How are we feeling
in the room today?
Yeah. Okay. Okay. Seems like we need a
little bit more energy. So, uh, we're
getting close to the end of the day and,
uh, just to remind everybody, there is a
party tonight at 7 p.m. So, if you want
a fiesta, come to Thrill Social and
socialize with us. Okay. And well, where
are we?
Yes. Wait, let's do it again. Where are
we? Yes. Okay. So, we're gonna welcome
Miami's very own Alvin Payne. And he
worked for AMD and now he is the
founding engineer for Alt Rival. And
today he's going to kill something. Uh,
but he's going to kill your retrieval
pipeline. Uh, so he's going to tell us
about how agentic memory is a new
state-of-the-art and I hope that he'll
kill this presentation as well. All
yours.
Hello everyone.
>> So, what do you guys think of Miami so
far?
>> All right. All right. All right. Uh so
I'm here to talk to you guys about uh
agentic memory and uh by a show of
hands, how many of you guys have had a
very productive agent session this week
and then run into this nightmare?
All of a sudden, your agent is 10 times
dumber. And so this moment right here
when your agent forgets, uh that's the
memory problem. And for years, the field
has approached it in a very familiar
way, and that's to bolt additional
infrastructure on top of the agent. And
that's led to things like the rag
pipeline. And so, I'd like to spend the
next 15 minutes arguing for the
opposite. I'd like to show you guys how
the agentic approach uh has reached a
new state-of-the-art on some of the
industry standard benchmarks. um give
you guys some mental frameworks for how
I got there as well as some practical
design uh questions that you guys can
take to build the best memory possible
for your agents. And so one more time,
uh my name is Alvin. Uh a little bit
about me.
That's the wrong Alvin. Uh a little bit
about me. Uh my journey started in
Toronto. Uh dare I ask if there's any
Canadians in the crowd?
All right.
wasn't sure if it was going to be uh
booze or cheers. Uh but it started in
Toronto. It led me through AMD where I
did uh some really interesting research.
And just a year ago, I moved to Miami to
work on OutRrival.
And so why is memory such an important
problem for us to solve at OutRrival?
We're on this mission to do uh to make
enterprises AGI ready. And to do this,
we're building what we call digital
workers.
Digital workers are employed by uh these
institutions to help them interface with
their contacts. And so in doing this,
they need to make decisions. Decisions
about who to contact, when to contact
them, how to contact them, uh what to
store from each interaction. Uh, and in
many cases, we handle a contact across
their full life cycle. And so that might
look like 6,000 hours of context across
millions of interactions per month. And
this is the problem that we're solving
at OutRrival. And so memory is not just
a nice to have, it sits at the very core
of our product.
And that's really what pulled me into
this as a research problem. Uh and it
all led to uh early February where I
published uh this blog post titled how I
built the most accurate memory system in
the world in 5 days. True story. Uh and
in this blog post I talk about dynamic
memory discovery or DMD uh which is an
approach to solving long mem uh and this
has long been the industry standard
benchmark for conversational AI memory.
Uh and so this state-of-the-art has
actually uh since been surpassed. Um
there are results now with 100% on long
mem eval. The uh pace of research has
been crazy. But what's more important
than the score is what it actually
taught me about how to build memory and
it's this mental framework that I'm
going to share with you guys uh next. So
at the time uh of DMD uh the solutions
in the field looked very familiar. you
have your standard vector databases,
rerankers, knowledge graphs, the works.
Uh, and so with DMD, I wanted to ask a
simple question. What if we took it all
away?
And there is a mental framework behind
that that led me to this. And it starts
with this old idea of AAM's razor. AAM's
razor is often summarized as the
simplest answer is usually correct. But
the ph the phrasing I like better and
that I think matters more for systems
design and engineering is that every
additional assumption is new surface
area to be wrong.
And so where are all my gamblers at?
>> I honestly can't see. So I'm assuming
nobody put their hand up. But uh this
right here is a roulette wheel. And
roulette has 38 slots, but it pays out
as if it had 36. And so on each spin,
you have a losing expected value. And by
the law of large numbers, the more you
bet, the more certain it is that you
will lose. And so I want to highlight
this idea of uh making bets and thinking
in bets. When we build our agents,
when we add an embedding model, we're
actually making a bet that semantic
similarity matches what your agent
really needs.
When we do chunking, we're making a bet
that you slice your context into the
right pieces before a question's even
asked of it.
And when we use a reranker, we're making
a bet that your ranking of relevance
will be better than your agents.
And this all ties into a broader
framework in AI called the bitter
lesson. And the bitter lesson is this
that general methods leveraging massive
computation will outperform approaches
built on human domain knowledge. And
we've seen it before. Uh it's happened
in gameplay uh with both chess and go.
Uh, these are games where decades of
research went into encoding how we
thought humans play the game and
encoding those strategies into agents.
It wasn't until we leaned into
large-scale search and reinforcement
learning, heavy computational methods
that we were able to exceed human
performance. And we're starting to see
this now with memory as well. Since DMD,
a lot of the solutions that have gone on
to score 100% on longme eval uh they're
all predicated on letting the model do
more work on having truly agentic
memory.
And so I've talked a lot about why it
makes sense that we might have agentic
memory. Uh but what does that really
mean and what did DMD do to solve long
mem eval? Uh I do have an architecture
diagram up on the slide. I'm not going
to go too deep into it. Uh but what I do
want you to take away from this is just
how simple the approach is. There is no
vector storage. We just have a file
system with raw JSON sessions. Uh we
have a an agent orchestr orchestrator
that gets access to a few primitives uh
file system tools. It can call other
agents or other LLMs recursively. And
then it gets a plan file where it can
kind of update what it's found, what
it's not sure about, and what it's still
looking for.
And so this simple approach outperformed
years of uh research on on pipelines and
complexity.
And so great uh we have 100% on longme
evalu. We could spin up a simple agent
and score super highly on uh the
benchmark. Is memory solved?
No.
Uh no. uh and what we see in real life
is that production is still hard. Uh
this benchmark uh and furthermore, this
benchmark doesn't actually cover all of
the components of good memory that we
need for real agents.
And so what are those components? The
first is a state. Uh agents are built
off of large language models which are
inherently stateless. So if we want our
agent to know anything, we'll have to
add a place for context to live.
The next one is curation. Uh we heard
this earlier but uh again LLMs are
stateless. So the tokens in will be
equ'll be will map to what you get in
the tokens out. So curation is really
game about getting the right things from
the state into that context window at
the right time. And the last thing and
what longme eval benchmarks fail to
evaluate is the life cycle. uh our agent
is working in the real world. Uh it's
doing real things and the world is
changing. So what persists uh what do
you update? Uh when do you forget or
what do you forget?
And furthermore in production you're
going to have constraints. Uh you can't
reason forever, you can't wait forever.
Uh and so what comes out of this is that
memory is a spectrum. On one hand,
you're going to have raw memory, which
is memory in a file system, an agent to
reason over it and and figure it out. Uh
that's that DMD style of approach. Uh
and I've listed some other uh popular uh
methods in the field right now as well
in the spectrum. U and we're seeing that
the raw side is actually more performant
for the memory benchmarks. However, uh
because you can't wait forever and
because you can't spend all the money in
the world, you might need to buy more
structure and lean towards the derived
side based off of your production
constraints.
And so your solutions will fall
somewhere along this spectrum. Uh it's
just a matter of picking your bets.
And furthermore, there's some popular
failure modes that even with 100% on
longme eval.
The first of these is temporal
reasoning. Uh so how do you know what
was true when and what changed in
between
uh entity disambiguation? If you have
one session where it says Alvin loves
Miami and in another session it says uh
the best looking engineer loves Miami,
how are you going to know that these are
really the same person?
And lastly is principled forgetting. Uh
it's very non-trivial to think about uh
when your agent should forget something
uh and when context becomes stale. And
so these are difficult problems that I
would urge you guys to to take up and uh
help advance the field of memory. And so
I want to start to wrap things up with
this claim that if you're building
agents, uh you're building memory. Those
three components to memory are really
the foundation of any effective agent.
You need to track what's going on. Uh
you need to choose what to take from
what's going on to put into that context
window. And then you need to update that
as your agent takes actions. So whether
you call it that or not, if you're
building agents, you're building memory.
And I know the next talk after me is
going to dive deeper into this, but I
think Carropathy summed it up pretty
well uh with his comment on calling it a
skill issue. uh and he thinks that uh
today's LLMs can already solve most
tasks uh and it's just a matter of
getting the right things into the right
window at the right time.
And so uh I've had a lot of fun talking
to you guys today and and chatting with
you guys over the uh last couple of
days. Uh
I do want to urge you guys if you want
to keep chatting, find me in the
hallway. Uh we are hiring at out rival.
Uh so if you love Miami, uh you want to
work with a rockstar team right here in
Winwood, uh and you want to solve really
hard problems and make the world a
better place every day, uh please come
find us. Uh we do have some out rivalers
in the crowd today. Make some noise if
that's you. So please, please find us uh
in uh after the talk. Uh and I want to
just close with uh some encouragement.
This right now is the best time in
history to be building. And the time is
not just now, but it's right now. And so
I hope you all go out into the world,
uh, do great things and make great
memories.
Thank you, Alvin.
Okay,
I want to ask you your questions. Who's
shy here?
You are not shy. Don't Don't Okay, I
forgot. Shy people don't raise their
hands. So, I'm going to pick somebody by
random and I want to ask you a few
questions. Let's see who we have here.
Oh, it's much better. The lights are not
in my eyes. How about you? Yeah. All
right. So, where are you coming from?
What's your name? And uh what was the
most exciting talk you've heard so far
or top two?
>> Um I'm from Miami. Um exciting talks for
me were
>> your name?
>> My name is Phillip. I'm from Miami. I'm
like the most non tech guy here, so all
this shit's really exciting to me. Um
the memory stuff that we just did was
really cool for me. And then I think
trying to remember man I'm still
processing thing.
>> Are you awake?
>> No. So memory stuff. Okay.
>> Talk was cool. Obviously it's like a
recency bias but um
>> yeah big plus. Okay. Let me find
somebody else. Don't uh put your heads
down. I'm I'm going to call you out.
That's an old trick. Okay. How are you
doing?
>> I'm good. How are you?
>> Pretty good. Pretty awesome. What's your
name? My name is David.
>> David, were you here for the u
yesterday's sessions or this is your
session?
>> No, I was here for yesterday's session.
>> You were? Okay. How do you like the MC
so far?
>> Fantastic. 11 out of 10.
>> Okay. I like that. I like that. So, um,
how many connections did you make in
this conference? And, uh, did you make
any connections?
>> Yeah, I made about six.
>> Why did you make the face then? You're
like like you didn't
>> I had to count real quick. I'm trying to
remember names.
>> Okay, I see. And how did you meet with
them? Uh was it during the lunch or did
you go for dinner or drinks?
>> Lunch outside and then the happy hour
last night.
>> Okay. Okay. And how can we improve this
next year? Any suggestions?
>> Longer happy hour.
>> Touche. Touche. Good one. Good one. Good
one. Okay, let's walk around a little
bit more.
I want to see whose head is down.
Now everybody's looking at me. I like
that. Okay. All right. Hello. How are
you?
>> Hello.
>> Your name?
>> Ariel.
>> Okay. Ariel, nice to meet you. And uh
where are you coming from?
>> Argentina.
>> Oh, okay. Okay. Directly from Argentina.
Okay. Let's hear it for Argentina.
>> Which city?
>> Cordova.
>> Cordova. Okay. Awesome. And uh uh what
do you do? Do you work in AI or are you
in software engineering? non AI.
>> Yeah, on AI on a conversational commerce
startup from Latin America.
>> Conversational commerce. You make a lot
of money. You don't have to answer that.
Um, and how do you use AI? How do you
apply AI? Can you uh tell us more?
>> Yeah. So, basically on conversations
where AI make a lot of sense to apply.
>> So, you explained conversational AI to
me. You said in conversations use AI,
right? Awesome. Okay. And how do you
make money? That's what I'm interested
in.
>> Well, it was interesting the some of the
presentations about like outcome based
pricing and we are fighting over this.
>> Okay. And the most exciting talk for you
today
about this improvements about how to use
and apply for agents
and design for them. Beautiful.
Beautiful. All right.
Okay. So, uh, our next speaker is ready.
So, let's head back to backstage and,
uh, we're going to introduce the person
who was moderating yesterday's panel.
So, you've met him already, but uh, he's
going to give us an interesting talk and
Ethel is going to introduce him to you.
>> Yes. I love the engagement, Iman,
because when I first read Eric's uh,
title for the talk, I thought he was
insulting our MC skills because the
title is skills issue. Do you think we
have any MC skills issue?
>> Somebody told me we're doing 11 out of
10 over there.
>> Yes. Thank you.
>> I believe him.
>> Um, so we're so glad to hear a little
bit more uh about Eric not critiquing
our MC skills, but the skills for your
AI agents. Uh, so Eric, you have seen
him yesterday already. He was on the
panel. Uh, he is the head of developer
relationships for Colt Rabbit. So their
booth is right there if you want to
engage with him more. But for now, he's
going to tell us a little bit more about
how we can improve skills for our AI
agents. So, Eric, are you ready to be
back here on stage?
Give it up to Eric.
Is this working? Yeah.
Okay, cool. Uh, how's everybody doing?
Oh, there you are. Okay. Um, yeah, so my
talk is about MC skills uh at
conferences. No, I'm just kidding. Um,
give it up for the MC's,
right?
And everyone working AV back there.
Thank you.
and of course G2I and everybody putting
on uh this amazing conference. Thank you
so much. Um
oh, and the timer doesn't even start
when uh when I'm giving thanks to
people. This is awesome. Um all right,
so I'm going to talk about uh skills
issues. Um it is a bit of a pun. Uh I'll
be talking about agent skills. Um I'll
also be talking about the skill of using
agents. And uh the goal here is kind of
like a reality check. Maybe it's going
to feel like super super simple. Um, and
that means that I've succeeded, at least
in my view. Um, because we're gonna talk
about building skills and what that
really means and for me what learning
means. Um, so I'm going to just dive
right into it. Um, you may have
recognized me from yesterday. Um, thanks
to those those amazing panelists, uh,
Max and Sil. Um, and uh, just for fun,
last night I built a skill that is
called taste check that I'll share uh,
at the end of this talk. It's open
source. It synthesizes all three of
those panelists uh, views on taste and
then checks anything that you any code
that you've written against them. So um,
thank you guys for uh, for the panel and
for letting me obsolete your guys' taste
u with a skill. Um, so who here is has
seen something like this this meme
before? or build me a billion dollar
SAS. Make no mistakes. Yeah. Okay. So,
uh obviously it's funny, but like in
reality, like a lot of us actually
prompt like this um where we're like,
"Yeah, yeah, don't make any mistakes,
make it good." Um and and that it's not
really that that great, right? Um it's
just kind of us like not thinking. Um,
so, uh, a lot of times people will will,
uh, see the agent, uh, make a mistake or
a model make a mistake and they'll just
be like, "Oh, that's failure. Let me
start a new." Um, and you're not
learning anything. You're not
transferring anything from that failure
into another session, but the model is
the exact same. You know, the weights
are the exact same. So uh sometimes uh
if you view this as like uh mistakes or
a way of learning. So like if you fall
down, if you're pushing yourself, if
you're trying a new sport or you you're
you are really good at a sport but you
lose a game, you can learn from it,
right? You can either just like give up
and feel defeated or you can learn from
it. Um so with with agents, we can cify
that um what what it's learned. Uh and
in my personal view, the more you let an
agent stretch, the more freedom you give
it, the more you can kind of uh see
where its real bounds are. And then as
long as you set your expectation that
this is going to be an iterative thing,
um then uh and not an infallible perfect
thing, then you can you can succeed. Um
another the third thing I put here at
the bottom is that um my view is that it
you're squandering when you make
mistakes and you're and you're not
taking away anything from that. Um, so
this is kind of like the the the
semantic space, the domain u that I'll
be talking about. I think it's a lot
simpler or I'm going to try to make it a
lot simpler um than all these things.
Um, but uh basically you've probably
seen these uh these terms used before uh
coming up on like your your Twitter feed
or or uh wherever. Um and kind of my
experience the last few years has been
that like every 90 days one of them
becomes p like oh my god I can't believe
you're prompt engineering. You're so
cooked. you need to be like context
engineering or harness engineering or
continuously learning with the new
memory system and yeah yeah yeah um and
then you see this like post it's like it
changes everything if you use this model
with this harness and this and this and
this one prompt um and it does a
disservice I think to like everybody
because generally when we see that we
get this like sense of FOMO but it's a
false sense of FOMO um everyone's trying
and the honest truth is there are no
real standards there are a lot of
standards um or purported standards But
there are no real standards that that
everyone's following because we're all
still figuring it out. So, um if anyone
tells you this, either they're a fool or
they're trying to make a fool of you. Uh
so, keep playing, keep exploring. Um so,
I've got a few examples here um just to
demonstrate uh the difference you can
have with the exact same model, the
exact same prompts um with uh in the
field of music. Um so, the only
difference here is that it's going to be
a different set of skills. So, uh, naive
prompt. I don't know if I can actually
zoom. Yeah, I can. Cool. Um, so I'm
like, yeah, make me a song. And it just
gives me lyrics. Um, that's not really
musical. Um, but, uh, when I give it,
let's see if the audio works. Hopefully,
this is not too loud. That
Okay. Okay. So, tell me if you feel like
emotional about this one. Can I play
this?
Does this give you any emotional vibes?
>> Yeah.
So that is a skill uh using ABC notation
uh which is basically like a markdown
extension that then like you have some
runtimes that will take that convert it
into like MIDI. Um, and then this other
one, uh, this is, uh, me showing off a
little bit of a few different skills
that I I combined together. Um, I think
that it's not going to work on the
Wi-Fi, but like, uh, here's the site u
that it created. Um, and uh, this is
you'll see this kind of like site format
and and let's see if I can actually
feel if you like country.
>> Let's see if we can get some uh, some of
the lyrics.
It was like Claude expressing itself
song with a bunch of skills. Um, it
picked country because I guess it was
feeling sad or something. Um,
>> all right. We're going to keep it going
in the background because I think it's
fun.
>> Um, and um, so yeah, it's again it's the
actually the exact same lyrics that were
written in the first prompt. I just like
added to the the the uh the the the run
the agent session there. um these other
skills and I was like go make it good um
and it just used those. So this is kind
of nice to have music while you're
playing. It's relaxing. So uh the gap in
my opinion is the skill um and uh so I
just demonstrated you can use the exact
same model um and just slightly
different engineering. Uh I will say
that those skills the website one not
that difficult. the like controlling a
digital audio workstation with Claude.
That one took a lot of iterations. Um
and uh kind of a passion project for me.
Um but uh some other ones I I have I
could show these too. Um so you can you
can see these like combination of
skills. Um
and there's Claude telling me that it's
going to be uh difficult to switch uh
and still be entertaining. So this is
one I run as a scheduled uh uh scheduled
task um like every Friday morning to
tell me like all the features that I so
I work at Code Rabbit. So it goes and
figures out like all the things that we
ship um and which one had like the most
impact based on our metrics and then I
don't want to look at all that so it
just writes me a song. Uh
so uh you know like songs have a lot of
like Shannon information or whatever.
So, uh uh that that's what my excuse is.
Um I also uh have another one. Um it's
bedtime stories. Uh it's like audio
books. I probably have to like pause the
country.
>> Once upon a time in a shimmering city by
the sea, a builder named Thorston.
>> It's definitely not about me being
nervous about giving a talk today. But
yesterday,
>> now backstage,
>> Thorston took a deep breath and walked
into the light. So there
>> and then something
>> there's something like
>> when he finished the room was quiet for
one heartbeat.
>> I guess I hope it's quiet like in a good
way like
>> afterwards wonderful thing.
>> Oh something wonderful is going to
happen after. Cool.
>> That night
>> and I play guitar obviously
>> um
>> strumming his guitar.
So that one um
thank you. Um thank you for following
the storyline and clapping and being
happy and it was a good talk. U so so
that one is also a combination of a lot
of different skills. Um uh one thing
that like always kind of uh made me
wonder is like why does Anthropic not
make u like other models besides like
LLMs? Um, and uh, so then I realized
like if you're just a really good
agentic model, um, that's kind of all
you need and you can let other people
make the model. So this one uses uh,
text to speech like seven different
texttospech models. Uh, the skill that
go back and forth. I'll talk about that
in a second, but it also uses I think
about eight or nine different image gen
models depending on what it needs. Um,
so for the bedtime story like meta
skill, it's just using those underlying
ones uh to inform it of which calls to
make to get what uh what kind of audio,
what kind of images. Um, and then uh
maps it so that the little highlighting
of the words. There's a CSS issue. I
don't know if you saw it. Um, it's like
a few pixels off. That's not the hard
part. The hard part is like cutting the
audio, figuring out where words are and
mapping them to it. That one took a
really long time, too. But I have two
little kids. So passion project and then
uh Oh yeah, we got to get back to the
music. Um let's see. Let's do like
house.
I got an immediate yes. Awesome.
Uh so house music is actually kind of an
easier genre for models to use because
it's like uh I can turn it down a little
bit. Yeah. um because it's electronic
and you know you can you can uh control
drum machines font. Anyway, I I'll
distract myself talking about music all
day. Um so basically I think it can be
this simple that um if you think of the
models as the primitive as the main
primitive uh that you're just trying to
work with. Um you can use the file as
the substrate or the medium the place
that you're doing things. The prompt is
like the intent u the tools or the the
the skills that you write. um those are
just calls are going to be made and then
uh and loop over that. Um I think like
it's very easy to like jump into like
premature optimization especially with
models especially when there's not this
like super clear path that we're kind of
can read a giant textbook about and we
can build out huge systems. Um so start
start small. Um and just like if you can
oneshot it that is fantastic. Um I
wouldn't call that a skill though. Um,
it just means it's it's part of the
model already, part of the harness.
Anybody can do that. You're really just
writing documentation for what what is
already there in the lat space. And
maybe more importantly too, if you can
oneshot the skill. So, how many how many
people here have made skills before?
Okay. How many people have iterated on
that skill?
Wait, that's more hands than originally.
>> Yeah. Um, so, so again too, like if you
if you if you can just write the skill
once, um, and it works every time, you
probably don't even need that as a
skill. It's just it's just extra context
tokens. The front matter of that skill
is going to be have to write into
context. Um, you could just skip it. Um,
another really really interesting thing
um is when you're building stuff, don't
use the best model or just wait for
better models to come. So, uh, worst
models in the same family. And I give
this 4647 one because the digital audio
workstation skills or set of skills, uh,
that I built, I originally built with
four Opus 46. Um, and still had like a
lot of frustrations. The 47 came out and
I was like, oh, like it just did it, um,
and didn't fall into the same failure
modes. Um so
using weaker models if they're in the
same family uh tends to give you these
constraints these forcing functions um
uh that allows you to be specific about
the failure modes um while you're kind
of co-working with the the model itself.
Uh and the the specificity there
survives model updates, it survives like
uh uh harness updates, etc. So like do
the hard work now, invest in it now,
building like really good skills and
then um they'll just get better. Um so
maybe this is a little too musical. Uh
it's a rhythm. Um so just just start
small like something that you want that
you're having frustrations with like an
agent or a model performing. Uh just do
it but then pay attention. Um look at
you know co-investigate with the model.
Uh don't don't fall into this mode where
like you think that you know better and
you'll tell it exactly what to do. Um
you should co-investigate it. Help the
model like work with you to to get
better and then codify those things.
Couple failure modes overgeneralizing.
Um so write good music. We got um you
know just just some text or oversp
specify especially on failure modes. So
like a lot of times the models will be
like oh I'll never do that again. Um
I'll just stay away from that whole
domain. Um that's not a good one. you
should understand why it failed exactly
and codify that. Um so for instance like
when I first wrote the this like music
generation skill um uh I wrote it with
Claude skill creator and it was like
okay here's 10 genres this is what we're
going to work in because it's a lot more
effort to do more uh we we kill the
expressiveness we kill the the the
potential output uh space and lower the
local maximum. So stay close to the
metal work. If you're like, "Oh, I need
to be having my background agents run
while I'm sleeping." If you're, you
know, that kind of like AI psychosis
level right now, um,
stay stay up and do the meta work, do
the skill creation work, do the
iterations, get it in a good spot, and
then you can sleep while the uh while
the skilled runs are running, and then
reflect on them. Always have loops.
every time it runs a skill, anything
that it learned while in that that
session, it should reflect and add back
into the skill. It's really cheap to
iterate on these uh dollar-wise,
token-wise, but uh it does take a lot of
work. There's nothing there's no such
thing as a free lunch. Um I've been
saying this term for like the last year,
year and a half or something called
Goofy. I totally made it up, but it just
means good oldfashioned engineering,
y'all. Uh yes, it just good oldfashioned
engineering. Um, so, uh, another way of
saying this, uh, is why spend five
minutes codifying the lesson in a skill
when you could spend five days watching
the agent fail with fibs. Um, so just
just stay focused. Maybe
one thing, who who here thinks that
we've solved continuous learning or we
need to solve continuous learning?
That's two opposite questions asked at
the same time. Um so uh there are there
are a lot of these talks out there that
are like oh like we're not going to have
like super intelligence until we solve
for this continuous learning thing and
no one knows what that will be. No one's
able to define it. Um so it's very
handwavy uh and I think it's BS um
because you can just write things down.
We like as simple as this talk is which
I hope it's landing as a very very
simple straightforward talk. Um, you can
just write things down in a skill.
That's memory. Uh, you can have that in
the context going forward in the future.
That's memory. Uh, it's learning while
it's running and then saving that
somewhere. That's memory. Like I don't
know how many other ways to say it. Um,
but, uh, you can get more out of the
models. So, uh, we need some kind of
like futuristic uh, predictions or
something like that. Uh so like where is
this all all pointed to I have two
slides on this. Um one one thing I think
is really interesting and really really
powerful. I tried this out a few months
ago um with a Neatron model building
like a very very simple uh uh uh coding
harness um in in a terminal for Neatron
to run but giving it access to its own
codebase uh so that it can write evals
and then uh and run those outside of its
own loop. Um there's other things like
skill creator skills that that help
agents create skills for themselves. Uh
there's PI agent which is like purposely
very very simple bare bones but then has
this extensibility. Um so these are like
self-modifying agent harnesses. Um I
think these are these are really
interesting uh as long as you you know
keep an eye on it even like on that that
kind of selferating loop. And then um
the other thing is like zooming out a
little bit um there the things that you
can do with your skills today or with
your agent harnesses today that the
model can't do itself. The model will be
able to do in the future. It just takes
a while to do the training um to get the
those new skills new capabilities into
the weights. Um but uh skills are a way
in between any of the ML stuff that that
that ML researchers would be doing and
just prompt engineering. Um so wrapping
up here um really you should make some
or many mistakes. Uh the more the
better. The more mistakes you make, the
more you're pushing things forward. uh
the more you're differentiating and you
can help the model learn from them
itself. So being there with the model um
and and helping it through its struggles
uh is good. It won't remember that but
uh um it will it will perform better in
the future and that way you can make not
make the same mistakes again. Um, and
this might sound weird, but uh, building
model empathy. I know people have their
preferences on which model uh, they like
the most um, tends to be the one they
think is the best as well, even though
those things should be different. Um,
but if you if you just keep playing with
these and keep seeing like where they
fail and and what actually helps them
succeed, um, you can build some empathy
for them. And I will note that some
models do need a sense of psychological
safety in my experience to perform at
their peak. Um, speaking of, uh, does
anyone want to hear some absolute
bangers that Claude wrote in a digital
audio workstation? Yeah. All right. So,
do you guys like metal?
All right.
See?
Oh, no. I paused it.
>> This is Claude expressing itself. I
figured like Claude is like known for
being like the nicest kind of models.
Like let's do some gent metal. Um so
while this plays,
>> it's sick. It's sick. Um
uh I'll share all actually you can go to
hyperoperator.vercell.app
and you get like six different versions
of this song. Um so claw deployed it. Um
but uh results will never be 100%. So
never be looking for that ever. We're
always dealing with probabilities with
models. We can get damn close to 100%. U
like good enough for production for
scaled out things. Um and if none of
this makes any sense or you're like,
"Oh, he's just making music." When you
put it in a business context, every
mistake that you solve for is a
differentiator for you because you can
go further than anyone else can go uh in
the market with just the model. So going
back to the meme, build me a billion
dollar SAS. Make many mistakes along the
way. Let's learn from them together on
every run so you do better next time.
Thank you all.
Do I just like walk off stage? All
right.
All right.
What a great talk from Eric. So, who
here today is going to write a new skill
after Eric's talk? Raise of hands.
Amazing. Who is going to make some edits
to their existing skills? Okay. Awesome.
Awesome. Okay. So, we're going to go for
a break. Uh and then after the break,
there are going to be four more talks.
So, get ready for that. and just be back
at 3:45 so you can hear all the amazing
talks for the end of the day and uh the
last session is going to be really
interesting. We can probably even start
a debate about CLIs versus IDs. So
definitely don't miss it. All right,
I'll see you at 3:45.
Yeah, we're good.
Ladies and gentlemen, please take your
seats. Our event will start in 5
minutes.
Ladies and gentlemen, please take your
seats. Our event will start in 2
minutes.
How are we doing?
Great. I see you, Ryan. That was a nice
woo. Uh, okay. Our next presenter
is an open source champion. He loves
building and developing open source
software and also loves educating others
and uh he feels lucky enough to do both
both for you know as his job for a
living and uh he has uh been at a few
startups like Versel and also ran his
own game marketing company for a while.
It's my pleasure to invite Hassan to the
stage. Please welcome him.
Thank you so much. Appreciate it.
Hello everybody. Uh thank you for
coming. Uh my name is Hassan. I lead DX
over at Together AI and I'm very excited
to be here today to talk to you about
one of my favorite things in the world,
which is building AI apps. Uh I've been
fortunate that some of the apps I've
built have have done pretty well. So,
I'm just here today to talk to you uh to
show you some demos of stuff I've built,
to talk to you about how I build these
apps, you know, what coding agents do I
use, what tech stack do I use, um talk
about my process from going from like
idea to deployed app. Uh and I'll end
off with some lessons I've learned along
the way.
All right, so introduction. Like I said,
my name is Hassan. Um I uh build AI
apps. That's the main thing you need to
know about me. Um, I've been building uh
about 10 to 15 AI apps every year for
the last five years. So, I have a bunch
of apps built. Uh, some of the apps I've
been lucky enough to to have gotten a
couple million users. Um, and also all
of my work is open source on GitHub. So,
github.com/nutlo.
I build these apps with um AI models
from uh Together AAI, which leads me to
uh Together AAI, the company I work for.
Uh, we're an AI native cloud platform
for running, deploying, and training
open source models. We have um a bunch
of different models across the stack. We
have like chat models and code models
like GLM 5.1 and Kimik 2.5. Uh and the
miniax models. We have image models like
Nano Banana. We have audio models both
speech to text and text to speech like
Whisper. We have vision models. We have
embedding models. Basically any any
model that you want to use to build your
app, we we probably host on our
inference API. Uh we also give kind of
more advanced teams who have a lot of
data the ability to fine-tune models
from scratch on their own data with kind
of Laura fine-tuning or full fine-tuning
or even RL. Um and lastly we we let
teams launch GPU clusters whether you
want you know eight H100s for 10 hours
or you know thousands of GPUs in a
really really big cluster um uh for for
a while. So that's together we have some
cool companies that that use us uh here
on the bottom.
Cool. So before I get into demos, uh I
just want to talk about why it's a
really great time to build right now. Um
obviously with coding agents, everybody
can move a little bit faster. Code is
really cheap. You can try out your ideas
and test them out really really fast. Um
but also on the other side, you know, it
doesn't mean, you know, you should ship
10 features because you can. Um and it
doesn't mean you should just kind of
like blindly ship. Um there's there's uh
you know, I say there's a right way to
do it, but really, you know, there's no
one right way. I am just going to talk
about my way and how I do it personally.
Um, but I think like it's it's a
blessing and a curse that you can move
so so quickly. Uh, but at the same time,
you still have to be intentional over
what you build and and and how you
architect your your apps.
All right, so let's get into some demos.
So, this is a list of some apps I've
built. Um, the main thing to take away
from this slide is the number of people
that use the apps vary a lot. And
honestly, I don't I don't know how many
people will use an app when I've
launched it. I don't know if it's going
to be 500 people or a million um still
to this day. So I think like u a big
piece of advice that I like to tell
people is just like keep building um and
uh it'll it'll kind of eventually uh pay
off. So yeah, we'll go through a couple
of the apps I've built. One of them is
kind of an open- source vzero
uh app where you can come in and ask it
to build you something and it'll go and
uh build this stuff. So I built this
about two years ago and I built this
when open source models started to get
really good at coding. Uh so here it
built uh a little uh front-end like
Trello like front end. Um I have you
know about a little over two million
people that have used this app uh uh
over time and about 7,000 stars on
GitHub. Uh this is another one that I
built called Blinkshot. I built this
because there this this new image model
called Flux Chanel came out and it was
really really fast. Uh on our API it's
like 800 milliseconds on average. Uh so
I built this like real-time image
generation app that as you type it'll
generate images and you can also change
styles and uh just to kind of show off
uh image models. Uh and so this app has
gotten I think about 30 million images
generated in the last year and a half.
Uh and a little over uh a little over a
million and a half people that have
tried it out. Another app uh so I'm
super passionate about education. Both
my parents are educators. Uh and so I
wanted to build kind of an AI personal
tutor that could teach at different
levels. So, um, I I built it, you know,
uh, using web search called llama tutor.
Uh, and I've had a bunch of people try
it out as well. Um, another one I wanted
an interface to be able to easily edit
images. And, you know, you can do this
on Chad GBT, but also I wanted a really
nice interface where I can go back and
see, you know, a list of all the
variations on the top left. You can see
we have V 0, V1, V2, and you can kind of
continually edit images and and and keep
going with them. Um, so that's uh that's
easyedit.
This is another one that I built called
AI commits. And um this one I actually
built uh a very long time ago, almost
four years ago. Um I saw someone Theo
tweet out like, "Oh, I wish I could use
AI to write my commit messages." Uh and
this is before the age of coding agents.
And so I I hand rolled a CLI um that can
actually, you know, generate commit
messages. It looks at your git diff and
it writes commit messages for you. Uh
and so this one I have uh about 45,000
developers that have installed it and
tried it out. Uh and and a bunch of
stars on GitHub. I built my own, you
know, chat app that uses exclusively
uses open source models and has web
search and can generate images and you
can branch chats and do all the all all
the cool things. Um,
and uh, yeah, this is like I said
exclusively using uh, uh, open open
source models. Um, have a bunch of
people that have tried it out as well.
You know, this one I I did to show off
how you can use vision models to
actually do OCR. And so you can extract
structured information. you can extract
JSON as well uh uh with it. So um that
was that was last year. Um generating
logos. I build a lot of these apps and
for every app I build I I love having uh
logos. So I built a little uh interface
uh and uh did a lot of prompt
engineering and uh came out with with an
app that works fairly well for for for
generating little logos here and there.
Um upload a screenshot and get code.
Right. There's just like a lot of these
apps that I've just I've built because
of problems I've seen online or like
this one was a I I go to all these
restaurants and I never know what's on
these menus and so I I I wanted an app
to to send a a picture of the menu for
and for it to actually give me um all of
these uh all of these images.
Um this is one of my favorites lately uh
that I've built a few months ago called
make comics where where it'll generate a
comic book uh for you from scratch. You
can uh you can upload pictures um and uh
yeah, you'll get a a full page comic
book in different styles and you can
keep going and keep generating more and
more pages. Uh we actually used this one
at a conference we did uh a while back
uh which was cool where people could
come into an iPad and and take a picture
of themselves and generate a real comic
book that that we printed. Uh and these
were some of my favorite ones. is the
Rubik's Cube guy. I especially like u
because it it captured, you know, kind
of exactly what he was wearing and it
and it and it put it in the comic uh
which was uh which was kind of nice. Um
this is one of the last ones uh that
that I built where you can upload a
video and get subtitles. You know, I
make a bunch of videos for my job just
kind of educating developers on how to
build this stuff. Um and depending on
the platform, I want to add subtitles or
I want to I want them to be structured
differently. And so, uh this is an app I
built that lets me do that. You know,
lets me add subtitles to any any video I
want.
Um, so those are all apps I' I've built.
I have a couple ones that I'm working on
right now. One of them is called
Orchestrator. You can spin up a bunch of
cloud codes uh in in the cloud, ask them
to do stuff, and it'll, you know, open a
PR for you. You can actually sign up
orchestrator.build if you want to join
the wait list. Um, but uh yeah, you can
spin up open code instances or cloud
code instances or really anything you
want. But yeah, this is one of the apps
I'm working on and and the other one is
uh like I said, back to education. Um it
it just builds a personalized course for
you based on any materials you have. Um
so for this one I uploaded like a
research paper to teach me about
transformers. Um and it'll kind of go
through and um you know construct
basically in an an entire course for me.
Um okay cool. So those are some of the
apps I've built. Um I'm now going to
transition to talk about uh how I
actually build these apps and and what I
think about when I do. So what's my
agent stack? Before that actually I want
to talk about the architecture of these
apps. So these apps are are quite
different. Um but an interesting thing
is that like almost all of them have the
exact same architecture. Um, almost all
of them have, you know, I start off with
an input, right? Some kind of text
input, put in a text and I'll generate
an image from it. Upload a PDF, upload
an image, you know, whatever it is, but
it's some sort of input. And then I send
it to an AI model um that generates
something, generates code, or it's a
vision model that describes what the
image is or, you know, whatever it is,
uh, stores it in some sort of database,
and then it sends it back to the user.
So, a very simple four-step um, uh,
architecture for a lot of these apps.
And I intentionally like to keep a lot
of these apps very very simple uh as
well. So this is the actual tech stack
that I use like I said together for AI
models. I like to use Nex.js. It's a
nice full stack framework. Um I use a
serverless Postgress database called
Neon. I use Clerk for authentication.
Drizzle is my OM. Um S3 for storing you
know object storage, storing images and
PDFs and that kind of thing. Plausible
for analytics and uh Versell for
hosting. So, uh, wi with people's tech
stacks, I I like to say, you know, do
what works for you. If you have a tech
stack that works well, continue going
with it. I've kept mine largely the
same, uh, over the years, just so I can
move as fast as possible.
Harness stack. So, um, this one's a
little bit interesting. Um, so I use
codec for prototyping apps. If I have an
idea and I want to get something in 30
minutes, uh, and I don't mind throwing
away the code, um, I'll usually use
Codeex for that. They have a pretty good
desktop app that they've released
recently. Um, I'll then use cursor for a
lot of my everyday like I want to add a
new feature to an app or um I I have
this very large app and I want to be
able to read code, you know, for when I
need an IDE, I use cursor a bunch. Uh,
and then cloud code, I mostly use it for
PR reviews. I like to use different
models for writing my code and reviewing
my code. I I notice that that tends to
help a little bit.
Cool. This is a question I see a lot on
Twitter, which is should I be running
like a million parallel sessions to
build everything that I can possibly
build. Um, and usually what I tell
people is no, you I I don't think you
need to spin up 10 or 15 sessions. I
think multi- aent orchestration is
really cool and I think like you can use
it. Um, and if you have two ideas at the
same time, you can do that, but I
usually find that like the gains stop
after like two or maybe three parallel
threads. Uh, so that's that's what I've
noticed works the best for me.
All right, process for building these
apps. I have this like sevenstep
process. The first one is like I always
have a list of ideas. I have a list of
like about 150 ideas right now that are
kind of stack ranked. And uh anytime I
think of a new idea or get inspired on X
or Reddit or wherever else, I add it to
this list. And so I have this running
list of of ideas. So anytime I want to
build, I don't have to sit and think
about, oh, like you know, what should I
build today? um I can just go to the
list and like read the top 10 or top 20
and pick one of them and go and build
it. Um so then I I come up with a good
name for it. I think through how it's
going to work a little bit. I don't just
like send it to an agent right away and
be like, "Hey, build me an app that does
X, Y, and Z." You know, I I try to think
about, okay, like this is going to have
one screen and I'm going to ask the user
for this and then and then this is
what's going to be output. So I try to
think through it just a little bit and
give the agent a little bit more
context. Um I try to build the simplest
version I can of it. Sometimes I'll come
up with a very like grandiose idea that
has all these things and I'll I have to
like ground myself and be like, "Okay,
what's the simplest possible version of
this app I can ship?" Um, and simplicity
helps in in in a lot of different ways.
Uh, number one, it's quicker to build,
which is great. Number two, it's
generally a lot more consistent. You
know, if you build an app that's really
simple, it's probably going to work
fairly well most of the time. And number
three, you can actually um describe it a
lot better. you should be able to
describe all of your app ideas in like
five words uh or at least the MVPs of
them very quickly. So build the number
four is build the simplest possible
working version. Number five is figuring
out authentication and limits. I usually
have a free tier on every app that I
launch. So I think through you know do I
want them to sign in with Google? Do I
want to rate limit by IP address? Do I
want to just have like you know put in
your own API key to to use this thing?
Uh so thinking through that and then
preparing for launch. Like I said most
of my stuff is open source. So I write a
good readme. I'll buy a nice domain.
I'll have a good OG image. I'll add
analytics. Uh and then number seven is
just like, you know, draft some social
posts, launch it, and keep an eye for
for any feedback.
Cool. And then this is what I'm going to
end off with, just a couple of lessons
I've learned and just general advice uh
for for building these apps.
Number one is keep shipping. You know, I
showed you guys that screenshot of uh
all the apps that I built and the number
of users and it fluctuates widely, you
know. So, I think like um a lot of apps
that I do don't do very well. A lot of
apps that I build and launch don't do
very well. Um so, I'd say keep building.
You got to put in the work. You have to
put in the reps of just like building
stuff and launching and building stuff
and launching and kind of learning from
what you launch as you go. Tip number
two is make sure it looks good. Sounds
super obvious. You'd be surprised how
many people don't do this. Um very few
people will use your app or take it
seriously if it doesn't if it's not easy
to use and if it doesn't, you know, feel
good. Um, I build all these AI apps. Uh,
but I actually spend, you know, over 80%
of the time focusing on the UI. Uh,
which sounds really counterintuitive
because I do spend time on the AI
portions and like what AI model should I
use and how should I chain them together
and how do I write my prompt and all
this stuff for for my apps, but the UI
is just so important um that that I've
just I I always spend the majority of
the time uh on on the UI and making sure
it looks really good. Uh, I'd say if
you're uh using agents, try to be
specific. Don't just say build me a nice
app that does this. Uh be a little bit
more specific. Uh agents love uh
screenshots and references for like
other apps that you like. Um and then
you can just iterate slowly uh and break
it into steps uh and and and try to
prompt your way into like an app that
that that looks quite nice.
So tip number three is um keeping apps
simple. I already talked about this a
bunch. Code is really cheap now. So, a
lot of people fall into in the habit of
like, oh, I'm going to build an app that
does like 12 different things. And and
the thing is, you can build it kind of
and then it'll mostly work and all this
stuff. And I think that's how we get a
lot of AI slop is like, you know,
building a really advanced app where
half the features fall apart the the
second you you you try to use them. Um,
almost every single app I've showed you
has like one API endpoint at the maximum
of two API endpoints. Um, so just keep
things simple. Uh, work with coding
agents step by step. test at each step,
actually read the code, you know, one
small task at a time. Um, I think coding
agents are getting smarter and smarter,
but the more you can actually like break
things down and think a little bit
through them, uh, the the better I I
found them from my experience. Um, this
one is one that I do a lot for a lot of
my apps, which is incorporate the the
latest AI models in there. So AI models
are dropping all the time and um a lot
of the time a new AI model drops that
kind of unlocks a new capability or
unlocks a or or just like is
significantly better at something very
specific. Uh there was a model that came
out that was started to get really good
at like uh image to code, you know, give
it an image and generate code. And so I
just went and built an app with that and
uh it did really really well because it
wasn't it kind of wasn't previously
possible or at least it wasn't
previously uh uh um possible to do in a
really nice way. Uh so I always
recommend you know use if you're
building AI apps use the latest AI
models in there so it's more impressive.
Uh this is an example where I did that.
uh Nano Banana had just come out and
this is like two days after I built this
app where you can upload a PDF and it
generates an infographic for you and it
sounds really cool but really I I take
the PDF, I scrape the text, I send it to
Nano Banana and I show the user an
image. Uh and so it's so simple. I built
it in a very short amount of time but it
it did very well on Twitter and got
thousands of likes and hundreds of
thousands of impressions and and
everything like that. And it's because
it's a new capability. You know, image
models previously were not able to to
you know build these really nice
infographics. Um, so I I love this
concept of like keep up with the latest
models and and try to build something
cool uh with them.
And then I think this is my last tip is
uh relaunch apps with new features and
new models. Uh a lot of the time I'll
build an app uh and then a year will
pass by and then models are just so much
better that
you know when I when I replace an old
model with a new one it kind of
drastically changes the app or like it
becomes significantly better to the
point where it feels like a new app and
then I'll just go and relaunch it and be
like hey here's version two with an
updated model and and all of this stuff.
Um and so that tends to do uh quite well
uh with these apps.
Awesome. That's all I have. Thank you so
much for coming and uh yeah
all right okay three more talks for
today okay before I introduce the next
speaker who is also a prominent figure
in open source I have a question for you
all who is a dog person. Show of hands.
Okay, I see some hands. What about a cat
person? Do we have them in the room or
are they back in their houses staying in
their rooms? Okay, no cat person. Okay,
that is good because our next speaker
has a three point pound sorry 3 point
three pound Yorkie that he wasn't able
to bring today but it would have fit in
his pocket because dog is so tiny. Um,
but Stefan is going to present to us the
multimodal future and he believes that
the multimodal future is open source. So
Stefan, welcome to the stage.
Okay, so Stefan is the head of business
at Open Code and we already heard from
Dax who is the opener for our conference
uh as the founder of Open Code. So today
Stefan is going to show us the
multimodal future. So take it away.
Thank you.
>> Cool. Is this working? That's that's all
right. Cool. Can you guys see my slides
yet or no?
Not yet. All right, let's do entire
screen. Start mirroring. Perfect. Can we
see that?
Okay, great. And Dax was aligning. Wow,
this is really bright. I can't see
anybody. So, I'm going to imagine that
this room is full. So, that's kind of
how we're going to go with it. Um, today
we're going to be talking about how the
multimodel future is open source and why
the best AI coding tools don't pick the
model for you.
I got to make sure I stay here because I
guess it's going to keep falling. Um,
cool. So again, my name is Stefan Abram.
I am the head of business. I'm not the
CEO at Open Code. Um, we have this
longunning joke which is people think
Dax is the CEO or Frank and so we kind
of like put that in there to clarify. So
I'm not the CEO. Previous to this I
co-founded Wonderraph which is an open
source API management solution. Uh so
primary uh primarily I spent my career
in open source technology as well as
developer tooling. Um I reside right
here in Miami literally like one
building over and uh I know that the MC
said that I was going to bring my
Yorkie. I really was but my wife was a
little bit scared that um this tiny dog
might start peeing on the stage or
something and so I didn't bring it but
she will be around for the afterparty if
you guys are around and also if you're
staying for React Miami you might see
her. Uh best way to get in contact with
me email or you can get in contact via
uh Twitter. So Stefan TMD and then also
I'm the only business guy at Open Code.
So the easiest way also is LinkedIn,
funny enough. But again, these are the
ways to get in contact with me. Cool.
And I don't know if this clicker is
working, but we'll just kind of stay
here. And today it's a really exciting
time, especially with like AI engineer
and everything that's kind of happening.
And it's really cool to kind of be at
Open Code to kind of be on the frontier
and like seeing what everything is
happening. And we're kind of living
right now in this golden age of models,
which I don't know why it's not working.
There we go.
And we're kind of living in this like
golden age of models where new models
are dropping every single day. Does it
keep going out for you guys by the way?
Okay. So, I'm going to just stand here,
make sure it doesn't move. And um again,
we're living in the golden age. New
models are being dropped. Yesterday,
Kimmy uh 2.6 dropped. We have a couple
new models dropping this week on Open
Code. And we kind of stay very close.
So, as soon as it drops from a provider,
we add it into Open Code. And it's
really exciting because I get to work
with Frank and what we call our
inference team. But it's also very
exhausting. So, I think you guys are
going to relate to this next slide,
which as a developer
again, it's like cutting out.
Can you guys see the meme or no? Oh my
god, what is going on?
Okay, sorry guys. Technical difficulty.
Let me see.
One sec. Can we see that? Okay. So I
think we can all kind of feel this way
especially every day a new model drops a
new model provider is created a new
company raises a I don't know hundred
billion dollar seed round and so it's
kind of exhausting but I wouldn't have
it any other way because it's really
exciting time right now to be building a
technology and yesterday as I mentioned
a new model drops we have to have it
into open code on day zero so definitely
kind of feeling this way as these new
models are dropping and it's really
important to kind of understand why this
is happening and so there's an
interesting trend that I'm kind of
seeing across these enterprises which
we'll talk a little bit about here which
is these companies tend to go through
this AI adoption phase and we call them
the three Rs resist rush and rain and
what really happens is I s also saw this
on like Twitter which was fantastic
because a lot of people are also seeing
this but the first one is like
reluctance and this is rightfully so um
you might see on X you might see on
LinkedIn or any of the headlines that
developers are going to be replaced and
so when we've had the same workflows for
20 years and people are saying that AI
is going to completely replace us
obviously there's some reluctance to
kind of try out these tools And so stage
one is this reluctant stage. Stage two
is based off my favorite movie if you've
ever seen it. Everything everywhere all
at once, which is the rush stage. We're
just throwing everything at it. AI is
everywhere. We are now an AI forward
company. We changed the workflows. AI is
everywhere. And I've also seen this with
companies that make no sense. So for
example, now you have like AI powered
toothbrushes, AI powered jewels.
Everything that we're doing is now with
AI. And so we're just overspending.
We're just going going going. And then
the last one is where we see this
transition from stage two to stage three
which is the tech works. We call this
the rain stage. And the way that it
works is now we actually make it
efficient per token cost matter. Right
model for the job. And we start to move
the business needle the business needle
towards what actually is really useful
for us as a business.
And of course I had to put in LinkedIn.
So this was on LinkedIn lunatics. But
this is phase two in its natural
habitat. And so I'll give you a second
so you guys can see this. These are
actual conversations that we're also
seeing at open code which is people are
literally token maxing. They don't care
about their spend. They don't care about
their costs. Their CTO said, "Hey, we
have a top down initiative. We need to
spend as much as we can. We need to
become AI forward companies." And so I
saw this on LinkedIn and I was looking
at it and I wasn't going to put it into
this talk, but then I looked at the
guy's profile and I couldn't tell if he
was kidding. And I think that he was
actually serious. And so I had to throw
this in there because in stage two we
actually are having these conversations
at open code which makes sense because
again they're getting these you know
like vibes that they're going to get
left behind. They need to adopt AI and
so they started to post like these
things on LinkedIn.
And so phase three is where I spend most
of my time. So as companies are moving
from reluctance and then now they're
throwing AI at everything they start to
hit stage three where model choice
starts to matter, the cost matters and
control matters. And so this is really
where open code shines and it's also a
really really important part for
developers where you want that
flexibility.
And so this is just an little screenshot
from inside um open code. So last week
Kimmy uh K25 was our most popular model.
Obviously this week that's not going to
be the same thing. It's going to be 2.6.
You can see opus, you can see GLM. I've
been daily using GLM which has been my
go-to model. every week it changes which
is a really exciting time because you
want that model flexibility and so
what's interesting we also have internal
benchmarks within our team and we really
try to see which models that the team is
using and so the first pillar inside of
that point when companies are moving
from stage two to stage three is they
start to experiment with some of these
models whether they're soda models or
whether they're closed source models but
they really start to experiment and
really try them out
the second one is cost and so you might
have seen this headline or you might
have seen this on Instagram or on X but
the Uber Uber CTO said that the AI
coding tools have already maxed out
their 2026 AI budget. And so Uber is
just one example of this, but we're also
seeing this across big enterprises that
come over to open code, which is, hey,
these tools are great, but they're
starting to cost more than junior
developers. Maybe it's actually better
we hire junior developers and get rid of
the AI coding tool. And what's
interesting is that the cost varies 60x
across the cheapest to the most
expensive ones. And it's really
interesting to take a look at that
because a lot of the AI tools, you know,
credit given, they've really pushed
technology forward, but they were really
built for phase two, which is that rush
phase where we're just throwing AI at
everything and we're really trying
things out. The last one though, given
my background in open source and with
GraphQL Federation, we spent a lot of
time in API infrastructure. And this is
probably my favorite part that I start
to see at Open Code, which is control
matters. And so we have a case study
with ramp where they built Inspect,
their own background agent on top of
Open Code. And the way that it works is
because open code was open source they
were able to build on top of it and it
really kind of moved the things from
stage two to stage three which is
delivering that business value. And so
30% of ramps merge PRs are now written
by inspect. It was built in a couple
months. It's fully customized to their
stack their tools and their workflows.
And again this case study is public on
our website. And these are the three
pillars that we typically see when
companies are moving from stage two to
stage three.
And this is something that I was talking
about DAX today. and Dax, he was the one
that did the keynote, which is a lot of
these ideas that we have, they're not
necessarily good ideas at first. We've
just been able to kind of supercharge
them with AI tooling. But we've seen
this movie before, which is there's
closed source, there's open source, and
they start to compete with each other,
and they actually start to drive each
other forward, whether that's with
databases, compilers, or code editors.
And when I was in Japan this past couple
weeks, I spent a lot of time there doing
some like history stuff. And one of the
tour guys, he told us, he's like, "You
study the future. You got to go back in
time." which makes sense because we've
seen this before and it's exactly what's
going to be coming with AI coding tools
and why open source and closed source
tools are important
and if we start to focus on the frontier
models and the soda models the value
starts to move up the change which is
that inference is commoditizing and so
uh Kimmy 2.6 6. I heard a couple like
whoops when I said that we had an open
code. It's a fantastic model and you can
start to see in this chart from uh
benchl.ai that it's starting to get
closer and closer and closer and these
frontier models because of this and
these open source models they start to
push each other forward which is quite
which is a great thing for both parties
involved. And what's also interesting is
that we have massive capacity from these
neo clouds. These big clouds are racing
to serve. Everybody is now offering to
serve instance which pushes us to offer
these models whether it's open source or
closed source. We also have these open
source models that are now enabling this
competition with these frontier models.
And these large companies have growing
incentives to fund these open source
alternatives because you're starting to
make these frontier models cheaper.
They're starting to make them better and
all pushes the whole ecosystem forward.
And I know I kind of flew through this
talk because it's really an exciting
time right now in AI, but there's one
thing that we really want to focus on
over here at Open Code, which is what
this means for you. And so if you're
building tools, start to design with
model flexibility from day one. If
you're choosing tools, ask who owns the
model decision. And if you're in phase
two, start planning for phase three. A
lot of the companies that are coming
into Open Code, they're either in phase
two into phase three, and they start to
see that they need to kind of aggregate
the choices. They're using everything
now, but eventually they're going to
start needing to pick between one tool
and two tools. And then the last thing
is it's kind of something personal,
something we've really been enjoying at
Open Code, which is there's a lot of
noise out there saying that, you know,
developers are going to be replaced,
that we're going to be replaced by AI.
And we kind of have this message going
around, which is it's really important
to remember that AI can't build great
software without you. And so we kind of
have this message going and we really
want to kind of nail it home that
regardless of what you see in the
headlines that you guys are needed. AI
can't build great software without you.
Um, that's it. That's my talk. Thank you
guys so much. Uh, thanks.
Perfect.
Thanks.
How's everybody doing? We're almost
there. I'm getting separation anxiety.
But anyways, um, so you built your
agent. Where does it actually live and
run?
On the CLI or an MCP server. Our next
speaker is Lori Voss who's been a
developer for over 30 years and uh was a
co-founder of npm and he believes
passionately in making the web bigger,
better, and more accessible to everyone.
Please join me in welcoming Lori.
Hello everyone.
They were not kidding when they said you
cannot see anything from up here. Like
really nothing. Come on, present full
screen. Thank you. Excellent. Hello
everybody. My name is Lori Voss. I am
head of developer relations at Arise AI.
Uh like you said, I've been a developer
for over 30 years. So I have some blog
posts that are older than some of you.
Uh
and in a former life I co-founded npm
inc. So uh I used to spend a lot of time
talking about JavaScript. Now I spend a
lot of time thinking about how to make
AI apps better uh by using evaluations
which is what we're talking about today.
We're talking about agents uh and how to
make them useful and efficient which is
important because we're all using agents
every day to do stuff. So if we can make
oursel if we can make our use of agents
slightly better uh then we win. uh
specifically I'm going to be looking at
how we give them tools. There are two
very popular ways of doing that. One is
MCP uh and the other one is command line
via skills files. Uh and I'm going to
talk about which one is better. Uh and
and we're going to talk about when it's
better and why it's better. Uh but
instead of just having a hot take about
that, I'm going to use actual data. I'm
going to use evaluations to figure out
uh which one of these works best for us.
uh and then we'll look at that data and
figure out what it's telling us that we
should do.
So, first what is MCP? I find it very
hard to believe that anyone in this
crowd needs uh MCP explained to them.
But just to be crystal clear about what
I mean, MCP is a remote tool protocol
with a discovery mechanism built in.
That is the important the two important
facts to remember about MCP. It is also
super popular. The SDKs get uh nearly a
100 million downloads every month. Uh
and there are over 17,000 MCP servers
that we know about. Uh a skill on the
other hand uh is just a markdown file.
Um it tells the model how to do
something. Often that means how to use a
command line tool. Uh but it could be
how to call a particular REST API or you
know just how to follow your team's uh
coding standards. Uh and you just give a
markdown file to the agent and it just
does the thing. There's no protocol per
se. It's just a very loosey goosey
natural language thing that LLMs uh
excel at. But it is the same thing as
MTP. It is a a method of calling remote
tools with a discovery mechanism built
in. Even if that discovery mechanism is
look in the skills directory. Uh skills
are also super popular. Skills.sh lists
over 90,000 skills, some of which have
been installed millions of times. Uh but
those numbers alone, the adoption
numbers don't tell the full story.
partly because uh not every lots of
skills are autogenerated and not a a lot
of skills are not necessarily about
remote tool use. Uh six months ago, MCP
was the hot new thing. Everybody was
building MCP servers. Uh and then
Anthropic, who also invented MCP,
invented skills files and told everybody
about them uh and ignited a debate. Uh
my friend Simon Willis, who writes a lot
about AI, uh said in October of last
year that Claude's skills are awesome.
maybe a bigger deal than MCP. Uh, which
is a very spicy take from a man who does
not usually indulge in spicy takes. Um,
and then in February this year, a blog
post titled uh, MCP is dead. Long live
the CLI uh, went to the top of Hacker
News and it stayed there for two days.
Uh, and the argument was simple. LM
already know how to use CLIs. Give them
a shell and some docs. Uh, stop building
servers, stop writing schemas, just let
your your uh, LLM run commands. Uh, and
just to pile on, uh, uh, Peter
Levelvels, Levels.io on Twitter said in
March, "MC is dead. Just as useless of
an idea as LLM's.ext was, it's all dumb
abstractions that AI doesn't need, which
is two piping hot takes in one." Uh, but
the problem here is that this is a very
online debate. It is it's it's tweets
and it's vibes and it's making claims uh
without a lot of data. Uh, and I like
data. Um, you've heard people on both
sides of this debate give talks both
yesterday and today. Um, but I wanted to
run a proper evaluation and see with
data which of these things works better
and why. Uh, and because a good eval is
hard to write, I started this several
weeks ago and was still writing my evals
until Sunday night. Uh, but I think I
have some pretty useful numbers now. Um,
but before I show you the data, I want
to know two things. First, which one do
you regardless of which which one you
think is better, which one do you
prefer? Raise your hand if you prefer
MCP.
Raise your hand if you prefer skills.
Okay. Uh, raise your hand if you think
the winner in this is going to be MCP.
And raise your hand if you think the
winner is going to be skills.
Okay. Okay. Well, we'll see. We'll see.
Here is the experiment that I'm running.
Uh I took one model uh Claude Opus 4.6
because I don't pay for my tokens. Uh
and I gave it a set of tasks to perform
against GitHub uh using the Claude agent
SDK. So this is the same SDK that powers
claude code itself. So I gave all three
experiments the same task uh the same
agent the same repo to work on uh but
using skills two different skills and
I'm going to go into why uh or using
GitHub's official MCP server. Why did I
pick GitHub for this experiment? Well,
partly because everybody in this room
uses GitHub. So it's an extremely
relevant experiment. You're probably
already using an agent to talk to GitHub
every single day. Uh, GitHub also has a
really mature official MCP server, a
reference quality implementation, or is
it? We'll find out. Uh, and GitHub also
has GH, which is an extremely capable
command line client. Uh, so uh, both
sides of this debate get their best
shot. It was very hard to find a bunch
of a popular service which had like a
really good MCP server and a really good
command line client, and GitHub is one
of them. Uh, so it's the fairest
possible fight or is it? We'll find out.
Um, to have something to test against
that didn't blow up like an actual
GitHub repository I was using for
something, uh, I created a fake GitHub
uh, repository called Acme SDK Python.
Uh, and I filled it with fake code that
doesn't really work. Uh, but I gave it a
bunch of realistic looking GitHub repo
activity. So I gave it issues and
comments and PRs and milestones and like
assignees and all sorts of stuff. Uh,
and the code itself doesn't matter. What
matters is is the state of the repo
because that is what I'm getting the
GitHub tool and the GitHub MCP to work
with. I'm getting them to modify the
state of this repo. Um, I initially made
this repo public and as soon as I did
that, somebody's uh somebody's bot
showed up and found the tag called good
first issue and started like desperately
trying to fix my fake code. It was like
this is terrible. This doesn't even
parse. Uh so uh and that of course blew
up all my experiments because it was
like mod it was creating PRs and things
and modifying my experiments. So uh that
was really annoying but it does prove
that it was a very realistic looking
repo because I fooled somebody's bot. Um
then I wrote 25 tasks across four tiers
of difficulty. So the first tier is like
stuff you can oneshot with one command.
It's trivial reads. Tier two is slightly
harder reads where you have to do
multiple reads to get enough data. Tier
three is writes, which turned out to be
the trickiest tier. Writes are very
difficult for these bots. Uh, and tier
four was supposed to be the most complex
tasks. Uh, where you do multiple reads
and you do analysis and you do a lot of,
uh, juggling of data stuff that you
can't answer with a single command.
Stuff like for each milestone, report
completion percentage or find all multil
issues and tell me the most common
pairing.
So then I wired up three versions of the
agent. I'm going to call them ARMS
because ARMS is just a term that people
doing AB tests use to call experiments.
And the first arm, like I said, is MCP.
The agent gets the official GitHub MCP
server and it is firmly instructed to
use that and only that. Will it pay
attention to me? Agents are tricky. Uh
the second arm is a GitHub skill that I
pulled from uh a library of GitHub of
skills called LoHub. Uh this skill is an
encyclopedic reference. It is 2,187
lines of markdown. It lists every gh
subcomand uh in it with all of the
subcomands and all of the arguments.
It's organized like a man hate page. It
is comprehensive.
Uh but then I thought, what if I just
picked a bad skill? Then it's not a good
test. So I decided I would pick a second
skill as well uh from uh a list called
uh the Claude skills vault. Uh and the
two skills look very different. The the
vault skill is very short. It's six
times shorter. uh and it's opinionated.
It it uh splits the commands into things
that are safe and things that are
dangerous and things that are reads and
things that are rights. Uh so it is a
playbook. It's not just a reference. Uh
but the rules are the same for all three
arms. Uh they get the same agent. They
get the same base prompt. Uh they get
the same 25 turns in which they are
allowed to uh try and finish the task
and the same very capable Opus 4.6
model. The only thing that changes is
what tools I'm telling them to use.
So before I get into the results of the
evals, for those who aren't familiar
with evals, I should explain what they
are. Basically, they are tests. I don't
know why the eval industry decided that
we were going to call these things eval
words from the ML industry rather than
logs and tests, which is what they are.
Uh but with regular deterministic code,
you can write deterministic tests. You
can say 2 plus 2 equals 4, true or
false. Did I get that right answer? But
with LLM powered agents, you can't do
that. If you if your is you know
summarize this text then your LLM can
give uh you know a million possible
answers and half of them will be right
uh and you don't know which half and you
can't like string match for the result.
You a human know looking at the results
whether uh the answer is correct. Um but
humans don't scale and they don't run in
CI. So the way to handle this kind of
non-deterministic input uh is to use a
tool that is very good at
nondeterministic input which is to say
use another LLM to judge the output of
your agent. Uh and this is part of what
Arise AI does. It provides a framework
uh that lets you run these evaluations.
Uh and eval
measure how well your agent is doing
something so that you can make your
agent better. That is what eval
brings me to a very quick plug for my
employer. I apologize. Arise brings you
like bunch of tools, a way to capture
logging data called traces uh and a UI
to inspect those traces in detail. Uh
you can also run evaluations against
those traces. So you can run the
evaluations like in development to make
sure that you're shipping something
good. You can run them in CI to make
sure that every commit is good or you
can run them against production data so
that you can make sure that things
aren't going off the rails while you're
not watching. Um but more directly
relevant to this talk, what you can do
is you can take a subset of those traces
and you can turn them into a data set
and then you can run experiments where
you run the same tests over and over and
you change one thing in this case the
tooling uh and you can track how your
changes change the outcome of the task
in a nifty little UI.
So to test my cloud agent uh I had to uh
come up with five metrics. The first one
is correctness which is measured by LLM
as a judge. Like I said, it is I know
what the answer is for each task. So, I
wrote that down and I got the LLM to
look at the answer that the agent gave
and see whether or not the answer was
correct. Uh
uh output quality is also LLM as a
judge. If the final is the final
response, you know, well formatted, not
just correct. Um latency is not an LLM
as a judge. It's literally just wall
clock time. How long did this how long
did the agent take to complete the task?
Cost is another very important metric uh
for your agent. Is it costing me a lot
of money to do this thing? Uh and the
last one is tool fidelity. For each arm,
I uh defined an allow list of tools that
it was allowed to use. And tool fidelity
is what percentage of the tool calls
that it made are tool calls that I
allowed it to make because agents, you
know, take rules as suggestions rather
than rules. Uh so that was then I ran it
five full passes through every every
task, every arm. uh a total of 500 tests
uh each one of which took a lot of uh
many minutes to run. Uh and I often had
to throw away runs because things like
you know Claude would have some downtime
and then I would have then my whole
experiment would be ruined because it
was taking forever because Claude was
down. Uh or I had to reset the repo back
to zero because one of my tests went
wrong. Like the whole thing took weeks.
Um but now let's look at the results. Uh
starting with the most important
measure. Did the agent get the answer
right? Well, this is sort of this is
equally good or bad news for everyone.
Everybody got roughly the same score
when it came to correctness. Uh their
score for correctness was in the high
80s. I would have preferred that got in
the 90s, but I made these tests
deliberately tricky. Uh and um on tiers
one and two, which like I said were all
reads, uh correctness was actually 100%.
It was when they got to rights uh that
the agents started falling down and when
they got to the tricky analysis they
really started falling down. Um but the
core conclusion that you can draw here
is the same. No matter what kind of tool
you use whether it's MCP uh or command
line you will get the answer eventually
but eventually turns out to be the real
kicker. If you look at latency and cost
things are wildly different. Uh latency
is way higher for MCP. MCP is taking way
more turns to get the task done. Uh more
than twice as many turns, about 12 calls
per task versus five uh on the complex
tier 4 tasks. Uh and because it's taking
more turns, it's also burning a lot more
money. Uh because every time it takes a
turn, it has to take the previous
conversation and inject that back into
the context window, which makes the
conversation even longer. Uh which means
that you it compounds. You burn a lot
more money than just the latency would
suggest. Um so this looks really really
bad for MCP right uh there are some
extenduating circumstances though one uh
is that the MCP server from GitHub is
less flexible than the API that GH is
using. Uh instead of for instance
instead of getting all issues for a
milestone in one call which GH can do uh
MCP has to make two calls one for open
issues one for closed issues. Uh but
another one is structural. Uh the
response that the MCP gets from from uh
the MCP server is a huge pile of very
verbose JSON. Um so what happened was
the output overflowed the agents context
window and it had to call again and this
time it had to write all of the JSON
that it got to disk and then it had to
use bash and GP to read through the
JSON. Uh which is cheating I think
you'll find. Uh so one of the really
great things about using an LLM as a
judge uh is that you don't just get
binary test results. You don't just get
yes, no, it passed or fail the test.
LLM's come LLM explanations accompany
every single result. So you get a little
LLM buddy who's watching every single
test who explains exactly what went
wrong with that test or what went right
with that test. Uh so I was able to read
these explanations and get full
visibility into what was going on. Uh
and this is amazing because it means
that LLM powered evals tell you not just
you know what your agent is doing wrong
but how to make your agent better. That
is one of the really cool things about
uh LLM evals. Uh so in the early stages
it was mostly telling me ways that I
messed up my tests. It was like claude
hung or you got this wrong or your reset
didn't work. Uh but eventually it became
explanations uh of how exactly the agent
was behaving in all tests. Uh one
particularly egregious example uh was
test 22 which was find issues opened and
closed in the same milestone and compute
the average time open to close. Uh so
vault on average got it done in seven
tool calls under a minute cost 19 cents.
MCP on its worst run used 71 tool calls.
Uh, and the worst part is that only
three of those 71 tool calls were
actually MCP calls and all of the rest
of them were uh the agent messing around
with bash and GP uh trying to parse
these JSON files that it had downloaded.
So, not only was MCP taking really long,
it was also cheating and using the
command line anyway. Uh, even though I
told it that it wasn't allowed to. Uh,
so I anticipated this possibility, that
is why tool fidelity is one of the tests
here. So you can see that on all tasks
two fidelity of MCP is not very good. It
is using bash a lot of the time and on
the really complicated ones it's even
worse. Uh,
but from a practical viewpoint, does
that matter? Right? Like, do you care if
your agent is using bash under the hood?
If you've told it to use MCP as well,
like does I'm not really going to count
that against it. Like, sometimes you
have to use bash, but it's certainly
indicative of a problem. And the
fundamental issue here, or more
accurately, the fundamental issue with
the GitHub MCP server, uh, is that it's
a fixed API surface. uh when there's an
API call that does what you need, it
absolutely perfectly oneshotss it. If
there isn't an API call in the MCP
server that does what you need, uh then
it has to make multiple calls. Uh or it
just can't figure out how to do it at
all. Uh the command line doesn't have
that problem. Bash is too complete, so
you can do basically anything that you
need to do. Um and so that is how all
three arms solved the problem. Whether
or not I told them to, all three arms
were like, I'm going to use bash to
solve this problem. Um but sometimes it
goes the other way. Uh task 13 was to
create a branch and open a pull request.
It turns out there are two calls exactly
for do for doing that. Uh so uh MCP did
it in two calls. Uh but it took GlobeHub
uh which was the the skill with all the
context uh 22 tool calls in nearly a
minute and a half and almost 50 cents.
So if you give it a thing that it's good
at, it's very good at it. But despite
that exception, like our conclusion
should be obvious, right? The command
line wins. Strike up the ban. Drop the
confetti. Uh, even the tool that was
supposed to be even the agent that was
supposed to be using MCP was in fact
using the command line the entire time.
Is that does that seem fair to you? It
doesn't seem fair to me because the
truth is a lot more complicated or this
would have been a very short talk. Uh,
the first problem is that GitHub is
really famous. So the GH tool is really
famous. It's in lots and lots of blog
posts. It's in Stack Overflow. It's in
Reddit. It's in all sorts of things.
So, what happens if I don't use a skill
at all? What happens if I don't use an
MCP server at all? What if I just give
the agent this task and say, "You figure
it out." Uh, so here is those tests
again with a fourth set using an agent
with no skills or MCP instructions. And
let me tell you, when I I'd done like
500 tests and then I discovered, oh, I
have to do one more set of all of these
tests, that was a discouraging day. Uh
it turns out Claude is really good at
using GH all by itself. Uh in fact it's
slightly better. The baseline is
slightly better than with the skill in
terms of correctness. And I I asked the
explanations why is this? Uh and it's
because uh the the tools the the skill
files rather they uh they say be careful
doing something dangerous. Uh whereas if
you just give it to Claude, Claude just
oneshotss that and just yolo does not
care whether the skill is whether the
tool is dangerous or not. Uh and so it
gets slightly better because it takes
slightly two returns to do things. Um
but on latency and cost the story is
different. MCP is still looking really
bad. Um but both skills do better on
latency than cloud baseline because they
use fewer turns because the skill file
uh helps the agent make fewer mistakes.
It gives it more information about how
to use GH correctly.
uh lobe hub, which is the really verbose
skill, uh it does worse on cost because
it's burning more tokens because that
skill file is so big. Um uh but the
vault skill, which is the short and
finely tuned one, does a bit better than
baseline on cost. So a skill does make a
difference and a good skill makes a
difference versus other skills.
Um but that is a really important point.
The cost of MCP is partly because it
uses more tokens. And the reason it's
using more tokens is because MCP has all
of this context about how to use the
tool. Uh, and it turns out that for
GitHub at least, the model mostly
doesn't need all of that context. It
already knows how to use the tool. But
that's not going to be true about your
CLI. Your company is probably not as
famous as GitHub. It's probably not well
documented as GH. So your CLI is
probably your your agent is going to
have a lot more trouble using your
particular CLI for your particular
service. it might not be able to figure
it out at all without the extra context
that MCP provides.
Uh the next confounding factor is that
MCP is not supposed to be just a thin
wrapper to your REST API. Although that
is what a lot of MCP servers are. MCP is
a very rich protocol that has tools and
resources and prompts and all sorts of
stuff that helps your agent do better at
using the tools than just a skills file
would do. Uh but more than that, an MCP
server can have a whole agent behind it.
MCP is an LLM native natural language
protocol. So instead of having to make
all of these tedious tool calls, my
agent could have been able to just say,
"Hey, give me all this. Give me all of
the issues uh that match these criteria
and filter them for me by these other
criteria." And an agent on GitHub's side
could have done all of that work and
give given back the answer and
oneshotted it. This could have been a
collaboration between agents. GitHub
CLI, GitHub's MCP doesn't work that way,
but there's no reason that it couldn't
work that way. And there's no reason
that your MCP server couldn't work that
way. Uh so that changes the story
considerably. And finally, there is an
there's an elephant in the room, which
is off. Uh CLI are inherently developer
tools. Um GitHub is a developer tool,
too. So that's fine. The average GitHub
user is going to have no trouble
figuring out how to generate an API key
um and passing it to the GH tool. But
LLM users are the whole world. That's
what we've been learning recently is
that everybody is using agents now. Uh
and asking the average user to generate
an API key is kind of pushing it. And
asking them to open terminal and put
that se put that API key into some kind
of config file is way out of the
question. They are not going to figure
that out. They are just going to say
nope. Uh so your agent is probably not
going to be able to use a CLI because
operator is not going to be able to use
a CLI. Uh, and even if it could, it
probably couldn't figure out how to use
a CLI because CLI are so different.
They're not standardized. Uh, uh, and
also the level of access control
provided by your average CLI uh, is
wildly different to what your your
enterprise admin is expecting to be able
to control. MCP on the other hand uses
OOTH. In fact, it uses OOTH so
extensively that they had to extend OOTH
to be able to do all of the things that
MCP needs to be able to do. Uh, and the
result is actually a really great
experience. You can just paste the URL
of an MCP server uh into your average
agent and it will figure everything else
out. It will it will discover stuff. It
will pop open a little browser window
that asks you for permissions in a nice
friendly user user usable way. Uh
uh and then everything just works. This
takes MCP from being a developer tool to
being a consumer tool. This is something
that the normies can use.
uh OOTH also allows enterprise
administrators to finally control who
gets access to what and where and when
and stuff like that that they need in a
way that API key uh based authentication
is usually not designed to do. So is MCP
I don't think so. uh if you're a
developer using a developer tool the
evidence is strongly of skills that uh
especially if you are using a widely
known service that is the narrow
conclusion that you can draw from this
data is that for developers skills are
usually the white right way to go uh
but the overwhelming of agent majority
of agent users are not going to be
developers they are a regular consumer
product they are used by people who are
using cloud and chat dbt uh and MCP is
not only the best option for them. It is
practically speaking going to be the
only option for them a lot of the time.
Uh
which brings us right back to the
beginning which is MCP versus the
command line. This turns out to be the
wrong question. It is MCP plus the
command line. Uh real agents use both.
Cloud code uses both MCP and the command
line depending what it what it's trying
to do. Uh cursor uses both. Your agents
should probably be capable of using
both. Use the CLI when you are working
locally, when the tool has decades of
training data behind it, when O is
preconfigured, when you benefit from
piping things together. This is the
automating my own workflow case. Use MCP
when the tool is remote, when it's
proprietary and not in the training
data. Uh when you need OOTH and per user
O when there's real state to manage
across steps and especially if you want
to hide an agent behind your API. This
is the shipping it use case.
So my headline findings are real. MCP
collapses uh on open-ended analysis. Six
times the cost on the hard t hardest
tasks, five times the latency. Skills
help a little bit. Uh a short
opinionated skill is going to help the
most. Um but for a popular CLI like GH,
uh the training data is doing most of
the work. Uh but MCP is not the loser of
this story. Uh testing with eval did
what testing should do. It showed us the
limits of our original question. And our
job as engineers is not to pick the
winner in a Twitter fight. As satisfying
as that may be, our job is to use the
right tool for the job. Uh so if the
model knows your CLI, if you're running
as one user with one set of credentials
and your task composible, use the shell.
If your tool is remote, proprietary,
needs off, uh then use MCP. And in most
real systems, you're going to end up
using both. I hope this has been a
useful look into what tools you should
use to what you tools you should
actually use and how you should use
them. And I hope it's given you some
insight into eval can help you improve
the quality and functioning of your
agents. Uh all of the code and the data
that I did here so you can double check
my work is open source on GitHub. Uh and
that's all I've got for you today. Thank
you so much for your time and attention.
All right. All right. Can you believe
that? We're down to our last talk. Okay.
Let's do a little poll. Who here uses
cursor?
Yes. Okay. Do you know cursor has a CLI?
Anybody knows that? Okay. One person.
Okay. Um, so our last speed curve for
today is from Cursor, David. David is a
software engineer and uh he's living a
dream. He works for curser and he lives
in Madrid and he has been doing
programming for the longest time uh even
doing competitive programming back in
high school and university. Um so he's
also a runner. He'll be running on
Friday at the React conference. So if
you're staying, you can go for a run
with David as well. Uh but before we go
to the run and go to the afterparty and
everything, we're going to welcome David
onto the stage to talk a little bit more
about how he thinks that IDE are dead.
Longleaf ideides tell us more about it.
Hello everyone.
Let's just
get things set up
and uh just
full screen this thing and we are ready
to go. Okay, welcome everyone. Thank you
for uh joining me here today for the
last talk of the conference. Um I hope
you've all had a blast. I know I did.
And I'm going to keep it short,
especially short, because I know you all
want to go explore Miami and get dressed
and ready for the afterparty.
So, um, a lot of you might have heard
that IDEs are dead. Um, and I think
that's partly true. Um, and today we're
going to be talking about what it means
for IDs to be dead and what I personally
think is the future of IDs, as well as
what we at Cursor believe are is the
future of um, IDs. Um, a little
background on myself. I I'm I'm a huge
IDE nerd. Um, my very first programming
project was an IDE. Um, and
in fact, I was doing some research
yesterday and I actually found it. I
hadn't looked it in a long time. It was
called Crockett. Um, it was an ID for a
programming language that doesn't even
exist anymore. It was called Axe. Um,
and this is what it looked like. Very,
very basic. Uh, but that was 2011,
right? We've come a long, long way since
then. Also, um, I forgot to mention at
the beginning, I love running. And on
Friday, I'm hosting a uh 5K with the
Neon Folks, my former employer. Um, and
we're meeting here uh in the hotel at
7:30. So, if you're around, please join.
Okay, back to regularly
uh scheduled programming. Um, when I
started coding, there weren't that many
IDE options. Everyone that I talked to
at the time was either using Emacs or
Vim. And it was kind of like Harry
Potter. you got picked into a house and
that that was where you went to. And for
me that was Emacs for no real reason.
Like I think my decision was basic like
basically like Emacs is a UI and Vim was
in the terminal so it felt more scary.
So I just went with Emacs. I also spent
a bunch of time on on Genie um if you've
heard of it and then also Eclipse. But I
basically used everything. I I tried
everything and I spent a bunch of time
with all these uh different editors
until VS Code came out. I think it came
out in 2014, but I think 2017 is when it
got really really good. And the reason
why I moved to VS Code, even though I
was like a a diehard Emacs fan, is
because VS Code made it so much easier
for me to try new programming languages.
The problem with Emacs is that every
time I wanted to try a new programming
language, I would spend a week just
configuring Emacs for that new language.
And with VS Code, that wasn't a problem
anymore. And that's what caught me sold
at the beginning. But then over time, VS
Code kept getting better. The extensions
API really improved and I became as a
diehard VS Code fan as I had been an
Emac dieard fan. Um, I put in a ton of
time into my VS Code configuration. I
have over 500 lines of VS Code config,
over 500 lines of just keyboard
shortcuts as well. So, I have like fully
customized my VS Code and it doesn't
even look like VS Code anymore. Um, and
then after
way way way too many people told me to
try cursor, I eventually caved now, but
I really really resisted it. Like I was
so scared of moving away from VS Code
that I never installed cursor even
though everyone I knew was already using
cursor and I just really really resisted
it until I was sitting next to someone
and I saw them use cursor and I thought
I have to try it. And I think in the
beginning just like most people what
really got me to love cursor was tap
complete right? It was like the first
cursor feature that really had everyone
fall in love with um with AI coding. And
the funny thing is I was recently
looking at my tab usage and I joined
cursor in September, right? So you can
see in this chart um you can see that in
September I used tab 1,400 times and
then in December I use it nine times. Um
and I didn't take time off in December.
This is a this is my real tab usage over
my first four months at the company
cursor. Um, and I think if you if the
chart went further into 2026, my tab
usage is probably zero right now, not
even five or nine as it was in December.
Um, the truth is there's a lot of things
from the old IDs of Emacs and VS Code
that we don't need anymore. You might
remember that we spent years
building and configuring text editors
for maximum comfort and maximum uh
productivity. Uh there are options such
as whether parenthesis should auto close
or not. Uh when you type quote or when
you open a um a block with a a bracket
or curly brace, should it auto close or
not? Should the cursor blink? And if so,
how many um what should the interval in
milliseconds be? Should your scroll
stick? And if yes, by how many lines? Um
and how long should the lines be? 80
characters, 200 characters.
Most of these things don't really matter
anymore. Um, and they actually are
detrimental to the performance of IDEs
because when you have settings that look
like this where you can even configure
which quotes and brackets have um, autoc
closing and which ones don't. Um the
problem with having all these settings
is that you introduce basically
factorial level complexity into your
product and it makes it harder to test
the product and makes it harder uh for
the product to be stable. So if you look
at VS Code for example, it has a million
settings. It that's why I loved it so
much, right? That's why I became such a
diehard VS code fan is because it let me
configure everything. Um a good example
is the mini map. So the mini map in VS
Code um is this little um minified or
zoomed out version of your files. And
just for the mini map, VS Code has more
than 10 different settings that you can
configure.
That's a lot. And it means that the
developers
that work on VS Code have a really large
and broad surface area. And that makes
it really hard for us at Curser and for
the people at Microsoft working on VS
Code to innovate because you you just
spend so much time fixing bugs, testing
all these different configurations.
We simply don't need all of it. Agents
are now writing 99% of our code which
means we have to rethink IDs. But if we
are going to rethink,
we have to start from first principles
and we have to think what do we still
need? what kinds of things are necessary
for the current age of AI programming.
Um, and one idea is
what if we just go completely bare
bones? We just remove all the
distractions and we just have a chat.
And if you've ever used the cursor CLI,
that's what it is. That's what the clock
CLI is as well. That's what the codec
is. Um, these are just the most raw
simple interface that you can have for
AI coding and they're nice. I like the
cursor CLI. I actually use it every
single day. I use it a lot to build this
presentation actually. Um, and I just
tweeted last week that I use it every
day, but ultimately I'm only using it
for 5% of my agent. I need more than
what the CLI can give me. And the kinds
of things are I need cannot really be
rendered in a CLI. I need video playback
because most of my work nowadays is
watching videos produced by Asians to
verify and assert that the work they did
is correct. So I need video playback. I
need image rendering. I need to be able
to annotate images, annotate videos. I
need rich interfaces such as mermaid
diagrams, knowledge graphs and charts
which I use for uh testing things and as
well as for data science and other data
analysis work that I have to do. I need
the browser and more importantly than
just the browser, I need the Chrome dev
tools and I need the Chrome Dev Tools
debugger. I need all these rich
interfaces that just cannot be dumbed
down into a basic shack. Uh mobile
developers need emulators and they need
um both iOS and Android um emulators. Uh
we need native system notifications that
tell us when our agents are done. And
then finally, last but not least, we
have to edit code. I'm not saying agents
write 100% of my code. They write maybe
97 98%. And for those 2% of my lines of
code that I'm still writing myself, I
need a basic code editor that functions
well enough, that is decent enough, has
go to definition, language, server
protocol, other types of um useful
features. Um, so I do need a decent
editor. And all of these things combined
is I think what we need for the IDE of
2026.
Another thing is for the most part and
this is my this is my own personal like
this is my own personal data of cursor
more and more I'm not running my agents
locally. So you can see I joined cursor
in September and this chart begins in
November last year and this is how many
Asians I'm spinning up per day and you
can see a clear trend that more and more
of my Asians are cloud. So
less and less my computer is the
bottleneck for performance and more and
more I'm letting the agents go and run
for longer remotely. So what do I do in
my ID? While I watch videos, right, I am
asserting that the agents do what I'm
asking them to do. I work on the cursor
product interface. So, I do a lot of
front end and I spend a lot of time just
watching these videos. Um, I use the
browser, right? Um, to verify my work,
test things, play with what I'm
building, and then I need rich
interfaces, right? I need charts. And I
actually use the cursor canvas feature
quite a bit um to do data science.
Right? If you haven't used it, it's
basically a feature where the agent can
generate on the fly singlepurpose
um UIs that can go away after you don't
need them anymore. Um now you might
remember Cursor 2.0. That was actually
the first project that I worked on when
I joined Cursor. Um literally in my
first week um my uh former manager
um said hey uh welcome to cursor we're
building a new UI for agents and I was
quite excited by the possibility of
rethinking um the interface for for
coding um and at the same time I was a
bit skeptical that I would be so hands
off with my code right but lo and behold
um after a few months I was barely
typing any code at all. Um, it looked
like this. It was pretty basic. Um, and
we launched in in late October. Um, and
we didn't get as much customer love as
we wanted to with this iteration. Um, it
had some nice things. It had work trees.
It had parallel Asians. Um, and that was
also when we introduced our first um,
company model. Um, but it didn't again
get as much love as we wanted it to. Um,
even though more recently we added this
super complex layout so you can tile and
manipulate terminals and files and chats
and you can have these grids of 16x6
agents if you want to. But even with
that like we still didn't feel like the
interface was malleable enough for what
we wanted to give to our users. So why
did we start from scratch? If you've
heard about cursor 3.0, which we
launched two weeks ago, we basically
took a from the ground up approach to
the IDE interface. Um, and the main
reason why we started from scratch was
to break away from the VS Code baggage,
right? The VS Code layout system, the VS
Code architecture. We needed a
blank slate. We needed a
canv a white canvas right where we could
innovate from first principles and build
the ID interface that we really believe
is right for most people. And then the
other reason is that by starting from
scratch with a new UI, a new window, our
users can gradually
adopt it, right? You can have both the
traditional VS Code interface open as
well as the new Asian window open at the
same time. You can hop between the two
of them. Uh or you can just use a new
one or you can just use the old one. But
with this model, we're not annoying
anyone by forcing them into a new
experience. Everyone who loves the VS
Code interface can stay there and then
the people who want to use a new
interface can use it as well. So that
gives us a lot more flexibility as as
well as a lot more room for innovation.
Um first thing I'll say is it's missing
your feedback right now. So
the only way that we can keep iterating
on the IDE interface and make it better
is by listening to our users. That's how
we build our product. We continuously
listen to user feedback. We iterate. We
launch new versions and we do this in as
tight of a loop as possible. So the the
main thing that is missing is really
more people using it. The more people we
have using it giving us feedback, the
better we can make it. And then the
other things that we're really excited
about are better handoff. So for
example, uh handing off agents from that
start on your phone that go into the
cloud that you then pull into your
computer. Maybe you move them into a
work tree for a little bit and then
maybe back to the cloud and maybe you
finish them from your phone. That kind
of handoff needs to be seamless and very
powerful. We also need more
orchestration tools because the more
high level we go with our AI coding
techniques, the less time we'll spend
reading all of the code necessarily. And
so we need camben style DAG style UIs
for orchestration where you can develop
plans and split them off into multiple
agents. Better mobile support is
something we desperately need. I know a
lot of people use cursor today for
mobile dev and they have to have Xcode
open at the same time and they keep
popping back between cursor and Xcode.
Um, nobody should be having to do that.
They should just be able to stay in
cursor the whole time. Um, computer use,
we have great computer use in our cloud
agents, but we don't have it locally
yet. Uh, that's something we're working
on. The idea here is that the agent can
interact directly with your product. We
allow this in the cursor browser but not
for desktop apps. And then we need much
better code review. That is for me one
of our P zeros at the company right now
is we're not writing any code but we are
reading a lot of code. Um I know most of
us are probably still reading 90 plus%
of the code. Um however the UI for code
review needs to be rethought as well.
And then there's many more things that
we're also cooking in the background.
So, um, I'm going to say it again.
Really importantly, please send us your
feedback. Um, my DMs are open. You know
where to find us. We're a very open
company. Um, and please try the new
interface. Uh, I was actually sitting
next to someone yesterday, uh, that had
never tried it. And I was just like
asked them to like, please click the
please click the button. I just want to
see you use it. Um, if you have feedback
also, just uh, come ping me. Um, and
yeah, thank you for for taking the time
to listen to me today.
Right. All right. Can you believe that
this is coming to a close and you've
stayed till the very end? So, uh, wow.
Okay. So, we're gonna do a closing
remark. And for those of you who are
live streaming, also stay tuned because
we're going to bring up some really cool
people for the closing remarks. Um,
okay. So, just a couple of shoutouts.
Uh, which talk really inspired you? Give
me some names, some topics.
Memory. Which one?
Oh, Alvin Payne. Yes. Yeah, I heard that
one. Anybody else? Code mode.
>> Yes.
>> You're not the speaker, are you?
>> We can't really see you. Sorry. So, if
you are the speaker, you can also shout
out your your talk.
>> Which one?
>> Codeex Spark. Oh, yeah. Yeah. Yeah.
Yeah.
>> Yeah. Keep keep messaging Gabe for sure.
>> Um,
>> versus CLA. Yeah.
>> Yeah. Yeah. Um, anybody is excited to
try something new after the conference?
Yes. Okay. A couple of people. Okay.
That's the energy that we're looking
for. We really hope that you learned
something new, made some good
connections, and you're really excited
about building a better future with AI.
So, that's kind of uh what we want from
this conference for you.
>> Yeah. And it was truly an honor and
pleasure to be MCPS for this year's AI
Engineer Miami.
>> The MCP. Ian, are you becoming an MCP?
>> What is that? What did I say? Okay, I'm
tired. You see, it's I'm performing at
20% energy now. But uh it went so fast
and it was all thanks to you. You're
awesome and uh you made it easy for us.
And I would like like to also to thank
uh everyone on live stream and the
videos are going to stay on YouTube and
that's the whole idea behind this to
enable many people to leverage AI in
their lives and in their works. Uh also
uh I want to leave you with a thought
that it's true that AI is um really
making us much more powerful and the
same scope maybe can be done with fewer
people but who's to say we have to keep
the scope limited. Let's tackle bigger
problems. Let's take on much bigger
problems that we perceive to be
impossible before and go for that. And
if there is a day and a day comes that
there are no more problems, we will all
go to the south beach and uh play the
guitar and nobody has to work, right? So
let's try to solve much bigger problems
and uh um yeah, I feel we have this
community here. It was a it was really
lovely to be a part of it and I hope to
see you all next year. Who's coming next
year? Uh who's planning a
>> Yes, that's the energy that we want.
>> There you go. Back to you.
>> Yeah. Well, so the party doesn't stop
here. First of all, there is a
afterparty at Thor Social at 7 p.m. So,
if you want to continue the networking,
feel free to come join us. Say hi to
your favorite MC, not MCPs,
>> MCP servers, MC's.
>> And um there's also more activities
going on this week. So, how many of you
are going to go to the React Miami
conference? Yes. So, the party
continues. So, keep making connections.
really keep an open heart uh apply the
things that you have learned, make new
connections and really build amazing
things with AI. Okay? And this
conference wouldn't be possible without
our organizers. So now I'm going to
invite our favorite organizers onto the
stage, Michelle, Gabe, and Rebecca.
Wow, this is awesome. It pinched me. Did
this just happen? It did. It did. Thank
you so much for coming. Um, first like a
shout out to all you and the vibes that
you brought. Like they were all
beautiful, positive, and uh it's only
going to get better at around 6:30. Um,
uh, with a slight amount of inebriation
and tacos. Um and uh I want to thank uh
first you know Swix and Ben and Leah at
AI Engineer please give them a round of
applause. This would not be possible
without them
like Swix early on early days early you
know JavaScript days you know 6 8 n
years ago um he wanted to care for the
community and he wanted to uh provide
you signal content um you know in the at
the time JavaScript space and um so I
just you know Swix wherever you are um
thank you for your heart and allow
allowing us to be a part of AI Engineer
Miami. Um, this also would not be
possible without uh these amazing
sisters up here, Michelle and Becca. You
crushed it.
Any any sense of like this is actually
for the community, not a just a, you
know, a cash grab. Uh, it comes from
these two right here. they have led this
uh process in Miami for the last, you
know, five years right out of COVID when
nobody thought this was possible and um
just thank you for for what you do.
>> Thanks, Gabe.
Uh yeah, you guys are our heroes. Thank
you so much for showing up for year one.
Like it a round of applause for
yourselves actually very quickly. Like
please thank you. We appreciate you so
much. Uh, and Gabe, I'll pass it back to
you because it wouldn't have been
possible without G2I and you and and
provide knowing what this was and the
impact it could become. So, you trusted
with React Miami and now we're doing AI
engineer Miami and this is incredible.
So, uh, thank you, Gabe. Thank you,
Gabe. We appreciate you.
All right. And we're going to do it
again next year. But before next year,
upcoming is AIE Singapore in four weeks.
Is anybody gonna go to Singapore in this
room? I mean, it's a very it's a I want
to go.
>> Yeah. Okay. The live stream. So, four
weeks. You still have time to get your
tickets. It's going to be amazing. We're
really looking forward to it. And
without further ado, you have anything
you wanted to mention? Let's party. All
right.
Wrap it up.
and see you at after