AI Engineer Summit 2025: Agent Engineering (Day 2)

Channel: aiDotEngineer

Published at: 2025-02-21

YouTube video id: D7BzTxVVMuw

Source: https://www.youtube.com/watch?v=D7BzTxVVMuw

[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
for
[Music]
n
[Music]
ladies and Gentlemen please join me in
welcoming to the stage your MC's for the
AI engineer Summit agent engineering day
the founder and CEO of superintelligent
nlw and the founder of touring post
cassinia
[Music]
saw hello everyone welcome to the AI
engineer Summit the agent engineering
track I think think it's going to be an
incredibly exciting today yesterday was
all about the leadership today is about
the builders so we're super excited to
have you here uh I'm the host of the AI
Daily Brief podcast the CEO of super
intelligent which is an agent Readiness
platform most of all I'm one of your
MC's for today good morning New York and
uh this is such a beautiful winter day
um and good morning to over 13,000
people who watching us online it's a
live stream um welcome to AI engineer
Summit 2025 five and uh my name is Cia
I'm founder of touring post it's a
newsletter about Ai and machine learning
and um um also as well your Co MC today
um yesterday um it was a lot of amazing
content but I was really looking forward
for this day because today we will be
discussing agents and this is such a h
topic so um we're talking about agents
engineering track and it's all about
Builders all about you um who are Act
shaping our agentic
future so across these talks you're
going to see a bunch of different themes
uh which together I think give you a
really good sense of the story of the
moment that we're in AI agents uh in the
spotlight um but their development
stretches back decades um research began
in the 1950s with symbolic AI gained
traction with expert systems in the
1970s and evolved into multi-agency
systems in the 1980s so today's leap
isn't in theory but in scale um llms and
automation Frameworks are making a gench
workflows practical um driving Real
World Adoption across Industries and
what was once theoretical is now
operational on an unprecedented level
that shift from Theory to practice is
one of the absolute key themes of today
you're going to hear a lot about what's
actually happening now things that are
real things that are live not just
things that could be in the
future yes and in fact um things haven't
gotten so real that we will hear a
number of use cases um and deployment
stories um such as how agents are
impacting the finance space at Jane
Street and black rock very excited about
that but of course not everything is
figured out the other big theme for
today is the big open questions the
challenges that remain questions of
scaling accuracy memory
yes we have a very very exciting lineup
today um and as um AI Engineers
Engineers we will go deep on Gemini deep
research how it was built um we will
learn from open AI um and entropic how
they think about building their agents
um we will explore how reinforcement
learning uh mean what reinforcement
learning means for agents um and discuss
how to scaffold wisely while scaling
effectively efficiently um experts from
widen weights and biases data dog Morgan
Stanley Bloomberg brightwave Galileo and
other amazing companies will share their
insights and don't forget tomorrow we
will have Hands-On workshops before we
start we have to give a big thank you to
our sponsors especially the AI engineer
summit's Platinum sponsor salana salana
for those of you who don't know is a
blockchain uh and crypto ecosystem it's
one of the EOS systems that's most
directly at the center of this
intersection between AI agents and
crypto um in short they are building a
permissionless layer for unlocking and
allowing your agents to create wealth uh
I think it's extremely telling that they
are here supporting this event in terms
of where they see their priorities so
big thank you to salana if you want to
learn more about them they have a large
Booth downstairs with three
demonstrations in the Expo um and this
event would not be possible without um
all of our sponsors um all the companies
you see on the screen are pushing the
boundaries of AI engineering and
represent a fascinating uh mix of
pioneering uh Pioneers shaping um the
future of the field um the Expo area is
just down the stairs in the hallway on
the lower level you probably um been
there yesterday um but please take time
to visit our sponsors chat with their um
experts and top Engineers there is a
wealth of knowledge to gain um from them
and um you know you can make connections
there are a lot of opportunities to
explore and these sponsors could be your
next collaborators they can be your
service providers or maybe mentors if
you just starting your AA Journey so
um take a time to visit them um last
Quick announcement before we get out of
here from a logistic perspective um at
the break following their sessions most
speakers will be able to answer your
questions at one of the three speaker
Q&A lounges there's one on this level
and then two uh down below just outside
the Expo um there's also during break
time the hallway track which is allows
you to gather uh and talk about the
topics of the block from the session
it's a way to sort of engage in the
conversation in a more direct way you'll
see there's several breaks happening
throughout the day uh hopefully you take
advantage of those to go engage with the
speakers engage with each other uh and
make sure of course that you do not miss
the Afterparty after all of this is done
in the expo at 5:00 pm so are we excited
about today yeah good
yeah um so please help me to um welcome
this person who probably all of you know
um he's the editor of leaden space he's
founder of small AI he is a co-founder
of this super practical um AI engineer
Summit um he'll set the context for
today's track and uh discuss what needs
to happen in 2025 to make this year of
the agents please join me in welcoming
swix
[Applause]
[Music]
oh
nice uh it's this thing on hi good
morning
everyone love that love that um I'm
going to get right into it one of the
challenges we have with Summit is that
we actually ask our speakers to do very
short talks so I as the lead of summit I
have to do even shorter talks so let's
go uh you can see a lot of the there be
a lot of show notes and homework you can
see it on the live stream how is AI
engineering doing uh it's pretty good we
have an O'Reilly book that's pretty cool
um uh chip is actually a good friend and
she's actually speaking at uh she's
giving her a keynote for the workshop
session uh tomorrow which is pretty cool
uh Garner hates us Garner thinks we've
we've hit the peak so it's only downhill
from here guys I'm sorry to inform you
that engineering is over uh there's no
there's nowhere else uh else to go but
down um a lot of the what I try to do
with these eight this uh talks that I do
at at each conference is to try to
Landmark the the stateof the art or the
state of the industry um so with Lon
space I I did the rise of the AI
engineer with the first AI engineer
Summit we talked about the three types
of AI engineer and with last year's a
engineer Worlds Fair we talked about how
the discipline of engineering was
maturing and spreading across different
disciplines um I think this is starting
to get a little sale by now a few
million people have seen this and like
you know uh used this to form their
teams I think that was the intended
effect what I am encountering these days
is the two resistance from two sides of
the AI engineer Spectrum uh if you come
from an mle point of view you think that
the AI engineer is just like mostly an
mle plus a few prompts if you come from
a software engineering point of view you
think that it's mostly software
engineering and uh calling a few llm
apis um I think over time it the AI
engineering is going to basically emerge
as its own discipline and it's still not
there yet it's still very very early I
still say things like oh yeah aiee is
90% software engineering 10% AI I think
that will grow over time and I think
this is the year when it starts to
spread out and that's that's what I'm
here to talk about a little bit today um
so for example I I think like what I try
to do with AE is also like it's a is a
work in anthropology like how people
describe themselves form groups form
identities and form Industries uh so mle
you know it leaks out in your language
um they say test time computer because
the only reason to run inference is to
test it uh AE will maybe say inference
time Compu because we actually really
care about inference um software
Engineers may be reasoning um and I I
think you see these differences I try to
articulate them over time um part of
what I want to do here to set context is
to explain why we've kind of pivoted AI
engineer Summit to be the agent
engineering conference um uh it's not a
decision that we made likely because um
we're saying no to all these things
we're saying no to rag we're saying no
to open models uh gpus and we're just
saying uh you know this is the only
thing that going to do today um and but
like closing all those doors actually
opens up others so when we put out the
call for speakers we uh made up all this
list of uh you know other a agent
engineering disciplines and and I soon
realized we didn't have to I'll talk
about this in a bit um I also looked at
last year's top performing Talks on
YouTube and you guys told us uh that you
know you really wanted all the all the
agentic things now the only problem with
this is that we only got speakers who
basically made agent Frameworks for a
living uh and everyone's asking the the
the real question who's putting this in
production so we had a new rule this
year of all right no more vendor pitches
um you know you complain about yeah
let's oh thank you uh as a as a curator
makes it so much infinitely harder
because uh basically the people that
you're about to see have no incentive to
come on stage and share what they're
sharing uh but somehow we talked them
into it so uh I hope you're looking
forward to that uh the other thing also
I realized that like everything plus
agent works basically so agent plus rag
Works agent plus Cent Works agent plus
search works um and this is kind of like
the simple formula for like making money
in 2025 uh most of these most of these
names you'll see in the talks that uh
that will follow uh in in the
sessions um Sor me if you heard this one
before 2025 is the year of Agents right
if you say it often enough it might be
true uh I think that when people make
predictions often times they confuse
what they want to happen for what will
actually happen um so maybe you believe
sa Adella maybe you believe Roman maybe
you believe Greg Brockman maybe you
believe Sam mman all of them want you to
believe that 2025 year of Agents uh and
I'll be very honest uh me and my co-host
alesio I think I saw you over there hey
um uh we were pretty skeptical as well
we we're on the record Being skeptical
actually actually all of you are being
on the record because last yesterday uh
bar played uh Family Feud with with our
with our audience and the number two uh
buzzer that everyone is tired of hearing
is Agent
um but fortunately you guys are not TIR
enough because you came to it today I
have you for one more day of uh of agent
talk uh but we're on record March 2024
with David Lan uh the former VP avenge
of openi uh saying that we tell people
to take agents off of their branding uh
now we tell them to put it back
on so okay there um I I I think I'm I'm
doing this as a public service to start
any agents conference we have to define
the word agents are you guys
ready all right I actually have one I
it's a Monumental task I could do it in
one slide um so if you talk again this
this is a very POV sort of
anthropological point of view the
machine learning people will talk about
some kind of reinforcement learning
environment they want to talk about
actions achieving goals and all that um
AIA will don't know what they what they
want yet uh this the software Engineers
are very reductive they just you know
put in a for Loop
um okay you it seems like you agree um
so uh fortunately you know I think every
aie conference needs to invoke the name
was Simon Willison uh he is our uh
patron saint um he's actually gone in
crowdsource 300 uh definitions of what
an agent is so I didn't have to survey
all of you I I was thinking about asking
every single speaker to start with what
is your definition uh it doesn't matter
uh there's here's six of them right you
it's either about goals it's about tools
it's about control flow it's about long
running processes it's about delegated
authority uh it's about multi-step task
completion yeah I see all the phones
coming out don't worry it's on the live
stream right there's like 20,000 people
uh watching along um and then there's a
there's a bunch of other things uh I
think I think the last one on the bottom
left bottom right is uh is an
interesting one like just have some
things that everyone defines agrees as
an agent and make sure that they're sort
of your agent definition is passing
those
things um except so that was my one
slide there's my slide uh of like what
defining an agent and then yesterday
opening I went and dropped a new agents
definition uh on the live stream uh that
you can watch yesterday as well um so
this is something they're obviously
going to work with um and I think you
should definitely pay attention to to
this because they're they're building on
top of this new definition as well so
that's defining agents why now why is
why are agents working now when they did
not work a year ago two years ago um I
have a rough idea so the people were
talking about capabilities and so uh you
can see that capabilities even even on
the trajectory of 2023 2025 um have been
have been really growing and they
started around to hit human baselines uh
right about now
um and I also have a map of other uh
reasons as well so I'll just bring you
through each of them most people will
say oh yeah we have better reasoning now
we have better tool use now with have
better tools um including mCP which
which you're doing a workshop on uh
tomorrow uh but I think there are some
other less appreciative things which I'm
going to bring up to you right now model
diversity right uh the opening ey market
share has gone from like let's say 95%
two years ago now down to 50% it's much
more diverse uh landscape including like
just this this past week um two Frontier
Model Labs that are possible challenges
to open the eye have emerged and which I
think which I think is um really
exciting for 2025 we we don't actually
know what is going to shake out to it by
the end of the year uh the second thing
is uh that the cost of intelligence is
super Mor low is what I call it um it's
it's gone uh the cost of GPT for level
intelligence has gone down 1,000 times
in the last 18 months um and you can see
the same curve starting for the o1 level
intelligence um uh and also we now start
to have RL fine tuning options um I have
zero experience in this area but
fortunately one of our speakers will uh
is going to talk us talk to us uh later
today about this about this um so we
have all these reasons we have I have a
few more uh you know in our conversation
with Brett Taylor U he talks about uh C
charging for outcomes instead of instead
of cost um there's a lot of work on
multi-agents as well as faster inference
as well that's coming out from the the
better Hardware that we have um there's
more homework there if you want this is
all sourced and you know has has has
some backing in our in our l space
conversations but I don't really have
time for that okay so one last thing for
you guys on agent use cases so uh I
think most people agree with like bar um
barries uh building effective agents
talk um he's going to talk about how
coding agents and support agents have
product Market fit I think now it's fair
to say deep research has pmf um but also
I will say up and coming are some of
these use cases some of which you you're
going to see in the the talks later but
I also want to offer anti- use cases can
we please stop demoing agents that book
flights yeah no more flight booking
agents uh
I want to book my own flights thank you
very much I I want to book my own
instacart orders and also please don't
ask for Ted it right okay so uh one yeah
and I think the reason that the tell
that uh you know this is this is the
headline that I saw yesterday I had to
put this in um opening I reported 400
million users uh which is a 33% growth
from three months ago um and then you
can ask deep research to research open
Ai and draw this chart of chat gbt
growth uh going from uh zero to uh 400
million users in two years in two and a
half years um so uh I I I remember this
chart very well because chpc spent a
year not growing and why did it spend a
year not growing because they didn't
ship any any agentic models um and if
you actually just look at the uh the
sort of weekly active user chart and
stretch it out you actually get this
chart uh which is actually super
interesting because it basically shows
that one one um the sort of o1 models
have doubled CH GPT usage and if you
stretch it out um CH GPT is going to hit
a billion users by the end of this year
this year
uh it's basically going to Quint tupo
the number of users it had uh as of
September of last year um and so like
the the the the the growth of chbt and
the growth of any AI product is going to
be very very tight to reasoning
capabilities and the amount of agents
that you can Shi for your users um it is
it is real it is it is huge huge numbers
this is 1/8 of the world population
that's going to be using chbt by the end
of this year and I think there's a lot
of money left on the table for everyone
else so um I hope you enjoy doing that
um I'm well pasted time so I'm going to
skip all this but basically I I think
that the job of a is now evolving
towards building agents in the same way
that mes build models software engineers
build software um so uh I'm going to
skip all that you can see all you can
see all that on on the on the live
stream U but we're actually uh you know
just here to welcome you to the show um
and uh I'm really excited to introduce
you to everyone so um thank you and I
hope you enjoy
[Music]
will 2025 be the year of the agents here
to present how to build and evaluate
agents is the author of AI snake oil
sash
[Applause]
kapor the theme of this conference today
is agents at work unfortunately for the
next 18 minutes you'll be stuck with me
talking about how agents don't work very
well today and how we can do better when
it comes to AI engineering so there is
so much interest in Agents from all
fronts in the product world and in the
industry World in academic labs in
research if you're someone who doesn't
think that companies will be able to
scale language models all the way to AGI
and what we're going to see more and
more of in the near future is Agents
that are not really deployed directly
but function as small parts of larger
products and systems and this is what AI
is probably going to look like in the
near future swix came up with a few
dozen definitions of AI agents this is
one of them where language models
control the flow of a particular
system in fact even when people naively
think of um chat GPT and Claude as
models uh these tools are actually
examples of rudimentary agents at some
some level they have input and output
filters uh they can carry out certain
tasks they can call these tools and so
on so in some sense agents are already
widely used as well as
successful we've now seen mainstream
product offerings uh that can do a lot
more open AI operator can carry out
open-ended tasks on the internet the
Deep research tool can carry out 30
minute long report writing tasks on any
conceivable
topic so that's the first reason I think
today's conference is timely but the
second reason is that on the flip side
the more ambition ambicious visions of
what agents can do are far from being
realized so on the left here is a vision
for what agents can do something out of
Science Fiction films like the film her
and on the right hand side are how these
ambitious products have failed in the
real world so far now I'm pointing this
out not to criticize the specific
products on the slide but to genuinely
challenge the audience into the
challenge of building AI agents that
really work for the people who are about
to use them and so over the course of
this talk I'll talk about three main
reasons why agents don't yet work and
what we can do to realize the potential
of agents to get past some of these
failures the first one is that
evaluating agents is genuinely hard so
to begin let's see some examples of how
when people have tried to productionize
agents these agents have sort of failed
in the real world
do not pay is a US startup that claimed
to automate the entire work of a lawyer
um the startup co-founder even offered a
million dollars for any lawyer who would
be willing to argue in front of the US
Supreme Court using do not pay in an
earpiece in reality a couple of years
later in fact very recently the FTC fine
do not pay hundreds of thousands of
dollars the reason for the fine was that
the performance claims that do not play
seem to be making were actually entirely
false now you might consider this a case
of rushed invention of a small startup
making claims that it cannot back so
let's look at some of the work from more
wellestablished
companies Law Firm Nexus Lexus Nexus as
well as West law are widely regarded to
be some of the leading lawtech firms in
the US a couple of years ago Lexus Nexus
launched this product which it claimed
was hallucination free in its ability to
generate legal reports and
reasoning but when Stanford researchers
evaluated Lexus Nexus and Westlaw
products they found that in up to a
third of cases and at least a sixth of
cases these language models
hallucinated um in particular in some
cases the hallucinations basically
completely reversed the intentions of
the original legal text in others the
paragraphs were made up uh they have
about 200 examples of such errors um in
leading lch products
we've also heard examples of AI agents
soon automating all of scientific
research so this is an example from
startup Sakana doai Sakana claimed they
had built a research scientist that
could fully automate open-ended
scientific research now our team at
Princeton wanted to test this claim in
the real world in part because
automating scientific research is one of
our main research interests so we built
a benchmark we created this Benchmark
called core bench the tasks in this
Benchmark are way simpler than what you
might expect from open-ended real world
scientific research um they just try to
reproduce a paper's result even
providing the agent with the code and
the data needed to reproduce it so as
you can imagine this is far simpler than
automating all of
science what we found is that the best
agents as of today cannot even automate
scientific research reliably less than
40% of the papers can be um reproduced
by the leading agents now of course you
can see these models getting better and
even if an agent can automate only 40%
of reproducibility that is a huge boost
because researchers spent a lot of time
reproducing baselines from past results
but on this basis to argue that AI can
soon automate all of science or that
agents will R render scientific
researches obsolete is way too
premature in fact when people actually
looked at how well Ka a AI scientists
worked they found that it was deployed
on toy problems that uh it was evaluated
using an llm as a judge rather than
human peer review and that in fact once
you start looking at the results they
turn out to be very minor tweaks on top
of other papers think undergrad research
projects rather than fully automating
all of
Science Now a couple of days ago as I
Was preparing the slides for the St I
came up with another claim or Sakana
came up with another claim where they
claimed to build an agent for optimizing
Cuda kernels the claims were indeed very
impressive they could lead to a 150x
improvement over the standard Cuda
kernels that Pyar comes with the issue
though was that if you sort of analyze
their claims one level deeper you would
see that they were claiming to
outperform the theoretical maximum of
the H 100 by 30 times clearly this claim
was false and once again it was because
of the lack of rigorous evaluation it
turned out that the agent was simply
hacking the reward function rather than
actually improving the Cuda kernels once
again the point is not to call out a
single company but rather to flag that
evaluating agents is genuinely a very
hard problem it needs to be treated as a
first class citizen in the AI
engineering toolkit or else we continue
risking failures like the ones on the
slide the second reason why building
agents that work in the real world is
hard is because static benchmarks can be
quite misleading when it comes to the
actual performance of
agents and that's because for the
longest time we focused on building
evaluations that might work pretty well
for evaluating High well language models
too but agents are not the same as
models for example for most language
model evaluations all you need to do is
to consider an input string and an
output string those are the domains
where language models work it's really
enough to construct a
valuation on the other hand when you
thinking about agents these agents need
to take actions in the real world they
need to interact with an environment and
so building this sort of evaluation that
makes these changes possible that
creates the virtual environments within
which these agents operate is a way
harder
problem a second difficulty in
evaluating agents is that for llms the
cost of evaluating a model is bounded to
the context window length of these
language models you can basically look
at these evaluations as having a fixed
ceiling but when you have agents that
can take open-ended actions in the real
world there isn't any such ceiling you
can imagine these agents calling other
sub agents there can be recursions there
can be all sorts of systems uh maybe
just llm calls in a far Loop and because
of this Cost needs to be once again a
first class citizen in all evaluations
of Agents if you don't have cost as an
access alongside accuracy or performance
you're not going to really understand
how well your agent works
and finally when you build a new
Benchmark for a language model you can
basically assume that you can evaluate
every single language model on this
Benchmark but when it comes to
evaluating agents these agents are often
purpose-built so for instance if there
is a coding agent you want to evaluate
you can't really use a web agent
Benchmark to evaluate it on and this
leads to an second hurdle which is how
do you construct these meaningful
multi-dimensional metrics to evaluate
your agents rather than um relying on a
single Benchmark to evaluate how well it
works now all of these concerns might be
thought of as theoretical um you know
you could reasonably ask why do we care
if static evaluations don't really work
well for agents the reason is that
because of these differences with the
cost and the accuracy because of the
single focus on optimizing for a single
Benchmark we are basically unable to get
a coherent picture of how well an agent
works
so at Princeton we developed this uh
agent leader board that tries to solve
some of these issues in particular for
example for the core bench leader board
I mentioned earlier um you can have
multiple agents which are evaluated with
cost alongside accuracy so here on this
parito Frontier you can see agents like
Cloud 3.5 um scoring about as much as
the um open ice1 models but the cloud
model actually costs $57 to run whereas
o1 cost
664 even if the performance of open ow
was a couple of percentage points higher
which it wasn't in this case by the way
but even it would were for most AI
Engineers the choice here is obvious you
would any day of the week take a model
that costs 10 times lesser while
performing about as
well now in response to this sort of
two-dimensional parito um I've often
been asked um are llms becoming too
cheap to meter in other words why do we
even need to care about the cost of
running an agent if the cost of uh
creating these models is dropping
drastically and it is indeed true that
costs have dropped drastically in the
last few years if you compare Tex
DaVinci 003 which was open AI model back
in
2022 um to today's GPD 40 mini which in
most cases outperforms this older model
the cost has dropped by over two orders
of
magnitude but at the same time if you're
building application that need to scale
this type of approach is still quite
costly and especially from the point of
view of releasing prototypes one of the
barriers is for AI Engineers is you
really need to sort of iterate in the
open and so if you don't account for
cost your prototype might soon end up
costing you thousands of
dollars and finally even if the cost of
uh scaling inference time um llm calls
continues to drop what is known as the
chevin Paradox I suspect will keep
increasing the overall cost of running
agents so JV Paradox is this Theory from
a 19th century British Economist who
figured out that as the cost of uh
mining coal reduced the overall usage of
coal increased not decreased along
several Industries the same happened
when the ATM machines were introduced
all over the US people expected a loss
of jobs for bank tellers but what
happened was the opposite because ATMs
were so easy to install the number of
Bank branches actually drastically
increase leading to an increase in the
number of bank tellers employed this is
also what I expect will happen as the
costs for language models keep dropping
drastically and that's why for the
foreseeable future at least we do need
to account for cost when it comes to
agent
evaluations so how do we do all of this
um in an automated way well with the
holistic agent leaderboard or Hal uh
we've come up with a way to
automatically run agent evaluations on
these 11 different benchmarks already
and very many more are on the way um
beyond that though even if we come up
with these multi-dimensional um
benchmarks even if we do come up with
cost controlled evaluations there are
still certain issues with this type of
evaluation and that's because agent
benchmarks have sort of become the
metric against which VC's fund companies
an example is cosign which raised its
seed round of funding based on its
results on sbench in fact agent
developer um cognition raised 175
million at a valuation of $2 billion
driven primarily by the fact that the
Asian did very well on S
bench unfortunately Benchmark
performance very rarely translates into
the real world so this is an excellent
analysis of how well Devon Works Devon
is the agent developed by cognition um
from the very impressive Folks at
answer. um instead of relying on
standard benchmarks they actually tried
to incorporate Devon into their real
world and what they found was that over
a month of use they tried it for 20
different tasks and it was only
successful at three of them so this is
the other reason why this overreliance
on static benchmarks can be really
misleading how do we get over this one
of my favorite Frameworks to think
through this is the work by Folks at
Berkeley called who validates the
validators on the top is the typical
evaluation pipeline which consists of
singular llm calls against static
metrics which is the um sort of broken
Paradigm for AI evaluations that we just
discussed and at the bottom is what they
propose they propose having humans in
the loop who are domain experts who
proactively edit the criteria based on
which these llm evaluations are
constructed and that can lead to much
better evaluation results
overall this brings me to the last key
takeaway for why Agent performance does
not really translate into the real world
which is the confusion between what
capability is and what liability
is so very roughly speaking capability
means what a model could do at certain
points of time for those of you who are
technically minded this means the pass
at K accuracy of a model for a very high
K that means at one of the K answers
that the model outputs are correct on
the other hand reliability means
consistently getting the answer right
each and every single time when agents
are deployed for consequential decisions
in the real world what you really need
to focus on is reliability rather than
capability that's because language
models are already capable of very many
things but if you trick yourself into
believing this means a reliable
experience for the end user that's when
products in the real world go wrong so
in particular I think the methods for
training models that get us to the 90%
of it what in swix term would be the job
of a machine learning engineer don't
necessarily get us to the
99.999% % what is often known as the 5 9
of
reliability and closing this gap between
the 90% And The
99.9% is the job of an AI
engineer and I think this is what has
led to the failures of products like
human Spin and rabbit R1 it's because
the developers did not anticipate that
not having reliability in products like
these would lead to a product failure in
other words if your personal assistant
only offers your orders your do Dash
food correctly 80% of the times that is
a catastrophic failure from the point of
view of a
product now one thing people have
proposed to fix this sort of issue to
improve reliability is to create a
verifier something like a unit test um
and on this basis there have been
several claims that we could improve the
inference scaling capabilities of these
tools and get to very reliable
models unfortunately what we found is
that verifiers can also be imperfect in
practice for instance two of the leading
coding benchmarks human eval and mbpp
both have false positives in the unit
tests that is a model could output
incorrect code and still pass the unit
test and once we account for these false
positives what we have are these
inference scaling curves bending
downwards so rather than model
performance continuing to improve if
there are false positives in your
verifiers the model performance sort of
bends downwards simply because the more
you try the more likely it is you'll get
a wrong
answer and so this is also not a perfect
solution to the problem of
reliability so what is the solution I
think the challenge for AI Engineers is
to figure out what sorts of software
optimizations and abstractions are
needed for working with inherently
stochastic components like llms in other
words it's a system design problem
rather than just a modeling problem
where you need to work around the
constraints of an inherently stochastic
system
and I want to argue in the last one
minute of my talk that this means
looking at AI engineering as more of a
reliability engineering field than a
software or a machine learning
engineering field and this also brings
me to the clear mindset shift that is
needed um to become successful for uh
from the perspective of being an AI
engineer if you look at the title slide
of my talk um this title slide sort of
pointed to one such area where we've
already overcome certain um types of
limitations of stochastic
systems and that is with the birth of
computing the 1946 aniac computer used
over 177,000 vacuum tubes many of which
at the beginning of this process used to
fail so often that the computer was
unavailable half the time and the
engineers who built this product knew
that this is a failure from the point of
view of the end users so their primary
job in the first two years of this
computer was to fix the reliability
issues to reduce it to a point where it
becomes well enough it works well enough
to become usable by the end user and I
say that this is precisely what AI
Engineers need to be thinking about as
their real job it is not to create
excellent products though that is
important but rather to fix the
reliability issues that plague every
single agent that uses inherently
stochastic models um as its basis so
this is what I'll leave you here with
today um to become successful Engineers
you need a reliability shift in your
mindset to think of yourselves as the
people who are ensuring that this next
wave of computing is as reliable for end
users as possible and there's a lot of
precedent for this type of thing
happening in the past all right with
this I'll leave you with the 3K
takeaways it was a pleasure being here
thank you
[Applause]
[Music]
let's dive with our next presenters into
Gemini deep research please join me in
welcoming to the stage staff ml software
engineer of Google mukun SDAR and
product manager of Google Gemini arish
[Music]
[Applause]
San cool uh hey everyone I'm arush I'm a
product manager here at Google hey I'm M
I'm a software engineer at Google
working on deep research um so uh I
don't know if people have had a chance
to uh try deep research on Gemini um or
are familiar with the product but you
can try it if you go to Gemini Advanced
and if you scroll past 2.0 flash 2.0
flash thinking experimental 2.0 flash
thinking experimental with apps 2.0 pro
experimental you will find uh 1.5 Pro
with deep research which is what we
built um and if you have the chance to
use it and you pay the 20 bucks uh
you'll see that it's a personal research
agent that can browse the web for you to
to build the reports on your
behalf and so our motivation and what we
want to talk about today is kind of why
we built it some of the product
challenges we overcame and some of the
technical challenges you'll face of
building a web research agent um so our
motivation was really we wanted to help
people get smart fast um we saw that
research and learning queries are some
of the top use cases in Gemini but when
you bring like really hard questions uh
to chat bots in general what we were
finding is that it would often give you
a blueprint for an answer rather than
actually give you the answer itself
right so we had this query that we used
to throw around of like tell me what
does it take to get an athletic
scholarship for shot put and like how do
I go get one and often the answers would
be things like you should talk to
coaches you should find out how far you
should be able to throw and you know uh
you should make sure you have good
grades but really what I want to know is
like okay what are the grade boundaries
like how far do I need to actually be
able to throw I want something super
comprehensive
and and that's where we saw a big
opportunity yeah so we said what if you
remove the constraints of compute and
latency at INF time let Gemini take as
long as it wants browse the web as much
as it needs and see if we can trade that
off for a much comprehensive answer of
the user but you got to do it in five
minutes because beyond that uh we don't
have the chips um so
uh this brought a bunch of product
challenges for us um Gemini up to this
point is an inherently synchronous
feature it's a chatbot um and so you
wanted to we needed to figure out how do
you sort of build asynchronous
experiences in an inherently synchronous
product um you also wanted to set
expectations with users right deep
research is good for like one very
specific thing but a lot of user queries
to Gin are things like what's the
weather write me a joke things like that
where waiting five minutes is not going
to get you a good answer and we wanted
to set expectations uh and the last
thing is our answers can be thousands of
words long and we needed to figure out
how do you make it easy for users to
engage with really long outputs and um
in in a chat
experience um so let's walk through kind
of the ux and kind of think about how
how we solve some of these right so
imagine you're a VC uh and everybody's
talking about you know investing in
nuclear in America and so you come with
this query like hey help me learn the
latest technology breakthroughs in small
nuclear reactors and tell me interesting
companies in the supply chain so the
first step um when you bring this query
to deep research is that g actually put
together a research plan for you and
present it in a card and so this is the
first way in which we're able to
communicate with users like this is
different this isn't your standard
chatbot experience something's going to
happen you're going to hit start but
it's also an opportunity for us to
actually show the user a research plan
that they can edit engage with kind of
like a good analyst right they they
wouldn't just get to work they'd
actually show you okay here's how I'm
going to approach this and it's a way
for users to if they want kind of engage
and steer the direction of the research
further now
uh once you hit start we actually try
and show you um what Gemini is doing on
the under the hood in real time uh by
showing you the the websites it's
browsing and this is a feature that was
built before thinking models and
thoughts are also a really great way of
kind of showing transparency of what the
model is thinking um but what's really
nice here is while you wait you can sort
of Click through the websites dive into
any of the content um but what we also
inadvertently saw is people trying to
game that number to see how high it
could go so we definitely saw people
push that number into the into the
thousands uh to try and um you know see
how many websites deep research could
read um finally we kind of get this
report that's you know thousands of way
long and um we're really inspired by
what kind of what anthropic does with
artifacts and so we thought that was a
really great way of sort of being able
to pin an artifact so that users can
actually ask questions about the
research while reading the material they
don't have to scroll back and forth and
what's really neat about this is it
means it's easy for you engage in sort
of changing the style of the report
adding sections removing sections asking
follow-up questions and uh and it sort
of makes that really easy and the last
part that's super important is kind of
user trust and also doing right by the
Publishers so we we try and always show
is all the sources we read as well as
all the sources we used in the report
because not everything that we read is
used but it stays in context for
followup questions and and also sort of
these are all things that um carry over
to Google Docs as citations and things
like that if you choose to
export uh so I thought today we can pick
some of the challenges uh that one has
to encounter while building a research
agent and talk through some of them so
uh I picked four for today so one is
this this long running nature of tasks
introdu is a couple of things that we
need to look into second is the model
has to plan iteratively and spend uh its
time and compute during this time
effectively so what are those challenges
there and it has to do this uh while
interacting with a very noisy
environment that is the web and as you
do this and uh read through information
very quickly you can start seeing your
context grow and how do you effectively
manage
context so if if you think about a job
that runs for multiple minutes and
something that can make many many uh
different llm calls and calls to
different Services they are bound to be
failures right and today we're talking
about o of minutes but you can very
easily think in the future of uh these
kind of research agents taking like
multiple hours so it's important to be
robust to intermediate failures of these
various Services of various
reliabilities and so being able to build
a good State Management solution being
able to recover from errors effectively
so that you just don't drop the whole
research uh task due to one failure
that's one the second aspect of doing
this what it enables us is to enable
this feature uh crossplatform so we
believe more and more uh users will
start kind of registering your asks uh
or your research tasks and just like
walk away do their thing and then you
need to get notified and this can happen
now across uh devices and you can pick
off uh uh reading it uh uh uh once it's
done so now what is the model doing it
uh like through these you know uh few
minutes uh so let's take a example right
so here uh we're looking for uh athletic
scholarships uh for short putut there
are many facets to this query and we
kind of show this in a research plan
like AR showed the first thing the model
has to do is try to figure out which of
these sub problems it can start tackling
in parallel versus things that are
inherently sequential right so the model
has to be able to reason to do that and
uh the other challenge is here you see
you're always going to land in this
state where there's partial information
so it's important to look at all the
information found so far before you
decide what to do next so in this
instance the model found hey it's it
knows the qualifying standards uh for
the D1 division but in order to provide
a complete report and answer the user's
question it has to go figure out what
the equivalent for the D2 and D3
divisions are so this notion of of being
able to ground on information you find
and then plan your next step is
key another example of partial
information could be when you make
searches um so in this case we're trying
to find the best roller coaster for kids
uh you might find results uh that
provide partial information again so
here uh you end up at a link uh which
talks about the top 10 roller coasters
but does not mention anything about them
being suitable to kids uh so the planner
has to recognize this fact and then go
ahead and in the next steps of planning
try to resolve this uh dis
ambiguity um another example of uh
challenges in planning is information is
often not found in one place you find
facets of information spread across
different sources so here uh we're
trying to find uh what would uh what
would it take to get a certification for
a scuba dive
uh in in in some dive centers nearby so
you see uh one part or One Source has uh
the kind of the structure of uh what
what you have to go through to get a
certification but in a completely
different Source you have this notion of
the pricing for this Diving Center so
the model has to weave this together to
figure out um you know what the cost
structure for such a certification would
look
like then there's the classic uh entity
resolution problem so you might find
mentions of the same entity across
different sources so you need to be able
to reason about some information
indicators to kind of figure out if
they're talking about the same entity or
you need to explore more to verify such
uh dis
ambiguities um yeah I think most people
here have worked on some notion of a web
problem and we know like it's super
fragmented so uh here you see two
different websites uh talking about the
same thing uh about music festivals in
Portugal this year uh on the left uh if
you end up at such a website it's easier
and you get most of your information in
one go uh on the right uh the layout is
different so having a robust uh browsing
mechanism if you want to navigate uh the
web for your research tasks is another
uh important
challenge so like we saw there is a lot
of these intermediate outputs and as you
do this and you start getting streams of
information during your planning you can
imagine your context size growing very
quickly um the other challenge that uh
about context size is your research task
doesn't typically end with your first
query people have follow-ups people can
say hey can you also do the same for
this other topic so there is like this
kind of a followup uh deep research and
uh that also adds pressure on the
context uh we at Gemini have uh the
liberty of really long context models uh
but uh even then you have to design uh
some way to make sure you you
effectively manage your context and
there are multiple choices here each
come with various different trade-offs
uh we're showing one here uh where we
kind of have like this recency bias so
you have lot more information about your
current and your previous tasks but as
you get to older tasks we kind of
selectively pick out uh you know things
what we call as research notes and put
it in a rag that way the model can still
access it but it's being
selective uh I'll hand it back to Aros
about to talk about what's next yeah so
we were super
excited to put this feature out in
December we weren't actually sure if
anyone was going to use it if anyone was
going to Care um to wait five minutes uh
for something and uh we were really
positively surprised by the reception um
and and really what we what we saw um
was hey we've built something that's
maybe as good as like a the analyst
right and we give it away for 20 bucks
but um you know that's that's really
great and um but what it does is it just
retrieves from the open web and it's a
text in text out only system right and
so where we sort of we sort of see a few
different directions of where research
agents are going to go next and the
first one is around expertise right so
how do you go from a McKenzie analyst to
a McKenzie partner or Goldman Sachs
partner or like a partner of the law
firm right so that's really around not
just being able to aggregate information
and synthesize it but also think through
the so what of how do like what are the
implications for what we're going to do
and and what are the most interesting
insights and patterns that come out of
it the the other thing is you know there
are plenty of domains Beyond
Professional Services like The Sciences
where you you know want to get really
good you know you want something that
can read many papers form hypotheses
find really interesting patterns in you
know what methods we used uh and and
come up with Noel hypotheses to
explore however um just because you
build something that can be really smart
doesn't mean that it's useful to someone
right so um if we were thinking about a
use case of running a due diligence on a
company the way you'd present that
information to me would be very
different to the way you'd present that
information to say a Goldman Sachs
Banker right um for me you really want
to talk through like what like what is
this company and how's its position
strategically but a banker would want to
know all the financial information
actually have a DCF that they could look
at right actually uh have a have a much
more like fine grained
uh sort of uh fin uh Financial modeling
and Analysis and and that really should
shape the way in which you browse the
web right the way you browse the web the
way you frame your answer the kind of
questions they pursue should be very
personalized to kind of meeting the user
where they're at I think the last part
is sort of something that goes across
domains of what models can do right so
not just being able to do web research
with text but being able to combine that
with abilities and coding data science
even video generation right so coming
back to this example if you're doing a
due diligence yeah what if could go and
do like a lot of statistical analysis
and actually build Financial models to
inform the research output that it gives
you right telling you hey why is this a
good company or not um I should say
Google doesn't give Financial advice and
you know it's not a financial advisor um
but yeah and so we really excited about
the potential we think there's a ton of
Headroom to make research agents better
and we are really glad we didn't call
this Gemini Deep dive which was our best
name before uh before launching this
feature um that's it thank you so much
thank
[Applause]
[Music]
you our next presenter is member of
technical staff at anthropic here to
present how they build effective agents
please join me in welcoming to the stage
Barry Jang
[Music]
[Applause]
all right can you guys hear me yeah all
right awesome wow it's uh incredible to
be on the same stage as uh so many
people have learned so much from let's
get into it my name is Barry and today
we're going to be talking about how we
build effective
agents about two months ago Eric and I
wrote a blog post called building
effective agents in there we shared some
opinionated take on what an agent is and
isn't and we give some practical
learnings that we have gained along the
way
today I'd like to go deeper on Three
core ideas from the blog post and
provide you with some personal amings at
the
end here are those
ideas first don't build agents for
everything second keep it simple and
third think like your
agents let's first start with a recap of
how we got here most of us probably
started building very simple features
things like summarization classification
extraction just really simple things
that felt like magic two to three years
ago and have now become table
Stakes then as we got more sophisticated
and as products mature we got more
creative one model call often wasn't
enough so we started orchestrating
multiple model calls in predefined
control
flows this basically gave us a way to
trade off cost and latency for better
performance and we call these
workflows we believe this is the
beginning of agentic systems
now models are even more capable and
we're seeing more and more domain domain
specific agents start to pop up in
production unlike workflows agents can
decide their own trajectory and operate
almost in independently based on
environment feedback this is going to be
our Focus
today it's probably a little bit too
early to name what the next phase of
agentic system is going to look like
especially in production single agents
could become a lot more general purpose
and more capable
or we can start to see collaboration and
delegation in multi-agent
settings regardless I think the broad
Trend here is that as we give these
systems a lot more agency they become
more useful and more capable but as a
result the cost latency the consequences
of Errors also go
up and that brings us to the first point
don't build agents for
everything well why not we think of
Agents as a way to scale complex and
valuable tasks they shouldn't be a
dropin upgrade for every use case you if
you have read the blog H you'll know
that we talked a lot about workflows and
that's because we really like them and
they are a great concrete way to deliver
values
today well also when should you build an
agent here's our
checklist the first thing to consider is
the complexity of your task agents
really thrive in ambiguous problem
spaces and if you can map out the entire
decision tree pretty easily just build
that explicitly and then optimize every
node of that decision tree it's a lot
more cost effective and it's going to
give you a lot more
control next thing to consider is the
value of your task that exploration I
just mentioned is going to cost you a
lot of tokens so the task really need to
justify the cost if your budget per per
task is around 10 cents for example
you're building a u high volume customer
support system that only affords you 30
to 50 selling tokens in that case just
use a word workflow to solve the most
common scenarios and you're able to
capture the majority of the values from
there on the other hand though if you
look at this question and your first
thought is I don't care how many tokens
I spend I just want to get the task done
please see me after the talk our go to
market team would love to speak with
you from there we want to drisk the
critical capabilities this is to make
sure that there aren't any significant
bottlenecks in the agent's trajectory if
you're doing a coding agent you want to
make sure it's able to good code it's
able to debug and it's able to recover
from its
errors if you do have bottleneck that's
probably not going to be fatal but they
will multiply your cost and latency so
in that case we normally just reduce the
scope simplify the task and try
again finally the the the last important
thing to consider is the cause of error
and error Discovery if your errors are
going to be high stake and very hard to
discover it's going to be very difficult
for you to trust the agent to take
actions on our behalf and to have more
autonomy you can always mitigate this by
limiting the scope right you can have
read only access you can to have more
human in the loop but this will also
limit how well you're able to scale your
agent in your use
case let's see this checklist in in
action why is coding a great agent use
case first to go from design doc to a PR
is obviously a very ambiguous and very
complex task and second um we're a lot
of us are developers here so we know
that good code has a lot of
value and third many of us already use
cloud for coding so we know that it's
great at many parts of the coding
workflow and last coding has this really
nice property where the output is easily
verifiable through unit test and
CI and that's probably why we're seeing
so many creative and successful coding
agents right
now once you find a good use case for
agents
this is the second core idea which is to
keep it as simple as
possible let me show you what I
mean this is what agents look like to us
they're models using Tools in a loop and
in this Frame three components Define
what an agent really looks like first is
the environment this is the system that
the agent is operating
in then we have a set of tools which
offer an interface for the agent to take
action and get feedback
then we have the system prompt which
defines the goals the constraints and
the ideal behavior for the agent to
actually work in this
environment then the model gets called
in a loop and that's
agents we have learned the hard way to
keep this simple because any complexity
up front is really going to kill
iteration speed iterating on just these
three basic components is going to give
you by far the highest Roi and
optimizations can come later
here are examples of three agent use
cases that we have built for ourselves
or or our customers just to make it more
concrete they're going to look very
different on the product surface they're
going to look very different in their
scope they're going to look different in
their capability but they share almost
exactly the same backbone they all they
actually share almost the exact same
code the environment largely depends on
your use case so really the only two
design decisions is what are the set of
tools you want to offer to the agent and
what is the prompt that you want the
instructor agent to
follow um on this note if you want to
learn more about tools my friend mahes
is going to be giving a workshop on
model context protocol mCP tomorrow
morning um I've seen that Workshop it's
going to be really fun so I highly
encourage you guys to to check that out
um but back to our talk once you have
figured out these three basic components
you have a lot of optimizations to do
from there uh for coding and computer
use uh you might want to uh cat the
trajectory to reduce cost for SE where
you have a lot of tool calls you can
paraliz a lot of those to reduce latency
and for almost all of these we want to
make sure to present the agents progress
in such a way that gain user trust but
that's it keep it as simple as possible
as you're iterating build these three
components first and then optimize once
you have the behaviors
down all right this is the last idea um
is to think like your agents I've seen a
lot of Builders and myself in included
who develop Agents from our own
perspectives and get confused when
agents make a mistake it seems
counterintuitive to us and that's why we
always recommend to put yourself in the
agents context
window agents can exhibit some really
sophisticated Behavior it could look
incredibly comp complex but at each step
what the model is doing is still just
running inference on a very limited set
of
contexts everything that the model knows
about the current state of the world is
it's going to be explained in that 10 to
20K tokens and it's really helpful to
limit ourselves in that context and see
if it's actually sufficient and coherent
this will give you a much better
understanding of how agents see the
world and then kind of bridge the gap
between our understanding and
theirs let's imagine for a second that
we're computer use agents now and then
see what that feels like all we're going
to get is a static screenshot and a very
poorly written description this by your
truly let's read through it you know
you're a computer use agent you have a
set of tools and you have a task
terrible uh we can think and talk and
reason what we want but the only thing
that's going to take effect in the
environment are our
tools so we attempt a click without
really seeing what's happening and while
the inference is happening while the two
execution is happening this is basically
equivalent to US closing our eyes for 3
to 5 seconds and using the computer in
the
dark then you open up your eyes and you
see another screenshot whatever you did
could have worked or you could have shut
down the computer you just don't know
this is a huge leap of face and the
cycle kind of starts again I highly
recommend just trying try doing a full
task from the agent's P perspective like
this I promise you it's a fascinating
and only mildly uncomfortable
experience however once you go through
that mildly uncomfortable experience uh
I think it becomes very clear what the
agents would have actually need it it's
clearly very crucial to know uh what the
screen resolution is so I know how to
click um it's also good to have
recommended actions and limitations just
so that you know uh we can uh put some
guardrails around uh what we should be
exploring and we can avoid a necessary
exploration these are just some examples
and you know do this exercise for your
own own agent use case and figure out
what kind of context do you actually
want to provide for the
agent fortunately though um we are
building system that speak our language
so we could just ask Cloud to understand
Cloud you can throw in your your system
prompt and ask well is any of this
instruction ambiguous does it make sense
to you are you able to follow this you
can throw in your two description and
see whether the agent knows how to use
the tool you can see if it wants more
parameter fewer
parameter and one thing that we do quite
frequently is we throw the entire
agent's trajectory into cloud and just
ask it hey why do you think we made this
decision right here and is there
anything that we can do to help you make
better
decisions this shouldn't replace your
own understanding of the context but you
will help you gain a much closer
perspective on how the agent is seeing
the world so once again think like your
agent as you're
iterating all right I've I've spent most
of the talk talking about very practical
stuff uh I'm going to indulge myself and
spend one slide on personal amings this
is going to be my view on how this might
evolve and some open questions I think
we need to answer together as AI
Engineers these are the top three things
that are always on my mind first I think
we need to make agents a lot more budget
aware unlike workflows we don't really
have a great sense of control for the
cost and latency for agents I think
figuring this out will enable a lot more
use cases as it gives us the necessary
control to deploy them in
production the open question is just
what's the best way to define an enforce
budgets in terms of time in terms of
money in terms of tokens the things that
we care about
next up is this concept of self-evolving
tools i' I've already hinted at this two
slides ago but uh we are already using
models to help iterate on the two
description but this should generalize
pretty well into a meta tool where
agents can design and improve their own
tool
ergonomics this will make agents a lot
more general purpose as they can adopt
the tools that they need for each use
case finally um I don't even think this
is a hot take anymore I have a personal
conviction that we all see a lot more
multi-agent uh collaborations in
production by the end of this year
they're well parallelized they have very
nice separation of concerns and having
sub agent for example will really
protect the main agent's context
window um but I think a big open
question here is um how how do these
agents actually communicate with each
other we're currently in this very rigid
frame of having mostly synchronous user
assistant terms and I think most of our
systems are built around that so how do
we expand from there and build an
asynchronous communication and and
enable more roles that that afford
agents to communicate with each other
and recognize each other I think that's
going to be a big open question as we
explore this more multi-agent
Future these are the areas that take up
a lot of my mind space if you're also
thinking about this uh please sh me a
text I would love to
chat okay let's uh bring it all together
if you forget everything I said today
these are the three takeaways first
don't build agents for everything if you
do find a good use case and want to
build an agent keep it as simple for as
long as possible and finally as you
iterate try to think like your agent
gain their perspective and help them do
their
job I would love to keep in touch with
everyone of you if you want to chat
about agents especially those open
questions that I talked about uh you'll
be incredibly lovely You' can just you
know uh J on some of the these ideas uh
these are my socials if you want to get
connected and I'm going to get end the
presentation on a personal anecdote so
back in 2023 I was building AI product
at meta and we had this funny thing
where we could change our job
description to anything we want um after
reading that blog post from swix I
decided I was going to be the first AI
engineer uh I I really love the focus on
practicality and just making AI actually
useful to the world and I think that
aspiration brought me here today so so I
hope you enjoy the rest of the air
engineer Summit and in the meantime
let's keep buing thank
[Applause]
[Music]
you our next speaker works for a company
that's built industrial grade AI agents
for Consumer Brands like Sonos ADT and
Sirius XM here to give us a peek into
how they do it is AI product manager at
Sierra Zack Renault
[Music]
[Applause]
wedin hey everyone uh my name is Zach
Radine uh I'm going to be telling a few
stories and hopefully we'll leave you
all entertained and with an idea of how
we build agents and improve them at
Sierra so in a nutshell Sierra is the
conversational AI platform for
businesses and just PLL of the room out
of curiosity how many people have heard
of
Sierra so most of the room but not all
if you've heard of us you probably
associate us with uh chat experiences
and perhaps with customer service and
that's a lot of what we do uh but I
would say that we're kind of broadening
out in both cases uh probably by the end
of this year most of our interactions
will be over the phone um so that's
already a big area for us and we'll also
have a lot more touch points we have a
lot of customers uh which I'll show
today who are using us for um sales for
subscription management for product
recommendations kind of all pieces of
the customer
experience I noticed yesterday were a
lot of people here yesterday some people
so it was funny to watch people were
reflecting on you know how much has
happened in Ai and they had these
timelines and they went way back in time
and so Colin from augment code went all
the way back to
2023 uh Wasim from writer was talking
about purpose built models and went all
the way back to
2020 and Grace from Lux Capital went
even further she went back to 2019
although if you zoom in you can see
actually the first thing here is still
from 2020 so everyone was reflecting on
ancient history in Ai and it was all
this decade so I'm going to zoom back
even further
2016 in the AI
caves and I know uh what you're thinking
you know AI goes back to the 70s and all
that but it definitely felt like the
caves in 2016 uh I know because if you
zoom in on the bottom right you can see
I'm actually down there I was working at
Google uh with a bunch of amazing
computer vision engineers and uh what
that meant in 2016 is we were really
trying to help computers understand the
difference between Chihuahuas and
blueberry
muffins and you know it's not actually
that simple uh it's not just Chihuahua
and blueberry muffins you know it's dogs
and bagels dogs and Ms s and of course
dogs and fried
chicken and so in other words what we
were doing is we were building the first
version of Google Lens um and at this
time I lived in New York City I was in
the East Village and I had about a 30
minute walk to work and on my walk I
would see a bunch of stuff New York's
one of the greatest walking cities in
the world and I would say what's going
on there what are they even doing or oh
I wonder if that bookstore is nice or I
wonder if this restaurant is tasty or oh
my goodness look at that dog uh and so
there were also a bunch of flowers on
the walk at this time Google Lens was in
its infancy and one of the very few
things that computer vision models were
actually good at that had some consumer
application was identifying plants you
might still know this today it's kind of
in the you know is that bug poisonous
category and so I'd ask questions on the
walk like you know can it tell the color
of the plant in addition to the species
or what's that what type of fern or or
Palm is that and there's a bunch of
flower shops on this walk so I'd even
walk in these these are all actually
photos from 2016 from my walks to work
and I would go in and test them all out
and as you can imagine you know
sometimes it was accurate and sometimes
you know it wasn't necessarily wrong but
it wasn't really on the nose either and
so it felt like a slot machine and I
think everyone here who's building with
AI can probably understand that feeling
of uh it worked five times in a row Why
didn't it work the sixth time whether
it's the non-determinism of the inputs
or the non-determinism of the outputs
that's just part of what it means to be
building with
AI so let's fast forward a bit to
present day Google Lens you can not only
search what you see you can also shop
what you see you can do this on Google
Images on YouTube you can do it with
your camera you can translate non-latin
character sets into English so you can
read the washing machine in Tokyo and
actually figure out what settings in
your Airbnb you should use you can do
your math homework I'm a little bit too
old to have benefited from this but
apparently it's a Brave New World out
there for the kids and of course uh this
is from the Google ends homepage you can
still identify
flowers so this is all very mind-blowing
but in my opinion it comes down to
consistent step-by-step iteration over a
decade and when we think about what
drives this we're all engineers in the
room we understand that you need a
process to iteratively improve to get
better without also getting worse and
this over time has kind of been
considered software development life
cycle how do you continuously impove
improve how do you implement test
maintain analyze design and go through
this as many times as you
can let's rewind a bit more
2012 the AI caves you know the drawings
are a little bit less sophisticated I'm
not there yet uh I've been oblad and I
pulled some headlines from around this
time you can see this is uh around when
Google brain was watching cat videos and
identifying them on YouTube and it was a
big breakthrough I don't know if anyone
remembers how big this model was it was
about a billion parameters this was a
huge breakthrough if you think today the
frontier models are about a trillion
parameters so it was one 1,000th it was
as if this whole room had like a quarter
of a person in
it and so uh it was still very
impressive at the time there was also a
theory you know everyone thought
computers would be limited in terms of
what they can achieve I think this is a
less popular Theory today what I'm
trying to say is it was a long time ago
this is also around the time that Mark
Andre published his famous essay that
said soft worri is eating the world and
that took a lot of people by storm if
you looked at Stanford University on
campus you would have seen some early
stage startups forming on the lawn does
anyone know which startups I'm talking
about you can call it
out okay you might be thinking Snapchat
uh not that one I did actually hear door
Dash in the back very good guess not
that one either of course I'm talk you
look like stylish people so I I think
you'll know what I'm talking about I'm
talking about
Chubbies Chubbies had a contrarian idea
that was also right which was not only
is software eating the world but teeny
shorts for men are also going to take
over and uh as I mentioned they were
correct which you can see here and you
can also see
here fast forward to 20 4 uh kit Garten
SVP of commercial at chub we were
fortunate enough to host her in Sierra's
office and chub has had an amazing brand
since they were founded and they've
always been on the Forefront of customer
experience they've always been thinking
about how to level up and how to make
the experience more fun and better for
their customers and so it clicked
immediately for kit that the same way
you needed a website in 1995 the same
way your business needed a Social
profile and a mobile app this millennium
in 2025 you need an AI agent to
represent your business and to help your
customers so kit and Chubbies partnered
with
Sierra we came up with an AI agent which
is affectionately called Duncan Smothers
first and foremost he's incredibly
capable but almost as importantly he's
always down to clown Duncan's mothers is
on the Chubby's website and can help you
with a variety of
cases I got permission from kit to show
some of these conversations to you today
so you can see what some of the Sierra
interactions look like under the hood
and some of the things that these agents
are capable of so on the left here you
have a customer asking a question about
sizing and fit Duncan is able to
empathetically help them while asking
questions like what's your waist size
and offer product recommendations at the
end it gets a thumbs up from the
customer another example another thumbs
up this is inventory tracking Duncan can
tell what's in stock and help customers
choose new
items and then finally package tracking
and refunds so more customer love uh in
this case the Duncan is able to inform
the customer actually there's a couple
different tracking numbers for your
order and in the second case issue a
refund and so when we talk about
autonomous agents agents actually taking
action not just answering questions this
is what we're talking about and the
results for Chubbies have been they're
able to help more customers more quickly
and with higher satisfaction
the way that we get to this is because
we believe at Sierra that every agent is
a product that means that you can't just
drag and drop a bunch of boxes you need
a fully featured developer platform you
need a fully featured customer
experience operations platform in order
to work on this the same way you would
work on your mobile app the same way
that you would work on your website if
you want the best
results and so when Chubbies is
partnering with Sierra it's not just
using the product it's actually
partnering with our team and so so we
have dedicated agent engineering and
agent product management functions that
you can think of sort of as forward
deployed with our customers working
closely with kit and her team on a daily
basis by the way remember that face that
you just saw on the last slide were any
was anyone here at the AI engineering
World's Fair uh back in
June nice got some whoops from the
audience uh so I know Ben was there he's
up there on stage introducing everyone
and the energy was electric you can see
the crowd is packed when I got there the
first thing I did was I sat down at the
Deep gram Workshop this was the uh about
three months into me building voice
agents at Sierra and I was very
interested in what deep gram had to say
what did they think of the latest
multimodal models how are they handling
latency how are they handling tone and
phrasing all of these problems that were
new at the time and I sat down next to a
man named sha and sea and I were nerding
out about how to increase the speed of
our developer Loop by using the say
command on Mac and then using a program
called loop back in order to pipe that
into the browser so that we didn't have
to wear headphones and talk and look
awkward in the office Sean gave me his
contact info he was interested in Sierra
and a few months later uh there we are
working together in the office so when I
told our company and our Founders hey
I'm going to the AI Summit uh you know I
hope it's as productive as the last one
I'm excited to learn they said go find
more
Sean so hopeful that people in the
audience will say hi after this uh
whether or not you're interested in
working at Sierra I'm interested in
meeting you and so uh I'm I hope to meet
you later today anyway back to Duncan
mothers the point of the software
development life cycle the point of our
agent engineering team is that even if
Duncan is not perfect today he should be
getting better every single day and so
what we did is we sought out to build
something like the software development
cycle borrowing as many Concepts as we
could and inventing new ones where we
needed to
the issue is that large language models
are like building on top of a foundation
of jello and so you can't just take
everything out of the box and have it
just work while traditional software is
deterministic fast cheap rigid and and
governed by if statements that always
follow logic large language models can
be non-deterministic they can be slow
they can be expensive to run they're
very flexible though they're creative
they can reason through problems and so
we wanted to create a methodology that
takes advantage of all the strengths of
large language models and then also is
able to invoke traditional software
where it's
helpful and that brings me to slide
78 the agent development life cycle so
at Sierra this is the process by which
we build and improve AI
agents you might be thinking about it
like oh that looks kind of like the
software development life cycle and I
think the devil is in the detail so I'm
going to dive in a little bit it's not
that these are revolutionary or
Innovative Concepts it's that each one
of them involves iterative refinement
with customers in production to make it
as productive and as bulletproof as
possible so if we dig into quality
assurance for example if you work at one
of your customer one of our customer
companies you have access to Sierra's
experience manager what that means is
that you can dive in and look at every
conversation and you can look at highle
reports of how is the agent performing
in real time you can IDE feedback so for
example if Duncan Smothers has incorrect
inventory maybe it made one API call to
one warehouse but it didn't make all the
API calls that it needed to or one of
them timed out whatever it may be you
can report this issue it then will lead
to an issue being filed which leads to a
test being created and then once that
test is passing we can make a new
release and over the course of time a
Sierra agent will go from having a
handful of tests at launch to hundreds
and then thousand thousand of tests as
it
improves another example here is it's
not always that the agent is making a
mistake sometimes there's an opportunity
to go above and beyond uh Chubbies
actually has each of its agents have a
budget in order to Delight customers and
so in this case Duncan mothers could
actually you know door Dash the shorts
from a retail location if they're not
available
online so this is the agent development
life cycle at work but the thing is a
year ago we were doing this all manually
this was kind of early on in in in the
history of Sierra and we were learning
what works at each of these stages and
with the uh improvements to AI we're
actually able to add AI to each part of
this life cycle and speed up the
improvements in the present
day but it's bigger than just Duncan the
agent development life cycle is more
effective the larger the customer is and
while Duncan handles hundreds of
thousands of requests we have customers
that are doing tens of
millions so the more valuable the
velocity and change management are when
you're that
big and the change also comes from
everywhere it's not just that oh there's
an issue with the agent and we need to
improve it there's tons of stuff going
on outside there's all those graphs at
the beginning of this presentation
showing how fast our space is moving you
have models being upgraded you have new
paradigms like reasoning models you have
multimodality and more and more
when we think about how these impact the
agent development life cycle reasoning
models are a force multiplier toward
each step we're actually able to be more
effective applying AI to development to
testing to QA and every step in
between now another one that's near and
dear to my heart I mentioned the Deep
gram Workshop eight months ago which was
an accelerant uh in my understanding of
the voice landscape is building for
voice and I started working on this
about a year ago uh and in October we
were able to launch Voice generally
available at Sierra one of our large
customers that has benefited from the
agent development life cycle that has
you know tens of millions of customers
in the United States is Sirus XM and
with Sierra's voice capabilities they're
able to pick up the phone right away
every time to answer their
customers the way that we think about
voice I think is similar to the way that
we think about web development today if
you remember 10 15 years ago a lot of
websites were you know m. website.com
you had two separate websites for mobile
phones and for desktops and then we
graduated to responsive design and this
is how we think about our AI agents at
Sierra too under the hood it's the same
platform it's the same agent code but
it's able to be responsive to whatever
Channel someone reaches out in and
whatever modality you're operating in of
course you can still customize the same
way you might have a different layout
you can still have different phrasing
you can still parallelize requests to
achieve lower latency but it basically
just works out of the
box I'll close with a few thoughts this
is something I've been thinking about a
lot lately one of the most fascinating
and fun Parts about building with AI is
that large language models remind us of
ourselves in short they're unpredictable
they're
slow and they're not that great at
math but also so it allows us to be
great designers by having empathy in a
way that we probably couldn't ever
before with
computers and so you can actually put
yourself in the shoes of the robot you
can put yourself in the I don't know
primordial soup of the jello and you can
think about what it would mean to
actually build a good experience and as
someone who's building voice agents and
a bunch of you I bet in the audience are
I know there's kind of this thought on
are these multimodal agents the real
deal you know should just kind of wire
everything together and hope it works
and the question I've been asking myself
a lot lately and what our results have
kind of shown us is you know how would
you do if someone just passed you
transcribed text of your conversation
partner with a few hundred milliseconds
of delay and then you had to respond on
the spot and so what we're building at
Sierra is much more robust and very
exciting to me and I hope to talk to you
all about it I think on my badge it says
voice too models is the thing that I'm
excited about uh and so here is kind of
a sense of the robustness and the
richness of what you can create when you
let large language models have the same
inputs and same experiences that humans
have um and so uh thank you for your
time today I look forward to a lot of
engaging discussions and uh it's great
to talk to you all
[Music]
[Applause]
[Music]
our next presenter is a researcher at
Morgan Stanley please join me in
welcoming to the stage will
[Music]
[Applause]
Brown hello everyone uh thanks swix and
the whole AI engineer Conference team
for putting this together and having me
I am will Brown I am a machine learning
researcher at Morgan Stanley and today I
want to talk to you all a bit about what
I think re enforcement learning or RL
means for agents so I was in grad school
at Columbia for a while and I mostly
worked on theory for multi-agent
reinforcement learning and over the past
couple years I have been working at
Morgan Stanley on a wide range of llm
related projects some of which look kind
of like agents but I will not really be
talking too much about that today uh I'm
also relatively active on X the
everything app and that will become
relevant later in the talk this talk I
think will be probably a little
different from most of the talks at the
conference um it's not about things we
ship to prod it's not about things that
definitely work and you should go do
tomorrow that are like proven science or
best practices it's about where we might
be headed and I want to really just tell
a story that will synthesize some things
that have been happening in the broader
research Community um and uh where these
Trends might be pointing do some
speculation and also talk about some uh
recent open source work of my own um and
the goal of this is to help you plan and
understand what reinforcement learning
means what what it means for agents and
how to best be ready for a potential
future which may involve reinforcement
learning as part of the agent
engineering
Loop so um where are we today most llms
that we work with are essentially chat
Bots I think it's helpful to think about
open ai's uh five levels framework here
so we did pretty well with chatbots
seems like we're doing pretty well with
reasoners um these are great models for
question answer they're very helpful for
interactive problem solving we have the
01 03 R1 grock 3 Gemini Etc models that
are really good at kind of thinking
longer um and we're trying to figure out
how we take all of this and make agents
level three um and these are systems
that are taking actions these are
systems that are doing things that are
longer and harder and more complex and
currently the way we tend to do this is
chaining together multiple calls to
these underlying chatbot or Reasoner
llms and we do lots of things like
prompt engineering tool calling evals
Ops giving the models tools of their own
to use
having humans in the loop and the
results are like pretty good um there's
a lot of things that we can do and then
there's a lot of stuff that it feels
like is around the corner that we're all
imagining about AGI but we're not really
to the point yet where these things are
going off and doing the things that we
would imagine an AGI is really doing to
the degree of autonomy that that would I
presume entail so I think it's useful a
bit to distinguish between agents and
pipelines I think Barry's talk earlier
was a good way to kind of frame this I'm
going to use pipelines to encapsulate
what Barry called workflows um and I
think these are really systems with
fairly low degrees of autonomy and
there's a very non-trivial non-trivial
amount of engineering required to
determine these decision trees to say
how does one action or call flow into
the another how uh to another how do we
refine the prompts um and it seems like
a lot of the winning apps in the agent
space have very tight feedback loops and
so whether or not you want to call these
agents or pipelines these are things
where a user is interacting with some
sort of interface they're telling it
what to do the thing will do some stuff
and come back relatively quickly things
like the IDE like cursor winds Surf and
repet um and search tools that are
really good at Harder question answer
maybe with some web search or research
integrated but there's not that many
agents nowadays that will go off and
like do stuff for more than 10 minutes
at a time I think Devon operator and
opening eyes deep research are the three
that really come to mind is like feeling
a little more in the like autonomous
agent Direction and I think a lot of us
might be wondering how do we make more
of these
and the kind of traditional wisdom is
like okay we'll just wait for better
models once better models are around we
can just like use those will be good but
I think it's also to kind of take note
of like the traditional definition of
reinforcement learning and what an agent
means there which is this idea of a
thing that is interacting with an
environment with a goal and the goal
that and this system is designed to
learn how to get better at that goal
over time via repeated interaction uh
with the system and I think this is
something that a lot of us are either
doing manually or don't really have the
tools to do which is once we have our
thing that it's set up to make the calls
we want and the performance is like 70%
and we've done a lot of promp tuning and
we wanted to get up to 90% but we just
like don't have the models to do it or
the models struggle to get the success
what's our path forward um and so in
terms of model Trends I think I won't
spend too much time talking about this
but uh pre-training seems to be having
diminishing returns to Capital at least
we're still seeing kind of like loss go
down but uh it does kind of feel we need
new tricks um reinforcement learning
from Human feedback is great for making
kind of friendly chat Bots um but it
doesn't really seem to uh be continually
pushing us at the frontier of smarter
and smarter and smarter models uh we
talk a lot about synthetic data and I
think synthetic data is great for
distilling uh larger models down into
smaller models to have kind of really
tiny models that are really performant
but on its own it doesn't seem to be an
unlock for like massive capabilities uh
getting better and better unless we
throw in very ification in the loop or
rejection sampling or any of these
things and that really takes us to the
world of reinforcement learning where
this seems to be the trick that unlocked
test time scaling for o1 models and R1
um it's not bottlenecked by needing
manually curated human data and it does
seem to actually work um I think we all
kind of took note about a month ago when
deep seek released the R1 model and
paper to the world and I think this was
really exciting because it was the first
paper that really explained how you
build a thing like 01 we'd had kind of
speculation and some rumors but they
really laid out the algorithm and the
mechanisms for what it takes to get a
model to learn to do this kind of
reasoning and it turns out it was
essentially just reinforcement learning
where you give the model some questions
you measure if it's getting the answer
right and you just kind of turn this
crank of giving it feedback to do more
like the things that worked well and
less like the things that didn't work um
and what you end up seeing is that the
the long Chain of Thought from Models
like o1 and R1 actually emerges as a
byproduct of this it wasn't kind of
manually programmed in where the models
were like given data of like 10,000
token reasoning steps this was the thing
the model learned to do because it was a
good strategy and reinforcement learning
at the core is really about identifying
good strategies for solving problems um
it also seems like open source models
are are back in a big way there's a lot
of excitement around the open source
Community um people have been working on
replication efforts for the o1 project
um and have also been trying to distill
data from o1 down into smaller models
and so what next how does relate to
agents um I think it'll be helpful to
know a little bit about how
reinforcement warning works the key idea
is to explore and exploit so you want to
try stuff see what works do more of the
things that worked less of the things
that didn't and so in this feedback loop
um demonstrated here in the image we can
see a CH a challenge or models uh
supposed to be writing code to pass test
cases and we give it rewards that
correspond to things like formatting
using the right language and then
ultimately whether or not the test cases
are passing and so this is kind of a
numerical signal that rather than like
training on data where we are kind of
curating this in advance we are letting
the model do synthetic data roll outs
and seeing scores from these rollouts
which then are fed back into the model
and so the grpo algorithm which maybe
some of you have heard about is the
algorithm deeps cued I think it's less
of like a technical breakthrough in
terms of it being a really important new
algorithm to study but I think it's very
conceptually simple and I think it's a
nice way to think about what
reinforcement learning means and the
idea really is just that you for a given
prompt sample end completions you score
them all and you tell the model be more
like the ones with higher scores um this
is still in kind of the single turn
Reasoner model non- agentic world uh and
so the challenges that lie ahead um are
going to be about how do we take these
ideas uh and extend them into uh more
powerful more agentic more autonomous
systems but we do know that it can be
done so open a deep research still has a
lot of questions that we do not know the
answers to about how it works but they
have told us that it was endtoend
reinforcement learning and so this is a
case where the model is taking up to
potentially 100 tool calls of browsing
or querying different parts of the
internet to synthesize a large answer
and it does seem I think to many
people's VI check opinions very
impressive um but it also is like not
AGI in the sense of you can't get it to
go like uh work in a repo or like solve
hard software engineering tasks and
people have kind of anecdotally found
that it does struggle a bit for like
outof distribution tasks or like if you
want it to fill out a table with like
100 very manual calculations it can
struggle there and so it seems like
reinforcement learning on one hand is a
big unlock for new skills and more
autonomy but it's not a thing that so
far has granted us agents that can just
do everything and know how to solve all
kinds of problems but it is a path
forward for teaching a model skills and
having the model learn how to get better
at certain skills particularly in
conjunction with environments and tools
and verification um and so there is
infrastructure out there for doing this
on our own kind of um a lot of it is
still rhf Style by which I mean it's
about kind of single turn interactions
where the goal is we have reward signals
that come from kind of human data that
has been combined into a reward model um
and if we want to have RL agents
becoming part of our systems maybe we
will get really good API services from
the large Labs that let us build these
things and hook into GPT whatever um or
Claud whatever and train these sorts of
models on our own with finetuning but we
also don't really have these options yet
um opening ey has kind of teased their
reinforcement fine tuning feedback but
it's not a multi-step tool tool calling
yet and so I think if we want to plan
ahead it's worth kind of noting and
asking what would this ecosystem look
like and there's a lot of unknown
questions like how much this will cost
how small can the model spe will it
generalize across tasks uh and how do we
design good rewards and good
environments and there's a lot of
opportunity here um open source uh
infrastructure there's a lot of room to
build and grow and determine what the
best practices are going to be what the
right tools will be as well as companies
that can build tools for to support this
ecosystem uh whether or not they're
already in the fine-tuning world or not
um and services for supporting this kind
of agentic RL and I think also it is
worth thinking about things that are
like not literal RL in the sense of
training the model but at the prompt
level there's all sorts of automation we
can do so if you've used dspi I think
that is kind of adjacent to RL in the
flavor of having a signal that we can
then uh bootstrapped from to improve our
uh underlying system based on improving
Downstream
scores um now I want to share a story
with you about a single python file I
wrote a couple weeks ago um so this was
the weekend after R1 came out and I'd
been reading the paper and thought it
was really cool we had not had the
Nvidia stock crash quite yet um and uh I
was just playing around some experiments
I was taking the a hug a trainer from
huggingface that had the grpo algorithm
and I was getting a really small
language model llama 1B to do some
reasoning and then give an answer for
math questions and I started with like a
pretty simple system prompt and I was
just training the model to let it see
what it did and I had kind of manually
curated some rewards in terms of what
the scoring function should look like
and I just kind of like tweeted it out
um where I had an example of the model
kind of looking like it's doing some
self-correction and so showing that the
accuracy gets better as well as the uh
length of response will initially drop
once it learns to kind of Follow The
Format then it goes back up as it learns
to kind of take advantage of longer
Chain of Thought to do its reasoning and
this was not the first thing to
replicate in any sense I wouldn't really
call it a true replication um it was far
from the most complicated and I think
that actually caught a lot of people's
imaginations and it became kind of a
thing um so over the next two weeks
after that it just took on a life of its
own where a lot of people were kind of
tweeting about it and forking it and
making modifications to it and making it
something you could run in a Jupiter
notebook making it more accessible
writing blog posts about it and it was
interesting um because it to me didn't
feel like a thing that kind of merited
this level of excitement but what I
think was catching people's imagination
was that it was one file of code it was
really simple and it invited uh
modification in a very userfriendly
engaging way which I like to call rubric
engineering and so the idea of rubric
engineering here is that similar to
prompt engineering um
to have a model do reinforcement
learning it's going to get some reward
but what should this reward be in the
most simple version it's just like did
it get the question right or wrong like
does a equal B but there's a lot more
you can do Beyond this and so I think
the the single file of code exposed uh
examples of this where you can give the
model points for things like following
this XML structure like if it gets a
certain tag right you give it plus one
point um if it has an integer answer
that's still the wrong answer but it's
learned that the format should be an
integer answer get some points for that
um um and there's a lot of room here for
getting creative and for Designing rules
that are not just Downstream evals to
for our own sake know whether a thing is
working but to allow the model itself to
know whether it's working and use that
as feedback for going further and
training more um and this is very early
stages there's a lot of things we don't
know and I think there's a lot of
opportunity to get creative and explore
and try things out such as using LMS to
design these rubrics uh autot tuning
these rubrics or autotuning your prompts
with Frameworks like DSP
um incorporating LM judges as part of
the scoring system and then also I think
reward hacking is an issue to be very
cautious of where the idea is you want
to ensure that the the reward model
you're using is actually capturing the
goal and it doesn't have kind of these
back doors where a model can kind of
cheat and do something else that
ultimately results in it kind of getting
a super high reward without learning to
do the actual task um and following this
I have been trying to learn from those
lessons of what I saw people using out
in the wild and make something that is a
little more uh robust and uh usable for
actual projects Beyond just one file of
code um and this has been a kind of very
recent effort it's not a thing that I'm
telling you to go use for all your
problems tomorrow but I think it's my
attempt doing some open source uh
research code um that will help people
potentially try these things out easier
and answer some questions uh about this
and so what this really is it's a a
framework for doing RL inside of multip
in en Ms so the idea here is that lots
of us have built these great agent
Frameworks for using API models and the
hope would be that we can leverage those
existing environments and Frameworks to
uh actually do RL so here the idea is
you can just create this environment
thing that the model plugs into and you
don't have to worry about the weights or
the tokens you can just write an
interaction protocol and then this gets
fed into a trainer and so once you build
this environment you can just kind of
let it run and uh have a model that once
you give it some WS learns to get better
and better over time um and to conclude
I want to talk about what I think AI
engineering might look like in the RL
era
so this is all still something that is
very new uh we don't know whether the op
shelf API models are going to just work
for the tasks we throw at them it might
be the case that they do it might be the
case that they don't um one reason I
think that they might not be the entire
solution is that it is really hard to
include a skill in a prompt you can
include knowledge in a prompt um but a
lot of us when we try something we don't
nail it the first time and it takes a
little bit of trial and error um and it
seems to be the case that models are
like this as well where a model does get
better at a thing and really gets a
skill nailed down by trial and error and
this has been the most promising unlock
we've seen so far for these higher
autonomy agents like deep research um
fine tuning might still be important I
think a lot of people wrote off fine
tuning for a while because open models
were far enough behind the frontier that
like a prompted uh Frontier Model API
was just going to beat a smaller
fine-tuned model I think one we're now
seeing the open and close Source Gap be
close enough that this is less of a
concern a lot of people are using open
source hosted models in their platforms
um and also uh RL the most kind of true
version of RL that deep seek did for
their R1 model that open I has talked
about doing for uh deep research
requires doing some reinforcement
learning um there's a lot of challenges
here there's a lot of research questions
we don't know the anwers to um but
there's a lot of things that I think
these skills we've learned from doing AI
engineering over the past couple years
translate very directly to which is that
the challenge of building environments
and rubrics is not that different from
The Challenge of building evals and
prompts we still need good monitoring
tools we still need a large ecosystem of
companies and platforms and products
that support the kinds of Agents we want
to build
um so I think all the stuff we've been
doing is going to be essential and it's
worth looking ahead a little bit to see
if we end up in a world where we have to
do a little bit more reinforcement
learning to unlock things like true
autonomous agents or innovators or
organizations that are powered by
language models um what does that look
like uh we will find out
[Applause]
[Music]
ladies and Gentlemen please welcome back
to the stage MC for the AI engineer
Summit agent engineering day the founder
and CEO of super intelligent nlw
[Music]
all right awesome first session thank
you all for uh for being here um thank
you will for for a great way to close us
out and for all the other great present
presenters as well um quick
clarification before I let you guys go
tonight so there is no on-site
Afterparty uh the Expo closes at 4 p.m.
the venue closes at 6 p.m. however there
are a number of Affiliated events um if
you check the website homepage for info
uh and RSVP instructions that all there
but again Expo closes at 4: this venue
closes at 6 so we'll be wrapping up
conversations uh and evening plans at
around 5:30 um with that we drop block
one uh and so we're going to do a 30
minute break now if you want to check
out and have discussions with the
speakers um the Q&A lounges are
available to meet them uh the first one
is on the first floor to the right as
you exit the theater and there are two
downstairs as well um also we recommend
making time to stop by the sponsor Expo
you're going to find coffee snacks uh
and also the amazing products and
services of our sponsors so thank you
very much and we will see you back here
in about half an hour
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
oh
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Applause]
w
w
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
ladies and Gentlemen please welcome to
the stage MC for the AI engineer Summit
agent engineering Day founder and CEO of
superintelligent nlw
[Music]
you all right welcome back to another
excellent session um this Sprint is
really really interesting we have
sessions from Jane Street about how they
do AI engineering uh challenges to
scaling agents by Bloomberg uh a session
on trusting but verifying from
brightwave and kicking it off uh we'd
like to welcome to the stage Brennan
Rosalez to talk about agents and
investment management at Aladdin
co-pilot from Black Rock
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
n
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
our next presenter is a software
engineer at Jane Street presenting how
they build AI powered developer tools
please join me in welcoming to the stage
John crezi
[Music]
[Applause]
sorry my name is John kzi and I work on
a team at Jane Street called AI
assistant our group roughly uh is there
at Jan Street to try to maximize the
value that Jane Street can get from
large language models
and I've spent my entire career uh in
Dev tools before I worked at Jan street
I was a GitHub for a long time and then
before that I worked at a variety of
other Dev tools companies and llms kind
of present this really amazing
opportunity in that they're so
open-ended that we can build kind of
anything that we can imagine and it
seems like right now the only thing
moving faster than the progress of the
models is kind of our creativity around
how to employ them uh at Jan street
though we've made some choices that make
adoption of off-the-shelf tooling a
little bit more difficult than it is for
other
companies and kind of the biggest reason
that we have this problem is that we use
o camel as a development platform for
those not familiar with oaml it is a a
functional very powerful language but
it's also incredibly obscure language uh
it was built in France and its most
common applications are in things like
theorem proving or formal verification
it's also used to write programming
languages um we use oaml kind of
for everything at CH street so just a
couple quick examples when we write web
applications of course web applications
have to be written in JavaScript but
instead we write oaml and we use a
library called JS of oaml that is
essentially a oaml bik code to
JavaScript
transpiler when we write plugins for Vim
those have to be written in vimscript uh
but we actually use a library called
vcam which again is O camel to vimscript
transpiler and uh even people at the
that are working on fpga code they're
not writing verog they're writing in an
O camel Library called hard
camel uh so why are the tools on the
market available not good for working
with o camel I think it kind of comes
down to a few primary reasons the first
and the most important is that models
themselves are just not very good at
oaml and this isn't the fault of the AI
labs this is just kind of a byproduct of
the amount of data that exists for
training so it's there's a really good
chance that the amount of o cam code
that we have inside of Jane Street is
just more than like the total combined
amount of oam code that there exists in
the world uh outside of our
walls the second is that we've made
things really hard on ourselves
partially as a byproduct of working in O
camel we've had to build our own build
systems we built our own distributed
build environment we even built our own
code review system which is called
iron we develop all of our software on a
giant monor repo application and just
for fun instead of storing that monor
repo in G we we store it in
Mercurial and uh at last count 67% of
the firm uses emac instead of normal
editors maybe like vs code uh we do have
people using vs code but emac is the
most popular and the last thing is we're
dreamers I mean kind of everyone in this
room hopefully is is a dreamer in a way
uh and what I mean by this is we want
the ability to kind of take llms and
apply them to different parts of our
development flow and light up different
parts so maybe we want to use large
language models to resolve merge
conflicts or build better feature
descriptions or figure out who reviewers
for features be and we don't want to be
hampered by the boundaries between
different systems when we do
that over the next 15 minutes I'm going
to cover our approach to large language
models at Jan Street uh particularly
when it comes to developer tools um I'm
going to talk about custom models that
we're building and how we build them I'm
going to talk about editor Integrations
so these are the Integrations into uh to
vs code emac and neovim and I will talk
about uh the ability that we've built
over time to evaluate models and figure
out how to make them perform
best and I guess at first glance it's
not really obvious that training models
at all is a good idea I mean it's a very
expensive proposition it takes a lot of
time and it can go wrong in a ton of
different ways who here has trained a
model before or tried to train something
like a model maybe took a foundation
model and trained on top of
it
cool we were more convinced after we
read this paper this is a paper from
meta about a project called code compose
and in this paper they detail their
results fine-tuning a model specifically
for use with hack uh hack is actually
pretty similar to O camel uh not in its
like syntax or function but really just
in the fact that it's used primarily at
one company and not really used much
outside of that company even though it's
open source so oh actually a fun fact
hack is implemented in no camel I think
that's just like a total coincidence
but uh we were pretty naive early on we
read this paper and we decided that it
would be really cool if we could uh
replicate the results we thought we
would just take a model off the shelf we
would show it a bunch of our code and
then we would get back a model that uh
worked like the original model but knew
about our libraries and
idioms it turns out that's just not how
it works uh it's not that easy in order
to get good outcomes you have to have
the model see a bunch of examples that
are in the shape of the type of question
that you want to ask the model so we
needed to First create a goal a thing
that we wanted the model to be able to
do
and in our in our world the goal that we
came up with was this we wanted to be
able to generate diffs given a prompt so
what that means is we wanted a user
inside of an Editor to be able to write
a description of what they wanted to
happen and then have the model suggest a
potentially multifile diff so maybe you
want to modify the test file an ml file
and an mli which is kind of like a
header
file we wanted the diffs to apply
cleanly and we wanted them to have a
good likelihood of type-checking after
they had been applied and we were kind
of targeting this range of up to 100
lines as an ideal zone of what we
thought llms would actually be capable
of and in order for that to work we
needed to collect data like I was
talking about before we needed data of
the training shape that looked just like
the test shape and this is what that
shape looks like for this task you need
to be able to collect a bunch of
examples of what context the model would
have had beforehand and then some prompt
of what you want the model to do written
hopefully in the same way that a human
would write it and then some diff that
would accomplish that goal so context
prompt diff and we need a bunch of these
examples so how do we get these how do
we get these training
examples kind of the first place to look
is features features is I mentioned a
code review system that we built
internally this is what it looks like
it's called iron uh features are very
similar to pull requests I think you can
just you know swap that term in your
head and features that first glance have
exactly the data they want on the
description tab they have a human
written description of a change and on
the diff tab they have the code that
accomplishes the goal of the
developer but on closer look they're not
exactly what you want right the way that
you write a feature description or a p
request description is really very
different from what you might want to
say inside of an editor so you're not
writing multiple paragraphs in the
editor you're just saying something like
fix that error that's happening right
now and that's just not how we write
feature descriptions another problem
with these features or poll requests is
that they're really large right often
it's a feature is 500 lines or a
thousand lines so in order to use it as
training data we would need to have an
automated way to pull features apart
into individual smaller components that
we could train
on so we need smaller things than
features what are those well maybe
commits commits are smaller chunks than
features uh this is what a typical
commit log looks like a chain street so
this is not like a git short log this is
literally just like an actual I want you
to look at this as an actual git log and
where it says summary Z that's my commit
message we don't really use commits the
same way that the rest of the world uses
them so we use commits mostly As
checkpoints between different parts
parts of a development cycle that you
might want to revert back to commits
don't have a description and they also
have the same problem in that they're
not isolated changes they're they're a
collection of changes what we actually
ended up with was a approach called
workspace snapshotting and the way that
that works is we take snapshots of
developer workstations throughout the
workday so you can think like every 20
seconds we just take a snapshot of what
the developer is doing and as we take
the snapshots we also take snapshots of
the build status so is the build that's
running on the box we can see what the
error is or whether the build is green
and we can kind of notice these little
patterns if you have a green to Red to
Green that often corresponds to a place
where a developer has made an isolated
change right you start writing some code
you break the build and then you get it
back to green and that's how you make a
change maybe this one the red to Green
this is the place where the developer
encountered an error whether that's a
type error or a compilation error and
then they fixed it so if we capture the
build error at the Red State and then
the diff from red to Green we can use
that as training data to help the model
be able to recover from
mistakes the next thing we need is a
description and the way that we did that
we just used the large language model so
we had a large language model write a
really detailed description of a change
in in as much words as it possibly could
and then we just kept filtering it down
until it was something that was around
the right level of what a human would
write so now we have this training data
and training data is kind of only half
the picture of training a model so you
you have the the supervised training
data and then you need to do the second
part which is the reinforcement learning
this is really where models get a lot of
their power right we we align the
model's ability to what humans think is
actually a good code so what is good
code I guess on the surface good code is
I mean it's it's code it has the parses
code meaning if a piece of code doesn't
go through the oaml parser and come out
with a green status that is that is not
good code I would say by most
definitions uh good code in oaml because
it's statically typed is code that type
checks so we want to have good code be
code that when it is applied on top of a
base revision can go through the type
Checker and the type Checker agrees that
the code is
valid and of course the the gold
standard is that good code is code that
compiles and passes tests so ideally in
during the reinforcement learning phase
of a model you could give the model a
bunch of tasks that are like verifiable
we have the model performs some some
edit and then we check whether or not it
actually passes the test when applied to
the
code so we did that uh we've done this
as part of our our training cycle and we
built this thing that is called uh CES
it's the code evaluation service you can
think of it kind of like a build service
except with a slight modification to
make it much faster and that's that
first we pre-warm a build it sits at a a
revision and is green and then we have
these workers that all day just take
diffs from model they apply them and
then we determine whether the build
status turns red or green and then we
report that error or or success back up
to the build function and through
continued use of this service over the
course of like months we're able to
better align the model to write code
that actually does compile and past
tests it turns out this exact same setup
is the one that you would want for
evaluation so if you just hold out some
of the RL data you can also use it to
evaluate model's ability to write code
kind of looks like this you give the
model a problem you let it write some
code and then you evaluate whether or
not the code that it writes actually
works and training is hard and it can
have kind of uh catastrophic but
hilarious results so at one point we
were training a code review model and
this is a totally separate model but the
idea was we want to be able to give some
code to this model and have it do a
first pass of code review just like a
human would do to try to save some of
the toil of of code review we trained
this model we put a bunch of data into
it we worked on it for months we're real
excited and we put our first code in for
uh for code review through the automated
agent it spun for a bit and it came back
with something along the lines of um
I'll do it
tomorrow and like of course it did that
because it's trained on a bunch of human
examples and humans write things like
I'll do things or I'll do this tomorrow
uh so it's it's you know not very
surprising so having evaluations that
are meaningful is kind of a Cornerstone
of making sure that mod don't go off the
rails like this and you don't waste a
bunch of your time and
money in the end though the real test of
models is whether or not they work for
humans so I'm going to talk a little bit
about the editor Integrations that we've
built to expose these models to
developers at Jan
Street kind of when we were starting
building these Integrations we had three
ideas in mind the first idea was wow we
support three editors we have neovim vs
code and emac and we really don't want
to write the same thing three times so
ideally we don't want to write all the
same context build strategies and all of
the same prompting strategies we want to
just write it once the second is that we
wanted to maintain flexibility so we had
a model that we were using at the time
uh that was not a fine-tuned model we
were pretty convinced that a fine tuned
model was in our future we wanted the
ability to do things like swap the model
or swap the prompting strategy out and
last we wanted to be able to collect
metrics so in a developer uh in their in
their editor developers care about
latency they care about whether or not
the diffs actually apply so we wanted to
get kind of on the ground real
experience of whether or not the diffs
really were meaningful for
people this is the simplified version of
the architecture that we settled on for
this service the AI development
environment essentially you have llms on
one side and then Aid handles all of the
uh ability to construct prompts and to
construct context and to see the build
status and then we are able to just
write these really thin layers on top of
Aid uh for each of the individual
editors and what's really neat about
this is that Aid sits as a side Car
application on the developer machine
which means that we when we want to make
changes to Aid we don't have to make
changes to the individual editors and
hope that people restart their editors
we can just restart the Aid Service on
all of the boxes so we restart Aid and
then everyone gets the most recent
copy uh this is an example of Aid
working inside of vs code so this is the
sidebar in vs code very similar to
something like co-pilot except this
thing allows you to uh ask for it and
get back multifile diffs uh and you can
see it kind of looks like what you'd
expect in VSS code it's it's you know a
visual interface that lays things out
really
nicely this is what we built in emac
though so in emac developers are used to
working in text buffers they move around
files they want to be able to copy
things the normal way that they copy
things so we actually built the a
experience in emac into a markdown
buffer so users can move around inside
this markdown buffer they can ask
questions and then there are key binds
that essentially append extra content to
the bottom of the markdown buffer
AIDS architecture lets us plug various
pieces in and out like I mentioned uh so
we can swap in new models we can uh make
changes to the context building we can
add support for new editors which I
think probably sounds far-fetched but
this is something we're actually just
doing right
now uh and we can even add domain
specific tools so different areas of the
company can supply tools that are
available inside of the editors and they
kind of end up in all the editors
without having to write individual
Integrations a also allows us to AB test
different approaches so we can do
something like send 50% of the company
to one model and 50% to another and then
determine which one gets the higher
acceptance rate Aid is kind of a an
investment that pays off over time every
time something changes in large language
models we're able to change it in one
place Downstream of the editors and then
have it available
everywhere and things change like really
often we need to be ready uh when things
change what I what I've had time to show
you today is only a small portion of
what my team is doing and we've got a
lot of other things going on so we're
finding new ways to apply rag inside of
the editors we're applying similar
approaches to what you've seen here to
large scale uh multi-agent workflows we
are working with reasoning models more
and more but the approach is the same
through all of these we keep things
pluggable we lay a strong Foundation to
build on top of and we build the ways
for the rest of the company to add to
our experience by adding more domain
specific tooling on top of
it uh if you think what I've said is
interesting and you want to talk more
about this I would love to hear from you
you can just find me outside and thank
you for your
[Applause]
[Music]
time next up the head of AI engineering
at Bloomberg is here to present
challenges to scaling agents for
generative AI products please please
join me in welcoming to the stage Anu
[Music]
[Applause]
kodor oh man it's really hard to see my
photo that big or
small um thank you so much for inviting
me um as I was trying to think what
would be a good topic to present at this
talk the organizers were really nice and
so a lot of things that you'll hear
today was influenced by what the
organizers thought was is important CU
there really so many things happening
that are exciting to talk about in the
agentic landscape so let's get started
the first thing was um late 2021 I think
lm's really uh were starting to capture
the imagination as a company we've been
investing in AI for almost 15 16 years
so we decided we'll build our own um
we'll build our own large language model
took all of 2022 to do that and 2023 we
wrote a paper about it we had learned a
lot about how do you build these models
how do you organize uh data sets for
these how does evaluation work how do
you Cox performance in certain zones
sort of this but then chat GPT happened
I think the open weight and the open
source Community has come up so uh
beautifully along so while we continue
to do very similar work as a strategy we
pivoted to say let's build on top of uh
whatever is available out there we have
many many different use cases so I think
we we pretty much pivoted to say We'll
build on top uh if it helps you in any
way on how we are doing things so there
you
go uh the other was uh I think there was
a curiosity on how exactly uh does a
company like Bloomberg organize its AI
efforts so um I report into I report to
the uh Global head of engineering and we
are organized somewhat as a special
group if you will we work a lot with our
data counterpart Bloomberg is a really
strong large data organization that you
can appreciate now helps us out a lot uh
we work with the product the CTO in in
cross functional settings about 400
people 50 teams London New York uh
Princeton and Toronto so that's a little
bit about our our
group okay so um we've been uh Building
Products using generative AI um starting
with tools more agentic for 12 to 16
months now I think the effort has been
really really serious and so there have
been so many things we've had to solve
in order to build something today using
what's available today uh and then I
decided somebody must cover all of these
topics so I'm not going to talk about
these at all right uh I think there are
some wonderful speakers talking about
this uh I'll try to hang around a bit
after this and really um I'm really
bullish on what the developments are in
any one of those challenges that we need
to solve I think it gets easier and
easier to solve those challenges so
please don't read these as being
pessimistic it's just realistic right I
need to build and ship things today and
that means these are the things I need
to deal with today uh again we won't be
touching on any of these topics
today um so internally it was really
hard to say what's an agent and what's a
tool because everyone kind of had their
own vocabulary and then this really nice
paper came out so when I'm talking today
when I say a tool I mean on the left
hand side of that uh it's cognitive
architectures for for language agents if
you haven't read it you should uh try to
read that paper and then an agent is
really like more autonomous has memory
can evolve so whenever I say agentic
it's on the right hand side of the
spectrum and the other one is the left
hand side um so that's what my
vocabulary will
be finally to set the stage for the talk
um I don't know how many of you know
about Bloomberg I certainly did not know
as much as I do today when I joined so
um we are a fintech company as you can
imagine from my nice uh jacket or
jumper and our clients are in finance
but Finance is a very diverse field so
uh I'm listed here 10 different
archetypes of people who are in finance
and they do very different activities
but they also do a lot of similar
activities and so um what is like a
short form of thinking what Bloomberg
does we have we both generate and
accumulate a lot of data this is
unstructured and structured so news
research uh documents slides we uh also
provide access to websites there's a lot
of reference data uh Market data coming
in so if you just want to know the scale
every day we get 400 billion pics of
structured data information about a
billion plus um unstructured messages
millions of well- written documents
which include news and this is just
every day and we have over 40 years of
history on it so when we say we offer
information as one of the things to our
clients this is the scale at which we
are working
uh the rest of this talk I will uh as
you can imagine we are building a very
broad set of products so to focus the
talk I'll talk about one
particular uh archetype uh research
analyst if you didn't know what a
research analyst done here is a does
here is a short course so uh there's a
research analist they are typically an
expert in a particular area think like
you know I'm a research analyst in AI or
semiconductor or technology or electric
vehicles and the kinds of things they
need to do on a daily basis are written
at the bottom so they are doing a lot of
work with search and Discovery and
summarization a lot of things with
unstructured data on the left hand side
they are doing a lot of work in uh in
data and analytics structured data and
analytics in the middle part of the
segment they are reaching out to their
colleagues both to disperse and gather
information so there's a lot of
communication and then they also uh some
of them are also building models uh
which means they need to normalize data
they need to actually program and
generate models as well so this is a a
research analyst in a uh in a
nutshell uh the other bit is because we
are in finance and we've been here for
we've been in finance for like since
founding 40 years ago there are some
aspects of our products that are
non-negotiable and uh those include
things like precision comprehensiveness
speed throughput availability
um some principles like protecting our
contributor and client data making sure
that whatever we build there is
transparency throughout these are
non-negotiables it doesn't matter
whether you're using AI or not so these
should ground you in the kinds of
challenges we face when we use what's
available today to build
agents okay so what was the first thing
we did uh again 2023 is when I think we
got serious so the first thing we did
was for the research in the in the zone
of helping the research analyst
community
um companies public companies in
particular they have scheduled quarterly
calls that discuss the health of their
company they talk about their future
it's a conference call a lot of analysts
attend the call uh there's a
presentation by the company's Executives
and then there's a Q&A segment and
during earning season it happens that on
any given day many many of these things
are happening so I told you that a
research analyst has to stay on top of
what's happening every single day
so transcripts of these calls need to be
generated again AI is used and in 2023
we saw an opportunity to say well we
know what for every company which is a
which is operating in a particular
sector we know what are the kinds of
questions are of interest and maybe we
can try to answer them for the analyst
to take a look at and that way they can
be informed on whether they wanted a
deeper dive or not right seems like a
simple product and again I'm talking
about work that started in 23 so
where the technology was we still needed
to do a lot to bring it to the market
keeping our principles and features in
place so what does it mean just focus on
the right hand side if you will um
performance out of the box was not great
like Precision accuracy uh factuality
things like that uh and for those of you
who are interested in mlops I think
there was a lot of work done in order to
just build remediation workflows and
circuit breakers because remember these
summaries are not somebody just chatting
with a transcript it's actually
published and everyone gets to see the
same summary and anything that is an
error has an outsized impact for us so
we constantly monitor performance
remediate and then the summaries get
more and more accurate so a lot of um I
think a lot of monitoring goes in behind
it a lot of cicd goes in behind it as
well okay so today how are the products
that we are building how does the
agentic architecture look like well
first of all it's semi- agentic because
I don't this is an opinion we don't yet
fully have the trust that everything can
be autonomous so there are some pieces
that are autonomous the other pieces
that are not autonomous guard rails is a
classic example of for example Bloomberg
doesn't offer Financial advice so if
someone starts with hey should I invest
in then you know you need to catch it we
need to be factual that's again a guard
rail so like those are not optional
pieces for any agent those are coded in
as you must uh you must do this check so
just take this keep this image in mind
it'll come back okay so this is about
this is a talk about scaling so with
that long Runway let's get to scaling so
I just wanted to cover two aspects of
scaling I'm hoping that both these
aspects will be more of a confirmation
and not a surprise to any of you um so
let's see so the first thing is if you
want to build agents and you want each
agent to evolve really quickly because
when you build the first time unless
you're a magician it's going to suck a a
bit and then it needs to improve and
improve and improve right so how do you
get there well let's go back to how some
really good software is built when I was
a grad student I use matrix
multiplication a lot and this is a
snapshot of the generalized Matrix
matrix product and if you read the API
documentation it lays out every aspect
of the input every error code how long
it will take is also available in
documentation it's just it just works
right right and when you build software
on top of such really well documented
well-written software your software also
tends to be robust your products tend to
be robust even from 20 years ago when we
started using machine learning to build
products like you know there are tools
like apis that use models or pipelines
of models behind them you as a caller or
a person Downstream of such apis there
is a bit of stoas stochasticity if I can
pronounce it correct uh in right you
don't quite know what the result will be
and you don't quite know if it'll work
for you or not and this is despite best
intentions of establishing you know what
the input distributions are and what the
output distributions are there's always
a bit of stochasticity it was still okay
to work with them and I'll tell you why
it was okay to work with these but when
you enter using llms and agents which
are really compositions of llms the
errors multiply a lot and that is
something that causes a lot of fragile
behavior and I and we'll just take a
look at it and and I I hope my answer is
mildly surprising to you on how to avoid
the
fragility um in 2009 we
built uh a news sentiment product it was
basically to detect if a piece of news
for a given company would be beneficial
for that company or
not so the input distribution we knew
which news wires we were monitoring we
knew which language it was in news wires
also have editorial guidelines on how
they write things so well it's while
this while the API that sits in front of
the model is not as clean as like Matrix
Matrix multiply you still have a very
decent handle on okay what is coming
into my system and the outputs are
obviously just like you know it's minus
one to plus one pretty much so like the
output space is also very easy training
data we built it from scratch so we know
the training data we could have really
nice held out in time and space um
test sets and then we could establish
the risk of deploying this we could
monitor it so despite all of this guard
rail being present we still ended up
having a lot of outof band communication
on anyone who's Downstream of us so for
example if you were consuming our stream
of output on sentiment we would give you
a heads up we would tell you that hey
the model version is changing if you
have a downstream application using this
as a signal you want to test it out
things like that this was the landscape
that's changed a lot when you think
about building agentic architectures
like you want to make improvements to
your agents every single day you don't
want to have a release cycle where there
is a you know a purely batch regression
test based release cycle because there
are so many customers who are Downstream
of you who are also making independent
improvements to your model so I'll give
you like one small example right so uh
one of the one of the workflows that we
have agents for is um for a research
analyst is uh I told you that structured
data is something that they look at the
question here is US CPI for the last
five quarters Q is just a quarter
there's an agent that deeply understands
the query uh figures out what domain it
should dispatch to and then uses a tool
it's there's an NLP front end to the
tool but uses a tool to basically fetch
the data right
um turns out that the data is wrong and
which is why you need the guard rails
the data is wrong because of one
character that was missed uh it fetched
monthly data as opposed to quarterly
data and if you're actually building a
downstream workflow where you're not
even exposing the table a good research
analyst would catch it but if you're not
even exposing the table and you're just
looking at an answer that says well it
looks like the answer is 42 it's really
hard to catch these compounding errors
which is
why it is easier to not count on the
Upstream systems to be accurate but
rather factor in that they will be
fragile and they'll be evolving and just
do your own safety checks even in like
I'm talking about within my own arc
people are independently operating every
version of the data and anals analytics
API tool that's coming out is better and
better but being better means being
better on average it doesn't mean it'll
be better for you as a downstream
consumer so building
in some of
this um guard rail I just think is good
sense and that almost makes you go
faster as you factor out individual
agents and each agent can evolve without
having these handshake signals of well
every Downstream caller I have I have to
make sure that they understand what's
changed and they sign off that I can
actually release my I can promote my U
new agent to like beta or production I
think we just need to like change that
mindset and be more
resilient so that's one the second thing
is as much as I used to code one one one
fine day long long ago I'm a manager now
so I thought I'll talk about Arc
structure and I don't know how many of
you will um resonate with it Bloomberg
like I said we've been building these
things for like 15 years and traditional
machine learning um it has a particular
factorization of software and that
software factorization is then reflected
in the arc structure if you are lucky
you have the reverse convey uh law of
design but you but you really need to
rethink that as you start start using
different Tex stacks and start building
different kinds of products um what do
what do I
mean how many agents Do you want to
build and what should each agent do and
should agents have overlapping
functionality or not these are some
basic questions and typically it's very
tempting to just say let's just keep our
current software stack and see if we can
build on top of that or let's keep our
current Arc structure and build on top
of that and so what I've learned
is on the columns here you can see you
know the first two columns are
vertically aligned teams the next two
columns are horizontally aligned teams
and there are some properties in the
rows and what we've learned and we've
actually done some reog what we've
learned are in the beginning you don't
really know much on what the product
design is going to be and you want to
iterate fast it's just easier to like
collapse The Arc collapse the software
stack and just say here's a team go
build what needs to be built and figure
things out and that's where you want
like you know really fast iteration you
want sharing of code data models things
like
that the more you have understood this
for a single product or a single agent
the more you understand what its use is
and what it's good at and what it's not
and you actually build many many of
these agents and that's when you start
thinking okay I can go back to the
foundations of building good software
and good Orcs And I want to have things
like optimization on it so I want to
increase the performance reduce the cost
make it more testable make it more
transparent and that's where you move
into the bottom right corner of the
segment where you do have some
horizontal so in our case like guard
rails are horizontal we don't want every
team every one of those 50 teams like
trying to figure out what does it mean
for me to not
accept user inputs that are thinly
wailed Financial advice inputs right
like it's something that you want to do
horizontally but you don't also don't
want to you want to figure out for
yourself what is the right
time uh for you and your organization to
start creating horizontals to also start
breaking out some of these
monolithic agents which are reflected
again in your structure and start
creating smaller and smaller
pieces so all that said and done like
you know just again for the uh running
example of a research agent this is how
it looks like today so you know I think
taking in the user user world and and
session context and deeply understanding
what is the question and then figuring
out what kinds of information are needed
to answer that question uh it's
factorized as its own agent uh reflected
in the art structure same similarly for
answer generation we have a lot of uh
rigor around what constitutes a
well-formed answer again that's factored
out I call it semi- agentic like I
alluded to before because we do have
guard rails that are non- optional there
is no autonomy there you have to call it
at multiple points uh and then yeah like
we build on top of like years of
traditional and more and more modern
forms of data monging like you know your
sparse indices have become dense and
hybrid indices now so yeah that's a
little bit and I think I'm right at time
so have a nice day thank you
[Music]
our final speaker this morning will
teach us how to distill accurate
actionable insights from vast multimodal
data sources he's the founder and CEO of
brightwave please join me in welcoming
to the stage Mike
Conover hey everybody uh I'm Mike coner
I am founder and CEO brightwave uh we
build a research agent that digests very
large corpuses of content in the
financial domain so you can think of due
diligence in a competitive deal process
you are pre-term sheet you step into a
data room with thousands of pages of
content uh you need to get to conviction
quickly ahead of uh other teams you need
to spot uh critical risk factors that
would would diminish asset performance
um it's a fairly non-trivial task um you
think about mutual fund analysts its
earning season you've got a universal
coverage of 80 120 names there are calls
transcripts filings it's um a fairly
non-trivial problem to understand uh at
a sector level but also at the
individual uh tier level what's what's
happening in the market um or goodness
you get into confirmatory diligence and
you've got 80 800 vendor contracts and
you need to spot uh early termination
Clauses you need to understand
thematically how is my entire portfolio
uh negotiating their vendor contracts
it's um frankly not a human level
intelligence task and the reality as
we've stepped into this space um is that
these uh these professionals uh just get
put in a meat grinder Junior analysts
are um tasked to do The Impossible on
extremely tight deadlines um I come from
a a technical background um prior to
Bright wave I was a data bricks and
create a language model called Dolly uh
that was one of the earlier models to
demonstrate the power of instruction
tuning um for eliciting uh instruction
following behavior from from open source
Technologies and
um as I have met with these
professionals I have developed a deep
sense of empathy for um the stakes and
the human cost of doing this work uh
manually when it comes to the role of
the individual in uh finance workflows
and Financial resarch research um we
think of the parallels to early early
spreadsheets you go to an accountant or
Finance professional 1978 before the
Advent of computational spreadsheets you
say what's your job well I run the
numbers it's cognitively demanding these
people write this stuff out by hand on
literally wide pieces of paper called
spreadsheets it's cognitively demanding
it's important to the business and it's
time intensive it feels like real work
and now nobody wants that job and it's
not because there aren't finances
professionals it's not because nobody's
doing analysis it's the sophistication
of the thought that you can bring to
bear on the problem has increased so
substantially because there are tools
that allow us to think more effectively
more
efficiently what we're seeing what we're
hearing from our customers is that a
system like brightwave that is able to
dig and not just brightwave these this
class of knowledge agents is able to
digest volumes of content and perform
meaningful work that accelerates by
orders of magnitude um the efficiency
and also uh time to to value in these
markets and so the purpose of this talk
is to relate um sort of the intelligence
that we've developed uh in in the course
of building this High Fidelity research
agent um and just things that we're
seeing both technically but also in
terms of product affordances I mean the
the design problem that you have to
solve is how do you reveal the thought
process of something that's considered
10,000 pages of content content to a
human in a way that's useful and legible
that is not a uiux problem it's not a
product architecture problem that
existed three years ago and the final
form factor has not been determined chat
everybody's very Target fixated on chat
um that's probably not
enough so the the first thing that I'll
observe is that non- reasoning models
are performing greedy local search so
the the Bloomberg talk highlighted that
sort of fidelity issue like a really
concrete examp example you put a
reuter's article in 40 and you ask it to
extract all the organizations goodness
if it's not going to give you products
and if you have a 10 5% error rate and
you chain calls like that you're going
to introduce um sort of in an
exponential way uh the likelihood of of
error being in these models and so the
the winning systems will perform end to
end RL over tool use calls where the
results of the API call are in fact part
of the RL um sequence of decisions so
that you can make locally sub optimal
decisions in order to get globally
optimal outputs um the reality is that
that's still an open research problem
you know how do I Avail myself of a
knowledge graph
or I did not do
that okay
um uh how do you Avail yourself of these
tools in an intelligent way um so that
you get globally optimal outputs it does
seem like that that is not a solved
question so the reality and I think it's
like heartening to see um this is a
theme and I think everybody in this room
can be sort of comforted by this you got
you got to build a product today and
like you're you're going there's going
to be this talk of the bitter lesson
that more data more compute better
models dominate all other approaches
like nobody wants an expert system
nobody nobody wants to use spy to do
named entity recognition um the sorry um
I was not in the speaker notes uh it you
can think of being more circumspect
about what is the scope of behaviors
that the system the agent is going to
engage in sort of like a regularization
parameter which constrains the
complexity of the model and that limits
the likelihood reduces the likelihood
that it will go truly off the rails and
begin to produce uh degenerate output
you can think of it sort of like
a like multi- the most interesting
interactions I've had with language
models are deep into a conversational
tree where you can think of selecting at
each branch each response there a set of
uh reactions that I can have to the
model output and I'm steering I'm
choosing this is what knowing how to use
language models it's that's a skill um
and many people who have real full-time
jobs may not invest in developing that
skill this is not dissimilar to what
these RL systems are doing and if you
can think of a multi-turn conversation
as not just establishing a a human
orchestrated Chain of Thought but really
that set of tokens defines the
activations of the model and if you
think of the activations of the model as
defining a program what you are doing
when you respond to the model and say no
not quite like that more like this is if
you think of the the activation weights
or the activations as a point in a
vector space you are nudging the
activations to a place where they can
finally solve the problem at hand and I
think that's what the chain of thought
process or the sort of reasoning
monologue is performing it's it's
getting the activations to a position
where it can actually solve the problem
so it's actually not I don't it's cute
that it you can interpret it but I would
prefer if it just got to the right set
of activations automatically um and so
from a product affordance
standpoint people are not going to want
to really become prompting experts in a
deep way and frankly it takes you know
easily a thousand hours um and so the
scaffolding that products put in place
in order to orchestrate these workflows
and and shape and the the the behavior
of these systems um I think had you know
these verticalized product workflows are
probably going to be enduring because
they specify intent they take that
weight off the user um so some of the
things that we see with respect to
archetypal design patterns in the space
consider a basic autonomous agent you
really want to mimic the human
decision-making process and decompose
what is it that a person would do well
if I need to understand how this uh poly
polypropylene reslin uh manufacturer um
is is managing costs I might look for
public market comparables and that would
that would you know maybe entail going
to the SEC filings or earnings called
transcripts and I would assess content
potentially from a Knowledge Graph
constructed from uh previous deals that
that you know I as as a private Equity
investor have done um news corpuses
assess which which document sets are
relevant to me distill down from those
documents findings that substantiate um
premises or hypotheses that I might have
about this question or this investment
thesis um and then enrich and error
correct those findings and so a couple
points on this one is that um it is
actually so I forget who it was but they
were talking about it was the Deep
research team talking about um on that
next step what are my intermediary notes
what is it that I believe on the basis
of what I found that's actually an
extremely useful think out loud about
what do we believe given the facts as
they uh have materialized on that first
pass through the the the data set um
enriching individual findings that are
distilled down from documents is an
extremely powerful um design pattern
likewise um it's it's you can ask these
models you know is this accurate for
that reuter's example you can say uh is
this factually entailed by this document
or is this actually an organization um
and the model can frequently
self-correct and what we've noticed is
that it is you can do that in the Json
um as sort of like a Chain of Thought
Behavior but it's also it's actually
more powerful to do it as a secondary
call because the model is kind of primed
to be credulous it says well you know I
told you was and so yeah I'm probably
right um so it's interesting how you can
tease apart some of these steps into
multiple different uh calls and then
through this process of synthesis you're
able to weave together fact patterns
across many many many documents into a
coherent narrative um and that control
Loop we think that obviously human
oversight is extremely important um the
ability to nudge the model um with
directives or or sort of selecting this
is an interesting thread I want you to
pull that as extremely important and
that's because the human analyst always
is going to have access to information
that has not been digitized that's that
conversation with management that's uh
your portfolio manager thinks this class
of Biotech is just hairbrained um that
taste making I think is going to be
where you see um the most powerful uh
products
lean I firmly believe with respect to
the nodes in that Knowledge Graph and we
prob many people in this room probably
reached this on this conclusion as well
but you still see this oh we got a
portfolio manager agent this is the fact
Checker and that sort of it needless
like anthropomorphizing of these systems
um it Con strains your flexibility if
the design needs of your compute graph
change and this is this 197 I think it
was 1978 Bel you know the Unix
philosophy it's like you think about
piping and teing on the bash command
line I guess I date myself I still use
bash not Z Shale um just simple tools
that do one thing and that work together
well and text is the universal interface
um it's 40 years ago 50 years ago jeez
um so our friends at Lon space put
together this plot with respect to the
structure of these graphs I obviously
that Paro Frontier which is the sort of
efficiency Frontier it's two bat in a
thousand a day um the efficiency
Frontier for compute and performance
trade-off or Price performance trade-off
um that Frontier is going to continue to
move out but there will I believe there
will for at least in enduring time be a
frontier and what's notable about that
is that you have to select then which
tool which system which model am I going
to use for each node in the compute
graph and the reason that this is
important
is what I call the latency trap if you
think about the plot of time devalue and
realized value for agentic systems and I
think this is extremely important it's
very easy to think oh man it's going to
do all of these things it's going to you
know I'm going to check it and airror
correct and then you know in 25 minutes
it's going to be banger and I think even
with high quality productss like opening
eyes deep research
it's you're not always sure that what
you're going to get out is high quality
so there's there's kind of like a
question of like which side of the
diagonal it's probably not a straight
line but is that product on but also
from a reps standpoint the impulse
response for the user how well how well
refined is you can think of like my
expectation for what the report is going
to look like and what the report
actually looks like is the loss and the
user's mental model is developing a
sense for how do how do my prompts
elicit behaviors from these models if
it's a 8 Minute feedback loop it's
20-minute feedback loop goodness you're
not going to do many of those in the
course of a day and your faculty with
the system and the product is going to
be
low so synthesis is is really where a
lot of the magic happens in these
systems and um a couple observations so
notice that it I don't know has anybody
in this room ever had a 50,000 token
response from any model
no they say it's you know 01 is 100,000
context output context length um I'm not
so sure and it's because the instruction
tuning demonstrations these human gener
synthetic or human generated outputs
that are used to post-train the models
have a characteristic output length it's
hard to write 50,000 coherent uh novel
words and so the likelihood that the
models are able to produce I mean even
um A1 still is about 2,000 3,000 tokens
better than 40 and so what happens it's
kind of like a comp there's a
compression problem so I have a very
very large context window for input I'm
compressing that information into a set
of tokens and so it's the like the
difference between writing a book report
and a synopsis of each chapter you can
you can be more focused and um specific
about what is it that I want those ,000
tokens to be focused on um here we have
uh you know I said write write a an
analysis of the Global Financial crisis
goodness if I don't think the rise of
the Shadow banking system warrants more
than three sentences and so if you if
you can be more granular and more
specific um you can get higher quality
higher Fidelity more information dense
outputs out of these systems by
decomposing your research instructions
into multiple sub themes um Additionally
the last point I'll make on this problem
is that of uh the the presence of
recombinative reasoning demonstrations
in the instruction tuning and
post-training corpuses is
so it is uh easy to say here you know
given the text of The Great Gatsby this
is the epilog and write a new epilogue
for The Great Gatsby because the cost of
internalizing that Corpus is fixed
effectively you read the book and then
you write five epilog and it's like
goodness I got it synthesis really is
about weaving together disparate fact
patterns for multiple documents think
about the applications to biomedical
literature synthesis I need to read all
of these papers and then have something
useful to say that actually brings
together the facts from these documents
now there's like a a cute trick you
could try which is to say given the
biography of Any Given paper write the
abstract as an in as a as a posttraining
exercise but it's just really hard to
get highquality intelligent thoughtful
analysis of many many many different
documents and so there are limitations
in practice for uh even state-of-the-art
models in terms of how they are able to
manage complex real world World situ
situations uh factors like temporality
um the perplexity had a well so
temporality is hard um and being able to
understand you know something like a
merger and an acquisition um you know
this these proforma financial statements
are different from those that came um
before the event um if they addendums to
contracts it's important to propagate
with um evidentiary um passages a
metadata that contextualizes why do I
care about this what do we think about
this document um what how should I
consider this in relation to the other
de uh pieces of evidence in in the in
the context window um so I'll now shift
a little bit with some some examples
from from our the product that we've
built which is um how do you reveal the
thought process of something that's
considered 10,000 pages of text and I
think that it is more like a surface and
one where you're able to um it's it's
kind of like this like people you may
know the Facebook and Linkedin
recommendation algorithm for for
um connections uh feels uncanny good in
part not because I mean the algorithms
are okay not great um have gotten a lot
better over time but in your visual
cortex there is a a bundle of nerves
that are uh exclusively dedicated to
face recognition
and the ability to say in a in a you
know 6x6 grid of faces goodness I know
that person and so you attend to the
things that matter even if it's actually
a low Precision product experience and
so the ability to give the person um
details on demand is extremely important
um we'll see so here we have a
brightwave report um we you know I think
the ability to click on a citation and
then get additional context about this
not just what document is it from but
how should I be thinking about this what
was the model thinking in the course of
this um as well as structured
interactive outputs that give you the
ability to pull the thread and say well
tell me tell me more about that Rising
capex spend in bright wve um you can
highlight any passage of text so it's
not just the citations but you can
highlight any passage of text and say
tell me more what are the implications
of this I think open AI gestures towards
this with respect to Canvas and the
ability to increase the reading level of
of a passage having a continuous surface
that not just these citations um but in
fact any uh finding should be
interrog
um likewise you can think of actually
GNA pause it's not going to pause I'm
gonna go back and do this again
um you can think of the set of things
that the model has discovered it reads
all of these documents it develops a
view it weaves the facts together um as
a as a high-dimensional data structure
and the report is one view on that data
structure it's kind of a low low effort
point of entry into the the space of
ideas you want to be able to turn over
that Cube and see especially in finance
um the receipts what's the audit Trail
for this system that's read all of these
materials and so being able to in this
example click into the documents is one
level but having all of the findings
laid out for you whether it's a
fundraising timeline um ongoing
litigation I'm able to if something
catches my attention click on it this is
where that that investor hello investor
analyst taste comes into play I'm able
to say tell me more about that it's like
a magnifying class for text something
catches my eye this patent litigation
the goodness that seems important um you
had a factory fire in Mexico that wiped
out you know critical single Source
supplier um what are you going to do
about that that ability to drill in and
get additional details on demand is
extremely important in these systems and
I think candidly um we we do not yet
have the final version the final form
factor of this class of products um but
it it's an extremely interesting design
problem and I will say uh we are we are
hiring so these QR codes not only is it
a great place to work we've got uh
people from Goldman sex and UBS and meta
and Instagram and anaplan and we just
hired senior staff software engineer
from Brave um goodness we got a stack
team we also have a $10,000 referral
bonus so I'm going to see a lot more
phones come out now
um $10,000 referral bonus for all of
these roles primarily the product
designer and the front-end engineer
we're hiring staff and Senior staff
level professionals we we have a small
team of extremely experienced
individuals um and this is structured
like the DARPA red balloon challenge if
you're familiar um so if you refer the
person that refers the person that we
hire you get a th000 bucks and so on and
so on and so on all along that
exponentially exploding uh referral tree
so we're bright wave we build knowledge
agents for finance workflow
I appreciate your time
[Applause]
today ladies and Gentlemen please
welcome back to the stage MC for the AI
engineer Summit agent engineering day
the founder of touring post Cassini
[Music]
AA so wonderful to see you and see your
faces and but I bring good news I'm
bringing good news it's
lunchtime
um thanks to Mike who already left um
for this amazing Deep dive if you have
any questions to any of the speakers uh
please find them in the K day um areas
um one is um right here on this floor uh
other two on the lower level and I just
wanted to say that this session was
amazing what a morning uh um I feel
buzzing with insights and I hope you got
a lot of um interesting uh things for
you to think about um each talk was an
absolute gem uh we got the sneak peek
into a lotting copilot's Enterprise
multi-agent platform we learned Jane
tra's tooling um for or camel uh
Bloomberg's challenging um challenges in
scaling generative AI agents and we
learned about um bride waves knowledge
agents um all these companies are hiring
so go talk to them if you're interested
um that's it enjoy your lunch um and uh
we'll see you back here at 2 p.m. thank
you so much ladies and gentlemen lunch
being served
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
a
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
n
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
n
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
w
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
this
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
n
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
a
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
w
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
MC for the a n engineer Summit agent
engineering day the founder of touring
post
[Music]
[Applause]
kiaa welcome back I hope you enjoyed
lunch and some sessions from our
sponsors
downstairs um it never stops right a
just never stops it constantly something
something something and um how are you
feeling are you ready for some more
awesomeness
awesome um so this Sprint is um this
Sprint this um Sprint of sessions is
packed with action and some valuable
insights into AI engineering let's see
what's on our list um it's agents are
built in The Fringe getting from 90 to
100 how to scale 500 million AI agents
in production with two Engineers um
voice AI uh your board isn't special and
um how to scaffold wisely with that
please join me in welcoming our next
speaker head of product engineering in
Winder Kevin
[Applause]
[Music]
how that's
crazy all right how we doing New
York so my name is Kevin this is our
first ever wind surf presentation so you
could say it's the first time we're kind
of spilling the beans on what the ID is
all about so thank you all for coming
I'm going to be talking about wind surf
the first AI agent powered
editor so my name is Kevin how I lead
our product engineering team we're a
team based out of uh San
Francisco and thank you so much to swix
and Ben and the whole AI engineering
Summit team for inviting us here and
letting us speak to you all um it's been
a pleasure talking to people in the
audience at the booth and just generally
talking about AI uh so let's dive into
it wind surf is an a gentic editor and
we're going to talk a little bit about
some of the principles that we use when
we're building a product like
this so we believe that agents are the
future of software development and you
all are here so you kind of understand
the power of what agents can do both for
software engineering and otherwise um
but to start I'm going to take you down
a trip down memory lane right let's go
back to 2022 co-pilot was the
state-of-the-art it just came out of
beta people were experiencing the ghost
text they were seeing their completions
and it was one of the first times that
people really got to see the magic of
what AI could do for developers was
making them more productive and we
codium uh decided we were going to be
one of the first companies to also
launch an autocomplete product so we
garnered a couple million users on our
vs code jet brains Vim emac extensions
um raise your hand if you were one of
those codium users nice nice um but we
always knew that intelligence was going
to get better right back then we were
doing short completions maybe finishing
your functions but we knew that there
were going to be better models larger
models better training paradigms
completely new you know RL new tool use
all this stuff and so we knew that we
wanted to build the best experience for
devs possible so even back then we
started looking at agents we started
thinking about what could the future of
software development be if models just
got
bigger and so we built the best
experience that we could at the time
and that was a chat autocomplete product
but we always knew that copy pasting
from chat GPT was going to be a thing of
the past we also knew that people are
going to probably tab less we're going
to have llms that we'd be able to
generate more and more and who knows you
know we always think all right agents
are the best now but as a company we're
always thinking about the future we're
technology optimists so if an ID in the
future who knows we might not even be
writing code inside ID of inside of
idees we'll just be there building the
best product for
S and so this year 2025 is finally the
year where I feel like we are all
recognizing the power of Agents inside
of software development agents are here
to stay and wi surf I'm proud to say is
pushing the envelope of that technology
and we're going to talk about some of
those features um and we're going to
keep pushing that agentic future um
because we believe that you know agents
are going to move software engineering
in a direction that no other llm has
done in the
past so this slide I guess is titled
Vibe coding with wind surf or also just
coding in windsurf um so I'm going to
give you a quick demo about this is the
windsurf product here you have a a
sidebar this is our
agent and you can see that we're going
to be building a python web scraper so
what this is going to do it's going to
build a python WebCrawler give us some
stats about the website um and you can
see it's actually installing
dependencies from pip um it's doing so
inside of the terminal that you use so
you can interact with it um it's
suggesting edits setting up your virtual
environment and we give a user a very
helpful accept and reject so that you
can go through and have confidence that
the code that it's generating works for
you and your codebase um this is of
course there's a lot more features under
the hood um some of the things that our
users like to do they like to look up
documentation we have web search enabled
by default um it always looks at your
codebase so you can grap through your
codebase uh we can generate commit
messages you can drag and drop images
right the possibilities are truly
endless and I'm going to be talking
about these features are powered by a
handful of principles a handful of
through lines that we as an engineering
team hold true as we're building and as
a team we always go back to the same
Mission which is to keep you in the
flow and unlock your Limitless potential
right we want to work on we want to
handle the grunt work for you we want to
handle looking at your debug stack
traces we want to handle modifying your
original source code we want to pull the
correct version of documentation so that
you never have to worry about pulling in
the correct context these are problems
that we are trying to solve um and we
want you to spend time on things that
you are good at right the things that
make us all excited which is shipping
products building great features um and
generally just shipping
code and so with that goal in mind how
do we tell what to work on um it's a
game of input and output so we want to
allow users to give the least amount of
explicit input possible to produce the
most correct and production ready code
right we want you to contribute less and
we want our agent to contribute more and
we do this by reducing the amount of
human in the loop required by doing
things like background research we are
always trying to predict your next step
and we'll make decisions on your behalf
so that you can move
faster and this might all seem like a
fantasy but winds surf launched three
months ago on November 13th um that date
is forever branded In My Memory um and
these are the results that we're already
seeing so in three months we've been
generating 4.5 billion lines of code
that is an absurd number and since the
time I started this presentation we've
actually probably sent users have
probably sent thousands of messages to
Cascade asking it to refactor code to
write new features to build new pages on
their website um and also a fun
statistic since we're all Engineers here
uh we've had 16 nights in the last you
know 90 days where we've been woken up
in the middle of the night from pag your
duty on call because we've had some
reliability issues um due to us
exceeding our capacity right these
problems these we've had immense success
getting people onto the platform um and
we've been very fortunate to have the
issue of being some of anthropic and
open ai's largest uh
consumers and so with these Mission with
this Mission and Metric in mind let's
walk through some of the principles that
we use when we're building this agentic
editor and for those of you that have
used winds surf um you might learn about
some of the new ways that you could use
the product and also for my own
curiosity how many of you have heard of
wind surf just so I know who we're
talking about oh let's
go that's sick um how many of you use
wind surf okay everyone who put their
hand down doors over there
um all right let's get into it so the
first principle trajectories what is a
trajectory we use trajectories to read
your mind so unlike edit other editors
like cursor the elephant in the room our
agent is deeply integrated into the
editor and we'll talk about what exactly
that means but on one half you can
imagine an agent has to understand what
you're doing and then on the other half
it has to understand and be able to
execute things on your behalf and this
has led to features like one of my
favorites quote continue my work so we
are building up an understanding of the
user as you're writing code as you're
executing terminal commands and then you
can actually just go into the the agent
sidebar and just say continue my work
and it'll actually continue executing
that and it might even give you a full
PR or a full commit right we also have
things like terminal execution mode
right it can automatically use the llm
to decide what is safe and not safe so
that if you're running something like
git it'll just work or if you it'll
probably prompt you if there's an rmrf
somewhere you probably don't want to run
that automatically and the LM would be
like oh we should probably flag to the
user to confirm this these are just some
of the ways that we try and let the
human be in the loop but as minimal as
possible and then finally we also have
you know a a stellar ux a stellar design
team that's been working on how to
integrate these sort of Cutting Edge
features into a product in a way that
allows the user to feel like they're in
control to be able to accept and reject
changes into their code so they can have
confidence in the code that they're
pushing to
production so here is how a trajectory
Works um we have this notion of a
unified timeline so an agent is working
in the background behind the scenes to
understand what the user is implicitly
doing so this includes things like
viewing files navigating around your
codebase um let's say you edit a file uh
and then the agent will edit a file this
all kind of goes into a shared timeline
of actions you can imagine this includes
things like searching grepping um making
edits making commits right the user has
this sort of holistic understanding of
what you're
doing and this entire experience is
Unified by this shared timeline so you
can contribute to it it can contribute
to it and in this way you never run into
the problem where you're talking to the
agent and it undoes the change that you
just did or has some you know outdated
notion of what the file state is so this
is a first class principle of ours and
when we decided we were going to build
an editor we were going to build it
around this notion of an agent in a
shared
timeline and so here's an example of
this feature in action here we're adding
a new function and you're seeing the
autocomplete and all the kind of like
bells and whistles of that feature and
in the right side we just asked continue
my work this is a new function we
probably want our form Handler to use
this new function and you can see based
on the context that we gave it by making
edits it's guessing Okay we probably
want to make this file change to this
file maybe some others and then at the
end it's actually just saying okay let's
just run npm run Dev and it can run
terminal commands on your behalf in the
background in your kind of like command
J terminal popup um and in this way
we're keeping you in the flow right
something that would have taken minutes
is now taking
seconds and here's another example uh
the terminal is now deeply integrated
into the agentic timeline so here if
you're typing commands you know the
classic example is like I npm install a
new package or I pip install a new
package the agent should know oh you
just installed this package why don't we
go ahead and implement it into your
project and based on context that it's
able to pick up around the codebase it
can continue that line of work so we
very strongly believe in a future of no
copy paste
right you should never have a situation
where you're in a terminal or you're in
a document or even on a website and
you're copy pasting text into an agent
that's just not how the way the world
works and in the same way we strongly
believe the future is not going to be at
Terminal here's another example of
commands running inside of your terminal
I've been talking about this for a
little bit and this concept of a
trajectory allows us to automatically
execute things inside of a Sandbox that
is as similar to the way you run
commands as possible so instead of
running some shell script in the
background what we do is we put this
right inside of the place that you would
actually write terminal commands so if
you pip install something or it pip
install something it's going to the same
environment you'll never have this
instance of kind of weirdness and and
this is all part of our effort to bring
these two sides the agentic side and The
Human Side close together as close
together as possible and you do this
through building a unified
product we believe that developers are
here to stay and if you want to work
seamlessly with a developer that means
the agent has to understand what they
are thinking wind surf has to be
ubiquitous and the agent will be reading
more and more of your mind doing things
that you might not even know it's doing
in the future we'll be looking not just
one to five steps in the future but 10
20 30 steps into the future it'll be
writing unit tests before you've even
finished defining the function it'll be
performing codebase wide refactors on
multiple files based on you just simply
editing a variable name all this is part
of this unified trajectory
concept now the second principle is is
meta learning so even if winds surf
understands what you're doing in the
moment there is still an inferred
understanding of your code base and your
preferences and your organizational
guidelines that let's just say senior
engineers at your company have built up
a notion of over time we call this
concept meta learning so wind surf we've
built from the ground up to to adapt and
remember these things about you and your
company so if you think about a frontier
llm right the best llms that they exist
in the world they're very very smart
Engineers definitely more capable than
than I probably more capable than most
of you they can just write an enormous
amount of code and do so correctly and
it probably runs and compiles pretty
well but what they do not have is the
exposure that you've had the education
that you've had and the ability to kind
of remember and and know how you
personally or your company writes code
and so what does this mean for our
product we've implemented a concept
called autogenerated Memories so over
time we build up a memory bank what you
are doing so you can say remember that I
use Tailwind version 4 or remember that
I use react 19 instead of 18 and these
things will be remembered you say them
once and they'll be remember forever we
also allow people to implement things
like custom mCP servers so you can plug
in your favorite tools we can adapt to
your workflow we will also allow you to
Whit list and black list commands going
back to that same concept we want to
keep you in the flow as least or sorry
we want to keep you in the flow as much
as possible but we can tell the agent
hey never run an RM command without my
approval and so in this way it learns
learns about your preferences over
time and if you think about what makes a
developer effective it's because they
remember things that you tell them and
Wier must also model this Behavior if we
hope that AI should write and maintain
projects for us so in the short term
this means you don't need to prompt the
agent again and again to do the same
thing over and over um but in the long
term the AI should just feel like a
seamless extension of yourself it's this
idea of explicit versus inferred
context and we always have the saying at
the company ideas are cheap so here's an
example of autogenerated memories in
action here we're not even explicitly
telling it remember this thing we're
just giving it an architecture overview
we're asking what does this project do
and it's remembering based on a couple
tool uses it's looking at a couple
different files looking at the routes
and now it's committed to memory hey
this is the project that this person is
working on here are the end points that
are available and we can reference that
in the next message that we send right
so in the next future conversation we
can now onshot things because we have a
notion of a memory bank
in the same way documentation is auto
learned we know what packages you're
using because of your package Json
because you've explicitly told us and
we're able to look up the web look on
the web for documentation that matches
those versions and we do so all
implicitly and so the dream of meta
learning is that you can have an
entirely inferred sense of context based
on a code base or based on the usage of
the product and autogenerated memories
are a step in that direction um we
strongly believe that having a rules
file you know we do allow users to to
add a rules file we strongly believe
that a rules file is a crutch you know
by the end of 2025 99% of the things
that you're going to put in a rules file
will be interpreted or inferred based on
your code base or your usage so our
dream is that every single windsurf
instance every single user using
windsurf regardless of the company or
the type of person the skill of the
developer will be personalized to that
user and you'll only have to tell it one
thing and finally my faite principle
which is scale with intelligence so what
does this mean now that wind surf
understands what you're doing in the
moment right the first principle and can
improve over time the second principle
how do we actually build an agent that
will scale with the rates at which llms
are scaling and while we're trying to
get always give you the best tool today
we recognize that new models are coming
out every other week right every day
there's some new article about some new
pattern and it's really really hard to
keep up but we always think at codium
how do we stay on top of this how do we
build the best product for not just
today but 3 months 6 months 12 months
out three years from now so in 2021 when
chat GPT came out you probably like me
we all had our imaginations running wild
we're like okay we're going to solve you
know AGI post economy whatever but
obviously there's a lot of things that
need to happen between then and that
future and so models at that time were
quite frankly a little bit too too dumb
to be able to comp accomplish everything
that we wanted them to do so we built up
a lot of infrastructure and you and I
have all probably done this we build out
embedding indices we build retrieval
heuristics we have output validating
systems to make sure that the code that
it's generating is good right these are
all things that were able to help at the
margin but this is all predicated on the
assumption that we were operating with a
fixed notion of intelligence 2021 2022
these models we were operating we were
building all this infrastructure to
compensate for areas and edge cases that
models could not handle and what's very
different about the way we're
approaching wind surf is that we want
our product to scale with the models so
if the models get better our product
gets better and I'll give you one such
example um it kind of surprised me I was
you know when I landed in New York I
tweeted that we deleted chat in Cascade
I was like a very I don't know I was
just I had thoughts and weirdly a lot of
you picked this up and this is an
example of something that we feel very
strongly
about one example of this principle in
practice is that we deleted chat so what
does this mean we only have an agent and
it's called Cascade inside of wind surf
chat is a legacy Paradigm and we
completely replaced it and as you can
see here users are enjoying it or in
fact they might not even know the
difference but they're just enjoying the
higher
quality an example of this is at
mentions we built at mentions and
probably you all have used at mentions
because context was not very good a year
two years ago today winds surf can
dynamically infer the relationships
between bits of code and documents 90%
of the time you do not need to at
mention something all you need to do is
let the retrieval system in the agent
kind of plan out what it needs to do and
then reconstruct the context
automatically for you so at file and at
web these are very helpful patterns when
you're working at kind of the margin but
these are eventually eing out basis
points in the long term we believe that
llms are going to improve and they
already have improved to the point where
you don't need to explicitly specify an
at mention the llm should be intelligent
enough to pick it up and so in this
example previously I was implementing
superbase inside of a xjs app previously
you'd be webbing you'd be at docs at
codebase at this that no just add
superbase right and it's able to infer
and plan out let's search the web let's
behave like a human would and to get
into this there's there's also web
search built into um winds Surf and
what's very special about this is that
it reads the web the way a human would
read the web so instead of these
hardcoded rules and you know we probably
could have created an embeding index but
we would probably get very low quality
results and so instead we said the llms
are very very good let's the model
decide what it wants to do let's have it
decide which search results to read what
parts of the page to read and then
finally give us an
answer and so we believe that as models
will continue to get better we're going
to be continuing to do unsupervised work
we're going to generate full PRS we're
going to read complex documentation the
possibilities are truly
endless and so here are some of the
principles that we just talked
about um where are be going with this
there's a lot of ways we can take this
right the engine underneath wind surf is
really really the secret sauce
and we believe that we're going to be
2025 is going to be a whole new world no
rules files generating PRS generating
commits it's going to be
crazy and we're already seeing this 90%
of our users or sorry all of our users
90% of the code that they're writing is
generated with Cascade that's an
astonishing number autocomplete was more
in like the 20 30% this is insane right
people are using agents today to
accomplish so much more than they could
have in the past and we're all software
Engineers I want to make sure that every
single person in this room is armed with
the best tools and those best tools are
agents and like every good thing in the
city I expect tips 25% of your ticket
price which I heard was quite a lot um
here's the actual QR that you're
probably curious about um this is wind
surf's download link we offer a free
tier um so go ahead and and scan that
start using the magic
today and then finally we have some
killer swag at our booth um you can also
connect with me on Twitter I try and
stay active with the community but thank
you so much for watching I hope that you
all learned something about how we're
building at windsurf and enjoy the rest
of the conference thank
[Applause]
[Music]
you our next presenters will tell us how
they scaled 500 million AI agents in
production with just two Engineers
please join me in welcoming senior
software engineer at method Financial
Mustafa Ali and the founder and CEO of
open pipe Kyle
Corbett all right um hey everybody uh
yep I'm Kyle Corbett from open pipe and
I'm here with Mustafa Ali from method
we're going to be talking about how
method has scaled in production to over
500 million agents uh and basically all
the the tricks they use to to make that
actually work
yeah so a little bit about method is
that we essentially collect and
centralize liability data from across
hundreds of different data sources this
includes tapping into the credit bureaus
uh connecting with the card networks
like visa and MasterCard um and just
direct connections with the financial
institutions and various other third
party sources and you know we uh sort of
aggregate and enhance this data and
serve it to our customers who are
typically other fexs Banks or lenders
and they use this enhanced data to um
anything really to do with debt
management so refinancing loan
consolidation liability payments or just
Personal Finance Management
um yeah and at open pipe what we do is
we help you build uh train and deploy
open source models um for actual usage
we also let you use in production your
signals you get from users from the
environment to improve your model
continuously over time and that's some
of the things we'll be talking about uh
what we did with method nice
so one of the early challenges that we
faced at method while coming up with
this you know aggregation pipeline uh
was that some of our customers basically
came to us and said you know it's really
nice that you can give us the balance
and payment information on a specific
liability for their end consumers but
you know what would be really nice is if
you could also give us some of these
liability specific data points like the
payoff amount on an auto loan or the
escrow balance for a mortgage and you
know we said okay let's do some research
so we go back to to some of our data
partners and basically ask them you know
is there anything you know we can plug
into to get these kinds of data points
and what we found was there's really no
Central API that we could get access to
that would allow us to get some of these
data points and of course ideally we
would want to work with uh directly with
the banks but you know having already
worked with banks before and just from
initial conversations we realized that
it would easily take up to at least a
couple of years before getting anything
solid done and you know we we're an
early stage company so we want to build
for the customer fast um and so that's
really what we're trying to come up with
a solution that we can just you know uh
push into production
tomorrow and so just to get better
understanding of how some of these
companies are operating today uh the
services that they're providing today
how are they doing that in the first
place right like they must be getting
that data somehow so we go back to some
of these customers and basically ask
them you know how are you guys operating
and what they tell us is it's kind of
interesting so a lot of these companies
they basically hire offshore teams of
contractors and you know they uh these
teams are basically responsible for
calling these Banks um on behalf of the
company and the end consumer they
authenticate with the banks gather the
necessary information somebody has to
prove check it it gets sent back um and
then it gets integrated into the
financial platforms um and it get surfac
to the user is used for underwriting
stuff like that and so that's the status
quo that we're dealing with here and
when you think about it that's a very
inefficient manual process right it's
it's when you try to think about scaling
it doesn't doesn't really scale it's a
very um it has a lot of problems you
know it's expensive because one person
can only do one thing at a time right so
if you want to scale uh you basically
have to hire more people and for the
same reason because it's so synchronous
it's also really slow um and the main I
guess the the biggest problem with that
is also that there's a lot of human
error involved and um you you need to
hire a team to fact check it uh to proof
check it and um it's the the the the
worst thing that you can end up with is
to to surface basically inaccurate
financial
information and so conceptually though
if you think about it it's kind of like
an API right you have the request
component you have the authentication
component you have the response
validation all that stuff uh so
essentially when you drill this problem
down into the core problem that's really
just trying to make sense of um
unstructured data right so if only there
was this magic tool or software that we
could use that was really good at
parsing unstructured
data and and you know lucky for us
around the time that we were trying to
solve this problem open AI announced
gbd4 and you know as people like to call
it there was this Cambrian explosion of
AI or llm enabled applications all
around us and the results were just
mind-blowing um and we thought to
ourselves you know this this this is the
perfect thing for us this is like a
godsend uh so we tried to like you know
we tried to see if there's anything
there that we could use and if there's
one thing that we all know in this room
is that advanced llms especially post
gbd4 are really good with um with
parsing unstructured data so tasks like
summarization or classification they're
really good with that kind of thing so
we want to test that theory out and see
what that can get
us and so we put our heads down hack
together this agentic workflow using GPD
4 and as expected you know it worked
really well so we tried to like expand
some of our use cases because that you
know the API costs are high so we wanted
to get as much as we could from a single
API call and you know it turned out to
be really good at that so we tried to
obviously this was in a very controlled
manner um but this was in production and
so we were testing out uh different uh
extractions basically and um you know
everything was going really good uh but
as soon as we started to increase a
little bit of uh traffic uh what we
found was you know the bill had to come
du and um it was a lot so $70,000 for
our first month in production with GPD 4
and you know this was this made
leadership really unhappy and you know
but um but it was something it was
something they were they were fine with
because the value that we were getting
out of gp4 was so immense um and so we
actually kept this thing in production
for at least a couple more months as we
tried to work around this kind of cost
problem and you know cost wasn't the
only thing that we were concerned with
um as we started to scale some of these
use cases we quickly ran into a wall
with prompt engineering it only takes
you so far um one thing we realized that
even though gbd is really smart it's not
a financial expert so you had to give it
really detailed instructions and
examples uh to really make it work with
all kinds of use cases that we were
trying to Target um so it's hard to
generalize those kinds of prompts they
become really long convoluted it's
always a cat and mouse Chase with you
fix it for a certain scenario and it
breaks for another one you fix it for
that one it breaks for the previous one
and so you're all this going back and
forth we didn't have any prompt
versioning so we had to figure out a
better way to make this work for all of
our use cases
and so the tldr here is that you know we
we didn't want to adopt that initial
solution that I just talked about
earlier in the slides because of its
scaling challenges and just because it
was so inefficient but we kind of ran
into the same scaling challenges with
GPT where it was expensive because we
couldn't really optimize for caching
because of the variability and responses
and the prompt tweaks we were making all
the time and the Baseline latency that
we were finding was actually really slow
so we couldn't you know it was over
overall we couldn't scale concurrently
and similar to human errors that were
kind of in a different nature we had AI
errors which were just hallucinations
that were hard to catch um and we just
couldn't scale with this kind of system
but we still kept it in production
because for a specific use cases was
actually really really
good and so now the problem shifted from
solving that core problem of trying to
make sense of unstructured data that was
solved with GPD now the problem shifted
to how do we scale this system how do we
build
a robust uh you know agentic workflow
that can handle this kind of volume
reliably and so some of the ballpark
figures that we came up with you know is
that we we're going to be at least
making 16 million requests per day uh
we're going to have at least 100K
concurrent load and you know we need
minimal latency to um handle this kind
of real-time agentic workflow so sub 200
milliseconds and you know so the natural
next step for us was like we thought to
ourselves do we buy more gpus do we host
our own model like what do we do at this
point
um so that at that point open pipe comes
in yeah so about a year ago we started
working with method on solving these
issues that Mustafa just listed and we
actually found that the that those three
issues he listed right which are quality
cost um and latency are very common um
these are things that you know across
almost everyone we work with uh at least
some subset of those are really top of
mind um and so with uh method
specifically we were working on okay how
do we how do we solve those problems in
a way that that makes this uh you know a
viable business for you so uh the first
thing we did was start measuring error
rates um you know like like he mentioned
uh even AI models are not perfect uh
these are all probabilistic systems
getting to a 0% error rate was not
really feasible but we were able to see
different models had different uh had
different performance characteristics
there so on Modern models on the task
they're doing these are the rates we're
seeing on gb40 um we're at about an 11%
error rate uh and with O3 mini it's much
better it's a 4% error rate um the way
you measure that is going to be specific
to your business and that that's
actually true to some extent for all
three of these things we'll talk about
uh in the case of method this is
actually relatively easy to measure
luckily because they have this agentic
workflow but like ultimately what the
agent is trying to do is is fill out um
you know extract all this information he
was talking about Bank balances things
like that and so you can you can have a
human go through the flow and figure out
what the real number should be and then
you can compare an agentic systems final
outputs to that and see if it was
successful or not um which which made
this part relatively easy to calculate
uh so these are kind of the error rates
we're getting um on the latency point of
view uh we see that GPD 40 is around a
second uh to respond uh and then O3 mini
takes about 5 seconds for their specific
task again this is somewhat task
dependent uh depending on how much you
know for example O3 has to think as
you're measuring this you also want to
make sure that you're using real
production conditions that you're
actually doing um you know like a real
diversity of tasks uh that that match
what you're actually doing and at a
reasonable currency level that matches
your production um and we also measured
the cost um so again cost uh this is
something that is going to obviously be
specific and how much it matters is also
very specific to your use case as well
um interestingly O3 mini even though it
has a much lower per token cost than GPD
40 if you just look at like the pricing
page on the API for their specific use
case uh we found it was a little bit
more expensive because it has it
generates many more reasoning tokens so
it has much longer outputs um again
though this is somewhat Tas dependent so
I just recommend and um actually just
just as an aside I would recommend once
you get to the point that you're trying
to optimize that you have sort of that
initial proof of concept with with some
model something that works I think it's
really worthwhile to it can be as simple
as like literally just writing like you
know three different Python scripts that
like are able to categorize each of
these for a different model um and then
as new models come out you'll be able to
quickly tell how they're doing um okay
once you've done or in this case once
we've done this this sort of U
benchmarking of where the models are
then next question is all right what is
where do we need these models to be
where do we need to get to um and so
again this is very task dependent uh in
the case of method uh they do have speal
like they have um extra checks that
happen after this where they look and
see okay are the numbers that came out
plausible do they match you know the
types of things we're seeing before all
the all these different kinds of checks
they're doing and so they didn't need to
get all the way down to a 0% error rate
but of course those checks are still
followable and so um if it's over a
certain point then then some fraction of
those errors are going to get through
and that's going to be bad so we found
around a 9% error rate was was able to
get them what they needed um from a
latency point of view so the way their
agent works is a real time system uh it
it needs to be able to respond quickly
to to move uh through the the basically
like through the whole flow to get the
information it needs and so they did
have a hard latency cut off um we see a
wide variety in this for what it's worth
we have some customers that I talked to
who it's like hey if I get a result back
at some point in the next few days like
that's totally fine this is a background
bash process um we have other customers
who are doing real-time voice with a
human on the other end of the line and
it's like hey you know if I'm over 500
milliseconds that's not going to work
for me and so again you just have to
know for your specific case how much
this matters same with cost um in their
case because of that very high volume as
mustaf was mentioning cost is pretty
important to them um again depending on
your use case usually mostly dependent
on how high volume it is um will
determine how much cost matters to you
but but it's something you you you
should know these numbers for your
specific task as you're comparing
different models okay so um
we're looking here at this uh of course
as you're looking at this this slide you
can you may see there's a problem here
which is um of the two models we're
comparing at least none of them actually
meet all three of the requirements we
need to be able to deploy this in
production and uh you know gb40 on both
the error rate as well as the cost we're
not quite there um and then 03 mini uh
on the cost but especially on the
latency it's just not going to work for
what we need so this is the point at
which uh method came and they talked to
us we're like hey we're not able to hit
what we need here um because again we're
not uh yeah we're these these models
aren't getting us where we need to be so
what we work on at open pipe is fine
tuning we work on building custom models
for your specific use case and so I'm
going to talk about why you would want
to do that and how that helps in this
case um first I would say fine tuning is
a power tool uh it does take more time
it takes more uh engineering investment
than just prompting a model uh so you
don't really want to do that until you
have actually benchmarked the production
models just prompting them and seen
whether they work or not um so in this
case in meth's case and and in all of
our customers cases uh they found that
they were not able to hit the numbers
they needed um and so that's the time
you want to bring in fine
tuning um so let's look at we were able
to find tuna model and see uh how that
was able to help uh because it can
actually really bend that price
performance curve a lot um so on the the
error rate uh which is basically just
the inverse of of accuracy if you want
to measure it that way um we were able
to get to a place where we were doing
significantly better than GPD 4 and
importantly better than that threshold
they needed uh this used to actually be
much harder to achieve it required a lot
of manual uh labeling of data and things
like that it's actually become much
easier over time because of the
existence of models like now O3 mini um
which allows you to just use your
production data you can you can use your
uh basically the inputs you're using
production you can uh generate outputs
for them using a model like O3 mini and
train on them we find like in this case
that often you're not able to quite get
uh to the the performance of the the
teacher model the model o03 mini in this
case that you're using but you can get
quite close to it and usually do much
better than you know uh a slightly less
good but much much larger model um you
know in this case uh the model we ended
up deploying with them is just an 8
billion parameter llama 3.1 model and
and we find that actually for the
majority of our customers a model that
large or smaller is is good enough and
is able to hit the numbers you need from
quality um but uh yeah the important
thing is to be able to Benchmark that
and to answer that question for yourself
um on the latency point of view because
actually this this is sort of the magic
of being able to move to that much
smaller model because we've got this 8
billion parameter model it is way easier
to deploy in a low latency way um
there's just many few fewer calculations
for your sequential calculations for the
number of layers and so you can get just
a much lower latency you can even and we
we didn't actually have to do this in
method's case but something you can do
is you can train this model you can
deploy it within your own infrastructure
collocate it with the application code
that's using it um and even completely
eliminate the the network
latency uh and then finally uh on the
cost front again just because this is
such a smaller model um you end up with
a much much lower cost uh and so that
for many of our customers is a big is
incredibly important is to be able to
get that performance number you need um
while still maintaining a relatively low
cost um in in method's case we were
actually able to far exceed the sort of
cost thresholds that they were looking
for to make this viable um which means
that they don't have to worry about this
from sort of a unit economics point of
view uh in in in the way that they did
when they were using the larger
models um so just um to sort of
reiterate what I started with before um
this is a power tool uh the fine tuning
uh is it does take a fair amount of work
um not an extreme amount of work but
significantly more work than you do for
prompt
engineering however if you're not able
to get to the reliability numbers you
need uh through just prompt engineering
with the models that exist out there
without tuning it is a viable way to
very strong bend that price performance
curve and get to a much better place uh
which uh which which can help you get to
a very large scale in production just
like method
did nice um so yeah just to wrap up here
uh one thing that or at least a couple
couple points that we want to highlight
is that you know the reason we put two
engineers in the title is also because
it's not that it's not that complicated
right you can get away with using we
identified a specific use case and we
got away with just using the cheapest
model that was out there uh we fine
tuned it we already had the data from
GPT in production so we already had the
data we didn't have to go digging around
for the data in the first place uh so we
already used that and we used the
cheapest model that gave us the fastest
performance and you know you don't need
to buy your own gpus um and the the
other thing that we realize is that
productionizing AI agents actually
requires a little bit of uh some level
of openness uh and patience from the
engineering team from the leadership
team it's because when you write code
we're all used to writing code that just
work works you push out a future and
never breaks because you're not changing
anything but with AI agents you it takes
some time to get to a point where it's
like production ready and actually gives
you the responses that you're looking
for um and you know I I feel compelled
to say something about as to Mark the
top of the traditional software
engineering job so I'll leave you with
these last few words if you're Inu pivot
to
aiee thank you thanks everyone
[Applause]
[Music]
our next presenter is a staff software
engineer at Super dial and he's here to
tell us how to make reliable voice AI
agents please join me in welcoming to
the stage Nick
[Music]
[Applause]
kotakis awesome
hey everyone uh I'm Nick I'm an engineer
at Super dial and first of all big
thanks to the organizers this event has
been awesome I've had a blast talking to
you guys connecting with you guys and
hearing all these great talks um somehow
I'm one of the few voice AI talks today
in this weekend so I have a lot to cover
we're going to Dive Right In if you're
new to voice AI I hope I can provide a
nice little framework to think about
this very fast moving space and if
you're building with voice AI already
I'll be sharing some little anecdotes
from our own scaling Journey that I hope
will help yours as well so voice AI in
2025 extremely exciting we're seeing
these new smart really fast really
affordable llms that are supporting a
lot more complex conversational use
cases uh but you still kind of need some
tricks to take your chat agent and turn
it into a voice agent we have these low
latency really realistic super
generative textto speech models but
sometimes we have audio hallucinations
and we have to deal with things like
pronunciation and
spelling with all the new things that
people are building there's this
explosion in voice aai infrastructure
and tooling and evaluation systems and a
big question becomes what's actually
worth owning and the big one on
everyone's mind are these new speech to
speech or voice too models uh and our
take is that for a lot of production
applications they're not quite yet ready
and a big reason for that is they start
to Output things that aren't actually
speechy are natur
uh things that you can use to build a
reliable conversation and this we saw
this when they first came out they were
like imitating people's voices and from
the start that's why we've kind of been
favoring uh reliability over this sort
of realism so today I'm going to talk
about how we at Super dial approach
agents as a service how we think about
the voice AI engineer and The Last Mile
problem so once you have your little
voice uh MV VP all the challenges that
you're going to face trying to actually
make it reliable and put it to work so
at Super dial we're in the business of
phone calls specifically one of the most
annoying phone calls ever that phone
call to your insurance company so for
Mid to large-sized healthc care
administration businesses we sell the
super dial platform and with super dial
you can build your script so design the
sort of conversation ask all the
questions that you need to get over the
phone you send us your calls via CSV API
or we also integrate with a lot of EHR
software systems and then you know
within the next couple hours in the next
day we send you back your results in a
stretchered format and this makes for a
really interesting agentic contract that
we sort of have with our customers so
from their perspective they're paying
for results they tell us who to call
which questions to ask and we tell them
the answers internally we have a little
agentic Loop set up so
that uh we go out we wait for these
offices to be open we wait for um you
know the call centers to open so we can
actually make these calls we will
attempt to make the call with our voice
bot and then if our voice bot needs to
bring in a human to complete the call or
cannot complete the call after a certain
number of attempts then we send it to a
fallback team and this is something that
of course we're transparent with with
our customers in fact it's a benefit to
them because it's kind of inevitable
with these Healthcare phone calls calls
that sometimes you need to bring in a
human so with us they know that no
matter what happens the call will get
made whether or not it gets made with a
human or a bot doesn't matter to them
they get their answers reliably and in a
structured
format uh and with all these calls we
try to do our best to learn from them so
we'll update the sort of office hours
for the given phone number we're calling
and learn from the sort of phone tree
traversal that we just tried so that
when we call it again we get even better
at that sort of call and because there
are sensitive phone calls we want to
make sure our system always works so
randomly we'll take out some of these
calls audit them make sure everything's
working uh for a quick little demo this
is actually a prior authorization call
uh this is after the point where we've
traversed little phone tree by clicking
the right buttons and now we're talking
to a human and trying to get some
questions answered for a customer I know
your first
name hi this is Sarah
are you calling from a doctor's office
or I'm calling from provider's
Office do you have a member ID or a case
number the member ID
is what is the CPT
code the CPT codes are
81243 okay hold then
so there's a Kon file uh that was
initiated for the code
81243 it is pending so this case number
is and we have not received any
clinicals for this case
yet okay what is your name again and
what is the reference number for this
call First theme is you may have the
pending case number as a call reference
number and the fact number on where to
send a clinicals
just thanks so much for your
help you're welcome thanks for calling
have a great
day so that's it uh if that call was
really boring to you
than if that call was really boring
that's kind of just how these things go
a boring call for us is an excellent
call because it turns out a lot of work
is
boring uh so with the system we've been
ble to save over 100,000 hours of human
Fone calling time and we're on track to
save Millions more in 2025 and what's
really incredible about voice AI today
is that we did this with a really lean
team of four Engineers so building the
whole full stack web application these
EHR Integrations the bot you just saw
all while bringing on new customers
supporting new conversational use cases
really quickly and a big part of why
that was possible was because we really
all embraced this role of a voice AI
engineer so let's kind of uncover what's
unique about a voice engineer today and
what hats they may be wearing so
starting from switch's like original
graph we can kind of see that a voice AI
engineer is going to deal with
multimodal data so MP3s Audi bytes in
addition to transcripts you're dealing
with transcription models voice models
speech to speech all that sort of thing
the application you're building it's in
real time latency all of a sudden
matters so much more you're going to be
dealing with async in Python a lot more
more than you probably wanted to be
doing and the product constraint here is
almost always going to be a voice
conversation so people have really high
expectations of how these sorts of
conversation goes uh for us like we're
slotting ourselves into an existing uh
sort of like business interaction and
people expect us to be conversational
and fit into that use case so to Grapple
with all these challenges we kind of
have two sayings at Super that we've
been saying over the past year and a
half say the right thing at the right
time and build this plane while we fly
it so the trickiest part uh for us is
customizing all these scripts and all
these use cases for each customer
individually and then we really rely on
this kind of like horizontal voice AI
stack to help us out with all those
other problems and this is kind of how
we think about the voice AI engineer
today and it's Unique roles and in the
larger context we really at this
inflection point where it's so easy to
build out an MVP for these sorts of
applications that ultimately what is
going to make your voice bot unique
isn't its Voice or its Interruption
handling or how realistic it sounds or
how it does turn taking ultimately it's
going to be in the conversational
content and the design there and the
vertical Integrations around it that
make your agents work actually valuable
and if you're like me and your favorite
classes in college were the AI ethics
ones everything I just said about moving
fast building with generative AI could
raise a few red uh raise some alarms so
it's not hard to imagine how voice AI
apps specifically could be biased
against people with certain accents we
can certain dialects or be really spooky
when they sound so real and then say
weird things so in the US we both like
enjoy and suffer from a lack of AI
regulation and that leaves the onus
ultimately on
the AI engineers and leaders in this
room to think about these sorts of
problems this is not going to be like a
talk on like AI safety and ethics but I
think for voice AI specifically with how
it's such like a new modality of
interaction with artificial intelligence
today I think it's really important how
we go about building it so for AI
Engineers when we go about making
tooling and infrastructure choices uh
remember that like developing AI should
be really accessible and collaborative
and the work that AI does should be for
everyone and a key part of making sure
that's the case is choosing tooling and
infrastructure so that a really diverse
set of stakeholders can be involved in
that process from the start so with the
role of the voice AI engineer kind of
scoped out now let's dive into some of
the last mile problems in voice AI that
we've been dealing with so when we
started out we had a really scrap
together pipeline of like a
transcription model and an llm and then
a touch to speech model
uh this was awesome to get started at
but you know we faced a lot of problems
very quickly and a lot of what we were
learning was not new at all so though
the voice agents we see today are better
than ever voice UI itself is not that
new so when we were just getting started
uh around a year and a half ago I had
the chance to speak to Kathy Pearl who
is a close family friend and has been
working on uh the ux of Gemini she's
been in the conversation design game for
like 20 years or
something uh and back in the day like
voice UI was lots of phone tree design
and then it BEC these Alexa and Siri
type things and now we're just in this
whole new world but a lot of the
principles remain the same and one of
the biggest things that's changed with
developing voice UI is the shift from
prescriptive to descriptive development
so we no longer prescribe what we want
our bot to do over the course of the
conversation by mapping out every
possible direction that it could go
instead we describe what we want to do
and then kind of pray to the Jenner of
gods that it
happens and for this you know there's a
lot of things I talk about with
conversation design but it comes up
really quickly when that becomes your
main interface one thing for us is when
we're asking these questions you know
should we be really open-ended with it
or kind of constrain the user into
selecting from a list of choices and for
us because these are existing
conversations we find it's often better
to just go General hope the call center
representative gives us a ton of
information and then instead of trying
to prevent them from saying the wrong
thing we try to adapt to whatever they
say so Kathy's
recommendation was hire conversation
designer if you're thinking about these
sorts of problems they're experts in
this and if you're just a voice AI
engineer and you want to get started in
this kind of thinking a great
recommendation is to do little table
reads so have one person pretend to be
the bot and the other person to pretend
to be a user and the sort of like
transcript that you may write out by
hand
immediately the sort of gaps and
awkwardness of it comes out when you say
these things out loud so knowing all
these things we were really excited to
work on our
conversations but we had kind of had to
deal with the tech de debt of the
orchestration framework that we had
built so we really hit our stride when
we started using pip cat for voice AI
orchestration this is an open source
framework maintained by the guys that
daily it's really easy to extend and
hack upon which is important for our use
case when we need to do transfers and
stuff um and we make really long phone
calls these can be like an hour and a
half long so a big decision for us in
choosing pipat was that we can self-host
it and deploy it and scale it how we
want so with some of our like voice
orchestration headaches dealt with we
really wanted to get back to focusing on
our conversations and everything in this
slide for us is really not unique to
voice UI uh and AI so I'm going to kind
of speed over it two interesting
decisions we've made here because we
just have you know an LM in the backbone
uh we chose to own our own open AI
endpoint we find this leads to a better
interface with a lot of these new voice
AI tools so behind our open endpoint we
can kind of route to different models
that are maybe more uh latency
sensitive for all of our generative
respons
we route them through this tool called
tensor zero tensor zero is relatively
new they have this nice framing of LMS
uh if that quote interests you I
recommend you look them up and talk to
them they're awesome uh this is like a
little open source tool so you can do
whatever you want with it they give us
kind of structured and typed llm
endpoints that we can then experiment
with in production so that's our gateway
to our LM and then all of our logging
and observability we self-host Lane FS
and we self-host these things also
because these are like healthare calls
we have to be hyp a compliant that's
often an easiest an easier way to deal
with you know the rapid growth of this
space so there we do like anomaly
detection evals and data sets so with a
good plan in place for our llm sort of
work another big challenge is our touch
to speech system so when you make these
sorts of phone calls your password is
basically your name your date of birth
and then your member ID or something
which is like a 12 digigit Long stren of
characters that you have to be able to
communicate over the phone and something
we quickly realized was that what our
llm is outputting is not necessarily
what we want to shove through our text
to speech engine and neither of those
things may actually match what's in the
recording so a little example of this
and this is like a personal last mile is
that if you're building me a personal
voice UI application it should say my
last name correctly so my last name is
pronounced kotus most people and most
models will say
kotus but with a lot of new tools out
there this is the syntax this company
called rhyme uses you can spell out the
exact sort of pronunciations you want
and then for things like spelling where
you may have kind of an intuition for
like the sort of pauses and breaks you
might want to use to say a really long
word you can use something like this
little spell function um and then with
all this stuff like
because this is outputting audio bytes
we usually review recordings to make
sure that this all sounds okay in
addition to checking the
transcripts and to start wrapping things
up I have a couple little mini Last Mile
problems that we've had to deal with oh
and yeah with voice to voice
models all this sort of rule based stuff
gets a little more complicated so some
little mini ones uh we used to be called
super bill and we called our bot Billy
because we thought that was a fun name
turns out that's an awful name the phone
because we would constantly have these
conversations where people were like hey
nice to meet you Billy and we would say
it's Billy not
Billy so yeah think about your persona a
lot dial that in
early uh if you're just starting don't
build from scratch what's going to make
your Bot unique is the conversation and
there's so many new tools out there like
pipe cat that you can use to get a quick
jump start track latency everywhere time
to First Bite for each of your little
processors
is the new most important metric and is
something you always kind of have to
keep an eye on uh upgrade paths this is
a big one for us when we need to make
sure we have really high transcription
accuracy so we use deep gram for our
speech DET text engine and we know that
whenever we kind of want to improve that
part of our system we can work with them
to fine-tune a better
model have fallbacks ready it really
sucks when open eye goes down for a
little bit and all of a sudden all the
concurrent conversations you have have
are just down the drain so have fallback
ready for each part of your stack it's
really easy to set that up with
something like T to zero there lots of
other tools that'll help you figure that
out and then end to end testing this is
pretty unique for voice UI and or voice
AI uh it seems like people are kind of
settling on telepan as a boundary layer
to test your Bot with like an external
service we do a couple different things
the easiest test for us is to create a
kind of fake phone number that just
plays an MP3 if your Bot can't talk to
an MP3 then you probably have bigger
problems next we can kind of create uh a
simulated voice tree with like different
uh like phone tree building tools and
have our bot pseudo navigate it and then
there's lots of generative services like
Koval and V where you can have your Bot
talk to another
bot so some takeaways for a quote
unquote vertical voice AI engineer
choose your stack wisely the better
decision you Mak you make here it will
allow you to focus on the things that
are really truly unique to your
conversational experience laser focus on
the last mile because this is where
ultimately you can provide a lot of
value and put your agents to work and
then ride the wave there's so much new
stuff happening in this space and
whenever new models come out you want to
be able to use them quickly and you also
want want to be able to use them safely
so thank you very much I'm excited to
talk to you all and hear about what's so
special about your
[Applause]
[Music]
conversations our next presenter is a
head of Applied AI at ramp here to teach
us how to scaffold our agents wisely
please join me in welcoming to the stage
Rahul sango tuvalu
[Applause]
all right while we're getting set up um
Can anyone find what the problem is with
this
slide yeah working on it
there we
go nice I think I'm the only presenter
using figma slides so I had to use my
own laptop for it um cool guys so the
problem here is it's a bitter lesson but
lemons are sour so um I only realized
like 10 minutes ago but I like the
graphic on
there so a little bit about me um had of
a plaat ramp I've been working on LM for
four years which
is
well which is uh kind of a long time I
guess uh in LM land everything started
happening really when chat GPD came out
um so I was trying to build what people
would Now call an AI agent company back
then we were just doing customer sport
we're trying to make our chat bot
smarter we're trying to figure out what
what models to use to or what tech to
use to get them to respond to customers
better and we were messing with gpd2 and
Bert and models were so frustrating
stupid and the context windows were
small and they were not very smart
reasoning and it was just incredibly
annoying and we just wrote lots of code
around these models to get them to work
at least somewhat
reliably and along the way as models got
smarter just kind of had to delete more
of that code and this ended up seeing a
lot of patterns in what what code needs
to get deleted how to build agents and
what ways that will scale with more
intelligence and clearly we're going to
continue to get a lot more intelligence
and I just wanted to uh maybe talk about
a single idea throughout the talk uh
through various examples uh we'll we'll
do some uh some uh setting but I'll also
have a bunch of demos to kind of like
drive home to point and maybe it can
convince you guys that uh there's a
certain way of building agents that's
slightly better than other
ways I also built a structur extraction
Library called Json former um I think it
was the first one I don't I'm not fully
sure but timing wise it was before all
the other major ones um and that was was
also scaffolding around a model models
were too stupid to up with Json and we
were just really begging and pleading
pleading it and forcing it to act in
ways that we want it to
be so as I said earlier just have a one
core agenda item here which is want to
convey one idea uh we'll start off all
of you probably read the essay bit or
less and just quickly go through what it
is uh we'll go through a production
agent we have at ramp and how it works
in three different ways of architecting
it and then I have a demo that to really
push maybe how we think about how
software and backends and things will
work in the
future so very simply the idea is just
that systems that scale with compute
beat systems that don't so there's two
systems and uh without any effort the
the system one of the systems can just
think more or use more compute in some
way that system tends to be systems that
are rigid and fixed and just
deterministic so from that idea it's
pretty clear like if you're Building
Systems you might as well build systems
that are prove with more compute and
this this seems pretty obvious like
obvious conclusion from the B lesson
taking it a step further why is this
true it's because exponentials are rare
like they just don't exist most things
in the world aren't exponential so when
you find one you just should hop on
strap on just take the free pass and go
for the right and probably shouldn't try
too hard and there's a lot of examples
from history that uh kind of reflect
this so for for chess and go and
computer vision Atari games like people
have tried to build lots of systems and
written a lot of code and my way of
thinking about rigid systems just like
spending a lot of time grinding weekends
and writing very clever software well
abstracted uh maybe trying to synthesize
human reasoning and thought process into
features and then using them in clever
ways and trying to approximate how a
human would
think and if you actually fix the amount
of compute that approach will win but if
it just turns out if you end up scaling
out how much search you're doing the
general method always ends up winning
even uh like in all these cases so Atari
go and computer
vision a little bit about ramp so ramp
is a finance platform that helps
businesses manage expenses payments
procurement travel bookkeeping more
efficiently and we have a ton of AI
across the product so automate a lot of
boring stuff the finance teams do and
employees do with uh submitting expense
reports and booking your flights and
hotels and uh submitting reimbursements
all that and so a lot of the work behind
the scenes is just we're interacting
with other system sys um helping like
Legacy systems and helping employees get
their work done
faster so let's actually talk through
one of the systems we have today at ramp
and um maybe some talk through the
different versions of the system and how
it evolved over
time so we're going to talk about
something called a switching report it's
very simple agent all it needs to do
taking a CSV a CSV arbitrary format so
the schema could be seriously anything
from the internet and we want these csbs
to come from third part party car
providers so when people onboard to ramp
we want to give them a nice checklist
and say hey here are all the
transactions you have on other platforms
and we want to help you move them over
and the more transactions come on ramp
the more we can help you and the more
you'll use our software and more
everyone benefits and so the switching
report is just really a checklist but to
read people's CSV transactions we need
to understand those and other platforms
have all these kinds of crazy
schemas and so the the description of
the problem we have here is just for an
arbitrary arbit like CSV how can we
support um parsing it and then into some
format that we we
understand so let's just start with the
the simple approach right is like let's
just take the 50 most common third party
card vendors um and just manually write
code for all of them and obviously like
this this will just work it is some work
not a lot of work but you still have to
maybe go to 50 different platforms and
download their csvs see what schemas
they have and then write code maybe if
they decide one day they change their
format your thing will break but that's
okay you'll get page and you can wake up
and go fix
it so let's maybe introduce some LMS in
here so from the over engineered code
where you ended up writing 100,000 lines
maybe we don't we don't we want a more
General system so let's introduce a
little bit of LM a little bit of AI in
here and so in the deterministic flow
let's maybe add some or just like
scripting In classical scripting land
let's add some more um calls to open AI
or you have an embedding model you want
to do semetic similarity or something
like that so then let's just take every
column in the CSV that comes in let's
try to classify what kind of column it
is is it a date is it a transaction uh
is it a transaction amount is it a
merchant name or is it the uh user's
name and then we map it on and then we
probably could uh end up in a schema
that we're happy with again most of the
compute is running in classical land
some of it is running in fuzzy like llm
land but this is somewhat looking like a
more General
system let's go maybe a different
approach when like we just go all the
way through let's just say we're just
going to literally give the CSV to LM
and say you have a code interpreter so
you can write whatever code you want
pandas or all the faster rust based ones
um you have all these python packages um
you allowed to look at the head of the
CSV the tail whichever rows you want um
and then I just want you to give me a
CSV uh with this spefic format and
here's a unit test here's a verifier
that you can use to tell if it's working
or not turns out this approach actually
doesn't work like we tried it um if you
only run it once but instead if you run
it 50 times in parallel it's actually
very likely that it works really well
and generalizes across a ton of
different formats the amount of compute
here is actually probably like what is
that number 10,000 times more than the
the first approach we came up with but
again like what is truly scarce in the
world is engineer time maybe not for not
in a while but at least today and we'd
rather have a system that works really
well and even with a 10,000 times more
compute it will probably cost less than
a dollar and every transaction that
switched over every fail CSV will cost
ramp way more money than whatever money
we spend on this exact this exact
architecture so this is a very specific
uh example it's like how does this apply
to the agents that we all build and
maybe the systems we're all working on
turns out something like this actually
generalizes so if you look at a three
approach is and let's assume like The
Black Arrow is just classical compute
and then the blue arrows are fuzzy land
so it goes into neet and all all sort of
weird matrix multiplication happens and
we're in latent space and gets all alien
intelligency and then comes back to a
classical land first approach there was
no AI we just wrote code and it just
worked mostly the constrained agent so
the second approach we broke into fuzzy
land from classic land when when we
decided we wanted similarity scores or
something something like that and then
the third approach is actually flipped
where the llm decides it needs to go
into classical l so it writes some code
write some pandas or uh python code and
it decides to break in into this
classical L when it needs to but most of
the compute is
fuzzy actually this is maybe not the
most accurate graph like because I
proposed that we run it 50 times it more
so looks like this but if you look at a
back end in general they're all request
response so some sort of messages going
in it's like a post request or get or
update or read any sort of credit
operation and we're really just asking
the back end to take this piece of
information do whatever you must with it
run out whatever mutations you want and
return me a
response and almost all systems we built
so far as like Humanity I guess like
look like the first one but more people
are using open AI open AI makes billions
of dollars and probably a lot of the
systems that use them look like number
two where just regular uh programming
languages are calling into open AI
servers and we're running some fuzzy
compute what we're seeing in like more
and more parts of the ram codebase we're
moving to the third approach because it
just tends to work well because all the
blue arrows if you did nothing Absol
absolutely nothing we all went to
vacation for the next year the big labs
are still working and spending billions
of dollars making those models better so
the blue arrows will get better and so
how much blue arrow you're using in your
code base actually will help directly
your company without much effort from
your end so this is what I was saying is
like the bitter lesson is just so
powerful and exponential Trends are so
powerful that you can just hitch hitch
the
ride let's um let's take this idea like
further um let's actually like go all
the way like something something
crazy um on the left you'll see a
traditional web app so usually the way
it works is you open um gmail.com
and some uh static file server and
Google sending you bu of JavaScript in
HTML and CSS your browser renders that
um and shows you some nice UI nice HTML
that's user friendly maybe you see some
emails maybe you click on one of them um
the friend makes a request to the back
end the back ask the friend end friend
end asks the back end give me the
content for email and whatever ID it is
and then the back end hits database and
gives you the result and maybe they Cod
gen maybe they use all the Cod genen
tools available to make Gmail so that
that was probably the LM only worked
when the software engineer was writing
the code but once the code is written
and it's like pushed to production it's
just classical
compute and on the right I'm actually
proposing a different model which is the
back end is the LM it's not Coen it's
this LM is doing the execution it is the
backend so the llm has access to tools
like cod interpreter and potentially has
access to um through that making request
Network requests and also has an access
to uh
DB so I have a mail client actually that
works with this principal and this is my
test email so if y'all want to see any
emails you send to me in a minute or so
you can send me an email but please be
nice
all right I think um it's probably
enough
time so I'm going to go over
so we have this email client I mean we
still have some regular JavaScript to
hook into the LM hook the LM into the
browser but when I do log in I'm going
to use my email just say
what oh it's
probably okay we're good we're good all
right we're
saved I think
thankfully I have a room full of
Engineers so there's a dot but the
reason it's so slow is because when I
open this page and log into Gmail the
Gmail token is actually being sent to an
llm we're saying literally this is a LM
chat session what we're we're seeing on
the screen is like hey LM you're you're
actually simulating a Gmail client you
have access to Oh all the emails you
have access to um Raul's
Gmail token and a Cod interpreter and so
just render some UI based on uh what you
think is reasonable for the homepage for
Gmail client and so looks like it
decided to render his markdown uh I
think we actually tell it to render his
markdown and it's rendering all the
emails that a bunch of people sent me
from here so looks like it says uh hello
from California so I'm going to click on
that when I click on that we're actually
not running um any like backend calls or
anything like that we're just telling
the LM the user clicked on that piece of
text
in this case it was hello from
California and the ID number the LM now
has the information on what the user
clicked on and it has the chance to
rerender the page much like a web
framework would so again it goes back it
probably hits uh a get request for that
specific email and pulls the
body what is this agent going to do I'm
watching you live so the LM just decided
this is the appropriate UI for a Gmail
client also I have other features the LM
thought was reasonable so looks like I
could Market as unread or or delete the
email if I want to uh maybe I'll delete
it because it's not that good of an
email I'm
sorry it is very slow because we're
doing a lot but wanted to push you in
this direction because this kind of
software barely
works
dang I guess not um also I clicked on it
and now the LM is trying to do something
with me clicking on it but anyway um
this kind of software barely works today
and it doesn't mean it won won't work in
the future uh but with exponential
trends like things might just like this
might just take off um so just wanted to
push you all to think in this direction
um yeah we'll software more software
look like this I don't know we'll see
thank you
[Music]
ladies and Gentlemen please welcome back
to the stage MC for the AI engineer
Summit agent engineering day the founder
and CEO of super intelligent
[Music]
nlw all right guys thank you rul and
everyone else who presented um you know
the theme of this whole event is agents
at work and one of the things that we
called out this morning is that what
makes it so different uh than events
that we've had in the past is how much
this is about real world happenings
what's actually being built the
challenges that we're facing in
deployment in production I think this
session was a great example of that we
are headed into another break we have
about 30 minutes now a quick reminder
you can go meet speakers in the Q&A
lounge or you can check out the sponsor
e uh Expo there is also coffee and
snacks down there so see you in about
half an hour that concludes this session
please enjoy one final break in the Expo
with sponsor demos food and drinks and a
special panel on the Expo stage with
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
a
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
oh
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
please welcome to the stage MC for the
AI engineer Summit agent engineering day
the founder of touring post
[Music]
kiaa hello so this is our last stretch
of
sessions um it's been such a long day
and I'm I'm excited um
still
um this is going to be three sessions
there is a there's one change but you
don't want to miss this uh sessions
they're amazing the first the first one
is creating agents that co-create uh the
second one is about education this is a
change it called uh the next AI
Engineers um and the last one uh would
be what does it take to build a personal
local private AI agent augment that
augments you deeply sorry um it's all
about building now it's all about
building a ey that is truly um
integrated into our lives and now from
very very early age um with that please
help me to welcome our next speaker she
has a very interesting experience on
working on both Chad GPT and Claude
please welcome Karina and
[Applause]
Gwen a little tired
hey everyone my name is Karina and I'm
an AI researcher at openi before that I
worked at antarik for about two years
working on cloud so today I would love
to chat more about what kind of scaling
paradigms that has happened in the past
two to four years in AI research and how
those paradigms unlocked New Frontier
product research and also going to share
some of the vignettes from some of the
lessons learned by developing Claud and
Chach PT products some design challenges
and lessons um and how do I think about
the future of Agents as they become from
collaborators to co- innovators um in
the future I would also love to invite
you to engage um in the conversation so
I'd be more than happy happy to answer
some of the questions at the
end cool
so not sure if probably the majority of
you know this but I think there are two
scaling paradigms that has happened in
AI research over the past few years the
first Paradigm is the next token
production and you might have heard this
as called
pre-training and what's really amazing
about the next token prediction is that
it's a World building machine the model
learns to understand the World by
predicting the next word and I think
fundamentally if you think about it this
is happening because certain sequence is
caused by initial action and this is
irreversible and so the model learns
some of the physics of the world to
understand and the token can be anything
right the tokens that we pre- chained
are strings words pixels it could be
anything and so to predict what will
happen next the model needs to
understand how the world works and this
is why
preining worked and so you can imagine
the next token prediction is the massive
multitask learning and what's amazing
about this is that during prating some
tasks are really easy to learn such as
translation right like the word boarding
in French
is the model also learns a lot about the
world the capital of France is Paris and
because some of the information is much
more present on the internet and in some
of the knowledge artifacts the model has
much easier time to learn this but we
actually the reason why compute is so
important and scaling computer in the
paining stage is so so important is
because there is a new class there is a
class of tasks that is really really
hard to learn and for example the model
learns a lot about the physics it learns
so much about the problem solving
generation and The Logical Expressions
it learns some of the spatial reasoning
although it's not
perfect um but we getting to the
complexity of the tasks such as math
when the model has to compute this
number during the next token prediction
is actually really high so that's why
you need Chain of Thought or might spend
more compute on a Chain of Thought to
help the model to reason through more
computational such tasks another class
of tasks that I was thinking a lot about
is creative writing it's actually really
really hard and the reason why it's so
hard for the model is because
you know you can predict very nicely the
style of the writing but a lot of the
creative writing is actually World
building and storytelling and the plot
and it's much much easier for the model
to make a mistake for the next token
prediction in such a way where it will
completely deteriorate the plot
coherence which is really important for
the stories and this is an open-ended
like research problem um creative
writing in itself and the reason why
it's because it's really really hard to
measure what is a good creative writing
what is not a creative writing and
obviously we would love for the models
to invent new forms of writing and be
extremely creative in their Generations
but this is actually one of the hardest
AI research problems today is um how do
we make models to like write novels and
have
coherent stories over the long course of
the period of time
time
so I think the era of
2020 to 2021 there was an year of
scaling prating a lot both at anic and
at OPI and actually the first at that
time one of the first products was
GitHub compilot and I thought it was
completely interesting product the
autocomplete because it's so
entertaining the model has learned so
much about the code and the next token
prediction for The Code by the billions
code tokens using from GitHub open
source projects Etc and what has
happened for the autocomplete T tab tab
in the cursor or gith hilot is that the
researchers
constrained via rlf reinforcement
learning from Human feedback and
reinforcement Lear from AI feedback to
make it extremely a little bit more
useful to use and this is where the era
of post training has gone off so in post
training we teach the model how to
complete function bodies understanding
dog strings how to complete generating
multi-line completions predicting the
next diffs apply the next diffs and I
think we are still in that era where
there is so much more to be explored in
the posttraining stage of rhf
RF to push the capabilities of models to
reason through complex code code
bases the next Paradigm in AI
research which has
happened last
year and and it was published by open a
with a new model 01 is scaling
reinforcement learning on Chain of
Thought and this is why we call them
it's highly complex
reasoning and you can imagine you you
spend a lot more test time
computed on training in to scale
reinforcement learning and the reason
why it works is because the model learns
how to think during the training and
learn from the feedback by having really
good signals in
ARL so on the left you can see the
output of normal gbd4 OG
gbd4 and on the right you can see the
entire Chain of Thought that has been
that the model has thought about to
solve the complex problems and as we
think about harder and harder tasks if
you want the model to go from you know
translation towards solving medical
problems you actually need to spend you
actually need the model to like spend a
lot of time just thinking through the
problem and completely creating more
complex environments with tools and
other other tools and more complex
environments to think through and verify
its outputs during the Chain of Thought
So as you can see the chain itself is
very interesting and the model is has
certain words that it does um but um
there's a lot of science to be done in
terms of you know faithfulness in the
Chain of Thought how do we measure the
faithfulness what happens if the mod
model goes into like wrong direction can
it backtrack itself I think there was a
lot of science around that and we are
only the beginning of
it one of the first projects that I've
done at open a is actually how do
we the interaction Paradigm is very
different now so the interaction
Paradigm is the model thinks a lot to
solve the problem if the problem is hard
so but how do we create this interaction
new inter Paradigm with humans such that
it will be much easier so that humans
don't have to wait for 15 seconds or 30
minutes for a model to come back and one
of the things that we did um as a simple
approach is to have like a streaming
models thoughts to a user and that way
we had to communicate what exactly the
summaries of the thoughts for the model
and communicate very wisely to human but
I think it's still one of the design
challenges like as the model's
capabilities and attraction paradigms
change you have like new design
challenges that you need to solve um for
these types of
models so and I guess like this year
open is the year of agents and the way
we think about it is highly complex
reasoners such as models trained on AR
and Chain of Thought using real world
tools such as browsing search computer
use uh over a long Horizon period of
time over a long
context but what's the next stage in my
view the next stage level is co-
innovators and the way I'm thinking
about is it's agents that is built upon
all the things that we've done with
reasoning and Tool use and long context
plus creativity and creativity is
enabled only through human AI
collaboration and I think this is where
I'm really really excited about in the
future is
to create new affordances for humans to
collaborate better with AI such that we
both can co-create the future that we
want and so those two scaling paradigms
in AI research has unlocked us new kind
of product research and you know you can
imagine product research being oh we we
have API from the model and now we have
to integrate in the products but it's
actually what's happening on the ground
is we have like a very now we have like
a very nice rapid iteration cycle of the
product development and the reason why
is because we can use those highly
reasoning models to distill back to
smaller models or the models that we
canate very very fast and we can use
those highly complex reasoning modelss
to synthetically generate new data such
that we can create new post training new
data sets new reinforcement learning
environments so okay um so one of the
things that we can do is creating new
completely new class of tasks and um you
know if the task is um how do we create
a multiplayer
collaboration uh with a human AI you
might want to simulate different users H
and and how do you do that you might
want to like synthetically generate data
sets of different users conditioned on
the different users and push chain on
that so it actually highly depends on
like what kind of product experiences
that you want to create and extrapolate
that to a new class of tasks that you
want to PST in the
models um I think we are moving towards
more complex reinforcement learning
environments uh which means we can allow
models to use search or browsing or much
more collaborative tools like canvas
during RL such that they can learn how
to be how to become better at
collaborating um we can leverage things
like in context learning I think models
are extremely extremely good so you can
essentially create something a new tool
and then the model will learn just by a
few shot examples and this is extremely
rapid eduation cycle for any developer
as I mentioned before synthetic data
wide distillation is another thing I
think we can also invent New Model
Behavior and interactions to utilize
user
feedback so now we're going to go
through some of the vignettes that
um that has happened um from Anar to um
I think the first concept that I've
learned um is how do we bring unfamiliar
capability into familiar from factor and
the reason why 100K context uh was
successful is because we found you know
file uploads is extremely familiar from
Factor everybody is working on documents
but you can imagine we could have
deployed 100K context via infinite chats
such that it's like one huge long chat
that you can interact with but I think
finding the simplest form practice
sometimes for unfamiliar capability uh
is one of the design challenges uh in
this new
ER the next project that I worked one
the second project that I was I worked
on at open is called CH tasks um and
actually I did not realize about this
until it was shipped um you know
reminders and tasks schedul tasks is
actually it's very familiar thing that
people do almost every day but what's
amazing about this product is that you
can scale this with new kind of
capabilities models so CHP task is not
just scheduled reminders and to-do lists
it's actually you can create you can ask
the model to continue the story every
day for you H or you can ask the model
to
search everything that you are
interested in every day or every other
every other day so in a way you can also
like help yourself to like learn new
language by having like extremely
multimodal and Interactive visualization
that JB create and so I I think this is
concept of your product feature should
enable modular compositions that will
scale very nicely in the future as the
models will develop much higher
capabilities um is one of the is is
something that I've learned by doing
chpt
tasks um I think another design
challenge that we have here is how do we
Bridge together real-time interaction
with models
to a synchronous task
completion and where we can ask the
model to go off for like 10 hours to
research or write code and then come
back with the
solution and the bottleneck here is
trust and I believe
that giving trust can be solved by
giving humans new collaborative
ordinances to verify edit model outputs
and having them to give models realtime
feedback so that the model can self-
improve and you know one of the first
products uh from antartic was actually
CLA and slack and it was the first
attempt to have a virtual teammate in
the organization and it was an amazing
concept because slack had all the
affordances with tools and image uploads
and multiplayer collaboration that you
can
create and there is something in there
that is still I think there is still a
lot of that we can do here and take the
lessons from CL and slack to the next
generational products um again the task
was also like very much inspired by
Claud and slack prototypes when Claud
could just summarize channels slack
channels across the organization every
Friday and have a summaries for
everybody my first project at was canvas
and
I thought that human collabor
affordances could scale and create new
creative capabilities and what I really
loved about the canvas and the way we
operate in canvas as a team is that it
was extremely flexible interface that we
could come up with and here here are
some like the vignettes that we had so
the canas itself can become like a
co-creator and co-editor and you can
have like a very very fine grained edior
uh interaction the model can also do
search in order to generate the report
and then you can also ask a question
back hi verify this output and you can
imagine this interface scales to
multiplayer when other people can join
your document or even multi-agents if uh
I can create a model critic or editor
they can use you have like a multi-
agentic and multiplayer collaboration at
the same time and so this is like a new
design challenge that we need to
navigate how do we do
that I'm also excited for personalized
tutors I think the models care are
becoming extremely multimodal extremely
flexible that you can learn new things
in a new way the way you like if I'm a
visual learner and you are more auditory
learner the model can adapt to your
personalization sorry um one thing that
uh I did yesterday is that um
I was on the plane and actually I used
canvas to create me a game and so I
really like this genor entertainment on
the
Fly anyone can create their own tools
and VB apps now and I'm not sure what
the future will be look like but I think
it will be extremely amazing if a non a
person who never had to touch code ever
before in their life for the first time
can create the tool that they really
wanted and deploy that L um for
themselves or to start a business from
scratch and I think there's something
around pair programming and code
creators that we can use in order to
create the future that we want and so
canvas has also becomes more prer
programmer so the reason why canvas is
so flexible is because it was adop it
was trained both to become collaborative
for writing and coding
because it has tools such as search and
can search for AP
documentation it can become a data
scientist too and especially if you
upload the entire CSV docs um it can
generate a real time um
analysis and finally what I'm really
really excited what everybody is excited
in AI is how do we
actually help models to become better at
research um and creating new
knowledge and here you can see that the
model that the model in the human can
co-create a document or co-create a new
artifact that has never been happened
before so here's a demo of the paper
that I've co-published and then um I'm
asking the model to kind of reproduce it
and and you can imagine this is like one
of the maybe one of the most common
tasks in research is to reproduce or you
can imagine to reproduce like open
source GitHub repo and you you have like
this very nice
interactive Paradigm where because the
model can also leverage its own internal
knowledge you and a you and a AI can
work together to come up with new
research hypothesis and verify certain
like research um directions together and
you can also handle delegate the tasks
to an AI assistant to do
that finally what I'm really excited
about the future is that there will be
this layer of invisible software
creation for all um and especially what
I'm really excited is like from the
mobile itself people can just create
their own software tools
I think the way you interact with AI
fundamentally changes in a way that the
way you access the internet will also
change my prediction is that you will
click less and less on the internet
links and the way you will access the
internet will be why models lens which
will be much cleaner and in a much more
personalized way and uh you can imagine
have like very personalized multimodel
outputs let's say if I say I want to
learn more about the solar system and
instead of it giving me a text output it
should give you a 3Gs interactive
visualization of the solar system and
you can have like highly richly
interactive features to learn more and I
think there is there'll be this like
kind of cool future of like generator
entertainment on the
Fly for the people to learn and share
new games with other
people I think the way I'm thinking
about it is the kind of interface to AGI
is is blank canvas that kind of self
morves into your intent so for example
you come to the work today and your
intention is to just write code then the
canvas becomes more of an IDE like a
cursor or like a coding IDE although the
future programming might change or if
you're writer and you decided to write a
Noel together the model can start
creating tools on the Fly for you such
that it will be much easier for you to
brainstorm or edit the writing or create
character plots and visualize the
structure of the plot
itself and finally I think the
co-innovation is actually going to
happen with co-direction creative
co-direction with the models itself and
it's through
collaboration with highly reasoning
agend systems uh that will be extremely
capable of superhuman tasks to create
new novels films games uh and
essentially new science new knowledge
creation cool um thank you so much I
think that's the end of my talk
[Applause]
[Music]
um the next gener generation of AI
Engineers will be learning more and more
from language models and agents like me
here to provide a glimpse into the
future of educating the next generation
of AI Engineers is Stefania druga
research scientist from
Google thank
[Applause]
you thank you so much hello New York uh
how are you
doing okay so so I know you've been
hearing a lot of amazing talks and
before I get started I wanted to see a
quick show of hands how many of you are
here for the first time at a engineering
conference wow that's amazing how many
of you are international outside of
us incredible thank you for coming um so
today I'm going to talk about how do we
open up this conference and how do we
open up this knowledge to the next
generation of AI Engineers which are
young and much
start much earlier and why does that
matter is because 70% of AI users are
actually from gen z um and we've seen
the potential of multi multimodal AI to
transform education we know that
students are using generative AI tools
for their homework and I think they can
use it in much more interesting way we
also know that in many cases they prefer
it um to uh human
tutors and I want kids to actually be
part
of the engineers and the designers that
create the tools for them to use um and
I turn to scratch I don't know how many
of you know
scratch wow uh there are over 100
million children using scratch world
wild scratch is a a platform it's fre
it's open source for coding for kids is
visual programming it was developed at
MIT and I was part of that lab and um
during my masters and that's in 2015
when I started working on Cates
cognates expand scratch to actually
allow children to learn about AI by
building games training their own uh AI
models and programming Hardware as well
so it has these kind of blocks they look
like Lego blocks and they can put them
together in order to create their
programs does anyone want to guess what
this program
do any
guesses so Tiana here who is nine at the
time she she wrote this program for the
first time and then play with the robot
um using the program for half an hour
which was super F fun so it's a hide and
seek game right so she would like run
around the room and the robot would turn
around and say every time you would
detect a person the number of people was
like higher than zero I would say I see
you there is a pro a problem with the
program though can you spot
it how many times are we going to play
if we have it like this
yeah yeah so what do we need Nick
we need a loop exactly awesome um so the
time to fun and the time to play um is
very short in scratch which is why I
used it when I buil CNE mates it has
this library of blocks the coding area
and a stage and I'm going to show you a
quick video of how kids use it in order
to learn more about
AI so we were programming robots you
could play Rock Paper Scissors
[Music]
you did Rock Paper Scissors into the
camera and on shoot you did one of the
motions and the camera did one of the
motions and it's like rock paper
scissors
shoe the computer gets like better as
you play the game cuz like us we might
not know everything at first but if we
keep trying we get better everyone has
heard about like machine based learning
or artificial intellig
and there was a certain of no questions
asked for a lot of the more techsavvy
parents was like go for it technology is
going to be a huge part of their lives
much more so than my life if it's scary
for some people this AI technology I
totally get it but as a parent and as a
teacher I thought it was really
important because these are skills that
21st century kids need to
have when my dad was young he bought a
car and took it apart to see how it
worked so you teach people that young
how these things that grown-ups mostly
program how it
works so we did this in 2015 2016 uh
these AI engineer started very early and
at the time they were training these
custom models with basic classification
models for images and text um but they
also had access to the entire library of
extensions that we built for them so
they could use off-the-shelf sentiment
analysis image classification they could
program their voice assistants because
they realized voice assistants were
really limited so they could program The
Voice assistants to remember things
about them and their preferences um
program microbits robots and this is
what the training page looks like um
they can drag and drop examples of
images so for example one kid wanted to
um make a game about unicorns and
narwhals so this was like what his
training data looked like and then this
is what the program looked like so he
can choose his custom model unicor
unicorns versus narwhals and then show
different drawings into the camera and
see what the model predicts and moreover
it ALS he also or she can see they can
see the level of confidence of the
prediction which really helps them
understand oh yes this in this case it
guessed that my drawing was narwal but
the confidence level is very very low
what do I need to do I'm going to go
back to the my training and add more
examples of images that are handrawn
because most of my images are cartoons
so it really creates this AI literacy
data literacy and demystifies everything
that kids learn about AI or hear about
AI that it's evil Terminator or all
sorts of things right um and they built
projects that went across different
domains so it was a project where they
could look at things in their food by
transforming a webcam into a microscope
they could build games like the rock
paper scissors that you've seen in the
video or a literature program uh where
you speak and it analyzes what you say
and uh it's seeing if it's in the style
of a famous writer or uh different other
types of styles and what's important
like I tested this with kids in public
private uh schools and community centers
and what I found is that kids are
actually like little scientists so if we
give them the right tools they engage in
the scientific process they formulate
hypothesis about how the model works or
how the robot works then they test those
hypothesis and they refine their
understanding in the process and it's
the same with the model training so we
need to create tools that enable them to
engage in the scientific process as fast
as possible and create tools that are
fun and sticky that they want to
use and why does this matter um before
allowing them to program with cognates
and train their own models I actually
asked them questions about voice
assistants and
uh smart toys and smart uh robots at the
time and I asked them do you think it's
smart do you trust it um do you like it
is it friendly and I asked the same
questions about those Technologies we
didn't have ch GPT at the time or Gemini
um at the end of the study and what I
found was that there was a significant
difference in the intelligence
attribution after they engage in this
process of learning how to train a model
learning how to program it understanding
why the data matters um so it does make
a difference in demystifying the
intelligence that we talk about and this
platform is used around the world it's
actually translated in 50 languages and
I realized after I worked on this that
it's not just the kits like the pandemic
came and a lot of young people were
stuck at home um so I had to like really
think about how do I create and how do I
work with families so started to do uh a
lot of other experiments and I fig like
what are the type of Tutors or uh games
or um platforms that families could use
when they're at home and maybe they want
to learn how to code with their kids
I'll show you an early prototype hi
there I would like to know your name so
let's do a program that allow me to
learn it let's start with the green flag
block there you go you did
it no I need you to help me ask a
question for that we'll need the ask
blck see if you can find
it
awesome so in this case they're learning
how to program a robot it was a jba
robot and the robot itself it's
participating in the process so you're
having like this reflective uh
conversation with the thing that you're
programming which is pretty cool uh and
we could do that much more than that now
um but because not everyone can afford
to buy a robot or like it's not you know
a required thing I wanted to build
something similar like a pair companion
for programming that works in the
browser so during the pandemic I start
doing this design studies with families
in 10 different states uh in US very
different backgrounds very different
ethnicities and this is very important
and I wanted to highlight that here
before building the system I mocked the
system so we didn't have a functional
cop pilot or a functional assistant I
was the AI um and it was a Wizard of Oz
study like I we were the kids did not
know initially we told them afterwards
that they're interacting with a person
that but they would interact via chat
and I really wanted to understand what
is it that they want what kind of
supports do they need from a pair
programmer when they code on scratch
with their parents and what I found was
that they really want to generate coding
ideas they don't want the co pilot in
scratch or in cognates to do everything
for them they want to kind of brainstorm
like oh H what if I want to do a game
about bears or what if I I'm into soccer
like give me some ideas um that was a
really big one and here are some quotes
like uh one of our participants who 12
said most people would like coding with
AI friends because one of the hardest
parts of our project is when you start
and you run into a wall you are out of
ideas so it really helps with the
ideation process it also helped them
kind of Express and elaborate their
ideas like if they would play a a pong
game it would ask like okay so how do we
make the ball move or like how do you
make it move faster um so it was very
helpful in that regard as well and it
supported their creative coding identity
right so another quote I like it because
sometimes when you code it gets
frustrating you finally get it to work
it's so good that it let you feel good
it's good when you have someone that's
says to you good job and what's
interesting is that the kids
participating in the study spent double
the amount of time programming that they
would spend normally and I I got that
from parents and didn't always work
sometimes like uh the parents were
needed in the loop like if it was too
distracting or if it wasn't able to
moderate thir taking between siblings uh
or if the yeah you know agents have
limitations a has limitations or it
wasn't always able to explain the most
complex Concepts like the in scratch
there is this thing called clone when
you create multiple instances of the
same object or broadcast that was harder
to explain so after doing this first
design study and identifying like what
are the core features that kids and
parents want um I created a evaluation
Benchmark so I actually had over 100
cases of scratch programs and I would
run this against like different
state-of-the-art models and see how good
these models are at explaining the
scratch code at explaining it with
learning exercises at debugging it at
generating ideas and the results were
very
promising so uh the next thing was to
build it and this is uh the first time
I'm showing these results I just finish
running the study so it's very off the
press uh very fresh uh from the from the
oven and um uh I I just tested this
cognates copilot with young people um 18
young people from 11 different countries
in different languages and it was It was
kind of cool like it's very simple it's
the code editor and it has this AI chat
um it's sending a message to a web
server and then gets a resport I'm I'm
using and evaluating different models
including a fine tune model on scratch
projects and he can also generate assets
art like images for their games and
here's an example from a session with a
kid from Mexico so all the session like
was done in I speak many languages so I
could do this session in different
languages and so does the co-pilot um
and what I found in analyzing these
sessions uh from 18 kids is that the
copilot provided all sorts of supports
and was conceptual support design
support positive encouragement platform
na navigation there were instances where
it failed um and I also saw lots of
instances where the kids would refuse
the suggested help or the suggested
ideas and here's an example of code
support this was a student from Jamaica
and he had never programmed in scratch
before so he was able to go from zero to
actually having a fully functional
program with a support of the copilot
and the the copilot was actually very
helpful not only in giving him ideas but
also helping him understand how to
navigate the platform for a very for the
very first time like this is where you
find the loop uh block and this is where
how you create variables
um so it was very helpful for people who
were new for people who were Advanced it
was also very helpful because it would
generate assets that they really liked
um it would give them ideas like for how
to refactor it didn't call it refactor
but like improve the code um or how to
add new features or new
levels and it was interesting because
kids would use it in a lot of ways I
didn't predict like besides like
generating the the background in this
game Al like for they wanted to get
names ideas for names for characters and
ideas for
plots um and this is a great example
like where a student is actually pasting
uh an image of the code and it's asking
for an idea and he actually doesn't like
the answer so he said like I don't want
to do that type of movement and then the
copilot says no worries it's your game
have fun right so we want to have tools
that are actually prioritizing and
encouraging young people's agency in
this process some lessons learned uh U
to prioritize users agency to balance
the support and the challenge that we
give them by default the copilot does
not give the answer initially it asks
asks questions and if the students
really stuck after asking the same
question three times it would give a
hint um to see these agents as
motivators and starting point right like
the effect of the blank page when you
start to write the effect in in scratch
is called cold start we see a lot of
students that go to the platform and
they really don't know where to start so
it's very helpful for that too and we
learned that it was important to allow
for flexibility and customization I had
kids that really wanted to use voice and
other kids that did not want to use
voice they just wanted to type right I
had kids that said I always wanted to
give me three ideas and other kids I was
like no just one idea is good um so
everyone wanted different things and
design agents that support um creativity
in sichu so a lot of the participants
told me they want the agent to be able
to go and move the blocks with them and
they want to have like C simulations of
Agents programming like how would an
agent build a Pac-Man game I want to see
like how five agents would collaborate
to build this asteroid game um or I want
when it generates the assets instead of
giving it to me in the chat like to have
it directly on the stage so so the next
the next phase and pro of the of the
Prototype is to actually hack the OS
entirely and integrate the the agent at
all different stages uh of the UI and
then support multimodal AI capabilities
like including like sound generation and
maybe reaction to camera stream um and
then the the part that I was very
important is when it doesn't work or it
cannot do what they ask of it to do um
you should tell them why right like
should be like okay I cannot generate
this type of images because my training
set does not include it or I can only
give you answers about scratch because
that's what my prompt is or um so it
should be really transparent uh and
explain its limitation in order to set
the right expectations so I'm going to
show you a quick demo let's see if it
works and this is just a prototype and I
learned that I'm giving this talk to
hours ago so please be kind um
so oh
oops I get it there thank
you okay so
when see I'm GNA reload it looks like
this so if you come here and you don't
know what it can do or not we could just
say
hi uh it's going to be like what do you
want to work on scratch today what
should I say
a racing game I love that racing game
give
me some
ideas TBO Bo bought on extra speed that
sounds great um but the cool thing is
that it actually integrates with scratch
so I can go and get like any uh project
from scratch like people build really
crazy stuff on scratch so let's say like
they they built some OCR
programs uh this is a bit slow but
people are actually building OCR
implementation in scratch um and if I
find a project that I like uh if this
works then I could actually load it in
in my scratch I have one that is
downloaded so I'm just going to load
that and
then see
so first let's see how it works like I
can draw any number
here and then ask it to recognize it
okay thinks it's two and eight um but
this is a pretty complex program right
so if I don't know how something works
like I can actually do a
screenshot oh I I did not expect to have
two
screens
um I can do a screenshot and then like
attach it to the chat and it would
explain to me what the what the code in
the see if I can get it fast
enough yeah I can't get the screenshot
from y chat but you got the idea um so
that's kind of what the program is and
let me go back to this
and um it's free it's open source like I
mentioned there are a lot of new
features coming I hope you can
contribute and give us feedback or share
it with your young friends um and the
reason why this matters is because AI
literacy is now actually part of the law
I don't know if you know but this was
passed earlier this month it's part of
the EU AI act and it basically says that
all providers and deployers of AI
systems should take measur me to ensure
to the best of their extent sufficient
level of AI literacy uh of their staff
but also of the users of their products
right so we need to in order to ensure
this AI literacy we need to start early
and um that's that's what I'm hoping and
trying to do with my work um if you want
to learn more uh there are lots of
papers and studies in about a literacy
about AI education uh work done on other
domains math misconceptions and science
it's all on my website side um thank you
so much and uh yeah I don't know if we
have time for questions but uh yeah
[Music]
thanks what does it take to build a
personal local private AI agent that
augments you deeply we are pleased to
Welcome to the stage the co-founder of
metap pytorch sumith Chinta
[Music]
hello
hello how's everyone doing last Talk of
the day last Talk of the conference
hopefully uh I'm not the most
boring so uh why do I well first of all
who am I uh do you know this thing
called pytorch um a lot of people in AI
used to know it but now a lot of people
just use high level apis and don't know
what's powering things underneath uh but
like pytorch is uh the software probably
powering your AI apis um so I work on it
I co-founded the project and it's uh
it's a big project that is uh majority
funded by meta where I work at um and so
I'm talking about uh so I'm not talking
about llama at all I work on llama a
little bit but uh
unfortunately uh I am not in charge of
llama to try to sneak in some secrets
for you guys um I'm not going to tell
you when the next llama is going to come
or anything like that so uh why am I
thinking about personal local agents
well the as AI kind of started becoming
more and more useful one of the things
that saved my time the most every single
day was swix is AI news and it's
basically like I have to keep update
with all of what's going on in AI That's
my job and now instead ofay basically
spending three four five hours a day um
looking at a bunch of sources like they
was aggregating a bunch of news for you
and I thought that was like one of the
like one of the first applications I
thought was like
mind-blowingly uh personally like
effective for my own
productivity and that's when I started
going into like hey I'm going to like
augment uh AI within my day-to-day in a
deeper way um that's not an agent though
AI news is more like a an aggregator uh
but that's how it kind of started the
other thing is I also work on robotics
um and robots are essentially agents
they act in the world so um I I my goal
is to build home robots so that I don't
need to do any errands um and so um as
part of that Journey as well I've been
like kind of getting into like okay how
do I get into understanding AI agents
more deeply um the key takeaway I'm
going to really really like drill down
uh to you today is Agents like
especially personal agents have so much
POS agency in taking actions on your
behalf and stuff and they have so much
of your life context to actually be
useful to you that you're better off
keeping them local and private and I'm
going to try to like sketch out a plan
on how to do it but I don't think I have
a complete solution either um so first
like agent what is an agent like and why
did I say swix is AI new is not an agent
well an agent is something that can act
in the world like an agent is something
that has agency it can actually like
take an action in the world uh anything
that can only get context and do things
but then eventually can't act in the
world is not an agent that's how I think
about it and what I think is like a
highly intelligent agent without the
right context is as good as a bag of
rocks it's like really useless I'll give
you a couple of examples very quickly uh
let's just say I build a personal agent
it has uh access to my Gmail my WhatsApp
my calendar I was like did did I get my
prescription renewed and it's like no
not yet and like it totally lying uh
except it didn't know because like I got
the text from CVS on my iMessage and it
didn't have access to uh that source and
it was doing the best it can with the
information it has but if it didn't have
the context like it's not going to like
know how to do better uh similarly I
mean you can make up like a hundred
examples like this where like you have
access to One bank account but like it
your money came into a venmo and you're
like uh the agent lied to you what
happens is like a personal agent that
doesn't have the right context it's
largely going to be irritating to use
it's like you don't know when it is
useful and when it is not useful so it's
essentially not useful um like even when
it gives you some answer you're like H
is this actually right I'm going to have
to go dig in right so so unless it hits
a certain level of like reliability and
predictability that you know it is right
uh it's not going to be actually useful
to you um so now like why am I talking
about personal agents specifically um
and how do you like how do you get all
this context to the agent so let's just
say you have like your opening ey API or
some other API or some local llm what is
all the context in the world that is
personal to you and how do you give it
to the agent well like the number one
thing that you possibly want to do is
like just have variables right you're
just like you can uh the your AI should
see everything you see and listen to
everything you hear um and that is like
obviously the best case of providing
context to your AI agent except like
there's no battery life for any of these
variable things so that's not really
practical maybe one day when you have
like crazy batteries but that's not
really going to work the other thing
could be like okay like most of my life
is on my phone uh in the ways that I
care about from like an agent
perspective uh what about just like uh
running an agent on my phone it's
running the background and it's just
like always like watching my screen or
something well you know that's where
like apple kicks you because you know
they don't let you run a bunch of stuff
and like on your phone asynchronously if
you even if you do they have a lot of
restrictions so like the ecosystems kind
of like kill you and not allowing you to
do that and unfortunately I use like
apple um so that's that's out so the
next one is like okay actually like the
thing that I found like relatively
useful is like if you use like apple in
your daily life uh you can actually get
a Mac Mini and like just put it
somewhere in your home connect it to the
internet and you can run your agents
asynchronously there's no battery life
issues you can just log into all your
services on your like Mac Mini and um it
also can access all the Android
ecosystems because Android is actually
open um
so I work at neither of these companies
I can say what I
want um so I I think that's like what I
think is a feasible um um device to use
to like run your AI agent right now um
the next thing I want to talk about is
like okay why are you talking about
local and private why can't you just
like run this in the cloud like just
subscribe to one of the large tech
companies uh agent services and run your
life out of it well I want to give you
like a few points here uh first
is um I want to talk about how this is
different from you using other Digital
Services and I think it is different
meaningfully and I think it's also easy
to understand so let's just think about
like a lot of you in this room probably
use like a cloud email service that is
free for all of your life all your taxes
are going in there like you know
everything personal is going in there
why do you trust it the reason I think
you trust it at least the reason I trust
it is because it has a very simple
mental model on how it will act on your
behalf or how it will act in general
email in reply out it's not um go it's
basically not trying to do something
sneaky under you that is unpredictable
it's a very simple mental model your
trust of that service is correlated with
whether you understand how it behaves on
your behalf so imagine tomorrow if some
unknown email service that you've been
using forever says oh you know for some
of your emails that I have confidence in
I can Auto reply on your behalf and
you're like okay well first of all that
might be true but what is the worst case
action you can take maybe you'll like
reply to my boss like something nasty
and like I don't want that to happen and
like that's like once the action space
becomes powerful enough and
unpredictable enough you get
uncomfortable with using a service that
you're not fully in control
um and it can get like worse right like
uh companies have to monetize in a
million ways and so what if like you're
using like like an online service and
they suddenly are like oh you know every
time you ask for a shopping query we're
going to like start making the agent
only buy from like stuff that gives us
Kickbacks or something so like I think
like your personal agent is so personal
to you and so intimate that I feel like
ultimately you want to be in control uh
on many aspects that you might not have
control on eventually when you have to
like trust uh an online service so
that's like one of the biggest reasons
like why I feel I want to build a
personal agent that's local uh to myself
the second is decentralization like I
mean you already see all these
ecosystems that are wall Gardens and
like fighting with each other and don't
allow each other to interoperate in
various ways and if you build one of
like your personal life uh your personal
agent around one ecosystem like is that
something like it it works fine for
compartmentalized things like maps and
email and various things but like is
that something that you want to really
subscribe into for like an agent that
can take so many different kinds of
actions on your behalf uh in your
day-to-day life that's like the other
reason I I I feel like you should try to
we as a world should try to get to like
local personalized agents as the
norm um and the third one is um for
various reasons okay this is what I
called uh this is what I call um are you
going to be punished for your thought
crimes right like okay you have a
thought and it is not a good thought and
like you know should you be punished for
it and usually like the answer is no now
if you have a personal agent that is
effectively augmenting you in such an
intimate personal way you might be
asking it stuff that you generally
wouldn't say out loud
ever um and in those cases like do you
really want to take the risk of like
like putting it out into like some
provider because like you know you you
can ask perplexity like Enterprise grade
Cloud API contracts that like are like
um Enterprise not consumer grade where
they like get sloppy even they have to
like do a bunch of like legally mandated
logging and then safety checks and stuff
so there is a possibility that like you
might or might not want to take a risk
on but for me I'm like I don't want to
ever get into a scenario where like I
will be like prosecuted or persecuted
for my thought crimes and like that I
think is like another really powerful
argument for myself at least to focus on
like local agents for my most personal
um uh
augmentation so now coming to um well I
hope you're convinced that yes yes we
actually like if you're going to build a
personal AI agent it has to be local and
private uh well okay what's the problem
well let's go to the technical
challenges first first like okay you got
to run this stuff right um there are
like great open source projects to run a
bunch of local models that are one of
the key components of these agents VM
and SG Lang are pretty great uh they're
both built on top of P torch
so effectively uh this is one time like
we wrote a bug in pyge and um a bunch of
us uh had a Tesla car and Tesla uses
pytorch and we were like man like this
is so scary because like are we writing
bugs on ourselves that's an aide
uh it was totally fine the bug was not
that
bad um so yeah vill sang are great um
but local model inference is still as of
today slow and limited it's not as fast
as like you know if you just use like a
cloud service even if you spend like a
be like enough money on a beefy machine
I think that's also rapidly changing
like for example locally if you're using
like a 20 billion or distilled model of
some sort it actually runs pretty fast
uh but if you want to use like the
latest R1 like full unquantized then it
runs like super duper damn slow um I
think this like is in a state of like it
will fix itself so you you probably
won't get to run the latest and
greatest um and I think like the
challenges are not so much the technical
and infrastructural challenges like they
will kind of get to a place where
they're fine I think there's some
challenges around like both the research
and product that um people need to think
a bit more about I think there's a bit
of a gap and this is just an open
challenge for this room for all of you
AI Engineers um one is like the open
multimodal models are good but not great
I mean they're not great in a couple
areas one is like just computer use even
the closed models like the latest and
greatest apis that you can just pay
money for they're not that great for
computer use they break all the time so
that needs to definitely get like into a
better State the other thing like I
noticed is like if I ask a model to do
shopping for me from clothes to shoes to
Furniture to whatever it'll basically
give me the most boring right like
it's like the basic stuff and if I ask
if I'm like look I'll tell you my tastes
and my taste can get very like specific
and find like the more specific I get
the more like it gives me like
it's like it's like the same oh you
asked for like a red velvet sofa with
Oak wooden
legs uh here's a green sofa that has
velvet um and it doesn't have like Oak
wooden legs you know like they're not
very good at identifying actually
visually what you're asking for they
mostly rely on like a bunch of text
matching um the other thing uh you will
notice and this is a big one is we don't
have good catastrophic action
classifiers what I what do I mean by
catastrophic actions
is there's many actions an agent can
take a lot of them are reversible or
harmless like even if it takes the
action and that's not the action you
wanted it to take it's like whatever oh
it had to go to like that particular
Wikipedia link but it went to this other
one okay big deal whatever it'll just
backtrack and go but there's some
actions that are actually catastrophic
for example you ask it to go purchase
like uh a renal of your Tide Pods and
then it goes and like purchases a Tesla
car you know this is not the best thing
for you to do uh and some of these are
called catastrophic actions and I don't
think there's a lot like there's some
open research around like how to really
get agents to get good at identifying
catastrophic actions before taking them
and then maybe like notifying the users
instead uh but there's not enough and so
if you want to really trust your agents
personal or in Cloud I think we got to
get a bit better at these
things um so that's like a big one and I
think open source voice mode is barely
there uh I feel like when I need a
personal local agent I definitely want
voice mode because sometimes I want to
talk to it uh and not actually type out
everything I want to say
um but still why am I bullish about this
whole thing I am uh because one I see
open models are actually like
compounding an intelligence like faster
than closed models like based on how
many resources are being put on them
like what do I mean by that like open AI
is only improving their own model
anthropic is only improving their own
model with all the billions they have or
whatever but open models are improving
themselves like in coordination across
board uh um and you know people didn't
really believe it until llama came out
and they didn't they didn't really
believe it until mol came out and then
they didn't really believe it until guac
came out uh and then they didn't really
believe it until deep SE came out like
basically like people are like oh you
know like open models you know will not
really win but I think they are like
basically in open source like I've
worked in open source like all my life
um there's a starting coordination
problem like initially you don't have
enough of a critical mass to coordinate
with each other but once you have a
critical coordinated Mass open source
kind of starts winning in like an
unprecedented way and you see that with
Linux you see that with a bunch of
projects uh so I am pretty bullish that
open models would actually start getting
like better than closed models um um
like per dollar of in investment into
open model
um
and that's what I said well okay I have
some plugs uh this is like uh gr. in
from my friend Ross Taylor who worked on
this model called Galactica which got a
lot of like um criticism when it was
released uh out of meta it was this open
science model before Chad GPT released
now like doing science with like llms is
pretty common but like they got a lot of
when they released uh and he like
quit uh he like unreleased Galactica and
he quit doing like a bunch of stuff
publicly but then like he's working on
like plugging the the reasoning gap
between open models and closed models
and they released a bunch of open
reasoning data uh that will help so just
a nice quick plug the other quick plug
is I work on pych pych is working on
enabling local agents especially the
technical challenges that I talked about
uh and we're hiring so if you are more
than an AI engineer if you're an AI
engineer who's also like a systems
engineer then like P George is hiring
um well that's what we
got the other thing obviously is I
welcome you all to come to llama con
which is happening on April 29th and
save the date it's going to be very
exciting lots of llama stuff will happen
there that's it uh I think it's in
California I I actually didn't look it
up so thank
[Applause]
you ladies and Gentlemen please welcome
back to the stage nlw and kinia saw
[Music]
thank you everyone for being such an
amazing audience we know you each to get
out of your seats but please please hang
out with us for two more
minutes um first uh on the website ai.
engineer you can find the information
about After parties that are happening
this evening please check it out and
then tomorrow there are a full there's a
full day of workshops um they are not
here you can find the address on the
website uh but it's Jets which is 109
West 39th Street 2 floor and AWS AWS JFK
27 on 12th West 39th Street so both of
which are right next to each other um
before we go we'd like to invite the
organizers up for a quick message of
thanks so please join me in welcoming to
the stage the co-founders of this AI
engineer Summit Benjamin dumpy and
swix hey everyone how are we doing did
we have a good
time long Marathon for us I'm sure for
you as well we won't keep you too long
but we just wanted to uh to chat with
you for a little bit because this has
been such a blast for us I know it has
been for swix as well um behind the
scenes he's really putting together all
the content and he's doing a lot more
speaker wrangling than I we like him to
do um so this guy is doing so much work
so I just would like everyone to give
him a rousing Round of
Applause all of the incredible people
here and also you in the audience I mean
you're not coming for me I'm putting
together the show you're coming for him
so yeah and everyone else here uh but
like you know I I think a lot of the
show is also just how uh everything
smoothly runs like people don't see
behind the scenes how much chaos there's
back
there uh and and it's all due to Ben and
team and Leah's back there as well and
we have a whole team that's uh helping
us as well so um yeah we ought to we
ought to thank that and I'm very
grateful to work with you on this stuff
I mean this like our third conference
and um I still feel like we're I feel
like it's getting better but I still
feel like there's always like chaos and
uh we kind of make it hard on ourselves
because every conference is slightly
different for example this one is the
first one in New York yeah we also
didn't decide on this venu until like
December 15th so this is like less than
two months of planning um so maybe we
can do a little bit more with World's
Fair for June yeah yeah I mean this like
we're we're doing it one year ahead yeah
so yeah so uh wsare coming soon I don't
know if we have like a URL or something
no we're just chatting okay cool yeah
yeah what else but I mean a lot we get
as this a lot so I just wanted to
address this to the whole audience like
why did you come to New York like a
bunch of San Francisco people so we get
ask this a lot and it's really just a
few simple reasons I mean first like we
just want to get out of our San
Francisco bubble we had two successful
AI events in San Francisco and then
second secondly as swix said to me about
the AI scene in New York what did you
say uh yeah show me what you got you
know like I think New York uh people
talk a lot about like you know the the
great engineering that's happening in
New York we saw some data yesterday
about uh you know the the hiring that's
going on in New York and um basically I
felt like New York was kind of
underserved uh the kind of frequency of
AI events that we have in San Francisco
we take for granted um apparently you
know I was checking my biases I was like
maybe I'm just ignorant but uh I talked
to some of you here and you're like yeah
you know this quality of event Doesn't
Really Happen uh that fre that often in
New York so every we was just wanted to
bring a little bit of the sort of eii
engineer magic to New York yeah and it's
been a blast of far that I think the
last reason is like if I get an excuse
to go to New York I'm going to take it
yeah even if it's in February as
Californians are are are uh we're not
that accustomed to this weather but yeah
yeah I I so I I really felt like we
needed to do something to early in the
year so that we have context we have
relationships and whatever for the rest
of the year um so I think like the
timing is important um I I'll just say
like you know if you come here even
though it's cold then like you're really
here you're very serious anyway
so God smiled upon us this week those
blizzards did not show up so yeah it was
supposed to be 2 feet of snow uh on
Thursday and then then some cleared up
so I don't know um what we did but we
did it right global warming for the
win so I was talking to some someone
about the uh what it takes to put an
event on like this and like everything's
really tight everyone's got 19 minutes
yesterday was 20 minutes everything's
back to back so it's like to go from the
plug-and-play conferences where you're
like coming in and doing this like
Meetup style which is what I do it's
literally like a THX more work and a lot
more stress on like us and then we kind
of pass that on to the speakers a little
bit because it's a little more strict so
um we I I just want to thank all the
speakers for putting up with me and
putting up with the crew for saying you
know urgent we need your slides now or
everything's going to break and you know
uh swix doesn't think it's that big a
deal but I do is it's it's a big deal
but I think we have to tell people why
uh you know when we when we have such
high expectations uh that they're not
used to like you have to give a
rationale it's like you know it's like
you have to show your reasoning to to
arrive at some kind of conclusion like
it actually helps you know true that
true that yeah what else there's just so
many more people that go into making
this production we have an entire crew
back there we have Argus HD back there
time Center helping to run this
um but not just that we have a whole
organizational crew we have Leah we have
France we have Peter we have Scott
joining us we have we're actually
starting to really build a big team so
we can actually cross that Chasm because
right now we're like you know you see me
running around or you don't see me at
all because I'm you know putting slides
together last minute or I'm doing
something last minute that should all be
done way ahead of time but uh yeah with
a bigger team we're going to be able to
do that so I think as we prove this
model we're going to continue to grow
and grow the team I mean don't forget
our MC's MC's yes thank you wonderful
job
um most mostly I wanted to I I think I
have a theory that professional
yappers are like very good like imp
improvisers like they like are involved
in Industry um you want to meet lots of
people like I you know that that's my
theory like I want to kind of bring
people into the fold so uh you know I'm
uh I hope that you guys are familiar
with with each other and um that you can
get get something out of it as well but
thank you so much for all the work that
you did
today and then of course in that sponsor
Expo down there did you guys see that
that was Motif events they put that
together really really good Expo down
there really really professional looking
and then our sponsors themselves who uh
man those booths and showed you those
exhibits who are our favorites what do
you think shout them
out what daily the the PCAT Cloud thing
yeah Cloud everyone else too scared all
right don't want to show
favorites s all
right Galileo all the company reps are
now shouting their
names yeah um yeah I would say like so
uh we added the Expo stage at at kind of
the last minute uh mostly because we we
were like we have we have some ability
to set up a uh but it was noisy in there
yeah I mean it was it was like hard to
hear um I think in future we would want
to separate it all a little bit more but
um yeah we didn't we didn't really we
couldn't really tell the Acoustics
before running the thing true and that
was kind of a last minute Edition so we
didn't really get a chance to test that
and when we added that they were like
well we can get speakers in there but
it's going to be overhead it'll be it'll
sound good though but we did get some
complaints from sponsors saying you know
it's too loud in here we can't have
conversations um probably getting some
same feedback from attendees as well so
I think that's those are like little
things that we need to optimize yeah fix
yeah yeah it that gets smooth out yeah
any else know yeah so we want to do one
more thing and we want to invite you all
and apologies to everyone I didn't get
to thank because too many people go into
this um thing but one person uh Randall
gee is our photographer there he is and
he is going to take a photo of ideally
all of us and Max video productions who
does our Boll and did the interviews out
there um hopefully they'll be getting
some b roll of us as we do that so I've
never done this I don't know if you've
ever done it but I'd love to get
everyone as much as everyone wants to
come on stage just one caveat before we
get up these things will break if you
step on them and we'll have to pay for
them um so just don't step on those
they're like greats they're hard to see
but just don't go past this line and
don't touch that um but yeah who wants
to come up and do a group photo it's
going to be a good memory yeah this is
like the Salve conference come on up
let's do
it Miranda you want to grabb a word h
Sam
you can
shout just reminder not to step on the
grates behind you they will break thank
you
right all
right we go look at this
group we get the lights on all of you
now all right all
right ah we got to turn it back
off S you got to be in the red carpet
can we get maybe MC's over
here red carpet if you want to if you're
comfortable we're going to get to know
each other really
quickly come on in
everyone and rand's GNA tell us if he
can see us how are we looking
all right let's do
[Laughter]
it swix can actually sing do you guys
know
that go for
it started spread
everyone fall in fall in
Fall do we have some volunteers
[Music]
[Applause]
thank you
everyone thank you uh one more thing
before everyone goes a little bit of a
surprise Ben didn't mention this but uh
it's actually his birthday today um and
we got him we got in a little thing uh
so thank you Ben for uh taking a part
like your personal life putting life on
hold and doing this us to you happy
birthday to you happy birthday dear
Ben happy birthday to
you and I don't
think yeah we we shut your mic
off just a reminder um one last protocol
item for me uh we we are actively like
breaking down the venue you're welcome
to stay for a little bit we got to like
be out of here like 5:30 5:45 so there
are plenty of um side events happening
but of course you know this is New York
you can go anywhere you want we'll see
you at the workshops tomorrow um those
are at those are nearby those like in
within walking distance of this venue
and also the hotel uh that's at AWS Hank
and J Suite so that's all on the website
check that out you can get the address
there you're all welcome to come to that
we'll see you there
[Music]
[Music]