AI Engineer Summit 2025 - AI Leadership (Day 1)

Channel: aiDotEngineer
Published at: 2025-02-20
YouTube video id: L89GzWEILkM
Source: https://www.youtube.com/watch?v=L89GzWEILkM
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
I
[Music]
[Music]
[Music]
[Music]
he
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
for
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
in
[Music]
n for
[Music]
a
[Music]
welcome welcome everyone and thank you
for coming welcome appreciate you being
here and welcome to the AI engineering
Summit 2025
okay so I just want to start by saying
even though I've often personally been
told I work like a machine I just want
to take this moment to reassure everyone
your event staff your event curators
your event people like myself we're all
real humans none of us are Bots that are
being launched at this conference well
this year anyway I mean next year who
knows right but uh this year is very
special to us uh this year is the first
year we're in New York and we could not
be more pleased to be here I can already
feel and have already felt the energy
that's available for us today here in
this room and at this conference we're
building on the soldout success of our
AI engineering Summit in 2023 and our
soldout AI engineer World's Fair in 2024
in San Francisco and that's what's
allowing us to bring to you here today
this curated exclusive in Industry
Insider event here in New York City uh
and where we can all get together and
talk about and learn from each other's
real world pragmatic experiences uh and
at a gorgeous Stellar
venue uh okay so why are we all
here well like the Industrial Revolution
before AI is a new future and it's going
to change everything that comes after it
I mean frankly just a few years ago even
the idea that the entire internet and
assembled Collective knowledge of the
human race could be put at your
fingertips is just amazing
breathtaking and we're here of course
also because Aon is by no means done
evolving right um things that used to
take five years and a research team in
2013 now take API docs at a spare
afternoon in 2025 it's just
incredible okay so let's set
expectations about the event for just a
moment um we have a really exciting
conference lined up uh today is the AI
engineering track uh for executives and
VPS and then we're going to be followed
tomorrow by the agent engineering track
and then we'll conclude the conference
with the Hands-On workshops on Saturday
so today's leadership track uh will
equip you with the Strategic insights
you need that are at the intersection of
AI and business leadership uh exec
leaders senior folks from Lux Capital
signal fire uh anthropic open AI
LinkedIn data dog many many others are
going to be here today to share their
real world experiences uh and hard one
uh scars and lessons with you um we will
touch on everything from Trends to
hiring to security to tools to
infrastructure and technology and a lot
more uh and of course we are blessed and
very fortunate to have sponsors like
salana uh if you're not familiar with
the blockchain world or you're not
familiar with salana salana is the
permissionless infrastructure that lets
your agents create
wealth they've got a large Booth
downstairs with three demo stations so
please stop by to learn how they can
help you uh the Expo area is just uh
downstairs right uh on the in the
hallway in the lower level um it opens
after the morning Keynotes and it's open
all day uh so please do take the time to
go visit our sponsors because they're a
integral part of how Gatherings like
this happen right uh and more
importantly they will help you on your
journey okay this event however isn't
just made possible by salana it's made
possible by all of our sponsors um who
are innovating at the edge of AI
engineering and they represent a really
fascinating mix of companies uh they've
sent their top you know sea level
Executives their heads of product their
senior technical staff to this event so
please make sure to visit them at the
breaks and have a chat all the breaks
are in the schedule and lined up and who
knows you might find your next service
provider your next partner maybe even
your next
customer um so in a moment you're going
to hear from these Founders these
executives these AI leaders who have all
prepared tailored talks just for you
okay and then at the break following uh
each block of talks um there's going to
be the speakers are going to be
available to answer your questions
there's one of three Q&A and discussion
areas uh throughout the the conference
venue um there's one right here on this
level uh and then there's two downstairs
one at the uh base at the landing of the
stairs and another tucked underneath
okay um we're going to use these areas
during the break time to facilitate the
hallway track so you can gather kind of
birds of a feather F birds of a feather
style um to talk about the topics as
well from all the sessions before in
that block before the break okay as well
as meet the speakers um these breaks
again several of them throughout the day
they're all listed in the schedule so
you won't have to to guess or worry um
and then of course all food and drinks
are going to be served downstairs in the
Expo area and please be sure don't miss
the Afterparty tonight in the Expo area
so we can all have some drinks have some
fun listen to music and get to know each
other because that's a lot of I think
why we're all here okay with that thank
you for your time and it is my honor and
privilege to call and invite our first
Speaker up to the stage please put your
hands together and join me in welcoming
partner from Lux Capital Grace isord
[Applause]
[Music]
thank you so much Peter to swix to all
of the AI engineer Summit for having me
I am so thrilled to be here I'm Grace
again a partner at Lux capital and it's
a pleasure to really kick off this
conference and tackle a pretty tough but
exciting task the state of the AI
Frontier and how we navigate that in
2025 so a little bit about Lux as we get
started here right Lux likes to say we
believe before others understand we
invest in Frontier Tech ideas that
seemed crazy right uh and we really like
to bring sci-fi to scif fact in fact
we've been lucky to partner at the
earliest stages with some top AI
companies hugging face I'm sure several
folks know it the GitHub for machine
learning together AI the open source AI
Cloud physical intelligence that's like
a robotics software brain and saon AI
That's a research lab actually in Tokyo
Japan doing really cool evolutionary
nature inspired algorithms they launched
a pretty cool AI Cuda scientist last
night so go check it out moving a little
bit forward as we think about New York
City if I get my clicker working
here there we go Lux is really excited
to double down in New York City and AI
Lux was founded in New York City our
first AI investment was here in 2013 and
a majority of the Lux AI portfolio is
actually headquartered here or as a
major Hub as you can kind of see in the
graph behind me it's also home to many
of you right state-of-the-art research
and Engineering leaders and many four to
500 companies several of whom you're
going to hear from over the next few
days we are really bullish on the New
York City opportunity and we're really
excited you guys all came to share this
opportunity with
us so when I was creating this
presentation I went back and looked at
the last few years of AI right and all
the way back really to stable diffusion
August 2022 and wow I mean look at this
hockey St right the last 2 and a half
years have been crazy the last 18 months
have been even more exponential the
progress is getting more aggressive it's
getting more impressive and really it's
getting more spread right it's not just
open Ai and anthropic publishing these
models it's xai we just saw the grock
launch this past week it's mistal it's
deep seek it's many many more and the
models are getting more performant
they're also getting more compute
efficient and as we zoom in to the
current state of the world in
2025 it's off to an even Wilder start
right if you thought the last few years
were crazy 2025 is even Wilder we saw
the 500 billion Stargate project
announced between the US government open
AI soft bank and Oracle we saw 03 open
eyes 03 right before the start of the
Year where they actually exceeded Human
Performance and the arc AGI challenge we
saw the Deep seek Mania right with deep
seeks R1 model launching earlier this
year sending you know Nvidia shares
tumbling down we also saw deep SE go to
number one uh in the app store and then
of course just last week we saw the
France AI Summit where macron actually
launched a whole new AI initiative with
France and Europe back in the
game so you may be saying and I think a
lot of us are thinking right this is the
AI agent moment in 2025 and I'd go as
far as say this is the perfect storm for
for AI agents and Frank frankly it's
easy to see why right uh several
reasoning models starting with open eyes
01 then 03 deeps R1 grock's latest
reasoning model this past week are
outperforming human ability and in some
cases even having more capabilities that
we've never even seen before we' seen
the rise of test time compute right
that's more compute applied at inference
instead of at training that's increasing
this model performance as well we've
seen further engineering and harder
optimizations right whatever you think
it cost to actually train that deep seek
model you cannot deny it was a feat of
engineering and Hardware efficiency
inference is getting cheaper Hardware is
getting cheaper the open source close
Source Gap is getting closed between
deep seek and llama models getting more
and more performant and of course
billions of infrastructure powering all
this data center and compute we just
talked about the US Stargate we talked
about macron and Europe and also Japan
with soft and Nvidia has been doubling
down on their set of efforts so all this
is setting this exciting groundwork for
the aonomus name of our conference
agents at work and it really does feel
like an exciting
moment but in reality these AI agents
aren't really working just yet right
people are saying it's a perfect storm
and I've seen a lot of Thunder I've seen
a lot of great momentum but we haven't
seen that lightning strike uh and
everyone I know has different definition
of Agents so for the purposes of this
presentation I'm going to Define my
definition as an AI agent that is a
fully autonomous system where llms
direct their own
actions so let's give a little example
of what I mean when I say an AI agent
isn't working just yet here's a
seemingly simple query on open AI
operator I'm sure everyone here knows
what it is I asked it to book a flight
for me to New York to San Francisco on
Monday I'm sure it's also a route and
something that many people in this room
are familiar with and in reality it's
actually kind of a complex problem right
I need to leave after three on Monday
but I want to avoid rush hour traffic I
want to fly United JetBlue or American
to maximize my chance of an upgrade from
economy I want to keep it under $500 to
keep under my work expense policy I also
want an aisle seat that's not too close
to the bathroom um and I want to get
there you know before midnight so I put
this in to open AI
operator and the first thing it did with
all this information is go to kayak
which if anyone has booked a flight
before that's a pretty frustrating
experience and unfortunately it did not
find a flight uh it wasn't able um it
couldn't find one it didn't even seem to
look for United or American second try
try it again uh this is Skys scanner
this time which is slightly better um
and it did actually find a flight but it
found one that had a lot of traffic uh
5:30 JFK for those who live in New York
that is a tough traffic time um and
ultimately I also couldn't even pick my
seat so didn't really work out based on
my prompts uh
earlier so what does this all mean right
why these AI agents not work I think we
so often talk about hallucinations and
fabrications and AI models kind of going
sideways we don't talk enough about
these tiny cumulative errors that add up
right uh there's a lot of little errors
that we see with this old model and I'm
going to go through a few it's not an
exhaustive list but it's a sense of some
of the things you might run into as
you're building these AI agents first
decision error it chooses the wrong fact
right I may book a flight with AI but it
may book it to s Isco Peru instead of
San Francisco California the model could
overthink or exaggerate and do a few
other things as well second
implementation error the wrong access or
integration on the prior slide with my
Skys scanner I actually had to enter
capture and that messed up a little bit
of the flow you also could get locked
out of a critical access to an important
database and ultimately that AI agent
isn't going to work anymore third
heuristic error the wrong criteria
unfortunately the model didn't
acknowledge best practice of allowing
enough time for JFK in fact I didn't
even ask where I was coming from
Manhattan Brooklyn or Beyond and that
could really affect the traffic you're
going to get and ultimately if I even
make that flight at 5:30 p.m. and fourth
taste error the wrong personal
preferences for those who know me well
I'm actually a pretty spooked flyer and
I do not like flying Boeing 737 Maxes if
AI booked that you know I did not put it
in the prompt earlier but if it did book
that I will be very unhappy and I would
not get on that
plane and then there's kind of a fifth
more nebulous error right it's a little
bit of this Perfection Paradox right we
are doing things so magical with AI
right now yet we're getting frustrated
when A1 thinks too long or when operator
moves at the speed of a
human even if the agent gets it right on
the first try often they're inconsistent
and unreliable leading to really
underwhelming our human expectations
about the whole EXP
experience here's another visual of kind
of these complex systems where each of
these cumulative errors really compound
right two simple agents one that's 99%
accuracy one that's 95% accuracy to
start pretty impressive agents at the
beginning but over 50 consecutive steps
you actually realize a pretty big
disparity here there's actually a 50%
difference after 50 tasks between the 95
and the 99 and that 99% agents actually
down to 60% the point here is that
something simple like booking a flight
is actually really complex in nature
when all these tiny cumulative errors
add up and they get even more Amplified
in a complex multi-agent system with
multi-step
tasks so how do you as all these amazing
VPS of AI these leaders of AI in the
room optimize a complex agent taking
into account all of these really
difficult queries to consistently and
reliably make the right
decision the truth is
it's
hard but that hasn't stopped us before
and there is hope so I thought I would
run through some of the best practices
that we're seeing building AI agents
today and five strategies we can all
think about to help mitigate a lot of
these cumulative errors let's Dive In
First Data curation how do we make sure
an AI agent has the information it needs
data is messy it's unstructured it's in
silos it's everywhere it's not just web
and text data now too it's design data
it's image data it's video data it's
audio data it's a data in your sensors
and your Warehouse if you're in the
manufacturing world it's even the agent
data that your data your agent is
producing in real time think about
curating proprietary data the data the
AI agent generates and ultimately even
the data you're using in your model
workflow for Quality Control Data is
your best asset and curation is key to
making it more effective
data also isn't static anymore how do
you design an agent data flywheel from
day one so every time a user uses the
product it automatically improves in
real time and at scale a simple example
back to our flight example is getting a
curated data set of all of Grace's
travel preferences including the 737 Max
and all my Airline preferences or even
say we run that agent over time and book
many flights how do we recycle back that
content and adapt to my own preferences
in real time second the importance of
evals how do we collect and measure a
model's response how do we choose the
correct answer this is long been
important in machine learning and Ai and
really understanding what's right versus
wrong you know it's pretty simple in
verifiable domains where there's a clear
yes or no answer like math like science
here's actually the grock three
benchmarks just up here where you saw
they did all verifiable benchmarks in MA
math and sciences but how do we set up
evaluations for non-verifiable systems
where there aren't clear yes or no
answers like well Grace Like This Plane
seat based on her preferences and how do
we collect those signals
too we also saw other examples of an
eval debate over the weekend with deep
research right we have an openi deep
research product one from perplexity one
from Gemini as well and there are
multiple versions of the same product
the evals here really depend on the eye
of the beholder right which one is
better for everyday research versus VC
market research versus scientific or
academic research we have to keep an eye
on collecting those signals we need to
know and collect human preferences and
we need to build evals in a way that is
truly
personal Sometimes the best eval is just
trying out the agent yourself and Vibes
based on your needs with no number or
leaderboard telling you what to
do third Scaffolding Systems how do we
ensure when one error occurs it doesn't
have a cascading effect throughout the
organization ramp a Le portfolio company
has done a great job with this and I
know Rahul is speaking tomorrow so go
check him out when ramp launches a new
applied AI feature and it fails there's
infrastructure logic to ensure that
doesn't have a cascading effect across
the agentic system and also across all
of ramp production infrastructure we can
mitigate scaffolding by building a
complex compound system of how all these
things work together and sometimes even
bringing a human back in the loop for
reasoning models get this gets even more
interesting and important how do we
adapt the scaffold to Stronger agents
that self-heal and grow an agent that
realizes they're wrong and actually
tries to correct their own path or an
agent that's not sure and then need to
break execution to get it back on track
back to our travel example again could
we add a checkpoint for this AI agent to
verify the Trine for traffic or maybe
steer it back in the right
direction fourth user experience or ux
is the mo that matters
and that's how our AI agents become
better co-pilots AI apps today are all
using the same models Foundation models
are the fastest depreciating asset class
on the market right now gbt rappers are
cool ux really does make a difference
for those who reimagine product
experiences and really deeply understand
the user workflow and really promote
that beautiful elegant human machine
collaboration right here's a few
concrete examples back to the Deep
research right asking clarifying
questions to make sure it fully got the
picture of what I'm trying to accomplish
like Wier from codium understanding the
ux or the psyche of that developer
really on a more fundamental level to
predict their next step like Harvey in
the legal World integrating seamlessly
with the Legacy systems to really create
real Roi for a practicing
lawyer if you think about all the major
AI apps today and categories like coding
like customer support like sales these
all are using the same models again
right and it's truly the ux and the
product quality that makes any one
company stand out at Lux we're really
excited about the new AI Frontier
companies who have proprietary data
sources and who know the workflow of
their user really well like robotics
like Hardware like defens and
Manufacturing like the life sciences you
know how do we take a company where they
take their proprietary data source they
know the workflow of a biologist or a
defense contractor or a chemist and
truly create a magical experience for
that end
user Fifth and finally how do we build
multimodally you know we're not just
multimodal anymore we're
multimodal there's new modalities where
we can truly reimagine and create a 10x
user personalized experience I am so
sick and tired of the chatbot as an
interface and I know there's so many
more exciting things we can do with our
AI agents to really make them more human
right how do we make AI more human how
do we add eyes and ears nose a voice
we've seen really incredible
improvements in voice over the last year
it's getting pretty scary good Lux
actually has an investment in the smell
space called osmo that's digitizing the
sense of smell and what about touch
right how do we instill a more human
feeling and sense of embodiment with
robotics I'll go as far to even talk
about memories right how do we make AI
truly personal and know you on a much
deeper level than it does today
doing all of this reframes what
Perfection is to a human and even if
that agent is inconsistent it's
unreliable the Visionary nature of the
product exceeds all expectations it's
something new and on the slide behind me
you'll see TLD draw that's an amazing
Lux portfolio company and I think
they've done a great job really
reimagining the visual canvas right
implementing AI through brush Strokes
they have a cool thing called T draw
computer or you can actually combine a
bunch of these cool AI models in Tandem
and not even know you're working with a
large language model in the background
so really strive to build
multimodally so in summary we tackled a
lot today but we're at the perfect storm
today for AI agents but unfortunately
that lightning hasn't struck yet and AI
agents are not going to happen
overnight cumulative errors add up we
see wrong answers wrong preferences
wrong criteria all these wrong human
expectations that abound when you're
building these systems data curation
evals and Scaffolding are all tools you
can use to help mitigate a lot of these
challenges and really please think
bigger ux multimodality Innovative
product experience that truly set the
workflow and the vision apart and I'm so
excited to see what all of you build and
really excited to continue this
conversation over the next few days
thank you guys so much and look forward
to talking to you throughout the
conference thank you
[Music]
[Applause]
[Music]
our next presenters will teach you how
to build an AI strategy that fails
please join me in welcoming haml Hussein
founder of parlament labs and Greg
sarelli co-founder of spec
[Music]
story all right everyone welcome Hamma
and I are absolutely thrilled to be here
with you to basically teach you how to
build the definitive guide to completely
utterly and spectacularly messing up
your AI strategy I actually couldn't
have really wished for a better foil
Grace's leaden than this because we're
not just talking about Minor setbacks
here we're going to take you through a
way to create full-blown company
crippling career ending
failure Grace talked about best
practices but we're here to embrace
worst practices in fact we're going to
make sure you know how to completely
torpedo your AI projects and ensure you
alienate everyone that you work with how
does that sound it sounds great to me um
so before we begin it might as well
start with some introductions um we Have
No Agenda here it's just a sequence of
steps but I'm Greg I'm an executive
leader and I've spent years in the SE
Suite crafting AI strategies I'm now a
co-founder of an AI startup but
previously I was the chief product
officer at plural site and executive
leader at other companies um and I've
had a front row seat to how executive
teams can transform clear strategic
opportunities into labyrinthine
disasters and I'm haml I'm a machine
learning engineer and independent
consultant who has worked with many
companies on AI I've witnessed every
conceivable way AI strategies can fail
and I found it really fascinating how
creative people can get with their
failures
that's right so you could say that
together we're kind of like the dream
team of disaster we've advised or maybe
we've you know interacted with
representatives from numerous companies
we hav even have this fancy website um
but for today's presentation we've
decided to live and breathe the great
words of the late Charlie Munger who
said invert always invert so let's get
started all right the first step to
failure is to make sure that you begin
to divide and conquer your own company
this is key if you're destined to fail
you got to embrace the disconnect
between willingness to pay price and
cost the keys to creating value by
contemplating unreasonable goals and you
should especially everyone here in the
audience know by now you you have to
make sure you go and attend every AI
industry conference right but never go
back and talk about what you learned
with your team
the point just like Moses here parting
the Red Sea is to create impenetrable
silos and incentivize secrecy between
your teams so let's get into
it so I talked about the value stick and
willingness to pay in the the prior
slide but here it's really important for
us to adhere to the anti value stick you
got to embrace it because it's the
opposite of everything good and useful
when it comes to Value creation and
being strategic
and today that's our guiding principle
you might be thinking that wtp means
willingness to pay but here it's wishful
thinking promises you got to tell your
customers that AI is going to do
absolutely everything for them your new
systems are going to write their emails
block their dog solve climate change and
achieve world peace but don't really
worry about the details just promise the
moon you know about price right well for
us that's another acronym particularly
ridiculous in infrastructure costs
everywhere I mean that was a mouthful
sorry uh buy the most expensive gpus
don't bother with any cost benefit
analysis just max out the company credit
card think of this as an investment in
something and cost well it's that
Cascade of spectacular technical debt
you're about to run headlong into um you
need to think about Building Systems so
convoluted so intertwined that even you
as an executive can barely understand it
you know about job security right this
is a key to guaranteeing it think about
it this when it inevitably breaks no
one's there except for you and finally
if you know about value you know about
WTS or willingness to sell for us it's
why this system well the answer and I
mean always is because AI there's never
any further explanation needed no board
is ever going to question you it's like
magic but much more expensive and less
reliable so step two here is when you
start to Define your strategy right
here's the first key fake any diagnosis
you might be thinking of grab last
year's annual report or operating plan
and just start highlighting random
paragraphs preferably the ones you
understand the least and declare he I
must fix this don't bother talking to
anyone who actually does the
work in your guiding policy should be
both incredibly ambiguous and vague
something like become the global AI
leader in everything except don't Define
what everything means that's someone
else's problem totally and your action
plan simple you need an AI powered SEO
tool that guarantees top Google search
results even if you sell garden gnomes
right and a generative art plug-in that
creates nfts of your CEO's cat and of
course an AI drone lunch delivery
service because
Synergy uh announc solve this at your
next company All Hands meeting and you
get bonus points if you wear a shiny
suit and use the word disruptive at
least a dozen
times and the last point on this slide
is about timelines but timelines are for
companies that intend to finish projects
what we recommend is you Embrace
Perpetual beta just create a massive
backlog in GitHub and stick all those
highlighted Financial reports in that
Greg was mentioning earlier great
strategy but you know what strategy
really works just create a 4,000 page
document that you post in all your slack
channels and just erode people's
willpower to engage with the
material and with these tidal wave of
documents um you know in words Greg
isn't there a strategy that you have
about jargon there certainly is the
point is to communicate in such a way
that nobody understands drown everyone
in a tsunami of jargon say things like
our multimodal agentic Transformer based
system leverages few shot learning and
Chain of Thought reasoning to optimize
the synergistic potential of our Dynamic
hyper hyperparameter space if you say it
with confidence you probably will have
absolutely no idea what you just
said Remember the goal is to look
incredibly smart even if nobody
understands a word you're saying the key
is alisation yeah you might be tempted
to do something like defining a very
cogent clear business on a page approach
like in the in in the advantage but
never be too
tempted one of the most effective ways
to cause dysfunction in your
organization is to use jargon
everywhere and use jargon strategically
to hide the jobs to be done for example
I had a mental health client instead of
saying we need to write a prompt we
would just say hey we're building agents
and what that did is it made sure that
the mental health experts
were not in the room and didn't know how
to participate and that's exactly the
result you want that's right just like
Hamill reduced those mental health
experts mental health I like to do that
as well so instead of saying hey let's
make sure the AI has the right context I
just talk about Rags and don't say make
sure users can trick the AI into doing
something bad just say prompt injections
yeah and the key here is to encourage
Engineers not the people who might best
understand your customers to write
prompts because what could possibly go
wrong look we know that translating
everyday English language into jargon
can be really difficult so we made this
guide for you and this guide will help
you divide your organizations just like
Greg was talking about earlier just like
Moses the link is right here but
remember making everything even writing
prompts seems super technical and Out Of
Reach for everyone is what you want to
go
for all right just a brief recap we
talked about how to seed your your
division how to start to Define your
strategy how to communicate it now we're
on to mobilization right because you got
to do something with that giant
backlog well some of you might know
about Jeffrey Moore but I've never heard
of him today we're pioneering a very new
revolutionary framework which is about
zoning to lose it's designed
specifically for
failure just randomly assign AI tasks to
people with absolutely no relevant
experience for example Outsource your
data review to Offshore Q&A teams who
have very little context about your
business yeah and most importantly you
might be tempted to use the incubation
Zone to bootstrap new AI ideas but the
goal is to launch from here completely
untested bug written AI chat Bots
directly to your customers as Hamill
mentioned never worry about beta testing
dis disre regard quality assurance just
ship it straight to production because
what's the Worst That Could Happen
outside of a potentially career- ending
PR
disaster so if you do it sort of right
should feel something like South Park um
you're going to yank all your best
Engineers from potentially supporting
your revenu producing products wait a
while and then profit no actually it's
going to feel more like total collapse
and because you're so disorganized we
can now transition to
look at this point your organization is
in complete disarray but it's time to do
the deed and burn it all to the
ground so the most effective way to
start doing this is to focus on tools
not processes and those problems that
you created earlier and other ones that
may exist don't analyze them don't try
to understand them just throw tools at
them so if your rag system isn't
retrieving the right documents just a
new more expensive Vector database yeah
and if you need to measure progress just
use every off-the-shelf evaluation
metric you can possibly find never
bother customizing them to your business
needs just blindly trust the numbers
even if they make no
sense oh and like you know we're talking
a lot about agents today if they're not
working just pick a new framework and
vendor find tune without any measurement
or evaluation just assume it's going to
be better because it's kind of like Al
me with a lot more
electricity exactly you don't need to
look at our design metrics evals that's
a vendor problem just plug in a tool and
it will solve all your problems Greg I
really love how you demonstrated exactly
what we're going for here with
wack-a-mole every time you see a problem
hammer it with a tool if another problem
comes up Hammer that with a tool the
same problem comes again hammer it with
a different tool you get the point yeah
him I really appreciate being meme
fodder to help you get your point
across look I want to emphasize you
should adopt a mindset that evals are a
vendor
problem just realize that there should
be a oniz fits all
solution let the vendors figure it out
you're too busy being an
executive and if you really want to do
this properly you need to create a
dashboard that looks like this with
every off-the-shelf metric that you can
gather the more metrics the better it
doesn't matter if the metrics track with
outcomes or real failure modes make sure
the numbers are unintelligible so you
don't know the difference between a 3.5
and a
4.5 keep hoarding random metrics until
you find one that's going up and to the
right then you can claim
success and look maybe you might have a
hard time figuring out where you can
come up with these generic metrics but
we got
you just adopt the ones from eval
Frameworks in fact adopt all of them let
your eval metrics guide you blindly and
never ask whether they actually measure
success again the more numbers you have
the better yeah I personally like to
optimize for cosine similarity BL and
Rouge ignoring actual user experience
and I said it once and I'm going to say
it again never CR check with domain
experts or your users because if an LM
says it's accurate who are we to argue
we are their humble servants after all
amen now it's time to unveil the most
potent Technique we have in our toolbox
and it's avoid looking at data seriously
just avoid it keep a blindfold next to
you at all times you have to bump into
Data by accident put that blindfold on
yeah data that sounds really messy let's
let a tool handle it because you can
absolutely 100% trust the ai's output
without ever looking at it yourself
looking at data that's an engineering
problem you're a leader you have more
important strategic things to do like
having meetings about
meetings besides developers they always
have more domain expertise than your
business teams yeah and we know that
ultimately by this point your customers
are really your best best Q&A and
hopefully you have lots of them and
they'll complain if something is wrong
maybe
eventually but more importantly you got
to trust your gut it got you this far in
life right feelings are always reliable
substitute for data especially when
you're making million-dollar decisions
if you have trouble trusting your gut
just put the blindfold on it'll get you
right back in touch with those
feelings so now we know by we know by
now that engineers are all coding
Wizards and they're going to handle
everything it doesn't really matter if
they haven't spoken to a customer in
years because you can quickly forget
about the fact that there might be
simpler options like using spreadsheets
to annotate and look at data say after
me remember this is beyond me great
advice and just it's not enough for you
not to look at the data you have to make
sure no one else is looking at data and
the best way to do that is to put your
data in complex systems that only
Engineers can access and it's not
available to domain experts right so
like instead of using a simple
spreadsheet or perhaps an air table like
up on the screen as an executive you
should insist on buying a custom data
analysis platform that requires perhaps
a team of phds to operate and understand
remember those bonus points you get more
of them if it takes six months to load
this thing and errors
incessantly so so there you have it the
ultimate foolproof guide in under 20
minutes to achieving total AI
failure if you follow this advice that
we've given you here meticulously it's
guaranteed that you're going to waste
time resources and alienate all the
people you work with and as far as I'm
concerned that's the ultimate success
that you can have sure is so for more
advice it's actually real do visit ai-
execs U we also have an O'Reilly book
the same material coming out February
27th uh and so while this talk was
inverted you know our lived experience
really isn't and we're always very eager
to help you on your journey so find us
after this presentation the Q&A speaker
boo thanks so
[Applause]
much our next presenter is the
co-founder and CTO of privacera and
paage please join me in welcoming to the
stage Don Bosco
[Music]
Dury hi everyone I'm Don B I'm the
co-founder and CTO for private sah uh
very recently we open sourced our
solution for Safety and Security for J
and AI agent um I'm also the Creator and
PMC member of the open open source
project Apache Ranger uh it does uh data
governance for Big Data is also used by
most of the uh CW providers like AWS gcp
as well as um Azure uh so today I'll be
mostly talking about how you can build a
safe and reliable AI
agent so before I get started let's get
some of the terminologies standardized
um from my perspective AI agents
autonomous systems uh they can do their
own reasoning they can come with their
own workflow and they can uh call task
for doing some actions that they can use
tools to get um make API calls so task
are more specific actions uh they may be
able to use llms or they may also call
Rags or uh tools while tools are
functions which can be used to get data
from the internet uh if you have
databases it can going get data from the
database if you have uh service apis it
can call those things also and memories
are context which are shared within the
uh agents the task and the
tools to give a visual representation um
there could be multiple agents and agent
may have access to multiple tasks that
could be multiple tools and as these
tools can talk with apis and DBS so one
thing that you need to know out here is
um most of the the uh agent framework
today they are run as a single process
what that really means is the agent the
task the tools they are in the same
process that means if the tool needs
access to database that mean needs need
to have the credentials or if they want
to make API calls it needs share tokens
so those credentials are generally a
service user credentials that means they
are have super admin privileges and
since they all in the same process uh
one tool can technically access some
other credentials which is in the same
process similarly if you have task or
agents which has uh prompts all the
things that's running within the process
any third party Library they can also
access it so those sort of makes this
entire environment a little bit unsecure
right so there's a little bit of a zero
trust uh issue out here uh the agents
the task uh they talk to LM that if you
don't have a secure llm then that is
another area where these things can get
exploited um if you take agent on his
own by definition is autonomous that
means it will call their own make make
up their own workflow depending upon the
task so so that actually brings in
another set of challenges which we call
in security is unknown unknown so you
really don't know like what the agent is
going to do so it's very
non-deterministic
so because of this the attack vectors in
a typical agent is pretty high
considering from some of the traditional
software so what are the challenges
because of this so there are multiple
challenges so if you look from the
security perspective if the agent is not
designed or implemented properly that
can lead to unauthorized acces also data
leakages of your sensitive information
confidential information right safety
trust is also the biggest challenge uh
if you are using models which are not
reliable or if your environment is not
safe enough if someone goes and change
the prompts that can also give you wrong
results compliance and governance is an
interesting thing most of us are so much
busy even just getting the agents
working we are not even worried about
lot of the other things that are
necessary for making your agent
Enterprise ready so interestingly I was
just talking to one of our c customer
this Tuesday they one of the top three
um credit buau so they built a lot of
Agents but the biggest challenge right
now is to take it to production for them
they consider a AI agent as similar as
to a human user and when they onboard a
human us they go through a training and
they have lot of regulations they need
to adhere to right they have data from
California residents so they there to
make sure anyone who is accessing uh
California resident data they should not
be using for if the user is not given
consent they should not be used for
marketing purpose they have
international so if they are Europe data
so who can access those data there's a
regulations around it and also there are
a lot of regional regulations so when
they consider even a AI agent similar to
a human so they have onboarding process
they have a training process and they
want to make sure the agents are also
following the regulations right so
without that they can't go into
production and we as air Engineers still
in the early stage so this one of the
things which is of our radar right
now so now how do we really address this
thing right so those who are in security
associated with security compliance
there's no Silver Bullet uh the best way
to do have multiple layers of solutions
so these are some of the things that I
have in my mind like so you can split it
into three different layers the first
layer is what is the criteria to even
put your agent into production like what
are you need to do right uh we talk
about evals but most of them we only
talking about evals for how good your
models how good your responses is
alternating but you also need to have
evales which are more security and
safety focused so we'll go through some
of those things but the the goal of this
eval out here is to come with a risk
score and depending upon the risk score
you can decide whether you can even
promote this agent to the production and
the agent may not be necessarily you
writing it it could be a third party
agent so it has to go to the same
criteria the second is
enforcement um eval tells you how good
is your agent built and enforcement is
the one who actually doing the
enforcement or implementation so you
have to make sure you have a pretty
strong implementation if your
implementation is not good your ual is
going to fail essentially you can't go
to production and third is observability
uh particularly in the world of agent is
a lot more important because there's so
many variables involved out here like
you cannot really catch all of them
during development or initial testing so
you have to keep track of how how it is
used in real world and how you can react
on it so I will go through some of those
things in a little bit more
detail so uh let's start with the evals
itself right um if you look into
traditional software development there
is already a process there is there are
gating factors that tells you how you
can promote your application into
production right
so if you start with basic things like
uh when you're writing your code you
have to make sure you have the right
test coverage right when if you're
building uh Docker containers you have
to do the vulnerability scanning if
you're using third party software you
need to make sure you're scanning for
CVS if you find higher medium risk or
critical risk you try to remediate that
before you can get into production right
uh you do pen testing so make sure
there's no cross-side uh scripting and
other um
vulnerabilities the same thing applies
for AI agents also right you need to
come with the right use cases you need
to make sure you have the the right
ground through so that when you are
doing any changes you're changing the
prompt or you are uh bringing a new
library or new framework or new llm you
want to make sure your base line doesn't
change right uh if you're using third
party llms make sure they are not poison
they they have been also scanned for
vulnerability uh if you're using
thirdparty libraries which almost
everyone is using it make sure they also
meet your minimum criteria for
vulnerability right and similarly to pen
testing uh you should also do testing
for your prompt injection make sure you
are um your your application has the
right controls so it can block them and
most of the llms already doing it but
not necessarily all the LMS are doing
it the other evaluat about on data
leakage uh this also is pretty important
particular in the Enterprise world
because when you building Enterprises
You're Building agents which does
generally what a human would do right if
you're building a agent for
HR have certain functionality if I am an
employee I can request for um uh to get
my
salary
benefits but I can't do the same I can't
get for someone else but if I HR admin
there's a possibility I may be able to
access someone else's salary benefit
right how do you make sure your agent is
not leaking data there's no malicious
user who can exploit some of the
loopholes you have so you would have to
do this eval up front before even you
can put your agents in the
production uh similar to data leakages
unauthorized actions um most of the
agents even though
uh a read only there also now agents
coming which are trying to change things
they'll do some actions how do you make
sure those are also done by the right
person with the right
person and Runway agencies um those are
work on agents already know like the
agents can go in a tight Loop and for
various different reasons it could be a
bad use on prompt or just the prompt for
the task or the agents are not cannot
address those things so you have to make
sure you test for such scenarios before
you you put your agent into
production so the goal of this is to
come with the risk score at the end of
the day so that it gives a confidence
that can you put this into production
and the next one is going to be around
enforcement as I said your risk Cod is
going to be depending on how good is
your enforcement and particularly in
agents um you're working almost like a
zero trust environment because you are
libraries which can access anything
right uh if you are accessing certain of
your backend systems which have
sensitive data how do you make sure the
wrong user is not accessing it so uh
from the security control there are a
lot of other things which I'm not going
to talk today like uh detecting
projections and moderation but focusing
on Enterprise level thing uh you have to
make sure you have the right
authentication authorization uh this is
pretty important because when you look
at the
environment when a user makes a request
to agent it goes to task and eventually
goes to tools and makes the API call to
a service or a database if you don't
have a right authenication someone can
impersonate someone else and may be able
to steal confidential sensory
information and the second is the
authorization if you have the
authentication done properly then you
have to make sure the access control is
applied properly and this is also
important because agents have their own
roles and as a they can do certain
things so you have to make sure they are
not going beyond what they're supposed
to and at the same time if you have
agents which are trying to do something
behalf of another user then you have to
make sure that the user the that
person's role is enforced so if you're
accessing a database you shouldn't
access anything which the user does not
have permission to or making APA calls
so so that's why authentication
authorization are super important but
that um obviously there going to be a
lot of other the
issues approvals is interesting because
um in the traditional world we already
have workflows uh if I request for a
leave my manager will approve it it's
already built into the system but in the
case of Agents you don't need to have a
human all the time your agents can do
most of the things automatically right
so there's a if you do it design it
properly you could have another agent
which all it does is just looks for
approvals and making sure that results
are right and you can also put
thresholds how much this agents can
approve automatically and you can put
the proper guard ra to make sure if it
goes above a certain uh limit it can
automatically get a human in the
loop so uh just to reiterate this one
because it's pretty important is when it
comes to authentication authorization is
not just about doing the authentication
at the point of entry at the where
you're making a request
it you have to make sure the user
identity is propagated across everywhere
if you're making a calling a task the
task is calling a tool you have to make
sure the user identity is passed on to
the the the last point where it's
actually making a data access or making
APA calls and that point you have to
make sure you're able to enforce the
right policies and access
control and the third is um
observability so observability is is
pretty important in the agent world
because as I mentioned the traditional
software once you build it it gener just
works you just had to make sure it is uh
there's no new vulnerabilities coming in
because of some Library update or
something like that but in the world of
Agents um there are many different
variables involved one is the models
change very rapidly um you if you're
using a agent framework that is also
keep evolving right you're using third
party library that start behaving
differently um the another important
thing in an agent is is very subjective
to what the user is entering like you
may have tested with a certain
assumption mostly sunny day scenario I
want to apply for a my leave but the end
user may use entire different um uh um
text to ask the same question so how do
your model is going to behave with so
you have to keep monitoring to see if
the user inputs are anything that
changes how the responses are coming in
and also to make sure how much Pi data
and other confidential data is been sent
out because if you see some anomaly you
to be able to really to act upon
it
um the other thing is obviously you
can't monitor each and every request
right as a number of request increases
it's just not possible so you have to
start uh putting uh defining thresholds
and
metrics so what that really means is and
uh can start
calculating counting how many failure
rates are out there once you know you
have a certain failure rat which is
within your tolerance is fine but if it
goes above that you can automatically
create alert and look into it and the
failure rates could be because of uh Mis
Bing agent it could be a malicious users
trying to um um compromise the
system then anomal detection and is
another interesting thing I don't think
we are anywhere close to it yet uh but
this very common in uh the regular
traditional software in the security
side there always something like user
Behavior analytics where they look at
the user and see whether they have
within the um uh um standard operating
thing with agents coming in so there'll
be more and more of anomal detection
whether the agent is behaving within the
uh accepted boundaries and all those
things will end up with a security pro
cure so that will give you near real
time saying how good your agent is
actually performing in life so that
gives you to a bit of a
confidence so to
recap uh as I said there are three
things one is preemptive have a
vulnerability eval to make sure that you
get a right risk score which gives you
the confidence whether you can promote
the uh um agent to production if you're
using third party agent whether you can
use it in your environment um
second is proactive enforcement make
sure you have the right guard rails you
have the right enforcement you have the
right sandbox so that you are able to
run the agent in a secure way um make
sure you have the right observability so
that you know at real time or near real
time how good your agent is performing
and if there are some anomalies you can
go and quickly find tune
it so um just I said we open sourced our
um Safety and Security Solutions um it's
called page. a uh security and
compliance is a pretty vast field I
don't think any single company can do it
uh so we are looking for design partners
and contributors who can help us in our
journey so if you're interested please
reach out to uh me at boscat page. or
connect me in
LinkedIn thank you
[Music]
our next speaker will teach you how to
build AI coding agents that build
themselves please join me in welcoming
to the stage founding researcher of
augment code Colin
[Music]
[Applause]
flarey hi everyone thanks for coming
today so I want to talk to you about
something that sounds like science
fiction but very much is reality an AI
coding agent that helped build itself my
name is Colin I'm an AI researcher at
augment code a company building AI power
Dev tools for software engineering Orcs
And I want to share with you a little
bit about my our journey working on AI
coding
agents so zooming out AI Dev tools is a
fast changing space everyone remembers
in 2023 we're all talking about
autocomplete models GitHub co-pilot
being the one that probably really comes
to mind in 2024 chat models really
started to penetrate software
engineering
Orcs in 2025 though we think AI agents
are going to dominate the conversation
about how software engineering is
changing so naturally a few months ago
we started building our own agent at
augment I want to show you a sneak peek
of what we built and share some
hard-learned lessons about how this Tech
works and I just want to you know
reiterate R I've been really amazed to
see the extent to which this agent has
helped build itself uh I'll not a kind
of fun statistic so we have about 20,000
lines of code in our uh agent code base
and over 90% of that was written by our
agent with with human
supervision so what does it mean for the
agent to write itself implementing core
features so one of the first things we
had we had to add was third party
integration so our agent you know if
it's going to work like a software
engineer it needs to interact with slack
linear jira notion search Google um muck
around in your code base and so we
wanted to have the agent help us build
these
features uh we had it we found after we
added the first few ourselves when we
add asked to you know gave it an
instruction like add a Google search
integration it was able to go look in
our code base for the right file to add
it in uh figure out the right interface
to use and go add it uh one kind of fun
an anecdote is when we were adding the
linear integration uh it didn't know the
linear API docs the foundation model
we're using uh didn't have those
memorized and so it used the Google
search integration which it had written
previously to go look up the linear API
docs and then was able to add
that uh we used it to write tests so we
found if we asked it something like add
unit tests for the Google search
integration it was able to go add those
uh in order to do this we just had to
give it some basic Process Management
tools things like running a subprocess
interacting with it uh not hanging if
there's an infinite Loop in some test it
wrote and reading
output um I think this is super
interesting so everyone's seen the
Twitter demos of these agents writing
features and writing tests but I haven't
yet seen a compelling example of them
performing some kind of
optimization well over the course of our
project we noticed the agent was pretty
slow and we weren't sure why so we asked
it to profile itself and what it ended
up doing using all these tools we'd
given it was add some print statements
to its own code base run essentially sub
copies of itself look through these
print statements and it figured out
there was a part of our code base where
we were loading up all the files and the
users
repository uh synchronously and hashing
them synchronously and then it added a
process pool for these to speed it up
and uh stress test to confirm it was all
working and by the end of this we
reached about 20,000 lines of code and
again over 90% of that was written by
the agent with with our help in
supervision so let's walk through a
quick examples a couple quick examples
to see how the agent works I focus on
simple examples where it's reliable so
you can uh follow along
easily so here I asked the agent are you
able to search Google and then it notes
that it found a tool called Google
search for those who aren't familiar
with the notion of tools I'm sure most
of you are but I'll just kind of quickly
reiterate the idea is we have this kind
of Master Level agent that's doing all
the planning and it has access to
certain tools that it can use to
interact with it its environment whether
that's the third party Integrations I
talked about like Google or it's editing
a file in the user's repository and then
it wants to confirm this that this
Google Search tool is working so it
sends a query to it of tests and the
agent uh uh responds to us yes I can
search Google and I see the first 10
results let's try something a little bit
more complicated I ask it instrument
agent's Google Search tool with logs and
then generate an example then it uses
our retrieval tool which is you know
allows to search uh the local uh
codebase and it's looking for a file
related to Google search Integrations it
finds this file deep in our directory
hierarchy at Services integration
thirdparty gooogle search to.py and then
it calls its file editing tool to
quickly and performant edit that file to
add those print
statements uh this is a continuation of
the last example so it added those print
statements and now it wants to run a a
sub copy of itself so it can look at the
outut this print statements uh because
we asked it for example logs uh but in
doing so it finds that we don't have
Google uh credentials authorized so it
uses its clarify tool to ask for
clarification from the user it asks I
don't see Google credentials would you
like me to one add stub for Google API
or to guide you through setting up
credentials I note that the credentials
are actually stored in augment gooogle
api. Json it had just missed this and
then here's a a really cool extra
feature we have which is we want the
agent to continuously learn as it
interacts with humans and so here it
thought well it's probably a good idea
to remember where the Google credentials
are stored so it called this memory tool
to create a memory of the where where
the Google credentials are stored to
save that for later this is another
example if you have that really good
context engine uh it's really critical
to getting the agent to to work
well and so now we get our output so it
prints out these logs that it searched
with an example string Python
programming language and it gives some
uh uh example URLs that were returned by
Google python.org and Wikipedia.org so
we have the agent add logs to itself run
itself learn from user feedback and it
use all kinds of tools Google search
codebase retrieval file editing
clarification from the user and and
memorizing uh useful learnings
so let's fast forward and talk through
some of our lessons building
this uh I just want to knowe you know
we've been working on AI coding tools
for a couple years now and we didn't set
out to build agents we've worked on
things like completion models and Chad
and so forth but our Focus the whole
time was around building a super
powerful scalable Enterprise ready
context engine because we knew no matter
what no matter how good these llms get
you're going to need that context and we
also thought a lot about how do you
build great UI ux so AI can seamlessly
um interoperate with humans it turns out
this context ENT and all these thoughts
around design provided a great
foundation for us to quickly build this
agent in just a couple months the three
most important things were that access
to context that context Engine with all
those different types of context sources
whether it's slack or the codebase the
reasoning capabilities from a
best-in-class uh Foundation model and
that code execution environment so you
can safely run uh commands in a uh
customers
environment so let's talk through a
couple assumptions that we frequently
fall into we we've have frequently
fallen into and remedied and and some of
you might encounter as
well uh so the first one is that you
know L5 agents are here is the senior
soft agents are at senior software
engineering level if you look at the
Twitter demos it oftentimes can seem
like this you have an agent write an
entire website all on its
own in reality professional software
engineering is rarely zero to one and
the environments that we're coding in
are are a lot messier than what those
demos uh show you as a result these
tools you know aren't quite there yet
but they're still super
useful um the way PE one framework I've
seen people think through when they're
trying to figure out you know how to use
these agents and how to build them is
they think agents will take over entire
categories of tasks so first you build
an agent that will uh solve backend
programming and then you build an agent
focus on front end and maybe one focus
on testing in reality this technology is
very general purpose and so instead of
thinking about categories of tasks we
found it more helpful to Think Through
levels of complexity so our agents you
know kind of good decently good at tasks
across front end backend security and so
forth and we're we're improving the
capability level along all those friends
at once because again it's a very
general purpose
technology um we've we' also seen people
anthropomorphize agents so they think
they're just like human software
engineers and they map the
characteristics of a weak software
engineer to what they think a weak agent
would look like and vice versa for
strengths as well in reality agents have
different strengths and weaknesses in
humans and so you may have an agent that
can't do math but it can Implement a
whole front end feature way faster than
any human could and it's important that
we keep this in
mind uh let's talk through a couple
Reflections and
lessons uh so here I asked aaman can you
create a stack of two PRS for the new
reasoning module using graphite
unfortunately so graphite is a Version
Control tool for working with Git you
can like stack PRS it makes it a lot
easier to review unfortunately
Foundation models have not memorized how
graphite works so our general agent
responds I don't know what graphite is
so I'll use git and then it calls our
terminal tool to run a command run and
get checkout but what do we do here we
wanted it to use graphite we can't
necessarily go tell open AI or anthropic
to retrain their model understand
graphite overnight so what we came up
with was this notion of a knowledge base
which is essentially a set of a sort a
set of information that we want the
agent to understand that it currently
doesn't we can kind of patch holes um
one thing we wanted to add to it was
this graphite knowledge so we created
this markdown file describing graphite
how to you know run common commands
things like how to create a PR use GT
create some things not to to do um we
created other files in our knowledge
base for things like details on our tool
stack how to run tests the style guide
and then we added this into the context
for the agent so it can dynamically go
search in this um knowledge base when it
doesn't understand something and uh once
we added this then you know we go ask it
can you create a stack of two PRS for
the new reasoning module using graphite
and it calls that Knowledge Graph reads
about graphi and then can run the GT
create command so what learning here
well onboarding the agent to your
organization is crucial the analogy I
like to think about is if you just hired
a new a new hire software engineer you
wouldn't go tell them to just stare at
the code base for three days to figure
out how your Tech stack Works you'd let
them ask you questions maybe they're
some things they didn't understand and
you add some additional documents to
your notion uh we should think similarly
about
agents uh recall I was talking about how
we had all these uh third party
Integrations we added whether it's
linear tools or slack tools and so forth
well when we were working on these we
weren't really sure of which ones to
prioritize and start with on our product
road map in a normal World we'd make
some educated guesses we'd Implement a
couple of them and go from there but
with the agents we were able to iterate
them um uh build them all at once and so
this starts to change the calculus
around how product management works uh
if you can build everything at once well
then maybe um maybe uh engineering hours
aren't the bottleneck on what we build
and it starts to uh we start to be
bottleneck a little bit more on good
product insights and good
design so when code is cheap you know
you can explore more
ideas uh also recall earlier we were
talking through this example of you know
instrumenting the agent's Google Search
tool with logs uh and it was able to go
find the file to edit notice here how we
didn't have to give a very precise
instruction to the model we just told it
in natural language like how we' talk to
another engineer to instrument the
agent's Google Search tool
and I was able to go figure out the file
to edit this only worked because we had
that really good uh codebase
awareness um we can also use the agent
for tasks outside of writing code but
still within the software development
life cycle so here we asked it to look
at the latest PRS uh in our codebase and
generated an an announcement on them and
then we posted it to slack and so uh was
titled new tools for this uh CLI agent
and we talked about some things around
slack notifications and linear uh linear
Integrations this only works because we
had that slack integration and and
understood our code base
well this uh figure may look familiar
from the beginning of the talk um we
actually had the agent make this as well
so we asked it make me a plot of the
interactive agents line of code as a
function of the
date um and so good context is critical
in all three of these tasks we needed to
pull in some different context from some
different sources and it's not just the
codebase context comes in many forms
and also note that it's multiplicative
so having access to the code base and
having access to slack is forx is useful
as just having access to one of
those finally I want to uh switch over
and talk about uh
testing so uh here's a really a hard to
test Edge case in our code the agent
actually wrote this um and we only
caught it because of some unexpected
runtime Behavior so we have these caches
that the agents store relevant
information for their runs in uh we can
run multiple agents in parallel and they
all write to the same cache location and
the agent wrote this save function to
save to that location and I had this
lock around the Json dump so there were
no raise conditions that would
explicitly fail if you had multiple
agents all right into this cache at the
same time but notice here how there's no
read before writing to the cache and as
a result you could hit a race condition
where multiple agents are running in
parallel they're all overwriting each
other's caches and when the agent wrote
this save function why did it Miss this
issue well these agents make mistakes
and this is a hardto test situation
there's some parallel programming
there's a cache involved and so we
didn't have a test and because we didn't
have a test the agent messed up my
learning here is we need to be very
careful about having sufficient
tests um we have this pretty incredible
statistic so we have a internal bug fix
bug fixing
Benchmark uh we found when we upgraded
our foundation model by about six months
our scores in this Benchmark improved by
4% but when we added H the uh ability to
run tests so the agent could suggest a
fix for the bugs run tests look at the
feedback suggest another fix run test
and do that four times that led to a 20%
gain on this
Benchmark so what's the lesson well
better tests enable more autonom
you can trust these agents more and it
just makes them
smarter what does software engineering
look like in a world of Agents well
agents didn't work last year but now are
pretty good if you'd asked me two years
ago if we'd be working on this Tech I
frankly wouldn't have guessed it there's
a compounding effect where these agents
are staring starting to help build
themselves and that's only going to
accelerate the pace at which they
improve code isn't going away way
because it's a spec of our systems but
our relationships to it is changing good
test harnesses are becoming more
important than ever and we need to be
especially careful about those parts of
our code bases that tend to be less well
tested and the calculus of product
development is changing if code becomes
super cheap to write then the focus our
focus is more on good product work
Gathering customer feedback quickly
building
insights we're really excited for how
this Tech's going to positively
transform our industry and we'll be
releasing our agents soon so I'm really
excited to share that with you uh find
me after the talk if you want to discuss
any more thanks
[Applause]
[Music]
ladies and Gentlemen please welcome back
to the stage your MC for the leadership
track session day Peter
[Music]
Humphrey all right folks thank
you uh Colin thanks for calling for I
mean pretty amazing to see AI software
building itself I think he was right
just uh sounds like science fiction
doesn't it um all right well everyone we
are truly off and running here uh We've
kicked off the day with AI market trends
setting your AI strategy we've heard
about AI security and AI safety even
just now ai creating itself okay so now
we're going to take a 30 minute break um
if you want to discuss anything talk to
the speakers have some question and
answer U go to the one of the three QA
lounges the speakers have spread across
those three uh there's not too many to
to choose from so you'll be able to find
them pretty easily and then again also
for some birds of a feather style uh
discussion if you want to talk about any
of the topics you've heard uh this
morning just be friendly go introduce
yourself say hi to folks uh and uh and
and have some time interacting um also
please do make some time to uh the stop
by the sponsor Expo which is open now uh
coffee and snacks are being served there
uh our sponsors again are a huge part of
making Gatherings like these happen and
they have amazing products and
technology and services to help you with
your journey okay we'll see you back
here for the resumption of leadership
day at 11 o'clock sharp thank you very
much
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
he
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
a
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
n
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Applause]
[Music]
[Applause]
[Music]
n
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
n
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
to New York
ladies and Gentlemen please join me in
welcoming to the stage your MC for the
leadership track session day Peter
[Music]
[Applause]
Humphrey all right welcome back everyone
uh hope you got a round of coffee and uh
strap in grab a helmet our next Sprint
of sessions is pretty action-packed okay
we're going to talk about retrieval
augmented generation and data pipelines
we're going to talk about exactly uh
always a popular one we're going to talk
about AI in the software development
life cycle this is a this is one I'm
pretty excited about how does it impact
what's going on with the traditional
software development life cycle that's
something I definitely am here to learn
myself uh and then a little bit about AI
productivity internal agents and then of
course we're going to have some speakers
from no no one less than open AI uh so
with that please join me put your hands
together in welcoming our next speakers
Steph chin who's the VP of developer
relations for Neo forj and Jonathan low
senior director of operations and
insights from fizer thank
[Applause]
[Music]
you hey so it's so great to be back in
New York City actually I grew up nearby
here and pleased to be co- speaking with
Jonathan thank you Stephen good to be
here and you know we're here to kind of
talk about leadership talk about how you
can actually do a bunch of the things
you've been hearing in practice we're
going to talk about strategy we're going
to talk about
technology but let's start with analysts
who who here trusts Gartner when Gartner
says something they're predicting the AI
wave they're predict okay nobody does
nobody no hands sweat up in the room for
the record but
when they are predicting failure and
catastrophes I I I try to trust that
right so last year they predicted 30% of
generative AI projects are going to be
abandoned by the end of
2025 now anybody in the room this is a
really real honest check has anyone been
on a failing gen project okay now Brave
SS amazing give give those guys around
that took a lot of courage
now to make them feel a little better
who who hasn't yet got to production on
their gen
app okay so the rest of the hands went
up right so this this is the
challenge so we all want to be
successful gen we all want to do amazing
things we're getting asked to do amazing
things but we we need to have the right
way of approaching this in our
organizations with leadership to sell
this internally to to build it on
Technologies which they can understand
and the the vision it's it's hard to get
a vision that's technically achievable
when the guy at the head of the table is
is this is this guy he's he's the
executive who's heard about J now his
kids are using it for their School
courses and he's like oh yeah yeah just
it solves all the problems insert here
success I wanted in production in two
months now um I think the great thing
about having having Jonathan as my
co-presenter is that he's actually done
this and
a big Life Sciences company and he's had
to navigate all of this um leadership
challenges organizational challenges
silos to build a system which actually
is something we can take to production
so tell us a little bit more about that
Jonathan thanks Stephen now as I've been
introduced Jonathan low you may know me
as Jonathan when we're out in the
hallway or when I give you a bit more
information about my experience
launching Jen AI based capabilities in
business you may think of me as Debbie
Downer AI is so exciting until the
singularity
so that's how actually approached the
problem I'm about to explain to
you but it it actually worked so the
business case was technology transfer
which means in biofarma scaling up from
Lab Ben think beers and human scale drug
development to Industrial scale making a
million doses a day and to get from that
lab bench level to multiple factories
around the world making lots and lots of
product very quickly takes years because
the industrial people that build the
factories and build the equipment need
to sift through hundreds of thousands of
documents and notes and test outcomes
that were created at the science
level another challenge with doing that
is I'll go to a statistic in
2019 a study said that the average
tenure of manufacturing workers tenure
being how many years they had spent in
their companies was about 20 20 years of
average tenure what do you think the
average tenure is in manufacturing
companies
today the study said three years so
we've gone from 20 down to three and all
that
expertise has has or will soon be
retiring because the boomers are growing
old so we really need generative AI we
need a machine to take a lot of the
intelligence that's captured in
documents or even in tacet people's
heads and get it to the new people
showing up to do this technology
transfer so we take all these millions
of documents and we've loaded them into
a
graph now we haven't necessarily
loaded the document into the graph we've
loaded the chunks into the graph and one
of the things that we really liked using
the graph to accomplish was we
structured the chunks the document the
block the paragraph the line the because
we wanted to understand when we searched
for those chunks with similarity search
which ones really returned the results
that people wanted the most we wanted to
really refine how we stored and managed
the chunks so at this at this point it
was it was a totally new space and
because we were able to structure in the
graph that level of chunking we were
able to eventually learn and get better
and better at how we chunked the
documents in the first
place yeah so I think what's really
really amazing for me about this is um
we were talking about business
challenges and like like projects
failing and in the study that Gartner
did that the biggest failure mode was
not having a business use case which
would actually solve real problems and
then be monetizable and um this is not
only like a great business use case but
it's also something which potentially is
Saving Lives because you're you're
getting life-saving drugs to folks
faster you're able to accomplish this
quicker
but the problem is always the humans in
the middle right so the teams you work
with probably have a little bit of
gen not invented here syndrome where you
come along with this great solution like
I'm going to use graph rag am I going to
load all these documents in the my my
you know my my big um store and they're
like no no no I we've SE This research
paper we watched this talk there's some
other platform we want to use there's a
mother
framework um or maybe it's maybe it's
too expensive I mean compared to Classic
Computing and cloud computing gen
architectures have the potential be much
more expensive if they're not well
architected and in general are going to
re increase the cost of the organization
so how do you convince people to go from
a system which is is working but not
working well enough to a much more
expensive system which is R&D investment
Redevelopment to go towards a gen
architecture so what are what are some
of the challenges you hit internally and
how did you address that
viser great so for this
one it's more of an entrepreneurial use
case within a big organization I wonder
how many of you have worked in
organizations with 50,000 or more
people a lot of hands going
up uh my current organization has over a
100,000 people I've also worked at IBM
deoe big organizations
and if you are like me in these
organizations you'll be that little red
guy going like this with the with the
light bulb over his head saying I have
an idea that might help the company and
I have a team of X number of data
scientists and developers and sres and
we can bring that value that capability
to the
company if you're like me if you're that
red
guy who's the first group of people on
this slide that you're most interested
in connecting
with
well you go for the top you're better
than I am I I joined this whole
profession because I love building
applications that Delight the people
that use them so my instinct has always
been to go to the bottom first and say
to those users hey do you really want
this tool and what are those users going
to tell you the like your tool
if that's right what makes it good it
takes away boring stuff that they don't
want to
do but it can't just take away boring
stuff it also has to give them accurate
results it also has to work in a
performant way they can't push the
button go get coffee and come back and I
feel like that's the easy part right
more and more these days you can build
accurate fast applications quickly so
where's the real challenge so somebody
said you go to the top first what's the
likelihood in a company of 50,000 to
100,000 people that you're going to meet
the CEO if you're the guy with the idea
at the level four of the hierarchy the
likelihood is pretty small did anyone
here ever see the movie Dirty
Dancing Dirty
Dancing maybe do you remember the part
in Dirty Dancing when baby the lead
leading woman in the in the movie meets
Johnny the amazing dancer for the first
time and she she's uh she's unable to
speak she's so flustered and finally she
blurts out I carried a watermelon and
then off he goes and she goes I carried
a
watermelon two weeks ago I stepped into
the elevator on the seventh floor of of
the headquarters of of my company and
there was my CEO in the elevator and I
felt like baby in Dirty Dancing I
couldn't think of what to say I locked
up for he's a good guy he broke the ice
I'm just back from vacation rolling up
my sleeves can't wait to get to work
what are you up to and then thank God
ding we got to his floor the doors
opened and out he went and as he went
out the door I blurted out not I carried
a watermelon but I'm working with
llms off he went so when you're trying
to promote your work within a big
company like this it would help to know
what that executive is trying to
accomplish and the way that he gets to
that point is he talks to Consultants
who say let us tell you how to be a
leader in your industry and not fall
behind the competition so an example of
something that an executive at that
level might do is create a purpose
blueprint or something like that name
and the number one message has to be a
few words and convey something that the
whole company can follow so an example
of that might be change a billion lives
a year
in life sciences a big aspiration now
why do you have to care about that in
the elevator maybe you'll reference it
I'm changing A Billion Lives a year with
the most amazing AI search engine Bing
and off he goes but that that message
that he gives trickles down to the next
level the chief digital officer the
chief scientific officer the chief
Supply officer what do you think they're
going to
say they're going to try to take his
message and turn it into their specific
flavor so the digital officer will say I
want to lead the industry in Ai and the
scientific officer will say I want to
take on the world's biggest
diseases and the supply officer will say
I want to accelerate Supply still very
high level and you probably won't meet
these people either who will you meet
though you'll meet their level twos and
their level threes and what are they
going to
say at this point they don't really say
tagline
instead they say I want cost savings I
want cost
avoidance I want earlier realized
revenue or I want more balanced
headcount so when you're talking to
these people your slides have to have
numbers and times and your promises
about how your tool or your capability
or report or whatever is going to me
meet those times and those numbers
now you may not get to meet them either
if your big company has a a form of um a
role called the client partner where
your digital people talk to the client
partner and the client partner talks to
the business then that's the other
person you have to convince and the
problem with this is that the client
Partners tend to stay within their
particular departments there might be a
client partner who works exclusively in
R&D or one who works exclusively in
Supply what would they say sometimes
they don't say the same thing one of
them might say R&D already has five or
six or 10 search engines why build
another or they might say search engine
is a great idea why don't you
incorporate that capability into every
tool in the supply organization so
either your scope goes to nothing or it
goes to everything and you need to be
able to negotiate and navigate that are
you done if you can satisfy all those
people and cross through all those
gauntlets well no because as you're
starting to build the vendor comes to
you and says why build in house when you
can buy our tools and they've been
talking to the chief digital officer
about build versus buy and which one is
more economically realistic and
appropriate well maybe you get through
that and then you're done
right who else would possibly stand in
the way
of your incredible AI Search
tool Friendly Fire is the answer your
own colleagues either a level above or
at the same level may say dude I was
here first AI search is my terve or they
might just say hey that client partner
over in Supply is right can you please
integrate with the stuff I've built so I
guess my message is we've heard a lot of
talks about failure and Challenge and
Garten are not liking this it's an
incredible time to be in this in this
amazing industry in this amazing change
in both for me life sciences and more
generally for the information technology
industry and I love that we're hearing
all this concern about failure because
it just means we're at the beginning of
a really exciting time but as
representatives of that my advice to you
is know your audience personalize for
all of them and get your human wetwear
chatbot speaking the right language at
the right
level now that's that's amazing um we've
chatted out a bunch of these challenges
right so we've chatted about getting a
good business use case that can actually
provide values the organization how to
navigate like like peoples and and
different failure modes um within the
organization where the organization has
a huge quantity of people who can be
your allies or can work against you
depending upon how you work with them
but it's also a technology problem you
have to have the right technology to
solve your use case now one of the the
biggest challenges I think a lot of us
who have been building Rag and
Enterprise applications has been the LMS
themselves fighting us with with
hallucinations this is getting better
with newer models um it's getting easier
to feed the right sort of information in
with Vector databases but I think that
you've chosen a rather unique approach
using graph databases why did why did
you choose to use a graph database for
your implementation at
fizer well
um there are a lot of things that graphs
aren't good at things like genealogic
sequences of recipes or social networks
or hierarchies or time series and all of
those applications were prevalent
opportunities within fizer so that was
the that was the original impetus for
using a graph but I also discovered that
the more data we Consolidated in the
graph the faster my data scientists and
engineers and developers and sres were
able to understand the data landscape
what used to take three months to
consolidate understand clean up took
three weeks or less for for a new
project so I know the reason a lot of
people take on graph is because
traversal becomes so much easier in
terms of data search and uh and and
performance gets better but I've found
that team performance also took a really
big Boost from using that Tech cool and
for folks who aren't familiar with with
knowledge graphs and LMS or or graph rag
um this isn't a new idea although I I
would put you guys on the early adopter
where you're actually in production now
with something that uses this but U
Microsoft kind of wrote the seminal
paper on graph Rag and used it basically
taking existing documents llms to chunk
it into a graph and then showed Superior
results coming out of it it on the
spectrum of Technologies using LMS
directly you can get good results but it
lacks that context it lacks that
Enterprise knowledge using a vector
database or or Baseline rag you you can
get better results where now it's
actually pulling in organizational
knowledge but the answers tend to be a
little bit generic there's a lot of
hallucinations um graph rag kind of
pulls us to the end of the spectrum
where now you're you're getting answers
from that that Knowledge Graph you built
you can evolve over time and much more
precise answers which actually get to
the heart of of real problems in in life
sciences and Manufacturing and business
critical Industries where you can't
afford to be
wrong and also where in in industries
that are complicated if there are a lot
of connections that might not appear in
a relational database because no one
bothered to make the joins permanent
whereas in a graph those joins are there
to begin with so if you search for one
thing suddenly the neighborhood of
related stuff becomes available to you
to share with an llm for better
contextual
knowledge yeah and you know I think just
if folks are implementing this or folks
are thinking about how to how to think
about architectures for graph rag um
this is a really simple way of thinking
about it so basically what you're doing
is you're taking your gen
application and you're doing both a
vector and the knowledge graph
representation of the data so you're
both ask asking the vector for the
answer you're getting relationally close
nodes from the graph database where
you're getting additional context and
passing that into the LM and then this
gives you more contextually relevant
results coming out of your your expert
system so I think this is a great way to
use a Knowledge Graph either that you
built up over time or that you have the
LM construct to kind of get those
Superior results where you could do
better governance you can put controls
and property on the graph nodes to
control who has access to the
information you can get better
explainability now because when you're
getting an answer from the LM you're no
longer looking at statistical
probabilities in the vector space you're
actually looking at graphs and nodes and
edges which we can reason about and we
can start to understand the relationship
between understand like what what things
are related to manufacturing which
things are unrelated to that they're
just you know general terms
and for the right
application maybe we're saving lives
getting drugs to people more quickly and
using gen for a good cause so thanks so
much for joining us for our presentation
at AI engineering Summit and um
appreciate everybody thank you
[Music]
our next speakers integrated AI coding
agents into the largest travel site in
the world here to tell us how is Bruno
[Music]
pasos how's everyone
doing all right it's been it's been a
fun morning so far it's great to see
like a a huge uh range of Faces in in
the audience everyone building software
from Big to small uh so my name is biang
uh I'm the CTO and co-founder of a
company called Source graph we build Dev
tools for uh big messy Co bases yeah and
I'm Bruno Bruno pasos and I lead uh the
product site of developer experience at
booking.com and um yeah over the past
year I've been overseeing the um the
geni Innovation side of booking as well
cool and today we're here to talk about
uh how we're partnering to build
software development agents uh that made
a bunch of toil inside booking that are
actually having real Roi and
impact so how many people have heard
this before you know you're the you're
working inside a large company the CEO
comes in and says like hey we need to
adopt AI uh and then folks are like okay
uh what does that mean you know how do
we measure it you know maybe like fomo
purchase co-pilot or something like that
uh and then 6 months later uh someone
else maybe the CFO is asking you hey so
what's the r I of of that AI tool we
just adopted or you know what's the
measurable impact of the agents that
that we're building um this is a
question that I think a lot of people
aren't quite sure how to answer right
now but Bruno and booking have been sort
of on the Leading Edge of answering this
question uh and very ProActive at uh
acquiring and building the best tools
and also following through to
demonstrate uh how they're actually
impacting their
org it's very kind of you to say we are
we are leading this I think we are we
are right at the beginning and I feel
couldn't be couldn't feel further from
from actually the Forefront of it uh but
let me let me start by talking a little
bit about booking um I am sure uh the
most of you would have heard about this
company uh our goal is to make easier
for everyone to experience the world and
my team's goal is to make sure that our
developers have their path cleared so
that they can do their best work now are
we close to that in some parts of the
company yes other parts we couldn't be
uh farther away from it
to get to set a little bit of context uh
we are one of the largest uh online
travel agencies in the planet um and we
serve about 1.5 million room uh nights
uh um with more than 3,000 developers uh
can you raise your hands uh who work in
a company that has more than a thousand
developers quick show of hands good
number of people okay uh on the more on
the on the dev side or on the technical
side um we serve over 250 50 merge
requests uh at a given year uh with 2.5
million CI jobs running at a given uh
year as well and we are extremely data
driven um our company has gotten to
where it got to over experimentation and
being obsessed about data and the reason
I'm going into this is because as we
experiment and in the form of primarily
AB tests we start adding those
experiments and and feature Flags into
the code base and as we push forward to
bring new features to our users uh most
likely those experiment Flags or dead
code will stay in the code base and now
fast forward decades our code base
became extremely
bloated uh fun fact I was uh my kids
were looking at me uh editing this slide
and they said what are feature flags and
I said well um you know they stayed the
code base and they sto polluting the
code base and they were like like code
farts and I'm like now you're going into
code smells it's a different topic but
let's uh let's move forward um and so as
the the code Bas starts to blow tap and
become bigger and bigger cycle times
also become uh larger and longer and
they the time that developers spend to
debug and to work on that code base just
becomes over 90% toil right who here is
familiar with
this that's even more HS than than than
than a thousand developers and so we
survey our developers at least a quarter
uh on how they're feeling how they're
how they're they're feeling about
working on that particular code base and
it's it just becomes harder and harder
for them to do anything and so we had to
do something about it so I've seen the
best developer Minds My Generation
destroyed by decade long dead feature
flag
migrations it's
crazy L Channon actually say that
but I mean seriously though like they're
probably like Geniuses out there like I
was talking to someone from PWC the
other night and described the system
that they're building to like update uh
all all the kind of like Legacy code in
their system and it was amazing like the
guy was really smart really brilliant uh
really like interesting Tech but
wouldn't it be great if you know those
sorts of Minds were unlocked to actually
work on like you know new features and
thinking about like user problems rather
than all this kind of like Legacy craft
and so in a nutshell that's why Source
graph exists uh as a company so our
mission is to Mak building software at
scale tractable uh and so you might be
familiar with a couple of the the
products and tools we built uh along the
years uh code search it's kind of like a
Google for your code allows any human
developer to find in and and build a
working understanding what's going on we
have a tool for large scale refactoring
and code migrations uh you might have
heard of our AI coding assistant Cody
it's a context aware uh code generator
that's tuned to work well in large messy
code bases uh and the topic of this talk
is really about the agents that we're
building to automate toil out of the
software development life cycle so a
bunch of different products that we
built over the years the unifying theme
really is to uh accelerate things in the
developer in Loop augment human
creativity there and then to automate as
much of the BS out of the outer loop as
possible all right so um SB young talked
about uh Source Graph Search just over
two years ago we started using their
product and there was a big success
within our uh our community because they
were able to search that bloated code
base much much easier and find small
pieces of context lying here and there I
I totally encourage you to have a look
at this uh particular product is awesome
um and so about a year ago uh January
last year we started experimenting with
Cody and why because Cody also has has
Source Graph Search as context and so it
became extremely useful for us to use a
uh tool that had that context to be able
to experiment with the the Gen topic and
now we are hoping to reach the path of
uh uh building agents with Cody and and
Source Graph Search uh uh built
in all right so um if I summarize very
quickly and hopefully this illustrates
how fast things are moving uh uh forward
in January we started um with Cody we
gave everyone one the ability to start
using the tool in the company so all our
3,000 developers uh um had the the
opportunity to use it some started using
it some uh uh uh used it didn't see any
value with it and then stopped using it
and that started intriguing us and so
back then right in the beginning of the
year we had the choice of one llm to use
across the entire company and some token
limit uh uh um uh limiting what we could
do with it and so the first thing that
that we
started pairing with Source graph and we
appreciate the partnership on that was
to remove all the the guard rails that
we had in order to be able to really
give it a go and so Source graph was
very quickly to be able to give us
multiple llms per developers we could
choose that and why that was important
it's because we found the llms had
expertise right and so if we were going
to excavate our code base our bloated
code base a particular llm would do
better than someone that was working on
a completely new uh uh piece of service
and and developing features there and so
fast forward to July um we started
training developers and that became
incredibly important because the people
that started using and didn't see the
value when they started getting trained
they started using it and falling in
love and becoming what we call that now
daily users and I'll explain how why
that's important um and then we started
looking into more metrics back in
January the main metric was I was saved
and um I mentioned that we are a data
driven company and I was saved wasn't
the most statistic relevant metric that
we could use it was based on Research
only over a couple of developers a few
developers and uh that wasn't cut um
raise your hand here if you uh heard
folks out there in the beginning of the
hype talking about thousands or or 80
100,000 hours they saved with Gen has
anybody ever heard that and then you go
back to your company and say why are we
not doing this I call that semi BS uh uh
uh and so we had to start going into
other metrics something that were more
statistically relevant and so we started
brainstorming with that come October
October we Define new kpis which I'll go
deeper into it and metrics to measure uh
to measure gen and fast forward to
November end of last year we then
started finding traces that developers
were 30% plus faster if they were using
Codi on a daily basis and that's 12 plus
day in a month to take away weekends and
the times that they they were not coding
and most importantly we were able to
partner with Source graph to be able to
create an API layer in front of cod so
we could be creative in using with some
of the tooling that we use like slack
jira and and being able to extract some
of that away from the
ID all right so as we as we we finish
around October we started looking into
those kpis and what was important for me
is that we defined something that we
could measure within a year why because
things are moving so fast and if we it
was really helpful to ground as to what
can we measure within the next year and
so we defined four kpis the lead time
for change quality code basing sites
that would then go into how we could
modernize some of our bloated code base
and so some of the metrics uh uh when I
say short mid and long term these were
metrics that we could see results in the
short term in the midterm and in the
long term and that longterm is precisely
a year and so we started seeing results
with time to review am Mars developers
that were using Cod on a daily basis
would ship 30% more M than the ones that
that didn't and one very interesting
piece is that their theirr were lighter
they had less code in it which I still
don't know what to make out of it but we
are we are working on it and then on the
quality side of things we are hoping to
go into the vulnerability can we show
some of the vulnerabilities we've had in
the past give the context the code bases
context and try to see where we can
predict whether new vulnerabilities will
uh will appear or if they're still
lingering our code base
and then we started the obvious one is
test coverage can we increase test
coverage can we create test coverage on
the Legacy so the new stuff when we rep
platform passes that that particular set
of tests and then we went into coding
sites which is more related to like can
we track what parts of our codebase are
not being used some feature flags that
are still lingering but shouldn't be
there and the code that is not
performance enough and all of this is to
feeding to our ultimate goal which is
can we bring the time to rep platform
our codebase
from from yeah years to months
right okay so while all this is going on
one of the things we noticed is that the
same Engineers that were using the the
like coding assistant to generate code
were also playing around with the
underlying apis and so what we realized
is that like asking people to customize
prompts leads to them wanting to build
and compose those calls into longer
chain automations that we now call
agents um there are a lot of pitfalls
that we encountered uh uh you know in in
the early stages of this like helping
people understand what the expectations
were with respect to what the the LM can
do and what it can't do but the long
story short is at some point we
basically said F this it's not really
working let's just like put our brains
together you know fly out to Amsterdam
we'll do like a weeklong joint hackathon
and build some agents
together and so the first thing to come
out of that hackathon was this thing
that uh generates graphql so booking has
a a huge graphql API play the
video it uh it's seriously like more
than a million tokens
long and so it does not fit into the
context window of any of the existing uh
llms even if you could shove it inside
context it's not going to do a good job
of of integrating that context into
something that's coherent a ton of
hallucinations and so what we did is we
built this system that basically
searches this very very long graph P
schema finds the relevant like nodes
where wherever they are in this like
schema tree uh and then uh agentically
figures out which ones are relevant and
then walks up that tree to pull in the
relevant parent nodes and so on the
right hand side you can kind of see it's
like inner dialogue this is like it's
thought process for uh reasoning about
which nodes of the schema to pull in and
then uh after it's done that reasoning
it generates a response and so if you do
this naively you know the UI looks very
similar but you just end up getting
garbage which is what we were seeing you
know before we ran this hackathon after
we sat down and and and actually worked
through like the specific prompts and
stuff uh to make this work well we saw
far better
results all right so um a pretty
interesting one that uh uh that we
started uh uh working through in terms
of Agents were the automated code
migration could we go into that Legacy
piece functions with over 10,000 lines
to give you context and uh uh and speed
up that rep platforming efforts and so
uh code search structure it structured
met meta prompts uh and then the the
concept of dividing that particular code
base to conquer the small bits were were
really uh really
interesting um one of the things that I
totally recommend if you started to
embark on on on a journey like this is
pairing with some experts to bring that
expertise into into your offices was
incredibly valuable to us and we started
seeing uh back to when I mentioned that
the developers were using Cod and
stopping and feeding back doesn't
doesn't add any value was pure lack of
knowledge folks didn't know how to work
llms out they didn't know how to pass
the right prompt in the right context
and this was uh a pretty important piece
for us to be able to uh to work on this
particular uh agent and so when we go
into this we had developers working for
months at this point to try to figure
out the size of the problem that we had
to then be able to divide and conquer
and then we came within two days within
a hackathon we were able to really
Define uh and understand where the the
call sites were coming from and then
being able to Define how big the problem
is was important for us to be able to
have a start point and then collect the
low hunging fruits that were available
for us so U all of this is still in
experimentation uh mode uh but we' seen
a lot of values and a lot of uh uh sort
of like fire in that smoke in going from
mon
uh of of understanding the code Base
today cool and so the the last agent
that really came out of this joint
effort uh was targeted at code review so
this is something that we found is is
pretty Universal across many different
Enterprises like everyone who does not
do code review here one hand okay uh
I'll talk to you later sir um so like
everyone does code review and what we
found like originally we didn't think
this was a very interesting space
because there's like two dozen startups
now that are popping up that do AI code
review but when we talked to booking we
talked to other Enterprises what we
found is that like code review is kind
of like very specific to your
organization there's a long tale of like
rules and guidelines and other things
that uh you want to bake into your
review process and a lot of the tools
that are off the shelf there aren't
super customizable and so what we built
is this interface uh where we're going
through and productizing the the process
of building a review agent that's
tailored to your team and your
organization so the the basic idea is
that you define a set of rules that you
want to hold in the code and then those
are defined in kind of like a simple
flat file format and then the agent will
go and consume those rules apply the
relevant ones to the specific files that
are modified in any given uh PR and then
uh very selectively post up comments uh
that are tuned to those rules so it's
very not noisy uh we're trying to
optimize for you know Precision uh over
recall here in in the feedback that
we're we're giving uh the the the
developer
all right so knowing what we know in the
beginning of this year we've been
working on this for a year together uh
uh with Source graph then a few ideas
started uh popping into our minds of how
we could go forward here and uh one of
the things that I'd love to leave you
with is the concept of declaring what
are the rules of your service right so
think think of your CI pipelines today
uh when they give you errors could we
anticipate and shift this left to the
IDE so those errors appear there and
they appear in the form of here is an
error and here is a fix so hopefully the
service gets to a point where it's
self-healing right and we started seeing
that we could do that that there are
there are there are areas that we can
start using giving the all the context
all the prompt that the developers
started creating via the prompt uh
library that we created and uh asking
those questions Auto automate those
questions to the server to see what
comes out in terms of knowledge uh um uh
out of that code base and so we think
this is ultimately what we we are trying
to achieve within I would say a short as
short as the end of this year in terms
of uh agents um but lots um Lots here to
to go can I sorry can I just say one
more thing about that last uh slide um I
think that we have the potential here to
solve one of the problems that has
plagued software development since it
Inception so you know who here has read
the mythical man month before so yeah
basically everyone so like it's this
problem of like any software that
becomes successful eventually becomes a
victim of its own success because if you
have Revenue if you have users that's
going to generate feature requests bug
reports any business that's prioritizing
that is going to take on Tech debt to in
order to compete quite frankly and over
time as you add contributors to the code
base you lose this cohesion of vision
you lose the set of standards that you
want to maintain and hold uh with
declarative coding now you can have like
the senior engineers The Architects the
the the people in charge of the
organization Define constraints and
rules that must hold through the
codebase and enforce those rules both at
review time as well as inside the editor
for you know the code that's written by
human or AI yeah for bigger organization
all your compliance rules all the things
that that the developers need to work on
but it's not necessarily feeding new
features to your to your end users I
think those could are perfect examples
of um um yeah being declared into your
service but anyway the main important
thing so far in this past year that
we've been uh um you know pairing to be
able to figure this out has been
education the more we educated the
developer and hand holding entire
business units to be able to show them
the value but then have them experiment
within two days of like workshops and
hackathons have them experiment with the
tool they were coming out the other side
incredibly passionate about what it can
do but also becoming that daily users
that we are trying to transform them so
hopefully to defend that 30% plus
increase on speed so educate your folks
if you take one thing from this is
education and if you want to dive deeply
into any of this we got a booth
downstairs feel free to stop by we'll
talk shop or uh also tomorrow I'm given
an expo talk that covers some of the
more nitty-gritty details of how some of
those agents were implemented so thank
you thank you all
[Applause]
[Music]
our next presentation is about building
trust in Enterprise AI please welcome to
the stage the co-founder and CTO of
writer Wasim
[Music]
Alik hello everyone my name is Wasim I'm
one of the co-founder on C a
trer today I'm going to just tell you a
quick story about actually why we
building a trer what we doing but before
we dive in I would love just to give you
quick history of writer so writers we
start the company in 202 we love to say
the story of writer is the story of the
Transformer we stting building those
decoder encoder model in the early days
and we start we kept building those
model and build a lot of them today we
have a family of
models I believe around 16 we published
we have another 20 coming in the way and
we keep building those models and you're
going to see from this list those model
com in two categories General model like
p x p 3 4 if you have a b 5 coming soon
and we have a lot of what's called
domain specific model Creative Financial
Services B
Medical now
early 2024 basically last year almost we
start seeing this trend with all the LM
basically get very high accuracy in
general with the Ed punish Mark we're
see the accuracy moving and just the
growing and I believe everyone noticing
this accuracy today average accuracy for
a good General models between 80 maybe
close to
90
so that basically
make a bring a question inside the
company saying is it worth it for us to
start building and keep building domain
specific models if the accuracy today
with General model achieving around
90% And we have domain specific model
should we just keep building General
models F tuneit maybe go Direction with
what you call reasoning or thinking
models and that would be more than
enough actually and we don't need those
financial or what call domain specific
model now to answer questions we need
data so whatever we going to present
next actually could be applicable to
financial services domain specific model
sorry to Medical specific model customer
support domain specific model and all
different domain specific model today
I'm going to talk specifically about the
financial spec uh called the financial
punchmark for domain specific model uh
we have something similar for medical
but we believe we are but we start
seeing similar result now let me dive
in just to remind you we're trying to
answer these questions General model
theic model should we keep build them
where we actually going from here we
start actually saying great we don't
know the answer let's actually do the
evaluation let's create the data and we
created something called
fail the idea behind it let's create
real word scenario
to evaluate those model and let's see
actually of those new model can really
give you the accuracy that we a promise
or the accuracy that we see today from
the punch marking on domain
specific we created two type of
categories in this evaluation something
called query failure in query failure
basically we introduce three type of
subcategories something called
misspelling queries you know when you go
ask the llm questions but you do some
spelling error segment error you do some
com comment typo issues we introduce
that to the eval set we introduce
something in like what called incomplete
queries you're missing some keyword some
stuff not clear we introduce what's
called out of domain queries if you're
not expert in the field or you decide to
copy paste some general answer try to
answer about something very specific
specific and also in the second category
is what we call the context failure and
the context failure basically and this
get very interesting we introduce three
subcategories what called basically
messing context we basically ask the llm
question about context not exist in the
the quest itself in the BR we introduce
what you called OCR error today when we
do any kind of OCR or convert
physical do document to text we
introduce a lot of Errors like you know
character issues distance between them
the word between when you do the OCR
could be merged together so we introduce
that type of errors and also we did what
call unrelevant
context let's say you want to ask
question about specific document and you
end up basically uploading completely
wrong document does the LM going to
still answer is the LM just actually
figure out you have a completely
irrelevant
context now when you put all this data
together in domain in financial specific
Financial Service specific you need some
kind of diversity just a quick
screenshot just tell you about amount of
data how much token something worth
mentioning the white paper the data the
evaluation set the leaderboard all
actually open source today available in
GitHub and hugen face so anyone please
check it out and we introduce very
simple what call it evaluation key
Matrix basically we need to look to two
things and the model give the correct
answer can the model actually give good
follow to the grounding or context
grounding or basically what you call it
here the context this is quick or high
level way of how we do the
calculation so to
evaluate we selected a group of models
today we can see a lot of chat model and
also thinking models this is basically
the two list we have here I'm sure you
familiar with this list and then we on
the evaluation and we start seeing very
interesting results I'm going to dive in
directly to the result and basically we
start getting something Fancy with all
this color let me switch to the what
basically
see what's start getting very
interesting we're saying really good
behavior in all thinking models actually
they don't refuse to answer this sound
good most of the time but in reality
when you give something those llms wrong
context when you give them wrong data
when you have a complete different
grounding those model actually fail fail
to follow this part and they still give
you an answer and that basically get you
way higher hallucination
if you start focusing on the answer
itself can the model give me answer or
not you can see basically almost every
model from the domain specific to
General model they give you some kind of
answer all them close to each other
actually reasoning or thinking model
they get to even higher score a little
bit from there but when get to the
grounding and con grounding this is when
stuff get more interesting you can see
specifically in task like text
generation
or question answering it's just not
performing well now all does the chart
look great what I prefer is the
numbers this is the same data we use to
generate the chart we can go through
this really quick and if you look at
this number here especially for example
like the o1 or o03 or B fan you can
start noticing the stuff those model
doing amazingly and basically when you
ask was it misspelled when you got stuff
un complete out of domain the numbers
look amazing the model can take a query
with mess spelling wrong grammars or
even out of domain and still can give
you the answer
but when you start going to grounding
this is going to stop get very
interesting I'm going to hold this slide
for a second here if did you notice
something
different yep and also those bigger more
thinking give you the worst result
you're getting almost
70 50% to
60% uh worse in the grounding meaning
the model is just not following you
attaching context you ask the questions
and the aners exist outside the context
completely same thing coming stuff
around it anal context so you can look
at the
data see smaller model actually
performing better than all this model
over thinking at that side and this is
basically will get us about is this
thinking or just a Chain of Thought you
know this could be a lot of argument at
least from the data we have in domain
specific task those model not thinking
at that stage meaning Hallucination is
really high causing a lot of a lot of
issues especially in this fin in this uh
punchmark we run here in fin Financial
use cases also we can see there's a huge
gap between what you call robustness and
the hallucination and getting the answer
correct so definitely we still have a
lot work to do to build those model and
better performance but also that get me
to you know to the main idea if you go
back real quick here even with the best
model between all the slide we're still
not getting between robustness and cont
surrounding more than
81% sounds a great number if you think
in the reality you saying every hundred
request 20 of them it's just completely
wrong so that basically what we start
seeing believe at least today with the
technology we have with the current
model we have until we have something
completely different we're seeing you
need full stack you need the arck system
you need the
grounding you need everything from guard
rails and the build around the system
itself to actually have something reable
utiliz today in the same
time I would love to go back and answer
the first questions and our first
question
here do you still need to build the
models at least today from the data we
have from R those punchmark the answer
simply yes we still need to build and
continue to make a specific model at
least with the today implementation even
accuracy is keep growing but the
grounding the context following all the
context correctly it's still way way way
behind from everything we see today in
the
market thank you so much guys
[Music]
our next presenters are here to share
real world case studies from open AI
please welcome to the stage member of
technical staff of open Ai prant matal
and Toki sherbakov head of solutions
architecture of open
AI hello uh thanks for having us here
and today we're going to talk a bit
about building and scaling use case with
open Ai and what this means in terms of
Enterprises working with open to bring
use cases to production and a little
sneak peek into agents and how we've
seen some of our experience building
these use cases in now agentic workflows
uh in the field so uh on our side um
just a quick introduction into open AI
I'm sure folks have probably heard of
open AI but just in terms of how we
operate we have two core engineering
teams we have our research team which is
1,200 researchers that are inventing
these models right we they build and
deploy these foundational models these
kind of come down from the heavens our
apply team our second engineering team
take this and build it into product so
this is where you see things like Chad
GPT you see things like the API where
GPT models are available and that's
where we actually deploy this finally in
the go to market sense where we take
these products and put it in end user
hands that's kind of where our team
comes into play with go to market where
we actually help get this in the hands
of your Workforce in the hands of your
product and really start to automate
these internal operations and once we
finally deploy these there's kind of
this iterative Loop where we take
feedback from the field to improve our
product directly and then also improve
our core models through this research
flywheel so that's kind of the last step
of getting it back to research so this
is typically how open AI operates um in
terms of the enterprise we see the kind
of AI customer Journey happen typically
in three phases it doesn't have to
happen in sequence in this way but this
is what we usually see is first and
foremost building an AI enabled
Workforce this is getting AI in the
hands of your employees to become AI
literate to use AI every day in their
day-to-day work that's the first and
foremost that first step typically that
we see then from there you typically
graduate to towards automating your AI
operations this is actually more of
internal use cases to build in
automation or maybe some co-pilot type
use cases into the workforce then the
last step here is actually infusing AI
into end product this is end user facing
so when it comes to open ai's product
specifically enabling your Workforce
typically starts with something like
chat GPT so this is our you know first
party product to put in the hands of
users to use day in and day out then
when you talk about automating
operations internally you can do this
partially with chat GPT for the more
complex use cases or more more
customization is needed that's where
something like the API comes in and then
finally infusing this into your end user
products is where it's primarily API use
cases but just to give a flavor of how
these products come into play when
actually executing this across your AI
customer
Journey so in terms of how we see
Enterprises actually craft this strategy
in practice it kind of happens in a few
different ways I'd say first and
foremost you determine a little bit from
the top down level of what should the
strategy be and one core thing that we
acknowledge here it's not actually
what's your AI strategy it's actually
what's your broader business strategy
and what open AI does is help figure out
where does a technology meet that
broader business strategy first and
foremost so that kind of top- down Str
strategic guidance is really important
to start with and then once you start
with that top down guidance you then
move to use cases like let's identify
one or two mey use cases that are high
impact to start with and scope those out
to really just deliver on um kind of
that scoped scale so once you have the
strategy you execute upon those two use
one to two use cases and then think
about how to build divisional capability
across your Enterprise this is where you
start to enable the team and to fuse AI
throughout the organization and this
happens in many ways this comes through
enablement this comes through building
centers of excellence this comes with
building maybe a centralized
technological platform that other people
in the Enterprise can build on and I
feel like that's typically the journey
we see is again set the strategy pick
those one to two use cases and then
build that capability across your
organization through enablement so
that's usually the the type of Journey
we see and just to illustrate this a
little bit with an
example is this is how we've seen the
use case Journey play out so um this is
illustrative of a three-month type of
example of a use case but when you've
identified that one to two use cases
that you want to tackle first and
foremost you have to ideate upon that do
some initial scoping do some
architecture review to understand how
does AI going to fit into your current
stack and then really clearly Define
what the success metrics in kpi are once
you have that established the bulk of
the time is really spent in development
this is where you iterate this is where
you are iterating prompting strategies
incorporating rag whatever it may be to
constantly improve the uh use case that
you're tackling when it comes to
engaging with open AI this is where our
team like Brant myself really interact
closely with your engineering team
through things like workshops things
like office hours paired programming
sessions webinars whatever it kind of
takes to accelerate the use case forward
once we doe that development phase we
kind of move to testing and evaluation
which is with the evals we've typically
defined up front we're able to actually
now do some AB testing do some beta roll
out to understand how this actually
works in practice and then finally we go
to production this is where you just do
some launch roll out do some uh scale
optimization testing to make sure it's
going to work once you deploy it to many
end users and then we have kind of
constant maintenance that that's ongoing
so that's like the typical phase you'll
see and again the bulk of the time
especially in partnership with open AI
will be around development um in this we
bring a dedicated team we ask you bring
us also a dedicated team to make this
work in practice and the things that we
deploy also to enable you are things
like early access to do models and
features that's one of the key things of
working closely with open AI is that we
can see a little bit into the future not
much like our road map I I don't see
beyond much maybe like six months people
ask what's your 18-month road map I
cannot tell you I can tell you basically
what's going to happen the next two
quarters but that like purview into the
future is really important to bring
forward to these use cases and enable
customers to build and innovate for
what's coming next so that's a really
critical part of our partnership um also
we bring in you know internal experts
from our research engineering team our
product team to help kind of accelerate
you on this path and then lastly just
kind of do joint roadmap sessions to
make sure that we're on track for what
your future road map is as well so
that's hopefully an illustration of how
we partner together and then one example
on this is something we did with Morgan
Stanley so Morgan Stanley here based in
New York uh was building a internal
knowledge assistant so what this was was
giving their wealth managers the ability
to qu ask questions of their large corp
Corpus of uh knowledge which was
research reports like live views on
stock ticker data whatever it may be and
they wanted to get highly accurate
information back to be able to respond
to their and clients right and accuracy
was pretty bad to start right it was 40
45% typically what they saw so
interacting with us we introduced new
methods throughout the use case
development things like hide retrieval
we did some fine tune embeddings
different chunking strategies which
improved performance and then once we
kept introducing more and more methods
we saw accuracy go up we introduced
things like reranking and classification
step that got it 85% and ultimately
their goal was 90% we got to 98%
accuracy through other things like
prompt engineering query expansion so
more of just an example of how we
introduced methods throughout this use
case journey to uh improve their core
metric for more conly in this case um so
this is hopefully one illustration of
how open eyes partner with customers and
one common use case we're seeing more
and more of is now building in this
agent space you maybe hear that 2025 is
the year of Agents agentic workflows has
been a buzzword for a long time I think
we're seeing that actually come to
reality this year and um I think with
that we've seen uh some B we have some
Battle Scars and some best practices of
what we've seen in the field and I'll
hand it off to pant to talk about what
we've seen on the agent side thanks
DOI so at openi we are lucky to work
alongside customers who are building
state-of-the-art agents and working
alongside team members who are building
our own agentic products like deep
research and operator like Doki said we
expect 2025 to be the year of Agents the
year gen gen truly graduates from being
an assistant to being a
coworker and to help usher in this era
we've been hard at work identifying the
patterns and anti patterns prevalent in
agent development I'm excited to share
four of those with you
today before we can go further I'd like
to quickly Define uh what we mean by the
term agent so we think of an agent as an
AI application that consists of a model
that has some instructions usually in
the form of a prompt access to some
tools for retrieving information and
interacting with external systems all
encapsulated in in an execution Loop
whose termination is controlled by the
model itself
so one way of thinking about this is
that in each execution cycle the agent
can be thought of as an entity that's
receiving instructions in natural
language determining whether or not to
issue any tool calls running those tools
synthesizing a response with the tool
return values and then providing an
answer to the user Additionally the user
may determine sorry the agent May
determine that it's met its objective
and therefore terminate the execution
Loop so with that definition let's move
on to some of the lessons that we've
learned uh building these agents in the
field so for the first Insight imagine
you're designing an AI agent you need to
orchestrate multiple models you need to
retrieve data reason over it and
generate an output you have two choices
you can start with Primitives making raw
API calls logging results yourself um
and logging outputs and
failures or you can start with a
framework you can pick an abstraction
you can wire it up and you can let it
handle a lot of the details
and I have to say starting with a
framework is pretty enticing it's how I
got started building agents it's really
easy to get started have a proof of
concept Concepts stood up in no time but
the problem is that if you start with a
framework you often don't actually know
how your system behaves or what
Primitives it uses you've deferred
design design
decisions before you've understood your
constraints and if you don't know your
constraints you can't optimize your
solution so we believe a better approach
is to First build with
Primitives understand how your task
decomposes where the failures happen and
what actually needs
Improvement then introduce abstraction
when you find that you're Reinventing
the wheel for example by re-implementing
an embedding strategy or re-implementing
model graders that may be a good time to
bring in some
abstractions many teams today are
spending a lot of time picking the right
framework um we actually believe that
developing agents in a scalable way
isn't so much about choosing the right
abstraction it's really about
understanding your data understanding
your failure points and your
constraints so in summary the first
lesson is to start simple optimize where
needed and Abstract only when it makes
your system
better which leads us straight to our
second Insight starting simple so too
often teams are jumping straight into
designing multi-agent systems agents
calling agents coordinating tasks
dynamically reasoning over long
trajectories it all sounds really
powerful but when it's done too soon it
creates a lot of unknowns and it doesn't
give you all that much
Insight we like a different
approach we generally recommend starting
with a single agent that's purpose built
for a single task put that into
production with a limited set of users
and observe how it performs doing this
allows you to identify the real
bottlenecks hallucinations over
conversation trajectories low adoption
due to high latency or maybe inaccuracy
due to poor retrieval
performance then knowing how the system
underperforms and knowing what's
important to your users we can work to
incrementally improve
it in a nutshell we should think of
complexity as something which increases
as we discover more intense failure
cases and
constraints because the goal isn't
really to build a complicated system
it's just to build a system that
works so starting simple sounds great uh
but we all know that complexity is where
True Value is realized so how should we
handle more complex
tasks this is where a network of agents
and the concept of handoffs comes
in so you can think of
handoffs sorry let's start with the
network of Agents so a network of Agents
is a collaborative system where multiple
agents work in concert to resolve
complex requests or perform a series of
interrelated tasks you can think of this
as a series of specialized agents
handling subflows within a large agentic
workflow on the topic of handoffs you
can think of
these as the process by which one agent
transfers control of a active
conversation to another agent it's
pretty similar to how you get
transferred to someone else on a phone
call except in this case you can
preserve your entire conversation
history and the new agent just magically
knows everything you've talked about
already so let's see an example of
this in this sample architecture we are
showing how a fully automated customer
service flow may be implemented with a
network of agents and
handoffs this approach is allowing us to
BU bring the right tools to the right
job so for example on the left hand side
we're using a GPD 40 mini call to
perform triage on the incoming request
we're then using GPD 40 on the dispute
agent to actually manage the
conversation with the user and finally
we are using a o03 mini reasoning model
to perform accuracy sensitive tasks like
checking whether the customer is
eligible for
refund it turns out that handoffs work
really well and keeping the entire
conversation history and context while
swapping out the model The Prompt the
tool definitions provide sufficient
flexibility to solve a wide range of
scenarios
so our final lesson pertains to
guardrails and just a level set
guardrails is a catchall term today for
any mechanism that enforces safety
security and reliability within your
application and it's generally used to
prevent misuse and ensure that your
system maintains
Integrity so keeping the model
instructions simple and focused on the
target task ensures maximum
interoperability of your system and also
ensures that we are able to H on on
accuracy and performance most uh
predictably guardrails should not
necessarily be made part of your main
prompts but should instead be run in
parallel and the proliferation of faster
and cheaper models like GPD 40 mini is
making that making this more accessible
than ever tool calls and user responses
that are high stakes for example issuing
a refund or showing a user what
information uh some information from
their personal account these can be
deferred until all of the guard rails
have
returned in this example we see that
we're running a single input guard rail
to prevent prompt injection and then a
couple of output guard rails uh on the
use on the agent's
response so to recap we have four
lessons from our time building agents
use abstractions minimally start with a
single agent graduate to a network of
Agents when you have more intents and
finally keep your prompts simp simple
and focused on the happy path and use
guardrails to handle edge
cases thank
[Applause]
[Music]
you ladies and Gentlemen please welcome
back to the stage your MC for the
leadership track session day Peter
Humphrey
all right folks thanks uh and thank you
to pant and Toki uh it was really
wonderful I think to have them here
today to hear about everything that's
going on with open AI uh it's been a
pretty exciting morning folks uh We've
dived into topics like knowledge graphs
how agents fit into the existing
software development life cycle uh
domain specific llms and of course just
now open AI um so if you want to just a
quick reminder if you want to discuss
any of these topics meet the speakers
have some question and answer time uh
just go to one of the three Q&A lounges
that I mentioned before uh there's one
on this level there's one right at the
bottom of the stairs on the Expo level
and there's another underneath the
stairs tucked under uh and the speakers
will be in one of those three areas um
also please take this time during lunch
uh to go stop into the sponsor Expo um
that's again where lunch is being served
and uh our sponsors have um pretty
amazing products Technologies and
services to help you on your journey um
so also if um during lunch a little
special uh late ad which I'm we're we're
kind of excited and pleased to tell you
about uh we're going to bring a little
Family Feud back does anyone remember
Family Feud that game show TV game show
can yeah all right a couple folks um
we're gonna have a teams from leading AI
Frontier Labs basically in a
head-to-head Family Feud style battle of
wits uh and this is going to be hosted
by bar euron a partner at
amplify uh Partners um and we're going
to feature family feers like Mahir Patel
uh shest the Malik John ma Tina Zoo
Petra grutzik Colin flarity Steven
roller Paige Bailey just to name a few
um so if you that's going to be right
here if you want at 1:15 if you decide
you want to kind of exit launch early
and come back and enjoy that just for
some funsies um all otherwise we will
see you back here for the continuation
of leadership sessions at 1:45 p.m. in
the theater please enjoy your lunch and
thank you for being here
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
oh
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
e
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
welcome back everyone as an AI language
model I cannot taste food but I am
certain you enjoyed your meal but enough
about me please join me in welcoming
back to the stage your MC for the AI
engineer Summit leadership track session
day Peter Humphrey
welcome back thank you hope everyone had
a bite um and uh just a quick reminder
uh not going to be up here long um
thanks for coming back from lunch a
little early uh we are going to have
some fun with a family feud Style game
show just again just curious who
remembers Family Feud anybody okay so a
couple more hands than than last time uh
it's a game show Q&A Q&A Style game show
where the contestants are under a little
pressure so this should be fun um I
would like to call up Baron uh a partner
at amplify uh partners and she will
introduce the rest of our family feers
have some fun enjoy thanks for coming
back
[Applause]
[Music]
that we're all
[Applause]
coming okay welcome to Frontier Feud so
I'm your host uh bar yourone I'm a
partner at amplify we're the First
Investors in technical founders and I
invest in data and AI companies I'm very
excited to be here Frontier feuding in
my favorite City New York City and today
on stage we have some amazing folks
competing for prizes and eternal glory
and all that good stuff so we're going
to introduce our teams uh do not let
these smiling faces deceive you we are
here to compete uh before we do intros
for the teams just an audience feeler
who here in the audience including those
on stage have watched Family Feud before
quick show of
hands okay most people for those of you
uh who haven't the premise is pretty
simple the nuances you'll learn with the
game so we surveyed 100 AI Engineers on
a series of questions folks on stage are
going to guess the answers to those
questions and the most popular answers
from the survey are going to get the
most points so feel free to follow along
and guess yourself in the audience as we
go along uh if you don't like the
answers you can look to your left and
look to your right and blame your
neighbors no I'm just kidding in all
seriousness thank you so much to
everyone who filled out that survey
before and so we're going to do some
quick intros um we have on my left M you
can kick it off tell us who you are what
you do and your Tech hot take we're
going to do quick
intros cool uh hey everyone my name is m
here uh I work at anthropic um my tech
hot take is I think uh let's say the
five is people training big models today
at least one of them will no longer be
training AI models by the end of the
year mic drop but really mic pass so
John hi guys I'm John I work at
anthropic and my tech hot take is I
think AIS make really good
therapists hi I'm Tina um I work at
reflection AI it's a startup building
coding agents and my tech hot take is
that everybody working in our industry
should spend like 20 minutes a day with
just a piece of paper and a pencil um
and think like non AI interrupted
thoughts thank
you hi I'm sha I'm the product lead for
the Gemini developer API working with
Paige uh I have two my first tech hard
take is a followup from Tina's I think
it'll be important for people to do what
Tina said because eventually when you
have ai models represent you they'll
need some good content to be trained
on I think I'll save my next one for
later I love it excellent paig you want
to start your side heck
yeah so just be aware that uh we did not
pregame it no fraud no fraud um uh my
name is Paige I work at Google deep mine
leading uh engineering for our AI deell
team um uh and I guess my tech hot take
is that uh speed forward ahead a year
and a half I think that the majority of
deployed models will be on device um and
we'll be seeing a trend more towards
smaller models um that are uh kind of
orchestrated together maybe uh hypers
specialized for a specific task um as
opposed to relying just on larger models
um and sending data elsewhere to do
interesting things
uh my name is Colin uh you might
remember me from my talk earlier today
I'm a researcher at augment code
building AI coding tools before that I
was at Fair Facebook AI research here in
New York on working on AI for board
games I I just had to get out both hot
takes since I had two two ones so one is
that in the future dating apps will be
our AI dating each other um two is I
don't think Transformers of the final
architecture because eventually we'll
build these models using biological
materials
oh
snap hi my name is Petra I'm the product
manager of factuality for the AI answer
at the top of the Google search page my
hot take is that um chatting with uh one
bot one-on-one is boring and in the
future most conversations will have at
least two other bots in the in the room
with you I love it hello uh I'm Stephen
uh I'm joining thinking machines soon um
and my H take is that if you think we're
hitting a token wall that sounds like a
skill issue to
[Laughter]
me I love it amazing so with that let's
get started I'm gonna have Paige and
Mahir join me not that you're so far
from me excellent great so the objective
is like hit this as fast as possible
right yes okay so so um so whoever
buzzes first has the first chance to
answer this question so I'm going to ask
a question and we'll take it from here
so we asked 100 AI Engineers name the
most influential AI researcher God dang
it I looked at the screen okay what's
your what's your guess uh Ilia
satar okay we have Ilia n shazir Noom
shazir very influential but not on the
board not in our not in our top eight
answers so your team decides if to play
or
pass all right we're going to play
You're going to play You're going to
play okay okay awesome so John it's it's
your turn to guess someone I'll guess
another person but you can stay
there we're working out the
Kinks cheating
frud Onre
car answer good
answer all right Tina how about you Jeff
Hinton oh an excellent one but
just oh sorry did you say Jeff Hinton
yeah
oh that's on
me um I thought you said something else
great I the the the lead auor on the
Transformer paper the attention all you
need paper we need we need a specific
name sorry
um sorry I'm having a literal like I
know the name but I uh I'm having a
literal
like uh we're gonna
have yeah I think but we you already had
a strike that doesn't count so we're
gonna we keep it we're netting at one go
here I'm GNA say yan laon yan laon we do
have Yan laon
okay why do we keep going we just keep
going because you're winning oh until we
lose yeah until you lose oh I didn't
I if you want to lose here's your
opportunity uh no so not on the
board um what was the question
again good question so name the most
influential AI researcher um I'm gonna
say sam Alman Sam Alman
the good one it's not on the board and
so we are at three strikes so your team
needs to
deliberate and if you pick an answer you
actually steal the
points we have four options on the
board and for those who are joining the
question is we asked 100 AI Engineers
name the most influential AI
researcher all right I'm giving you a
five
four you're good all
right I'll do Jeff
Dean um Jeff Dean is not on the board
which
is what I had registered earlier um okay
so that goes to
uh Roo's
basilisk and we're going to
go yeah I'll show you yeah I'll show you
the rest of
them so these were the ones that were
guest we also have Andrew in we
haveu we have F Fe Lee and we have yosua
Benjo there is a longtail very
influential researchers too but they
they're only eight spots so all great
guesses okay
next all right uh I think John and Colin
what is it yeah you want to know the
question yeah patience so name the top
considerations when choosing a
model oh I think your's
brok I
first more passion
uh come aggressively this team but we'll
yes name the top considerations when
choosing a
model
intelligence I'm going to be very
literal very smart answer very smart
answer you have to to guess uh
safety safety uh is actually not on
here and so why don't we get uh Tina and
Petra to
guess am I next yeah oh um
price so cost is the number one answer
which actually surprised me I don't know
if like maybe raise of hands if you
would have guessed price now no so
hindsight's 2020 that wouldn't have been
obvious to me but this means that you
guys will continue with this answer and
we'll we'll we'll continue from
there latency
latency uh
yes uh eval benchmark
scores some of these things are
ambiguous but I'm gonna put it under
don't don't kill me accuracy performance
[Applause]
okay that's like
[Laughter]
everything okay um uh the CEO the
CEO good answer good answer good answer
good
moving on Oh I thought because we miss
oh no we your team got it um like where
the the model is being served like if
it's on Prem or you know uh no we're not
GNA count that one but it's a good
answer okay this team has an opportunity
to steal these
points I hear some folks in the audience
think they're ready to steal
so I love that about you I love the
confidence oh my God
we can make a guess that was no it
wasn't uh open source versus closed
Source okay so with that you guys get 63
points cuz you stole the stole the
points but
[Music]
so it's a pretty close game going into
the third and final
question
yeah all
right
sorry I know everyone everyone wants the
answers who has like very strong upper
body strength so so here here are the
[Music]
answers yeah these are the
answers great thanks for keeping me
honest everyone in the audience and
everyone here so all right next so who
do we have coming up Petra and Tina
incredible you can do it so both in the
front Okay this requires uh Power is
what we've learned so um are you ready
ready yeah okay we asked 100 AI
Engineers name a buzzword
sorry it's okay I'm not offended but
name a buzz word everyone in AI is tired
of hearing
agents
agent what do you all think
yes okay but you still have an
opportunity to guess because you could
get the number one answer
oh but you have to guess
now I'm giving you three two two one
sorry um okay so it's going to go to
this
team and um uh sha it goes to you we
asked 100 AI Engineers name a buzzword
everyone in AI is tired of
hearing I'm going to start my five four
multimodality
multimodality uh we do not have
multimodality but I don't know where
this strike came from so it's first
strike uh go with
co-pilot co-pilot that's a good guess
what do we think yes or no to co-pilot
that's a very good
guess I think it's a good okay good
answer good answer uh but no
AGI
AGI okay what what does the audience
think
AGI yes all right it seems like we think
get the good answer and no just kidding
yes the number one answer
AI all right
Tina oh
um deep
seek too I no no deep so this team has
an opportunity to
steal but you're right maybe maybe for
like the
week yeah you're back you're so
back safety
safety so unfortunately safety is not
one of the
[Applause]
answers so the these were the additional
answers people here are sick of rag and
prompt engineer
um but what this means is that we have a
winner which is Roo's
basil but amazing amazing work by paig
attention and the mixture of experts and
so we're going to move to the fast money
round do you guys have two
Representatives all right you two are
going and that means John leave the
stage I'll call you when you're
ready all right we're moving into fast
money
and uh so some of you have seen Family
Feud but we're playing fast money the
goal is for the two of them together to
get to 200 points uh you have Mah 20
seconds to answer five questions uh if
you can't think of anything just say
pass and we can come back to it at the
end or never um and if you hear a buzzer
sound let me check that it
works beautiful your answer wasn't one
of the surveyed questions you can keep
asking or or moving move on all right
are you ready y okay uh name an AI tool
that Engineers love
curser uh name the sorry I didn't start
the
clock name the job most at risk of AI
disruption software Engineers name the
most influential AI paper in history
attention is all you
need name the biggest nightmare for AI
engineer at 2 a.m.
uh Hardware
failure okay so we got through four but
they were good ones so name an AI tool
that Engineers love you said cursor
cursor is the number one
answer we asked name the job most at
risk of AI
disruption you said software
engineering that is the number three
answer we asked name the most
influential AI paper in history
what do you think demolished attention
is all you need is by far the number one
answer uh name the biggest nightmare for
AI at 2
am. Hardware failure Hardware
failure you know what we'll give it
we'll give it as infar problem okay okay
yeah um and so you're going into the
round 140 points you want to tap tap
John in
John ran away he got scared no there we
go
amazing incredible so John big things
you have 25 seconds and you got to stand
back there you have 25 seconds to answer
five questions um and if you can't think
of anything just say pass and we'll come
back to it if we have time at the end uh
if you hear a buzzer sound it means that
either your answer wasn't isn't on the
board or he has already answered it you
ready y um okay and I'm going to start
your
timer name an AI tool Engineers love
sir oh um
ch um name the job most at risk of
disruption um
artists
uh name the most influential AI paper in
history uh Transformers is all attention
is all you
need okay so we have model apis which
is uh the third answer on the board we
have name the job mode you said you said
artist so I'm going to give you content
creationwriting it's a little
ambiguous and attention is all you need
was already uh selected by Mahir but
we'll I'll let you pick the next
question which is name the biggest
nightmare for AI engineer at 2 a.m.
someone wrote In Cold email from a VC so
I will talk to you
after uh uh Cuda SS Cuda SS okay well
you know what your teammates because you
think alike and you think you you think
of great answers so let me just show you
the number one answers here which are a
cursor data entry for the job most at
risk of AI disruption attention is all
you need the the top paper so that was a
good one just already taken the biggest
nightmare an outage of the model or
otherwise and an industry that would
benefit the most from AI
Healthcare uh so with that we have well
you lost the 200
points but I think that you're still
winners and you're also winners and we
have a few prizes outside that I think
someone is
bringing
so we have a massive llama for this team
we have rainbow lamba Bean babies for
everyone we have engineering books and
gift cards to your favorite restaurants
in New York and maybe we can bring them
out and leave uh thank you so much for
joining us for Frontier Feud and we're
going to see you next year we're running
a massive
survey on the state of AI engineering
going to be presenting it at the June AI
engineer World Fair so if you liked some
of the teaser questions this will get
much more in depth into the tools folks
are using uh the workflows that AI
Engineers have and it's a way for the
industry to be more more transparent uh
so thank you uh you can find the QR code
or the link here if you want to
participate in that survey on the state
of AI
engineering
amazing babies are but they'll
come
for
for
for for
[Music]
ladies and Gentlemen please welcome back
to the stage your MC for the leadership
track session day Peter
[Music]
Humphrey that was fun thank you um yeah
that the uh I'll try to do my best uh
Steve Harvey impression which is not
going to work I'm just going to say it
in advance so the feud that was really
fun um well I hope you enjoyed that and
lunch uh and hopefully talking to our
sponsors for a minute in the Expo um all
right so settle in hope you had some
coffee buckle your seat Bel grab a
helmet our next Sprint of sessions is
going to be pretty awesome we are going
to talk we're going to have an AI case
study we're going to talk about AI evals
Hot Topic of course AI observability AI
infrastructure and of course a talk from
none other than anthropic so uh with
that please join me and welcoming our
next speaker to Thea stage Shira chadri
from Thompson
[Applause]
[Music]
Reuters hello good afternoon I have
before me the ominous task of making
this presentation really interesting
with a topic which is going to sound
like a crib what are those missing
pieces for workflow automation to happen
with AI and I'm going to tell you really
an Enterprise story is it dry is it just
going to be about now I'm going to find
out who took my lunch sandwich we'll see
um and you know as I Was preparing for
this talk and I realized in the schedule
that this is going to be just after
lunch I thought I should start off with
a joke and since for all our daily needs
we go to AI tools I try to go to a AI
tool for a joke and they really suck I
couldn't find one decent joke if you can
tell me a good joke about you know using
AI for your real world Enterprise needs
I'd be happy to squeeze it in right
now uh like it doesn't work
can someone help me oh okay yep so the
graph looks a little different from all
the graphs that I've been seeing this
morning we took this journey in our
worlds in our Enterprise worlds um as we
explored but before I dive into this I
think I should do an introduction I'm
shisha I've come all the way from
Bangalore to tell you the story that I
see unfolding around me not just in
Thompson Reuters where I work trying to
bring AI to my um to my um you know
teams and um different business
processes but also the same story that I
hear at meetups and you know um
different community events where where I
meet AI
practitioners um everyone started off
trying to democratize the use of
generative AI back in 2023 we have
something called as open arena in uh
Thompson Reuters very similar to your
you know um playground where you can try
different large language models this is
where it truly came home to um almost
everybody in an Enterprise to start
using generative AI for their workflows
further along we got on to the rag um
and you know prompt engineering World
pretty quickly we looked at um
automating various knowledge driven
tasks with the use of rag and very soon
we were answering questions
um at the Enterprise level on what is
the
ROI further along in 2024 we started to
play with tools and Frameworks and we
heralded the rise of the agents we are
now here where we're looking at
automating entire workflows with the use
of AI SL agents and not just one task at
a time right we are looking at a future
where we want to reimagine business
processes because just automating a task
seems
redundant okay so what what do we mean
by workflow Automation and what I've got
here is a very typical workflow for
almost any company that's putting
software out there customer calls you
calls your service desk report a billing
issue an invoice issue or a product
feature are not working as expected your
customer support um is going to take the
call they are probably using rag to sort
of answer that question already or they
may be looking at you know internal
tickets or connecting with um you know
uh their uh hierarchy to see if the
answer can if the answer can be given
and if no answers are found they then
you know report a ticket to the it Ops
teams right the internal it Ops teams do
the level two support and they're
looking at you know launching
investigations to support this further
along if this doesn't work you've got
your engineering teams doing the L3 L4
support and a fix is likely going to be
identified with the use of various
observability tools scripts are launched
to you know create that new build tests
regression tests integration tests are
launched and finally you've got either
the bug fixed or the billing question
answered you your tickets are updated
and SLA is met needless to say all of
you can spot so many of these tasks that
can be automated with the use of Agents
but you know is automating each task the
way we want to do this is there
something that can be done differently
in reimagining this
workflow we are there we're trying to
reimagine this
workflow um here's a slightly different
take let's look at um a workflow where
content is getting created right it
starts off with authors or content
specialist perhaps identifying an alert
or a trigger that's going to launch that
content workflow you then have maybe
approvals to say yes go ahead do your
research find out what we want to write
about this and then you've got you know
content getting created with research
being done and subsequently your editors
and your um associate editors and
reviewers reviewing that content if it's
you know very critical content you you
will probably have several rounds of
this reviews and eventually the
finalized content goes to the publisher
which then you know the the publishing
teams then launch their own formatting
styling related workflows and eventually
the content is published here too you
will realize that so many of these St
tasks can be done by AI can be done by
agents and of course humans being in the
approval flow but here again something
seems a Miss should we stick to the same
design of the workflow or should we be
doing this a little
differently okay so that's that's where
we are we want to be able to remagine
these workflows because it's a new world
because we have new capabilities with
these Technologies
um and not just plug in capabilities
into an existing business process right
but we
stuck we're stuck we are um missing
certain parts of that
reimagination so what are we
missing the first thing that we missing
is
connectors um I I spoke to few of the um
you know the Stalls uh yesterday and a
common theme was how around providing a
good AI agentic solution you always
needed that layer which connected to
your current it systems right and and
connectors are a very very much a
missing part of reimagining these
business
processes um I also want to say that you
know I come from a world
where um you know technology is not
altogether new we've been you know we've
been doing AI we've been doing NLP for
several decades as Thompson Reuters and
even even in the different companies
that um uh that developers come from in
the different meetups and communities
that I attend they are also supporting
it systems of some of the you know
different technology companies of our
world
Believe It or Not
71% of Fortune 500 companies still use
Mainframe 68% of the world's it
production workloads still run on
Mainframe right and some of your major
credit card um transactions still happen
on the main frame which means we are
that distant right like that technology
Spectrum if you were to measure from the
main frame to to an agentic workflow how
do we connect these worlds right and so
that's one of one of our major stumbling
blocks I believe where you know how do
you connect the worlds of the technology
um stable technology Stacks that exist
with um with with with the power of AI
agentic workflows the second thing is
something that um I struggle with as I
take new ideas to to different
stakeholders and I see you know several
startups whom I meet on a regular basis
them struggling as well it comes back to
the question of Roi it comes to the
question of you know um
reliability right how will I be sure
that my agent will be able to perform
and often with stakeholders and you know
from from a business impact standpoint
it's a zero or one call right am I going
to continue to need to have to pay
manual hours or can I consider that not
needed anymore if I'm going to pay for
the agent AI agent right and so
reliability becomes a big factor and a
stumbling block for us the third thing
that we're finding missing as
practitioners is to have Visionaries who
are able to re-imagine this world with
us it's um you know
as a practitioner as somebody who's
deeply entrenched in AI you can only go
that far in reimagining this world you
need the subject matter experts you need
the Specialists from that specific
domain to sort of do this together with
them to be able to you know reimagine
your business
processes the the fourth thing that you
know we need and I'm sure many of you
will agree based on conversations that
I've had is we need a certain level of
standardization
right we need to be able to say this is
how agents will be built this is how
they are packaged this is how they
deployed it's too nent yet um you know
in a in a established Tech ecosystem to
say uh this we are going to replace
these these bits with
agents data and systems for an agent to
truly you know um get its full power we
need to give it access to context the
context is today distributed across
different it systems Business Systems it
is probably um partially located in logs
it is probably there in um chat messages
or you know it tickets and you know
different um you know different styo
systems right sometimes which are spread
across different parts of the
organization and so bringing them
together and even identifying which of
these systems will have what and how do
you correlate a single transaction
across these systems becomes often a
stumbling part of you know getting the
AI in the sixth thing is one that you
know I've I personally feel very
strongly about is creating a
collaborative
ux agents are going to be assistant so
what is the role of the human defining
that and creating systems in which
humans can support the work of the
agents and vice versa is I think a very
important part of um creating creating
those workflows and so what makes sense
from a collaborative ux is something
that you know I'm waiting to hear for
from any of you fresh ideas on
right AI governance we we saw um in one
of the talks about how different parts
of different aspects of security testing
go down into parts of your agentic
workflows and and so you know aspects of
your AI governance which we established
all this while around ethics and
responsibility how do you translate that
into different levels of your agent
architecture
right the next thing is control we still
want to give the human control we want
to have certain steps which are
deterministic
and certain steps which the agent can
control on its own or you know how how
do you balance that um need for control
uh between between the agent and the
human and give the human the right um
you know right time to act and finally
what is the life cycle for the agent All
of You Have You Know spoken about that
exponential growth of um evolution in
our space how do we bring the capability
the latest capability into what we've
already got deployed and that one that's
ever
changing so that that's what I had to
share um we are just at the start a lot
of good work from all of you I'm waiting
to bridge from the world that I'm seeing
around me to the world that I come from
and so happy to have your questions and
ideas suggestions feedback thank you
[Applause]
[Music]
our next presenter will share strategies
for turning AI agents into reliable
production ready tools that deliver
tangible business results please join me
in welcoming to the stage the founder
and CPO of arise apara dinkin
[Music]
[Applause]
[Music]
hey hey y'all how's it going all right
well can you all hear me cool well thank
you so much for being here I'm gonna
start off by saying apologize my voice a
little bit it's a little horse today but
you guys are going to hang in there with
me today we're going to talk about a
really important topic which is um
slideshow mode awesome we're going to
talk about a really important topic
which is about evaluating AI agents and
assistance this will load just to set a
little context before we we jump
in a lot of you have probably heard
today about different agents that are
being built how to build it what are the
Cool Tools out there to go build agents
and today we're going to actually talk
about when you put those agents into
production it's important to actually
know how they're doing and evaluate them
it's super important to making sure that
they actually work in the real world
World we're probably going to get a
little technical in this talk maybe a
little bit more than some of the other
talks but hang in there I think this is
important even at the leadership level
to understand how to make sure what
you're putting out actually works in the
real world um so a little bit about me
my name is aparta one of the founders of
arise uh fun update on us actually today
we announced our series C Ray um
so
um have a lot of fun folks who are using
us to evaluate agents so with that let's
jump in okay well everyone here's
probably talk to you about text based
agents so you have this chat bot
whatever it's making an action and it's
it's figuring out all these things to do
the cool next Frontier is actually voice
AI is already taking over call centers
there are over 1 billion calls made in
call centers all around the world with
voice assistant with with voice apis and
the the realtime voice API if any of you
guys have played around with it we're
actually already seeing these types of
um agents start to take over and
revolutionized call centers this is
actually a real production application
of a travel agent this is the price line
pennybot you can go in and actually
handsfree no text book an entire
vacation using Priceline Penny today so
we're not just talking about tech based
agents anymore we're talking about
multimodal agents and it's important to
to address these because the way that
you evaluate these types of Agents it's
not just evaluate an agent but also if
it's on voice there's specific types of
evaluations you're going to need to do
if it's multimodal there's additional
types of evaluations you need to
consider so we're going to break all
that down and hang in there with me for
a fun one today so before I jump in and
talk about how to evaluate an agent
let's talk about what are the components
of an agent you probably have heard
different versions of this today but
I'll tell you the language we're going
to use one um there's something
typically called a
router uh which is essentially what's
deciding what the next step an agent
will take there's skills which is the
actual logical chains that do the work
and then there's something that stores
the memory this is important
because there might be different
architectures of how you're seeing
people build these agents out there
doesn't matter if you're using L graph
or qai or llama index workflows there's
all sorts of agent Frameworks they all
have slightly different ways of building
an agent you might not even use a
framework but what you're going to see
is these common patterns of okay that's
a router that's a skill and that's a
memory and these different components
are going to have different ways of how
you actually evaluate it so let's first
talk about the first one what the heck's
a router so you can think about a router
almost like the boss it's kind of
deciding hey well it's very common to
have e-commerce agents in you probably
are all talking to e-commerce agents
today to purchase things Amazon has one
all these e-commerce companies have one
when you type in a question like I want
to make a return give me an idea of what
to go buy are there any discounts on
this that user query funnels into
something called a router and that
router's goal is to is really to
determine do I call this skill about
hitting up a customer service agent do I
call this skill um to suggest all the
discounts we have or suggest products
the router is really kind of the boss
deciding who do I tap on to go actually
execute the the ask that the user made
and the router might not always get it
right but you want it to get it right
because then it goes down the pathway of
a specific skill within an agent so in
this case it will call a skill um so if
I asked hey uh tell me the best um I
don't know leggings to go by so it'll go
in it'll do a product search and then
this is actually the entire skill flow
of execution that the agent needs to go
through to execute you know whatever the
user asked for some of these might be
llm calls some of these might just be
API calls it just really depends on how
people actually Implement them and then
lastly this is an important piece is
there's always something storing the
memory because these are usually not
just single turn conversations they're
multi- turn conversations multi- turn
interactions and so you don't want to be
talking to an agent that forgets what
you previously said so there's really
memory which is storing what it
previously asked for and keeping all of
this in some sort of um in in some some
sort of semblance of state so with that
we're going to get a little fun here I'm
going to show you um an actual example
of what this could all look like a
router skills and memory so this is an
open source project um that actually
looks at the inner workings of an agent
these are called traces for folks who
may not be familiar if you're you know
in leadership or this is really what
your engineers are looking at when
they're actually building and
troubleshooting your agent they're
actually understanding what the heck
went on under the scenes so this is
actually an example of a code-based
agent somebody asked a question like
what trends do you see in my Trace
latency AKA what's making my application
slow this is the router call that we
were talking about earlier where it
actually decides well how do I then go
ask how do I then go tackle that
question so first you can see here
there's multiple router calls there's
not just one router call this is pretty
common as your application grows you can
have multiple times where it comes back
and has to decide what do I need to go
do so the first time it calls the router
what it does is it actually so the
router then makes a tool call um which
is essentially the skill that you need
the first time it actually makes a tool
call to then go run a SQL query go
collect all of my traces of my
application and go go run a SQL query
then it goes back up to the router and
then it calls the second skill which is
actually the data analyzer skill which
takes all of the traces and the
application data and then it passes it
to something that actually analyzes that
data so in this case you can actually
see there was a router there was tool
calls we actually have memory that's
actually storing everything that's
happening under the scenes and so really
just shows all three of the different
components that I actually just walked
through so now that we have an example
of a of an agent with a router and
skills in memory let's talk about how to
actually evaluate these agents every
single step that I just walked through
here actually is an area where the agent
can go wrong for routers typically what
teams end up caring about is did it call
the right skill because if it didn't
call the right skill you know user asks
for I asked for leggings but then it
sent me over to customer service or it
sent me over to um you know uh something
about discounts and Deals so you
actually want to make sure that the
router within an agent is correctly
doing the right skill and calling the
right skill so that's the first piece
that you'll want to make sure that your
teams are evaluating so if your teams
are building agents when ask well hey
what's the ultimate control flow what's
the control flow and are do we have
something like a router and are we
evaluating it to make sure that it's
correctly calling the right skill
between ABC and is it calling the right
skill with the right parameters so not
just um a calling product search but
actually making sure that whatever way
you've designed that skill you're
actually passing in the correct things
like um you know I want this type of
material I want this type of whatever
cost range you're actually passing in
all the right parameters into what the
user actually is is asking for can I get
a raise of hands have any of you guys
heard of do any of you guys evaluate
your agents today actually is that
something you know your teams are doing
okay awesome are any of you guys
evaluating this router level internally
okay awesome wow this is a great group
okay this is impressive um
okay let's next go to the next one which
is actually evaluating a skill this is
actually the part where it gets really
interesting and tricky because there's
many different components in a skill
there might be in this case I have a rag
type of skill so I want to look at
things like evaluating the actual
relevance of the chunks that were pulled
I want to look at the actual correctness
of the answer that was generated but
this skill itself can have many
different llm as a judge evals or can
also have code based evals that you
might want to run to actually evaluate
the
skills the skills of the agent and then
lastly this is kind of a really
important one that we're seeing teams
probably have the most trouble
evaluating which is actually the path
that the agent took
because well ideally you want it to
converge you call the same skill
hundreds of times and it always takes
about five steps or six steps to
actually query what the user asked for
put in the right parameters call XYZ
components of the skill and then
ultimately um take the right you know
generate the right answer but sometimes
this can be a little longer we've seen
sometimes where the same skill and I
don't know if you all have done this
experiment but you can put the same
skill and build it with open the eye and
you can also build it with anthropic and
sometimes they have wildly different
number of steps that the path actually
takes and so so the goal here is how do
you be succinct and how do you also make
sure there's reliability in the number
of steps that your agent takes to
actually consistently complete a task so
we call this convergence um but probably
one of the hardest to actually evaluate
is anyone evaluating convergence today
or at least counting the number of steps
awesome okay you're awesome
dude cool well with that I'm going to go
maybe two more minutes then I'll hop
into one more demo here so
if any of you guys have watched the
movie her this is from
her um uh this is
where you know the the main character
asks like who else are you talking to
and you know the Samantha says something
like 8,000 other people are in a
conversation with me right now and so
the future of voice applications is that
these are probably some of the most
complex type of applications that have
ever been deployed ever been built it's
going to require one more additional
pieces to actually evaluate voice
applications and the interesting part
about these is that it's not just the
text that needs to be evaluated or the
transcript but it's also the audio chunk
that needs to be
evaluated um in a lot of these Voice
Assistant apis you have the generated
transcript that happens actually after
the audio chunk is really sent and so
that's a whole another dimension around
is the user how what's the user sent to
is the speech to text transcription
actually okay is the tone consistent
throughout the entire conversation and
so you actually need to evaluate not
just the audio piece and the flow of the
conversation and everything else you're
doing for all your other you know other
parts of your agent but also make sure
that the audio chunks are getting their
own evals defined on um you know intent
or speech quality or speech detects
accuracy um so this is important for
voice
so with that um I'm going to actually
show you guys how we evaluate our own
agent so that you can get a little bit
of a example of of what some agent in
the wild actually does um this is our
own agent so let me actually show you
what it looks like um you can actually
go in our product today and there's a
little co-pilot and our co-pilot does
something similar to what other
co-pilots do where as people are
spending time in our product we actually
help them do things like hey help me
debug this help me summarize this help
me look at this um can I search with
natural language there's kind of this
co-pilot integrated throughout our
entire product but we're an eval company
so what do we do we actually dog food
our own tool and we decide to what
you're looking at here is actually the
traces of our entire co-pilot actually
in the wild and every single step of
this co-pilot we actually run
evaluations of so in this case we have
an eval at the very top actually
evaluating something around was the
overall response that was generated this
was actually a search question is the
overall search question actually correct
or incorrect and then we also have one
around once it actually called the
search router did it pick the right
router and then did it pass in the
correct arguments into the router and
then finally
ultimately did it complete the task or
the skill correctly in the execution of
this this entire skill and so evals
aren't just at one layer of your entire
Trace if you take anything away from
this conversation the goal here is
really how do you make sure that you
have evals throughout your application
so that when something goes wrong I can
debug if it actually happened at the
router level if it happened at the skill
level or if it happened somewhere else
along the flow um and I think that's it
from me any
[Applause]
questions yeah so what teams actually
end up doing is they have um call like
the input into a skill so the input into
I'm not sure you know if you're a
building agent but like if there's some
input into your skill that's typically
what the input would be and then it
would repeatedly call that same skill
with the same input like one time two
times three times all the way to 100
times and then you're tracking for every
single run of that skill how many steps
did it actually take so ideally you want
to MIM mimic it with the same input and
the same skill we do have some teams
though where as part of testing they'll
slightly modify the input so sometimes
they'll ask the question a little bit
differently a little bit more wordy see
if that takes more steps um so you can
do that as well but it's you know
flavors of the same input is typically
how we recommend testing for
convergence um there's a lot I didn't
cover here today which is around like
for example guardrails is a big
component of you know I call it slightly
more proactive type of EV valves because
you're you're running EV vales and then
you're actually making an action or
blocking what the output of the the
agent could because of it um if I'm
honest with you though I think that it's
really mostly useful for external facing
applications and highrisk environments
because of course it's going to add some
sort of latency to your actual call um
so you want to be sure you need it and
sorry I think there's a question there
yeah oh yeah
yeah okay good point so she uh first she
asked both great questions first she
asked well how do you think about um
evals in the context of multimodal
applications so if you have voice if you
have video if you have images how do you
actually think about that piece in the
evaluations let me tackle that part
first so we we do see that a lot I think
voice is probably more common than video
is is what we're seeing and to be honest
image and I think there's just a lot of
applications like the call center one I
was telling you about I think
Enterprises have just been like the
ability to just talk to something is a
lot easier than video and image um so
we've been seeing a lot more voice
assistance and then with voice the
things that they end up caring about is
actually um a fun story uh one of our
customers the tone started out really
like nice and sweet in the beginning and
then it got rough near the
end and it just and so that wasn't
really something that they expected they
wanted the tone to be consistent
throughout the entire conversation so
stuff like that is things that the
speech is not going to detect so if
you're just evaluating the transcript
you're not going to detect those things
that the underlying tone has changed um
with image and video uh what we've
typically seen people do is summarize
what's in the picture or summarize
what's happening in the video and then
try to evaluate as a common one and then
of course all the common like quality of
image quality of video ends up being
common things that people end up
evaluating uh and then I think the
second question you asked was like how
how do people actually use you know I
won't talk specifically about ours but
just a general observability platforms
so people do um plug in and connect
their actual data because at the end of
the day today these are probably some of
the most complex software that has ever
been built and they're also
non-deterministic so they do want some
level of visibility into what's
happening the engineers use it for
troubleshooting we're seeing aipm really
make a rise into really understanding
The Experience um so they they are um if
you're concerned about data security
there's a lot of ways not just our
platform but other platforms out there
too where you can always decide to
deploy it in your own VPC
um so that's kind of how they address
the security yeah go
ahead thanks
y yeah
yeah um I'm so excited for the call
center transformation that's about to
happen right now um because I think
we're going to see a lot more of what
you're saying so one thing we have been
seeing teams do is I I mean it just
comes down to what can you fit in the
context window and the context Windows
aren't big to handle a lot of back and
forth especially of the voice
conversations with both the audio file
as well as the transcription what we see
some teams do is that uh
they I think this is B pretty common of
like summarizing the previous historical
State the memory this is where it that
starts to get really important they
decide to then uh summarize the audio
chunk itself so that you're not
obviously passing that back and forth
over the wire um and so typically how do
you keep trying to cond condense the
historical State um and then of course
you're kind of hoping that there's some
recency so things that were asked more
recently are more important than things
that were like earlier in the
conversation and so that's how we've
seen a lot of people manage State
overall um and and just growing size of
these conversations but great question
yeah thanks everyone I think that's time
[Applause]
[Music]
bye would you like more time to go to
more epic conferences like this one our
next presenter is the director of
engineering AI at data dog where they've
built AI agents that are always ready to
handle issues on your behalf
please join me in welcoming to the stage
Diamond
[Music]
[Applause]
Bishop hey I'm Diamond I hope everyone's
you know feeling the AGI today um I'll
be sharing our AI agents at data dog and
what we've learned building the devops
engineer who never sleeps I came all the
way from the New York Times building uh
right across here uh to see all of you
so I hope it's worth it
if this
works uh
okay here we go hey little about me um
I've been working for my entire career
about 15 years or so in AI trying to
build more AI friends and co-workers um
I wouldn't read too much into that I
have human ones too um I promise um
throughout the AI Winters and lws of the
last 15 years or so I've managed to keep
doing just that at Microsoft
Cortana um building out Alexa at Amazon
working on pytorch at meta and building
my own AI startup that was working on a
devop assistant now at data dog we're
building out bits AI which is the AI
assistant who's there to help all of you
with your devops
problems so today I little I'll talk a
little about that talk a little about
the history of AI at data dog a little
bit about how we think about AI agents
today and where we think things are
going for the
future day dog is the observability and
security platform for cloud applications
there's a lot that we do um but it kind
of all boils down to being able to
observe what's happening in your system
and take action on that make it easier
to understand make it easier for us to
uh simply understand and build out
things to have a safer and more devops
friendly
system we've been shipping AI for quite
a while actually um it's not always
inyour face it's not always out there
saying here's a big AI product but
things like proactive alerting really
understanding things like root cause
analysis impact analysis and change
tracking and much more has been
happening since 2015 or
so but things are changing this is a
clear era shift I think of this kind of
similar terms to the microprocessor or
the shift to SAS um bigger smarter
models
reasoning and multimodal coming uh
Foundation model Wars happening this
General shift where intelligence becomes
too Shi too cheap to
meter and what this means is products
like cursor are growing you know
terribly fast um and really people are
expecting more and more from AI every
day um with these advancements at data
dog we're really trying to rise to meet
the shift as well the future is
uncertain this kind of ambiguity creates
opportunity but there's a lot of
potential
for us that's kind of the dawning of
this intelligence age we're working to
move up the stack to leverage these
advancements and give even more to our
customers by making it so that you don't
use data dog as just the devops platform
but also as AI agents that use that
platform for you this requires work in a
few key areas that I'll talk about
developing the actual agents doing eval
you just heard a lot about eval we'
think about that every day for better or
worse um and building new types of
observability there's a few agents that
we're working on right now in private
beta the first is the AI software
engineer this kind of looks at problems
for you looks at errors tries to
recommend code uh that we can generate
to help you improve your system the
second is the AI on call engineer this
wakes up for you in the middle of the
night does your work hopefully makes it
so you have to get page less frequently
and then we have a lot more on the
way so I'm going to talk a little bit
about the AI on call engineer first this
is the one that you know everyone wants
to save them from that 2 am. alert you
don't want to have to wake up in the
middle of the night go and look through
your runbook go and figure out what's
going on if you can help it our on call
engineer is there to really make it so
you can keep sleeping this agent
proactively kicks off when an alert
occurs and works to First situationally
Orient read things like your run books
grab context of the alert and then goes
and you know figures out the kind of
common stuff that each of you would do
on data dog already look through logs
look through metrics look through traces
and kind of act in this Loop to figure
out what's going
on the en call agent's great for both
automatically running investigations for
me but also you know being able to look
through and find summaries and find
information for me before I even get to
my computer so if I want to get insights
into why an alert just occurred or
figure out why a trace might uh be
showing an error this agent can jump
ahead pull information for me and show
it to me we also have added a new page
that makes it easy so that you can have
human AI collaboration this is still
something I'm thinking about a lot is
like what what kind of collaboration do
we expect we want our agents to act as
humans but we also need to be able to
verify what they did and be able to kind
of look over what they're doing and
really learn from
it it also helps you to kind of earn
trust along the way I can see the reason
why uh this hypothesis for example was
generated I can see what the agent found
and I can make decisions about whether
or not I agree along the way
it also tells you things like what steps
did it actually take out of your runbook
and kind of like a junior engineer who
does this work I can go ask follow-up
questions find out why I did a certain
thing a little more insight into how
we're making this happen much like a
human s or devops engineer our agent
Works to put together hypothesis on what
might be happening and reason over them
coming up with ways to test them use
tools in the tool forer sense to try out
ideas run queries against logs metrics
Etc and work to validate or invalidate
each
hypothesis in the case that it does find
a solid root cause our agent cause uh
can suggest remediations along the way
again just like a human might might say
hey we should page in that other team
that's involved here or it might offer
to scale up or down your infrastructure
over time we plan to add more built-in
actions and eventually discover new
types of workflows based on what your
team has
done but if you already have certain
workflows that youve set up in data dog
um we can tie directly into them and
make it so that our agent can understand
those workflows and how they might map
to helping you remediate a
problem and if it's a real incident the
enal engineer is not usually done once
an issue is remediated you usually go
and write a postmortem you go try to
learn from it you share it with your
team our agent can do the same write out
your postmortem for you look at what
occurred during the entire time what it
did what humans did and put that
together so that you have something
ready in the
morning so that was the on call engineer
um that's the one that is you know
trying to help you in the middle of the
night trying to help you every time
alerts come on um we also have this AI
software engineer I think of this as the
proactive developer the devops or
software engineering agent who observes
and acts on things like errors coming
through this is kind of the error
tracking assistant it automatically
analyzes these errors identifies causes
and proposes Solutions those Solutions
can include generating a code fix and
working to reduce the number of on call
incidents you have in the first first
place so they can work in coner to make
a better system over time in this case
the assistant has caught a recursion
issue proposes a
fix and even creates a recursion test so
that we can catch it if it happens again
in the future we have the option to
create a PR in GitHub or open the diff
in vs code for editing this workflow
significantly reduces the time spent by
an engineer manually writing and testing
code and greatly reduces human time
spent
overall so what have we learned building
out these agents and some of the new
ones that we're working on today well
we've learned quite a lot um there's a
lot of things that we started with that
we kind of went back and and redid um
but a few areas I'll touch on that I
hope help you as you develop your own
first is scoping tasks for evaluation
it's very easy you know to build out
demos quickly much harder sometimes to
scope and eval what's occurring second
is building the right team who's ready
to move fast and deal with the ambiguity
that comes with these kind of problems
third is that you know the ux of is
changing um that's something that
everyone needs to be comfortable with
and fourth is observability matters you
know uh I'm surprising for data do to
say that I'm sure but observability is
terribly important even in this new
era so scoping the problems scoping the
work to be done I like to think about
this as defining jobs to be done and
really kind of trying to clearly
understand step by step what you'd like
to do think about it from the human
angle first and think about how another
human might go and evaluate it um this
is why we build out vertical task
specific agents rather than building out
generalized agents we also want where
possible this to be measurable and
verifiable and at each step this has
honestly been one of the biggest pain
points for us and I think this is true
for many people working in agents where
you can quickly build out a demo you can
quickly build something that looks like
it works but then it's very hard to
actually verify that over time and
improve it um use your domain experts
but use them more like design Partners
or task verifiers don't use them as the
people who will go and kind of write the
code rules for it because there is a big
difference in how these kind of
stochastic models work versus how
experts work you know everyone kind of
knows gnome and his uh anti- NLP um
rants but that kind of stuff happens
pretty frequently at domain
experts eval eval eval I can't stress
this enough um start by thinking deeply
about your eval the number of mistakes
we made by not thinking about eval first
is uh frustrating and something that I
think everyone should think about it's
very easy to build these demos as I said
um but everything in this fuzzy
stochastic World requires good ev even
something small to start this means
offline online and kind of living evl
have endtoend uh uh uh tasks have
endtoend measurements uh make it so you
also instrument appropriately the way to
know if humans are using your product
right and giving you feedback and then
make this a living breathing test
set building the team um you don't have
to have a bunch of ml experts there
aren't that many to go around right now
um what you really need is you want to
seat it with one or two and then have a
bunch of optimistic generalists who are
very good at writing code and very
willing to try things out fast um I'll
also note that ux and front end matters
more than I'd like as a backend engineer
myself um but it's terribly important as
you collaborate with these with these
agents and the
assistance um and then you want
teammates and people who are excited to
be AI augmented themselves this is
day-to-day AI use this is Explorer types
who want to learn this is field that's
changing fast um and if you don't have
people like that you're going to kind of
get
stuck you want folks to kind of you know
yeah you're in for the vast and endless
AI capabilities right um it's a big
world out there and there's a lot going
on Ye Old ux um this is one of those
things that I still you know we think
about we go back and forth every day um
it's an area that I didn't realize was
quite so important initially when I
started working in this field um despite
my engineering sensibilities and lack of
ux it's terribly important um this is
such an early space of work this is kind
of one of the more important things here
as you collaborate and work together but
the old ux patterns are changing be
comfortable with that um and so far I'm
partial to agents that work more and
more like human teammates instead of
building out a bunch of new pages or
buttons so who watches the Watchman
right um You have these agents running
around um observability is actually
really important and don't make it an
afterthought um these are complex
workflows you really need situational
awareness to debug problems and this has
saved us time a lot as we start to work
with um a new view that we're calling LM
observability in the data dog
product um data dog in general has a
full observability stack as many of you
know we can look at gpus um we can look
at LM monitoring we can look at really
your system end to end but tying in the
llm observability has been very helpful
because you have a wide variety of
interactions and calls out to models
you're hosting models you're running
maybe models you're using through an API
and we can make them all uh kind of
group together in the same paint of
glass so you can look at them and debug
what's
occurring I will note though that this
can get messy fast with agents our agent
for example has very complex multi-step
calls you're not going to look at this
and figure out what's going on right
away Um this can be hundreds of calls
this can be uh you know uh tons of
different places where it's making
decisions about tools looping time and
time again and if you just look through
a full list of these things you'll never
really figure out what's going on so
here's a Qui you know sneak peek into a
more agent view of what's occurring
inside of our observability tools this
is our agent graph um really what this
means is that I can kind of look at it
just like our agent did and looking at
workflows that are occurring you can see
in this even though it's a big graph uh
there's a bright red node here if we
zoom into that we can actually see where
errors were occurring this is very human
readable something that makes it super
easy to figure out what's going on when
your complex workflow is
running as an aside though I do also
want to note what I think of as kind of
like the agent or application layer
bitter lesson uh General methods that
can leverage new off-the-shelf models
are ultimately the most effective um by
a large margin um I hate to say it but
like you sit there you fine-tune you do
all this work on like this specific uh
you know project a specific task and
then all of a sudden you know open AI or
someone comes out with a new model and
it handles all this you know kind of
quickly a lot of the reasoning is solved
for you um we're not quite there where
it handles all of it very quickly but
you should be at a point where you can
EAS try out any of these models um and
don't feel stuck to a particular model
that you're you've been working on for a
while you know Rising tide lifts all
boats
here um I also think a lot about not
just building agents but what it might
mean for other agents to be users of
data dog and other SAS products um
there's a good chance that agents
surpass humans as users in the next five
years um I'm probably somewhere in the
middle on my estimate there you know
there people who will tell you that'll
happen in the next year there'll people
who will tell you you it'll happen in 10
years I think we're somewhere around the
fiveyear Mark um but this means that you
shouldn't just be building for humans or
building your own agents you should
really think about agents that might use
your product as well an example of this
is like third party agents like Claude
might use you know data dog directly I
set this up with mCP relatively quickly
um but any type of agent that might be
coming in and using your platform you
should think of the context you want to
provide them the information you want to
provide about your apis that agents
would use more than humans
so looking ahead um the future is going
to be weird it'll be fun uh and AI
accelerating is accelerating each and
every day I strongly believe that we'll
be able to offer a team of Dev secop
agents for hire to each of you soon you
don't have to go and use our platform
directly and integrate directly ideally
our agents will do that for you and our
agents will handle your on call and
everything like that for you um I also
do think that AI agents will be
customers many of you building out SRE
agents and other types of Agents coding
agents should use our platform should
use our tools um just like a human would
and uh we can't wait to see
that and generally I think that small
companies out there are going to be
building built by someone who can use
automated developers like cursor or
Devon to get their ideas out into the
real world and then agents like ours to
handle operations and Security in a way
that lets you know an order of magnitude
more ideas make it out into the real
world thank you so much um please reach
out if you're building any agents that
want to use us um or if you'd like to
check out our agents as well um there's
a lot to build here and if you want to
work in the space we are hiring more AI
engineers and people who are just
excited about it but thank you very
[Applause]
[Music]
much our next presentation is about
building self-managed AI networks please
join me in welcoming to the stage
technical lead for Arista networks Paul
[Music]
Gilbert ah oops how do I go back there
you go yep so my name is p Gil I'm a
tech lead for Arista networks I have an
accident but I'm actually based here in
New York City and I build or design or
help build and design uh Enterprise
networks uh but what we do is a Plumb in
uh so I'm not going to talk about agents
but more kind of how you train uh models
what the infrastructure looks like and
how you do inflence in on on the
infrastructure uh I I I normally teach
people uh the very basic stuff so I I
you guys probably know this already but
these are new terms for us when when we
built computer networks people will come
to us and say job completion time uh
barrier I'm I'm pretty sure you guys
know that the the inference and the
question I get all the time is you know
we we can build a network to train a
model there's a an algorithm maybe you
can use to to look at what you what you
need uh but then you know what's
inference and you know it's changed a
lot now because of chain of fa and
reasoning models the inference is a lot
different it used to be X but now it's
y uh I'm pretty sure you guys have seen
this slide but I use these just to talk
to Enterprises around kind of what they
might be thinking in GPU size uh this uh
wed Sosa came up Dr wed Sosa came up
with this on the left there is the
training and on the right there is the
inference and it's kind of you know on
one you have times 18 times the other
times two again I think that changes now
with Chain of Thought and reasoning not
too sure kind of which way it's going to
go and at the bottom there was a really
interesting one again which I showed
customers cuz I most of the Enterprises
I talk to kind of don't understand
models and how they work and training I
know little but not a lot but you know
the the the the the model they trained
down here was 248 gpus for one to two
months and then when you go to inference
after fine Jun in alignment it's four
h100s for inference so we talk to people
about building different types of
networks which I'll speak about but uh
you know kind of I always start at the
beginning and you know this is I got
this slide and I think it's really
interesting you know llms were kind of
just tiny little bit of inference but
now with the the next generation models
it's a lot so this is what we build or I
I build uh and these are new
terminologies for us from the networking
world uh backend Network so this is
where you connect gpus to when we build
these networks uh they're completely
isolated because gpus are really really
expensive they take a lot of power uh
and they're really hard to get old of so
when people build AI networks in the
Enterprise uh we don't connect nothing
else to these networks the the bottom
part of that the back end Network
there's eight gpus per pool on those
servers and they can be Nvidia they can
be super micro they can be be whatever
they will go into a high-speed switch at
the bottom there uh you have a leaf
switch and a spine switch and nothing
else attaches to that Network and then
on the front end network is where you
get sto storage from the Train the model
obviously you know it the gpus
synchronize they do something they
calculate they produce an algorithm and
they call for more dat and that's kind
of the cycle the front end network is
not as intense as the back end the
backend Network depending on the model
that you train uh they the gpus will
actually work at 400 gabt and for for us
in the Enterprise you know and I've
built some big big data centers but I've
never seen anything like that so this is
in the networking world this is a
completely new world to us and we we
make the networks as simple as possible
because again
these are really expensive and people
want to get their money's worth they
want these running 24 by7 uh so we do IB
ibgp or ebgp just a really simple
protocols uh I'm sure most of you have
seen this but again I I kind of teach
this I uh the the this was an
infrastructure presentation but you know
that's kind of the back of a h100
probably the most popular it is actually
the most popular uh uh AI server out
there right now you can see in the
middle there there's four ports but
those four ports are broken out into two
so there's eight ports those are the GPU
ports there and then kind of over to the
left there there's the ethernet ports so
that's what we connect to we've never
seen anything like this before you know
when you first speak to people about
this you know I've seen servers with 400
gig you know and I do a lot of the big
Financial networks but never before have
we seen servers that can put this type
of traffic onto a network
uh you we always they always ask me
about this and you know I got this from
an Nvidia slide it's stand there but you
know there's this thing called scale up
and scale out I'm not really sure scale
up you know when when you buy when my
customers buy these servers that always
have hgp using it you can't add anything
to an Nvidia server you get the djx the
if you go with the Outsource model it's
a hgx so it's a third party you don't
really add things to it but so I don't
see scale up but scale out you know
obviously you can build we we build a
network so that you can add more gpus we
can start very small and we can go up to
you know hundreds of thousands of gpus
not in the Enterprise but the cloud
scale guys
do so so what's different uh you know
for us again it's there's it's hardware
and software the hardware are those gpus
we're not used to them uh the first time
I tried to configure one it took me
hours and hours but I'd never seen them
before whereas other stuff I've seen
pretty quickly you know you have
software so Cuda and nickel are probably
you know two of the biggest protocols
and you you guys know more about that
than me but we had to kind of understand
not Cuda but nickel because it has a
collective so we had to understand kind
of how the collective works because that
will put uh traffic onto the network in
a certain way uh the hardware was
completely different again you know we
had the eight uh 400 gig PS and the four
400 gig PS facing the the the front end
Network totally new to
us the other thing was Data Center
applications kind of web app database
they're really easy uh they go from one
to the other and in different parts of
the network and if one fails you you
have some kind of low balance in or fa
and it fails over uh this is AI networks
not like that the gpus all speak they'll
talk to each other they'll get stuff
they'll send stuff and if one fails the
the job might fail it might recover but
it's a different concept to us so it's
it's hard to imagine uh and traffic is
bursty because all of these gpus if you
have th000 gpus of 400 gig they will all
burst at the same time and if you if
they can they will burst up 400 gig so
there a lot of traffic on a network and
I've never seen anything like that so
when we build these networks we don't
build them over subscribe we build them
one to one uh might in the data center
world we used to do 1 to 10 it went down
to probably 1 to three but never one to
to one because it's just really
expensive to to to build that kind of
bandwidth but with AI networks we need
to so we have no over subscription in
the network and from from our point of
view if you look at what one of these
servers can put on the network you know
just a h100 is 8 400 gig gpus and 4 400
gig is 4.8 terabyt which is and that's
just one server the storage size the the
the front end probably nowhere near that
but the back end is always wire rate and
then 800 gig is probably you know the
bees are just around the corner I think
in March they'll be released I think
there's some people to have them and
those are 800 gig we support 800 gig
today on the network but each one of
those servers in there's a possible 9.6
terabytes per server and you most people
in my world in the in the Enterprise
world come from servers at maybe 1 two
three or four 100 Gig ethernet but
nothing like uh 9.6 terab whes per
server so the other problem we have is
the traffic patterns when we low balance
from kind of leaf to spine we use a
thing called entropy which is the five
tle IP address port and Mac address and
we do pretty good low balancing but with
gpus it's just one IP address and it can
sometimes match to a single Uplink and
over subscribe it which would be really
bad because you'll start dropping an
awful lot of packets so we have to take
a lot of Care on how we low balance
within the AI Network or how we build
the back end and the front end so we
have some pretty cool tools where we
don't now look at the five topples we
actually low balance on the percent of
bandwidth that's being used on the
Uplink and we can get up to about 93%
utilization on all the uplinks to the
down links which is pretty
good uh you know and again one thing
that's really new to us is a single GPU
can you or a set of gpus if they fail uh
sometimes the model will stop I know
checkpoints but uh a single fit GPU fail
is a problem for us and if one of the
big problems that's we've always add is
Optics and transceivers and Doms which
is the rates and the loss between them
and the cables Etc and when you start
building these networks with thousands
of gpus you will have a lot of cable
problems and you will have a lot of GPU
problems so it's it's really hard for us
because we again this world is is new to
us the last year or
so uh Power I you know power is you
could you know you've read the
newspapers you know everyone's trying to
buy buy nuclear power stations to power
these
things the the average rack in the data
center today is about 7kw to 15kw and
you can put like you know 10 one one IU
rexs into those and you'll be fine and
when was come to me say yeah we finally
got gpus whatever and I say to them you
know what kind of racks have you got and
they said well we're going to put them
in and then you could only put one of
these servers in one of those racks
because they they actually draw with 8
gpus 10.2 KW so you need new racks uh
most Enterprises now waking up to this
and they're building racks between 100
and 200 KW and they're water called
there's no way you could air call them
in a data center so that's a whole new
concept to people as well is water
called
rxs uh traffic is is both ways which
again is new to us so north south you
know in a regular data center you have
users coming in database app web
whatever and it comes in it goes out but
in the AI world when the gpus speak that
traffic is east west because it's
speaking amongst each other uh and then
when they ask for more data from the
storage Network it's north south so you
have both traffic patterns the East West
is really bad that's kind of where they
run wi rate the front end to the storage
is much more more calmer because most
storage vendors can't put that kind of
traffic on on the on the network right
now I'm pretty sure they will one day
but they're more around 100 200
gig uh and you know in a network there's
a certain amount of buffering on these
switches and buffering is bad because it
means it can't send traffic somewhere
because something else is is not
receiving the traffic trffic so you need
a congestion control and feedback uh and
right now we use something called Rocky
V2 which is two parts of Rocky V2
there's a PFC and an ecn uh if you were
building an AI Network your engineers
your network Engineers will definitely
know about this ecn is an endtoend flow
control where if traffic if traffic is
if there's congestion somewhere in
network packets are marks they go to the
receiver the receiver sends back to the
sender you need to slow down because
there's congestion and it goes for an
algorithm it pauses for a while it slows
down and if it doesn't get any more ECM
packets it speeds up again uh and PFC is
basically stop uh my buffers are full I
can't take anymore so it kind of is a
dead stop so you have kind of a a slow
uh feedback mechanism with ecn and the
kind of an emergency stop with
PFC the networks we build are really
simple
we don't uh have things like in regular
data centers we have DMZ with firewalls
low balances Etc we have connections to
the internet we have L4 through 7
service whole bunch of stuff when we
build these networks they're totally
isolated uh the GPU the the back end is
completely isolated the front end
possibly could have connections to
something but even then it's so
expensive to build you you don't want to
take the
chance uh on demand the applic ations
that we're used to you know if it fouls
or something fouls something will
recover and you know you you may get a a
little skip or a jump but if you've done
the right thing it's not going to be
that bad in this world if something
fails the model May foul and the call
that you get into your operation Center
is different cool than you get that if
you're you're at kind of restarted and
everything's good
again uh the other thing is collectives
you know obviously nickel will go out
there and work out kind of where to GPU
gpus are and what to do with it but
there's kind of different design so I
tell my customers that speak to your
data scientist and your your uh your
programmers developers and find out kind
of what they're doing and what kind of
models they're building because it can
can affect the network on kind of how
you build it and how you design
it so networks totally isolated things
are moving fast we're at 800 gig right
now uh which is you know we have been
for probably a year we will see 1.6
terabytes on the network probably end of
this year uh early
20127 and it will just keep grind and
grind and grind and these models will
get bigger and bigger and consume more
and more and more I'm pretty
sure uh visibility and Telemetry I you
know all my customers the core that they
get when a model fails because the
network is the problem is a different
core than they used to so we put
different uh to lry and visibility in
there to make sure that if things are
going wrong on the network that you know
they know about it hopefully before they
get that
call so yeah I work for Arista our
operating system is called EOS and we
have a whole bunch of features there so
if you were building an AI Network I'm
not sure that you guys speak to the
engineers but this is the type of things
we talk about uh lossless ethernet
everyone thinks that you you know when
you train them model you can't drop
packets I I've seen it and you can I
think drop packets are okay consistent
latency is okay but if you drop so many
packets obviously it's a problem so flow
control losses e foret is really key ecn
and PFC are part of that as I said
before they're flow control mechanisms
one is a slow down please and the other
one is a stop and as you know because G
gpus are synchronized if something slows
down you slow down one port one GPU
everything slows down so you really got
to be on top of kind of the over
subscription and if you are getting Qin
where is
it uh we we have really good buffer uh
we can adjust buffers we have different
kinds of switches for different places
in the network but we found that models
send and receive a particular size
packet and what we do is we adjust those
buffers to accept those types of packets
buffering is a really expensive
commodity and switches in networking and
if you can find a way to allocate the
buffers exactly tuned to the packet
sizes it's a win-win uh and we we've
worked out how to do that which is
good uh yeah monitoring is really key
for us uh I tell my customers there's
probably five things you want to do one
of them is RDMA you know uh these
networks train using RDMA you know uh
which is memory to memory rights rather
than going CPU to memory and RDMA is a a
complex protocol and it has 10 or 10 or
12 maybe more kind of error codes so if
the network starts seeing problems uh
and starts dropping packets rather than
just drop the packet on the floor we can
actually copy that packet to a buffer or
send it somewhere or just just the
headers and and why we drop that packet
and if you think about it's really cool
like most networks will in congestion
your buffer fills up is you're going to
drop the packets we'll drop the packet
but we'll actually take snapshot of the
packet and the headers any rme
information in it and we'll tell you why
we dropped
it uh another thing we have which is
really good we have an AI agent uh you
know from the networking point of view
we can look at what's going on but we
don't really have any visibility into
the GPU so now we have an agent which is
an API uh and some code that we load on
the the gpus in Nvidia and they will
speak to the switch so that agent will
say to the switch how are you configured
so PFC and ecn those slow control
mechanisms have to be configured
correctly because if they're not it will
be a disaster so the the GP will speak
to the switch and say this is how I'm
configured the switch will say yeah
you're good we understand each other and
the second thing it does it gives you a
whole bunch of statistics about packets
received packets sent RDMA errors RDMA
issues in there so you can can correlate
now if the problem is a GPU or if it's
the network which is a huge step forward
for for us
uh another really cool feature we have
is uh smart system upgrade you know if
you used the routers and switches you
know you have to upgrade the software
sometimes uh sometimes to get new
features sometimes to fix pts which are
security vulnerabilities on that switch
uh we've worked out a way now that we
can do that you can upgrade code without
actually taking the switch offline so
you know if you have 1,24 gpus with 64
switches in your network you actually
can upgrade those and the gpus can keep
working so it's a real big step forward
for
us so you for us again I I don't know
but no over subscription on the back end
you can't because the gpus use
everything you give them address wise is
really important for us it's a it's a
Point to-point connection so it's sl30
31s you could use IPv6 if you have ipv4
problems address space problems all my
customers I told bgp because it's the
best protocol out there it's really
simple and it's really quick uh evpn VXL
if you have M tency if you have a lot of
different business units lines of
business uh using the network you need
things like Advanced low balancing we
have a couple of different we we
actually look at the collective that
you're running low balance on that
Collective Now which we call cluster low
balancing you could deploy It Rocky I
tell all my customers do it because if
you don't your network going to melt
down you're not going to know why these
things will give you an early warning
system that you need to do something
with your network so they're really key
to have and visibility in Telemetry is
is really good at all times because in
the network knock or the Operation
Center you always want to be aware of
the problem before you get the call from
the developers and the people that have
paid a lot of money for that
Network I I'm running out of time here
but this is kind of 1,400 gig cluster
what it would look like spine and leaf
uh again no over subscription 800 gig
links between the leaf and spine 400 gig
down to the
gpus this is a 4,000
cluster these ones these are the bigger
boxes these are 16 slot one of these
boxes can take 576 800 gig gpus so
1152 400 gig gpus so if you're building
clusters with thousands of gpus then
this would be the box for you the 7800
series and and putting it together this
is kind of what we would build there's
three networks here there's a backend
Network where your gpus live there's a
front end Network where the storage live
and then there's the the the inference
that you take the model you put it
somewhere
else I I'm out of time I do not come in
so I'm guess I'm not so so so the other
thing is you know there's Ultra eanet
Consortium you I don't know if it's
interest you ethernet hasn't changed the
way it's built probably 30 years uh
there's some things it could do better
around congestion control around packet
spraying around the Nicks talking to
each other so there's this thing called
Ultra eite Consortium version 10 will be
ratified probably q1
2025 and it's a kind of different way of
building networks and you probably won't
see them until Q3 Q4 but most of the
cloud scale guys were kind of really
keen on this because it it puts a lot
more into the ncks and takes a lot more
out out of the network so we just get
we can do what we're good at which is
forward in
packets so summary you know for us we
have the front end which is storage the
back end which is the really important
part for us uh that part is really
bursty uh the gpus are all synced so
they send and receive at the same time
and if you have a slow GPU that's a
barrier because it stops everyone else
job completion time is what matters to
us if you know we get the call that you
know my job completion time was 1 hour
yesterday it's 4 days today you know
it's probably our problem uh you know
models can checkpoint but they're really
expensive you guys know
that and I'm done and they're still not
coming yeah uh anyone got any question I
take a question if you want if they're
not
[Music]
[Applause]
[Music]
coming our final presentation for this
block is anthropic for VPS of AI please
join me in welcoming to the stage
Alexander bricken member of technical
staff at anthropic and Joe Bailey GTM
Enterprise at
[Music]
anthropic here we are brighter than I
expected good to see you all today I'm
Alexander bricken I'm on the applied AI
team at anthropic so I work very closely
with customers to do technical
implementation work and I also bring
that advice back to product research and
model research um I'm going to pass it
over to Joe hey everyone it's great to
be here my name is Joe Bailey I work on
the go to market team in anthropic I
joined anthropic over a year ago now so
I've seen our models evolve from a 2.1
to today's capabilities and I think
day-to-day what's really exciting is
we're working with AI leaders who are
solving real business problems um that
just seemed impossible a year ago so
really excited about how quickly
everything is
moving okay for today we will do um a
quick overview uh you know who we are
our mission and then we'll focus a lot
on implementing Ai and best practices
and common mistakes uh Alex and I
actually didn't just take this from our
own experience but we uh talked to a
number of our colleagues so this is all
based on hundreds and hundreds of
customer interactions um so we hope
there's some actionable insights to take
out of
this awesome so what is anthropic so we
are an AI safety and research company
building the world's best and safest uh
large language models we were founded a
few years ago by some of the leading
experts in Ai and since our Inception
we've not only released uh multiple
iterations of our Frontier models we've
done so while being at the bleeding edge
of safety techniques of research and
policy I'm going to pass it over to Alex
to talk a little B about our Marquee
model awesome and so some of you are
probably familiar but the most recent
model we launched was Sonet 3.5 new in
late October of last year um you might
be familiar with it because if you're a
developer uh Sonet is actually one of
the leading models in the code space so
if you're familiar with evaluations like
sbench which is an agentic coding eval
uh Sona is still at the top of the
leaderboard for that um I won't go too
much into the details on the eval side
so let's keep
moving um so yeah in addition to what
Joe mentioned we have a lot of different
research directions that we're focused
on um and these are really distributed
but have overlap between you know model
capabilities product research and AI
safety the one that differentiates us I
would say is the interpretability and
this realistically is reverse
engineering the models and trying to
figure out actually how they're thinking
and maybe why they're thinking and then
an additional capability in terms of
steering them in the right direction
depending on a use case so let's dive
into that a little bit more we're still
very early in interpretability research
it's worth mentioning as you can see
there's kind of like a longer timeline
and we're really only at the the first
half of that maybe even the first
25% um but we're we're really
approaching it in these stages that
build upon each other so these include
things like understanding so grasping AI
decision-making detection so actually
being able to understand specific
behaviors and put labels on those
steering so influencing the AI input in
some some way shape or form and I'll get
to an example of that in a second and
then finally explainability and that's
really where you unlock business value
associated with interpretability methods
and so while we see interpretability in
the long term providing you know a lot
of significant improvements in AI safety
reliability and usability specifically
our interpretability team uses methods
to understand feature activations at the
model level and then has published
research on these uh in towards Mon
semanticity and scaling mon semanticity
which are two papers I highly recommend
um and then as the technology improves
into kind of detection landscapes for
example you can imagine having a much
better grasp at uh the actual thinking
and behavior of the model or even
discovering sleeper agents for safety
reasons that might be very deep within
uh model capabilities so a good example
of that is imagine you ask the model
what were the scores of the NBA matches
today right and let's say it knows the
answer and it says oh Steph Curry you
know scored 30 points this would lead to
a feature activation of for example
feature number 304 famous NBA players
realistically that's a group of neurons
activating in recognizable pattern that
we've identified across all mentions of
famous basketball players when model is
answering a question not just Steph
Curry um and you also might have heard
of Golden Gate Claude that was an
example of us steering the model uh
basically amping up the activation in
the Golden Gate Direction and thus
whenever you'd ask a question like what
should I paint my bedroom Claude would
respond oh you should paint it red like
the Golden Gate Bridge and maybe it
should have some like you know pillars
in it or
something I'm going to pass over to Joe
to talk a little bit about some of the
customers we work with yeah so I'm going
to frame this in two ways one is uh sort
of early on discussions and the other
would be just examples of customers that
are doing really cool things so in
conversations there's obviously a lot of
noise and Buzz and everything and that's
fantastic but we often encourage our
customers uh to sort of get back to the
basics and how can they use AI to solve
the core problem that your product is
trying trying to solve we also get to
work with a ton of uh of AI native or AI
startups and this is how they're
thinking about their product and I think
you want to move Beyond uh chat Bots and
summarization these can be great options
but I'd be thinking more like where do
you want to place bigger bets and to
give an example um if you just click one
more time fancy slide uh imagine you're
an onboarding and upskilling platform
the problem that you solve for customers
is you help them get ramp really quickly
and then you help them get to the next
phase of their career by equipping them
with skills so for instance you might be
public speaking you want to get good at
or you might want to become a manager
and so it would be easy to say okay
let's summarize course content or let's
um uh let's have a Q&A chat bot that
answers questions along the way and they
could be helpful but I would actually
think about it differently so what about
if you could hyper personalize uh course
content based on each indiv individual
employees context or if someone is like
breezing through all the course cont
content could you adapt it dynamically
to make it more challenging uh so
they're actually getting more value out
of it and the last one that I
particularly like would be what if you
could uh uh dynamically update uh course
material based on people learning about
the customer so if someone was a visual
learner great let's make visual content
for them and having the AI having the ml
sorry the the large language model just
do that automatically and you have to
think does that solve the problem more
than summarization or or a Q&A chat B um
so really good food for thought and to
sort of talk about some of the customers
uh that we see achieving really industry
leading results by combining uh their
own domain expertise and our um our
model so I won't read off each but just
a couple of call outs one is uh AI
impacting different Industries we have
uh taxes we have legal we have uh
project management they're using AI to
uh drastically enhance their customer
experience they make it more like easier
to use they make it more trustworthy um
and so it's really improving the
experience versus just being like a nice
to have and then they're achieving a
real a real high quality um of output
right you can't be giving you can't be
hallucinating when you're doing your
taxes uh it could just you know that
could lead to all sorts of things so
we're thrilled that they're seeing these
these sort of like business critical
workflows powered by AI driving really
positive outcomes for them and also
their
customers awesome I can do this one yeah
so um getting started I just quickly uh
there's two two key points here so if
you go to the next slide um what are our
products we have our API we have Claude
for work our API uh is for businesses
that want to embed AI in their product
and services and then clae for work
empowers your entire organization to
take advantage of AI and the day-to-day
work we also have uh next one we also
have a partnership with AWS and gcp and
you can kind of get the Best of Both
Worlds here you can access our Frontier
models on Bedrock or in vertex you can
deploy these applications in your
existing environment um and you but you
so you don't have to manage any new
infrastructure so it really sort of
breaks down like any barriers to entry
so you're getting the best of both World
here we talk a little bit about support
throughout this talk it to us it doesn't
matter if you're accessing us through a
third party or our first party so I just
wanted to call that
out awesome so now that we've talked a
little bit about some of the customers
how do we actually set customers up for
Success when working with them out
anthropic so just a Preface on kind of
what my team does as I mentioned it's at
this intersection of product research uh
customer facing interaction and then
also just actual research within the org
um and we support the technical aspects
of the use cases so helping to design
architectures evals tweak Cloud prompts
to get the best out of our models Etc
and then we also bring whatever we see
back into anthropic and we try to build
some the best products we can for our
customers so some examples of projects
we worked on or things that we published
include the building effective uh
research uh paper that uh my colleague
Barry published he's going to be
speaking tomorrow and then as well as
that we launched model context protocol
which is a open source protocol for
language models to interact with data
sources and uh Mah is going to be
leading a workshop on that uh on
Saturday I
believe so anthropic as a whole we try
to effectively support our custom
customers but where we really start to
embed at least my team in particular is
um we work closely with customers that
are using clot a lot and they're facing
really Niche challenges in specific use
case domains and they need support from
our team to try to apply some of the
newest kind of latest and greatest
research or get the most out of the
models from a prompting standpoint Etc
and so this approach is pretty additive
we often kick off a a Sprint once the
customer is facing those tricky
challenges that could be L llm Ops
architectures or evals we help them to
find certain metrics that they deem to
be important when they're evaluating the
model against the use case and then
finally um we help them deploy that kind
of iterative loop the result of that
into an AB test environment um and then
hopefully into production and so a part
of that is the importance of evals and
I'll get onto that in a second um but
I'm going to pass it over first to Joe
to talk about some stuff that we did for
intercom yeah so sort of I think this is
a good segue in what Alex was describing
so for those of you don't who don't know
intercom is an AI customer service
platform they have an AI agent called
Finn by many measures it's the best in
the market and it's a pretty competitive
market so they had their product uh out
for I think about a year or so and when
we spoke to them they shared where they
wanted to go where they saw the futures
of like customer support and agents and
based on some of the capabilities of our
model we felt that we could have a
pretty good impact on these metrics and
so what we started with was the the
applied AI uh lead met with their data
science team and we ran a quick twoe
Sprint we took their hardest prompt for
Finn and we compared it uh against a
prompt that we helped them sort of
figure out with uh with Claude and they
saw really good results after the first
two weeks so much so we went on this
sort of Sprint of about two months where
we were basically um fine-tuning and
optim optimizing all of their uh prompts
to get the best performance out of
Claude at the end of this they're ble to
look at all their benchmarks and see
that anthropic was outperforming the
current llm it's also worth noting that
they do a resolution based pricing model
so there's an incentive for everyone for
the model to be really helpful and help
customers solve problems and not be like
a deflection machine where it's like you
know we're probably all experienced them
before and so at the end of these two
months they decided to move forward with
anthropic they launched it you can read
about it it's called fin 2 and I think
just some of the metrics are really like
mind-blowing like can solve up to 86% of
customer support volume 51% out the Box
our support team thought about lots of
different options and they actually
adopted Finn as well and they saw very
similar resolution rates but also making
it more human so they I think with our
model we can there's much more of a
human element to it so they could do
like uh adjustment of the tone uh answer
length and then was also really good at
doing policy awareness so like refund
policy for instance so unlocking some
new capabilities and we're thrilled to
be partnering with them as they sort of
I think March forward as a leader in in
this space
yeah on a kind of separate note one of
the things I've seen recently is uh
Claude on Twitter acting as some sort of
therapist for a lot of people and I
always find that an entertaining example
of like its character being expressed
yeah um cool so let's get on to some
best practices and mistakes that we see
in the field uh on the goto Market team
so firstly testing and evaluation I'm
sure this those two words have been
mentioned a lot today and probably
tomorrow too um there's some typical
common common mistakes that we see uh
customers strugg with so the first one
is they build a really robust workflow
they spent like a bunch of time building
some architecture out and then they're
like okay now we need to evaluate it
like let's build some evals that's not
really how it should work in practice
because your evales are actually the
thing that directs you towards a perfect
outcome right you can't build a whole
workflow without evals probably from the
get-go or very shortly after and so you
know sometimes customers as a result of
struggling with data problems might not
be able to design their evals you could
use Claude to clean that up do data
reconciliation um or they're just you
know trusting The Vibes too much maybe
they run a couple queries they're like
hey it looks good right are they really
testing that on a representative sample
though like do you have enough samples
to say that the thing that you're
looking at is statistically significant
and like or are you going to you know
run a hundred things when it actually
goes into prod and then there's going to
be like loads of outliar uh because you
didn't actually predict correctly what
that the customer is going to ask of the
model for example so I challenge you to
think about your use cases as this sort
of latent space right let's take this
kind of chart here on the leftand side
of the slide right as you explore the
latent space with different functions
that you can apply to the model let's
say prompt engineering prompt caching
stuff like that you're you're basically
moving your kind of position in that
Laten space around between attractor
States you know and eventually you want
to find an optimized point but you don't
really know where that is right like if
you're changing an instruction don't
know how the attention mechanism of the
Transformer is going to eventually
result in some different outcome that
might not be performant and so the only
way you can truly know that is
empirically and that's through
evaluations and so I think that's why
evaluations are so important and a lot
of people just don't understand that
soon enough in many ways I actually tell
customers evals are your intellectual
intellectual property like if you want
to be competitive in a space you need to
be able to out compete people by
navigating that l l in space and finding
the attractor state state faster than
anyone else um and so part of you know
how you do that is well firstly setting
up some sort of telemetry uh to back
test ideally that architecture set up in
advance but you know you should invest
in it um designing representative text
cases so let's say you're working on
that customer support agent eval you
know you might have a kid come on your
website let's you're building it for and
might ask some crazy question like how
do I you know kill a zombie in Minecraft
like it totally unrelated to your
product that's still probable and so you
should probably include silly examples
like that in your eval set to make sure
that your models actually approaching
the response in an appropriate way or
rerouting the question
Etc cool moving on to the next one um
identifying metrics so a lot of the time
you know there's this intelligence cost
latency triangle of tradeoffs that
people are trying to move in between and
most or organizations can optimize for
one or two of those things but it's very
difficult to meet three at least right
now but realistically that balance
should be defined in advance and you
should know that for your specific use
case you're going to make a trade-off
between those things so let's say a
customer support use case again you care
about your customer getting response
within 10 seconds right more than 10
seconds I think there's been research
done on this uh the customer is likely
going to just log off the page and then
they won't get the response and then
they'll probably complain about your
product to their friends right whereas
if you're looking at a financial
research analyst agent you probably
don't care that it works for 10 minutes
to come up with the actual response to
your question because the decision being
made after that is very important it's
an allocation of capital for example and
so the stakes and time sensity
sensitivity of the decision should
really drive your optimization choices
and you know maybe more instruction sets
lead to longer latency but higher
performance Etc the other thing is ux
could be important right so again on
that customer support agent because we
spoke about intercom you could have
different ways of circumventing that 10c
to 15 seconds specifically you could add
like a little thinking box that bounces
around you could send the customer to
another web page in the meantime have
them read something right like there's
loads of ways to distract and kind of
push on those boundaries but you still
need to know what that important
indicator is and you need to optimize
accordingly finally fine tuning so a lot
of people you know I go into these calls
and they're like oh we want to do fine
tuning I'm like oh here we go again um
fine tuning is not a silver bullet so it
comes at a cost and most people aren't
aware of that cost um the cost is
generally you're doing brain surgery on
the model and thus there can be kind of
limitations to its reasoning in other
fields outside of the thing you're fine
tuning tools um so my encouragement is
try other approaches first right most
people they don't even have their eval
set when they're trying to do fine
tuning right they need to have a clear
success criteria in advance and it's
like only if we can't get that in our
specific intelligence domain do we then
do fine tuning don't try to boil the
ocean in advance the difference in
fine-tuned capabilities and the wide
variance of which you know failure
versus success looks like in fine tuning
land means that you should be able to
justify the cost of fine tuning and the
effort of doing it right getting a team
fine tuning working with us for example
uh you should be able to justify that
difference and so in in terms of best
practices you you know don't want let to
let fine tuning slow you down right you
don't want to say oh I'm only going to
convert this language model use case if
we can actually finally fine-tune our
model it's like no no no pursue it and
then realize that you need to do
fine-tuning and then you can just sub in
the fine-tuned model and then explore
other methods first and there are loads
of different methods that you know are
anthropic as well as other companies are
working on these days and I just wanted
to like flash up a few of them as we
wrap things up here and so I'm not going
to go through all of these but alongside
just Bas prompt engineering which
granted is very important there are
loads of different features or
architectures that will change the
success of your use case drastically so
for example you might not need to
sacrifice on intelligence of your model
if in order to speed it up by like
removing instructions if you can just
leverage P caching and have a 90%
reduction in cost and a 50% increase in
speed right or contextual retrieval will
drastically improve the performance of
your retrieval retrieval mechanisms and
thus you feed the information to the
model more effectively and thus it has
less of a Time processing all the
instruction set that you've given it so
there are quite a few things that you
can apply here and some of them are even
out of the box like citations and then
there are also architectural decisions
like agentic architectures you know
Barry my colleague who's speaking
tomorrow we'll have a lot to say on that
um but that pretty much does it um thank
you so much for uh for your time we'll
be in the
theater level Lounge after this chat for
followup questions um anything else from
you Joe no thank you so much
[Applause]
[Music]
cool ladies and Gentlemen please welcome
back to the stage your MC for the
leadership track session day Peter
Humphrey
all right folks uh thank you again to
Alexander and Joe for all their insights
into applied AI with Claude uh anthropic
is truly up to a lot of amazing things
who's using Claude on a pretty frequent
basis I'll speak for myself okay great
we got a lot of them so it's been a
pretty exciting afternoon uh we've had
case studies we've talked about
evaluation Frameworks we've talked about
observability with agents and we talked
about self-managed AI infrastructure
that one I was particularly kind of
interested in and of course anthropic
thanks again to Alexander and Joe so
we're going to take a 30 minute break
before sessions continue at right here
at 4M uh and again as before if you want
to discuss meet the speakers or talk
about uh birds of a feather you know
like-minded topics from the last block
of sessions just go find the speakers at
the one of the three Q&A areas again one
here on the theater level one at the
very bottom of the stairs right at the
at the landing and then another one
tucked underneath uh hidden way behind
the stairs um take some of this time
during the break to stop in at the
sponsor Expo there are more coffee and
snacks again our sponsors have some
pretty amazing products technology and
services to help you on your journey see
you back here at four o'clock thank you
very much
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
he
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Music]
[Music]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
w
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
[Music]
[Applause]
ladies and Gentlemen please join me in
welcomeing to the stage your MC for the
leadership track session day Peter
Humphrey all right folks thanks welcome
back again uh you only have to see me
one more time you've made it to the home
stretch uh it's getting real we are uh
almost there uh that said our last
Sprint of sessions is a little shorter
uh just to to take it easy on you um but
it's no less exciting we're going to
have people speaking about Ai and hiring
one of my favorite topics I'm really
excited for this talk uh from Heath he's
got some some yummy data that I think
you're going to like uh and then of
course uh we have other speakers
speaking about uh building AI platform
teams and org structure and then we're
going to finish off with retrieval
augmented generation and data pipelines
from a very special distinguished
speaker with that please join me put
your hands together and welcome Heath
black managing director at signal fire
thank you
[Applause]
[Music]
hi everyone I'm Heath black I'm the
managing director of product at signal
fire and let's actually take a quick
step back and ask uh why am I here uh so
before I got involved in Tech I actually
went and got a master's in Irish
literature of all things uh both of my
sons are named after Irish writers and
uh you're going to see how this actually
weaves into the presentation a little
bit later when you get a degree like
Irish literature you have to get
creative in how you actually use it in
2009 I got involved in uh some startups
and I helped ship the
first ever conversational chat bot at a
company called
chirpify uh I then went and worked at a
company called MZ where we were trying
to build a Reddit competitor but a lot
nicer and then I actually went and
joined Reddit where I worked on
experimental business lines and trust
and safety tools and I followed that
experience up by going to meta where I
shipped meta's first assistant it was
called M it lived Within Me messenger
and then I was the uh first product
manager on the AI assistant for their
rayb bang glasses but now I serve as the
managing director of product at signal
fire so what does signal fire do I like
to say the signal fire is actually the
first VC built like a tech company and
what I mean by that is the same way that
you all go out and interview customers
to figure out whether you're building
the right thing for them before you ship
your product signal fire interviewed 500
Founders to understand the things that
keep them up at night and make them bang
their head against the wall all day we
then built AI ml tools and our portfolio
success teams entirely around those
problems going to Market recruiting
building your leadership skills and the
ability to launch your product but today
we're actually here to talk about uh
some things that we've learned from our
proprietary AIML platform beacon beacon
tracks over 650 million employees 20 or
80 million companies and 200 million
open- Source projects and with all of
that information we build a variety of
proprietary ranking systems and Market
insights that we can then use to power
our firm so that we can move at startup
speed but then also to support the
companies that we invest
in today's Focus we're going to be using
some of the data from Beacon to figure
out how to filter the right
people to find them in the right
locations to nail the right timing and
then finally to close them with the
right narrative so let's first start
with
filters when I think about recruiting I
think about it in terms of like what
filters would I apply to the people that
are on my team and that I want on my
team Beacon gives people the tools to
apply these filters as they search but
the reality is if you don't know what
filters you need to apply you're not
going to find the right people so here's
some interesting trends that we've seen
that change how we
filter over the past you know decade or
so we've seen a Stark deeden isation in
ai ai startups are hiring more Engineers
without phds or prestigious School than
ever before in 2015
27% of engineer hires were from top
schools and 16% had
phds in 2023 those numbers were 15% and
7% this is about a 50% decline for both
of these numbers over that period of
time and you're probably saying that
doesn't sound right what about for
things like research scientists they've
got to have phds right and you know
you're not entirely wrong about 40% of
research scientists have advanced
degrees now this is not phds is simply
Advanced degrees and even still it makes
up less than half of the people that
serve in research scientist roles
today from my standpoint this isn't too
surprising because there's been a slight
shift in the market since 2015 in 2015
we were really focused on the the kind
of ml research side of things the
foundational side of things whereas
today a lot of the work is about
applying that to the real world usage of
that model it's ml Ops Pro like product
like software experience is
understanding how users interact with
the thing that you're
building and with the shift from
credentials we've also seen this really
interesting people Mobility over this
period of time if you look on the left
side here historically a lot of the AI
Talent was centered on these companies
the Googles the Ubers the the metas the
apples and over this period of time
they've shifted to nine companies that
we call the aiv league the companies on
the right here have seen a massive
concentration of talent over this period
of time it's really interesting because
we generated this last year shortly
after that inflection was acquired so
one of the key things that I want us to
take away from here is that the market
is constantly moving so we have to
constantly be assessing where that
market is Shifting and the interesting
thing as well is that all the companies
on the left side of the screen are now
fighting viciously to get people from
the right side of the screen rather than
the other way
around but one of the key things is not
just knowing where people are going it's
about where they're coming
from this graph shows you net employee
movement between different aiv League
companies so to speak as you can see at
the top open AI has a positive flow of
people from Deep Mind whereas coh here
actually has an a negative Trend knowing
where people come from and where they're
going is essential in ensuring that you
are filtering for the right people as
you look for people like you know
building out your
teams so the takeaway here is that work
experience has always been important but
it now far surpasses education in terms
of the main aspect that you should be
looking at here don't just rely on the
credentials that someone has you should
instead look at the body of work that
they've compiled for new workers you can
still look at their body of work what
are their open- Source contributions
what have they built outside of class
the reality as experience and what
you're building matters more than where
you get a degree secondly you should be
asking yourself do I need a PhD
researcher for the role that I'm hiring
or will a really awesome engineer with
experience
suffice and then the third is that you
should actually consider removing
academic requirements from your job
postings or maybe making them soft so to
speak because this will ensure that your
top of funnel is getting the people that
have the experience you need more so
than the
education now let's talk about the next
aspect which is location I'm sure many
of you have seen the debates on Twitter
that San Francisco is clearly dead and
so we wanted to know is
it interestingly the answer is no is not
dead San Francisco makes up about
29% of all startup engineers years now
this is slightly down from highs of 2013
when it was at 33% but it's up ticking
again since
2021 New York and Seattle have also been
pretty impressive as they've both
doubled the market share of Engineers
that they have over that period of time
if we were to zoom out and look at big
Tech 50% of big Tech Engineers still
reside in the San Francisco Bay area but
what about AI specific
specifically well San Francisco is still
leading the
pack 20 or 35% of all engineers in AI
reside within San Francisco Seattle
makes up about 22% New York makes up
about 10% so San Francisco makes up more
than both of those cities combined but
if you actually look at this slide and
compare it with the data that I showed
in the previous slide you'll see
that these these markets are all punched
in well above their weight in terms of
AI hiring and AI Talent where they had a
smaller number in the previous slide
they have a much larger number in this
one so the talent is concentrating in
these threey key markets today now this
isn't terribly surprising for me because
San Francisco makes up nearly 38% of all
early stage funding into AI startups and
the interesting thing about this is that
San Francisco only has 26% of all early
stage funding in the United States total
so not only is San Francisco punching
above its weight in terms of AI Talent
is punching above his weight in terms of
the funding that is going to AI
companies
today so the takeaway here is that
Twitter doesn't determine whether a
market is dead data does location still
matters even in a highly distributed
world that we live in today San
Francisco Seattle and New York York are
the premier locations for AI Talent
today and so your job is to watch the
location and funding markets to see
where talent and capital is Flowing as
another way to filter and find the right
people now let's talk about timing
finding the right person has as much to
do with time as it does
Talent at the airport on the way here it
was a mere matter of minutes that
separated me catching my flight or
sitting in the be sad like Charlie Brown
a fraction of a second is the difference
between a home run to right field or a
foul ball in the bleachers for me timing
means two different things first it
means finding people when they are most
likely to leave secondly it means
finding people that really are going to
have a pinion to join a company at your
stage and uplevel your team for where
you are today so let's talk about
timing we analyze some of those aiv
League companies to see their retention
rates if you're on this slide I'm sorry
if the if the information uh offends you
at all but anthropic is leading the pack
with about a 66% 4year retention rate
while perplexity hovers around 44% or so
43% this is just a small slice of the
world as is constantly changing but the
reality is understanding retention helps
you know when you're likely to be able
to get someone to answer that message
that you send to them it effectively
creates a poach ability score so to
speak of your ability to land that
person but in addition to retention we
actually study the behavior of different
Generations they act
differently in 2023 nearly
27% of all gen Z left their
job if you compare this with Gen X
that's actually more than two times as
much as
genx and if you were to look at the fact
that within four years after graduating
gen Z has about 2.2 jobs whereas Gen X
has
1.1 and so some of this has to do with
the fact that gen Z is actually getting
promoted at a slower rate some of it
might be causing those slower promotions
some of it has to do with the layoff
Market that took place over that period
of time but if you ask me a lot of it
has to come down to the pension for
people to take risk and bet on
themselves j z likes that
risk but it's not just retention and
it's not just the generation you were
born into you need to know when people
work at different companies being on the
New York Knicks in 197 three is very
different from being on the New York
Knicks in
2018 one of those teams held up a
championship trophy and the other team
had the worst record in their franchises
history so as signal fire we built this
cool tool we call it historical
composition and it actually shows all of
the startups that we invest in a
snapshot of the companies that they
admire at different points in time what
did their org structure look like at
that point in time who are the sales
leaders that took them from 1 million to
10 million who were the first three
Engineers on their team when they
shipped that key product that I'm trying
to beat now these things are going to
help you
identify the risk profile people have
are they going to join a company at your
stage it's going to Mo like help you
understand the motivations that they
have but it's also going to help you
understand whether they're a potential
tenx higher which you need to make in
order to take your company to the next
stage so the away here is that you have
to understand timing both from an
Outreach and an impact standpoint you
should know when your competitors or the
companies that you admire most are
likely to lose other people you should
track the people that work at those
companies profiles to see whether there
are changes made to it over that period
of time studying the patterns of
different Generations or segments of the
population will help you understand how
they change jobs and finally
you need to know when people join and
leave companies because that will help
you identify your 10xers and help you
identify people that are likely to join
a company at your stage now this is
where I finally get to use that
literature degree narrative one of my
favorite writers Kurt vonet has this
awesome visualization for the shape of
stories if you look at the bottom left
here on the xais you have beginning and
end and on the Y AIS you have ill
Fortune leading up to Good Fortune so
the bottom here is fron kafka's
metamorphosis Gregor Sam so wakes up a
bug and everything goes downhill from
there on the top right you have a man
walking down the street and he falls
into a pothole but then he works him uh
works his way out but my favorite
Cinderella she's on the bottom right
over here things start pretty crummy for
her sisters are evil she has to do a
bunch of work then uh some magic happens
and she gets invited to this ball meets
a beautiful man they fall in love and
then what happens the clock strikes 12
she falls off a cliff but then through a
series of fortunate events things lead
to Eternal bliss now I'm not telling you
that you need to preach the depths of
Despair that your company is in or has
gone through but you do need to
understand the triumphs that you've had
why you are where you are today where
your Arc is going the reason for that is
because historically pay and Equity were
the two components that we use for
narrative but we can't rely on those
solely
anymore why from November 2022 to
November 2024 we saw a
1.6% increase in the average tech salary
and a precipitous decline in the amount
of equity granted but I have some really
bad news for the folks in this room it's
even worse for ai ai Engineers are the
hot ticket for this year they command a
5% salary premium and 10 to 20% Equity
premium over other engineering roles so
what was already expensive is getting
even more expensive for us so if we rely
our entire narrative on that we're
relying on things that we might not be
able to afford so salary as the sole
selling point has got to
go Equity was that other thing that we
used to dangle to get people to buy into
what we were doing as a company but
we've seen a precipitous decline in the
amount of people that are exercising
investment
shares in Q2 of 20124 33% of people
exercised the shares that they had
vested this is down from
55% a couple years earlier a lot of this
is driven by concerns over uh valuations
that might be a little bit too high
concerns about the cost of liquid
Capital to exercise these shares and
concerns about the market shifting which
it does every 3 weeks in AI Equity can't
be the only other thing that we're
relying
on we have to get to a point where we're
not just focusing on money and Equity we
have to have things like a close-knit
environment with working with the
founders collaborative teams speed and
the lack of friction to actually get
stuff done a big mission the ability to
grow your mind and your career
opportunities markets that are
exploding and solving complex problems
you need to understand what all these
things are for your company in order to
not rely wholeheartedly on salary and
Equity as a narrative so to summarize in
a world where so many companies are
fishing in the same engineering Pond
recruiting data can give you an edge
these are just a few examples but in the
same way that you use data to build your
product both your models and the kind of
analysis of that product you should be
assessing data to build your team your
team is your most valuable product that
you have what we've seen is that de
credential isation is happening so you
need to filter accordingly location
still matters so watch where people move
data can help you identify the right
time to reach out to people and it can
help you identify the right time that
people have been at different
companies and all of this is going to
help you craft a better narrative so if
you can filter if you can time if you
can find the right location and you can
have a good narrative you're going to do
a much better job if you're on the other
side of the coin and you're actually
looking for work you should know where
the people that you admire go not just
the companies but the space you should
watch how long they stay there this will
help you know how treated whether they
think the space is going to be fruitful
and then finally you should know what
you want in that Arc of your career I'll
be out in the lobby a little bit later
thanks for your
[Applause]
[Music]
time our next presenter is engineering
manager for generative AI at LinkedIn
presenting insights from building their
gen AI platform please join me in
welcoming to the stage Xiao Fang
[Music]
Wang all right good afternoon everyone
uh it's my pleasure here to share our
journey on building out linkedin's ji
platform uh my name is Al ad manager of
J Foundation
uh let me try it one more time cool uh
in today's talk I'd like to First share
our journey on building out this
platform especially on why we're
building it how we build it and what
we're building it after that we will
talk about uh some thought process on
why this platform is critical for
today's agent
word um hopefully after that you agree
with me this is critical component in
your component uh uh in your company and
you also want to build this team I want
to share some tips on how to build such
a team how to hire for such a team
towards end we will share some K
takeways and Lessons
Learned before we dive into this
application uh platform journey I think
it's important to first talk about the
ji product experience because that's
essentially what our platform is
supporting for back in 2023 LinkedIn
launched the first formal gii feature
called collaborative articles this is a
kind straightforward uh GI feature if we
are thinking in this standard because
it's a very simple prompt in string out
type of
application uh we leverage chat GPT uh I
mean we leverage GPT 4 model uh to
create the long content uh articles on
the platform and they invite our members
to comment on it
at this stage our team helped to build
some key component behind the scene
including the gateway to centralize the
access to the model uh some Pyon
notebook for the prom engineering uh but
at this time we actually have two
different taex Stacks uh to serve the uh
experience in the online phase we use
Java and in the back end we use Python
uh we wouldn't call this as a platform
at this time
very soon we realize uh there are some
limitation for this simple approach
especially it lacks the capability to
inject our reach data into the product
experience then in the mid 2023 we
started to develop the second generation
of the ji product uh internally we
called co-pilot or coach here we're
showing one popular such experience on
LinkedIn right now uh basically it looks
at uh your profile and the job
description and then uh use uh some rack
process to give you personalized uh
recommendation on if you're good fit to
the
job at this time we started to build uh
some platform capability uh specifically
in the center of our platform we build
the uh python SDK on top of the popular
uh lunching framework to orchestrate
ourm calls and it also provides the key
value to integrate with our large scale
infrastructure uh in this SDK so our
developers can easily assemble a
application we started to unify the text
stack at this stage because we realize
it's really costly to transfer the
python prompt into the Java world not to
mention the arrow during this process we
started to invest on the prompt
management or prompt source of Truth
this is a sub module at this stage
uh to help developers to version their
prompt and to provide some structure
around their meta
prompt uh the most important piece I'd
like to call out here is conversational
memory uh this is uh infrastructure to
help to keep track of the llm
interactions and retrieval content and
then inject those content into the final
product it will help us to build this
kind of conversational uh bo
now uh zooming to this year actually in
the last year we launched our first ever
uh real multi-agent system called uh
LinkedIn hirer assistant uh this is uh
multi-agent systems to help our
recruiters to do their work uh
efficiently especially it automate
several Teeters task uh normally
recruiter need to do manually like um a
post the job uh um and evaluate hundreds
of candidates then reach out to
them our platform also start to involve
into the agent platform uh from the
framework side we extend the sport of
the Python SDK into a more large scaled
U distributed agent orchestration layer
it will handle the distributed agent
execution and also handle the more
complicated scenarios like retry logic
and traffic shift uh for folks who build
agent uh I think you probably know the
skills or apis are one key aspect of the
agent because we expect uh this agent to
perform some action one investment we
did at this uh time is around the skill
registry basically we have a set of
tools uh to help our developers to
publish their API into this centralized
skill r
this skill rry can handle the skill
Discovery problem skill invocation
problem so in your application it's
actually very easy to call the API to
perform some
task another key component uh we invest
at this stage is on the memory in
addition to the conversational memory we
extend uh it capability into the
experential memory essentially it's a
memory storage to uh extract and analyze
and infer the contextual Knowledge from
the interaction between the agent and
our user we also organize this memory
into different layers including the um
uh working memory long-term memory
Collective memories uh this can help our
agent to be aware of the surrounding
content uh lastly at this uh time we
also realize the operability is super
important because agent uh one key
aspect to Define agent is autonomous
right
because agent can decide what API they
can call what L LM they need to call so
it's actually very hard to predict Its
Behavior so we started to invest on the
operability uh particularly we build our
in-house Solution on top of the hotel to
keep track very low level granularity of
the uh telary data so we can use this
data to replay the uh agent call and we
also add uh actual layer of the
analytics on top of it so we can use
that to guide the future optimization of
our agent
systems let's put together all the
components we build for this platform uh
we can classify them into four layers
basically including the artion prom
engineering tools and skills in location
content and the memory uh
management uh of course that not
everything in the LinkedIn J ecosystem
uh in addition we have our sister teams
to build out the modeling layer like
fine tune the open source models
responsible AI layers to make sure the
agent is behave according to our policy
and standard and also the uh AI platform
or machine learning infrastructure team
to host those
models the key value propriation for
this uh ji platform is actually to uh be
the unified interface for this complex
ecosystem so our developers don't need
to necessarily understand all those
individual box when they build uh their
application instead they can leverage
our platform to quickly access to this
entire ecosystem uh for example uh in
our SDK the developer can just the
switch one parameter in the one of the
code to switch from the open a model to
our on model of course they still need
to do the prompt engineering but that
reduce a lot of the complexity on the
infrastructure integration
phase uh last but the most importantly
is because of this is a centralized the
platform uh it provide a place to
enforce the best practice and governance
so we can make sure our developers are
building the applications efficiently
but also
responsibly as as you can see from our
journey we actually start to build this
uh platform piece by piece and then this
platform start to emerge if we take one
step back and think uh do we really need
this platform at this time especially
there are lots of uh uh the vendor
product on this space shall we buy it
build it and why do we need to buy it or
build it uh here are some
thoughts um the short answer is yes the
reason behind it is uh we feel like ji
is totally different new AI systems
compared to the traditional AI systems
so in the traditional AI systems there's
a clear cut off between the uh AI model
optimization phase and the model serving
phase so AI engineers and product
Engineers can operate in two different
tax stack uh they usually don't uh need
to uh work on the same code base but in
the ji systems what we're seeing is
this line between the optimization phase
and the serving phase disappear
basically everyone is a engineer who can
optimize the overall system performance
this actually create the new challenge
of the tooling and the best practice in
the
company essentially we think these ji
systems or agent systems is a compound
AI system here we borrow the definition
from Berkeley AI research lab uh
compound AI system can be defined as a
system which tackles AI tasks using
multiple interacting components
including multiple costs to model
retrievers or external tools as you can
see this is actually skill across AI
engineer and product engineer and I
believe this uh J app platform is trying
to bridge this
Gap to summarize uh we believe this
platform is critical for your success m
because it can Bridge the skill gaps
between those two group of
Engineers okay let's say if you want to
build this uh platform in your company
and how to hire it is a frequent
question uh I heard um I basically look
into uh my great engineering team and uh
extract all the key qualifiers from
those top engineers and uh I put all the
qualifications here uh the ideal
candidate in this team is a strong
software engineer uh who can build
infrastructure integration they have a
good developer uh PM skills to design
the interface uh ideally they have the
AI and the data science background to
understand the latest techniques they
are the people who can learn from the
latest techniques but at the same time
they are
Hands-On unfortunately it's really hard
to guide those candidates if you get
them uh it's probably worth more than
unicor realistically we are making
multiple tradeoff in the hiring uh here
are some principle uh we follow and it's
actually working pretty well on to share
here in terms of the core skills uh we
usually prioritize the stronger software
engineer skills over the AI expertise
this might be controversial but uh uh uh
we can discuss if you're
interested second is instead of hariring
for experience or degrees we hire for
the potential because this field is
involving so fast most of the experience
might be
outdated in case you won't be able to
find a single engineer with all the
qualifications we're showing here uh the
way we are solving this problem is to
hire a diversify the team so so for
example uh in our team we have some full
stack soft Engineers we have data
scientist we have ai engineers and data
Engineers we also have a fresh grads uh
from the top research University and
also uh some people from the stub
background and then we put them together
uh into the project what we've seen is
based on those collaboration those
strong Engineers start to pick up new
new skills in the project and very soon
they started to grow into these ideal
candidates uh lastly is uh want to
emphasize is uh the critical thinking uh
one constant topic uh in our team
meeting is uh no matter what we're
building right now it will be outdated
within a year or even less than six
month so we consistently evaluate the
latest open source package talking with
vendors and deprecate our solution more
proactively
cool let's talk about say some uh K
takeways uh especially on the text stack
choice if possible we strongly recommend
python we started with Java and python
uh there are some back and forth of the
debate internally but finally we pick
Python and I think that's a right choice
especially most research and open source
uh are in this space based on our
experience it's also scalable
in terms of the uh key components you
want to build in this platform the first
one is a prompt source of Truth prompt
in some way is like a traditional model
parameters you want to have a really
robust systems to Version Control your
prompt this is really really critical
for the operational stability you don't
want accidentally added your prompt in
production and uh see some really side
effect second key component is on the
memory I think in today's uh meeting uh
I mean today's talk someone already
talked about it memory is a really key
component to inject your Rich data into
the agent experience lastly in the agent
era uh one key new component we are
building is on the uplifting our apis
into skills which can be called from the
agent easily so you can uh build some
surrounding tooling and infrastructure
to support this
need all right let's talk about how to
uh scale this solution and got gued
adopted uh from our experience instead
of trying to build this full-fledged the
platform at the beginning try to solve
immediate need for example we started
with a simple python library to support
orchestration then we start to grow into
all the components we're seeing here
second is
uh focus on the infrastructure and the
scalable solution and Linkedin we
actually have a pretty good success
story by leveraging our uh messaging
infrastructure uh to be as a memory
layer uh it's both cost efficient and
scalable last day is uh focus on the
developer experience by the end of the
day this platform is trying to help
developer to be as productive as
possible their adoption is a key for the
success if you can design this platform
please focus on uh how to align your
technology with their existing uh
workflow so it will ease adoption and uh
be more
successful uh we actually have lots of
lowle details on the technical side uh
if you are interested please check out
our engineering blog post on LinkedIn by
Cake s and
myself uh with that uh thank you for
your attention and uh if if you are
having more questions happy to answer
that after the talk thank
[Applause]
[Music]
you ladies and Gentlemen please welcome
to the stage the CEO and co-founder of
contextual AI da Kila
[Music]
[Applause]
hi folks uh thanks for being here uh I'm
the last talk for today my name is Da
Kila I'm the CEO at contextual Ai and
I'm here to talk to you about rag in
production rag agents specifically um
and I'll I'll share some of the lessons
that I've learned so my background is in
AI research uh but after that I became
the CEO of a AI company focused on
Enterprise uh so I thought I would share
some of my learnings with you uh in the
hope that that's
useful so if you uh look at Enterprise
AI uh if you work in this space you'll
probably notice that there's a huge
opportunity uh ahead of us right
everybody wants to grab that opportunity
there's there's these huge numbers
flying around $4.4 trillion dollar is is
the estimated added value to the global
economy according to McKenzie so we have
this giant opportunity but at the same
time if you actually look at what's
happening in Enterprises you see a lot
of frustration it's probably even true
for some of the people in the audience
right here
if you're a VP of AI then you're
probably under some pressure right now
it's like where's the ROI we're
investing all this money in AI but where
is it actually leading us to are we
getting something out of this so uh
Forbes has this interesting study where
they showed that only one in four
businesses actually get value from AI so
why is that happening right it feels a
bit like a
paradox uh so to to explain it we can
look uh at a paradox that might be
familiar to you it's it's something
called morx Paradox it's from Robotics
and in robotics they were very surprised
when they found out that it's actually
much easier to beat humans at chess than
to have a robot that can vacuum clean
your house or have a self-driving
car um so the the the Paradox here is
really that things that seem hard are
actually much easier for computers than
you would expect and things that seem
easy actually turn out to be much harder
right so there's something very similar
happening right now in in Enterprise AI
specifically and this is around context
so on the one hand we have these amazing
language models right you you've that's
why we're all here basically because we
see this revolution happening right in
front of our eyes so language models can
generate code much better than most
humans they can solve mathematical
problems much better than than most of
us here can do uh and we're pretty smart
um so it's really amazing what they can
do but one of the things that they
really still struggle with and that's
one of the things that as humans we are
very good at sort of without effort is
putting things in the right
context right so as humans we build on
our expertise we build on our intuition
that we've developed over the years
especially if we're a specialist this is
something that is very easy for us to do
is to put something in the right context
and uh and in the right situation so
that you can make sense of the
information or the problem that you're
solving so I would argue that this is
really the key observation this this
context Paradox um for unlocking Roi
with AI and the reason for that is that
where we are right now here is is in the
bottom left right so we're we're mostly
focused on convenience we have general
purpose assistance they're very useful
mostly if you're lazy they help you sort
of solve your problems faster but where
you really want to get to is
differentiated value if you're an
Enterprise it's nice that you can make
things more convenient venient you
probably can make people more efficient
and more productive that's great but
where you want to get to is this
business transformation ideal right
that's what all the CEOs are probably
telling you as a VP of AI like I want to
change my entire business how am I going
to do that so getting to that
differentiated value that's where you
want to get to but the problem is that
the higher you go on that axis the
further you go on the context AIS so
that the better you need to be at
handling the context uh uh that exists
Within your
Enterprise um so what should we do about
that um so that observation is really
why I started the company that I'm
currently the the CEO of contextual AI
um and we started this two years ago to
try to help bridge this Gap um and we've
learned some lessons along the way that
I thought I would share with you uh in
the hope that they're also useful for
you so the first observation is really
that language models are awesome but
often there are only 20 % of a much
bigger system um so if you have an
Enterprise AI deployment usually that
means it's a rag system uh so I I think
everybody here probably has heard of rag
uh rag is something that I uh originally
pioneered with my team at Facebook a
research when I was there uh so rag is
is really kind of the standard way that
you get gen to work on your data so what
happens very often these days is a new
language model comes out everybody goes
whoa new language model it's great
everybody starts to think just about the
language model but very few people
actually think about the system around
the language model and that system needs
to solve the problem right so you can
have a relatively mediocre language
model but an amazing rag pipeline around
it and that's going to be much better
than an amazing language model with a
terrible rag pipeline around it so the
basic observation here or the lesson is
that you should be thinking about
systems not about models the model is
only a small part of the system and the
system is the thing that solves the
problem the next observation is that if
you're in an Enterprise expertise is
really your fuel right so uh one of the
the things that you want to be able to
do as an Enterprise is unlock all of
that expertise so you have all of this
institutional knowledge in your company
how do you get it out um so one way to
try to do that is is using these
generalist kind of general purpose
assistance but it's very hard to get
them to to uh to match the expertise of
people in your company so ideally what
you want to do is to specialize so that
you can capture that expertise much
better so uh at my company we call this
specialization over AGI AGI is great
there are lots of use cases for it if
you really want to solve a very
difficult problem that is very domain
specific where you understand the use
case you want to specialize for it and
you'll get much further so that's I
guess uh pretty counterintuitive if you
look at the sort of broader uh interests
right most people are much more excited
about AGI but solving real problems is
much easier with
specialization the next lesson is uh at
an Enterprise scale is your remot so if
you think about what a company really is
is a company maybe its people probably a
little bit right but over time what the
company really is or what makes a
company a company is its data because
even people are transient right so the
data that a company owns that is the
company in the long term so now as an
Enterprise you need to think how you can
unlock all of that potential right and
so uh one of the the big issues that we
see a lot is that enterprises think that
uh you need to scrub the data and clean
it and invest a lot of time in in uh
making your data accessible with AI but
what you really want to do is make sure
that AI can work on your noisy data at
scale and doing that is incredibly
difficult but if you succeed in doing
that that's how you get to
differentiated value right that's how
you get that mode because the data makes
makes your company your company and so
data is really your
remot uh one observation uh and this is
really a hard truth that that we've
learned and I I think that many of you
might have learned already or that
you're about to find out uh if you're
earlier in in your journey is that
Pilots are very easy uh building a demo
not very difficult these days right if
you want to build a rag system you take
one of the Frameworks you put in some
documents you have a working solution
it's great you give it to your 10 users
they all tell you it's fantastic and and
then you show it to the CEO and he
saysay we're going to fire half the
customer support team and we're going to
replace them with AI and we're going to
do that in three months and now you're
on the hook for productionizing
something that is actually much much
harder right so getting this to work at
tens of thousands or hundreds of
thousands or millions of documents you
can't do that with any existing tools uh
that are out there on the open source
market it's very very difficult to do
that making this skill to thousands of
users is very hard um making it work for
lots of different different use cases if
you're an Enterprise maybe you have
20,000 different use cases that you want
to cover so how do you scale if that's
the problem that you're solving and then
there's of course Enterprise
requirements around security and and
compliance so bridging that Gap is much
harder than you think and and the the
right way to deal with that is to really
focus on production from day one so
don't design for the pilot design for
production uh and that can save you a
lot of time and that brings me to the
next observation is that speed is really
much more important than infection what
we see um in terms of production
rollouts of uh rag agents it's all about
speed um and and what that means is uh
you need to give it to your users
relatively early real users not not sort
of uh testers who are are kind of
friendly you want to give it to real
users to get their feedback you want to
do that early it doesn't have to be
perfect it just needs to be barely
functional and if you do that then you
can heal climb to actually get to this
level where it's good enough if you
don't do that and you wait too long and
then you try to design something that is
perfect it's going to be very hard to to
bridge that Gap from Pilot to production
so iteration is really the key to a lot
of uh successful uh production AI
deployments in in
Enterprises next observation is is
related to this too which is that uh if
you want your engineers to be fast and
if you want to follow that that speed
Maxim I just talked about then you don't
want them work working on boring stuff
uh sounds kind of obvious but it turns
out that Engineers are working on a lot
of very boring stuff um and so one of
one of the the things that they have to
worry about for example is what is the
optimal chunking strategy for my rag
system and it's different for every use
case and is different for every
framework and then they have to think
about what the right prompt is or really
basic things that ideally they don't
have to think about too much because you
really want your engineers to think
about how are am I going to deliver
business value right how how do I make
sure I have this differentiated value
and that I'm actually better than my
competitors um so make sure that your
engineers spend time on the things that
matter and not on the chunking strategy
or or things that that can be abstracted
away uh very well these days by by state
of theork platforms for for rag
agents next one is is about making AI
easy to consume so what what I mean by
that is we actually see uh this happen
quite often where companies have gen AI
running in production and then the next
question I often ask them is okay how
many people are actually using it and
and surprisingly often the answer is
zero almost nobody's actually using it
they did all this work but they had to
make sure it came through uh sort of
model risk and and teams like that so it
was really like kneecapped almost and
now it's barely useful uh so that's one
scenario or or very often people just
don't actually know how to use the
technology so it it really a journey
that you are on and the easier you can
make your solutions to consume the
better it is and what that what that
means for most Enterprises is not just
thinking about your Enterprise data and
how you make AI work on it but also how
you integrate it into their workflows so
the closer you can integrate it into a
workflow that already exists in your
Enterprise the more successful you're
are going to be with real production
usage uh next one is is related uh to to
previous one as well uh where it's
really about getting usage it's about
sort of being sticky and and so this
sounds maybe a little obvious but the
the quicker you can wow users or get
this sort of spark where they they
suddenly get it like this for for me as
a CEO of a AI company that's really the
special moment when people suddenly go
like wow I didn't know that it could do
this um so you can try to design your
experiences for onboarding users around
this observation too right so so where
where they get to the wow as quickly as
possible so for us we had this really
nice example with someone at Qualcomm so
we're we're running in production
globally with Qualcomm with thousands of
customer engineers and one of them
became so happy when they found this
document it was seven years old it was
hidden away somewhere they didn't know
it existed they had all these questions
and they just never knew what the
answers were and suddenly because they
asked our system they got these answers
and like their their world was never the
same again after that um so these are
the small the small winds sort of uh
that that really matter for for uh
evangelizing uh production in
AI uh so that brings me to the the
penultimate learning which is that it's
not even really about accuracy anymore
so accuracy is almost table Stakes right
uh so I I think as AI practitioners we
probably know that getting 100% accuracy
is very hard if not impossible getting
95% accuracy maybe you can get there or
90% but what Enterprise are are thinking
about much more these days is what about
the missing 5% or what about the missing
10% how do I deal with the things that
might go wrong right um so there's a
minimum requirement for accuracy but
beyond that is really about inaccuracy
and the way to deal with that is through
observability so you want to be very
careful with how you evaluate these
systems you want to be very careful with
making sure you have proper audit Trails
especially if you work in a regulated
industry this is incredibly important
right making sure that you have an audit
dril that says this is why I generated
this answer it's because I found it here
in this document basic things like that
so attribution essentially in a rag
system actually becomes very very
important for dealing with the
inaccuracies and similarly what you can
do is you can check the claims that your
system generates so do a lot of
post-processing to ensure that you have
proper attributions uh that that you can
really back up uh as
evidence struggling with the clicker a
bit so uh fi final one uh that I want to
end on and this sounds maybe a little
bit cheesy but it it really is is true
is be ambitious we we actually see a lot
of projects fail not because people are
aiming too high but because people are
aiming too low where where folks are
going like I have gen running in
production and then what does it do it
answers basic questions about who your
401k provider is or how many days of
vacation I get like that's not really
where the ROI of AI is right so you want
to aim for really ambitious things where
if you solve them you actually have have
Roi and you don't just have a gimmick
that people don't really uh use anyway
so try to be be ambitious because we
really live in special times we have the
the astronaut here on the slide um so so
I I think it was a pretty special time
to be alive during the the moon landing
and when all of that happened right
we're in a in a similar moment right now
where AI is is really going to change
everything is going to change our entire
Society in the next couple of years um
and so you have an opportunity being in
the role that you're in to to Really uh
uh affect that change in society
yourself uh so so be ambitious when you
do that uh and don't aim for the the lwh
hanging easy fruit uh aim for for the
sky so uh that's really uh what what my
lessons were for you here uh this
context Paradox is not going away um but
by understanding these lessons that I I
shared with you hope you can turn some
of these challenges uh that we see
everywhere in Enterprise AI into
opportunities for yourself um so so it's
really build better systems think about
systems not models focus on your
expertise and specialize for it don't
settle for General Solutions specialize
for for the expertise that you have in
your company and be ambitious and then
you'll be very successful thank you
[Applause]
[Music]
ladies and Gentlemen please welcome back
to the stage your MC for the leadership
track session day Peter
[Music]
Humphrey we did it thank you everyone
and thank you dawe for your insights
into retrieval augmented generation I
mean it's such an important part of this
ongoing Quest that we're all on for AI
accuracy and relevance um I think in
particular I don't know if everyone saw
the Google announcements in the last
couple days about like one billion input
tokens um so I a really really great
session I was I was uh pretty riveted on
that one um it's been a pretty Whirlwind
afternoon folks uh we did topics like AI
hiring I thought uh Heath session was
great so much good data there um team
building case studies at LinkedIn and of
course just now retrieval augmented
generation with uh if you want to meet
up with the speakers again same thing as
before um there are those three Q&A
lounges one's here on the theater level
two downstairs one at the bottom of the
stairs the other one tucked under um so
that's a good place to go to chat with
birds of a feather people that want to
talk if you want to talk about uh topics
from this afternoon um and uh uh that'll
be kind of going on for the first the
speakers will be there for like the
first 30 minutes uh of the reception and
then you know uh enjoy some some social
time some drinks and have some fun uh
quick reminder uh regarding brunch
tomorrow um for those of you that don't
have a bundle pass uh so for security
reasons we are not reprinting badges
sorry uh so if you have a bundle pass
and you are joining us tomorrow um
please remember put your badge in your
bag don't toss it keep it with you
tonight hanging on your front door knob
you know whatever it is that's going to
help you remember it as you walk out the
hotel room uh tomorrow um so that's it
you did it thank you so much for
sticking with us today that is a wrap
it's onto the Afterparty at the Expo
like I said we'll have some drinks and
of course products technology and
services from our sponsors so please
stop by and chat with them have a
wonderful evening thank you for being
here and we'll see you tomorrow for the
engineering track
[Applause]
[Music]
hold
[Music]
[Music]