3 ingredients for building reliable enterprise agents - Harrison Chase, LangChain/LangGraph

Channel: aiDotEngineer
Published at: 2025-07-23
YouTube video id: kTnfJszFxCg
Source: https://www.youtube.com/watch?v=kTnfJszFxCg
[Music]
I want to talk today a little bit about
trying to build reliable agents in the
enterprise. This is something we work
with a bunch of people for both people
building as developers inside of an
enterprise looking to build agents for
for their company but also people who
are looking to build solutions and and
and bring them and sell them into
enterprises. Um and so I wanted to talk
a little bit about some of what we see
kind of being the the success tips and
tricks for making this happen. So the
the vision of the future that that I and
other people I think have a have a
similar view of for agents is that
there'll be a lot of them. They'll be
running around the enterprise doing
different things. They'll they'll be you
know an agent for every different task.
We'll be coordinating with them. We'll
be kind of like a manager, a supervisor
and and so and so how do we get to that
vision and what uh what what parts of
this will kind of like arrive before
before the others? Um and and and so I
was thinking about this question, what
makes some agents kind of like succeed
in the enterprise and and some fail. And
I was chatting uh with with my friend
Assaf, he's the head of AI at Monday. He
also wrote GPT researcher. It's a great
open source package. Uh I was chatting
with him a few weeks ago. Um and uh a
lot of the ideas here are borrowed from
that conversation. He'll probably write
a blog post about this uh with a
slightly different framing which I would
encourage everyone to to check out. So,
I just want to give him a massive shout
out and if you have the opportunity to
to chat with him, you should definitely
take that opportunity. Um, thinking
about it from like first principles like
what makes agents successful in the
enterprise. Uh, it will make it
successful, it will make it more likely
to be adopted, the greater the value of
the agent if it's right. These these
probably aren't going to sound kind of
like earthshattering, but hopefully
we'll get to some interesting points. If
the more value it provides when it's
right, the more likely it will be to be
adopted.
the more likely it is to have success,
the more likely it will be to be
adopted. And then the cost if it's
wrong. If it if there's big costs when
it's wrong, then it will be less likely
to be adopted. So I think these are
three kind of like ingredients which are
pretty simple and pretty basic, but I
think provide an interesting kind of
like first principles approach for how
to think about building agents and what
types of agents kind of like find
success. And and you know, I say in the
enterprise here, but I also think this
applies just generally within within uh
kind of like society. Um if if we want
to try to put this into a fun little
equation, you know, we can multiply the
the probability that something succeeds
times the value that you get when it
succeeds and then and then do the
opposite for the cost when it's wrong.
And of course, like this needs to be
greater than than the cost of running
the agent for you to want to put it into
production. And so, uh yeah, fun fun
little kind of like stats math formula.
So, how can we build agents that score
higher on this? Because this is this
hasn't been anything kind of like
earthshattering so far. Hopefully, we'll
get to some fun insights when we talk
about how to make that make that
equation kind of like go up. So, how can
we increase the the the value of or of
of things um when they go right and and
what types of agents have higher value?
So, so part of this is uh choosing kind
of like problems where there is there is
really high kind of like value. So a lot
of the agents that have been successful
so far, Harvey in the legal space is one
of them. Um in the finance space we see
stuff around research and summarization.
These are high value work tasks. People
pay a lot of money for for lawyers uh
and and and for and for research and
investment research. And so these are
examples of what I would say kind of
like high value tasks are.
there's other ways to kind of like
improve the value of what you're working
on besides just switching kind of like
the vertical completely and I think
we're starting to see some of this
especially more recently. So if we think
about rag or if we think about kind of
like existing question or old older
school question answering solutions they
would often respond kind of like quickly
ideally within 5 seconds and give you a
quick answer and we're starting to see a
trend towards things like deep research
which go and run for an extended period
of time. We're seeing the same with
code. We start with cursor. It has kind
of like inline autocomplete. Maybe some
chat question answering there. In the
past like three weeks, there's been what
seven different examples of these
ambient agents that run in the
background for like hours at time. And I
think this speaks to ways that people
are trying to get their agents to
provide more value. They're getting them
to do more work. Um pretty pretty basic,
but I do think that like as we have as
we think about this future of agents
working and what that means, that
doesn't mean a co-pilot. That means
something working more autonomously in
the background doing more amounts of
work. So besides kind of like focusing
on areas or verticals that provide
value, I think you can also absolutely
reshift the the UI UX the interaction
pattern of what you're building to be
kind of like more long-term and do more
kind of like substantial patterns of
work.
Let's talk about now the probability of
success. How do we make this go up? So
the the the there's a few there's two
different aspects I want to talk about
here. One I think is about the
reliability of agents. If you've built
uh agents before, it's easy to get
something that works in a prototype. It
runs once great. You can make a video,
put it on Twitter, but it's hard to make
it work reliably, put it in in in
production. And I think the the core
thing that we've seen and and and by the
way, for some parts of of of so for some
types of agents, that's totally fine.
Um, you can have agents that that run
for a while and and kind of like you
don't know what they do and and that's
totally fine. Especially in the
enterprise, we see often times that
people want more predictability, more
control over what steps actually happen
inside the agents. Maybe they always
want to do step A after step B. And so
if you prompt an agent to do that,
great. It might do that like 90% of the
time. You don't know what the LLM will
do. If you put that in a deterministic
kind of like workflow or code, then it
will always do that. And so, especially
in the enterprise, we see that there are
workflow like things where you need more
controllability, more predictability
than you get by just prompting. And so,
what we've seen is more the solution for
this is basically make more and more of
your agent deterministic.
There's this concept of kind of like
workflows versus agents. Anthropic wrote
a great blog post on this that I'd
encourage you to check out. Um I I would
argue that instead of workflows versus
agents, it's it's oftentimes workflows
and agents. Uh we see that parts of an
agentic system are sometimes looping
calling a tool and sometimes they're
just doing A after B after C. An example
of this is when you think about
multi-agent architectures. If you think
about an architecture that has agent A
and then after agent A finishes, you
always call agent B. Is is that a
workflow? Is that an agent? It's it's
this middle ground. And so as we think
about building tools for this this this
future, uh, one of the one of the things
that we've released is Langraph.
Langraph is an agent framework. It's
very different from other agent
frameworks where it really leans in to
this spectrum of workflows and agents
and allows you to be wherever wherever
is best for your application on that on
that kind of like curve. And where where
on that curve is best totally depends on
kind of like the application that you're
building.
There there there's another thing um
that that is different from just
building and changing the agent. And I
think there's uh there's oftentimes
really high error bars that people have
when they think about how likely an
agent is to work. I think this
technology is new um when when uh trying
to get something built or approved or
put into production inside an
enterprise. think there's a lot of um
uncertainty and and and and fear around
this um and and I think that relates to
this fundamental kind of like
uncertainty around how this agent is
kind of like performing and so besides
just like making it better a really
important thing that we see to do inside
the enterprise whether you're you're
you're bringing a third party agent and
selling it as a service or whether
you're building inside uh uh the
enterprise yourself is to work to kind
of reduce the the the way that people
see the error bars of how this agent
performs. Um, so what I mean by that
specifically is that this is where
observability and eval plays a slightly
different role than than we would maybe
think or we would maybe intend. So we we
have a we have an observability and eval
solution called Lang Smith. We built it
for developers so that they could see
what's going on inside their agent. It's
also proved really really valuable for
communicating to ex external
shareholders what's going on inside the
agent and and how the agent performs and
where it messes up and where it doesn't
mess up and and basically communicate
these kind of like patterns. Um and so
and so again like the observability part
you can just see every step that's
happening inside the agent. This reduces
the kind of like uncertainty that people
have around what the agent and what it's
actually doing. They can see that it's
making three five LM calls. It's not
just one. they're actually being really
thoughtful about the steps that are
happening and then you can benchmark it
against different things. Um, and so
there there's a a great story of a user
of ours who used Lang Smith initially to
build the agent, but then but then
brought it and showed it to the review
panel as they were trying to get their
agent approved to go into production.
and they they ended the meeting under
time, which almost never happens uh if
if you've been to these review panels.
And and and they showed them basically
everything everything inside Linkmith
and it helped reduce kind of like the
the perception or the risk that people
had um of of of these agents.
And then uh the the the last thing I
want about talk about is is kind of like
the cost of something if it's wrong. Um
there's similar to kind of like the
probability of things being right. This
this plays an outsized kind of like role
in especially in larger enterprises
among review boards and and and
managers. People's like perception of
these agents. People hear stories of
agents going wild and causing kind of
like brain uh brand damage or or you
know giving away things for free.
there's this I think there's an outsized
perception of of of kind of like what
could be what could happen if things go
bad and and so I think there's a few
kind of like UI UX tricks that people
are doing and that successful agents
have to kind of like just make this a a
non-issue. So so one is just make it
easy to reverse the changes that the
agent makes. So if you think about code
and this is a screenshot of of of replet
agent it's it's a it's a it's a diff
that it generates a PR code's really
easy to revert you you go back to the
previous commit and so I think I think
that's part of the reason why we see
code being one of the first uh kind of
like real places that you can apply
agents besides the fact that the models
are trained on it is also that when you
when you use these agents you create all
these commits and and and and well it
depends how you do it replet does it in
a very clever way where every time they
change a file they save it as a new
commit. So you can always go back, you
can always revert kind of like what the
agent does. Um and then and then the
second part is is having a human in the
loop. So rather than uh you know merging
uh code changes into into main directly,
open up PR that's putting the human in
the loop. And so then the the the effect
of the agent is it's not kind of like
making changes. There's the human who's
kind of like approving what the agent
does. And this seems uh uh uh maybe a
little subtle, but I think it completely
changes the cost calculations in
people's minds about what the costs of
the agent doing something bad is because
now it's reversible and you have a human
who's going to prevent it from even
going in in the first place if it's bad.
Um and and so human in the loop is one
of the big things that we see uh people
selling into enterprises and building
inside enterprises really leaning into.
So to make this a little bit more
concrete, what are some examples of
this? I think deep research is a pretty
good example of this. If we think about
this, there is a period of time up front
when you're messaging with deep research
that you go back and forth. It asks you
follow-up questions and you kind of like
calibrate on what you want to research.
That puts kind of like the human in the
loop. It also it makes sure that it gets
a better result. So, it increases kind
of like the value that you're going to
get from the report because it's more
aligned with what you actually want. And
then dur deep research it doesn't you
know take this and publish it as a blog
out in the internet or doesn't take it
and email it to your clients. It
produces just you know a report that you
can read and decide what to do with. So
it's not actually doing anything. It's
up to you to take that and and and do
things. I think similarly when you think
about uh code it's it's it's another
great example of uh so claude code also
has uh this ability where it asks
questions. It clarifies things. um this
is to kind of like uh both keep the
human in the loop but also make sure
that it yields better results and then
again with code maybe you're not making
a commit every time you change things
but it's it's on a separate branch you
open a PR you're not pushing kind of
like directly to master and so I think
these are these are examples of things
in the general industry that kind of
like follow some of of of these patterns
so okay so we've figured out a few
levers that we can kind of like pull to
to kind of like try to make our agents
more interesting to to be deployed in in
the enterprise, what next? What next is
kind of like how how do we scale that?
So if if this kind of like has has kind
of like positive value, then what we
really want to do is just multiply this
a bunch and scale it up a bunch. And I
think this speaks to the the concept of
kind of like ambient agents. Um which is
uh when we think about agents working uh
you know this futuristic view, agents
working in an enterprise doing things in
the background, they're doing things in
the background. They're not being kicked
off by humans kind of like still in the
loop. They're being triggered by by by
by different events. Um, and I think the
reason that this is so powerful is that
it scales up this positive expected
value thing even more um than than we
can. Like I can only really have one
maybe I can have two chat boxes open at
the same time, but now there can be
there can be, you know, hundreds of
these running in the background. And so
when we think about the difference
between chat agents, which I would argue
we've we've mostly seen, and ambient
agents, one one big difference is
ambient agents are triggered by events
that lets us scale ourselves. Instead of
a onetoone, it's now a one to many
conversation that we can be happening.
Um and and so the concurrencies of these
agents that can be running goes from one
to to unlimited.
Um the latency requirements also change.
So when chat you have this kind of like
UX expectation that it responds really
really quickly and that's not the case
with with ambient agents because they're
triggered without you even knowing. So
like how how do you how do you know how
do you even care how long it's running?
And so you can what does this let you
do? Why does this matter? This lets you
do more complex operations. So you can
do more things. So you can start to
build build up kind of like a bigger
body of work. You can go from kind of
like changing one line of code to
changing a whole file or making a new
kind of like repo or or any of that. And
so instead of this agent just responding
directly or calling like a single tool
call, which usually happens in these
chat applications because of the latency
requirements, it can now do these more
complex things. And so the value can
start kind of like increasing in terms
of what you're doing. And then the the
the other thing that I want to emphasize
is that there there's still kind of like
a UX for interacting with these agents.
Um so ambient does not mean fully
autonomous. And this is really really
important because autonomous when people
hear autonomous they think the the cost
of this thing doing something bad is is
is is really high because I'm not going
to be able to to oversee it. I don't
know what's going on. How do I it could
go out there and like run wild. And so
ambient does not mean fully autonomous.
And so there are a lot of different kind
of like human in the loop interaction
patterns that you can bring into these
kind of like background these ambient
agents. Um there can be like an approve
reject pattern where for certain tools
you want to explicitly say yes it's okay
to call this tool. You might want to
edit the tool that it's calling. So if
it messes up a tool call you can
actually just correct it in the UI. You
might want to give it the ability to
kind of like ask questions so that you
can answer them. You can provide more
info if it gets stuck kind of like
halfway through. And then time travel is
something that we call uh human on the
loop as well. So this is after the
agents run. if it messed up on step like
10 out of 100, you can reverse back to
step 10 and say, "Hey, no, like resume
from here, but do this other thing like
slightly differently." And so, so human
loop, we think is is super super
important. The the other thing that I
want to call out just briefly is I think
there's this there's this uh
intermediary state where where we're
starting to be right now. I like I
wouldn't call deep research or uh or
claude code or any of these coding
agents ambient agents because they're
still triggered by a human and but I
think these are good examples of kind of
like sync to async agents. Um and so so
factory uh is a coding agent. They they
use a term kind of like async coding
agents and I I really like that. Um but
but I think this this kind of like sync
to async agents is a natural progression
if you think about it like right now to
start or you know a year ago everything
was a sync agent. We were chatting with
it. is very much in the moment. The
future is probably these autonomous
agents working in the background and
still pinging us when they need help.
But there's this intermediate state
where where the human kicks it off, uses
that kind of like human in the loop at
the start to calibrate on what you want
it to do. And so I I think that that
table I showed of like chat and and
ambient is actually probably missing a
column in the middle that's like these
sync to async agents. Um, anyways, an
example of some of the UX's that we
think can be interesting for these
ambient agents are are basically what we
call agent inbox, which is where you
surface all the actions that the agent
wants to take that need your approval
and then you can go in and approve,
reject, leave feedback, things like
that. Just kind of tie this together and
make uh really concrete what I mean by
ambient agents. Uh, email I think is a
really natural place for ambient agents.
These these agents can listen to
incoming emails. Those are events. They
can run on however many emails come in.
So that's, you know, in theory
unlimited, but you still probably want
in agent or you still probably want the
human, the user to approve any emails
that go out or any calendar events that
get sent. Um, depending on your level of
comfort. And so, uh, this is a concrete
thing. Um, uh, I actually built one, uh,
that I have myself. Uh, we've used it to
kind of like test out a lot of these
things. Um, if people want to try it
out, there is a QR code that you can
scan and get the GitHub repo. It's all
open source. Um, and I think this is uh
it's not the only example of ambient
agents. Um, but it's one that that I've
built myself and so we we talk a lot
about internally. Um, that's all I have.
Uh, I'm not sure if there's time for
questions or not. One or two questions
if if people have them. Um, yeah.
So my question is uh although
everybody's talking about agents uh but
only codegenerating agents are the one
who are getting funding is it because uh
you can measure what you have done and
you can reverse what you have done but
for all other agents you can do lot of
stuff but you cannot measure what you
have done you cannot reverse what you
have done. Yeah, I I I I think those are
Yeah, I think there's a variety of
reasons. I think those two measure and
and well, okay, so the measure thing I
think probably more so like you can a
lot of the large model labs train on a
lot of coding data because you can test
whether it's correct or not. You can run
it, see if it compiles. Same with math
data. Math is very it's verifiable,
right? So math and code are two examples
of verifiable domains. Essay writing is
less verifiable. What does it mean for
an essay to be correct? That's far more
ambiguous. And so because of these
verifiable things, uh you're able to
bootstrap a lot of training data. And so
there's a lot of training data in the
models already about code. And so the
models are better at that. That makes
the agents that use those models better
at that. Then the second part, uh I I do
think code lends itself naturally to
kind of like this commit and this draft
and this preview thing. I think that's
more generalizable. So like legal is a
great example. Legal, you can you can
have first drafts of things. That's very
common. Same with essay writing. I think
the concept of like a first draft is
actually a really good UX to aim for. It
lets you do far more. It also puts the
human in the loop and so you kind of get
you get this dual kind of like like if
you put the human in the loop at every
step like that doesn't provide any
value. Like each step is so small. So
the key is like finding these UX
patterns where the agent does a ton of
work but the human still in the loop at
key points and first drafts I think are
a great kind of like mental model for
that. And so anything where there's like
first drafts, legal, writing, code, I
think I think that's a little bit more
generalizable. The
verifiable stuff that's a little bit
tougher. Um, yeah.
Yeah. Oh, no. Good. I'll talk to you
afterwards.
Cool. Yeah, more than happy to chat
after. Thank you all.