3 ingredients for building reliable enterprise agents - Harrison Chase, LangChain/LangGraph
Channel: aiDotEngineer
Published at: 2025-07-23
YouTube video id: kTnfJszFxCg
Source: https://www.youtube.com/watch?v=kTnfJszFxCg
[Music] I want to talk today a little bit about trying to build reliable agents in the enterprise. This is something we work with a bunch of people for both people building as developers inside of an enterprise looking to build agents for for their company but also people who are looking to build solutions and and and bring them and sell them into enterprises. Um and so I wanted to talk a little bit about some of what we see kind of being the the success tips and tricks for making this happen. So the the vision of the future that that I and other people I think have a have a similar view of for agents is that there'll be a lot of them. They'll be running around the enterprise doing different things. They'll they'll be you know an agent for every different task. We'll be coordinating with them. We'll be kind of like a manager, a supervisor and and so and so how do we get to that vision and what uh what what parts of this will kind of like arrive before before the others? Um and and and so I was thinking about this question, what makes some agents kind of like succeed in the enterprise and and some fail. And I was chatting uh with with my friend Assaf, he's the head of AI at Monday. He also wrote GPT researcher. It's a great open source package. Uh I was chatting with him a few weeks ago. Um and uh a lot of the ideas here are borrowed from that conversation. He'll probably write a blog post about this uh with a slightly different framing which I would encourage everyone to to check out. So, I just want to give him a massive shout out and if you have the opportunity to to chat with him, you should definitely take that opportunity. Um, thinking about it from like first principles like what makes agents successful in the enterprise. Uh, it will make it successful, it will make it more likely to be adopted, the greater the value of the agent if it's right. These these probably aren't going to sound kind of like earthshattering, but hopefully we'll get to some interesting points. If the more value it provides when it's right, the more likely it will be to be adopted. the more likely it is to have success, the more likely it will be to be adopted. And then the cost if it's wrong. If it if there's big costs when it's wrong, then it will be less likely to be adopted. So I think these are three kind of like ingredients which are pretty simple and pretty basic, but I think provide an interesting kind of like first principles approach for how to think about building agents and what types of agents kind of like find success. And and you know, I say in the enterprise here, but I also think this applies just generally within within uh kind of like society. Um if if we want to try to put this into a fun little equation, you know, we can multiply the the probability that something succeeds times the value that you get when it succeeds and then and then do the opposite for the cost when it's wrong. And of course, like this needs to be greater than than the cost of running the agent for you to want to put it into production. And so, uh yeah, fun fun little kind of like stats math formula. So, how can we build agents that score higher on this? Because this is this hasn't been anything kind of like earthshattering so far. Hopefully, we'll get to some fun insights when we talk about how to make that make that equation kind of like go up. So, how can we increase the the the value of or of of things um when they go right and and what types of agents have higher value? So, so part of this is uh choosing kind of like problems where there is there is really high kind of like value. So a lot of the agents that have been successful so far, Harvey in the legal space is one of them. Um in the finance space we see stuff around research and summarization. These are high value work tasks. People pay a lot of money for for lawyers uh and and and for and for research and investment research. And so these are examples of what I would say kind of like high value tasks are. there's other ways to kind of like improve the value of what you're working on besides just switching kind of like the vertical completely and I think we're starting to see some of this especially more recently. So if we think about rag or if we think about kind of like existing question or old older school question answering solutions they would often respond kind of like quickly ideally within 5 seconds and give you a quick answer and we're starting to see a trend towards things like deep research which go and run for an extended period of time. We're seeing the same with code. We start with cursor. It has kind of like inline autocomplete. Maybe some chat question answering there. In the past like three weeks, there's been what seven different examples of these ambient agents that run in the background for like hours at time. And I think this speaks to ways that people are trying to get their agents to provide more value. They're getting them to do more work. Um pretty pretty basic, but I do think that like as we have as we think about this future of agents working and what that means, that doesn't mean a co-pilot. That means something working more autonomously in the background doing more amounts of work. So besides kind of like focusing on areas or verticals that provide value, I think you can also absolutely reshift the the UI UX the interaction pattern of what you're building to be kind of like more long-term and do more kind of like substantial patterns of work. Let's talk about now the probability of success. How do we make this go up? So the the the there's a few there's two different aspects I want to talk about here. One I think is about the reliability of agents. If you've built uh agents before, it's easy to get something that works in a prototype. It runs once great. You can make a video, put it on Twitter, but it's hard to make it work reliably, put it in in in production. And I think the the core thing that we've seen and and and by the way, for some parts of of of so for some types of agents, that's totally fine. Um, you can have agents that that run for a while and and kind of like you don't know what they do and and that's totally fine. Especially in the enterprise, we see often times that people want more predictability, more control over what steps actually happen inside the agents. Maybe they always want to do step A after step B. And so if you prompt an agent to do that, great. It might do that like 90% of the time. You don't know what the LLM will do. If you put that in a deterministic kind of like workflow or code, then it will always do that. And so, especially in the enterprise, we see that there are workflow like things where you need more controllability, more predictability than you get by just prompting. And so, what we've seen is more the solution for this is basically make more and more of your agent deterministic. There's this concept of kind of like workflows versus agents. Anthropic wrote a great blog post on this that I'd encourage you to check out. Um I I would argue that instead of workflows versus agents, it's it's oftentimes workflows and agents. Uh we see that parts of an agentic system are sometimes looping calling a tool and sometimes they're just doing A after B after C. An example of this is when you think about multi-agent architectures. If you think about an architecture that has agent A and then after agent A finishes, you always call agent B. Is is that a workflow? Is that an agent? It's it's this middle ground. And so as we think about building tools for this this this future, uh, one of the one of the things that we've released is Langraph. Langraph is an agent framework. It's very different from other agent frameworks where it really leans in to this spectrum of workflows and agents and allows you to be wherever wherever is best for your application on that on that kind of like curve. And where where on that curve is best totally depends on kind of like the application that you're building. There there there's another thing um that that is different from just building and changing the agent. And I think there's uh there's oftentimes really high error bars that people have when they think about how likely an agent is to work. I think this technology is new um when when uh trying to get something built or approved or put into production inside an enterprise. think there's a lot of um uncertainty and and and and fear around this um and and I think that relates to this fundamental kind of like uncertainty around how this agent is kind of like performing and so besides just like making it better a really important thing that we see to do inside the enterprise whether you're you're you're bringing a third party agent and selling it as a service or whether you're building inside uh uh the enterprise yourself is to work to kind of reduce the the the way that people see the error bars of how this agent performs. Um, so what I mean by that specifically is that this is where observability and eval plays a slightly different role than than we would maybe think or we would maybe intend. So we we have a we have an observability and eval solution called Lang Smith. We built it for developers so that they could see what's going on inside their agent. It's also proved really really valuable for communicating to ex external shareholders what's going on inside the agent and and how the agent performs and where it messes up and where it doesn't mess up and and basically communicate these kind of like patterns. Um and so and so again like the observability part you can just see every step that's happening inside the agent. This reduces the kind of like uncertainty that people have around what the agent and what it's actually doing. They can see that it's making three five LM calls. It's not just one. they're actually being really thoughtful about the steps that are happening and then you can benchmark it against different things. Um, and so there there's a a great story of a user of ours who used Lang Smith initially to build the agent, but then but then brought it and showed it to the review panel as they were trying to get their agent approved to go into production. and they they ended the meeting under time, which almost never happens uh if if you've been to these review panels. And and and they showed them basically everything everything inside Linkmith and it helped reduce kind of like the the perception or the risk that people had um of of of these agents. And then uh the the the last thing I want about talk about is is kind of like the cost of something if it's wrong. Um there's similar to kind of like the probability of things being right. This this plays an outsized kind of like role in especially in larger enterprises among review boards and and and managers. People's like perception of these agents. People hear stories of agents going wild and causing kind of like brain uh brand damage or or you know giving away things for free. there's this I think there's an outsized perception of of of kind of like what could be what could happen if things go bad and and so I think there's a few kind of like UI UX tricks that people are doing and that successful agents have to kind of like just make this a a non-issue. So so one is just make it easy to reverse the changes that the agent makes. So if you think about code and this is a screenshot of of of replet agent it's it's a it's a it's a diff that it generates a PR code's really easy to revert you you go back to the previous commit and so I think I think that's part of the reason why we see code being one of the first uh kind of like real places that you can apply agents besides the fact that the models are trained on it is also that when you when you use these agents you create all these commits and and and and well it depends how you do it replet does it in a very clever way where every time they change a file they save it as a new commit. So you can always go back, you can always revert kind of like what the agent does. Um and then and then the second part is is having a human in the loop. So rather than uh you know merging uh code changes into into main directly, open up PR that's putting the human in the loop. And so then the the the effect of the agent is it's not kind of like making changes. There's the human who's kind of like approving what the agent does. And this seems uh uh uh maybe a little subtle, but I think it completely changes the cost calculations in people's minds about what the costs of the agent doing something bad is because now it's reversible and you have a human who's going to prevent it from even going in in the first place if it's bad. Um and and so human in the loop is one of the big things that we see uh people selling into enterprises and building inside enterprises really leaning into. So to make this a little bit more concrete, what are some examples of this? I think deep research is a pretty good example of this. If we think about this, there is a period of time up front when you're messaging with deep research that you go back and forth. It asks you follow-up questions and you kind of like calibrate on what you want to research. That puts kind of like the human in the loop. It also it makes sure that it gets a better result. So, it increases kind of like the value that you're going to get from the report because it's more aligned with what you actually want. And then dur deep research it doesn't you know take this and publish it as a blog out in the internet or doesn't take it and email it to your clients. It produces just you know a report that you can read and decide what to do with. So it's not actually doing anything. It's up to you to take that and and and do things. I think similarly when you think about uh code it's it's it's another great example of uh so claude code also has uh this ability where it asks questions. It clarifies things. um this is to kind of like uh both keep the human in the loop but also make sure that it yields better results and then again with code maybe you're not making a commit every time you change things but it's it's on a separate branch you open a PR you're not pushing kind of like directly to master and so I think these are these are examples of things in the general industry that kind of like follow some of of of these patterns so okay so we've figured out a few levers that we can kind of like pull to to kind of like try to make our agents more interesting to to be deployed in in the enterprise, what next? What next is kind of like how how do we scale that? So if if this kind of like has has kind of like positive value, then what we really want to do is just multiply this a bunch and scale it up a bunch. And I think this speaks to the the concept of kind of like ambient agents. Um which is uh when we think about agents working uh you know this futuristic view, agents working in an enterprise doing things in the background, they're doing things in the background. They're not being kicked off by humans kind of like still in the loop. They're being triggered by by by by different events. Um, and I think the reason that this is so powerful is that it scales up this positive expected value thing even more um than than we can. Like I can only really have one maybe I can have two chat boxes open at the same time, but now there can be there can be, you know, hundreds of these running in the background. And so when we think about the difference between chat agents, which I would argue we've we've mostly seen, and ambient agents, one one big difference is ambient agents are triggered by events that lets us scale ourselves. Instead of a onetoone, it's now a one to many conversation that we can be happening. Um and and so the concurrencies of these agents that can be running goes from one to to unlimited. Um the latency requirements also change. So when chat you have this kind of like UX expectation that it responds really really quickly and that's not the case with with ambient agents because they're triggered without you even knowing. So like how how do you how do you know how do you even care how long it's running? And so you can what does this let you do? Why does this matter? This lets you do more complex operations. So you can do more things. So you can start to build build up kind of like a bigger body of work. You can go from kind of like changing one line of code to changing a whole file or making a new kind of like repo or or any of that. And so instead of this agent just responding directly or calling like a single tool call, which usually happens in these chat applications because of the latency requirements, it can now do these more complex things. And so the value can start kind of like increasing in terms of what you're doing. And then the the the other thing that I want to emphasize is that there there's still kind of like a UX for interacting with these agents. Um so ambient does not mean fully autonomous. And this is really really important because autonomous when people hear autonomous they think the the cost of this thing doing something bad is is is is really high because I'm not going to be able to to oversee it. I don't know what's going on. How do I it could go out there and like run wild. And so ambient does not mean fully autonomous. And so there are a lot of different kind of like human in the loop interaction patterns that you can bring into these kind of like background these ambient agents. Um there can be like an approve reject pattern where for certain tools you want to explicitly say yes it's okay to call this tool. You might want to edit the tool that it's calling. So if it messes up a tool call you can actually just correct it in the UI. You might want to give it the ability to kind of like ask questions so that you can answer them. You can provide more info if it gets stuck kind of like halfway through. And then time travel is something that we call uh human on the loop as well. So this is after the agents run. if it messed up on step like 10 out of 100, you can reverse back to step 10 and say, "Hey, no, like resume from here, but do this other thing like slightly differently." And so, so human loop, we think is is super super important. The the other thing that I want to call out just briefly is I think there's this there's this uh intermediary state where where we're starting to be right now. I like I wouldn't call deep research or uh or claude code or any of these coding agents ambient agents because they're still triggered by a human and but I think these are good examples of kind of like sync to async agents. Um and so so factory uh is a coding agent. They they use a term kind of like async coding agents and I I really like that. Um but but I think this this kind of like sync to async agents is a natural progression if you think about it like right now to start or you know a year ago everything was a sync agent. We were chatting with it. is very much in the moment. The future is probably these autonomous agents working in the background and still pinging us when they need help. But there's this intermediate state where where the human kicks it off, uses that kind of like human in the loop at the start to calibrate on what you want it to do. And so I I think that that table I showed of like chat and and ambient is actually probably missing a column in the middle that's like these sync to async agents. Um, anyways, an example of some of the UX's that we think can be interesting for these ambient agents are are basically what we call agent inbox, which is where you surface all the actions that the agent wants to take that need your approval and then you can go in and approve, reject, leave feedback, things like that. Just kind of tie this together and make uh really concrete what I mean by ambient agents. Uh, email I think is a really natural place for ambient agents. These these agents can listen to incoming emails. Those are events. They can run on however many emails come in. So that's, you know, in theory unlimited, but you still probably want in agent or you still probably want the human, the user to approve any emails that go out or any calendar events that get sent. Um, depending on your level of comfort. And so, uh, this is a concrete thing. Um, uh, I actually built one, uh, that I have myself. Uh, we've used it to kind of like test out a lot of these things. Um, if people want to try it out, there is a QR code that you can scan and get the GitHub repo. It's all open source. Um, and I think this is uh it's not the only example of ambient agents. Um, but it's one that that I've built myself and so we we talk a lot about internally. Um, that's all I have. Uh, I'm not sure if there's time for questions or not. One or two questions if if people have them. Um, yeah. So my question is uh although everybody's talking about agents uh but only codegenerating agents are the one who are getting funding is it because uh you can measure what you have done and you can reverse what you have done but for all other agents you can do lot of stuff but you cannot measure what you have done you cannot reverse what you have done. Yeah, I I I I think those are Yeah, I think there's a variety of reasons. I think those two measure and and well, okay, so the measure thing I think probably more so like you can a lot of the large model labs train on a lot of coding data because you can test whether it's correct or not. You can run it, see if it compiles. Same with math data. Math is very it's verifiable, right? So math and code are two examples of verifiable domains. Essay writing is less verifiable. What does it mean for an essay to be correct? That's far more ambiguous. And so because of these verifiable things, uh you're able to bootstrap a lot of training data. And so there's a lot of training data in the models already about code. And so the models are better at that. That makes the agents that use those models better at that. Then the second part, uh I I do think code lends itself naturally to kind of like this commit and this draft and this preview thing. I think that's more generalizable. So like legal is a great example. Legal, you can you can have first drafts of things. That's very common. Same with essay writing. I think the concept of like a first draft is actually a really good UX to aim for. It lets you do far more. It also puts the human in the loop and so you kind of get you get this dual kind of like like if you put the human in the loop at every step like that doesn't provide any value. Like each step is so small. So the key is like finding these UX patterns where the agent does a ton of work but the human still in the loop at key points and first drafts I think are a great kind of like mental model for that. And so anything where there's like first drafts, legal, writing, code, I think I think that's a little bit more generalizable. The verifiable stuff that's a little bit tougher. Um, yeah. Yeah. Oh, no. Good. I'll talk to you afterwards. Cool. Yeah, more than happy to chat after. Thank you all.