From Copilot to Colleague: Trustworthy Agents for High-Stakes - Joel Hron, CTO Thomson Reuters
Channel: aiDotEngineer
Published at: 2025-07-23
YouTube video id: kDEvo2__Ijg
Source: https://www.youtube.com/watch?v=kDEvo2__Ijg
[Music] [Applause] So, uh, nice to meet you all. Thank you for having me. Um you know probably two two and a half years ago like many other companies out there you know we sort of started on this journey of of building assistants and sort of the north star that we had when we were building these assistants were that they were helpful you know and obviously we wanted them to be as accurate they could and to reference citations when they could and these kinds of things but at the end of the day we wanted it to be helpful and I think over the last two two and a half years and certainly like within the six months like that northstar has shifted from helpfulness to productive like like we're not asking assistants to just be helpful anymore. We're asking them to actually like produce output um to to make judgments and decisions on behalf of users and uh in in the environments that we work in in the law, tax, global trade um risk and and fraud investigations like the risks of being wrong are are not particularly acceptable to our end users. So doing that in those kinds of environments I think is somewhat unique and and that's hopefully what what we'll talk about today. A little context on Thompson Reuters as a as a company. Um maybe maybe different from many of you who like started a company and grew to tens of thousands of users in a couple weeks. We we've been around for over 100 years. Um we like I said represent legal tax compliance audit risk. uh 97% of the top 100 US law firms are customers of ours. Um 99% of the Fortune 100 uh corporate customers of ours uh and the top 100 US uh CPA firms. So we've had a long-standing and and a pretty significant presence in many of these industries for a large long time. And really what what it underpins that is our domain expertise and content. So we have 4,500 uh domain experts. I think we're the the the highest employer of lawyers in the world as an example and you know our proprietary content really underpins most of our software products uh that our customers use and it's north of one and a half terabytes of of proprietary content across those industries that you know we serve to our customers through through our software. Uh you know we're heavily inquisitive as a company. We've spent over three billion and acquisitions over the last couple of years. Uh we have an applied research lab with uh a little more than 200 scientists and engineers uh that work closely with our development teams and uh as a company we spend north of 200 million a year uh in capital on AI product development. So it's a little background on who we are as TR. Um, so I'll switch gears, just talk about maybe ground us in the evolution of of where AI has been and where it's come. So I think this quote from Y Combinator and their summer 2025 sort of request for startups is pretty good grounding. And they said, I'll paraphrase a little bit here, but this is pretty much what they said. They said, don't build agentic tools for law firms, build law firms of agents. And I think that like signifies like the profound shift of like moving from helpfulness to productive. Like we're asking AI systems to now produce output and produce judgments and decisions uh and not just be helpful to people who are doing those kinds of tasks. Um and and that's the shift that we're experiencing with agentic AI. I think you know what what does agentic AI actually mean? Uh I think we've been talking about this a little I think we we like to define it more as a spectrum. Uh it is not that this system is agentic or it is not but in fact um these are dials that can be used to uh to to sort of tune what the experience and how much agency the experience has for the user depending on the use case. There's some use cases where it's very exploratory and you may want to dial these agency dials far up. Uh there are other situations where there's a high degree of precision and and there's sort of an expectation of certainty uh around how a certain workflow might need to be executed and you may not want to dial the agency up in those situations. And so we view these things as levers that we are able to kind of move up and down depending on what our users are willing to tolerate in terms of the the risk of the situation that they they may be dealing with. And each of these dials you can think of um somewhat discreetly. So like the autonomy dial, this is the ability of an AI assistant to go do a discrete task like summarize this document all the way down the spectrum to uh very variable kind of self-evolving workflows where the AI assistant is planning its own work. It's executing its own work and it's replanning that work along the way based on what is it is observing or learning along the path. Context is also a dial like the sort of first most simple examples were like using parametric knowledge of the models directly and then rag became a big thing and we we added one uh knowledge source we added another knowledge source and then the models then need to sort of rationalize between let's say a controlled knowledge source and the web and it needs to use both of these sources of information and it needs to understand which one is better under which context all the way to perhaps the models even permuting the data sources themselves and updating not just the the data but perhaps the schemas of the data uh to make to make better use of them for future types of of questions that may get asked. Memory is another dial like you know the the earliest systems of rag that we had were somewhat stateless. Uh they retrieved the context at the point in time. Uh and what we're seeing now is that memory needs to be shared uh throughout the workflow. uh it may need to be shared across a series of execution steps in that workflow and it may need and likely does need to be persistent across many sessions of users. And so these are also dials that we can use from a from a memory perspective. And then lastly, coordination. Um, coordination is the idea of uh an LLM or an AI assistant just uh atomically executing a task like I mentioned summarizing documents uh to delegation to tools uh to to full agent systems uh collaborating with each other. So again just to sum like these are levers that we view the ability to kind of pull up and down depending on the type of use case and and what what sort of agency we want to give the system. So, I'll switch gears and just share kind of some lessons that we've learned along the way from the last two and a half years of of building this. And and some of this may be obvious, some of it may not. Um, the first is going to be on eval. Evals is maybe the hardest thing that we do and I think for our users um one of the things that is most challenging is that like to build trust in the system they almost expect a determinism like like sort of by definition trust comes through like you know having certainty and and an expected outcome when you give a certain input and that is just not the way that these systems work and uh that has been I think a really challenging bar not just to climb for our users but also for our own internalsmemes who evaluate these systems alongside of us. Um what we see in our own development is that even with highly trained domain experts in legal I could give the same set of data like question response to the same people a week later and we see 10 plus percent swings in accuracy by the same people on the same questions. And so their own judgments are highly variable as well. And it's it's quite difficult to to sort of uh understand whether you're climbing that hill or not. Uh the the other challenge is that you know it's quite expensive. These are highly trained uh lawyers or tax professionals whatever it may be. Uh and if you're iterating on a system every week like you know it's quite expensive to to to leverage this amount of of uh human judgment. um we see these challenges sort of amplified by agentic systems. Some of the challenges are that uh referencing to source material which is probably one of the most important things for any of our applications becomes more challenging as you start to build these systems with higher levels of agency. we see these agents sort of drift and and identifying why they have drift drifted and where they have drifted along the trajectory becomes more challenging and building the guardrail systems themselves require you know a deep level of expert knowledge I think you know as we've approached our evals like we have really focused on uh developing pretty rigorous rubrics for how we eval but at the end of the day I do think we need sort of north stars that guide And in many ways like we really look at preference at the end of the day to really drive an understanding of are we getting better or are we getting worse. But we do have like deeper levels of rubric that we use to sort of hill climb on certain components of the system. The other thing we've learned is that our legacy applications are you know in in many ways they're handicapped to be honest with you but in a lot of ways I think they're really enabling. And I'll show you a couple demos of that in in just a minute. But we've have 100 plus years of building software systems that have highly tuned domain logic. Uh and and our users expect this sort of logic in the way that they work. And you know early on in the age of building assistance, you know, we were kind of just starting over. We were leaving all that behind us and building something new somewhat from scratch. But what agents have allowed us to do is to really decompose these legacy applications and decompose the components of them as tools that agents can now use. And so we're we're finding new ways to leverage a lot of these legacy applications and infrastructure that um you know previously we might have thought of as baggage but I think are really unique assets for us uh to build on going forward. And then the last thing I would say as a learning is uh which may be somewhat non-intuitive is you know this whole idea of MVPs which sort of like centers in everybody's mind when they're building a new product. I think I think in many times we've overindexed on the word minimal like and we've sort of like chased rabbit holes in development trying to optimize what we thought of as like sort of the smallest most valuable piece of you know code that we could build. And it wasn't until we actually like built the whole system that we could see the whole system operate and we could understand you know what components of that system do we go need to go spend time on versus what is just healed by the agentic uh sort of nature of the system itself. And it was really, I would say, like a mindset shift for many of our teams to not ground themselves in this MVP concept, but to try to just go build the the whole thing first and then learn from there rather than starting at a smaller component. So with that, I'll show you just a couple quick demos of some applications that sort of do this work. So the first one is a tax use case. This is like uh obviously fake data but you know you can imagine a tax professional getting a bunch of documents going through those documents extracting data you know mapping it to a tax calculation engine etc etc etc. So what what this product does now is basically take source documents like a W2 or 1099 or whatever and you know end to end does the process of of generating tax return. So we use AI to extract data from the particular documents. We use AI to take that data and understand how to map it to what fields in a tax engine. Uh what the sort of tax laws say about the rules and conditions of those numerical values and whether they should apply in this case or that case or to this line or to that line and generate a a tax return end to end. And this is a good example of a couple things I just mentioned. First is you know this is really only possible because we have the tools like a tax engine to be able to give to the model to leverage to to do these calculations. Uh we also have a validation engine that's built into that tax engine that the the AI system can use to validate the work that it's doing can inspect the errors. It can go look for more information from the documents when it needs it and and resolve to finish the the workflow. Um, so I think this is a good example of how we're able to decompose our legacy systems and kind of bring new life to them and and and leverage them in in a unique way. Um, the second will be uh an example of legal. This is like a legal research use case like uh where a lawyer might go in and prepare for litigation. And so uh as I mentioned we have one and a half plus terabytes of proprietary content that we build uh our products on. And so this is really like a deep research implementation that is tuned for uh legal. And what we're doing in this particular case uh is uh having an AI assistant that uses the tools of our litigation research product. So those things would be like searching for documents, fetching documents, uh comparing citations across cases, validating citations within cases and is using the components of that application as tools to go out and search content, retrieve content. It's looking at various different sources of content, whether that be case law or statutes or regulations or legal know-how, you know, articles that we have or other blogs or other content that we've we've licensed in some way to reason to an appropriate answer to uh to a legal research type question. And what you're seeing here is not necessarily just the the product, but these are sort of like under the product of like the trajectories that the model would be following along its path of answering this particular type of of legal question. And at the end of the flow, uh the model will or along the flow rather, the model will write notes to itself about what it is learning, what it's finding. And at the end of the flow, we'll sort of rationalize those notes together into like a a final report that sort of sums up all of the information that was found uh throughout the research. And I think most importantly, what you'll see is uh it it links to hard citations in our product. So every sort of blue hyperlink links to like a true case or a true statute. Uh and it flags the the sort of risk associated with that with with with these flags that you can see. So uh these are two examples that I think kind of highlight pretty well uh some of those lessons learned that I say around around decomposing applications trying to build the whole product at once. Uh, and these are really things, like I said, that that we've we've learned the hard way in many cases. So, I think just to wrap up, I've got a few minutes and and we can take a couple questions as well. But, uh, I I think beginning with the whole problem in mind is is is the right strategy when you're thinking about agentic systems. Uh I think the way to think about a agency is is not as a binary thing but as a lever that you can dial up or down depending on uh the risk or the use case or the tolerance uh of your users for for certain situations. I think one way to think about agents is to to bring life back to old systems and and to sort of break those old systems down into components that can be leveraged uniquely uh by an agentic system. uh I I think focusing on where humans are in the loop in terms of evaluation like I said those thosemesmemes that we have internally are extremely important for us and then lastly I think you know the reason we've done what we've done is because we looked at our company we said what are the assets that we have that nobody else has and it's 4500 you know domain experts terabytes of content we really ask ourselves like how can we use those to create the most amount of differentiation in in our products and so I would I would you know certainly challenge you guys to do the same for yourselves. What are the unique assets that you have and how can you perhaps best leverage those to to build uniqueness into whatever applications uh it is that you may be doing. So with that I think we've got a couple minutes for a couple questions and some mics. So great. You want to use use the mic. A great presentation Mr. Joel Harren and uh my name is Prab Bala. I'm a PhD student and I work for Department of Defense who sponsor me. Um so my questions are um if I it's it's a great product. If I have to take the product to my firm who is department of defense or any financial firms, how would you describe the cyber security postures uh which are mandated by CISA and government recently such as LLM firewall or LLM guard rails or uh automated uh agents for scanning vulnerabilities or any SCM security posture me management. How would you define the uh cyber security posture for the entire architecture? Yeah, I mean there's certainly a lot of technical documentation on this that I could point you to online, but I would just say that like, you know, we're heavily focused on uh not just compliance with the standards of like like Fed Ramp and these other things when we work with the government, but also like really trying to conform to the the latest standards that are coming out like the ISO uh standard that uh recently came out. Several of our products are now uh sort of compliant with with with that as well. It's a It's a pretty quickly evolving space though, so I would say we're quite adaptable to it. Anyway, I think I'm getting the hand, but I appreciate the time. Thank you very much. And and we have a booth as well, so so come come say hi. Thanks. [Music]