From Copilot to Colleague: Trustworthy Agents for High-Stakes - Joel Hron, CTO Thomson Reuters

Channel: aiDotEngineer
Published at: 2025-07-23
YouTube video id: kDEvo2__Ijg
Source: https://www.youtube.com/watch?v=kDEvo2__Ijg
[Music]
[Applause]
So, uh, nice to meet you all. Thank you
for having me. Um you know probably two
two and a half years ago like many other
companies out there you know we sort of
started on this journey of of building
assistants and sort of the north star
that we had when we were building these
assistants were that they were helpful
you know and obviously we wanted them to
be as accurate they could and to
reference citations when they could and
these kinds of things but at the end of
the day we wanted it to be helpful and I
think over the last two two and a half
years and certainly like within the six
months like that northstar has shifted
from helpfulness to productive like like
we're not asking assistants to just be
helpful anymore. We're asking them to
actually like produce output um to to
make judgments and decisions on behalf
of users and uh in in the environments
that we work in in the law, tax, global
trade um risk and and fraud
investigations like the risks of being
wrong are are not particularly
acceptable to our end users. So doing
that in those kinds of environments I
think is somewhat unique and and that's
hopefully what what we'll talk about
today. A little context on Thompson
Reuters as a as a company. Um maybe
maybe different from many of you who
like started a company and grew to tens
of thousands of users in a couple weeks.
We we've been around for over 100 years.
Um we like I said represent legal tax
compliance audit risk. uh 97% of the top
100 US law firms are customers of ours.
Um 99% of the Fortune 100 uh corporate
customers of ours uh and the top 100 US
uh CPA firms. So we've had a
long-standing and and a pretty
significant presence in many of these
industries for a large long time. And
really what what it underpins that is
our domain expertise and content. So we
have 4,500
uh domain experts. I think we're the the
the highest employer of lawyers in the
world as an example and you know our
proprietary content really underpins
most of our software products uh that
our customers use and it's north of one
and a half terabytes of of proprietary
content across those industries that you
know we serve to our customers through
through our software. Uh you know we're
heavily inquisitive as a company. We've
spent over three billion and
acquisitions over the last couple of
years. Uh we have an applied research
lab with uh a little more than 200
scientists and engineers uh that work
closely with our development teams and
uh as a company we spend north of 200
million a year uh in capital on AI
product development. So it's a little
background on who we are as TR. Um, so
I'll switch gears, just talk about maybe
ground us in the evolution of of where
AI has been and where it's come. So I
think this quote from Y Combinator and
their summer 2025 sort of request for
startups is pretty good grounding. And
they said, I'll paraphrase a little bit
here, but this is pretty much what they
said. They said, don't build agentic
tools for law firms, build law firms of
agents. And I think that like signifies
like the profound shift of like moving
from helpfulness to productive. Like
we're asking AI systems to now produce
output and produce judgments and
decisions uh and not just be helpful to
people who are doing those kinds of
tasks. Um and and that's the shift that
we're experiencing with agentic AI. I
think you know what what does agentic AI
actually mean? Uh I think we've been
talking about this a little I think we
we like to define it more as a spectrum.
Uh it is not that this system is agentic
or it is not but in fact um these are
dials that can be used to uh to to sort
of tune what the experience and how much
agency the experience has for the user
depending on the use case. There's some
use cases where it's very exploratory
and you may want to dial these agency
dials far up. Uh there are other
situations where there's a high degree
of precision and and there's sort of an
expectation of certainty uh around how a
certain workflow might need to be
executed and you may not want to dial
the agency up in those situations. And
so we view these things as levers that
we are able to kind of move up and down
depending on what our users are willing
to tolerate in terms of the the risk of
the situation that they they may be
dealing with. And each of these dials
you can think of um somewhat discreetly.
So like the autonomy dial, this is the
ability of an AI assistant to go do a
discrete task like summarize this
document
all the way down the spectrum to uh very
variable kind of self-evolving workflows
where the AI assistant is planning its
own work. It's executing its own work
and it's replanning that work along the
way based on what is it is observing or
learning along the path. Context is also
a dial like the sort of first most
simple examples were like using
parametric knowledge of the models
directly and then rag became a big thing
and we we added one uh knowledge source
we added another knowledge source and
then the models then need to sort of
rationalize between let's say a
controlled knowledge source and the web
and it needs to use both of these
sources of information and it needs to
understand which one is better under
which context all the way to perhaps the
models even permuting the data sources
themselves and updating not just the the
data but perhaps the schemas of the data
uh to make to make better use of them
for future types of of questions that
may get asked. Memory is another dial
like you know the the earliest systems
of rag that we had were somewhat
stateless. Uh they retrieved the context
at the point in time. Uh and what we're
seeing now is that memory needs to be
shared uh throughout the workflow. uh it
may need to be shared across a series of
execution steps in that workflow and it
may need and likely does need to be
persistent across many sessions of
users. And so these are also dials that
we can use from a from a memory
perspective. And then lastly,
coordination. Um, coordination is the
idea of uh an LLM or an AI assistant
just uh atomically executing a task like
I mentioned summarizing documents uh to
delegation to tools uh to to full agent
systems uh collaborating with each
other. So again just to sum like these
are levers that we view the ability to
kind of pull up and down depending on
the type of use case and and what what
sort of agency we want to give the
system.
So, I'll switch gears and just share
kind of some lessons that we've learned
along the way from the last two and a
half years of of building this. And and
some of this may be obvious, some of it
may not. Um, the first is going to be on
eval.
Evals is maybe the hardest thing that we
do and I think for our users um one of
the things that is most challenging is
that like to build trust in the system
they almost expect a determinism like
like sort of by definition trust comes
through like you know having certainty
and and an expected outcome when you
give a certain input and that is just
not the way that these systems work and
uh that has been I think a really
challenging bar not just to climb for
our users but also for our own
internalsmemes who evaluate these
systems alongside of us. Um what we see
in our own development is that even with
highly trained domain experts in legal I
could give the same set of data like
question response to the same people a
week later and we see 10 plus percent
swings in accuracy by the same people on
the same questions. And so their own
judgments are highly variable as well.
And it's it's quite difficult to to sort
of uh understand whether you're climbing
that hill or not. Uh the the other
challenge is that you know it's quite
expensive. These are highly trained uh
lawyers or tax professionals whatever it
may be. Uh and if you're iterating on a
system every week like you know it's
quite expensive to to to leverage this
amount of of uh human judgment.
um we see these challenges sort of
amplified by agentic systems. Some of
the challenges are that uh referencing
to source material which is probably one
of the most important things for any of
our applications becomes more
challenging as you start to build these
systems with higher levels of agency. we
see these agents sort of drift and and
identifying why they have drift drifted
and where they have drifted along the
trajectory becomes more challenging and
building the guardrail systems
themselves require you know a deep level
of expert knowledge I think you know as
we've approached our evals like we have
really focused on uh developing pretty
rigorous rubrics for how we eval but at
the end of the day I do think we need
sort of north stars that guide And in
many ways like we really look at
preference at the end of the day to
really drive an understanding of are we
getting better or are we getting worse.
But we do have like deeper levels of
rubric that we use to sort of hill climb
on certain components of the system.
The other thing we've learned is that
our legacy applications are you know in
in many ways they're handicapped to be
honest with you but in a lot of ways I
think they're really enabling. And I'll
show you a couple demos of that in in
just a minute. But we've have 100 plus
years of building software systems that
have highly tuned domain logic. Uh and
and our users expect this sort of logic
in the way that they work. And you know
early on in the age of building
assistance, you know, we were kind of
just starting over. We were leaving all
that behind us and building something
new somewhat from scratch. But what
agents have allowed us to do is to
really decompose these legacy
applications and decompose the
components of them as tools that agents
can now use. And so we're we're finding
new ways to leverage a lot of these
legacy applications and infrastructure
that um you know previously we might
have thought of as baggage but I think
are really unique assets for us uh to
build on going forward. And then the
last thing I would say as a learning is
uh which may be somewhat non-intuitive
is you know this whole idea of MVPs
which sort of like centers in
everybody's mind when they're building a
new product. I think I think in many
times we've overindexed on the word
minimal like and we've sort of like
chased rabbit holes in development
trying to optimize what we thought of as
like sort of the smallest most valuable
piece of you know code that we could
build. And it wasn't until we actually
like built the whole system that we
could see the whole system operate and
we could understand you know what
components of that system do we go need
to go spend time on versus what is just
healed by the agentic uh sort of nature
of the system itself. And it was really,
I would say, like a mindset shift for
many of our teams to not ground
themselves in this MVP concept, but to
try to just go build the the whole thing
first and then learn from there rather
than starting at a smaller component.
So with that, I'll show you just a
couple quick demos of some applications
that sort of do this work. So the first
one is a tax use case. This is like uh
obviously fake data but you know you can
imagine a tax professional getting a
bunch of documents going through those
documents extracting data you know
mapping it to a tax calculation engine
etc etc etc. So what what this product
does now is basically take source
documents like a W2 or 1099 or whatever
and you know end to end does the process
of of generating tax return. So we use
AI to extract data from the particular
documents. We use AI to take that data
and understand how to map it to what
fields in a tax engine. Uh what the sort
of tax laws say about the rules and
conditions of those numerical values and
whether they should apply in this case
or that case or to this line or to that
line and generate a a tax return end to
end. And this is a good example of a
couple things I just mentioned. First is
you know this is really only possible
because we have the tools like a tax
engine to be able to give to the model
to leverage to to do these calculations.
Uh we also have a validation engine
that's built into that tax engine that
the the AI system can use to validate
the work that it's doing can inspect the
errors. It can go look for more
information from the documents when it
needs it and and resolve to finish the
the workflow. Um, so I think this is a
good example of how we're able to
decompose our legacy systems and kind of
bring new life to them and and and
leverage them in in a unique way. Um,
the second will be uh an example of
legal. This is like a legal research use
case like uh where a lawyer might go in
and prepare for litigation. And so uh as
I mentioned we have one and a half plus
terabytes of proprietary content that we
build uh our products on. And so this is
really like a deep research
implementation that is tuned for uh
legal. And what we're doing in this
particular case uh is uh having an AI
assistant that uses the tools of our
litigation research product. So those
things would be like searching for
documents, fetching documents, uh
comparing citations across cases,
validating citations within cases and is
using the components of that application
as tools to go out and search content,
retrieve content. It's looking at
various different sources of content,
whether that be case law or statutes or
regulations or legal know-how, you know,
articles that we have or other blogs or
other content that we've we've licensed
in some way to reason to an appropriate
answer to uh to a legal research type
question. And what you're seeing here is
not necessarily just the the product,
but these are sort of like under the
product of like the trajectories that
the model would be following along its
path of answering this particular type
of of legal question.
And at the end of the flow, uh the model
will or along the flow rather, the model
will write notes to itself about what it
is learning, what it's finding. And at
the end of the flow, we'll sort of
rationalize those notes together into
like a a final report that sort of sums
up all of the information that was found
uh throughout the research. And I think
most importantly, what you'll see is uh
it it links to hard citations in our
product. So every sort of blue hyperlink
links to like a true case or a true
statute. Uh and it flags the the sort of
risk associated with that with with with
these flags that you can see. So uh
these are two examples that I think kind
of highlight pretty well uh some of
those lessons learned that I say around
around decomposing applications trying
to build the whole product at once. Uh,
and these are really things, like I
said, that that we've we've learned the
hard way in many cases. So, I think just
to wrap up, I've got a few minutes and
and we can take a couple questions as
well. But, uh, I I think beginning with
the whole problem in mind is is is the
right strategy when you're thinking
about agentic systems. Uh I think the
way to think about a agency is is not as
a binary thing but as a lever that you
can dial up or down depending on uh the
risk or the use case or the tolerance uh
of your users for for certain
situations.
I think one way to think about agents is
to to bring life back to old systems and
and to sort of break those old systems
down into components that can be
leveraged uniquely uh by an agentic
system.
uh I I think focusing on where humans
are in the loop in terms of evaluation
like I said those thosemesmemes that we
have internally are extremely important
for us and then lastly I think you know
the reason we've done what we've done is
because we looked at our company we said
what are the assets that we have that
nobody else has and it's 4500 you know
domain experts terabytes of content we
really ask ourselves like how can we use
those to create the most amount of
differentiation in in our products and
so I would I would you know certainly
challenge you guys to do the same for
yourselves. What are the unique assets
that you have and how can you perhaps
best leverage those to to build
uniqueness into whatever applications
uh it is that you may be doing. So with
that I think we've got a couple minutes
for a couple questions and some mics.
So great.
You want to use use the mic.
A great presentation Mr. Joel Harren and
uh my name is Prab Bala. I'm a PhD
student and I work for Department of
Defense who sponsor me. Um so my
questions are um if I it's it's a great
product. If I have to take the product
to my firm who is department of defense
or any financial firms,
how would you describe the cyber
security postures
uh which are mandated by CISA and
government recently such as LLM firewall
or LLM guard rails or uh automated uh
agents for scanning
vulnerabilities or any SCM security
posture me management. How would you
define the uh cyber security posture for
the entire architecture? Yeah, I mean
there's certainly a lot of technical
documentation on this that I could point
you to online, but I would just say that
like, you know, we're heavily focused on
uh not just compliance with the
standards of like like Fed Ramp and
these other things when we work with the
government, but also like really trying
to conform to the the latest standards
that are coming out like the ISO uh
standard that uh recently came out.
Several of our products are now uh sort
of compliant with with with that as
well. It's a It's a pretty quickly
evolving space though, so I would say
we're quite adaptable to it.
Anyway, I think I'm getting the hand,
but I appreciate the time. Thank you
very much. And and we have a booth as
well, so so come come say hi. Thanks.
[Music]