Scaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)

Channel: aiDotEngineer
Published at: 2025-07-29
YouTube video id: W1MiZChnkfA
Source: https://www.youtube.com/watch?v=W1MiZChnkfA
[Music]
All right. Uh, thank you everyone. We're
excited for to be here and thank you for
uh, coming to our talk. Uh, my name is
Chong. I'm the CEO and co-founder of
LANCB. I've been making data tools for
machine learning and data science for
about 20 years. I was one of the
co-authors of pandas library and I'm
working on LANCB today for all of that
data that doesn't fit neatly into those
pandas data frames. And I'm Calvin. I
lead one of the teams at Harvey Aai
working on rag um tough rag problems
across massive data sets of complex
legal docs and complex use cases.
So yeah, our talk is about Oh,
one sec. Maybe we should have used the
other clicker. Yeah. Yeah. All right,
that's okay. We use the laptop.
So we're going to talk about some of the
tough rag problems on the legal
frontier. Um sort of challenges, some
solutions and learnings from our
experiences working together on it. So
we'll start roughly with like sort of
how Harvey tackles retrieval, the types
of problems there are and then the
challenges that come up with that all
with like retrieval quality, scaling, uh
security, all that good stuff and then
how we end up sort of creating a system
with good infrastructure to support
that.
So first of all, a quick intro to what
Harvey is. We're a legal AI assistant.
So, we sell our sort of AI product to a
bunch of law firms to help them do all
kinds of legal tasks like draft, analyze
documents, um sort of go through legal
workflows and a big part of that is
processing data. So, we handle data of
all different sort of volumes and forms.
Um the sort of different scales of that
are we have an assistant product that's
like on demand uploads, the same way you
might like on demand upload to any AI
assistant tool. So, that's like a
smaller 1 to 50 range. We have these
vaults which are sort of uh larger scale
project contacts. So if there's like a a
big deal going on that the law firm's
working on or like a data room where
they need sort of all their contracts,
all their you know litigation documents
and emails in one place that's a vault.
And then the third is the largest scale
which is data corpuses which are like
knowledge bases around the world. So
like legislation, case laws of a
particular country um all the sort of
laws, taxes, regulations that go into
it.
So yeah, some big challenges that come
up come with that. Uh, one is scale.
Just very large amounts of data. Uh,
some of these documents are like super
long and and dense and packed with
content. Um, sparse versus dense I'm
sure is like a sort of retrieval
challenge that all of you deal with um
of how to represent the data, how to
retrieve over and index it. Uh, query
complexity is a big one. We got very
sort of difficult expert queries and
I'll show an example of that in the next
slide. Um the data is very domain
specific and complex. Uh there's sort of
a lot of nitty-gritty legal details that
go into it. So we have to like work with
domain experts and lawyers to understand
it and try to like translate that into
how we represent the data, how we you
know index, query, pre-process over it.
Um data security and privacy is a big
one. A lot of this data is sensitive for
like confidential deals or confidential
I don't know IPOs, financial filings,
stuff like that. So we have to respect
all that for our clients. And then of
course evaluation of how to make sure
systems are actually good.
So yeah, I'll show a quick demonstration
of a retrieval quality challenge. So uh
this is just on the query side of like
this is maybe the average complexity of
a query someone might issue in our
product. Um they're much more complex
and maybe simpler ones, but this is
right in the middle. And you can see
that like there's a lot of different
components that go into this. Um there's
sort of a sem well to read it out. like
what is the applicable regime to covered
bonds issued before 9 July 2022 under
the directive EU 2019 2062 and article
129 of the CR. So, you know, that that's
a handful. Um, but what goes into it is
like there's a semantic aspect. There's
sort of implicit filtering going on of
like, you know, we want applicability
before a certain date. Um, there's a
specialized data set being referenced,
which is EU laws and directives. Um,
there's kind of keyword matches of like
the specific, you know, regulation
directive ID. Um, it is multiart in that
it's sort of asking how this applies to
two different regulations, like one
directive, one article. And there's like
domain jargon here where this is like an
abbreviation. Uh I forget what it was.
Capital regulations something. I looked
it up this morning. Um but yeah, it's
very complex and we sort of need a need
a system that can tackle all this
complexity and sort of break down this
query and um use all the appropriate
technologies for the different parts of
it.
And yeah, so one common question we get
sort of in response to this complexity
is how do you evaluate your systems? How
do you make sure they're good? Um, and
that's actually where we spend a ton of
time. It's, you know, not as much on the
algorithms and the the fancy agentic
techniques, but more like how to
validate them. Um, and I'd say like
investing in eval driven development is
a huge huge key to building these
systems and making sure they're good,
especially when it's a tough domain that
like you don't inherently know much
about as maybe an engineer or a
researcher. Um, so I say there's no
silver bullet eval, but we have like a
whole range of them of like different
task depths and complexities. So in sort
of one dimension you have it being sort
of higher fidelity but more costly and
then the other direction it's like more
automated evals that are faster to
iterate on. So as an example like the um
sort of high fidelity would be like
expert reviews of just having them
directly review outputs and analyze them
and write reports. Um so that's like
super expensive but super high quality.
And then maybe something in between is
like an expert labeled like set of
criteria that you can maybe evaluate
synthetically or evaluate in some
automated way. So it's still expensive
to curate, maybe a little expensive to
run but more um more tractable. And then
the third is sort of the fastest
iteration which is um sort of more
automated quantitative metrics like just
you know retrieval, precision and recall
sort of different more deterministic
success criteria of like am I pulling
documents from the right folder? Is it
the right section? Do they have the
right keywords in them? things like that
and yeah give you a quick sense also of
sort of the scale and complexity on the
data side not only on the query side. So
the data sets we integrate with are
pretty massive. Um as you can see we
support like you know data sets across
all different kinds of countries and for
each one there's sort of complex
filtering and organization and
categorization that goes into it. Um, so
we sort of work with domain experts for
all of this, but also try to apply
automation whenever possible, like use
their guidance to maybe come up with
heristics or LLM processing techniques
um to be able to categorize all this.
Um, and I'd say that the performance
implications are are pretty significant
as well. Um, we need very good
performance both online and offline.
online being like querying over this.
You want good latency and then offline
being like ingestion, reingestion,
running ML experiments for different
variations and such. Um and I say
generally one of these corpuses can be
like yeah tens of millions of docs. Uh
so yeah pretty large scale and each
document is often quite large.
So I can talk quickly about kind of
infrastructure needs to support this. So
at this scale of course we you know we
want infrastructure to be reliable
available um for all our users at all
times. I'm sure that's something that
you know all all products need. Um, we
also want smooth sort of onboarding and
scaling where, you know, we definitely
want our ML and data teams to be able to
focus more on the sort of the business
logic and the quality um, and spinning
up new applications and products for
customers and, you know, not too much
about like the nitty-gritty details of
the database or tuning that or manually
scaling. Um, and of course there's
always some in between where you you
want to have awareness of it. It can't
be like fully thousand% automated. Um
likewise we we need sort of flexibility
and capabilities around data privacy and
data retention. Um like I mentioned with
some uh storage needing to be like
segregated depending on the customer
depending on the use case sort of
retention policies on some docs that we
might only be allowed to store for
certain amounts of time for legal
reasons. Uh we want good sort of
telemetry and usage around the database.
And then of course any sort of vector or
or keyword or any filtering database we
need we want to support good performance
query flexibility scale especially for
all the different kinds of query
patterns I mentioned before where it's
like you need exact matches you want
semantic matches you want filters you
might want sort of to sort of navigate
it maybe yeah agentically or like in
some dynamic way. Um so yeah all that
flexibility is important to us at scale.
And that's where Lance TV comes in.
Cool. Thank you. Awesome. So,
sorry.
Okay. I'm going to try to hold this here
maybe so there's no echo. Okay. Um,
yeah. So uh as I was saying um so you
know I I work at LANC DB and what we are
delivering for um AI is uh beyond what I
call just a vector database but what we
call an AI native multimodal lakehouse
and so if you think about back to maybe
Jerry's talk right in addition to search
you also need um a good foundation good
platform for you to do all of the other
tasks um that you need to do with your
AI data. So, this can be feature
extraction, um, generating summaries,
generating text descriptions from
images, managing all that data, and you
want to be able to do that all together.
So, what you really need is sort of this
lakehouse architecture where all the
data can be stored in one place on
object store. Um, you can run search and
retrieval workloads, you can run
analytical workloads, you can train off
of that data, and of course, you can
pre-process that data to iterate on new
features that you can experiment for
your applications and models.
Um specifically to uh sort of in
addition to these like large batch
offline use cases um you know lakehouse
architectures generally are good for
that but not necessarily for online
serving and this is where LANC DB u
distributed architecture comes in and uh
it's actually good for both offline and
online context so that we can serve at
massive scale from cloud uh object store
uh we can deliver cloud uh compute,
memory and storage separation and we
give you a simple API for sophisticated
retrieval whether you want to combine
multiple vector columns uh vector and uh
full text search and then do re-ranking
on top of that. Those are all available
with an API in Python or TypeScript that
feels um what you know folks have told
me feels kind of like pandas or polar
like very familiar to data workers
who've uh are used to dataf frame me
type of APIs and of course for large
tables uh we support GPU indexing so I
think the our our record has been around
something like three or four billion
vectors in a single table um that can in
index in under uh two or three hours.
So um all of that is to say like LANCB
excels at massive scale and it's this is
happening at a fraction of the cost
because of the uh compute storage
separation and because we take advantage
of object store and um so of course uh
and I talked about sort of having one
place to put all of your AI data. So
this is the only database where you can
put you know images and uh videos and
audio track next to your embeddings next
to text data next to your um tabular
data uh time series data. You can put
all of that in a single table.
Um and then you can of course use that
as the single source of truth for all
the different workloads that you want to
do that do on that data from search to
analytics to training and of course
pre-processing or feature engineering. A
lot of that um is possible because of
the open source lance format that we
built from the ground up. Um so you know
if you're working with multimmodal data
whether it's documents um you know PDF
scan slides or just large even large
scale videos uh if you're doing that in
let's say like web data set or iceberg
parquet you're missing out on a lot of
features um and things like you know
lack of random access or the inability
to support large blob data or not being
very efficient about schema evolution.
Uh so LAN's format by giving you uh
giving you all of those right it makes
it so that you can store all of your
data in one place rather than split up
across multiple parts. And so this is
the I would say like the the
foundational innovation in LANC DB where
um without it what we see a lot of AI
teams doing is they they have to have
different copies of different parts of
their data in different places and
they're spending a lot of their time and
effort just sort of keeping those um
pieces glued together and in sync with
each other.
Right? So um kind of to to basically you
can think about uh lance format as sort
of parquet plus iceberg plus secondary
indices but for AI data and that gives
you fast random access which is good for
search and shuffle. Uh it still gives
you fast scans which is so good for
analytics and you know data loading and
training and um it's the only one out of
this this set that is uniquely good for
storing blob data or or more importantly
a mix of uh large blob data and small
like scalar data.
Um and by using Apache arrow as the main
interface lens format is already
compatible with your current data
lakeink and and uh lakehouse tools. So
you can use spark and ray to write very
large amounts of lance data in a
distributed fashion very quickly. U you
can use pietorrch to load that data for
training or fine-tuning. Um you can
certainly query it using tools like you
know p pandas and polars.
All right so
you take back yeah so
we back. Okay. So, uh just wanted to
share some general take-home messages
about building rag for these sort of
large scale domain specific use cases.
So, the first is that these domain
specific challenges require very
creative solutions around understanding
the data and also choosing sort of
modeling and infrastructure around that
like I mentioned about like trying to
understand structure of your data, what
the use cases are, what the explicit and
implicit query patterns are. Um so,
definitely spend time with that, work
with domain experts and try to sort of
immerse yourself as much as possible in
that. Um the second is to make sure
you're building for iteration speed and
flexibility. I think this is a very new
technology, very new industry and a lot
of things are changing. New tools are
coming out, new paradigms, new model
context windows and everything. So you
kind of want to set yourself up for
flexibility um and iteration speed and
you can kind of ground that in
evaluation where if you have good
evaluation sets or either procedures or
automation around that then you can
iterate much faster and just get good
signal on whether your systems are good
or accurate. Um so definitely invest
time in the evaluation to enable that
iteration speed. And then yeah the third
which John covered is that new data
infrastructure has to recognize that
there's sort of this new world we're
entering with multimodal data a lot
heavier on you know vectors and
embeddings workloads are very diverse
and the scale is just going to keep
getting larger and larger as we try to
sort of ingest and uh query over all the
data that exists public and private.
Yeah, thanks for listening to our talk.
[Music]