Scaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)
Channel: aiDotEngineer
Published at: 2025-07-29
YouTube video id: W1MiZChnkfA
Source: https://www.youtube.com/watch?v=W1MiZChnkfA
[Music] All right. Uh, thank you everyone. We're excited for to be here and thank you for uh, coming to our talk. Uh, my name is Chong. I'm the CEO and co-founder of LANCB. I've been making data tools for machine learning and data science for about 20 years. I was one of the co-authors of pandas library and I'm working on LANCB today for all of that data that doesn't fit neatly into those pandas data frames. And I'm Calvin. I lead one of the teams at Harvey Aai working on rag um tough rag problems across massive data sets of complex legal docs and complex use cases. So yeah, our talk is about Oh, one sec. Maybe we should have used the other clicker. Yeah. Yeah. All right, that's okay. We use the laptop. So we're going to talk about some of the tough rag problems on the legal frontier. Um sort of challenges, some solutions and learnings from our experiences working together on it. So we'll start roughly with like sort of how Harvey tackles retrieval, the types of problems there are and then the challenges that come up with that all with like retrieval quality, scaling, uh security, all that good stuff and then how we end up sort of creating a system with good infrastructure to support that. So first of all, a quick intro to what Harvey is. We're a legal AI assistant. So, we sell our sort of AI product to a bunch of law firms to help them do all kinds of legal tasks like draft, analyze documents, um sort of go through legal workflows and a big part of that is processing data. So, we handle data of all different sort of volumes and forms. Um the sort of different scales of that are we have an assistant product that's like on demand uploads, the same way you might like on demand upload to any AI assistant tool. So, that's like a smaller 1 to 50 range. We have these vaults which are sort of uh larger scale project contacts. So if there's like a a big deal going on that the law firm's working on or like a data room where they need sort of all their contracts, all their you know litigation documents and emails in one place that's a vault. And then the third is the largest scale which is data corpuses which are like knowledge bases around the world. So like legislation, case laws of a particular country um all the sort of laws, taxes, regulations that go into it. So yeah, some big challenges that come up come with that. Uh, one is scale. Just very large amounts of data. Uh, some of these documents are like super long and and dense and packed with content. Um, sparse versus dense I'm sure is like a sort of retrieval challenge that all of you deal with um of how to represent the data, how to retrieve over and index it. Uh, query complexity is a big one. We got very sort of difficult expert queries and I'll show an example of that in the next slide. Um the data is very domain specific and complex. Uh there's sort of a lot of nitty-gritty legal details that go into it. So we have to like work with domain experts and lawyers to understand it and try to like translate that into how we represent the data, how we you know index, query, pre-process over it. Um data security and privacy is a big one. A lot of this data is sensitive for like confidential deals or confidential I don't know IPOs, financial filings, stuff like that. So we have to respect all that for our clients. And then of course evaluation of how to make sure systems are actually good. So yeah, I'll show a quick demonstration of a retrieval quality challenge. So uh this is just on the query side of like this is maybe the average complexity of a query someone might issue in our product. Um they're much more complex and maybe simpler ones, but this is right in the middle. And you can see that like there's a lot of different components that go into this. Um there's sort of a sem well to read it out. like what is the applicable regime to covered bonds issued before 9 July 2022 under the directive EU 2019 2062 and article 129 of the CR. So, you know, that that's a handful. Um, but what goes into it is like there's a semantic aspect. There's sort of implicit filtering going on of like, you know, we want applicability before a certain date. Um, there's a specialized data set being referenced, which is EU laws and directives. Um, there's kind of keyword matches of like the specific, you know, regulation directive ID. Um, it is multiart in that it's sort of asking how this applies to two different regulations, like one directive, one article. And there's like domain jargon here where this is like an abbreviation. Uh I forget what it was. Capital regulations something. I looked it up this morning. Um but yeah, it's very complex and we sort of need a need a system that can tackle all this complexity and sort of break down this query and um use all the appropriate technologies for the different parts of it. And yeah, so one common question we get sort of in response to this complexity is how do you evaluate your systems? How do you make sure they're good? Um, and that's actually where we spend a ton of time. It's, you know, not as much on the algorithms and the the fancy agentic techniques, but more like how to validate them. Um, and I'd say like investing in eval driven development is a huge huge key to building these systems and making sure they're good, especially when it's a tough domain that like you don't inherently know much about as maybe an engineer or a researcher. Um, so I say there's no silver bullet eval, but we have like a whole range of them of like different task depths and complexities. So in sort of one dimension you have it being sort of higher fidelity but more costly and then the other direction it's like more automated evals that are faster to iterate on. So as an example like the um sort of high fidelity would be like expert reviews of just having them directly review outputs and analyze them and write reports. Um so that's like super expensive but super high quality. And then maybe something in between is like an expert labeled like set of criteria that you can maybe evaluate synthetically or evaluate in some automated way. So it's still expensive to curate, maybe a little expensive to run but more um more tractable. And then the third is sort of the fastest iteration which is um sort of more automated quantitative metrics like just you know retrieval, precision and recall sort of different more deterministic success criteria of like am I pulling documents from the right folder? Is it the right section? Do they have the right keywords in them? things like that and yeah give you a quick sense also of sort of the scale and complexity on the data side not only on the query side. So the data sets we integrate with are pretty massive. Um as you can see we support like you know data sets across all different kinds of countries and for each one there's sort of complex filtering and organization and categorization that goes into it. Um, so we sort of work with domain experts for all of this, but also try to apply automation whenever possible, like use their guidance to maybe come up with heristics or LLM processing techniques um to be able to categorize all this. Um, and I'd say that the performance implications are are pretty significant as well. Um, we need very good performance both online and offline. online being like querying over this. You want good latency and then offline being like ingestion, reingestion, running ML experiments for different variations and such. Um and I say generally one of these corpuses can be like yeah tens of millions of docs. Uh so yeah pretty large scale and each document is often quite large. So I can talk quickly about kind of infrastructure needs to support this. So at this scale of course we you know we want infrastructure to be reliable available um for all our users at all times. I'm sure that's something that you know all all products need. Um, we also want smooth sort of onboarding and scaling where, you know, we definitely want our ML and data teams to be able to focus more on the sort of the business logic and the quality um, and spinning up new applications and products for customers and, you know, not too much about like the nitty-gritty details of the database or tuning that or manually scaling. Um, and of course there's always some in between where you you want to have awareness of it. It can't be like fully thousand% automated. Um likewise we we need sort of flexibility and capabilities around data privacy and data retention. Um like I mentioned with some uh storage needing to be like segregated depending on the customer depending on the use case sort of retention policies on some docs that we might only be allowed to store for certain amounts of time for legal reasons. Uh we want good sort of telemetry and usage around the database. And then of course any sort of vector or or keyword or any filtering database we need we want to support good performance query flexibility scale especially for all the different kinds of query patterns I mentioned before where it's like you need exact matches you want semantic matches you want filters you might want sort of to sort of navigate it maybe yeah agentically or like in some dynamic way. Um so yeah all that flexibility is important to us at scale. And that's where Lance TV comes in. Cool. Thank you. Awesome. So, sorry. Okay. I'm going to try to hold this here maybe so there's no echo. Okay. Um, yeah. So uh as I was saying um so you know I I work at LANC DB and what we are delivering for um AI is uh beyond what I call just a vector database but what we call an AI native multimodal lakehouse and so if you think about back to maybe Jerry's talk right in addition to search you also need um a good foundation good platform for you to do all of the other tasks um that you need to do with your AI data. So, this can be feature extraction, um, generating summaries, generating text descriptions from images, managing all that data, and you want to be able to do that all together. So, what you really need is sort of this lakehouse architecture where all the data can be stored in one place on object store. Um, you can run search and retrieval workloads, you can run analytical workloads, you can train off of that data, and of course, you can pre-process that data to iterate on new features that you can experiment for your applications and models. Um specifically to uh sort of in addition to these like large batch offline use cases um you know lakehouse architectures generally are good for that but not necessarily for online serving and this is where LANC DB u distributed architecture comes in and uh it's actually good for both offline and online context so that we can serve at massive scale from cloud uh object store uh we can deliver cloud uh compute, memory and storage separation and we give you a simple API for sophisticated retrieval whether you want to combine multiple vector columns uh vector and uh full text search and then do re-ranking on top of that. Those are all available with an API in Python or TypeScript that feels um what you know folks have told me feels kind of like pandas or polar like very familiar to data workers who've uh are used to dataf frame me type of APIs and of course for large tables uh we support GPU indexing so I think the our our record has been around something like three or four billion vectors in a single table um that can in index in under uh two or three hours. So um all of that is to say like LANCB excels at massive scale and it's this is happening at a fraction of the cost because of the uh compute storage separation and because we take advantage of object store and um so of course uh and I talked about sort of having one place to put all of your AI data. So this is the only database where you can put you know images and uh videos and audio track next to your embeddings next to text data next to your um tabular data uh time series data. You can put all of that in a single table. Um and then you can of course use that as the single source of truth for all the different workloads that you want to do that do on that data from search to analytics to training and of course pre-processing or feature engineering. A lot of that um is possible because of the open source lance format that we built from the ground up. Um so you know if you're working with multimmodal data whether it's documents um you know PDF scan slides or just large even large scale videos uh if you're doing that in let's say like web data set or iceberg parquet you're missing out on a lot of features um and things like you know lack of random access or the inability to support large blob data or not being very efficient about schema evolution. Uh so LAN's format by giving you uh giving you all of those right it makes it so that you can store all of your data in one place rather than split up across multiple parts. And so this is the I would say like the the foundational innovation in LANC DB where um without it what we see a lot of AI teams doing is they they have to have different copies of different parts of their data in different places and they're spending a lot of their time and effort just sort of keeping those um pieces glued together and in sync with each other. Right? So um kind of to to basically you can think about uh lance format as sort of parquet plus iceberg plus secondary indices but for AI data and that gives you fast random access which is good for search and shuffle. Uh it still gives you fast scans which is so good for analytics and you know data loading and training and um it's the only one out of this this set that is uniquely good for storing blob data or or more importantly a mix of uh large blob data and small like scalar data. Um and by using Apache arrow as the main interface lens format is already compatible with your current data lakeink and and uh lakehouse tools. So you can use spark and ray to write very large amounts of lance data in a distributed fashion very quickly. U you can use pietorrch to load that data for training or fine-tuning. Um you can certainly query it using tools like you know p pandas and polars. All right so you take back yeah so we back. Okay. So, uh just wanted to share some general take-home messages about building rag for these sort of large scale domain specific use cases. So, the first is that these domain specific challenges require very creative solutions around understanding the data and also choosing sort of modeling and infrastructure around that like I mentioned about like trying to understand structure of your data, what the use cases are, what the explicit and implicit query patterns are. Um so, definitely spend time with that, work with domain experts and try to sort of immerse yourself as much as possible in that. Um the second is to make sure you're building for iteration speed and flexibility. I think this is a very new technology, very new industry and a lot of things are changing. New tools are coming out, new paradigms, new model context windows and everything. So you kind of want to set yourself up for flexibility um and iteration speed and you can kind of ground that in evaluation where if you have good evaluation sets or either procedures or automation around that then you can iterate much faster and just get good signal on whether your systems are good or accurate. Um so definitely invest time in the evaluation to enable that iteration speed. And then yeah the third which John covered is that new data infrastructure has to recognize that there's sort of this new world we're entering with multimodal data a lot heavier on you know vectors and embeddings workloads are very diverse and the scale is just going to keep getting larger and larger as we try to sort of ingest and uh query over all the data that exists public and private. Yeah, thanks for listening to our talk. [Music]