AI Engineer World’s Fair 2025 - Retrieval + Search
Channel: aiDotEngineer
Published at: 2025-06-05
YouTube video id: a0TyTMDh1is
Source: https://www.youtube.com/watch?v=a0TyTMDh1is
15. Awesome. Love it. Please don't I'll give you a little extra. All righty. Do operations on top of the files and it could be structured quering, right? Quering a more structured database to get aggregate insights over the types of data um that that you've extracted out. One, you know, top consideration uh when actually building this type of toolbox is uh complex documents. Uh for those of you who follow our socials, we talk a lot about this type of issue where a lot of human knowledge in the form of like really complicated PDFs and and other formats too. Embedded tables, charts, images, irregular layouts, headers, footers. This is typically stuff that's designed for human consumption and not machine consumption. And so, you know, if the documents are not processed correctly, no matter how good your LLM is, um it will fail. So we were probably one of the first people to actually realize that LLMs and LVMs could be used for document understanding. Um if uh in contrast to more traditional techniques where you use kind of like hand-tuned and task specific ML models to achieve uh kind of like document parsing over a specific class of documents, LLMs actually have a much general layer of accuracy um that you can use to your advantage and just like understanding and inhaling any type of document with comp uh any type of complexity. Um obviously the baseline these days is you can just screenshot a PDF, feed it into chatbot or claude. um it doesn't actually give you amazing accuracy, but it's a good start. And so one of the the kind of secret sauce like uh magic tricks we found was figuring out how to interle LLMs and LVMs with more traditional parsing techniques and adding kind of test time tokens in terms of agentic validation and reasoning to really get a higher level of accuracy. Um and so you know we have a cloud service that does document parsing and is a core step of this document toolbox. uh we basically benchmarked uh our modes where we adapt uh you know sauna 3.5 4.0 So uh Gemini 2.5 Pro 4.1 from OpenAI and it basically outperforms all existing parsing benchmarks um and and tools out there in terms of open source to proprietary. Um yeah so some of you might know us as a rag framework. That's basically how we started. Um you know for those of you who don't know we have this uh managed platform that is basically this GI native document toolbox. Um contains a lot of operations that you need to do on top of your docs. It could be document parsing. Document extraction uh uses some of those, you know, kind of capabilities I just mentioned and allows you to parse, extract index data for all the set of tools I just mentioned. One of the special releases I actually want to highlight today um and we just announced this in a blog post a few hours ago is Excel capabilities to help complement this document toolbox. A lot of knowledge work happens in Microsoft Excel and also Google Sheets and you know numbers and basically it's spreadsheets, right? but it's been unsolved by LLMs. Um, if you look at the document to the right, uh, neither rag nor Texas CSV techniques will actually work over this because it's not really a structured 2D table. There's a bunch of gaps in the rows and gaps in the columns. So we basically built an Excel agent um that's capable of taking unnormalized Excel spreadsheets and transforming them um into a normalized 2D format and also allows you to do a gentic QA um over uh both the unnormalized and normalized versions of the Excel spreadsheet. Um it's a pretty cool capability. I'll describe uh how it kind of works in just a bit. um but it's going to complement our toolbox right in terms of uh more traditional document parsing, extraction, indexing and it's available in uh early preview. So if you just uh take a look at the video, it's also on our blog post. We basically uploaded that example synthetic data set, transformed it into a 2D table, and you can also ask questions over it to basically get insights. And it's really doing the heavy lifting of deeply understanding the semantic structure of the Excel spreadsheet. Um, and then using that and plugging that in as specialized tools to an AI agent. Um, the best baseline is not really raised to CSV. Um, those both suck. Um, it's really just an LLM being able to write code. Um, so LLM with the code interpreter tool is a reasonable baseline. Gets you to 70 75% accuracy. Um, over like a private data set of synthetic Excel sheets, uh, we basically were able to get this up to 95%. Um, it actually surpasses human baselines of 90% of a human trying to go and do the data transformation by hand. Um, a brief note on how it works. Uh, it's a little bit technical. Um but you know more details are in the blog post. Um first we do some sort of structure understanding of the Excel spreadsheet. So we do a little bit of RL reinforcement learning. Um you know we actually kind of adapt dynamically to the specific format of the document um and learn a semantic map of the sheet. By learning a semantic map uh we can then translate this into um kind of a set of specialized tools that you provide to an agent. And so from a abstract perspective, you can kind of think about it as an agent could just write code from scratch. Um as LLMs get better, that will certainly become um an e like a a kind of higher performing baseline. But in the meantime, we're helping it out by really providing uh a set of specialized tools over the semantic map so you can reason over an Excel spreadsheet. Great. Um the next piece here is so we talked about a document toolbox. uh we talked about a lot of operations basically make this uh document toolbox really good and comprehensive. So now that you plugged it into an agent, what are the different agent architectures and what are the use cases are implied by them? Um as many of you probably know from building agents yourselves, agent orchestration ranges from more constrained architectures to unconstrained architectures. Um constrained is basically you kind of more explicitly define the control flow. Unconstrained is like a React loop, function calling, codeax, uh whatever. You basically give it a set of tools and let it run. Um, deep research is kind of the same thing. Um, for us, we basically noticed there's two main categories of UX's. Um, there's more assistantbased UXs that can basically surface information and um, help a human surface information or produce some unit of knowledge work through usually a chatbased interface. It's usually chat oriented. The inputs natural language. Um, the architecture is a little bit more unconstrained. You know, it's basically a React loop over some set of tools. um and it's inherently both unconstrained but also with a higher degree of human in the loop. So the goal is or the expectation is that the human is supposed to kind of guide and coax the agent uh along the steps of the process to basically achieve the task at hand. There's a I mean there's I'm sure many of you have built these types of use cases and so this is just a very small subset um but it's basically just you know your uh generalization of a of a rag chatbot. There's a second category of cases that I think is interesting and I think a lot of folks are actually starting to build more into this space which is um this automation interface. So being able to actually instead of uh providing some assistant or co-pilot to help a human get more information um processing routine tasks in a multi-step end to end manner and usually the architecture is a little bit different. Um it takes in some batch of inputs uh it can run in the background or it could be triggered ad hoc by the human. um the architecture is a little bit more constrained which kind of makes sense right if you want this thing to run more end to end um you need it to not just go off the rails um and there's usually a little bit less human in the loop at every step of the process and usually some sort of like batch review in the end and the output is like structured results integration with APIs uh decision- making after approval it'll just go route to the downstream systems some of the use cases here include you know financial data normalization data sheet extraction invoice reconcilation, contract view, and more. Um, I'll skip this video, but you know, there's some fun example of some community- based open source repos we built in this area, like the invoice reconciler by Lori Boss. a kind of general idea that we've emer that has emerged and we've noticed as a pattern is you know oftentimes the automation agents can serve as a backend because it runs in the background you know can do the data ETL transformation they're still human in the loop but it's kind of the doing the thing where it needs to process and structure a lot of data um and do decisions in the background and then assistant agents are kind of more front-end facing right and so automation agents can structure process your data and provide the right tool tool interfaces um for assistant agents. Not every tool depends on agentic reasoning, but for a lot of these use cases like for a very generalized data pipeline um where you're processing a lot of unstructured context, you might have automation agents go in and process your data, provide the right tools for some sort of more uh research userf facing interface. So we talked about building a document toolbox. We talked about you know the the the different categories of agentic architectures and putting it together um here are some real world use cases of document agents and these are basically examples of agents that actually help automate different types of knowledge work. So one of our favorite examples is a combination of both automation and assistant UX's for financial due diligence. Um Carl is one of our uh favorite customers and and partners. Um you know they basically used uh some of the core capabilities that we have to build an end toend leverage bio agent um you know it requires an automation interface to inhale massive amounts of unstructured public and private financial data um Excel sheets PDFs powerpoints go through some bespoke extraction algorithms with human in the loop review and then once that data is actually structured in the right format providing a co-pilot interface uh for the analyst teams to actually both get insights and generate reports over that data. If you look at any enterprise search use case that typically falls within the assistant UX, um, SEMX is one of our favorite uh, customers in this space where, you know, just being able to define a lot of different collections to different sources of data and providing more task specific specialized agentic rag chat bots over your data, right? Um, you know, it's basically rag, but you add like an agentic reasoning layer on top so that you can basically break down user queries, do research, and answer the question at hand. And on the pure automation UX aside, uh we notice a lot of kind of use cases popping up around automate automation and efficiency. And so one example is actually technical data sheet ingestion. Um you know we're working with a global electronics company. They have a lot of data sheets that need to be automatically processed and reviewed. And historically it's taken a lot of human effort to actually do this. Um so by creating the right end to end automation agent you can basically encode the business specific logic for parsing these types of documents extracting out the right pieces of information matching it against specific rules and outputting the structured data into SQL. There's human in the loop review. Um but if we're actually able to do this end to end, it transforms weeks of just like you know technical writer work um into an automated extraction interface. So that's basically it. Um you know for those of you who are less familiar, Wlama Index is uh the most accurate customizable platform for automating your document workflows with Adantic AI. Um our mission statement's evolved a little bit since the past few years where um you know we're a very broad horizontal uh framework oftentimes focused on rag. Um but if you're interested in some of the capabilities uh come talk to us and then please come check us out at booth g1. Thank you. [Applause] All right, thank you very much, Jerry from Llama Index. Uh, next up, uh, we have, uh, Chong and Calvin from, uh, Chong from Lance DB and then Calvin from Harvey.ai, Scaling Enterprisegrade Rag Systems, uh, lessons from the legal frontier. Give a big, uh, warm welcome to these two. All right, everybody hear us? Okay, sounds like this is good. All right. Uh, thank you everyone. We're excited for to be here and thank you for uh coming to our talk. Uh, my name is Chong. I'm the CEO and co-founder of LANCB. I've been making data tools for machine learning and data science for about 20 years. I was one of the co-authors of pandas library and I'm working on LANC CB today for all of that data that doesn't fit neatly into those pandas data frames. I'm Calvin. I lead one of the teams at Harbai working on rag um tough rag problems across massive data sets of complex legal docs and complex use cases. So yeah, our talk is about one sec. Maybe we should have used the other clicker. Yeah. Yeah. All right. Use the laptop. So we're going to talk about some of the tough rag problems on the legal frontier. Um sort of challenges, some solutions and learnings from our experiences working together on it. So we'll start roughly with like sort of how Harvey tackles retrieval, the types of problems there are and the challenges that come up with that all with like retrieval quality, scaling, uh security, all that good stuff and then how we end up sort of creating a system with good infrastructure to support that. So first of all, a quick intro to what Harvey is. We're a legal AI assistant. So, we sell our sort of AI product to a bunch of law firms to help them do all kinds of legal tasks like draft, analyze documents, um sort of go through legal workflows and a big part of that is processing data. So, we handle data all different sort of volumes and forms. Um the sort of different scales of that are we have an assistant product that's like on demand uploads. Same way you might like on demand upload to any AI assistant tool. So, that's like a smaller one to 50 range. We have these vaults which are sort of uh larger scale project contexts. So if there's like a a big deal going on that the law firm's working on or like a data room where they need sort of all their contracts, all their you know litigation documents and emails in one place that's a vault. And then the third is the largest scale which is data corpuses which are like knowledge bases around the world. So like legislation, case laws of a particular country um all the sort of laws, taxes, regulations that go into it. So yeah, some big challenges that come up come with that. Uh, one is scale. Just very large amounts of data. Uh, some of these documents are like super long and and dense and packed with content. Um, sparse versus dense, I'm sure, is like a sort of retrieval challenge that all of you deal with um of how to represent the data, how to retrieve over and index it. Uh, query complexity is a big one. We get very sort of difficult expert queries and I'll show an example of that in the next slide. Um the data is very domain specific and complex. Uh there's sort of a lot of nitty-gritty legal details that go into it. So we have to like work with domain experts and lawyers to understand it and try to like translate that into how we represent the data, how we you know index, query, pre-process over it. Um data security and privacy is a big one. A lot of this data is sensitive for like confidential deals or confidential I don't know IPOs, financial filings, stuff like that. So we have to respect all that for our clients. And then of course evaluation of how to make sure systems are actually good. So yeah, I'll show a quick demonstration of a retrieval quality challenge. So uh this is just on the query side of like this is maybe the average complexity of a query someone might issue in our product. Um they're much more complex and maybe simpler ones, but this is right in the middle and you can see that like there's a lot of different components that go into this. Um there's sort of a sem to read it out. like what is the applicable regime to covered bonds issued before 9 July 2022 under the directive EU 2019 2062 and article 129 of the CR. So you know that that's a handful. But what goes into it is like there's a semantic aspect. There's sort of implicit filtering going on of like you know we want applicability before a certain date. Um there's a specialized data set being referenced which is EU laws and directives. Um there's kind of keyword matches of like the specific you know regulation directive ID. Um it is multiart in that it's sort of asking how this applies to two different regulations like one directive one article. And there's like domain jargon here where this is like an abbreviation. I forget what it was. Capital regulations something. I looked it up this morning. Um but yeah, this is very complex and we sort of need a need a system that can tackle all this complexity and sort of break down this query and um use all the appropriate technologies for the different parts of it. And yeah, so one common question we get sort of in response to this complexity is how do you evaluate your systems? How do you make sure they're good? Um, and that's actually where we spend a ton of time. It's, you know, not as much on the algorithms and the the fancy agentic techniques, but more like how to validate them. Um, and I'd say like investing in eval development is a huge huge key to building these systems and making sure they're good, especially when it's a tough domain that like you don't inherently know much about as an engineer or researcher. Um, so I say there's no silver bullet eval, but we have like a whole range of them of like different task depths and complexities. So in sort of one dimension you have it being sort of higher fidelity but more costly and then the other direction it's like more automated evals that are faster to iterate on. So as an example like the um sort of high fidelity would be like expert reviews of just having them directly review outputs and analyze them and write reports. Um so that's like super expensive but super high quality. And then maybe something in between is like an expert labeled like set of criteria that you can maybe evaluate synthetically or evaluate in some automated way. So it's still expensive to curate, maybe a little expensive to run but more um more tractable. And then the third is sort of the fastest iteration which is um sort of more automated quantitative metrics like just you know retrieval, precision and recall sort of different more deterministic success criteria of like am I pulling documents from the right folder? Is it the right section? Do they have the right keywords in them? Things like that. Yeah. Give you a quick sense also of sort of the scale and complexity on the data side, not only on the query side. So the data sets we integrate with are pretty massive. Um, as you can see, we support like, you know, data sets across all different kinds of countries and for each one there's sort of complex filtering and organization and categorization that goes into it. Um, so we sort of work with domain experts for all of this. but also try to apply automation whenever possible like use their guidance to maybe come up with heristics or LM processing techniques um to be able to categorize all this. Um and I say that the performance implications are are pretty significant as well. Um we need very good performance both online and offline. Online being like querying over this, you want good latency and then offline being like ingestion, reingestion, running ML experiments for different variations and such. Um and I say generally one of these corpuses can be like yeah tens of millions of docs. Uh so yeah pretty large scale and each document is often quite large. So I can talk quickly about kind of infrastructure needs to support this. So at this scale of course we you know we want infrastructure to be reliable available um for all our users at all times. I'm sure that's something that you know all all products need. Um, we also want smooth sort of onboarding and scaling where you know we definitely want our ML and data teams to be able to focus more on the sort of business logic and the quality um and spinning up new applications and products for customers and you know not too much about like the nitty-gritty details of the database or tuning that or manually scaling um and of course there's always something in between where you you want to have awareness of it. It can't be like fully thousand% automated. Um likewise we we need sort of flexibility and capabilities around data privacy and data retention. Um like I mentioned with some uh storage needing to be like segregated depending on the customer depending on use case attention policies on some docs that we might only be allowed to store for certain amounts of time for legal reasons. Uh we want good sort of telemetry and usage around the database. And then of course any sort of vector or keyword filtering database. We need we want to support good performance, query flexibility, scale, especially for all the different kinds of query patterns I mentioned before where it's like you need exact matches, you want semantic matches, you want filters, you might want sort of to sort of navigate it maybe yeah aentically or like in some dynamic way. Um so yeah, all that flexibility is important to us at scale and that's where Lance PD comes in. Cool. Thank you. Awesome. So, sorry. Okay, I'm gonna try to hold this here maybe so there's no echo. Okay. Um, yeah. So uh as I was saying um so uh you know I I work at a LANCB and what we are delivering for um AI is uh beyond what I call just a vector database but what we call an AI native multimodal lakehouse and so if you think about back to maybe Jerry's talk right in addition to search you also need um a good foundation a good platform for you to do all of the other tasks um that you need to do with your AI data. So, this can be feature extraction, um, generating summaries, generating text descriptions from images, managing all of that data, and you want to be able to do that all together. So, what you really need is sort of this lakehouse architecture where all the data can be stored in one place on object store. Um, you can run search and retrieval workloads, you can run analytical workloads, you can train off of that data, and of course, you can pre-process that data to iterate on new features that you can experiment for your applications and models. Um specifically to uh sort of in addition to these like large batch offline use cases um you know lakehouse architectures generally are good for that but not necessarily for online serving and this is where in a lance DB u distributed architecture comes in and uh it's actually good for both offline and online context so that we can serve at massive scale from cloud uh object store uh we can deliver cloud uh compute, memory and storage separation and we give you a simple API for sophisticated retrieval whether you want to combine multiple vector columns uh vector and uh full text search and then do reranking on top of that. And those are all available with an API in Python or TypeScript that feels u what you know folks have told me feels kind of like pandas or polar like very familiar to data workers who uh are used to dataf frame type of APIs and of course for large tables uh we support GPU indexing so I think the our our record has been around something like three or four billion vectors in a single table um that can index in under uh two or three hours. So um all of that is to say like LANCV excels at massive scale uh and it's this is happening at a fraction of the cost because of the uh compute storage separation and because we take advantage of object store and um so of course uh and I talked about sort of having one place to put all of your AI data. So this is the only database where you can put you know images and videos and audio track next to your embeddings next to text data next to your um tabular data uh time series data. You can put all of that in a single table. Um and then you can of course use that as the single source of truth for all the different workloads that you want to do that do on that data from search to analytics to training and of course pre-processing or feature engineering. A lot of that um is possible because of the open source lance format that we built from the ground up. Um so you know if you're working with multimodal data whether it's documents um you know PDF scan slides or just large even large scale videos uh if you're doing that in let's say like web data set or iceberg parquet you're missing out on a lot of features um and things like you know lack of random access or the inability to support large blob data or not being very efficient about schema evolution. Uh so LAN's format by giving you uh giving you all of those right it makes it so that you can store all of your data in one place rather than split up across multiple parts. And so this is the I would say like the the foundational innovation in LANCB where um without it what we see a lot of AI teams doing is they they have to have different copies of different parts of their data in different places. and they're spending a lot of their time and effort just sort of keeping those um pieces glued together and in sync with each other. Right? So um kind of to to basically you can think about uh lance format as sort of paret plus iceberg plus secondary indices but for AI data and that gives you fast random access which is good for search and shuffle. U it still gives you fast scans which is so good for analytics and you know data loading and training and um it's the only one out of this this set that is uniquely good for storing blob data or or more importantly a mix of uh large blob data and small like scalar data. Um and by using Apache arrow as the main interface lance format is already compatible with your current data lake and and lake house tools. So you can use spark and ray to write very large amounts of lance data in a distributed fashion very quickly. U you can use pietorch to load that data for training or fine-tuning. Um you can certainly query it using tools like you know p pandas and polars. All right. So take back back. Okay. So uh just want to share some general take-h home messages about building rag for these sort of large scale domain specific use cases. So the first is that these domain specific challenges require very creative solutions around understanding the data and also choosing sort of modeling and infrastructure around that like I mentioned about like trying to understand structure of your data what the use cases are what the explicit and implicit query patterns are um so definitely spend time with that work with domain experts and try to immerse yourself as much as possible in that um the second is to make sure you're building for iteration speed and flexibility I think this is a very new technology very new industry and a lot of things are changing, new tools are coming out, new paradigms, new model context windows and everything. So you kind of want to set yourself up for flexibility um and iteration speed and you can kind of ground that in evaluation where if you have good evaluation sets or either procedures or automation around that then you can iterate much faster and just get good signal on whether your systems are good or accurate. Um so definitely invest time in the evaluation to enable that iteration speed. And then yeah the third which John covered is that new data infrastructure has to recognize that there's sort of this new world we're entering with multimodal data a lot heavier on you know vectors and embeddings workloads are very diverse and the scale is just going to keep getting larger and larger as we try to sort of ingest and uh query over all the data that exists public and private. Yeah. Thanks for listening to our talk. Um, yeah, good Harvey F. You can visit us at Harvey Aai. Um, come come talk to us or work with us. We're hiring if you're interested. Thanks everyone. Awesome. Thank you so much, Chong and Calvin. Uh, really exciting stuff. I think uh Harvey is changing, you know, the legal space and Lance is allowing you guys to to do that at scale. So, thank you so much. Alrighty. Um, okay. Next up, uh, we have, um, Julia, uh, Niagu, uh, Danna Emmery, both from Quotient AI, um, along with Mar Asher with, uh, Tavali, Tavili, uh, Tavili, sorry, I always messed that one up. U, but we're super excited for these three uh, speakers as well. So, this one will be about evaluating uh, AI search um, the frameworks, you know, for augmented AI system. So, give him a big uh warm uh round of applause. for sure. Yeah, nice to meet you. Same thing. Good girl. Just make sure It's okay. It's okay. Hi everyone. Uh, thank you so much for coming. Uh, my name is Julia. I'm CEO and co-founder of Quotient AI. Uh, I'm Danna Emmery. I am founding AI researcher at Quotient AI. My name is And today we are going to talk to you about uh, evaluating AI search. So, let me start with a fundamental challenge we're all facing in AI today. Traditional monitoring approaches simply aren't keeping up with the complexity of modern AI approaches. First off, these systems are dynamic. Unlike traditional software, AI agents operate in constantly changing environments. They're not just executing predetermined logic. They're making real-time decisions based on evolving web content, user interactions, and complex tool chains. These systems can also have multiple failure modes that happen at the same time. They hallucinate, retrieval fails. Uh they make reasoning errors and all of these are interconnected. A little bit about what we do at Ocean. We monitor live AI agents. Uh we have expert evaluators that can detect objective system failures without waiting on ground data, human feedback or benchmarks. A year ago, we met Rom Te's founder and CEO and he posed us with a problem uh that really crystallized the core issues we needed to solve. Here's a challenge. How do you build production readyi search agents when your system will be dealing with two fundamental sources of unpredictability you cannot proactively control? Under the hood, Tavil's agents gather their context by searching the web. The web is not static. Traditional benchmarks assume stable ground truth, but when you're dealing with real-time information, ground truth itself is a moving target. Your users also don't stick to your test cases. They can ask odd malformed questions. They have implicit context they don't really share and you're not aware of. Uh and this is not just a theoretical problem. The uh the processes hundreds of millions of search requests for its agents in production and they need a solution that worked at scale in these real world conditions and this is a story of how we built that. Yes. So at we're building the infrastructure layer for agent interaction at scale essentially providing language models with real-time data from across the web. There are many use cases where real time AI search deliver values and this is just a few example of how our clients are using Tavili to empower their applications. Beginning from a CLM company that built an AI legal assistant to power their legal and business team with instant case insight to a sports news outlet that created a hybrid rag chat agent that delivers scores, games, and news updates to a credit card company that uses real-time search to uh fight fraud by pinpointing merch merchant locations. So as you can imagine evaluate a system in this kind of vast fastm moving setting is quite challenging. We have two principles that guide our evaluation. First the web which is our foundation of our data is constantly changing. This means that our evaluation method must keep up with the ongoing change. Second that truth is often subjective and contextual. Evaluating correctness can be tricky because what's right may depend on the source or the timing or the user needs. So we have a responsibility to design our evaluation methods to be as unbiased and fair as possible even when absolute truth is hard to pin down. So the first thing to think about in offline evaluation is which data to use to evaluate your system. Static data sets are a great start and there are many widely uh open-source data sets available out in the web. Simple QA is one example. It's a benchmark and a data sets from open AI that um serve as a standard for evaluating retrieval accuracy. We have many many leading AI search providers that use simple QA to evaluate their performance. Simple QA is designed to evaluate a system ability to answer short fact-seeking question with a single empirical answer. Another widely uh adopted data set is hotspot QA which uh tests which evaluates a system ability to answer multihop questions where reasoning across multiple documents is required to retrieve the final answer. data set like simple QA and hotspot QA are a great start for evaluating your system. But what happens when you're um when you're evaluating real time um systems that um especially when measuring your that your system keeps up with rapidly evolving information and avoiding regress regression like where we operate also those kind of static data sets uh don't address the challenge of benchmarking question questions where uh they don't there's no one truth answer or subjectivity is involved. This is what led us to think beyond static data sets towards dynamic evaluation that reflects the changing uh the pace of the of the web essentially. Um dynamic data set are essential for for benchmarking rags in real world production system. You can answer today's questions with yesterday data. Dynamic data sets have real world alignment. They have broad cover coverage as you can easily create evil sets for any domain or use case that is relevant to your specific needs and they also ensure continuous relevancy because they are regularly refreshed which means that your system is con is always evaluated against the latest data. This led us to build an open-source um agent that basically build dynamic eval sets for web-based rug system. It's open source and we encourage everyone to check it out and contribute and I also want to acknowledge the work of AAL our head of data at Cavilli who initiated this project a couple months ago. As you can see here an example of a data set generated by the agent. It generates question and answer pairs for targeted domains using information found in the web. So the agent leverage the langraph framework and it consists of these key steps. First it generates broad web search queries for targeted domains which essentially let you create eval sets for any uh domain of your choice and specific need of your application. The second step is to aggregate grounding documents from multiple real-time AI search providers. We understand that we cannot just use Tavili to search the web on specific domains, find grounding documents, then generate question and answer pairs from those documents and then evaluate our performance on those documents. That's why we use multiple real time AI search providers to both maximize coverage and minimize bias. The third step which is the key uh step in this process is to generate the evidencebased question and answer pairs. And we ensure that in the generation process the agent is obliged to to generate answer context which al which also increase the reliability of our question and answer pairs and reduce hallucinations. You can always go back and and check which sources were used and which evidence from so those sources were used to generate each question and answer pair. And lastly, we use length miss to track our experiments which is a great observability tool to manage these uh offline um evaluation runs and see how your performance at different time steps. The next steps that we want to address is to support a range of question types both simple factbased questions and multi-hop questions similar to the hot pot QA. We also want to ensure furnish and fairness and coverage by proactively addressing bias and covering a wide range of perspective for each subject we generate question and answer to. Additionally, we want to add a supervisor node for coordination which prove itself to be valuable especially in these multi- aents uh architectures and this will increase the quality of our question and answer pairs. The next step to think about is uh benchmarking and we argue that it's important to measure accuracy but you should not stop there. You should ensure an holistic evaluation framework which use benchmark like for for our case that um measure your your source diversity your uh source relevancy and hallucation rates. It's also important to leverage unsupervised evaluation method that remove the need for label for label data which enable to scale your evaluations and address the subjectivity uh issue. With that, I'll pass it over to Danna who will explain more about this reference free benchmarks and also share a results from an experiment we ran using a static and a dynamic data set that was generated by the agent I described before. So, uh we performed a two-part evaluation of six different AI search providers. Um the first component of this experiment was to compare the accuracy of search providers on a static and a dynamic benchmark in order to demonstrate that static benchmarking is not a comprehensive method for evaluation of AI search. The second component was to evaluate the uh dynamic data set responses using reference free metrics and we compare these results to the uh reference based accuracies that we get from the benchmark in order to demonstrate that reference reevaluation can be an effective substitute when ground truths are not available. So jumping right in um for our static versus dynamic benchmarking comparison we use simple QA benchmark as the static data set and we're using a dynamic benchmark of about a thousand rows created by Tibili and as you can see here uh both data sets have roughly similar distributions of topics and this helps to ensure a fair comparison and diversity of questions to evaluate the AI search providers. performance on these two benchmarks. We're using the simple QA correctness metric and this is an LLM judge which is used on the simple QA benchmark. It compares the model's response against a ground truth answer in order to determine if it's correct, incorrect, or not attempted. And so here we're showing the correctness scores from that simple QA benchmark compared against the dynamic benchmark. And uh we've anonymized the search providers for this talk. Um but I do want to call out that the simple QA accuracy scores here are all self-reported and so they don't all necessarily have clear documentation on how they were calculated. But um as you can see the correctness scores are for the dynamic benchmark in blue are substantially lower. Um and not only that the relative rankings have also changed pretty considerably. For example um provider all the way on the end of this plot here performs the worst on simple QA, but it performs the best on the dynamic benchmark. Looking a little closer to the in the results, um while this simple QA evaluator is useful, it's certainly far from perfect. Um I have a few examples here of um model responses that were flagged as incorrect by this LLM judge. Uh but if you look at the actual text in the model outputs, they do contain the correct answer from the ground truth. On the flip side of things, um here is an example where uh the LLM judge classified it as correct. And yes, you can see that the correct answer is in this response. But while the correct answer might be present, that doesn't necessarily mean that the full answer is right. Um this evaluation is not accounting for any of the additional text in this response. and there might be hallucinations in there and that would invalidate it. So ultimately this evaluation falls short of identifying when things go wrong in AI search. So what are some other ways that we can identify when things go wrong? Up to this point we have been talking about a reference-based approach to evaluation. But what if we don't have ground truths? In most online and production settings, this is typically the case and as we've already discussed, it's especially so in AI search. Um, so the question is, can reference free metrics effectively identify issues in AI search? For this talk, we're going to look at three of quotient's reference free metrics. Um, we'll look at answer completeness, which identifies whether all components of the question were answered. Um, so it classifies model responses as either fully addressed, unadressed, or unknown. Uh, if the model says I don't know, then we'll look at document relevance, and this is the percent of the retrieved documents that are actually relevant to addressing the question. Um, and then finally, we'll look at hallucination detection, which identifies whether there are any facts in the model response that are not present in any of the retrieve documents. And so we use these metrics to evaluate the search providers responses on this dynamic benchmark. So we've got answer completeness plotted here. Um the stacked bar plot shows the number of responses that were either completely answered uh unressed or marked as unknown. And if we look back at the overall rankings that we saw earlier on the dynamic benchmark, um you can see that the rankings from answer completeness pretty closely match. Um the average performance scores for the two get a correlation of 0.94. So this indicates that the reference free metric can capture relative performance pretty well. But completeness is still not the same thing as correctness. Um and when we have no ground truths available then we have to turn to the next best thing and that is the grounding documents. So this is where document relevance and hallucination detection come in. Uh both of these metrics are going to be looking at those grounding documents in order to measure the quality of the model's response. Unfortunately uh of all of the search providers we looked at only three of them actually return the retrieved documents used to generate their answers. Um the majority of search providers typically only provide citations and these are largely unhelpful at scale and also really limit transparency when it comes to debugging. So these are those document relevance scores for the three search providers um and they've been reanomized here. Um the plot to the left shows the average document relevance, the percent of retrieved documents that are relevant to the question. And the plot to the right shows the number of responses that have no relevant documents. And if we consider these results in conjunction with answer completeness, we find that there's a strong inverse correlation between document relevance and the number of unknown answers. And this kind of matches intuition. uh if you think about it, if you have no grounding, no relevant documents for the question, the model should say I don't know rather than trying to answer it. And so this brings us to hallucination detection. And here we were actually surprised to see that there was a direct relationship with the hallucination rate and document relevance. Provider X here has the highest hallucination rate, but it also had the highest overall document relevance. And this is kind of counterintuitive. Um, but if we think about it more, provider X had high answer completeness, the lowest rate of unknown answers, and it also had the highest answer correctness from the benchmarking ear of these three providers. So, this probably implies that maybe in provider X's responses, um, they're more likely to provide new reasoning or interpretations in their response, or maybe even they're more detailed and thorough, and this just creates more opportunity for hallucination in their responses. Um, but the point I want to make here is that when considering these metrics, depending on your use case, you might index more heavily on one over another. um they're measuring different dimensions of response quality and it's often a give and take. If you perform really well in one, it might be at the expense of another. Um and as we see here, uh there is a trade-off between complete answer completeness and hallucination. But also, um if you take these three metrics in conjunction, you can use them to understand why things went wrong and uh identify potential strategies for addressing those issues. This diagram here shows a few examples on how you can interpret your evaluation results. Um sorry uh how you can interpret your evaluation results to identify what to do to fix it. So we've got one example here where um maybe your response is incomplete um but you have relevant documents, you have no hallucinations. Uh so this probably means you don't have all the information you need to answer the question. Uh and so just retrieving more documents solve that. Um, but the big picture idea is that your evaluation should do more than just provide relative rankings. It should help you identify the types of issues that are present and it should also help you understand what strategies to implement to solve those issues. Okay. So, uh, so in conclusion, let me just quickly paint a picture of where we're heading with all this because this is not just about building the agents we've been building for the past couple years and then slapping evaluation on it and then continuing to do the same thing. Uh, it's actually it's not about building better benchmarking. It's not better monitoring. It's not about better evaluation. It's about creating AI systems that can uh continuously improve themselves. And imagine for a second that agents don't just retrieve information but learn from the patterns of what information is outdated, what sources are unreliable and what users need. Um they can also like maybe detect hallucinations mid conversations and uh correct the course all without human intervention. And this framework that we shared today dynamic data sets holistic evaluation reference free metrics are the building blocks for getting there. Uh, and this this is where we want to get with augmented AI. So, thank you so much for your [Applause] time. All right, Tangu Ma, are you in the room? Our next speaker, Tenu Ma from MongoDB. Oh, you are. Okay, perfect. Come on up. All right, thank you very much uh from Koshant AI and Tilli uh for presenting. Next up we have uh MongoDB. So you know them, they're very much in the ecosystem. Uh Tangu Maul will be presenting uh Rag in 2025. So what's changed? Uh state-of-the-art and the forward road. So um he'll talk about kind of uh the road map uh debate rag versus fine-tuning versus long context and then um you know two secondly rag today the benefits challenges uh and current solutions and then what will the future of rag look like. So all right while he's getting miked up um we'll let him get settled in but uh big round of applause for Tangyuma from MongoDB. Um this is uh this is working. Okay. Um I guess um yeah, thanks thanks for coming. Thanks for having me here. Um I'm Tony Ma. I'm uh um I was uh the CEO and co-founder for we just recently got acquired by MongoDB. I'm also teaching at Stanford as well. So um this is about rag which is the main focus of for ai uh the startup who is focusing on how to make retrieval better. So um um but I will just generally talk about you know rag and and we'll touch on some of the products we make as well very quickly. So I guess why we are u doing rag or anything like that right so I guess the main reason is that large language model these days agent u which are you know using large language models as well uh if they out of the box they cannot just uh uh have priority information from any of the companies, right? Because if they know anything about what MongoDB for example internally has, then the the data was leaked. So that means that if you want to apply any of this to enterprise uh uh then um uh you need to ingest a lot of data from the uh proprietary information. So um and I'm going to discuss you know why which kind of technologies to to enable us to ingest the data. I guess there are a few options rag fine-tuning and long context which are all ways to ingest data and I'll focus on raich for the rest of the talk. So I guess you know for this audience probably most people knows these technologies they are all very simple on a high level. So for a long context it's just the most simple you just dump all your documents uh uh to a large language models context and maybe it's like 1 million tokens maybe it's 1 billion tokens um so and then you have a query and you just get a response uh fine tuning is like you first fine tune lo model you update the parameters and then you say I'm not going to look at the documents anymore uh when the query comes I just use the updated parameters to generate response uh and rag is u um also pretty simple So basically what happens is that on the fly you use the query to retrieve some subset of the documents. You use a retrieval or search method u um and then you get some relevant documents. You give this small set of relevant documents to the large language model and then you generate response based on those contexts. So this is one my one slide kind of like a uh summary of you know how I think about the differences between these technologies. you know some of the these are uh uh inspired by some of the research at Stanford when we kind of started to uh uh um uh uh build voyage you know we kind of like believe in rag and one of the reason is that we don't believe that fine tuning can work and and long context I think um um I also don't really believe that it can be costefficient in the long run so so basically I think the way that I think about this is that um I uh try to make analogy to how humans uh are uh learning from or using the addition party information. Um um so in some sense long context is kind of like you scan an entire library to answer any single question right every time you answer a question you you need to go through the entire library which has like probably one billion tokens and fine-tuning is kind of like you read this library in in advance you muscle memorize them you try to internalize them in your brain in your neurons in your synapsis and you update your brain basically rewire your brain so that you really know all of those deeply. Um the challenge there is that you know it's very difficult and um somewhat unnecessary um um because you know you cannot really memorize all the books in your in the in the world and and and memorizing a subset of them sometimes is kind of like you know which subset you want to memorize is kind of uh tricky as well. So and another thing is that um um it makes you know forgetting the knowledge also tricky because you don't know which part of the knowledge you should forget and how to cleanly forget all of them and also this makes the the access the the the data governance also kind of tricky because you know maybe there are so many libraries so many books in a library and not everyone can access everything and how to organize those and on the other hand rag is very very simple and modularized uh as I've shown so um and very reliable and you know and and also kind of fast and and cheap. So um and it's kind of like similar to how humans actually are using the libraries, right? You retrieve the most relevant, you know, book chapters or books or book chapters and then answer the question. It's kind of a hierarchical way to store your information, right? You don't really put all of information in your brain. You put them in a library and then use them uh when you need it. So um that's why I believe in rag and uh and this is kind of how you implement the retrieval part. So basically there's a breakdown of two components actually there are three you know if you are advanced so uh so this embedding models which vectorize the the documents and query uh into vectors and the vectors are representations of the uh this kind of like representations of the content or the meanings of u the documents and queries and then you use a vector database to store the data and also search uh uh with a k nearest neighbor search in a vector space and then you get the relevant documents and then you can use large language model generate answers. So um we have seen significant improvements over the retrieval accuracy in the last two years. Um when we started voyage you know I think open I v3 was not uh yet launched. I open v3 was launched 1.5 years ago. Um and in the last 1.5 years you know voyage you know u has made significant progress you know coher also made some progress. So we can see that the new model has much better uh accuracy and with lower cost uh um and generally we have much better scaling law right so the same number of parameters the the quality becomes better or the same quality the the the the parameters become smaller and it becomes cheaper um and all of these are through kind of like you know optimizing the research stack uh the tuning stack you know as much as possible you know all the way from like data curations data selection uh uh uh architecture loss functions you know evaluation so on and so forth. Uh and we still you know believe that there's a big headroom here because you know right now you can see that in this plot you know we are averaging over about a 100 data sets and accuracy is about 80%. So that means that you still have like probably 20% of the improvement um headroom. But that said you know just to be clear it's not like for every data set you only have 80% accuracy. For probably half of the data sets the accuracy is probably 90% or even 95%. And for some of the other ones it's kind of 60 sometimes 20 sometimes 30. So that's why in average it's 80%. Um so for so basically I'm saying like for some of the tasks that are common uh I think you can get already very high accuracy in the retrieval step. Um and another thing that uh voyage and other uh uh uh companies has uh uh offered is this so-called matasha learning and also quantization of wire tuning. So basically these are two approaches to reduce the the storage cost uh for the vectors. So basically matrascal learning means that you make sure that uh even you have like a high dimensional embedding right you can use a subset of the uh the coordinates uh is usually the first let's say suppose you have a 20 48 dimensional vectors and the first 256 dimensional uh sub uh vector is still a reasonable embedding the accuracy wouldn't be as high as two uh 2048 but it will be almost the same uh maybe with one or two% % loss and quantization is kind of this in a similar vein. So where you uh even you lower your precision of the vectors you still get pretty high performance. Uh and you can see the uh the trade-off on the right of the figure uh here. So basically you can save you know 100x you know at least 10x without losing much. If you save a 100x in the storage cost then you start to lose probably five to 10%. Um but voyage you know is doing a great job here because you know you can save uh 100 access but still doing better than open eye. That's just because the parental frontier is uh different. So um um and and you can actually see better tradeoff you know for domain specific models which I I'm going to discuss in a moment. Um I have nine minutes here so I will probably just quickly go through some of the uh uh the techniques that you can use. Um so basically the next question is that how do you do better rag you know without besides using better embedding models using better embedding models is probably one of the simplest way. Um so I'm just going to go through it quickly. So one of them is to use hybrid search and reankers you can use you know lexical search and other kind of search and then combine them with a reancher and voyage provides a revanker as well. Um and another one is you can enhance the queries and documents by the so-called query de composition and and document enrichment. Um so this is probably the most common one maybe one minute on it. So it's actually very simple. You just say if you have a query rag then you try to improve the query by u making it longer using a large language model. Uh you can also decompose the longer query into small subqueries so that you can have like a few different queries and search for different subset of documents. Um and you can also enrich the document by adding additional meta information in it. You can add titles you know hiders you know categories authors dates. Sometimes you trunk the document so that in the trunk you don't even have this information anymore. So that's why you have to add a global information into each of the trunks and some of this global information can be added by large models and thropic wrote a blog post which does achieve pretty good results. So basically they use large models to generate additional context um that you can add to the trunks so that you can make the trunks you know more uh informative and then the it's easier to search uh through them. So um another one is you can use domain specific embeddings where um you um you customize embeddings for certain kind of domains you know in MongoDB or voyage we customize it for code for example and you can see that you know you get much better performance and also uh it's a better trade-off in terms of the the storage cost and the um uh and and accuracy. So basically you don't lose as much if you compress the vectors even further. Um so so here we lose probably 5% by compressing uh for like a about 100x but before we lose probably 10% or 15%. Um fine-tuning is another one. You can find tune embedding models with your own data. Um and you can also use other um sometimes I call them tricks on top of the embeddings, right? So these are different type type of ways to retrieve um uh using additional information like graph, you know, iterative retrieval, so forth. They're all based on embeddings. Um but uh you can uh use the embeddings in many different ways uh as add an additional layer. Um so um I guess I'll use the next probably five minutes to discuss some of the uh and the uh my vision for how rack will go in the future. I do believe that rag will be there forever because this is as I argued in the first slides uh the first set of slides. This is the kind of like very similar to how humans are uh using uh additional large amount of data. You retrieve you hierarchically select some subset and then you use those uh to uh uh to to answer the questions. or take some actions. And this is very efficient because you only use a small subset of the data. Um and um and as a um regarding uh how um how regular evolve from a technical point of view, I like to draw some analogy uh on uh um from uh how the the AI generally is evolving. So I think I was reflecting on when I was teaching at Stanford, you know, started to teach at Stanford about seven years ago. I started to teach with Chris Ray on this machine learning course and one of the slides literally have this seven steps on how do we build ML systems in enterprises. Um so this actually this slides is still in the the lecture notes. Uh we still teach them but just with more kind of like uh uh asterisks uh around it. So you can see like you you need to go through you know many steps you know collecting data you know train test bits you know define your loss functions you know build models and iterate and repeat and then uh for the large language model world it's kind of like this right you don't do need to do any of this you just take a large language model out of the box and just you know you can deploy it in enterprise in most of the cases of course it's not going to be perfect but this is already better than in the old days you do all of these steps in the enterprise using all the enterprise data just out of the box you are doing already very very well. Of course, you still have this issue that you cannot out of the box art model cannot um uh access proprietary information then you can use rag uh for it. So, but I think the point here is that before uh all of these steps have to be done by the kind of like the users or the enterprise or the customers in some sense. Um uh and now uh you largely speaking just can take off the shelf components and connect them and build your uh AI applications very fast without going through these training steps. The trainings still have to be done, right? All of these steps still are done u um but they are done by open anthropic or voyage MongoDB u the providers of the models but not the uh uh the the users um uh the end users. So um and I think for rag I I would say probably the same kind of evolution should happen um um um um so right now what happens is that u we have the several different layers where you have the computing infrastructure layer about the GPUs you know or some of the KNS on the CPUs and there's also a model layer where you have the embedding models the revampers the large language models and then on top of all of this um people use a lot of like I call it tricks to make rack uh uh accuracy much better right you can uh uh use all kind of parsing strategies, you can use all kind of trunking strategies, you know, uh uh that the the the um you can do some recursive search, you can do some contextual trunks, graph racks, so and so forth, right? That's what happens right now. And it's kind of necessary. These tricks are somewhat necessary because the embeddings and reankers and large models, none of them are perfect yet, right? So um and but I do believe that in the future I think this model layer will grow uh uh and the tricks will be smaller. So it's going to be fewer and fewer tricks and the models can capture many of the uh the the performance gain u by the tricks. I think we have seen this in the large language model space as well, right? So like um two years ago I think you need to do a lot of things on top of the GPD3 to make your application work and now even after uh uh out of the box you can get the same performance as before with all of the tricks. Um of of course you still probably need some kind of tricks because some information are not u uh um uh uh uh some information um the embedding models and rankers don't have just right. So the the the general purpose models or off-the-shelf models don't have certain information and then you can incorporate those uh into your tricks. For example, one thing that uh is that you know the definition of the similarity matrix uh uh could be something that you should customize in your prompt and um in this one you know um I think u there are several things that we are developing towards this vision right so one of them is multimodel embedding this is to dramatically simplify the workflow so that you don't have to do many things right so these days the multimodel embedding provided by voyage can just take in screenshots right before you take a PDF you have to do the data extraction to turn them into image and text and then probably do some embeddings for the image and embedding for the text separately. um and and and parsing this PDF is actually complex you know and for videos you have to do turn them into transcript um um and then use the text embedding so and so forth right right now uh right now we have the multimodel embedding which just takes in screenshot you can deal with PT PDF you know ppt uh uh powerpoint you know any of the other kind of slice stack in the same way just take a screenshot and then use the multimodel embedding we don't have the uh we can even do the uh same thing for video not necessarily the perfect way but like you just take screenshots of the frames, you know, conceptual frames and you give it to the multimodel embedding and you can uh turn them into vectors and you can search over uh those documents or or videos or slice stack. So, um and these are some performance metrics that we have evaluated. You know, we have tried kind of like oh by the way another one application tables right now you can just take a screenshot of the tables. you don't have to think too much about what is the header, what is the row, so and so forth. And we have done evaluations on many of these document screenshot, you know, table figures and and also text only. And you can see that it's uh improving across the board. So um the the final one I would like to mention which is something that we're going to launch uh soon is that this context where auto trunking embedding. So uh right now what happens is that when you have a long document, you do have to trunk the data. Um uh one of the reason is that the the context length of the embeddings is limited and if you have like a 100k tokens you do have to trunk into three or four trunks you know even though voyage has the probably I think it we have the longest context window is still like 32k so that's one reason to trunk and another reason to trunk is that sometimes it's a long the long document even you don't trunk suppose you can have a way to uh uh put all of them in a context window still when you retrieve you're going to retrieve on a document level then you retrieve a very very long document and then you should give this long document to large language model it's going to be very very expensive right if you give 100k tokens to large model every time you answer any question if you do some cost analysis you'll find that that core is very very expensive so that's why you do have to work on a smaller unit so that you can cut the cost and and be also more focused right so sometimes you give a long document to large English model it misses some of some of the context in the middle and you have to use the retrieval to focus on a paragraph a page so and so forth. So, so that's what happens right now with the trunking and but all of these are done by the users. Uh and our vision is that we're going to do this for you and also we're going to get all the meta information about from other trunks. So basically um uh in a nutshell the the the the interface will be that you give us a long document and we're going to trunk it for you and then also we return the trunks and also the vectors for each of the trunk and each of these vector is not only representing that trunk but also representing some of the global uh m information from other trunks. So it has all the details uh of the corresponding trunk and also has some kind of like a cog grid information from other trunks. So they can get the best of the both worlds. Um and uh that's what I'm going to launch you know soon. And another one is that we're going to have some fine tune API some point uh to make you uh uh uh so that you can fine-tune uh with your own data. Um I guess uh it's exactly time. Thanks very [Applause] much. All righty folks. That's it for the first block. Enjoy lunch. uh come back around a little bit before 2. We'll start with the the second set of sessions. We've got EXA uh 11X uh as well as um Google. So don't miss it. Four, five, six. One, two, three, four, five, seven. Two, three, four, five, six. One, two, three, four. closer. Even yelling, why you betray me like this? They only have one. Check. Check. How we doing, y'all? We're about to get started in just about 2 minutes. Feel free to grab a seat. Raise your hand if there's a seat next to you. We have a bunch of folks in the back looking for seats. If anyone's in the back, you want to find a seat, you can find a seat next to these folks raising their hands. Let's pack it in. It's going to be a full room. The last time we did this at the last AI engineer summit, we ended up by the end of it having the full ring around with people standing. So, if you want to get a clear view, please grab a seat. Um, and then we'll get started. And then we'll get started in just about two minutes. Thank you'all. Sweet. You can bring it in real quick. Check. Check. Check. Check. One, two. Check. Check. One, two. Sweet. About to get started here. How we doing? All right. Back there. Okay. Seems like the mic just kicked on. Can you hear me? All right. Back there in the back. Yeah. Sweet. Raise your hand if you flew in and you live outside of San Francisco. Holy Who came the farthest? Where'd you all come from? UK. I don't think that's the farthest. Poland, we're getting pretty far. Bangalore. Anyone got Bangalore beat? That's hard to beat. Romania. Brazil. I don't I don't I think Brazil's a little bit closer. Kolkata. Oh One more back there. Melbourne. Wow. This is crazy, y'all. And and if you're if you're back there in that corner, if y'all can slide over a little bit because there'll be more people joining as we're going. So, just slide over this way so we make more space so that people can actually see. We're about to get started here. Um, so I'll introduce myself real quick. I'm Dave Fontineau, the founder of HFZero. For those of you who don't know, HFZO is the most selective startup program in the world now. It is the program for repeat founders. A lot of the founders that you're about to see on this stage have built billion-dollar companies before this and then they're starting their next company and they're building something really interesting mainly in AI. It's a 12-week residency where you move into this residency and you give up everything in your life. Everything in your life and you just go all in on birthing your life's work. And without further ado, let's give it up for the first startup in this series that we're calling the next AI Unicorns. Let's give it up for Diego from Korea. Test test test. Hello everyone. My name is Diego Rodriguez. I am co-founder and CTO at Korea. We're building an AI creative suite. I'm going to tell you three stories and then I'll try to hire you. So, a friend once told me, if you think about it, like cars are easy to predict, right? Like it's like you get the horse, you have the wheels, you swap the horse, and you put an engine, which was known at the time, and that's a car. But like, you know what's really hard to predict? Traffic. So it's my job to ask what are the traffics that are we're missing especially with AI you know JAMO JSON MCP whatever it's like okay okay but like what comes what happens when you generate a million images per day like we do for one studio how do you find that another story to Babel we wanted to reach heaven God was like no created a bunch of languages and basically misunderstanding misunderstanding was like nope you're not going to go there and it reminds me of standup meetings where people are like no it should be react no no no like it should be like JavaScript and he was like dude we're not re God is winning uh but now we have AI so maybe okay he if it wasn't with it was with AI you'll see and so this is only like people trying to convey ideas and that's what we're trying to tell just trying to tell stories. Um the final story is I was talking with someone from Netflix uh and she was like what happens when we are making so much content personalized to each town in India? How do I even find that? Like, right. And and then a few days ago, I just realized that Korea was already being used for broadcasting an ad uh with Fox to millions of people. And then I look and they literally signed up two days ago. So, we went from sign up to conversion to payment to broadcast in two days. I And then this was the CTO telling me that. I was like, well, um I basically I'm about to run out of time. So the mandatory slide a bunch of users 25 million raised a bunch of money we did this with eight people some of the people who are using us uh an email that I created for today that is going to prioritize applications. Um thank you [Applause] All right, open home. Okay, everybody. The smartphone was the number one best-selling consumer product last year, and the laptop was the second pop quiz for all you here. What was the third? Wasn't Apple Watch, wasn't the AirPod? No, I think I heard it. It was a smart speaker. 500 million smart speakers were sold last year. But why do they still suck? You can barely talk to them. There's no customization. There's no community. There's nothing. That's why we built OpenHome, the very first AIdriven smart speaker. And we're letting you guys build smart speakers, too. And we believe here that the future is talking with AI. You should be able to talk seamlessly, intuitively. In fact, you shouldn't have to use this really awkward command-based language. You should be able to just chat naturally. So, that's what we're building. And we're letting people here today build their own smart speakers and build them in whatever form that they want. And the key here is developer ecosystems. And well, we know developer ecosystems. I started my career as the chief of staff for the founder of Splunk, a 30 billion big data company. Then I was on the founding team of Maker Dow, a $5 billion developer ecosystem. My co-founder and I raised $50 million for our last business, a big a data privacy tool, and we sold that business. But it all came down to developers and really building what people actually wanted. And well, oops. Now with OpenHome, we have over 10,000 developers building on OpenHome. They're building all kinds of interesting things. All types of different custom smart speakers, building interesting voice AI applications. Sky's really the limit. And well, what do developers really want? They want open source. They want LLM driven. And they want fully jailbroken. They want OpenHome, the AI smart speaker. And now what's really exciting is with voice AI, you can put it on any type of hardware. We have developers building talking toys, AI robots, AI appliances. You should be able to talk to the world around you in a much more natural way. And you can do that now with OpenHome in AI smart speaker. Here's our dashboard. We have many, many applications, hundreds of applications that have been built, games, personalities. We have an editor that you guys can go in and build and all kinds of interesting things, home automation tools. And what's really exciting is our last devkit got booked up within minutes. And today we have a special announcement for you guys here today. We're releasing the next batch of 500 dev kits for free for everybody here. If you guys want it, we will ship you a dev kit. It's very cool. You can build on it. You can talk with AI. You can build your own smart speaker here. and we're doing it today. Thank you so much. One, two. All right. How's it going, y'all? Uh, I'm Josh. I'm the founder of company called Koframe. And I am missing my slides. Cool. Got it. So, uh, the last company that I started, we scaled to over two billion dollars in in the course of a couple years. Um but when I started to tinker on uh using AI to generate code and created one of the actual top uh autonomous coding agents on GitHub was one number one on GitHub for a week. I realized it was time to build something bigger. The internet is dead. It's not adaptive. It's not personal. It's not truly living in a sense. Websites are all one sizefits-all. And we're bringing that concept to life. We are giving websites a life of their own, giving every single customer experience its own AI growth team. But this isn't just a pipe dream. We made $20 million for the for Europe's largest travel company in just a few weeks. We increased clickthrough rate for India's largest company, a $400 billion enterprise also in a few weeks. And how are we doing it? We're working with the best and we have the best. Uh we're the only marketing tech company that's partnered directly with OpenAI to date and they actually called our team cracked which is cool. So if you're interested in learning more about this, reach out. Thank you. Hi, I'm Eugene. I'm sorry. My team is obsoleting all the air models you see today because you see uh my team built Ki 72B the world's largest model without the transformer attention with only eight GPUs and this allow us to have a th00andx lower inference on our on our new architecture while performing the same. Surprisingly the techniques that we uh the technology that we built can also be applied to existing transformer models through speculative decoding. Nothing, nothing too big, nothing too important. But here's my hot take. Scale is dead. And I'm not saying this just from my own opinion. Like we are burning billions into making AI models bigger. But at the same time, the deep mind founder and CEO is saying compound AI agents errors will take more than 10 years to fix. Yan Lun is even saying that we need a new AI architecture to push the paradigm forward. And in production we see over 90% of AI projects fail. The reason behind this is not something that scale can fix. The problem is reliability. The thing is like would you order and use an app that only succeeds 45% of the time? Would you order Door Dash that way? Of course not. You if your order goes missing or or you you end up having 100 pizza, you're going to be stuck with customer support screaming down there. It's a frustrating experience, but that's what AI agents do. When they work, they're awesome. When they don't work, we are stuck cleaning up the mess. And that's even with frontier models. And here's the thing, what companies want is not a smarter model that can do PhD level math. We the models are already smart enough, but what we actually really want is the models reliable enough to book airline tickets, sort out our emails, or file our taxes and invoices. That's what we actually want. And that is what we are building at Federus AI. We are a research lab that is building personalized AGI that's made reliable for each one of you. And and most recently uh we are in our research that we actually shown that we built an action R1 agent that beats clot for Sonet and Gemini and OpenAI. This model is not going to do PhD level math but it's going to fill up the form with absolute reliability uh uh better than the frontier. And that's the thing like are we going to burn billions more to make a smarter model that is just a few percentage point higher IQ or are we going to make something that's 99.9% reliable for the boring things in life because this is where the money is for all of you because think of it as AI engineers reliability is revenue for every use case you unlock and find you're going to do a billion dollar app in in in be in e-commerce or in in B2B sales and that something that all of you can build on, not rocket science. And that's what we are building. And if you're excited about it, feel free to reach out to us. I'm eugened.com. Test test. Hello everyone. My name is Jonas. I'm an engineer and I like working with data. I love working with data. Actually, I love it so much. I dropped out of high school when I was 15. I got on a plane. I moved across the country to California and I joined a startup called Branch. You might have heard of it. Anytime you were clicking one of those links on your phone for an app, that was probably us. I also led a team there that built a search engine that over a 100 million people used every day. And then last year, I left along with one of the founders of Branch to tackle an even bigger challenge. This is probably what you think your sales and marketing teams are doing with their budgets. And you wouldn't be entirely wrong. So that's why I co-ounded Upside. We do forensic revenue attribution and intelligence. But what does that actually mean? Well, how many of you have an email from a salesperson like this sitting in your inbox right now? Uhhuh. And how many of you are actually going to reply to it? Yeah, I didn't think so. These teams are shouting into the void hoping something will work because they don't actually know what works because their data is a mess. I mean, don't get me wrong, they're data hoarders. They store everything. They stuff it in Salesforce. They treat it, you know, it's basically a SQL database in a trench coat, but they're not data practitioners. They don't know what to do with it once they have it. But now we have things like LLM. They can help with this. They can take that poor, mishandled, abused email record and they can pull the most important details out of it into a structured form. And so just as search engines and web crawlers learn how to make sense of the unstructured web, upside is turning raw enterprise data into a highly structured map of the world and all the interactions that people do in it. So there's hope. We can untangle this mess and we can create a data command center that these teams can actually use to reach their customers more effectively. We only just started talking about this publicly a couple weeks ago. Um, we've been quietly building in the background for the last year or so and my co-founder decided to make a small post on her LinkedIn, you know, just to update our network on what we'd been off doing and the things that we'd been building and it blew up. Like there's so much pain people feel around this and they're hungry for a solution. We got a whole slew of demo requests coming in from people that want access to the product and now we have a bunch of customers lining up that want to get into our platform. It's a who's who of companies you've heard of. Um and we just have a lot of building to do now. So if you're interested in working on knowledge graphs, on data analytics agents, on graph analytics and graph learning models, come talk to me. We're hiring. Hello everyone. I'm Sua, the founder of OpenAI. Uh, sorry, I mean Open audio. Before that, I created something you might be heard of, Fishial Audio. We have grown from 400k to 5.5 million annualized revenue in just four months and we closed our se runs at 100 million valuation. It all started with my girlfriend. I had a girlfriend for six years from the beginning of high school to college. I love her so much and it was so good until one day I found out she cheated. I wasn't angry, just confused, disappointed, and I asked myself, if this can happen, how can we trust relationships again? I thought about it for days, all day, and all night. And finally, I found my answer. AI. But nobody can really fall in love with today's AI, right? It's flat. It's emotionless. is a robotic. So I set on a mission to build an AI that I could really fall in love with starting with her voice. So we began with open source and crush it. We built soc and also fish speech. Today actually not today it's the day before yesterday. I'm excited to introduce S1, the first ever instructible voice model. It's the only model where you can control not just what to say, but how to say it. Here's a demo. You can pinpoint, focus, or draw it closer, even yelling why you betray me like this. Yeah, you can control whatever you want. And uh with open audio S1 we have the most expressive voice model in the world. And the most importantly she will and most importantly she will never leave. So for some reason it missed two slides. So we have blown 11 labs out of the water based on the TTS arena ranking and they are so hurry and they dropped their latest model today but unfortunately it's just a demo. So try them now. Fish audio is instantly available at fish.audio. Thank [Applause] you. Hello, I'm David Vorick and I'm building Glow. Prior to Glow, I built Sciacoin, a cryptocurrency that we took from a $10,000 market cap to more than $3 billion. And we think Glow is going to be even bigger. That's why Framework and USV Union Square Ventures led a $30 million round into our company. Subsequently, we posted the world record for onchain deepin revenue, doing more than $10 million of revenue in a single day. What does Glow do? Glow builds solar, not with shovels, but with incentives. This is not a stock photo. This is a photograph taken of taken by our team in India of a solar farm that was constructed for the purpose of mining glow tokens. A lot of people don't realize but in the developing world rising temperatures and growing populations have strained the grid. In a lot of cases, people are unable to run their air conditioners during the heat of the day. This causes people to die of heat stroke. Glow is an incentive protocol that revolutionizes what governments do and can take the same subsidy and turn it into 10 times as much solar. If you're interested in working with us, we're currently building incentive projects in India, in Mexico, in Lebanon, and across the entire world. Bitcoin incentivized the construction of tens of millions of mining machines. Glow asked, "Why not tens of millions of solar panels?" Thank you. and my email davidglowabs.org. I'd love to be in touch. Hi, I'm David. I'm an engineer. Uh, I made a social app with 250 million users making $20 million a year. Everyone thought it was luck. So, I did it again. I'm building favored. We built the world's most engaging live app and we scaled it from one to$und00 million annualized in six months. If you're a cracked engineer that wants to join the fastest growing company of all time, uh, talk to me. [Applause] Hello, I'm Alex Atala building open router the first and largest LLM marketplace. Thank you. So open I want to tell a little bit about how it started. Um, I co-founded OpenC in 2017 and in at the end of 2022, I really wanted to know if inference was going to be a winner take all market because the way it looked this could be the largest market in software that has ever happened before. And uh, the first experiment that we tried was building a Chrome extension to help you bring your own language model to any website that supported uh, the protocol. And that eventually evolved into open router, a single place and a single API to get all language models uh with the best prices, best performance and highest uptime. And the way it works is you just have a single API. You pay once and there's near zero switching costs to move from one model to another. We do all the heavy work to implement tool calling, edge cases, caching, and give you the best prices and performance possible for your region or wherever your servers are deployed. And because inference is so important, remember this might be the most important software market ever, it deserves its own marketplace just for language models optimized for them, including filtering for context, for features, for tool calling, for structured output, and much more. And so we built that. Then we built a chat room for you to obviously compare models head-to-head as simply as you do when chatting with people in iMessage. We built fine grain privacy settings including API level controls. We built a a lot of observability so you could see which models you're using and why. And we built public data in our rankings page which has become the go-to place for comparing models on their real world usage and on different categories for their prompts as well. This has grown for the last two months 10 to 100% every single month or for the last two years 10 to 100% every single month. Uh and scaling it has been a lot of the work that we've done so far. The the fundamental goal here is to make a heterogeneous ecosystem homogeneous because we believe inference is a commodity. Claude from bedrock is the should be the same as claude from vertex is clawed from enthropic and we do all the abstraction and heavy work to make it uh that way for you. I want to talk a little bit about some of our technical challenges. Um, we built our own system, our own middleware for doing inference called plugins, which are kind of like MCPs except a little bit more powerful because you can call MCPS from inside of them and you can transform the outputs from language models. Bunch of other tricky problems that we've done to make the fastest routing in the market. Um, and we're bringing a lot more features in the coming months, including images, enterprise features, prompt observability, and more. So, if you're interested, come find me after or check out our careers page. Thank you. Check check. Give him a hand one more time, y'all. One more time. So, every founder that you just saw on the stage is right outside that door. If you want to meet any of those founders, they'll be outside for just about 10 minutes. Um, feel free to meet them right out there. All 10 teams will be right there. What an incredible group. Thank you all so much for coming. Boom. I got it. Boom. You're good. Don't miss folks. outside. All right, welcome back folks uh to the search and retrieval track. Um I know we just had demos in here, but hope you guys had a great lunch. Uh I'm your host today. Uh my name is Dat. Uh I'm with Arise AI. If you don't know what Arise is, we are in the largest player in observability and evals. but uh here to present uh our our tracks today. So, first up we got uh 11X. So, uh how many people here have a sales team at their company? All right, awesome. Well, today we get to go through Alice um with Sherwood and Safwick, which is awesome. And then next we'll be uh Exa. So, how many people have you ever heard heard of Exa? Woo. That's right. Uh so, we'll be building smarter AI agents with neural rag. And lastly, we have Google as well. So layering every uh technique in rag one query at a time but really excited. Please give a warm welcome to 11x. Thank [Music] you. Just getting our slide sorted here. This is a photo of a forest. This looks good. Let's bring it back to the start. I'll try this. Okay. Thanks everyone for coming today. Uh so today's talk is called building Alice's brain. How we built an AI sales rep that learns like a human. Uh my name is Sherwood. I am one of the tech leads here at 11x. I lead engineering for our Alice product and I'm joined by my colleague Sawe. So 11X for those of you who are unfamiliar is a company that's building digital workers for the go to market organization. We have two digital workers today. We have Alice who is our AI SDR and then we also have Julian who is our voice agent and we have more workers on the way. Today we're going to be talking about Alice specifically and actually uh Alice's brain or the knowledge base which is effectively her brain. So let's start from the basics. Uh what what is an SDR? Well, an SDR is is a sales development representative. If you're not familiar, I know that's a room full of engineers, so I thought I would start with the basics. And this is essentially an entry-level sales role. This is the kind of job that you might get uh right out of school. And your responsibilities basically boil down to three things. First, you're sourcing leads. These are people that you'd like to sell to. Then you're contacting them or engaging them across channels. And finally, you're booking meetings with those people. So your goal here is to generate positive replies and meetings booked. These are the two uh key metrics for an SDR. And a lot of an SDR's job boils down to writing emails like the one that you see in front of you right now. This is actually an email that Alice has written and uh it's an example of the type of uh type of work output that Alice has. Uh Alice sends about 50,000 of these emails to uh in a given day and that's in comparison to a human SDR who would send 20 to 50. Uh and Alice is now running campaigns for about 300 different uh business organizations. So before we go any further, I want to define some terms because since we work at 11X, we have our customers but then our customers also have their customers. So things get a little confusing. Uh today we'll be using the term seller to refer to the company that is selling something through Alice. That is our customer. And then we'll be using the term lead to refer to the person who's being sold to. And here's what that looks like as a diagram. You can see the seller is pushing context about their business. These are the the products that they sell or the case studies that they have that they can reference in emails. She they're pushing that to Alice and then Alice is then using that to personalize emails for each of the leads that she contacts. So there are two requirements that Alice needs to uh in order to succeed in her role. The first is that she needs to know the seller, the products, the the services, the case studies, the pain points, the value props, the ICP. And the second is that she needs to know the lead, uh their role, their responsibilities, what they care about, what other solutions they've tried, uh pain points that they might have be experiencing, the company they work for. And today we're going to be really focused on knowing the seller. So in our in the old version of our product, the seller would be responsible for pushing context about her uh about their business to Alice. And they did so through a manual experience uh called the library. And here you could see what it looks like there where the library shows uh all of the different products and offers that are available for this business that uh Alice can then reference when she writes emails. The user would have to enter details about all every individual product and service and all of the pain points and solutions and value props associated with them in our dashboard. And including these detailed descriptions and those descriptions would uh were were important to get right because these actually get included in the context for the emails or for Alice when she writes the emails. Then later on during campaign creation, this is what it looks like to to create a campaign. And you can see we have a lead in the top left and the user is selecting the different offers that they've defined from the library in the top right that and these are the offers that Alice has access to when she's generating her emails. We had a lot of problems with this user experience. And the first one was it was just extremely tedious. It was a really bad and and and cumbersome user experience. The user had to enter a lot of information and that created this onboarding friction where uh users couldn't actually run campaigns until they hadn't filled out their library. And finally, the emails that we were generating using this approach were just sub-optimal. Users would have to either choose between too few email or too few offers, which meant that uh you'd have irrelevant offers for a given lead, or too many offers, which means that you have all of the stuff in the context window, and Alice just wasn't as smart when she write writes those emails. So, how can we address this? Well, we had an idea which is that instead of the seller being responsible for pushing context about the business to Alice, we could flip things around so that Alice can proactively uh pull all of the context about the seller into her system and then use what'sever whatever is most relevant when writing those emails. And that's effectively what we accomplished with the knowledge base which we'll tell you more about in just a moment. So for the rest of the talk, we're going to first do a highle overview of the knowledge base and how it works. Then we will do a deep dive on the pipeline, the different steps in our rag system pipeline. Then after that we will talk through the user experience of the knowledge base and we will wrap up with some lessons from this project and uh future plans. So let's start out with an overview. All right. So overview, what is knowledge base, right? It's basically a way for us to kind of get closer to a human experience. Like if a hum if you're training a human SDR, you would kind of get them in and then you will basically dump a bunch of documents on them and then they ramp up throughout a period of like weeks or months. Um, and you can basically check in on their prog progress. Um, and similar to that, knowledge base is basically a centralized repository on our platform for the seller info and then users can kind of come in, dump all their source material and then we are able to reference that information at the time of message generation. Um, now what resources do SDRs care about? Here's a little glimpse into that. Marketing materials, case studies, uh, sales calls, press releases, you know, and a bunch of other stuff. Um, now, how do we bucket these into categories that we're actually going to parse? Uh, well, we created documents and images, websites, and then media, audio, video, and you're going to see why that's important. So, here's an overview of what the architecture looks like. It starts off with the user uploading something any document or resource in the client and then we save it to our S3 bucket and then send it to the back end um which then you know creates a bunch of resources in our DB and then kicks off a bunch of jobs depending on the resource type and the vendor selected. Now the vendors are asynchronously doing the parsing. Once they're done they send a web hook to us which we consume via ingest. And then once we've consumed that web hook, we take that parsed uh artifact that we get back from the vendors and then we store it in our DB and then at the same time upsert it to pine cone and embed it. Um and then eventually once we store it in local DB we have like a UI update and then eventually our agent can query pine cone our vector DB for that stored information that we just put in. So now that we have a high level of understanding of how the knowledge base works, let's dig into each individual step in the pipeline. There are five different steps in the pipeline. The first is parsing, then there's chunking, then there's storage, then there's retrieval, and finally we have visualization, which will uh sounds a little untraditional, but we'll cover it in in a moment. So let's start with parsing. Uh what is parsing? I think that we probably all take this for granted, but it's worth defining. Parsing is the process of converting a non-ext resource into text. And the reason that this is necessary is because as we all know, language models, they speak text. So in order to make information that is represented in a different form, like a PDF or an MP4 file or a or an image legible or useful to the LLM, we need to first convert it to text. And so one way of thinking about parsing is it's the process of making non-ext information legible to a large language model. Um, and we do have multimodal models that are one solution to this, but there are lots of restrictions on multimodal models that make it uh that make parsing still relevant. So to illustrate that, we have the five different document types or resource types that we mentioned momentarily ago. Uh, going through our parsing process and coming out is actually markdown, which is a type of text that as we all know contains some structural information and formatting which is actually semantically semantically meaningful and useful. Let's talk about the process of how we implemented parsit. And the the short answer is that we did not. We didn't want to build this from scratch. And we had a few different reasons for doing this. The first is that you just saw that we had five different resource types and a lot of different file types within each of them. We thought it was going to be too many and we thought it was going to be too much work. We wanted to get to market quickly. Um the last reason was that we just weren't that confident in the outcome. There are vendors who dedicate their entire company to building an effective parsing system for a specific resource type. We didn't want our team to to have to become specialists in in parsing for each one of these resource types and to build a parsing system for that. We thought that maybe if we tried to do this, the outcome actually just wouldn't be that that successful. So, we chose to work with a vendor and here are a bunch of the vendors that we we came across. You can find 10 or 20 or 50 with just a quick Google search. But these are some of the leaders that we evaluated. And in order to make a decision, we came up with some requirements and three specific requirements. The first was that we needed support for our necessary resource types. That goes without saying. We also wanted markdown output. And then finally, we wanted this vendor to support web hooks. We wanted to be able to receive that output in a convenient manner. a few things that we didn't consider to start out with. Accuracy. Crazy. We didn't consider accuracy. We didn't consider either accuracy or comprehensiveness. Our assumption here was that most of the vendors that are leaders in the market are going to be within a reasonable band of accuracy and comprehensiveness. And accuracy would refer to whether or not the extracted output is actually matches the the original resource. Comprehensiveness on the other hand is the amount of extracted information that is uh available um in the in the final output. The last thing that we didn't really consider was cost to be honest and this was because we were this system was pre-production. We didn't have real production data yet and we didn't know what our usage would be and so we we figured what we would do is come back and optimize cost once we had real usage data. So on to our final selections for documents and images. we chose to work with llama parse which is a llama index product. Uh I think Jerry was up here earlier today. Uh and the reasons that we chose to work with llama parse was first it supported the most number of file types out of any document parsing solution we could find. And second their support was really great. Jerry and his team were were were quick to get in a slack channel with us. I think within just a couple of hours of us doing an initial evaluation. And with Llama Parse, we're able to turn documents like this PDF of a 11X sales deck into a markdown file like the one you see on the right. For websites, we chose to work with Firecrawl. The other main vendor that we were considering was Tavi. And this is actually not really a a major knock on Tavi. For Firecrawl, we chose to work with them because first, we were familiar. We had already worked with them on a previous project. And secondly, Tavi's crawl endpoint, which is the endpoint that we would have needed for this project, was still in development at the time. So it wasn't something we could actually use and similar to uh llama parse with t with fire call we are able to take a website like this bre homepage that you see here and turn it into another markdown document then we have audio and video and for audio and video we chose to work with a newer uh upstart vendor called cloud glue and the reasons that we chose to work with cloud glue were first they supported both audio and video not just audio and second they were actually capable of extracting information from the video itself itself as opposed to just transcribing the video and giving us back a markdown file that contains the transcript of the audio. And so with Cloud Glue, we're able to turn uh YouTube videos and MP4 files and other video formats into markdown like you see on the right. So now that everything is marked down, we move on to the next step which is chunking. All right, markdown. Let's go. Now basically we have a blob of markdown, right? And we want to kind of break it down into like semantic entities that we can embed and put it in our vector DB. At the same time, we want to uh protect the structure of the markdown because it contains some meaning inherently like something's a title versus something's a paragraph. There is inherent meaning behind that. Um so we're splitting these long blobs of text like 10page documents into chunks that we can eventually retrieve uh after we've embedded and stored them in a vector DB, right? And now basically we can like take all of this and like we're thinking about how we can you know split a long document into chunks right so chunking strategies um you have various things that you can do you can split on tokens you can split on sentences you can also split on markdown headers right and then you can do like LLM calls and have an LLM split your document into chunks you know or any combination of the above um now what you you want to ask yourself when you're deciding on a chunking strategy is like um what kind of logical units am I trying to preserve in my data right what do I eventually want to extract during my retrieval right what strategy will keep them intact and at the same time you're able to successfully embed them and store them in whatever DB you want um so and then should I try a different strategy for different resource types we have like we have deal with PDFs powerpoints videos right um and then eventually what kinds of queries or retrieval strategies am I expecting? Um and then we ended up with like a combination of all the three like all the things that we mentioned. So we split on markdown headers and then we kind of a waterfall. So because we want our like records in our vector DB to be a certain token count. So we split our markdown headers and then we split on sentences and then eventually we split on tokens and then yeah it's like worked well for us for all types of documents. Um, and it has successfully preserved our markdown chunks that we can kind of cleanly show in the UI. Um, and it also prevents super long chunks which are, you know, diluting the meaning behind your document if you end up with that. Okay, so we have split all of our markdown into individual chunks. It's now time to put those chunks somewhere. We're going to store them. Let's talk about storage technologies. So for storage technologies, I'm sure everyone is like here for the rag section. So they think that we're using a vector database. We actually are using a vector database. But to be pedantic, rag is retrieval augmented generation. So we all know that uh anytime you're retrieving context from an external source, whether it's a graph database or elastic search or uh a file in the file system, that also qualifies as rag. Um some of the other options you can use for rag uh I just mentioned a graph database, document databases, uh relational databases, key value stores. You could even use object storage like S3. In our case, we did use a vector database and that's because we wanted to do sim similarity search, which is what vector databases are are built for and optimized for. Once again, we had a lot of options to choose from. This is not a complete or an exhaustive list. In the end, we chose to work with a company called Pine Cone. And the reason that we chose to work with Pine Cone was first, it was a well-known solution. We were kind of new to the space and we thought probably can't go wrong going with the market leader. It was cloud hosted so our team wouldn't have to spin up any additional infrastructure. It was really easy to get started. They had great getting started guides and SDKs. Uh they had embedding models bundled with the solution. So for a vector database typically you have to embed the information before it goes into the database. Uh that would require the use of a third party or an external vector vector excuse me embedding model. And uh with Pine Cone, we didn't actually have to go find another embedding model provider or host our own embedding model. We just used the one that they provide. And last but not least, their customer support was awesome. They got on a lot of calls with us, helped us analyze different vector data database options and think through a graph databases and graph rag whether that made sense for us. So retrieval, the rag part of the rag workflow that we just built, right? Um you'll see that there's actually an evolution of different rag techniques over the last year. We started off with just traditional rag which is kind of a play on you're pulling information and then enriching your system prompt for an LLM API call right and then eventually that turned into an agentic rag form where now you have all these tools for getting information retrieval and then you attach those tools to whatever agentic flow that you have and then it calls the tool as a part of its larger flow. Right now something we we're seeing emerge in the last couple of months is deep research rack where now you have these deep research agents which are coming up with a plan and then they execute them and the plan may contain one or many steps of retrieval. Right? These deep research agents can go broad or deep depending on the context needs and they can evaluate whether or not they want to do more retrieval. Um we ended up building a deep research agent. Um we actually used a company called Leta. Letta is a cloud agent provider and they're really easy to build with. Um how it works basically we pass in the lead information to our agent and then it basically comes up with a plan. Plan contains one or many context retrieval steps and then eventually you know does the tool call summarizes the results and then generates an answer for us in a nice clean Q&A manner. Right? And then this is kind of how it looks like for a system with two questions that we ask. Um, now on to visualization, the most uh mysterious part of the pipeline. So what does visualization have to do with a a rag or ETL pipeline? Um, for more context, our customers are trusting Alice to represent their business. They really want to know that Alice knows her stuff, that she actually knows the products that they sell, and she's not going to lie about case studies or testimonials or make things up about the pain points that they address. So, how can we reassure them? In our case, we came up with a solution, which is to let al let users peek into Alice's brain. Get ready. This is what that looks like. We have a interactive 3D visualization of the knowledge base available in the product. What we've done here is taken all of the vectors vectors from our uh pine cone vector database and uh collapsed or actually excuse me I think the correct term is projected them down to just three dimensions. So we're going to render them as nodes in threedimensional space um um with using um uh and once the nodes are visible in this space you can click on any given node to view the associated chunk. This is one of the ways that uh for example our sales team or support team will demonstrate Alice's knowledge. Now how does it look like in the actual UI? Right? Basically you start off with this nice little modal you know you drop in your URLs, your web pages, your documents, your videos and then you click learn and then it kind of shows up nicely in the UI. Um you have all the resources there and then you have the ability to interrogate Alice about what she knows of your knowledge base, right? It's a really nice agent that we built again using Leta. And here's how it looks like in the campaign creation flow. You see that on the left hand side we have the knowledgebased content showing up as a nice Q&A where you can click on the questions and it shows you a drop down of the chunks that we retrieved and these were used as a part of the messaging flow. So now with that we have achieved our goal. Our agent is closer to a human than being an email. Right? We are now we are now basically uh emulating how you onboard a human SDR. You dump in a bunch of context and they just know it. So in conclusion, the knowledge base was a pretty revolutionary project for our product and really changed the user experience and also leveled up our team a lot. Uh we learned a lot of lessons. It was hard to create this slide, but there are just three that I want to highlight for you today. The first was that rag is complex. It was a lot harder than we thought it was going to be. There were a lot of micro decisions made along the way, a lot of different technologies we had to evaluate. Supporting different research types was hard. Hopefully, you all have a better appreciation of how complicated RA can be. The second lesson was that you should first get to production before benchmarking and then you can improve. And the idea here is that with all of those decisions and vendors to evaluate, uh, it can be hard to get started. So we recommend just getting something in production that satisfies the product requirements and then establishing some real benchmarks which you can use to iterate and improve. And the last learning here was that you should lean on vendors. You guys are all going to be buying solutions and they're going to be fighting for your business. Make them work for it. Make them teach you about the different uh the different offerings and why their solution is better. And so our future plans are to first track and address hallucinations in our emails. Evaluate parsing vendors on accuracy and completeness, those metrics that we uh identified earlier, experiment with hybrid rag, the introduction of a graph database alongside our vector database. And finally to just focus on reducing cost across our entire pipeline. And if any of this sounds interesting to you, we are hiring. So please reach out to either Sautwick or myself. Join us. And uh thank you all for coming today. [Applause] Awesome. Thank you so much. That was a great talk. Um yeah, send those over. Big shout out to 11X. That was really awesome to see your journey through. So next up we have Exa uh for live coding demo, too. So this will be great. Uh do we have Will? All right, come on up, Will. Not quite live. Okay. Hello everybody. All right. So I was gonna give uh live demo coding but well I will but I know you all are actually here to hear a cool story. So I'll tell you a story about web search built for AI and then we do some coding at the end. This story will end with this slide uh one API to get any information from the web. And you'll know what this means by the end. But the story starts in 1998. And what you're looking at is the the state-of-the-art in information retrieval in 1998. You type in a word Australia to this new search engine called Google and it magically finds you all the documents that contain the word Australia from the web. It's crazy. Um and the the big insight of Google was they had this page rank algorithm. So uh the results are ranked by authority based on the graph structure of the web. And this was a clever algorithm and it was really cool. I was two years old at the time. So if I was conscious I would have thought this was cool. Um okay and now our story our now our story uh skips 23 years to 2021. Um, by this point I was conscious barely and uh uh and I I noticed that you know GB3 had recently come out and it was this magical thing that you could input a whole paragraph explaining exactly what you want. Uh and it would really understand the subtleties of your language and give you an output that exactly matched. Um and it's hard to remember how magical this was but it was really magical in 2021. And at the same time, I noticed there was Google, which you know, you type in a simple query like shirts without stripes and it would give you shirts with stripes, which is crazy. Uh it like doesn't understand the word without u because it's doing a keyword comparison algorithm. And so I decided that for the next at least 10 years, I'm going to devote myself to building a search engine that combines the technology of GB3 uh to with a search engine to make a search engine that actually understands what you're saying uh at a deep level and understands all the documents on the web at a deep level and gives you exactly what you asked for. This is a very big idea and we're working we've worked on it for four years and uh a lot of progress but it would change the world if you actually solve this problem. And so in 2021, uh, we we we joined YC, summer 2021. Uh, we raised a couple million dollars and we did what every YC startup should do. We spent half of it on a GPU cluster. I'm joking. You shouldn't do that. Um, and and then we also followed YC's advice uh where we didn't talk to any users or or customers for a year and a half and we just did research. Um, again, you shouldn't do that. You should do, but in our case, it made sense because we were trying to solve a really hard problem which is like redesign search from scratch. um using the same technology as DB3, this like next token prediction idea with transformers. What if you could apply the same thing uh to search? And this is actually one of our uh WDB training runs. Um the purple one I believe is was a breakthrough where it like really it really like learned there was like a few breakthroughs along the way uh involving like random data sets and different uh transform architectures that we were trying. And this purple one like really started to like work well. Um and the general idea we had was like okay so what is what is a search engine? have like a trillion documents on the web. Um, and traditional search engines uh on a very high level will create like a keyword index of those documents. So for each document you you say you ask what are the words in those document and you create this big inverted index where you map from like words like brown to all the documents that contain that word. Um and then at search time you know search without stripes comes in you do some crazy keyword uh comparison algorithm and get the top results. That's obviously a simplification of what Google does. But at a fundamental level, it's doing this a keyword comparison. But the idea was like what if you could actually so with transformers like the big thing is like what if you could turn each document not into a set of keywords but into embeddings. Uh and these embeddings can be arbitrarily powerful, right? Like it's a list of an embedding is just a list of of of numbers and uh it could represent lots of information. So and embedding it doesn't just capture the words in the document but also the meaning the ideas in the document and the way people refer to that document on the web and you know embedding can be arbitrarily big and so it like of course in the limit it would just destroy keywords and so you have this like arbitrarily powerful representation um and that was the fundamental idea was just like the bitter lesson. What if we could like you know train transformers to output embeddings for documents and if we keep getting more and more data and that's high quality we could uh make a search engine that actually understands you and um the way it would work at inference at search time is like a search comes in a query comes in like shirts without stripes traditional search engines would use the above thing where they would do a very fancy keyword comparison and a bunch of other things and then instead we would just embed the shirts without stripes and compare it to the embedings of all the trillion documents and you know after a year and a half we actually had a new search engine that worked in a very different way. Uh, and you search shirt search without stripes on Google, sorry, on Exa and you um you get a list of results that actually are not do not have stripes. It's a simple uh example, but like you could uh it can handle like more way more complex queries like paragraph long queries. And when we launched this in November 2022, we got a lot of excitement on Twitter. This is a very new paradigm for search. You could do all sorts of interesting queries that you couldn't do before. And then two weeks later, this happened. It was a small tweet. Um, and uh, this is a visual depiction of San Francisco at the time. U, you guys probably all remember this. And then this is a visual depiction of the exit team at the time because chatbt completely changed the way we interact with the world's information. You know, like everyone can now use an LLM to just like talk talk to their computer and and get information. And we were thinking, wait, is there even a role for search in this world? like these LM are so powerful and then very quickly we realized yes there is a role because LLM don't know everything on the web so for example if you ask an LLM like GB4 find me cool personal sites of engineers in San Francisco um it'll it it can't like it just doesn't have that in the weights they'll apologize whatever um and you know there's a very simple information theory argument here where it's like there literally isn't enough information in the weights of GB4 to store the whole web GB4 call we don't know exactly how many uh parameters I think someone leaked it on YouTube once but it's like you a couple trillion parameters. You could call it like less than 10 terabytes uh in the weights of GB4. And then the internet is like over a million terabytes. And that's just the documents on the web. Uh there's also images and video and that's way more. Um actually the the if you look I I did a tweet recently about the the size of the web and it's it's in the exabyte range. Um and our name is Exa. It's not a coincidence. Um anyway, so like LLM uh need to search the web just from this simple argument and they're going to need to do that for a long time which Um, if you talk to ML researchers, they'll say the same thing. It's just like it's too hard. Also, the web is constantly updating. That's another problem. It's not just the size of the web, it's the constant updatingness of the web that makes it very tricky. So, LM's always will need search. That's great. Um, and so when you combine an LLM with a search engine like Exa, you can handle these uh queries. So, like find me cool personal sites, engineers, and SF, uh, the LLM will search Exa, get a list of personal sites, uh, and then like use that information to output the perfect thing for the user. You're all very familiar with this like LLM plus search. It's obvious now, right? Like everyone knows about it. But now, let me tell you a secret about search that most people don't know. Um, and the secret is that traditional search engines were not built for this world of AI. Traditional search engines were built for humans. Uh, and humans are not are very different from AI. Uh, so every search engine like Google, Bing, you name it, uh, was built in a different era for this kind of creature. uh this this slow flesh human that's typing keywords and wants to read a few links and really cares about UI of the page and all these things like it's a lazy human. They type simple keywords. Google is great for this creature. Um Google was optimized for this creature. It gives you exactly the kinds of things you would click on. But AIS are very different. Um this like an AI can gobble up information like crazy. This is a much slowed down version of what our AIs probably feel like inside. Uh and so AI are very different. They want to use complex queries, not simple ones, to find not a couple links, but just tons of knowledge, as much knowledge as they could get because they actually have the patience to just analyze it all extremely fast. And so the the search algorithm that's optimal for this type of creature is not the same algorithm that's optimal for the human. Like that would be crazy if the same algorithm was optimal for humans was optimal for AIs. And so like all the a lot of the tools the search tools that we're talking about these days on Twitter and stuff like that, they're still using like the old traditional search combined with AI. It's just not the right puzzle fit. So Exo, we're really trying to think of like what is the right search engine for this AI world. And so just a few examples uh we could dive deep into um to how AI are different. Well, AIS want precise controllable information. So by the way, when I say AI, I'm usually I'm talking about like an AI product. So imagine like in this case like a VC that's using an AI system to find a list of companies uh because they want to invest. So you know they're looking for something what's the next big thing? What's the next big thing that feels like Bell Labs? Well, when they tell their AI what they want, the AI will then go search a search engine, right? And if it searches a search engine like Google, they'll get a list of results that humans like to click on, but it's not very information dense and it doesn't even match what the person asks for what the AI asked for. The AI asks for startups working at something huge that feels like Bell Labs. It should get a list of startups. It's kind of a crazy idea, but what if search engines actually returned exactly what you asked of them and not what you want to what Google knows you will click on. And so with AI especially, they just want a search engine that returns exactly what they ask for. Because what's what really the world's going to look like is you're going to interact with your AI agent and you're going to ask for something and then it's going to make tons of searches like, okay, maybe they want startups working on something like similar to Bill Bell Labs. Maybe they want startups working only in New York City that have this quality and that quality and and and they'll do all sorts of searches and it just wants a search API that just does what it asks. And so you need a search engine like that. X is like that. Um, another difference between AIS and humans is AI want to search with lots of context. Again, if you're if you have an AI assistant and you talk to it all day and then you ask for restaurants or apartments or or what have you, uh, the AI has lots of context on you. So it should be able to search with this uh large multi paragraph thing saying like you know my human is a software engineer and it likes these types of things and I like these types of things and like can you give me uh you know restaurants that match those preferences. Uh and so you need a search engine that could literally handle multiple paragraphs of text. But traditional search search engines like Google were not meant to do that because humans would never type in multiple paragraphs because they're too lazy. So Google was optimized for like simple keyword queries. So Google I think has like a a few dozen keyword limit. uh whereas uh Exa can handle like multiple paragraphs of text. Another big one where AI are different than humans is AI want comprehensive knowledge. Uh like if you give a human 10,000 links or 10,000 pages, it doesn't know what to do with that. Like it would take 10 days of extreme patience to process all that. But AI can do it in three seconds if it's paralyzed, right? So if I'm an a VC and I want to report on like all the companies in a space, I want literally all the companies. And there's a huge amount of value to getting truly all of them and not just like the 10 or 20 that Google is able to find. And so you need a search engine that exposes the ability to return a thousand 10,000 whatever it is. And also has this semantic ability to like you know when you say like every startup funded by YC working on AI you actually can get all of them. So like Google literally just can't do this at all. Okay. I hope that through these examples we see that the space of possible queries is actually like way larger than people realize. Uh and until like 2022, we were kind of in this like top left blue world. Uh so this circle is like the space of possible queries and the blues are like uh you know specific subsets of that space. And so like we were all in that top left corner of blue for a long time where you could you know we could uh search engines could handle like uh like basic keyword queries like stripe pricing or uh someone's GitHub page or Taylor Swift boyfriend or whatever it is. Uh after 2022, everyone started to want the top right blue uh circle where it was like, "Hey, actually, I want to make queries like explain this concept to me like I'm a 5-year-old or here's my code. Can you like debug it?" This is a form of query doesn't require search, but it's a it's another type of query that was introduced to the world in 2022. And then like uh there's other types of queries like these semantic queries like people in San Francisco who know assembly. uh as far as I'm aware, XA is like I mean XA kind of like introduced this kind of query and and uh and does really really well on them on those queries. And then there's these like really complex queries like find me every article that argues X and not Y from an author like Z. And we're starting to now have systems like X's like websites product that could handle these things. And I think this is actually a huge space because this like turns the web into like a database you could filter however you want. And that's really what AIs want. they want this like full control database like query system that they can just get whatever they need for their user. And then there are the queries that no one has thought of yet. Um like every week we get tons of queries and like oh wait that's a really interesting type of query that uh that no search engine could do right now and and eventually we'll try to you know handle all the the queries that are possible. But there there's so many new types of queries now because we have these AI systems and the stakes like the the expectations have just gotten way higher. Okay, so now you we end our story. uh with the same slide a one API to get any information from the web. So again like EXO is trying to if you go back like handle not just like the keyword queries but also the semantic queries and also the super complex queries and eventually all queries. Um we we want one API that could like give these AI systems whatever knowledge they want. You have the AI and you have Exa providing uh the knowledge. Oh, I only have four minutes. Okay. Okay. So that's how do I go to a different uh if I change to the code editor. How do I do that? Oh, it's Oh, cool. Okay. Okay. Um there we go. Okay. Cool. Well, first of all, just just very quick exploration of this is our our search dashboard, we could try different queries. I would just point out like in the search uh API endpoint. Uh you know, we expose lots of different toggles. So, first of all, you just try out a query and get uh it shows you the code and it gets you uh a list of results. Uh and it exposes tons of different types of filters that you might want to do. For example, like number of results, 10, 100, a thousand, whatever it is. You could have like date ranges or, you know, I only want to search over these domains. And it's a lot of toggles, but I think the point is actually you want the toggles because your AI is actually going to be calling this. You want a search engine that gives you full control. Um, and we have like neural and keyword search. So you could try different ones. Um, okay, let me quickly jump the the code. Okay, so I prepared this like code uh agent.py. So we made this agent uh agent Mark and Mark loves to make markdown out of things. Anything you give it, it will make markdown. Mark will make markdown. Uh and so in this case uh we're going to here well I guess in this case let's try uh this query uh personal site of engineer in San Francisco who likes information retrieval. Uh well this is this is the kind of a query that neural would be a lot better at okay it okay so it's just it's making a query to get like a list of personal sites of engineers San Francisco who like information retrieval and and mark the agent is just making a markdown output of that that's a very neural type query you also might want to do uh a different type of query which is like a more keyword heavy one. Let's see like my GitHub. So, okay. So, here I would want to make a keyword query. So you just change it to keyword search. So it's going to get information from my from my GitHub using keyword search because this is a very typical like Google like search that would work well, right? Go. Okay, cool. That's information about Wilbur's GitHub. Um and then okay, so when you're actually building an agent, you're going to be combining lots of different types of searches. So neural searches and keyword searches uh and all sorts of other searches that X exposes. So like the right agent in the future is going to be this system that decides what type of search it needs uh for uh whatever whatever the user uh says like it be like oh okay I'm going to make like a neural search to get a list of things and then for each one I'm going to do a keyword search right you want to give the uh agent like just full access to the world's information in however way it wants not just keyword search but also all these other things um and so here I oneshotted with 03 a GitHub agent which combines these two queries So first it'll because I want you know I want to get the GitHub of every uh engineer in San Francisco who likes information retrieval. Uh so the agent will make uh a neural search to get a list of people extract the names and then search those using a keyword search to get their GitHubs. And then if you run that here it's just getting 10 results but we could you know with EXO we could do a 100 or a thousand if you're on an enterprise plan. So now it's getting all the GitHub info. Cool. So that's just a example. Um and yeah, I mean there are lots of other things that you could do with Exa like um we actually just today launched this research endpoint um where it will actually do like as much searches in the N and LM calls in the background to get you that perfect report or that perfect structured output for the thing you asked for. So it's kind of like a deep research API and it's state-of-the-art deep re deep research API. Um, cool. That is the talk. Hope that was interesting. Thank you. Alrighty. Nice job, Will. Um, all right. Last, but certainly not least, uh, we have I think there's a typo up there. It shouldn't be Pyabs, right? Yeah, Pyabs. It says Google, but it should say Pyabs. But David here is going to walk us through everything from BM25 to the most complex rag. So, super excited for this one. Uh, big shout out to David. Uh, please help me give him a warm welcome. Please don't send me back to Google. All right. Uh I'll I'll just give you all a little bit of context. So, uh, my co-founder and I and a lot of our team were actually working on Google search and then we left and like started Pyabs and, uh, I I loved I love the exit talk and like we're all nerds for information and search and, uh, so this is going to be a little bit of that. Uh, just going to go through a whole bunch of ways you can actually show up and improve your rack systems. Uh, I think one thing that I personally uh, sometimes struggle with is there's a lot of talk about things sometimes like too much in the reads like oh specific techniques and you can do RL this way and you can tune the model this way. It's like doesn't help me orient in the space like what are all these things and how do I like hang on them? Uh or you have the complete opposite which is like a whole bunch of buzzwords and hype and such and like rag is dead. No rag is not dead is like agents like wait what like uh so just you know I think a lot of what I'll do today is just uh what I call like plain English uh just trying to like set up a framework right like very centered around like okay if you are trying to show up the quality of your system how do you do that and then where do all the things you hear about like day in day out like fit uh and then just how to approach that and give a lot of examples I think one thing that I always love and we always did in Google we always do in pyabs uh is just like look at things look at cases look at queries see what's working. See what's not working. That's really the essence of like quality engineering as we used to call it at Google. Um, are we good? Let's see. All right. Perfect. And the timer has not started. So, this is us. Uh, all right. Perfect. Uh, if you do want the slides, there's like 50 slides and I set my a challenge for myself to go through 50 slides in 19 minutes. Uh, but you can catch the slides here if you want. Uh I'll flash this towards the end as well with pi.aira-t talk. uh it should point to the slides that we're going through and as I mentioned plain English no hype no buzz uh no debates no like all right so how to think about techniques before we go techniques and get into the weeds of it like what why does this even matter and the way we always think about it is like always start with outcomes you're always trying to solve some product problem uh and generally the best way to visualize something like this you have a certain quality bar you want to reach and there were a very interesting talk this this week about like you know benchmarks aren't really helpful and absolutely eval are helpful you're trying to launch a CRM agent and you sort of have a launch bar like a place where you feel comfortable that you can actually put it out into the world. Uh and techniques fit somewhere here. You have that like kind of end metric and you're trying to like come up with different ways to shore up the quality and those ways are like sort of the techniques there and you know this is sort of your own personal benchmark. You start with some of the easy the easy the easy bars you want to hit and then there's like medium benchmarks and hard benchmarks. These are query sets you're setting up. uh and then you know depending on what you want to reach and in at what time frame uh then you end up trying different things uh and this is what we call like quality engineering loop you sort of like baseline yourself okay you want to achieve you know you want CRM and this is the easy query set and your quality is there uh just through the simplest way you can try it do a loss analysis okay what's broken there were a lot of eval talks this week and then what we call quality engineering now the reason I I I I say this is because like okay techniques fit this in this last bucket and one of the things that I think biggest problems is like people sometimes start there and it doesn't make any sense because you say oh do I need BM25 or do I need like uh vector vector retrieval it's like I don't know what what are you trying to do and what is your query says and where are things failing because many times you actually don't need these things and you end up implementing them it doesn't make a lot of sense anyway so usually the thing I say is like what I call complexity adjusted impact or you know stay lazy uh in a sense of like always look at what's broken and if it's not broken don't fix it and if it is broken do fix it uh and we'll go through a lot of techniques uh today, but like this is a good way to think about them. It's just a cluster. It's a catalog of stuff. The most important two columns are the ones to the right, difficulty and impact. And if it's easy, go ahead and try it. And most times like BM25, BM25 is pretty easy. You should absolutely try it and does like, you know, shore up your quality quite a bit. Um, but you know, should I build like custom embeddings for retrieval? Like I don't know. Let's take a look. This is actually really really hard. Uh, Harvey gave a talk. They built custom embeddings, but you know, they have a really hard problem space and just, you know, relevance embeddings don't don't do enough for them. uh and then they're willing to put all that work and effort. All right, queries examples. Lots of stuff. Let's first technique in memory retrieval. Uh easiest thing, bring all your documents, shove them all to the LLM. Uh this is the whole like is rag dead, is rag not dead, context windows. Well, context windows are pretty easy, so you should definitely start there. Uh one example, notebook LM. Uh very nice product. You actually, you know, put in five documents, just ask questions about them. You don't need any rag. Just shove the whole thing in. Now, questions might get cut too long and this is where it breaks, right? Maybe things don't fit in memory. Uh or maybe you just pull the context window too much. So this is where you start to think things like oh okay that's what's happening. I have too many documents. Oh that's what's happening. The documents are not attended properly by the LLM. And here are like the five things that are breaking. Okay great let's move to the next one. So now you try something very simple which is can I retrieve just based on terms. So BM25 what is BM25? BM25 is kind of like four things. um query terms, frequency of those query terms, uh length of the document and just how rare a certain term is. It's a very nice thing. It actually works pretty well and it's very easy to try and um it has a problem that like when things are not have that nature like the ex as saying when they don't have that nature of like a keyword based search they didn't work and this is where you bring in something like relevance embeddings and relevance embeddings are pretty interesting because now you're in vector space and vector space can handle way more nuance uh than like keyword space uh but you know they also fail in certain ways especially when you're looking for keyword matching and it's actually pretty easy to know when things work and when they don't actually this was queried like I went to Chadypt and I asked like hey give me a bunch of keywords ones that work for like standard term matching and ones that work for relevance embedding and you can see like exactly what's going on here right if your query stream looks like iPhone battery life then you don't need uh vector search but if they look something like oh how long does an iPhone like last before I need to charge it again then you absolutely need like things like vector search and this is where you need to be like tuned to what what every technique gives you before you go in invest in it and when you do your loss analysis and you see oh most of my queries actually look like the ones on the right hand side, then you should absolutely start investing in this area. All right, now you did BM25, you did vector because your query sets look exactly like that and now you have conflicted candidate set and this is where re-rankers help quite a bit. And when people say rerankers, they're usually referring to like cross encoders and this is a specific architecture. If you remember the architecture here for relevant for the relevance embeddings was you're getting a vector for the query and you're getting a vector for document and then you're just measuring distance. Now cross encoders are more sophisticated. They actually take both the query and the document and they give you a score while attending to both at the same time. And that's why they were much more powerful. Now they are more powerful, but they're actually pretty expensive. And now this is a failure state as well. You can't do it on all your documents. So now you have to have like this fancy thing where you're retrieving a lot of things and then ranking a smaller set of things with a technique like that. Uh but it is really powerful and you should use it and it fails in certain cases and now when you hit those cases then you move to the next thing. Now where does it fail? uh it's still relevance and this is a big problem with like you know standard embeddings and standard rerankers they only measure semantic similarity and there's a thing like these are all proxy metrics at the end like your application is your application and your set of information needs is your set of information needs and you try to proxy with relevance but relevance is not ranking and this is something you know we learned in Google search sort of uh it's been like 15 20 years where you know what brings the magic of Google search well they look at a lot of other things than just relevance uh and this is you know this is kind of like actually the talk from Harvey and Lance DB was really really interesting and he gave the example of this query right uh it's a really interesting query like it's it has so much semantics for the legal uh domain that it's impossible to catch these with just relevance um and again what does a word like regime means that's a very specific like legal term material what does it mean it actually very has a very specific meaning in the legal term uh and then there's like things that are very specific to domain that need to be retrieved like laws and regulations and such and this is where you get to building things like custom embeddings and they say you know what just fetching on relevance is not enough for me and now I need to go and like model my own domain in its own vector space and now I can actually fetch some of these things again go back to chat GPD like is this interesting should I actually even do it so I asked it to give me a list of things that would fail in a standard relevant search in the legal domain and you start to see like oh all these things would the words like moot don't mean the same thing words like material don't mean the same thing and when you have a vocabulary that is so specific and just off you will not get good results. Right? So now how do you how do you match that? Like you need to have again you need to have evals you need to have quets you need to look at things that are breaking and decide that oh the things that are breaking have to do with the vocabulary just being out of distribution of a standard relevance model and that's how you decide right so don't like again don't think too much about it like oh should I do it should I not do it like what is your what are your queries telling you what is your data telling you and then go and try to do it or not do it there's also an example from shopping um so embeddings are very interesting because they help you a lot with retrieval and recall uh but you still good need good ranking, right? So now if if if if you think relevance doesn't work with retrieval, it also probably doesn't work with ranking. Uh this is an example I pulled from perplexity. I was trying Yeah, I was just trying to break it today. It didn't take too much to break it. Uh I asked like give me cheap gifts uh for a gift for my son and I follow up with this query like but I have a budget of 50 bucks or more because when I said cheap, it started giving me like $10. Well, you know, cheap for me is like $50. Uh but I didn't know that. So it's fine. I told it that. But when I said $50 or more, it still gave me $15 and $40, both of which are actually below uh $50. Uh and this is kind of interesting, right? Because what we call like in, you know, in standard terms like for information retrieval, this is a signal. It's a price signal and it's not being caught and it's not being translated into the query and it's definitely not being translated into the ranking. So now you have to like think of okay, I have ranking and I need the ranking to see the semantics of my corpus and my queries. And this is has a very specific meaning like when you think of your corporate inquiries again it's not just relevance. Relevance helps you with natural language, but things like prize signals, things like merchant signals, uh if you're doing like podcasts, how many times has been listened to is a very important signal. Has nothing to do with relevance, right? And in in in many many applications, you will see things that are for example more popular tend to rank more highly. Uh and there you mentioned like uh the page rank algorithm. Page rank is not about relevance. It's about prominence. How many things outside of my uh document point to me? that has nothing to do with relevance and everything to do with the structure of the web corpus. So that's the shape of the data. So this is a signal about the shape of the data and not a signal about like the relevance. Um and you know best way to think about it think of like you have horizontal semantics and then you have vertical semantics. And if you're in vertical domain where the semantics are very verticalized right let's say you're in doing a CRM or you're doing emails uh and it's a very complex bar you're trying to hit uh that is way beyond just natural language. understand that relevance will be a very tiny tiny part of the semantic universe and the harder you try to go the more you're going to hit this wall and the more you all right this breaks again things keep breaking I'm sorry at sufficient complexity things will keep breaking so now the thing that breaks with even custom semantics is user preference uh because even when you get to all this okay you're saying I'm doing relevance and I'm doing price signals and merchant signals I'm doing everything I now I know the shopping domain now you don't know the shopping domain because now users are using your product they're clicking on stuff you thought they're not going to click on and they're clicking they're not clicking on thoughts on things you thought they were going to click on. Uh and this is where you need to like bring in the click signal thumbs up thumbs down signal. Now um these things get very complex. So we're not going to talk about how to implement them. Uh just because again in this case for example you have to be a clickth through uh signal prediction signal and then you take that signal and then you combine it with all your other signals. So now if you look at your ranking function it's doing okay I want it to be relevant. I wanted to have this like semi-structured price signal and like query understanding related to that plus I want to get the user preference and that and then you take all these signals and you add them and that becomes your ranking score. So it becomes a very balanced function and this is how you go from like oh it's just relevance to oh no it's not just relevance to oh no it's not just relevance and and my semantics and my user preferences all rolled up into one. I'll mention two more things. uh you calling the wrong queries. This happening a lot because this go this goes into more orchestration and you're trying to do complex things uh especially now when you have agents uh and you're telling them to use a certain tool. This is happening quite a bit because there is an impedance mismatch uh between what the search engine expects right let's say you tune the search engine and expects like keyword queries or expects uh you know even like more complex queries but you cannot describe all of that to the LM and the LM is reasoning about your application and then making queries by itself and this is a big problem so one thing that we've seen many companies do we've done this also at Google you actually take more control of the actual orchestration so you take the big query and you make n smaller queries out of Uh I took the screenshot from AI mode in Google and it's it's very brief. You have to catch it because after after the animation goes away but you see it's actually it's making X queries. It's making 15 queries. It's making 20 queries. Um so what we call fan out take very complex thing try to figure out what are all the subqueries in it and then fan them out. Now you might think hey why isn't the LM doing it? The LM is kind of doing it but the LM doesn't know about your tool. It doesn't know enough about your search engine. Uh I love MCP but I'm not a big believer that you can actually teach the LLM and like just through prompting what to expect from the search on the other back end. This is why people still like oh is it agent autonomous? Do I need to do workflows? This is very very complicated. Uh and it will take a while for this to be solved because again it's unclear where the boundary is. Is it uh is it the search engine should be able to handle more complex things and then the LLM will just throw anything its way or is it the other way around? and the LM has to have more information about what the search engine can support so it can tailor it and right now you need control because the quality is still not there. Uh so this looks like this. Um if you have sort of like this assistant input and you're turning it in these narrow queries like for example was David working on this has very very specific semantics and it's more like oh Jira is David Slack threads David. Uh and it's very very hard to know without knowing enough about your application that these are the queries that matter and not on the the ones on the left hand side. And if you send the thing on the left hand side to a search engine, it will absolutely tip over unless it understands your domain. And this is where like you know you need to calibrate the boundary. Okay. So now you're asking all the right queries. Are you asking them to all the right backends? And this is another place where it all fails. Um and this is what we call like one technique I call supplementary retrieval. This is something you notice like clients do quite a bit which is they don't call search enough. Uh and sometimes people try to overoptimize. When you're trying to get Heidi call, you should always be searching more. Like I always like just search more like this is similar to what we talked like about dynamic content like the inmemory uh retrieval just like just give more things. So it never fails to give more things. I know in the in the description we said like there was this query fo which was really hard to uh to do and then you think like oh we're in Google search and it's very simple middle eastern dish and it stumped an organization of 6,000 people like oh my god what's so hard about this query? What's so hard about this query is like it's it's it's an ambiguous intent. uh so you need to reach to a lot of backends to actually understand enough about it right because you might be asking about food at which point I want to show you restaurants you might be asking this for for pictures at which point I want to show you images uh now what Google ended up doing is that they ask they you know create all the back ends and then they put the whole thing in and I think you know I would recommend like this is a great technique to just even increase the recall more just call more things um and don't try to be skimpy unless you're running through like some real cost overload and that's the last one you're running into cost overload. GPUs are melting. I try to generate an image, but then I realized there actually a pretty good image that is real. Somebody took a server rack and threw it from the roof. Um, this was like I didn't need to go to chatp and generate this image. Uh, apparently this was an advertisement pretty expensive one. Um, all right. So, this happens a lot like when you get to a certain scale and you have all these backends and you're making all these queries and it's just getting very very complex and this, you know, I mean Google's there, perplexity is there. I mean Sam Alton keeps keeps complaining about GPUs melting. Um and this is the part where like you need to start doing distillation and distillation is a very interesting thing because like to do that you have to learn how to fine-tune models and this this gets to be a little bit complex. You sort of have to hold the quality bar constant while you decrease the size of the model. The reason you can do that like is is is kind of like in that in that graph like hey hire me I know everything actually I'm firing you. uh it's overqualified like an LLM a very like large language model is actually over mostly overqualified for the task you want to do uh because what you really want to do is just one thing like perplexity they're they're doing question answering uh and they're pretty fast I mean when you use perplexity insert context like they're really really fast which is amazing because they trained this one model to do this one very specific thing which is just be really really good at question answering um and you know this is very hard so I wouldn't do it unless you know latency becomes a really important thing for your users right like, oh, the thing is taking 10 seconds. Users churn. If I can make it in two seconds, users don't churn. Actually, that's a really great place to be because then you can use this technique and like just bring everything down. Um, all right. You've done everything you can. Things are still failing. This is uh everybody. Okay. What do you do? Like we have a bunch of engineers here. What do you do when when everything fails? Um, yes, you blame the product manager. It's it's the last trick in the book. Uh, when everything fails, uh, make sure it's not your fault. But I I'll say there's something really important here. Quality engineering will never like it'll never be 100%. Things will always fail. These are stocastic systems. So then you have to punt the problem. You have to punt it upwards. So it's it's kind of a joke, but it's not a joke. Like the design of the product matters a lot to how much how magical it can seem because if you try to be more magical than your product surface uh can can absorb, you will you will run into into a bunch of problems. Um this is I use a very simple example. Uh probably a more complex one would be uh sort of a human in the loop for customer support where you're like okay some cases the bot can handle by its own but then you need to like punt to a human. This is basically UX design right like when when do you trust the machine to do what the machine needs to do and when does a human need to be in the loop. This is a much simpler example from like Google shopping. Um there's some cases where Google has a lot of great data. So what we call like high understanding the fidelity of the understanding is really high and then it shows like what we call a high promise UI. Like I'll show you things you can click on them. There's reviews, there's filters because I just understand this really well. And there's things Google does not understand at all, mostly as web documents, bag of words. And what's really interesting about the UI is the eye changes. If you understand more, you show a more kind of like filterable high promise. If you don't understand enough, you actually degrade your experience, but you degrade it to something that is still workable. Like, I'll show you 10 things, you choose. Oh no, I know exactly what you want. I'll show you one thing. And this is really, really important. and it has to be like part of every and this is sort of like always understanding like there's only so much engineering you can do until you have to like actually change your product to accommodate this sort of stoastic nature. So gracefully degrade, gracefully upgrade depending on like the the level of your understanding. And again, I'll flash these two slides at the end like always remember what you're doing because you can absolutely get into theoretical debates. Again, context window versus rag, uh this versus that, like is, you know, agents versus I don't know like just everything is empirical in this domain when you're doing like this this sort of thing. Oh, I have my my evals. I'm trying to like step by step go up. I have like a toolbox under my disposal. Everything everything is empirical. So again, baseline, analyze your losses, and then look at your toolbox and see, are there easy things here I can do? If not, are there at least medium things I could do? If not, you know, should I hire more people and do like some really, really hard things? Uh, but always remember like the choice is on you and you should be principal because this can be an absolute waste of time uh if you're doing it too far ahead of the curve. All right, again, the slides are here. I think Oh, I I I achieved it. 30 seconds left. Uh and if you want the slides they're here again and uh reach out to us. We're always happy to talk. I think I was very happy with the exit talk because it's always nice to find like friends who are nerds in information retrieval. Uh we are also such. So reach out and happy to talk about you know rag challenges and such and some of the models we are building. Um all right thank you so much. Awesome. Thank you so much, David. All right. Thank you everyone so much for joining the search and retrieval track. Um, that's it for today, but enjoy the rest of uh your last day of AI Engineer World Fair. Thanks, uh, thanks for coming.