AI Engineer World’s Fair 2025 - Retrieval + Search

Channel: aiDotEngineer

Published at: 2025-06-05

YouTube video id: a0TyTMDh1is

Source: https://www.youtube.com/watch?v=a0TyTMDh1is

15. Awesome.
Love it.
Please don't
I'll give you a little extra.
All righty. Do operations on top of the
files and it could be structured
quering, right? Quering a more
structured database to get aggregate
insights over the types of data um that
that you've extracted out.
One, you know, top consideration uh when
actually building this type of toolbox
is uh complex documents. Uh for those of
you who follow our socials, we talk a
lot about this type of issue where a lot
of human knowledge in the form of like
really complicated PDFs and and other
formats too. Embedded tables, charts,
images, irregular layouts, headers,
footers. This is typically stuff that's
designed for human consumption and not
machine consumption. And so, you know,
if the documents are not processed
correctly, no matter how good your LLM
is, um it will fail.
So we were probably one of the first
people to actually realize that LLMs and
LVMs could be used for document
understanding. Um if uh in contrast to
more traditional techniques where you
use kind of like hand-tuned and task
specific ML models to achieve uh kind of
like document parsing over a specific
class of documents, LLMs actually have a
much general layer of accuracy um that
you can use to your advantage and just
like understanding and inhaling any type
of document with comp uh any type of
complexity. Um obviously the baseline
these days is you can just screenshot a
PDF, feed it into chatbot or claude. um
it doesn't actually give you amazing
accuracy, but it's a good start. And so
one of the the kind of secret sauce like
uh magic tricks we found was figuring
out how to interle LLMs and LVMs with
more traditional parsing techniques and
adding kind of test time tokens in terms
of agentic validation and reasoning to
really get a higher level of accuracy.
Um and so you know we have a cloud
service that does document parsing and
is a core step of this document toolbox.
uh we basically benchmarked uh our modes
where we adapt uh you know sauna 3.5 4.0
So uh Gemini 2.5 Pro 4.1 from OpenAI and
it basically outperforms all existing
parsing benchmarks um and and tools out
there in terms of open source to
proprietary. Um
yeah so some of you might know us as a
rag framework. That's basically how we
started. Um you know for those of you
who don't know we have this uh managed
platform that is basically this GI
native document toolbox. Um contains a
lot of operations that you need to do on
top of your docs. It could be document
parsing. Document extraction uh uses
some of those, you know, kind of
capabilities I just mentioned and allows
you to parse, extract index data for all
the set of tools I just
mentioned. One of the special releases I
actually want to highlight today um and
we just announced this in a blog post a
few hours ago is Excel capabilities to
help complement this document toolbox. A
lot of knowledge work happens in
Microsoft Excel and also Google Sheets
and you know numbers and basically it's
spreadsheets, right? but it's been
unsolved by LLMs. Um, if you look at the
document to the right, uh, neither rag
nor Texas CSV techniques will actually
work over this because it's not really a
structured 2D table. There's a bunch of
gaps in the rows and gaps in the
columns.
So we basically built an Excel agent um
that's capable of taking unnormalized
Excel spreadsheets and transforming them
um into a normalized 2D format and also
allows you to do a gentic QA um over uh
both the unnormalized and normalized
versions of the Excel spreadsheet. Um
it's a pretty cool capability. I'll
describe uh how it kind of works in just
a bit. um but it's going to complement
our toolbox right in terms of uh more
traditional document parsing,
extraction, indexing and it's available
in uh early preview. So if you just uh
take a look at the video, it's also on
our blog post. We basically uploaded
that example synthetic data set,
transformed it into a 2D table, and you
can also ask questions over it to
basically get insights. And it's really
doing the heavy lifting of deeply
understanding the semantic structure of
the Excel spreadsheet. Um, and then
using that and plugging that in as
specialized tools to an AI
agent. Um, the best baseline is not
really raised to CSV. Um, those both
suck. Um, it's really just an LLM being
able to write code. Um, so LLM with the
code interpreter tool is a reasonable
baseline. Gets you to 70 75% accuracy.
Um, over like a private data set of
synthetic Excel sheets, uh, we basically
were able to get this up to 95%. Um, it
actually surpasses human baselines of
90% of a human trying to go and do the
data transformation by
hand. Um, a brief note on how it works.
Uh, it's a little bit technical. Um but
you know more details are in the blog
post. Um first we do some sort of
structure understanding of the Excel
spreadsheet. So we do a little bit of RL
reinforcement learning. Um you know we
actually kind of adapt dynamically to
the specific format of the document um
and learn a semantic map of the sheet.
By learning a semantic map uh we can
then translate this into um kind of a
set of specialized tools that you
provide to an agent. And so from a
abstract perspective, you can kind of
think about it as an agent could just
write code from scratch. Um as LLMs get
better, that will certainly become um an
e like a a kind of higher performing
baseline. But in the meantime, we're
helping it out by really providing uh a
set of specialized tools over the
semantic map so you can reason over an
Excel
spreadsheet. Great. Um the next piece
here is so we talked about a document
toolbox. uh we talked about a lot of
operations basically make this uh
document toolbox really good and
comprehensive. So now that you plugged
it into an agent, what are the different
agent architectures and what are the use
cases are implied by them? Um as many of
you probably know from building agents
yourselves, agent orchestration ranges
from more constrained architectures to
unconstrained architectures. Um
constrained is basically you kind of
more explicitly define the control flow.
Unconstrained is like a React loop,
function calling, codeax, uh whatever.
You basically give it a set of tools and
let it run. Um, deep research is kind of
the same
thing. Um, for us, we basically noticed
there's two main categories of UX's. Um,
there's more assistantbased UXs that can
basically surface information and um,
help a human surface information or
produce some unit of knowledge work
through usually a chatbased interface.
It's usually chat oriented. The inputs
natural language. Um, the architecture
is a little bit more unconstrained. You
know, it's basically a React loop over
some set of tools. um and it's
inherently both unconstrained but also
with a higher degree of human in the
loop. So the goal is or the expectation
is that the human is supposed to kind of
guide and coax the agent uh along the
steps of the process to basically
achieve the task at
hand. There's a I mean there's I'm sure
many of you have built these types of
use cases and so this is just a very
small subset um but it's basically just
you know your uh generalization of a of
a rag
chatbot. There's a second category of
cases that I think is interesting and I
think a lot of folks are actually
starting to build more into this space
which is um this automation interface.
So being able to actually instead of uh
providing some assistant or co-pilot to
help a human get more information um
processing routine tasks in a multi-step
end to end manner and usually the
architecture is a little bit different.
Um it takes in some batch of inputs uh
it can run in the background or it could
be triggered ad hoc by the human. um the
architecture is a little bit more
constrained which kind of makes sense
right if you want this thing to run more
end to end um you need it to not just go
off the rails um and there's usually a
little bit less human in the loop at
every step of the process and usually
some sort of like batch review in the
end and the output is like structured
results integration with APIs uh
decision- making after approval it'll
just go route to the downstream
systems some of the use cases here
include you know financial data
normalization data sheet extraction
invoice reconcilation, contract view,
and
more. Um, I'll skip this video, but you
know, there's some fun example of some
community- based open source repos we
built in this area, like the invoice
reconciler by Lori Boss.
a kind of general idea that we've emer
that has emerged and we've noticed as a
pattern is you know oftentimes the
automation agents can serve as a backend
because it runs in the background you
know can do the data ETL transformation
they're still human in the loop but it's
kind of the doing the thing where it
needs to process and structure a lot of
data um and do decisions in the
background and then assistant agents are
kind of more front-end facing right and
so automation agents can structure
process your data and provide the right
tool tool interfaces um for assistant
agents. Not every tool depends on
agentic reasoning, but for a lot of
these use cases like for a very
generalized data pipeline um where
you're processing a lot of unstructured
context, you might have automation
agents go in and process your data,
provide the right tools for some sort of
more uh research userf facing
interface. So we talked about building a
document toolbox. We talked about you
know the the the different categories of
agentic architectures and putting it
together um here are some real world use
cases of document agents and these are
basically examples of agents that
actually help automate different types
of knowledge work. So one of our
favorite examples is a combination of
both automation and assistant UX's for
financial due diligence. Um Carl is one
of our uh favorite customers and and
partners. Um you know they basically
used uh some of the core capabilities
that we have to build an end toend
leverage bio agent um you know it
requires an automation interface to
inhale massive amounts of unstructured
public and private financial data um
Excel sheets PDFs powerpoints go through
some bespoke extraction algorithms with
human in the loop review and then once
that data is actually structured in the
right format providing a co-pilot
interface uh for the analyst teams to
actually both get insights and generate
reports over that data.
If you look at any enterprise search use
case that typically falls within the
assistant UX, um, SEMX is one of our
favorite uh, customers in this space
where, you know, just being able to
define a lot of different collections to
different sources of data and providing
more task specific specialized agentic
rag chat bots over your data, right? Um,
you know, it's basically rag, but you
add like an agentic reasoning layer on
top so that you can basically break down
user queries, do research, and answer
the question at hand.
And on the pure automation UX aside, uh
we notice a lot of kind of use cases
popping up around automate automation
and efficiency. And so one example is
actually technical data sheet ingestion.
Um you know we're working with a global
electronics company. They have a lot of
data sheets that need to be
automatically processed and reviewed.
And historically it's taken a lot of
human effort to actually do this. Um so
by creating the right end to end
automation agent you can basically
encode the business specific logic for
parsing these types of documents
extracting out the right pieces of
information matching it against specific
rules and outputting the structured data
into SQL. There's human in the loop
review. Um but if we're actually able to
do this end to end, it transforms weeks
of just like you know technical writer
work um into an automated extraction
interface. So that's basically it. Um
you know for those of you who are less
familiar, Wlama Index is uh the most
accurate customizable platform for
automating your document workflows with
Adantic AI. Um our mission statement's
evolved a little bit since the past few
years where um you know we're a very
broad horizontal uh framework oftentimes
focused on rag. Um but if you're
interested in some of the capabilities
uh come talk to us and then please come
check us out at booth g1. Thank you.
[Applause]
All right, thank you very much, Jerry
from Llama Index. Uh, next up, uh, we
have, uh, Chong and Calvin from, uh,
Chong from Lance DB and then Calvin from
Harvey.ai, Scaling Enterprisegrade Rag
Systems, uh, lessons from the legal
frontier. Give a big, uh, warm welcome
to these two.
All right, everybody hear us? Okay,
sounds like this is
good. All right. Uh, thank you everyone.
We're excited for to be here and thank
you for uh coming to our talk. Uh, my
name is Chong. I'm the CEO and
co-founder of LANCB. I've been making
data tools for machine learning and data
science for about 20 years. I was one of
the co-authors of pandas library and I'm
working on LANC CB today for all of that
data that doesn't fit neatly into those
pandas data frames. I'm Calvin. I lead
one of the teams at Harbai working on
rag um tough rag problems across massive
data sets of complex legal docs and
complex use cases.
So yeah, our talk is
about one sec. Maybe we should have used
the other clicker. Yeah. Yeah. All
right. Use the laptop.
So we're going to talk about some of the
tough rag problems on the legal
frontier. Um sort of challenges, some
solutions and learnings from our
experiences working together on it. So
we'll start roughly with like sort of
how Harvey tackles retrieval, the types
of problems there are and the challenges
that come up with that all with like
retrieval quality, scaling, uh security,
all that good stuff and then how we end
up sort of creating a system with good
infrastructure to support
that. So first of all, a quick intro to
what Harvey is. We're a legal AI
assistant. So, we sell our sort of AI
product to a bunch of law firms to help
them do all kinds of legal tasks like
draft, analyze documents, um sort of go
through legal workflows and a big part
of that is processing data. So, we
handle data all different sort of
volumes and forms. Um the sort of
different scales of that are we have an
assistant product that's like on demand
uploads. Same way you might like on
demand upload to any AI assistant tool.
So, that's like a smaller one to 50
range. We have these vaults which are
sort of uh larger scale project
contexts. So if there's like a a big
deal going on that the law firm's
working on or like a data room where
they need sort of all their contracts,
all their you know litigation documents
and emails in one place that's a vault.
And then the third is the largest scale
which is data corpuses which are like
knowledge bases around the world. So
like legislation, case laws of a
particular country um all the sort of
laws, taxes, regulations that go into
it.
So yeah, some big challenges that come
up come with that. Uh, one is scale.
Just very large amounts of data. Uh,
some of these documents are like super
long and and dense and packed with
content. Um, sparse versus dense, I'm
sure, is like a sort of retrieval
challenge that all of you deal with um
of how to represent the data, how to
retrieve over and index it. Uh, query
complexity is a big one. We get very
sort of difficult expert queries and
I'll show an example of that in the next
slide. Um the data is very domain
specific and complex. Uh there's sort of
a lot of nitty-gritty legal details that
go into it. So we have to like work with
domain experts and lawyers to understand
it and try to like translate that into
how we represent the data, how we you
know index, query, pre-process over it.
Um data security and privacy is a big
one. A lot of this data is sensitive for
like confidential deals or confidential
I don't know IPOs, financial filings,
stuff like that. So we have to respect
all that for our clients. And then of
course evaluation of how to make sure
systems are actually
good. So yeah, I'll show a quick
demonstration of a retrieval quality
challenge. So uh this is just on the
query side of like this is maybe the
average complexity of a query someone
might issue in our product. Um they're
much more complex and maybe simpler
ones, but this is right in the middle
and you can see that like there's a lot
of different components that go into
this. Um there's sort of a sem to read
it out. like what is the applicable
regime to covered bonds issued before 9
July 2022 under the directive EU 2019
2062 and article 129 of the CR. So you
know that that's a handful. But what
goes into it is like there's a semantic
aspect. There's sort of implicit
filtering going on of like you know we
want applicability before a certain
date. Um there's a specialized data set
being referenced which is EU laws and
directives. Um there's kind of keyword
matches of like the specific you know
regulation directive ID. Um it is
multiart in that it's sort of asking how
this applies to two different
regulations like one directive one
article. And there's like domain jargon
here where this is like an abbreviation.
I forget what it was. Capital
regulations something. I looked it up
this morning. Um but yeah, this is very
complex and we sort of need a need a
system that can tackle all this
complexity and sort of break down this
query and um use all the appropriate
technologies for the different parts of
it. And yeah, so one common question we
get sort of in response to this
complexity is how do you evaluate your
systems? How do you make sure they're
good? Um, and that's actually where we
spend a ton of time. It's, you know, not
as much on the algorithms and the the
fancy agentic techniques, but more like
how to validate them. Um, and I'd say
like investing in eval development is a
huge huge key to building these systems
and making sure they're good, especially
when it's a tough domain that like you
don't inherently know much about as an
engineer or researcher. Um, so I say
there's no silver bullet eval, but we
have like a whole range of them of like
different task depths and complexities.
So in sort of one dimension you have it
being sort of higher fidelity but more
costly and then the other direction it's
like more automated evals that are
faster to iterate on. So as an example
like the um sort of high fidelity would
be like expert reviews of just having
them directly review outputs and analyze
them and write reports. Um so that's
like super expensive but super high
quality. And then maybe something in
between is like an expert labeled like
set of criteria that you can maybe
evaluate synthetically or evaluate in
some automated way. So it's still
expensive to curate, maybe a little
expensive to run but more um more
tractable. And then the third is sort of
the fastest iteration which is um sort
of more automated quantitative metrics
like just you know retrieval, precision
and recall sort of different more
deterministic success criteria of like
am I pulling documents from the right
folder? Is it the right section? Do they
have the right keywords in them? Things
like
that. Yeah. Give you a quick sense also
of sort of the scale and complexity on
the data side, not only on the query
side. So the data sets we integrate with
are pretty massive. Um, as you can see,
we support like, you know, data sets
across all different kinds of countries
and for each one there's sort of complex
filtering and organization and
categorization that goes into it. Um, so
we sort of work with domain experts for
all of this. but also try to apply
automation whenever possible like use
their guidance to maybe come up with
heristics or LM processing techniques um
to be able to categorize all this. Um
and I say that the performance
implications are are pretty significant
as well. Um we need very good
performance both online and offline.
Online being like querying over this,
you want good latency and then offline
being like ingestion, reingestion,
running ML experiments for different
variations and such. Um and I say
generally one of these corpuses can be
like yeah tens of millions of docs. Uh
so yeah pretty large scale and each
document is often quite
large. So I can talk quickly about kind
of infrastructure needs to support this.
So at this scale of course we you know
we want infrastructure to be reliable
available um for all our users at all
times. I'm sure that's something that
you know all all products need. Um, we
also want smooth sort of onboarding and
scaling where you know we definitely
want our ML and data teams to be able to
focus more on the sort of business logic
and the quality um and spinning up new
applications and products for customers
and you know not too much about like the
nitty-gritty details of the database or
tuning that or manually scaling um and
of course there's always something in
between where you you want to have
awareness of it. It can't be like fully
thousand% automated. Um likewise we we
need sort of flexibility and
capabilities around data privacy and
data retention. Um like I mentioned with
some uh storage needing to be like
segregated depending on the customer
depending on use case attention policies
on some docs that we might only be
allowed to store for certain amounts of
time for legal reasons. Uh we want good
sort of telemetry and usage around the
database. And then of course any sort of
vector or keyword filtering database. We
need we want to support good
performance, query flexibility, scale,
especially for all the different kinds
of query patterns I mentioned before
where it's like you need exact matches,
you want semantic matches, you want
filters, you might want sort of to sort
of navigate it maybe yeah aentically or
like in some dynamic way. Um so yeah,
all that flexibility is important to us
at
scale and that's where Lance PD comes
in. Cool. Thank you. Awesome.
So,
sorry. Okay, I'm gonna try to hold this
here maybe so there's no echo. Okay. Um,
yeah. So uh as I was saying um so uh you
know I I work at a LANCB and what we are
delivering for um AI is uh beyond what I
call just a vector database but what we
call an AI native multimodal lakehouse
and so if you think about back to maybe
Jerry's talk right in addition to search
you also need um a good foundation a
good platform for you to do all of the
other tasks um that you need to do with
your AI data. So, this can be feature
extraction,
um, generating summaries, generating
text descriptions from images, managing
all of that data, and you want to be
able to do that all together. So, what
you really need is sort of this
lakehouse architecture where all the
data can be stored in one place on
object store. Um, you can run search and
retrieval workloads, you can run
analytical workloads, you can train off
of that data, and of course, you can
pre-process that data to iterate on new
features that you can experiment for
your applications and models.
Um specifically to uh sort of in
addition to these like large batch
offline use cases um you know lakehouse
architectures generally are good for
that but not necessarily for online
serving and this is where in a lance DB
u distributed architecture comes in and
uh it's actually good for both offline
and online context so that we can serve
at massive scale from cloud uh object
store uh we can deliver cloud uh
compute, memory and storage separation
and we give you a simple API for
sophisticated retrieval whether you want
to combine multiple vector columns uh
vector and uh full text search and then
do reranking on top of that. And those
are all available with an API in Python
or TypeScript that feels u what you know
folks have told me feels kind of like
pandas or polar like very familiar to
data workers who uh are used to dataf
frame type of APIs and of course for
large tables uh we support GPU indexing
so I think the our our record has been
around something like three or four
billion vectors in a single table um
that can
index in under uh two or three
hours. So
um all of that is to say like LANCV
excels at massive scale uh and it's this
is happening at a fraction of the cost
because of the uh compute storage
separation and because we take advantage
of object store and um so of course uh
and I talked about sort of having one
place to put all of your AI data. So
this is the only database where you can
put you know images and videos and audio
track next to your embeddings next to
text data next to your um tabular data
uh time series data. You can put all of
that in a single table.
Um and then you can of course use that
as the single source of truth for all
the different workloads that you want to
do that do on that data from search to
analytics to training and of course
pre-processing or feature engineering. A
lot of that um is possible because of
the open source lance format that we
built from the ground up. Um so you know
if you're working with multimodal data
whether it's documents um you know PDF
scan slides or just large even large
scale videos uh if you're doing that in
let's say like web data set or iceberg
parquet you're missing out on a lot of
features um and things like you know
lack of random access or the inability
to support large blob data or not being
very efficient about schema evolution.
Uh so LAN's format by giving you uh
giving you all of those right it makes
it so that you can store all of your
data in one place rather than split up
across multiple parts. And so this is
the I would say like the the
foundational innovation in LANCB where
um without it what we see a lot of AI
teams doing is they they have to have
different copies of different parts of
their data in different places. and
they're spending a lot of their time and
effort just sort of keeping those um
pieces glued together and in sync with
each
other. Right? So um kind of to to
basically you can think about uh lance
format as sort of paret plus iceberg
plus secondary indices but for AI data
and that gives you fast random access
which is good for search and shuffle. U
it still gives you fast scans which is
so good for analytics and you know data
loading and training and um it's the
only one out of this this set that is
uniquely good for storing blob data or
or more importantly a mix of uh large
blob data and small like scalar
data. Um and by using Apache arrow as
the main interface lance format is
already compatible with your current
data lake and and lake house tools. So
you can use spark and ray to write very
large amounts of lance data in a
distributed fashion very quickly. U you
can use pietorch to load that data for
training or fine-tuning. Um you can
certainly query it using tools like you
know p pandas and polars.
All right.
So take back
back. Okay. So uh just want to share
some general take-h home messages about
building rag for these sort of large
scale domain specific use cases. So the
first is that these domain specific
challenges require very creative
solutions around understanding the data
and also choosing sort of modeling and
infrastructure around that like I
mentioned about like trying to
understand structure of your data what
the use cases are what the explicit and
implicit query patterns are um so
definitely spend time with that work
with domain experts and try to immerse
yourself as much as possible in that um
the second is to make sure you're
building for iteration speed and
flexibility I think this is a very new
technology very new industry and a lot
of things are changing, new tools are
coming out, new paradigms, new model
context windows and everything. So you
kind of want to set yourself up for
flexibility um and iteration speed and
you can kind of ground that in
evaluation where if you have good
evaluation sets or either procedures or
automation around that then you can
iterate much faster and just get good
signal on whether your systems are good
or accurate. Um so definitely invest
time in the evaluation to enable that
iteration speed. And then yeah the third
which John covered is that new data
infrastructure has to recognize that
there's sort of this new world we're
entering with multimodal data a lot
heavier on you know vectors and
embeddings workloads are very diverse
and the scale is just going to keep
getting larger and larger as we try to
sort of ingest and uh query over all the
data that exists public and private.
Yeah. Thanks for listening to our talk.
Um, yeah, good Harvey F. You can visit
us at Harvey Aai. Um, come come talk to
us or work with us. We're hiring if
you're interested.
Thanks everyone.
Awesome. Thank you so much, Chong and
Calvin. Uh, really exciting stuff. I
think uh Harvey is changing, you know,
the legal space and Lance is allowing
you guys to to do that at scale. So,
thank you so much.
Alrighty. Um, okay. Next up, uh, we
have, um, Julia, uh, Niagu, uh, Danna
Emmery, both from Quotient AI, um, along
with Mar Asher with, uh, Tavali, Tavili,
uh, Tavili, sorry, I always messed that
one up. U, but we're super excited for
these three uh, speakers as well. So,
this one will be about evaluating uh, AI
search um, the frameworks, you know, for
augmented AI system. So, give him a big
uh warm uh round of applause.
for sure.
Yeah, nice to meet you.
Same thing.
Good girl.
Just make sure
It's okay. It's okay.
Hi everyone. Uh, thank you so much for
coming. Uh, my name is Julia. I'm CEO
and co-founder of Quotient AI. Uh, I'm
Danna Emmery. I am founding AI
researcher at Quotient AI. My name is
And today we are going to talk to you
about uh, evaluating AI search. So, let
me start with a fundamental challenge
we're all facing in AI today.
Traditional monitoring approaches simply
aren't keeping up with the complexity of
modern AI approaches. First off, these
systems are dynamic. Unlike traditional
software, AI agents operate in
constantly changing environments.
They're not just executing predetermined
logic. They're making real-time
decisions based on evolving web content,
user interactions, and complex tool
chains.
These systems can also have multiple
failure modes that happen at the same
time. They hallucinate, retrieval fails.
Uh they make reasoning errors and all of
these are
interconnected. A little bit about what
we do at Ocean. We monitor live AI
agents. Uh we have expert evaluators
that can detect objective system
failures without waiting on ground data,
human feedback or benchmarks.
A year ago, we met Rom Te's founder and
CEO and he posed us with a problem uh
that really crystallized the core issues
we needed to
solve. Here's a challenge. How do you
build production readyi search agents
when your system will be dealing with
two fundamental sources of
unpredictability you cannot proactively
control? Under the hood, Tavil's agents
gather their context by searching the
web. The web is not static. Traditional
benchmarks assume stable ground truth,
but when you're dealing with real-time
information, ground truth itself is a
moving target. Your users also don't
stick to your test cases. They can ask
odd malformed questions. They have
implicit context they don't really share
and you're not aware of. Uh and this is
not just a theoretical problem. The uh
the processes hundreds of millions of
search requests for its agents in
production and they need a solution that
worked at scale in these real world
conditions and this is a story of how we
built that.
Yes. So at we're building the
infrastructure layer for agent
interaction at scale essentially
providing language models with real-time
data from across the web.
There are many use cases where real time
AI search deliver values and this is
just a few example of how our clients
are using Tavili to empower their
applications. Beginning from a CLM
company that built an AI legal assistant
to power their legal and business team
with instant case insight to a sports
news outlet that created a hybrid rag
chat agent that delivers scores, games,
and news updates to a credit card
company that uses real-time search to uh
fight fraud by pinpointing merch
merchant
locations. So as you can imagine
evaluate a system in this kind of vast
fastm moving setting is quite
challenging. We have two principles that
guide our evaluation. First the web
which is our foundation of our data is
constantly changing. This means that our
evaluation method must keep up with the
ongoing
change. Second that truth is often
subjective and contextual.
Evaluating correctness can be tricky
because what's right may depend on the
source or the timing or the user needs.
So we have a responsibility to design
our evaluation methods to be as unbiased
and fair as possible even when absolute
truth is hard to pin down. So the first
thing to think about in offline
evaluation is which data to use to
evaluate your system.
Static data sets are a great start and
there are many widely uh open-source
data sets available out in the web.
Simple QA is one example. It's a
benchmark and a data sets from open AI
that um serve as a standard for
evaluating retrieval accuracy. We have
many many leading AI search providers
that use simple QA to evaluate their
performance. Simple QA is designed to
evaluate a system ability to answer
short fact-seeking question with a
single empirical
answer. Another widely uh adopted data
set is hotspot QA which uh tests which
evaluates a system ability to answer
multihop questions where reasoning
across multiple documents is required to
retrieve the final answer. data set like
simple QA and hotspot QA are a great
start for evaluating your system. But
what happens when you're
um when you're evaluating real time um
systems that um especially when
measuring your that your system keeps up
with rapidly evolving information and
avoiding regress regression like where
we operate also those kind of static
data sets uh don't address the challenge
of benchmarking question questions where
uh they don't there's no one truth
answer or subjectivity is
involved. This is what led us to think
beyond static data sets towards dynamic
evaluation that reflects the changing
uh the pace of the of the web
essentially. Um dynamic data set are
essential for for benchmarking rags in
real world production system. You can
answer today's questions with yesterday
data. Dynamic data sets have real world
alignment. They have broad cover
coverage as you can easily create evil
sets for any domain or use case that is
relevant to your specific needs and they
also ensure continuous relevancy because
they are regularly refreshed which means
that your system is con is always
evaluated against the latest data.
This led us to build an open-source um
agent that basically build dynamic eval
sets for web-based rug system. It's open
source and we encourage everyone to
check it out and contribute and I also
want to acknowledge the work of AAL our
head of data at Cavilli who initiated
this project a couple months
ago. As you can see here an example of a
data set generated by the agent. It
generates question and answer pairs for
targeted domains using information found
in the
web. So the agent leverage the langraph
framework and it consists of these key
steps. First it generates broad web
search queries for targeted domains
which essentially let you create eval
sets for any uh domain of your choice
and specific need of your application.
The second step is to aggregate
grounding documents from multiple
real-time AI search providers. We
understand that we cannot just use
Tavili to search the web on specific
domains, find grounding documents, then
generate question and answer pairs from
those documents and then evaluate our
performance on those documents. That's
why we use multiple real time AI search
providers to both maximize coverage and
minimize bias.
The third step which is the key uh step
in this process is to generate the
evidencebased question and answer pairs.
And we ensure that in the generation
process the agent is obliged to to
generate answer context which al which
also increase the reliability of our
question and answer pairs and reduce
hallucinations.
You can always go back and and check
which sources were used and which
evidence from so those sources were used
to generate each question and answer
pair. And lastly, we use length miss to
track our experiments which is a great
observability tool to manage these uh
offline um evaluation runs and see how
your performance at different time
steps.
The next steps that we want to address
is to support a range of question types
both simple factbased questions and
multi-hop questions similar to the hot
pot QA. We also want to ensure furnish
and fairness and coverage by proactively
addressing bias and covering a wide
range of perspective for each subject we
generate question and answer to.
Additionally, we want to add a
supervisor node for coordination which
prove itself to be valuable especially
in these multi- aents uh architectures
and this will increase the quality of
our question and answer
pairs. The next step to think about is
uh benchmarking and we argue that it's
important to measure accuracy but you
should not stop there. You should ensure
an holistic evaluation framework which
use benchmark like for for our case that
um measure your your source diversity
your uh source relevancy and hallucation
rates. It's also important to leverage
unsupervised evaluation method that
remove the need for label for label data
which enable to scale your evaluations
and address the subjectivity uh issue.
With that, I'll pass it over to Danna
who will explain more about this
reference free benchmarks and also share
a results from an experiment we ran
using a static and a dynamic data set
that was generated by the agent I
described
before. So, uh we performed a two-part
evaluation of six different AI search
providers. Um the first component of
this experiment was to compare the
accuracy of search providers on a static
and a dynamic benchmark in order to
demonstrate that static benchmarking is
not a comprehensive method for
evaluation of AI search. The second
component was to evaluate the uh dynamic
data set responses using reference free
metrics and we compare these results to
the uh reference based accuracies that
we get from the benchmark in order to
demonstrate that reference reevaluation
can be an effective substitute when
ground truths are not available.
So jumping right in um for our static
versus dynamic benchmarking comparison
we use simple QA benchmark as the static
data set and we're using a dynamic
benchmark of about a thousand rows
created by
Tibili and as you can see here uh both
data sets have roughly similar
distributions of topics and this helps
to ensure a fair comparison and
diversity of
questions to evaluate the AI search
providers. performance on these two
benchmarks. We're using the simple QA
correctness metric and this is an LLM
judge which is used on the simple QA
benchmark. It compares the model's
response against a ground truth answer
in order to determine if it's correct,
incorrect, or not
attempted. And so here we're showing the
correctness scores from that simple QA
benchmark compared against the dynamic
benchmark. And uh we've anonymized the
search providers for this talk. Um but I
do want to call out that the simple QA
accuracy scores here are all
self-reported and so they don't all
necessarily have clear documentation on
how they were
calculated. But um as you can see the
correctness scores are for the dynamic
benchmark in blue are substantially
lower. Um and not only that the relative
rankings have also changed pretty
considerably. For example um provider
all the way on the end of this plot here
performs the worst on simple QA, but it
performs the best on the dynamic
benchmark. Looking a little closer to
the in the results, um while this simple
QA evaluator is useful, it's certainly
far from perfect. Um I have a few
examples here of um model responses that
were flagged as incorrect by this LLM
judge. Uh but if you look at the actual
text in the model outputs, they do
contain the correct answer from the
ground
truth. On the flip side of things, um
here is an example where uh the LLM
judge classified it as correct. And yes,
you can see that the correct answer is
in this
response. But while the correct answer
might be present, that doesn't
necessarily mean that the full answer is
right. Um this evaluation is not
accounting for any of the additional
text in this response. and there might
be hallucinations in there and that
would invalidate it. So ultimately this
evaluation falls short of identifying
when things go wrong in AI
search. So what are some other ways that
we can identify when things go wrong? Up
to this point we have been talking about
a reference-based approach to
evaluation. But what if we don't have
ground truths? In most online and
production settings, this is typically
the case and as we've already discussed,
it's especially so in AI search. Um, so
the question is, can reference free
metrics effectively identify issues in
AI search? For this talk, we're going to
look at three of quotient's reference
free metrics. Um, we'll look at answer
completeness, which identifies whether
all components of the question were
answered. Um, so it classifies model
responses as either fully addressed,
unadressed, or unknown. Uh, if the model
says I don't know, then we'll look at
document relevance, and this is the
percent of the retrieved documents that
are actually relevant to addressing the
question. Um, and then finally, we'll
look at hallucination detection, which
identifies whether there are any facts
in the model response that are not
present in any of the retrieve
documents. And so we use these metrics
to evaluate the search providers
responses on this dynamic
benchmark. So we've got answer
completeness plotted here. Um the
stacked bar plot shows the number of
responses that were either completely
answered uh unressed or marked as
unknown.
And if we look back at the overall
rankings that we saw earlier on the
dynamic benchmark, um you can see that
the rankings from answer completeness
pretty closely match. Um the average
performance scores for the two get a
correlation of
0.94. So this indicates that the
reference free metric can capture
relative performance pretty well. But
completeness is still not the same thing
as correctness. Um and when we have no
ground truths available then we have to
turn to the next best thing and that is
the grounding
documents. So this is where document
relevance and hallucination detection
come in. Uh both of these metrics are
going to be looking at those grounding
documents in order to measure the
quality of the model's
response. Unfortunately uh of all of the
search providers we looked at only three
of them actually return the retrieved
documents used to generate their
answers. Um the majority of search
providers typically only provide
citations and these are largely
unhelpful at scale and also really limit
transparency when it comes to
debugging. So these are those document
relevance scores for the three search
providers um and they've been reanomized
here. Um the plot to the left shows the
average document relevance, the percent
of retrieved documents that are relevant
to the question. And the plot to the
right shows the number of responses that
have no relevant
documents. And if we consider these
results in conjunction with answer
completeness, we find that there's a
strong inverse correlation between
document relevance and the number of
unknown answers. And this kind of
matches intuition. uh if you think about
it, if you have no grounding, no
relevant documents for the question, the
model should say I don't know rather
than trying to answer
it. And so this brings us to
hallucination detection. And here we
were actually surprised to see that
there was a direct relationship with the
hallucination rate and document
relevance. Provider X here has the
highest hallucination rate, but it also
had the highest overall document
relevance. And this is kind of
counterintuitive. Um, but if we think
about it more, provider X had high
answer completeness, the lowest rate of
unknown answers, and it also had the
highest answer correctness from the
benchmarking ear of these three
providers.
So, this probably implies that maybe in
provider X's responses, um, they're more
likely to provide new reasoning or
interpretations in their response, or
maybe even they're more detailed and
thorough, and this just creates more
opportunity for hallucination in their
responses. Um, but the point I want to
make here is that when considering these
metrics, depending on your use case, you
might index more heavily on one over
another. um they're measuring different
dimensions of response quality and it's
often a give and take. If you perform
really well in one, it might be at the
expense of another. Um and as we see
here, uh there is a trade-off between
complete answer completeness and
hallucination. But also, um if you take
these three metrics in conjunction, you
can use them to understand why things
went wrong and uh identify potential
strategies for addressing those issues.
This diagram here shows a few examples
on how you can interpret your evaluation
results.
Um sorry uh how you can interpret your
evaluation results to identify what to
do to fix it. So we've got one example
here where um maybe your response is
incomplete um but you have relevant
documents, you have no
hallucinations. Uh so this probably
means you don't have all the information
you need to answer the question. Uh and
so just retrieving more documents solve
that. Um, but the big picture idea is
that your evaluation should do more than
just provide relative rankings. It
should help you identify the types of
issues that are present and it should
also help you understand what strategies
to implement to solve those issues.
Okay. So, uh, so in conclusion, let me
just quickly paint a picture of where
we're heading with all this because this
is not just about building the agents
we've been building for the past couple
years and then slapping evaluation on it
and then continuing to do the same
thing. Uh, it's actually it's not about
building better benchmarking. It's not
better monitoring. It's not about better
evaluation. It's about creating AI
systems that can uh continuously improve
themselves. And imagine for a second
that agents don't just retrieve
information but learn from the patterns
of what information is outdated, what
sources are unreliable and what users
need. Um they can also like maybe detect
hallucinations mid conversations and uh
correct the course all without human
intervention. And this framework that we
shared today dynamic data sets holistic
evaluation reference free metrics are
the building blocks for getting there.
Uh, and this this is where we want to
get with augmented AI. So, thank you so
much for your
[Applause]
time. All right, Tangu Ma, are you in
the room? Our next
speaker, Tenu Ma from
MongoDB. Oh, you are. Okay, perfect.
Come on up.
All right, thank you very much uh from
Koshant AI and Tilli uh for presenting.
Next up we have uh MongoDB. So you know
them, they're very much in the
ecosystem. Uh Tangu Maul will be
presenting uh Rag in 2025. So what's
changed? Uh state-of-the-art and the
forward road. So um he'll talk about
kind of uh the road map uh debate rag
versus fine-tuning versus long context
and then um you know two secondly rag
today the benefits challenges uh and
current solutions and then what will the
future of rag look like. So all right
while he's getting miked up um we'll let
him get settled in but uh big round of
applause for Tangyuma from MongoDB.
Um this is uh this is working. Okay. Um
I guess um yeah, thanks thanks for
coming. Thanks for having me here. Um
I'm Tony Ma. I'm uh um I was uh the CEO
and co-founder for we just recently got
acquired by MongoDB. I'm also teaching
at Stanford as well. So um this is about
rag which is the main focus of for ai uh
the startup who is focusing on how to
make retrieval better. So um um but I
will just generally talk about you know
rag and and we'll touch on some of the
products we make as well very quickly.
So I guess why we are u doing rag or
anything like that right so I guess the
main reason is that large language model
these days agent u which are you know
using large language models as well uh
if they out of the box they cannot just
uh uh have priority information from any
of the companies, right? Because if they
know anything about what MongoDB for
example internally has, then the the
data was leaked. So that means that if
you want to apply any of this to
enterprise uh uh then um uh you need to
ingest a lot of data from the uh
proprietary information. So um and I'm
going to discuss you know why which kind
of technologies to to enable us to
ingest the data. I guess there are a few
options rag fine-tuning and long context
which are all ways to ingest data and
I'll focus on raich for the rest of the
talk. So I guess you know for this
audience probably most people knows
these technologies they are all very
simple on a high level. So for a long
context it's just the most simple you
just dump all your documents uh uh to a
large language models context and maybe
it's like 1 million tokens maybe it's 1
billion tokens um so and then you have a
query and you just get a response uh
fine tuning is like you first fine tune
lo model you update the parameters and
then you say I'm not going to look at
the documents anymore uh when the query
comes I just use the updated parameters
to generate response uh and rag is u um
also pretty simple So basically what
happens is that on the fly you use the
query to retrieve some subset of the
documents. You use a retrieval or search
method u um and then you get some
relevant documents. You give this small
set of relevant documents to the large
language model and then you generate
response based on those contexts. So
this is one my one slide kind of like a
uh summary of you know how I think about
the differences between these
technologies. you know some of the these
are uh uh inspired by some of the
research at Stanford when we kind of
started to uh uh um uh uh build voyage
you know we kind of like believe in rag
and one of the reason is that we don't
believe that fine tuning can work and
and long context I think um um I also
don't really believe that it can be
costefficient in the long run so so
basically I think the way that I think
about this is that um I uh try to make
analogy to how humans uh are uh learning
from or using the addition
party information. Um um so in some
sense long context is kind of like you
scan an entire library to answer any
single question right every time you
answer a question you you need to go
through the entire library which has
like probably one billion tokens and
fine-tuning is kind of like you read
this library in in advance you muscle
memorize them you try to internalize
them in your brain in your neurons in
your synapsis and you update your brain
basically rewire your brain so that you
really know all of those deeply. Um the
challenge there is that you know it's
very difficult and um somewhat
unnecessary um um because you know you
cannot really memorize all the books in
your in the in the world and and and
memorizing a subset of them sometimes is
kind of like you know which subset you
want to memorize is kind of uh tricky as
well. So and another thing is that um um
it makes you know forgetting the
knowledge also tricky because you don't
know which part of the knowledge you
should forget and how to cleanly forget
all of them and also this makes the the
access the the the data governance also
kind of tricky because you know maybe
there are so many libraries so many
books in a library and not everyone can
access everything and how to organize
those and on the other hand rag is very
very simple and modularized uh as I've
shown so um and very reliable and you
know and and also kind of fast and and
cheap. So um and it's kind of like
similar to how humans actually are using
the libraries, right? You retrieve the
most relevant, you know, book chapters
or books or book chapters and then
answer the question. It's kind of a
hierarchical way to store your
information, right? You don't really put
all of information in your brain. You
put them in a library and then use them
uh when you need it. So um that's why I
believe in rag and uh and this is kind
of how you implement the retrieval part.
So basically there's a breakdown of two
components actually there are three you
know if you are advanced so uh so this
embedding models which vectorize the the
documents and query uh into vectors and
the vectors are representations of the
uh this kind of like representations of
the content or the meanings of u the
documents and queries and then you use a
vector database to store the data and
also search uh uh with a k nearest
neighbor search in a vector space and
then you get the relevant documents and
then you can use large language model
generate answers. So um we have seen
significant improvements over the
retrieval accuracy in the last two
years. Um when we started voyage you
know I think open I v3 was not uh yet
launched. I open v3 was launched 1.5
years ago. Um and in the last 1.5 years
you know voyage you know u has made
significant progress you know coher also
made some progress. So we can see that
the new model has much better uh
accuracy and with lower cost uh um and
generally we have much better scaling
law right so the same number of
parameters the the quality becomes
better or the same quality the the the
the parameters become smaller and it
becomes cheaper um and all of these are
through kind of like you know optimizing
the research stack uh the tuning stack
you know as much as possible you know
all the way from like data curations
data selection uh uh uh architecture
loss functions you know evaluation so on
and so forth. Uh and we still you know
believe that there's a big headroom here
because you know right now you can see
that in this plot you know we are
averaging over about a 100 data sets and
accuracy is about 80%. So that means
that you still have like probably 20% of
the improvement um headroom. But that
said you know just to be clear it's not
like for every data set you only have
80% accuracy. For probably half of the
data sets the accuracy is probably 90%
or even 95%. And for some of the other
ones it's kind of 60 sometimes 20
sometimes 30. So that's why in average
it's 80%. Um so for so basically I'm
saying like for some of the tasks that
are common uh I think you can get
already very high accuracy in the
retrieval step. Um and another thing
that uh voyage and other uh uh uh
companies has uh uh offered is this
so-called matasha learning and also
quantization of wire tuning. So
basically these are two approaches to
reduce the the storage cost uh for the
vectors. So basically matrascal learning
means that you make sure that uh even
you have like a high dimensional
embedding right you can use a subset of
the uh the coordinates uh is usually the
first let's say suppose you have a 20 48
dimensional vectors and the first 256
dimensional uh sub uh vector is still a
reasonable embedding the accuracy
wouldn't be as high as two uh 2048 but
it will be almost the same uh maybe with
one or two% % loss and quantization is
kind of this in a similar vein. So where
you uh even you lower your precision of
the vectors you still get pretty high
performance. Uh and you can see the uh
the trade-off on the right of the figure
uh here. So basically you can save you
know 100x you know at least 10x without
losing much. If you save a 100x in the
storage cost then you start to lose
probably five to 10%. Um but voyage you
know is doing a great job here because
you know you can save uh 100 access but
still doing better than open eye. That's
just because the parental frontier is uh
different. So um um and and you can
actually see better tradeoff you know
for domain specific models which I I'm
going to discuss in a moment. Um I have
nine minutes here so I will probably
just quickly go through some of the uh
uh the techniques that you can use. Um
so basically the next question is that
how do you do better rag you know
without besides using better embedding
models using better embedding models is
probably one of the simplest way. Um so
I'm just going to go through it quickly.
So one of them is to use hybrid search
and reankers you can use you know
lexical search and other kind of search
and then combine them with a reancher
and voyage provides a revanker as well.
Um and another one is you can enhance
the queries and documents by the
so-called query de composition and and
document enrichment. Um so this is
probably the most common one maybe one
minute on it. So it's actually very
simple. You just say if you have a query
rag then you try to improve the query by
u making it longer using a large
language model. Uh you can also
decompose the longer query into small
subqueries so that you can have like a
few different queries and search for
different subset of documents. Um and
you can also enrich the document by
adding additional meta information in
it. You can add titles you know hiders
you know categories authors dates.
Sometimes you trunk the document so that
in the trunk you don't even have this
information anymore. So that's why you
have to add a global information into
each of the trunks and some of this
global information can be added by large
models and thropic wrote a blog post
which does achieve pretty good results.
So basically they use large models to
generate additional context um that you
can add to the trunks so that you can
make the trunks you know more uh
informative and then the it's easier to
search uh through them. So um another
one is you can use domain specific
embeddings where um you um you customize
embeddings for certain kind of domains
you know in MongoDB or voyage we
customize it for code for example and
you can see that you know you get much
better performance and also uh it's a
better trade-off in terms of the the
storage cost and the um uh and and
accuracy. So basically you don't lose as
much if you compress the vectors even
further. Um so so here we lose probably
5% by compressing uh for like a about
100x but before we lose probably 10% or
15%.
Um fine-tuning is another one. You can
find tune embedding models with your own
data. Um and you can also use other um
sometimes I call them tricks on top of
the embeddings, right? So these are
different type type of ways to retrieve
um uh using additional information like
graph, you know, iterative retrieval, so
forth. They're all based on embeddings.
Um but uh you can uh use the embeddings
in many different ways uh as add an
additional layer. Um so um I guess I'll
use the next probably five minutes to
discuss some of the uh and the uh my
vision for how rack will go in the
future. I do believe that rag will be
there forever because this is as I
argued in the first slides uh the first
set of slides. This is the kind of like
very similar to how humans are uh using
uh additional large amount of data. You
retrieve you hierarchically select some
subset and then you use those uh to uh
uh to to answer the questions. or take
some actions. And this is very efficient
because you only use a small subset of
the data. Um and um and as a um
regarding uh how um how regular evolve
from a technical point of view, I like
to draw some analogy uh on uh um from uh
how the the AI generally is evolving. So
I think I was reflecting on when I was
teaching at Stanford, you know, started
to teach at Stanford about seven years
ago. I started to teach with Chris Ray
on this machine learning course and one
of the slides literally have this seven
steps on how do we build ML systems in
enterprises. Um so this actually this
slides is still in the the lecture
notes. Uh we still teach them but just
with more kind of like uh uh asterisks
uh around it. So you can see like you
you need to go through you know many
steps you know collecting data you know
train test bits you know define your
loss functions you know build models and
iterate and repeat and then uh for the
large language model world it's kind of
like this right you don't do need to do
any of this you just take a large
language model out of the box and just
you know you can deploy it in enterprise
in most of the cases of course it's not
going to be perfect but this is already
better than in the old days you do all
of these steps in the enterprise using
all the enterprise data just out of the
box you are doing already very very
well. Of course, you still have this
issue that you cannot out of the box art
model cannot um uh access proprietary
information then you can use rag uh for
it. So, but I think the point here is
that before uh all of these steps have
to be done by the kind of like the users
or the enterprise or the customers in
some sense. Um uh and now uh you largely
speaking just can take off the shelf
components and connect them and build
your uh AI applications very fast
without going through these training
steps. The trainings still have to be
done, right? All of these steps still
are done u um but they are done by open
anthropic or voyage MongoDB u the
providers of the models but not the uh
uh the the users um uh the end users. So
um and I think for rag I I would say
probably the same kind of evolution
should happen um um um um so right now
what happens is that u we have the
several different layers where you have
the computing infrastructure layer about
the GPUs you know or some of the KNS on
the CPUs and there's also a model layer
where you have the embedding models the
revampers the large language models and
then on top of all of this um people use
a lot of like I call it tricks to make
rack uh uh accuracy much better right
you can uh uh use all kind of parsing
strategies, you can use all kind of
trunking strategies, you know, uh uh
that the the the um you can do some
recursive search, you can do some
contextual trunks, graph racks, so and
so forth, right? That's what happens
right now. And it's kind of necessary.
These tricks are somewhat necessary
because the embeddings and reankers and
large models, none of them are perfect
yet, right? So um and but I do believe
that in the future I think this model
layer will grow uh uh and the tricks
will be smaller. So it's going to be
fewer and fewer tricks and the models
can capture many of the uh the the
performance gain u by the tricks. I
think we have seen this in the large
language model space as well, right? So
like um two years ago I think you need
to do a lot of things on top of the GPD3
to make your application work and now
even after uh uh out of the box you can
get the same performance as before with
all of the tricks. Um of of course you
still probably need some kind of tricks
because some information are not u uh um
uh uh uh some information um the
embedding models and rankers don't have
just right. So the the the general
purpose models or off-the-shelf models
don't have certain information and then
you can incorporate those uh into your
tricks. For example, one thing that uh
is that you know the definition of the
similarity matrix uh uh could be
something that you should customize in
your
prompt and um in this one you know um I
think u there are several things that we
are developing towards this vision right
so one of them is multimodel embedding
this is to dramatically simplify the
workflow so that you don't have to do
many things right so these days the
multimodel embedding provided by voyage
can just take in screenshots right
before you take a PDF you have to do the
data extraction to turn them into image
and text and then probably do some
embeddings for the image and embedding
for the text separately. um and and and
parsing this PDF is actually complex you
know and for videos you have to do turn
them into transcript um um and then use
the text embedding so and so forth right
right now uh right now we have the
multimodel embedding which just takes in
screenshot you can deal with PT PDF you
know ppt uh uh powerpoint you know any
of the other kind of slice stack in the
same way just take a screenshot and then
use the multimodel embedding we don't
have the uh we can even do the uh same
thing for video not necessarily the
perfect way but like you just take
screenshots of the frames, you know,
conceptual frames and you give it to the
multimodel embedding and you can uh turn
them into vectors and you can search
over uh those documents or or videos or
slice stack. So,
um and these are some performance
metrics that we have evaluated. You
know, we have tried kind of like oh by
the way another one application tables
right now you can just take a screenshot
of the tables. you don't have to think
too much about what is the header, what
is the row, so and so forth. And we have
done evaluations on many of these
document screenshot, you know, table
figures and and also text only. And you
can see that it's uh improving across
the board. So um the the final one I
would like to mention which is something
that we're going to launch uh soon is
that this context where auto trunking
embedding. So uh right now what happens
is that when you have a long document,
you do have to trunk the data. Um uh one
of the reason is that the the context
length of the embeddings is limited and
if you have like a 100k tokens you do
have to trunk into three or four trunks
you know even though voyage has the
probably I think it we have the longest
context window is still like 32k so
that's one reason to trunk and another
reason to trunk is that sometimes it's a
long the long document even you don't
trunk suppose you can have a way to uh
uh put all of them in a context window
still when you retrieve you're going to
retrieve on a document level then you
retrieve a very very long document and
then you should give this long document
to large language model it's going to be
very very expensive right if you give
100k tokens to large model every time
you answer any question if you do some
cost analysis you'll find that that core
is very very expensive so that's why you
do have to work on a smaller unit so
that you can cut the cost and and be
also more focused right so sometimes you
give a long document to large English
model it misses some of some of the
context in the middle and you have to
use the retrieval to focus on a
paragraph a page so and so forth. So, so
that's what happens right now with the
trunking and but all of these are done
by the users. Uh and our vision is that
we're going to do this for you and also
we're going to get all the meta
information about from other trunks. So
basically um uh in a nutshell the the
the the interface will be that you give
us a long document and we're going to
trunk it for you and then also we return
the trunks and also the vectors for each
of the trunk and each of these vector is
not only representing that trunk but
also representing some of the global uh
m information from other trunks. So it
has all the details uh of the
corresponding trunk and also has some
kind of like a cog grid information from
other trunks. So they can get the best
of the both worlds. Um and uh that's
what I'm going to launch you know soon.
And another one is that we're going to
have some fine tune API some point uh to
make you uh uh uh so that you can
fine-tune uh with your own data. Um I
guess uh it's exactly time. Thanks very
[Applause]
much. All righty folks. That's it for
the first block. Enjoy lunch. uh come
back around a little bit before 2. We'll
start with the the second set of
sessions. We've got EXA uh 11X uh as
well as um Google. So don't miss it.
Four, five, six. One, two, three, four,
five, seven.
Two, three, four, five, six. One, two,
three, four.
closer. Even yelling, why you betray me
like this?
They only have one.
Check. Check. How we doing, y'all? We're
about to get started in just about 2
minutes. Feel free to grab a seat. Raise
your hand if there's a seat next to you.
We have a bunch of folks in the back
looking for seats. If anyone's in the
back, you want to find a seat, you can
find a seat next to these folks raising
their hands. Let's pack it in. It's
going to be a full room. The last time
we did this at the last AI engineer
summit, we ended up by the end of it
having the full ring around with people
standing. So, if you want to get a clear
view, please grab a seat. Um, and then
we'll get started. And then we'll get
started in just about two minutes. Thank
you'all.
Sweet. You can bring it in real quick.
Check. Check. Check. Check. One, two.
Check. Check. One, two. Sweet. About to
get started here.
How we
doing? All right. Back
there. Okay. Seems like the mic just
kicked on. Can you hear me? All right.
Back there in the back. Yeah. Sweet.
Raise your hand if you flew in and you
live outside of San
Francisco. Holy Who came the
farthest? Where'd you all come
from? UK. I don't think that's the
farthest. Poland, we're getting pretty
far. Bangalore. Anyone got Bangalore
beat? That's hard to
beat.
Romania. Brazil. I don't I don't I think
Brazil's a little bit
closer. Kolkata. Oh One more back
there. Melbourne. Wow. This is
crazy, y'all. And and if you're if
you're back there in that corner, if
y'all can slide over a little bit
because there'll be more people joining
as we're going. So, just slide over this
way so we make more space so that people
can actually see. We're about to get
started here. Um, so I'll introduce
myself real quick. I'm Dave Fontineau,
the founder of HFZero. For those of you
who don't know, HFZO is the most
selective startup program in the world
now. It is the program for repeat
founders. A lot of the founders that
you're about to see on this stage have
built billion-dollar companies before
this and then they're starting their
next company and they're building
something really interesting mainly in
AI. It's a 12-week residency where you
move into this residency and you give up
everything in your life. Everything in
your life and you just go all in on
birthing your life's work. And without
further ado, let's give it up for the
first startup in this series that we're
calling the next AI Unicorns. Let's give
it up for Diego from
Korea. Test test test.
Hello everyone. My name is Diego
Rodriguez. I am co-founder and CTO at
Korea. We're building an AI creative
suite. I'm going to tell you three
stories and then I'll try to hire you.
So, a friend once told me, if you think
about it, like cars are easy to predict,
right? Like it's like you get the horse,
you have the wheels, you swap the horse,
and you put an engine, which was known
at the time, and that's a
car. But like, you know what's really
hard to
predict? Traffic.
So it's my job to ask what are the
traffics that are we're missing
especially with AI you know JAMO JSON
MCP whatever it's like okay okay but
like what comes what happens
when you generate a million images per
day like we do for one studio how do you
find
that another
story to Babel we wanted to reach heaven
God was like no created a bunch of
languages and basically misunderstanding
misunderstanding was like nope you're
not going to go there and it reminds me
of standup meetings where people are
like no it should be react no no no like
it should be like JavaScript and he was
like dude we're not re God is winning
uh but now we have AI so maybe okay he
if it wasn't with it was with AI you'll
see and so this is only like people
trying to convey ideas and that's what
we're trying to tell just trying to tell
stories. Um the final story is I was
talking with someone from Netflix uh and
she was
like what
happens when we are making so much
content personalized to each town in
India? How do I even find that? Like,
right. And and then a few days ago, I
just realized that Korea was already
being used for broadcasting an ad uh
with Fox to millions of people. And then
I look and they literally signed up two
days ago. So, we went from sign up to
conversion to payment to broadcast in
two days. I And then this was the CTO
telling me that. I was like, well,
um I basically I'm about to run out of
time. So the mandatory slide a bunch of
users 25 million raised a bunch of money
we did this with eight people some of
the people who are using us uh an email
that I created for today that is going
to prioritize applications. Um thank you
[Applause]
All right, open home. Okay, everybody.
The smartphone was the number one
best-selling consumer product last year,
and the laptop was the
second pop quiz for all you here. What
was the
third? Wasn't Apple Watch, wasn't the
AirPod? No, I think I heard it. It was a
smart speaker.
500 million smart speakers were sold
last year. But why do they still suck?
You can barely talk to them. There's no
customization. There's no community.
There's nothing. That's why we built
OpenHome, the very first AIdriven smart
speaker. And we're letting you guys
build smart speakers,
too. And we believe here that the future
is talking with AI. You should be able
to talk seamlessly, intuitively. In
fact, you shouldn't have to use this
really awkward command-based language.
You should be able to just chat
naturally. So, that's what we're
building. And we're letting people here
today build their own smart speakers and
build them in whatever form that they
want. And the key here is developer
ecosystems. And well, we know developer
ecosystems. I started my career as the
chief of staff for the founder of
Splunk, a 30 billion big data company.
Then I was on the founding team of Maker
Dow, a $5 billion developer
ecosystem. My co-founder and I raised
$50 million for our last business, a big
a data privacy tool, and we sold that
business. But it all came down to
developers and really building what
people actually wanted. And well,
oops.
Now with OpenHome, we have over 10,000
developers building on OpenHome. They're
building all kinds of interesting
things. All types of different custom
smart speakers, building interesting
voice AI
applications. Sky's really the limit.
And well, what do developers really
want? They want open source. They want
LLM driven. And they want fully
jailbroken. They want OpenHome, the AI
smart speaker.
And now what's really exciting is with
voice AI, you can put it on any type of
hardware. We have developers building
talking toys, AI robots, AI appliances.
You should be able to talk to the world
around you in a much more natural way.
And you can do that now with OpenHome in
AI smart
speaker. Here's our dashboard. We have
many, many applications, hundreds of
applications that have been built,
games, personalities. We have an editor
that you guys can go in and build and
all kinds of interesting things, home
automation
tools. And what's really exciting is our
last devkit got booked up within
minutes. And today we have a special
announcement for you guys here today.
We're releasing the next batch of 500
dev kits for free for everybody here. If
you guys want it, we will ship you a dev
kit. It's very cool. You can build on
it. You can talk with AI. You can build
your own smart speaker here. and we're
doing it today. Thank you so much.
One, two. All right. How's it going,
y'all? Uh, I'm Josh. I'm the founder of
company called Koframe. And I am missing
my slides. Cool. Got it. So, uh, the
last company that I started, we scaled
to over two billion dollars in in the
course of a couple years. Um but when I
started to tinker on uh using AI to
generate code and created one of the
actual top uh autonomous coding agents
on GitHub was one number one on GitHub
for a week. I realized it was time to
build something
bigger. The internet is
dead. It's not adaptive. It's not
personal. It's not truly living in a
sense. Websites are all one
sizefits-all. And we're bringing that
concept to life. We are giving websites
a life of their own, giving every single
customer experience its own AI growth
team. But this isn't just a pipe dream.
We made $20 million for the for Europe's
largest travel company in just a few
weeks. We increased clickthrough rate
for India's largest company, a $400
billion enterprise also in a few weeks.
And how are we doing it? We're working
with the best and we have the best. Uh
we're the only marketing tech company
that's partnered directly with OpenAI to
date and they actually called our team
cracked which is cool. So if you're
interested in learning more about this,
reach out. Thank you.
Hi, I'm Eugene.
I'm sorry. My team is obsoleting all the
air models you see
today because you see uh my team built
Ki 72B the world's largest model without
the transformer attention with only
eight GPUs and this allow us to have a
th00andx lower inference on our on our
new architecture while performing the
same. Surprisingly the techniques that
we uh the technology that we built can
also be applied to existing transformer
models through speculative decoding.
Nothing, nothing too big, nothing too
important. But here's my hot take. Scale
is dead. And I'm not saying this just
from my own opinion. Like we are burning
billions into making AI models bigger.
But at the same time, the deep mind
founder and CEO is saying compound AI
agents errors will take more than 10
years to fix. Yan Lun is even saying
that we need a new AI architecture to
push the paradigm
forward.
And in production we see over 90% of AI
projects
fail. The reason behind this is not
something that scale can fix. The
problem is
reliability. The thing is like would you
order and use an app that only succeeds
45% of the time? Would you order Door
Dash that way? Of course not. You if
your order goes missing or or you you
end up having 100 pizza, you're going to
be stuck with customer support screaming
down there. It's a frustrating
experience, but that's what AI agents
do. When they work, they're awesome.
When they don't work, we are stuck
cleaning up the mess. And that's even
with frontier models. And here's the
thing, what companies want is not a
smarter model that can do PhD level
math. We the models are already smart
enough, but what we actually really want
is the models reliable enough to book
airline tickets, sort out our emails, or
file our taxes and invoices. That's what
we actually want. And that is what we
are building at Federus AI. We are a
research lab that is building
personalized AGI that's made reliable
for each one of you. And and most
recently uh we are in our research that
we actually shown that we built an
action R1 agent that beats clot for
Sonet and Gemini and OpenAI. This model
is not going to do PhD level math but
it's going to fill up the form with
absolute reliability uh uh better than
the frontier. And that's the thing like
are we going to burn billions more to
make a smarter model that is just a few
percentage point higher IQ or are we
going to make something that's 99.9%
reliable for the boring things in life
because this is where the money is for
all of you because think of it as AI
engineers reliability is revenue for
every use case you unlock and find
you're going to do a billion dollar app
in in in be in e-commerce or in in B2B
sales and that something that all of you
can build on, not rocket science. And
that's what we are building. And if
you're excited about it, feel free to
reach out to us. I'm
eugened.com. Test test. Hello
everyone. My name is Jonas. I'm an
engineer and I like working with data. I
love working with data. Actually, I love
it so much. I dropped out of high school
when I was 15. I got on a plane. I moved
across the country to California and I
joined a startup called Branch. You
might have heard of it. Anytime you were
clicking one of those links on your
phone for an app, that was probably us.
I also led a team there that built a
search engine that over a 100 million
people used every day. And then last
year, I left along with one of the
founders of Branch to tackle an even
bigger challenge.
This is probably what you think your
sales and marketing teams are doing with
their
budgets. And you wouldn't be entirely
wrong. So that's why I co-ounded Upside.
We do forensic revenue attribution and
intelligence. But what does that
actually mean? Well, how many of you
have an email from a salesperson like
this sitting in your inbox right now?
Uhhuh. And how many of you are actually
going to reply to it?
Yeah, I didn't think so. These teams are
shouting into the void hoping something
will work because they don't actually
know what works because their data is a
mess. I mean, don't get me wrong,
they're data hoarders. They store
everything. They stuff it in Salesforce.
They treat it, you know, it's basically
a SQL database in a trench coat, but
they're not data practitioners. They
don't know what to do with it once they
have it. But now we have things like
LLM. They can help with this. They can
take that poor, mishandled, abused email
record and they can pull the most
important details out of it into a
structured form. And so just as search
engines and web crawlers learn how to
make sense of the unstructured web,
upside is turning raw enterprise data
into a highly structured map of the
world and all the interactions that
people do in it. So there's hope. We can
untangle this mess and we can create a
data command center that these teams can
actually use to reach their customers
more effectively. We only just started
talking about this publicly a couple
weeks ago. Um, we've been quietly
building in the background for the last
year or so and my co-founder decided to
make a small post on her LinkedIn, you
know, just to update our network on what
we'd been off doing and the things that
we'd been building and it blew up. Like
there's so much pain people feel around
this and they're hungry for a solution.
We got a whole slew of demo requests
coming in from people that want access
to the product and now we have a bunch
of customers lining up that want to get
into our platform. It's a who's who of
companies you've heard of. Um and we
just have a lot of building to do now.
So if you're interested in working on
knowledge graphs, on data analytics
agents, on graph analytics and graph
learning models, come talk to me. We're
hiring.
Hello everyone. I'm Sua, the founder of
OpenAI. Uh, sorry, I mean Open
audio. Before that, I created something
you might be heard of, Fishial Audio.
We have grown from 400k to 5.5 million
annualized revenue in just four months
and we closed our se runs at 100 million
valuation. It all started with my
girlfriend. I had a girlfriend for six
years from the beginning of high school
to
college. I love her so much and it was
so good until one day I found out she
cheated.
I wasn't angry, just confused,
disappointed, and I asked myself, if
this can happen, how can we trust
relationships
again? I thought about it for days, all
day, and all night. And finally, I found
my answer.
AI. But nobody can really fall in love
with today's AI, right? It's flat. It's
emotionless. is a
robotic. So I set on a mission to build
an AI that I could really fall in love
with starting with her
voice. So we began with open source and
crush it. We built
soc and also fish
speech. Today actually not today it's
the day before yesterday. I'm excited to
introduce S1, the first ever
instructible voice model. It's the only
model where you can control not just
what to say, but how to say it. Here's a
demo.
You can pinpoint, focus, or draw it
closer, even yelling why you betray me
like this.
Yeah, you can control whatever you want.
And uh with open audio S1 we have the
most expressive voice model in the
world. And the most importantly she
will and most importantly she will never
leave.
So for some reason it missed two slides.
So we have blown 11 labs out of the
water based on the TTS arena ranking and
they are so hurry and they dropped their
latest model today but unfortunately
it's just a demo. So try them now. Fish
audio is instantly available at
fish.audio. Thank
[Applause]
you. Hello, I'm David Vorick and I'm
building Glow. Prior to Glow, I built
Sciacoin, a cryptocurrency that we took
from a $10,000 market cap to more than
$3 billion.
And we think Glow is going to be even
bigger. That's why Framework and USV
Union Square Ventures led a $30 million
round into our
company. Subsequently, we posted the
world record for onchain deepin revenue,
doing more than $10 million of revenue
in a single
day. What does Glow do? Glow builds
solar, not with shovels, but with
incentives. This is not a stock photo.
This is a photograph taken of taken by
our team in India of a solar farm that
was constructed for the purpose of
mining glow tokens. A lot of people
don't realize but in the developing
world rising temperatures and growing
populations have strained the grid. In a
lot of cases, people are unable to run
their air conditioners during the heat
of the day. This causes people to die of
heat stroke. Glow is an incentive
protocol that revolutionizes what
governments do and can take the same
subsidy and turn it into 10 times as
much solar. If you're interested in
working with us, we're currently
building incentive projects in India, in
Mexico, in Lebanon, and across the
entire
world. Bitcoin incentivized the
construction of tens of millions of
mining machines. Glow asked, "Why not
tens of millions of solar panels?" Thank
you.
and my email
davidglowabs.org. I'd love to be in
touch.
Hi, I'm David.
I'm an
engineer. Uh, I made a social app with
250 million
users making $20 million a year.
Everyone thought it was luck. So, I did
it again. I'm building
favored. We built the world's most
engaging live app and we scaled it from
one to$und00 million annualized in six
months.
If you're a cracked engineer that wants
to join the fastest growing company of
all time, uh, talk to me.
[Applause]
Hello, I'm Alex Atala building open
router the first and largest LLM
marketplace. Thank
you. So open I want to tell a little bit
about how it started. Um, I co-founded
OpenC in 2017 and in at the end of 2022,
I really wanted to know if inference was
going to be a winner take all market
because the way it looked this could be
the largest market in software that has
ever happened
before. And uh, the first experiment
that we tried was building a Chrome
extension to help you bring your own
language model to any website that
supported uh, the protocol. And that
eventually evolved into open router, a
single place and a single API to get all
language models uh with the best prices,
best performance and highest
uptime. And the way it works is you just
have a single API. You pay once and
there's near zero switching costs to
move from one model to another. We do
all the heavy work to implement tool
calling, edge cases, caching, and give
you the best prices and performance
possible for your region or wherever
your servers are
deployed. And because inference is so
important, remember this might be the
most important software market ever, it
deserves its own marketplace just for
language models optimized for them,
including filtering for context, for
features, for tool calling, for
structured output, and much more. And so
we built that. Then we built a chat room
for you to obviously compare models
head-to-head as simply as you do when
chatting with people in
iMessage. We built fine grain privacy
settings including API level controls.
We built a a lot of observability so you
could see which models you're using and
why. And we built public data in our
rankings page which has become the go-to
place for comparing models on their real
world usage and on different categories
for their prompts as well. This has
grown for the last two months 10 to 100%
every single month or for the last two
years 10 to 100% every single month. Uh
and scaling it has been a lot of the
work that we've done so far. The the
fundamental goal here is to make a
heterogeneous ecosystem homogeneous
because we believe inference is a
commodity. Claude from bedrock is the
should be the same as claude from vertex
is clawed from enthropic and we do all
the abstraction and heavy work to make
it uh that way for you. I want to talk a
little bit about some of our technical
challenges. Um, we built our own system,
our own middleware for doing inference
called plugins, which are kind of like
MCPs except a little bit more powerful
because you can call MCPS from inside of
them and you can transform the outputs
from language
models. Bunch of other tricky problems
that we've done to make the fastest
routing in the market. Um, and we're
bringing a lot more features in the
coming months, including images,
enterprise features, prompt
observability, and
more. So, if you're interested, come
find me after or check out our careers
page. Thank you.
Check check. Give him a hand one more
time, y'all. One more time.
So, every founder that you just saw on
the stage is right outside that door. If
you want to meet any of those founders,
they'll be outside for just about 10
minutes. Um, feel free to meet them
right out there. All 10 teams will be
right there. What an incredible group.
Thank you all so much for coming.
Boom. I got it. Boom. You're good. Don't
miss folks.
outside.
All right, welcome back folks uh to the
search and retrieval track. Um I know we
just had demos in here, but hope you
guys had a great lunch. Uh I'm your host
today. Uh my name is Dat. Uh I'm with
Arise AI. If you don't know what Arise
is, we are in the largest player in
observability and evals. but uh here to
present uh our our tracks today. So,
first up we got uh 11X. So, uh how many
people here have a sales team at their
company? All right, awesome. Well, today
we get to go through Alice um with
Sherwood and Safwick, which is awesome.
And then next we'll be uh Exa. So, how
many people have you ever heard heard of
Exa? Woo. That's right. Uh so, we'll be
building smarter AI agents with neural
rag. And lastly, we have Google as well.
So layering every uh technique in rag
one query at a time but really excited.
Please give a warm welcome to
11x. Thank
[Music]
you. Just getting our slide sorted
here. This is a photo of a forest.
This looks good. Let's bring it back to
the start. I'll try this. Okay. Thanks
everyone for coming today. Uh so today's
talk is called building Alice's brain.
How we built an AI sales rep that learns
like a
human. Uh my name is Sherwood. I am one
of the tech leads here at 11x. I lead
engineering for our Alice product and
I'm joined by my colleague
Sawe. So 11X for those of you who are
unfamiliar is a company that's building
digital workers for the go to market
organization. We have two digital
workers today. We have Alice who is our
AI SDR and then we also have Julian who
is our voice agent and we have more
workers on the
way. Today we're going to be talking
about Alice specifically and actually uh
Alice's brain or the knowledge base
which is effectively her
brain. So let's start from the basics.
Uh what what is an SDR? Well, an SDR is
is a sales development representative.
If you're not familiar, I know that's a
room full of engineers, so I thought I
would start with the basics. And this is
essentially an entry-level sales role.
This is the kind of job that you might
get uh right out of school. And your
responsibilities basically boil down to
three things. First, you're sourcing
leads. These are people that you'd like
to sell to. Then you're contacting them
or engaging them across channels. And
finally, you're booking meetings with
those people. So your goal here is to
generate positive replies and meetings
booked. These are the two uh key metrics
for an
SDR. And a lot of an SDR's job boils
down to writing emails like the one that
you see in front of you right now. This
is actually an email that Alice has
written and uh it's an example of the
type of uh type of work output that
Alice has. Uh Alice sends about 50,000
of these emails to uh in a given day and
that's in comparison to a human SDR who
would send 20 to 50. Uh and Alice is now
running campaigns for about 300
different uh business
organizations. So before we go any
further, I want to define some terms
because since we work at 11X, we have
our customers but then our customers
also have their customers. So things get
a little confusing. Uh today we'll be
using the term seller to refer to the
company that is selling something
through Alice. That is our customer. And
then we'll be using the term lead to
refer to the person who's being sold
to. And here's what that looks like as a
diagram. You can see the seller is
pushing context about their business.
These are the the products that they
sell or the case studies that they have
that they can reference in emails. She
they're pushing that to Alice and then
Alice is then using that to personalize
emails for each of the leads that she
contacts.
So there are two requirements that Alice
needs to uh in order to succeed in her
role. The first is that she needs to
know the seller, the products, the the
services, the case studies, the pain
points, the value props, the ICP. And
the second is that she needs to know the
lead, uh their role, their
responsibilities, what they care about,
what other solutions they've tried, uh
pain points that they might have be
experiencing, the company they work for.
And today we're going to be really
focused on knowing the seller.
So in our in the old version of our
product, the seller would be responsible
for pushing context about her uh about
their business to Alice. And they did so
through a manual experience uh called
the library. And here you could see what
it looks like there where the library
shows uh all of the different products
and offers that are available for this
business that uh Alice can then
reference when she writes emails. The
user would have to enter details about
all every individual product and service
and all of the pain points and solutions
and value props associated with them in
our dashboard. And including these
detailed descriptions and those
descriptions would uh were were
important to get right because these
actually get included in the context for
the emails or for Alice when she writes
the
emails. Then later on during campaign
creation, this is what it looks like to
to create a campaign. And you can see we
have a lead in the top left and the user
is selecting the different offers that
they've defined from the library in the
top right that and these are the offers
that Alice has access to when she's
generating her
emails. We had a lot of problems with
this user experience. And the first one
was it was just extremely tedious. It
was a really bad and and and cumbersome
user experience. The user had to enter a
lot of information and that created this
onboarding friction where uh users
couldn't actually run campaigns until
they hadn't filled out their library.
And finally, the emails that we were
generating using this approach were just
sub-optimal. Users would have to either
choose between too few email or too few
offers, which meant that uh you'd have
irrelevant offers for a given lead, or
too many offers, which means that you
have all of the stuff in the context
window, and Alice just wasn't as smart
when she write writes those
emails. So, how can we address this?
Well, we had an idea which is that
instead of the seller being responsible
for pushing context about the business
to Alice, we could flip things around so
that Alice can proactively uh pull all
of the context about the seller into her
system and then use what'sever whatever
is most relevant when writing those
emails. And that's effectively what we
accomplished with the knowledge base
which we'll tell you more about in just
a
moment. So for the rest of the talk,
we're going to first do a highle
overview of the knowledge base and how
it works. Then we will do a deep dive on
the pipeline, the different steps in our
rag system
pipeline. Then after that we will talk
through the user experience of the
knowledge base and we will wrap up with
some lessons from this project and uh
future
plans. So let's start out with an
overview. All right. So overview, what
is knowledge base, right? It's basically
a way for us to kind of get closer to a
human experience. Like if a hum if
you're training a human SDR, you would
kind of get them in and then you will
basically dump a bunch of documents on
them and then they ramp up throughout a
period of like weeks or months. Um, and
you can basically check in on their prog
progress. Um, and similar to that,
knowledge base is basically a
centralized repository on our platform
for the seller info and then users can
kind of come in, dump all their source
material and then we are able to
reference that information at the time
of message generation. Um, now what
resources do SDRs care about? Here's a
little glimpse into that. Marketing
materials, case studies, uh, sales
calls, press releases, you know, and a
bunch of other stuff. Um, now, how do we
bucket these into categories that we're
actually going to parse? Uh, well, we
created documents and images, websites,
and then media, audio, video, and you're
going to see why that's
important. So, here's an overview of
what the architecture looks like. It
starts off with the user uploading
something any document or resource in
the client and then we save it to our S3
bucket and then send it to the back end
um which then you know creates a bunch
of resources in our DB and then kicks
off a bunch of jobs depending on the
resource type and the vendor selected.
Now the vendors are asynchronously doing
the parsing. Once they're done they send
a web hook to us which we consume via
ingest. And then once we've consumed
that web hook, we take that parsed
uh artifact that we get back from the
vendors and then we store it in our DB
and then at the same time upsert it to
pine cone and embed it. Um and then
eventually once we store it in local DB
we have like a UI update and then
eventually our agent can query pine cone
our vector DB for that stored
information that we just put in. So now
that we have a high level of
understanding of how the knowledge base
works, let's dig into each individual
step in the pipeline. There are five
different steps in the pipeline. The
first is parsing, then there's chunking,
then there's storage, then there's
retrieval, and finally we have
visualization, which will uh sounds a
little untraditional, but we'll cover it
in in a moment. So let's start with
parsing. Uh what is parsing? I think
that we probably all take this for
granted, but it's worth defining.
Parsing is the process of converting a
non-ext resource into text. And the
reason that this is necessary is because
as we all know, language models, they
speak text. So in order to make
information that is represented in a
different form, like a PDF or an MP4
file or a or an image legible or useful
to the LLM, we need to first convert it
to text. And so one way of thinking
about parsing is it's the process of
making non-ext information legible to a
large language model. Um, and we do have
multimodal models that are one solution
to this, but there are lots of
restrictions on multimodal models that
make it uh that make parsing still
relevant. So to illustrate that, we have
the five different document types or
resource types that we mentioned
momentarily ago. Uh, going through our
parsing process and coming out is
actually markdown, which is a type of
text that as we all know contains some
structural information and formatting
which is actually semantically
semantically meaningful and useful.
Let's talk about the process of how we
implemented parsit. And the the short
answer is that we did not. We didn't
want to build this from scratch. And we
had a few different reasons for doing
this. The first is that you just saw
that we had five different resource
types and a lot of different file types
within each of them. We thought it was
going to be too many and we thought it
was going to be too much work. We wanted
to get to market quickly. Um the last
reason was that we just weren't that
confident in the outcome. There are
vendors who dedicate their entire
company to building an effective parsing
system for a specific resource type. We
didn't want our team to to have to
become specialists in in parsing for
each one of these resource types and to
build a parsing system for that. We
thought that maybe if we tried to do
this, the outcome actually just wouldn't
be that that successful. So, we chose to
work with a vendor and here are a bunch
of the vendors that we we came across.
You can find 10 or 20 or 50 with just a
quick Google search. But these are some
of the leaders that we
evaluated. And in order to make a
decision, we came up with some
requirements and three specific
requirements. The first was that we
needed support for our necessary
resource types. That goes without
saying. We also wanted markdown output.
And then finally, we wanted this vendor
to support web hooks. We wanted to be
able to receive that output in a
convenient manner.
a few things that we didn't consider to
start out with.
Accuracy. Crazy. We didn't consider
accuracy. We didn't consider either
accuracy or comprehensiveness. Our
assumption here was that most of the
vendors that are leaders in the market
are going to be within a reasonable band
of accuracy and comprehensiveness. And
accuracy would refer to whether or not
the extracted output is actually matches
the the original resource.
Comprehensiveness on the other hand is
the amount of extracted information that
is uh available um in the in the final
output. The last thing that we didn't
really consider was cost to be honest
and this was because we were this system
was pre-production. We didn't have real
production data yet and we didn't know
what our usage would be and so we we
figured what we would do is come back
and optimize cost once we had real usage
data. So on to our final selections for
documents and images. we chose to work
with llama parse which is a llama index
product. Uh I think Jerry was up here
earlier today. Uh and the reasons that
we chose to work with llama parse was
first it supported the most number of
file types out of any document parsing
solution we could find. And second their
support was really great. Jerry and his
team were were were quick to get in a
slack channel with us. I think within
just a couple of hours of us doing an
initial evaluation.
And with Llama Parse, we're able to turn
documents like this PDF of a 11X sales
deck into a markdown file like the one
you see on the
right. For websites, we chose to work
with Firecrawl. The other main vendor
that we were considering was Tavi. And
this is actually not really a a major
knock on Tavi. For Firecrawl, we chose
to work with them because first, we were
familiar. We had already worked with
them on a previous project. And
secondly, Tavi's crawl endpoint, which
is the endpoint that we would have
needed for this project, was still in
development at the time. So it wasn't
something we could actually
use and similar to uh llama parse with t
with fire call we are able to take a
website like this bre homepage that you
see here and turn it into another
markdown
document then we have audio and video
and for audio and video we chose to work
with a newer uh upstart vendor called
cloud glue and the reasons that we chose
to work with cloud glue were first they
supported both audio and video not just
audio and second they were actually
capable of extracting information from
the video itself itself as opposed to
just transcribing the video and giving
us back a markdown file that contains
the transcript of the
audio. And so with Cloud Glue, we're
able to turn uh YouTube videos and MP4
files and other video formats into
markdown like you see on the right. So
now that everything is marked down, we
move on to the next step which is
chunking. All right, markdown. Let's go.
Now basically we have a blob of
markdown, right? And we want to kind of
break it down into like semantic
entities that we can embed and put it in
our vector DB. At the same time, we want
to uh protect the structure of the
markdown because it contains some
meaning inherently like something's a
title versus something's a paragraph.
There is inherent meaning behind that.
Um so we're splitting these long blobs
of text like 10page documents into
chunks that we can eventually retrieve
uh after we've embedded and stored them
in a vector DB, right? And now basically
we can like take all of this and like
we're thinking about how we can you know
split a long document into chunks right
so chunking strategies um you have
various things that you can do you can
split on tokens you can split on
sentences you can also split on markdown
headers right and then you can do like
LLM calls and have an LLM split your
document into chunks you know or any
combination of the above um now what you
you want to ask yourself when you're
deciding on a chunking strategy is like
um what kind of logical units am I
trying to preserve in my data right what
do I eventually want to extract during
my retrieval right what strategy will
keep them intact and at the same time
you're able to successfully embed them
and store them in whatever DB you want
um so and then should I try a different
strategy for different resource types we
have like we have deal with PDFs
powerpoints videos right um and then
eventually what kinds of queries or
retrieval strategies am I expecting? Um
and then we ended up with like a
combination of all the three like all
the things that we mentioned. So we
split on markdown headers and then we
kind of a waterfall. So because we want
our like records in our vector DB to be
a certain token count. So we split our
markdown headers and then we split on
sentences and then eventually we split
on tokens and then yeah it's like worked
well for us for all types of documents.
Um, and it has successfully preserved
our markdown chunks that we can kind of
cleanly show in the UI. Um, and it also
prevents super long chunks which are,
you know, diluting the meaning behind
your document if you end up with that.
Okay, so we have split all of our
markdown into individual chunks. It's
now time to put those chunks somewhere.
We're going to store them. Let's talk
about storage technologies. So for
storage technologies, I'm sure everyone
is like here for the rag section. So
they think that we're using a vector
database. We actually are using a vector
database. But to be pedantic, rag is
retrieval augmented generation. So we
all know that uh anytime you're
retrieving context from an external
source, whether it's a graph database or
elastic search or uh a file in the file
system, that also qualifies as rag. Um
some of the other options you can use
for rag uh I just mentioned a graph
database, document databases, uh
relational databases, key value stores.
You could even use object storage like
S3. In our case, we did use a vector
database and that's because we wanted to
do sim similarity search, which is what
vector databases are are built for and
optimized
for. Once again, we had a lot of options
to choose from. This is not a complete
or an exhaustive
list. In the end, we chose to work with
a company called Pine Cone. And the
reason that we chose to work with Pine
Cone was first, it was a well-known
solution. We were kind of new to the
space and we thought probably can't go
wrong going with the market leader. It
was cloud hosted so our team wouldn't
have to spin up any additional
infrastructure. It was really easy to
get started. They had great getting
started guides and
SDKs. Uh they had embedding models
bundled with the solution. So for a
vector database typically you have to
embed the information before it goes
into the database. Uh that would require
the use of a third party or an external
vector vector excuse me embedding model.
And uh with Pine Cone, we didn't
actually have to go find another
embedding model provider or host our own
embedding model. We just used the one
that they provide. And last but not
least, their customer support was
awesome. They got on a lot of calls with
us, helped us analyze different vector
data database options and think through
a graph databases and graph rag whether
that made sense for
us. So retrieval, the rag part of the
rag workflow that we just built, right?
Um you'll see that there's actually an
evolution of different rag techniques
over the last year. We started off with
just traditional rag which is kind of a
play on you're pulling information and
then enriching your system prompt for an
LLM API call right and then eventually
that turned into an agentic rag form
where now you have all these tools for
getting information retrieval and then
you attach those tools to whatever
agentic flow that you have and then it
calls the tool as a part of its larger
flow. Right now something we we're
seeing emerge in the last couple of
months is deep research rack where now
you have these deep research agents
which are coming up with a plan and then
they execute them and the plan may
contain one or many steps of retrieval.
Right? These deep research agents can go
broad or deep depending on the context
needs and they can evaluate whether or
not they want to do more retrieval. Um
we ended up building a deep research
agent. Um we actually used a company
called Leta. Letta is a cloud agent
provider and they're really easy to
build with. Um how it works basically we
pass in the lead information to our
agent and then it basically comes up
with a plan. Plan contains one or many
context retrieval steps and then
eventually you know does the tool call
summarizes the results and then
generates an answer for us in a nice
clean Q&A manner. Right? And then this
is kind of how it looks like for a
system with two questions that we ask.
Um, now on to visualization, the most uh
mysterious part of the pipeline. So what
does visualization have to do with a a
rag or ETL pipeline? Um, for more
context, our customers are trusting
Alice to represent their business. They
really want to know that Alice knows her
stuff, that she actually knows the
products that they sell, and she's not
going to lie about case studies or
testimonials or make things up about the
pain points that they address. So, how
can we reassure them? In our case, we
came up with a solution, which is to let
al let users peek into Alice's brain.
Get
ready. This is what that looks like. We
have a interactive 3D visualization of
the knowledge base available in the
product. What we've done here is taken
all of the vectors vectors from our uh
pine cone vector database and uh
collapsed or actually excuse me I think
the correct term is projected them down
to just three dimensions. So we're going
to render them as nodes in
threedimensional space um um with using
um uh and once the nodes are visible in
this space you can click on any given
node to view the associated chunk. This
is one of the ways that uh for example
our sales team or support team will
demonstrate Alice's
knowledge. Now how does it look like in
the actual UI? Right? Basically you
start off with this nice little modal
you know you drop in your URLs, your web
pages, your documents, your videos and
then you click learn and then it kind of
shows up nicely in the UI. Um you have
all the resources there and then you
have the ability to interrogate Alice
about what she knows of your knowledge
base, right? It's a really nice agent
that we built again using Leta. And
here's how it looks like in the campaign
creation flow. You see that on the left
hand side we have the knowledgebased
content showing up as a nice Q&A where
you can click on the questions and it
shows you a drop down of the chunks that
we retrieved and these were used as a
part of the messaging
flow. So now with that we have achieved
our goal. Our agent is closer to a human
than being an email. Right? We are now
we are now basically uh emulating how
you onboard a human SDR. You dump in a
bunch of context and they just know it.
So in conclusion, the knowledge base was
a pretty revolutionary project for our
product and really changed the user
experience and also leveled up our team
a lot. Uh we learned a lot of lessons.
It was hard to create this slide, but
there are just three that I want to
highlight for you today. The first was
that rag is complex. It was a lot harder
than we thought it was going to be.
There were a lot of micro decisions made
along the way, a lot of different
technologies we had to evaluate.
Supporting different research types was
hard. Hopefully, you all have a better
appreciation of how complicated RA can
be. The second lesson was that you
should first get to production before
benchmarking and then you can improve.
And the idea here is that with all of
those decisions and vendors to evaluate,
uh, it can be hard to get started. So we
recommend just getting something in
production that satisfies the product
requirements and then establishing some
real benchmarks which you can use to
iterate and
improve. And the last learning here was
that you should lean on vendors. You
guys are all going to be buying
solutions and they're going to be
fighting for your business. Make them
work for it. Make them teach you about
the different uh the different offerings
and why their solution is better.
And so our future plans are to first
track and address hallucinations in our
emails. Evaluate parsing vendors on
accuracy and completeness, those metrics
that we uh identified earlier,
experiment with hybrid rag, the
introduction of a graph database
alongside our vector database. And
finally to just focus on reducing cost
across our entire
pipeline. And if any of this sounds
interesting to you, we are hiring. So
please reach out to either Sautwick or
myself. Join us. And uh thank you all
for coming today.
[Applause]
Awesome. Thank you so much. That was a
great talk. Um yeah, send those over.
Big shout out to 11X. That was really
awesome to see your journey through. So
next up we have Exa uh for live coding
demo, too. So this will be great. Uh do
we have Will? All right, come on up,
Will.
Not quite live.
Okay. Hello
everybody. All right. So I was gonna
give uh live demo
coding but well I will but I know you
all are actually here to hear a cool
story. So I'll tell you a story about
web search built for AI and then we do
some coding at the end.
This story will end with this slide uh
one API to get any information from the
web. And you'll know what this means by
the end. But the story starts in
1998. And what you're looking at is the
the state-of-the-art in information
retrieval in 1998. You type in a word
Australia to this new search engine
called Google and it magically finds you
all the documents that contain the word
Australia from the web. It's crazy. Um
and the the big insight of Google was
they had this page rank algorithm. So uh
the results are ranked by authority
based on the graph structure of the web.
And this was a clever algorithm and it
was really cool. I was two years old at
the time. So if I was conscious I would
have thought this was cool. Um okay and
now our story our now our story uh skips
23 years to 2021. Um, by this point I
was conscious barely and uh uh and I I
noticed that you know GB3 had recently
come out and it was this magical thing
that you could input a whole paragraph
explaining exactly what you want. Uh and
it would really understand the
subtleties of your language and give you
an output that exactly matched. Um and
it's hard to remember how magical this
was but it was really magical in 2021.
And at the same time, I noticed there
was Google, which you know, you type in
a simple query like shirts without
stripes and it would give you shirts
with stripes, which is crazy. Uh it like
doesn't understand the word without u
because it's doing a keyword comparison
algorithm. And so I decided that for the
next at least 10 years, I'm going to
devote myself to building a search
engine that combines the technology of
GB3 uh to with a search engine to make a
search engine that actually understands
what you're saying uh at a deep level
and understands all the documents on the
web at a deep level and gives you
exactly what you asked for. This is a
very big idea and we're working we've
worked on it for four years and uh a lot
of
progress but it would change the world
if you actually solve this problem. And
so in 2021, uh, we we we joined YC,
summer 2021. Uh, we raised a couple
million dollars and we did what every YC
startup should do. We spent half of it
on a GPU
cluster. I'm joking. You shouldn't do
that. Um, and and then we also followed
YC's advice uh where we didn't talk to
any users or or customers for a year and
a half and we just did research. Um,
again, you shouldn't do that. You should
do, but in our case, it made sense
because we were trying to solve a really
hard problem which is like redesign
search from scratch. um using the same
technology as DB3, this like next token
prediction idea with transformers. What
if you could apply the same thing uh to
search? And this is actually one of our
uh WDB training runs. Um the purple one
I believe is was a breakthrough where it
like really it really like learned there
was like a few breakthroughs along the
way uh involving like random data sets
and different uh transform architectures
that we were trying. And this purple one
like really started to like work
well. Um and the general idea we had was
like okay so what is what is a search
engine? have like a trillion documents
on the web. Um, and traditional search
engines uh on a very high level will
create like a keyword index of those
documents. So for each document you you
say you ask what are the words in those
document and you create this big
inverted index where you map from like
words like brown to all the documents
that contain that word. Um and then at
search time you know search without
stripes comes in you do some crazy
keyword uh comparison algorithm and get
the top results. That's obviously a
simplification of what Google does. But
at a fundamental level, it's doing this
a keyword
comparison. But the idea was like what
if you could actually so with
transformers like the big thing is like
what if you could turn each document not
into a set of keywords but into
embeddings. Uh and these embeddings can
be arbitrarily powerful, right? Like
it's a list of an embedding is just a
list of of of numbers and uh it could
represent lots of information. So and
embedding it doesn't just capture the
words in the document but also the
meaning the ideas in the document and
the way people refer to that document on
the web and you know embedding can be
arbitrarily big and so it like of course
in the limit it would just destroy
keywords and so you have this like
arbitrarily powerful representation um
and that was the fundamental idea was
just like the bitter lesson. What if we
could like you know train transformers
to output embeddings for documents and
if we keep getting more and more data
and that's high quality we could uh make
a search engine that actually
understands you and um the way it would
work at inference at search time is like
a search comes in a query comes in like
shirts without stripes traditional
search engines would use the above thing
where they would do a very fancy keyword
comparison and a bunch of other things
and then instead we would just embed the
shirts without stripes and compare it to
the embedings of all the trillion
documents and you know after a year and
a half we actually had a new search
engine that worked in a very different
way. Uh, and you search shirt search
without stripes on Google, sorry, on Exa
and you um you get a list of results
that actually are not do not have
stripes. It's a simple uh example, but
like you could uh it can handle like
more way more complex queries like
paragraph long
queries. And when we launched this in
November 2022, we got a lot of
excitement on Twitter. This is a very
new paradigm for search. You could do
all sorts of interesting queries that
you couldn't do before. And then two
weeks later, this happened. It was a
small tweet. Um, and uh, this is a
visual depiction of San Francisco at the
time. U, you guys probably all remember
this. And then this is a visual
depiction of the exit team at the time
because chatbt completely changed the
way we interact with the world's
information. You know, like everyone can
now use an LLM to just like talk talk to
their computer and and get information.
And we were thinking, wait, is there
even a role for search in this world?
like these LM are so powerful and then
very quickly we realized yes there is a
role because LLM don't know everything
on the web so for example if you ask an
LLM like GB4 find me cool personal sites
of engineers in San Francisco um it'll
it it can't like it just doesn't have
that in the weights they'll apologize
whatever um and you know there's a very
simple information theory argument here
where it's like there literally isn't
enough information in the weights of GB4
to store the whole web GB4 call we don't
know exactly how many uh parameters I
think someone leaked it on YouTube once
but it's like you a couple trillion
parameters. You could call it like less
than 10 terabytes uh in the weights of
GB4. And then the internet is like over
a million terabytes. And that's just the
documents on the web. Uh there's also
images and video and that's way more. Um
actually the the if you look I I did a
tweet recently about the the size of the
web and it's it's in the exabyte range.
Um and our name is Exa. It's not a
coincidence. Um anyway, so like LLM uh
need to search the web just from this
simple argument and they're going to
need to do that for a long time which
Um, if you talk to ML researchers,
they'll say the same thing. It's just
like it's too hard. Also, the web is
constantly updating. That's another
problem. It's not just the size of the
web, it's the constant updatingness of
the web that makes it very tricky. So,
LM's always will need search. That's
great. Um, and so when you combine an
LLM with a search engine like Exa, you
can handle these uh queries. So, like
find me cool personal sites, engineers,
and SF, uh, the LLM will search Exa, get
a list of personal sites, uh, and then
like use that information to output the
perfect thing for the user. You're all
very familiar with this like LLM plus
search. It's obvious now, right? Like
everyone knows about it. But now, let me
tell you a secret about search that most
people don't know.
Um, and the secret is that traditional
search engines were not built for this
world of AI. Traditional search engines
were built for humans. Uh, and humans
are not are very different from AI. Uh,
so every search engine like Google,
Bing, you name it, uh, was built in a
different era for this kind of creature.
uh this this slow flesh human that's
typing keywords and wants to read a few
links and really cares about UI of the
page and all these things like it's a
lazy human. They type simple keywords.
Google is great for this creature. Um
Google was optimized for this creature.
It gives you exactly the kinds of things
you would click
on. But AIS are very different. Um this
like an AI can gobble up information
like crazy. This is a much slowed down
version of what our AIs probably feel
like inside. Uh and so AI are very
different. They want to use complex
queries, not simple ones, to find not a
couple links, but just tons of
knowledge, as much knowledge as they
could get because they actually have the
patience to just analyze it all
extremely fast. And so the the search
algorithm that's optimal for this type
of creature is not the same algorithm
that's optimal for the human. Like that
would be crazy if the same algorithm was
optimal for humans was optimal for AIs.
And so like all the a lot of the tools
the search tools that we're talking
about these days on Twitter and stuff
like that, they're still using like the
old traditional search combined with AI.
It's just not the right puzzle fit. So
Exo, we're really trying to think of
like what is the right search engine for
this AI
world. And so just a few examples uh we
could dive deep into um to how AI are
different. Well, AIS want precise
controllable information. So by the way,
when I say AI, I'm usually I'm talking
about like an AI product. So imagine
like in this case like a VC that's using
an AI system to find a list of companies
uh because they want to invest. So you
know they're looking for something
what's the next big thing? What's the
next big thing that feels like Bell
Labs? Well, when they tell their AI what
they want, the AI will then go search a
search engine, right? And if it searches
a search engine like Google, they'll get
a list of results that humans like to
click on, but it's not very information
dense and it doesn't even match what the
person asks for what the AI asked for.
The AI asks for startups working at
something huge that feels like Bell
Labs. It should get a list of startups.
It's kind of a crazy idea, but what if
search engines actually returned exactly
what you asked of them and not what you
want to what Google knows you will click
on. And so with AI especially, they just
want a search engine that returns
exactly what they ask for. Because
what's what really the world's going to
look like is you're going to interact
with your AI agent and you're going to
ask for something and then it's going to
make tons of searches like, okay, maybe
they want startups working on something
like similar to Bill Bell Labs. Maybe
they want startups working only in New
York City that have this quality and
that quality and and and they'll do all
sorts of searches and it just wants a
search API that just does what it asks.
And so you need a search engine like
that. X is like that. Um, another
difference between AIS and humans is AI
want to search with lots of context.
Again, if you're if you have an AI
assistant and you talk to it all day and
then you ask for restaurants or
apartments or or what have you, uh, the
AI has lots of context on you. So it
should be able to search with this uh
large multi paragraph thing saying like
you know my human is a software engineer
and it likes these types of things and I
like these types of things and like can
you give me uh you know restaurants that
match those preferences. Uh and so you
need a search engine that could
literally handle multiple paragraphs of
text. But traditional search search
engines like Google were not meant to do
that because humans would never type in
multiple paragraphs because they're too
lazy. So Google was optimized for like
simple keyword queries. So Google I
think has like a a few dozen keyword
limit. uh whereas uh Exa can handle like
multiple paragraphs of
text. Another big one where AI are
different than humans is AI want
comprehensive knowledge. Uh like if you
give a human 10,000 links or 10,000
pages, it doesn't know what to do with
that. Like it would take 10 days of
extreme patience to process all that.
But AI can do it in three seconds if
it's paralyzed, right? So if I'm an a VC
and I want to report on like all the
companies in a space, I want literally
all the companies. And there's a huge
amount of value to getting truly all of
them and not just like the 10 or 20 that
Google is able to find. And so you need
a search engine that exposes the ability
to return a thousand 10,000 whatever it
is. And also has this semantic ability
to like you know when you say like every
startup funded by YC working on AI you
actually can get all of them. So like
Google literally just can't do this at
all. Okay. I hope that through these
examples we see that the space of
possible queries is actually like way
larger than people realize. Uh and until
like 2022, we were kind of in this like
top left blue world. Uh so this circle
is like the space of possible queries
and the blues are like uh you know
specific subsets of that space. And so
like we were all in that top left corner
of blue for a long time where you could
you know we could uh search engines
could handle like uh like basic keyword
queries like stripe pricing or uh
someone's GitHub page or Taylor Swift
boyfriend or whatever it is. Uh after
2022, everyone started to want the top
right blue uh circle where it was like,
"Hey, actually, I want to make queries
like explain this concept to me like I'm
a 5-year-old or here's my code. Can you
like debug it?" This is a form of query
doesn't require search, but it's a it's
another type of query that was
introduced to the world in 2022. And
then like uh there's other types of
queries like these semantic queries like
people in San Francisco who know
assembly. uh as far as I'm aware, XA is
like I mean XA kind of like introduced
this kind of query and and uh and does
really really well on them on those
queries. And then there's these like
really complex queries like find me
every article that argues X and not Y
from an author like Z. And we're
starting to now have systems like X's
like websites product that could handle
these things. And I think this is
actually a huge space because this like
turns the web into like a database you
could filter however you want. And
that's really what AIs want. they want
this like full control database like
query system that they can just get
whatever they need for their user. And
then there are the queries that no one
has thought of yet. Um like every week
we get tons of queries and like oh wait
that's a really interesting type of
query that uh that no search engine
could do right now and and eventually
we'll try to you know handle all the the
queries that are possible. But there
there's so many new types of queries now
because we have these AI systems and the
stakes like the the expectations have
just gotten way
higher. Okay, so now you we end our
story. uh with the same slide a one API
to get any information from the web. So
again like EXO is trying to if you go
back like handle not just like the
keyword queries but also the semantic
queries and also the super complex
queries and eventually all queries. Um
we we want one API that could like give
these AI systems whatever knowledge they
want. You have the AI and you have Exa
providing uh the knowledge. Oh, I only
have four minutes.
Okay. Okay.
So
that's how do I go to a different
uh if I change to the code editor. How
do I do
that? Oh,
it's Oh, cool. Okay. Okay.
Um there we go. Okay. Cool. Well, first
of all, just just very quick exploration
of this is our our search dashboard, we
could try different queries. I would
just point out like in the search uh API
endpoint. Uh you know, we expose lots of
different toggles. So, first of all, you
just try out a query and get uh it shows
you the code and it gets you uh a list
of results. Uh and it exposes tons of
different types of filters that you
might want to do. For example, like
number of results, 10, 100, a thousand,
whatever it is. You could have like date
ranges or, you know, I only want to
search over these domains. And it's a
lot of toggles, but I think the point is
actually you want the toggles because
your AI is actually going to be calling
this. You want a search engine that
gives you full control. Um, and we have
like neural and keyword search. So you
could try different ones. Um, okay, let
me quickly jump the the code. Okay, so I
prepared this like code uh agent.py. So
we made this agent uh agent Mark and
Mark loves to make markdown out of
things. Anything you give it, it will
make markdown. Mark will make markdown.
Uh and so in this case uh we're going to
here well I guess in this case let's try
uh this query uh personal site of
engineer in San Francisco who likes
information retrieval.
Uh well this is this is the kind of a
query that neural would be a lot better
at okay
it okay so it's just it's making a query
to get like a list of personal sites of
engineers San Francisco who like
information retrieval and and mark the
agent is just making a markdown output
of that that's a very neural type query
you also might want to do uh a different
type of query which is like a more
keyword heavy one. Let's see
like my GitHub. So, okay. So, here I
would want to make a keyword query. So
you just change it to
keyword search. So it's going to get
information from my from my GitHub using
keyword search because this is a very
typical like Google like search that
would work well, right?
Go.
Okay, cool. That's information about
Wilbur's GitHub. Um and then okay, so
when you're actually building an agent,
you're going to be combining lots of
different types of searches. So neural
searches and keyword searches uh and all
sorts of other searches that X exposes.
So like the right agent in the future is
going to be this system that decides
what type of search it needs uh for uh
whatever whatever the user uh says like
it be like oh okay I'm going to make
like a neural search to get a list of
things and then for each one I'm going
to do a keyword search right you want to
give the uh agent like just full access
to the world's information in however
way it wants not just keyword search but
also all these other things um and so
here I oneshotted with 03 a GitHub agent
which combines these two queries So
first it'll because I want you know I
want to get the GitHub of every uh
engineer in San Francisco who likes
information retrieval. Uh so the agent
will make uh a neural search to get a
list of people extract the names and
then search those using a keyword search
to get their GitHubs. And then if you
run
that here it's just getting 10 results
but we could you know with EXO we could
do a 100 or a thousand if you're on an
enterprise plan.
So now it's getting all the GitHub
info. Cool. So that's just a example. Um
and yeah, I mean there are lots of other
things that you could do with Exa like
um we actually just today launched this
research endpoint um where it will
actually do like as much searches in the
N and LM calls in the background to get
you that perfect report or that perfect
structured output for the thing you
asked for. So it's kind of like a deep
research API and it's state-of-the-art
deep re deep research API. Um, cool.
That is the talk. Hope that was
interesting. Thank you.
Alrighty. Nice job, Will. Um, all right.
Last, but certainly not least, uh, we
have I think there's a typo up there. It
shouldn't be Pyabs, right? Yeah, Pyabs.
It says Google, but it should say Pyabs.
But David here is going to walk us
through everything from BM25 to the most
complex rag. So, super excited for this
one. Uh, big shout out to David. Uh,
please help me give him a warm welcome.
Please don't send me back to Google.
All right. Uh I'll I'll just give you
all a little bit of context. So, uh, my
co-founder and I and a lot of our team
were actually working on Google search
and then we left and like started Pyabs
and, uh, I I loved I love the exit talk
and like we're all nerds for information
and search and, uh, so this is going to
be a little bit of that. Uh, just going
to go through a whole bunch of ways you
can actually show up and improve your
rack systems. Uh, I think one thing that
I personally uh, sometimes struggle with
is there's a lot of talk about things
sometimes like too much in the reads
like oh specific techniques and you can
do RL this way and you can tune the
model this way. It's like doesn't help
me orient in the space like what are all
these things and how do I like hang on
them? Uh or you have the complete
opposite which is like a whole bunch of
buzzwords and hype and such and like rag
is dead. No rag is not dead is like
agents like wait what like uh so just
you know I think a lot of what I'll do
today is just uh what I call like plain
English uh just trying to like set up a
framework right like very centered
around like okay if you are trying to
show up the quality of your system how
do you do that and then where do all the
things you hear about like day in day
out like fit uh and then just how to
approach that and give a lot of examples
I think one thing that I always love and
we always did in Google we always do in
pyabs uh is just like look at things
look at cases look at queries see what's
working. See what's not working. That's
really the essence of like quality
engineering as we used to call it at
Google.
Um, are we
good? Let's
see. All right.
Perfect. And the
timer has not started. So, this is us.
Uh, all right. Perfect. Uh, if you do
want the slides, there's like 50 slides
and I set my a challenge for myself to
go through 50 slides in 19 minutes. Uh,
but you can catch the slides here if you
want. Uh I'll flash this towards the end
as well with
pi.aira-t talk. uh it should point to
the slides that we're going through and
as I mentioned plain English no hype no
buzz uh no debates no like all right so
how to think about techniques before we
go techniques and get into the weeds of
it like what why does this even matter
and the way we always think about it is
like always start with outcomes you're
always trying to solve some product
problem uh and generally the best way to
visualize something like this you have a
certain quality bar you want to reach
and there were a very interesting talk
this this week about like you know
benchmarks aren't really helpful and
absolutely eval are helpful you're
trying to launch a CRM agent and you
sort of have a launch bar like a place
where you feel comfortable that you can
actually put it out into the world. Uh
and techniques fit somewhere here. You
have that like kind of end metric and
you're trying to like come up with
different ways to shore up the quality
and those ways are like sort of the
techniques there and you know this is
sort of your own personal benchmark. You
start with some of the easy the easy the
easy bars you want to hit and then
there's like medium benchmarks and hard
benchmarks. These are query sets you're
setting up. uh and then you know
depending on what you want to reach and
in at what time frame uh then you end up
trying different things uh and this is
what we call like quality engineering
loop you sort of like baseline yourself
okay you want to achieve you know you
want CRM and this is the easy query set
and your quality is there uh just
through the simplest way you can try it
do a loss analysis okay what's broken
there were a lot of eval talks this week
and then what we call quality
engineering now the reason I I I I say
this is because like okay techniques fit
this in this last bucket and one of the
things that I think biggest problems is
like people sometimes start there and it
doesn't make any sense because you say
oh do I need BM25 or do I need like uh
vector vector retrieval it's like I
don't know what what are you trying to
do and what is your query says and where
are things failing because many times
you actually don't need these things and
you end up implementing them it doesn't
make a lot of sense anyway so usually
the thing I say is like what I call
complexity adjusted impact or you know
stay lazy uh in a sense of like always
look at what's broken and if it's not
broken don't fix it and if it is broken
do fix it uh and we'll go through a lot
of techniques uh today, but like this is
a good way to think about them. It's
just a cluster. It's a catalog of stuff.
The most important two columns are the
ones to the right, difficulty and
impact. And if it's easy, go ahead and
try it. And most times like BM25, BM25
is pretty easy. You should absolutely
try it and does like, you know, shore up
your quality quite a bit. Um, but you
know, should I build like custom
embeddings for retrieval? Like I don't
know. Let's take a look. This is
actually really really hard. Uh, Harvey
gave a talk. They built custom
embeddings, but you know, they have a
really hard problem space and just, you
know, relevance embeddings don't don't
do enough for them. uh and then they're
willing to put all that work and effort.
All right, queries examples. Lots of
stuff. Let's first technique in memory
retrieval. Uh easiest thing, bring all
your documents, shove them all to the
LLM. Uh this is the whole like is rag
dead, is rag not dead, context windows.
Well, context windows are pretty easy,
so you should definitely start there. Uh
one example, notebook LM. Uh very nice
product. You actually, you know, put in
five documents, just ask questions about
them. You don't need any rag. Just shove
the whole thing in. Now, questions might
get cut too long and this is where it
breaks, right? Maybe things don't fit in
memory. Uh or maybe you just pull the
context window too much. So this is
where you start to think things like oh
okay that's what's happening. I have too
many documents. Oh that's what's
happening. The documents are not
attended properly by the LLM. And here
are like the five things that are
breaking. Okay great let's move to the
next one. So now you try something very
simple which is can I retrieve just
based on terms. So BM25 what is BM25?
BM25 is kind of like four things. um
query terms, frequency of those query
terms, uh length of the document and
just how rare a certain term is. It's a
very nice thing. It actually works
pretty well and it's very easy to try
and um it has a problem that like when
things are not have that nature like the
ex as saying when they don't have that
nature of like a keyword based search
they didn't work and this is where you
bring in something like relevance
embeddings and relevance embeddings are
pretty interesting because now you're in
vector space and vector space can handle
way more nuance uh than like keyword
space uh but you know they also fail in
certain ways especially when you're
looking for keyword matching and it's
actually pretty easy to know when things
work and when they don't actually this
was queried like I went to Chadypt and I
asked like hey give me a bunch of
keywords ones that work for like
standard term matching and ones that
work for relevance embedding and you can
see like exactly what's going on here
right if your query stream looks like
iPhone battery life then you don't need
uh vector search but if they look
something like oh how long does an
iPhone like last before I need to charge
it again then you absolutely need like
things like vector search and this is
where you need to be like tuned to what
what every technique gives you before
you go in invest in it and when you do
your loss analysis and you see oh most
of my queries actually look like the
ones on the right hand side, then you
should absolutely start investing in
this area. All right, now you did BM25,
you did vector because your query sets
look exactly like that and now you have
conflicted candidate set and this is
where re-rankers help quite a bit. And
when people say rerankers, they're
usually referring to like cross encoders
and this is a specific architecture. If
you remember the architecture here for
relevant for the relevance embeddings
was you're getting a vector for the
query and you're getting a vector for
document and then you're just measuring
distance. Now cross encoders are more
sophisticated. They actually take both
the query and the document and they give
you a score while attending to both at
the same time. And that's why they were
much more powerful. Now they are more
powerful, but they're actually pretty
expensive. And now this is a failure
state as well. You can't do it on all
your documents. So now you have to have
like this fancy thing where you're
retrieving a lot of things and then
ranking a smaller set of things with a
technique like that. Uh but it is really
powerful and you should use it and it
fails in certain cases and now when you
hit those cases then you move to the
next thing. Now where does it fail? uh
it's still relevance and this is a big
problem with like you know standard
embeddings and standard rerankers they
only measure semantic similarity and
there's a thing like these are all proxy
metrics at the end like your application
is your application and your set of
information needs is your set of
information needs and you try to proxy
with relevance but relevance is not
ranking and this is something you know
we learned in Google search sort of uh
it's been like 15 20 years where you
know what brings the magic of Google
search well they look at a lot of other
things than just relevance uh and this
is you know this is kind of like
actually the talk from Harvey and Lance
DB was really really interesting and he
gave the example of this query right uh
it's a really interesting query like
it's it has so much semantics for the
legal uh domain that it's impossible to
catch these with just relevance um and
again what does a word like regime means
that's a very specific like legal term
material what does it mean it actually
very has a very specific meaning in the
legal term uh and then there's like
things that are very specific to domain
that need to be retrieved like laws and
regulations and such and this is where
you get to building things like custom
embeddings and they say you know what
just fetching on relevance is not enough
for me and now I need to go and like
model my own domain in its own vector
space and now I can actually fetch some
of these things again go back to chat
GPD like is this interesting should I
actually even do it so I asked it to
give me a list of things that would fail
in a standard relevant search in the
legal domain and you start to see like
oh all these things would the words like
moot don't mean the same thing words
like material don't mean the same thing
and when you have a vocabulary that is
so specific and just off you will not
get good results. Right? So now how do
you how do you match that? Like you need
to have again you need to have evals you
need to have quets you need to look at
things that are breaking and decide that
oh the things that are breaking have to
do with the vocabulary just being out of
distribution of a standard relevance
model and that's how you decide right so
don't like again don't think too much
about it like oh should I do it should I
not do it like what is your what are
your queries telling you what is your
data telling you and then go and try to
do it or not do it there's also an
example from shopping um so embeddings
are very interesting because they help
you a lot with retrieval and recall uh
but you still good need good ranking,
right? So now if if if if you think
relevance doesn't work with retrieval,
it also probably doesn't work with
ranking. Uh this is an example I pulled
from perplexity. I was trying Yeah, I
was just trying to break it today. It
didn't take too much to break it. Uh I
asked like give me cheap gifts uh for a
gift for my son and I follow up with
this query like but I have a budget of
50 bucks or more because when I said
cheap, it started giving me like $10.
Well, you know, cheap for me is like
$50. Uh but I didn't know that. So it's
fine. I told it that. But when I said
$50 or more, it still gave me $15 and
$40, both of which are actually below uh
$50. Uh and this is kind of interesting,
right? Because what we call like in, you
know, in standard terms like for
information retrieval, this is a signal.
It's a price signal and it's not being
caught and it's not being translated
into the query and it's definitely not
being translated into the ranking. So
now you have to like think of okay, I
have ranking and I need the ranking to
see the semantics of my corpus and my
queries. And this is has a very specific
meaning like when you think of your
corporate inquiries again it's not just
relevance. Relevance helps you with
natural language, but things like prize
signals, things like merchant signals,
uh if you're doing like podcasts, how
many times has been listened to is a
very important signal. Has nothing to do
with relevance, right? And in in in many
many applications, you will see things
that are for example more popular tend
to rank more highly. Uh and there you
mentioned like uh the page rank
algorithm. Page rank is not about
relevance. It's about prominence. How
many things outside of my uh document
point to me? that has nothing to do with
relevance and everything to do with the
structure of the web corpus. So that's
the shape of the data. So this is a
signal about the shape of the data and
not a signal about like the relevance.
Um and you know best way to think about
it think of like you have horizontal
semantics and then you have vertical
semantics. And if you're in vertical
domain where the semantics are very
verticalized right let's say you're in
doing a CRM or you're doing emails uh
and it's a very complex bar you're
trying to hit uh that is way beyond just
natural language. understand that
relevance will be a very tiny tiny part
of the semantic universe and the harder
you try to go the more you're going to
hit this wall and the more you all right
this breaks again things keep breaking
I'm
sorry at sufficient complexity things
will keep breaking so now the thing that
breaks with even custom semantics is
user preference uh because even when you
get to all this okay you're saying I'm
doing relevance and I'm doing price
signals and merchant signals I'm doing
everything I now I know the shopping
domain now you don't know the shopping
domain because now users are using your
product they're clicking on stuff you
thought they're not going to click on
and they're clicking they're not
clicking on thoughts on things you
thought they were going to click on. Uh
and this is where you need to like bring
in the click signal thumbs up thumbs
down signal. Now um these things get
very complex. So we're not going to talk
about how to implement them. Uh just
because again in this case for example
you have to be a clickth through uh
signal prediction signal and then you
take that signal and then you combine it
with all your other signals. So now if
you look at your ranking function it's
doing okay I want it to be relevant. I
wanted to have this like semi-structured
price signal and like query
understanding related to that plus I
want to get the user preference and that
and then you take all these signals and
you add them and that becomes your
ranking score. So it becomes a very
balanced function and this is how you go
from like oh it's just relevance to oh
no it's not just relevance to oh no it's
not just relevance and and my semantics
and my user preferences all rolled up
into one. I'll mention two more things.
uh you calling the wrong queries. This
happening a lot because this go this
goes into more orchestration and you're
trying to do complex things uh
especially now when you have agents uh
and you're telling them to use a certain
tool. This is happening quite a bit
because there is an impedance mismatch
uh between what the search engine
expects right let's say you tune the
search engine and expects like keyword
queries or expects uh you know even like
more complex queries but you cannot
describe all of that to the LM and the
LM is reasoning about your application
and then making queries by itself and
this is a big problem so one thing that
we've seen many companies do we've done
this also at Google you actually take
more control of the actual orchestration
so you take the big query and you make n
smaller queries out of Uh I took the
screenshot from AI mode in Google and
it's it's very brief. You have to catch
it because after after the animation
goes away but you see it's actually it's
making X queries. It's making 15
queries. It's making 20 queries. Um so
what we call fan out take very complex
thing try to figure out what are all the
subqueries in it and then fan them out.
Now you might think hey why isn't the LM
doing it? The LM is kind of doing it but
the LM doesn't know about your tool. It
doesn't know enough about your search
engine. Uh I love MCP but I'm not a big
believer that you can actually teach the
LLM and like just through prompting what
to expect from the search on the other
back end. This is why people still like
oh is it agent autonomous? Do I need to
do workflows? This is very very
complicated. Uh and it will take a while
for this to be solved because again it's
unclear where the boundary is. Is it uh
is it the search engine should be able
to handle more complex things and then
the LLM will just throw anything its way
or is it the other way around? and the
LM has to have more information about
what the search engine can support so it
can tailor it and right now you need
control because the quality is still not
there. Uh so this looks like this. Um if
you have sort of like this assistant
input and you're turning it in these
narrow queries like for example was
David working on this has very very
specific semantics and it's more like oh
Jira is David Slack threads David. Uh
and it's very very hard to know without
knowing enough about your application
that these are the queries that matter
and not on the the ones on the left hand
side. And if you send the thing on the
left hand side to a search engine, it
will absolutely tip over unless it
understands your domain. And this is
where like you know you need to
calibrate the
boundary. Okay. So now you're asking all
the right queries. Are you asking them
to all the right backends? And this is
another place where it all fails. Um and
this is what we call like one technique
I call supplementary retrieval. This is
something you notice like clients do
quite a bit which is they don't call
search enough. Uh and sometimes people
try to overoptimize. When you're trying
to get Heidi call, you should always be
searching more. Like I always like just
search more like this is similar to what
we talked like about dynamic content
like the inmemory uh retrieval just like
just give more things. So it never fails
to give more things. I know in the in
the description we said like there was
this query fo which was really hard to
uh to do and then you think like oh
we're in Google search and it's very
simple middle eastern dish and it
stumped an organization of 6,000 people
like oh my god what's so hard about this
query? What's so hard about this query
is like it's it's it's an ambiguous
intent. uh so you need to reach to a lot
of backends to actually understand
enough about it right because you might
be asking about food at which point I
want to show you restaurants you might
be asking this for for pictures at which
point I want to show you images uh now
what Google ended up doing is that they
ask they you know create all the back
ends and then they put the whole thing
in and I think you know I would
recommend like this is a great technique
to just even increase the recall more
just call more things um and don't try
to be skimpy unless you're running
through like some real cost overload and
that's the last one you're running into
cost overload. GPUs are melting. I try
to generate an image, but then I
realized there actually a pretty good
image that is real. Somebody took a
server rack and threw it from the roof.
Um, this was like I didn't need to go to
chatp and generate this image. Uh,
apparently this was an advertisement
pretty expensive one. Um, all right. So,
this happens a lot like when you get to
a certain scale and you have all these
backends and you're making all these
queries and it's just getting very very
complex and this, you know, I mean
Google's there, perplexity is there. I
mean Sam Alton keeps keeps complaining
about GPUs melting. Um and this is the
part where like you need to start doing
distillation and distillation is a very
interesting thing because like to do
that you have to learn how to fine-tune
models and this this gets to be a little
bit complex. You sort of have to hold
the quality bar constant while you
decrease the size of the model. The
reason you can do that like is is is
kind of like in that in that graph like
hey hire me I know everything actually
I'm firing you. uh it's overqualified
like an LLM a very like large language
model is actually over mostly
overqualified for the task you want to
do uh because what you really want to do
is just one thing like perplexity
they're they're doing question answering
uh and they're pretty fast I mean when
you use perplexity insert context like
they're really really fast which is
amazing because they trained this one
model to do this one very specific thing
which is just be really really good at
question answering um and you know this
is very hard so I wouldn't do it unless
you know latency becomes a really
important thing for your users right
like, oh, the thing is taking 10
seconds. Users churn. If I can make it
in two seconds, users don't churn.
Actually, that's a really great place to
be because then you can use this
technique and like just bring everything
down. Um, all right. You've done
everything you can. Things are still
failing. This is uh everybody. Okay.
What do you do? Like we have a bunch of
engineers here. What do you do when when
everything fails? Um, yes, you blame the
product manager.
It's it's the last trick in the book.
Uh, when everything fails, uh, make sure
it's not your fault. But I I'll say
there's something really important here.
Quality engineering will never like
it'll never be 100%. Things will always
fail. These are stocastic systems. So
then you have to punt the problem. You
have to punt it upwards. So it's it's
kind of a joke, but it's not a joke.
Like the design of the product matters a
lot to how much how magical it can seem
because if you try to be more magical
than your product surface uh can can
absorb, you will you will run into into
a bunch of problems. Um this is I use a
very simple example. Uh probably a more
complex one would be uh sort of a human
in the loop for customer support where
you're like okay some cases the bot can
handle by its own but then you need to
like punt to a human. This is basically
UX design right like when when do you
trust the machine to do what the machine
needs to do and when does a human need
to be in the loop. This is a much
simpler example from like Google
shopping. Um there's some cases where
Google has a lot of great data. So what
we call like high understanding the
fidelity of the understanding is really
high and then it shows like what we call
a high promise UI. Like I'll show you
things you can click on them. There's
reviews, there's filters because I just
understand this really well. And there's
things Google does not understand at
all, mostly as web documents, bag of
words. And what's really interesting
about the UI is the eye changes. If you
understand more, you show a more kind of
like filterable high promise. If you
don't understand enough, you actually
degrade your experience, but you degrade
it to something that is still workable.
Like, I'll show you 10 things, you
choose. Oh no, I know exactly what you
want. I'll show you one thing. And this
is really, really important. and it has
to be like part of every and this is
sort of like always understanding like
there's only so much engineering you can
do until you have to like actually
change your product to accommodate this
sort of stoastic nature. So gracefully
degrade, gracefully upgrade depending on
like the the level of your
understanding. And again, I'll flash
these two slides at the end like always
remember what you're doing because you
can absolutely get into theoretical
debates. Again, context window versus
rag, uh this versus that, like is, you
know, agents versus I don't know like
just everything is empirical in this
domain when you're doing like this this
sort of thing. Oh, I have my my evals.
I'm trying to like step by step go up. I
have like a toolbox under my disposal.
Everything everything is empirical. So
again, baseline, analyze your losses,
and then look at your toolbox and see,
are there easy things here I can do? If
not, are there at least medium things I
could do? If not, you know, should I
hire more people and do like some
really, really hard things? Uh, but
always remember like the choice is on
you and you should be principal because
this can be an absolute waste of time uh
if you're doing it too far ahead of the
curve. All right, again, the slides are
here. I think Oh, I I I achieved it. 30
seconds left. Uh and if you want the
slides they're here again and uh reach
out to us. We're always happy to talk. I
think I was very happy with the exit
talk because it's always nice to find
like friends who are nerds in
information retrieval. Uh we are also
such. So reach out and happy to talk
about you know rag challenges and such
and some of the models we are building.
Um all right thank you so
much.
Awesome. Thank you so much, David. All
right. Thank you everyone so much for
joining the search and retrieval track.
Um, that's it for today, but enjoy the
rest of uh your last day of AI Engineer
World Fair. Thanks, uh, thanks for
coming.