Layering every technique in RAG, one query at a time - David Karam, Pi Labs (fmr. Google Search)

Channel: aiDotEngineer

Published at: 2025-07-29

YouTube video id: w9u11ioHGA0

Source: https://www.youtube.com/watch?v=w9u11ioHGA0

[Music]
I'll I'll just give you all a little bit
of context. So uh my co-founder and I
and a lot of our team were actually
working on Google search and then we
left and like started Pyabs and uh I I
loved I love the exit talk and like
we're all nerds for information
retrieval and search and uh so this is
going to be a little bit of that. Uh
just going to go through a whole bunch
of ways you can actually show up and
improve your rack systems. Uh I think
one thing that I personally uh sometimes
struggle with is there's a lot of talk
about things sometimes like too much in
the buzzed like oh specific techniques
and you can do RL this way and you can
tune the model this way and it's like
doesn't help me orient in the space like
what are all these things and how do I
like hang on them uh or you have the
complete opposite which is like a whole
bunch of buzzwords and hype and such and
like rag is dead no rag is not dead is
like agents like wait what like uh so
just you know I think a lot of what I'll
do today is just uh what I call like
plain English uh just trying to like set
up a framework right like very centered
around like okay if you are trying to
show up the quality of your system how
do you do that and then where do all the
things you hear about like day in day
out like fit uh and then just how to
approach that and give a lot of examples
I think one thing that I always love and
we always did in Google we always do in
pyabs uh is just like look at things
look at cases look at queries see what's
working see what's not working and
that's really the essence of like
quality engineering as we used to call
it at Google if you do want the slides
there's like 50 slides and I said my a
challenge for myself to go through 50
slides in 19 minutes. Uh but you can
catch the slides here if you want. Uh
I'll flash this towards the end as well
with pi.ai-talk.
Uh it should point to the slides that
we're going through. And as I mentioned,
plain English, no hype, no buzz, uh no
debates, no like all right. So how to
think about techniques before we go
techniques and get into the weeds of it
like why does this even matter and the
way we always think about it is like
always start with outcomes. You're
always trying to solve some product
problem. Uh and generally the best way
to visualize something like this. you
have a certain quality bar you want to
reach and there were a very interesting
talk this this week about like you know
benchmarks aren't really helpful and
absolutely eval are helpful you're
trying to launch a CRM agent and you
sort of have a launch bar like a place
where you feel comfortable that you can
actually put it out into the world uh
and techniques fit somewhere here you
have that like kind of end metric and
you're trying to like come up with
different ways to shore up the quality
and those ways are like sort of the
techniques there and you know this is
sort of your own personal benchmark you
start with some of the easy the easy the
easy bars you want to hit and then
there's like medium benchmarks and hard
benchmarks. So these are query sets
you're setting up. Uh and then you know
depending on what you want to reach and
in at what time frame uh then you end up
trying different things. Uh and this is
what we call like quality engineering
loop. You sort of like baseline
yourself. Okay you want to achieve you
know you want CRM and this is the easy
query set and your quality is there. Uh
just through the simplest way you can
try it. Do a loss analysis. Okay what's
broken? There were a lot of eval talks
this week and then what we call quality
engineering. Now the reason I I I I say
this is because like okay techniques fit
this in this last bucket and one of the
things that I think biggest problems is
like people sometimes start there and it
doesn't make any sense because you say
oh do I need BM25 or do I need like uh
vector vector rich people it's like I
don't know what what are you trying to
do and what is your query says and where
are things failing because many times
you actually don't need these things and
you end up implementing them it doesn't
make a lot of sense anyway so usually
the thing I say is like what I call
complexity adjusted impact or you know
stay lazy uh in a sense like always look
at what's broken and if it's not broken
don't fix it and if it is broken do fix
it. Uh and we'll go through a lot of
techniques today but like this is a good
way to think about them. It's just a
cluster. It's a catalog of stuff. The
most important two columns are the ones
to the right difficulty and impact and
if it's easy go ahead and try it. And
most times like BM25 BM25 is pretty
easy. You should absolutely try it and
does like you know show up your quality
quite a bit. Um but you know should I
build like custom embeddings for
retrieval? Like I don't know. Let's take
a look. This is actually really really
hard. Uh Harvey gave a talk. They build
custom embeddings but you know they have
a really hard problem space and just you
know relevance embeddings don't don't do
enough for them uh and then they're
willing to put all that work and effort.
All right queries examples let's stuff
let's first technique in memory
retrieval uh easiest thing bring all
your documents shove them all to the
LLM. Uh this is the whole like is rag
dead is rag not dead context windows.
Well context windows are pretty easy so
you should definitely start there. Uh
one example notebook LM uh very nice
product. You actually you know put in
five documents just ask questions about
them. you don't need any rag just shove
the whole thing in. Now questions might
get cut too long and this is where it
breaks right maybe things don't fit in
memory uh or maybe you just pull the
context window too much so this is where
you start to think things like oh okay
that's what's happening I have too many
documents oh that's what's happening the
documents are not attended properly by
the LLM and here are like the five
things that are breaking okay great
let's move to the next one so now you
try something very simple which is can I
retrieve just based on terms so BM25
what is BM25 BM25 is kind of like four
things um query terms frequency of those
query terms
uh length of the document and just how r
where a certain term is. It's a very
nice thing. It actually works pretty
well and it's very easy to try and um it
has a problem that like when things are
not have that nature like the exa as exa
was saying when they don't have that
nature of like a keyword based search
they don't work and this is where you
bring in something like relevance
embeddings and relevance embeddings are
pretty interesting because now you're in
vector space and vector space can handle
way more nuance uh than like keyword
space uh but you know they also fail in
certain ways especially when you're
looking for keyword matching and it's
actually pretty easy to know when things
work and when they Actually, this was
queried like I went to Chad Gypt and I
asked like, "Hey, give me a bunch of
keywords. Ones that work for like
standard term matching and ones that
work for relevance embedding." And you
can see like exactly what's going on
here, right? If your query stream looks
like iPhone battery life, then you don't
need vector
search. But if they look something like,
"Oh, how long does an iPhone like last
before I need to charge it again?" Then
you absolutely need like things like
vector search. And this is where you
need to be like tuned to what what every
technique gives you before you go and
invest in it. And when you do your loss
analysis and you see, oh, most of my
queries actually look like the ones on
the right hand side, then you should
absolutely start investing in this area.
All right, now you did BM25, you did
vector because your query sets look
exactly like that. And now you have
conflicted candidate set. And this is
where re-rankers help quite a bit. And
when people say rerankers, they're
usually referring to like cross
encoders. And this is a specific
architecture. If you remember the
architecture here for relevant for the
relevance embeddings was you're getting
a vector for the query and you're
getting a vector for document and then
you're just measuring distance. Now
cross encoders are more sophisticated.
They actually take both the query and
the document and they give you a score
while attending to both at the same
time. And that's why they were much more
powerful. Now they are more powerful but
they're actually pretty expensive. And
now this is a failure state as well. You
can't do it on all your documents. So
now you have to have like this fancy
thing where you're retrieving a lot of
things and then ranking a smaller set of
things with a technique like that. Uh
but it is really powerful and you should
use it and it fails in certain cases and
now when you hit those cases then you
move to the next thing. Now where does
it fail? Uh it's still relevance and
there's a big problem with like you know
standard embeddings and standard
rerankers. They only measure semantic
similarity and there's a thing like
these are all proxy metrics at the end
like your application is your
application and your set of information
needs as your set of information needs
and you try to proxy with relevance but
relevance is not ranking and this is
something you know we learned in Google
search sort of uh it's been like 15 20
years where you know what brings the
magic of Google search well they look at
a lot of other things than just
relevance uh and this is you know this
came from like actually the talk from
Harvey and Lance DB was really really
interesting and he gave the example of
this query right uh it's a really
interesting query like it's it has so
much semantics for the legal uh domain
that it's impossible to catch these with
just relevance. Um and again what does a
word like regime means? That's a very
specific like legal term material. What
does it mean? It actually very has a
very specific meaning in the legal term.
Uh and then there's like things that are
very specific to domain that need to be
retrieved like laws and regulations and
such. And this is where you get to
building things like custom embeddings.
And you say, you know what, just
fetching on relevance is not enough for
me. And now I need to go and like model
my own domain in its own vector space.
And now I can actually fetch some of
these things. Now again, go back to chat
GPD like is this interesting? Should I
actually even do it? So I asked it to
give me a list of things that would fail
in a standard relevant search in the
legal domain. And you start to see like,
oh, all these things would the words
like moot don't mean the same thing.
Words like material don't mean the same
thing. And when you have a vocabulary
that is so specific and just off, you
will not get good results. Right? So now
how do you how do you match that? Like
you need to have again you need to have
evals you need to have query sets. need
to look at things that are breaking and
decide that oh the things that are
breaking have to do with the vocabulary
just being out of distribution of a
standard relevance model and that's how
you decide right so don't like again
don't think too much about it like oh
should I do it should I not do it like
what is your what are your queries
telling you what is your data telling
you and then go and try to do it or not
do it there's also an example from
shopping um so embeddings are very
interesting because they help you a lot
with retrieval and recall uh but you
still good need good ranking right so
now if If if if if you think relevance
doesn't work with retrieval, it also
probably doesn't work with ranking. Uh
this is an example I pulled from
Perplexity. I was trying Yeah, I was
just trying to break it today. It didn't
take too much to break it. Uh I asked
like, "Give me cheap gifts uh for a gift
for my son." And then I follow up with
this query like, "But I have a budget of
50 bucks or more because when I said
cheap, it started giving me like $10."
Well, you know, cheap for me is like
$50. Uh but it didn't know that, so it's
fine. I told it that. But when I said
$50 or more, it still gave me $15 and
$40, both of which are actually below uh
$50. Uh and this is kind of interesting
right because what we call like in you
know in standard terms like for
information retrieval this is a signal
it's a price signal and it's not being
caught and it's not being translated
into the query and it's definitely not
being translated into the ranking. So
now you have to like think of okay I
have ranking and I need the ranking to
see the semantics of my corpus and my
queries and this is has a very specific
meaning like when you think of your
corpus and your queries again it's not
just relevance relevance helps you with
natural language but things like price
signals things like merchant signals uh
if you're doing like podcasts how many
times has been listened to is a very
important signal has nothing to do with
relevance right and in in in many many
applications you will see things that
are for example more popular tend to
rank more highly uh and as I talk you
mentioned like uh the page rank
algorithm. Page rank is not about
relevance. It's about prominence. How
many things outside of my uh document
point to me that has nothing to do with
relevance and everything to do with the
structure of the web corpus. So that's
the shape of the data. So this is a
signal about the shape of the data and
not a signal about like the relevance.
Um and you know best way to think about
it think of like you have horizontal
semantics and then you have vertical
semantics. And if you're in vertical
domain where the semantics are very
verticalized, right? Let's say you're in
doing a CRM or you're doing emails uh
and it's a very complex bar you're
trying to hit uh that is way beyond just
natural language. Understand that
relevance will be a very tiny tiny part
of the semantic universe. And the harder
you try to go, the more you're going to
hit this wall and the more you all right
this breaks again. Things keep breaking.
I'm sorry.
At sufficient complexity, things will
keep breaking. So now the thing that
breaks with even custom semantics is
user preference. Uh because even when
you get to all this, okay, you're saying
I'm doing relevance and I'm doing price
signals and merchant signals. I'm doing
everything. I now I know the shopping
domain. Now you don't know the shopping
domain because now users are using your
product. They're clicking on stuff you
thought they're not going to click on
and they're clicking they're not
clicking on thoughts on things you
thought they were going to click on. Uh
and this is where you need to like bring
in the click signal, thumbs up, thumbs
down signal. Now um these things get
very complex. So we're not going to talk
about how to implement them. uh just
because again in this case for example
you have to build a click-through uh
signal prediction signal and then you
take that signal and then you combine it
with all your other signals. So now if
you look at your ranking function it's
doing okay I want it to be relevant I
wanted to have this like semi-structured
price signal and like query
understanding related to that plus I
want to get the user preference and that
and then you take all these signals and
you add them and that becomes your
ranking score. So it becomes a very
balanced function. And this is how you
go from like oh it's just relevance to
oh no it's not just relevance to oh no
it's not just relevance and and my
semantics and my user preferences all
rolled up into one. I'll mention two
more things. Uh
you calling the wrong queries. This
happening a lot because this go this
goes into more orchestration and you're
trying to do complex things. Uh
especially now when you have agents uh
and you're telling them to use a certain
tool. This is happening quite a bit
because there is an impedance mismatch
uh between what the search engine
expects right let's say you tune the
search engine and expects like keyword
queries or expects uh you know even like
more complex queries but you cannot
describe all of that to the LM and the
LM is reasoning about your application
and then making queries by itself and
this is a big problem so one thing that
we've seen many companies do we've done
this also at Google you actually take
more control of the actual orchestration
so you take the big query and you make n
smaller queries out of Uh I took the
screenshot from AI mode in Google and
it's it's very brief. You have to catch
it because after after the animation
goes away but you see it's actually it's
making X queries. It's making 15
queries. It's making 20 queries. Um so
what we call fan out take very complex
thing try to figure out what all the
subqueries in it and then fan them out.
Now you might think hey why isn't the LM
doing it? The LM is kind of doing it but
the LM doesn't know about your tool. It
doesn't know enough about your search
engine. Uh I love MCP but I'm not a big
believer that you can actually teach the
LLM and like just through prompting what
to expect from the search on the other
back end. This is why people still like
oh is it agent autonomous? Do I need to
do workflows? This is very very
complicated. Uh and it will take a while
for this to be solved because again it's
unclear where the boundary is. Is it uh
is it the search engine should be able
to handle more complex things and then
the LLM will just throw anything its way
or is it the other way around? the LM
has to have more information about what
the search engine can support so it can
tailor it and right now you need control
because the quality is still not there.
Uh so this looks like this. Um if you
have sort of like this assistant input
and you're turning it in these narrow
queries like for example was David
working on this has very very specific
semantics and it's more like oh JC is
David Slack threads David. Uh and it's
very very hard to know without knowing
enough about your application that these
are the queries that matter and not on
the the ones on the left hand side. And
if you send the thing on the left hand
side to a search engine, it will
absolutely tip over unless it
understands your domain. And this is
where like you know you need to
calibrate the boundary.
Okay. So now you're asking all the right
queries. Are you asking to all the right
backends? And this is another place
where it all fails. Um and this is what
we call like one technique we call
supplementary retrieval. This is
something you notice like clients do
quite a bit which is they don't call
search enough. Uh and sometimes people
try to overoptimize. When you're trying
to get high call, you should always be
searching more. Like I always like just
search more like this is similar to when
we talked like about dynamic content
like the in-memory uh retrieval just
like just give more things. So it never
fails to give more things. I know in the
in the description we said like there
was this query fell which was really
hard to uh to do and then you think like
oh we're in Google search and it's very
simple Middle Eastern dish and it
stumped an organization of 6,000 people
like oh my god what's so hard about this
query? What's so hard about this query
is like it's it's it's an ambiguous
intent. uh so you need to reach to a lot
of backends to actually understand
enough about it right because you might
be asking about food at which point I
want to show you restaurants you might
be asking this for for pictures at which
point I want to show you images uh now
what Google ended up doing is that they
ask they you know create all the
backends and then they put the whole
thing in and I think you know I would
recommend like this is a great technique
to just even increase the recall more
just call more things um and don't try
to be skimpy unless you're running
through like some real cost overload and
that's the last one you're running into
cost overloads. GPUs are melting. I try
to generate an image, but then I
realized there actually a pretty good
image that is real. Somebody took a
server rack and threw it from the roof.
Um, this was like I didn't need to go to
ChatG and generate this image. Uh,
apparently this was an advertisement.
Pretty expensive one. Um, all right. So,
this happens a lot like when you get to
a certain scale and you have all these
backends and you're making all these
queries and it's just getting very very
complex and this, you know, I mean
Google's there, perplexity is there. I
mean Sam Alman keeps keeps complaining
about GPUs melting. Um I think this is
the part where like you need to start
doing distillation and distillation is a
very interesting thing because like to
do that you have to learn how to
fine-tune models and this this gets to
be a little bit complex. You sort of
have to hold the quality bar constant
while you decrease the size of the
model. U the reason you can do that like
is is is kind of like in that in that
graph like hey hire me I know everything
actually I'm firing you. uh it's
overqualified like an LLM a very like
large language model is actually over
mostly overqualified for the task you
want to do uh because what you really
want to do is just one thing like
perplexity they're they're doing
question answering uh and they're pretty
fast I mean when you use perplexity in
certain context they're really really
fast which is amazing because they
trained this one model to do this one
very specific thing which is just be
really really good at question answering
um and you know this is very hard so I
wouldn't do it unless you know latency
becomes a really important thing for
your users right like, oh, the thing is
taking 10 seconds. Users churn. If I can
make it in two seconds, users don't
churn. Actually, that's a really great
place to be because then you can use
this technique and like just bring
everything down. Um, all right. You've
done everything you can. Things are
still failing. This is uh everybody.
Okay. What do you do? Like we have a
bunch of engineers here. What do you do
when everything fails? Um, yes. You you
blame the product manager. It's
it's the last trick in the book. Uh,
when everything fails, uh, make sure
it's not your fault. But I'll say
there's something really important here.
Quality engineering will never like
it'll never be 100%. Things will always
fail. These are stocastic systems. So
then you have to punt the problem. You
have to punt it upwards. So it's it's
kind of a joke, but it's not a joke.
Like the design of the product matters a
lot to how much how magical it can seem
because if you try to be more magical
than your product surface uh can can
absorb, you will you'll run into into a
bunch of problems. Um this is I use a
very simple example. Uh probably a more
complex one would be uh sort of a human
in the loop for customer support where
you're like okay some cases the bot can
handle by its own but then you need to
like punt to a human. This is basically
UX design right like when when do you
trust the machine to do what the machine
needs to do and when does a human need
to be in the loop. This is a much
simpler example from like Google
shopping. Um there's some cases where
Google has a lot of great data. So what
we call like high understanding the
fidelity of the understanding is really
high and then it shows like what we call
a high promise UI. Like I'll show you
things you can click on them. There's
reviews, there's filters because I just
understand this really well. And there's
things Google does not understand at
all, mostly as web documents, bag of
words. And what's really interesting
about the AI is the eye changes. If you
understand more, you show a more kind of
like filterable high promise. If you
don't understand enough, you actually
degrade your experience, but you degrade
it to something that is still workable.
Like, I'll show you 10 things, you
choose. Oh no, I know exactly what you
want. I'll show you one thing. And this
is really, really important. has to be
like part of every and this is sort of
like always understand like there's only
so much engineering you can do until you
have to like actually change your
product to accommodate this sort of
stoastic nature. So gracefully degrade,
gracefully upgrade depending on like the
the level of your understanding. And
again, I'll flash these two slides at
the end like always remember what you're
doing because you can absolutely get
into theoretical debates again context
window versus rag uh this versus that
like is you know agents versus I don't
know like just everything is empirical
in this domain when you're doing like
this this sort of thing. Oh I have my
evals I'm trying to like step by step go
up I have like a toolbox under my
disposal. Everything everything is
empirical. So again, baseline, analyze
your losses, and then look at your
toolbox and see, are there easy things
here I can do? If not, are there at
least medium things I could do? If not,
you know, should I hire more people and
do like some really, really hard things?
Uh, but always remember like the choice
is on you and you should be principled
because this can be an absolute waste of
time uh if you're doing it too far ahead
of the curve. All right, again, the
slides are here, I think. Oh, I I I
achieved it. 30 seconds left. uh and if
you want the slides they're here again
and uh reach out to us we're always
happy to talk I think I was very happy
with the exit talk because it's always
nice to find like friends who are nerds
in information retrieval uh we are also
such so reach out and happy to talk
about you know rag challenges and such
and some of the models we are building
um all right thank you so much
[Music]