Information Retrieval from the Ground Up - Philipp Krenn, Elastic

Channel: aiDotEngineer
Published at: 2025-07-27
YouTube video id: 4Xe_iMYxBQc
Source: https://www.youtube.com/watch?v=4Xe_iMYxBQc
Let's get going. Audio is okay for
everybody? I have some slight feedback,
but I'll try to manage. I hope it's okay
for you. Um hi, I'm Philip. Uh let's
talk a bit about retrieval. I will show
you some retrieval from the ground up.
We'll keep it pretty hands-on. Um you
will have a chance to follow along and
do everything that I show you as well. I
have like a demo instance that you can
use. Um or you can just watch me. Um
If you have any questions, ask at any
moment. If anything is too small to
reach out and we'll try to make it
larger. Uh we'll try to adjust as we go
along. So,
I guess we're not over rag yet, but uh
rag is a thing and
we'll focus on the R in rag, the
retrieval augmented generation. We'll
just focus on the retrieval. Um
Just
let's see where we are with retrieval.
Um quick show of hands, who has done rag
before?
Okay, that's about half or so. Um who
has done anything with vector search and
rag?
Do I need vector search for rag or can I
do do anything else?
Do anything in the prompt stuff in Yeah.
Yeah, so you can do anything. Um
retrieval is actually a very old thing.
Um depending on how you define it, it
might be 50, 70, whatever years old.
Retrieval is just getting the right
context to the generation. I like ignore
all the generation for today. We'll keep
it very simple. We'll just focus on the
retrieval part of getting the right
information in. Um
Partially from the
like the classics, uh but we'll get to
some new things as well as we go along.
Um
Who has done keyword search before? Just
That is fewer than vector search, I feel
like. Um which is um
which almost reminds me of like 15 years
ago or so when no sequel came up like
more people had done MongoDB, Redis,
whatever else um rather than sequel.
That has changed again. Um I think it
will be kind of similar for retrieval.
Um
The way I would always
vector search um is a feature of
retrieval. It's only one of multiple
features or many features that you want
in retrieval. And we'll see a bit why
and how and we'll we'll dive into those
details. Um so, I work for Elastic, the
company behind search. We're the most
downloaded, deployed, whatever else
search engine. We do vector search. We
do keyword search. We do hybrid search.
Um we'll dive into various examples.
Everything that I will show you um
well, the query language is
Elasticsearch, but if you use anything
built on Apache Lucene, everything
behaves very similarly. If you use
something that is a clone or close uh to
Lucene like anything built on uh Ten TV
or anything like that, it will be very
similar. The foundation
keyword search and vector search will
apply broadly everywhere. Um
So, let's get going. Um we'll keep this
pretty hands-on. Um who remembers in
Star Wars when he's making that hand
gesture? Um
What is the quote?
These are not the droids you're looking
for.
Um we'll keep this relatively Star Wars
uh based. Um feel free to come in and
filter on the side so whatever. Um
I think we have I think one chair over
there.
Otherwise and one down there. Otherwise,
it's getting a bit
full.
Um okay. Let's look at this uh of what
"These are not the droids you're looking
for" uh does for search. And I will
start kind of like with the classic
approach. Keyword search or lexical
search is like you search for the words
that you have stored and we want to find
what is relevant in our examples.
Um
If you want to follow along,
um
there is a gist which has all the code
that I'm showing you. It's at
elastic/ai.engineer.
Um
There is one important thing.
It's I have one shared instance
basically for everybody, so you can all
just use this without signing up for any
accounts or anything. So, this is just a
cloud instance um that you can use.
There is my handle um is in the index
name. If you don't want to fight and
override each other's data, reply
replace that with your unique handle or
something that is specific to you
because otherwise you will all work on
the same index and kind of like override
each other's data.
You can also just watch me. If you don't
have a computer handy, that's fine. Um
but if you want to follow along,
elastic/ai.engineer,
there will be a gist. Um it will have
the connection string. Like there is a
URL and then the credentials are
workshop workshop. If you go into login,
it will say login with Elasticsearch.
That's where you use workshop workshop.
Then you'll be able to login and you can
just run all the queries that I'm
showing you. You can try out stuff. Um
if you have any questions, shout. I have
a couple of colleagues dispersed in the
room, so um if we have too many
questions, we'll somehow divide and
conquer.
Um so, let's get going and see what we
have here. Um
And I'll show you most of the stuff
live. Um
I think this is large enough in the back
row. If it's not large enough for
anybody, shout and we'll see how much
larger I can make this. Um
Uh
and let me turn off the Wi-Fi and hope
that my wired connection is good enough.
Let's refresh to see.
Oh.
Maybe we'll use my phone after all.
Okay.
Let's try this again.
Okay, this is no good.
Out you go.
Okay. Hardest problem of the day solved,
we have network.
Um okay. So, we have the sentence "These
are not the droids you're looking for."
And we'll start with the classic keyword
or lexical search. Like what happens
behind the scenes? So, what you
generally want to do is you basically
want to extract the individual words and
then make them searchable. Um so, here
I'm not storing anything. I'm just
looking at like how would that look like
if I stored something. I'm using this
underscore analyze endpoint uh to to see
um
what I will actually store in the
background to make them searchable. So,
I have "These are not the droids you're
looking for." And you see these are m
not m the droids you are looking for.
In western languages,
the first step that happen or every time
or everywhere this first step that
happens is the tokenization. In western
languages, it's pretty simple. It's
normally any white spaces and
punctuation marks where you just break
out the individual tokens.
Um especially Asian languages are a bit
more complicated around that, but we'll
gloss over that for today.
And we have a couple of interesting
pieces of information here. So, we have
the token. So, these um is the first
token. We have the start offset and the
end offset. Um why would I need a start
and an end offset? Why would extract and
then store that potentially?
Any guesses? Yeah?
Yes. Especially if you have a longer
text, you would want to have that
highlighting feature that you want to
say um this is where my hit actually
was. So, if I'm searching for these,
which is maybe not a great word, uh but
you would very easily be able to
highlight where you had actually the
match. And the trick that you're doing
in search, generally what differentiates
it from a database is a database just
stores what you give it and then does
basically almost everything at query or
search time. Whereas a search engine
does a lot of the work at ingestion or
when you store the data. So, we break
out the individual tokens. Um we
calculate these offsets and store them.
So, whenever we have a match afterwards,
we never need to reanalyze the actual
text, which could potentially be
multiple pages long. Um but we could
just highlight where we have that match
because we have extracted those
positions. Um we have a position. Why
would I want to store the position with
the text that I have?
Yeah? For the annotations.
Uh yeah, annotations. So, the the main
use case uh that you have is if you have
these positions and later on we'll
briefly look at if you want to look for
a phrase. If you want to look for this
word followed by that word. Um so, you
could then just look for all the texts
that contain these words, but then you
could also just compare the positions
and basically look for n n + 1 etc. And
you never need to look at
the string again, but you can just look
at the positions to figure out like this
was one continuous phrase even if you
have broken it out into the individual
tokens.
Um
Most of the things that we see here is
alpha num um for alphanumeric.
An alternative would be synonyms. We'll
skip over synonym definition because
it's not fun to define tons of synonyms.
Uh but this is all the things that we're
storing here in the background. You can
also customize this analysis. And that
is one of the the features again of full
text search and lexical search is that
you pre process a lot of the information
to make that search afterwards faster.
So, here you can see
I'm stripping out the HTML because
nobody's going to search for this
emphasis tag. Um I use a standard
tokenizer
um that breaks up for example on dashes.
You will see that alternatives would be
white space that you only break up on
white spaces.
I lower case everything
which is most of the times what you want
because nobody searches in Google
with proper casing or at least maybe my
parents but nobody else searches with
proper casing in in Google.
We remove stop words we'll get to stop
words in a moment
and we do stemming with the snowball
stemmer. What stemming is it basically
reduces a word down to the root so you
don't care about singular plural or like
the flexion of a verb anymore but you
really care more about the concept.
So
if I run through that analysis does
anybody want to guess what will remain
of this phrase so which tokens will be
extracted and what in what form?
Not a lot will remain.
Um
two
Yeah close
so we'll actually have three
so we have droid you and look and you
can see all the others were stop words
which were removed.
The stemming reduced looking down to
look because we don't care if it looks
looking look we just reduce it to the
word stem so we do this when we store
the data and by the way when you search
afterwards
your text would run through the same
analysis that you would have exact
matches so that you don't need to do
anything like a like search anymore in
the future
so this will be much more performant
than anything that you would do in a
relational database because you have
direct matches and we'll look at the
data structure behind it in a moment but
we what we get is droid you look
with the right positions so for example
if we searched for droid you we could
easily retrieve that because we have the
positions
even though that is a weird phrase. Do
we start indexing at zero or one?
Zero yes. It's the only right way.
There discussion here
so we we are the positions are based
starting at zero and these are the
tokens that are remaining.
If you do this for a different language
like you might hear I'm a native German
speaker
this is the text in in German and you
would if you use a German analyzer it
would know the rules for German and then
would analyze the text in the right way
so then you would have remaining droid
been so
Anybody wants wants to guess what
happens if I have the wrong language for
a text?
It will go very poorly
because that so how how this works is
basically you have rules for every
single language is like what is a stop
word how does stemming work if you apply
the wrong rules
you basically just get wrong stuff out
so it will not
do what you want so what you get here is
like that this is an article
but well in the in English the rule is
an S at the end just get stemmed away
even though this doesn't make any sense
so you apply the wrong rules and you
just produce pretty much garbage
so don't do that
just to give you another example um
French this is the same phrase in French
and then you see
droid la and recherchez
are the words that are remaining in
these examples otherwise it works the
same but you need to have the right
analysis for what you're doing otherwise
you'll just produce garbage
um
A couple of things as we're going along
the stop word list by default which you
could override is relatively short this
is
linguists have spent many years figuring
out what are the right lists of stop
words and you don't want to have too
many or too few in English it's I always
forget I think it's 33 years old
this is where you can find it in the
source code it's
I don't want to say well hidden but it's
not easy to find either
so every language has like a list of
stop words that are defined that will be
automatically removed for these are not
the droids you are looking for
by accident more or less we had a lot of
stop words and why not a lot remained
here in the phrase and then for all
other languages you will have a similar
list of stop words.
Should you always remove stop words?
Yes no.
Yes that is by the way another not is a
very good so I'm not sure if everybody
heard that
the the comment was about not one
important thing here we're talking about
lexical keyword search which is
dumb but scalable
it doesn't know understand if there is a
droid or there's no droid it's just
defined as a stop word it does just
keyword matching
that is in in vector search or anything
with a machine learning model behind it
will be a bit of a different story
afterwards
where these things might make a
difference but this is very simple
because it just matches on similar
strings basically it doesn't understand
the context it doesn't know what's going
on that's why the linguists decided not
this a good stop word
you could override that if for your
specific use case this is not a good
idea
um
Always removing stop words yes no maybe.
So our favorite phrase is it depends
and then you have to
explain like what it depends on so what
it depends on is there are scenarios
where removing all stop words does not
give you the desired result and maybe
you want to have like a text with and
without stop words like sometimes stop
words are just like a lot of noise that
blow up the index size and don't really
add a lot of value that's why we have
defined them and try to remove them by
default
but if you had for example to be or not
to be
these are all stop words
it's put all be gone
when you run it through analysis
so you
it is tricky to figure out like what is
the right balance for stop words or what
works for you use case but you might
have unexpected surprises in all of this
okay we've seen the German examples
um
Let's do some more queries
or let's actually store something
so far we only pretended or we only
looked at what would happen if we would
store something now I'm actually
creating an index again if you're
running this yourself
please use a different name than me
just
replace all
my handle instances with your handle or
whatever you want since this is a shared
instance we have too many collisions I
might jump to it another instance that I
have as a backup in the background but
what I'm doing here is I'm creating my
this analysis pipeline that I have
looked at before like I'm throwing out
the HTML I use a standard tokenizer
lower casing stop word removal and
stemming and then I
I call this my analyzer and then I'm
basically applying this my analyzer on a
field called quote.
This we call this a mapping it's kind of
like the equivalent of a schema in a
relational database but this defines how
different fields behave.
Okay and somebody did not replace
the the query
um
By the way
you need to keep user underscore
um
Let me let me quickly do this um
myself.
Oops.
I should have seen this coming.
Um
we want
to replace
and we'll use
oops
Please don't copy that.
Um
and I want to
Okay.
Looks like it worked.
Let's try it again.
Um
So we're creating our own index.
And now
I just to double check I'll just again
run this
underscore analyze against this field
that I have set up to just double check
that I have set it up correctly and now
I'm actually starting to store documents
bless you
so we'll store these are not the droids
you're looking for
I have two others
that I'll index just so we have a bit
more to search
no I am your father
any guesses what will remain here?
Father yeah.
Okay let's let's try this out let me
copy my
This one actually has way fewer stop
words than you would expect.
Let's quickly do this.
Since I didn't do the HTML removal let's
take these out manually.
So what you get is no I am your father
and this was stupid because this was not
what I wanted we need to run this
against the right analysis.
This happens when you copy paste okay
um
Uh sorry.
And we'll do text.
No, I think I've pieced this back
together. Okay, I am your father. So, no
is the only stopper in this list
actually.
No was on the stopper list, all the
others are not.
Um
Okay, let's try another one. Obi-Wan
never told you what happened to your
father.
How many tokens will Obi-Wan be?
Two?
One?
No, Obi-Wan will will be two like
Obi-Wan because we use the default
tokenizer or standard tokenizer, that
one breaks up at dashes. If you had used
another tokenizer like white space, that
would keep it together because they
don't break up at white spaces. So,
there are various reasons why you want
or would not want to do it. I don't want
to go into all the details, but there
are a lot of things to do right or wrong
when you ingest the data, which will
then allow you to query the data in
specific ways. So, for example, if you
would have an email address,
that one is also weirdly broken up. Like
you might use like there's a dedicated
tokenizer for URL email addresses. So,
depending on what type of data you have,
you will need to process the data the
right way because pretty much all the
the smart pieces are kind of like at
ingestion here to make the search
afterwards easier.
Um
So, you can easily do that. Um let's
see. Let's
index all my three documents so that we
can actually search for them. Now, if I
start searching droid,
it should match these are not the droids
you're looking for, yes or no? Because
this one is singular and uppercase and
the droid that we stored was plural and
lowercase. Will that match, yes or no?
Why?
Uh yes, we had the stemming. We had the
lower casing.
And when we search, so we store the
text, it runs through this pipeline uh
or the analysis um and for the search,
it does the same thing. So, it will
lowercase the droid um it has
stemmed down the droids in the text to
droid and then we have an exact match.
Um So, what the the data structure
behind the scene actually looks like.
The magic is kind of like in this
so-called inverted index. What the
inverted index is is
These are all the tokens that remained
that I have extracted.
I have alphabetically sorted them and
they basically have a pointer and say in
this document like with the IDs 1 2 3
that I have stored, we have how many
occurrences? Like 0 1 um
Yeah, nothing here too.
Um
And then we also know at which position
they appeared. So, search for droid now,
this is what I have stored. Um I will
lowercase the droid to droid. I have an
exact match here. Then I go through that
list and see, retrieve this document,
skip this one, skip this one um and at
position four, um you have that hit. And
then you could easily highlight that.
So, you have almost done all the hard
work at ingestion and this retrieval
afterwards will be very fast and
efficient.
That's the classic data structure for
search, the inverted index where you
have this alphabetic list uh of all the
tokens that you have extracted to do
that. Um And this will just be built in
the background for you and that's how
you can retrieve all of this. Um let's
look at a few other um
uh queries and how they behave. Um
If I search for
robot, will I find anything?
No, because there was no robot. Um
There was a droid.
Um we could now define a synonym and say
like all uh droids are robots for
example. Um
Who likes creating synonym lists?
Nobody anymore. Okay.
No normally I would have said that's the
Stockholm syndrome because there is
sometimes somebody who likes uh creating
uh synonym lists because they've done
that for so many years um but it got
easier nowadays. Now you can use LLMs to
generate your synonyms, so it can get a
bit easier to create them, but they're
still limited because you have always
this mapping. Um so, with synonyms you
can expand the right way. Where it gets
trickier if you have homonyms, uh if a
word has multiple um meanings like a bat
could be the animal or it could be the
thing you hit a ball with. Um
There it just gets trickier because
there is there is no meaning behind the
words or no context. So, you just match
strings um and that is inherently
limited. But like I said, it's
dumb, but it scales very well. And
that's why it has been around for a long
time and it does surprisingly well for
many things because there's not a lot of
things that are unexpected or that can
go totally wrong.
Um
Now, other things that you can do. You
could do a a phrase search where you
say, I am your father. Um will this find
anything?
Yes, because we had no, I am your
father. Um
What happens if I say for example, I am
uh let's say I am not your father.
Yes, no?
No.
Why?
So, you're right.
But not is a stop word.
But you're right because the positions
still don't match.
Um so, the stop word not would be
filtered out
um but it still doesn't match because
the positions are off. Um
That is one of the things that sometimes
can be confusing cuz even if something
is a stop word and will be filtered out,
uh it doesn't work like that. Um
One thing that you can do is though that
the factor is called slop where you
basically say if there is something
missing,
um it would still work. So, I am your
father and I am father with slop uh
zero, that's kind of like the implicit
one, um will not find anything, but if I
say one, then I basically say like there
can be a one off in there, like one word
can be missing. Um
Uh
However, I am his father, here his would
not match, so this still will not work.
Um s- the slop is really just to skip a
word, yeah.
I am your father?
Uh
That will not work.
Uh there you might need to do something
like a synonym where you say slash um
m gets replaced by m.
Uh or we will need to have some more
machine learning capabilities behind the
scenes to to do stuff like that.
So,
what
What is built in is generally a very
simple set of rules. Um What you will
need to do for things like this is
normally you need a dictionary.
Um the problem around these is they're
normally not available for free or open
source. Um funnily enough, they're often
coming out of uh university the
dictionaries um because they have a lot
of free labor.
The students.
That's why the universities have been
creating a lot of dictionaries, but they
often come out under the weirdest
licenses, that's why they're very not
very widely available. But yes, there is
a smarter or more powerful approach if
you have a dictionary and you can do
these things. Um For example, um one
thing to to show is like
or maybe that's a good thing to to also
mention. Um
You don't always get words out of the
stemming. It's not a dictionary, it
doesn't really get what you're doing, it
just applies some rules. So, for
example,
uh
blackberry,
blackberry
um
uh sorry, blackberries, I think that
this will be stemmed down differently.
Uh sorry, I need English.
Without English, this will not work.
So, this will stem down to this weird
word blackberry um
and it will also stem down the singular
to blackberry. So, there is a rule
that applies this. Uh but it's just a
rule, it's not dictionary based, it's
not very smart. Um and it only has some
rules built in um that work for this,
but you will definitely hit limits. Um
The other thing by the way and why I
picked blackberry as an example, um you
have some annoying languages like
German, Korean, and others that compound
nouns
like blackberry um where you have
basically two words. Um black would
never find blackberry in the simplest
form because
it's not a complete string.
There are various ways to work around
that um that all come with their own
downsides um and either you have a
dictionary or you extract the so-called
n-grams, it's like group of words and
then you mention group of words
um but all of those are
one of the many tools how we try to make
this a bit better or smarter, but it all
has limitations.
I hope that answers the question and
makes sense.
Um So, there are dictionaries, but
they're generally not free or not under
an easy license available. For some
languages by the way, um even the the
stemmers are not really available. I
think there is a stemmer or analyzer for
Hebrew. I think that has also like some
commercial license or at least uh
you can't use it for free in commercial
products.
Um
The licensing with machine learning
models is also its own
dark secret. Um yeah.
I guess it's maybe not exactly clear cuz
we don't know how to spell
it, but like
uh why not just have like
much smaller groupings?
Um like sub-word tokens.
Um and then you can like match a lot of
things. You're going to have like more
false positives, but presumably you can
filter
your true positives to have more
matches. Yes, um that is what an n-gram
is doing. Let me
Let me see if I can
um
An n-gram is normally uh
a word group. Normally a tri- This is
way too small.
Um
somehow I have weirdly overwritten my
command plus, so I can't use that. Let
me make this slightly larger.
Okay.
Um here um
we basically
use one or two letters as word groups,
which which is way too small, but uh
just to um
show the example, and this is very hard
to read. Let Let me co- copy that over
to my console. There There you can
um
There you can Oops.
There you can see this.
But this is a great question. Um
so we'll use n-gram for Quick Fox.
And then you can see the tokens that I
extract here are the first letter, the
first two, um the second,
the second and third, etc. And you end
up with a ton of tokens. The
The downside is A, you have to do more
work when you store this. Um B, it
creates a lot of storage on disk because
you extract so many different tokens.
And then your search will also be pretty
expensive because
normally you would at least do three
like tri-grams. Um
but even that creates a tons of tokens
and a tons of matches, and then you need
to find the ones with the most matches.
And it works, but A, it is pretty
expensive in disk, but also query time.
Um and it might also create undesired
results or dis- results that are a bit
unexpected for the end user. It is I
would call it a
It's It's again, it's a very dumb tool
that works reasonably well for some
scenarios, um but it's only one of many
potential factors.
Um what you could potentially do is, and
I
I don't have a full example for that,
but we could build it quickly. Um
what you would do in reality probably,
um you might store a text more than one
ways. So you might store it like with
stop words and without stop words and
maybe with n-grams. And then you give a
lower weight to the n-grams and say like
if I have an exact match, then I want
this first. But if I don't have anything
in the exact matches, then I want to
look into my n-gram list. Uh and then I
want to kind of like
take whatever is coming up next.
Um so
even keyword based search will be more
complex if you combine different
methods.
Um
N-grams are interesting, but again,
they're a dumb but pretty heavy hammer.
Use them with the right
at the right scenario. Sorry, quick
question about these n-grams. Is it by
default one or two?
Yes. But you you could redefine that. So
we can
uh we can Let me go back to the docs.
The The n-gram, you can uh say min gram
and max gram. If you set both to three,
you would have tri-grams, where it's
always groups of three like
1 2 3, 2 3 4, etc. Um You could also
have something called edge n-gram, where
you expect that somebody types the first
few letters, right? Uh and then you only
start from the beginning, but not in the
middle of the word, which sometimes
avoids unexpected results.
And of course reduces the number of
tokens um quite a bit.
Um so somewhere in here
um
edge n-gram
Let's just copy that over, so I won't
type and
So here we have edge n-gram with quick,
and you can see it only does
first and the first two letters, but
nothing else. And in reality you would
probably define this like two to five or
more or whatever else you want.
Uh but here we only do from the start
and nothing else, which reduces the
tokens tremendously.
But of course, if you have BlackBerry
and you want to match the berry, you're
out of luck.
Make sense?
Anybody else? Anything else? Yes, so
another question. Sure. So if you're
working with a language like English and
then Hebrew, where it's right to left,
how do you
deal with the indexes and stuff like
that?
Yeah. So if you have multiple languages,
um
do not mix them up. That will just
create chaos. Uh because we'll get to
that in a moment, but how keyword search
works is basically word frequencies, and
it if you mix languages, it screws up
all frequencies and statistics. Um so
what you would do is either you would
have um
You could do two things uh to do um
If you have English and Hebrew,
uh you could either do field English,
and then you would have
field uh
whatever the abbreviation for Hebrew is.
Um
Hebrew uh
Oops.
And then you would have that. And then
you would need to define the right
analyzer for that specific field. But so
you break it out either into different
fields, or you could even do different
indices.
Um
and
ideally we even have that built in. Um
we have a language analyzer. Even if you
just provide a couple of words, it will
guess or
not guess. It will infer the language um
with a very high uh
degree of certainty.
Um especially Hebrew will be very easy
to identify. Uh if you have your own
diacritics, it's easy.
Um but even if you just throw all random
languages at it, it will have a very
good chance like just with a few words
uh to know this is this language, and
then you can treat it the the right way.
Um
Good. Let's continue. Um
So we have done all of these searches.
Uh we have done slope. Uh
One more thing before we get into the
the relevance.
One other very heavy hammer
um
that people often overuse is fuzziness.
Uh so bless you. Uh if you have
a misspelling, so I misspelled Obi-Wan
Kenobi. Uh
We already know that this is broken out
into two different words or tokens.
Um it will still match you Obi-Wan
because we have this fuzziness,
uh which uh allows edits. It's like a
It's a Levenshtein distance. So you can
have one By default here
you could either give it an absolute
value like you can have one edit, which
could be one
character too much, too little, or one
character different. Uh you could set it
to two or three. You can't do it more
because otherwise uh you match almost
anything.
Um and auto will Auto is kind of smart
because auto, depending on how long the
token that you're searching for is, uh
will set a specific value. Um if you
have zero to two characters, auto
fuzziness I think is one from
um
two to
No, it's zero to two characters is zero.
Uh three to five characters is one, and
after that it's two.
Um
So you can match these. Um
Will this one match?
Yes, no, and why?
Um no, because you've got the
B
and the W.
Yes, so we have
We have both of those are misspelled.
It still matches. Why?
I think it's tokenized separately, and
each one has a single Levenshtein
Yes. That is a bit of a gotcha. Um so
yes, you need to know the tokenizer. So
we tokenize with standard. So it's two
tokens, and then the fuzziness applies
per token.
Which is another slightly surprising
thing. Uh but yes, that's how you end up
here.
Um Okay. Now
we could look at like the how the
Levenshtein distance works behind the
scenes, but it's
It's basically a Levenshtein automaton
um
which looks something like this. If you
search for food and you have two edits,
this is how the automaton would work in
the background to figure out like what
are all the possible permutations. Um
it's a fancy algorithm uh that was I
think pretty hard to implement, uh but
uh it's in uh Lucene nowadays.
Um Okay.
Now, let's talk about scoring. Uh one
thing that you have seen that you don't
have anywhere or in a non-search engine
or just in a database is like we have
the scores. Like how well does this
match? Um
How does a score work here?
Um
Let's look at the the details of that
one.
Um so the the basic algorithm, which is
also
most of us or pretty much all of us
here,
um
term frequency inverse document
frequency or TF-IDF, um
it has been slightly tweaked. Like the
new implementation is called BM25, which
stands for best match, and it's the 25th
iteration of the best match algorithm.
Um So what they look like is um you have
the term frequency. If I search for
droid, how many times does droid appear
in the text that I'm looking for?
Um
And it's basically the square root of
that. So the assumption is um
if a text contains droid once,
this is the relevancy. If I have a text
that contains droid 10 times, um this is
the relevancy. The tweak between TF-IDF,
that one just keeps growing. Uh BM25
says like once you hit like five droids
in a text, it doesn't really get much
more relevant anymore. Um so it kind of
like flatters out the curve. Um that is
the idea of term frequency. Um the next
thing is the inverse document frequency,
um
which is almost the inverse curve.
The assumption here is
over my entire text body, this is how
often the term droid appears. So if a
term is rare, it is much more relevant
than if a term is uh very common. Then
it's kind of like less relevant.
Basically, the assumption is rare is
relevant and interesting, very common is
not very interesting anymore. Uh and
then it's kind of like just uh
works its curve out like that. Um
And the final thing is uh the field
length norm is like
the shorter a field is and you have a
match, the more relevant it is. Which
assumes like if you have a short title
and you keyword appears there, it's much
more relevant than if there is a very
long text body and your keyword or and
you have a match there. Um
And these are the three main components
of uh TF-IDF. So let's take a look at
how this looks like. Um
You can make this a bit more
complicated. Um
And
this show you why something matches.
Don't be confused by the or let me take
that out for the first try.
So I'm looking for father and I am uh
no, I am your father and Obi-Wan never
told you what happened to your father.
Um
One is more relevant than the other. Why
is the first one more relevant than the
second one?
Yeah. Term frequency is the same. Both
contain contain uh father once. The uh
inverse document frequency is also the
same because we're looking for the same
term. Uh the only difference is that the
second one is
longer than the first one um and that's
why it's um more relevant here.
Um So this is very simple. And you can
then if you're unsure why something is
calculated in a specific way, you can
add this explain true
uh and then it will tell you all the the
details of like, okay, we have father um
and it then calculates basically all the
different pieces of the formula for you
and shows you how it did the
calculation. So you can debug that if
you need to um
but it's probably a bit too much output
for the everyday use case. Um
And then you can customize the score if
you want to. Here I'm doing a a random
score. So my two fathers, uh this is a
bit hard to show, um they will just be
in random order because their score is
here um
randomly assigned. Um But you could do
this more intelligently that you combine
like the score and like if you have I
don't know, the margin that on the
product that you sell or the rating that
you include that in the rating somehow
and you can build a custom score for
things like that. Um so you can
influence that any way you want. Um
One thing that I see every now and then
that is a very bad idea and we'll skip
this one because it's probably a bit too
much. Um This one by the way is the
is the total formula that you can do or
maybe I'll show you the parts that I
skipped. What happens if you search for
two terms? And they're not the same.
They don't have the same relevancy. So
what the calculation behind the scenes
basically looks like this. Let's say we
search for um father.
Father is very rare, that's why it's
much more relevant than your. Your is
pretty common. And then we have a
document that contains your father. It's
kind of like this axis. This is like the
this will be the best match. But will a
will a document that only contains
father be more relevant or only your?
Intuitively, the one with just father
will be more relevant. But how does it
calculate that? Um it basically
calculates like this is the relevancy of
father. Um this is the ideal document
and this is your and then it looks like
which one has the shorter angle. And
this is the one that is more relevant.
Um so if you have a multi-term search,
um you can figure out which term is more
relevant and how they are combined. Um
and then you can also have the
coordination factor which basically we
rewards documents containing more of the
terms that you're searching for. So if
you I'm searching for three terms like
um your
I am father, whatever. Um
if a document contains all three, this
will be the the formula that it combines
the scores of all three and multiplies
it by 3 / 3. If it only contains two of
them, it would only have the relevancy
of 2/3 and with one, 1/3. And then you
put it all together and this is the
formula that happens behind the scenes
and you don't have to do that in your
head, luckily.
Cool. We have seen these
um
One thing that I or we see every now and
then is that people try to translate the
score into percentages.
Um
Like you say, this is a 100% percent
score and this is only like a 50% match.
Who wants to do that?
Hopefully nobody because the Lucene
documentation is pretty explicit about
that. Um
You should not think about the problem
in that way because it doesn't work. Um
and I'll show you why it doesn't work or
how this breaks. Um let's take another
example. Uh
Let's say we take this short text. These
are my father's machines. I I
think of a good Star Wars quote to use
here, but bear with me. Um so what
remains if I run this through my
analyzer? My father machine.
These are the three tokens that remain.
Now
I will store that. You remember what are
the three tokens that we have stored. Um
and if I search for my father machine,
um
you might be inclined uh to say um
this is um
the perfect score.
This is like 100%.
Agreed?
Because all the three tokens that I have
stored in these are my father's machines
um are there. So this must be like my
perfect match. So it's 3.2, that would
be 100%. The problem now is every time
you add or remove a document, the
statistics will change and your score
will change. So if I delete that
document um
and I search the same thing again,
um
I don't know what percentage this is
now. Is this now the new 100% the best
document or is this a 0. or I don't
know, 20%? Um how does this compare?
Um
And then you can play play funny tricks
where these droids are my father's
father's machines. Um and you can see I
have a term frequency of two for father
here. So if I store that one then and
then search it, um
is this now 100% is this now 110%? Um so
don't try to translate uh scores into
percentages.
They're only relevant within one query.
They're also not comparable across
queries. They're really just sorting
within one query um to do that.
Um okay, let me get rid of this one
again.
Now
we've seen the limitations uh of keyword
search. We don't want to define um our
our synonyms. We might want to extract a
bit more meaning. So we'll we'll do some
simple examples uh to extend. Um I will
add uh
from OpenAI text embedding small.
I'm basically connecting that inference
API for text embeddings here in my
instance.
Um
I have removed the API key. You will
need to use your own API key um if you
want to use that, but it is already
configured. Um so let me pull up the
inference services that we have here. I
have done or I have added two uh
different models. Uh
One sparse, one dense. Let's go to
these. Uh by the way, if you try to do
this um with the 100% score, don't do
this. Um
Because it will just not work. Um
Okay.
Not everybody has worked with dense
vectors, right? Uh so I have a couple of
graphics coming back to our Star Wars
themed uh just to look at how that
works. So
what you do with dense vectors is um
we'll keep this very simple. Um this one
just has a single uh dimension um and it
has like the axis is pretty much like
realistic Star Wars characters and
cartoonish Star Wars characters. And
this one falls on the realistic side and
that other one is just cartoonish. And
you have model behind the scenes that
can rate those images and will figure
out like where they fall.
Um
Now in reality, you will have more
dimensions than one
um and you will also have floating point
precision, so it's not just like um -1,
0, or 1. Um but
you will have more dimensions. So for
example here, I'm adding human and
machine. And in a realistic model, you
don't have like the dimensions are not
labeled as nicely and clearly
understandable. The machine has learned
what they represent, but they're not
representing an actual thing that you
can extract like that. Um but in our
simple example here now, we can say this
Leia character
um is
realistic
and a human versus um
I don't know, the Darth Vader um is
cartoonish and I don't know, somewhere
between human and machine.
Um so this is the representation in a
vector space. And then you could have
like I said, you could have floating
point values and then you can have
different characters. Um
And similar characters like both of
those are human um
without the hand, he's only like not
quite as human anymore, so he's a bit
lower down here. Uh Um
Uh so he's a bit closer to the machines.
Um
So you can have all of your entities in
this vector space, and then if you
search for something, you could figure
out like which characters are the
closest uh to this one.
And again, in reality, you will have
hundreds of dimensions. Uh it will be
much harder to say like uh these are the
explicit things and this is why it works
like that.
Um
It will depend on how good your model is
uh in interpreting your data and
extracting the right meaning from it. Um
but that is the general idea of
dense vector representation. You have
your documents, or sometimes it's like
chunks of documents, um that are
represented in this vector space, and
then you try to find something that is
close to it um for that. Um
Does that make sense for everybody or
any specific questions?
So it's a bit more opaque, I want to
say. It's not quite as easy because you
say say like these five characters match
these other five characters here,
um but you need to trust or evaluate
that you have the right model to figure
out how these things connect.
So let's um see how that uh looks like.
Um
I have
one dense vector model down here. We
have OpenAI embedding. Um this one is a
very small model. It only has 128
dimensions.
Um
The results will not be great, but it's
actually for demonstrating uh it
actually helpful.
Uh so we'll see that. The other model
that we have, and let me show you the
output of that. So if I take my text,
"These are not the droids you're looking
for."
This is the representation. It's
basically an array of floating point
values uh that will be stored, and then
you just look for similar floating point
values. And then you have
"These are not the droids you're looking
for." Here, on the previous one
Uh
a dense
text embedding. Um this one here does
sparse embedding. Sparse is uh the main
model used for that is called Splayd.
Um our input of Splayd is
we call it ELSER. Uh it's kind of like a
slightly improved uh Splayd, but the
concept is still the same. What you get
is um
you take your
you take your words, and this is not
just uh TF-IDF. This is a learned
representation, where I take all of my
tokens and then expand them and say
like, "For this text, um
these are all the tokens that I think
are relevant, and this number here
basically tells me how relevant they
are."
Um
Again, not all of these make sense
intuitively.
Um
and you might get some funky results,
for example, with foreign languages.
This currently only supports English. Um
But these are all the terms that uh we
have extracted. Normally, um yeah, you
get like 100 something or so. Um
So the the idea is that this text is
represented by all of these tokens, and
the higher the score here, the more uh
important it is. And what you will do is
you store that behind the scenes.
When you search for something, um you
will generate a similar list, and then
you look for the ones that have an
overlap, and you basically multiply the
scores together, and the ones with the
highest values will then find the most
relevant document.
This is
in so far interesting or nice because
it's a bit easier to interpret. It's not
just like long array of floating point
values.
Um
Sometimes these don't make sense. The
main downside of this, though, is
that it gets pretty expensive at query
time because you store like a ton of
different tokens here for this.
Um when you retrieve it, um
the search query will generate a similar
long list of terms, and at
if you have a large enough text body,
um a query might hit a very large
percentage of your entire stored
documents to with these or matches
because basically, these are just a lot
of or's that you combine, calculate uh
the score, and then return the most or
the the highest ranking results. Um
So it's an interesting approach. It
didn't gain as much traction as dense
vector models, but it can be as a first
step or an easy and interpretable step.
It can be a good starting point uh to
dive into the details here.
Um
your entire vocabulary?
Uh so This "These are not the droids
you're looking for" is basically
represented by this embedding here.
So it's like this entire list of of
terms with these um
yeah, with this relevancy, basically.
This is the representation of this
string.
And then when I search for something, I
will generate a similar list, and then I
basically try to match the two together.
Like for what is has the most or the
highest matches here.
Make sense?
Can you can you run a query? Can you
search for
Yes, we'll do that in a second.
Um
It it's a bit of because um
with the
or um
back it doesn't tell you exactly this is
what matched, uh but yes.
Um
So I will need to create a new index. Um
this one keeps the configuration from
before, but I'm adding this semantic
text for the the sparse model and the
dense model.
So I've created this one,
uh and now I'll just put three documents
I have my other index. Um
as you can see here, it says three
documents were moved over. Um so we can
then start searching here, and um
if I look at that, the first document is
still "These are not the droids you're
looking for." You don't see like for the
for keyword search, you don't see the
extracted tokens. Here, we also don't
show you the the dense vector
representation or the sparse vector
representation. Those are just stored
behind the scenes um for querying, but
there's no real point in retrieving them
because you're you're not going to do
anything with that huge array of of
dense vectors.
Um it will just slow down your searches.
Um You can look at the the mapping, and
you can see I'm basically copying my
existing quote field to these other two
that I can also search those.
Okay. So if I look for machine on my
original quote,
will it find anything?
No.
No, because it only had "These are not
the droids you're looking for." And this
is still a keyword search.
So this is just a query from before,
just to show you that yes, we're not
matching anything because it only had
"These are not the droids you're looking
for," but here we're looking for
machine.
Um it was still a keyword matching.
Doesn't work, shouldn't work. That's
exactly the result that we want out of
this here.
Um Now, if I say ELSER and I say
machine,
then it will match here "These are not
the droids you're looking for," and you
can see this one matches pretty well, I
don't know, at 0.9, but it also has some
overlap with "No, I am your father." I
mean, it is much lower in terms of
relevance,
uh but something had an overlap here.
And only the the third document
um
"Obi-Wan never told you what happened to
your father." Only that one is not in
our result list at all.
But there was something here. I don't
know the expansion. We would need to
basically
run
Where was it? We would need to run uh
this one here for
all the strings, and look then for the
expansion of the query, and then there
would be some overlap, and that's how we
retrieve that one.
Um
So is there like a threshold that you
have You could define a threshold. Um
It will, though, depend
um
Let's see.
Uh "These are not the droids you're
looking for." Let's say if I
if I say
I'm not sure this will change anything.
I mean, the relevance here is still it's
still 10 x or so. Um
But yeah, this one still be
in
will just have a very low match.
It's still
terms you look for, um
the score just totally jumps around. So
it's a bit hard to define the threshold.
Because here you can see in my previous
query, we might have said 0.2 is the cut
off point, but now it's actually 0.4,
even though it's not super relevant. So
it might be a bit tricky, uh or you
might need to have a more dynamic
threshold depending on how many terms
you're looking for and uh what is a
relevant result. In the in the bigger
picture, the assumption would be if you
have hundreds of thousands or even
millions of documents, um you will
probably not have that problem that
anything that is so remotely connected
will actually be in the top 10 or 20 or
whatever list that you want to retrieve.
Um So for larger proper data sets, this
should be less of an issue. With my
Hello World example of three documents,
um it can be a bit misleading. But yes,
you can have a cut off point. If you
figure out what for your data set and
your queries is a good cut off point,
you could define a cut off point. No,
sorry. I meant you have three documents.
How come it's only showing two? Is it
because of there's a
Be be So
the query gets expanded into, I don't
know, those 100 tokens or whatever, and
then for those two, there is some
overlap, but the third one just didn't
have any overlap.
But I So we what Okay, we can we can do
that. It's just a bit tricky to figure
out that the term that it has the
overlook the overlap. So we will need to
take this one machine
No, I am your father. Let's take this
one.
What you need to do is
to figure that one out. I know actually
we
should be able Let me see.
Um
Let's see.
Pa pa pa pa pa pa pa pa pa pa
Is it pretty long output?
Somewhere was actually hoping that it
will show me that term that has matched
here. Okay, I see something.
Okay, there is something puppet that
seems to be the overlap.
How much sense that term expansion for
the the store text and the query text
makes is a bit of a different
discussion.
Uh but in here with that explained true,
you can actually see how it matched and
what happened behind the scenes. If you
have any really hard or weird queries or
something that is hard to explain to
debug that. But the third one didn't
match. Um Now, if I take the dense
vector model with OpenAI and I search
for machine,
how many results do you expect to get
back from this one?
0 1 2 3
Yes, three. Why three?
Yes, because there's always some match.
That is the other or let me run the
query first.
Um
Here is your These are not the droids
you're looking for. This one is the
first one. Um I don't think that this
model is generally great because here
the results are super close. It is I
mean, the the droids with the machines
That is the first one, but the score is
super close to the second one, which is
No, I am your father, which feels pretty
unrelated and uh Obi-Wan never told you
what happened to your father. Even that
one is
still with a reasonably close score. Uh
but why do we have those? Um
Because if we say
what is the relevance I mean, it's
further away, but it's always like there
is always kind of like some angle to it
even if the kind of like the angle here
or depending on the the similarity
calculation that you do, but it's still
always related. There is no easy way to
say something is totally unrelated. Um
That is by the way, one
good thing about keyword search where it
was relatively easy to have a cutoff
point of things that are totally not
relevant where you're not going to
confuse your users. Whereas here, if you
don't have great matches, you might get
almost It's not random, but it's
potentially it looks very unrelated to
your end users what you might return.
Just because it's very hard to show.
Yes. Is it fair to say then that the
like the OpenAI embedding search is
worse for this
kind of toy example because the
magnitude of difference is
I I'm careful with worse because it's
really a hello world example, so I don't
take this as a quality measurement in
any way. Um I
I I'm Yeah, I mean the the OpenAI model
with 128 dimensions is very few
dimensions. I think it will probably be
cheap, but not give you great results
necessarily, but don't use this as a
benchmark. I think it's just a
a good way to see um that you
This is now much harder because now you
need to pick the right machine learning
model
uh to actually figure out what is a good
match. With keyword based search, it was
a bit of a different story. There you
need to pay more attention to like how
do I tokenize and do I have the right
language and do I do stemming or not
stemming, but most of that work is
relatively I want to say almost
algorithmic um and then you can figure
that out and you can figure it and then
it's very predictable at query time. Um
whereas with the dense vector
representation,
you really need to evaluate for the
queries that you run and the data that
you have like is that relevant and is
that an improvement or not? Um It's very
easy to get going and just throw a dense
vector model together um and you will
match you will always match something
that might be an advantage over the the
lexical search where you don't have any
matches, which sometimes is there the
problem that nothing comes back and you
would want to have at least some
results. Um Here it might just be
unrelated.
So that that can be tricky. Um
the
You want to have some results is by the
way um a funny story that European
e-commerce store once told me. They said
they accidentally deleted I think 2/3 of
their data that they had for the
products that you could buy.
And then I asked them like, "Okay, so
how much revenue did you lose because of
that?" And they said basically nothing
because as long as you showed some
somewhat relevant results quickly
enough, people would still buy that.
So only if you have no results, that's
probably the worst.
So for an e-commerce store, you might
want to show stuff a bit further out
because people might still buy it.
But it really depends on the I'm I'm
coming to you in a moment. It really
depends on your use case. E-commerce is
kind of like one extreme where you want
to show
always something for people to buy.
If you have a database of legal cases or
something like that, you probably don't
want that approach
because that will go horribly wrong. So
it is very domain specific.
Um
That's I think also the good thing about
search because it keeps a lot of people
employed because it's not an easy
problem. It's it's almost job security
because it depends much on the This is
the data that you have and this is the
query that people run and this is the
expectation of what will happen and this
is for this domain the right behavior.
Um So there's no easy right or wrong
with a checkbox. And the other thing is
you might make if you tune it, you might
make it better for one case, but worse
for 20 others. That's why a robust
evaluation set is normally very
important though
very rare.
Uh a lot of people YOLO it um and you
will see that in the results. And for
the e-commerce store, it probably works
well enough. Uh sorry, you had a
question. Um can I limit the semantic
enrichment to a subset of my index based
off of properties of the document? So if
I have a very large shared index with a
lot of customers and I want to enable AI
for a subset of the index, can I say
hey, only do the semantic enrichment if
the document has this property where
maybe it's like an AI customer?
Yeah, so the
the way we would do it in our product um
is that you would probably have two
different indices with different
mappings.
Yeah, but then it's not so fun like the
customer upgrades and I have to migrate
them
Uh
they split. Yeah, so if you for example,
have an index in Elasticsearch, you can
think of it almost like a sparse table.
Right? So there's no penalty for having
a field that is not populated. So either
in your application or in ingest
processor, you could have an if
statement and say
Yeah, yeah, that's how we do it now.
Yeah, only move it over uh with this
automatic way where you kind of turn it
No, that the problem is the data
structure. Like if the field is there,
so the data structure that we build in
the background is called HNSW.
And either we build a data structure or
we don't build it.
Yeah, so if you had you know, 10 billion
entries in your vector index and your
index that it was set up for vectors,
right? And you just don't populate the
thing that is either putting in a dense
vector or is triggering the inference to
create the dense vector to put into
there,
then it's just going to be a you know,
bunch of the index is just a bunch of
pointers and none of them head towards
the HNSW, then it won't show up in the
search results. The penalty to you is
nothing. Okay. Right?
Um but you're going to have to manage
what does or does not create the vector.
You could do that in an ingest processor
by just saying, "Hey, we're going to use
the copy command to have two copies of
the text, one that's meant for
non-vector indexing, one that's meant
for actual vector indexing." You'd have
to manage that with some tricky complex
AI technology called if then else,
right? Somewhere inside of your
ingesting pipeline.
And it would work just fine.
Yeah, um
one more question.
When we did
HNSW with Elasticsearch, we found that
it was extremely slow at write time and
the community suggested that we freeze
our index if we were going to use
HNSW.
Force merge or
Um
Yeah, I I I think just freeze writes.
They said they said build the index and
freeze it. Otherwise, you'll put a ton
of load on the
I mean, yes. What we found is that some
of the default kind of like have been
around in Elasticsearch for 10 years
settings for the merge scheduler weren't
really optimized for keyword search
and for high update workloads on HNSW,
we got some suggestions. Okay. Um they
they take a little bit of parameter
sweet tuning to go find something right
for your IOPS and for your actual update
workloads. Sometimes it's about the
merge scheduler and not doing kind of an
inefficient HNSW build when it's not
important for your use case. Okay. Um
The other thing we'd say is that
sometimes friends don't let friends run
Elasticsearch 8.11. Upgrade upgrade
upgrade. They put a lot of optimization
work in here cuz it should be simple.
So that the reason
The the reason why that is it's like
merging So because you you have the
immutable segment structure in
Elasticsearch.
And HNSW, you cannot easily merge.
You basically need to rebuild them. The
one trick I forgot which version it was.
I'm not sure Dave if you remember. I
think it was even before 8.11. But
basically if we do a merge, we would
take the the largest segment with not
deleted documents and basically plop the
new documents on top of them rather than
starting from scratch from two HNSW data
structures. There's another optimization
somewhere now in 9.0 that will make that
a lot faster. So, it really depends on
the the version that you have. And there
are a couple of tricks that you can
play.
But yeah, that is one of the downsides
of like the way
immutable segments work and HNSW is
built that you can't easily just merge
it as easily together as other data
structures because you really need to
rebuild the HNSW data structure or like
take the largest one and then plop the
other one in.
Might have been a while ago. Yeah.
Yeah, I feel like rag has been very
heavily abused that it's
or like the mental model I think started
off as like you do retrieval and then
you do the generation, but you could do
the generation earlier on as well that
you do the query rewriting and expanded
query.
So, I
My favorite example for that is um
you're looking for a recipe.
You don't need to have the
the LLM regenerate the recipe. You just
want to find the recipe. But maybe you
have a scenario where you forgot what
the thing is called that you want to
cook. And then you could use the LLM for
example to tell you what you're looking
for.
Like you say like, oh, I'm looking for
this Italian dish
that has like these layers of pasta and
then some meat in between and then the
LLM says like, oh, you're looking for
lasagna. And then you basically do the
generation first or a query rewriting
and then search and then get the results
as a
very let's say you explicit example
here. Your example will look very
different than probably smarter than my
example.
But query rewriting is one um thing.
There's also that this concept of height
where you your documents and your
queries often look very different and
that you use an LLM to generate
something from the query that looks more
close like the doctor documents that you
have and then you match the documents
together because they're more similar in
structure.
So, there are all kinds of interesting
things that you can do.
Like I said earlier, it depends is
becoming a bigger and bigger factor.
But yeah, your use case is probably
might be
Yeah, maybe a multi retrieval where you
figure out like, oh, you look I don't
know. I
I know the example from an
e-commerce store where it's like I'm
going to a theme party from the 1920s.
Give me some suggestions. And then the
LLM will need to figure out like what am
I searching for and then it can retrieve
the right items and rewrite the query
and then actually give you proper
suggestions.
But it's it's not just running a query
anymore.
Yeah.
Yeah.
Definitely not necessarily. It's
it's an interesting question. It's that
feels almost like a blast from the past.
I remember like two or three years ago
there was this big debate of like how
many dimensions does each data store
support and like how many dimensions
should you have? And at first it looked
like, oh, more dimensions is always
better, but then it turned out more
dimensions are very expensive.
So, it really depends on the model and
what you're trying to solve. Like if you
can get away with fewer dimensions, it's
potentially much cheaper and faster.
But it
I don't think there's a a hard rule like
maybe the model with more dimensions can
express more in because it just has more
data and then it will come in handy. But
maybe it's not necessary for your
specific use case and then you're just
wasting a lot of resources. I don't
think there is a an easy answer to say
like, yes, for this use case you need at
least 4,000 dimensions.
It will depend, but it depends on the
model how many dimensions it will output
and then maybe you have some
quantization in the background to reduce
it again or reduce the either the number
of dimensions or the fidelity per
dimension.
So, there are a lot of different
trade-offs in that performance
consideration, but it will mostly rely
on like how good does the model work for
the use case that you're trying to do.
Yeah, so the
that is one area. So, I want to say
historically what you would do is
you would have a golden data set
and then you would know what people are
searching for and then you would have
human experts who rate the queries and
then you run different queries against
it and then you see like is it getting
better or is it getting worse. Now
LLMs open a new opportunity where you
might have a human experts in the loop
to help them out a bit, but they might
be actually good at evaluating the
results.
So, you almost nobody has like the
golden data set and test against that.
But you can either use it look at the
behavior of your end users and try to
infer something from that. Or you have
an LLM that evaluates like what you have
or you have a human together with an LLM
evaluate the results. So, you you have
various tools. But again, it's
and it depends and it's really not an
easy question of saying like this is the
right thing. Maybe you can get away with
something simple so that the classic
approach I want to say is like
you looked at the click stream of how
your users behaved and then you saw like
they clicked on the first or up to the
third result, the result was potentially
good and they didn't just go back and
then click on something else, but they
they stuck on the page. If they don't
click on anything and just leave, it
might be very bad. If they go to the
second or third page, it might also not
be great. So, there are some quality
signals that you can infer from that or
you really look into the the quality
aspect and try to evaluate like what
people were doing and how how it
behaves.
But you can make this from relatively
simple to pretty complicated.
What else? What else?
Okay, obviously if I search for
that with query extension, it will find
my father example. Um
and this one here
will still again match my my droids
pretty much like the opening example.
One thing that I you what is also
happening behind the scenes here. This
is a very long segment like it's
it's a lot of information with different
speakers. What I have created here
though is we have
created multiple chunks behind the
scenes. And if I search for that
I think looking for murder in the
Skywalker saga works pretty well here.
It finds
the document that I've retrieved, but it
can also highlight.
So, here I say, show me the fragment
that actually matched best here. And
if I search here for murder
it didn't find anything, but I think the
term that it found was in this
highlighted segment here, it found kill
and then was that one that was expanded
here.
So, here I've broken up my long text
field into multiple chunks and there are
multiple strategies to do that by page,
by paragraph, by sentence.
You could do it overlapping or not
overlapping. Um
many strategies. It will depend on how
you want to retrieve what works best for
your use case, but you want to kind of
like reduce the context per element that
you're matching because there's only so
much context that a dense vector
representation can hold. So, you want to
chunk that up especially if you have
like a full book, you want to break up
those individual at least pages
and then find the relevant part where
the match is and then you can actually
link back to that. The point in this
query here is also to show you I didn't
define any chunks. I didn't say like,
okay, send this representation of a
dense vector there and then when it
comes back uh again.
This is all happening behind the scenes
just to make this easier. So, the entire
behavior here is still very similar to
the keyword matching, even though
there's a lot more magic happening
behind the scenes.
Um just to keep that very simple.
Um
Okay.
How does everybody feel about longer
JSON queries?
Um we'll see about alternatives and
maybe we can make this a bit simpler
again. Uh but let me show you one one
more way of of looking at Um we call
them retriever.
They're a more powerful mechanism to
actually combine different types of
searches.
Um
Combining different types of searches,
let me get from my slides actually.
When we talk about combining searches
and how this all plays together.
Um This is kind of my my little
interactive map of what you do when you
do retrieval or what your what your
searches do. So, we started here in the
the lexical keyword search. Uh and then
we run the match query and we're
matching these strings.
Um
This often combined with some rank
features um
are often what we call full text search.
The rank features could be either you
extract a specific signal or it could
also be something
um
however you influence that ranking. It
could be like the margin on a product,
like how many people bought something,
what the rating is. There are many
different signals that you could ex-
include like not just with the the match
of the the text, but any other signals
that you want to combine uh for
retrieving that. And then you have like
full text search as a whole.
On top of that, I kept that kind of like
to the side here. You might have a
Boolean filter
where you have like a hard include or
exclude of certain attributes.
Um
This does not contribute to the score.
This is just like black and white. This
is included or excluded. Whereas this
here calculates the score for you, how
you match. And
then this was kind of like the the
algorithmic side. And then we have this
machine learning, the learned side, or
the semantic search, uh where you have a
model behind the scenes,
uh split into the dense vector um
embeddings and the sparse vector
embeddings
um
for vector search or learned sparse
retrieval. I think those are the two
common terms. Um
And the interesting thing is
all of these, including the the sparse
one, these are the the sparse vector
representation in the background, and
only this one here is the dense vector
representation. Um
And then when you combine
any any grouping down here to combine
for one search,
this is then what we would call hybrid
search, um even though there can be a
big discussion of like what is exactly
hybrid search or not. I will definitely
stick to the definition that as soon as
you combine uh more than one type of
search, it could be sparse and dense, or
it could be dense and keyword, or maybe
if you combine two dense vector uh
searches,
um
then it's hybrid search because you have
multiple approaches, and then you can
either boost them together. You could do
reranking, which is becoming more and
more popular.
Um one thing that we lean heavily into
is RRF, which is reciprocal rank fusion.
Uh that doesn't rely on the score, but
it relies on the position of each search
mechanism. So, it basically says like
that
the lexical search had this document at
position four, and the dense vector
search had it at position two, and then
it kind of like evens out the position
and gives you an overall position by
blending them together.
Rather than looking at the individual
scores because they might be totally
different.
So, this is kind of like the
the information retrieval map overall,
and we have
Okay, we didn't do a lot of filters, but
I think filters are intuitively
relatively clear that you just say like,
I'm only interested in users with this
ID or whatever other criteria. It could
be a geo-based filter like only things
within 10 km, or only products that came
out in the last year.
Um
Like a hard yes or no. Um all the others
will give you um
uh a value for the relevance, and then
you can blend that potentially together
to give you the overall results. That is
kind of like the the total map of
search.
Can you give an example of the signal on
this?
Uh yeah, for for signal um
so, we have our own data structure for
for these rank features. It could be for
example the
I don't know, the the rating of a book.
And then you combine
um
the the keyword match for I don't know,
you you search for um
murder mysteries, uh but then another
feature would be um how well they're
ranked. And then
you would see that. Or it could be your
margin on the product, or the stock you
have available, and you would want to
show the product where you have more in
stock. Uh there can be or it might even
be a simple like a clickstream, like
what have people clicked before. There
are a lot of different signals that you
could include in all of this searching
then.
Any other questions or everybody good
for now?
Yeah? When you do this like scoring
you said you do like a blended approach,
right?
RRF does a blended approach, yes. Those
scores, are they kind of standardized or
are they different based on the type of
search that you do?
Like the relevancy scores that you're
kind of combining?
Yeah, so they are you would have to
normalize them. They will be So, dep-
depending on like if you have depending
on the
the comparison that you do for dense
vectors, it might be between uh zero and
one. But you saw that for the keyword
search,
uh also depending on how many words I
was searching for, it might be a much
higher value. There is there is no
really real ceiling for that. Um
or you could add a boost and say like,
this field is 20 times more important
than this other field. Um there is no
real max value that you would have here.
You could normalize the score and then
basically say like, I'll take like the
highest value in this subquery as 100%
and then reduce everything down by that
factor, and then I combine them. Maybe
that works well. RRF is a it's a very
simple paper, I think it's like two
pages. Um and it really just takes the
different positions. I think it's 1
divided by 60, which is like a factor
they figured out made sense, plus the
position. Uh and then you add the scores
or like the positions for each document
together, and then that value gives you
the overall position.
Um it really just it doesn't look at the
score anymore, but it blends the
different positions together and like
how they are interleaving and what
should be first or second then.
Um
Yeah? Um
If I just score on vector search,
why should I use Elastic over PGVector
or something like Quadrant?
I'm sure there are some trade-offs.
And because
if my data is already in the database,
any change probably via CDC, change data
capture, has to
So, that's
one extra hop in the ingestion versus
like
like
PGVector, like it's right there,
no outside.
So, I was just curious, what sort of
systems maybe people who already have
keyword search requirements,
what have you seen in production?
Why
what systems choose PGVector or Quadrant
versus
I mean,
PGVector will always be there because
like if you are already using Postgres,
it's very easy to add. Um I think then
the question is like, does it have all
the features uh that you need? For
example,
Postgres doesn't even do BM25. Um it has
some matching, but it's not the full
BM25 algorithm because it I don't think
it keeps all the statistics. Um it will
be the question of like scaling out
Postgres can be uh a problem, and then
just like the breadth of all the search
features. Um if you only need vector
search, um
I think my or our default question back
to that is like, do you really only need
vector search? Um
maybe for your use case, but for many
use cases, you probably need hybrid
search. Uh one area for example where
vector search will not do great is like
if somebody searches for like a brand.
Because there is no easy representation
in most models uh for this specific
brand, and it will be very hard to beat
keyword search. So, there will be very
and also your user very angry when they
know you have this word somewhere in
your documents uh or in your data set,
but you don't give me the result back.
Um so, there are many scenarios where
you probably want hybrid search. I feel
like that's the We started 2 years ago,
we started with just vector search, but
I feel like the overall trend is coming
more to hybrid search because you
probably some sort of keyword search.
Um and then you want to have that
combined, probably with some model uh
for the added benefit and extra text. Um
but you often want the combination. It
might also depend a bit on like the
types of queries that your users run.
So, if your users run single word
queries like I've done in my examples,
that's often not really ideal for vector
search because you live off of like any
machine learning model because you live
off extra context. Um
So, depending on that,
I've seen some people build searches
where it's like if you search for one or
two words, they do keyword search, but
if you search for more, they might fall
over to vector search. So, it depends a
bit on the context what works. Um
if you really only need vector search,
um and PGVector is small enough uh to do
all of that, um and Postgres is your
primary data store, then that's that's
probably where you will do well.
Uh but there are plenty of scenarios
where that will or not all of those
really boxes will be ticked. One last
question.
Mhm.
But so
you would create
So it's one data set basically with
thousands of files that all are chunked
together and so one change would
invalidate all of them or
Ah.
Maybe so
I I think the way we might solve it is
that
if you create the hash of the file and
use that as the ID and you only use the
operation create and it would reject any
any duplicate rights, you would at least
not ingest and and then create the
vector representation again. Um you will
still send it over again and it would
need to get rejected.
Yes. If you if you have that doc ID and
then you need to set the operation to
just create and not update or upsert,
then it would just be rejected and you
would only write it once.
I'm not sure if there's a great use case
or if you might want to keep like a I
don't know an outside cache of like all
the hashes that you've already had and
deduplicate it there, but that would be
the Elasticsearch solution of like using
the the hash as the ID and then just
writing to that.
Okay, and all the with create on the
Yeah. That that is I think the intuitive
or
most native approach that we could offer
for that. Yeah.
I think there was some other questions
somewhere. Ah yeah.
Yeah, but the
from what I remember the default
Postgres full text search does not do
full BM25, but it only does It doesn't
have all the statistics I think from
what I remember.
Yeah.
Any other questions?
Ah.
Joe, please go ahead.
To to show now? Yeah, I mean How much
just do you want to see?
Um
Mhm.
Yes. Um
Maybe for before we dive into that be be
for for everybody else like rescoring is
like let's say we have a million
documents and then we have one cheaper
way of retrieving them and we retrieve
the top
I don't know a thousand candidates and
then we have a more expensive way but
higher quality way of actually rescoring
them, then we will run this more
expensive rescoring on just the top
thousand uh to get our ultimate result
list of of results. But the the
rescoring algorithm would be too
expensive to run it across like the
million documents that why you don't
want to do that. That's why you
two-step process and that's why you
might want to have the the rescoring. So
yes, we have a
You can in Elasticsearch now you can do
rescoring because it becomes more and
more popular.
Um I don't have a full example there,
but we do have like we do have a
rescoring model built in by default now.
Let me pull that up.
Um bop bop bop bop bop
Not this one. So we have Currently it's
the the version one reranking, but we
have a built-in reranking model now as
well. So for
one of the tasks that we can do,
you can see here we have the other task
is like for example the dense text
embedding. Now we have a reranking task
that you can also call.
Good question.
Okay.
No, reranking is good. Let me ah
Somehow my keyboard binding is broken.
This is very annoying.
Okay.
Um
We rerank results. Let me see. Somewhere
here there should be So there's learn to
rank. Um
But it should not be the only one.
This is what we want. Okay, we have our
reranking model.
Um
Unless Dave you know from the top of
your head where we have the right docs
for this.
Yeah, retrievers could find them.
Yeah, so I think this is
a simple example like we have a standard
match like this is will be very cheap
and then we have the text similarity
reranker which uses our elastic reranker
um that falls back to that model behind
the scenes.
But
you would have like the text reranker
retriever inside of that you have the
RRF.
Inside of that you would have lexical
and
KNN as peers.
And it works from the inside out. Hey,
do each of those retrieval
methodologies, do like the Venn diagram,
find the best results. And then take the
full text of those results and run them
through the reranker. Almost like a
little mini LLM
what outcomes that is pretty good. The
cool thing about the the reranker is you
can run it on
structured lexical retrievals. You don't
have to run it on a vector search.
You can run it on anything you want. So
if you don't want to pay the vector
search everything or maybe the text is
too small to vector search and you're
not eating there to actually have the
model lock on the stuff, the reranker
when you run it on just kind of
actual customer data sets to say how to
how to do
they're like, "Yeah, our evaluation
scores bumped by 10 points
basically for free. It feels like
cheating."
Right? So when you run against like a
Gemini API and it's like, "Wow, why is
this 10 points better than the Amazon
one?" It's cuz they threw on their
retriever and didn't tell you.
Right?
So there's a lot of black box stuff out
there that we're exposing.
So
don't be don't be scared that we're
telling you how it works inside, but
this is what the
Yeah.
Yeah, so that
just to give you the example of like I
don't think I have a re-ranking example
here, but this one um
uses a classic keyword match uh for
retriever and then we have we normalize
here the score. I think somebody else
asked about the normalize or we had a
discussion about the normal normalizing
we do a min-max normalizing. We weight
this with two um and then I use the open
AI embeddings with a again
normalized with a weight of 1.5 and then
they will get blended together and you
you get the results that won't surprise
you that these are not the droids you're
looking for if you search for uh droid
and robot um will be by far the highest
ranking document.
You had a question
somewhere.
Yeah.
I mean so in the first so we would
retrieve like X candidates and you could
define the number of candidates and then
we would run the re-ranking on top of
those.
So, then it would be a trade-off for you
like the larger the window is the the
slower it will be, but the potentially
higher quality your overall results will
be
because you will just have everything in
your data day a data data set that you
can then re-rank at the end of the day.
Is that what you meant or you wanted
something
per node or
Yeah, and
I don't think that's how we do it. So,
what what you can control here is like
this is the window of like what you
might retrieve and then we have the
minimum score like a cutoff point to
throw out what might not be relevant
anyway to to keep that a bit cheaper.
That's what we have here.
Um
does it a retrievers and then you could
do the RRF that I've explained where you
blend results together. All of that is
easy. Um
one final note, if you
if you got tired of all the JSON
um
we have a new way of defining those
queries as well um
where here
we have a match operator like the one
we've used all the time uh that you can
use either on a keyword field, but it
could also be either a dense or a sparse
vector embedding and then you can just
run a query on that and then just get
the the scores from that. So, it is a
query language. It's a bit more like I
don't know like a shell um but if you
don't want to type all the JSON anymore
uh this is how you can do that and here
my screen size is a bit off uh but yeah,
you get the the quote that we retrieved,
the speaker and the the score maybe
maybe I'll take out the speaker to make
this slightly more readable.
Now it broke.
Oh. Um
Uh this is you could write queries with
a fraction of the JSON.
Uh this will also support funny things
like joints. It doesn't have every
single search feature yet uh but it's
getting pretty close. So, this is more
like a closing out at the end. If you're
tired of all the JSON queries, you don't
have to write JSON queries anymore.
Um
this is nice both for like observability
use cases where you have like
just like aggregations and things like
that, but it's also very helpful for
full-text search now
um if you want to write different
queries. I think the main downside is
that the language support in the
different languages like Java etc. is
not very strong yet. You basically give
it strings and then it gives you a
result back that you need to parse out
again. Uh so, it is not as strongly
typed on the client side yet as the
other languages.
Um
Any final questions?
Yes.
I mean we can make your life easier.
It it's just all behind one single uh
query endpoint. So, you you could use
the two different methods to retrieve um
and then you could still re-rank, but
all from one single query. So, you
don't have to do it yourself. I mean
it's not like we want to stop you, but
you don't have to and we can make your
life a bit easier.
I mean it's only one single query that
you need to run and like one single
round trip to the server that you need
to do.
I mean if you still need to do the
retrieval like you you do the retrieve
like all the individual pieces are still
there. If you have two parts of the
query, you will still
retrieve those if that is the main cost
and then you have the re-ranking um so,
you're not getting out of those
completely, but you can just do it in
one single request that you send. We
take care of all of that for you and
then send you one result set back rather
than sending more back to your
application. So, it
it will potentially be a little less
work on the Elasticsearch side, but it
will mostly be less work on your
application side.
Perfect. Thank you so much. I hope
everybody learned something. I will let
the instance running for 2 days or so,
so you can still play around with the
queries if you feel like it. Uh
Thanks a lot for joining. If you want
stickers, we have stickers up there. We
also have a booth the next few days.
Come join and uh
get some proper swag from us there. Um
Thank you. Uh see you around.