Information Retrieval from the Ground Up - Philipp Krenn, Elastic
Channel: aiDotEngineer
Published at: 2025-07-27
YouTube video id: 4Xe_iMYxBQc
Source: https://www.youtube.com/watch?v=4Xe_iMYxBQc
Let's get going. Audio is okay for everybody? I have some slight feedback, but I'll try to manage. I hope it's okay for you. Um hi, I'm Philip. Uh let's talk a bit about retrieval. I will show you some retrieval from the ground up. We'll keep it pretty hands-on. Um you will have a chance to follow along and do everything that I show you as well. I have like a demo instance that you can use. Um or you can just watch me. Um If you have any questions, ask at any moment. If anything is too small to reach out and we'll try to make it larger. Uh we'll try to adjust as we go along. So, I guess we're not over rag yet, but uh rag is a thing and we'll focus on the R in rag, the retrieval augmented generation. We'll just focus on the retrieval. Um Just let's see where we are with retrieval. Um quick show of hands, who has done rag before? Okay, that's about half or so. Um who has done anything with vector search and rag? Do I need vector search for rag or can I do do anything else? Do anything in the prompt stuff in Yeah. Yeah, so you can do anything. Um retrieval is actually a very old thing. Um depending on how you define it, it might be 50, 70, whatever years old. Retrieval is just getting the right context to the generation. I like ignore all the generation for today. We'll keep it very simple. We'll just focus on the retrieval part of getting the right information in. Um Partially from the like the classics, uh but we'll get to some new things as well as we go along. Um Who has done keyword search before? Just That is fewer than vector search, I feel like. Um which is um which almost reminds me of like 15 years ago or so when no sequel came up like more people had done MongoDB, Redis, whatever else um rather than sequel. That has changed again. Um I think it will be kind of similar for retrieval. Um The way I would always vector search um is a feature of retrieval. It's only one of multiple features or many features that you want in retrieval. And we'll see a bit why and how and we'll we'll dive into those details. Um so, I work for Elastic, the company behind search. We're the most downloaded, deployed, whatever else search engine. We do vector search. We do keyword search. We do hybrid search. Um we'll dive into various examples. Everything that I will show you um well, the query language is Elasticsearch, but if you use anything built on Apache Lucene, everything behaves very similarly. If you use something that is a clone or close uh to Lucene like anything built on uh Ten TV or anything like that, it will be very similar. The foundation keyword search and vector search will apply broadly everywhere. Um So, let's get going. Um we'll keep this pretty hands-on. Um who remembers in Star Wars when he's making that hand gesture? Um What is the quote? These are not the droids you're looking for. Um we'll keep this relatively Star Wars uh based. Um feel free to come in and filter on the side so whatever. Um I think we have I think one chair over there. Otherwise and one down there. Otherwise, it's getting a bit full. Um okay. Let's look at this uh of what "These are not the droids you're looking for" uh does for search. And I will start kind of like with the classic approach. Keyword search or lexical search is like you search for the words that you have stored and we want to find what is relevant in our examples. Um If you want to follow along, um there is a gist which has all the code that I'm showing you. It's at elastic/ai.engineer. Um There is one important thing. It's I have one shared instance basically for everybody, so you can all just use this without signing up for any accounts or anything. So, this is just a cloud instance um that you can use. There is my handle um is in the index name. If you don't want to fight and override each other's data, reply replace that with your unique handle or something that is specific to you because otherwise you will all work on the same index and kind of like override each other's data. You can also just watch me. If you don't have a computer handy, that's fine. Um but if you want to follow along, elastic/ai.engineer, there will be a gist. Um it will have the connection string. Like there is a URL and then the credentials are workshop workshop. If you go into login, it will say login with Elasticsearch. That's where you use workshop workshop. Then you'll be able to login and you can just run all the queries that I'm showing you. You can try out stuff. Um if you have any questions, shout. I have a couple of colleagues dispersed in the room, so um if we have too many questions, we'll somehow divide and conquer. Um so, let's get going and see what we have here. Um And I'll show you most of the stuff live. Um I think this is large enough in the back row. If it's not large enough for anybody, shout and we'll see how much larger I can make this. Um Uh and let me turn off the Wi-Fi and hope that my wired connection is good enough. Let's refresh to see. Oh. Maybe we'll use my phone after all. Okay. Let's try this again. Okay, this is no good. Out you go. Okay. Hardest problem of the day solved, we have network. Um okay. So, we have the sentence "These are not the droids you're looking for." And we'll start with the classic keyword or lexical search. Like what happens behind the scenes? So, what you generally want to do is you basically want to extract the individual words and then make them searchable. Um so, here I'm not storing anything. I'm just looking at like how would that look like if I stored something. I'm using this underscore analyze endpoint uh to to see um what I will actually store in the background to make them searchable. So, I have "These are not the droids you're looking for." And you see these are m not m the droids you are looking for. In western languages, the first step that happen or every time or everywhere this first step that happens is the tokenization. In western languages, it's pretty simple. It's normally any white spaces and punctuation marks where you just break out the individual tokens. Um especially Asian languages are a bit more complicated around that, but we'll gloss over that for today. And we have a couple of interesting pieces of information here. So, we have the token. So, these um is the first token. We have the start offset and the end offset. Um why would I need a start and an end offset? Why would extract and then store that potentially? Any guesses? Yeah? Yes. Especially if you have a longer text, you would want to have that highlighting feature that you want to say um this is where my hit actually was. So, if I'm searching for these, which is maybe not a great word, uh but you would very easily be able to highlight where you had actually the match. And the trick that you're doing in search, generally what differentiates it from a database is a database just stores what you give it and then does basically almost everything at query or search time. Whereas a search engine does a lot of the work at ingestion or when you store the data. So, we break out the individual tokens. Um we calculate these offsets and store them. So, whenever we have a match afterwards, we never need to reanalyze the actual text, which could potentially be multiple pages long. Um but we could just highlight where we have that match because we have extracted those positions. Um we have a position. Why would I want to store the position with the text that I have? Yeah? For the annotations. Uh yeah, annotations. So, the the main use case uh that you have is if you have these positions and later on we'll briefly look at if you want to look for a phrase. If you want to look for this word followed by that word. Um so, you could then just look for all the texts that contain these words, but then you could also just compare the positions and basically look for n n + 1 etc. And you never need to look at the string again, but you can just look at the positions to figure out like this was one continuous phrase even if you have broken it out into the individual tokens. Um Most of the things that we see here is alpha num um for alphanumeric. An alternative would be synonyms. We'll skip over synonym definition because it's not fun to define tons of synonyms. Uh but this is all the things that we're storing here in the background. You can also customize this analysis. And that is one of the the features again of full text search and lexical search is that you pre process a lot of the information to make that search afterwards faster. So, here you can see I'm stripping out the HTML because nobody's going to search for this emphasis tag. Um I use a standard tokenizer um that breaks up for example on dashes. You will see that alternatives would be white space that you only break up on white spaces. I lower case everything which is most of the times what you want because nobody searches in Google with proper casing or at least maybe my parents but nobody else searches with proper casing in in Google. We remove stop words we'll get to stop words in a moment and we do stemming with the snowball stemmer. What stemming is it basically reduces a word down to the root so you don't care about singular plural or like the flexion of a verb anymore but you really care more about the concept. So if I run through that analysis does anybody want to guess what will remain of this phrase so which tokens will be extracted and what in what form? Not a lot will remain. Um two Yeah close so we'll actually have three so we have droid you and look and you can see all the others were stop words which were removed. The stemming reduced looking down to look because we don't care if it looks looking look we just reduce it to the word stem so we do this when we store the data and by the way when you search afterwards your text would run through the same analysis that you would have exact matches so that you don't need to do anything like a like search anymore in the future so this will be much more performant than anything that you would do in a relational database because you have direct matches and we'll look at the data structure behind it in a moment but we what we get is droid you look with the right positions so for example if we searched for droid you we could easily retrieve that because we have the positions even though that is a weird phrase. Do we start indexing at zero or one? Zero yes. It's the only right way. There discussion here so we we are the positions are based starting at zero and these are the tokens that are remaining. If you do this for a different language like you might hear I'm a native German speaker this is the text in in German and you would if you use a German analyzer it would know the rules for German and then would analyze the text in the right way so then you would have remaining droid been so Anybody wants wants to guess what happens if I have the wrong language for a text? It will go very poorly because that so how how this works is basically you have rules for every single language is like what is a stop word how does stemming work if you apply the wrong rules you basically just get wrong stuff out so it will not do what you want so what you get here is like that this is an article but well in the in English the rule is an S at the end just get stemmed away even though this doesn't make any sense so you apply the wrong rules and you just produce pretty much garbage so don't do that just to give you another example um French this is the same phrase in French and then you see droid la and recherchez are the words that are remaining in these examples otherwise it works the same but you need to have the right analysis for what you're doing otherwise you'll just produce garbage um A couple of things as we're going along the stop word list by default which you could override is relatively short this is linguists have spent many years figuring out what are the right lists of stop words and you don't want to have too many or too few in English it's I always forget I think it's 33 years old this is where you can find it in the source code it's I don't want to say well hidden but it's not easy to find either so every language has like a list of stop words that are defined that will be automatically removed for these are not the droids you are looking for by accident more or less we had a lot of stop words and why not a lot remained here in the phrase and then for all other languages you will have a similar list of stop words. Should you always remove stop words? Yes no. Yes that is by the way another not is a very good so I'm not sure if everybody heard that the the comment was about not one important thing here we're talking about lexical keyword search which is dumb but scalable it doesn't know understand if there is a droid or there's no droid it's just defined as a stop word it does just keyword matching that is in in vector search or anything with a machine learning model behind it will be a bit of a different story afterwards where these things might make a difference but this is very simple because it just matches on similar strings basically it doesn't understand the context it doesn't know what's going on that's why the linguists decided not this a good stop word you could override that if for your specific use case this is not a good idea um Always removing stop words yes no maybe. So our favorite phrase is it depends and then you have to explain like what it depends on so what it depends on is there are scenarios where removing all stop words does not give you the desired result and maybe you want to have like a text with and without stop words like sometimes stop words are just like a lot of noise that blow up the index size and don't really add a lot of value that's why we have defined them and try to remove them by default but if you had for example to be or not to be these are all stop words it's put all be gone when you run it through analysis so you it is tricky to figure out like what is the right balance for stop words or what works for you use case but you might have unexpected surprises in all of this okay we've seen the German examples um Let's do some more queries or let's actually store something so far we only pretended or we only looked at what would happen if we would store something now I'm actually creating an index again if you're running this yourself please use a different name than me just replace all my handle instances with your handle or whatever you want since this is a shared instance we have too many collisions I might jump to it another instance that I have as a backup in the background but what I'm doing here is I'm creating my this analysis pipeline that I have looked at before like I'm throwing out the HTML I use a standard tokenizer lower casing stop word removal and stemming and then I I call this my analyzer and then I'm basically applying this my analyzer on a field called quote. This we call this a mapping it's kind of like the equivalent of a schema in a relational database but this defines how different fields behave. Okay and somebody did not replace the the query um By the way you need to keep user underscore um Let me let me quickly do this um myself. Oops. I should have seen this coming. Um we want to replace and we'll use oops Please don't copy that. Um and I want to Okay. Looks like it worked. Let's try it again. Um So we're creating our own index. And now I just to double check I'll just again run this underscore analyze against this field that I have set up to just double check that I have set it up correctly and now I'm actually starting to store documents bless you so we'll store these are not the droids you're looking for I have two others that I'll index just so we have a bit more to search no I am your father any guesses what will remain here? Father yeah. Okay let's let's try this out let me copy my This one actually has way fewer stop words than you would expect. Let's quickly do this. Since I didn't do the HTML removal let's take these out manually. So what you get is no I am your father and this was stupid because this was not what I wanted we need to run this against the right analysis. This happens when you copy paste okay um Uh sorry. And we'll do text. No, I think I've pieced this back together. Okay, I am your father. So, no is the only stopper in this list actually. No was on the stopper list, all the others are not. Um Okay, let's try another one. Obi-Wan never told you what happened to your father. How many tokens will Obi-Wan be? Two? One? No, Obi-Wan will will be two like Obi-Wan because we use the default tokenizer or standard tokenizer, that one breaks up at dashes. If you had used another tokenizer like white space, that would keep it together because they don't break up at white spaces. So, there are various reasons why you want or would not want to do it. I don't want to go into all the details, but there are a lot of things to do right or wrong when you ingest the data, which will then allow you to query the data in specific ways. So, for example, if you would have an email address, that one is also weirdly broken up. Like you might use like there's a dedicated tokenizer for URL email addresses. So, depending on what type of data you have, you will need to process the data the right way because pretty much all the the smart pieces are kind of like at ingestion here to make the search afterwards easier. Um So, you can easily do that. Um let's see. Let's index all my three documents so that we can actually search for them. Now, if I start searching droid, it should match these are not the droids you're looking for, yes or no? Because this one is singular and uppercase and the droid that we stored was plural and lowercase. Will that match, yes or no? Why? Uh yes, we had the stemming. We had the lower casing. And when we search, so we store the text, it runs through this pipeline uh or the analysis um and for the search, it does the same thing. So, it will lowercase the droid um it has stemmed down the droids in the text to droid and then we have an exact match. Um So, what the the data structure behind the scene actually looks like. The magic is kind of like in this so-called inverted index. What the inverted index is is These are all the tokens that remained that I have extracted. I have alphabetically sorted them and they basically have a pointer and say in this document like with the IDs 1 2 3 that I have stored, we have how many occurrences? Like 0 1 um Yeah, nothing here too. Um And then we also know at which position they appeared. So, search for droid now, this is what I have stored. Um I will lowercase the droid to droid. I have an exact match here. Then I go through that list and see, retrieve this document, skip this one, skip this one um and at position four, um you have that hit. And then you could easily highlight that. So, you have almost done all the hard work at ingestion and this retrieval afterwards will be very fast and efficient. That's the classic data structure for search, the inverted index where you have this alphabetic list uh of all the tokens that you have extracted to do that. Um And this will just be built in the background for you and that's how you can retrieve all of this. Um let's look at a few other um uh queries and how they behave. Um If I search for robot, will I find anything? No, because there was no robot. Um There was a droid. Um we could now define a synonym and say like all uh droids are robots for example. Um Who likes creating synonym lists? Nobody anymore. Okay. No normally I would have said that's the Stockholm syndrome because there is sometimes somebody who likes uh creating uh synonym lists because they've done that for so many years um but it got easier nowadays. Now you can use LLMs to generate your synonyms, so it can get a bit easier to create them, but they're still limited because you have always this mapping. Um so, with synonyms you can expand the right way. Where it gets trickier if you have homonyms, uh if a word has multiple um meanings like a bat could be the animal or it could be the thing you hit a ball with. Um There it just gets trickier because there is there is no meaning behind the words or no context. So, you just match strings um and that is inherently limited. But like I said, it's dumb, but it scales very well. And that's why it has been around for a long time and it does surprisingly well for many things because there's not a lot of things that are unexpected or that can go totally wrong. Um Now, other things that you can do. You could do a a phrase search where you say, I am your father. Um will this find anything? Yes, because we had no, I am your father. Um What happens if I say for example, I am uh let's say I am not your father. Yes, no? No. Why? So, you're right. But not is a stop word. But you're right because the positions still don't match. Um so, the stop word not would be filtered out um but it still doesn't match because the positions are off. Um That is one of the things that sometimes can be confusing cuz even if something is a stop word and will be filtered out, uh it doesn't work like that. Um One thing that you can do is though that the factor is called slop where you basically say if there is something missing, um it would still work. So, I am your father and I am father with slop uh zero, that's kind of like the implicit one, um will not find anything, but if I say one, then I basically say like there can be a one off in there, like one word can be missing. Um Uh However, I am his father, here his would not match, so this still will not work. Um s- the slop is really just to skip a word, yeah. I am your father? Uh That will not work. Uh there you might need to do something like a synonym where you say slash um m gets replaced by m. Uh or we will need to have some more machine learning capabilities behind the scenes to to do stuff like that. So, what What is built in is generally a very simple set of rules. Um What you will need to do for things like this is normally you need a dictionary. Um the problem around these is they're normally not available for free or open source. Um funnily enough, they're often coming out of uh university the dictionaries um because they have a lot of free labor. The students. That's why the universities have been creating a lot of dictionaries, but they often come out under the weirdest licenses, that's why they're very not very widely available. But yes, there is a smarter or more powerful approach if you have a dictionary and you can do these things. Um For example, um one thing to to show is like or maybe that's a good thing to to also mention. Um You don't always get words out of the stemming. It's not a dictionary, it doesn't really get what you're doing, it just applies some rules. So, for example, uh blackberry, blackberry um uh sorry, blackberries, I think that this will be stemmed down differently. Uh sorry, I need English. Without English, this will not work. So, this will stem down to this weird word blackberry um and it will also stem down the singular to blackberry. So, there is a rule that applies this. Uh but it's just a rule, it's not dictionary based, it's not very smart. Um and it only has some rules built in um that work for this, but you will definitely hit limits. Um The other thing by the way and why I picked blackberry as an example, um you have some annoying languages like German, Korean, and others that compound nouns like blackberry um where you have basically two words. Um black would never find blackberry in the simplest form because it's not a complete string. There are various ways to work around that um that all come with their own downsides um and either you have a dictionary or you extract the so-called n-grams, it's like group of words and then you mention group of words um but all of those are one of the many tools how we try to make this a bit better or smarter, but it all has limitations. I hope that answers the question and makes sense. Um So, there are dictionaries, but they're generally not free or not under an easy license available. For some languages by the way, um even the the stemmers are not really available. I think there is a stemmer or analyzer for Hebrew. I think that has also like some commercial license or at least uh you can't use it for free in commercial products. Um The licensing with machine learning models is also its own dark secret. Um yeah. I guess it's maybe not exactly clear cuz we don't know how to spell it, but like uh why not just have like much smaller groupings? Um like sub-word tokens. Um and then you can like match a lot of things. You're going to have like more false positives, but presumably you can filter your true positives to have more matches. Yes, um that is what an n-gram is doing. Let me Let me see if I can um An n-gram is normally uh a word group. Normally a tri- This is way too small. Um somehow I have weirdly overwritten my command plus, so I can't use that. Let me make this slightly larger. Okay. Um here um we basically use one or two letters as word groups, which which is way too small, but uh just to um show the example, and this is very hard to read. Let Let me co- copy that over to my console. There There you can um There you can Oops. There you can see this. But this is a great question. Um so we'll use n-gram for Quick Fox. And then you can see the tokens that I extract here are the first letter, the first two, um the second, the second and third, etc. And you end up with a ton of tokens. The The downside is A, you have to do more work when you store this. Um B, it creates a lot of storage on disk because you extract so many different tokens. And then your search will also be pretty expensive because normally you would at least do three like tri-grams. Um but even that creates a tons of tokens and a tons of matches, and then you need to find the ones with the most matches. And it works, but A, it is pretty expensive in disk, but also query time. Um and it might also create undesired results or dis- results that are a bit unexpected for the end user. It is I would call it a It's It's again, it's a very dumb tool that works reasonably well for some scenarios, um but it's only one of many potential factors. Um what you could potentially do is, and I I don't have a full example for that, but we could build it quickly. Um what you would do in reality probably, um you might store a text more than one ways. So you might store it like with stop words and without stop words and maybe with n-grams. And then you give a lower weight to the n-grams and say like if I have an exact match, then I want this first. But if I don't have anything in the exact matches, then I want to look into my n-gram list. Uh and then I want to kind of like take whatever is coming up next. Um so even keyword based search will be more complex if you combine different methods. Um N-grams are interesting, but again, they're a dumb but pretty heavy hammer. Use them with the right at the right scenario. Sorry, quick question about these n-grams. Is it by default one or two? Yes. But you you could redefine that. So we can uh we can Let me go back to the docs. The The n-gram, you can uh say min gram and max gram. If you set both to three, you would have tri-grams, where it's always groups of three like 1 2 3, 2 3 4, etc. Um You could also have something called edge n-gram, where you expect that somebody types the first few letters, right? Uh and then you only start from the beginning, but not in the middle of the word, which sometimes avoids unexpected results. And of course reduces the number of tokens um quite a bit. Um so somewhere in here um edge n-gram Let's just copy that over, so I won't type and So here we have edge n-gram with quick, and you can see it only does first and the first two letters, but nothing else. And in reality you would probably define this like two to five or more or whatever else you want. Uh but here we only do from the start and nothing else, which reduces the tokens tremendously. But of course, if you have BlackBerry and you want to match the berry, you're out of luck. Make sense? Anybody else? Anything else? Yes, so another question. Sure. So if you're working with a language like English and then Hebrew, where it's right to left, how do you deal with the indexes and stuff like that? Yeah. So if you have multiple languages, um do not mix them up. That will just create chaos. Uh because we'll get to that in a moment, but how keyword search works is basically word frequencies, and it if you mix languages, it screws up all frequencies and statistics. Um so what you would do is either you would have um You could do two things uh to do um If you have English and Hebrew, uh you could either do field English, and then you would have field uh whatever the abbreviation for Hebrew is. Um Hebrew uh Oops. And then you would have that. And then you would need to define the right analyzer for that specific field. But so you break it out either into different fields, or you could even do different indices. Um and ideally we even have that built in. Um we have a language analyzer. Even if you just provide a couple of words, it will guess or not guess. It will infer the language um with a very high uh degree of certainty. Um especially Hebrew will be very easy to identify. Uh if you have your own diacritics, it's easy. Um but even if you just throw all random languages at it, it will have a very good chance like just with a few words uh to know this is this language, and then you can treat it the the right way. Um Good. Let's continue. Um So we have done all of these searches. Uh we have done slope. Uh One more thing before we get into the the relevance. One other very heavy hammer um that people often overuse is fuzziness. Uh so bless you. Uh if you have a misspelling, so I misspelled Obi-Wan Kenobi. Uh We already know that this is broken out into two different words or tokens. Um it will still match you Obi-Wan because we have this fuzziness, uh which uh allows edits. It's like a It's a Levenshtein distance. So you can have one By default here you could either give it an absolute value like you can have one edit, which could be one character too much, too little, or one character different. Uh you could set it to two or three. You can't do it more because otherwise uh you match almost anything. Um and auto will Auto is kind of smart because auto, depending on how long the token that you're searching for is, uh will set a specific value. Um if you have zero to two characters, auto fuzziness I think is one from um two to No, it's zero to two characters is zero. Uh three to five characters is one, and after that it's two. Um So you can match these. Um Will this one match? Yes, no, and why? Um no, because you've got the B and the W. Yes, so we have We have both of those are misspelled. It still matches. Why? I think it's tokenized separately, and each one has a single Levenshtein Yes. That is a bit of a gotcha. Um so yes, you need to know the tokenizer. So we tokenize with standard. So it's two tokens, and then the fuzziness applies per token. Which is another slightly surprising thing. Uh but yes, that's how you end up here. Um Okay. Now we could look at like the how the Levenshtein distance works behind the scenes, but it's It's basically a Levenshtein automaton um which looks something like this. If you search for food and you have two edits, this is how the automaton would work in the background to figure out like what are all the possible permutations. Um it's a fancy algorithm uh that was I think pretty hard to implement, uh but uh it's in uh Lucene nowadays. Um Okay. Now, let's talk about scoring. Uh one thing that you have seen that you don't have anywhere or in a non-search engine or just in a database is like we have the scores. Like how well does this match? Um How does a score work here? Um Let's look at the the details of that one. Um so the the basic algorithm, which is also most of us or pretty much all of us here, um term frequency inverse document frequency or TF-IDF, um it has been slightly tweaked. Like the new implementation is called BM25, which stands for best match, and it's the 25th iteration of the best match algorithm. Um So what they look like is um you have the term frequency. If I search for droid, how many times does droid appear in the text that I'm looking for? Um And it's basically the square root of that. So the assumption is um if a text contains droid once, this is the relevancy. If I have a text that contains droid 10 times, um this is the relevancy. The tweak between TF-IDF, that one just keeps growing. Uh BM25 says like once you hit like five droids in a text, it doesn't really get much more relevant anymore. Um so it kind of like flatters out the curve. Um that is the idea of term frequency. Um the next thing is the inverse document frequency, um which is almost the inverse curve. The assumption here is over my entire text body, this is how often the term droid appears. So if a term is rare, it is much more relevant than if a term is uh very common. Then it's kind of like less relevant. Basically, the assumption is rare is relevant and interesting, very common is not very interesting anymore. Uh and then it's kind of like just uh works its curve out like that. Um And the final thing is uh the field length norm is like the shorter a field is and you have a match, the more relevant it is. Which assumes like if you have a short title and you keyword appears there, it's much more relevant than if there is a very long text body and your keyword or and you have a match there. Um And these are the three main components of uh TF-IDF. So let's take a look at how this looks like. Um You can make this a bit more complicated. Um And this show you why something matches. Don't be confused by the or let me take that out for the first try. So I'm looking for father and I am uh no, I am your father and Obi-Wan never told you what happened to your father. Um One is more relevant than the other. Why is the first one more relevant than the second one? Yeah. Term frequency is the same. Both contain contain uh father once. The uh inverse document frequency is also the same because we're looking for the same term. Uh the only difference is that the second one is longer than the first one um and that's why it's um more relevant here. Um So this is very simple. And you can then if you're unsure why something is calculated in a specific way, you can add this explain true uh and then it will tell you all the the details of like, okay, we have father um and it then calculates basically all the different pieces of the formula for you and shows you how it did the calculation. So you can debug that if you need to um but it's probably a bit too much output for the everyday use case. Um And then you can customize the score if you want to. Here I'm doing a a random score. So my two fathers, uh this is a bit hard to show, um they will just be in random order because their score is here um randomly assigned. Um But you could do this more intelligently that you combine like the score and like if you have I don't know, the margin that on the product that you sell or the rating that you include that in the rating somehow and you can build a custom score for things like that. Um so you can influence that any way you want. Um One thing that I see every now and then that is a very bad idea and we'll skip this one because it's probably a bit too much. Um This one by the way is the is the total formula that you can do or maybe I'll show you the parts that I skipped. What happens if you search for two terms? And they're not the same. They don't have the same relevancy. So what the calculation behind the scenes basically looks like this. Let's say we search for um father. Father is very rare, that's why it's much more relevant than your. Your is pretty common. And then we have a document that contains your father. It's kind of like this axis. This is like the this will be the best match. But will a will a document that only contains father be more relevant or only your? Intuitively, the one with just father will be more relevant. But how does it calculate that? Um it basically calculates like this is the relevancy of father. Um this is the ideal document and this is your and then it looks like which one has the shorter angle. And this is the one that is more relevant. Um so if you have a multi-term search, um you can figure out which term is more relevant and how they are combined. Um and then you can also have the coordination factor which basically we rewards documents containing more of the terms that you're searching for. So if you I'm searching for three terms like um your I am father, whatever. Um if a document contains all three, this will be the the formula that it combines the scores of all three and multiplies it by 3 / 3. If it only contains two of them, it would only have the relevancy of 2/3 and with one, 1/3. And then you put it all together and this is the formula that happens behind the scenes and you don't have to do that in your head, luckily. Cool. We have seen these um One thing that I or we see every now and then is that people try to translate the score into percentages. Um Like you say, this is a 100% percent score and this is only like a 50% match. Who wants to do that? Hopefully nobody because the Lucene documentation is pretty explicit about that. Um You should not think about the problem in that way because it doesn't work. Um and I'll show you why it doesn't work or how this breaks. Um let's take another example. Uh Let's say we take this short text. These are my father's machines. I I think of a good Star Wars quote to use here, but bear with me. Um so what remains if I run this through my analyzer? My father machine. These are the three tokens that remain. Now I will store that. You remember what are the three tokens that we have stored. Um and if I search for my father machine, um you might be inclined uh to say um this is um the perfect score. This is like 100%. Agreed? Because all the three tokens that I have stored in these are my father's machines um are there. So this must be like my perfect match. So it's 3.2, that would be 100%. The problem now is every time you add or remove a document, the statistics will change and your score will change. So if I delete that document um and I search the same thing again, um I don't know what percentage this is now. Is this now the new 100% the best document or is this a 0. or I don't know, 20%? Um how does this compare? Um And then you can play play funny tricks where these droids are my father's father's machines. Um and you can see I have a term frequency of two for father here. So if I store that one then and then search it, um is this now 100% is this now 110%? Um so don't try to translate uh scores into percentages. They're only relevant within one query. They're also not comparable across queries. They're really just sorting within one query um to do that. Um okay, let me get rid of this one again. Now we've seen the limitations uh of keyword search. We don't want to define um our our synonyms. We might want to extract a bit more meaning. So we'll we'll do some simple examples uh to extend. Um I will add uh from OpenAI text embedding small. I'm basically connecting that inference API for text embeddings here in my instance. Um I have removed the API key. You will need to use your own API key um if you want to use that, but it is already configured. Um so let me pull up the inference services that we have here. I have done or I have added two uh different models. Uh One sparse, one dense. Let's go to these. Uh by the way, if you try to do this um with the 100% score, don't do this. Um Because it will just not work. Um Okay. Not everybody has worked with dense vectors, right? Uh so I have a couple of graphics coming back to our Star Wars themed uh just to look at how that works. So what you do with dense vectors is um we'll keep this very simple. Um this one just has a single uh dimension um and it has like the axis is pretty much like realistic Star Wars characters and cartoonish Star Wars characters. And this one falls on the realistic side and that other one is just cartoonish. And you have model behind the scenes that can rate those images and will figure out like where they fall. Um Now in reality, you will have more dimensions than one um and you will also have floating point precision, so it's not just like um -1, 0, or 1. Um but you will have more dimensions. So for example here, I'm adding human and machine. And in a realistic model, you don't have like the dimensions are not labeled as nicely and clearly understandable. The machine has learned what they represent, but they're not representing an actual thing that you can extract like that. Um but in our simple example here now, we can say this Leia character um is realistic and a human versus um I don't know, the Darth Vader um is cartoonish and I don't know, somewhere between human and machine. Um so this is the representation in a vector space. And then you could have like I said, you could have floating point values and then you can have different characters. Um And similar characters like both of those are human um without the hand, he's only like not quite as human anymore, so he's a bit lower down here. Uh Um Uh so he's a bit closer to the machines. Um So you can have all of your entities in this vector space, and then if you search for something, you could figure out like which characters are the closest uh to this one. And again, in reality, you will have hundreds of dimensions. Uh it will be much harder to say like uh these are the explicit things and this is why it works like that. Um It will depend on how good your model is uh in interpreting your data and extracting the right meaning from it. Um but that is the general idea of dense vector representation. You have your documents, or sometimes it's like chunks of documents, um that are represented in this vector space, and then you try to find something that is close to it um for that. Um Does that make sense for everybody or any specific questions? So it's a bit more opaque, I want to say. It's not quite as easy because you say say like these five characters match these other five characters here, um but you need to trust or evaluate that you have the right model to figure out how these things connect. So let's um see how that uh looks like. Um I have one dense vector model down here. We have OpenAI embedding. Um this one is a very small model. It only has 128 dimensions. Um The results will not be great, but it's actually for demonstrating uh it actually helpful. Uh so we'll see that. The other model that we have, and let me show you the output of that. So if I take my text, "These are not the droids you're looking for." This is the representation. It's basically an array of floating point values uh that will be stored, and then you just look for similar floating point values. And then you have "These are not the droids you're looking for." Here, on the previous one Uh a dense text embedding. Um this one here does sparse embedding. Sparse is uh the main model used for that is called Splayd. Um our input of Splayd is we call it ELSER. Uh it's kind of like a slightly improved uh Splayd, but the concept is still the same. What you get is um you take your you take your words, and this is not just uh TF-IDF. This is a learned representation, where I take all of my tokens and then expand them and say like, "For this text, um these are all the tokens that I think are relevant, and this number here basically tells me how relevant they are." Um Again, not all of these make sense intuitively. Um and you might get some funky results, for example, with foreign languages. This currently only supports English. Um But these are all the terms that uh we have extracted. Normally, um yeah, you get like 100 something or so. Um So the the idea is that this text is represented by all of these tokens, and the higher the score here, the more uh important it is. And what you will do is you store that behind the scenes. When you search for something, um you will generate a similar list, and then you look for the ones that have an overlap, and you basically multiply the scores together, and the ones with the highest values will then find the most relevant document. This is in so far interesting or nice because it's a bit easier to interpret. It's not just like long array of floating point values. Um Sometimes these don't make sense. The main downside of this, though, is that it gets pretty expensive at query time because you store like a ton of different tokens here for this. Um when you retrieve it, um the search query will generate a similar long list of terms, and at if you have a large enough text body, um a query might hit a very large percentage of your entire stored documents to with these or matches because basically, these are just a lot of or's that you combine, calculate uh the score, and then return the most or the the highest ranking results. Um So it's an interesting approach. It didn't gain as much traction as dense vector models, but it can be as a first step or an easy and interpretable step. It can be a good starting point uh to dive into the details here. Um your entire vocabulary? Uh so This "These are not the droids you're looking for" is basically represented by this embedding here. So it's like this entire list of of terms with these um yeah, with this relevancy, basically. This is the representation of this string. And then when I search for something, I will generate a similar list, and then I basically try to match the two together. Like for what is has the most or the highest matches here. Make sense? Can you can you run a query? Can you search for Yes, we'll do that in a second. Um It it's a bit of because um with the or um back it doesn't tell you exactly this is what matched, uh but yes. Um So I will need to create a new index. Um this one keeps the configuration from before, but I'm adding this semantic text for the the sparse model and the dense model. So I've created this one, uh and now I'll just put three documents I have my other index. Um as you can see here, it says three documents were moved over. Um so we can then start searching here, and um if I look at that, the first document is still "These are not the droids you're looking for." You don't see like for the for keyword search, you don't see the extracted tokens. Here, we also don't show you the the dense vector representation or the sparse vector representation. Those are just stored behind the scenes um for querying, but there's no real point in retrieving them because you're you're not going to do anything with that huge array of of dense vectors. Um it will just slow down your searches. Um You can look at the the mapping, and you can see I'm basically copying my existing quote field to these other two that I can also search those. Okay. So if I look for machine on my original quote, will it find anything? No. No, because it only had "These are not the droids you're looking for." And this is still a keyword search. So this is just a query from before, just to show you that yes, we're not matching anything because it only had "These are not the droids you're looking for," but here we're looking for machine. Um it was still a keyword matching. Doesn't work, shouldn't work. That's exactly the result that we want out of this here. Um Now, if I say ELSER and I say machine, then it will match here "These are not the droids you're looking for," and you can see this one matches pretty well, I don't know, at 0.9, but it also has some overlap with "No, I am your father." I mean, it is much lower in terms of relevance, uh but something had an overlap here. And only the the third document um "Obi-Wan never told you what happened to your father." Only that one is not in our result list at all. But there was something here. I don't know the expansion. We would need to basically run Where was it? We would need to run uh this one here for all the strings, and look then for the expansion of the query, and then there would be some overlap, and that's how we retrieve that one. Um So is there like a threshold that you have You could define a threshold. Um It will, though, depend um Let's see. Uh "These are not the droids you're looking for." Let's say if I if I say I'm not sure this will change anything. I mean, the relevance here is still it's still 10 x or so. Um But yeah, this one still be in will just have a very low match. It's still terms you look for, um the score just totally jumps around. So it's a bit hard to define the threshold. Because here you can see in my previous query, we might have said 0.2 is the cut off point, but now it's actually 0.4, even though it's not super relevant. So it might be a bit tricky, uh or you might need to have a more dynamic threshold depending on how many terms you're looking for and uh what is a relevant result. In the in the bigger picture, the assumption would be if you have hundreds of thousands or even millions of documents, um you will probably not have that problem that anything that is so remotely connected will actually be in the top 10 or 20 or whatever list that you want to retrieve. Um So for larger proper data sets, this should be less of an issue. With my Hello World example of three documents, um it can be a bit misleading. But yes, you can have a cut off point. If you figure out what for your data set and your queries is a good cut off point, you could define a cut off point. No, sorry. I meant you have three documents. How come it's only showing two? Is it because of there's a Be be So the query gets expanded into, I don't know, those 100 tokens or whatever, and then for those two, there is some overlap, but the third one just didn't have any overlap. But I So we what Okay, we can we can do that. It's just a bit tricky to figure out that the term that it has the overlook the overlap. So we will need to take this one machine No, I am your father. Let's take this one. What you need to do is to figure that one out. I know actually we should be able Let me see. Um Let's see. Pa pa pa pa pa pa pa pa pa pa Is it pretty long output? Somewhere was actually hoping that it will show me that term that has matched here. Okay, I see something. Okay, there is something puppet that seems to be the overlap. How much sense that term expansion for the the store text and the query text makes is a bit of a different discussion. Uh but in here with that explained true, you can actually see how it matched and what happened behind the scenes. If you have any really hard or weird queries or something that is hard to explain to debug that. But the third one didn't match. Um Now, if I take the dense vector model with OpenAI and I search for machine, how many results do you expect to get back from this one? 0 1 2 3 Yes, three. Why three? Yes, because there's always some match. That is the other or let me run the query first. Um Here is your These are not the droids you're looking for. This one is the first one. Um I don't think that this model is generally great because here the results are super close. It is I mean, the the droids with the machines That is the first one, but the score is super close to the second one, which is No, I am your father, which feels pretty unrelated and uh Obi-Wan never told you what happened to your father. Even that one is still with a reasonably close score. Uh but why do we have those? Um Because if we say what is the relevance I mean, it's further away, but it's always like there is always kind of like some angle to it even if the kind of like the angle here or depending on the the similarity calculation that you do, but it's still always related. There is no easy way to say something is totally unrelated. Um That is by the way, one good thing about keyword search where it was relatively easy to have a cutoff point of things that are totally not relevant where you're not going to confuse your users. Whereas here, if you don't have great matches, you might get almost It's not random, but it's potentially it looks very unrelated to your end users what you might return. Just because it's very hard to show. Yes. Is it fair to say then that the like the OpenAI embedding search is worse for this kind of toy example because the magnitude of difference is I I'm careful with worse because it's really a hello world example, so I don't take this as a quality measurement in any way. Um I I I'm Yeah, I mean the the OpenAI model with 128 dimensions is very few dimensions. I think it will probably be cheap, but not give you great results necessarily, but don't use this as a benchmark. I think it's just a a good way to see um that you This is now much harder because now you need to pick the right machine learning model uh to actually figure out what is a good match. With keyword based search, it was a bit of a different story. There you need to pay more attention to like how do I tokenize and do I have the right language and do I do stemming or not stemming, but most of that work is relatively I want to say almost algorithmic um and then you can figure that out and you can figure it and then it's very predictable at query time. Um whereas with the dense vector representation, you really need to evaluate for the queries that you run and the data that you have like is that relevant and is that an improvement or not? Um It's very easy to get going and just throw a dense vector model together um and you will match you will always match something that might be an advantage over the the lexical search where you don't have any matches, which sometimes is there the problem that nothing comes back and you would want to have at least some results. Um Here it might just be unrelated. So that that can be tricky. Um the You want to have some results is by the way um a funny story that European e-commerce store once told me. They said they accidentally deleted I think 2/3 of their data that they had for the products that you could buy. And then I asked them like, "Okay, so how much revenue did you lose because of that?" And they said basically nothing because as long as you showed some somewhat relevant results quickly enough, people would still buy that. So only if you have no results, that's probably the worst. So for an e-commerce store, you might want to show stuff a bit further out because people might still buy it. But it really depends on the I'm I'm coming to you in a moment. It really depends on your use case. E-commerce is kind of like one extreme where you want to show always something for people to buy. If you have a database of legal cases or something like that, you probably don't want that approach because that will go horribly wrong. So it is very domain specific. Um That's I think also the good thing about search because it keeps a lot of people employed because it's not an easy problem. It's it's almost job security because it depends much on the This is the data that you have and this is the query that people run and this is the expectation of what will happen and this is for this domain the right behavior. Um So there's no easy right or wrong with a checkbox. And the other thing is you might make if you tune it, you might make it better for one case, but worse for 20 others. That's why a robust evaluation set is normally very important though very rare. Uh a lot of people YOLO it um and you will see that in the results. And for the e-commerce store, it probably works well enough. Uh sorry, you had a question. Um can I limit the semantic enrichment to a subset of my index based off of properties of the document? So if I have a very large shared index with a lot of customers and I want to enable AI for a subset of the index, can I say hey, only do the semantic enrichment if the document has this property where maybe it's like an AI customer? Yeah, so the the way we would do it in our product um is that you would probably have two different indices with different mappings. Yeah, but then it's not so fun like the customer upgrades and I have to migrate them Uh they split. Yeah, so if you for example, have an index in Elasticsearch, you can think of it almost like a sparse table. Right? So there's no penalty for having a field that is not populated. So either in your application or in ingest processor, you could have an if statement and say Yeah, yeah, that's how we do it now. Yeah, only move it over uh with this automatic way where you kind of turn it No, that the problem is the data structure. Like if the field is there, so the data structure that we build in the background is called HNSW. And either we build a data structure or we don't build it. Yeah, so if you had you know, 10 billion entries in your vector index and your index that it was set up for vectors, right? And you just don't populate the thing that is either putting in a dense vector or is triggering the inference to create the dense vector to put into there, then it's just going to be a you know, bunch of the index is just a bunch of pointers and none of them head towards the HNSW, then it won't show up in the search results. The penalty to you is nothing. Okay. Right? Um but you're going to have to manage what does or does not create the vector. You could do that in an ingest processor by just saying, "Hey, we're going to use the copy command to have two copies of the text, one that's meant for non-vector indexing, one that's meant for actual vector indexing." You'd have to manage that with some tricky complex AI technology called if then else, right? Somewhere inside of your ingesting pipeline. And it would work just fine. Yeah, um one more question. When we did HNSW with Elasticsearch, we found that it was extremely slow at write time and the community suggested that we freeze our index if we were going to use HNSW. Force merge or Um Yeah, I I I think just freeze writes. They said they said build the index and freeze it. Otherwise, you'll put a ton of load on the I mean, yes. What we found is that some of the default kind of like have been around in Elasticsearch for 10 years settings for the merge scheduler weren't really optimized for keyword search and for high update workloads on HNSW, we got some suggestions. Okay. Um they they take a little bit of parameter sweet tuning to go find something right for your IOPS and for your actual update workloads. Sometimes it's about the merge scheduler and not doing kind of an inefficient HNSW build when it's not important for your use case. Okay. Um The other thing we'd say is that sometimes friends don't let friends run Elasticsearch 8.11. Upgrade upgrade upgrade. They put a lot of optimization work in here cuz it should be simple. So that the reason The the reason why that is it's like merging So because you you have the immutable segment structure in Elasticsearch. And HNSW, you cannot easily merge. You basically need to rebuild them. The one trick I forgot which version it was. I'm not sure Dave if you remember. I think it was even before 8.11. But basically if we do a merge, we would take the the largest segment with not deleted documents and basically plop the new documents on top of them rather than starting from scratch from two HNSW data structures. There's another optimization somewhere now in 9.0 that will make that a lot faster. So, it really depends on the the version that you have. And there are a couple of tricks that you can play. But yeah, that is one of the downsides of like the way immutable segments work and HNSW is built that you can't easily just merge it as easily together as other data structures because you really need to rebuild the HNSW data structure or like take the largest one and then plop the other one in. Might have been a while ago. Yeah. Yeah, I feel like rag has been very heavily abused that it's or like the mental model I think started off as like you do retrieval and then you do the generation, but you could do the generation earlier on as well that you do the query rewriting and expanded query. So, I My favorite example for that is um you're looking for a recipe. You don't need to have the the LLM regenerate the recipe. You just want to find the recipe. But maybe you have a scenario where you forgot what the thing is called that you want to cook. And then you could use the LLM for example to tell you what you're looking for. Like you say like, oh, I'm looking for this Italian dish that has like these layers of pasta and then some meat in between and then the LLM says like, oh, you're looking for lasagna. And then you basically do the generation first or a query rewriting and then search and then get the results as a very let's say you explicit example here. Your example will look very different than probably smarter than my example. But query rewriting is one um thing. There's also that this concept of height where you your documents and your queries often look very different and that you use an LLM to generate something from the query that looks more close like the doctor documents that you have and then you match the documents together because they're more similar in structure. So, there are all kinds of interesting things that you can do. Like I said earlier, it depends is becoming a bigger and bigger factor. But yeah, your use case is probably might be Yeah, maybe a multi retrieval where you figure out like, oh, you look I don't know. I I know the example from an e-commerce store where it's like I'm going to a theme party from the 1920s. Give me some suggestions. And then the LLM will need to figure out like what am I searching for and then it can retrieve the right items and rewrite the query and then actually give you proper suggestions. But it's it's not just running a query anymore. Yeah. Yeah. Definitely not necessarily. It's it's an interesting question. It's that feels almost like a blast from the past. I remember like two or three years ago there was this big debate of like how many dimensions does each data store support and like how many dimensions should you have? And at first it looked like, oh, more dimensions is always better, but then it turned out more dimensions are very expensive. So, it really depends on the model and what you're trying to solve. Like if you can get away with fewer dimensions, it's potentially much cheaper and faster. But it I don't think there's a a hard rule like maybe the model with more dimensions can express more in because it just has more data and then it will come in handy. But maybe it's not necessary for your specific use case and then you're just wasting a lot of resources. I don't think there is a an easy answer to say like, yes, for this use case you need at least 4,000 dimensions. It will depend, but it depends on the model how many dimensions it will output and then maybe you have some quantization in the background to reduce it again or reduce the either the number of dimensions or the fidelity per dimension. So, there are a lot of different trade-offs in that performance consideration, but it will mostly rely on like how good does the model work for the use case that you're trying to do. Yeah, so the that is one area. So, I want to say historically what you would do is you would have a golden data set and then you would know what people are searching for and then you would have human experts who rate the queries and then you run different queries against it and then you see like is it getting better or is it getting worse. Now LLMs open a new opportunity where you might have a human experts in the loop to help them out a bit, but they might be actually good at evaluating the results. So, you almost nobody has like the golden data set and test against that. But you can either use it look at the behavior of your end users and try to infer something from that. Or you have an LLM that evaluates like what you have or you have a human together with an LLM evaluate the results. So, you you have various tools. But again, it's and it depends and it's really not an easy question of saying like this is the right thing. Maybe you can get away with something simple so that the classic approach I want to say is like you looked at the click stream of how your users behaved and then you saw like they clicked on the first or up to the third result, the result was potentially good and they didn't just go back and then click on something else, but they they stuck on the page. If they don't click on anything and just leave, it might be very bad. If they go to the second or third page, it might also not be great. So, there are some quality signals that you can infer from that or you really look into the the quality aspect and try to evaluate like what people were doing and how how it behaves. But you can make this from relatively simple to pretty complicated. What else? What else? Okay, obviously if I search for that with query extension, it will find my father example. Um and this one here will still again match my my droids pretty much like the opening example. One thing that I you what is also happening behind the scenes here. This is a very long segment like it's it's a lot of information with different speakers. What I have created here though is we have created multiple chunks behind the scenes. And if I search for that I think looking for murder in the Skywalker saga works pretty well here. It finds the document that I've retrieved, but it can also highlight. So, here I say, show me the fragment that actually matched best here. And if I search here for murder it didn't find anything, but I think the term that it found was in this highlighted segment here, it found kill and then was that one that was expanded here. So, here I've broken up my long text field into multiple chunks and there are multiple strategies to do that by page, by paragraph, by sentence. You could do it overlapping or not overlapping. Um many strategies. It will depend on how you want to retrieve what works best for your use case, but you want to kind of like reduce the context per element that you're matching because there's only so much context that a dense vector representation can hold. So, you want to chunk that up especially if you have like a full book, you want to break up those individual at least pages and then find the relevant part where the match is and then you can actually link back to that. The point in this query here is also to show you I didn't define any chunks. I didn't say like, okay, send this representation of a dense vector there and then when it comes back uh again. This is all happening behind the scenes just to make this easier. So, the entire behavior here is still very similar to the keyword matching, even though there's a lot more magic happening behind the scenes. Um just to keep that very simple. Um Okay. How does everybody feel about longer JSON queries? Um we'll see about alternatives and maybe we can make this a bit simpler again. Uh but let me show you one one more way of of looking at Um we call them retriever. They're a more powerful mechanism to actually combine different types of searches. Um Combining different types of searches, let me get from my slides actually. When we talk about combining searches and how this all plays together. Um This is kind of my my little interactive map of what you do when you do retrieval or what your what your searches do. So, we started here in the the lexical keyword search. Uh and then we run the match query and we're matching these strings. Um This often combined with some rank features um are often what we call full text search. The rank features could be either you extract a specific signal or it could also be something um however you influence that ranking. It could be like the margin on a product, like how many people bought something, what the rating is. There are many different signals that you could ex- include like not just with the the match of the the text, but any other signals that you want to combine uh for retrieving that. And then you have like full text search as a whole. On top of that, I kept that kind of like to the side here. You might have a Boolean filter where you have like a hard include or exclude of certain attributes. Um This does not contribute to the score. This is just like black and white. This is included or excluded. Whereas this here calculates the score for you, how you match. And then this was kind of like the the algorithmic side. And then we have this machine learning, the learned side, or the semantic search, uh where you have a model behind the scenes, uh split into the dense vector um embeddings and the sparse vector embeddings um for vector search or learned sparse retrieval. I think those are the two common terms. Um And the interesting thing is all of these, including the the sparse one, these are the the sparse vector representation in the background, and only this one here is the dense vector representation. Um And then when you combine any any grouping down here to combine for one search, this is then what we would call hybrid search, um even though there can be a big discussion of like what is exactly hybrid search or not. I will definitely stick to the definition that as soon as you combine uh more than one type of search, it could be sparse and dense, or it could be dense and keyword, or maybe if you combine two dense vector uh searches, um then it's hybrid search because you have multiple approaches, and then you can either boost them together. You could do reranking, which is becoming more and more popular. Um one thing that we lean heavily into is RRF, which is reciprocal rank fusion. Uh that doesn't rely on the score, but it relies on the position of each search mechanism. So, it basically says like that the lexical search had this document at position four, and the dense vector search had it at position two, and then it kind of like evens out the position and gives you an overall position by blending them together. Rather than looking at the individual scores because they might be totally different. So, this is kind of like the the information retrieval map overall, and we have Okay, we didn't do a lot of filters, but I think filters are intuitively relatively clear that you just say like, I'm only interested in users with this ID or whatever other criteria. It could be a geo-based filter like only things within 10 km, or only products that came out in the last year. Um Like a hard yes or no. Um all the others will give you um uh a value for the relevance, and then you can blend that potentially together to give you the overall results. That is kind of like the the total map of search. Can you give an example of the signal on this? Uh yeah, for for signal um so, we have our own data structure for for these rank features. It could be for example the I don't know, the the rating of a book. And then you combine um the the keyword match for I don't know, you you search for um murder mysteries, uh but then another feature would be um how well they're ranked. And then you would see that. Or it could be your margin on the product, or the stock you have available, and you would want to show the product where you have more in stock. Uh there can be or it might even be a simple like a clickstream, like what have people clicked before. There are a lot of different signals that you could include in all of this searching then. Any other questions or everybody good for now? Yeah? When you do this like scoring you said you do like a blended approach, right? RRF does a blended approach, yes. Those scores, are they kind of standardized or are they different based on the type of search that you do? Like the relevancy scores that you're kind of combining? Yeah, so they are you would have to normalize them. They will be So, dep- depending on like if you have depending on the the comparison that you do for dense vectors, it might be between uh zero and one. But you saw that for the keyword search, uh also depending on how many words I was searching for, it might be a much higher value. There is there is no really real ceiling for that. Um or you could add a boost and say like, this field is 20 times more important than this other field. Um there is no real max value that you would have here. You could normalize the score and then basically say like, I'll take like the highest value in this subquery as 100% and then reduce everything down by that factor, and then I combine them. Maybe that works well. RRF is a it's a very simple paper, I think it's like two pages. Um and it really just takes the different positions. I think it's 1 divided by 60, which is like a factor they figured out made sense, plus the position. Uh and then you add the scores or like the positions for each document together, and then that value gives you the overall position. Um it really just it doesn't look at the score anymore, but it blends the different positions together and like how they are interleaving and what should be first or second then. Um Yeah? Um If I just score on vector search, why should I use Elastic over PGVector or something like Quadrant? I'm sure there are some trade-offs. And because if my data is already in the database, any change probably via CDC, change data capture, has to So, that's one extra hop in the ingestion versus like like PGVector, like it's right there, no outside. So, I was just curious, what sort of systems maybe people who already have keyword search requirements, what have you seen in production? Why what systems choose PGVector or Quadrant versus I mean, PGVector will always be there because like if you are already using Postgres, it's very easy to add. Um I think then the question is like, does it have all the features uh that you need? For example, Postgres doesn't even do BM25. Um it has some matching, but it's not the full BM25 algorithm because it I don't think it keeps all the statistics. Um it will be the question of like scaling out Postgres can be uh a problem, and then just like the breadth of all the search features. Um if you only need vector search, um I think my or our default question back to that is like, do you really only need vector search? Um maybe for your use case, but for many use cases, you probably need hybrid search. Uh one area for example where vector search will not do great is like if somebody searches for like a brand. Because there is no easy representation in most models uh for this specific brand, and it will be very hard to beat keyword search. So, there will be very and also your user very angry when they know you have this word somewhere in your documents uh or in your data set, but you don't give me the result back. Um so, there are many scenarios where you probably want hybrid search. I feel like that's the We started 2 years ago, we started with just vector search, but I feel like the overall trend is coming more to hybrid search because you probably some sort of keyword search. Um and then you want to have that combined, probably with some model uh for the added benefit and extra text. Um but you often want the combination. It might also depend a bit on like the types of queries that your users run. So, if your users run single word queries like I've done in my examples, that's often not really ideal for vector search because you live off of like any machine learning model because you live off extra context. Um So, depending on that, I've seen some people build searches where it's like if you search for one or two words, they do keyword search, but if you search for more, they might fall over to vector search. So, it depends a bit on the context what works. Um if you really only need vector search, um and PGVector is small enough uh to do all of that, um and Postgres is your primary data store, then that's that's probably where you will do well. Uh but there are plenty of scenarios where that will or not all of those really boxes will be ticked. One last question. Mhm. But so you would create So it's one data set basically with thousands of files that all are chunked together and so one change would invalidate all of them or Ah. Maybe so I I think the way we might solve it is that if you create the hash of the file and use that as the ID and you only use the operation create and it would reject any any duplicate rights, you would at least not ingest and and then create the vector representation again. Um you will still send it over again and it would need to get rejected. Yes. If you if you have that doc ID and then you need to set the operation to just create and not update or upsert, then it would just be rejected and you would only write it once. I'm not sure if there's a great use case or if you might want to keep like a I don't know an outside cache of like all the hashes that you've already had and deduplicate it there, but that would be the Elasticsearch solution of like using the the hash as the ID and then just writing to that. Okay, and all the with create on the Yeah. That that is I think the intuitive or most native approach that we could offer for that. Yeah. I think there was some other questions somewhere. Ah yeah. Yeah, but the from what I remember the default Postgres full text search does not do full BM25, but it only does It doesn't have all the statistics I think from what I remember. Yeah. Any other questions? Ah. Joe, please go ahead. To to show now? Yeah, I mean How much just do you want to see? Um Mhm. Yes. Um Maybe for before we dive into that be be for for everybody else like rescoring is like let's say we have a million documents and then we have one cheaper way of retrieving them and we retrieve the top I don't know a thousand candidates and then we have a more expensive way but higher quality way of actually rescoring them, then we will run this more expensive rescoring on just the top thousand uh to get our ultimate result list of of results. But the the rescoring algorithm would be too expensive to run it across like the million documents that why you don't want to do that. That's why you two-step process and that's why you might want to have the the rescoring. So yes, we have a You can in Elasticsearch now you can do rescoring because it becomes more and more popular. Um I don't have a full example there, but we do have like we do have a rescoring model built in by default now. Let me pull that up. Um bop bop bop bop bop Not this one. So we have Currently it's the the version one reranking, but we have a built-in reranking model now as well. So for one of the tasks that we can do, you can see here we have the other task is like for example the dense text embedding. Now we have a reranking task that you can also call. Good question. Okay. No, reranking is good. Let me ah Somehow my keyboard binding is broken. This is very annoying. Okay. Um We rerank results. Let me see. Somewhere here there should be So there's learn to rank. Um But it should not be the only one. This is what we want. Okay, we have our reranking model. Um Unless Dave you know from the top of your head where we have the right docs for this. Yeah, retrievers could find them. Yeah, so I think this is a simple example like we have a standard match like this is will be very cheap and then we have the text similarity reranker which uses our elastic reranker um that falls back to that model behind the scenes. But you would have like the text reranker retriever inside of that you have the RRF. Inside of that you would have lexical and KNN as peers. And it works from the inside out. Hey, do each of those retrieval methodologies, do like the Venn diagram, find the best results. And then take the full text of those results and run them through the reranker. Almost like a little mini LLM what outcomes that is pretty good. The cool thing about the the reranker is you can run it on structured lexical retrievals. You don't have to run it on a vector search. You can run it on anything you want. So if you don't want to pay the vector search everything or maybe the text is too small to vector search and you're not eating there to actually have the model lock on the stuff, the reranker when you run it on just kind of actual customer data sets to say how to how to do they're like, "Yeah, our evaluation scores bumped by 10 points basically for free. It feels like cheating." Right? So when you run against like a Gemini API and it's like, "Wow, why is this 10 points better than the Amazon one?" It's cuz they threw on their retriever and didn't tell you. Right? So there's a lot of black box stuff out there that we're exposing. So don't be don't be scared that we're telling you how it works inside, but this is what the Yeah. Yeah, so that just to give you the example of like I don't think I have a re-ranking example here, but this one um uses a classic keyword match uh for retriever and then we have we normalize here the score. I think somebody else asked about the normalize or we had a discussion about the normal normalizing we do a min-max normalizing. We weight this with two um and then I use the open AI embeddings with a again normalized with a weight of 1.5 and then they will get blended together and you you get the results that won't surprise you that these are not the droids you're looking for if you search for uh droid and robot um will be by far the highest ranking document. You had a question somewhere. Yeah. I mean so in the first so we would retrieve like X candidates and you could define the number of candidates and then we would run the re-ranking on top of those. So, then it would be a trade-off for you like the larger the window is the the slower it will be, but the potentially higher quality your overall results will be because you will just have everything in your data day a data data set that you can then re-rank at the end of the day. Is that what you meant or you wanted something per node or Yeah, and I don't think that's how we do it. So, what what you can control here is like this is the window of like what you might retrieve and then we have the minimum score like a cutoff point to throw out what might not be relevant anyway to to keep that a bit cheaper. That's what we have here. Um does it a retrievers and then you could do the RRF that I've explained where you blend results together. All of that is easy. Um one final note, if you if you got tired of all the JSON um we have a new way of defining those queries as well um where here we have a match operator like the one we've used all the time uh that you can use either on a keyword field, but it could also be either a dense or a sparse vector embedding and then you can just run a query on that and then just get the the scores from that. So, it is a query language. It's a bit more like I don't know like a shell um but if you don't want to type all the JSON anymore uh this is how you can do that and here my screen size is a bit off uh but yeah, you get the the quote that we retrieved, the speaker and the the score maybe maybe I'll take out the speaker to make this slightly more readable. Now it broke. Oh. Um Uh this is you could write queries with a fraction of the JSON. Uh this will also support funny things like joints. It doesn't have every single search feature yet uh but it's getting pretty close. So, this is more like a closing out at the end. If you're tired of all the JSON queries, you don't have to write JSON queries anymore. Um this is nice both for like observability use cases where you have like just like aggregations and things like that, but it's also very helpful for full-text search now um if you want to write different queries. I think the main downside is that the language support in the different languages like Java etc. is not very strong yet. You basically give it strings and then it gives you a result back that you need to parse out again. Uh so, it is not as strongly typed on the client side yet as the other languages. Um Any final questions? Yes. I mean we can make your life easier. It it's just all behind one single uh query endpoint. So, you you could use the two different methods to retrieve um and then you could still re-rank, but all from one single query. So, you don't have to do it yourself. I mean it's not like we want to stop you, but you don't have to and we can make your life a bit easier. I mean it's only one single query that you need to run and like one single round trip to the server that you need to do. I mean if you still need to do the retrieval like you you do the retrieve like all the individual pieces are still there. If you have two parts of the query, you will still retrieve those if that is the main cost and then you have the re-ranking um so, you're not getting out of those completely, but you can just do it in one single request that you send. We take care of all of that for you and then send you one result set back rather than sending more back to your application. So, it it will potentially be a little less work on the Elasticsearch side, but it will mostly be less work on your application side. Perfect. Thank you so much. I hope everybody learned something. I will let the instance running for 2 days or so, so you can still play around with the queries if you feel like it. Uh Thanks a lot for joining. If you want stickers, we have stickers up there. We also have a booth the next few days. Come join and uh get some proper swag from us there. Um Thank you. Uh see you around.