HybridRAG: A Fusion of Graph and Vector Retrieval - Mitesh Patel, NVIDIA
Channel: aiDotEngineer
Published at: 2025-07-22
YouTube video id: -tgQa8Fzf80
Source: https://www.youtube.com/watch?v=-tgQa8Fzf80
[Music] to quickly introduce myself. My name is Mitesh. I lead the develop advocate team at Nvidia. And the goal of my team is to uh create technical workflows, notebooks uh for different applications and then we release that codebase uh on GitHub. So developers in general which is me and you all of us together we can harness that uh that knowledge and take it further for the application or use case that you're working on. So that is what my uh my team does including myself. In today's talk, I'm I'm I'm going to talk about this project that we did with one of our partners um um and some of my colleagues at Nvidia and our partner about how can we create a graph rack system what are the advantages of it and if we add the hybrid nature to it how it is helpful so that's what my uh my talk is going to be on I will not give I will not be able to give you a 10TI view where you can I can dive with you in the codebase but there is a GitHub link at the end of this talk which you can um scan and all these notebooks whatever I'm going to talk about is available for you to take home but I'll give you a 10,000 ft view or if you are trying to build your own graph rag system how can you build it so u a quick refresher what is knowledger graph um and why are they important so um it is a network that represents relationship between different entities and those entities can be anything it can be people places uh concept events. A a simple example would be me being here. What is my relationship to AI worldfare conference? AI engineers worldfare conference and my relationship is I'm a speaker at this conference. What is my relationship to anyone who is attending here? Well, uh our relationship is you attended my session. So this edge of relationship between the two entities becomes very important uh to which only graph-based network can exploit or knowledge graphs can exploit. And that is the reason why uh there's a lot of active research happening in this domain of how you can harness graph rag u how can how you can harness knowledge graph and put it into a rag based system. So the goal is three things. How can you create a triplet which is the which defines the relationship between these entities that graph our graph-based system or knowledge graph is really good at exploiting and that's what is unique about this knowledge graph. So if you think about um um why can they work better than semantic u rag system well it captures the information between entities in much more detail. So those connections can um can provide a very comprehensive view um um of the knowledge that you that you are creating in your rag system and that will become very important to exploit when you are retrieving some of that information and and converting that into into a response for the user who is asking that question and it and it has the ability to organize your data from multiple sources. I mean that's a given no matter u what kind of rack system you're building. So how do we create a graph rag or a hybrid system? So this is the highlevel diagram of what it entails. So I broke it down into four components. The very first thing is your data. You need to process your data. The better you process your data, the better is a knowledge graph. The better is a knowledge graph, the better is the retrieval. So four components data, data processing, your graph creation or your semantic u embedding vector database creation. Those are the three uh steps. And then the last step is of course inferencing when you're asking questions uh to your u rag pipeline. And at a higher level this can be broken down into two big pieces offline online. So all your data processing u work which is a one-time process is offline and and once you have created your knowledge graph which is your triplet entity relationship entity 2 um or your semantic vector database once you have it then it's all about quering it and converting that information into um um a response that is readable to the user. It cannot be something that here are the three relationship and then we as the user have to go figure out what does this exactly mean. So the top um part of this u flow diagram is where you build your semantic u vector database which is you you pick your uh u documents and then you convert them into vector embeddings and you store into a vector database. So that piece is uh is how you create your semantic uh vector database and then the piece below is um how you create your knowledge graph and it is much more uh um there are much more steps that you have to follow a care that you have to take when you're creating your knowledge graph. So diving in the first step creating your knowledge graph. How can you create those triplets out of documents that are that are not that structured? So creating triplets which uh which exposes the information between two entities and picking up those entities uh so that that information becomes helpful is very important. Here's a simple example. This document is of Exxon Mobile's uh results I think uh their quarterly results and we we tried to pick up um the relationship or create the the knowledge graph using an LLM and if you see at the first line it's Exon Mobile which is a company that's the entity uh cut is the feature of um of that entity spending oil and gas exploration um and activity my apologies cut is the relationship between Exxon on mobile and spending on oil and gas exploration and activity is the the um the name of the entity spending on oil and gas exploration. So this is how the relationship needs to be exploited. Now the question that comes to our mind is that sounds very difficult to do and exactly it is difficult to do and that is the reason why we need to harness uh or we need to use LLM to figure out a way to extract this information and structure it for us so that we can save it in um um in a triplet format and how can we do that prompt engineering but we need to be much more uh uh uh defined about it. So you based on the use case that you are trying to work on you can define your oncology and once you have defined your oncology you can put it in your prompt and then ask the LLM to go extract this information that is oncology specific from the documents and then structure it in that way so that that can be stored in a form of a triplet. This step is very important. You might be spending a lot of time here to make sure your prompt is doing the right thing and it is creating the right oncology for you. If your oncology is not right, uh if your triplets are not right, if they are noisy, your retrieval will be noisy. So this is where you will be going back and forth figuring out how to get a better oncology. So th this is where you will spend my take is this is where you'll spend uh 80% of your time to make sure you get the oncology right and you'll be going back and forth in an iterative manner to see how you can make it better over time and then the next vector database for a hybrid rack system is to create the semantic vector database and that is very reasonably straight straight straightforward or it is well studied. So you pick your document. This is the first page of attention is all you need research paper. And you you break it into chunk sizes and you you have another factor called overlap. And chunk sizes are important because what semantic vector database does is it will it will pick up that chunk and convert that into use the embedding model and convert them into a u embedding vector and store into the vector database. And it will if you don't have an overlap then the context between the previous and the and the next chunk will be lost. if there is any relationship. So you try to be smart on how much overlap do I need between my previous chunk and the and the next chunk and what is the size of the chunk that I should uh I should use when I'm chunking my documents into different paragraphs. That is where the the advantage of graph rag comes into play because uh if you think about it the important information which is uh the relationship between different entities are not exploited by u by your semantic uh uh vector database but they are exploited really well when you're trying to um use a knowledge graph or create a knowledge graph based system. So once you have created this uh um this knowledge graph what is the next step? Now, now comes the retrieval piece which is um um you you ask a question what is Exon Mobile's cut this quarter that that it is looking like and knowledger graph will will help you figure out how to retrieve those nodes or those entities and the relationship between them. But if you do uh a very flat retrieval which is a single hop you are missing uh the the most important u piece that graph allows you which is exploitation through multiple nodes that you can think about and that becomes very very very important. I I cannot stress how important that becomes. So think of different strategies. Again you will spend a lot of time to optimize this whether you should look at um single hop, double hop, how much deep you want to go so that nodes um the relationship between your first node to the second node, your second node to the third node is exploited pretty well. And and the the more deeper you go, the better context you'll get. But there's a disadvantage of that. The more deeper you go, the more time you're going to spend on retrieving that information. So then uh uh latency becomes a factor as well especially when you're working in a production environment. So there is a sweet spot that you'll have to hit when you're trying to um go how deep you want to go how how many hops you want to go into your graph versus how many uh what is the latency that you can u you can survive. So so that becomes very uh very important and those some of those searches can be accelerated. So um um um we created a library called cool graph um which which is a which is available or integrated in a lot of um libraries out there like network X and whatnot. But that acceleration becomes important so that it gives you the flexibility to get deeper into your graph go through multiple hops but at the same time you can reduce the latency so your performance of your graph improves uh a lot. So this is the where the retrieval piece comes into play where you can have different strategies defined so that when you're querying uh your data um and get getting the responses you can have better responses and the other important piece I personally worked on this piece so I I can talk at length on this but uh I'm I'm going to give you a very high level um is evaluating the performance and there are multiple factors that you can evaluate around faithfulness um answer relevancy uh precision recall um um if you try to use an LLM model, helpfulness, collectiveness, coherence, complexity, verbosity, all these factors becomes very important. So there is a library pistol library called Ragas. Um it is meant to evaluate your rag workflow end to end. Anyone who used Ragas for evaluating your graph rag? All right, a few of them. Thank you. But it is it is an amazing library that you can uh uh use to evaluate your uh your rag pipeline end to end because it evaluates the response. It evaluates the retrieval and it evaluates what the query is. So it it will evaluate your your pipeline end to end which becomes very handy when you're when you're trying to test whether my retrieval is doing the right thing or whether my uh the questions that I'm asking is the LLM interpreting it in in the right way or not. So you can break down your responses in u the raas pipeline will evaluate all those pieces and see what your eventual score is. So it is a pip install library. The other is LLM uh and Ragas under the hood uses an LLM um no surprises there. By default, it is integrated with GPT, but it provides you the flexibility that if you have your own um model, you can bring it in as well and you can uh wire it up with your API and you can use that LLM to figure out on these four four evaluation parameters that RAS offers. So, so it's a it's it's it's quite comp I would say it's comprehensive but it's really good in terms of giving you that flexibility. The other path is uh using a model that is meant to evaluate specifically the response coming out of LM. And that is where this model Lanimotron 340 million reward model that we released I think few years ago. At that time it was a really good response model. It's it's a 340 billion parameter model so reasonably big but uh it evaluates um it's a reward model. So it will go and evaluate the response of another LLM and judge it in terms of um how the responses are looking looking like on this five parameters but it is meant to go and judge other LLMs. That is how it was trained. So moving further I would like to use this analogy that for u to create a graph ra system it will take you uh which is 80% of the job it will take you 20% of your time but then to make it better which is the last 20% uh sorry which is the um the 80/20 rule the last 20% will take 80% of your time because now you are in the process of optimizing it further to make it make sure it works. for the use case good enough um um for for the application that you're working on and there are some strategies there which I would like to walk you through so one as I said before which I couldn't stress enough the way you are creating your knowledge graph out of your unstructured data becomes very important the better your knowledge graph the better results you're going to get and something that we did as experimentation through this use case that we were exploring with one of our partners uh was can we fine-tune an LLM model to get the quality of the of the triplets that we are creating better and does that improve results? Can we do a better job at data processing like removing reax, apostrophes, brackets, words that characters that don't matter? If we remove them, does it give you better results? So these are like small things that um that you can think about but it gives you it it improves the performance of your overall system. So that is where you I'm talking about 80% of your time small nitty-gritty of the things that you are the knobs that you are fine-tuning with slowly and steadily to make sure your performance gets better and better and I would like to share a few strategies that we did which we got uh which led us to uh uh which led us to get better results. So the very first thing is uh reax or just cleaning out your data. Um we we removed uh apostrophes as other other characters that are not that important if you think about uh triplet generation that led us to uh um to better uh better results. We we then implemented another strategy of reducing the not not missing out of longer output making it smaller. that got us uh uh better results and we also fine-tuned the um the llama 3.3 model or 3.2 model and that got us better better results. So if you look at the last three columns you'll see that by using llama 3.3 as is we got 70 1% accuracy. So this was tested on 100 uh triplets to see how it is performing and as it got sorry 100 documents. So as it got better and uh as we introduced Laura we fine the llama 3.1 model our our accuracy or performance went up from 71 to 87%. And then we did those small tweaks uh it improved the performance better. Again remember this is on 100 documents so the accuracy is looking high but if your document pool increases that will come down a bit but in comparison to where we were before we saw improvement and and that is where the small uh tweaks come into play which would be very very very helpful to you when you're putting a a system um a graph rag or a rack system into production. The other is from a latency standpoint. Um so if your graph gets bigger and bigger now you're talking about a network which which goes into millions or billions of parameter and uh or millions and billions of nodes. Now how do you how do you do search in um in those millions and billions u in the graph that has got millions or billions of nodes and that is where acceleration comes into play. So with with with cool graph which is now available through network X. So network X is also al also a pip install library. Uh anyone who used network X here right few okay um so network is also a pip install library under the hood um it uses um acceleration and if you see a few of the algorithms uh we um we we did a performance test on that and um you can see the amount of latency in terms of overall execution reducing drastically. So that is where you can again small tweaks which will lead you to better results. So these are two things that we experimented which led us to to better results in terms of accuracy as well as reducing the overall latency and these are small tweaks and it it leads us to better results. So then the question obviously is should I uh use graph or should I use semantic um based rack system or should I use hybrid and I'm going to give you the diplomatic answer. It depends but but there are few things I would like to you guys to take home to to um um which will help you to come up to a decision so that you can make an educated guess that for this use case that I'm working on a rack system would solve the problem I don't need a graph and vice versa or I need a hybrid approach so it depends on two two factors one is your data um traditionally if you look at retail data if you look at FSI data if you look at employee database of companies those have a really good structure structure defined. So those kind of data set becomes really good use cases for graph based system and the other thing you think about is even if you have unstructured data can you create a good uh graph knowledge graph out of it. If the answer is yes then it's worthwhile experimenting uh um with u to go the graph path and it depend it will depend on the application and use case. So if your use case requires to um to understand the complex relationship and then extract that information u um to for the response that you um for the questions that you are asking only then it makes sense uh to use graph because remember these are compute heavy uh heavy systems. So you need to make sure that these things are taken care of. I am running out of time I think but u as I said before all these things that I talked about I gave you a 10,000 ft view but if you want to get a 100 ft view where you are coding into into things all these things is available on GitHub even the finetuning of the llama 1.1 Laura model and we had a workshop a two-hour workshop so I gave you a 20-minute talk but this whole workshop is covered uh in two hours as well and lastly um join our developer programs we do release all these things on a regular basis you if you join the mailing list you get this information based on your interest and as u my colleague mentioned I will be across uh the hall at Neo4j booth uh to answer questions if any I would love to interact with you and see if you have any qu uh any questions and I can answer those questions. Thank you for your time. [Applause] [Music]