HybridRAG: A Fusion of Graph and Vector Retrieval - Mitesh Patel, NVIDIA

Channel: aiDotEngineer
Published at: 2025-07-22
YouTube video id: -tgQa8Fzf80
Source: https://www.youtube.com/watch?v=-tgQa8Fzf80
[Music]
to quickly introduce myself. My name is
Mitesh. I lead the develop advocate team
at Nvidia. And the goal of my team is to
uh create technical workflows, notebooks
uh for different applications and then
we release that codebase uh on GitHub.
So developers in general which is me and
you all of us together we can harness
that uh that knowledge and take it
further for the application or use case
that you're working on. So that is what
my uh my team does including myself. In
today's talk, I'm I'm I'm going to talk
about this project that we did with one
of our partners um um and some of my
colleagues at Nvidia and our partner
about how can we create a graph rack
system what are the advantages of it and
if we add the hybrid nature to it how it
is helpful so that's what my uh my talk
is going to be on I will not give I will
not be able to give you a 10TI view
where you can I can dive with you in the
codebase but there is a GitHub link at
the end of this talk which you can um
scan and all these notebooks whatever
I'm going to talk about is available for
you to take home but I'll give you a
10,000 ft view or if you are trying to
build your own graph rag system how can
you build it so u a quick refresher what
is knowledger graph um and why are they
important so um it is a network that
represents relationship between
different entities and those entities
can be anything it can be people places
uh concept events. A a simple example
would be me being here. What is my
relationship to AI worldfare conference?
AI engineers worldfare conference and my
relationship is I'm a speaker at this
conference. What is my relationship to
anyone who is attending here? Well, uh
our relationship is you attended my
session. So this edge of relationship
between the two entities becomes very
important uh to which only graph-based
network can exploit or knowledge graphs
can exploit. And that is the reason why
uh there's a lot of active research
happening in this domain of how you can
harness graph rag u how can how you can
harness knowledge graph and put it into
a rag based system. So the goal is three
things. How can you create a triplet
which is the which defines the
relationship between these entities that
graph our graph-based system or
knowledge graph is really good at
exploiting
and that's what is unique about this
knowledge graph. So if you think about
um um why can they work better than
semantic u rag system well it captures
the information between entities in much
more detail. So those connections can um
can provide a very comprehensive view um
um of the knowledge that you that you
are creating in your rag system and that
will become very important to exploit
when you are retrieving some of that
information and and converting that into
into a response for the user who is
asking that question and it and it has
the ability to organize your data from
multiple sources. I mean that's a given
no matter u what kind of rack system
you're building.
So how do we create a graph rag or a
hybrid system? So this is the highlevel
diagram of what it entails. So I broke
it down into four components. The very
first thing is your data. You need to
process your data. The better you
process your data, the better is a
knowledge graph. The better is a
knowledge graph, the better is the
retrieval. So four components data, data
processing, your graph creation or your
semantic u embedding vector database
creation. Those are the three uh steps.
And then the last step is of course
inferencing when you're asking questions
uh to your u rag pipeline.
And at a higher level this can be broken
down into two big pieces offline online.
So all your data processing u work which
is a one-time process is offline and and
once you have created your knowledge
graph which is your triplet entity
relationship entity 2 um or your
semantic vector database once you have
it then it's all about quering it and
converting that information into um um a
response that is readable to the user.
It cannot be something that here are the
three relationship and then we as the
user have to go figure out what does
this exactly mean.
So the top um part of this u flow
diagram is where you build your semantic
u vector database which is you you pick
your uh u documents and then you convert
them into vector embeddings and you
store into a vector database.
So that piece is uh is how you create
your semantic uh vector database and
then the piece below is um how you
create your knowledge graph and it is
much more uh um there are much more
steps that you have to follow a care
that you have to take when you're
creating your knowledge graph.
So diving in the first step creating
your knowledge graph. How can you create
those triplets out of documents that are
that are not that structured? So
creating triplets which uh which exposes
the information between two entities and
picking up those entities uh so that
that information becomes helpful is very
important. Here's a simple example. This
document is of Exxon Mobile's uh results
I think uh their quarterly results and
we we tried to pick up um the
relationship or create the the knowledge
graph using an LLM and if you see at the
first line it's Exon Mobile which is a
company that's the entity uh cut is the
feature of um of that entity spending
oil and gas exploration um and activity
my apologies cut is the relationship
between Exxon on mobile and spending on
oil and gas exploration and activity is
the the um the name of the entity
spending on oil and gas exploration. So
this is how the relationship needs to be
exploited. Now the question that comes
to our mind is that sounds very
difficult to do and exactly it is
difficult to do and that is the reason
why we need to harness uh or we need to
use LLM to figure out a way to extract
this information and structure it for us
so that we can save it in um um in a
triplet format and how can we do that
prompt engineering but we need to be
much more uh uh uh defined about it. So
you based on the use case that you are
trying to work on you can define your
oncology and once you have defined your
oncology you can put it in your prompt
and then ask the LLM to go extract this
information that is oncology specific
from the documents and then structure it
in that way so that that can be stored
in a form of a triplet. This step is
very important. You might be spending a
lot of time here to make sure your
prompt is doing the right thing and it
is creating the right oncology for you.
If your oncology is not right, uh if
your triplets are not right, if they are
noisy, your retrieval will be noisy. So
this is where you will be going back and
forth figuring out how to get a better
oncology.
So th this is where you will spend my
take is this is where you'll spend uh
80% of your time to make sure you get
the oncology right and you'll be going
back and forth in an iterative manner to
see how you can make it better over time
and then the next vector database for a
hybrid rack system is to create the
semantic vector database and that is
very reasonably straight straight
straightforward or it is well studied.
So you pick your document. This is the
first page of attention is all you need
research paper. And you you break it
into chunk sizes and you you have
another factor called overlap. And chunk
sizes are important because what
semantic vector database does is it will
it will pick up that chunk and convert
that into use the embedding model and
convert them into a u embedding vector
and store into the vector database. And
it will if you don't have an overlap
then the context between the previous
and the and the next chunk will be lost.
if there is any relationship. So you try
to be smart on how much overlap do I
need between my previous chunk and the
and the next chunk and what is the size
of the chunk that I should uh I should
use when I'm chunking my documents into
different paragraphs. That is where the
the advantage of graph rag comes into
play because uh if you think about it
the important information which is uh
the relationship between different
entities are not exploited by u by your
semantic uh uh vector database but they
are exploited really well when you're
trying to um use a knowledge graph or
create a knowledge graph based system.
So once you have created this uh um this
knowledge graph what is the next step?
Now, now comes the retrieval piece which
is um um you you ask a question what is
Exon Mobile's
cut this quarter that that it is looking
like and knowledger graph
will will help you figure out how to
retrieve those nodes or those entities
and the relationship between them. But
if you do uh a very flat retrieval which
is a single hop you are missing uh the
the most important u piece that graph
allows you which is exploitation through
multiple nodes that you can think about
and that becomes very very very
important. I I cannot stress how
important that becomes. So think of
different strategies. Again you will
spend a lot of time to optimize this
whether you should look at um single
hop, double hop, how much deep you want
to go so that nodes um the relationship
between your first node to the second
node, your second node to the third node
is exploited pretty well. And and the
the more deeper you go, the better
context you'll get. But there's a
disadvantage of that. The more deeper
you go, the more time you're going to
spend on retrieving that information. So
then uh uh latency becomes a factor as
well especially when you're working in a
production environment. So there is a
sweet spot that you'll have to hit when
you're trying to um go how deep you want
to go how how many hops you want to go
into your graph versus how many uh what
is the latency that you can u you can
survive. So so that becomes very uh very
important
and those some of those searches can be
accelerated. So um um um we created a
library called cool graph um which which
is a which is available or integrated in
a lot of um libraries out there like
network X and whatnot. But that
acceleration becomes important so that
it gives you the flexibility to get
deeper into your graph go through
multiple hops but at the same time you
can reduce the latency so your
performance of your graph improves uh a
lot.
So this is the where the retrieval piece
comes into play where you can have
different strategies defined so that
when you're querying uh your data um and
get getting the responses you can have
better responses
and the other important piece I
personally worked on this piece so I I
can talk at length on this but uh I'm
I'm going to give you a very high level
um is evaluating the performance and
there are multiple factors that you can
evaluate around faithfulness um answer
relevancy uh precision recall
um um if you try to use an LLM model,
helpfulness, collectiveness, coherence,
complexity, verbosity, all these factors
becomes very important. So there is a
library pistol library called Ragas. Um
it is meant to evaluate your rag
workflow end to end. Anyone who used
Ragas for evaluating your graph rag? All
right, a few of them. Thank you. But it
is it is an amazing library that you can
uh uh use to evaluate your uh your rag
pipeline end to end because it evaluates
the response. It evaluates the retrieval
and it evaluates what the query is. So
it it will evaluate your your pipeline
end to end which becomes very handy when
you're when you're trying to test
whether my retrieval is doing the right
thing or whether my uh the questions
that I'm asking is the LLM interpreting
it in in the right way or not. So you
can break down your responses in u the
raas pipeline will evaluate all those
pieces and see what your eventual score
is. So it is a pip install library. The
other is LLM uh and Ragas under the hood
uses an LLM um no surprises there. By
default, it is integrated with GPT, but
it provides you the flexibility that if
you have your own um model, you can
bring it in as well and you can uh wire
it up with your API and you can use that
LLM to figure out on these four four
evaluation parameters that RAS offers.
So, so it's a it's it's it's quite comp
I would say it's comprehensive but it's
really good in terms of giving you that
flexibility. The other path is uh using
a model that is meant to evaluate
specifically the response coming out of
LM. And that is where this model
Lanimotron 340 million reward model that
we released I think few years ago. At
that time it was a really good response
model. It's it's a 340 billion parameter
model so reasonably big but uh it
evaluates
um it's a reward model. So it will go
and evaluate the response of another LLM
and judge it in terms of um how the
responses are looking looking like on
this five parameters but it is meant to
go and judge other LLMs. That is how it
was trained.
So moving further I would like to use
this analogy that for u to create a
graph ra system it will take you uh
which is 80% of the job it will take you
20% of your time but then to make it
better which is the last 20% uh sorry
which is the um the 80/20 rule the last
20% will take 80% of your time because
now you are in the process of optimizing
it further to make it make sure it
works. for the use case good enough um
um for for the application that you're
working on and there are some strategies
there which I would like to walk you
through so one as I said before which I
couldn't stress enough the way you are
creating your knowledge graph out of
your unstructured data becomes very
important the better your knowledge
graph the better results you're going to
get and something that we did as
experimentation through this use case
that we were exploring with one of our
partners
uh was can we fine-tune an LLM model to
get the quality of the of the triplets
that we are creating better and does
that improve results? Can we do a better
job at data processing like removing
reax, apostrophes, brackets, words that
characters that don't matter? If we
remove them, does it give you better
results? So these are like small things
that um that you can think about but it
gives you it it improves the performance
of your overall system. So that is where
you I'm talking about 80% of your time
small nitty-gritty of the things that
you are the knobs that you are
fine-tuning with slowly and steadily to
make sure your performance gets better
and better and I would like to share a
few strategies that we did which we got
uh which led us to uh uh which led us to
get better results.
So the very first thing is uh reax or
just cleaning out your data. Um we we
removed uh apostrophes as other other
characters that are not that important
if you think about uh triplet generation
that led us to uh um to better uh better
results. We we then implemented another
strategy of reducing the not not missing
out of longer output making it smaller.
that got us uh uh better results and we
also fine-tuned the um the llama 3.3
model or 3.2 model and that got us
better better results. So if you look at
the last three columns you'll see that
by using llama 3.3 as is we got 70 1%
accuracy. So this was tested on 100 uh
triplets to see how it is performing and
as it got sorry 100 documents. So as it
got better and uh as we introduced Laura
we fine the llama 3.1 model our our
accuracy or performance went up from 71
to 87%. And then we did those small
tweaks uh it improved the performance
better. Again remember this is on 100
documents so the accuracy is looking
high but if your document pool increases
that will come down a bit but in
comparison to where we were before we
saw improvement and and that is where
the small uh tweaks come into play which
would be very very very helpful to you
when you're putting a a system um a
graph rag or a rack system into
production.
The other is from a latency standpoint.
Um so if your graph gets bigger and
bigger now you're talking about a
network which which goes into millions
or billions of parameter and uh or
millions and billions of nodes. Now how
do you how do you do search in um in
those millions and billions u
in the graph that has got millions or
billions of nodes and that is where
acceleration comes into play. So with
with with cool graph which is now
available through network X. So network
X is also al also a pip install library.
Uh anyone who used network X here right
few okay um so network is also a pip
install library under the hood um it
uses um acceleration and if you see a
few of the algorithms uh we um we we did
a performance test on that and um you
can see the amount of latency in terms
of overall execution reducing
drastically. So that is where you can
again small tweaks which will lead you
to better results. So these are two
things that we experimented which led us
to to better results in terms of
accuracy as well as reducing the overall
latency and these are small tweaks and
it it leads us to better results.
So then the question obviously is should
I uh use graph or should I use semantic
um based rack system or should I use
hybrid and I'm going to give you the
diplomatic answer. It depends but but
there are few things I would like to you
guys to take home to to um um which will
help you to come up to a decision so
that you can make an educated guess that
for this use case that I'm working on a
rack system would solve the problem I
don't need a graph and vice versa or I
need a hybrid approach so it depends on
two two factors one is your data um
traditionally if you look at retail data
if you look at FSI data if you look at
employee database of companies those
have a really good structure structure
defined. So those kind of data set
becomes really good use cases for graph
based system and the other thing you
think about is even if you have
unstructured data can you create a good
uh graph knowledge graph out of it. If
the answer is yes then it's worthwhile
experimenting uh um with u to go the
graph path and it depend it will depend
on the application and use case. So if
your use case requires to um to
understand the complex relationship and
then extract that information u um to
for the response that you um for the
questions that you are asking only then
it makes sense uh to use graph because
remember these are compute heavy uh
heavy systems. So you need to make sure
that these things are taken care of. I
am running out of time I think but u as
I said before all these things that I
talked about I gave you a 10,000 ft view
but if you want to get a 100 ft view
where you are coding into into things
all these things is available on GitHub
even the finetuning of the llama 1.1
Laura model and we had a workshop a
two-hour workshop so I gave you a
20-minute talk but this whole workshop
is covered uh in two hours as well and
lastly um join our developer programs we
do release all these things on a regular
basis you if you join the mailing list
you get this information based on your
interest and as u my colleague mentioned
I will be across uh the hall at Neo4j
booth uh to answer questions if any I
would love to interact with you and see
if you have any qu uh any questions and
I can answer those questions. Thank you
for your time.
[Applause]
[Music]