OpenRAG: An open-source stack for RAG — Phil Nash

Channel: aiDotEngineer

Published at: 2026-04-08

YouTube video id: 4TxOBhDRRCM

Source: https://www.youtube.com/watch?v=4TxOBhDRRCM

Hi there. My name is Ash and I'm a
developer relations engineer at IBM.
I've been working on tools around AI and
rag for the last couple of years and
I've got something
I'd like to show to you today.
Now first things first, I've heard that
rag is dead many a time and I'm sure you
have too.
Context windows are huge these days, so
you might as well just dump all of your
information into there.
I don't take this kind of thing very
seriously.
If every business has less than a
million tokens worth of data then sure,
rag is dead and probably so all those
businesses.
Of course, like not everyone is happy
you know paying for a million input
tokens every time you want to ask a
question as well.
Instead I sort of hear that the rag is
dead claims is more of more of rag is
solved, right? We we think we understand
the process and we can just apply rag
when we need to.
You just you know gather up all your
unstructured data, extract the text,
chunk it up, embed it, throw it into a
vector database. And then when you want
to ask your agent a question, you just
embed that question, search the
database, pick the top K results and
pass them to a model as context. It's
just a it's a footnote in in context
engineering these days.
But it turns out that rag is actually
hard and it's hard for different reasons
for different projects.
You know PDFs are a pain, chunking
strategies are a hassle and changing
them and testing them is difficult.
Embeddings keep improving which is great
for the industry but not very great when
you've used something from a 6-month or
a year ago.
There are new search techniques all the
time and further tweaks that you can add
to your pipeline to improve the results
like adding summaries to chunks,
performing chunk expansion, using a a
crossing coder to to rewrite results,
query rewriting, there's there's so much
more. Rag is quite complex. In fact,
everyone's documents are different,
every system will have different users,
different questions, different
interaction patterns and different
expectations.
While every rag system will ultimately
be different, there are definitely some
core components that are required.
When building a rag system, it's useful
to have a high-quality baseline to build
from.
So that's what we've been working on at
IBM.
We've brought together three existing
open-source projects to create a rag
stack that is powerful, easy to use and
easy to extend.
And the project's called open rag.
And it uses the open-source docling for
document processing, open search for
search indexing and LangFlow for visual
orchestration and agents.
Open rag is an open-source project that
you can try out today to build your own
powerful, customizable and easy to use
rag system.
I just want to break down the stack for
you so that you understand the
components and how they work together
and how they create a stack that is
flexible enough for your modern rag
requirements.
Let's start by looking at the ingestion
side of rag.
Let's start where it all begins,
document processing.
Ingesting PDFs, HTML, Word docs, slides
and more can be a pain
but the biggest pain of all is is of
course PDFs.
Docling is an open-source project. It
was built out of IBM research in Zurich
and it processes and parses all sorts of
documents from HTML, markdown and Word
documents through to slides and
spreadsheets, audio and video and even
that enemy of all rag systems, PDFs.
Docling has a number of different
pipelines that handle different file
types.
This allows it to be flexible in the way
it it takes in documents and accurate in
its in it accurate in its output.
So there is a simple pipeline that
handles those mostly straightforward
text documents like markdown, HTML and
Word. That just extracts the text, turns
into a hierarchy and and outputs a
document.
For audio and video, there is an ASR and
automatic speech recognition pipeline.
And for PDFs, there's two available
pipelines.
The standard pipeline has a number of
small focused models that do different
things like extracting text, tables and
images from PDFs.
You can even choose an OCR backend to
read text which is particularly useful
for scanned documents that don't have
actual real text in them.
So this collection of small models in a
pipeline perform things like layout
analysis, table extraction, image
extraction descriptions.
This gives you a wide array of options
to get the best out of those documents.
There's also a VLM, a vision language
model pipeline that uses the granite
docling 258 million vision model to
extract all of that in one go.
This is a newer pipeline but it is
simpler as it is this just all-in-one
model
that's that's trained specifically for
this task.
Docling extracts text and then produces
an intermediate representation, a
docling document which models the
structure of document in an XML-ish
format called doc tags.
Those doc tags can then be converted to
a number of formats including markdown,
HTML and JSON.
And the docling also has a chunker that
uses the hierarchy generated generated
by the pauses and and
built into those doc tags to produce
hierarchically understood chunks of
text.
Moving on to embeddings, open rag
actually isn't very prescriptive with
embeddings at all. It supports a number
of external providers including open AI,
what's next AI and Ollama for locally
hosted embeddings. In fact, the entirety
of open rag can be run offline using
locally hosted models. Docling itself
can be run offline
so it can run in air gap situations.
It doesn't need those external services.
Once you have embedded those chunks,
they are indexed in open search.
Open search is of course the open-source
fork of elastic search and is is a
powerful database for performing vector
search and keyword search as well as
highly configurable
it also has highly configurable of
filtering and aggregation.
Out of the box, open rag uses open
search for a hybrid vector and keyword
search and exposes that sophisticated
filtering for more targeted searching.
It also supports vector search over
multiple embedding models.
Now this will slow down your vector
search in practice but it is useful if
you decide you need to migrate your
embedding models as as part of your
system.
Open rag also sets up open search with a
secret fourth open-source project.
The default open search nearest
neighbors plugin gives you options for
HNSW or IVF vector indexes but open rag
uses the JVector KNN plugin by default.
JVector is an open-source
vector index that gives you live
indexing and because it's based on the
disk KNN architecture means your whole
index doesn't have to fit in memory
giving you more options for scaling the
data servers.
All of this is then tied together with
LangFlow.
LangFlow is a drag-and-drop visual
editor for AI flows
and it integrates docling, open search
and all these embedding models
as well as further data enrichment as
part of that ingestion process and
pipeline. We'll come back and have a
look more deeply into LangFlow later.
So that's ingestion and indexing. What
about the generation side of rag?
>> [snorts]
>> On the generation side, we we don't
normally have to worry about ingesting
documents and we already know that open
search is handling that multi-vector
hybrid search for us.
Um
but we do need to point out that open
rag uses agentic retrieval in order to
perform the search. This is also done in
LangFlow and again gives you access to
all the kind of models that LangFlow
makes makes available to you. So out of
the box, that's very much
open AI, Anthropic, Ollama, What's Next
AI.
What does agentic search mean? Well,
traditional a traditional rag generation
pipeline would take a user query, embed
it, use it to perform that nearest
neighbor search over the chunks and
present the top K chunks to the LLM
hoping that the answer is contained
within and that the model is smart
enough to extract it.
With agentic retrieval, we instead give
the user's query to a to an agent along
with instructions and tools that it can
use to perform as many searches as
required.
The model is actually responsible for
deciding what searches to perform and
what to do with the results.
So let's actually take a look at this in
action.
I have open rag running on my laptop and
we're going to have a quick look at what
it can do.
So once you've gone through the once
you've gone through the onboarding
process with open rag and setting it up,
you get dropped into a chat
and the first thing you get to ask is
what is open rag.
Um
And as you can see here, it has got an
answer but you can also see that it has
done some tool calling already.
It turns out that what is open rag the
answer to what is open rag is inside the
agent's
prompt by default so it doesn't actually
need to do any search querying. But it
did
go and get the current date just in case
as well which is nice of it.
So we can see we get an answer out of it
but we also get these kind of
suggestions about the next things. Those
are a little nudges.
That is also powered by LangFlow
and if we
if we were to ask about that to explore
LangFlow's role in AI agent
construction, then the agent itself will
go off search that documentation and
come up itself with an answer.
So as you see, the model the agent has
gone and used some tools again. It has
come up with an answer.
It's come up with more nudges as well.
So let's go look at the the knowledge
section. This is where you actually
upload your data, your your documents
and you can do so just by adding a whole
file or whole folder.
There's also a sync button here. We'll
see that in a minute. And you can also
inspect kind of your your objects and
your documents here and your chunks. So
you can see that they are chunking
things
as you'd expect.
This also is where you can create
knowledge filters. So um
this takes advantage of that filtering
in open search. You can create filters
based on a whole bunch of uh different
options around the data that you have in
your system. Uh and then that allows you
in chat to uh write use use those
filters
uh to only so talk to specific
documents.
Uh so that's knowledge section.
Uh and then in the settings we can dig
into the actual customizability of this.
Uh so right at the top um there are
cloud connectors, but in order to use
cloud connectors you need a user model
or some some authentication. Uh right
now we set that up with Google OAuth. Uh
so you need to need an OAuth client and
a secret there. Once you have that
involved, you can connect to uh Google
Drive, you can connect to SharePoint, uh
you can connect to OneDrive. And this
allows your users to uh connect to uh
directories of their own documents uh
and allow OpenRag to sync them uh
directly. I think that's really
powerful. It saves you having to upload
things a lot of the time. You can just
sync with this external document store
and it will always be up to date.
Here we can see our model providers.
Uh configure kind of API-based ones or
Ollama. Uh like I said that's for for
running things locally.
Uh and I'm running Ollama and you can
see I'm running currently Granite 4 3B.
That's uh one of IBM's models.
So this is our language model and you
can see the actual agent instructions as
well.
Uh so you can set your system prompt
there.
And then in the ingest section uh you
can see again I'm running Queen 3
embedding uh 0.6B for my embedding
model.
Also on Ollama. Uh and you can set your
chunk size and chunk overlap. And then
these last bits are uh Docling settings
where we where we say do we want to
capture table structure? Uh yes
currently. Um do we want to run OCR?
Uh not right now. That's turned off. Uh
and do we want to extract picture
descriptions? Uh currently that's off,
but that's a useful one if you want to
kind of get the information out of
images as well. Of course adding uh more
models uh to the pipeline makes things a
little slower. Uh so they're off for
now.
And then right at the bottom there are
API keys. And this is where you can set
up access to OpenRag as an API. So you
can implement you can use your search or
your agent within your own application.
Um
but let's actually drop into uh under
the hood even further where we can go
and customize things even more.
Uh you can hit this edit in LangFlow
button and that will take you into the
actual implementation uh of your
uh of your agent.
Uh and so let's actually zoom in. We can
see here is our agent. So this is the
this is the chat the generation flow.
Um and our agent receives its
information from this chat input uh
which goes through a quick prompt
template um adding in things about
knowledge filters if you've used them.
Uh and then the agent has a bunch of
tools. Uh those tools include um there's
an MCP server for a URL ingestor. That's
actually just another flow within
LangFlow.
Uh there's a calculator because I think
that um
agents and and models shouldn't be doing
arithmetic. They're language models, not
math models. So uh a calculator is
always useful.
Uh and then finally the last one is the
OpenSearch multi-model um embedding uh
thing. And so the embedding providers uh
are all here.
Um we can edit this. Uh if you can go
into here and just unlock the flow and
save that, we can we can do more with
it.
Uh and so
um for example, uh we can take this chat
input uh and we might want to um
might want to put some guardrails
uh in place.
Uh so we can grab guardrails from uh
uh from our our set of components on the
left.
Uh and then we do need to parse the
result of that. So we can just get a
parser.
Uh and so if it parses passes, we pass
it through the parser, get the text out,
which is the original text that was sent
in, and then hand that on to our
uh prompt template. I guess if it fails,
we can send an error message to our chat
output.
Uh that's fine. Uh and so now we've
added guardrails to our to our thing. Um
we can use uh our our Ollama models
there as well.
And um
and so this is as extensible as LangFlow
uh as LangFlow can be for you.
Um
there's one more thing. Uh there is an
MCP server available for OpenRag as well
as an API. So you can go and use this
and hand this to your other agents as
well.
So, is RAG solved?
Uh well, that's kind of still up to you,
to your data, and to your users.
Uh but OpenRag is built to help. Uh it
is an opinionated, but agentic, and open
source stack for RAG. As I said before,
it combines Docling, OpenSearch, and
LangFlow to create this powerful
baseline RAG system made of open source
components. And it leaves plenty of room
to customize that within that stack so
that you can build out the best RAG for
your data and provide the best context
to your agents.
Uh it is currently at version 0.4.0 and
uh it's ready for you to play with.
Uh so this link or all the QR code will
take you to the project. And we'd love
if you try out OpenRag, uh drop a star
on the GitHub, and uh let us know what
you think.
Uh it's also uh open source, right? Um
the front end is a Next.js application.
Everything else is a Python app. Uh and
if so if you look like the look of
OpenRag, we'd also appreciate your
feedback and your contributions to the
project itself.
Um the components to OpenRag of course
are open as well. So you can get
involved with Docling, with OpenSearch,
or with LangFlow as well.
And together, you know, we can build a
RAG platform that works for everyone. It
gives you the choices where you need
them and makes good decisions for you
where it makes sense. We can do it with
open source components out in the open.
That's what I'd like to see.
Uh so thank you very much uh for
listening. Uh again, my name's Tom Nash.
I'm a developer relations engineer at
IBM trying to help build OpenRag and
this open ecosystem of uh agentic
applications. And we can't wait to see
uh what you build with OpenRag. Thank
you very much.