OpenRAG: An open-source stack for RAG — Phil Nash
Channel: aiDotEngineer
Published at: 2026-04-08
YouTube video id: 4TxOBhDRRCM
Source: https://www.youtube.com/watch?v=4TxOBhDRRCM
Hi there. My name is Ash and I'm a developer relations engineer at IBM. I've been working on tools around AI and rag for the last couple of years and I've got something I'd like to show to you today. Now first things first, I've heard that rag is dead many a time and I'm sure you have too. Context windows are huge these days, so you might as well just dump all of your information into there. I don't take this kind of thing very seriously. If every business has less than a million tokens worth of data then sure, rag is dead and probably so all those businesses. Of course, like not everyone is happy you know paying for a million input tokens every time you want to ask a question as well. Instead I sort of hear that the rag is dead claims is more of more of rag is solved, right? We we think we understand the process and we can just apply rag when we need to. You just you know gather up all your unstructured data, extract the text, chunk it up, embed it, throw it into a vector database. And then when you want to ask your agent a question, you just embed that question, search the database, pick the top K results and pass them to a model as context. It's just a it's a footnote in in context engineering these days. But it turns out that rag is actually hard and it's hard for different reasons for different projects. You know PDFs are a pain, chunking strategies are a hassle and changing them and testing them is difficult. Embeddings keep improving which is great for the industry but not very great when you've used something from a 6-month or a year ago. There are new search techniques all the time and further tweaks that you can add to your pipeline to improve the results like adding summaries to chunks, performing chunk expansion, using a a crossing coder to to rewrite results, query rewriting, there's there's so much more. Rag is quite complex. In fact, everyone's documents are different, every system will have different users, different questions, different interaction patterns and different expectations. While every rag system will ultimately be different, there are definitely some core components that are required. When building a rag system, it's useful to have a high-quality baseline to build from. So that's what we've been working on at IBM. We've brought together three existing open-source projects to create a rag stack that is powerful, easy to use and easy to extend. And the project's called open rag. And it uses the open-source docling for document processing, open search for search indexing and LangFlow for visual orchestration and agents. Open rag is an open-source project that you can try out today to build your own powerful, customizable and easy to use rag system. I just want to break down the stack for you so that you understand the components and how they work together and how they create a stack that is flexible enough for your modern rag requirements. Let's start by looking at the ingestion side of rag. Let's start where it all begins, document processing. Ingesting PDFs, HTML, Word docs, slides and more can be a pain but the biggest pain of all is is of course PDFs. Docling is an open-source project. It was built out of IBM research in Zurich and it processes and parses all sorts of documents from HTML, markdown and Word documents through to slides and spreadsheets, audio and video and even that enemy of all rag systems, PDFs. Docling has a number of different pipelines that handle different file types. This allows it to be flexible in the way it it takes in documents and accurate in its in it accurate in its output. So there is a simple pipeline that handles those mostly straightforward text documents like markdown, HTML and Word. That just extracts the text, turns into a hierarchy and and outputs a document. For audio and video, there is an ASR and automatic speech recognition pipeline. And for PDFs, there's two available pipelines. The standard pipeline has a number of small focused models that do different things like extracting text, tables and images from PDFs. You can even choose an OCR backend to read text which is particularly useful for scanned documents that don't have actual real text in them. So this collection of small models in a pipeline perform things like layout analysis, table extraction, image extraction descriptions. This gives you a wide array of options to get the best out of those documents. There's also a VLM, a vision language model pipeline that uses the granite docling 258 million vision model to extract all of that in one go. This is a newer pipeline but it is simpler as it is this just all-in-one model that's that's trained specifically for this task. Docling extracts text and then produces an intermediate representation, a docling document which models the structure of document in an XML-ish format called doc tags. Those doc tags can then be converted to a number of formats including markdown, HTML and JSON. And the docling also has a chunker that uses the hierarchy generated generated by the pauses and and built into those doc tags to produce hierarchically understood chunks of text. Moving on to embeddings, open rag actually isn't very prescriptive with embeddings at all. It supports a number of external providers including open AI, what's next AI and Ollama for locally hosted embeddings. In fact, the entirety of open rag can be run offline using locally hosted models. Docling itself can be run offline so it can run in air gap situations. It doesn't need those external services. Once you have embedded those chunks, they are indexed in open search. Open search is of course the open-source fork of elastic search and is is a powerful database for performing vector search and keyword search as well as highly configurable it also has highly configurable of filtering and aggregation. Out of the box, open rag uses open search for a hybrid vector and keyword search and exposes that sophisticated filtering for more targeted searching. It also supports vector search over multiple embedding models. Now this will slow down your vector search in practice but it is useful if you decide you need to migrate your embedding models as as part of your system. Open rag also sets up open search with a secret fourth open-source project. The default open search nearest neighbors plugin gives you options for HNSW or IVF vector indexes but open rag uses the JVector KNN plugin by default. JVector is an open-source vector index that gives you live indexing and because it's based on the disk KNN architecture means your whole index doesn't have to fit in memory giving you more options for scaling the data servers. All of this is then tied together with LangFlow. LangFlow is a drag-and-drop visual editor for AI flows and it integrates docling, open search and all these embedding models as well as further data enrichment as part of that ingestion process and pipeline. We'll come back and have a look more deeply into LangFlow later. So that's ingestion and indexing. What about the generation side of rag? >> [snorts] >> On the generation side, we we don't normally have to worry about ingesting documents and we already know that open search is handling that multi-vector hybrid search for us. Um but we do need to point out that open rag uses agentic retrieval in order to perform the search. This is also done in LangFlow and again gives you access to all the kind of models that LangFlow makes makes available to you. So out of the box, that's very much open AI, Anthropic, Ollama, What's Next AI. What does agentic search mean? Well, traditional a traditional rag generation pipeline would take a user query, embed it, use it to perform that nearest neighbor search over the chunks and present the top K chunks to the LLM hoping that the answer is contained within and that the model is smart enough to extract it. With agentic retrieval, we instead give the user's query to a to an agent along with instructions and tools that it can use to perform as many searches as required. The model is actually responsible for deciding what searches to perform and what to do with the results. So let's actually take a look at this in action. I have open rag running on my laptop and we're going to have a quick look at what it can do. So once you've gone through the once you've gone through the onboarding process with open rag and setting it up, you get dropped into a chat and the first thing you get to ask is what is open rag. Um And as you can see here, it has got an answer but you can also see that it has done some tool calling already. It turns out that what is open rag the answer to what is open rag is inside the agent's prompt by default so it doesn't actually need to do any search querying. But it did go and get the current date just in case as well which is nice of it. So we can see we get an answer out of it but we also get these kind of suggestions about the next things. Those are a little nudges. That is also powered by LangFlow and if we if we were to ask about that to explore LangFlow's role in AI agent construction, then the agent itself will go off search that documentation and come up itself with an answer. So as you see, the model the agent has gone and used some tools again. It has come up with an answer. It's come up with more nudges as well. So let's go look at the the knowledge section. This is where you actually upload your data, your your documents and you can do so just by adding a whole file or whole folder. There's also a sync button here. We'll see that in a minute. And you can also inspect kind of your your objects and your documents here and your chunks. So you can see that they are chunking things as you'd expect. This also is where you can create knowledge filters. So um this takes advantage of that filtering in open search. You can create filters based on a whole bunch of uh different options around the data that you have in your system. Uh and then that allows you in chat to uh write use use those filters uh to only so talk to specific documents. Uh so that's knowledge section. Uh and then in the settings we can dig into the actual customizability of this. Uh so right at the top um there are cloud connectors, but in order to use cloud connectors you need a user model or some some authentication. Uh right now we set that up with Google OAuth. Uh so you need to need an OAuth client and a secret there. Once you have that involved, you can connect to uh Google Drive, you can connect to SharePoint, uh you can connect to OneDrive. And this allows your users to uh connect to uh directories of their own documents uh and allow OpenRag to sync them uh directly. I think that's really powerful. It saves you having to upload things a lot of the time. You can just sync with this external document store and it will always be up to date. Here we can see our model providers. Uh configure kind of API-based ones or Ollama. Uh like I said that's for for running things locally. Uh and I'm running Ollama and you can see I'm running currently Granite 4 3B. That's uh one of IBM's models. So this is our language model and you can see the actual agent instructions as well. Uh so you can set your system prompt there. And then in the ingest section uh you can see again I'm running Queen 3 embedding uh 0.6B for my embedding model. Also on Ollama. Uh and you can set your chunk size and chunk overlap. And then these last bits are uh Docling settings where we where we say do we want to capture table structure? Uh yes currently. Um do we want to run OCR? Uh not right now. That's turned off. Uh and do we want to extract picture descriptions? Uh currently that's off, but that's a useful one if you want to kind of get the information out of images as well. Of course adding uh more models uh to the pipeline makes things a little slower. Uh so they're off for now. And then right at the bottom there are API keys. And this is where you can set up access to OpenRag as an API. So you can implement you can use your search or your agent within your own application. Um but let's actually drop into uh under the hood even further where we can go and customize things even more. Uh you can hit this edit in LangFlow button and that will take you into the actual implementation uh of your uh of your agent. Uh and so let's actually zoom in. We can see here is our agent. So this is the this is the chat the generation flow. Um and our agent receives its information from this chat input uh which goes through a quick prompt template um adding in things about knowledge filters if you've used them. Uh and then the agent has a bunch of tools. Uh those tools include um there's an MCP server for a URL ingestor. That's actually just another flow within LangFlow. Uh there's a calculator because I think that um agents and and models shouldn't be doing arithmetic. They're language models, not math models. So uh a calculator is always useful. Uh and then finally the last one is the OpenSearch multi-model um embedding uh thing. And so the embedding providers uh are all here. Um we can edit this. Uh if you can go into here and just unlock the flow and save that, we can we can do more with it. Uh and so um for example, uh we can take this chat input uh and we might want to um might want to put some guardrails uh in place. Uh so we can grab guardrails from uh uh from our our set of components on the left. Uh and then we do need to parse the result of that. So we can just get a parser. Uh and so if it parses passes, we pass it through the parser, get the text out, which is the original text that was sent in, and then hand that on to our uh prompt template. I guess if it fails, we can send an error message to our chat output. Uh that's fine. Uh and so now we've added guardrails to our to our thing. Um we can use uh our our Ollama models there as well. And um and so this is as extensible as LangFlow uh as LangFlow can be for you. Um there's one more thing. Uh there is an MCP server available for OpenRag as well as an API. So you can go and use this and hand this to your other agents as well. So, is RAG solved? Uh well, that's kind of still up to you, to your data, and to your users. Uh but OpenRag is built to help. Uh it is an opinionated, but agentic, and open source stack for RAG. As I said before, it combines Docling, OpenSearch, and LangFlow to create this powerful baseline RAG system made of open source components. And it leaves plenty of room to customize that within that stack so that you can build out the best RAG for your data and provide the best context to your agents. Uh it is currently at version 0.4.0 and uh it's ready for you to play with. Uh so this link or all the QR code will take you to the project. And we'd love if you try out OpenRag, uh drop a star on the GitHub, and uh let us know what you think. Uh it's also uh open source, right? Um the front end is a Next.js application. Everything else is a Python app. Uh and if so if you look like the look of OpenRag, we'd also appreciate your feedback and your contributions to the project itself. Um the components to OpenRag of course are open as well. So you can get involved with Docling, with OpenSearch, or with LangFlow as well. And together, you know, we can build a RAG platform that works for everyone. It gives you the choices where you need them and makes good decisions for you where it makes sense. We can do it with open source components out in the open. That's what I'd like to see. Uh so thank you very much uh for listening. Uh again, my name's Tom Nash. I'm a developer relations engineer at IBM trying to help build OpenRag and this open ecosystem of uh agentic applications. And we can't wait to see uh what you build with OpenRag. Thank you very much.