The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked
Channel: aiDotEngineer
Published at: 2026-05-05
YouTube video id: qdh_x-uRs9g
Source: https://www.youtube.com/watch?v=qdh_x-uRs9g
Hello everyone. Um, welcome to this talk. I'll be speaking about small model inference and a gap that we've recognized in the market. And what we did about it and why we kind of made this approach. And as you can see this background slide here, um, this is no accident. So, if you can guess what this is, I'll prompt you at the end of the slides, you win a little reward. So, you can catch me at the break afterwards. So, think about this, but also listen to me, so don't think too hard. So, the story starts with me posting an article a few months ago on Substack that got a little bit of traction, got a few people interested, and I explained flash attention, I explained how models worked, how processes can be memory-bound, compute-bound, and I felt really good cuz I kind of went deep into this and as a person who's been in AI for a few years, I felt very confident. And that was true, but then some people pointed out that actually I had overlooked one key aspect around what makes these models fast in the real world. And that aspect that I overlooked was inference. So, as someone who kind of wants to understand things in first principles and work to understand the problems and the solutions deep, I realized, "Okay, I need to figure this out. I need to know kind of where I've made my mistake and as a AI researcher and engineer, I need to find more about inference. So, I've done a lot of work with VLAM, training models, fine-tuning, kind of doing applied ML and AI, did a bit of also research in academia as well. But this part around kind of how models run in production, scheduling GPUs, routing, and automation, I guess was a bit of a blind spot for me. So, I realized, "Okay, this is the time I have to now figure this out, learn, and make it work." And what better way to do this than to actually build stuff. So, I decided to join a team, um, a team at Superlinked, comprised of very good, uh, infrastructure engineers, and actually work with them and build something around inference. And that something is, um, this, uh, repo, the Superlinked inference engine that we have open-sourced, and this is kind of like the soft launch that I'm doing today. So, you can have a look at that later. So, basically it's inference for small models around AI search and document processing. And we've tested this out, as you can see, with some of our partners. So, we've tested out with Chroma, Quadrant, Weaviate, so a lot of the vector DBs, as well as LanceDB. And as you can see, they've tried it out a bit. Sounds fun, sounds interesting. So, this is working. And, um, it was the right step for me to kind of figure this out and learn where inference, um, kind of truly is and how that combines with my ML experience. So, the three key points I want you to, uh, go away with from this talk are these. So, first, I want to tell you why this matters. So, why doing inference for AI search and document processing actually matters when you're building agents or when you're building workflows that involve, uh, agents. So, this is very important, the why. Then I want to talk to you about the second thing, which is what inference is not about. So, there are some misconceptions or ideas of how inference looks like, but it's not all about those things. And the third thing is how we see inference, and I call this as like the yin and yang of model inference, and it's a way of combining a few things around model support and infrastructure. So, why this matters for your agentic workflow? Well, what you have encountered for sure is context rot. And as we probably all know, this research paper from Chroma from some time ago, um, showcases this effect that no matter what you do, there is this effect of context rot. So, quality degrades as context increases. So, being able to manage this context and do some context management is very important and a useful way to solve this. So, using small models that can preprocess your data so that then you can actually use your agents and build your workflows is a very powerful technique. You can also use the small models for tool calling to again do similar things and tackle this, uh, problem of context management. And you might say, "Okay, why would I not use like code and grepping?" And that's a valid point. However, you can still do that and having your data being preprocessed is actually making the grepping and file systems that you build even better. And it's not just me saying this, so this is how the community has responded to this problem. So, Andrej Karpathy is building knowledge bases graph-based. So, for example, you can use named entity recognition models, um, to generate ontologies and then build knowledge graphs or you generate your file systems in that way. Chroma have also, uh, shipped their own model that actually does this. It kind of preprocesses and filters the input, manages the context in order to tackle and achieve the same problem. And there's also people in the community building a lot of solutions that, um, lower this, um, context and kind of reduce the token amount so that the agents can be even more effective. And we've done this in production as well. We have a use case where this repo that I showcased, uh, Superlinked inference engine, we've used it as a tool calling, basically, uh, solution where it was around taxonomy classification for like an e-commerce store. So, going directly, uh, with tool calling where the small models are tools that can retrieve and go through your data is also a powerful way to approach this. So, now I'll go about inference and what inference doesn't look like. So, the traditional maybe perspective is, "Okay, I will just chuck in more GPUs, get more compute, all good, I've solved inference." However, um, with small models where actually each model, um, takes up only a small space in memory. So, you can see, for example, Stella, which is an embedding model, then other rerankers, um, the named entity recognition Glyner model, they only occupy like a few gigabytes of memory. And if you provision a GPU for each model, you're wasting a lot of, um, idle space. Your GPU stays idle and it's not used. So, in this case of small model inference, it's very important to be able to hot-swap models. So, what we've done is we've built the ability for you to swap all of these models in one GPU so that per GPU, you get much higher utilization. So, this lowers your costs, but also enables you to hot-swap and switch around between models quickly. If you want to have one tool that's one reranker or another tool that's another model, you can switch them around quickly, and we have this least recently used eviction policy that we've built in. And also what inference is not about is, um, only the server or only kind of the production situation. So, we have solutions for both the server and the production, but building something, so for example, like using Tay or VLAM or having even, um, some API wrapper in order to get that in production where you actually do routing, auto scaling, you do kind of monitoring with Prometheus metrics and Grafana, you have to write that code yourself. And on the market currently, there is no open-source solution that leads you from creating the model inference to actually productionizing this at a scale, um, of this size. So, that's why we've kind of wanted to fill this gap as well. So, um, we've included a lot of stuff around, routing, auto scaling, queuing mechanisms, and provisioning GPUs. So, I've talked about what inference isn't, why this is important, and let me tell you now about, um, what actually inference is about. And I call this the yin and yang of inference because I feel like this is a holistic approach that has to combine two key things. So, first, the yin is model support. So, your inference is worthless if you're not supporting the right models or you're not offering enough breadth of um, options for your users. And open source models, so on hugging face currently, there are millions of models. This is from March. Now, there might be even more like close to 3 million models. And open source is moving very quickly both in size, but also in accuracy. So, you need to support these models because people want to use them. People want to work with open source, and the performance is also getting better and better. So, it's not even a sacrifice anymore. It's actually you're being able for specific tasks. So, if you look at MTEB or different benchmarks, you can see that for very narrow tasks, open source models are beating managed services. And we see this even with more general-purpose models like what Gemma have done. So, they've released very low parameter models that have ELO scores that are higher than much, much bigger models. So, open source small models are very relevant, and you need an infra to support to support them. And we decided to do exactly this. Okay, let's do it. We'll build and support hundreds of models. However, that's not as straightforward as it might sound because all of these models have different run times. So, for example, with if you use BERT and Qwen as models, they have different implementation of flash attention. They have different positional embeddings, and you need to adjust this in order to make make this kind of applicable to your case and make it more general. So, there is no universal engine that can handle BERT, Qwen, and modern BERT as well because their architectures are different. So, what what we've done is we've set up a way of re-implementing this forward pass in order to adapt attention, in order to do padding where needed, so have variable length attention as well. Work with kind of fuse fusion of the query, key, and value. And this is kind of an underrated component, especially if you look at supporting models like ColBERT that has multiple vectors as output. So, late interaction models, also cross encoders, and re-rankers that do not output a vector at all, but they output kind of scores. So, it's very important to figure this out, and one example of this is, for example, this, which is a comparison of a few models that are totally different in these five aspects. So, normalization is done differently in BERT and Qwen. So, that needs to be accounted for. In ColBERT, it's also totally different. The Q The query, key, and values are also can be fused somewhere. In others, for example, in Qwen, it cannot be because there is um grouped query attention. So, that's another problem. The position positional embeddings are also different. You can do in BERT absolute lookup, where it with Qwen, there is rotary positional embeddings. So, these are problems that maybe people don't think about straight away, but if you want to support a lot of models, there needs to be a consistent way of doing this. So, we've set up this with kind of working with agents as well as with humans to be able to support this and re-implement this forward pass in order to make um the the inference of this models efficient, and we've also worked to do flash attention where it's variable length flash attention, so that you can do padding, and that there is no waste because what happens is if you want to do token-based batching, you can have tokens that are kind of lower number of tokens one request, and then another is a higher amount, and then if both are padded at the higher amount, you basically are wasting compute on empty tokens. So, you need to adapt this in order to make the models work even better and quicker. And that's a key differentiator in our way of approaching inference. The other part, the yang, is around infrastructure. So, this is the complex plot. It's kind of this is the deep dive. We have this in our website and in our repo. I won't go into each component here, but basically what's up there is the three primitives of our API, so encode, score, and extract. And then the infrastructure layer happens, and this is kind of the summary of it. So, basically, we have the three primitives of the API, and then we have a router and also queuing mechanism depending on the load. So, we're adjusting between these and also different pools of different GPUs and resources that we use in order to kind of distribute the workload, and we we use spot instances, but also bigger GPUs. So, it's about being able to provision the hardware and also have the metrics to auto scale this. So, we're doing KEDA auto scaling with Prometheus metrics in order to be able to kind of switch models around, not keep GPUs idle, and kind of waste resources. So, the key driver here is that we want you to have the cluster as well as the model support. So, not just one or the other, and then you have to merge them together and write all that code, but we want to give you the whole end-to-end thing so that you can work with this models quickly, and the models are basically just a config that you can switch around and then do Terraform apply, and that's it. And we also have published Helm charts and Docker images as well. And that's how I learned my lesson. We've built Sie. So, this is kind of our soft launch in a way. And we've open sourced both the model inference that I talked about with kind of adapting this forward pass, working with attention, re-implementing different aspects on the model, but also the cluster so that you can actually use this straight away without having to think about hardware, provisioning GPUs, and all of those problems that arise in in the real world. You can scan the QR code to look at the repo, and it's called Sie, so S I E. And the company Superlinked. And also, now we have the the background as well. So, a quick reminder on that. Does anyone have any ideas maybe from the audience? I'll reveal it in the next slide, yeah, I don't want to maybe do it too quickly if anyone knows what this is. Think of it machine learning a bit kind of foundational knowledge. I was talking about attention and all that, so it's a bit around embeddings, but also how transformers work. Okay, no no no reward, I guess, for you, sadly. Or yeah, sorry. I don't remember exactly, but it was one of the ones like ColBERT embeddings for the clustering of the like one area. It wasn't using basically the embedding space. Okay. >> Because of the way they were doing the positional coding. How much of that So, I will take that as a correct answer. So, it was around positional encodings, and yes, this is basically yeah. Congrats. So, yes, this is around vector visualization of positional encoding. So, this is done when transformers are trained, this is how positions are encoded, and it's sinusoidal. That's why you get this pattern because it they're embeddings, and yeah. Thank you very much.