The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked

Channel: aiDotEngineer

Published at: 2026-05-05

YouTube video id: qdh_x-uRs9g

Source: https://www.youtube.com/watch?v=qdh_x-uRs9g

Hello everyone.
Um, welcome to this talk.
I'll be speaking about small model
inference and a gap that we've
recognized in the market.
And what we did about it and why we kind
of made this approach.
And as you can see this background slide
here, um, this is no accident. So, if
you can guess what this is, I'll prompt
you at the end of the slides,
you win a little reward. So, you can
catch me at the break afterwards. So,
think about this, but also listen to me,
so don't think too hard.
So, the story starts with me posting an
article a few months ago on Substack
that got a little bit of traction, got a
few people interested, and I explained
flash attention, I explained how models
worked, how processes can be
memory-bound, compute-bound, and I felt
really good cuz I kind of went deep into
this and as a person who's been in AI
for a few years, I felt very confident.
And that was true, but then some people
pointed out that actually I had
overlooked one key aspect
around what
makes these models fast in the real
world.
And that aspect that I overlooked was
inference.
So, as someone who kind of wants to
understand things in first principles
and
work to understand the problems and the
solutions deep, I realized, "Okay, I
need to figure this out. I need to know
kind of where I've made my mistake and
as a AI researcher and engineer, I need
to find more about
inference.
So, I've done a lot of work with VLAM,
training models, fine-tuning, kind of
doing applied
ML and AI, did a bit of also research in
academia as well.
But this part around kind of how models
run in production, scheduling GPUs,
routing, and automation, I guess was a
bit of a blind spot for me. So, I
realized, "Okay, this is the time I have
to now figure this out, learn, and make
it work." And what better way to do this
than to actually
build stuff. So, I decided to join
a team,
um, a team at Superlinked,
comprised of
very good,
uh, infrastructure engineers, and
actually work with them and build
something around inference. And that
something is, um,
this, uh, repo, the Superlinked
inference engine that we have
open-sourced, and this is kind of like
the soft launch that I'm doing today.
So, you can have a look at that later.
So, basically it's inference for small
models around AI search and document
processing.
And we've tested this out, as you can
see, with some of our partners. So,
we've tested out with Chroma, Quadrant,
Weaviate, so a lot of the vector DBs,
as well as LanceDB. And as you can see,
they've tried it out a bit.
Sounds fun, sounds interesting. So, this
is working.
And, um,
it was the right step for me to kind of
figure this out and learn
where inference, um, kind of truly is
and how that combines with my ML
experience.
So, the three key points I want you to,
uh, go away with from this talk
are these. So, first, I want to tell you
why
this matters. So, why doing inference
for AI search and document processing
actually matters when you're building
agents or when you're building workflows
that involve, uh, agents. So, this is
very important, the why.
Then I want to talk to you about the
second thing, which is what inference is
not about. So, there are some
misconceptions or ideas of how inference
looks like, but it's not all about those
things.
And the third thing is how we see
inference, and I call this as like the
yin and yang of model inference, and
it's a way of combining a few things
around model support and infrastructure.
So,
why this matters for your agentic
workflow?
Well, what you have encountered for sure
is context rot. And as we probably all
know, this research paper from Chroma
from some time ago,
um, showcases this effect that no matter
what you do, there is this effect of
context rot. So, quality degrades as
context increases.
So, being able to manage this context
and do some context management is very
important and a useful way to solve
this. So, using small models that can
preprocess your data so that then you
can actually use your agents and build
your workflows is a very powerful
technique.
You can also use the small models for
tool calling to
again do similar things and tackle this,
uh, problem of context management. And
you might say, "Okay, why would I not
use like code and grepping?" And that's
a valid point.
However, you can still do that and
having your data being preprocessed is
actually making the grepping and file
systems that you build even better. And
it's not just me saying this, so this is
how the community has responded to this
problem. So, Andrej Karpathy is building
knowledge bases graph-based. So, for
example, you can use named entity
recognition models, um, to generate
ontologies and then build knowledge
graphs or you generate your file systems
in that way. Chroma have also,
uh, shipped their own model that
actually does this. It kind of
preprocesses and filters the input,
manages the context
in order to tackle and achieve the same
problem.
And there's also people in the community
building a lot of solutions that, um,
lower this, um,
context and kind of reduce the token
amount so that the agents can be even
more effective. And we've done this in
production as well. We have a use case
where this repo that I showcased,
uh,
Superlinked inference engine,
we've used it as a tool calling,
basically, uh, solution where it was
around taxonomy classification for like
an e-commerce store. So, going directly,
uh, with tool calling where the small
models are tools that can retrieve and
go through your data is also a powerful
way to approach this.
So,
now I'll go about inference and what
inference doesn't look like. So,
the traditional maybe perspective is,
"Okay, I will just
chuck in more GPUs, get more compute,
all good, I've solved inference."
However, um,
with small models where actually each
model,
um, takes up only a small space in
memory. So, you can see, for example,
Stella, which is an embedding model,
then other rerankers,
um, the named entity recognition Glyner
model, they only occupy like a few
gigabytes of memory.
And if you provision a GPU for each
model, you're wasting a lot of, um, idle
space. Your GPU stays idle and it's not
used. So, in this case of small model
inference, it's very important to be
able to hot-swap models. So, what we've
done is we've built the ability for you
to swap all of these models in one GPU
so that per GPU, you get much higher
utilization. So, this lowers your costs,
but also enables you to hot-swap and
switch around between models quickly. If
you want to have one tool that's one
reranker or another tool that's another
model, you can switch them around
quickly, and we have this least recently
used eviction policy that we've built
in.
And also what inference is not about is,
um,
only the server or only kind of the
production situation. So, we have
solutions for both the server and the
production,
but building something, so for example,
like using Tay or VLAM or having even,
um, some API wrapper in order to get
that in production where you actually do
routing, auto scaling,
you do kind of monitoring with
Prometheus metrics and Grafana,
you have to write that code yourself.
And on the market currently, there is no
open-source solution that leads you from
creating the model inference to actually
productionizing this at a scale, um, of
this size. So, that's why we've kind of
wanted to fill this gap as well. So, um,
we've included a lot of stuff around,
routing, auto scaling, queuing
mechanisms, and provisioning GPUs.
So, I've talked about what inference
isn't, why this is important,
and let me tell you now about, um,
what actually inference is about.
And I call this the yin and yang of
inference because I feel like this is a
holistic approach that has to combine
two key things.
So, first, the yin
is model support. So, your inference is
worthless if you're not supporting the
right models or you're not offering
enough breadth of um, options for your
users.
And open source models, so
on hugging face currently, there are
millions of models.
This is from March. Now, there might be
even more like close to 3 million
models.
And
open source is moving very quickly both
in size, but also in accuracy. So, you
need to support these models because
people want to use them. People want to
work with open source, and the
performance is also getting better and
better. So, it's not even a sacrifice
anymore. It's actually you're being able
for specific tasks. So, if you look at
MTEB or different benchmarks, you can
see that for very narrow tasks,
open source models are beating managed
services.
And we see this even with more
general-purpose models like what Gemma
have done. So, they've released very low
parameter models that have ELO scores
that are higher than
much, much bigger models. So,
open source small models are very
relevant, and you need
an infra
to support to support them.
And
we decided to do exactly this. Okay,
let's do it. We'll build and support
hundreds of models.
However, that's not as straightforward
as it might sound because all of these
models have different run times. So, for
example, with if you use BERT and Qwen
as
models, they have different
implementation of
flash attention. They have different
positional embeddings, and you need to
adjust this in order to
make
make this kind of applicable to your
case and make it more general. So, there
is no universal engine that can handle
BERT, Qwen, and modern BERT as well
because their architectures are
different. So, what what we've done is
we've set up a way of re-implementing
this forward pass in order to adapt
attention, in order to do padding where
needed, so have variable length
attention as well.
Work with kind of fuse fusion of the
query, key, and value. And
this is kind of an underrated component,
especially if you look at supporting
models like ColBERT that has multiple
vectors as output. So, late interaction
models, also cross encoders, and
re-rankers that do not output a vector
at all, but they output kind of scores.
So, it's very important to
figure this out, and one example of this
is, for example, this, which is
a comparison of a few models that are
totally different in these five aspects.
So, normalization is done differently in
BERT and Qwen. So, that needs to be
accounted for. In ColBERT,
it's also totally different.
The Q The query, key, and values are
also can be fused somewhere. In others,
for example, in Qwen, it cannot be
because there is um
grouped query attention. So, that's
another problem. The position positional
embeddings are also different. You can
do in BERT absolute lookup, where it
with Qwen, there is
rotary positional embeddings. So, these
are problems that maybe
people don't think about straight away,
but if you want to support a lot of
models, there needs to be a consistent
way of doing this. So, we've set up this
with kind of working with agents as well
as with humans to be able to support
this and re-implement this forward pass
in order to make um
the the inference of this models
efficient, and we've also worked to do
flash attention
where it's variable length flash
attention, so that you can do
padding, and that there is no waste
because what happens is if you want to
do token-based batching, you can have
tokens that are kind of
lower number of tokens one request, and
then another is a higher amount, and
then
if both are padded at the higher amount,
you basically are wasting compute on
empty tokens. So, you need to adapt this
in order to make the models work
even better and quicker.
And
that's a key differentiator in our way
of approaching inference.
The other part, the yang, is around
infrastructure. So, this is the complex
plot.
It's
kind of this is the deep dive. We have
this in our website and in our repo. I
won't go into each component here, but
basically
what's up there is the three primitives
of our API, so encode, score, and
extract.
And then the infrastructure layer
happens, and this is kind of the summary
of it. So, basically, we have the three
primitives of the API, and then we have
a router and also queuing mechanism
depending on
the load. So, we're adjusting between
these and also different pools of
different GPUs and resources that
we use in order to kind of distribute
the workload, and we we use spot
instances, but also bigger GPUs. So,
it's about being able to provision
the hardware and also have the metrics
to auto scale this. So, we're doing KEDA
auto scaling with Prometheus metrics in
order to be able to kind of switch
models around,
not keep GPUs idle, and kind of waste
resources. So, the key driver here is
that we want you to have the cluster as
well as the model support. So, not just
one or the other, and then you have to
merge them together and write all that
code, but we want to give you the whole
end-to-end thing so that you can work
with this models quickly, and the models
are basically just a config that you can
switch around and then do Terraform
apply, and that's it. And we also have
published Helm charts and Docker images
as well.
And that's how I learned my lesson.
We've built Sie. So, this is kind of our
soft launch in a way. And we've open
sourced both the model inference that I
talked about with kind of adapting this
forward pass, working with attention,
re-implementing different aspects on the
model, but also the cluster so that you
can actually use this straight away
without having to think about hardware,
provisioning GPUs, and all of those
problems that arise in in the real
world.
You can scan the QR code to look at the
repo, and it's called Sie, so S I E.
And the company Superlinked.
And also, now we have the the background
as well. So, a quick reminder on that.
Does anyone have any ideas maybe from
the audience? I'll reveal it in the next
slide,
yeah, I don't want to
maybe do it too quickly if anyone
knows what this is. Think of it machine
learning a bit kind of foundational
knowledge. I was talking about attention
and all that, so it's a bit around
embeddings, but also how transformers
work.
Okay, no no no reward, I guess, for you,
sadly.
Or
yeah, sorry.
I don't remember exactly, but it was one
of the ones like ColBERT embeddings for
the clustering of the
like one area. It wasn't using basically
the embedding space. Okay.
>> Because of the
way
they were doing the positional coding.
How much of that
So, I will take that as a correct
answer. So, it was around positional
encodings, and yes, this is basically
yeah. Congrats.
So, yes, this is around vector
visualization of positional encoding.
So, this is done when transformers are
trained, this is how positions are
encoded, and it's sinusoidal. That's why
you get this pattern because
it they're embeddings, and yeah.
Thank you very much.