Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

Channel: aiDotEngineer

Published at: 2026-04-29

YouTube video id: fLUtUkqYHnQ

Source: https://www.youtube.com/watch?v=fLUtUkqYHnQ

Hi everyone. My name is Maxim Labonne.
In this presentation I want to talk
about the lessons I've learned
pre-training small models.
So for context I work at Liquid AI as
head of pre-training. At Liquid we
mostly focus on edge models for on
device deployment. And as you can see
here we have models from
350 million parameters to 24 billion
parameters. So this is very very small.
And
yesterday we released our new VLM 450M
and the week before we released the new
version of the 350M model for text.
So this is what we do. We work across
text, vision and audio. And yeah, the
models are available on Hugging Face if
you want to try them out.
In this presentation I want to talk
about what separates small models and
big models. And there are three main
characteristics I want to talk about. So
first of all, the small models they're
memory bound um
because the hardware is is what it is,
right? On a phone, in a car, etc. We
can't really use super big models which
is why we try to keep the the size quite
small.
Because of that we have low knowledge
capacity compared to bigger models.
Then the models are task which is great
because if you have small knowledge
capacity you can at least focus on one
thing very well. And so that means that
they are usually not general purpose
chatbots like ChatGPT. They are a lot
more narrow in terms of focus and they
can do something like summarization to
use very very well.
So that's the second aspect. And the
final one is that it's very latency
sensitive. And that means that you need
to have very very fast throughput. So
all these characteristics are very
important and we'll see in this
presentation how they play with each
other and how we can do better.
But the main lesson I want you to retain
from this presentation is that small
models are not just scaled on versions
of bigger models. They also have their
unique challenges and we will see about
how we do it in this presentation. The
first thing I want to talk about is the
model architecture because there's a lot
of interesting things that we can do
here for edge models.
I want to first look about Gemma 3 270M
and Gemma 2.5 0.8B. So these models are
the smallest version of their respective
family. And you can see that both of
them they adopt a hybrid architecture.
Gemma 3 has sliding window attention and
GQA hybrid. Gemma 2.5 has a new
architecture with gated
Delta Net and gated attention. This is
great because this is a lot faster. But
what I'm interested in here is actually
the embedding layer. Because if you look
at the size of the embedding layer
compared to all the parameters of the
model you see that actually Gemma 3 270M
is mostly an embedding layer. It's 63%
of the total parameters. And even Gemma
2.5 0.8B it's it's still like 29% of the
parameters.
So that's not super efficient because
the effective parameters, the parameters
that are really used for reasoning, for
knowledge capacity and all that stuff
are not the embedding parameters.
They're the rest. So the effective size
is actually a lot smaller and it means
that you could squeeze more reasoning
and more performance from the same
memory footprint.
And the reason why they do that is
because they use distillation to train
the models. So they distill
these models like those are the
student models and they have teacher
models with a huge vocabulary sizes and
this is why we have these super big
embedding layers.
All right, let's look about the LFM 2
architecture now. As you can see the LFM
2 architecture is actually
not that different in terms of um
just layers. We also have like a hybrid
architecture and this time we have short
convolutions and GQA.
And I want to talk a bit about well
first you can see that the embedding
layer is actually a lot smaller compared
to the others. It's like 90% of the
parameters so we have more effective
parameters which is great. And I want to
talk about how we created this
architecture.
And we did on device profiling. So
instead of doing more like theoretical
work we decided to say like okay, let's
try to really implement it on the target
hardware. So we had two target hardware
here and we wanted to see how it
performs in real life to be able to
optimize
the architecture, find the right
operators here. And the thing that we
found is this gated short convolution
block that you can see here. And why
this is nice it's because it's very very
fast. The short convs are a lot faster
than all the alternatives. So you can
see here compared to sliding window
attention from Gemma 3, the gated Delta
Net from Gemma 2.5, gated linear
attention and group query attention. You
can see that
the cost ratio is really in favor of
short conv which is great because we
said that this is very latency
sensitive. So this is exactly what we
want. So this is quite theoretical but
if we look in practice and we profile
the inference of these models you can
see here on the 2 CPU, the AMD Ryzen Max
Plus 395 and the Samsung Galaxy S25
Ultra. Um
All these models don't have necessarily
the same size but it gives you a rough
picture and you can see that the short
conv really allow LFM 2 architecture to
be a lot faster and also use less
memory. So that's great.
And you can see also GPU. It's not
really just for CPU but also on GPU you
can see that it has a lot of throughput
even at very high concurrency levels.
All right, let's talk a bit about
training now.
So
the LFM 2.5 training recipe is quite
similar to what you can find find
elsewhere in terms of stages. We have
pre and mid-training
28 trillion tokens. We have supervised
fine-tuning, preference alignment and
reinforcement learning.
And here you can see I'm talking about
28 trillion tokens. And I said that we
released a model of 350M parameters last
week.
So yeah, we pre-trained a 350 million
parameter model on 28 trillion tokens.
If you're familiar with Chinchilla
scaling laws that might sound a bit
weird because we're supposed to be
compute optimal at like I don't know
like maybe 1 billion not not even 1
trillion parameter. I mean
but it's actually not the case and we
see that the performance still
grows when you scale the number of
pre-training tokens. And there was a
super interesting paper by Roberts et al
published last week about the test time
scaling laws. And you can see here how
LFM 2.5 350M compares to their new
scaling laws. You can see Chinchilla
scaling laws here and the new ones that
they proposed here. And actually we did
not pre-train the models on enough
tokens. We should pre-train even more to
be optimal according to their laws.
But this is cool because more
pre-training works and it works even at
the smallest scale which is great
because these models are a lot cheaper
to train than
much bigger models. And here you can see
a comparison. It's not just for
pre-training. It's like
post-training models. And you can see
that the LFM 2.5 model is significantly
better than the the previous version LFM
2 350M on a lot of different benchmarks.
So you have knowledge with GPQA diamond.
You have instruction following with IF
bench. You have case report bench which
is data extraction. And also a lot of
tool use with BFCL and and Dow 2 bench.
With this model it's only 350 million
parameters. So what we wanted to do is
we wanted the model to be very very good
at data extraction and at tool use. And
the rest if it's not the best model at
encode it doesn't matter. Like people
don't use it that way anyway. Same for
math. I think it's really nice to try to
target some capabilities and not try to
be like average on everything.
All right, let's talk exactly about
that. Pre-training
small and big models. What's the
difference? We have pretty much the same
stage. So this is not really in terms of
stages that we see a difference. It's
more about how you do it. So for
supervised fine-tuning is better if you
actually quite narrow and you focus on
some task. It's true for general purpose
pre-training but it's also true if you
do fine-tuning. So you can take one of
these models on Hugging Face and just
fine-tune it for your use case. And for
example you have a use case where you
have a particular function that you want
to call. This is great. This is a
excellent use case. Like the more narrow
you can
you can
find it or design it the better it is.
Then we have preference alignment. So
during pre-training we have our on
on policy length normalized direct
preference optimization algorithm that
we quite like. And preference alignment
is very nice because it brings you
general improvements. It's not just
about benchmarks. It's really like
overall after preference alignment the
model is better. It sounds better and
this is really nice to be able to just
improve it overall.
And finally we have reinforcement
learning.
And reinforcement learning is extremely
efficient even at very small scale. It's
a really really important technique that
we use everywhere. And the main thing is
that it's very narrow in terms of focus.
So you want to have like as many
environments, as many tasks as possible,
and make sure that
you generalize well thanks to this.
Um
and then for small models in particular,
they're quite sensitive to cold start
SFT data. So, if you have a particular
task in reinforcement learning, it's
always good to have similar samples and
a similar task in your supervised
fine-tuning mixture. And this is good
feedback that you can see during
reinforcement learning something doesn't
train very well. It's probably because
you are missing some uh cold start SFT
data. Maybe the task is too complex,
there are different reasons, but um you
can try to
start again uh from the supervised
fine-tuning uh stage, add your data, and
then see if it improves anything.
All right, but there's a new problem uh
with small language models that you
might have encountered even with bigger
ones, and this is doom looping.
So, the problem with doom looping is, as
you can see here, it's going to start
repeating a sequence of words over and
over and over and over again, and it
just like never stops.
Um so, this is a problem all the time,
but it's particularly a problem if you
have small models, if you have reasoning
models, and if you have complex task. If
the task is basically too complex for
the model. Hopefully, this recipe is not
too complex for this model, but this can
happen anywhere, and you have the three
of them at the same time. So, if you
have a tiny reasoning models on like
super difficult math task, this is the
perfect recipe to have a lot of doom
loops.
Um so, this is a unique challenge that
you find with small models uh to give
you a concrete example.
And here I can talk about like how we
solve it. The first thing is that we
solve it during the preference alignment
stage, and in particular for the data
generation part that we do.
So, here you can see the pipeline that
we use uh to do the data generation, the
on-policy data generation for preference
alignment. So, we start with prompt like
1 million samples to give you a rough
idea, and then we use the policy model,
the model that we want to train with
temperature sampling, and we just
generate five rollouts. Because we use
temperature sampling, these rollouts
tend to be a lot more diverse, and we
expect that not all of them will uh have
doom loops. At least one should not doom
loop, right? And on the other hand, we
generate just one extra uh rollout with
a policy model with temperature zero,
and this one we think that it's going to
doom loop.
And then we give everything to a LLM
jury to score all the um rollouts. We
pick the best one, the one with the
highest score as the chosen answer, the
one with the worst score as the rejected
answer, and the idea is that if we have
some doom loop here, the response for
the doom loop will be rejected. So, we
will train the model during preference
alignment to not doom loop. And this is
quite effective. So, this is solution
number one.
And then we have solution number two,
and this one is about using
reinforcement learning with verifiable
rewards, and we add a bit of n-gram
repetition penalty. Uh but you can see
that with reinforcement learning with um
verifiable rewards, it's a very nice way
to actually um solve this issue, because
if you have a question, like a math
question like this one,
you are going to try to extract the
final answer. If you do not have a final
answer, uh you won't get a positive
reward. So, this is already
being taken care of during um
reinforcement learning with verifiable
rewards. But on top of that, you can add
a bit of repetition penalty to make sure
that um you are going to generate more
like less doom loops in general, and
same thing, we also use temperature
sampling here, so the rollouts are also
quite diverse, and it just is less
likely that you are going to get like a
lot of doom loops uh all the time.
So, this is the the second uh solution.
And that allowed us to uh really reduce
the doom loop ratio. So, this is a real
example with uh LFM 2.5 1.2B thinking,
which is a small model. It's a reasoning
model, and on top of that, we threw
really hard task at it. So, you can see
that after pre-training, the doom loop
ratio that we calculated across like a
lot of benchmarks was about 15%.
Um or even 16%. And then after SFT, it
it barely moves. Like SFT is not the
right stage to fix this. Um we didn't
have doom loop uh examples during the
SFT stage, but it's not enough uh to get
rid of this issue.
Um after DPO, so that was our first
solution, um it really reduces quite a
lot, and you can see that after
reinforcement learning, the problem is
almost nonexistent. If today you try to
do the same thing with um Gwen 3.5 0.8B
in reasoning mode, you will see a lot a
lot a lot of doom loops, like over 50%
of doom loops, which is something that
people complain about online, and that
also shows that the Gwen 3.5, like this
tiny model, is just a scaled-down
version of bigger models. And this is
not the approach that we're taking here
at Liquid. We want to say, "Okay, like
the edge models, they are their own
thing, and um this is also a way to just
optimize the entire architecture, the
entire pre-training stack to make sure
that uh we treat them as um special as
possible.
And finally, I want to talk about uh
next stage, next steps for uh all these
uh small models with agentic
reinforcement learning. The final
characteristic I didn't mention here is
about uh being memory bound. If you're
memory bound, it means that you have low
knowledge capacity. If you have no low
knowledge capacity, it means that you're
going to hallucinate a lot. But a nice
way to solve this issue is just
providing like web search tools to the
model. If you have a tiny model, but
it's able to Google everything that you
um throw at it in terms of like
knowledge questions, you're going to
have like much much better performance
uh than if you just rely on the uh base
models.
And same thing with a a lot of problems
that you can uh throw at the model. I
think that
from experience, these tiny models are
actually very good at agentic task, and
this is how we should use them. Um it
doesn't matter if um they don't have the
knowledge capacity of big models. What
they truly need is really good reasoning
capabilities to make sure that they are
able to use these tools in a reliable
manner. And another point that I haven't
mentioned here is that small models are
also not very good at long context
capabilities. But it's okay, because if
you have like a recursive um language
model environment, then you can use
Python and and like basically take a
shortcut uh to to solve this issue. So,
most of the issues that you find with
small language models can actually be
fixed in different ways. It just
requires more creativity, it just
requires thinking about this problem not
like you would think about it uh from a
bigger model perspective, uh but
everything about this is fixable.
All right. So, in conclusion, some
takeaways, um I hope I convinced you
that edge models have unique challenges,
and they are actually interesting from
scientific point of view, and also
production point of view.
Um if you combine them with agentic
tools, they tend to perform really
really well, and this is something that
is currently underexplored. We talk
about agentic workloads with really big
models, but it's not necessarily um the
best use case, it's not necessarily the
best fit all the time. And uh yeah,
finally, we're working on LFM 3, and we
have like a a ton of crazy experiments
and uh ideas to try, so um come work
with us if you're interested uh in this
space.
Thank you, everyone.
Yes. Um
can you share a bit about how you use
these models in your workflow, and how
how you make the decision about when to
use big models
Yeah, this is a good question. So, the
question is like how are we use this
model in the workflows, and uh how we
decide between small models and big
models. So, the the main idea here is
that you will try to use the small
models when you
don't have uh internet connection, for
example. So, in-car deployment is a good
example of that, because you can't have
like a reliable internet connection, so
it makes sense. Uh latency is also a big
one. If you have a workload that is very
latency sensitive, uh small models
running locally are always going to be
better. And another one is is privacy.
If you're in a
regulated environment, um if you work in
finance or health care, this is also a
good one.
>> in in your specific workflow is the
question.
It I make the models, so I like I make
them most for other people, but uh like
in my workflows, like not necessarily.
Have you tried for doom looping have you
tried like you you talked about how Gwen
3.5 is a scaled-down version of bigger
one? Yep. Have you seen uh the work that
you do with RL and all these uh
reduction of doom looping on a bigger
model
uh be able to be distilled into a
smaller model without having to re- redo
the steps?
It's a good question. I think we need to
do some experiments to see like if uh
just distilling from a bigger model
translates well in terms of doom
looping. I would say no. I would I don't
think so, because I think it would be
too close to SFT. So, it depends like
how you do this distillation. If you
stop K and you have like enough uh K,
maybe this is uh good enough, um but I I
think that it would not completely be
solved, and you would still need like
several batches to make sure that it
doesn't happen again.
All right. Yeah.
All right, I I think I should quit, but
um I'll be around if you you have other
questions. Thank you very much.