Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI
Channel: aiDotEngineer
Published at: 2026-04-29
YouTube video id: fLUtUkqYHnQ
Source: https://www.youtube.com/watch?v=fLUtUkqYHnQ
Hi everyone. My name is Maxim Labonne. In this presentation I want to talk about the lessons I've learned pre-training small models. So for context I work at Liquid AI as head of pre-training. At Liquid we mostly focus on edge models for on device deployment. And as you can see here we have models from 350 million parameters to 24 billion parameters. So this is very very small. And yesterday we released our new VLM 450M and the week before we released the new version of the 350M model for text. So this is what we do. We work across text, vision and audio. And yeah, the models are available on Hugging Face if you want to try them out. In this presentation I want to talk about what separates small models and big models. And there are three main characteristics I want to talk about. So first of all, the small models they're memory bound um because the hardware is is what it is, right? On a phone, in a car, etc. We can't really use super big models which is why we try to keep the the size quite small. Because of that we have low knowledge capacity compared to bigger models. Then the models are task which is great because if you have small knowledge capacity you can at least focus on one thing very well. And so that means that they are usually not general purpose chatbots like ChatGPT. They are a lot more narrow in terms of focus and they can do something like summarization to use very very well. So that's the second aspect. And the final one is that it's very latency sensitive. And that means that you need to have very very fast throughput. So all these characteristics are very important and we'll see in this presentation how they play with each other and how we can do better. But the main lesson I want you to retain from this presentation is that small models are not just scaled on versions of bigger models. They also have their unique challenges and we will see about how we do it in this presentation. The first thing I want to talk about is the model architecture because there's a lot of interesting things that we can do here for edge models. I want to first look about Gemma 3 270M and Gemma 2.5 0.8B. So these models are the smallest version of their respective family. And you can see that both of them they adopt a hybrid architecture. Gemma 3 has sliding window attention and GQA hybrid. Gemma 2.5 has a new architecture with gated Delta Net and gated attention. This is great because this is a lot faster. But what I'm interested in here is actually the embedding layer. Because if you look at the size of the embedding layer compared to all the parameters of the model you see that actually Gemma 3 270M is mostly an embedding layer. It's 63% of the total parameters. And even Gemma 2.5 0.8B it's it's still like 29% of the parameters. So that's not super efficient because the effective parameters, the parameters that are really used for reasoning, for knowledge capacity and all that stuff are not the embedding parameters. They're the rest. So the effective size is actually a lot smaller and it means that you could squeeze more reasoning and more performance from the same memory footprint. And the reason why they do that is because they use distillation to train the models. So they distill these models like those are the student models and they have teacher models with a huge vocabulary sizes and this is why we have these super big embedding layers. All right, let's look about the LFM 2 architecture now. As you can see the LFM 2 architecture is actually not that different in terms of um just layers. We also have like a hybrid architecture and this time we have short convolutions and GQA. And I want to talk a bit about well first you can see that the embedding layer is actually a lot smaller compared to the others. It's like 90% of the parameters so we have more effective parameters which is great. And I want to talk about how we created this architecture. And we did on device profiling. So instead of doing more like theoretical work we decided to say like okay, let's try to really implement it on the target hardware. So we had two target hardware here and we wanted to see how it performs in real life to be able to optimize the architecture, find the right operators here. And the thing that we found is this gated short convolution block that you can see here. And why this is nice it's because it's very very fast. The short convs are a lot faster than all the alternatives. So you can see here compared to sliding window attention from Gemma 3, the gated Delta Net from Gemma 2.5, gated linear attention and group query attention. You can see that the cost ratio is really in favor of short conv which is great because we said that this is very latency sensitive. So this is exactly what we want. So this is quite theoretical but if we look in practice and we profile the inference of these models you can see here on the 2 CPU, the AMD Ryzen Max Plus 395 and the Samsung Galaxy S25 Ultra. Um All these models don't have necessarily the same size but it gives you a rough picture and you can see that the short conv really allow LFM 2 architecture to be a lot faster and also use less memory. So that's great. And you can see also GPU. It's not really just for CPU but also on GPU you can see that it has a lot of throughput even at very high concurrency levels. All right, let's talk a bit about training now. So the LFM 2.5 training recipe is quite similar to what you can find find elsewhere in terms of stages. We have pre and mid-training 28 trillion tokens. We have supervised fine-tuning, preference alignment and reinforcement learning. And here you can see I'm talking about 28 trillion tokens. And I said that we released a model of 350M parameters last week. So yeah, we pre-trained a 350 million parameter model on 28 trillion tokens. If you're familiar with Chinchilla scaling laws that might sound a bit weird because we're supposed to be compute optimal at like I don't know like maybe 1 billion not not even 1 trillion parameter. I mean but it's actually not the case and we see that the performance still grows when you scale the number of pre-training tokens. And there was a super interesting paper by Roberts et al published last week about the test time scaling laws. And you can see here how LFM 2.5 350M compares to their new scaling laws. You can see Chinchilla scaling laws here and the new ones that they proposed here. And actually we did not pre-train the models on enough tokens. We should pre-train even more to be optimal according to their laws. But this is cool because more pre-training works and it works even at the smallest scale which is great because these models are a lot cheaper to train than much bigger models. And here you can see a comparison. It's not just for pre-training. It's like post-training models. And you can see that the LFM 2.5 model is significantly better than the the previous version LFM 2 350M on a lot of different benchmarks. So you have knowledge with GPQA diamond. You have instruction following with IF bench. You have case report bench which is data extraction. And also a lot of tool use with BFCL and and Dow 2 bench. With this model it's only 350 million parameters. So what we wanted to do is we wanted the model to be very very good at data extraction and at tool use. And the rest if it's not the best model at encode it doesn't matter. Like people don't use it that way anyway. Same for math. I think it's really nice to try to target some capabilities and not try to be like average on everything. All right, let's talk exactly about that. Pre-training small and big models. What's the difference? We have pretty much the same stage. So this is not really in terms of stages that we see a difference. It's more about how you do it. So for supervised fine-tuning is better if you actually quite narrow and you focus on some task. It's true for general purpose pre-training but it's also true if you do fine-tuning. So you can take one of these models on Hugging Face and just fine-tune it for your use case. And for example you have a use case where you have a particular function that you want to call. This is great. This is a excellent use case. Like the more narrow you can you can find it or design it the better it is. Then we have preference alignment. So during pre-training we have our on on policy length normalized direct preference optimization algorithm that we quite like. And preference alignment is very nice because it brings you general improvements. It's not just about benchmarks. It's really like overall after preference alignment the model is better. It sounds better and this is really nice to be able to just improve it overall. And finally we have reinforcement learning. And reinforcement learning is extremely efficient even at very small scale. It's a really really important technique that we use everywhere. And the main thing is that it's very narrow in terms of focus. So you want to have like as many environments, as many tasks as possible, and make sure that you generalize well thanks to this. Um and then for small models in particular, they're quite sensitive to cold start SFT data. So, if you have a particular task in reinforcement learning, it's always good to have similar samples and a similar task in your supervised fine-tuning mixture. And this is good feedback that you can see during reinforcement learning something doesn't train very well. It's probably because you are missing some uh cold start SFT data. Maybe the task is too complex, there are different reasons, but um you can try to start again uh from the supervised fine-tuning uh stage, add your data, and then see if it improves anything. All right, but there's a new problem uh with small language models that you might have encountered even with bigger ones, and this is doom looping. So, the problem with doom looping is, as you can see here, it's going to start repeating a sequence of words over and over and over and over again, and it just like never stops. Um so, this is a problem all the time, but it's particularly a problem if you have small models, if you have reasoning models, and if you have complex task. If the task is basically too complex for the model. Hopefully, this recipe is not too complex for this model, but this can happen anywhere, and you have the three of them at the same time. So, if you have a tiny reasoning models on like super difficult math task, this is the perfect recipe to have a lot of doom loops. Um so, this is a unique challenge that you find with small models uh to give you a concrete example. And here I can talk about like how we solve it. The first thing is that we solve it during the preference alignment stage, and in particular for the data generation part that we do. So, here you can see the pipeline that we use uh to do the data generation, the on-policy data generation for preference alignment. So, we start with prompt like 1 million samples to give you a rough idea, and then we use the policy model, the model that we want to train with temperature sampling, and we just generate five rollouts. Because we use temperature sampling, these rollouts tend to be a lot more diverse, and we expect that not all of them will uh have doom loops. At least one should not doom loop, right? And on the other hand, we generate just one extra uh rollout with a policy model with temperature zero, and this one we think that it's going to doom loop. And then we give everything to a LLM jury to score all the um rollouts. We pick the best one, the one with the highest score as the chosen answer, the one with the worst score as the rejected answer, and the idea is that if we have some doom loop here, the response for the doom loop will be rejected. So, we will train the model during preference alignment to not doom loop. And this is quite effective. So, this is solution number one. And then we have solution number two, and this one is about using reinforcement learning with verifiable rewards, and we add a bit of n-gram repetition penalty. Uh but you can see that with reinforcement learning with um verifiable rewards, it's a very nice way to actually um solve this issue, because if you have a question, like a math question like this one, you are going to try to extract the final answer. If you do not have a final answer, uh you won't get a positive reward. So, this is already being taken care of during um reinforcement learning with verifiable rewards. But on top of that, you can add a bit of repetition penalty to make sure that um you are going to generate more like less doom loops in general, and same thing, we also use temperature sampling here, so the rollouts are also quite diverse, and it just is less likely that you are going to get like a lot of doom loops uh all the time. So, this is the the second uh solution. And that allowed us to uh really reduce the doom loop ratio. So, this is a real example with uh LFM 2.5 1.2B thinking, which is a small model. It's a reasoning model, and on top of that, we threw really hard task at it. So, you can see that after pre-training, the doom loop ratio that we calculated across like a lot of benchmarks was about 15%. Um or even 16%. And then after SFT, it it barely moves. Like SFT is not the right stage to fix this. Um we didn't have doom loop uh examples during the SFT stage, but it's not enough uh to get rid of this issue. Um after DPO, so that was our first solution, um it really reduces quite a lot, and you can see that after reinforcement learning, the problem is almost nonexistent. If today you try to do the same thing with um Gwen 3.5 0.8B in reasoning mode, you will see a lot a lot a lot of doom loops, like over 50% of doom loops, which is something that people complain about online, and that also shows that the Gwen 3.5, like this tiny model, is just a scaled-down version of bigger models. And this is not the approach that we're taking here at Liquid. We want to say, "Okay, like the edge models, they are their own thing, and um this is also a way to just optimize the entire architecture, the entire pre-training stack to make sure that uh we treat them as um special as possible. And finally, I want to talk about uh next stage, next steps for uh all these uh small models with agentic reinforcement learning. The final characteristic I didn't mention here is about uh being memory bound. If you're memory bound, it means that you have low knowledge capacity. If you have no low knowledge capacity, it means that you're going to hallucinate a lot. But a nice way to solve this issue is just providing like web search tools to the model. If you have a tiny model, but it's able to Google everything that you um throw at it in terms of like knowledge questions, you're going to have like much much better performance uh than if you just rely on the uh base models. And same thing with a a lot of problems that you can uh throw at the model. I think that from experience, these tiny models are actually very good at agentic task, and this is how we should use them. Um it doesn't matter if um they don't have the knowledge capacity of big models. What they truly need is really good reasoning capabilities to make sure that they are able to use these tools in a reliable manner. And another point that I haven't mentioned here is that small models are also not very good at long context capabilities. But it's okay, because if you have like a recursive um language model environment, then you can use Python and and like basically take a shortcut uh to to solve this issue. So, most of the issues that you find with small language models can actually be fixed in different ways. It just requires more creativity, it just requires thinking about this problem not like you would think about it uh from a bigger model perspective, uh but everything about this is fixable. All right. So, in conclusion, some takeaways, um I hope I convinced you that edge models have unique challenges, and they are actually interesting from scientific point of view, and also production point of view. Um if you combine them with agentic tools, they tend to perform really really well, and this is something that is currently underexplored. We talk about agentic workloads with really big models, but it's not necessarily um the best use case, it's not necessarily the best fit all the time. And uh yeah, finally, we're working on LFM 3, and we have like a a ton of crazy experiments and uh ideas to try, so um come work with us if you're interested uh in this space. Thank you, everyone. Yes. Um can you share a bit about how you use these models in your workflow, and how how you make the decision about when to use big models Yeah, this is a good question. So, the question is like how are we use this model in the workflows, and uh how we decide between small models and big models. So, the the main idea here is that you will try to use the small models when you don't have uh internet connection, for example. So, in-car deployment is a good example of that, because you can't have like a reliable internet connection, so it makes sense. Uh latency is also a big one. If you have a workload that is very latency sensitive, uh small models running locally are always going to be better. And another one is is privacy. If you're in a regulated environment, um if you work in finance or health care, this is also a good one. >> in in your specific workflow is the question. It I make the models, so I like I make them most for other people, but uh like in my workflows, like not necessarily. Have you tried for doom looping have you tried like you you talked about how Gwen 3.5 is a scaled-down version of bigger one? Yep. Have you seen uh the work that you do with RL and all these uh reduction of doom looping on a bigger model uh be able to be distilled into a smaller model without having to re- redo the steps? It's a good question. I think we need to do some experiments to see like if uh just distilling from a bigger model translates well in terms of doom looping. I would say no. I would I don't think so, because I think it would be too close to SFT. So, it depends like how you do this distillation. If you stop K and you have like enough uh K, maybe this is uh good enough, um but I I think that it would not completely be solved, and you would still need like several batches to make sure that it doesn't happen again. All right. Yeah. All right, I I think I should quit, but um I'll be around if you you have other questions. Thank you very much.