[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han
Channel: aiDotEngineer
Published at: 2025-07-19
YouTube video id: OkEGJ5G3foU
Source: https://www.youtube.com/watch?v=OkEGJ5G3foU
[Music] Hello guys. Hello. Hello. Hello. Yes. Sorry for being a bit late. There's a lot of traffic. But um hello. Um welcome to AI engineers worldfare. Thanks for coming to my session. Um, and yes, today we're going to talk about the deep dive into RL kernels, agents, and quantization. Um, you might know me or you might not know me, but I'm Daniel. Uh, my brother is somewhere. Uh, yeah, somewhere, but yes. Um, thanks for coming again. Oh, we also have stickers and other like random stuff later. Um, that's after the talk. Um, so maybe you might know me, maybe you might not. Um, so we, um, on Twitter, we tweet a lot. Um we did like a gradient accumulation bug fix last year. Um we introduced something called async offloaded gradient checkpointing. Um we also work with the hugging face Google Meta Mistro teams to like fix bugs and like their open source models like Gemma, Llama, Mistro fee and more. Um yeah so like we if you want to follow on the latest stuff of AI better follow us. Um we tweet about random stuff. Um you might even know when next models might be released. You know we sometimes tell people approximately. Um so that might be very interesting. Um we also do like open source contributions to the entire open source ecosystem. For example, we contribute sometimes to Llama CBP. Um we work with the Quen team and Mistro on their releases. Um we also do like you know F4 bug fixes, Llama 4 bug fixes which increase accuracy by a bit. Um so definitely you know utilize some of the new newer uploads that we do um which fix bugs all the time. We just surpassed 10 million monthly downloads and hugging face. Um, and yeah, we also have a GitHub package um with with 40,000 GitHub stars. Um, and we do like essentially we make fine tuning faster and reduce memory usage. And yeah, that's the GitHub package. Definitely check that out. Um, there's like free collab notebooks. I'm not sure if many people know, but you have like free GPUs that Google offers. You can just use them. Um, please use them more. Um, and Kaggle, if not many people know, have 30 hours for free of GPUs per week, right? And there's no restrictions on it. You know, please utilize the free resources as much as possible. Um, yes, they won't be unhappy. Just use them all. So, yes, we have notebooks. So, if you scroll down a bit on our GitHub uh page, there's like free notebooks um for Collab, Kaggle. We do reasoning, um, continue pre-training, supervised fine-tuning, and other stuff. We also upload models to our hunting case page. Um, for example, DC R10528 released a few days ago. We upload 1.58 bit quants which are like very small. They retain most of the accuracy. Um so like these can run on your local device. Um even if you have like very low VRAM or like not a very good GPU um it will still work and we will constantly upload models. So like sometimes people complain to us you know please stop uploading fixed models. You know it's kind of annoying. Um but too bad you know unfortunately models have bugs. So we do have to fix them immediately. You will see sometimes for example the accuracy can increase by 10%. Um, sometimes, you know, the large model providers won't tell you that they uploaded a fix. They're not going to tell you, but you know, be sure to download the latest models. You will get all the fixes. Now, today, let's start off from history, right? Does everyone remember Llama? Um, although finally it got leaked, right? It was just a research paper, you know, meta saying, "Oh, we trained Llama. Where's the weights?" You know, it's only research access. And then suddenly it got leaked. And then you kind of just kind of spawn the entire open source movement. Um you know some of the people who are the who are the authors you know are not part of meta anymore but llama is extremely important for the entire ecosystem and it was like the beginning of open source kind of um for large language models. The most famous plot from the paper is this right? So like you know if you keep making the model train more it just the loss just keeps going down. Um well the question is will the loss keep going down? That's the question right? So Llama 1 was only trained at 1.4 trillion tokens, right? So now 1.4 trillion tokens is actually very less. Most models are trained 10 times more. Um and you can also see from the trend that the bigger the model, the lower the loss, right? So 7 billion is the blue line. Um and then 65 billion was like the red line. And you can see that in general as a model gets bigger, it it gets smarter. Um the training loss, the numbers are correct. So you should see generally these numbers from around maybe a bit higher than one. Um if you see training losses when you do fine tuning of like 8 13 definitely something's wrong. Um, so you should get at least losses around twoish, three-ish. And so now, oh, now Google, um, you know, Google's new Jamma 3 models are trained on 14 trillion tokens, right? Which is much more. Um, Llama 4 is trained on 30 trillion tokens, right? So like literally like at least 10 times more. Um, so Jamma is 10 times more. Llama is like, you know, 30 times more. Oh yes, I forgot. You can also access the slides by the QR code if you want. It's also on the docs as well. Um so if you go to the docs there will be a link to all the slides. Um I will probably post the slides anyways on Twitter and elsewhere so you can access them as well. Also if there are questions for people like you know raise your hand ask I will essentially do intermission between like some parts and I will ask people if you have questions please ask. Um last time the talk many people asked questions I will answer every single one even if it's stupid I don't care please ask. Um, sometimes I get stuff wrong. So, yes, just ask questions. Um, are there any questions? I'm assuming no. Okay. Okay. Let's go to the Okay, I will. Okay. So, I I don't know if people have seen this plot very famous from Maxim. Um, he shows the open source versus closed source performance of popular benchmarks. I think this is MLU5 shot. Um, you can see the green line is open source models. Um, and then the red line or the orange line is like, you know, closed source models. Um and you can see that in general the slope of the open source models is more dramatic than the closed source models. And I would say in general that you know the open source models and closed source models in terms of MLOU they've kind of like reached the same accuracy right so like you can see that okay this is already outdated but in general you see like llama 3.1 405 billion kind of reached you know GP40 level um so open source models definitely have caught up to closed source models however there was a however um recent you know like since September 2024 I would call this something called o opensource drought. Um, no, no one wants to talk about it, but I will, right? So, like September 2024, 01 got released, 01 preview. And to be honest, the open source community was shocked, right? So, like suddenly the capabilities diverged, right? So, there's something called the MLU plateau where most models, the open source models and the closed source models, they kind of converged. So, the open source models was equivalent to the closed source models. But suddenly in 2024 September you know OpenAI released 01 preview and it kind of shocked the entire community because the capability or intelligence kind of skyrocketed right so like with reasoning long reasoning traces it just was a total change of mindset um and for four months the open source community kind of died internally because there was nothing we can't replicate it we don't know what to do you know do we do this do we do that I don't know like and so like but then suddenly in 2025 January deepseek R1 came along and they released R1 and that's when the entire world kind of changed their view right so like you can in fact train open source models to be as powerful as 01 or 03 or whatever right so like that's that was what I call the open source drought however there was a previous drought even before that remember when chbt got released in 2022 December right so like before even chbttt most models were base models um they were not really instruct fine-tuned that well and so most large pre-trained models were actually useless but they were terri But then suddenly CHP came along and they did better you know reinforcement learning from human feedback better instruction fine-tuning better instruction following and it really changed the world right so like I think large lang large large language models were already here before 2022 right it they were already there but it was just chipd which showed that if you have good data right good instructions good answers good supervised fine-tuning um and good reinforcement learning you can actually make the model very useful um and yes again open source had a delay a very long delay right until like llama one I guess um and so always I would say open source always tries to catch up to the closed source models um the next question is what is after reasoning um is there going to be something else um I think that's a very good question my personal take is it's going to be very hard um I think reasoning was like the last most the DC R1 paper said that most likely the model already has these reasoning capabilities and we just need to accentuate them um and so like I'm not sure if there's going to be some new you know like step function where we'll get to like the next capability um but in my view I think like every single time the closed source models will always do like a step function um but who knows maybe now it will plateau forever I don't know so that's like you have like you know you have like long discussions about you know if AGI is going to come or not like but who knows um the talk is not going to be about that but um yes next um so I call the first jump the SFT or um RHF jump right so like that's essentially if you do good supervised fine tuning you get this large jump in performance and then the second jump is called the RL jump right so like this essentially can increase performance dramatically if you employ second methodologies like RL right so like but the question is like what's the next jump um I don't know so I'm not sure if you guys saw this picture before um it's very widely known in the community about you know by Yan Lakun he essentially showed this cake um and essentially unsupervised learning or like just pre-training in general um you know it's just a cake not that good um and then supervised fine-tuning is kind of the icing on on top of the cake. So like it's a bit better. Um and then the reinforcement learning is the cherry, right? So like I'm not sure people like the cherry, but like some people like the cherry. Um and so like the goal is how can we get the cherry? Um but the problem is there's so less data about this, right? Reinforcement learning is like very very very less data. And so the problem is most large model labs will train these large pre-trained models and then they will iteratively refine it to make the model better through supervised learning, through reinforcement learning. Interestingly enough, this slide was actually shared last year. Very popular. But actually, this was from 2016 September. Um, so I had to dig this up on YouTube. And so Yakun actually talked about this back in 2016. So literally nearly 10 years ago. Um, I was like, how? Wait a second. That's 10 years ago. Very long. Um, so this slide was actually very popular on Twitter. I I think it was in November last year. People kept tweeting about it. I don't know. I saw this. I was like shocked. Um, but yes. So this encapsulates like the current AI um boom and so like firstly like when we talked about these large models remember they started from a base model um and so we call these training stages right so when you have a base model um you then convert it to a chat model right so like for example chat GBT is not a base model it is a instruct fine-tune model or like some sort of fine-tuned model from a base model so actually openai does have a base model somewhere sitting in their server somewhere they're probably not going to serve it ever But it is somewhere on the computers and they essentially fine-tuned it to make Chai GP4. Um Claude 4 has most likely a base model and then they fine-tune it to become Opus. Right? So like Gemini also has a base model and they convert it to Gemini 2.5 Pro. So this this phase when you convert a base model to a chat model is the fine-tuning phase. Um and then the question is like you know what do we do in the arrow right like you know is it reinforcement learning is it supervised fine tuning is it like some other special source I don't know but like we essentially we'll discuss about these um topics um any questions first okay so for example in open source models you might have seen gemma 3 pt um gemma 3 it llama 4 llama 4 instruct quen 3 base quen 3 um M small base M small instruct llama 2 llama 2 chat right these like terminologies to be honest I think the open source community should standardize the method like terminologies like instruct or chat or you know no not even a word like you know it and PT maybe they should like standardize it a bit um but in general if you see it it means instruct instruction tuned PT means pre-trained um instruct just means you know instruction fine-tuned quen 3 just removed it entirely it's just called quen 3 um and then the base model is called with a base um and so like essentially these naming methodologies um if you see on hugging face um hopefully you will now recognize these different types of um models and so generally what we say for like you know reinforcement learning and finetuning is I would say it's called re fine tuning is everywhere um you start off with pre-training you then convert it into a supervised fine-tuning model via supervised fine tetuning SFT it's called SFT you also might hear like IF which is instruction fine tuning they're the same thing um and then we call something post training, the post- training phase. Um, but actually recently it actually kind of changed. Um, so I don't know if you guys have been keeping up with the latest stuff, terminology. Um, I actually don't really like terminology anymore, but like we have something called pre-training, which is you take like all of Wikipedia, all of the web, you know, everything, all of the data you can ever see, shove it into the model, predict the next word. That's called the pre-training uh pre-training stage. We then have something called the mid training stage um which essentially gives you higher quality data. Like for example, you can weight Wikipedia more because it's higher quality. Um you can essentially do long context extension as well. You shove this during the mid mid pre-tra uh mid-training stage. So if your context of your model is very short, you want to extend it to very long context. You shove this during the mid-training stage. And then the second stage is the supervised fine-tuning stage where you want to convert the model to a chat model. And then we have the post-training phase which is like pre preference fine-tuning, DPO, RHF and stuff like that. And then we have this new thing called reinforcement fine-tuning or RLVR. Uh if no one knows what RLVR stands for, it stands for reinforcement learning with verifiable rewards. And this is like a new paradigm um not the same as preference fine tuning or DPO where we consider reward functions to make models much better. Um and so this is how I would envision like you know the whole training phases of models. Another way to put it is we have some random initialization of the model like some random weights of the model right so like seven billion parameters literally random numbers right like GPT4 GP4 I don't know 1.4 trillion parameters like just random numbers and then somehow we move in the space like the black line right so pretend this is like some highdimensional 1.4 for trillion dimensional space and then we somehow move in this space and then we get the final model. Right? That's a green job. The question is how do we move in this space to get to the final model? That's the question. Most people what they do is firstly you start from a random initialization. You do the pre-training phase which is very long. You get to this dark blue dot, right? That's called the pre-trained model. And then you do some supervised fine-tuning, instruction fine-tuning to get the blue dot. And notice the line for the light blue line is very short because it is very short, right? There is not that much data for supervised finetuning. And then somehow we get the blue dot and then we keep doing more iteration to get to the purple dot which is through preference finetuning. And then finally we get the the green dot which is reinforcement learning via you know verifiable rewards like you know 03 or 01. And so like the goal is somehow we have to move from the black dot to the green dot. And essentially all of large language models all of AI is just an optimization problem right? like how do we make this easier to get to the green dot? You could, you know, you could kind of theoretically guess you could just why don't you just go from the green dot black dot to the green dot like skipping all of the dumb phases, you know, just skip it entirely. Yes, you could do that, but it's not going to be very efficient. You're going to be waiting there for like, you know, I don't know, millennia. Your loss is not going to go down. So, the tricks that we found in rein um AI is like you have to do these phases to get to your final green dot. Um there is like a new methodology where you can actually bypass the supervised finetuning stage and the preference fine tuning stage and directly go to the green dot. There is a way and that's the dark red line. Um I think deepse uh deepseek zero kind of like showed that you can use a pre-trained model a base model and directly do some reinforcement learning with verifiable rewards and just skip it entirely. Um so that's like a new paradigm that people want to focus on. In my view I think you should still do the light blue, the purple and then the green. I don't think so you should like directly skip over to the green. If you want to waste resources, you can skip to the green. Um but I don't think so large model labs want to waste resources. Um hopefully not. So I I don't know if people have seen this diagram. Agents in the old sense like everyone keeps connecting agents with reinforcement learning. Okay, but like why? Um so in general an agent is you have some sort of environment. You have like the agent doing something in the environment. you get like an action, you do the action and then you get some some sort of reward. Um, and the reward is R, S is the state. So the current what the environment currently looks like. Um, and then essentially RL tries to optimize this loop. You're trying to maximize a reward. Um, giving some sort of action. Um, and that's why like you know that's why RL and agents are kind of connected. Assume the agent is the lang language model, right? So assume the agent is in fact the language model. And so this the environment is kind of fishy like you know it's it's hard to say what exactly is the environment it's more like the language models inference space that's the environment kind of um but like pretend this was a game right so like the agent was the computer the environment is like Mario for example right and so like you're playing the Mario game automatically and your goal is to win the game and so like the whole goal of RL is to maximize reward another one is like Pac-Man right so like you have the yellow Pac-Man And um you can either go up, down, left or right. Right. So like up, down, left or right. That is the action. That's the a the yellow the I think I don't know orange or something. The orange little things are like rewards, right? So like if you eat if you eat you know a yellow a red an orange dot you will get positive reward for. So R plus right? So like you have R pluses. If you eat a very big one you'll get like very large reward. But if you encounter one of the enemies you will get minus reward, right? All right. So like the qu the the question is how do we maximize the reward based on this environment for language models there is a trick the trick is this loop kind of changes because we don't actually have a continuous loop um the state does not actually change over time right so like for example in a game in a game if you do an action the whole state changes right the environment totally changes and so you have to like continuously keep a history of the past steps but in language models there is no history, right? So, like if you do a prompt, what is 2 plus 2? If you ask another question, what is 4 plus4? It's like totally not relevant to your previous prompt. Okay, well, fine. It is kind of relevant, but you like it's not directly correlated. And so, like you can actually delete one of the lines. Um, that's the next prompt. You can delete it entirely. So, for example, what is 2 plus two? Right? So, like essentially you have all of these options. It could be zero, it could be one, it could be two, it could be infinity, it could be B, it could be D. I don't know. It could be anything that you like. a symbol and so the what is 2 plus two is the state. So like what is the question that's the question right? Like the question the reward for example if you choose the one if you choose four your reward is plus one. If you choose anything else your reward might be negative infinity zero whatever number you like. You can come up with any number you like for reward. It doesn't have to be plus one. It can be plus 10. It can be plus 100. And you can do anything that you like. Um you can also do distance based scoring. For example is you know is choosing the number five better than zero? So that's a question. What what do you guys think? Is choosing the number five better than zero or is it worse? I guess better. Yes. Okay. Better. So what would you do for the reward then? Pretend the model outputs five for what is 2 plus two? Zero. Zero. Someone said zero. Okay. Zero is fine because it's wrong. Yes. Like if you the answer five is wrong. So you should probably give a reward of zero. But is there a better answer? Less than one. Okay. Yes. Less than one. So like some sort of like maybe 0.8. I don't know. Right. So correct. So like you could do like the answer divided by the correct answer, right? So like if it's five, you divide it by four. You could do some sort of Okay, no, that's wrong. It's five minus four divided by four. Um so like something some reward like that. Um pretend if your pretend if the model says a what is a reward?US minus one. Okay. Or could be minus 10 because that's very bad, right? You you should not output a letter. It should be some sort of number. So that's how you design reward functions, right? You you we just design a reward function. shove this into you know take your reward function it's like just if statements shove this into a language model fine-tuning phase and there we go you have 03 okay well you won't have 03 but you know what I mean um essentially 03 is a collection of all these reward functions right so like for example this what is 2 plus2 is one question remember it doesn't have to be what is 2 plus2 it is a general maths question what is 10 + 20 what is 10 * 200 / 10 you know whatever maths equation you ever want and This function can take your question and convert it into a number. And 03 is just a collection of all of these reward functions. And so the goal of RL is to make the good ones more good. You want the good rewards increase in value. So for example, the four, you want the four to appear more, but you want the three to decrease. You know, you don't want the three to keep appearing in your answer, but you want the D and the B to be very, very, very heavily penalized. That's the goal of RL, right? So like we don't actually have the answer, right? Okay, this this question is very easy. What is 2 plus2 obviously is four? Yes. Okay, that's very easy. But pretend you have some sort of complicated question like for example, how do I win the stock market? For example, let's dumb dumb thing. Okay, how do I win the stock market? You don't know what actions you're going to take, right? Like but the point is you have the result, right? You have the result like profit or loss. But the question is we don't know how do we get to the good profit. And so the question is how do we maximize good actions as much as possible and decrease bad actions as much as possible and that is RL and so open AI released something called RHF in uh chipt and they oh actually I think it was instruct but anyways for chipt they showed that you just need some training data you need some data you interact with the agent which is the language model you then get some actions which is your answer to the language model you feed this into reward model and then you get some reward and then you keep it you know iteratively doing this step and you'll finally get charg so remember the the base model you start with so chat GBT base you convert this into chat GP4 via this method to expand on a bit if you guys have heard of PO right what is PO um essentially PO is just you just expand the box for the agent right so the language model is like an agent you expand it and there's just three models inside of it. Um there is a generating policy, the reference policy and then there's a value model. Um and that's all not that special. Um we will talk about each of these things separately. Um but PO is just a optimization algorithm to make RAHF work better. GRPO which is the algorithm behind DC R1 smartly deletes one of the fact uh deletes one of the things the value model. It just gets rid of it. Um now why would you delete it? We will talk about this but the trick is if you delete a model the value model you just save parameters you save compute and it's much more efficient right remember each of these models is kind of like a large language model right so like you know pretend you're generating your model is like already 1.4 trillion parameters what are you going to do make another 1.4 four trillion parameter model for the value model. So we just get rid of it, delete it. Um and that is GPO. That's the only difference between Okay. Well, there's other differences, but the biggest difference is we get rid of the value model. Any questions? Yes. Um so you talked about negative reward. Yes. It's confusing because in pre-training isn't probabilities. Well, it's like negative rewards, right? And then I guess the phrase reward when it can be positive or negative versus comparing it to pre-training where it's always negative. Do you mean pre-training as in like the negative log likelihood or like some prob? So the when during pre-training the goal is to maximize probability. Yeah. So like you know you output a some number from 0 to one the probability of the next word and you want to maximize that for RL you want to maximize reward. So if it's a negative reward, you still want to maximize that. So if it's negative 1, you just want to make this negative one go away and you want to be positive in the positive range. But also rewards can actually be just negative, right? So for example, if your your reward function can be -10 and negative 1. The good one is negative 1, the bad one is -10. Your goal is to move towards negative 1 as much as possible because your goal is to maximize it. So I would say the reward is a misnomer. You could just add 10 to everything and then you know the sc it scales the numbers away. Um does that kind of make sense or it does it question of nomenclature? Okay. Yeah. Most people I would say like to be honest they don't like negative rewards actually in the RS space people just like to do positive reward. I don't know I like negative rewards. Um I I feel like it's more for me it's more intuitive. Um yeah. Any other questions? Yeah. Yeah. So you've got your language model and then you've got generating policy and reference policy. Yes. Right. What what models are being used there? Is it the same as your language model or is that another model trick? Yes, very good question. There are some tricks that you can employ. Most people just make them at the same model. The reference model is like the beginning of the model. The generating model is the model that you're updating. So like essentially the reference model is like the base model. Okay, that's probably not a good Okay, fine. Just keep it as the base model. The base model, the generating policy is actually the model where you update it. So like the model that you update the the the base models update. So every single time you get a base model, you update one, update two, update three, that is the generating policy, but we will talk about this. So it's essentially the same model, but there is updates to the model. So the reference policy is the model that is not updated. The generating policy is the model that is actually updated, and they're both the same model. So the language model, one of them's updated, the other one's not updated. But we will talk about that. Um, yes. So the actions, is it typically one token or is it more tokens full that's a good question in the Pac-Man case the action will be a string of actions right so like you can go up then down then left or right you know some sort of like long history for in the language model space we just generally this is called single turn and multi-turn generally speaking currently single turn is what most people do it's just one action so the action will be um the action essentially is saying what is 2 plus two and the answer is four and that is the action the action is actually the inference space so like what is the actual chain of thought that is the action um and so like it's just one though but it is the total sum of the chain of thought so like if you have like what is 2 plus two I think the answer is four you know let me do some working out blah blah blah blah blah blah blah that is the entire thing if does that kind of make sense okay yes there was a clawed conversation around how like to finish a poem's next They have to like think ahead to the last letter to the last word to match the previous one. Do you think like reward focus next allows you to do that? you could understand I think for pre-training specifically there are research p papers which show that actually pre-training doesn't just predict the next word it does try to predict many words ahead and so like yes maybe reward model in reinforcement learning you can see you can accent it essentially accentuates a pre-training behavior so maybe these behavior already exists in the model we just see it more often So maybe I would say that I would say that the model itself it already has this capability. We just want to make it more obvious. And so maybe the model already knows how to do that. It already knows it predicts 10 words ahead or 20 words ahead. It already knows how to do that. But we just want to make them more obvious. I'm not sure that answers. Would it be safe to say like so if it's a generation you want like every last you want almost the circuit for the last word? Okay. I think for reinforcement learning specifically yeah I guess yes. Um, I think it essentially your goal is to maximize a reward. And so like however whatever way you try to get there, it's different from generally pre-training. Pre-training is just maximizing next the probability of the next word. But reinforcement learning is you're trying to maximize reward. The question is how do you actually maximize a reward? Do you like do chain of thought? Do you do what you describe like thinking about the next you know in the future or something? I don't know. Like I mean the question is like what is the reward function actually doing? I don't know. What is a language model actually doing? I don't know. Um I don't know if that answers your question like it's to be Yeah. Yes. Yeah. I was I was curious about when you were talking about you know the arithmetic whether five is a better answer than say yes or something and like given that there are these like closed circuits between all these different related you know mathematical functions you can do on numbers in space like whether it is like in the literature or in like the current state of the art better to train it so that a closer pred is more accurate or whether just saying like the right answer is right and everything else is wrong which in some logical sense is true like tends to produce more performance in that space. Yes, you're correct. You should have data which is like getting more accurate data. Is that what you're trying to say? Like you should have data which is like for example what is two plus two? You should get more data which is like four. It shouldn't be like five or 10 or minus 100. Well like saying that like the practice of saying that five is better than 10. Yes. like in general is that what people do in produce tend to say like when there is like exactly one correct answer and everything else in a mathematical sense is equally wrong because it's not that answer that is a good question I don't know um I think large model labs won't tell you exactly what they do in our experiments when you use our notebooks we actually show that if you do distance- based so the closer your number is to the actual number you will get better results but generally speaking it's easier to just say five is wrong just give it zero reward everything is zero and then The good one is like one. It's actually much easier to do. Um, for example, if you want to do execution of code, how do you actually do distance based scoring, right? So like if you ask it to create a Flappy Bird game, you just have the final output, but you don't actually know how to verify like, you know, oh, is this Flappy Bird game game better than the previous Flappy Bird game? It's only in a mathematical sense you can like do distance based scoring. Um, I'm assuming large model labs, they probably do the 01 better. Like the majority of them just do like yes or no, yes or no, binary. Um but in our experiments for math specifically you should do distance based scoring. Um it makes the model learn faster. Yeah. For verifiable domain like math is it actually because 2 + 2 makes sense but it's not going to scale for like really large numbers or large multiplications. So are we going to end up is it endame using tool use to calculate that or model could potentially be changed to that is a good question. In the olden days before this paradigm came along, we would think that you can just use a tool like a calculator to you should actually I would say you should still use a calculator to calculate 2 plus2 right you should not use a language model but with RL you know VR the trick is we actually found that actually wait a second if you just do 2 plus2 or you do another question like 10 * 10 or you do some sort of complicated mathematical expression you know like the derivative of x^2 or something I don't know like some random mathematical equation it randomly learns to actually solve that equation without actually doing overfitting. And so like I would say that with RL you can actually make the model actually learn how to do multiplication, how to do addition. So it's actually in the model. Um does that kind of make sense? Yeah. Would we use this in production or Oh yes. Yeah. People use that in production. I Okay, maybe don't use it in production. You know, you're not sure if the answer is correct. Maybe it will say 3 + 3 is 7. Okay. I don't know. It's possible. So like essentially, but it's getting better. Um maybe in the future all of mathematical equations can just be done by a model. Um I think in the maybe a few months ago maybe like seep you know before 01 got released actually not even 01 a few months ago people would still say use a calculator you know like some sort of tool calling. Yes you should probably still do it. Um but imagine you know as time goes on as models get better and better and better in terms of like training data just just for the max equation you know 2 plus two. Um, imagine in the limit as we get all of the world's data for just this maths question, right? 2 plus 2, 4 plus 4, 10 x 10 or whatever, it should in theory solve them all in theory. Um, yes, the it's always in theory. Um, but yes, you don't need a tool calling. It's not necessary. Um, yeah. Yeah. Yes. My question is about the reward model. In practice, are people using large language models as a reward model? Good question. Or is it? Yes, I actually Good question. Oh, I was going to go in the next next slides. We'll be talking about that. Um, yes. Multi-turn does this change with multi-turn? I mean, you showed a single multi. Yes, you could do multi-turn. It's a bit more complicated. Um, you just imagine. There are tricks you can do. You imagine that your current step is good. Imagine it and then you just continue doing inference. You you append like your next question. Like for example, how am I going to what is 2 plus 2? You say okay let me think about this question. What is 2 plus2? Blah blah blah blah blah blah blah blah. The answer is four. And then the question is what is your next question? Maybe the user interacts with it and says okay I I don't think your question is correct. Oh, I don't think your answer is correct. And then the model says oh okay let me rethink about this blah blah blah blah blah blah blah. I still think the answer is four. Um, so you could chain this all together and shove it into the, you know, the whole RL step. You could do that. Um, it's a bit more complicated. I think the diagram will be a little bit more different. Um, the follow that to that would be you assume like a loop is a single turn or a loop is a lot of turns and then you only give a reward at the end or you give like subrewards. Very good question. So there is in the deepse R1 paper you could do subrewards or you could just do the reward at the very end. I think subrewards might actually do better in general but the question is subrewards is very hard to calculate. You would rather just wait you know all until the very very end and just give a reward. That's probably the easiest. So it's more about efficiency. It's all to be honest all of AI is about efficiency. What is more efficient? What is more it's all optimization. Um so the answer is like I would suggest people just to like shove a reward at the very end. Um yes once you get your reward signal is it just the reinforce algorithm with the gradient to go back we will talk about that yes yes yes most yes correct yes so we will talk about reinforce we'll talk about po and stuff like that um yes this one the problem is if you skip from pre-training to the RLVA st RLVR stage it's relatively hard because your model doesn't actually know how to do instructions right so like you have this base model you ask the question to the base model what is 2 plus2 it's not going to say I think the answer is four it might you might be lucky somewhere in your pre-training data somewhere on the web someone asked this question what is 2 plus two and then the you know the the answer was like oh the answer is four but you have to be lucky um so like so the problem of this is the whole trick of SFT is you want to force the model to answer what is 2 plus2 instruction way right so it will tell the model what is 2 plus2 you want it to say the answer is four you don't want it to like blabber on and like j like get some Wikipedia article and shove it as the output so the whole point of SFT preference fine-tuning and stuff like that is to make the model forced to make it more optimal to like output conversation style um if you want to skip it's also fine It's just not efficient. Um because like you you could do this. Um I'm assuming large model labs are trying to do this. Um so it's not like a you should or you shouldn't. They are trying. Um does that okay there? Yes. Question couple of questions. One question is online policy optimizer model after the reference model does not change. So the reference model is just a model that you didn't train. Um it's like the it's like the it's like the base model or like the SFT whatever checkpoint you started with. It doesn't change. You could change it. I think that would be too expensive though. I think if you change it that'll be more complicated. Remember all of AI is about optimization and efficiency. So I feel like you don't have to. You could I I don't know if there are papers talking about it though. U maybe open AI does it. I don't know. Um other question is that do we need less of So the trick of RL is you just need a reward function. You need to make that and you don't need data. You don't need the answer of the data. Oh actually you do need the answer. You don't need the chain of thought. You just need lots of questions like what is 2 plus2? What is 4 plus 4? Remember you can actually automatically generate this. Right? So like number of samples do you need less number of samples compared to you should do as much as possible. you most large language models I think for like you know 03 or 01 I don't know what is a percentage of compute maybe they spend like 5% or less but the goal is what happens if you spend double the compute just on RL right so like previously if you do 14 trillion tokens on pre-training make RL 14 trillion tokens and then the goal of large large labs is to just do that so currently it's very less but over time it will increase but compared Let's say the number of samples will be much less currently. Yes. Because it's expensive like you need models. Correct. It's expensive but over time I think like maybe by next year or like this year large model labs their goal is to do this phase the most. That's their goal because remember you can automatically generate questions now. What is 2 plus 2? What is two? 2 * 2 what is 10 divided by 10? I don't know. Generate as many math questions as you like. But remember you can also generate you know like coding questions. You can generate any questions that you like or you can use the supervised fine tetuning data itself for the RL step. You can do that as well. Um does that kind of make sense? Okay. Any Okay. How do you protect the SF from being like screwed up by Oh, good. Yes. We won't talk about it. Like you have 2 plus 2 is also a good answer. 2 plus 2. Correct. But I don't want more equations. I want the answer. Very good. So is there like techniques to make sure that we're not violating our instruction? Yes. We will also talk about that clipping and stuff like that. Yes. Yes. Is there any research done in the model? So for example, you know, there's a circuit that says yes. But can you incentivizing the model? addition the concept general you that is a very good question I don't know I think that's like the during the pre-training phase essentially somewhere somewhere in the internet someone wrote what is 2 plus2 somewhere and then somehow maybe someone did a formulation you know some sort of derivation of like what is 2 plus2 okay I don't think anyone has done pretend there is some derivation of like some complicated maths equation and so the model somehow learned to predict all of that entire trace and if it keeps seeing this it would like accentuate the fact oh okay I've seen this before let's make this even more um more prevalent in the model so somewhere in the model it has learned 2 plus 2 is equal to four somewhere yes I guess what I'm saying is so we can use like the stepific or super good reward you can you're waiting it correct so there was actually two schools of thought. The first one is the model already has this knowledge, right? It already knows what is 2 plus2 and you're just RL just tries to like maximize what if it sees 2 plus 2 is equal to four. It tries to weight this factor more weight this circuit in the model more. So the model already learns learns but then the second school of thought is like okay maybe the model doesn't know actually an arro actually learns a new thing. Um I'm more in the first one. the model probably already learns. It already knows it and we're just maximizing the you're just trying to make it more accentuated. Um, exactly. So the extent of my question is basically I want I'm wonder if that makes sense. You mean like just do addition? Yes, you could. I guess what you could do is like get the language model, see which weights are changed during the RL phase, which weights are changed, and you just give it what is 2 plus two, what is two plus two, what is two plus two. You just keep doing this question, and you can see which of the weights are changing, and essentially you can extract this from the model. You could I don't know if there's research about this, but I'm just making I'm just making stuff up on the spot. You could do that if that makes sense. Oh, maybe that's a research question. Someone should do research paper. Any other Yes. Yes, we will also talk about that. Yes. Yes. Yes. Okay. Okay. Yeah. Sorry. changes all the parameters in the model set you you could there's like two large model labs will most likely change all of the parameters every single parameter is changed um but there are papers which show that actually not all of the parameters are actually changing that much some of them are changed by like zero like the majority of updates to the model is like zero and only some very small updates to the model are seen and so that's kind of a circuit idea where like the model already knows how to do whatever question you give it you just most of the updates are like zero um saying which one is better is it better to aim for changing the parameters you can also do like lower you know parameter efficient fine tuning you can do other things you don't have to fine-tune every single thing um but I think majority of large language model labs they just do everything um otherwise this again becomes a optimization problem you know what do we which layer do we select and stuff like that. So it gets more complicated but yes you could do parameter efficient f actually we're going to show a notebook for that so you can actually do it on your own computer um yeah was yes one more question one more yes oh okay So my question I wanted good question for the divergence term. Do you mean like removing it? Would that make it better? because the whole point of the K divergence term is to like not make it too stray away from the supervised model. Okay, maybe I'm not kept up to date with with research papers but anyways I so your your point was like if we remove the KL term it will be better but it learns new capabilities. Well, it was more it was more I wanted to understand. Do you mean like do you want to have more capabilities into the model? Uh strategies hard to say. I'm assuming the large model labs will probably know strategies. We will show like I I will show examples of how to like make RL better like how to like reach higher reward faster. I'm not sure about new capabilities. It's actually very hard. It's actually very very very very hard to is elicit new capabilities in the model. Um the question is like is this new or not new? I think that's the question like is this actually part of the model or not part of the model. Um and most research papers are like hand wavy. They say oh most updates are sparse you know like so most likely it's not you know new capabilities but what happens if you know one year later all of the model updates are like not sparse. Is this considered new capability? I don't know. like you know those are the questions. It's more like I don't know if that answers your question. I probably didn't answer your question but maybe maybe the other parts of the talk maybe might answer some part. Um yeah okay I will keep going on more questions later. Um, okay. So, like the reward model, right, was actually a language model that like some sort of model, some neuronet network, some AI model that predicts the reward. In RLVR, we delete this entirely and we just call it the reward function. So, like the ground truth reward, you know, if it's correct, you plus one. If it's bad, it's just zero, right? So, you essentially delete another part. GRPO essentially deletes another part, right? So like you remove remember I g you delete the value model get totally remove it and then you delete the reward model and it's just a reward function and yes as a reward model you could use lm as a judge so you could ask a language model itself to say is the answer good or bad you could do that you could do regular expression check you know like is the formatting of the answer good or bad is the maths equation good or bad you know is the final output good or bad you can do distance scoring and stuff like that you can also execute the Python code and then you can see if it actually executed, right? So like is there like import errors or like format errors or like some sort of Python error and you can use this as a reward and so this blue box the reward can be anything that you like. It just needs to output a number you know minus one plus one I don't know it just has to be a number. In fact, you can make a dumb reward just everything. It just does random, you know, plus one 50% of the time, minus one 50% of the time, I don't know. And confusingly, a paper recently showed that actually random rewards works. Um, so like yes, go ahead. You can try it. Um, but also why did someone say why? um probably read the paper but but what to be honest actually I think the paper might be a bit I don't actually believe there was yes there was an update showing that actually this was wrong um that actually it's because the model they don't the benchmarks are incorrect so when you say that you actually increase accuracy like from 20% to 50% but actually the model itself was already 50% they just didn't check the accuracy of the correct model before um so there was a recent rebuttal to those types of papers. Um, but you know, interesting results. Um, yeah, I don't know. Okay. So, remember in RL the goal is you don't know the best action to take in the space, right? When you're doing Pac-Man, I don't know if going left, right, up, or down is the best. I don't know. But at the very very very end, you will either, you know, win or like, you know, get some reward or you will die. Yes. Um, but the goal of RL is to maximize the best action you can ever take, right? So like what is a better action than all of the other bad actions. So RL just tries to maximize the best well not the best action, the better action. Normal pre-training, you already know what is the best answer. So like you already know what is the next word, right? You if you want to predict, you know, hello, my name is Daniel, you already know the next word is going to be Daniel, right? So you already know it. But RL, you don't know in advance what is the actual correct reward. Um so you the only thing you can do in RL is to maximize the you know one of the better options and so yes okay now more maths um the goal is to maximize this equation um that's the goal of RL so what is this equation the J is like the total gradient um well actually it's it's more like the it's more like we want to maximize this it's not actually the total gradient okay maybe I misro that anyways didn't write that. Um the we want to calculate the gradient with respect to the policy language model and the action is given a state and the r is a reward. If you want to write this down in like English, it's like we want to take the derivative of the log probability of the action given the state times the reward. Now I don't know if you guys understand what that means but I did like a example um Pac-Man case. Okay, so you are Pac-Man. The red is your enemy. You don't want to go there, right? So like you definitely don't want to go to the red thing, right? But you want to eat the two gray dots, right? That's remember you can only go up, down, left, or right, right? You only have four actions. Remember the action space is just up, down, left, or right. So if you do rewards, I just randomly made some rewards up. If you go to the red thing, you will get minus 10 reward. Or actually, it should be minus infinity. You die. But anyways, minus 10. If you eat the gray dots, you get plus one or plus one. And if you go up, it's just zero reward. There's nothing there. Now when you get this language model or like some sort of model it has to tell you what is the next action right it tells you what to do to the next action for now we'll just assign every single action up down left or right as one quarter probability right so like you go up 25% of the time left 25% of the time and so on so these are your numbers right this is the entire state so the goal of RL is you want to do that red going towards um the right you want to do this you want to go towards the right less you want to do a much less right so like you want to push the probability of the 0.25 25 of the right much less and you want to go bot like you know down and left much more right so you want to push the probabilities much more and the top are not really that important and so RL essentially you your goal is to avoid doing the bad thing and you want to do the good thing much more that's kind of RL if you convert this into a table right you have the probability of the action given the state right remember up down left or right we just assigned 25% % chance, right? Just just pretend 25% chance. The reward, which we can calculate, right? We calculated this. We calculated the reward. We just made some numbers up, right? As 0 1 1 - 10. The probability times the reward, we get some numbers, right? So like 0 0.25, 0.5 minus 2.5. And then if you take the log of the probability times the reward, you get some number, right? So like 0 minus 0.6 minus 0.6 and 6.02. So from this table, does anyone know which row do we want to maximize? What is the goal? Like what do we want to maximize? Which row? You want to maximize the bottom row. What is the reward of the bottom row? Correct. You want to minimize the bottom row. Remember the reward is minus 10. We do not want to maximize the last row because the last row is the worst. And so that means the 6.02. We want to actually decrease this number dramatically. Right? That's way too large. We want to decrease it. The other rows we want to maximize. And so the goal is okay, we just take the sum of all of that, right? We take the sum of the four numbers and it's 4.8. And so remember, okay, let's try, right? So like by hand, by hand, we shall remember the right remember all the probabilities are one quarter 0.25. By hand we shall do the bad action even more, right? We actually do the worst thing. What happens to the what happens? Right? So like the reward the probability times the reward is now minus4. It used to be minus 2.5. And so the reward the log probabilities times the reward the sum actually decreased right it decreased to 2.58. Before it was 4.81. Is 2.58 smaller or bigger than 4.81? Obviously smaller. So actually this is worse. You should not do this. But this is actually bad. Remember the goal is to maximize maximize this equation. Maximize it, right? Maximize. And so 4.81 is actually better. The original state is actually better than the 2.58. So this the thing that we just did is worse. So do not do this. However, let's do the right thing less. Right? Let's not go to the right and actually maximize the rest. You shall see that if you do the log probability times reward sum of them all you will get 8.9 which is a larger number and so the goal is to maximize this as much as possible. You could say wait we know the answer right you should go towards the right just make this 100% probability let's just you know and you'll have an infinite reward okay not infinite but you will get maximum reward right why don't we just do that but you should not do that because you're actually forced if you do this your model will be like learn oh okay let's just keep going right let's keep going right and they just get stuck and it just becomes very bad for optimization so definitely don't do that now there are someone who's who talked reinforce um we don't just multiply the reward but remember this equation we did where is it the probability of the action given the state times the reward we don't actually multiply the reward we should not do that um you actually multiply by something called the advantage um and what is the advantage the advantage is a reward minus the average reward the base reward so you shouldn't actually just see you want to maximize reward you want to actually maximize a reward but also looking at the average reward across the entire model. So it's called the baseline and so this B this baseline model is what is the value function the value model remember GPO deletes the value model this was the value model and this value model essentially estimates what is the average reward if we just see the current state it does not take it does not look at like you know what is the next uh next step it does not look at you know what is the next action it just takes a snapshot of what you currently so essentially it looks at this it looks at this and just guesses what is a reward, right? It doesn't you you're not supposed to give it the rewards. You're not supposed to give a minus 10, + 1, plus one or zero. It just looks at the current state and produces a number. And this number is called the average reward. And so the goal is now we don't actually want to maximize this, you know, just a reward. We want to maximize the advantage as well. Um so like we multiply all this together and the goal is we want to maximize this new equation. Does anyone have any questions? There's lots of baths but questions. Yeah. Yes. So in terms of probability, how do you is the large language modeling an estimated probability or this is a known probability of all the possible states. But how do you get that in practice world? So large so a model a large language model predicts the next word. So for example you take the entire Wikipedia and then you like chunk it into small little tokens and then the output is just what is the next word. So, for example, my name is Daniel. But it could also be my name is Michael, my name is Bob, my name is whatever, whatever. Right? You have all of these probabilities for every single word in the entire language possible like 128,000 words. You assign a probability for every single one to Yes, correct. But they'll be okay. The trick of this for language models is you can utilize the probabilities directly. That's the trick. And so like that essentially makes everything easier. Any other question? Yes. Yes. Yes. Yes. Correct. Correct. Yes. Oh, multimodal models. Do you mean like doing RL multimodal models? Oh, that is more harder. I would say you you could you could look at the Sudoku puzzle just convert the text model into a vision. Just cheat. I guess you could do that. You could like say, "Oh, you know, I guess you could give it the Pac-Man, you know, give it give it the Pac-Man thing and tell the model what should I do next?" You you could I mean vision is it's kind of the same thing, but it's more Does 03 do vision plus reinforcement learn? I I think it does. Um yes, you could. I think for open source I don't think I've seen Yeah, I don't think Yeah, I don't think I've seen open source models do that very well. Um, it is still very hard. Um, yeah. Any other questions? No. No. Okay. Did someone did ask a question? Oh. Oh, so yes. So, sorry. Oh, what is the B? What is the base model? average but like what is average reference model for what exit? So we just like your goal is to your goal is you see this current state of the model like whatever the environment currently looks like and you just want to produce a number that approximates what is the total average reward. Okay, I'll give you an example. Pretend you're playing chess or like go or I remember alpha go. You look at the board the current state of the board and just say what is the probability of the white player winning? What is it? You're not supposed to do any prediction. You just have to predict what is the probability of the white player winning by just looking at the board. That's kind of the average reward. Always low. It's always low. Yes. Yes. Correct. But remember at the very very very end phases like you know you might get higher reward but that's the goal. You want to you essentially want to predict what is the probability always. You know for example in chess I'm sure there are like some steps you can take to make the reward higher. The question is when the model sees this, you need to essentially you need to say is this board better than the previous boards and so this model you have to train as well. You have to train this model. It needs to output a probability of it winning. That's for the chess example. Does that kind of make sense or no? Yes. The value No, no, no. The value model is totally different. It's a there's three models. There is a value model which predicts the average reward of the state. The reference model is just the model that you started with and then the policy or the actual model that you're changing is the out the final result of your model like the actual chat model. So there's actually three models. This one the B you will see the current state. You will see the current state and then you will see okay what is the actual I think you do use the policy no you no you just look at the current state you look at the current state and then you output a probability of whether this chess board is good or bad of some like 0.8% 8% you're going to win. Um, okay. Yeah, something like that. Yes. Oh, yes. Yes. Yeah, we'll talk about that. This is just a general simpler formula. Yes, we'll talk about that. I guess my question. Oh, you can ask. Yeah, go. I guess because the policy is predicting That's the probability. Yes. That is a good question and that is an active era of research because you could either normalize by all of tokens or the entire just one turn. Remains to be seen which one's better. Um it's actually still people talking about that because correct. Yes. So generally speaking normally people just assume this assume this roll out is correct assume this chain of thought is correct and they just do the very end but then you do have to multiply probab I don't yeah you do have to multiply probabilities. So there is a multiplication somewhere you will get very small. Yes. Um but you know you get very small but the numbers are relative right? So everything is very small but then the smaller one the bigger ones are still very small but it's still better. So they're all relative. Any was there one more? Yes. Yes. Oh no no it's very old. Yes. Very very old. Yes. I wonder if you can give some like advice on like how to think about this training or abstract level about like error propagation between like if you have a trained model which does the scoring or does the value function or whatever that itself is trained from data. It has some like error margin and that you know you have some soft max function for example that like only one in a 100 times will produce the wrong that probability. Yes. like how do we think about the like development over time of these models and like to what extent that error propagation is something that you can observe and measure like systematize and engineer around like I don't really understand like what the mindset is in this process right now that in my view I think all of these formulas are just made up and so like the goal is to maximize reward but the question is you need to like you can't just maximize reward because otherwise you might make the model really silly. Like you might say, okay, what is 2 plus two? It just says four. Pretend your data set was just what is 2 plus2, right? So like you literally just cheat. What is 2 plus2? What is 2 plus2? What is 2 plus2? Just make the model just say four four. Just four forever. Do you want this as a model? Definitely not. So like we want it needs to learn. Okay, if I give it the next question, what is 8 plus 8? It should not just say four or what is 2 minus 2? It shouldn't say four. And so the goal of all these algorithms is to somehow force the models not to like overfit to your question. And so like these formulations are trying to do these things to like not overfit. Yeah. Well, I'm thinking about like the chess you were saying which scores the board and produces this like number a number which is like this good or bad. Sometimes these well trained models have these novelties that you know where they say like make this move. It's like not the new state is not obviously good or whatever but they somehow like figured out this. Yes. And suppose that your training mechanism for the value function model you know hasn't picked up on something like that. In fact there's like some error in the tendency of the value model. Okay. like it's probability of producing like perfect scoring of the board. Yes. You know is not always exactly right. Yes. Always not into your training process. Correct. Yes. How do you think about that? Like how do you like what is the the value model? You have to train it together. So it's like a combination of the entire algorithm. So the value model predicts what is the probability but you actually have to train this as well. And so that is actually the problem. Some people you could train this you could train this separately. You know you can like get all the chess possibilities and then output what is the final number. I think that's what some people do. You could train this in tandem with the model actually. I think that's actually more harder. I don't know if that you so there is always error in the value model always. But you have to train this model as well. So you will reduce the error but there's always error. So I think there's like some numbers you can like force the value model to be like less less prominent. like don't forcibly utilize the reward uh the value function but in Japia we just get rid of the reward model uh the value model anyways um so totally gone um no so you don't need to worry about that anymore um okay I will keep going on let me just check time actually okay so remember the goal is the advantage we want to maximize advantage not reward anymore advantage is the reward minus minus the average reward or the base the base reward right if the advantage is less than zero it means that it is worse than average is if the advantage is more than zero it means it is better than average and so the goal is we want to do the action more if it's better than average on general now to PO right so like I don't know if you guys have seen the PO formula it is ugly but this is the PO formula right so like it looks it's more confusing because there's like a clip and then there's epsilon and blah blah blah whatever. But we could just strip everything away. It's just it's just the probability of the action given the state times advantage, right? We literally just discussed about this. Okay, minus a log. Okay, the log's gone. But anyways, it's just that and then the rest is the rest is trying to reduce overfitting. And so remember, so essentially this there's a thing called the division of the old model. And essentially it's the model that created the action. And the goal is we now want to maximize this likelihood ratio. We don't just want to maximize the probability of the model. We don't want to maximize the probability of the action given the highest reward. We actually want to maximize the likelihood instead. But what is this likelihood? So I did some numbers. I just made some numbers up. So this is the pack. So pretend the the top the um numerator is 0.01 and the denominator is 0.01. 0.01 divided by 0.01 is 1. If the if the top is 0.01 and the bottom is 0.99. Remember these are all probabilities. You divide the top from the bottom you'll get 0.01. Again if the top is 0.99 and the bottom is 0.01 you'll get 99 and so on. Right? The last one's one. And so the goal is 0.01 divided by 0.99 is 0.01. This means that the action that you do is actually very likely, right? Because the bottom the bottom thing is 0.99. But we actually don't like this, right? Remember the top is 0.01. We do not like this. So the ratio is 0.01. And then the bottom is this. This action the bottom the denominator is 0.01. It is actually not likely. But we actually like this because the top number is 0.99. And so when you multi when you do the division you get 99. So this is actually good. And so actually we're not actually trying to maximize the probability. We're actually trying to maximize the likelihood now. And so the question is why don't we just maximize the probability right? The first equation. Why do we need to do this division thing? Because if we maximize just the top you will have reward hacking. What is 2 plus two? It might say to solve this question we need to do blah blah blah blah blah blah blah blah blah and suddenly it says hello hello hello hello hello hello and then it says four. Is this good? I don't think so. It's very good. We don't want it to say hello hello hello something or like some weird trace in the reasoning model. It does something weird. We don't want this to happen. And so actually this this you know hello hello hello hello is actually very not likely. And so the goal of the division is to reduce these issues. The epsilon part is called the trust region. Essentially we don't want to make we don't want to do large steps for um po right. We don't want to do large steps and the the trick is we want to restrict them right. So like you don't want to overfit the model. So now we restrict the model and so epsilon could be like 0.2.1 and the 1 minus epsilon is 0.8 8 1 + epsilon is 1.2. And the trick is we just want to not move the direction of the gradient that much, right? We don't trust the model that much. We don't trust the algorithm that much. So, we want to constrain it. And then also the PO there's also a K term. Um there's another term. Um and essentially what this does is we want the model to be as close to the supervised fine-tune model as much as possible. We want it to be we don't want it to go so far away from the base model or the pre-tra or the supervised finetuning model. So essentially if it deviates too much we want to tax the uh we want to tax it and so this beta is like 0.05 and the ko divergence is the dis okay it's not a distance it's like the distance between the current model and the pre-trained model and essentially we want to also shove this into the equation. Um, so you can see with PPOs, there's many moving parts. It's who cares about the equation? It's not that complicated. The point is all of these extra add-ons are just to reduce overfitting and not to make the model like randomly go to some weird state um that like you know overfits to like your questions. And so the trick of PO is they just added all these terms in to make training more stable. And so the final equation is like this. This again um hopefully you will to be honest no one even counts the formula. It's not that important. Um, but I just try to like break it down into pieces. And the goal, remember the goal is to maximize this equation, right? We want to maximize it. And normally I just like to think about this one, right? You just need to learn this one, right? You want to maximize the probab. So it's just this equation. Um, remember we did the table. Just this is enough. You don't need to learn the rest of the formulas. It's not very interesting. Um, yeah. Any questions? Yes. Yes, correct. So the biggest problem is pretend you're like pretend you just started RL like you you have that pre you have like the base model you have like a supervised fine tuned model and then you do RL the gradient updates at the very beginning are going to be gigantic right you're like what is 2 plus two it says four but if it says five you want to like penalize it dramatically and so the problem is you don't actually want to do large steps. And so the goal is you want to constrain it. And so the constraint factor is like you know if the if the if the num if the gradient update is extremely large you just want to like constrain all the numbers if that makes sense. The goal is just to constrain the update not to make it too large. What about the ratio? Oh the the ratio it's the KI divergence. Oh sorry not the K the likelihood. To be honest I think I need to do more research. I would ask Gemini exactly what it is. That's my answer. I'm probably not the best person to answer every single question. Yes. Any other questions? Yes. Yes. It's the model that actually created the action. And so the top one is all of the numbers that actually like how do I explain this? The bottom one is the model that created the action. So for example, you the model says you want to go up, down, left or right. Is there a correct or a Oh, we just created the action. Like it could be anything. It could be the So it's the m it's the maximum. It's whatever of action the model says currently. It might be wrong. It might be good. It might be bad. It's just any action. Any other questions? Okay. Yes. space. I don't think so can answer that question. I don't know. That's why I don't know. Maybe research papers show it. I I'm not sure. Ed. Okay. Okay. Well, okay. So, GRPO the trick from PO is we remember remove the value model. We get rid of it entirely. We do not want to estimate the average reward. It's totally removed. Um and the reward model is now removed as well for a reward function. So, we get yeah we remove it right remember the value model is removed. B is a reward value model. We get rid of it entirely. But what do we replace? So, the trick of GPO is we do roll out or inference sampling. we get the answer what is 2 plus two you literally make four inferences you just literally call the model four times it could say the answer is zero the answer is one the answer is two or the answer is four you can do you do like you just literally call the model four times and you take the reward right zero the what is 2 plus 2 the correct answer is four so you want the last number to be one but the rest is all zero and the trick is you literally just take the statistics of your current roll out. You take the statistics of all of this. You literally take the reward minus the mean divided by the standard deviation. You get the zed score. And this is your this is your base model. This is your value model, right? This is there's no more value model anymore. It's just a number. And so I did this on a table as well, right? What is 2 plus two? If you think it's zero, remember the prediction could be zero, one, two, or four. And your reward could be 000 or one. And if you take the mean or the average of all the rewards, you get 0.375. If you take the standard deviation, you take 0.43301. And then you do the reward minus the mean divided by the standard dev deviation, you get some numbers. Remember the number four is correct. That is why you know reward minus the mean divided by the standard deviation is 1.44. It's the largest number. And so that is why we need to like max we need to essentially maximize that good answer and we want to reduce the bad answers. But why is it called group relative in GPO? Because it's not just one question. It's many questions. It could be what is 2 plus 2? What is 4 plus 4? Okay, well my graphs are all the same. My plots are all the same. But anyways, imagine there's like four different tables. What is 2 plus2? What is 4 plus 4? How do I create this Python function? You know, whatever. And there will be four tables. And so the goal group relative just means you for each question we take the statistics within each group. For example, what is 2 plus2? You create four. You literally call the model four times and you get some, you know, answers. What is 4 plus4? You call it four times, right? Create Python code, you call it four times. Yes, there are other factors of G. So essentially we already explained what GPO is, right? Everything you need to know about GRP we already explained in in the total mathematical formula looks kind of like this. Um there's some rearrangement. For example, the minus beta the KR divergence is just taken out of the reward function. That's the only other difference. Um hopefully it makes more sense about the parts of the GPO formula and stuff like that. It's actually not that complicated to understand. The majority is just trying to reduce overfitting, right? That's the whole goal, right? minus beta times a k divergence is to reduce overfitting. One minus epsilon 1 plus epsilon is to reduce overfitting. The division reduce overfitting. Everything's reducing overfitting. Right? So that's all of machine learning and AI. It's just to make the training more stable and to reduce overfitting. I would highly suggest there is these are the two things that I really highly suggest. Um Nathan Lambert's policy gradients um book. It's online though. Very very very very helpful. Um and um Yanick's video on GPO and stuff very very very helpful as well. Um yeah and now I will go into a collab demonstration of GPO. Um before that like does anyone have any questions? Let me just check time questions. Yes. Because maybe the answer is there is to me whatever information. Yes. Memorization. You make more memorization. To answer your question another way, I think it's actually because GRP itself is a problem. Remember the goal of all these algorithms is to force the model not to detract too much from the original model, right? With this like minus beta K divergence, you know, one minus epsilon, all of this is trying to make the model not go towards too far away from the original model. And I think that's the problem because you're essentially forcing the model not to go too far. And so maybe there might be some new algorithm I don't know something some other formulation which you know you want to go very far away you could do that um I I don't know if there are any I don't know if there's any research papers about that I I don't know but you could yes you could do that yes any other questions yeah yeah yeah you probably saying okay that's why Yes. Yes. Energy based models. Yes. So what do you think about that? What do I think about it? I can't really comment. I mean he definitely you should listen to what he says. Um but you don't see anything like movements in open source. I don't think so. I think open source we kind of got captivated by RL GPO. I don't think so open source people are doing whatever he's talking about. Unfortunately I think he needs to talk about it more. Yeah. Japa. Yeah. And energy based models. I unfortunately I don't think so. Open source. Yeah, maybe we should talk about it more, but yeah, I I don't think so. Um Yeah. Yeah. Yes. Yes. Yes. You got rid of it. Yes, correct. Yes. group. So this is more about an optimization question. So in theory you should make the you should make for example I just selected four right what is 2 plus two create four examples you should do you know as many as you like you know 3,000 whatever number you like you should do as much as possible but remember AI is about optimization this is going to take forever you know it's all about efficiency so probably don't do as many as you like um but you should in the limit you should do that but you know everyone can't just wait there waiting for the computer to spin. Um so yes, you you should do as much as you like. Um it's but yes also for like recommendations when you do inference sampling you should set set temperature is like you know 1.2 1.5 set m to be 0.1 you know something like that. If you set temperature to be zero you'll have the same answer every single time. So definitely don't do that. But you should have high temperature numbers to make the model you know produce new output as much as possible. maximize variability. Yes, distribution. You should try your best to maximize variability. Your outputs should not be all the same. Um, if it's all the same, I don't think so is going to learn. So, you should make it as different as possible. So, that's why you should set temperature to be 1.2, 1.5, whatever, some large number. Don't do too large though. Um, any other questions? Yes. Yeah. Yes, correct. All the first steps are all the rewards are basically no signal at all of the underlying How do you deal with that? I have faced that a lot of times with moving to a larger model sometimes. Yes, that's a good question. So essentially you're saying the model if the model starts off with like no reward like every single update is like 000000 it's not going to do anything. Yes, that happens all the time. But by chance, just by chance, you know, you have like some small little little probability just by chance, you will have some reward. So that's the trick. You will see this after 10,000 inferences, what is 2 plus 2? Suddenly the model says four suddenly. Okay, just suddenly just by random probability. Let's make this more. That's all. That's all of GPO. Like this is addition simple task if it was like a proof of concept. It may never come with yes may never. But remember you're not doing this one question. You're also shoving this together with other questions. What is 2 plus 2? What is 4 plus 4? What is you know derive the derivative of blah do this Python function. This step is very large. You essentially shove this all together. And the trick is in general it works. in general maybe by bad luck it might not work but I feel like the bad luck won't last forever because remember you're changing the samples right so like the question what is 2 plus2 you're changing that the next phase will be some other question and so the trick is just by chance you will have a good reward just by chance and we just force that to be more does that kind of make sense so it's to be honest it's all luck yes it's all luck we're just guessing oh you know we're praying that there's going to be some positive reward somewhere in the model. There will be negative reward, right? So like if your model is really really bad, you you can do negative reward. And so you just don't want to do the negative one. You just want to do the negative one less. And by miraculous probability, you know, just rely on probabilities, you will get a good reward somewhere just by chance. That does that kind of make sense? I mean, all of the large model labs are just literally relying on the fact that that's what they're doing. They're just guessing. We're just praying for the GPUs to work and then suddenly the reward comes out. I'm being serious. That's exactly what they do. They're just waiting for the algorithm to work and if suddenly oh okay that's why people do random seeds as well. So for example the initialization of the model might not be good. So you just kill the training run. You do like 500 training runs. Oh no 499 of them are like zero reward. Oh just kill them all. Don't release them. I have seen on my training. Yes. Very common. As small as step. Yes, that's the Yes, I was going to show you guys that. Exactly. So like you could force the model to answer some question like for example you ask a question what is 2 plus two? It's very easy. It's four. You just force it to learn oh okay it should be four first. And then you do other steps that is actually why remember? Okay, I have to go back to all the slides. I don't remember. Okay, where is it? How do I Okay, I'll exit. Uh it's the same as this problem. Where is it? This one, right? Someone asked about why don't you just start from the blue, you know, the pre-trained model to go to the green one. It's the same thing. Essentially, the trick is we want to do some supervised fine tuning to make it know some instructions. So it knows something and then you want to go to the reinforcement learning phase. But if you want to start from nothing like just the pre-training phase that's the hard part, right? Your reward might be 00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z like zero forever and then suddenly one you know suddenly oh if you suff I think like if you if you see zero rewards most likely either one your reward function is not that good yes doing priming or like you know make the model learn a little bit about your data is actually does work does help um so there are tricks to make it work but generally I would just say it's bad luck just bad luck um And unfortunately, you can't do anything. It's not your fault. It's Yeah. Just unfortunate. Yeah. Yes. How should you think about like when or like what to expect from is it going to be just that this is the way open source catches up with clos your competent ML engineer to be able to like specialize a model for like I just don't is there consensus on like where what this is going to bring us to smart people like you going to give us a really good open source model or is it that you know we should This is a new tool for specialization. The algorithm is not special. The hard part is actually the reward functions itself and the data that you're going to shove into the model. That's the hard part. So like I I think there is a misconception like you know the algorithm is important. No science it's useless. Who cares about the algorithm? You can literally just use you know the general the function which I gave you can just use any algorithm that you like. But the problem is actually the reward function itself. I gave you some examples you know like what is 2 plus two? The answer is four. Yes, you can do distance- based, but that's just one example. Can you like can someone make a function? Can someone make a reward function for like trading stocks? That's you do that. Do that, right? And then there you have like a model for trading. Go ahead. So, so you think it's going to be more that because people are able to create reward functions like it's a little bit more easy like similar to like you could create a prompt. It's an easier thing for most people to like iterate on. Um it's actually quite hard coming up but it's easier than coming up with the algorithm itself or whatever actually I think collecting so in the olden days large model labs will ask like you know large data providers like scale or whatever to create data like what is 2 plus two you literally have someone sit there and write okay the answer is four but then you also have to do the chain of thought like oh I think the answer is four because of blah blah blah blah blah blah blah or like you know this is my working out you literally have to ask someone sitting there to make the data The trick is get no more. You don't need the data labeling step anymore. It's totally gone. You have the answer and you have the question. The middle step is totally removed. But you still need to make the reward function. You need to ver you need to say is the answer for good or bad. For maths, it's very easy for code. For code, it's somewhat easy. You know, you can check, oh, did you import the correct function? Did you import the library? Did did your code execute, you know, some other reward functions? But it's still hard to verify if your actual function is correct. For example, let's say your question, let's say the task was create the Flappy Bird game. How do you actually know that the output's good? How do you actually know? We don't you could again ask a human to verify or you know, let's test a Flappy Bird game and then give it a good reward, a bad reward. Or the trick is did the game actually run? If it ran, plus one. Did you see the word Flappy Bird in the, you know, Flappy Bird inside of the functions? If yes, plus one. Did you see the, you know, the image of the Flappy Bird sprite be used? If yes, plus one there. Something like that. So, like you don't, it's still I would say the hardest part is writing the reward functions. And for open source specifically, if you know the whole open source community starts writing reward functions, we can probably beat 03 01 like you know. Okay. plus compute. You still need compute. That's the problem. You still need compute. But if you write good reward functions, you'll probably catch up in no time. So the end state here is that you want an open version of the closed model. It's not I guess my original question is like is this a thing that people are going to use like prompts have specialized models separately or is it more that we want good open? It okay that is a good question. It depends on which school of thought you're in. If you're in the thought that large language models already have the capability and you're just trying to accentuate it, then there will be just one model. Yes, but this model can only learn some facts because otherwise you're like overfitting and like it's not but if you're in the second camp that the model actually learns something new. Oh wait, did I say right? I think it's the opposite way around. Um the first one is you have many models. I saw sorry I said it wrong. The first one is you have many models because a model doesn't actually learn that much. But if the second one is a model actually learns everything then you have this one gigantic model. I think open air probably subscribes to that point. You know, like most large model labs think that actually RL can get you to AGI, right? It will know everything about everything. Any single question you ask, it already knows. And so like that's I think that's where they're trying to go for. For open source, it's more harder. I think the open source community consensus for now for now is the model it already knows your questions. You're just trying to accentuate it. And by doing reward functions, you're trying to weight the circuits more. you're trying to wait the model to know how to do these equations and stuff like that. So I think like the goal of open source is you know if the entire community comes up with good reward functions write them all then the problem is we need compute that's the second problem right if you shove both of them together you will get o10 or something I don't know um right imagine if every single person writes a reward function once per day okay that's probably too hard once per day will have like you know seven billion reward functions more than open air can ever come up with and they will defeat open air but you know you need the computer part that's the only Um, okay. Any other questions? Yes. Correct. Yes. How do you feel like saving those traces of those? Very smart. That's what Yes. Yes. Yes. I don't know if large model labs do that. You could do that. Yes. I feel like Yeah. We're just mining for those good examples. The only problem I would say is pretend the question was what is 2 plus two right and then the model says okay pretend the model just says this okay it says oh let me work out what is 2 plus two I think it is oh you know the number two means two apples and I want to add two more apples I think the answer might be three h but let me rethink about it wait a second it's like four is that a should you fine tune on that I mean you could but maybe it's like cheating Maybe it just says four by chance. Maybe like Okay, I'll give you an example. Let's say that by chance. Okay, the question was what is 2 plus two. It says gibberish like um I like to go to Paris for fun or whatever. I don't know. I like to go to the you know this event blah blah blah blah blah blah blah blah blah and then suddenly says four just by chance. Remember we're still rewarding this. We're literally rewarding this as good. But this is not good. No, so we we reward it at the very end. Remember we see the number four. It is good. The question was what is 2 plus two? The model can generate anything it likes. It could you could but that gets harder. So the trick is people don't actually reward the steps in between. They just do the final step because otherwise it gets too complicated, right? So like what is intermediate steps reward? It's way too complicated. So what you do is you just reward the final step. So if if you see the number four, it's good. But we don't know how we got there. So you should So yes, you could maybe at the very very very end step of RL you can then use some data to do do um fine-tuning. Yes, you could. But I think like in general it's because we don't know what the process is in between. You say we don't train on Oh no, we don't train on the thought. Yes, we don't. But problem is we don't train on the intermediate step in between. You don't have to. No, you don't have you you could you could but remember we don't know if the traces are good or bad. We don't know. So you can't just take this trace and then do supervised fine tuning because pretend the answer was four is good but we don't know the intermediate steps unless if you read the data right you could ask you know some human labelist to like oh you know please verify if this trace is good you you could but then that kind of defeats the whole purpose of RL so like you don't you don't want to do this um does that kind of make sense or not really okay yes the what sorry oh yeah yeah we Yes, we will do that. Yes. Yes. Yes. Python. Yes. What's the normalization among the rewards that are like best practices? What do you mean by normalization amongst rewards? We have four plus. Yes. If we're running this in one big yes one batch how do we normalize the fact good question correct it could be like minus 10 one plus 100 toative infinity binary very good question that's your choice unfortunately that's the problem of RL it's all about human choice like you you will have to decide you know is the python one more important um than your maths what what is 2 plus2 then you can wait it more for example your make 2+ 2 the reward as minus one and one and the Python function as like 1,00 and zero, right? You you have to decide on the waiting functions. That is that is your choice. Um unfortunately, it is a it's kind of like an art. You could like dumbly you could just do everything as the same scale. I think that's what most large I think that's what the large model labs probably do. It's like everything all the reward functions have the same scale. You know, plus one, minus one, plus one, minus one, not plus 10 and the minus 1,000. So it's up to you. Okay. Yes. And what about stuff like you know is this a good summary? Is this a course? How are we how we creating that is the question. So like now you want to like analyze. Okay. That's where the LLM as a judge comes in. So like there is a school of thought that you can use a language model itself to to make a number. You can ask is this a good summary or is this a bad summary? Please give me a score from minus 10 to 10. Ask you could do that. That's called the LLM as a judge thing. Um there is a paper which shows that you can do this for some time but then it breaks down. So you can't just keep calling the language model you know like it's kind of like cheating if I would say like you're trying to call chatbt to train chatbt like it will work for some time but then it will break down. There was a paper I need to find the paper but the paper showed that if you keep doing this your actual reward actually goes backwards. Um so like you will get more more more reward and then suddenly it just I don't know by bad luck again it's always about bad luck the reward just goes back um so yes I mean serious like all of AI is about bad luck and good luck um and you know optimization trying to do efficiency um that's what everyone yeah unfortunately um so yes you can use LLM as a judge um is that kind of if I use LM as a judge doesn't it just end up being a teacher and student model or yes but That's why like that's why sometimes essentially the problem is I I need to find the paper. If you keep doing this it will actually do bad. I mean intuitively it kind of works but then at some point there is actually another way. You could ask a language model to generate reward functions. That is actually another school of thought. You can actually ask a language model to generate 10 seven billion reward functions. But the question is is the reward functions good or bad? I don't know. Um so like you now you need to like rely on the fact that the models are good or bad. You could then ask another language model to verify the reward functions. So yes, you could do this. Maybe that's what OpenAI is doing. I don't know. Maybe OpenAI's goal this whole time is like, oh, let's generate all these reward functions, verify each of them, and then shove it into the function. Let's see what happens. Maybe that's what they're doing. I I don't know. Um, but yes, it is a student teacher. Um, yes. What's your opinion on how to make models? something else. Do you mean like how to make reward functions more efficient in general? No, I mean scalable in the sense of many. [Music] Yes, the majority of reward functions currently that's why it's called verifiable rewards is like maths and coding to be honest. I think coding is also hard. I I don't know why people lump. So coding you can't actually verify technically it's correct. You can just say it ran or the output is most likely correct, right? For some functions. But for example, the Flappy Bird game, tell it to create a Flappy Bird game. How do you actually verify if it even is the Flappy Bird game? I don't know. But you could, right? That's the whole point of the LM as a judge. You could take the output of the Flappy Bird game. Ask the language model, does this look like the Flappy Bird game? And if it's yes, I get plus one. If no, minus one. You could do that. Um, but you can only go so far. If there's Does that kind of what your opinion is the scaling oh part to say I think most I I think large model labs they're currently just trying to use their own model to to to like literally reward reward it like as I like described you know ask it oh does this look like the flappy bird game if yes plus one if no minus one and I think maybe that's what I think large model labs their view is if you keep doing this you'll get to AGI that's their view I mean if you think about You could maybe, but then I always go fall back to, oh, but you might be bad luck. It's not going to work. Um, so like I think in general I think bad luck will just it won't work. Um, you will only get so far and then suddenly it just doesn't work. Does that Okay. Any other Yes. How do you think about diapers? That is your choice again. So like if you want to specify for example you just want to make a legal bot you given some sort of court case and if it's like you know the plaintiff wins or the defendant wins I don't know you could just do law. Yes, you could. You could do that. But in my view, you should combine it with other sources. You should combine it with some maths. You should combine it with some programming because the point is you don't want the model just to know like you just you don't want the model just to like overfit just to just law. Maybe maths might be helpful just by you know by chance. Again, maybe you know maybe coding might be helpful. Probably not, but like you know in general um so like you should combine other source um other domains together. Um I feel like you know all the large model labs their goal is to do every single domain possible right like mine every single reward function in the whole world make all the reward functions shove it into the model and just learns um so like yes you should do more domains if yeah that's another yeah research let's small particular. So the notebook I will share will showcase you should probably do some supervised finetuning fast. It's called the priming stage. Um otherwise you're remember the plot over here the the the this one right? You don't want to be in the situation where like you're starting from like some bad you know pre-trained state and you're trying to go to the RL stage very not efficient but remember AI is all about efficiency you don't want to do this step so we do have to do some priming you know the SFT stage and the other stages if that's your question or oh if you want to yes the bigger the model the better yes can That's the trick. So essentially the research papers show that small models actually do work confusingly enough because essentially these small models it just does longer thinking. It does longer reasoning traces if the model's smaller and if it's a larger model maybe the reasoning traces are like smaller in general. So like I I feel like the small models actually do work. They do break down though. If you want to do like very complicated reasoning traces then maybe the small models might not work because you know there is only seven billion parameters. There's no not that much space you can move. Um and so the large models you just have more space to move around. And so that's why large models are better if I don't know if that maybe kind of I don't know if that answer your question but want to find. Yes, correct. Exactly. Yes, that's what you should do. Yes, you can take a distilled model like I already reasoning model and then further fine tune it. Yes, you could. Um, I would say it's a bit more complicated because you could do that, but remember the reasoning model itself is already a reasoning model and you're trying to fine-tune it to become other re like you're trying to do some other domain. It might be easier, it might be harder. It's all about luck again. I don't know. So, you have to try. It's all trial and error. Um, yeah. Yeah. Yes. Two questions for you. one like it's pretty empirical just like try and see what works and what doesn't and see what yes correct okay and then the other side how you keeping up with all the papers and all the content that's been good at I'm sure it's a lot how you learn to follow to be honest I don't you don't need to follow that's my view don't try to follow the latest research because sometimes I may like the next day it's like rebuttal of the previous paper and then the next paper says oh it's a rebuttal of the rebuttal I don't know. So I would not try to keep too much up to date with the latest research. I think the field has kind of matured and it is mostly stable now. You might have like some algorithm increasing accuracy by 1% or 2% or some efficient. Remember all of the papers are about efficiency. It's always about efficiency. Making the training more stable, reducing overfitting. It's always these similar similar papers. Um so I would say don't you can keep up to date with papers. Twitter is very good as a resource. Sometimes I tweet about papers, although I I don't suggest the Nathan Lambert paper. The Nathan Lambert um book is very good. He keeps updating it. Where is it? Where did I put it? Um this one, the RHF book. That is very good. So definitely read that. Um he updates it all the time. And so maybe follow Nathan Lambert. He's actually a very good Twitter on like the latest research. Um so he's very useful. In general, there's a lot of noise in the RS space as well. You don't know if the research is good or bad, like you know, rebuttals on top of rebuttals. So, I would suggest people just to like try. It's just trial and error, right? Try to see if your reward function is good or bad, you know, is the loss not, you know, is the reward just 000? Like, unfortunately, something's wrong or you just it's bad luck. Try again. Um, so it's just empirical. Yes. Um, these slides. Sorry. Are you put Oh, yeah. Yeah. Yeah. Yeah. Yeah. Um, yes. These slides should be up. I was supposed to make a bitly link. Um, I'll probably do that later, but I will share the slides. Yes. Slacking. Oh, yeah. Okay. I'll do that then. Okay. Any uh Yes. Yeah. Yes. In the old PO sense you are the value model is a new model. The reward model is a new model. Yes, remember in GRPO we delete the value model. The value model is totally gone. We we create the value from just statistics from the distribution. We essentially just create you know four what is 2 plus two? create four examples, four trials and then find the mean, find the standard deviation and then that is your that is your value model. It's not even a model anymore. And then the reward model is no more as well. It's just reward functions. And that is why we call it reinforcement learning with verifiable rewards. It's not normal RL anymore. It's like you replace a reward model as well. If does that kind of make sense? That does make sense. Okay. There's Yes. So the again there is two schools of thought. The first one is like the the question what is 2 plus two? Somewhere in the model somewhere I don't know where right this high dimensional 1.4 four trillion parameter space somewhere it knows to calculate it as four right there is some sort of circuit inside the model and then the goal of RL is just just to maximize this circuit somehow right via these formulas and stuff you're just trying to maximize it but that's one school of thought the other school of thought is oh you know RL is actually learning something you know like it's actually learning how to do 2 plus 2 is equal to four and it's not actually in the model okay any yeah yes but when you say capabilities of of a model that already has them inside. Yes. You mean knowing actually the answer to a question or knowing how to reason to get to a question? That is a good question. Maybe both. I I think it depends. It probably knows how to do the re it might for example a contrived example you get all of the entire world's data right like you know 30 trillion tokens get all of the data and you just make a question that is not part of the data right you you could do that right what is 10 billion you know some random number times some random you can make a maths equation which is not in the data but somehow the model has leared to do multiplication has leared to do addition somewhere and so Maybe this circuit for addition for multiplication you know for whatever some you know many many many circuits of these like functions we just want to accentuate them all and that is what kind of RL is trying to do and I mean yes there's a reasoning circuit so like somehow the model also learns how to do reasoning and so we also want to make that more important and so you know 2 plus two is very important addition is very important multiplication is very important and so on we're just trying to like make all of these circuits more prevalent but that's only one half of the AI community, right? That's only one half thinks like that. The other half is like, "Oh, but the model's actually learning." You know, we're actually training the model to learn and the base model actually doesn't know how to do reasoning. Does that kind of Okay. Yes. Okay. Hopefully. Okay. Any other questions? Yes. Yes, there is ver there is TRL um unsoft like we we also make a it's not called a framework that's more like we showcase that you can do uh gpo and reinforcement learning with very low resources. So we are the only package which allows you to do GPO on like a free collab and so like that's the only difference between us and everyone else. So for ver was very good for large training runs but for now onslaught like if you want to do like small experimentation you want to try stuff out you don't know what reinforcement learning is you don't know how to make reward functions you don't even know what reward function to do you should utilize our notebooks and that's what I was going to demo um yeah one question is like let's say you don't think you just have a regular task any effectiveness of that using to just improve like let's say tool use for your Yes. Yes, you can do that. Exactly. It should increase accuracy by quite a bit. If your accuracy if your accuracy before with tool use was not very good, IRL should definitely help. And I feel like the trick of RL is it reduces overfitting. I think that's the trick because you do you do multiple inferences. you don't know which one's correct, but you're trying to like maximize some good ones, right? Some good ones. The problem with general fine-tuning is you're kind of like overfitting the model. And so the trick of the trick of reinforcement fine-tuning is you can essentially reduce overfitting. So the model actually learns how to do tool hauling. Not just, oh, you know, I see someone's trying to do a restaurant order, I just want to call Door Dash to do order or something, right? But it actually learns oh okay because the person wants to order food I should order door dash it's like reverse thinking um so we should definitely help some of my experiments like unless I tried not explicitly asking the model instruction like just do some task right and it does not it does not automatically start the thinking process unless you're explicitly promped okay first think and then yes I was trying to check if like just without explicitly asking it to start the thinking process just to improve the tool accuracy itself does it start and that I think that did not happen because you don't need to do so GPO generally people utilize GPO and like reinforcement learning algorithms to create the thinking process that's because it's like an artifact of GPO just by chance they see the reasoning process you don't need it so maybe by chance by luck somehow how it learns how to do tool calling and it's not some thinking process it could be some weird symbols maybe you know I don't know it could be like using some other you know like sometimes models have like different languages suddenly it could be like that you know randomly it learns how to do two calling it made some new programming language internally I don't know but it could have done that in this case there was no thinking process yes it just directly gives the output because like let's say you sample like 10 trajectories and none of them have a thinking process then it's never it never explores those parts For now it will never but remember it's all about luck over time you will have a thinking process just by miraculous chance there is a thinking process somewhere and then oh you should do this more and it would just do this more but to make that more probable like you should prompt it you can prompt it so essentially in the system prompt you can say please put your working out between this box you could force the model to create the working out you could do that you could Um, but is this the most efficient? I don't know. You could you could say, "Oh, please create a new language that I don't understand which does tool calling and then it does some weird symbols and then oh, it does tool calling." I don't know. But yes, you you should prompt it. It should make it more effective. Okay. Any other Yes. What's the secret sauce and Oh, we we utilize Triton kernels. We do like kernel optimizations. We reduce memory usage by 70%. Um there are lots of optimizations that we do to make training faster and more memory efficient. Yes, later. Yes, it's not the main focus, but yes, we will talk about that. For VLM specifically, we use Okay, that's actually at the notebook. Oh, before that, do we have any other questions? Yes. focus on these days. The actually the DC paper talks about this. There is pass at K and majority at K. I think they said that if you do test test time scaling, I think it improves majority. I think that was correct. You need to read the I'll have to rever revert back to the DC paper. But they did say remember test time scaling is different from reinforcement learning. There are different methodologies. Test time scaling is calling the model 10,000 times and then you know then you just check you know by average what okay you for example you ask chbt what is 2 plus two? It might say four you know it says four four and suddenly it says five. I don't know by chance it says five. It says zero and you just take the most likely answer. That's called test time test time scaling. And then reinforcement learning is kind of different. It's more like oh we want to like make we want to actually train the model to actually do the whole trace. And you do you don't need to do you don't need to output 10,000 examples and get the best answer. You just do one. Yes. Correct. Yes. Yeah. So the trick is you do the RL step and then you do test time compute. It will actually make the accuracy much higher. You that's a good question. You could kind of GPU is kind of like that, right? So GPO you do test time scaling in in the actual reward function. You literally call the model what is 2 plus2 do test time scaling and then you aggregate the results. So it's like GPU itself is doing test time scaling internally. You could that sounds like a new research paper. You could do that I guess. Yes. Okay. I will have to go to the notebook now. Um so in order to access a notebook you can go to our GitHub page um which until my internet actually loads. Um so if you go to unsloaf right the github page unsoft there is a button called quen 3 gpo right and you can click start for free that's how you get the notebook or you can go to our docs um which have the notebook um so remember go to the github page and then click start for free um yes for quen 3gio and then you will get this notebook um generally it's dark mode but I know in presentations you know presentations I don't think so people can see that so I will change this to light mode. Um, so we utilize VLM behind the hood. So VLM does does anyone not know VLM? I think that's a good question. Who does not know VLM? Okay, 100% you must use VLM, right? So like for all open source, how do you serve a large language model? Please use VLM or SG lang um or you know I think hugging face has one as well. So like these are the best open source libraries to serve open source models. Um you know like you have a GPU how do we actually serve Llama 3 you know how do we serve Llama 4 you use VLM to serve it. The trick of unsloaf is so unsoft is a package for fine-tuning, for GRPO, for reinforcement learning, for whatever you like. Continue pre-training, whatever. And the trick is we just optimize it. We make it much faster. You know, use two, use, um, 70% less memory. Um, make it fit on a free collab. Um, remember, please use free collab resources. You know, Kaggle has 30 hours for I already said this again, Kaggle has 30 hours of for free per week of GPUs. Please utilize them. Um, they won't be unhappy, you know. please utilize them. Um and yeah, so you install unsoft in VLM. Um and we have this thing called the fast language model class which essentially you can call a model. Um for example, we we will now utilize the quen 3 base model, right? So like remember I told you not to do this, right? But we are anyways we're going to do it. Um this plot where is the plot? Um, we are going to do the dark blue dot to to the um the dark green. Um, yeah, that's what we're going to do. We're actually not going to do what I suggested not to do. Um, but anyways, um, we're going to do that. Um, you also have to set set a max sequence length. So, for example, if you want to make it longer, you can set it for longer, right? If you want longer reasoning traces, you can increase the maximum sequence length. We set it as 248. um if you set up a larger that free GPU will run out. Um so that's the problem. You can also load in four bits. So if you want to do four bit quantization, you can make the model go to four bits. You can reduce memory usage by quite a bit. So you can do that as well. Um and remember we are utilizing Laura which is a parameter efficient fine-tuning method. Um you don't need to fine-tune every single weight inside the entire model. Um this will be very very very costly. Instead we add small weights to the model to fine-tune it. Um and so that's a trick that we do and because we utilize VLM directly we do a trick we actually reduce memory usage by 50%. The trick is we share VLM's weights directly. Um other training frameworks what they do is they have to copy VLM's weights um because you have to you have the model for fine-tuning and you have the VLM weights. The trick that we do is we actually share the VLM weights directly so you can reduce memory usage by a further 50%. We use something called the unsoft grading checkpointing which reduces memory. Okay, essentially everything in AI is about reducing memory usage more efficiency. You know everything that we do is just efficiency. Um so everything that we set is for efficiency purposes. Um your Laura rank for if you do Laura please set it to be the alpha to be two times the Laura rank. Um it speeds up training dramatically. So please do that. Some lots of stuff right compiling. We do like automatic compiling and stuff like that. You don't need to read all of this. Um and here is the bulk. This is the most important part. Someone was asking about you know about a prompt. You make a system prompt, right? You are given a problem. Think about the problem and provide your working out. Place it between reasoning start and reasoning end. Right? So like the reasoning start is start working out and end working out. So it should look something like this. Um if you you know does uh start working out and end working out, right? You were given a problem. Think about the problem provide you're working out. Place it between start working at and end working out. Then provide your solution between solution start and solution end. So it should look something like this. Um and this is the system prompt that we're going to use for reinforcement learning. Remember you can customize this to however you like. Right? You don't have to say you are given a problem. You are given a legal case. Right? think about the case and provide provide your I don't know legal thinking I don't know I'm just making stuff up I don't know whatever provide place it between I don't know it can be literally anything um thinking I don't know I don't you can even make spelling mistakes it doesn't really matter right end thinking the whole goal of RL is you can design your reward you can design the system prompt to whatever you like right I think that's the main problem is like people think oh you must Follow deep seeks think right think right you need to people see in deepseek think right and close think right you do not need to follow you do not need to follow this right you do not need to follow this at all you can make it up entirely right this is customizable to whatever you like the hard part is because we are using a base model remember this is a base model you have to make a chat template as well um this is The more annoying part, you can just copy and paste this chat template. You do not need to do anything else. Um, just, you know, literally copy and paste it. Um, the base model does not have a chat template, right? A base model when you call it, you can't actually call it for conversation. You can't, it's not GPT, right? It's just a base model. It doesn't do anything. So, you need to specify a template for it to understand how to do conversations. And so, this is kind of like a template that we did. It's a very it's very generic. You can just copy and paste this. It should be the same for anything. And then we show so after you do the chat template, we show an example of how to actually utilize the chat template and the tokenizer. Right? So for example, if you ask what is oneplus 1, you do the reasoning process, it will say you are given a problem. Think about the problem and provide you're working out blah blah blah blah blah and in the question this is a question. Your question is what is 2 plus two? Remember the answer is four. And so start working out this is what you give to the model. You give all of this to the entire model, right? All of this um okay I can't really highlight all of this you will give into the model and the goal of RL is you want to create the working out process automatically right the RL algorithm will automatically create the you know the working the working out or the thinking process and then finally it will say four well hopefully it will say four and the goal is if you see the if you see four you want to make the reward higher just for that now someone was talking about um you know fine-tuning with the um you know instruct fine tuning first again Remember, we go back to this diagram. We want it to start from the blue dot to go to the green dot. But we found it doesn't actually work. So don't actually do this. The trick is we go back to this diagram. We actually want to take the pre-trained model, do some fine-tuning, do some supervised fine-tuning, and then go to the green dot. And so this part we show that you should actually do some supervised finetuning. You need to do some finetuning to prime the model. Right? The goal is you want to learn you make the you want to make the model not just output zero ward forever and so this data set allows you to prime the model to do supervised fine tuning. So for example the problem is you know what is the sum of all the real numbers blah blah blah blah blah and then you use deepseek R1. So this is a trick. This is a hack. You use DeepSseek R1 to create some examples and you shove this during the finetuning step and essentially the model already learns how to do some reasoning and that's the trick. This data set is very small. It's only 7,000 rows, right? You don't need to have that much data for just this first step. You can have like I think I only use 600 rows. Very very very less data. So this is just data preparation step. Not that important. Um I need to skip to the reward function. This is the most important part of the model. So this is the supervised finetuning step. So all of this is a supervis supervised finetuning step. So this this part this part of the model um that's this part. So not that important. Um okay we skip all of that. This is the most important part. The reward function creation is the most important part. And I feel like you know the majority of people like neglect this part. It is the most hardest part to do. Um okay let's see where is the reward function. Oh here it is. Okay, for example, okay, no, no, this one is a regular expression to match if your format is correct. For example, remember we have we ask the question, we ask the model to say please put your working out between start working out and end working out. This regular expression essentially um essentially rewards the model to see start working out and end working out. If it doesn't have this, you will actually penalize it. And so this is one reward function that I created. For example, I give it an example. If you say let me think end working out, it extracts two. Yes, that's good. Right? Remember we force the model to say we force a model you must generate the answer between this and this. And it successfully extracted two. So that's good. But you know also sometimes a model might generate some random spaces. It's possible. you know the model might not actually the model might not follow your exact format. We still try to match it, right? Even if it generates extra spaces, we still successfully match the number two. So that's good. This is a reward function, right? So essentially what we see is if it matches the format exactly, we add the score by three. If not, we just put zero. And remember this match format essentially matches regular expression. We had to create it by hand for matching the format. This number does not have to be plus three. It can be plus 300. Whatever you like. It can be plus one. I don't know. It can be anything that you like. But I just found plus three to work fine. Um so you can do whatever you like. Um yeah, anything. And remember the score is zero. If you don't see it, you can also do minus one. For example, if it's else, right? If it's not good, you can also do score minus, right? Minus three. You can minus three points from it. So up to you. You can design your reward function as whatever you like. But remember, if the model if the model output is not exactly following your format, we should still at least reward it a little bit, right? Otherwise, the reward will just be 0000. So the trick is if we see a keyword, we plus one. And if we don't sorry, plus 0.5 if you see the keyword and if you don't see the keyword, right? If you see the keyword this, if you see this in the output, you should at least plus 0.5. But if you don't see it, then you're minus one. And so this essentially allows you to partially reward the model. More reward functions now gets more complicated. This this large reward function essentially allows you to calculate the distancebased scoring. For example, remember we said um over here um where is it? Um this one, right? What is 2 plus two? Four is is correct, but three is also a better answer than D. Right? If you if you output D, it's definitely wrong, right? If you output five, it's okay, but it's wrong. And so this function essentially allows you to take the good answer. So sorry, this is the guess divided by your true answer. And it's like a ratio. And this ratio essentially allows you to reward if your number is close to the actual answer, you give it higher reward. And if your if your answer is very very very far off, then you penalize it by minusing reward. And so this essentially allows you to do that. If it's exactly correct, you also add five points. And so this this is probably the this is probably the most important reward function. Um but this is only for maths. Um for other like code and stuff like that, you have to create more reward functions. Now we test if our reward functions actually work. Um and yes, it extracts the numbers. Um this is just format reward. Sorry, this is just extracting the solution. And you can see that it extracted 0.34. It extracted this number. It extracted this and extracted this. Um, if your reward function is not working very well, you probably did something wrong in the regular expression. So, please like edit that. And then this is helper functions. Oh, this is another another reward function. If you see if you see the number 1 2 3, 45, we want to remove the comma because you can't actually convert this into Python. So, you want to remove the comma. Um, and then if it's equal to the true answer, you're plus 3.5 reward. And if it's not, then you're minus 1.5 reward. More data set preparation functions not that important. Um, and here is the meat of the code for training, right? We call VLM top P is 1.0. Um, you don't it's not that you can probably it's probably not a good 1.0 just means you're sampling the entire space. Um, so that's good. You can set this to like 0.8 eight or something else up to you but I generally set it to be 1.0 to be like full sampling of the entire space min is 0.1 I suggest people to use this because otherwise the model might like go into like it might do inference of like random outputs so you know use 0.1 and temperature I did suggest people to increase temperature to 0.1.2 two, right? Or you can do one. Um try to increase your temperature as much as possible. The more temperature you increase, the model becomes very very um creative, right? It like creates random outputs. If you increase the temperature too much, like you know two, your model will be like gibberish. So probably don't do too large numbers. So I normally suggest 1.0 1.1 or 1.2 or somewhere around there. You should utilize minp together, right? you should utilize minp together with high temperature numbers. Um there is a paper about using temperature 1.5 and 0.1 min. You should utilize that. There are some some other things that we utilize. Um num generations is very important. I set this to before this number is is this thing. Um where is it? This number is this. How many how many like uh how many rollouts do you want to do? How many inference steps do you want to do for GPU? Right? We chose four. And so four will just means what is 2 plus2? It will create four four options. And so that is this number. If you increase this number too much, you will use much more memory. Um you should increase this as much as possible if you can. Um and there's like gradient uh there's like batch size. We set this to be one. Um the trick is the batch size times the gradient accumulation is equivalent most of the time. Um GPO is not but essentially what this does is if you do one it just means we're doing one. What is one? What is 2 plus two? If you set batch size to be three then we shove all of these three examples together into one. Generally you should set batch size to be much larger. Um the problem is if you set batch size to be too big you're going to use more memory. So the trick is instead you do gradient accumulation. You set this to be 16. That's a trick. Um gradient accumulation. It essentially allows you to do addition of gradients over time and you can skip using too much memory. And then there's like evaluation. If you want to do evaluation, there's some functions for that. Um and then we shall see the training. Um you will get a large table of numbers during the training process. This took 2 hours and 54 minutes on a free collab. Um look at the reward column right the reward column minus 7.5 - 5.5 - 5.5 all very bad oh and then suddenly plus 13 just by chance suddenly it's plus 13 remember GPO the trick is if you see this + 13 let's maximize this even more and then it's oh but then it didn't really work so it goes back to - 7.5 minus 5.5 minus 7.5 and so on and then plus 11 you see another good reward we want to maximize ize this as well and so so on so on so on right that's that's the trick of GPO by luck by chance literally by luck you will have good answers right good answers you want to maximize this and if you keep looking down right if you keep scrolling down in the end okay I need to make a plot but in general your reward will increase over time right look these are all positive numbers now right these are all positive numbers your minus numbers are getting less and less and less and less um I think if I can I don't know if I can plot this But okay, I'll plot this later. But essentially, if you plot this over time, the reward will actually increase over time. There is also other numbers like completion length. So essentially, remember when you use a reasoning model, the reasoning trace can be extremely long. So this column just tracks how long the reasoning process is. Um it's over time in general, the reasoning length should get longer and longer. Um but sometimes not always the case. Um yeah, not always the case. There is also another column called KR divergence. This essentially tells you how far the model is the you know the final model from the original model and the larger the number it means it's getting very very very far away from the original model. Um in general this number should get bigger over time in general. Um sometimes it doesn't move but you should make this number go as much much higher as possible. Um and then we also we made separate reward functions. Each of those reward functions also has their own reward. um the most important one is the last column, right? Or the second last column. These are the two numbers. So in in RL there is a problem. Most RL most RL training runs just follow the format and it doesn't actually learn. So the format columns are not important. Do not look at the format columns. These are useless. You need to look at the last two columns. And if you look at the last two columns, right, rewards, check, check numbers. This essentially checks if the output is good or bad. You see that it's minus 2.5 minus 2.5 not very good and then suddenly 3.5 right 3.5 is good we want to maximize this if you keep looking over time in general if you take the if you take like a rolling average in general the model gets better and better and better obviously we only train this for two hours and 50 minutes you know if you train it for 20 days it might actually do very well um but remember this is a free collab GPU um so in general remember so like the goal is the goal of GRPO is suddenly we see a good answer with a good reward we want to maximize that and that's the whole point of GPU it it's it's nothing fancy it's just like by luck we see it and we just want to maximize it we can also see some output from the model right at the very beginning right at the very beginning of the model let's see um where is it an example compute the number of positive integers that divide at least to a blah blah blah some question and then it does some reasoning trace. Remember, we already fine-tuned it a little bit. So, it does something, right? It does something, but the answer, it just goes on and blah blah blah. It just blah blah blah. It just keeps going on blabbering on. Um, but then if you look at the actual answer, um, where is it? If you keep Okay, there's a lot. Um, we print out every single We print out a lot. Okay, it just keeps Okay, whatever. It keeps going on and on and on. Um, this is the output of the GPO algorithm. And you will see over time if you inspect this you will see that the model actually gets better and better and better. Um for just an example right let's say we ask the model what is the square root of 101 right it's not we don't just say what is the square root of 100 that's just 10 we say what is the square root of 101 if you do not train the model this is what you get it will say answers education math and arithmetic what is the square root of 101 wiki user oh wiki user this is what it actually will say right that's actually what it will say does do where do you think this data comes from does anyone know can take a guess Where do you think this data comes from? Probably Wikipedia, right? So if you ask the question, what is the square root of 101? It doesn't do anything. Right? Remember this is the base model. The base model is useless, right? You're not going to get it's not going to answer the question. But after we do GPO, right? What is the square root of 101? We ask the question again. It says, okay, so I need to find the square root of 101. Hm, let me think. I remember that the square root of numbers between the perfect numbers are rational. blah blah blah blah blah blah blah right it just says blah blah whatever some whatever and it says solution 10 point right here solution 10.049875 049875 I think that's correct. I don't know if that's correct but probably it's very close. And so the goal of so the whole point is GPO allows you the GPO algorithm produced all of this right this reasoning trace in the olden days you actually have to have a human write all of this and then you have to fine-tune the model GPO you skip you don't need to make this anymore it's automatic right that's the trick of GPO and reinforcement learning all of this reasoning phase is automatic totally produced from nothing and in the end it gets a solution Yes, you had that. Yes, that's trick. If you do the base model, the trick of doing this is if you just do the base model, go into the space, you'll still get this, but it'll be too long. Otherwise, you'll wait there for like 20 days and for demonstration purposes in a collab, you'll have to do the supervised fine tuning step. That's trick. Yeah. Yes. What is the advantage of doing this 7,000 examples versus using model out of the box? You can use an instru we actually have notebooks for that. So if you go to GPU in general um we have notebooks for using instruct model. Okay the internet's very slow. Um GP we actually have other notebooks. For example if you use llama 3.2 3 billion that is using instruct model. You don't need to use a base model but we showed that you can use a base model. Um yeah you can use I suggest people to use instruct. Um you probably shouldn't use base. It's all about efficiency as well. Um yes. Sorry. What? What? Yes. Yes. Yes. Correct. Yes. So the goal of Ko divergence is you want the model not to stray too much away from the original model. Right. KO divergence is I shouldn't say distance but K cho divergence is like a distance between the the current model the current model that you're training and the previous very very very beginning of the model right and so essentially if the model is too far away your ko divergence will be very large right if you look at the plot uh where is the table uh I'll scroll up a bit um right which one is the k diver oh here this is the column right this column is a ko divergence um column over time it should get larger and larger and larger over time, right? The number should get larger and larger and larger because the model is straying away from the fine uh the original model. If you set the beta to be zero, then you remove this term. Maybe this might make the capabilities of the model more maybe because you're essentially not forcing the model to be as close as possible to the base model. Active error of research. So yeah, some people might set it to zero, some people might not set it to zero. You know, I think 0.0 I think the default is 0.05 5 or 0.03 if that. Okay. Any other questions for Yes. So for fine-tuning actually fine-tuning is actually very helpful already. Um where is the loss? Where is the loss? The base model. So the base model already is very bad. Um if you do uh here there's a loss. There's a loss. This is using the finetuning step the pre the priming stage right you use like a data set to firstly prime the data to prime the model. The loss does decrease. Remember if you see a loss of 0.64 that's good. Um if you see a loss higher than 30 definitely something's wrong. Um higher than three is very bad. Um you can see the loss definitely decreases over time. So yes doing the fine-tuning stage does teach the model a little bit to do reasoning and it learns how to do some stuff. Also a good a very interesting um fact is we used deepse R1 some of the reasoning process to do the finetuning step. And interestingly, if you just call the model without doing GRPO, it kind of does reasoning already by doing 7,000 examples, right? It already says, remember the question was, um, what is 2 plus? Oh, wait, this is just a general question, right? It kind of learns how to do reasoning somewhat, but it's not perfect. And so the goal of GPO is to forcibly make it perfect. Okay, not perfect, but as much as possible. Um, so actually the finetuning step already kind of learns a little bit. No, we No, actually, you don't need to use 7,000. I think I only used I think I used 118. It's so it's uh uh two training epochs. I think it's Yeah, it's 118. I only use 118 rows. You don't need to You can use 10 rows. You can use Yeah, use as less. You must use more than three rows though. Um because when you do Laura the gradients become are zero. So you must use more than three but but anything more than three is fine. Yeah. So even if you use 118 it does fine. Um yeah. Yes. Do you mean like a small model versus a big model? What's the difference? Um, can you tell any tricks if you want to do this training on a bigger? Oh, yeah. Go ahead. You can you can take the notebook. You will need a better GPU though. Take the notebook. Edit edit this here. Not four. You can do 14 billion. Wait, I think there's a 14 billion, I think. Or is it 12 billion? I can't remember. You can do whatever you like. You could even do, you know, llama 3.370 billion. I don't know. Up to you. Um, do whatever you like. And but the goal is for a collab demonstration because it's a small GPU. I use like a small model. Um we actually have notebooks for free collab which fits 14 billion. Um so fe 14 billion actually fits in a free collab. Um so you can do big models in a free collab. Um Kaggle again I said Kaggle use Kaggle free GPUs. Um there are like no so essentially this whole page has like no books. Uh where's Kaggle? Kaggle. Kaggle has like notebooks for GPO as well. Um so you can do whatever you like for large models. Um, oh, do you mean like for VLM rollouts like Oh, so the trick what we do is you we colllocate so you use the same machine for inference and fine-tuning and the trick is you can reduce memory usage because you're sharing the VLM weights. So some other trainers like Verl and TRL you do have to put the inference on another server and then your training is like a separate server and they have to do communication. We don't there is no communication for us. There is none. So we do it's very close to asynchronous training nearly no delay in training but yes we we don't support it yet but we do plan to support like you know larger training runs um yeah yes question what's yes people have asked that no road map I don't know if we're going to support it it's a bit more complicated you could use I think XLA like they they do have PyTorch converted down to TPU So maybe it might work. I don't know if it works. I've never tried it. Um maybe later. Um yeah, maybe later. Okay. Yeah. Yes. Oh, you don't have to. You you the the whole point then like this thing um uh here, right? We chose a base model because you can show that you can do a base model going to the green dot. But then unfortunately in the collab we do have to do some supervised fine tuning otherwise you'll wait there forever. The reward again will be 000000. We just want to remember all of AI is about efficiency and speed. So we just want to showcase okay you do need to do the light blue step. You still need to do the supervised fine tuning step. Um yeah yeah oh no you don't need to. You can take the instructor. So the notebooks over here. So for example, if you go to the uh the llama 3.23 billion notebook, we don't do any fine tuning step at all. You skip directly because it's an instructor model already. It already learns how to do chat. It already learns how to answer some questions. You can skip directly to GPU. Um if it loads, um but yes, the notebooks, okay, you'll have to wait for it to load. Um whatever. The internet's very bad. Um it is loading. Um yes. Any other questions? Yes. Over in the reinforce algorithm you had the log probability of a state of an act. Is that happening inside the sampling model? Where is that in the notebook? Oh the the the algorithm itself of GPO. Oh it's like behind the scenes like yeah is that happening over in the GPO trainer? Yes it's inside the trainer itself like somewhere in the code somewhere it does that. It's figuring out like what is the probability of token versus all the tokens. Oh, the calculation is inside the trainer. So like somewhere, you know, on the GPU, you're doing this calculation, but you do get the probabilities. Remember the language model, you get the probabilities already. You just get the reward function and you just want to maximize it. So do you take the log that come out of the large language model and turn them into pseudo probabilities and then just assume that's I think so. Yes, that's correct. I think it's exponential of Yes, I think that's correct. There is if you go to like the code there is like some derivation for it but yes you're correct any oh okay the notebook loaded but yes there is another notebook which does the instruct here right there is instruct model and there is no fine-tuning step at all it just does the reward function um and stuff like that um wheels notebook for example is also very good so like if anyone wants to check other notebooks out wheels notebook um also utilizes I think also the instruct model and then does gio um Okay, the GR Okay, kind of time is running out, but okay, technically the GP4 portion is done. Oh, there's actually more portions. Um, I will have to breeze through them. There's only 10 minutes left. Um, Whoops. Um, any other I will take questions at the very end. Anyways, I'm going to stay here anyways afterwards. Um, quantization. We'll now shift over to quantization. Um, so I don't know if you guys know about the deepse R1 1.5 bit quance that we did. Um but you can essentially download these models. Deepseek R1 is 730 I think 730GB. You can quantize them down to 140GB um without that much loss in accuracy. Okay, there's obviously loss in accuracy. But the trick is you can quantize them down to be very small and miraculously they work. Um llama for scout for example, right? You can't really see the accuracy plot. Um the smallest number is 80% accuracy on MLU5 shot 80%. And the highest accuracy is 81 point something right so it's actually only 1% difference and the one on the left is a one bit quant it's very small it's tiny in comparison to the full precision like float 8. And so essentially you can make the model eight times smaller and you only decrease accuracy by 1%. Um so that's very interesting. And so essentially we showcase that you can actually quantize layers the mixture of experts layers very heavily but you must leave the attention layers the shared experts and other layers in higher precision and that's called that's what we call the dynamic quantization methodology. Um if you see there was like a benchmark of llama 4 scout for example um if you use a two bit a twobit quant it actually gets high accuracy than other other full precision um providers which is very interesting right so like for example the two-bit quant gets um 73% accuracy and then other inference providers get 65% accuracy 67 right there is a very large difference and so okay there is like some bugs in the models and there's Maybe they quantize it incorrectly, but the goal is to show that if you quantize a model down to be very small bits, it still works. We showcase this with an example. For example, if you take a vision model like um Quen 7 bill uh Quen 2 billion, um if you naively quantize all the layers to be four bit, right? You ask the model what does this image show? It will say the image depicts a vibrant and colorful scene of a coastal area, which is totally wrong, right? The answer should be the image shows a train traveling on tracks or something like that, right? If you quantize everything to 4bit, it's 1.36GB, but it's definitely bad. So, the trick is you must quantize some layers to be you must leave some layers to higher precision and you only need to increase it by 500 dB or so to 1.8 bit and it works. The image shows a train traveling on tracks. It suddenly works. But the question is which layers do you not quantize? That's the question, right? Which layers? You could do an exhaustive search, right? You can check, oh, let's not quantize layer zero. Let's check layer 1, layer two, you know, check every single one, but it'll take forever. So, definitely don't do that, right? You have 70 choose one or something. 70 choose one plus 70 choose two. Horrible, right? And remember, all of AI has a bad efficiency. So, don't do that. The trick is you can check the activation quantization error and the weight quantization error and you will see these large outliers. For example, for Quen, if you quantize the first few layers, it's extremely bad. So, you must leave the first few layers not quantized. And also, this gigantic jump for the weight quantization error. This means you probably shouldn't quantize that layer as well. There are some other plots that we show. For example, for Llama 3.2, it's interesting. All of these graphs are very different from each model. Um, you will notice Llama 3.2 has these weird, you know, continuous spikes. Um it's because they use attention and then they put the attention back to the vision module. I think every single three I think it was every single three layers. So every single three layers it has these big jumps. Um this means you should not quantize those layers. Pixrol for example is also a difference graph again. Um pixrol seems like you can't quantize many layers unfortunately. Um and so like the whole vision module must not be quantized. There is a very there's a very important paper talking about like you know why which layers you should quantize and which layers you should not quantize. It's called the super weights paper. Um you should you guys definitely should read that. Essentially it says that in all language models the first few layers of the down projection there is a very very very important number in one of the numbers. One of the numbers in the models very very very important and you should never quantize it ever. But the trick is that the interesting finding is it's not actually a very large number. So there is a trend in the there is a trend in um language model space where people think that you should not quantize outliers. The problem is you know these models have these big outliers like suddenly in the model there's like this big number like 3,000 and if you quantize it it essentially ruins the model but actually this paper shows that it's not actually the outliers that are the problem. These numbers could be very small and if you look at the plots if you remove if you select these numbers and if you make them zero the accuracy uh the accuracy decreases dramatically. Um and so like if you see like for example if you remove one of the numbers um you know the activation value totally decreases very bad they have very large activation values and then if you remove them it's very very very very bad. Um there is another trick that you can do. If you have a model that has seven billion parameters, make every single number go to zero. The first parameter, make it go to zero, check accuracy. The second number, make it go to zero. Check accuracy. The third number go to zero. Check accuracy. You can do this seven billion times and you can see which number is the most important. You could do that as well. Um but remember AI efficiency very not a good idea. um more later research. For example, the new Blackwell chips um instead of doing a quantization to like one bit, two bit, three bit, 4bit, Nvidia chips also have this new architecture, this new format called FP4 or MXFP4. Um and essentially this is float 4. Um and float 4 is very it's it's most likely going to be very used a lot in the future. Um and then there's these like new formats for quantization as well which which essentially allows you to train models in very low precision. I also made this plot um going from float 32. So the question people always ask is why is GPUs getting faster and faster and faster? My take is actually this year is probably the last year you're going to get GPUs that is actually faster. There is no more faster GPUs. Why? Because the majority of GPU is getting faster is because of numerical precision. From float 32 to float 16, you get five times faster. Why? Why is it five times faster? Because because um the calculation of faster is when you use transistors, it's the exponent plus the mantissa squared, right? And float 32, you have to use 23 numbers for the mantissa and 23 squared is very large. Float 16, you reduce the mantissa to 10. And that is why you get five times speed up from float 32 to float 16. It's because the number itself is getting smaller. The representation inside the models for each of the weights is getting smaller. And then we moved from float 16 to be B float 16. It is again maybe around two times faster than float 16. We then have float 8. Um float 8 is even more faster. Um you know uses even less space. Um but then there is a problem. Float 4. We get to float four and it's around two times faster than float 8 around. Um the problem is what's next? Do we go to float two, float three, float one? You know I there's you can't push anymore in terms of numerical precision. There is not much more to go in terms of that space. And so like you can only get maybe 180 times faster than float 32, maybe 200 times faster, but essentially my take is float 4 might be the final flop that's getting faster. You know, the final um precision numerical precision and in the future GPUs are not going to get faster. Um so maybe if you know people want to buy blackwell GPUs, that's probably you should probably buy them. It's most likely not going to get faster anymore. Um that's kind of my take. And also for Okay, I was going to talk about kernels and stuff. I don't think I have enough time. You must use torch.compile. You know, every single function that you see, wrap it in torso compile. Try it out. You know, like I I always tell the PyTorch team, please make it by default. You know, definitely use torso compile. Um why? Because it makes your training faster sometimes, only sometimes. Um not all the time. It reduces memory usage most of the time. If you see bugs, they'll probably fix it. Um, but remember torch.compile is not as easy as you think. You don't just do torch.compile the model. There is actually many options you can tune, right? I just listed a few options. Um, this is this is literally just a few. There is like 10 more pages of options. I'm being serious. 10 more pages you can tune. Imagine if you can like use torch.compile and tune every single one. And that's why I highly suggest people to use torch compile more effectively. um it's probably the biggest thing that can change your entire, you know, training run. Um make a more more memory efficient, make it faster. So definitely look through this. Um okay, so in general, yes, thank you. Definitely start us on GitHub. Um join our discord if you want to have any questions on RL and stuff. Um we have a website as well. Um and finally, we have stickers. Um yes, there are some limited time stickers as well that we have somewhere, I think over there. Um, and remember if you have any questions, I'm still going to stay around and ask. Um, yeah. Thanks a lot. [Music]