[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han

Channel: aiDotEngineer
Published at: 2025-07-19
YouTube video id: OkEGJ5G3foU
Source: https://www.youtube.com/watch?v=OkEGJ5G3foU
[Music]
Hello guys. Hello. Hello. Hello. Yes.
Sorry for being a bit late. There's a
lot of traffic. But um hello. Um welcome
to AI engineers worldfare. Thanks for
coming to my session. Um, and yes, today
we're going to talk about the deep dive
into RL kernels, agents, and
quantization. Um, you might know me or
you might not know me, but I'm Daniel.
Uh, my brother is somewhere.
Uh, yeah, somewhere, but yes. Um, thanks
for coming again. Oh, we also have
stickers and other like random stuff
later. Um, that's after the talk. Um, so
maybe you might know me, maybe you might
not. Um, so we, um, on Twitter, we tweet
a lot. Um we did like a gradient
accumulation bug fix last year. Um we
introduced something called async
offloaded gradient checkpointing. Um we
also work with the hugging face Google
Meta Mistro teams to like fix bugs and
like their open source models like
Gemma, Llama, Mistro fee and more. Um
yeah so like we if you want to follow on
the latest stuff of AI better follow us.
Um we tweet about random stuff. Um you
might even know when next models might
be released. You know we sometimes tell
people approximately. Um so that might
be very interesting. Um
we also do like open source
contributions to the entire open source
ecosystem. For example, we contribute
sometimes to Llama CBP. Um we work with
the Quen team and Mistro on their
releases. Um we also do like you know F4
bug fixes, Llama 4 bug fixes which
increase accuracy by a bit. Um so
definitely you know utilize some of the
new newer uploads that we do um which
fix bugs all the time.
We just surpassed 10 million monthly
downloads and hugging face. Um, and
yeah, we also have a GitHub package um
with with 40,000 GitHub stars. Um, and
we do like essentially we make fine
tuning faster and reduce memory usage.
And yeah, that's the GitHub package.
Definitely check that out. Um, there's
like free collab notebooks. I'm not sure
if many people know, but you have like
free GPUs that Google offers. You can
just use them. Um, please use them more.
Um, and Kaggle, if not many people know,
have 30 hours for free of GPUs per week,
right? And there's no restrictions on
it. You know, please utilize the free
resources as much as possible. Um, yes,
they won't be unhappy. Just use them
all. So, yes, we have notebooks. So, if
you scroll down a bit on our GitHub uh
page, there's like free notebooks um for
Collab, Kaggle. We do reasoning, um,
continue pre-training, supervised
fine-tuning, and other stuff.
We also upload models to our hunting
case page. Um, for example, DC R10528
released a few days ago. We upload 1.58
bit quants which are like very small.
They retain most of the accuracy. Um so
like these can run on your local device.
Um even if you have like very low VRAM
or like not a very good GPU um it will
still work and we will constantly upload
models. So like sometimes people
complain to us you know please stop
uploading fixed models. You know it's
kind of annoying. Um but too bad you
know unfortunately models have bugs. So
we do have to fix them immediately. You
will see sometimes for example the
accuracy can increase by 10%. Um,
sometimes, you know, the large model
providers won't tell you that they
uploaded a fix. They're not going to
tell you, but you know, be sure to
download the latest models. You will get
all the fixes.
Now, today, let's start off from
history, right? Does everyone remember
Llama? Um,
although finally it got leaked, right?
It was just a research paper, you know,
meta saying, "Oh, we trained Llama.
Where's the weights?" You know, it's
only research access. And then suddenly
it got leaked. And then you kind of just
kind of spawn the entire open source
movement. Um you know some of the people
who are the who are the authors you know
are not part of meta anymore but llama
is extremely important for the entire
ecosystem and it was like the beginning
of open source kind of um for large
language models.
The most famous plot from the paper is
this right? So like you know if you keep
making the model train more it just the
loss just keeps going down. Um well the
question is will the loss keep going
down? That's the question right? So
Llama 1 was only trained at 1.4 trillion
tokens, right? So now 1.4 trillion
tokens is actually very less. Most
models are trained 10 times more. Um and
you can also see from the trend that the
bigger the model, the lower the loss,
right? So 7 billion is the blue line. Um
and then 65 billion was like the red
line. And you can see that in general as
a model gets bigger, it it gets smarter.
Um the training loss, the numbers are
correct. So you should see generally
these numbers from around maybe a bit
higher than one. Um if you see training
losses when you do fine tuning of like 8
13 definitely something's wrong. Um, so
you should get at least losses around
twoish, three-ish.
And so now,
oh, now Google, um, you know, Google's
new Jamma 3 models are trained on 14
trillion tokens, right? Which is much
more. Um, Llama 4 is trained on 30
trillion tokens, right? So like
literally like at least 10 times more.
Um, so Jamma is 10 times more. Llama is
like, you know, 30 times more.
Oh yes, I forgot. You can also access
the slides by the QR code if you want.
It's also on the docs as well. Um so if
you go to the docs there will be a link
to all the slides. Um I will probably
post the slides anyways on Twitter and
elsewhere so you can access them as
well.
Also if there are questions for people
like you know raise your hand ask I will
essentially do intermission between like
some parts and I will ask people if you
have questions please ask. Um last time
the talk many people asked questions I
will answer every single one even if
it's stupid I don't care please ask. Um,
sometimes I get stuff wrong. So, yes,
just ask questions. Um, are there any
questions? I'm assuming no. Okay. Okay.
Let's go to the Okay, I will. Okay. So,
I I don't know if people have seen this
plot very famous from Maxim. Um, he
shows the open source versus closed
source performance of popular
benchmarks. I think this is MLU5 shot.
Um, you can see the green line is open
source models. Um, and then the red line
or the orange line is like, you know,
closed source models. Um and you can see
that in general the slope of the open
source models is more dramatic than the
closed source models. And I would say in
general that you know the open source
models and closed source models in terms
of MLOU they've kind of like reached the
same accuracy right so like you can see
that okay this is already outdated but
in general you see like llama 3.1 405
billion kind of reached you know GP40
level um so open source models
definitely have caught up to closed
source models
however there was a however um recent
you know like since September 2024 I
would call this something called o
opensource drought. Um, no, no one wants
to talk about it, but I will, right? So,
like September 2024, 01 got released, 01
preview. And to be honest, the open
source community was shocked, right? So,
like suddenly the capabilities diverged,
right? So, there's something called the
MLU plateau where most models, the open
source models and the closed source
models, they kind of converged. So, the
open source models was equivalent to the
closed source models. But suddenly in
2024 September you know OpenAI released
01 preview and it kind of shocked the
entire community because the capability
or intelligence kind of skyrocketed
right so like with reasoning long
reasoning traces it just was a total
change of mindset um and for four months
the open source community kind of died
internally because there was nothing we
can't replicate it we don't know what to
do you know do we do this do we do that
I don't know like and so like but then
suddenly in 2025 January deepseek R1
came along and they released R1 and
that's when the entire world kind of
changed their view right so like you can
in fact train open source models to be
as powerful as 01 or 03 or whatever
right so like that's that was what I
call the open source drought
however there was a previous drought
even before that remember when chbt got
released in 2022 December right so like
before even chbttt most models were base
models um they were not really instruct
fine-tuned that well and so most large
pre-trained models were actually useless
but they were terri But then suddenly
CHP came along and they did better you
know reinforcement learning from human
feedback better instruction fine-tuning
better instruction following and it
really changed the world right so like I
think large lang large large language
models were already here before 2022
right it they were already there but it
was just chipd which showed that if you
have good data right good instructions
good answers good supervised fine-tuning
um and good reinforcement learning you
can actually make the model very useful
um and yes again open source had a delay
a very long delay right until like llama
one I guess um and so always I would say
open source always tries to catch up to
the closed source models um the next
question is what is after reasoning um
is there going to be something else um I
think that's a very good question my
personal take is it's going to be very
hard um I think reasoning was like the
last most the DC R1 paper said that most
likely the model already has these
reasoning capabilities and we just need
to accentuate them um and so like I'm
not sure if there's going to be some new
you know like step function where we'll
get to like the next capability um but
in my view I think like every single
time the closed source models will
always do like a step function um but
who knows maybe now it will plateau
forever I don't know so that's like you
have like you know you have like long
discussions about you know if AGI is
going to come or not like but who knows
um the talk is not going to be about
that but um yes next um so I call the
first jump the SFT or um RHF jump right
so like that's essentially if you do
good supervised fine tuning you get this
large jump in performance and then the
second jump is called the RL jump right
so like this essentially can increase
performance dramatically if you employ
second methodologies like RL right so
like but the question is like what's the
next jump um I don't know
so I'm not sure if you guys saw this
picture before um it's very widely known
in the community about you know by Yan
Lakun he essentially showed this cake um
and essentially unsupervised learning or
like just pre-training in general um you
know it's just a cake not that good um
and then supervised fine-tuning is kind
of the icing on on top of the cake. So
like it's a bit better. Um and then the
reinforcement learning is the cherry,
right? So like I'm not sure people like
the cherry, but like some people like
the cherry. Um and so like the goal is
how can we get the cherry? Um but the
problem is there's so less data about
this, right? Reinforcement learning is
like very very very less data. And so
the problem is most large model labs
will train these large pre-trained
models and then they will iteratively
refine it to make the model better
through supervised learning, through
reinforcement learning.
Interestingly enough, this slide was
actually shared last year. Very popular.
But actually, this was from 2016
September. Um, so I had to dig this up
on YouTube. And so Yakun actually talked
about this back in 2016. So literally
nearly 10 years ago. Um, I was like,
how? Wait a second. That's 10 years ago.
Very long. Um, so this slide was
actually very popular on Twitter. I I
think it was in November last year.
People kept tweeting about it. I don't
know. I saw this. I was like shocked.
Um, but yes. So this encapsulates like
the current AI um boom
and so like firstly like when we talked
about these large models remember they
started from a base model um and so we
call these training stages right so when
you have a base model um you then
convert it to a chat model right so like
for example chat GBT is not a base model
it is a instruct fine-tune model or like
some sort of fine-tuned model from a
base model so actually openai does have
a base model somewhere sitting in their
server somewhere they're probably not
going to serve it ever But it is
somewhere on the computers and they
essentially fine-tuned it to make Chai
GP4. Um Claude 4 has most likely a base
model and then they fine-tune it to
become Opus. Right? So like Gemini also
has a base model and they convert it to
Gemini 2.5 Pro. So this this phase when
you convert a base model to a chat model
is the fine-tuning phase. Um and then
the question is like you know what do we
do in the arrow right like you know is
it reinforcement learning is it
supervised fine tuning is it like some
other special source I don't know but
like we essentially we'll discuss about
these um topics um any questions first
okay
so for example in open source models you
might have seen gemma 3 pt um gemma 3 it
llama 4 llama 4 instruct quen 3 base
quen 3 um M small base M small instruct
llama 2 llama 2 chat right these like
terminologies to be honest I think the
open source community should standardize
the method like terminologies like
instruct or chat or you know no not even
a word like you know it and PT maybe
they should like standardize it a bit um
but in general if you see it it means
instruct instruction tuned PT means
pre-trained um instruct just means you
know instruction fine-tuned quen 3 just
removed it entirely it's just called
quen 3 um and then the base model is
called with a base um and so like
essentially these naming methodologies
um if you see on hugging face um
hopefully you will now recognize these
different types of um models
and so generally what we say for like
you know reinforcement learning and
finetuning is I would say it's called re
fine tuning is everywhere um you start
off with pre-training you then convert
it into a supervised fine-tuning model
via supervised fine tetuning SFT it's
called SFT you also might hear like IF
which is instruction fine tuning they're
the same thing um and then we call
something post training, the post-
training phase. Um, but actually
recently it actually kind of changed.
Um, so I don't know if you guys have
been keeping up with the latest stuff,
terminology. Um, I actually don't really
like terminology anymore, but like we
have something called pre-training,
which is you take like all of Wikipedia,
all of the web, you know, everything,
all of the data you can ever see, shove
it into the model, predict the next
word. That's called the pre-training uh
pre-training stage. We then have
something called the mid training stage
um which essentially gives you higher
quality data. Like for example, you can
weight Wikipedia more because it's
higher quality. Um you can essentially
do long context extension as well. You
shove this during the mid mid pre-tra uh
mid-training stage. So if your context
of your model is very short, you want to
extend it to very long context. You
shove this during the mid-training
stage. And then the second stage is the
supervised fine-tuning stage where you
want to convert the model to a chat
model. And then we have the
post-training phase which is like pre
preference fine-tuning, DPO, RHF and
stuff like that. And then we have this
new thing called reinforcement
fine-tuning or RLVR. Uh if no one knows
what RLVR stands for, it stands for
reinforcement learning with verifiable
rewards. And this is like a new paradigm
um not the same as preference fine
tuning or DPO where we consider reward
functions to make models much better. Um
and so this is how I would envision like
you know the whole training phases of
models.
Another way to put it is we have some
random initialization of the model like
some random weights of the model right
so like seven billion parameters
literally random numbers right like GPT4
GP4 I don't know 1.4 trillion parameters
like just random numbers and then
somehow we move in the space like the
black line right so pretend this is like
some highdimensional 1.4 for trillion
dimensional space and then we somehow
move in this space and then we get the
final model. Right? That's a green job.
The question is how do we move in this
space to get to the final model? That's
the question.
Most people what they do is firstly you
start from a random initialization. You
do the pre-training phase which is very
long. You get to this dark blue dot,
right? That's called the pre-trained
model. And then you do some supervised
fine-tuning, instruction fine-tuning to
get the blue dot. And notice the line
for the light blue line is very short
because it is very short, right? There
is not that much data for supervised
finetuning. And then somehow we get the
blue dot and then we keep doing more
iteration to get to the purple dot which
is through preference finetuning. And
then finally we get the the green dot
which is reinforcement learning via you
know verifiable rewards like you know 03
or 01. And so like the goal is somehow
we have to move from the black dot to
the green dot. And essentially all of
large language models all of AI is just
an optimization problem right? like how
do we make this easier to get to the
green dot? You could, you know, you
could kind of theoretically guess you
could just why don't you just go from
the green dot black dot to the green dot
like skipping all of the dumb phases,
you know, just skip it entirely. Yes,
you could do that, but it's not going to
be very efficient. You're going to be
waiting there for like, you know, I
don't know, millennia. Your loss is not
going to go down. So, the tricks that we
found in rein um AI is like you have to
do these phases to get to your final
green dot. Um there is like a new
methodology where you can actually
bypass the supervised finetuning stage
and the preference fine tuning stage and
directly go to the green dot. There is a
way and that's the dark red line. Um I
think deepse uh deepseek zero kind of
like showed that you can use a
pre-trained model a base model and
directly do some reinforcement learning
with verifiable rewards and just skip it
entirely. Um so that's like a new
paradigm that people want to focus on.
In my view I think you should still do
the light blue, the purple and then the
green. I don't think so you should like
directly skip over to the green. If you
want to waste resources, you can skip to
the green. Um but I don't think so large
model labs want to waste resources. Um
hopefully not.
So I I don't know if people have seen
this diagram. Agents in the old sense
like everyone keeps connecting agents
with reinforcement learning. Okay, but
like why? Um so in general an agent is
you have some sort of environment. You
have like the agent doing something in
the environment. you get like an action,
you do the action and then you get some
some sort of reward. Um, and the reward
is R, S is the state. So the current
what the environment currently looks
like. Um, and then essentially RL tries
to optimize this loop. You're trying to
maximize a reward. Um, giving some sort
of action. Um, and that's why like you
know that's why RL and agents are kind
of connected. Assume the agent is the
lang language model, right? So assume
the agent is in fact the language model.
And so this the environment is kind of
fishy like you know it's it's hard to
say what exactly is the environment it's
more like the language models inference
space that's the environment kind of um
but like pretend this was a game right
so like the agent was the computer the
environment is like Mario for example
right and so like you're playing the
Mario game automatically and your goal
is to win the game and so like the whole
goal of RL is to maximize reward
another one is like Pac-Man right so
like you have the yellow Pac-Man And um
you can either go up, down, left or
right. Right. So like up, down, left or
right. That is the action. That's the a
the yellow the I think I don't know
orange or something. The orange little
things are like rewards, right? So like
if you eat if you eat you know a yellow
a red an orange dot you will get
positive reward for. So R plus right? So
like you have R pluses. If you eat a
very big one you'll get like very large
reward. But if you encounter one of the
enemies you will get minus reward,
right? All right. So like the qu the the
question is how do we maximize the
reward based on this environment
for language models there is a trick the
trick is this loop kind of changes
because we don't actually have a
continuous loop um the state does not
actually change over time right so like
for example in a game in a game if you
do an action the whole state changes
right the environment totally changes
and so you have to like continuously
keep a history of the past steps but in
language models there is no history,
right? So, like if you do a prompt, what
is 2 plus 2? If you ask another
question, what is 4 plus4? It's like
totally not relevant to your previous
prompt. Okay, well, fine. It is kind of
relevant, but you like it's not directly
correlated. And so, like you can
actually delete one of the lines. Um,
that's the next prompt. You can delete
it entirely.
So, for example, what is 2 plus two?
Right? So, like essentially you have all
of these options. It could be zero, it
could be one, it could be two, it could
be infinity, it could be B, it could be
D. I don't know. It could be anything
that you like. a symbol and so the what
is 2 plus two is the state. So like what
is the question that's the question
right? Like the question the reward for
example if you choose the one if you
choose four your reward is plus one. If
you choose anything else your reward
might be negative infinity zero whatever
number you like. You can come up with
any number you like for reward. It
doesn't have to be plus one. It can be
plus 10. It can be plus 100. And you can
do anything that you like. Um you can
also do distance based scoring. For
example is you know is choosing the
number five better than zero? So that's
a question. What what do you guys think?
Is choosing the number five better than
zero or is it worse?
I guess better.
Yes. Okay. Better. So what would you do
for the reward then? Pretend the model
outputs five for what is 2 plus two?
Zero.
Zero. Someone said zero.
Okay. Zero is fine because it's wrong.
Yes. Like if you the answer five is
wrong. So you should probably give a
reward of zero. But is there a better
answer?
Less than one.
Okay. Yes. Less than one. So like some
sort of like maybe 0.8. I don't know.
Right. So correct. So like you could do
like the answer divided by the correct
answer, right? So like if it's five, you
divide it by four. You could do some
sort of Okay, no, that's wrong. It's
five minus four divided by four. Um so
like something some reward like that. Um
pretend if your pretend if the model
says a what is a reward?US
minus one. Okay. Or could be minus 10
because that's very bad, right? You you
should not output a letter. It should be
some sort of number. So that's how you
design reward functions, right? You you
we just design a reward function. shove
this into you know take your reward
function it's like just if statements
shove this into a language model
fine-tuning phase and there we go you
have 03 okay well you won't have 03 but
you know what I mean um essentially 03
is a collection of all these reward
functions right so like for example this
what is 2 plus2 is one question remember
it doesn't have to be what is 2 plus2 it
is a general maths question what is 10 +
20 what is 10 * 200 / 10 you know
whatever maths equation you ever want
and This function can take your question
and convert it into a number. And 03 is
just a collection of all of these reward
functions.
And so the goal of RL is to make the
good ones more good. You want the good
rewards increase in value. So for
example, the four, you want the four to
appear more, but you want the three to
decrease. You know, you don't want the
three to keep appearing in your answer,
but you want the D and the B to be very,
very, very heavily penalized. That's the
goal of RL, right? So like we don't
actually have the answer, right? Okay,
this this question is very easy. What is
2 plus2 obviously is four? Yes. Okay,
that's very easy. But pretend you have
some sort of complicated question like
for example, how do I win the stock
market? For example, let's dumb dumb
thing. Okay, how do I win the stock
market? You don't know what actions
you're going to take, right? Like but
the point is you have the result, right?
You have the result like profit or loss.
But the question is we don't know how do
we get to the good profit. And so the
question is how do we maximize good
actions as much as possible and decrease
bad actions as much as possible and that
is RL
and so open AI released something called
RHF in uh chipt and they oh actually I
think it was instruct but anyways for
chipt they showed that you just need
some training data you need some data
you interact with the agent which is the
language model you then get some actions
which is your answer to the language
model you feed this into reward model
and then you get some reward and then
you keep it you know iteratively doing
this step and you'll finally get charg
so remember the the base model you start
with so chat GBT base you convert this
into chat GP4 via this method
to expand on a bit if you guys have
heard of PO right what is PO um
essentially PO is just you just expand
the box for the agent right so the
language model is like an agent you
expand it and there's just three models
inside of it. Um there is a generating
policy, the reference policy and then
there's a value model. Um and that's all
not that special. Um we will talk about
each of these things separately. Um but
PO is just a optimization algorithm to
make RAHF work better.
GRPO which is the algorithm behind DC R1
smartly deletes one of the fact uh
deletes one of the things the value
model. It just gets rid of it. Um now
why would you delete it? We will talk
about this but the trick is if you
delete a model the value model you just
save parameters you save compute and
it's much more efficient right remember
each of these models is kind of like a
large language model right so like you
know pretend you're generating your
model is like already 1.4 trillion
parameters what are you going to do make
another 1.4 four trillion parameter
model for the value model. So we just
get rid of it, delete it. Um and that is
GPO. That's the only difference between
Okay. Well, there's other differences,
but the biggest difference is we get rid
of the value model. Any questions?
Yes.
Um so you talked about negative reward.
Yes.
It's confusing because in pre-training
isn't
probabilities.
Well, it's like negative rewards, right?
And then I guess
the phrase reward when it can be
positive or negative versus comparing it
to pre-training where it's always
negative.
Do you mean pre-training as in like the
negative log likelihood or like some
prob? So the when during pre-training
the goal is to maximize probability.
Yeah.
So like you know you output a some
number from 0 to one the probability of
the next word and you want to maximize
that for RL you want to maximize reward.
So if it's a negative reward, you still
want to maximize that. So if it's
negative 1, you just want to make this
negative one go away and you want to be
positive in the positive range. But also
rewards can actually be just negative,
right? So for example, if your your
reward function can be -10 and negative
1. The good one is negative 1, the bad
one is -10. Your goal is to move towards
negative 1 as much as possible because
your goal is to maximize it. So I would
say the reward is a misnomer. You could
just add 10 to everything and then you
know the sc it scales the numbers away.
Um does that kind of make sense or
it does it question of nomenclature?
Okay. Yeah.
Most people I would say like to be
honest they don't like negative rewards
actually in the RS space people just
like to do positive reward. I don't know
I like negative rewards. Um I I feel
like it's more for me it's more
intuitive. Um yeah. Any other questions?
Yeah. Yeah.
So you've got your language model and
then you've got generating policy and
reference policy. Yes. Right. What what
models are being used there? Is it the
same as your language model or is that
another model
trick? Yes, very good question. There
are some tricks that you can employ.
Most people just make them at the same
model. The reference model is like the
beginning of the model. The generating
model is the model that you're updating.
So like essentially the reference model
is like the base model. Okay, that's
probably not a good Okay, fine. Just
keep it as the base model. The base
model, the generating policy is actually
the model where you update it. So like
the model that you update the the the
base models update. So every single time
you get a base model, you update one,
update two, update three, that is the
generating policy, but we will talk
about this. So it's essentially the same
model, but there is updates to the
model. So the reference policy is the
model that is not updated. The
generating policy is the model that is
actually updated, and they're both the
same model. So the language model, one
of them's updated, the other one's not
updated. But we will talk about that.
Um, yes.
So the actions, is it typically one
token or is it more tokens
full
that's a good question in the Pac-Man
case the action will be a string of
actions right so like you can go up then
down then left or right you know some
sort of like long history for in the
language model space we just generally
this is called single turn and
multi-turn generally speaking currently
single turn is what most people do it's
just one action so the action will be um
the action essentially is saying what is
2 plus two and the answer is four and
that is the action the action is
actually the inference space so like
what is the actual chain of thought that
is the action um and so like it's just
one though but it is the total sum of
the chain of thought so like if you have
like what is 2 plus two I think the
answer is four you know let me do some
working out blah blah blah blah blah
blah blah that is the entire thing if
does that kind of make sense
okay yes
there was a clawed conversation around
how like to finish a poem's next They
have to like think ahead to the last
letter to the last word to match the
previous one. Do you think like reward
focus next allows you to do that?
you could
understand
I think for pre-training specifically
there are research p papers which show
that actually pre-training doesn't just
predict the next word it does try to
predict many words ahead and so like yes
maybe reward model in reinforcement
learning you can see you can accent it
essentially accentuates a pre-training
behavior so maybe these behavior already
exists in the model we just see it more
often So maybe I would say that
I would say that the model itself it
already has this capability. We just
want to make it more obvious. And so
maybe the model already knows how to do
that. It already knows it predicts 10
words ahead or 20 words ahead. It
already knows how to do that. But we
just want to make them more obvious. I'm
not sure that answers. Would it be safe
to say like so if it's a generation you
want like every last you want almost the
circuit for the last word?
Okay.
I think for reinforcement learning
specifically yeah I guess yes. Um, I
think it essentially your goal is to
maximize a reward. And so like however
whatever way you try to get there, it's
different from generally pre-training.
Pre-training is just maximizing next the
probability of the next word. But
reinforcement learning is you're trying
to maximize reward. The question is how
do you actually maximize a reward? Do
you like do chain of thought? Do you do
what you describe like thinking about
the next you know in the future or
something? I don't know. Like I mean the
question is like what is the reward
function actually doing? I don't know.
What is a language model actually doing?
I don't know. Um I don't know if that
answers your question like it's to be
Yeah.
Yes. Yeah.
I was I was curious about when you were
talking about you know the arithmetic
whether five is a better answer than say
yes or something
and like given that there are these like
closed circuits between all these
different related you know mathematical
functions you can do on numbers in space
like whether it is like in the
literature or in like the current state
of the art
better to train it so that a closer pred
is more accurate or whether just saying
like the right answer is right and
everything else is wrong which in some
logical sense is true
like tends to produce more performance
in that space.
Yes, you're correct. You should have
data which is like getting more accurate
data. Is that what you're trying to say?
Like you should have data which is like
for example what is two plus two? You
should get more data which is like four.
It shouldn't be like five or 10 or minus
100.
Well like saying that like the practice
of saying that five is better than 10.
Yes. like in general is that what people
do in produce tend to say like when
there is like exactly one correct answer
and everything else in a mathematical
sense is equally wrong because it's not
that answer
that is a good question I don't know um
I think large model labs won't tell you
exactly what they do in our experiments
when you use our notebooks we actually
show that if you do distance- based so
the closer your number is to the actual
number you will get better results but
generally speaking it's easier to just
say five is wrong just give it zero
reward everything is zero and then The
good one is like one. It's actually much
easier to do. Um, for example, if you
want to do execution of code, how do you
actually do distance based scoring,
right? So like if you ask it to create a
Flappy Bird game, you just have the
final output, but you don't actually
know how to verify like, you know, oh,
is this Flappy Bird game game better
than the previous Flappy Bird game? It's
only in a mathematical sense you can
like do distance based scoring. Um, I'm
assuming large model labs, they probably
do the 01 better. Like the majority of
them just do like yes or no, yes or no,
binary. Um but in our experiments for
math specifically you should do distance
based scoring. Um it makes the model
learn faster.
Yeah.
For verifiable domain like math is it
actually because 2 + 2 makes sense but
it's not going to scale for like really
large numbers or large multiplications.
So are we going to end up is it endame
using tool use to calculate that or
model could potentially be changed to
that is a good question. In the olden
days before this paradigm came along, we
would think that you can just use a tool
like a calculator to you should actually
I would say you should still use a
calculator to calculate 2 plus2 right
you should not use a language model but
with RL you know VR the trick is we
actually found that actually wait a
second if you just do 2 plus2 or you do
another question like 10 * 10 or you do
some sort of complicated mathematical
expression you know like the derivative
of x^2 or something I don't know like
some random mathematical equation it
randomly learns to actually solve that
equation without actually doing
overfitting. And so like I would say
that with RL you can actually make the
model actually learn how to do
multiplication, how to do addition. So
it's actually in the model. Um does that
kind of make sense?
Yeah. Would we use this in production or
Oh yes. Yeah. People use that in
production. I Okay, maybe don't use it
in production. You know, you're not sure
if the answer is correct. Maybe it will
say 3 + 3 is 7. Okay. I don't know. It's
possible. So like essentially, but it's
getting better. Um maybe in the future
all of mathematical equations can just
be done by a model. Um I think in the
maybe a few months ago maybe like seep
you know before 01 got released actually
not even 01 a few months ago people
would still say use a calculator you
know like some sort of tool calling. Yes
you should probably still do it. Um but
imagine you know as time goes on as
models get better and better and better
in terms of like training data just just
for the max equation you know 2 plus
two. Um, imagine in the limit as we get
all of the world's data for just this
maths question, right? 2 plus 2, 4 plus
4, 10 x 10 or whatever, it should in
theory solve them all in theory. Um,
yes, the it's always in theory. Um, but
yes, you don't need a tool calling. It's
not necessary. Um, yeah. Yeah. Yes.
My question is about the reward model.
In practice, are people using large
language models as a reward model?
Good question.
Or is it? Yes, I actually
Good question. Oh, I was going to go in
the next next slides. We'll be talking
about that. Um, yes.
Multi-turn
does this change with multi-turn? I
mean, you showed a single
multi.
Yes, you could do multi-turn. It's a bit
more complicated. Um, you just imagine.
There are tricks you can do. You imagine
that your current step is good. Imagine
it and then you just continue doing
inference. You you append like your next
question. Like for example, how am I
going to what is 2 plus 2? You say okay
let me think about this question. What
is 2 plus2? Blah blah blah blah blah
blah blah blah. The answer is four. And
then the question is what is your next
question? Maybe the user interacts with
it and says okay I I don't think your
question is correct. Oh, I don't think
your answer is correct. And then the
model says oh okay let me rethink about
this blah blah blah blah blah blah blah.
I still think the answer is four. Um, so
you could chain this all together and
shove it into the, you know, the whole
RL step. You could do that. Um, it's a
bit more complicated. I think the
diagram will be a little bit more
different. Um,
the follow that to that would be
you assume like a loop is a single turn
or a loop is a lot of turns and then you
only give a reward at the end or you
give like subrewards.
Very good question. So there is in the
deepse R1 paper you could do subrewards
or you could just do the reward at the
very end. I think subrewards might
actually do better in general but the
question is subrewards is very hard to
calculate. You would rather just wait
you know all until the very very end and
just give a reward. That's probably the
easiest. So it's more about efficiency.
It's all to be honest all of AI is about
efficiency. What is more efficient? What
is more it's all optimization. Um so the
answer is like I would suggest people
just to like shove a reward at the very
end. Um yes
once you get your reward signal is it
just the reinforce algorithm with the
gradient to go back
we will talk about that yes
yes yes most yes correct
yes so we will talk about reinforce
we'll talk about po and stuff like that
um yes
this one
the problem is if you skip from
pre-training to the RLVA st RLVR stage
it's relatively hard because your model
doesn't actually know how to do
instructions right so like you have this
base model you ask the question to the
base model what is 2 plus2 it's not
going to say I think the answer is four
it might you might be lucky somewhere in
your pre-training data somewhere on the
web someone asked this question what is
2 plus two and then the you know the the
answer was like oh the answer is four
but you have to be lucky um so like so
the problem of this is the whole trick
of SFT is you want to force the model to
answer what is 2 plus2 instruction way
right so it will tell the model what is
2 plus2 you want it to say the answer is
four you don't want it to like blabber
on and like j like get some Wikipedia
article and shove it as the output so
the whole point of SFT preference
fine-tuning and stuff like that is to
make the model forced to make it more
optimal to like output conversation
style um if you want to skip it's also
fine
It's just not efficient. Um because like
you you could do this. Um I'm assuming
large model labs are trying to do this.
Um so it's not like a you should or you
shouldn't. They are trying. Um does that
okay
there? Yes. Question
couple of questions. One question is
online policy optimizer
model after
the reference model does not change. So
the reference model is just a model that
you didn't train. Um it's like the it's
like the it's like the base model or
like the SFT whatever checkpoint you
started with. It doesn't change. You
could change it. I think that would be
too expensive though. I think if you
change it that'll be more complicated.
Remember all of AI is about optimization
and efficiency. So I feel like you don't
have to. You could I I don't know if
there are papers talking about it
though. U maybe open AI does it. I don't
know. Um
other question is that do we need less
of
So the trick of RL is you just need a
reward function. You need to make that
and you don't need data. You don't need
the answer of the data. Oh actually you
do need the answer. You don't need the
chain of thought. You just need lots of
questions like what is 2 plus2? What is
4 plus 4? Remember you can actually
automatically generate this. Right? So
like
number of samples do you need less
number of samples compared to
you should do as much as possible. you
most large language models I think for
like you know 03 or 01 I don't know what
is a percentage of compute maybe they
spend like 5% or less but the goal is
what happens if you spend double the
compute just on RL right so like
previously if you do 14 trillion tokens
on pre-training make RL 14 trillion
tokens and then the goal of large large
labs is to just do that so currently
it's very less but over time it will
increase
but compared Let's say the number of
samples will be much less
currently. Yes.
Because it's expensive like you need
models.
Correct. It's expensive but over time I
think like maybe by next year or like
this year large model labs their goal is
to do this phase the most. That's their
goal because remember you can
automatically generate questions now.
What is 2 plus 2? What is two? 2 * 2
what is 10 divided by 10? I don't know.
Generate as many math questions as you
like. But remember you can also generate
you know like coding questions. You can
generate any questions that you like or
you can use the supervised fine tetuning
data itself for the RL step. You can do
that as well. Um does that kind of make
sense?
Okay. Any Okay.
How do you protect the SF from being
like screwed up by
Oh, good. Yes. We won't talk about it.
Like you have 2 plus 2
is also a good answer. 2 plus 2.
Correct.
But I don't want more equations. I want
the answer.
Very good. So is there like techniques
to make sure that we're not violating
our
instruction? Yes. We will also talk
about that clipping and stuff like that.
Yes.
Yes.
Is there any research done
in the model? So for example, you know,
there's a circuit that says
yes.
But can you incentivizing
the model?
addition the concept general
you that is a very good question I don't
know I think that's like the during the
pre-training phase essentially somewhere
somewhere in the internet someone wrote
what is 2 plus2 somewhere and then
somehow maybe someone did a formulation
you know some sort of derivation of like
what is 2 plus2 okay I don't think
anyone has done pretend there is some
derivation of like some complicated
maths equation and so the model somehow
learned to predict all of that entire
trace and if it keeps seeing this it
would like accentuate the fact oh okay
I've seen this before let's make this
even more um more prevalent in the model
so somewhere in the model it has learned
2 plus 2 is equal to four somewhere
yes I guess what I'm saying is so we can
use like the stepific
or super good reward you can
you're waiting it
correct so there was actually two
schools of thought. The first one is the
model already has this knowledge, right?
It already knows what is 2 plus2 and
you're just RL just tries to like
maximize
what if it sees 2 plus 2 is equal to
four. It tries to weight this factor
more weight this circuit in the model
more. So the model already learns learns
but then the second school of thought is
like okay maybe the model doesn't know
actually an arro actually learns a new
thing. Um I'm more in the first one. the
model probably already learns. It
already knows it and we're just
maximizing the
you're just trying to make it more
accentuated. Um,
exactly. So the extent of my question is
basically I want I'm wonder
if that makes sense.
You mean like just do addition?
Yes,
you could. I guess what you could do is
like get the language model, see which
weights are changed during the RL phase,
which weights are changed, and you just
give it what is 2 plus two, what is two
plus two, what is two plus two. You just
keep doing this question, and you can
see which of the weights are changing,
and essentially you can extract this
from the model. You could I don't know
if there's research about this, but I'm
just making I'm just making stuff up on
the spot. You could do that if that
makes sense.
Oh, maybe that's a research question.
Someone should do research paper.
Any other Yes.
Yes, we will also talk about that. Yes.
Yes. Yes. Okay.
Okay. Yeah. Sorry.
changes all the parameters in the model
set
you you could there's like two large
model labs will most likely change all
of the parameters every single parameter
is changed um but there are papers which
show that actually not all of the
parameters are actually changing that
much some of them are changed by like
zero like the majority of updates to the
model is like zero and only some very
small updates to the model are seen and
so that's kind of a circuit idea where
like the model already knows how to do
whatever question you give it you just
most of the updates are like zero um
saying which one is better is it better
to aim for changing the parameters
you can also do like lower you know
parameter efficient fine tuning you can
do other things you don't have to
fine-tune every single thing um but I
think majority of large language model
labs they just do everything um
otherwise this again becomes a
optimization problem you know what do we
which layer do we select and stuff like
that. So it gets more complicated but
yes you could do parameter efficient f
actually we're going to show a notebook
for that so you can actually do it on
your own computer um yeah was yes one
more question one more yes
oh okay
So my question
I wanted
good question for the divergence term.
Do you mean like removing it? Would that
make it better?
because the whole point of the K
divergence term is to like not make it
too stray away from the supervised
model.
Okay, maybe I'm not kept up to date with
with research papers but anyways I so
your your point was like if we remove
the KL term it will be better but it
learns new capabilities.
Well, it was more it was more
I wanted to understand.
Do you mean like do you want to have
more capabilities into the model?
Uh strategies
hard to say. I'm assuming the large
model labs will probably know
strategies. We will show like I I will
show examples of how to like make RL
better like how to like reach higher
reward faster. I'm not sure about new
capabilities. It's actually very hard.
It's actually very very very very hard
to is elicit new capabilities in the
model. Um the question is like is this
new or not new? I think that's the
question like is this actually part of
the model or not part of the model. Um
and most research papers are like hand
wavy. They say oh most updates are
sparse you know like so most likely it's
not you know new capabilities but what
happens if you know one year later all
of the model updates are like not
sparse. Is this considered new
capability? I don't know. like you know
those are the questions. It's more like
I don't know if that answers your
question. I probably didn't answer your
question but
maybe maybe the other parts of the talk
maybe might answer some part. Um yeah
okay I will keep going on more questions
later. Um, okay. So, like the reward
model, right, was actually a language
model that like some sort of model, some
neuronet network, some AI model that
predicts the reward. In RLVR, we delete
this entirely and we just call it the
reward function. So, like the ground
truth reward, you know, if it's correct,
you plus one. If it's bad, it's just
zero, right? So, you essentially delete
another part. GRPO essentially deletes
another part, right? So like you remove
remember I g you delete the value model
get totally remove it and then you
delete the reward model and it's just a
reward function
and yes as a reward model you could use
lm as a judge so you could ask a
language model itself to say is the
answer good or bad you could do that you
could do regular expression check you
know like is the formatting of the
answer good or bad is the maths equation
good or bad you know is the final output
good or bad you can do distance scoring
and stuff like that you can also execute
the Python code and then you can see if
it actually executed, right? So like is
there like import errors or like format
errors or like some sort of Python error
and you can use this as a reward and so
this blue box the reward can be anything
that you like. It just needs to output a
number you know minus one plus one I
don't know it just has to be a number.
In fact, you can make a dumb reward just
everything. It just does random, you
know, plus one 50% of the time, minus
one 50% of the time, I don't know. And
confusingly, a paper recently showed
that actually random rewards works. Um,
so like yes, go ahead. You can try it.
Um,
but also why did someone say why? um
probably read the paper but but what to
be honest actually I think the paper
might be a bit
I don't actually believe there was yes
there was an update showing that
actually this was wrong um that actually
it's because the model they don't the
benchmarks are incorrect so when you say
that you actually increase accuracy like
from 20% to 50% but actually the model
itself was already 50% they just didn't
check the accuracy of the correct model
before um so there was a recent rebuttal
to those types of papers. Um, but you
know, interesting results. Um,
yeah, I don't know.
Okay. So, remember in RL the goal is you
don't know the best action to take in
the space, right? When you're doing
Pac-Man, I don't know if going left,
right, up, or down is the best. I don't
know. But at the very very very end, you
will either, you know, win or like, you
know, get some reward or you will die.
Yes. Um, but the goal of RL is to
maximize the best action you can ever
take, right? So like what is a better
action than all of the other bad
actions. So RL just tries to maximize
the best well not the best action, the
better action.
Normal pre-training, you already know
what is the best answer. So like you
already know what is the next word,
right? You if you want to predict, you
know, hello, my name is Daniel, you
already know the next word is going to
be Daniel, right? So you already know
it. But RL, you don't know in advance
what is the actual correct reward. Um so
you the only thing you can do in RL is
to maximize the you know one of the
better options
and so yes okay now more maths um the
goal is to maximize this equation um
that's the goal of RL
so what is this equation the J is like
the total gradient um well actually it's
it's more like the it's more like we
want to maximize this it's not actually
the total gradient okay maybe I misro
that anyways didn't write that. Um the
we want to calculate the gradient with
respect to the policy language model and
the action is given a state and the r is
a reward. If you want to write this down
in like English, it's like we want to
take the derivative of the log
probability of the action given the
state times the reward. Now I don't know
if you guys understand what that means
but I did like a example um Pac-Man
case. Okay, so you are Pac-Man. The red
is your enemy. You don't want to go
there, right? So like you definitely
don't want to go to the red thing,
right? But you want to eat the two gray
dots, right? That's remember you can
only go up, down, left, or right, right?
You only have four actions. Remember the
action space is just up, down, left, or
right. So if you do rewards, I just
randomly made some rewards up. If you go
to the red thing, you will get minus 10
reward. Or actually, it should be minus
infinity. You die. But anyways, minus
10. If you eat the gray dots, you get
plus one or plus one. And if you go up,
it's just zero reward. There's nothing
there.
Now when you get this language model or
like some sort of model it has to tell
you what is the next action right it
tells you what to do to the next action
for now we'll just assign every single
action up down left or right as one
quarter probability right so like you go
up 25% of the time left 25% of the time
and so on so these are your numbers
right this is the entire state
so the goal of RL is you want to do that
red going towards um the right you want
to do this you want to go towards the
right less you want to do a much less
right so like you want to push the
probability of the 0.25 25 of the right
much less and you want to go bot like
you know down and left much more right
so you want to push the probabilities
much more and the top are not really
that important and so RL essentially you
your goal is to avoid doing the bad
thing and you want to do the good thing
much more that's kind of RL if you
convert this into a table right you have
the probability of the action given the
state right remember up down left or
right we just assigned 25% % chance,
right? Just just pretend 25% chance. The
reward, which we can calculate, right?
We calculated this. We calculated the
reward. We just made some numbers up,
right? As 0 1 1 - 10.
The probability times the reward, we get
some numbers, right? So like 0 0.25, 0.5
minus 2.5. And then if you take the log
of the probability times the reward, you
get some number, right? So like 0 minus
0.6 minus 0.6 and 6.02.
So from this table, does anyone know
which row do we want to maximize? What
is the goal? Like what do we want to
maximize?
Which row?
You want to maximize the bottom row.
What is the reward of the bottom row?
Correct. You want to minimize the bottom
row. Remember the reward is minus 10. We
do not want to maximize the last row
because the last row is the worst. And
so that means the 6.02. We want to
actually decrease this number
dramatically. Right? That's way too
large. We want to decrease it. The other
rows we want to maximize. And so the
goal is okay, we just take the sum of
all of that, right? We take the sum of
the four numbers and it's 4.8.
And so remember, okay, let's try, right?
So like by hand, by hand, we shall
remember the right remember all the
probabilities are one quarter 0.25. By
hand we shall do the bad action even
more, right? We actually do the worst
thing. What happens to the what happens?
Right? So like the reward the
probability times the reward is now
minus4. It used to be minus 2.5. And so
the reward the log probabilities times
the reward the sum actually decreased
right it decreased to 2.58. Before it
was 4.81.
Is 2.58 smaller or bigger than 4.81?
Obviously smaller. So actually this is
worse. You should not do this. But this
is actually bad. Remember the goal is to
maximize
maximize this equation. Maximize it,
right? Maximize.
And so 4.81 is actually better. The
original state is actually better than
the 2.58. So this the thing that we just
did is worse. So do not do this.
However, let's do the right thing less.
Right? Let's not go to the right and
actually maximize the rest.
You shall see that if you do the log
probability times reward sum of them all
you will get 8.9 which is a larger
number and so the goal is to maximize
this as much as possible. You could say
wait we know the answer right you should
go towards the right just make this 100%
probability let's just you know and
you'll have an infinite reward okay not
infinite but you will get maximum reward
right why don't we just do that
but you should not do that because
you're actually forced if you do this
your model will be like learn oh okay
let's just keep going right let's keep
going right and they just get stuck and
it just becomes very bad for
optimization so definitely don't do that
now there are someone who's who talked
reinforce um we don't just multiply the
reward but remember this equation we did
where is it the probability of the
action given the state times the reward
we don't actually multiply the reward we
should not do that um you actually
multiply by something called the
advantage um and what is the advantage
the advantage is a reward minus the
average reward the base reward so you
shouldn't actually just see you want to
maximize reward you want to actually
maximize a reward
but also looking at the average reward
across the entire model. So it's called
the baseline
and so this B this baseline model is
what is the value function the value
model remember GPO deletes the value
model this was the value model and this
value model essentially estimates what
is the average reward if we just see the
current state it does not take it does
not look at like you know what is the
next uh next step it does not look at
you know what is the next action it just
takes a snapshot of what you currently
so essentially it looks at this it looks
at this and just guesses what is a
reward, right? It doesn't you you're not
supposed to give it the rewards. You're
not supposed to give a minus 10, + 1,
plus one or zero. It just looks at the
current state and produces a number. And
this number is called the average
reward.
And so the goal is now we don't actually
want to maximize this, you know, just a
reward. We want to maximize the
advantage as well. Um so like we
multiply all this together and the goal
is we want to maximize this new
equation. Does anyone have any
questions? There's lots of baths but
questions. Yeah. Yes.
So in terms of probability, how do you
is the large language modeling an
estimated probability or this is a known
probability of all the possible states.
But how do you get that in practice
world?
So large
so a model a large language model
predicts the next word. So for example
you take the entire Wikipedia and then
you like chunk it into small little
tokens and then the output is just what
is the next word. So, for example, my
name is Daniel. But it could also be my
name is Michael, my name is Bob, my name
is whatever, whatever. Right? You have
all of these probabilities for every
single word in the entire language
possible like 128,000 words. You assign
a probability for every single one
to
Yes, correct.
But they'll be okay.
The trick of this for language models is
you can utilize the probabilities
directly. That's the trick. And so like
that essentially makes everything
easier.
Any other question? Yes.
Yes. Yes. Yes. Correct. Correct.
Yes.
Oh,
multimodal models. Do you mean like
doing RL multimodal models? Oh, that is
more harder. I would say you you could
you could look at the Sudoku puzzle just
convert the text model into a vision.
Just cheat. I guess you could do that.
You could like say, "Oh, you know, I
guess you could give it the Pac-Man, you
know, give it give it the Pac-Man thing
and tell the model what should I do
next?" You you could I mean vision is
it's kind of the same thing, but it's
more
Does 03 do vision plus reinforcement
learn? I I think it does. Um yes, you
could. I think for open source I don't
think I've seen Yeah, I don't think
Yeah, I don't think I've seen open
source models do that very well. Um, it
is still very hard. Um, yeah. Any other
questions?
No. No. Okay.
Did someone did ask a question? Oh. Oh,
so
yes. So, sorry.
Oh, what is the B? What is the base
model? average but like what is average
reference model for what exit?
So we just like your goal is to your
goal is you see this current state of
the model like whatever the environment
currently looks like and you just want
to produce a number that approximates
what is the total average reward. Okay,
I'll give you an example. Pretend you're
playing chess or like go or I remember
alpha go. You look at the board the
current state of the board and just say
what is the probability of the white
player winning? What is it? You're not
supposed to do any prediction. You just
have to predict what is the probability
of the white player winning by just
looking at the board. That's kind of the
average reward.
Always low.
It's always low. Yes. Yes. Correct. But
remember at the very very very end
phases like you know you might get
higher reward but that's the goal. You
want to you essentially want to predict
what is the probability always. You know
for example in chess I'm sure there are
like some steps you can take to make the
reward higher. The question is when the
model sees this, you need to essentially
you need to say is this board better
than the previous boards and so this
model you have to train as well. You
have to train this model. It needs to
output a probability of it winning.
That's for the chess example. Does that
kind of make sense or
no?
Yes.
The value No, no, no. The value model is
totally different. It's a there's three
models. There is a value model which
predicts the average reward of the
state. The reference model is just the
model that you started with and then the
policy or the actual model that you're
changing is the out the final result of
your model like the actual chat model.
So there's actually three models.
This one the B
you will see the current state. You will
see the current state and then you will
see okay what is the actual I think you
do use the policy no you no you just
look at the current state you look at
the current state and then you output a
probability of whether this chess board
is good or bad
of some like 0.8% 8% you're going to
win. Um, okay. Yeah, something like
that. Yes.
Oh, yes. Yes. Yeah, we'll talk about
that. This is just a general simpler
formula. Yes, we'll talk about that.
I guess my question.
Oh, you can ask. Yeah, go.
I guess because the policy is predicting
That's the probability.
Yes.
That is a good question and that is an
active era of research because you could
either normalize by all of tokens or the
entire just one turn. Remains to be seen
which one's better. Um it's actually
still people talking about that
because
correct. Yes. So generally speaking
normally people just assume this assume
this roll out is correct assume this
chain of thought is correct and they
just do the very end but then you do
have to multiply probab I don't yeah you
do have to multiply probabilities. So
there is a multiplication somewhere you
will get very small. Yes. Um but you
know you get very small but the numbers
are relative right? So everything is
very small but then the smaller one the
bigger ones are still very small but
it's still better. So they're all
relative.
Any was there one more? Yes.
Yes.
Oh no no it's very old. Yes. Very very
old.
Yes.
I wonder if you can give some like
advice on like how to think about this
training or abstract level about like
error propagation between like if you
have a trained model which does the
scoring or does the value function or
whatever
that itself is trained from data. It has
some like error margin and that
you know you have some soft max function
for example that like only one in a 100
times will produce the wrong that
probability.
Yes. like how do we think about the like
development over time of these models
and like to what extent that error
propagation is something that you can
observe and measure like systematize and
engineer around
like I don't really understand like what
the mindset is in this process right now
that
in my view I think all of these formulas
are just
made up and so like the goal is to
maximize reward
but the question is you need to like you
can't just maximize reward because
otherwise you might make the model
really silly. Like you might say, okay,
what is 2 plus two? It just says four.
Pretend your data set was just what is 2
plus2, right? So like you literally just
cheat. What is 2 plus2? What is 2 plus2?
What is 2 plus2? Just make the model
just say four four. Just four forever.
Do you want this as a model? Definitely
not. So like we want it needs to learn.
Okay, if I give it the next question,
what is 8 plus 8? It should not just say
four or what is 2 minus 2? It shouldn't
say four. And so the goal of all these
algorithms is to somehow force the
models not to like overfit to your
question. And so like these formulations
are trying to do these things to like
not overfit.
Yeah.
Well, I'm thinking about like the chess
you were saying which scores the board
and produces this like
number
a number which is like this good or bad.
Sometimes these well trained models have
these novelties that you know where they
say like make this move. It's like not
the new state is not obviously good or
whatever but they somehow like figured
out this.
Yes.
And suppose that your training mechanism
for the value function model you know
hasn't picked up on something like that.
In fact there's like some error in the
tendency of the value model.
Okay. like it's probability of producing
like
perfect scoring of the board.
Yes.
You know is not always exactly right.
Yes. Always not into your training
process.
Correct.
Yes.
How do you think about that? Like how do
you like what is the
the value model? You have to train it
together. So it's like a combination of
the entire algorithm. So the value model
predicts what is the probability but you
actually have to train this as well. And
so that is actually the problem. Some
people you could train this you could
train this separately. You know you can
like get all the chess possibilities and
then output what is the final number. I
think that's what some people do. You
could train this in tandem with the
model actually. I think that's actually
more harder. I don't know if that you so
there is always error in the value model
always. But you have to train this model
as well. So you will reduce the error
but there's always error. So I think
there's like some numbers you can like
force the value model to be like less
less prominent. like don't forcibly
utilize the reward uh the value function
but in Japia we just get rid of the
reward model uh the value model anyways
um so totally gone um no so you don't
need to worry about that anymore um
okay I will keep going on let me just
check time actually
okay
so remember the goal is the advantage we
want to maximize advantage not reward
anymore advantage is the reward minus
minus the average reward or the base the
base reward right if the advantage is
less than zero it means that it is worse
than average is if the advantage is more
than zero it means it is better than
average and so the goal is we want to do
the action more if it's better than
average on general
now to PO right so like I don't know if
you guys have seen the PO formula it is
ugly but this is the PO formula right so
like it looks it's more confusing
because there's like a clip and then
there's epsilon and blah blah blah
whatever.
But we could just strip everything away.
It's just it's just the probability of
the action given the state times
advantage, right? We literally just
discussed about this. Okay, minus a log.
Okay, the log's gone. But anyways, it's
just that and then the rest is the rest
is trying to reduce overfitting.
And so remember, so essentially this
there's a thing called the division of
the old model. And essentially it's the
model that created the action. And the
goal is we now want to maximize this
likelihood ratio. We don't just want to
maximize the probability of the model.
We don't want to maximize the
probability of the action given the
highest reward. We actually want to
maximize the likelihood instead. But
what is this likelihood?
So I did some numbers. I just made some
numbers up. So this is the pack. So
pretend the the top the um numerator is
0.01 and the denominator is 0.01.
0.01 divided by 0.01 is 1. If the if the
top is 0.01 and the bottom is 0.99.
Remember these are all probabilities.
You divide the top from the bottom
you'll get 0.01. Again if the top is
0.99 and the bottom is 0.01 you'll get
99 and so on. Right? The last one's one.
And so the goal is 0.01
divided by 0.99 is 0.01.
This means that the action that you do
is actually very likely, right? Because
the bottom the bottom thing is 0.99. But
we actually don't like this, right?
Remember the top is 0.01. We do not like
this. So the ratio is 0.01.
And then the bottom is this. This action
the bottom the denominator is 0.01. It
is actually not likely. But we actually
like this because the top number is
0.99. And so when you multi when you do
the division you get 99. So this is
actually good. And so actually we're not
actually trying to maximize the
probability. We're actually trying to
maximize the likelihood now.
And so the question is why don't we just
maximize the probability right? The
first equation. Why do we need to do
this division thing? Because if we
maximize just the top you will have
reward hacking. What is 2 plus two? It
might say to solve this question we need
to do blah blah blah blah blah blah blah
blah blah and suddenly it says hello
hello hello hello hello hello
and then it says four. Is this good? I
don't think so. It's very good. We don't
want it to say hello hello hello
something or like some weird trace in
the reasoning model. It does something
weird. We don't want this to happen. And
so actually this this you know hello
hello hello hello is actually very not
likely. And so the goal of the division
is to reduce these issues.
The epsilon part is called the trust
region. Essentially we don't want to
make we don't want to do large steps for
um po right. We don't want to do large
steps and the the trick is we want to
restrict them right. So like you don't
want to overfit the model. So now we
restrict the model and so epsilon could
be like 0.2.1
and the 1 minus epsilon is 0.8 8 1 +
epsilon is 1.2. And the trick is we just
want to not move the direction of the
gradient that much, right? We don't
trust the model that much. We don't
trust the algorithm that much. So, we
want to constrain it.
And then also the PO there's also a K
term. Um there's another term. Um and
essentially what this does is we want
the model to be as close to the
supervised fine-tune model as much as
possible. We want it to be we don't want
it to go so far away from the base model
or the pre-tra or the supervised
finetuning model. So essentially if it
deviates too much we want to tax the uh
we want to tax it and so this beta is
like 0.05 and the ko divergence is the
dis okay it's not a distance it's like
the distance between the current model
and the pre-trained model and
essentially we want to also shove this
into the equation. Um, so you can see
with PPOs, there's many moving parts.
It's who cares about the equation? It's
not that complicated. The point is all
of these extra add-ons are just to
reduce overfitting and not to make the
model like randomly go to some weird
state um that like you know overfits to
like your questions. And so the trick of
PO is they just added all these terms in
to make training more stable.
And so the final equation is like this.
This again um hopefully you will to be
honest no one even counts the formula.
It's not that important. Um, but I just
try to like break it down into pieces.
And the goal, remember the goal is to
maximize this equation, right? We want
to maximize it. And normally I just like
to think about
this one, right? You just need to learn
this one, right? You want to maximize
the probab. So it's just this equation.
Um, remember we did the table. Just this
is enough. You don't need to learn the
rest of the formulas. It's not very
interesting. Um,
yeah. Any questions? Yes.
Yes, correct.
So the biggest problem is pretend you're
like pretend you just started RL like
you you have that pre you have like the
base model you have like a supervised
fine tuned model and then you do RL the
gradient updates at the very beginning
are going to be gigantic right you're
like what is 2 plus two it says four but
if it says five you want to like
penalize it dramatically and so the
problem is you don't actually want to do
large steps. And so the goal is you want
to constrain it. And so the constraint
factor is like you know if the if the if
the num if the gradient update is
extremely large you just want to like
constrain all the numbers if that makes
sense. The goal is just to constrain the
update not to make it too large.
What about the ratio?
Oh the the ratio it's the KI divergence.
Oh sorry not the K the likelihood. To be
honest I think I need to do more
research. I would ask Gemini exactly
what it is.
That's my answer. I'm probably not the
best person to answer every single
question. Yes. Any other questions? Yes.
Yes.
It's the model that actually created the
action. And so the top one is all of the
numbers that actually like how do I
explain this? The bottom one is the
model that created the action. So for
example, you the model says you want to
go up, down, left or right.
Is there a correct or a
Oh, we just created the action. Like it
could be anything.
It could be the So it's the m it's the
maximum. It's whatever of action the
model says currently.
It might be wrong. It might be good. It
might be bad. It's just any action.
Any other questions?
Okay.
Yes.
space.
I don't think so can answer that
question. I don't know. That's why I
don't know. Maybe research papers show
it. I I'm not sure. Ed.
Okay. Okay. Well, okay. So, GRPO the
trick from PO is we remember remove the
value model. We get rid of it entirely.
We do not want to estimate the average
reward. It's totally removed. Um and the
reward model is now removed as well for
a reward function.
So, we get yeah we remove it right
remember the value model is removed. B
is a reward value model. We get rid of
it entirely. But what do we replace? So,
the trick of GPO is we do roll out or
inference sampling.
we get the answer what is 2 plus two you
literally make four inferences you just
literally call the model four times it
could say the answer is zero the answer
is one the answer is two or the answer
is four you can do you do like you just
literally call the model four times
and you take the reward right zero the
what is 2 plus 2 the correct answer is
four so you want the last number to be
one but the rest is all zero
and the trick is you literally just take
the statistics of your current roll out.
You take the statistics of all of this.
You literally take the reward minus the
mean divided by the standard deviation.
You get the zed score. And this is your
this is your base model. This is your
value model, right? This is there's no
more value model anymore. It's just a
number.
And so I did this on a table as well,
right? What is 2 plus two? If you think
it's zero, remember the prediction could
be zero, one, two, or four. And your
reward could be 000 or one. And if you
take the mean or the average of all the
rewards, you get 0.375. If you take the
standard deviation, you take 0.43301.
And then you do the reward minus the
mean divided by the standard dev
deviation, you get some numbers.
Remember the number four is correct.
That is why you know reward minus the
mean divided by the standard deviation
is 1.44. It's the largest number. And so
that is why we need to like max we need
to essentially maximize that good answer
and we want to reduce the bad answers.
But why is it called group relative in
GPO? Because it's not just one question.
It's many questions. It could be what is
2 plus 2? What is 4 plus 4? Okay, well
my graphs are all the same. My plots are
all the same. But anyways, imagine
there's like four different tables. What
is 2 plus2? What is 4 plus 4? How do I
create this Python function? You know,
whatever. And there will be four tables.
And so the goal group relative just
means you for each question we take the
statistics within each group.
For example, what is 2 plus2? You create
four. You literally call the model four
times and you get some, you know,
answers. What is 4 plus4? You call it
four times, right? Create Python code,
you call it four times.
Yes, there are other factors of G. So
essentially we already explained what
GPO is, right? Everything you need to
know about GRP we already explained in
in the total mathematical formula looks
kind of like this. Um
there's some rearrangement. For example,
the minus beta the KR divergence is just
taken out of the reward function. That's
the only other difference. Um hopefully
it makes more sense about the parts of
the GPO formula and stuff like that.
It's actually not that complicated to
understand. The majority is just trying
to reduce overfitting, right? That's the
whole goal, right? minus beta times a k
divergence is to reduce overfitting. One
minus epsilon 1 plus epsilon is to
reduce overfitting. The division reduce
overfitting. Everything's reducing
overfitting. Right? So that's all of
machine learning and AI. It's just to
make the training more stable and to
reduce overfitting.
I would highly suggest there is these
are the two things that I really highly
suggest. Um Nathan Lambert's policy
gradients um book. It's online though.
Very very very very helpful. Um and um
Yanick's video on GPO and stuff very
very very helpful as well. Um yeah and
now I will go into a collab
demonstration of GPO. Um before that
like does anyone have any questions? Let
me just check time
questions. Yes.
Because maybe the answer is there
is
to me
whatever information.
Yes.
Memorization. You make more
memorization.
To answer your question another way, I
think it's actually because GRP itself
is a problem. Remember the goal of all
these algorithms is to force the model
not to detract too much from the
original model, right? With this like
minus beta K divergence, you know, one
minus epsilon, all of this is trying to
make the model not go towards too far
away from the original model. And I
think that's the problem because you're
essentially forcing the model not to go
too far. And so maybe there might be
some new algorithm I don't know
something some other formulation which
you know you want to go very far away
you could do that um I I don't know if
there are any I don't know if there's
any research papers about that I I don't
know
but you could yes you could do that
yes any other questions yeah yeah yeah
you probably saying
okay
that's why
Yes. Yes.
Energy based models.
Yes.
So what do you think about that?
What do I think about it? I can't really
comment. I mean he definitely you should
listen to what he says. Um
but you don't see anything like
movements in open source.
I don't think so. I think open source we
kind of got captivated by RL GPO. I
don't think so open source people are
doing whatever he's talking about.
Unfortunately
I think he needs to talk about it more.
Yeah. Japa. Yeah. And energy based
models. I unfortunately I don't think
so. Open source. Yeah, maybe we should
talk about it more, but yeah, I I don't
think so. Um Yeah. Yeah. Yes.
Yes.
Yes. You got rid of it.
Yes, correct.
Yes.
group.
So this is more about an optimization
question. So in theory you should make
the you should make for example I just
selected four right what is 2 plus two
create four examples you should do you
know as many as you like you know 3,000
whatever number you like you should do
as much as possible but remember
AI is about optimization this is going
to take forever you know it's all about
efficiency so probably don't do as many
as you like um but you should in the
limit you should do that but you know
everyone can't just wait there waiting
for the computer to spin. Um so yes, you
you should do as much as you like. Um
it's but yes also for like
recommendations
when you do inference sampling you
should set set temperature is like you
know 1.2 1.5 set m to be 0.1 you know
something like that. If you set
temperature to be zero you'll have the
same answer every single time. So
definitely don't do that. But you should
have high temperature numbers to make
the model you know produce new output as
much as possible. maximize variability.
Yes,
distribution.
You should try your best to maximize
variability. Your outputs should not be
all the same. Um, if it's all the same,
I don't think so is going to learn. So,
you should make it as different as
possible. So, that's why you should set
temperature to be 1.2, 1.5, whatever,
some large number. Don't do too large
though. Um,
any other questions?
Yes.
Yeah.
Yes, correct.
All the first steps are all the rewards
are basically no signal at all of the
underlying
How do you deal with that? I have faced
that a lot of times with
moving to a larger model sometimes.
Yes, that's a good question. So
essentially you're saying the model if
the model starts off with like no reward
like every single update is like 000000
it's not going to do anything. Yes, that
happens all the time. But by chance,
just by chance, you know, you have like
some small little little probability
just by chance, you will have some
reward. So that's the trick. You will
see this after 10,000 inferences, what
is 2 plus 2? Suddenly the model says
four suddenly. Okay, just suddenly just
by random probability. Let's make this
more. That's all. That's all of GPO.
Like this is addition simple task if it
was like a proof of concept. It may
never come with
yes may never. But remember you're not
doing this one question. You're also
shoving this together with other
questions. What is 2 plus 2? What is 4
plus 4? What is you know derive the
derivative of blah do this Python
function.
This step is very large. You essentially
shove this all together. And the trick
is in general it works. in general maybe
by bad luck it might not work but I feel
like the bad luck won't last forever
because remember you're changing the
samples right so like the question what
is 2 plus2 you're changing that the next
phase will be some other question and so
the trick is just by chance you will
have a good reward just by chance and we
just force that to be more
does that kind of make sense so it's to
be honest it's all luck yes it's all
luck we're just guessing oh you know
we're praying that there's going to be
some positive reward somewhere in the
model. There will be negative reward,
right? So like if your model is really
really bad, you you can do negative
reward. And so you just don't want to do
the negative one. You just want to do
the negative one less.
And by miraculous probability, you know,
just rely on probabilities, you will get
a good reward somewhere just by chance.
That does that kind of make sense? I
mean, all of the large model labs are
just literally relying on the fact that
that's what they're doing. They're just
guessing. We're just praying for the
GPUs to work and then suddenly the
reward comes out. I'm being serious.
That's exactly what they do. They're
just waiting for the algorithm to work
and if suddenly oh okay that's why
people do random seeds as well. So for
example the initialization of the model
might not be good. So you just kill the
training run. You do like 500 training
runs. Oh no 499 of them are like zero
reward. Oh just kill them all. Don't
release them.
I have seen on my training.
Yes. Very common.
As small as step.
Yes, that's the Yes, I was going to show
you guys that.
Exactly. So like you could force the
model to answer some question like for
example you ask a question what is 2
plus two? It's very easy. It's four. You
just force it to learn oh okay it should
be four first. And then you do other
steps that is actually why remember?
Okay, I have to go back to all the
slides. I don't remember. Okay, where is
it? How do I Okay, I'll exit. Uh it's
the same as this problem. Where is it?
This one, right? Someone asked about why
don't you just start from the blue, you
know, the pre-trained model to go to the
green one. It's the same thing.
Essentially, the trick is we want to do
some supervised fine tuning to make it
know some instructions. So it knows
something and then you want to go to the
reinforcement learning phase. But if you
want to start from nothing like just the
pre-training phase that's the hard part,
right? Your reward might be 00 Z00 Z00
Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00
Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00
Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00
Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z00 Z like
zero forever and then suddenly one you
know suddenly
oh if you suff
I think like if you if you see zero
rewards most likely either one your
reward function is not that good
yes doing priming or like you know make
the model learn a little bit about your
data is actually does work does help um
so there are tricks to make it work but
generally I would just say it's bad luck
just bad luck um And unfortunately, you
can't do anything. It's not your fault.
It's Yeah. Just unfortunate. Yeah. Yes.
How should you think about like when or
like what to expect from
is it going to be just that this is the
way open source catches up with clos
your competent ML engineer to be able to
like specialize a model for like I just
don't is there consensus on like where
what this is going to bring us to smart
people like you going to give us a
really good open source model or is it
that you know we should This is a new
tool for specialization.
The algorithm is not special. The hard
part is actually the reward functions
itself and the data that you're going to
shove into the model. That's the hard
part. So like I I think there is a
misconception like you know the
algorithm is important. No science it's
useless. Who cares about the algorithm?
You can literally just use you know the
general the function which I gave you
can just use any algorithm that you
like. But the problem is actually the
reward function itself. I gave you some
examples you know like what is 2 plus
two? The answer is four. Yes, you can do
distance- based, but that's just one
example. Can you like can someone make a
function? Can someone make a reward
function for like trading stocks? That's
you do that. Do that, right? And then
there you have like a model for trading.
Go ahead.
So, so you think it's going to be more
that because people are able to create
reward functions like it's a little bit
more easy like similar to like you could
create a prompt. It's an easier thing
for most people to like iterate on. Um
it's actually quite hard
coming up but it's easier than coming up
with the algorithm itself or whatever
actually I think collecting so in the
olden days large model labs will ask
like you know large data providers like
scale or whatever to create data like
what is 2 plus two you literally have
someone sit there and write okay the
answer is four but then you also have to
do the chain of thought like oh I think
the answer is four because of blah blah
blah blah blah blah blah or like you
know this is my working out you
literally have to ask someone sitting
there to make the data
The trick is get no more. You don't need
the data labeling step anymore. It's
totally gone. You have the answer and
you have the question. The middle step
is totally removed. But you still need
to make the reward function. You need to
ver you need to say is the answer for
good or bad. For maths, it's very easy
for code. For code, it's somewhat easy.
You know, you can check, oh, did you
import the correct function? Did you
import the library? Did did your code
execute, you know, some other reward
functions? But it's still hard to verify
if your actual function is correct. For
example, let's say your question, let's
say the task was create the Flappy Bird
game. How do you actually know that the
output's good? How do you actually know?
We don't you could again ask a human to
verify or you know, let's test a Flappy
Bird game
and then give it a good reward, a bad
reward. Or the trick is did the game
actually run? If it ran, plus one. Did
you see the word Flappy Bird in the, you
know, Flappy Bird inside of the
functions? If yes, plus one. Did you see
the, you know, the image of the Flappy
Bird sprite be used? If yes, plus one
there. Something like that. So, like you
don't, it's still I would say the
hardest part is writing the reward
functions. And for open source
specifically,
if you know the whole open source
community starts writing reward
functions, we can probably beat 03 01
like you know. Okay. plus compute. You
still need compute. That's the problem.
You still need compute. But if you write
good reward functions, you'll probably
catch up in no time.
So the end state here is that you want
an open version of the closed model.
It's not I guess my original question is
like is this a thing that people are
going to use like prompts have
specialized models separately or is it
more that we want
good open?
It okay that is a good question. It
depends on which school of thought
you're in. If you're in the thought that
large language models already have the
capability and you're just trying to
accentuate it, then there will be just
one model. Yes, but this model can only
learn some facts because otherwise
you're like overfitting and like it's
not but if you're in the second camp
that the model actually learns something
new. Oh wait, did I say right? I think
it's the opposite way around. Um the
first one is you have many models. I saw
sorry I said it wrong. The first one is
you have many models because a model
doesn't actually learn that much. But if
the second one is a model actually
learns everything then you have this one
gigantic model. I think open air
probably subscribes to that point. You
know, like most large model labs think
that actually RL can get you to AGI,
right? It will know everything about
everything. Any single question you ask,
it already knows. And so like that's I
think that's where they're trying to go
for. For open source, it's more harder.
I think the open source community
consensus for now for now is the model
it already knows your questions. You're
just trying to accentuate it. And by
doing reward functions, you're trying to
weight the circuits more. you're trying
to wait the model to know how to do
these equations and stuff like that. So
I think like
the goal of open source is you know if
the entire community comes up with good
reward functions write them all then the
problem is we need compute that's the
second problem right if you shove both
of them together you will get o10 or
something I don't know um right imagine
if every single person writes a reward
function once per day okay that's
probably too hard once per day will have
like you know seven billion reward
functions more than open air can ever
come up with and they will defeat open
air
but you know you need the computer part
that's the only Um,
okay. Any other questions? Yes.
Correct.
Yes.
How do you feel like saving those traces
of those?
Very smart. That's what Yes. Yes. Yes. I
don't know if large model labs do that.
You could do that. Yes.
I feel like Yeah. We're just mining for
those
good examples.
The only problem I would say is pretend
the question was what is 2 plus two
right and then the model says okay
pretend the model just says this okay it
says oh let me work out what is 2 plus
two I think it is oh you know the number
two means two apples and I want to add
two more apples I think the answer might
be three h but let me rethink about it
wait a second it's like four is that a
should you fine tune on that I mean you
could but maybe it's like cheating Maybe
it just says four by chance. Maybe like
Okay, I'll give you an example. Let's
say that by chance. Okay, the question
was what is 2 plus two. It says
gibberish like um I like to go to Paris
for fun or whatever. I don't know. I
like to go to the you know this event
blah blah blah blah blah blah blah blah
blah and then suddenly says four
just by chance. Remember we're still
rewarding this. We're literally
rewarding this as good. But this is not
good.
No, so we we reward it at the very end.
Remember we see the number four. It is
good. The question was what is 2 plus
two? The model can generate anything it
likes.
It could
you could but that gets harder. So the
trick is people don't actually reward
the steps in between. They just do the
final step because otherwise it gets too
complicated, right? So like what is
intermediate steps reward? It's way too
complicated. So what you do is you just
reward the final step. So if if you see
the number four, it's good.
But we don't know how we got there. So
you should So yes, you could maybe at
the very very very end step of RL you
can then use some data to do do um
fine-tuning. Yes, you could. But I think
like in general it's because we don't
know what the process is in between.
You say we don't train on
Oh no, we don't train on the thought.
Yes, we don't. But
problem is
we don't train on the intermediate step
in between. You don't have to. No, you
don't have
you you could you could but remember we
don't know if the traces are good or
bad. We don't know. So you can't just
take this trace and then do supervised
fine tuning because pretend the answer
was four is good but we don't know the
intermediate steps unless if you read
the data right you could ask you know
some human labelist to like oh you know
please verify if this trace is good you
you could but then that kind of defeats
the whole purpose of RL so like you
don't you don't want to do this um does
that kind of make sense or not really
okay
yes
the what sorry
oh yeah yeah we Yes, we will do that.
Yes. Yes. Yes.
Python.
Yes.
What's the normalization among the
rewards that are like best practices?
What do you mean by normalization
amongst rewards?
We have four plus.
Yes. If we're running this in one big
yes one batch
how do we normalize the fact
good question
correct it could be like minus 10
one plus 100 toative infinity binary
very good question that's your choice
unfortunately that's the problem of RL
it's all about human choice like you you
will have to decide you know is the
python one more important um than your
maths what what is 2 plus2 then you can
wait it more for example your make 2+ 2
the reward as minus one and one and the
Python function as like 1,00 and zero,
right? You you have to decide on the
waiting functions. That is that is your
choice. Um unfortunately, it is a
it's kind of like an art. You could like
dumbly you could just do everything as
the same scale. I think that's what most
large I think that's what the large
model labs probably do. It's like
everything all the reward functions have
the same scale. You know, plus one,
minus one, plus one, minus one, not plus
10 and the minus 1,000.
So it's up to you.
Okay.
Yes.
And what about stuff like you know is
this a good summary? Is this a course?
How are we how we creating
that is the question. So like now you
want to like analyze. Okay. That's where
the LLM as a judge comes in. So like
there is a school of thought that you
can use a language model itself to to
make a number. You can ask is this a
good summary or is this a bad summary?
Please give me a score from minus 10 to
10. Ask you could do that. That's called
the LLM as a judge thing. Um there is a
paper which shows that you can do this
for some time but then it breaks down.
So you can't just keep calling the
language model you know like it's kind
of like cheating if I would say like
you're trying to call chatbt to train
chatbt like it will work for some time
but then it will break down. There was a
paper I need to find the paper but the
paper showed that if you keep doing this
your actual reward actually goes
backwards. Um so like you will get more
more more reward and then suddenly it
just I don't know by bad luck again it's
always about bad luck the reward just
goes back um so yes I mean serious like
all of AI is about bad luck and good
luck um and you know optimization trying
to do efficiency um that's what everyone
yeah unfortunately um so yes you can use
LLM as a judge um is that kind of
if I use LM as a judge doesn't it just
end up being a teacher and student model
or
yes but That's why like that's why
sometimes essentially the problem is I I
need to find the paper. If you keep
doing this it will actually do bad. I
mean intuitively it kind of works but
then at some point there is actually
another way. You could ask a language
model to generate reward functions. That
is actually another school of thought.
You can actually ask a language model to
generate 10 seven billion reward
functions. But the question is is the
reward functions good or bad? I don't
know. Um so like you now you need to
like rely on the fact that the models
are good or bad. You could then ask
another language model to verify the
reward functions. So yes, you could do
this. Maybe that's what OpenAI is doing.
I don't know. Maybe OpenAI's goal this
whole time is like, oh, let's generate
all these reward functions, verify each
of them, and then shove it into the
function. Let's see what happens. Maybe
that's what they're doing. I I don't
know. Um, but yes, it is a student
teacher. Um, yes.
What's your opinion on how to make
models?
something else.
Do you mean like how to make reward
functions more efficient in general?
No, I mean scalable in the sense of
many.
[Music]
Yes, the majority of reward functions
currently that's why it's called
verifiable rewards is like maths and
coding to be honest. I think coding is
also hard. I I don't know why people
lump. So coding you can't actually
verify technically it's correct. You can
just say it ran or the output is most
likely correct, right? For some
functions. But for example, the Flappy
Bird game, tell it to create a Flappy
Bird game. How do you actually verify if
it even is the Flappy Bird game? I don't
know. But you could, right? That's the
whole point of the LM as a judge. You
could take the output of the Flappy Bird
game. Ask the language model, does this
look like the Flappy Bird game? And if
it's yes, I get plus one. If no, minus
one. You could do that. Um, but you can
only go so far. If there's Does that
kind of
what your opinion is the scaling
oh
part to say I think most I I think large
model labs they're currently just trying
to use their own model to to to like
literally reward reward it like as I
like described you know ask it oh does
this look like the flappy bird game if
yes plus one if no minus one and I think
maybe that's what I think large model
labs their view is if you keep doing
this you'll get to AGI that's their view
I mean if you think about You could
maybe, but then I always go fall back
to, oh, but you might be bad luck. It's
not going to work. Um, so like I think
in general I think bad luck will just it
won't work. Um, you will only get so far
and then suddenly it just doesn't work.
Does that Okay. Any other Yes. How do
you think about
diapers?
That is your choice again. So like if
you want to specify for example you just
want to make a legal bot you given some
sort of court case and if it's like you
know the plaintiff wins or the defendant
wins I don't know you could just do law.
Yes, you could. You could do that. But
in my view, you should combine it with
other sources. You should combine it
with some maths. You should combine it
with some programming because the point
is you don't want the model just to know
like you just you don't want the model
just to like overfit just to just law.
Maybe maths might be helpful just by you
know by chance. Again, maybe you know
maybe coding might be helpful. Probably
not, but like you know in general um so
like you should combine other source um
other domains together. Um I feel like
you know all the large model labs their
goal is to do every single domain
possible right like mine every single
reward function in the whole world make
all the reward functions shove it into
the model and just learns um so like yes
you should do more domains if yeah
that's another yeah
research
let's
small
particular.
So the notebook I will share will
showcase you should probably do some
supervised finetuning fast. It's called
the priming stage. Um otherwise you're
remember the plot over here the the the
this one right? You don't want to be in
the situation where like you're starting
from like some bad you know pre-trained
state and you're trying to go to the RL
stage very not efficient but remember AI
is all about efficiency you don't want
to do this step so we do have to do some
priming you know the SFT stage and the
other stages if that's your question or
oh if you want to yes the bigger the
model the better yes
can
That's the trick. So essentially the
research papers show that small models
actually do work confusingly enough
because essentially these small models
it just does longer thinking. It does
longer reasoning traces if the model's
smaller and if it's a larger model maybe
the reasoning traces are like smaller in
general. So like I I feel like the small
models actually do work. They do break
down though. If you want to do like very
complicated reasoning traces then maybe
the small models might not work because
you know there is only seven billion
parameters. There's no not that much
space you can move. Um and so the large
models you just have more space to move
around. And so that's why large models
are better if I don't know if that maybe
kind of I don't know if that answer your
question but
want to find.
Yes, correct. Exactly. Yes, that's what
you should do. Yes, you can take a
distilled model like I already reasoning
model and then further fine tune it.
Yes, you could. Um, I would say it's a
bit more complicated because you could
do that, but remember the reasoning
model itself is already a reasoning
model and you're trying to fine-tune it
to become other re like you're trying to
do some other domain. It might be
easier, it might be harder. It's all
about luck again. I don't know. So, you
have to try. It's all trial and error.
Um,
yeah. Yeah. Yes.
Two questions for you. one like
it's pretty empirical just like try and
see what works and what doesn't and see
what
yes correct
okay and then the other side how you
keeping up with all the papers and all
the content that's been good at I'm sure
it's a lot how you learn
to follow
to be honest I don't you don't need to
follow that's my view don't try to
follow the latest research because
sometimes I may like the next day it's
like rebuttal of the previous paper and
then the next paper says oh it's a
rebuttal of the rebuttal I don't know.
So I would not try to keep too much up
to date with the latest research. I
think the field has kind of matured and
it is mostly stable now. You might have
like some algorithm increasing accuracy
by 1% or 2% or some efficient. Remember
all of the papers are about efficiency.
It's always about efficiency. Making the
training more stable, reducing
overfitting. It's always these similar
similar papers. Um so I would say don't
you can keep up to date with papers.
Twitter is very good as a resource.
Sometimes I tweet about papers, although
I I don't suggest the Nathan Lambert
paper. The Nathan Lambert um book is
very good. He keeps updating it. Where
is it? Where did I put it? Um this one,
the RHF book. That is very good. So
definitely read that. Um he updates it
all the time. And so maybe follow Nathan
Lambert. He's actually a very good
Twitter on like the latest research. Um
so he's very useful. In general, there's
a lot of noise in the RS space as well.
You don't know if the research is good
or bad, like you know, rebuttals on top
of rebuttals. So, I would suggest people
just to like try. It's just trial and
error, right? Try to see if your reward
function is good or bad, you know, is
the loss not, you know, is the reward
just 000? Like, unfortunately,
something's wrong or you just it's bad
luck. Try again. Um, so it's just
empirical. Yes. Um,
these slides. Sorry.
Are you put
Oh, yeah. Yeah. Yeah. Yeah. Yeah. Um,
yes. These slides should be up. I was
supposed to make a bitly link. Um, I'll
probably do that later, but I will share
the slides. Yes.
Slacking.
Oh, yeah. Okay. I'll do that then. Okay.
Any uh Yes.
Yeah.
Yes.
In the old PO sense you are the value
model is a new model. The reward model
is a new model. Yes, remember in GRPO we
delete the value model. The value model
is totally gone. We we create the value
from just statistics from the
distribution. We essentially just create
you know four what is 2 plus two? create
four examples, four trials and then find
the mean, find the standard deviation
and then that is your that is your value
model. It's not even a model anymore.
And then the reward model is no more as
well. It's just reward functions. And
that is why we call it reinforcement
learning with verifiable rewards. It's
not normal RL anymore. It's like you
replace a reward model as well. If does
that kind of make sense?
That does make sense.
Okay.
There's
Yes. So the again there is two schools
of thought. The first one is like
the the question what is 2 plus two?
Somewhere in the model somewhere I don't
know where right this high dimensional
1.4 four trillion parameter space
somewhere it knows to calculate it as
four right there is some sort of circuit
inside the model and then the goal of RL
is just just to maximize this circuit
somehow right via these formulas and
stuff you're just trying to maximize it
but that's one school of thought the
other school of thought is oh you know
RL is actually learning something you
know like it's actually learning how to
do 2 plus 2 is equal to four and it's
not actually in the model
okay any yeah yes
but when you say capabilities
of of a model that already has them
inside.
Yes.
You mean knowing actually the answer to
a question or knowing how to reason to
get to a question?
That is a good question. Maybe both. I I
think it depends. It probably knows how
to do the re
it might
for example a contrived example you get
all of the entire world's data right
like you know 30 trillion tokens get all
of the data and you just make a question
that is not part of the data right you
you could do that right what is 10
billion you know some random number
times some random you can make a maths
equation which is not in the data but
somehow the model has leared to do
multiplication has leared to do addition
somewhere and so Maybe this circuit for
addition for multiplication you know for
whatever some you know many many many
circuits of these like functions we just
want to accentuate them all and that is
what kind of RL is trying to do and I
mean yes there's a reasoning circuit so
like somehow the model also learns how
to do reasoning and so we also want to
make that more important and so you know
2 plus two is very important addition is
very important multiplication is very
important and so on we're just trying to
like make all of these circuits more
prevalent but that's only one half of
the AI community, right? That's only one
half thinks like that. The other half is
like, "Oh, but the model's actually
learning." You know, we're actually
training the model to learn and the base
model actually doesn't know how to do
reasoning. Does that kind of
Okay. Yes. Okay. Hopefully. Okay. Any
other questions? Yes.
Yes, there is ver there is TRL um unsoft
like we we also make a it's not called a
framework that's more like we showcase
that you can do uh gpo and reinforcement
learning with very low resources. So we
are the only package which allows you to
do GPO on like a free collab and so like
that's the only difference between us
and everyone else. So for ver
was very good for large training runs
but for now onslaught like if you want
to do like small experimentation you
want to try stuff out you don't know
what reinforcement learning is you don't
know how to make reward functions you
don't even know what reward function to
do you should utilize our notebooks and
that's what I was going to demo um yeah
one question is like let's say you don't
think you just have a regular task
any
effectiveness of that using to just
improve like let's say tool use for your
Yes. Yes, you can do that. Exactly. It
should increase accuracy by quite a bit.
If your accuracy if your accuracy before
with tool use was not very good, IRL
should definitely help. And I feel like
the trick of RL is it reduces
overfitting. I think that's the trick
because you do you do multiple
inferences. you don't know which one's
correct, but you're trying to like
maximize some good ones, right? Some
good ones. The problem with general
fine-tuning is you're kind of like
overfitting the model. And so the trick
of the trick of reinforcement
fine-tuning is you can essentially
reduce overfitting. So the model
actually learns how to do tool hauling.
Not just, oh, you know, I see someone's
trying to do a restaurant order, I just
want to call Door Dash to do order or
something, right? But it actually learns
oh okay because the person wants to
order food I should order door dash it's
like reverse thinking um
so we should definitely help some of my
experiments like unless I tried not
explicitly asking the model instruction
like just do some task right and it does
not
it does not automatically start the
thinking process unless you're
explicitly promped okay first think and
then
yes I was trying to check if like just
without explicitly asking it to start
the thinking process just to improve the
tool accuracy itself does it start and
that I think that did not happen because
you don't need to do so GPO generally
people utilize GPO and like
reinforcement learning algorithms to
create the thinking process
that's because it's like an artifact of
GPO just by chance they see the
reasoning process you don't need it so
maybe by chance by luck somehow how it
learns how to do tool calling and it's
not some thinking process it could be
some weird symbols maybe you know I
don't know it could be like using some
other you know like sometimes models
have like different languages suddenly
it could be like that you know randomly
it learns how to do two calling it made
some new programming language internally
I don't know but it could have done that
in this case there was no thinking
process
yes
it just directly gives the output
because like let's say you sample like
10 trajectories and none of them have a
thinking process then it's never it
never explores those parts
For now it will never but remember it's
all about luck over time you will have a
thinking process just by miraculous
chance there is a thinking process
somewhere and then oh you should do this
more and it would just do this more
but to make that more probable like you
should prompt it
you can prompt it so essentially in the
system prompt you can say please put
your working out between this box you
could force the model to create the
working out you could do that
you could Um, but is this the most
efficient? I don't know. You could you
could say, "Oh, please create a new
language that I don't understand which
does tool calling and then it does some
weird symbols and then oh, it does tool
calling." I don't know. But yes, you you
should prompt it. It should make it more
effective.
Okay. Any other Yes.
What's the secret sauce
and
Oh, we we utilize Triton kernels. We do
like kernel optimizations. We reduce
memory usage by 70%. Um there are lots
of optimizations that we do to make
training faster and more memory
efficient.
Yes, later. Yes, it's not the main
focus, but yes, we will talk about that.
For VLM specifically, we use Okay,
that's actually at the notebook. Oh,
before that, do we have any other
questions? Yes. focus on these days.
The actually the DC paper talks about
this. There is pass at K and majority at
K. I think they said that if you do test
test time scaling, I think it improves
majority. I think that was correct. You
need to read the I'll have to rever
revert back to the DC paper. But they
did say remember test time scaling is
different from reinforcement learning.
There are different methodologies. Test
time scaling is calling the model 10,000
times and then you know then you just
check you know by average what okay you
for example you ask chbt what is 2 plus
two? It might say four you know it says
four four and suddenly it says five. I
don't know by chance it says five. It
says zero and you just take the most
likely answer. That's called test time
test time scaling. And then
reinforcement learning is kind of
different. It's more like oh we want to
like make we want to actually train the
model to actually do the whole trace.
And you do you don't need to do you
don't need to output 10,000 examples and
get the best answer. You just do one.
Yes. Correct.
Yes.
Yeah. So the trick is you do the RL step
and then you do test time compute. It
will actually make the accuracy much
higher.
You that's a good question. You could
kind of GPU is kind of like that, right?
So GPO you do test time scaling in in
the actual reward function. You
literally call the model what is 2 plus2
do test time scaling and then you
aggregate the results. So it's like GPU
itself is doing test time scaling
internally. You could
that sounds like a new research paper.
You could do that I guess. Yes.
Okay. I will have to go to the notebook
now. Um so in order to access a notebook
you can go to our GitHub page um which
until my internet actually loads. Um so
if you go to unsloaf right the github
page unsoft there is a button called
quen 3 gpo right and you can click start
for free that's how you get the notebook
or you can go to our docs um which have
the notebook um so remember go to the
github page and then click start for
free um yes for quen 3gio and then you
will get this notebook
um generally it's dark mode but I know
in presentations you know presentations
I don't think so people can see that so
I will change this to light mode. Um,
so we utilize VLM behind the hood. So
VLM does does anyone not know VLM? I
think that's a good question. Who does
not know VLM?
Okay, 100% you must use VLM, right? So
like for all open source, how do you
serve a large language model? Please use
VLM or SG lang um or you know I think
hugging face has one as well. So like
these are the best open source libraries
to serve open source models. Um you know
like you have a GPU how do we actually
serve Llama 3 you know how do we serve
Llama 4 you use VLM to serve it. The
trick of unsloaf is so unsoft is a
package for fine-tuning, for GRPO, for
reinforcement learning, for whatever you
like. Continue pre-training, whatever.
And the trick is we just optimize it. We
make it much faster. You know, use two,
use, um, 70% less memory. Um, make it
fit on a free collab. Um, remember,
please use free collab resources. You
know, Kaggle has 30 hours for I already
said this again, Kaggle has 30 hours of
for free per week of GPUs. Please
utilize them. Um, they won't be unhappy,
you know. please utilize them. Um and
yeah, so you install unsoft in VLM. Um
and we have this thing called the fast
language model class which essentially
you can call a model. Um for example, we
we will now utilize the quen 3 base
model, right? So like remember I told
you not to do this, right? But we are
anyways we're going to do it. Um this
plot where is the plot? Um,
we are going to do the dark blue dot to
to the um the dark green. Um, yeah,
that's what we're going to do. We're
actually not going to do what I
suggested not to do. Um, but anyways,
um,
we're going to do that. Um, you also
have to set set a max sequence length.
So, for example, if you want to make it
longer, you can set it for longer,
right? If you want longer reasoning
traces, you can increase the maximum
sequence length. We set it as 248. um if
you set up a larger that free GPU will
run out. Um so that's the problem. You
can also load in four bits. So if you
want to do four bit quantization, you
can make the model go to four bits. You
can reduce memory usage by quite a bit.
So you can do that as well. Um and
remember we are utilizing Laura which is
a parameter efficient fine-tuning
method. Um you don't need to fine-tune
every single weight inside the entire
model. Um this will be very very very
costly. Instead we add small weights to
the model to fine-tune it. Um and so
that's a trick that we do
and because we utilize VLM directly we
do a trick we actually reduce memory
usage by 50%. The trick is we share
VLM's weights directly. Um other
training frameworks what they do is they
have to copy VLM's weights um because
you have to you have the model for
fine-tuning and you have the VLM
weights. The trick that we do is we
actually share the VLM weights directly
so you can reduce memory usage by a
further 50%.
We use something called the unsoft
grading checkpointing which reduces
memory. Okay, essentially everything in
AI is about reducing memory usage more
efficiency. You know everything that we
do is just efficiency. Um so everything
that we set is for efficiency purposes.
Um your Laura rank for if you do Laura
please set it to be the alpha to be two
times the Laura rank. Um it speeds up
training dramatically. So please do
that.
Some lots of stuff right compiling. We
do like automatic compiling and stuff
like that. You don't need to read all of
this. Um and here is the bulk. This is
the most important part. Someone was
asking about you know about a prompt.
You make a system prompt, right? You are
given a problem. Think about the problem
and provide your working out. Place it
between
reasoning start and reasoning end.
Right? So like the reasoning start is
start working out and end working out.
So it should look something like this.
Um if you you know does uh start working
out and end working out,
right? You were given a problem. Think
about the problem provide you're working
out. Place it between start working at
and end working out. Then provide your
solution between solution start and
solution end. So it should look
something like this. Um
and this is the system prompt that we're
going to use for reinforcement learning.
Remember you can customize this to
however you like. Right? You don't have
to say you are given a problem. You are
given a legal case. Right? think about
the case and provide provide your I
don't know legal thinking I don't know
I'm just making stuff up I don't know
whatever provide place it between
I don't know it can be literally
anything
um thinking I don't know I don't you can
even make spelling mistakes it doesn't
really matter right end thinking
the whole goal of RL is you can design
your reward you can design the system
prompt to whatever you like right I
think that's the main problem is like
people think oh you must Follow deep
seeks think right think right you need
to people see in deepseek think right
and close think right
you do not need to follow you do not
need to follow this right you do not
need to follow this at all you can make
it up entirely right this is
customizable to whatever you like
the hard part is because we are using a
base model remember this is a base model
you have to make a chat template as well
um this is The more annoying part, you
can just copy and paste this chat
template. You do not need to do anything
else. Um, just, you know, literally copy
and paste it. Um, the base model does
not have a chat template, right? A base
model when you call it, you can't
actually call it for conversation. You
can't, it's not GPT, right? It's just a
base model. It doesn't do anything. So,
you need to specify a template for it to
understand how to do conversations. And
so, this is kind of like a template that
we did. It's a very it's very generic.
You can just copy and paste this. It
should be the same for anything.
And then we show so after you do the
chat template, we show an example of how
to actually utilize the chat template
and the tokenizer. Right? So for
example, if you ask what is oneplus 1,
you do the reasoning process, it will
say you are given a problem. Think about
the problem and provide you're working
out blah blah blah blah blah and in the
question this is a question. Your
question is what is 2 plus two? Remember
the answer is four. And so start working
out this is what you give to the model.
You give all of this to the entire
model, right? All of this um okay I
can't really highlight all of this you
will give into the model and the goal of
RL is you want to create the working out
process automatically right the RL
algorithm will automatically create the
you know the working the working out or
the thinking process and then finally it
will say four well hopefully it will say
four and the goal is if you see the if
you see four you want to make the reward
higher just for that now someone was
talking about um you know fine-tuning
with the um you know instruct fine
tuning first again Remember, we go back
to this diagram. We want it to start
from the blue dot to go to the green
dot. But we found it doesn't actually
work. So don't actually do this. The
trick is we go back to this diagram. We
actually want to take the pre-trained
model, do some fine-tuning, do some
supervised fine-tuning, and then go to
the green dot.
And so this part we show that you should
actually do some supervised finetuning.
You need to do some finetuning to prime
the model. Right? The goal is you want
to learn you make the you want to make
the model not just output zero ward
forever and so this data set allows you
to prime the model to do supervised fine
tuning.
So for example the problem is you know
what is the sum of all the real numbers
blah blah blah blah blah and then you
use deepseek R1. So this is a trick.
This is a hack. You use DeepSseek R1 to
create some examples and you shove this
during the finetuning step and
essentially the model already learns how
to do some reasoning and that's the
trick. This data set is very small. It's
only 7,000 rows, right? You don't need
to have that much data for just this
first step. You can have like I think I
only use 600 rows. Very very very less
data.
So this is just data preparation step.
Not that important. Um I need to skip to
the reward function. This is the most
important part of the model. So this is
the supervised finetuning step. So all
of this is a supervis supervised
finetuning step. So this this part this
part of the model um that's this part.
So not that important. Um okay we skip
all of that. This is the most important
part. The reward function creation is
the most important part. And I feel like
you know the majority of people like
neglect this part. It is the most
hardest part to do. Um
okay let's see where is the reward
function. Oh here it is. Okay,
for example, okay, no, no, this one is a
regular expression to match if your
format is correct. For example, remember
we have we ask the question, we ask the
model to say please put your working out
between start working out and end
working out. This regular expression
essentially um essentially rewards the
model to see start working out and end
working out. If it doesn't have this,
you will actually penalize it. And so
this is one reward function that I
created.
For example, I give it an example. If
you say let me think end working out,
it extracts two. Yes, that's good.
Right? Remember we force the model to
say we force a model you must generate
the answer between this and this. And it
successfully extracted two. So that's
good. But you know also sometimes a
model might generate some random spaces.
It's possible. you know the model might
not actually the model might not follow
your exact format. We still try to match
it, right? Even if it generates extra
spaces, we still successfully match the
number two. So that's good.
This is a reward function, right? So
essentially what we see is if it matches
the format exactly, we add the score by
three.
If not, we just put zero. And remember
this match format essentially matches
regular expression. We had to create it
by hand for matching the format.
This number does not have to be plus
three. It can be plus 300. Whatever you
like. It can be plus one. I don't know.
It can be anything that you like. But I
just found plus three to work fine. Um
so you can do whatever you like. Um
yeah, anything. And remember the score
is zero. If you don't see it, you can
also do minus one. For example, if it's
else, right? If it's not good, you can
also do score minus, right? Minus three.
You can minus three points from it. So
up to you. You can design your reward
function as whatever you like.
But remember, if the model if the model
output is not exactly following your
format, we should still at least reward
it a little bit, right? Otherwise, the
reward will just be 0000.
So the trick is if we see a keyword, we
plus one. And if we don't sorry, plus
0.5 if you see the keyword and if you
don't see the keyword, right? If you see
the keyword this, if you see this in the
output, you should at least plus 0.5.
But if you don't see it, then you're
minus one. And so this essentially
allows you to partially reward the
model.
More reward functions now gets more
complicated. This this large reward
function essentially allows you to
calculate the distancebased scoring. For
example, remember we said um over here
um where is it? Um this one, right? What
is 2 plus two? Four is is correct, but
three is also a better answer than D.
Right? If you if you output D, it's
definitely wrong, right? If you output
five, it's okay, but it's wrong. And so
this function essentially allows you to
take the good answer. So sorry, this is
the guess divided by your true answer.
And it's like a ratio. And this ratio
essentially allows you to reward if your
number is close to the actual answer,
you give it higher reward. And if your
if your answer is very very very far
off, then you penalize it by minusing
reward. And so this essentially allows
you to do that.
If it's exactly correct, you also add
five points.
And so this this is probably the this is
probably the most important reward
function. Um but this is only for maths.
Um for other like code and stuff like
that, you have to create more reward
functions.
Now we test if our reward functions
actually work. Um and yes, it extracts
the numbers. Um this is just format
reward. Sorry, this is just extracting
the solution. And you can see that it
extracted 0.34. It extracted this
number. It extracted this and extracted
this. Um, if your reward function is not
working very well, you probably did
something wrong in the regular
expression. So, please like edit that.
And then this is helper functions. Oh,
this is another another reward function.
If you see if you see the number 1 2 3,
45, we want to remove the comma because
you can't actually convert this into
Python. So, you want to remove the
comma. Um, and then if it's equal to the
true answer, you're plus 3.5 reward. And
if it's not, then you're minus 1.5
reward.
More data set preparation functions not
that important. Um, and here is the meat
of the code for training, right? We call
VLM
top P is 1.0. Um, you don't it's not
that you can probably it's probably not
a good 1.0 just means you're sampling
the entire space. Um, so that's good.
You can set this to like 0.8 eight or
something else up to you but I generally
set it to be 1.0 to be like full
sampling of the entire space min is 0.1
I suggest people to use this because
otherwise the model might like go into
like it might do inference of like
random outputs so you know use 0.1
and temperature I did suggest people to
increase temperature to 0.1.2 two,
right? Or you can do one. Um try to
increase your temperature as much as
possible. The more temperature you
increase, the model becomes very very um
creative, right? It like creates random
outputs. If you increase the temperature
too much, like you know two, your model
will be like gibberish. So probably
don't do too large numbers. So I
normally suggest 1.0 1.1 or 1.2 or
somewhere around there. You should
utilize minp together, right? you should
utilize minp together with high
temperature numbers. Um there is a paper
about using temperature 1.5 and 0.1 min.
You should utilize that. There are some
some other things that we utilize. Um
num generations is very important. I set
this to before this number is is this
thing. Um where is it?
This number is this. How many how many
like uh how many rollouts do you want to
do? How many inference steps do you want
to do for GPU? Right? We chose four. And
so four will just means what is 2 plus2?
It will create four four options. And so
that is this number. If you increase
this number too much, you will use much
more memory. Um you should increase this
as much as possible if you can. Um and
there's like gradient uh there's like
batch size. We set this to be one. Um
the trick is the batch size times the
gradient accumulation is equivalent most
of the time. Um GPO is not but
essentially what this does is if you do
one it just means we're doing one. What
is one? What is 2 plus two? If you set
batch size to be three then we shove all
of these three examples together into
one.
Generally you should set batch size to
be much larger. Um the problem is if you
set batch size to be too big you're
going to use more memory. So the trick
is instead you do gradient accumulation.
You set this to be 16. That's a trick.
Um gradient accumulation. It essentially
allows you to do addition of gradients
over time and you can skip using too
much memory.
And then there's like evaluation. If you
want to do evaluation, there's some
functions for that. Um and then we shall
see the training. Um
you will get a large table of numbers
during the training process. This took 2
hours and 54 minutes on a free collab.
Um look at the reward column right the
reward column minus 7.5 - 5.5 - 5.5 all
very bad oh and then suddenly plus 13
just by chance suddenly it's plus 13
remember GPO the trick is if you see
this + 13 let's maximize this even more
and then it's oh but then it didn't
really work so it goes back to - 7.5
minus 5.5 minus 7.5 and so on and then
plus 11 you see another good reward we
want to maximize ize this as well and so
so on so on so on right that's that's
the trick of GPO by luck by chance
literally by luck you will have good
answers right good answers you want to
maximize this and if you keep looking
down right if you keep scrolling down in
the end okay I need to make a plot but
in general your reward will increase
over time right look these are all
positive numbers now right these are all
positive numbers your minus numbers are
getting less and less and less and less
um I think if I can I don't know if I
can plot this But okay, I'll plot this
later. But essentially, if you plot this
over time, the reward will actually
increase over time. There is also other
numbers like completion length. So
essentially, remember when you use a
reasoning model, the reasoning trace can
be extremely long. So this column just
tracks how long the reasoning process
is. Um it's over time in general, the
reasoning length should get longer and
longer. Um but sometimes not always the
case. Um yeah, not always the case.
There is also another column called KR
divergence. This essentially tells you
how far the model is the you know the
final model from the original model and
the larger the number it means it's
getting very very very far away from the
original model. Um in general this
number should get bigger over time in
general. Um sometimes it doesn't move
but you should make this number go as
much much higher as possible. Um and
then we also we made separate reward
functions. Each of those reward
functions also has their own reward. um
the most important one is the last
column, right? Or the second last
column. These are the two numbers. So in
in RL there is a problem. Most RL most
RL training runs just follow the format
and it doesn't actually learn. So the
format columns are not important. Do not
look at the format columns. These are
useless. You need to look at the last
two columns. And if you look at the last
two columns, right, rewards, check,
check numbers. This essentially checks
if the output is good or bad. You see
that it's minus 2.5 minus 2.5 not very
good and then suddenly 3.5 right 3.5 is
good we want to maximize this if you
keep looking
over time in general if you take the if
you take like a rolling average in
general the model gets better and better
and better obviously we only train this
for two hours and 50 minutes you know if
you train it for 20 days it might
actually do very well um but remember
this is a free collab GPU um so in
general remember so like the goal is the
goal of GRPO is suddenly we see a good
answer with a good reward we want to
maximize that and that's the whole point
of GPU it it's it's nothing fancy it's
just like by luck we see it and we just
want to maximize it
we can also see some output from the
model right at the very beginning right
at the very beginning of the model let's
see um where is it an example
compute the number of positive integers
that divide at least to a blah blah blah
some question and then it does some
reasoning trace. Remember, we already
fine-tuned it a little bit. So, it does
something, right? It does something, but
the answer, it just goes on and blah
blah blah. It just blah blah blah. It
just keeps going on blabbering on. Um,
but then if you look at the actual
answer, um, where is it? If you keep
Okay, there's a lot. Um, we print out
every single We print out a lot. Okay,
it just keeps Okay, whatever. It keeps
going on and on and on. Um, this is the
output of the GPO algorithm. And you
will see over time if you inspect this
you will see that the model actually
gets better and better and better. Um
for just an example right let's say we
ask the model what is the square root of
101 right it's not we don't just say
what is the square root of 100 that's
just 10 we say what is the square root
of 101
if you do not train the model this is
what you get it will say answers
education math and arithmetic what is
the square root of 101 wiki user oh wiki
user this is what it actually will say
right that's actually what it will say
does do where do you think this data
comes from does anyone know can take a
guess Where do you think this data comes
from? Probably Wikipedia, right? So if
you ask the question, what is the square
root of 101? It doesn't do anything.
Right? Remember this is the base model.
The base model is useless, right? You're
not going to get it's not going to
answer the question.
But after we do GPO,
right? What is the square root of 101?
We ask the question again. It says,
okay, so I need to find the square root
of 101. Hm, let me think. I remember
that the square root of numbers between
the perfect numbers are rational. blah
blah blah blah blah blah blah right it
just says blah blah whatever some
whatever and it says solution 10 point
right here
solution 10.049875 049875 I think that's
correct. I don't know if that's correct
but probably it's very close. And so the
goal of so the whole point is GPO allows
you the GPO algorithm produced all of
this right this reasoning trace in the
olden days you actually have to have a
human write all of this and then you
have to fine-tune the model GPO you skip
you don't need to make this anymore it's
automatic right that's the trick of GPO
and reinforcement learning all of this
reasoning phase is automatic totally
produced from nothing and in the end it
gets a solution
Yes,
you had that.
Yes, that's trick. If you do the base
model, the trick of doing this is if you
just do the base model, go into the
space, you'll still get this, but it'll
be too long.
Otherwise, you'll wait there for like 20
days and for demonstration purposes in a
collab, you'll have to do the supervised
fine tuning step. That's trick.
Yeah. Yes.
What is the advantage of doing this
7,000 examples versus using model out of
the box? You can use an instru we
actually have notebooks for that. So if
you go to GPU in general um we have
notebooks for using instruct model. Okay
the internet's very slow. Um GP we
actually have other notebooks. For
example if you use llama 3.2 3 billion
that is using instruct model. You don't
need to use a base model but we showed
that you can use a base model. Um yeah
you can use I suggest people to use
instruct. Um you probably shouldn't use
base. It's all about efficiency as well.
Um
yes.
Sorry. What? What?
Yes. Yes.
Yes. Correct.
Yes. So the goal of Ko divergence is you
want the model not to stray too much
away from the original model. Right. KO
divergence is I shouldn't say distance
but K cho divergence is like a distance
between the the current model the
current model that you're training and
the previous very very very beginning of
the model right and so essentially if
the model is too far away your ko
divergence will be very large right if
you look at the plot uh where is the
table uh I'll scroll up a bit um
right which one is the k diver oh here
this is the column right this column is
a ko divergence um column over time it
should get larger and larger and larger
over time, right? The number should get
larger and larger and larger because the
model is straying away from the fine uh
the original model. If you set the beta
to be zero, then you remove this term.
Maybe this might make the capabilities
of the model more maybe because you're
essentially not forcing the model to be
as close as possible to the base model.
Active error of research. So yeah, some
people might set it to zero, some people
might not set it to zero. You know, I
think 0.0 I think the default is 0.05 5
or 0.03
if that.
Okay. Any other questions for Yes.
So for fine-tuning actually fine-tuning
is actually very helpful already. Um
where is the loss? Where is the loss?
The base model. So the base model
already is very bad. Um if you do
uh here there's a loss. There's a loss.
This is using the finetuning step the
pre the priming stage right you use like
a data set to firstly prime the data to
prime the model. The loss does decrease.
Remember if you see a loss of 0.64
that's good. Um if you see a loss higher
than 30 definitely something's wrong. Um
higher than three is very bad. Um you
can see the loss definitely decreases
over time. So yes doing the fine-tuning
stage does teach the model a little bit
to do reasoning and it learns how to do
some stuff. Also a good a very
interesting um fact is we used deepse R1
some of the reasoning process to do the
finetuning step. And interestingly, if
you just call the model without doing
GRPO, it kind of does reasoning already
by doing 7,000 examples, right? It
already says, remember the question was,
um, what is 2 plus? Oh, wait, this is
just a general question, right? It kind
of learns how to do reasoning somewhat,
but it's not perfect. And so the goal of
GPO is to forcibly make it perfect.
Okay, not perfect, but as much as
possible. Um, so actually the finetuning
step already kind of learns a little
bit.
No, we No, actually, you don't need to
use 7,000. I think I only used
I think I used 118.
It's so it's uh uh two training epochs.
I think it's Yeah, it's 118. I only use
118 rows. You don't need to You can use
10 rows. You can use Yeah, use as less.
You must use more than three rows
though. Um because when you do Laura the
gradients become are zero. So you must
use more than three but but anything
more than three is fine.
Yeah. So even if you use 118 it does
fine. Um yeah.
Yes.
Do you mean like a small model versus a
big model? What's the difference? Um,
can you tell any tricks if you want to
do this training on a bigger?
Oh, yeah. Go ahead. You can you can take
the notebook. You will need a better GPU
though. Take the notebook. Edit
edit this here. Not four. You can do 14
billion. Wait, I think there's a 14
billion, I think. Or is it 12 billion? I
can't remember. You can do whatever you
like. You could even do, you know, llama
3.370 billion. I don't know. Up to you.
Um, do whatever you like. And but the
goal is for a collab demonstration
because it's a small GPU. I use like a
small model. Um we actually have
notebooks for free collab which fits 14
billion. Um so fe 14 billion actually
fits in a free collab. Um so you can do
big models in a free collab. Um Kaggle
again I said Kaggle use Kaggle free
GPUs. Um there are like no so
essentially this whole page has like no
books. Uh where's Kaggle? Kaggle. Kaggle
has like notebooks for GPO as well. Um
so you can do whatever you like for
large models. Um,
oh, do you mean like for VLM rollouts
like Oh, so the trick what we do is you
we colllocate so you use the same
machine for inference and fine-tuning
and the trick is you can reduce memory
usage because you're sharing the VLM
weights. So some other trainers like
Verl and TRL you do have to put the
inference on another server and then
your training is like a separate server
and they have to do communication. We
don't there is no communication for us.
There is none. So we do it's very close
to asynchronous training nearly no delay
in training but yes we we don't support
it yet but we do plan to support like
you know larger training runs um yeah
yes
question what's
yes people have asked that no road map I
don't know if we're going to support it
it's a bit more complicated you could
use I think XLA like they they do have
PyTorch converted down to TPU So maybe
it might work. I don't know if it works.
I've never tried it. Um
maybe later. Um yeah, maybe later.
Okay. Yeah. Yes.
Oh, you don't have to. You you the the
whole point then like this thing um
uh here, right? We chose a base model
because you can show that you can do a
base model going to the green dot. But
then unfortunately in the collab we do
have to do some supervised fine tuning
otherwise you'll wait there forever. The
reward again will be 000000. We just
want to remember all of AI is about
efficiency and speed. So we just want to
showcase okay you do need to do the
light blue step. You still need to do
the supervised fine tuning step. Um yeah
yeah
oh no you don't need to. You can take
the instructor. So the notebooks over
here. So for example, if you go to the
uh
the llama 3.23 billion notebook, we
don't do any fine tuning step at all.
You skip directly because it's an
instructor model already. It already
learns how to do chat. It already learns
how to answer some questions. You can
skip directly to GPU. Um if it loads, um
but yes, the notebooks, okay, you'll
have to wait for it to load. Um
whatever. The internet's very bad. Um it
is loading. Um yes. Any other questions?
Yes.
Over in the reinforce algorithm you had
the log probability of a state of an
act.
Is that happening inside the sampling
model? Where is that in the notebook?
Oh the the the algorithm itself of GPO.
Oh it's like behind the scenes like
yeah is that happening over in the GPO
trainer?
Yes it's inside the trainer itself like
somewhere in the code somewhere it does
that.
It's figuring out like what is the
probability of token versus all the
tokens.
Oh, the calculation is inside the
trainer. So like somewhere, you know, on
the GPU, you're doing this calculation,
but you do get the probabilities.
Remember the language model, you get the
probabilities already. You just get the
reward function and you just want to
maximize it.
So do you take the log that come out of
the large language model and turn them
into pseudo probabilities and then just
assume that's
I think so. Yes, that's correct. I think
it's exponential of Yes, I think that's
correct. There is if you go to like the
code there is like some derivation for
it but yes you're correct any oh okay
the notebook loaded but yes there is
another notebook which does the instruct
here right there is instruct model and
there is no fine-tuning step at all it
just does the reward function um and
stuff like that um wheels notebook for
example is also very good so like if
anyone wants to check other notebooks
out wheels notebook um also utilizes I
think also the instruct model and then
does gio um
Okay, the GR Okay, kind of time is
running out, but okay, technically the
GP4 portion is done.
Oh, there's actually more portions. Um,
I will have to breeze through them.
There's only 10 minutes left. Um,
Whoops. Um, any other I will take
questions at the very end. Anyways, I'm
going to stay here anyways afterwards.
Um,
quantization. We'll now shift over to
quantization. Um, so I don't know if you
guys know about the deepse R1 1.5 bit
quance that we did. Um but you can
essentially download these models.
Deepseek R1 is 730 I think 730GB.
You can quantize them down to 140GB um
without that much loss in accuracy.
Okay, there's obviously loss in
accuracy. But the trick is you can
quantize them down to be very small and
miraculously they work.
Um llama for scout for example, right?
You can't really see the accuracy plot.
Um the smallest number is 80% accuracy
on MLU5 shot 80%. And the highest
accuracy is 81 point something right so
it's actually only 1% difference and the
one on the left is a one bit quant it's
very small it's tiny in comparison to
the full precision like float 8. And so
essentially you can make the model eight
times smaller and you only decrease
accuracy by 1%. Um so that's very
interesting. And so essentially we
showcase that you can actually quantize
layers the mixture of experts layers
very heavily but you must leave the
attention layers the shared experts and
other layers in higher precision and
that's called that's what we call the
dynamic quantization methodology.
Um if you see there was like a benchmark
of llama 4 scout for example um if you
use a two bit a twobit quant it actually
gets high accuracy than other other full
precision um providers which is very
interesting right so like for example
the two-bit quant gets um 73% accuracy
and then other inference providers get
65% accuracy 67 right there is a very
large difference and so okay there is
like some bugs in the models and there's
Maybe they quantize it incorrectly, but
the goal is to show that if you quantize
a model down to be very small bits, it
still works.
We showcase this with an example. For
example, if you take a vision model like
um Quen 7 bill uh Quen 2 billion, um if
you naively quantize all the layers to
be four bit, right? You ask the model
what does this image show? It will say
the image depicts a vibrant and colorful
scene of a coastal area, which is
totally wrong, right? The answer should
be the image shows a train traveling on
tracks or something like that, right? If
you quantize everything to 4bit, it's
1.36GB,
but it's definitely bad. So, the trick
is you must quantize some layers to be
you must leave some layers to higher
precision and you only need to increase
it by 500 dB or so to 1.8 bit and it
works. The image shows a train traveling
on tracks. It suddenly works.
But the question is which layers do you
not quantize? That's the question,
right? Which layers? You could do an
exhaustive search, right? You can check,
oh, let's not quantize layer zero. Let's
check layer 1, layer two, you know,
check every single one, but it'll take
forever. So, definitely don't do that,
right? You have 70 choose one or
something. 70 choose one plus 70 choose
two. Horrible, right? And remember, all
of AI has a bad efficiency. So, don't do
that. The trick is you can check the
activation quantization error and the
weight quantization error and you will
see these large outliers. For example,
for Quen, if you quantize the first few
layers, it's extremely bad. So, you must
leave the first few layers not
quantized. And also, this gigantic jump
for the weight quantization error. This
means you probably shouldn't quantize
that layer as well. There are some other
plots that we show. For example, for
Llama 3.2,
it's interesting. All of these graphs
are very different from each model. Um,
you will notice Llama 3.2 has these
weird, you know, continuous spikes. Um
it's because they use attention and then
they put the attention back to the
vision module. I think every single
three I think it was every single three
layers. So every single three layers it
has these big jumps. Um this means you
should not quantize those layers.
Pixrol for example is also a difference
graph again. Um pixrol seems like you
can't quantize many layers
unfortunately. Um and so like the whole
vision module must not be quantized.
There is a very there's a very important
paper talking about like you know why
which layers you should quantize and
which layers you should not quantize.
It's called the super weights paper. Um
you should you guys definitely should
read that. Essentially it says that in
all language models the first few layers
of the down projection there is a very
very very important number in one of the
numbers. One of the numbers in the
models very very very important and you
should never quantize it ever.
But the trick is that the interesting
finding is it's not actually a very
large number. So there is a trend in the
there is a trend in um language model
space where people think that you should
not quantize outliers. The problem is
you know these models have these big
outliers like suddenly in the model
there's like this big number like 3,000
and if you quantize it it essentially
ruins the model but actually this paper
shows that it's not actually the
outliers that are the problem. These
numbers could be very small and if you
look at the plots if you remove if you
select these numbers and if you make
them zero
the accuracy
uh the accuracy decreases dramatically.
Um and so like if you see like for
example if you remove one of the numbers
um you know the activation value totally
decreases very bad they have very large
activation values and then if you remove
them it's very very very very bad. Um
there is another trick that you can do.
If you have a model that has seven
billion parameters, make every single
number go to zero. The first parameter,
make it go to zero, check accuracy. The
second number, make it go to zero. Check
accuracy. The third number go to zero.
Check accuracy. You can do this seven
billion times and you can see which
number is the most important. You could
do that as well. Um but remember AI
efficiency very not a good idea. um
more later research. For example, the
new Blackwell chips um instead of doing
a quantization to like one bit, two bit,
three bit, 4bit, Nvidia chips also have
this new architecture, this new format
called FP4 or MXFP4. Um and essentially
this is float 4. Um and float 4 is very
it's it's most likely going to be very
used a lot in the future. Um and then
there's these like new formats for
quantization as well which which
essentially allows you to train models
in very low precision.
I also made this plot um going from
float 32. So the question people always
ask is why is GPUs getting faster and
faster and faster? My take is actually
this year is probably the last year
you're going to get GPUs that is
actually faster. There is no more faster
GPUs. Why? Because the majority of GPU
is getting faster is because of
numerical precision. From float 32 to
float 16, you get five times faster.
Why? Why is it five times faster?
Because because um the calculation of
faster is when you use transistors, it's
the exponent plus the mantissa squared,
right? And float 32, you have to use 23
numbers for the mantissa and 23 squared
is very large. Float 16, you reduce the
mantissa to 10. And that is why you get
five times speed up from float 32 to
float 16. It's because the number itself
is getting smaller. The representation
inside the models for each of the
weights is getting smaller. And then we
moved from float 16 to be B float 16. It
is again maybe around two times faster
than float 16.
We then have float 8. Um float 8 is even
more faster. Um you know uses even less
space. Um but then there is a problem.
Float 4. We get to float four and it's
around two times faster than float 8
around. Um the problem is what's next?
Do we go to float two, float three,
float one? You know I there's you can't
push anymore in terms of numerical
precision. There is not much more to go
in terms of that space. And so like you
can only get maybe 180 times faster than
float 32, maybe 200 times faster, but
essentially my take is float 4 might be
the final flop that's getting faster.
You know, the final um precision
numerical precision and in the future
GPUs are not going to get faster. Um so
maybe if you know people want to buy
blackwell GPUs, that's probably you
should probably buy them. It's most
likely not going to get faster anymore.
Um that's kind of my take.
And also for Okay, I was going to talk
about kernels and stuff. I don't think I
have enough time. You must use
torch.compile. You know, every single
function that you see, wrap it in torso
compile. Try it out. You know, like I I
always tell the PyTorch team, please
make it by default. You know, definitely
use torso compile. Um why? Because it
makes your training faster sometimes,
only sometimes. Um not all the time. It
reduces memory usage most of the time.
If you see bugs, they'll probably fix
it. Um,
but remember torch.compile is not as
easy as you think. You don't just do
torch.compile the model. There is
actually many options you can tune,
right? I just listed a few options. Um,
this is this is literally just a few.
There is like 10 more pages of options.
I'm being serious. 10 more pages you can
tune. Imagine if you can like use
torch.compile and tune every single one.
And that's why I highly suggest people
to use torch compile more effectively.
um it's probably the biggest thing that
can change your entire, you know,
training run. Um make a more more memory
efficient, make it faster. So definitely
look through this. Um
okay, so in general, yes, thank you.
Definitely start us on GitHub. Um join
our discord if you want to have any
questions on RL and stuff. Um we have a
website as well. Um and finally, we have
stickers. Um yes, there are some limited
time stickers as well that we have
somewhere, I think over there. Um, and
remember if you have any questions, I'm
still going to stay around and ask. Um,
yeah. Thanks a lot.
[Music]