Latent Space Paper Club: AIEWF Special Edition (Test of Time, DeepSeek R1/V3) — VIbhu Sapra

Channel: aiDotEngineer

Published at: 2025-07-25

YouTube video id: 9k3xPh-40mo

Source: https://www.youtube.com/watch?v=9k3xPh-40mo

[Music]
Okay, so Paper Club year in review.
We've gone over a year, like a year and
a half, no missed weeks. We've always
done Paper Club and it's pretty
interesting. You know, I don't think any
of us expected it to get this far, but
every week for the past year and a half,
we have always done a paper club
Wednesday at noon. We've had a bunch of
authors come share their work. People
from Nvidia, Meta, uh, Alen Aai, Amazon
together, writer, bunch of authors come,
we get direct one-hour sessions with
them. They share their work. We get nice
feedback. And, you know, some of these
have started to do pretty well. On
average, we get about 100 people in here
every every session. And just for you
know um for for information that's like
on a Wednesday workday at noon we have
100 people join to discuss a random
paper. Um and some of the big ones
DeepSeek V3 had 300 live people sitting
listening to me yap about uh Deepseek of
course all the other speakers do great.
This is all built on volunteers right
and along the way we made many many
friends and yeah paper club went much
further than we uh expected. Now the the
launch, you know, we have to ship
something. We basically have to launch
at World's Fair. So we're launching our
test of time paper club. This is going
to be a V2 second paper club. So the one
that we do now or Wednesday at noon,
this is still sticking around. It will
still be the same thing. Every week
we'll kind of have a paper, something
that's trending, cover it, have an
author, have our Q&A session, you know,
30 minutes of paper presentation,
whether that's highlights or that's
slides, and then we'll continue that
with some discussion. But Tesla time
paper club is going to be a little bit
different, you know. So it's more
curriculum based. Basically, we take
this idea of what's everything that you
would need to know to be a good AI
engineer. What are the themes? What are
the core papers that don't change? So
stuff like, you know, what is attention?
How does uh sequential text generation
work? So what's going on in GPT2? How do
like optimizers work? What about like
the key inference techniques, right? So
stuff like speculative decoding, flash
attention, uh stable diffusion, whisper,
the key papers that you know are the
foundation of what's being built. We're
going to kind of group those together
and session by session we'll go over
them. So we're going to kick off in July
and run till December. That leaves us
six months. Six months is about four
weeks a month. You know, 24 weeks. Every
week we plan to go through two to four
pre-presented papers. So you kind of
have a bit more structure to it. There
will be presentations but you know in
six months we can get through like 50 to
100 papers and we can cover the core
concepts and every week will be kind of
different right so like one week we
might be talking about whisper and
you're interested right in one week you
can get the fundamentals of speech uh
speech to text to speech all that stuff
another day it might be something like
image generation so you know we'll go
over clip stable diffusion how does all
this stuff work extend it out to video
basically we'll segment the core topics
that we should need to
and then we'll cover it every week. Uh
the exciting announcement here is we're
also for the first time having uh SF
section. So we're going to do uh
inperson SF paper clubs and a remote
section. I am going to mute someone real
quick.
So we've got 40 of you. Someone needs to
be muted.
We've got crazy echo. Oh, crazy. I'll
just mute myself. Easy. I'll mute my own
laptop. Okay, continuing on. Um if if
someone needs to interrupt just just
start in zoom chat because I don't have
my speakers going but yeah um we'll have
an in-person session in San Francisco
every week and of course we'll keep the
remote thing you know so we'll keep it
going and original paper club will still
be its own you know every week something
comes out we'll cover that and we won't
really deviate too much on the schedule
for test of time it's going to be
foundational papers plus some blogs um
so what are the topics you know this is
still up for us to all decide so just
listing out some of the few ones here
you know we have foundations of deep
learning something like attention
optimization relu gradient descent what
is basic RL how do these things work so
you know we'll have a section on
foundations of deep learning another one
LLM foundations right so prels we have
like RNN's uh LSTMs birectional RNN's
all this stuff then we have like the
very very foundational models right so
like BERT from Google GPT2 stuff like
that We'll have a day of everything kind
of pre-LLM and go over those. Then after
that, we'll have, you know, the the
actual like generative LLMs. I'm sure
that's missing here, but you know, like
Llama 3, DeepSeek, the the core LLMs
that you would expect to hear about.
We'll go over those. We'll have a day of
pre-made postraining, right? So, what
are the scaling law papers? What is
chinchilla? How does distillation work?
What are the kind of key papers that you
would want to know for um training? And
once again, this will be like a one to
two session. So in one week we'll go
over three to four papers. We'll have
someone present. So come prepared. We'll
go over you know what are the
fundamentals in scaling laws,
distillation, what is chinchilla
scaling, what is uh overtraining, what
is llama scaling laws, what are small
models scaling laws. So like the fi team
really has you know for small language
models, how should we do proper RL? What
is post training for small models versus
big models? Well, we'll have someone
cover all these and that'll be a one to
two week session. We'll have generative
models. So you know clip sora segment
anything diffusion some of the key agent
papers fine-tuning we'll go over Laura Q
Laura DPO RL GRPO uh voice we'll have
whisper optimization stage so you know
speculative decoding flash attention
we'll have eval tracks Rexus Eugene Yan
you know he hosts our original paper
club he'll fill in good ones there much
much more um so yeah you know these
categories are still up for uh debate so
I have a form here later fill it out on
stuff that you would want to add, any
papers that you recommend, any topics.
It's it's very straightforward, you
know. Uh also join our discord. We have
the paper club channel and discord. So
join that, add in topics, anything if
you want to cover, you know, if you want
to volunteer to be a speaker. Over the
next like week, I'll I'll flesh out a
rough schedule and then from there,
we'll we'll kind of take it, you know,
start covering it. Make sure that every
session we have speakers. We'll figure
out logistics. So SF will have a venue.
We have a few in mind. Remote, it'll be
the same uh Zoom thing, but yeah, we
want to we want to fragment this out,
get people get people interested. We'll
obviously still bring in speakers, you
know, so some of these key authors, we
know them. So, we'll we'll invite them
to speak on these papers and it'll be
good. We'll have discussion sessions.
These these sessions might be a little
bit more than an hour. Since we now have
three to four papers, we still want to
just go deep, right? We don't want to
just like do a TLDDR of the paper. We
want to do what are the found what are
the fundamentals and then go a little
bit deeper. So um stay in touch on
discord very very basic Google form but
yeah now we'll have curriculum based
stuff um you know like Flo here might
talk about music generation uh Eugene
will talk about states based stuff so
key same members but just yeah second
paper club. Now the the other part of
this is you know if you kind of have an
area of interest this will be like your
go-to session of you know here are the
five to 10 papers here's a presentation
on them here's discussion and it will
stay live on YouTube you know it's just
you can kind of set in at any time and
be caught up to date you'll you'll kind
of go fundamentals to what you need to
know for every session and of course
we'll have the same lively discussion
okay test of time paper club very very
hype but Today is still paper club. We
can't not do a paper. So, you know, this
is just mochi pitcher cuz everyone at
the conference is having dogs in their
slides. So, I need them too. So, we
can't just have a old paper club back
here. We need our OG. So, today's paper
is going to be Deepseek. So, Deepseek
was obviously a popular paper. I know a
lot of people haven't had a chance to
actually go through the paper and
frankly I didn't have much time to prep.
So, you know, I get to reuse my slides.
I'm smart like that. uh this is also
being recorded for the broader AI
engineer conference workshops and
speakers. So it's another point you know
this is why we do paper club we get
these discussions on papers out and then
you know people can find them later like
our original deepseek paper reading
basically like you know that was just
last minute let's highlight the paper
make some basic slides have some
discussion but guess what 300 people
join live there's over a thousand views
on a YouTube video of us just reading
through a paper so it's a pretty key
paper it's like one of the big open
source papers uh models that kind of had
a big transition so let's just go over
it again. And of course there is new
stuff. So as of this week um or last
week we we have Deepseek R10528.
So basically the May 28th update. Now um
you know there have been rumors that
okay Deep Seek uh V2 is coming out. It's
coming out and then people start
launching stuff. Guess what? They didn't
do it. They just did the same model.
They called it a minor update but it's
actually not that small of an update. So
let's let's dig into what it basically
is. Uh it's not V3 like revision 2. It's
it's the same naming scheme, but it's
it's significantly better actually. So,
um, yep. Simon Willis in his keynote
mentioned um how we don't have good
naming for models. This is basically
quite a step up, but they've kept the
name. Um, so also plugging Simon's talk,
uh, he gave a keynote for the past six
months. What were the 30 models? He
launched Pelican bench. It's his own
benchmark of how he judges how good
foundation models are. uh he needs to
check the new Deepseek model. I don't
think he did, but basically here's what
they did. Um they did better post
training on DeepSeek V3. And now guess
what? It got better. So um some of this
stuff they put out very little
information, but when you dig in, you
see that one of the key things is it's
much better at reasoning. So the AIM
2024 score went from 70% to 87.5%.
That basically means, you know, this is
a good reasoning benchmark from before.
Now we have V3 matching the performance
of 03 and 2.5 level um on math coding
and reasoning which is quite a step up.
You know it used to be like okay we have
01 level intelligence and open source.
Yeah all we needed was a little bit more
training and now we have 03 and 2.5
level intelligence. And this isn't even
like a new model from them. This is just
like let's do a little bit better post
training. let's do a little bit more and
we can get significant significant uh
performance increases. Pretty wild,
right? Like 18% um improvement on
benchmarks and no one's really talking
about this. Deepseek got a lot better.
Uh one of the quotes that they say is
originally the original DeepS V3 it
would take 12,000 tokens to reason
through on average on the benchmark for
an aime it would take 12,000 tokens of
reasoning. They basically did more RL
now it reasons more on on average it
reasons for 25,000 tokens. So they got
the model to do double the reasoning.
One thing that we talk a lot about is
scaling laws, right? So before we would
do optimal scaling for base models. Now
in Llama we kind of did overfit our
training for inference time, right?
Let's really overtrain so we can do fast
basic inference. And then now in this
world of test time compute where we do
um we do more inference time compute we
can also scale even more in that
dimension. So original deepseek on
average would reason for 12,000 tokens.
The new model can double that. So in
this domain we've doubled the amount of
reasoning it could do and we have a lot
of benchmarks to increase. You know 18%
on a IME and it's a lot better at
coding. So in this they intentionally
wanted to do better JSON output function
calling and more reasoning and yeah they
just dropped it like that. Here's kind
of our you know benchmark chart. On most
paper clubs we don't do benchmarks but
yeah it's um you know it's actually kind
of up there with 03 and Gemini 2.5. So
uh you know the darkest color is our new
revision of Deepseek R1. And yeah it's
it's actually very good on most
benchmarks. It's like significantly
better than the original Deepseek R1.
Um, and you can see this, you know,
humanity human humanity's uh last exam,
it basically went from not being able to
do anything to, okay, now this thing can
do well. What they did is now we can
basically reason for twice as long.
Okay, that's not the only drop. They
also launched another distillation. This
is kind of the interesting one. Not not
many people really talked about this at
all on Twitter, but um if we remember in
the original Deepseek paper, what they
did was outside of their um original
DeepSeek model, they did three
distillation models. They distilled the
Quen series and the Llama series and
they showed how distilling from the big
model, distilling on these reasoning
traces, we can get really, really good
performance on small models. Well, they
did it again. They took Quen 38B and
they did another distillation with their
new reasoning model and they show that
you know the new model this new basic
distillation basically like kills the
old one. So um basically when you look
at their old distill versus their new
distill they they get another 10%
performance boost by just doing
distillation from a better reasoning
model. So in a few months time they were
able to get uh DeepSeek to do more chain
of thought more reasoning use that to
distill down to an 8B and now we have an
even better 8B. So very very interesting
little note right not just do we get a
10% improvement from the last base model
um their 8B distillation is actually
matching performance of Quen 3's 235
billion 20B active thinking model.
Horrible naming, I know, but let's take
a second to think about that, right? A a
Quen 3 8B dense model. So, a small 8B
model is matching the performance of the
Quen 3 235B thinking model. And this is
not a native thinking model, right? This
is a base distillation model of an 8
billion parameter model just on
distillation. So, logic matching
distillation from a big model. were
matching performance of their 235
billion thinking model. That's pretty
wild. They didn't do this to the 32B,
the 70B, but yeah, it's pretty crazy,
right? Untalked about release that the
new 8B does very very well. Um, how do
we see this? We can see um, you know,
the chain of thought improvements
distill down really, really hard. This
was one of the key findings in the
original paper that um a better recipe
for training small models is to distill
down from big models and reasoning
models make this even more efficient.
And then this is just kind of their
follow-up right we don't have a paper we
don't have too much on this but these
are benchmarks that show it and of
course model is open source open way
everything. So those are kind of the um
those are kind of the overviews. Uh
let's see let's see now let's go over
the actual um the original Deepseek
paper. So that's kind of ending where we
had um the new releases. So we have two
two models. To recap, we have a new
Deepseek version. So for May 2020 uh for
May 28th, we have a new DeepSeek. It's
now at on par with uh OpenAI's 03 and
Gemini 2.5. Uh it's significantly better
and it reasons for twice as long. We
also took that model, we distilled it
down to quen 3 8B and we have a much
much better small 8B reasoning model. Um
this shows that you know reasoning
models distilled down very very
efficiently and there's still a lot of
juice to be squeezed out there. Now um
from here for those that haven't seen
it, we're going to take a two second
pause, see if there's any interesting
questions. Um there's a hugging face
link. If these benchmarks are public,
won't models be trained to score better
on these benchmarks? Yeah, benchmarks
obviously have their uh cons, their
flaws, but you know there's ways to see
what models are overfit on them. How do
they do in general performance? In
general, the Deepseek models are
actually doing very very well. So um
from there, let's go into the original
Deepseek model. So uh okay, Deepseek V3
hypest paper of the year. 300 people
joined us live. Let's do a quick recap.
This is basically me using my old slides
because I can. But let's talk about what
happened in Deep Deepseek. So we're
going to kick off with a highle model
overview. So what are the models they
release? When they release this, it's
not just DeepSync V3, right? They also
have R1. Um what is inference time
training training? What is test time
compute? What makes reasoning models
different? So if you guys don't
remember, this was the first test time
scaling open um you know open model,
right? This was the first one that got
good. OpenAI released 01. Uh cloud
released cloud thinking much later.
Gemini thinking came much later. We
didn't really understand what was
happening. We thought there was like
MCTS. Uh there was a lot of you know
multicolor load research. What's going
on? There was internal let's generate
chain of thought. Let's train it to do
chain of thought. But turns out uh
DeepSeek comes out with this paper. They
do a great model and they're like yo RL
RL works. So two models were released.
Uh DeepSeek R10 basically they they take
a base model. They do a lot of gRPO RL.
They have training templates, reward
models. um they have this emergence
capability reflection aha moments then
they do um R1 which is basically a four-
stage pipeline they have cold start
reasoning RL stage rejection sampling
and SFT and then you know of course the
little um RL round two to get it to
really really reason from there we'll
talk about performance and eval of how
does original R1 do um how is deepseek
then the original distillation models
future work reproductions and whatnot
what kind skip over the base deepseek um
evals and performance because you know
we already covered the new one but okay
continuing on so highle overview for
those that understand that uh don't
really follow you know we keep hearing
this term of test time scaling what is
test time scaling what are thinking
models and what does this mean so
basically we got to a point where we
started to overtrain our models we
basically hit a scaling limit on how
much we can train models Originally back
in the day we used to do sort of these
chinchilla scaling laws right we had a
fixed compute budget we had a fixed
amount of data set we would design a
model around that so how many parameters
should it be based on how much data we
have let's fit a model to our data let's
fit how many GPUs you GPUs we have and
then let's train it to be kind of
chinchilla optimal from there we kind of
realized that okay this isn't this isn't
really what we want and we started
really really scaling up our training.
So we had stuff like you know mistrol
llama 1, llama 2, llama 3. We started
training these models from billions of
tokens to trillions of tokens. So you
know we had originally like a one be one
trillion token. Then llama 3 was 15
trillion tokens. Now they're up to like
45 trillion tokens. So what we shifted
was instead of training for uh you know
model chinchilla optimal let's start
training for this sort of inference
optimal uh training regime. So instead
of you know thinking about what we have
now let's think about inference time as
we scale this we want a model to be as
densely packed as smart as possible. So
llama is like okay basically if we
continue training we don't really see
degregation but the problem in this is
it gets very very expensive right as you
train more and more yeah it's very very
heavily compute extensive and you know
how many times can you scale this up
like how much data do we have how much
can we really fit in if we're at 45
trillion parameters for like a 70B can
we scale that up 10x again you know are
we going to do 450 trillion parameters
what if we want to 10x it again we're
basically hitting the compute scale for
training, right? We can't continually
just keep scaling up our train runs
because it's no longer like cost
efficient, right? We're spending
millions and millions and hundreds of
millions of dollars on these train runs.
You can only scale so much. So, a lot of
hypothesis was, you know, there's going
to be a sort of plateau and open models
will start to catch up because, you
know, we're already all scaling so much.
But that's where you know we need to
unlock another dimension which in this
case was reasoning or test time
training. So this is where we basically
started to do um reasoning capabilities
without any supervised data right. Um
there were approaches to try to do you
know let's generate a bunch of chain of
thought reasoning style data. Let's do
post training on it and yeah our models
do a little better but this didn't
scale. What we needed was um we wanted
to do pure RL. Can we do pure RL to do
uh reasoning data? So this is a quote
from the paper. Basically the Deepseek
team says our goal is to explore the
potential of LLMs to to develop
reasoning capabilities without any
supervised data f focusing on their
self-evolution through a pure RL
process. So what they're going to do is
they're going to postrain the deepseek
v3 base with gRPO which is you know
basically pure RL and then as they do
this they start to notice emergence of
great reasoning reflection aha moments
and they start to match 01. Now fast
forward a few months as we can see
today. Um they can basically continue
this. They do a little bit more RL. They
they do RL on longer traces. They can
now get the model to not match 01 but
match 03 and double its amount of
reasoning tokens. Okay. Here's kind of
the four-step approach to how they train
this R1 model. They start with this sort
of cold start where you know they they
jump start with some SFT then they do RL
for reasoning. Then they have this key
step of rejection sampling for
generation purposes. You know rejection
sampling is something we'll talk about
later. Then once again they do the
fourth stage of basic RL polishing.
Okay. Um that's a very high level uh
overview of what DeepS did to make RL.
So to recap you know we needed to shift
from next token predictors to scaling in
another access. To do this we needed to
scale on instead of spending the same
compute for every token generated we
want to dynamically generate we want
dynamically spend more compute on
different queries. So we train the
models now with pure RL to reason
through their questions. So instead, you
know, they're now trained to do RL with
verifiable outputs on a lot of code and
math data that can be verified to be,
you know, uh, whether it compiles,
whether it's factually correct, whether
the math is logically, um, correct. And
then now we can basically do native RL.
Doing this, we notice emergence,
reflection, aha moments. And now we
basically have another domain in which
to scale models. instead of scaling up a
10x order of magnitude of you know
instead of 45 trillion tokens let's
train on 450 trillion tokens and then
you know scale that up and up we're kind
of at the limit there we start with a
really good base model that's a good
general next token predictor we do RL
and now we can scale in the reasoning
domain so a few months ago they showed
that you know they can get 01 level
performance and then fast forward to now
we have 03 level performance with just
more RL tangentially this was kind of
the paper to kick it all off, right? Um,
Deepseek showed that two things. One,
you can do RL from base models and we
can get reasoning. Two, distillation
really works and it's a much better
approach to small models. Following
this, we've now had a lot more papers.
So, uh, shout out to like the 54 models,
right? 54 showed how effective RL can
be. They took 54 mini and uh, they made
it a reasoning variant. Basically, they
took about 6,000 samples and in 6,000
samples of RL, they could take a small
base model, a small base uh next token
predictor and do really really good
reasoning inference on it. So, this was
the one to kick it off and now we have
um you know the Quen models have done
this. So, there's Quen thinking models
or deepseek thinking models and then
there's formulas like um the the the
five models that show how to do this in
small models. Okay, let's let's continue
on from my high level overview. So, what
did they release? They released two
models early on. There's R10, which is a
great reasoning model only trained on
unraveled chain of thought with RL, but
it's not a great general model. R10 is
when you only do RL on chain of thought.
You know, it doesn't it doesn't really
emerge to general performance. So, R1 is
trained um as the second model. It uses
outputs of R10 using this four stages of
training and then you know now we start
to do RL back on human task back on chat
and we have a good good oral we have a
good good um reasoning model. Second
thing they release is their
distillations right so they take these
uh thinking traces and they take models
in the same family and then they distill
them down. So these are not natively
trained with RL. They're um
distillations from their base models in
Quinn and Llama families. And then they
show how this works really well. Okay.
Now, of course, you know, it's 2025. We
don't get real papers anymore. So they
don't talk about data, how many tokens,
where the data comes from, but you know,
they still do share a lot about about
how this stuff works. Um models of
course fully open source MIT license, no
training data, no code. They have a
deepseek API which at the time you know
it was much faster than anyone. It was
much cheaper than anyone. Turns out this
was fake news. This API is very
unreliable. It's barely ever up. But um
you know now now a few months later
after this is coming out after this has
come out we can see stuff like from open
router you know deepseek actually takes
a significant chunk of the pie. They
take about 10% of uh API usage that goes
through them. So this model actually
sparked a lot of adoption. It reopened
up the race of you know, okay, we're not
done scaling. Models are still getting a
lot better and um yeah, you know, once
again, it's been like a few months and
as of like last week, we have another
update to this and then the Quen team
followed along. Uh what was fake news?
Fake news was the Deep Seek API. When it
launched, it was uh 10x faster and 10x
cheaper than other inference providers,
but turns out it was super unreliable.
No one can make API keys. It was almost
always down, but it's okay. You know, it
exists if you still want to try it. It's
a really good model. A lot of people use
it. Okay. Um, let's dig deep into these
topics. Inference time scaling. So, what
is inference time scaling? We have 01
versus GPT40. Now, there's 03, 04, 04
mini, all these things. But, um,
basically what these do is you increase
the chain of thought reasoning process.
You allow models to spend more time
thinking before they respond. Now, it's
very common. we've used a lot of these
models, right? They're starting to
become even a little bit more agentic
with RL, that's kind of what's
progressed since we last covered this.
So, um, in instead of spending hundreds
of millions of dollars pre-training
LLMs, uh, that starts to get
exponentially more expensive, right?
Instead of changing from hundreds of
millions of dollars to pre-train a
llama, we don't want to spend billions
on train runs, right? So, we needed
another access. So instead we shifted to
this paradigm of inference time scaling
where you take really hard questions you
do um you know inference time chain of
thought RL and we have a new dimension
to on which we can scale. So previously
people tried to um do process b uh
process based reward modeling. So we
would do RL basically on um we would try
RL, we would do beam search, we would do
MCT uh MCTS, we would do all these
inference time, you know, let's predict
multiple tokens go down these processes.
They were all hacks, but nothing was
really close to 01, right? We would see
Twitter demos. We would see like some
fundraising that even came out of, you
know, okay, I have much better
performance than Llama 3 because I go
down 10 trees of thought and you know,
I'm doing all this stuff on the back end
using a bunch of token and tokens and
gluing it together. But this is really
um not the right approach as we see.
What really worked is just native pure
beautiful scaled up RL. So um once again
now this is what uh DeepSeek did to make
V3. Here's the one slide if you want to
know what uh DeepScync v3 is. Uh it's
open-source GPT40
uh oh sorry this is Deepseek V3. This is
the precursor to R1. So um 40 quality 37
billion active parameters. This is the
you know regular non-reasoning model.
This is what they build R1 off of. So
it's a 671 billion parameters 37 billion
active. Uh they launched this it was a
good model. It's just really chunky
right? you can't run it on your laptop.
No one can run a 700B model. Um they
made this whole, you know, we're better
than everyone else where um you know, we
could train this model in $5 million.
And I think that this is actually true,
right? Looking back at it, um a lot of
what we see is DeepSeek and Chinese Labs
were really able to catch up to the
United States because of the
constraints, right? We put a lot of uh
trade restrictions. We couldn't give
them GPUs. So, they had to get clever
and smart with what they got. And
basically, you know, they they they did
very very strong inference optimization.
They made the most of what they had,
right? We could have continued scaling,
right? We could have thrown this 14.8
trillion tokens into 150 trillion
tokens, but China didn't have GPUs for
that, right? The Deep Seek Labs were
like, we realize that we can't scale
this in the same dimension. So, they had
to get creative and think about, okay,
what if we do RL? And that's basically
what they did. So um V3 was you know it
was ane 37 billion active they
introduced this concept of multi-headed
latent attention 15 trillion tokens they
did SFT and then you know traditional RL
so RHF to make it a chat model um
multi-token prediction this came out of
meta they needed to be sample efficient
with their 15 trillion tokens so
multi-token prediction for a little bit
more sample um efficiency trained in FBA
did some long context uh extension
basically first trained it at 32K then
they extended this down to 128K came out
a month ago from uh R1 after that R1
came out people got mad hyped now we
have R1 v2 basically so these are kind
of you know they're fancy diagrams we
have deepseek v3 base we do SFT and RL
you have a SFT checkpoint RL with uh RHF
to get uh sorry fine-tune with RL to get
Deepseek R1. Okay. What is DeepSeek R10?
R10 is where you don't do NF any SFT.
You take a pure base model. For those
that don't remember, base models are
models that come when you do your
pre-training. So you pertain we train
models to predict the next token. Right?
So these are not models like GPT40 or
01. These are purebased models. All they
do is they predict the next token. So
they're kind of completionist models,
right? You can't normally chat with
these. All they do is complete your
sentence, complete your word. So they
take the Deepseek V3 base model. They
apply pure RL. They don't do any SFT.
They don't train it as a user assistant
chat model. They don't do any of that.
It uses gRPO for RL which they, you
know, they actually introduce quite a
while in Deepseek math. The reward is
based on both accuracy and format.
Responses must be verifiably correct.
Right? So what are they doing here? They
take a base non-hat model. They use gpo
style RL on math and code and they have
a verifiable output right so the models
need to be verifiably output they need
their output to be verifiably correct
right so for math that's you have a
correct output to your math question
right is the answer correct or not
correct if it's correct that's good we
can RL on that for code does your code
compile if it compiles that's good we
can do RL on that and you know this is
leak code style questions so le code
style questions we know the answer we
know if the answer is correct. Uh then
we format the uh basically you know in
the little minute details they format
the rewards they kind of output the
thinking between think tags. So we take
our base model we do RL to do a bunch of
thinking to get its chain of thought
thinking process and from there we kind
of have deepse R10. It's a it's a
reasoning thinking model that's good at
outputting thinking and answers but it
hasn't really been trained to be a
useful assistant. It's not 01 yet. This
is just a good thinking model that can
unravel its thought process to generate
out answers. We'll probably skip this
guide. This is a gpo. This is kind of
the RL based algorithm that they use.
It's you know comparing PO GRPO. This is
how they do the RL. We'll kind of skip
it. The key the key things to note
there's no critique model. Uh you have
group based rewards that are scored. Uh
it has stability updates. There's a, you
know, KL divergence, but yeah, they did
they do GRPO. We're going to skip it in
this talk for now. So, okay, Deepseek
R10, we took a base model, we did RL, we
now have a thinking model. How does it
perform? Pretty good. Uh, aime, you
know, it passes 01 mini for the time.
Uh, math, it passes math 500, it passes
01 mini. It's not, it's like on par with
but slightly worse than 01. And then you
know the charts show that um we're able
to do this inference time scaling by
doing RL on really hard questions and it
kind of works. It works pretty well. So
um yes chart number goes up number goes
up even more the more you train this
thing is starting to stably learn. Um
what else do we learn? So it naturally
has the ability to solve these complex
tasks by extending test time compute.
Right here we know that the original
R10, it ranged from hundreds to
thousands of reasoning tokens. Turns out
that this was a pretty key uh factor. In
the update that they released that DeepS
released last week, we scaled from
training on uh from taking thousands of
tokens to now taking to doubling it. So
now we take from 12,000 tokens on
average to 24,000 tokens. And once again
we shift from getting 01 level to 03
level performance in the new DeepC um
model. So other stuff um you know
there's this emergence of interesting
behaviors as test time compute
increases. So as we're able to increase
our test time training as we can reason
for more and more time we start to see
this emergence of interesting behaviors.
Basically as models learn to reason for
longer and longer as you get to
thousands of steps of reasoning models
are able to start to have these
reflection moments. So you know the more
reasoning you do models start to learn
okay I'm actually not forced to just
output my next token I'm not forced to
do a 100 tokens of thinking I'm not
forced to do a thousand I can continue
down this thinking path and we notice
that you know as they start to reason
for longer models start to do this sort
of reflection phase models start to re
revisit re-evaluate their previous steps
they start to go down alternative paths
and you know this kind of arises this
spontaneity um of this spontaneuity
emergence right so spontaneously you
know models will be like okay I tried
this this this it's not working I don't
have to answer right now let me try this
new thing and guess what it works and
then we also had these aha moments so
very core takeaway of the um paper you
know this is kind of what got the
deepseeek authors to realize okay RL
actually works um the more time that we
think for the the further along the
thinking traces we start to get these
models to do these aha moments. So
here's a quote again from the paper.
This moment is not only an aha moment
for the model but also for the
researchers observing its behavior. It
underscores the power and beauty of
reinforcement learning. Rather than
explicitly teaching the model on how to
solve a problem, we simply provide it
with the right incentives and it
autonomously develops advanced problem
solving strategies. The aha moment
serves as a powerful reminder of the
potential of reinforcement that
reinforcement learning un unlocks new
levels of intelligence and AI systems.
So basically instead of telling models
how to reason, what traces, instead of
doing SFT on chain of thought traces, if
we just give it this sparse idea of, you
know, reason as much as you want, um the
more time it takes, we start to notice
emergence of aha, I see what it should
finally be. I tried these six steps and
this seventh step is working. And you
know, this shows the power of RL. And
this has been kind of passed down into
other papers. Microsoft has put out
really good scaling laws on doing RL on
small models. Uh Quen team has done
really good thinking models. They have
an online talk here. So please please
watch out the Quen reasoning talk. Uh a
speaker speaks about how they do that.
But okay back to Deep Seek. Back to Deep
Seek. Um this is an example of an aha
moment. So question uh basic math
question you know not not that basic
actually. I can't answer this. If a is
greater than one, then the sum of this
square root of a square root is equal to
something. Well, as the model starts to
think, you know, it realizes, oh,
there's a x squar, right? If I square
this, I can get x squared. Then I can
isolate this out, right? It's doing
reasoning. It's thinking through this
math problem step by step. Then it says
this in its own generation. Wait, wait,
wait. That's an aha moment. I can flag
here. Let's re-evaluate this. Oh, okay.
instead of you know actually solving the
square root. I can do this square both
side and like you know very interesting
aha moments start to pop up. So yeah um
that's kind of overview of what R1 was.
Uh that's kind of an overview of R what
R10 was. R10 is basically where we take
a base model we train it on pure RL for
math and code. We do RL on thinking
steps and now we have a reasoning model.
It works, but um it's it's not great. Uh
it doesn't actually have great re
readability. It starts to reason in
multiple languages. Whoa, it reasoned in
Chinese. Who would have guessed? So um
we want to make it, you know, we can't
just skip RHF, right? We want to make
this thing a good chat model. So next
section, instead of R10, how do we make
Deepseek R1? How do we take Deepseek R1
from being just a reasoning model to a
reasoning useful assistant? that's good
at you know actually being a chat full
assistant. So key solution giving it to
you straight cold start instead of
taking base model to just RL take a base
model do some regular SFT SFT is kind of
you know here's some prompt answer
here's chat assistant get it to you know
do a cold start get it to understand
you're still a useful assistant after
that do some RL do this core RL on very
hard code math get it to start to
understand how to think how to reason
scale it out until it starts to see
these aha moment ments until they can do
proper reasoning. From there, let's do
some rejection sampling. We want to take
examples that don't work, right? Stuff
where it is going wrong, where it has
negative behavior. We do rejection
sampling on it. And then once again, we
do it once again our last session of
stage four training, which is another
round of RL. So going through these,
stage one, cold start with strong SFT
prevents the model from getting
unstable. So basically you take a base
model you have a long chain of thought
style um data set. So you know prompt
answer pair so user assistant think
through your process of how to solve
this give your chain of thought we take
a base model train it on this chain of
thought then from there um you know we
do our cold start with strong strong sft
they don't just do uh synthetic data
they have human annotators we want
better reason we uh we want better
readability right so do your thinking in
think tags um and then this is on the
scale of a couple thousand examples so
base model thousand examples couple
thousand examples of SFT then our main
RL stage you know so same RL process as
R10 do a lot of RL on verifiable hard
questions so um math leak code style
coding stuff that we can verify has the
right answer do a lot of RL and then you
know stage three rejection sampling so
generate completions rank them with a
reward model fine-tune the original
model so this was uh standard So Llama 3
showed us this concept of rejection
sampling and you know we take it to
deepsek. We do rejection sampling on
samples that we don't like. We had
800,000 samples that were generated,
600,000 reasoning, 200,000 general chat.
We rank them and then we do our
rejection sampling training. Then final
RL stage for general use. This is very
similar to um you know similar to RLHF.
You want to make the model helpful,
harmless and reason good. So for
reasoning we use you know we want to
keep it a good reasoner add in that
reasoning for hard math questions code
questions for general chat capture human
preference and nuance situation. So you
know this question should have a very
verbose detailed answer this is a basic
summary keep it short but keep your
reasoning there. So final stage is um
you know this final stage of RL and now
guess what model is good at being a chat
model is good at thinking model has
emergence of aha behaviors and um the
model is just a good chat model
performance and evals we're going to
skip this basically um when we launched
deepseek it was 01 level but forget that
we launched an update um two weeks ago
so new model DeepS R1 launched May 28th
And instead of being 01 level, the new
DeepSeek R1 model is now reasoning for
twice as long. It's as good as 03 and
Gemini 2.5 Pro. Much better at math,
much better at coding, much better at
reasoning. It now has support for native
function calling, JSON outputs, no
longer hallucinates as much. Second
model drop, and of course, you know,
performance charts, all the regular
benchmarks you would expect. Model is
performing as good as 03 and Gemini 2.5.
Now all that was done do more RL uh
Aimeme jumped you know 17.5%
doubled the reasoning tokens basically
we we doubled our reasoning access we
now reason for twice as long so double
the reasoning effort and um yeah now
we're 03 and Gemini 2.5 level the other
drop we have uh a new distillation we
take our new model that reasons for
twice as long we distill this down to
quen 38B and We do a distillation loss
on this reasoning and we get performance
matching the Quinn 235 billion reasoning
model. So our dense 8B non-reasoning
model that was distilled from our new
deepseek model is as good as um Quen 3
235B which is ane reasoning model which
is pretty crazy. You know the the
implications of this show that you know
long detailed good reasoning really has
a deep impact. Once again check out the
Microsoft work for good distillation
scaling laws on this. Okay. Okay. Back
to our paper. Instead of looking at our
original um deepseek that's kind of the
performance of where we're at
distillation. Let's talk about these
distillation models. So what we did was
distill R1 down into llama and quen
models. This is not RL. This is basic
SFT. We have these models that reason
for 25 30,000 steps. We take these
traces, do SFT style distillation. So
proper distillation, match your logits.
And guess what? Um, this showed so much
performance, but we know RL can do
better. So, you know, all open source,
all traces are open. Someone do RL based
distillation from the big models. No one
has done this as far as I know, but you
know, we're able to get such good
performance and there's still so much uh
left to be done. But let's go over what
we um distilled out. So these are the
family of models. This is once again
pure SFT style distillation. Oh my
slides are gone. They're back. Okay, so
we distilled Quen 1.5B, Quen 7B, 14B,
32B, Llama 8B, 70B. Performance killed
all the models they are themselves. So
you take the model itself, you look at
our distillation, it worked. Number went
up like crazy. Really, really good
performance. All our distills are now
basically on par with GPT40. And for our
new one, our new 8B distill is much much
better. It's way better than 40. Um, and
this is just once again RL and long
chain of thought. Take that, do SFT
style distillation. Um, question, what
if we just did RL on the base model,
right? We tried it. We tried RL on quen
32B for 10k steps. It's actually worse
than distillation. So, you know, for
small models, we don't want to just
start with native RL. We saw this in our
own model, right? We needed this cold
start. We need to kick off something
from base models. R10 to actual R1. We
had to do this SFT cold start. So, it
actually performed significantly worse
than distillation. Um, and of course,
the Quinn team at the time, they had
their own reasoning model, right? They
had QWQ32B
and um you know the R1 distill so the
base model distill it did worse than
what Quen did but kind of on par you
know it's probably what they did our
distillation on the uh chat model on the
base chat model actually performed so
much better we were we were able to do
so much better than the Quen reasoning
model and of course now we've taken this
a step further with our new model okay
future work R1 is worse than V3 at
function calling multitap
uh multi-turn and JSON mode. Guess what?
Two weeks ago, there's a new DeepSeek
model. We now do native function
calling. We have JSON mode. We fixed it.
It works. Um R1 struggles with language
mixing. We don't know how we do on the
new one. Uh it's sensitive to prompting.
I think this is something that the
industry needs to figure out, right? Um
spoiler alert for some latent space
podcast fans, you know, we talked to
some reasoning experts. So like some of
the stuff that we're seeing, you know,
um researchers at OpenAI, they're saying
that, you know, if you're still doing
scaffolding with reasoning models, we're
failing as labs. So, you know, we need
to learn how to prompt these things
better. Um and it's not much better at
engineering tasks than V3. That's all
fake. This is old news. This was our old
DeepSeek. The new DeepSeek is a lot
better. Um open recreations. We want to
promote open research, right? So there
were people trying to recreate this. Uh,
Hugging Face has a version of this. Um,
Bespoke Labs was doing this. I don't
know how they're doing. Uh, there's
quite a few people now that have done
this. But yeah, that's kind of an
overview. I know a lot more people have
joined. We now have like 100 200 people
in the audience. So, I'm going to do a
quick recap in our last 10 minutes. So,
um, first we are launching a second
paper club. Um, every year, uh, every
week we do our normal paper club where
we take the latest paper. We have a
hundred people that join every week, 300
for DeepSeek. We're we're turning this
into a test of time paper club. If
you're interested, sign up. Uh we're
going to run this in SF and we're going
to do it remotely over the next 6
months. We're going to cover 50 to 100
papers. We're going to break up what you
would need to know as an AI engineer
into different buckets. So stuff like
you know what are foundations of deep
learning attention RL um optimizers atom
gradient descent foundation models that
you should know about GPT2 BERT RNN's
LSTMs pre-training post- training
mid-training so scaling laws chinchilla
distillation we'll cover days of
diffusion optimization voice fine-tuning
we basically have a paper club where
every week we're going to split up these
core concepts into a few papers. We'll
have a presentation of three to four
papers. Everyone is welcome to join.
We're going to have a presentation on
every core concept and then open
discussion. This is not a course is
courses are good. You know, you have
active workshop, you build stuff, you
you actually like do active learning.
This is still a paper club, you know.
This is if you want to know the
foundations of what's going on under the
hood. These are the key papers to know.
We'll invite a lot of speakers. We'll
have people present. We'll have good
discussions. But yeah, test of time
paper club coming in June. Scan QR code.
Let us know if you want to be involved,
if you want to recommend a paper, share
a paper. Um, we'll share curriculum
soon. Join the latent space uh discord.
We already have a list of top 2025
papers. It's a paper every week that you
can go through. We're going to we're
going to build off of that. And once
again, final recap. What did we talk
about today? Today we talked about the
new DeepSseek model. So two weeks ago,
DeepSseek R1 May 28 came out. Basically,
Deepseek took the last DeepSseek model
that was as good as 01. We trained it to
reason for twice as long. We got
significantly better performance.
Deepseek R1 May 28th can now do standard
structured JSON output native function
calling hallucinates less reasons for
twice as long and is a much much better
jump in performance from being 01 level.
Deepseek is now on par with OpenAI's 03
model and Gemini 2.5. Basically across
the board on all benchmarks we are now
as good as um Gemini 25 and OpenAI's 03.
The other model that was released is um
DeepSeek R1 Quen 3 8B. So we took Quen 3
8B distilled it down into a reasoning
model based on our longer traces and we
killed it. So it's a small 8B where we
do post training via distillation as SFT
on reasoning traces and model got really
good. You take base quen 38B and you
take R quen 38B it's very good. Uh
looking at the benchmarks here, you
know, our opensource ondevice runnable
Quen 38B non-reasoning model is on par
with Gemini 2.5 flash thinking 03 mini
better than 54, significantly better
than Quen 38B. Our 8B reasoner is better
than Quen 32B. It's on par with Quen's
235B reasoning model. Um so you know two
major updates new reasoning model from
deepseek as good as um 03 and Gemini 2.5
new mini 8B model that is as good as 2.5
thinking and 03 mini of course all open
source run on your laptop um just as
good you know and these are not even R2
this is not DeepSec R2 this is our mini
title refresh with a new date um yeah so
high level that's what we talked about
um you we see aha moments. Instead of um
training on next token prediction, we
scale this out to inference time
scaling. So we now train models to train
and think for longer. We get aha
moments. And that's kind of the update
to our new um to our new DeepSec models.
So yeah, thanks everyone for coming.
Thanks for listening. Join paper club.
[Applause]
Who is this? Oh yeah. Um, so a lot of
our regulars that help make paper club
are here. Uh, Eugene, Era, R.J. Flo is
here. Um, if anyone else is a regular in
paper club, you know, come up. Uh, every
week we have our weekly paper club.
These are these are the homies that make
it possible. Uh, we're going to have our
second paper club. Test of time. We'll
be there soon. But yeah, uh, you know,
major shout out. This is not this is not
me. This is not Swixs. This is
volunteers and more on Zoom. Every week
on a weekday at noon, hundreds of you
join in to discuss a paper. So, you
know, big shout out to everyone here.
Not possible without us. You know, all
the authors as well that have come, all
the authors that have been able to
share. Let's just give it up for
everyone that makes a paper club
possible.
Eugene Yan who is over there running his
track. Sam I know is speaking right now.
Six who is putting all this on. Um Yeah,
of course. Um, once again, I'll leave
our QR code here. If you're interested
in test of time volunteering for a
paper, recommending a paper, fill it
out, you know, help us out. This is our
paper club selfie. But yeah, thanks for
coming out everyone. Enjoy the rest of
the conference.
[Music]