Latent Space Paper Club: AIEWF Special Edition (Test of Time, DeepSeek R1/V3) — VIbhu Sapra
Channel: aiDotEngineer
Published at: 2025-07-25
YouTube video id: 9k3xPh-40mo
Source: https://www.youtube.com/watch?v=9k3xPh-40mo
[Music] Okay, so Paper Club year in review. We've gone over a year, like a year and a half, no missed weeks. We've always done Paper Club and it's pretty interesting. You know, I don't think any of us expected it to get this far, but every week for the past year and a half, we have always done a paper club Wednesday at noon. We've had a bunch of authors come share their work. People from Nvidia, Meta, uh, Alen Aai, Amazon together, writer, bunch of authors come, we get direct one-hour sessions with them. They share their work. We get nice feedback. And, you know, some of these have started to do pretty well. On average, we get about 100 people in here every every session. And just for you know um for for information that's like on a Wednesday workday at noon we have 100 people join to discuss a random paper. Um and some of the big ones DeepSeek V3 had 300 live people sitting listening to me yap about uh Deepseek of course all the other speakers do great. This is all built on volunteers right and along the way we made many many friends and yeah paper club went much further than we uh expected. Now the the launch, you know, we have to ship something. We basically have to launch at World's Fair. So we're launching our test of time paper club. This is going to be a V2 second paper club. So the one that we do now or Wednesday at noon, this is still sticking around. It will still be the same thing. Every week we'll kind of have a paper, something that's trending, cover it, have an author, have our Q&A session, you know, 30 minutes of paper presentation, whether that's highlights or that's slides, and then we'll continue that with some discussion. But Tesla time paper club is going to be a little bit different, you know. So it's more curriculum based. Basically, we take this idea of what's everything that you would need to know to be a good AI engineer. What are the themes? What are the core papers that don't change? So stuff like, you know, what is attention? How does uh sequential text generation work? So what's going on in GPT2? How do like optimizers work? What about like the key inference techniques, right? So stuff like speculative decoding, flash attention, uh stable diffusion, whisper, the key papers that you know are the foundation of what's being built. We're going to kind of group those together and session by session we'll go over them. So we're going to kick off in July and run till December. That leaves us six months. Six months is about four weeks a month. You know, 24 weeks. Every week we plan to go through two to four pre-presented papers. So you kind of have a bit more structure to it. There will be presentations but you know in six months we can get through like 50 to 100 papers and we can cover the core concepts and every week will be kind of different right so like one week we might be talking about whisper and you're interested right in one week you can get the fundamentals of speech uh speech to text to speech all that stuff another day it might be something like image generation so you know we'll go over clip stable diffusion how does all this stuff work extend it out to video basically we'll segment the core topics that we should need to and then we'll cover it every week. Uh the exciting announcement here is we're also for the first time having uh SF section. So we're going to do uh inperson SF paper clubs and a remote section. I am going to mute someone real quick. So we've got 40 of you. Someone needs to be muted. We've got crazy echo. Oh, crazy. I'll just mute myself. Easy. I'll mute my own laptop. Okay, continuing on. Um if if someone needs to interrupt just just start in zoom chat because I don't have my speakers going but yeah um we'll have an in-person session in San Francisco every week and of course we'll keep the remote thing you know so we'll keep it going and original paper club will still be its own you know every week something comes out we'll cover that and we won't really deviate too much on the schedule for test of time it's going to be foundational papers plus some blogs um so what are the topics you know this is still up for us to all decide so just listing out some of the few ones here you know we have foundations of deep learning something like attention optimization relu gradient descent what is basic RL how do these things work so you know we'll have a section on foundations of deep learning another one LLM foundations right so prels we have like RNN's uh LSTMs birectional RNN's all this stuff then we have like the very very foundational models right so like BERT from Google GPT2 stuff like that We'll have a day of everything kind of pre-LLM and go over those. Then after that, we'll have, you know, the the actual like generative LLMs. I'm sure that's missing here, but you know, like Llama 3, DeepSeek, the the core LLMs that you would expect to hear about. We'll go over those. We'll have a day of pre-made postraining, right? So, what are the scaling law papers? What is chinchilla? How does distillation work? What are the kind of key papers that you would want to know for um training? And once again, this will be like a one to two session. So in one week we'll go over three to four papers. We'll have someone present. So come prepared. We'll go over you know what are the fundamentals in scaling laws, distillation, what is chinchilla scaling, what is uh overtraining, what is llama scaling laws, what are small models scaling laws. So like the fi team really has you know for small language models, how should we do proper RL? What is post training for small models versus big models? Well, we'll have someone cover all these and that'll be a one to two week session. We'll have generative models. So you know clip sora segment anything diffusion some of the key agent papers fine-tuning we'll go over Laura Q Laura DPO RL GRPO uh voice we'll have whisper optimization stage so you know speculative decoding flash attention we'll have eval tracks Rexus Eugene Yan you know he hosts our original paper club he'll fill in good ones there much much more um so yeah you know these categories are still up for uh debate so I have a form here later fill it out on stuff that you would want to add, any papers that you recommend, any topics. It's it's very straightforward, you know. Uh also join our discord. We have the paper club channel and discord. So join that, add in topics, anything if you want to cover, you know, if you want to volunteer to be a speaker. Over the next like week, I'll I'll flesh out a rough schedule and then from there, we'll we'll kind of take it, you know, start covering it. Make sure that every session we have speakers. We'll figure out logistics. So SF will have a venue. We have a few in mind. Remote, it'll be the same uh Zoom thing, but yeah, we want to we want to fragment this out, get people get people interested. We'll obviously still bring in speakers, you know, so some of these key authors, we know them. So, we'll we'll invite them to speak on these papers and it'll be good. We'll have discussion sessions. These these sessions might be a little bit more than an hour. Since we now have three to four papers, we still want to just go deep, right? We don't want to just like do a TLDDR of the paper. We want to do what are the found what are the fundamentals and then go a little bit deeper. So um stay in touch on discord very very basic Google form but yeah now we'll have curriculum based stuff um you know like Flo here might talk about music generation uh Eugene will talk about states based stuff so key same members but just yeah second paper club. Now the the other part of this is you know if you kind of have an area of interest this will be like your go-to session of you know here are the five to 10 papers here's a presentation on them here's discussion and it will stay live on YouTube you know it's just you can kind of set in at any time and be caught up to date you'll you'll kind of go fundamentals to what you need to know for every session and of course we'll have the same lively discussion okay test of time paper club very very hype but Today is still paper club. We can't not do a paper. So, you know, this is just mochi pitcher cuz everyone at the conference is having dogs in their slides. So, I need them too. So, we can't just have a old paper club back here. We need our OG. So, today's paper is going to be Deepseek. So, Deepseek was obviously a popular paper. I know a lot of people haven't had a chance to actually go through the paper and frankly I didn't have much time to prep. So, you know, I get to reuse my slides. I'm smart like that. uh this is also being recorded for the broader AI engineer conference workshops and speakers. So it's another point you know this is why we do paper club we get these discussions on papers out and then you know people can find them later like our original deepseek paper reading basically like you know that was just last minute let's highlight the paper make some basic slides have some discussion but guess what 300 people join live there's over a thousand views on a YouTube video of us just reading through a paper so it's a pretty key paper it's like one of the big open source papers uh models that kind of had a big transition so let's just go over it again. And of course there is new stuff. So as of this week um or last week we we have Deepseek R10528. So basically the May 28th update. Now um you know there have been rumors that okay Deep Seek uh V2 is coming out. It's coming out and then people start launching stuff. Guess what? They didn't do it. They just did the same model. They called it a minor update but it's actually not that small of an update. So let's let's dig into what it basically is. Uh it's not V3 like revision 2. It's it's the same naming scheme, but it's it's significantly better actually. So, um, yep. Simon Willis in his keynote mentioned um how we don't have good naming for models. This is basically quite a step up, but they've kept the name. Um, so also plugging Simon's talk, uh, he gave a keynote for the past six months. What were the 30 models? He launched Pelican bench. It's his own benchmark of how he judges how good foundation models are. uh he needs to check the new Deepseek model. I don't think he did, but basically here's what they did. Um they did better post training on DeepSeek V3. And now guess what? It got better. So um some of this stuff they put out very little information, but when you dig in, you see that one of the key things is it's much better at reasoning. So the AIM 2024 score went from 70% to 87.5%. That basically means, you know, this is a good reasoning benchmark from before. Now we have V3 matching the performance of 03 and 2.5 level um on math coding and reasoning which is quite a step up. You know it used to be like okay we have 01 level intelligence and open source. Yeah all we needed was a little bit more training and now we have 03 and 2.5 level intelligence. And this isn't even like a new model from them. This is just like let's do a little bit better post training. let's do a little bit more and we can get significant significant uh performance increases. Pretty wild, right? Like 18% um improvement on benchmarks and no one's really talking about this. Deepseek got a lot better. Uh one of the quotes that they say is originally the original DeepS V3 it would take 12,000 tokens to reason through on average on the benchmark for an aime it would take 12,000 tokens of reasoning. They basically did more RL now it reasons more on on average it reasons for 25,000 tokens. So they got the model to do double the reasoning. One thing that we talk a lot about is scaling laws, right? So before we would do optimal scaling for base models. Now in Llama we kind of did overfit our training for inference time, right? Let's really overtrain so we can do fast basic inference. And then now in this world of test time compute where we do um we do more inference time compute we can also scale even more in that dimension. So original deepseek on average would reason for 12,000 tokens. The new model can double that. So in this domain we've doubled the amount of reasoning it could do and we have a lot of benchmarks to increase. You know 18% on a IME and it's a lot better at coding. So in this they intentionally wanted to do better JSON output function calling and more reasoning and yeah they just dropped it like that. Here's kind of our you know benchmark chart. On most paper clubs we don't do benchmarks but yeah it's um you know it's actually kind of up there with 03 and Gemini 2.5. So uh you know the darkest color is our new revision of Deepseek R1. And yeah it's it's actually very good on most benchmarks. It's like significantly better than the original Deepseek R1. Um, and you can see this, you know, humanity human humanity's uh last exam, it basically went from not being able to do anything to, okay, now this thing can do well. What they did is now we can basically reason for twice as long. Okay, that's not the only drop. They also launched another distillation. This is kind of the interesting one. Not not many people really talked about this at all on Twitter, but um if we remember in the original Deepseek paper, what they did was outside of their um original DeepSeek model, they did three distillation models. They distilled the Quen series and the Llama series and they showed how distilling from the big model, distilling on these reasoning traces, we can get really, really good performance on small models. Well, they did it again. They took Quen 38B and they did another distillation with their new reasoning model and they show that you know the new model this new basic distillation basically like kills the old one. So um basically when you look at their old distill versus their new distill they they get another 10% performance boost by just doing distillation from a better reasoning model. So in a few months time they were able to get uh DeepSeek to do more chain of thought more reasoning use that to distill down to an 8B and now we have an even better 8B. So very very interesting little note right not just do we get a 10% improvement from the last base model um their 8B distillation is actually matching performance of Quen 3's 235 billion 20B active thinking model. Horrible naming, I know, but let's take a second to think about that, right? A a Quen 3 8B dense model. So, a small 8B model is matching the performance of the Quen 3 235B thinking model. And this is not a native thinking model, right? This is a base distillation model of an 8 billion parameter model just on distillation. So, logic matching distillation from a big model. were matching performance of their 235 billion thinking model. That's pretty wild. They didn't do this to the 32B, the 70B, but yeah, it's pretty crazy, right? Untalked about release that the new 8B does very very well. Um, how do we see this? We can see um, you know, the chain of thought improvements distill down really, really hard. This was one of the key findings in the original paper that um a better recipe for training small models is to distill down from big models and reasoning models make this even more efficient. And then this is just kind of their follow-up right we don't have a paper we don't have too much on this but these are benchmarks that show it and of course model is open source open way everything. So those are kind of the um those are kind of the overviews. Uh let's see let's see now let's go over the actual um the original Deepseek paper. So that's kind of ending where we had um the new releases. So we have two two models. To recap, we have a new Deepseek version. So for May 2020 uh for May 28th, we have a new DeepSeek. It's now at on par with uh OpenAI's 03 and Gemini 2.5. Uh it's significantly better and it reasons for twice as long. We also took that model, we distilled it down to quen 3 8B and we have a much much better small 8B reasoning model. Um this shows that you know reasoning models distilled down very very efficiently and there's still a lot of juice to be squeezed out there. Now um from here for those that haven't seen it, we're going to take a two second pause, see if there's any interesting questions. Um there's a hugging face link. If these benchmarks are public, won't models be trained to score better on these benchmarks? Yeah, benchmarks obviously have their uh cons, their flaws, but you know there's ways to see what models are overfit on them. How do they do in general performance? In general, the Deepseek models are actually doing very very well. So um from there, let's go into the original Deepseek model. So uh okay, Deepseek V3 hypest paper of the year. 300 people joined us live. Let's do a quick recap. This is basically me using my old slides because I can. But let's talk about what happened in Deep Deepseek. So we're going to kick off with a highle model overview. So what are the models they release? When they release this, it's not just DeepSync V3, right? They also have R1. Um what is inference time training training? What is test time compute? What makes reasoning models different? So if you guys don't remember, this was the first test time scaling open um you know open model, right? This was the first one that got good. OpenAI released 01. Uh cloud released cloud thinking much later. Gemini thinking came much later. We didn't really understand what was happening. We thought there was like MCTS. Uh there was a lot of you know multicolor load research. What's going on? There was internal let's generate chain of thought. Let's train it to do chain of thought. But turns out uh DeepSeek comes out with this paper. They do a great model and they're like yo RL RL works. So two models were released. Uh DeepSeek R10 basically they they take a base model. They do a lot of gRPO RL. They have training templates, reward models. um they have this emergence capability reflection aha moments then they do um R1 which is basically a four- stage pipeline they have cold start reasoning RL stage rejection sampling and SFT and then you know of course the little um RL round two to get it to really really reason from there we'll talk about performance and eval of how does original R1 do um how is deepseek then the original distillation models future work reproductions and whatnot what kind skip over the base deepseek um evals and performance because you know we already covered the new one but okay continuing on so highle overview for those that understand that uh don't really follow you know we keep hearing this term of test time scaling what is test time scaling what are thinking models and what does this mean so basically we got to a point where we started to overtrain our models we basically hit a scaling limit on how much we can train models Originally back in the day we used to do sort of these chinchilla scaling laws right we had a fixed compute budget we had a fixed amount of data set we would design a model around that so how many parameters should it be based on how much data we have let's fit a model to our data let's fit how many GPUs you GPUs we have and then let's train it to be kind of chinchilla optimal from there we kind of realized that okay this isn't this isn't really what we want and we started really really scaling up our training. So we had stuff like you know mistrol llama 1, llama 2, llama 3. We started training these models from billions of tokens to trillions of tokens. So you know we had originally like a one be one trillion token. Then llama 3 was 15 trillion tokens. Now they're up to like 45 trillion tokens. So what we shifted was instead of training for uh you know model chinchilla optimal let's start training for this sort of inference optimal uh training regime. So instead of you know thinking about what we have now let's think about inference time as we scale this we want a model to be as densely packed as smart as possible. So llama is like okay basically if we continue training we don't really see degregation but the problem in this is it gets very very expensive right as you train more and more yeah it's very very heavily compute extensive and you know how many times can you scale this up like how much data do we have how much can we really fit in if we're at 45 trillion parameters for like a 70B can we scale that up 10x again you know are we going to do 450 trillion parameters what if we want to 10x it again we're basically hitting the compute scale for training, right? We can't continually just keep scaling up our train runs because it's no longer like cost efficient, right? We're spending millions and millions and hundreds of millions of dollars on these train runs. You can only scale so much. So, a lot of hypothesis was, you know, there's going to be a sort of plateau and open models will start to catch up because, you know, we're already all scaling so much. But that's where you know we need to unlock another dimension which in this case was reasoning or test time training. So this is where we basically started to do um reasoning capabilities without any supervised data right. Um there were approaches to try to do you know let's generate a bunch of chain of thought reasoning style data. Let's do post training on it and yeah our models do a little better but this didn't scale. What we needed was um we wanted to do pure RL. Can we do pure RL to do uh reasoning data? So this is a quote from the paper. Basically the Deepseek team says our goal is to explore the potential of LLMs to to develop reasoning capabilities without any supervised data f focusing on their self-evolution through a pure RL process. So what they're going to do is they're going to postrain the deepseek v3 base with gRPO which is you know basically pure RL and then as they do this they start to notice emergence of great reasoning reflection aha moments and they start to match 01. Now fast forward a few months as we can see today. Um they can basically continue this. They do a little bit more RL. They they do RL on longer traces. They can now get the model to not match 01 but match 03 and double its amount of reasoning tokens. Okay. Here's kind of the four-step approach to how they train this R1 model. They start with this sort of cold start where you know they they jump start with some SFT then they do RL for reasoning. Then they have this key step of rejection sampling for generation purposes. You know rejection sampling is something we'll talk about later. Then once again they do the fourth stage of basic RL polishing. Okay. Um that's a very high level uh overview of what DeepS did to make RL. So to recap you know we needed to shift from next token predictors to scaling in another access. To do this we needed to scale on instead of spending the same compute for every token generated we want to dynamically generate we want dynamically spend more compute on different queries. So we train the models now with pure RL to reason through their questions. So instead, you know, they're now trained to do RL with verifiable outputs on a lot of code and math data that can be verified to be, you know, uh, whether it compiles, whether it's factually correct, whether the math is logically, um, correct. And then now we can basically do native RL. Doing this, we notice emergence, reflection, aha moments. And now we basically have another domain in which to scale models. instead of scaling up a 10x order of magnitude of you know instead of 45 trillion tokens let's train on 450 trillion tokens and then you know scale that up and up we're kind of at the limit there we start with a really good base model that's a good general next token predictor we do RL and now we can scale in the reasoning domain so a few months ago they showed that you know they can get 01 level performance and then fast forward to now we have 03 level performance with just more RL tangentially this was kind of the paper to kick it all off, right? Um, Deepseek showed that two things. One, you can do RL from base models and we can get reasoning. Two, distillation really works and it's a much better approach to small models. Following this, we've now had a lot more papers. So, uh, shout out to like the 54 models, right? 54 showed how effective RL can be. They took 54 mini and uh, they made it a reasoning variant. Basically, they took about 6,000 samples and in 6,000 samples of RL, they could take a small base model, a small base uh next token predictor and do really really good reasoning inference on it. So, this was the one to kick it off and now we have um you know the Quen models have done this. So, there's Quen thinking models or deepseek thinking models and then there's formulas like um the the the five models that show how to do this in small models. Okay, let's let's continue on from my high level overview. So, what did they release? They released two models early on. There's R10, which is a great reasoning model only trained on unraveled chain of thought with RL, but it's not a great general model. R10 is when you only do RL on chain of thought. You know, it doesn't it doesn't really emerge to general performance. So, R1 is trained um as the second model. It uses outputs of R10 using this four stages of training and then you know now we start to do RL back on human task back on chat and we have a good good oral we have a good good um reasoning model. Second thing they release is their distillations right so they take these uh thinking traces and they take models in the same family and then they distill them down. So these are not natively trained with RL. They're um distillations from their base models in Quinn and Llama families. And then they show how this works really well. Okay. Now, of course, you know, it's 2025. We don't get real papers anymore. So they don't talk about data, how many tokens, where the data comes from, but you know, they still do share a lot about about how this stuff works. Um models of course fully open source MIT license, no training data, no code. They have a deepseek API which at the time you know it was much faster than anyone. It was much cheaper than anyone. Turns out this was fake news. This API is very unreliable. It's barely ever up. But um you know now now a few months later after this is coming out after this has come out we can see stuff like from open router you know deepseek actually takes a significant chunk of the pie. They take about 10% of uh API usage that goes through them. So this model actually sparked a lot of adoption. It reopened up the race of you know, okay, we're not done scaling. Models are still getting a lot better and um yeah, you know, once again, it's been like a few months and as of like last week, we have another update to this and then the Quen team followed along. Uh what was fake news? Fake news was the Deep Seek API. When it launched, it was uh 10x faster and 10x cheaper than other inference providers, but turns out it was super unreliable. No one can make API keys. It was almost always down, but it's okay. You know, it exists if you still want to try it. It's a really good model. A lot of people use it. Okay. Um, let's dig deep into these topics. Inference time scaling. So, what is inference time scaling? We have 01 versus GPT40. Now, there's 03, 04, 04 mini, all these things. But, um, basically what these do is you increase the chain of thought reasoning process. You allow models to spend more time thinking before they respond. Now, it's very common. we've used a lot of these models, right? They're starting to become even a little bit more agentic with RL, that's kind of what's progressed since we last covered this. So, um, in instead of spending hundreds of millions of dollars pre-training LLMs, uh, that starts to get exponentially more expensive, right? Instead of changing from hundreds of millions of dollars to pre-train a llama, we don't want to spend billions on train runs, right? So, we needed another access. So instead we shifted to this paradigm of inference time scaling where you take really hard questions you do um you know inference time chain of thought RL and we have a new dimension to on which we can scale. So previously people tried to um do process b uh process based reward modeling. So we would do RL basically on um we would try RL, we would do beam search, we would do MCT uh MCTS, we would do all these inference time, you know, let's predict multiple tokens go down these processes. They were all hacks, but nothing was really close to 01, right? We would see Twitter demos. We would see like some fundraising that even came out of, you know, okay, I have much better performance than Llama 3 because I go down 10 trees of thought and you know, I'm doing all this stuff on the back end using a bunch of token and tokens and gluing it together. But this is really um not the right approach as we see. What really worked is just native pure beautiful scaled up RL. So um once again now this is what uh DeepSeek did to make V3. Here's the one slide if you want to know what uh DeepScync v3 is. Uh it's open-source GPT40 uh oh sorry this is Deepseek V3. This is the precursor to R1. So um 40 quality 37 billion active parameters. This is the you know regular non-reasoning model. This is what they build R1 off of. So it's a 671 billion parameters 37 billion active. Uh they launched this it was a good model. It's just really chunky right? you can't run it on your laptop. No one can run a 700B model. Um they made this whole, you know, we're better than everyone else where um you know, we could train this model in $5 million. And I think that this is actually true, right? Looking back at it, um a lot of what we see is DeepSeek and Chinese Labs were really able to catch up to the United States because of the constraints, right? We put a lot of uh trade restrictions. We couldn't give them GPUs. So, they had to get clever and smart with what they got. And basically, you know, they they they did very very strong inference optimization. They made the most of what they had, right? We could have continued scaling, right? We could have thrown this 14.8 trillion tokens into 150 trillion tokens, but China didn't have GPUs for that, right? The Deep Seek Labs were like, we realize that we can't scale this in the same dimension. So, they had to get creative and think about, okay, what if we do RL? And that's basically what they did. So um V3 was you know it was ane 37 billion active they introduced this concept of multi-headed latent attention 15 trillion tokens they did SFT and then you know traditional RL so RHF to make it a chat model um multi-token prediction this came out of meta they needed to be sample efficient with their 15 trillion tokens so multi-token prediction for a little bit more sample um efficiency trained in FBA did some long context uh extension basically first trained it at 32K then they extended this down to 128K came out a month ago from uh R1 after that R1 came out people got mad hyped now we have R1 v2 basically so these are kind of you know they're fancy diagrams we have deepseek v3 base we do SFT and RL you have a SFT checkpoint RL with uh RHF to get uh sorry fine-tune with RL to get Deepseek R1. Okay. What is DeepSeek R10? R10 is where you don't do NF any SFT. You take a pure base model. For those that don't remember, base models are models that come when you do your pre-training. So you pertain we train models to predict the next token. Right? So these are not models like GPT40 or 01. These are purebased models. All they do is they predict the next token. So they're kind of completionist models, right? You can't normally chat with these. All they do is complete your sentence, complete your word. So they take the Deepseek V3 base model. They apply pure RL. They don't do any SFT. They don't train it as a user assistant chat model. They don't do any of that. It uses gRPO for RL which they, you know, they actually introduce quite a while in Deepseek math. The reward is based on both accuracy and format. Responses must be verifiably correct. Right? So what are they doing here? They take a base non-hat model. They use gpo style RL on math and code and they have a verifiable output right so the models need to be verifiably output they need their output to be verifiably correct right so for math that's you have a correct output to your math question right is the answer correct or not correct if it's correct that's good we can RL on that for code does your code compile if it compiles that's good we can do RL on that and you know this is leak code style questions so le code style questions we know the answer we know if the answer is correct. Uh then we format the uh basically you know in the little minute details they format the rewards they kind of output the thinking between think tags. So we take our base model we do RL to do a bunch of thinking to get its chain of thought thinking process and from there we kind of have deepse R10. It's a it's a reasoning thinking model that's good at outputting thinking and answers but it hasn't really been trained to be a useful assistant. It's not 01 yet. This is just a good thinking model that can unravel its thought process to generate out answers. We'll probably skip this guide. This is a gpo. This is kind of the RL based algorithm that they use. It's you know comparing PO GRPO. This is how they do the RL. We'll kind of skip it. The key the key things to note there's no critique model. Uh you have group based rewards that are scored. Uh it has stability updates. There's a, you know, KL divergence, but yeah, they did they do GRPO. We're going to skip it in this talk for now. So, okay, Deepseek R10, we took a base model, we did RL, we now have a thinking model. How does it perform? Pretty good. Uh, aime, you know, it passes 01 mini for the time. Uh, math, it passes math 500, it passes 01 mini. It's not, it's like on par with but slightly worse than 01. And then you know the charts show that um we're able to do this inference time scaling by doing RL on really hard questions and it kind of works. It works pretty well. So um yes chart number goes up number goes up even more the more you train this thing is starting to stably learn. Um what else do we learn? So it naturally has the ability to solve these complex tasks by extending test time compute. Right here we know that the original R10, it ranged from hundreds to thousands of reasoning tokens. Turns out that this was a pretty key uh factor. In the update that they released that DeepS released last week, we scaled from training on uh from taking thousands of tokens to now taking to doubling it. So now we take from 12,000 tokens on average to 24,000 tokens. And once again we shift from getting 01 level to 03 level performance in the new DeepC um model. So other stuff um you know there's this emergence of interesting behaviors as test time compute increases. So as we're able to increase our test time training as we can reason for more and more time we start to see this emergence of interesting behaviors. Basically as models learn to reason for longer and longer as you get to thousands of steps of reasoning models are able to start to have these reflection moments. So you know the more reasoning you do models start to learn okay I'm actually not forced to just output my next token I'm not forced to do a 100 tokens of thinking I'm not forced to do a thousand I can continue down this thinking path and we notice that you know as they start to reason for longer models start to do this sort of reflection phase models start to re revisit re-evaluate their previous steps they start to go down alternative paths and you know this kind of arises this spontaneity um of this spontaneuity emergence right so spontaneously you know models will be like okay I tried this this this it's not working I don't have to answer right now let me try this new thing and guess what it works and then we also had these aha moments so very core takeaway of the um paper you know this is kind of what got the deepseeek authors to realize okay RL actually works um the more time that we think for the the further along the thinking traces we start to get these models to do these aha moments. So here's a quote again from the paper. This moment is not only an aha moment for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning. Rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives and it autonomously develops advanced problem solving strategies. The aha moment serves as a powerful reminder of the potential of reinforcement that reinforcement learning un unlocks new levels of intelligence and AI systems. So basically instead of telling models how to reason, what traces, instead of doing SFT on chain of thought traces, if we just give it this sparse idea of, you know, reason as much as you want, um the more time it takes, we start to notice emergence of aha, I see what it should finally be. I tried these six steps and this seventh step is working. And you know, this shows the power of RL. And this has been kind of passed down into other papers. Microsoft has put out really good scaling laws on doing RL on small models. Uh Quen team has done really good thinking models. They have an online talk here. So please please watch out the Quen reasoning talk. Uh a speaker speaks about how they do that. But okay back to Deep Seek. Back to Deep Seek. Um this is an example of an aha moment. So question uh basic math question you know not not that basic actually. I can't answer this. If a is greater than one, then the sum of this square root of a square root is equal to something. Well, as the model starts to think, you know, it realizes, oh, there's a x squar, right? If I square this, I can get x squared. Then I can isolate this out, right? It's doing reasoning. It's thinking through this math problem step by step. Then it says this in its own generation. Wait, wait, wait. That's an aha moment. I can flag here. Let's re-evaluate this. Oh, okay. instead of you know actually solving the square root. I can do this square both side and like you know very interesting aha moments start to pop up. So yeah um that's kind of overview of what R1 was. Uh that's kind of an overview of R what R10 was. R10 is basically where we take a base model we train it on pure RL for math and code. We do RL on thinking steps and now we have a reasoning model. It works, but um it's it's not great. Uh it doesn't actually have great re readability. It starts to reason in multiple languages. Whoa, it reasoned in Chinese. Who would have guessed? So um we want to make it, you know, we can't just skip RHF, right? We want to make this thing a good chat model. So next section, instead of R10, how do we make Deepseek R1? How do we take Deepseek R1 from being just a reasoning model to a reasoning useful assistant? that's good at you know actually being a chat full assistant. So key solution giving it to you straight cold start instead of taking base model to just RL take a base model do some regular SFT SFT is kind of you know here's some prompt answer here's chat assistant get it to you know do a cold start get it to understand you're still a useful assistant after that do some RL do this core RL on very hard code math get it to start to understand how to think how to reason scale it out until it starts to see these aha moment ments until they can do proper reasoning. From there, let's do some rejection sampling. We want to take examples that don't work, right? Stuff where it is going wrong, where it has negative behavior. We do rejection sampling on it. And then once again, we do it once again our last session of stage four training, which is another round of RL. So going through these, stage one, cold start with strong SFT prevents the model from getting unstable. So basically you take a base model you have a long chain of thought style um data set. So you know prompt answer pair so user assistant think through your process of how to solve this give your chain of thought we take a base model train it on this chain of thought then from there um you know we do our cold start with strong strong sft they don't just do uh synthetic data they have human annotators we want better reason we uh we want better readability right so do your thinking in think tags um and then this is on the scale of a couple thousand examples so base model thousand examples couple thousand examples of SFT then our main RL stage you know so same RL process as R10 do a lot of RL on verifiable hard questions so um math leak code style coding stuff that we can verify has the right answer do a lot of RL and then you know stage three rejection sampling so generate completions rank them with a reward model fine-tune the original model so this was uh standard So Llama 3 showed us this concept of rejection sampling and you know we take it to deepsek. We do rejection sampling on samples that we don't like. We had 800,000 samples that were generated, 600,000 reasoning, 200,000 general chat. We rank them and then we do our rejection sampling training. Then final RL stage for general use. This is very similar to um you know similar to RLHF. You want to make the model helpful, harmless and reason good. So for reasoning we use you know we want to keep it a good reasoner add in that reasoning for hard math questions code questions for general chat capture human preference and nuance situation. So you know this question should have a very verbose detailed answer this is a basic summary keep it short but keep your reasoning there. So final stage is um you know this final stage of RL and now guess what model is good at being a chat model is good at thinking model has emergence of aha behaviors and um the model is just a good chat model performance and evals we're going to skip this basically um when we launched deepseek it was 01 level but forget that we launched an update um two weeks ago so new model DeepS R1 launched May 28th And instead of being 01 level, the new DeepSeek R1 model is now reasoning for twice as long. It's as good as 03 and Gemini 2.5 Pro. Much better at math, much better at coding, much better at reasoning. It now has support for native function calling, JSON outputs, no longer hallucinates as much. Second model drop, and of course, you know, performance charts, all the regular benchmarks you would expect. Model is performing as good as 03 and Gemini 2.5. Now all that was done do more RL uh Aimeme jumped you know 17.5% doubled the reasoning tokens basically we we doubled our reasoning access we now reason for twice as long so double the reasoning effort and um yeah now we're 03 and Gemini 2.5 level the other drop we have uh a new distillation we take our new model that reasons for twice as long we distill this down to quen 38B and We do a distillation loss on this reasoning and we get performance matching the Quinn 235 billion reasoning model. So our dense 8B non-reasoning model that was distilled from our new deepseek model is as good as um Quen 3 235B which is ane reasoning model which is pretty crazy. You know the the implications of this show that you know long detailed good reasoning really has a deep impact. Once again check out the Microsoft work for good distillation scaling laws on this. Okay. Okay. Back to our paper. Instead of looking at our original um deepseek that's kind of the performance of where we're at distillation. Let's talk about these distillation models. So what we did was distill R1 down into llama and quen models. This is not RL. This is basic SFT. We have these models that reason for 25 30,000 steps. We take these traces, do SFT style distillation. So proper distillation, match your logits. And guess what? Um, this showed so much performance, but we know RL can do better. So, you know, all open source, all traces are open. Someone do RL based distillation from the big models. No one has done this as far as I know, but you know, we're able to get such good performance and there's still so much uh left to be done. But let's go over what we um distilled out. So these are the family of models. This is once again pure SFT style distillation. Oh my slides are gone. They're back. Okay, so we distilled Quen 1.5B, Quen 7B, 14B, 32B, Llama 8B, 70B. Performance killed all the models they are themselves. So you take the model itself, you look at our distillation, it worked. Number went up like crazy. Really, really good performance. All our distills are now basically on par with GPT40. And for our new one, our new 8B distill is much much better. It's way better than 40. Um, and this is just once again RL and long chain of thought. Take that, do SFT style distillation. Um, question, what if we just did RL on the base model, right? We tried it. We tried RL on quen 32B for 10k steps. It's actually worse than distillation. So, you know, for small models, we don't want to just start with native RL. We saw this in our own model, right? We needed this cold start. We need to kick off something from base models. R10 to actual R1. We had to do this SFT cold start. So, it actually performed significantly worse than distillation. Um, and of course, the Quinn team at the time, they had their own reasoning model, right? They had QWQ32B and um you know the R1 distill so the base model distill it did worse than what Quen did but kind of on par you know it's probably what they did our distillation on the uh chat model on the base chat model actually performed so much better we were we were able to do so much better than the Quen reasoning model and of course now we've taken this a step further with our new model okay future work R1 is worse than V3 at function calling multitap uh multi-turn and JSON mode. Guess what? Two weeks ago, there's a new DeepSeek model. We now do native function calling. We have JSON mode. We fixed it. It works. Um R1 struggles with language mixing. We don't know how we do on the new one. Uh it's sensitive to prompting. I think this is something that the industry needs to figure out, right? Um spoiler alert for some latent space podcast fans, you know, we talked to some reasoning experts. So like some of the stuff that we're seeing, you know, um researchers at OpenAI, they're saying that, you know, if you're still doing scaffolding with reasoning models, we're failing as labs. So, you know, we need to learn how to prompt these things better. Um and it's not much better at engineering tasks than V3. That's all fake. This is old news. This was our old DeepSeek. The new DeepSeek is a lot better. Um open recreations. We want to promote open research, right? So there were people trying to recreate this. Uh, Hugging Face has a version of this. Um, Bespoke Labs was doing this. I don't know how they're doing. Uh, there's quite a few people now that have done this. But yeah, that's kind of an overview. I know a lot more people have joined. We now have like 100 200 people in the audience. So, I'm going to do a quick recap in our last 10 minutes. So, um, first we are launching a second paper club. Um, every year, uh, every week we do our normal paper club where we take the latest paper. We have a hundred people that join every week, 300 for DeepSeek. We're we're turning this into a test of time paper club. If you're interested, sign up. Uh we're going to run this in SF and we're going to do it remotely over the next 6 months. We're going to cover 50 to 100 papers. We're going to break up what you would need to know as an AI engineer into different buckets. So stuff like you know what are foundations of deep learning attention RL um optimizers atom gradient descent foundation models that you should know about GPT2 BERT RNN's LSTMs pre-training post- training mid-training so scaling laws chinchilla distillation we'll cover days of diffusion optimization voice fine-tuning we basically have a paper club where every week we're going to split up these core concepts into a few papers. We'll have a presentation of three to four papers. Everyone is welcome to join. We're going to have a presentation on every core concept and then open discussion. This is not a course is courses are good. You know, you have active workshop, you build stuff, you you actually like do active learning. This is still a paper club, you know. This is if you want to know the foundations of what's going on under the hood. These are the key papers to know. We'll invite a lot of speakers. We'll have people present. We'll have good discussions. But yeah, test of time paper club coming in June. Scan QR code. Let us know if you want to be involved, if you want to recommend a paper, share a paper. Um, we'll share curriculum soon. Join the latent space uh discord. We already have a list of top 2025 papers. It's a paper every week that you can go through. We're going to we're going to build off of that. And once again, final recap. What did we talk about today? Today we talked about the new DeepSseek model. So two weeks ago, DeepSseek R1 May 28 came out. Basically, Deepseek took the last DeepSseek model that was as good as 01. We trained it to reason for twice as long. We got significantly better performance. Deepseek R1 May 28th can now do standard structured JSON output native function calling hallucinates less reasons for twice as long and is a much much better jump in performance from being 01 level. Deepseek is now on par with OpenAI's 03 model and Gemini 2.5. Basically across the board on all benchmarks we are now as good as um Gemini 25 and OpenAI's 03. The other model that was released is um DeepSeek R1 Quen 3 8B. So we took Quen 3 8B distilled it down into a reasoning model based on our longer traces and we killed it. So it's a small 8B where we do post training via distillation as SFT on reasoning traces and model got really good. You take base quen 38B and you take R quen 38B it's very good. Uh looking at the benchmarks here, you know, our opensource ondevice runnable Quen 38B non-reasoning model is on par with Gemini 2.5 flash thinking 03 mini better than 54, significantly better than Quen 38B. Our 8B reasoner is better than Quen 32B. It's on par with Quen's 235B reasoning model. Um so you know two major updates new reasoning model from deepseek as good as um 03 and Gemini 2.5 new mini 8B model that is as good as 2.5 thinking and 03 mini of course all open source run on your laptop um just as good you know and these are not even R2 this is not DeepSec R2 this is our mini title refresh with a new date um yeah so high level that's what we talked about um you we see aha moments. Instead of um training on next token prediction, we scale this out to inference time scaling. So we now train models to train and think for longer. We get aha moments. And that's kind of the update to our new um to our new DeepSec models. So yeah, thanks everyone for coming. Thanks for listening. Join paper club. [Applause] Who is this? Oh yeah. Um, so a lot of our regulars that help make paper club are here. Uh, Eugene, Era, R.J. Flo is here. Um, if anyone else is a regular in paper club, you know, come up. Uh, every week we have our weekly paper club. These are these are the homies that make it possible. Uh, we're going to have our second paper club. Test of time. We'll be there soon. But yeah, uh, you know, major shout out. This is not this is not me. This is not Swixs. This is volunteers and more on Zoom. Every week on a weekday at noon, hundreds of you join in to discuss a paper. So, you know, big shout out to everyone here. Not possible without us. You know, all the authors as well that have come, all the authors that have been able to share. Let's just give it up for everyone that makes a paper club possible. Eugene Yan who is over there running his track. Sam I know is speaking right now. Six who is putting all this on. Um Yeah, of course. Um, once again, I'll leave our QR code here. If you're interested in test of time volunteering for a paper, recommending a paper, fill it out, you know, help us out. This is our paper club selfie. But yeah, thanks for coming out everyone. Enjoy the rest of the conference. [Music]