OpenThoughts: Data Recipes for Reasoning Models — Ryan Marten, Bespoke Labs
Channel: aiDotEngineer
Published at: 2025-07-19
YouTube video id: liG97YXaTSA
Source: https://www.youtube.com/watch?v=liG97YXaTSA
[Music] I'm Ryan. I'm a founding engineer at Bespoke Labs. And today I'm going to talk to you about Open Thoughts, which is our project to create the best open-source reasoning data sets. And I'll be switching tack a little bit from our earlier discussions on reasoning and RL and focus on the reasoning part and you'll see why. So just so we're on the same page, we've talked a lot about reasoning, but what's actually going on here? So I like this graph from JSON which shows this incredible performance that's happened in the last several months where models are getting much much much better on certain benchmarks. Um, and if you look at that, this is reasoning. This is test time scaling. I think everyone here is quite familiar with this. And it seems that certain tasks like Amy, which are competitive math problems, really respond to models when they're able to think step by step and do these long chain of thoughts. Um, so let's go back to DeepSeek R1. Now, I think Deepseek R1 was really impressive for a lot of people for a lot of reasons and RL was a big part of that. But I was also particularly interested because Deepseek R1 at the end of the day is an SFT model. So the final weights that they've released are actually from DeepSseek V3 base which is fine-tuned on 800K SFT examples. 600K of which are reasoning. Of course you can see here that RL was a big part of it and RL was used heavily to create that model which generated this data. Um but at the end it was SFT and a little bit of RL for alignment. So this was really interesting and surprising. And the other thing that was really interesting and surprising to us was these small reasoning models that Deepseek released which were incredibly strong. Um and this for us was a a huge motivation a huge motivation to try to do this ourselves. And why is that interesting? Because if we go back to here, none no additional detail was really given on these data sets here. So if you want to create strong reasoning models, we now sort of have a training recipe, but we don't have the data recipe. That's the missing link. Okay. I want to also include a slide here on why is it interesting to train your own reasoning models. So uh I'm partially taking this from Amir's talk yesterday on open source and enterprise which I really liked. But there's these main points, performance, privacy, speed and cost, and then ownership and destiny. I think um using reasoning is a is a great tool to solve a problem, and you shouldn't li limit yourself in your toolbox if you're trying to solve a specific domain task. So, uh as we talked about before, RL is a great tool in this toolbox to tackle to tackle reasoning tasks. But we're going to see here that SFT is, as Nathan put this morning, extremely easy and extremely effective. Okay, great. Now, the missing link. How do we actually solve for this this reasoning data recipe? There's all these questions that we had when we started. How much data do you really need? What data creation steps are necessary? What are the optimal choices for each step in that data creation pipeline? And then, how do you even go about figuring all this out? And this this is the meat of the Open Thoughts project. So today we're excited to announce Open Thoughts 3, which is hot off the presses, just came out two hours ago, which is our latest and greatest version of our reasoning data sets. And thank you. And now we this is the state-of-the-art reasoning data set recipe. So you can see here these graphs are showing accuracy on three of these reasoning benchmarks. Amy which is competitive math, live codebench is competitive code and GPQA diamond which is our science questions. Um on the y- axis you see accuracy is going up. Uh on the x-axis you see the data scale is going up. So we we heard before that scaling is difficult particularly difficult with RL. The good news is for SFT scaling is quite easier. Um you can see here we compare to other open reasoning data sets. So Neimatron nano Nvidia released this great model Neimatron nano it's a 8b model and they also released the data set to train on it. So we compared directly by training on the same base model between our data set which is our data set recipe and the neatron nano data which is the Nvidia recipe and you can see here there's a significant gap. So we we shifted this scaling curve upwards. Great. So the yeah this is the state-of-the-art 7B open data reasoning model. You can see we've had we have measured across the domains of interest. So science, code and math and then a couple held out benchmarks. So our original goal was to to reproduce to find the missing link for the deepseek distill models. And you can see here we've crushed that goal. So we're we're significantly outperforming the deepseek R1 quen 7B model which we started off trying to reproduce. And then compared to the Neimatron nano model which is trained on a different base model um we are also outperforming on some benchmarks and similarly competitive on some others. So okay let's actually talk about how we achieve this. This is the interesting part for you. So we go back to the scaling graph. You can see um once again on the x-axis we're scaling data set size. So uh this is a a huge method to increase accuracy and the thing here is it gets more and more expensive exponentially more expensive as you keep going. Um and then uh on vertically you can see that we've shifted this the scaling curve up. So this is what I was talking about before. This is the improving the data set recipe. So given a fixed data set recipe you can always scale it larger and you can always have higher performance. But um if you want to push your performance to abs absolute maximum, the real question is how do I create the best data set and therefore what is the best recipe for the data set. Okay, so uh enough teasing here. Let's go into the meat of it. So this is this is how we approach this problem. We broke down the data set pipeline into sourcing questions, mixing different sources of questions, filtering those questions, filtering out the high highest quality questions, generating answers with a teacher model. So that's distillation, and then filtering out bad answers, um, and and lastly, at the end of this entire experimentation, we looked at what what are the best teacher models, which which teacher model should we select? So through this entire pipeline, we've we've come down to this final data set recipe. Now, this was a ton of work. This is a screenshot of our our hugging face page. So, you can see created over 5,000 data sets and almost 3,000 models. Um, for this project, it was only around a thousand experiments. But it just to give you an idea of how rigorously we looked at the different decisions in each of these steps of the pipeline. And also, I think this is interesting because it it peels back the curtain a little bit on maybe what the frontier labs are doing. uh finding signal at the smallest scale possible and trying out as many things as possible and empirically choosing the best and then scaling. And often sometimes when you scale you see okay what was the best at the small scale doesn't actually work but if you're lucky um and you've done good science then you'll you'll your yolo run will be the best possible right okay so these are the the key learnings that we had from our data set recipe and and this is what you can take away so the first thing is that pretty surprising sampling multiple answers so multiple reasoning traces per question in your data set works really really well. Um the the performance does not go down at a fixed scale. If you take a fixed scale of questions, say 30k questions um or 30 30k examples and of those you if you take just 30k questions and you only sample once per question that performs pretty similarly to um if you took 116th so 30k over 16 and then for each you sampled 16 times which is quite cool. So this allows you, this is really cool because this allows you to scale by 16x, which is more than an order of magnitude. And if you remember the graph from before, that corresponds to a pretty large increase in accuracy. Yeah. The other surprising thing that we found was that a better model in terms of its own performance on evaluation benchmarks does not necessarily mean it's a better teacher model. I think a good way to think about this is a brilliant researcher who's maybe a terrible lecturer, right? Um, we found specifically Quen 32B was a stronger teacher model than Deepseek R1. So, we switched to that in our our recipe even though previously everyone has been using R1. We also found that the the sources of data that had synthetic questions were actually quite good. Um, some of the top sources that we selected were entirely synthetic and better than sources say that scraped from forums or had humans manually write things. And this is also really good news because synthetic question generation is scalable. So once again we go back to the x-axis and we can push even further which is is accuracy boost. So question filtering also works well here. We we filtered questions by having asking a language model how difficult is this question and then taking only the hardest questions. We also had a language model try to answer that question and looked at the length of that answer. So these are sort of proxies for the same thing. You can imagine that if a problem is a lot harder then a language model will think more and it will produce more text. So its answer will be longer and these things worked better than embeddings based approaches or fast text classifiers which is interesting as so much that those those approaches were typical for pre-training. So it seems that the the filtering for data in post- training is quite different than pre-training. Okay. Some things that didn't work that were also quite interesting. Uh through our experiments, you saw that choosing a smaller number of high quality sources was much better than trying to optimize for diversity by going for a larger number of sources. That's very counterintuitive, right? You'd think, okay, I'm always going to go for for higher diversity, but this is actually not what we saw. Um the last thing we was interesting is that people talk a lot about um verification which is obviously very important for RL and we actually see for SFT and distillation it didn't seem that filtering based off of the answer or verifying the answer really helped at all. This is quite surprising. Um, and I think there's there's some some good research in the literature about maybe why this is because if you have the the hardest problem, it might be still helpful even if you have an incorrect answer to that hardest problem. Um, keeping it in and and seeing how the teacher model attempts. It's not just the final output that matters. Okay, great. Okay, so this is those are all like the amazing learnings that we had for open thoughts 3, which super excited to share. But now you're probably thinking, okay, they they've done a thousand experiments. I don't want to do a thousand experiments. I still want to create reasoning models. Uh how do I adapt this if I want to create specialized reasoning models? Um so I guess the first thing I would say is be aware that based off of your domain, these exact choices might be a little bit different. I would suggest okay, start with our recipe and then iterate on it. If you have um capacity and compute, try a couple different choices for each step in the pipeline. And I think a good example of this is we studied each step in the pipeline differently by domain. So we studied it distinctly for code, science and math. And we saw for example in the question filtering which I talked about before um using difficulty labels worked well for code questions but for math and science it was response length. And if you think about that for a second, it makes a little it makes sense because the response length for coding questions are very different, right? For for um Amy math, it's literally just a number between zero and a thousand. So the the answer is not it's not considering a large portion of the length, but you can imagine there's very simple coding questions in which the answer is still a lot of lines of code. Um so yeah, this is one thing to be aware of. The other thing which I talked about previously is synthetic question generation because it works so well. Um and if if your specialized domain if you're if you don't have a lot of data for your particular problem then uh go ahead transform that existing data into questions expand it um throw those as in context examples and just and generate more data. So yeah we built an open source library for this. It's called curator and you can you can try that out. And then lastly, I feel like everyone says this, but it can't be said enough. Like the evaluation is paramount. If you don't know how well your models are doing or improving, then you cannot make good principled decisions about your data set recipe. Um, we spent a lot of time on this. We also have this open source library on GitHub called Evalchemy uh which takes a care takes care of this and also takes care of the um sharding and parallelism. And and the key thing here is for very small evaluation sets. If you if you only have a handful of questions, you should run your model on those evaluation sets many times in average. So going back again to Amy competitive math questions, there's only 30 per year. So uh for our evaluations, we gave the model those 30 questions 10 times and then we averaged to get the the the final signal to determine um which data strategies were working better than others because otherwise there's too much noise. Okay, this is also very very interesting and surprising and promising for you if you're specializing. It seems that you can actually surpass the teacher in some domains with distillation. This is this is super cool. Usually you think about only RL can push the frontier. Distillation is just about catching up to the teacher, but no, that's not the case. So we have an example, it's in our paper where um we looked at the legal reasoning domain. So the problem of classifying Supreme Court decisions. What we did is we took 2k unique questions. We sampled five answers per question and then we did do verification here which which did matter. So we threw away any questions any answers that were incorrect. Um and when you fine-tune the 7B model, it surpasses R1 which is a very strong reasoning model and also a very huge reasoning model. So this is very exciting. I think there's a lot more um research and also application to be done here. Okay, cool. So, everything's open. It's open thoughts and open thoughts means open. Go out and build. We have all of our uh we've got our detailed paper. It's just out this morning. We've got the weights data set. Uh we have a ton of repos for code for data generation for evaluation and synthetic data. So, check those out. Um this is this is the team. It was a huge group of people, a lot of work over many months. Uh, I think we're all very proud of what we did, but there's lots of people to recognize here. If you scan that QR code, it goes to the tweet and everything uh about the Open Thoughts project is linked in from there. Yeah. Thank you. All right. Thank you so much, Ryan. Um, that was fascinating. Looks like we're already getting we have at least one question lined up. Again, we have time for maybe a couple of questions. So, if you have questions, um, please line up and and we'll do it. Um, actually, before we get to those questions, I will say as people are leaving, um, we are going to be back here at 2:00. We've got an excellent afternoon planned on this track. We've got Nathan Lambert. Um, we've got the, uh, we've got Christian Seed, who's the co-founder of X. Um, and it's going to be a really great track at 2 o'clock back in this room. Also, one more thing. If you do have questions for any of the speakers from this morning, um, hopefully they're going to be able to stick around. Don't let them go to lunch. They're going to be they're they're sitting up here at the front, so swarm them as soon as we're done. But for now, let's uh let's get a couple questions for uh Go ahead. Um yes, over there. Uh thank you. Great talk. So, uh two questions. One is um if you're just using SFT on this data, what's the difference between this and regular SFT? This is just regular SFT. Oh, yeah. Oh, okay. So then how is regular SFT able to make the models like think longer? Because I thought for the reason models they have like a thinking block and they think for you know hours and minutes and exactly so how do you how do you how does SFT make it think for hours? So you're you're doing supervised fine-tuning on the questions and the answers also contain the thinking. So the model learns to use its context window and produce these long thinking traces. So it can do this people call SFT imitation. Um but it it can learn to learn this format in the same way. Yeah. Thanks. All right, we'll take one from this side. Um, great presentation, Ryan. Uh, one question. Uh, why do you think, um, a smaller model like Quen 32B was a better teacher than a Deep Sea Car1? What was your insight in figuring out that like a good professor makes a bad lecturer? Yeah, that's a great question. Um I think this is something we need to investigate more but you can see that uh when you look at charts of the length of reasoning traces you can see the distributions are different. So uh it might be the case that you're using more of your context window using more tokens more steps. It also might be the case that you just have a better formatted response better output. Um this is like an another great open research question. Interesting. I'll also say on this point we also tried Claude as a teacher which is like a very as a good strong model and it was just a terrible teacher. Um so there's yeah it's interesting what can what actually creates a good teacher. Yeah. All right we'll take one more very brief question from this side and then those of you still waiting on questions um after uh after we have closed this up it's swarming. Sorry um great talk Ryan. Um we're doing similar kind of thing but I just had a question. Do you guys have any like pattern map as to in the reasoning chain of thought when things don't work at what level you know in the eval do you find out that things are not working or it's not reasoning correctly is there a pattern map or something that you have in your open source rap is sorry I didn't catch that is there a so if there are five steps of reasoning to reach a final conclusion uh at what step does a reasoning go arai yeah this is this is a great question we don't do this fine grain analysis but there is a ton in in the literature about this um where yeah there's a sort of critical step where it get gets things wrong. Um there we did like the simplest thing possible right you could also go in and try to do more complicated things um at evaluation time where you're doing interventions to uh maybe detect steps that have gone arry and and changed or you can do this in the when you're creating the data set. So you could potentially rewrite things, but everything that we tried in terms of like messing with the reasoning trace, it wasn't helpful. Um, so yeah, I think there's still more to explore there. There's like this is really just the start of everything in reasoning. [Music]