OpenThoughts: Data Recipes for Reasoning Models — Ryan Marten, Bespoke Labs

Channel: aiDotEngineer
Published at: 2025-07-19
YouTube video id: liG97YXaTSA
Source: https://www.youtube.com/watch?v=liG97YXaTSA
[Music]
I'm Ryan. I'm a founding engineer at
Bespoke Labs. And today I'm going to
talk to you about Open Thoughts, which
is our project to create the best
open-source reasoning data sets. And
I'll be switching tack a little bit from
our earlier discussions on reasoning and
RL and focus on the reasoning part and
you'll see why. So just so we're on the
same page, we've talked a lot about
reasoning, but what's actually going on
here? So I like this graph from JSON
which shows this incredible performance
that's happened in the last several
months where models are getting much
much much better on certain benchmarks.
Um, and if you look at that, this is
reasoning. This is test time scaling. I
think everyone here is quite familiar
with this. And it seems that certain
tasks like Amy, which are competitive
math problems, really respond to models
when they're able to think step by step
and do these long chain of thoughts. Um,
so let's go back to DeepSeek R1. Now, I
think Deepseek R1 was really impressive
for a lot of people for a lot of reasons
and RL was a big part of that. But I was
also particularly interested because
Deepseek R1 at the end of the day is an
SFT model. So the final weights that
they've released are actually from
DeepSseek V3 base which is fine-tuned on
800K SFT examples. 600K of which are
reasoning. Of course you can see here
that RL was a big part of it and RL was
used heavily to create that model which
generated this data. Um but at the end
it was SFT and a little bit of RL for
alignment. So this was really
interesting and surprising. And the
other thing that was really interesting
and surprising to us was these small
reasoning models that Deepseek released
which were incredibly strong. Um and
this for us was a a huge motivation a
huge motivation to try to do this
ourselves. And why is that interesting?
Because if we go back to here, none no
additional detail was really given on
these data sets here. So if you want to
create strong reasoning models, we now
sort of have a training recipe, but we
don't have the data recipe. That's the
missing link. Okay. I want to also
include a slide here on why is it
interesting to train your own reasoning
models. So uh I'm partially taking this
from Amir's talk yesterday on open
source and enterprise which I really
liked. But there's these main points,
performance, privacy, speed and cost,
and then ownership and destiny. I think
um using reasoning is a is a great tool
to solve a problem, and you shouldn't li
limit yourself in your toolbox
if you're trying to solve a specific
domain task. So, uh as we talked about
before, RL is a great tool in this
toolbox to tackle to tackle reasoning
tasks. But we're going to see here that
SFT is, as Nathan put this morning,
extremely easy and extremely effective.
Okay, great. Now, the missing link. How
do we actually solve for this this
reasoning data recipe? There's all these
questions that we had when we started.
How much data do you really need? What
data creation steps are necessary?
What are the optimal choices for each
step in that data creation pipeline? And
then, how do you even go about figuring
all this out? And this this is the meat
of the Open Thoughts project. So today
we're excited to announce Open Thoughts
3, which is hot off the presses, just
came out two hours ago, which is our
latest and greatest version of our
reasoning data sets. And
thank you. And now we this is the
state-of-the-art reasoning data set
recipe.
So you can see here these graphs are
showing accuracy on three of these
reasoning benchmarks. Amy which is
competitive math, live codebench is
competitive code and GPQA diamond which
is our science questions. Um on the y-
axis you see accuracy is going up. Uh on
the x-axis you see the data scale is
going up. So we we heard before that
scaling is difficult particularly
difficult with RL. The good news is for
SFT scaling is quite easier. Um you can
see here we compare to other open
reasoning data sets. So Neimatron nano
Nvidia released this great model
Neimatron nano it's a 8b model and they
also released the data set to train on
it. So we compared directly by training
on the same base model between our data
set which is our data set recipe and the
neatron nano data which is the Nvidia
recipe and you can see here there's a
significant gap. So we we shifted this
scaling curve upwards.
Great. So the yeah this is the
state-of-the-art 7B open data reasoning
model. You can see we've had we have
measured across the domains of interest.
So science, code and math and then a
couple held out benchmarks.
So our original goal was to to reproduce
to find the missing link for the
deepseek distill models. And you can see
here we've crushed that goal. So we're
we're significantly outperforming the
deepseek R1 quen 7B model which we
started off trying to reproduce.
And then compared to the Neimatron nano
model which is trained on a different
base model um we are also outperforming
on some benchmarks and similarly
competitive on some others. So okay
let's actually talk about how we achieve
this. This is the interesting part for
you. So we go back to the scaling graph.
You can see um once again on the x-axis
we're scaling data set size. So uh this
is a a huge method to increase accuracy
and the thing here is it gets more and
more expensive exponentially more
expensive as you keep going. Um
and then uh on vertically you can see
that we've shifted this the scaling
curve up. So this is what I was talking
about before. This is the improving the
data set recipe. So given a fixed data
set recipe you can always scale it
larger and you can always have higher
performance. But um if you want to push
your performance to abs absolute
maximum, the real question is how do I
create the best data set and therefore
what is the best recipe for the data
set. Okay, so uh enough teasing here.
Let's go into the meat of it. So this is
this is how we approach this problem.
We broke down the data set pipeline into
sourcing questions, mixing different
sources of questions, filtering those
questions, filtering out the high
highest quality questions, generating
answers with a teacher model. So that's
distillation, and then filtering out bad
answers, um, and and lastly, at the end
of this entire experimentation, we
looked at what what are the best teacher
models, which which teacher model should
we select? So through this entire
pipeline, we've we've come down to this
final data set recipe. Now, this was a
ton of work. This is a screenshot of our
our hugging face page. So, you can see
created over 5,000 data sets and almost
3,000 models. Um, for this project, it
was only around a thousand experiments.
But it just to give you an idea of how
rigorously we looked at the different
decisions in each of these steps of the
pipeline. And also, I think this is
interesting because it it peels back the
curtain a little bit on maybe what the
frontier labs are doing. uh finding
signal at the smallest scale possible
and trying out as many things as
possible and empirically choosing the
best and then scaling. And often
sometimes when you scale you see okay
what was the best at the small scale
doesn't actually work but if you're
lucky um and you've done good science
then you'll you'll your yolo run will be
the best possible right
okay so these are the the key learnings
that we had from our data set recipe and
and this is what you can take away
so the first thing is that pretty
surprising sampling multiple answers so
multiple reasoning traces per question
in your data set works really really
well. Um the the performance does not go
down at a fixed scale. If you take a
fixed scale of questions, say 30k
questions um or 30 30k examples and of
those you if you take just 30k questions
and you only sample once per question
that performs pretty similarly to um if
you took 116th so 30k over 16 and then
for each you sampled 16 times which is
quite cool. So this allows you, this is
really cool because this allows you to
scale by 16x, which is more than an
order of magnitude. And if you remember
the graph from before, that corresponds
to a pretty large increase in accuracy.
Yeah. The other surprising thing that we
found was that a better model in terms
of its own performance on evaluation
benchmarks does not necessarily mean
it's a better teacher model. I think a
good way to think about this is a
brilliant researcher who's maybe a
terrible lecturer, right?
Um, we found specifically Quen 32B was a
stronger teacher model than Deepseek R1.
So, we switched to that in our our
recipe even though previously everyone
has been using R1.
We also found that the the sources of
data that had synthetic questions were
actually quite good. Um, some of the top
sources that we selected were entirely
synthetic and better than sources say
that scraped from forums or had humans
manually write things. And this is also
really good news because synthetic
question generation is scalable. So once
again we go back to the x-axis and we
can push even further which is is
accuracy boost.
So question filtering also works well
here. We we filtered questions by having
asking a language model how difficult is
this question and then taking only the
hardest questions.
We also had a language model try to
answer that question and looked at the
length of that answer. So these are sort
of proxies for the same thing. You can
imagine that if a problem is a lot
harder then a language model will think
more and it will produce more text. So
its answer will be longer and these
things worked better than embeddings
based approaches or fast text
classifiers which is interesting as so
much that those those approaches were
typical for pre-training. So it seems
that the the filtering for data in post-
training is quite different than
pre-training.
Okay. Some things that didn't work that
were also quite interesting. Uh through
our experiments, you saw that choosing a
smaller number of high quality sources
was much better than trying to optimize
for diversity by going for a larger
number of sources. That's very
counterintuitive, right? You'd think,
okay, I'm always going to go for for
higher diversity, but this is actually
not what we saw. Um the last thing we
was interesting is that people talk a
lot about um verification which is
obviously very important for RL and we
actually see for SFT and distillation it
didn't seem that filtering based off of
the answer or verifying the answer
really helped at all. This is quite
surprising. Um, and I think there's
there's some some good research in the
literature about maybe why this is
because if you have the the hardest
problem, it might be still helpful even
if you have an incorrect answer to that
hardest problem. Um, keeping it in and
and seeing how the teacher model
attempts. It's not just the final output
that matters.
Okay, great. Okay, so this is those are
all like the amazing learnings that we
had for open thoughts 3, which super
excited to share. But now you're
probably thinking, okay, they they've
done a thousand experiments. I don't
want to do a thousand experiments. I
still want to create reasoning models.
Uh how do I adapt this if I want to
create specialized reasoning models? Um
so I guess the first thing I would say
is be aware that based off of your
domain, these exact choices might be a
little bit different. I would suggest
okay, start with our recipe and then
iterate on it. If you have um capacity
and compute, try a couple different
choices for each step in the pipeline.
And I think a good example of this is we
studied each step in the pipeline
differently by domain. So we studied it
distinctly for code, science and math.
And we saw for example in the question
filtering which I talked about before um
using difficulty labels worked well for
code questions but for math and science
it was response length. And if you think
about that for a second, it makes a
little it makes sense because the
response length for coding questions are
very different, right? For for um Amy
math, it's literally just a number
between zero and a thousand. So the the
answer is not it's not considering a
large portion of the length, but you can
imagine there's very simple coding
questions in which the answer is still a
lot of lines of code. Um so yeah, this
is one thing to be aware of. The other
thing which I talked about previously is
synthetic question generation because it
works so well. Um and if if your
specialized domain if you're if you
don't have a lot of data for your
particular problem then uh go ahead
transform that existing data into
questions expand it um throw those as in
context examples and just and generate
more data. So yeah we built an open
source library for this. It's called
curator and you can you can try that
out.
And then lastly, I feel like everyone
says this, but it can't be said enough.
Like the evaluation is paramount. If you
don't know how well your models are
doing or improving, then you cannot make
good principled decisions about your
data set recipe. Um, we spent a lot of
time on this. We also have this open
source library on GitHub called
Evalchemy uh which takes a care takes
care of this and also takes care of the
um sharding and parallelism. And and the
key thing here is for very small
evaluation sets. If you if you only have
a handful of questions, you should run
your model on those evaluation sets many
times in average. So going back again to
Amy competitive math questions, there's
only 30 per year. So uh for our
evaluations, we gave the model those 30
questions 10 times and then we averaged
to get the the the final signal to
determine um which data strategies were
working better than others because
otherwise there's too much noise.
Okay, this is also very very interesting
and surprising and promising for you if
you're specializing.
It seems that you can actually surpass
the teacher in some domains with
distillation. This is this is super
cool. Usually you think about only RL
can push the frontier. Distillation is
just about catching up to the teacher,
but no, that's not the case. So we have
an example, it's in our paper where um
we looked at the legal reasoning domain.
So the problem of classifying Supreme
Court decisions.
What we did is we took 2k unique
questions. We sampled five answers per
question and then we did do verification
here which which did matter. So we threw
away any questions any answers that were
incorrect. Um and when you fine-tune the
7B model, it surpasses R1 which is a
very strong reasoning model and also a
very huge reasoning model. So this is
very exciting. I think there's a lot
more um research and also application to
be done here.
Okay, cool. So, everything's open. It's
open thoughts and open thoughts means
open. Go out and build. We have all of
our uh we've got our detailed paper.
It's just out this morning. We've got
the weights data set. Uh we have a ton
of repos for code for data generation
for evaluation and synthetic data. So,
check those out. Um this is this is the
team. It was a huge group of people, a
lot of work over many months. Uh, I
think we're all very proud of what we
did, but there's lots of people to
recognize here. If you scan that QR
code, it goes to the tweet and
everything uh about the Open Thoughts
project is linked in from there. Yeah.
Thank you.
All right. Thank you so much, Ryan. Um,
that was fascinating. Looks like we're
already getting we have at least one
question lined up. Again, we have time
for maybe a couple of questions. So, if
you have questions, um, please line up
and and we'll do it. Um, actually,
before we get to those questions, I will
say as people are leaving, um, we are
going to be back here at 2:00. We've got
an excellent afternoon planned on this
track. We've got Nathan Lambert. Um,
we've got the, uh, we've got Christian
Seed, who's the co-founder of X. Um, and
it's going to be a really great track at
2 o'clock back in this room. Also, one
more thing. If you do have questions for
any of the speakers from this morning,
um, hopefully they're going to be able
to stick around. Don't let them go to
lunch. They're going to be they're
they're sitting up here at the front, so
swarm them as soon as we're done. But
for now, let's uh let's get a couple
questions for uh Go ahead. Um yes, over
there.
Uh thank you. Great talk. So, uh two
questions. One is um if you're just
using SFT on this data, what's the
difference between this and regular SFT?
This is just regular SFT.
Oh,
yeah.
Oh, okay. So then how is regular SFT
able to make the models like think
longer? Because I thought for the reason
models they have like a thinking block
and they think for you know hours and
minutes and
exactly
so how do you how do you how does SFT
make it think for hours?
So you're you're doing supervised
fine-tuning on the questions and the
answers also contain the thinking. So
the model learns to use its context
window and produce these long thinking
traces. So it can do this people call
SFT imitation. Um but it it can learn to
learn this format in the same way. Yeah.
Thanks. All right, we'll take one from
this side.
Um, great presentation, Ryan. Uh, one
question. Uh, why do you think, um, a
smaller model like Quen 32B was a better
teacher than a Deep Sea Car1? What was
your insight in figuring out that like a
good professor makes a bad lecturer?
Yeah, that's a great question. Um I
think this is something we need to
investigate more but you can see that uh
when you look at charts of the length of
reasoning traces you can see the
distributions are different. So uh it
might be the case that you're using more
of your context window using more tokens
more steps. It also might be the case
that you just have a better formatted
response better output. Um this is like
an another great open research question.
Interesting. I'll also say on this point
we also tried Claude as a teacher which
is like a very as a good strong model
and it was just a terrible teacher. Um
so there's yeah it's interesting what
can what actually creates a good
teacher. Yeah.
All right we'll take one more very brief
question from this side and then those
of you still waiting on questions um
after uh after we have closed this up
it's swarming.
Sorry um great talk Ryan. Um we're doing
similar kind of thing but I just had a
question. Do you guys have any like
pattern map as to in the reasoning chain
of thought when things don't work at
what level you know in the eval do you
find out that things are not working or
it's not reasoning correctly is there a
pattern map or something that you have
in your open source rap
is sorry I didn't catch that is there a
so if there are five steps of reasoning
to reach a final conclusion
uh at what step does a reasoning go arai
yeah this is this is a great question we
don't do this fine grain analysis but
there is a ton in in the literature
about this um where yeah there's a sort
of critical step where it get gets
things wrong. Um there we did like the
simplest thing possible right you could
also go in and try to do more
complicated things um at evaluation time
where you're doing interventions to uh
maybe detect steps that have gone arry
and and changed or you can do this in
the when you're creating the data set.
So you could potentially rewrite things,
but everything that we tried in terms of
like messing with the reasoning trace,
it wasn't helpful. Um, so yeah, I think
there's still more to explore there.
There's like this is really just the
start of everything in reasoning.
[Music]