Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With SemiAnalysis CEO Dylan Patel

Channel: Alex Kantrowitz

Published at: 2025-04-23

YouTube video id: QUIgIBx5NDQ

Source: https://www.youtube.com/watch?v=QUIgIBx5NDQ

If you've ever wondered how generative
AI works and where the technology is
heading, this episode is for you. We're
going to explain the basics of the
technology and then catch up with
modern-day advances like reasoning to
help you understand exactly how it does
what it does and where it might advance
in the future. That's coming up with
semi analysis founder and chief analyst
Dylan Patel right after this. Welcome to
Big Technology Podcast, a show for
coolheaded and nuanced conversation of
the tech world and beyond. We're joined
today by semi analysis founder and chief
analyst Dylan Patel, a leading expert in
semiconductor and generative AI research
and someone I've been looking forward to
speaking with for a long time now. I
want this to be an episode that a helps
people learn how generative AI works and
b is an episode that people will send to
their friends to explain to them how
generative AI works. I've had a couple
of those that I've been sending to my
friends uh and colleagues and
counterparts about what is going on
within generative AI. That includes one
this 3 and 1/2 hour long video from
Andre Karpathy explaining everything
about training large language models.
And the second one is a great episode
that Dylan and Nathan Lambert from the
Allen Institute of AI did uh with Lex
Freiedman. Both of those 3 hours plus.
So I want to do it ours in an hour. Uh,
and I'm very excited to begin. So,
Dylan, it's great to see you and welcome
to the show. Thank you for having me.
Great to have you here. Let's just start
with tokens. Can you explain how the how
AI researchers basically take words and
then give them numerical representations
and parts of words and give them
numerical representations? So, what are
tokens? Tokens are are in fact like
chunks of words, right? Uh, in in the
human way, you can think of like
syllables, right? Um, syllables are
often, you know, viewed as like a chunks
of word. Um, they have some meaning. Um,
it's the base level of of of speaking,
right? Is is syllables, right? Now, for
models, tokens are the base level of
output. They're all about like sort of
compressing, you know, sort of this is
the most efficient representation of
language. From my understanding, AI
models are very good at predicting
patterns. So if you give it 1 3 7 9 it
might know the next number is going to
be 11. And so what it's doing in with
tokens is taking words breaking them
down to their component parts assigning
them a numerical value and then
basically in its own word in its own
language learning to predict what number
comes next because computers are better
at numbers and then converting that
number back to text and that's what we
see come out. Is that is that accurate?
Yeah. And each individual token uh is
actually it's not just like one number,
right? It's multiple vectors. You could
think of like well the tokenizer needs
to learn king and queen are actually
extremely similar on on most in terms of
like the English language. Extremely
similar, right? Except there is like one
vector in which they're super different,
right? Because a king is a male and a fe
a queen is a female, right? And then
from there like you know in language
oftent times kings are considered
conquerors and and you know all these
other things and like these are just
like historical things right so a lot of
the text around them while they're both
like royal regal right like you know uh
monarchy etc there are many vectors in
which they differ so like it's not just
like converting a to a word into one
number right it's like converting it
into multiple vectors and each of these
vectors the model learns what it means
right um you don't initialize the model
with like hey you know king means
male uh monarch blah and it's associated
with like war and conquering because
that's what all the writing about kings
is on you know in history and all that
right like people don't talk about the
daily lives of kings that much or they
mostly talk about like their wars and
conquests and stuff um and so like there
will be each of these numbers in this
embedding space right will be assigned
over time as the model reads the
internet's text and trains on it'll
start to realize oh king and queen are
exactly similar on these vectors but
very different on these vectors and
these vectors aren't you don't
explicitly tell the model hey this is
what this vector is for but it could be
like you know it could be as much as
like one vector could be like is it a
building or not right and and it doesn't
actually know that you don't you don't
know that ahead of time it just happens
to in the latent space and then all
these vectors sort of relate to each
other but yeah these these numbers are
um are efficient representation of words
uh because you can do math on them right
you can you can multiply them you can
divide them you can run them through an
entire model whereas Whereas we're and
and your brain does something similar,
right? When it hears something, it
converts that into a frequency in your
ears and then that gets converted to
frequencies that should go through your
brain, right? This is this is the same
thing as a tokenizer, right? Um although
it's like obviously a very different
medium of compute, right? Ones and zeros
for for computers versus uh you know
binary and multiplication etc. being
more efficient whereas humans uh brains
are more like analog in nature and um
you know think more in waves and and
patterns in different ways. uh while
they are very different it is a
tokenizer right like language is not
actually how our brain thinks it's just
a representation for which it to you
know reason over yeah so that's crazy so
the tokens are the sufficient
representation of words but more than
that the models are also learning the
way that they are all all these words
are connected and that brings us to
pre-training from my understanding
pre-training is when you take the entire
basically the entire internet worth of
text and you use that to teach the model
these representations between each
token. So therefore like we talked about
if you gave if you gave a model the sky
is and the next word is typically blue
in in the pre-training which is
basically all of the English language
all or all of language on the internet
uh it should know that the next token is
blue. So what you do is you want to make
sure that when the model uh is
outputting information it's closely tied
to what that next value should be. Is
that is that a proper description of
what happens in pre-training? Yeah, I
think I think that's that's pretty
that's the objective function which is
just to reduce loss i.e. how often is
the token predicted incorrectly versus
correctly. Right. Right. So if it's it's
like this if you said the sky is red um
that's not the most probable outcome. So
that would be wrong. that text is on the
internet, right? Like because the
Martian sky is red and there's all these
books about Mars and sci-fi, right? So,
how does the model then learn how to,
you know, figure this out and in what
context is it accurate to say blue and
red, right? So, so I mean, um, first of
all, the model doesn't just output one
token, right? It outputs a distribution.
Um, it turns out the most the way most
people take it is they take the top, uh,
top K, i.e. the most high probability.
So, yes, blue is obviously the right
answer if you give it to anyone on this
planet. Um, but there are situations and
contexts where the sky is red is the
appropriate sentence. But that's not
just in isolation, right? It's like if
the prior passage is all about Mars and
all this and then all of a sudden it's
like and it's like a quote from a
Martian settler and it's like the sky is
and then the correct token is actually
red, right? The correct word. Um, and so
it has to know this through the
attention mechanism, right? Um, if it
was just the sky is blue, always you're
going to output blue because blue is
let's say 80% 90% 99% likely to be the
right option. But as you as you start to
add context about Mars or any other
planet, right, other planets have
different colored colored atmospheres, I
presume. Um you end up with this um
distribution starts to shift, right? If
if I add we're on Mars, the sky is, you
know, then then all of a sudden blue
goes from 99%, you know, in the prior
context window, right? The uh the text
that you sent to the model, the
attention of it, all of a sudden it
realizes the sky is blue uh preceded by
that the the stuff about Mars. Now,
blues rockets down to like, you know,
let's call it 20% probability and red
rockets up to 80% probability, right?
Um, now the model outputs that and then
most people just end up taking the top
probability and outputting it to the
user. Um, and and that's sort of like
how how does the model learn that is is
the attention mechanism, right? And this
is sort of the beauty. Yeah, the the uh
the attention mechanism is the beauty of
of modern sort of large language models.
It takes the relational value you know
in this vector space between every
single token right um so the sky you
know the sky is blue right like when I
think about it yes blue is the next
token after the sky is but in in a lot
of like older style models you would
have you would just predict the exact
next word so after sky obviously it's
you know it's it could be many things it
could be blue but it could also be like
uh scraper right you know skyscraper is
yeah that's that that that makes sense
um but what attention does is it it it
is it is um taking all of these various
values, the query, the key, the value um
which represents what you're looking
for, what you where you where you're
looking um and what what that value is
across the attention. Um and you're
you're you're calculating mathematically
what the relationship is between all of
these tokens. Um and so going back to
the king queen representation, right?
The way these two words interact is now
calculated, right? And the way that
every word in the entire passage you
sent is calculated uh is tied together.
Which is why models have like challenges
with like how long can you uh how many
documents can you send them, right?
Because if you're sending them, you
know, just the question like what color
is the sky? Okay, it only has to
calculate the attention between you know
those those words, right? But if you're
sending it like 30 books with like
insurance claims and all these other
things and you're like okay, figure out
what what's going on here. Um is this a
claim or not? Right? in in the insurance
context all of a sudden it's like okay
I've got to calculate the attention of
not just like the last five words to
each other but I have to calculate every
you know 50,000 words to each other
right which then ends up being a ton of
math back in the day actually the best
language models were a different
architecture entirely right um but then
at some point you know transformers
large language models sort of large
language models which are bas based on
transformers primarily uh rocketed past
in capabilities because they were able
to scale and because the hardware got
there um and then we were able to scale
them so much that we not to just put
like some text in them and not just a
lot of text or a lot of books but the
entire internet which you know one could
view the internet often times as a
microcosm of all human culture and
learnings and knowledge uh to many
extents because most books are on the
internet most papers are on the internet
um obviously there's a lot of things
missing on the internet um but this is
sort of this is the sort of modern you
know magic of like what what it was sort
of like three different things like
coming all together at once right an
efficient way for models to relate every
referred to each other the compute
necessary uh to scale the data large
enough and then someone actually like
pulling the trigger to do that right at
the scale that was you know got to the
point where it was useful right which
was sort of like GPT3.5
uh level or four level right where it
became extremely useful for normal
humans to use uh you know chat chat
models okay and so why is it called
pre-training so so pre-training is is is
sort of called that because it is what
happens happens, you know, before the
actual uh training of the model, right?
Uh the objective function in
pre-training is to just predict the next
token. Uh but predicting the next token
is not what humans want to use AIS for,
right? I want it to ask a question and
answer it. Uh but in in most cases,
asking a question does not necessarily
mean that the next most likely token is
is the answer, right? Often times it is
another question, right? Uh, for
example, if I ingested the entire SAT
um, you know, and I asked a question,
the next like five answer then all the
next tokens would be like A is this, B
is this, C is this, D is this. Like, no,
I just want the answer, right? Um, and
so pre-training is the reason it's
called pre-training is because you're
ingesting humongous volumes of text no
matter the use case, right? Um, and
you're learning the general patterns
across all of language, right? I don't
actually know that king and queen relate
to each other in this way. And I don't
know that king and queen are opposites
in these ways, right? Um, and so this is
why it's called pre-training is because
you must get a broad general
understanding of the entire sort of
world of text before you're able to then
do post-training or fine-tuning, which
is let me train it on more specific data
that is specifically useful for what I
want it to do. Whether it's hey um in
chat style applications um you know go
and go and you know when I ask a
question give me the answer or um in in
other applications like teach me how to
build a bomb. Well obviously no I'm not
going to help you teach build a bomb
because that's what I don't want the
model to teach me how to build a bomb.
So, you know, it's sort of got to do
this and it's not like you're teaching
it, you know, when you're doing this
pre-training um you're filtering out all
this data because in fact there's a lot
of good useful data on how to build
bombs because there's a lot of useful
information on like hey like C4
chemistry and like you know people want
to use it for chemistry right so you
don't want to just filter out everything
uh so that the model doesn't know
anything about it um but at the same
time you don't want it to output you
know how to build a bomb um so there's
like a fine balance here and that's why
pre-training is defined as pre because
you're you're you're still letting it do
things and teaching it things and
inputting things into the model that are
theoretically like quite bad, right? Um
for example, books about like killing or
war tactics or what have you, right?
Like things that like plausibly you
could see like, oh well maybe that's not
okay. Um or or wild descriptions of like
really grotesque things all over the
internet, but you want the model to
learn these things, right? Because first
you build the general understanding
before you say okay now that you've got
a general framework of the world let's
align you so that you with this general
understanding of the world can figure
out what is useful for people what is
not useful for people what should I
respond on what should I not respond on
so what happens then in the training
process so the the in is the training
process that the model is then
attempting to make the next prediction
and then just trying to minimize loss as
it goes right right I mean like
basically uh you you have you have loss
is is how often you're wrong versus
right in the most simple terms right you
will you'll run through a passages right
through the model um and you'll see how
often did the model get it right when it
got it right great reinforce that uh
when I got it wrong let's figure out
which which neurons neurons in the model
you know quote unquote neurons in the
model you can tweak to then fix the
answer so that when you go through it
again it actually outputs the correct
answer and then you move the the move
the model slightly in that direction. Um
now obviously the the challenge with
this is um if I if I first you know I
can I can come up with a simplistic way
where all the neurons will just output
the sky is blue every single time it
says the sky is but then when it goes to
um you know hey the the color blue is
commonly used on walls because it's
soothing right and it's like oh what's
the next word is soothing right soothing
you know and so like that that is a
completely different representation and
to understand that blue is soothing and
that the sky is blue and those things
aren't actually related but they are
related to blue is like very important
and so you know over often times you'll
run through the training data set
multiple times right uh because the
first time you see it oh great maybe you
memorized that the sky is blue um and
and you memorize the wall is blue and
that uh and when people describe art and
often times use the color blue it can it
can be representations of art or the
wall right and so over time as you go
through all this text in pre-training
Yes, you're minimizing loss initially by
just memorizing, but over time because
you're constantly overwriting the model,
it starts to learn the generalization,
right? I.e. blue is a soothing color,
also represents the sky, also used in
art to for either of those two motifs,
right? And so that's sort of the the
goal of pre-training is is you don't
want to memorize, right? Because that's,
you know, in school you memorize all the
time. Um, and that's not useful because
you forget everything you memorize. Uh
but if you get tested on it then and
then you get tested on it 6 months later
um and then again 6 months later after
that or or however you do it ends up
being oh you you don't actually like
memorize that anymore. You just know it
innately um and you've generalized on it
and that's the real goal uh that you
want out of the model. Uh but that's not
necessarily something you can just
measure right and therefore loss is
something you can measure i.e for this
group of this group of text, right?
Because you train the model in in steps.
Every step you're inputting a bunch of
text. You're trying to uh you're trying
to see what's predict the right token
where you didn't predict the right
token. Let's adjust the neurons. Okay,
on to the next batch of text. And you'll
do this these batches over and over and
over again um o across, you know,
trillions of words of text, right? Um
and and as you step through and then
you're like, "Oh, well, I'm done." But I
bet if I go back to the first first
group of text, which was all about the
sky being blue, it's going to get the
answer wrong because maybe later on in
the training, it discovered it saw some
passages about sci-fi and how the how
Martian sky is red. So like it'll it'll
overwrite but then over time as you go
through the data multiple times as you
see it on the internet multiple times as
you see it in different books multiple
times whether it be scientific sci-fi
whatever it is you start to realize and
it starts to learn that uh that
representation of like oh when it's on
Mars it's red because the sky and Mars
is red because the atmospheric makeup is
this way whereas the atmospheric makeup
in in on Earth is a different way. And
so that's sort of like the whole point
of pre-training is is to minimize loss,
but the nice side effect is that the
model initially memorizes, but then it
stops memorizing and it generalizes. And
that's the useful pattern that we want.
Okay, that's fascinating. Uh we've
touched on post-training uh for a bit,
but just to recap, post-raining is so
you have a model that's good at
predicting the next word, and in post-
training, you sort of give it a
personality by inputting sample
conversations to make the model want to
emulate the certain values that you want
it to take on.
Yeah. So, post training can be a number
of different things. The most simple way
of doing it is is yeah, uh pay for
humans to label a bunch of data. um take
a bunch of example conversations um etc
and input that data and train on that at
the end, right? Um and so that that
example data is is useful uh but this is
not scalable, right? Like using humans
to train models is just so expensive,
right? So then there's the magic of sort
of reinforcement learning and and other
synthetic data technologies, right?
Where the model is helping teach the
model, right? So you have many models in
in in a sort of in a post training where
yes you have some example human data but
human data does not scale that fast
right cuz cuz the internet is trillions
and trillions of words out there whereas
you know even if you had you know Alex
and I write words all day long for our
whole lives would we would have millions
or you know hundreds of millions of
words written right it's nothing it's
like orders of magnitude off in terms of
the number of words required um so then
you have the model, you know, takes some
of this example data um and you have
various models that are surrounding the
main model that you're training, right?
And these can be policy models, right?
Teaching it, hey, is this is this what
you want or that what you want? Uh
reward models, right? Like is that good?
Is that a good response or is that a bad
bad response? You have value models like
hey, grade this output, right? And you
have all these different models working
in conjunction to say um uh you know,
different companies have different
objective functions, right? In the case
of anthropic, they're they want their
model to be helpful, harmful, harmless,
and safe, right? So, be helpful, but
also don't harm people or anyone or
anything. And then, you know, uh you
know, safe, right? In other cases, like
Grock, right? Uh Elon's model from XAI,
um it actually just wants to be helpful.
Um and maybe has like a little bit of a
right leaning to it, right? And and for
other folks, right, like you know, I
mean, most AI models are made in the Bay
Area, so they tend to just be left
leaning, right? Um but also the internet
on general is a little bit left-leaning
um because it skews younger than older.
Um and so like all these things like
sort of affect models. Um but like it's
not just around politics, right? Post
training is also just about teaching the
model. If I if I say like uh the movie
where uh the princess has a slipper and
it doesn't fit. It's like well on if I
said that into a base model that was
just pre-training like the answer
wouldn't be oh the movie you're looking
for Cinderella. you know, it it would
only realize that once it goes, you
know, once it goes through post-
training, right? Because a lot of times
people just throw garbage into the model
and then the model still figures out
what you want, right? And this is part
of what post- training is. Like you can
just do stream of consciousness into
models and often times it'll figure out
what you want. Like, you know, if it's a
movie that you're looking for or if it's
help answering a question or if you
throw a bunch of like unstructured data
into it and then ask it to make it into
a table, it does this, right? And that's
because of all these different aspects
of post training, right? example data
but also uh you know generating a bunch
of data and and grading it um and seeing
if it's good or not um and whether it
matches the various policies you want is
it help you know a lot of times grading
can be based on multiple factors right
there could be a model that says hey is
this helpful hey is this is this safe
and what is safe right so then that
model for safety needs to be tuned on
human data right so there's it is a
quite complex thing but the end goal is
to be able to get the model to uh output
in a certain way models aren't always
about just humans using them either
Right? There can be models that are just
focused on like, hey, like, you know, if
it doesn't output code, you know, yes,
it was trained on the whole internet
because the person's going to talk to
the model using text, but if it doesn't
output code, you know, penalize it,
right? Now, all of a sudden, the model
will never output like text ever again.
It'll only output code. Um, and and so
like these sorts of like models exist
too. So post- training is not just a
univariable thing, right? It's what
variables do you want to target? Um, and
so that's why models have different
personalities from different companies.
It's why they target different use cases
and why, you know, it's not just like
one model that rules them all, but
actually many.
That's fascinating. So, that's why we've
seen so many different models with
different personalities is because it
all happens in the post-training uh
moment. And this is when when you talk
about giving the models examples to
follow. That's that's what reinforcement
learning with human feedback is, is the
humans give some examples and then the
model learns to emulate what the human
is interested in, what the human trainer
is interested in having them uh
embody. Is that right? Yeah, exactly.
Okay, great. All right, so first half
we've covered what training is, what
tokens are, what loss is, what uh post-
trainining is, post training is, post
training, by the way, also called
fine-tuning. We've also covered
reinforcement learning with human
feedback. We're going to take a quick
break and then we're going to talk about
reasoning. We'll be back right after
this. And we're back here on Big
Technology Podcast with Dylan Patel.
He's the founder and chief analyst at
Semi analysis. He actually has great
analysis on Nvidia's recent uh uh GTC
conference, which we just covered
recently on a on a recent episode. You
can find semi analysis at
semianalysis.com. It is both content and
sort of and consulting. So definitely
check in with Dylan for all of those uh
all those needs. And now we're going to
talk a little bit about reasoning
because a couple months ago, and Dylan,
this is really where you know I sort of
entered the picture watching your uh
conversation with Flex uh with Nathan
Lambert about what the difference is
between reasoning and um and your
traditional LLMs, large language models.
Um, if I gathered it right from your
conversation, what reasoning is is
basically instead of uh the model going
uh basically predicting the next word
based off of its training, it uses the
tokens to spend more time uh basically
figuring out what the right answer is
and then coming out with its with a new
prediction. I think Carpathy does a very
interesting job uh in the YouTube video
talking about how models think with
tokens. the more tokens there are, the
more computes they compute they use
because they're running these
predictions through the transformer
model which we discussed and therefore
they can come to better answers. Is that
the right way to think about reasoning?
So, so I think that um humans are also
fantastic at f pattern matching, right?
Um we're really good at like recognizing
things. Um but a lot of tasks it's not
like an immediate response, right? we
are thinking um whether that's thinking
through words out loud, thinking through
words in an inner monologue in our head
or it's just like processing somehow and
then we know the answer, right? Um and
this is the same for models, right?
Models are horrendous at math, right?
Historically have been, right? Um you
could ask it, you know, what is 9.11
bigger than 9.9? Um and it would say
yes, it's bigger even though like
everyone knows that 9.11 is is way
smaller than 9.9, right? Um, and that's
just like a thing that happened in
models because they didn't think or
reason, right? And and it's the same for
you, Alex, right? Like, you know, or
myself, right? Like if someone asked me,
you know, uh, 17 * 34, I'd be like, I
don't know, like right off the top of my
head, but, you know, give me give me a a
little bit of time. I can do some long
form multiplication and I can get the
answer, right? And that's because I'm
thinking about it. Um, and this is the
same thing with reasoning for models. um
is you know when when you look at a
transformer every word is the every
token output it has the same amount of
compute behind it right um i.e i.e. you
know when I'm saying the versus sky is
blue the blue and the the have the or
the the is and the blue have the same
amount of compute to generate right um
and this is not exactly what you want to
do right you want to actually um spend
more time on the hard things and and and
not on the easy things um and so
reasoning models are effectively
teaching you know large pre-trained
models to do this right hey think
through the problem hey output a lot of
tokens uh think about it generate all
text and then when you're done, you
know, start answering the question, but
now you have all of this stuff you
generated in your context, right? Um,
and that stuff you generated is is
helpful, right? Um, it could be like,
you know, all sorts of all, you know,
just like any human's thought patterns
are, right? Um, and so this this is the
sort of like new paradigm that we've
entered maybe 6 months ago where models
now will think for some time before they
answer. And this enables much better
performance on all sorts of tasks,
whether it be coding or math or
understanding science or understanding
complex social dilemmas, right? All
sorts of different topics they're much
much better at. Um, and this is done
through post-training, similar to the
reinforcement learning by human feedback
that we mentioned earlier. Uh, but also
there's other forms of post-training and
that's that's what makes these reasoning
models. Before we head out, I want to
hit on a couple things. Um, first of
all, the growing efficiency of these
models. So, I think one of the things
that people focused on with DeepSeek was
that it was just able to be much more
efficient in the way that it generates
answers. Uh, and there was this
obviously this big reaction to Nvidia
stock where it fell 18% the day or at
the Monday after Deepseek weekend um
because people thought we wouldn't need
as much compute. So can you talk a
little bit about how models are becoming
more efficient and how they're doing it?
Yeah. So there's a variety of u the
beauty of these uh of AI is not just
that we continue to build new
capabilities, right? Um because those
new capabilities are going to be able to
benefit the world in many ways. Um and
there's a lot of focus on those but
there's also a lot of there's a lot of
focus on well to get to that next level
of capabilities is is the scaling laws
i.e. the more compute and data I send
spend the better the model gets. But
then the other vector is well can I get
to the same level with less compute and
data right um and and those two things
are handinand because if I can get to
the same level with the less comput and
data then I can spend that more comput
and data and get to a new level right
and so AI researchers are constantly
looking for ways to make models more
efficient uh whether it be through
algorithmic tweaks uh data tweaks uh
tweaks in you know how you do
reinforcement learning so on and so
forth right and so when we look at
models across history they've constantly
gotten cheaper and cheaper and cheaper,
right? Um at a stupendous rate, right?
Um and so one easy example is GPT3,
right? Uh cuz there's GPT3, 3.5 Turbo,
uh Llama 27B, Llama 3, uh Llama 3.1,
Llama 3.2, right? As these models have
gotten bigger, we've gone from, hey, it
costs $60 for a million tokens to it
cost less than it costs like 5 cents now
for the same quality of model now. And
the model has shrank dramatically in
size as well and that's because of
better algorithms, better data etc. Um
and now what happened with DeepSeek was
similar. Um you know open AI had GPD4
then they had 4 Turbo which was half the
cost. Then they had 40 which was again
half the cost. Uh and then Meta released
Llama 405B open source and so the open
source community was able to run that
and that was again like roughly like
half the cost. Uh but or or 5x lower
cost um than 40 which was lower than
four turbo and four u but deepseek came
out with another tier right. So when we
looked at GPD3, the cost fell 1,200x
from uh GPD3's initial cost to what you
can get llama 3 point uh 2 uh 3b today,
right? And likewise when we look at uh
from GPD4 to deepse v3 uh it's fallen
roughly 600x in cost, right? So we're
not quite at that 1,200x, but it has
fallen 600x in cost from $60 to less
than uh you know to about a dollar,
right? Or or to less than a dollar,
sorry, 60x. Um, and so you've got this
massive cost decrease. Uh, but it's not
necessarily out of bounds, right? We've
already seen it. I think what was really
surprising, uh, was that it was a
Chinese company for the first time,
right? Cuz Google and and OpenAI and
Enthropic and Meta have all traded
blows, right? You know, uh, whether it
be OpenAI always being on the leading
edge or Enthropic always being on the
leading edge or, you know, Google and
Meta, you know, being close followers
but oftentimes, sometimes with a new
feature, um, and sometimes just being
much cheaper. Um, we have not seen this
from any Chinese company, right? And now
we have a Chinese company releasing a
model that's cheap. Um, it's not
unexpected, right? Like this is actually
within the trend line of what happened
with GPT3 is happening to GPT4 level
quality with Deepseek. Um, it's more so
surprising that it's a Chinese company
and that's I think why everyone freaked
out and then there was a lot of things
that like you know from there became a
thing, right? Like if Meta had done
this, I don't think people would have
freaked out, right? Um, and Meta's Meta
is going to release their new llama soon
enough, right? And that one is going to
be, you know, uh, similar level of cost
decrease. Uh, probably similar area as
Deep Seek V3, right? Um, it's just not
people aren't going to freak out because
it's an American company and it was sort
of expected. All
right, Dylan, let me ask you the last
question, which is that you mentioned I
think you mentioned the bitter lesson,
which is basically that the I mean, I'm
going to just be kind of facitious in
summing it up, but the answer to all
questions in machine learning is just to
make bigger uh bigger models uh and
scale solves almost all problems. So,
it's interesting that we have this
moment where models are becoming way
more efficient. Uh but we also have
massive massive data center buildouts.
Um I think you it would be great to hear
you kind of recap the size of these data
center buildouts and then answer this
question. Uh if we are getting more
efficient, why are these data centers
getting so much bigger? Uh and what
might that added scale get in the world
of generative AI uh for the companies
building them. Yeah. So when we look
across the ecosystem at data center
buildouts um you know we track all the
buildouts and and server purchases and
supply chains here and and the pace of
construction is incredible, right?
right? You can just you can pick a state
and you can see new data centers going
up um all across the US and and around
the world, right? Um and so you see
things like um capacity in you know, for
example, of of the largest scale
training supercomputers goes from hey,
it's a few hundred million it's it's not
even a few hundred million years ago,
but like you know, hey, for GPT4 it was
a few hundred million. Um and it's it's
one building full of GPUs, too. uh GPD
4.5 um and uh the reasoning models like
0103 were done in a in three buildings
on the same site and you know billions
of dollars to hey these next generation
things that people are making are tens
of billions of dollars um like OpenAI's
data center in Texas called Stargate
right uh with Crusoe and Oracle etc
right and likewise applies to Elon Musk
who's building these data centers in old
in an old factory where he's got like a
bunch of like gas generation uh you know
outside and he's doing all these crazy
things to get the data center up as fast
as possible, right? Um and and and you
can go to just basically every company
and they have like these humongous
buildouts. Um and this sort of like and
and and and because of the scaling laws,
right? You know, 10x more uh compute for
linear like improvement gains, right?
Like it's sort of like or it's log log,
sorry. Um but you uh you end up with
this like very confusing thing which is
like hey models keep getting better as
we spend more but also the model that we
had a year ago is now done for way way
cheaper right often times 10x cheaper or
more right just a year later. Um, so
then the question is like why are we
spending all this money to scale? Um,
and and and there's a few things here,
right? A, you can't actually make that
cheaper model without making the better
bigger model so you can generate data to
help you make the cheaper model, right?
Like that's part of it. Um, but also
another part of it is that um, you know,
if we were to freeze AI capabilities
where we were uh, basically in uh, what
was it March 2023, right? two years ago
when GPD4 released um and only made them
cheaper, right? Like Deepseek is like
much cheaper. It's much more efficient.
Um but it's roughly the same
capabilities as GPD4 um that would not
pay for all of these cap buildouts,
right? AI is useful today, but is not
capable of doing a lot of things, right?
But if we make the model way more
efficient and then continue to scale and
we have this like stair step, right,
where we like increase capabilities
massively, make them way more efficient,
increase capabilities massively, make
them way more efficient. We do the stair
step, then you end up with creating all
these new capabilities that could in
fact pay for, you know, these massive AI
buildouts. So, no one is trying to make
with these, you know, with these 10
billion dollar data centers, they're not
trying to make chat models, right?
They're not trying to make models that
people chat with, just to be clear,
right? They're trying to solve things
like software engineering and make it
automated. Um, which is like a trillion
dollar plus industry, right? Um, so
these are very different like sort of um
use cases and targets. And so it's the
bitter lesson because yes, you can make
you can spend a lot of time and effort
making clever specialized methods um you
know based on intuition. Um, and you
should, right? But these things should
also just have a lot more compute thrown
behind them because if you make it more
efficient as you follow the scaling laws
up, it'll also just get better and you
can then unlock new capabilities, right?
And so today, you know, a lot of uh AI
models, the best ones from Anthropic are
now useful for like coding uh as a
assistant with you, right? You're you're
going back and forth. you know, as time
goes forward, as you make them more
efficient and continue to scale them,
the possibility is that, hey, it can
code for like 10 minutes at a time and I
can just review the work and it'll make
me 5x more efficient, right? Um, you
know, and so on and so forth. And this
is sort of like where where reasoning
models and sort of the scaling sort of
argument comes in is like, yes, we can
make it more efficient, but we also
just, you know, that's not going to
solve the problems that we have today,
right? The Earth is still going to run
out of resources. uh we're going to run
out of nickel because we can't make
enough batteries and we can't make
enough batteries. So then we can't with
current technology that we can't replace
all of you know uh gas you know gas and
coal with renewables right all of these
things are going to happen unless like
you continue to improve AI and invent
and or just generally research new
things and AI helps us research new
things
okay this is really the last one where
is GPT5
um so so so openai released GP 4.5
recently um with uh what they called
training run Orion uh there ever hopes
that Orion could be used for uh for uh
GPD5. Um but its improvement was like
not enough to be like really a GPD5. Um
furthermore, it was trained on the
classical method which is like um which
is a ton of pre-training and then some
reinforcement learning with human
feedback and and some other
reinforcement learning like PO and DPO
and stuff like that. Um but not but then
along the way right this model was
trained last year. Along the way another
team at OpenAI made the big breakthrough
of reasoning, right? strawberry training
and they released 01 and then they
released 03 and these models are rapidly
getting better on re with reinforcement
learning with verifiable rewards. Um,
and so now GPD5, as Sam calls it, is is
going to be a model that has huge
pre-training scale, right, like GPD 4.5,
but also huge post-training scale like
01 and 03 and continuing to scale that
up, right? This would be the first time
we see a model that was was a step up in
both at the same time. And so that's
that's what uh OpenAI says is coming.
Um, they say it's coming, you know, this
year, hopefully in the next 3 to 6
months, maybe sooner. I've heard sooner,
but, you know, we'll we'll see. Um, but
this this path of scaling both
pre-training and post-training with
reinforcement learning with verifiable
rewards massively should yield much
better models that are capable of much
more things and we'll see what those
things are. Very cool. All right, Dylan,
do you want to give a quick shout out to
those who are interested in potentially
working with semi analysis who you work
with and where where they can learn
more? Sure. So we you know at semi
analysis.com we have you know the we
have the public stuff which is like all
these reports that are uh pseudo-free
but then we most of our work is done on
uh directly for clients there's these
data sets that we sell around every data
center in the world servers all the
compute where it's manufactured how many
where what's the cost and who's doing it
um and then we also do a lot of
consulting we've got people who have
worked all the way from ASML which makes
lithography tools all the way up to you
know Microsoft and Nvidia um which you
know making models and doing
infrastructure and so we've got this
whole gambit of you know folks there's
there's uh there's roughly 30 of us
across the world in US Taiwan Singapore
Japan uh France Germany um Canada so you
know there's a lot of uh engagement
points but if you want to reach out just
go to the website um you know go to one
of those specialized pages of models or
uh sales um and and and and reach out
and that that'd be the best way to sort
of interact and engage with us but for
most people just read the blog right
like I I think like unless you have
specialized like needs unless you're a
company in the space um or your investor
in the space like you know you just want
to be informed just read the blog and
it's free right um I think that's that's
the best option for most people yeah
well I I will attest the blog is
magnificent and Dylan is really a thrill
to get a chance to meet you and talk
through these topics with you so thanks
so much for coming on the show thank you
so much
Alex everybody thanks for listening
we'll be back on Friday to break down
the week's news until then we'll see you
next time on Big Technology podcast.