Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With SemiAnalysis CEO Dylan Patel
Channel: Alex Kantrowitz
Published at: 2025-04-23
YouTube video id: QUIgIBx5NDQ
Source: https://www.youtube.com/watch?v=QUIgIBx5NDQ
If you've ever wondered how generative AI works and where the technology is heading, this episode is for you. We're going to explain the basics of the technology and then catch up with modern-day advances like reasoning to help you understand exactly how it does what it does and where it might advance in the future. That's coming up with semi analysis founder and chief analyst Dylan Patel right after this. Welcome to Big Technology Podcast, a show for coolheaded and nuanced conversation of the tech world and beyond. We're joined today by semi analysis founder and chief analyst Dylan Patel, a leading expert in semiconductor and generative AI research and someone I've been looking forward to speaking with for a long time now. I want this to be an episode that a helps people learn how generative AI works and b is an episode that people will send to their friends to explain to them how generative AI works. I've had a couple of those that I've been sending to my friends uh and colleagues and counterparts about what is going on within generative AI. That includes one this 3 and 1/2 hour long video from Andre Karpathy explaining everything about training large language models. And the second one is a great episode that Dylan and Nathan Lambert from the Allen Institute of AI did uh with Lex Freiedman. Both of those 3 hours plus. So I want to do it ours in an hour. Uh, and I'm very excited to begin. So, Dylan, it's great to see you and welcome to the show. Thank you for having me. Great to have you here. Let's just start with tokens. Can you explain how the how AI researchers basically take words and then give them numerical representations and parts of words and give them numerical representations? So, what are tokens? Tokens are are in fact like chunks of words, right? Uh, in in the human way, you can think of like syllables, right? Um, syllables are often, you know, viewed as like a chunks of word. Um, they have some meaning. Um, it's the base level of of of speaking, right? Is is syllables, right? Now, for models, tokens are the base level of output. They're all about like sort of compressing, you know, sort of this is the most efficient representation of language. From my understanding, AI models are very good at predicting patterns. So if you give it 1 3 7 9 it might know the next number is going to be 11. And so what it's doing in with tokens is taking words breaking them down to their component parts assigning them a numerical value and then basically in its own word in its own language learning to predict what number comes next because computers are better at numbers and then converting that number back to text and that's what we see come out. Is that is that accurate? Yeah. And each individual token uh is actually it's not just like one number, right? It's multiple vectors. You could think of like well the tokenizer needs to learn king and queen are actually extremely similar on on most in terms of like the English language. Extremely similar, right? Except there is like one vector in which they're super different, right? Because a king is a male and a fe a queen is a female, right? And then from there like you know in language oftent times kings are considered conquerors and and you know all these other things and like these are just like historical things right so a lot of the text around them while they're both like royal regal right like you know uh monarchy etc there are many vectors in which they differ so like it's not just like converting a to a word into one number right it's like converting it into multiple vectors and each of these vectors the model learns what it means right um you don't initialize the model with like hey you know king means male uh monarch blah and it's associated with like war and conquering because that's what all the writing about kings is on you know in history and all that right like people don't talk about the daily lives of kings that much or they mostly talk about like their wars and conquests and stuff um and so like there will be each of these numbers in this embedding space right will be assigned over time as the model reads the internet's text and trains on it'll start to realize oh king and queen are exactly similar on these vectors but very different on these vectors and these vectors aren't you don't explicitly tell the model hey this is what this vector is for but it could be like you know it could be as much as like one vector could be like is it a building or not right and and it doesn't actually know that you don't you don't know that ahead of time it just happens to in the latent space and then all these vectors sort of relate to each other but yeah these these numbers are um are efficient representation of words uh because you can do math on them right you can you can multiply them you can divide them you can run them through an entire model whereas Whereas we're and and your brain does something similar, right? When it hears something, it converts that into a frequency in your ears and then that gets converted to frequencies that should go through your brain, right? This is this is the same thing as a tokenizer, right? Um although it's like obviously a very different medium of compute, right? Ones and zeros for for computers versus uh you know binary and multiplication etc. being more efficient whereas humans uh brains are more like analog in nature and um you know think more in waves and and patterns in different ways. uh while they are very different it is a tokenizer right like language is not actually how our brain thinks it's just a representation for which it to you know reason over yeah so that's crazy so the tokens are the sufficient representation of words but more than that the models are also learning the way that they are all all these words are connected and that brings us to pre-training from my understanding pre-training is when you take the entire basically the entire internet worth of text and you use that to teach the model these representations between each token. So therefore like we talked about if you gave if you gave a model the sky is and the next word is typically blue in in the pre-training which is basically all of the English language all or all of language on the internet uh it should know that the next token is blue. So what you do is you want to make sure that when the model uh is outputting information it's closely tied to what that next value should be. Is that is that a proper description of what happens in pre-training? Yeah, I think I think that's that's pretty that's the objective function which is just to reduce loss i.e. how often is the token predicted incorrectly versus correctly. Right. Right. So if it's it's like this if you said the sky is red um that's not the most probable outcome. So that would be wrong. that text is on the internet, right? Like because the Martian sky is red and there's all these books about Mars and sci-fi, right? So, how does the model then learn how to, you know, figure this out and in what context is it accurate to say blue and red, right? So, so I mean, um, first of all, the model doesn't just output one token, right? It outputs a distribution. Um, it turns out the most the way most people take it is they take the top, uh, top K, i.e. the most high probability. So, yes, blue is obviously the right answer if you give it to anyone on this planet. Um, but there are situations and contexts where the sky is red is the appropriate sentence. But that's not just in isolation, right? It's like if the prior passage is all about Mars and all this and then all of a sudden it's like and it's like a quote from a Martian settler and it's like the sky is and then the correct token is actually red, right? The correct word. Um, and so it has to know this through the attention mechanism, right? Um, if it was just the sky is blue, always you're going to output blue because blue is let's say 80% 90% 99% likely to be the right option. But as you as you start to add context about Mars or any other planet, right, other planets have different colored colored atmospheres, I presume. Um you end up with this um distribution starts to shift, right? If if I add we're on Mars, the sky is, you know, then then all of a sudden blue goes from 99%, you know, in the prior context window, right? The uh the text that you sent to the model, the attention of it, all of a sudden it realizes the sky is blue uh preceded by that the the stuff about Mars. Now, blues rockets down to like, you know, let's call it 20% probability and red rockets up to 80% probability, right? Um, now the model outputs that and then most people just end up taking the top probability and outputting it to the user. Um, and and that's sort of like how how does the model learn that is is the attention mechanism, right? And this is sort of the beauty. Yeah, the the uh the attention mechanism is the beauty of of modern sort of large language models. It takes the relational value you know in this vector space between every single token right um so the sky you know the sky is blue right like when I think about it yes blue is the next token after the sky is but in in a lot of like older style models you would have you would just predict the exact next word so after sky obviously it's you know it's it could be many things it could be blue but it could also be like uh scraper right you know skyscraper is yeah that's that that that makes sense um but what attention does is it it it is it is um taking all of these various values, the query, the key, the value um which represents what you're looking for, what you where you where you're looking um and what what that value is across the attention. Um and you're you're you're calculating mathematically what the relationship is between all of these tokens. Um and so going back to the king queen representation, right? The way these two words interact is now calculated, right? And the way that every word in the entire passage you sent is calculated uh is tied together. Which is why models have like challenges with like how long can you uh how many documents can you send them, right? Because if you're sending them, you know, just the question like what color is the sky? Okay, it only has to calculate the attention between you know those those words, right? But if you're sending it like 30 books with like insurance claims and all these other things and you're like okay, figure out what what's going on here. Um is this a claim or not? Right? in in the insurance context all of a sudden it's like okay I've got to calculate the attention of not just like the last five words to each other but I have to calculate every you know 50,000 words to each other right which then ends up being a ton of math back in the day actually the best language models were a different architecture entirely right um but then at some point you know transformers large language models sort of large language models which are bas based on transformers primarily uh rocketed past in capabilities because they were able to scale and because the hardware got there um and then we were able to scale them so much that we not to just put like some text in them and not just a lot of text or a lot of books but the entire internet which you know one could view the internet often times as a microcosm of all human culture and learnings and knowledge uh to many extents because most books are on the internet most papers are on the internet um obviously there's a lot of things missing on the internet um but this is sort of this is the sort of modern you know magic of like what what it was sort of like three different things like coming all together at once right an efficient way for models to relate every referred to each other the compute necessary uh to scale the data large enough and then someone actually like pulling the trigger to do that right at the scale that was you know got to the point where it was useful right which was sort of like GPT3.5 uh level or four level right where it became extremely useful for normal humans to use uh you know chat chat models okay and so why is it called pre-training so so pre-training is is is sort of called that because it is what happens happens, you know, before the actual uh training of the model, right? Uh the objective function in pre-training is to just predict the next token. Uh but predicting the next token is not what humans want to use AIS for, right? I want it to ask a question and answer it. Uh but in in most cases, asking a question does not necessarily mean that the next most likely token is is the answer, right? Often times it is another question, right? Uh, for example, if I ingested the entire SAT um, you know, and I asked a question, the next like five answer then all the next tokens would be like A is this, B is this, C is this, D is this. Like, no, I just want the answer, right? Um, and so pre-training is the reason it's called pre-training is because you're ingesting humongous volumes of text no matter the use case, right? Um, and you're learning the general patterns across all of language, right? I don't actually know that king and queen relate to each other in this way. And I don't know that king and queen are opposites in these ways, right? Um, and so this is why it's called pre-training is because you must get a broad general understanding of the entire sort of world of text before you're able to then do post-training or fine-tuning, which is let me train it on more specific data that is specifically useful for what I want it to do. Whether it's hey um in chat style applications um you know go and go and you know when I ask a question give me the answer or um in in other applications like teach me how to build a bomb. Well obviously no I'm not going to help you teach build a bomb because that's what I don't want the model to teach me how to build a bomb. So, you know, it's sort of got to do this and it's not like you're teaching it, you know, when you're doing this pre-training um you're filtering out all this data because in fact there's a lot of good useful data on how to build bombs because there's a lot of useful information on like hey like C4 chemistry and like you know people want to use it for chemistry right so you don't want to just filter out everything uh so that the model doesn't know anything about it um but at the same time you don't want it to output you know how to build a bomb um so there's like a fine balance here and that's why pre-training is defined as pre because you're you're you're still letting it do things and teaching it things and inputting things into the model that are theoretically like quite bad, right? Um for example, books about like killing or war tactics or what have you, right? Like things that like plausibly you could see like, oh well maybe that's not okay. Um or or wild descriptions of like really grotesque things all over the internet, but you want the model to learn these things, right? Because first you build the general understanding before you say okay now that you've got a general framework of the world let's align you so that you with this general understanding of the world can figure out what is useful for people what is not useful for people what should I respond on what should I not respond on so what happens then in the training process so the the in is the training process that the model is then attempting to make the next prediction and then just trying to minimize loss as it goes right right I mean like basically uh you you have you have loss is is how often you're wrong versus right in the most simple terms right you will you'll run through a passages right through the model um and you'll see how often did the model get it right when it got it right great reinforce that uh when I got it wrong let's figure out which which neurons neurons in the model you know quote unquote neurons in the model you can tweak to then fix the answer so that when you go through it again it actually outputs the correct answer and then you move the the move the model slightly in that direction. Um now obviously the the challenge with this is um if I if I first you know I can I can come up with a simplistic way where all the neurons will just output the sky is blue every single time it says the sky is but then when it goes to um you know hey the the color blue is commonly used on walls because it's soothing right and it's like oh what's the next word is soothing right soothing you know and so like that that is a completely different representation and to understand that blue is soothing and that the sky is blue and those things aren't actually related but they are related to blue is like very important and so you know over often times you'll run through the training data set multiple times right uh because the first time you see it oh great maybe you memorized that the sky is blue um and and you memorize the wall is blue and that uh and when people describe art and often times use the color blue it can it can be representations of art or the wall right and so over time as you go through all this text in pre-training Yes, you're minimizing loss initially by just memorizing, but over time because you're constantly overwriting the model, it starts to learn the generalization, right? I.e. blue is a soothing color, also represents the sky, also used in art to for either of those two motifs, right? And so that's sort of the the goal of pre-training is is you don't want to memorize, right? Because that's, you know, in school you memorize all the time. Um, and that's not useful because you forget everything you memorize. Uh but if you get tested on it then and then you get tested on it 6 months later um and then again 6 months later after that or or however you do it ends up being oh you you don't actually like memorize that anymore. You just know it innately um and you've generalized on it and that's the real goal uh that you want out of the model. Uh but that's not necessarily something you can just measure right and therefore loss is something you can measure i.e for this group of this group of text, right? Because you train the model in in steps. Every step you're inputting a bunch of text. You're trying to uh you're trying to see what's predict the right token where you didn't predict the right token. Let's adjust the neurons. Okay, on to the next batch of text. And you'll do this these batches over and over and over again um o across, you know, trillions of words of text, right? Um and and as you step through and then you're like, "Oh, well, I'm done." But I bet if I go back to the first first group of text, which was all about the sky being blue, it's going to get the answer wrong because maybe later on in the training, it discovered it saw some passages about sci-fi and how the how Martian sky is red. So like it'll it'll overwrite but then over time as you go through the data multiple times as you see it on the internet multiple times as you see it in different books multiple times whether it be scientific sci-fi whatever it is you start to realize and it starts to learn that uh that representation of like oh when it's on Mars it's red because the sky and Mars is red because the atmospheric makeup is this way whereas the atmospheric makeup in in on Earth is a different way. And so that's sort of like the whole point of pre-training is is to minimize loss, but the nice side effect is that the model initially memorizes, but then it stops memorizing and it generalizes. And that's the useful pattern that we want. Okay, that's fascinating. Uh we've touched on post-training uh for a bit, but just to recap, post-raining is so you have a model that's good at predicting the next word, and in post- training, you sort of give it a personality by inputting sample conversations to make the model want to emulate the certain values that you want it to take on. Yeah. So, post training can be a number of different things. The most simple way of doing it is is yeah, uh pay for humans to label a bunch of data. um take a bunch of example conversations um etc and input that data and train on that at the end, right? Um and so that that example data is is useful uh but this is not scalable, right? Like using humans to train models is just so expensive, right? So then there's the magic of sort of reinforcement learning and and other synthetic data technologies, right? Where the model is helping teach the model, right? So you have many models in in in a sort of in a post training where yes you have some example human data but human data does not scale that fast right cuz cuz the internet is trillions and trillions of words out there whereas you know even if you had you know Alex and I write words all day long for our whole lives would we would have millions or you know hundreds of millions of words written right it's nothing it's like orders of magnitude off in terms of the number of words required um so then you have the model, you know, takes some of this example data um and you have various models that are surrounding the main model that you're training, right? And these can be policy models, right? Teaching it, hey, is this is this what you want or that what you want? Uh reward models, right? Like is that good? Is that a good response or is that a bad bad response? You have value models like hey, grade this output, right? And you have all these different models working in conjunction to say um uh you know, different companies have different objective functions, right? In the case of anthropic, they're they want their model to be helpful, harmful, harmless, and safe, right? So, be helpful, but also don't harm people or anyone or anything. And then, you know, uh you know, safe, right? In other cases, like Grock, right? Uh Elon's model from XAI, um it actually just wants to be helpful. Um and maybe has like a little bit of a right leaning to it, right? And and for other folks, right, like you know, I mean, most AI models are made in the Bay Area, so they tend to just be left leaning, right? Um but also the internet on general is a little bit left-leaning um because it skews younger than older. Um and so like all these things like sort of affect models. Um but like it's not just around politics, right? Post training is also just about teaching the model. If I if I say like uh the movie where uh the princess has a slipper and it doesn't fit. It's like well on if I said that into a base model that was just pre-training like the answer wouldn't be oh the movie you're looking for Cinderella. you know, it it would only realize that once it goes, you know, once it goes through post- training, right? Because a lot of times people just throw garbage into the model and then the model still figures out what you want, right? And this is part of what post- training is. Like you can just do stream of consciousness into models and often times it'll figure out what you want. Like, you know, if it's a movie that you're looking for or if it's help answering a question or if you throw a bunch of like unstructured data into it and then ask it to make it into a table, it does this, right? And that's because of all these different aspects of post training, right? example data but also uh you know generating a bunch of data and and grading it um and seeing if it's good or not um and whether it matches the various policies you want is it help you know a lot of times grading can be based on multiple factors right there could be a model that says hey is this helpful hey is this is this safe and what is safe right so then that model for safety needs to be tuned on human data right so there's it is a quite complex thing but the end goal is to be able to get the model to uh output in a certain way models aren't always about just humans using them either Right? There can be models that are just focused on like, hey, like, you know, if it doesn't output code, you know, yes, it was trained on the whole internet because the person's going to talk to the model using text, but if it doesn't output code, you know, penalize it, right? Now, all of a sudden, the model will never output like text ever again. It'll only output code. Um, and and so like these sorts of like models exist too. So post- training is not just a univariable thing, right? It's what variables do you want to target? Um, and so that's why models have different personalities from different companies. It's why they target different use cases and why, you know, it's not just like one model that rules them all, but actually many. That's fascinating. So, that's why we've seen so many different models with different personalities is because it all happens in the post-training uh moment. And this is when when you talk about giving the models examples to follow. That's that's what reinforcement learning with human feedback is, is the humans give some examples and then the model learns to emulate what the human is interested in, what the human trainer is interested in having them uh embody. Is that right? Yeah, exactly. Okay, great. All right, so first half we've covered what training is, what tokens are, what loss is, what uh post- trainining is, post training is, post training, by the way, also called fine-tuning. We've also covered reinforcement learning with human feedback. We're going to take a quick break and then we're going to talk about reasoning. We'll be back right after this. And we're back here on Big Technology Podcast with Dylan Patel. He's the founder and chief analyst at Semi analysis. He actually has great analysis on Nvidia's recent uh uh GTC conference, which we just covered recently on a on a recent episode. You can find semi analysis at semianalysis.com. It is both content and sort of and consulting. So definitely check in with Dylan for all of those uh all those needs. And now we're going to talk a little bit about reasoning because a couple months ago, and Dylan, this is really where you know I sort of entered the picture watching your uh conversation with Flex uh with Nathan Lambert about what the difference is between reasoning and um and your traditional LLMs, large language models. Um, if I gathered it right from your conversation, what reasoning is is basically instead of uh the model going uh basically predicting the next word based off of its training, it uses the tokens to spend more time uh basically figuring out what the right answer is and then coming out with its with a new prediction. I think Carpathy does a very interesting job uh in the YouTube video talking about how models think with tokens. the more tokens there are, the more computes they compute they use because they're running these predictions through the transformer model which we discussed and therefore they can come to better answers. Is that the right way to think about reasoning? So, so I think that um humans are also fantastic at f pattern matching, right? Um we're really good at like recognizing things. Um but a lot of tasks it's not like an immediate response, right? we are thinking um whether that's thinking through words out loud, thinking through words in an inner monologue in our head or it's just like processing somehow and then we know the answer, right? Um and this is the same for models, right? Models are horrendous at math, right? Historically have been, right? Um you could ask it, you know, what is 9.11 bigger than 9.9? Um and it would say yes, it's bigger even though like everyone knows that 9.11 is is way smaller than 9.9, right? Um, and that's just like a thing that happened in models because they didn't think or reason, right? And and it's the same for you, Alex, right? Like, you know, or myself, right? Like if someone asked me, you know, uh, 17 * 34, I'd be like, I don't know, like right off the top of my head, but, you know, give me give me a a little bit of time. I can do some long form multiplication and I can get the answer, right? And that's because I'm thinking about it. Um, and this is the same thing with reasoning for models. um is you know when when you look at a transformer every word is the every token output it has the same amount of compute behind it right um i.e i.e. you know when I'm saying the versus sky is blue the blue and the the have the or the the is and the blue have the same amount of compute to generate right um and this is not exactly what you want to do right you want to actually um spend more time on the hard things and and and not on the easy things um and so reasoning models are effectively teaching you know large pre-trained models to do this right hey think through the problem hey output a lot of tokens uh think about it generate all text and then when you're done, you know, start answering the question, but now you have all of this stuff you generated in your context, right? Um, and that stuff you generated is is helpful, right? Um, it could be like, you know, all sorts of all, you know, just like any human's thought patterns are, right? Um, and so this this is the sort of like new paradigm that we've entered maybe 6 months ago where models now will think for some time before they answer. And this enables much better performance on all sorts of tasks, whether it be coding or math or understanding science or understanding complex social dilemmas, right? All sorts of different topics they're much much better at. Um, and this is done through post-training, similar to the reinforcement learning by human feedback that we mentioned earlier. Uh, but also there's other forms of post-training and that's that's what makes these reasoning models. Before we head out, I want to hit on a couple things. Um, first of all, the growing efficiency of these models. So, I think one of the things that people focused on with DeepSeek was that it was just able to be much more efficient in the way that it generates answers. Uh, and there was this obviously this big reaction to Nvidia stock where it fell 18% the day or at the Monday after Deepseek weekend um because people thought we wouldn't need as much compute. So can you talk a little bit about how models are becoming more efficient and how they're doing it? Yeah. So there's a variety of u the beauty of these uh of AI is not just that we continue to build new capabilities, right? Um because those new capabilities are going to be able to benefit the world in many ways. Um and there's a lot of focus on those but there's also a lot of there's a lot of focus on well to get to that next level of capabilities is is the scaling laws i.e. the more compute and data I send spend the better the model gets. But then the other vector is well can I get to the same level with less compute and data right um and and those two things are handinand because if I can get to the same level with the less comput and data then I can spend that more comput and data and get to a new level right and so AI researchers are constantly looking for ways to make models more efficient uh whether it be through algorithmic tweaks uh data tweaks uh tweaks in you know how you do reinforcement learning so on and so forth right and so when we look at models across history they've constantly gotten cheaper and cheaper and cheaper, right? Um at a stupendous rate, right? Um and so one easy example is GPT3, right? Uh cuz there's GPT3, 3.5 Turbo, uh Llama 27B, Llama 3, uh Llama 3.1, Llama 3.2, right? As these models have gotten bigger, we've gone from, hey, it costs $60 for a million tokens to it cost less than it costs like 5 cents now for the same quality of model now. And the model has shrank dramatically in size as well and that's because of better algorithms, better data etc. Um and now what happened with DeepSeek was similar. Um you know open AI had GPD4 then they had 4 Turbo which was half the cost. Then they had 40 which was again half the cost. Uh and then Meta released Llama 405B open source and so the open source community was able to run that and that was again like roughly like half the cost. Uh but or or 5x lower cost um than 40 which was lower than four turbo and four u but deepseek came out with another tier right. So when we looked at GPD3, the cost fell 1,200x from uh GPD3's initial cost to what you can get llama 3 point uh 2 uh 3b today, right? And likewise when we look at uh from GPD4 to deepse v3 uh it's fallen roughly 600x in cost, right? So we're not quite at that 1,200x, but it has fallen 600x in cost from $60 to less than uh you know to about a dollar, right? Or or to less than a dollar, sorry, 60x. Um, and so you've got this massive cost decrease. Uh, but it's not necessarily out of bounds, right? We've already seen it. I think what was really surprising, uh, was that it was a Chinese company for the first time, right? Cuz Google and and OpenAI and Enthropic and Meta have all traded blows, right? You know, uh, whether it be OpenAI always being on the leading edge or Enthropic always being on the leading edge or, you know, Google and Meta, you know, being close followers but oftentimes, sometimes with a new feature, um, and sometimes just being much cheaper. Um, we have not seen this from any Chinese company, right? And now we have a Chinese company releasing a model that's cheap. Um, it's not unexpected, right? Like this is actually within the trend line of what happened with GPT3 is happening to GPT4 level quality with Deepseek. Um, it's more so surprising that it's a Chinese company and that's I think why everyone freaked out and then there was a lot of things that like you know from there became a thing, right? Like if Meta had done this, I don't think people would have freaked out, right? Um, and Meta's Meta is going to release their new llama soon enough, right? And that one is going to be, you know, uh, similar level of cost decrease. Uh, probably similar area as Deep Seek V3, right? Um, it's just not people aren't going to freak out because it's an American company and it was sort of expected. All right, Dylan, let me ask you the last question, which is that you mentioned I think you mentioned the bitter lesson, which is basically that the I mean, I'm going to just be kind of facitious in summing it up, but the answer to all questions in machine learning is just to make bigger uh bigger models uh and scale solves almost all problems. So, it's interesting that we have this moment where models are becoming way more efficient. Uh but we also have massive massive data center buildouts. Um I think you it would be great to hear you kind of recap the size of these data center buildouts and then answer this question. Uh if we are getting more efficient, why are these data centers getting so much bigger? Uh and what might that added scale get in the world of generative AI uh for the companies building them. Yeah. So when we look across the ecosystem at data center buildouts um you know we track all the buildouts and and server purchases and supply chains here and and the pace of construction is incredible, right? right? You can just you can pick a state and you can see new data centers going up um all across the US and and around the world, right? Um and so you see things like um capacity in you know, for example, of of the largest scale training supercomputers goes from hey, it's a few hundred million it's it's not even a few hundred million years ago, but like you know, hey, for GPT4 it was a few hundred million. Um and it's it's one building full of GPUs, too. uh GPD 4.5 um and uh the reasoning models like 0103 were done in a in three buildings on the same site and you know billions of dollars to hey these next generation things that people are making are tens of billions of dollars um like OpenAI's data center in Texas called Stargate right uh with Crusoe and Oracle etc right and likewise applies to Elon Musk who's building these data centers in old in an old factory where he's got like a bunch of like gas generation uh you know outside and he's doing all these crazy things to get the data center up as fast as possible, right? Um and and and you can go to just basically every company and they have like these humongous buildouts. Um and this sort of like and and and and because of the scaling laws, right? You know, 10x more uh compute for linear like improvement gains, right? Like it's sort of like or it's log log, sorry. Um but you uh you end up with this like very confusing thing which is like hey models keep getting better as we spend more but also the model that we had a year ago is now done for way way cheaper right often times 10x cheaper or more right just a year later. Um, so then the question is like why are we spending all this money to scale? Um, and and and there's a few things here, right? A, you can't actually make that cheaper model without making the better bigger model so you can generate data to help you make the cheaper model, right? Like that's part of it. Um, but also another part of it is that um, you know, if we were to freeze AI capabilities where we were uh, basically in uh, what was it March 2023, right? two years ago when GPD4 released um and only made them cheaper, right? Like Deepseek is like much cheaper. It's much more efficient. Um but it's roughly the same capabilities as GPD4 um that would not pay for all of these cap buildouts, right? AI is useful today, but is not capable of doing a lot of things, right? But if we make the model way more efficient and then continue to scale and we have this like stair step, right, where we like increase capabilities massively, make them way more efficient, increase capabilities massively, make them way more efficient. We do the stair step, then you end up with creating all these new capabilities that could in fact pay for, you know, these massive AI buildouts. So, no one is trying to make with these, you know, with these 10 billion dollar data centers, they're not trying to make chat models, right? They're not trying to make models that people chat with, just to be clear, right? They're trying to solve things like software engineering and make it automated. Um, which is like a trillion dollar plus industry, right? Um, so these are very different like sort of um use cases and targets. And so it's the bitter lesson because yes, you can make you can spend a lot of time and effort making clever specialized methods um you know based on intuition. Um, and you should, right? But these things should also just have a lot more compute thrown behind them because if you make it more efficient as you follow the scaling laws up, it'll also just get better and you can then unlock new capabilities, right? And so today, you know, a lot of uh AI models, the best ones from Anthropic are now useful for like coding uh as a assistant with you, right? You're you're going back and forth. you know, as time goes forward, as you make them more efficient and continue to scale them, the possibility is that, hey, it can code for like 10 minutes at a time and I can just review the work and it'll make me 5x more efficient, right? Um, you know, and so on and so forth. And this is sort of like where where reasoning models and sort of the scaling sort of argument comes in is like, yes, we can make it more efficient, but we also just, you know, that's not going to solve the problems that we have today, right? The Earth is still going to run out of resources. uh we're going to run out of nickel because we can't make enough batteries and we can't make enough batteries. So then we can't with current technology that we can't replace all of you know uh gas you know gas and coal with renewables right all of these things are going to happen unless like you continue to improve AI and invent and or just generally research new things and AI helps us research new things okay this is really the last one where is GPT5 um so so so openai released GP 4.5 recently um with uh what they called training run Orion uh there ever hopes that Orion could be used for uh for uh GPD5. Um but its improvement was like not enough to be like really a GPD5. Um furthermore, it was trained on the classical method which is like um which is a ton of pre-training and then some reinforcement learning with human feedback and and some other reinforcement learning like PO and DPO and stuff like that. Um but not but then along the way right this model was trained last year. Along the way another team at OpenAI made the big breakthrough of reasoning, right? strawberry training and they released 01 and then they released 03 and these models are rapidly getting better on re with reinforcement learning with verifiable rewards. Um, and so now GPD5, as Sam calls it, is is going to be a model that has huge pre-training scale, right, like GPD 4.5, but also huge post-training scale like 01 and 03 and continuing to scale that up, right? This would be the first time we see a model that was was a step up in both at the same time. And so that's that's what uh OpenAI says is coming. Um, they say it's coming, you know, this year, hopefully in the next 3 to 6 months, maybe sooner. I've heard sooner, but, you know, we'll we'll see. Um, but this this path of scaling both pre-training and post-training with reinforcement learning with verifiable rewards massively should yield much better models that are capable of much more things and we'll see what those things are. Very cool. All right, Dylan, do you want to give a quick shout out to those who are interested in potentially working with semi analysis who you work with and where where they can learn more? Sure. So we you know at semi analysis.com we have you know the we have the public stuff which is like all these reports that are uh pseudo-free but then we most of our work is done on uh directly for clients there's these data sets that we sell around every data center in the world servers all the compute where it's manufactured how many where what's the cost and who's doing it um and then we also do a lot of consulting we've got people who have worked all the way from ASML which makes lithography tools all the way up to you know Microsoft and Nvidia um which you know making models and doing infrastructure and so we've got this whole gambit of you know folks there's there's uh there's roughly 30 of us across the world in US Taiwan Singapore Japan uh France Germany um Canada so you know there's a lot of uh engagement points but if you want to reach out just go to the website um you know go to one of those specialized pages of models or uh sales um and and and and reach out and that that'd be the best way to sort of interact and engage with us but for most people just read the blog right like I I think like unless you have specialized like needs unless you're a company in the space um or your investor in the space like you know you just want to be informed just read the blog and it's free right um I think that's that's the best option for most people yeah well I I will attest the blog is magnificent and Dylan is really a thrill to get a chance to meet you and talk through these topics with you so thanks so much for coming on the show thank you so much Alex everybody thanks for listening we'll be back on Friday to break down the week's news until then we'll see you next time on Big Technology podcast.