OpenAI's Chief Research Officer on GPT 4.5's Debut, Scaling Laws, And Teaching EQ to Models
Channel: Alex Kantrowitz
Published at: 2025-02-27
YouTube video id: pdfI9MuxWq8
Source: https://www.youtube.com/watch?v=pdfI9MuxWq8
open aai Chief research officer Mark Chen is here to talk about the release of GPT 4.5 the company's largest and best model yet which is coming out today we'll dive in right after this welcome to Big technology podcast a show for cool-headed nuance conversation of the tech world and Beyond we're joined today by Mark Chen the chief research officer at open aai who's here to talk about the company's newest release GPT 4.5 yes it's finally here and it is debuting today Mark great to see you welcome to the show thank you so much for having me on thanks for being here this is uh in four and a half years of the show our first open AI interview so hopefully the first of many we appreciate you uh jumping into the water like this and it's on big news with the release of GPT 4.5 yeah um so gbd 4.5 really it signifies the latest milestone in our predictable scaling uh Paradigm so you know previous models that have fit this Paradigm have been gbd3 3 3.54 and now this is the latest thing it um signifies an order of magnitude improvement over the last models kind of commure it with the jump from 3.5 to four I think the question that most of our listeners are going to be asking and certainly we asked on our show in the past couple months is why isn't this GPT 5 I mean what is it going to take to get to GPT 5 yeah um well I think gp5 uh you know whenever we make these naming decisions right uh we try to keep with uh a sense of what the trends are so uh again when it comes to predictable scaling right um going from three to 3.5 you can kind of predict out you know what an order of magnitude of improvements in in you know amount of compute that you train the model with uh in terms of efficiency improvements will buy you and uh we find this model kind of aligns with what 4.5 would be so we want to name it what it is okay but there's been so much talk about um when GPT 5 is going to come correct me if I'm wrong but I think there's been a longer wait between GPT 4 and 4.5 uh than there has been between let's say uh GPT 3.5 and four and I don't know is is this uh like because we're seeing a lot of hype from uh opening ey folks on Twitter about what's coming next or uh maybe this is probably it probably is the most impatient industry in the world and the most impatient users in the world but um it seems to me like the expectations for GPT 5 are built up pretty high and so I'm curious from like your perspective um do you think it's going to be hard to meet those expectations whenever that GPT 5 model does come out well I don't think so and one of the fundamental reasons is because we now have two different axes on which we can scale right um so GPD 4.5 this is our latest scaling experiment along the axis of unsupervised learning but there's also reasoning um and when you ask about kind of like uh why there seems to be you know a little bit bigger of a gap in release time between 4 and 4.5 we've been really largely focused on developing the the reasoning par Paradigm as well so um I think you know our research program is really an exploratory research program right um we're looking into all avenues of how we can scale our models and over the last you know one and a half two years we've really found a new very exciting Paradigm through reasoning which we're also scaling um and and so I think like uh GPD 5 really could be the culmination of a lot of the things coming together okay so you talk about how there's been a lot of work toward reasoning we of course have seen that with a one there's a lot of uh Buzz about deep seek um and now we're talking about again like one of the more traditional scaled up large language models uh with GPT 4.5 so the big question here I think that was on a lot of people's mind uh when it came to this upcoming release we thought was going to be 4.55 anyway it doesn't matter the big question is can AI models continue to scale when you add more compute more data and more power to them um it seems like you have an answer to this so I'm curious to hear your point of view on whether what you've learned about the scaling wall um given your development of this model and um and whether we're going to hit it whether we're already seeing some diminishing returns from scaling yeah um I really kind of have a different framing around uh scaling so when it comes to unsupervised learning right um You want to put more ingredients like uh compute algorithmic in algorithmic efficiencies and uh more data um and GPT 4.5 really is proof that we can continue the scaling Paradigm and this Paradigm is not the antithesis of reasoning as well right um You need knowledge in order to build reasoning on top of right um a model can't kind of go in blind um and just learn reasoning from scratch so uh we find these two paradigms to be fairly complimentary um and we think you know they have feedback loops on each other so um yeah GPD 4.5 again uh it is smart in different ways from the ways that reasoning models are smart right um when you look at the model today um it has a lot more World Knowledge um when we look at kind of comparisons against 3PD 40 um you see that everyday use cases people prefer it you know by a margin of 60% for actually productivity and knowledge work against gp40 there's almost like a 70% preference rate so people are really responding to this model um and it's this knowledge that we can leverage uh for our reasoning models in the future so what are the examples like you talk about everyday knowledge work what are some of the examples that you would use GPT 4.5 for that you would prefer it over a reasoning model yeah um so I I wouldn't say like uh it's a it's a different profile from from a reasoning model right um so with a larger model um what you're doing is it it takes more time to kind of process and think through the query but it's also giving you an immediate response back so this is very similar to what a gp4 would have would have done for you right um whereas I think um with something like 01 you get a model where you give a query and it can think for several minutes um and and I think these are fundamentally kind of different trade-offs right uh you have a model that immediately comes back to you doesn't do much thinking um uh but comes up with a better answer versus a model that you know uh thinks for a while um and then comes comes up with the with an answer and you know we find that in a lot of areas like creative writing for instance um uh again this is stuff that we want to test over the next one or two months um but uh we find that there there are areas like creative writing where this model outshines reasoning models okay so writing any other use cases yeah so there's there's writing um I think some coding use cases as well um we also find that um kind of like uh you know there there are some particular kind of scientific domains where this outshines in terms of the amount of knowledge that it can display okay and I'm going to come back to benchmarks uh in a moment but I want to keep on this scaling question because I think there's been a lot of conversation about it in public and it's great to uh be speaking with you from open AI to sort of get to the bottom of of what's happening so the first is um the question that folks have is do you end up at this size and you don't talk about the size of the models which is you know which is fair U but they're big right this is the largest model uh that openi has ever released GPT 4.5 so I'm actually curious to hear at this size uh does adding you know similar amounts of compute similar amounts of data get you the same returns uh that you did or are are we already starting to see the returns of adding these resources tail off no no we are seeing the same returns and I I do want to stress that JD 4.5 is next point on this unsupervised learning Paradigm and you know we're very rigorous about how we do this we make projections based on all the models we've trained before on what performance to expect um and in this case um you know we put together the scaling machinery and this is the point that lies at that next order of magnitude so what's it been like getting here I mean again we talked okay so there was there was a period of time that was longer than the last interval and part of that was focused on reasoning but there's also been some reports that open eyes had to start and stop a couple times to get this to work um and really had to fight through some thorny issues to get it to be this step change as you're saying so talk a little bit about the process and um maybe you can confirm or deny some of the things that we've heard about having to start and stop again and uh retrain to get here um actually so I I think it's it's interesting that this gets uh is a point that's attributed to this model because um actually in in in developing all of our foundation models right um they are all experiments right I I think um you know running all the foundation models often times does involve stopping at certain Pro just kind of analyzing what's going on and then restarting the runs and uh I don't think that this is a characteristic of dpd 4.5 I'm it's something that we've done with you gbd4 with O Series models um and you know they are largely experiments right we we want to go in um diagnose them in the middle and if we want to make some interventions we we should make interventions but um I wouldn't characterize this as kind of uh something that we do for GPD 4.5 that we don't do for other models we've already talked a little bit about reasoning versus these traditional GPT models uh but it makes me think of deep seek and um I think you already gave a pretty compelling answer as to like what you would use one of these models for versus a reasoning model uh but there's another thing that deeps did that um is worth discussing which is that they made their models much more efficient and it's kind of interesting like when I told to you about like all right so you need data you need compute you need power you're like yeah and you need model optimizations which is something that people often Overlook and just going back to deep seek for a moment the model optimization the fact that they went from basically queering the entire knowledge base to mixture of experts where they're able to sort of Route the queries to certain parts of the model instead of lighting it all up is credited with help them helping them get more efficient so I just want to turn it over to you um without commenting on what they did or if you can if you want but I'm actually more curious what open AI is doing on that front and what sort of whether you did similar optimizations with GPT 4.5 and are you able to run these large models more efficiently and if so how yeah so I would say um kind of the process of making a model efficient to serve I often see as fairly decoupled from developing the core capability of the model right um and we see a lot of work being done on the inference stack right I think that's something that uh deep seek did very well um and it's also something that we push on a lot right um we care about serving these models at cheap cost to all users um and we push on that quite a bit um so I think this is irrespective of you know gbd4 reasoning models we're always applying that pressure to be able to influence more cheaply and and I think we've done a good job of that over time right like uh the cost have Dro you know many orders of magnitude since we first launched your bd4 and so are there like are I mean maybe tell me if this is to but um the move towards for instance mixture of experts um is that more of a reasoning thing or can you apply that in GPT yeah so um that is an architectural um element of language models I think pretty much all large language models today use utilize mixture of experts um and it's something that applies equally to efficiency wins in uh Foundation models like GPT 4 4.5 as it does to reasoning models so you were able to use that here as well basically um no we've definitely explored mixture of experts as well as a number of other architectural improvements in okay great um so we we have a Discord uh with some members of the big technology listeners and reader group and you know a theme that's come up recently it's kind of interesting to be talking with you right now about an extremely large model because a theme that they can't stop talking about the people in the Discord is just that how small and Niche models uh to them are going to you know potenti be the future I'll just read you one comment that we had over the past few days for me the future is very much aligned with Niche models existing in workflows and less so of these general purpose God models um so clearly open AI is a different thesis here and I am curious to hear your perspective on what we get with the big models versus the niche models and do you see them in competition or as compliments help us think about think through that yeah yeah so I think one important thing is we also serve models that are smaller right like we serve our Flagship Frontier models but we Also Serve mini models right which are cost-efficient ways that you can access the capabilities or fairly close to Frontier capabilities for much lower cost right and we think that's an important part of this comprehensive portfolio here um fundamentally at opening eye though uh we're in the business of advancing the frontier of intelligence and that involves developing the best models that we can um and I I think really kind of what we're motivated by is really pushing that out as much as possible um we think there's always going to be use cases at the frontiers of intelligence um you know we we think that you know going from 99.9 percentile in in mathematics to the best in the world in mathematics right like that difference means something to us like I think uh what you know the best human scientists can discover is tangibly different right from what you or I can can discover so um we're we're motivated by pushing the intelligence Frontier as far as possible and at the same time uh we want to make these capabilities cheaper and more cost effective reserve for everyone so we don't think the niche models will go away we want to build these Foundation models and also figure out how to deliver these capabilities at cost over time so um that's always been our philosophy there's always going to be some juice there in in those last bits of intelligence yeah so let's talk about that because we have a debate on the show often what matters more the products or the model right um I'm on team model we have uh Ronan Roy who comes on on Fridays he's team uh product he's basically like just take what you have now and prioritize it and I say well you could probably do more with a better model but I have to be honest I'm kind of at a loss for word sometimes about what that getting from that 99th percentile in math to the best in World a math will do so actually am curious to hear your answer on this one what does what does building the best model in the world do that yeah could do otherwise 100% And I think really um it signals a shift right like I I think if you just think about hey you take the current models and you build the best surface for them that's certainly something you should always be doing and exploring that exercise I think 3 years ago that looked like chat right we we launched chat gbt um and today when it look when you take the best models and the best capabilities I think it looks a little bit more like agents right um and I think reasoning and agents they're they're very very much coupled right um when you think about what makes a good Agent it's something that you can kind of sit back let it do its own thing and you're fairly confident it'll come back with something that you want right and I think reasoning is the engine that powers that right like uh you uh have the model go and try something out and if it um if it can't succeed on the first try it should be able to be like oh well why didn't I succeed and what's a better approach for me to do so um you know I I I think very much kind of like uh the capabilities are always changing and the surface is always changing as a as a response and we're always exploring what the best surface for the current capabilities looks like but just to H I'm on your team here yeah but but again just to hammer home on this like what does that Improvement in model get you like what you think that it will enable yeah yeah so I mean I think uh I mean agents of all forms right when you look at stuff like deep research for instance right um it gives you the ability to essentially kind of get a fully formed report on any single topic that you might be interested in right um I've used it to even put together like hourlong talks um and it goes and really kind of like synthesizes all the information out there and and really organizes it comes up with lessons um allows you to do deep Discovery um it allows you to uh you know like dig into almost any topic that that you're interested in so I feel like um just the amount of information and synthesis that's that's available to you now is is just really rapidly evolving so basically it's not as simple as like just go make deep research better with the product uh with the model you have now am I reading between the lines the right way saying that what you're you're uh expressing here is that if you make the model better then the product is to get better inherently take deep research for instance 100% 100% yeah and that's something that is not enabled unless you have models of a certain level of capability both in reasoning and in the foundational unsupervised learning sense okay you know it's interesting I guess like this one question I've had in the back of my mind is uh and I'm just going to ask it to you again just so I'm sure I'm clear on it is um my view maybe erroneously was that we were just going to or your industry was just going to move from um these massive models to the massive models with reasoning but you're actually saying that there's a dual track here yeah yeah so I think we're always pushing the frontier right and we I think even since you know five six years ago the prevailing way to do that was to up up the scale right and so we've been upping the scale in unsupervised learning we've been upping the scale in reasoning but at the same time right you care about serving mini models you care about serving models that are cost effective that can deliver capabilities at at a cheaper cost um and that will often be sufficient for a lot of use cases right and uh the mission isn't just about pushing the biggest most costly models it's about having that and also a portfolio of models that people can use cheaply for their use cases okay so let's quickly talk before we leave about the upgrades uh that you're seeing in 4.5 compared to four so I'm curious like if you can just run us through a very high level the benchmarks that hits versus the benchmarks of the previous models and then I'll just throw a double question in here yeah MH um I've already read your blog post and so I have an idea of what's coming um by the way we're going to release this just as the news is released uh so um it seems like you're also saying making a statement in some ways saying like yes we have the traditional benchmarks but we also need to measure how this model Works in with EQ as opposed to just you know pure intelligence so yeah just hit us with the Benchmark improvements and then why you think that it's important for us to look at both of these in conjunction yeah so I mean along all tradition metrics like things like you know GP QA Amy you know the the traditional kind of benchmarks that we track this does signify you know an order of magnitude about at the same level of jump from 3.5 to four um there isn't there's a kind of interesting Focus here also on um I would say more Vib space Ben benchmarks right and I think that's actually important to highlight because every single time we've launched a model there is a discovery process of what the kind of interesting use cases out there are going to be U we notice here you know um it's actually a much more emotionally intelligent model um you know you can kind of uh see examples in the blog post later today but like how it responds to you know queries about uh you know a hard situation or you know um uh advice in in um a particular difficult situation that it responds more emotionally intelligent um I think there's also just kind of like uh you can kind of see like yeah uh this may be a kind of silly example right but um if you ask any of the previous models to create aski art for you right um actually they mostly just fall down this one can do it almost Flawless pretty well um and and so there's just so many kind of like uh Footprints of improved capabilities and um and I think things like creative writing will showcase this one of the things that I I think I picked up uh in the examples that you've given so far is that it doesn't seem like it feels the need to write uh a you know a thesis for every response like one uh user was like I'm having a hard time and it actually succinctly wrote as if a human would as opposed to like maybe the traditional you know here's three paragraphs of self-care routine you can doly yeah yeah yeah and that speaks to the emotional intelligence right it's not like oh uh I see that you're feeling bad here are like five ways you could feel better right it just doesn't feel like a grounded kind of a compassionate response and here you just get something that's you know direct to the point and really invites the user to say more so I think there's going to be a criticism I can I'm anticipating it and let's let's talk about it right now that people will say okay open AI was talking about these traditional benchmarks now it's talking about emotional intelligence it's Shifting the goalposts and wants us to pay attention to something else what's your response there well I I really don't think that the accurate characterization is that it doesn't hit the benchmarks that that we expected to so when you look at kind of the development of 3 to 3.5 to 4 to 4.5 um this does hit the benchmarks that we expect um and I think the main thing is like uh you know it's all about use case Discovery every time you put a new model out there um and in many senses like gp4 is already very smart right um and and kind of when we were putting that this parallel is kind of like when we were putting gp4 out right it's like we saw it hit all the right benchmarks that we expected to but what are users going to resonate with that was the key question and I think that's the question that we're asking today with GPD 4.5 as well um and we're inviting people to be like hey you know we did some early Explorations we see that it's more emotionally intelligent you know we see that it's a better creative writer but what do you see here yep all right Mark so I've been seeing you in I we mentioned this before we started recording I've been seeing you in all the opening eye videos about every release so first of all great to uh speak to you uh live uh but also over the past year we've seen a lot of Exodus uh out of opening ey maybe the media plays it up too much probably we do um but I am kind of curious what it's like working within open Ai and how you see the talent bench inside the company you recently became Chief research officer just a few months ago um and now look we have a new foundational model so just give us a sense as the talent situation is um it's still I think the most World Class um AI organization um I would say that there's a separation between the talent bar at opening ey and any other firm out there and um when it comes to kind of people leaving you know like the AI landscape it changes a lot um You probably more so than any other field out there right um the field three months ago looks different from the field three months before that and I think it's kind of just natural in the development of AI that some people will have their own thesis about here's the way I I want to develop Ai and go try it their their own way um I think that's healthy and it also gives an opportunity for people internally to shine and um we've never had a shortage of people internally who are willing to step up and we've seen that a lot and I really just love the bench that we have here very cool all right folks GPT 4.5 is out today for open AI Pro users next week it's coming out for plus team Enterprise and edu uh Mark great to see you thank you again for spending time you're about to go and do the live stream so I'm very grateful that you spent the time with me today thank so much I really appreciate your time to you thanks for having me well let's do it again soon and uh folks uh so we shouted out the Ronan and I argument we'll we'll go into that in More Everything uh we can share about GPT 4.5 coming up tomorrow on the Friday show thanks for listening thanks again to Mark and open AI for the for the interview and we'll see you next time on big technology podcast