OpenAI's Chief Research Officer on GPT 4.5's Debut, Scaling Laws, And Teaching EQ to Models

Channel: Alex Kantrowitz

Published at: 2025-02-27

YouTube video id: pdfI9MuxWq8

Source: https://www.youtube.com/watch?v=pdfI9MuxWq8

open aai Chief research officer Mark
Chen is here to talk about the release
of GPT 4.5 the company's largest and
best model yet which is coming out today
we'll dive in right after this welcome
to Big technology podcast a show for
cool-headed nuance conversation of the
tech world and Beyond we're joined today
by Mark Chen the chief research officer
at open aai who's here to talk about the
company's newest release GPT 4.5 yes
it's finally here and it is debuting
today Mark great to see you welcome to
the show thank you so much for having me
on thanks for being here this is uh in
four and a half years of the show our
first open AI interview so hopefully the
first of many we appreciate you uh
jumping into the water like this and
it's on big news with the release of GPT
4.5 yeah um so gbd 4.5 really it
signifies the latest milestone in our
predictable scaling uh Paradigm so you
know previous models that have fit this
Paradigm have been gbd3 3
3.54 and now this is the latest thing it
um signifies an order of magnitude
improvement over the last models kind of
commure it with the jump from 3.5 to
four I think the question that most of
our listeners are going to be asking and
certainly we asked on our show in the
past couple months is why isn't this GPT
5 I mean what is it going to take to get
to GPT 5 yeah um well I think gp5 uh you
know whenever we make these naming
decisions right uh we try to keep with
uh a sense of what the trends are so uh
again when it comes to predictable
scaling right um going from three to 3.5
you can kind of predict out you know
what an order of magnitude of
improvements in in you know amount of
compute that you train the model with uh
in terms of efficiency improvements will
buy you and uh we find this model kind
of aligns with what 4.5 would be so we
want to name it what it is okay but
there's been so much talk about um when
GPT 5 is going to come correct me if I'm
wrong but I think there's been a longer
wait between GPT 4 and 4.5 uh than there
has been between let's say uh GPT 3.5
and four and I don't know is is this uh
like because we're seeing a lot of hype
from uh opening ey folks on Twitter
about what's coming next or uh maybe
this is probably it probably is the most
impatient industry in the world and the
most impatient users in the world but um
it seems to me like the expectations for
GPT 5 are built up pretty high and so
I'm curious from like your perspective
um do you think it's going to be hard to
meet those expectations whenever that
GPT 5 model does come out well I don't
think so and one of the fundamental
reasons is because we now have two
different axes on which we can scale
right um so GPD 4.5 this is our latest
scaling experiment along the axis of
unsupervised learning but there's also
reasoning um and when you ask about kind
of like uh why there seems to be you
know a little bit bigger of a gap in
release time between 4 and 4.5 we've
been really largely focused on
developing the the reasoning par
Paradigm as well so um I think you know
our research program is really an
exploratory research program right um
we're looking into all avenues of how we
can scale our models and over the last
you know one and a half two years we've
really found a new very exciting
Paradigm through reasoning which we're
also scaling um and and so I think like
uh GPD 5 really could be the culmination
of a lot of the things coming together
okay so you talk about how there's been
a lot of work toward reasoning we of
course have seen that with a one there's
a lot of uh Buzz about deep seek um and
now we're talking about again like one
of the more traditional scaled up large
language models uh with GPT 4.5 so the
big question here I think that was on a
lot of people's mind uh when it came to
this upcoming release we thought was
going to be 4.55 anyway it doesn't
matter the big question is can AI models
continue to scale when you add more
compute more data and more power to them
um it seems like you have an answer to
this so I'm curious to hear your point
of view on whether what you've learned
about the scaling wall um given your
development of this model and um and
whether we're going to hit it whether
we're already seeing some diminishing
returns from scaling yeah um I really
kind of have a different framing around
uh scaling so when it comes to
unsupervised learning right um You want
to put more ingredients like uh compute
algorithmic in algorithmic efficiencies
and uh more data um and GPT 4.5 really
is proof that we can continue the
scaling Paradigm and this Paradigm is
not the antithesis of reasoning as well
right um You need knowledge in order to
build reasoning on top of right um a
model can't kind of go in blind um and
just learn reasoning from scratch so uh
we find these two paradigms to be fairly
complimentary um and we think you know
they have feedback loops on each other
so um yeah GPD 4.5 again uh it is smart
in different ways from the ways that
reasoning models are smart right um when
you look at the model today um it has a
lot more World Knowledge um when we look
at kind of comparisons against 3PD 40 um
you see that everyday use cases people
prefer it you know by a margin of 60%
for actually productivity and knowledge
work against gp40 there's almost like a
70% preference rate so people are really
responding to this model um and it's
this knowledge that we can leverage uh
for our reasoning models in the future
so what are the examples like you talk
about everyday knowledge work what are
some of the examples that you would use
GPT 4.5 for that you would prefer it
over a reasoning model yeah um so I I
wouldn't say like uh it's a it's a
different profile from from a reasoning
model right um so with a larger model um
what you're doing is it it takes more
time to kind of process and think
through the query but it's also giving
you an immediate response back so this
is very similar to what a gp4 would have
would have done for you right um whereas
I think um with something like 01 you
get a model where you give a query and
it can think for several minutes um and
and I think these are fundamentally kind
of different trade-offs right uh you
have a model that immediately comes back
to you doesn't do much thinking um uh
but comes up with a better answer versus
a model that you know uh thinks for a
while um and then comes comes up with
the with an answer and you know we find
that in a lot of areas like creative
writing for instance um uh again this is
stuff that we want to test over the next
one or two months um but uh we find that
there there are areas like creative
writing where this model outshines
reasoning models okay so writing any
other use cases yeah so there's there's
writing um I think some coding use cases
as well um we also find that um kind of
like uh you know there there are some
particular kind of scientific domains
where this outshines in terms of the
amount of knowledge that it can display
okay and I'm going to come back to
benchmarks uh in a moment but I want to
keep on this scaling question because I
think there's been a lot of conversation
about it in public and it's great to uh
be speaking with you from open AI to
sort of get to the bottom of of what's
happening so the first is um the
question that folks have is do you end
up at this size and you don't talk about
the size of the models which is you know
which is fair U but they're big right
this is the largest model uh that openi
has ever released GPT 4.5 so I'm
actually curious to hear at this size uh
does adding you know similar amounts of
compute similar amounts of data get you
the same returns uh that you did or are
are we already starting to see the
returns of adding these resources tail
off no no we are seeing the same returns
and I I do want to stress that JD 4.5 is
next point on this unsupervised learning
Paradigm and you know we're very
rigorous about how we do this we make
projections based on all the models
we've trained before on what performance
to expect um and in this case um you
know we put together the scaling
machinery and this is the point that
lies at that next order of magnitude so
what's it been like getting here I mean
again we talked okay so there was there
was a period of time that was longer
than the last interval and part of that
was focused on reasoning but there's
also been some reports that open eyes
had to start and stop a couple times to
get this to work um and really had to
fight through some thorny issues to get
it to be this step change as you're
saying so talk a little bit about the
process and um maybe you can confirm or
deny some of the things that we've heard
about having to start and stop again and
uh retrain to get here um actually so I
I think it's it's interesting that this
gets uh is a point that's attributed to
this model because um actually in in in
developing all of our foundation models
right um they are all experiments right
I I think um you know running all the
foundation models often times does
involve stopping at certain Pro just
kind of analyzing what's going on and
then restarting the runs and uh I don't
think that this is a characteristic of
dpd 4.5 I'm it's something that we've
done with you gbd4 with O Series models
um and you know they are largely
experiments right we we want to go in um
diagnose them in the middle and if we
want to make some interventions we we
should make interventions but
um I wouldn't characterize this as kind
of uh something that we do for GPD 4.5
that we don't do for other models we've
already talked a little bit about
reasoning versus these traditional GPT
models uh but it makes me think of deep
seek and um I think you already gave a
pretty compelling answer as to like what
you would use one of these models for
versus a reasoning model uh but there's
another thing that deeps did that um is
worth discussing which is that they made
their models much more efficient and
it's kind of interesting like when I
told to you about like all right so you
need data you need compute you need
power you're like yeah and you need
model optimizations which is something
that people often Overlook and just
going back to deep seek for a moment the
model optimization the fact that they
went from basically queering the entire
knowledge base to mixture of experts
where they're able to sort of Route the
queries to certain parts of the model
instead of lighting it all up is
credited with help them helping them get
more efficient so I just want to turn it
over to you um without commenting on
what they did or if you can if you want
but I'm actually more curious what open
AI is doing on that front and what sort
of whether you did similar optimizations
with GPT
4.5 and are you able to run these large
models more efficiently and if so how
yeah so I would say um kind of the
process of making a model efficient to
serve I often see as fairly decoupled
from developing the core capability of
the model right um and we see a lot of
work being done on the inference stack
right I think that's something that uh
deep seek did very well um and it's also
something that we push on a lot right um
we care about serving these models at
cheap cost to all users um and we push
on that quite a bit um so I think this
is irrespective of you know gbd4
reasoning models we're always applying
that pressure to be able to influence
more cheaply and and I think we've done
a good job of that over time right like
uh the cost have Dro you know many
orders of magnitude since we first
launched your bd4 and so are there like
are I mean maybe tell me if this is to
but um the move towards for instance
mixture of experts um is that more of a
reasoning thing or can you apply that in
GPT yeah so um that is an architectural
um element of language models I think
pretty much all large language models
today use utilize mixture of experts um
and it's something that applies equally
to efficiency wins in uh Foundation
models like GPT 4 4.5 as it does to
reasoning models so you were able to use
that here as well basically um no we've
definitely explored mixture of experts
as well as a number of other
architectural improvements
in okay great um so we we have a Discord
uh with some members of the big
technology listeners and reader group
and you know a theme that's come up
recently it's kind of interesting to be
talking with you right now about an
extremely large model because a theme
that they can't stop talking about the
people in the Discord is just that how
small and Niche models uh to them are
going to you know potenti be the future
I'll just read you one comment that we
had over the past few days for me the
future is very much aligned with Niche
models existing in workflows and less so
of these general purpose God models um
so clearly open AI is a different thesis
here and I am curious to hear your
perspective on what we get with the big
models versus the niche models and do
you see them in competition or as
compliments help us think about think
through that yeah yeah so I think one
important thing is we also serve models
that are smaller right like we serve our
Flagship Frontier models but we Also
Serve mini models right which are
cost-efficient ways that you can access
the capabilities or fairly close to
Frontier capabilities for much lower
cost right and we think that's an
important part of this comprehensive
portfolio here um fundamentally at
opening eye though uh we're in the
business of advancing the frontier of
intelligence and that involves
developing the best models that we can
um and I I think really kind of what
we're motivated by is really pushing
that out as much as possible um we think
there's always going to be use cases at
the frontiers of intelligence um you
know we we think that you know going
from 99.9 percentile in in mathematics
to the best in the world in mathematics
right like that difference means
something to us like I think uh what you
know the best human scientists can
discover is tangibly different right
from what you or I can can discover so
um we're we're motivated by pushing the
intelligence Frontier as far as possible
and at the same time uh we want to make
these capabilities cheaper and more cost
effective reserve for everyone so we
don't think the niche models will go
away we want to build these Foundation
models and also figure out how to
deliver these capabilities at cost over
time so um that's always been our
philosophy there's always going to be
some juice there in in those last bits
of intelligence yeah so let's talk about
that because we have a debate on the
show often what matters more the
products or the model right um I'm on
team model we have uh Ronan Roy who
comes on on Fridays he's team uh product
he's basically like just take what you
have now and prioritize it and I say
well you could probably do more with a
better model but I have to be honest I'm
kind of at a loss for word sometimes
about what that getting from that 99th
percentile in math to the best in World
a math will do so actually am curious to
hear your answer on this one what does
what does building the best model in the
world do that yeah could do otherwise
100% And I think really um it signals a
shift right like I I think if you just
think about hey you take the current
models and you build the best surface
for them that's certainly something you
should always be doing and exploring
that exercise I think 3 years ago that
looked like chat right we we launched
chat gbt um and today when it look when
you take the best models and the best
capabilities I think it looks a little
bit more like agents right um and I
think reasoning and agents they're
they're very very much coupled right um
when you think about what makes a good
Agent it's something that you can kind
of sit back let it do its own thing and
you're fairly confident it'll come back
with something that you want right and I
think reasoning is the engine that
powers that right like uh you uh have
the model go and try something out and
if it um if it can't succeed on the
first try it should be able to be like
oh well why didn't I succeed and what's
a better approach for me to do so um you
know I I I think very much kind of like
uh the capabilities are always changing
and the surface is always changing as a
as a response and we're always exploring
what the best surface for the current
capabilities looks like but just to H
I'm on your team here yeah but but again
just to hammer home on this like what
does that Improvement in model get you
like what you think that it will enable
yeah yeah so I mean I think uh I mean
agents of all forms right when you look
at stuff like deep research for instance
right um it gives you the ability to
essentially kind of get a fully formed
report on any single topic that you
might be interested in right um I've
used it to even put together like
hourlong talks um and it goes and really
kind of like synthesizes all the
information out there and and really
organizes it comes up with lessons um
allows you to do deep Discovery um it
allows you to uh you know like dig into
almost any topic that that you're
interested in so I feel like um just the
amount of information and synthesis
that's that's available to you now is is
just really rapidly evolving so
basically it's not as simple as like
just go make deep research better with
the product uh with the model you have
now am I reading between the lines the
right way saying that what you're you're
uh expressing here is that if you make
the model better then the product is to
get better inherently take deep research
for instance 100% 100% yeah and that's
something that is not enabled unless you
have models of a certain level of
capability both in reasoning and in the
foundational unsupervised learning sense
okay you know it's interesting I guess
like this one question I've had in the
back of my mind is uh and I'm just going
to ask it to you again just so I'm sure
I'm clear on it is um my view maybe
erroneously was that we were just going
to or your industry was just going to
move from um these massive models to the
massive models with reasoning but you're
actually saying that there's a dual
track here yeah yeah so I think we're
always pushing the frontier right and we
I think even since you know five six
years ago the prevailing way to do that
was to up up the scale right and so
we've been upping the scale in
unsupervised learning we've been upping
the scale in reasoning but at the same
time right you care about serving mini
models you care about serving models
that are cost effective that can deliver
capabilities at at a cheaper cost um and
that will often be sufficient for a lot
of use cases right and uh the mission
isn't just about pushing the biggest
most costly models it's about having
that and also a portfolio of models that
people can use cheaply for their use
cases okay so let's quickly talk before
we leave about the upgrades uh that
you're seeing in 4.5 compared to four so
I'm curious like if you can just run us
through a very high level the benchmarks
that hits versus the benchmarks of the
previous models and then I'll just throw
a double question in here yeah MH um
I've already read your blog post and so
I have an idea of what's coming um by
the way we're going to release this just
as the news is released uh so um it
seems like you're also saying making a
statement in some ways saying like yes
we have the traditional benchmarks but
we also need to measure how this model
Works in with EQ as opposed to just you
know pure intelligence so yeah just hit
us with the Benchmark improvements and
then why you think that it's important
for us to look at both of these in
conjunction yeah so I mean along all
tradition metrics like things like you
know GP QA Amy you know the the
traditional kind of benchmarks that we
track this does signify you know an
order of magnitude about at the same
level of jump from 3.5 to four um there
isn't there's a kind of interesting
Focus here also on um I would say more
Vib space Ben benchmarks right and I
think that's actually important to
highlight because every single time
we've launched a model there is a
discovery process of what the kind of
interesting use cases out there are
going to be U we notice here you know um
it's actually a much more emotionally
intelligent model um you know you can
kind of uh see examples in the blog post
later today but like how it responds to
you know queries about uh you know a
hard situation or you know um uh advice
in in um a particular difficult
situation that it responds more
emotionally intelligent um I think
there's also just kind of like uh you
can kind of see like yeah uh this may be
a kind of silly example right but um if
you ask any of the previous models to
create aski art for you right um
actually they mostly just fall down this
one can do it almost Flawless pretty
well um and and so there's just so many
kind of like uh Footprints of improved
capabilities and um and I think things
like creative writing will showcase this
one of the things that I I think I
picked up uh in the examples that you've
given so far is that it doesn't seem
like it feels the need to write uh a you
know a thesis for every response like
one uh user was like I'm having a hard
time and it actually succinctly wrote as
if a human would as opposed to like
maybe the traditional you know here's
three paragraphs of self-care routine
you can doly yeah yeah yeah and that
speaks to the emotional intelligence
right it's not like oh uh I see that
you're feeling bad here are like five
ways you could feel better right it just
doesn't feel like a grounded kind of a
compassionate response and here you just
get something that's you know direct to
the point and really invites the user to
say more so I think there's going to be
a criticism I can I'm anticipating it
and let's let's talk about it right now
that people will say okay open AI was
talking about these traditional
benchmarks now it's talking about
emotional intelligence it's Shifting the
goalposts and wants us to pay attention
to something else what's your response
there well I I really don't think that
the accurate characterization is that it
doesn't hit the benchmarks that that we
expected to so when you look at kind of
the development of 3 to 3.5 to 4 to 4.5
um this does hit the benchmarks that we
expect um and I think the main thing is
like uh you know it's all about use case
Discovery every time you put a new model
out there
um and in many senses like gp4 is
already very smart right um and and kind
of when we were putting that this
parallel is kind of like when we were
putting gp4 out right it's like we saw
it hit all the right benchmarks that we
expected to but what are users going to
resonate with that was the key question
and I think that's the question that
we're asking today with GPD 4.5 as well
um and we're inviting people to be like
hey you know we did some early
Explorations we see that it's more
emotionally intelligent you know we see
that it's a better creative writer but
what do you see here
yep all right Mark so I've been seeing
you in I we mentioned this before we
started recording I've been seeing you
in all the opening eye videos about
every release so first of all great to
uh speak to you uh live uh but also over
the past year we've seen a lot of Exodus
uh out of opening ey maybe the media
plays it up too much probably we do um
but I am kind of curious what it's like
working within open Ai and how you see
the talent bench inside the company you
recently became Chief research officer
just a few months ago um and now look we
have a new foundational model so just
give us a sense as
the talent situation
is um it's still I think the most World
Class um AI organization um I would say
that there's a separation between the
talent bar at opening ey and any other
firm out there and um when it comes to
kind of people leaving you know like the
AI landscape it changes a lot um You
probably more so than any other field
out there right um the field three
months ago looks different from the
field three months before that and I
think it's kind of just natural in the
development of AI that some people will
have their own thesis about here's the
way I I want to develop Ai and go try it
their their own way um I think that's
healthy and it also gives an opportunity
for people internally to shine and um
we've never had a shortage of people
internally who are willing to step up
and we've seen that a lot and I really
just love the bench that we have here
very cool all right folks GPT 4.5 is out
today for open AI Pro users next week
it's coming out for plus team Enterprise
and edu uh Mark great to see you thank
you again for spending time you're about
to go and do the live stream so I'm very
grateful that you spent the time with me
today thank so much I really appreciate
your time to you thanks for having me
well let's do it again soon and uh folks
uh so we shouted out the Ronan and I
argument we'll we'll go into that in
More Everything uh we can share about
GPT 4.5 coming up tomorrow on the Friday
show thanks for listening thanks again
to Mark and open AI for the for the
interview and we'll see you next time on
big technology podcast