Meta's Generative AI Head: How We Trained Llama 3

Channel: Alex Kantrowitz

Published at: 2024-04-22

YouTube video id: o6qoqKS2mv8

Source: https://www.youtube.com/watch?v=o6qoqKS2mv8

so you're releasing llama 3 today um
talk a little bit about what is involved
in that release yeah um so we did
something really really exciting this
time around um we are releasing an
updated 8 billion parameter model plus a
70 billion parameter model um and these
are state-of-the-art they're they're
incredible High performing models
they're they're at their scale some you
know uh we feel really really good about
how they're they're performing and and
in order to get there you know we did
all the right um uh foundational work to
to be able to scale and build these
models we had the right clusters the
right infrastructure we had the right
training Frameworks we did all the right
um data work and so we're really really
excited to be introducing these and if
you think about how these um AI models
are trained they're trained in two
phases there's this process that's
called pre-training that's where um
you're basically trying to consume
general knowledge uh and then there's
post trining effort which be which we
call uh which involves some human uh
supervision that's where you tell the
model how to behave um and we did a lot
of work and we learned a lot of things
both from um llama 2 and the community
gave us like really great feedback on
llama 2 uh also from connect um and
introducing our products uh to the world
um and we all that sort of fed into the
alignment work that we do um uh for for
llama 3 the other thing that we also did
is you know know historically we've
introduced llama 2 as a as a model and
then in Connect we introduced our
products this time around we're actually
bundling those together and we built
Lama 3 with meta AI which is our
assistant um in mind right and meta AI
is going to be the you know um one of
the best um if not the best uh assistant
that's available for for free so
everybody can access it you know the 70
billion is a magical number that allows
us to scale to billions of people and so
we're really excited that we've been a
ble to strike the right balance between
intelligence and efficiency um and so
when you're talking about these large
numbers of parameters what does that
mean um that's just the that's the
number of Weights that are required uh
to sort of embed or represent knowledge
in the model it's the capacity of the
model to sort of contain
information um and it's also most
importantly the uh amount it usually
references the amount of um Hardware
that you need to to run the model so an
8 billion parameter model is small
enough to sort of run on phones or even
laptops at the higher end uh of those
and then the 70b is something that you
can run on a on a um usually in the
cloud but on on less Hardware which lets
you serve more users and just for those
unfamiliar talk a little bit about these
weights you know a lot of Weights
involved what are these
weights um well at their simplest level
they're just Matrix multiplies but um uh
but but effectively what you're what
these weights are doing is they are
um they are encoding or representing
human knowledge so so in the in the
training process you are um uh showing
the model lots and lots of uh text
information and and uh what you're
trying to do is teach
these weights to
predict um the next word in a sequence
of words in a sentence so if I say hello
how are you to how are you and I leave
the today out um we're teaching the
model to build relationships to learn
rep uh relationships between the words
to predict the word today um and so when
you go through that process the model
starts to like learn relationships
between different concepts to learn
relationships between different um uh
domains of information and you scale it
across all domains so you're not just
learning how are you today but you're
also learning you know um Global facts
and Humanities and Mathematics and you
sort of like learn all of this knowledge
uh and these weights are basically
learning to uh compute the relationship
between these different concepts and
during the alignment process that's
where you can start to teach the model
to answer prompts or to to behave in a
certain way that's more
conversational and so then how do you
build in or train it with these
weights not sure maybe what do you
mean my question is so you've built
these models with billions of Weights so
how does how do you include the weights
in the models talk a little bit about
that training process so if the models
going to understand the relationship
between these different entities what do
you do to teach it
that um so so what we do is we
initialize random weights and then we
like start to do something called uh
gradient descent which is you predict a
word and you compute the error between
the word you predicted the model
predicted some like random word you tell
it you wanted to predict another word
and then you do and then you try to
teach you update the weights you update
the numbers inside of the weights to
basically converge to higher and higher
precision and so this process usually is
what we call INE it's it's what happens
during pre-training but we update those
weight
um in order to minimize the distance
between what your predicted value is and
what the actual value is and that
process usually does some level of
convergence and we train it over
thousands of gpus and trillions of
tokens so if you look at something like
um the 8 billion and and uh 70 billion
they were trained on almost 15 trillion
uh tokens and tokens roughly you can
imagine as a word so roughly like um 15
trillion words which is an incredible
outcome and it requires thousands of
gpus to train it on and a GPU is is you
know
um uh roughly the cost of like an Audi
is what I call it an Audi A3 or
something like that but they're very
very expensive and so like being able to
operate these large-scale infrastructure
um training jobs is is Quite a feat of
both engineering and science yeah so
this is coming from meta own publication
so you use two
24,000 GP few clusters I believe to
train llama 3 so just give us a an idea
of how big that actually is it's a lot
of gpus that is a lot a lot of GP a lot
of gpus um we're very fortunate again at
meta to have access to we're we're a
vertically integrated company so we
actually have access to our own um
infrastructure all the way down to the
gpus and so our ability to optimize up
and down the stack to build the world's
best models is like Second To None it's
really quite a special place from that
perspective um and these clusters are a
demonstration of that capability right
our ability to basically build
interconnected and this is like
interconnected what I mean by that is
how how fast can these gpus talk to each
other because you get into a bottlenecks
uh first bottleneck would be like you
know how how powerful is a single GPU
and then the second bottleneck is how
quickly can you have these um gpus talk
to each other um and we're quite
fortunate to be able to optimize up and
down the stack to to basically make it
efficient to to run these um models yeah
for listeners Mark Zuckerberg meta CEO
talked about how meta is going to have
something like 650,000 GPU or GPU
equivalents by the end of this year so
yeah this training took a huge chunk of
them but nothing close to the full War
chest which is just amazing if each one
of those is an Audi equivalent something
like 20 to $40,000 each so then talk a
little bit about what it takes to get
from llama 2 to llama 3 right if you
want to basically this the the idea is
you're working to build a more powerful
model does that mean you need you just
need more gpus and more training data or
is there something else that goes on
behind the scenes to make these models
more powerful yeah I mean there's a lot
of hard science that happens in the
background and one of the amazing things
about our Approach at meta is we're
we're very open so um you know following
this release there will be a research
paper where we share a lot of um our
learnings and and efforts that we've
we've encountered as we went from llama
2 to llama 3 I think we were very open
about llama 1 to llama two you know
llama 1 to llama 2 is all about
alignment and figuring out how to
balance one of the hardest problems in
the field is like usefulness versus
safety and figuring out how to encourage
the model to be useful and answer
questions but not um not answer the
wrong questions and um and so we've been
uh really really focused on pushing the
science and the infrastructure Ure and
the systems engineering to achieve this
level of scale so we've introduced the 8
and 70 is really just the beginning of
our llama 3 train we also are talking a
little bit about one of the larger
models that we're training that is
already achieving you know um
exceptional uh performance um which is
it's a it's a model that's over 400
billion parameters um and and these
models have basically been trained on
10x more compute 10x more data uh which
is which is a Quite a feat to to go from
llama 2 to llama 3 and under uh what is
it like 6 months or so six to 8 months
um and scale to that level so our pace
is actually really really really good um
and so I think that's been our primary
focus moving from two to three is just
you know really really uh doing a good
job of our fundamentals which should
allow us to continue scaling into the
future in a really um effortless way so
just to put a fine point on that 10
times more Computing resources for the
three these llama 3 models so actually I
think it's uh believe it's a 100 times
more compute so okay 100 times more
compute and then 10 times more data
yeah is there is okay let's talk about
the data actually because um this has
been a big conversation Topic in terms
of are we going to run out of data to
train these models and there was a
recent New York Times article that went
into this I'd love to get you to respond
to it so they actually have you uh
saying uh in a meeting that you've used
almost every available this is from The
New York Times almost every available
English language book essay poem and
news article on the internet to develop
a model and that uh meta could not match
chat GPT unless it gets more data and
there was even some debate of paying $10
a book for the full licensing rights to
new titles and even a discussion of
buying Simon and Schuster the publishing
house uh to feed these models where are
you do you do you think in terms of your
ability to train these models with with
more data because the sense is that the
more data they have the better they're
going to be but they might be hitting a
wall in terms of the available data to
use you know I don't think the field
really has narrowed down and understands
exactly um the relationship between
uh scale and required novess in in data
you know there are techniques that um
the the research in the the research
Community is looking at to sort of do
better data augmentation um synthetic
data
generation um and so I think it's really
early to sort of like predict where we
will be and what the data situation
would look like for improving and
enhancing the models you know one of the
things that we did did with llama 3 is
in posttraining um we actually leverage
synthetic data so you'll see for example
our coding abilities on llama 3 is
exceptionally uh High we we're we're
setting kind of a benchmark for what um
a model can do at the scale that we're
at and por part of that was like really
being Innovative and pushing on our
ability to do to leverage models to to
improve um uh to generate synthetic data
and have synthetic data techniques and
approaches to improving the model so I
suspect um we'll have some Innovations
uh as we move forward on data but I
don't think we know yet that you know
we'll run out of data any or or that
there's some like limiting factor here
well let's talk a little bit about when
you're trying to make a better model
you're actually going to have to build
some personality in um and llama 2 was a
little bit too careful right I think
that was something that meta has sort of
assessed internally and you wanted to
make Lama 3 a little bit more willing to
answer questions and have less I think
it's called false rejections so how do
you train a model to be a little bit
more of a cowboy on that front um I mean
I ideally not a cowboy I think you know
one of the most important product
experience questions that we have you
know everybody really focuses on like
General general knowledge and general
capabilities of the systems and like Can
it can it sort of answer all the
questions but you know one of the things
that we're excited about at meta and and
we we want to be able to innovate on is
is like there's this idea that you're
building
um alignment for everybody which is like
a system that can globally align to All
Humans but I actually think one of the
unique things about our vision is we're
we're like really interested in building
AIS for different people for for
different uses um that's why we
introduced uh our assistant um and we
believe in personalization for it that's
why we introduced our um chat Bots uh
and and which I believe are kind of like
more interest baced and and aligns to
people's interests um and I think core
part of that is really how we deal with
false revals is we we build into the
alignment process the ability to um to
be steerable to allow people to sort of
uh align to and personalize to to them
and that's how we see our product
Evolution I think for the models the
base models themselves you know we've
done a lot of innovation around you know
boundary sampling and just making sure
that we are um working on the model's
tone and making sure that on Boundary
prompts like for example a famous one
that uh we got feedback from the
community in llama 2 was like how do you
kill a thread in in Linux um and and
that's like a very safe prompt so you
should be able to answer that so like
looking for those sort of counter
examples and and really helping to
improve the model that way and would it
not answer that because it involved the
word
kill for example um in LL 2 it it was uh
we we
definitely um I think overleveraged some
of the alignment tools to to discourage
answering those kinds of questions and
we've done a lot of innovation again on
Boundary sampling so that if you are
asking something that we don't want to
answer we don't answer but if it's uh um
and then and then the other thing that I
think is like really valuable is um
making sure that uh
[Music]
we
um false refuse with a with a positive
tone so so some of these models for
example um tend to do a lot of
moralization or like really take
perspective or a point of view and and
we worked on uh and I'm continuing to
work on and innovate on how how the the
model responds and how it refuses which
I think is also part of building a
really like enjoyable conversational
experience yeah I think that's great
there was that the early chaty BT was
like a real moralizer and yeah the ones
that are like ah listen I can't touch on
that like that's a little bit better so
anything else that you did in terms of
personality of these Bots I mean it does
seem like think about chat GPT Claude
Med AI they have a little bit of a
different personality each do you do you
think about that when you're going from
the first model to the second model or
and the second model of the third in
terms of how you tweak the
personality we definitely put a lot of
emphasis and focus on steerability of
the models being able to control their
their outputs and and have them sort of
take different um tones and and how they
engage in their tone when they respond I
don't know if you've tried meta AI but
for your thoughts um what do you think
about it I think it's good and I think
that it needs to be more prominent in
the product so I actually know that's
part of your announcement now that
you're actually completely agree with
you so right like I forget about it
sometimes then I see it in Instagram I'm
like oh it's there and so I think that
actually well I'd love to I'll turn it
over to you in terms of how this is the
first time you're actually developing
this new foundational model and putting
it in the
product not putting it in the product
but putting in the product prodct one
and doing it doing it like uh quite
prominent in in that in that regard yeah
we're just making it really easy to find
the product and interact with it I think
it's going to be um really really like
popular and useful and helpful to people
especially um uh for me at least uh one
of the one of my and it's not just you
know it's not just a conversational
agent but it's also creativity agent so
like it's a very popular thing for me to
be to be leveraging it in chat threads
with my wife to like brainstorm and
different ideas we I like to use the
image creation capabilities for example
to to ideate so like recently it was my
twins birthday and we couldn't figure
out what kind of cake to get them and I
just kind of leveraged Med AI to like
imagine all these different cakes and
then we took the cake to the to a custom
cake manufacturer and had them had them
create the that recreate that exact cake
cool that's cool yeah and you have a
dedicated website for meta AI that's
coming out that's right yeah and where
what's the website it's
meta okay good job with the naming
exactly great job can be done yeah
that's right that's right and there's
also some cool stuff in terms of like if
you're typing in your image generator
tool that it will like basically create
the images in front live in front of you
and as you you know add a more detailed
prompt those images will transform as
you type yeah I think this is like one
of the fastest if not the fastest image
generation system and if you think about
like creativity creativity is an
iterative process you kind of want to um
experience or brainstorm in line with
the thing that you want to create so we
created this model specifically for
Speed um and so as you're typing this
like concept of what you want to create
for an image it's actually generating um
variations of it yeah that's so helpful
you know I have this I'm doing this uh
relay race through New England over the
uh over the spring and my team was like
you know we needed an image they wanted
to make a magnet or something for the
team and someone's like hey Alex can you
just do it with your AI tools and I'm
like oh no here we go and I took a first
try and it was somebody with like a
melted face and then second try I was
like oh this looks pretty cool and it
was like a bunch of Runners on a VW B
van and I dropped it into uh the chat
and people were like oh this is great
it's like a van going through New
England um only problem is it's uh we're
running in the spring and it's in the
fall and all the runners are male and
they're like can you tweak it and I'm
like listen like with a lot of this
stuff uh making images in with AI is
like shooting a bow and arrow with a
blindfold on but eventually took me
other shots and I got a a nice diverse
photo of a group of runners in a van
going through New England in the spring
by the way this was um this was
Microsoft's image generator so like yeah
that ability to fast customize I think
people might underrate how important it
is I think so too yeah and that's why we
we did all this um model work to really
get this thing very fast it's under a
second which is like really really
excting cool um yeah and so I'm I'm
excited to see what people do with it I
think you're right I think people are
going to want to it's really hard to get
exactly what you want from a single
prompt and you really just want to kind
of experiment and prod and try to
explore what the model's capable of
doing and have it imagine different
scenarios and so I think it's going to
be quite quite popular for people to to
leverage it you know as as you go into
the runup of this release you build a
foundational model and what I mean you
build a model and now you're building it
into products what kind of lift is it to
sort of finish that model and then get
it operational Within products
effectively same day of release how do
you do that um well you start you start
very very early and and you set the goal
for for the team to do that to be able
to close the loop but there is a it's a
complex orchestration that's required to
sort of move it from model complete
to behind a product you have to um you
know we we have to work across our our
organization is called gen but we we
build the models but we also deploy them
in product and one of the things that we
do we have to partner very closely with
the um different app teams application
teams like WhatsApp and Instagram and
Facebook and we have very very close
partnership with them and Leadership is
incredibly involved um in moving very
quickly so we're very fortunate to be
able to to have the distribution and the
scale and the um and the ability to
create these experiences for billions of
users to because we're have such a close
partnership with these application teams
um but you know rolling a model from
from completion to to an API requires or
to to the apps requires that we do a
tremendous amount of red teaming and
quality checks and you know we have tool
use for example um which requires you
know a system not just a model to be
designed and rolled behind it and so
we've gotten very very good at this over
the last year as we move from llama 2 to
to connect and launching meta AI our
characters announcing creators uh
Creator AIS um we've gotten really good
at that process to move a model into
production and have all of the balances
checks and balances for the for the
model but also um making sure that we're
working closely with the application
teams to engineer the the experience
that users are going to have um and you
kind of say red teaming and passing but
that's pretty important to make sure
that this thing doesn't spit out
embarrassing results like I won't make
you say it but I'll say it like the
Gemini situation at Google so it's good
to know that that's dialed in on on your
end is the is the thousands of hours red
teaming these things but I also think
realistically these models because
they're predicting sequences and they've
been trained on enormous amounts of
information that sometimes they produce
erroneous they they hallucinate and so
we've applied all the tools and that's
why the research so important U to make
these systems more and more factual so
the goal is to make this meta AI
assistant the biggest assistant in the
world yeah is it going to end up living
more prominently within messenger and
WhatsApp because again there are times
where I'm like I'm trying to find it or
I like oh I forgot it's there it's not
top of mine so how are you going to from
a product standpoint make those changes
yeah I think we're doing we we've
already started the M like the the
mission to to make it more prominent and
and sort of make it more useful meet
people where they are so you know we
have high level entry points directly in
the in the inbox we're integrating it
into search um uh so we'll have also
suggestions and typ of heads as you as
you sort of apply it in search so we're
we've integrated it at a very prominent
in a very prominent way but one of the
amazing things about how we do things at
meta is we iterate and so we're going to
learn a lot as we engineer these entry
points and as we understand um usage
patterns we'll we'll likely sort of
modify and enhance and test and and
iterate and improve uh the experience
for all for all of our
users okay um I also want to ask you
about open source so obviously I think
these models are going to be open source
this open sourcing your models has sort
of been the calling card of llama for a
while but when you get into let's say
400 billion parameter models like the
one that you're developing and I think
is scheduled for release this summer do
you ever like think like we don't want
this to be able to be used by everyone
because there are inevitably going to be
bad actors that use it and we don't
necessarily want to make it something
that they can use like where is your
stance on open
source again I I think uh well the 400
is still training and we generally our
approach to open source is to like look
at the model apply all of the the safety
critical um checks and understand the
balances of like the model itself and
its performance so we approach these
things very responsibly um but again the
the it's too early for me to comment on
the on the 400 uh plus um model or the
the large one um but I do think we we
it's important to like always remember
the benefits of open sourcing um you
know there's both the uh you know the AI
advances I would say like every AI lab
in the world today kind of has depended
on openness and transparency in order to
achieve the outcomes and the results and
the improve improvements to um to these
models that we have today and so um I
think it's always important to like
re-anchor on the on the value and the
benefit of open sourcing and um these 8
and 70 are going to be incredibly useful
for people to innovate um across the
industry and to be able to really push
um understanding on the science to be
able to understand how to align these
models how to train them how to improve
them um which I think is is really
really valuable but it's telling that
your answer on four on the 400 like the
real big one isn't slam dunk yes of
course we're open sourcing like it seems
like it might be I think it's just still
training so I think you're looking for
like a clear direction of yes or no and
I'm saying it's still training yeah and
I guess my point in saying that it's
telling is that like it's going to be
big it's going to be powerful and like
the first two were like a definite yes
and even if it's still training like
when it comes out it seems like cuz I do
wonder about like with Open Source by
the way I I'm a F of Open Source but I
also you know think about like what is
the recourse if somebody uses it for um
negative purposes like I speak with a
lot of people who say um cyber security
is going to be a rising field because
this thing is going to make it easier to
fish for instance and I do wonder about
like those bad this is an area where for
example we've been leading so if you
look at um for example in our release
with llama 3 we we're open we're opening
a model called lard um and we've also
been uh opening up cyber security EV
valves to help understand this the
safety metrics for cyber security so you
know we we also are leading in terms of
um pushing the safety standards and
understanding how to measure and
evaluate uh these models under different
conditions okay last question for you
before we go to break you've trained
such a much more I mean a significantly
more powerful model in llama 3 versus
llama 2 and llama 2 was already pretty
good I mean like I hear a lot of people
talking about when they're hacking
projects together they're using llama 2
as default because it's open um and also
a lot of companies moving towards these
open source models when they want more
customizability is there anything in
developing llama 3 which again uh 100
times more powerful gpus 10 times more
data that you saw that surprised you or
like made you like kind of St take a
step back and be like wow this is really
significant advance from where we were
before no no I think it was all very
I was it kind of I don't think anything
about the model has has really
personally surprised me in terms of its
performance I think we kind of expect it
to be here you know we do a lot of like
rigorous scaling laws and rigorous
prediction of what we think the metrics
will look like and um you know while
these while these models are kind of
impressive they're they're not um
superstitious if you will like nobody
you know they're they're they are to
some extent um at least at their current
capability levels
um you know well well understood at the
at the eight and at the 70 and do you
think that there's going to come a point
if we continue going on this level of
progress that they're going to push
beyond that you know I think it's hard
to predict I am
maybe uh I generally don't like to make
predictions that I don't have confidence
in and you know having worked on
something like autonomous agents for
four years five years I would tell you
um
things are always harder to predict than
we think you know I've worked with
people who are very confident that we
would all get in a car today and press a
button and go to work in
2014 um and I've worked with people who
were saying it was like a hundred years
away and I think they're probably both
wrong and I think I could say the same
thing about where we are today I think
there is going to be people who are very
very bullish on what's going to happen
in the next two years and there's going
to be people who are a lot more bearish
and I'm personally sort of in the
science and in the work and trying to
push the the frontier and and really try
to understand what we can and can't do
and what these systems are and aren't
capable of doing but it's pretty it's a
very hard thing to to predict and I'm
pretty sure if you pull you know a 100
scientists today you'll get a hundred
different answers right um and so uh I
think it's better not to speculate and
it's better to sort of understand
through the science and the
data okay well I want to talk about some
of like the long-term Vision in terms of
I mean meta's stated goal now I think
for a while has been to build artificial
general intelligence or uh Intelligence
on par with human intelligence so I want
to talk a little bit more about that
goal and what the company might do if it
achieves it what the timeline might be
we're going to do it all after this so
um again the second half or like really
the second the last little little bit
here is going to be a discussion of that
if you are a paid big technology
subscribers uh you'll be able to listen
to the second one and if you're not and
you want to sign up uh you can just sign
up for a upgraded big technology uh uh
subscription on big technology.com and
check out the second half all right back
right after this