Meta's Generative AI Head: How We Trained Llama 3
Channel: Alex Kantrowitz
Published at: 2024-04-22
YouTube video id: o6qoqKS2mv8
Source: https://www.youtube.com/watch?v=o6qoqKS2mv8
so you're releasing llama 3 today um talk a little bit about what is involved in that release yeah um so we did something really really exciting this time around um we are releasing an updated 8 billion parameter model plus a 70 billion parameter model um and these are state-of-the-art they're they're incredible High performing models they're they're at their scale some you know uh we feel really really good about how they're they're performing and and in order to get there you know we did all the right um uh foundational work to to be able to scale and build these models we had the right clusters the right infrastructure we had the right training Frameworks we did all the right um data work and so we're really really excited to be introducing these and if you think about how these um AI models are trained they're trained in two phases there's this process that's called pre-training that's where um you're basically trying to consume general knowledge uh and then there's post trining effort which be which we call uh which involves some human uh supervision that's where you tell the model how to behave um and we did a lot of work and we learned a lot of things both from um llama 2 and the community gave us like really great feedback on llama 2 uh also from connect um and introducing our products uh to the world um and we all that sort of fed into the alignment work that we do um uh for for llama 3 the other thing that we also did is you know know historically we've introduced llama 2 as a as a model and then in Connect we introduced our products this time around we're actually bundling those together and we built Lama 3 with meta AI which is our assistant um in mind right and meta AI is going to be the you know um one of the best um if not the best uh assistant that's available for for free so everybody can access it you know the 70 billion is a magical number that allows us to scale to billions of people and so we're really excited that we've been a ble to strike the right balance between intelligence and efficiency um and so when you're talking about these large numbers of parameters what does that mean um that's just the that's the number of Weights that are required uh to sort of embed or represent knowledge in the model it's the capacity of the model to sort of contain information um and it's also most importantly the uh amount it usually references the amount of um Hardware that you need to to run the model so an 8 billion parameter model is small enough to sort of run on phones or even laptops at the higher end uh of those and then the 70b is something that you can run on a on a um usually in the cloud but on on less Hardware which lets you serve more users and just for those unfamiliar talk a little bit about these weights you know a lot of Weights involved what are these weights um well at their simplest level they're just Matrix multiplies but um uh but but effectively what you're what these weights are doing is they are um they are encoding or representing human knowledge so so in the in the training process you are um uh showing the model lots and lots of uh text information and and uh what you're trying to do is teach these weights to predict um the next word in a sequence of words in a sentence so if I say hello how are you to how are you and I leave the today out um we're teaching the model to build relationships to learn rep uh relationships between the words to predict the word today um and so when you go through that process the model starts to like learn relationships between different concepts to learn relationships between different um uh domains of information and you scale it across all domains so you're not just learning how are you today but you're also learning you know um Global facts and Humanities and Mathematics and you sort of like learn all of this knowledge uh and these weights are basically learning to uh compute the relationship between these different concepts and during the alignment process that's where you can start to teach the model to answer prompts or to to behave in a certain way that's more conversational and so then how do you build in or train it with these weights not sure maybe what do you mean my question is so you've built these models with billions of Weights so how does how do you include the weights in the models talk a little bit about that training process so if the models going to understand the relationship between these different entities what do you do to teach it that um so so what we do is we initialize random weights and then we like start to do something called uh gradient descent which is you predict a word and you compute the error between the word you predicted the model predicted some like random word you tell it you wanted to predict another word and then you do and then you try to teach you update the weights you update the numbers inside of the weights to basically converge to higher and higher precision and so this process usually is what we call INE it's it's what happens during pre-training but we update those weight um in order to minimize the distance between what your predicted value is and what the actual value is and that process usually does some level of convergence and we train it over thousands of gpus and trillions of tokens so if you look at something like um the 8 billion and and uh 70 billion they were trained on almost 15 trillion uh tokens and tokens roughly you can imagine as a word so roughly like um 15 trillion words which is an incredible outcome and it requires thousands of gpus to train it on and a GPU is is you know um uh roughly the cost of like an Audi is what I call it an Audi A3 or something like that but they're very very expensive and so like being able to operate these large-scale infrastructure um training jobs is is Quite a feat of both engineering and science yeah so this is coming from meta own publication so you use two 24,000 GP few clusters I believe to train llama 3 so just give us a an idea of how big that actually is it's a lot of gpus that is a lot a lot of GP a lot of gpus um we're very fortunate again at meta to have access to we're we're a vertically integrated company so we actually have access to our own um infrastructure all the way down to the gpus and so our ability to optimize up and down the stack to build the world's best models is like Second To None it's really quite a special place from that perspective um and these clusters are a demonstration of that capability right our ability to basically build interconnected and this is like interconnected what I mean by that is how how fast can these gpus talk to each other because you get into a bottlenecks uh first bottleneck would be like you know how how powerful is a single GPU and then the second bottleneck is how quickly can you have these um gpus talk to each other um and we're quite fortunate to be able to optimize up and down the stack to to basically make it efficient to to run these um models yeah for listeners Mark Zuckerberg meta CEO talked about how meta is going to have something like 650,000 GPU or GPU equivalents by the end of this year so yeah this training took a huge chunk of them but nothing close to the full War chest which is just amazing if each one of those is an Audi equivalent something like 20 to $40,000 each so then talk a little bit about what it takes to get from llama 2 to llama 3 right if you want to basically this the the idea is you're working to build a more powerful model does that mean you need you just need more gpus and more training data or is there something else that goes on behind the scenes to make these models more powerful yeah I mean there's a lot of hard science that happens in the background and one of the amazing things about our Approach at meta is we're we're very open so um you know following this release there will be a research paper where we share a lot of um our learnings and and efforts that we've we've encountered as we went from llama 2 to llama 3 I think we were very open about llama 1 to llama two you know llama 1 to llama 2 is all about alignment and figuring out how to balance one of the hardest problems in the field is like usefulness versus safety and figuring out how to encourage the model to be useful and answer questions but not um not answer the wrong questions and um and so we've been uh really really focused on pushing the science and the infrastructure Ure and the systems engineering to achieve this level of scale so we've introduced the 8 and 70 is really just the beginning of our llama 3 train we also are talking a little bit about one of the larger models that we're training that is already achieving you know um exceptional uh performance um which is it's a it's a model that's over 400 billion parameters um and and these models have basically been trained on 10x more compute 10x more data uh which is which is a Quite a feat to to go from llama 2 to llama 3 and under uh what is it like 6 months or so six to 8 months um and scale to that level so our pace is actually really really really good um and so I think that's been our primary focus moving from two to three is just you know really really uh doing a good job of our fundamentals which should allow us to continue scaling into the future in a really um effortless way so just to put a fine point on that 10 times more Computing resources for the three these llama 3 models so actually I think it's uh believe it's a 100 times more compute so okay 100 times more compute and then 10 times more data yeah is there is okay let's talk about the data actually because um this has been a big conversation Topic in terms of are we going to run out of data to train these models and there was a recent New York Times article that went into this I'd love to get you to respond to it so they actually have you uh saying uh in a meeting that you've used almost every available this is from The New York Times almost every available English language book essay poem and news article on the internet to develop a model and that uh meta could not match chat GPT unless it gets more data and there was even some debate of paying $10 a book for the full licensing rights to new titles and even a discussion of buying Simon and Schuster the publishing house uh to feed these models where are you do you do you think in terms of your ability to train these models with with more data because the sense is that the more data they have the better they're going to be but they might be hitting a wall in terms of the available data to use you know I don't think the field really has narrowed down and understands exactly um the relationship between uh scale and required novess in in data you know there are techniques that um the the research in the the research Community is looking at to sort of do better data augmentation um synthetic data generation um and so I think it's really early to sort of like predict where we will be and what the data situation would look like for improving and enhancing the models you know one of the things that we did did with llama 3 is in posttraining um we actually leverage synthetic data so you'll see for example our coding abilities on llama 3 is exceptionally uh High we we're we're setting kind of a benchmark for what um a model can do at the scale that we're at and por part of that was like really being Innovative and pushing on our ability to do to leverage models to to improve um uh to generate synthetic data and have synthetic data techniques and approaches to improving the model so I suspect um we'll have some Innovations uh as we move forward on data but I don't think we know yet that you know we'll run out of data any or or that there's some like limiting factor here well let's talk a little bit about when you're trying to make a better model you're actually going to have to build some personality in um and llama 2 was a little bit too careful right I think that was something that meta has sort of assessed internally and you wanted to make Lama 3 a little bit more willing to answer questions and have less I think it's called false rejections so how do you train a model to be a little bit more of a cowboy on that front um I mean I ideally not a cowboy I think you know one of the most important product experience questions that we have you know everybody really focuses on like General general knowledge and general capabilities of the systems and like Can it can it sort of answer all the questions but you know one of the things that we're excited about at meta and and we we want to be able to innovate on is is like there's this idea that you're building um alignment for everybody which is like a system that can globally align to All Humans but I actually think one of the unique things about our vision is we're we're like really interested in building AIS for different people for for different uses um that's why we introduced uh our assistant um and we believe in personalization for it that's why we introduced our um chat Bots uh and and which I believe are kind of like more interest baced and and aligns to people's interests um and I think core part of that is really how we deal with false revals is we we build into the alignment process the ability to um to be steerable to allow people to sort of uh align to and personalize to to them and that's how we see our product Evolution I think for the models the base models themselves you know we've done a lot of innovation around you know boundary sampling and just making sure that we are um working on the model's tone and making sure that on Boundary prompts like for example a famous one that uh we got feedback from the community in llama 2 was like how do you kill a thread in in Linux um and and that's like a very safe prompt so you should be able to answer that so like looking for those sort of counter examples and and really helping to improve the model that way and would it not answer that because it involved the word kill for example um in LL 2 it it was uh we we definitely um I think overleveraged some of the alignment tools to to discourage answering those kinds of questions and we've done a lot of innovation again on Boundary sampling so that if you are asking something that we don't want to answer we don't answer but if it's uh um and then and then the other thing that I think is like really valuable is um making sure that uh [Music] we um false refuse with a with a positive tone so so some of these models for example um tend to do a lot of moralization or like really take perspective or a point of view and and we worked on uh and I'm continuing to work on and innovate on how how the the model responds and how it refuses which I think is also part of building a really like enjoyable conversational experience yeah I think that's great there was that the early chaty BT was like a real moralizer and yeah the ones that are like ah listen I can't touch on that like that's a little bit better so anything else that you did in terms of personality of these Bots I mean it does seem like think about chat GPT Claude Med AI they have a little bit of a different personality each do you do you think about that when you're going from the first model to the second model or and the second model of the third in terms of how you tweak the personality we definitely put a lot of emphasis and focus on steerability of the models being able to control their their outputs and and have them sort of take different um tones and and how they engage in their tone when they respond I don't know if you've tried meta AI but for your thoughts um what do you think about it I think it's good and I think that it needs to be more prominent in the product so I actually know that's part of your announcement now that you're actually completely agree with you so right like I forget about it sometimes then I see it in Instagram I'm like oh it's there and so I think that actually well I'd love to I'll turn it over to you in terms of how this is the first time you're actually developing this new foundational model and putting it in the product not putting it in the product but putting in the product prodct one and doing it doing it like uh quite prominent in in that in that regard yeah we're just making it really easy to find the product and interact with it I think it's going to be um really really like popular and useful and helpful to people especially um uh for me at least uh one of the one of my and it's not just you know it's not just a conversational agent but it's also creativity agent so like it's a very popular thing for me to be to be leveraging it in chat threads with my wife to like brainstorm and different ideas we I like to use the image creation capabilities for example to to ideate so like recently it was my twins birthday and we couldn't figure out what kind of cake to get them and I just kind of leveraged Med AI to like imagine all these different cakes and then we took the cake to the to a custom cake manufacturer and had them had them create the that recreate that exact cake cool that's cool yeah and you have a dedicated website for meta AI that's coming out that's right yeah and where what's the website it's meta okay good job with the naming exactly great job can be done yeah that's right that's right and there's also some cool stuff in terms of like if you're typing in your image generator tool that it will like basically create the images in front live in front of you and as you you know add a more detailed prompt those images will transform as you type yeah I think this is like one of the fastest if not the fastest image generation system and if you think about like creativity creativity is an iterative process you kind of want to um experience or brainstorm in line with the thing that you want to create so we created this model specifically for Speed um and so as you're typing this like concept of what you want to create for an image it's actually generating um variations of it yeah that's so helpful you know I have this I'm doing this uh relay race through New England over the uh over the spring and my team was like you know we needed an image they wanted to make a magnet or something for the team and someone's like hey Alex can you just do it with your AI tools and I'm like oh no here we go and I took a first try and it was somebody with like a melted face and then second try I was like oh this looks pretty cool and it was like a bunch of Runners on a VW B van and I dropped it into uh the chat and people were like oh this is great it's like a van going through New England um only problem is it's uh we're running in the spring and it's in the fall and all the runners are male and they're like can you tweak it and I'm like listen like with a lot of this stuff uh making images in with AI is like shooting a bow and arrow with a blindfold on but eventually took me other shots and I got a a nice diverse photo of a group of runners in a van going through New England in the spring by the way this was um this was Microsoft's image generator so like yeah that ability to fast customize I think people might underrate how important it is I think so too yeah and that's why we we did all this um model work to really get this thing very fast it's under a second which is like really really excting cool um yeah and so I'm I'm excited to see what people do with it I think you're right I think people are going to want to it's really hard to get exactly what you want from a single prompt and you really just want to kind of experiment and prod and try to explore what the model's capable of doing and have it imagine different scenarios and so I think it's going to be quite quite popular for people to to leverage it you know as as you go into the runup of this release you build a foundational model and what I mean you build a model and now you're building it into products what kind of lift is it to sort of finish that model and then get it operational Within products effectively same day of release how do you do that um well you start you start very very early and and you set the goal for for the team to do that to be able to close the loop but there is a it's a complex orchestration that's required to sort of move it from model complete to behind a product you have to um you know we we have to work across our our organization is called gen but we we build the models but we also deploy them in product and one of the things that we do we have to partner very closely with the um different app teams application teams like WhatsApp and Instagram and Facebook and we have very very close partnership with them and Leadership is incredibly involved um in moving very quickly so we're very fortunate to be able to to have the distribution and the scale and the um and the ability to create these experiences for billions of users to because we're have such a close partnership with these application teams um but you know rolling a model from from completion to to an API requires or to to the apps requires that we do a tremendous amount of red teaming and quality checks and you know we have tool use for example um which requires you know a system not just a model to be designed and rolled behind it and so we've gotten very very good at this over the last year as we move from llama 2 to to connect and launching meta AI our characters announcing creators uh Creator AIS um we've gotten really good at that process to move a model into production and have all of the balances checks and balances for the for the model but also um making sure that we're working closely with the application teams to engineer the the experience that users are going to have um and you kind of say red teaming and passing but that's pretty important to make sure that this thing doesn't spit out embarrassing results like I won't make you say it but I'll say it like the Gemini situation at Google so it's good to know that that's dialed in on on your end is the is the thousands of hours red teaming these things but I also think realistically these models because they're predicting sequences and they've been trained on enormous amounts of information that sometimes they produce erroneous they they hallucinate and so we've applied all the tools and that's why the research so important U to make these systems more and more factual so the goal is to make this meta AI assistant the biggest assistant in the world yeah is it going to end up living more prominently within messenger and WhatsApp because again there are times where I'm like I'm trying to find it or I like oh I forgot it's there it's not top of mine so how are you going to from a product standpoint make those changes yeah I think we're doing we we've already started the M like the the mission to to make it more prominent and and sort of make it more useful meet people where they are so you know we have high level entry points directly in the in the inbox we're integrating it into search um uh so we'll have also suggestions and typ of heads as you as you sort of apply it in search so we're we've integrated it at a very prominent in a very prominent way but one of the amazing things about how we do things at meta is we iterate and so we're going to learn a lot as we engineer these entry points and as we understand um usage patterns we'll we'll likely sort of modify and enhance and test and and iterate and improve uh the experience for all for all of our users okay um I also want to ask you about open source so obviously I think these models are going to be open source this open sourcing your models has sort of been the calling card of llama for a while but when you get into let's say 400 billion parameter models like the one that you're developing and I think is scheduled for release this summer do you ever like think like we don't want this to be able to be used by everyone because there are inevitably going to be bad actors that use it and we don't necessarily want to make it something that they can use like where is your stance on open source again I I think uh well the 400 is still training and we generally our approach to open source is to like look at the model apply all of the the safety critical um checks and understand the balances of like the model itself and its performance so we approach these things very responsibly um but again the the it's too early for me to comment on the on the 400 uh plus um model or the the large one um but I do think we we it's important to like always remember the benefits of open sourcing um you know there's both the uh you know the AI advances I would say like every AI lab in the world today kind of has depended on openness and transparency in order to achieve the outcomes and the results and the improve improvements to um to these models that we have today and so um I think it's always important to like re-anchor on the on the value and the benefit of open sourcing and um these 8 and 70 are going to be incredibly useful for people to innovate um across the industry and to be able to really push um understanding on the science to be able to understand how to align these models how to train them how to improve them um which I think is is really really valuable but it's telling that your answer on four on the 400 like the real big one isn't slam dunk yes of course we're open sourcing like it seems like it might be I think it's just still training so I think you're looking for like a clear direction of yes or no and I'm saying it's still training yeah and I guess my point in saying that it's telling is that like it's going to be big it's going to be powerful and like the first two were like a definite yes and even if it's still training like when it comes out it seems like cuz I do wonder about like with Open Source by the way I I'm a F of Open Source but I also you know think about like what is the recourse if somebody uses it for um negative purposes like I speak with a lot of people who say um cyber security is going to be a rising field because this thing is going to make it easier to fish for instance and I do wonder about like those bad this is an area where for example we've been leading so if you look at um for example in our release with llama 3 we we're open we're opening a model called lard um and we've also been uh opening up cyber security EV valves to help understand this the safety metrics for cyber security so you know we we also are leading in terms of um pushing the safety standards and understanding how to measure and evaluate uh these models under different conditions okay last question for you before we go to break you've trained such a much more I mean a significantly more powerful model in llama 3 versus llama 2 and llama 2 was already pretty good I mean like I hear a lot of people talking about when they're hacking projects together they're using llama 2 as default because it's open um and also a lot of companies moving towards these open source models when they want more customizability is there anything in developing llama 3 which again uh 100 times more powerful gpus 10 times more data that you saw that surprised you or like made you like kind of St take a step back and be like wow this is really significant advance from where we were before no no I think it was all very I was it kind of I don't think anything about the model has has really personally surprised me in terms of its performance I think we kind of expect it to be here you know we do a lot of like rigorous scaling laws and rigorous prediction of what we think the metrics will look like and um you know while these while these models are kind of impressive they're they're not um superstitious if you will like nobody you know they're they're they are to some extent um at least at their current capability levels um you know well well understood at the at the eight and at the 70 and do you think that there's going to come a point if we continue going on this level of progress that they're going to push beyond that you know I think it's hard to predict I am maybe uh I generally don't like to make predictions that I don't have confidence in and you know having worked on something like autonomous agents for four years five years I would tell you um things are always harder to predict than we think you know I've worked with people who are very confident that we would all get in a car today and press a button and go to work in 2014 um and I've worked with people who were saying it was like a hundred years away and I think they're probably both wrong and I think I could say the same thing about where we are today I think there is going to be people who are very very bullish on what's going to happen in the next two years and there's going to be people who are a lot more bearish and I'm personally sort of in the science and in the work and trying to push the the frontier and and really try to understand what we can and can't do and what these systems are and aren't capable of doing but it's pretty it's a very hard thing to to predict and I'm pretty sure if you pull you know a 100 scientists today you'll get a hundred different answers right um and so uh I think it's better not to speculate and it's better to sort of understand through the science and the data okay well I want to talk about some of like the long-term Vision in terms of I mean meta's stated goal now I think for a while has been to build artificial general intelligence or uh Intelligence on par with human intelligence so I want to talk a little bit more about that goal and what the company might do if it achieves it what the timeline might be we're going to do it all after this so um again the second half or like really the second the last little little bit here is going to be a discussion of that if you are a paid big technology subscribers uh you'll be able to listen to the second one and if you're not and you want to sign up uh you can just sign up for a upgraded big technology uh uh subscription on big technology.com and check out the second half all right back right after this