What Is a Humanoid Foundation Model? An Introduction to GR00T N1 - Annika & Aastha
Channel: aiDotEngineer
Published at: 2025-07-28
YouTube video id: mWKYvT9Lc50
Source: https://www.youtube.com/watch?v=mWKYvT9Lc50
[Music] Hi everyone, I'm Anaka. Uh this is Austa. We both work at Advidia and we were part of the team that developed the GR 101 uh robotics foundation model. So today we're going to give you a sense of what that is and and how you go about building a robotics foundation model. But before we get into it, uh I feel like a lot of people start talks here with a hot take. Uh so the hot take that I'm bringing to an AI conference is that we're not necessarily running out of jobs. Uh so this was a report done by McKenzie uh part of their global institute showing that in the world's 30 uh most advanced economies there's actually too many jobs uh for the number of people that could fill them and really the two things you should look at in this whole graph is the 4.2x 2x. That's been the the rate at which we're getting more jobs than people could fill it over the last decade. And this line that I'm highlighting in red, that's where we're in trouble, where there's just more jobs than able-bodied people to fill those jobs. Uh, and obviously uh there's a real conversation around AI and jobs. So, it helps to look at what industries uh are largely affected. I'm going to highlight a couple in red. So leisure, hospitality, healthcare, construction, transportation, manufacturing. Um I I guess you can figure out what they have in common. None of them can be solved by chat GPT alone. Uh they require operating instruments uh and devices in the physical world. Uh and they require physical AI. So that's that's really the big challenge uh that I see over the coming years is how do we take this huge amount of intelligence that we're seeing in language models um and make it operable in the in the physical world. The other question around humanoids is why do we build them to look like humans? Um it's not just because we want them to look like us. The world was made for humans. Uh it's very hard to have generalist robots operate in our world and be generally useful uh without copying our physical form. Uh there's a lot of specialist robots that do incredible things. I don't know if you got to try the espresso uh from the barista robot downstairs makes a good espresso. Uh but that robot couldn't even cook rice. So if we want a robot that can do multiple tasks for you, um it's just a lot easier to try and imagine that that robot uh can operate in in our human world. So how do we do this? There's three big buckets, three big stages. First one is data collecting or generating or multiplying data, which we'll talk quite a lot about. Um and now that you have this synthetic and and real data, but largely synthetic data, uh you train a model. We'll also talk about what that architecture and training paradigm looks like. And then finally, we deploy uh on the robot uh or at the edge. Uh this is what we call the physical AI life cycle. So generate the data, consume the data and then finally deploy and um have this robot operable in the physical world. Nvidia also likes to call this the three computer problem because they have very different compute characteristics. So at the simulation stage, you're looking for a computer that's powerful at simulating something like an OVX Omniverse machine. Um there's a lot of really interesting work happening on the simulation side, but that has a very different type of uh workload than when we're training and we're using a DGX to just consume this enormous amounts of data and and learn learn from that. And then finally, when we're deploying at the edge, it needs to be a model that's small enough and efficient enough uh to run on an edge edge device uh like an AGX. Um and really this is this is project root. So project root is Nvidia's strategy uh for bringing humanoid and other forms of robotics into the world. And it's everything from the compute infrastructure to the software to the research that's needed. It's not simply um just one foundation model but that is what we'll be f focusing on in this talk because uh that's what what we worked on. Uh so the group N1 foundation model was announced at GTC and match uh it is open source uh it is highly customizable um and a very big part of it that is cross embodiment. So basically you can take this base model uh there's specific embodiment that we have fine-tuned for uh but the whole premise is that you can take this base model it's a two billion parameter model which in the world of LLM is tiny but still pretty sizable for a robot um and then go and go and modified it for your embodiment your use cases. So let's start with the first huge daunting task in the world of robotics data. Um when when the group team actually started thinking about data uh they put together this idea of the data pyramid which is very elegant but it was born out of desperation and necessity. Um the data you want does not exist in quantities. There's no there is no internet scale uh data set to scrape or download or put together because uh robots h haven't made it YouTube yet. Uh so really at the top of the pyramid we have the real world data which is robots doing things real robots doing real tasks and solving them. Um and how it's collected is humans teleoperate a robot most of the time. So wearing uh like an Apple vision pro and wearing gloves. There's all kinds of ways to tell operate the robot where you have a real robot successfully completing a task and then you have that ground truth data. So you can imagine this is very small in quantity, very expensive. Um, and we put 24 hours per robot per day because that's how many hours a human has. But the reality is that humans and robots get tired. Uh, so it's not even 24 hours. Uh, so really this is is very very limited data set. Uh, and then at the bottom of the pyramid we have the internet. So we have huge amounts of video data and it's typically human solving tasks. So you can imagine someone collecting a cooking video tutorial uh and putting that out there. Uh but this unstructured data, it's not necessarily relevant to robots. Um but there is some value in that. So we we didn't want to completely discard it. It forms part of this cohesive data strategy. Uh and then in the middle, synthetic data. uh and this is this is a topic that could fill this whole entire talk and I've cut down so many slides on on just this section because in theory this is infinite right you could just let the GPU keep generating more data um but in practice creating high quality simulation environments is very labor intensive and it requires serious skill um and then on top of that uh the other technique which I will share a little bit about oh sorry let me go one back uh is is taking the human trajectories that we do collect so human teleoperation data and trying to multiply it um through essentially video generation models. So through world foundation models that we fine-tune to do this task. Uh but even in that case there's there's a lot of active research in how we take the little bits of high quality data that we have and multiply it as well as how we effectively combine simulation data with this real world data. Uh so this is dream genen. This was something that was announced at Computex uh very recently. Uh all in all, this data piece is a huge part of what the project group is about. Um so there's many many solutions here in terms of the teab uh and the data strategy. But for now the next piece is how do we bring all this data into an architecture. Uh so I'm going to hand over to AA to explain that. B thank you Anukica. Um do you guys hear me? All right. Awesome. Uh so before we dive into thank you before we dive into the architecture I'm going to show to you what an example input looks like and what an example output looks like. So what you see here is the image observation the robot state and the language prompt. That's the input. And then what's the output? The output is a robot action trajectory. So the prompt was to pick up the industrial object and place it in the yellow bin. And that's what the robot does. It picks up and places it in the yellow bin very neatly. But this is what it appears to us as humans. But is the robot or the humanoid seeing the same? Not really. The humanoid sees this. It sees a bunch of vectors, floatingoint vectors, which control the different joints. So you're seeing the output as a trajectory, which is like motion of the robot hand. But that's not what the robot is seeing. It uses these vectors to actually generate a continuous action. And to set context on what a robot state and action is, you can imagine the state is the robot snapshot at an instant instance of time. So including the physique of the robot and the environment. That's the state. And then the action is what the robot decides to do next based on the state. So moving on and diving a bit deeper into the architecture, the Groot N1 system introduced a very interesting concept and this concept is inspired from Daniel Kman's book thinking fast and slow. Uh show of hands, how many of you have read the book? Amazing. That helps to explain. So it's inspired by the same concept, but it has it's applied to a robotics con context. So we have two systems system one system two. System two you can imagine is the brain of the robot or the brain of the model. So that's the part which is actually trying to break down the complex tasks. So make it simpler such that the system one can execute on it. So you can think of system two as the planner which executes slowly to break down the complex task and then system one is the fast one. It operates almost at 120 hertz and it basically executes on the task that system two puts out for it. And then now we're going to delve another level deeper into the architecture. And it's okay if all of this is complicated to you because it's not very straightforward. So we have the input as the robot state and the noise action. You must be wondering why we've called it noised action. Noised action is a natural state because these sensors don't capture the action perfectly. So we have no action. Uh and then they're passed to a state encoder and an action encoder which generates some tokens. Uh and you may be familiar with tokens. We've talked about LLMs, agents a lot. So the same concept but just different kinds of tokens. So state tokens, action tokens and then it's passed through a diffusion transformer block. And the diffusion transformer block is essentially um multiple layers of cross attention and self attention. And bringing in the other piece which is the vision input and the text input. So you have the vision encoder which takes the image input, generates some tokens, passes it to the VLM to bring it to like a standardized encoding format and then the text tokenizer which takes the text input again does the same passes through the VLM and then all of this uh all of the output tokens from the VLM uh in this case in case of gluten one it was the eagle to VLM is passed into the cross attention uh layer of the diffusion transformer block And then you get some output tokens. These are the output tokens. Uh but these output tokens are still not ready to be consumed by the physical robot. So you need to make it consumable by the physical robot. And that's where you have this key piece called the action decoder. So it it may seem like there's lots of encoders, lots of decoders, but you can say that the action decoder is the one which gives the model capability to be a generalist. So you're giving it an action decoder which is specific to the embodiment that you're going to use. Whether it's a humanoid hand or a robot arm, an industrial robot arm, that's where that action decoder comes into place. It's specific to the embodiment you're trying to use. And then it's going to translate it specific to your embodiment and output an action vector which can be translated into continuous robot motion or embodiment motion. Just going to give you a second to digest all of this. Uh so you can see that the action decoder is very very important because otherwise you would only be able to train a model for one specific embodiment but this model can leverage foundation knowledge from all different um embodiment and then bring it to one particular embodiment. The concept the concept is similar to the concept of a foundation model essentially. Uh moving on to the next slide. There are two main ways of robot learning. Uh and the reason I chose to keep this slide is because it came up a lot in the conversations I was having the past couple of days. Uh there are two ways of training robots. One is imitation learning and the other is through reinforcement learning. Imitation learning uh in simple English terms imitation means to copy someone like learn by copying. That's exactly what's happening here. So you have a human expert and the robot is trying to copy the human expert and uh you're trying to minimize the loss between the expert and the human. So you have a gold standard, you're trying to match up to the gold standard. And then in case of reinforcement learning, it's more of a trial and error format. So what you're doing is you just uh maximize the reward. So you don't have a golden state. You're trying to just reach wherever you can the best you can. You can think of it similar to having siblings. When there siblings, parents try to compare between the two and they're like, you need to be like your elder sibling, but then there's no sibling, and in that case, you can just be as good as you want. So, so that's reinforcement learning for you. And like all things good and bad in this world, both of them come with pros and cons. Uh with the imitation learning, you're severely bottlenecked by the expert data, which is quite expensive. uh but in case of reinforcement learning you don't have that bottleneck but it's the key challenge is the sim to real so there's a huge gap between going from sim to real and it's a active area of research a lot of research labs uh universities are going behind it so that was the two ways of training robots and uh group n1 used both of these in some ways um here's an example of the train model what can it do so on the left you see the model being able to do a few pick and place tasks in the kitchen. On the right top, uh you can see with enough training, the model can be taught how to be romantic as well. Uh you don't see all the fallen champagne glasses and fallen flowers which went behind capturing this perfect snap. Uh and then the bottom right uh is two robot friends trying to get to an industrial task like a pick and place task again. But these are not um the only tasks that these these humanoids or robots can be do can be doing. They can be extended to any task any environment. Uh and that's why we have a foundation model, a generalist foundation model which can be expanded to any downstream task. So this is going to be my conclusion. Uh there are three core principles that we spoke about today and each of these is very hefty by itself. uh but primarily the data pyramid Anika spoke about this in case of LLMs or text data or text models you have the whole internet which you can be scraping to generate data but there's no such internet scale data for actions. So that is one of the key challenges that you need to address either via simulation or by imitation learning get generating expert data telly operation all sorts of things. The next thing is the dual system architecture. Uh previously what used to happen was each of these components was trained independently and that resulted in some kind of disagreement between the two systems. The group N1 introduces this coherent uh architecture where both the system one and system two are being co-rained and that kind of helps to optimize the whole stack instead of ident individually trying to train the pieces and then the third piece and the final piece uh is the generalist model. So in case of the generalist model you're able to leverage foundation knowledge from the model and extend it to different embodiment different tasks you're you can think of it like how you have in case of large language models you have a base foundation llama 27TB model or there's llama 4 now or llama 3 I I don't know which is the latest but you can extend it uh to any f you can fine-tune it to any task or like domain adapt it similarly you have the root N1 model which can be adapted to any embodiment and any downstream task. Thank you so much for attending our talk today. Uh we're really happy you were here. Uh please let us know if you have questions. We'll be outside hanging out. Uh thank you so much. We appreciate it. [Music]