What Is a Humanoid Foundation Model? An Introduction to GR00T N1 - Annika & Aastha

Channel: aiDotEngineer

Published at: 2025-07-28

YouTube video id: mWKYvT9Lc50

Source: https://www.youtube.com/watch?v=mWKYvT9Lc50

[Music]
Hi everyone, I'm Anaka. Uh this is
Austa. We both work at Advidia and we
were part of the team that developed the
GR 101 uh robotics foundation model. So
today we're going to give you a sense of
what that is and and how you go about
building a robotics foundation model.
But before we get into it, uh I feel
like a lot of people start talks here
with a hot take. Uh so the hot take that
I'm bringing to an AI conference is that
we're not necessarily running out of
jobs. Uh so this was a report done by
McKenzie uh part of their global
institute showing that in the world's 30
uh most advanced economies there's
actually too many jobs uh for the number
of people that could fill them and
really the two things you should look at
in this whole graph is the 4.2x 2x.
That's been the the rate at which we're
getting more jobs than people could fill
it over the last decade. And this line
that I'm highlighting in red, that's
where we're in trouble, where there's
just more jobs than able-bodied people
to fill those jobs. Uh, and obviously uh
there's a real conversation around AI
and jobs. So, it helps to look at what
industries uh are largely affected. I'm
going to highlight a couple in red. So
leisure, hospitality, healthcare,
construction, transportation,
manufacturing. Um I I guess you can
figure out what they have in common.
None of them can be solved by chat GPT
alone. Uh they require operating
instruments uh and devices in the
physical world. Uh and they require
physical AI. So that's that's really the
big challenge uh that I see over the
coming years is how do we take this huge
amount of intelligence that we're seeing
in language models um and make it
operable in the in the physical world.
The other question around humanoids is
why do we build them to look like
humans? Um it's not just because we want
them to look like us. The world was made
for humans. Uh it's very hard to have
generalist robots operate in our world
and be generally useful uh without
copying our physical form. Uh there's a
lot of specialist robots that do
incredible things. I don't know if you
got to try the espresso uh from the
barista robot downstairs makes a good
espresso. Uh but that robot couldn't
even cook rice. So if we want a robot
that can do multiple tasks for you, um
it's just a lot easier to try and
imagine that that robot uh can operate
in in our human world.
So how do we do this? There's three big
buckets, three big stages. First one is
data collecting or generating or
multiplying data, which we'll talk quite
a lot about. Um and now that you have
this synthetic and and real data, but
largely synthetic data, uh you train a
model. We'll also talk about what that
architecture and training paradigm looks
like. And then finally, we deploy uh on
the robot uh or at the edge. Uh this is
what we call the physical AI life cycle.
So generate the data, consume the data
and then finally deploy and um have this
robot operable in the physical world.
Nvidia also likes to call this the three
computer problem because they have very
different compute characteristics. So at
the simulation stage, you're looking for
a computer that's powerful at simulating
something like an OVX Omniverse machine.
Um there's a lot of really interesting
work happening on the simulation side,
but that has a very different type of uh
workload than when we're training and
we're using a DGX to just consume this
enormous amounts of data and and learn
learn from that. And then finally, when
we're deploying at the edge, it needs to
be a model that's small enough and
efficient enough uh to run on an edge
edge device uh like an AGX. Um and
really this is this is project root. So
project root is Nvidia's strategy uh for
bringing humanoid and other forms of
robotics into the world. And it's
everything from the compute
infrastructure to the software to the
research that's needed. It's not simply
um just one foundation model but that is
what we'll be f focusing on in this talk
because uh that's what what we worked
on. Uh so the group N1 foundation model
was announced at GTC and match uh it is
open source uh it is highly customizable
um and a very big part of it that is
cross embodiment. So basically you can
take this base model uh there's specific
embodiment that we have fine-tuned for
uh but the whole premise is that you can
take this base model it's a two billion
parameter model which in the world of
LLM is tiny but still pretty sizable for
a robot um and then go and go and
modified it for your embodiment your use
cases. So let's start with the first
huge daunting task in the world of
robotics data. Um when when the group
team actually started thinking about
data uh they put together this idea of
the data pyramid which is very elegant
but it was born out of desperation and
necessity. Um the data you want does not
exist in quantities. There's no there is
no internet scale uh data set to scrape
or download or put together because uh
robots h haven't made it YouTube yet. Uh
so really at the top of the pyramid we
have the real world data which is robots
doing things real robots doing real
tasks and solving them. Um and how it's
collected is humans teleoperate a robot
most of the time. So wearing uh like an
Apple vision pro and wearing gloves.
There's all kinds of ways to tell
operate the robot where you have a real
robot successfully completing a task and
then you have that ground truth data. So
you can imagine this is very small in
quantity, very expensive. Um, and we put
24 hours per robot per day because
that's how many hours a human has. But
the reality is that humans and robots
get tired. Uh, so it's not even 24
hours. Uh, so really this is is very
very limited data set. Uh, and then at
the bottom of the pyramid we have the
internet. So we have huge amounts of
video data and it's typically human
solving tasks. So you can imagine
someone collecting a cooking video
tutorial uh and putting that out there.
Uh but this unstructured data, it's not
necessarily relevant to robots. Um but
there is some value in that. So we we
didn't want to completely discard it. It
forms part of this cohesive data
strategy. Uh and then in the middle,
synthetic data. uh and this is this is a
topic that could fill this whole entire
talk and I've cut down so many slides on
on just this section because in theory
this is infinite right you could just
let the GPU keep generating more data um
but in practice creating high quality
simulation environments is very labor
intensive and it requires serious skill
um and then on top of that uh the other
technique which I will share a little
bit about oh sorry let me go one back uh
is is taking the human trajectories that
we do collect so human teleoperation
data and trying to multiply it um
through essentially video generation
models. So through world foundation
models that we fine-tune to do this
task. Uh but even in that case there's
there's a lot of active research in how
we take the little bits of high quality
data that we have and multiply it as
well as how we effectively combine
simulation data with this real world
data. Uh so this is dream genen. This
was something that was announced at
Computex uh very recently. Uh all in
all, this data piece is a huge part of
what the project group is about. Um so
there's many many solutions here in
terms of the teab uh and the data
strategy. But for now the next piece is
how do we bring all this data into an
architecture. Uh so I'm going to hand
over to AA to explain that. B thank you
Anukica. Um do you guys hear me? All
right. Awesome. Uh so before we dive
into thank you before we dive into the
architecture I'm going to show to you
what an example input looks like and
what an example output looks like. So
what you see here is the image
observation the robot state and the
language prompt. That's the input. And
then what's the output? The output is a
robot action trajectory. So the prompt
was to pick up the industrial object and
place it in the yellow bin. And that's
what the robot does. It picks up and
places it in the yellow bin very neatly.
But this is what it appears to us as
humans. But is the robot or the humanoid
seeing the same? Not really. The
humanoid sees this. It sees a bunch of
vectors, floatingoint vectors, which
control the different joints. So you're
seeing the output as a trajectory, which
is like motion of the robot hand. But
that's not what the robot is seeing. It
uses these vectors to actually generate
a continuous action. And to set context
on what a robot state and action is, you
can imagine the state is the robot
snapshot at an instant instance of time.
So including the physique of the robot
and the environment. That's the state.
And then the action is what the robot
decides to do next based on the state.
So
moving on and diving a bit deeper into
the architecture,
the Groot N1 system introduced a very
interesting concept and this concept is
inspired from Daniel Kman's book
thinking fast and slow. Uh show of
hands, how many of you have read the
book? Amazing. That helps to explain. So
it's inspired by the same concept, but
it has it's applied to a robotics con
context. So we have two systems system
one system two. System two you can
imagine is the brain of the robot or the
brain of the model. So that's the part
which is actually trying to break down
the complex tasks. So make it simpler
such that the system one can execute on
it. So you can think of system two as
the planner which executes slowly to
break down the complex task and then
system one is the fast one. It operates
almost at 120 hertz and it basically
executes on the task that system two
puts out for it.
And then now we're going to delve
another level deeper into the
architecture. And it's okay if all of
this is complicated to you because it's
not very straightforward. So we have the
input as the robot state and the noise
action. You must be wondering why we've
called it noised action. Noised action
is a natural state because these sensors
don't capture the action perfectly. So
we have no action. Uh and then they're
passed to a state encoder and an action
encoder which generates some tokens. Uh
and you may be familiar with tokens.
We've talked about LLMs, agents a lot.
So the same concept but just different
kinds of tokens. So state tokens, action
tokens and then it's passed through a
diffusion transformer block. And the
diffusion transformer block is
essentially um multiple layers of cross
attention and self attention.
And bringing in the other piece which is
the vision input and the text input. So
you have the vision encoder which takes
the image input, generates some tokens,
passes it to the VLM to bring it to like
a standardized encoding format and then
the text tokenizer which takes the text
input again does the same passes through
the VLM and then all of this uh all of
the output tokens from the VLM uh in
this case in case of gluten one it was
the eagle to VLM is passed into the
cross attention uh layer of the
diffusion transformer block And then you
get some output tokens. These are the
output tokens. Uh but these output
tokens are still not ready to be
consumed by the physical robot. So you
need to make it consumable by the
physical robot. And that's where you
have this key piece called the action
decoder. So it it may seem like there's
lots of encoders, lots of decoders, but
you can say that the action decoder is
the one which gives the model capability
to be a generalist. So you're giving it
an action decoder which is specific to
the embodiment that you're going to use.
Whether it's a humanoid hand or a robot
arm, an industrial robot arm, that's
where that action decoder comes into
place. It's specific to the embodiment
you're trying to use. And then it's
going to translate it specific to your
embodiment and output an action vector
which can be translated into continuous
robot motion or embodiment motion.
Just going to give you a second to
digest all of this. Uh so you can see
that the action decoder is very very
important because otherwise you would
only be able to train a model for one
specific embodiment but this model can
leverage foundation knowledge from all
different um embodiment and then bring
it to one particular embodiment. The
concept
the concept is similar to the concept of
a foundation model essentially. Uh
moving on to the next slide.
There are two main ways of robot
learning. Uh and the reason I chose to
keep this slide is because it came up a
lot in the conversations I was having
the past couple of days. Uh there are
two ways of training robots. One is
imitation learning and the other is
through reinforcement learning.
Imitation learning uh in simple English
terms imitation means to copy someone
like learn by copying. That's exactly
what's happening here. So you have a
human expert and the robot is trying to
copy the human expert and uh you're
trying to minimize the loss between the
expert and the human. So you have a gold
standard, you're trying to match up to
the gold standard. And then in case of
reinforcement learning, it's more of a
trial and error format. So what you're
doing is you just uh maximize the
reward. So you don't have a golden
state. You're trying to just reach
wherever you can the best you can. You
can think of it similar to having
siblings. When there siblings, parents
try to compare between the two and
they're like, you need to be like your
elder sibling, but then there's no
sibling, and in that case, you can just
be as good as you want. So, so that's
reinforcement learning for you. And like
all things good and bad in this world,
both of them come with pros and cons. Uh
with the imitation learning, you're
severely bottlenecked by the expert
data, which is quite expensive. uh but
in case of reinforcement learning you
don't have that bottleneck but it's the
key challenge is the sim to real so
there's a huge gap between going from
sim to real and it's a active area of
research a lot of research labs uh
universities are going behind it so that
was the two ways of training robots and
uh group n1 used both of these in some
ways um
here's an example of the train model
what can it do so on the left you see
the model being able to do a few pick
and place tasks in the kitchen. On the
right top, uh you can see with enough
training, the model can be taught how to
be romantic as well. Uh you don't see
all the fallen champagne glasses and
fallen flowers which went behind
capturing this perfect snap. Uh and then
the bottom right uh is two robot friends
trying to get to an industrial task like
a pick and place task again. But these
are not um the only tasks that these
these humanoids or robots can be do can
be doing. They can be extended to any
task any environment. Uh and that's why
we have a foundation model, a generalist
foundation model which can be expanded
to any downstream task.
So
this is going to be my conclusion. Uh
there are three core principles that we
spoke about today and each of these is
very hefty by itself.
uh but primarily the data pyramid Anika
spoke about this in case of LLMs or text
data or text models you have the whole
internet which you can be scraping to
generate data but there's no such
internet scale data for actions. So that
is one of the key challenges that you
need to address either via simulation or
by imitation learning get generating
expert data telly operation all sorts of
things. The next thing is the dual
system architecture. Uh previously what
used to happen was each of these
components was trained independently and
that resulted in some kind of
disagreement between the two systems.
The group N1 introduces this coherent uh
architecture where both the system one
and system two are being co-rained and
that kind of helps to optimize the whole
stack instead of ident individually
trying to train the pieces and then the
third piece and the final piece uh is
the generalist model. So in case of the
generalist model you're able to leverage
foundation knowledge from the model and
extend it to different embodiment
different tasks you're you can think of
it like how you have in case of large
language models you have a base
foundation llama 27TB model or there's
llama 4 now or llama 3 I I don't know
which is the latest but you can extend
it uh to any f you can fine-tune it to
any task or like domain adapt it
similarly you have the root N1 model
which can be adapted to any embodiment
and any downstream task. Thank you so
much for attending our talk today. Uh
we're really happy you were here. Uh
please let us know if you have
questions. We'll be outside hanging out.
Uh thank you so much. We appreciate it.
[Music]