Robotics: why now? - Quan Vuong and Jost Tobias Springberg, Physical Intelligence

Channel: aiDotEngineer

Published at: 2025-07-26

YouTube video id: cGLa8DsOYdk

Source: https://www.youtube.com/watch?v=cGLa8DsOYdk

[Music]
So, good morning. Uh, thanks for being
here with us. My name isWan. This is
Toby. Um, our mission is to make a model
that can control any robot to do any
task. Now, this is not something that's
ready today. uh we believe that there
are multiple scientific uh breakthrough
that needs to happen for us to get
there. And so we're very open. We
publish our research. We open source our
model and we talk very publicly about
what is that we do. And so if you think
about robotics before and not to say
that it's not useful, you know, it's
incredibly impactful on the world. Um
but the scenario that you often see
robotic in is either in a very
constrained environment such as you know
a factory very reparative motion very
structure environment um and then when
you try to bring them into the real
world just kind of like semistructure I
think this is a pretty kind of
well-known video of a full body humano
you know struggling to perform a
somewhat simple task um it was from some
time ago um if you look at robotics
today what do You see
you see
you see kind of humanoid dancing. I know
I don't think I can do that dance move.
Um
um I'll try. And so uh you see kind of
very complex physical motion that robots
are capable of. Um, and you also see uh
this video on the right, which is what
we released late last year of a robot
operating with kind of somewhat
semistructure objects. You know, this is
clothing that just came out of a dryer
that ran before. So, it's very hard to
control, you know, the exact initial
scene of, you know, how the shirt should
be. You know, it takes out all of the
shirts, managed to put them in a basket.
And in the full video, you know, it goes
on, you know, bring it to a table and
then complete the fold them. So, what
really changed? You know, the obvious
answer is that there is this AI wave
that we're all riding on. Robotic
benefits a lot from kind of general AI
development, but there is also this
vision language action model that uh
we're pioneering. And so, what are they?
I'll pass it off to Toby.
Cool. Yeah. So, vision language action
models, what are they? As Quan asked.
Um, well, you probably all know by now
what a multimodal LLM or we often refer
to it as a vision language model or VLM
is by now. Um, essentially a VLM
generally takes text and images as
input. So you have some sort of prompt
for the model. Um, and then it embeds
them and pass them to a transformer
model to auto reggressively produce an
answer in text. Right? So you interact
with these models probably uh as I do
every day now. Um so and then a VLA is
essentially an adaptation of a VLM for
the purpose of robotics. The model
additionally gets inputs describing the
robot states such as its joints
positions and um instead of asking
questions about what's going on in the
scene, we ask the model to produce
actions to control the robot directly.
Right?
So if VLMs and VAS are so similar in
kind of principle then what are the
additional engineering challenges that
we face when we try to train such a VA.
Um to understand this I want to contrast
a little bit pipelines for VLM training
to what we have to do when we train a VA
for robotics. So when you as kind of
like a downstream customer of these like
big models that have been trained want
to use a VLM, you generally kind of take
um a model and you can source data from
the web and maybe you have a little bit
of extra data that that you have for
your specific task that you're
interested in that you supplement um and
then you probably train an use an
offtheshelf model and you fine-tune it
on a large cluster somewhere in the
cloud and then finally you can use well
established libraries for inference and
deployment in the cloud. Right? So all
of this is like probably bread and
better for for a bunch of you. Now in
contrast to that um if we want to do VLA
training
um you want to train a model to exhibit
kind of dextrous frontier level behavior
then it's an open question what the
analytical data source for the web
actually is right and we believe that
this is kind of a trillion dollar
question for the industry in some sense
and it's an entirely open resource
question as Quan already uh said
and then secondly while we can use um
VLM backbones so we can reuse use some
pre-trained models, we typically have to
adapt them and adapt the model
architectures in order to allow for
models that can control robots at
somehow high frequency controls that we
need to actually make progress.
And then finally, we there's also no
standard solution for deploying large
robot policies in multiple locations on
premise on device for robots, right? So
this just doesn't exist. I won't really
have time to talk about the third thing
in detail here, but I want to dive into
a little bit about data and model
training for robotics that we do at PI.
So, first, how can we design a data
engine that enables robust, highly
dextrous policies for difficult tasks
with robots? At PI, we kind of believe
that there currently is no standard
solution for this at all that would
enable us to do this. And so, we're
building essentially a data engine from
the ground up, from zero. We are
designing this pipeline to get us very
quickly to some sort of impressive
capability and I hope at the end of the
talk you'll agree that it looks somewhat
impressive but also to enable
significant scaling in the next few
years.
Um so we have seen firsthand that
operationalizing this pipeline is
actually kind of one of the main
ingredients is probably more than 50% of
the work is getting the data pipeline
right getting the right data getting it
to be high quality. So how does it work?
Well, we typically start from a set of
everexpanding tasks that we pick to test
what's possible in the moment, right?
So, these are tasks such as folding
clothes, bagging groceries, many other
tasks that we're interested in. And we
then have human operators control our
robots using a custom runtime and
teleoperation system. You can see the
system in this video here. Uh so what
happens there is there's human operators
and they are controlling what we call
leader arms where they basically trace
out motions with their arms with uh
robot arms kind of strapped to their
arms and the motion gets transferred via
software to the actual robot and right
and this way you can demonstrate fairly
intricate highly dextrous tasks and
collect that data for training
afterwards.
Um and then afterwards we have a bunch
of uh facilities to do kind of uh
tracking of metrics of what's going on
at the moment. I think there's another
slide here. Yep. Uh so we basically
schedule these data collection uh
sessions all the time and each dash in
this uh dashboard here shows from this
week this is I think from from Tuesday
an operator doing a specific episode for
a specific task. Right? So we collect a
lot of this data all around the clock
basically.
And then we uh annotate that data in the
cloud. We we we serve it uh in in big
buckets there and can filter it back
down based on annotations and use it for
model training.
After we've trained, we then get
policies that can actually solve the
denominated tasks autonomously. So here
in this video, you see a model that is
PI zero. So this is the model that Quan
referred to earlier already that we
released uh late last year. And as you
can see, it can kind of do these like
fairly impressive, highly dextrous tasks
such as as shirt folding.
Okay, so how far have we gotten by
following this approach? Well, when we
started, the biggest publicly available
data set was the open cross embodiment
data set, which contained about 3,800
hours of data largely from static scenes
in different robot labs around the
world. After kind of running this data
pad that I've described for six months,
we had collected about 10,000 hours of
successful episodes. This is successful
data uh in tens of environments covering
hundreds of different tasks enabling for
example the kind of shortfolding
policies you you saw before. Oops.
Um cool. Is that video actually playing?
Why is that video not playing?
Can I get it to play?
account.
Let's see what we do with the next
video. Um, and then after another six
months, this place, uh, we have
collected many more hours of data in
static scenes, but crucially have also
started to collect significant amounts
of data using mobile manipulation setups
such as the ones you see here. So, the
data now spends many many more tasks and
importantly has massively grown in
diversity covering hundreds of different
scenes and environments. And as you can
imagine, this scale and diversity
enables new leaps as you can kind of see
here by this policy that is already
running autonomously. And I'll go into
detail a little bit how this works, but
also brings lots of additional
engineering challenges.
So now that we have kind of described
how we get the data, then the question
is what kind of capabilities can we
elicit with this data in VAS, right? And
to understand where we are today, I
think it's kind of useful to draw an
analogy between the industry trends for
VLMs and VAS which have been kind of
ongoing in the last three years.
So first for multimodal LLMs or VLMs, we
have seen a constant stream of
improvements over the last three years
kind of starting from initial
conversational agents that you've all
interacted with all the way to the RL
trained multimodel reasoning models and
coding assistants that we all have
today, right? And we all use um for VAS
they follow a similar but time lag
trajectory basically. So initial VALAS
such as RT2 for example done by some of
my colleagues that are now at PI emerged
in 2023 after LLM had already been
enhanced with with vision encoders. And
in fact actually some of the earliest
multimodal LLMs were trained for
robotics purposes by some of my
colleagues Danny and others that are now
working with us at PI. These were kind
of impressive as first proofs of concept
and showed some generalization
capabilities. So you could kind of like
ask them to pick different objects in
the same kind of scene that kind of is
in the training data. But they are
generally held back by a lack of
available robot data. But you know
nonetheless they sparked this big
explosion of interest in the field which
is probably why you're here today. Um
then in the mid 20 so mid 2024 towards
ends 2024 the first kind of really
dextrous multi-root vas uh appeared
right and the industry at large has now
produced several of those uh there are
models for example such as Gemini for
robotics or Nvidia's group models that I
think you'll hear a little bit about
later as well and our entry in this
category was PI zero which we believe is
kind of perhaps the most dextrous uh
multi-root model that you can you can
use and in fact is open source as well.
So these models generally adjust
architectures to produce actions via
diffusion to enable kind of very fast
generation at high frequencies you need
for for robot control. So if that's
where we were then what's next? Where
are we now? Right? Um
so for us the next leap was to study
just how exactly model capabilities
change when we increase data collection
diversity. And this led us to develop
PIO5 which is basically a VLA with open
world generalization. And I want to talk
a little bit more about this. Um so what
does this look like? In general, we have
expanded massively as I kind of said the
data we take in during training. It now
consists on both static and mobile robot
data on the right here as well as an
extended set of multimodal VLM data such
as data from the web, object detection
data, and general language annotations
for the robot data that we've collected.
So we have a huge annotation pipeline as
well. And this is what's on the left
here. We then feed this data into a
specially designed VLM which starts from
a pre-trained transformer model and is
expanded with an action expert
transformer to the right. This VLM part
so the big backbone of it is trained to
give predictions for both general
questions about the scene. Now, but also
to subdivide highle requests that a
human might have uh for the model such
as for example clean my bedroom. Right?
So it is trained to subdivide these
tasks into u subtasks such as pick up a
pillow in the case of cleaning the bed
and at the same time the action expert
transformer can attend to the internals
of the large VM and can run at much
higher rate and produces the actual
continuous output actions via diffusion
flow matching objective basically.
Cool. Hope this video please
get it to play.
Yes. So training this architecture on
all our data then leads us to a VA that
can perform difficult long horizon tasks
of up to 10 minutes in in each episodes
and this is I think much much longer
than what we've seen before right and it
can do this in entirely unseen homes
showcasing perhaps the first sign like
the first true sign that broad
generalization can emerge from VA
training so in this video here you see
my uh colleague Chelsea prompting the
model in in an yeah you know an entirely
new home that is not in a train data to
essentially do perform multiple cleaning
tasks such as cleaning a surface here uh
in a kitchen.
Cool.
And to understand this ability to
generalize a little bit further, we
tested how it emerges by training PIO5
on a fixed number of data but varying
the number of uh homes from which data
was introduced during training.
basically then testing it in a held out
location, right? And as you can see with
increasing amounts of locations added,
so this is the yellow curve here,
performance generally increases in the
test scene as you would expect until it
surprisingly perhaps matches and even
slightly surpasses training with the
held out scene specifically, right? So
this was a very cool result for us
because we saw that we could kind of
expand with more and more training data
collections in different homes the
capabilities of the model in new
environments which I think is is pretty
cool.
And then here we see the same model
performing a bedroom cleanup task in a
new home. And in this specific case
Laura a colleague of mine uh is
prompting a model only to basically
clean the bedroom. Right? And you see
here the power of this these
capabilities of subdividing into several
tasks such as you know throwing trash in
the bin and then uh cleaning uh making a
bed basically. And the as you can see at
the bottom with the timer you know this
policy is autonomously collecting uh
controlling this robot for multiple
minutes at a time basically.
Cool. And with that uh I give it back to
Quan to talk to you a little bit about
partnerships.
Thank you Toby.
So what you're seeing is a robot that we
have never seen in person. Um we mean
the team at PI. Uh we've never have
access to it. Um this robot is running
very very far away from where our office
is. Um it's performing a somewhat
interesting task of making a cup of
coffee end to end. Um and it works
pretty well. we didn't need to kind of
iterate many times to to to be able to
produce this video. Now why is this
important? Um when you think about
robotic you know oftent times you would
think about hardware you would think
about kind of real deployment
challenges. Now those are important but
it's our belief that actually one of the
main bottleneck is software and just
model intelligence. And if you think
about what if successful would scale
with maximum velocity in the sense that
you know suddenly next year you have
thousand and millions of robot deploy
it's really about demonstrating the
hypothesis that our model can run across
many different hardware platform out
there um without us having to invest
significant time and effort into trying
to work with that hardware platform. So
I think the demonstration that we've
never touched this robot before, we
don't know how it works internally and
yet our model can control that robot to
perform a fairly interesting task um is
a piece of evidence in that direction.
Uh we have many more evidence here. Um
and so that's why it's important. Um we
are also very open when we work with
other company because we believe that
you know the problem is far from being
solved. So when we work with this
company for example that you can ask how
do we run inference we literally sent
them the model checkpoint for them to
run like inference with um and you know
we we have very low-level technical
discussion with them this is to say that
you know if you think there's a company
that we should be talking to um please
let us know you can let me and Toby know
in person and also you know just shoot
us an email um yep
and if you ask you know what is our
biggest bottleneck right right now uh
because we're after this mission of
building a model that can work on any
robot to perform any task. Um it's a
scientific problem, engineering problem,
operation problem that are far from
being solved. And so our biggest
bottleneck really is we need the best
people in the world in this area to help
us assate progress. Um and so for a
research organization, you know, any
role that you might think we might need,
we're hiring for it. Even if there is a
role that you think you're exceptional
at that you know we don't have on our
website, please feel free to also let us
know that you know you should really be
hiring for this role and happy to have a
conversation with you about it. Um again
you can talk to me and Toby in person
about you know what is how hiring needs
right now but you can also apply online
and shoot us a DM on Twitter. Uh thank
you for listening.
[Applause]
[Music]