Robotics: why now? - Quan Vuong and Jost Tobias Springberg, Physical Intelligence
Channel: aiDotEngineer
Published at: 2025-07-26
YouTube video id: cGLa8DsOYdk
Source: https://www.youtube.com/watch?v=cGLa8DsOYdk
[Music] So, good morning. Uh, thanks for being here with us. My name isWan. This is Toby. Um, our mission is to make a model that can control any robot to do any task. Now, this is not something that's ready today. uh we believe that there are multiple scientific uh breakthrough that needs to happen for us to get there. And so we're very open. We publish our research. We open source our model and we talk very publicly about what is that we do. And so if you think about robotics before and not to say that it's not useful, you know, it's incredibly impactful on the world. Um but the scenario that you often see robotic in is either in a very constrained environment such as you know a factory very reparative motion very structure environment um and then when you try to bring them into the real world just kind of like semistructure I think this is a pretty kind of well-known video of a full body humano you know struggling to perform a somewhat simple task um it was from some time ago um if you look at robotics today what do You see you see you see kind of humanoid dancing. I know I don't think I can do that dance move. Um um I'll try. And so uh you see kind of very complex physical motion that robots are capable of. Um, and you also see uh this video on the right, which is what we released late last year of a robot operating with kind of somewhat semistructure objects. You know, this is clothing that just came out of a dryer that ran before. So, it's very hard to control, you know, the exact initial scene of, you know, how the shirt should be. You know, it takes out all of the shirts, managed to put them in a basket. And in the full video, you know, it goes on, you know, bring it to a table and then complete the fold them. So, what really changed? You know, the obvious answer is that there is this AI wave that we're all riding on. Robotic benefits a lot from kind of general AI development, but there is also this vision language action model that uh we're pioneering. And so, what are they? I'll pass it off to Toby. Cool. Yeah. So, vision language action models, what are they? As Quan asked. Um, well, you probably all know by now what a multimodal LLM or we often refer to it as a vision language model or VLM is by now. Um, essentially a VLM generally takes text and images as input. So you have some sort of prompt for the model. Um, and then it embeds them and pass them to a transformer model to auto reggressively produce an answer in text. Right? So you interact with these models probably uh as I do every day now. Um so and then a VLA is essentially an adaptation of a VLM for the purpose of robotics. The model additionally gets inputs describing the robot states such as its joints positions and um instead of asking questions about what's going on in the scene, we ask the model to produce actions to control the robot directly. Right? So if VLMs and VAS are so similar in kind of principle then what are the additional engineering challenges that we face when we try to train such a VA. Um to understand this I want to contrast a little bit pipelines for VLM training to what we have to do when we train a VA for robotics. So when you as kind of like a downstream customer of these like big models that have been trained want to use a VLM, you generally kind of take um a model and you can source data from the web and maybe you have a little bit of extra data that that you have for your specific task that you're interested in that you supplement um and then you probably train an use an offtheshelf model and you fine-tune it on a large cluster somewhere in the cloud and then finally you can use well established libraries for inference and deployment in the cloud. Right? So all of this is like probably bread and better for for a bunch of you. Now in contrast to that um if we want to do VLA training um you want to train a model to exhibit kind of dextrous frontier level behavior then it's an open question what the analytical data source for the web actually is right and we believe that this is kind of a trillion dollar question for the industry in some sense and it's an entirely open resource question as Quan already uh said and then secondly while we can use um VLM backbones so we can reuse use some pre-trained models, we typically have to adapt them and adapt the model architectures in order to allow for models that can control robots at somehow high frequency controls that we need to actually make progress. And then finally, we there's also no standard solution for deploying large robot policies in multiple locations on premise on device for robots, right? So this just doesn't exist. I won't really have time to talk about the third thing in detail here, but I want to dive into a little bit about data and model training for robotics that we do at PI. So, first, how can we design a data engine that enables robust, highly dextrous policies for difficult tasks with robots? At PI, we kind of believe that there currently is no standard solution for this at all that would enable us to do this. And so, we're building essentially a data engine from the ground up, from zero. We are designing this pipeline to get us very quickly to some sort of impressive capability and I hope at the end of the talk you'll agree that it looks somewhat impressive but also to enable significant scaling in the next few years. Um so we have seen firsthand that operationalizing this pipeline is actually kind of one of the main ingredients is probably more than 50% of the work is getting the data pipeline right getting the right data getting it to be high quality. So how does it work? Well, we typically start from a set of everexpanding tasks that we pick to test what's possible in the moment, right? So, these are tasks such as folding clothes, bagging groceries, many other tasks that we're interested in. And we then have human operators control our robots using a custom runtime and teleoperation system. You can see the system in this video here. Uh so what happens there is there's human operators and they are controlling what we call leader arms where they basically trace out motions with their arms with uh robot arms kind of strapped to their arms and the motion gets transferred via software to the actual robot and right and this way you can demonstrate fairly intricate highly dextrous tasks and collect that data for training afterwards. Um and then afterwards we have a bunch of uh facilities to do kind of uh tracking of metrics of what's going on at the moment. I think there's another slide here. Yep. Uh so we basically schedule these data collection uh sessions all the time and each dash in this uh dashboard here shows from this week this is I think from from Tuesday an operator doing a specific episode for a specific task. Right? So we collect a lot of this data all around the clock basically. And then we uh annotate that data in the cloud. We we we serve it uh in in big buckets there and can filter it back down based on annotations and use it for model training. After we've trained, we then get policies that can actually solve the denominated tasks autonomously. So here in this video, you see a model that is PI zero. So this is the model that Quan referred to earlier already that we released uh late last year. And as you can see, it can kind of do these like fairly impressive, highly dextrous tasks such as as shirt folding. Okay, so how far have we gotten by following this approach? Well, when we started, the biggest publicly available data set was the open cross embodiment data set, which contained about 3,800 hours of data largely from static scenes in different robot labs around the world. After kind of running this data pad that I've described for six months, we had collected about 10,000 hours of successful episodes. This is successful data uh in tens of environments covering hundreds of different tasks enabling for example the kind of shortfolding policies you you saw before. Oops. Um cool. Is that video actually playing? Why is that video not playing? Can I get it to play? account. Let's see what we do with the next video. Um, and then after another six months, this place, uh, we have collected many more hours of data in static scenes, but crucially have also started to collect significant amounts of data using mobile manipulation setups such as the ones you see here. So, the data now spends many many more tasks and importantly has massively grown in diversity covering hundreds of different scenes and environments. And as you can imagine, this scale and diversity enables new leaps as you can kind of see here by this policy that is already running autonomously. And I'll go into detail a little bit how this works, but also brings lots of additional engineering challenges. So now that we have kind of described how we get the data, then the question is what kind of capabilities can we elicit with this data in VAS, right? And to understand where we are today, I think it's kind of useful to draw an analogy between the industry trends for VLMs and VAS which have been kind of ongoing in the last three years. So first for multimodal LLMs or VLMs, we have seen a constant stream of improvements over the last three years kind of starting from initial conversational agents that you've all interacted with all the way to the RL trained multimodel reasoning models and coding assistants that we all have today, right? And we all use um for VAS they follow a similar but time lag trajectory basically. So initial VALAS such as RT2 for example done by some of my colleagues that are now at PI emerged in 2023 after LLM had already been enhanced with with vision encoders. And in fact actually some of the earliest multimodal LLMs were trained for robotics purposes by some of my colleagues Danny and others that are now working with us at PI. These were kind of impressive as first proofs of concept and showed some generalization capabilities. So you could kind of like ask them to pick different objects in the same kind of scene that kind of is in the training data. But they are generally held back by a lack of available robot data. But you know nonetheless they sparked this big explosion of interest in the field which is probably why you're here today. Um then in the mid 20 so mid 2024 towards ends 2024 the first kind of really dextrous multi-root vas uh appeared right and the industry at large has now produced several of those uh there are models for example such as Gemini for robotics or Nvidia's group models that I think you'll hear a little bit about later as well and our entry in this category was PI zero which we believe is kind of perhaps the most dextrous uh multi-root model that you can you can use and in fact is open source as well. So these models generally adjust architectures to produce actions via diffusion to enable kind of very fast generation at high frequencies you need for for robot control. So if that's where we were then what's next? Where are we now? Right? Um so for us the next leap was to study just how exactly model capabilities change when we increase data collection diversity. And this led us to develop PIO5 which is basically a VLA with open world generalization. And I want to talk a little bit more about this. Um so what does this look like? In general, we have expanded massively as I kind of said the data we take in during training. It now consists on both static and mobile robot data on the right here as well as an extended set of multimodal VLM data such as data from the web, object detection data, and general language annotations for the robot data that we've collected. So we have a huge annotation pipeline as well. And this is what's on the left here. We then feed this data into a specially designed VLM which starts from a pre-trained transformer model and is expanded with an action expert transformer to the right. This VLM part so the big backbone of it is trained to give predictions for both general questions about the scene. Now, but also to subdivide highle requests that a human might have uh for the model such as for example clean my bedroom. Right? So it is trained to subdivide these tasks into u subtasks such as pick up a pillow in the case of cleaning the bed and at the same time the action expert transformer can attend to the internals of the large VM and can run at much higher rate and produces the actual continuous output actions via diffusion flow matching objective basically. Cool. Hope this video please get it to play. Yes. So training this architecture on all our data then leads us to a VA that can perform difficult long horizon tasks of up to 10 minutes in in each episodes and this is I think much much longer than what we've seen before right and it can do this in entirely unseen homes showcasing perhaps the first sign like the first true sign that broad generalization can emerge from VA training so in this video here you see my uh colleague Chelsea prompting the model in in an yeah you know an entirely new home that is not in a train data to essentially do perform multiple cleaning tasks such as cleaning a surface here uh in a kitchen. Cool. And to understand this ability to generalize a little bit further, we tested how it emerges by training PIO5 on a fixed number of data but varying the number of uh homes from which data was introduced during training. basically then testing it in a held out location, right? And as you can see with increasing amounts of locations added, so this is the yellow curve here, performance generally increases in the test scene as you would expect until it surprisingly perhaps matches and even slightly surpasses training with the held out scene specifically, right? So this was a very cool result for us because we saw that we could kind of expand with more and more training data collections in different homes the capabilities of the model in new environments which I think is is pretty cool. And then here we see the same model performing a bedroom cleanup task in a new home. And in this specific case Laura a colleague of mine uh is prompting a model only to basically clean the bedroom. Right? And you see here the power of this these capabilities of subdividing into several tasks such as you know throwing trash in the bin and then uh cleaning uh making a bed basically. And the as you can see at the bottom with the timer you know this policy is autonomously collecting uh controlling this robot for multiple minutes at a time basically. Cool. And with that uh I give it back to Quan to talk to you a little bit about partnerships. Thank you Toby. So what you're seeing is a robot that we have never seen in person. Um we mean the team at PI. Uh we've never have access to it. Um this robot is running very very far away from where our office is. Um it's performing a somewhat interesting task of making a cup of coffee end to end. Um and it works pretty well. we didn't need to kind of iterate many times to to to be able to produce this video. Now why is this important? Um when you think about robotic you know oftent times you would think about hardware you would think about kind of real deployment challenges. Now those are important but it's our belief that actually one of the main bottleneck is software and just model intelligence. And if you think about what if successful would scale with maximum velocity in the sense that you know suddenly next year you have thousand and millions of robot deploy it's really about demonstrating the hypothesis that our model can run across many different hardware platform out there um without us having to invest significant time and effort into trying to work with that hardware platform. So I think the demonstration that we've never touched this robot before, we don't know how it works internally and yet our model can control that robot to perform a fairly interesting task um is a piece of evidence in that direction. Uh we have many more evidence here. Um and so that's why it's important. Um we are also very open when we work with other company because we believe that you know the problem is far from being solved. So when we work with this company for example that you can ask how do we run inference we literally sent them the model checkpoint for them to run like inference with um and you know we we have very low-level technical discussion with them this is to say that you know if you think there's a company that we should be talking to um please let us know you can let me and Toby know in person and also you know just shoot us an email um yep and if you ask you know what is our biggest bottleneck right right now uh because we're after this mission of building a model that can work on any robot to perform any task. Um it's a scientific problem, engineering problem, operation problem that are far from being solved. And so our biggest bottleneck really is we need the best people in the world in this area to help us assate progress. Um and so for a research organization, you know, any role that you might think we might need, we're hiring for it. Even if there is a role that you think you're exceptional at that you know we don't have on our website, please feel free to also let us know that you know you should really be hiring for this role and happy to have a conversation with you about it. Um again you can talk to me and Toby in person about you know what is how hiring needs right now but you can also apply online and shoot us a DM on Twitter. Uh thank you for listening. [Applause] [Music]