Rishabh Garg, Tesla Optimus — Challenges in High Performance Robotics Systems
Channel: aiDotEngineer
Published at: 2025-08-25
YouTube video id: bCGbuyv8PMk
Source: https://www.youtube.com/watch?v=bCGbuyv8PMk
[Music] Right. Good afternoon everyone and really excited to be here today. Really exciting stuff so far. So many models, so many new ideas and today I want to talk about what happens between the controller and the wire. Now we have seen so many policies that work that control robots but again we that we need to get that data to the actuators we need to get that data from sensors and feed the whole system and what happens if your carefully crafted policy does not work as expected like is this issue in the policy or is it in the software system. So today we look at a lot of instances where the issue will look like it's the policy but it's actually the software system and along the way we'll try to design a very small toy robotics robot. So why this talk again? Well, robots are complex. So many systems, so many different software components, and yet you're focused on like one big question. When things go wrong on the robot, when you don't see that motor move, what's the root cause? Is the policy that is not giving the command or is it the software system? And this is a question that I grapple almost every day. And so I want to talk about what I've seen so far and how to diagnose these issues on the robot. So let's go into the buildup. Let's try build a very small uh toy robotics general architecture, right? Like this is what a general robot would look like. You'd have some actuators, a CPU, maybe a hybrid accelerator and then a sensor. Perfect. Now, one of the most critical aspects is the communication protocol. So for our our talk, we'll use CAN. CAN is great. CAN is open source. Everyone can use CAN. It's cheap. It's affordable. And it has enough data rate and enough compatibility for a lot of components out there. So we'll stick to can and uh we'll see how that influence a lot of the design design decisions down the line. All right. So let's also start simple with the code. We'll start with receiving the data giving that to the policy and basically sending it back out. Nothing nothing happening nothing fancy right. And let's assume that we have approximately 2 milliseconds for our policy. And this is what we should expect to see, right? Our loops running. Every 2 millisecond we are able to see our policy output. We read data. We send it out. Standard. But as soon as we deploy it on the robot, this is what happens. There's a gap. Every two milliseconds, there's a gap. Wait, what's going on? Well, let's look at the loop again. So, at the edge of the loop, we have question marks. We see that we're transmitting and receiving CAN data. So, let's look at the canvas. Maybe we'll find some hints there. Okay. So, let's say we have 100 bits per message and we have about 10 messages. Five to be sent out, five to be received. That gives us a total of thousand bits. And for a canvas that's operating at 1 megabit per second, that's about 0.1 milliseconds per message or 1 milliseconds for 10 message. You can see like how even a small number of messages are saturating the canvas to the point that the loop time the how much our system takes to run is on the same order as the transmission time. And this explains the 1 millcond gap. So great, but then what to do about it? It's like it's almost unavoidable, right? we cannot go around this 1 millcond gap. Well, that's solution number one. We just accept the delay. Hopefully, it's 3 millonds and that's not too bad. But again, a system would not be high performance if we let that stop us. So, we'll multi thread and we'll pipeline. We'll try to figure out how we can work around that 1 milliseconds and see how we can sort of organize our tasks differently to still get that 2 millisecond loop time. So here we'll take a take a moment to pause and see that you know the loop it has multiple components broken down into three now TX px RX and the policy and we're running the communication in different thread and the policy in a different thread and now we'll see how we'll take the simple building block and stagger it so that we can actually achieve faster loop times and this is it. So what we do we seed the policy the first time we get some data we feed it to the policy but before we conclude the policy we start receiving the next set of data and that's for the next iteration when the next iteration starts we transmit the data from the last policy and we continue the this iteration of this policy essentially we have parallelized our RX and TX but we're still receiving data for the same policy at the same cadence so this is great we might have solved our problems Let's move on. So, we deploy the system on the robot and now we see new problems. Our system is stuttering. Our actuators are making sounds like catching up or like we're seeing weird motions on the actuator. This has to be policy. There's no way this can be software. Well, let's investigate more. Let's get some more data from the canvas. So, again, like here we have our canvas again and we see our CPU, GPU, all our accelerators. And what we'll try to do is get an external transceiver. These are again very cheap, very open source products that you can get anywhere. And we connect it to the canvas and we get data off the canvas. We take this data, we feed it to another host computer, let's say a laptop. And on there we can run utilities like can dump which will actually give you a timestamp data of what message was seen at what time. So once we get this raw data off the bus, we can start plotting it. And this is what we should expect that every 2 milliseconds we have a message on the bus that is being sent out right should be very nicely spaced and it should reach the actuators in time and this is if they see this on the bus we're really happy. Now what happens a lot of the times in systems is you'll not see this you'll see something like this here we'll see like between message number three and four there's almost no gap. What happened there? And between two and three there's four milliseconds of gap. It's almost like message number three was just late and four was on time and because of that we had this weird weird jitter where the actuator would try to catch up or have try to command like try to follow two commands at the same time. Okay, same thing happened with seven and eight. So let's take a deeper look but first let's try plot this differently. So there's this plot called the cycle time plot and where what we plot here is the time since last message. time since last message is just a way to say like hey last message came in at 2 milliseconds interval this one should also come at 2 milliseconds so we should see a straight line around the 2 millisecond mark but here we see some messages jump at 4 milliseconds and the one after that comes to zero this is expected because if a message is delayed the cycle time for that would be late but then then for the next one it would be much closer to zero because that one was not late and the difference between the last message and the current one is basically nothing. Okay, so now we characterize the system. We know what's going on and we can start solving it. But this is what's going on with the TX side. So let's see. So we miss sending the data and cued it. Why would that happen? Well, policies are not very real time. At times they can take longer, times they can take shorter. And what happens if a policy takes longer? Well, you miss the time when you're supposed to send it out. So all you can do is just cue it somewhere. you can store it, but that cannot be sent out anymore. And when the next iteration comes around, that's when you send both the last message and the current message. So you'll see two messages just go on the bus at the same time. And this can also happen if our TX and RX thread start desynchronizing. But this is one of the issues that is very commonly seen with like a multi-threaded system. And it's very important to have uh synchronization in the systems. But let's say we do synchronize it and we are able to fix our TX side. Well, we see some improvement. We don't see that like everything is solved. We see some improvement. Okay. But now this has to be policy. Our graphs are looking fine. Everything is on the bus is fine. This has to be policy. There's no way the systems. Well, there's one last one last issue that we have to check and that is what happens if we desynchronize in the RX side. What happens if our thread is delayed? Well, now our policy will not get the new data and it'll work with the last data and because of that the output will also be based on the last data and so in policy number two or iteration number two we'll actually have an old command still like which is relatively older and in policy number three we'll directly jump we'll skip one of the data processings and because of that we'll see a sort of skip a catching up behavior on the motors which will sound like almost like a jitter. Okay, so how do we resolve these two things? Well, there are synchronization primitives looking you can wait conditional variables semifers. These are like very low-level system things that are widely used in robotics and should be used as well for this toy system. But again, if these are not available, which is sometimes the case like we're not working with Linux based system, we'll work with like a real-time OS or like a microcontroller where we may not have all these offers. We can just add padding just have some cushion, right? like have some cushion so that if some desynchronization happens you still have the same RX going into the right policy and coming out the other way in like in a timely manner we don't miss messages okay perfect so this this this makes a system fairly robust fairly high performant but there are few other relative problems which will happen with a system like this which we should also talk about so let's talk about logging logging is benign right we just log that hey the message is coming in we want to just log that this is the data that we got this is the output it's fine right well if we log too much at some point we have to send those logs to disk and that is very costly imagine what happens if your main control loop starts logging and decides just one day that hey I'm done I'll just start putting this on the on the hard disk well your robot would stay frozen for 30 milliseconds as we saw on a Raspberry Pi with an SD card so well that's bad how do we fix that well we just throw more CPU at it we just add another CPU and now all our logging is handled by that third CPU Cool. Okay. So now we have like we're seeing how multi-threaded is slowly getting baked into the system. How the robot is operating in a real-time deadline guarantee and how we are able to like avoid the pitfalls. Perfect. Let's talk about something a little more low-level again like microcontrollers. Microcontrollers are fairly simple and their logging doesn't actually go through a whole disk and file system way. They just log to some other peripheral that takes time. In fact, for you art, it can be on the order of millisecond depending on how much we're logging. So, here's an interesting problem. Let's say we drop a packet and we log that hey, we dropped the packet. Well, that log itself would take enough time that will drop the next packet and then you will keep drop because you drop the next packet, you log again. And so, basically, you just keep logging and you see a complete blackout on the canvas and it's very hard to debug like why am I getting logs and seeing packet drops but no data. So these are mysterious things that and in my experience like it's really good to like know about the pitfalls beforehand before we dive in the system and really figure out that hey this can also be a problem just a log statement. Finally there's also priority inversion. So in the kernel in the Linux kernel there are ways in which data is received by the user process. It's not direct like it takes a while between the interrupt the kernel process handling and then it goes to the user process. In robotics, we tend to just boost the priority of all our processes so high that we start just blocking the kernel almost. Like if the kernel doesn't run, we won't get the data. But we're trying to get the data and we're blocking the very thing that will give us the data. Well, this is inversion in action. And it will see your system again drop out for like seconds almost at a time. So again, this is something we we fix by just making sure we know the parts of the pipeline. we fix the right priorities and we make sure that our whole system as a whole like it works together well. So this is how like software and robotics have to work together. We have to talk about hardware the various profiling the various priority stuff and actually just take a recap from the top. So we went over pipeline. We saw how to reduce cycle time beat how the communication delays. We saw how synchronization can actually cause some unexpected jitter which are hard to diagnose. Could be the policy, could be the system. So we want to make sure that that doesn't happen. Logging strategies so that we don't block the system while we're trying to tell the user that hey this is happening. And finally priority inversion to avoid starvation. And that's how we start designing high performance robotic systems at least on a very basic level. And that's my talk for today. And thank you so much for being here listening. [Music]