Rishabh Garg, Tesla Optimus — Challenges in High Performance Robotics Systems

Channel: aiDotEngineer
Published at: 2025-08-25
YouTube video id: bCGbuyv8PMk
Source: https://www.youtube.com/watch?v=bCGbuyv8PMk
[Music]
Right. Good afternoon everyone and
really excited to be here today. Really
exciting stuff so far. So many models,
so many new ideas and today I want to
talk about what happens between the
controller and the wire. Now we have
seen so many policies that work that
control robots but again we that we need
to get that data to the actuators we
need to get that data from sensors and
feed the whole system and what happens
if your carefully crafted policy does
not work as expected like is this issue
in the policy or is it in the software
system. So today we look at a lot of
instances where the issue will look like
it's the policy but it's actually the
software system and along the way we'll
try to design a very small toy robotics
robot. So why this talk again? Well,
robots are complex. So many systems, so
many different software components, and
yet you're focused on like one big
question. When things go wrong on the
robot, when you don't see that motor
move, what's the root cause? Is the
policy that is not giving the command or
is it the software system? And this is a
question that I grapple almost every
day. And so I want to talk about what
I've seen so far and how to diagnose
these issues on the robot. So let's go
into the buildup. Let's try build a very
small uh toy robotics general
architecture, right? Like this is what a
general robot would look like. You'd
have some actuators, a CPU, maybe a
hybrid accelerator and then a sensor.
Perfect. Now, one of the most critical
aspects is the communication protocol.
So for our our talk, we'll use CAN. CAN
is great. CAN is open source. Everyone
can use CAN. It's cheap. It's
affordable. And it has enough data rate
and enough compatibility for a lot of
components out there. So we'll stick to
can and uh we'll see how that influence
a lot of the design design decisions
down the line. All right. So let's also
start simple with the code. We'll start
with receiving the data giving that to
the policy and basically sending it back
out. Nothing nothing happening nothing
fancy right. And let's assume that we
have approximately 2 milliseconds for
our policy. And this is what we should
expect to see, right? Our loops running.
Every 2 millisecond we are able to see
our policy output. We read data. We send
it out. Standard. But as soon as we
deploy it on the robot, this is what
happens. There's a gap. Every two
milliseconds, there's a gap. Wait,
what's going on? Well, let's look at the
loop again. So, at the edge of the loop,
we have question marks. We see that
we're transmitting and receiving CAN
data. So, let's look at the canvas.
Maybe we'll find some hints there.
Okay. So, let's say we have 100 bits per
message and we have about 10 messages.
Five to be sent out, five to be
received. That gives us a total of
thousand bits. And for a canvas that's
operating at 1 megabit per second,
that's about 0.1 milliseconds per
message or 1 milliseconds for 10
message. You can see like how even a
small number of messages are saturating
the canvas to the point that the loop
time the how much our system takes to
run is on the same order as the
transmission time. And this explains the
1 millcond gap. So great, but then what
to do about it? It's like it's almost
unavoidable, right? we cannot go around
this 1 millcond gap. Well, that's
solution number one. We just accept the
delay. Hopefully, it's 3 millonds and
that's not too bad. But again, a system
would not be high performance if we let
that stop us. So, we'll multi thread and
we'll pipeline. We'll try to figure out
how we can work around that 1
milliseconds and see how we can sort of
organize our tasks differently to still
get that 2 millisecond loop time. So
here we'll take a take a moment to pause
and see that you know the loop it has
multiple components broken down into
three now TX px RX and the policy and
we're running the communication in
different thread and the policy in a
different thread and now we'll see how
we'll take the simple building block and
stagger it so that we can actually
achieve faster loop times and this is
it. So what we do we seed the policy the
first time we get some data we feed it
to the policy but before we conclude the
policy we start receiving the next set
of data and that's for the next
iteration when the next iteration starts
we transmit the data from the last
policy and we continue the this
iteration of this policy essentially we
have parallelized our RX and TX but
we're still receiving data for the same
policy at the same cadence so this is
great we might have solved our problems
Let's move on. So, we deploy the system
on the robot and now we see new
problems. Our system is stuttering. Our
actuators are making sounds like
catching up or like we're seeing weird
motions on the actuator. This has to be
policy. There's no way this can be
software. Well, let's investigate more.
Let's get some more data from the
canvas.
So, again, like here we have our canvas
again and we see our CPU, GPU, all our
accelerators. And what we'll try to do
is get an external transceiver. These
are again very cheap, very open source
products that you can get anywhere. And
we connect it to the canvas and we get
data off the canvas. We take this data,
we feed it to another host computer,
let's say a laptop. And on there we can
run utilities like can dump which will
actually give you a timestamp data of
what message was seen at what time. So
once we get this raw data off the bus,
we can start plotting it. And this is
what we should expect that every 2
milliseconds we have a message on the
bus that is being sent out right should
be very nicely spaced and it should
reach the actuators in time and this is
if they see this on the bus we're really
happy. Now what happens a lot of the
times in systems is you'll not see this
you'll see something like this here
we'll see like between message number
three and four there's almost no gap.
What happened there? And between two and
three there's four milliseconds of gap.
It's almost like message number three
was just late and four was on time and
because of that we had this weird weird
jitter where the actuator would try to
catch up or have try to command like try
to follow two commands at the same time.
Okay, same thing happened with seven and
eight. So let's take a deeper look but
first let's try plot this differently.
So there's this plot called the cycle
time plot and where what we plot here is
the time since last message. time since
last message is just a way to say like
hey last message came in at 2
milliseconds interval this one should
also come at 2 milliseconds so we should
see a straight line around the 2
millisecond mark but here we see some
messages jump at 4 milliseconds and the
one after that comes to zero this is
expected because if a message is delayed
the cycle time for that would be late
but then then for the next one it would
be much closer to zero because that one
was not late and the difference between
the last message and the current one is
basically nothing. Okay, so now we
characterize the system. We know what's
going on and we can start solving it.
But this is what's going on with the TX
side. So let's see. So we miss sending
the data and cued it. Why would that
happen? Well, policies are not very real
time. At times they can take longer,
times they can take shorter. And what
happens if a policy takes longer? Well,
you miss the time when you're supposed
to send it out. So all you can do is
just cue it somewhere. you can store it,
but that cannot be sent out anymore. And
when the next iteration comes around,
that's when you send both the last
message and the current message. So
you'll see two messages just go on the
bus at the same time. And this can also
happen if our TX and RX thread start
desynchronizing. But this is one of the
issues that is very commonly seen with
like a multi-threaded system. And it's
very important to have uh
synchronization in the systems. But
let's say we do synchronize it and we
are able to fix our TX side. Well, we
see some improvement. We don't see that
like everything is solved. We see some
improvement. Okay. But now this has to
be policy. Our graphs are looking fine.
Everything is on the bus is fine. This
has to be policy. There's no way the
systems. Well, there's one last one last
issue that we have to check and that is
what happens if we desynchronize in the
RX side. What happens if our thread is
delayed? Well, now our policy will not
get the new data and it'll work with the
last data and because of that the output
will also be based on the last data and
so in policy number two or iteration
number two we'll actually have an old
command still like which is relatively
older and in policy number three we'll
directly jump we'll skip one of the data
processings and because of that we'll
see a sort of skip a catching up
behavior on the motors which will sound
like almost like a jitter.
Okay, so how do we resolve these two
things? Well, there are synchronization
primitives looking you can wait
conditional variables semifers. These
are like very low-level system things
that are widely used in robotics and
should be used as well for this toy
system. But again, if these are not
available, which is sometimes the case
like we're not working with Linux based
system, we'll work with like a real-time
OS or like a microcontroller where we
may not have all these offers. We can
just add padding just have some cushion,
right? like have some cushion so that if
some desynchronization happens you still
have the same RX going into the right
policy and coming out the other way in
like in a timely manner we don't miss
messages okay perfect so this this this
makes a system fairly robust fairly high
performant but there are few other
relative problems which will happen with
a system like this which we should also
talk about so let's talk about logging
logging is benign right we just log that
hey the message is coming in we want to
just log that this is the data that we
got this is the output it's fine right
well if we log too much at some point we
have to send those logs to disk and that
is very costly imagine what happens if
your main control loop starts logging
and decides just one day that hey I'm
done I'll just start putting this on the
on the hard disk well your robot would
stay frozen for 30 milliseconds as we
saw on a Raspberry Pi with an SD card so
well that's bad how do we fix that well
we just throw more CPU at it we just add
another CPU and now all our logging is
handled by that third CPU
Cool. Okay. So now we have like we're
seeing how multi-threaded is slowly
getting baked into the system. How the
robot is operating in a real-time
deadline guarantee and how we are able
to like avoid the pitfalls. Perfect.
Let's talk about something a little more
low-level again like microcontrollers.
Microcontrollers are fairly simple and
their logging doesn't actually go
through a whole disk and file system
way. They just log to some other
peripheral that takes time. In fact, for
you art, it can be on the order of
millisecond depending on how much we're
logging. So, here's an interesting
problem. Let's say we drop a packet and
we log that hey, we dropped the packet.
Well, that log itself would take enough
time that will drop the next packet and
then you will keep drop because you drop
the next packet, you log again. And so,
basically, you just keep logging and you
see a complete blackout on the canvas
and it's very hard to debug like why am
I getting logs and seeing packet drops
but no data. So these are mysterious
things that and in my experience like
it's really good to like know about the
pitfalls beforehand before we dive in
the system and really figure out that
hey this can also be a problem just a
log statement.
Finally there's also priority inversion.
So in the kernel in the Linux kernel
there are ways in which data is received
by the user process. It's not direct
like it takes a while between the
interrupt the kernel process handling
and then it goes to the user process. In
robotics, we tend to just boost the
priority of all our processes so high
that we start just blocking the kernel
almost. Like if the kernel doesn't run,
we won't get the data. But we're trying
to get the data and we're blocking the
very thing that will give us the data.
Well, this is inversion in action. And
it will see your system again drop out
for like seconds almost at a time. So
again, this is something we we fix by
just making sure we know the parts of
the pipeline. we fix the right
priorities and we make sure that our
whole system as a whole like it works
together well. So this is how like
software and robotics have to work
together. We have to talk about hardware
the various profiling the various
priority stuff and actually just take a
recap from the top. So we went over
pipeline. We saw how to reduce cycle
time beat how the communication delays.
We saw how synchronization can actually
cause some unexpected jitter which are
hard to diagnose. Could be the policy,
could be the system. So we want to make
sure that that doesn't happen. Logging
strategies so that we don't block the
system while we're trying to tell the
user that hey this is happening. And
finally priority inversion to avoid
starvation. And that's how we start
designing high performance robotic
systems at least on a very basic level.
And that's my talk for today. And thank
you so much for being here listening.
[Music]