Waymo's EMMA: Teaching Cars to Think - Jyh Jing Hwang, Waymo

Channel: aiDotEngineer
Published at: 2025-07-26
YouTube video id: iS9YFW28XyM
Source: https://www.youtube.com/watch?v=iS9YFW28XyM
[Music]
So I I think a lot of people here
already to be here in San Francisco. So
we drive every day and I think the
technology just works as you all
witness. So uh I'm going to talk a
little bit about the history of
autonomous driving research like what it
took to to to take us here and in the
current status. So in the autonomous
driving history uh people started the
research in 1980s where it's just very
simple neuronet networks there are three
layers and over time people start to
think about you know going deeper and
deeper and then in around uh 2020 there
is papers from NVDI and other uh
research labs publishing end to end
driving models and this is one of the
videos that they published at that time
right and then we can we can check it
out right the the the car in front
driving is the atomic driving cars from
the research labs at that time and you
can see that it's actually drifting
right so this is more like a L2 sort of
technology and you wouldn't want to
basically ride in it so why is an L4
system that Whimo has so much better
right you can see it drives around the
San Francisco in various regions in
downtown areas and it avoids pedestrians
cyclists and it makes rows very safe.
So, so the secret is very basically on
the top where uh you can see that this
is a sort of visualization or departing
screen where it visualize everything
that the system understands. So, it can
actually capture all the cars,
pedestrians, cyclists, traffic lights,
crossroads and it understand almost
everything related to driving. And we
can see uh this is a very complicated
system. If we break it down, it's
basically a perception system where it
understands the world and there is a
prediction system that basically
predicts the future of future world
states given the current one. And
finally, there is a planning system that
basically tell us like how we should
drive in this scenario like we should
turn right, how much acceleration, how
much steering etc. So this is the the
whole autonomous driving system that is
very complicated and delicated but it
solves problems right it drives in San
Francisco today. So what else? So what
are we heading with atom driving the the
answer is scaling right. So we proven
that we are operating in Phoenix, San
Francisco, Austin and Los Angeles today
and is offering the rider only service.
However, we are our ambition is not
going to stop in just four cities now.
So, in this year, we are doing a road
trip where we are visiting about 10
cities um in various locations. Here is
the map and you can even see that we are
actually going across the globe to to
Tokyo, Japan. So, there is a lot of
challenges when we start to scale.
So just recently we finished a open
challenge for participants to solve some
of the hardest problems that we see in
our spending. So let's replay the video
a little bit here. So on the on the top
left there is a marone run on the right
and then there's are cones around the
the the the roads. And most
interestingly is that when we drive a
closer to the intersection, if you look
at the traffic light, it's actually red,
but there is a traffic control man
waving us to go ahead. So, it's actually
green and and this is very confusing,
right? And all these longtails are very
challenging to solve and there are
challenges in like a lot of times that
you've never seen it in your entire life
of driving. But for Whimo and Whimo
scale we will see very very often and we
need to solve them to in order to scale.
So one of the the the the solutions
arises where foundation model is is very
generalizable right. So let's just take
a example here that we feed in Gemini
right you can see the video where when
we were driving peacefully suddenly
there are a a a bunch of angry birds
right they they appear and attacked our
car virtually right so that's where we
need to react but for a car to
understand the scenario it also has to
understand okay this is a fleet of birds
and how they are going to behave how
they are going to interact with the car
right this is not the everyday driving
that you see. But this is a critical
long tail. So let's see what Gemini
responds. So Gemini basically says that
a large flock of birds suddenly takes
flight from the ground in front of the
vehicle and the expected behavior is
basically to slow down and uh remain
alert and possibly adjust the speed uh
afterwards. So this is a very basically
I would say perfect answer to how we
should drive with respect to this
scenario. And this is another scenario
where you can see there is a scooter
rider in front of us and unfortunately
she slipped right and this is because
basically it's at night and there there
was rain earlier so the the row was wet.
So these kind of events also happens um
quite often like when the scale is large
right but this is a very much a safety
critical events that we we really want
to get it right and get it perfect and
let's ask Gemini again and Gemini
actually identify the entire scenario
correctly and even identify a a gas
station uh in far away scene so I I
think it's very telling that in today's
foundation models they actually
generalize well in these kind of rare
events and the questions becomes like
how how do we like how do we leverage
this technology for Thomas driving.
So here is our exploration uh in
research where we want to have a more
generalizable time driving system by
leveraging Gemini or other multimodal
large language models.
So here we we call it Emma. So this is
the simplest form of AMA where the idea
is very straightforward. Yeah, Gemini is
very good at generalizing in various
scenarios. So let's just put it to
drive. So how do we do that? Um so on
the top left is about router where we
want to know where we are going in this
current uh driving. So for example like
you can think about it as a Google map
signal and then it tell us you should
turn left or turn right at the next
intersection something like that and
then we translate it into text just pure
text. And then on the bottom left we
have the the the vehicle with
surrounding cameras. There are eight
cameras in this image and it covers 360
degrees. And then we just input these
video and the text the the routing text
to to Emma basically built on top of
Gemini and then we ask Gemini how to
drive in the next scenario right so
Gemini will output um future waypoints
which means the future locations of this
car should be in in the next few
seconds. So this this method is very
simple right like and there are three
major traits here. The first things is
that it is self-supervised because
whenever we have a driving log right we
know where the car were. So at any time
point we can use this formulation to
trend the model because we know where
the future the car should be in right.
So this is super self-s supervised very
scalable. The second thing is it is
camera only. We we don't have LAR yet
because you know Gemini is a camera
model. So we only need cameras to drive.
The third thing is that it is high
dimension map free. I think for people
who are who understand autonomous
driving technology to certain extent
like we usually hear that Whimo's car
needs a lot of map prior. So in this
model we actually don't need any maps
besides Google map.
So how how was the performance here? So
we we basically conducted a on a
research benchmark called new things.
This is one of the most popular openloop
planner benchmarks and for le simple
model E achieves state-of-the-art
quality compared to all other models at
that time. So basically whatever like
customized model, small model, large
models like with this simple formulation
we can already achieve the best quality.
So now we are thinking more towards like
okay so the self-supervised method does
work to some extent but there are a lot
of drawbacks that we want to basically
remedy right the first thing is that
okay we have had a lot of label data a
lot of asper models how do we leverage
them to further improve the quality the
second thing is that there is a very
clear drawback about end toend models
where the explanability is not there
right like we can only see the the
output of planner but we don't know what
happens inside so here is what we do
right we basically have a channel so
process before outputting the the
planner so basically we let the model
explain itself about how it should drive
in this scenario for example we ask the
model to identify critical objects on
the road first in this scenario they
identify Okay, the cyclist and the
vehicle and then we ask them to explain
what kind of behaviors those critical
objects will will do and what kind of
driving meta decision that we should we
should go for and in this scenario it
says we should just keep the normal
speed and some of the scenario it will
say we should yield or slow down
something like that
and with this we actually achieved a
even better forer um it's measured in
our own open motion data set at least
lease data set is actually 100k data set
about 100 times larger than the new
things data set and more importantly is
that the baselines are much stronger
because they are not they they are
specialized models they are waveformers
and motion LM so they are basically
built on top of very sophisticated
Oracle perception system and they takes
inputs from from those oracle perception
and raw graph that's basically high
definitionition map and traffic light
states. So they have inputs from
everything almost and then their own job
is to output the planner. So this is
these are very strong baselines and with
channel so reasoning you can see that it
actually comes on top. So with this
architecture we actually
uh see that it actually performs very
well in the academic setting.
So one promise in foundation model is
that you know open accreating
laws. It basically says that the quality
will keep improving if you have a larger
model and larger data set. And we we we
are showing here that if we trend like
this is a data set that's magnetive
larger than any of the academic data
sets that's released publicly. And we
see that if we keep trending more data
and we see the quality keeps improving.
So the the y-axis is the complexity
here. So it basically means that the
lower the proplexity the better the
planner results. So we see the quality
can can be further improved.
So um we talk about the expulability
here and one thing that we also are
thinking about is that uh why not just
train on various tasks right like
because it is a vision language models
at the core and the language part is
super flexible we can basically
formulate any types of tasks with it.
So, so we think about like let's make
Emma the most generalizable model
possible, right? So, we add a lot of
different tasks into it. Uh, in this
example, we we demonstrate 3D detection,
rograph estimation and some kind of free
form VQA here. And for any prompt on the
left, uh, if you enter it, then Emma
will speed out the answers
correspondingly. So it will have the
first one is the the end driving the
second one is 3D detection etc. And then
we decode and visualize the results on
the right.
So we also measure the the detection
quality here on our way more open data
set. So it actually achieves very
similar quality compared to other
state-of-the-art models. So this
demonstrates the AMAS capability um
generalizing different tasks and coin
them together.
So here here are some visualization of
the AAS outputs including the driving
trajectory detection and RO graph. So I
think the the predictions are reasonably
well.
So now it seems like we have um some
kind of prototype for for end to end
multimodel large language models. But
then the question comes next is about
how we evaluate it, validate the version
and make sure that it's safe. So
evaluation is part of you know the the
entire solution right we cannot really
make make this model succeed without any
evaluation
and this for evaluation previously all
the results I showed was about open loop
evaluation which means that we just
replay the video and then see the check
the model quality but this is usually
not the most faithful way to to to do
evaluation so there are simulations
and real world testing and simulation
means that we create a a virtual world
where we can test our model to drive the
car in that world and real world testing
is just deploying it. So the
difficulties are different and usually
we trust simulation a lot more compared
to open world.
So one thing that we are also thinking
about is that yes foundation models in
the generative warrior is also very
advanced. This is a V2 from Google deep
mind where we see the videos generated
are very realistic. So we are also
thinking about maybe we can leverage it
for simulation. So this is also like our
latest research um sensor simulation for
end to end driving evaluation. So for
this model we actually use a
open-sourced
um generative models from Google and
then we generate the the the videos and
then we use that to basically place our
AMI and then we can evaluate the the
quality and we can also control in
various conditions like we can change
the weather, we can change the time, we
can change a lot of different ways. uh
due to time I basically removed those
videos. So sorry about that the visual
part. So this is the results where we
can test our MAR planner in various
conditions right we can switch from RAN
to no RAN we can switch different time
in day and this is very much like
um aligns with our intuition where when
it rains when the weather is bad then
usually our planner gets a little bit
worse because it's it's a camera only
model so it it affects the camera domain
and then for time of day is also kind of
a lines where uh at night usually the
quality is worse compared in the in the
daytime. So usually at noon or at in the
afternoon the model will perform the
best.
So this is the last slides. So these are
all generated videos that we we we test
our models in. So I think this is a
ongoing research where I I think it's a
very exciting field where we have all
the foundation models and because Whimo
and Google are both belongs to alphabet
so we get to act get access to all these
models for free sort of. Yeah. So so we
we basically try to adapt them and try
to see if we can improve the
generalization and and help Whimo to
scale to the next big thing. So thank
you. Thank you for your attention.
[Music]