Waymo's EMMA: Teaching Cars to Think - Jyh Jing Hwang, Waymo
Channel: aiDotEngineer
Published at: 2025-07-26
YouTube video id: iS9YFW28XyM
Source: https://www.youtube.com/watch?v=iS9YFW28XyM
[Music] So I I think a lot of people here already to be here in San Francisco. So we drive every day and I think the technology just works as you all witness. So uh I'm going to talk a little bit about the history of autonomous driving research like what it took to to to take us here and in the current status. So in the autonomous driving history uh people started the research in 1980s where it's just very simple neuronet networks there are three layers and over time people start to think about you know going deeper and deeper and then in around uh 2020 there is papers from NVDI and other uh research labs publishing end to end driving models and this is one of the videos that they published at that time right and then we can we can check it out right the the the car in front driving is the atomic driving cars from the research labs at that time and you can see that it's actually drifting right so this is more like a L2 sort of technology and you wouldn't want to basically ride in it so why is an L4 system that Whimo has so much better right you can see it drives around the San Francisco in various regions in downtown areas and it avoids pedestrians cyclists and it makes rows very safe. So, so the secret is very basically on the top where uh you can see that this is a sort of visualization or departing screen where it visualize everything that the system understands. So, it can actually capture all the cars, pedestrians, cyclists, traffic lights, crossroads and it understand almost everything related to driving. And we can see uh this is a very complicated system. If we break it down, it's basically a perception system where it understands the world and there is a prediction system that basically predicts the future of future world states given the current one. And finally, there is a planning system that basically tell us like how we should drive in this scenario like we should turn right, how much acceleration, how much steering etc. So this is the the whole autonomous driving system that is very complicated and delicated but it solves problems right it drives in San Francisco today. So what else? So what are we heading with atom driving the the answer is scaling right. So we proven that we are operating in Phoenix, San Francisco, Austin and Los Angeles today and is offering the rider only service. However, we are our ambition is not going to stop in just four cities now. So, in this year, we are doing a road trip where we are visiting about 10 cities um in various locations. Here is the map and you can even see that we are actually going across the globe to to Tokyo, Japan. So, there is a lot of challenges when we start to scale. So just recently we finished a open challenge for participants to solve some of the hardest problems that we see in our spending. So let's replay the video a little bit here. So on the on the top left there is a marone run on the right and then there's are cones around the the the the roads. And most interestingly is that when we drive a closer to the intersection, if you look at the traffic light, it's actually red, but there is a traffic control man waving us to go ahead. So, it's actually green and and this is very confusing, right? And all these longtails are very challenging to solve and there are challenges in like a lot of times that you've never seen it in your entire life of driving. But for Whimo and Whimo scale we will see very very often and we need to solve them to in order to scale. So one of the the the the solutions arises where foundation model is is very generalizable right. So let's just take a example here that we feed in Gemini right you can see the video where when we were driving peacefully suddenly there are a a a bunch of angry birds right they they appear and attacked our car virtually right so that's where we need to react but for a car to understand the scenario it also has to understand okay this is a fleet of birds and how they are going to behave how they are going to interact with the car right this is not the everyday driving that you see. But this is a critical long tail. So let's see what Gemini responds. So Gemini basically says that a large flock of birds suddenly takes flight from the ground in front of the vehicle and the expected behavior is basically to slow down and uh remain alert and possibly adjust the speed uh afterwards. So this is a very basically I would say perfect answer to how we should drive with respect to this scenario. And this is another scenario where you can see there is a scooter rider in front of us and unfortunately she slipped right and this is because basically it's at night and there there was rain earlier so the the row was wet. So these kind of events also happens um quite often like when the scale is large right but this is a very much a safety critical events that we we really want to get it right and get it perfect and let's ask Gemini again and Gemini actually identify the entire scenario correctly and even identify a a gas station uh in far away scene so I I think it's very telling that in today's foundation models they actually generalize well in these kind of rare events and the questions becomes like how how do we like how do we leverage this technology for Thomas driving. So here is our exploration uh in research where we want to have a more generalizable time driving system by leveraging Gemini or other multimodal large language models. So here we we call it Emma. So this is the simplest form of AMA where the idea is very straightforward. Yeah, Gemini is very good at generalizing in various scenarios. So let's just put it to drive. So how do we do that? Um so on the top left is about router where we want to know where we are going in this current uh driving. So for example like you can think about it as a Google map signal and then it tell us you should turn left or turn right at the next intersection something like that and then we translate it into text just pure text. And then on the bottom left we have the the the vehicle with surrounding cameras. There are eight cameras in this image and it covers 360 degrees. And then we just input these video and the text the the routing text to to Emma basically built on top of Gemini and then we ask Gemini how to drive in the next scenario right so Gemini will output um future waypoints which means the future locations of this car should be in in the next few seconds. So this this method is very simple right like and there are three major traits here. The first things is that it is self-supervised because whenever we have a driving log right we know where the car were. So at any time point we can use this formulation to trend the model because we know where the future the car should be in right. So this is super self-s supervised very scalable. The second thing is it is camera only. We we don't have LAR yet because you know Gemini is a camera model. So we only need cameras to drive. The third thing is that it is high dimension map free. I think for people who are who understand autonomous driving technology to certain extent like we usually hear that Whimo's car needs a lot of map prior. So in this model we actually don't need any maps besides Google map. So how how was the performance here? So we we basically conducted a on a research benchmark called new things. This is one of the most popular openloop planner benchmarks and for le simple model E achieves state-of-the-art quality compared to all other models at that time. So basically whatever like customized model, small model, large models like with this simple formulation we can already achieve the best quality. So now we are thinking more towards like okay so the self-supervised method does work to some extent but there are a lot of drawbacks that we want to basically remedy right the first thing is that okay we have had a lot of label data a lot of asper models how do we leverage them to further improve the quality the second thing is that there is a very clear drawback about end toend models where the explanability is not there right like we can only see the the output of planner but we don't know what happens inside so here is what we do right we basically have a channel so process before outputting the the planner so basically we let the model explain itself about how it should drive in this scenario for example we ask the model to identify critical objects on the road first in this scenario they identify Okay, the cyclist and the vehicle and then we ask them to explain what kind of behaviors those critical objects will will do and what kind of driving meta decision that we should we should go for and in this scenario it says we should just keep the normal speed and some of the scenario it will say we should yield or slow down something like that and with this we actually achieved a even better forer um it's measured in our own open motion data set at least lease data set is actually 100k data set about 100 times larger than the new things data set and more importantly is that the baselines are much stronger because they are not they they are specialized models they are waveformers and motion LM so they are basically built on top of very sophisticated Oracle perception system and they takes inputs from from those oracle perception and raw graph that's basically high definitionition map and traffic light states. So they have inputs from everything almost and then their own job is to output the planner. So this is these are very strong baselines and with channel so reasoning you can see that it actually comes on top. So with this architecture we actually uh see that it actually performs very well in the academic setting. So one promise in foundation model is that you know open accreating laws. It basically says that the quality will keep improving if you have a larger model and larger data set. And we we we are showing here that if we trend like this is a data set that's magnetive larger than any of the academic data sets that's released publicly. And we see that if we keep trending more data and we see the quality keeps improving. So the the y-axis is the complexity here. So it basically means that the lower the proplexity the better the planner results. So we see the quality can can be further improved. So um we talk about the expulability here and one thing that we also are thinking about is that uh why not just train on various tasks right like because it is a vision language models at the core and the language part is super flexible we can basically formulate any types of tasks with it. So, so we think about like let's make Emma the most generalizable model possible, right? So, we add a lot of different tasks into it. Uh, in this example, we we demonstrate 3D detection, rograph estimation and some kind of free form VQA here. And for any prompt on the left, uh, if you enter it, then Emma will speed out the answers correspondingly. So it will have the first one is the the end driving the second one is 3D detection etc. And then we decode and visualize the results on the right. So we also measure the the detection quality here on our way more open data set. So it actually achieves very similar quality compared to other state-of-the-art models. So this demonstrates the AMAS capability um generalizing different tasks and coin them together. So here here are some visualization of the AAS outputs including the driving trajectory detection and RO graph. So I think the the predictions are reasonably well. So now it seems like we have um some kind of prototype for for end to end multimodel large language models. But then the question comes next is about how we evaluate it, validate the version and make sure that it's safe. So evaluation is part of you know the the entire solution right we cannot really make make this model succeed without any evaluation and this for evaluation previously all the results I showed was about open loop evaluation which means that we just replay the video and then see the check the model quality but this is usually not the most faithful way to to to do evaluation so there are simulations and real world testing and simulation means that we create a a virtual world where we can test our model to drive the car in that world and real world testing is just deploying it. So the difficulties are different and usually we trust simulation a lot more compared to open world. So one thing that we are also thinking about is that yes foundation models in the generative warrior is also very advanced. This is a V2 from Google deep mind where we see the videos generated are very realistic. So we are also thinking about maybe we can leverage it for simulation. So this is also like our latest research um sensor simulation for end to end driving evaluation. So for this model we actually use a open-sourced um generative models from Google and then we generate the the the videos and then we use that to basically place our AMI and then we can evaluate the the quality and we can also control in various conditions like we can change the weather, we can change the time, we can change a lot of different ways. uh due to time I basically removed those videos. So sorry about that the visual part. So this is the results where we can test our MAR planner in various conditions right we can switch from RAN to no RAN we can switch different time in day and this is very much like um aligns with our intuition where when it rains when the weather is bad then usually our planner gets a little bit worse because it's it's a camera only model so it it affects the camera domain and then for time of day is also kind of a lines where uh at night usually the quality is worse compared in the in the daytime. So usually at noon or at in the afternoon the model will perform the best. So this is the last slides. So these are all generated videos that we we we test our models in. So I think this is a ongoing research where I I think it's a very exciting field where we have all the foundation models and because Whimo and Google are both belongs to alphabet so we get to act get access to all these models for free sort of. Yeah. So so we we basically try to adapt them and try to see if we can improve the generalization and and help Whimo to scale to the next big thing. So thank you. Thank you for your attention. [Music]