How Google DeepMind is researching the next Frontier of AI for Gemini — Raia Hadsell, VP of Research
Channel: aiDotEngineer
Published at: 2026-04-18
YouTube video id: zZsTVBXcbow
Source: https://www.youtube.com/watch?v=zZsTVBXcbow
[music] >> Our next speaker is VP of research at Google DeepMind. Please join me in welcoming to the stage Raya Hadsel. >> [music] [applause] >> Hello everyone. Wonderful. What a lovely full room and good smiles. I heard the dig on Google there at the end. I I I did catch that. Um so my name is Raya Hadsel. I've been a part of DeepMind for the last almost 13 years and I'm very happy to have AI engineer come here to London. I'm also very proud this year to be a UKAI ambassador. So I help the government academia industry sort of bridge those those gaps and yes, I'm American by birth but I've been here for long enough that I can count myself among the proud Brits as well. So I'm going to talk a little bit about Frontier AI and the future of intelligence. To start a little bit of a introduction to who I am. It's good to be as old as I am. You get to look at this by the decades. So in the 90s I did my undergraduate degree in philosophy of religion. Definitely not not a computer scientist yet. But I really enjoyed it before you ask. Yes, I learned a lot and I'm glad that I did it and no it hasn't been very useful since. In the 2000s I did a bit of a pivot moved into computer science after some some good advice from those close to me and spent my PhD years in New York City working on convolutional networks neural networks for robots with with Jan LeCun a lot of fun. I then in the 2010s made the decision to join a small group of curious scrappy individuals working at DeepMind. It's a group of about 30 40 people at the time and we spent the rest of that decade working on things like Atari video games go Starcraft and some robotics. A lot of fun. And now I am a VP within DeepMind. I help run about a group of about 1200 scientists and engineers across 10 labs and we're working on a lot of different things. I'll tell you about three of those. So first Frontier AI is an area where we really are trying to make sure that we are staying in the front. So we're thinking about what are the next architectures that we're going to use for Gemini. What are the next problems that only AI can really address and how are we going to build the future of intelligence and that's thinking not just about artificial intelligence but it's create the future of human intelligence as well and even robotics intelligence as well. We are all on this journey together and I think that it's important to think about how how humans change as well as the technology. Our approach we look for root nodes. You know, we're not going to waste time on the leaves. We're going to really find for a big problem space that hasn't been solved. What are how deep can we go? Find the deepest problems and solve those in order to then enable a lot of downstream stream impact. We partner you know really with the world. I really think about it very broadly and think about who are the partners that can help us find those root nodes and solve those problems and also you know, bring it to the to the leaf nodes. And solving problems that are worth solving. The motto the mission of DeepMind is to build AI responsibly for the benefit of humanity. So I really take that seriously. We want to build build solve problems that are worth solving. All right. So we work in a lot of different areas within Frontier AI in DeepMind. These are sort of some of the different categories. I'm not going to tell you about all of them so you can just maybe keep those a mystery. But I'll just pick out a couple. So first in advanced models I actually wanted to bring up an embeddings model. So the theme of this talk overall is things that are not directly language models. And in the modeling space I wanted to talk about embedding models. To start that out I'll ask if anyone knows what a Jennifer Aniston cell is. I got a few neuroscientists neuroscientists in the room. So this is actually a concept from neuroscience where we've discovered that there are not just a single cell but a small number of neurons that will encode for a specific thing as in a specific person. And that those combinations of neurons that only activate for that one person or that one thing or that one place those cells are actually very robust. They activate regardless of modality. And this is used by the brain very very fast retrieval for recognition and for comparison functions. So that means that when I say the name Jennifer Aniston or if I showed you a picture or a video or if you even heard her voice if you knew her if you were enough of a fan then those all those different modalities lead to the same set of cells activating. So we want that in a artificial neural network for the same reason. We want fast retrieval recognition and comparison. And so we can trade what's called an embedding model in order to encode for those concepts in order to be more robust to different different ways the information can be presented and to be very very good at sort of understanding what is the comparison between these different activations. Use contrastive losses. One of the reasons why I like the space is because I did my PhD work in part looking at Siamese neural networks which was an early way of understanding what is a contrastive loss function. Um and so these these embedding functions are really critical companion to generative AI. Sometimes we want to generate sometimes we want to retrieve. So the group at Google has been working on this for a long time and just recently we've actually released Gemini embeddings 2. So this is exciting to me because it really is sort of the the ideal. It is fully omnimodal. It uses it's derived from Gemini. So it's got sort of that level of knowledge and understanding of the world and it is and it allows extremely extremely good retrieval. Um in a little bit more more detail than why is it good that it is unified and multimodal? It means that you don't have to have different steps to try to bring you can be truly end-to-end and not lose information by trying to combine audio information visual information text information together. So you can get a single vector that represents text up to 8000 8K tokens um [snorts] 128 seconds of video 80 seconds of audio and a full PDF and together that can give you a lot of information. You can then use that to be able to use it for retrieval for [snorts] for for querying for agentic logic and other things. We also use something called the Matryoshka representation learning MRL and that allows us to have be able to have the same network but represent different dimensions. So for instance you could start out doing a retrieval using only 256 dimensions for your embedding and then you can expand that to get to more expressiveness. Um So this also this gives us we can demonstrate that we have this allows us to have a unified semantic space and really state-of-the-art quality. So just something that's come out recently that I think should doesn't get talked about quite as often as language models but it's really important as that companion I think. All right. [snorts] Next I wanted to quickly talk about another thing that is not a language model. This is not a language model at all. There was no language involved. And this is work that we've done on the weather in London. It rains a lot and a few years ago there was a a informatics scientist at the Met Office, Meteorological Office for the UK. The UK's weather agency that said, "Can you predict better rainfall than our physics-based models using AI?" And I said, "I don't know. Interesting problem. Let me take this back to the team." Took this back to the team at DeepMind. We started working on this and we discovered, yes, you know what? Predicting the weather, even though it is a very very hard problem using physics simulation of the atmosphere, is actually quite tractable for neural network models given that we have 40 years of data, 40 years of global data on what the weather is. So a couple of years ago we came out with GraphCast. GraphCast predicts the predicts the state of the atmosphere up to 15 days out everywhere on Earth and for many different variables. And this uses a spherical graph neural network. Think about this encompassing the Earth and having nodes that go all the way from the surface of the Earth all the way up into the lower stratosphere. And we actually feed in and then predict in an autoregressively 100 different atmospheric variables. For instance, wind speed, temperature and humidity as shown here. Um And this worked very well. Here's a quick example. We were excited to see this in late 2024. This is Hurricane Lee. Sort of comes into the Atlantic, pauses for a moment and then takes a takes a turn to the north, speeds up and makes landfall in Nova Scotia. The total this total video is 9 days worth. So that's how far the the hurricane moves. And this is actually the output of the graph neural network. That's its prediction. And the prediction that it made is accurate 9 days out of where that landfall is going to be. In comparison, the best gold standard models that are that are physics-based were only accurate 6 days out as to where that landfall is going to be. When you're talking about a major hurricane hitting land, 3 days is really important. So with this we said, "Okay, this is important and we're going to keep on pushing the science." So the team developed the next model. We called this GenCast. And the difference here is that this model, while also based on a mesh, is probabilistic and it has a higher accuracy and a higher efficiency. The weather is fundamentally chaotic and we want to know what's happening on the tails. And so having a model that's probabilistic allows us to do that, allows this to be operationalized and used for actual weather prediction. Um GenCast also was more accurate. So when we compared it to 1,300 gold standard benchmarked weather forecast, then this was more accurate 97% of the time. And it was also could be we could produce that 15-day forecast in 8 minutes on a single chip instead of hours on a very large supercomputer. So much different sort of space of the solution that we were proposing. And just this last year, this team is relentless. They're constantly coming up with new models. And so the latest one is called FGN, functional generative network. This directly predicts cyclones rather than predicting the weather and then having to add on a cyclone detector as sort of post-processing. This actually incorporates the the categorization, the recognition of cyclones, their trajectory, their wind speed and the formation of the eye directly into the network. We train for that, which means that it's much better. So this has already been used in the US by the National Hurricane Center and they are very very excited by how much of an advantage this now gives. So this this will hopefully be used worldwide in the coming years. All right. Lastly, I wanted to use the last few minutes to talk about again something that is not language model-based. So this is world models. And this actually came out of work that DeepMind has done on games and simulation for a long time. We've been working on Atari and Go on StarCraft and then on you know Mujoco type environments for robotics because we wanted to understand agency and the environment. We started focusing more and more on not just the training the agent, but creating the an infinite environment. You know, when we when I did work on locomotion here um then oh, that's not playing. Well. Maybe this will play. I'm going to jump forward to Genie 1. So we wanted So this is Genie 1. It could only run for a few seconds, but you could say, "Hey, I want this type of a world." And then you could produce this little platformer 2D game environment where you could jump around for a few minutes and it would actually respond to whether you hit the the left or the right. And it could produce a reasonable diversity of different-looking platformer type of worlds. This was enough to say, "Hmm, we might have something here. Let's scale up. Let's scale up the data and train again. Improve the method and train again now on 3D games. Then we produced Genie 2. Genie 2 is is interactive, but it's not yet real time. So you need to go awfully slowly. And it can produce 3D environments, but it couldn't do anything that was really real-world type of quality and more higher definition. So we were working on that. And then along came V03. All right. Now >> [cheering and applause] >> All right. Well, now I am out of time, but I will still take another another minute or two to just show show you these. So this is saying that telling Genie, I want a world where I'm walking down a muddy lane in Kent. Looks not far from my house. The fun thing here is that you look down at yourself and you realize that you actually have a body. You're actually interacting with the world. It's a little bit odd to to know what's coming out of this model. It's really understood not just the appearance of a lane in Kent, but actually what it takes to engage with that, make the water move and to walk forward. Of course, it's not just scenes that are walking. We can very happily ski. And so you can create an environment where you can engage with the world in so many different different ways. Um Here's an example where it says original there. That's we started this. We prompted this with a fragment of video and now it's changed to Genie Genie 3. So this is an artist. He made those that those first few seconds. And then we used that to prompt Genie and bring this world to life. He was so tickled to see that we could take a little snippet of his world that he had laboriously created and bring that to life in a way that means that you can fly through it. You can bounce off of this thing and it remembers that oh, here's that here's that weird structure there and go back to that. And fly through there. So [snorts] these environments are not only diverse and interactive high quality. They also have memory. So the prompt here was, "I'm an origami lizard in an origami world." And this is what you get. And we use this as a nice little test that I can spend, you know, I can spend a minute running in one direction, run back to the start and everything is exactly as it was at the beginning because we have a really good memory. Working in these environments gives us consistency and control. And lastly, we have we're able to prompt this world as you're in it. So that means that while I'm in a world that might be a little bit boring. Here I am, you know, this is a world saying I'm walking down the Camden Canal in London here near the DeepMind office. Well, what happens if I prompt it at the same time? Then what happens? Ah. I've just changed the world that I'm in. Can change it again. There we go. Immediately, the world is is is different. And one more time just for fun. I love the idea of a new form of gaming where I could be adversarially prompting your experience of a world. It just creates a whole different sort of entertainment, a whole world, a whole new frontier um I think can be really amazing, not just for entertainment, but for education, as well. Um the ability to be able to go into a world in order to learn about it, I think is incredibly powerful, um and may well be something that that that we see more and more of. Um and uh with that, I will uh say thank you and just a quick call out that tomorrow morning, um my uh colleague Omar is going to talk about Gemma 4, which is a language model. >> [laughter] >> Uh thank you. >> [music]