How Google DeepMind is researching the next Frontier of AI for Gemini — Raia Hadsell, VP of Research

Channel: aiDotEngineer
Published at: 2026-04-18
YouTube video id: zZsTVBXcbow
Source: https://www.youtube.com/watch?v=zZsTVBXcbow
[music]
>> Our next speaker is VP of research at
Google DeepMind. Please join me in
welcoming to the stage Raya Hadsel.
>> [music]
[applause]
>> Hello everyone. Wonderful.
What a lovely full room and good smiles.
I heard the dig on Google there at the
end. I I I did catch that. Um so my name
is Raya Hadsel.
I've been a part of DeepMind
for the last
almost 13 years and I'm very happy to
have AI engineer come here to London.
I'm also
very proud this year to be
a UKAI ambassador. So I help the
government
academia industry sort of bridge those
those gaps
and
yes, I'm American by birth but I've been
here for long enough
that I can count myself among the proud
Brits as well. So I'm going to talk a
little bit about Frontier AI and the
future of intelligence.
To start a little bit of a
introduction to who I am. It's good to
be as old as I am. You get to look at
this by the decades. So in the 90s I did
my undergraduate degree in philosophy of
religion.
Definitely not
not a computer scientist yet. But
I really enjoyed it before you ask.
Yes, I learned a lot and I'm glad that I
did it and no it hasn't been very useful
since.
In the 2000s I did a bit of a pivot
moved into computer science
after some some good advice from those
close to me and spent my PhD years in
New York City working on
convolutional networks neural networks
for robots
with with Jan LeCun
a lot of fun.
I then in the 2010s made the decision to
join a small group of
curious
scrappy
individuals working at DeepMind.
It's a group of about
30 40 people at the time and we spent
the rest of that decade working on
things like Atari video games
go Starcraft and some robotics.
A lot of fun. And
now I am a VP
within DeepMind. I help run about a
group of about 1200 scientists and
engineers
across 10 labs
and we're working on a lot of different
things. I'll tell you about three of
those.
So first Frontier AI is
an area where
we really are trying to make sure that
we are staying in the front. So we're
thinking about what are the next
architectures that we're going to use
for Gemini. What are the next problems
that only AI can really address
and how are we going to build the future
of intelligence and that's thinking not
just about artificial intelligence but
it's create the future of human
intelligence as well and even robotics
intelligence as well.
We are all on this journey together and
I think that it's important to think
about how how humans change as well as
the technology. Our approach we look for
root nodes. You know, we're not going to
waste time on the leaves. We're going to
really find for a big problem space that
hasn't been solved. What are how deep
can we go? Find the deepest problems and
solve those in order to then enable
a lot of downstream stream impact. We
partner you know really with the world.
I really think about it very broadly and
think about who are the partners that
can help us find those root nodes and
solve those problems and also
you know, bring it to the to the leaf
nodes.
And solving problems that are worth
solving. The motto the mission of
DeepMind
is to build AI responsibly
for the benefit of humanity. So I really
take that seriously. We want to build
build solve problems that are worth
solving.
All right. So
we work in a lot of different areas
within Frontier AI
in DeepMind. These are sort of some of
the different
categories. I'm not going to tell you
about all of them so you can just maybe
keep those a mystery.
But I'll just pick out a couple. So
first in advanced models
I actually wanted to bring up an
embeddings model.
So the theme of this talk overall is
things that are not directly language
models.
And in the modeling space I wanted to
talk about embedding models.
To start that out
I'll ask if anyone knows what a Jennifer
Aniston cell is.
I got a few neuroscientists
neuroscientists in the room. So this is
actually a concept from neuroscience
where we've discovered that there are
not just a single cell but a small
number of neurons that will encode for a
specific thing as in a specific person.
And that those combinations of neurons
that only activate for that one person
or that one thing or that one place
those cells are actually very robust.
They
activate regardless of modality.
And this is used by the brain very very
fast retrieval for recognition and for
comparison functions.
So that means that when I say the name
Jennifer Aniston or if I showed you a
picture or a video or if you even heard
her voice if you knew her if you were
enough of a fan then those all those
different modalities lead to the same
set of cells activating.
So we want that in a artificial neural
network
for the same reason. We want fast
retrieval recognition and comparison.
And so we can trade what's called an
embedding model
in order to encode for those concepts in
order to be more robust
to different different ways the
information can be presented
and to be very very good at sort of
understanding what is the comparison
between these different activations.
Use contrastive losses. One of the
reasons why I like the space is because
I did my PhD work in part looking at
Siamese neural networks which was an
early way of understanding what is a
contrastive loss function. Um
and so these these embedding
functions are really critical companion
to generative AI. Sometimes we want to
generate sometimes we want to retrieve.
So the group at Google has been working
on this for a long time and just
recently we've actually released Gemini
embeddings 2. So this is exciting to me
because it really is sort of the the
ideal. It is fully omnimodal.
It uses
it's derived from Gemini.
So it's got sort of that level of
knowledge and understanding of the world
and it is and it allows extremely
extremely good retrieval.
Um
in a little bit more more detail than
why is it good that it
is unified and multimodal? It means that
you don't have to have different steps
to try to bring you can be truly
end-to-end and not lose information by
trying to combine audio information
visual information text information
together.
So you can get a single vector that
represents text
up to 8000 8K tokens um [snorts]
128 seconds of video 80 seconds of audio
and a full PDF and together that can
give you a lot of information. You can
then use that
to be able
to use it for retrieval
for [snorts]
for for querying for agentic logic and
other things.
We also use something called the
Matryoshka representation learning MRL
and that allows us to have be able to
have the same network but represent
different dimensions. So for instance
you could start out doing a retrieval
using only 256 dimensions for your
embedding and then you can expand that
to get to more expressiveness.
Um
So
this also this gives us we can
demonstrate that we have this allows us
to have a unified semantic space
and really state-of-the-art quality.
So
just something that's come out recently
that I think should doesn't get talked
about quite as often as language models
but it's really important as that
companion I think.
All right. [snorts]
Next I wanted to quickly talk about
another thing that is not a language
model. This is not a language model at
all. There was no language involved.
And this is work that we've done on the
weather
in London. It rains a lot
and a few years ago there was a
a informatics scientist at the Met
Office, Meteorological Office for the
UK. The UK's weather agency that said,
"Can you predict better rainfall than
our physics-based models
using AI?" And I said, "I don't know.
Interesting problem. Let me take this
back to the team." Took this back to the
team
at DeepMind. We started working on this
and we discovered, yes, you know what?
Predicting the weather, even though it
is a very very hard problem using
physics simulation of the atmosphere, is
actually quite tractable for neural
network models given that we have 40
years of data, 40 years of global data
on what the weather is.
So a couple of years ago we came out
with GraphCast.
GraphCast
predicts the
predicts the state of the atmosphere
up to 15 days out
everywhere on Earth and for many
different variables. And this uses a
spherical graph neural network.
Think about this
encompassing the Earth and having nodes
that go all the way from the surface of
the Earth all the way up into the lower
stratosphere.
And we actually feed in and then predict
in an autoregressively
100 different atmospheric variables. For
instance, wind speed, temperature and
humidity as shown here.
Um
And this worked very well. Here's a
quick example. We were excited to see
this
in late 2024. This is Hurricane Lee.
Sort of comes into the Atlantic, pauses
for a moment and then takes a takes a
turn to the north, speeds up and makes
landfall in Nova Scotia.
The total this total video
is
9 days worth. So that's how far the the
hurricane moves. And this is actually
the output of the graph neural network.
That's its prediction. And the
prediction that it made is accurate
9 days out of where that landfall is
going to be.
In comparison, the best
gold standard models
that are that are physics-based were
only accurate 6 days out as to where
that landfall is going to be.
When you're talking about a major
hurricane hitting land, 3 days is really
important.
So with this we said, "Okay, this is
important and we're going to keep on
pushing the science." So the team
developed the next model. We called this
GenCast. And the difference here is that
this model, while also based on a mesh,
is probabilistic and it has a higher
accuracy and a higher efficiency.
The weather is fundamentally chaotic and
we want to know what's happening
on the tails. And so having a model
that's probabilistic
allows us to do that, allows this to be
operationalized and used for actual
weather prediction.
Um GenCast also was more accurate. So
when we compared it to 1,300 gold
standard benchmarked weather forecast,
then this was more accurate 97% of the
time.
And it was also
could be we could produce that 15-day
forecast in 8 minutes on a single chip
instead of hours on a very large
supercomputer.
So much different sort of space
of the solution that we were proposing.
And just this last year, this team is
relentless. They're constantly coming up
with new models. And so the latest one
is called FGN,
functional generative network. This
directly predicts cyclones rather than
predicting the weather and then having
to add on a cyclone detector as sort of
post-processing. This actually
incorporates the the categorization, the
recognition of cyclones, their
trajectory, their wind speed and the
formation of the eye directly into the
network. We train for that, which means
that it's much better. So this has
already been used in the US by the
National Hurricane Center
and they are very very excited by how
much of an advantage this now gives.
So this this will hopefully be used
worldwide in the coming years.
All right. Lastly, I wanted to use the
last few minutes to talk about again
something that is not
language model-based. So this is world
models.
And this actually came out of work that
DeepMind has done on games and
simulation for a long time.
We've been working on Atari and Go on
StarCraft
and then on you know Mujoco type
environments for robotics because we
wanted to understand
agency and the environment.
We started focusing more and more on not
just the training the agent, but
creating the
an infinite environment.
You know, when we when I did work on
locomotion here um
then oh, that's not playing.
Well.
Maybe this will play. I'm going to jump
forward to Genie 1. So we wanted So this
is Genie 1. It could only run for a few
seconds, but you could say, "Hey, I want
this type of a world." And then you
could produce this little platformer 2D
game environment where you could jump
around for a few minutes and it would
actually respond to whether you hit the
the left or the right. And it could
produce a reasonable diversity of
different-looking platformer type of
worlds.
This was enough to say, "Hmm,
we might have something here. Let's
scale up. Let's scale up the data
and train again. Improve the method
and train again now on 3D games. Then we
produced Genie 2.
Genie 2 is
is interactive, but it's not yet real
time. So you need to go awfully slowly.
And it can
produce
3D environments, but it couldn't do
anything that was really real-world type
of quality
and
more higher definition.
So we were working on that. And then
along came
V03.
All right.
Now
>> [cheering and applause]
>> All right. Well, now I am out of time,
but I will still take another another
minute or two to just show show you
these.
So this is saying that telling Genie, I
want a world where I'm walking down a
muddy lane in Kent.
Looks not far from my house. The fun
thing here is that you look down at
yourself and you realize that you
actually have a body. You're actually
interacting with the world. It's a
little bit odd to to know what's coming
out of this model. It's really
understood not just
the appearance of a lane in Kent, but
actually what it takes to engage with
that, make the water move and to walk
forward. Of course, it's not just scenes
that are walking. We can very happily
ski.
And so you can create an environment
where you can engage with the world in
so many different different ways.
Um
Here's an example where it says original
there. That's we started this. We
prompted this with a fragment of video
and now it's changed to Genie Genie 3.
So this is an artist. He made those that
those first few seconds. And then we
used that to prompt Genie and bring this
world to life. He was so tickled to see
that we could take a little snippet of
his world that he had laboriously
created and bring that to life in a way
that means that you can fly through it.
You can bounce off of this thing and it
remembers that oh, here's that here's
that weird structure there and go back
to that. And fly through there.
So [snorts]
these environments are not only diverse
and interactive
high quality. They also have memory. So
the prompt here was, "I'm an origami
lizard in an origami world." And this is
what you get. And we use this as a nice
little test that I can spend, you know,
I can spend a minute running in one
direction, run back to the start and
everything is exactly as it was at the
beginning
because we have a really good memory.
Working in these environments gives us
consistency and control.
And lastly, we have we're able to prompt
this world as you're in it. So that
means that
while I'm in a world that might be a
little bit boring. Here I am, you know,
this is a world saying I'm walking down
the Camden Canal in London here
near the DeepMind office. Well, what
happens if I prompt it at the same time?
Then what happens?
Ah.
I've just changed the world that I'm in.
Can change it again.
There we go. Immediately, the world is
is is different. And one more time just
for fun.
I love the idea of a new form of gaming
where I could be adversarially prompting
your experience of a world. It just
creates a whole different sort of
entertainment, a whole world, a whole
new frontier
um I think can be really amazing, not
just for entertainment, but for
education, as well. Um the ability to be
able to go into a world in order to
learn about it, I think is incredibly
powerful, um and may well be something
that that that we see more and more of.
Um and uh with that, I will uh say thank
you and just a quick call out that
tomorrow morning, um my uh colleague
Omar is going to talk about Gemma 4,
which is a language model.
>> [laughter]
>> Uh thank you.
>> [music]