Gemma, DeepMind's Family of Open Models — Omar Sanseviero, Google DeepMind

Channel: aiDotEngineer
Published at: 2026-04-20
YouTube video id: _gVFUEdhCyI
Source: https://www.youtube.com/watch?v=_gVFUEdhCyI
[music]
>> All right. Hi everyone. It's cool here.
So I'm super excited to give this talk
because just 7 days ago we released
Gemma 4.
So before this conference, who here has
heard about Gemma already?
Okay, so most of you. Great. So Gemma is
Google DeepMind's family of open models.
Open models means that these are models
that you can
take, you can download, you can run in
your own infrastructure, your own
devices. You can fine-tune for your own
use cases. So about a year ago we
released Gemma 3. Back then Gemma 3 were
the most capable open models that could
fit in a single consumer GPU. So we
designed models from 1 billion
parameters all the way to 27 billion
parameters. And back then in LM Arena it
was a very strong model. So you see here
like different open models under LM
Arena scores. And those small dots at
the bottom represent how many H100s or
A100s you would need just to be able to
load the models. So this is again Gemma
3. That's from 1 year ago. But you can
see that even if it's a model from a
year ago, it's a tiny model or a
relatively small model that is extremely
capable.
But yeah, so last week we released Gemma
4 and this is my first conference
talking about Gemma 4. So I'm very
excited about that.
So Gemma 4 is the family of most capable
of open models that Google has released
ever. These are models that go from 2
billion parameters all the way to 32
billion parameters. These models have
very different capabilities. So I'm
going to talk a bit about these
different things. And if you're
wondering what's the E there, I also
explain that in a second. So the
smallest two models can run in an
Android phone, in an iOS, in an iPhone
phone as well, even in a Raspberry Pi.
These are really small small models that
are multimodal, have reasoning, can do
like very cool on-device agentic things.
Then there's a MoE, a mixture of experts
model that's like super fast, high very
low latency, a model that you can do
that can do very cool things. And then
you have the 31B. That's the most
intelligent model, the most capable. So
when you want like the most raw
intelligence you would use this large
model. But even the 31B is a model that
can run in a consumer GPU. So all of
these models have been in
developer-friendly sizes, which is quite
important to us.
So let me show you a couple of them as
assuming the videos load. So there's a
lot happening here. So let me begin with
the one at the right. That's a
an application where you have Gemma
running directly in an Android phone,
where you can pick different skills. So
pretty much here you have a full agentic
setup where the model is picking maybe
like a skill to play the piano and then
you have Gemma playing the piano, right?
The one at the left is Gemma by coding,
also on device.
This is again airplane mode, no API
calls, fully running in a phone.
And the example in the middle is in a
laptop computer. We have 20 instances or
10 sorry, 10 instances of Gemma running
in parallel. Each of them is doing a
different SVG and in a couple of seconds
you're going to see like 10 SVGs
generated by different agents. All of
these running on device with llama.cpp.
And even then it's like 100 tokens per
second and there you can see the SVGs
that were generated by the 10 different
Gemma models.
Gemma is a good coding
model. It can do agentic stuff, it can
do coding, it can do even Android app
development. And again all of this
offline.
So the LM Arena scores are quite nice.
Uh
Here you can see like bunch of different
models. X axis is how many billion
parameters the model has. Y axis is the
LM Arena score. And I know like LM Arena
is not a perfect benchmark, but it does
give you like some proxy of how much the
community likes the model for general
use cases like conversations and so on.
And Gemma has like a nice kind of uh
mix between being friendly and like
helpful and at the same time being very
capable.
And you can see like this corner at the
top left. That means that these are very
small models that are very capable,
which is quite exciting.
It's been exciting to see how the models
have progressed over the last 2 years.
So last year it was Gemma 3, 2 years ago
it was Gemma 1. Sorry, yeah, Gemma 2.
And you can see like for a bunch of
different things the models have kept
getting better and better without going
bigger.
Which for me is quite exciting because
if I think where we'll stand in a year
from now or in 2 years from now,
I do think we'll have extremely capable
models running directly in our own
devices, in our own pockets.
I'll skip the benchmarks.
But yeah, what is exciting is that Gemma
can fit in a desktop computer, it can
fit in a laptop, it can fit in a phone.
Uh
I saw yesterday or 2 days ago that
someone put llama.cpp in a Nintendo
Switch and they are using llama.cpp to
try Gemma directly there.
So I don't know how things will be in a
couple of years, but
I'm excited for it.
Something that we heard a lot with the
previous Gemma versions is was that the
license that we had was not great. Like
people wanted a proper open source
license. So with Gemma 4 we changed our
license to an actual Apache 2 license
that gives you control to uh
pretty much you have the flexibility of
the Apache 2 license. So
that's quite nice as well.
Now
you have probably heard about mixture of
experts. That's the 27B model, 26B
model. You have heard about transformers
and dense models, but you have probably
never heard about the E here. So E2B
stands for effectively 2 billion
parameters. So actually Gemma E2B has
more parameters. It has 4 billion
parameters or so. And it has a new novel
kind of architecture called per-layer
embeddings. That was something that we
released summer of last year.
So there's this small block at the
bottom and
the TLDR here is that pretty much there
is like a embedding kind of per each
layer as the name indicates. And it
works more of a
pretty much as a lookup table rather
than a computation that you need to do.
So pretty much this is a extremely fast
thing. You don't need to have this in
the GPU, you can have this in the CPU,
you can have this in the disk. And this
is architecture decision that is really
optimized for on-device like mobile use
cases. So that's why the smallest models
that can run in an Android or in an
iPhone are using this E2B
or E4B architecture. So even if the
model is 5 billion parameters, you
actually just load 2 billion parameters
into the GPU and then the rest can be
like much slower memory because you are
not doing any of the matrix
multiplications that you would usually
do with a transformer architecture.
And this can be done leveraging
llama.cpp with a simple flag override
tensor and then you move the per-layer
embeddings to CPU or even to disk and it
should work quite well out of the box.
A couple of other exciting things. The
smallest models can do multimodal
understanding for images, for videos and
even for audio. So you can do speech
recognition, you can do speech to
translated text. So I can speak in
Spanish and the text can be
transcribed to I don't know, French.
And then the larger model can do like
extremely capable multimodal
understanding. So
videos, fine-grained details. Uh
I actually have a couple of examples in
here.
So for example, it can do things such as
pointing where the llama is in the
picture.
It can do object so it can detect
different objects in a picture. And what
is cool is that this model is heavily
multilingual. So Gemma 4 has well, it
was trained with over 140 languages and
it uses the tokenizer that is based on
Gemini as well. So pretty much all of
the multilingual research that powers
Gemini is also enabling Gemma. The
tokenizer piece is quite interesting
because independently of the raw
capabilities of Gemma, this tokenizer
was designed for multilingual use cases.
And we took lots of care with it.
Which is interesting because if you want
to fine-tune Gemma for a different
language for which there are
low digital resource languages. So let's
say like an indigenous language in Peru,
Quechua, or I don't know, one of the
official languages in India. You can
pick the model, you can use your data,
you can train the model. And
independently of the raw capabilities of
Gemma, just because of the tokenizer
decisions, things tend to work quite
well out of the box.
So then you can mix the multilingual
with multimodal capabilities. So for
example here to get the text or an
explanation of an image with Japanese
text. And that's quite cool.
So we released the model a week ago.
Uh
Just last yesterday we got to 10 million
downloads just for Gemma 4 base models.
There are over 1,000 models based on
Gemma 4 already. So quantizations or
fine-tunes by the community. Over 500
million downloads of the whole Gemma
family. So what is very cool for me is
that Gemma is not just about always some
model that you can use, but it's more
about enabling the ecosystem to build on
top of it.
And that's what the community has done
over the last few days.
It was top of at Hugging Face. People
have been building like cool examples.
They also love people have been doing
like full repository audits using Gemma.
People are putting Gemma like in in all
kinds of devices and exploring all of
the capabilities, which is quite nice.
And all of this is not done just by us.
We collaborate with the open source
ecosystem. We work with Unsloth, MLX,
llama.cpp, Hugging Face, vLLM, C Lang.
And pretty much we want to ensure that
when we launch a new tool both for
Gemini and for Gemma, people can
leverage the capabilities out of the
box, right? Like they should not need to
switch to
Keras if they want to fine-tune Gemma.
Like if they are fine-tuning with
Hugging Face transformers, they should
be able to do that. So for us it's very
important and critical to be where the
community is. And that's why I really
shout out to all of those of you that
are working in the open source ecosystem
that are contributing to different
tools, maintainers of all of these
repositories because it's really a way
to enable the ecosystem to do amazing
things.
Another part that I like about Gemma is
all of the product integrations that we
can do. So, Android Studio,
I don't know if anyone here is an
Android developer, but Android Studio
has like a
agent mode where you have a agent that
helps you write code and develop. And
there's an offline mode now where you
can have a llama.cpp or a llama or vllm
powered
uh
uh system in which you have Gemma
helping you write code
for Android development. And we did
include some Android related datasets
and benchmarks while training Gemma. So,
it's actually a very capable model for
Android development.
So, I talked a bit about how many like
people are fine-tuning and about how
many people are sharing. So, let me
share a bit about the the Gemmaverse. Uh
so, this number is uh updated. This is
from last week. Now, we have 500 million
downloads as I mentioned. And in total,
Gemma has over 100,000 models.
So, again,
uh
maybe you just want to use Gemma out of
the box like open models may work great
for you, but maybe you want to
improve the capabilities, maybe you want
to change the style in which the model
is talking with the users, maybe you
don't want a conversational model,
right? Maybe you just want a model that
can predict certain thing in your own
context. Uh
or maybe you just have too many GPUs at
home and you just want to burn them.
Uh I don't know what's your reason, but
you can fine-tune models for many cool
things. So, Google has done a couple of
what we call official Gemma variants. We
did a Shield Gemma, which is a family of
world rate models. Those are great for
production use cases where maybe you
don't want users to put uh let's say
toxic images or toxic text that does not
match the policies that you have set up.
Uh so, Shield Gemma is the family of
models that allows you to do that.
But then there are also other kind of
use cases. So, for example, for medical
use cases, we have released Med-Gemini,
which is a multimodal Gemma 3-based
model for different uh
medical tasks. So, radiology, uh X-ray,
uh chest X-ray understanding, and a
bunch of other things. And again, these
are open models. You can use them and
you can also fine-tune them even more if
you have like a even uh more niche kind
of use case.
So, that's what Google has done. But the
community is also doing cool things. So,
for example, there is AI Singapore. It's
a group that is training models for
Southeast Asian languages.
Uh there are a bunch of them and they
have been building quite a bit of
research with open models to push even
further the state-of-the-art
capabilities in terms of
multilinguality.
Or another example is Sarvam. Uh so, in
India, there are many official languages
and there is this effort by the
government. They are investing in a
couple of
big startups to train national models.
So, this is more on the sovereign AI and
official like languages point of view,
but people are doing like very
interesting stuff on the multilingual
side of things.
Apart of that, there's quite a bit of
other like cool research happening. So,
there was this paper we released in
December of last year about how some
researchers from DeepMind were able to
use Gemma 3 to propose some cancer
therapy pathways, which was actually
taken to an actual lab and they were
able to validate that the pathways that
were proposed by this
Gemma-based model were able to actually
lead to actual results that could be
validated. So, that was quite exciting
because it's not just about uh
having a assistant or chatting with
yeah, your I don't know, like doing
role-playing and whatnot. It's also
about building models that can be used
for actual things that help the
community for many different things. So,
be that like finance or be that I don't
know, like legal reviews,
offline use cases where you don't want
your data to leave your servers. If
that's like for offline modes, if you're
in in a I don't know, in the subway, if
you're in an airplane and you need to
use AI for something. If you want to
have a Chrome extension that has
Gemma in there and help you understand
what is in your screen. If you want to
do on-device control,
the open models are getting there. And
for me, that's quite exciting because if
you compare where we are now versus how
we were like 1 year ago, 2 years ago,
open models now can do very cool, very
interesting, highly agentic, complex
tasks entirely on device, entirely in
your phone.
Uh so, I really like recommend all of
you to just spend like 1 hour in the
next 2 weeks just playing with open
models, uh the latest open models, and
try to understand which are the
capabilities. Of course, there are many
things for which you will want to use a
API-based model. If you want like the
most raw intelligence, you will go and
use like Gemini or a your model of
choice. But if you want to have things
on device, there are many exciting
things that you can already do. Uh
and for me, what is more exciting is I
don't know how things will be in 6 or 12
months from now, but I think we're
heading towards a very exciting
direction where people will be able to
have extremely capable open models in
their own devices that are customized
for their own use cases with their own
data. Uh
So, yeah, please try the models, build
something, and share that.
All right, thank you.
>> [music]