Gemma 4 Deep Dive — Cassidy Hardin, Researcher, Google DeepMind

Channel: aiDotEngineer
Published at: 2026-04-27
YouTube video id: _A367W_qvc8
Source: https://www.youtube.com/watch?v=_A367W_qvc8
Hi everyone. My name is Cassidy and I'm
a researcher at Google DeepMind. Today,
I'm really excited to share with you
some of the technical improvements and
architecture that we have with Gemma 4.
Last week, we launched Gemma 4, which is
the latest addition to our family of
open-source models.
Gemma 4 brought incredible improvements
at a scale that has not been seen
before. We have a family of very small
models with incredible performance
setting a new precedent for what's
possible with small open-source models.
Gemma 4 comes in four sizes. We have two
smaller effective models which are
geared towards on-device applications.
These models have been adapted and and
improved in order to provide incredible
performance at a small scale which are
able to run locally on phones, iPads,
and laptops.
We have two larger models starting with
a 26B mixture of experts model, which is
the first ever Gemma MoE. This model has
been adapted to have incredible
performance while only requiring 3.9
billion active parameters.
And our largest model is our 31B dense.
This has insane performance, a huge
improvement upon what existed within
Gemma 3
at a new precedent that hasn't been seen
before.
Taking a look at our larger models, our
31B and our 26B, these models have both
ranked in the top six of all open-source
models on the LM the LM arena.
One of the most exciting improvements
and things that we've launched alongside
Gemma 4 is the move to an Apache 2.0
license. This was deliberately done in
order to make our models more accessible
for the everyday developer. You should
easily be able to integrate Gemma into
your life cycle of development through
initial testing all the way to
deployment and building within the Gemma
in in within the Gemma universe.
Now, let's take a little look at what
each of these models are and some of the
use cases that we've adapted this for.
Starting with our 31B dense model. This
is a state-of-the-art multimodal model,
which has been purposely built for
advanced reasoning.
This model ranked number three on the
global arena for the AI leaderboard.
This is outperforming models over 20
times its size. This is a huge
improvement.
The 31B has a 256K
context length, which has been
purpose-built for autonomous workflows
with native support for thinking,
function calling, and structured JSON
outputs.
We also have a slightly smaller 26B.
This 26B is the first edition of a
mixture of experts model into the Gemma
family.
Only requiring 3.8 billion parameters
during any forward pass, this model is
small and efficient.
Utilizing a total of 128 experts while
only requiring eight experts during any
inference, this is efficient for running
while still maintaining some of the
incredible performance that we saw with
our 31B.
On the smaller side, we introduced two
effective models. These models are
geared towards on-device applications
with the additional support of audio.
These are vision, text, and image
vision, text, and audio input models
while remaining being text-only output
models.
Similarly, we have our effective 2B
model.
Across a variety of benchmarks, these
models are incredible. Looking at our
performance across agentic capabilities,
coding, multimodal, multilingual, we've
truly set a new frontier for what's
capable with the Gemma models. This is
significantly outperforming everything
we had with the Gemma 3 family of
models.
Now, let's take a look at what's
actually new in Gemma 4 and what have we
done and how have we actually been able
to achieve this incredible performance.
Starting on the architecture side.
We have our standard dense model. This
is our 31B as well as our smaller
effective 2B and 4B models.
We have our standard decoder block. What
we've done with Gemma 4 is we've made
several improvements within attention.
We've introduced a 5:1 ratio of
interleaving local to global layers
with our smaller effective 2B having a
4:1 ratio.
This means that within our local layers,
we have a sliding window of how many
tokens we're attending to.
And lastly, with our global layers,
we've now ensured that the last layer is
always a global layer, meaning that our
last layer is attending to all preceding
tokens.
In practice, what this looks like is our
global layers are attending to every
token that is preceded within this,
whereas our local models are only
attending to a specific number of
preceding tokens. In our smaller models,
we have a sliding window of 512 tokens,
while in our larger models, we have a
sliding window of 1,024 tokens.
This sliding window has provided
significant improvements in the
efficiency and optimizations of our
local layers while still maintaining
passing through information to the
preceding layers.
However, our global layers remain to be
quite expensive. Despite this
interleaving of local and global layers,
all of our global layers are still
required to attend to all preceding
tokens, which makes it quite memory
intensive and expensive to run.
And this is where we've looked into
introducing grouped query attention.
Within our local layers, we group
together two queries to share the same
key and value heads.
However, in our global layers, we're
grouping together eight queries sharing
the same key and value heads.
Since reducing the number of key and
value heads can have a big impact on
performance, we've doubled the length of
the key value heads within our global
layers to have a length of 512 as
opposed to 256 as used within the local
layers.
This grouped query attention has
provided significant performance
improvements without massive memory cost
and inference increases in the cost of
being able to serve these models. This
ratio of eight queries to one key value
across all of our models has provided
significant improvements in the
efficiency of our global layers.
These attention changes were present
across all of our models, but we also
introduced a new architecture with Gemma
4, and this is our MoE.
Our MoE has one shared router expert
with a total of 128 total experts with
eight activated experts on each forward
pass.
All of these experts are small
feedforward neural networks.
Similar to the architecture that we had
with our dense model, we've now replaced
that standard feedforward neural network
with an MoE.
In practice, what this looks like is
having our constant shared expert, which
is activated on every pass of the model.
This shared expert is three times the
size of our of our regular experts.
We then have 128 experts, each of which
eight of which are selected by the
router during any pass of the model.
And lastly, on an architecture side, we
get into our dense effective models.
So, what does it mean when we're saying
our model is effectively 2B, effectively
4B? This is where we're looking at the
difference in the number of parameters
which are required to operate the model
as opposed to the total number of
representational parameters present.
Our 2B is effectively 2.3 billion
parameters while having a
representational depth of 5.1 billion
parameters.
These models have been specifically
designated and optimized in order to
have the best on-device performance.
These are designed to run on phones, run
on laptops without requiring expensive
API calls to models served somewhere
else in the world.
These advancements were made possible
through PLE. This is our per-layer
embeddings, where within each of our
layers, we now have a dedicated
embedding table.
Before digging into exactly how our
per-layer layer embedding table looks,
let's take a look at how the entire
token embedding layer works. This hasn't
been replaced within our
effective models. We still have your
standard embedding table where you're
looking at the mapping of an ID for a
token into our towards its embedding
vector.
In our E2B, we have an embedding vector
size of 1,536,
and in our larger E4B, we have an
embedding vector size of 2,560.
This is where we're storing the vector
embeddings of each of these tokens.
Now, we also have a per-embedding table.
A per-layer embedding table. This is
where, similarly, we have our entire
vocabulary size and have an embedding
representation for each of these tokens,
but now we also have one of these for
each of the layers. The big advancement
that comes here is the fact that we
store our PLE, our per-layer embedding
table, in flash memory as opposed to
VRAM. VRAM is one of the largest
constraints on on-device. It's where you
quickly run out of memory in phones and
laptops. So, by requiring that we no
longer need to store this additional
embedding table in VRAM, and we can
store it in flash memory, we're able to
get incredible improvements on the
inference side without having an
expensive cost of this additional
storage and memory.
The big difference in our PLE embedding
table as opposed to the standard
embedding table is our embedding
dimension now is only 256.
So, this is significantly reduced from
the size of the full model.
For each token, for example, hi, we now
have an embedding for this token at each
separate layer within the model. And as
you progress through the layers of the
models, 35 for the 2B and 42 for our
larger model, you'll see the progression
and improvements in the embedding
representation for each of these tokens
at the next sequential layer.
And how does this work in practice? Now,
at the end of our decoder block, we're
able to look up the per-layer embedding
for each of our tokens. This is where
we're able to look up the 256-dimension
and project this up to the full
embedding size that's expected for each
of our models.
Ultimately, these improvements with PLE
allow for our E2B and our E4B to be
significantly outperforming prior
generations of Gemma small models.
In addition to everything that we've
done on an architecture side, we've also
really set a new frontier for what's
possible with multi-modality.
In Gemma 3, we introduced vision for the
first time. We added vision in Gemma 3,
adding support across all of our sizes.
And in the Gemma 3N model, we launched
open-sourced audio, vision, and text
model for the first time.
This really paved the way for Gemma 4
being natively multimodal models.
Multimodal has been integrated from the
very beginning across all of these
models with quite incredible
performance.
Starting on the vision side, our 31B
model you and 26B model both use a 550
million parameter vision encoder.
Our effective 2B and effective 4B have a
smaller
compact, more designed encoder of 150
million parameters.
We've made big strides and improvements
from Gemma 3 by the introduction of
variable aspect ratios and variable
resolutions.
What this means in practice for you as
developers is you now have a choice when
you're running a Gemma model to select
the resolution and the soft token budget
that you want to allocate for images.
This is available in five different
resolutions across all of our models.
Let's take a second to revisit how our
vision encoder works. What we want to
understand here is we get an initial
image, and this image is split into
patches, patches of 16 by 16 pixels.
These patches are then flattened and
linear projected up into patch
embeddings. These embeddings are then
transformed to account for their
positional encodings, and this is what's
ultimately showed to our model.
With this in mind, let's take a look
back at variable aspect.
We can have images of different variable
aspects where we have an image of 4 by 2
and image of 3 by 3.
What's important to under
Okay, what's important to understand
here now is patch four is now in a
completely different position. If we're
encoding both of these images in the
same way, we need our model to
understand that patch four
in the one on the left is in the second
row. It's right below our first. Versus
patch four in the image on the right is
now at the very end. So, what's been a
critical development here is that we
need to ensure that the spatial
positional encoding is also passed
through to our model.
In addition to variable aspect ratios,
we've also had to introduce variable
resolution.
This is where, despite the fact that
both of our images have a ratio of 3 to
2, they're both very different
resolutions. We have a higher resolution
on the right with a lower resolution on
the left. And this is where we introduce
the variable resolutions where you can
select how many images and token how
many tokens you want to allocate to each
image.
This is critical because it allows you
to determine how much of your token
budget you want to be spent on images.
For tasks such as OCR, spatial object
recognition,
you want to allocate a much higher
budget to ensure that you're processing
high-quality, high-resolution images. If
your applications are purely text-based
and you're not using some of the
multimodal capabilities, you can
allocate a much smaller and low lower
token budget.
What's important to understand here is
this is a huge improvement from Gemma 3.
In Gemma 3, we had to introduce pan and
scan. And this is where you would give
an image of variable resolutions,
variable aspect ratios, and we split it
into a sequence of squares and padded
whatever else was necessary. Then your
one image was passed to the model as
two, three, four different images, which
were each processed sequentially.
Now, with these advancements across
variable resolution and variable aspect
ratios, we're now able to process images
with varying number of patches depending
upon the actual image which was provided
by the user.
The next thing to understand here is how
do we actually take these patches of
images and patch and pass this through
to the model?
This is where we take 3 by 3 grids of
patches and they become a single
embedding. And this single embedding is
then passed forward to a model.
So, if you're running a Gemma model with
a token budget of 280, this actually
equates to 2,520
different patches which will be
constructed from each of your images.
For anything that doesn't fit within a
sequence of 16 by 16 patches of 16 by 16
pixels for each of the patches, we'll
provide additional padding.
For an example of square images across
the five different soft token budgets
and resolutions which are supported with
Gemma, here's an instance of the
resolutions, patches, and the pooled
elements embeddings which would exist at
each of these different sizes.
And this is where you really see that if
you're doing something such as object
detection, OCR, you really want to be
running with a higher quality
resolution and token budget. And this is
where we have 560 and 1120 available for
these use cases.
And pulling this all together, we start
with an image of N patches. These
patches are each turned into N patch
embeddings. We then pull this together
to produce N over 9 soft tokens, and
these soft tokens are linear projected
up to what is actually showed to our
models.
Now, for the audio side.
Audio has been added into the E2B and
the E4B with the goal of being able to
support translation and speak speech
recognition. This is made possible
through the combination of an audio
tokenizer and a conformer.
This is a 305
million parameter conformer which
processes as an audio encoder processing
embeddings rather than tokens.
On the audio tokenizer side, we start
with raw audio. This raw audio is run
through a mel spectrogram which is able
to process features out of the raw audio
files. This mel spectrogram is then
split into N mel chunks which are
downsampled through two convolutional
layers. This ultimately outputs N over 4
soft tokens. These audio embeddings are
what's passed forward into the
conformer.
Our conformer follows a similar
architecture to what we've already seen
across the dense models and the MoE
architecture. Although this time we're
adding in a convolutional layer.
Returning back to where we started
today, the Gemma family of models has
four models which we've released to you
externally last week. We have two
smaller models which have been geared
towards on-device applications with
support for text, vision, and audio. Our
two larger models are designed for more
complex and reasoning tasks with support
for agentic workflows and coding.
And that's what brings us to what can
you do with Gemma today and how can you
get started?
There's two main options for getting
started with Gemma. You have the option
to download and self-host all of our
models. These are available across
Hugging Face, Kaggle, and Ollama.
We also have cloud-hosted options for
our larger models, our 31B and our 26B,
and these can be accessible across AI
Studio and Vertex.
This is what's going to allow you to
immediately get started with
prototyping, building agentic workflows,
and testing out function calling in
these larger models.
Thank you, and I'm happy to answer any
questions about Gemma.