Gemma 4 Deep Dive — Cassidy Hardin, Researcher, Google DeepMind
Channel: aiDotEngineer
Published at: 2026-04-27
YouTube video id: _A367W_qvc8
Source: https://www.youtube.com/watch?v=_A367W_qvc8
Hi everyone. My name is Cassidy and I'm a researcher at Google DeepMind. Today, I'm really excited to share with you some of the technical improvements and architecture that we have with Gemma 4. Last week, we launched Gemma 4, which is the latest addition to our family of open-source models. Gemma 4 brought incredible improvements at a scale that has not been seen before. We have a family of very small models with incredible performance setting a new precedent for what's possible with small open-source models. Gemma 4 comes in four sizes. We have two smaller effective models which are geared towards on-device applications. These models have been adapted and and improved in order to provide incredible performance at a small scale which are able to run locally on phones, iPads, and laptops. We have two larger models starting with a 26B mixture of experts model, which is the first ever Gemma MoE. This model has been adapted to have incredible performance while only requiring 3.9 billion active parameters. And our largest model is our 31B dense. This has insane performance, a huge improvement upon what existed within Gemma 3 at a new precedent that hasn't been seen before. Taking a look at our larger models, our 31B and our 26B, these models have both ranked in the top six of all open-source models on the LM the LM arena. One of the most exciting improvements and things that we've launched alongside Gemma 4 is the move to an Apache 2.0 license. This was deliberately done in order to make our models more accessible for the everyday developer. You should easily be able to integrate Gemma into your life cycle of development through initial testing all the way to deployment and building within the Gemma in in within the Gemma universe. Now, let's take a little look at what each of these models are and some of the use cases that we've adapted this for. Starting with our 31B dense model. This is a state-of-the-art multimodal model, which has been purposely built for advanced reasoning. This model ranked number three on the global arena for the AI leaderboard. This is outperforming models over 20 times its size. This is a huge improvement. The 31B has a 256K context length, which has been purpose-built for autonomous workflows with native support for thinking, function calling, and structured JSON outputs. We also have a slightly smaller 26B. This 26B is the first edition of a mixture of experts model into the Gemma family. Only requiring 3.8 billion parameters during any forward pass, this model is small and efficient. Utilizing a total of 128 experts while only requiring eight experts during any inference, this is efficient for running while still maintaining some of the incredible performance that we saw with our 31B. On the smaller side, we introduced two effective models. These models are geared towards on-device applications with the additional support of audio. These are vision, text, and image vision, text, and audio input models while remaining being text-only output models. Similarly, we have our effective 2B model. Across a variety of benchmarks, these models are incredible. Looking at our performance across agentic capabilities, coding, multimodal, multilingual, we've truly set a new frontier for what's capable with the Gemma models. This is significantly outperforming everything we had with the Gemma 3 family of models. Now, let's take a look at what's actually new in Gemma 4 and what have we done and how have we actually been able to achieve this incredible performance. Starting on the architecture side. We have our standard dense model. This is our 31B as well as our smaller effective 2B and 4B models. We have our standard decoder block. What we've done with Gemma 4 is we've made several improvements within attention. We've introduced a 5:1 ratio of interleaving local to global layers with our smaller effective 2B having a 4:1 ratio. This means that within our local layers, we have a sliding window of how many tokens we're attending to. And lastly, with our global layers, we've now ensured that the last layer is always a global layer, meaning that our last layer is attending to all preceding tokens. In practice, what this looks like is our global layers are attending to every token that is preceded within this, whereas our local models are only attending to a specific number of preceding tokens. In our smaller models, we have a sliding window of 512 tokens, while in our larger models, we have a sliding window of 1,024 tokens. This sliding window has provided significant improvements in the efficiency and optimizations of our local layers while still maintaining passing through information to the preceding layers. However, our global layers remain to be quite expensive. Despite this interleaving of local and global layers, all of our global layers are still required to attend to all preceding tokens, which makes it quite memory intensive and expensive to run. And this is where we've looked into introducing grouped query attention. Within our local layers, we group together two queries to share the same key and value heads. However, in our global layers, we're grouping together eight queries sharing the same key and value heads. Since reducing the number of key and value heads can have a big impact on performance, we've doubled the length of the key value heads within our global layers to have a length of 512 as opposed to 256 as used within the local layers. This grouped query attention has provided significant performance improvements without massive memory cost and inference increases in the cost of being able to serve these models. This ratio of eight queries to one key value across all of our models has provided significant improvements in the efficiency of our global layers. These attention changes were present across all of our models, but we also introduced a new architecture with Gemma 4, and this is our MoE. Our MoE has one shared router expert with a total of 128 total experts with eight activated experts on each forward pass. All of these experts are small feedforward neural networks. Similar to the architecture that we had with our dense model, we've now replaced that standard feedforward neural network with an MoE. In practice, what this looks like is having our constant shared expert, which is activated on every pass of the model. This shared expert is three times the size of our of our regular experts. We then have 128 experts, each of which eight of which are selected by the router during any pass of the model. And lastly, on an architecture side, we get into our dense effective models. So, what does it mean when we're saying our model is effectively 2B, effectively 4B? This is where we're looking at the difference in the number of parameters which are required to operate the model as opposed to the total number of representational parameters present. Our 2B is effectively 2.3 billion parameters while having a representational depth of 5.1 billion parameters. These models have been specifically designated and optimized in order to have the best on-device performance. These are designed to run on phones, run on laptops without requiring expensive API calls to models served somewhere else in the world. These advancements were made possible through PLE. This is our per-layer embeddings, where within each of our layers, we now have a dedicated embedding table. Before digging into exactly how our per-layer layer embedding table looks, let's take a look at how the entire token embedding layer works. This hasn't been replaced within our effective models. We still have your standard embedding table where you're looking at the mapping of an ID for a token into our towards its embedding vector. In our E2B, we have an embedding vector size of 1,536, and in our larger E4B, we have an embedding vector size of 2,560. This is where we're storing the vector embeddings of each of these tokens. Now, we also have a per-embedding table. A per-layer embedding table. This is where, similarly, we have our entire vocabulary size and have an embedding representation for each of these tokens, but now we also have one of these for each of the layers. The big advancement that comes here is the fact that we store our PLE, our per-layer embedding table, in flash memory as opposed to VRAM. VRAM is one of the largest constraints on on-device. It's where you quickly run out of memory in phones and laptops. So, by requiring that we no longer need to store this additional embedding table in VRAM, and we can store it in flash memory, we're able to get incredible improvements on the inference side without having an expensive cost of this additional storage and memory. The big difference in our PLE embedding table as opposed to the standard embedding table is our embedding dimension now is only 256. So, this is significantly reduced from the size of the full model. For each token, for example, hi, we now have an embedding for this token at each separate layer within the model. And as you progress through the layers of the models, 35 for the 2B and 42 for our larger model, you'll see the progression and improvements in the embedding representation for each of these tokens at the next sequential layer. And how does this work in practice? Now, at the end of our decoder block, we're able to look up the per-layer embedding for each of our tokens. This is where we're able to look up the 256-dimension and project this up to the full embedding size that's expected for each of our models. Ultimately, these improvements with PLE allow for our E2B and our E4B to be significantly outperforming prior generations of Gemma small models. In addition to everything that we've done on an architecture side, we've also really set a new frontier for what's possible with multi-modality. In Gemma 3, we introduced vision for the first time. We added vision in Gemma 3, adding support across all of our sizes. And in the Gemma 3N model, we launched open-sourced audio, vision, and text model for the first time. This really paved the way for Gemma 4 being natively multimodal models. Multimodal has been integrated from the very beginning across all of these models with quite incredible performance. Starting on the vision side, our 31B model you and 26B model both use a 550 million parameter vision encoder. Our effective 2B and effective 4B have a smaller compact, more designed encoder of 150 million parameters. We've made big strides and improvements from Gemma 3 by the introduction of variable aspect ratios and variable resolutions. What this means in practice for you as developers is you now have a choice when you're running a Gemma model to select the resolution and the soft token budget that you want to allocate for images. This is available in five different resolutions across all of our models. Let's take a second to revisit how our vision encoder works. What we want to understand here is we get an initial image, and this image is split into patches, patches of 16 by 16 pixels. These patches are then flattened and linear projected up into patch embeddings. These embeddings are then transformed to account for their positional encodings, and this is what's ultimately showed to our model. With this in mind, let's take a look back at variable aspect. We can have images of different variable aspects where we have an image of 4 by 2 and image of 3 by 3. What's important to under Okay, what's important to understand here now is patch four is now in a completely different position. If we're encoding both of these images in the same way, we need our model to understand that patch four in the one on the left is in the second row. It's right below our first. Versus patch four in the image on the right is now at the very end. So, what's been a critical development here is that we need to ensure that the spatial positional encoding is also passed through to our model. In addition to variable aspect ratios, we've also had to introduce variable resolution. This is where, despite the fact that both of our images have a ratio of 3 to 2, they're both very different resolutions. We have a higher resolution on the right with a lower resolution on the left. And this is where we introduce the variable resolutions where you can select how many images and token how many tokens you want to allocate to each image. This is critical because it allows you to determine how much of your token budget you want to be spent on images. For tasks such as OCR, spatial object recognition, you want to allocate a much higher budget to ensure that you're processing high-quality, high-resolution images. If your applications are purely text-based and you're not using some of the multimodal capabilities, you can allocate a much smaller and low lower token budget. What's important to understand here is this is a huge improvement from Gemma 3. In Gemma 3, we had to introduce pan and scan. And this is where you would give an image of variable resolutions, variable aspect ratios, and we split it into a sequence of squares and padded whatever else was necessary. Then your one image was passed to the model as two, three, four different images, which were each processed sequentially. Now, with these advancements across variable resolution and variable aspect ratios, we're now able to process images with varying number of patches depending upon the actual image which was provided by the user. The next thing to understand here is how do we actually take these patches of images and patch and pass this through to the model? This is where we take 3 by 3 grids of patches and they become a single embedding. And this single embedding is then passed forward to a model. So, if you're running a Gemma model with a token budget of 280, this actually equates to 2,520 different patches which will be constructed from each of your images. For anything that doesn't fit within a sequence of 16 by 16 patches of 16 by 16 pixels for each of the patches, we'll provide additional padding. For an example of square images across the five different soft token budgets and resolutions which are supported with Gemma, here's an instance of the resolutions, patches, and the pooled elements embeddings which would exist at each of these different sizes. And this is where you really see that if you're doing something such as object detection, OCR, you really want to be running with a higher quality resolution and token budget. And this is where we have 560 and 1120 available for these use cases. And pulling this all together, we start with an image of N patches. These patches are each turned into N patch embeddings. We then pull this together to produce N over 9 soft tokens, and these soft tokens are linear projected up to what is actually showed to our models. Now, for the audio side. Audio has been added into the E2B and the E4B with the goal of being able to support translation and speak speech recognition. This is made possible through the combination of an audio tokenizer and a conformer. This is a 305 million parameter conformer which processes as an audio encoder processing embeddings rather than tokens. On the audio tokenizer side, we start with raw audio. This raw audio is run through a mel spectrogram which is able to process features out of the raw audio files. This mel spectrogram is then split into N mel chunks which are downsampled through two convolutional layers. This ultimately outputs N over 4 soft tokens. These audio embeddings are what's passed forward into the conformer. Our conformer follows a similar architecture to what we've already seen across the dense models and the MoE architecture. Although this time we're adding in a convolutional layer. Returning back to where we started today, the Gemma family of models has four models which we've released to you externally last week. We have two smaller models which have been geared towards on-device applications with support for text, vision, and audio. Our two larger models are designed for more complex and reasoning tasks with support for agentic workflows and coding. And that's what brings us to what can you do with Gemma today and how can you get started? There's two main options for getting started with Gemma. You have the option to download and self-host all of our models. These are available across Hugging Face, Kaggle, and Ollama. We also have cloud-hosted options for our larger models, our 31B and our 26B, and these can be accessible across AI Studio and Vertex. This is what's going to allow you to immediately get started with prototyping, building agentic workflows, and testing out function calling in these larger models. Thank you, and I'm happy to answer any questions about Gemma.