Gemma, DeepMind's Family of Open Models — Omar Sanseviero, Google DeepMind
Channel: aiDotEngineer
Published at: 2026-04-20
YouTube video id: _gVFUEdhCyI
Source: https://www.youtube.com/watch?v=_gVFUEdhCyI
[music] >> All right. Hi everyone. It's cool here. So I'm super excited to give this talk because just 7 days ago we released Gemma 4. So before this conference, who here has heard about Gemma already? Okay, so most of you. Great. So Gemma is Google DeepMind's family of open models. Open models means that these are models that you can take, you can download, you can run in your own infrastructure, your own devices. You can fine-tune for your own use cases. So about a year ago we released Gemma 3. Back then Gemma 3 were the most capable open models that could fit in a single consumer GPU. So we designed models from 1 billion parameters all the way to 27 billion parameters. And back then in LM Arena it was a very strong model. So you see here like different open models under LM Arena scores. And those small dots at the bottom represent how many H100s or A100s you would need just to be able to load the models. So this is again Gemma 3. That's from 1 year ago. But you can see that even if it's a model from a year ago, it's a tiny model or a relatively small model that is extremely capable. But yeah, so last week we released Gemma 4 and this is my first conference talking about Gemma 4. So I'm very excited about that. So Gemma 4 is the family of most capable of open models that Google has released ever. These are models that go from 2 billion parameters all the way to 32 billion parameters. These models have very different capabilities. So I'm going to talk a bit about these different things. And if you're wondering what's the E there, I also explain that in a second. So the smallest two models can run in an Android phone, in an iOS, in an iPhone phone as well, even in a Raspberry Pi. These are really small small models that are multimodal, have reasoning, can do like very cool on-device agentic things. Then there's a MoE, a mixture of experts model that's like super fast, high very low latency, a model that you can do that can do very cool things. And then you have the 31B. That's the most intelligent model, the most capable. So when you want like the most raw intelligence you would use this large model. But even the 31B is a model that can run in a consumer GPU. So all of these models have been in developer-friendly sizes, which is quite important to us. So let me show you a couple of them as assuming the videos load. So there's a lot happening here. So let me begin with the one at the right. That's a an application where you have Gemma running directly in an Android phone, where you can pick different skills. So pretty much here you have a full agentic setup where the model is picking maybe like a skill to play the piano and then you have Gemma playing the piano, right? The one at the left is Gemma by coding, also on device. This is again airplane mode, no API calls, fully running in a phone. And the example in the middle is in a laptop computer. We have 20 instances or 10 sorry, 10 instances of Gemma running in parallel. Each of them is doing a different SVG and in a couple of seconds you're going to see like 10 SVGs generated by different agents. All of these running on device with llama.cpp. And even then it's like 100 tokens per second and there you can see the SVGs that were generated by the 10 different Gemma models. Gemma is a good coding model. It can do agentic stuff, it can do coding, it can do even Android app development. And again all of this offline. So the LM Arena scores are quite nice. Uh Here you can see like bunch of different models. X axis is how many billion parameters the model has. Y axis is the LM Arena score. And I know like LM Arena is not a perfect benchmark, but it does give you like some proxy of how much the community likes the model for general use cases like conversations and so on. And Gemma has like a nice kind of uh mix between being friendly and like helpful and at the same time being very capable. And you can see like this corner at the top left. That means that these are very small models that are very capable, which is quite exciting. It's been exciting to see how the models have progressed over the last 2 years. So last year it was Gemma 3, 2 years ago it was Gemma 1. Sorry, yeah, Gemma 2. And you can see like for a bunch of different things the models have kept getting better and better without going bigger. Which for me is quite exciting because if I think where we'll stand in a year from now or in 2 years from now, I do think we'll have extremely capable models running directly in our own devices, in our own pockets. I'll skip the benchmarks. But yeah, what is exciting is that Gemma can fit in a desktop computer, it can fit in a laptop, it can fit in a phone. Uh I saw yesterday or 2 days ago that someone put llama.cpp in a Nintendo Switch and they are using llama.cpp to try Gemma directly there. So I don't know how things will be in a couple of years, but I'm excited for it. Something that we heard a lot with the previous Gemma versions is was that the license that we had was not great. Like people wanted a proper open source license. So with Gemma 4 we changed our license to an actual Apache 2 license that gives you control to uh pretty much you have the flexibility of the Apache 2 license. So that's quite nice as well. Now you have probably heard about mixture of experts. That's the 27B model, 26B model. You have heard about transformers and dense models, but you have probably never heard about the E here. So E2B stands for effectively 2 billion parameters. So actually Gemma E2B has more parameters. It has 4 billion parameters or so. And it has a new novel kind of architecture called per-layer embeddings. That was something that we released summer of last year. So there's this small block at the bottom and the TLDR here is that pretty much there is like a embedding kind of per each layer as the name indicates. And it works more of a pretty much as a lookup table rather than a computation that you need to do. So pretty much this is a extremely fast thing. You don't need to have this in the GPU, you can have this in the CPU, you can have this in the disk. And this is architecture decision that is really optimized for on-device like mobile use cases. So that's why the smallest models that can run in an Android or in an iPhone are using this E2B or E4B architecture. So even if the model is 5 billion parameters, you actually just load 2 billion parameters into the GPU and then the rest can be like much slower memory because you are not doing any of the matrix multiplications that you would usually do with a transformer architecture. And this can be done leveraging llama.cpp with a simple flag override tensor and then you move the per-layer embeddings to CPU or even to disk and it should work quite well out of the box. A couple of other exciting things. The smallest models can do multimodal understanding for images, for videos and even for audio. So you can do speech recognition, you can do speech to translated text. So I can speak in Spanish and the text can be transcribed to I don't know, French. And then the larger model can do like extremely capable multimodal understanding. So videos, fine-grained details. Uh I actually have a couple of examples in here. So for example, it can do things such as pointing where the llama is in the picture. It can do object so it can detect different objects in a picture. And what is cool is that this model is heavily multilingual. So Gemma 4 has well, it was trained with over 140 languages and it uses the tokenizer that is based on Gemini as well. So pretty much all of the multilingual research that powers Gemini is also enabling Gemma. The tokenizer piece is quite interesting because independently of the raw capabilities of Gemma, this tokenizer was designed for multilingual use cases. And we took lots of care with it. Which is interesting because if you want to fine-tune Gemma for a different language for which there are low digital resource languages. So let's say like an indigenous language in Peru, Quechua, or I don't know, one of the official languages in India. You can pick the model, you can use your data, you can train the model. And independently of the raw capabilities of Gemma, just because of the tokenizer decisions, things tend to work quite well out of the box. So then you can mix the multilingual with multimodal capabilities. So for example here to get the text or an explanation of an image with Japanese text. And that's quite cool. So we released the model a week ago. Uh Just last yesterday we got to 10 million downloads just for Gemma 4 base models. There are over 1,000 models based on Gemma 4 already. So quantizations or fine-tunes by the community. Over 500 million downloads of the whole Gemma family. So what is very cool for me is that Gemma is not just about always some model that you can use, but it's more about enabling the ecosystem to build on top of it. And that's what the community has done over the last few days. It was top of at Hugging Face. People have been building like cool examples. They also love people have been doing like full repository audits using Gemma. People are putting Gemma like in in all kinds of devices and exploring all of the capabilities, which is quite nice. And all of this is not done just by us. We collaborate with the open source ecosystem. We work with Unsloth, MLX, llama.cpp, Hugging Face, vLLM, C Lang. And pretty much we want to ensure that when we launch a new tool both for Gemini and for Gemma, people can leverage the capabilities out of the box, right? Like they should not need to switch to Keras if they want to fine-tune Gemma. Like if they are fine-tuning with Hugging Face transformers, they should be able to do that. So for us it's very important and critical to be where the community is. And that's why I really shout out to all of those of you that are working in the open source ecosystem that are contributing to different tools, maintainers of all of these repositories because it's really a way to enable the ecosystem to do amazing things. Another part that I like about Gemma is all of the product integrations that we can do. So, Android Studio, I don't know if anyone here is an Android developer, but Android Studio has like a agent mode where you have a agent that helps you write code and develop. And there's an offline mode now where you can have a llama.cpp or a llama or vllm powered uh uh system in which you have Gemma helping you write code for Android development. And we did include some Android related datasets and benchmarks while training Gemma. So, it's actually a very capable model for Android development. So, I talked a bit about how many like people are fine-tuning and about how many people are sharing. So, let me share a bit about the the Gemmaverse. Uh so, this number is uh updated. This is from last week. Now, we have 500 million downloads as I mentioned. And in total, Gemma has over 100,000 models. So, again, uh maybe you just want to use Gemma out of the box like open models may work great for you, but maybe you want to improve the capabilities, maybe you want to change the style in which the model is talking with the users, maybe you don't want a conversational model, right? Maybe you just want a model that can predict certain thing in your own context. Uh or maybe you just have too many GPUs at home and you just want to burn them. Uh I don't know what's your reason, but you can fine-tune models for many cool things. So, Google has done a couple of what we call official Gemma variants. We did a Shield Gemma, which is a family of world rate models. Those are great for production use cases where maybe you don't want users to put uh let's say toxic images or toxic text that does not match the policies that you have set up. Uh so, Shield Gemma is the family of models that allows you to do that. But then there are also other kind of use cases. So, for example, for medical use cases, we have released Med-Gemini, which is a multimodal Gemma 3-based model for different uh medical tasks. So, radiology, uh X-ray, uh chest X-ray understanding, and a bunch of other things. And again, these are open models. You can use them and you can also fine-tune them even more if you have like a even uh more niche kind of use case. So, that's what Google has done. But the community is also doing cool things. So, for example, there is AI Singapore. It's a group that is training models for Southeast Asian languages. Uh there are a bunch of them and they have been building quite a bit of research with open models to push even further the state-of-the-art capabilities in terms of multilinguality. Or another example is Sarvam. Uh so, in India, there are many official languages and there is this effort by the government. They are investing in a couple of big startups to train national models. So, this is more on the sovereign AI and official like languages point of view, but people are doing like very interesting stuff on the multilingual side of things. Apart of that, there's quite a bit of other like cool research happening. So, there was this paper we released in December of last year about how some researchers from DeepMind were able to use Gemma 3 to propose some cancer therapy pathways, which was actually taken to an actual lab and they were able to validate that the pathways that were proposed by this Gemma-based model were able to actually lead to actual results that could be validated. So, that was quite exciting because it's not just about uh having a assistant or chatting with yeah, your I don't know, like doing role-playing and whatnot. It's also about building models that can be used for actual things that help the community for many different things. So, be that like finance or be that I don't know, like legal reviews, offline use cases where you don't want your data to leave your servers. If that's like for offline modes, if you're in in a I don't know, in the subway, if you're in an airplane and you need to use AI for something. If you want to have a Chrome extension that has Gemma in there and help you understand what is in your screen. If you want to do on-device control, the open models are getting there. And for me, that's quite exciting because if you compare where we are now versus how we were like 1 year ago, 2 years ago, open models now can do very cool, very interesting, highly agentic, complex tasks entirely on device, entirely in your phone. Uh so, I really like recommend all of you to just spend like 1 hour in the next 2 weeks just playing with open models, uh the latest open models, and try to understand which are the capabilities. Of course, there are many things for which you will want to use a API-based model. If you want like the most raw intelligence, you will go and use like Gemini or a your model of choice. But if you want to have things on device, there are many exciting things that you can already do. Uh and for me, what is more exciting is I don't know how things will be in 6 or 12 months from now, but I think we're heading towards a very exciting direction where people will be able to have extremely capable open models in their own devices that are customized for their own use cases with their own data. Uh So, yeah, please try the models, build something, and share that. All right, thank you. >> [music]