TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google
Channel: aiDotEngineer
Published at: 2026-05-03
YouTube video id: BKWpYIWvAo4
Source: https://www.youtube.com/watch?v=BKWpYIWvAo4
[music] Hey, my name is uh Cormbrick. I work um on Google AI edge, which is a way of kind of bringing models to the edge. This is something we use internally for our own products. It's also something we make available as open source products. As part of this, we also work really, really close with the Gemma team because they publish a lot of models that are targeted to edge devices as well. Um, yeah, quick bit of background on me. Um, yeah, I've been focusing on Edgi for the last like, wow, like maybe 10 years at this point. Um, so started off in 2016 with like running a uh Google Net on a USB key hardware accelerator plugged into a Raspberry Pi at like Nurips's 2016. So yeah, it's been a lot of fun over the years. I then joined Intel, worked on led the architecture for the NPU that goes into all of their laptops these days. Three years ago, I moved to Google to work as tech lead on Edge AI here. Uh yeah, and that's been great fun like been a crazy few years as you can imagine. Um but yeah, happy to share like both where we're up to and also give a kind of overview of like as part of that we can also kind of cover the story of like where is um where is mobile mobile AI up to today right like what's the state of the art what are the different patterns we see for model deployment so the two things I want to focus on in this talk are uh tiny LLMs and agent skills um so tiny LLMs are very very small models um agent skills are now possible on device with larger models uh are needed to make those kind of work fluently on device. But both are kind of exciting new directions um that have only recently been made possible. So as recently as last week when we launched um uh Gemma 4 um with the deep mind team launched that we supported them with a kind of Android and iOS app that's kind of opened uh new possibilities for things we can do on mobile. So we'll also kind of deep dive into that um as well as then looking at what are the types of things we can do with tiny models. the kind of two halves of the presentation. Okay, so that's the intro. So yeah, this is kind of what I was saying. Uh so first give an overview of like hey what's like AI in the edge like looking at what we look at small language models and tiny language models. Uh then taking a look at Gemma models just because that's um uh pretty topical uh and the types of performance we see for Gemma models on various types of edge devices because that's kind of what our team does. Um then looking at agent skills which we built on top of the latest generation of uh Gemma models um uh that can run on both Android and iOS as well as many other platforms. And then um second half we're going to look at tiny models um how um like how you can fine-tune and deploy a tiny model to edge devices. Um and then finally I have an example of an app that our team built um like a real app uh using not just an example app using uh that's built using tiny LLMs based on Gemma technology as well. Okay. Okay. So firstly yeah I guess folks who are you guys are already in this room so you're probably interested in edge AI this probably goes without saying but um yeah like there's a lot of benefits to running on the edge. There's um like latency or UX improvements for some really sensitive in the loop um things like live voice translation for example. Uh that's something our team shipped on Pixel last year where you can do a live voice translation. Very challenging to do that with the required latency using a cloud service. So being able to do that offline and on the edge is key to be able to meet latency or user experience requirements. Um privacy that's um uh that's definitely a thing as well. like we um do some uh yeah uh there's people use AI within like messaging apps for example and people like their messages just to stay on their phone encrypted fully private so that's a good kind of example of where we're seeing LLM's getting deployed is assisting in some of those types of use cases okay where privacy is really important um offline use yeah that's kind of obvious and then savings um yeah like we see this being increasing ly relevant for laptop users where you see a trend towards folks at least experimenting with um uh experimenting with um small language models to do some types of tokens they may do with the desktop agentic workflow they have going on. Uh so some folks are interested in exporting savings as well. Uh yep. Okay. So this is like what we do as like our team we do enable people to build apps. uh we do uh media pipe uh lighter TLM which is a LLM runtime that works on mobile and edge and then lighter t which is just a standard inference framework this was a long time ago known as tensorflow light that is the types of things we've used that over the years in is like it ships in photos in onlur uses media pipe in photos YouTube shorts h if you've used any of the funny effects in YouTube shorts they're also things that our team have built uh for these uh along with these with the YouTube team um using media pipe and various deep learning tracking models under the hood um and that's also built on the same uh media pipe and light t technology and then yeah we run this on like all Android phones today will use this as part of system services um or you know third party apps building on a version of the stack that ships with Android um but our stack also like runs far beyond Android. Um, so we also run even it being available um as a system service in Android, it's also you can take the same TF light file and ship that to iOS, Mac OS, Linux, Windows, web or IoT devices just from that same file. Um, so yeah, that's kind of why deployment is helpful. One caveat there, that same file deploys on CPU and GPU. For NPU, we need to do some special compilation and you end up with a special NPU file. Um but certainly for even for like the gemma models we published last week this is uh true that one file can work in CPU and GPU across a wide set of devices. So then yeah some some of the things we're seeing for LLMs on device is kind of privacycentric stuff uh voice agents and local agents uh with uh tool calling is a very popular workflow and then yeah within our stack we lighter TLM is the uh thing that we'll look at a bit later that helps us run tiny LLMs on device crossplatform and it's fast because we support all of the different hardware accelerators. Oh yeah. So this is kind of important uh concept is we see two trends happening today. Uh one is we see kind of system level genai. So for very very large models, the way these tends to turn up on mobile phones isn't that you know when you launch a single app it will download a four billion parameter model uh just to uh you know just to help you find a good restaurant in you know uh in whatever app you're looking up looking in right instead the trend is to build larger models into the OS. So we call this like system level gen AAI. These models tend to be in the kind of two to five billion parameter range, right? Like it depends on the OS and um whatever. But like this is one choice that we see both the Android team who we work closely with and the um Apple intelligence team. we see kind of similar choices being made by both mobile OS vendors where there's a central model built in and that is um like an Android that's called AI core and there's things like summarization APIs and uh an increasing set of APIs available for developers including a prompt API where you can use that API um and that's available on kind of premium Android devices and uh premium Apple devices as well or the like obviously Apple has their own Apple intelligence thing but as a trend this is kind of worth um uh this is really relevant, right? So if you want a um if you want to leverage a built-in model, this is a great way to go. Um so then the other trend we see is like inapp genai which is where the kind of tiny LLMs or um TLMs as we're calling them in this presentation are more uh are more relevant right uh so system genai generally you customize that via prompting um or via skills as we're going to see later um and it's a foundation model pre-loaded on device in appg genai generally these are custom to tasks they're loaded with the app or with the web page. We also deploy some of these uh on the web and generally they're targeted at kind of like wider reach. So this may work not just on premium devices but all devices because apps uh reach is really important for a lot of the app application developer teams that we work with. Um and these um yeah surprisingly you can get LLMs to uh if you fine-tune for a single task we've seen really really strong performance on you know simple tasks like you know summarization transcription or kind of voice to action type things. We can get really reliable performance from models in the um uh really reliable performance from models in the like 100 to 500 million parameter range depending on the complexity of the task, right? Um so a good example here is we launched function Gemma um in December and that's a was a 270 million parameter model that was dedicated for function calling and we're able to show that uh doing uh like 10 diff doing voiceto function calling for 10 different uh functions that were relevant to kind of Android developers and our internal evals we did an internal eval set that kind of reached over um it's like 85 to 90% uh kind of reliability just using that very small model which was widely deployable to iOS and Android devices and that's actually something you can play with in a a sample app that that we have that we'll look more at a bit later. So that's an example of like a tiny um of one use of a kind of a tiny LLM model. Um but we're seeing more um interest from application developer teams now into kind of fine-tuning models to deploy as inapp genai. So the difference here is on the left, yeah, you kind of um the customization on the left is via prompting or skills customization on the right like we would encourage people to do some degree of kind of fine-tuning to make a tiny OLM work like in practice like um yeah certainly below 500 million parameters that's true. Maybe for 500 million parameters and above um we can do more general purpose tasks with with a model but for the really really tiny models uh certainly less than 500 in our experience you need to fine-tune to get kind of production level reliability. Okay. So now I'm going to talk about Gemma 4. um uh and Gemma 4 the the models that were launched last week kind of fall into that um kind of system genai candidate category right so we can do lots of powerful things with Gemma 4 um as you'll see uh okay first we're going to talk about sizes so there's two small sizes which are the first two which are E2B and E4B um so these have E2B is called that because it has it um only needs about two billion parameters to be present in RAM to to run the model because one of the limiting factors we see with these models is how much RAM you need to have in a device in order to run it reliably. Uh E4B has been optimized to run it with four billion parameters on device. Um there are more parameters used by the model but the other parameters that the model uses are um you could like deep mind have talks in this later in the week so you can go there for more detail but the the tlddrore is the other parameters in the model are used uh for kind of per layer embeddings. So in our runtime we don't actually need to load all of those parameters. So we need to maintain them like the the 2B and 4B they need to stay resident in RAM. The other ones we typically like memory map them and we only need to load like one line of the embedding table of the parlayer embedding table and the embedding table uh once in the auto reggressive loop and we actually only need to load a few uh a few hundred bytes or thousands of bytes of that uh in order to do the next token inference. So as you go through inference, you don't ever end up requiring to load the whole uh PLE table into memory and then you know depending on the uh OS it'll do a reasonably good job of kind of um you know evicting memory used by kind of older PL tokens. Um so that's why we have this idea of effective. Uh so the smaller models run on yeah you can see on the right it runs on a variety of platforms. um the E2B and E4B models. These models are on the AI core roadmap. So these models that are available now for experimentation um uh at the appropriate point in time in the future, the Android team will integrate this into AI core and they'll be available uh more broadly on a wider set of devices. And there's another uh talk next week or there's a talk here this week or um from Olli from the AI core team who'll probably share more details about exactly what that road map looks like. And also there's uh yeah Omar is going to do a from the GDM team is going to do a deeper dive on on all of the the Gemma world. The the bottom two models just for reference. These are also relevant for the edge. They're not really the focus of my talk today. Uh because these are relevant for folks to run on laptops. Uh the sizes have been optimized to run really well on consumer grade laptops. Um um albeit ones that have maybe you know 32 gigs of RAM. But um yeah, that's what these. So I'm going to focus less on these other two models even though they're very powerful and useful. Um yeah so we can see like certainly the E4B model uh like both the E2B and E4B model have excellent uh performance on a wide range of things. um um yeah on knowledge and reasoning. Um but one of the big step ups relative to the from my perspective was a like user of the models relative to the last generation is uh they've kind of built in function calling which is excellent and they also have built-in thinking. So that combination of kind of thinking plus function calling is what unlocks our ability to now do skills on device. Um as to so you can just as you'll see in a second we can uh describe a skill give it to the model and the model can just pick it up and and use it. So that allows us to use that kind of pattern that's proven very popular in the last few months to bring that pattern to mobile which allows for new types of kind of mobile experiences. Uh also for the E2B and E4B these are multimodel so they support audio um image and uh text. Um the larger models support just image and text. Oh uh also the other change from a deployment point of view is this is the first time the general models are released as in just a stock standard Apache 2.0 license which means they're more usable by more people. Um again you'll hear more about this in Omar's talk. Just thought I'd mention it here. Okay. So that's Gemma in general. Now just a deep dive Gemma on our runtime and our platform. Um so yeah that same kind of picture earlier right we have a single lighter TLM file which is a light RT file with the things like the tokenizer and the other things we need in one package to run a model. Uh that single model runs um uh across uh all of these classes of devices um across mobile, desktop, and embedded. And we're yeah, we're pretty excited to to do more in the embedded space. There's a lot particularly with image input. There's a lot of uh scope for new uh for new um IoT use cases. Okay, bit of an eye chart now, but just to dig into performance. So um I would say this is a snapshot as of today. Uh this is something we are continuing to work on uh both inselves both ourselves and our team and also with various partners that we have right across like Intel um uh with the Raspberry Pi team and the Qualcomm team. We're continuing to optimize uh all of these numbers. But we can see we can do um certainly on you know for the two billion parameter model we can get really compelling performance on a like a high-end Android phone can do thousands of tokens per second on a GPU um also thousands of tokens per second on MacBook or um but the the bottom two rows then are like on a Raspberry Pi we can get um about 133 tokens per second which is sufficient to do um kind of simple image analysis use cases with reasonable latency. Um, and also the bottom one is us running on a hardware accelerator. This is a Qualcomm kind of IoT robotics development uh like development platform uh that they have available and there we can see pretty compelling performance as well because we've gone and used the NPU which uh which gives uh a lot better performance and then yeah you can imagine uh then we have corresponding performance on the E4B model um which works uh on yeah a wide set of devices as well with um yeah kind of proportionally less less prefill and decode performance, right? Given the size and the numbers of parameters we need to fetch. But yeah, broadly they're available on lots of platforms. Okay, so now what can we do with those models? Uh lots of things, right? Um so one of the things I wanted to talk more about because it's like net new rather than just kind of show you like our image analysis or audio transcription or audio translation, right? There's a lot of things we've been able to do with models from my perspective for a while that these that uh the Gemma 4 models are, you know, much better than their predecessors at. Uh but we um the thing we're the thing that's newest from my perspective is kind of agent skills on device. So this is uh so we have an an app um I don't know if you guys have seen it. It's available on both iOS and Android called Google AI Gallery. And um this allows you to do lots of different things like to just do basic AI chat stuff or ask questions of an image or um uh do kind of transcription or translation use cases starting from audio or like audio to function calling type use cases. But the one in um yeah the second from the left is agent skills I'm going to do a deeper dive on today. Okay. Hopefully this will play. Oh, okay. That is very annoying. Sorry. I I will fix this before we post slides. Um Oh, wait. Let me see if this will okay sound going to work. Morning. Let's log a new mood journal entry with school. >> Since this is playing, but sound is not working. >> I got eight hours of sleep and I'm looking forward to hanging out with Amy today. >> So, this is a journal scale app we're looking at here >> or a mood tracker where you where it logs mood and sleep. And then >> perfect analyze the trend in my mood over the last seven days. Okay. So, what's happening here is um with the mood tracker app, it kind of logs uh it logs your moods or observations to a diary and then the LLM is able to go back in and summarize the content that you check my calendar for. >> What's interesting here is this is just >> bullet list. >> Okay, this is just I won't kind of repeat this more. Okay. So what's interesting here though is the way this is done which we're going to look in more detail. This is isn't a kind of custom fine tune. It is just us giving a particular scale with a little bit of JavaScript that the u that the model can call >> um to the model and then through just a free text interface you're able to like the model is able to use and pick up that scale. So it's also able to call there's two skills actually happening here. One is the mood tracker skill and the other one is the um >> Can you check Wikipedia for the latest information? >> The other one is the uh was the map skill and now it's using a query Wikipedia skill here to look for latest information from the Oscars. So versus having a model that's just like pinned in time, it now becomes really easy to extend the model with like uh with skills such as kind of map lookup something like interactive journal that you can both add subtract entries to from voice as well as query um uh and also kind of adding more modern knowledge or relevant knowledge. So we're showing um >> can you pair this vibe with >> we're showing Wikipedia in this case. Um, oh, this is another fun uh skill that somebody in the team also developed was the mood music scale, which calls a web service to compose music based on a single image and plays it. Um, so yeah, it's able to kind of uh compose some lowfi music to go with a person's breakfast sort of use case, but it's a new um like I guess kind of paradigm and how we're able to like really really really easily extend the models and to do that in a pretty kind of low code way as you'll also see in a bit when we see when we dig into how this uh works under the hood. Okay, so yeah, so examples here and I'm not going to play all of these videos. uh maybe to save time because I think you've seen many of them. So one is we can augment the knowledge base was one pattern that we find interesting. We can produce kind of rich interactive content like flashcards for visualizations and I yeah um uh where we can instead of like if you ask to summarize something in three bullet points you can have like a JavaScript skill to kind of show those bullet points as a card instead. Um, so you can just say like summarize and show a car like summarize this and if it thinks that the card display will be more helpful for the user that'll get used um or music synthesis. Okay, so I'm going to skip forward a little bit from the demos to how [snorts] we built them. Okay. So what's actually happening is um the way we've built the skills is they're efficient. Uh so the instructions can be kind of loaded on demand. So this kind of uses a principle that you may have seen um that you may have seen elsewhere of um kind of uh wow um like kind of progressive disclosure of conditional or conditional depth. So instead of like in an MCP workflow where you need to describe everything about all of the functions that you need, the way we've structured the skills is there's a kind of oneline description, the oneline descriptions of the skills is what the agent sees and then if it thinks that sounds interesting, then it asks for more. It asks to load the skill. It has a like we teach it a skill to load a skill. Uh and then it goes in and loads the skill and it finds out all of the the details about how to use it, what function calls it can use to to do that skill. So this is like that pattern is particularly important for kind of token efficiency and frankly like reliability on edge models because if we had to load all of the details for all of the skills into the edge model that would be a lot of context for the model to reason over and in a lighter weight model like that will you know that will hurt kind of performance ultimately right because the lighter weight models are you know um you know they're really great to be able to run on device but the in terms of reasonability ility over very very long context windows. If you can if you can have a more condensed context window that'll kind of uh up your batting average in terms of kind of quality metrics you're looking at to kind of ship a particular app. Um the second part is tool access. It helps us um uh integrate new tools dynamically. So things like there's a car like we think of them in terms of there's a set of like input tools like how do you get more information which could be like Wikipedia or looking up a weather service or something like this. There's a set of um uh then there's a set of uh things uh that are helpful to kind of present new outputs to the user like kind of showing something on maps or showing something via cards. So you can have skills both to extend the input and extend the output you know the output patterns that a model can do. Um and also kind of helps us to bring in domain specific knowledge bases. So that skill that's kind of calling Wikipedia you could easily imagine that calling some you know customer CRM internally or some um asking data from a local rag system as well. Okay. So our yeah so our like within our structure of skills we have skill.md and then there's optionally kind of scripts or assets where um the skill.md is the metadata that we always uh process um so um like yeah the example here is showing extracting text and tables for PDF and we trigger on PDF for extraction and then the instructions are only loaded when the model thinks that it requires that skill and yeah this pattern is particularly important. Um then in um within scripts, we do also support um uh people kind of writing custom JavaScripts that can then be rendered within the app because that's a really uh easy way to extend within um within both kind of iOS and Android systems. Okay. So dig a bit deeper then is we have within our system prompt we have our own system prompt that we ship with the app. Then there's a set of skill descriptions which get added to the system prompt. Uh and then when the user asks for something, the model decide to trigger the skill by reading its metadata. Um it then calls the load skill app that we have under the hood. Um it the tool response from that function call is then the contents of the skill.mg file. Uh which is then in the context window. So it now knows about these functions. Um and then it calls the run. Oh yeah. So then that's that also contains some JavaScript and then we call the run JavaScript um uh tool which runs the JavaScript that we picked up from the uh from the skill file and then we call a response on that as well. Um, one other thing to kind of mention as part of this workflow is the one of the things we did when we reporting function Gemma for reliability is within the runtime we have constrained decoding that applies but it's kind of tuned to only apply to the output when we're generating a when we're calling a tool. Um and we can also the constraint decoding can constrain it just to the particular tool that the um uh that you're supposed to be calling. So like in this system instead of just kind of having generic JSON constraints we know that there's a finite set of tools that the model is supposed to be able to use. So we can therefore have stronger constraint decoding. This also kind of helps us have a more reliable system overall by using stronger constraint decoding. Um we find that's less yeah for we find this helps a lot for like the two billion parameter model as as models get more capable we find the like the margin you get from this sort of strict constraint decoding as we call it is um it's less essential when you're running a very large model like if you were running a 10 billion or a very large model but for really small models on device this is a very helpful tool that helps us um have stronger guard rails for the model uh so that we can up the quality so we have something that is um useful in production. So then we support wonder wait I'll see if this will play for us. We then support um within the app we support uh you can toggle the skills you want to use. Um so even the skill descriptions you can decide how many skill descriptions you want to have live at any given point in time. Um, so this is loading a piano playing scale, virtual piano, which uh will do its things. You can actually tap the keys and play sounds. We can Yeah, this is just more JavaScript stuff. But then um we can also you can also load custom skills. So you can write a skill yourself and load it from a URL, which is kind of fun. If you want to experiment with kind of prototyping a skill with Gemma for an app idea you have, this is something that you do. Oh, back there also the skills can have like an API key as well. So like a secret key. So if you need to use a web service, that's something that you can prompt the user to put in. This is showing then the mood music example. again. But um we and we have a on our GitHub page, we have a GitHub discussion where users are posting skills that they have written themselves. Um and the skills in the community that we like, we are then able to kind of pull up and have as like featured skills in the app. So if something develop so that uh ones that people develop, we can have a way of kind of showcasing useful community skills to the wider community as well. So this is at third party scale. I think this is going to show us. Okay, this is adding an animal intro scale. This is like a kids thing. Uh it's a lot of um to the app. Yeah. Um but yeah, the key point here is there's a really really low barrier to extend the model, right? um and to extend the model in a way that is, you know, kind of relevant to downstream app users. Um so this is very very easy and we're going to go a little deeper now and to see just how easy that is. if the clicker is its thing. Okay. So yeah, so skill yeah so skill architecture when we go one step deeper um this is kind of relevant. So we have our own kind of orchestrator. We it has a skill registry and we call the load skill skill in order to load the skills. And within skills then there's like a JavaScript skill. There's native intents. So at least within um for the Android system we're able to call like kind of Android system intents or native intents and you can you can just call those in your scale. So things like if you wanted to turn off on and off the Wi-Fi or something like this. So intents that are exposed to all Android that are available in JavaScript you can certainly use. Um uh then there's a kind of the roleplay the skill.md which is the persona and scenario data and then there's the ex specific scale and resources and like I was saying before this can include um this can just be JavaScript that runs entirely locally which is a fully offline experience or like in the music composer one we actually kind of called a web API and that needed an API key that then the user was prompted and that also works within gallery. Oh, so then yeah then the predefined tools we have under the hood just to give a mental model of how it works is we have kind of load skill which helps us load skills, run JavaScript or run intent. Um and just the orchestrator just using these three skills is able to make the overall system work. So this is another >> let's test our effective. This is from our Gemini product manager showing >> it's good to put restaurant. Please reply to me in English. >> This is a restaurant roulette skill. Yeah, we actually have about like it was so easy to develop skills using um um using both um like anti-gravity, right? Or Gemini CLI. We actually had the team do like 80 of these skills. So we had like lots and lots of things to to do. Uh we had lots of options in terms of kind of what to what to show or it was a lot of fun for folks to develop these. So here we can go in and just look at the the structure itself u of restaurant roulette. Um um yeah this is the under the hood part. So this required requires secret true. Okay. We wanted to get an API key and it searches for 10 restaurants. Um, and then returns in Yeah. Okay. Returns locations cuisine. Ah, and this is the index.j.js JS file then which is um where we can have a um yeah simple kind of web view then we kind of rendered to have the kind of real wheel and you can like obviously go and code all of this yourself right and there's full source code here for the examples cloud code we yeah I never got that actually published um the but it works really well in cloud code as well and we have both a source code for the example and a skill spec for getting started. But what uh and there's full instructions on GitHub if you want to try writing your own scale. Uh but this was this was the pattern we actually used most. It was using like skills to write skills. So using something like uh Gemini CLI or cloud code or anti-gravity. Uh this was our kind of favorite pattern where we just say hey I want to write a skill for AI edge gallery. Um this this this example works in Gemini CLI um where we just say hey this is the documentation here are some examples of some skills that you can go and read. Then this is the skill I want to build which was a offline arc archiver. Um oh yeah the idea is when you yeah you can uh wow just based on things pictures you took around London for example it would then be able to give you like there's that cool Churchill statue I kind of walked past. It could kind of then it would kind of go and research a bit about Churchill for you and then on your flight back you could go back into this scale and it would have a bunch of information about the stuff that you saw that you could kind of read on the flight home for example. I think that was the inspiration for this one. Um uh yeah so it kind of it fetches Wikipedia content and then stores it locally and then has a index. Um also here with CLI we can also even ask the um uh because uh we have an ADB skill in Gemini CLI you can also ask it to test the skill itself by saying hey you have access to a phone connected via ADB and then Gemini CLI uses its Android ADB skill to go in and test that the app actually kind of works and and that the skill is doing what it says in the app and we can just add it ask it then you can put this in a prompt and ask it to iterate And um it'll do some basic validation itself and return it. Uh yeah. So this I think is going to be if this plays this will be an example of doing that. I could also have done this. Um yeah. So this is um this is us using Gemini CLI just to do that. And this worked really robustly. I think of the maybe 80 skills are to team did internally maybe more than half were were kind of vibe coded skills at least initially uh which then really really reduces the barrier for folks to kind of extend an LLM to do kind of new things in a way that's kind of useful uh for for their audience. Um take over the screen. Yeah. Yeah. Go for it. >> All examples here are running on the smallest model in the in app or is it in on device >> the we all of the examples here are defaulting to the to the good question. So the question for the recording is uh are all of the examples running on the smallest model? In this case the examples are running on the 4B model. Right. Uh the 2B model, the skills will work with the 2B model as well and you can try that. But it's um yeah, your your kind of mileage may may vary like simpler skills, fewer skills, that sort of thing, right? But all of the examples you've seen are running on the 4B model. >> Try the Chrome CP. >> Uh try doing a version of skills in Chrome. >> Yeah. So like using the Chrome CVP to with the scale to actually use a local model to work in your brow. >> Uh we haven't. No. >> Yeah. It's a interesting thing to try. Okay. Oh yeah. And this then is quick shout out to this link. Uh so a few things about gallery that I should also mention are quick check in time. A few things about gallery that I should also mention are one is gallery is an open source project and the open source project itself builds on top of kind of the light or TLM kind of tooling that you saw kind of earlier as part of the intro. So the app itself, you know, as well as being something that's kind of fun to use, so you can like I I guess kind of prototype kind of app ideas or to see, wow, is this possible in the model on a phone or how fast would this be if I ran it on a phone or is this sort of skill actually viable to do? As well as being able to kind of do kind of model prototyping, you also have kind of full access to the source code here. And the um and the models the model also enables or sorry and it also builds on top of this the same infrastructure like the open source lighter T and lighter TLM uh acceleration infrastructure under the hood. So if you see something you like in gallery, you know, yes, you can deploy the gemma model some other way as well. But if you want to get the same experience or the same speed, you can then just take um the underlying kind of open source APIs and uh pull a model from our hugging face page that we'll look at more in a little bit and uh just run that directly as well. Um so then as as part of that, we also have a set of discussions on the gallery. um on gallery that shows the um some of the community uh skills that have been uploaded from things from like uh cat entertainment to uh um blackjack, right? I guess. Um so these skills you could also um you can people have posted these skills. So if if you develop a skill feel free to post it here and uh it'll make it more discoverable by other people in the community. And then the ones here that seem compelling or maybe useful to a lot of people, we can add to our kind of thirdparty preferred skills list in the app so that then it'll make it easier for folks to discover. Um yeah, so this is yeah, this is something we saw a reasonable amount of action over the weekend on and it's been something that has only been up since last last Thursday. Okay. Um quick time check. Okay, so next up I wanted to switch gears. So everything we saw with skills kind of really applies to from your question applies to the 2B and 4B model which mostly on mobile phones will be destined for like um uh like system genis where that'll really turn up in production. Certainly for IoT or for desktop or um you know um for edge applications you could just load that model yourself and run these sort of skill use cases in production on like that kind of Qualcomm IoT platform we saw earlier for example um but to deploy to deploy models in app today and kind of ship them in production apps like we're seeing more people are using smaller models to do that um so that we would call tiny models which are less than 1 billion parameter models in our in our view. These are the sort of things that we work with teams to deploy LLMs within their app. And these are the types of things we're seeing. So I want to just kind of briefly take a look at what that workflow looks like. Um so lighter TLM this is the engine that kind of that powers um gallery right um and itself is a open source project that has um C++ and Java APIs. There's Swift APIs coming soon. It also has a Python API as of last week which is kind of relevant for um relevant for IoT developers who kind of like to work in Python. Um um and that takes an LLM file and has the required components to kind of fully run uh an auto reggressive loop and expose that via easy to use APIs. Yeah. So it's crossplatform uh C++ APIs and where relevant the same API can be used with hardware acceleration like the Qualcomm example. Um um we also for other smaller models in the past we've published um we don't have this yet published for Gemma 4 but in the past we've also published models that work on uh MediaTek silicon as well and Intel silicon as well. Um so yeah so over the course of the year um and as there are more and more smaller models available we'll see broader uh NPU support. So then the workflow is you know um you start from transformers. We use a package called lighter t torch that can um do uh that has some pyro native optimizations that will optimize for light or t and also has quantization built into the workflow. then that gives you a lighter TLM file. Um, and you can then deploy that um you can either prototype it with the gallery app or just deploy it directly um to your kind of production candidate use case using lighter TLM in whichever platform uh you want to work with. Um yeah, this is another view of that flow I guess. Yeah, for really advanced use cases we have light or ttor torch generative API. So if you actually want to write your own if you want to write your own tiny model from scratch and train it from scratch, we support that workflow as well as just kind of fine-tuning stock standard workflows. And there's an API called the torch generative API that allows you to um basically it's got kind of building blocks uh for LLMs um that supports many colon LLMs that are um in native PyTorch that are in a way that give really really good performance when you run them on device um and are yeah so that's the flow. Wow, this is this is really in the weeds now. So we um this is our stack to kind of deploy on NPUs. Um the wow takeaways from this slide are maybe two things. One is under the hood we actually invest an awful lot of work in um optimization libraries. So we've X and which is a CPU optimization library and ML drift which is an optimization libraries for GPUs. So we have teams who work in these to ensure that both of these have excellent performance. Lots of Google's onep apps rely on both of these libraries. Um so we're very motivated to ensure they're um they have really great performance and work in the widest possible set of devices. And these are use what we call like the JIT workflow like where we produce a single artifact called a light or TLM file or a light or T file that can work across CPU and GPU and get deployed to lots of types of devices. For NPU we need something a little more specialized. For npu, we need to call a vendor compiler plug-in upfront and that uses an ahead of time compile workflow. So you need to produce an artifact that's particular to a particular npu. Um and then within our runtime we um we call a dispatch to a particular um API that's allowed to dispatch work to the device driver of the NPU. Uh but like both of these are available through a consistent API. So even though the the path the JIT versus AOT affects the kind of the build workflow, the actual the app development workflow is very similar um across NPU or CPU or GPU. Export and inference is really simple. Uh we've lighter TT torch to just export. Um I know I've talked a lot about Gemma models because it's kind of like the Gemma week for us. Uh so we're very excited about Gemma at the moment. It's worth kind of calling out we do support thirdparty models as well. So there's a quen 6b parameter model is what you're there and those quen models also work in the app right which is what you're going to see on the right hand side and the app also supports just loading any lighter TLM file and kind of running it and getting benchmark stats which we'll hopefully see in the video in a second. Um, so you can run it like say on GPU and then >> the latest pixel that you're simulating here. >> Wow. Actually, that's a great question. I to be honest, I don't know this. We do a lot of testing on Pixel and a lot of testing on on S25 as well, right? Um, so I assume it's one of those two devices, right? Uh, but I don't know definitively. Um, and we actually we didn't show it there because like recently like in the last release, we've added a an optional icon underneath each chat and if you click in it, it shows you the kind of pre-filled decode um stats for the model. So you can just if you want like so you can you can find any model we have in hooking face and uh load that in the app and then run to see the benchmark stats for this model on a particular phone if you want to. Um so yeah so you can um and also in the command line like if you run if you're running on desktop you can also um just use kind of lighter TLM run if you want to kind of do some desktop prototyping to kind of understand how models work in desktop you can also do that and then yeah so I guess this is another yeah this is another third party model this is fast VLM this is a model from um this is a model from Apple it's actually really uh it's a really nice uh VLM model that's only 500 million parameters. Um and literally this is running with hardware acceleration on Qualcomm which is why it's running so fast. Um and this is running yeah this is running on an S25 I think um because it's it's Qualcomm silicon and this is literally we just have it running in loop saying describe the scene describe the scene uh with video input and it runs yeah it runs really really fast. Um, so this is a good example of what's possible with a model that's feasible to deploy on device. Like we haven't we haven't in this case done 4-bit quantization, but I assume had we been motivated enough we could have done that and we would then have like something that has like just that only requires an extra maybe 250 or 260 megabytes in your app to give this type of experience. So yeah, they're certainly feasible. And this is earlier when I was saying about models like this is an example of of a general purpose 500 million parameter model like it's been trained on a fairly narrow set of things which are like kind of scene descriptions or just like image to description. So it's like it's a narrowish use case but still general purpose also right. So this is this is a good example of the the larger tiny models uh the smaller ones like that we have with like GMA 3270M for example uh those models typically require fine-tuning to do a particular task. Uh but um yeah but this is a good examp this is why I included it's a good example of a general purpose um tiny model that is is very useful. Uh there's also like it's worth noting like I don't think I've included in the talk other examples of general purpose tiny models include there's a lot of very small transcription models uh or some narrow pair-wise translation models out there at the moment as well many of which we support many of which are on our hugging face page as well if you want to kind of check them out there's a um yeah there's a bunch of uh things available on hugging oh yeah so yeah next slide is this is just showing a handful of models that I um wanted to talk about. So function Gemma is one that we published with um a partnership with GDM last year. This is function Gemma that was a general purpose model that you can further fine-tune for function calling and there is collab notebooks out there if you want to look it up in terms of how to format data sets and how to how to fine-tune how to fine-tune function demo for function calling. The next two ones above that are mobile actions in tiny gardens. Mobile actions is a this is the example I mentioned earlier that has the 10 different mobile actions that are we fine-tuned ourselves where we achieved I think it was like 86 or 70% reliability on those. Uh tiny garden is another like this is a game uh that is we built inside in the gallery app that you can play with as well. It's just another example of a voice to function calling fine-tune model uh that you can can use >> um It suffix >> um it is instruction fine-tuned. So we in models uh when GDM published models they sometimes have PT suffixes and sometimes have it suffixes. Um PT is when um PT is when we publish a model just after pre-training before finetuning. That's kind of helpful for kind of expert users because you can do all of your own instruction fine tuning. you can fully control the model's personality and you're not trying to like unlearn some fine-tuning we did in this case. The way we teach the model how to do function calling is actually via instruction fine-tuning itself. So when we publish a model for further fine-tuning for function calling, it already has a like function calling personality and that's actually what we wanted to publish. Um so in this case this model is for further fine-tuning but it's an IT model. More typically if you see for larger models um you like certainly for the Gemma 3 family we had both IT and PT checkpoints uh yeah we had both IT and PT checkpoints for those depending on if you had a large volume of data and you want to do full fine tuning yourself like I think there's there's a separate conversation on like sovereign AI use cases that uh some of the deep mind people are going to do later in the week and I I imagine that will be a hey start with our pre-chain checkpoint and Now add a huge corpus of data of sovereign data or enterprise data and you can fully fine-tune a 27 billion parameter model to do um stuff. Um yeah. Oh, another example here is embedding Gemma. Um, I know that's not a technically it's not an LLM, but embedding Gemma does um that's a text embedding model that um that we published with the Gemma team last September and uh that does text embeddings for rag type use cases. But that's also like a very very strong embedding model that only takes 300 million parameters. It's another example of a kind of high utility um uh yeah I guess it's not technically an LLM even if it is a transformer inside but it's another example of a tiny model that's uh very useful for ondevice use cases. Okay. Yeah. Uh I've kind of already covered this I think is we have um yeah both AOT is AOT compilation is our workflow for hardware acceleration on device is best for distributing small models to lots of platforms. Um okay yeah and then I kind of covered this as well. Light TLM itself is built on lightorty and it supports like all of these types of models and you can find like non LLM models also on hugging face. This is kind of relevant because you typically when you're building a more complex app you need some things around an LLM like a voice activity detection model or a dnoising model. Um so many these models are also available using just the lighter t runtime like lighter TLM is the runtime that has the full auto reggressive loop. It builds in lighter t and there's a set of LLM models that are available to use with lighter TLM but there's lots of supporting um uh just kind of um uh how do you say um non-auto regggressive models that are available to use with the light API that are kind of relevant to deploy uh that are typically used to deploy a full app. Okay. And this was the earlier thing I was saying about um for advanced usage you can fully customize models. So if you wanted to write your own uh llama or moonshine or fi or quen variant uh you can using the codance directory. Okay. I'm going to have probably on track to finish a little earlier I would say. Okay. Um so that's it in terms of the tiny models examples and workflows. Um so there's a lot and there will be yeah so there's a lot we can do now with very tiny models. Um 500 MClass are available for some standard features uh for smaller LLMs. We've had a lot of success fine-tuning them uh for apps. And what I'm going to show next is an example of a app that we've built using uh tiny LLMs. So this is an app that's available on iOS only for some reason um uh called AI edge eloquent. Um what this is is a kind of transcription um model but instead you may have noticed that as a speaker I say lots of um ss and arms right if you got the transcript of this presentation it would uh it wouldn't be a great transcript there would be lots of uh lots of uh interjections. This uh eloquent was built for that type of transcription user story where you um where it does dictation but then as a separate um kind of uh automatic kind of polish uh step where it can kind of remove all of the kind of interjection and and words. So if you want to dictate a message for use later um it's able to you know clean up idioms of speech. One of the other things it has as well is a uh a biasing uh list. So one of the things we found ourselves is like if we're talking about LLMs, you would say, "Oh, have you trained a Laura for this or got it?" And any standard um any standard transcription service would when you say Laura, it would translate that to the name Lau or A is typically what happens. Whereas you can give it a list of kind of keywords or technical terms as well. And then the model will kind of um uh bias to those words, right? Um because that's how real people spell. So you can So the the kind of key features here is a it runs entirely offline. B it's got its own kind of biasing dictionary if you want to call it that. And then c it kind of cleans up the text if you see uh in the middle it has this kind of polished step, right? Um and overall, yeah, this is the kind of gives a pretty uh neat offline um uh yeah, a kind of no no cost, right? Like because goes back to that kind of cost motivation at the start as well because there are some paid services that do this. Um but it can give you really clean uh cleaned up text. >> Is is that last step done by changing like setting special tokens in the or like is it done last by overlaying a map or something? which is this the >> last step >> is it is that >> it is we're gonna give give me two slides I'll answer your question but yeah good question right the question for the root for the recording was is the is the polishing done inside in the main model or outside and how is biasing applied we're going to answer it in two slides hopefully yeah personalization yeah this is the uh personaliz ization. So you can add um you can add things like Laura is the example we always use because it's kind of uh near and dear to our hearts. So it allows you to if you want to connect it to your Google account, it can import stuff from Gmail or something um and look for unusual words and then add them to their biasing list or you can add them in yourself. The sort of things people usually put in here are names because like um models will get uncommon names wrong frequently, right? Um as well as technical terms, right? So like Gianning is and surreal. These are two of the people on the team who developed the app. Uh yeah, so no surprise uh that's the example we have. Okay. So this is where we yeah so the text polishing engine is what we're calling at this stage. Uh so we have two steps here. Uh important to note that both of these um yeah firstly I'll describe then I'll talk about models. So microphone input goes into a speech recognition engine that delivers a unfiltered transcription. Then in the bottom half we have the kind of personalization flow where you get a set of kind of uncommon or unique words that then goes into relative terms. Both of these go into a text polishing engine which is then a um dedicated L mini LLM just for text polishing, right? We could probably have built like one LLM to do all of this, right? But it's actually like sometimes like one of the other realities of mobile development is sometimes it's like with tiny models as well there's this kind of like modularity story. So that same transcription engine in your app, you may have some other use case for that and you may not want to pay um you know you can recoup the cost of those weights by using them in multiple places, right? So this this is a pattern we see emerging as we build more of these types of apps is that kind of modularity playbook. So yeah, in theory those two models could be stitched together. In practice it's kind of a more pragmatic choice to have separate models. >> This is more depth as well. >> Yeah. And you can also inspect what's happening in the middle. it's easier to debug, right? Um we see the same a little bit with voice to function calling like you know as well but that's another story for another day is um but it's uh yeah so okay then the other thing to point out the ASR engine and text generation these have little I don't know if you recognize the Gemma logo in the corner these aren't like officially published Gemma models but these are we've basically done the same workflow that we're advising to other people to do to do we have taken the Gemma models, the smaller Gemma model, right? This would be a derivative for like this would be something from the Gemma 3270M lineage is the best way I would describe it, right? And we've basically taken that model and done a fine-tuning just to have um uh just to have a transcription engine and then just to have a text polishing engine. Uh we've fine- tuned them. Generally what that workflow looks like to give you a bit more insight is we'll use a workflow of using a much stronger LLM in the cloud to generate lots and lots of synthetic data that corresponds to the type of thing we want. And then once you have a few, you know, lowdigit millions or tens of millions depending on how ambitious you want to be of synthetic data, you kind of put that into a fine-tuning workflow and you fine-tune the base tiny model you're working with to get a derived model. And you know that same workflow we're showing here um you know uh you can then use to like we've used that internally to kind of ship a stronger um to ship a stronger uh note takingaking app right but that same workflow is uh the reason I'm showing it here is like a real life example that like um of how we're using Gemma derive tiny models in order to be able to like build new production apps and we're using this flow in yeah in for for lots other use cases internally as well supporting Google 1p products and lots of other ways that um those products will probably talk about themselves in time right um but yeah like like smaller GMA models are really really powerful for this type of use case um and we're seeing good mileage coming from this >> I'm going to like combine it with like keyboard because this is a very useful >> yeah so this is that's what the text polishing engine does like as part of that we Um, wow. Like the texting engine was able was trained was instruction fine-tuned, I guess, to kind of have a system prompt of like, hey, these are your special words. Uh, please uh please correct uh anything that sounds like these words to these words. And then also remove um interjections or lack of clarity or even things like I forgot to say this or scratch that like things like this. >> Probably a skill. >> Yeah. No, this in this case it's not a scale because this is a tiny model. We've actually just trained the behavior into the model. It's true with a with the four billion parameter model maybe there's a version of this you could do as a scale which would actually be an interesting side project. Uh yeah, maybe there is a version of this you could do as a scale, but for the tiny models, the playbook we're generally seeing is really useful is synthetic data data generation with a larger model and then pick an offtheshelf like Gemma 327M, right? or similar and run with that with fine-tuning and then you can um combined with quantization you can then ship a pretty compelling narrow feature to a very wide set of users uh powered by an LLM that works on lots of devices. Do you have um like an GitHub repo like somewhere workflow how to fine tune it because not just fine tuning right you had like warm up the training and then start >> yeah we do with the Gemma 3270M publication we do have a collab notebook and with Gemma 327M and function Gemma both of these when we ship those models have collab notebooks that show how to do kind of fully fine-tune um models uh how how to do full fine tuning. Yeah. Okay. So, regarding the finetuning, >> yeah, >> what's your experience? Uh, let's take the the function calling example. >> Yeah. >> And you said you had 80% chance of hitting the 10 10 functions. >> We finished that. >> That is probably after fine tuning. >> Yeah. >> And could you say some numbers before what what's about? >> Oh, wow. Are we talking? >> Oh, wow. Is it 10%? like 40 something% 40ome percent to 86% right within that 86% as well right because I don't want to um there was one oh wow we we had 10 functions and maybe two of those functions dragged our average down a lot there was like eight simple functions that were over 90 like 93% type thing so we on very simple functions we had like really really high reliability and then yeah I just forget the actual details but there was like two that kind of brought the average down more like kind of 86% which is where we finished. Um there's also a blog post you can read about that for more detail. Yeah. Um but we'll see this in Yeah, we see this with smaller models all the time. It's like a like our experience is on a given eval um like fine-tuning is between 20 and 40 points on the eval. So it's like it's a really really significant win for tiny models, right? like when we're talking 200 million parameters right um or 270 right like we published with 3M yeah um their fine tuning is is essential for most things right unless you have a unless you have a model that is already published you know to do a narrow task right like you'll find a transcription you'll find transcription models out there that are in that size that work really well at one task and don't require further finetuning but that's because they've been scoped to do the narrow task from the outset, right? But if you want to do your own narrow task, Yeah. right, and ship something to lots and lots of devices, then yeah, fine-tuning is um at least for now, it's it's the workflow of choice. Yeah, it may change. >> We get a code lab that can use the web. >> Um that's a good question. So we have a collab to get as far as the light or t file right. Um our support for lid or TLM on web is work in progress right like we um yeah yeah is I'll just say that like please look at our GitHub please look there for latest status right you you'll see it >> so like will we get an fine tuning manual for Gemma 4 by any chance >> um there are so Gemma 4 what was announced last week was small models and medium-sized models, right? In the past for GEMO 3, we also published tiny models, right? So, um uh but this is like they were the first models we shipped for Gemma 4 were last week, right? So, there will be more GMO models in future, I imagine, right? Um there are some fine-tuning already available for the larger models, right? But for the tiny models we've published at this point in time, right? And if anybody's reading or listening to this talk in a few months time, please search the web for the latest information. But as of right now, the tiny models we have published are Gemma 3 tiny models. Um, and they do have for function Gemma and for Gemma 327M, there are fine-tuning workflows uh available right for Gemma 4 as available last week. There are some workflows um there are fine-tuning recipes in uh Vert.Ex checks and elsewhere I believe for those and but um check out the other Gemma talks uh later in the week and you'll you'll see and hear more. >> Uh oh yeah so the eloquent thing is actually available on iOS if anybody wants to give that a go and >> not in Europe. >> Yeah I could not >> oh okay >> yeah you can find on web browser but not on the store. >> Wow. So that needs to be enabled. I guess >> that's really helpful feedback. I'll pass that along. Okay. Uh yeah, and then wrap up. Yeah, the kind of key takeaways system genai mediumsiz models in our uh or small models in our parlance, right? Um will be kind of turning up in a a mobile device near you. Those same models are excellent for use in embedded systems, embedded platforms. um uh at least for now given the memory we have in mobile phones which doesn't look to be getting larger anytime soon given the cost um tiny models are kind of where it's at um for deploying things widely right we hope to make those easier and easier like we hope to have stronger and stronger models available uh through the partnership with JDM and we want to also make the the fine-tuning workflows as easy as we can right to enable to make this kind of more accessible um but yeah happy just to kind of share what we've been doing in both of these fronts So I think we've nine minutes if anybody else wants to ask any questions. Happy to. Yeah. Yeah. >> Okay. >> Yeah. Like great question. So questions about safety on edge models. So firstly the Gemma team spend an awful lot of time on this and I would defer you to them for all the questions on safety for the models that were published last week but um they do like it's really top of mind like I like they spend an awful lot of time on safety for those models. Um, additionally for like for system genai within AI core for example, right, what actually ships there and you'll probably hear more of this from other people is within system geni when that actually ships as part of the US, not as a raw model that we published last week, the system vendor will typically have some sort of input and output safety checker on the model, right? Uh, as a kind of aftermarket addition because there's particular things for their product they they want and they don't want, right? Um so that's another layer for smaller models and for like really tiny models the like safety there is really important but generally the way like at least for us the way we've deployed tiny models is to within a very narrow kind of API or task like you can imagine with eloquent like the app you saw that kind of the risk profile there like technically that's more like of a regenerative app than a generative app if you know what I mean it's not going to fully make things up on the fly So we try like yeah so there's a um you know you can kind of look at the scope of what the model is trying to do and the API surface and the functional surface and typically tiny models have to make a tiny model work it generally has narrower functional scope which allows for a more kind of like you still need to do safety but it's a narrower problem you need to solve when you're kind of looking at it right uh is what I would say yeah >> is there a way like place where I can look for how deploy small like a little bit bigger gem models on like 5090 because I have a 5090 going to deploy it on it and then try it out. Is there a way I can look it up? Oh, so the okay the so on 5090 so I believe I'm not sure if we have like I so I know we don't have specific documentation for that but our tool does support uh Nvidia GPUs Nvidia is also like Gemma partner right they um they have supported some of the Gemma launches in the past so it may be it may be if you look on Nvidia's own web page you may see some of that through their tensor TLM. Um I know they've supported some of the Gemma launches in the past, but we don't like I can't point you to specific documentation of the fast answer, but there are the two places I would check. >> Yeah. >> Okay. >> Sorry. >> Yeah, that's all right. So the examples that you show for like agent skills they're obviously like one skill execution you run the skill that's a single skill in the s would you sort of like change the architecture for like mult >> it's a question of so we do actually support multi-kill execution right um it's something [snorts] so within the app if you download the app you can actually define um you can define um the skills you want loaded, right? And you can just toggle them on or on or off, right? Um and then you can easily say like if you're very very specific in your prompt, which is like wow um like uh if you're very specific in your prompt, which is like, you know, look up this topic in Wikipedia, summarize to three bullets, and then play as display as flashcards. If you're really really specific, that's going to work, right? Um Frankly, like we've only had this model a couple of weeks, so we're still putting models in the clock uh and seeing what are the boundaries of how like how much can we do skill stacking and skill chaining, right? Uh so we're yeah, we're literally still in the mode of putting miles in the clock there. So most of the examples we were publishing were single scale but even in the even in the diorization app right if you recall there um like Alice the person doing that she asked for um she asked to oh summarize my mood or what time I'm meeting up with Amy and then oh Wikipedia is going to come up ask for blah blah blah. So like within in that example she was able to kind of show within a single conversation individual turns using individual skills. Um uh yeah, but like skill stacking within an individual prompt, I believe we've seen that work where you're pretty explicit, but also we're still like frankly we're still learning right the boundaries of you know like this uh yeah we're still learning the boundaries of what's possible with these classes of models. >> Sorry follow question. Yeah. So it's like the decision to identify which skills are important for the problem and given that it's a small model. Do you like do instruction tuning from a larger model to build that intelligence or is it really good at >> this is um like the model was well the model was changed to be really good at agentic workflows at really good at function calling right it wasn't trained like nothing in the gemma model was trained specifically for our skill pattern right uh that just kind of came afterwards like when we got the model and we started playing with it was like wow this works and then can we do this right so it was more that kind of workflow uh so there's no specific speific training for our skill structure, right? But the, you know, the team spent lots of time doing general purpose thinking and function calling, right? Um, like the GDM team did lots of great work there to give us some a general purpose model is really strong, but there's nothing special for our app, right? So, if you had a if you had a slightly different take on skills and maybe there's a better skill architecture than the one we've shown you today, right, it's entirely possible. Um, yeah, you you could expect to be pretty successful, right? There's no there's no like hidden secret sauce in what we've shown. >> Yeah. Uh take class sorry I think you're maybe next. Yeah. >> I'm building like anecdot. The first big challenge is context. >> How you know particularly once you start doing the gent talk a little bit about context window on the Yeah, >> E2B. Okay. Um, yeah, I would I would defer you to the Gemma team for official guidance. Like the the mediumsiz models have a context window of 128K and the smaller models I would actually need to double check, right? >> 32K. >> 32K. Yeah, that's what I was wondering if it was 32K. So like our implementation like in gallery we default to like 8K or 12K or something just for performance reasons right but the models do support up to like E2B and 4B support up to then 32K and the other models support up to 128K >> with 32K you use a lot of memory that >> um wow I wonder if we have stats for that in the model card it's for the E2B and the E4B model the the memory footprint for a larger context is um it's it's not as bad as you think, right? Uh there's actually put a lot of optimiz like the team optimized that metric for those models because it was targeted for edge use cases. So the amount of kind of KV cache that's required for each input token that was something that was optimized. So the models behave pretty well on that front. I don't have I don't have a number off the top of my head of like bytes per input token um uh to give you but it's it's it's uh it's good for its model class is what I would say. Yeah. >> And a second question. Do you have the iOS version? >> Yeah, the AI edge gallery works on both iOS and Android >> in terms of being open source. I think the version Mac OS uh that's a good point. I will put that in the coming soon bucket. Right. Uh certainly one of the items on our to-do list because we published yeah the iOS app we only published for the first time in January. Um whereas the Android app has been available since last summer, right? Uh but we do like our intention is to have um a what you see is what you get experience for developers, right? So you can use the app, have fun, experiment with the models, then also get the source code, um, see how it's built, etc. Right. Um, so yeah, that's certainly our intention. >> Yeah. >> Probably last question because we're at >> is there a trade-off between like fine tuning individual models for a specific task and then like actual amount of memory on devices they consume? So it's like you know add you're like chaining models together. >> So to clarif So for which for which model is your question? The >> I just downloaded whatever the E2B >> the E2B model. >> Yeah. So if you've got multiple models presumably so yeah so for for E2B we would kind of recommend customization via skills or via prompting not via fine-tuning. Right. Um for the smaller so for the small models that are published we would recommend customization through skills and prompting right um for tiny models um we would recommend uh we would recommend customization through fine-tuning if you were deploying a smaller model right I don't like on on the other path that is available to you for the medium for the small models is Laura fine-tuning right so I know Apple supports that in their foundation model framework uh you can check with the AI core speaker if that's on their road map right there's an AI core speaker who has an AMA uh coming up you can check with him about their road map for this but certainly if you were deploying that on an embedded system right like uh if somebody asked me about deploying the 2B on an uh on a robotics platform I would be like absolutely you should fine-tune Lauras for each of your things for each of your tasks and then like our runtime supports loading the model and like hot swapping Loras so you don't even need to kind of load and unload the model to load and unload Laura adapters uh that it's built for that particular use case for like robotics or IoT platforms. Um and that's and those loras then are like maybe yeah it depends on the radics you choose but they're much smaller. It's like you know maybe 16 to 100 megabytes depend or actually even smaller like 8 to 100 megabytes in that kind of range depending on the the radics you use. Yeah. >> Thanks a lot. >> Yeah. Cool. All right. That's a wrap. I'm going to let everybody get lunch. Yeah. [applause] Thank you. [music]