Accelerating AI on Edge — Chintan Parikh and Weiyi Wang, Google DeepMind
Channel: aiDotEngineer
Published at: 2026-05-05
YouTube video id: Lm8BLHkxiAo
Source: https://www.youtube.com/watch?v=Lm8BLHkxiAo
Afternoon everyone. We'll get started. Uh my name is uh Chintan Purik uh product manager for uh light RT which is part of Google AI edge. And uh I also have my uh my colleague here ve also he'll be joining us also for the uh Q&A part of this session. Uh quick show of hands. How many of you are working on deploying on edge or are keen on learning more about like what it's going to be the big benefits. Okay. And some of you already deploying. So um I'm going to go through the slides. I'll try to you know also leave it open to understand if you guys have any use cases or any sort of things you're working on that you know you'd like to discuss. Uh so we can kind of keep it open in that sense. Great. So uh here's going to be a quick uh set of agenda items and I'm going to go through some of the new models that are coming up. Um also uh some edge use cases that will be relevant. Um I'm going to show you some of the new capabilities in the gallery app and then you'll go through our stack for deploying AI on edge devices. In addition, uh we also want to emphasize uh the crossplatform support we offer because I think a lot of you are looking to deploy on just more than just on mobile platforms but also other platforms. So we'll kind of do some of that. Um Google DeepMind recently launched Gemma 4 and uh this talk is going to be focusing a little bit more on like the 2B and the 4B edge models that are are focused on uh deploying on device. Uh Google certainly has a big suite of models and even the Gemma 3 family which are also in the smaller size like all the way down to 270 million parameters. So if you're looking for extremely small models that you're able to fine-tune, uh we certainly have a hugging face page which I'll go over which has more of these models. Um, so yeah, so the big evolution with Jamma 4 is going to be really moving from like chatbot type capabilities to more autonomous agents that also support reasoning capabilities and more sophisticated features and uh a lot of it is also uh it's going to be great how you can deploy these on different devices. So running on edge has many benefits. I think my colleague went over some of these uh yesterday and I'll just you know run through this uh really quick also. So certainly like latency is important for those who are keen. Um any of these like real-time camera use cases like filters where you're looking to replace backgrounds or like video calls like real-time latency is king over there. So that's you know on device can help with that. Privacy is also critical for any use cases where you know you have sensitive details, you're doing some kind of summarizing uh documentation which is sensitive offline use cases where you have poor connectivity and costs like you know I've seen so many presentations uh this uh uh AI engineer event where uh people are complaining about the number of tokens that are getting used and this and that but I think uh ondevice always offers this hybrid approach where if you're looking for like how do I really offset you know uh running things on the edge versus on cloud and where can I get the best balance? I think there's a there is something here. Cool. So, uh in terms of models, there's two models and I think like a lot of the use cases I'll show are going to be built on these. So, the Gemma 4 E2B roughly from a from a amount of RAM usage is roughly anywhere from 1 to 2 GB of RAM usage. So, it could be usable but like yeah it depends on your end use cases. Again, uh certainly good for any voice interfaces or uh use cases for summarization or any kind of low latency local processing. Uh 4 4B is going to be a little more heavy duty if you're looking to run on bigger platforms like laptops or IoT devices. It'll have like a higher RAM RAM requirement. Again, this is once it's been quantized uh to your desired size. So uh I want to do a quick deep dive on what's going to be new uh this time on uh on Gemma 4 E2B and E4B capabilities in terms of the agent capabilities. So I'm going to show you what's already there in the next couple of slides in terms of use cases and here uh touch a little bit on exactly like what are the new capabilities are there. So function calling built-in support for tool calling and also essentially models to interact with other local APIs. So you can certainly start the in inferencing on edge but you have the path to essentially you know call other APIs outside and that I'll show that in some of my examples but the core of the in inferencing will be on the edge uh structured JSON output. So a native support for any structured JSON output is supported here which was built into the model architecture rather than achieving through some kind of specific prompt engineering. You can kind of do this too and I'll show examples for the same uh chain of thought is new. So there is a thinking mode uh which we will which can be demonstrated by app where the thinking mode will help you understand like the thought process that the model's going through. Our gallery app which I will showcase will uh will support that. And finally what's going to be great is these models are optimized for hardware native support means you can run seamlessly across multiple platforms and multiple hardes and really we want to give you the flexibility of deploying into uh various platforms. So just for starters, if anyone's like thinking, okay, so where can I get these models? Like we have these models already ready to go. So our hugging face page has these models. So if you're looking, you can uh you know they all Apache 2.0 licensed models. So you are able to download these and then you know start building with it. And we'll give you some paths of how you may want to build with it. All right. So so yeah, let's talk about some of the use cases now. So in terms of these are just some of the many possible use cases. Uh these are use cases that are possible today. Uh this is our gallery app which we have been demoing down on the third floor and we also have a demo slam session after this for anyone who's keen on learning more. I do have QR codes for this too. So in the next couple of slides um at present you know what's new is the de the the gallery app allows you to demonstrate the agent skill capabilities um and also it has the audiocribe capabilities or ask image and also other features that were uh related to uh chat experiences. Um all of this is happening on device. The purpose of this app that is created by Google is to help you get a playground to get a feel for what these models are capable of. Uh each of these capabilities has sample code as well that anyone can go and grab and then you are able to also fork this app to build your own experiences. But this is essentially to inspire and motivate you to build your own experiences. So some of the next couple of slides I'll be focusing on is going to be related to some of the Gemma 4 edge use cases. uh these are going to emphasize on like the voice agent capabilities or the local agent capabilities and also a lot of these are going to be privacy focused as well. So there are three key pillars that are going to be new here or versus what was already supported in the previous slides. So let's dive into uh the Gemma 4 use cases. So let's see this place. So here's one use case where you're augmenting knowledge and you know for example you can build a skill to query Wikipedia allowing the agent to query uh and respond to any encyclopedia question. So this is a this is a skill that's already available in our app that's uh that's there. So if you're looking to build something like this this is this is possible. Uh these are new skills in addition to like basic ondevice skills like summarization or ask image and things of that nature. Um, this is another category where essentially let's see if you can journal entry with score and comment. >> I only got 8 hours of sleep and I'm looking forward to heading out with Amy today. >> So, here's you're going to see creating and summarizes and display trends of hours of sleep. So, you're going to build a sleep in a quick agent that helps you track your mood. analyze the trend in my mood over the last seven days. So you can kind of see that it's able to create this all on device. It's taking your input, it's feeding it, it's able to understand. So the reasoning and the thinking capability are new in the new model and that the the ondevice capabilities have gotten a lot more powerful. Another similar ex example is going to be on expanding like the core capabilities. So for instance, you can also pair photos and it's going to read the photo and have image understanding and generate music also all on device. So let's try this. >> Breakfast. I'm sending a photo. Can you pair this vibe with some music? It's taking a few seconds and then so a lot of these skills can be written by yourself on the app itself. So you don't even need to leave the app. The app has instructions on how to do this. Uh again like the whole sole focus of this is to help you understand uh the capabilities of the model and build something by yourself that you know you really like. So that's useful. Uh one last one is to really you want to do sort of a working app that describes let's say local calls of animals like now here we are trying to navigate multiple apps and users and you can manage a more complex workflow here in this case. So this is another example and you can also change the CPU GPU and hopefully we have NPU support soon so you can decide the accelerator you want to use when you decide. So this was sound generation here from purely the prompt by the person again the the the skill was set up and created and loaded and the skill is running on device and this is for something similar right. So there's a lot you can do with it. There is a whole GitHub repo that we have uh where where actually uh uh there's like users are posting their own skills on the GitHub website and uh and they're able to basically share that with others with the community. So you're also welcome to explore that or you're welcome to download one of these apps. Another option is uh the sample app is actually open source as well on GitHub. So you're welcome to take it and fork it and you can also make changes to that app if you like. So the GitHub link is also here and it's an opportunity to do that. And uh creating your own skill is also here. So these are just an examples of what folks have have done. Uh this QR code is instructions on how to build a skill. So if you're looking to essentially get guidance on okay, how do I do this? Uh you know this uh you can kind ofh get started. All right. Um the next part of the talk is I'm going to focus a little bit on deploying this now on edge devices. So we've talked about what's possible with the Gemma models. Uh the framework that is used here, right? So so Google has an offering with light RT for bring your own models essentially. So light RT essentially is Google's ondevice framework. It's built and built on the TensorFlow framework. TensorFlow light if you guys are familiar with that. Is anyone familiar with TensorFlow light? Some of you guys. Okay. Awesome. Okay. So that's what this is built on. Uh it's really meant to also be built on using the same TensorFlow light model format and that's what we are focusing on at this point. So here the the point is that the underlying framework for the app that was running all these experiences is light RT and then we are trying to show you that this is basically it's been one of the most widely deployed framework so far it has uh 100,000 plus apps billions of active users and also lots of uh daily interpreter invocations. This essentially number of inferences that are happening every day. Um why this is interesting is to show that we're building on a trusted foundation and when you do this like we also have our TF light file format and this is going to be very important because if you were a developer that was building with TensorFlow light your models are still going to run on light RT and these models the same model format is crossplatform so it's not just Android but you can run it on iOS Mac OS Linux uh Windows web and even IoT devices and I'll also show some examples on IoT that we were actually just building this morning actually. So I'll show you that. Um cool. So from a development side of flow standpoint and uh essentially you have your model files. So we are able to accept the reason it's got branded to light RT one of the bigger motivation for the rebranding was to also demonstrate that it's not just TensorFlow light models that we accept but also PyTorch models and JAX models. So if you are working with PyTorch models, you can take one of those models, convert it to the uh TFI file format and then go through this journey and deploy. So we have a lot of sample apps and things like that, but I would not go through the whole journey here. But simply put like you have portability of the model and you have multi framework support from the models. All right. So the complete this is a complete solution that provides one unified crossplatform architecture. Maybe a quick question. How many of you are looking to deploy on multiple Android or iOS devices? Like you want something that you build that you can test easily on many devices? Cuz if you are trying to build apps that you're like, okay, I made it, but how do I know if it's going to work on like 5-year-old phones and sixy old phones and all of that stuff? So, we have options for that, too. So, I'll walk you through the stack. So, light RT torch is going to be your conversion path. If you basically it's your bring your own model. If you found a model, you like it, you want to run it. You convert it to TensorFlow TF light format, you can quantize it if needed. Uh if you do, if it's LLM, you go through the light RTLM path uh essentially or and then if it's not, you can go through light RT. What's interesting here is that we also have the model explorer tool which can help you explore the graph and decide which aspects of the graph you want to change and quantize. So you can actually study the graph and decide how to best do mix precision or or basically complete convert the quantize the model as you like. The AI edge portal is a benchmarking tool which could be of interest for those who are looking to deploy broadly on Android. So this is a cloud-based benchmarking service called AI edge portal. Um this is available basically to help and a lot of our uh uh third party app developers and even like internal developments use this tool essentially to get a good pulse check that hey you know if I have a model that's so many parameters do I need to use ahead of time compilation or just in time compilation what is going to be my right recipe to ensure it's actually deployable across a broad fleet of devices and be reliable in that manner. Um yeah sorry one more thing on this slide is uh the next part is acceleration. So CPU and GPU are pretty universal right now. So these are are essentially libraries that will help you run CPU and GPU. I also want to emphasize back like with this sort of uh framework you are able to deploy on multiple platforms uh with the CPU and GPU running. Uh NPU acceleration is also another focus. So we have completed uh integration with Qualcomm MediaTek and we are also focusing on additional integrations with other uh partners in the and on on different platforms. Uh we also offer if you are looking I think this is going to be a gamecher for a lot of your apps who are or a lot of your products that you are trying to run lot ASR TTS or any of these applications that are going to be you're requiring real time capability you're or you're setting up some kind of AR VR application where you want to be able to have real time improvements or updates to uh the camera feeds. I think the NPU is going to give you at least like 3 to 10x improvement in performance and it's going to be a gamecher in terms of the amount of energy you're going to use uh and the amount of performance you're looking for to unlock those use cases. So the flexibility part is we have the ahead of time or ondevice and then for ease of use also there are many options here in terms of how we simplify and there's a lot more uh documentation and support on that. All right so the next part is I'm going to go over some performance numbers on how we are performing. So so far we've given you an example of what you can do at the app level uh what's possible at the framework level and how you can scale it uh in terms of our coverage right so just with the Gemma models like this is a this is the coverage right now so a lot of times they come to the booth or our demo booth and ask hey okay so you know can you just do this on Android but no we've been we've tested this on all these platforms right now so the models that you have on hugging face you can actually test it on on many of these platforms so on Android iOS uh Linux, Raspberry Pi as well. Uh and and here's like uh just a quick uh demo we just did this morning. It's going to take a while, but this is our uh little uh robot that's uh sitting down in our demo booth. So, we just made it this morning, so it's not super performant, but what we did here is we showed a robot uh a sign to say move your antenna. So, now the two lights are blinking. So, it's doing inferencing. Uh, and then you know, you'll see it's running a Raspberry Pi. It's running on a CPU, running on our LAR TLM. And you'll see in a few seconds that it's going to wiggle the Sharpies that are its antennas. And uh, so there it is. So So there are other questions like, "Will you marry me?" And things like that. If you're interested, you uh, you can try it out downstairs. It's there. Uh, cool. So uh, yeah, that that was amusing, but also yeah, we we do need to work on the performance because we just made it this morning. Uh we also have a CLI tool. So for those who are looking to deploy and you want to have a easier way of doing it. So there is a new CLI tool that's been developed. So that is also available on our website. In terms of uh with the Python binding support and many of that so I'll just uh it's on our website. So if anyone's interested you'll find it. Um in terms of performance so I have some quick performance numbers like I just want to emphasize like running on some of these uh new accelerators NPUs can give you a big benefit advantage. So you can get up to 13x boost in some cases. Um and also you have uh iOS performance as well. So we are supporting all of this here. So like roughly 56 tokens per second and so on so forth. Uh desktop is here too. So a lot of these numbers are can be found on our hugging uh hugging face page. So the quant is already listed there. So I'm just pulling from there. So you will find it there. So when you download our models, you will get performance details on all these platforms and IoT. Uh our runtime is also very performant versus llama. At least on like mobile, we've seen up to like 35 times faster performance. On desktop, it's at par. And then IoT also we have like 3x performance. All right. So I have a minute left. So yeah, this in summary, yeah, light RT supports all these frameworks. If you're looking to bring models from different frameworks, you can. We have the models on hugging hugging face. And then we also have a gallery app which is a nice playground for you to uh try this out. Uh and these are there. Yeah. So if there's any questions happy to take them >> Yeah. Thank you. >> Can recognize prices like me about like okay the security camera at my home recognizes that my son came home. I mean because pushing this to the cloud would cost enormous amount of money and since this can be done locally would lock the computer maybe this is like a security feature of my of my home camera. >> Yeah. So on your phones like on let's say even iPhones or Pixels when you do face unlock it's typically running locally already. So and that is using this same framework in Google devices as well. Uh Apple has uh Apple's face unlock is also on device on on core ML. So, so that is possible. Yes. >> And then you would stream it constantly like uh every 2 seconds for example to recognize the face or how you would do that >> typically. Yeah. You have uh you have the camera on and you can stream it but like for for your home camera I think it would be a little different story I think. Yeah. So you would have to maybe do some sort of a algorithm to check the authenticity and and kind of do that. So it'll be a little different from how it's deployed on the phone, but yes, you can totally run that on device >> that would consume a lot of it would be better to hook up to the camera and only when that individual is recognized >> then then then message your phone. Right. Right. >> Yeah. I'm talking about like a local device connected with my camera device is actually checking the the frames on the camera. >> Raspberry Pi could do it right. Yeah, you have a question. >> Do you want to answer that? >> Uh, let me see. I don't know if I have or in here. Uh, not here, but I think we have it on our hugging face. I don't think I add it here. Yeah. Have you tried any um exercise where you have multiple nodes running a model and if any of them trigger something and they connect to a higher agent another agent that >> processes that >> I think we have like we have some groups that are working on let's say like a speaker and a thinking agent type of architectures uh I don't um have any examples to share but like that is one way of uh distributing like what and there's orchestration to decide what should run locally or not and not uh and should run elsewhere. So uh I don't have examples here but yes we there are like u uh different types of you can think of like a health agent or coaching agent of sorts. I think it's a pretty common practice like you have a classifier or train like a very small model just to classify whether the complexity of the require Yeah that's actually that's pretty common practice right >> thank you very much interrupt next one >> thank you yeah >> uh I have a mobile app which is currently using API >> right >> is there any this seems like audio to text >> right Right. Right. >> Is there any on that? So if we can if you have open open weight models that you prefer like uh we are able to support any of them at I mean hopefully we just need to make sure we get them in the right file format and uh you know if the sizes are so we should be able to if you know which models you're thinking about uh uh for application >> not these uh at least E4 >> here we've just focused on the Gemma because of the deep mind track but essentially our hugging page has other open weight models and you are you can so some of them we provide for ease of use others you are welcome to like convert and use yourself as well uh but yeah if there is a model that you've uh identified yeah then we can certainly discuss >> thank you >> yeah thanks