Accelerating AI on Edge — Chintan Parikh and Weiyi Wang, Google DeepMind

Channel: aiDotEngineer

Published at: 2026-05-05

YouTube video id: Lm8BLHkxiAo

Source: https://www.youtube.com/watch?v=Lm8BLHkxiAo

Afternoon
everyone. We'll get started. Uh my name
is uh Chintan Purik uh product manager
for uh light RT which is part of Google
AI edge. And uh I also have my uh my
colleague here ve also he'll be joining
us also for the uh Q&A part of this
session. Uh quick show of hands. How
many of you are working on deploying on
edge or are keen on learning more about
like what it's going to be the big
benefits. Okay. And some of you already
deploying. So um I'm going to go through
the slides. I'll try to you know also
leave it open to understand if you guys
have any use cases or any sort of things
you're working on that you know you'd
like to discuss. Uh so we can kind of
keep it open in that sense. Great. So uh
here's going to be a quick uh set of
agenda items and I'm going to go through
some of the new models that are coming
up. Um also uh some edge use cases that
will be relevant. Um I'm going to show
you some of the new capabilities in the
gallery app and then you'll go through
our stack for deploying AI on edge
devices. In addition, uh we also want to
emphasize uh the crossplatform support
we offer because I think a lot of you
are looking to deploy on just more than
just on mobile platforms but also other
platforms. So we'll kind of do some of
that. Um Google DeepMind recently
launched Gemma 4 and uh this talk is
going to be focusing a little bit more
on like the 2B and the 4B edge models
that are are focused on uh deploying on
device. Uh Google certainly has a big
suite of models and even the Gemma 3
family which are also in the smaller
size like all the way down to 270
million parameters. So if you're looking
for extremely small models that you're
able to fine-tune, uh we certainly have
a hugging face page which I'll go over
which has more of these models. Um, so
yeah, so the big evolution with Jamma 4
is going to be really moving from like
chatbot type capabilities to more
autonomous agents that also support
reasoning capabilities and more
sophisticated features and uh a lot of
it is also uh it's going to be great how
you can deploy these on different
devices.
So running on edge has many benefits. I
think my colleague went over some of
these uh yesterday and I'll just you
know run through this uh really quick
also. So certainly like latency is
important for those who are keen. Um any
of these like real-time camera use cases
like filters where you're looking to
replace backgrounds or like video calls
like real-time latency is king over
there. So that's you know on device can
help with that. Privacy is also critical
for any use cases where you know you
have sensitive details, you're doing
some kind of summarizing uh
documentation which is sensitive offline
use cases where you have poor
connectivity and costs like you know
I've seen so many presentations uh this
uh uh AI engineer event where uh people
are complaining about the number of
tokens that are getting used and this
and that but I think uh ondevice always
offers this hybrid approach where if
you're looking for like how do I really
offset you know uh running things on the
edge versus on cloud and where can I get
the best balance? I think there's a
there is something here.
Cool. So, uh in terms of models, there's
two models and I think like a lot of the
use cases I'll show are going to be
built on these. So, the Gemma 4 E2B
roughly from a from a amount of RAM
usage is roughly anywhere from 1 to 2 GB
of RAM usage. So, it could be usable but
like yeah it depends on your end use
cases. Again, uh certainly good for any
voice interfaces or uh use cases for
summarization or any kind of low latency
local processing. Uh 4 4B is going to be
a little more heavy duty if you're
looking to run on bigger platforms like
laptops or IoT devices. It'll have like
a higher RAM RAM requirement. Again,
this is once it's been quantized uh to
your desired size.
So uh I want to do a quick deep dive on
what's going to be new uh this time on
uh on Gemma 4 E2B and E4B capabilities
in terms of the agent capabilities. So
I'm going to show you what's already
there in the next couple of slides in
terms of use cases and here uh touch a
little bit on exactly like what are the
new capabilities are there. So function
calling built-in support for tool
calling and also essentially models to
interact with other local APIs. So you
can certainly start the in inferencing
on edge but you have the path to
essentially you know call other APIs
outside and that I'll show that in some
of my examples but the core of the in
inferencing will be on the edge uh
structured JSON output. So a native
support for any structured JSON output
is supported here which was built into
the model architecture rather than
achieving through some kind of specific
prompt engineering. You can kind of do
this too and I'll show examples for the
same uh chain of thought is new. So
there is a thinking mode uh which we
will which can be demonstrated by app
where the thinking mode will help you
understand like the thought process that
the model's going through. Our gallery
app which I will showcase will uh will
support that. And finally what's going
to be great is these models are
optimized for hardware native support
means you can run seamlessly across
multiple platforms and multiple hardes
and really we want to give you the
flexibility of deploying into uh various
platforms.
So just for starters, if anyone's like
thinking, okay, so where can I get these
models? Like we have these models
already ready to go. So our hugging face
page has these models. So if you're
looking, you can uh you know they all
Apache 2.0 licensed models. So you are
able to download these and then you know
start building with it. And we'll give
you some paths of how you may want to
build with it. All right. So so yeah,
let's talk about some of the use cases
now. So in terms of these are just some
of the many possible use cases. Uh these
are use cases that are possible today.
Uh this is our gallery app which we have
been demoing down on the third floor and
we also have a demo slam session after
this for anyone who's keen on learning
more. I do have QR codes for this too.
So in the next couple of slides um at
present you know what's new is the de
the the gallery app allows you to
demonstrate the agent skill capabilities
um and also it has the audiocribe
capabilities or ask image and also other
features that were uh related to uh chat
experiences.
Um all of this is happening on device.
The purpose of this app that is created
by Google is to help you get a
playground to get a feel for what these
models are capable of. Uh each of these
capabilities has sample code as well
that anyone can go and grab and then you
are able to also fork this app to build
your own experiences. But this is
essentially to inspire and motivate you
to build your own experiences.
So some of the next couple of slides
I'll be focusing on is going to be
related to some of the Gemma 4 edge use
cases. uh these are going to emphasize
on like the voice agent capabilities or
the local agent capabilities and also a
lot of these are going to be privacy
focused as well. So there are three key
pillars that are going to be new here or
versus what was already supported in the
previous slides.
So let's dive into uh the Gemma 4 use
cases. So let's see this place. So
here's one use case where you're
augmenting knowledge and you know for
example you can build a skill to query
Wikipedia allowing the agent to query uh
and respond to any encyclopedia
question. So this is a this is a skill
that's already available in our app
that's uh that's there. So if you're
looking to build something like this
this is this is possible. Uh these are
new skills in addition to like basic
ondevice skills like summarization or
ask image and things of that nature.
Um, this is another category where
essentially let's see if you can journal
entry with score and comment.
>> I only got 8 hours of sleep and I'm
looking forward to heading out with Amy
today.
>> So, here's you're going to see creating
and summarizes and display trends of
hours of sleep. So, you're going to
build a sleep in a quick agent that
helps you track your mood. analyze the
trend in my mood over the last seven
days.
So you can kind of see that it's able to
create this all on device. It's taking
your input, it's feeding it, it's able
to understand. So the reasoning and the
thinking capability are new in the new
model and that the the ondevice
capabilities have gotten a lot more
powerful. Another similar ex example is
going to be on expanding like the core
capabilities. So for instance, you can
also pair photos and it's going to read
the photo and have image understanding
and generate music also all on device.
So let's try this.
>> Breakfast. I'm sending a photo. Can you
pair this vibe with some music?
It's taking a few seconds and then
so a lot of these skills can be written
by yourself on the app itself. So you
don't even need to leave the app. The
app has instructions on how to do this.
Uh again like the whole sole focus of
this is to help you understand uh the
capabilities of the model and build
something by yourself that you know you
really like. So that's useful. Uh one
last one is to really you want to do
sort of a working app that describes
let's say local calls of animals like
now here we are trying to navigate
multiple apps and users and you can
manage a more complex workflow here in
this case.
So this is another example
and you can also change the CPU GPU and
hopefully we have NPU support soon so
you can decide the accelerator you want
to use when you decide.
So this was sound generation here from
purely the prompt by the person again
the the the skill was set up and created
and loaded and the skill is running on
device and this is for something similar
right. So there's a lot you can do with
it. There is a whole GitHub repo that we
have uh where where actually uh uh
there's like users are posting their own
skills on the GitHub website and uh and
they're able to basically share that
with others with the community. So
you're also welcome to explore that or
you're welcome to download one of these
apps. Another option is uh the sample
app is actually open source as well on
GitHub. So you're welcome to take it and
fork it and you can also make changes to
that app if you like. So the GitHub link
is also here and it's an opportunity to
do that. And uh creating your own skill
is also here. So these are just an
examples of what folks have have done.
Uh this QR code is instructions on how
to build a skill. So if you're looking
to essentially get guidance on okay, how
do I do this? Uh you know this uh you
can kind ofh get started.
All right. Um the next part of the talk
is I'm going to focus a little bit on
deploying this now on edge devices. So
we've talked about what's possible with
the Gemma models. Uh the framework that
is used here, right? So so Google has an
offering with light RT for bring your
own models essentially. So light RT
essentially is Google's ondevice
framework. It's built and built on the
TensorFlow framework. TensorFlow light
if you guys are familiar with that. Is
anyone familiar with TensorFlow light?
Some of you guys. Okay. Awesome. Okay.
So that's what this is built on. Uh it's
really meant to also be built on using
the same TensorFlow light model format
and that's what we are focusing on at
this point. So here the the point is
that the underlying framework for the
app that was running all these
experiences is light RT and then we are
trying to show you that this is
basically it's been one of the most
widely deployed framework so far it has
uh 100,000 plus apps billions of active
users and also lots of uh daily
interpreter invocations. This
essentially number of inferences that
are happening every day. Um why this is
interesting is to show that we're
building on a trusted foundation and
when you do this like we also have our
TF light file format and this is going
to be very important because if you were
a developer that was building with
TensorFlow light your models are still
going to run on light RT and these
models the same model format is
crossplatform so it's not just Android
but you can run it on iOS Mac OS Linux
uh Windows web and even IoT devices and
I'll also show some examples on IoT that
we were actually just building this
morning actually. So I'll show you that.
Um
cool. So from a development side of flow
standpoint and uh essentially you have
your model files. So we are able to
accept the reason it's got branded to
light RT one of the bigger motivation
for the rebranding was to also
demonstrate that it's not just
TensorFlow light models that we accept
but also PyTorch models and JAX models.
So if you are working with PyTorch
models, you can take one of those
models, convert it to the uh TFI file
format and then go through this journey
and deploy. So we have a lot of sample
apps and things like that, but I would
not go through the whole journey here.
But simply put like you have portability
of the model and you have multi
framework support from the models.
All right. So the complete this is a
complete solution that provides one
unified crossplatform architecture.
Maybe a quick question. How many of you
are looking to deploy on multiple
Android or iOS devices? Like you want
something that you build that you can
test easily on many devices? Cuz if you
are trying to build apps that you're
like, okay, I made it, but how do I know
if it's going to work on like 5-year-old
phones and sixy old phones and all of
that stuff? So, we have options for
that, too. So, I'll walk you through the
stack. So, light RT torch is going to be
your conversion path. If you basically
it's your bring your own model. If you
found a model, you like it, you want to
run it. You convert it to TensorFlow TF
light format, you can quantize it if
needed. Uh if you do, if it's LLM, you
go through the light RTLM path uh
essentially or and then if it's not, you
can go through light RT. What's
interesting here is that we also have
the model explorer tool which can help
you explore the graph and decide which
aspects of the graph you want to change
and quantize. So you can actually study
the graph and decide how to best do mix
precision or or basically complete
convert the quantize the model as you
like. The AI edge portal is a
benchmarking tool which could be of
interest for those who are looking to
deploy broadly on Android. So this is a
cloud-based benchmarking service called
AI edge portal. Um this is available
basically to help and a lot of our uh uh
third party app developers and even like
internal developments use this tool
essentially to get a good pulse check
that hey you know if I have a model
that's so many parameters do I need to
use ahead of time compilation or just in
time compilation what is going to be my
right recipe to ensure it's actually
deployable across a broad fleet of
devices and be reliable in that manner.
Um yeah sorry one more thing on this
slide is uh the next part is
acceleration. So CPU and GPU are pretty
universal right now. So these are are
essentially libraries that will help you
run CPU and GPU. I also want to
emphasize back like with this sort of uh
framework you are able to deploy on
multiple platforms uh with the CPU and
GPU running. Uh NPU acceleration is also
another focus. So we have completed uh
integration with Qualcomm MediaTek and
we are also focusing on additional
integrations with other uh partners in
the and on on different platforms. Uh we
also offer if you are looking I think
this is going to be a gamecher for a lot
of your apps who are or a lot of your
products that you are trying to run lot
ASR TTS or any of these applications
that are going to be you're requiring
real time capability you're or you're
setting up some kind of AR VR
application where you want to be able to
have real time improvements or updates
to uh the camera feeds. I think the NPU
is going to give you at least like 3 to
10x improvement in performance and it's
going to be a gamecher in terms of the
amount of energy you're going to use uh
and the amount of performance you're
looking for to unlock those use cases.
So the flexibility part is we have the
ahead of time or ondevice and then for
ease of use also there are many options
here in terms of how we simplify and
there's a lot more uh documentation and
support on that. All right so the next
part is I'm going to go over some
performance numbers on how we are
performing. So so far we've given you an
example of what you can do at the app
level uh what's possible at the
framework level and how you can scale it
uh in terms of our coverage right so
just with the Gemma models like this is
a this is the coverage right now so a
lot of times they come to the booth or
our demo booth and ask hey okay so you
know can you just do this on Android but
no we've been we've tested this on all
these platforms right now so the models
that you have on hugging face you can
actually test it on on many of these
platforms so on Android iOS
uh Linux, Raspberry Pi as well. Uh and
and here's like uh just a quick uh demo
we just did this morning. It's going to
take a while, but this is our uh little
uh robot that's uh sitting down in our
demo booth. So, we just made it this
morning, so it's not super performant,
but
what we did here is we showed a robot uh
a sign to say move your antenna. So, now
the two lights are blinking. So, it's
doing inferencing. Uh, and then you
know, you'll see it's running a
Raspberry Pi. It's running on a CPU,
running on our LAR TLM. And you'll see
in a few seconds that it's going to
wiggle the Sharpies that are its
antennas. And uh, so there it is. So So
there are other questions like, "Will
you marry me?" And things like that. If
you're interested, you uh, you can try
it out downstairs. It's there. Uh,
cool. So uh, yeah, that that was
amusing, but also yeah, we we do need to
work on the performance because we just
made it this morning. Uh we also have a
CLI tool. So for those who are looking
to deploy and you want to have a easier
way of doing it. So there is a new CLI
tool that's been developed. So that is
also available on our website. In terms
of uh with the Python binding support
and many of that so I'll just uh it's on
our website. So if anyone's interested
you'll find it. Um in terms of
performance so I have some quick
performance numbers like I just want to
emphasize like running on some of these
uh new accelerators NPUs can give you a
big benefit advantage. So you can get up
to 13x boost in some cases. Um and also
you have uh iOS performance as well. So
we are supporting all of this here. So
like roughly 56 tokens per second and so
on so forth.
Uh desktop is here too. So a lot of
these numbers are can be found on our
hugging uh hugging face page. So the
quant is already listed there. So I'm
just pulling from there. So you will
find it there. So when you download our
models, you will get performance details
on all these platforms and IoT. Uh our
runtime is also very performant versus
llama. At least on like mobile, we've
seen up to like 35 times faster
performance. On desktop, it's at par.
And then IoT also we have like 3x
performance. All right. So I have a
minute left. So yeah, this in summary,
yeah, light RT supports all these
frameworks. If you're looking to bring
models from different frameworks, you
can. We have the models on hugging
hugging face. And then we also have a
gallery app which is a nice playground
for you to uh try this out. Uh and these
are there. Yeah. So if there's any
questions happy to take them
>> Yeah. Thank you.
>> Can recognize prices like me about like
okay the security camera at my home
recognizes that my son came home. I mean
because pushing this to the cloud would
cost enormous amount of money and since
this can be done locally would lock the
computer maybe this is like a security
feature of my of my home camera.
>> Yeah. So on your phones like on let's
say even iPhones or Pixels when you do
face unlock it's typically running
locally already. So and that is using
this same framework in Google devices as
well. Uh Apple has uh Apple's face
unlock is also on device on on core ML.
So, so that is possible. Yes.
>> And then you would stream it constantly
like uh every 2 seconds for example to
recognize the face or how you would do
that
>> typically. Yeah. You have uh you have
the camera on and you can stream it but
like for for your home camera I think it
would be a little different story I
think. Yeah. So you would have to maybe
do some sort of a algorithm to check the
authenticity and and kind of do that. So
it'll be a little different from how
it's deployed on the phone, but yes, you
can totally run that on device
>> that would consume a lot of it would be
better to hook up to the camera and only
when that individual is recognized
>> then then then message your phone.
Right. Right.
>> Yeah. I'm talking about like a local
device connected with my camera
device is actually checking the the
frames on the camera.
>> Raspberry Pi could do it right.
Yeah, you have a question.
>> Do you want to answer that?
>> Uh, let me see. I don't know if I have
or in here. Uh, not here, but I think we
have it on our hugging face. I don't
think I add it here. Yeah.
Have you tried any um exercise where you
have multiple nodes
running a model and if any of them
trigger something and they connect to a
higher agent another agent that
>> processes that
>> I think we have like we have some groups
that are working on let's say like a
speaker and a thinking agent type of
architectures uh I don't um have any
examples to share but like that is one
way of uh distributing like what and
there's orchestration to decide what
should run locally or not and not uh and
should run elsewhere. So uh I don't have
examples here but yes we there are like
u uh different types of you can think of
like a health agent or coaching agent of
sorts. I think it's a pretty common
practice like you have a classifier or
train like a very small model just to
classify whether the complexity of the
require
Yeah that's actually that's pretty
common practice right
>> thank you very much interrupt next one
>> thank you yeah
>> uh I have a mobile app which is
currently using API
>> right
>> is there any this seems like audio to
text
>> right Right. Right.
>> Is there any on that? So if we can if
you have open open weight models that
you prefer like uh we are able to
support any of them at I mean hopefully
we just need to make sure we get them in
the right file format and uh you know if
the sizes are so we should be able to if
you know which models you're thinking
about uh uh for application
>> not these uh at least
E4
>> here we've just focused on the Gemma
because of the deep mind track but
essentially our hugging page has other
open weight models and you are you can
so some of them we provide for ease of
use others you are welcome to like
convert and use yourself as well uh but
yeah if there is a model that you've uh
identified yeah then we can certainly
discuss
>> thank you
>> yeah thanks