TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google

Channel: aiDotEngineer

Published at: 2026-05-03

YouTube video id: BKWpYIWvAo4

Source: https://www.youtube.com/watch?v=BKWpYIWvAo4

[music]
Hey, my name is uh Cormbrick. I work um
on Google AI edge, which is a way of
kind of bringing models to the edge.
This is something we use internally for
our own products. It's also something we
make available as open source products.
As part of this, we also work really,
really close with the Gemma team because
they publish a lot of models that are
targeted to edge devices as well. Um,
yeah, quick bit of background on me. Um,
yeah, I've been focusing on Edgi for the
last like, wow, like maybe 10 years at
this point. Um, so started off in 2016
with like running a uh Google Net on a
USB key hardware accelerator plugged
into a Raspberry Pi at like Nurips's
2016. So yeah, it's been a lot of fun
over the years. I then joined Intel,
worked on led the architecture for the
NPU that goes into all of their laptops
these days. Three years ago, I moved to
Google to work as tech lead on Edge AI
here. Uh yeah, and that's been great fun
like been a crazy few years as you can
imagine. Um but yeah, happy to share
like both where we're up to and also
give a kind of overview of like as part
of that we can also kind of cover the
story of like where is um where is
mobile mobile AI up to today right like
what's the state of the art what are the
different patterns we see for model
deployment so the two things I want to
focus on in this talk are uh tiny LLMs
and agent skills um so tiny LLMs are
very very small models um agent skills
are now possible on device with larger
models uh are needed to make those kind
of work fluently on device. But both are
kind of exciting new directions um that
have only recently been made possible.
So as recently as last week when we
launched um uh Gemma 4 um with the deep
mind team launched that we supported
them with a kind of Android and iOS app
that's kind of opened uh new
possibilities for things we can do on
mobile. So we'll also kind of deep dive
into that um as well as then looking at
what are the types of things we can do
with tiny models. the kind of two halves
of the presentation. Okay, so that's the
intro. So yeah, this is kind of what I
was saying. Uh so first give an overview
of like hey what's like AI in the edge
like looking at what we look at small
language models and tiny language
models. Uh then taking a look at Gemma
models just because that's um uh pretty
topical uh and the types of performance
we see for Gemma models on various types
of edge devices because that's kind of
what our team does. Um then looking at
agent skills which we built on top of
the latest generation of uh Gemma models
um uh that can run on both Android and
iOS as well as many other platforms. And
then um second half we're going to look
at tiny models um how um like how you
can fine-tune and deploy a tiny model to
edge devices. Um and then finally I have
an example of an app that our team built
um like a real app uh using not just an
example app using uh that's built using
tiny LLMs based on Gemma technology as
well. Okay.
Okay. So firstly yeah I guess folks who
are you guys are already in this room so
you're probably interested in edge AI
this probably goes without saying but um
yeah like there's a lot of benefits to
running on the edge. There's um like
latency or UX improvements for some
really sensitive in the loop um things
like live voice translation for example.
Uh that's something our team shipped on
Pixel last year where you can do a live
voice translation. Very challenging to
do that with the required latency using
a cloud service. So being able to do
that offline and on the edge is key to
be able to meet latency or user
experience requirements. Um privacy
that's um uh that's definitely a thing
as well. like we um do some uh yeah uh
there's people use AI within like
messaging apps for example and people
like their messages just to stay on
their phone encrypted fully private so
that's a good kind of example of where
we're seeing LLM's getting deployed is
assisting in some of those types of use
cases
okay where privacy is really important
um offline use yeah that's kind of
obvious and then savings um yeah like we
see this being increasing ly relevant
for laptop users where you see a trend
towards folks at least experimenting
with um uh experimenting with um small
language models to do some types of
tokens they may do with the desktop
agentic workflow they have going on. Uh
so some folks are interested in
exporting savings as well. Uh yep.
Okay. So this is like what we do as like
our team we do enable people to build
apps. uh we do uh media pipe uh lighter
TLM which is a LLM runtime that works on
mobile and edge and then lighter t which
is just a standard inference framework
this was a long time ago known as
tensorflow light
that is the types of things we've used
that over the years in is like it ships
in photos in onlur uses media pipe in
photos YouTube shorts h if you've used
any of the funny effects in YouTube
shorts they're also things that our team
have built uh for these uh along with
these with the YouTube team um using
media pipe and various deep learning
tracking models under the hood um and
that's also built on the same uh media
pipe and light t technology and then
yeah we run this on like all Android
phones today will use this as part of
system services um or you know third
party apps building on a version of the
stack that ships with Android
um but our stack also like runs far
beyond Android. Um, so we also run even
it being available um as a system
service in Android, it's also you can
take the same TF light file and ship
that to iOS, Mac OS, Linux, Windows, web
or IoT devices just from that same file.
Um, so yeah, that's kind of why
deployment is helpful. One caveat there,
that same file deploys on CPU and GPU.
For NPU, we need to do some special
compilation and you end up with a
special NPU file. Um but certainly for
even for like the gemma models we
published last week this is uh true that
one file can work in CPU and GPU across
a wide set of devices.
So then yeah some some of the things
we're seeing for LLMs on device is kind
of privacycentric stuff uh voice agents
and local agents uh with uh tool calling
is a very popular workflow
and then yeah within our stack we
lighter TLM is the uh thing that we'll
look at a bit later that helps us run
tiny LLMs on device crossplatform and
it's fast because we support all of the
different hardware accelerators.
Oh yeah. So this is kind of important uh
concept is we see two trends happening
today. Uh one is we see kind of system
level genai. So for very very large
models, the way these tends to turn up
on mobile phones isn't that you know
when you launch a single app it will
download a four billion parameter model
uh just to
uh you know just to help you find a good
restaurant in you know uh in whatever
app you're looking up looking in right
instead the trend is to build larger
models into the OS. So we call this like
system level gen AAI. These models tend
to be
in the kind of two to five billion
parameter range, right? Like it depends
on the OS and um whatever. But like this
is one choice that we see both the
Android team who we work closely with
and the um Apple intelligence team. we
see kind of similar choices being made
by both mobile OS vendors where there's
a central model built in and that is um
like an Android that's called AI core
and there's things like summarization
APIs and uh an increasing set of APIs
available for developers including a
prompt API where you can use that API um
and that's available on kind of premium
Android devices and uh premium Apple
devices as well or the like obviously
Apple has their own Apple intelligence
thing but as a trend this is kind of
worth um uh this is really relevant,
right? So if you want a um if you want
to leverage a built-in model, this is a
great way to go. Um
so then the other trend we see is like
inapp genai which is where the kind of
tiny LLMs or um TLMs as we're calling
them in this presentation are more uh
are more relevant right uh so system
genai generally you customize that via
prompting um or via skills as we're
going to see later um and it's a
foundation model pre-loaded on device in
appg genai generally these are custom to
tasks they're loaded with the app or
with the web page. We also deploy some
of these uh on the web and generally
they're targeted at kind of like wider
reach. So this may work not just on
premium devices but all devices because
apps uh reach is really important for a
lot of the app application developer
teams that we work with. Um and these um
yeah surprisingly you can get LLMs to uh
if you fine-tune for a single task we've
seen really really strong performance on
you know simple tasks like you know
summarization transcription or kind of
voice to action type things. We can get
really reliable performance from models
in the um
uh really reliable performance from
models in the like 100 to 500 million
parameter range depending on the
complexity of the task, right? Um so a
good example here is we launched
function Gemma um in December
and that's a was a 270 million parameter
model that was dedicated for function
calling and we're able to show that uh
doing uh like 10 diff doing voiceto
function calling for 10 different uh
functions that were relevant to kind of
Android developers and our internal
evals we did an internal eval set that
kind of reached over um it's like 85 to
90% uh kind of reliability just using
that very small model which was widely
deployable to iOS and Android devices
and that's actually something you can
play with in a a sample app that that we
have that we'll look more at a bit
later. So that's an example of like a
tiny um of one use of a kind of a tiny
LLM model. Um but we're seeing more um
interest from application developer
teams now into kind of fine-tuning
models to deploy as inapp genai. So the
difference here is on the left, yeah,
you kind of um the customization on the
left is via prompting or skills
customization on the right
like we would encourage people to do
some degree of kind of fine-tuning to
make a tiny OLM work like in practice
like um yeah certainly below 500 million
parameters that's true. Maybe for 500
million parameters and above um we can
do more general purpose tasks with with
a model but for the really really tiny
models uh certainly less than 500 in our
experience you need to fine-tune to get
kind of production level reliability.
Okay. So now I'm going to talk about
Gemma 4. um uh and Gemma 4 the the
models that were launched last week kind
of fall into that um kind of system
genai candidate category right so we can
do lots of powerful things with Gemma 4
um as you'll see uh okay first we're
going to talk about sizes so there's two
small sizes which are the first two
which are E2B and E4B um so these have
E2B is called that because it has it um
only needs about two billion parameters
to be present in RAM to to run the model
because one of the limiting factors we
see with these models is how much RAM
you need to have in a device in order to
run it reliably. Uh E4B has been
optimized to run it with four billion
parameters on device. Um there are more
parameters used by the model but the
other parameters that the model uses are
um you could like deep mind have talks
in this later in the week so you can go
there for more detail but the the
tlddrore is the other parameters in the
model are used uh for kind of per layer
embeddings. So in our runtime we don't
actually need to load all of those
parameters. So we need to maintain them
like the the 2B and 4B they need to stay
resident in RAM. The other ones we
typically like memory map them and we
only need to load like one line of the
embedding table of the parlayer
embedding table and the embedding table
uh once in the auto reggressive loop and
we actually only need to load a few
uh a few hundred bytes or thousands of
bytes of that uh in order to do the next
token inference. So as you go through
inference, you don't ever end up
requiring to load the whole uh PLE table
into memory and then you know depending
on the uh OS it'll do a reasonably good
job of kind of um
you know evicting memory used by kind of
older PL tokens. Um so that's why we
have this idea of effective. Uh so the
smaller models run on yeah you can see
on the right it runs on a variety of
platforms. um the E2B and E4B models.
These models are on the AI core roadmap.
So these models that are available now
for experimentation
um uh at the appropriate point in time
in the future, the Android team will
integrate this into AI core and they'll
be available uh more broadly on a wider
set of devices. And there's another uh
talk next week or there's a talk here
this week or um from Olli from the AI
core team who'll probably share more
details about exactly what that road map
looks like. And also there's uh yeah
Omar is going to do a from the GDM team
is going to do a deeper dive on on all
of the the Gemma world. The the bottom
two models just for reference. These are
also relevant for the edge. They're not
really the focus of my talk today. Uh
because these are relevant for folks to
run on laptops. Uh the sizes have been
optimized to run really well on consumer
grade laptops. Um um albeit ones that
have maybe you know 32 gigs of RAM. But
um yeah, that's what these. So I'm going
to focus less on these other two models
even though they're very powerful and
useful.
Um yeah so we can see like certainly the
E4B model uh like both the E2B and E4B
model have excellent uh performance on a
wide range of things. um um
yeah on knowledge and reasoning. Um but
one of the big step ups relative to the
from my perspective was a like user of
the models relative to the last
generation is uh they've kind of built
in function calling which is excellent
and they also have built-in thinking. So
that combination of kind of thinking
plus function calling is what unlocks
our ability to now do skills on device.
Um as to so you can just as you'll see
in a second we can uh describe a skill
give it to the model and the model can
just pick it up and and use it. So that
allows us to use that kind of pattern
that's proven very popular in the last
few months to bring that pattern to
mobile which allows for new types of
kind of mobile experiences.
Uh also for the E2B and E4B these are
multimodel so they support audio um
image and uh text. Um the larger models
support just image and text.
Oh uh also the other change from a
deployment point of view is this is the
first time the general models are
released as in just a stock standard
Apache 2.0 license which means they're
more usable by more people. Um again
you'll hear more about this in Omar's
talk. Just thought I'd mention it here.
Okay. So that's Gemma in general. Now
just a deep dive Gemma on our runtime
and our platform. Um so yeah that same
kind of picture earlier right we have a
single lighter TLM file which is a light
RT file with the things like the
tokenizer and the other things we need
in one package to run a model. Uh that
single model runs um uh across uh all of
these classes of devices
um across mobile, desktop, and embedded.
And we're yeah, we're pretty excited to
to do more in the embedded space.
There's a lot particularly with image
input. There's a lot of uh scope for new
uh for new um IoT use cases. Okay, bit
of an eye chart now, but just to dig
into performance. So um I would say this
is a snapshot as of today. Uh this is
something we are continuing to work on
uh both inselves both ourselves and our
team and also with various partners that
we have right across like Intel um uh
with the Raspberry Pi team and the
Qualcomm team. We're continuing to
optimize uh all of these numbers. But we
can see we can do um
certainly on you know for the two
billion parameter model we can get
really compelling performance on a like
a high-end Android phone can do
thousands of tokens per second on a GPU
um also thousands of tokens per second
on MacBook or um but the the bottom two
rows then are like on a Raspberry Pi we
can get um about 133 tokens per second
which is sufficient to do um kind of
simple image analysis use cases with
reasonable latency. Um, and also the
bottom one is us running on a hardware
accelerator. This is a Qualcomm kind of
IoT robotics development uh like
development platform uh that they have
available and there we can see pretty
compelling performance as well because
we've gone and used the NPU which uh
which gives uh a lot better performance
and then yeah you can imagine uh then we
have corresponding performance on the
E4B model um which works uh on yeah a
wide set of devices as well with um
yeah kind of proportionally less less
prefill and decode performance, right?
Given the size and the numbers of
parameters we need to fetch. But yeah,
broadly they're available on lots of
platforms. Okay, so now what can we do
with those models? Uh lots of things,
right? Um so one of the things I wanted
to talk more about because it's like net
new rather than just kind of show you
like our image analysis or audio
transcription or audio translation,
right? There's a lot of things we've
been able to do with models from my
perspective for a while that these that
uh the Gemma 4 models are, you know,
much better than their predecessors at.
Uh but we um the thing we're the thing
that's newest from my perspective is
kind of agent skills on device. So this
is uh so we have an an app um I don't
know if you guys have seen it. It's
available on both iOS and Android called
Google AI Gallery. And um this allows
you to do lots of different things like
to just do basic AI chat stuff or ask
questions of an image or um
uh do kind of transcription or
translation use cases starting from
audio or like audio to function calling
type use cases. But the one in um yeah
the second from the left is agent skills
I'm going to do a deeper dive on today.
Okay. Hopefully this will play.
Oh, okay. That is very annoying. Sorry.
I I will fix this before we post slides.
Um Oh, wait. Let me see if this will
okay
sound going to work.
Morning. Let's log a new mood journal
entry with school.
>> Since this is playing, but sound is not
working.
>> I got eight hours of sleep and I'm
looking forward to hanging out with Amy
today.
>> So, this is a journal scale app we're
looking at here
>> or a mood tracker where you
where it logs mood and sleep. And then
>> perfect analyze the trend in my mood
over the last seven days.
Okay. So, what's happening here is um
with the mood tracker app, it kind of
logs uh it logs your moods or
observations to a diary and then the LLM
is able to go back in and summarize the
content that you check my calendar for.
>> What's interesting here is this is just
>> bullet list.
>> Okay, this is just I won't kind of
repeat this more. Okay. So what's
interesting here though is the way this
is done which we're going to look in
more detail. This is isn't a kind of
custom fine tune. It is just us giving a
particular scale with a little bit of
JavaScript that the u that the model can
call
>> um to the model and then through just a
free text interface you're able to like
the model is able to use and pick up
that scale. So it's also able to call
there's two skills actually happening
here. One is the mood tracker skill and
the other one is the um
>> Can you check Wikipedia for the latest
information?
>> The other one is the uh was the map
skill and now it's using a query
Wikipedia skill here to look for latest
information from the Oscars. So versus
having a model that's just like pinned
in time, it now becomes really easy to
extend the model with like uh with
skills such as kind of map lookup
something like interactive journal that
you can both add subtract entries to
from voice as well as query um uh and
also kind of adding more modern
knowledge or relevant knowledge. So
we're showing um
>> can you pair this vibe with
>> we're showing Wikipedia in this case.
Um, oh, this is another fun uh skill
that somebody in the team also developed
was the mood music scale, which calls a
web service to compose music based on a
single image and plays it. Um,
so yeah, it's able to kind of uh compose
some lowfi music to go with a person's
breakfast sort of use case, but it's a
new um like I guess kind of paradigm and
how we're able to like really really
really easily extend the models and to
do that in a pretty kind of low code way
as you'll also see in a bit when we see
when we dig into how this uh works under
the hood.
Okay, so yeah, so examples here and I'm
not going to play all of these videos.
uh maybe to save time because I think
you've seen many of them. So one is we
can augment the knowledge base was one
pattern that we find interesting.
We can produce kind of rich interactive
content like flashcards for
visualizations and I yeah um uh where we
can instead of like if you ask to
summarize something in three bullet
points you can have like a JavaScript
skill to kind of show those bullet
points as a card instead. Um, so you can
just say like summarize and show a car
like summarize this and if it thinks
that the card display will be more
helpful for the user that'll get used
um
or music synthesis. Okay, so I'm going
to skip forward a little bit from the
demos to
how [snorts] we built them. Okay. So
what's actually happening is
um the way we've built the skills is
they're efficient. Uh so the
instructions can be kind of loaded on
demand. So this kind of uses a principle
that you may have seen um that you may
have seen elsewhere of um kind of uh wow
um like kind of progressive disclosure
of conditional or conditional depth. So
instead of like in an MCP workflow where
you need to describe everything about
all of the functions that you need, the
way we've structured the skills is
there's a kind of oneline description,
the oneline descriptions of the skills
is what the agent sees and then if it
thinks that sounds interesting, then it
asks for more. It asks to load the
skill. It has a like we teach it a skill
to load a skill. Uh and then it goes in
and loads the skill and it finds out all
of the the details about how to use it,
what function calls it can use to to do
that skill. So this is like that pattern
is particularly important for kind of
token efficiency and frankly like
reliability on edge models because if we
had to load all of the details for all
of the skills into the edge model that
would be a lot of context for the model
to reason over and in a lighter weight
model like that will you know that will
hurt kind of performance ultimately
right because the lighter weight models
are you know um you know they're really
great to be able to run on device but
the in terms of reasonability ility over
very very long context windows. If you
can if you can have a more condensed
context window that'll kind of uh up
your batting average in terms of kind of
quality metrics you're looking at to
kind of ship a particular app. Um the
second part is tool access. It helps us
um uh integrate new tools dynamically.
So things like there's a car like we
think of them in terms of there's a set
of like input tools like how do you get
more information which could be like
Wikipedia or looking up a weather
service or something like this. There's
a set of um
uh then there's a set of uh things uh
that are helpful to kind of present new
outputs to the user like kind of showing
something on maps or showing something
via cards. So you can have skills both
to extend the input and extend the
output you know the output patterns that
a model can do. Um and also kind of
helps us to bring in domain specific
knowledge bases. So that skill that's
kind of calling Wikipedia you could
easily imagine that calling some you
know customer CRM internally or some um
asking data from a local rag system as
well.
Okay. So our yeah so our like within our
structure of skills we have skill.md and
then there's optionally kind of scripts
or assets where um the skill.md is the
metadata that we always uh process um so
um like yeah the example here is showing
extracting text and tables for PDF and
we trigger on PDF for extraction and
then the instructions are only loaded
when the model thinks that it requires
that skill and yeah this pattern is
particularly important. Um then in um
within scripts, we do also support um uh
people kind of writing custom
JavaScripts that can then be rendered
within the app because that's a really
uh easy way to extend within um within
both kind of iOS and Android systems.
Okay. So dig a bit deeper then is we
have within our system prompt we have
our own system prompt that we ship with
the app. Then there's a set of skill
descriptions which get added to the
system prompt. Uh and then when the user
asks for something, the model decide to
trigger the skill by reading its
metadata. Um it then calls the load
skill app that we have under the hood.
Um it the tool response from that
function call is then the contents of
the skill.mg file. Uh which is then in
the context window. So it now knows
about these functions. Um and then it
calls the run. Oh yeah. So then that's
that also contains some JavaScript and
then we call the run JavaScript um uh
tool which runs the JavaScript that we
picked up from the uh from the skill
file and then we call a response on that
as well. Um, one other thing to kind of
mention as part of this workflow is the
one of the things we did when we
reporting function Gemma for reliability
is within the runtime we have
constrained decoding that applies but
it's kind of tuned to only apply to the
output when we're generating a when
we're calling a tool. Um and we can also
the constraint decoding can constrain it
just to the particular tool that the um
uh that you're supposed to be calling.
So like in this system instead of just
kind of having generic JSON constraints
we know that there's a finite set of
tools that the model is supposed to be
able to use. So we can therefore have
stronger constraint decoding. This also
kind of helps us have a more reliable
system overall by using stronger
constraint decoding. Um we find that's
less yeah for we find this helps a lot
for like the two billion parameter model
as as models get more capable we find
the like the margin you get from this
sort of strict constraint decoding as we
call it is um it's less essential when
you're running a very large model like
if you were running a 10 billion or a
very large model but for really small
models on device this is a very helpful
tool that helps us um have stronger
guard rails for the model uh so that we
can up the quality so we have something
that is um useful in production.
So then we support wonder wait I'll see
if this will play for us. We then
support um within the app we support uh
you can toggle the skills you want to
use. Um so even the skill descriptions
you can decide how many skill
descriptions you want to have live at
any given point in time. Um, so this is
loading a piano playing scale, virtual
piano, which
uh will do its things. You can actually
tap the keys and play sounds. We can
Yeah, this is just more JavaScript
stuff. But then um we can also you can
also load custom skills. So you can
write a skill yourself and load it from
a URL, which is kind of fun. If you want
to experiment with kind of prototyping a
skill with Gemma for an app idea you
have, this is something that you do. Oh,
back there also the skills can have like
an API key as well. So like a secret
key. So if you need to use a web
service, that's something that you can
prompt the user to put in. This is
showing then the mood music example.
again. But um we
and we have a on our GitHub page, we
have a GitHub discussion where users are
posting skills that they have written
themselves. Um and the skills in the
community that we like, we are then able
to kind of pull up and have as like
featured skills in the app. So if
something develop so that uh ones that
people develop, we can have a way of
kind of showcasing useful community
skills to the wider community as well.
So this is at third party scale. I think
this is going to show us.
Okay, this is adding an animal intro
scale. This is like a kids thing.
Uh
it's a lot of um to the app.
Yeah. Um but yeah, the key point here is
there's a really really low barrier to
extend the model, right? um and to
extend the model in a way that is, you
know, kind of relevant to downstream app
users. Um so this is very very easy and
we're going to go a little deeper now
and to see just how easy that is.
if the clicker is
its thing.
Okay. So yeah, so skill yeah so skill
architecture when we go one step deeper
um this is kind of relevant. So we have
our own kind of orchestrator. We it has
a skill registry and we call the load
skill skill in order to load the skills.
And within skills then there's like a
JavaScript skill. There's native
intents. So at least within um for the
Android system we're able to call like
kind of Android system intents or native
intents and you can you can just call
those in your scale. So things like if
you wanted to turn off on and off the
Wi-Fi or something like this. So intents
that are exposed to all Android that are
available in JavaScript you can
certainly use. Um
uh then there's a kind of the roleplay
the skill.md which is the persona and
scenario data and then there's the ex
specific scale and resources and like I
was saying before this can include um
this can just be JavaScript that runs
entirely locally which is a fully
offline experience or like in the music
composer one we actually kind of called
a web API and that needed an API key
that then the user was prompted and that
also works within gallery. Oh, so then
yeah then the predefined tools we have
under the hood just to give a mental
model of how it works is we have kind of
load skill which helps us load skills,
run JavaScript or run intent. Um and
just the orchestrator just using these
three skills is able to make the overall
system work.
So this is another
>> let's test our effective. This is from
our Gemini product manager showing
>> it's good to put restaurant.
Please reply to me in English.
>> This is a restaurant roulette skill.
Yeah, we actually have about like it was
so easy to develop skills using um um
using both um like anti-gravity, right?
Or Gemini CLI. We actually had the team
do like 80 of these skills. So we had
like lots and lots of things to to do.
Uh we had lots of options in terms of
kind of what to what to show or it was a
lot of fun for folks to develop these.
So here we can go in and just look at
the the structure itself u of restaurant
roulette. Um
um yeah this is the under the hood part.
So this required requires secret true.
Okay. We wanted to get an API key and it
searches for 10 restaurants. Um, and
then returns
in Yeah. Okay. Returns locations
cuisine.
Ah, and this is the index.j.js
JS file then which is um where we can
have a um yeah simple kind of web view
then we kind of rendered to have the
kind of real wheel
and you can like obviously go and code
all of this yourself right and there's
full source code here for the examples
cloud code we yeah I never got that
actually published um the but it works
really well in cloud code as well and we
have both a source code for the example
and a skill spec for getting started.
But what
uh and there's full instructions on
GitHub if you want to try writing your
own scale.
Uh but this was this was the pattern we
actually used most. It was using like
skills to write skills. So using
something like uh Gemini CLI or cloud
code or anti-gravity. Uh this was our
kind of favorite pattern where we just
say hey I want to write a skill for AI
edge gallery. Um this this this example
works in Gemini CLI um where we just say
hey this is the documentation here are
some examples of some skills that you
can go and read. Then this is the skill
I want to build which was a offline arc
archiver.
Um oh yeah the idea is when you yeah you
can
uh wow just based on things pictures you
took around London for example it would
then be able to give you like there's
that cool Churchill statue I kind of
walked past. It could kind of then it
would kind of go and research a bit
about Churchill for you and then on your
flight back you could go back into this
scale and it would have a bunch of
information about the stuff that you saw
that you could kind of read on the
flight home for example. I think that
was the inspiration for this one. Um
uh yeah so it kind of it fetches
Wikipedia content and then stores it
locally and then has a index. Um also
here with CLI we can also even ask the
um uh because uh we have an ADB skill in
Gemini CLI you can also ask it to test
the skill itself by saying hey you have
access to a phone connected via ADB and
then Gemini CLI uses its Android ADB
skill to go in and test that the app
actually kind of works and and that the
skill is doing what it says in the app
and we can just add it ask it then you
can put this in a prompt and ask it to
iterate And um it'll do some basic
validation itself and return it.
Uh yeah. So this I think is going to be
if this plays this will be an example of
doing that. I could also have done this.
Um
yeah. So this is um this is us using
Gemini CLI just to do that. And this
worked really robustly. I think of the
maybe 80 skills are to team did
internally maybe more than half were
were kind of vibe coded skills at least
initially uh which then really really
reduces the barrier for folks to kind of
extend an LLM to do kind of new things
in a way that's kind of useful uh for
for their audience.
Um
take over the screen. Yeah. Yeah. Go for
it.
>> All examples here are running on the
smallest model in the in app or is it in
on device
>> the we all of the examples here are
defaulting to the to the good question.
So the question for the recording is uh
are all of the examples running on the
smallest model? In this case the
examples are running on the 4B model.
Right. Uh the 2B model, the skills will
work with the 2B model as well and you
can try that. But it's um yeah, your
your kind of mileage may may vary like
simpler skills, fewer skills, that sort
of thing, right? But all of the examples
you've seen are running on the 4B model.
>> Try the Chrome CP.
>> Uh try doing a version of skills in
Chrome.
>> Yeah. So like using the Chrome CVP to
with the scale to actually use a local
model to work in your brow.
>> Uh we haven't. No.
>> Yeah. It's a interesting thing to try.
Okay.
Oh yeah. And this then is quick shout
out to this link. Uh so a few things
about gallery that I should also mention
are quick check in time. A few things
about gallery that I should also mention
are one is gallery is an open source
project and the open source project
itself builds on top of kind of the
light or TLM kind of tooling that you
saw kind of earlier as part of the
intro. So the app itself, you know, as
well as being something that's kind of
fun to use, so you can like I I guess
kind of prototype kind of app ideas or
to see, wow, is this possible in the
model on a phone or how fast would this
be if I ran it on a phone or is this
sort of skill actually viable to do? As
well as being able to kind of do kind of
model prototyping, you also have kind of
full access to the source code here. And
the um
and the models the model also enables or
sorry and it also builds on top of this
the same infrastructure like the open
source lighter T and lighter TLM uh
acceleration infrastructure under the
hood. So if you see something you like
in gallery, you know, yes, you can
deploy the gemma model some other way as
well. But if you want to get the same
experience or the same speed, you can
then just take um the underlying kind of
open source APIs and uh pull a model
from our hugging face page that we'll
look at more in a little bit and uh just
run that directly as well. Um so then as
as part of that, we also have a set of
discussions on the gallery. um on
gallery that shows the um some of the
community uh skills that have been
uploaded from things from like uh cat
entertainment to uh um blackjack, right?
I guess. Um so these skills you could
also um you can people have posted these
skills. So if if you develop a skill
feel free to post it here and uh it'll
make it more discoverable by other
people in the community. And then the
ones here that seem compelling or maybe
useful to a lot of people, we can add to
our kind of thirdparty preferred skills
list in the app so that then it'll make
it easier for folks to discover. Um
yeah, so this is yeah, this is something
we saw a reasonable amount of action
over the weekend on and it's been
something that has only been up since
last last Thursday.
Okay. Um quick time check. Okay, so next
up I wanted to switch gears. So
everything we saw with skills kind of
really applies to from your question
applies to the 2B and 4B model which
mostly on mobile phones will be destined
for like um uh like system genis where
that'll really turn up in production.
Certainly for IoT or for desktop or um
you know um for edge applications you
could just load that model yourself and
run these sort of skill use cases in
production on like that kind of Qualcomm
IoT platform we saw earlier for example
um but to deploy to deploy models in app
today and kind of ship them in
production apps like we're seeing more
people are using smaller models to do
that um so that we would call tiny
models which are less than 1 billion
parameter models in our in our view.
These are the sort of things that we
work with teams to deploy LLMs within
their app. And these are the types of
things we're seeing. So I want to just
kind of briefly take a look at what that
workflow looks like. Um so lighter TLM
this is the engine that kind of that
powers um gallery right um and itself is
a open source project that has um C++
and Java APIs. There's Swift APIs coming
soon. It also has a Python API as of
last week which is kind of relevant for
um relevant for IoT developers who kind
of like to work in Python. Um um and
that takes an LLM file and has the
required components to kind of fully run
uh an auto reggressive loop and expose
that via easy to use APIs.
Yeah. So it's crossplatform uh C++ APIs
and where relevant the same API can be
used with hardware acceleration like the
Qualcomm example. Um um we also for
other smaller models in the past we've
published um
we don't have this yet published for
Gemma 4 but in the past we've also
published models that work on uh
MediaTek silicon as well and Intel
silicon as well. Um so yeah so over the
course of the year um and as there are
more and more smaller models available
we'll see broader uh NPU support.
So then the workflow is you know um you
start from transformers. We use a
package called lighter t torch that can
um do uh that has some pyro native
optimizations that will optimize for
light or t and also has quantization
built into the workflow. then that gives
you a lighter TLM file. Um, and you can
then deploy that um you can either
prototype it with the gallery app or
just deploy it directly um to your kind
of production candidate use case using
lighter TLM in whichever platform uh you
want to work with.
Um yeah, this is another view of that
flow I guess. Yeah, for really advanced
use cases we have light or ttor torch
generative API. So if you actually want
to write your own if you want to write
your own tiny model from scratch and
train it from scratch, we support that
workflow as well as just kind of
fine-tuning stock standard workflows.
And there's an API called the torch
generative API that allows you to
um basically it's got kind of building
blocks uh for LLMs um that supports many
colon LLMs that are um in native PyTorch
that are in a way that give really
really good performance when you run
them on device
um and are yeah so that's the flow. Wow,
this is this is really in the weeds now.
So we um this is our stack to kind of
deploy on NPUs. Um the wow takeaways
from this slide are maybe two things.
One is under the hood we actually invest
an awful lot of work in um optimization
libraries. So we've X and which is a CPU
optimization library and ML drift which
is an optimization libraries for GPUs.
So we have teams who work in these to
ensure that both of these have excellent
performance. Lots of Google's onep apps
rely on both of these libraries. Um so
we're very motivated to ensure they're
um they have really great performance
and work in the widest possible set of
devices. And these are use what we call
like the JIT workflow like where we
produce a single artifact called a light
or TLM file or a light or T file that
can work across CPU and GPU and get
deployed to lots of types of devices.
For NPU we need something a little more
specialized. For npu, we need to call a
vendor compiler plug-in upfront and that
uses an ahead of time compile workflow.
So you need to produce an artifact
that's particular to a particular npu.
Um and then within our runtime we um we
call a dispatch to a particular um API
that's allowed to dispatch work to the
device driver of the NPU.
Uh but like both of these are available
through a consistent API. So even though
the the path the JIT versus AOT affects
the kind of the build workflow, the
actual the app development workflow is
very similar um across NPU or CPU or
GPU.
Export and inference is really simple.
Uh we've lighter TT torch to just
export. Um I know I've talked a lot
about Gemma models because it's kind of
like the Gemma week for us. Uh so we're
very excited about Gemma at the moment.
It's worth kind of calling out we do
support thirdparty models as well. So
there's a quen 6b parameter model is
what you're there and those quen models
also work in the app right which is what
you're going to see on the right hand
side and the app also supports just
loading any lighter TLM file and kind of
running it and getting benchmark stats
which we'll hopefully see in the video
in a second. Um, so you can run it like
say on GPU and then
>> the latest pixel that you're simulating
here.
>> Wow. Actually, that's a great question.
I to be honest, I don't know this. We do
a lot of testing on Pixel and a lot of
testing on on S25 as well, right? Um, so
I assume it's one of those two devices,
right? Uh, but I don't know
definitively. Um, and we actually we
didn't show it there because like
recently like in the last release, we've
added a an optional icon underneath each
chat and if you click in it, it shows
you the kind of pre-filled decode um
stats for the model. So you can just if
you want like so you can you can find
any model we have in hooking face and uh
load that in the app and then run to see
the benchmark stats for this model on a
particular phone if you want to. Um
so yeah so you can um and also in the
command line like if you run if you're
running on desktop you can also um just
use kind of lighter TLM run if you want
to kind of do some desktop prototyping
to kind of understand how models work in
desktop you can also do that
and then yeah so I guess this is another
yeah this is another third party model
this is fast VLM this is a model from um
this is a model from Apple it's actually
really uh it's a really nice uh VLM
model that's only 500 million
parameters. Um and literally this is
running with hardware acceleration on
Qualcomm which is why it's running so
fast. Um and this is running yeah this
is running on an S25 I think um because
it's it's Qualcomm silicon and this is
literally we just have it running in
loop saying describe the scene describe
the scene uh with video input and it
runs yeah it runs really really fast.
Um, so this is a good example of what's
possible with a model that's feasible to
deploy on device. Like we haven't we
haven't in this case done 4-bit
quantization, but I assume had we been
motivated enough we could have done that
and we would then have like something
that has like just that only requires an
extra maybe 250 or 260 megabytes in your
app to give this type of experience. So
yeah, they're certainly feasible. And
this is earlier when I was saying about
models like this is an example of of a
general purpose 500 million parameter
model like it's been trained on a fairly
narrow set of things which are like kind
of scene descriptions or just like image
to description. So it's like it's a
narrowish use case but still general
purpose also right. So this is this is a
good example of the the larger tiny
models uh the smaller ones like that we
have with like GMA 3270M for example uh
those models typically require
fine-tuning to do a particular task. Uh
but um yeah but this is a good examp
this is why I included it's a good
example of a general purpose um tiny
model that is is very useful. Uh there's
also like it's worth noting like I don't
think I've included in the talk other
examples of general purpose tiny models
include there's a lot of very small
transcription models uh or some narrow
pair-wise translation models out there
at the moment as well many of which we
support many of which are on our hugging
face page as well if you want to kind of
check them out there's a um yeah there's
a bunch of uh things available on
hugging
oh yeah so yeah next slide is this is
just showing a handful of models that I
um wanted to talk about. So function
Gemma is one that we published with um a
partnership with GDM last year. This is
function Gemma that was a general
purpose model that you can further
fine-tune for function calling and there
is collab notebooks out there if you
want to look it up in terms of how to
format data sets and how to how to
fine-tune how to fine-tune function demo
for function calling. The next two ones
above that are mobile actions in tiny
gardens. Mobile actions is a this is the
example I mentioned earlier that has the
10 different mobile actions that are we
fine-tuned ourselves where we achieved I
think it was like 86 or 70% reliability
on those. Uh tiny garden is another like
this is a game uh that is we built
inside in the gallery app that you can
play with as well. It's just another
example of a voice to function calling
fine-tune model uh that you can can use
>> um It suffix
>> um it is instruction fine-tuned. So we
in models uh when GDM published models
they sometimes have PT suffixes and
sometimes have it suffixes. Um PT is
when um PT is when we publish a model
just after pre-training before
finetuning. That's kind of helpful for
kind of expert users because you can do
all of your own instruction fine tuning.
you can fully control the model's
personality and you're not trying to
like unlearn some fine-tuning we did in
this case. The way we teach the model
how to do function calling is actually
via instruction fine-tuning itself. So
when we publish a model for further
fine-tuning for function calling, it
already has a like function calling
personality and that's actually what we
wanted to publish. Um so in this case
this model is for further fine-tuning
but it's an IT model. More typically if
you see for larger models um you like
certainly for the Gemma 3 family we had
both IT and PT checkpoints
uh yeah we had both IT and PT
checkpoints for those depending on if
you had a large volume of data and you
want to do full fine tuning yourself
like I think there's there's a separate
conversation on like sovereign AI use
cases that uh some of the deep mind
people are going to do later in the week
and I I imagine that will be a hey start
with our pre-chain checkpoint and Now
add a huge corpus of data of sovereign
data or enterprise data and you can
fully fine-tune a 27 billion parameter
model to do um stuff. Um yeah. Oh,
another example here is embedding Gemma.
Um, I know that's not a
technically it's not an LLM, but
embedding Gemma does um that's a text
embedding model that um that we
published with the Gemma team last
September and uh that does text
embeddings for rag type use cases. But
that's also like a very very strong
embedding model that only takes 300
million parameters. It's another example
of a kind of high utility um uh yeah I
guess it's not technically an LLM even
if it is a transformer inside but it's
another example of a tiny model that's
uh very useful for ondevice use cases.
Okay. Yeah. Uh I've kind of already
covered this I think is we have um yeah
both AOT is AOT compilation is our
workflow for hardware acceleration on
device is best for distributing small
models to lots of platforms.
Um
okay yeah and then I kind of covered
this as well. Light TLM itself is built
on lightorty and it supports like all of
these types of models and you can find
like non LLM models also on hugging
face. This is kind of relevant because
you typically when you're building a
more complex app you need some things
around an LLM like a voice activity
detection model or a dnoising model. Um
so many these models are also available
using just the lighter t runtime like
lighter TLM is the runtime that has the
full auto reggressive loop. It builds in
lighter t and there's a set of LLM
models that are available to use with
lighter TLM but there's lots of
supporting um uh just kind of um uh how
do you say um non-auto regggressive
models that are available to use with
the light API that are kind of relevant
to deploy uh that are typically used to
deploy a full app.
Okay. And this was the earlier thing I
was saying about um for advanced usage
you can fully customize models. So if
you wanted to write your own uh llama or
moonshine or fi or quen variant uh you
can using the codance directory.
Okay. I'm going to have probably on
track to finish a little earlier I would
say. Okay. Um so that's it in terms of
the tiny models examples and workflows.
Um so there's a lot and there will be
yeah so there's a lot we can do now with
very tiny models. Um 500 MClass are
available for some standard features uh
for smaller LLMs. We've had a lot of
success fine-tuning them uh for apps.
And what I'm going to show next is an
example of a app that we've built using
uh tiny LLMs. So this is an app that's
available on iOS only for some reason um
uh called AI edge eloquent.
Um what this is is a kind of
transcription um model but instead you
may have noticed that as a speaker I say
lots of um ss and arms right if you got
the transcript of this presentation it
would uh it wouldn't be a great
transcript there would be lots of uh
lots of uh interjections. This uh
eloquent was built for that type of
transcription user story where you um
where it does dictation but then as a
separate um kind of uh automatic kind of
polish uh step where it can kind of
remove all of the kind of interjection
and and words. So if you want to dictate
a message for use later um it's able to
you know clean up idioms of speech. One
of the other things it has as well is a
uh a biasing uh list. So one of the
things we found ourselves is like if
we're talking about LLMs, you would say,
"Oh, have you trained a Laura for this
or got it?" And any standard um any
standard transcription service would
when you say Laura, it would translate
that to the name Lau or A is typically
what happens. Whereas you can give it a
list of kind of keywords or technical
terms as well. And then the model will
kind of um uh bias to those words,
right? Um because that's how real people
spell. So you can So the the kind of key
features here is a it runs entirely
offline. B it's got its own kind of
biasing dictionary if you want to call
it that. And then c it kind of cleans up
the text if you see uh in the middle it
has this kind of polished step, right?
Um and overall, yeah, this is the kind
of gives a pretty uh neat offline um uh
yeah, a kind of no no cost, right? Like
because goes back to that kind of cost
motivation at the start as well because
there are some paid services that do
this. Um but it can give you really
clean uh cleaned up text.
>> Is is that last step done by changing
like setting special tokens in the or
like is it done last by overlaying a map
or something? which is this the
>> last step
>> is it
is that
>> it is we're gonna give give me two
slides I'll answer your question but
yeah good question right the question
for the root for the recording was is
the is the polishing done inside in the
main model or outside and how is biasing
applied we're going to answer it in two
slides hopefully
yeah personalization yeah this is the uh
personaliz ization. So you can add um
you can add things like Laura is the
example we always use because it's kind
of uh near and dear to our hearts. So it
allows you to if you want to connect it
to your Google account, it can import
stuff from Gmail or something um and
look for unusual words and then add them
to their biasing list or you can add
them in yourself. The sort of things
people usually put in here are names
because like um models will get uncommon
names wrong frequently, right? Um as
well as technical terms, right? So like
Gianning is and surreal. These are two
of the people on the team who developed
the app. Uh yeah, so no surprise uh
that's the example we have.
Okay. So this is where we yeah so the
text polishing engine is what we're
calling at this stage. Uh so we have two
steps here. Uh important to note that
both of these um yeah firstly I'll
describe then I'll talk about models. So
microphone input goes into a speech
recognition engine that delivers a
unfiltered transcription. Then in the
bottom half we have the kind of
personalization flow where you get a set
of kind of uncommon or unique words that
then goes into relative terms. Both of
these go into a text polishing engine
which is then a um dedicated L mini LLM
just for text polishing, right? We could
probably have built like one LLM to do
all of this, right? But it's actually
like sometimes like one of the other
realities of mobile development is
sometimes it's like with tiny models as
well there's this kind of like
modularity story. So that same
transcription engine in your app, you
may have some other use case for that
and you may not want to pay um you know
you can recoup the cost of those weights
by using them in multiple places, right?
So this this is a pattern we see
emerging as we build more of these types
of apps is that kind of modularity
playbook. So yeah, in theory those two
models could be stitched together. In
practice it's kind of a more pragmatic
choice to have separate models.
>> This is more depth as well.
>> Yeah. And you can also inspect what's
happening in the middle. it's easier to
debug, right? Um we see the same a
little bit with voice to function
calling like you know as well but that's
another story for another day is um but
it's uh yeah so okay then the other
thing to point out the ASR engine and
text generation these have little I
don't know if you recognize the Gemma
logo in the corner these aren't like
officially published Gemma models but
these are we've basically done the same
workflow that we're advising to other
people to do to do we have taken the
Gemma models, the smaller Gemma model,
right? This would be a derivative for
like this would be something from the
Gemma 3270M lineage is the best way I
would describe it, right? And we've
basically taken that model and done a
fine-tuning just to have um uh just to
have a transcription engine and then
just to have a text polishing engine. Uh
we've fine- tuned them. Generally what
that workflow looks like to give you a
bit more insight is we'll use a workflow
of using a much stronger LLM in the
cloud to generate lots and lots of
synthetic data that corresponds to the
type of thing we want. And then once you
have a few,
you know, lowdigit millions or tens of
millions depending on how ambitious you
want to be of synthetic data, you kind
of put that into a fine-tuning workflow
and you fine-tune the base tiny model
you're working with to get a derived
model. And you know that same workflow
we're showing here um you know uh you
can then use to like we've used that
internally to kind of ship a stronger um
to ship a stronger uh note takingaking
app right but that same workflow is uh
the reason I'm showing it here is like a
real life example that like um of how
we're using Gemma derive tiny models in
order to be able to like build new
production apps and we're using this
flow in yeah in for for lots other use
cases internally as well supporting
Google 1p products and lots of other
ways that um those products will
probably talk about themselves in time
right um but yeah like like smaller GMA
models are really really powerful for
this type of use case um and we're
seeing good mileage coming from this
>> I'm going to like combine it with like
keyboard because this is a very useful
>> yeah so this is that's what the text
polishing engine does like as part of
that we Um,
wow. Like the texting engine was able
was trained was instruction fine-tuned,
I guess, to kind of have a system prompt
of like, hey, these are your special
words. Uh, please uh please correct uh
anything that sounds like these words to
these words. And then also remove um
interjections or lack of clarity or even
things like I forgot to say this or
scratch that like things like this.
>> Probably a skill.
>> Yeah. No, this in this case it's not a
scale because this is a tiny model.
We've actually just trained the behavior
into the model. It's true with a with
the four billion parameter model maybe
there's a version of this you could do
as a scale which would actually be an
interesting side project. Uh yeah, maybe
there is a version of this you could do
as a scale, but for the tiny models, the
playbook we're generally seeing is
really useful is synthetic data data
generation with a larger model and then
pick an offtheshelf like Gemma 327M,
right? or similar and run with that with
fine-tuning and then you can um
combined with quantization you can then
ship a pretty compelling narrow feature
to a very wide set of users uh powered
by an LLM that works on lots of devices.
Do you have um like an GitHub repo like
somewhere workflow how to fine tune it
because not just fine tuning right you
had like warm up the training and then
start
>> yeah we do with the Gemma 3270M
publication we do have a collab notebook
and with Gemma 327M and function Gemma
both of these when we ship those models
have collab notebooks that show how to
do kind of fully fine-tune um models uh
how how to do full fine tuning. Yeah.
Okay. So, regarding the finetuning,
>> yeah,
>> what's your experience? Uh, let's take
the the function calling example.
>> Yeah.
>> And you said you had 80% chance of
hitting the 10 10 functions.
>> We finished that.
>> That is probably after fine tuning.
>> Yeah.
>> And could you say some numbers before
what what's about?
>> Oh, wow. Are we talking?
>> Oh, wow. Is it 10%?
like 40 something%
40ome percent to 86% right within that
86% as well right because I don't want
to um there was one oh wow we we had 10
functions and maybe two of those
functions dragged our average down a lot
there was like eight simple functions
that were over 90 like 93% type thing so
we on very simple functions we had like
really really high reliability and then
yeah I just forget the actual details
but there was like two that kind of
brought the average down more like kind
of 86% which is where we finished. Um
there's also a blog post you can read
about that for more detail. Yeah. Um but
we'll see this in Yeah, we see this with
smaller models all the time. It's like a
like our experience is on a given eval
um like fine-tuning is between 20 and 40
points on the eval. So it's like it's a
really really significant win for tiny
models, right? like when we're talking
200 million parameters right um or 270
right like we published with 3M yeah um
their fine tuning is is essential for
most things right unless you have a
unless you have a model that is already
published you know to do a narrow task
right like you'll find a transcription
you'll find transcription models out
there that are in that size that work
really well at one task and don't
require further finetuning but that's
because they've been scoped to do the
narrow task from the outset, right? But
if you want to do your own narrow task,
Yeah. right, and ship something to lots
and lots of devices, then yeah,
fine-tuning is um at least for now, it's
it's the workflow of choice. Yeah, it
may change.
>> We get a code lab that can use the web.
>> Um
that's a good question. So we have a
collab to get as far as the light or t
file right. Um our support for lid or
TLM on web is work in progress right
like we um
yeah yeah is I'll just say that like
please look at our GitHub please look
there for latest status right you you'll
see it
>> so like will we get an fine tuning
manual for Gemma 4 by any chance
>> um there are so Gemma 4 what was
announced last week was small models and
medium-sized models, right? In the past
for GEMO 3, we also published tiny
models, right? So, um uh but this is
like they were the first models we
shipped for Gemma 4 were last week,
right? So, there will be more GMO models
in future, I imagine, right? Um there
are some fine-tuning already available
for the larger models, right? But for
the tiny models we've published at this
point in time, right? And if anybody's
reading or listening to this talk in a
few months time, please search the web
for the latest information. But as of
right now, the tiny models we have
published are Gemma 3 tiny models. Um,
and they do have for function Gemma and
for Gemma 327M, there are fine-tuning
workflows uh available right for Gemma 4
as available last week. There are some
workflows um there are fine-tuning
recipes in uh Vert.Ex checks and
elsewhere I believe for those and but um
check out the other Gemma talks uh later
in the week and you'll you'll see and
hear more.
>> Uh oh yeah so the eloquent thing is
actually available on iOS if anybody
wants to give that a go and
>> not in Europe.
>> Yeah I could not
>> oh okay
>> yeah you can find on web browser but not
on the
store.
>> Wow. So that needs to be enabled. I
guess
>> that's really helpful feedback. I'll
pass that along. Okay. Uh yeah, and then
wrap up. Yeah, the kind of key takeaways
system genai mediumsiz models in our uh
or small models in our parlance, right?
Um will be kind of turning up in a a
mobile device near you. Those same
models are excellent for use in embedded
systems, embedded platforms. um uh at
least for now given the memory we have
in mobile phones which doesn't look to
be getting larger anytime soon given the
cost um tiny models are kind of where
it's at um for deploying things widely
right we hope to make those easier and
easier like we hope to have stronger and
stronger models available uh through the
partnership with JDM and we want to also
make the the fine-tuning workflows as
easy as we can right to enable to make
this kind of more accessible um but yeah
happy just to kind of share what we've
been doing in both of these fronts
So I think we've nine minutes if anybody
else wants to ask any questions.
Happy to. Yeah.
Yeah.
>> Okay.
>> Yeah. Like great question. So questions
about safety on edge models. So firstly
the Gemma team spend an awful lot of
time on this and I would defer you to
them for all the questions on safety for
the models that were published last week
but um they do like it's really top of
mind like I like they spend an awful lot
of time on safety for those models. Um,
additionally for like for system genai
within AI core for example, right, what
actually ships there and you'll probably
hear more of this from other people is
within system geni when that actually
ships as part of the US, not as a raw
model that we published last week, the
system vendor will typically have some
sort of input and output safety checker
on the model, right? Uh, as a kind of
aftermarket addition because there's
particular things for their product they
they want and they don't want, right? Um
so that's another layer for smaller
models and for like really tiny models
the like safety there is really
important but generally the way like at
least for us the way we've deployed tiny
models is to within a very narrow kind
of API or task like you can imagine with
eloquent like the app you saw that kind
of the risk profile there like
technically that's more like of a
regenerative app than a generative app
if you know what I mean it's not going
to fully make things up on the fly So
we try like yeah so there's a um you
know you can kind of look at the scope
of what the model is trying to do and
the API surface and the functional
surface and typically tiny models have
to make a tiny model work it generally
has narrower functional scope which
allows for a more kind of like you still
need to do safety but it's a narrower
problem you need to solve when you're
kind of looking at it right uh is what I
would say yeah
>> is there a way like place where I can
look for how deploy small like a little
bit bigger gem models on like 5090
because I have a 5090 going to deploy it
on it and then try it out. Is there a
way I can look it up? Oh, so the
okay the
so on 5090 so I believe I'm not sure if
we have like I so I know we don't have
specific documentation for that but our
tool does support uh Nvidia GPUs Nvidia
is also like Gemma partner right they um
they have supported some of the Gemma
launches in the past so it may be it may
be if you look on Nvidia's own web page
you may see some of that through their
tensor TLM. Um I know they've supported
some of the Gemma launches in the past,
but we don't like I can't point you to
specific documentation of the fast
answer, but there are the two places I
would check.
>> Yeah.
>> Okay.
>> Sorry.
>> Yeah, that's all right. So the examples
that you show for like agent skills
they're obviously like one skill
execution
you run the skill that's a single skill
in the s would you sort of like change
the architecture for like mult
>> it's a question of so we do actually
support multi-kill execution right um
it's something [snorts] so within the
app if you download the app you can
actually define
um you can define um the skills you want
loaded, right? And you can just toggle
them on or on or off, right? Um and then
you can easily say like if you're very
very specific in your prompt, which is
like wow um like uh if you're very
specific in your prompt, which is like,
you know, look up this topic in
Wikipedia, summarize to three bullets,
and then play as display as flashcards.
If you're really really specific, that's
going to work, right? Um
Frankly, like we've only had this model
a couple of weeks, so we're still
putting models in the clock uh and
seeing what are the boundaries of how
like how much can we do skill stacking
and skill chaining, right? Uh so we're
yeah, we're literally still in the mode
of putting miles in the clock there. So
most of the examples we were publishing
were single scale but even in the even
in the diorization app right if you
recall there um like Alice the person
doing that she asked for um she asked to
oh summarize my mood or what time I'm
meeting up with Amy and then oh
Wikipedia is going to come up ask for
blah blah blah. So like within in that
example she was able to kind of show
within a single conversation
individual turns using individual
skills. Um uh yeah, but like skill
stacking within an individual prompt, I
believe we've seen that work where
you're pretty explicit, but also we're
still like frankly we're still learning
right the boundaries of
you know like this uh yeah we're still
learning the boundaries of what's
possible with these classes of models.
>> Sorry follow question. Yeah. So it's
like the decision to identify which
skills are important for the problem and
given that it's a small model. Do you
like do instruction tuning from a larger
model to build that intelligence or is
it really good at
>> this is um
like the model was well the model was
changed to be really good at agentic
workflows at really good at function
calling right it wasn't trained like
nothing in the gemma model was trained
specifically for our skill pattern right
uh that just kind of came afterwards
like when we got the model and we
started playing with it was like wow
this works and then can we do this right
so it was more that kind of workflow uh
so there's no specific speific training
for our skill structure, right? But the,
you know, the team spent lots of time
doing general purpose thinking and
function calling, right? Um, like the
GDM team did lots of great work there to
give us some a general purpose model is
really strong, but there's nothing
special for our app, right? So, if you
had a if you had a slightly different
take on skills and maybe there's a
better skill architecture than the one
we've shown you today, right, it's
entirely possible. Um, yeah, you you
could expect to be pretty successful,
right? There's no there's no like hidden
secret sauce in what we've shown.
>> Yeah.
Uh take class sorry I think you're maybe
next. Yeah.
>> I'm building like anecdot.
The first big challenge is context.
>> How you know particularly once you start
doing the gent
talk a little bit about context window
on the Yeah,
>> E2B. Okay. Um,
yeah, I would I would defer you to the
Gemma team for official guidance. Like
the the mediumsiz models have a context
window of 128K and the smaller models I
would actually need to double check,
right?
>> 32K.
>> 32K. Yeah, that's what I was wondering
if it was 32K. So like our
implementation like in gallery we
default to like 8K or 12K or something
just for performance reasons right but
the models do support up to like E2B and
4B support up to then 32K and the other
models support up to 128K
>> with 32K you use a lot of memory that
>> um
wow I wonder if we have stats for that
in the model card it's for the E2B and
the E4B model the the memory footprint
for a larger context is um it's it's not
as bad as you think, right? Uh there's
actually put a lot of optimiz like the
team optimized that metric for those
models because it was targeted for edge
use cases. So the amount of kind of KV
cache that's required for each input
token that was something that was
optimized. So the models behave pretty
well on that front. I don't have I don't
have a number off the top of my head of
like bytes per input token um uh to give
you but it's it's it's uh it's good for
its model class is what I would say.
Yeah.
>> And a second question. Do you have the
iOS version?
>> Yeah, the AI edge gallery works on both
iOS and Android
>> in terms of being open source. I think
the version
Mac OS uh that's a good point. I will
put that in the coming soon bucket.
Right. Uh certainly one of the items on
our to-do list because we published yeah
the iOS app we only published for the
first time in January. Um whereas the
Android app has been available since
last summer, right? Uh but we do like
our intention is to have um a what you
see is what you get experience for
developers, right? So you can use the
app, have fun, experiment with the
models, then also get the source code,
um, see how it's built, etc. Right. Um,
so yeah, that's certainly our intention.
>> Yeah.
>> Probably last question because we're at
>> is there a trade-off between like fine
tuning individual models for a specific
task and then like actual amount of
memory on devices they consume? So it's
like you know
add you're like chaining models
together.
>> So to clarif So
for which for which model is your
question? The
>> I just downloaded whatever the E2B
>> the E2B model.
>> Yeah. So if you've got multiple models
presumably so yeah so for for E2B we
would kind of recommend customization
via skills or via prompting not via
fine-tuning. Right. Um for the smaller
so for the small models that are
published we would recommend
customization through skills and
prompting right um for tiny models um we
would recommend uh we would recommend
customization through fine-tuning if you
were deploying a smaller model right I
don't like on on the other path that is
available to you for the medium for the
small models is Laura fine-tuning right
so I know Apple supports that in their
foundation model framework uh you can
check with the AI core speaker if that's
on their road map right there's an AI
core speaker who has an AMA uh coming up
you can check with him about their road
map for this but certainly if you were
deploying that on an embedded system
right like uh if somebody asked me about
deploying the 2B on an uh on a robotics
platform I would be like absolutely you
should fine-tune Lauras for each of your
things for each of your tasks and then
like our runtime supports loading the
model and like hot swapping Loras so you
don't even need to kind of load and
unload the model to load and unload
Laura adapters uh that it's built for
that particular use case for like
robotics or IoT platforms. Um and that's
and those loras then are like maybe yeah
it depends on the radics you choose but
they're much smaller. It's like
you know maybe 16 to 100 megabytes
depend or actually even smaller like 8
to 100 megabytes in that kind of range
depending on the the radics you use.
Yeah.
>> Thanks a lot.
>> Yeah.
Cool. All right. That's a wrap. I'm
going to let everybody get lunch. Yeah.
[applause] Thank you.
[music]