MLX Genmedia — Prince Canuma, Arcee

Channel: aiDotEngineer

Published at: 2026-05-11

YouTube video id: zTLJNHj0DeQ

Source: https://www.youtube.com/watch?v=zTLJNHj0DeQ

Good morning everyone.
How many of you here have ever tried to
run AI on your phone, on your MacBook?
How was that experience?
Was good?
With MLX, yes. More or less? Okay. So,
this talk today is for you.
I'm going to show you how you can deploy
and manage AI agents, or even voice
agents if you will, completely on device
using MLX.
Today's agenda, of course, we're going
to start with why on device. A lot of
you use cloud code
subscriptions and many other
subscriptions. I want to convince you
today to offload some of that
subscription completely on device and
then all you need to pay is your energy
bill. Then I'm going to talk about MLX,
and I'll give you a small demo and I'll
show you some of the amazing community
projects that I've seen uh built by the
community using the projects that I'll
demonstrate today.
In 2020,
something very magical and also weird
happened. It was
the best yet also worst year of my life.
It was a weird one. One On one side, my
dad became blind. And on the other side,
Apple released one of the most powerful
um chips for on-device intelligence
ever.
And
I remember talking to my dad and I said,
"Hey,
I promise you that I'll get you back to
reading." He's one of the most voracious
readers I know, and losing his sight was
one of the main things that kind of
bumped him out. He couldn't no longer
consume information. And I made this
really weird promise. I said, "Hey, I'm
going I'm going to fix this somehow,
somehow."
And at the same time, this happened.
And I thought, "Hmm,
compute on the cloud doesn't necessarily
solve all of these use cases because my
dad lives in Africa and there we don't
have internet as easy as we have here or
the subscription plan plans there are
really, really bad. So, I thought on
device is the future." And then,
2023,
I was investigating on GitHub some
really cool projects and I saw this
project here called MLX. It's an array
framework for Apple silicon. You can
imagine PyTorch or TensorFlow for Apple
silicon. And I tried out the the initial
example and I thought, "There's a future
here. There's a future that was promised
for all of us that all of the big
companies like Meta and
Google could not really deliver because
they were trying to optimize for scale
for the cloud."
And Apple did something different. So,
then I started contributing to MLX.
Three years later, we have over 1.5
million downloads, over 4,000 models
ported, and we work with some of the
best frontier labs to deliver to you day
zero support for all of your open source
models. You can imagine Gemma 4, the
latest Gemma 4, we had day zero support
for that on MLX, meaning you can run all
of the best frontier open source models
completely on your MacBook, on your
iPhone, or your iPad.
So, let's start with vision, which for
him was the
kind of the biggest sensory deprivation.
He cannot see. He cannot navigate the
world. And I thought, "Well,
the easiest way to do that is by
giving him his vision back or giving him
a system that can help him navigate the
world. And I think in 2020 uh one, I
went into this hackathon and I built
these goggles with my team that could
like tell you what's in front of you,
but then MLX VLM became the second
iteration of that that allows you to do
this not only via some weird uh glasses,
but even on your iPhone. You can now
just pick up your phone, point it at
something, and you'll be able to, you
know, understand what's in front of you.
And then you have omnimodels nowadays uh
beyond just vision models. These are
models that can also take in what?
Audio. For my dad in particular, typing
is not really a reality, but he can
speak. And with his speech, he can
control the camera and understand what's
in front of him, what he needs to do,
and navigate the world. So, this is MLX
VLM. Right now, if you use LM Studio,
it's one of the main engines that powers
LM Studio, powers Liquid AI models, and
many other models out there.
But, when you think on device, you might
think, well, that's weird. Uh I cannot
run really large models. Well, that's
not true anymore. You can now run models
of hundreds of billions of parameters
even on your on your initial M1 MacBook.
There's a lot of improvements that the
community has made that allows you to
run even the largest and most uh
abnormal models completely on device. I
have uh some examples that show that you
can run models like Gemma 426B on an
iPhone using your storage, and you can
still get reasonable speeds.
Then,
after a year or two, I was also
experimenting with, you know, how can we
enable humans more and give them more
accessibility, especially the ones that
don't have all the sensorries?
Um
and I thought audio is the next
iteration. But, it was for that and also
for a very selfish reason. I wanted to
be able to control my computer without
being in front of my computer all the
time. What if I could just blur a
command and have my computer do it? This
is more of like the Jarvis vision of the
world where you can just speak to your
computer and have actions done for you.
We started off with text-to-speech and
then that on became Marvis,
uh one of our custom models that can
generate audio in less than 100
milliseconds.
Um then you have we have uh
speech-to-text, which allows you to
speak to your computer and have it
transcribed in real time. So, for
example, if you ever How many of you
here use you WhisperFlow
or Super Whisper? Yeah. You can now
write code that application, just point
Cloud Code or Code X into MLX Audio, ask
ask it to build it for you, and you'll
have it in like 10 minutes.
And then you also have speech-to-speech.
So, beyond the first two capabilities, a
big unlock is to have the computer speak
back to you. So, speech-to-speech is one
of the core capabilities that we
recently added, and we also support both
Python and Swift. Uh we started off with
Python because it's just much easier, it
scales faster, but we also understand
that native experiences matter. So, with
Swift, you can now build fully native
applications enabled by audio
intelligence as well as vision
intelligence.
And on the on the right side of
your right side there, you have uh our
modular pipeline that beyond just models
that are speech-to-speech natively, you
can actually chain a series of different
capabilities to create a modular speech
pipeline. For it For instance, you can
use You can choose which uh automatic
speech recognition model you want, you
can choose which language model you
want, and you can also choose which
text-to-speech model you want. And this
way you can create a really custom and
modular experiences that fit on every
single hardware budget. So, if you have
a a very simple M1 first-generation
Apple Silicon or even the latest, you
can adjust that to your hardware.
But then
you might think, well, speech-to-speech
or text-to-speech is not that good. I've
seen some videos, audios of
speech-to-speech. That may be I don't
know. Is it Does it sound good? I can
promise you that it does, and I'll show
you a demo in a bit.
So,
let's start with vision.
And um with vision, I have a couple of
examples.
The first one here is real-time
um image analysis, so you can understand
what's happening. It's a very simple
command, and we'll make this even
simpler. Right now, it's in Python. It
will come to Swift very soon. But if I
run this command, it's going to to run
the RFDetecor model by um Roboflow. And
as you can see, this is completely
real-time. It's understanding where And
this is all running on my computer. I
can actually just
to make sure that this is very clear. I
don't know if I turn off the internet
what's going to happen, but
here it is, continuously running. I can
grab a glass,
and we'll also detect that. Of course,
it's thinking it's wine, but I'm I'm
sober.
So, this is running real-time completely
on device on my Mac.
It can also run on your phone. And this
is one example. I want to show you
another really cool example that you can
do with this particular use case.
Have you ever tried
Have you ever tried the the
when you're in a meeting, you you want
to blur the background? You you know
that Google does this and etc. So, now,
you can actually do this natively, and
you can build this kind of experiences
into your products. I'm not sure you can
see that it's blurring the background,
but it actually is blurring the
background. And it's detecting my mask
in real-time and will detect other
objects as well.
All right. So, that's example number
one. Number two is you can run really
um large models completely on device.
Here's Gemma 4 was released I think a
week ago a couple Yeah, last week it was
released last week. And with MLX VLM you
just run MLX VLM.chat_ui
and you pass in the model you want
and it should load that model and give
you a simple interface for you to get
started with Gradio. Wait.
Not sure what's happening.
This is the problem with demos.
Sometimes the gods don't want it to
to do the demo. Give me a second.
Okay, not sure what's happening.
Um okay.
Well,
I need to quickly get
something to close off.
Um
So,
Okay, it stopped.
Let's start Let's try that again.
Okay.
It's trying to download some model
files, but now I'll turn off the
internet again just to show you that
this is not running completely on
device. So, now we have a chat window. I
don't know if you can see this or I have
to zoom in more.
And
we can choose any image
to analyze. Let me see a very simple
one.
Okay.
Okay, here.
Describe
this image in detail.
So,
here it is. It's saying that it's a
profile of a name man named Prince
Kanuma and it's picking up all the
different details like my bio and etc.
And all of this is running on device.
Um
it's using the GPU in this uh particular
device. So, this particular machine has
like 96 GB of VRAM. So, I can run
actually all of the models I showcased
so far in real time, all of them. At the
same time.
So,
this is demo number two. Let's now get
back. I wanted to show you audio, but it
seems like the Swift branch, there was
something there. I couldn't really
figure out this morning. But,
for the sake of examples, I will show
you some of the really cool community
use cases.
Um we have one here, one of the
creators. I didn't include this this
particular use case, but I'll show you a
video in just a bit.
So, let me try and get this in full
screen.
So that you can see better.
One second.
Hey, okay. So, we are back.
Um the first example is grounded visual
reasoning.
You can use Gemma for You can use the
model I just showcased, the RF detector,
or any of our um
um
perception models, and you can create
really cool experiences like
here. You can detect all the fires. You
can ask it to detect particular items in
the video. And this is all going to
happen completely on device without the
use of internet. And now you can have,
for example, security systems that run
completely on a MacBook
um on your house, and you can analyze
even your dash cam. I have a dash cam
video, and I've had some really cool, or
not cool, but crazy experiences. And I
usually use this kind of system here
that I have built to kind of analyze the
video footage afterwards.
And then you have example number two.
This is more of like a honorable
mention, which is cartoons generated
completely on device um using MLX video,
one of the most recent projects I I'm
running. And here's one of the I think
coolest
uh video generated by one of our users.
This is all generated on device with a
simple text prompt.
Huh?
So,
and he did something very interesting,
which is
he chained this this is not like one
video that he generated all at once.
What he did is like he chained a system
that can continuously generate from the
video. So, you you can create a cohesive
story, even though it was not like one
shot it. And
this particular system can run even on a
MacBook with 16 GB of video RAM.
So, yeah, pretty cool. He has a lot more
on Twitter. I will put his
like I'll share I will share on my
Twitter. So, if you check my Twitter,
I'll reshare all of these videos. And
then I think one of the latest ones
that I will show,
but before that, let me show you one one
here from one of our participants in in
the
on on the floor. So, if I go to drive,
let me see.
And my drive,
okay, new labs.
You have
here.
This is actually It was by this
gentleman here in front, Adrian. He
builds this application called locally.
And using MLX audio and Marvis TTS, he
now gave that particular application the
capability of speaking back to its
users. It was a strong voice Can you
increase the volume? The Chatsubo was a
bar for professional expatriates. You
could drink there for a week and never
hear two words in Japanese.
All right. So, that is one of the
examples where you can actually build
really beautiful native experiences. And
if you have a a good touch of design, a
killer application as well.
And finally, I think this is one of the
most exciting parts of what I where
where I think we are going next with on
device AI,
which is
robotics.
So, last year I acquired a robot called
um Richie Mini.
And the way that I power my Richie Mini
in particular is that I use MLX audio,
MLX vision to give it all the
capabilities or perception capabilities
using its camera and audio uh input. So,
here it is.
It's also it's doing voice real-time
voice cloning of the original uh
how do you call it? Iron Man Jarvis
voice. So, I hope you can hear this.
>> Hey Jarvis.
Hey there.
Great to see you.
How's it going?
So,
you can chain and and build a lot of
really cool applications. This is just a
start. And I hope that in the future,
you can understand or today from today
you can understand that you can build
agents that can hear, see, and sound
just like you or you one of your loved
ones today running on your iPhone, iPad,
Mac, or even your robot. Thank you.
Questions?
Do you already
Apple is like uh
promoting their neural engine a lot.
Yes. And I've also tried an on
Yeah. But if you look at the the usage
of the neural engine, it's it's always
like Yeah. it's zero. That's a great
question. So, MLX uses the GPU, not the
neural engine. For you to enable the
neural engine, you need Core ML.
And right now, uh Core ML it does not
really run well like it's not an easy
experience for developers. I hope by
WWDC, Apple solves the private API
issues. And when they do, we have some
internal projects that can allow you to
run a hybrid inference across both.
Yeah.
It it will be. It will be. But we also
think that they might be changing the
the the neural engine and putting some
components into the GPU. With, for
example, the M5 series, you can see that
it already has some components of it.
And we just don't know where the what
direction they are they're heading, but
it's exciting. Let's wait for WWDC.
Questions? Yeah? Is there a tool to see
your GPU usage in
real time?
Yes, so the easiest tool for you to get
started is um
Mac top.
If you run Mac top,
it's going to pretty much show you all
your usage. This is by uh Carson.
Um he's a really cool dude. So, you can
see here pretty much what's happening,
the GPU, CPU.
Um and you can have this overlay across
your your your device. And if we do run
inference,
let's say
I start a new window and if I can find
that command,
okay. So, if I do start running
inference on this,
let me put here four bit.
And
okay.
You will see that the GPU now is is
going to start moving up.
And if I say
hi
are you
You see, the GPU is already moving up
and and this is one of the easiest ways
that I found to track the performance.
Uh any other questions? Yep. You also
also mentioned only models. Yes. Like
what's the one that you would recommend?
So you have a couple options. The first
one is Gemma 4.
The
e version. They have this
Gemma 4 with the number e or the letter
e and then a number e4 e2. Those are
only models. They they take
image, audio and text as input or any of
the variation of the three. Um and then
you also have Qwen 3 Omni which is a
much larger model around 30 billion
parameters, but you can also run on
device.
Those are the top ones that I know.
What are your limits?
What are the key limitations of those
models at the moment?
One of the key limitations, I think it
depends on on your particular use case.
So there are certain things the models
just cannot do. You're not going to get
the performance of Cloud 3 or 4.4.6 Opus
today, but maybe in 6 months these open
source models will have that
performance. So the experience should be
kind of adjusted to your expect the
expectations of performance. That's the
only thing I would say. Outside of that
I don't see any limitations. You can run
inference on hundreds of images in
parallel. You can run inference on uh
many, many documents and you can you now
have context of up to a million thanks
to our recent breakthrough that I made
with Turbo Quant. So now you can
actually serve 1 million context
completely on device depending on the
size of the model and your hardware, but
you can do that today.
I don't think I know the specifics about
it, like you have to
>> Yes.
Mhm.
So, I was one of the first people on the
world to implement Turbo Quant publicly.
So, like 30 minutes after the paper was
out, I already had implemented it. And I
made this tweet at like 3:00 a.m. I
didn't know that it would go like this
viral.
But, if you write if you go to my
profile and you write Turbo
Quant
you'll be able to see there's this
particular post here. So, this was like
25th March and pretty much the same day,
but just was like midnight and it got
like 700,000 views because of that. So,
Turbo Quant does work. Um example is
like the full model takes uh almost 1 GB
of
uh
KV cache or RAM and by using Turbo Quant
you can reduce that by 4x.
Yeah.
Yeah, see yeah, similar quality. As you
can see when I when when you see exact
match, it means that it matches the
performance of the of the responses of
the full model. And I also public
publicized the full um
results and performance. For example,
here when you get to like 300,000
context, the performance almost doubles
in terms of like throughput. So, yeah.
Yeah.
Uh there this is one of the many things
that I try to do to enable on device to
go even further.
Yep. Any other questions?
All right, thank you.
>> Oh.