MLX Genmedia — Prince Canuma, Arcee
Channel: aiDotEngineer
Published at: 2026-05-11
YouTube video id: zTLJNHj0DeQ
Source: https://www.youtube.com/watch?v=zTLJNHj0DeQ
Good morning everyone. How many of you here have ever tried to run AI on your phone, on your MacBook? How was that experience? Was good? With MLX, yes. More or less? Okay. So, this talk today is for you. I'm going to show you how you can deploy and manage AI agents, or even voice agents if you will, completely on device using MLX. Today's agenda, of course, we're going to start with why on device. A lot of you use cloud code subscriptions and many other subscriptions. I want to convince you today to offload some of that subscription completely on device and then all you need to pay is your energy bill. Then I'm going to talk about MLX, and I'll give you a small demo and I'll show you some of the amazing community projects that I've seen uh built by the community using the projects that I'll demonstrate today. In 2020, something very magical and also weird happened. It was the best yet also worst year of my life. It was a weird one. One On one side, my dad became blind. And on the other side, Apple released one of the most powerful um chips for on-device intelligence ever. And I remember talking to my dad and I said, "Hey, I promise you that I'll get you back to reading." He's one of the most voracious readers I know, and losing his sight was one of the main things that kind of bumped him out. He couldn't no longer consume information. And I made this really weird promise. I said, "Hey, I'm going I'm going to fix this somehow, somehow." And at the same time, this happened. And I thought, "Hmm, compute on the cloud doesn't necessarily solve all of these use cases because my dad lives in Africa and there we don't have internet as easy as we have here or the subscription plan plans there are really, really bad. So, I thought on device is the future." And then, 2023, I was investigating on GitHub some really cool projects and I saw this project here called MLX. It's an array framework for Apple silicon. You can imagine PyTorch or TensorFlow for Apple silicon. And I tried out the the initial example and I thought, "There's a future here. There's a future that was promised for all of us that all of the big companies like Meta and Google could not really deliver because they were trying to optimize for scale for the cloud." And Apple did something different. So, then I started contributing to MLX. Three years later, we have over 1.5 million downloads, over 4,000 models ported, and we work with some of the best frontier labs to deliver to you day zero support for all of your open source models. You can imagine Gemma 4, the latest Gemma 4, we had day zero support for that on MLX, meaning you can run all of the best frontier open source models completely on your MacBook, on your iPhone, or your iPad. So, let's start with vision, which for him was the kind of the biggest sensory deprivation. He cannot see. He cannot navigate the world. And I thought, "Well, the easiest way to do that is by giving him his vision back or giving him a system that can help him navigate the world. And I think in 2020 uh one, I went into this hackathon and I built these goggles with my team that could like tell you what's in front of you, but then MLX VLM became the second iteration of that that allows you to do this not only via some weird uh glasses, but even on your iPhone. You can now just pick up your phone, point it at something, and you'll be able to, you know, understand what's in front of you. And then you have omnimodels nowadays uh beyond just vision models. These are models that can also take in what? Audio. For my dad in particular, typing is not really a reality, but he can speak. And with his speech, he can control the camera and understand what's in front of him, what he needs to do, and navigate the world. So, this is MLX VLM. Right now, if you use LM Studio, it's one of the main engines that powers LM Studio, powers Liquid AI models, and many other models out there. But, when you think on device, you might think, well, that's weird. Uh I cannot run really large models. Well, that's not true anymore. You can now run models of hundreds of billions of parameters even on your on your initial M1 MacBook. There's a lot of improvements that the community has made that allows you to run even the largest and most uh abnormal models completely on device. I have uh some examples that show that you can run models like Gemma 426B on an iPhone using your storage, and you can still get reasonable speeds. Then, after a year or two, I was also experimenting with, you know, how can we enable humans more and give them more accessibility, especially the ones that don't have all the sensorries? Um and I thought audio is the next iteration. But, it was for that and also for a very selfish reason. I wanted to be able to control my computer without being in front of my computer all the time. What if I could just blur a command and have my computer do it? This is more of like the Jarvis vision of the world where you can just speak to your computer and have actions done for you. We started off with text-to-speech and then that on became Marvis, uh one of our custom models that can generate audio in less than 100 milliseconds. Um then you have we have uh speech-to-text, which allows you to speak to your computer and have it transcribed in real time. So, for example, if you ever How many of you here use you WhisperFlow or Super Whisper? Yeah. You can now write code that application, just point Cloud Code or Code X into MLX Audio, ask ask it to build it for you, and you'll have it in like 10 minutes. And then you also have speech-to-speech. So, beyond the first two capabilities, a big unlock is to have the computer speak back to you. So, speech-to-speech is one of the core capabilities that we recently added, and we also support both Python and Swift. Uh we started off with Python because it's just much easier, it scales faster, but we also understand that native experiences matter. So, with Swift, you can now build fully native applications enabled by audio intelligence as well as vision intelligence. And on the on the right side of your right side there, you have uh our modular pipeline that beyond just models that are speech-to-speech natively, you can actually chain a series of different capabilities to create a modular speech pipeline. For it For instance, you can use You can choose which uh automatic speech recognition model you want, you can choose which language model you want, and you can also choose which text-to-speech model you want. And this way you can create a really custom and modular experiences that fit on every single hardware budget. So, if you have a a very simple M1 first-generation Apple Silicon or even the latest, you can adjust that to your hardware. But then you might think, well, speech-to-speech or text-to-speech is not that good. I've seen some videos, audios of speech-to-speech. That may be I don't know. Is it Does it sound good? I can promise you that it does, and I'll show you a demo in a bit. So, let's start with vision. And um with vision, I have a couple of examples. The first one here is real-time um image analysis, so you can understand what's happening. It's a very simple command, and we'll make this even simpler. Right now, it's in Python. It will come to Swift very soon. But if I run this command, it's going to to run the RFDetecor model by um Roboflow. And as you can see, this is completely real-time. It's understanding where And this is all running on my computer. I can actually just to make sure that this is very clear. I don't know if I turn off the internet what's going to happen, but here it is, continuously running. I can grab a glass, and we'll also detect that. Of course, it's thinking it's wine, but I'm I'm sober. So, this is running real-time completely on device on my Mac. It can also run on your phone. And this is one example. I want to show you another really cool example that you can do with this particular use case. Have you ever tried Have you ever tried the the when you're in a meeting, you you want to blur the background? You you know that Google does this and etc. So, now, you can actually do this natively, and you can build this kind of experiences into your products. I'm not sure you can see that it's blurring the background, but it actually is blurring the background. And it's detecting my mask in real-time and will detect other objects as well. All right. So, that's example number one. Number two is you can run really um large models completely on device. Here's Gemma 4 was released I think a week ago a couple Yeah, last week it was released last week. And with MLX VLM you just run MLX VLM.chat_ui and you pass in the model you want and it should load that model and give you a simple interface for you to get started with Gradio. Wait. Not sure what's happening. This is the problem with demos. Sometimes the gods don't want it to to do the demo. Give me a second. Okay, not sure what's happening. Um okay. Well, I need to quickly get something to close off. Um So, Okay, it stopped. Let's start Let's try that again. Okay. It's trying to download some model files, but now I'll turn off the internet again just to show you that this is not running completely on device. So, now we have a chat window. I don't know if you can see this or I have to zoom in more. And we can choose any image to analyze. Let me see a very simple one. Okay. Okay, here. Describe this image in detail. So, here it is. It's saying that it's a profile of a name man named Prince Kanuma and it's picking up all the different details like my bio and etc. And all of this is running on device. Um it's using the GPU in this uh particular device. So, this particular machine has like 96 GB of VRAM. So, I can run actually all of the models I showcased so far in real time, all of them. At the same time. So, this is demo number two. Let's now get back. I wanted to show you audio, but it seems like the Swift branch, there was something there. I couldn't really figure out this morning. But, for the sake of examples, I will show you some of the really cool community use cases. Um we have one here, one of the creators. I didn't include this this particular use case, but I'll show you a video in just a bit. So, let me try and get this in full screen. So that you can see better. One second. Hey, okay. So, we are back. Um the first example is grounded visual reasoning. You can use Gemma for You can use the model I just showcased, the RF detector, or any of our um um perception models, and you can create really cool experiences like here. You can detect all the fires. You can ask it to detect particular items in the video. And this is all going to happen completely on device without the use of internet. And now you can have, for example, security systems that run completely on a MacBook um on your house, and you can analyze even your dash cam. I have a dash cam video, and I've had some really cool, or not cool, but crazy experiences. And I usually use this kind of system here that I have built to kind of analyze the video footage afterwards. And then you have example number two. This is more of like a honorable mention, which is cartoons generated completely on device um using MLX video, one of the most recent projects I I'm running. And here's one of the I think coolest uh video generated by one of our users. This is all generated on device with a simple text prompt. Huh? So, and he did something very interesting, which is he chained this this is not like one video that he generated all at once. What he did is like he chained a system that can continuously generate from the video. So, you you can create a cohesive story, even though it was not like one shot it. And this particular system can run even on a MacBook with 16 GB of video RAM. So, yeah, pretty cool. He has a lot more on Twitter. I will put his like I'll share I will share on my Twitter. So, if you check my Twitter, I'll reshare all of these videos. And then I think one of the latest ones that I will show, but before that, let me show you one one here from one of our participants in in the on on the floor. So, if I go to drive, let me see. And my drive, okay, new labs. You have here. This is actually It was by this gentleman here in front, Adrian. He builds this application called locally. And using MLX audio and Marvis TTS, he now gave that particular application the capability of speaking back to its users. It was a strong voice Can you increase the volume? The Chatsubo was a bar for professional expatriates. You could drink there for a week and never hear two words in Japanese. All right. So, that is one of the examples where you can actually build really beautiful native experiences. And if you have a a good touch of design, a killer application as well. And finally, I think this is one of the most exciting parts of what I where where I think we are going next with on device AI, which is robotics. So, last year I acquired a robot called um Richie Mini. And the way that I power my Richie Mini in particular is that I use MLX audio, MLX vision to give it all the capabilities or perception capabilities using its camera and audio uh input. So, here it is. It's also it's doing voice real-time voice cloning of the original uh how do you call it? Iron Man Jarvis voice. So, I hope you can hear this. >> Hey Jarvis. Hey there. Great to see you. How's it going? So, you can chain and and build a lot of really cool applications. This is just a start. And I hope that in the future, you can understand or today from today you can understand that you can build agents that can hear, see, and sound just like you or you one of your loved ones today running on your iPhone, iPad, Mac, or even your robot. Thank you. Questions? Do you already Apple is like uh promoting their neural engine a lot. Yes. And I've also tried an on Yeah. But if you look at the the usage of the neural engine, it's it's always like Yeah. it's zero. That's a great question. So, MLX uses the GPU, not the neural engine. For you to enable the neural engine, you need Core ML. And right now, uh Core ML it does not really run well like it's not an easy experience for developers. I hope by WWDC, Apple solves the private API issues. And when they do, we have some internal projects that can allow you to run a hybrid inference across both. Yeah. It it will be. It will be. But we also think that they might be changing the the the neural engine and putting some components into the GPU. With, for example, the M5 series, you can see that it already has some components of it. And we just don't know where the what direction they are they're heading, but it's exciting. Let's wait for WWDC. Questions? Yeah? Is there a tool to see your GPU usage in real time? Yes, so the easiest tool for you to get started is um Mac top. If you run Mac top, it's going to pretty much show you all your usage. This is by uh Carson. Um he's a really cool dude. So, you can see here pretty much what's happening, the GPU, CPU. Um and you can have this overlay across your your your device. And if we do run inference, let's say I start a new window and if I can find that command, okay. So, if I do start running inference on this, let me put here four bit. And okay. You will see that the GPU now is is going to start moving up. And if I say hi are you You see, the GPU is already moving up and and this is one of the easiest ways that I found to track the performance. Uh any other questions? Yep. You also also mentioned only models. Yes. Like what's the one that you would recommend? So you have a couple options. The first one is Gemma 4. The e version. They have this Gemma 4 with the number e or the letter e and then a number e4 e2. Those are only models. They they take image, audio and text as input or any of the variation of the three. Um and then you also have Qwen 3 Omni which is a much larger model around 30 billion parameters, but you can also run on device. Those are the top ones that I know. What are your limits? What are the key limitations of those models at the moment? One of the key limitations, I think it depends on on your particular use case. So there are certain things the models just cannot do. You're not going to get the performance of Cloud 3 or 4.4.6 Opus today, but maybe in 6 months these open source models will have that performance. So the experience should be kind of adjusted to your expect the expectations of performance. That's the only thing I would say. Outside of that I don't see any limitations. You can run inference on hundreds of images in parallel. You can run inference on uh many, many documents and you can you now have context of up to a million thanks to our recent breakthrough that I made with Turbo Quant. So now you can actually serve 1 million context completely on device depending on the size of the model and your hardware, but you can do that today. I don't think I know the specifics about it, like you have to >> Yes. Mhm. So, I was one of the first people on the world to implement Turbo Quant publicly. So, like 30 minutes after the paper was out, I already had implemented it. And I made this tweet at like 3:00 a.m. I didn't know that it would go like this viral. But, if you write if you go to my profile and you write Turbo Quant you'll be able to see there's this particular post here. So, this was like 25th March and pretty much the same day, but just was like midnight and it got like 700,000 views because of that. So, Turbo Quant does work. Um example is like the full model takes uh almost 1 GB of uh KV cache or RAM and by using Turbo Quant you can reduce that by 4x. Yeah. Yeah, see yeah, similar quality. As you can see when I when when you see exact match, it means that it matches the performance of the of the responses of the full model. And I also public publicized the full um results and performance. For example, here when you get to like 300,000 context, the performance almost doubles in terms of like throughput. So, yeah. Yeah. Uh there this is one of the many things that I try to do to enable on device to go even further. Yep. Any other questions? All right, thank you. >> Oh.