Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI
Channel: aiDotEngineer
Published at: 2026-04-20
YouTube video id: a2muGkT4WD4
Source: https://www.youtube.com/watch?v=a2muGkT4WD4
[music] >> Okay, hello everybody. I'm going [clears throat] to show you today how to run Gemma 4 on iPhone with MLX. So, first let's introduce myself. I'm Adria. You can find my Twitter if you want to learn more about all on device things. I'm a developer of Locally AI, so maybe you have already seen the the app. So, Locally AI is a chatbot that allow you to run on device models on your iPhone with MLX. So, I will just go through what is MLX in a in a few few seconds. Basically, as I said, it's it's a chatbot. It's fully native. You can also chat with Apple Foundation with it. And and many models like that are compatible with MLX. And one of those models is Gemma 4. So, basically, [clears throat] Gemma 4 by Google in mind have a lot of model and some of them like can run on iPhone like the smaller ones and they are pretty great. Maybe you have seen on on Twitter one of the post I've made that I should I demo it running in the app on iPhone. It's really fast. It runs really well on MLX. And behind the app, so it's using [clears throat] MLX. MLX is a framework made by Apple that is optimized for Apple Silicon. So, mainly mainly the the chip [clears throat] in iPhone, but also the chips on the on the Mac. The the app Locally is available on iPad. It works also very well on this device and Mac OS. And everything is is built to be as optimized as possible on on this devices. So, if you want if you want to run on iPhone a language model, so Gemma works well, but you have a lot of model that you can run also. The Quen model and and the small LM model from Hugging Face. The place you you will want to go is GitHub and go to the repo MLX Swift LM. I won't go into detail how you implement the repo. It's I think I will let your agent in a implement that for you, but it's one repo that you need to install if you're developing an iOS or Mac OS or iPad OS app. And you can use you can use that to simply download the model and then run it. The API is very straightforward, very simple to to implement. In less than 10 minutes, you can have an iOS app with a model that is running on your on your device. That's very simple to do. As as mentioned, MLX So, this is MLX Swift LM, but if you're more into Python apps or Mac OS app, you can also run MLX VLM from from Prince. Maybe you have seen him that he's doing on device for audio with MLX audio and visual model with MLX VLM and also MLX video to to run to run image image generation model or image or video generation model. MLX it's the ecosystem is getting bigger it's getting bigger it's it's really great right now. You can do pretty much everything like omni models. As I say, text to speech speech speech to speech. There's a lot of thing that you can do with um with a model. And basically, let's say you integrate the you integrate this model. You can't just run any model on it. Like you need to you need to get some model and there's a good place to get the model is Hugging Face. I'm pretty sure everybody heard uh of Hugging Face. And on Hugging Face, you will want to look for MLX community. This is where like all of the the weight of the model, the quantized weight, the full size will be uploaded. So, you will just be able to go to this community look for the models. I think right now there's almost 4,000 or 5,000 model uploaded. So, the community is really active on it. When a model is is released by a by a lab you will directly have it almost 30 minutes after release quantized in 4-bit, 6-bit and everything that you can you can imagine. Here you have like an example for Gemma 4 8-bit, which is the one I I run on um on iPhone. There's like a lot of variant of it for from BF16 and make MXFP4 [clears throat] like 5-bit, 6-bit there's everything. So, so you download MLX Swift LM, install it with your agent or anything. Then you go to MLX community and you just choose the model you want to run and then with ID you can just pass it to the framework and it will directly it will be integrated with Hugging Face to download the model with MLX Swift LM directly. So, you just need to grab the ID and and pass it to to the framework. Usually, when you're running the the model on on iPhone, what you want to what you want to do is quantize select selecting some quantized quantized version of the model because the full size will be way too large. What I recommend is going is trying depending on the size between 3-bit and and 8-bit. Usually, between 4-bit and 8-bit. Usually, under like 4-bit it's getting it's starting to get a lot of impact on the output. And and the model are not that great usually. 4-bit is the lower I would go. And 8-bit is the higher I would go if you're using really small models. And in my app for example, I have like some bigger model like Gemma 4, but I also have some liquid model where that is 300 350 parameters and this it can run in shortcuts. So, you can do a lot of kind of automation because the model are really fast and really efficient at that at that size to do some text processing and things like this. And on the on the latest iPhone like if you take Gemma 4 8-bit quantized in 4-bit it's extremely fast. Like it can run easily at 40 token 40 token per second. I will just do an update in the slide because I remove I remove a slide where there's the video, but maybe I can add it back because I have a little bit more time. I just like this. So, just to show you in demo what's what the 40 token per second means. And that's running live offline. And as you can see, it's like really fast. Like 40 token per second is more than acceptable for a lot of use cases. This is of course streaming. You can also like not do streaming and do a UI that just will wait for for 4 seconds. And here the output is quite long. So, it's generating a lot of tokens. On device with MLX, it's working. It's here. Like it's really not easy and really not hard to to integrate. As I said, if you go to the to the repo MLX Swift LM very it's a breeze to to install. On top of that, as I said like again, latest iPhones really great, but it works also with older iPhone. Like you will not get 40 token per second, which is quite fast, but even if you get 20 tokens per second, that's already great and useful for a lot of for a lot of application a lot of use cases that you would want to do with with your with your with your app. You can go to and scan this QR code if you want to try it by yourself. If you want to have a if you have an iPhone, the app is on the App Store. It's free to use. Only thing is that you will have to download the model that's usually around 1 gigabytes or 3 gigabytes. Really depend on the on which model, but that's the biggest barrier right now. It's the size of the model, but this also is getting better. Model are getting smaller or getting smarter and they're also the iPhone is getting better. So, next and second the next next iPhone everything will just is reaching really great usability from what I can see. And also on top of that like maybe you have heard the news yesterday. Locally have been acquired by LM Studio. If you don't know LM LM Studio >> [clears throat] >> it's basically a kind of a AI studio for all your local models. So, you can download the the model with any model with LM Studio directly from Hugging Face. You can run them and you can open a server. You can run them with Llama CPP, but also MLX. So, you can really compare the different engine how they how they work. You can as I said, you can open a a server locally and connect your app to this to this hosted server with various various response response type. For example, open open API response type or >> [clears throat] >> or or Anthropic response type for streaming anything and you can just get any model running really easily with that. So, and I want to thank you. That was a very short introduction how you can do the same and run any model like in Gemma 4 if you want on your on your iPhone. If you have any question also. >> [applause] >> Does this support tool calling? Sorry? Does this Yes, like so yeah, I forgot about that. So, it support tool calling. Um not yet custom not yet structured generation. There are some package on top of MLX Swift LM that are trying to to make this make this working. Um I will let you Hugging Face is doing it, but you can easily find them online. But MLX Swift LM, yes, it support tool calling. So, that's really useful if you want to do tool calling and call over over system. And the model the model are getting also better at tool calling. They were not so great like a year ago. Now it's getting much better. I see you mentioned two things. First was the GitHub repo. Yes. And then the second thing was the app. Um So, you have the GitHub repo for for MLX Swift LM. That's the the package that will install in your app. And then you need to go to a game face and and getting to get the weight of the model. But for for a normal user, they can just download the app from the app Oh, and yes, if you want to try my app, you can if you want to try it like right now without having anything to install, you can do that. And you can choose any open source model inside that. Uh there is a selection that that they are not it's not any like I'm ensuring that all the model ones correctly on the on the iPhone because not all of them work work well. Thank you very much. >> [music]