Google Photos Magic Editor: GenAI Under the Hood of a Billion-User App - Kelvin Ma, Google Photos
Channel: aiDotEngineer
Published at: 2025-07-19
YouTube video id: C13jiFWNuo8
Source: https://www.youtube.com/watch?v=C13jiFWNuo8
[Music] Uh, while people get settled in, I'll do a quick survey, I guess. Yeah, feel free. Go take my seat. Um, raise of hands. How many people use Google Photos in any way? Are you familiar with the application? Cool. How many people have used the editing feature within Google Photos? Cool. A few less. That's fine. How many people have used like Dolly or any of this new generative image editors? Okay, not that many. Cool. Okay. Well, my name is Kelvin. Uh I am an engineer on the Google Photos editing team. Really quickly about Google Photos, I think most people are familiar with it. We are the home for your memories. We offer auto backup. When we first launched, the product really was built with machine learning in mind, right? We want to use ML to make your life easier. Um so if for example, if you back up your photos with us, we index your photo. We run OCR so you can do really powerful machine learning search. So you don't have to organize your photos anymore. You go on a trip, you come back home, the photos get backed up. We go, "Hey, you went on a trip. Here's an album of your photos from that trip. Um, you can just search for stuff like receipts from this restaurant and we'll find it for you." We're all taking way too many photos nowadays, way too many video. Nobody's got time to sit down and organize any of it. So let us do that for you. We have about 1.5 billion monthly active users and we do hundreds of millions of edits per month across our clients. The team I work on is called a comput computational photography team. I just call it the editing team. It's easier to say. Uh it was started in 2018. We had a pretty basic image editor at the time, but we wanted to really focus on this idea of using the compute on your phone, right? It doesn't have a great sensor, but it does have a lot of compute, right? And we want to use that to make really great image edits for any images. Doesn't have to be captured on the phone. It could be old devices too. And the idea is with competition photography, so for example, you have like this photo on the left that is kind of taken at sunset and the subject is very dark. Traditionally with DSLRs, what you would do is you would capture multiple images at different exposure level and then you combine them into the right image, which is called the HDR image. But that's a lot of work. You need a tripod. You need to know how to do this. you know how to use Photoshop with machine learning and compute we can just do it with one image so capture it anywhere you want bring it to Google photos we run machine learning we generate this photo we kind of re bring in all the brightness and variance in the image and the reason we're doing this at Google with the photos team is we are able to vertically integrate we control the hardware from pixel right we're able to use stuff like edgetpu to really do accelerated compute but we also have internal research um this is 2018 Back then I think hugging face was founded but it was not a place where you can just go on hugging face find a model that suits your need and then go off and build application with it like you can do that now which is really great but you couldn't then able to work with our researchers in Google that are experts in computer vision and machine learning to build the feature from the ground up right um like hey can you iterate on the model we'll iterate on the application let's keep go back and forth until we hit something really good that's easy to use I'll really briefly talk about our tech stack we have three main clients, Android, iOS, web. Those are all pretty natural to each of those clients. My team also owns this share C++ library that does all the model inference and it's all on device. Um we integrate across all the clients. We integrate with our research partners. Um and then we run inference on device using um TensorFlow light which is now called light RT. I'll briefly show off some of the editing features highlights that we built from 2018 for the next couple years. The first one is this post capture segmentation, right? You capture a photo of a portrait. Oh, you decided, I actually want more bokeh in the background. I didn't have a lens that could do bokeh in that environment. We're able to just do it for you after the fact. Another one, you know, you capture a nice portrait. The lighting's not perfect. Maybe the sun's in your eyes. It's kind of like making you washed out or it's like overcast and you have shadows on your face. Also, to fix that after the fact, you know, you want a lighting on your left side, lighting on your right side. No problem. we'll do it for you. Another one this is really popular on the one of the pixel launches is Magic Eraser. Right? You may be at a popular spot, get a lot of tourists or distractors in the background. You want a really clean photo of just the core subjects. Um, no problem. Don't worry. Have don't have to wait for the scene to clean up or anything like that. Just take the photo. we can auto detect like what are the subjects you care about and then just clean the background for you and impaint it for you and get something really beautiful that you can print off or show off to your you know close friends or family members. Diving briefly into like you know the first big feature we launched um which is the post capture segmentation. This is really simple actually relatively it's a unit convolutional neural network. This is like pretty traditional ML stuff in the computer vision space, right? And is able to just focus on a specific use case where it works really well, which is single portrait, single subject portrait segmentation. So you see below, we separate that into a foreground, which is the white and the background. And this is where the strength of ML really comes in like you can do this with computer vision without using machine learning, right? But it doesn't behave as well. And you need experts who are experts in that system to tune it. With machine learning, you just get better performance. You just build a benchmark, the data set, and you do the training and it always returns a result. There's no error handling. Like the beauties of models is it always works, right? You give it any input, it'll run, and it's like, here's the output. And that's nice. There's some challenges. Models are really big. They're giant um static files of floats, right? In this case, this model was 10 megabytes, which is a lot of code, and that's not something we can bundle in the application. We're very sensitive to our APK size. So now we have to download the model after the fact. Now you have to do model management, IP protection since we're doing on this on the user's device. We have to make sure the model can't be extracted and used somewhere else. Uh also I'm sure everyone here who's worked with a IML has to deal with evals like how do you know this model does what you want? How do you know the next model is better and does better than what you want? And that just takes a lot of time. You have to build a benchmark. You have to run the benchmark. um you have to make sure the benchmark actually reflects your you real world usage because if they separate then the benchmark is useless right I think of it as for traditional code you have unit tests and that's how you make sure you don't have regressions the benchmark is the equivalent of unit testing for your model and you need to maintain it and it takes time and then the pro is also a con the model always returns a result like people deal with this LM so it's like hey tell me when you're not sure so I can do something else and the LM will be like what do you mean I'm always sure right Even in this case, it'll always return you something. And for the example, even this segmentation, right? The subject has really long hair. And you'll see the mask it produces is not perfect. It's not capturing her hair, the finer strands. So when you really do blur or you do some really sharp segmentation, you will notice that it's not perfect. There's blurry edges and that sort of thing. And we can fix that post model. We're able to run kind of more image understanding traditionally and be like, "Hey, that's a hair. Let's follow strains of the hair. Get a really perfect accident. Uh couple years later we launched Magic Eraser. This is in 2021. At this point, we're really doing a system of models. You know, now people call it like orchestration in the LM world, but really we have a few things here. We are detecting distractors. We are segmenting them. We are then running an impainter on them and then we have um custom GL rendering on your device to make all this seamless. So visualize the mask, animate the mask away, bring in the impainting area, so on so forth. more models, more things to your system is now more complicated. You have larger models, they're getting to hundreds of megabytes now, even on device. Um, the failure cases are more obvious. If you ask to implement in the foreground, um, even a model will struggle with it. Um, at least in 2021, as Paige mentioned, now we are able to do much better stuff. So, quick summary of our learnings from those couple years, right? ML really does give you great capabilities that you wouldn't be able to do with traditional image understanding. It's great that you can shape the ship these features that are very easy to use. The models themselves have very consistent latency on a specific device. Right? Once again, the models just does the same thing. Doesn't matter the input. We're going to run this number of flops. It's going to go great. The device themselves on Android spe especially have a huge variance. You know, the latest Samsung and Pixel really kind of comparable to certain laptops. the really deep older phones not at all comparable. So you have to deal with that. And then once again going back like we really had this great relationship with our researchers some of which is now in deep mind right being able to talk to them early. It's like hey we have a use case our users want to be able to do certain things. What do you have that lets us build something on top and we're able to go with them for years at a time really to see like oh you made it better this year. It's not quite ready for our use case but that's great. Let's keep working on it. So that's all really good. The downside is once again the unpredict unpredictability of certain edge cases. Sometimes you'll look at two images and you're like they're the same to me. You put one in the model, great result. Put another one in the model, terrible result. And you're like why? Why? And the research will be like who knows? We should, you know, collect more data, train a new version of the model, we'll let you know in a couple weeks or a month if it's fixed or not. That's just the nature of working with machine learning, right? No one can look at the system, go, "Oh, there's the bug. Let me fix it. Let me push the fix." it'll be up. That's totally different from software engineering. So that means slower iteration. You got to keep in mind, especially for us, we launch a pixel, we have hard deadlines. We're launching in two months. That means we get what two iterations the model at best. Can you promise a launch? Can you go to your VP and go we are ready to launch? That's a totally different mindset from machine learning. So what happened in 2022 2023? What's the big excitement? Some of you probably know certain things launched. They covered it too with Dolly and Chai GBD. AI AI AI AI generative AI generative AI generative AI AI AI AI AI AI AI AI it uses AI to bring AI AI AI AI AI AI AI AI generative AI. Yes. So that was just a snippet of IO for that year. Uh really everyone's excited about AI. No more ML now it's all AI larger models more ambitious more ambiguous really it's that's that's the world we live in now right so that's why well that's not totally why but like large reason why we decided to embark on building this new magic editor experience which is now using the largest state-of-the-art models um but it also solves another problem we've been building these great features specific features for years at this point but discoverability is an issue like the user still has to know oh in this this case I want to use magic eraser In this case, I want to blur the background. AI is great at going, "Hey, you know what? In this case, you should try using magic eraser." Like, it solves that. And we want to combine both of them. So, from a product level, I think the previous speakers like Paige showed off, you know, this really amazing, fantastical ability um to generate like anything you want like something you can capture in the real world, some things you can't. For photos productwise, we want to be more grounded, right? We are the home for your memories. We don't want to generate something really weird that kind of would stand out if you were to share it with your friends or family. We obviously will have a prompt. Everything has a prompt. Now, um, but we want prompt to be more specific. Like we want still have the ability for users to select and visualize what they're talking about. So you can interact with the model kind of as like a co-editor or a co-pilot so on. And of course, we have to use the best model available because we want to be more ambitious and do more like what's possible at the edge. So we're no longer constrained to on device. That means some challenges. Now we have to worry about servers. Before everything happens on one device and there are a lot of advantages there, but there's no way back then or even now we can fit the best models on a mobile device specifically. Two, the problem space is really big. Like it's great to say we want to build the best JAI image editor. And you're like great, what does that mean? Right? LMS can do a lot of things. editing images is a subset of what they can do and then within that space there's still a lot of specific things they can do so you got to narrow what you are actually tackling because like image generative image editing is not a specific user problem the user doesn't go I want to generative image edit they have a use case and you got to like bring meet them where their use case is and I think the other speakers also talked about like trust and safety is a big one for this sort of thing preventing deep fakes being responsible so on so forth so clients server. Um, this is actually new for us. I'm coming from a ondevice local first background. I think most people are using AI server side. So, up until this point, I have never had to do server capacity planning. The user brings the compute. It is great. I don't pay for power. I don't pay for compute. You know, a billion people want to use it. Five billion people want to use it. Great. Nothing changes on our end. We push the same code. But now we have to worry about it. Especially because we use accelerated comput TPUs, GPUs. You really have to do a lot of planning and this takes a lot of time, right? Second, latency. Latency is a concern. Now, before we had zero network latency on device now, you worry about network quality. Is the user in a remote spot? Is the data center overloaded and they're backed up. Is the user really far from the data center? And now you have to go like a round trip around the world and that adds a ton of latency. And then the other thing is testing is now hard. Right? Before our models were small enough that we could run our um tests including the models. Now these models are way too big. We can't run them in our automated testing suite for our daily regressions. Now you have to worry about like do we recapture the responses? Do we try and have a test server? None of these are really good options here. I do want to spend a lot of time talking about this ambiguous problem space. There's this old comic from XKCD. This is like I don't know how old but old one it makes sense. basically is like hey some problems in computer science are hard to explain why it's hard you know the first one is like let the is the user in a national park no problem GPS that's been solved problem like easy anyone can do it now and then it's like check the whether the photos of a bird and back then that'd be like I need a computer vision researcher team give me a couple PhDs I'll get back to you now this is a solved problem so the world has changed right and that's the power of genai which is great like I love going to hackathons now because anyone can sit down at hackathon to be like you know what I have this idea and I can build it it will definitely work in some use case and that's great and when so when your PM goes I want to solve this problem cuz and I use Gemini or chat GPT and it worked in this case that means we can build it as engineer you go like did it work 5% of the time did it work 10% of the time did it work 50% of the time or 80% of the time those are they all worked in some case but those are vastly different things you cannot ship a product that is 5% reliable if it's 50% reliable But when you can work with research to get it to 80%, then let's talk. But you really do want to constrain the problem you're solving. I think someone talked about yesterday the comment that like prompt is a bug, not a feature. And I agree with that, right? You want it to be easy to use. A user doesn't want to look at a prompt and think about what they want in a very detailed way. Paige talked about like prompt editing, and we do some of that, too. The user doesn't want to write a whole paragraph on their phone. We want to extract their intent and go great. We think we know what you want. Let us give you what you actually want, not what you say you want. Right? And then once again, anything is possible means nothing is out of scope, which means you cannot design an engineering system around it. A system by definition has constraints. So really what we focus on is reducing ambiguity across all of our functions. So product goes, hey, talk to users, what are the big demands they have, right? and and they identified a few things here. One, the ability to move things within the image to relocate right seamlessly. Another reimagining some of the scenes. So this is like the left side is the real part. It was a very gray sky kind of boring. Reimagine the background to be something more exciting. And then we had eraser before, but now we can do better erase because we're no longer constrained to the device capabilities. So here it's not just erasing the drink itself, but also the reflection of the drink. So it actually feels more natural. And then we work with research to highlight those or train specific models for those use cases. So they're more efficient, more accurate, um, easier to use. And then we build the UX and the software engineering on top to guide users to those cases so they work really, really well so you can rely on them. And then for the other use cases, we kind of just take the advantage of LMS which tend to hallucinate. But in our case, we'll just give you multiple responses. The hallucination is a feature. The good thing is we work in a creative space. There is no correct. It's not like code. There's no compilers, right? It's like if you see something you like out of these choices, great. Take it. Go with it. Trust and safety. Obviously, this will never be perfect. This is not a world where we just go this is the correct answer 100% of the time. You want to aim for certain precision or recall, right? And you have to manage the expectations of your users, of journalists, of media, of you know your internal stakeholders to be like we are doing what we can. We're preventing the really bad cases. But this is something we're going to always keep going on. It's ambiguous. The language itself is ambiguous. The prompt could be like the view from the mountain was sick. You should make it look like that. What does sick mean with the LM? Make sure it interprets sick in the way they meant and not in the they were unhealthy way. like that's just how human language works and that's why we have code. So my learnings on working with AI once again is like to me AI engineering is just software engineering but with machine learning or ML on top. And to me ML is this great power but it also adds a lot of randomness into your system. And as engineers, our job is to reduce that randomness and kind of bring things back to a deterministic repeatable way so you can actually have repeatable outcomes that like provides value to your users in some way, right? So build evals, use your evals, make your evals faster to run. And then once you have a useful value in production, go from a large model that can probably do way too much and it's doing way too much compute for your use case and replace with a smaller model, faster model, more efficient model. Whether that's through model distillation or you find a more efficient model or you can even replace traditional engineering without ML. All great stuff. The main point is to be able to move faster, right? Faster iteration means more tries means you get better improvements in your product. And then for what's next for us, we are we announced last week at the Google Photos 10-year anniversary, we're rebuilding the editor from the ground up to be AI first. So once again, really fulfilling that vision of we are meeting you where you are. You can tap on your image, we'll surface the relevant edits. you can kind of adjust where they are and you can use AI as part of that tool if you want but or the AI can use deterministic tools to do better edits. Uh I'm almost at time. This is more my personal view. Like where will we be in two years? I don't know. You know, you talk to different researchers, they have different views on AI progress. Effing page talked about, you know, like the Gemma nano ondevice model now that was just released. Is this as good as the pro model of the last version of Gemini? I hear that too. That's a very exciting world to me to bring things back on device. Or who knows, maybe Gemini 4 and these things keep increasing and the nano version is one step behind. That's an interesting part too. or maybe we have no progress. But either way, being able to iterate quickly and having really good benchmarks so that you can know what to change is always important. So, we should just focus on that. And I think that's my time. If you want to contact me, that's my contact and I'll be here afterwards to talk. Thanks, Calvin. [Music]