Google Photos Magic Editor: GenAI Under the Hood of a Billion-User App - Kelvin Ma, Google Photos

Channel: aiDotEngineer
Published at: 2025-07-19
YouTube video id: C13jiFWNuo8
Source: https://www.youtube.com/watch?v=C13jiFWNuo8
[Music]
Uh, while people get settled in, I'll do
a quick survey, I guess. Yeah, feel
free. Go take my seat. Um, raise of
hands. How many people use Google Photos
in any way? Are you familiar with the
application? Cool. How many people have
used the editing feature within Google
Photos? Cool. A few less. That's fine.
How many people have used like Dolly or
any of this new generative image
editors? Okay, not that many. Cool.
Okay. Well, my name is Kelvin. Uh I am
an engineer on the Google Photos editing
team. Really quickly about Google
Photos, I think most people are familiar
with it. We are the home for your
memories. We offer auto backup. When we
first launched, the product really was
built with machine learning in mind,
right? We want to use ML to make your
life easier. Um so if for example, if
you back up your photos with us, we
index your photo. We run OCR so you can
do really powerful machine learning
search. So you don't have to organize
your photos anymore. You go on a trip,
you come back home, the photos get
backed up. We go, "Hey, you went on a
trip. Here's an album of your photos
from that trip. Um, you can just search
for stuff like receipts from this
restaurant and we'll find it for you."
We're all taking way too many photos
nowadays, way too many video. Nobody's
got time to sit down and organize any of
it. So let us do that for you.
We have about 1.5 billion monthly active
users and we do hundreds of millions of
edits per month across our clients.
The team I work on is called a comput
computational photography team. I just
call it the editing team. It's easier to
say. Uh it was started in 2018. We had a
pretty basic image editor at the time,
but we wanted to really focus on this
idea of using the compute on your phone,
right? It doesn't have a great sensor,
but it does have a lot of compute,
right? And we want to use that to make
really great image edits for any images.
Doesn't have to be captured on the
phone. It could be old devices too. And
the idea is with competition
photography, so for example, you have
like this photo on the left that is kind
of taken at sunset and the subject is
very dark. Traditionally with DSLRs,
what you would do is you would capture
multiple images at different exposure
level and then you combine them into the
right image, which is called the HDR
image. But that's a lot of work. You
need a tripod. You need to know how to
do this. you know how to use Photoshop
with machine learning and compute we can
just do it with one image so capture it
anywhere you want bring it to Google
photos we run machine learning we
generate this photo we kind of re bring
in all the brightness and variance in
the image
and the reason we're doing this at
Google with the photos team is we are
able to vertically integrate we control
the hardware from pixel right we're able
to use stuff like edgetpu to really do
accelerated compute but we also have
internal research um this is 2018 Back
then I think hugging face was founded
but it was not a place where you can
just go on hugging face find a model
that suits your need and then go off and
build application with it like you can
do that now which is really great but
you couldn't then
able to work with our researchers in
Google that are experts in computer
vision and machine learning to build the
feature from the ground up right um like
hey can you iterate on the model we'll
iterate on the application let's keep go
back and forth until we hit something
really good that's easy to use I'll
really briefly talk about our tech stack
we have three main clients, Android,
iOS, web. Those are all pretty natural
to each of those clients. My team also
owns this share C++ library that does
all the model inference and it's all on
device. Um we integrate across all the
clients. We integrate with our research
partners. Um and then we run inference
on device using um TensorFlow light
which is now called light RT.
I'll briefly show off some of the
editing features highlights that we
built from 2018 for the next couple
years. The first one is this post
capture segmentation, right? You capture
a photo of a portrait. Oh, you decided,
I actually want more bokeh in the
background. I didn't have a lens that
could do bokeh in that environment.
We're able to just do it for you after
the fact. Another one, you know, you
capture a nice portrait. The lighting's
not perfect. Maybe the sun's in your
eyes. It's kind of like making you
washed out or it's like overcast and you
have shadows on your face. Also, to fix
that after the fact, you know, you want
a lighting on your left side, lighting
on your right side. No problem. we'll do
it for you. Another one this is really
popular on the one of the pixel launches
is Magic Eraser. Right? You may be at a
popular spot, get a lot of tourists or
distractors in the background. You want
a really clean photo of just the core
subjects. Um, no problem. Don't worry.
Have don't have to wait for the scene to
clean up or anything like that. Just
take the photo. we can auto detect like
what are the subjects you care about and
then just clean the background for you
and impaint it for you and get something
really beautiful that you can print off
or show off to your you know close
friends or family members.
Diving briefly into like you know the
first big feature we launched um which
is the post capture segmentation. This
is really simple actually relatively
it's a unit convolutional neural
network. This is like pretty traditional
ML stuff in the computer vision space,
right? And is able to just focus on a
specific use case where it works really
well, which is single portrait, single
subject portrait segmentation. So you
see below, we separate that into a
foreground, which is the white and the
background. And this is where the
strength of ML really comes in like you
can do this with computer vision without
using machine learning, right? But it
doesn't behave as well. And you need
experts who are experts in that system
to tune it. With machine learning, you
just get better performance. You just
build a benchmark, the data set, and you
do the training and it always returns a
result. There's no error handling. Like
the beauties of models is it always
works, right? You give it any input,
it'll run, and it's like, here's the
output. And that's nice. There's some
challenges. Models are really big.
They're giant um static files of floats,
right? In this case, this model was 10
megabytes, which is a lot of code, and
that's not something we can bundle in
the application. We're very sensitive to
our APK size. So now we have to download
the model after the fact. Now you have
to do model management, IP protection
since we're doing on this on the user's
device. We have to make sure the model
can't be extracted and used somewhere
else.
Uh also I'm sure everyone here who's
worked with a IML has to deal with evals
like how do you know this model does
what you want? How do you know the next
model is better and does better than
what you want? And that just takes a lot
of time. You have to build a benchmark.
You have to run the benchmark. um you
have to make sure the benchmark actually
reflects your you real world usage
because if they separate then the
benchmark is useless right I think of it
as for traditional code you have unit
tests and that's how you make sure you
don't have regressions the benchmark is
the equivalent of unit testing for your
model and you need to maintain it and it
takes time and then the pro is also a
con the model always returns a result
like people deal with this LM so it's
like hey tell me when you're not sure so
I can do something else and the LM will
be like what do you mean I'm always sure
right Even in this case, it'll always
return you something. And for the
example, even this segmentation, right?
The subject has really long hair. And
you'll see the mask it produces is not
perfect. It's not capturing her hair,
the finer strands. So when you really do
blur or you do some really sharp
segmentation, you will notice that it's
not perfect. There's blurry edges and
that sort of thing. And we can fix that
post model. We're able to run kind of
more image understanding traditionally
and be like, "Hey, that's a hair. Let's
follow strains of the hair. Get a really
perfect accident.
Uh couple years later we launched Magic
Eraser. This is in 2021. At this point,
we're really doing a system of models.
You know, now people call it like
orchestration in the LM world, but
really we have a few things here. We are
detecting distractors. We are segmenting
them. We are then running an impainter
on them and then we have um custom GL
rendering on your device to make all
this seamless. So visualize the mask,
animate the mask away, bring in the
impainting area, so on so forth. more
models, more things to your system is
now more complicated. You have larger
models, they're getting to hundreds of
megabytes now, even on device. Um, the
failure cases are more obvious. If you
ask to implement
in the foreground, um, even a model will
struggle with it. Um, at least in 2021,
as Paige mentioned, now we are able to
do much better stuff. So, quick summary
of our learnings from those couple
years, right? ML really does give you
great capabilities that you wouldn't be
able to do with traditional image
understanding. It's great that you can
shape the ship these features that are
very easy to use. The models themselves
have very consistent latency on a
specific device. Right? Once again, the
models just does the same thing. Doesn't
matter the input. We're going to run
this number of flops. It's going to go
great. The device themselves on Android
spe especially have a huge variance. You
know, the latest Samsung and Pixel
really kind of comparable to certain
laptops. the really deep older phones
not at all comparable. So you have to
deal with that.
And then once again going back like we
really had this great relationship with
our researchers some of which is now in
deep mind right being able to talk to
them early. It's like hey we have a use
case our users want to be able to do
certain things. What do you have that
lets us build something on top and we're
able to go with them for years at a time
really to see like oh you made it better
this year. It's not quite ready for our
use case but that's great. Let's keep
working on it. So that's all really
good. The downside is once again the
unpredict unpredictability of certain
edge cases. Sometimes you'll look at two
images and you're like they're the same
to me. You put one in the model, great
result. Put another one in the model,
terrible result. And you're like why?
Why? And the research will be like who
knows? We should, you know, collect more
data, train a new version of the model,
we'll let you know in a couple weeks or
a month if it's fixed or not. That's
just the nature of working with machine
learning, right? No one can look at the
system, go, "Oh, there's the bug. Let me
fix it. Let me push the fix." it'll be
up. That's totally different from
software engineering. So that means
slower iteration. You got to keep in
mind, especially for us, we launch a
pixel, we have hard deadlines. We're
launching in two months. That means we
get what two iterations the model at
best. Can you promise a launch? Can you
go to your VP and go we are ready to
launch? That's a totally different
mindset from machine learning.
So what happened in 2022 2023?
What's the big excitement? Some of you
probably know certain things launched.
They covered it too with Dolly and Chai
GBD.
AI AI AI AI generative AI generative AI
generative AI AI AI AI AI AI AI AI it
uses AI to bring AI AI AI AI AI AI AI AI
generative AI.
Yes. So that was just a snippet of IO
for that year. Uh really everyone's
excited about AI. No more ML now it's
all AI larger models more ambitious more
ambiguous really it's that's that's the
world we live in now right so that's why
well that's not totally why but like
large reason why we decided to embark on
building this new magic editor
experience which is now using the
largest state-of-the-art models um but
it also solves another problem we've
been building these great features
specific features for years at this
point but discoverability is an issue
like the user still has to know oh in
this this case I want to use magic
eraser In this case, I want to blur the
background. AI is great at going, "Hey,
you know what? In this case, you should
try using magic eraser." Like, it solves
that. And we want to combine both of
them. So, from a product level, I think
the previous speakers like Paige showed
off, you know, this really amazing,
fantastical ability um to generate like
anything you want like something you can
capture in the real world, some things
you can't. For photos productwise, we
want to be more grounded, right? We are
the home for your memories. We don't
want to generate something really weird
that kind of would stand out if you were
to share it with your friends or family.
We obviously will have a prompt.
Everything has a prompt. Now, um, but we
want prompt to be more specific. Like we
want still have the ability for users to
select and visualize what they're
talking about. So you can interact with
the model kind of as like a co-editor or
a co-pilot so on. And of course, we have
to use the best model available because
we want to be more ambitious and do more
like what's possible at the edge. So
we're no longer constrained to on
device.
That means some challenges. Now we have
to worry about servers. Before
everything happens on one device and
there are a lot of advantages there, but
there's no way back then or even now we
can fit the best models on a mobile
device specifically. Two, the problem
space is really big. Like it's great to
say we want to build the best JAI image
editor. And you're like great, what does
that mean? Right? LMS can do a lot of
things. editing images is a subset of
what they can do and then within that
space there's still a lot of specific
things they can do so you got to narrow
what you are actually tackling because
like image generative image editing is
not a specific user problem the user
doesn't go I want to generative image
edit they have a use case and you got to
like bring meet them where their use
case is and I think the other speakers
also talked about like trust and safety
is a big one for this sort of thing
preventing deep fakes being responsible
so on so forth so clients server. Um,
this is actually new for us. I'm coming
from a ondevice local first background.
I think most people are using AI server
side. So, up until this point, I have
never had to do server capacity
planning. The user brings the compute.
It is great. I don't pay for power. I
don't pay for compute. You know, a
billion people want to use it. Five
billion people want to use it. Great.
Nothing changes on our end. We push the
same code. But now we have to worry
about it. Especially because we use
accelerated comput TPUs, GPUs. You
really have to do a lot of planning and
this takes a lot of time, right? Second,
latency. Latency is a concern. Now,
before we had zero network latency on
device now, you worry about network
quality. Is the user in a remote spot?
Is the data center overloaded and
they're backed up. Is the user really
far from the data center? And now you
have to go like a round trip around the
world and that adds a ton of latency.
And then the other thing is testing is
now hard. Right? Before our models were
small enough that we could run our um
tests including the models. Now these
models are way too big. We can't run
them in our automated testing suite for
our daily regressions. Now you have to
worry about like do we recapture the
responses? Do we try and have a test
server? None of these are really good
options here.
I do want to spend a lot of time talking
about this ambiguous problem space.
There's this old comic from XKCD. This
is like I don't know how old but old one
it makes sense. basically is like hey
some problems in computer science are
hard to explain why it's hard you know
the first one is like let the is the
user in a national park no problem GPS
that's been solved problem like easy
anyone can do it now and then it's like
check the whether the photos of a bird
and back then that'd be like I need a
computer vision researcher team give me
a couple PhDs I'll get back to you now
this is a solved problem so the world
has changed right and that's the power
of genai which is great like I love
going to hackathons now because anyone
can sit down at hackathon to be like you
know what I have this idea and I can
build it it will definitely work in some
use case and that's great and when so
when your PM goes I want to solve this
problem cuz and I use Gemini or chat GPT
and it worked in this case that means we
can build it as engineer you go like did
it work 5% of the time did it work 10%
of the time did it work 50% of the time
or 80% of the time those are they all
worked in some case but those are vastly
different things you cannot ship a
product that is 5% reliable if it's 50%
reliable But when you can work with
research to get it to 80%, then let's
talk. But you really do want to
constrain the problem you're solving. I
think someone talked about yesterday the
comment that like prompt is a bug, not a
feature. And I agree with that, right?
You want it to be easy to use. A user
doesn't want to look at a prompt and
think about what they want in a very
detailed way. Paige talked about like
prompt editing, and we do some of that,
too. The user doesn't want to write a
whole paragraph on their phone. We want
to extract their intent and go great. We
think we know what you want. Let us give
you what you actually want, not what you
say you want. Right? And then once
again, anything is possible means
nothing is out of scope, which means you
cannot design an engineering system
around it. A system by definition has
constraints.
So really what we focus on is reducing
ambiguity across all of our functions.
So product goes, hey, talk to users,
what are the big demands they have,
right? and and they identified a few
things here. One, the ability to move
things within the image to relocate
right seamlessly.
Another reimagining some of the scenes.
So this is like the left side is the
real part. It was a very gray sky kind
of boring. Reimagine the background to
be something more exciting. And then we
had eraser before, but now we can do
better erase because we're no longer
constrained to the device capabilities.
So here it's not just erasing the drink
itself, but also the reflection of the
drink. So it actually feels more
natural. And then we work with research
to highlight those or train specific
models for those use cases. So they're
more efficient, more accurate, um,
easier to use. And then we build the UX
and the software engineering on top to
guide users to those cases so they work
really, really well so you can rely on
them. And then for the other use cases,
we kind of just take the advantage of
LMS which tend to hallucinate. But in
our case, we'll just give you multiple
responses. The hallucination is a
feature. The good thing is we work in a
creative space. There is no correct.
It's not like code. There's no
compilers, right? It's like if you see
something you like out of these choices,
great. Take it. Go with it. Trust and
safety. Obviously, this will never be
perfect. This is not a world where we
just go this is the correct answer 100%
of the time. You want to aim for certain
precision or recall, right? And you have
to manage the expectations of your
users, of journalists, of media, of you
know your internal stakeholders to be
like we are doing what we can. We're
preventing the really bad cases. But
this is something we're going to always
keep going on. It's ambiguous. The
language itself is ambiguous. The prompt
could be like the view from the mountain
was sick. You should make it look like
that. What does sick mean with the LM?
Make sure it interprets sick in the way
they meant and not in the they were
unhealthy way. like that's just how
human language works and that's why we
have code.
So my learnings on working with AI once
again is like to me AI engineering is
just software engineering but with
machine learning or ML on top. And to me
ML is this great power but it also adds
a lot of randomness into your system.
And as engineers, our job is to reduce
that randomness and kind of bring things
back to a deterministic repeatable way
so you can actually have repeatable
outcomes that like provides value to
your users in some way, right? So build
evals, use your evals, make your evals
faster to run. And then once you have a
useful value in production, go from a
large model that can probably do way too
much and it's doing way too much compute
for your use case and replace with a
smaller model, faster model, more
efficient model. Whether that's through
model distillation or you find a more
efficient model or you can even replace
traditional engineering without ML. All
great stuff. The main point is to be
able to move faster, right? Faster
iteration means more tries means you get
better improvements in your product.
And then for what's next for us, we are
we announced last week at the Google
Photos 10-year anniversary, we're
rebuilding the editor from the ground up
to be AI first. So once again, really
fulfilling that vision of we are meeting
you where you are. You can tap on your
image, we'll surface the relevant edits.
you can kind of adjust where they are
and you can use AI as part of that tool
if you want but or the AI can use
deterministic tools to do better edits.
Uh I'm almost at time. This is more my
personal view. Like where will we be in
two years? I don't know. You know, you
talk to different researchers, they have
different views on AI progress. Effing
page talked about, you know, like the
Gemma nano ondevice model now that was
just released. Is this as good as the
pro model of the last version of Gemini?
I hear that too. That's a very exciting
world to me to bring things back on
device. Or who knows, maybe Gemini 4 and
these things keep increasing and the
nano version is one step behind. That's
an interesting part too. or maybe we
have no progress. But either way, being
able to iterate quickly and having
really good benchmarks so that you can
know what to change is always important.
So, we should just focus on that. And I
think that's my time. If you want to
contact me, that's my contact and I'll
be here afterwards to talk.
Thanks, Calvin.
[Music]