Building Generative Image & Video models at Scale - Sander Dieleman, Google DeepMind

Channel: aiDotEngineer

Published at: 2026-04-21

YouTube video id: xOP1PM8fwnk

Source: https://www.youtube.com/watch?v=xOP1PM8fwnk

Um, I'm going to get started.
>> The sounds all right. Okay, I can I can
hear myself. Okay, great. Uh, thanks for
coming everyone. Uh, I'm uh Sander. I'm
a research scientist at Google DeepMind.
I've been there for over a decade now.
Uh I'm on the generative media team. So
we work on models like uh VO and nano
banana and I want to this talk is kind
of a bit of a behind the scenes
perspective on like uh everything that
goes into training these uh models at
scale. Um and so it's going to be
focused on uh diffusion models because
that's sort of the the the modeling
paradigm that at least nowadays people
tend to use for generating audiovisisual
data which is a bit different from uh
the language modeling situation where we
primarily rely on auto reggression
although of course not exclusively these
days. Um so yeah I have um this is kind
of a whirlwind tour of everything that
goes into it. So I have eight sections
which is quite a lot to get through. Um
they're not all equally long. I have
more to say about some things than
others. So I have more to say about the
modeling and the and the sampling side
of things for example, but I'll just try
to uh cover a little bit of everything
and then hopefully leave uh enough time
for questions as well at the end uh
about about any of these things. So the
the topics I want to cover are uh data
curation uh representation also because
you don't necessarily just shove pixels
into a transformer, right? You might use
a more uh advanced representation uh to
sort of simplify things. uh modeling is
going to be sort of about the core
mechanism behind diffusion models. I
just like to explain that uh in an
intuitive way and I think that's that's
kind of useful to understand why these
models are able to work as well as they
do for for audiovisisual uh data
generation. Uh we'll talk a little bit
about architecture as well like the the
core neural network architecture that's
that's behind all this. Uh and then uh
very briefly uh some things about
training at scale. I'll talk a bit more
about sampling as well because uh
diffusion models have sort of a very
flexible um uh sampling regime like
there's there's there sort of more stuff
you can do with diffusion models than
you can do with autogressive models. So
I'll talk a little bit about that. uh
I'll talk about distillation which in
the context of diffusion is uh not so
much about making the model smaller but
more about reducing the number of steps
that you need to get a good sample and
then finally talk about control like the
various control signals that we want to
use to actually make the models do our
bidding and make make them do what we
want them to do all right uh before I
start with that I have a blog um you
don't have to remember all these
individual URLs just go to sander.ai AI.
You can find all my blog posts there.
They're mostly about diffusion these
days, but in general, I like to talk
about sort of the the intuition behind
how generative uh models uh actually
work.
Cool. So, let's talk a little bit about
data first. Um I think uh it's really
important to emphasize just how
important data curation is in training
these these large scale models. Um it's
really really essential for high quality
results. And like for me coming from a
research background, it's actually
something like when I was doing my PhD,
it's not something that's incentivized.
Like you're not incentivized to look at
your data because you're actually
incentivized to use a predefined data
set that everyone else uses so that you
can compare your results to the
state-of-the-art. Um, and so that's sort
of something that we are sort of
collectively having to unlearn in this
era because you really just have to look
at your data to the point where I think
time spent on improving the data is
sometimes a better investment of that
time than actually sort of trying to
tweak the model and trying to make the
optimizer better or things like that.
Uh, I think this is still sort of an
even even today is still sort of an
underrated aspect of of uh training
these models. Um but yeah, I I'm I'm I'm
not going to say too much more about
that because this it's a topic that you
won't find a lot of publications on and
it's also a topic that um I won't be
able to share a lot of details on
because obviously it's it's also part of
the secret sauce of what makes the
models good. Um so let's talk about
representation next. So um the sort of
typical digital representation of audio
visual data sorry of visual data I
should say is uh pixels right grids of
pixels. you get a two 2D grid for
images, you get a 3D pixel grid for for
video. Um, and you could actually just
shove that into a diffusion model as is.
And that's what we did at the start, you
know, when diffusion sort of um started
uh being a thing. That's what people
were doing. We're sort of directly
training on pixels and it worked
remarkably well. Uh but that's not
actually what people do nowadays because
once you sort of start scaling up the
the size of the objects that you're
modeling like the resolution or the
duration in the case of video these uh
pixel tensors in memory get very very
large right you you have like okay let's
let's say you have um uh 1080p video
right 30 seconds of 1080p video at 30
fps we're talking many like several
gigabytes of of data right to store that
in memory that's that's one training
example so that's just not feasible So
what we end up doing is we actually use
compressed representations, right? What
we're not just going to use sort of
pre-existing compressed representations.
You know, there's there's many sort of
like video codecs out there. Uh you
know, the JPEG standard for for images.
We tend not to use those representations
because those are really focused on
making things small and in the process
they actually obscure the structure of
the data to a degree that uh generative
modeling becomes really hard. So what
typically happens in this context is we
actually learn our own compressor. We
and and we do that based on
autoenccoders. So this is like the the
top uh the top row on this slide is
basically a neural network that consists
of an encoder and a decoder. Uh and the
task of this neural network is just like
reproduce your input, right? You put an
image in, you want the same image out uh
but you you're sort of forcing it
through a bottleneck in the middle and
that bottleneck is going to learn what
what's typically called a latent
representation. So it's like a more
compact uh representation uh of the data
and that is then the representation that
we're actually going to use as a basis
for uh diff like for training our
diffusion model. So in the second stage
we take these latent representations we
extract them from our uh inputs. I I
usually use images as examples uh
because obviously that's easier to show
on a slide but you can imagine this is a
video as well. It's it's basically the
exact same procedure. Um uh so we
extract the latent representations and
then we train on top of that a
generative model which could actually be
an ultra reagressive model or a
diffusion model. Um sort of doesn't
really matter but people have found in
general for audiovisisual data um uh
diffusion sort of has the edge in terms
of uh being able to get good results
with uh certain parameter budget.
>> So that's what we mostly tend to use
nowadays. Uh and then once you've
trained your your generative model in
latent space, of course, once if you
sample from that model, you're going to
get a sample in latent space. Uh so it's
a compressed sample. So you still want
to uh use the decoder that you trained
before to bring that back to pixel space
um as well.
And so what does that actually look like
in practice? So I have this sort of
concrete example here. And this is
actually I think these dimensions sort
of correspond to what the autoenccoder
in the original stable diffusion uh
model does. So stable diffusion, it's
also a latent diffusion model. Uh let's
say you have like an RGB image uh 256 x
256 pixels. If you represent that as a
10 as a 3D tensor on the left, so these
are the uh dimensions that you get. And
then when we extract our latent
representation with our encoder, so one
thing that's worth noticing is it still
has the same spatial structure. It still
has the same topology, but the
resolution is greatly reduced. Right? So
now we have 32x 32 latent grid. And we
sort of compensate for that reduction by
adding a few extra channels so that we
can encode some of the information that
would get lost if you were to just uh
resize the image. Right? If you just
take the original image and resize it to
32x 32, you lose a lot of high frequency
detail. And these uh latent
representations sort of try to hold on
to that a little bit by adding by using
these extra channels uh to encode that
information. Of course, it's still uh
lossy compression because if you look at
the total sort of tensor size of the
latence versus the original image, it's
still much much smaller, right? And
that's the key idea. That's the point.
We want to make this smaller. And
especially for video also where you have
this extra time dimension, you have even
more redundancy, right? You you're often
able to reduce the size of these tensors
that you have to work with by by two
orders of magnitude. And that's the
difference between like being able to
fit your data in memory and and just not
being able to do anything at all, right?
Um, but so one thing to to note about
these compression mechanisms is they're
actually sort of slightly more primitive
than sort of the you know like the H.265
or like the other sort of codecs that
people tend to use these days. And
that's actually by design cuz um you
kind of want to preserve a lot of the
topological structure of the original uh
pixel representation because uh our
neural network architectures that we use
to build these models, they kind of rely
on the structure being present, right?
they have these strong inductive biases.
And so we we end up with these latent
representations that are still that
still have the same grid structure as
the original uh pixels but sort of at a
at a more um coarse grain scale, right?
And and the image that I'm showing here
is actually from a paper called EQVE.
It's a paper about like improving the
training of these autoenccoders and they
did a really nice thing. they sort of
try to visualize the latent space by
taking uh the principal components of
the of the across the latent channels
and then uh mapping that onto RGB so
that you can actually visualize what
these are and you can kind of see
looking at these uh latent
visualizations you can still tell what
animal is in the image right so the
latent are not really making abstraction
of any semantic content of the image
they're m they're basically just sort of
um abstracting the local texture and
very fine grain structure right that's
sort of the information that's sort of
compressed and that's removed to some
degree and everything else is just
preserved as in uh the original uh pixel
space and that's that's kind of by
design.
All right. So once we have our latent
representation, we can start doing some
generative modeling. I've already
mentioned you could do that either with
auto reggression or with diffusion. With
autogression, you sort of have to turn
everything into a sequence and then you
predict the sequence step by step.
There's a very natural thing to do for
language which is already kind of
sequence structured. Slightly less less
natural to do for an image, right?
Because an image is sort of a 2D pixel
grid. So you have to sort of decide an
arbitrary sequence order in which to uh
generate uh the pixels. Um diffusion is
kind of a different approach to
achieving a very similar thing which is
sort of this iterative refinement right
you generate something not in one go but
step by step and diffusion does that uh
in a slightly different way which is by
first defining a corruption process uh
sort of destroying information in the
image or in the video by adding noise
gradually. So adding more and more noise
and then you learn a deninoiser model
that tries to remove that noise. Um and
then you can use that deninoiser as as
we'll show you can use that deninoiser
to generate images or videos step by
step. So uh yeah just just to visualize
what that looks like. So I have an
example image here on the left if you
gradually add more and more gaussian
noise to that image. Uh one thing to
note is that eventually when you add a
lot of noise you can't see the image
anymore. Right? So it sort of destroyed
all the structure. But sort of in the
intermediate stages, you can see that if
I've add a little bit of noise, it's
sort of removing the detail, right?
Because for example, you might not be
able to make out the whiskers anymore of
of this bunny, but you can still see the
silhouette. So you can still tell that
it's a bunny, right? So you're sort of
losing fine grain detail uh initially,
but not the global structure. And only
when you add more noise, that global
structure also uh starts to fade. So
that's the corruption process. And then
I'm going to try and explain to you how
we can use a dinoiser that tries to
reverse this process to do generative
modeling. Right? So sort of an intuitive
view of how diffusion models actually do
this. So I I like to do this with a with
a 2D diagram. So images and video are
actually highdimensional objects. Right?
If you sort of flatten an image into a
single vector, it's very high
dimensional. Um I'm going to reduce that
down to two dimensions so that I can
actually plot it on a slide. It's a
slightly risky thing to do like
generalizing from low dimensions to high
dimensions, but in this case, it's it's
actually quite instructive. So, I'm
going to try to explain how this
backward process in in diffusion
actually works. So, we start here with a
clean image from our data set. So,
that's what I'm going to call Xnot. Put
that in the uh bottom left corner on the
slide there. And um and then I'm going
to add some amount of noise to it. So,
I'm actually going to sort of sim
simulate this corruption process and add
some amount of noise to it. And then we
end up in an arbitrary uh uh step of
this corruption process. We're going to
call that XT. T is like a time step that
that's indexing the the corruption
process. Uh and so that's a noisy
version of our image in the top right
corner. So what's going to happen in
diffusion sampling is we're going to
start from noise, right? But uh I'm just
going to uh sort of zoom in on a
particular step in the process where
we've sort of already partially noised
the image. So we might end up at
something like XT and then I'm going to
tell you like how does the how does the
process continue from there. So uh the
first thing that our diffusion model is
going to do as I said it's a dino noiser
right so it's going to try to predict a
clean version of the image given the
noisy version like it observes the noisy
version that's all it gets and then we
ask the model where could this have come
from right what image could have given
rise to this particular noisy
observation and that's actually an
illpost problem because as I just said
the noise actually removes information
right so there's actually many different
images that could have given rise to
this particular noisy observation And so
what ends up happening is the model just
doesn't know which one to predict and it
has to make a single prediction. So it's
just going to predict the average of all
of them. And that's uh why you get a
blurry prediction out of a diffusion
model. Like if you just ask a diffusion
model to do a onestep prediction, you're
going to get a very blurry image that
looks sort of like this. And the way to
think about this is that it's not
predicting a particular image per se,
but rather it's sort of predicting a
region of image space, the direction
that we roughly want to go in. And so
that's what I'm trying to visualize here
with these uh these uh circles is that
there sort of a region of images that
all potentially could have given rise to
this noisy observation and that's sort
of roughly the direction that we want to
go in. And so how does sampling work
then? Well, we we predict that direction
and then we take a small step in that
direction. We don't make a big jump all
the way to uh our predicted x not
because obviously then we would just end
up with a blurry image. That's not what
we want. So we're only going to take a
small step and then ask a model again
basically. Right? You can compare this
to how uh optimization of neural
networks works, right? When you're doing
optimization, of course that's happening
in parameter space. Here we're we're
working in in pixel space or in in
latent space. But when you're doing
optimization in parameter space, you're
doing something very similar like the
optimizer gives you an update direction.
You don't want to take a step that's too
large because actually this update
direction is only valid locally. And
it's it's the exact same thing here.
Um, one thing that's a bit different
from a typical optimization algorithm is
that after we've done this small
denoising step, uh, in many diffusion
sampling algorithms, you actually add a
little bit of the noise back. And this
is this is new noise. This is not sort
of like the noise that you just removed,
but you just add a little bit of new
noise. Obviously less than the amount
that you just removed, right? Because
otherwise you wouldn't make any
progress. And the reason you do that is
often to sort of um avoid accumulation
of errors along the process because
obviously this deninoiser that we've
trained is a neural network. So its
predictions are not perfect. So it's
going to make mistakes. And if you keep
feeding the network its own mistakes
back, um you're going to go off the
rails, right? And so what it turns out
that this is actually a useful trick to
uh to greatly reduce the risk of that.
Like you add a bit a little bit of new
noise and it sort of obscures the the
mistakes that the noiser is making. Not
all diffusion sampling algorithms use
this trick, but many do.
Um and then yeah, as I said, the process
just repeats. So we're now in a new
iterate. We're in in XT minus one which
is basically a slightly ne slightly less
noisy version of the image because we've
partially denoised it and we're just
going to ask our deninoiser to make a
new prediction and that's going to be
different this time, right? It's still
trying to predict where we came from,
but now it has a little bit more
information because there's less noise
in the image. So that corresponds to a
new prediction and that new prediction
corresponds to a smaller region uh of
input space, right? And so that region
is just going to sort of continually
shrink as we go along until it sort of
collapses to a single point. And at that
point we've sampled uh an image uh from
the distribution that we're trying to uh
generate.
Um yeah so this kind of continues as I
said like so exactly the same procedure
d noiseise add a little bit of noise
back um uh and so on and so on and
that's how you sample from a diffusion
model. Um so I want to now take a
slightly different perspective to try
and explain why doing this particular
thing works so well for images and
video. Uh and for that we're going to uh
bring out fury analysis. We're going to
do um frequency analysis of what's
actually happening in a diffusion model.
So I have here um four images from the
imageet data set. Uh and what I've done
is I've taken their FIA transforms and a
fiery transform is a of an image is a 2D
object, right? An image is a 2D object.
So you also get a 2D for your transform
and it's a complex valued thing. Uh so
I'm going to pull that apart into the
magnitude and the phase. So the
magnitude is the middle row and then the
phase is the bottom row. The phase is
kind of hard to work with. We're going
to ignore that for now. We're just going
to look at the magnitude spectrum. And
then to make this even more convenient,
we're going to summarize the magnitude
spectrum into like a 1D uh plot. And
we're going to do that by radial
averaging. So, we're just going to sort
of take these radial slices and and sort
of um uh see what they are and then just
average them all together. And if you do
that and you plot that on a log plot, uh
here's what you get for these images.
So, what's very interesting about this,
I think, is they look like straight
lines, right? And this is a log log
plot. So, if you're familiar with
scaling laws in language models, this
says power law, right? There is a power
law relationship um in the spectrum. And
this is sort of a natural phenomenon
with with images and also with video.
Like if you if you take a photo, you
calculate the spectrum, you'll typically
get this this power law relationship
between the uh the frequencies and then
the corresponding uh power uh or energy
of those frequencies.
So what does that mean for diffusion?
Well, um if you add noise to an image,
here's what that looks like on the
spectrum. So uh the spectrum of Gaussian
noise is basically such that it contains
all frequencies in in equal measure in
expectation right so it's sort of flat
in the frequency domain so you can see
that here like this is this is a
spectrum of actual gaussian noise uh
this this blue line right so you get
sort of this this horizontal line and
then for an image you're going to get
this power loss spectrum so you get this
downward sloping line now if I add both
of them together if I add the noise to
the image I can again calculate the
spectrum and what you get then is the
green line. So you can kind of see that
it's uh following the image spectrum uh
up to the point where the noise starts
to drown it out, right? And then it just
follows the noise spectrum. So what does
that mean concretely for our corruption
process? It means that the more noise
you add, the more of the high frequency
components you're starting to obscure,
right? So add a little bit of noise, you
only obscure the highest frequency
components. If you add a bit more noise,
then you're actually obscuring more of
the lower and lower frequency components
until eventually if you add a lot of
noise, you're obscuring all the
frequency components in the signal and
and they're they're completely
indistinguishable.
Um, and so that observation I've sort of
summarized that myself as diffusion is
basically spectral auto reggression,
right? Because it's essentially allowing
you to generate images from coarse
defined, right? You start with the low
frequencies and then you gradually add
higher and higher frequencies and that
corresponds to coarse grain features and
fine grain features. Um
but of course yeah it's it's it's it's
an approximation right it's not the same
as hard auto regression but it is sort
of in in principle it's it's doing a
very similar thing just in a in a
representation space that makes more
sense for for image generation where you
can sort of go course define which
typically means that you can sort of
sketch out the semantics of your image
before you add all the details which is
a very natural way uh to do uh image
generation and it also means that when
we train these models we can actually
put more weight on the frequencies on
the on the on the scales that matter
more uh perceptually and that's a that's
a very important aspect of this as well.
All right. So uh let me now talk a
little bit about uh network
architectures uh for these deninoisers
like it's a it's the network itself is
very simple right it's just doing dino
noising you feed in a noisy image and
you ask it like what is the what is the
clean version of this image uh and so
people initially started using units for
this so units are convolutional neural
networks uh that are sort of designed
they were originally designed for for
things like image segmentation uh but
you can also use them for any sort of
image uh restoration task or whatever
anything where the where the output
dimension is the same as the input
dimension basically um and yeah they
worked they worked really well for for
this particular use case as well. So
people sort of adopted them. Early
stable diffusion models are are unit
based as well. Uh but then people
figured out actually can we use a
transformer for this? Um and the answer
is yes of course you can. Um there's u
you know you you you obviously don't use
the sort of exact recipe that you would
use for a large language model. Uh you
don't use a causal mask for example
because you're not going to you know
it's okay for attention to be fully
birectional in this context. So it's
actually slightly more expressive than
than a typical LLM transformer. Uh but
yeah, it works it works just fine with
transformers and because we've learned
so much about how to scale transformers
from the LLM side, it just makes
practical sense to also use this uh for
diffusion models because we can reuse a
lot of that knowledge. Um one more
architectural aspect that I want to talk
about in the concept of video generation
is sort of the tension between auto
reggression and uh diffusion. So you can
train a fully autogressive video model,
but it would mean that you have to take
this sort of height times width times
time cube and then flatten it out into a
sequence, right? And then generate a
token by token. You can also treat this
whole uh 3D volume of the video, this 3D
pixel volume as a thing that you just
jointly noise and d noiseise. And that's
what most modern uh video generation
models do. So they're sort of adding
noise across the whole u time dimension
as well as the spatial dimensions. Um
that works pretty well. But there's
actually uh it's it's not a dichotomy
like it's not a binary choice. There's
actually an intermediate uh sort of uh
setup where you do auto reggression in
time. So you generate frame by frame but
you use uh diffusion to generate each of
the frames. And that's actually a really
nice sort of hybrid setup for for many
applications. um especially when you
want to do things like real-time video
generation for example, you kind of have
to do autogression in time, right? So
like Genie is a good example of sort of
this this uh this compromise.
Um cool. All right, I'll I'll talk very
briefly about sort of what goes into
actually training a model. Like it's you
know we we we scale these things up,
right? More parameters is better. Um I
would say they're probably not at the
scale where large language models are
today. They tend to still be a little
bit smaller. Uh I'll also uh explain a
little bit why that might be. Uh but you
know we still scale them to to to
reasonable sizes. And so it's it's very
important to figure out how to do
parallelism and and sharding across many
chips. Um so up to a certain scale you
can rely on data parallelism, right?
Just split up your batch across many
chips uh for training. But at some point
you just need to go into model
parallelism as well. So you actually
need to start sort of spreading your
model across uh different ships. Um we
tend to use Jax to build these models.
Jax has some really good tooling that
sort of does that for you, right? So you
you can you can use Jax.p to basically
ask it to, you know, um to shard the
model in a way that's going to be uh
ideally close to optimal and it sort of
automatically minimizes communication
between the chips for you. So you don't
have to do that manually.
Cool. Uh yeah, I think that's that's all
I'll say about that for now. Move on to
sampling because I want to talk a little
bit more about sort of this contrast
between stocastic and deterministic
sampling. So um I already mentioned that
some of these algorithms add a little
bit of noise back after every step.
Other algorithms do not. This has
different tradeoffs. Um uh we'll talk
about distillation a bit later. In that
context, it's actually really useful to
have a deterministic sampling algorithm
because you sort of have a onetoone
mapping between initial noise and and
samples from your uh data distribution.
Um but as I said, stochastic algorithms
can be a bit more robust uh to
accumulation of errors. So there's like
interesting trade-offs there. Uh the
main thing I want to talk about in the
context of sampling is this idea of
guidance, right? This is something um
it's commonly associated with diffusion
models, but you can actually also do it
for autogression autogressive models. It
just seems to work exceptionally well
for uh diffusion models specifically. So
what guidance enables you to do is to
trade off sample quality for diversity.
So if you crank up what's called the
guidance scale, if you crank up this
hyperparameter, u diversity of your
samples is going to greatly reduce.
They're going to start looking more
alike.
uh but the quality is going to is going
to massively improve to the point where
these models are punching well above
their weight. And this is what what I
was saying earlier why these models why
I think these diffusion models tend to
still be a bit smaller than their sort
of LLM counterpart is that they have
this very powerful trick that really
allows them to to punch above their
weight. I'll show you some examples in a
bit. But first I want to explain what
guidance actually does. So, I'm going to
go back to this diagram that I had
before, and I'm going to show you how
guidance actually modifies the the
sampling procedure. So, we're going to
start the exact same way. We're just
going to uh try and predict where we
came from. It's going to be a blurry
prediction, but now we're going to make
a second prediction as well because
we're going to have trained our model
with a conditioning signal. So, this
could be a text prompt, for example, and
we can basically sample from the model
by giving it a text prompt and also
without a text prompt. So if this the
prediction is it a prediction without
the text prompt then I can make another
prediction with the text prompt and
typically what's going to happen is
that's going to be a slightly less
blurry prediction because obviously the
number of images that this noisy
observation could have come from is
greatly restricted by what's described
in the prompt. Right? If the if the
prompt says this is an image of a of a
of a rabbit and it gives you some
details and stuff then obviously you you
know a bit more about what the original
image might have been. So there's a
difference between these two predictions
and it's precisely that difference that
we're interested in. So we're going to
call this um delta and and we're going
to basically amplify this. We're going
to say okay if this is the delta between
not having the prompt and having the
prompt. What if we just uh scale this up
and then we we say this is our new
direction that we're going to actually
den noiseise. It turns out this is super
powerful. It seems like it seems almost
childish like it seems so simple but
actually there's sort of like a beijian
reasoning type thing behind this that
makes this work really well. Um and then
um yeah so this is basically the only
change that we make to the sampling
algorithm. So everything else proceeds
as normal. Again we might add a little
bit of noise back and then we just take
uh the steps as normal. So you do need
to do two model evaluations per step but
you get a a huge increase in sort of um
uh sample quality from that. In fact let
me show you. So um so this is a
comparison of some samples from a model
uh without guidance and with guidance.
Um this is from a 2021 paper. So
positively ancient uh in AI, right?
Because things move pretty fast. Uh but
this is one of the last sort of papers
where they train diffusion models at
scale where they actually show the
difference between not using guidance
and using guidance. After this paper,
everyone just always uses guidance. like
that nobody ever turns it off anymore
because it's so essential to getting
good samples from these models. And the
differences you can see here are sort of
uh you can see that reduction in
diversity, right? Like from the from the
left to the right, you can see the
samples from uh uh from on the right are
actually much less diverse, but also
individually they're much higher quality
with respect to the prompt, right? So
you get this trade-off. And if you only
want one sample, why would you even care
about diversity, right? So you can sort
of use this or you can even uh you know
reintroduce diversity in different ways
if you wanted to. Uh but yeah basically
uh guidance is sort of a no-brainer
these days. So everyone just always
leaves it on and I think a lot of people
would be surprised at how bad today's
models are if you take guidance away
like just if you use them um as they
are. Another example here from the same
paper. So this is a glide paper from
openai uh one of the first sort of uh
pixel space diffusion models at scale.
Cool. Right. Let me talk briefly about
distillation. Uh so sort of to speed up
uh the sampling procedure, what we're
going to try is we're not going to try
and make the the model smaller, which is
usually what distillation refers to in
the context of large language models.
But here, what we're actually going to
try and do is uh reduce the number of
steps that we require. So uh sampling
from a diffusion model, you could say,
is sort of tracing out this nonlinear
path through input space. Right? I've
showed you a few steps before. Now here
I'm sort of trying to interpolate that
path with this dashed red line. I'm sort
of trying to show like this is the
trajectory that we might follow through
input space during sampling and that's a
nonlinear trajectory. Uh and what a
diffusion model does it sort of gives
you at each point on the trajectory it
gives you the direction that you should
move in next right so it's it predicts a
tangent to this path basically um and
then a question that arises is okay once
we know what this path is especially if
if we have like a deterministic sampling
algorithm there's like one path between
a particular noise sample and a
particular data point once we know that
path um why are we predicting the
tangent to this path why can't we just
predict where we're going to end up
right just like where we are let's just
try to predict where this path is going
to end up when all the noise is gone and
then we can just sample in one step.
Right? And this is exactly what
consistency models do. So you can take a
diffusion model and then distill it into
a consistency model and this is just one
of the many ways that people have
developed uh to uh reduce the number of
steps that are required to sample from
diffusion models. Now in practice when
you try to do this uh and try to sample
in one step from a consistency model it
usually won't work that well. And it's m
it's mostly because like you're you're
asking uh the neural network to do in a
single pass what it did before in like
50 passes or something like that, right?
So like that's a tall order. So usually
that's not going to work super well out
of the box. Um but so one one uh uh
trade-off that you can sort of make is
you can basically say I'm going to do
consistency modeling but only for like
certain intervals of my sampling path so
that I can still sample in like maybe
three steps and maybe then I'm going to
get uh better results. This is just many
of the one of the many ways uh uh to
achieve this. But I just wanted to kind
of show this u uh intuitively what this
is actually doing. All right, let's talk
briefly about uh controls. This is the
last thing I want to talk about. So
these models only are as useful as um
our ability to actually influence uh at
sort of a semantically semantically high
level um what they actually produce,
right? And so the canonical way that
we've been doing this with these models
is to give them a text prompt. You give
them a text prompt and then that
describes what the image should look
like, what the video should look like.
That's great. That works pretty well.
But more and more we're starting to see
that people want more than that, right?
They don't just want to try to uh lost
my slide. They don't just want to try
and um uh Is this on my side or is this
>> back?
>> Oh, it's back. Awesome. Yeah. Um they
don't just want to try and describe
everything in in in great detail in in
language. You want to be able to do
reference based generation. For example,
you want to, you know, you want to be
able to generate a video and put
yourself in it, right? And you're not
going to do that by describing what you
look like in intricate detail. You're
going to do that by taking a photo or a
short video of yourself and then
conditioning the model on that. So we
want these other types of conditioning
signals as well. Uh in video generation
specifically, we might want to sort of
explicitly control camera motion, the
speed of events happening like the
specific timing of events. That's not
something that you should sort of try to
shoehorn into a textual representation,
right? So you need sort of more advanced
conditioning signals. And then a big
question that arises is when should we
introduce these conditioning signals to
the model? because often you don't have
this kind of conditioning information
for most of your pre-training data set.
So it can actually be quite useful to do
that in post- training like pre-train a
model with with just a text prompt and
then maybe add some of these extra
conditioning signals uh in a post-
training phase. And of course other
things we can do in post training are
like preference tuning uh based on based
on human evaluations like um using
reinforcement learning or direct
preference optimization. All right, that
was uh sort of a whirlwind overview of
all eight of these things. Um I'm going
to stop there so we have some time for
questions. Um all right. Um thank you
very much for listening.
You're talking about how you increase
guidance quality. Is that kind of why
lots of these models have a certain
style? Like you can look at analog
image.
>> Uh that's part of it. I mean I think
part of it is the probably the post-
training recipe that tends to sort of
make these models a lot more
opinionated. I like to think of post-
training as sort of taking like
pre-training sort of models the
distribution of images let's say and
then post- training is more about like
which sliver of that distribution am I
actually interested in and that's a very
opinionated type of thing so that's
playing a big role in that I think
there's specific effects though that you
can observe that are easily attributed
to guidance like if you get images that
are that look sort of very saturated uh
that's usually a sign that the guidance
scale is too high right it's it's like
uh because the guidance scale is sort of
amplifying this um this difference
between the unconditional and the
conditional prediction, but it's doing
that at every noise level. And so, um,
if you do that at depending on which
noise level you're sort of manipulating
this this direction at, you sort of get
these different effects. So, what people
have often started doing is like varying
the this the scale of the guidance
across the sampling procedure. Like it
often there's a there's a paper that
says it's actually best to not do
guidance at the very start and at the
very end, but you should sort of ramp it
up in the middle. And that tends to give
you the best sort of trade-off. Um but
yeah it's definitely playing a role but
I think probably a bigger role in that
is played by the post training. Yeah.
>> Yes.
>> How does the guidance or conditional
work for text?
Thank you very much.
>> Hi. Um how would you do this guidance or
like a conditional sampling for text
diffusion? That's
>> um it you could use exactly the same uh
procedure
>> um provided that you have you know you
have a sort of uh prompt that you that
you're giving the model right you can
you can treat that prompt as a
conditioning signal and then you could
uh apply guidance to that how well
that's going to work is is sort of an
open question like it it clearly is
applicable to the setting as well but I
think uh typically what we found is that
guidance tends to work best when there
is sort of um a bit of a semantic gap
between the thing that you're guiding
with and the thing that you're
generating. And so that's why it works
so exceptionally well with this sort of
prompting setup. Like if you guide on a
text prompt, text prompt tends to
capture sort of the highle semantics of
what's in an image. And and that seems
to be like a really good way to sort of
steer the model in a sort of semantic
space so to speak. But if you're if
you're guiding like this is you have the
same problem in image generation like if
you do guidance on like let's say you
provide a segmentation mask and you want
the model to fill in this the the the
different parts of the segmentation mask
right actually guidance will work a lot
less well for that because it's sort of
a low-level conditioning signal that's
sort of at the same level of abstraction
as the the pixels that you're trying to
generate. It seems like guidance just
works better when you have that gap. I
think that's also why people don't
typically use it as much for language
models. um even though it's perfectly
applicable to autogressive models as
well.
>> Um over there
>> hi uh you mentioned uh Jax is better
suited for uh model sharding and
parallelism. Uh is there any particular
reason for that especially compared to
PyTorch? Um
>> um okay I have to caveat my answer with
the fact that I've been at Google for
over a decade. So you know PyTorch
didn't exist when I joined Google,
right? So I've never had a lot of
hands-on experience with uh with
PyTorch. I think Jax Jax is just kind of
designed with that in mind from from the
get-go because uh you know we've we've
had TPUs for a while at Google and it
was really just um I think conceived to
try and make make that as easy as
possible, right? to to to make using
TPUs as easy as possible. And with TPUs
like you very rarely use just one chip,
right? You kind of want to take
advantage of the fact that they have
this really fast interconnect and sort
of scale up uh pretty quickly. Um so
yeah, I think it was sort of designed
with this in mind from the getgo, but
you know, PyTorch can do these things as
well, right? It's not it's not unique to
Jack.
>> Thanks.
>> Thank you for your presentation. If you
slow down the process of denoising and
noising, do you get better images or
>> slow down as in take more steps?
>> Yes. Take take more steps
>> up to a point. Yeah. Because as I said,
you're sort of you're approximating this
nonlinear path through input space,
right? So you're and you're
approximating that by taking finite
steps. So that means you're actually
approximating that with sort of a a
piewise linear path
>> through input space, right? And so the
more that path that pie wise linear path
starts to deviate from your true
nonlinear pathway, the worse your
samples are going to get. But there is
obviously a limit. And and there's also
been a lot of research on like how can
we train diffusion models or like models
like this that actually produce
straighter paths from the get-go so that
you can actually get a more accurate
approximation with fewer steps. There's
something called rectified flow or
refflow uh that that sort of tries to
dig into this. So yeah, and so up to a
point is is the answer. Yeah. Thanks.
I um I was wondering if you could maybe
talk a little bit more about how you
could get these other types of control
signals in for like a special camera
movement. Is this something that you
would do like with a simulator or
something where you can like actually
control all the parameters or like how
>> That's a very good question. Um I well
obviously I can't comment on what we
actually do. I can sort of speculate. I
that's actually like I don't actually
know the details about this because the
team at this point is quite large and
like there's like people working
specifically on what we call
capabilities you know like uh endowing
the model with this ability to for
example uh be conditioned on on on a
camera control signal and things like
that. Uh I'm not actually actively
involved uh in that part of the work but
yeah it's like it's very um as I said
it's sort of like um it's quite
important to have these conditioning
signals that are sort of semantically
abstract. So you kind of have to find
the right representation that makes the
most sense uh to condition a model on. I
think also what's very important is um
you are sort of you know if if if if
your model is a transformer you actually
have quite a few ways to insert
conditioning information like you could
add a bunch of extra tokens. That's one
way to do it. Uh sometimes you might
rather sort of broadcast the
conditioning information to all the
tokens. Like for example, if you if we
the the noise level input that we give
to the diffusion model like to tell it
how much noise there is in it. Uh we're
going to typically broadcast that to all
the tokens. We're not going to just add
a single token for that. You know, I
mean that's that's that's and then
there's um uh I think there's like a
third way that people have started doing
conditioning. But yeah, these are these
are all things that you have to think
about when you sort of try to figure out
how to introduce a new conditioning
signal. Um yeah.
I think we have time for maybe one more.
>> Thanks for the really really nice talk.
Uh just to understand uh in in the
denoising part if you didn't add that
error in each step would it be fully
deterministic
>> from a
>> yes.
>> Okay.
>> Yeah. So there's there's basically when
when you once you have a den noiser
there's sort of two different types of
generative models that you can
construct. one of them is stochcastic
and the other is fully deterministic
right and it it's kind of a magical uh
feat like I still don't really believe
that this works but it really does uh so
you can kind of yeah you can have a
deterministic model and as I said this
can be useful in many applications like
especially if you want to sort of go
both ways like if you want to go from
your data distribution to your noise
distribution and back it's really useful
and then also for many distillation
techniques it's a requirement for the
teacher to be deterministic so that's
the case with these consistency models
for example um You need you need a a
deterministic sampling algorithm to
start from. Yeah.
>> Thanks.
>> Cheers. All right. Thank you everyone.
Um I have a longer version of this talk
on YouTube like if you want a bit more
detail uh for some of these sections uh
you can check that out as well. Cool.
Thank you.