Building Generative Image & Video models at Scale - Sander Dieleman, Google DeepMind
Channel: aiDotEngineer
Published at: 2026-04-21
YouTube video id: xOP1PM8fwnk
Source: https://www.youtube.com/watch?v=xOP1PM8fwnk
Um, I'm going to get started. >> The sounds all right. Okay, I can I can hear myself. Okay, great. Uh, thanks for coming everyone. Uh, I'm uh Sander. I'm a research scientist at Google DeepMind. I've been there for over a decade now. Uh I'm on the generative media team. So we work on models like uh VO and nano banana and I want to this talk is kind of a bit of a behind the scenes perspective on like uh everything that goes into training these uh models at scale. Um and so it's going to be focused on uh diffusion models because that's sort of the the the modeling paradigm that at least nowadays people tend to use for generating audiovisisual data which is a bit different from uh the language modeling situation where we primarily rely on auto reggression although of course not exclusively these days. Um so yeah I have um this is kind of a whirlwind tour of everything that goes into it. So I have eight sections which is quite a lot to get through. Um they're not all equally long. I have more to say about some things than others. So I have more to say about the modeling and the and the sampling side of things for example, but I'll just try to uh cover a little bit of everything and then hopefully leave uh enough time for questions as well at the end uh about about any of these things. So the the topics I want to cover are uh data curation uh representation also because you don't necessarily just shove pixels into a transformer, right? You might use a more uh advanced representation uh to sort of simplify things. uh modeling is going to be sort of about the core mechanism behind diffusion models. I just like to explain that uh in an intuitive way and I think that's that's kind of useful to understand why these models are able to work as well as they do for for audiovisisual uh data generation. Uh we'll talk a little bit about architecture as well like the the core neural network architecture that's that's behind all this. Uh and then uh very briefly uh some things about training at scale. I'll talk a bit more about sampling as well because uh diffusion models have sort of a very flexible um uh sampling regime like there's there's there sort of more stuff you can do with diffusion models than you can do with autogressive models. So I'll talk a little bit about that. uh I'll talk about distillation which in the context of diffusion is uh not so much about making the model smaller but more about reducing the number of steps that you need to get a good sample and then finally talk about control like the various control signals that we want to use to actually make the models do our bidding and make make them do what we want them to do all right uh before I start with that I have a blog um you don't have to remember all these individual URLs just go to sander.ai AI. You can find all my blog posts there. They're mostly about diffusion these days, but in general, I like to talk about sort of the the intuition behind how generative uh models uh actually work. Cool. So, let's talk a little bit about data first. Um I think uh it's really important to emphasize just how important data curation is in training these these large scale models. Um it's really really essential for high quality results. And like for me coming from a research background, it's actually something like when I was doing my PhD, it's not something that's incentivized. Like you're not incentivized to look at your data because you're actually incentivized to use a predefined data set that everyone else uses so that you can compare your results to the state-of-the-art. Um, and so that's sort of something that we are sort of collectively having to unlearn in this era because you really just have to look at your data to the point where I think time spent on improving the data is sometimes a better investment of that time than actually sort of trying to tweak the model and trying to make the optimizer better or things like that. Uh, I think this is still sort of an even even today is still sort of an underrated aspect of of uh training these models. Um but yeah, I I'm I'm I'm not going to say too much more about that because this it's a topic that you won't find a lot of publications on and it's also a topic that um I won't be able to share a lot of details on because obviously it's it's also part of the secret sauce of what makes the models good. Um so let's talk about representation next. So um the sort of typical digital representation of audio visual data sorry of visual data I should say is uh pixels right grids of pixels. you get a two 2D grid for images, you get a 3D pixel grid for for video. Um, and you could actually just shove that into a diffusion model as is. And that's what we did at the start, you know, when diffusion sort of um started uh being a thing. That's what people were doing. We're sort of directly training on pixels and it worked remarkably well. Uh but that's not actually what people do nowadays because once you sort of start scaling up the the size of the objects that you're modeling like the resolution or the duration in the case of video these uh pixel tensors in memory get very very large right you you have like okay let's let's say you have um uh 1080p video right 30 seconds of 1080p video at 30 fps we're talking many like several gigabytes of of data right to store that in memory that's that's one training example so that's just not feasible So what we end up doing is we actually use compressed representations, right? What we're not just going to use sort of pre-existing compressed representations. You know, there's there's many sort of like video codecs out there. Uh you know, the JPEG standard for for images. We tend not to use those representations because those are really focused on making things small and in the process they actually obscure the structure of the data to a degree that uh generative modeling becomes really hard. So what typically happens in this context is we actually learn our own compressor. We and and we do that based on autoenccoders. So this is like the the top uh the top row on this slide is basically a neural network that consists of an encoder and a decoder. Uh and the task of this neural network is just like reproduce your input, right? You put an image in, you want the same image out uh but you you're sort of forcing it through a bottleneck in the middle and that bottleneck is going to learn what what's typically called a latent representation. So it's like a more compact uh representation uh of the data and that is then the representation that we're actually going to use as a basis for uh diff like for training our diffusion model. So in the second stage we take these latent representations we extract them from our uh inputs. I I usually use images as examples uh because obviously that's easier to show on a slide but you can imagine this is a video as well. It's it's basically the exact same procedure. Um uh so we extract the latent representations and then we train on top of that a generative model which could actually be an ultra reagressive model or a diffusion model. Um sort of doesn't really matter but people have found in general for audiovisisual data um uh diffusion sort of has the edge in terms of uh being able to get good results with uh certain parameter budget. >> So that's what we mostly tend to use nowadays. Uh and then once you've trained your your generative model in latent space, of course, once if you sample from that model, you're going to get a sample in latent space. Uh so it's a compressed sample. So you still want to uh use the decoder that you trained before to bring that back to pixel space um as well. And so what does that actually look like in practice? So I have this sort of concrete example here. And this is actually I think these dimensions sort of correspond to what the autoenccoder in the original stable diffusion uh model does. So stable diffusion, it's also a latent diffusion model. Uh let's say you have like an RGB image uh 256 x 256 pixels. If you represent that as a 10 as a 3D tensor on the left, so these are the uh dimensions that you get. And then when we extract our latent representation with our encoder, so one thing that's worth noticing is it still has the same spatial structure. It still has the same topology, but the resolution is greatly reduced. Right? So now we have 32x 32 latent grid. And we sort of compensate for that reduction by adding a few extra channels so that we can encode some of the information that would get lost if you were to just uh resize the image. Right? If you just take the original image and resize it to 32x 32, you lose a lot of high frequency detail. And these uh latent representations sort of try to hold on to that a little bit by adding by using these extra channels uh to encode that information. Of course, it's still uh lossy compression because if you look at the total sort of tensor size of the latence versus the original image, it's still much much smaller, right? And that's the key idea. That's the point. We want to make this smaller. And especially for video also where you have this extra time dimension, you have even more redundancy, right? You you're often able to reduce the size of these tensors that you have to work with by by two orders of magnitude. And that's the difference between like being able to fit your data in memory and and just not being able to do anything at all, right? Um, but so one thing to to note about these compression mechanisms is they're actually sort of slightly more primitive than sort of the you know like the H.265 or like the other sort of codecs that people tend to use these days. And that's actually by design cuz um you kind of want to preserve a lot of the topological structure of the original uh pixel representation because uh our neural network architectures that we use to build these models, they kind of rely on the structure being present, right? they have these strong inductive biases. And so we we end up with these latent representations that are still that still have the same grid structure as the original uh pixels but sort of at a at a more um coarse grain scale, right? And and the image that I'm showing here is actually from a paper called EQVE. It's a paper about like improving the training of these autoenccoders and they did a really nice thing. they sort of try to visualize the latent space by taking uh the principal components of the of the across the latent channels and then uh mapping that onto RGB so that you can actually visualize what these are and you can kind of see looking at these uh latent visualizations you can still tell what animal is in the image right so the latent are not really making abstraction of any semantic content of the image they're m they're basically just sort of um abstracting the local texture and very fine grain structure right that's sort of the information that's sort of compressed and that's removed to some degree and everything else is just preserved as in uh the original uh pixel space and that's that's kind of by design. All right. So once we have our latent representation, we can start doing some generative modeling. I've already mentioned you could do that either with auto reggression or with diffusion. With autogression, you sort of have to turn everything into a sequence and then you predict the sequence step by step. There's a very natural thing to do for language which is already kind of sequence structured. Slightly less less natural to do for an image, right? Because an image is sort of a 2D pixel grid. So you have to sort of decide an arbitrary sequence order in which to uh generate uh the pixels. Um diffusion is kind of a different approach to achieving a very similar thing which is sort of this iterative refinement right you generate something not in one go but step by step and diffusion does that uh in a slightly different way which is by first defining a corruption process uh sort of destroying information in the image or in the video by adding noise gradually. So adding more and more noise and then you learn a deninoiser model that tries to remove that noise. Um and then you can use that deninoiser as as we'll show you can use that deninoiser to generate images or videos step by step. So uh yeah just just to visualize what that looks like. So I have an example image here on the left if you gradually add more and more gaussian noise to that image. Uh one thing to note is that eventually when you add a lot of noise you can't see the image anymore. Right? So it sort of destroyed all the structure. But sort of in the intermediate stages, you can see that if I've add a little bit of noise, it's sort of removing the detail, right? Because for example, you might not be able to make out the whiskers anymore of of this bunny, but you can still see the silhouette. So you can still tell that it's a bunny, right? So you're sort of losing fine grain detail uh initially, but not the global structure. And only when you add more noise, that global structure also uh starts to fade. So that's the corruption process. And then I'm going to try and explain to you how we can use a dinoiser that tries to reverse this process to do generative modeling. Right? So sort of an intuitive view of how diffusion models actually do this. So I I like to do this with a with a 2D diagram. So images and video are actually highdimensional objects. Right? If you sort of flatten an image into a single vector, it's very high dimensional. Um I'm going to reduce that down to two dimensions so that I can actually plot it on a slide. It's a slightly risky thing to do like generalizing from low dimensions to high dimensions, but in this case, it's it's actually quite instructive. So, I'm going to try to explain how this backward process in in diffusion actually works. So, we start here with a clean image from our data set. So, that's what I'm going to call Xnot. Put that in the uh bottom left corner on the slide there. And um and then I'm going to add some amount of noise to it. So, I'm actually going to sort of sim simulate this corruption process and add some amount of noise to it. And then we end up in an arbitrary uh uh step of this corruption process. We're going to call that XT. T is like a time step that that's indexing the the corruption process. Uh and so that's a noisy version of our image in the top right corner. So what's going to happen in diffusion sampling is we're going to start from noise, right? But uh I'm just going to uh sort of zoom in on a particular step in the process where we've sort of already partially noised the image. So we might end up at something like XT and then I'm going to tell you like how does the how does the process continue from there. So uh the first thing that our diffusion model is going to do as I said it's a dino noiser right so it's going to try to predict a clean version of the image given the noisy version like it observes the noisy version that's all it gets and then we ask the model where could this have come from right what image could have given rise to this particular noisy observation and that's actually an illpost problem because as I just said the noise actually removes information right so there's actually many different images that could have given rise to this particular noisy observation And so what ends up happening is the model just doesn't know which one to predict and it has to make a single prediction. So it's just going to predict the average of all of them. And that's uh why you get a blurry prediction out of a diffusion model. Like if you just ask a diffusion model to do a onestep prediction, you're going to get a very blurry image that looks sort of like this. And the way to think about this is that it's not predicting a particular image per se, but rather it's sort of predicting a region of image space, the direction that we roughly want to go in. And so that's what I'm trying to visualize here with these uh these uh circles is that there sort of a region of images that all potentially could have given rise to this noisy observation and that's sort of roughly the direction that we want to go in. And so how does sampling work then? Well, we we predict that direction and then we take a small step in that direction. We don't make a big jump all the way to uh our predicted x not because obviously then we would just end up with a blurry image. That's not what we want. So we're only going to take a small step and then ask a model again basically. Right? You can compare this to how uh optimization of neural networks works, right? When you're doing optimization, of course that's happening in parameter space. Here we're we're working in in pixel space or in in latent space. But when you're doing optimization in parameter space, you're doing something very similar like the optimizer gives you an update direction. You don't want to take a step that's too large because actually this update direction is only valid locally. And it's it's the exact same thing here. Um, one thing that's a bit different from a typical optimization algorithm is that after we've done this small denoising step, uh, in many diffusion sampling algorithms, you actually add a little bit of the noise back. And this is this is new noise. This is not sort of like the noise that you just removed, but you just add a little bit of new noise. Obviously less than the amount that you just removed, right? Because otherwise you wouldn't make any progress. And the reason you do that is often to sort of um avoid accumulation of errors along the process because obviously this deninoiser that we've trained is a neural network. So its predictions are not perfect. So it's going to make mistakes. And if you keep feeding the network its own mistakes back, um you're going to go off the rails, right? And so what it turns out that this is actually a useful trick to uh to greatly reduce the risk of that. Like you add a bit a little bit of new noise and it sort of obscures the the mistakes that the noiser is making. Not all diffusion sampling algorithms use this trick, but many do. Um and then yeah, as I said, the process just repeats. So we're now in a new iterate. We're in in XT minus one which is basically a slightly ne slightly less noisy version of the image because we've partially denoised it and we're just going to ask our deninoiser to make a new prediction and that's going to be different this time, right? It's still trying to predict where we came from, but now it has a little bit more information because there's less noise in the image. So that corresponds to a new prediction and that new prediction corresponds to a smaller region uh of input space, right? And so that region is just going to sort of continually shrink as we go along until it sort of collapses to a single point. And at that point we've sampled uh an image uh from the distribution that we're trying to uh generate. Um yeah so this kind of continues as I said like so exactly the same procedure d noiseise add a little bit of noise back um uh and so on and so on and that's how you sample from a diffusion model. Um so I want to now take a slightly different perspective to try and explain why doing this particular thing works so well for images and video. Uh and for that we're going to uh bring out fury analysis. We're going to do um frequency analysis of what's actually happening in a diffusion model. So I have here um four images from the imageet data set. Uh and what I've done is I've taken their FIA transforms and a fiery transform is a of an image is a 2D object, right? An image is a 2D object. So you also get a 2D for your transform and it's a complex valued thing. Uh so I'm going to pull that apart into the magnitude and the phase. So the magnitude is the middle row and then the phase is the bottom row. The phase is kind of hard to work with. We're going to ignore that for now. We're just going to look at the magnitude spectrum. And then to make this even more convenient, we're going to summarize the magnitude spectrum into like a 1D uh plot. And we're going to do that by radial averaging. So, we're just going to sort of take these radial slices and and sort of um uh see what they are and then just average them all together. And if you do that and you plot that on a log plot, uh here's what you get for these images. So, what's very interesting about this, I think, is they look like straight lines, right? And this is a log log plot. So, if you're familiar with scaling laws in language models, this says power law, right? There is a power law relationship um in the spectrum. And this is sort of a natural phenomenon with with images and also with video. Like if you if you take a photo, you calculate the spectrum, you'll typically get this this power law relationship between the uh the frequencies and then the corresponding uh power uh or energy of those frequencies. So what does that mean for diffusion? Well, um if you add noise to an image, here's what that looks like on the spectrum. So uh the spectrum of Gaussian noise is basically such that it contains all frequencies in in equal measure in expectation right so it's sort of flat in the frequency domain so you can see that here like this is this is a spectrum of actual gaussian noise uh this this blue line right so you get sort of this this horizontal line and then for an image you're going to get this power loss spectrum so you get this downward sloping line now if I add both of them together if I add the noise to the image I can again calculate the spectrum and what you get then is the green line. So you can kind of see that it's uh following the image spectrum uh up to the point where the noise starts to drown it out, right? And then it just follows the noise spectrum. So what does that mean concretely for our corruption process? It means that the more noise you add, the more of the high frequency components you're starting to obscure, right? So add a little bit of noise, you only obscure the highest frequency components. If you add a bit more noise, then you're actually obscuring more of the lower and lower frequency components until eventually if you add a lot of noise, you're obscuring all the frequency components in the signal and and they're they're completely indistinguishable. Um, and so that observation I've sort of summarized that myself as diffusion is basically spectral auto reggression, right? Because it's essentially allowing you to generate images from coarse defined, right? You start with the low frequencies and then you gradually add higher and higher frequencies and that corresponds to coarse grain features and fine grain features. Um but of course yeah it's it's it's it's an approximation right it's not the same as hard auto regression but it is sort of in in principle it's it's doing a very similar thing just in a in a representation space that makes more sense for for image generation where you can sort of go course define which typically means that you can sort of sketch out the semantics of your image before you add all the details which is a very natural way uh to do uh image generation and it also means that when we train these models we can actually put more weight on the frequencies on the on the on the scales that matter more uh perceptually and that's a that's a very important aspect of this as well. All right. So uh let me now talk a little bit about uh network architectures uh for these deninoisers like it's a it's the network itself is very simple right it's just doing dino noising you feed in a noisy image and you ask it like what is the what is the clean version of this image uh and so people initially started using units for this so units are convolutional neural networks uh that are sort of designed they were originally designed for for things like image segmentation uh but you can also use them for any sort of image uh restoration task or whatever anything where the where the output dimension is the same as the input dimension basically um and yeah they worked they worked really well for for this particular use case as well. So people sort of adopted them. Early stable diffusion models are are unit based as well. Uh but then people figured out actually can we use a transformer for this? Um and the answer is yes of course you can. Um there's u you know you you you obviously don't use the sort of exact recipe that you would use for a large language model. Uh you don't use a causal mask for example because you're not going to you know it's okay for attention to be fully birectional in this context. So it's actually slightly more expressive than than a typical LLM transformer. Uh but yeah, it works it works just fine with transformers and because we've learned so much about how to scale transformers from the LLM side, it just makes practical sense to also use this uh for diffusion models because we can reuse a lot of that knowledge. Um one more architectural aspect that I want to talk about in the concept of video generation is sort of the tension between auto reggression and uh diffusion. So you can train a fully autogressive video model, but it would mean that you have to take this sort of height times width times time cube and then flatten it out into a sequence, right? And then generate a token by token. You can also treat this whole uh 3D volume of the video, this 3D pixel volume as a thing that you just jointly noise and d noiseise. And that's what most modern uh video generation models do. So they're sort of adding noise across the whole u time dimension as well as the spatial dimensions. Um that works pretty well. But there's actually uh it's it's not a dichotomy like it's not a binary choice. There's actually an intermediate uh sort of uh setup where you do auto reggression in time. So you generate frame by frame but you use uh diffusion to generate each of the frames. And that's actually a really nice sort of hybrid setup for for many applications. um especially when you want to do things like real-time video generation for example, you kind of have to do autogression in time, right? So like Genie is a good example of sort of this this uh this compromise. Um cool. All right, I'll I'll talk very briefly about sort of what goes into actually training a model. Like it's you know we we we scale these things up, right? More parameters is better. Um I would say they're probably not at the scale where large language models are today. They tend to still be a little bit smaller. Uh I'll also uh explain a little bit why that might be. Uh but you know we still scale them to to to reasonable sizes. And so it's it's very important to figure out how to do parallelism and and sharding across many chips. Um so up to a certain scale you can rely on data parallelism, right? Just split up your batch across many chips uh for training. But at some point you just need to go into model parallelism as well. So you actually need to start sort of spreading your model across uh different ships. Um we tend to use Jax to build these models. Jax has some really good tooling that sort of does that for you, right? So you you can you can use Jax.p to basically ask it to, you know, um to shard the model in a way that's going to be uh ideally close to optimal and it sort of automatically minimizes communication between the chips for you. So you don't have to do that manually. Cool. Uh yeah, I think that's that's all I'll say about that for now. Move on to sampling because I want to talk a little bit more about sort of this contrast between stocastic and deterministic sampling. So um I already mentioned that some of these algorithms add a little bit of noise back after every step. Other algorithms do not. This has different tradeoffs. Um uh we'll talk about distillation a bit later. In that context, it's actually really useful to have a deterministic sampling algorithm because you sort of have a onetoone mapping between initial noise and and samples from your uh data distribution. Um but as I said, stochastic algorithms can be a bit more robust uh to accumulation of errors. So there's like interesting trade-offs there. Uh the main thing I want to talk about in the context of sampling is this idea of guidance, right? This is something um it's commonly associated with diffusion models, but you can actually also do it for autogression autogressive models. It just seems to work exceptionally well for uh diffusion models specifically. So what guidance enables you to do is to trade off sample quality for diversity. So if you crank up what's called the guidance scale, if you crank up this hyperparameter, u diversity of your samples is going to greatly reduce. They're going to start looking more alike. uh but the quality is going to is going to massively improve to the point where these models are punching well above their weight. And this is what what I was saying earlier why these models why I think these diffusion models tend to still be a bit smaller than their sort of LLM counterpart is that they have this very powerful trick that really allows them to to punch above their weight. I'll show you some examples in a bit. But first I want to explain what guidance actually does. So, I'm going to go back to this diagram that I had before, and I'm going to show you how guidance actually modifies the the sampling procedure. So, we're going to start the exact same way. We're just going to uh try and predict where we came from. It's going to be a blurry prediction, but now we're going to make a second prediction as well because we're going to have trained our model with a conditioning signal. So, this could be a text prompt, for example, and we can basically sample from the model by giving it a text prompt and also without a text prompt. So if this the prediction is it a prediction without the text prompt then I can make another prediction with the text prompt and typically what's going to happen is that's going to be a slightly less blurry prediction because obviously the number of images that this noisy observation could have come from is greatly restricted by what's described in the prompt. Right? If the if the prompt says this is an image of a of a of a rabbit and it gives you some details and stuff then obviously you you know a bit more about what the original image might have been. So there's a difference between these two predictions and it's precisely that difference that we're interested in. So we're going to call this um delta and and we're going to basically amplify this. We're going to say okay if this is the delta between not having the prompt and having the prompt. What if we just uh scale this up and then we we say this is our new direction that we're going to actually den noiseise. It turns out this is super powerful. It seems like it seems almost childish like it seems so simple but actually there's sort of like a beijian reasoning type thing behind this that makes this work really well. Um and then um yeah so this is basically the only change that we make to the sampling algorithm. So everything else proceeds as normal. Again we might add a little bit of noise back and then we just take uh the steps as normal. So you do need to do two model evaluations per step but you get a a huge increase in sort of um uh sample quality from that. In fact let me show you. So um so this is a comparison of some samples from a model uh without guidance and with guidance. Um this is from a 2021 paper. So positively ancient uh in AI, right? Because things move pretty fast. Uh but this is one of the last sort of papers where they train diffusion models at scale where they actually show the difference between not using guidance and using guidance. After this paper, everyone just always uses guidance. like that nobody ever turns it off anymore because it's so essential to getting good samples from these models. And the differences you can see here are sort of uh you can see that reduction in diversity, right? Like from the from the left to the right, you can see the samples from uh uh from on the right are actually much less diverse, but also individually they're much higher quality with respect to the prompt, right? So you get this trade-off. And if you only want one sample, why would you even care about diversity, right? So you can sort of use this or you can even uh you know reintroduce diversity in different ways if you wanted to. Uh but yeah basically uh guidance is sort of a no-brainer these days. So everyone just always leaves it on and I think a lot of people would be surprised at how bad today's models are if you take guidance away like just if you use them um as they are. Another example here from the same paper. So this is a glide paper from openai uh one of the first sort of uh pixel space diffusion models at scale. Cool. Right. Let me talk briefly about distillation. Uh so sort of to speed up uh the sampling procedure, what we're going to try is we're not going to try and make the the model smaller, which is usually what distillation refers to in the context of large language models. But here, what we're actually going to try and do is uh reduce the number of steps that we require. So uh sampling from a diffusion model, you could say, is sort of tracing out this nonlinear path through input space. Right? I've showed you a few steps before. Now here I'm sort of trying to interpolate that path with this dashed red line. I'm sort of trying to show like this is the trajectory that we might follow through input space during sampling and that's a nonlinear trajectory. Uh and what a diffusion model does it sort of gives you at each point on the trajectory it gives you the direction that you should move in next right so it's it predicts a tangent to this path basically um and then a question that arises is okay once we know what this path is especially if if we have like a deterministic sampling algorithm there's like one path between a particular noise sample and a particular data point once we know that path um why are we predicting the tangent to this path why can't we just predict where we're going to end up right just like where we are let's just try to predict where this path is going to end up when all the noise is gone and then we can just sample in one step. Right? And this is exactly what consistency models do. So you can take a diffusion model and then distill it into a consistency model and this is just one of the many ways that people have developed uh to uh reduce the number of steps that are required to sample from diffusion models. Now in practice when you try to do this uh and try to sample in one step from a consistency model it usually won't work that well. And it's m it's mostly because like you're you're asking uh the neural network to do in a single pass what it did before in like 50 passes or something like that, right? So like that's a tall order. So usually that's not going to work super well out of the box. Um but so one one uh uh trade-off that you can sort of make is you can basically say I'm going to do consistency modeling but only for like certain intervals of my sampling path so that I can still sample in like maybe three steps and maybe then I'm going to get uh better results. This is just many of the one of the many ways uh uh to achieve this. But I just wanted to kind of show this u uh intuitively what this is actually doing. All right, let's talk briefly about uh controls. This is the last thing I want to talk about. So these models only are as useful as um our ability to actually influence uh at sort of a semantically semantically high level um what they actually produce, right? And so the canonical way that we've been doing this with these models is to give them a text prompt. You give them a text prompt and then that describes what the image should look like, what the video should look like. That's great. That works pretty well. But more and more we're starting to see that people want more than that, right? They don't just want to try to uh lost my slide. They don't just want to try and um uh Is this on my side or is this >> back? >> Oh, it's back. Awesome. Yeah. Um they don't just want to try and describe everything in in in great detail in in language. You want to be able to do reference based generation. For example, you want to, you know, you want to be able to generate a video and put yourself in it, right? And you're not going to do that by describing what you look like in intricate detail. You're going to do that by taking a photo or a short video of yourself and then conditioning the model on that. So we want these other types of conditioning signals as well. Uh in video generation specifically, we might want to sort of explicitly control camera motion, the speed of events happening like the specific timing of events. That's not something that you should sort of try to shoehorn into a textual representation, right? So you need sort of more advanced conditioning signals. And then a big question that arises is when should we introduce these conditioning signals to the model? because often you don't have this kind of conditioning information for most of your pre-training data set. So it can actually be quite useful to do that in post- training like pre-train a model with with just a text prompt and then maybe add some of these extra conditioning signals uh in a post- training phase. And of course other things we can do in post training are like preference tuning uh based on based on human evaluations like um using reinforcement learning or direct preference optimization. All right, that was uh sort of a whirlwind overview of all eight of these things. Um I'm going to stop there so we have some time for questions. Um all right. Um thank you very much for listening. You're talking about how you increase guidance quality. Is that kind of why lots of these models have a certain style? Like you can look at analog image. >> Uh that's part of it. I mean I think part of it is the probably the post- training recipe that tends to sort of make these models a lot more opinionated. I like to think of post- training as sort of taking like pre-training sort of models the distribution of images let's say and then post- training is more about like which sliver of that distribution am I actually interested in and that's a very opinionated type of thing so that's playing a big role in that I think there's specific effects though that you can observe that are easily attributed to guidance like if you get images that are that look sort of very saturated uh that's usually a sign that the guidance scale is too high right it's it's like uh because the guidance scale is sort of amplifying this um this difference between the unconditional and the conditional prediction, but it's doing that at every noise level. And so, um, if you do that at depending on which noise level you're sort of manipulating this this direction at, you sort of get these different effects. So, what people have often started doing is like varying the this the scale of the guidance across the sampling procedure. Like it often there's a there's a paper that says it's actually best to not do guidance at the very start and at the very end, but you should sort of ramp it up in the middle. And that tends to give you the best sort of trade-off. Um but yeah it's definitely playing a role but I think probably a bigger role in that is played by the post training. Yeah. >> Yes. >> How does the guidance or conditional work for text? Thank you very much. >> Hi. Um how would you do this guidance or like a conditional sampling for text diffusion? That's >> um it you could use exactly the same uh procedure >> um provided that you have you know you have a sort of uh prompt that you that you're giving the model right you can you can treat that prompt as a conditioning signal and then you could uh apply guidance to that how well that's going to work is is sort of an open question like it it clearly is applicable to the setting as well but I think uh typically what we found is that guidance tends to work best when there is sort of um a bit of a semantic gap between the thing that you're guiding with and the thing that you're generating. And so that's why it works so exceptionally well with this sort of prompting setup. Like if you guide on a text prompt, text prompt tends to capture sort of the highle semantics of what's in an image. And and that seems to be like a really good way to sort of steer the model in a sort of semantic space so to speak. But if you're if you're guiding like this is you have the same problem in image generation like if you do guidance on like let's say you provide a segmentation mask and you want the model to fill in this the the the different parts of the segmentation mask right actually guidance will work a lot less well for that because it's sort of a low-level conditioning signal that's sort of at the same level of abstraction as the the pixels that you're trying to generate. It seems like guidance just works better when you have that gap. I think that's also why people don't typically use it as much for language models. um even though it's perfectly applicable to autogressive models as well. >> Um over there >> hi uh you mentioned uh Jax is better suited for uh model sharding and parallelism. Uh is there any particular reason for that especially compared to PyTorch? Um >> um okay I have to caveat my answer with the fact that I've been at Google for over a decade. So you know PyTorch didn't exist when I joined Google, right? So I've never had a lot of hands-on experience with uh with PyTorch. I think Jax Jax is just kind of designed with that in mind from from the get-go because uh you know we've we've had TPUs for a while at Google and it was really just um I think conceived to try and make make that as easy as possible, right? to to to make using TPUs as easy as possible. And with TPUs like you very rarely use just one chip, right? You kind of want to take advantage of the fact that they have this really fast interconnect and sort of scale up uh pretty quickly. Um so yeah, I think it was sort of designed with this in mind from the getgo, but you know, PyTorch can do these things as well, right? It's not it's not unique to Jack. >> Thanks. >> Thank you for your presentation. If you slow down the process of denoising and noising, do you get better images or >> slow down as in take more steps? >> Yes. Take take more steps >> up to a point. Yeah. Because as I said, you're sort of you're approximating this nonlinear path through input space, right? So you're and you're approximating that by taking finite steps. So that means you're actually approximating that with sort of a a piewise linear path >> through input space, right? And so the more that path that pie wise linear path starts to deviate from your true nonlinear pathway, the worse your samples are going to get. But there is obviously a limit. And and there's also been a lot of research on like how can we train diffusion models or like models like this that actually produce straighter paths from the get-go so that you can actually get a more accurate approximation with fewer steps. There's something called rectified flow or refflow uh that that sort of tries to dig into this. So yeah, and so up to a point is is the answer. Yeah. Thanks. I um I was wondering if you could maybe talk a little bit more about how you could get these other types of control signals in for like a special camera movement. Is this something that you would do like with a simulator or something where you can like actually control all the parameters or like how >> That's a very good question. Um I well obviously I can't comment on what we actually do. I can sort of speculate. I that's actually like I don't actually know the details about this because the team at this point is quite large and like there's like people working specifically on what we call capabilities you know like uh endowing the model with this ability to for example uh be conditioned on on on a camera control signal and things like that. Uh I'm not actually actively involved uh in that part of the work but yeah it's like it's very um as I said it's sort of like um it's quite important to have these conditioning signals that are sort of semantically abstract. So you kind of have to find the right representation that makes the most sense uh to condition a model on. I think also what's very important is um you are sort of you know if if if if your model is a transformer you actually have quite a few ways to insert conditioning information like you could add a bunch of extra tokens. That's one way to do it. Uh sometimes you might rather sort of broadcast the conditioning information to all the tokens. Like for example, if you if we the the noise level input that we give to the diffusion model like to tell it how much noise there is in it. Uh we're going to typically broadcast that to all the tokens. We're not going to just add a single token for that. You know, I mean that's that's that's and then there's um uh I think there's like a third way that people have started doing conditioning. But yeah, these are these are all things that you have to think about when you sort of try to figure out how to introduce a new conditioning signal. Um yeah. I think we have time for maybe one more. >> Thanks for the really really nice talk. Uh just to understand uh in in the denoising part if you didn't add that error in each step would it be fully deterministic >> from a >> yes. >> Okay. >> Yeah. So there's there's basically when when you once you have a den noiser there's sort of two different types of generative models that you can construct. one of them is stochcastic and the other is fully deterministic right and it it's kind of a magical uh feat like I still don't really believe that this works but it really does uh so you can kind of yeah you can have a deterministic model and as I said this can be useful in many applications like especially if you want to sort of go both ways like if you want to go from your data distribution to your noise distribution and back it's really useful and then also for many distillation techniques it's a requirement for the teacher to be deterministic so that's the case with these consistency models for example um You need you need a a deterministic sampling algorithm to start from. Yeah. >> Thanks. >> Cheers. All right. Thank you everyone. Um I have a longer version of this talk on YouTube like if you want a bit more detail uh for some of these sections uh you can check that out as well. Cool. Thank you.