FLUX, Open Research, and the Future of Visual AI — Stephen Batifol, Black Forest Labs
Channel: aiDotEngineer
Published at: 2026-05-08
YouTube video id: x8Yb4RidLgM
Source: https://www.youtube.com/watch?v=x8Yb4RidLgM
Thank you all for coming today. I really appreciate it. Uh thank you for coming to this talk which is Black Forest Labs Flux Open Research and the Future of Visual AI. I'm going to start quickly with a quick intro of myself. Uh I'm Stefan Batiful, sorry. I'm a developer relation engineers at BFL and I want to start with two questions. First, who here knows about BFL? Raise your hand. Okay, who here knows Flux? Okay, about the same people actually. Uh, but for the people that don't know BFL, I have a quick intro so you're not lost. But BFL at a glance, we are the team behind stable diffusion, latent diffusion, and the flux models as well. Our team has more than 200,000 academic sitations and we don't only build models, we actually also work with enterprises and customers with them. So some of our customers are Microsoft, Adobe, Canva, Mistro and many more. And the way we started is we started in August 2024 with Flux one. Flux one was the first breakthrough. You know that was the model that was the big competitor to stable diffusion back then and that was really the breakthrough where people were like oh this is a really cool model. We released it in open source in the first place. So really that was the one that was texture image only and you could run it on your laptop that was a game changer as well and the anatomy was really good in comparison to the other models and especially in comparison to other models that were way bigger. So, this is where we really had a breakthrough and Clem from Hugging Face actually gave us a shout out back then. This is fairly old, but Flux was actually the model that was the most liked on Hugging Face back then. This is not true anymore. Uh, but back then that was really the thing and that was really good and really big actually for a company that was, you know, coming out of nowhere and just released this model. We then released Flux Context which was the first open source editing model in the world that was like the combination of text to image and image editing as well. This one now what I'm showing you here what I'm showing you here you know it's obvious because now we have editing models everywhere but back then there was a big breakthrough where you could do both text to image and image generation at the same time. If I have an example here we have this input image and then you remove the snowflake from the face. You know, you can see you have the character consistency. But you can also then move this person to be in Fryborg, which is where our headquarter is. You know, she's chilling taking a selfie in the streets of Fryborg. And then you can, you know, do some local editing where you can change the background to be snowy. And then you also have snow on her face and everything. It was also the model that was really one of the fastest back then. You know if you remember this is the time where you had the first GPT image where it would take like 40 50 seconds to generate or edit images whereas context if I remember correctly was like seven to 8 seconds. It was also really useful to tell stories. I've seen lot of use cases from our partners from our customers where they would start with an image image and then create a story board like we see here. We have the famous seagull here, you know, which has the VR headsets drinking a beer in a bar, but then you can actually create other things. They can have a friend that is joining and that is then drinking with them. Now, you know, I guess they got a bit tipsy and they're wearing hats in the bar, then they're going outside and then, you know, this is a story you could create and that was really useful actually for video model or for animation models. you know, you would give those images as input frames or as end frames and then the video model could then create different content. In November, we released Flux 2 which is our steps towards what we call visual intelligence. Flux 2 was uh as our base still our base our best model sorry um those are samples which I don't know if you can see clearly but in my opinion they are like really amazing samples and it's impossible to basically tell they are AI generated. If you look at the hands if you look at the veins if you look you know at the bracelets of the person on the left there's personally no way that I would tell it's AI generated. Same for the turtles you see on the right side. Same for the dog or cat actually that is in the bath that would be very very hard to get the sample like this. Um but those are AI generated and then you have more as well. So it's not only the people or animals you know you could do like some proper product photography. You can see it with a waffle here on the bottom right or you can make some very cute images like this person you know on the left that is really driving the moped with some balloons. And this is what we release in November. But it's not only an image generation, it's also an image editing model at the same time. Um, you can see on the left we have six images that we give to the model. And my prompt was literally like create an outfit with those images. And then the model is intelligent enough to actually, you know, make things that make sense. Like the jacket, you know, is worn properly. Same for the tie. And on the right side, it's a bit more of a simple use case where you have the sofa and then you have to imagine, you know, maybe you are an e-commerce website or you're a sofa maker and then you want to people to imagine, you know, what it looks like in your flat or what really would look like if you were to buy it. And those use cases are really really important and those are like the main use cases we have currently for Flux 2. But it also takes yeah up to 10 images simultaneously. So you can really edit lot of images at the same time and then you can create magic things. It's very good at character product and style consistency. And what I want to make clear is that BFL as a company and as a research lab, our first operating principle is to release state-of-the-art models. This is what we want to focus on. This is what we want to do as a company. You know, we want to raise the bar on quality with every release we do. So we did it in the past, you know, with flux one when it came out. We did it with context. Flux 2 was our best image model to date. It's the first one we released that was actually multi-reference as well. It was state-of-the-art in the open source world. It was state-of-the-art for texture image and image editing. In January, we released Flux 2, which is a step towards like interactive editing and interactive generation. It generates and edits images in less than a second. I'll talk a bit more about it later on during the talk but I think the fastest it can do if I remember correctly it's 500 millconds for editing and 300 millconds for generation so basically real time but this is not it we also have more things that are coming and this is where I want to talk about today so I mentioned it we are a research company first we publish things in the open we publish paper we really want to make sure that the field is moving forward with us you know this is our big focus as well so it's state-of-the-art model publishing things in the open and that's what we want to do but I want to first take a step back and tell you a bit you know about like how do you train models and especially generate um model that are generating content generating images and everything you know when they generate things when you train them they actually don't understand what they're generating you know they don't understand that my glass here should be actually on this table I shouldn't go through it because you train them you have images and then you know you add some random noise to those images and then you just try to den noiseise them that's what you do that's what those models are doing and when you den noiseise images you never learn you know that my glass shouldn't go through here you never learn that you know you're sitting on a chair you shouldn't go through it so what do you do is that you use you do what is called like representation alignment so you use an external model that actually knows about this and that is an encoder that is like an image encoder that is teaching our model hey he is currently sitting on the chair he shouldn't go through it and those models are external and they are really like trained to segment images whereas our models are trained to generate images or generate videos or generate audio and you try to align them to be on the same objective so that our generative model actually learns okay you shouldn't go through the chair my glass should stay on this table and this is great because it really improves the way generative models are working. We can see here on the right you know it is 70 times faster to actually converge and to reduce the loss when you use this external alignment. So you're like okay this is great but as usual if something is working well there are also counterparts to it. So the first one is that you have a scaling ceiling. You imagine you're working with a model that is external that has been trained. It's a checkpoint. You're not changing it anymore. What if you train a new model and you have a generative model that you want to scale up? You're still like limited by this encoder that you have on the side. You know, you're never actually scaling up fully with there. Also, those are specialized in modalities. You have an encoder, for example, Dino V2 and the other one that I can't remember. Uh it's specialized in images only. What if you want your model to generate images, audio, video and more? You would have to have encoders for all of those. And you can imagine then you would have like a very Frankenstein setup. You know, nothing would really make sense. And the objectives are also misaligned. So I've said it before, we want to generate content. We want to generate images or audio. The other one is here to segment things. So like they have different objectives and you're trying to make them work together and it works great but it's also not perfect. Here you see on the right side we have Dino V2 and Dinov3. Dino V3 is a better model technically per se than Dinov2 but when you train your model you're actually getting worse performances. You know Dino V3 is here in right in red and green and so you're getting worse performances. So you're like okay like this is supposed to be a better model and yet when I do train a model to generate things then it gets worse and there's also like not really any rules as to why you know certain encoder should work or otherwise shouldn't. So how can we solve this? How can you teach you know a model representation directly without this external encoder? This is what we released about a month and a half ago now which is a research paper. You can read it. It's called Selfflow. It's in the open. We released it to really make sure, you know, we're moving the field forward again. And it's not only us benefiting from it. And it's basically a scalable approach to training multimodel generative models. So they use self-supervised learning. So you don't need any other models, you know, to train it. And I'm going to try to go in tiny bit more details into it, but we combine representation learning and generation in the same flow. And you see here on the left you have videos, images, audio. You have different modalities. What do you do when you usually train a model? You add some noise. You add some random noise. You try to den noiseise it and then you align it with the encoder. You know how do we do it then? We actually add two different kind of noises that are both random and they're both different. The first one we're adding is actually we're adding a lot of noise to the asset. So this is the one you see at the top and the other one we're adding like a low amount of noise. This is what you see at the bottom. And the idea is that then we have two models that are actually working together. We have the student one which is always getting the images with the most noises and is trying to den noiseise them. And then the teacher one which is basically a more stable version of the student is always getting the low um noises images. And then the student one is actually trying to learn two things at the same time. It's trying to minimize the loss for the generation and the loss in representation. And this is how then you actually work across different modalities. You know this is then you only have one model. You don't have anything external. And if you actually scale up your model then you're scaling up your student. You're scaling up your teacher and you don't have to worry about the encoder that you have on the side anymore. And this is where we're working on. This is something, you know, we are currently using for different models that we're training. Um, and this is, you know, where we believe the future is going to be. And to get rid of those encoders that we have, we actually trained models. Uh, so those disclaimer, those are research models. They're not meant to be released in production. Uh, but we released actually one model on all those modalities. On the left, we're comparing flow matching, which is the usual way of training models with ours. And you can see we are better in audio. So this is what you see in orange on the right side. And then we're also better in images. So the dash lines is the baseline. And then we are like the full line where we can see we're also better at images and also better at video. So with this approach without having the encoder and the external model that you may struggle with, you actually get better at every modality that you're training your model on. It's also converging faster. You can see on the right, you know, the baseline is converging is actually hitting a plateau whereas we are converging faster and we're still, you know, decreasing the loss. And I'm pretty sure that if we were to go towards two million steps, you know, the baseline would really plateau and then wouldn't really get any better. maybe actually gets worse whereas we would still go down in loss and this is the difference between the two. If you use flocks in the past or if you use different models you know to generate you may have noticed the text might not be perfect or you know things don't really make sense. This is what you see at the top where on the left it's like the future is flux but you can see you know you have like some letters that are missing or maybe you have two letters like on worlds for example you have two L's instead of one whereas with this approach now in the cellflow approach you can see at the bottom everything makes sense there's like they learn representation you know they learn that flux then for the letters should be like one next to to the other and same on the mirror same on the tree and this is where we believe this is the future. But on top of this, we can see some comparisons here. On the left is a baseline where again the letters are wrong and on the right you can see that the letters are correct. Here's the same for the anatomy where you see on the left you know you have like a face that's looking a bit odd. Let's put it that way. And on the right this is the one with cell flow. And again, this is not like a production model where, you know, you expect the face to be like perfect, but you can see that the anatomy is way better than what you have on the left. But I want to show you as well some different generation if it loads. Yes, thank you. Uh, this is also possible. This is also possible for video generation. So, this is the same model that has been trained on images now also can generate videos. On the left, you see the baseline. It's a weird way to do a push-up. Let's put it that way. Uh whereas on the right, it's a perfect form. You know, the arms are correct, the hair as well is correct, and nothing is wrong in it. And this is, you know, a way to actually fix all those artifacts that you may see usually in generations. It's the same here for the birds. Oops, my bird. Yes, thank you. Uh it's the same here for the birds where Thank you. where you see on the left side you have the baseline there's a lot of flickering there's a lot of like you know weird things happening because the model was using this encoder was trying to align things whereas on the right side with cell flow it just does it perfectly and like the the bird you know is walking on the floor and there's no flickering or anything but it's not only about images or videos or audio you train those jointly so you can also actually generate things jointly we have here an example of a video and audio sample where the idea is that we have someone that is saying hello from the black forest. Uh I will just play them and you will hear the difference. Again, this is not a production ready model so it's not like perfect but you can hear the difference hopefully. >> Oh from the back for the black from the back. So this one was the baseline where if you hear it correct, if you try to pay attention to what he's saying, you hear like hello from the there's a bit of like weird things at the end. >> Hello from the black forest. Hello from the black forest. >> Whereas on the right side you can see you know the prompt is really just say hello from the black forest and then it ends here. And yeah this is the same model that was trained on those images that we've seen before on video and on video and audio. Oh. Oh, from the botto. >> But this is cool and this is great. But what if you could also teach robots on how to use this? This is also the same model. This one is trained on actions and not only on images, video or audio. So it can also predict actions. And what I'm going to show you now, it's a robot that is trying to pick up a can and make it closer to us. On the left, this is a baseline. Again, you see some like flickering. You see like the the arm is doing weird things whereas on the right for the same amount of steps you can see self flow the robot is picking up the arm directly and like bricking it closer and this is where we're going as well as a company this is where we're really interested is like not only image generation or video but it's also doing actions and doing more things toward physical AI and there is more it's also how do we make our models faster because this is really important for us this is a demo of client which is you know like near real-time editing. You see it on the right side. This is you know generated with client on Korea where you see the edits and this is not a video model. This is those are images that are always editing in real time. Not only they are faster, they're also actually at least on par or better than other models. And I'm almost out of time or they added five minutes so I don't know if I'm okay cool. So I'm not out of time so I can chill. Uh so yes here on the left we can see you know we have client uh that is 4B and 9B that is compared to the other open source models so it's at least on par while the latency you know it's like 0.5 seconds while Quen is like around like 15 seconds you know and if you are like on par and you're like way faster then this is really really good for us. Same for image to image. You can see the editing for client 9b. We are at like a tiny bit more than 0.5 seconds whereas Quen is still at around 15 seconds. And then same for multire you add it but we're still at less than a second where Quen is more towards the 20 seconds. And this is what is really really important for us because you really want to actually generate things in real time. This is where we believe the wall to visual intelligence. This is where we're going as a company in the future. And why does it matter? It's like I mentioned it real time generation. So you can imagine you render mockups as fast as you think. You know, you don't have to wait. You don't have to wait like 10 seconds, 20, a minute or two. You do things in real time and you can guide them in real time. This is also where we're going. You can think, you know, interactive visual engines for gaming or films where you really render a movie as you prompted. On top of this, there's also world models. The idea of world models and behind it and why it matters for us, it's you train your models to understand and simulate geometry relationship and like different interaction of the world. And you may be like, okay, that's cool from a research perspective. Why do we care? The reason is robot. uh that's why we care. That's why you know robotics and automation this is where we're taking BFL and that's why we also want to go towards world models is to train agents in those generative world to scale safe driving and automate every manufacturing. I think that is it. Thank you very much. Do we take Yeah, I think we can take questions. measure >> uh you mean the data >> can't really uh this trade secrets I mean data is very sensitive as you can imagine so I can't really share this we're partnering with a lot of people though for it >> how do you store world >> well this is what the model is learning is basically like those representation you know the model is learning that in itself as like the state and it has like some kind of memory and this is the way we do it. >> Is it like a cont? >> No, it's Yeah, it's the context window that you have, you know, you train and then you have the tokens and then they'd be like, "Oh, look, I've moved here is where I should be." Then next, >> can you then run it for long or does some >> uh define long? What do you call with long? >> Indefinitely. Uh that I'm not sure. I mean, there's always going to be a limit. Uh so you may have you know like a sliding window um but this is the way we saw it. Thank you.