FLUX, Open Research, and the Future of Visual AI — Stephen Batifol, Black Forest Labs

Channel: aiDotEngineer
Published at: 2026-05-08
YouTube video id: x8Yb4RidLgM
Source: https://www.youtube.com/watch?v=x8Yb4RidLgM
Thank you all for coming today. I really
appreciate it. Uh thank you for coming
to this talk which is Black Forest Labs
Flux Open Research and the Future of
Visual AI. I'm going to start quickly
with a quick intro of myself. Uh I'm
Stefan Batiful, sorry. I'm a developer
relation engineers at BFL and I want to
start with two questions. First, who
here knows about BFL? Raise your hand.
Okay, who here knows Flux?
Okay, about the same people actually.
Uh, but for the people that don't know
BFL, I have a quick intro so you're not
lost. But BFL at a glance, we are the
team behind stable diffusion, latent
diffusion, and the flux models as well.
Our team has more than 200,000 academic
sitations and we don't only build
models, we actually also work with
enterprises and customers with them. So
some of our customers are Microsoft,
Adobe, Canva, Mistro and many more.
And the way we started is we started in
August 2024 with Flux one. Flux one was
the first breakthrough. You know that
was the model that was the big
competitor to stable diffusion back then
and that was really the breakthrough
where people were like oh this is a
really cool model. We released it in
open source in the first place. So
really that was the one that was texture
image only and you could run it on your
laptop that was a game changer as well
and the anatomy was really good in
comparison to the other models and
especially in comparison to other models
that were way bigger. So, this is where
we really had a breakthrough and Clem
from Hugging Face actually gave us a
shout out back then. This is fairly old,
but Flux was actually the model that was
the most liked on Hugging Face back
then. This is not true anymore. Uh, but
back then that was really the thing and
that was really good and really big
actually for a company that was, you
know, coming out of nowhere and just
released this model.
We then released Flux Context which was
the first open source editing model in
the world that was like the combination
of text to image and image editing as
well. This one now what I'm showing you
here what I'm showing you here you know
it's obvious because now we have editing
models everywhere but back then there
was a big breakthrough where you could
do both text to image and image
generation at the same time. If I have
an example here we have this input image
and then you remove the snowflake from
the face. You know, you can see you have
the character consistency. But you can
also then move this person to be in
Fryborg, which is where our headquarter
is. You know, she's chilling taking a
selfie in the streets of Fryborg. And
then you can, you know, do some local
editing where you can change the
background to be snowy. And then you
also have snow on her face and
everything. It was also the model that
was really one of the fastest back then.
You know if you remember this is the
time where you had the first GPT image
where it would take like 40 50 seconds
to generate or edit images whereas
context if I remember correctly was like
seven to 8 seconds.
It was also really useful to tell
stories. I've seen lot of use cases from
our partners from our customers where
they would start with an image image and
then create a story board like we see
here. We have the famous seagull here,
you know, which has the VR headsets
drinking a beer in a bar, but then you
can actually create other things. They
can have a friend that is joining and
that is then drinking with them. Now,
you know, I guess they got a bit tipsy
and they're wearing hats in the bar,
then they're going outside and then, you
know, this is a story you could create
and that was really useful actually for
video model or for animation models. you
know, you would give those images as
input frames or as end frames and then
the video model could then create
different content.
In November, we released Flux 2 which is
our steps towards what we call visual
intelligence. Flux 2 was uh as our base
still our base our best model sorry um
those are samples which I don't know if
you can see clearly but in my opinion
they are like really amazing samples and
it's impossible to basically tell they
are AI generated. If you look at the
hands if you look at the veins if you
look you know at the bracelets of the
person on the left there's personally no
way that I would tell it's AI generated.
Same for the turtles you see on the
right side. Same for the dog or cat
actually that is in the bath that would
be very very hard to get the sample like
this. Um but those are AI generated
and then you have more as well. So it's
not only the people or animals you know
you could do like some proper product
photography. You can see it with a
waffle here on the bottom right or you
can make some very cute images like this
person you know on the left that is
really driving the moped with some
balloons. And this is what we release in
November. But it's not only an image
generation, it's also an image editing
model at the same time. Um, you can see
on the left we have six images that we
give to the model. And my prompt was
literally like create an outfit with
those images. And then the model is
intelligent enough to actually, you
know, make things that make sense. Like
the jacket, you know, is worn properly.
Same for the tie. And on the right side,
it's a bit more of a simple use case
where you have the sofa and then you
have to imagine, you know, maybe you are
an e-commerce website or you're a sofa
maker and then you want to people to
imagine, you know, what it looks like in
your flat or what really would look like
if you were to buy it. And those use
cases are really really important and
those are like the main use cases we
have currently for Flux 2.
But it also takes yeah up to 10 images
simultaneously. So you can really edit
lot of images at the same time and then
you can create magic things. It's very
good at character product and style
consistency.
And what I want to make clear is that
BFL as a company and as a research lab,
our first operating principle is to
release state-of-the-art models. This is
what we want to focus on. This is what
we want to do as a company. You know, we
want to raise the bar on quality with
every release we do. So we did it in the
past, you know, with flux one when it
came out. We did it with context. Flux 2
was our best image model to date. It's
the first one we released that was
actually multi-reference as well. It was
state-of-the-art in the open source
world. It was state-of-the-art for
texture image and image editing.
In January, we released Flux 2, which is
a step towards like interactive editing
and interactive generation. It generates
and edits images in less than a second.
I'll talk a bit more about it later on
during the talk but I think the fastest
it can do if I remember correctly it's
500 millconds for editing and 300
millconds for generation so basically
real time but this is not it we also
have more things that are coming and
this is where I want to talk about today
so I mentioned it we are a research
company first we publish things in the
open we publish paper we really want to
make sure that the field is moving
forward with us you know this is our big
focus as well so it's state-of-the-art
model publishing things in the open and
that's what we want to do but I want to
first take a step back and tell you a
bit you know about like how do you train
models and especially generate um model
that are generating content generating
images and everything you know when they
generate things when you train them they
actually don't understand what they're
generating you know they don't
understand that my glass here should be
actually on this table I shouldn't go
through it because you train them you
have images and then you know you add
some random noise to those images and
then you just try to den noiseise them
that's what you do that's what those
models are doing and when you den
noiseise images you never learn you know
that my glass shouldn't go through here
you never learn that you know you're
sitting on a chair you shouldn't go
through it so what do you do is that you
use you do what is called like
representation alignment so you use an
external model that actually knows about
this and that is an encoder that is like
an image encoder that is teaching our
model hey he is currently sitting on the
chair he shouldn't go through it and
those models are external and they are
really like trained to segment images
whereas our models are trained to
generate images or generate videos or
generate audio and you try to align them
to be on the same objective so that our
generative model actually learns okay
you shouldn't go through the chair my
glass should stay on this table
and this is great because it really
improves the way generative models are
working. We can see here on the right
you know it is 70 times faster to
actually converge and to reduce the loss
when you use this external alignment. So
you're like okay this is great but as
usual if something is working well there
are also counterparts to it. So the
first one is that you have a scaling
ceiling. You imagine you're working with
a model that is external that has been
trained. It's a checkpoint. You're not
changing it anymore. What if you train a
new model and you have a generative
model that you want to scale up? You're
still like limited by this encoder that
you have on the side. You know, you're
never actually scaling up fully with
there.
Also, those are specialized in
modalities. You have an encoder, for
example, Dino V2 and the other one that
I can't remember. Uh it's specialized in
images only. What if you want your model
to generate images, audio, video and
more? You would have to have encoders
for all of those. And you can imagine
then you would have like a very
Frankenstein setup. You know, nothing
would really make sense.
And the objectives are also misaligned.
So I've said it before, we want to
generate content. We want to generate
images or audio. The other one is here
to segment things. So like they have
different objectives and you're trying
to make them work together and it works
great but it's also not perfect. Here
you see on the right side we have Dino
V2 and Dinov3. Dino V3 is a better model
technically per se than Dinov2 but when
you train your model you're actually
getting worse performances. You know
Dino V3 is here in right in red and
green and so you're getting worse
performances. So you're like okay like
this is supposed to be a better model
and yet when I do train a model to
generate things then it gets worse and
there's also like not really any rules
as to why you know certain encoder
should work or otherwise shouldn't.
So how can we solve this? How can you
teach you know a model representation
directly without this external encoder?
This is what we released about a month
and a half ago now which is a research
paper. You can read it. It's called
Selfflow. It's in the open. We released
it to really make sure, you know, we're
moving the field forward again. And it's
not only us benefiting from it. And it's
basically a scalable approach to
training multimodel generative models.
So they use self-supervised learning. So
you don't need any other models, you
know, to train it. And I'm going to try
to go in tiny bit more details into it,
but we combine representation learning
and generation in the same flow. And you
see here on the left you have videos,
images, audio. You have different
modalities.
What do you do when you usually train a
model? You add some noise. You add some
random noise. You try to den noiseise it
and then you align it with the encoder.
You know how do we do it then? We
actually add two different kind of
noises that are both random and they're
both different. The first one we're
adding is actually we're adding a lot of
noise to the asset. So this is the one
you see at the top and the other one
we're adding like a low amount of noise.
This is what you see at the bottom. And
the idea is that then we have two models
that are actually working together. We
have the student one which is always
getting the images with the most noises
and is trying to den noiseise them. And
then the teacher one which is basically
a more stable version of the student is
always getting the low um noises images.
And then the student one is actually
trying to learn two things at the same
time. It's trying to minimize the loss
for the generation and the loss in
representation. And this is how then you
actually work across different
modalities. You know this is then you
only have one model. You don't have
anything external. And if you actually
scale up your model then you're scaling
up your student. You're scaling up your
teacher and you don't have to worry
about the encoder that you have on the
side anymore. And this is where we're
working on. This is something, you know,
we are currently using for different
models that we're training. Um, and this
is, you know, where we believe the
future is going to be. And to get rid of
those encoders that we have,
we actually trained models. Uh, so those
disclaimer, those are research models.
They're not meant to be released in
production. Uh, but we released actually
one model on all those modalities. On
the left, we're comparing flow matching,
which is the usual way of training
models with ours. And you can see we are
better in audio. So this is what you see
in orange on the right side. And then
we're also better in images. So the dash
lines is the baseline. And then we are
like the full line where we can see
we're also better at images and also
better at video. So with this approach
without having the encoder and the
external model that you may struggle
with, you actually get better at every
modality that you're training your model
on. It's also converging faster. You can
see on the right, you know, the baseline
is converging is actually hitting a
plateau whereas we are converging faster
and we're still, you know, decreasing
the loss. And I'm pretty sure that if we
were to go towards two million steps,
you know, the baseline would really
plateau and then wouldn't really get any
better. maybe actually gets worse
whereas we would still go down in loss
and this is the difference between the
two. If you use flocks in the past or if
you use different models you know to
generate you may have noticed the text
might not be perfect or you know things
don't really make sense. This is what
you see at the top where on the left
it's like the future is flux but you can
see you know you have like some letters
that are missing or maybe you have two
letters like on worlds for example you
have two L's instead of one whereas with
this approach now in the cellflow
approach you can see at the bottom
everything makes sense there's like they
learn representation you know they learn
that flux then for the letters should be
like one next to to the other and same
on the mirror same on the tree and this
is where we believe this is the future.
But on top of this, we can see some
comparisons here. On the left is a
baseline where again the letters are
wrong and on the right you can see that
the letters are correct.
Here's the same for the anatomy where
you see on the left you know you have
like a face that's looking a bit odd.
Let's put it that way. And on the right
this is the one with cell flow. And
again, this is not like a production
model where, you know, you expect the
face to be like perfect, but you can see
that the anatomy is way better than what
you have on the left.
But I want to show you as well some
different generation if it loads. Yes,
thank you. Uh, this is also possible.
This is also possible for video
generation. So, this is the same model
that has been trained on images now also
can generate videos. On the left, you
see the baseline. It's a weird way to do
a push-up. Let's put it that way. Uh
whereas on the right, it's a perfect
form. You know, the arms are correct,
the hair as well is correct, and nothing
is wrong in it. And this is, you know, a
way to actually fix all those artifacts
that you may see usually in generations.
It's the same here for the birds. Oops,
my bird. Yes, thank you. Uh it's the
same here for the birds where
Thank you. where you see on the left
side you have the baseline there's a lot
of flickering there's a lot of like you
know weird things happening because the
model was using this encoder was trying
to align things whereas on the right
side with cell flow it just does it
perfectly and like the the bird you know
is walking on the floor and there's no
flickering or anything
but it's not only about images or videos
or audio you train those jointly so you
can also actually generate things
jointly we have here an example of a
video and audio sample where the idea is
that we have someone that is saying
hello from the black forest. Uh I will
just play them and you will hear the
difference. Again, this is not a
production ready model so it's not like
perfect but you can hear the difference
hopefully.
>> Oh from the back for the black
from the back. So this one was the
baseline where if you hear it correct,
if you try to pay attention to what he's
saying, you hear like hello from the
there's a bit of like weird things at
the end.
>> Hello from the black forest. Hello from
the black forest.
>> Whereas on the right side you can see
you know the prompt is really just say
hello from the black forest and then it
ends here. And yeah this is the same
model that was trained on those images
that we've seen before on video and on
video and audio. Oh. Oh, from the botto.
>> But this is cool and this is great. But
what if you could also teach robots on
how to use this? This is also the same
model. This one is trained on actions
and not only on images, video or audio.
So it can also predict actions. And what
I'm going to show you now, it's a robot
that is trying to pick up a can and make
it closer to us. On the left, this is a
baseline. Again, you see some like
flickering. You see like the the arm is
doing weird things whereas on the right
for the same amount of steps you can see
self flow the robot is picking up the
arm directly and like bricking it closer
and this is where we're going as well as
a company this is where we're really
interested is like not only image
generation or video but it's also doing
actions and doing more things toward
physical AI
and there is more it's also how do we
make our models faster because this is
really important for us this is a demo
of client which is you know like near
real-time editing. You see it on the
right side. This is you know generated
with client on Korea where you see the
edits and this is not a video model.
This is those are images that are always
editing in real time.
Not only they are faster, they're also
actually at least on par or better than
other models. And I'm almost out of time
or they added five minutes so I don't
know if I'm okay cool. So I'm not out of
time so I can chill. Uh so yes here on
the left we can see you know we have
client uh that is 4B and 9B that is
compared to the other open source models
so it's at least on par while the
latency you know it's like 0.5 seconds
while Quen is like around like 15
seconds you know and if you are like on
par and you're like way faster then this
is really really good for us. Same for
image to image. You can see the editing
for client 9b. We are at like a tiny bit
more than 0.5 seconds whereas Quen is
still at around 15 seconds. And then
same for multire you add it but we're
still at less than a second where Quen
is more towards the 20 seconds. And this
is what is really really important for
us because you really want to actually
generate things in real time.
This is where we believe the wall to
visual intelligence. This is where we're
going as a company in the future. And
why does it matter? It's like I
mentioned it real time generation. So
you can imagine you render mockups as
fast as you think. You know, you don't
have to wait. You don't have to wait
like 10 seconds, 20, a minute or two.
You do things in real time and you can
guide them in real time. This is also
where we're going. You can think, you
know, interactive visual engines for
gaming or films where you really render
a movie as you prompted.
On top of this, there's also world
models. The idea of world models and
behind it and why it matters for us,
it's you train your models to understand
and simulate geometry relationship and
like different interaction of the world.
And you may be like, okay, that's cool
from a research perspective. Why do we
care? The reason is robot. uh that's why
we care. That's why you know robotics
and automation this is where we're
taking BFL and that's why we also want
to go towards world models is to train
agents in those generative world to
scale safe driving and automate every
manufacturing.
I think that is it. Thank you very much.
Do we take Yeah, I think we can take
questions.
measure
>> uh you mean the data
>> can't really uh this trade secrets I
mean data is very sensitive as you can
imagine so I can't really share this
we're partnering with a lot of people
though for it
>> how do you store world
>> well this is what the model is learning
is basically like those representation
you know the model is learning that in
itself as like the state and it has like
some kind of memory and this is the way
we do it.
>> Is it like a cont?
>> No, it's Yeah, it's the context window
that you have, you know, you train and
then you have the tokens and then they'd
be like, "Oh, look, I've moved here is
where I should be." Then next,
>> can you then run it for long or does
some
>> uh define long? What do you call with
long?
>> Indefinitely. Uh that I'm not sure. I
mean, there's always going to be a
limit. Uh so you may have you know like
a sliding window um but this is the way
we saw it.
Thank you.