How Transformers Finally Ate Vision – Isaac Robinson, Roboflow

Channel: aiDotEngineer
Published at: 2026-05-08
YouTube video id: VhfAVA3BG2I
Source: https://www.youtube.com/watch?v=VhfAVA3BG2I
Hi, I'm Isaac Robinson. I'm the research
lead at Roboflow and I'm here to talk to
you today about how transformers finally
ate vision.
Uh
So, I'm going to start off with a brief
summary of
the competition.
Then we're going to go through an
overview of
the evolution of the transformer,
why that ended up winning out, some
consequences of that, and what's next.
So,
where we started, convolutional neural
networks. I'm sure everyone here is
aware of how these work, but just to
summarize,
they have excellent inductive bias
motivated by looking at how the eye
works. So, you have a
a filter that you convolve against your
image, and you have activations that
light up
the same way regardless of where in the
image the thing is happening. Great
inductive bias. A person in an image is
a person regardless of whether they're
in the upper left or the bottom right.
And you we build these uh interesting
hierarchical structures out of these,
ResNets, etc.
This is how we've done vision for a very
long time.
Then comes the transformer. Again, I'm
sure everyone is aware of how a
transformer works, but just to
summarize,
we uh have a just a set of tokens.
We run a set-to-set operation.
So, there's no inductive bias. This is
just an n-squared transformation.
We inject the inductive biases into the
transformer. So, for example, a
classical
uh autoregressive transformer, we add a
causal mask to the attention matrix
matrix,
and that gives us a sequential modeling.
Uh for vision,
we have a vision transformer.
And this is
very complicated.
A lot of engineering went into this. We
take our image, we split it into
uh patches.
16 by 16 was the original.
And
we add a learned positional encoding,
and then we throw that into a
transformer,
and that's it.
So, transformer is n-squared set-to-set.
We've got patches in our image.
This is uh
we've got
uh n over 16 patches for the side
length, and and we end up actually with
n to the fourth power
with the resolution uh compute scaling.
We have no inductive bias.
The thing that is in the upper left can
have a totally different activation
pattern if it's in the bottom right.
And so, the question naturally arises,
which is better?
The high inductive bias n-squared
convolutional network or the no
inductive bias n to the fourth power
VIT?
So, as everyone would expect,
it's the VIT.
So,
how is this possible?
And I'm I'm going to make the argument
that it is because of massive
VIT-specific pretraining,
and then we get to borrow a lot of
speedups and infrastructure from the
fact that LLMs are blowing up.
So,
to trace this evolution, we're going to
talk about
we we just talked about the introduction
of the VIT,
then how it how people tried to say,
"Okay, well, this this cannot possibly
be the best thing that we can do. How do
we make this better?"
So, we go to Swin,
then back to a a convolutional-based
network, ConvNeXt, then uh Hera,
um which I think is a
kind of a ha- has some really really
beautiful takeaways. And then,
as always happens with machine learning,
we come back better listen to the simple
thing that scales well, the VIT.
So,
uh
first we're going to start with Swin.
So, we have this patchify operation,
and
we take our uh our patches, and we split
them instead of doing global attention
across all the patches at once, we just
say, "Okay, we're going to do attention
this window."
If we just keep doing attention in this
window, though, the the uh tokens will
not be able to interact with each other.
So, these two will never see each other.
And so, the next layer, in fact, we
shift the window a little bit. So, we've
got these back-and-forth overlapping
windows, and this looks very very
similar to what I described with the
convolution.
Yes, so we've got this uh
similar-looking operation that is
happening on these sections of the
image,
and uh
then we end up with like overlap between
the filters, the locations that they get
applied on. And this is how we proceed.
And this actually gets us down to
n-squared if your window size is
independent of your resolution,
and it adds a locality inductive bias
following the convolutional net. So,
that that seems logical. That makes
sense.
Um
then we go to the uh
someone said,
"Okay, look, there's this transformer
operation that has no inherent
relationship with
vision. Let's go back to the
convolutional
network. Let's take all the learnings
that we've had from the vision
transformers and just spit them into a
convolutional network and see what do a
patchify operation.
We do a
I think it was a
it was a 4 by 4 patch instead of a 16 by
16 patch,
and we're going to say,
"Our VIT was
as all transformers,
a
self-attention, feed-forward,
self-attention, feed-forward, etc.
And that self-attention is mixing your
spatial information.
Okay?
So, for the convolutional network, what
if we just say, 'Okay, we're going to
have the convolution mix the spatial
information. We're going to do the same
pattern.
Uh mixer,
feed-forward, mixer, feed-forward,
onwards.' And we're going to borrow the
same hierarchical structure that
everyone has been using for these
convolutional networks."
And uh
also throw in layer norm and a couple
other innovations, and that's it. We're
going to try that. Turns out that beats
VIT and Swin when you apply it on the uh
standard ImageNet
reference.
That's great. Finally, we have something
that makes a little bit of sense.
Um
turns out that's not super fast.
So,
someone uh Meta decided, "Okay, what
what what are what are the actual
important things here? The ConvNeXt has
a bunch of these beautiful inductive
biases. It's following this formula that
we got from the transformer.
Let's look at what those inductive
biases are actually useful for. So,
we're going to take a really really good
inductively biased transformer model.
We're going to strip out the biases one
at a time. We're going to get a speedup
because we don't have all the
specialized equipment anymore for the
inductive bias, and we're going to use
pretraining to learn the bias instead."
So, this is I think a really really
great example of the balance between
pretraining and inherent inductive bias,
which ends up being how transformers
ultimately win out. Here we're using uh
uh
MAE, masked autoencoder,
um for those of you who are not
familiar,
you take your image, you take your
patches, you drop a bunch of the
patches, and you ask the
the model to reconstruct what would have
been in the patches
just based on the context.
Very very similar to BERT
for those of you who come from the
language space.
Uh
you do this at scale, and it turns out
the model actually learns back the
inductive biases.
But, you can't actually apply MAE to a
convolutional network. How do you drop
out a patch when you're doing this
convolution that's invariant across
patches?
So, it's a VIT-specific
uh pretraining technique that adds
inductive bias that would otherwise be
missing from the structure.
Um
So, that's great. That's super
interesting. That That works nicely.
Turns out it doesn't You can take that
to an extreme,
and you throw in DINOv2, DINOv3,
these really
again, VIT-specific pretraining
techniques,
and you
at the end of it don't just have these
inductive biases towards how to process
an image. You actually have
really really really fit uh
rich feature maps out of the box.
So, for example, this is a PCA
decomposition of the feature maps
produced by a DINOv3 pretrained VIT,
and we see that the paws of the cat have
different colors, and that it's tracing
the paws correctly for each of the
different cats.
The satellite imagery is decomposed in a
way that is semantically meaningful.
And uh in fact, the self-supervised
learning objective is is catching up
with the best that we have from
supervised learning,
and this is via linear probe. So, you
have your frozen features, you're just
probing into them.
You're not training anything
specifically on your data except for
that linear projection at the end,
and you're getting very very very close
to the best that we know how to do with
self- with uh fully supervised learning.
Okay, but what about the speed issue?
This is still n to the fourth power.
Uh
it turns out that people
are care a lot about attention. So, in
the LLM world, we start introducing
these tools, uh flash attention,
the biggest one,
and
uh Hera
explicitly showed a speedup for the same
accuracy versus VIT, but they have a
note in their paper where they where
they say, "Okay,
we see the speed up
and we're not going to measure with
flash attention."
So, then you add back in flash
attention, suddenly it doesn't really
matter.
You've got you're back to this very
silly ends end of the fourth power thing
that benefits from this VIT specific
pre-training method,
and that's it. We're
uh we've kind of we're we're we've kind
of won.
So,
this is a a talk about this is a an
evolution of the backbones. How it is
how it is that the end of the fourth
thing ended up beating out everyone that
tried to beat it.
What does this mean in
practice and application? So,
SAM is a very famous series of models.
Again, if I I I don't know where people
come from, but SAM was a
what from my perspective one of the most
important uh foundation model series in
vision, period.
And we actually see the same pattern.
So, SAM to mobile SAM to SAM 2 to SAM 3,
if you look at the backbones that are
underlying it, it's a VIT trained with
MAE.
Then someone says, "Okay, well, surely
this cannot be the best that we can do."
So, they uh mobile SAM actually uses
a specialized convolutional transformer
hybrid uh called tiny VIT and replaces
the the VIT backbone.
And then SAM 2 actually uses Hera
with uh this MAE pre-training,
and then SAM 3 just gives up on the
architecture ablation and just says,
"Okay, well, we've got this massively
pre-trained
backbone. Let's just stick it in. That's
the best that we can do."
Um
So, that's all good and fine, but where
does that actually leave us?
Um
this is really expensive if we're
relying on these huge pre-training
strategies in order to recover the
performance that is lost due to the fact
that our architecture is not biased
towards the subject at all.
Uh that means we have to spend a huge
amount of money every time we want to do
a deployment.
So, no deployment flexibility means that
we have these one-size-fits-all models.
So, SAM 3
is this
very very powerful thing, but is also
800 million parameters. It takes 300
milliseconds to run on a T4 GPU.
It's not
actually usable in a lot of cases,
uh especially because vision
historically has been focused on these
very low-power edge devices,
these uh resource constrained deployment
scenarios.
So,
uh at Roboflow, what we have done
is we've attempted to say,
"How do we actually take these
fixed foundation models
and to transform them into something
that has flexibility?"
So,
we introduce a data set, RF100VL, that
measures how well
uh
foundation models transfer to downstream
diverse tasks with respect to object
detection, which is
one of the canonical vision-centric
tasks.
And we see
about a 40x speed up for the same
accuracy versus fine-tuning SAM 3,
uh and for merely a 15x speed up, we get
a a meaningful improvement.
Um
So, this combination of this huge
advancement in foundation model
pre-training and these strong deployment
methods
combined with uh
an actual ability to deploy these in
hardware constrained environments ends
up being the
uh
uh final nail in the coffin for these
classical convolutional-based methods.
Uh these at the time of our publication
of RFDetR,
these were the best convolutional
uh these were the best real-time
instance segmentation models, and we uh
we we we we outperformed them in a in a
meaningful way.
Uh
Um and so, this is all those models on
that line use actually the same
foundation model.
We just modify the foundation model
using neural architecture search
such that we generate an entire family
of
uh high-performance models in in one go.
To do this, we actually introduce a
bunch of uh flexible knobs. All of these
are drop-in compatible with the existing
foundation model infrastructure,
and by mixing and matching them in a way
that is dependent on target data and
target hardware,
we can uh
resolve the issue that
uh these foundation models do not have
deployment flexibility.
Um
So, massive VIT specific pre-training
plus speed ups from LLMs plus
uh pre-training compatible neural
architecture search,
and uh
that's it.
Do we have architectures that support
like video or
video plus image plus text pre-training
already?
Is someone working on that?
So,
there are a lot of people working on
like a huge amount of different
combinations of things. I actually think
SAM 3 is a good example of that. They do
Well, in terms of
in terms of vision-specific video
processing,
it does video processing in the in from
the perspective of tracking objects
through video.
So, they do massive-scale pre-training.
They do the perception encoder
pre-training for the backbone, and then
they do a huge amount of downstream
pre-training.
Or I guess you would just call it
training at that point.
Um
What about JEPA like Yeah, the video
Yeah, JEPA and the VJEPA, those are uh
I mean, yeah, those are another variety
of foundation model. I think in terms of
single
like image-centric pre-training,
JEPA doesn't seem to outperform a lot of
the other ones.
Video JEPA,
I haven't seen anyone use it
meaningfully in a video context for
downstream transfer yet, but
you know, we'll see. Yeah.
Yeah.
Any other questions?
Cool.
Okay.