Vision AI in 2025 — Peter Robicheaux, Roboflow

Channel: aiDotEngineer

Published at: 2025-08-03

YouTube video id: IQc05eCvNYE

Source: https://www.youtube.com/watch?v=IQc05eCvNYE

[Music]
I'm going to be giving a quick
presentation about the state of the
union regarding AI vision. Um, so I'm
Peter Robisho. I'm the ML lead at
Rooflow, which is a platform for
building and deploying vision models.
Um, so a lot of people are really
interested in LLMs these days. So I'm
trying to pitch why computer vision
matters. Uh so if you think about
systems that interact with the real
world, they have to use vision as one of
their primary inputs because the the
built world is sort of built around
vision as a fundamental primitive. Um
and there's a big gap between where
human vision is and where computer
vision is. Uh I I would argue a bigger
gap than exists currently for human
speech and uh computer speech.
uh computer vision has like a its own
set of problems that are very distinct
from the problems that need to be solved
by LLMs. Latency usually matters. You
need to if you want to perceive motion,
you have to be running your process
multiple frames per second. Uh you
usually want to run at the edge. You
can't have like one big hub where you do
all your computation because you would
introduce too much latency to make
decisions based off that computation.
Um, and so I sort of gave a version of
this talk uh at Leightton Space's
podcast at Nurips. Um, and
retrospectively I think we identified a
few problems with uh the field of vision
in 2024. One of them being eval are
saturated. So vision evals like ImageNet
and Coco, they're mostly like pattern
matching. they measure your ability to
match patterns and sort of like don't
require much visual intelligence to
solve. Uh consequently, I think vision
models don't leverage big pre-training
the way that uh language models do. So
right now you can take a language model,
unleash it on the internet and get
something incredibly smart. Some of the
best vision models are certled
and intelligence to solve the eval
there's kind of no incentive to do so.
Uh and I think that part of that so
there's sort of two dimensions here. One
is that vision doesn't leverage big
pre-training. So you can think of like
if you're building an application with
language right now, you probably want to
use the smartest model to get an
embedding that works really well for
you. Right now we don't have like
they're downstream applications that
make really good use of the pre-training
and the embeddings that they get from
large language models, but there aren't
really good vision models that can
leverage these embeddings. And the the
correlary to this is that
there the quality of big pre-trained
models just isn't the same in vision as
it is in language. And so my my
underlying conclusion is vision models
aren't smart. That's the the takeaway.
And I can prove it to you. So last year
when cloud 3.5 was happening, you could
give it an image of a watch and it just
guesses. You ask it what time it is and
it'll just guess a random time. And
that's because this model, it has a good
conceptual abstract idea of what a clock
is or what a watch is, but when it comes
to actually identifying the location of
watch hands and finding the numbers on
the watch, it's hopeless. Uh, and
updated for Cloud 4 still has no idea
what time it is. And this is even like a
an egregious failure because 1010 is
like the stock time on like all watches.
So, the fact that it couldn't even get
like the most common time is uh pretty
telling.
Uh there's so there's this really cool
uh data set that's trying to measure
this inability of LLM to see uh called
MMVP which basically you can see an
example here where they ask this
question that seems incredibly obvious
and the model so in this case they ask
the model which is like chatbt 40 uh
which direction the school bus is facing
are we seeing the front or the back of
the school bus and the model gets it
completely wrong and then hallucinates
details to support its claim. And again,
I think this is evidence that large
language models, which are maybe the
most intelligent models that we have
like cannot see. And that is due to a
lack of visual features that they can
perceive with. And so the way that this
data set was created is they went and
they found pairs of images that were
close in clip space but far in Dinov2
space. So, clip is a is a vision
language model that was sort of
contrastively trained on the whole
internet. Dinov2 is just a pure vision
model that was uh trained in a
self-supervised way on the whole
internet. Right? And so what this is
showing is that clip is not
discriminative enough to tell these two
images apart. Right? So according to
clip, these two images basically look
the same. And what that's pointing to is
like a failure in vision language
pre-training. And so the way clip is
trained is basically you come up with a
big data set of captioned images and you
you ask the model to you scramble the
captions and scramble the images and ask
the model to pair the image with the
caption. But the thing is is if you go
back and look these two images, what is
a caption that would distinguish these
two images, right? It's like the
peculiar pose of the dog and one in one
image it's facing the camera and one
it's facing away. But these are sort of
details that aren't included in the
caption. So if your loss function can't
tell these two images apart, then why
would your model be able to, right? So
vision only pre-training kind of works
is is the claim. So Dino V2 is this
really cool model that so what you're
seeing right now is a visualization of
its PCA features that have been
self-discovered by pre-training on the
whole internet. Um, so what's really
cool is not only like does it find the
mask of the dog, obviously that's sort
of easy because it's highly contrastive
with the green background, but it also
finds the segments of the dog and it
finds even analogous segments. So if you
look at the these principal components,
you compare the legs of a dog, it'll be
in the same sort of feature space as the
legs of a of a human. And so the there's
sort of this big open question which is
like how do we get vision features that
are well aligned with language features
and usable by VLMs that don't suck and
like have visual fidelity. Um cool. So
so that's part of the story. The other
part of this the question that needs to
be answered is given that we have some
sort of semi- working large pre-training
uh of vision models why aren't we
leveraging these vision models and I
would answer that at least in the object
detection space the answer is mostly in
the distinction between convolutional
models and transformers so this is from
LW DTER which is one of the top
performing uh detection transformers
that currently exists uh if you look at
this graph If you look at yellow v
yellow v8 which is a convolutional
object detector on the edge with and
without pre-training on objects 365 it
gains like 02 map which is like the main
accuracy metric for object detectors. So
object 365 which is a big million 1.6 6
million image data set uh pre-training
on it leads almost no performance
improvements on on Coco whereas for LW
DTOR which is a transformer-based model
you can see that without if you look at
this column map without pre-training and
you look at the column map with
pre-training you can see that you're
getting like five map improvements
across the board sometimes even seven
map improvements which is like a
gigantic amount right and so basically
the while the language world knows shows
that transformers are able to leverage
big pre-trainings and and yield decent
results. The vision world is sort of
just now catching up. Uh and you can see
this from the scale of the big
pre-training in the image world.
Pre-training on objects 365 with 1.6
million images is considered a large
pre-training that would be like a tiny
like challenge data set for like for
like undergrads in the LLM world. So I
want to announce Rooflow's
special new model called RF DTOR which
leverages the Dinov2 pre-trained
backbone and uh perform or uses it in a
real-time object detection context. So
this is sort of our
answer to the the hole that we see in
the field of like why aren't we
leveraging big pre-trainings for visual
models. Um and so here's some of the
metrics. You can see that um
basically what we did is we took the LW
dter backbone and we like kind of
swapped it out with the Dynino V2
backbone and we get like a decent
improvement on Coco. Um and we're still
not soda uh compared to on Coco compared
to define which is the current soda.
We're like second soda. Um, but I think
what's really interesting is there's
this other data set called RF100VL
which we created to measure the sort of
domain adaptability of this model and
you can see massive yields from using
the Dynino V2 pre-trained backbone which
basically is pointing to the fact that
number one Coco is too easily solvable.
Uh, it basically has common classes like
humans and like coffee cups and stuff
like this. So it's not a good measure of
the intelligence of your model. more or
so the way that you optimize Coco is by
like really nailing the precise location
of a bounding box or something. Really
having good iterative refinement of your
like locations that you're guessing. Um
whereas we posit R100VL this new data
set is a better measure of the
intelligence of a visual model. Um, so
we're introducing a new data set,
R1400VL, which is a collection of 100
different object dete detection data
sets that were pulled from our
open-source collection of data sets. We
have something like I don't know, it's
something like 750,000 data sets or
whatever on Rooflow Universe and we hand
curated the 100 best I guess by some
metric. So like we sorted by community
engagement and we tried to find very
difficult domains. So you'll notice for
instance we have different uh camera
poses that are common from in Coco. So
we have like aerial camera positioning
uh which requires your model to sort of
understand different views of an object
in order to to do well. We have
different visual imaging domains like
you can see like microscopes and X-rays
and all this sort of things. Uh so yeah
we think that this data set can measure
the the richness of features that are
learned by object detectors in a much
more comprehensive way than Coco. Uh and
here's some the the other fun thing
about this is that it is a visual
language model. So we were able to
benchmark a bunch of different models on
R100VL
being able to ask them things like using
contextualizing the class name in the
context of this data set. Where where is
this action happening for instance? So
for so if you look at the top left we
have this class which is block which is
representing an action a volleyball
block but you have to be smart enough to
contextualize this like word embedding
of block within the context of
volleyball to be able to detect that.
Same thing with this thunderbolt type uh
defect in this cable here. If you just
ask a a dumb visual language model to
detect thunderbolts in the image, it'll
find nothing. But if it contextualizes
it in the context of a cable defect,
then it'll be able to find more things.
And it also increases the breadth of
classes. So if you only look at Coco,
you're basically asking your model, hey,
can you find a dog? Can you find a cat?
But like, can you find fibrosis? Now
your model needs to have like a lot more
information around the world about the
world to solve that problem. Same thing
with different imaging domains.
Um so it is a vision language benchmark.
So we also have um visual descriptions
uh and sort of instructions on how to
find uh the objects that are present in
this image. And basically what we found
is like you take a Coco or you take a
YOLO V8 model and you train it on like
10 examples per class. It does better
than like Quen V272
Quinn 2.5VL 72B like state-of-the-art
gigantic vision language model. So the
vision language models are really good
right now at generalizing out of
distribution in the vis in the
linguistic domain but absolutely
hopeless when it comes to generalizing
in the visual domain. And so we hope
that this benchmark can sort of drive
that part of the the research and make
sure that the visual parts of BLMs don't
get left behind. Uh and yeah, basically
by leveraging like stronger embeddings,
uh a deter
much better on R100VL than just
leveraging embeddings learned on object
365, which makes sense. And that's my
talk.
Thank you.
Yes. Can you fine it? Run it at the
edge. Fine tune Quinn on the edge.
Oh, yeah. Yeah. Yeah. It's It's like 20
million parameters at the smallest size.
Yeah. Cool. Any other questions? It's
Wix.
Yeah, it's publicly available. It's on
Maybe I can If you go to rf100vl.org, or
you can find our archive paper as well
as the code utilities to help download
the data set. It's also like on hugging
face somewhere. Yeah.
Yeah. So, Rooflow kind of has a pretty
unique strategy when it comes to our
platform. So, we make our platform
freely available to all researchers
basically. And so we have like a ton of
people who use our platform to label
medical data and biological data for
their own papers and their own research.
And then our only ask is that they then
contribute that data back to the
community and make it open source. And
so a lot of this data comes from like
papers cited in nature and stuff like
that.
two
correspondence between
a lot of
Yeah. So the data set is kind of
measuring the performance of like a
bunch of different imaging modalities or
predictive modalities I guess. So so I
think the most interesting tract of the
data set is the fshot tract. So
basically we've constructed like uh
canonical 10shot splits. So we provide
the model the class name uh annotator
instructions on how to find that class
as well as 10 visual examples per class.
And if a model basically no model exists
that can leverage those three things and
get higher map than if you just deleted
one of those like options. I I see that
as one of the big shortcomings of visual
language models right now. Then in terms
of multiple
there's like more generous kind of
things like
specialist like
yeah so so currently the specialists are
by far the best uh we benchmark
grounding dyno specifically both zero
shot and fine tune. So zero shot
grounding dyno got like 19 map average
on R100VL which is like kind of good
kind of bad. So if you take like a YOLO
V8 nano and you train it from scratch on
the 10shot examples which is not a lot
of data obviously it gets something like
25 map. So like to to be worse than
fine-tuning a YOLO from scratch is is
sort of bad. But if you then fine-tune
the grounding dyno with federated loss
that's the highest performing model we
have on the data set. However, that
being said, like I think that the point
of the data set should be, hey, like you
should be able to leverage these
annotator instructions, the 10shot
examples, and the class names and come
up with something more accurate, which
requires a journalist model. But, okay,
I think I'm super overtime. So, yeah,
thanks for the questions. Cool. Thanks
everyone.
[Music]