Vision AI in 2025 — Peter Robicheaux, Roboflow
Channel: aiDotEngineer
Published at: 2025-08-03
YouTube video id: IQc05eCvNYE
Source: https://www.youtube.com/watch?v=IQc05eCvNYE
[Music] I'm going to be giving a quick presentation about the state of the union regarding AI vision. Um, so I'm Peter Robisho. I'm the ML lead at Rooflow, which is a platform for building and deploying vision models. Um, so a lot of people are really interested in LLMs these days. So I'm trying to pitch why computer vision matters. Uh so if you think about systems that interact with the real world, they have to use vision as one of their primary inputs because the the built world is sort of built around vision as a fundamental primitive. Um and there's a big gap between where human vision is and where computer vision is. Uh I I would argue a bigger gap than exists currently for human speech and uh computer speech. uh computer vision has like a its own set of problems that are very distinct from the problems that need to be solved by LLMs. Latency usually matters. You need to if you want to perceive motion, you have to be running your process multiple frames per second. Uh you usually want to run at the edge. You can't have like one big hub where you do all your computation because you would introduce too much latency to make decisions based off that computation. Um, and so I sort of gave a version of this talk uh at Leightton Space's podcast at Nurips. Um, and retrospectively I think we identified a few problems with uh the field of vision in 2024. One of them being eval are saturated. So vision evals like ImageNet and Coco, they're mostly like pattern matching. they measure your ability to match patterns and sort of like don't require much visual intelligence to solve. Uh consequently, I think vision models don't leverage big pre-training the way that uh language models do. So right now you can take a language model, unleash it on the internet and get something incredibly smart. Some of the best vision models are certled and intelligence to solve the eval there's kind of no incentive to do so. Uh and I think that part of that so there's sort of two dimensions here. One is that vision doesn't leverage big pre-training. So you can think of like if you're building an application with language right now, you probably want to use the smartest model to get an embedding that works really well for you. Right now we don't have like they're downstream applications that make really good use of the pre-training and the embeddings that they get from large language models, but there aren't really good vision models that can leverage these embeddings. And the the correlary to this is that there the quality of big pre-trained models just isn't the same in vision as it is in language. And so my my underlying conclusion is vision models aren't smart. That's the the takeaway. And I can prove it to you. So last year when cloud 3.5 was happening, you could give it an image of a watch and it just guesses. You ask it what time it is and it'll just guess a random time. And that's because this model, it has a good conceptual abstract idea of what a clock is or what a watch is, but when it comes to actually identifying the location of watch hands and finding the numbers on the watch, it's hopeless. Uh, and updated for Cloud 4 still has no idea what time it is. And this is even like a an egregious failure because 1010 is like the stock time on like all watches. So, the fact that it couldn't even get like the most common time is uh pretty telling. Uh there's so there's this really cool uh data set that's trying to measure this inability of LLM to see uh called MMVP which basically you can see an example here where they ask this question that seems incredibly obvious and the model so in this case they ask the model which is like chatbt 40 uh which direction the school bus is facing are we seeing the front or the back of the school bus and the model gets it completely wrong and then hallucinates details to support its claim. And again, I think this is evidence that large language models, which are maybe the most intelligent models that we have like cannot see. And that is due to a lack of visual features that they can perceive with. And so the way that this data set was created is they went and they found pairs of images that were close in clip space but far in Dinov2 space. So, clip is a is a vision language model that was sort of contrastively trained on the whole internet. Dinov2 is just a pure vision model that was uh trained in a self-supervised way on the whole internet. Right? And so what this is showing is that clip is not discriminative enough to tell these two images apart. Right? So according to clip, these two images basically look the same. And what that's pointing to is like a failure in vision language pre-training. And so the way clip is trained is basically you come up with a big data set of captioned images and you you ask the model to you scramble the captions and scramble the images and ask the model to pair the image with the caption. But the thing is is if you go back and look these two images, what is a caption that would distinguish these two images, right? It's like the peculiar pose of the dog and one in one image it's facing the camera and one it's facing away. But these are sort of details that aren't included in the caption. So if your loss function can't tell these two images apart, then why would your model be able to, right? So vision only pre-training kind of works is is the claim. So Dino V2 is this really cool model that so what you're seeing right now is a visualization of its PCA features that have been self-discovered by pre-training on the whole internet. Um, so what's really cool is not only like does it find the mask of the dog, obviously that's sort of easy because it's highly contrastive with the green background, but it also finds the segments of the dog and it finds even analogous segments. So if you look at the these principal components, you compare the legs of a dog, it'll be in the same sort of feature space as the legs of a of a human. And so the there's sort of this big open question which is like how do we get vision features that are well aligned with language features and usable by VLMs that don't suck and like have visual fidelity. Um cool. So so that's part of the story. The other part of this the question that needs to be answered is given that we have some sort of semi- working large pre-training uh of vision models why aren't we leveraging these vision models and I would answer that at least in the object detection space the answer is mostly in the distinction between convolutional models and transformers so this is from LW DTER which is one of the top performing uh detection transformers that currently exists uh if you look at this graph If you look at yellow v yellow v8 which is a convolutional object detector on the edge with and without pre-training on objects 365 it gains like 02 map which is like the main accuracy metric for object detectors. So object 365 which is a big million 1.6 6 million image data set uh pre-training on it leads almost no performance improvements on on Coco whereas for LW DTOR which is a transformer-based model you can see that without if you look at this column map without pre-training and you look at the column map with pre-training you can see that you're getting like five map improvements across the board sometimes even seven map improvements which is like a gigantic amount right and so basically the while the language world knows shows that transformers are able to leverage big pre-trainings and and yield decent results. The vision world is sort of just now catching up. Uh and you can see this from the scale of the big pre-training in the image world. Pre-training on objects 365 with 1.6 million images is considered a large pre-training that would be like a tiny like challenge data set for like for like undergrads in the LLM world. So I want to announce Rooflow's special new model called RF DTOR which leverages the Dinov2 pre-trained backbone and uh perform or uses it in a real-time object detection context. So this is sort of our answer to the the hole that we see in the field of like why aren't we leveraging big pre-trainings for visual models. Um and so here's some of the metrics. You can see that um basically what we did is we took the LW dter backbone and we like kind of swapped it out with the Dynino V2 backbone and we get like a decent improvement on Coco. Um and we're still not soda uh compared to on Coco compared to define which is the current soda. We're like second soda. Um, but I think what's really interesting is there's this other data set called RF100VL which we created to measure the sort of domain adaptability of this model and you can see massive yields from using the Dynino V2 pre-trained backbone which basically is pointing to the fact that number one Coco is too easily solvable. Uh, it basically has common classes like humans and like coffee cups and stuff like this. So it's not a good measure of the intelligence of your model. more or so the way that you optimize Coco is by like really nailing the precise location of a bounding box or something. Really having good iterative refinement of your like locations that you're guessing. Um whereas we posit R100VL this new data set is a better measure of the intelligence of a visual model. Um, so we're introducing a new data set, R1400VL, which is a collection of 100 different object dete detection data sets that were pulled from our open-source collection of data sets. We have something like I don't know, it's something like 750,000 data sets or whatever on Rooflow Universe and we hand curated the 100 best I guess by some metric. So like we sorted by community engagement and we tried to find very difficult domains. So you'll notice for instance we have different uh camera poses that are common from in Coco. So we have like aerial camera positioning uh which requires your model to sort of understand different views of an object in order to to do well. We have different visual imaging domains like you can see like microscopes and X-rays and all this sort of things. Uh so yeah we think that this data set can measure the the richness of features that are learned by object detectors in a much more comprehensive way than Coco. Uh and here's some the the other fun thing about this is that it is a visual language model. So we were able to benchmark a bunch of different models on R100VL being able to ask them things like using contextualizing the class name in the context of this data set. Where where is this action happening for instance? So for so if you look at the top left we have this class which is block which is representing an action a volleyball block but you have to be smart enough to contextualize this like word embedding of block within the context of volleyball to be able to detect that. Same thing with this thunderbolt type uh defect in this cable here. If you just ask a a dumb visual language model to detect thunderbolts in the image, it'll find nothing. But if it contextualizes it in the context of a cable defect, then it'll be able to find more things. And it also increases the breadth of classes. So if you only look at Coco, you're basically asking your model, hey, can you find a dog? Can you find a cat? But like, can you find fibrosis? Now your model needs to have like a lot more information around the world about the world to solve that problem. Same thing with different imaging domains. Um so it is a vision language benchmark. So we also have um visual descriptions uh and sort of instructions on how to find uh the objects that are present in this image. And basically what we found is like you take a Coco or you take a YOLO V8 model and you train it on like 10 examples per class. It does better than like Quen V272 Quinn 2.5VL 72B like state-of-the-art gigantic vision language model. So the vision language models are really good right now at generalizing out of distribution in the vis in the linguistic domain but absolutely hopeless when it comes to generalizing in the visual domain. And so we hope that this benchmark can sort of drive that part of the the research and make sure that the visual parts of BLMs don't get left behind. Uh and yeah, basically by leveraging like stronger embeddings, uh a deter much better on R100VL than just leveraging embeddings learned on object 365, which makes sense. And that's my talk. Thank you. Yes. Can you fine it? Run it at the edge. Fine tune Quinn on the edge. Oh, yeah. Yeah. Yeah. It's It's like 20 million parameters at the smallest size. Yeah. Cool. Any other questions? It's Wix. Yeah, it's publicly available. It's on Maybe I can If you go to rf100vl.org, or you can find our archive paper as well as the code utilities to help download the data set. It's also like on hugging face somewhere. Yeah. Yeah. So, Rooflow kind of has a pretty unique strategy when it comes to our platform. So, we make our platform freely available to all researchers basically. And so we have like a ton of people who use our platform to label medical data and biological data for their own papers and their own research. And then our only ask is that they then contribute that data back to the community and make it open source. And so a lot of this data comes from like papers cited in nature and stuff like that. two correspondence between a lot of Yeah. So the data set is kind of measuring the performance of like a bunch of different imaging modalities or predictive modalities I guess. So so I think the most interesting tract of the data set is the fshot tract. So basically we've constructed like uh canonical 10shot splits. So we provide the model the class name uh annotator instructions on how to find that class as well as 10 visual examples per class. And if a model basically no model exists that can leverage those three things and get higher map than if you just deleted one of those like options. I I see that as one of the big shortcomings of visual language models right now. Then in terms of multiple there's like more generous kind of things like specialist like yeah so so currently the specialists are by far the best uh we benchmark grounding dyno specifically both zero shot and fine tune. So zero shot grounding dyno got like 19 map average on R100VL which is like kind of good kind of bad. So if you take like a YOLO V8 nano and you train it from scratch on the 10shot examples which is not a lot of data obviously it gets something like 25 map. So like to to be worse than fine-tuning a YOLO from scratch is is sort of bad. But if you then fine-tune the grounding dyno with federated loss that's the highest performing model we have on the data set. However, that being said, like I think that the point of the data set should be, hey, like you should be able to leverage these annotator instructions, the 10shot examples, and the class names and come up with something more accurate, which requires a journalist model. But, okay, I think I'm super overtime. So, yeah, thanks for the questions. Cool. Thanks everyone. [Music]