Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai
Channel: aiDotEngineer
Published at: 2025-08-23
YouTube video id: h5ItAJuB3Fc
Source: https://www.youtube.com/watch?v=h5ItAJuB3Fc
[Music] Okay. So, hello everyone. Uh, my name is Diego Rodriguez. Uh, I'm the co-founder of Korea, a startup in the AI space like many others. uh in particular generative media, multimedia, multimodel and all the uh buzzwords but uh I come here mainly to tell a story about [Music] how we think about evaluations when we have to take into account human perception and human opinion and aesthetics uh into the mix. Right? So I'm going to start with a very simple story. is like I put an AI generated image of a hand obviously it looks horrible and then I ask 03 what do you think of this image then he thought for 17 seconds obviously tool calling does Python analysis opencv goes crazy and then after he charges me a few cents is like oh just a couple multar is like it's mostly natural but like and it's like okay we have like what many people claim is basically AGI and it is completely unable of answering a very simple question and and like that's it that's a surprising thing if you think about it because we as humans when people see that image is like we just react so naturally right against that it's like what is that like that that's not natural and and I feel like that's precisely what AI models are being trained on a on human data right uh uh second on human preference data uh and and third like in a way limited by the data that we humans ba based on our preconceived notions and perception and all of that. So that's what this talk is about about why what um what can we do better and honestly to ask ourselves some questions that I think are not being asked enough in the field. Um cool. So tiny tiny bit of history. Uh there's uh we all know about Claude uh Claude Shannon that is the the father of information theory uh and according to many his master thesis one of the most important master thesis in the world where he laid foundations for digital circuits and then eventually communication and all and to a degree we can say that even LLMs uh nowadays right if we fast forward um and I want I want to focus on the inform all the like the fact that we call his work foundational and information theory. Well, when he published it, it was actually called like mathematical foundation for communication theory and he was always focused on communication. um there's this image uh appears on that work uh and it's all about okay this is the source this is the channel this is nation there can be some noise there um and as a well as a founder of a company that is focusing on on media to me is interesting to realize like these parallels between classic information theory and communication Let me see. Did I put the image? Let's see. Well, if you I didn't put the image, but if you have any context around variational autoenccoders or neural networks or whatever, you you can squint and be like, oh, is that a neural network, right? Um and in the context of information and communication I want to talk about how compression is going to be uh related to how we think about evaluation right in I'm going to talk in for example on JPEG um JPEG exploits like human nature in the sense that we are very sensitive to brightness but not so much to color and this is a illusion that also talks about that where A and B is actually the same color but we are basically unable to perceive it until we do this and then suddenly it's like oh really um and it's kind of like what's going on there right um and so JPEG just does the same thing where okay we have RGB color space to represent images with computers uh we notice that there's a diagonal that represents the brightness of the images we can change into a different color space uh that separates color versus brightness and then we can down sample the channels around color because we are actually not even that done sensitive to it. So we can remove that or parts of it and then uh once we do that this is an image where we can see the uh brightness and color component separated. Once we down sample, we can try to recreate the image. And this is an example of like basically original image and then the image with the down sample color looks the same to us. Uh, and the image is like 50% less information, right? And other stuff. There's Huffman Huffman coding and more stuff, but like the point is the right. And then the thing is if you exploit the same for audio like what can we hear? What can we not hear? Well, you you do the same and we have MP3. And then if you do the exact same thing across time, well, congrats. Now you have MP4. It's like it's all this principle of like let's exploit how we humans perceive the world, right? Um, but this made me think about myself because I studied artificial systems engineering, which is engineering around all of these mic how microphones work, how speakers work. And it was just interesting to me that I was coding. I start deleting information. I know for a fact that I'm deleting information yet and then I rerender the image and I see the same. is like like philosophers always tell you about like oh we are limited by our senses but like this is the first time I was like I'm seeing it right like I am not seeing the difference um but then if all if if a lot of our data is the internet right like we're stripping data from the internet bunch of those images are also compressed right like are we taking into account that perhaps our AIS are limited too because we're kind cont like we have some sort of contagion going on of our flaws into the AI. Um and then it gets more tricky because for instance uh this is a just a screenshot I took from a paper I think it's called clean FID and FID scores for all of you who don't have context is one of the standard metrics used for uh how well for instance diffusion models are are reproducing an image but then you start adding JPEG artifacts and the score is like oh no no no this is horrible horrible image and it's like perceptually the four images are basically the same yet the FID score is like no no no this is really bad. So then it's like why are we using FID scores or metrics along those lines to deci to decide oh this generative AI model is good or bad right um so the thing is sometimes I feel like we are focused on measuring just things that are easy to measure right like problem adher adherence with clip how many objects are there is this blue is this red etc but what about here. Oh, it's like, oh no, really bad, really bad generator because that's not how clock looks and the sky that makes no sense. And it's like, okay, how not only are we limiting our AIS by our human perceptions? Uh, on top of that, we forget about the relativity of metrics, right? like uh no actually this is art and this is great and and and and there's sometimes meaning behind the work that is not like it is conveyed in the image but only if you're human you you get it right like oh this is what the author is trying to tell me but I feel like the metrics don't show that and kind like commercially and professionally my job is kind of like okay how can we make a company that allows creatives artists is of all sorts. Uh we can start with image, we can start with video, but to better express themselves. But how are we supposed to do that if this is kind of like the state-of-the-art, right? Um then a friend of mine uh Changloo actually works at at Mid Journey. it was uh he has um like he has great talks that you should all check but he told me once a little bit over a year ago a quote that I just can't stop thinking about which goes something like hey man if you think about it like predicting the car back when everything was horses it's not that hard I was like well like yeah it's not that hard to to like oh cars are the future and whatever it's like we have We have a thing that goes like this. We have horses that make energy. So, you swap the thing for the engine. That's essentially a car, right? I was like, come on, how hard is that? It's like, you know what's hard to predict? Traffic, right? And and then I just kept thinking about I was like, oh man, like as engineers, as researchers, as founders, what are the traffics that we're missing now? Because I feel like everyone's focused on like, yeah, but you can you can I don't know transform from JSON to JAMAL and like who cares that dude? who cares like or or yes it's important right like but what kind of big picture are we all missing right um then he talks about well you know the myth of the Tower of Babel where in a nutshell is like God like we want to go and meet God and then he's No, I don't want that. So, instead, I'm just going to confuse all of you, and then you're not going to be able to uh coordinate. Uh, and then you're all each one is going to speak a different language, and then it's just basically going to be impossible to keep the thing going, which like reminds me of like a standard infrastructure meetings with backend engineers. It's like, oh no, we should use Kubernetes. No, we should just And it's like it's just all fighting and whatever. Nothing get nothing gets built. I'm like, dude, God is winning. God damn it. Um but then this makes me think about like we are now in we just entered the age where you can have models essentially they solve translation right or they solved it to a very high degree. So so what happens now that we that we can all speak our own languages yet at the same time communicate with each other. I'm already doing it. For instance, I do sometimes customer support manually for Korea and I literally speak Japanese with some of my users and I don't speak Japanese. I learn a little bit but I don't speak it and and and like I'm now able to provide an excellent founder whatever that means uh customer support level to a country that otherwise I would be unable to do right and and so I invite us all to think about what that really means. Um because this for for instance means that we can now understand better or transmit our own opinion better to others and on the previous point that I was talking about with the art that's kind of like an opinion right like evals are not just about are there four cuts here it's about this cut is blue and it's like yeah but is it blue or is it teal what kind of blue and I don't like this blue and all of that. Um so like in a nutshell is like how how do we evol our evolves right like from my opinion like from my opinion this is bad um then I want metrics that take into account my my opinion too and then it's like okay consider myself I may be a visual learner what that means is like maybe your evil should take into account how we humans perceive images, right? So, and and and also the the nature of the data such as oh, it's all trained on JPEG on the internet. So, take into account the artifacts, take into account uh like all of these while while training your data. Um, okay, I guess mandatory slide before the thank you. Uh, bunch of users, bunch of money. We did all of that with eight people, now we're 12. Uh this is an email that I set up today for high priority applications uh for anyone who wants to work on research around uh aesthetics research uh hyperpersonalization scaling generative AI models in real time for multimedia image video audio 3D uh across the globe. Uh we have customers like those and that's it. Thank you. Oh Q&A. Okay, perfect. Any questions? Uh, >> yeah. Can there's many points there. Can you like refrain the question like Yeah. >> Yeah. Yeah. So, so the question like in a nutshell is like are there uh perceptually aware metrics right like like okay you I showed an example of FID score it changes a lot with JPEG artifacts are those where it's almost like the opposite barely changes uh and the metric is still good like there are some and many of these are used also in traditional uh encoding uh techniques um but in a way I'm here to invite us all to start thinking about those like like to we can actually train like we can train uh I mean it's it's called a classifier right like or or or a continuous classifier we can train so that it understands what we mean and it's like hey I show you these five images these five images are actually all good and then they can have all sorts of artifact not just JPEG artifact and this is exactly where machine learning journey excels, right? When it's all about opinions and it's like, let me just know and you will know. You know, you know what? You will know when you see it. That's precisely the type of question that AI is amazing at. [Music]