Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai

Channel: aiDotEngineer

Published at: 2025-08-23

YouTube video id: h5ItAJuB3Fc

Source: https://www.youtube.com/watch?v=h5ItAJuB3Fc

[Music]
Okay. So, hello everyone. Uh, my name is
Diego Rodriguez. Uh, I'm the co-founder
of Korea, a startup in the AI space like
many others. uh in particular generative
media, multimedia, multimodel and all
the uh buzzwords but uh I come here
mainly to tell a story about
[Music]
how we think about evaluations
when we have to take into account human
perception and human opinion and
aesthetics uh into the mix. Right? So
I'm going to start with a very simple
story. is like I put an AI generated
image of a hand obviously it looks
horrible and then I ask 03 what do you
think of this image then he thought for
17 seconds obviously tool calling does
Python analysis opencv goes crazy and
then after he charges me a few cents is
like oh just a couple multar is like
it's mostly natural but like and it's
like okay we have like what many people
claim is basically AGI and it is
completely unable of answering a very
simple question and
and like
that's it that's a surprising thing if
you think about it because we as humans
when people see that image is like we
just react so naturally right against
that it's like what is that like that
that's not natural and and I feel like
that's precisely what AI models are
being trained on a on human data right
uh uh second on human preference data
uh and and third like in a way limited
by the data that we humans ba based on
our preconceived notions and perception
and all of that. So that's what this
talk is about about why what um what can
we do better and honestly
to ask ourselves some questions that I
think are not being asked enough in the
field. Um cool. So tiny tiny bit of
history. Uh there's uh we all know about
Claude uh Claude Shannon that is the the
father of information theory
uh and according to many his master
thesis one of the most important master
thesis in the world where he laid
foundations for digital circuits and
then eventually communication and all
and to a degree we can say that even
LLMs uh nowadays right if we fast
forward
um and I want
I want to focus on the inform all the
like the fact that we
call his work foundational and
information theory. Well, when he
published it, it was actually called
like mathematical foundation for
communication theory and he was always
focused on communication. um there's
this image uh appears on that work uh
and it's all about okay this is the
source this is the channel this is
nation there can be some noise there
um
and as a well as a founder of a company
that is focusing on on media
to me is interesting to realize like
these parallels between classic
information theory and communication
Let me see. Did I put the image? Let's
see. Well, if you I didn't put the
image, but if you have any context
around variational autoenccoders or
neural networks or whatever, you you can
squint and be like, oh, is that a neural
network, right?
Um
and in the context of information and
communication I want to talk about how
compression is going to be uh related to
how we think about evaluation right in
I'm going to talk in for example on JPEG
um JPEG exploits
like human nature in the sense that we
are very sensitive to brightness but not
so much to color and this is a illusion
that also talks about that where A and B
is actually the same color but we are
basically unable to perceive it until we
do this and then suddenly it's like oh
really um and it's kind of like what's
going on there right
um and so JPEG just does the same thing
where okay we have RGB color space to
represent images with computers uh we
notice that there's a diagonal that
represents the brightness of the images
we can change into a different color
space uh that separates color versus
brightness and then we can down sample
the channels around color because we are
actually not even that done sensitive to
it. So we can remove that or parts of it
and then uh once we do that this is an
image where we can see the uh brightness
and color component separated. Once we
down sample, we can try to recreate the
image. And this is an example of like
basically original image and then the
image with the down sample color looks
the same to us. Uh, and the image is
like 50% less information, right?
And other stuff. There's Huffman Huffman
coding and more stuff, but like the
point is the right. And then the thing
is if you exploit the same for audio
like what can we hear? What can we not
hear? Well, you you do the same and we
have MP3. And then if you do the exact
same thing across time, well, congrats.
Now you have MP4. It's like it's all
this principle of like let's exploit how
we humans perceive the world, right? Um,
but this made me think about myself
because I studied artificial systems
engineering, which is engineering around
all of these mic how microphones work,
how speakers work. And it was just
interesting to me that I was coding. I
start deleting information. I know for a
fact that I'm deleting information yet
and then I rerender the image and I see
the same. is like like philosophers
always tell you about like oh we are
limited by our senses but like this is
the first time I was like I'm seeing it
right like I am not seeing the
difference um
but then
if all if if a lot of our data is the
internet right like we're stripping data
from the internet bunch of those images
are also compressed right like are we
taking into account that perhaps our AIS
are limited too because we're kind cont
like
we have some sort of contagion going on
of our flaws into the AI. Um
and then it gets more tricky because for
instance uh this is a just a screenshot
I took from a paper I think it's called
clean FID and FID scores for all of you
who don't have context is one of the
standard metrics used for uh how well
for instance diffusion models are are
reproducing an image but then you start
adding JPEG artifacts and the score is
like oh no no no this is horrible
horrible image and it's like
perceptually the four images are
basically the same yet the FID score is
like no no no this is really bad. So
then it's like why are we using FID
scores or metrics along those lines to
deci to decide oh this generative AI
model is good or bad right
um
so the thing is sometimes I feel like we
are focused on measuring just things
that are easy to measure right like
problem adher adherence with clip how
many objects are there is this blue is
this red etc
but
what about here. Oh, it's like, oh no,
really bad, really bad generator because
that's not how clock looks and the sky
that makes no sense. And it's like,
okay, how
not only are we limiting our AIS by our
human perceptions? Uh, on top of that,
we forget about the relativity of
metrics, right? like uh no actually this
is art and this is great and and and and
there's sometimes meaning behind the
work that is not like it is conveyed in
the image but only if you're human you
you get it right like oh this is what
the author is trying to tell me but I
feel like the metrics don't show that
and kind like
commercially and professionally my job
is kind of like okay how can we make a
company that
allows creatives artists is of all
sorts. Uh we can start with image, we
can start with video, but to better
express themselves. But how are we
supposed to do that if this is kind of
like the state-of-the-art, right? Um
then
a friend of mine uh Changloo actually
works at at Mid Journey. it was uh he
has um
like he has great talks that you should
all check but he told me once a little
bit over a year ago a quote that I just
can't stop thinking about which goes
something like
hey man if you think about it like
predicting the car back when everything
was horses it's not that hard I was like
well like yeah it's not that hard to to
like oh cars are the future and whatever
it's like we have We have a thing that
goes like this. We have horses that make
energy. So, you swap the thing for the
engine. That's essentially a car, right?
I was like, come on, how hard is that?
It's like, you know what's hard to
predict? Traffic,
right? And and then I just kept thinking
about I was like, oh man, like as
engineers, as researchers, as founders,
what are the traffics that we're missing
now? Because I feel like everyone's
focused on like, yeah, but you can you
can I don't know transform from JSON to
JAMAL and like who cares that dude? who
cares like or or yes it's important
right like but
what kind of big picture are we all
missing right um then he talks about
well you know the myth of the Tower of
Babel where
in a nutshell is like
God like we want to go and
meet God and then he's No, I don't want
that. So, instead, I'm just going to
confuse all of you, and then you're not
going to be able to uh coordinate. Uh,
and then you're all each one is going to
speak a different language, and then
it's just basically going to be
impossible to keep the thing going,
which like reminds me of like a standard
infrastructure meetings with backend
engineers. It's like, oh no, we should
use Kubernetes. No, we should just And
it's like it's just all fighting and
whatever. Nothing get nothing gets
built. I'm like, dude, God is winning.
God damn it. Um but then this makes me
think about like
we are now in we just entered the age
where you can have
models
essentially they solve translation right
or they solved it to a very high degree.
So so what happens now that we that we
can all speak our own languages yet at
the same time communicate with each
other. I'm already doing it. For
instance, I do sometimes customer
support manually for Korea and I
literally speak Japanese with some of my
users and I don't speak Japanese. I
learn a little bit but I don't speak it
and and and like I'm now able to provide
an excellent founder whatever that means
uh customer support level to a country
that otherwise I would be unable to do
right
and
and so I invite us all to think about
what that really means. Um
because this for for instance means that
we can now understand
better or transmit our own opinion
better to others and
on the previous point that I was talking
about with the art
that's kind of like an opinion right
like evals
are not just about are there four cuts
here it's about this cut is blue and
it's like yeah but is it blue or is it
teal what kind of blue and I don't like
this blue and all of that. Um so
like in a nutshell is like how how do we
evol our evolves right like from my
opinion like from my opinion this is bad
um then I want metrics that take into
account my my opinion too and then it's
like okay consider myself I may be a
visual learner what that means is like
maybe
your evil should take into account how
we humans
perceive images, right? So, and and and
also the the nature of the data such as
oh, it's all trained on JPEG on the
internet. So, take into account the
artifacts, take into account uh like all
of these while while training your data.
Um, okay, I guess mandatory slide before
the thank you. Uh, bunch of users, bunch
of money. We did all of that with eight
people, now we're 12. Uh this is an
email that I set up today for high
priority applications uh for anyone who
wants to work on research around uh
aesthetics research uh
hyperpersonalization
scaling generative AI models in real
time for multimedia image video audio 3D
uh across the globe. Uh we have
customers like those and that's it.
Thank you. Oh
Q&A. Okay, perfect. Any questions? Uh,
>> yeah. Can there's many points there. Can
you like refrain the question like
Yeah.
>> Yeah.
Yeah. So, so the question like in a
nutshell is like are there uh
perceptually aware metrics right like
like okay you I showed an example of FID
score it changes a lot with JPEG
artifacts are those where it's almost
like the opposite barely changes uh and
the metric is still good like there are
some and many of these are used also in
traditional uh encoding uh techniques um
but in a way I'm here to invite us all
to start thinking about those like
like to we can actually
train like we can train uh I mean it's
it's called a classifier right like or
or or a continuous classifier we can
train so that it understands what we
mean and it's like hey I show you these
five images these five images are
actually all good and then they can have
all sorts of artifact not just JPEG
artifact and this is exactly where
machine learning journey excels, right?
When it's all about opinions and it's
like, let me just know and you will
know. You know, you know what? You will
know when you see it. That's precisely
the type of question that AI is amazing
at.
[Music]