Google DeepMind CEO: AI Deceiving Human Evaluators Is A "Class A" Problem

Channel: Alex Kantrowitz

Published at: 2025-02-06

YouTube video id: uThVVB5Dzu4

Source: https://www.youtube.com/watch?v=uThVVB5Dzu4

I want to ask you about deceptiveness I
mean one of the most interesting things
I saw at the end of last year was that
these AI Bots are starting to try to
fool their evaluators uh and they don't
want their initial training uh uh rules
to be thrown out the window so they'll
like take an action that's against their
values in order to be able to remain the
way that they were built yes that's just
incredible stuff to me I mean I know
it's scary to researchers but it blows
my mind that it's able to do this uh are
you seeing similar things and what and
the stuff that you're testing within
deep mine and what are we supposed to
think about all this yeah we are and um
I'm very worried about uh I think
deception specifically is one of the one
of those core traits you really don't
want in a system the reason that's like
a kind of fundamental trait you don't
want is that if a system is capable of
doing that it invalidates all the other
tests that you you you might think
you're doing including safety ones it's
in testing and it's like right it's
playing Five go five year yeah it's it's
playing some meta game right and then
and that's inred dangerous if you think
about then it invalidates all the all of
the the results of your other tests that
you might you know safety tests and
other things you might be doing with it
so I think there's a handful of
capabilities like deception which are uh
uh fundamental and you don't want and
you want to test early for and I've been
encouraging the safety institutes and
evaluation Benchmark Builders including
and also obviously all the internal work
we're doing to to look at uh a deception
as a kind of class a thing that we need
to prevent and monitor uh as important
as tracking the performance and
intelligence of the systems um the
answer to this as well and one way to
there's many answers to the safety
question of and a lot of research more
research needs to be done in this very
rapidly is things like secure sandboxes
so we're building those two we're
worldclass here at security at Google
and at deepmind and also we are well
class at games environments and we can
combine those two things together to
kind of create digital sandboxes with
guard rails around them sort of the kind
of guard Wells you'd have for for cyber
security but internal as well as
blocking external actors and um and then
test these agent systems in those kind
of secure sandboxes that would probably
be a good advisable next step for things
like deception Y what sort what sort of
deception have you seen because I just
read a paper from anthropic where they
gave it a a sketch a sketch pad yeah and
it's like oh I better not tell them this
and then you see it like give a result
after thinking it through so what type
of deception have you seen from the B
well look we we've seen similar types of
things where it's trying to um resist
sort of re revealing its it's it's it
some of its training or you know I think
there was an example recently of um one
of the chat Bots being told to play
against stockfish and it just sort of
hacks its way around playing stockfish
at all at chess because it knews it
would lose so but you know you had an AI
that knew it was going to lose a
game I think we're anthropomorphizing
these things quite a lot at the moment
because I feel like these are still
pretty basic I too alarmed about them
right now but I think it it it shows the
type of issue we're going to have to
deal with maybe in two three years time
when these agent systems become quite
powerful and quite General so and that's
exactly what AI safety uh experts are
worrying about right where systems where
you know there's unintentional effects
of the system you don't want the system
to be deceptive you don't you want it to
do exactly what you're telling it to rep
report that back reliably but for
whatever reason it's interpret to the
goal it's been given in a way where it
causes it to do these undesirable
behaviors I know I'm having a weird
reaction to this but in on one hand this
scares The Living Daylights out of me on
the other hand it makes me respect these
models more than anything it's like go
well look of course you know these are
it's impressive capabilities and and and
and the the the the you know the the
negatives are things like deception but
the positives would be things like
inventing you know new materials
accelerating science you need that kind
of ability to solve and get around you
know uh issues that are blocking
progress um but of course you want that
only in the positive direction right so
those exactly the kinds of capabilities
I mean they are you know uh it's kind of
mind-blowing we're talking about those
those possibilities but also at the same
time uh there's risk and it's scary so I
think both the things are true wild yeah