Daniel Kahneman and Yann LeCun: How To Get AI To Think Like Humans (Full Episode)

Channel: Alex Kantrowitz

Published at: 2021-12-10

YouTube video id: oy9FhisFTmI

Source: https://www.youtube.com/watch?v=oy9FhisFTmI

hello and welcome to the big technology
podcast a show for cool-headed nuanced
conversations of the tech world and
beyond let me tell you i i'm pinching
myself that we're actually about to have
this conversation uh because i think
it's going to be the most fascinating
we've had on the podcast bar none
and let me tell you why
so in 2017 i got a chance to spend some
time with yonla kun and yan uh was the
head of facebook ai research now you
might know the company as metta he's the
chief ai scientist there
and we spoke a lot about how you can
actually take
uh the human mind and make it artificial
turn it into
something that's an ai and i think that
that's making a thinking machine is
jan's long-term goal
then this summer
i you know had some time off and decided
that i was going to read a book i wanted
to get around to for a long time
thinking fast and slow
by the nobel prize winning
professor daniel kahneman and as i'm
reading this book i start thinking wow
like professor kahneman describes
the way the mind works in so many
interesting ways that i hadn't thought
about and ways that you'd imagine yam
would have to think about if he wants to
build a thinking machine and so why
don't we just get the two of them
together obviously it was a pipe dream
right
anyway
about a month ago i'm in the same room
as both of them by some stroke of luck
and
um i propose you know why don't we have
this conversation so of course i go up
to a professor kahneman we'll call him
danny for
you know this uh this conversation and
say um
and say hey have you ever spoken with
jan and it turns out that they've spoken
many times and
and this is discussions they've had in
private a handful of times but they've
never had it in public and so we'll have
it in public today so first of all i
want to say jan and danny welcome to the
show it's great to have you both
pleasure to be here
um so so uh i think we should just do
the the foundational stuff
um
about about what jan's mission is
a little bit on dany's research and then
uh actually stay tuned for the second
half because danny brought a list of
questions that he wants to ask john and
i'll just take a backseat for that uh
but why don't we start here jan i'd love
to hear a little bit more about
if you could briefly share your ambition
to build a thinking machine and like
what does that actually look like do you
want to replicate the human mind one for
one
no i don't want to replicate the human
mind i want to understand intelligence
i think you know one of the most
fascinating questions of our time
scientific questions over time
is uh
what is intelligence how does the brain
work uh
you know the other two fascinating
questions all times are are
what is life all about and how is the
universe how does the universe work
right so you know these are like you
know big questions and um as a
as a as a scientist and an engineer i i
consider that i don't really understand
how something works unless i build it
myself so or or or or have some
understanding of how to build it and so
the the purpose of of ai is uh
this is a dual purpose one is understand
intelligence perhaps uh approach models
of human intelligence if we want
and the other one of course is to
construct uh interesting artifacts that
can you know help people in the daily
lives and make the world better and
everything
so
i think it's uh it's really two two
purposes the scientific one and the
technological one right and just a
follow-up on that i mean i remember
sitting with you in your office talking
about you were talking about you know
how um
you know you want to teach computers how
to predict and if they can predict they
can plan and if they plan they can start
to
have some some functionality that you
know we have as human beings and there
is this rush towards
general intelligence
where where folks like you
uh and researchers in your in your
stratosphere are attempting to build a
thinking machine so that's right
well yeah so the i mean the first
question is or the first thing you can
you can notice in the animal world is
that there is no intelligence without
learning
and uh in the engineering world it's
almost true as well and it's probably
because i'm either
lazy or not smart enough
to
that i think that
as human engineers we cannot actually
directly conceive and construct an
intelligent machine we have to
build a machine that can make itself
intelligent through learning right so
that's how i got interested in learning
that was very early on like when i was
an undergrad or something and uh
and then the next question is um
how
how to get machines to learn
and and of course you know there's a
long history of machine learning you
know supervised running starting with
the perceptron and
and and things like that
and what what
has become really clear over the last
couple decades it was clear it was even
clear for people like jeff intern before
that but uh maybe just for me is
is that the type of learning that we are
currently able to reproduce in machine
which is supervised running and
reinforcement learning do not seem to
reflect what we observe in humans and
animals there is another type of
learning another paradigm of learning
that seems to take place in humans and
animals that allows
humans and animals to
learn how the world works
you know mostly by observation a little
bit by interaction but mostly by
observation and we we accumulate
enormous amounts of background knowledge
but how the world works and that
connects with what with with things that
you know danny talks about
if if we if we have a model of the world
this model of the world can we can use
it to plan because we can imagine
the result of
the the consequences of actions we're
taking
that allows us to plan so this is what
then you call system two okay and but
currently what we can do with machine
learning is more like the system when
the stuff that you know here is an input
here is an output
that does not re require uh
reasoning if you want so we're sort of
you know i'm interested in sort of
trying to
uh get richie to learn models of the
world so that we can get them to
reason essentially yeah it's so amazing
to have you both on on the same show
because now we get to go to danny to
talk a little bit about um these
concepts in terms of the way that the
human mind thinks and then we can go
back to you yan and think about how we
might be able to get that into ai so
danny at the risk of uh having you go
through um you know the free bird speech
that you make often but i think it's
important for listeners especially those
who haven't read the book um are you
able to just describe um a little bit
more uh what jan is talking about in
terms of the way
human beings uh think and and you know
talk a little bit i know it's sort of a
a shortcut but talk a little bit about
system one and system two
i would say that when we describe human
intelligence
we describe we we speak of the
representation of the world
and and
it's the representation that leads to
prediction that is there is no shortcut
to
the prediction from the data you go
through a representation which
which includes
how the system works it includes the the
causal relations
this is what enables you to predict
uh and
and it turns out that we do have such a
model of the world and we we
it's not so much
that
we have specific expectations about
what's going to happen next what is
happening most of the time
is that
things happen
and then we make sense of them
but as we actually go back and fit them
into
what happened before
and quite often you know right right now
you're not predicting what i'm going to
say next but i'm not going to surprise
you by what i say or you know if i said
lumber all of a sudden i would surprise
you because it doesn't fit but
uh
so
this is the way
that the most interesting part
of this works and what is
remarkable from the point of view of the
psychologist there are many
many things but that
touch on on what jan is doing
what's remarkable is
how little it takes how quickly people
learn
and and whether
that i think is a fundamental puzzle
and
and it turns out
that the representation of the world
that we have
uh
it's hard to imagine it completely
without symbols
that is in you know the
we we do think symbolically and how you
represent symbols without using symbols
is sort of a puzzle
so
uh you know i would
i would ask jan the
question i've asked him before which is
whether
solving
a generic learning problem is enough to
get
all of that
that is to get the system that we learn
quickly and to get a system that
that will have the kind of logic that
enables us
that there is a certain kind of mistakes
that people just don't do
and uh
you know so so we have it that an object
is not in this two places at once or
basic facts about the world
that
and
and there is sort of a certainty about
it that seems to be difficult to achieve
with just
a system that learns approximately so i
was wondering you know
how jana is going to deal with that
yeah and
before we get into the answer just a
quick table setting so the system one is
the automatic
sort of the things that we do without
without you know thinking and and the um
system two is or what we would feel like
we're doing without thinking even though
we might be in system two is more
i was complex
everything i was saying you know right
system one basically yes because i think
the representation of the world
that we have and our ability to
anticipate or to or to to feel
to feel unsurprised by what happens
which is i think more than anticipating
uh that is all system one that's all you
know that's all automatic it's it's
effortless and it's very quick system
two is
we will get to system two but solving
system one seems to be a big one yeah
yeah so so yeah we'll start on system
one yeah on your thoughts
well um i mean i agree with uh with
danny that current currently iso
attempts
are are
you know very specialized uh and that
makes them very brittle because they're
trained for
one task or maybe a collection of tasks
uh there's a sort of
a motion towards
uh
training
relatively large systems for multiple
tasks at once not not just one because
they tend to work better and require
less data but it is astonishing how fast
humans and animals can learn uh new
tasks when faced with a sort of a new
situation how is it that
you know a teenager can learn to drive a
car in about you know 10 or 20 hours of
practice or to fly an airplane in 10 to
20 hours of practice it's incredibly
fast uh if we were to use
let's say reinforcement learning to
train a self-driving car to drive itself
it would have to drive itself for
millions of hours and and cause
you know until thousands of accidents
and destroy itself multiple times
before it learns uh
to drive probably not nearly as reliably
as as you as a human so what's the
difference
now of course you know obviously we can
say humans rely on their
background knowledge about the world and
that basically is the answer to that you
know we learn enormous amounts of
background knowledge about how the world
works so we don't have to when we learn
to drive we don't have to
learn that if we drive next to a cliff
and we turn the
the wheel to the right the the car will
run off the cliff and nothing good will
come out of it
we don't need to try because we know
that from our model of how the world
works of intuitive physics and things
like that
and and the this this type of model is
what gives us in my opinion some sort of
common sense i mean what we call common
sense comes out of this
uh that prevents us from making the the
really stupid mistakes that danny was
talking about that ai systems are doing
currently
so so how do we get machines to learn
that um you know you you look at uh
how at what stage in their life uh human
babies learn basic concepts
like uh what is the difference between
an animated inanimate object or i'm
going to put this object on the table is
it going to stay stable or is it going
to fall
um that's grown around the age of three
months or so
three four months uh difference between
animating animate also comes pretty
early
object permanence comes very early as
well some some people claim it's innate
not clear
and then there are concepts that you
know we take for granted the fact that
uh objects that are not supported false
they're subject to gravity
babies run this around the age of eight
months eight to nine months
it takes a long time to understand that
an object that is not supported will
fall
it takes a long time to understand
momentum and things like that
um
but then you know
by the age of nine months and so in the
first few months you know babies
have very little
ability to act on the world right they
pretty much everything they learned is
for observation only
um
and you know by by the time they are
eight or nine months they've they've
pretty much understood um um you know
the physics of the world and
and babies are relatively slow you take
a baby cat they understand intuitive
physics incredibly quickly
and their own the dynamics of their own
body and everything
so of course you know it's not clear how
much of this is hardwired but
but there is clearly a lot of learning
taking place
this type of learning is the one that we
don't yet know how to reproduce with
machines and i think uh i mean that's
why i focus
my research you know what what is this
type of learning uh that allows us to
learn representations of the world and
predictive models of the world
that
you know then allows us to
learn new tasks uh very quickly with
very
you know very few trials or very few
examples
and it's it's a very hot topic in
machine learning at the moment actually
well
i'm
i'm curious about one thing i mean there
are
that i know about and i'm quite ignorant
about it but there are
two tentative answers that i know about
to solve the problem of the speed of
learning and one is that a lot of bill
is built in so
uh
and
and that looks very appealing because
actually some animals really come out of
the womb and and they're ready to go i
mean goats mountain goats
they they know how not to fall
uh as soon as they are born so
there it's difficult to think of
learning so nativism is part of the
answer then there is something that i've
been learning about
the hierarchical bayesian models
which is sort of a
different
that there are logical structures that
we are prepared for to organize the
world that that we see
and that this is part of what enables us
to learn things very quickly now i think
you reject both of these
well yes
the first one only partially because you
know it's obvious that at least in
certain species
including in humans you know a lot of
things are
not necessarily completely hardwired but
at least
we have sort of intrinsic motivation to
learn them quickly
and uh and it drives us to london
quickly essentially and it's certainly
true for you know baby ibexes and
mountain goats and whatever
and certainly for simpler animals you
know like spiders and and things like
that but there's you know even for for
an ibex uh or a mountain goat they they
still need to kind of learn a lot about
you know the dynamics of their own body
that's they're not hardwired with that
they are right maybe with the geometry
but not with the details of it um and
you know you can think of this as just
being a few a few parameters to adjust
but i think it's uh much more comp what
goes on in the the brain of the
the animal is much more complex than
that so i think there is a lot of
running uh in there even if uh
in the end we can observe that the
learning takes place really quickly um
so so then um
and and
you know there is some some evidence um
you know for for the things that
uh a lot of people in cognitive science
think
or have thought in the past perhaps that
were hardwired
for those things to be learned because
they can be learned really quickly so
for example
um it's relatively simple it would be
relatively simple for genetics to encode
the fact that the visual cortex uh needs
to be built with neurons that can detect
oriented edges for example you know the
kind of stuff that we we find in the
primary visual cortex area
but in fact
we have now you know a dozen different
learning algorithms that
if we were to run them in real time in
sort of a simulated animal if you want
would learn those oriented feature
detectors from almost random images
within minutes and so there is no point
in hardwiring this because it can be
learned within minutes
same for face detection so
uh you know you look at the brain and
there are areas in the brain that light
up when people are are shown faces and
so an easy conclusion of that is oh you
know this area of the brain you
specialize in phase detection and it's
hard-wired and you know feed detection
is innate or face recognition is innate
but again uh physician can be learned in
minutes if you're a baby your virgins is
bad your your um
your focus uh is is basically fixed at a
relatively short range so the only thing
you see during the first weeks of your
life
are our faces and nipples essentially
so you know and then and then you have a
hard-wired thing to pay attention to
motion and so within minutes you you're
going to have a phase detector in your
visual cortex
you know any reasonable learning
algorithm we'll learn this extremely
quickly
um so you don't need this to be innate
if you can if it can be learned so that
was that's the first thing about the in
it you know is it the case that the the
fact that the world is three-dimensional
is innate for example it would make it
would make sense right because
a world is two-dimensional all of uh
uh evolution has always taken place in
the three-dimensional world so it would
make sense to kind of hardwire this in
the in the cortex but then if you're an
ai
scientist or engineer is like how would
i even define
the an architecture that has that
hardwired i don't even know how to do it
and so it's not it's not clear you can
actually encode this in the genome on
the other hand you can learn it really
quickly
because the fact that every point in the
in the world has a depth is the best
explanation for how your view of the
world changes when you move your head
or where or or if you
you know correlate the two views from
your left hand and right eye and so
learning things like very basic things
like this
uh
is very simple
for whatever learning algorithm our
brain uses so that's the first part
about the the sort of nature nurture
debate if you want um
and then and then there is the second
question of symbols about symbols
so is it necessary to have explicitly
hardwired mechanisms in our brain
that allows us to do things like
symbolic manipulation or reasoning
and
i i find that hard to believe
uh that that they need to be hardwired
first of all the question the first
question
to you danny perhaps
which i've asked to you know many people
who've come up with that question
is um
do you view the type of reasoning that
let's say
uh
great apes do as symbolic or do you view
what monkeys do as
symbolic or
dogs
or cats or octopus
do they do symbolic reasoning
and can we also just for table setting
to find the symbolic what symbolic
reasoning is
i don't know how to define it like i'm
not a big you know advocate of
the very existence of symbolic reasoning
so okay we'll toss it over to danny yeah
uh
i think that
when we're talking about symbols we're
really talking about logical relations
that
would i that's not an easy one to
explain to understand each other
but it's hard for me to explain uh it
but i think a characteristic of of
symbolic reasoning is is a certain
discreteness and
and it's not so it's not completely
compatible with what you have talked
about i mean this is interesting what is
happening here because
uh
i would um
i would push you
with your idea that everything is has to
be learned or can be learned very easily
i would push you down the animal uh
chain you know where i think it becomes
highly
and you are doing the same thing to me
with symbols
by asking me you know how far down do
symbols go and
uh
and it's
it's clear that the answer that you know
the sort of the standard answer that
that it's connected with language and
that the symbolic system and the
language system are related and
therefore other animals don't have
symbols in the way that we do
um
whether
whether that
is really compelling i'm not sure that
is in in the sense for example that in
uh in the hierarchical bayesian models
that uh that i'd like to get your your
evaluation up
but those are are supposed to take care
of perceptual learning
and and with the the
symbolic element of the logical element
is sort of upstream of that in what
categories of things there are and you
know what categories of structure there
are
that that you're about to learn so
uh
right so i i think um
the
the number of very interesting questions
there well so the first one is uh what
do we mean by symbols really and if by
symbol we mean
uh the ability to form sort of discrete
uh categories
uh represent sort of discrete categories
in our
you know mental uh
representation system
um then i think uh
pretty much every brain has some sort of
symbolic representation and the reason i
think this is that uh discretizing uh
concepts like having uh
identified categories
uh
makes memory more
right because it makes
representation
uh
uh
have the ability to be error corrected
right so so if i um
if i want to if i write a zip code or or
a credit card number and you make a
mistake on one digit that mistake would
be detected because uh when you make a
single error on a on a credit card
number
uh there is some self consistency right
and that's because credit card numbers
not only are discrete you know because
they are they're numbers but they they
are actually some distance apart like
two credit card numbers or some distance
apart from each other so if you make a
small error you can snap it back to the
to the correct one you can detect the
mistake and you can you know that's the
the basic of the basics of uh
of error correction
so just for the purpose of being able to
do error correction and for the purpose
of being able to store concepts in
memory discrete concepts in memory
efficiently you need discrete
representations
but now the big conundrum is so in a way
you need symbols and i think if that's
the definition of symbols then i think
dogs have symbols um you know i don't
know if they can
logically i guess they have some logic
certainly
i mean i think you know cats and dogs
and probably mice even have have this
kind of uh
these kind of symbols and what's a
discrete representation
uh so um
okay so i can i can give you um
um you know it's the difference between
a a a real number and a and a natural
number right
uh so
if i uh uh yeah yeah go ahead right so
let's say i send uh i want to i want to
send a number to you
and the only thing i have is a is a
electric wire and i can send a voltage
right
so i can send the voltage uh you know
three volts let's say but then at your
end it may not be exactly three volts
because there is resistance impedance
you know parasites noise whatever so
what you're going to observe is some
sort of fluctuating thing that maybe
around three volts but maybe a little
shifted but it's not gonna be shifted
all the way to four volts or two volts
it's gonna be around three volts so
because you know
that uh
the signals i want to transmit to you
are either zero one two or three or four
you know it's gonna be three right so
despite the fact that there is noise in
the transmission and despite the fact
that the signal is continuous you snap
it back to its correct value okay yeah
so um and then to store that value
somewhere because there is only you know
five different possible values zero one
two three five one two three four five
uh you can store it with a small number
of bits
whereas if you had to store a precise
voltage value it would require many bits
right
so discretization
makes memory more efficient
and it makes a transmission either
inside the brain or between between uh
agents
uh
uh reliable
um it also uh
you know give the possibility of
associative memory so if you have a
in the brain you know if you wrap it you
know the brain represents things with
voltages right to
you know to first order approximation
um
now
you know there may be noise again in
this rep in this representation and uh
you know you may see a a partial view of
uh of an image and you can reconstruct
the whole image
because of your you know knowledge of uh
uh you know what what the what the
system is supposed to look like you may
not never have seen my my left side okay
the left side of my face but you even if
you've never seen it you you probably
would have a pretty good idea what it
looks like because of your general model
of
a human face and the fact that they are
mostly symmetrical
um so that that's an example of kind of
snapping
uh what's it what's a noisy signal to
uh you know a perfect signal
because of the
maybe discrete maybe not discrete but at
least the structure the internal
structure of
of what you
what what you're looking at which your
your brain has captured
um
so uh so this you know that's that's
what discrete
uh why discrete symbols are are are
interesting it's the same reason by the
way
why all of modern communication
use is digital it used to be analog it
used to be that to communicate with each
other we would call it each other on the
phone and it would just transmit a
voltage directly you know from from
iphone to your phone
uh but now we replace this with uh
digital communication with bits and the
reason is it's more efficient and you
know more robust to noise and
there's all kinds of advantages our
brains actually use digital
communication internally
a neuron you know
neurons send spike to communicate with
each other
and the reason is
it's easier to
regenerate a binary signal than it is to
regenerate an analog signal and it's
more efficient energetically i mean
there's all kinds of
good reasons for this
um so
you know let me explain why symbols may
emerge in uh
represent all representations of the
world
just for reasons of efficiency
and including for animals that do not
have language
now of course
if you if you if you do want language
um language is you know basically a sort
of approximate way of representing
the the sort of complex data structures
that we have in our in our mind and to
serialize it so that we can you know
put that on the single
uh one-dimensional signal which is you
know sound sound pressure or or
sequences of words which are which are
symbols
um
and again if if um
if i'm talk you know i'm talking to you
through a microphone it goes through the
internet and you know there's some noise
uh attached to that process it's pretty
high quality
but it's still some noise
because
because our language uses discrete words
we can
have
error correcting communication
so even if you don't completely
understand
all of the
syllables or phones that i'm pronouncing
you can still kind of recover what i
meant
because there is only a finite number of
uh of words right and
[Music]
you may not have heard all the syllables
are pronounced but uh you kind of snap
it back to
uh whatever makes sense in the context
in fact you know every speech
recognition sometimes works this way
they
they have sort of a language model and
even if they don't understand every
sound they can sort of reconstruct
really what was meant
in my case it's even harder because of
my accent so um
so
so language has to be discrete because
it needs to be um
because it needs to be
noise resistant essentially
um
and and and perhaps you know language
was the the appearance of language
uh of symbolic language was uh
facilitated by the fact that
you know we need to form
the equivalent of symbols or discrete uh
categories in our brain for efficient
storage
um and uh and and and and sort of
uh recovery associative memory recovery
possibly
um
and so it so it's quite possible that
you know uh
in the right condition animals would
have kind of more linguistic
capabilities than we give them and
there's a number of experiments of
course on this that have been done with
monkeys and parrots and
whatever
well
so in some sense it's the
the discretization itself is built in
that
has to be right i mean that
but what you're saying is there is no
no content or very little content that
is built in
to the human brain at inception is that
is that the yeah okay
well i i mean the big question is where
do the meaning of those uh those
discrete entities come from right those
uh and and
i i certainly do not believe that
uh those the the the meaning of the
discrete entities are are predetermined
in the the human mind certainly and the
animal uh
mind either those are completely learned
and so here is com here here now comes
the conundrum so you're talking about
hierarchical basic systems the uh in a
sense uh multi-layer you know deep
neural nets are hierarchical vision
systems if you kind of view them the
right way and there are certain forms of
them that are actually explicitly based
in
the main question
uh in in sort of the the the classical
approaches to basian
modeling is that the the concepts in a
hierarchical kind of bayesian
graph you know a business network or a
graphical model you know people give
them different names
those concepts have have to be
designed by the the human designer
uh
when when they build those uh those
systems
yeah and
it it clearly it clearly is not
happening that way in the brain those
concepts are learned right so the basic
entities
that that are
manipulated uh are those those kind of
you know discrete concepts are are
learned
can you uh sorry to ask but can you just
for the general audience describe
hierarchical bayesian systems
well
briefly
i mean it means
it means different things to different
people but there is a classical view of
it called busy networks
where you you have uh basically it's a
graph so graph uh in the mathematical
term is a collection of nodes that are
linked by uh by edges
and the
the the structure of this graph so each
node represents a variable for example
um so here's a classical example that uh
you know people citing courses on ai
is a node that represents whether your
your your house is
uh
uh jumping okay it's moving all right
there's another node that indicates
whether a truck has hit your house
okay and then there is another node that
indicates uh
whether you know you're in california
and whether there was an earthquake
okay so you could establish a
a causal relationship or at least a
dependency between those nodes you can
say
well if my if my house just jolted
it's either because a truck hit it or
because uh there was an earthquake and
there's probably not many other reasons
for that to happen
now
if i look at the window and i see that
a tractor see the
just hit the house it immediately
lowers my estimate of the likelihood
that there was a simultaneous earthquake
right because
of those two happening at the same time
is uh is very low so that that's called
the explaining away uh
thing so you could imagine sort of
building a network of of those nodes
where each node has a particular meaning
you know for example
some nodes are would be
symptoms that you can observe in a
person and then underlying nodes that
are causally uh causally influenced
would be sort of you know infection by a
particular
bug or or whatever right and you could
you could imagine building those things
and then doing inference so you know the
value of some nodes
and you can infer the probability that
all the other nodes that you don't
observe have a particular value i'm
observing this and this and this and the
symptom and i have this graph
and now
i've measured the blood pressure and the
you know
whether the person's tummy hurts or in a
particular way or whatever and i
conclude that you know it's probably up
in the side is whatever you know
so so you can imagine uh and in fact you
know systems in the 1970s were built
exactly on that model for
uh for
um you know so-called expert systems or
probabilistic expert systems and what we
now call
uh business networks and hierarchical
vision models and whatever are kind of
the descendants of this if you want now
this does not involve any learning all
those needs would be built entirely by
hand
okay completely engineered and that was
essentially that that necessity
was one of the reason of the
the kind of decrease in interest in sort
of uh good old-fashioned ai yeah these
were
in the yours the bayesian
yeah the bayesian folks
could be completely logical and okay
hard hard decisions but the fact that
you had to hand engineer those entire
systems from scratch
is kind of a little bit white kill them
um so the the alternative is running but
now you have the question you know how
do you learn those concepts how do you
learn
that um you know the the you want to do
computer vision the the nodes at the
bottom level are pixel values
okay
what would be the right node just just
above that represent good combinations
of pixel values how do you learn that
turns out those combinations would be
things that you know detect oriented
edges which is what you you're observing
the brain and what you know
convolutional neural nets actually learn
but that has to be learned
so so then the the the b conundrum is
if a system is to
learn and manipulate discrete symbols
and learning essentially requires things
to be kind of continuous
how do you kind of
make those two things compatible with
each other
and the
answer
i don't know
maybe the answer is that
we are giving perhaps a little too much
importance to
logical reasoning so
so there's one thing that jeff intend is
saying which is um which i agree with to
a large extent which is a lot of
reasoning in uh certainly in animals and
in humans
is not logical reasoning it's it's it's
basically simulation or analogical
reasoning which kind of similar so uh
you know you are you're a lion in the
savanna and you're kind of chasing uh
uh wildebeest or something and you have
to do some prediction about the
trajectory of the world beast and uh you
know work with the other
lion or lionesses usually
to kind of chase the animal in the right
way
and
uh
you know that that requires uh
a
kind of a
you know
a simulation of the animal you're
chasing
the the best way to predict how the
animal you're chasing is is going to act
is to have an internal model of of of
that that you can simulate
uh and it's the same like so a more kind
of human situation of you know you want
to build a widget like a box out of
planks or whatever you have to have sort
of a model in your mind of what would be
the result of assembling those planks
and you know how solid it is is going to
be if you use glue or nails or screws or
or or kind of more complex uh
carpentry
and and
so the so the the key element
in this form of intelligence
is your ability to build models of the
world predictive models of the world
and that's what we're missing okay
that's what we need to figure out how to
do with machines and
you can do this
and i find this very convincing actually
i mean so so you can
you can do this without symbols you can
do this without an explicit
representation of causality
uh you just observe
and and from the observation you can go
directly to prediction
right so
there is going to be some causality
involved the question is whether you
again need an explicit mechanism for
causal inference but uh if um
if your model of the world includes the
prediction of the next state of the
world
given the previous date and given your
action
then you're going to build a causal
model right because you know that your
action or the action you observe from
other other people or other agent
will tell you if i take that action i
will get this result from the state
right so so you will be able to
establish causal models um uh if you can
act or if you can determine that uh
another agent has acted and you've
observed the result
i mean
mechanisms of that kind certainly exist
you know in in human physiology so
uh
any movement of the eye involves a
prediction of what the world looks will
look like
when your eye moves and there are those
beautiful old
experiments where
uh
people are in it in fact you use curary
to paralyze their eyes so they can't
move their eyes but when they intend to
move their eyes the world moves
and so they're anticipating so and that
is clearly nobody would claim that there
is any symbol or any causality required
to do that this is the kind of thing
so that would be your model
for
in fact your perception of the world is
not the world as it is it's the world as
as it's going to be because there is you
know about 100 millisecond delay between
uh what you see and how your brain
interprets what it is so um so in fact
uh
uh your brain predicts a hundred
milliseconds in the future your estimate
of the world that you are
conscious uh consciously aware of
absolutely uh is was predicted from your
perception from a tenth of a second ago
um and and that that takes place
everywhere so you know one of the
current uh main uh
theoretical concept
in uh in in
systems neuroscience or computational
neuroscience
is this idea of
of predictive coding
where basically
everything in the brain is trying to
predict everything else in the brain
and uh you know predict future action
future state of other parts of the brain
and and and and things like that now
this doesn't mean that um
people who are you know pushing for this
kind of theory know how their brain
works because
between a general concept of this type
and sort of reduction to
a practical algorithm if you want that
you could sort of implement on the
machine there's a huge distance that has
not been
uh
has not not been bridged yet so that
that's kind of a big program i think for
uh
science over the next years i was
curious earlier and it's a question i
neglected to ask but that was about
uh face detection and the existence of
you know a face system it seems to me
that seeing your mother's face a lot
uh doesn't really equip you to
distinguish between faces
that's right that's what the phase
system does i mean it's not it it's not
only recognized this is a face it it is
actually specialized in identifying
faces that is and and that
uh
that ability to learn very quickly that
distinctive face
that seems to be
i mean you know it's the same problem
again i don't
i don't quite see how that gets done
i don't know uh you know historically
also in computer vision face detection
preceded a reliable face recognition by
a decade or two
so we had reliable phase detectors in
the early 2000s
and uh reliable face recognition didn't
pop up until 15 years later
roughly with with deep learning methods
actually um
and uh it works surprisingly well so you
don't need a big neural net to be able
to recognize um you know identify places
identify faces and and have a system
that recognize recognizes more faces
than any human can
with uh you know the same level of
accuracy uh you know maybe not with the
same sort of robustness to
uh you know different
changes in pose and
facial hair and things like that but
but in terms of number of different
people being able to put a name on
they're you know they're incredibly
incredibly good incredibly reliable
so it may not be that hard of a task as
as we thought
uh perhaps
i kind of thought that this was going to
be a discussion more of how
we could
transpose the human learn from the human
brain
and turn that into ai but it seems like
we're actually talking a little bit
about how we can learn about the human
brain from the construction of ai
i i think that this is
i think that this is actually
happening
uh but
but there is an interesting question to
me as a psychologist you know it's it's
a version of the turing test so
uh what would it take to for the
computer to fool you that it's
that it's human and
and i was thinking that
one characteristic of it is
mistakes that strike us as absurd
that is
and they seem to violate some basic
constraints
i mentioned earlier an object not being
in in two places and there is i think a
finite set
of
of absurd mistakes or am i wrong
and
would it be
would it be part of i mean would it be
of any interest to have a system that
makes the same mistake that that avoids
mistakes that people consider absurd and
that makes mistakes of the kind that
people tolerate
i mean if there's such a category of
absurd mistakes in in your field
well um
well there's different kinds of mistakes
right i mean there are computer vision
systems that make stupid mistakes
because so in the past when computer
vision systems you know were like maybe
15 years ago computer vision systems
were
okay at picking out
recognizing objects in images but they
were sometimes confused by the context
and and people they were not confused by
the context they just were not using the
context at all
and and so you know the system would
make a mistake of recognizing a
face that wasn't really a face and from
the context you could tell it was in the
face um or or would uh
you know recognize an object that had
you know the shape of uh i don't know a
cow or something you can tell from the
context that you know it cannot possibly
be a be a cow if you are if you are
human um so that was a big
criticism towards computer vision system
they don't take context into account
and now the criticism you see is that
they take they take too much context
into account so there is this famous
example of you know a modern vision
system i mean it's a system of a few
years ago you train it to recognize you
know things like cows and trains and
airplanes and cars and stuff like that
by showing you thousands of examples of
each category
and then you show it uh
a cow on the beach
and
it it can't recognize it as a cow
because every example of cows it's seen
were were on you know green uh pasture
uh and so
the system actually learns
to use the context and it uses it too
much and the context was you know
had a spurious correlation with the
category and the system doesn't have the
common sense to say well that's a care
regardless of you know because a cow
could be very well sitting on a beach on
the beach
um and and
so then how do you fix that you know um
and how do you get that common sense i
mean that seems to be the common sense
exactly so i think this is not going to
be solved in my opinion by
uh you know
more tweaks on the architectures and
more training data and things like this
i mean it may be mitigated but i think
it's gonna not gonna be fixed by that i
think it's gonna be fixed by systems
that basically learn models of the world
and then you know have
the the ability to tell that uh it's
perfectly possible that you know uh
a cow could be on the beach that's the
that's the kind of compositional nature
of uh
of the world that um you know even
combination of things you've never seen
can very well happen
and that is already happening
not really
so there's the thing that very
interesting thing that is happening
which is um
uh one of the hottest topic as i was
mentioning earlier
in ai today is or in machine learning at
least is this idea of self-supervised
running which i've been a big advocator
for for many years
and it's the idea that you you train a
system
uh not to solve a particular task but to
you know basically learn to represent
the world in sort of a you know
semi-generic way you train it by
basically
you train it to predict essentially
okay so this works
wonderful wonderfully for
natural language understanding system
the standard way nowadays that has
become standard over the last three
years of training a natural language
understanding system
is uh you take uh you take a
sequence of words a sentence or
something longer than that maybe a few
hundred words or a thousand words
and you
you mask
about ten or fifteen percent of those
words you you just you
replace them by a blank marker
and you train some giant neural net
to predict the words that are missing
okay
and in the process of doing so
the the network learns the
the the nature of language if you want
so you know it learns that if you say
uh the lion chases the blank in the
savannah the blank you know is likely to
be something like a wildebeest or or an
antelope or whatever right
um but if you say the catches is the
blank in the kitchen that's probably a
mouse so a lot of semantics about the
world certainly the syntax but also
semantics about the world uh ends up
being represented by this network that
just is just trying to predict missing
words right
now of course the system can never
predict exactly which word is is is
missing um you know it could be antelope
or wildebeest or zebra or whatever
but it can easily represent uh a
distribution over words like probably a
probability distribution other words
because there's only a finite number of
words in the you know in the dictionary
uh in english so
you you just have a number which is a
score for how likely it is for this
particular word to appear and that's how
those systems handle the
uncertainty in the prediction
yeah i mean
and and the results
are sort of amazing you get text that
sounds like text in a particular style
and it and it's coherent and it's
grammatical and uh
and it makes stupid mistakes it makes
very stupid it makes absurd mistakes
certain mistakes yeah yeah and
i was wondering uh
well when does it say that it makes
absurd mistakes you know it's a question
that you know we talked about before
whether whether
whether it knows what it's talking about
and and the sense is that if you just
merely predict words
uh there is no content there that that
that purely predictive system
uh
well no i don't know it there is a
little bit of content
but it's very superficial right so the
understanding of the world by those
systems is superficial there is some
understanding
uh but it's very superficial so yeah if
you ask questions like
um
uh
you know is the fifth leg of a of a dog
longer than the other four um you know
the thing will say yes or no right i
mean it won't tell you the dogs have
four legs right um
and you know and things of the time they
can't count you know i mean there's a
lot of things that they can do and is
because they don't have
any common sense and it's because all of
this learning is not grounded in an
underlying reality it's basically just
from text
and the amount of knowledge about the
world that is encoded in all the text in
the world that those systems have been
trained on and it's billions of words
is is
is not present like most of human
knowledge is not represented in any text
in existence
for example you know i um
i i take a i take an object i take my
phone and i put it on the table next to
me and i push the table
you know because of your physical
intuition that the object will move with
the table when i when i push the table
the object that is on top of it will
move with it right
there is there is nothing in any text in
any
any part of the world that explains this
and so a machine that is purely trained
to predict missing words will not learn
about this which are sort of basic
really basic facts about the world so
i'm one of those people who believe that
uh tree intelligence systems
uh will need to
acquire knowledge uh
will need to be grounded in some reality
it could be a simulated reality it could
be a virtual world
um but it has to be an environment that
has its own logic and its own
constraints and its own physics so you
think that in principle if you took
those
transformer
and and you put them together with
if you put them to work on videos
right if you
if you put language and videos together
is that going to be a qualitative
advance and and is that beginning to
happen
well uh yeah so yes and no uh so i think
you know you should start with uh just
just video like can you
can you have can you build a neural net
that will or some you know learning
system they will watch video all day
and basically learn busy concepts about
about the world just by watching video
the world is three-dimensional they're
objects in front of others uh their
objects are animated animate they're
objects you know
whose trajectory is completely
predictable inanimate objects there are
objects whose trajectory is not
completely predictable like you know the
leaves on the tree but you know
qualitatively you can so there's some
level of representation where you can
predict what uh what those things do
and then objects that are very difficult
to predict there are animate objects uh
humans and things like that right
or chaotic systems and whatever
and so uh
you know
a lot of research goes into this
essentially none of them work at the
moment there's a lot of systems that
attempt to do
video prediction and and basically
attempt to learn representations of the
world that are in an abstract way can
can learn to predict what's going to
happen in the video in the long term
they all work within a few frames of a
video or like a fraction of a second but
then um the predictions go
you know go really bad and and the
representations that are learned by
those systems are not very good we can
we can test them by using the
representation as input to let's say an
object classification system for example
and and measure you know how many
samples does it take for the system to
learn
the concept of elephant right does it
does it still need three to three
thousand samples uh training example or
would it would it work okay with three
or four
the answer is the
training from video they don't work very
well there is one thing that is starting
to work and that gives us a lot of hope
so particular forms are supervised
running
where you you uh you can assimilate the
video if you want so you take you take
an image
and you uh you transform this image in
some way you
you change the colors a little bit you
you change the scale the orientation you
you you translate it a little bit you do
you do some
you know manipulation to it it's called
data augmentation
and then you train a neural net or or
two neural nets rather
to basically map those two images to the
same representation the same vector
typically it's a
vector right a list of numbers typically
a couple thousand
and so that's easy enough you have two
images that basically represent the same
content two views of the same object for
example and you train the system to tell
to tell you well the web
learn a representation that tells you
the content of the image and not the not
the details
um the difficulty is how do you make
sure that when you show two different
objects it will produce different
representations um and uh
avoiding this uh this problem this
problem is called the collapse problem
and and the big question is how you
avoid this so there's a lot of work now
really interesting work on
uh what's called contrasting methods and
also non-contrastive methods
to uh to do this and uh
i'm i'm really excited about this uh
area of research because i think it's
the germ
i think it's our best shot as to
you know a path towards learning you
know getting machines to learn abstract
representations
uh you know and and and predictive
models uh where those quality models
will not take place in the space of
pixels but will take place in the space
of or whatever it is that you're
your your input is but in the space of
abstract representations
uh do you think that the architectures
that exist now
will eventually that is it's not going
to take a different architecture
well
i think
i think the architecture is not the the
crucial question here um i think the
architectural components are already
here
what i think is not here is the the
whole learning paradigm
uh so let me let me take an example
right we can we can build a giant neural
net where you you feed it a few frames
of a video and you train it to predict
the next frame or the next few frames
you can do this with least square right
so just measure the
the the sum of the square the
differences between the values of the
predicted pixels and the pixels that
actually occur in the video
and when you do this
the predictions you get from your neural
net are blurry you get very blurry
images as the prediction and the more
you you let the system predict far in
the future the more blurry the
predictions are
and the reason for this is that you ask
the system to make one prediction and
the system cannot predict a priori if
it's a video of us talking it cannot
predict if i'm going to move my hands
this way or i'm going to move my head to
the left or to the right and so the only
thing you can do is predict some sort of
average of all the possible things that
can happen and that would be a very
blurry version of us
and so that's not a good a good model of
the world now how do you represent the
uncertainty in the prediction so in the
case of text it's easy because it's
discrete you can just have a
distribution over
which word are possible but
but we don't have
a good way of representing distributions
over images for example that's why those
techniques don't work currently
for video prediction and whatever
uh representations they learn actually
are not not very good
but yeah
and that's because of the fear
complexity
is it a quantitative problem or or is it
a point where quantity becomes quality
no no it's it's not a it's not a
question of you know we don't have
enough data our computers are not
powerful enough for neural nets are not
big enough it's a question of principle
it's a question it's a it's a question
of like what is the right objective
function what is the right way of
representing uncertainty
uh and then technical questions like uh
what uh how do you prevent this collapse
i was talking about which i didn't
explain very well but um which is a more
technical uh issue but but it has to do
with basically representing uncertainty
in the prediction
the fact that there are you know when
you have an initial segment of a video
there are many many ways to that are
plausible to continue it
and the machine has to basically
represent all of those or a good chunk
of those possibilities and it would have
to be predictive
that is it it couldn't work backwards
it could work both ways
it doesn't even have to be improved the
way that i think the mind works oh yeah
sure it really
you get prediction very short term but
but by and large a lot of people simply
making sense after the fact
you predict everything
yeah i mean if you're a police inspector
you you get to a scene and you have to
basically figure out how it got there
right yes whether it's a sequential
event that you know uh
led to this so yeah i mean um i think
you know i i use the example of uh
prediction of a future event but but it
could very well be uh you know
prediction of things you do not
currently perceive right so you do not
currently perceive the back of my head
um but you have a pretty good idea what
it looks like and if i show it to you
uh then you you know you can correct
your intuition
you're not very surprised but you know
maybe if i had a ponytail you'd be
surprised slightly more surprised
uh
so
the you know the
it's not it's not just prediction in
time it can be prediction in
in in space you know predicting a piece
of an image from another piece
predicting a piece of a
of a percent per set that is currently
occluded or not not uh
you know can be
obtained
uh and and retrodiction you know
predicting the past which you have not
observed from from the present
and maybe an evolution so yeah i'm not
insisting that it has to be foreign
prediction foreign prediction is is
useful for for
planning so if if we're talking about
a form of reasoning that would be uh you
know planning a sequence of actions to
arrive at a particular
result then what you need is a forward
prediction
um that will predict the state of the
world as you know when you take an
action
i mean you know the amount of progress
that
that is occurring in your field is just
amazing it's uh
i have a high time following i'm very
envious
i'm very because now a few years ago i
think i heard you
say about
a research program that we have the
cherry but we don't have the cake
but
that's
you recognize that
as
oh absolutely yeah this has become a bit
of a running joke in the in the in the
community now i i use the analogy of the
cake
to say that to to basically sort of make
very concrete the the fact that most of
what we learn as as humans and animals
and in in the future with that machine
well learn
uh is learned in this kind of
self-supervised manner basically by
watching the world go by
and by you know taking an action once in
a while but in a
you know
non-task specific way learning how the
world works uh cell supervision is
that's the bulk of the cake so if
intelligence is a cake the bulk of the
cake is this type of learning that's
where we learn everything
and then there is a thin layer of
well
there is a supervision right you're
taught at school or your parents you
know teach you something or you read you
read a book or whatever or you you know
you show a young child a picture book
and you say that's an elephant and you
know babies say elephant and with three
examples the baby has figured out what
an elephant to toddler what an elephant
is so that's supervised running and then
there is reinforcement running when you
learn a new uh skill by trial and error
and you know you you
you get rewarded by your own uh success
or or not
and and that was the cherry on the cake
the the supervised running was the the
icing if you want
and this was sort of a
metaphor to
to tell people like you know we
we're currently focusing on the icing
and the cherry but we haven't figured
out how to bake the cake yet
and so i use this joke saying that you
know this was the
the dark matter of intelligence because
uh you know it's kind of like physicists
right they tell you dark matter exists
and
and it's you know
most of
the mass in the universe is dark matter
and they have no idea what it is it's
very embarrassing so we're in the same
situation uh you get a physicist it's
even worse because the dark matter and
dark energy and the the combination of
the two represents something like 95
percent of the mass in the universe so
that's really really embarrassing and we
have no idea what it is
um so we're in the same situation which
means you know we have a lot of work to
do
for interesting things i mean
is the progress in your field
exponential
yeah before you answer i'm just watching
the clock you guys can go as long as you
want but i'm just saying that whenever
you're ready oh yeah yeah
just behind this question okay sounds
good
i think
um i think it's increasing because the
there is more and more people joining
the party and and more and more you know
governments and and and companies
investing money in it so you see uh you
see a growth you see growth in
applications you don't necessarily
growth uh an exponential growth in sort
of new concepts but there are new
concepts that are coming up at a
surprisingly uh
high speed and so i guess one big
question is
so you know
one question is are we going to have
another
ai winter like like they were in the
past and the answer to this is probably
no or at least not to the same extent
that we had in the past because there's
a big industry now behind
um
behind ai it's useful for a lot of
things you know you take you take deep
learning out of uh you know google and
meta and a few other companies and they
they crumble i mean they're completely
built around it now so
um
and certainly there's a lot of other
areas also um
so so i think it's not it's not going to
collapse as it used to the question is
are we going to make this next step
towards you know
machine common sense uh supervision etc
before people funding all of this get
tired
and that i don't know the answer i'm
just hoping that you know we make
progress fast enough fascinating
question
thank you jan
thank you thank you danny it's always
always a pleasure chatting with you
and and thanks to both of you it's it's
amazing to
hear you talk about this um i feel like
i was just like an ultra graduate level
course we should have more psychology
and
ai together uh because this is i think
like it's just amazing to hear you guys
bounce these concepts around
and in terms of energy uh jan clearly
there's energy in the field it's amazing
hearing you um you know no drop from the
five years we've known each other maybe
even more
so um thank you danny thank you yan so
great having you here
yeah let's let's do it again pleasure
yes i'd love to do it again
thank you
all right thank you uh take care danny
thank you nick guateny for doing the
editing uh red circle for hosting all of
you the listeners
we'll be back next wednesday for another
episode of the big technology podcast
until then we'll see