Daniel Kahneman and Yann LeCun: How To Get AI To Think Like Humans (Full Episode)
Channel: Alex Kantrowitz
Published at: 2021-12-10
YouTube video id: oy9FhisFTmI
Source: https://www.youtube.com/watch?v=oy9FhisFTmI
hello and welcome to the big technology podcast a show for cool-headed nuanced conversations of the tech world and beyond let me tell you i i'm pinching myself that we're actually about to have this conversation uh because i think it's going to be the most fascinating we've had on the podcast bar none and let me tell you why so in 2017 i got a chance to spend some time with yonla kun and yan uh was the head of facebook ai research now you might know the company as metta he's the chief ai scientist there and we spoke a lot about how you can actually take uh the human mind and make it artificial turn it into something that's an ai and i think that that's making a thinking machine is jan's long-term goal then this summer i you know had some time off and decided that i was going to read a book i wanted to get around to for a long time thinking fast and slow by the nobel prize winning professor daniel kahneman and as i'm reading this book i start thinking wow like professor kahneman describes the way the mind works in so many interesting ways that i hadn't thought about and ways that you'd imagine yam would have to think about if he wants to build a thinking machine and so why don't we just get the two of them together obviously it was a pipe dream right anyway about a month ago i'm in the same room as both of them by some stroke of luck and um i propose you know why don't we have this conversation so of course i go up to a professor kahneman we'll call him danny for you know this uh this conversation and say um and say hey have you ever spoken with jan and it turns out that they've spoken many times and and this is discussions they've had in private a handful of times but they've never had it in public and so we'll have it in public today so first of all i want to say jan and danny welcome to the show it's great to have you both pleasure to be here um so so uh i think we should just do the the foundational stuff um about about what jan's mission is a little bit on dany's research and then uh actually stay tuned for the second half because danny brought a list of questions that he wants to ask john and i'll just take a backseat for that uh but why don't we start here jan i'd love to hear a little bit more about if you could briefly share your ambition to build a thinking machine and like what does that actually look like do you want to replicate the human mind one for one no i don't want to replicate the human mind i want to understand intelligence i think you know one of the most fascinating questions of our time scientific questions over time is uh what is intelligence how does the brain work uh you know the other two fascinating questions all times are are what is life all about and how is the universe how does the universe work right so you know these are like you know big questions and um as a as a as a scientist and an engineer i i consider that i don't really understand how something works unless i build it myself so or or or or have some understanding of how to build it and so the the purpose of of ai is uh this is a dual purpose one is understand intelligence perhaps uh approach models of human intelligence if we want and the other one of course is to construct uh interesting artifacts that can you know help people in the daily lives and make the world better and everything so i think it's uh it's really two two purposes the scientific one and the technological one right and just a follow-up on that i mean i remember sitting with you in your office talking about you were talking about you know how um you know you want to teach computers how to predict and if they can predict they can plan and if they plan they can start to have some some functionality that you know we have as human beings and there is this rush towards general intelligence where where folks like you uh and researchers in your in your stratosphere are attempting to build a thinking machine so that's right well yeah so the i mean the first question is or the first thing you can you can notice in the animal world is that there is no intelligence without learning and uh in the engineering world it's almost true as well and it's probably because i'm either lazy or not smart enough to that i think that as human engineers we cannot actually directly conceive and construct an intelligent machine we have to build a machine that can make itself intelligent through learning right so that's how i got interested in learning that was very early on like when i was an undergrad or something and uh and then the next question is um how how to get machines to learn and and of course you know there's a long history of machine learning you know supervised running starting with the perceptron and and and things like that and what what has become really clear over the last couple decades it was clear it was even clear for people like jeff intern before that but uh maybe just for me is is that the type of learning that we are currently able to reproduce in machine which is supervised running and reinforcement learning do not seem to reflect what we observe in humans and animals there is another type of learning another paradigm of learning that seems to take place in humans and animals that allows humans and animals to learn how the world works you know mostly by observation a little bit by interaction but mostly by observation and we we accumulate enormous amounts of background knowledge but how the world works and that connects with what with with things that you know danny talks about if if we if we have a model of the world this model of the world can we can use it to plan because we can imagine the result of the the consequences of actions we're taking that allows us to plan so this is what then you call system two okay and but currently what we can do with machine learning is more like the system when the stuff that you know here is an input here is an output that does not re require uh reasoning if you want so we're sort of you know i'm interested in sort of trying to uh get richie to learn models of the world so that we can get them to reason essentially yeah it's so amazing to have you both on on the same show because now we get to go to danny to talk a little bit about um these concepts in terms of the way that the human mind thinks and then we can go back to you yan and think about how we might be able to get that into ai so danny at the risk of uh having you go through um you know the free bird speech that you make often but i think it's important for listeners especially those who haven't read the book um are you able to just describe um a little bit more uh what jan is talking about in terms of the way human beings uh think and and you know talk a little bit i know it's sort of a a shortcut but talk a little bit about system one and system two i would say that when we describe human intelligence we describe we we speak of the representation of the world and and it's the representation that leads to prediction that is there is no shortcut to the prediction from the data you go through a representation which which includes how the system works it includes the the causal relations this is what enables you to predict uh and and it turns out that we do have such a model of the world and we we it's not so much that we have specific expectations about what's going to happen next what is happening most of the time is that things happen and then we make sense of them but as we actually go back and fit them into what happened before and quite often you know right right now you're not predicting what i'm going to say next but i'm not going to surprise you by what i say or you know if i said lumber all of a sudden i would surprise you because it doesn't fit but uh so this is the way that the most interesting part of this works and what is remarkable from the point of view of the psychologist there are many many things but that touch on on what jan is doing what's remarkable is how little it takes how quickly people learn and and whether that i think is a fundamental puzzle and and it turns out that the representation of the world that we have uh it's hard to imagine it completely without symbols that is in you know the we we do think symbolically and how you represent symbols without using symbols is sort of a puzzle so uh you know i would i would ask jan the question i've asked him before which is whether solving a generic learning problem is enough to get all of that that is to get the system that we learn quickly and to get a system that that will have the kind of logic that enables us that there is a certain kind of mistakes that people just don't do and uh you know so so we have it that an object is not in this two places at once or basic facts about the world that and and there is sort of a certainty about it that seems to be difficult to achieve with just a system that learns approximately so i was wondering you know how jana is going to deal with that yeah and before we get into the answer just a quick table setting so the system one is the automatic sort of the things that we do without without you know thinking and and the um system two is or what we would feel like we're doing without thinking even though we might be in system two is more i was complex everything i was saying you know right system one basically yes because i think the representation of the world that we have and our ability to anticipate or to or to to feel to feel unsurprised by what happens which is i think more than anticipating uh that is all system one that's all you know that's all automatic it's it's effortless and it's very quick system two is we will get to system two but solving system one seems to be a big one yeah yeah so so yeah we'll start on system one yeah on your thoughts well um i mean i agree with uh with danny that current currently iso attempts are are you know very specialized uh and that makes them very brittle because they're trained for one task or maybe a collection of tasks uh there's a sort of a motion towards uh training relatively large systems for multiple tasks at once not not just one because they tend to work better and require less data but it is astonishing how fast humans and animals can learn uh new tasks when faced with a sort of a new situation how is it that you know a teenager can learn to drive a car in about you know 10 or 20 hours of practice or to fly an airplane in 10 to 20 hours of practice it's incredibly fast uh if we were to use let's say reinforcement learning to train a self-driving car to drive itself it would have to drive itself for millions of hours and and cause you know until thousands of accidents and destroy itself multiple times before it learns uh to drive probably not nearly as reliably as as you as a human so what's the difference now of course you know obviously we can say humans rely on their background knowledge about the world and that basically is the answer to that you know we learn enormous amounts of background knowledge about how the world works so we don't have to when we learn to drive we don't have to learn that if we drive next to a cliff and we turn the the wheel to the right the the car will run off the cliff and nothing good will come out of it we don't need to try because we know that from our model of how the world works of intuitive physics and things like that and and the this this type of model is what gives us in my opinion some sort of common sense i mean what we call common sense comes out of this uh that prevents us from making the the really stupid mistakes that danny was talking about that ai systems are doing currently so so how do we get machines to learn that um you know you you look at uh how at what stage in their life uh human babies learn basic concepts like uh what is the difference between an animated inanimate object or i'm going to put this object on the table is it going to stay stable or is it going to fall um that's grown around the age of three months or so three four months uh difference between animating animate also comes pretty early object permanence comes very early as well some some people claim it's innate not clear and then there are concepts that you know we take for granted the fact that uh objects that are not supported false they're subject to gravity babies run this around the age of eight months eight to nine months it takes a long time to understand that an object that is not supported will fall it takes a long time to understand momentum and things like that um but then you know by the age of nine months and so in the first few months you know babies have very little ability to act on the world right they pretty much everything they learned is for observation only um and you know by by the time they are eight or nine months they've they've pretty much understood um um you know the physics of the world and and babies are relatively slow you take a baby cat they understand intuitive physics incredibly quickly and their own the dynamics of their own body and everything so of course you know it's not clear how much of this is hardwired but but there is clearly a lot of learning taking place this type of learning is the one that we don't yet know how to reproduce with machines and i think uh i mean that's why i focus my research you know what what is this type of learning uh that allows us to learn representations of the world and predictive models of the world that you know then allows us to learn new tasks uh very quickly with very you know very few trials or very few examples and it's it's a very hot topic in machine learning at the moment actually well i'm i'm curious about one thing i mean there are that i know about and i'm quite ignorant about it but there are two tentative answers that i know about to solve the problem of the speed of learning and one is that a lot of bill is built in so uh and and that looks very appealing because actually some animals really come out of the womb and and they're ready to go i mean goats mountain goats they they know how not to fall uh as soon as they are born so there it's difficult to think of learning so nativism is part of the answer then there is something that i've been learning about the hierarchical bayesian models which is sort of a different that there are logical structures that we are prepared for to organize the world that that we see and that this is part of what enables us to learn things very quickly now i think you reject both of these well yes the first one only partially because you know it's obvious that at least in certain species including in humans you know a lot of things are not necessarily completely hardwired but at least we have sort of intrinsic motivation to learn them quickly and uh and it drives us to london quickly essentially and it's certainly true for you know baby ibexes and mountain goats and whatever and certainly for simpler animals you know like spiders and and things like that but there's you know even for for an ibex uh or a mountain goat they they still need to kind of learn a lot about you know the dynamics of their own body that's they're not hardwired with that they are right maybe with the geometry but not with the details of it um and you know you can think of this as just being a few a few parameters to adjust but i think it's uh much more comp what goes on in the the brain of the the animal is much more complex than that so i think there is a lot of running uh in there even if uh in the end we can observe that the learning takes place really quickly um so so then um and and you know there is some some evidence um you know for for the things that uh a lot of people in cognitive science think or have thought in the past perhaps that were hardwired for those things to be learned because they can be learned really quickly so for example um it's relatively simple it would be relatively simple for genetics to encode the fact that the visual cortex uh needs to be built with neurons that can detect oriented edges for example you know the kind of stuff that we we find in the primary visual cortex area but in fact we have now you know a dozen different learning algorithms that if we were to run them in real time in sort of a simulated animal if you want would learn those oriented feature detectors from almost random images within minutes and so there is no point in hardwiring this because it can be learned within minutes same for face detection so uh you know you look at the brain and there are areas in the brain that light up when people are are shown faces and so an easy conclusion of that is oh you know this area of the brain you specialize in phase detection and it's hard-wired and you know feed detection is innate or face recognition is innate but again uh physician can be learned in minutes if you're a baby your virgins is bad your your um your focus uh is is basically fixed at a relatively short range so the only thing you see during the first weeks of your life are our faces and nipples essentially so you know and then and then you have a hard-wired thing to pay attention to motion and so within minutes you you're going to have a phase detector in your visual cortex you know any reasonable learning algorithm we'll learn this extremely quickly um so you don't need this to be innate if you can if it can be learned so that was that's the first thing about the in it you know is it the case that the the fact that the world is three-dimensional is innate for example it would make it would make sense right because a world is two-dimensional all of uh uh evolution has always taken place in the three-dimensional world so it would make sense to kind of hardwire this in the in the cortex but then if you're an ai scientist or engineer is like how would i even define the an architecture that has that hardwired i don't even know how to do it and so it's not it's not clear you can actually encode this in the genome on the other hand you can learn it really quickly because the fact that every point in the in the world has a depth is the best explanation for how your view of the world changes when you move your head or where or or if you you know correlate the two views from your left hand and right eye and so learning things like very basic things like this uh is very simple for whatever learning algorithm our brain uses so that's the first part about the the sort of nature nurture debate if you want um and then and then there is the second question of symbols about symbols so is it necessary to have explicitly hardwired mechanisms in our brain that allows us to do things like symbolic manipulation or reasoning and i i find that hard to believe uh that that they need to be hardwired first of all the question the first question to you danny perhaps which i've asked to you know many people who've come up with that question is um do you view the type of reasoning that let's say uh great apes do as symbolic or do you view what monkeys do as symbolic or dogs or cats or octopus do they do symbolic reasoning and can we also just for table setting to find the symbolic what symbolic reasoning is i don't know how to define it like i'm not a big you know advocate of the very existence of symbolic reasoning so okay we'll toss it over to danny yeah uh i think that when we're talking about symbols we're really talking about logical relations that would i that's not an easy one to explain to understand each other but it's hard for me to explain uh it but i think a characteristic of of symbolic reasoning is is a certain discreteness and and it's not so it's not completely compatible with what you have talked about i mean this is interesting what is happening here because uh i would um i would push you with your idea that everything is has to be learned or can be learned very easily i would push you down the animal uh chain you know where i think it becomes highly and you are doing the same thing to me with symbols by asking me you know how far down do symbols go and uh and it's it's clear that the answer that you know the sort of the standard answer that that it's connected with language and that the symbolic system and the language system are related and therefore other animals don't have symbols in the way that we do um whether whether that is really compelling i'm not sure that is in in the sense for example that in uh in the hierarchical bayesian models that uh that i'd like to get your your evaluation up but those are are supposed to take care of perceptual learning and and with the the symbolic element of the logical element is sort of upstream of that in what categories of things there are and you know what categories of structure there are that that you're about to learn so uh right so i i think um the the number of very interesting questions there well so the first one is uh what do we mean by symbols really and if by symbol we mean uh the ability to form sort of discrete uh categories uh represent sort of discrete categories in our you know mental uh representation system um then i think uh pretty much every brain has some sort of symbolic representation and the reason i think this is that uh discretizing uh concepts like having uh identified categories uh makes memory more right because it makes representation uh uh have the ability to be error corrected right so so if i um if i want to if i write a zip code or or a credit card number and you make a mistake on one digit that mistake would be detected because uh when you make a single error on a on a credit card number uh there is some self consistency right and that's because credit card numbers not only are discrete you know because they are they're numbers but they they are actually some distance apart like two credit card numbers or some distance apart from each other so if you make a small error you can snap it back to the to the correct one you can detect the mistake and you can you know that's the the basic of the basics of uh of error correction so just for the purpose of being able to do error correction and for the purpose of being able to store concepts in memory discrete concepts in memory efficiently you need discrete representations but now the big conundrum is so in a way you need symbols and i think if that's the definition of symbols then i think dogs have symbols um you know i don't know if they can logically i guess they have some logic certainly i mean i think you know cats and dogs and probably mice even have have this kind of uh these kind of symbols and what's a discrete representation uh so um okay so i can i can give you um um you know it's the difference between a a a real number and a and a natural number right uh so if i uh uh yeah yeah go ahead right so let's say i send uh i want to i want to send a number to you and the only thing i have is a is a electric wire and i can send a voltage right so i can send the voltage uh you know three volts let's say but then at your end it may not be exactly three volts because there is resistance impedance you know parasites noise whatever so what you're going to observe is some sort of fluctuating thing that maybe around three volts but maybe a little shifted but it's not gonna be shifted all the way to four volts or two volts it's gonna be around three volts so because you know that uh the signals i want to transmit to you are either zero one two or three or four you know it's gonna be three right so despite the fact that there is noise in the transmission and despite the fact that the signal is continuous you snap it back to its correct value okay yeah so um and then to store that value somewhere because there is only you know five different possible values zero one two three five one two three four five uh you can store it with a small number of bits whereas if you had to store a precise voltage value it would require many bits right so discretization makes memory more efficient and it makes a transmission either inside the brain or between between uh agents uh uh reliable um it also uh you know give the possibility of associative memory so if you have a in the brain you know if you wrap it you know the brain represents things with voltages right to you know to first order approximation um now you know there may be noise again in this rep in this representation and uh you know you may see a a partial view of uh of an image and you can reconstruct the whole image because of your you know knowledge of uh uh you know what what the what the system is supposed to look like you may not never have seen my my left side okay the left side of my face but you even if you've never seen it you you probably would have a pretty good idea what it looks like because of your general model of a human face and the fact that they are mostly symmetrical um so that that's an example of kind of snapping uh what's it what's a noisy signal to uh you know a perfect signal because of the maybe discrete maybe not discrete but at least the structure the internal structure of of what you what what you're looking at which your your brain has captured um so uh so this you know that's that's what discrete uh why discrete symbols are are are interesting it's the same reason by the way why all of modern communication use is digital it used to be analog it used to be that to communicate with each other we would call it each other on the phone and it would just transmit a voltage directly you know from from iphone to your phone uh but now we replace this with uh digital communication with bits and the reason is it's more efficient and you know more robust to noise and there's all kinds of advantages our brains actually use digital communication internally a neuron you know neurons send spike to communicate with each other and the reason is it's easier to regenerate a binary signal than it is to regenerate an analog signal and it's more efficient energetically i mean there's all kinds of good reasons for this um so you know let me explain why symbols may emerge in uh represent all representations of the world just for reasons of efficiency and including for animals that do not have language now of course if you if you if you do want language um language is you know basically a sort of approximate way of representing the the sort of complex data structures that we have in our in our mind and to serialize it so that we can you know put that on the single uh one-dimensional signal which is you know sound sound pressure or or sequences of words which are which are symbols um and again if if um if i'm talk you know i'm talking to you through a microphone it goes through the internet and you know there's some noise uh attached to that process it's pretty high quality but it's still some noise because because our language uses discrete words we can have error correcting communication so even if you don't completely understand all of the syllables or phones that i'm pronouncing you can still kind of recover what i meant because there is only a finite number of uh of words right and [Music] you may not have heard all the syllables are pronounced but uh you kind of snap it back to uh whatever makes sense in the context in fact you know every speech recognition sometimes works this way they they have sort of a language model and even if they don't understand every sound they can sort of reconstruct really what was meant in my case it's even harder because of my accent so um so so language has to be discrete because it needs to be um because it needs to be noise resistant essentially um and and and perhaps you know language was the the appearance of language uh of symbolic language was uh facilitated by the fact that you know we need to form the equivalent of symbols or discrete uh categories in our brain for efficient storage um and uh and and and and sort of uh recovery associative memory recovery possibly um and so it so it's quite possible that you know uh in the right condition animals would have kind of more linguistic capabilities than we give them and there's a number of experiments of course on this that have been done with monkeys and parrots and whatever well so in some sense it's the the discretization itself is built in that has to be right i mean that but what you're saying is there is no no content or very little content that is built in to the human brain at inception is that is that the yeah okay well i i mean the big question is where do the meaning of those uh those discrete entities come from right those uh and and i i certainly do not believe that uh those the the the meaning of the discrete entities are are predetermined in the the human mind certainly and the animal uh mind either those are completely learned and so here is com here here now comes the conundrum so you're talking about hierarchical basic systems the uh in a sense uh multi-layer you know deep neural nets are hierarchical vision systems if you kind of view them the right way and there are certain forms of them that are actually explicitly based in the main question uh in in sort of the the the classical approaches to basian modeling is that the the concepts in a hierarchical kind of bayesian graph you know a business network or a graphical model you know people give them different names those concepts have have to be designed by the the human designer uh when when they build those uh those systems yeah and it it clearly it clearly is not happening that way in the brain those concepts are learned right so the basic entities that that are manipulated uh are those those kind of you know discrete concepts are are learned can you uh sorry to ask but can you just for the general audience describe hierarchical bayesian systems well briefly i mean it means it means different things to different people but there is a classical view of it called busy networks where you you have uh basically it's a graph so graph uh in the mathematical term is a collection of nodes that are linked by uh by edges and the the the structure of this graph so each node represents a variable for example um so here's a classical example that uh you know people citing courses on ai is a node that represents whether your your your house is uh uh jumping okay it's moving all right there's another node that indicates whether a truck has hit your house okay and then there is another node that indicates uh whether you know you're in california and whether there was an earthquake okay so you could establish a a causal relationship or at least a dependency between those nodes you can say well if my if my house just jolted it's either because a truck hit it or because uh there was an earthquake and there's probably not many other reasons for that to happen now if i look at the window and i see that a tractor see the just hit the house it immediately lowers my estimate of the likelihood that there was a simultaneous earthquake right because of those two happening at the same time is uh is very low so that that's called the explaining away uh thing so you could imagine sort of building a network of of those nodes where each node has a particular meaning you know for example some nodes are would be symptoms that you can observe in a person and then underlying nodes that are causally uh causally influenced would be sort of you know infection by a particular bug or or whatever right and you could you could imagine building those things and then doing inference so you know the value of some nodes and you can infer the probability that all the other nodes that you don't observe have a particular value i'm observing this and this and this and the symptom and i have this graph and now i've measured the blood pressure and the you know whether the person's tummy hurts or in a particular way or whatever and i conclude that you know it's probably up in the side is whatever you know so so you can imagine uh and in fact you know systems in the 1970s were built exactly on that model for uh for um you know so-called expert systems or probabilistic expert systems and what we now call uh business networks and hierarchical vision models and whatever are kind of the descendants of this if you want now this does not involve any learning all those needs would be built entirely by hand okay completely engineered and that was essentially that that necessity was one of the reason of the the kind of decrease in interest in sort of uh good old-fashioned ai yeah these were in the yours the bayesian yeah the bayesian folks could be completely logical and okay hard hard decisions but the fact that you had to hand engineer those entire systems from scratch is kind of a little bit white kill them um so the the alternative is running but now you have the question you know how do you learn those concepts how do you learn that um you know the the you want to do computer vision the the nodes at the bottom level are pixel values okay what would be the right node just just above that represent good combinations of pixel values how do you learn that turns out those combinations would be things that you know detect oriented edges which is what you you're observing the brain and what you know convolutional neural nets actually learn but that has to be learned so so then the the the b conundrum is if a system is to learn and manipulate discrete symbols and learning essentially requires things to be kind of continuous how do you kind of make those two things compatible with each other and the answer i don't know maybe the answer is that we are giving perhaps a little too much importance to logical reasoning so so there's one thing that jeff intend is saying which is um which i agree with to a large extent which is a lot of reasoning in uh certainly in animals and in humans is not logical reasoning it's it's it's basically simulation or analogical reasoning which kind of similar so uh you know you are you're a lion in the savanna and you're kind of chasing uh uh wildebeest or something and you have to do some prediction about the trajectory of the world beast and uh you know work with the other lion or lionesses usually to kind of chase the animal in the right way and uh you know that that requires uh a kind of a you know a simulation of the animal you're chasing the the best way to predict how the animal you're chasing is is going to act is to have an internal model of of of that that you can simulate uh and it's the same like so a more kind of human situation of you know you want to build a widget like a box out of planks or whatever you have to have sort of a model in your mind of what would be the result of assembling those planks and you know how solid it is is going to be if you use glue or nails or screws or or or kind of more complex uh carpentry and and so the so the the key element in this form of intelligence is your ability to build models of the world predictive models of the world and that's what we're missing okay that's what we need to figure out how to do with machines and you can do this and i find this very convincing actually i mean so so you can you can do this without symbols you can do this without an explicit representation of causality uh you just observe and and from the observation you can go directly to prediction right so there is going to be some causality involved the question is whether you again need an explicit mechanism for causal inference but uh if um if your model of the world includes the prediction of the next state of the world given the previous date and given your action then you're going to build a causal model right because you know that your action or the action you observe from other other people or other agent will tell you if i take that action i will get this result from the state right so so you will be able to establish causal models um uh if you can act or if you can determine that uh another agent has acted and you've observed the result i mean mechanisms of that kind certainly exist you know in in human physiology so uh any movement of the eye involves a prediction of what the world looks will look like when your eye moves and there are those beautiful old experiments where uh people are in it in fact you use curary to paralyze their eyes so they can't move their eyes but when they intend to move their eyes the world moves and so they're anticipating so and that is clearly nobody would claim that there is any symbol or any causality required to do that this is the kind of thing so that would be your model for in fact your perception of the world is not the world as it is it's the world as as it's going to be because there is you know about 100 millisecond delay between uh what you see and how your brain interprets what it is so um so in fact uh uh your brain predicts a hundred milliseconds in the future your estimate of the world that you are conscious uh consciously aware of absolutely uh is was predicted from your perception from a tenth of a second ago um and and that that takes place everywhere so you know one of the current uh main uh theoretical concept in uh in in systems neuroscience or computational neuroscience is this idea of of predictive coding where basically everything in the brain is trying to predict everything else in the brain and uh you know predict future action future state of other parts of the brain and and and and things like that now this doesn't mean that um people who are you know pushing for this kind of theory know how their brain works because between a general concept of this type and sort of reduction to a practical algorithm if you want that you could sort of implement on the machine there's a huge distance that has not been uh has not not been bridged yet so that that's kind of a big program i think for uh science over the next years i was curious earlier and it's a question i neglected to ask but that was about uh face detection and the existence of you know a face system it seems to me that seeing your mother's face a lot uh doesn't really equip you to distinguish between faces that's right that's what the phase system does i mean it's not it it's not only recognized this is a face it it is actually specialized in identifying faces that is and and that uh that ability to learn very quickly that distinctive face that seems to be i mean you know it's the same problem again i don't i don't quite see how that gets done i don't know uh you know historically also in computer vision face detection preceded a reliable face recognition by a decade or two so we had reliable phase detectors in the early 2000s and uh reliable face recognition didn't pop up until 15 years later roughly with with deep learning methods actually um and uh it works surprisingly well so you don't need a big neural net to be able to recognize um you know identify places identify faces and and have a system that recognize recognizes more faces than any human can with uh you know the same level of accuracy uh you know maybe not with the same sort of robustness to uh you know different changes in pose and facial hair and things like that but but in terms of number of different people being able to put a name on they're you know they're incredibly incredibly good incredibly reliable so it may not be that hard of a task as as we thought uh perhaps i kind of thought that this was going to be a discussion more of how we could transpose the human learn from the human brain and turn that into ai but it seems like we're actually talking a little bit about how we can learn about the human brain from the construction of ai i i think that this is i think that this is actually happening uh but but there is an interesting question to me as a psychologist you know it's it's a version of the turing test so uh what would it take to for the computer to fool you that it's that it's human and and i was thinking that one characteristic of it is mistakes that strike us as absurd that is and they seem to violate some basic constraints i mentioned earlier an object not being in in two places and there is i think a finite set of of absurd mistakes or am i wrong and would it be would it be part of i mean would it be of any interest to have a system that makes the same mistake that that avoids mistakes that people consider absurd and that makes mistakes of the kind that people tolerate i mean if there's such a category of absurd mistakes in in your field well um well there's different kinds of mistakes right i mean there are computer vision systems that make stupid mistakes because so in the past when computer vision systems you know were like maybe 15 years ago computer vision systems were okay at picking out recognizing objects in images but they were sometimes confused by the context and and people they were not confused by the context they just were not using the context at all and and so you know the system would make a mistake of recognizing a face that wasn't really a face and from the context you could tell it was in the face um or or would uh you know recognize an object that had you know the shape of uh i don't know a cow or something you can tell from the context that you know it cannot possibly be a be a cow if you are if you are human um so that was a big criticism towards computer vision system they don't take context into account and now the criticism you see is that they take they take too much context into account so there is this famous example of you know a modern vision system i mean it's a system of a few years ago you train it to recognize you know things like cows and trains and airplanes and cars and stuff like that by showing you thousands of examples of each category and then you show it uh a cow on the beach and it it can't recognize it as a cow because every example of cows it's seen were were on you know green uh pasture uh and so the system actually learns to use the context and it uses it too much and the context was you know had a spurious correlation with the category and the system doesn't have the common sense to say well that's a care regardless of you know because a cow could be very well sitting on a beach on the beach um and and so then how do you fix that you know um and how do you get that common sense i mean that seems to be the common sense exactly so i think this is not going to be solved in my opinion by uh you know more tweaks on the architectures and more training data and things like this i mean it may be mitigated but i think it's gonna not gonna be fixed by that i think it's gonna be fixed by systems that basically learn models of the world and then you know have the the ability to tell that uh it's perfectly possible that you know uh a cow could be on the beach that's the that's the kind of compositional nature of uh of the world that um you know even combination of things you've never seen can very well happen and that is already happening not really so there's the thing that very interesting thing that is happening which is um uh one of the hottest topic as i was mentioning earlier in ai today is or in machine learning at least is this idea of self-supervised running which i've been a big advocator for for many years and it's the idea that you you train a system uh not to solve a particular task but to you know basically learn to represent the world in sort of a you know semi-generic way you train it by basically you train it to predict essentially okay so this works wonderful wonderfully for natural language understanding system the standard way nowadays that has become standard over the last three years of training a natural language understanding system is uh you take uh you take a sequence of words a sentence or something longer than that maybe a few hundred words or a thousand words and you you mask about ten or fifteen percent of those words you you just you replace them by a blank marker and you train some giant neural net to predict the words that are missing okay and in the process of doing so the the network learns the the the nature of language if you want so you know it learns that if you say uh the lion chases the blank in the savannah the blank you know is likely to be something like a wildebeest or or an antelope or whatever right um but if you say the catches is the blank in the kitchen that's probably a mouse so a lot of semantics about the world certainly the syntax but also semantics about the world uh ends up being represented by this network that just is just trying to predict missing words right now of course the system can never predict exactly which word is is is missing um you know it could be antelope or wildebeest or zebra or whatever but it can easily represent uh a distribution over words like probably a probability distribution other words because there's only a finite number of words in the you know in the dictionary uh in english so you you just have a number which is a score for how likely it is for this particular word to appear and that's how those systems handle the uncertainty in the prediction yeah i mean and and the results are sort of amazing you get text that sounds like text in a particular style and it and it's coherent and it's grammatical and uh and it makes stupid mistakes it makes very stupid it makes absurd mistakes certain mistakes yeah yeah and i was wondering uh well when does it say that it makes absurd mistakes you know it's a question that you know we talked about before whether whether whether it knows what it's talking about and and the sense is that if you just merely predict words uh there is no content there that that that purely predictive system uh well no i don't know it there is a little bit of content but it's very superficial right so the understanding of the world by those systems is superficial there is some understanding uh but it's very superficial so yeah if you ask questions like um uh you know is the fifth leg of a of a dog longer than the other four um you know the thing will say yes or no right i mean it won't tell you the dogs have four legs right um and you know and things of the time they can't count you know i mean there's a lot of things that they can do and is because they don't have any common sense and it's because all of this learning is not grounded in an underlying reality it's basically just from text and the amount of knowledge about the world that is encoded in all the text in the world that those systems have been trained on and it's billions of words is is is not present like most of human knowledge is not represented in any text in existence for example you know i um i i take a i take an object i take my phone and i put it on the table next to me and i push the table you know because of your physical intuition that the object will move with the table when i when i push the table the object that is on top of it will move with it right there is there is nothing in any text in any any part of the world that explains this and so a machine that is purely trained to predict missing words will not learn about this which are sort of basic really basic facts about the world so i'm one of those people who believe that uh tree intelligence systems uh will need to acquire knowledge uh will need to be grounded in some reality it could be a simulated reality it could be a virtual world um but it has to be an environment that has its own logic and its own constraints and its own physics so you think that in principle if you took those transformer and and you put them together with if you put them to work on videos right if you if you put language and videos together is that going to be a qualitative advance and and is that beginning to happen well uh yeah so yes and no uh so i think you know you should start with uh just just video like can you can you have can you build a neural net that will or some you know learning system they will watch video all day and basically learn busy concepts about about the world just by watching video the world is three-dimensional they're objects in front of others uh their objects are animated animate they're objects you know whose trajectory is completely predictable inanimate objects there are objects whose trajectory is not completely predictable like you know the leaves on the tree but you know qualitatively you can so there's some level of representation where you can predict what uh what those things do and then objects that are very difficult to predict there are animate objects uh humans and things like that right or chaotic systems and whatever and so uh you know a lot of research goes into this essentially none of them work at the moment there's a lot of systems that attempt to do video prediction and and basically attempt to learn representations of the world that are in an abstract way can can learn to predict what's going to happen in the video in the long term they all work within a few frames of a video or like a fraction of a second but then um the predictions go you know go really bad and and the representations that are learned by those systems are not very good we can we can test them by using the representation as input to let's say an object classification system for example and and measure you know how many samples does it take for the system to learn the concept of elephant right does it does it still need three to three thousand samples uh training example or would it would it work okay with three or four the answer is the training from video they don't work very well there is one thing that is starting to work and that gives us a lot of hope so particular forms are supervised running where you you uh you can assimilate the video if you want so you take you take an image and you uh you transform this image in some way you you change the colors a little bit you you change the scale the orientation you you you translate it a little bit you do you do some you know manipulation to it it's called data augmentation and then you train a neural net or or two neural nets rather to basically map those two images to the same representation the same vector typically it's a vector right a list of numbers typically a couple thousand and so that's easy enough you have two images that basically represent the same content two views of the same object for example and you train the system to tell to tell you well the web learn a representation that tells you the content of the image and not the not the details um the difficulty is how do you make sure that when you show two different objects it will produce different representations um and uh avoiding this uh this problem this problem is called the collapse problem and and the big question is how you avoid this so there's a lot of work now really interesting work on uh what's called contrasting methods and also non-contrastive methods to uh to do this and uh i'm i'm really excited about this uh area of research because i think it's the germ i think it's our best shot as to you know a path towards learning you know getting machines to learn abstract representations uh you know and and and predictive models uh where those quality models will not take place in the space of pixels but will take place in the space of or whatever it is that you're your your input is but in the space of abstract representations uh do you think that the architectures that exist now will eventually that is it's not going to take a different architecture well i think i think the architecture is not the the crucial question here um i think the architectural components are already here what i think is not here is the the whole learning paradigm uh so let me let me take an example right we can we can build a giant neural net where you you feed it a few frames of a video and you train it to predict the next frame or the next few frames you can do this with least square right so just measure the the the sum of the square the differences between the values of the predicted pixels and the pixels that actually occur in the video and when you do this the predictions you get from your neural net are blurry you get very blurry images as the prediction and the more you you let the system predict far in the future the more blurry the predictions are and the reason for this is that you ask the system to make one prediction and the system cannot predict a priori if it's a video of us talking it cannot predict if i'm going to move my hands this way or i'm going to move my head to the left or to the right and so the only thing you can do is predict some sort of average of all the possible things that can happen and that would be a very blurry version of us and so that's not a good a good model of the world now how do you represent the uncertainty in the prediction so in the case of text it's easy because it's discrete you can just have a distribution over which word are possible but but we don't have a good way of representing distributions over images for example that's why those techniques don't work currently for video prediction and whatever uh representations they learn actually are not not very good but yeah and that's because of the fear complexity is it a quantitative problem or or is it a point where quantity becomes quality no no it's it's not a it's not a question of you know we don't have enough data our computers are not powerful enough for neural nets are not big enough it's a question of principle it's a question it's a it's a question of like what is the right objective function what is the right way of representing uncertainty uh and then technical questions like uh what uh how do you prevent this collapse i was talking about which i didn't explain very well but um which is a more technical uh issue but but it has to do with basically representing uncertainty in the prediction the fact that there are you know when you have an initial segment of a video there are many many ways to that are plausible to continue it and the machine has to basically represent all of those or a good chunk of those possibilities and it would have to be predictive that is it it couldn't work backwards it could work both ways it doesn't even have to be improved the way that i think the mind works oh yeah sure it really you get prediction very short term but but by and large a lot of people simply making sense after the fact you predict everything yeah i mean if you're a police inspector you you get to a scene and you have to basically figure out how it got there right yes whether it's a sequential event that you know uh led to this so yeah i mean um i think you know i i use the example of uh prediction of a future event but but it could very well be uh you know prediction of things you do not currently perceive right so you do not currently perceive the back of my head um but you have a pretty good idea what it looks like and if i show it to you uh then you you know you can correct your intuition you're not very surprised but you know maybe if i had a ponytail you'd be surprised slightly more surprised uh so the you know the it's not it's not just prediction in time it can be prediction in in in space you know predicting a piece of an image from another piece predicting a piece of a of a percent per set that is currently occluded or not not uh you know can be obtained uh and and retrodiction you know predicting the past which you have not observed from from the present and maybe an evolution so yeah i'm not insisting that it has to be foreign prediction foreign prediction is is useful for for planning so if if we're talking about a form of reasoning that would be uh you know planning a sequence of actions to arrive at a particular result then what you need is a forward prediction um that will predict the state of the world as you know when you take an action i mean you know the amount of progress that that is occurring in your field is just amazing it's uh i have a high time following i'm very envious i'm very because now a few years ago i think i heard you say about a research program that we have the cherry but we don't have the cake but that's you recognize that as oh absolutely yeah this has become a bit of a running joke in the in the in the community now i i use the analogy of the cake to say that to to basically sort of make very concrete the the fact that most of what we learn as as humans and animals and in in the future with that machine well learn uh is learned in this kind of self-supervised manner basically by watching the world go by and by you know taking an action once in a while but in a you know non-task specific way learning how the world works uh cell supervision is that's the bulk of the cake so if intelligence is a cake the bulk of the cake is this type of learning that's where we learn everything and then there is a thin layer of well there is a supervision right you're taught at school or your parents you know teach you something or you read you read a book or whatever or you you know you show a young child a picture book and you say that's an elephant and you know babies say elephant and with three examples the baby has figured out what an elephant to toddler what an elephant is so that's supervised running and then there is reinforcement running when you learn a new uh skill by trial and error and you know you you you get rewarded by your own uh success or or not and and that was the cherry on the cake the the supervised running was the the icing if you want and this was sort of a metaphor to to tell people like you know we we're currently focusing on the icing and the cherry but we haven't figured out how to bake the cake yet and so i use this joke saying that you know this was the the dark matter of intelligence because uh you know it's kind of like physicists right they tell you dark matter exists and and it's you know most of the mass in the universe is dark matter and they have no idea what it is it's very embarrassing so we're in the same situation uh you get a physicist it's even worse because the dark matter and dark energy and the the combination of the two represents something like 95 percent of the mass in the universe so that's really really embarrassing and we have no idea what it is um so we're in the same situation which means you know we have a lot of work to do for interesting things i mean is the progress in your field exponential yeah before you answer i'm just watching the clock you guys can go as long as you want but i'm just saying that whenever you're ready oh yeah yeah just behind this question okay sounds good i think um i think it's increasing because the there is more and more people joining the party and and more and more you know governments and and and companies investing money in it so you see uh you see a growth you see growth in applications you don't necessarily growth uh an exponential growth in sort of new concepts but there are new concepts that are coming up at a surprisingly uh high speed and so i guess one big question is so you know one question is are we going to have another ai winter like like they were in the past and the answer to this is probably no or at least not to the same extent that we had in the past because there's a big industry now behind um behind ai it's useful for a lot of things you know you take you take deep learning out of uh you know google and meta and a few other companies and they they crumble i mean they're completely built around it now so um and certainly there's a lot of other areas also um so so i think it's not it's not going to collapse as it used to the question is are we going to make this next step towards you know machine common sense uh supervision etc before people funding all of this get tired and that i don't know the answer i'm just hoping that you know we make progress fast enough fascinating question thank you jan thank you thank you danny it's always always a pleasure chatting with you and and thanks to both of you it's it's amazing to hear you talk about this um i feel like i was just like an ultra graduate level course we should have more psychology and ai together uh because this is i think like it's just amazing to hear you guys bounce these concepts around and in terms of energy uh jan clearly there's energy in the field it's amazing hearing you um you know no drop from the five years we've known each other maybe even more so um thank you danny thank you yan so great having you here yeah let's let's do it again pleasure yes i'd love to do it again thank you all right thank you uh take care danny thank you nick guateny for doing the editing uh red circle for hosting all of you the listeners we'll be back next wednesday for another episode of the big technology podcast until then we'll see