Editor: Rita Chen , Chain Zhang, Qintong Wu, Jiaxin Su
Intro: Professor Richard Sutton is considered to be one of the founding fathers of modern computational reinforcement learning. He made several significant contributions to the field, including temporal difference learning, policy gradient methods, and the Dyna architecture.
Surprisingly, the first field Dr. Sutton looked into was not even related to computer science. He got his B.A degree in Psychology, and then turned to computer science. However, he didn’t think it was a direction change,「I was interested in how learning works as most of psychologists concerned about it, and I got Psychology degree in 1977; at that time learning is not popular in computer science. Since I am curious about AI or anything AI related, then I pursued computer science as the Master and then PhD.. My views of AI is re-coloured by Psychology with human and animal learning(which is my secret weapon) because many people in AI don’t have this background. I started there and got lots of inspiration from Psychology. 」Dr.Sutton said.
In 1984 Dr. Sutton held a postdoctoral position at University of Massachusetts at Amherst. From 1985 to 1994, he was a Principal Member of Technical Staff in the Computer and Intelligent Systems Laboratory at GTE Laboratories. In 1995, he returned to University of Massachusetts at Amherst as a Senior Research Scientist. He held this position until 1998, the year he joined the AT&T Shannon Laboratory as Principal Technical Staff Member in the Artificial Intelligence Department. He centred his research on the kind of learning problems that a decision-maker faces while it interacts with its environment, the kind of problems which he see as the core of artificial intelligence. He is also interested in animal learning psychology, connectionist networks, and the general systems that continually improve their representations and models of the world. Since 2003, he became a Professor and the iCORE Chair in the Department of Computing Science at the University of Alberta, where he led the Reinforcement Learning and Artificial Intelligence Laboratory (RLAI).
The lab name (RLAI ) seems to be profound because it shows that reinforcement learning is the solution to all the AI problems. However, Dr.Sutton gave us an explanation from a different perspective during the interview. He addressed that some people think RL is just Reinforcement of AI problems, however, RL problem is actually an abstracted approach to AI. 「I’d like to say we’re using an approach to AI. It’s funny to name Reinforcement Learning and Artificial Intelligence, the word ‘and’ in English can mean either exclusive or inclusive, it can be ‘and’ or can be ‘or’. Because Reinforcement Learning is both a subset of AI, and also originate of AI. It’s quite ambiguous. We’re still looking for an answer. 」Dr.Sutton said.
Reinforcement Learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. Nowadays, if you are a beginner of RL, the book Reinforcement Learning : An Introduction by Richard Sutton and Andrew Barto is probably your best option.The book provides a clear and simple account of the key ideas and algorithms of reinforcement learning. Richard Sutton and Andrew Barto’s discussion ranges from the history of the field’s intellectual foundations to the most recent developments and applications. However, back to 1970s , even though machine learning was becoming well-known and popular, there was still no such thing like reinforcement learning.
Synced visited University of Alberta and talked with the God Father of Reinforcement Learning…
Synced: How did reinforcement learning start? What’s the starting point to write algorithms?
Dr. Sutton: It was always an obvious idea, a learning system wants something and some kind of learning is missing. In 1970s, Harry Klopf (1972,1975,1982) wrote several reports addressed the similar issues. He recognized the essential aspects of adaptive behavior were being lost as learning researchers came to focus almost exclusively on supervised learning. The missing part is the essential idea of trial-and -error learning.We tried to figure out what the basic idea, and found out that he is right. This idea has never been studied in any fields, especially in machine learning, not in Control Theory, not in Engineering, not in Pattern Recognition. All those fields overlooked this idea. You could see some earlier work in 50s, people talked about trial neuro but in the end it became supervised learning. It has targets and training sets and try to memorize, try to generalize from it. It’s funny we’re nowadays talking about Deep Learning and Reinforcement Learning. Way back to the beginning, it was the similar situation, trying to distinguish Reinforcement Learning from Supervised Learning. You need a system that can learn and that’s all. So, Reinforcement Learning system finds a way in behaving or maximizing the world, where Supervised Learning just memorizes the example given to them, and generalizes new ones but they have to be told what to do. Now, Reinforcement Learning system can try different things. We must try different things, we must search actions and spaces or define learning to maximize the world. So, that idea has been lost and Andrew Barto and I gradually realize that it’s not present in old works and it was needed. This is simplified view of why we’re precursors.
Editor’s note: actually, Dr.Sutton has been developing and promoting Reinforcement Learning (RL) from late 1979. Like others, Dr.Sutton had a sense that reinforcement learning had been thoroughly explored in the early days of cybernetics and artificial intelligence. While reinforcement learning had clearly prompted some of the earliest computational studies of learning to develop, most of these researchers had shifted their focus to other things, such as pattern classification, supervised learning, and adaptive control, or they had abandoned the study of learning altogether. Also, the computing power of the computers at that time was very limited, so it was quite difficult to apply Reinforcement Learning to a real-world problem, since Reinforcement Learning involves a lot of trial-and-errors before converging to the optimal policy, which can take an extremely long time.
Synced : How do you think about the development of RL from 1970s ? What gave you the faith at that time while the development of RL seems to be long and slow ?
Dr.Sutton: I do not agree as you mentioned that Reinforcement Learning development is slow, but I do accept the fact that increasing computational resources have a big impact on this field. You have a time to coincide with the availability of hardware. Even though it is still a bit early for deep learning, it uses a lot of computation successfully because of its strength. It has been a long time that people said we will have computation power for strong AI in 2030. I think it is not just only depend on cheap hardware, but also algorithms.I don’t think we have strong AI algorithms now but we might have them by 2030.
Synced: So, which will be more critical by 2030 , hardware or software ?
Dr. Sutton : It’s a big question whether one of the hardware first or the software first. We have the software to test out hardware, and the availability of hardware pushes people to software. But it’s not tremendously valuable for smartest guy researching or working in limited computational resources. Even in 2030 we may have adequate hardware, we may still need 10 years more for smartest guy to catch up with algorithms. Now you know my reasonings, you can reevaluate it or change it yourself.
Synced: AI benefits from the fields of psychology and neuroscience very much, like RL itself and ConvNets. You added two new chapters in your new edition of RL book. Why are the interactions between AI/RL and psychology /neuroscience important?
Dr.Sutton: The basic reinforcement that trumps different learning has been found in the brain essentially. There are processes in the brain that are followed same rules and are well modelled by the rules of reinforcement learning.This is so called the standard model of world system in our brain. And I say it is standard model not because it is perfect but everyone can pick it. You knew you are succeeded when everyone chooses you , as well as the reward system in the brain. Thus, our brain is a good model of psychology learning and animal behavioural study. Meanwhile, the other major thing is the model is based on learning where you can do planning, that is responded from various notions of replay imagining circumstances. That is also a model reinforced how we plan , where we can learn the sequences from various demonstrations. With considering of both, AI researchers try to figure out the mind and deep reassuring behind that.
Reinforcement learning studies decision making and control, and how a decision making agent can learn to act optimally in a previously unknown environment. Deep reinforcement learning studies how neural networks can be used in reinforcement learning algorithms, making it possible to learn the mapping from raw sensory inputs to raw motor outputs, removing the need to hand-engineer this pipeline. Thus, nowadays, Deep Reinforcement Learning (DRL) which combines Reinforcement Learning with Deep Learning has become a very popular approach to solve many kinds of problems, such as game playing, decision problems, robotic control etc.
Editor’s note: Dr.Sutton agreed that combination of RL and DL is a really good improvement. As to the specific field , for an example Computer Vision ( CV), he stated that 「You can certainly do a computer vision without reinforcement learning and practice normally to do it as how to prepare dataset mainly supervised example and then to learn from that. But I can say you couldn’t have it without deep learning. But who would actually take some imaginations and do it with reinforcement learning. I think it would take some cleverness and imagination to do that. I’d tend to think that would be a breakthrough to do computer vision with a degree of reinforcement.」
Dr.Sutton (cont’d): The winning feature of reinforcement learning is that you can learn during normal operation. Conventional deep learning learns from trained label training . (With Reinforcement learning) Whereas in principle you could learn from your normal operation. You could take some imagination to reform it because you don’t have examples but you have much more experience than just normal use. And then you do(test) in the training examples.
As to the winning feature of reinforcement learning, AlphaGo’s victory is of course in a different league.There’s no question that AlphaGo’s achievement and the speed with which it improved was unprecedented. According to Dr.Sutton, AlphaGo’s success can largely be traced to a combination of the following two powerful technologies:Monte Carlo tree search and Deep reinforcement learning.
Synced: Let’s take AlphaGo as an example. Why self-play is so important? Is there a limit for self-play? Can an agent keeps improving its performance?
Dr. Sutton: Self-play can generate infinite training data. you don’t have to have people labeling the training data to play yourself if you can number the examples, different games. That’s what we want. So can we do something like self-play for real life not just a game. However, AlphaGo is missing one key thing: the ability to learn how the world works ,such as an understanding of the laws of physics, and the consequences of one’s actions. Here comes the limitation. Limitation is that you need just to play with yourself. The limitation is in regular life, we don’t have an analogous to the rules of the game, just tells us how good the pieces of your real life. You know you pick up the phone you press a button or something will happen. You have to learn that you don’t have the rules of the game built-in. You don’t know the consequences of the moves. So did the self play you need the rules of the game.
Synced :Deep learning hungers for big data. Reinforcement learning usually also needs lots of samples. However, there is research on one shot learning, trying to learn with one or a few samples. This may be the way people learn for some problems. Is it possible to integrate the idea of one shot learning with RL?
Dr.Sutton: Learning slow so that you can learn fast, learning from one shot. I have this phrase you have to learn learning slow so that you can learn fast. So you know people through our lives we learn good representations. So that then when we get some experience we can learn very quickly what the correct behaviours mean we can learn from one shot but that learning from one shot builds on a long period of gathering representations.
Synced: Besides all the advantage and breakthrough of RL, let’s discuss about the short side. What’s the limitation of reinforcement learning and AI in general?
Dr.Sutton: Well, there are several really important ones. There are technical ones. But let me go towards something that we can all understand, which has the harder limitations. Reinforcement learning in general, which is we would like to be able to learn how the world works and then apply that knowledge in our plan, corrects autonomy behavior. So we take something like AlphaGo or computer chess. We don’t have to learn how the world works. We know what the moves are and we know what the consequences of moves are or we move this piece there then the board will be. And you know we can already do amazing things in term planning scenario like that. We like to do the same thing we have the moves the actions the choices and the consequences are learned. We got a new mechanism, new plan with the learned model of the world. That’s the key problem I think. We have no choices and no consequences to make models of how the dynamic of the world states and demonstrates. Once we got that sense, we will be able to plan them and to do AI in stronger sense.
There are sub-problems what mean by knowledge. What kind of predictions we want to make about what will happen in different ways, how we formally behave in different ways.「 We’re going to learn the consequences from trying out different ways but without taking them to completion. 」Dr.Sutton explains it with a typical example, 「Okay, let’s see, you walk into a room . Right here is a bottle of water, there is chair, some another objects around , and people etc. You can talk to various people and respond to different objects , but I will only do one thing and maybe never pick up the bottle of water because I learned from looking at what it is. something you learn from these partial experiences which we call off-policy learning so off-policy learning is our big technical challenge in reinforcement learning.」
Synced: That’s interesting. How to get better understanding of off-policy learning ?
Dr.Sutton: To learn efficiently off-policy function, you want to learn in a scale way.you want to take unprepared data, and don’t have to have a training set always label pictures, you want just be able to interact with the world and gain experience and learn the way the world works from them so how can we learn from unprepared experience with world. That’s what reinforcement methods are good for and should be good for.
Synced:Thank you for your time today. At the last, Can you give some advice to the beginners of RL? Any application wise or philosophy behind it ?
Dr.Sutton: Learn the basics and find an application with inexpensive costs.There’s a known correct response on something deduced from the data. Think about an elevator. It’s better for it to stop because there’s no one in the middle of night, maybe there’s no one coming out yesterday. Thus, you want to save energy by shutting down but then turn it on when people arrive. So how do you make a schedule ? So the data information matters. That’s a bad event when you’re running the elevator and no one is coming because you just waste time and energy. So thinking about the same way when you use the actual data without training information. Think of things like that anywhere this is.