Reinforcement learning (RL) methods have achieved human and superhuman-level performance in many complex and large-scale environments, like Atari games and Go. However, compared to human performance, previous deep RL systems have at least two shortcomings:
- Deep RL typically requires a massive volume of training data, whereas human learners can attain reasonable performance with comparatively little experience.
- Deep RL systems typically specialize on one restricted task domain, whereas human learners can adapt to changing task conditions.
In this paper, the authors introduced a novel approach called deep meta-reinforcement learning (meta-RL) which can rapidly adapt to new tasks. The key idea is to use standard deep RL techniques to train a recurrent neural network, which will implement its own, free-standing RL procedure.
2.1 Background: Meta-learning in recurrent neural networks
In the context of machine learning, meta-learning is the process of learning to learn. Informally speaking, a meta-learning algorithm uses its experience to change certain aspects of a learning algorithm, or the learning method itself, such that the modified learner is better than the original learner at learning from additional experience. 
As in Hochreiter’s work , Hochreiter described the meta-learning system using the following graph:
There are two important aspects of the meta-learning system above:
a. The dynamics of the recurrent network can represent the process that underlies learning within each new task.
b. Due to the embedding biases, the learning procedure implemented in the recurrent network can learn efficiently when dealing with new tasks from the same family.
2.2 Deep meta-RL
Unlike in the supervised case of Hochreiter’s method , where the dynamics of the recurrent network come to implement a learning algorithm entirely separate from the one used to train the network weights. In this paper, the authors considered the implications of applying the same approach in the context of reinforcement learning. Here, the learned RL procedure can differ greatly from the algorithm used to train the network’s weights. In particular, the learned RL procedure can implement its own approach to exploration.
First, an appropriately structured agent, embedding a recurrent neural network, is trained to maximize the sum of observed rewards by interacting with a sequence of Markov Decision Processes (MDPs) tasks drawn from a prior distribution (called D) through all steps and episodes. After training, the agent’s policy is fixed. The authors wanted to demonstrate that meta-RL will on average perform well on MDPs drawn from D or slight modifications of D. Since the learned agent makes uses of a recurrent network (thus history-dependent), it is able to adapt a strategy that optimizes rewards for that task when exposed to any new MDP environments.
In order to see whether meta-RL can be used to learn an adaptive balance between exploration and exploitation, and whether meta-RL could give rise to learning that gains efficiency by capitalizing on task structure, the authors performed four experiments on bandit tasks and two experiments on Markov decision problems. To wrap up, the authors reviewed an experiment recently reported in a related paper , which showed how meta-RL can scale to large-scale navigation tasks with rich visual inputs.
In all experiments, the agent architecture centered on a recurrent neural network feeding into a soft-max output representing discrete actions. Other architectural details would vary across experiments, see in Table 1:
All reinforcement learning experiments were conducted using the Advantage Actor-Critic algorithm [9, 10]. The architecture can be seen in Figure 1:
3.1 Bandit problems
In this paper, the authors first studied four different bandit problems for evaluating meta-RL. If a learned bandit algorithm which trained on a set of bandit environments drawn independently and identically distributed from a given distribution of environment, also performs well on problems drawn from that distribution or a slight modification of that distribution, we can say meta-RL learns a prior-dependent bandit algorithms.
The authors reported the resulting performance by the cumulative expected regret, which is a measure of the loss suffered when playing sub-optimal arms.
3.1.1 Bandits with independent arms
The authors first considered a simple two-armed bandit task where the arm distributions are independent Bernoulli distributions to examine the behavior of meta-RL compared to some theoretically optimal models, such as Thompson sampling, UCB and Gittins. From the experiment results (Figure 2a), we can see meta-RL outperforms both Thompson sampling and UCB, while performs less well compared to Gittins.
3.1.2 Bandits with dependent arms (I)
In order to emphasize meta-RL can give rise to a learned RL algorithm that exploits consistent structure in the training distribution, the authors trained the recurrent system from the first experiment in a more structured bandit task in which arm reward distributions are correlated. The results (Figure 2b-f) show that agents trained in structured environments perform comparably to Gittins, and more superior compare to agents trained on independent arms in all structured tasks at test. One point should be emphasized is that previous training on any structured distribution will hurt performance when agents are tested on independent-arm tasks. (Figure 2f)
3.1.3 Bandits with dependent arms (II)
Since humans and animals make decisions that sacrifice immediate reward for information gain. Similarly, the authors examined a problem where information can be gained by paying a short-term reward cost to further emphasize the dependent-arm bandit problems. The results (Figure 3) demonstrated that the agent can successfully learn the optimal long-run strategy of sampling the informative arm once, then use the resulting information to exploit the high-value target arm.
3.1.4 Restless bandits
Furthermore, the authors also considered non-stationary bandit problems in which reward probabilities change over the course of an episode, with different rates of change in different episodes. Thus, the agent must not only track the best arm, but also infer the change rate of the episode and adapt its learning rate accordingly.
The authors tested the agent in a two-armed Bernoulli bandit task to evaluate whether meta-RL would learn such a flexible RL policy. The results (Figure 4b) showed that meta-RL achieved lower regret than UCB, Thompson sampling and a Rescorla-Wagner (R-W) learner with fixed learning rate (alpha=0.5).
3.2 Markov decision problems
In comparison, the authors also studied Markov decision processes (MDPs) where actions can influence the task’s underlying state to further examine how meta-RL adapts to invariances in task structure. The fifth experiment was derived from the neuroscience literature called the “two-step task”, where training with model-free RL would give rise to behavior reflecting model-based control. In the sixth experiment, the authors studied a meta-learning task that requires the agent to learn an abstract task structure, originally demonstrated in the context of animal learning. To warp up, the authors reviewed a related experiment which recently reported within the navigation domain , demonstrated that meta-RL allows a base model-free RL algorithm to solve a challenging RL problem.
4. Related Work
Meta-RLwas introduced in Schmidhuber’s work  in 1996, which did not involve a neural network implementation. In 2001, Hochreiter’s work  pioneered the use of recurrent networks to perform meta-learning in the supervised case. Santoro’s work  in 2016, which demonstrated the utility of leveraging an external memory structure extended this technique. Recently, there has been a lot of work in using neural networks to learn optimization procedures, using a range of innovative meta-learning techniques [5, 6, 7, 8].
Meanwhile, a number of recent studies have implemented deep RL to train recurrent neural networks on navigation tasks (e.g., maze task, goal location), where the structure of the task varies across episodes [9, 10]. A closely related work  which focused on relatively unstructured task distribution would be a good complement to this paper.
This paper proposed a novel approach called deep meta-reinforcement learning (meta-RL), which involves three ingredients: (1) Use of deep RL algorithm to train a recurrent neural network, (2) a training set that includes a series of interrelated tasks, (3) network input that includes the action selected and reward received in the previous time interval.
According to the experiments’ results, the authors believe that deep meta-RL is likely to generate RL procedures that occupy a grey area between model-free and model-based RL in widely varying but structured environments. Meanwhile, deep meta-RL may have important implications in neuroscience domain, as in recent work  demonstrated that deep meta-RL can help understanding the respective roles of dopamine and the prefrontal cortex in biological reinforcement learning.
6. Final Thoughts
In current work, one limitation is it mainly focuses on structured task distributions (like dependent bandits problems and learning abstract task structure), compared to a related work  which focuses on unstructured task distributions. While as a secondary separate RL algorithm is learned, it is configured to exploit structure in the training domain which means it can learn specifically how to better learn on data presented. Based on this idea, there are a lot of future directions can be explored. As a recent work , uses RL algorithm to search the best RNN structure. Moreover, we are willing to see more excellent works by applying these ideas to tackle hyperparameter optimization problems in data mining and machine learning.
2. Hochreiter, S., Younger, A.S. and Conwell, P.R., 2001, August. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks (pp. 87-94). Springer Berlin Heidelberg.
3. urgen Schmidhuber, J., Zhao, J. and Wiering, M., 1996. Simple principles of metalearning.
4. Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In Proceedings of The 33rd International Conference on Machine Learning, pages 1842–1850, 2016.
5. Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando
de Freitas. Learning to learn by gradient descent by gradient descent. arXiv preprint arXiv:1606.04474, 2016.
6. Yutian Chen, Matthew W Hoffman, Sergio Gomez, Misha Denil, Timothy P Lillicrap, and Nando de Freitas.
Learning to learn for global optimization of black box functions. arXiv preprint arXiv:1611.03824, 2016.
7. Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
8. Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
9. Jaderberg, M., Mnih, V., Czarnecki, W.M., Schaul, T., Leibo, J.Z., Silver, D. and Kavukcuoglu, K., 2016. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397.
10. Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and Raia Hadsell. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016.
11. Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
12. Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Joel Leibo, Hubert Soyer, Dharshan Kumaran, and Matthew Botvinick. Meta-reinforcement learning: a bridge between prefrontal and dopaminergic function. In Cosyne Abstracts, 2017.
Analyst: Yufeng Xiong | Localized by Synced Global Team : Xiang Chen