Paper source:

https://openreview.net/pdf?id=B1oK8aoxe

Paper comments:

https://openreview.net/forum?id=B1oK8aoxe¬eId=B1oK8aoxe

**1. Introduction**

Deep reinforcement learning has achieved many impressive results recently, but these deep RL algorithms typically employ naive exploration strategies such as epsilon-Greedy or uniform Gaussian exploration noise, which work poorly in tasks with sparse rewards. To deal with these challenges, two strategies are employed:

- Design a hierarchy over the actions, which requires domain-specific knowledge and careful hand-engineering.
- Use domain-agnostic intrinsic rewards to guide exploration, however, it is unclear how the knowledge of solving a task can be transferred to other tasks, which may lead to high sample complexity.

In this paper, the authors proposed a general framework for learning a span of skills in a pre-training environment, which can be used in downstream tasks by training a high-level policy on top of the skills. The authors utilized the so-called stochastic neural networks (SNNs) combined with a proxy reward to learn these skills, the design of which requires very minimal domain knowledge about the downstream tasks. To encourage diversity of behaviors of the SNN policy, an information-theoretic regularizer based on Mutual Information (MI) is proposed in the pre-training phase. The experiments show that the hierarchy policy-learning framework can learn a wide span of interpretable skills, and training high-level policies over the learned skills can lead to great performance on a set of long-horizon and sparse-reward tasks.

**2. Problem Statement**

In this paper, the authors specified a collection of downstream tasks via a collection of discrete-time finite-horizon discounted Markov Decision Processes (MDPs) *M*. The objective is to maximize the expected discounted return along a whole trajectory.

In order to tackle these problems, these downstream tasks should satisfy some structural assumptions first, which should be minimal to ensure generality. As the idea of *agent-space* [1], the state space can be factored into two components: *agent state* and *rest state*, which very weakly interacts with each other. The agent state are the same for all MDPs in *M*, and all the MDPs share the same action space.

Having set up a collection of tasks satisfying the structural assumption, the objective then is to minimize the total sample complexity required to solve these tasks. The previous techniques [2] which make use of experience gathered from solving earlier tasks to help solve later tasks are not directly applicable in front of tasks with sparse rewards. Thus, a general framework for learning useful skills in a pre-training environment is proposed.

**3. Methodology**

The authors described the formulation with a 5-step process, which takes advantage of a pre-training task that can be constructed with minimal domain knowledge, and with the learned useful skills which can be applied to solve challenging tasks with sparse rewards.

**3.1 Constructing the pre-training environment**

To construct a pre-training environment in which the agent can learn useful skills that can be utilized in downstream tasks, the authors let the agent freely interact with the environment in a minimal setup. For example, for a mobile robot, a pre-training environment can be a spacious environment where the robot can first learn the necessary locomotion skills.

Instead of specifying goals in the pre-training environment corresponding to the desired skills, a proxy reward, which should encourage the locally optimal solutions, is used as the only reward signal to guide the skill learning. Again for the mobile robot example, the proxy reward can be proportional to the magnitude of the speed of the robot, without constraining the movement’s direction.

**3.2 Stochastic neural networks for skill learning**

Once the pre-training environment is constructed, one direct approach but with high sample complexity to learn a span of skills is to train different policies, each with a uni-modal action distribution under different random initializations. To tackle this issue, the authors proposed to use stochastic neural networks (SNNs), with stochastic units in the computation graph. Since SNNs have rich representation power and can approximate any well-behaved probability distributions, in this paper, the authors implemented SNNs to train different policies.

This paper implemented a simple class of SNNs, where latent variables with uniform weights categorical distributions are integrated with observations from the environment to the neural network to form a joint embedding. The joint embedding is then fed to a feedforward neural network with deterministic units for computing the distribution parameters.

Concatenating the observations and the latent variables directly can form the simplest joint embedding (Figure 1(a)), which has limited expressiveness power. Inspired by previous work [3] [4] showing that richer forms of integrations can achieve greater representation power, a simple bilinear integration, forming the outer product between the observations and the latent variable, was proposed. As shown in the experiments, the deployment of integration largely affects the quality of the span of useful skills that is learned, due to the concatenation corresponds to changing the bias term of the first hidden layer depending on the latent code h, while the bilinear integration to changing all the first hidden layer weights. Thus, training a single SNN allows for flexible weight-sharing schemes among different policies.

**3.3 Information-theoretic regularization**

Since we do not have any control over whether the different policies actually learn different skills, in order to encourage diversity of behaviors of the SNN policy, and prevent SNNs from collapsing into a single mode, the authors proposed an information-theoretic regularizer based on Mutual Information (MI) in the pre-training phase.

Concretely, for a mobile robot, the authors added an additional reward bonus, proportional to the mutual information (MI) between the latent variable and the state the robot is currently in. Here the authors only measured the MI with respect to a relevant subset in the state. Mathematically, let Z be a random variable denoting the latent variable, and let X be a random variable for where the agent is currently situated. Then the additional reward bonus can be denoted:

Since H(Z) is constant due to the fixed distribution, maximizing MI is equivalent to minimizing the conditional entropy H(Z|X). That means given where the robot is, it should be easy to infer what skill the robot is currently preforming.

**3.4 Learning high-level policies**

After having learned a span of skills from the pre-training tasks, we can use them to solve downstream tasks with sparse rewards by training a high-level policy.

For a given task M, the high-level policy (Manager Neural Network, as shown in Figure 2) receives the full state as input, and outputs the parametrization of a categorical distribution from which we sample a discrete action out of K possible choices. Usually, the high-level policy runs at a slower time scale than the low-level policy (SNNs), only switching its categorical output every T time-steps. T, called switch time, is a hyperparameter depending on the downstream tasks.

In this paper, the weights of the SNNs are froze during the phase of training.

**3.5 Policy optimization**

To optimize the policy, the authors used Trust Region Policy Optimization (TRPO) algorithm for both the pre-training phase and the training of high-level policies. This is due to TRPO’s excellent empirical performance and it does not require excessive hyperparameter tuning.

**4. Experiment Details**

**4.1 Experiment task**

The authors implemented the hierarchical SNNs framework to two hierarchical tasks described in one previous paper [5]: Locomotion + Maze and Locomotion + Food collection. Here, the Swimmer robot was deployed to accomplish the tasks. The results with more complex robots (SNAKE and ANT) can be seen in Appendix C-D.

To illustrate the variety of downstream tasks, 4 different mazes are constructed. As shown in Figure 3(a)-3(b), Maze 0 is just the same task depicted in the benchmark [5] and Maze 1 is its reflection. Mazes 2 and 3 are different instantiations of the environment shown in Figure 3(c), where the goal has been placed in the North-East or in the South-West corner respectively. Figure 3(d) described the Food Gather task, where the robot gets a reward of 1 for gathering green balls and a reward of -1 for the red balls, all of which are placed randomly at the beginning of each episode. The benchmark of continuous control problems [3] also shown that algorithms that employ naive exploration strategies could not solve the Mazes and the Gather tasks, more advanced intrinsically motivated explorations [6] can improve the performance. In Appendix B depicted stronger results by using SNNs.

**4.2 Hyperparameters**

In all experiments, all policies are trained with TRPO with step size 0.01 and discount 0.99, and all neural networks have the same architecture of 2 layers of 32 hidden units. For the pre-training tasks, the batch size and the maximum path length are 50,000 and 500 respectively, the same with in the benchmark [5]. For the downstream tasks, see Table 1:

**5. Results**

To examine the relevance of different pieces of SNNs architecture and how they impact the exploration achieved, every step of the skill learning process should be evaluated. The results explanation on the sparse environments see below: (For videos of the achieved results, see https://goo.gl/5wp4VP)

**5.1 Skill learning in pre-train**

Here, the authors used “visitation plots”, showing the (x, y) position of the robot’s Center of Mass (CoM) during 100 rollouts of 500 time-steps each to examine the diversity of the learned skill. Fig. 4(a) showed six visitation plots of six different feed-forward policies, each trained from scratch in the pre-training environment. Since the Swimmer robot has a natural preference for forward and backward motion, the visitation will concentrate on the initialized direction without extra incentives, thus guaranteeing potentially useful skills due to the general proxy reward for each independently trained policy. Fig. 4(b) superposed a batch of 50 rollouts for each of the 6 policies with different colors for better graphical interpretation and comparison.

In Fig. 4(c)-4(d) depicted the visitation plots of SNNs policies obtained for different design choices, with or without bilinear integration. According to Fig. 4(c), simple concatenation of latents with the observations rarely yields distinctive behaviors for each latent. While the pre-trained SNNs with bilinear integration are able to acquire more forward and backward motions associated with different latents.

**5.2 Hierarchical use of skills**

To illustrate how the hierarchical architectures impact the areas covered by random exploration, the authors compared the visitation plots of a single rollout of one million steps. The exploration performed with standard Gaussian noise is depicted in Fig. 5(a), where it did not yield a good exploration. While using the proposed hierarchical structures with pretrained policies yields a drastic increase in exploration, as shown in Fig. 5(b)-5(d). On the other hand, hierarchy with Multi-policy concentrates heavily in upward and downward exploration motion, while the exploration obtained with SNNs yields a wider coverage of the space due to the underlying policy’s additional behaviors.

**5.3 Mazes and Gather tasks **

Since standard reinforcement learning algorithms cannot properly solve tasks with sparse rewards, a better baseline: adding to the downstream task the same Center of Mass (CoM) proxy reward that was granted to the robot in the pre-training task is used for comparing. As shown in Fig. 6(a)-6(c), the baseline performs quite poorly in all the mazes due to the long time-horizon and associated credit assignment problems. The proposed hierarchical architectures can learn much faster in every new MDP because they can shrink the time-horizon by aggregating time-steps into useful primitives. One issue which should be emphasized is that SNNs are pre-trained with MI bonus, which means more tuning and sideway motions are not critical in some maze tasks, as shown in Fig. 6(c). However, in Gather task as seen in Fig. 6(d), the average return is higher and the variance of the learning curve is lower for the algorithm employing SNNs pre-trained with MI bonus.

In addition, the authors also compared current approach to previous work on the Gather environment, for the sake of fairness, all the results setting are the same as in [5]. As shown in Fig. 7, the SNN hierarchical approach outperforms the state-of-the-art intrinsic motivation results like VIME [6].

**6. Discussion and Future Research**

The authors proposed a novel approach for learning a diverse set of skills via a stochastic neural network representation, an unsupervised procedure to learn a large span of skills using proxy rewards, and a hierarchical structure that allows to reuse the learned useful skills in future tasks. The SNNs framework with bilinear integration and mutual information bonus can largely improve the learned skills’ expressiveness and multimodality. Moreover, the hierarchical structure can boost the agent’s exploration in a new environment.

As the paper suggested at the end, there are several limitations in the current research paper, and we can improve the approach by following some future directions which are left as future research. First, current approach is not robust for unstable robots when switching between skills, which could be improved by learning a transition policy or integrating switching in the pre-train tasks. Secondly, the weights of low-level policies are frozen, and switch time is fixed during training high-level policies. The first issue could be meliorated by introducing end-to-end training, as in previous work [7][8], using straight-through estimators for Stochastic Computation Graphs with discrete latent variables. The second issue could be solved by learning a termination policy by the Manager, similar to the Option-critic architecture [9]. Finally, current research only used standard feedforward architectures, which could not use any sensory information gathered while the previous skill was active. One future direction is to introduce a recurrent neural network architecture at the Manager level.

**References**

1. George Konidaris and Andrew G Barto. Building portable options: Skill transfer in reinforcement learning. In *IJCAI*, volume 7, pp. 895–900, 2007.

2. Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, and Sergey Levine. Learning modular neural network policies for multi-task and multi-robot transfer. *arXiv preprint arXiv:1609.07088*, 2016.

3. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. *arXiv preprint arXiv:1606.01847*, 2016.

4. Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan Salakhutdinov. On multi- plicative integration with recurrent neural networks. *arXiv preprint arXiv:1606.06630*, 2016.

5. Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. *International Conference on Machine Learning*, 2016.

6. Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Variational information maximizing exploration. *Advances in Neural Information Processing Systems*, 2016.

7. Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. *arXiv preprint arXiv:1611.01144*, 2016.

8. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. *arXiv preprint arXiv:1611.00712*, 2016.

9. Pierre-Luc Bacon and Doina Precup. The option-critic architecture. *arXiv:1609.05140v2*, 2016.

**Author: ***Yufeng Xiong***|Editor: ***Hao Wang***| Localized by Synced Global Team: ***Xiang Chen*

## 0 comments on “Stochastic Neural Networks for Hierarchical Reinforcement Learning”