AI Research

Explore, Exploit, and Explode — The Time for Reinforcement Learning is Coming

Reinforcement learning (RL) has been making spectacular achievements, e.g., Atari games, AlphaGo, AlphaGo Zero, AlphaZero, DeepStack, Libratus, OpenAI Five, Dactyl, DeepMimic, Catch The Flag, learning to dress, data center cooling, chemical syntheses, drug design, etc. See more RL applications.

Reinforcement learning (RL) has been making spectacular achievements, e.g., Atari games, AlphaGo, AlphaGo Zero, AlphaZero, DeepStack, Libratus, OpenAI Five, Dactyl, DeepMimic, Catch The Flag, learning to dress, data center cooling, chemical syntheses, drug design, etc. See more RL applications.

Most of these are academic research. However, we are also witnessing RL products and services, e.g., Google Cloud AutoML and Facebook Horizon, and open-sources/testbeds like OpenAI Gym, Deepmind Lab, Deemind Control Suite, Google Dopamine, Deepmind TRFL, Facebook ELF, Microsoft TextWorld, Amazon AWS DeepRacer, Intel RL Coach, etc. Multi-armed bandits, in particular, contextual bandits, have many successful applications.

In the following, I will introduce RL briefly, discuss recent achievements, issues, research directions, applications, and the future of RL. The take-home message is: The time for reinforcement learning is coming.

A Brief Introduction

An RL agent interacts with the environment over time, and learns an optimal policy, by trial and error, for sequential decision-making problems, in a wide range of areas in natural sciences, social sciences, engineering, and art.

At each time step, the agent receives a state and selects an action, following a policy, which is the agent’s behavior, i.e., a mapping from an observation to actions. The agent receives a scalar reward and transitions to the next state according to the environment dynamics. The model refers to the transition probability and the reward function. The agent aims to maximize the expectation of a long-term return, i.e., a discounted, accumulated reward.

Supervised learning is usually one-shot, myopic, and considers instant rewards, whereas RL is sequential, far-sighted, and considers long-term accumulative rewards.

Russell and Norvig’s AI textbook states that “reinforcement Learning might be considered to encompass all of AI: an agent is placed in an environment and must learn to behave successfully therein” and “reinforcement learning can be viewed as a microcosm for the entire AI problem”. It is also shown that tasks with computable descriptions in computer science can be formulated as RL problems. These support AI = RL + DL, Dr. David Silver’s hypothesis.

See the following for more details about RL: Dr. David Silver’s UCL RL course, Deepmind & UCL’s DL & RL course, Prof. Sergey Levine’s Deep RL course, OpenAI’s Spinning Up in Deep RL, Sutton & Barto’s RL book, a book draft on deep RL, a collection of deep RL resources, etc.


Recent Achievements

We have witnessed deep RL breakthroughs, such as deep Q-network (DQN), AlphaGo (AlphaGo Zero, AlphaZero), and DeepStack/Libratus, each of which represents a big family of problems and a large number of applications. DQN is for single-player games and single-agent control in general. DQN ignited the current round of popularity of deep RL. AlphaGo is for two-player perfect information zero-sum games. AlphaGo makes a phenomenal achievement on a very hard problem, and sets a landmark in AI. DeepStack is for two-player imperfect information zero-sum games, a family of problems which are inherently difficult to solve. DeepStack/Libratus, similar to AlphaGo, also makes an extraordinary achievement on a hard problem, and sets a milestone in AI.

OpenAI Five defeated good human players at Dota 2. OpenAI trained Dactyl for a human-like robot hand to dextrously manipulate physical objects. DeepMimic simulated humanoid to perform highly dynamic and acrobatic skills. Human-level performance in the multi-player game Catch The Flag shows the progress in mastering tactical and strategical team plays. Learning to dress achieves dressing tasks with a cloth simulation model. Data center cooling has applied RL to real-world physical systems. Chemical syntheses has applied RL to retrosynthesis.

We have also seen applications of RL in products and services. AutoML attempts to make AI easily accessible. Google Cloud AutoML provides services like the automation of neural architecture design. Facebook Horizon has open-sourced an RL platform for products and services like notification delivery, streaming video bit rates optimization, and improvements of M suggestions in Messenger. Amazon has launched a physical RL testbed AWS DeepRacer, together with Intel RL Coach.

The techniques underlying these achievements, namely, deep learning, RL, Monte Carlo tree search (MCTS), and self-learning, will have wider and further implications and applications.


There are many concepts, algorithms, and issues in RL. Sample efficiency, sparse reward, credit assignment, exploration vs. exploitation, and representation are common issues and there are efforts to address them. Off-policy learning uses both on-policy and off-policy data for learning. Auxiliary reward and self-supervised learning learn from non-reward signals in the environment. Reward shaping provides denser rewards. Hierarchical RL is for temporal abstraction. General value functions (GVFs), in particular, Horde, universal value function approximators (UVFs), and hindsight experience replay (HER), learn shared representation/knowledge among goals. Exploration techniques learn more from valuable actions. Model-based RL can generate more data to learn from. Learning to learn, e.g., one/zero/few-shot learning, transfer learning, and multi-task learning, learns from related tasks to achieve efficient learning. Incorporating structure and knowledge can help achieve more intelligent representation and problem formulation.

RL with function approximation, in particular deep RL, encounters the deadly triad, i.e., instability and/or divergence caused by the integration of off-policy, function approximation, and bootstrapping. There are efforts tackling this fundamental issue, e.g., gradient TD (GTD), smoothed Bellman error embedding (SBEED), and non-delusional algorithms.

Reproducibility is an issue for deep RL. Experimental results are influenced by hyperparameters, including network architecture and reward scale, random seeds and trials, environments, and codebases.

RL also share issues with machine learning like time/space efficiency, accuracy, interpretability, safety, scalability, robustness, simplicity, etc.


Research Directions

It is essential to study value-based methods, policy-based methods, model-based methods, reward, exploration vs. exploitation, and representation, the 6 core elements as discussed in the book draft deep RL. The 6 important mechanisms: attention and memory, unsupervised learning, hierarchical RL, multi-agent RL, relational RL, and learning to learn, play critical roles in various aspects of (deep) RL, respectively.

In deep RL, six research directions are discussed, as both challenges and opportunities. Research direction 1, systematic, comparative study of deep RL algorithms, is about reproducibility, and under the surface, about stability and convergence properties of deep RL algorithms. Research direction 2, “solve” multi-agent problems, is about sample efficiency, sparse reward, stability, non-stationarity, and convergence in a large-scale, complex, and maybe adversarial setting. Research direction 3, learn from entities, but not just raw inputs, is about sample efficiency, sparse reward, and interpretability, by incorporating more knowledge and structure. Research direction 4, design an optimal representation for RL, research direction 5, AutoRL, and research direction 6, (deep) RL for real life, are about the whole RL problem, about all issues in RL, from different angles of representation, automation, and applications, respectively. We expect all these research directions to be open, except the first one, which is also challenging, and progress in these directions would deepen our understanding of (deep) RL, and push further frontiers of AI.

Prof. Rich Sutton highlights the importance of planning with a learned model. Prof. Yann LeCun discusses learning world model, in particular, self-supervised learning. Prof. Yoshua Bengio discusses disentangled representation.

There are more and more research efforts for building machines that learn and think like people and incorporating components of classical AI, like causality, reasoning, symbolism, etc., in particular, causal reasoning and relational learning are enjoying the traction. See more info.


12 application areas are discussed in deep RL, including games, robotics, NLP, computer vision, finance, business management, healthcare, education, energy, transportation, computer systems, and science, engineering, and art. The last one, science, engineering, and art, basically covers everything, which conveys the message that RL, and AI in general, will be everywhere.

RL is a solution method for sequential decision making problems. However, some problems, seemingly non-sequential on the surface, like neural network architecture design, have been successfully approached by RL. In general, RL is probably helpful, if a problem can be regarded as or transformed to a sequential decision-making problem, and states, actions, maybe rewards, can be constructed. Roughly speaking, if a task involves some manual designed “strategy”, then there is a chance for RL to help automate and optimize the strategy.

There are interesting applications of RL for beam search policies, database join queries, active learning, program synthesis, model compression and acceleration, driver management, etc.

One particular direction for RL applications is to extend AlphaGo techniques. As recommended by AlphaGo authors in their papers, the following applications are worth further investigation: general game-playing (in particular, video games), classical planning, partially observed planning, scheduling, constraint satisfaction, robotics, industrial control, online recommendation systems, protein folding, reducing energy consumption, and searching for revolutionary new materials. Chemical syntheses is a good example.

For RL to work for real life applications, we need to consider the availability of data and computation. The success of AlphaGo hinges on the perfect model of the game of Go, by which we can generate a huge amount of training data, and the Google-level computation power. For some applications like robotics, healthcare, and education, we usually do not have a good model, thus it is nontrivial to obtain a large amount of data. Off-policy policy evaluation is one approach to this issue.


In the above, I discuss recent achievements, issues, research directions, and applications of RL. In the following, I present several researchers’ opnions.

David Silver summarizes principles of deep RL: evaluation drives progress, scalability determines success, generality future-proofs algorithms, trust in the agent’s experience, state is subjective, control the stream, value functions model the world, planning: learn from imagined experience, empower the function approximator, and learn to learn.

Prof. Dimitri Bertsekas is cautiously positive about the future of RL, including its real life applications. The following items are quoted literally from his slides.

  • Broadly applicable methodology: Can address broad range of challenging problems. Deterministic-stochastic-dynamic, discrete-continuous, games, etc.
  • There are no methods that are guaranteed to work for all or even most problems
  • There are enough methods to try with a reasonable chance of success for most types of optimization problems
  • Role of the theory: Guide the art, delineate the sound ideas
  • There are challenging implementation issues in all approaches, and no fool-proof methods
  • Problem approximation and feature selection require domain-specific knowledge
  • Training algorithms are not as reliable as you might think by reading the literature
  • Approximate PI involves oscillations (note: PI means policy iteration)
  • Recognizing success or failure can be a challenge!
  • The RL success in game context are spectacular, but they have benefited from prefect known and stable models and small number of controls (per state)
  • Problems with partial state observation remain a big challenge
  • Massive computational power together with distributed computation are a source of hope
  • Silver lining: We can begin to address practical problems of unimaginable difficulty!
  • There is an exciting journey ahead!

Sutton and Barto’s RL book is intuitive. Bertsekas and Tsitsiklis’s Neuro-Dynamic Programming, which is closely related to (deep) RL, is theoretical. Prof. Bertsekas has a new RL and optimal control book. If we call Prof. Sutton the father of RL, then we would call Prof. Bertsekas an uncle of RL.

Not only do we have positive opinions about RL from researchers focusing on fundamental studies, we also see the deployments of RL in products and services like Google Cloud AutoML, Facebook Horizon, etc.

It’s tough to make predictions, especially about the future. Various blogs discuss the importance of RL, in particular for 2019. RL is among MIT Technology Review 10 Breakthrough Technologies in 2017 and DL is among the 2013 selection. For AI in general, Prof. Geoffrey Hinton mentioned that: “No, there’s not going to be an AI winter, because it drives your cellphone. In the old AI winters, AI wasn’t actually part of your everyday life. Now it is.” Dr. Andrew Ng provides an AI Transformation Playbook.

RL has been accumulating quantitative changes, which would lead to qualitative changes, in both fundamental research and real life applications. Bearing in mind that there are both challenges and opportunities, the evidence implies that the time for reinforcement learning is coming.

Contributor: Yuxi Li

Yuxi Li, author of Deep Reinforcement Learning at, CS PhD from University of Alberta.

3 comments on “Explore, Exploit, and Explode — The Time for Reinforcement Learning is Coming

  1. Thank you so much for writing this. I’ve never had such an eloquent description for what I do. Much to think about

  2. Thank you so much for writing this. I’ve never had such an eloquent description for what I do. Much to think about

  3. peter.ketto

    The research by engineers at Purdue University and Sandia National Laboratories is part of an effort to develop a smarter wind turbine structure.

Leave a Reply to zoli zoli Cancel reply

Your email address will not be published. Required fields are marked *