A new study from DeepMind and Amii (Alberta Machine Learning Institute) proposes emphatic algorithms that extend emphatic methods to multi-step deep reinforcement learning (RL) targets. The method delivers noticeable benefits in problems that highlight the instability of existing temporal difference (TD) approaches, while their team’s combination of emphatic trace and deep neural networks results in improved agent performance on classic Atari video games.
Off-policy learning enables AI agents to learn possible behaviour policies from experiences generated by other behaviour policies, such that the agents can accumulate richer knowledge. Temporal difference (TD) learning however suffers from stability issues when off-policy learning is combined with function approximation and bootstrapping — a scenario the research community refers to as the “deadly triad.”
In the paper Emphatic Algorithms for Deep Reinforcement Learning, the DeepMind and Amii researchers address this issue by adapting the emphatic temporal difference (ETD(λ)) algorithm to ensure convergence in the linear case by appropriately weighting the TD(λ) updates.
The paper first provides some background information on the Markov Decision Process (MDP), forward-view learning targets, and ETD(λ). Policy evaluation is the problem of learning to predict the value for all states under a fixed policy, and TD(λ) is a widely used algorithm for policy evaluation that can be extended to off-policy settings. However, combining function approximation with bootstrapping and off-policy learning can cause the parameters to diverge, leading to the so-called deadly triad. The use of emphatic TD(λ) resolves this stability issue by adjusting the magnitude of updates on each time-step, while ETD(λ) is convergent with linear function approximation.
The team generalizes the ETD(λ) to widely used deep RL systems and extends the ETD(λ) method to multi-step deep RL learning targets, including the “V-trace” off-policy value-learning method used in actor-critic systems, a focus of this work.
The researchers propose two extended ETD(λ) algorithms: Windowed ETD(λ) (WETD) and Emphatic TD(n) (NETD). For WETD, the team adapts ETD(λ) to use update windows, where each state in a window is updated with a variable bootstrap length, all bootstrapping on the last state in the window. For NETD, the team uses an off-policy n-step TD target, with NETD accumulating every n steps, making it a tamer trace than the WETD follow-on trace and thus more stable than WETD when larger bootstrap lengths are used.
The team focuses on V-trace because actor-critic agents suffer more from off-policy learning than value-based agents. They combine the emphatic traces with an off-policy n-step TD or the V trace value targets, and apply these to both the value estimate gradient and the policy gradient. They name their novel emphatic algorithms after the emphatic trace used, e.g. NETD, WETD, ClipWETD, Clip-NETD, NEVtrace, WEVtrace.
Finally, the researchers empirically analyze qualitative properties such as convergence, learning speed and variance manifest in practice for the proposed emphatic algorithms.
Root mean squared error (RMSE) results over training time for NETD (WETD) and Clip-NETD (ClipWETD) indicate that Clip-NETD learns quickly and exhibits low variance with no instability after some initial fluctuations. The experiments also show that while V-trace diverges, NEVtrace slowly converges.
The study ultimately aims to design emphatic algorithms that will improve off-policy learning at scale, and so the team also evaluated their emphatic algorithms on Atari games from the widely used deep Arcade Learning Environment RL benchmark. Here, they use the baseline agent Surreal, which achieved a strong median human-normalized score of 403 percent, and measured median human-normalized scores across 57 games.
The results show that the best performing emphatic actor-critic agent NETD-ACE improved the median human-normalized score from the baseline performance of 403 percent to 497 percent, demonstrating the power and potential of the proposed NETD family of empathic algorithm variants. The team says future work in this area could further investigate the application of emphatic traces to various other off-policy learning targets and settings at scale.
The paper Emphatic Algorithms for Deep Reinforcement Learning is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.