AI Machine Learning & Data Science Research

ETH Zurich & UC Berkeley Method Automates Deep Reward-Learning by Simulating the Past

A research team from ETH and UC Berkeley proposes a Deep Reward Learning by Simulating the Past (Deep RLSP) algorithm that represents rewards directly as a linear combination of features learned through self-supervised representation learning and enables agents to simulate human actions backwards in time to infer what they must have done.

In the field of reinforcement learning (RL), task specifications are typically designed by experts. Learning from demonstrations and preferences requires a great deal of human interaction, and hand-coded reward functions are notoriously difficult to specify. If all these hand-designed RL system parts and specifications could be replaced with automatically learned components — as is increasingly the case in other AI areas — that would be a huge breakthrough.

In a new paper, a research team from ETH Zurich and UC Berkeley propose Deep Reward Learning by Simulating the Past (Deep RLSP), a novel algorithm that represents rewards directly as a linear combination of features learned through self-supervised representation learning and enables agents to simulate human actions “backwards in time to infer what they must have done.”


The research team begins with the premise that a given environmental state is already optimized toward a user’s preferences. For instance, if a vase is observed intact in a room, it is reasonable to assume that its user(s) have no desire to break the vase. The study thus attempts to simulate the past trajectories that led to an observed state, instead of manually specifying what an agent should do.

The proposed method starts at an observed state and simulates backwards in time to derive a gradient that is amenable to estimation. It learns an inverse policy and inverse dynamics model using supervised learning to perform the backwards simulation.

The environment for RL is formalized as a stochastic finite state machine with inputs (actions sent from the agent) and outputs (observations and rewards sent to the agent), which can be abstracted as a finite-horizon Markov Decision Process (MDP) that contains a set of states S and a set of actions A. The transition function T determines the distribution over next states given a state and an action, and the reward function r determines the agent’s objective. A policy π specifies how to choose actions given a state. Here, as with most RL, the goal is to find a policy π∗ that maximizes the expected cumulative reward.

The researchers first describe how Deep RLSP can learn reward functions for high-dimensional environments when only given access to a simulator and the observed state. To this end, Deep RLSP needs to approximate expectations over past trajectories. The researchers propose that if they can sample the future by rolling out forward in time, they should also be able to sample the past by rolling out backward in time. In this case, they can learn the inverse policy and the inverse dynamics using supervised learning, and approximate the expectation in the gradient.

But this gradient is problematic, as it depends on a feature function. In the next step, the team attempts to remove this assumption by using self-supervised learning to learn the feature function. They do this by having a variational autoencoder learn the feature function under fully observable environments and directly encode the states into a latent feature representation.

For partially observable environments, the researchers apply recurrent state space models (RSSMs), which allow the states to encode the history, so that the partially observable MDP can be converted into a latent MDP with an identity feature function. In this way, they can then compute gradients directly in this latent MDP.

Putting all these components together forms the Deep RLSP algorithm.


The team employed a MuJoCo (Multi-Joint dynamics with Contact) physics simulator in their experiments to show that Deep RLSP can be scaled to complex, continuous, high-dimensional environments. They selected three environments from the Open AI Gym — Inverted Pendulum, Half-Cheetah and Hopper — and compared Deep RLSP against a GAIL (Generative Adversarial Imitation Learning) baseline.


The results show that although GAIL was provided with both states and actions as input, it could only learn a truly good policy for the (very simple) inverted pendulum environment. Deep RLSP meanwhile achieved reasonable behaviour across all environments with only state as input.

The study demonstrates that learning useful policies with neural networks doesn’t necessarily require significant manual human effort. The proposed Deep RLSP frees researchers from this burden by extracting the “free” information present in an environment’s current state.

The paper Learning What To Do by Simulating the Past is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

%d bloggers like this: