AI Machine Learning & Data Science Popular Research

Pieter Abbeel Team Proposes Task-Agnostic RL Method to Auto-Tune Simulations to the Real World

A research team from UC Berkeley and Carnegie Mellon University proposes a task-agnostic reinforcement learning method that reduces the task-specific engineering required for domain randomization of both visual and dynamics parameters.

The real world is complex and ever-changing. Applying deep learning (DL) techniques to complex control tasks relies on learning in simulations before transferring models to the real world. But there is a problematic “reality gap” associated with such transfers, as it is difficult for simulators to accurately capture or predict the dynamics and visual properties of the real world.

Domain randomization methods are among the most effective ways to tackle this issue. A model is incentivized to learn features that are invariant to the shift between simulation and reality data distributions. However, this approach requires task-specific expert knowledge for feature engineering, and the process is often time-consuming and laborious.

In the paper Auto-Tuned Sim-to-Real Transfer, a research team from UC Berkeley and Carnegie Mellon University proposes a task-agnostic reinforcement learning (RL) method that reduces the task-specific engineering required for domain randomization of both visual and dynamics parameters. Using only raw observations as inputs, the approach can auto-tune the system parameters of a simulation to map reality.


The researchers summarize their contributions as:

  1. Proposing an automatic system identification procedure with the key insight of reformulating the problem of tuning a simulation as a search problem.
  2. Designing a Search Param Model (SPM) that updates the system parameters using raw pixel observations of the real world.
  3. Demonstrating that the proposed method outperforms domain randomization on a range of robotic control tasks in both sim-to-sim and sim-to-real transfer.

In the case of a real-world RL problem defined by a partial observation Markov decision process, the reward function depends on an unobserved state, and it is very challenging to fully and accurately simulate the real world in order to assign such rewards to an agent. One of the highlights of this paper is that the researchers train the agent in a simulation where unobserved state space and reward functions are easily accessible.

The dynamics and visuals of the simulator are defined by simulator system parameters. Domain randomization samples the simulator parameters from a distribution of system parameters and aims to train the policy in simulation to maximize these parameters.


To this end, standard practices use expert knowledge and spend time manually engineering the environment so that the simulator parameters (represented as “ξmean”) can reasonably approximate to real-life parameters (ξreal). The selection of ξreal can be made by comparing trajectories from differently parametrized simulations and from the real world. However, measuring trajectories requires obtaining state-space information in the real world, which is not practically possible. Thus, the team only used raw pixel observations in the real world to find ξmean and auto-tune their simulator.

The proposed approach aims to automatically find ξmean ≈ ξreal using a function that maps a sequence of observations and actions to their corresponding system parameters. To achieve this, the team reformulated the auto-tuning problem as a search procedure rather than predicting ξreal exactly, and proposed Search Param Model (SPM).

SPM is a binary classifier designed to iteratively “auto-tune” ξmean to be closer to ξreal. The researchers first train the SPM on simulated trajectories generated using the current simulation system parameters, then train again using logistic regression. As the hyperparameter E (a uniform distribution with the mean of simulator parameters and range proportional to the mean) may change significantly from the initial distribution as auto-tuning continues, it is not sufficient to only pretrain the SPM. The team therefore apply a joint training procedure that enables them to continue to train and update the SPM as they slowly shift E.


To validate that SPM can effectively update simulators to the correct system parameters and improve real-world return over sim-to-real transfer with naive domain randomization, the team conducted experiments on six sim-to-sim transfer tasks: four from the DeepMind Control Suite and two robotic arm tasks (Rope Peg-in-Hole and Cabinet Slide).


In all DeepMind Control Suite environments, SPM matched or exceeded the two baselines: domain randomization and a variant of the SPM method that directly regresses on the simulation parameter values.


For the rope task, the baseline failed to move the peg towards the hole, whereas SPM consistently succeeded. SPM similarly succeeded in the cabinet slide task, while the baseline moved towards the box but consistently overshot and failed. The results validated SPM’s ability to adjust the randomization mean to be closer to the real system parameters, ultimately leading to improved transfer success in the real world.

The paper Auto-Tuned Sim-to-Real Transfer is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

%d bloggers like this: