AI Machine Learning & Data Science Research

MIT & IBM ‘Curiosity’ Framework Explores Embodied Environments to Learn Task-Agnostic Visual Representations

A research team from MIT and MIT-IBM Watson AI Lab proposes Curious Representation Learning (CRL), a framework that learns to understand the surrounding environment by training a reinforcement learning (RL) agent to maximize the error of a representation learner to gain an incentive to explore the environment.

Self-supervised visual representation learning has achieved impressive results in recent years. Because this approach works with unlabelled images, it can leverage the trillions of images available on the Internet and in photo datasets. A new study however argues that building “truly intelligent” learners requires moving beyond the curated data paradigm in favour of a more biological vision approach, where agents can also learn from their environments. An example is infants, who acquire visual experience through active physical explorations and interactions such as pushing, grasping, sucking and prodding.

The question is, given an interactive environment, how might an AI agent learn good visual representations without any prior data or defined tasks? To address this, a research team from MIT and MIT-IBM Watson AI Lab has proposed Curious Representation Learning (CRL), a framework that, given a self-supervised representation learning technique, trains a reinforcement learning (RL) agent to learn an exploration policy by maximizing rewards equal to the loss of a self-supervised representation learning model.


The researchers summarize their contributions as:

  1. Introduce CRL as an approach to embodied representation learning, in which a representation learning model plays a minimax game with an exploration policy.
  2. Show that learned visual representations can help in a variety of embodied tasks where it is crucial to freeze representations to enable good performance.
  3. Show that these representations, while entirely trained in simulation, can obtain interpretable results on real photographs.

The researchers first review background knowledge on contrastive representation learning frameworks. To learn representations, they utilize a contrastive learning approach comprising a representation learning model, a two-layer multilayer perceptron (MLP) projection head, and a family of data augmentations. They also employ an RL policy trained to maximize its reward, incentivizing the policy to find previously unseen images where the model will incur high losses. This intrinsic motivation and curiosity enables the policy to obtain useful data automatically.

The researchers conducted extensive experiments to validate the utility of the learned representations on downstream tasks such as semantic navigation, visual language navigation and real image understanding. For representation learning models, they utilized a ResNet50 image encoder. To pretrain representations, they trained CRL on a Habitat simulator using the Matterport3D dataset, and used the Gibson dataset for experimental validation.


For semantic navigation, the team evaluated task success, success weighted by path length (SPL), soft SPL (success weighted by path length but with a softer success criterion), and distance to goal. The results showed CRL achieving the best results on both ImageNav and ObjectNav.


For visual language navigation, they investigated how different representation learning methods can be utilized to aid visual language navigation via imitation learning. In both the behavioural cloning and Dagger settings, CRL outperformed methods that utilize either random, RND, or ATC weights, and also achieved comparable performance to the ImageNet supervised model.


CRL also recorded the best performance on real image understanding, indicating that it learns representations that transfer best to real images.

The results demonstrate that the proposed generic CRL framework can successfully learn task-agnostic visual representations in embodied environments, and can effectively transfer to downstream tasks.

The paper Curious Representation Learning for Embodied Intelligence is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

4 comments on “MIT & IBM ‘Curiosity’ Framework Explores Embodied Environments to Learn Task-Agnostic Visual Representations

%d bloggers like this: