AI Machine Learning & Data Science Research

UT Austin & Sony AI’s VIOLA Object-Centric Imitation Learning Method for Robot Manipulation Outperforms the SOTA by 45.8%

In the new paper VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors, researchers from the University of Texas at Austin and Sony AI present VIOLA (Visuomotor Imitation via Object-centric LeArning), an object-centric imitation learning model that endows imitation learning with awareness regarding objects and their interactions.

Vision-based manipulation is a key skill that enables autonomous robots to understand their environment and learn intelligent behaviours from it. Deep imitation learning has recently emerged as a promising training method for vision-based manipulation, and while the resulting models perform well on mapping raw visual observations to motor actions, they are not robust to covariate shifts or environmental perturbations, resulting in poor generalization ability to new situations.

In the new paper VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors, researchers from the University of Texas at Austin and Sony AI present VIOLA (Visuomotor Imitation via Object-centric LeArning), an object-centric imitation learning model that endows imitation learning with awareness regarding objects and their interactions. The novel approach improves robustness in vision-based robotic manipulation tasks and outperforms state-of-the-art imitation learning methods by 45.8 percent.

VIOLA is designed to learn effective closed-loop visuomotor policies for robot manipulation and was inspired by the idea that explaining visual scenes as multiple objects and their corresponding interactions could enable models to make faster and more accurate predictions. The proposed method thus decomposes visual scenes into factorized representations of objects to encourage robots to reason about the manipulation workspace in a modular fashion and improve their generalization ability.

VIOLA first uses a pretrained region proposal network (RPN) to obtain a set of general object proposals from raw visual observations, then extracts features from each of these proposals to learn the factorized object-centric representations of the visual scene. Finally, a transformer-based policy leverages a multi-head self-attention mechanism to identify task-relevant regions and improve the robustness and efficiency of the imitation learning process.

The team compared VIOLA with state-of-the-art deep imitation learning methods on vision-based manipulation tasks using a real robot. In the evaluations, VIOLA surpassed the best state-of-the-art baseline’s success rate by 45.8 percent and maintained its robustness on precise grasping and manipulation tasks even when visual variations such as jittered camera views were introduced.

The team summarizes their study’s contributions as follows:

  1. We learn object-centric representations based on general object proposals and design a transformer-based policy that determines task-relevant proposals to generate the robot’s actions.
  2. We show that VIOLA outperforms state-of-the-art baselines in simulation and validate the effectiveness of our model designs through ablative studies.
  3. We show that VIOLA learns policies on a real robot to complete challenging tasks.

Videos and model details are available on the project’s website: https://ut-austin-rpl.github.io/VIOLA. The paper VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “UT Austin & Sony AI’s VIOLA Object-Centric Imitation Learning Method for Robot Manipulation Outperforms the SOTA by 45.8%

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: