The paper Masked Autoencoders Are Scalable Vision Learners, published this week by Kai-Ming He, Xinlei Chen and their Facebook AI Research (FAIR) team, has become a hot topic in the computer vision community.
Systems employing masked language modelling such as Google’s BERT and their autoregressive counterparts like OpenAI’s GPT have achieved astonishing performance across a wide range of natural language processing (NLP) tasks and enabled the training of generalizable NLP models containing over one hundred billion parameters.
The progress and performance of autoencoding methods in computer vision however lags behind their proven NLP abilities. A question naturally arises: how does masked autoencoding differ in the vision and language domains? The FAIR paper addresses this, and demonstrates that masked autoencoders (MAE) can be scalable self-supervised learners for computer vision.
The researchers first examine the differences in masked autoencoding in the vision and language domains, explaining: 1) Until recently, the architectures were distinct; 2) Information density is different in language and vision; 3) The autoencoder’s decoder, which maps latent representations back to the input, plays a different role when reconstructing either text or images.
The team then presents a simple, effective, and scalable form of an MAE for visual representation learning. The idea behind the proposed MAE method is simple — random patches from the input image are masked, and the missing patches are then reconstructed in the pixel space. The team summarizes their MAE’s two-core design and approach as:
- We develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
- We find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively, accelerating training by 3× or more and improving accuracy.
The team performed self-supervised pretraining on the ImageNet-1K (IN1K) training set, then conducted supervised training to evaluate the representations with end-to-end finetuning or linear probing. They used ViT-Large (ViT-L/16) as their models and reported top-1 validation accuracy.
The results show that MAE learns very high-capacity models that also generalize well. With a vanilla ViT-Huge model, MAE achieved 87.8 percent accuracy when finetuned on ImageNet-1K.
The team believes simple algorithms that scale well are the core of deep learning. As such, they hope their simple self-supervised method, which is similar to NLP techniques and provides scalable benefits in computer vision, can be a valuable contribution and inspire future research in this area.
The paper Masked Autoencoders Are Scalable Vision Learners is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.