In November, a Facebook AI Research (FAIR) team led by Kaiming He published the paper Masked Autoencoders Are Scalable Vision Learners, which demonstrated that masked autoencoders (MAE) are scalable self-supervised learners — a game-changing finding for the computer vision research field.
Scarcely a month later, a FAIR and Johns Hopkins University team has published Masked Feature Prediction for Self-Supervised Visual Pre-Training. Influenced in part by the MAE project, the paper proposes Masked Feature Prediction (MaskFeat), a new self-supervised pretraining method for video models that regresses features of masked regions. MaskFeat outperforms the MAE approach and achieves state-of-the-art results on various video benchmarks.
Like MAE, MaskFeat employs a simple “mask than predict” strategy that has proven effective in natural language processing (NLP) tasks but had remained underexplored in computer vision. The MAE approach masks random parts of an input image then predicts the missing pixels, whereas MaskFeat replaces MAE’s direct prediction of pixels with a prediction of the image’s histogram of oriented gradients (HOG).
The researchers explain that HOG acts as a feature descriptor with regard to the distribution of gradient orientations or edge directions within image subregions, capturing local shapes and appearances while being partially invariant to geometric changes; and that local contrast normalization in HOG is essential for effective MaskFeat pretraining.
MaskFeat’s design enables it to learn abundant visual knowledge and drive large-scale transformer-based models, producing strong performance without the use of extra model weights or supervision.
The team summarizes their study’s findings as:
- Simple histogram of oriented gradients, as in the popular HOG and SIFT descriptors which dominated visual recognition for over a decade, is a particularly effective target for MaskFeat in terms of both performance and efficiency.
- The discretization (tokenization) of visual signals is not necessary for masked visual prediction, and continuous feature regression (i.e. MaskFeat) can work well.
- Semantic knowledge from human annotations is not always helpful for MaskFeat, but characterizing local patterns seems important. For example, predicting supervised features from CNNs or ViTs trained on labelled data leads to degraded performance.
The team evaluated MaskFeat on standard video benchmarks, with a MaskFeat pretrained MViT-L scoring 86.7 percent top-1 accuracy on the Kinetics-400 action classification dataset without using any external data, bettering the top prior performance by +5.2 percent. When transferring to downstream tasks, MaskFeat achieved unprecedented results of 38.8 mAP on action detection and 75.0 percent top-1 accuracy on human-object interaction classification. When generalized to the image domain, MaskFeat achieved a competitive 84.0 percent top-1 accuracy with ViT-B and 85.7 percent top-1 accuracy with ViT-L using only ImageNet-1K.
Overall, the results show that MaskFeat is efficient, generalizes well, and scales to large models in both the video and image domains. The team believes MaskFeat can open the door to direct pretraining on unlabelled videos, which will bring enormous benefits to video understanding.
The paper Masked Feature Prediction for Self-Supervised Visual Pre-Training is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.