There is a growing trend in the deep learning community to unify methodologies for solving problems in different areas (language, vision, speech, etc.). This generalized approach requires less domain knowledge and fewer inductive biases for specific problems and motivates models to learn useful knowledge almost entirely from their training data.
A Facebook AI Research (Meta AI) team advances this research avenue in their new paper Masked Autoencoders As Spatiotemporal Learners, which applies masked autoencoders (MAE) to the problem of spatiotemporal representation learning for video. The novel approach introduces negligible inductive biases on space-time while achieving strong empirical results compared to vision transformers (ViTs) and outperforms supervised pretraining by large margins.
The team describes their work as a simple extension of MAE to space-time data. Their goal is to develop the method under a general and unifying framework that uses as little domain knowledge/inductive bias as possible.
The proposed approach first randomly masks out space-time patches in videos using a relatively high masking ratio of 90 percent, then attempts to reconstruct the patches via a learned autoencoder. This process requires minimal domain knowledge as the only space-time-specific inductive bias is on embedding the patches and their positions. The team applies vanilla ViTs with no factorization or hierarchy on either their encoders or decoders; and uses a random mask sampling strategy that is agnostic to space-time structures. The approach can predict pixel values without requiring an extra problem-specific tokenizer and achieves impressive performance despite its minimal inductive biases, suggesting that useful knowledge can be learned from data alone.
The team conducted experiments on a variety of video recognition datasets to test their method’s performance. In the evaluations, MAE pretraining improved ViT-Large accuracy by 13 percent on the Kinetics-400 benchmark, outperformed its supervised pretraining counterparts by large margins, and achieved SOTA-comparable performance with much less domain knowledge.
Overall, the study shows it is possible to learn strong representations with minimal domain knowledge or inductive biases and that self-supervised learning on videos can be effectively performed using a conceptually unified framework in a manner similar to language and images.
The paper Masked Autoencoders As Spatiotemporal Learners is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.