The natural language processing (NLP) field has achieved remarkable advancements spurred by the power of transformer-based pretrained large language models such as Google’s BERT and OpenAI’s GPT. Transformer architectures have recently also attained state-of-the-art performance on various computer vision tasks.
Efforts to extend the BERT-style masked image modelling pretraining approach (where a portion of the image is masked and the model learns to recover it) from transformers to convolutional networks (convnets) however remain stymied: While transformers can handle irregular (variable-length) and non-overlapping patches, convnets cannot, as they only operate on regular grids.
In the new paper Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling, a research team from Peking University, ByteDance, and the University of Oxford presents Sparse Masked Modelling with Hierarchy (SparK), the first BERT-style pretraining approach for convolutional models.

The team summarizes their main contributions as follows:
- The first BERT-style pretraining method that can be used directly on any convnets without backbone modifications, overcoming their inability to handle irregular masked inputs.
- Insights into designing generative pretraining for convnets, e.g., the first use of sparse convolution for masked image modelling and the hierarchical design for BERT-style pretraining.
- A leap in convnet’s performance across downstream tasks (up to +3.5 points), showing the promise of extending the success of transformers’ pretrain-finetune paradigm to convnets.

The SparK framework aims at pretraining a convolutional network encoder using hierarchical masked image modelling. To this end, the team first introduces a novel sparse masking strategy that brings the following benefits:
- No information is leaked.
- It can be applied directly to any convnet without backbone modifications.
- It is efficient as sparse convolution computes only at visible places.
- It addresses the pixel distribution shift and mask pattern vanishing issues of previous transformer-based masked modelling approaches.

The team then constructs a hierarchical encoder-decoder architecture that uses sparse convolution to encode a sparse image with visible patches and generates a set of feature maps with different resolutions. They apply a UNet-style architecture to decode these sparse feature maps, filling all empty positions with mask embeddings (“densifying”). Finally, they optimize targets and transfer the trained model to downstream tasks.


In their empirical study, the researchers applied SparK on two convnet families: classical ResNets (He et al., 2016) and modern ConvNeXts (Liu et al., 2022). In the evaluations, SparK exhibited the the largest improvements over supervised baselines and performed significantly better than state-of-the-art convolutional contrastive learning methods across all downstream tasks.
This work shows that SparK can be effectively applied to any convnet and significantly improve performance on downstream tasks. The team hopes this will inspire researchers to further explore and exploit the potential of convnets and generative pretraining and enable the computer vision community to profit from the pretrain-finetune paradigm.
Codes and models have been released on the project’s GitHub. Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling is under review as a conference paper at ICLR 2023 and is available on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
I really applaud the information shared in this post. Some of it goes over my head but I still found it informative
Michael Troyer
Explore the game Terraria with many attractive challenges. Please play it on my profile.