A team of researchers from Google Brain have proposed a rethink of the dominant computer vision paradigm of pre-training. In the paper Rethinking Pre-training and Self-training, they investigate the generality and flexibility of the relatively new paradigm of self-training. “We researchers love pre-training. Our new paper shows that pre-training is unhelpful when we have a lot of labeled data. In contrast, self-training works well even when we have a lot of labeled data,” tweeted one of the authors, Google Brain Researcher Quoc Le.

Many computer vision tasks are related, and so researchers tend to leverage pre-training because a model pre-trained on one dataset can often help another. The common practice of pre-training the backbones of object detection and segmentation models on ImageNet classification is a typical example. However, despite its increasing popularity, the overall value of pre-training has recently been questioned. The paper notes for example that “ImageNet pre-training does not improve accuracy on the COCO dataset” in reference to the large-scale object detection, segmentation and captioning COCO dataset.
Distillation is one particular example of self-training. Distillation was first introduced to compress a large, “teacher” model into a “student” model so that the student produces the teacher’s best output while being both smaller and faster. Applying the idea to self-training, researchers first use labelled data to train a good teacher model, then use it to label unlabelled data, and finally employ the labelled and unlabelled data jointly to train a student model.
What would self-training look like in the context of vision tasks? If researchers wanted to use ImageNet to help COCO object detection, using a self-training paradigm would mean first discarding the ImageNet labels, then training an object detection model on COCO to generate pseudo labels. The last step would be combining the pseudo-labelled ImageNet and labelled COCO data to train a new model.
Setting out to reexamine the role of pre-training, the team asked, “Can self-training work well on the exact setup, using ImageNet to improve COCO, where pre-training fails?”
Through a series of control experiments using ImageNet as additional data to improve COCO accuracy, the team found that as they increase the strength of data augmentation or the amount of labelled data, the value of pre-training diminishes. Furthermore, with the strongest data augmentation, “pre-training significantly hurts accuracy by -1.0AP.” The researchers note that self-training works well under experiment setups where pre-training fails. With the same data augmentation, self-training yields positive +1.3AP improvements using the same ImageNet dataset.


There are of course limitations to self-training — for example it requires much more compute than simply fine-tuning a pre-trained model. The team however notes the following significant advantages of self-training:
- Flexibility: self-training works well in every setup that we tried: low data regime, high data regime, weak data augmentation, and strong data augmentation.
- Effective with different architectures (ResNet, EfficientNet, SpineNet, FPN, NAS-FPN), data sources (ImageNet, OID, PASCAL, COCO), and tasks (Object Detection, Segmentation).
- Generality, self-training works well even when pre-training fails but also when pre-training succeeds.
- Scalability, self-training proves to perform well as we have more labeled data and better models.
The paper Rethinking Pre-training and Self-training is on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen
0 comments on “Google Brain Rethinks Pre-training and Self-training”