Baseline datasets such as ImageNet have played an important role in pretraining models in the computer vision field. These datasets however often include images that have technical, ethical and legal shortcomings. Furthermore, current state-of-the-art pretrained models use unsupervised learning methods, and so baseline datasets such as ImageNet — whose images are labelled — may not be optimal choices for model pretraining.
To address these issues, a research team from Oxford University’s Visual Geometry Group has compiled PASS (Pictures without humAns for Self-Supervision), a large (1.28M) and unlabelled collection of images excluding humans, designed as an ImageNet replacement for self-supervised (SSL) model pretraining.
There are various technical, ethical and legal issues connected with today’s popular pretraining datasets. ImageNet was designed for supervised object classification, but in recent years supervised pretraining approaches have been largely replaced by unsupervised pretraining. Using ImageNet for unsupervised pretraining can introduce undesirable biases, as images with humans might for example include contextual biases linked to a person’s appearance. Moreover, previous research has argued that such standard datasets can also contain biases inherited from the search engines used to collect them. These factors bring the risk of racist, sexualized, skewed or stereotyped representations negatively affecting model training and performance.
To alleviate privacy and bias issues, the proposed PASS is completely free of humans, does not rely on search engines or labels, and only contains images with permissive license (CC-BY) and attribution information.
The PASS dataset was derived as a subset of the YFCC-100M dataset of 100 million media objects. YFCC-100M metadata was first used to select images with CC-BY licences, and corrupted or single-colour images were then purged. Next, pretrained RetinaFace and Cascade-RCNN (3x) models were employed to filter out images containing human faces or bodies.
The team pretrained models on the PASS dataset using self-supervision to evaluate their effectiveness on downstream tasks such as clustering, linear probing, low-shot classification, object detection, object segmentation and dense pose detection.
For their pretrained models, the researchers selected MoCo-v2 (a staple self-supervised model), SwAV (a state-of-the-art representation learning method), and DINO (a recent method tailored to vision transformer SSL).
The team summarizes their results as:
- Self-supervised approaches such as MoCo, SwAV and DINO train well on our dataset, yielding strong image representations.
- Excluding images with humans during pretraining has almost no effect on downstream task performances, even if this is done in ImageNet.
- In 8/13 frozen encoder evaluation benchmarks, performance of models trained on PASS yield better results than pretraining on ImageNet, ImageNet without humans, or Places205, when transferred to other datasets.
- For finetuning evaluations such as detection and segmentation, PASS pretraining yields results within ±1% mAP and AP50 on COCO.
- Even on tasks involving humans, such as dense pose prediction, pretraining on our dataset yields performance on par with ImageNet pretraining.
Overall, the proposed PASS dataset can greatly reduce the risk of copyright, biases, data privacy and ethics issues. The team also demonstrated that pretrained networks obtained using self-supervised training on PASS are competitive with ImageNet on transfer settings and even on downstream tasks that involve humans.
The paper PASS: An ImageNet Replacement for Self-Supervised Pretraining Without Humans is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.