The original 2017 transformer model was designed for natural language processing (NLP), where it achieved SOTA results. Its performance intrigued machine learning researchers, who have since successfully adapted the attention-based architecture to perception tasks in other modalities, such as the classification of images, video and audio. While transformers have shown their power and potential in these areas, achieving SOTA performance requires training a separate model for each task. Producing a single transformer model capable of processing multiple modalities and datasets and sharing its learnable parameters has thus emerged as an attractive research direction.
To this end, a team from Google Research, University of Cambridge and Alan Turing Institute has proposed PolyViT; a single transformer architecture co-trained on image, audio and video that is parameter-efficient and learns representations that generalize across multiple domains.
The PolyViT design is motivated by the idea that human perception is inherently multimodal and previous studies that have demonstrated transformers’ ability to operate on any modality that can be tokenized. PolyViT shares a single transformer encoder across different tasks and modalities, enabling up to a linear reduction in parameters with the number of tasks.
The main technique used to develop PolyViT is co-training, where a single model is trained on multiple classification tasks simultaneously across potentially multiple modalities. The researchers construct their training minibatches using examples from a single task to enable co-training on multiple tasks without any additional hyperparameter tuning compared to a single-task baseline. In this way, it is not necessary to perform large hyperparameter sweeps to achieve competitive accuracy. During the co-training process, the team samples a minibatch from the task at hand, evaluates a gradient, and then performs a parameter update.
The team summarizes the benefits of this co-training procedure:
- It is parameter-efficient, which has practical advantages when deploying models on edge devices with limited memory which may not otherwise be able to fit the weights of n different models.
- Maintaining a single model for multiple tasks simplifies model deployment and online updates.
- The co-training on tasks of the same modality leads to accuracy improvements on each individual task while also linearly decreasing the total parameters.
- This multi-task, multi-modal model is able to learn representations that generalize across multiple tasks and domains.
- The co-training setup is practical and simple to implement.
- The co-training does not increase the overall training cost, as the total number of training steps does not exceed that of the sum of each single-task baseline.
To evaluate the effectiveness of their proposed method, the team trained PolyViT simultaneously on nine diverse classification tasks spanning the image, video, and audio modalities. The results show that co-training on tasks of the same modality leads to accuracy improvements on each task while also linearly decreasing total parameters. PolyViT achieved SOTA results on video and audio classification tasks, and extending co-training to multiple tasks and modalities also achieved competitive performance and was even more parameter efficient. The team’s linear probing experiments also demonstrated PolyViT’s ability to learn representations that generalize across multiple tasks and domains.
Overall, the study shows that PolyViT’s novel co-training improves parameter efficiency compared to single-task models; and that the approach can achieve competitive and even SOTA performance on multiple tasks across different datasets.
The paper PolyViT: Co-training Vision Transformers on Images, Videos and Audio is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.