Pretrained large language models have revolutionized the natural language processing (NLP) research field, achieving state-of-the-art performance and enabling widespread and effective deployment in many real-world applications. One of the main drawbacks to such models however is that they require task-specific annotated data and fine-tuning for each end task, which can be time- and resource-consuming.
In the paper VideoCLIP: Contrastive Pre-Training for Zero-Shot Video-Text Understanding, a research team from Facebook AI and Carnegie Mellon University presents VideoCLIP, a contrastive approach for pretraining a unified model for zero-shot video and text understanding without requiring annotated data for downstream tasks.
The team summarises their study’s main contributions as:
- We propose to pre-train a unified model that is capable of zero-shot transfer to multiple end tasks for video-text understanding, even surpassing fully-supervised methods in some cases.
- We introduce two novel techniques to improve the learning of fine-grained video-text association.
This work focuses on pretraining for zero-shot transfer to video-text understanding tasks, with the proposed VideoCLIP designed to pre-train a unified video-text representation. To do this, it learns fine-grained associations between video and text pairs in a transformer using a contrastive objective to compute the training objective. The paper identifies two novel aspects to this learning process: for positive pairs, it uses video and text clips that are loosely temporarily overlapping instead of enforcing strict start/end timestamp overlap; and for negative pairs, it employs a retrieval based sampling technique that uses video clusters to form batches with mutually harder videos.
The proposed approach first improves the association of video and text with different sequence lengths by pretraining with temporally overlapped pairs of video and text clips of varying length, resulting in a significant increase in the quality and quantity of the video-text alignment.
The method then learns fine-grained video-text similarities from a contrastive loss for gathering (implicitly) harder negative pairs, using a retrieval augmented pretraining approach to retrieve a cluster of videos with similar patterns.
For their empirical study, the researchers used the publicly established HowTo100M dataset for pretraining, then applied the model to zero-shot transfer without any fine-tuning on target dataset labels. The model was evaluated on a diverse set of tasks: text-video retrieval, video question answering (VideoQA), action localization, and segmentation.
In the text-video retrieval task on the YouCook2 large-scale cooking video dataset, VideoCLIP outperformed all baseline zero-shot methods and even outperformed fully supervised pretraining plus fine-tuning methods. On the VideoQA task, VideoCLIP outperformed most of the supervised methods and, after fine-tuning, achieved the best overall performance. VideoCLIP also achieved impressive performance on the action localization and segmentation tasks, even surpassing supervised approaches that use human-annotated labels.
Overall, the study shows that the proposed VideoCLIP can outperform prior approaches on a variety of tasks without any supervision on downstream datasets, and in some scenarios is competitive or better than methods that use full supervision.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.