AI Machine Learning & Data Science Research

Google and UT Austin’s Game-Changing Approach Distills Vision-Language Models on Millions of Videos

In a new paper Distilling Vision-Language Models on Millions of Videos, a research team introduces a straightforward yet highly effective method to adapt image-based vision-language models to video. The approach involves generating high-quality pseudo-captions for millions of videos, outperforming state-of-the-art methods across various video-language benchmarks.

Significant strides in image understanding have been propelled by expansive, high-quality image-text datasets. However, the annotation of videos poses a considerable challenge, requiring approximately 70 hours to transcribe narratives for a 1-hour video and a substantial 700 hours to furnish it with instance-level annotations. This bottleneck impedes the progress of vision-language models, despite the wealth of video content available on the Internet.

In a new paper Distilling Vision-Language Models on Millions of Videos, a research team from Google and University of Texas introduces a straightforward yet highly effective method to adapt image-based vision-language models (VLMs) to video. The approach involves generating high-quality pseudo-captions for millions of videos, outperforming state-of-the-art methods across various video-language benchmarks.

The research team commences with PaLI-3, an advanced VLM trained on WebLI, featuring image-text data exclusively. The visual encoder, a ViT-G/14 with 2 billion parameters, and the language model, based on an encoder-decoder architecture using UL-2 with 3 billion parameters, form the foundation of their model architecture. To better harness the potential of a relatively limited video-text corpora, the team proposes adapting each component separately.

Initially, they fine-tune the visual encoder using video captioning data while maintaining the language component’s frozen state. This adaptation imparts the model with the ability to navigate dynamic scenes while retaining the diverse capabilities of the original language decoder.

Subsequently, the language model undergoes fine-tuning on a modest amount of instruction-following data, with the visual encoder held constant. This step emphasizes temporal and causal reasoning abilities beyond scene-level descriptions. The integration of pseudo-captions results in a more robust dual-encoder model, exhibiting positive scaling behavior concerning the number of videos. The final video-language model seamlessly processes dynamic inputs and delivers motion-focused output, generating high-quality pseudo-captions for a vast array of web-scraped videos.

Evaluation of the adapted VLM across a spectrum of video-language benchmarks, including video question answering (QA) and captioning, demonstrates state-of-the-art zero-shot performance across the board. Notably, the model surpasses prior results on open-ended NExT-QA by 2.8% and outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by an impressive 6%. This signifies a remarkable advancement in the field of vision-language models, overcoming challenges associated with video annotation and yielding superior performance across diverse video-language tasks.

The paper Distilling Vision-Language Models on Millions of Videos on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “Google and UT Austin’s Game-Changing Approach Distills Vision-Language Models on Millions of Videos

  1. Pingback: Google And UT Austin’s Game-Changing Approach Distills Vision-Language Models On Millions Of Videos - VEVOLIA MAGAZINE

  2. Joseph D. Brown

    This article highlights a collaborative effort between Google and researchers at the University of Texas at Austin (UT Austin) that’s making waves in the field of vision-language models (geometry dash lite).

Leave a Reply

Your email address will not be published. Required fields are marked *