Intelligent video understanding is crucial for real-world applications such as autonomous driving and human-robot interaction. Current video understanding approaches typically rely on task-specific fine-tuning of video foundation models, whose spatiotemporal and other interpretations do not effectively generalize.
Powerful pretrained large language models (LLMs) can be utilized as encoders to enable a deep understanding of images — but have proven less capable concerning video understanding. Existing video-centric multimodal dialogue systems meanwhile still struggle with spatiotemporal reasoning, event localization and relationship-inference tasks.
In the new paper VideoChat: Chat-Centric Video Understanding, a research team from Shanghai AI Laboratory’s OpenGVLab, Nanjing University, the University of Hong Kong, Shenzhen Institute of Advanced Technology, and the Chinese Academy of Sciences presents VideoChat, a groundbreaking end-to-end chat-centric video understanding system that leverages state-of-the-art video and language models to improve spatiotemporal reasoning, event localization, and causal relationship inference.
The proposed VideoChat integrates state-of-the-art video foundation models and LLMs in a learnable neural interface and provides all the techniques needed to learn the system from a data perspective.
The proposed framework comprises two distinct processes: VideoChat-text, which uses LLMs to textualize videos; and VideoChat-Embed, which combines the video and language foundation models with a learnable video-language token interface (VLTF) tuned with video-text data to encode the videos as embeddings. The video tokens, user queries, and dialogue context are then fed into an LLM for communication.
The stack includes a pretrained vision transformer with a global multi-head relation aggregator temporal modelling module and a pretrained QFormer with extra linear projection and additional query tokens that serves as the token interface. The resulting compact LLM-compatible video embeddings can be reused in future dialogues.
The researchers also design a two-stage joint training paradigm that leverages readily-available image instruction data to enable the model to handle both images and videos with shared spatial perception and reasoning capacity, and curate a video-centric instruction dataset comprising thousands of videos matched with detailed descriptions and conversations to tune their system.
In their empirical study, the team compared VideoChat with LLaVa, miniGPT-4, and mPLUG-owl baselines. In the experiments, VideoChat demonstrated a superior ability to correctly identify a scene’s various characteristics.
The team hopes their work will advance the integration of video and natural language processing for video understanding and reasoning and pave the way for a wide range of real-world applications across various domains.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.