Intelligent video understanding is crucial for real-world applications such as autonomous driving and human-robot interaction. Current video understanding approaches typically rely on task-specific fine-tuning of video foundation models, whose spatiotemporal and other interpretations do not effectively generalize.
Powerful pretrained large language models (LLMs) can be utilized as encoders to enable a deep understanding of images — but have proven less capable concerning video understanding. Existing video-centric multimodal dialogue systems meanwhile still struggle with spatiotemporal reasoning, event localization and relationship-inference tasks.
In the new paper VideoChat: Chat-Centric Video Understanding, a research team from Shanghai AI Laboratory’s OpenGVLab, Nanjing University, the University of Hong Kong, Shenzhen Institute of Advanced Technology, and the Chinese Academy of Sciences presents VideoChat, a groundbreaking end-to-end chat-centric video understanding system that leverages state-of-the-art video and language models to improve spatiotemporal reasoning, event localization, and causal relationship inference.

The proposed VideoChat integrates state-of-the-art video foundation models and LLMs in a learnable neural interface and provides all the techniques needed to learn the system from a data perspective.

The proposed framework comprises two distinct processes: VideoChat-text, which uses LLMs to textualize videos; and VideoChat-Embed, which combines the video and language foundation models with a learnable video-language token interface (VLTF) tuned with video-text data to encode the videos as embeddings. The video tokens, user queries, and dialogue context are then fed into an LLM for communication.
The stack includes a pretrained vision transformer with a global multi-head relation aggregator temporal modelling module and a pretrained QFormer with extra linear projection and additional query tokens that serves as the token interface. The resulting compact LLM-compatible video embeddings can be reused in future dialogues.

The researchers also design a two-stage joint training paradigm that leverages readily-available image instruction data to enable the model to handle both images and videos with shared spatial perception and reasoning capacity, and curate a video-centric instruction dataset comprising thousands of videos matched with detailed descriptions and conversations to tune their system.

In their empirical study, the team compared VideoChat with LLaVa, miniGPT-4, and mPLUG-owl baselines. In the experiments, VideoChat demonstrated a superior ability to correctly identify a scene’s various characteristics.
The team hopes their work will advance the integration of video and natural language processing for video understanding and reasoning and pave the way for a wide range of real-world applications across various domains.
The code and data are available on the project’s GitHub. The paper VideoChat: Chat-Centric Video Understanding is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Through my experience, I’ve learned that stylishly presented videos with minimal distractions are often easier to comprehend. Personally, I enjoy using the black background remover video technique to achieve this effect. It enhances the focus on the content, making it more engaging and visually appealing.
I need to translate a video from Spanish to English. Can it be with this chat or should I look for other services?
Absolutely, you can utilize this chat for translating your video from Spanish to English. But the Vidby service https://vidby.com/spanish-english is an excellent choice for such translations, ensuring accuracy and quality. Feel free to provide the necessary details, and I’ll be happy to assist you throughout the process!