AI Machine Learning & Data Science Research

MovieChat+: Elevating Zero-Shot Long Video Understanding to New Heights

A pioneering research group introduces MovieChat, a novel framework tailored to accommodate extensive video durations exceeding 10,000 frames. This innovative system achieves unprecedented performance in deciphering prolonged video content.

In recent advancements, the fusion of video foundation models and large language models has emerged as a promising avenue for constructing robust video understanding systems, transcending the constraints of predefined vision tasks. However, while these methods exhibit commendable performance on shorter videos, they encounter significant hurdles when confronted with longer video sequences. The escalating computational complexity and memory demands inherent in sustaining long-term temporal connections pose formidable challenges.

In a new paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering, a pioneering research group introduces MovieChat, a novel framework tailored to accommodate extensive video durations exceeding 10,000 frames. This innovative system achieves unprecedented performance in deciphering prolonged video content.

The team outlines their pivotal contributions as follows:

  1. Introduction of MovieChat: MovieChat represents the inaugural framework expressly crafted to support the analysis of protracted videos, leveraging pre-trained Multi-Modal Language Models (MLLMs) and employing a zero-shot, training-free memory consolidation mechanism.
  2. Enhancement with MovieChat+: Building upon the foundation of MovieChat, the upgraded version, MovieChat+, refines memory efficiency by introducing a vision-question matching-based memory consolidation technique. This enhancement not only eclipses the performance of the initial iteration but also outshines prevailing benchmarks in both short and long video question-answering tasks.
  3. Launch of MovieChat-1K Benchmark: The research group releases the pioneering long-video understanding benchmark, MovieChat-1K, featuring an expanded temporal label set of 2,000 compared to its precursor. Rigorous quantitative assessments and comprehensive case studies substantiate the comparable performance of both understanding capacity and inference costs.

MovieChat employs a sliding window mechanism to extract video features, subsequently encoding them into token representations. These tokens are sequentially integrated into the short-term memory frame by frame. Upon reaching the predetermined threshold, the earliest tokens are amalgamated and consolidated into the long-term memory.

The proposed methodology incorporates two distinctive inference modes: the global mode, relying exclusively on the long-term memory, and the breakpoint mode, which incorporates the current short-term memory alongside the long-term memory, facilitating focused video comprehension at specific temporal junctures. Following projection, the video representation interfaces with a large language model to engage with users effectively.

Furthermore, the team introduces MovieChat+, wherein they refine the vision-question matching-based memory consolidation mechanism to more effectively align predictions of visual language models with relevant visual cues.

MovieChat represents a significant breakthrough in tackling the challenges associated with analyzing extended video sequences, achieving state-of-the-art performance in long video comprehension. Its prowess eclipses existing systems constrained to handling videos with fewer frames, signaling a paradigm shift in video understanding technology.

The code is available on project’s GitHub. The paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering is on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “MovieChat+: Elevating Zero-Shot Long Video Understanding to New Heights

  1. Alex Di

    Experience excellence in dental care at Enamel Clinic. Our clinic dental combines cutting-edge technology with personalized attention to deliver outstanding results for our patients. From preventive care to complex treatments, our skilled team is committed to helping you achieve optimal oral health and a confident smile. Discover the difference at Enamel Clinic – schedule your appointment today.

  2. Pingback: MovieChat+: Elevating Zero-Shot Long Video Understanding to New Heights -

Leave a Reply

Your email address will not be published. Required fields are marked *