NVIDIA’s Minimal Video Instance Segmentation Framework Achieves SOTA Performance Without Video-Based Training

Synced

4 years ago

The task of simultaneously classifying, segmenting, and tracking multiple object instances in videos is referred to as video instance segmentation (VIS). Modern VIS transformers (VisTR) use a per-clip approach and have shown impressive end-to-end performance but suffer from long training times and high computation costs due to their frame-wise dense attention. Moreover, VisTRs must annotate object instance masks for each video frame, which is also prohibitively expensive at scale.

In the new paper MinVIS: A Minimal Video Instance Segmentation Framework Without Video-based Training, an NVIDIA research team presents MinVIS, a minimal video instance segmentation framework that outperforms state-of-the-art VIS methods without requiring video-based training or annotations.

The team summarizes their main contributions as follows:

We show that video-based architecture and training are not required for competitive VIS performances. MinVIS outperforms previous state-of-the-art on YouTube-VIS 2019 and 2021 datasets by 1% and 3% AP while only training an image instance segmentation model.
We show that image instance segmentation models capable of segmenting occluded instances are also well suited to track occluded instances in videos in our framework. MinVIS outperforms its per-clip counterpart by over 13% AP on the challenging Occluded VIS (OVIS) dataset, which is an over 10% improvement compared to the previous best performance on the dataset.
Our image-based approach allows us to significantly sub-sample the required segmentation annotations in training without any change to the model. With only 1% of labelled frames, MinVIS outperforms or is comparable to fully-supervised state-of-the-art approaches on all three datasets.

The proposed MinVIS method trains a query to only return high responses on features of its corresponding instance while other query embeddings have low responses on these features as the instance masks are non-overlapping. As such, the query embeddings for different instances in a frame are well-separated, which enables temporally consistent query embeddings for object-tracking without requiring video-based training.

For MinsVIS inference, the team first independently applies a query-based image instance segmentation model on video frames, then associates the segmented instances with their corresponding query embeddings. The query embeddings will thus contain the needed information for tracking a given instance. Since the video frames are treated as independent images for MinsVIS training, there is no need to annotate all the frames in a video, which enables significant sub-sampling and reduction of segmentation annotations without any change to the model.

In their empirical study, the team compared MinVIS against state-of-the-art approaches on the YouTube-VIS 2021 dataset, where it improved average precision (AP) by 3 percent. MinVIS also outperformed its per-clip counterparts by over 13 percent AP on the challenging Occluded VIS (OVIS) dataset, again without video-based training.

The researchers note MinVIS’s practical advantages — reducing both label and computation costs without sacrificing model performance — make it a promising new approach to VIS, and propose extending MinVIS with sub-sampled annotations to further improve performance.

The code is available on the project’s GitHub. The paper MinVIS: A Minimal Video Instance Segmentation Framework Without Video-based Training is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Share this: