The task of simultaneously classifying, segmenting, and tracking multiple object instances in videos is referred to as video instance segmentation (VIS). Modern VIS transformers (VisTR) use a per-clip approach and have shown impressive end-to-end performance but suffer from long training times and high computation costs due to their frame-wise dense attention. Moreover, VisTRs must annotate object instance masks for each video frame, which is also prohibitively expensive at scale.
In the new paper MinVIS: A Minimal Video Instance Segmentation Framework Without Video-based Training, an NVIDIA research team presents MinVIS, a minimal video instance segmentation framework that outperforms state-of-the-art VIS methods without requiring video-based training or annotations.
The team summarizes their main contributions as follows:
- We show that video-based architecture and training are not required for competitive VIS performances. MinVIS outperforms previous state-of-the-art on YouTube-VIS 2019 and 2021 datasets by 1% and 3% AP while only training an image instance segmentation model.
- We show that image instance segmentation models capable of segmenting occluded instances are also well suited to track occluded instances in videos in our framework. MinVIS outperforms its per-clip counterpart by over 13% AP on the challenging Occluded VIS (OVIS) dataset, which is an over 10% improvement compared to the previous best performance on the dataset.
- Our image-based approach allows us to significantly sub-sample the required segmentation annotations in training without any change to the model. With only 1% of labelled frames, MinVIS outperforms or is comparable to fully-supervised state-of-the-art approaches on all three datasets.
The proposed MinVIS method trains a query to only return high responses on features of its corresponding instance while other query embeddings have low responses on these features as the instance masks are non-overlapping. As such, the query embeddings for different instances in a frame are well-separated, which enables temporally consistent query embeddings for object-tracking without requiring video-based training.
For MinsVIS inference, the team first independently applies a query-based image instance segmentation model on video frames, then associates the segmented instances with their corresponding query embeddings. The query embeddings will thus contain the needed information for tracking a given instance. Since the video frames are treated as independent images for MinsVIS training, there is no need to annotate all the frames in a video, which enables significant sub-sampling and reduction of segmentation annotations without any change to the model.
In their empirical study, the team compared MinVIS against state-of-the-art approaches on the YouTube-VIS 2021 dataset, where it improved average precision (AP) by 3 percent. MinVIS also outperformed its per-clip counterparts by over 13 percent AP on the challenging Occluded VIS (OVIS) dataset, again without video-based training.
The researchers note MinVIS’s practical advantages — reducing both label and computation costs without sacrificing model performance — make it a promising new approach to VIS, and propose extending MinVIS with sub-sampled annotations to further improve performance.
The code is available on the project’s GitHub. The paper MinVIS: A Minimal Video Instance Segmentation Framework Without Video-based Training is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

