AI Computer Vision & Graphics Machine Learning & Data Science Research

NVIDIA’s Minimal Video Instance Segmentation Framework Achieves SOTA Performance Without Video-Based Training

In the new paper MinVIS: A Minimal Video Instance Segmentation Framework Without Video-based Training, an NVIDIA research team presents MinVIS, a minimal video instance segmentation framework that outperforms state-of-the-art VIS approaches without requiring video-based training.

The task of simultaneously classifying, segmenting, and tracking multiple object instances in videos is referred to as video instance segmentation (VIS). Modern VIS transformers (VisTR) use a per-clip approach and have shown impressive end-to-end performance but suffer from long training times and high computation costs due to their frame-wise dense attention. Moreover, VisTRs must annotate object instance masks for each video frame, which is also prohibitively expensive at scale.

In the new paper MinVIS: A Minimal Video Instance Segmentation Framework Without Video-based Training, an NVIDIA research team presents MinVIS, a minimal video instance segmentation framework that outperforms state-of-the-art VIS methods without requiring video-based training or annotations.

The team summarizes their main contributions as follows:

  1. We show that video-based architecture and training are not required for competitive VIS performances. MinVIS outperforms previous state-of-the-art on YouTube-VIS 2019 and 2021 datasets by 1% and 3% AP while only training an image instance segmentation model.
  2. We show that image instance segmentation models capable of segmenting occluded instances are also well suited to track occluded instances in videos in our framework. MinVIS outperforms its per-clip counterpart by over 13% AP on the challenging Occluded VIS (OVIS) dataset, which is an over 10% improvement compared to the previous best performance on the dataset.
  3. Our image-based approach allows us to significantly sub-sample the required segmentation annotations in training without any change to the model. With only 1% of labelled frames, MinVIS outperforms or is comparable to fully-supervised state-of-the-art approaches on all three datasets.

The proposed MinVIS method trains a query to only return high responses on features of its corresponding instance while other query embeddings have low responses on these features as the instance masks are non-overlapping. As such, the query embeddings for different instances in a frame are well-separated, which enables temporally consistent query embeddings for object-tracking without requiring video-based training.

For MinsVIS inference, the team first independently applies a query-based image instance segmentation model on video frames, then associates the segmented instances with their corresponding query embeddings. The query embeddings will thus contain the needed information for tracking a given instance. Since the video frames are treated as independent images for MinsVIS training, there is no need to annotate all the frames in a video, which enables significant sub-sampling and reduction of segmentation annotations without any change to the model.

In their empirical study, the team compared MinVIS against state-of-the-art approaches on the YouTube-VIS 2021 dataset, where it improved average precision (AP) by 3 percent. MinVIS also outperformed its per-clip counterparts by over 13 percent AP on the challenging Occluded VIS (OVIS) dataset, again without video-based training.

The researchers note MinVIS’s practical advantages — reducing both label and computation costs without sacrificing model performance — make it a promising new approach to VIS, and propose extending MinVIS with sub-sampled annotations to further improve performance.

The code is available on the project’s GitHub. The paper MinVIS: A Minimal Video Instance Segmentation Framework Without Video-based Training is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

9 comments on “NVIDIA’s Minimal Video Instance Segmentation Framework Achieves SOTA Performance Without Video-Based Training

  1. Harry Kane

    The information is very special, I will have to follow you.
    https://fivenightsatfreddys.onl

  2. What an unexpected success that surpassed all expectations.

  3. JamesOneil

    The resulting multiplicative co-modulation architecture
    achieves a favourable identity-editability trade off.

  4. Happy Wheels is a fun and very addicting physics-based game. You need skill, planning, and luck to get through the levels of this game.

  5. Michealjackson

    Stone Well Financial is the name I rely on for 해외여행자보험 비교
    . Their services are unmatched, ensuring I have a worry-free travel experience.

  6. JamesOneil

    I chanced upon County Roofing Systems at https://www.countyroofingsystems.com
    and found the perfect match for my roofing project.

  7. MichealJackson

    Elevate your holiday style with a stunning christmas case from Claspp! Their unique designs and durable materials make them the go-to source for festive phone accessories.

  8. MarkNeil

    Incorporating Grado Inspired’s affirmation card
    into my routine has been a game-changer. The diverse range of affirmations caters to various aspects of personal growth. Grado Inspired’s dedication to promoting positivity shines through in every card, making them a trusted choice.

  9. JamesOneil

    I recently explored http://www.flangeandfittings.com
    , the online home of Flanges Steel, and I was impressed with their range of flange solutions. If you’re looking for dependable flanges, they’ve got you covered.

Leave a Reply

Your email address will not be published. Required fields are marked *

%d