In the new paper MinVIS: A Minimal Video Instance Segmentation Framework Without Video-based Training, an NVIDIA research team presents MinVIS, a minimal video instance segmentation framework that outperforms state-of-the-art VIS approaches without requiring video-based training.
In the new paper Is Attention All NeRF Needs?, a research team from the Indian Institute of Technology Madras and the University of Texas at Austin proposes Generalizable NeRF Transformer (GNT), a pure and universal transformer-based architecture for efficient on-the-fly reconstruction of NeRFs. The work demonstrates that a pure attention mechanism suffices for learning a physically-grounded rendering process.
In the new paper YOLOv7: Trainable Bag-Of-Freebies Sets New State-Of-The-Art for Real-Time Object Detectors, an Academia Sinica research team releases YOLOv7. This latest YOLO version introduces novel “extend” and “compound scaling” methods that effectively utilize parameters and computation; and surpasses all known real-time object detectors in speed and accuracy.
In the new paper Global Context Vision Transformers, an NVIDIA research team proposes the Global Context Vision Transformer, a novel yet simple hierarchical ViT architecture comprising global self-attention and token generation modules that enables the efficient modelling of both short- and long-range dependencies without costly compute operations while achieving SOTA results across various computer vision tasks.
In the new paper UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes, a Google Brain research team proposes UViM, a unified approach that leverages language modelling and discrete representation learning to enable the modelling of a wide range of computer vision tasks without task-specific modifications.
In the new paper i-Code: An Integrative and Composable Multimodal Learning Framework, a Microsoft Azure Cognitive Services Research team presents i-Code, a self-supervised pretraining framework that enables the flexible integration of vision, speech, and language modalities and learns their vector representations in a unified manner.
A research team from Rikkyo University and AnyTech Co., Ltd. examines the suitability of different inductive biases for computer vision and proposes Sequencer, an architectural alternative to ViTs that leverages long short-term memory (LSTM) rather than self-attention layers to achieve ViT-competitive performance on long sequence modelling.
In the new paper PP-Matting: High-Accuracy Natural Image Matting, a Baidu research team proposes PP-Matting, a trimap-free architecture that combines a high-resolution detail branch and a semantic context branch to achieve state-of-the-art performance on natural image matting tasks.
In the new paper Dancing Under the Stars: Video Denoising in Starlight, a research team from UC Berkeley and Intel Labs leverages a GAN-tuned, physics-based noise model to represent camera noise under low light conditions and trains a novel denoiser that, for the first time, achieves photorealistic video denoising in starlight.
DeepMind researchers propose Hierarchical Perceiver (HiP), a model that retains the original Perceiver’s ability to process arbitrary modalities but is faster, can scale up to even more inputs/outputs, reduces the need for input engineering, and improves both efficiency and accuracy on classical computer vision benchmarks.
In the new paper Visual Attention Network, a research team from Tsinghua University and Nankai University introduces a novel large kernel attention (LKA) mechanism for an extremely simple and efficient Visual Attention Network (VAN) that significantly outperforms state-of-the-art vision transformers and convolutional neural networks on various computer vision tasks.
A Google Research team proposes Masked Generative Image Transformer (MaskGIT), a novel image synthesis paradigm that uses a bidirectional transformer decoder. MaskGIT significantly outperforms state-of-the-art transformer models on the ImageNet dataset and accelerates autoregressive decoding by up to 64x.
A DeepMind research team proposes ReLICv2, which demonstrates for the first time that representations learned without labels can consistently outperform a strong, supervised baseline on ImageNet and even achieve comparable results to state-of-the-art self-supervised vision transformers (ViTs).
A team from Facebook AI Research and UC Berkeley proposes ConvNeXts, a pure ConvNet model that achieves performance comparable with state-of-the-art hierarchical vision transformers on computer vision benchmarks while retaining the simplicity and efficiency of standard ConvNets.
In the new paper Masked Feature Prediction for Self-Supervised Visual Pre-Training, a Facebook AI Research and Johns Hopkins University team presents a novel Masked Feature Prediction (MaskFeat) approach for the self-supervised pretraining of video models that achieves SOTA results on video benchmarks.
In the paper A New Foundation Model for Computer Vision, a Microsoft research team proposes Florence, a novel foundation model for computer vision that significantly outperforms previous large-scale pretraining approaches and achieves new SOTA results across a wide range of visual and visual-linguistic benchmarks.
In the new paper Shaking the Foundations: Delusions in Sequence Models for Interaction and Control, a DeepMind research team explores the origins of mismatch problems in sequence models that lack understanding of the cause and effect of their actions, and addresses the problem by treating actions as causal interventions.
Researchers from Fudan University, University of Surrey and Huawei Noah’s Ark Lab identify the limitations of quadratic complexity for vision transformers (ViTs) as rooted in keeping the softmax self-attention during approximations. The team proposes the first softmax-free transformer (SOFT), which reduces the self-attention computation to linear complexity, achieving a superior trade-off between accuracy and complexity.
A research team from Google Brain and Google Research introduces SCENIC, an open-source JAX library for fast and extensible computer vision research and beyond. JAX currently supports implementations of state-of-the-art vision models such as ViT, DETR and MLP Mixer, and more open-sourced cutting-edge projects will be added in the near future.
In a paper currently under double-blind review for ICLR 2022, researchers propose StyleNeRF, a 3D-aware generative model that can synthesize high-resolution images at interactive rates while preserving high-quality 3D consistency, and can even generalize to unseen views with control on styles and poses.
A research team proposes ConvMixer, an extremely simple model designed to support the argument that the impressive performance of vision transformers (ViTs) is mainly attributable to their use of patches as the input representation. The study shows that ConvMixer can outperform ViTs, MLP-Mixers and classical vision models.
An Apple research team performs a comparative analysis on a contrastive self-supervised learning (SSL) algorithm (SimCLR) and a supervised learning (SL) approach for simple image data in a common architecture, shedding light on the similarities and dissimilarities in their learned visual representation patterns.
A research team from University of California San Diego and Microsoft proposes Micro-Factorized Convolution (MF-Conv), a novel approach that can deal with extremely low computational costs (4M–21M FLOPs) and achieves significant performance gains over state of the art models in the low FLOP regime.