In a new paper LRM: Large Reconstruction Model for Single Image to 3D, a research team from Adobe Research and Australian National Univerisity introduces an innovative Large Reconstruction Model (LRM). This groundbreaking model has the remarkable ability to predict a 3D model of an object from a single input image in a mere 5 seconds.
In a new paper Follow Anything: Open-set detection, tracking, and following in real-time, a research team from MIT and Harvard University presents the follow anything system (FAn), an open-set real-time any object following framework that can detect, segment, track, and follow any object, and is able to adapt to new objects using text, images, or click queries.
In a new paper Objaverse-XL: A Universe of 10M+ 3D Objects, a research team from Allen Institute for AI, University of Washington, Columbia University, Stability AI, California Institute of Technology and LAION join force to present Objaverse-XL, a large-scale, web-crawled dataset of 3D assets, which provides substantially richer variety and quality data that aims to boost the performance of state-of-the-art 3D models.
In a new paper AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning, a research team presents AnimateDiff, a general and practical framework that is able to generate animated images for any personalized text-to-image (T2I) model, without any extra training and model-specified tuning.
In a new paper Scaling Open-Vocabulary Object Detection, a DeepMind research team introduces OWLv2 model, an optimized architecture with improved training efficiency and applies and OWL-ST self-training recipe to the proposed OWLv2 to substantially improves detection performance, achieving state-of-the-art result on open-vocabulary detection task.
In a new paper Image Captioners Are Scalable Vision Learners Too, a DeepMind research team presents CapPa, a image captioning based pretraining strategy that and can compete CLIP and exhibit favorable model and data scaling properties, verifying that a plain image captioning can be a competitive pretraining strategy for vision backbones.
In the new paper ZipIt! Merging Models from Different Tasks Without Training, a Georgia Tech research team proposes ZipIt!, a general method that exploits redundant features to combine two or more models with the same architecture but trained on different tasks into one multi-task model without additional training.
In the new paper DETRs Beat YOLOs on Real-Time Object Detection, a Baidu Inc. research team presents Real-Time Detection Transformer (RT-DETR), a real-time end-to-end object detector that leverages a hybrid encoder and novel IoU-aware query selection to address inference speed delay issues. RT-DETR outperforms YOLO object detectors in both accuracy and speed.
In the new paper SpectFormer: Frequency and Attention Is What You Need in a Vision Transformer, a research team from Microsoft and the University of Bath proposes Spectformer, a novel transformer architecture that combines spectral and multi-headed attention layers to better capture appropriate feature representations and improve performance.
In the new paper Instruct-NeRF2NeRF: Editing 3D Scenes With Instructions, a UC Berkeley research team presents Instruct-NeRF2NeRF, an approach for editing 3D NeRF scenes through natural language text instructions. The proposed method can edit large-scale, real-world 3D scenes with improved ease of use and realism.
In the new paper RealFusion: 360° Reconstruction of Any Object from a Single Image, an Oxford University research team leverages a diffusion model to generate 360° reconstructions of objects from a single image. Their RealFusion approach achieves state-of-the-art performance on monocular 3D reconstruction benchmarks.
In the new paper Dreamix: Video Diffusion Models Are General Video Editors, a team from Google Research and the Hebrew University of Jerusalem presents Dreamix, a novel approach that leverages a video diffusion model (VDM) to enable text-based motion and appearance video editing.
In the new paper Point-E: A System for Generating 3D Point Clouds from Complex Prompts, An OpenAI research team presents Point·E, a system for text-conditional synthesis of 3D point clouds that leverages diffusion models to generate diverse and complex 3D shapes conditioned on complex text prompts in minutes on a single GPU.
In the new paper What Do Vision Transformers Learn? A Visual Exploration, a research team from the University of Maryland and New York University uses large-scale feature visualizations from a wide range of vision transformers to gain insights into what they learn from images and how they differ from convolutional neural networks.
In the new paper SPACEx: Speech-driven Portrait Animation with Controllable Expression, an NVIDIA research team introduces SPACEx — a speech-driven portrait animation framework that generates high-resolution and expressive facial videos with control over subject pose, emotion and expression intensity.
In the new paper Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods, DeepMind and NYU Center for Neural Systems researchers introduce computational efficiency evaluation approaches designed to aid in the selection of optimal methods, datasets and models for pretraining visual tasks on a fixed FLOP budget.
In the new paper 3D-FM GAN: Towards 3D-Controllable Face Manipulation, a team from Princeton University and Adobe Research presents 3D-FM GAN, a novel conditional GAN framework that enables precise 3D-controllable face manipulation with high photorealism and strong identity preservation without requiring any manual tuning or optimizations.
In the new paper Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, a Microsoft research team presents BEiT-3, a general-purpose state-of-the-art multimodal foundation model for both vision and vision-language tasks that advances the big convergence of backbone architectures, pretraining tasks, and model scaling.
In the new paper Paint2Pix: Interactive Painting based Progressive Image Synthesis and Editing, a research team from Adobe Research and Australian National University presents paint2pix, a novel model that learns to predict users’ intentions and produce photorealistic images from primitive and coarse human brushstroke inputs.
In the new paper MinVIS: A Minimal Video Instance Segmentation Framework Without Video-based Training, an NVIDIA research team presents MinVIS, a minimal video instance segmentation framework that outperforms state-of-the-art VIS approaches without requiring video-based training.
In the new paper Is Attention All NeRF Needs?, a research team from the Indian Institute of Technology Madras and the University of Texas at Austin proposes Generalizable NeRF Transformer (GNT), a pure and universal transformer-based architecture for efficient on-the-fly reconstruction of NeRFs. The work demonstrates that a pure attention mechanism suffices for learning a physically-grounded rendering process.
In the new paper YOLOv7: Trainable Bag-Of-Freebies Sets New State-Of-The-Art for Real-Time Object Detectors, an Academia Sinica research team releases YOLOv7. This latest YOLO version introduces novel “extend” and “compound scaling” methods that effectively utilize parameters and computation; and surpasses all known real-time object detectors in speed and accuracy.
In the new paper Global Context Vision Transformers, an NVIDIA research team proposes the Global Context Vision Transformer, a novel yet simple hierarchical ViT architecture comprising global self-attention and token generation modules that enables the efficient modelling of both short- and long-range dependencies without costly compute operations while achieving SOTA results across various computer vision tasks.
In the new paper UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes, a Google Brain research team proposes UViM, a unified approach that leverages language modelling and discrete representation learning to enable the modelling of a wide range of computer vision tasks without task-specific modifications.
In the new paper i-Code: An Integrative and Composable Multimodal Learning Framework, a Microsoft Azure Cognitive Services Research team presents i-Code, a self-supervised pretraining framework that enables the flexible integration of vision, speech, and language modalities and learns their vector representations in a unified manner.
A research team from Rikkyo University and AnyTech Co., Ltd. examines the suitability of different inductive biases for computer vision and proposes Sequencer, an architectural alternative to ViTs that leverages long short-term memory (LSTM) rather than self-attention layers to achieve ViT-competitive performance on long sequence modelling.
In the new paper PP-Matting: High-Accuracy Natural Image Matting, a Baidu research team proposes PP-Matting, a trimap-free architecture that combines a high-resolution detail branch and a semantic context branch to achieve state-of-the-art performance on natural image matting tasks.
In the new paper Dancing Under the Stars: Video Denoising in Starlight, a research team from UC Berkeley and Intel Labs leverages a GAN-tuned, physics-based noise model to represent camera noise under low light conditions and trains a novel denoiser that, for the first time, achieves photorealistic video denoising in starlight.
DeepMind researchers propose Hierarchical Perceiver (HiP), a model that retains the original Perceiver’s ability to process arbitrary modalities but is faster, can scale up to even more inputs/outputs, reduces the need for input engineering, and improves both efficiency and accuracy on classical computer vision benchmarks.
In the new paper Visual Attention Network, a research team from Tsinghua University and Nankai University introduces a novel large kernel attention (LKA) mechanism for an extremely simple and efficient Visual Attention Network (VAN) that significantly outperforms state-of-the-art vision transformers and convolutional neural networks on various computer vision tasks.
A Google Research team proposes Masked Generative Image Transformer (MaskGIT), a novel image synthesis paradigm that uses a bidirectional transformer decoder. MaskGIT significantly outperforms state-of-the-art transformer models on the ImageNet dataset and accelerates autoregressive decoding by up to 64x.
A DeepMind research team proposes ReLICv2, which demonstrates for the first time that representations learned without labels can consistently outperform a strong, supervised baseline on ImageNet and even achieve comparable results to state-of-the-art self-supervised vision transformers (ViTs).