A paper titled “Online Video Quality Enhancement with Spatial-Temporal Look-up Tables” introduces a novel method, STLVQE. This research, conducted by a team from Tongji University and Microsoft Research Asia, pioneers the exploration of the online video quality enhancement problem and presents the first method achieving real-time processing speed.
In a new paper Follow Anything: Open-set detection, tracking, and following in real-time, a research team from MIT and Harvard University presents the follow anything system (FAn), an open-set real-time any object following framework that can detect, segment, track, and follow any object, and is able to adapt to new objects using text, images, or click queries.
In a new paper Objaverse-XL: A Universe of 10M+ 3D Objects, a research team from Allen Institute for AI, University of Washington, Columbia University, Stability AI, California Institute of Technology and LAION join force to present Objaverse-XL, a large-scale, web-crawled dataset of 3D assets, which provides substantially richer variety and quality data that aims to boost the performance of state-of-the-art 3D models.
In the new paper UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes, a Google Brain research team proposes UViM, a unified approach that leverages language modelling and discrete representation learning to enable the modelling of a wide range of computer vision tasks without task-specific modifications.
In the paper A New Foundation Model for Computer Vision, a Microsoft research team proposes Florence, a novel foundation model for computer vision that significantly outperforms previous large-scale pretraining approaches and achieves new SOTA results across a wide range of visual and visual-linguistic benchmarks.
Researchers from Fudan University, University of Surrey and Huawei Noah’s Ark Lab identify the limitations of quadratic complexity for vision transformers (ViTs) as rooted in keeping the softmax self-attention during approximations. The team proposes the first softmax-free transformer (SOFT), which reduces the self-attention computation to linear complexity, achieving a superior trade-off between accuracy and complexity.
A research team from Google Brain and Google Research introduces SCENIC, an open-source JAX library for fast and extensible computer vision research and beyond. JAX currently supports implementations of state-of-the-art vision models such as ViT, DETR and MLP Mixer, and more open-sourced cutting-edge projects will be added in the near future.
A research team from University of California San Diego and Microsoft proposes Micro-Factorized Convolution (MF-Conv), a novel approach that can deal with extremely low computational costs (4M–21M FLOPs) and achieves significant performance gains over state of the art models in the low FLOP regime.
A research team from Microsoft Research Asia, University of Science and Technology of China, Huazhong University of Science and Technology, and Tsinghua University takes advantage of the inherent spatiotemporal locality of videos to present a pure-transformer backbone architecture for video recognition that leads to a better speed-accuracy trade-off.
A research team from Google Cloud AI, Google Research and Rutgers University simplifies vision transformers’ complex design, proposing nested transformers (NesT) that simply stack basic transformer layers to process non-overlapping image blocks individually. The approach achieves superior ImageNet classification accuracy and improves model training efficiency.
Yann LeCun and a team of researchers propose Barlow Twins, a method that learns self-supervised representations through a joint embedding of distorted images, with an objective function that can make the embedding vectors almost identical while reducing redundancy between their components.