A research team from University of California San Diego and Microsoft proposes Micro-Factorized Convolution (MF-Conv), a novel approach that can deal with extremely low computational costs (4M–21M FLOPs) and achieves significant performance gains over state of the art models in the low FLOP regime.
A research team from Microsoft Research Asia, University of Science and Technology of China, Huazhong University of Science and Technology, and Tsinghua University takes advantage of the inherent spatiotemporal locality of videos to present a pure-transformer backbone architecture for video recognition that leads to a better speed-accuracy trade-off.
A research team from Google Cloud AI, Google Research and Rutgers University simplifies vision transformers’ complex design, proposing nested transformers (NesT) that simply stack basic transformer layers to process non-overlapping image blocks individually. The approach achieves superior ImageNet classification accuracy and improves model training efficiency.
Yann LeCun and a team of researchers propose Barlow Twins, a method that learns self-supervised representations through a joint embedding of distorted images, with an objective function that can make the embedding vectors almost identical while reducing redundancy between their components.