In the new paper TVLT: Textless Vision-Language Transformer, researchers from UNC Chapel Hill present the Textless Vision-Language Transformer (TVLT) for vision-and-language representation learning. TVLT uses only raw visual and audio inputs and performs comparably to its text-based counterparts but requires only 1/3 the parameters and achieves 28x faster inference speeds.
In the new paper UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes, a Google Brain research team proposes UViM, a unified approach that leverages language modelling and discrete representation learning to enable the modelling of a wide range of computer vision tasks without task-specific modifications.
A DeepMind research team argues that the mathematical description of symmetries in group theory is an important foundation that determines the structure of the universe, constrains the nature of natural tasks, and consequently shapes both biological and artificial intelligence. The study proposes symmetry transformations as a fundamental principle for defining what makes good representations.
In the new paper On the Integration of Self-Attention and Convolution, a research team from Tsinghua University, Huawei Technologies Ltd. and the Beijing Academy of Artificial Intelligence proposes ACmix, a mixed model that leverages the benefits of both self-attention and convolution for computer vision representation tasks while achieving minimum computational overhead compared to its pure convolution or self-attention counterparts.
A research team from MIT and MIT-IBM Watson AI Lab proposes Curious Representation Learning (CRL), a framework that learns to understand the surrounding environment by training a reinforcement learning (RL) agent to maximize the error of a representation learner to gain an incentive to explore the environment.
A research team from Facebook AI conducts a large-scale study on unsupervised spatiotemporal representation learning from videos. The work takes a unified perspective on four recent image-based frameworks (MoCo, SimCLR, BYOL, SwAV) and investigates a simple objective that can easily generalize unsupervised representation learning methodologies to space-time.