A research team from Google Research, University of Cambridge and Alan Turing Institute proposes PolyViT, a single transformer model capable of processing multiple modalities and datasets. PolyViT is parameter-efficient and learns representations that generalize across multiple domains.
In the paper A New Foundation Model for Computer Vision, a Microsoft research team proposes Florence, a novel foundation model for computer vision that significantly outperforms previous large-scale pretraining approaches and achieves new SOTA results across a wide range of visual and visual-linguistic benchmarks.
A research team from Kwai Inc., Kuaishou Technology and ETH Zürich builds PERSIA, an efficient distributed training system that leverages a novel hybrid training algorithm to ensure both training efficiency and accuracy for extremely large deep learning recommender systems of up to 100 trillion parameters.
In the new paper GFlowNet Foundations, a research team from Mila, University of Montreal, McGill University, Stanford University, CIFAR and Microsoft Azure AI builds upon GFlowNets, providing an in-depth formal foundation and expansion of the set of theoretical results for a broad range of scenarios, especially active learning.
DeepMind and Google Brain researchers and former World Chess Champion Vladimir Kramnik explore how human knowledge is acquired and how chess concepts are represented in the AlphaZero neural network via concept probing, behavioural analysis, and an examination of its activations.
A research team from Microsoft, Peking University, Tencent, and Baidu proposes SPANN, a simple but efficient memory-disk hybrid vector indexing and search system that guarantees both low latency and high recall and achieves a 2× speedup over the state-of-the-art nearest neighbour search (ANNS) solution while retaining the same recall quality and memory cost.
An Intel research team presents Prune Once for All (Prune OFA), a training method that leverages weight pruning and model distillation to produce pretrained transformer-based language models with high sparsity ratios. Applied to BERT, the approach achieves state-of-the-art results in compression-to-accuracy ratio.
A research team from ByteDance, Johns Hopkins University, Shanghai Jiao Tong University and UC Santa Cruz seeks to apply the proven technique of masked language modelling to the training of better vision transformers, presenting iBOT (image BERT pretraining with Online Tokenizer), a self-supervised framework that performs masked prediction with an online tokenizer.
In the new paper Gradients Are Not All You Need, a Google Brain and Radboud University research team discusses a “particularly sinister” chaos-based failure mode that appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers.
A DeepMind research team presents the One Pass ImageNet (OPIN) problem, designed to study the space and compute efficiency of deep learning in a streaming setting with constrained data storage and to develop model training systems where each example is passed to the system only once.
A Microsoft Research India team presents Varuna, a system for training massive deep learning models on commodity networking that eliminates the need for specialized hyperclusters and alleviates the cost, scale, and resource utilization challenges of deep learning model training.
In the new paper Can Vision Transformers Perform Convolution?, a research team from Peking University, UCLA and Microsoft Research proves that a single ViT layer with image patches as the input can perform any convolution operation constructively, and show that ViT performance in low data regimes can be significantly improved using their proposed ViT training pipeline.
A research team from the University of Washington, Facebook AI Research and the Allen Institute for AI introduces Meta-training for InContext Learning (MetaICL), a new meta-training framework for few-shot learning where an LM is meta-trained to learn in-context — conditioning on training examples to recover the task and make predictions.
A research team from Google Research and UC Berkeley proposes PRIME, an offline data-driven approach that can architect hardware accelerators without any form of simulations. Compared to state-of-the-art simulation-driven methods, PRIME achieves impressive performance improvements of up to 1.54× while reducing the total required simulation time by up to 99 percent.
In the new paper Understanding How Encoder-Decoder Architectures Attend, researchers from the University of Washington, Google Blueshift Team and Google Brain Team propose a method for decomposing hidden states over a sequence into temporal- and input-driven components, revealing how attention matrices are formed in encoder-decoder networks.
In the new paper Shaking the Foundations: Delusions in Sequence Models for Interaction and Control, a DeepMind research team explores the origins of mismatch problems in sequence models that lack understanding of the cause and effect of their actions, and addresses the problem by treating actions as causal interventions.
Researchers from Fudan University, University of Surrey and Huawei Noah’s Ark Lab identify the limitations of quadratic complexity for vision transformers (ViTs) as rooted in keeping the softmax self-attention during approximations. The team proposes the first softmax-free transformer (SOFT), which reduces the self-attention computation to linear complexity, achieving a superior trade-off between accuracy and complexity.
A research team from Google Brain and Google Research introduces SCENIC, an open-source JAX library for fast and extensible computer vision research and beyond. JAX currently supports implementations of state-of-the-art vision models such as ViT, DETR and MLP Mixer, and more open-sourced cutting-edge projects will be added in the near future.
Facebook AI Research proposes NormFormer, an approach that improves pretraining perplexity and downstream task performance for both causal and masked language models, achieving GPT3-Large (1.3B) zero-shot performance 60 percent faster and improving fine-tuned GLUE performance by 1.9 percent.
In the new paper Non-deep Networks, a research team from Princeton University and Intel Labs argues it is possible to achieve high performance with “non-deep” neural networks, presenting ParNet (Parallel Networks), a novel 12-layer architecture that achieves performance competitive with its state-of-the-art deep counterparts.
In a paper currently under double-blind review for ICLR 2022, researchers propose StyleNeRF, a 3D-aware generative model that can synthesize high-resolution images at interactive rates while preserving high-quality 3D consistency, and can even generalize to unseen views with control on styles and poses.
A research team from the University of Southern California and Google proposes TOME, a “mention memory” approach to factual knowledge extraction for NLU tasks. A transformer model with attention over a semi-parametric representation of the entire Wikipedia text corpus, TOME can extract information without supervision and achieves strong performance on multiple open-domain question answering benchmarks.
A Google Research team conducts a systematic exploration comprising more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with parameters ranging from 10 million to 10 billion, evaluated on more than 20 downstream image recognition tasks, aiming to capture the nonlinear relationships between performance on upstream and downstream tasks.
A NVIDIA and Aalto University research team presents StyleGAN3, a novel generative adversarial network (GAN) architecture where the exact sub-pixel position of each feature is exclusively inherited from the underlying coarse features, enabling a more natural transformation hierarchy and advancing GAN-based animation generation.
A research team proposes ConvMixer, an extremely simple model designed to support the argument that the impressive performance of vision transformers (ViTs) is mainly attributable to their use of patches as the input representation. The study shows that ConvMixer can outperform ViTs, MLP-Mixers and classical vision models.