In the new paper EcoFormer: Energy-Saving Attention with Linear Complexity, a Monash University research team presents EcoFormer, an attention mechanism with linear complexity that replaces expensive multiply-accumulate operations with simple accumulations and achieves a 73 percent energy footprint reduction on ImageNet.
In the new paper Knowledge Neurons in Pretrained Transformers, a research team from Peking University and Microsoft Research introduces a knowledge attribution method that identifies the neurons that store factual knowledge in pretrained transformers and leverages these neurons to edit factual knowledge in transformers without any fine-tuning.
In the new paper Efficient Training of Language Models to Fill in the Middle, an OpenAI research team shows that causal decoder-based autoregressive (AR) language models can learn to infill texts via a very simple and straightforward transformation to the training data and without any architectural modifications.
In the new paper Is Attention All NeRF Needs?, a research team from the Indian Institute of Technology Madras and the University of Texas at Austin proposes Generalizable NeRF Transformer (GNT), a pure and universal transformer-based architecture for efficient on-the-fly reconstruction of NeRFs. The work demonstrates that a pure attention mechanism suffices for learning a physically-grounded rendering process.
In the new paper VCT: A Video Compression Transformer, a Google Research team presents an elegantly simple but powerful video compression transformer (VCT) that does not require architectural biases and priors and learns totally from data without any hand-crafting. VCT is easy to implement and outperforms conventional video compression approaches.
In the new paper Extreme Compression for Pre-trained Transformers Made Simple and Efficient, a Microsoft research team introduces XTC, a simple yet effective extreme compression pipeline for pretrained transformers that can achieve state-of-the-art results while reducing model size by 50x.
In the new paper A Probabilistic Interpretation of Transformers, ML Collective researcher Alexander Shim provides a probabilistic explanation of transformers’ exponential dot product attention and contrastive learning based on distributions of the exponential family.
In the new technical report OPT: Open Pre-trained Transformer Language Models, Meta AI open-sources OPT, a suite of decoder-only pretrained transformers ranging from 125M to 175B parameters. The release will enable more researchers to work with large-scale language models to drive the field forward.
In the new paper CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, Tsinghua University and the Beijing Academy of Artificial Intelligence researchers pretrain a Cross-Modal general Language Model (CogLM) for text and image token prediction and finetune it for fast super-resolution. The resulting CogView2 hierarchical text-to-image system achieves significant speedups while generating images with better quality at comparable resolutions.
In the new paper Token Dropping for Efficient BERT Pretraining, a research team from Google, New York University, and the University of Maryland proposes a simple but effective “token dropping” technique that significantly reduces the pretraining cost of transformer models such as BERT without hurting performance on downstream fine-tuning tasks.
A Google research team addresses conventional transformers’ resource-heavy training and fine-tuning requirements for learning new knowledge, proposing Memorizing Transformers as a step toward language models that can simply read and memorize new data at inference time for immediate knowledge acquisition.
A team from Google Research and the Swiss AI Lab IDSIA proposes the Block-Recurrent Transformer, a novel long-sequence processing approach that has the same computation time and parameter count costs as a conventional transformer layer but achieves significant perplexity improvements in language modelling tasks over very long sequences.
Researchers from Meta AI and the State University of New York at Buffalo propose sparsely-activated all-MLP architectures (sMLPs) that achieve training efficiency improvements of up to 2x compared to transformer-based mixture-of-experts (MoE) architectures, transformers, and gMLP.
An Idiap Research Institute team proposes a novel multi-layer perceptron (MLP) model, HyperMixer, as a Green AI alternative to transformers. HyperMixer achieves comparable performance with substantially lower costs in terms of processing time, training data and hyperparameter tuning.
A team from Facebook AI Research, UC Berkeley and UCLA proposes Online Decision Transformers (ODT), an RL algorithm based on sequence modelling that incorporates offline pretraining and online finetuning in a unified framework and achieves performance competitive with the state-of-the-art models on the D4RL benchmark.
A Google Research team proposes Masked Generative Image Transformer (MaskGIT), a novel image synthesis paradigm that uses a bidirectional transformer decoder. MaskGIT significantly outperforms state-of-the-art transformer models on the ImageNet dataset and accelerates autoregressive decoding by up to 64x.
In the new paper Self-attention Does Not Need O(n2) Memory, a Google Research team presents novel and simple algorithms for attention and self-attention that require only constant memory and logarithmic memory and reduce the self-attention memory overhead by 59x for inference and by 32x for differentiation at a sequence length of 16384.
A DeepMind research team proposes RETRO (Retrieval-Enhanced Transformer), an enhanced auto-regressive language model that conditions on document chunks retrieved from a large corpus and achieves performance comparable to GPT-3 and Jurassic-1 on the Pile dataset while using 25× fewer parameters.
In the new paper Sparse is Enough in Scaling Transformers, a research team from the University of Warsaw, Google Research and OpenAI proposes Scaling Transformers, a family of novel transformers that leverage sparse layers to scale efficiently and perform unbatched decoding much faster than original transformers, enabling fast inference on long sequences even with limited memory.
Researchers from Fudan University, University of Surrey and Huawei Noah’s Ark Lab identify the limitations of quadratic complexity for vision transformers (ViTs) as rooted in keeping the softmax self-attention during approximations. The team proposes the first softmax-free transformer (SOFT), which reduces the self-attention computation to linear complexity, achieving a superior trade-off between accuracy and complexity.
Facebook AI Research proposes NormFormer, an approach that improves pretraining perplexity and downstream task performance for both causal and masked language models, achieving GPT3-Large (1.3B) zero-shot performance 60 percent faster and improving fine-tuned GLUE performance by 1.9 percent.
A research team from the University of Southern California and Google proposes TOME, a “mention memory” approach to factual knowledge extraction for NLU tasks. A transformer model with attention over a semi-parametric representation of the entire Wikipedia text corpus, TOME can extract information without supervision and achieves strong performance on multiple open-domain question answering benchmarks.
In the paper Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers, a research team from New York University and the University of North Carolina at Chapel Hill uses centered kernel alignment (CKA) to measure the similarity of representations across layers and explore how fine-tuning changes transformers’ learned representations.
A Google Research team explores the design space of Transformer models in an effort to enable deep learning architectures to solve compositional tasks. The proposed approach provides models with inductive biases via design decisions that significantly impact compositional generalization, and achieves state-of-the-art results on the COGS and PCFG composition benchmarks.
A research team from Microsoft Research Asia, University of Science and Technology of China, Huazhong University of Science and Technology, and Tsinghua University takes advantage of the inherent spatiotemporal locality of videos to present a pure-transformer backbone architecture for video recognition that leads to a better speed-accuracy trade-off.
A research team from UC Berkeley, Facebook AI Research and Google Brain abstracts Reinforcement Learning (RL) as a sequence modelling problem. Their proposed Decision Transformer simply outputs optimal actions by leveraging a causally masked transformer, yet matches or exceeds state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
A research team from UC Davis, Microsoft Research and Johns Hopkins University extends work on training massive amounts of linguistic data to reveal the grammatical structures in their representations to the domain of mathematical reasoning, showing that both the standard transformer and the TP-Transformer can compose the meanings of mathematical symbols based on their structured relationships.