One of the highlights of OpenAI’s GPT-4 large language model (LLM) is its expanded context window size of 32,000 tokens (about 25,000 words), which enables longer input sequences and conversations than ChatGPT’s 4,000 token limit. While expanding the processing capacities of transformer-based LLMs in this way is beneficial, it is also computationally costly due to the quadratic complexity of the models’ attention mechanisms and the application of feedforward and projection layers to every token.
A Google Research team addresses this issue in the new paper CoLT5: Faster Long-Range Transformers with Conditional Computation, proposing CoLT5 (Conditional LongT5), a family of transformer models that apply a novel conditional computation approach for higher quality and faster long-input processing of up to 64,000 tokens.
CoLT5 is built on Google’s LongT5 (Gua et al., 2022), which simultaneously scales input length and model size to improve long-input processing in transformers; and is inspired by the idea that better performance and reduced computation cost can be achieved via a novel “conditional computation” approach that allocates more computation to important tokens.
The conditional computation mechanism comprises three main components: 1) Routing modules, which select important tokens at each attention or feedforward layer; 2) A conditional feedforward layer that applies an additional high-capacity feedforward layer to select important routed tokens; and 3) A conditional attention layer that enables CoLT5 to differentiate between tokens that require additional information and those that already possess such information.
CoLT5 applies two additional modifications to the LongT5 architecture: multi-query cross-attention, which substantially speeds up inference, and a UL2 (Tay et al., 2022) pretraining objective, which combines different denoising objectives to improve in-context learning over long inputs.
In their empirical study, the team compared CoLT5 with LongT5 on TriviaQA, arXiv summarization, and the SCROLLS benchmark tasks. In the experiments, CoLT5 demonstrated its ability to process inputs of up to 64k tokens and achieved better quality and faster speed than LongT5 on long-input datasets.
The paper CoLT5: Faster Long-Range Transformers with Conditional Computation is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Google’s CoLT5 Processes Extremely Long Inputs via Conditional Computation”