In the ongoing quest for bigger and better, Google Brain researchers have scaled up their newly proposed Switch Transformer language model to a whopping 1.6 trillion parameters while keeping computational costs under control. The team simplified the Mixture of Experts (MoE) routing algorithm to efficiently combine data, model and expert-parallelism and enable this “outrageous number of parameters” while also achieving a four-times pretraining speedup over a strongly tuned T5-XXL baseline (Google’s previously largest language model).
The mammoth language models are introduced in the paper Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
Although many recent and simpler deep learning architectures have outperformed more complicated algorithms, these performance gains have come with enormous computational budgets, huge datasets and large parameter counts. The team notes that deep learning models tend to reuse the same parameters for all inputs, while Mixture of Experts (MoE) models instead use different parameters. They focused their attention on the large-scale training of language models using only a subset of the neural network weights (parameters) for each incoming example, with the sparsity coming from a newly proposed technique for simplifying the MoE paradigm.
In the context of deep learning architectures, the MoE routing algorithm allows models to combine the output of several expert networks, where each of the expert networks specializes in a different part of the input space. In this way, a learned gating network essential mixes the expert networks’ outputs to produce a final output. “This (MoE) resulted in state-of-the-art results in language modelling and machine translation benchmarks,” the researchers explain. One of the study’s key contributions is the simplified MoE paradigm’s reduced communication and computational costs. Unlike previously proven MoE strategies that route to more than one expert network to enable non-trivial gradients on the routing functions, the proposed models use only a single expert.
The team says the proposed simplified technique ensures the model weights increase with the number of devices while maintaining a manageable memory and computational footprint on each device. Switch Transformer pretrained on the Colossal Clean Crawled Corpus (C4) using 32 TPU cores consumes less compute while outperforming both carefully tuned dense models and MoE models.
In experiments, Switch Transformer improved on the multilingual T5-based (mT5) model on 101 different languages in the multilingual variant of the Common Crawl dataset (mC4). Switch Transformer also achieved a mean pretraining speedup over the mT5 baseline, with 91 percent of the 101 languages seeing four-times speedups. Moreover, the team demonstrated the possibility of pushing the current scale of language models by pretraining Switch Transformer with 1.6 trillion parameters in one-quarter the time required for the T5-XXL model.
The paper Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity is on arXiv.
Reporter: Fangyu Cai | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.