In March, Google introduced Pathways, a large-scale orchestration layer for accelerators that employs a novel asynchronous distributed dataflow design to enable highly efficient training across multiple TPU Pods. Now, scarcely two weeks later, a Google Research team has put Pathways to the test, using it to train Pathways Language Model (PaLM), a 540 billion parameter, densely activated, autoregressive transformer, on 780 billion tokens of high-quality text. PaLM achieves state-of-the-art few-shot performance on language understanding and generation tasks, by significant margins in many cases.
Today’s extreme-scale language models have achieved substantial improvements under few-shot settings across a wide range of language understanding and generation tasks. The new paper PaLM: Scaling Language Modeling with Pathways, aims to advance understanding of the impact of scale on few-shot learning.
The team summarizes the key takeaways in their work as follows:
- Efficient scaling – We demonstrate the first large-scale use of Pathways (Barham et al., 2022) – a new ML system that enables training a single model across thousands or tens of thousands of accelerator chips in a highly efficient manner.
- Continued improvements from scaling – We evaluate PaLM across hundreds of natural language, code, and mathematical reasoning tasks, and achieve state-of-the-art results on the vast majority of these benchmarks, typically by significant margins.
- Breakthrough capabilities – We demonstrate breakthrough capabilities in language understanding and generation across a number of difficult tasks.
- Discontinuous improvements – To better understand the scaling behaviour, we present results at three different parameter scales: 8B, 62B, and 540B. Typically, scaling from 62B to 540B results in similar performance as scaling from 8B to 62B, which is consistent with the “power law” rule of thumb often observed in neural network scaling.
- Multilingual understanding – In this work, we conduct a more thorough evaluation of multilingual benchmarks including machine translation, summarization, and question answering in a wide variety of languages.
- Bias and toxicity – We also evaluated model performance for distributional bias and toxicity, which resulted in several insights.
PaLM is based on a standard transformer model architecture but uses only a decoder setup and introduces the following modifications: SwiGLU Activation, Parallel Layers, Multi-Query Attention, RoPE Embeddings, Shared Input-Output Embeddings, and No Biases and Vocabulary.
The team’s approach uses SwiGLU activations for the multilayer perceptron (MLP) intermediate activations, producing significant quality improvements compared to standard ReLU, GeLU, or Swish activations; while a “parallel” formulation in each transformer block — rather than the standard “serialized” formulation — results in roughly 15 percent faster large-scale training speeds. Multi-query attention realizes cost savings at autoregressive decoding time, and the use of RoPE embeddings rather than absolute or relative position embeddings enables better performance on long sequence lengths. The method also shares the input and output embedding matrices, and uses no biases in the dense kernels or layer norms to increase training stability for large models. Finally, the team opts for a SentencePiece vocabulary with 256k tokens to support the large number of languages in the training corpus without excess tokenization.
The Pathways system executes two-way pod-level data parallelism: a single Python client constructs a sharded dataflow program that launches JAX/XLA work on remote servers, each comprising a TPU pod. The paper illustrates how a component A is used for within-pod forward and backward computation, a transfer subgraph for cross-pod gradient transfer, and a component B for optimizer updates.
In their empirical study, the team compared PaLM models of 8B, 62B, and 540B parameters trained on a high-quality corpus of 780 billion tokens comprising filtered webpages, books, Wikipedia, news articles, source code, and social media conversations representing a wide variety of language processing use cases. The models were evaluated on 29 English benchmarks, including TriviaQA, LAMBADA, RACE SuperGLUE, etc.
In the experiments, the proposed PaLM achieved a training efficiency of 57.8 percent hardware FLOPs utilization, the highest yet for large-scale language models at this scale. PaLM also obtained breakthrough state-of-the-art performance on a wide range of language understanding and generation tasks and demonstrated strong capabilities on multilingual tasks and source code generation.
The team hopes PaLM can serve as a strong foundation for building large-scale, modularized systems that will have broad generalization capabilities across multiple modalities.
The paper PaLM: Scaling Language Modeling with Pathways is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.