AI Machine Learning & Data Science Nature Language Tech Research

17 Billion Parameters! Microsoft DeepSpeed Breeds World’s Largest NLP Model

Deep learning models are getting larger and larger to meet the demand for better and better performance. Meanwhile, the time and money required to train these DL behemoths also keeps rising and rising.

One of the biggest training bottlenecks is GPU Memory, which can restrict the number of parameters used in model training. Microsoft believes that existing training solutions suffer in terms of computing, communication, and development efficiency for two main reasons:

  • Data parallelism cannot reduce the memory consumption of each device – A model with more than 1 billion parameters will exceed the capacity of a GPU with 32G memory.
  • Model parallelism does not scale efficiently when it extends to multiple nodes – A model’s performance decreases when it is extended to multiple nodes due to fine-grained computation and expensive communication.

To solve this problem, Microsoft has introduced a new library called DeepSpeed, which can increase the batch size of each node by four times while reducing training by two-thirds to enable the training of 100-billion-parameter models.

One very important component of DeepSpeed is ZeRO (abbreviated for the Zero Redundancy Optimizer), a novel parallelized optimizer that significantly reduces the resources required for model and data parallelism while at the same time improving the amount of trainable parameters.

ZeRO’s main optimization stages corresponding to the partitioning of optimizer states, gradients, and parameters to benefit training in aspects of memory consumption and communication volume.

Microsoft says ZeRO can train deep learning models with 100 billion parameters on the current generation of GPU clusters “at three to five times the throughput of the current best system.”

Leveraging DeepSpeed’s large-model training capabilities, Microsoft built the Turing Natural Language Generation model (T-NLG). This is the largest NLP model ever trained, with 17 billion parameters. T-NLG has achieved SOTA performance on mainstream NLP tasks.

T-NLG has far more parameters than any other NLP model

Like Google’s famous massive language model BERT and OpenAI’s GPT-2, T-NLG is based on the popular and powerful Transformer architecture and is able to tackle demanding language generation tasks such as question answering and automatic summarization. Moreover, with the help of DeepSpeed, T-NLG at 17 billion parameters handily outperforms those SOTA models on similarly challenging NLP tasks which rely on larger training parameters to achieve more natural, accurate, and fluent text generations.

In aspects of accuracy, T-NLG demonstrates a clear performance advantage on standard language tasks, as well as on the abstractive summarization task.

T-NLG compared with the GPT-2 and Megatron-LM models on WikiText-103 (perplexity as the metric, lower is better) and LAMBADA (next word prediction accuracy as the metric, higher is better)
T-NLG compared with the PEGASUS model and previous SOTA models on four common abstractive summarization datasets (ROUGE score as the metric, higher is better)

Because the traditional ROUGE metric cannot accurately judge the fluency and naturalness of answers in the question answering task, Microsoft hired human annotators to evaluate the automated generated answers.

T-NLG compared with an LSTM model similar to CopyNet for factual and grammatical correctness by human annotators.

There is much more that T-NLG can do, such as direct question answering and zero-shot question capabilities, details of which are on Microsoft’s blog. While T-NLG has unfortunately not been open-sourced (yet), the good news is you can find the PyTorch-compatible and open-sourced tool DeepSpeed on GitHub and try it yourself.

Author: Mos Zhang | Editor: Michael Sarazen

%d bloggers like this: