Deep learning models are getting larger and larger to meet the demand for better and better performance. Meanwhile, the time and money required to train these DL behemoths also keeps rising and rising.
One of the biggest training bottlenecks is GPU Memory, which can restrict the number of parameters used in model training. Microsoft believes that existing training solutions suffer in terms of computing, communication, and development efficiency for two main reasons:
- Data parallelism cannot reduce the memory consumption of each device – A model with more than 1 billion parameters will exceed the capacity of a GPU with 32G memory.
- Model parallelism does not scale efficiently when it extends to multiple nodes – A model’s performance decreases when it is extended to multiple nodes due to fine-grained computation and expensive communication.
To solve this problem, Microsoft has introduced a new library called DeepSpeed, which can increase the batch size of each node by four times while reducing training by two-thirds to enable the training of 100-billion-parameter models.
One very important component of DeepSpeed is ZeRO (abbreviated for the Zero Redundancy Optimizer), a novel parallelized optimizer that significantly reduces the resources required for model and data parallelism while at the same time improving the amount of trainable parameters.
Microsoft says ZeRO can train deep learning models with 100 billion parameters on the current generation of GPU clusters “at three to five times the throughput of the current best system.”
Leveraging DeepSpeed’s large-model training capabilities, Microsoft built the Turing Natural Language Generation model (T-NLG). This is the largest NLP model ever trained, with 17 billion parameters. T-NLG has achieved SOTA performance on mainstream NLP tasks.
Like Google’s famous massive language model BERT and OpenAI’s GPT-2, T-NLG is based on the popular and powerful Transformer architecture and is able to tackle demanding language generation tasks such as question answering and automatic summarization. Moreover, with the help of DeepSpeed, T-NLG at 17 billion parameters handily outperforms those SOTA models on similarly challenging NLP tasks which rely on larger training parameters to achieve more natural, accurate, and fluent text generations.
In aspects of accuracy, T-NLG demonstrates a clear performance advantage on standard language tasks, as well as on the abstractive summarization task.
Because the traditional ROUGE metric cannot accurately judge the fluency and naturalness of answers in the question answering task, Microsoft hired human annotators to evaluate the automated generated answers.
There is much more that T-NLG can do, such as direct question answering and zero-shot question capabilities, details of which are on Microsoft’s blog. While T-NLG has unfortunately not been open-sourced (yet), the good news is you can find the PyTorch-compatible and open-sourced tool DeepSpeed on GitHub and try it yourself.
Author: Mos Zhang | Editor: Michael Sarazen