Google AI Chief Jeff Dean’s ML System Architecture Blueprint

ML has revolutionized vision, speech and language understanding and is being applied in many other fields. That’s an extraordinary achievement in the tech’s short history and even more impressive considering there is still no dedicated ML hardware.

Back in January, Google AI Chief and former head of Google Brain Jeff Dean co-published the paper A New Golden Age in Computer Architecture: Empowering the Machine-Learning Revolution with Turing Award winner and computer architect David Patterson. The paper encouraged Machine Learning (ML) experts and computer architects to “work together to design the computing systems required to deliver on the potential of ML.”

At this month’s Tsinghua-Google AI Symposium in Beijing, Dean discussed trends regarding the kinds of models scientists want to train. Google Brain research scientist Azalia Mirhoseini meanwhile gave a presentation on autoML with Reinforcement Learning at the same event.

The Golden Age paper and recent talks by Google researchers provide a picture of how Google Brain is thinking through hardware and software challenges to improve ML system performance and productivity.

Dean has often pointed out that ML’s growth trend as reflected in related arXiv papers has already surpassed Moore’s Law, the 1975 prediction for chip growth.

ML paper trend.png — ML arXiv papers per year for the cs.Computer Vision, cs.Computation and Language, cs.Machine Learning, cs.Artificial Intelligence, cs.Neural and Evolutionary Computing, and stat.Machine Learning topics. *(Source: A New Golden Age in Computer Architecture: Empowering the Machine-Learning Revolution. DOI: 10.1109/MM.2018.112130030)*

Dean and Patterson dissect hardware design in their Golden Age paper, using the example of the Google-developed TPUv1 and TPUv2 Tensor Processing Units (TPU), which are advanced application-specific integrated circuits (ASIC). The duo advises engineers to look forward at least five years for hardware development, as an appropriate design must remain relevant through at least a two-year design and three-year deployment window to maintain its competitive edge, assuming standard depreciation projections.

Dean identifies six issues that impact ML hardware design within this five-year window, from purely architectural to mostly ML-driven concerns, including:

Training
Batch Size
Sparsity and Embeddings
Quantization and Distillation
Networks with Soft Memory
Learning to Learn (L2L)

Training

Two of the most important phases in the ML workflow are the production phase, called inference or prediction; and the development phase, called training or learning.

Back in 2015, like many other companies, Google developed application-specific integrated circuit (ASIC) hardware. Their TPUv1 was designed for ML inference instead of training, mainly because: 1) Inference takes one-third as many arithmetic operations as training; 2) During training, activation values calculated through feedforward must be saved for back-propagation and thus occupy much more space than inference; 3) Training cannot scale up like inference, it requires numerous expensive follow up steps.

This does not mean hardware design for training is unnecessary. On the contrary, training ASICs can save researchers’ valuable time: if training an ML model requires 30 days of computation time, that will deter most scientists from running the experiment.

Consequently, in 2017 Google developed and deployed its second-generation ASIC TPUv2 in data centers for ML training. Sixty-four TPUv2 devices were assembled into a pod, achieving the computing performance of 11.5 peta floating point operations per second (PFLOPS) with 4 terabytes of High Bandwidth Memory (HBM).

Jeff Dean shared some TPUv2 success stories in his presentation, such as increasing the Google Search Ranking model training speed by 14.2 times and speeding up image model training by 9.8 times, both using only one-quarter of the pod (16 TPUv2 devices). Moreover, the high-performance TPUv2 can also solve the scale-up challenge in ML training: It takes 1,402 minutes to train ResNet-50 (a pre-trained network for image recognition) to over 76% accuracy with one TPUv2 device, and only 45 minutes (31.2 times faster) using half the pod (32 TPUv2 devices).

image (48).png — *TPUv2 in Model Training. (Source: Jeff Dean’s Presentation at Tsinghua-Google AI Symposium)*

TPUs are very costly, however as part of its TensorFlow Research Cloud (TFRC) program, Google now grants 1,000 TPU devices free of charge to top scientists who are devoting significant efforts to open ML research.

Batch Size

Batch size enables an important form of operand reuse. The setting of minibatch size can greatly affect the efficiency of gradient descent in ML training. Unfortunately, the setting of minibatch size is still poorly understood.

Current GPUs operate efficiently at minibatch sizes of 32 or larger. Building ML with very large or small minibatch sizes or with the Stochastic Gradient Descent (SGD) minibatch size of 1 has become a topic for heated debate among top-notch researchers, who have offered various solutions.

The recent Facebook AI Research (FAIR) paper Accurate, Large Minibatch: Training ImageNet in 1 Hour shows that visual recognition models can be effectively trained at minibatch sizes of 8,192 and 32,768. Although such large-scale training may be suitable for the FAIR model, the approach cannot be assumed as a universal solution.

minibatch large size.png — ImageNet top-1 validation error vs. minibatch size. Such techniques enable a linear reduction in training time with ∼90% efficiency, training an accurate 8k minibatch ResNet-50 model in 1 hour on 256 GPUs. (source: *Accurate, Large Minibatch: Training ImageNet in 1 Hour. arXiv:1706.02677 )*

Facebook Chief AI Scientist Yann LeCun is not a fan of large minibatch size. In April he tweeted, “Training with large minibatches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use minibatches larger than 32.”

Dean and Patterson take a more neutral position in Golden Age, “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.”

Sparsity and Embeddings

Sparsity has different forms and exploiting this can help reduce the computational complexity of ML by skipping zeros and small values. Dean and Patterson believe that coarse-grained sparsity has more potential than the commonly seen irregular fine-grained sparsity.

Dean argues researchers will want increasingly huge model capacity for larger datasets, but will want each individual example to only activate a tiny fraction of the large model, in other words, “bigger models, but sparsely activated.”

Different sparse structures in a 4-dimensional weight tensor. Regular sparsity makes hardware acceleration easier. (Source: Exploring the Regularity of Sparse Structure in Convolutional Neural Networks. arXiv:1705.08922)

One example is Google Brain’s Mixture of Experts (MoE) model, which consults the learned subset of a panel of “experts” as part of its network structure to achieve the desired level of sparsity.

A Mixture of Experts (MoE) layer embedded within a recurrent language model. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network. (Source: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538)

As a result, MoE models train more weights using fewer flops for higher accuracy than previous approaches. On a Google English to French dataset, the MoE model scored 1.01 times higher in the Bilingual Evaluation Understudy (BLEU) test than the GNMT model after training for just one-sixth the time.

moe compare time.png — *(Source: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538)*

Embeddings, which can transform large sparse data into more compact, dense representations suitable for linear algebra operations, also play an important role in big data applications such as web search, translation, and video recommendation. For text or video analysis, accesses to embedding tables only involve hundreds of small (100 to 1,000 byte) random accesses in very large (hundreds of gigabyte) data structures.

Quantization and Distillation

Quantization has already proved useful in cost-effective ML inferences. Such reduced-precision computation, fortunately, has fewer resource requirements and higher computing efficiency with little accuracy loss for model inference. Dean thinks reduced-precision is effective and believes it should also work for training acceleration.

Although there has been little work published in this area so far, last year Baidu and Nvidia researchers introduced a mixed-precision architecture for training deep neural networks. By replacing the single-precision floating-point format (FP32) with half-precision floating-point (FP16) for feedforward calculation, the memory requirement was reduced to half, but accuracy still matched the FP32 models. Mixed precision training is supported by the NVIDIA Deep Learning SDK.

Distillation was first proposed by Google Brain’s Geoffrey Hinton at NIPS 2014. It uses a larger model to bootstrap the training of a smaller model while achieving higher accuracy, instead of directly training the smaller model on the same inputs. For method application, Dean suggests that for a small and cheap model that runs for example on a phone, it is possible to transfer knowledge from an existing giant model with high accuracy into the small model for suitable deployment.

As noted by Dean and Patterson in Golden Age, the distillation method also raises questions, such as “Could better training methods allow us to directly train the smaller models (and perhaps all models) to higher accuracy?” and “Is there something fundamental about the having more degrees of freedom in the larger model that enables better training?” These questions bring up different directions in ML development for both small and large models.

Networks with Soft Memory

In Golden Age, Dean and Patterson indicate that some deep-learning techniques can provide features akin to memory access. The attention mechanism, for instance, is one such technique that can be used to improve ML performance in machine translation by paying attention to selected parts of the source during long sequences of data processing.

Unlike traditional hard memory, soft memory computes a weighted average over all entries of a table for information-rich content selections. Doing so however is complicated and there is no current research into efficient or sparse implementations of soft memory models.

Learning to Learning (L2L)

Currently, most large ML architecture and model designs still rely on human experts’ heuristics and intuitions. L2L is a revolution in model development as it enables automated machine learning that involves no human expert decisions. This approach is now used to address the growing shortage of ML experts.

For automated ML, the Google Brain team uses Reinforcement Learning (RL) — a method they proposed in the 2017 ICLR paper Neural Architecture Search with Reinforcement Learning. By using accuracy as a reward signal, a model can learn to self-improve over time. Authors applied the CIFAR-10 dataset for the discovery of a novel network architecture and Penn Treebank dataset for the composition of a novel recurrent cell with RL, and both achieved results comparable with previous state-of-the-art methods.

Left: Convolutional architecture discovered by neural architecture search using CIFAR-10 dataset. Right: Performance of Neural Architecture Search and other state-of-the-art models on CIFAR-10. (source: Neural Architecture Search with Reinforcement Learning. arXiv:1611.01578)

Penn tree data.png — Left: Normal LSTM cell vs. cell discovered by Neural Architecture Search neural architecture search using Penn Treebank dataset. Right: Single model perplexity on the test set of the Penn Treebank language modeling task. (source: Neural Architecture Search with Reinforcement Learning. arXiv:1611.01578)

Other applications of RL in meta-learning include optimal path detection, activation function selection, learning optimization update rule, and even device placement optimization. At the Tsinghua-Google AI Symposium, Mirhoseini spoke on Device Placement Optimization with RL, which was published at ICML 2017.

image (49).png — *Google Brain Research Scientist Azalia Mirhoseini speaking at the Tsinghua-Google AI Symposium*

As explained in Mirhoseini’s RL paper, “[The] key to the method is the use of a sequence-to-sequence model to read input information about the operations as well as the dependencies between them, and then propose a placement for each operation. Each proposal is executed in the hardware environment to measure the execution time.” Using execution time as a reward signal, the model gets a better device placement proposal over time.

In Neural Machine Translation (NMT) model training tests, even though the RL-based placement was inconsistent with human intuitions, it was nearly 65 hours faster than expert-designed placement, achieving a 27.8 percent speedup of total training time.

RL model performance.png — Training curves of NMT model using RL-based placement and expert-designed placement. The per-step running time as well as the perplexities are averaged over 4 runs. (Source: Device Placement Optimization with Reinforcement Learning. arXiv:1706.04972)

Going Forward

So what might a plausible future look like? At the Symposium, Dean proposed it could involve a combination of many ideas:

1. Large model, but sparsely activated;
2. Single model to solve many tasks;
3. Dynamically learn and grow pathways through a large model;
4. Hardware specialized for ML supercomputing;
5. ML for efficient mapping onto hardware.

Dean and Google’s research direction suggests that those considering stepping into the ML arena and contributing to its future modeling would do well to think broadly and develop skills across research, engineering, and hardware structure.

Source: Synced China

Journalist: Luna Qiu | Localization: Tingting Cao

Editor: Meghan Han, Michael Sarazen

Google AI Chief Jeff Dean’s ML System Architecture Blueprint

Training

Batch Size

Sparsity and Embeddings

Quantization and Distillation

Networks with Soft Memory

Learning to Learning (L2L)

Going Forward

Like this:

0 comments on “Google AI Chief Jeff Dean’s ML System Architecture Blueprint”

Leave a Reply Cancel reply

Related

Training

Batch Size

Sparsity and Embeddings

Quantization and Distillation

Networks with Soft Memory

Learning to Learning (L2L)

Going Forward

Share this:

Like this:

0 comments on “Google AI Chief Jeff Dean’s ML System Architecture Blueprint”

Leave a Reply Cancel reply

Related