The blog author, Rob Farber, has been working on machine learning and related fields as a staff scientist, and participating in relating projects since the 1980s. He is now a global technology consultant and author with an extensive background in high performance computing (HPC) and in developing machine learning technology, which he applies at national labs and commercial organization. In this blog, he first explained some common misunderstandings. Then, he talked about the key technology requirements and capabilities of machine learning, which would help technologists, management, and data scientists make efficient and intelligent decisions on choosing hardware platforms.
- What does ‘Deep Learning’ really mean?
Rob pointed out that the phrase — ‘deep learning’ — is often misused, and is generally the equivalent to artificial intelligence (AI), which itself is extremely ambiguous compared to its real definition. Instead, he demonstrated deep learning as a particular configuration of an artificial neural network architecture that contained multiple hidden layers between the input and output layers. It is really important that people do not misuse this phrase for marketing, and do not exaggerate what it is able to do.
- What does ‘training’ really mean?
Most people recognizes “Training = Learning”, which is completely incorrect. When it comes to training, it is all about the process of fitting a set of model parameters for the ANN’s to minimize the error on the training set, which is not what humans do when we starting to learn something. In addition, ANN’s do not have any concepts of a goal or real-world constraints. They need guideline to keep them on the right track in order to optimize the problem solutions. After the training is done, what ‘learning’ does is just to solve a computational problem without the help of humans.
- What does ‘inferencing’ mean?
Inferencing usually means a sequential calculation from a computer science perspective, subject to memory bandwidth limitations, unlike the training process which is highly parallel when evaluating a set of training parameters. In ANNs, inferencing usually refers to what happens when the ANN’s computes the final result given parameters from the training process. Rob also pointed out the difference between serial and parallel process. When data scientists performed volume processing of data, they usually do sequential computation instead of parallel inferencing. In addition, Individual data also prefer sequential inferencing.
- What is ‘parallelism’?
All hardware uses parallelism, but it is people’s responsibility to decide what kind of devices could speed up the training process and minimize the ‘time-to-model’ performance. The result is used to optimize the algorithm by calculating the error, and finally choosing a set of parameters that produce the minimal error. To efficiently perform this process, Rob suggested to apply a Single Instruction Multiple Data (SIMD) model and summarized in the following three key points:
- SIMD maps to processors, vector processors, accelerators, FPGAs, and custom chips alike very efficiently.
- The performance of the cache and memory subsystems determines the parallelism performance of the hardware.
- To fully apply parallelism, the size of the training set needs to be determined before the hardware is selected.
- What is Reduced Precision used for?
Half-precision arithmetic can double the performance of the hardware memory and computational systems, but it is actually a bad idea. Because numerical optimization needs iterations of potential parameter sets, the reduced precision then slows the model convergence. In the worst case scenario, if the precision is reduced too much, the training process can get stuck in a local minima which indicates the failure of searching for a solution. Therefore, it is significant to deliberate or avoid reduced precision for training.
- What is the key to Gradient calculation?
Gradient calculation is widely applied in order to effectively and efficiently optimize algorithms, such as L-BFGS and Conjugate Gradient. However, the size of the gradient becomes very large and grows very fast with the increasing number of model parameters in the ANN models. Hence, keeping an eye on the instruction memory capacity is important, and make sure it can hold all the machine instruction to perform the gradient calculation. Rob suggested to look for hardware and benchmark comparisons using the stacked memory available on high-end devices and systems with large memory capacities.
Fortunately, industries have recognized the demand of faster memory for both processors and accelerators, as well as custom hardware (Google and Intel Nervana) specialized for ANN’s are available for both CPU’s and GPU’s. The on-package processor interfaces are offered on some Intel processor SKU’s to apply custom solutions (ASICS and FPGA’s). These front-end processors should correlate with the performance capabilities of custom devices, although it is still a performance conjecture. People expected to see more custom hardware and obtain better performance in the future market.
People nowadays try to make profit with whatever hot topics they can get a hold on, by borrowing the word ‘deep learning’ and misleading people around. Every time they say “our products were incorporated with deep learning”, their clients might think it is really outstanding, without any preconception of what deep learning is. The terminology of deep learning really needs to be corrected and used wisely, instead of misleading people in the wrong field. For example, when it comes to big data, people may relate it to machine learning and neural networks due to the large amount of data they deal with. However, they are not the same. This is the reason why people always get confused as they apply for data scientist and machine learning engineer positions. The main duty of data scientists is presented inside the job title, which is to extract the value of data. Whereas the job of the machine learning engineer is to optimize the learning algorithm and its performance, instead of data. Rob did a great job at the beginning, explaining the terminology and what deep learning really does.
In the second part, his topic transited to hardware performance and comment on the future trend in the industry. The latest news presented was that Google had announced their custom ASIC hardware would be exclusively available on Google Cloud, and Intel unveiled the Nervana processor. With the rise of deep learning, custom hardware targeted for different sub-fields will emerge, and experts need to identify and intelligently select what the best options are for their model to solve different problems. As for Google, the newly designed processor is fabricated specially for deep neural networks covering all its sub-fields, from image and speech recognition, to automated translation to robotics. However, this chip is not for sale, and is kept in and used for Google’s new cloud service. Google claimed that developers could operate their software on the cloud service built in Google data centers before the end of this year. Unlike other Google products, this cloud service may charge users when it finally releases.
Blog Author: Rob Farber
Author: Bin Liu | Localized by Synced Global Team: Xiang Chen