During the first quarter of 2017, Nvidia’s revenue was driven by 63% year-over-year growth in data center revenue. This impressive growth was largely owing to technology companies such as Google and Amazon, who have accelerated their AI cloud products which are mostly based on Nvidia’s GPU hardware. By contrast, Intel, the company once dominated the data-center market, reported only 9% growth in the same segment. Such a large difference shows the increasing adoption of deep learning technology in the market. Yet Intel has already increased its investment and development efforts in deep learning. In this article we will provide some insights of Intel’s recent deep learning products.
Intel AI Products
Intel enhanced its development efforts in AI for both hardware and software fields. In the context of data centers, the Intel Xeon and Intel Xeon Phi processors have been released for general use cases in machine learning and other High Performance Computing (HPC) applications. To meet the increased needs of AI, Intel also launched two products optimized for deep learning model training and efficient inference:
- Training: Intel Xeon processor + Intel Deep Learning Engine “Lake Crest”, which has best in class neural network performance and offers unprecedented compute density with high bandwidth interconnect.
- Inference: Intel Xeon processor + FPGA (ARRIA 10). The FPGA engine is customizable and programmable, which offers low latency and flexible precision with higher perf/w for machine learning inference. This solution is designed for efficient inference and real-time pre-filtering for machine learning applications.
In the following subsections we provide more detail of Lake Crest, Intel’s FPGA solution ARRIA 10, and some evaluation results of Xeon Phi processor in deep learning model training.
Intel’s deep learning engine – “Lake Crest” – is a new chip product that enables hardware level optimization for neural network computation. The advantage of hardware networks compared to programmable FPGAs is primarily the fact that a chip like Lake Crest adapts to the code at run-time, and the network is also updated on hardware level. Lake Crest has a tensor based architecture. The memory hierarchy of Lake Crest has the following features: high dimensional (>2) tensors are the default data type, no cache mechanism is applied, and the memory is compiler-allocated. Tensors can be read transpose or regular. It has ECC protection throughout, and applies HBM2 RAM which is 12 times faster than DDR4.
Another important innovation of Lake Crest is the data transportation. Lake Crest has a high bandwidth interconnect with 6 bi-directional links for 3D torus interconnect, which is 20 times faster than PCIe. The twelve computation units of Lake Crest are connected directly to all others with a throughput rate up to 100 Gigabit per second.
Lake Crest supports 16 FlexPoint for deep model, and focus on the optimization of Mat-Mult and Convolutions since they take the majority of neural network execution time. Lake Crest supports complex GEMM functions as e.g., (A^2*4B)+C, and automatic matrix blocking, partial product addition etc.
The specific data type designed in Lake Crest can be seen in Figure 1
Figure 1. Supported data type in Lake Crest. (Image from Intel)
The FlexPoint engine is able to achieve 50TOPs based on 12x100Gbps interc and 32 GB HDM2 RAM. The Lake Crest based deep learning platform will be available late 2017. The next generation of Intel’s deep learning engine “Spring Crest” can achieve 80-90 TOPs with 8g winograd, which will be available late 2018.
Arria 10 FPGA
Arria 10 is Intel’s current FPGA generation for machine learning, which has the computation ability of 1.5 TF in single precision, 3 TOPs in Int16 and 6 TOPs in Int8. The next FPGA generation is “Stratix 10” which is planed to be released late 2017. Stratix 10 will have a much stronger computation power, with up to 9 TF in single precision, and up to 18/36 TOPs for Int16/8.
Intel offers two options for installing Arria 10 FPGA module. It could be installed as an individual PCIe component, which will be referred to as the “Discrete” version. Or it can be integrated into the Xeon processor package and will have a direct internal connection to the processor. In addition, it offers a connection pipeline directly from the outside to the FPGA module, which enables a flexible data access. This option is referred to as the “Integrated” version.
Table 1,2 show the throughput and energy consumption of Xeon processor with Arria 10 FPGA component. (All the statistics are collected from Intel’s public materials.)
|Arria 10-115, FP32,full size image, speed @306Mhz||575 img/s||~31W||18.5 img/s/W|
|Arria 10-115, FP16, full size image, speed @297Mhz||1020 img/s||~40W||25.5 img/s/W|
|Nvidia M4||20 img/s/W|
Table 1. Intel Xeon with Arria 10 Discrete
|FixedP8 with winograd||~1200 img/s|
Table 2. Throughput of “Integrated” version on classification task. Results in this table are based on the AlexNet classification with the input 224x224x3 and output 1000×1
Xeon Phi Knights Mill
Xeon Phi processor is defined for high performance general purpose machine learning applications. The current release in 2017 is “Knights Landing” (KNL) with Groveport platform. The next generation “Knights Mill” will be available late 2017, which will have the following computation features: 13.8TF in single precesion, 27.6TOPs in VNNI. VNNI supports 2 times the flops by using Int16 inputs and can achieve similar accuracy as single precision by using Int32 outputs.
Figure 2 shows some bench-marking results of inference speed test on various deep models by using MxNet framework. After the hardware level optimization it achieved up to 123 times speed improvement on 2S Intel Xeon processor E5 2699v4 compared to out-of-box performance.
Figure 2. Inference Test on optimized Intel microprocessors. (Image from Intel)
To optimize training performance, Intel introduced the Knights Mill & Groveport Platform, which have overall improvements on speed, memory and consistency. It can achieve up to 2.5 times single precision performance gain over KNL with highly distributed multi-node scaling for deep learning training workloads. The distributed multi-node scaling can be across up to 72 cores. It has high memory bandwidth with integrated 16 GB MC DRAM and has 384GB 6-channel DDR4 memory capacity for massive AI use cases. The common Intel Xeon programming is natively supported, and the framework has been optimized for industry standard open source machine learning frameworks. The peak performance in single precision can achieve up to 13.8TF.
As reported by Intel, based on the optimization, it can achieve up to 340 times performance gain for training a VGG model in TensorFlow compared to out-of-box performance on 2S Intel Xeon processor E5 2699 v4. Furthermore, as shown in Figure 3, it can achieve up to 273 times cumulative speedup on Intel Xeon Phi processor 7250 for training a VGG model.
Figure 3. Cumulative speedup on optimized Intel microprocessors. (Image from Intel)
Figure 4 shows the training time for GoogleNet v1 scaling up to 32 node clusters of Intel Xeon Phi processor 7250 with Intel Omni Path Fabric. It demonstrates a maximum scaling efficiency up to 97%.
Figure 4. Training time on scaling. X-axis: number of node clusters, Y-axis: number of hours. (Image from Intel)
Software And Tools
Software is also an important part of Intel’s computation foundation for artificial intelligence. Figure 5 shows the software libraries and tools developed by Intel in deep learning/machine learning context.
Figure 5. Intel’s deep learning software and tools. (Image from Intel)
It clearly shows that Intel is intended to build a complete computation foundation for deep learning/AI products. Not only does Intel’s deep learning platform support all the mainstream open source deep learning libraries, Intel also provides better Math kernel library MKL-DNN specifically for fast-charging deep neural networks. We consider such libraries as the computational primitives, while Intel’s machine learning scaling library serves as communication primitives.
Recently, Intel’s research team published a paper “Can FPGAs beat GPUs in accelerating next-generation deep neural networks” at the FPGA’17 conference. This paper intensively experimented with the performance gain for accelerating deep learning models based on Intel’s FPGA products Arria 10 and Stratix 10. The authors also conducted evaluation compared to the current Nvidia TitanX Pascal GPU. The results show that Intel’s FPGA solution is very competitive to the state-of-the-art GPU processing devices when used for deep learning.
Further detailed information of Intel’s paper can be found by follow this link.
*All rights of the applied images in this article reserved by Intel.
 Intel workshop (not publicly available)
I was invited to the Intel discovery workshop at SAP Innovation center on 09.Feb.2017. The guys from Intel DCG (Data Center Group) presented some current progress in deep learning and their AI products. Based on the content of this workshop, we could predict the future trend or strategies of CPU hardware producers like Intel in the next wave of computing, particular the AI-related facade.
Author: Ian|Editor: Jake Zhao | Localized by Synced Global Team: Xiang Chen