California artificial intelligence startup Cerebras Systems yesterday introduced its Cerebras Wafer Scale Engine (WSE), the world’s largest-ever chip built for neural network processing. Cerebras Co-Founder and Chief Hardware Architect Sean Lie introduced the gigantic chip at one of the semiconductor industry’s leading conferences, Hot Chips 31: A Symposium on High Performance Chips, hosted at Stanford University.
The 16nm WSE is a 46,225 mm2 silicon chip — slightly larger than a 9.7-inch iPad — featuring 1.2 trillion transistors, 400,000 AI optimized cores, 18 Gigabytes of on-chip memory, 9 petabyte/s memory bandwidth, and 100 petabyte/s fabric bandwidth. It is 56.7 times larger than the largest Nvidia graphics processing unit, which accommodates 21.1 billion transistors on an 815 mm2 silicon base.
Cerebras CEO Andrew Feldman, whose previous company SeaMicro was acquired by AMD for US$334 million in 2012, told Synced that the WSE is vastly more efficient than Nvidia GPUs. “Nvidia gets about 30 chips from a wafer. Each chip is put on a circuit board. If they sell it in a DGX, they have to buy two Intel processors and put together. If they sell it in a DGX-2, they have to put in switches. And to compete with us, they have to put in 20 to 30 switches. I just have one piece of Silicon!”
Feldman believes the Moore’s Law era is done, and that means the semiconductor industry now has to change its innovation focus. “Moore’s Law says we get more transistors which we can turn into more circuits in a given area. We think that is over. We need more compute but we are not getting smaller so we need more silicon.”
An inevitable cost of the unprecedented progress of deep learning is rapidly growing compute workloads, with estimates that the amount of compute used for AI training now doubles roughly every 3.5 months. DeepMind’s 2017 Go computer AlphaZero required over 300,000 times more compute than 2012’s breakthrough GPU-powered convolutional neural network AlexNet.
Lie says deep learning training is challenging in terms of size and shape, with a training process usually involving billions to trillions of operations per data sample, and millions to billions of total samples required per training. The industry today however is still largely using legacy technologies — what Lie calls “Brute Force Parallelism” — to train AI models. Dense vector processors like GPUs are limited when compute is not a large uniform block. Interconnect and networking technologies like PCIe, Ethernet, InfiniBand, and NVLink all offer some form of scale-out clustering, but are limited by the inherent serial nature of the problem.
Synced has highlighted some standout properties of the WSE:
Chip design: The chip’s 400,000 AI optimized Sparse Linear Algebra Cores (SLAC) are a type of flexible and programmable core optimized for neural network primitives. SLAC supports flexible general operations for control processing (arithmetic, logical, load/store, branch) as well as optimized tensor operations for data processing.
Sparse compute: A major innovation of SLAC is a sparse processing capability in the hardware that can filter out sparse zero data and eliminate unnecessary processing. Neural network operations like nonlinearities naturally create fine-grained sparsity. A native, sparse processing enables higher efficiency and performance.
Optimized memory: Conventional memory designs typically involve central shared memory that is independent from computing cores, which is not ideal for deep learning models as the memory is slow and physically separated. Such designs also require a high percentage of data reuse in caching and have low data reuse in fundamental deep learning operations (matrix vector multiply). Optimal hardware for deep learning operations should provide massive compute with frequent access to data.
Lie says on-chip memory is a promising path for future hardware development as it can reduce the significant costs of data movement. All WSE memory is fully distributed within the compute datapath. The chip has 18 gigabytes of on-chip memory and 9 petabyte/s memory bandwidth — 3,000 times more on-chip memory and 10,000 times more memory bandwidth than the leading GPU.
Faster communication: A selling point of this chip is high fabric bandwidth at low latency for connecting cores. The chip uses fast and fully configurable fabric with small, single-word messages. Communication is hardware-based to avoid software overhead. WSE uses 2D mesh topology for local communication.
Programming: Cerebras engineers co-designed software that allows neural network models to be expressed on common machine learning frameworks like TensorFlow and PyTorch. The software performs placement and routing to map neural network layers to fabric. This enables applying the entire WSE to a single neural network at once.
In addition, Lie listed five main challenges Cerebras researchers encountered in the chip development process:
Cross die connectivity: Standard chip fabrication processes require the die to be independent and separated by scribe lines. In partnership with TSMC, Cerebras’ solution is to add wires across the scribe lines to extend a 2D mesh across the die. As a result, they achieve the same connectivity between cores and across scribe lines, and the short wires enable ultra high bandwidth with low latency.
Inevitable defects: Silicon and process defects are inevitable even in mature process. Cerebras includes redundant cores in chip design to replace defective cores and redundant fabric links to restore logical 2D mesh.
Thermal expansion in the package: Silicon and printed circuit board (PCB) expand at different rates under temperature changes. Changing wafer size could result in too much mechanical stress when using traditional package technology. Cerebras’ solution is to develop a custom connector between the wafer and the PCB to absorb size variations and maintain connectivity.
Package assembly: Since there was no existing package method suitable for WSE, Cerebras designed custom machines and processes to ensure precision alignment and handling.
Power and cooling: Concentrated high density exceeds traditional power and cooling capabilities. Cerebras’ solution is to use the third dimension for power delivery, and add a water-cooled “cold plate” atop the chip to further reduce operating temperature.
WSE’s mind-blowing design has made a powerful first impression. Senior Director of AI Strategy & Products at eSilicon Carlos Macian called WSE “an incredible masterpiece of engineering… Whether it’s useful or not, it is a thing of beauty.”
Macian told Synced he thinks Cerebras can provide a new path forward for computer chips. “The recent reality of the market, in particular data centres, is we are running out of the space. Space is the maximum size of one individual chip limited by radical size, which is 800 mm2. You can’t build bigger chips than that. However, many systems require more than 800 mm2.”
Says Macian, “AMD for example is putting several dies next to one another into a package. That is one way. Cerebras is saying my problem is bigger — it needs a lot of performance and many dies. Instead of having independent dies, they are building a system that actually contains many dies.”
Is high-level performance on the scale of WSE actually needed? Macian says yes. “The percentage of attached accelerators on servers today is below one percent. If you are betting that one percent will someday grow up to 20 percent, then yes, there is a market for that.”
Macian said he is also curious about the as-yet unrevealed WSE lab tests and hardware bring-up process (wherein hardware is successively tested, validated and debugged iteratively until it is ready for manufacturing).
Lie told the Hot Chips audience that Cerebras will disclose additional information about WSE application performance and related data in the coming months.
Journalist: Tony Peng | Editor: Michael Sarazen