Machine learning (ML) models continue to evolve in scale and algorithmically, a trend that burdens system developers with ever-increasing compute and power requirements for training and deployment. How to meet the demands? Tech giant Nvidia’s A100 high-performance graphics processing unit (GPU) dominated the AI accelerator market until Google joined the campaign in 2016 with its Tensor Processing Units (TPUs). This week, Google introduced its newest entry, a TPU-based supercomputer it says is both faster and more efficient than the A100.
In the new paper TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support, a Google Research team presents TPU v4, the company’s latest supercomputer. The TPU v4 is ten times faster than v3 and 1.2–1.7x faster than Nvidia A100 GPUs while using 1.3x–1.9x less power. Google believes the performance, scalability, and availability of TPU v4 will make it the new workhorse for today’s compute-hungry large language models (LMMs).
The team summarizes their paper’s main contributions as follows:
- It describes and evaluates the first production deployment of Optical Circuit Switches (OCSs) in a supercomputer and the first deployment to allow topology reconfiguration to improve performance.
- It describes and evaluates the first accelerator support for embeddings in a commercial ML system.
- It documents the rapid change in production model types since 2016 for the fast-changing ML field.
- It shows how Google uses ML to co-optimize deep neural network (DNN) models, OCS topology, and the SparseCore.
A key improvement in this version is its leveraging of Optical Circuit Switches (OCSes), which interconnect TPU v4’s 4096 chips via optical data links to improve scale, availability, utilization, modularity, deployment, security, power and performance. The Google Palomar OCSes are based on 3D Micro-Electro-Mechanical Systems (MEMS) mirrors that switch in milliseconds; and advance the state-of-the-art in terms of reliability and cost.
Like TPU v3, the TPU v4 package comprises two Tensor Cores (TC). Each TC contains four 128×128 Matrix Multiply Units (MXUs), a Vector Processing Unit (VPU) and a Vector Memory (VMEM). Thanks to its OCSes, TPU v4 can quickly and easily change topology to adapt to the application, number of nodes, and system running a job — thus significantly improving training time. Each TPU v4 also incorporates SparseCores, dataflow processors that accelerate embedding-reliant models by 5x–7x yet use only 5 percent of the die area and power.
The researchers’ empirical study shows that TPU v4 is 2.1x faster and improves performance by 2.7x compared to TPU v3, achieves ~4.3x–4.5x faster speeds than the Graphcore IPU Bow, and is 1.2x–1.7x faster than Nvidia A100s with 1.3x–1.9x less power consumption. The paper also reports that TPU v4s deployed in Google Cloud’s energy-optimized warehouse-scale computers use ~2-6x less energy and produce ~20x less CO2e than domain-specific architectures (DSAs) in typical data centres.
The paper TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings will be presented at ISCA 2023 (International Symposium on Computer Architecture) in the Industry Track and is available on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.