AI Machine Learning & Data Science Research

Kwai, Kuaishou & ETH Zürich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion Parameters

A research team from Kwai Inc., Kuaishou Technology and ETH Zürich builds PERSIA, an efficient distributed training system that leverages a novel hybrid training algorithm to ensure both training efficiency and accuracy for extremely large deep learning recommender systems of up to 100 trillion parameters.

Modern recommender systems have countless real-world applications and have made astonishing progress thanks to the ever-increasing size of deep neural network models — which have grown from Google’s 2016 model with 1 billion parameters to Facebook’s latest model with 12 trillion parameters. There seems to be no limit to the significant quality boosts delivered by such model scaling, and deep learning practitioners believe the era of 100 trillion parameter systems will arrive sooner than later.

The training of extremely large models that are both memory and computation-intensive is however challenging, even with the support of industrial-scale data centers. In the new paper PERSIA: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters, a research team from Kwai Inc., Kuaishou Technology and ETH Zürich proposes PERSIA, an efficient distributed training system that leverages a novel hybrid training algorithm to ensure both training efficiency and accuracy in such recommender models. The team provides theoretical demonstrations and empirical studies to validate the effectiveness of PERSIA on recommender systems of up to 100 trillion parameters.

The team summarizes their study’s main contributions as:

  1. We present a natural but novel hybrid training algorithm to tackle the embedding layer and dense neural network modules differently.
  2. We provide a rigorous theoretical analysis on its convergence behaviour and further connect the characteristics of a recommender model to its convergence to justify its effectiveness.
  3. We design a distributed system to manage the hybrid computation resources (CPUs and GPUs) to optimize the co-existence of asynchronicity and synchronicity in the training algorithm.
  4. We evaluate PERSIA using both publicly available benchmark tasks and real-world tasks at Kwai. We show that PERSIA scales out effectively and produces up to 7.12× speedups compared to state-of-the-art approaches.

The team first proposes a novel sync-async hybrid algorithm, where the embedding module trains in an asynchronous fashion while the dense neural network is updated synchronously. This hybrid algorithm enables hardware efficiency that is comparable with the fully asynchronous mode without sacrificing statistical efficiency.

The team designed PERSIA (parallel recommendation training system with hybrid acceleration) to support the aforementioned hybrid algorithm with two fundamental aspects: 1) the placement of the training workflow over a heterogeneous cluster, and 2) the corresponding training procedure over the hybrid infrastructure. PERSIA features four modules designed to provide efficient autoscaling and support recommender models of up to 100 trillion parameters:

  1. A data loader that fetches training data from distributed storages.
  2. An embedding parameter server (PS) that manages the storage and updates of parameters in the embedding layer.
  3. A group of embedding workers for fetching the embedding parameters from the embedding PS, aggregating embedding vectors (potentially), and putting embedding gradients back to the embedding PS.
  4. A group of NN workers that runs the forward-/backward- propagation of the neural network.

The team evaluated PERSIA on three open-source benchmarks (Taobao-Ad, Avazu-Ad and Criteo-Ad) and the real-world production microvideo recommendation workflow at Kwai. They used two state-of-the-art distributed recommender training systems — XDL and PaddlePaddle — as their baselines.

The proposed hybrid algorithm achieved much higher throughput compared to all other systems. PERSIA reached nearly linear speedups with significantly higher throughput compared with XDL and PaddlePaddle, and 3.8× higher throughput compared with the fully synchronous algorithm on the Kwai-video benchmark. Moreover, PERSIA demonstrated stable training throughput even when model size increased up to 100 trillion parameters, achieving 2.6× higher throughput than the fully synchronous mode.

Overall, the results show the proposed PERSIA effectively supports the efficient and scalable training of recommender models at a scale of up to 100 trillion parameters. The team hopes their study and insights can benefit both academia and industry.

The code is available on the project’s GitHub. The paper PERSIA: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “Kwai, Kuaishou & ETH Zürich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion Parameters

Leave a Reply

Your email address will not be published.

%d bloggers like this: