AI Machine Learning & Data Science Research

ETH Zürich & Microsoft Study: Demystifying Serverless ML Training

A research team from ETH Zürich and Microsoft presents a systematic, comparative study of distributed ML training over serverless infrastructures (FaaS) and “serverful” infrastructures (IaaS), aiming to understand the system tradeoffs of distributed ML training with serverless infrastructures.

Serverless computing is a new type of cloud-based computation infrastructure initially developed for web microservices and IoT applications. As it frees model developers from concerns regarding capacity planning, configuration, management, maintenance, operating and scaling of containers, VMs and physical servers, serverless computing has gained popularity with machine learning (ML) researchers in recent years.

Moreover, the benefits of serverless computing have also piqued interest in adopting it to data-intensive workloads such as ETL (extract, transform, load), query processing and ML, where it can provide significant cost reductions. Riding this trend, a research team from ETH Zürich and Microsoft recently conducted a systematic, comparative study of distributed ML training over serverless infrastructures (FaaS) and “serverful” infrastructures (IaaS), aiming to identify and understand the system tradeoffs involved in distributed ML training with serverless infrastructures.

image.png

Serverless computing is offered by major cloud service providers such as AWS Lambda, Azure Functions and Google Cloud Functions. Although researchers are increasingly choosing FaaS for ML inference, it remains unclear whether FaaS is a good choice for ML training. This “training-as-a-service platform” paradigm appeals to both industry and academia, and AWS now provides serverless ML training in AWS Lambda using the SageMaker and AutoGluon platforms.

The paper Towards Demystifying Serverless Machine Learning Training poses the question: When can a serverless infrastructure (FaaS) outperform a VM-based, “serverful” infrastructure (IaaS) for distributed ML training?

The team summarizes their contributions as:

  1. Systematically explore the algorithm choice and system design for both FaaS and IaaS ML training strategies and depict the tradeoff over a diverse range of ML models, training workloads, and infrastructure choices.
  2. Develop an analytical model that characterizes the tradeoff between FaaS and IaaS-based training, and use it to speculate on the performances of potential configurations used by future systems.
image.png

The team uses LambdaML — a prototype FaaS-based ML system built on top of Amazon Lambda — to study the tradeoffs involved in training ML models over serverless infrastructures. With this approach, a user specifies training configurations such as data location, resources, optimization algorithm and hyperparameters in the AWS web UI. AWS then submits the job to a serverless infrastructure that allocates resources based on the user’s request. The training data is partitioned and stored in AWS S3, and each “worker” (running instance) maintains a local model copy and uses the LambdaML library to train an ML model. The LambdaML training pipeline has five steps: load data, compute statistics, send statistics, aggregate statistics and update model.

The researchers explore four major aspects of LambdaML implementation: the distributed optimization algorithm, communication channels, communication patterns, and synchronization protocols. They focus on two distributed optimization algorithms, distributed stochastic gradient descent (SGD) and distributed alternating direction method of multipliers (ADMM), and employ a storage service such as S3 or ElastiCache as the communication channel. The team uses AllReduce and ScatterReduce as their communication patterns and designs a two-phase synchronous protocol that includes a merging and an updating phase.

LambdaML performance was evaluated by comparing the various design options referenced above using the Higgs, RCV1 and Cifar10 datasets. The team implemented GA-SGD (SGD with gradient averaging), MA-SGD (SGD with model averaging), and ADMM on top of LambdaML, employing ElastiCache for Memcached as the external storage service.

image.png
image.png
image.png

From the empirical results, the team concluded that FaaS can be faster than IaaS, but only in a specific regime, namely when the underlying workload can be made communication efficient in terms of both convergence and amount of data communicated. They also note that although FaaS is much faster, it is not much cheaper. One insight that holds across all scenarios is that even when FaaS is faster than IaaS, it usually has a comparable price.

Overall, the results validate that LambdaML provides a fair comparison between FaaS and IaaS systems, taking a significant step forward on demystifying serverless ML.

The paper Towards Demystifying Serverless Machine Learning Training is on arXiv.


Author: Hecate He | Editor: Michael Sarazen, Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

4 comments on “ETH Zürich & Microsoft Study: Demystifying Serverless ML Training

  1. Pingback: Demystifying Serverless ML Training : MachineLearning - TechFlx

  2. Pingback: [R] ETH Zürich & Microsoft Study: Demystifying Serverless ML Training – ONEO AI

  3. Pingback: r/artificial - [R] ETH Zürich & Microsoft Study: Demystifying Serverless ML Training - Cyber Bharat

  4. Pingback: 23 May, 2021 11:14 – Notes de Francis

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: