Characterizing Distributed Machine Learning and Deep Learning Workloads

Cet article a été publié dans la Conférence francophone d'informatique en Parallélisme, Architecture et Système 2021. International audience Nowadays, machine learning (ML) is widely used in many application domains to analyze datasets and build decision making systems. With the rapid growth of...

Full description

Bibliographic Details
Main Authors: Djebrouni, Yasmine, Rocha, Isabelly, Bouchenak, Sara, Chen, Lydia, y, Felber, Pascal, Marangozova-Martin, Vania, Schiavoni, Valerio
Other Authors: Université Grenoble Alpes (UGA), Université de Neuchâtel = University of Neuchatel (UNINE), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA), Delft University of Technology (TU Delft)
Format: Conference Object
Language:English
Published: HAL CCSD 2021
Subjects:
DML
Online Access:https://hal.science/hal-03344132
https://hal.science/hal-03344132/document
https://hal.science/hal-03344132/file/COMPAS2021_paper_12%20%2810%29.pdf
Description
Summary:Cet article a été publié dans la Conférence francophone d'informatique en Parallélisme, Architecture et Système 2021. International audience Nowadays, machine learning (ML) is widely used in many application domains to analyze datasets and build decision making systems. With the rapid growth of data, ML users switched to distributed machine learning (DML) platforms for faster executions and large-scale training datasets. However, DML platforms introduce complex execution environments that are overwhelming for uninitiated users. To provide guidance for the tuning of DML platforms and achieve good performance, it is crucial to characterize DML workloads. In this work, we focus on popular DML and distributed deep learning (DDL) workloads leveraging Apache Spark. We characterize the impact of several platform parameters related to distributed executions such as parallelization, data shuffle and scheduling on performance. Based on our analysis, we derive key takeaways on DML/DDL workload patterns, as well as unexpected behavior of workloads based on ensemble learning methods.