Characterizing Distributed Machine Learning and Deep Learning Workloads
Cet article a été publié dans la Conférence francophone d'informatique en Parallélisme, Architecture et Système 2021. International audience Nowadays, machine learning (ML) is widely used in many application domains to analyze datasets and build decision making systems. With the rapid growth of...
Main Authors: | , , , , , , |
---|---|
Other Authors: | , , , , |
Format: | Conference Object |
Language: | English |
Published: |
HAL CCSD
2021
|
Subjects: | |
Online Access: | https://hal.science/hal-03344132 https://hal.science/hal-03344132/document https://hal.science/hal-03344132/file/COMPAS2021_paper_12%20%2810%29.pdf |
Summary: | Cet article a été publié dans la Conférence francophone d'informatique en Parallélisme, Architecture et Système 2021. International audience Nowadays, machine learning (ML) is widely used in many application domains to analyze datasets and build decision making systems. With the rapid growth of data, ML users switched to distributed machine learning (DML) platforms for faster executions and large-scale training datasets. However, DML platforms introduce complex execution environments that are overwhelming for uninitiated users. To provide guidance for the tuning of DML platforms and achieve good performance, it is crucial to characterize DML workloads. In this work, we focus on popular DML and distributed deep learning (DDL) workloads leveraging Apache Spark. We characterize the impact of several platform parameters related to distributed executions such as parallelization, data shuffle and scheduling on performance. Based on our analysis, we derive key takeaways on DML/DDL workload patterns, as well as unexpected behavior of workloads based on ensemble learning methods. |
---|