Characterizing Distributed Machine Learning Workloads on Apache Spark
International audience Distributed machine learning (DML) environments are widely used in many application domains to build decision-making systems. However, the complexity of these environments is overwhelming for novice users. On the one hand, data scientists are more familiar with hyper-parameter...
Published in: | Proceedings of the 24th International Middleware Conference on ZZZ |
---|---|
Main Authors: | , , , , , , |
Other Authors: | , , , , , , , , , , , , , , , , , , |
Format: | Conference Object |
Language: | English |
Published: |
HAL CCSD
2023
|
Subjects: | |
Online Access: | https://hal.science/hal-04399409 https://doi.org/10.1145/3590140.3629112 |
Summary: | International audience Distributed machine learning (DML) environments are widely used in many application domains to build decision-making systems. However, the complexity of these environments is overwhelming for novice users. On the one hand, data scientists are more familiar with hyper-parameter tuning and typically lack an understanding of the trade-offs and challenges of parameterizing DML platforms to achieve good performance. On the other hand, system administrators focus on tuning distributed platforms, unaware of the possible implications of the platform on the quality of the learning models. To shed light on such parameter configuration interplay, we run multiple DML workloads on the widely used Apache Spark distributed platform, leveraging 13 popular learning methods and 6 real-world datasets on two distinct clusters. We collect and perform an in-depth analysis of workload execution traces to compare the efficiency of different configuration strategies. We consider tuning only hyper-parameters, tuning only platform parameters, and jointly tuning both hyper-parameters and platform parameters. We publicly release our collected traces and derive key takeaways on DML workloads. Counter-intuitively, platform parameters have a higher impact on the model quality than hyper-parameters. More generally, we show that multi-level parameter configuration can provide better results in terms of model quality and execution time while also optimizing resource costs |
---|