Network scheduling for distributed machine learning

Distributed machine learning (DML) is of growing importance. Due to the growing scale of data and complexity of models, many important machine learning problems cannot be effectively solved by single machine. Existing scheduling algorithms are insufficient due to the complex computation-communicatio...

Full description

Bibliographic Details
Main Author:	Xia, Jiacheng
Format:	Thesis
Language:	English
Published:	2019
Subjects:	Computer scheduling Machine learning Distributed artificial intelligence DML
Online Access:	https://repository.hkust.edu.hk/ir/Record/1783.1-123718 https://doi.org/10.14711/thesis-991012757568303412 https://repository.hkust.edu.hk/ir/bitstream/1783.1-123718/1/th_redirect.html

Description
Summary:	Distributed machine learning (DML) is of growing importance. Due to the growing scale of data and complexity of models, many important machine learning problems cannot be effectively solved by single machine. Existing scheduling algorithms are insufficient due to the complex computation-communication pattern of DML. In the training stage of DML, networking becomes bottleneck as the models trained on different machines need synchronization and updates frequently, transmitting MB to GB scale of parameters at second to sub-second level. In this thesis, we focus on the network scheduling problems for DML. Firstly, we propose SaSP, a intra-job scheduler for allocating resources to processes on the same DML job on different servers. We show that DML attains faster speed with decoupling the computation and communication processes at scheduler design. Our prototype shows a 25% to 50% speed compared over different parameter synchronization schemes on various DML applications. Secondly, we present DeepProphet, a tool to analyze the computation and network resource requirements offline via analyzing the dataflow graph representing the DML application. With given hardware configuration, DeepProphet accurately predicts the iteration completion time within below 10% average error. We demonstrate the resource requirements for DML can be conducted accurately via offline analysis, a feature that benefits later inter-job scheduler designs.

Network scheduling for distributed machine learning

Similar Items