Efficient Online Scheduling for Coflow-aware Machine Learning Clusters
Distributed machine learning (DML) is an increasingly important workload. In a DML job, each communication stage can comprise a coflow, and there are dependencies among its coflows. Thus, efficient coflow scheduling becomes critical for DML jobs. However, the majority of existing solutions focus on...
Published in: | IEEE Transactions on Cloud Computing |
---|---|
Main Authors: | , , , , , |
Format: | Article in Journal/Newspaper |
Language: | English |
Published: |
2020
|
Subjects: | |
Online Access: | http://repository.ust.hk/ir/Record/1783.1-109354 https://doi.org/10.1109/TCC.2020.3040312 http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=2168-7161&rft.volume=&rft.issue=&rft.date=2020&rft.spage=&rft.aulast=Li&rft.aufirst=&rft.atitle=Efficient+Online+Scheduling+for+Coflow-aware+Machine+Learning+Clusters&rft.title=IEEE+Transactions+on+Cloud+Computing http://www.scopus.com/record/display.url?eid=2-s2.0-85097150313&origin=inward |
id |
ftunivsthongkong:oai:repository.ust.hk:1783.1-109354 |
---|---|
record_format |
openpolar |
spelling |
ftunivsthongkong:oai:repository.ust.hk:1783.1-109354 2023-05-15T16:01:23+02:00 Efficient Online Scheduling for Coflow-aware Machine Learning Clusters Li, Wenxin Chen, Sheng Li, Keqiu Qi, Heng Xu, Renhai Zhang, Song 2020 http://repository.ust.hk/ir/Record/1783.1-109354 https://doi.org/10.1109/TCC.2020.3040312 http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=2168-7161&rft.volume=&rft.issue=&rft.date=2020&rft.spage=&rft.aulast=Li&rft.aufirst=&rft.atitle=Efficient+Online+Scheduling+for+Coflow-aware+Machine+Learning+Clusters&rft.title=IEEE+Transactions+on+Cloud+Computing http://www.scopus.com/record/display.url?eid=2-s2.0-85097150313&origin=inward English eng http://repository.ust.hk/ir/Record/1783.1-109354 IEEE Transactions on Cloud Computing, 2020 2168-7161 https://doi.org/10.1109/TCC.2020.3040312 http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=2168-7161&rft.volume=&rft.issue=&rft.date=2020&rft.spage=&rft.aulast=Li&rft.aufirst=&rft.atitle=Efficient+Online+Scheduling+for+Coflow-aware+Machine+Learning+Clusters&rft.title=IEEE+Transactions+on+Cloud+Computing http://www.scopus.com/record/display.url?eid=2-s2.0-85097150313&origin=inward Coflow Scheduling Dependent Coflows Distributed Machine Learning Multi-Stage Job Article 2020 ftunivsthongkong https://doi.org/10.1109/TCC.2020.3040312 2021-04-16T00:01:56Z Distributed machine learning (DML) is an increasingly important workload. In a DML job, each communication stage can comprise a coflow, and there are dependencies among its coflows. Thus, efficient coflow scheduling becomes critical for DML jobs. However, the majority of existing solutions focus on scheduling single-stage coflows with no dependencies. While there are a few studies schedule dependent coflows of multi-stage jobs, they suffer from either practical or theoretical issues. In this paper, we study how to schedule dependent coflows of multiple DML jobs to minimize the total job completion time (JCT) in a shared cluster. To solve this problem without any prior knowledge of job information, we present an online coflow-aware optimization framework called Parrot. The core idea in Parrot is to infer the job with the shortest remaining processing time (SRPT) each time and dynamically control the inferred job's bandwidth based on how confident it is an SRPT job while being mindful of not starving any other job. We have proved that Parrot algorithm has an approximation ratio of $O(M)$ , where M is the number of jobs. The results from large-scale trace-driven simulations further demonstrate that our Parrot can reduce the total JCT by up to 58.4%, compared to the state-of-the-art solution Aalo. IEEE Article in Journal/Newspaper DML The Hong Kong University of Science and Technology: HKUST Institutional Repository IEEE Transactions on Cloud Computing 1 1 |
institution |
Open Polar |
collection |
The Hong Kong University of Science and Technology: HKUST Institutional Repository |
op_collection_id |
ftunivsthongkong |
language |
English |
topic |
Coflow Scheduling Dependent Coflows Distributed Machine Learning Multi-Stage Job |
spellingShingle |
Coflow Scheduling Dependent Coflows Distributed Machine Learning Multi-Stage Job Li, Wenxin Chen, Sheng Li, Keqiu Qi, Heng Xu, Renhai Zhang, Song Efficient Online Scheduling for Coflow-aware Machine Learning Clusters |
topic_facet |
Coflow Scheduling Dependent Coflows Distributed Machine Learning Multi-Stage Job |
description |
Distributed machine learning (DML) is an increasingly important workload. In a DML job, each communication stage can comprise a coflow, and there are dependencies among its coflows. Thus, efficient coflow scheduling becomes critical for DML jobs. However, the majority of existing solutions focus on scheduling single-stage coflows with no dependencies. While there are a few studies schedule dependent coflows of multi-stage jobs, they suffer from either practical or theoretical issues. In this paper, we study how to schedule dependent coflows of multiple DML jobs to minimize the total job completion time (JCT) in a shared cluster. To solve this problem without any prior knowledge of job information, we present an online coflow-aware optimization framework called Parrot. The core idea in Parrot is to infer the job with the shortest remaining processing time (SRPT) each time and dynamically control the inferred job's bandwidth based on how confident it is an SRPT job while being mindful of not starving any other job. We have proved that Parrot algorithm has an approximation ratio of $O(M)$ , where M is the number of jobs. The results from large-scale trace-driven simulations further demonstrate that our Parrot can reduce the total JCT by up to 58.4%, compared to the state-of-the-art solution Aalo. IEEE |
format |
Article in Journal/Newspaper |
author |
Li, Wenxin Chen, Sheng Li, Keqiu Qi, Heng Xu, Renhai Zhang, Song |
author_facet |
Li, Wenxin Chen, Sheng Li, Keqiu Qi, Heng Xu, Renhai Zhang, Song |
author_sort |
Li, Wenxin |
title |
Efficient Online Scheduling for Coflow-aware Machine Learning Clusters |
title_short |
Efficient Online Scheduling for Coflow-aware Machine Learning Clusters |
title_full |
Efficient Online Scheduling for Coflow-aware Machine Learning Clusters |
title_fullStr |
Efficient Online Scheduling for Coflow-aware Machine Learning Clusters |
title_full_unstemmed |
Efficient Online Scheduling for Coflow-aware Machine Learning Clusters |
title_sort |
efficient online scheduling for coflow-aware machine learning clusters |
publishDate |
2020 |
url |
http://repository.ust.hk/ir/Record/1783.1-109354 https://doi.org/10.1109/TCC.2020.3040312 http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=2168-7161&rft.volume=&rft.issue=&rft.date=2020&rft.spage=&rft.aulast=Li&rft.aufirst=&rft.atitle=Efficient+Online+Scheduling+for+Coflow-aware+Machine+Learning+Clusters&rft.title=IEEE+Transactions+on+Cloud+Computing http://www.scopus.com/record/display.url?eid=2-s2.0-85097150313&origin=inward |
genre |
DML |
genre_facet |
DML |
op_relation |
http://repository.ust.hk/ir/Record/1783.1-109354 IEEE Transactions on Cloud Computing, 2020 2168-7161 https://doi.org/10.1109/TCC.2020.3040312 http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=2168-7161&rft.volume=&rft.issue=&rft.date=2020&rft.spage=&rft.aulast=Li&rft.aufirst=&rft.atitle=Efficient+Online+Scheduling+for+Coflow-aware+Machine+Learning+Clusters&rft.title=IEEE+Transactions+on+Cloud+Computing http://www.scopus.com/record/display.url?eid=2-s2.0-85097150313&origin=inward |
op_doi |
https://doi.org/10.1109/TCC.2020.3040312 |
container_title |
IEEE Transactions on Cloud Computing |
container_start_page |
1 |
op_container_end_page |
1 |
_version_ |
1766397272734564352 |