id ftunivsthongkong:oai:repository.hkust.edu.hk:1783.1-109354
record_format openpolar
spelling ftunivsthongkong:oai:repository.hkust.edu.hk:1783.1-109354 2023-05-15T16:01:23+02:00 Efficient Online Scheduling for Coflow-aware Machine Learning Clusters Li, Wenxin Chen, Sheng Li, Keqiu Qi, Heng Xu, Renhai Zhang, Song 2022 https://repository.hkust.edu.hk/ir/Record/1783.1-109354 https://doi.org/10.1109/TCC.2020.3040312 http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=2168-7161&rft.volume=&rft.issue=&rft.date=2020&rft.spage=&rft.aulast=Li&rft.aufirst=&rft.atitle=Efficient+Online+Scheduling+for+Coflow-aware+Machine+Learning+Clusters&rft.title=IEEE+Transactions+on+Cloud+Computing http://www.scopus.com/record/display.url?eid=2-s2.0-85097150313&origin=inward http://gateway.isiknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=LinksAMR&SrcApp=PARTNER_APP&DestLinkType=FullRecord&DestApp=WOS&KeyUT=000894810300024 English eng Ieee-inst Electrical Electronics Engineers Inc https://repository.hkust.edu.hk/ir/Record/1783.1-109354 IEEE Transactions on Cloud Computing, v. 10, (4), October 2022, article number 9269382 2168-7161 https://doi.org/10.1109/TCC.2020.3040312 http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=2168-7161&rft.volume=&rft.issue=&rft.date=2020&rft.spage=&rft.aulast=Li&rft.aufirst=&rft.atitle=Efficient+Online+Scheduling+for+Coflow-aware+Machine+Learning+Clusters&rft.title=IEEE+Transactions+on+Cloud+Computing http://www.scopus.com/record/display.url?eid=2-s2.0-85097150313&origin=inward http://gateway.isiknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=LinksAMR&SrcApp=PARTNER_APP&DestLinkType=FullRecord&DestApp=WOS&KeyUT=000894810300024 Coflow Scheduling Dependent Coflows Distributed Machine Learning Multi-Stage Job Article 2022 ftunivsthongkong https://doi.org/10.1109/TCC.2020.3040312 2023-03-10T01:10:35Z Distributed machine learning (DML) is an increasingly important workload. In a DML job, each communication stage can comprise a coflow, and there are dependencies among its coflows. Thus, efficient coflow scheduling becomes critical for DML jobs. However, the majority of existing solutions focus on scheduling single-stage coflows with no dependencies. While there are a few studies schedule dependent coflows of multi-stage jobs, they suffer from either practical or theoretical issues. In this paper, we study how to schedule dependent coflows of multiple DML jobs to minimize the total job completion time (JCT) in a shared cluster. To solve this problem without any prior knowledge of job information, we present an online coflow-aware optimization framework called Parrot. The core idea in Parrot is to infer the job with the shortest remaining processing time (SRPT) each time and dynamically control the inferred job's bandwidth based on how confident it is an SRPT job while being mindful of not starving any other job. We have proved that Parrot algorithm has an approximation ratio of $O(M)$ , where M is the number of jobs. The results from large-scale trace-driven simulations further demonstrate that our Parrot can reduce the total JCT by up to 58.4%, compared to the state-of-the-art solution Aalo. IEEE Article in Journal/Newspaper DML The Hong Kong University of Science and Technology: HKUST Institutional Repository IEEE Transactions on Cloud Computing 10 4 2564 2579
institution Open Polar
collection The Hong Kong University of Science and Technology: HKUST Institutional Repository
op_collection_id ftunivsthongkong
language English
topic Coflow Scheduling
Dependent Coflows
Distributed Machine Learning
Multi-Stage Job
spellingShingle Coflow Scheduling
Dependent Coflows
Distributed Machine Learning
Multi-Stage Job
Li, Wenxin
Chen, Sheng
Li, Keqiu
Qi, Heng
Xu, Renhai
Zhang, Song
Efficient Online Scheduling for Coflow-aware Machine Learning Clusters
topic_facet Coflow Scheduling
Dependent Coflows
Distributed Machine Learning
Multi-Stage Job
description Distributed machine learning (DML) is an increasingly important workload. In a DML job, each communication stage can comprise a coflow, and there are dependencies among its coflows. Thus, efficient coflow scheduling becomes critical for DML jobs. However, the majority of existing solutions focus on scheduling single-stage coflows with no dependencies. While there are a few studies schedule dependent coflows of multi-stage jobs, they suffer from either practical or theoretical issues. In this paper, we study how to schedule dependent coflows of multiple DML jobs to minimize the total job completion time (JCT) in a shared cluster. To solve this problem without any prior knowledge of job information, we present an online coflow-aware optimization framework called Parrot. The core idea in Parrot is to infer the job with the shortest remaining processing time (SRPT) each time and dynamically control the inferred job's bandwidth based on how confident it is an SRPT job while being mindful of not starving any other job. We have proved that Parrot algorithm has an approximation ratio of $O(M)$ , where M is the number of jobs. The results from large-scale trace-driven simulations further demonstrate that our Parrot can reduce the total JCT by up to 58.4%, compared to the state-of-the-art solution Aalo. IEEE
format Article in Journal/Newspaper
author Li, Wenxin
Chen, Sheng
Li, Keqiu
Qi, Heng
Xu, Renhai
Zhang, Song
author_facet Li, Wenxin
Chen, Sheng
Li, Keqiu
Qi, Heng
Xu, Renhai
Zhang, Song
author_sort Li, Wenxin
title Efficient Online Scheduling for Coflow-aware Machine Learning Clusters
title_short Efficient Online Scheduling for Coflow-aware Machine Learning Clusters
title_full Efficient Online Scheduling for Coflow-aware Machine Learning Clusters
title_fullStr Efficient Online Scheduling for Coflow-aware Machine Learning Clusters
title_full_unstemmed Efficient Online Scheduling for Coflow-aware Machine Learning Clusters
title_sort efficient online scheduling for coflow-aware machine learning clusters
publisher Ieee-inst Electrical Electronics Engineers Inc
publishDate 2022
url https://repository.hkust.edu.hk/ir/Record/1783.1-109354
https://doi.org/10.1109/TCC.2020.3040312
http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=2168-7161&rft.volume=&rft.issue=&rft.date=2020&rft.spage=&rft.aulast=Li&rft.aufirst=&rft.atitle=Efficient+Online+Scheduling+for+Coflow-aware+Machine+Learning+Clusters&rft.title=IEEE+Transactions+on+Cloud+Computing
http://www.scopus.com/record/display.url?eid=2-s2.0-85097150313&origin=inward
http://gateway.isiknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=LinksAMR&SrcApp=PARTNER_APP&DestLinkType=FullRecord&DestApp=WOS&KeyUT=000894810300024
genre DML
genre_facet DML
op_relation https://repository.hkust.edu.hk/ir/Record/1783.1-109354
IEEE Transactions on Cloud Computing, v. 10, (4), October 2022, article number 9269382
2168-7161
https://doi.org/10.1109/TCC.2020.3040312
http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=2168-7161&rft.volume=&rft.issue=&rft.date=2020&rft.spage=&rft.aulast=Li&rft.aufirst=&rft.atitle=Efficient+Online+Scheduling+for+Coflow-aware+Machine+Learning+Clusters&rft.title=IEEE+Transactions+on+Cloud+Computing
http://www.scopus.com/record/display.url?eid=2-s2.0-85097150313&origin=inward
http://gateway.isiknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=LinksAMR&SrcApp=PARTNER_APP&DestLinkType=FullRecord&DestApp=WOS&KeyUT=000894810300024
op_doi https://doi.org/10.1109/TCC.2020.3040312
container_title IEEE Transactions on Cloud Computing
container_volume 10
container_issue 4
container_start_page 2564
op_container_end_page 2579
_version_ 1766397273961398272