RAT-Resilient Allreduce Tree for Distributed Machine Learning

Parameter/gradient exchange plays an important role in large-scale distributed machine learning (DML). However, prior solutions such as parameter server (PS) or ring-allreduce (Ring) fall short since they are not resilient to issues or uncertainties like oversubscription, congestion or failures that...

Full description

Bibliographic Details
Published in:4th Asia-Pacific Workshop on Networking
Main Authors: Wan, Xinchen CSE, Zhang, Hong, Wang, Hao, Hu, Shuihai, Zhang, Junxue, Chen, Kai
Format: Conference Object
Language:English
Published: Association for Computing Machinery 2020
Subjects:
DML
Online Access:http://repository.ust.hk/ir/Record/1783.1-107368
https://doi.org/10.1145/3411029.3411037
http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=&rft.volume=&rft.issue=&rft.date=2020&rft.spage=52&rft.aulast=Wan&rft.aufirst=&rft.atitle=RAT-Resilient+Allreduce+Tree+for+Distributed+Machine+Learning&rft.title=ACM+International+Conference+Proceeding+Series
http://www.scopus.com/record/display.url?eid=2-s2.0-85094889062&origin=inward
Description
Summary:Parameter/gradient exchange plays an important role in large-scale distributed machine learning (DML). However, prior solutions such as parameter server (PS) or ring-allreduce (Ring) fall short since they are not resilient to issues or uncertainties like oversubscription, congestion or failures that may occur in datacenter networks (DCN). This paper proposes RAT, a new solution that determines the communication pattern for DML. At its heart, RAT establishes allreduce trees taking into account the physical topology and its oversubscription condition. The allreduce trees specify the aggregation pattern in which each aggregator is responsible for aggregating gradients from all workers within an oversubscribed region at the reduce phase, and broadcasting the updates back to workers at the broadcast phase. We show that such an approach can effectively reduce cross-region traffic and shorten dependency chain compared to prior solutions. We have evaluated RAT in both oversubscribed network and network with failures and found that RAT is resilient to these issues or uncertainties. For example, it delivers an average of 25X and 5.7X speedup compared to PS in oversubscribed network and Ring in network with failures, respectively. © 2020 ACM.