Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually become the bottleneck of DML. Current multi-tenant GPU cluster...

Full description

Bibliographic Details
Main Authors: Han, Xinchi, Jiang, Weihao, Cao, Peirui, Yang, Qinwei, Liu, Yunzhuo, Qi, Shuyao, Lin, Shengkai, Zhao, Shizhen
Format: Text
Language:unknown
Published: 2023
Subjects:
DML
Online Access:http://arxiv.org/abs/2308.05692
id ftarxivpreprints:oai:arXiv.org:2308.05692
record_format openpolar
spelling ftarxivpreprints:oai:arXiv.org:2308.05692 2023-09-05T13:19:05+02:00 Isolated Scheduling for Distributed Training Tasks in GPU Clusters Han, Xinchi Jiang, Weihao Cao, Peirui Yang, Qinwei Liu, Yunzhuo Qi, Shuyao Lin, Shengkai Zhao, Shizhen 2023-08-10 http://arxiv.org/abs/2308.05692 unknown http://arxiv.org/abs/2308.05692 Computer Science - Distributed Parallel and Cluster Computing text 2023 ftarxivpreprints 2023-08-16T17:52:45Z Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually become the bottleneck of DML. Current multi-tenant GPU clusters face network contention caused by hash-collision problem which not only further increases the overhead of communication, but also creates unfairness and affects the user experience. In this paper, we firstly analyse how network contention affects the training time in a cluster with 32 NVIDIA V100 GPUs. Then we propose vClos to eliminate network contention by jointly optimizing network topology and communication pattern in distributed training. An OCS-vClos which introduces a layer of optical circuit switches (OCSs) in the leaf-spine network is also proposed to reduce potential network resource fragmentation caused by resource allocation strategy in vClos. Testbed experiments and real-trace-based large-scale simulations are conducted to demonstrate the superiority of vClos over existing network resource scheduling strategies. Text DML ArXiv.org (Cornell University Library)
institution Open Polar
collection ArXiv.org (Cornell University Library)
op_collection_id ftarxivpreprints
language unknown
topic Computer Science - Distributed
Parallel
and Cluster Computing
spellingShingle Computer Science - Distributed
Parallel
and Cluster Computing
Han, Xinchi
Jiang, Weihao
Cao, Peirui
Yang, Qinwei
Liu, Yunzhuo
Qi, Shuyao
Lin, Shengkai
Zhao, Shizhen
Isolated Scheduling for Distributed Training Tasks in GPU Clusters
topic_facet Computer Science - Distributed
Parallel
and Cluster Computing
description Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually become the bottleneck of DML. Current multi-tenant GPU clusters face network contention caused by hash-collision problem which not only further increases the overhead of communication, but also creates unfairness and affects the user experience. In this paper, we firstly analyse how network contention affects the training time in a cluster with 32 NVIDIA V100 GPUs. Then we propose vClos to eliminate network contention by jointly optimizing network topology and communication pattern in distributed training. An OCS-vClos which introduces a layer of optical circuit switches (OCSs) in the leaf-spine network is also proposed to reduce potential network resource fragmentation caused by resource allocation strategy in vClos. Testbed experiments and real-trace-based large-scale simulations are conducted to demonstrate the superiority of vClos over existing network resource scheduling strategies.
format Text
author Han, Xinchi
Jiang, Weihao
Cao, Peirui
Yang, Qinwei
Liu, Yunzhuo
Qi, Shuyao
Lin, Shengkai
Zhao, Shizhen
author_facet Han, Xinchi
Jiang, Weihao
Cao, Peirui
Yang, Qinwei
Liu, Yunzhuo
Qi, Shuyao
Lin, Shengkai
Zhao, Shizhen
author_sort Han, Xinchi
title Isolated Scheduling for Distributed Training Tasks in GPU Clusters
title_short Isolated Scheduling for Distributed Training Tasks in GPU Clusters
title_full Isolated Scheduling for Distributed Training Tasks in GPU Clusters
title_fullStr Isolated Scheduling for Distributed Training Tasks in GPU Clusters
title_full_unstemmed Isolated Scheduling for Distributed Training Tasks in GPU Clusters
title_sort isolated scheduling for distributed training tasks in gpu clusters
publishDate 2023
url http://arxiv.org/abs/2308.05692
genre DML
genre_facet DML
op_relation http://arxiv.org/abs/2308.05692
_version_ 1776199904197083136