Approximate Gradient Coding for Heterogeneous Nodes

In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers. These are nodes that are terribly slow in performing their tas...

Full description

Bibliographic Details
Main Authors: Johri, Amogh, Yardi, Arti, Bodas, Tejas
Format: Text
Language:unknown
Published: 2021
Subjects:
DML
Online Access:http://arxiv.org/abs/2105.06124
id ftarxivpreprints:oai:arXiv.org:2105.06124
record_format openpolar
spelling ftarxivpreprints:oai:arXiv.org:2105.06124 2023-09-05T13:19:05+02:00 Approximate Gradient Coding for Heterogeneous Nodes Johri, Amogh Yardi, Arti Bodas, Tejas 2021-05-13 http://arxiv.org/abs/2105.06124 unknown http://arxiv.org/abs/2105.06124 Computer Science - Information Theory text 2021 ftarxivpreprints 2023-08-16T16:29:21Z In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers. These are nodes that are terribly slow in performing their task which results in under-utilization of the training data that is stored in them. Towards this, gradient coding mitigates the impact of stragglers by adding sufficient redundancy in the data. Gradient coding and other straggler mitigation schemes assume that the straggler behavior of the worker nodes is identical. Our experiments on the Amazon AWS cluster however suggest otherwise and we see that there is a correlation in the straggler behavior across iterations. To model this, we introduce a heterogeneous straggler model where nodes are categorized into two classes, slow and active. To better utilize training data stored with slow nodes, we modify the existing gradient coding schemes with shuffling of the training data among workers. Our results (both simulation and cloud experiments) suggest remarkable improvement with shuffling over existing schemes. We perform theoretical analysis for the proposed models justifying their utility. Text DML ArXiv.org (Cornell University Library)
institution Open Polar
collection ArXiv.org (Cornell University Library)
op_collection_id ftarxivpreprints
language unknown
topic Computer Science - Information Theory
spellingShingle Computer Science - Information Theory
Johri, Amogh
Yardi, Arti
Bodas, Tejas
Approximate Gradient Coding for Heterogeneous Nodes
topic_facet Computer Science - Information Theory
description In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers. These are nodes that are terribly slow in performing their task which results in under-utilization of the training data that is stored in them. Towards this, gradient coding mitigates the impact of stragglers by adding sufficient redundancy in the data. Gradient coding and other straggler mitigation schemes assume that the straggler behavior of the worker nodes is identical. Our experiments on the Amazon AWS cluster however suggest otherwise and we see that there is a correlation in the straggler behavior across iterations. To model this, we introduce a heterogeneous straggler model where nodes are categorized into two classes, slow and active. To better utilize training data stored with slow nodes, we modify the existing gradient coding schemes with shuffling of the training data among workers. Our results (both simulation and cloud experiments) suggest remarkable improvement with shuffling over existing schemes. We perform theoretical analysis for the proposed models justifying their utility.
format Text
author Johri, Amogh
Yardi, Arti
Bodas, Tejas
author_facet Johri, Amogh
Yardi, Arti
Bodas, Tejas
author_sort Johri, Amogh
title Approximate Gradient Coding for Heterogeneous Nodes
title_short Approximate Gradient Coding for Heterogeneous Nodes
title_full Approximate Gradient Coding for Heterogeneous Nodes
title_fullStr Approximate Gradient Coding for Heterogeneous Nodes
title_full_unstemmed Approximate Gradient Coding for Heterogeneous Nodes
title_sort approximate gradient coding for heterogeneous nodes
publishDate 2021
url http://arxiv.org/abs/2105.06124
genre DML
genre_facet DML
op_relation http://arxiv.org/abs/2105.06124
_version_ 1776199905750024192