Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model

International audience For over a decade, MapReduce has become the leading programming model for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. Howeve...

Full description

Bibliographic Details
Main Authors:	Al Hajj Hassan, Mohamad, Bamha, Mostafa
Other Authors:	Lebanese International University (LIU), Laboratoire d'Informatique Fondamentale d'Orléans (LIFO), Université d'Orléans (UO)-Ecole Nationale Supérieure d'Ingénieurs de Bourges
Format:	Conference Object
Language:	English
Published:	HAL CCSD 2015
Subjects:	Join and GrouBy-join operations Data skew MapReduce programming model Distributed file systems Hadoop framework Apache Pig Latin [SCCO.COMP]Cognitive science/Computer science Iceland
Online Access:	https://hal.science/hal-01160931

_version_	1821554520017076224
author	Al Hajj Hassan, Mohamad Bamha, Mostafa
author2	Lebanese International University (LIU) Laboratoire d'Informatique Fondamentale d'Orléans (LIFO) Université d'Orléans (UO)-Ecole Nationale Supérieure d'Ingénieurs de Bourges
author_facet	Al Hajj Hassan, Mohamad Bamha, Mostafa
author_sort	Al Hajj Hassan, Mohamad
collection	Université d'Orléans: HAL
description	International audience For over a decade, MapReduce has become the leading programming model for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets.In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects.A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations.
format	Conference Object
genre	Iceland
genre_facet	Iceland
id	ftunivorleans:oai:HAL:hal-01160931v1
institution	Open Polar
language	English
op_collection_id	ftunivorleans
op_coverage	Reykjavik, Iceland
op_relation	hal-01160931 https://hal.science/hal-01160931
op_source	International Conference on Computational Science (ICCS'2015) International Conference On Computational Science - ICCS 2015 https://hal.science/hal-01160931 International Conference On Computational Science - ICCS 2015, Jun 2015, Reykjavik, Iceland. pp.70-79
publishDate	2015
publisher	HAL CCSD
record_format	openpolar
spelling	ftunivorleans:oai:HAL:hal-01160931v1 2025-01-16T22:37:44+00:00 Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model Al Hajj Hassan, Mohamad Bamha, Mostafa Lebanese International University (LIU) Laboratoire d'Informatique Fondamentale d'Orléans (LIFO) Université d'Orléans (UO)-Ecole Nationale Supérieure d'Ingénieurs de Bourges Reykjavik, Iceland 2015-06-01 https://hal.science/hal-01160931 en eng HAL CCSD hal-01160931 https://hal.science/hal-01160931 International Conference on Computational Science (ICCS'2015) International Conference On Computational Science - ICCS 2015 https://hal.science/hal-01160931 International Conference On Computational Science - ICCS 2015, Jun 2015, Reykjavik, Iceland. pp.70-79 Join and GrouBy-join operations Data skew MapReduce programming model Distributed file systems Hadoop framework Apache Pig Latin [SCCO.COMP]Cognitive science/Computer science info:eu-repo/semantics/conferenceObject Conference papers 2015 ftunivorleans 2023-10-24T21:40:13Z International audience For over a decade, MapReduce has become the leading programming model for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets.In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects.A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations. Conference Object Iceland Université d'Orléans: HAL
spellingShingle	Join and GrouBy-join operations Data skew MapReduce programming model Distributed file systems Hadoop framework Apache Pig Latin [SCCO.COMP]Cognitive science/Computer science Al Hajj Hassan, Mohamad Bamha, Mostafa Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model
title	Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model
title_full	Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model
title_fullStr	Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model
title_full_unstemmed	Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model
title_short	Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model
title_sort	towards scalability and data skew handling in groupby-joins using mapreduce model
topic	Join and GrouBy-join operations Data skew MapReduce programming model Distributed file systems Hadoop framework Apache Pig Latin [SCCO.COMP]Cognitive science/Computer science
topic_facet	Join and GrouBy-join operations Data skew MapReduce programming model Distributed file systems Hadoop framework Apache Pig Latin [SCCO.COMP]Cognitive science/Computer science
url	https://hal.science/hal-01160931

Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model

Similar Items