Comparative study of data transformation tools: An investigation of functionalities supported in common tools and case study of declarative and procedural data manipulation languages

Today, organizations are collecting and storing huge amounts of data that could potentially be very valuable. Finding trends and patterns in historic data can allow businesses to make more informed decision. Data scientists are therefore working to extract meaning from the massive amount of data. Ho...

Full description

Bibliographic Details
Main Author: Storvoll, Tine-Lovise
Other Authors: Soylu, Ahmet, Martin-Recuerda, Francisco
Format: Master Thesis
Language:English
Published: OsloMet - storbyuniversitetet 2022
Subjects:
DML
Online Access:https://hdl.handle.net/11250/3017395
Description
Summary:Today, organizations are collecting and storing huge amounts of data that could potentially be very valuable. Finding trends and patterns in historic data can allow businesses to make more informed decision. Data scientists are therefore working to extract meaning from the massive amount of data. However, 80% of the time in data science projects is spent preparing the data for analysis. Selecting an efficient tool for the job can contribute to reducing the time spent on data transformation. Thus, this thesis will provide some insights into existing tools and their performance. A selection of common tools is made in Chapter 3. The tools are reviewed with regards to a framework to identify the support of common data preparation tasks and an evaluation of the tools are given at the end of the chapter. In Chapter 4, one declarative and one procedural Data Manipulation Language (DML) are selected from the common data transformation tools. Python pandas, a procedural language, and SQL, a declarative language, are evaluated and compared in a case study. The case study delves deeper into the tools through a use case and the comparative analysis at the end will provide some insights into the differences in the two DMLs. Thus, the first contribution of this thesis is a review of the support of common data preparation tasks provided by a selection of some prevalent data transformation tools. The second contribution is an analysis of the differences in a declarative vs procedural approach to data manipulation through a case study comparing two popular DMLs. The findings of the review of tools in Chapter 3, revealed that the most prevalent data transformation tools support the majority of the common data preparation tasks. This review gives some general insight into which tasks are supported, which tasks needs more effort to perform, and which are not supported at all. The review is exclusively based on information found in technical documentation of the tools, and no further experimentation is done to investigate the ...