Scalable and numerically stable descriptive statistics in systemml

Abstract—With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine l...

Full description

Bibliographic Details
Main Authors: Yuanyuan Tian, Shirish Tatikonda, Berthold Reinwald
Other Authors: The Pennsylvania State University CiteSeerX Archives
Format: Text
Language:English
Published: 2012
Subjects:
DML
Online Access:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.648.7840
http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf
id ftciteseerx:oai:CiteSeerX.psu:10.1.1.648.7840
record_format openpolar
spelling ftciteseerx:oai:CiteSeerX.psu:10.1.1.648.7840 2023-05-15T16:01:37+02:00 Scalable and numerically stable descriptive statistics in systemml Yuanyuan Tian Shirish Tatikonda Berthold Reinwald The Pennsylvania State University CiteSeerX Archives 2012 application/pdf http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.648.7840 http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf en eng http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.648.7840 http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf Metadata may be used without restrictions as long as the oai identifier remains attached to it. http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf text 2012 ftciteseerx 2016-01-08T16:15:31Z Abstract—With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level language, called DML, which is syntactically similar to R. DML scripts are compiled, optimized, and executed in the SystemML runtime that is built on top of MapReduce. As the basis of virtually every quantitative analysis, descriptive statistics provide powerful tools to explore data in SystemML. In this paper, we describe our experience in implementing descrip-tive statistics in SystemML. In particular, we elaborate on how to overcome the two major challenges: (1) achieving numerical stability while operating on large data sets in a distributed setting of MapReduce; and (2) designing scalable algorithms to compute order statistics in MapReduce. By empirically comparing to algorithms commonly used in existing tools and systems, we demonstrate the numerical accuracy achieved by SystemML. We also highlight the valuable lessons we have learned in this exercise. I. Text DML Unknown Tive ENVELOPE(12.480,12.480,65.107,65.107)
institution Open Polar
collection Unknown
op_collection_id ftciteseerx
language English
description Abstract—With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level language, called DML, which is syntactically similar to R. DML scripts are compiled, optimized, and executed in the SystemML runtime that is built on top of MapReduce. As the basis of virtually every quantitative analysis, descriptive statistics provide powerful tools to explore data in SystemML. In this paper, we describe our experience in implementing descrip-tive statistics in SystemML. In particular, we elaborate on how to overcome the two major challenges: (1) achieving numerical stability while operating on large data sets in a distributed setting of MapReduce; and (2) designing scalable algorithms to compute order statistics in MapReduce. By empirically comparing to algorithms commonly used in existing tools and systems, we demonstrate the numerical accuracy achieved by SystemML. We also highlight the valuable lessons we have learned in this exercise. I.
author2 The Pennsylvania State University CiteSeerX Archives
format Text
author Yuanyuan Tian
Shirish Tatikonda
Berthold Reinwald
spellingShingle Yuanyuan Tian
Shirish Tatikonda
Berthold Reinwald
Scalable and numerically stable descriptive statistics in systemml
author_facet Yuanyuan Tian
Shirish Tatikonda
Berthold Reinwald
author_sort Yuanyuan Tian
title Scalable and numerically stable descriptive statistics in systemml
title_short Scalable and numerically stable descriptive statistics in systemml
title_full Scalable and numerically stable descriptive statistics in systemml
title_fullStr Scalable and numerically stable descriptive statistics in systemml
title_full_unstemmed Scalable and numerically stable descriptive statistics in systemml
title_sort scalable and numerically stable descriptive statistics in systemml
publishDate 2012
url http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.648.7840
http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf
long_lat ENVELOPE(12.480,12.480,65.107,65.107)
geographic Tive
geographic_facet Tive
genre DML
genre_facet DML
op_source http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf
op_relation http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.648.7840
http://researcher.watson.ibm.com/researcher/files/us-ytian/stability.pdf
op_rights Metadata may be used without restrictions as long as the oai identifier remains attached to it.
_version_ 1766397399860772864