SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets

This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Unlike the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk withi...

Full description

Bibliographic Details
Main Authors:	Naghavi-Nozad, Sayyed-Ahmad, Haeri, Maryam Amir, Folino, Gianluigi
Format:	Text
Language:	unknown
Published:	2020
Subjects:	Computer Science - Machine Learning Statistics - Machine Learning Orca
Online Access:	http://arxiv.org/abs/2006.07616

id	ftarxivpreprints:oai:arXiv.org:2006.07616
record_format	openpolar
spelling	ftarxivpreprints:oai:arXiv.org:2006.07616 2023-05-15T17:53:52+02:00 SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets Naghavi-Nozad, Sayyed-Ahmad Haeri, Maryam Amir Folino, Gianluigi 2020-06-13 http://arxiv.org/abs/2006.07616 unknown http://arxiv.org/abs/2006.07616 Computer Science - Machine Learning Statistics - Machine Learning text 2020 ftarxivpreprints 2021-05-02T00:15:02Z This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Unlike the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. A temporary clustering model is built at the first phase; then, it is gradually updated by analyzing consecutive memory loads of points. Subsequently, at the end of scalable clustering, the approximate structure of the original clusters is obtained. Finally, by another scan of the entire dataset and using a suitable criterion, an outlying score is assigned to each object called SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the memory; and also, to some fast distance-based methods, which can perform on data resident in the disk. Comment: For ORCA in the efficiency test, N is set to n/2, although the final results are not significantly modified. Some minor grammatical errors are rectified Text Orca ArXiv.org (Cornell University Library)
institution	Open Polar
collection	ArXiv.org (Cornell University Library)
op_collection_id	ftarxivpreprints
language	unknown
topic	Computer Science - Machine Learning Statistics - Machine Learning
spellingShingle	Computer Science - Machine Learning Statistics - Machine Learning Naghavi-Nozad, Sayyed-Ahmad Haeri, Maryam Amir Folino, Gianluigi SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets
topic_facet	Computer Science - Machine Learning Statistics - Machine Learning
description	This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Unlike the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. A temporary clustering model is built at the first phase; then, it is gradually updated by analyzing consecutive memory loads of points. Subsequently, at the end of scalable clustering, the approximate structure of the original clusters is obtained. Finally, by another scan of the entire dataset and using a suitable criterion, an outlying score is assigned to each object called SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the memory; and also, to some fast distance-based methods, which can perform on data resident in the disk. Comment: For ORCA in the efficiency test, N is set to n/2, although the final results are not significantly modified. Some minor grammatical errors are rectified
format	Text
author	Naghavi-Nozad, Sayyed-Ahmad Haeri, Maryam Amir Folino, Gianluigi
author_facet	Naghavi-Nozad, Sayyed-Ahmad Haeri, Maryam Amir Folino, Gianluigi
author_sort	Naghavi-Nozad, Sayyed-Ahmad
title	SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets
title_short	SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets
title_full	SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets
title_fullStr	SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets
title_full_unstemmed	SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets
title_sort	sdcor: scalable density-based clustering for local outlier detection in massive-scale datasets
publishDate	2020
url	http://arxiv.org/abs/2006.07616
genre	Orca
genre_facet	Orca
op_relation	http://arxiv.org/abs/2006.07616
_version_	1766161563107983360

SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets

Similar Items