Cluster analysis for symbolic interval data using linear regression method

Symbolic data records are becoming a more powerful instrument to deal with large size data sets. Interval-valued data are a special type of symbolic data, for which each observation is a vector of intervals. The typical K-means methods for interval-valued data suppose the data separate to spherical...

Full description

Bibliographic Details
Main Author: Liu, Fei
Format: Doctoral or Postdoctoral Thesis
Language:English
Published: uga 2016
Subjects:
Online Access:http://hdl.handle.net/10724/36237
http://purl.galileo.usg.edu/uga_etd/liu_fei_201605_phd
Description
Summary:Symbolic data records are becoming a more powerful instrument to deal with large size data sets. Interval-valued data are a special type of symbolic data, for which each observation is a vector of intervals. The typical K-means methods for interval-valued data suppose the data separate to spherical clusters. It usually cannot converge to the correct clusters if the data are not clustering spherically. We propose a K-regressions based clustering method for interval-valued data to recover a more complicated data structure. Assuming the response and predictor variables follow K di erent linear relationships, the data are initially split into K groups randomly. Then, we apply the new developed symbolic variation" least squares to estimate the parameters of the K symbolic regressions. A data point is then relocated to its closest group in terms of its symbolic distance to the regression lines. This two-step dynamic clustering algorithm continues until the clusters are stable. Further, we introduce an orthogonal regression clustering algorithm (ORCA) for interval-value data to avoid specifying a response variable. Two orthogonal regression methods are proposed: the simple orthogonal regression method and the general orthogonal regression method. We utilize four di erent methods to determine the optimal number of clusters. Simulation study is conducted to investigate the performance of the ORCA algorithm. We use the Iris data (Fisher, 1936) to test the e ectiveness of the ORCA algorithm. PhD Statistics Statistics Lynne Billard Lynne Billard Paul Schliekelman Jaxk Reeves William McCormick Pengsheng Ji