A new procedure to optimize the selection of groups in a classification tree: applications for ecological data

International audience Agglomerative cluster analyses encompass many techniques, which have been widely used in various fields of science. In biology, and specifically ecology, datasets; are generally highly variable and may contain outliers, which increase the difficulty to identify the number of c...

Full description

Bibliographic Details
Published in:Ecological Modelling
Main Authors: Guidi, Lionel, Ibanez, Frederic, Calcagno, Vincent, Beaugrand, Grégory
Other Authors: Laboratoire d'océanographie de Villefranche (LOV), Observatoire océanologique de Villefranche-sur-mer (OOVM), Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut national des sciences de l'Univers (INSU - CNRS)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre et Marie Curie - Paris 6 (UPMC)-Institut national des sciences de l'Univers (INSU - CNRS)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS), Texas A&M University System, McGill University = Université McGill Montréal, Canada, Laboratoire d’Océanologie et de Géosciences (LOG) - UMR 8187 (LOG), Institut national des sciences de l'Univers (INSU - CNRS)-Université du Littoral Côte d'Opale (ULCO)-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche pour le Développement (IRD France-Nord ), Ministère de l'Education Nationale de I'Enseignement Superieur et de la Recherche, EC GOCE-036949
Format: Article in Journal/Newspaper
Language:English
Published: HAL CCSD 2009
Subjects:
Online Access:https://hal.inrae.fr/hal-02660464
https://doi.org/10.1016/j.ecolmodel.2008.11.006
Description
Summary:International audience Agglomerative cluster analyses encompass many techniques, which have been widely used in various fields of science. In biology, and specifically ecology, datasets; are generally highly variable and may contain outliers, which increase the difficulty to identify the number of clusters. Here we present a new criterion to determine statistically the optimal level of partition in a classification tree. The criterion robustness is tested against perturbated data (outliers) using an observation or variable with values randomly generated. The technique, called Random Simulation Test (RST), is tested on (1) the well-known Iris dataset (Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugenic. 7, 179-188], (2) simulated data with predetermined numbers of clusters following Milligan and Cooper [Milligan, G.W, Cooper, M.C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika SO, 159-1791 and finally (3) is applied on real copepod communities data previously analyzed in Beaugrand et al. (Beaugrand, G., Ibanez, F., Lindley, J.A., Reid, P.C., 2002. Diversity of calanoid copepods in the North Atlantic and adjacent seas: species associations and biogeography. Mar. Ecol. Prog. Ser. 232, 179-1951. The technique is compared to several standard techniques. RST performed generally better than existing algorithms on simulated data and proved to be especially efficient with highly variable datasets.