Distance Measures in Bioinformatics

Many bioinformatics applications rely on the computation of similarities between objects. Distance and similarity measures applied to vectors of characteristics are essential to problems such as classification, clustering and information retrieval. This study explores the usefulness of distance and...

Full description

Bibliographic Details
Main Author: Xiong, Feiyu
Other Authors: Kam, Moshe, Hrebien, Leonid, 1949-
Format: Thesis
Language:English
Published: Drexel University 2015
Subjects:
DML
Online Access:http://hdl.handle.net/1860/idea:6403
Description
Summary:Many bioinformatics applications rely on the computation of similarities between objects. Distance and similarity measures applied to vectors of characteristics are essential to problems such as classification, clustering and information retrieval. This study explores the usefulness of distance and similarity measures in several bioinformatics applications. These applications are in two categories. (1) Estimation of the adverse reaction severity of unknown pharmaceutical treatments, based on the severity of known treatments, in order to provide guidance for testing of the unknown treatments in clinical trials. (2) Classification of cancer tissue types and estimation of cancer stages, based on high-dimensional microarray data, in order to support clinical decisions making. To address the first category, we studied several clustering and classification approaches for binary severity estimation of Cytokine Release Syndrome (CRS). We developed a Severity Estimation using Distance Metric Learning (SE-DML) approach to get graded severity estimation. With binary estimation we were able to identify treatments that caused the most severe response and then built prediction models for CRS. Using the SE-DML approach, we evaluated four known data sets and showed that SE-DML outperformed other widely used methods on these data sets. For the second category, we presented Kernelized Information-Theoretic Metric Learning (KITML) algorithms that optimize distance metrics and effectively handle high-dimensional data. This learned metric by KITML is used to improve the performance of $k$-nearest neighbor classification for cancer tissue microarray data. We evaluated our approach on fourteen (14) cancer microarray data sets and compared our results with other state-of-the-art approaches. We achieved the best overall performance for the classification task. In addition we tested the KITML algorithm in estimating the severity stages of cancer samples, with accurate results. Ph.D., Electrical Engineering -- Drexel University, 2015