Comparison of regression and machine learning methods for classification in a large cohort study

Many different methods exist to determine associations between an outcome and a set of predictors. Regression and machine learning are two categories of methods that can be used to determine these associations and classify data. Regression models are most often based on linear mappings of predictors...

Full description

Bibliographic Details
Main Author: Ingibjörg Magnúsdóttir 1989-
Other Authors: Háskóli Íslands
Format: Thesis
Language:English
Published: 2021
Subjects:
Online Access:http://hdl.handle.net/1946/38767
Description
Summary:Many different methods exist to determine associations between an outcome and a set of predictors. Regression and machine learning are two categories of methods that can be used to determine these associations and classify data. Regression models are most often based on linear mappings of predictors with few interactions, while machine learning methods use a combination of linear and non-linear mappings and higher order of interactions. Also, the objective of the two is different. More focus is on statistical inference when using regression methods and more focus on accuracy when using machine learning methods. However, it is natural to assume that a better performance in modeling and classification can be achieved with such methods. In this thesis, two regression methods (logistic regression and lasso) and two machine learning methods (random forest and support vector machine) were examined and their ability to classify outcomes in a large cohort study were observed. The classification was performed using both a full dataset and a training and test dataset. The cohort used was the SAGA cohort, a nationwide study in Iceland on the impact of trauma on women’s health. The cohort consists of 31,795 women between the ages of 18-69. In the cohort study setting there is usually a much larger number of observations than the number of predictors. Still there can be a concern about misspecification and overfitting. It is of practical value to be able to judge how robust the modeling of the data is with respect to modeling approach. The three binary outcomes that were studied and classified were posttraumatic stress disorder (PTSD), obesity, and hypertension. The classification ability of the methods was assessed the using area under the ROC (Receiver Operating Characteristics) curve (AUC) and accuracy, measured as the proportion correctly classified. The results of the study showed that the methods had similar performance but there were some differences. The difference between the worst and the best result for AUC was on ...