Comparison of regression and machine learning methods for classification in a large cohort study

Many different methods exist to determine associations between an outcome and a set of predictors. Regression and machine learning are two categories of methods that can be used to determine these associations and classify data. Regression models are most often based on linear mappings of predictors...

Full description

Bibliographic Details
Main Author: Ingibjörg Magnúsdóttir 1989-
Other Authors: Háskóli Íslands
Format: Thesis
Language:English
Published: 2021
Subjects:
Online Access:http://hdl.handle.net/1946/38767
id ftskemman:oai:skemman.is:1946/38767
record_format openpolar
spelling ftskemman:oai:skemman.is:1946/38767 2023-05-15T16:52:23+02:00 Comparison of regression and machine learning methods for classification in a large cohort study Ingibjörg Magnúsdóttir 1989- Háskóli Íslands 2021-05 application/pdf http://hdl.handle.net/1946/38767 en eng http://hdl.handle.net/1946/38767 Líftölfræði Thesis Master's 2021 ftskemman 2022-12-11T06:51:16Z Many different methods exist to determine associations between an outcome and a set of predictors. Regression and machine learning are two categories of methods that can be used to determine these associations and classify data. Regression models are most often based on linear mappings of predictors with few interactions, while machine learning methods use a combination of linear and non-linear mappings and higher order of interactions. Also, the objective of the two is different. More focus is on statistical inference when using regression methods and more focus on accuracy when using machine learning methods. However, it is natural to assume that a better performance in modeling and classification can be achieved with such methods. In this thesis, two regression methods (logistic regression and lasso) and two machine learning methods (random forest and support vector machine) were examined and their ability to classify outcomes in a large cohort study were observed. The classification was performed using both a full dataset and a training and test dataset. The cohort used was the SAGA cohort, a nationwide study in Iceland on the impact of trauma on women’s health. The cohort consists of 31,795 women between the ages of 18-69. In the cohort study setting there is usually a much larger number of observations than the number of predictors. Still there can be a concern about misspecification and overfitting. It is of practical value to be able to judge how robust the modeling of the data is with respect to modeling approach. The three binary outcomes that were studied and classified were posttraumatic stress disorder (PTSD), obesity, and hypertension. The classification ability of the methods was assessed the using area under the ROC (Receiver Operating Characteristics) curve (AUC) and accuracy, measured as the proportion correctly classified. The results of the study showed that the methods had similar performance but there were some differences. The difference between the worst and the best result for AUC was on ... Thesis Iceland Skemman (Iceland)
institution Open Polar
collection Skemman (Iceland)
op_collection_id ftskemman
language English
topic Líftölfræði
spellingShingle Líftölfræði
Ingibjörg Magnúsdóttir 1989-
Comparison of regression and machine learning methods for classification in a large cohort study
topic_facet Líftölfræði
description Many different methods exist to determine associations between an outcome and a set of predictors. Regression and machine learning are two categories of methods that can be used to determine these associations and classify data. Regression models are most often based on linear mappings of predictors with few interactions, while machine learning methods use a combination of linear and non-linear mappings and higher order of interactions. Also, the objective of the two is different. More focus is on statistical inference when using regression methods and more focus on accuracy when using machine learning methods. However, it is natural to assume that a better performance in modeling and classification can be achieved with such methods. In this thesis, two regression methods (logistic regression and lasso) and two machine learning methods (random forest and support vector machine) were examined and their ability to classify outcomes in a large cohort study were observed. The classification was performed using both a full dataset and a training and test dataset. The cohort used was the SAGA cohort, a nationwide study in Iceland on the impact of trauma on women’s health. The cohort consists of 31,795 women between the ages of 18-69. In the cohort study setting there is usually a much larger number of observations than the number of predictors. Still there can be a concern about misspecification and overfitting. It is of practical value to be able to judge how robust the modeling of the data is with respect to modeling approach. The three binary outcomes that were studied and classified were posttraumatic stress disorder (PTSD), obesity, and hypertension. The classification ability of the methods was assessed the using area under the ROC (Receiver Operating Characteristics) curve (AUC) and accuracy, measured as the proportion correctly classified. The results of the study showed that the methods had similar performance but there were some differences. The difference between the worst and the best result for AUC was on ...
author2 Háskóli Íslands
format Thesis
author Ingibjörg Magnúsdóttir 1989-
author_facet Ingibjörg Magnúsdóttir 1989-
author_sort Ingibjörg Magnúsdóttir 1989-
title Comparison of regression and machine learning methods for classification in a large cohort study
title_short Comparison of regression and machine learning methods for classification in a large cohort study
title_full Comparison of regression and machine learning methods for classification in a large cohort study
title_fullStr Comparison of regression and machine learning methods for classification in a large cohort study
title_full_unstemmed Comparison of regression and machine learning methods for classification in a large cohort study
title_sort comparison of regression and machine learning methods for classification in a large cohort study
publishDate 2021
url http://hdl.handle.net/1946/38767
genre Iceland
genre_facet Iceland
op_relation http://hdl.handle.net/1946/38767
_version_ 1766042587156709376