Comparison of regression and machine learning methods for classification in a large cohort study
Many different methods exist to determine associations between an outcome and a set of predictors. Regression and machine learning are two categories of methods that can be used to determine these associations and classify data. Regression models are most often based on linear mappings of predictors...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Language: | English |
Published: |
2021
|
Subjects: | |
Online Access: | http://hdl.handle.net/1946/38767 |
id |
ftskemman:oai:skemman.is:1946/38767 |
---|---|
record_format |
openpolar |
spelling |
ftskemman:oai:skemman.is:1946/38767 2023-05-15T16:52:23+02:00 Comparison of regression and machine learning methods for classification in a large cohort study Ingibjörg Magnúsdóttir 1989- Háskóli Íslands 2021-05 application/pdf http://hdl.handle.net/1946/38767 en eng http://hdl.handle.net/1946/38767 Líftölfræði Thesis Master's 2021 ftskemman 2022-12-11T06:51:16Z Many different methods exist to determine associations between an outcome and a set of predictors. Regression and machine learning are two categories of methods that can be used to determine these associations and classify data. Regression models are most often based on linear mappings of predictors with few interactions, while machine learning methods use a combination of linear and non-linear mappings and higher order of interactions. Also, the objective of the two is different. More focus is on statistical inference when using regression methods and more focus on accuracy when using machine learning methods. However, it is natural to assume that a better performance in modeling and classification can be achieved with such methods. In this thesis, two regression methods (logistic regression and lasso) and two machine learning methods (random forest and support vector machine) were examined and their ability to classify outcomes in a large cohort study were observed. The classification was performed using both a full dataset and a training and test dataset. The cohort used was the SAGA cohort, a nationwide study in Iceland on the impact of trauma on women’s health. The cohort consists of 31,795 women between the ages of 18-69. In the cohort study setting there is usually a much larger number of observations than the number of predictors. Still there can be a concern about misspecification and overfitting. It is of practical value to be able to judge how robust the modeling of the data is with respect to modeling approach. The three binary outcomes that were studied and classified were posttraumatic stress disorder (PTSD), obesity, and hypertension. The classification ability of the methods was assessed the using area under the ROC (Receiver Operating Characteristics) curve (AUC) and accuracy, measured as the proportion correctly classified. The results of the study showed that the methods had similar performance but there were some differences. The difference between the worst and the best result for AUC was on ... Thesis Iceland Skemman (Iceland) |
institution |
Open Polar |
collection |
Skemman (Iceland) |
op_collection_id |
ftskemman |
language |
English |
topic |
Líftölfræði |
spellingShingle |
Líftölfræði Ingibjörg Magnúsdóttir 1989- Comparison of regression and machine learning methods for classification in a large cohort study |
topic_facet |
Líftölfræði |
description |
Many different methods exist to determine associations between an outcome and a set of predictors. Regression and machine learning are two categories of methods that can be used to determine these associations and classify data. Regression models are most often based on linear mappings of predictors with few interactions, while machine learning methods use a combination of linear and non-linear mappings and higher order of interactions. Also, the objective of the two is different. More focus is on statistical inference when using regression methods and more focus on accuracy when using machine learning methods. However, it is natural to assume that a better performance in modeling and classification can be achieved with such methods. In this thesis, two regression methods (logistic regression and lasso) and two machine learning methods (random forest and support vector machine) were examined and their ability to classify outcomes in a large cohort study were observed. The classification was performed using both a full dataset and a training and test dataset. The cohort used was the SAGA cohort, a nationwide study in Iceland on the impact of trauma on women’s health. The cohort consists of 31,795 women between the ages of 18-69. In the cohort study setting there is usually a much larger number of observations than the number of predictors. Still there can be a concern about misspecification and overfitting. It is of practical value to be able to judge how robust the modeling of the data is with respect to modeling approach. The three binary outcomes that were studied and classified were posttraumatic stress disorder (PTSD), obesity, and hypertension. The classification ability of the methods was assessed the using area under the ROC (Receiver Operating Characteristics) curve (AUC) and accuracy, measured as the proportion correctly classified. The results of the study showed that the methods had similar performance but there were some differences. The difference between the worst and the best result for AUC was on ... |
author2 |
Háskóli Íslands |
format |
Thesis |
author |
Ingibjörg Magnúsdóttir 1989- |
author_facet |
Ingibjörg Magnúsdóttir 1989- |
author_sort |
Ingibjörg Magnúsdóttir 1989- |
title |
Comparison of regression and machine learning methods for classification in a large cohort study |
title_short |
Comparison of regression and machine learning methods for classification in a large cohort study |
title_full |
Comparison of regression and machine learning methods for classification in a large cohort study |
title_fullStr |
Comparison of regression and machine learning methods for classification in a large cohort study |
title_full_unstemmed |
Comparison of regression and machine learning methods for classification in a large cohort study |
title_sort |
comparison of regression and machine learning methods for classification in a large cohort study |
publishDate |
2021 |
url |
http://hdl.handle.net/1946/38767 |
genre |
Iceland |
genre_facet |
Iceland |
op_relation |
http://hdl.handle.net/1946/38767 |
_version_ |
1766042587156709376 |