LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data

Abstract Background Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Nois...

Full description

Bibliographic Details
Published in:BMC Bioinformatics
Main Authors: Josip Rudar, Teresita M. Porter, Michael Wright, G. Brian Golding, Mehrdad Hajibabaei
Format: Article in Journal/Newspaper
Language:English
Published: BMC 2022
Subjects:
Online Access:https://doi.org/10.1186/s12859-022-04631-z
https://doaj.org/article/4183141caf7345e8819f245225b26824
id ftdoajarticles:oai:doaj.org/article:4183141caf7345e8819f245225b26824
record_format openpolar
spelling ftdoajarticles:oai:doaj.org/article:4183141caf7345e8819f245225b26824 2023-05-15T18:44:20+02:00 LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data Josip Rudar Teresita M. Porter Michael Wright G. Brian Golding Mehrdad Hajibabaei 2022-03-01T00:00:00Z https://doi.org/10.1186/s12859-022-04631-z https://doaj.org/article/4183141caf7345e8819f245225b26824 EN eng BMC https://doi.org/10.1186/s12859-022-04631-z https://doaj.org/toc/1471-2105 doi:10.1186/s12859-022-04631-z 1471-2105 https://doaj.org/article/4183141caf7345e8819f245225b26824 BMC Bioinformatics, Vol 23, Iss 1, Pp 1-34 (2022) Biomarker selection Metagenomics Metabarcoding Biomonitoring Ecological assessment Machine learning Computer applications to medicine. Medical informatics R858-859.7 Biology (General) QH301-705.5 article 2022 ftdoajarticles https://doi.org/10.1186/s12859-022-04631-z 2022-12-30T22:03:51Z Abstract Background Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algorithmic failures to account for specific distributional properties, and feature interactions can complicate the discovery of ASV biomarkers. In addition, these issues can impact the replicability of various models and elevate false-discovery rates. Contemporary machine learning approaches can be leveraged to address these issues. Ensembles of decision trees are particularly effective at classifying the types of data commonly generated in high-throughput sequencing (HTS) studies due to their robustness when the number of features in the training data is orders of magnitude larger than the number of samples. In addition, when combined with appropriate model introspection algorithms, machine learning algorithms can also be used to discover and select potential biomarkers. However, the construction of these models could introduce various biases which potentially obfuscate feature discovery. Results We developed a decision tree ensemble, LANDMark, which uses oblique and non-linear cuts at each node. In synthetic and toy tests LANDMark consistently ranked as the best classifier and often outperformed the Random Forest classifier. When trained on the full metabarcoding dataset obtained from Canada’s Wood Buffalo National Park, LANDMark was able to create highly predictive models and achieved an overall balanced accuracy score of 0.96 ± 0.06. The use of recursive feature elimination did not impact LANDMark’s generalization performance and, when trained on data from the BE amplicon, it was able to outperform the Linear Support Vector Machine, Logistic Regression models, and Stochastic Gradient Descent models (p ≤ 0.05). Finally, LANDMark distinguishes itself due to its ability to learn smoother non-linear decision ... Article in Journal/Newspaper Wood Buffalo Wood Buffalo National Park Directory of Open Access Journals: DOAJ Articles Wood Buffalo ENVELOPE(-112.007,-112.007,57.664,57.664) BMC Bioinformatics 23 1
institution Open Polar
collection Directory of Open Access Journals: DOAJ Articles
op_collection_id ftdoajarticles
language English
topic Biomarker selection
Metagenomics
Metabarcoding
Biomonitoring
Ecological assessment
Machine learning
Computer applications to medicine. Medical informatics
R858-859.7
Biology (General)
QH301-705.5
spellingShingle Biomarker selection
Metagenomics
Metabarcoding
Biomonitoring
Ecological assessment
Machine learning
Computer applications to medicine. Medical informatics
R858-859.7
Biology (General)
QH301-705.5
Josip Rudar
Teresita M. Porter
Michael Wright
G. Brian Golding
Mehrdad Hajibabaei
LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
topic_facet Biomarker selection
Metagenomics
Metabarcoding
Biomonitoring
Ecological assessment
Machine learning
Computer applications to medicine. Medical informatics
R858-859.7
Biology (General)
QH301-705.5
description Abstract Background Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algorithmic failures to account for specific distributional properties, and feature interactions can complicate the discovery of ASV biomarkers. In addition, these issues can impact the replicability of various models and elevate false-discovery rates. Contemporary machine learning approaches can be leveraged to address these issues. Ensembles of decision trees are particularly effective at classifying the types of data commonly generated in high-throughput sequencing (HTS) studies due to their robustness when the number of features in the training data is orders of magnitude larger than the number of samples. In addition, when combined with appropriate model introspection algorithms, machine learning algorithms can also be used to discover and select potential biomarkers. However, the construction of these models could introduce various biases which potentially obfuscate feature discovery. Results We developed a decision tree ensemble, LANDMark, which uses oblique and non-linear cuts at each node. In synthetic and toy tests LANDMark consistently ranked as the best classifier and often outperformed the Random Forest classifier. When trained on the full metabarcoding dataset obtained from Canada’s Wood Buffalo National Park, LANDMark was able to create highly predictive models and achieved an overall balanced accuracy score of 0.96 ± 0.06. The use of recursive feature elimination did not impact LANDMark’s generalization performance and, when trained on data from the BE amplicon, it was able to outperform the Linear Support Vector Machine, Logistic Regression models, and Stochastic Gradient Descent models (p ≤ 0.05). Finally, LANDMark distinguishes itself due to its ability to learn smoother non-linear decision ...
format Article in Journal/Newspaper
author Josip Rudar
Teresita M. Porter
Michael Wright
G. Brian Golding
Mehrdad Hajibabaei
author_facet Josip Rudar
Teresita M. Porter
Michael Wright
G. Brian Golding
Mehrdad Hajibabaei
author_sort Josip Rudar
title LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
title_short LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
title_full LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
title_fullStr LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
title_full_unstemmed LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
title_sort landmark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
publisher BMC
publishDate 2022
url https://doi.org/10.1186/s12859-022-04631-z
https://doaj.org/article/4183141caf7345e8819f245225b26824
long_lat ENVELOPE(-112.007,-112.007,57.664,57.664)
geographic Wood Buffalo
geographic_facet Wood Buffalo
genre Wood Buffalo
Wood Buffalo National Park
genre_facet Wood Buffalo
Wood Buffalo National Park
op_source BMC Bioinformatics, Vol 23, Iss 1, Pp 1-34 (2022)
op_relation https://doi.org/10.1186/s12859-022-04631-z
https://doaj.org/toc/1471-2105
doi:10.1186/s12859-022-04631-z
1471-2105
https://doaj.org/article/4183141caf7345e8819f245225b26824
op_doi https://doi.org/10.1186/s12859-022-04631-z
container_title BMC Bioinformatics
container_volume 23
container_issue 1
_version_ 1766234981971001344