LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data

BACKGROUND: Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algor...

Full description

Bibliographic Details
Published in:	BMC Bioinformatics
Main Authors:	Rudar, Josip, Porter, Teresita M., Wright, Michael, Golding, G. Brian, Hajibabaei, Mehrdad
Format:	Text
Language:	English
Published:	BioMed Central 2022
Subjects:	Research Wood Buffalo Wood Buffalo National Park
Online Access:	http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8969335/ https://doi.org/10.1186/s12859-022-04631-z

id	ftpubmed:oai:pubmedcentral.nih.gov:8969335
record_format	openpolar
spelling	ftpubmed:oai:pubmedcentral.nih.gov:8969335 2023-05-15T18:44:19+02:00 LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data Rudar, Josip Porter, Teresita M. Wright, Michael Golding, G. Brian Hajibabaei, Mehrdad 2022-03-31 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8969335/ https://doi.org/10.1186/s12859-022-04631-z en eng BioMed Central http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8969335/ http://dx.doi.org/10.1186/s12859-022-04631-z © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. CC0 PDM CC-BY BMC Bioinformatics Research Text 2022 ftpubmed https://doi.org/10.1186/s12859-022-04631-z 2022-04-03T01:25:52Z BACKGROUND: Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algorithmic failures to account for specific distributional properties, and feature interactions can complicate the discovery of ASV biomarkers. In addition, these issues can impact the replicability of various models and elevate false-discovery rates. Contemporary machine learning approaches can be leveraged to address these issues. Ensembles of decision trees are particularly effective at classifying the types of data commonly generated in high-throughput sequencing (HTS) studies due to their robustness when the number of features in the training data is orders of magnitude larger than the number of samples. In addition, when combined with appropriate model introspection algorithms, machine learning algorithms can also be used to discover and select potential biomarkers. However, the construction of these models could introduce various biases which potentially obfuscate feature discovery. RESULTS: We developed a decision tree ensemble, LANDMark, which uses oblique and non-linear cuts at each node. In synthetic and toy tests LANDMark consistently ranked as the best classifier and often outperformed the Random Forest classifier. When trained on the full metabarcoding dataset obtained from Canada’s Wood Buffalo National Park, LANDMark was able to create highly predictive models and achieved an overall balanced accuracy score of 0.96 ± 0.06. The use of recursive feature elimination did not impact LANDMark’s generalization performance and, when trained on data from the BE amplicon, it was able to outperform the Linear Support Vector Machine, Logistic Regression models, and Stochastic Gradient Descent models (p ≤ 0.05). Finally, LANDMark distinguishes itself due to its ability to learn smoother non-linear decision boundaries. ... Text Wood Buffalo Wood Buffalo National Park PubMed Central (PMC) Wood Buffalo ENVELOPE(-112.007,-112.007,57.664,57.664) BMC Bioinformatics 23 1
institution	Open Polar
collection	PubMed Central (PMC)
op_collection_id	ftpubmed
language	English
topic	Research
spellingShingle	Research Rudar, Josip Porter, Teresita M. Wright, Michael Golding, G. Brian Hajibabaei, Mehrdad LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
topic_facet	Research
description	BACKGROUND: Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algorithmic failures to account for specific distributional properties, and feature interactions can complicate the discovery of ASV biomarkers. In addition, these issues can impact the replicability of various models and elevate false-discovery rates. Contemporary machine learning approaches can be leveraged to address these issues. Ensembles of decision trees are particularly effective at classifying the types of data commonly generated in high-throughput sequencing (HTS) studies due to their robustness when the number of features in the training data is orders of magnitude larger than the number of samples. In addition, when combined with appropriate model introspection algorithms, machine learning algorithms can also be used to discover and select potential biomarkers. However, the construction of these models could introduce various biases which potentially obfuscate feature discovery. RESULTS: We developed a decision tree ensemble, LANDMark, which uses oblique and non-linear cuts at each node. In synthetic and toy tests LANDMark consistently ranked as the best classifier and often outperformed the Random Forest classifier. When trained on the full metabarcoding dataset obtained from Canada’s Wood Buffalo National Park, LANDMark was able to create highly predictive models and achieved an overall balanced accuracy score of 0.96 ± 0.06. The use of recursive feature elimination did not impact LANDMark’s generalization performance and, when trained on data from the BE amplicon, it was able to outperform the Linear Support Vector Machine, Logistic Regression models, and Stochastic Gradient Descent models (p ≤ 0.05). Finally, LANDMark distinguishes itself due to its ability to learn smoother non-linear decision boundaries. ...
format	Text
author	Rudar, Josip Porter, Teresita M. Wright, Michael Golding, G. Brian Hajibabaei, Mehrdad
author_facet	Rudar, Josip Porter, Teresita M. Wright, Michael Golding, G. Brian Hajibabaei, Mehrdad
author_sort	Rudar, Josip
title	LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
title_short	LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
title_full	LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
title_fullStr	LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
title_full_unstemmed	LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
title_sort	landmark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data
publisher	BioMed Central
publishDate	2022
url	http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8969335/ https://doi.org/10.1186/s12859-022-04631-z
long_lat	ENVELOPE(-112.007,-112.007,57.664,57.664)
geographic	Wood Buffalo
geographic_facet	Wood Buffalo
genre	Wood Buffalo Wood Buffalo National Park
genre_facet	Wood Buffalo Wood Buffalo National Park
op_source	BMC Bioinformatics
op_relation	http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8969335/ http://dx.doi.org/10.1186/s12859-022-04631-z
op_rights	© The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
op_rightsnorm	CC0 PDM CC-BY
op_doi	https://doi.org/10.1186/s12859-022-04631-z
container_title	BMC Bioinformatics
container_volume	23
container_issue	1
_version_	1766234974436982784

LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data

Similar Items