Combining natural language processing and metabarcoding to reveal pathogen-environment associations.

Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year-with 180,000 resulting deaths-mostly in sub-Saharan Africa. Surprisingly, little is known about the ecologic...

Full description

Bibliographic Details
Published in:PLOS Neglected Tropical Diseases
Main Authors: David C Molik, DeAndre Tomlinson, Shane Davitt, Eric L Morgan, Matthew Sisk, Benjamin Roche, Natalie Meyers, Michael E Pfrender
Format: Article in Journal/Newspaper
Language:English
Published: Public Library of Science (PLoS) 2021
Subjects:
Online Access:https://doi.org/10.1371/journal.pntd.0008755
https://doaj.org/article/0b69eba847274be69a403a632b0d1869
id ftdoajarticles:oai:doaj.org/article:0b69eba847274be69a403a632b0d1869
record_format openpolar
spelling ftdoajarticles:oai:doaj.org/article:0b69eba847274be69a403a632b0d1869 2023-05-15T15:11:37+02:00 Combining natural language processing and metabarcoding to reveal pathogen-environment associations. David C Molik DeAndre Tomlinson Shane Davitt Eric L Morgan Matthew Sisk Benjamin Roche Natalie Meyers Michael E Pfrender 2021-04-01T00:00:00Z https://doi.org/10.1371/journal.pntd.0008755 https://doaj.org/article/0b69eba847274be69a403a632b0d1869 EN eng Public Library of Science (PLoS) https://doi.org/10.1371/journal.pntd.0008755 https://doaj.org/toc/1935-2727 https://doaj.org/toc/1935-2735 1935-2727 1935-2735 doi:10.1371/journal.pntd.0008755 https://doaj.org/article/0b69eba847274be69a403a632b0d1869 PLoS Neglected Tropical Diseases, Vol 15, Iss 4, p e0008755 (2021) Arctic medicine. Tropical medicine RC955-962 Public aspects of medicine RA1-1270 article 2021 ftdoajarticles https://doi.org/10.1371/journal.pntd.0008755 2022-12-31T11:50:08Z Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year-with 180,000 resulting deaths-mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations. Article in Journal/Newspaper Arctic Directory of Open Access Journals: DOAJ Articles Arctic PLOS Neglected Tropical Diseases 15 4 e0008755
institution Open Polar
collection Directory of Open Access Journals: DOAJ Articles
op_collection_id ftdoajarticles
language English
topic Arctic medicine. Tropical medicine
RC955-962
Public aspects of medicine
RA1-1270
spellingShingle Arctic medicine. Tropical medicine
RC955-962
Public aspects of medicine
RA1-1270
David C Molik
DeAndre Tomlinson
Shane Davitt
Eric L Morgan
Matthew Sisk
Benjamin Roche
Natalie Meyers
Michael E Pfrender
Combining natural language processing and metabarcoding to reveal pathogen-environment associations.
topic_facet Arctic medicine. Tropical medicine
RC955-962
Public aspects of medicine
RA1-1270
description Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year-with 180,000 resulting deaths-mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations.
format Article in Journal/Newspaper
author David C Molik
DeAndre Tomlinson
Shane Davitt
Eric L Morgan
Matthew Sisk
Benjamin Roche
Natalie Meyers
Michael E Pfrender
author_facet David C Molik
DeAndre Tomlinson
Shane Davitt
Eric L Morgan
Matthew Sisk
Benjamin Roche
Natalie Meyers
Michael E Pfrender
author_sort David C Molik
title Combining natural language processing and metabarcoding to reveal pathogen-environment associations.
title_short Combining natural language processing and metabarcoding to reveal pathogen-environment associations.
title_full Combining natural language processing and metabarcoding to reveal pathogen-environment associations.
title_fullStr Combining natural language processing and metabarcoding to reveal pathogen-environment associations.
title_full_unstemmed Combining natural language processing and metabarcoding to reveal pathogen-environment associations.
title_sort combining natural language processing and metabarcoding to reveal pathogen-environment associations.
publisher Public Library of Science (PLoS)
publishDate 2021
url https://doi.org/10.1371/journal.pntd.0008755
https://doaj.org/article/0b69eba847274be69a403a632b0d1869
geographic Arctic
geographic_facet Arctic
genre Arctic
genre_facet Arctic
op_source PLoS Neglected Tropical Diseases, Vol 15, Iss 4, p e0008755 (2021)
op_relation https://doi.org/10.1371/journal.pntd.0008755
https://doaj.org/toc/1935-2727
https://doaj.org/toc/1935-2735
1935-2727
1935-2735
doi:10.1371/journal.pntd.0008755
https://doaj.org/article/0b69eba847274be69a403a632b0d1869
op_doi https://doi.org/10.1371/journal.pntd.0008755
container_title PLOS Neglected Tropical Diseases
container_volume 15
container_issue 4
container_start_page e0008755
_version_ 1766342445762609152