Novel methods for comparing and evaluating single and metagenomic assemblies

The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome...

Full description

Bibliographic Details
Main Author:	Hill, Christopher Michael
Format:	Thesis
Language:	English
Published:	Digital Repository at the University of Maryland 2015
Subjects:	Bioinformatics FOS Computer and information sciences Computer science Assembly Genome Arctic Valet
Online Access:	https://dx.doi.org/10.13016/m28k9d http://hdl.handle.net/1903/17100

id	ftdatacite:10.13016/m28k9d
record_format	openpolar
spelling	ftdatacite:10.13016/m28k9d 2023-05-15T15:17:49+02:00 Novel methods for comparing and evaluating single and metagenomic assemblies Hill, Christopher Michael 2015 https://dx.doi.org/10.13016/m28k9d http://hdl.handle.net/1903/17100 en eng Digital Repository at the University of Maryland Bioinformatics FOS Computer and information sciences Computer science Assembly Genome Thesis Collection Dissertation thesis 2015 ftdatacite https://doi.org/10.13016/m28k9d 2021-11-05T12:55:41Z The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still heavily relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. The focus of this work is to develop reference-free computational methods to accurately compare and evaluate genome assemblies. We introduce a reference-free likelihood-based measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. Despite the unresolved challenges of single genome assembly, the decreasing costs of sequencing technology has led to a sharp increase in metagenomics projects over the past decade. These projects allow us to better understand the diversity and function of microbial communities found in the environment, including the ocean, Arctic regions, other living organisms, and the human body. We extend our likelihood-based framework and show that we can accurately compare assemblies of these complex bacterial communities. After an assembly has been produced, it is not an easy task determining what parts of the underlying genome are missing, what parts are mistakes, and what parts are due to experimental artifacts from the sequencing machine. Here we introduce VALET, the first reference-free pipeline that flags regions in metagenomic assemblies that are statistically inconsistent with the data generation process. VALET detects mis-assemblies in publicly available datasets and highlights the current shortcomings in available metagenomic assemblers. By providing the computational methods for researchers to accurately evalu- ate their assemblies, we decrease the chance of incorrect biological conclusions and misguided future studies. Thesis Arctic DataCite Metadata Store (German National Library of Science and Technology) Arctic Valet ENVELOPE(151.050,151.050,61.917,61.917)
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	English
topic	Bioinformatics FOS Computer and information sciences Computer science Assembly Genome
spellingShingle	Bioinformatics FOS Computer and information sciences Computer science Assembly Genome Hill, Christopher Michael Novel methods for comparing and evaluating single and metagenomic assemblies
topic_facet	Bioinformatics FOS Computer and information sciences Computer science Assembly Genome
description	The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still heavily relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. The focus of this work is to develop reference-free computational methods to accurately compare and evaluate genome assemblies. We introduce a reference-free likelihood-based measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. Despite the unresolved challenges of single genome assembly, the decreasing costs of sequencing technology has led to a sharp increase in metagenomics projects over the past decade. These projects allow us to better understand the diversity and function of microbial communities found in the environment, including the ocean, Arctic regions, other living organisms, and the human body. We extend our likelihood-based framework and show that we can accurately compare assemblies of these complex bacterial communities. After an assembly has been produced, it is not an easy task determining what parts of the underlying genome are missing, what parts are mistakes, and what parts are due to experimental artifacts from the sequencing machine. Here we introduce VALET, the first reference-free pipeline that flags regions in metagenomic assemblies that are statistically inconsistent with the data generation process. VALET detects mis-assemblies in publicly available datasets and highlights the current shortcomings in available metagenomic assemblers. By providing the computational methods for researchers to accurately evalu- ate their assemblies, we decrease the chance of incorrect biological conclusions and misguided future studies.
format	Thesis
author	Hill, Christopher Michael
author_facet	Hill, Christopher Michael
author_sort	Hill, Christopher Michael
title	Novel methods for comparing and evaluating single and metagenomic assemblies
title_short	Novel methods for comparing and evaluating single and metagenomic assemblies
title_full	Novel methods for comparing and evaluating single and metagenomic assemblies
title_fullStr	Novel methods for comparing and evaluating single and metagenomic assemblies
title_full_unstemmed	Novel methods for comparing and evaluating single and metagenomic assemblies
title_sort	novel methods for comparing and evaluating single and metagenomic assemblies
publisher	Digital Repository at the University of Maryland
publishDate	2015
url	https://dx.doi.org/10.13016/m28k9d http://hdl.handle.net/1903/17100
long_lat	ENVELOPE(151.050,151.050,61.917,61.917)
geographic	Arctic Valet
geographic_facet	Arctic Valet
genre	Arctic
genre_facet	Arctic
op_doi	https://doi.org/10.13016/m28k9d
_version_	1766348085656551424

Novel methods for comparing and evaluating single and metagenomic assemblies

Similar Items