ModEst - Precise estimation of genome size from NGS data

Accurate estimates of genome sizes are important parameters for both theoretical and practical biodiversity genomics. We present here a fast, easy-to-implement and precise method to estimate genome size from the number of bases sequenced and the mean sequencing depth. To estimate the latter, we take...

Full description

Bibliographic Details
Main Authors: Schell, Tilman, Pfenninger, Markus, Schönnenbeck, Philipp
Format: Article in Journal/Newspaper
Language:unknown
Published: Zenodo 2022
Subjects:
Online Access:https://dx.doi.org/10.5281/zenodo.5903272
https://zenodo.org/record/5903272
id ftdatacite:10.5281/zenodo.5903272
record_format openpolar
spelling ftdatacite:10.5281/zenodo.5903272 2023-05-15T18:15:52+02:00 ModEst - Precise estimation of genome size from NGS data Schell, Tilman Pfenninger, Markus Schönnenbeck, Philipp 2022 https://dx.doi.org/10.5281/zenodo.5903272 https://zenodo.org/record/5903272 unknown Zenodo https://zenodo.org/communities/dryad https://dx.doi.org/10.5061/dryad.dr7sqvb0j https://dx.doi.org/10.5281/zenodo.5903271 https://zenodo.org/communities/dryad Open Access MIT License https://opensource.org/licenses/MIT mit info:eu-repo/semantics/openAccess MIT genome size simulation article Software SoftwareSourceCode 2022 ftdatacite https://doi.org/10.5281/zenodo.5903272 https://doi.org/10.5061/dryad.dr7sqvb0j https://doi.org/10.5281/zenodo.5903271 2022-02-09T13:46:27Z Accurate estimates of genome sizes are important parameters for both theoretical and practical biodiversity genomics. We present here a fast, easy-to-implement and precise method to estimate genome size from the number of bases sequenced and the mean sequencing depth. To estimate the latter, we take advantage of the fact that a precise estimation of the Poisson distribution parameter lambda is possible from truncated data, restricted to the part of the sequencing depth distribution representing the true underlying distribution. With simulations we could show that reasonable genome size estimates can be gained even from low-coverage (10X), highly discontinuous genome drafts. Comparison of estimates from a wide range of taxa and sequencing strategies with flow-cytometry estimates of the same individuals showed a very good fit and suggested that both methods yield comparable, interchangeable results. : To illustrate the influence of factors like sequencing depth, genome size, repeat content and -distribution on the different genome size estimation methods, we simulated five different genomes according to real examples. The latest genome assemblies and annotations of Saccharomyces cerevisae, Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster and Scophthalmus maximus were used to obtain distributions of size and distance between annotated repeat regions. Simulated genomes of the size of the five genome assemblies mentioned above were then created using a custom Python-tool, available at https://github.com/Croxa/Simulate-Genome. Regions annotated as repeat regions (rr) were filled with random repeat units up to 10 bp length, high complexity regions with random nucleotides. For sake of ease, we simulated the genomes on a single chromosome. A mean GC content of 0.5 was applied to both categories. Article in Journal/Newspaper Scophthalmus maximus DataCite Metadata Store (German National Library of Science and Technology) Lambda ENVELOPE(-62.983,-62.983,-64.300,-64.300)
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language unknown
topic genome size
simulation
spellingShingle genome size
simulation
Schell, Tilman
Pfenninger, Markus
Schönnenbeck, Philipp
ModEst - Precise estimation of genome size from NGS data
topic_facet genome size
simulation
description Accurate estimates of genome sizes are important parameters for both theoretical and practical biodiversity genomics. We present here a fast, easy-to-implement and precise method to estimate genome size from the number of bases sequenced and the mean sequencing depth. To estimate the latter, we take advantage of the fact that a precise estimation of the Poisson distribution parameter lambda is possible from truncated data, restricted to the part of the sequencing depth distribution representing the true underlying distribution. With simulations we could show that reasonable genome size estimates can be gained even from low-coverage (10X), highly discontinuous genome drafts. Comparison of estimates from a wide range of taxa and sequencing strategies with flow-cytometry estimates of the same individuals showed a very good fit and suggested that both methods yield comparable, interchangeable results. : To illustrate the influence of factors like sequencing depth, genome size, repeat content and -distribution on the different genome size estimation methods, we simulated five different genomes according to real examples. The latest genome assemblies and annotations of Saccharomyces cerevisae, Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster and Scophthalmus maximus were used to obtain distributions of size and distance between annotated repeat regions. Simulated genomes of the size of the five genome assemblies mentioned above were then created using a custom Python-tool, available at https://github.com/Croxa/Simulate-Genome. Regions annotated as repeat regions (rr) were filled with random repeat units up to 10 bp length, high complexity regions with random nucleotides. For sake of ease, we simulated the genomes on a single chromosome. A mean GC content of 0.5 was applied to both categories.
format Article in Journal/Newspaper
author Schell, Tilman
Pfenninger, Markus
Schönnenbeck, Philipp
author_facet Schell, Tilman
Pfenninger, Markus
Schönnenbeck, Philipp
author_sort Schell, Tilman
title ModEst - Precise estimation of genome size from NGS data
title_short ModEst - Precise estimation of genome size from NGS data
title_full ModEst - Precise estimation of genome size from NGS data
title_fullStr ModEst - Precise estimation of genome size from NGS data
title_full_unstemmed ModEst - Precise estimation of genome size from NGS data
title_sort modest - precise estimation of genome size from ngs data
publisher Zenodo
publishDate 2022
url https://dx.doi.org/10.5281/zenodo.5903272
https://zenodo.org/record/5903272
long_lat ENVELOPE(-62.983,-62.983,-64.300,-64.300)
geographic Lambda
geographic_facet Lambda
genre Scophthalmus maximus
genre_facet Scophthalmus maximus
op_relation https://zenodo.org/communities/dryad
https://dx.doi.org/10.5061/dryad.dr7sqvb0j
https://dx.doi.org/10.5281/zenodo.5903271
https://zenodo.org/communities/dryad
op_rights Open Access
MIT License
https://opensource.org/licenses/MIT
mit
info:eu-repo/semantics/openAccess
op_rightsnorm MIT
op_doi https://doi.org/10.5281/zenodo.5903272
https://doi.org/10.5061/dryad.dr7sqvb0j
https://doi.org/10.5281/zenodo.5903271
_version_ 1766189104193601536