Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data

DNA sequencing has transformed the discipline of population genetics, which seeks to assess the level of genetic diversity within species or populations, and infer the geographic and temporal distributions between members of a population. Restriction-site associated DNA sequencing (RADSeq) is a NGS...

Full description

Bibliographic Details
Main Author:	Nadukkalam Ravindran, Praveen
Other Authors:	Faculty of Computer Science, Doctor of Philosophy, Dr. Timothy Frasier, Dr. Michael McAllister, Dr. Norbert Zeh, Dr. Nauzer Kalyaniwalla, Dr. Robert Beiko, Dr. Ian R. Bradbury, Not Applicable, Yes
Language:	English
Published:	2020
Subjects:	Computational methods Software optimization Population genetics Short-read DNA sequence processing Atlantic salmon
Online Access:	http://hdl.handle.net/10222/78429

id	ftdalhouse:oai:DalSpace.library.dal.ca:10222/78429
record_format	openpolar
spelling	ftdalhouse:oai:DalSpace.library.dal.ca:10222/78429 2023-05-15T15:33:05+02:00 Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data Nadukkalam Ravindran, Praveen Faculty of Computer Science Doctor of Philosophy Dr. Timothy Frasier Dr. Michael McAllister Dr. Norbert Zeh Dr. Nauzer Kalyaniwalla Dr. Robert Beiko Dr. Ian R. Bradbury Not Applicable Yes 2020-04-13T12:42:01Z http://hdl.handle.net/10222/78429 en eng http://hdl.handle.net/10222/78429 Computational methods Software optimization Population genetics Short-read DNA sequence processing 2020 ftdalhouse 2022-03-06T00:10:52Z DNA sequencing has transformed the discipline of population genetics, which seeks to assess the level of genetic diversity within species or populations, and infer the geographic and temporal distributions between members of a population. Restriction-site associated DNA sequencing (RADSeq) is a NGS technique, which produce data that consists of relatively short (typically 50 to 300 nucleotide) fragments or “reads” of sequenced DNA and enables large-scale analysis of individuals and populations. In this thesis, we describe computational methods, which use graph-based structures to represent these short reads obtained and to capture the relationships among them. A key challenge in RADSeq analysis is to identify optimal parameter settings for assignment of reads to loci (singular: Locus), which correspond to specific regions in the genome. The parameter sweep is computationally intensive, as the entire analysis needs to be run for each parameter set. We propose a graph-based structure (RADProc), which provides persistence and eliminates redundancy to enable parameter sweeps. For 20 green crab samples and 32 different parameter sets, RADProc took only 2.5 hours while the widely used Stacks software took 78 hours. Another challenge is to identify paralogs, sequences that are highly similar due to recent duplication events, but occur in different regions of the genome and should not to be merged into the same locus. We introduce PMERGE, which identifies paralogs by clustering the catalog locus consensus sequences based on similarity. PMERGE is built on the fact that paralogs may be wrongly merged into a single locus in some but not all samples. PMERGE identified 62%-87% of paralogs in the Atlantic salmon and green crab datasets. Gene flow is the movement of alleles, specific sequence variants at a given locus, between populations and is an important indicator of population mixing that changes genetic diversity within the populations. We use the RADProc graph to infer gene flow among populations using allele frequency differences in exclusively shared alleles in each pair of populations. The method successfully inferred gene flow patterns in simulated datasets and provided insights into reasons for observed hybridization at two locations in a green crab dataset. Other/Unknown Material Atlantic salmon Dalhousie University: DalSpace Institutional Repository
institution	Open Polar
collection	Dalhousie University: DalSpace Institutional Repository
op_collection_id	ftdalhouse
language	English
topic	Computational methods Software optimization Population genetics Short-read DNA sequence processing
spellingShingle	Computational methods Software optimization Population genetics Short-read DNA sequence processing Nadukkalam Ravindran, Praveen Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data
topic_facet	Computational methods Software optimization Population genetics Short-read DNA sequence processing
description	DNA sequencing has transformed the discipline of population genetics, which seeks to assess the level of genetic diversity within species or populations, and infer the geographic and temporal distributions between members of a population. Restriction-site associated DNA sequencing (RADSeq) is a NGS technique, which produce data that consists of relatively short (typically 50 to 300 nucleotide) fragments or “reads” of sequenced DNA and enables large-scale analysis of individuals and populations. In this thesis, we describe computational methods, which use graph-based structures to represent these short reads obtained and to capture the relationships among them. A key challenge in RADSeq analysis is to identify optimal parameter settings for assignment of reads to loci (singular: Locus), which correspond to specific regions in the genome. The parameter sweep is computationally intensive, as the entire analysis needs to be run for each parameter set. We propose a graph-based structure (RADProc), which provides persistence and eliminates redundancy to enable parameter sweeps. For 20 green crab samples and 32 different parameter sets, RADProc took only 2.5 hours while the widely used Stacks software took 78 hours. Another challenge is to identify paralogs, sequences that are highly similar due to recent duplication events, but occur in different regions of the genome and should not to be merged into the same locus. We introduce PMERGE, which identifies paralogs by clustering the catalog locus consensus sequences based on similarity. PMERGE is built on the fact that paralogs may be wrongly merged into a single locus in some but not all samples. PMERGE identified 62%-87% of paralogs in the Atlantic salmon and green crab datasets. Gene flow is the movement of alleles, specific sequence variants at a given locus, between populations and is an important indicator of population mixing that changes genetic diversity within the populations. We use the RADProc graph to infer gene flow among populations using allele frequency differences in exclusively shared alleles in each pair of populations. The method successfully inferred gene flow patterns in simulated datasets and provided insights into reasons for observed hybridization at two locations in a green crab dataset.
author2	Faculty of Computer Science Doctor of Philosophy Dr. Timothy Frasier Dr. Michael McAllister Dr. Norbert Zeh Dr. Nauzer Kalyaniwalla Dr. Robert Beiko Dr. Ian R. Bradbury Not Applicable Yes
author	Nadukkalam Ravindran, Praveen
author_facet	Nadukkalam Ravindran, Praveen
author_sort	Nadukkalam Ravindran, Praveen
title	Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data
title_short	Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data
title_full	Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data
title_fullStr	Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data
title_full_unstemmed	Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data
title_sort	computational methods for efficient processing and analysis of short-read next-generation dna sequencing data
publishDate	2020
url	http://hdl.handle.net/10222/78429
genre	Atlantic salmon
genre_facet	Atlantic salmon
op_relation	http://hdl.handle.net/10222/78429
_version_	1766363552803717120

Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data

Similar Items