Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data
DNA sequencing has transformed the discipline of population genetics, which seeks to assess the level of genetic diversity within species or populations, and infer the geographic and temporal distributions between members of a population. Restriction-site associated DNA sequencing (RADSeq) is a NGS...
Main Author: | |
---|---|
Other Authors: | , , , , , , , , , |
Language: | English |
Published: |
2020
|
Subjects: | |
Online Access: | http://hdl.handle.net/10222/78429 |
id |
ftdalhouse:oai:DalSpace.library.dal.ca:10222/78429 |
---|---|
record_format |
openpolar |
spelling |
ftdalhouse:oai:DalSpace.library.dal.ca:10222/78429 2023-05-15T15:33:05+02:00 Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data Nadukkalam Ravindran, Praveen Faculty of Computer Science Doctor of Philosophy Dr. Timothy Frasier Dr. Michael McAllister Dr. Norbert Zeh Dr. Nauzer Kalyaniwalla Dr. Robert Beiko Dr. Ian R. Bradbury Not Applicable Yes 2020-04-13T12:42:01Z http://hdl.handle.net/10222/78429 en eng http://hdl.handle.net/10222/78429 Computational methods Software optimization Population genetics Short-read DNA sequence processing 2020 ftdalhouse 2022-03-06T00:10:52Z DNA sequencing has transformed the discipline of population genetics, which seeks to assess the level of genetic diversity within species or populations, and infer the geographic and temporal distributions between members of a population. Restriction-site associated DNA sequencing (RADSeq) is a NGS technique, which produce data that consists of relatively short (typically 50 to 300 nucleotide) fragments or “reads” of sequenced DNA and enables large-scale analysis of individuals and populations. In this thesis, we describe computational methods, which use graph-based structures to represent these short reads obtained and to capture the relationships among them. A key challenge in RADSeq analysis is to identify optimal parameter settings for assignment of reads to loci (singular: Locus), which correspond to specific regions in the genome. The parameter sweep is computationally intensive, as the entire analysis needs to be run for each parameter set. We propose a graph-based structure (RADProc), which provides persistence and eliminates redundancy to enable parameter sweeps. For 20 green crab samples and 32 different parameter sets, RADProc took only 2.5 hours while the widely used Stacks software took 78 hours. Another challenge is to identify paralogs, sequences that are highly similar due to recent duplication events, but occur in different regions of the genome and should not to be merged into the same locus. We introduce PMERGE, which identifies paralogs by clustering the catalog locus consensus sequences based on similarity. PMERGE is built on the fact that paralogs may be wrongly merged into a single locus in some but not all samples. PMERGE identified 62%-87% of paralogs in the Atlantic salmon and green crab datasets. Gene flow is the movement of alleles, specific sequence variants at a given locus, between populations and is an important indicator of population mixing that changes genetic diversity within the populations. We use the RADProc graph to infer gene flow among populations using allele frequency differences in exclusively shared alleles in each pair of populations. The method successfully inferred gene flow patterns in simulated datasets and provided insights into reasons for observed hybridization at two locations in a green crab dataset. Other/Unknown Material Atlantic salmon Dalhousie University: DalSpace Institutional Repository |
institution |
Open Polar |
collection |
Dalhousie University: DalSpace Institutional Repository |
op_collection_id |
ftdalhouse |
language |
English |
topic |
Computational methods Software optimization Population genetics Short-read DNA sequence processing |
spellingShingle |
Computational methods Software optimization Population genetics Short-read DNA sequence processing Nadukkalam Ravindran, Praveen Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data |
topic_facet |
Computational methods Software optimization Population genetics Short-read DNA sequence processing |
description |
DNA sequencing has transformed the discipline of population genetics, which seeks to assess the level of genetic diversity within species or populations, and infer the geographic and temporal distributions between members of a population. Restriction-site associated DNA sequencing (RADSeq) is a NGS technique, which produce data that consists of relatively short (typically 50 to 300 nucleotide) fragments or “reads” of sequenced DNA and enables large-scale analysis of individuals and populations. In this thesis, we describe computational methods, which use graph-based structures to represent these short reads obtained and to capture the relationships among them. A key challenge in RADSeq analysis is to identify optimal parameter settings for assignment of reads to loci (singular: Locus), which correspond to specific regions in the genome. The parameter sweep is computationally intensive, as the entire analysis needs to be run for each parameter set. We propose a graph-based structure (RADProc), which provides persistence and eliminates redundancy to enable parameter sweeps. For 20 green crab samples and 32 different parameter sets, RADProc took only 2.5 hours while the widely used Stacks software took 78 hours. Another challenge is to identify paralogs, sequences that are highly similar due to recent duplication events, but occur in different regions of the genome and should not to be merged into the same locus. We introduce PMERGE, which identifies paralogs by clustering the catalog locus consensus sequences based on similarity. PMERGE is built on the fact that paralogs may be wrongly merged into a single locus in some but not all samples. PMERGE identified 62%-87% of paralogs in the Atlantic salmon and green crab datasets. Gene flow is the movement of alleles, specific sequence variants at a given locus, between populations and is an important indicator of population mixing that changes genetic diversity within the populations. We use the RADProc graph to infer gene flow among populations using allele frequency differences in exclusively shared alleles in each pair of populations. The method successfully inferred gene flow patterns in simulated datasets and provided insights into reasons for observed hybridization at two locations in a green crab dataset. |
author2 |
Faculty of Computer Science Doctor of Philosophy Dr. Timothy Frasier Dr. Michael McAllister Dr. Norbert Zeh Dr. Nauzer Kalyaniwalla Dr. Robert Beiko Dr. Ian R. Bradbury Not Applicable Yes |
author |
Nadukkalam Ravindran, Praveen |
author_facet |
Nadukkalam Ravindran, Praveen |
author_sort |
Nadukkalam Ravindran, Praveen |
title |
Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data |
title_short |
Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data |
title_full |
Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data |
title_fullStr |
Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data |
title_full_unstemmed |
Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data |
title_sort |
computational methods for efficient processing and analysis of short-read next-generation dna sequencing data |
publishDate |
2020 |
url |
http://hdl.handle.net/10222/78429 |
genre |
Atlantic salmon |
genre_facet |
Atlantic salmon |
op_relation |
http://hdl.handle.net/10222/78429 |
_version_ |
1766363552803717120 |