Population-haplotype models for mapping and tagging structural variation using whole genome sequencing
The scientific interest in copy number variation (CNV) is rapidly increasing, mainly due to the evidence of phenotypic effects and its contribution to disease susceptibility. Single nucleotide polymorphisms (SNPs) which are abundant in the human genome have been widely investigated in genome-wide as...
Main Author: | |
---|---|
Format: | Text |
Language: | unknown |
Published: |
Imperial College London
2018
|
Subjects: | |
Online Access: | https://dx.doi.org/10.25560/72185 http://spiral.imperial.ac.uk/handle/10044/1/72185 |
Summary: | The scientific interest in copy number variation (CNV) is rapidly increasing, mainly due to the evidence of phenotypic effects and its contribution to disease susceptibility. Single nucleotide polymorphisms (SNPs) which are abundant in the human genome have been widely investigated in genome-wide association studies (GWAS). Despite the notable genomic effects both CNVs and SNPs have, the correlation between them has been relatively understudied. In the past decade, next generation sequencing (NGS) has been the leading high-throughput technology for investigating CNVs and offers mapping at a high-quality resolution. We created a map of NGS-defined CNVs tagged by SNPs using the 1000 Genomes Project phase 3 (1000G) sequencing data to examine patterns between the two types of variation in protein-coding genes. To investigate potential relationships between CNV-tagging SNPs and various phenotypes, we used SNPs reported for disease/phenotype associations from the GWAS catalog. Moreover, we applied our method to DIAGRAM consortium and Northern Finland Birth Cohort (NFBC) data. Our analysis replicated existing CNV-tagging SNPs but also revealed novel relationships between them in almost all the datasets we analysed. We have developed a statistical framework under a population perspective for a fast and accurate CNV detection. Using 202 drug-target genes defined in collaboration with GlaxoSmithKline (GSK), we applied our framework to the 1000G data. We calculated summary statistics based on the detected CNV calls including the allele frequency (AF) for each of the 26 populations of the 1000G. In addition, we visualised our results using UCSC genome browser visualisation tracks for all 202 regions and successfully benchmarked our CNV calls by comparing them to a gold standard set of the 1000G CNVs. Overall in this thesis, we present detailed maps of CNVs and CNV-tagging SNPs to enhance existing knowledge of their impact on human genome. |
---|