Detection and genotyping of Atlantic salmon structural variants with genome graphs

Structural variants (SVs) are defined as genomic rearrangements of 50 base pairs (bp) or larger. Although they are less frequent in the genome, they can account for ten folds more variable base pairs than the widely studied singe nucleotide polymorphisms (SNPs). SVs have been hard to detect by short...

Full description

Bibliographic Details
Main Author: Kjelstrup, Anna Sofie
Other Authors: Lien, Sigbjørn
Format: Master Thesis
Language:English
Published: Norwegian University of Life Sciences, Ås 2022
Subjects:
Online Access:https://hdl.handle.net/11250/3030212
Description
Summary:Structural variants (SVs) are defined as genomic rearrangements of 50 base pairs (bp) or larger. Although they are less frequent in the genome, they can account for ten folds more variable base pairs than the widely studied singe nucleotide polymorphisms (SNPs). SVs have been hard to detect by short-read sequencing, especially in repeat rich regions. The recent addition of a new reference genome (GCA_905237065.2) and long-read sequencing data for eleven Atlantic salmon individuals has allowed for a more extensive characterization of SVs, revealing a significantly higher count than previously reported. By constructing a genome graph with new high-quality assemblies based on long-reads, we aim to genotype salmon SVs in short-read data, not detectable by traditional methods. We demonstrate how genome graphs, generated with the bioinformatic pipeline PGGB, can be used to detect and accurately represent SVs in Atlantic salmon genomes. We also present two pipelines for graph-based genotyping using short-reads and discuss alternative metrics for genome graph quality improvement. Eventually, this work will contribute to building a whole genome graph for Atlantic salmon, enabling population scale SV-calling based on already available short-read data. Strukturelle varianter (SVer) er definert som genomisk endring på 50 basepar eller mer. Selv om de er i mindretall i genomet, står SVer for mange ganger antallet variable basepar enn de mye studerte enkeltnukleotidpolymorfismer (SNPs). Strukturelle varianter har tidligere vært utfordrende å oppdage ved bruk av eldre teknologi som shortread sekvensering, spesielt i regioner med høyt innhold av repetativt DNA. Et nytt refereanse genom for atlanterhavslaks (GCA_905237065.2), samnt long-read sekvenseringsdata for elleve individer, har åpnet opp for utvidet karakterisering/deteksjon av strukturelle varianter. Dette har avdekket høyere forekomster enn hva som tidligere har blitt rapportert. Ved å konstruere en genomgraf fra nye assemblies av høy kvalitet, basert på long-read ...